如何使用Node.js抓取网站

news/2024/7/17 17:19:21

介绍 (Introduction)

Web scraping is the technique of extracting data from websites. This data can further be stored in a database or any other storage system for analysis or other uses. While extracting data from websites can be done manually, web scraping usually refers to an automated process.

Web抓取是从网站提取数据的技术。 该数据可以进一步存储在数据库或任何其他存储系统中,以进行分析或其他用途。 从网站提取数据可以手动完成,而网络抓取通常是指自动化过程。

Web scraping is used by most bots and web crawlers for data extraction. There are various methodologies and tools you can use for web scraping, and in this tutorial we will be focusing on using a technique that involves DOM parsing a webpage.

大多数机器人和网络爬虫都使用Web抓取来提取数据。 您可以使用多种方法和工具进行Web抓取,在本教程中,我们将重点介绍使用涉及DOM解析网页的技术。

先决条件 (Prerequisites)

Web scraping can be done in virtually any programming language that has support for HTTP and XML or DOM parsing. In this tutorial, we will focus on web scraping using JavaScript in a Node.js server environment.

几乎可以使用任何支持HTTPXMLDOM解析的编程语言来完成Web抓取。 在本教程中,我们将重点介绍在Node.js服务器环境中使用JavaScript进行Web抓取。

With that in mind, this tutorial assumes that readers know the following:

考虑到这一点,本教程假定读者了解以下内容:

  • Understanding of JavaScript and ES6 and ES7 syntax

    了解JavaScript和ES6和ES7语法
  • Familiarity with jQuery

    熟悉jQuery
  • Functional programming concepts

    函数式程序设计概念

Next, we will go through what our end project will be.

接下来,我们将研究最终项目。

项目规格 (Project Specs)

We will be using web scraping to extract some data from the Scotch website. Scotch does not provide an API for fetching the profiles and tutorials/posts of authors. So, we will be building an API for fetching the profiles and tutorials/posts of Scotch authors.

我们将使用网络抓取从Scotch网站提取一些数据。 Scotch没有提供API来获取作者的个人资料和教程/帖子。 因此,我们将构建一个API,以获取Scotch作者的个人资料和教程/帖子。

Here is a screenshot of a demo app created based on the API we will be build in this tutorial. You can see the app on Heroku and the source code on GitHub.

这是一个基于我们将在本教程中构建的API创建的演示应用程序的屏幕截图。 您可以在Heroku上看到该应用程序,并在GitHub上看到源代码 。

Before we begin, let’s go over the packages and dependencies you will need to complete this project.

在开始之前,让我们研究一下完成该项目所需的软件包和依赖项。

项目设置 (Project Setup)

Before you begin, ensure that you have Node and npm or yarn installed on your machine. Since we will use a lot of ES6/7 syntax in this tutorial, it is recommended that you use the following versions of Node and npm for complete ES6/7 support: Node 8.9.0 or higher and npm 5.2.0 or higher.

在开始之前,请确保已在计算机上安装了Nodenpmyarn 。 由于在本教程中我们将使用很多ES6 / 7语法,因此建议您使用以下版本的Node和npm以获得完整的ES6 / 7支持: Node 8.9.0或更高版本以及npm 5.2.0或更高版本。

We will be using the following core packages:

我们将使用以下核心软件包:

  1. Cheerio - Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server. It makes DOM parsing very easy.

    Cheerio -Cheerio是专门为服务器设计的核心jQuery的快速,灵活和精益实现。 这使得DOM解析非常容易。

  2. Axios - Axios is a promise based HTTP client for the browser and Node.js. It will enable us fetch page contents through HTTP requests.

    Axios -Axios是用于浏览器和Node.js的基于Promise的HTTP客户端。 这将使我们能够通过HTTP请求获取页面内容。

  3. Express - Express is a minimal and flexible Node.js web application framework that provides a robust set of features for web and mobile applications.

    Express -Express是一个最小且灵活的Node.js Web应用程序框架,为Web和移动应用程序提供了一组强大的功能。

  4. Lodash - Lodash is a modern JavaScript utility library delivering modularity, performance & extras. It makes JavaScript easier by taking the hassle out of working with arrays, numbers, objects, strings, etc.

    Lodash -Lodash是一个现代JavaScript实用程序库,提供模块化,性能和附加功能。 通过消除处理数组,数字,对象,字符串等的麻烦,它使JavaScript更加容易。

步骤1 —创建应用程序目录 (Step 1 — Create the Application Directory)

Create a new directory for the application and run the following command to install the required dependencies for the app.

为该应用程序创建一个新目录,然后运行以下命令以安装该应用程序所需的依赖项。

# Create a new directory
mkdir scotch-scraping

# cd into the new directory
cd scotch-scraping

# Initiate a new package and install app dependencies
npm init -y
npm install express morgan axios cheerio lodash

步骤2 —设置Express Server应用程序 (Step 2 — Set Up the Express Server Application)

We will go ahead to set up an HTTP server application using Express. Create a server.js file in the root directory of your application and add the following code snippet to setup the server:

我们将继续使用Express设置HTTP服务器应用程序。 在应用程序的根目录中创建一个server.js文件,并添加以下代码段以设置服务器:

/_ server.js _/

// Require dependencies
const logger = require('morgan');
const express = require('express');

// Create an Express application
const app = express();

// Configure the app port
const port = process.env.PORT || 3000;
app.set('port', port);

// Load middlewares
app.use(logger('dev'));

// Start the server and listen on the preconfigured port
app.listen(port, () => console.log(`App started on port ${port}.`));

第3步-修改npm scripts (Step 3 — Modify npm scripts)

Finally, we will modify the "scripts" section of the package.json file to look like the following snippet:

最后,我们将package.json文件的"scripts"部分修改为类似于以下代码段:

"scripts": {
  "start": "node server.js"
}

We have gotten all we need to start building our application. If you run the command npm start in your terminal now, it will start up the application server on port 3000 if it is available. However, we cannot access any route yet since we are yet to add routes to our application. Let’s start building some helper functions we will need for web scraping.

我们已经获得了开始构建应用程序所需的一切。 如果现在在终端中运行命令npm start ,它将在端口3000上启动应用程序服务器(如果可用)。 但是,由于尚未向我们的应用程序添加路由,因此我们无法访问任何路由。 让我们开始构建一些Web抓取所需的帮助程序功能。

步骤4 —创建助手功能 (Step 4 — Create Helper Functions)

As stated earlier, we will create a couple of helper functions that will used in several parts of our application. Create a new app directory in your project root. Create a new file named helpers.js in the just created directory and add the following content to it:

如前所述,我们将创建几个辅助函数,这些函数将在应用程序的多个部分中使用。 在项目根目录中创建一个新的app目录。 在刚刚创建的目录中创建一个名为helpers.js的新文件, helpers.js其中添加以下内容:

/_ app/helpers.js _/

const _ = require('lodash');
const axios = require("axios");
const cheerio = require("cheerio");

In this code, we are requiring the dependencies we will need for our helper functions. Let’s go ahead and add the helper functions.

在此代码中,我们需要帮助函数所需的依赖项。 让我们继续添加辅助函数。

创建实用程序助手功能 (Creating Utility Helper Functions)

We will start by creating some utility helper functions. Add the following snippet to the app/helpers.js file.

我们将从创建一些实用程序帮助器功能开始。 将以下代码段添加到app/helpers.js文件。

/_ app/helpers.js _/

///
// UTILITY FUNCTIONS
///

/**
 **_ Compose function arguments starting from right to left
 _** to an overall function and returns the overall function
 */
const compose = (...fns) => arg => {
  return **_.flattenDeep(fns).reduceRight((current, fn) => {
    if (_**.isFunction(fn)) return fn(current);
    throw new TypeError("compose() expects only functions as parameters.");
  }, arg);
};

/**
 _ Compose async function arguments starting from right to left
 _ to an overall async function and returns the overall async function
 _/
const composeAsync = (...fns) => arg => {
  return .flattenDeep(fns).reduceRight(async (current, fn) => {
    if (.isFunction(fn)) return fn(await current);
    throw new TypeError("compose() expects only functions as parameters.");
  }, arg);
};

/**
 _ Enforces the scheme of the URL is https
 _ and returns the new URL
 _/
const enforceHttpsUrl = url =>
  _.isString(url) ? url.replace(/^(https?:)?\/\//, "https://") : null;

/*
  Strips number of all non-numeric characters
  and returns the sanitized number
 /
const sanitizeNumber = number =>
  _.isString(number)
    ? number.replace(/[^0-9-.]/g, "")
    : _.isNumber(number) ? number : null;

/*
  Filters null values from array
  and returns an array without nulls
 /
const withoutNulls = arr =>
  _.isArray(arr) ? arr.filter(val => !_.isNull(val)) : _[_];

/_**
 ** Transforms an array of ({ key: value }) pairs to an object
 ** and returns the transformed object
 */
const arrayPairsToObject = arr =>
  arr.reduce((obj, pair) => ({ ...obj, ...pair }), {});

/**_
 _ A composed function that removes null values from array of ({ key: value }) pairs
 _ and returns the transformed object of the array
 */
const fromPairsToObject = compose(arrayPairsToObject, withoutNulls);

Let’s go through the functions one at a time to understand what they do.

让我们一次浏览一个功能,以了解它们的作用。

  • compose() - This is a higher-order function that takes one or more functions as its arguments and returns a composed function. The composed function has the same effect as invoking the functions passed in as arguments from right to left, passing the result of a function invocation as argument to the next function each time.<br/\><br/\>If any of the arguments passed to compose() is not a function, the composed function will throw an error whenever it is invoked. Here is a code snippet that describes how compose() works.

    compose() -这是一个高阶函数,将一个或多个函数作为其参数,并返回一个composed function 。 组成函数与从右向左调用作为参数传入的函数的效果相同,每次将函数调用的结果作为参数传递给下一个函数。<br/\> <br/\>传递给compose()参数不是一个function ,只要调用该组合函数,就会抛出错误。 这是描述compose()如何工作的代码段。

/**
**_ -------------------------------------------------
_** Method 1: Functions in sequence
**_ -------------------------------------------------
_**/
function1( function2( function3(arg) ) );

/**
_ -------------------------------------------------
_ Method 2: Using compose()
_ -------------------------------------------------
_ Invoking the composed function has the same effect as (Method 1)
*/
const composedFunction = compose(function1, function2, function3);

composedFunction(arg);
  • composeAsync() - This function works in the same way as the compose() function. The only difference being that it is asynchronous. Hence, it is ideal for composing functions that have asynchronous behaviour - for example, functions that return promises.

    composeAsync() -此函数的工作方式与compose()函数相同。 唯一的区别是它是异步的。 因此,非常适合组成具有异步行为的函数-例如,返回诺言的函数。

  • enforceHttpsUrl() - This function takes a url string as argument and returns the url with https scheme provided the url begins with either https://, http:// or //. If the url is not a string then null is returned. Here is an example.

    enforceHttpsUrl() -此函数将url字符串作为参数,并使用https方案返回url,前提是该urlhttps://http:////开头。 如果url不是字符串,则返回null 。 这是一个例子。

enforceHttpsUrl('scotch.io'); // returns => 'scotch.io'
enforceHttpsUrl('//scotch.io'); // returns => 'https://scotch.io'
enforceHttpsUrl('http://scotch.io'); // returns => 'https://scotch.io'
  • sanitizeNumber() - This function expects a number or string as argument. If a number is passed to it, it returns the number. However, if a string is passed to it, it removes non-numeric characters from the string and returns the sanitized string. For other value types, it returns null. Here is an example:

    sanitizeNumber() -此函数需要numberstring作为参数。 如果将number传递给它,它将返回该数字。 但是,如果将string传递给它,它将从字符串中删除非数字字符并返回经过清理的字符串。 对于其他值类型,它返回null 。 这是一个例子:

sanitizeNumber(53.56); // returns => 53.56
sanitizeNumber('-2oo,40'); // returns => '-240'
sanitizeNumber('badnumber.decimal'); // returns => '.'
  • withoutNulls() - This function expects an array as argument and returns a new array that only contains the non-null items of the original array. Here is an example.

    withoutNulls() -此函数需要一个array作为参数,并返回一个仅包含原始数组non-null项的新数组。 这是一个例子。

withoutNulls([ 'String', [], null, {}, null, 54 ]); // returns => ['String', [], {}, 54]
  • arrayPairsToObject() - This function expects an array of ({ key: value }) objects, and returns a transformed object with the keys and values. Here is an example.

    arrayPairsToObject() -此函数需要一个( { key: value } )对象array ,并返回包含键和值的转换对象。 这是一个例子。

const pairs = [ { key1: 'value1' }, { key2: 'value2' }, { key3: 'value3' } ];

arrayPairsToObject(pairs); // returns => { key1: 'value1', key2: 'value2', key3: 'value3' }
  • fromPairsToObject() - This is a composed function created using compose(). It has the same effect as executing:

    fromPairsToObject() -这是一个使用compose()创建的组合函数。 它与执行相同的效果:

arrayPairsToObject( withoutNulls(array) );

请求和响应助手功能 (Request and Response Helper Functions)

Add the following to the app/helpers.js file.

将以下内容添加到app/helpers.js文件。

/_ app/helpers.js _/

/**
 **_ Handles the request(Promise) when it is fulfilled
 _** and sends a JSON response to the HTTP response stream(res).
 */
const sendResponse = res => async request => {
  return await request
    .then(data => res.json({ status: "success", data }))
    .catch(({ status: code = 500 }) =>
      res.status(code).json({ status: "failure", code, message: code == 404 ? 'Not found.' : 'Request failed.' })
    );
};

/**
 _ Loads the html string returned for the given URL
 _ and sends a Cheerio parser instance of the loaded HTML
 */
const fetchHtmlFromUrl = async url => {
  return await axios
    .get(enforceHttpsUrl(url))
    .then(response => cheerio.load(response.data))
    .catch(error => {
      error.status = (error.response && error.response.status) || 500;
      throw error;
    });
};

Here, we have added two new functions: sendResponse() and fetchHtmlFromUrl(). Let’s try to understand what they do.

在这里,我们添加了两个新函数: sendResponse()fetchHtmlFromUrl() 。 让我们尝试了解它们的作用。

  • sendResponse() - This is a higher-order function that expects an Express HTTP response stream(res) as its argument and returns an async function. The returned async function expects a promise or a thenable as its argument(request).<br/\><br/\>If the request promise resolves, then a successful JSON response is sent using res.json(), containing the resolved data. If the promise rejects, then an error JSON response with an appropriate HTTP status code is sent. Here is how it can be used in an Express route:

    sendResponse() -这是一个高阶函数,它期望Express HTTP响应流( res )作为其参数,并返回一个async function 。 返回的async function期望一个promisethenable作为其参数( request )。<br/\> <br/\>如果request promise解析,则使用res.json()发送成功的JSON响应,其中包含已解析的数据。 如果承诺被拒绝,则会发送带有适当HTTP状态代码的错误JSON响应。 这是如何在Express路线中使用的方法:

app.get('/path', (req, res, next) => {
  const request = Promise.resolve([1, 2, 3, 4, 5]);
  sendResponse(res)(request);
});

Making a GET request to the /path endpoint will return this JSON response:

/path端点发出GET请求将返回以下JSON响应:

{
  "status": "success",
  "data": [1, 2, 3, 4, 5]
}
  • fetchHtmlFromUrl() - This is an async function that expects a url string as its argument. First, it uses axios.get() to fetch the content of the URL(which returns a promise). If the promise resolves, it uses cheerio.load() with the returned content to create a Cheerio parser instance, and then returns the instance. However, if the promise rejects, it throws an error with an appropriate status code.<br/\><br/\>The Cheerio parser instance that is returned by this function will enable us extract the data we require. We can use it in much similar ways as we use the jQuery instance returned by calling $() or jQuery() on a DOM target.

    fetchHtmlFromUrl() -这是一个async function ,期望将url字符串作为其参数。 首先,它使用axios.get()来获取URL的内容(返回一个axios.get() )。 如果承诺cheerio.load()解决,它将使用带有返回内容的cheerio.load()创建一个Cheerio解析器实例,然后返回该实例。 但是,如果Promise拒绝,它将抛出错误并带有适当的状态码。<br/\> <br/\>此函数返回的Cheerio解析器实例将使我们能够提取所需的数据。 我们可以使用与通过在DOM目标上调用$()jQuery()返回的jQuery实例非常相似的方式来使用它。

DOM解析助手功能 (DOM Parsing Helper Functions)

Let’s go ahead to add some additional functions to help us with DOM parsing. Add the following content to the app/helpers.js file.

让我们继续添加一些其他功能来帮助我们进行DOM解析。 将以下内容添加到app/helpers.js文件。

/_ app/helpers.js _/

///
// HTML PARSING HELPER FUNCTIONS
///

/**
 **_ Fetches the inner text of the element
 _** and returns the trimmed text
 */
const fetchElemInnerText = elem => (elem.text && elem.text().trim()) || null;

/**
 _ Fetches the specified attribute from the element
 _ and returns the attribute value
 _/
const fetchElemAttribute = attribute => elem =>
  (elem.attr && elem.attr(attribute)) || null;

/**
 _ Extract an array of values from a collection of elements
 _ using the extractor function and returns the array
 _ or the return value from calling transform() on array
 _/
const extractFromElems = extractor => transform => elems => $ => {
  const results = elems.map((i, element) => extractor($(element))).get();
  return _.isFunction(transform) ? transform(results) : results;
};

/_*
  A composed function that extracts number text from an element,
  sanitizes the number text and returns the parsed integer
 /
const extractNumber = compose(parseInt, sanitizeNumber, fetchElemInnerText);

/_
 _ A composed function that extracts url string from the element's attribute(attr)
 _ and returns the url with https scheme
 _/
const extractUrlAttribute = attr =>
  compose(enforceHttpsUrl, fetchElemAttribute(attr));


module.exports = {
  compose,
  composeAsync,
  enforceHttpsUrl,
  sanitizeNumber,
  withoutNulls,
  arrayPairsToObject,
  fromPairsToObject,
  sendResponse,
  fetchHtmlFromUrl,
  fetchElemInnerText,
  fetchElemAttribute,
  extractFromElems,
  extractNumber,
  extractUrlAttribute
};

We’ve added a few more functions. Here are the functions and what they do:

我们添加了更多功能。 以下是功能及其作用:

  • fetchElemInnerText() - This function expects an element as argument. It extracts the innerText of the element by calling elem.text(), it trims the text of surrounding whitespaces and returns the trimmed inner text. Here is an example.

    fetchElemInnerText() -此函数需要一个element作为参数。 它通过调用elem.text()提取元素的innerText ,修剪周围空白的文本并返回修剪后的内部文本。 这是一个例子。

const $ = cheerio.load('<div class="fullname">  Glad Chinda </div>');
const elem = $('div.fullname');

fetchElemInnerText(elem); // returns => 'Glad Chinda'
  • fetchElemAttribute() - This is a higher-order function that expects an attribute as argument and returns another function that expects an element as argument. The returned function extracts the value of the given attribute of the element by calling elem.attr(attribute). Here is an example.

    fetchElemAttribute() -这是一个高阶函数,期望将attribute作为参数,并返回另一个函数,将element作为参数。 返回的函数通过调用elem.attr(attribute)提取元素的给定attribute的值。 这是一个例子。

const $ = cheerio.load('<div class="username" title="Glad Chinda">@gladchinda</div>');
const elem = $('div.username');

// fetchTitle is a function that expects an element as argument
const fetchTitle = fetchElemAttribute('title');

fetchTitle(elem); // returns => 'Glad Chinda'
  • extractFromElems() - This is a higher-order function that returns another higher-order function. Here, we have used a functional programming technique known as currying to create a sequence of functions each requiring just one argument. Here is the sequence of arguments:

    extractFromElems() -这是一个高阶函数,它返回另一个高阶函数。 在这里,我们使用了一种称为currying的功能编程技术来创建一系列函数,每个函数仅需要一个参数。 这是参数序列:

extractorFunction -> transformFunction -> elementsCollection -> cheerioInstance

extractFromElems() makes it possible to extract data from a collection of similar elements using an extractor function, and also transform the extracted data using a transform function. The extractor function receives an element as argument, while the transform function receives an array of values as argument. <br/\><br/\> Let’s say we have a collection of elements, each containing the name of a person as innerText. We want to extract all these names and return them in an array, all in uppercase. Here is how we can do this using extractFromElems():

extractFromElems()使得使用extractor函数从相似元素的集合中提取数据成为可能,并且还可以使用transform函数对提取的数据进行transformextractor函数接收一个元素作为参数,而transform函数接收一个值的数组作为参数。 <br/\> <br/\>假设我们有一个元素集合,每个元素都包含一个人的名字,如innerText 。 我们要提取所有这些名称,并将它们以大写形式返回到数组中。 这是我们如何使用extractFromElems()做到这一点:

const $ = cheerio.load('<div class="people"><span>Glad Chinda</span><span>John Doe</span><span>Brendan Eich</span></div>');

// Get the collection of span elements containing names
const elems = $('div.people span');

// The transform function
const transformUpperCase = values => values.map(val => String(val).toUpperCase());

// The arguments sequence: extractorFn => transformFn => elemsCollection => cheerioInstance($)
// fetchElemInnerText is used as extractor function
const extractNames = extractFromElems(fetchElemInnerText)(transformUpperCase)(elems);

// Finally pass in the cheerioInstance($)
extractNames($); // returns => ['GLAD CHINDA', 'JOHN DOE', 'BRENDAN EICH']
  • extractNumber() - This is a composed function that expects an element as argument and tries to extract a number from the innerText of the element. It does this by composing parseInt(), sanitizeNumber() and fetchElemInnerText(). It has the same effect as executing:

    extractNumber() -这是一个组合函数,期望将element作为参数,并尝试从元素的innerText中提取一个数字。 它通过组成parseInt()sanitizeNumber()fetchElemInnerText() 。 它与执行相同的效果:

parseInt( sanitizeNumber( fetchElemInnerText(elem) ) );
  • extractUrlAttribute() - This is a composed higher-order function that expects an attribute as argument and returns another function that expects an element as argument. The returned function tries to extract the URL value of an attribute in the element and returns it with the https scheme. Here is a snippet that shows how it works:

    extractUrlAttribute() -这是一个组合的高阶函数,期望将attribute作为参数,并返回另一个函数,将element作为参数。 返回的函数尝试提取元素中属性的URL值,并使用https方案将其返回。 这是显示其工作原理的代码段:

// METHOD 1
const fetchAttribute = fetchElemAttribute(attr);
enforceHttpsUrl( fetchAttribute(elem) );

// METHOD 2: Using extractUrlAttribute()
const fetchUrlAttribute = extractUrlAttribute(attr);
fetchUrlAttribute(elem);

Finally, we export all the helper functions we have created using module.exports. Now that we have our helper functions, we can proceed to the web scraping part of this tutorial.

最后,我们导出使用module.exports创建的所有辅助函数。 现在我们有了助手功能,我们可以继续本教程的Web抓取部分。

第5步-通过调用URL设置抓取 (Step 5 — Set Up Scraping by Calling the URL)

Create a new file named scotch.js in the app directory of your project and add the following content to it:

在项目的app目录中创建一个名为scotch.js的新文件, scotch.js其中添加以下内容:

/_ app/scotch.js _/

const _ = require('lodash');

// Import helper functions
const {
  compose,
  composeAsync,
  extractNumber,
  enforceHttpsUrl,
  fetchHtmlFromUrl,
  extractFromElems,
  fromPairsToObject,
  fetchElemInnerText,
  fetchElemAttribute,
  extractUrlAttribute
} = require("./helpers");

// scotch.io (Base URL)
const SCOTCH_BASE = "https://scotch.io";

///
// HELPER FUNCTIONS
///

/*
  Resolves the url as relative to the base scotch url
  and returns the full URL
 /
const scotchRelativeUrl = url =>
  _.isString(url) ? `${SCOTCH_BASE}${url.replace(/^\/*?/, "/")}` : null;

/_*
 _ A composed function that extracts a url from element attribute,
 _ resolves it to the Scotch base url and returns the url with https
 _/
const extractScotchUrlAttribute = attr =>
  compose(enforceHttpsUrl, scotchRelativeUrl, fetchElemAttribute(attr));

As you can see, we imported lodash as well as some of the helper functions we created earlier. We also defined a constant named SCOTCH_BASE that contains the base URL of the Scotch website. Finally, we added two helper functions:

如您所见,我们导入了lodash以及我们先前创建的一些辅助函数。 我们还定义了一个名为SCOTCH_BASE的常量,其中包含Scotch网站的基本URL。 最后,我们添加了两个帮助器功能:

  • scotchRelativeUrl() - This function takes a relative url string as argument and returns the URL with the pre-configured SCOTCH_BASE prepended to it. If the url is not a string then null is returned. Here is an example.

    scotchRelativeUrl() -此函数以相对url字符串作为参数,并返回带有预先配置的SCOTCH_BASE的URL。 如果url不是字符串,则返回null 。 这是一个例子。

scotchRelativeUrl('tutorials'); // returns => 'https://scotch.io/tutorials'
scotchRelativeUrl('//tutorials'); // returns => 'https://scotch.io///tutorials'
scotchRelativeUrl('http://domain.com'); // returns => 'https://scotch.io/http://domain.com'
  • extractScotchUrlAttribute() - This is a composed higher-order function that expects an attribute as argument and returns another function that expects an element as argument. The returned function tries to extract the URL value of an attribute in the element, prepends the pre-configured SCOTCH_BASE to it and returns it with the https scheme. Here is a snippet that shows how it works:

    extractScotchUrlAttribute() -这是一个组合的高阶函数,期望将attribute作为参数,并返回另一个函数,将element作为参数。 返回的函数尝试提取元素中属性的URL值,将预配置的SCOTCH_BASE到该元素,然后使用https方案返回。 这是显示其工作原理的代码段:

// METHOD 1
const fetchAttribute = fetchElemAttribute(attr);
enforceHttpsUrl( scotchRelativeUrl( fetchAttribute(elem) ) );

// METHOD 2: Using extractScotchUrlAttribute()
const fetchUrlAttribute = extractScotchUrlAttribute(attr);
fetchUrlAttribute(elem);

第6步—使用提取功能 (Step 6 — Using Extraction Functions)

We want to be able to extract the following data for any Scotch author:

我们希望能够为任何苏格兰威士忌作者提取以下数据:

  • profile (name, role, avatar, etc)

    个人资料 (姓名,角色,头像等)

  • social links (facebook, twitter, github, etc)

    社交链接 (facebook,twitter,github等)

  • stats (total views, total posts, etc)

    统计信息 (总观看次数,帖子总数等)

  • posts

    帖子

If you recall, the extractFromElems() helper function we created earlier requires an extractor function for extracting content from a collection of similar elements. We are going to define some extractor functions in this section.

如果您还记得的话,我们之前创建的extractFromElems()帮助extractor函数需要一个extractor函数,用于从一组相似元素中提取内容。 我们将在本节中定义一些提取器函数。

First, we will create an extractSocialUrl() function for extracting the social network name and URL from a social link <a> element. Here is the DOM structure of the social link <a> element expected by extractSocialUrl().

首先,我们将创建一个extractSocialUrl()函数,用于从社交链接<a>元素中提取社交网络名称和URL。 这是extractSocialUrl()期望的社交链接<a>元素的DOM结构。

<a href="https://github.com/gladchinda" target="_blank" title="GitHub">
  <span class="icon icon-github">
    <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" id="Capa_1" x="0px" y="0px" width="50" height="50" viewBox="0 0 512 512" style="enable-background:new 0 0 512 512;" xml:space="preserve">
      ...
    </svg>
  </span>
</a>

Calling the extractSocialUrl() function should return an object that looks like the following:

调用extractSocialUrl()函数应返回一个类似于以下内容的对象:

{ github: 'https://github.com/gladchinda' }

Let’s go on to create the function. Add the following content to the app/scotch.js file.

让我们继续创建函数。 将以下内容添加到app/scotch.js文件。

/_ app/scotch.js _/

///
// EXTRACTION FUNCTIONS
///

/_*
 _ Extract a single social URL pair from container element
 */
const extractSocialUrl = elem => {

  // Find all social-icon <span> elements
  const icon = elem.find('span.icon');

  // Regex for social classes
  const regex = /^(?:icon|color)-(.+)$/;

  // Extracts only social classes from the class attribute
  const onlySocialClasses = regex => (classes = '') => classes
      .replace(/\s+/g, ' ')
      .split(' ')
      .filter(classname => regex.test(classname));

  // Gets the social network name from a class name
  const getSocialFromClasses = regex => classes => {
    let social = null;
    const [classname = null] = classes;

    if (_.isString(classname)) {
      const _[_, name = null] = classname.match(regex);
      social = name ? _.snakeCase(name) : null;
    }

    return social;
  };

  // Extract the href URL from the element
  const href = extractUrlAttribute('href')(elem);

  // Get the social-network name using a composed function
  const social = compose(
    getSocialFromClasses(regex),
    onlySocialClasses(regex),
    fetchElemAttribute('class')
  )(icon);

  // Return an object of social-network-name(key) and social-link(value)
  // Else return null if no social-network-name was found
  return social && { [social]: href };

};

Let’s try to understand how the extractSocialUrl() function works:

让我们尝试了解extractSocialUrl()函数的工作方式:

  1. First, we fetch the <span> child element with an icon class. We also define a regular expression that matches social-icon class names.

    首先,我们使用icon类获取<span>子元素。 我们还定义了一个与社交图标类名称匹配的正则表达式。

  2. We define onlySocialClasses() higher-order function that takes a regular expression as its argument and returns a function. The returned function takes a string of class names separated by spaces. It then uses the regular expression to extract only the social class names from the list and returns them in an array. Here is an example:

    我们定义onlySocialClasses()高阶函数,该函数以正则表达式为参数并返回一个函数。 返回的函数采用一串用空格分隔的类名。 然后,它使用正则表达式从列表中仅提取社交类名称,并以数组形式返回它们。 这是一个例子:

const regex = /^(?:icon|color)-(.+)$/;
const extractSocial = onlySocialClasses(regex);
const classNames = 'first-class another-class color-twitter icon-github';

extractSocial(classNames); // returns [ 'color-twitter', 'icon-github' ]
  1. Next, we define getSocialFromClasses() higher-order function that takes a regular expression as its argument and returns a function. The returned function takes an array of single class strings. It then uses the regular expression to extract the social network name from the first class in the list and returns it. Here is an example:

    接下来,我们定义getSocialFromClasses()高阶函数,该函数以正则表达式为参数并返回一个函数。 返回的函数采用单个类字符串的数组。 然后,它使用正则表达式从列表中的第一类中提取社交网络名称,然后将其返回。 这是一个例子:

const regex = /^(?:icon|color)-(.+)$/;
const extractSocialName = getSocialFromClasses(regex);
const classNames = [ 'color-twitter', 'icon-github' ];

extractSocialName(classNames); // returns 'twitter'
  1. Afterwards, we extract the href attribute URL from the element. We also extract the social network name from the <span> icon element using a composed function created by composing getSocialFromClasses(regex), onlySocialClasses(regex) and fetchElemAttribute('class').

    然后,我们从元素中提取href属性URL。 我们还使用由组成getSocialFromClasses(regex)onlySocialClasses(regex)fetchElemAttribute('class')的组合函数从<span>图标元素中提取社交网络名称。

  2. Finally, we return an object with the social network name as key and the href URL as value. However, if no social network was fetched, then null is returned. Here is an example of the returned object:

    最后,我们返回一个对象,其社交网络名称为键, href URL为值。 但是,如果未获取任何社交网络,则返回null。 这是返回对象的示例:

{ twitter: 'https://twitter.com/gladchinda' }

提取帖子和统计信息 (Extracting Posts and Stats)

We will go ahead to create two additional extraction functions namely: extractPost() and extractStat(), for extracting posts and stats respectively. Before we create the functions, let’s take a look at the DOM structure of the elements expected by these functions.

我们将继续创建两个附加的提取函数,分别是extractPost()extractStat() ,分别用于提取帖子和统计信息。 在创建函数之前,让我们看一下这些函数期望的元素的DOM结构。

Here is the DOM structure of the element expected by extractPost().

这是extractPost()期望的元素的DOM结构。

<div class="card large-card" data-type="post" data-id="2448">
  <a href="/tutorials/password-strength-meter-in-angularjs" class="card**img lazy-background" data-src="https://cdn.scotch.io/7540/iKZoyh9WSlSzB9Bt5MNK_post-cover-photo.jpg">
    <span class="tag is-info">Post</span>
  </a>
  <h2 class="card**title">
    <a href="/tutorials/password-strength-meter-in-angularjs">Password Strength Meter in AngularJS</a>
  </h2>
  <div class="card-footer">
    <a class="name" href="/@gladchinda">Glad Chinda</a>
    <a href="/tutorials/password-strength-meter-in-angularjs" title="Views">
      ?️ <span>24,280</span>
    </a>
    <a href="/tutorials/password-strength-meter-in-angularjs#comments-section" title="Comments">
      ? <span class="comment-number" data-id="2448">5</span>
    </a>
  </div>
</div>

Here is the DOM structure of the element expected by extractStat().

这是extractStat()期望的元素的DOM结构。

<div class="profile__stat column is-narrow">
  <div class="stat">41,454</div>
  <div class="label">Pageviews</div>
</div>

Add the following content to the app/scotch.js file.

将以下内容添加到app/scotch.js文件。

/_ app/scotch.js _/

/**
 **_ Extract a single post from container element
 _**/
const extractPost = elem => {
  const title = elem.find('.card__title a');
  const image = elem.find('a**[**data-src]');
  const views = elem.find("a**[**title='Views'] span");
  const comments = elem.find("a**[**title='Comments'] span.comment-number");

  return {
    title: fetchElemInnerText(title),
    image: extractUrlAttribute('data-src')(image),
    url: extractScotchUrlAttribute('href')(title),
    views: extractNumber(views),
    comments: extractNumber(comments)
  };
};

/**
 _ Extract a single stat from container element
 _/
const extractStat = elem => {
  const statElem = elem.find(".stat")
  const labelElem = elem.find('.label');

  const lowercase = val => _.isString(val) ? val.toLowerCase() : null;

  const stat = extractNumber(statElem);
  const label = compose(lowercase, fetchElemInnerText)(labelElem);

  return { [label]: stat };
};

The extractPost() function extracts the title, image, URL, views and comments of a post by parsing the children of the given element. It uses a couple of helper functions we created earlier to extract data from the appropriate elements.

extractPost()函数通过解析给定元素的子元素来提取帖子的标题,图像,URL,视图和评论。 它使用了我们先前创建的几个辅助函数来从适当的元素中提取数据。

Here is an example of the object returned from calling extractPost().

这是从调用extractPost()返回的对象的示例。

{
  title: "Password Strength Meter in AngularJS",
  image: "https://cdn.scotch.io/7540/iKZoyh9WSlSzB9Bt5MNK_post-cover-photo.jpg",
  url: "https://scotch.io//tutorials/password-strength-meter-in-angularjs",
  views: 24280,
  comments: 5
}

The extractStat() function extracts the stat data contained in the given element. Here is an example of the object returned from calling extractStat().

extractStat()函数提取给定元素中包含的统计数据。 这是从调用extractStat()返回的对象的示例。

{ pageviews: 41454 }

步骤7 —提取特定的网页 (Step 7 — Extracting a Specific Web Page)

Now we will proceed to define the extractAuthorProfile() function that extracts the complete profile of the Scotch author. Add the following content to the app/scotch.js file.

现在,我们将继续定义extractAuthorProfile()函数,该函数提取Scotch作者的完整档案。 将以下内容添加到app/scotch.js文件。

/_ app/scotch.js _/

/**
 **_ Extract profile from a Scotch author's page using the Cheerio parser instance
 _** and returns the author profile object
 */
const extractAuthorProfile = $ => {

  const mainSite = $('#sitemain');
  const metaScotch = $("meta**[**property='og:url']");
  const scotchHero = mainSite.find('section.hero--scotch');
  const superGrid = mainSite.find('section.super-grid');

  const authorTitle = scotchHero.find(".profilename h1.title");
  const profileRole = authorTitle.find(".tag");
  const profileAvatar = scotchHero.find("img.profileavatar");
  const profileStats = scotchHero.find(".profilestats .profilestat");
  const authorLinks = scotchHero.find(".author-links a**[**target='_blank']");
  const authorPosts = superGrid.find(".super-griditem **[**data-type='post']");

  const extractPosts = extractFromElems(extractPost)();
  const extractStats = extractFromElems(extractStat)(fromPairsToObject);
  const extractSocialUrls = extractFromElems(extractSocialUrl)(fromPairsToObject);

  return Promise.all(**[**
    fetchElemInnerText(authorTitle.contents().first()),
    fetchElemInnerText(profileRole),
    extractUrlAttribute('content')(metaScotch),
    extractUrlAttribute('src')(profileAvatar),
    extractSocialUrls(authorLinks)($),
    extractStats(profileStats)($),
    extractPosts(authorPosts)($)
  ]).then((**[** author, role, url, avatar, social, stats, posts ]) => ({ author, role, url, avatar, social, stats, posts }));

};

/**
 _ Fetches the Scotch profile of the given author
 _/
const fetchAuthorProfile = author => {
  const AUTHOR_URL = `${SCOTCH_BASE}/@${author.toLowerCase()}`;
  return composeAsync(extractAuthorProfile, fetchHtmlFromUrl)(AUTHOR_URL);
};

module.exports = { fetchAuthorProfile };

The extractAuthorProfile() function is very straight-forward. We first use $(the cheerio parser instance) to find a couple of elements and element collections.

extractAuthorProfile()函数非常简单。 我们首先使用$ (cheerio解析器实例)来找到几个元素和元素集合。

Next, we use the extractFromElems() helper function together with the extractor functions we created earlier in this section (extractPost, extractStat and extractSocialUrl) to create higher-order extraction functions. Notice how we use the fromPairsToObject helper function we created earlier as a transform function.

接下来,我们将extractFromElems()帮助程序函数与我们在本节前面创建的提取器函数( extractPostextractStatextractSocialUrl )一起使用,以创建高阶提取函数。 请注意,我们如何将fromPairsToObject创建的fromPairsToObject帮助器函数用作转换函数。

Finally, we use Promise.all() to extract all the required data, leveraging on a couple of helper functions we created earlier. The extracted data is contained in an array structure following this sequence: author name, role, Scotch link, avatar link, social links, stats and posts.

最后,我们利用Promise.all()提取了所有必需的数据,并利用了我们先前创建的几个辅助函数。 提取的数据按照以下顺序包含在数组结构中:作者姓名,角色,苏格兰链接,头像链接,社交链接,统计信息和帖子。

Notice how we use destructuring in the .then() promise handler to construct the final object that is returned when all the promises resolve. The returned object should look like the following:

请注意,我们如何在.then()承诺处理程序中使用解构来构造最终的对象,该对象将在所有承诺均得到解决后返回。 返回的对象应如下所示:

{
  author: 'Glad Chinda',
  role: 'Author',
  url: 'https://scotch.io/@gladchinda',
  avatar: 'https://cdn.scotch.io/7540/EnhoZyJOQ2ez9kVhsS9B_profile.jpg',
  social: {
    twitter: 'https://twitter.com/gladchinda',
    github: 'https://github.com/gladchinda'
  },
  stats: {
    posts: 6,
    pageviews: 41454,
    readers: 31676
  },
  posts: [
    {
      title: 'Password Strength Meter in AngularJS',
      image: 'https://cdn.scotch.io/7540/iKZoyh9WSlSzB9Bt5MNK_post-cover-photo.jpg',
      url: 'https://scotch.io//tutorials/password-strength-meter-in-angularjs',
      views: 24280,
      comments: 5
    },
    ...
  ]
}

We also define the fetchAuthorProfile() function that accepts an author’s Scotch username and returns a Promise that resolves to the profile of the author. For an author whose username is gladchinda, the Scotch URL is https://scotch.io/@gladchinda.

我们还定义了fetchAuthorProfile()函数,该函数接受作者的Scotch用户名,并返回一个解析为作者个人资料的Promise。 对于用户名为gladchinda的作者,苏格兰语URL为https://scotch.io/@gladchinda

fetchAuthorProfile() uses the composeAsync() helper function to create a composed function that first fetches the DOM content of the author’s Scotch page using the fetchHtmlFromUrl() helper function, and finally extracts the profile of the author using the extractAuthorProfile() function we just created.

fetchAuthorProfile()使用composeAsync()帮助程序函数创建一个组合函数,该函数首先使用fetchHtmlFromUrl()帮助程序函数提取作者的Scotch页面的DOM内容,最后使用我们刚才extractAuthorProfile()函数提取作者的个人资料创建。

Finally, we export fetchAuthorProfile as the only identifier in the module.exports object.

最后,我们将fetchAuthorProfile导出为module.exports对象中的唯一标识符。

第8步—如何创建路线 (Step 8 — How to Create a Route)

We are almost done with our API. We need to add a route to our server to enable us to fetch the profile of any Scotch author. The route will have the following structure, where the author parameter represents the username of the Scotch author.

我们的API差不多完成了。 我们需要向服务器添加路由,以使我们能够获取任何Scotch作者的个人资料。 该路由将具有以下结构,其中author参数代表苏格兰威士忌作者的用户名。

GET /scotch/:author

Let’s go ahead and create this route. We will make a couple of changes to the server.js file. First, add the following to the server.js file to require some of the functions we need.

让我们继续创建此路线。 我们将对server.js文件进行一些更改。 首先,将以下内容添加到server.js文件中,以需要我们需要的一些功能。

/_ server.js _/

// Require the needed functions
const { sendResponse } = require('./app/helpers');
const { fetchAuthorProfile } = require('./app/scotch');

Finally, add the route to the server.js file immediately after the middlewares.

最后,在中间件之后立即将路由添加到server.js文件。

/_ server.js _/

// Add the Scotch author profile route
app.get('/scotch/:author', (req, res, next) => {
  const author = req.params.author;
  sendResponse(res)(fetchAuthorProfile(author));
});

As you can see, we pass the author received from the route parameter to the fetchAuthorProfile() function to get the profile of the given author. We then use the sendResponse() helper method to send the returned profile as a JSON response.

如您所见,我们将从路由参数接收的author传递给fetchAuthorProfile()函数,以获取给定作者的个人资料。 然后,我们使用sendResponse()帮助器方法将返回的配置文件作为JSON响应发送。

We have successfully built our API using a web scraping technique. Go ahead and test the API by running npm start command on your terminal. Launch your favorite HTTP testing tool e.g Postman and test the API endpoint. If you followed all the steps correctly, you should have a result that looks like the following demo:

我们已经使用网络抓取技术成功构建了我们的API。 继续并在终端上运行npm start命令来测试API。 启动您喜欢的HTTP测试工具,例如Postman并测试API端点。 如果正确执行了所有步骤,则结果应类似于以下演示:

<a data-flickr-embed=“true” href=“https://www.flickr.com/photos/100345980@N08/41038838905/in/dateposted-public/” title=“Scotch Scraping API Demo”\><img src=“https://farm1.staticflickr.com/960/41038838905\_ab703d85fb\_o.jpg” width=“1280” height=“784” alt=“Scotch Scraping API Demo”\></a\><script async src=“//embedr.flickr.com/assets/client-code.js” charset=“utf-8”\></script\>

<a data-flickr-embed=“true”href="https://www.flickr.com/photos/100345980@N08/41038838905/in/dateposted-public/"title="苏格兰威士忌API演示"\>< img src =“ https://farm1.staticflickr.com/960/41038838905\_ab703d85fb\_o.jpg” width =“ 1280” height =” 784“ alt =” Scotch Scraping API Demo“ \> </ a \> <脚本异步src =“ // embedr.flickr.com/assets/client-code.js” charset =“ utf-8” \> </ script \>

结论 (Conclusion)

In this tutorial, we have seen how we can employ web scraping techniques (especially DOM parsing) to extract data from a website. We used the Cheerio package to parse the content of a webpage using available DOM methods in a much similar fashion as the popular jQuery library. Note however that Cheerio has its limitations. You can achieve more advanced parsing using headless browsers like JSDOM and PhantomJS.

在本教程中,我们了解了如何利用Web抓取技术(尤其是DOM解析)从网站提取数据。 我们使用Cheerio包通过可用的DOM方法解析网页的内容,其方式与流行的jQuery库非常相似。 但是请注意,Cheerio有其局限性。 您可以使用无头浏览器(如JSDOM和PhantomJS)实现更高级的解析。

You can find the source code for the API we built in this tutorial on GitHub. We have also built a demo app based on the API from this tutorial as shown in the initial screenshot. You can see the app on Heroku and the source code on GitHub.

您可以在GitHub上找到我们在本教程中构建的API的源代码。 我们还根据本教程中的API构建了一个演示应用程序,如初始屏幕快照所示。 您可以在Heroku上看到该应用程序,并在GitHub上看到源代码 。

翻译自: https://www.digitalocean.com/community/tutorials/how-to-scrape-a-website-with-node-js


http://www.niftyadmin.cn/n/3649126.html

相关文章

[USTC]中科大备忘录

中科大的罪行之一&#xff1a; 居然到现在还不衰落。 世界学术排名最高的7所中国大陆大学&#xff1a;1 清华大学 248名 75.0分&#xff08;Tsing Hua Univ&#xff09; 2 北京大学 287名 66.9分&#xff08;Peking Univ&#xff09; 3 中国科大 312名 62.1分&#xff08;Univ …

如何在DigitalOcean Kubernetes上设置Eclipse Theia Cloud IDE平台

介绍 (Introduction) With developer tools moving to the cloud, creation and adoption of cloud IDE (Integrated Development Environment) platforms is growing. Cloud IDEs are accessible from every type of modern device through web browsers, and they offer nume…

Android中View绘制不同状态背景图片原理以及StateListDrawable使用详解

1、View的几种不同状态属性2、如何根据不同状态去切换我们的背景图片。开篇介绍&#xff1a;android背景选择器selector用法汇总对Android开发有经验的同学&#xff0c;对 <selector>节点的使用一定很熟悉&#xff0c;该节点的作用就是定义一组状态资源图片&#xff0c;使…

django中的app_如何在Django中构建Weather App

django中的app介绍 (Introduction) In this article we’ll build a simple Django app that displays the current weather for various cities. To get the current weather data, we’ll use the Open Weather Map API. 在本文中&#xff0c;我们将构建一个简单的Django应用…

酷炫轮播广告

一、广告轮播条的简介 广告轮播条在HTML网页设计以及APP界面设计中非常常见&#xff0c;如下图所示。在Android中&#xff0c;实现的方式可以是自定义ViewPager来实现&#xff0c;但是我们程序员中流传的一句名言&#xff0c;“不要重复造轮子”。因此我们也可以通过网上已经有…

android View 详解

一、View 的概述 android.View.View(即View)类是以矩形的方式显示在屏幕上&#xff0c;View是用户界面控件的基础。View的继承层次关系如下图&#xff1a; 可以看到所有的界面控件都是View的子类。简单证实一下&#xff0c;每当你用findViewByIds(R.id.xx)时总要将其强转&#…

“大整数阶乖”问题的递推算法

/**//* 标题&#xff1a;<<系统设计师>>应试编程实例-[递推算法程序设计]作者&#xff1a;成晓旭时间&#xff1a;2002年09月11日(11:52:00-16:26:00)实现递推算法的大整数阶乖处理函数时间&#xff1a;2002年09月16日(18:38:00-20:02:00)实现“斐波那契数列”问…

RecycleView的详细介绍

一、RecycleView的简介 RecyclerView是一种新的视图组&#xff0c;目标是为任何基于适配器的视图提供相 似的渲染方式。该控件用于在有限的窗口中展示大量数据集&#xff0c;它被作为ListView和GridView控件的继承者。 那么有了ListView、GridView为什么还需要RecyclerView这…