支持 Selector API 的 HTML 解析器 node-html-parser

简介

做爬虫解析 HTML，之前一直是用 cheerio，随着 jQuery 的渐行渐远，使用 cheerio 的类 jQuery API 已经成为一种负担，我使用这个支持 Selectors API 的解析器 node-html-parser 来代替 cheerio。前后端的 HTML Selectors API 终于统一了。

官方地址：https://www.npmjs.com/package/node-html-parser

在 npmjs.com 的周下载量是：1,777,505。 cheerio 的周下载量是 6,696,323。

执行速度上：

cheerio         :12.0726 ms/file ± 7.31605
parse5          :8.18615 ms/file ± 6.15337
node-html-parser (last release):2.16533 ms/file ± 1.56924
htmlparser      :17.0658 ms/file ± 120.901
htmlparser2     :2.62695 ms/file ± 4.17579
node-html-parser:2.14907 ms/file ± 1.66632
html-parser     :24.6505 ms/file ± 18.9996
htmljs-parser   :5.81797 ms/file ± 6.55537
html-dom-parser :2.52265 ms/file ± 3.54858
html5parser     :2.01144 ms/file ± 2.53570
high5           :3.91342 ms/file ± 2.65563

安装

npm install --save node-html-parser

使用

// const { parse } = require('node-html-parser');

import { parse } from 'node-html-parser';

const root = parse('<ul id="list"><li>Hello World</li></ul>');

console.log(root.firstChild.structure);
// ul#list
//   li
//     #text

console.log(root.querySelector('#list'));
// { tagName: 'ul',
//   rawAttrs: 'id="list"',
//   childNodes:
//    [ { tagName: 'li',
//        rawAttrs: '',
//        childNodes: [Object],
//        classNames: [] } ],
//   id: 'list',
//   classNames: [] }
console.log(root.toString());
// <ul id="list"><li>Hello World</li></ul>
root.set_content('<li>Hello World</li>');
root.toString();	// <li>Hello World</li>
var HTMLParser = require('node-html-parser');

var root = HTMLParser.parse('<ul id="list"><li>Hello World</li></ul>');

核心方法

parse(data[, options])

data 是需要解析的字符串，返回生成的 DOM 对象根节点。

options 如下：

{
  lowerCaseTagName: false,  // convert tag name to lower case (hurts performance heavily)
  comment: false,            // retrieve comments (hurts performance slightly)
  blockTextElements: {
    script: true,	// keep text content when parsing
    noscript: true,	// keep text content when parsing
    style: true,		// keep text content when parsing
    pre: true			// keep text content when parsing
  }
}

valid(data[, options])

验证需要解析的字符串是否合法。

HTMLElement Methods

HTMLElement#trimRight()

Trim element from right (in block) after seeing pattern in a TextNode.

HTMLElement#removeWhitespace()

Remove whitespaces in this sub tree.

HTMLElement#querySelectorAll(selector)

Query CSS selector to find matching nodes.

Note: Full range of CSS3 selectors supported since v3.0.0.

HTMLElement#querySelector(selector)

Query CSS Selector to find matching node.

HTMLElement#getElementsByTagName(tagName)

Get all elements with the specified tagName. Note: Use * for all elements.

HTMLElement#closest(selector)

Query closest element by css selector.

HTMLElement#appendChild(node)

Append a child node to childNodes

HTMLElement#insertAdjacentHTML(where, html)

Parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.

HTMLElement#setAttribute(key: string, value: string)

Set value to key attribute.

HTMLElement#setAttributes(attrs: Record<string, string>)

Set attributes of the element.

HTMLElement#removeAttribute(key: string)

Remove key attribute.

HTMLElement#getAttribute(key: string)

Get key attribute.

HTMLElement#exchangeChild(oldNode: Node, newNode: Node)

Exchanges given child with new child.

HTMLElement#removeChild(node: Node)

Remove child node.

HTMLElement#toString()

Same as outerHTML

HTMLElement#set_content(content: string | Node | Node[])

Set content. Notice: Do not set content of the root node.

HTMLElement#remove()

Remove current element.

HTMLElement#replaceWith(...nodes: (string | Node)[])

Replace current element with other node(s).

HTMLElement#classList

HTMLElement#classList.add

Add class name.

HTMLElement#classList.replace(old: string, new: string)

Replace class name with another one.

HTMLElement#classList.remove()

Remove class name.

HTMLElement#classList.toggle(className: string):void

Toggle class. Remove it if it is already included, otherwise add.

HTMLElement#classList.contains(className: string): boolean

Returns true if the classname is already in the classList.

HTMLElement#classList.values()

Get class names.

HTMLElement Properties

HTMLElement#text

Get unescaped text value of current node and its children. Like innerText. (slow for the first time)

HTMLElement#rawText

Get escaped (as-is) text value of current node and its children. May have & in it. (fast)

HTMLElement#tagName

Get or Set tag name of HTMLElement. Notice: the returned value would be an uppercase string.

HTMLElement#structuredText

Get structured Text.

HTMLElement#structure

Get DOM structure.

HTMLElement#firstChild

Get first child node.

HTMLElement#lastChild

Get last child node.

HTMLElement#innerHTML

Set or Get innerHTML.

HTMLElement#outerHTML

Get outerHTML.

HTMLElement#nextSibling

Returns a reference to the next child node of the current element's parent.

HTMLElement#nextElementSibling

Returns a reference to the next child element of the current element's parent.

HTMLElement#textContent

Get or Set textContent of current element, more efficient than set_content.

HTMLElement#attributes

Get all attributes of current element. Notice: do not try to change the returned value.

HTMLElement#classList

Get all attributes of current element. Notice: do not try to change the returned value.

HTMLElement#range

Corresponding source code start and end indexes (ie [ 0, 40 ])

参考：

https://www.npmjs.com/package/node-html-parser

爬虫

修改时间 2022-03-03

声明：本站所有文章和图片，如无特殊说明，均为原创发布，转载请注明出处。