简介
做爬虫解析 HTML,之前一直是用 cheerio,随着 jQuery 的渐行渐远,使用 cheerio 的类 jQuery API 已经成为一种负担, 我使用这个支持 Selectors API 的解析器 node-html-parser 来代替 cheerio。前后端的 HTML Selectors API 终于统一了。
官方地址:https://www.npmjs.com/package/node-html-parser
在 npmjs.com 的周下载量是:1,777,505。 cheerio 的周下载量是 6,696,323。
执行速度上:
cheerio :12.0726 ms/file ± 7.31605
parse5 :8.18615 ms/file ± 6.15337
node-html-parser (last release):2.16533 ms/file ± 1.56924
htmlparser :17.0658 ms/file ± 120.901
htmlparser2 :2.62695 ms/file ± 4.17579
node-html-parser:2.14907 ms/file ± 1.66632
html-parser :24.6505 ms/file ± 18.9996
htmljs-parser :5.81797 ms/file ± 6.55537
html-dom-parser :2.52265 ms/file ± 3.54858
html5parser :2.01144 ms/file ± 2.53570
high5 :3.91342 ms/file ± 2.65563
安装
npm install --save node-html-parser
使用
// const { parse } = require('node-html-parser');
import { parse } from 'node-html-parser';
const root = parse('<ul id="list"><li>Hello World</li></ul>');
console.log(root.firstChild.structure);
// ul#list
// li
// #text
console.log(root.querySelector('#list'));
// { tagName: 'ul',
// rawAttrs: 'id="list"',
// childNodes:
// [ { tagName: 'li',
// rawAttrs: '',
// childNodes: [Object],
// classNames: [] } ],
// id: 'list',
// classNames: [] }
console.log(root.toString());
// <ul id="list"><li>Hello World</li></ul>
root.set_content('<li>Hello World</li>');
root.toString(); // <li>Hello World</li>
var HTMLParser = require('node-html-parser');
var root = HTMLParser.parse('<ul id="list"><li>Hello World</li></ul>');
核心方法
parse(data[, options])
data 是需要解析的字符串,返回生成的 DOM 对象根节点。
options 如下:
{
lowerCaseTagName: false, // convert tag name to lower case (hurts performance heavily)
comment: false, // retrieve comments (hurts performance slightly)
blockTextElements: {
script: true, // keep text content when parsing
noscript: true, // keep text content when parsing
style: true, // keep text content when parsing
pre: true // keep text content when parsing
}
}
valid(data[, options])
验证需要解析的字符串是否合法。
HTMLElement Methods
HTMLElement#trimRight()
Trim element from right (in block) after seeing pattern in a TextNode.
HTMLElement#removeWhitespace()
Remove whitespaces in this sub tree.
HTMLElement#querySelectorAll(selector)
Query CSS selector to find matching nodes.
Note: Full range of CSS3 selectors supported since v3.0.0.
HTMLElement#querySelector(selector)
Query CSS Selector to find matching node.
HTMLElement#getElementsByTagName(tagName)
Get all elements with the specified tagName. Note: Use * for all elements.
HTMLElement#closest(selector)
Query closest element by css selector.
HTMLElement#appendChild(node)
Append a child node to childNodes
HTMLElement#insertAdjacentHTML(where, html)
Parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.
HTMLElement#setAttribute(key: string, value: string)
Set value to key attribute.
HTMLElement#setAttributes(attrs: Record<string, string>)
Set attributes of the element.
HTMLElement#removeAttribute(key: string)
Remove key attribute.
HTMLElement#getAttribute(key: string)
Get key attribute.
HTMLElement#exchangeChild(oldNode: Node, newNode: Node)
Exchanges given child with new child.
HTMLElement#removeChild(node: Node)
Remove child node.
HTMLElement#toString()
Same as outerHTML
HTMLElement#set_content(content: string | Node | Node[])
Set content. Notice: Do not set content of the root node.
HTMLElement#remove()
Remove current element.
HTMLElement#replaceWith(...nodes: (string | Node)[])
Replace current element with other node(s).
HTMLElement#classList
HTMLElement#classList.add
Add class name.
HTMLElement#classList.replace(old: string, new: string)
Replace class name with another one.
HTMLElement#classList.remove()
Remove class name.
HTMLElement#classList.toggle(className: string):void
Toggle class. Remove it if it is already included, otherwise add.
HTMLElement#classList.contains(className: string): boolean
Returns true if the classname is already in the classList.
HTMLElement#classList.values()
Get class names.
HTMLElement Properties
HTMLElement#text
Get unescaped text value of current node and its children. Like innerText. (slow for the first time)
HTMLElement#rawText
Get escaped (as-is) text value of current node and its children. May have & in it. (fast)
HTMLElement#tagName
Get or Set tag name of HTMLElement. Notice: the returned value would be an uppercase string.
HTMLElement#structuredText
Get structured Text.
HTMLElement#structure
Get DOM structure.
HTMLElement#firstChild
Get first child node.
HTMLElement#lastChild
Get last child node.
HTMLElement#innerHTML
Set or Get innerHTML.
HTMLElement#outerHTML
Get outerHTML.
HTMLElement#nextSibling
Returns a reference to the next child node of the current element's parent.
HTMLElement#nextElementSibling
Returns a reference to the next child element of the current element's parent.
HTMLElement#textContent
Get or Set textContent of current element, more efficient than set_content.
HTMLElement#attributes
Get all attributes of current element. Notice: do not try to change the returned value.
HTMLElement#classList
Get all attributes of current element. Notice: do not try to change the returned value.
HTMLElement#range
Corresponding source code start and end indexes (ie [ 0, 40 ])
参考:
https://www.npmjs.com/package/node-html-parser
修改时间 2022-03-03