简介
做爬虫解析 HTML,之前一直是用 cheerio,随着 jQuery 的渐行渐远,使用 cheerio 的类 jQuery API 已经成为一种负担, 我使用这个支持 Selectors API 的解析器 node-html-parser 来代替 cheerio。前后端的 HTML Selectors API 终于统一了。
官方地址:https://www.npmjs.com/package/node-html-parser
在 npmjs.com 的周下载量是:1,777,505。 cheerio 的周下载量是 6,696,323。
执行速度上:
cheerio :12.0726 ms/file ± 7.31605 parse5 :8.18615 ms/file ± 6.15337 node-html-parser (last release):2.16533 ms/file ± 1.56924 htmlparser :17.0658 ms/file ± 120.901 htmlparser2 :2.62695 ms/file ± 4.17579 node-html-parser:2.14907 ms/file ± 1.66632 html-parser :24.6505 ms/file ± 18.9996 htmljs-parser :5.81797 ms/file ± 6.55537 html-dom-parser :2.52265 ms/file ± 3.54858 html5parser :2.01144 ms/file ± 2.53570 high5 :3.91342 ms/file ± 2.65563
安装
npm install --save node-html-parser
使用
// const { parse } = require('node-html-parser'); import { parse } from 'node-html-parser'; const root = parse('<ul id="list"><li>Hello World</li></ul>'); console.log(root.firstChild.structure); // ul#list // li // #text console.log(root.querySelector('#list')); // { tagName: 'ul', // rawAttrs: 'id="list"', // childNodes: // [ { tagName: 'li', // rawAttrs: '', // childNodes: [Object], // classNames: [] } ], // id: 'list', // classNames: [] } console.log(root.toString()); // <ul id="list"><li>Hello World</li></ul> root.set_content('<li>Hello World</li>'); root.toString(); // <li>Hello World</li> var HTMLParser = require('node-html-parser'); var root = HTMLParser.parse('<ul id="list"><li>Hello World</li></ul>');
核心方法
parse(data[, options])
data 是需要解析的字符串,返回生成的 DOM 对象根节点。
options 如下:
{ lowerCaseTagName: false, // convert tag name to lower case (hurts performance heavily) comment: false, // retrieve comments (hurts performance slightly) blockTextElements: { script: true, // keep text content when parsing noscript: true, // keep text content when parsing style: true, // keep text content when parsing pre: true // keep text content when parsing } }
valid(data[, options])
验证需要解析的字符串是否合法。
HTMLElement Methods
HTMLElement#trimRight()
Trim element from right (in block) after seeing pattern in a TextNode.
HTMLElement#removeWhitespace()
Remove whitespaces in this sub tree.
HTMLElement#querySelectorAll(selector)
Query CSS selector to find matching nodes.
Note: Full range of CSS3 selectors supported since v3.0.0.
HTMLElement#querySelector(selector)
Query CSS Selector to find matching node.
HTMLElement#getElementsByTagName(tagName)
Get all elements with the specified tagName. Note: Use * for all elements.
HTMLElement#closest(selector)
Query closest element by css selector.
HTMLElement#appendChild(node)
Append a child node to childNodes
HTMLElement#insertAdjacentHTML(where, html)
Parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.
HTMLElement#setAttribute(key: string, value: string)
Set value to key attribute.
HTMLElement#setAttributes(attrs: Record<string, string>)
Set attributes of the element.
HTMLElement#removeAttribute(key: string)
Remove key attribute.
HTMLElement#getAttribute(key: string)
Get key attribute.
HTMLElement#exchangeChild(oldNode: Node, newNode: Node)
Exchanges given child with new child.
HTMLElement#removeChild(node: Node)
Remove child node.
HTMLElement#toString()
Same as outerHTML
HTMLElement#set_content(content: string | Node | Node[])
Set content. Notice: Do not set content of the root node.
HTMLElement#remove()
Remove current element.
HTMLElement#replaceWith(...nodes: (string | Node)[])
Replace current element with other node(s).
HTMLElement#classList
HTMLElement#classList.add
Add class name.
HTMLElement#classList.replace(old: string, new: string)
Replace class name with another one.
HTMLElement#classList.remove()
Remove class name.
HTMLElement#classList.toggle(className: string):void
Toggle class. Remove it if it is already included, otherwise add.
HTMLElement#classList.contains(className: string): boolean
Returns true if the classname is already in the classList.
HTMLElement#classList.values()
Get class names.
HTMLElement Properties
HTMLElement#text
Get unescaped text value of current node and its children. Like innerText. (slow for the first time)
HTMLElement#rawText
Get escaped (as-is) text value of current node and its children. May have & in it. (fast)
HTMLElement#tagName
Get or Set tag name of HTMLElement. Notice: the returned value would be an uppercase string.
HTMLElement#structuredText
Get structured Text.
HTMLElement#structure
Get DOM structure.
HTMLElement#firstChild
Get first child node.
HTMLElement#lastChild
Get last child node.
HTMLElement#innerHTML
Set or Get innerHTML.
HTMLElement#outerHTML
Get outerHTML.
HTMLElement#nextSibling
Returns a reference to the next child node of the current element's parent.
HTMLElement#nextElementSibling
Returns a reference to the next child element of the current element's parent.
HTMLElement#textContent
Get or Set textContent of current element, more efficient than set_content.
HTMLElement#attributes
Get all attributes of current element. Notice: do not try to change the returned value.
HTMLElement#classList
Get all attributes of current element. Notice: do not try to change the returned value.
HTMLElement#range
Corresponding source code start and end indexes (ie [ 0, 40 ])
参考:
https://www.npmjs.com/package/node-html-parser
修改时间 2022-03-03