How to implement an HTML parser using JS

Mondo Technology Updated on 2024-01-30

There is a very important thing at the bottom of the browser is the HTML parser, the work of the HTML parser is to parse the HTML string into a tree, each node on the tree is a node, many students are curious about how to achieve it, this article uses js to implement a simple HTML parser.

1. Effect. We need to implement oneparsemethod, and pass in an HTML string that returns a tree structure:

const root = parse(`

hello world

console.log(root);,rawattrs":"id=\"test\" class=\"container\" c=\"b\"","type":"element","range":[0,128],children":[,"rawattrs":"class=\"text-block\"",type":"element","range":[39,102],"children":[,rawattrs":"id=\"xxx\"","type":"element","range":[63,96],"children":[

type":"text","range":[78,89],"value":"hello world"}]}rawattrs":"src=\"xx.jpg\" ","type":"element","range":[102,122],"children":

Match it with a regular patternMatch the label pair (by first-in, last-out (stack) methodFirst we need to initialize some simple variables and method alternates:

Initialize 2 node types. 

html [nodetype]( will be more, here in order to let you understand the core principle, omit some unimportant things.

const nodetype = ;

Set root as the parent node.

let currentparent = root;

Stack management. const stack = [root];

let lasttextpos = -1;

Stitch together the simulated root node and the HTML that needs to be parsed.

data = `<

..Start traversing the parsing.

By processing, returning the stack is the end result.

return statck;

Let's illustrate this with an example, give an HTML snippet:

hello world

For this snippet, we need to parse out the following strings in turn:

Before parsing, let's learn regexpprototype.exec() can be skipped if you already know how to use it.

exec()The method searches for matching the specified string and returns an array ornullIf the regex is set to global, all matching results will be traversed one by one, and the end of the matching string will be recorded each time it is matchedlastindexproperties, take a look at the demo below

const regex = /foo/g;

const str = 'table football, foosball';

let matcharray;

while ((matcharray = regex.exec(str)) == null) .next starts at $.

expected output: "found foo. next starts at 9."

expected output: "found foo. next starts at 19."

Then we can take advantage of itregex.execThe attribute matches the required strings in turn:

Refer to the label documentation:

const kmarkuppattern = //g;

while ((match = kmarkuppattern.exec(data)))= match;

The string that was matched this time.

const matchlength = matchtext.length;

The starting position of this match.

const tagstartpos = kmarkuppattern.lastindex - matchlength;

The position at the end of this match.

const tagendpos = kmarkuppattern.lastindex;

if (lasttextpos > 1) )

Record the last matched location.

lasttextpos = kmarkuppattern.lastindex;

If the matched tag is an analog tag, skip it.

if (tagname === frameflag) continue;

..Handle nodetype as element logic.

Next, let's deal with the logic of untagging (e.g

), open labels include closed labels and non-closed labels, directly see **:

if (!leadingslash) ;

Parse the id, class attributes, and hang them under the attrs object.

const kattributepattern = /(?:s)(id|class)\s*=\s*(('[^']*')|("[^"]*")|\s+)/gi;

for (let attmatch; (attmatch = kattributepattern.exec(attributes));= attmatch;

Whether the attribute value is in quotation marks.

const isquoted = val[0] === `'` |val[0] === `"`;

attrs[key.tolowercase()]= isquoted ? val.slice(1, val.length - 1) :val;

const currentnode = ;

Put the current node information into the children of the currentparent.

currentparent.children.push(currentnode);

Reset the currentparent node to the current node.

currentparent = currentnode;

Stuff each node into the stack in turn, and then release it as a stack in a closed tab that follows.

stack.push(currentparent);

herestackIt is very important to use the first-in-last-out principle of the stack to match the corresponding opening and closing labels one by one.

After putting the tag into the stack during the above process of untagging, we also need to update the range and kick it out of the stack after matching the closed tag

Self-closing elements. 

const kselfclosingelements = else

The above explains how to implement a basic HTML parser with JS, but there are some ** that are not handled, such as omitting the processing of script, style and other tags (nodetype is not complete), and the nodes above I replace them with ordinary objects, but in fact, the objects corresponding to each nodetype will inherit from node, respectivelyelementhtmlelementtextcommentInterested students can implement a real HTML parser based on the W3C standard.

Related Pages