This repository provides a minimal HTML parser written in Go without using any external libraries. The parser reads an HTML string, breaks it down into elements, and constructs a hierarchical structure of nodes, allowing you to inspect or manipulate the HTML content programmatically.
- Basic HTML Parsing: Parses HTML elements and text nodes.
- Attributes Parsing: Handles element attributes.
- Hierarchical Structure: Builds a tree structure representing the HTML document.
- Console Output: Prints the HTML structure in a formatted way, resembling the original structure.
The parser identifies two types of nodes:
- ElementNode: Represents an HTML element with a tag (e.g.,
<div>
,<p>
) and optional attributes. - TextNode: Represents text content within an HTML element.
The parser tokenizes the HTML input and recursively constructs a tree of nodes. Each node can have a list of child nodes, making it easy to visualize or traverse the document structure.
For the HTML input:
<html>
<head><title>Sample Page</title></head>
<body>
<h1>Welcome to the Sample Page</h1>
<p>This is a <b>simple</b> HTML parser in Go.</p>
</body>
</html>
The output would look like:
<root>
<html>
<head>
<title>
Sample Page
</title>
</head>
<body>
<h1>
Welcome to the Sample Page
</h1>
<p>
This is a
<b>
simple
</b>
HTML parser in Go.
</p>
</body>
</html>
</root>
The code is broken down into several key components:
- Node Struct: Represents each HTML element or text in the structure.
- Parser Struct: Manages the parsing process, including the current position in the HTML string.
- Parsing Functions: Functions to parse elements, tags, attributes, and text nodes.
- Print Function: A recursive function to display the parsed HTML structure in a readable format.
To use the parser, simply include the code and call the Parse
method with your HTML content.
-
Clone this repository and navigate to the directory:
git clone https://github.com/your-username/html-parser-go.git cd html-parser-go
-
Run the code:
go run main.go
The sample HTML included in main.go
will be parsed, and the output structure will be printed to the console.
parser := NewParser("<html><body><h1>Title</h1></body></html>")
root, err := parser.Parse()
if err != nil {
fmt.Println("Error:", err)
return
}
printNode(root, 0)
The parser’s output displays the HTML elements and text nodes in a tree-like format, preserving the original HTML hierarchy.
This parser is intended as a minimal example for learning purposes. It does not cover all HTML specifications, such as:
- Self-closing tags like
<img>
or<br>
. - Nested structures in more complex HTML.
- Advanced error handling for malformed HTML.
Feel free to open issues or submit pull requests if you’d like to improve this parser.
This project is licensed under the MIT License.