[TIPS] How to correct HTML tags - Python

By JoeVu, at: June 9, 2024, 10:46 a.m.

Estimated Reading Time: __READING_TIME__ minutes

[TIPS] How to correct HTML tags - Python
[TIPS] How to correct HTML tags - Python

To correct messed-up HTML tags using Python, you can use libraries like BeautifulSoup from the bs4 module. BeautifulSoup is powerful for parsing and fixing HTML.

 

Here’s a step-by-step guide on how to use it:

 

Step 1: Install BeautifulSoup

 

If you haven’t installed BeautifulSoup and lxml (a parser library), you can install them using pip:

 

pip install beautifulsoup4 lxml

 

Step 2: Use BeautifulSoup to Parse and Correct HTML

 

Here’s an example script that reads an HTML string, parses it with BeautifulSoup, and then outputs the corrected HTML.

 

from bs4 import BeautifulSoup

# Example of messed-up HTML content
messed_up_html = """ YOUR MESSY HTML CONTENT """

# Parse the HTML
soup = BeautifulSoup(messed_up_html, 'lxml')

# Pretty print the corrected HTML
corrected_html = soup.prettify()
print(corrected_html)

 

Here is messy html tags content

 

Messy html tags content

 

Explanation

 

  • BeautifulSoup: A Python library for parsing HTML and XML documents. It creates a parse tree from page source code that can be used to extract data from HTML.

 

  • lxml: A parser for BeautifulSoup. It is faster and more lenient with broken HTML compared to the default parser.

 

Output

 

The prettify method formats the HTML nicely. The corrected HTML will look something like this:

 

from bs4 import BeautifulSoup

# The messy HTML string from before
messy_html = ""

from bs4 import BeautifulSoup

# The messy HTML string from before
messy_html = """
< center>< font size="5" color="red">< b>Welcome to my 1999 Website!!< /font>< /b>
< br>< br>
< div style="background-color: yellow; padding: 10px; border: 5px dotted blue; float: left; width: 100%;">
< p>This is a paragraph that < i>never really ends because the tags are < b>all over the place.
< marquee>Check out this scrolling text!< /marquee>
< table border=1>< tr>< td>Bad Table Formatting< td>No Closing Tag< /tr>
< /table>
< br>
< a href="#"> <img src="cool_gif.gif" width="50" height="50">Click here!!< /a>
< /p></div>
< br clear="all">
< center>< footer>Copyright 2025 - Best Viewed in Netscape Navigator< /footer>< /center>
"""

# Initialize the library with the 'html.parser'
# You can also use 'lxml' for even more robust error correction
soup = BeautifulSoup(messy_html, 'html.parser')

# The .prettify() method fixes the nesting and adds indentation
clean_html = soup.prettify()

print(clean_html)

 

 

Alternatives

 

There are some online services for you to validate the html tags and correct them:

 

  1. https://validator.w3.org/#validate_by_input
     
  2. https://www.freeformatter.com/html-validator.html
     
  3. https://www.htmlcorrector.com/
     
  4. https://jsonformatter.org/html-validator
Tag list:
- BeautifulSoup
- html
- extract html tags
- html tags
- correct html tags

Related

Subscribe

Subscribe to our newsletter and never miss out lastest news.