Beautiful Soup (HTML parser)


Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
It is available for Python 2.7 and Python 3.

Code example


  1. !/usr/bin/env python3
  2. Anchor extraction from HTML document
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen as response:
soup = BeautifulSoup
for anchor in soup.find_all:
print

Advantages and Disadvantages

This table summarizes the advantages and disadvantages of each parser library
ParserTypical usageAdvantagesDisadvantages
Python’s html.parserBeautifulSoup
  • Not as fast as lxml, less lenient than html5lib.
  • lxml’s HTML parserBeautifulSoup
  • Very fast
  • Lenient
  • External C dependency
  • lxml’s XML parserBeautifulSoup

    BeautifulSoup

    • Very fast
    • The only currently supported XML parser
  • External C dependency
  • html5libBeautifulSoup
  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5
  • Very slow
  • External Python dependency
  • Release

    Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is . You can install Beautiful Soup 4 with pip install beautifulsoup4.