BeautifulSoup is a Python library that is used to parse HTML and XML documents.
Think of it like a "Google Maps for websites" – it helps you navigate through a webpage’s structure (tags, attributes, text) and extract the data you need.
👉 Officially, it is part of the web scraping toolkit.
🌐 Extract data from websites (news, e-commerce, weather, etc.)
📝 Convert messy HTML into structured data
🚀 Works well with requests or urllib to fetch webpage content
💡 Provides simple methods like .find()
, .find_all()
, .select()
pip install beautifulsoup4
(Optionally, you may need a parser: lxml
or html5lib
)
1️⃣ Import & Fetch Content
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
2️⃣ Create Soup Object
soup = BeautifulSoup(response.text, "html.parser")
3️⃣ Navigate & Extract Data
# Title of page
print(soup.title.string)
# First <h1> tag
print(soup.h1.text)
# All links
for link in soup.find_all('a'):
print(link['href'])
soup.title
→ <title>Example Domain</title>
soup.h1
→ First <h1>
tag
soup.p
→ First <p>
tag
soup.find('h1')
→ Finds first h1
soup.find_all('p')
→ Finds all p tags
soup.find('a', {'class': 'link'})
→ Find tag with specific attribute
soup.select("div.article h2")
👉 Selects all <h2>
inside <div class="article">
link = soup.find('a')
print(link['href']) # URL inside <a>
print(soup.get_text()) # Full plain text
📌 Scraping Quotes from a Website
import requests
from bs4 import BeautifulSoup
url = "http://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for quote in soup.find_all('span', class_="text"):
print(quote.text)
👉 Output:
“The world as we have created it is a process of our thinking.”
“It is our choices, Harry, that show what we truly are.”
...
Website HTML → BeautifulSoup Parser → Soup Object → Find/Select Tags → Extract Data → Store in CSV/DB
❌ Can’t handle JavaScript-rendered websites (use Selenium or Playwright)
❌ Dependent on website structure (if website changes, scraper breaks)
❌ Too large pages may slow parsing
Always check robots.txt before scraping 🤖
Use time.sleep()
to avoid overloading servers ⏳
Combine with pandas or CSV to store data 📊
For dynamic content, pair with Selenium / Playwright
✨ In short, BeautifulSoup is like a detective 🔍:
It reads a webpage’s HTML structure
Helps you search and extract data
Makes web scraping clean, easy, and pythonic 🐍