Lesson 3: Web Scraping with BeautifulSoup

Web scraping is the process of extracting data from websites. It’s widely used for data collection, market analysis, competitive research, and more. In Python, BeautifulSoup is one of the most popular libraries for parsing HTML and XML documents, making web scraping easier. Combined with the requests library, it allows you to fetch and extract data from web pages effortlessly.

In this lesson, we’ll cover:

  1. Introduction to Web Scraping
  2. Installing beautifulsoup4and requests
  3. Scraping Data from HTML Pages

1. Introduction to Web Scraping

What is Web Scraping?
Web scraping involves automating the process of visiting web pages, retrieving their content, and extracting specific information such as text, images, links, etc. It’s commonly used for:

  • Gathering data from e-commerce sites
  • Extracting news headlines
  • Collecting social media trends
  • Aggregating job listings

Is Web Scraping Legal?
While web scraping is technically possible for most websites, it’s important to follow legal and ethical guidelines:

  • Check the website’s txtfile to see if scraping is allowed.
  • Always comply with the website’s Terms of Service.
  • Avoid sending too many requests quickly, which can overwhelm servers.

2. Installing beautifulsoup4 and requests

Before starting with web scraping, you need to install two key Python libraries:

  • requests: To send HTTP requests and retrieve web page content.
  • beautifulsoup4: To parse and extract data from HTML.

Installation with pip:

bash

pip install beautifulsoup4

pip install requests

 

You can verify the installation in Python:

python
import requests

from bs4 import BeautifulSoup

 

3. Scraping Data from HTML Pages

Let’s walk through a simple web scraping example.

Step 1: Sending an HTTP Request

Use the requests library to get the content of a web page:

python
import requests

 

url = ‘https://example.com’

response = requests.get(url)

print(response.text)  # Displays the raw HTML content

 

The response.text contains the entire HTML of the page.

Step 2: Parsing HTML with BeautifulSoup

Now, use BeautifulSoup to parse the HTML content:

python
from bs4 import BeautifulSoup

 

soup = BeautifulSoup(response.text, ‘html.parser’)

print(soup.prettify())  # Prints the formatted HTML structure

 

Step 3: Extracting Specific Data

To extract specific elements like headings, links, or paragraphs:

python
# Extract all headings (h1 tags)

headings = soup.find_all(‘h1’)

for heading in headings:

print(heading.text)

 

# Extract all links

links = soup.find_all(‘a’)

for link in links:

print(link.get(‘href’))

 

  • find_all()searches for all occurrences of the specified tag.
  • .textretrieves the text content inside an HTML element.
  • .get(‘href’)fetches the URL from anchor (<a>) tags.

Handling Complex Web Pages

Web pages often have nested HTML elements. You can target specific sections using:

CSS Selectors with select():

python
articles = soup.select(‘div.article > h2’)

for article in articles:

print(article.text)

Attributes Filtering:
python
CopyEdit
images = soup.find_all(‘img’, {‘class’: ‘featured-image’})

for img in images:

print(img[‘src’])

Error Handling and Best Practices

Handle Missing Elements Gracefully:

python
title = soup.find(‘h1’)

if title:

print(title.text)

else:

print(“Title not found.”)

Avoid Overloading Servers: Use delays between requests:

python
import time

time.sleep(2)  # Sleep for 2 seconds before the next request

Respect Robots.txt: Check if scraping is allowed:

python
import urllib.robotparser

 

rp = urllib.robotparser.RobotFileParser()

rp.set_url(‘https://example.com/robots.txt’)

rp.read()

print(rp.can_fetch(‘*’, url))  # Returns True if scraping is allowed

Real-World Example: Scraping Quotes from a Website

python
import requests

from bs4 import BeautifulSoup

 

url = ‘http://quotes.toscrape.com’

response = requests.get(url)

soup = BeautifulSoup(response.text, ‘html.parser’)

 

quotes = soup.find_all(‘span’, class_=’text’)

authors = soup.find_all(‘small’, class_=’author’)

 

for quote, author in zip(quotes, authors):

print(f'{quote.text} — {author.text}’)

 

Key Takeaways

  • Web scrapingautomates data extraction from websites.
  • Use requeststo fetch web pages and BeautifulSoup to parse and extract data.
  • Always respect the website’s rules (robots.txt) and Terms of Service.
  • For dynamic websites (JavaScript-heavy), consider advanced tools like Selenium.

In the next lessons, we’ll dive deeper into handling dynamic content, pagination, and working with APIs, which are often a cleaner alternative to web scraping.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *