Lesson 3: Web Scraping with BeautifulSoup

Web scraping is the process of extracting data from websites. It’s widely used for data collection, market analysis, competitive research, and more. In Python, BeautifulSoup is one of the most popular libraries for parsing HTML and XML documents, making web scraping easier. Combined with the requests library, it allows you to fetch and extract data from web pages effortlessly.

In this lesson, we’ll cover:

Introduction to Web Scraping
Installing beautifulsoup4and requests
Scraping Data from HTML Pages

1. Introduction to Web Scraping

What is Web Scraping?
Web scraping involves automating the process of visiting web pages, retrieving their content, and extracting specific information such as text, images, links, etc. It’s commonly used for:

Gathering data from e-commerce sites
Extracting news headlines
Collecting social media trends
Aggregating job listings

Is Web Scraping Legal?
While web scraping is technically possible for most websites, it’s important to follow legal and ethical guidelines:

Check the website’s txtfile to see if scraping is allowed.
Always comply with the website’s Terms of Service.
Avoid sending too many requests quickly, which can overwhelm servers.

2. Installing beautifulsoup4 and requests

Before starting with web scraping, you need to install two key Python libraries:

requests: To send HTTP requests and retrieve web page content.
beautifulsoup4: To parse and extract data from HTML.

Installation with pip:

bash

pip install beautifulsoup4

pip install requests

You can verify the installation in Python:

python

import requests

from bs4 import BeautifulSoup

3. Scraping Data from HTML Pages

Let’s walk through a simple web scraping example.

Step 1: Sending an HTTP Request

Use the requests library to get the content of a web page:

python

import requests

url = ‘https://example.com’

response = requests.get(url)

print(response.text) # Displays the raw HTML content

The response.text contains the entire HTML of the page.

Step 2: Parsing HTML with BeautifulSoup

Now, use BeautifulSoup to parse the HTML content:

python

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, ‘html.parser’)

print(soup.prettify()) # Prints the formatted HTML structure

Step 3: Extracting Specific Data

To extract specific elements like headings, links, or paragraphs:

python

# Extract all headings (h1 tags)

headings = soup.find_all(‘h1’)

for heading in headings:

print(heading.text)

# Extract all links

links = soup.find_all(‘a’)

for link in links:

print(link.get(‘href’))

find_all()searches for all occurrences of the specified tag.
.textretrieves the text content inside an HTML element.
.get(‘href’)fetches the URL from anchor (<a>) tags.

Handling Complex Web Pages

Web pages often have nested HTML elements. You can target specific sections using:

CSS Selectors with select():

python

articles = soup.select(‘div.article > h2’)

for article in articles:

print(article.text)

Attributes Filtering:
python
CopyEdit
images = soup.find_all(‘img’, {‘class’: ‘featured-image’})

for img in images:

print(img[‘src’])

Error Handling and Best Practices

Handle Missing Elements Gracefully:

python

title = soup.find(‘h1’)

if title:

print(title.text)

else:

print(“Title not found.”)

Avoid Overloading Servers: Use delays between requests:

python

import time

time.sleep(2) # Sleep for 2 seconds before the next request

Respect Robots.txt: Check if scraping is allowed:

python

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()

rp.set_url(‘https://example.com/robots.txt’)

rp.read()

print(rp.can_fetch(‘*’, url)) # Returns True if scraping is allowed

Real-World Example: Scraping Quotes from a Website

python

import requests

from bs4 import BeautifulSoup

url = ‘http://quotes.toscrape.com’

response = requests.get(url)

soup = BeautifulSoup(response.text, ‘html.parser’)

quotes = soup.find_all(‘span’, class_=’text’)

authors = soup.find_all(‘small’, class_=’author’)

for quote, author in zip(quotes, authors):

print(f'{quote.text} — {author.text}’)

Key Takeaways

Web scrapingautomates data extraction from websites.
Use requeststo fetch web pages and BeautifulSoup to parse and extract data.
Always respect the website’s rules (robots.txt) and Terms of Service.
For dynamic websites (JavaScript-heavy), consider advanced tools like Selenium.

In the next lessons, we’ll dive deeper into handling dynamic content, pagination, and working with APIs, which are often a cleaner alternative to web scraping.

Lesson 3: Web Scraping with BeautifulSoup

Lesson 3: Web Scraping with BeautifulSoup

Module 1 Introduction to Python

Module 2 Variables and Data Types

Module 3 Control Structures

Module 4 Functions

Module 5 Data Structures

Module 6 Object-Oriented Programming

Module 7 File Handling

Module 8 Error Handling and Exceptions

Module 9 Libraries and Modules

Module 10 Working with Data

Module 11 Advanced Python Topics

Module 12 Working with Databases

Module 13 Python for Web Development (Optional)

Module 14 Final Project and Challenges

1. Introduction to Web Scraping

2. Installing beautifulsoup4 and requests

Installation with pip:

3. Scraping Data from HTML Pages

Step 1: Sending an HTTP Request

Step 2: Parsing HTML with BeautifulSoup

Step 3: Extracting Specific Data

Handling Complex Web Pages

Error Handling and Best Practices

Real-World Example: Scraping Quotes from a Website

Key Takeaways

Lesson 3: Web Scraping with BeautifulSoup

Module 1 Introduction to Python

Module 2 Variables and Data Types

Module 3 Control Structures

Module 4 Functions

Module 5 Data Structures

Module 6 Object-Oriented Programming

Module 7 File Handling

Module 8 Error Handling and Exceptions

Module 9 Libraries and Modules

Module 10 Working with Data

Module 11 Advanced Python Topics

Module 12 Working with Databases

Module 13 Python for Web Development (Optional)

Module 14 Final Project and Challenges

1. Introduction to Web Scraping

2. Installing beautifulsoup4 and requests

Installation with pip:

3. Scraping Data from HTML Pages

Step 1: Sending an HTTP Request

Step 2: Parsing HTML with BeautifulSoup

Step 3: Extracting Specific Data

Handling Complex Web Pages

Error Handling and Best Practices

Real-World Example: Scraping Quotes from a Website

Key Takeaways

Share this: