Web scraping is the process of extracting data from websites. It’s widely used for data collection, market analysis, competitive research, and more. In Python, BeautifulSoup is one of the most popular libraries for parsing HTML and XML documents, making web scraping easier. Combined with the requests library, it allows you to fetch and extract data from web pages effortlessly.
In this lesson, we’ll cover:
- Introduction to Web Scraping
- Installing beautifulsoup4and requests
- Scraping Data from HTML Pages
1. Introduction to Web Scraping
What is Web Scraping?
Web scraping involves automating the process of visiting web pages, retrieving their content, and extracting specific information such as text, images, links, etc. It’s commonly used for:
- Gathering data from e-commerce sites
- Extracting news headlines
- Collecting social media trends
- Aggregating job listings
Is Web Scraping Legal?
While web scraping is technically possible for most websites, it’s important to follow legal and ethical guidelines:
- Check the website’s txtfile to see if scraping is allowed.
- Always comply with the website’s Terms of Service.
- Avoid sending too many requests quickly, which can overwhelm servers.
2. Installing beautifulsoup4 and requests
Before starting with web scraping, you need to install two key Python libraries:
- requests: To send HTTP requests and retrieve web page content.
- beautifulsoup4: To parse and extract data from HTML.
Installation with pip:
pip install beautifulsoup4
pip install requests
You can verify the installation in Python:
from bs4 import BeautifulSoup
3. Scraping Data from HTML Pages
Let’s walk through a simple web scraping example.
Step 1: Sending an HTTP Request
Use the requests library to get the content of a web page:
url = ‘https://example.com’
response = requests.get(url)
print(response.text) # Displays the raw HTML content
The response.text contains the entire HTML of the page.
Step 2: Parsing HTML with BeautifulSoup
Now, use BeautifulSoup to parse the HTML content:
soup = BeautifulSoup(response.text, ‘html.parser’)
print(soup.prettify()) # Prints the formatted HTML structure
Step 3: Extracting Specific Data
To extract specific elements like headings, links, or paragraphs:
headings = soup.find_all(‘h1’)
for heading in headings:
print(heading.text)
# Extract all links
links = soup.find_all(‘a’)
for link in links:
print(link.get(‘href’))
- find_all()searches for all occurrences of the specified tag.
- .textretrieves the text content inside an HTML element.
- .get(‘href’)fetches the URL from anchor (<a>) tags.
Handling Complex Web Pages
Web pages often have nested HTML elements. You can target specific sections using:
CSS Selectors with select():
for article in articles:
print(article.text)
Attributes Filtering:
python
CopyEdit
images = soup.find_all(‘img’, {‘class’: ‘featured-image’})
for img in images:
print(img[‘src’])
Error Handling and Best Practices
Handle Missing Elements Gracefully:
if title:
print(title.text)
else:
print(“Title not found.”)
Avoid Overloading Servers: Use delays between requests:
time.sleep(2) # Sleep for 2 seconds before the next request
Respect Robots.txt: Check if scraping is allowed:
rp = urllib.robotparser.RobotFileParser()
rp.set_url(‘https://example.com/robots.txt’)
rp.read()
print(rp.can_fetch(‘*’, url)) # Returns True if scraping is allowed
Real-World Example: Scraping Quotes from a Website
from bs4 import BeautifulSoup
url = ‘http://quotes.toscrape.com’
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘html.parser’)
quotes = soup.find_all(‘span’, class_=’text’)
authors = soup.find_all(‘small’, class_=’author’)
for quote, author in zip(quotes, authors):
print(f'{quote.text} — {author.text}’)
Key Takeaways
- Web scrapingautomates data extraction from websites.
- Use requeststo fetch web pages and BeautifulSoup to parse and extract data.
- Always respect the website’s rules (robots.txt) and Terms of Service.
- For dynamic websites (JavaScript-heavy), consider advanced tools like Selenium.
In the next lessons, we’ll dive deeper into handling dynamic content, pagination, and working with APIs, which are often a cleaner alternative to web scraping.
Leave a Reply