Scrape Quotes using Python Requests and BeautifulSoup
By hientd, at: 21:55 Ngày 30 tháng 6 năm 2023
Thời gian đọc ước tính: __READING_TIME__ minutes
Introduction
Scraping data from websites has become an essential task in various domains, ranging from research to business intelligence. In this article, I will guide you through the process of scraping quotes from the popular website, https://quotes.toscrape.com/, using the powerful combination of Python Requests and BeautifulSoup. By the end, you will have a clear understanding of how to retrieve and extract specific content from web pages efficiently.
A full code snippet can be found here
Overview of the scraping process
Before diving into the details, let's get an overview of the tools we will be using: Python Requests and BeautifulSoup.
Python Requests is a versatile library that allows us to send HTTP requests effortlessly.
BeautifulSoup is a Python library used for parsing HTML and XML documents.
CSV is a Python library used for writing/reading csv files.
By combining these three tools, we can navigate through the web page's structure, extract the desired information, and write to a file.
Setting up the environment
First things first, let's ensure that we have the necessary tools installed. To follow along, you will need Python, Requests, BeautifulSoup, and CSV. If you don't have Python installed, head over to the official Python website and download the latest version. Once Python is set up, you can install Requests and BeautifulSoup by running the following commands in your terminal:
pip install requests
pip install beautifulsoup4
# csv is a Python built-in library
Making a request to the website
To begin scraping, we need to make a request to the website we want to extract data from. Using Python Requests, we can send a GET request to the URL of the quotes website. It's important to handle potential errors and exceptions, such as connection timeouts or invalid URLs or invalid status. This ensures that our scraping script is robust and reliable.
import requests url = "https://quotes.toscrape.com/"
response = requests.get(url)
if response.status_code == 200:
# Proceed with scraping
else:
# Handle the error gracefully
Parsing the HTML response
Once we have obtained the response from the website, we need to parse the HTML content to extract the desired information. BeautifulSoup comes to the rescue here. It provides a convenient way to navigate and search through the HTML document using selectors.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
Extracting quotes
With BeautifulSoup, we can now locate and extract the quote text and author information from the parsed HTML. By inspecting the web page's structure, we identify the appropriate HTML elements and use the appropriate selectors to retrieve the data.
quotes = soup.select(".quote")
for quote in quotes:
text = quote.select_one(".text").get_text()
author = quote.select_one(".author").get_text()
# Process the scraped data as desired
Handling pagination
In many cases, we may need to scrape multiple pages to retrieve all the desired data. The quotes website has multiple pages, and we don't want to miss any quotes. We can tackle pagination by following the pagination links or modifying query parameters in the URL. This allows us to iterate through each page and scrape the quotes.
next_page_link = soup.select_one(".next > a")["href"] # Construct the URL for the next page
next_page_url = f"{url}{next_page_link} # Send a GET request to the next page and continue scraping
Storing scraped data
Once we have successfully scraped the quotes, it's crucial to decide how to store the extracted data for future use. We have various options, such as saving it to a CSV file, JSON format, or storing it in a database. The choice depends on your specific requirements and the scalability of your project.
quotes = [{"author": "Joe", "text": "this is a quote"}]
with open(file_name, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=["author", "text"])
writer.writeheader()
writer.writerows(quotes)
Edge cases and challenges
While web scraping can be straightforward, it's essential to address potential edge cases and challenges.
Some websites may have dynamic content loaded via JavaScript, which requires additional techniques like using a headless browser. In this case, we might need to use Selenium/PlayWright to scrape.
Additionally, websites may implement anti-scraping measures, such as CAPTCHAs or rate-limiting (ex: https://www.walmart.com/). Handling such challenges requires advanced strategies like session management, IP rotation, or using dedicated scraping frameworks.
Tips and best practices
To ensure smooth and effective web scraping, here are some tips and best practices to keep in mind:
- Respect website policies and terms of service.
- Include a delay mechanism between requests to avoid overloading the server.
- Use proper user-agent headers to mimic real user behavior.
- Monitor and handle exceptions gracefully to prevent scraping interruptions.
- Test your scraping script regularly to adapt to potential website changes.
Conclusion
In this article, we explored how to scrape quotes from https://quotes.toscrape.com/ using Python Requests and BeautifulSoup. We discussed the overall scraping process, including making requests, parsing HTML responses, extracting quotes, handling pagination, and storing scraped data. By following best practices and considering potential edge cases, you can create robust and efficient web scraping scripts tailored to your specific needs.
There are many other sites that we can scrape quotes for free:
- https://www.brainyquote.com/topics/scrape-quotes
- https://www.overallmotivation.com/quotes/scrape-quotes/
- https://bhavyasree.github.io/PythonClass/Notebooks/18.scrape-quotes/
A full content of the script is
# Python 3.11
import argparse
import csv
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://quotes.toscrape.com"
def scrape_url(url):
print(f"Scrape url {url}")
response = requests.get(url)
if response.status_code != 200:
breakpoint()
soup = BeautifulSoup(response.text, "html.parser")
quote_divs = soup.select(".quote")
quotes = []
for quote_div in quote_divs:
text = quote_div.select_one(".text").get_text()
if '"' in text:
text = text.replace('"', "")
if "“" in text:
text = text.replace("“", "")
if "”" in text:
text = text.replace("”", "")
author = quote_div.select_one(".author").get_text()
quote = {"text": text, "author": author}
quotes.append(quote)
print(f"Found #{len(quotes)} quotes")
next_url = None
if soup.select_one(".next > a"):
next_link = soup.select_one(".next > a")["href"]
next_url = f"{BASE_URL}{next_link}"
return next_url, quotes
def write_to_file(file_name, quotes):
field_names = ["author", "text"]
with open(file_name, "w") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=field_names)
writer.writeheader()
writer.writerows(quotes)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Scrape quotes")
parser.add_argument("-f", "--file_name")
args = parser.parse_args()
file_name = "quotes.csv"
if args.file_name:
file_name = args.file_name
print("Start scraping")
next_url = "https://quotes.toscrape.com/"
quotes = []
while next_url:
next_url, new_quotes = scrape_url(next_url)
quotes.extend(new_quotes)
print(f"Total found #{len(quotes)}")
write_to_file(file_name, quotes)
FAQs
-
Can I scrape any website? While web scraping is technically possible for most websites, it's important to review the website's terms of service and policies. Some websites may explicitly prohibit scraping or have specific restrictions.
-
How can I handle websites with dynamic content? Websites that load data dynamically using JavaScript may require additional techniques like using headless browsers or reverse engineering API endpoints.
-
Is web scraping legal? The legality of web scraping depends on various factors, including the website's terms of service, applicable laws, and the purpose of scraping. It's advisable to consult legal experts or review local regulations to ensure compliance.
-
How can I handle anti-scraping measures like CAPTCHA? Some websites employ anti-scraping measures like CAPTCHA to prevent automated access. In such cases, it may be necessary to use additional tools or services to bypass these measures.
-
Can web scraping overload a server and cause issues? Yes, excessive scraping can put a strain on servers and impact website performance. It's crucial to implement delays, throttling, and other measures to scrape responsibly and avoid overloading the server.
-
Are there any ethical considerations when scraping websites? When scraping websites, it's important to respect the website's policies, terms of service, and privacy considerations. Avoid scraping sensitive or personal information without proper consent.
-
What are some alternative libraries for web scraping in Python? Besides Requests and BeautifulSoup, other popular libraries for web scraping in Python include Scrapy, Selenium, and lxml. The choice of library depends on the specific requirements and complexity of the scraping task.
-
Can I scrape websites protected by login/authentication? Scraping authenticated or login-protected websites requires additional steps, such as sending proper session cookies or using API endpoints. It's important to understand the website's authentication mechanism before attempting scraping.
-
How can I store and analyze the scraped data? The scraped data can be stored in various formats like CSV, JSON, or databases for further analysis. Python provides numerous libraries, such as pandas, for data manipulation and analysis.
-
How can I handle changes in website structure? Websites may undergo changes in HTML structure over time, which can break scraping scripts. Regularly reviewing and updating the scraping logic can help handle