How to Scrape The First Edition site - A Sample Code Walkthrough
By hientd, at: 14:10 Ngày 08 tháng 11 năm 2024
How to Scrape https://thefirstedition.com – A Sample Code Walkthrough
Scraping data from websites is often a useful skill, whether you're collecting data for research, creating an aggregation of listings, or building a web-based product. In this guide, we'll show you how to scrape books listed on The First Edition, a site with a rich catalog of rare books. We'll walk you through a problem-solving approach to gather information like SKU, title, author, price, edition, description, location, date published, and ISBN.
1. A Sample Problem
Our goal is to collect book information from The First Edition. We'll aim to extract essential details like SKU, title, author, price, edition, description, location, date published, and ISBN for each book. Let's imagine a scenario where we need this data for analysis or to showcase book data in an app.
The sample output of a book might look like
{
"sku": "12345",
"title": "To Kill a Mockingbird",
"author": "Harper Lee",
"price": "$1,250.00",
"edition": "First Edition",
"description": "A rare first edition of Harper Lee's 'To Kill a Mockingbird' with original dust jacket.",
"location": "New York, USA",
"date_published": "1960",
"isbn": "978-0-06-112008-4"
}
2. Analyze the Problem and Build the Solution Steps
When approaching a web scraping task, breaking down the problem is crucial. Here are the steps we'll follow:
- Identify the Data: Look at the webpage structure to find the exact elements that contain the required data.
- Understand Pagination: Most e-commerce pages have multiple pages of listings. We need to figure out how to navigate through these pages.
- Structure the Scraper: Build a function to extract data from a single book page, then scale it to scrape data across multiple books and pages.
- Compile Data: Store the scraped data in a structured format (e.g., CSV) for further analysis.
3. Implement the solution
Step 1: Import Required Libraries and Set Up Headers
To mimic a human browser visit, we'll use HTTP headers in our requests.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
# Base URL of the website
base_url = "https://thefirstedition.com"
# Headers to mimic a browser visit
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
Step 2: Define Functions to Handle Pagination
We need to find the total number of pages in each category so our scraper knows when to stop.
def get_total_pages(category_url):
response = requests.get(category_url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
# Find the pagination element and extract the total number of pages
pagination = soup.find("nav", class_="woocommerce-pagination")
if pagination:
pages = pagination.find_all("a")
if pages:
last_page = pages[-2].get_text()
return int(last_page)
return
Step 3: Extract Book Details
Each book page contains specific elements that hold details like SKU, title, author, price, and description. Here's how to retrieve these elements:
def extract_book_details(book_url):
response = requests.get(book_url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
details = {}
# Extract SKU
sku = soup.find("span", class_="sku")
details["SKU"] = sku.get_text(strip=True) if sku else None
# Extract title
title = soup.find("h1", class_="product_title")
details["Title"] = title.get_text(strip=True) if title else None
# Extract price
price = soup.find("p", class_="price")
details["Price"] = price.get_text(strip=True) if price else None
# Extract description
description = soup.find("div", class_="woocommerce-product-details__short-description")
details["Description"] = description.get_text(strip=True) if description else None
return details
Step 4: Scrape Multiple Pages in a Category
To loop through pages in a category and collect data, we’ll use a pagination function.
def scrape_category(category_url):
books = []
total_pages = get_total_pages(category_url)
for page in range(1, total_pages + 1):
print(f"Scraping page {page} of {total_pages} in category {category_url}")
page_url = f"{category_url}/page/{page}/"
response = requests.get(page_url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
# Find all book links on the page
book_links = soup.find_all("a", class_="woocommerce-LoopProduct-link")
for link in book_links:
book_url = link.get("href")
print(f"Scraping book: {book_url}")
book_details = extract_book_details(book_url)
books.append(book_details)
time.sleep(1) # Delay to avoid overwhelming the server
return book
4. Put It All Together and Demo a Category Link
Now we can apply our scraper to one category and save the output to a CSV file:
# Choose a category to scrape
category_url = "https://thefirstedition.com/product-category/literature-classics/"
# Scrape the chosen category
books_data = scrape_category(category_url)
# Convert the list of books to a DataFrame
df = pd.DataFrame(books_data)
# Save to CSV
df.to_csv("the_first_edition_books.csv", index=False)
print("Scraping completed. Data saved to 'the_first_edition_books.csv'")
Run the above code, and you'll collect the specified details for each book in the chosen category, saved in the_first_edition_books.csv
.
The full code snippet is stored here
5. Lesson Learned
- Respectful Scraping: It’s essential to be respectful when scraping. Always add delays between requests to avoid overwhelming the server. Be sure to follow the site’s
robots.txt
guidelines.
- Error Handling: Not all pages are structured the same way. When building scrapers, add checks to handle missing fields or unexpected layouts.
- Pagination Logic: Navigating multi-page content is crucial for comprehensive data gathering. Test your pagination logic carefully to ensure all items are captured.
- Data Structure: Organize scraped data meaningfully. Using a structured format like CSV or a database makes it easier to analyze or use the data later.
This guide shows how to break down and solve a web scraping problem efficiently, leaving you with both structured data and insights into building web scrapers. Happy scraping!
Where to go from here: We only cover how to scrape a list of book links from a category link and book detail from a book link. However, we haven't covered "how to get list of category links from the site" - You can give it a try