How to Scrape The First Edition site - A Sample Code Walkthrough

By hientd, at: 14:10 Ngày 08 tháng 11 năm 2024

Thời gian đọc ước tính: 7 min read

How to Scrape The First Edition site - A Sample Code Walkthrough
How to Scrape The First Edition site - A Sample Code Walkthrough

How to Scrape https://thefirstedition.com – A Sample Code Walkthrough

Scraping data from websites is often a useful skill, whether you're collecting data for research, creating an aggregation of listings, or building a web-based product. In this guide, we'll show you how to scrape books listed on The First Edition, a site with a rich catalog of rare books. We'll walk you through a problem-solving approach to gather information like SKU, title, author, price, edition, description, location, date published, and ISBN.

 

1. A Sample Problem

Our goal is to collect book information from The First Edition. We'll aim to extract essential details like SKU, title, author, price, edition, description, location, date published, and ISBN for each book. Let's imagine a scenario where we need this data for analysis or to showcase book data in an app.

The sample output of a book might look like

{
  "sku": "12345",
  "title": "To Kill a Mockingbird",
  "author": "Harper Lee",
  "price": "$1,250.00",
  "edition": "First Edition",
  "description": "A rare first edition of Harper Lee's 'To Kill a Mockingbird' with original dust jacket.",
  "location": "New York, USA",
  "date_published": "1960",
  "isbn": "978-0-06-112008-4"
}

 

2. Analyze the Problem and Build the Solution Steps

When approaching a web scraping task, breaking down the problem is crucial. Here are the steps we'll follow:

  1. Identify the Data: Look at the webpage structure to find the exact elements that contain the required data.
     
  2. Understand Pagination: Most e-commerce pages have multiple pages of listings. We need to figure out how to navigate through these pages.
     
  3. Structure the Scraper: Build a function to extract data from a single book page, then scale it to scrape data across multiple books and pages.
     
  4. Compile Data: Store the scraped data in a structured format (e.g., CSV) for further analysis.
     

3. Implement the solution


Step 1: Import Required Libraries and Set Up Headers

To mimic a human browser visit, we'll use HTTP headers in our requests.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Base URL of the website
base_url = "https://thefirstedition.com"

# Headers to mimic a browser visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

 

Step 2: Define Functions to Handle Pagination

We need to find the total number of pages in each category so our scraper knows when to stop.

def get_total_pages(category_url):
    response = requests.get(category_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    # Find the pagination element and extract the total number of pages
    pagination = soup.find("nav", class_="woocommerce-pagination")
    if pagination:
        pages = pagination.find_all("a")
        if pages:
            last_page = pages[-2].get_text()
            return int(last_page)
    return 

 

Step 3: Extract Book Details

Each book page contains specific elements that hold details like SKU, title, author, price, and description. Here's how to retrieve these elements:

def extract_book_details(book_url):
    response = requests.get(book_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    details = {}
    # Extract SKU
    sku = soup.find("span", class_="sku")
    details["SKU"] = sku.get_text(strip=True) if sku else None
    # Extract title
    title = soup.find("h1", class_="product_title")
    details["Title"] = title.get_text(strip=True) if title else None
    # Extract price
    price = soup.find("p", class_="price")
    details["Price"] = price.get_text(strip=True) if price else None
    # Extract description
    description = soup.find("div", class_="woocommerce-product-details__short-description")
    details["Description"] = description.get_text(strip=True) if description else None
    return details

 

Step 4: Scrape Multiple Pages in a Category

To loop through pages in a category and collect data, we’ll use a pagination function.

def scrape_category(category_url):
    books = []
    total_pages = get_total_pages(category_url)
    for page in range(1, total_pages + 1):
        print(f"Scraping page {page} of {total_pages} in category {category_url}")
        page_url = f"{category_url}/page/{page}/"
        response = requests.get(page_url, headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")
        # Find all book links on the page
        book_links = soup.find_all("a", class_="woocommerce-LoopProduct-link")
        for link in book_links:
            book_url = link.get("href")
            print(f"Scraping book: {book_url}")
            book_details = extract_book_details(book_url)
            books.append(book_details)
            time.sleep(1)  # Delay to avoid overwhelming the server
    return book

 

4. Put It All Together and Demo a Category Link

Now we can apply our scraper to one category and save the output to a CSV file:

# Choose a category to scrape
category_url = "https://thefirstedition.com/product-category/literature-classics/"

# Scrape the chosen category
books_data = scrape_category(category_url)

# Convert the list of books to a DataFrame
df = pd.DataFrame(books_data)

# Save to CSV
df.to_csv("the_first_edition_books.csv", index=False)
print("Scraping completed. Data saved to 'the_first_edition_books.csv'")

 

Run the above code, and you'll collect the specified details for each book in the chosen category, saved in the_first_edition_books.csv.

The full code snippet is stored here

 

5. Lesson Learned

 

  • Respectful Scraping: It’s essential to be respectful when scraping. Always add delays between requests to avoid overwhelming the server. Be sure to follow the site’s robots.txt guidelines.
     
  • Error Handling: Not all pages are structured the same way. When building scrapers, add checks to handle missing fields or unexpected layouts.
     
  • Pagination Logic: Navigating multi-page content is crucial for comprehensive data gathering. Test your pagination logic carefully to ensure all items are captured.
     
  • Data Structure: Organize scraped data meaningfully. Using a structured format like CSV or a database makes it easier to analyze or use the data later.
     

This guide shows how to break down and solve a web scraping problem efficiently, leaving you with both structured data and insights into building web scrapers. Happy scraping!

Where to go from here: We only cover how to scrape a list of book links from a category link and book detail from a book link. However, we haven't covered "how to get list of category links from the site" - You can give it a try


Theo dõi

Theo dõi bản tin của chúng tôi và không bao giờ bỏ lỡ những tin tức mới nhất.