How to Scrape The First Edition site - A Sample Code Walkthrough

thefirstedition.com – A Sample Code Walkthrough

Scraping data from websites is often a useful skill, whether you're collecting data for research, creating an aggregation of listings, or building a web-based product. In this guide, we'll show you how to scrape books listed on The First Edition, a site with a rich catalog of rare books. We'll walk you through a problem-solving approach to gather information like SKU, title, author, price, edition, description, location, date published, and ISBN.

1. A Sample Problem

Our goal is to collect book information from The First Edition. We'll aim to extract essential details like SKU, title, author, price, edition, description, location, date published, and ISBN for each book. Let's imagine a scenario where we need this data for analysis or to showcase book data in an app.

The sample output of a book might look like

{

  "sku": "12345",

  "title": "To Kill a Mockingbird",

  "author": "Harper Lee",

  "price": "$1,250.00",

  "edition": "First Edition",

  "description": "A rare first edition of Harper Lee's 'To Kill a Mockingbird' with original dust jacket.",

  "location": "New York, USA",

  "date_published": "1960",

  "isbn": "978-0-06-112008-4"

}

2. Analyze the Problem and Build the Solution Steps

When approaching a web scraping task, breaking down the problem is crucial. Here are the steps we'll follow:

Identify the Data: Look at the webpage structure to find the exact elements that contain the required data.
Understand Pagination: Most e-commerce pages have multiple pages of listings. We need to figure out how to navigate through these pages.
Structure the Scraper: Build a function to extract data from a single book page, then scale it to scrape data across multiple books and pages.
Compile Data: Store the scraped data in a structured format (e.g., CSV) for further analysis.

3. Implement the solution

Step 1: Import Required Libraries and Set Up Headers

To mimic a human browser visit, we'll use HTTP headers in our requests.

import requests from bs4 import BeautifulSoup import pandas as pd import time

# Base URL of the website base_url = "https://thefirstedition.com"

# Headers to mimic a browser visit headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" }

Step 2: Define Functions to Handle Pagination

We need to find the total number of pages in each category so our scraper knows when to stop.

def get_total_pages(category_url):

    response = requests.get(category_url, headers=headers)

    soup = BeautifulSoup(response.content, "html.parser")

    # Find the pagination element and extract the total number of pages

    pagination = soup.find("nav", class_="woocommerce-pagination")

    if pagination:

        pages = pagination.find_all("a")

        if pages:

            last_page = pages[-2].get_text()

            return int(last_page)

    return

Step 3: Extract Book Details

Each book page contains specific elements that hold details like SKU, title, author, price, and description. Here's how to retrieve these elements:

def extract_book_details(book_url):

    response = requests.get(book_url, headers=headers)

    soup = BeautifulSoup(response.content, "html.parser")

    details = {}

    # Extract SKU

    sku = soup.find("span", class_="sku")

    details["SKU"] = sku.get_text(strip=True) if sku else None

    # Extract title

    title = soup.find("h1", class_="product_title")

    details["Title"] = title.get_text(strip=True) if title else None

    # Extract price

    price = soup.find("p", class_="price")

    details["Price"] = price.get_text(strip=True) if price else None

    # Extract description

    description = soup.find("div", class_="woocommerce-product-details__short-description")

    details["Description"] = description.get_text(strip=True) if description else None

    return details

Step 4: Scrape Multiple Pages in a Category

To loop through pages in a category and collect data, we’ll use a pagination function.

def scrape_category(category_url):

    books = []

    total_pages = get_total_pages(category_url)

    for page in range(1, total_pages + 1):

        print(f"Scraping page {page} of {total_pages} in category {category_url}")

        page_url = f"{category_url}/page/{page}/"

        response = requests.get(page_url, headers=headers)

        soup = BeautifulSoup(response.content, "html.parser")

        # Find all book links on the page

        book_links = soup.find_all("a", class_="woocommerce-LoopProduct-link")

        for link in book_links:

            book_url = link.get("href")

            print(f"Scraping book: {book_url}")

            book_details = extract_book_details(book_url)

            books.append(book_details)

            time.sleep(1)  # Delay to avoid overwhelming the server

    return book

4. Put It All Together and Demo a Category Link

Now we can apply our scraper to one category and save the output to a CSV file:

# Choose a category to scrape category_url = "https://thefirstedition.com/product-category/literature-classics/"

# Scrape the chosen category books_data = scrape_category(category_url)

# Convert the list of books to a DataFrame df = pd.DataFrame(books_data)

# Save to CSV df.to_csv("the_first_edition_books.csv", index=False) print("Scraping completed. Data saved to 'the_first_edition_books.csv'")

Run the above code, and you'll collect the specified details for each book in the chosen category, saved in the_first_edition_books.csv.

The full code snippet is stored here

5. Lesson Learned

Respectful Scraping: It’s essential to be respectful when scraping. Always add delays between requests to avoid overwhelming the server. Be sure to follow the site’s robots.txt guidelines.
Error Handling: Not all pages are structured the same way. When building scrapers, add checks to handle missing fields or unexpected layouts.
Pagination Logic: Navigating multi-page content is crucial for comprehensive data gathering. Test your pagination logic carefully to ensure all items are captured.
Data Structure: Organize scraped data meaningfully. Using a structured format like CSV or a database makes it easier to analyze or use the data later.

This guide shows how to break down and solve a web scraping problem efficiently, leaving you with both structured data and insights into building web scrapers. Happy scraping!

Where to go from here: We only cover how to scrape a list of book links from a category link and book detail from a book link. However, we haven't covered "how to get list of category links from the site" - You can give it a try