How to scrape zalando.co.uk (fashion only)

By huynq, at: Jan. 17, 2024, 10:08 p.m.

Estimated Reading Time: 12 min read

How to scrape zalando.co.uk (fashion only)
How to scrape zalando.co.uk (fashion only)

How to Scrape zalando.co.uk
 

1. Playwright: A Powerful Web Automation Tool
 

What is Playwright?

Playwright is a comprehensive web automation tool developed by Microsoft since 30th Jan 2020. It is designed to interact with web browsers such as Chrome, Firefox, and Safari, providing a flexible and robust approach to web page interaction and data extraction.
 

Why Playwright?

Playwright goes beyond being a simple browser automation tool. It offers versatility in performing various tasks including web interactions, mouse and keyboard manipulations, screenshot capturing, and even handling multi-page applications. This flexibility makes Playwright an ideal choice for efficient web scraping.

Another solution would be Selenium, but PlayWright is much better in term of:

  • Improved Browser Support: Playwright provides native support for all major browsers (Chromium, Firefox, and WebKit) using a single API, allowing for consistent cross-browser testing without browser-specific configurations.

  • Better Handling of Modern Web Features: Designed for modern web applications, Playwright effortlessly supports complex scenarios like single-page applications (SPAs), web sockets, and service workers, which might pose challenges in Selenium.

  • Faster Execution: Thanks to its architecture and optimizations, Playwright offers faster test execution compared to Selenium, reducing overall test run time, especially in extensive automated test suites.

  • Built-in Parallel Test Execution: Playwright includes built-in support for running tests in parallel, simplifying setup and scaling of test suites, unlike Selenium, which requires additional tools for parallel testing.

  • Auto-Wait Features: It automatically waits for elements to be ready before executing actions, minimizing flakiness and improving the reliability of tests by reducing the need for explicit waits.

  • Simplified Setup for Headless Testing: Offering straightforward configuration for headless testing, Playwright facilitates easier integration into CI/CD pipelines for automated testing environments.

  • Rich Set of APIs for Modern Interactions: With a comprehensive API suite, Playwright simulates complex user interactions like multi-page scenarios, file uploads, and downloads, making it adaptable for testing sophisticated user interfaces.

  • Enhanced Debugging Capabilities: Playwright provides tools to capture screenshots, record test session videos, and trace actions, aiding in the diagnosis and resolution of issues in automated tests.

  • Multi-language Support: Initially for Node.js, Playwright now also supports Python, Java, and C#, making it accessible to a broader range of development teams across different tech stacks.

  • Active Development and Community Support: As a newer tool, Playwright benefits from active development, frequent updates, and a growing community that offers a wealth of resources and support.

 

2. Objectives and Data Output

The goal is to scrape all men fashion products. We want to scrape product pricing, title, description. The result format is below

[
    {
        "listing_url": https://www.zalando.co.uk/pier-one-shirt-olive-pi922d0b0-n11.html,
        "title": "Pier One Shirt - olive",
        "description": " Pier One Shirt - olive for 327.99 (2023-08-02) Free shipping on most orders*"
    },
    {
        "listing_url": https://www.zalando.co.uk/pier-one-shirt-black-pi922d0b0-q11.html,
        "title": "Pier One Shirt - black",
        "description": " Pier One Shirt - black for 327.99 (2023-08-02) Free shipping on most orders*",
    }
]

# Each entry in the JSON array represents a product with details such as listing URL, title, and description

 

3. Solution


3.1. Review website

Why is it necessary to review a website before we proceed with scraping? If we don't review the website, we won't know what information it offers, how the products are presented, or the overall structure of the website. Without a clear understanding of the website we want to scrape, devising an effective scraping strategy becomes challenging. Therefore, reviewing the website is crucial to developing the most efficient and optimal scraping strategy.

Take the time to go through all the pages dedicated to men products. Right-click on the page and select “Inspect” to examine the detailed structure of the web elements. The more thorough our review, the easier it will be to formulate the best scraping strategy for our needs.
 

3.2. Planning

Firstly, to collect all pages of products for both men, we have two approaches:

  1. If the main page displays the total number of pages, we will immediately retrieve that value to obtain all the pages of the website.
  2. Otherwise, we will retrieve the total number of products and divide it by the number of products per page to calculate the total number of pages.

Afterward, having gathered all the URLs of product pages, we will access each link/url to scrape product detail.

For this website, the process will involve collecting a list of URLs for all products first. Then, we will gather all the remaining information.
 

3.3. Using Playwright

Installing Playwright:

pip install playwright
playwright install


Writing Playwright Code:

  • Import Playwright into your Python code.
  • Initialize a browser context.
  • Navigate to zalando.co.uk.
  • Interact with elements using selectors.
  • Extract and store the necessary data.

Example:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://www.zalando.co.uk/mens-clothing/")
    # Perform data extraction operations
    # ...
    browser.close()

 

3.4. Scraping All Paging of Men Fashion

After the initial steps to access the website, the next task is to extract information about all products by navigating through the pagination.


# Example code for getting all product paging links
total_page = calculate_total_pages()  # Calculate total pages here
results = []

for page in range(0, total_page):
    link = f"https://www.zalando.co.uk/mens-clothing/?p={page + 1}"
    results.append(link)

# Write links to file
with open(result_path, "w") as fp:
    for line in results:
        fp.write(f"{line}\\n")

 

3.5. Scraping Product Details for Each Product

Once you have a file containing all the detailed links to the products, read each link one by one and extract the necessary information.

# Example code for scraping product details
for link in read_links_from_file(result_path):
    product_data = scrape_product_details(link)
    # Process product data as needed


Utilize Playwright to leverage its JavaScript rendering capabilities for extracting information such as URL, title, description.

# Example code for product information retrieval
data = self.page.query_selector('script[type="application/ld+json"]')
if data:
    data = json.loads(data.text_content())
else:
    return
title = f'{data.get("manufacturer")} {data.get("name")}'
description = data.get("description")
listing_url = f'https://www.zalando.co.uk/{data.get("url")}'
price = data.get("offers", [])[0].get("price")
product = {
    "listing_url": listing_url,
    "title": title,
    "description": description,
    "price": price
}
return product


 

4. Final Source Code

5. Potential problems

  1. ReCaptcha

    ReCaptcha poses a significant challenge for browser automation tools like Playwright. This can hinder the automation of login processes or interactions on websites that require user confirmation.

    Solution: Utilize the system's API or a Captcha-solving library to handle ReCaptcha effectively. Alternatively, exceptions can be raised to navigate through different sessions.

  2. IP Banned

    When using Playwright to make multiple requests from the same IP address, there is a risk of being banned by the server.

    Solution: Use proxies to dynamically change the IP address and avoid being banned.

  3. Request Limit

    Some websites impose restrictions on the number of allowed requests within a specified time frame, especially when using Playwright for automation.

    Solution: Optimize the number of requests and the wait time between them. For websites requiring login, consider using sessions to maintain the login state.

  4. Different Page Layouts

    Websites often change their structure and interfaces, which can reduce the stability of automated scripts.

    Solution: Create flexible scripts, paying attention to handling various cases of page structure.

  5. Crashing/Hanging

    Playwright may encounter issues with certain browsers or websites, leading to crashes or application closures.

    Solution: Use Playwright's error-handling mechanisms to log and notify issues. Employ services like Sentry or other monitoring tools to track and promptly report occurring problems.

By proactively addressing these issues and applying suitable solutions, you can harness the full potential of Playwright in browser automation.
 

6. Tips, Tricks, Best Practices

  • Wait Time Management: Integrate suitable wait times into your scripts to allow web pages ample time to render JavaScript. This is crucial for ensuring that elements are fully loaded before interacting with them, reducing the likelihood of errors.
     
  • Error Handling: Implement robust error-handling mechanisms to maintain stability during the scraping process. Log errors appropriately, and consider retrying failed actions with exponential backoff strategies to improve reliability.
     
  • User-Agent Management: Set the User-Agent header to emulate different browsers and devices, helping to avoid detection and potential blocking from servers. Rotate User-Agents periodically to mimic diverse user behaviors.
     
  • Use ChatGPT Vision to extract data from HTML: Leverage ChatGPT Vision to enhance your data extraction capabilities. By integrating ChatGPT Vision, you can efficiently extract information from HTML, making your data scraping process more intelligent and adaptable.
     
  • Dynamic Element Identification: Employ robust strategies for identifying web elements dynamically. Using stable identifiers such as CSS classes or data attributes helps your scripts adapt to changes in the webpage structure.
     
  • Proxy Rotation: Utilize a rotation of proxies to diversify IP addresses, preventing IP bans and enhancing the anonymity of your web scraping activities.
     
  • Logging and Monitoring: Implement comprehensive logging to keep track of script activities and identify potential issues. Utilize monitoring tools or services like Sentry to receive real-time alerts about critical errors.
     
  • Avoid Overloading Servers: Be mindful of the rate of your requests to avoid overloading servers, which can lead to IP bans. Implement rate-limiting strategies to align with the target website's policies.
     
  • Regular Script Maintenance: Periodically review and update your scripts to accommodate changes in the target website's structure or policies. Regular maintenance ensures continued effectiveness and minimizes disruptions.
     
  • Testing Environments: Develop and test your scripts in controlled environments before deploying them in production. This helps identify potential issues early and ensures a smoother automation process.
     

By incorporating these tips, tricks, and best practices into your Playwright scripts, you can enhance their reliability, adaptability, and overall effectiveness in automating browser interactions and data extraction.
 

7. Conclusion

Utilizing Playwright for scraping data from zalando.co.uk provides flexibility and power. You can automate various tasks and efficiently gather detailed product information. Adhering to best practices ensures stability and safety throughout the scraping process. Happy scraping!

 


Subscribe

Subscribe to our newsletter and never miss out lastest news.