How to Build a Web Scraper with Python and PlayWright

By JoeVu, at: 22:19 Ngày 23 tháng 3 năm 2023

Thời gian đọc ước tính: 23 min read

None
None

1. Introduction

Web scraping is a technique used to extract data from websites, allowing users to gather information that can be used for various purposes such as market research, competitor analysis, or data analysis. With the increasing amount of data available on the internet, web scraping has become an essential tool for many businesses and individuals.


1.1 Brief explanation of web scraping and its uses

Web scraping involves automated extraction of data from websites. It is done by using web scrapers or web spiders that crawl through web pages, following links and gathering data as they go. The data extracted from websites can be used for various purposes such as data analysis, market research, or competitor analysis.

For example, an e-commerce website may use web scraping to gather data on its competitors' pricing strategies or to monitor customer reviews.


1.2 Overview of Playwright and its features

Playwright is a cross-browser automation library developed by Microsoft that allows developers to automate browser actions, such as clicking buttons, filling out forms, and extracting data. It supports multiple programming languages, including Python, and provides a powerful and flexible API for interacting with web pages.

Some of the features of Playwright include

  • Support for multiple browsers, including Chromium, Firefox, and WebKit
  • Headless and non-headless modes
  • Built-in support for various testing frameworks such as Jest and Mocha
  • Automation of mobile browsers and can emulate different devices and screen sizes

 

1.3 Explanation of the benefits of using Playwright for web scraping

Playwright provides a number of benefits for web scraping:

  • A powerful API for automating browser actions and extracting data from web pages, making it an excellent choice for web scraping tasks
  • Multiple browsers and headless mode also makes it easy to use for scraping data from websites that require JavaScript rendering
  • Support for various testing frameworks can also be useful for web scraping tasks, as it allows developers to easily integrate scraping scripts into their testing workflows
  • Support for mobile browsers and emulation of different devices can be useful for scraping data from mobile websites.

Here's an example of how to use Playwright to automate the process of navigating to a web page and extracting its title

import asyncio
from playwright.async_api import async_playwright, Playwright

async def main() -> None:
    async with async_playwright() as p:
        browser: Playwright = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto("https://www.example.com")
        title = await page.title()
        print(f"The page title is: {title}")
        await browser.close()

asyncio.run(main())


This code launches a headless Chromium browser, navigates to the example.com website, extracts the page title, and then closes the browser.

The non-async version of the above example

from playwright.sync_api import Playwright, sync_playwright

def scrape_data(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        # extract data from the page using page methods
        title = page.title()
        # close the browser when done
        browser.close()
        return title

# call the function to scrape data from a website
title = scrape_data('https://www.example.com')
print(title)


In this version, we use the sync_playwright() function to create a synchronous instance of the Playwright library. We then use regular functions instead of async functions, and call them using the with statement. Inside the function, we use the browser and page objects to navigate to the website and extract data from it. Finally, we close the browser and return the extracted data.

The main difference between this and the previous code snippet is that we use synchronous functions instead of asynchronous functions. This can be useful if you prefer to use a synchronous programming style, or if you're working with code that doesn't support async functions. However, it's worth noting that using the async version of Playwright can provide better performance and scalability, especially when working with large or complex web scraping tasks.

 

2. Getting Started with Playwright


2.1 Installation of Playwright and Python dependencies

To use Playwright, you first need to install it along with its Python bindings. The recommended way to install Playwright is via pip

You can follow the instruction from the official website

Or simply run the following command in your terminal:

pip install playwright
pip install pytest-playwright  # this is recommended
playwright install


2.2 Creating a new Playwright project

To create a new Playwright project, you can simply create a new Python file and import the Playwright library:

import playwright
from playwright.sync_api import Playwright, Browser, Page

with playwright.sync_playwright() as p:
    # Your code here


This code creates a new Playwright instance and allows you to use its synchronous API.

 

2.3 Explanation of the basic structure of a Playwright project

A typical Playwright project consists of the following components:

  1. Launching a browser
  2. Creating a new page
  3. Navigating to a URL
  4. Interacting with the page (e.g. clicking links, filling out forms, etc.)
  5. Extracting data from the page

For each component, we can create a module if the project is complicated, otherwise, we can keep them all in a single script/class.

Here is an example script that demonstrates these steps:

import playwright
from playwright.sync_api import Playwright, Browser, Page

with playwright.sync_playwright() as p:
    # Launch the browser
    browser = p.chromium.launch(headless=False)
    # Create a new page
    page = browser.new_page()
    # Navigate to a URL
    page.goto('https://example.com')
    # Interact with the page
    page.click('a')
    page.fill('input[type="text"]', 'example')
    # Extract data from the page
    title = page.title()
    content = page.text_content('body')
    # Close the browser
    browser.close()


This script launches a Chromium browser in non-headless mode, creates a new page, navigates to example.com, clicks the first link on the page, fills out a text input field, extracts the title and body text of the page, and finally closes the browser.

In the next section, we'll look at how to use Playwright to build a web scraper.


3. Building a Web Scraper with Playwright


3.1 Defining the target website and its structure

Before starting to build a web scraper with Playwright, it is crucial to define the target website and its structure. This means identifying the specific pages, elements, and data that you want to extract. You can use browser developer tools to inspect the HTML code and CSS selectors of the target website.

For example, if you want to extract product information from an e-commerce website, you need to identify the page containing the product information, such as the product title, price, description, and image. You can use CSS selectors to target the specific HTML elements that contain this information.

# Example CSS selector to target product title
product_title_selector = 'div.product-info h1.product-title'

 

3.2 Writing a Playwright script to extract data from the website

Once you have identified the target website and its structure, you can start writing a Playwright script to extract data from it. The Playwright API provides various methods to interact with the website, such as navigating to a page, clicking on elements, typing text, and extracting data.

Here is an example Playwright script that navigates to a page, waits for a specific element to load, and extracts its text content:

from playwright.sync_api import Playwright, sync_playwright

def scrape_website(url: str) -> None:
    with sync_playwright() as playwright:
        browser = playwright.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        page.wait_for_selector('div.product-info')
        product_title = page.text_content('div.product-info h1.product-title')
        print(f"Product title: {product_title}")
        browser.close()

scrape_website('https://www.example.com/product/123')


In this script, the Playwright API is used to launch a Chromium browser, create a new page, navigate to the target URL, wait for the div.product-info element to load, extract the text content of the h1.product-title element, and print it to the console.

 

3.3 Handling dynamic content and JavaScript-generated content

Many modern websites use dynamic content and JavaScript to load and update content without requiring a full page refresh. This can make it challenging to scrape data using traditional methods (which you cannot do with scrapy and requests + beautifulsoup). However, Playwright provides a solution to this problem by allowing you to interact with dynamic content and wait for specific events or conditions to occur.

For example, if the target website uses JavaScript to load additional data when the user scrolls down the page, you can use the Playwright scroll method to simulate scrolling and wait for the new data to load before extracting it.

# Example of scrolling to load dynamic content
await page.scroll(0, 1000)
await page.wait_for_selector('.new-data')

 

3.4 Saving scraped data to a file or database

After extracting data from the target website, you may want to save it to a file or database for further analysis or use. Playwright provides various methods to export data in different formats, such as JSON, CSV, or XML.

Here is an example Playwright script that extracts product information from a website and saves it to a CSV file:

import csv
from playwright.sync_api import Playwright, sync_playwright

def save_to_csv():
    with sync_playwright() as playwright:
        browser = playwright.chromium.launch()
        page = browser.new_page()
        page.goto('https://www.example.com')

        # scrape data
        data = []
        rows = page.query_selector_all('table tr')
        for row in rows:
            cells = row.query_selector_all('td')
            row_data = [cell.inner_text() for cell in cells]
            data.append(row_data)

        browser.close()

    # write data to CSV file
    with open('data.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerows(data)

 

4. Advanced Techniques in Playwright Web Scraping


4.1 Handling authentication and login pages

Many websites require users to log in before they can access certain pages or data. Playwright provides a way to handle authentication and login pages by using the fill and click methods to enter credentials and submit the login form.

from playwright.sync_api import Playwright, sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto('https://example.com/login')

    # Fill in the login form and submit
    page.fill('#username', 'myusername')
    page.fill('#password', 'mypassword')
    page.click('#submit')

    # Wait for the page to load after login
    page.wait_for_selector('#dashboard')
    
    # Continue scraping the authenticated pages
    # ...
    
    browser.close()


In the example above, we first navigate to the login page and use the fill method to enter our credentials. We then use the click method to submit the login form. Finally, we wait for the page to load after login using the wait_for_selector method before continuing with scraping the authenticated pages.

 

4.2 Scraping multiple pages and websites

Often, web scraping requires scraping data from multiple pages or websites. Playwright makes this easy by allowing us to use loops to navigate to different pages and websites, and then extract the desired data.

from playwright.sync_api import Playwright, sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    # Scrape multiple pages
    for i in range(1, 4):
        page.goto(f'https://example.com/page/{i}')
        # Extract data from the current page
        # ...

    # Scrape multiple websites
    for url in ['https://website1.com', 'https://website2.com', 'https://website3.com']:
        page.goto(url)
        # Extract data from the current website
        # ...

    browser.close()


In the example above, we use a loop to navigate to multiple pages on the same website, and then extract data from each page. We then use another loop to navigate to multiple websites and extract data from each website.

There is a better way to handle scraping multiple pages in parallel

import asyncio
from playwright.async_api import async_playwright, Playwright

async def scrape_page(url: str, browser: Playwright):
    async with browser.chromium.launch() as browser:
        page = await browser.new_page()
        await page.goto(url)
        # scrape data from page
        data = await page.evaluate('''() => {
            // code to extract data from the page
        }''')
        return data

async def scrape_multiple_pages(urls: list):
    async with async_playwright() as p:
        results = []
        for url in urls:
            result = await scrape_page(url, p)
            results.append(result)
        return results

urls = ['https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3']
scraped_data = asyncio.run(scrape_multiple_pages(urls))


In this example, we define a scrape_page function that takes a URL and a Playwright browser object and returns the scraped data from that page. We then define a scrape_multiple_pages function that takes a list of URLs and uses asyncio.gather to run multiple instances of scrape_page in parallel. Finally, we run the scrape_multiple_pages function with a list of URLs and collect the results.

Note that this example assumes that the data extraction code is the same for all pages. If the extraction code is different for each page, you may need to pass that code as an argument to the scrape_page function.


4.3 Building a customizable web scraping framework using Playwright

For more complex web scraping projects, it can be useful to build a customizable web scraping framework using Playwright. This allows us to reuse common code and configurations across multiple scraping scripts.

from playwright.sync_api import Playwright, sync_playwright

class Scraper:
    def __init__(self, browser_type: str = 'chromium'):
        self.browser_type = browser_type
        self.browser = None
        self.page = None

    def start_browser(self):
        self.browser = p[self.browser_type].launch()
        self.page = self.browser.new_page()

    def goto(self, url: str):
        self.page.goto(url)

    def extract_data(self):
        # Implement data extraction logic here
        # ...

    def close_browser(self):
        self.browser.close()

if __name__ == '__main__':
    with sync_playwright() as p:
        scraper = Scraper(browser_type='chromium')
        scraper.start_browser()
        scraper.goto('https://example.com')
        scraper.extract_data()
        scraper.close_browser()


In the example above, we define a Scraper class that encapsulates the common logic for starting a browser, navigating to a page, and extracting data. We can then use this class in our scraping scripts by instantiating an instance of Scraper

 

5. Best Practices and Tips for Playwright Web Scraping

When it comes to web scraping with Playwright, there are some best practices and tips that can help you avoid detection and blocking by websites, limit the number of requests to avoid overloading the website, and respect the website's terms of service and legal limitations. In addition, testing and debugging Playwright web scraping scripts is important to ensure they work as expected. Here are some best practices and tips for Playwright web scraping:

5.1 Avoiding detection and blocking by websites

Websites can detect and block web scrapers using various techniques such as analyzing user-agent strings, detecting unusual behavior patterns, or implementing CAPTCHA challenges. Here are some tips to avoid detection and blocking by websites:

  • Use a user-agent string that mimics a real web browser to avoid being identified as a bot.
  • Add a delay between requests to avoid sending too many requests too quickly.
  • Randomize the timing and order of requests to make the scraping behavior less predictable.
  • Use proxies or VPNs to hide your IP address and avoid being detected by websites that block specific IP addresses.

Example of randomizing the timing and order of requests:

import random
import time

async def scrape_multiple_pages(page, urls):
    for url in random.sample(urls, len(urls)):
        await page.goto(url)
        # Scrape data from the page
        time.sleep(random.randint(1, 5))  # Add a random delay between requests


5.2 Limiting the number of requests to avoid overloading the website

Sending too many requests too quickly can overload the website and cause it to slow down or crash. Here are some tips to limit the number of requests and avoid overloading the website:

  • Set a maximum number of requests per minute or hour to avoid sending too many requests too quickly.
  • Use a scraping queue to manage the order and rate of requests.
  • Use a headless browser to reduce the amount of data transferred and the load on the website's servers.

Example of using a scraping queue:

import asyncio
from aiohttp import ClientSession, TCPConnector

async def scrape_multiple_pages(urls):
    conn = TCPConnector(limit=30)  # Set a maximum of 30 concurrent connections
    async with ClientSession(connector=conn) as session:
        tasks = []
        for url in urls:
            task = asyncio.ensure_future(scrape_page(session, url))
            tasks.append(task)
        await asyncio.gather(*tasks)

async def scrape_page(session, url):
    async with session.get(url) as response:
        # Scrape data from the response
        await asyncio.sleep(1)  # Add a delay between requests


5.3 Respecting the website's terms of service and legal limitations

Web scraping can be legally questionable, and violating a website's terms of service or legal limitations can result in legal action. Here are some tips to respect the website's terms of service and legal limitations:

  • Read and comply with the website's terms of service and legal limitations.
  • Respect the website's robots.txt file, which specifies which pages are allowed to be scraped and which are not.
  • Do not scrape sensitive or personal data that is protected by privacy laws.

Example of respecting the website's robots.txt file:

async def scrape_page(page, url):
    await page.goto(url)
    # Check if the page is allowed to be scraped according to the robots.txt file
    robots_txt_url = f"{page.url.scheme}://{page.url.host}/robots.txt"
    async with aiohttp.ClientSession() as session:
        async with session.get(robots_txt_url) as response:
            if response.status == 200:
                robots_txt = await response.text()


6. Comparison between PlayWright, Selenium, Scrapy

 
Feature PlayWright Selenium Scrapy
Browser Support Chromium, Firefox, WebKit Chrome, Firefox, Safari relies on third-party libraries
Async Support Yes Yes No
Multi-Browser Support Yes No No
Performance Fast Fast Slow Fast
Learning Curve Moderate Steep Moderate
Community Support Growing Large Established
Data Extraction Yes Yes Yes
Javascript Execution Yes Yes No
Scalability Good Good Good
Customizability High Moderate High
Web Scraping Yes Yes Yes
Price Free Free Free

Overall, Playwright offers the most features and flexibility compared to Selenium and Scrapy. While Selenium has a large community and Scrapy is a well-established web scraping tool, Playwright's multi-browser support, async support, and good performance make it a top choice for web scraping and browser automation tasks. However, beginners may find the learning curve a bit steeper than with Scrapy.

7. Conclusion

In conclusion, Playwright provides a powerful and efficient solution for web scraping with Python. With its intuitive API, support for multiple programming languages, and ability to handle dynamic content, Playwright makes it easier than ever to extract valuable data from websites. Additionally, the ability to run headless browsers and multiple instances in parallel can greatly speed up the scraping process.

We encourage you to explore and experiment with Playwright for your web scraping needs. Whether you're scraping e-commerce websites for pricing data, gathering research from news websites, or analyzing data from social media platforms, Playwright can provide the flexibility and power necessary for the task.

Overall, Playwright's extensive features and benefits make it a valuable tool in the world of web scraping, and we recommend it for any web scraping project.


Theo dõi

Theo dõi bản tin của chúng tôi và không bao giờ bỏ lỡ những tin tức mới nhất.