How to Build a Web Scraper with Python and PlayWright
By JoeVu, at: 2023年5月23日22:19
1. Introduction
Web scraping is a technique used to extract data from websites, allowing users to gather information that can be used for various purposes such as market research, competitor analysis, or data analysis. With the increasing amount of data available on the internet, web scraping has become an essential tool for many businesses and individuals.
1.1 Brief explanation of web scraping and its uses
Web scraping involves automated extraction of data from websites. It is done by using web scrapers or web spiders that crawl through web pages, following links and gathering data as they go. The data extracted from websites can be used for various purposes such as data analysis, market research, or competitor analysis.
For example, an e-commerce website may use web scraping to gather data on its competitors' pricing strategies or to monitor customer reviews.
1.2 Overview of Playwright and its features
Playwright is a cross-browser automation library developed by Microsoft that allows developers to automate browser actions, such as clicking buttons, filling out forms, and extracting data. It supports multiple programming languages, including Python, and provides a powerful and flexible API for interacting with web pages.
Some of the features of Playwright include
- Support for multiple browsers, including Chromium, Firefox, and WebKit
- Headless and non-headless modes
- Built-in support for various testing frameworks such as Jest and Mocha
- Automation of mobile browsers and can emulate different devices and screen sizes
1.3 Explanation of the benefits of using Playwright for web scraping
Playwright provides a number of benefits for web scraping:
- A powerful API for automating browser actions and extracting data from web pages, making it an excellent choice for web scraping tasks
- Multiple browsers and headless mode also makes it easy to use for scraping data from websites that require JavaScript rendering
- Support for various testing frameworks can also be useful for web scraping tasks, as it allows developers to easily integrate scraping scripts into their testing workflows
- Support for mobile browsers and emulation of different devices can be useful for scraping data from mobile websites.
Here's an example of how to use Playwright to automate the process of navigating to a web page and extracting its title
mport asyncio
from playwright.async_api import async_playwright, Playwright
async def main() -> None:
async with async_playwright() as p:
browser: Playwright = await p.chromium.launch()
page = await browser.new_page()
await page.goto("https://www.example.com")
title = await page.title()
print(f"The page title is: {title}")
await browser.close()
asyncio.run(main())
This code launches a headless Chromium browser, navigates to the example.com website, extracts the page title, and then closes the browser.
The non-async version of the above example
from playwright.sync_api import Playwright, sync_playwright
def scrape_data(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
# extract data from the page using page methods
title = page.title()
# close the browser when done
browser.close()
return title
# call the function to scrape data from a website
title = scrape_data('https://www.example.com')
print(title)
In this version, we use the sync_playwright() function to create a synchronous instance of the Playwright library. We then use regular functions instead of async functions, and call them using the with statement. Inside the function, we use the browser and page objects to navigate to the website and extract data from it. Finally, we close the browser and return the extracted data.
The main difference between this and the previous code snippet is that we use synchronous functions instead of asynchronous functions. This can be useful if you prefer to use a synchronous programming style, or if you're working with code that doesn't support async functions. However, it's worth noting that using the async version of Playwright can provide better performance and scalability, especially when working with large or complex web scraping tasks.
2. Getting Started with Playwright
2.1 Installation of Playwright and Python dependencies
To use Playwright, you first need to install it along with its Python bindings. The recommended way to install Playwright is via pip
You can follow the instruction from the official website
Or simply run the following command in your terminal:
pip install playwright
pip install pytest-playwright # this is recommended
playwright install
2.2 Creating a new Playwright project
To create a new Playwright project, you can simply create a new Python file and import the Playwright library:
import playwright
from playwright.sync_api import Playwright, Browser, Page
with playwright.sync_playwright() as p:
# Your code here
This code creates a new Playwright instance and allows you to use its synchronous API.
2.3 Explanation of the basic structure of a Playwright project
A typical Playwright project consists of the following components:
- Launching a browser
- Creating a new page
- Navigating to a URL
- Interacting with the page (e.g. clicking links, filling out forms, etc.)
- Extracting data from the page
For each component, we can create a module if the project is complicated, otherwise, we can keep them all in a single script/class.
Here is an example script that demonstrates these steps:
import playwright
from playwright.sync_api import Playwright, Browser, Page
with playwright.sync_playwright() as p:
# Launch the browser
browser = p.chromium.launch(headless=False)
# Create a new page
page = browser.new_page()
# Navigate to a URL
page.goto('https://example.com')
# Interact with the page
page.click('a')
page.fill('input[type="text"]', 'example')
# Extract data from the page
title = page.title()
content = page.text_content('body')
# Close the browser
browser.close()
This script launches a Chromium browser in non-headless mode, creates a new page, navigates to example.com, clicks the first link on the page, fills out a text input field, extracts the title and body text of the page, and finally closes the browser.
In the next section, we'll look at how to use Playwright to build a web scraper.
3. Building a Web Scraper with Playwright
3.1 Defining the target website and its structure
Before starting to build a web scraper with Playwright, it is crucial to define the target website and its structure. This means identifying the specific pages, elements, and data that you want to extract. You can use browser developer tools to inspect the HTML code and CSS selectors of the target website.
For example, if you want to extract product information from an e-commerce website, you need to identify the page containing the product information, such as the product title, price, description, and image. You can use CSS selectors to target the specific HTML elements that contain this information.
# Example CSS selector to target product title
product_title_selector = 'div.product-info h1.product-title'
3.2 Writing a Playwright script to extract data from the website
Once you have identified the target website and its structure, you can start writing a Playwright script to extract data from it. The Playwright API provides various methods to interact with the website, such as navigating to a page, clicking on elements, typing text, and extracting data.
Here is an example Playwright script that navigates to a page, waits for a specific element to load, and extracts its text content:
from playwright.sync_api import Playwright, sync_playwright
def scrape_website(url: str) -> None:
with sync_playwright() as playwright:
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto(url)
page.wait_for_selector('div.product-info')
product_title = page.text_content('div.product-info h1.product-title')
print(f"Product title: {product_title}")
browser.close()
scrape_website('https://www.example.com/product/123')
In this script, the Playwright API is used to launch a Chromium browser, create a new page, navigate to the target URL, wait for the div.product-info element to load, extract the text content of the h1.product-title element, and print it to the console.
3.3 Handling dynamic content and JavaScript-generated content
Many modern websites use dynamic content and JavaScript to load and update content without requiring a full page refresh. This can make it challenging to scrape data using traditional methods (which you cannot do with scrapy and requests + beautifulsoup). However, Playwright provides a solution to this problem by allowing you to interact with dynamic content and wait for specific events or conditions to occur.
For example, if the target website uses JavaScript to load additional data when the user scrolls down the page, you can use the Playwright scroll method to simulate scrolling and wait for the new data to load before extracting it.
# Example of scrolling to load dynamic content
await page.scroll(0, 1000)
await page.wait_for_selector('.new-data')
3.4 Saving scraped data to a file or database
After extracting data from the target website, you may want to save it to a file or database for further analysis or use. Playwright provides various methods to export data in different formats, such as JSON, CSV, or XML.
Here is an example Playwright script that extracts product information from a website and saves it to a CSV file:
import csv
from playwright.sync_api import Playwright, sync_playwright
def save_to_csv():
with sync_playwright() as playwright:
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto('https://www.example.com')
# scrape data
data = []
rows = page.query_selector_all('table tr')
for row in rows:
cells = row.query_selector_all('td')
row_data = [cell.inner_text() for cell in cells]
data.append(row_data)
browser.close()
# write data to CSV file
with open('data.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data)
4. Advanced Techniques in Playwright Web Scraping
4.1 Handling authentication and login pages
Many websites require users to log in before they can access certain pages or data. Playwright provides a way to handle authentication and login pages by using the fill and click methods to enter credentials and submit the login form.
from playwright.sync_api import Playwright, sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto('https://example.com/login')
# Fill in the login form and submit
page.fill('#username', 'myusername')
page.fill('#password', 'mypassword')
page.click('#submit')
# Wait for the page to load after login
page.wait_for_selector('#dashboard')
# Continue scraping the authenticated pages
# ...
browser.close()
In the example above, we first navigate to the login page and use the fill method to enter our credentials. We then use the click method to submit the login form. Finally, we wait for the page to load after login using the wait_for_selector method before continuing with scraping the authenticated pages.
4.2 Scraping multiple pages and websites
Often, web scraping requires scraping data from multiple pages or websites. Playwright makes this easy by allowing us to use loops to navigate to different pages and websites, and then extract the desired data.
from playwright.sync_api import Playwright, sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Scrape multiple pages
for i in range(1, 4):
page.goto(f'https://example.com/page/{i}')
# Extract data from the current page
# ...
# Scrape multiple websites
for url in ['https://website1.com', 'https://website2.com', 'https://website3.com']:
page.goto(url)
# Extract data from the current website
# ...
browser.close()
In the example above, we use a loop to navigate to multiple pages on the same website, and then extract data from each page. We then use another loop to navigate to multiple websites and extract data from each website.
There is a better way to handle scraping multiple pages in parallel
import asyncio
from playwright.async_api import async_playwright, Playwright
async def scrape_page(url: str, browser: Playwright):
async with browser.chromium.launch() as browser:
page = await browser.new_page()
await page.goto(url)
# scrape data from page
data = await page.evaluate('''() => {
// code to extract data from the page
}''')
return data
async def scrape_multiple_pages(urls: list):
async with async_playwright() as p:
results = []
for url in urls:
result = await scrape_page(url, p)
results.append(result)
return results
urls = ['https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3']
scraped_data = asyncio.run(scrape_multiple_pages(urls))
In this example, we define a scrape_page function that takes a URL and a Playwright browser object and returns the scraped data from that page. We then define a scrape_multiple_pages function that takes a list of URLs and uses asyncio.gather to run multiple instances of scrape_page in parallel. Finally, we run the scrape_multiple_pages function with a list of URLs and collect the results.
Note that this example assumes that the data extraction code is the same for all pages. If the extraction code is different for each page, you may need to pass that code as an argument to the scrape_page function.
4.3 Building a customizable web scraping framework using Playwright
For more complex web scraping projects, it can be useful to build a customizable web scraping framework using Playwright. This allows us to reuse common code and configurations across multiple scraping scripts.
from playwright.sync_api import Playwright, sync_playwright
class Scraper:
def __init__(self, browser_type: str = 'chromium'):
self.browser_type = browser_type
self.browser = None
self.page = None
def start_browser(self):
self.browser = p[self.browser_type].launch()
self.page = self.browser.new_page()
def goto(self, url: str):
self.page.goto(url)
def extract_data(self):
# Implement data extraction logic here
# ...
def close_browser(self):
self.browser.close()
if __name__ == '__main__':
with sync_playwright() as p:
scraper = Scraper(browser_type='chromium')
scraper.start_browser()
scraper.goto('https://example.com')
scraper.extract_data()
scraper.close_browser()
In the example above, we define a Scraper class that encapsulates the common logic for starting a browser, navigating to a page, and extracting data. We can then use this class in our scraping scripts by instantiating an instance of Scraper
5. Best Practices and Tips for Playwright Web Scraping
When it comes to web scraping with Playwright, there are some best practices and tips that can help you avoid detection and blocking by websites, limit the number of requests to avoid overloading the website, and respect the website's terms of service and legal limitations. In addition, testing and debugging Playwright web scraping scripts is important to ensure they work as expected. Here are some best practices and tips for Playwright web scraping:
5.1 Avoiding detection and blocking by websites
Websites can detect and block web scrapers using various techniques such as analyzing user-agent strings, detecting unusual behavior patterns, or implementing CAPTCHA challenges. Here are some tips to avoid detection and blocking by websites:
- Use a user-agent string that mimics a real web browser to avoid being identified as a bot.
- Add a delay between requests to avoid sending too many requests too quickly.
- Randomize the timing and order of requests to make the scraping behavior less predictable.
- Use proxies or VPNs to hide your IP address and avoid being detected by websites that block specific IP addresses.
Example of randomizing the timing and order of requests:
import random
import time
async def scrape_multiple_pages(page, urls):
for url in random.sample(urls, len(urls)):
await page.goto(url)
# Scrape data from the page
time.sleep(random.randint(1, 5)) # Add a random delay between requests
5.2 Limiting the number of requests to avoid overloading the website
Sending too many requests too quickly can overload the website and cause it to slow down or crash. Here are some tips to limit the number of requests and avoid overloading the website:
- Set a maximum number of requests per minute or hour to avoid sending too many requests too quickly.
- Use a scraping queue to manage the order and rate of requests.
- Use a headless browser to reduce the amount of data transferred and the load on the website's servers.
Example of using a scraping queue:
import asyncio
from aiohttp import ClientSession, TCPConnector
async def scrape_multiple_pages(urls):
conn = TCPConnector(limit=30) # Set a maximum of 30 concurrent connections
async with ClientSession(connector=conn) as session:
tasks = []
for url in urls:
task = asyncio.ensure_future(scrape_page(session, url))
tasks.append(task)
await asyncio.gather(*tasks)
async def scrape_page(session, url):
async with session.get(url) as response:
# Scrape data from the response
await asyncio.sleep(1) # Add a delay between requests
5.3 Respecting the website's terms of service and legal limitations
Web scraping can be legally questionable, and violating a website's terms of service or legal limitations can result in legal action. Here are some tips to respect the website's terms of service and legal limitations:
- Read and comply with the website's terms of service and legal limitations.
- Respect the website's robots.txt file, which specifies which pages are allowed to be scraped and which are not.
- Do not scrape sensitive or personal data that is protected by privacy laws.
Example of respecting the website's robots.txt file:
async def scrape_page(page, url):
await page.goto(url)
# Check if the page is allowed to be scraped according to the robots.txt file
robots_txt_url = f"{page.url.scheme}://{page.url.host}/robots.txt"
async with aiohttp.ClientSession() as session:
async with session.get(robots_txt_url) as response:
if response.status == 200:
robots_txt = await response.text()
6. Comparison between PlayWright, Selenium, Scrapy
Feature | PlayWright | Selenium | Scrapy |
Browser Support | Chromium, Firefox, WebKit | Chrome, Firefox, Safari | relies on third-party libraries |
Async Support | Yes | Yes | No |
Multi-Browser Support | Yes | No | No |
Performance Fast | Fast | Slow | Fast |
Learning Curve | Moderate | Steep | Moderate |
Community Support | Growing | Large | Established |
Data Extraction | Yes | Yes | Yes |
Javascript Execution | Yes | Yes | No |
Scalability | Good | Good | Good |
Customizability | High | Moderate | High |
Web Scraping | Yes | Yes | Yes |
Price | Free | Free | Free |
Overall, Playwright offers the most features and flexibility compared to Selenium and Scrapy. While Selenium has a large community and Scrapy is a well-established web scraping tool, Playwright's multi-browser support, async support, and good performance make it a top choice for web scraping and browser automation tasks. However, beginners may find the learning curve a bit steeper than with Scrapy.
7. Conclusion
In conclusion, Playwright provides a powerful and efficient solution for web scraping with Python. With its intuitive API, support for multiple programming languages, and ability to handle dynamic content, Playwright makes it easier than ever to extract valuable data from websites. Additionally, the ability to run headless browsers and multiple instances in parallel can greatly speed up the scraping process.
We encourage you to explore and experiment with Playwright for your web scraping needs. Whether you're scraping e-commerce websites for pricing data, gathering research from news websites, or analyzing data from social media platforms, Playwright can provide the flexibility and power necessary for the task.
Overall, Playwright's extensive features and benefits make it a valuable tool in the world of web scraping, and we recommend it for any web scraping project.