Glinteco | Blog | Cách tải xuống tệp âm thanh + PDF từ một URL

Example: How to download an audio + PDF from an url

Tải xuống tệp từ URL là một tác vụ phổ biến trong web scraping và tự động hóa. Trong Python, có nhiều thư viện và công cụ giúp quá trình này hiệu quả hơn. Trong bài viết này, chúng ta sẽ khám phá các phương pháp khác nhau bằng cách sử dụng các thư viện requests và BeautifulSoup, Scrapy, Playwright và Selenium để tải xuống các tệp từ một URL đã cho.

Cho ví dụ về https://event.choruscall.com/mediaframe/webcast.html?webcastid=370tVnvP&securityString=Eiys0PocsKYan5O3oWpFsYe3, chúng ta có thể dễ dàng tải xuống tệp âm thanh và tệp PDF. Hãy bắt đầu

Phương pháp 1: Sử dụng `requests` và `BeautifulSoup`

Cài đặt trước

pip install requests pip install beautifulsoup4

Ví dụ đầy đủ

import requests from bs4 import BeautifulSoup import urllib.parse import re

url = "https://event.choruscall.com/mediaframe/webcast.html?webcastid=370tVnvP&securityString=Eiys0PocsKYan5O3oWpFsYe3"

# Thực hiện yêu cầu đến URL response = requests.get(url)

# Kiểm tra xem yêu cầu có thành công hay không (mã trạng thái 200) if response.status_code == 200: # Phân tích cú pháp nội dung HTML của trang soup = BeautifulSoup(response.text, 'html.parser')

title = soup.title.text # thu thập văn bản tiêu đề

# Tìm các liên kết đến các tệp PDF và âm thanh audio_link = soup.find("a", string=re.compile("Download Audio")) pdf_link = soup.find("a", string=re.compile("Download Presentation")) # Tải xuống tệp PDF if pdf_link: pdf_url = urllib.parse.urljoin(url, pdf_link['href']) pdf_response = requests.get(pdf_url) extension = pathlib.Path(pdf_link['href']).suffix with open(f"{title}.{extension}", 'wb') as pdf_file: pdf_file.write(pdf_response.content)

# Tải xuống tệp âm thanh if audio_link: audio_url = urllib.parse.urljoin(url, audio_link['href']) audio_response = requests.get(audio_url) extension = pathlib.Path(audio_link['href']).suffix with open(f"{title}.{extension}", 'wb') as audio_file: audio_file.write(audio_response.content) else: print(f"Không thể truy xuất trang web. Mã trạng thái: {response.status_code}")

Phương pháp 2: Sử dụng `Scrapy`

Cài đặt trước

pip install scrapy

Ví dụ đầy đủ

import scrapy from bs4 import BeautifulSoup import re import urllib import requests import pathlib

class DownloadSpider(scrapy.Spider): name = 'download_spider' start_urls = [ 'https://event.choruscall.com/mediaframe/webcast.html?webcastid=370tVnvP&securityString=Eiys0PocsKYan5O3oWpFsYe3' ]

def parse(self, response): # Thêm mã spider Scrapy của bạn để tìm và tải xuống tệp ở đây if response.status == 200: # Phân tích cú pháp nội dung HTML của trang soup = BeautifulSoup(response.text, 'html.parser')

title = soup.title.text

# Tìm các liên kết đến các tệp PDF và âm thanh audio_link = soup.find("a", string=re.compile("Download Audio")) pdf_link = soup.find("a", string=re.compile("Download Presentation"))

# Tải xuống tệp PDF if pdf_link: pdf_url = urllib.parse.urljoin(response.url, pdf_link['href']) pdf_response = requests.get(pdf_url) extension = pathlib.Path(pdf_link['href']).suffix with open(f"{title}.{extension}", 'wb') as pdf_file: pdf_file.write(pdf_response.content)

# Tải xuống tệp âm thanh if audio_link: audio_url = urllib.parse.urljoin(response.url, audio_link['href']) audio_response = requests.get(audio_url) extension = pathlib.Path(audio_link['href']).suffix with open(f"{title}.{extension}", 'wb') as audio_file: audio_file.write(audio_response.content) else: print(f"Không thể truy xuất trang web. Mã trạng thái: {response.status_code}")

Chạy lệnh để thực thi spider scrapy: scrapy runspider my_spider.py

Phương pháp 3: Sử dụng `Playwright`

Cài đặt trước

pip install pytest-playwright

playwright install

Ví dụ đầy đủ

# Tạo một tệp mới download_files.py

from bs4 import BeautifulSoup import re import urllib import requests import pathlib from playwright.sync_api import Page

def test_has_title(page: Page): page.goto('https://event.choruscall.com/mediaframe/webcast.html?webcastid=370tVnvP&securityString=Eiys0PocsKYan5O3oWpFsYe3')

# Mong đợi một tiêu đề "chứa" một chuỗi con. # Phân tích cú pháp nội dung HTML của trang soup = BeautifulSoup(page.content(), 'html.parser')

title = soup.title.text

# Tìm các liên kết đến các tệp PDF và âm thanh audio_link = soup.find("a", string=re.compile("Download Audio")) pdf_link = soup.find("a", string=re.compile("Download Presentation"))

# Tải xuống tệp PDF if pdf_link: pdf_url = urllib.parse.urljoin(page.url, pdf_link['href']) pdf_response = requests.get(pdf_url) extension = pathlib.Path(pdf_link['href']).suffix with open(f"{title}.{extension}", 'wb') as pdf_file: pdf_file.write(pdf_response.content)

assert pathlib.Path(f"{title}.{extension}").is_file() is True

# Tải xuống tệp âm thanh if audio_link: audio_url = urllib.parse.urljoin(page.url, audio_link['href']) audio_response = requests.get(audio_url) extension = pathlib.Path(audio_link['href']).suffix with open(f"{title}.{extension}", 'wb') as audio_file: audio_file.write(audio_response.content)

assert pathlib.Path(f"{title}.{extension}").is_file() is True

Chạy lệnh để thực thi spider scrapy: pytest download_files.py

Phương pháp 4: Sử dụng `Selenium`

Cài đặt trước

pip install selenium

playwright install

Ví dụ đầy đủ

# Tạo một tệp download_files_selenium.py

from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.by import By import urllib import pathlib import requests

url = 'https://event.choruscall.com/mediaframe/webcast.html?webcastid=370tVnvP&securityString=Eiys0PocsKYan5O3oWpFsYe3' driver = webdriver.Firefox() driver.get(url) title = driver.title

# Tìm các liên kết đến các tệp PDF và âm thanh audio_links = driver.find_elements(By.XPATH, '//a[contains(text(), "Download Audio")]') pdf_links = driver.find_elements(By.XPATH, '//a[contains(text(), "Download Presentation")]')

# Tải xuống tệp PDF if pdf_links: pdf_link = pdf_links[0].get_property('href') pdf_url = urllib.parse.urljoin(url, pdf_link) pdf_response = requests.get(pdf_url) extension = pathlib.Path(pdf_link).suffix with open(f"{title}.{extension}", 'wb') as pdf_file: pdf_file.write(pdf_response.content)

# Tải xuống tệp âm thanh if audio_links: audio_link = audio_links[0].get_property('href') audio_url = urllib.parse.urljoin(url, audio_link) audio_response = requests.get(audio_url) extension = pathlib.Path(audio_link).suffix with open(f"{title}.{extension}", 'wb') as audio_file: audio_file.write(audio_response.content)

driver.close()

Chạy lệnh để thực thi spider scrapy: python download_files_selenium.py

Kết luận

Trong bài viết này, chúng ta đã khám phá nhiều phương pháp để tải xuống các tệp từ URL bằng Python. Mỗi phương pháp, cho dù đó là tận dụng sự đơn giản của requests và BeautifulSoup, sự mạnh mẽ của Scrapy, tự động hóa trình duyệt không đầu với Playwright hay khả năng động của Selenium, đều cung cấp một bộ công cụ độc đáo để xử lý nhiều tình huống web scraping và tải xuống tệp khác nhau.

Việc lựa chọn phương pháp phụ thuộc vào các yêu cầu cụ thể của dự án của bạn, độ phức tạp của trang web mục tiêu và sự quen thuộc của bạn với các thư viện tương ứng. Khi bạn bắt đầu hành trình tải xuống tệp của mình, hãy nhớ điều chỉnh các đoạn mã theo cấu trúc của trang web bạn đang làm việc.

Với những công cụ Python mạnh mẽ này, bạn có thể truy xuất hiệu quả các tệp từ URL, giúp các tác vụ tự động hóa web và trích xuất dữ liệu dễ dàng hơn và có thể tùy chỉnh hơn. Chúc bạn lập trình vui vẻ và hy vọng các script Python của bạn sẽ tải xuống tệp một cách liền mạch từ thế giới internet rộng lớn!