Scraping VietnamNet with Newspaper3k: A Step-by-Step Guide

By hientd, at: Dec. 1, 2023, 10:30 p.m.

Estimated Reading Time: 5 min read

Scraping VietnamNet with Newspaper3k: A Step-by-Step Guide
Scraping VietnamNet with Newspaper3k: A Step-by-Step Guide

Scraping VietnamNet with Newspaper3k: A Step-by-Step Guide


In this blog post, we will explore how to use the newspaper3k library to scrape articles from VietnamNet. We'll go through a step-by-step process, discuss the pros and cons of this approach, and look at further features and future applications.

 

Step-by-Step Guide


Step 1: Setting Up the Environment

First, you'll need to install the newspaper3k library. You can do this using pip:

pip install newspaper3k

 

Step 2: Importing Necessary Libraries

Next, import the necessary libraries in your Python script:

from newspaper import Article
import newspaper

 

Step 3: Building the Scraper

We'll create a scraper that extracts articles from Vietnamnet. Here's the complete code:

news_url = 'https://vietnamnet.vn/en-page1'
news_paper = newspaper.build(news_url, config=config)

for article in news_paper.articles[:10]:  # Limiting to the first 10 articles for simplicity
    article.download()
    article.parse()
    print(f"Title: {article.title}")
    print(f"Summary: {article.summary}")
    print(f"URL: {article.url}\n")

 

Step 4: Running the Scraper

Run your script, and it will output the titles, authors, publish dates, summaries, and URLs of the articles it finds on VietnamNet.

 

Pros and Cons of This Approach

 

Pros

  1. Ease of Use: The newspaper3k library is user-friendly and simplifies the process of extracting information from news articles.
     
  2. Comprehensive Parsing: It automatically handles downloading, parsing, and extracting metadata from articles.
     
  3. Language Support: newspaper3k supports multiple languages, making it versatile for various applications.

 

Cons

  1. Dynamic Content: It may not handle dynamic content loaded via JavaScript well (ex: https://www.wsj.com/). Articles loaded after the initial HTML render might be missed. In which, you might need to use PlayWright or Selenium or Puppeteer (JS)
     
  2. Limited Control: The library abstracts away many details, which can be a downside if you need fine-grained control over the scraping process.
     
  3. Dependency Management: newspaper3k relies on several dependencies that may occasionally cause compatibility issues or require updates.

 

Further Features


Article Keywords and Summary

newspaper3k provides additional features such as extracting keywords and generating a summary for each article:

article.nlp()
print(f"Keywords: {article.keywords}")
print(f"Summary: {article.summary}"

 

Source Categorization

You can also categorize articles based on their content, which can be useful for organizing a large number of articles:

news_paper = newspaper.build('https://samplesite-with-categories.com', memoize_articles=False)
for category in news_paper.category_urls():
    print(f"Category URL: {category}")

 

Future Applications


Sentiment Analysis

Integrate sentiment analysis to gauge the overall tone of the articles. This can be particularly useful for market analysis and understanding public opinion. This is an ongoing trend due to the booming of small helpful products - created by China factories, resellers have to find the good products to compete.

 

Automated News Aggregator

Build an automated news aggregator that collects articles from multiple sources, categorizes them, and presents them in a user-friendly dashboard. There is a need of a news platform to ignore some particular topics, human names, offended content. 

 

Trend Analysis

Analyze trends over time by tracking the frequency and sentiment of specific keywords in articles. This can provide insights into emerging topics and industry trends.

 

Custom Alerts

Create a system that sends custom alerts based on specific keywords or topics of interest. For instance, receive notifications whenever there are new articles about "Artificial Intelligence" or "Blockchain."

 

Conclusion

Using newspaper3k to scrape articles from VietnamNet is a straightforward and efficient way to gather news data. While it has its limitations, the library's ease of use and comprehensive parsing capabilities make it a valuable tool for many applications. By leveraging further features and exploring future applications, you can create powerful tools for news aggregation, sentiment analysis, and trend tracking.

 


Subscribe

Subscribe to our newsletter and never miss out lastest news.