Regex Powerful Features
By hientd, at: Jan. 25, 2024, 2:48 p.m.
Regex Powerful Features
Text Searching
Text searching with regex allows you to locate specific patterns within a body of text. This can be useful for finding keywords, email addresses, phone numbers, or any other identifiable patterns.
Example 1: Find all email addresses in a text
import re
text = "Contact us at [email protected] or [email protected]."
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails) # Output: ['[email protected]', '[email protected]']
Explanation: This regex pattern matches standard email addresses by ensuring the presence of alphanumeric characters and certain special characters before the "@" symbol and a valid domain name format.
Example 2: Search for specific keywords in a document
import re
document = "Python is a great programming language. Java is also popular."
keywords = re.findall(r'\bPython\b|\bJava\b', document)
print(keywords) # Output: ['Python', 'Java']
Explanation: This pattern matches the words "Python" and "Java" as whole words, ensuring that partial matches are not included.
Example 3: Find all occurrences of a date pattern
import re
text = "The event is on 2024-07-21. Another event is on 2023-11-15."
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(dates) # Output: ['2024-07-21', '2023-11-15']
Explanation: This regex pattern matches dates in the format YYYY-MM-DD by looking for four digits, followed by a hyphen, two digits, another hyphen, and two more digits.
Text Validation
Regex can validate input formats to ensure they meet specified criteria, such as correct email, phone number, and date formats.
Example 4: Validate an email address
import re
email = "[email protected]"
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
is_valid = re.match(pattern, email)
print(is_valid is not None) # Output: True
Explanation: This pattern ensures the email address contains valid characters and follows the standard email structure.
Example 5: Validate a phone number (US format)
import re
phone_number = "+1 (123) 456-7890"
pattern = r'^(\+\d{1,2}\s?)?(\(\d{3}\)|\d{3})[-.\s]?\d{3}[-.\s]?\d{4}$'
is_valid = re.match(pattern, phone_number)
print(is_valid is not None) # Output: True
Explanation: This pattern checks for various US phone number formats, including optional country code and different separators.
Text Extraction
Extracting specific data from text using regex is powerful for parsing logs, web scraping, and processing structured documents.
Example 6: Extract all URLs from a document
import re
document = "Check out https://example.com and http://domain.org for more info."
urls = re.findall(r'https?://[^\s]+', document)
print(urls) # Output: ['https://example.com', 'http://domain.org']
Explanation: This pattern matches URLs starting with "http" or "https" followed by "://", capturing everything until the next whitespace.
Example 7: Extract hashtags from a tweet
import re
tweet = "Loving the new features in #Python3 and #Django!"
hashtags = re.findall(r'#\w+', tweet)
print(hashtags) # Output: ['#Python3', '#Django']
Explanation: This pattern matches words that start with the "#" symbol and continue with alphanumeric characters.
Text Replacement
Regex is useful for replacing text patterns, enabling text formatting or data cleaning.
Example 8: Replace multiple spaces with a single space
import re
text = "This is an example."
normalized_text = re.sub(r'\s+', ' ', text)
print(normalized_text) # Output: "This is an example."
Explanation: This pattern matches one or more whitespace characters and replaces them with a single space.
Example 9: Anonymize email addresses in a document
import re
text = "Contact [email protected] or [email protected]."
anonymized_text = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '[REDACTED]', text)
print(anonymized_text) # Output: "Contact [REDACTED] or [REDACTED]."
Explanation: This pattern finds all email addresses and replaces them with "[REDACTED]".
Text Splitting
Splitting text into parts based on patterns is useful for tokenizing, parsing CSV data, and more.
Example 10: Split a CSV line into fields
import re
csv_line = "name,age,location"
fields = re.split(r',', csv_line)
print(fields) # Output: ['name', 'age', 'location']
Explanation: This pattern splits the text at each comma, returning a list of fields.
Example 11: Split a paragraph into sentences
import re
paragraph = "This is sentence one. This is sentence two! Is this sentence three?"
sentences = re.split(r'[.!?]\s', paragraph)
print(sentences) # Output: ['This is sentence one', 'This is sentence two', 'Is this sentence three?']
Explanation: This pattern splits the text at periods, exclamation marks, or question marks followed by a space.
By leveraging these regex patterns, you can efficiently search, validate, extract, replace, and split text in various applications, enhancing your text processing capabilities.