Speech-to-Text in Python: Tools, Code, and Comparisons
By manhnv, at: Nov. 26, 2024, 11:17 a.m.
Speech-to-Text in Python: Tools, Code, and Comparisons
Converting speech into text is a critical feature in many applications, from personal assistants to transcription tools. In this post, we'll explore the top libraries and services for implementing speech-to-text in Python: SpeechRecognition, Google Cloud Speech-to-Text, Azure Speech Service, and Whisper by OpenAI. We'll provide sample code for each and compare their performance, accuracy, and pricing.
1. SpeechRecognition
Overview
- Type: Open-source
- Dependencies: Local microphone or pre-recorded audio files
- Backend: Google Web Speech API by default
- Ideal for: Quick prototyping, hobby projects
Code Snippet
import speech_recognition as sr
recognizer = sr.Recognizer()
with sr.Microphone() as source:
print("Listening...")
audio = recognizer.listen(source)
try:
text = recognizer.recognize_google(audio)
print("You said:", text)
except sr.UnknownValueError:
print("Could not understand the audio.")
except sr.RequestError as e:
print(f"API error: {e}")
2. Google Cloud Speech-to-Text
Overview
- Type: Cloud-based API
- Dependencies: Google Cloud account
- Ideal for: High accuracy, diverse language support
Code Snippet
from google.cloud import speech
import io
client = speech.SpeechClient()
with io.open("audio.wav", "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
)
response = client.recognize(config=config, audio=audio)
for result in response.results:
print("Transcript:", result.alternatives[0].transcript)
3. Azure Speech Service
Overview
- Type: Cloud-based API
- Dependencies: Azure Cognitive Services account
- Ideal for: Enterprise applications, integration with Microsoft ecosystem
Code Snippet
import azure.cognitiveservices.speech as speechsdk
speech_key = "YOUR_AZURE_SPEECH_KEY"
service_region = "YOUR_SERVICE_REGION"
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)
print("Speak something...")
result = speech_recognizer.recognize_once()
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print("Recognized:", result.text)
else:
print("Error:", result.reason)
4. Whisper by OpenAI
Overview
- Type: Open-source, offline
- Dependencies: Pre-trained Whisper model
- Ideal for: Offline transcription, high accuracy for varied accents
Code Snippet
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print("Transcription:", result["text"])
Comparison Table
Performance Insights
-
Accuracy:
- Whisper and Google Cloud Speech provide superior accuracy for diverse accents and noisy environments.
- SpeechRecognition can struggle with unclear audio.
- Whisper and Google Cloud Speech provide superior accuracy for diverse accents and noisy environments.
-
Speed:
- SpeechRecognition and cloud services like Google and Azure are faster than Whisper, especially for real-time tasks.
- Whisper is slower but operates offline, making it valuable for privacy-focused applications.
- SpeechRecognition and cloud services like Google and Azure are faster than Whisper, especially for real-time tasks.
-
Cost:
- SpeechRecognition and Whisper are cost-effective for smaller projects.
- Cloud services (Google and Azure) are better for large-scale, high-accuracy needs but come with ongoing costs.
- SpeechRecognition and Whisper are cost-effective for smaller projects.
Final Recommendations
- Use SpeechRecognition for quick and small-scale projects.
- Choose Google Cloud Speech or Azure Speech Service for enterprise applications requiring high accuracy and language support.
- Opt for Whisper if offline capabilities or privacy are priorities.