Speech-to-Text in Python: Best Tools, Code Examples

Speech-to-Text in Python: Tools, Code, and Comparisons

Converting speech into text is a critical feature in many applications, from personal assistants to transcription tools. In this post, we'll explore the top libraries and services for implementing speech-to-text in Python: SpeechRecognition, Google Cloud Speech-to-Text, Azure Speech Service, and Whisper by OpenAI. We'll provide sample code for each and compare their performance, accuracy, and pricing.

1. SpeechRecognition

Overview

Type: Open-source
Dependencies: Local microphone or pre-recorded audio files
Backend: Google Web Speech API by default
Ideal for: Quick prototyping, hobby projects

Code Snippet

import speech_recognition as sr

recognizer = sr.Recognizer()

with sr.Microphone() as source: print("Listening...") audio = recognizer.listen(source)

try: text = recognizer.recognize_google(audio) print("You said:", text) except sr.UnknownValueError: print("Could not understand the audio.") except sr.RequestError as e: print(f"API error: {e}")

2. Google Cloud Speech-to-Text

Overview

Type: Cloud-based API
Dependencies: Google Cloud account
Ideal for: High accuracy, diverse language support

Code Snippet

from google.cloud import speech import io

client = speech.SpeechClient()

with io.open("audio.wav", "rb") as audio_file: content = audio_file.read()

audio = speech.RecognitionAudio(content=content) config = speech.RecognitionConfig( encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=16000, language_code="en-US", )

response = client.recognize(config=config, audio=audio)

for result in response.results: print("Transcript:", result.alternatives[0].transcript)

3. Azure Speech Service

Overview

Type: Cloud-based API
Dependencies: Azure Cognitive Services account
Ideal for: Enterprise applications, integration with Microsoft ecosystem

Code Snippet

import azure.cognitiveservices.speech as speechsdk

speech_key = "YOUR_AZURE_SPEECH_KEY" service_region = "YOUR_SERVICE_REGION" speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

print("Speak something...") result = speech_recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.RecognizedSpeech: print("Recognized:", result.text) else: print("Error:", result.reason)

4. Whisper by OpenAI

Overview

Type: Open-source, offline
Dependencies: Pre-trained Whisper model
Ideal for: Offline transcription, high accuracy for varied accents

Code Snippet

import whisper

model = whisper.load_model("base")

result = model.transcribe("audio.mp3") print("Transcription:", result["text"])

Comparison Table

Speech to text - services comparison

Performance Insights

Accuracy:
- Whisper and Google Cloud Speech provide superior accuracy for diverse accents and noisy environments.
- SpeechRecognition can struggle with unclear audio.
Speed:
- SpeechRecognition and cloud services like Google and Azure are faster than Whisper, especially for real-time tasks.
- Whisper is slower but operates offline, making it valuable for privacy-focused applications.
Cost:
- SpeechRecognition and Whisper are cost-effective for smaller projects.
- Cloud services (Google and Azure) are better for large-scale, high-accuracy needs but come with ongoing costs.

Final Recommendations

Use SpeechRecognition for quick and small-scale projects.
Choose Google Cloud Speech or Azure Speech Service for enterprise applications requiring high accuracy and language support.
Opt for Whisper if offline capabilities or privacy are priorities.