How to Use Python Speech Recognition: Step by Step Beginner Guide
By Braincuber Team
Published on May 6, 2026
Speech recognition is the technology that converts human speech into digital text. In this complete beginner guide, you will learn how to use Python for speech recognition tasks following a step by step guide format. We will use the SpeechRecognition library to read audio files, capture microphone input, handle noise, and work with multiple APIs including Google Speech API, IBM Watson, and others.
What You'll Learn:
- How to install SpeechRecognition and PyAudio libraries
- Reading and transcribing audio files (WAV, AIFF, FLAC)
- Using the Recognizer class and available API methods
- Reading specific segments of audio with offset and duration
- Handling noise with adjust_for_ambient_noise()
- Working with microphones for real-time speech capture
- Handling errors and unknown speech exceptions
What is Python Speech Recognition?
Speech recognition systems have evolved from single-speaker models with limited vocabularies to sophisticated systems that recognize multiple speakers with huge vocabularies across various languages. The process involves converting speech from physical sound to electrical signals using a microphone, then using an analogue-to-digital converter to transform this into digital data. Finally, multiple models transcribe the audio into text using techniques like the Hidden Markov Model (HMM), which divides speech signals into 10-millisecond fragments.
Available APIs in Python Speech Recognition
Python offers several APIs for speech recognition. While packages like Wit and Api.ai offer more than basic speech recognition, we will demonstrate SpeechRecognition, which is easier to use and includes a hardcoded default API key for the Google Web Speech API.
| API Method | Service | Requirements |
|---|---|---|
| recognize_bing() | Microsoft Bing Speech | API key |
| recognize_google() | Google Web Speech API | None (default key included) |
| recognize_google_cloud() | Google Cloud Speech | google-cloud-speech package |
| recognize_houndify() | Houndify by SoundHound | Client ID & Client Key |
| recognize_ibm() | IBM Speech to Text | Username & Password |
| recognize_sphinx() | CMU Sphinx | PocketSphinx installation |
| recognize_wit() | Wit.ai | API key |
Step by Step Guide to Installation
Install SpeechRecognition
Use pip to install the SpeechRecognition library. This beginner guide assumes you have Python 3.6+ installed.
pip install SpeechRecognition
Verify Installation
Open Python interpreter and check the installed version.
>>> import speech_recognition as sr
>>> sr.__version__
'3.8.1'
Install PyAudio for Microphone Support
PyAudio is required to work with microphones. On Windows, download the binary wheel from Christoph Gohlke's repository.
# On Windows, download the appropriate .whl file from:
# https://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudio
# Then install:
pip install PyAudio-0.2.11-cp37-cp37m-win32.whl
# On Mac/Linux:
pip install PyAudio
Supported Audio File Types
The SpeechRecognition library supports the following audio formats:
WAV (PCM/LPCM)
Uncompressed audio format, best for speech recognition accuracy.
AIFF / AIFF-C
Audio Interchange File Format, commonly used on Apple systems.
FLAC
Free Lossless Audio Codec, compressed without quality loss.
Reading an Audio File in Python
The Recognizer Class
The Recognizer class is the core component for speech recognition. Create an instance and use its methods to interact with different APIs.
import speech_recognition as sr
# Create a Recognizer instance
r = sr.Recognizer()
# List available methods for different APIs
# r.recognize_google() - Google Web Speech API
# r.recognize_google_cloud() - Google Cloud Speech
# r.recognize_bing() - Microsoft Bing Speech
# r.recognize_ibm() - IBM Speech to Text
# r.recognize_houndify() - Houndify
# r.recognize_wit() - Wit.ai
# r.recognize_sphinx() - CMU Sphinx (offline)
Capturing Data with record()
Use the AudioFile context manager to open an audio file and the record() method to capture its contents into an AudioData instance.
import speech_recognition as sr
r = sr.Recognizer()
# Open audio file
demo = sr.AudioFile('demo.wav')
# Read the audio file
with demo as source:
audio = r.record(source)
# Check the type
print(type(audio))
# Output:
Recognizing Speech in Audio
Call recognize_google() to transcribe the audio into text. You can also specify a different language using the language parameter.
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile('demo.wav') as source:
audio = r.record(source)
# Transcribe audio to text
text = r.recognize_google(audio)
print(text)
# For different languages (e.g., Romanian)
# text = r.recognize_google(audio, language='ro-RO')
Reading a Segment of Audio
When you only want to read a part of your audio file, use the offset parameter (start time in seconds) and duration parameter (how long to listen).
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile('demo.wav') as source:
# Start at 4 seconds, listen for 3 seconds
audio = r.record(source, offset=4, duration=3)
text = r.recognize_google(audio)
print(text)
# Try different offset values
with sr.AudioFile('demo.wav') as source:
audio = r.record(source, offset=3.3, duration=3)
text = r.recognize_google(audio)
print(text)
Dealing with Noise in Python Speech Recognition
Noise is inevitable in audio recordings. The adjust_for_ambient_noise() method reads the first second of an audio stream to calibrate the recognizer to the audio's noise level. You can specify how long it should listen for noise using the duration parameter.
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile('demo.wav') as source:
# Calibrate for ambient noise (listens for 0.5 seconds)
r.adjust_for_ambient_noise(source, duration=0.5)
audio = r.record(source, offset=2.5, duration=3)
text = r.recognize_google(audio)
print(text)
# Note: A small difference in duration (0.005) can produce different outputs
# This is because adjust_for_ambient_noise() consumes part of the audio
Important Note
adjust_for_ambient_noise() is not a miracle worker. For best results, consider preprocessing audio with software like Audacity to remove noise before using speech recognition.
Working with Microphones
To work with your own voice in real-time, you need the PyAudio package and the Microphone class from SpeechRecognition.
Using the Microphone Class
import speech_recognition as sr
r = sr.Recognizer()
# List available microphones
print(sr.Microphone.list_microphone_names())
# Use default microphone
mic = sr.Microphone()
# Or select a specific microphone by device index
mic = sr.Microphone(device_index=3)
Capturing Microphone Input
Use the listen() method to capture input from the microphone. It records until silence is detected.
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source:
print("Please speak something...")
audio = r.listen(source)
try:
text = r.recognize_google(audio)
print("You said:", text)
except sr.UnknownValueError:
print("Sorry, could not understand the audio")
except sr.RequestError:
print("Could not request results from Google Speech API")
Handling Unintelligible Speech
When Python cannot match audio to text, it raises an UnknownValueError exception. This happens with coughing, gagging sounds, hand claps, or tongue clicks.
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source:
r.adjust_for_ambient_noise(source)
print("Say something!")
audio = r.listen(source)
try:
text = r.recognize_google(audio)
print("You said:", text)
except sr.UnknownValueError:
print("Could not understand the audio")
except sr.RequestError as e:
print(f"Could not request results; {e}")
Applications of Python Speech Recognition
Python speech recognition has numerous real-world applications in AI and automation:
Voice Assistants
Build smart assistants like Siri or Alexa that respond to voice commands.
Phone Bots
Create automated phone systems that understand and respond to customer queries.
Accessibility Tools
Help people who cannot type by converting their speech directly to text.
Emotion Analysis
Analyze tone and speed of speech to understand emotions or stress levels.
Frequently Asked Questions
What is Python Speech Recognition used for?
Python speech recognition is used for voice assistants, phone bots, accessibility tools, transcription services, smart home systems, and emotion analysis in voice recordings.
Which API is best for beginners?
Google Web Speech API (recognize_google) is best for beginners as it comes with a default API key and requires no additional setup or authentication.
How do I handle noise in audio files?
Use the adjust_for_ambient_noise() method to calibrate the recognizer, or preprocess audio with tools like Audacity to remove noise before transcription.
What audio formats are supported?
SpeechRecognition supports WAV (PCM/LPCM), AIFF, AIFF-C, and FLAC formats. WAV is recommended for best speech recognition accuracy.
Can I use speech recognition offline?
Yes, CMU Sphinx (recognize_sphinx) works offline but requires installing PocketSphinx. Most other APIs require an internet connection.
Need Help with AI Speech Projects?
Our experts can help you build custom speech recognition systems, integrate voice capabilities into your applications, and optimize audio processing workflows for your business needs.
