Speech recognition is the technology that converts human speech into digital text. In this complete beginner guide, you will learn how to use Python for speech recognition tasks following a step by step guide format. We will use the SpeechRecognition library to read audio files, capture microphone input, handle noise, and work with multiple APIs including Google Speech API, IBM Watson, and others.

What You'll Learn:

How to install SpeechRecognition and PyAudio libraries
Reading and transcribing audio files (WAV, AIFF, FLAC)
Using the Recognizer class and available API methods
Reading specific segments of audio with offset and duration
Handling noise with adjust_for_ambient_noise()
Working with microphones for real-time speech capture
Handling errors and unknown speech exceptions

What is Python Speech Recognition?

Speech recognition systems have evolved from single-speaker models with limited vocabularies to sophisticated systems that recognize multiple speakers with huge vocabularies across various languages. The process involves converting speech from physical sound to electrical signals using a microphone, then using an analogue-to-digital converter to transform this into digital data. Finally, multiple models transcribe the audio into text using techniques like the Hidden Markov Model (HMM), which divides speech signals into 10-millisecond fragments.

Available APIs in Python Speech Recognition

Python offers several APIs for speech recognition. While packages like Wit and Api.ai offer more than basic speech recognition, we will demonstrate SpeechRecognition, which is easier to use and includes a hardcoded default API key for the Google Web Speech API.

API Method	Service	Requirements
recognize_bing()	Microsoft Bing Speech	API key
recognize_google()	Google Web Speech API	None (default key included)
recognize_google_cloud()	Google Cloud Speech	google-cloud-speech package
recognize_houndify()	Houndify by SoundHound	Client ID & Client Key
recognize_ibm()	IBM Speech to Text	Username & Password
recognize_sphinx()	CMU Sphinx	PocketSphinx installation
recognize_wit()	Wit.ai	API key

Step by Step Guide to Installation

Install SpeechRecognition

Use pip to install the SpeechRecognition library. This beginner guide assumes you have Python 3.6+ installed.

Terminal Command

pip install SpeechRecognition

Verify Installation

Open Python interpreter and check the installed version.

Python Interpreter

>>> import speech_recognition as sr
>>> sr.__version__
'3.8.1'

Install PyAudio for Microphone Support

PyAudio is required to work with microphones. On Windows, download the binary wheel from Christoph Gohlke's repository.

Install PyAudio

# On Windows, download the appropriate .whl file from:
# https://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudio
# Then install:
pip install PyAudio-0.2.11-cp37-cp37m-win32.whl

# On Mac/Linux:
pip install PyAudio

Supported Audio File Types

The SpeechRecognition library supports the following audio formats:

WAV (PCM/LPCM)

Uncompressed audio format, best for speech recognition accuracy.

AIFF / AIFF-C

Audio Interchange File Format, commonly used on Apple systems.

FLAC

Free Lossless Audio Codec, compressed without quality loss.

Reading an Audio File in Python

The Recognizer Class

The Recognizer class is the core component for speech recognition. Create an instance and use its methods to interact with different APIs.

Create Recognizer Instance

import speech_recognition as sr

# Create a Recognizer instance
r = sr.Recognizer()

# List available methods for different APIs
# r.recognize_google() - Google Web Speech API
# r.recognize_google_cloud() - Google Cloud Speech
# r.recognize_bing() - Microsoft Bing Speech
# r.recognize_ibm() - IBM Speech to Text
# r.recognize_houndify() - Houndify
# r.recognize_wit() - Wit.ai
# r.recognize_sphinx() - CMU Sphinx (offline)

Capturing Data with record()

Use the AudioFile context manager to open an audio file and the record() method to capture its contents into an AudioData instance.

Read Audio File

import speech_recognition as sr

r = sr.Recognizer()

# Open audio file
demo = sr.AudioFile('demo.wav')

# Read the audio file
with demo as source:
    audio = r.record(source)

# Check the type
print(type(audio))
# Output:

Recognizing Speech in Audio

Call recognize_google() to transcribe the audio into text. You can also specify a different language using the language parameter.

Transcribe Audio

import speech_recognition as sr

r = sr.Recognizer()

with sr.AudioFile('demo.wav') as source:
    audio = r.record(source)

# Transcribe audio to text
text = r.recognize_google(audio)
print(text)

# For different languages (e.g., Romanian)
# text = r.recognize_google(audio, language='ro-RO')

Reading a Segment of Audio

When you only want to read a part of your audio file, use the offset parameter (start time in seconds) and duration parameter (how long to listen).

Read Audio Segment

import speech_recognition as sr

r = sr.Recognizer()

with sr.AudioFile('demo.wav') as source:
    # Start at 4 seconds, listen for 3 seconds
    audio = r.record(source, offset=4, duration=3)

text = r.recognize_google(audio)
print(text)

# Try different offset values
with sr.AudioFile('demo.wav') as source:
    audio = r.record(source, offset=3.3, duration=3)

text = r.recognize_google(audio)
print(text)

Dealing with Noise in Python Speech Recognition

Noise is inevitable in audio recordings. The adjust_for_ambient_noise() method reads the first second of an audio stream to calibrate the recognizer to the audio's noise level. You can specify how long it should listen for noise using the duration parameter.

Handle Noise in Audio

import speech_recognition as sr

r = sr.Recognizer()

with sr.AudioFile('demo.wav') as source:
    # Calibrate for ambient noise (listens for 0.5 seconds)
    r.adjust_for_ambient_noise(source, duration=0.5)
    audio = r.record(source, offset=2.5, duration=3)

text = r.recognize_google(audio)
print(text)

# Note: A small difference in duration (0.005) can produce different outputs
# This is because adjust_for_ambient_noise() consumes part of the audio

Important Note

adjust_for_ambient_noise() is not a miracle worker. For best results, consider preprocessing audio with software like Audacity to remove noise before using speech recognition.

Working with Microphones

To work with your own voice in real-time, you need the PyAudio package and the Microphone class from SpeechRecognition.

Using the Microphone Class

Microphone Setup

import speech_recognition as sr

r = sr.Recognizer()

# List available microphones
print(sr.Microphone.list_microphone_names())

# Use default microphone
mic = sr.Microphone()

# Or select a specific microphone by device index
mic = sr.Microphone(device_index=3)

Capturing Microphone Input

Use the listen() method to capture input from the microphone. It records until silence is detected.

Capture Microphone Input

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    print("Please speak something...")
    audio = r.listen(source)

try:
    text = r.recognize_google(audio)
    print("You said:", text)
except sr.UnknownValueError:
    print("Sorry, could not understand the audio")
except sr.RequestError:
    print("Could not request results from Google Speech API")

Handling Unintelligible Speech

When Python cannot match audio to text, it raises an UnknownValueError exception. This happens with coughing, gagging sounds, hand claps, or tongue clicks.

Error Handling

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    r.adjust_for_ambient_noise(source)
    print("Say something!")
    audio = r.listen(source)

try:
    text = r.recognize_google(audio)
    print("You said:", text)
except sr.UnknownValueError:
    print("Could not understand the audio")
except sr.RequestError as e:
    print(f"Could not request results; {e}")

Applications of Python Speech Recognition

Python speech recognition has numerous real-world applications in AI and automation:

Voice Assistants

Build smart assistants like Siri or Alexa that respond to voice commands.

Phone Bots

Create automated phone systems that understand and respond to customer queries.

Accessibility Tools

Help people who cannot type by converting their speech directly to text.

Emotion Analysis

Analyze tone and speed of speech to understand emotions or stress levels.

Frequently Asked Questions

What is Python Speech Recognition used for?

Python speech recognition is used for voice assistants, phone bots, accessibility tools, transcription services, smart home systems, and emotion analysis in voice recordings.

Which API is best for beginners?

Google Web Speech API (recognize_google) is best for beginners as it comes with a default API key and requires no additional setup or authentication.

How do I handle noise in audio files?

Use the adjust_for_ambient_noise() method to calibrate the recognizer, or preprocess audio with tools like Audacity to remove noise before transcription.

What audio formats are supported?

SpeechRecognition supports WAV (PCM/LPCM), AIFF, AIFF-C, and FLAC formats. WAV is recommended for best speech recognition accuracy.

Can I use speech recognition offline?

Yes, CMU Sphinx (recognize_sphinx) works offline but requires installing PocketSphinx. Most other APIs require an internet connection.

Need Help with AI Speech Projects?

Our experts can help you build custom speech recognition systems, integrate voice capabilities into your applications, and optimize audio processing workflows for your business needs.

What You'll Learn:

How to install SpeechRecognition and PyAudio libraries
Reading and transcribing audio files (WAV, AIFF, FLAC)
Using the Recognizer class and available API methods
Reading specific segments of audio with offset and duration
Handling noise with adjust_for_ambient_noise()
Working with microphones for real-time speech capture
Handling errors and unknown speech exceptions

What is Python Speech Recognition?

Available APIs in Python Speech Recognition

API Method	Service	Requirements
recognize_bing()	Microsoft Bing Speech	API key
recognize_google()	Google Web Speech API	None (default key included)
recognize_google_cloud()	Google Cloud Speech	google-cloud-speech package
recognize_houndify()	Houndify by SoundHound	Client ID & Client Key
recognize_ibm()	IBM Speech to Text	Username & Password
recognize_sphinx()	CMU Sphinx	PocketSphinx installation
recognize_wit()	Wit.ai	API key

Step by Step Guide to Installation

Install SpeechRecognition

Use pip to install the SpeechRecognition library. This beginner guide assumes you have Python 3.6+ installed.

Terminal Command

pip install SpeechRecognition

Verify Installation

Open Python interpreter and check the installed version.

Python Interpreter

>>> import speech_recognition as sr
>>> sr.__version__
'3.8.1'

Install PyAudio for Microphone Support

PyAudio is required to work with microphones. On Windows, download the binary wheel from Christoph Gohlke's repository.

Install PyAudio

# On Windows, download the appropriate .whl file from:
# https://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudio
# Then install:
pip install PyAudio-0.2.11-cp37-cp37m-win32.whl

# On Mac/Linux:
pip install PyAudio

Supported Audio File Types

The SpeechRecognition library supports the following audio formats:

WAV (PCM/LPCM)

Uncompressed audio format, best for speech recognition accuracy.

AIFF / AIFF-C

Audio Interchange File Format, commonly used on Apple systems.

FLAC

Free Lossless Audio Codec, compressed without quality loss.

Reading an Audio File in Python

The Recognizer Class

The Recognizer class is the core component for speech recognition. Create an instance and use its methods to interact with different APIs.

Create Recognizer Instance

import speech_recognition as sr

# Create a Recognizer instance
r = sr.Recognizer()

# List available methods for different APIs
# r.recognize_google() - Google Web Speech API
# r.recognize_google_cloud() - Google Cloud Speech
# r.recognize_bing() - Microsoft Bing Speech
# r.recognize_ibm() - IBM Speech to Text
# r.recognize_houndify() - Houndify
# r.recognize_wit() - Wit.ai
# r.recognize_sphinx() - CMU Sphinx (offline)

Capturing Data with record()

Use the AudioFile context manager to open an audio file and the record() method to capture its contents into an AudioData instance.

Read Audio File

import speech_recognition as sr

r = sr.Recognizer()

# Open audio file
demo = sr.AudioFile('demo.wav')

# Read the audio file
with demo as source:
    audio = r.record(source)

# Check the type
print(type(audio))
# Output:

Recognizing Speech in Audio

Call recognize_google() to transcribe the audio into text. You can also specify a different language using the language parameter.

Transcribe Audio

import speech_recognition as sr

r = sr.Recognizer()

with sr.AudioFile('demo.wav') as source:
    audio = r.record(source)

# Transcribe audio to text
text = r.recognize_google(audio)
print(text)

# For different languages (e.g., Romanian)
# text = r.recognize_google(audio, language='ro-RO')

Reading a Segment of Audio

When you only want to read a part of your audio file, use the offset parameter (start time in seconds) and duration parameter (how long to listen).

Read Audio Segment

import speech_recognition as sr

r = sr.Recognizer()

with sr.AudioFile('demo.wav') as source:
    # Start at 4 seconds, listen for 3 seconds
    audio = r.record(source, offset=4, duration=3)

text = r.recognize_google(audio)
print(text)

# Try different offset values
with sr.AudioFile('demo.wav') as source:
    audio = r.record(source, offset=3.3, duration=3)

text = r.recognize_google(audio)
print(text)

Dealing with Noise in Python Speech Recognition

Handle Noise in Audio

import speech_recognition as sr

r = sr.Recognizer()

with sr.AudioFile('demo.wav') as source:
    # Calibrate for ambient noise (listens for 0.5 seconds)
    r.adjust_for_ambient_noise(source, duration=0.5)
    audio = r.record(source, offset=2.5, duration=3)

text = r.recognize_google(audio)
print(text)

# Note: A small difference in duration (0.005) can produce different outputs
# This is because adjust_for_ambient_noise() consumes part of the audio

Important Note

adjust_for_ambient_noise() is not a miracle worker. For best results, consider preprocessing audio with software like Audacity to remove noise before using speech recognition.

Working with Microphones

To work with your own voice in real-time, you need the PyAudio package and the Microphone class from SpeechRecognition.

Using the Microphone Class

Microphone Setup

import speech_recognition as sr

r = sr.Recognizer()

# List available microphones
print(sr.Microphone.list_microphone_names())

# Use default microphone
mic = sr.Microphone()

# Or select a specific microphone by device index
mic = sr.Microphone(device_index=3)

Capturing Microphone Input

Use the listen() method to capture input from the microphone. It records until silence is detected.

Capture Microphone Input

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    print("Please speak something...")
    audio = r.listen(source)

try:
    text = r.recognize_google(audio)
    print("You said:", text)
except sr.UnknownValueError:
    print("Sorry, could not understand the audio")
except sr.RequestError:
    print("Could not request results from Google Speech API")

Handling Unintelligible Speech

When Python cannot match audio to text, it raises an UnknownValueError exception. This happens with coughing, gagging sounds, hand claps, or tongue clicks.

Error Handling

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    r.adjust_for_ambient_noise(source)
    print("Say something!")
    audio = r.listen(source)

try:
    text = r.recognize_google(audio)
    print("You said:", text)
except sr.UnknownValueError:
    print("Could not understand the audio")
except sr.RequestError as e:
    print(f"Could not request results; {e}")

Applications of Python Speech Recognition

Python speech recognition has numerous real-world applications in AI and automation:

Voice Assistants

Build smart assistants like Siri or Alexa that respond to voice commands.

Phone Bots

Create automated phone systems that understand and respond to customer queries.

Accessibility Tools

Help people who cannot type by converting their speech directly to text.

Emotion Analysis

Analyze tone and speed of speech to understand emotions or stress levels.

Frequently Asked Questions

What is Python Speech Recognition used for?

Python speech recognition is used for voice assistants, phone bots, accessibility tools, transcription services, smart home systems, and emotion analysis in voice recordings.

Which API is best for beginners?

Google Web Speech API (recognize_google) is best for beginners as it comes with a default API key and requires no additional setup or authentication.

How do I handle noise in audio files?

Use the adjust_for_ambient_noise() method to calibrate the recognizer, or preprocess audio with tools like Audacity to remove noise before transcription.

What audio formats are supported?

SpeechRecognition supports WAV (PCM/LPCM), AIFF, AIFF-C, and FLAC formats. WAV is recommended for best speech recognition accuracy.

Can I use speech recognition offline?

Yes, CMU Sphinx (recognize_sphinx) works offline but requires installing PocketSphinx. Most other APIs require an internet connection.

Need Help with AI Speech Projects?

Our experts can help you build custom speech recognition systems, integrate voice capabilities into your applications, and optimize audio processing workflows for your business needs.

How to Use Python Speech Recognition: Step by Step Beginner Guide

What is Python Speech Recognition?

Available APIs in Python Speech Recognition

Step by Step Guide to Installation

Install SpeechRecognition

Verify Installation

Install PyAudio for Microphone Support

Supported Audio File Types

WAV (PCM/LPCM)

AIFF / AIFF-C

FLAC

Reading an Audio File in Python

The Recognizer Class

Capturing Data with record()

Recognizing Speech in Audio

Reading a Segment of Audio

Dealing with Noise in Python Speech Recognition

Working with Microphones

Using the Microphone Class

Capturing Microphone Input

Handling Unintelligible Speech

Applications of Python Speech Recognition

Voice Assistants

Phone Bots

Accessibility Tools

Emotion Analysis

Frequently Asked Questions

What is Python Speech Recognition used for?

Which API is best for beginners?

How do I handle noise in audio files?

What audio formats are supported?

Can I use speech recognition offline?

Need Help with AI Speech Projects?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Use Python Speech Recognition: Step by Step Beginner Guide

What is Python Speech Recognition?

Available APIs in Python Speech Recognition

Step by Step Guide to Installation

Install SpeechRecognition

Verify Installation

Install PyAudio for Microphone Support

Supported Audio File Types

WAV (PCM/LPCM)

AIFF / AIFF-C

FLAC

Reading an Audio File in Python

The Recognizer Class

Capturing Data with record()

Recognizing Speech in Audio

Reading a Segment of Audio

Dealing with Noise in Python Speech Recognition

Working with Microphones

Using the Microphone Class

Capturing Microphone Input

Handling Unintelligible Speech

Applications of Python Speech Recognition

Voice Assistants

Phone Bots

Accessibility Tools

Emotion Analysis

Frequently Asked Questions

What is Python Speech Recognition used for?

Which API is best for beginners?

How do I handle noise in audio files?

What audio formats are supported?

Can I use speech recognition offline?

Need Help with AI Speech Projects?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief