Introduction
Speech-to-Text in Python: Build Your Own Voice Assistant with Whisper + Pyttsx3. In the age of smart assistants and voice-driven interfaces, speech recognition technology has become more accessible than ever. From virtual assistants like Alexa and Siri to custom voice bots, the ability to convert spoken language into actionable commands is revolutionizing human-computer interaction. This blog explores how to build your own voice assistant using Python by combining OpenAI’s Whisper for speech-to-text and Pyttsx3 for text-to-speech (TTS).
We’ll walk through the architecture, installation, and development of a basic assistant capable of listening to your voice, transcribing it into text, and responding audibly. The best part? You don’t need high-end GPUs or proprietary APIs—Python makes it easy to create this powerful functionality on your local machine.
Table of Contents
What You Will Learn
- How to transcribe audio with Whisper
- How to make your assistant talk using Pyttsx3
- How to integrate the components into a real-time loop
- Basic Natural Language Processing (NLP) for intent detection
Tools & Libraries Used
- Whisper by OpenAI: Automatic Speech Recognition (ASR)
- Pyttsx3: Offline TTS engine
- SpeechRecognition: For real-time microphone input
- Python 3.7+
Prerequisites
- Python installed (3.7+ recommended)
- Basic understanding of Python syntax
- Microphone and speaker/headphones
Setting Up the Environment
First, install the required libraries:
pip install openai-whisper pyttsx3 SpeechRecognition pyaudio
Note: On some systems, pyaudio
may need to be installed with additional tools. Refer to PyAudio Installation Guide for help.
Step 1: Transcribing Speech with Whisper
Whisper is a deep learning model that can transcribe audio in multiple languages with high accuracy. Here’s how to use it for short audio clips:
import whisper
model = whisper.load_model("base") # You can use 'tiny', 'base', 'small', 'medium', 'large'
result = model.transcribe("audio.wav")
print("Transcribed Text:", result["text"])
To record your voice from the microphone and save it:
import speech_recognition as sr
def record_audio(filename="input.wav"):
recognizer = sr.Recognizer()
with sr.Microphone() as source:
print("Listening...")
audio = recognizer.listen(source)
with open(filename, "wb") as f:
f.write(audio.get_wav_data())
record_audio()
Step 2: Making the Assistant Speak with Pyttsx3
Pyttsx3 is a text-to-speech conversion library in Python which works offline and is compatible with both Windows and Unix systems.
import pyttsx3
def speak(text):
engine = pyttsx3.init()
engine.say(text)
engine.runAndWait()
speak("Hello! How can I help you today?")
You can also control voice rate, volume, and gender:
engine.setProperty('rate', 150)
engine.setProperty('volume', 0.9)
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[1].id) # 0 for male, 1 for female
Step 3: Combining Everything into a Voice Assistant
Let’s create a loop where the assistant listens, transcribes, and responds.
import whisper
import pyttsx3
import speech_recognition as sr
model = whisper.load_model("base")
engine = pyttsx3.init()
recognizer = sr.Recognizer()
while True:
try:
with sr.Microphone() as source:
print("Say something...")
audio = recognizer.listen(source)
with open("temp.wav", "wb") as f:
f.write(audio.get_wav_data())
result = model.transcribe("temp.wav")
text = result["text"]
print("You said:", text)
if "exit" in text.lower():
speak("Goodbye!")
break
elif "your name" in text:
speak("I am your personal voice assistant built with Python.")
elif "time" in text:
from datetime import datetime
now = datetime.now()
speak(f"The current time is {now.strftime('%H:%M')}")
else:
speak("I heard you say: " + text)
except Exception as e:
print("Error:", e)
Step 4: Add Natural Language Understanding (Optional)
To make your assistant smarter, you can use NLP libraries like spaCy or NLTK to detect intent.
pip install spacy
python -m spacy download en_core_web_sm
Example intent extraction:
import spacy
nlp = spacy.load("en_core_web_sm")
def get_intent(text):
doc = nlp(text.lower())
if any(token.lemma_ == "time" for token in doc):
return "get_time"
elif "name" in text:
return "get_name"
else:
return "default"
Use Cases of Python Voice Assistants
- Voice-controlled smart home systems
- Accessibility tools for differently-abled users
- Virtual teaching or storytelling bots
- Personalized AI companions
Useful Resources
- Whisper GitHub: https://github.com/openai/whisper
- Pyttsx3 Docs: https://pyttsx3.readthedocs.io
- SpeechRecognition Library: https://pypi.org/project/SpeechRecognition/
Conclusion
Creating a voice assistant using Whisper and Pyttsx3 is a powerful demonstration of how Python continues to make AI and voice-based technologies more accessible. Whether you’re a hobbyist, student, or professional developer, you can integrate these tools to build custom assistants tailored to your needs.
As AI models become more efficient and speech interfaces more commonplace, mastering these tools will allow you to innovate in fields like IoT, education, health tech, and beyond.
Ready to build your own voice assistant? Start small, experiment with new features, and make your assistant smarter over time!
Find more Python content at: https://allinsightlab.com/category/software-development