Speech-to-Text In Python: Build Your Own Voice Assistant With Whisper + Pyttsx3

Introduction

Speech-to-Text in Python: Build Your Own Voice Assistant with Whisper + Pyttsx3. In the age of smart assistants and voice-driven interfaces, speech recognition technology has become more accessible than ever. From virtual assistants like Alexa and Siri to custom voice bots, the ability to convert spoken language into actionable commands is revolutionizing human-computer interaction. This blog explores how to build your own voice assistant using Python by combining OpenAI’s Whisper for speech-to-text and Pyttsx3 for text-to-speech (TTS).

We’ll walk through the architecture, installation, and development of a basic assistant capable of listening to your voice, transcribing it into text, and responding audibly. The best part? You don’t need high-end GPUs or proprietary APIs—Python makes it easy to create this powerful functionality on your local machine.

What You Will Learn

How to transcribe audio with Whisper
How to make your assistant talk using Pyttsx3
How to integrate the components into a real-time loop
Basic Natural Language Processing (NLP) for intent detection

Tools & Libraries Used

Whisper by OpenAI: Automatic Speech Recognition (ASR)
Pyttsx3: Offline TTS engine
SpeechRecognition: For real-time microphone input
Python 3.7+

Prerequisites

Python installed (3.7+ recommended)
Basic understanding of Python syntax
Microphone and speaker/headphones

Setting Up the Environment

First, install the required libraries:

pip install openai-whisper pyttsx3 SpeechRecognition pyaudio

Note: On some systems, pyaudio may need to be installed with additional tools. Refer to PyAudio Installation Guide for help.

Step 1: Transcribing Speech with Whisper

Whisper is a deep learning model that can transcribe audio in multiple languages with high accuracy. Here’s how to use it for short audio clips:

import whisper

model = whisper.load_model("base")  # You can use 'tiny', 'base', 'small', 'medium', 'large'

result = model.transcribe("audio.wav")
print("Transcribed Text:", result["text"])

To record your voice from the microphone and save it:

import speech_recognition as sr

def record_audio(filename="input.wav"):
    recognizer = sr.Recognizer()
    with sr.Microphone() as source:
        print("Listening...")
        audio = recognizer.listen(source)
        with open(filename, "wb") as f:
            f.write(audio.get_wav_data())

record_audio()

Step 2: Making the Assistant Speak with Pyttsx3

Pyttsx3 is a text-to-speech conversion library in Python which works offline and is compatible with both Windows and Unix systems.

import pyttsx3

def speak(text):
    engine = pyttsx3.init()
    engine.say(text)
    engine.runAndWait()

speak("Hello! How can I help you today?")

You can also control voice rate, volume, and gender:

engine.setProperty('rate', 150)
engine.setProperty('volume', 0.9)
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[1].id)  # 0 for male, 1 for female

Step 3: Combining Everything into a Voice Assistant

Let’s create a loop where the assistant listens, transcribes, and responds.

import whisper
import pyttsx3
import speech_recognition as sr

model = whisper.load_model("base")
engine = pyttsx3.init()

recognizer = sr.Recognizer()

while True:
    try:
        with sr.Microphone() as source:
            print("Say something...")
            audio = recognizer.listen(source)

        with open("temp.wav", "wb") as f:
            f.write(audio.get_wav_data())

        result = model.transcribe("temp.wav")
        text = result["text"]
        print("You said:", text)

        if "exit" in text.lower():
            speak("Goodbye!")
            break
        elif "your name" in text:
            speak("I am your personal voice assistant built with Python.")
        elif "time" in text:
            from datetime import datetime
            now = datetime.now()
            speak(f"The current time is {now.strftime('%H:%M')}")
        else:
            speak("I heard you say: " + text)

    except Exception as e:
        print("Error:", e)

Step 4: Add Natural Language Understanding (Optional)

To make your assistant smarter, you can use NLP libraries like spaCy or NLTK to detect intent.

pip install spacy
python -m spacy download en_core_web_sm

Example intent extraction:

import spacy
nlp = spacy.load("en_core_web_sm")

def get_intent(text):
    doc = nlp(text.lower())
    if any(token.lemma_ == "time" for token in doc):
        return "get_time"
    elif "name" in text:
        return "get_name"
    else:
        return "default"

Use Cases of Python Voice Assistants

Voice-controlled smart home systems
Accessibility tools for differently-abled users
Virtual teaching or storytelling bots
Personalized AI companions

Useful Resources

Whisper GitHub: https://github.com/openai/whisper
Pyttsx3 Docs: https://pyttsx3.readthedocs.io
SpeechRecognition Library: https://pypi.org/project/SpeechRecognition/

Conclusion

Creating a voice assistant using Whisper and Pyttsx3 is a powerful demonstration of how Python continues to make AI and voice-based technologies more accessible. Whether you’re a hobbyist, student, or professional developer, you can integrate these tools to build custom assistants tailored to your needs.

As AI models become more efficient and speech interfaces more commonplace, mastering these tools will allow you to innovate in fields like IoT, education, health tech, and beyond.

Ready to build your own voice assistant? Start small, experiment with new features, and make your assistant smarter over time!

Find more Python content at: https://allinsightlab.com/category/software-development

Speech-to-Text in Python: Build Your Own Voice Assistant with Whisper + Pyttsx3