Skip to main content

Voice mode

Headmaster supports real-time voice interaction — talk to the agent with your voice and hear responses spoken aloud.

Under the hood

Voice mode uses:
  • Speech-to-text (STT): Your spoken audio is transcribed to text and sent to the agent as a message. Headmaster uses a local faster-whisper model by default, or OpenAI Whisper for higher accuracy.
  • Text-to-speech (TTS): The agent’s text response is converted to speech audio and played back. Headmaster uses OpenAI TTS, xAI, MiniMax, or ElevenLabs as TTS providers.
The cycle is: you speak → STT transcribes → agent processes → TTS speaks the response → you speak again.

Enabling voice mode

  1. Open Settings → My Headmaster → Look → Voice.
  2. Turn on Enable voice mode.
  3. Choose a TTS provider and voice.
  4. Choose an STT backend (local faster-whisper or OpenAI Whisper).
  5. Save.

TTS providers

ProviderVoicesNotes
OpenAIAlloy, Echo, Fable, Onyx, Nova, ShimmerNatural, high quality. Requires OpenAI key.
xAIVariousRequires xAI key.
MiniMaxVariousRequires MiniMax key.
ElevenLabs5k-40k voice optionsPremium quality. Requires ElevenLabs key.
EdgeBuilt-in voicesFree, no API key needed. Lower quality.

STT backends

BackendQualityNotes
Local faster-whisperGoodFree, runs locally, no API key. Default.
OpenAI WhisperHighRequires OpenAI key. Better accuracy for accents and noise.

Using voice mode

In the desktop app

Click the microphone icon in the chat composer. The icon turns red to indicate recording. Speak your message, then click the icon again (or press Esc) to stop recording. The agent transcribes your speech, processes it, and speaks the response.

Push to talk

Enable Push to talk in voice settings. Hold the microphone button (or a keyboard shortcut) to talk, release to send. The agent responds with speech automatically.

Continuous conversation

In continuous mode, the agent listens for your speech, responds, then automatically listens again. You don’t need to click the microphone each time — just talk. Enable in Settings → Voice → Continuous mode.

Interrupting

While the agent is speaking, click the stop button or press Esc to interrupt. The agent stops speaking and the partial audio is discarded. You can then speak a new message.

Voice on messaging platforms

On Telegram and Discord, voice messages you send are transcribed and processed. The agent’s response is sent as text (or as a voice message if TTS is enabled for that platform). Send a voice message to your Headmaster bot on Telegram → the agent transcribes it, processes it, and responds. If TTS is enabled, the response comes back as a voice message.

Voice settings

SettingWhat it controls
TTS providerWhich service generates speech
TTS voiceWhich voice to use
TTS speedHow fast the agent speaks (0.5x to 2x)
STT backendWhich service transcribes your speech
Auto-listenStart listening automatically after the agent responds
Push to talkHold to talk, release to send
Continuous modeAgent listens → responds → listens again
Voice volumeOutput volume for TTS audio

Speech input button

The microphone button in the chat composer shows the current voice state:
  • Gray mic — voice mode is off. Click to start recording.
  • Red mic — recording in progress. Click to stop and send.
  • Blue mic — processing. The agent is transcribing or generating speech.
  • Green mic — speaking. The agent is speaking the response.

Tips for better voice interaction

  • Speak clearly — the STT model works best with clear, moderate-paced speech.
  • Use a quiet environment — background noise reduces transcription accuracy.
  • Try different voices — some TTS voices sound more natural for your use case. Try them all.
  • Use push to talk — prevents the agent from picking up background conversation as input.
  • Adjust speed — if the agent speaks too fast or slow, adjust the TTS speed setting.