How to Use OpenAI GPT-Realtime-2 API: Step by Step Guide
By Braincuber Team
Published on May 13, 2026
OpenAI recently launched GPT-Realtime-2, a speech-to-speech reasoning model that simplifies building real-time voice applications. Alongside it, OpenAI released two specialized models: GPT-Realtime-Whisper for transcription and GPT-Realtime-Translate for live translation. This complete beginner guide and step by step tutorial walks you through exactly how to connect to each API using Python WebSockets, send audio, and handle responses. By the end of this tutorial, you will have tested all three models with working code and understand when to use each one.
What You'll Learn:
- How to choose between WebRTC and WebSocket for the Realtime API
- Set up Python environment with WebSocket dependencies
- Authenticate and connect to Realtime API WebSocket endpoints
- Build real-time audio transcription with GPT-Realtime-Whisper
- Implement live speech translation with GPT-Realtime-Translate
- Create a full voice assistant with GPT-Realtime-2 including turn-taking and barge-in
- Understand cost, latency, API limits, and common troubleshooting
What is the GPT-Realtime-2 API?
GPT-Realtime-2 is OpenAI's speech-to-speech reasoning model. It replaces the separate STT-to-LLM-to-TTS pipeline with a single model that understands audio directly, reasons about it, and responds with natural speech. It supports a 128K context window and controllable reasoning effort, making it suitable for long voice sessions.
The Realtime API actually covers three distinct models, each designed for a specific job:
| Model | Purpose | Billing |
|---|---|---|
| gpt-realtime-2 | Full voice assistant with reasoning, tool calls, and context | Per token |
| gpt-realtime-translate | Live speech translation across 70+ languages | Per minute |
| gpt-realtime-whisper | Real-time audio transcription (text only) | Per minute |
This beginner guide covers the complete setup process. Before you begin, make sure you have:
OpenAI Account
OpenAI API key with access to GPT-Realtime-2 models. The Free tier is not supported for Realtime API.
Python 3.9+
Python 3.9 or newer with pip package manager and a working microphone for audio capture.
Important Note
Each model has its own endpoint and billing model. Using GPT-Realtime-2 for simple transcription is overkill and costs more. Always pick the smallest model that matches your task.
Step 1: Choose Between WebRTC and WebSocket
OpenAI provides two transport paths for the Realtime API. The right choice depends entirely on your client environment:
- WebRTC — Use for browser and mobile clients. WebRTC handles jitter buffering, audio transport, and media stream management natively. For browser builds, use ephemeral client secrets so your API key never reaches the frontend.
- WebSocket — Use for server-side applications. WebSockets make sense when your backend already receives raw audio from a telephony provider or media pipeline. This path shows event names directly, making model differences visible in the code.
For this complete tutorial, all three tests use WebSockets since we are connecting from a Python server environment.
Step 2: Set Up Python Environment and Dependencies
The full code for all scripts is available on GitHub. Clone the repository first, then install the required dependencies:
git clone https://github.com/KhalidAbdelaty/gpt-realtime-api.git
cd gpt-realtime-api
pip install websocket-client sounddevice numpy python-dotenv
The websocket-client library handles the WebSocket connection. sounddevice captures microphone audio, numpy converts the audio buffer, and python-dotenv loads the API key from a .env file.
Platform-Specific Audio Setup
On macOS, you may need brew install portaudio before sounddevice works. On Linux, install portaudio19-dev. Windows typically works out of the box.
Step 3: Configure Authentication and WebSocket URL
Create a .env file at the root of your project and add your OpenAI API key:
OPENAI_API_KEY=sk-...
Server-side connections use an Authorization: Bearer header on the WebSocket handshake. Each model uses a different WebSocket URL:
| Model | WebSocket URL |
|---|---|
| gpt-realtime-2 | wss://api.openai.com/v1/realtime?model=gpt-realtime-2 |
| gpt-realtime-translate | wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate |
| gpt-realtime-whisper | wss://api.openai.com/v1/realtime?intent=transcription |
Load the API key in each script using python-dotenv:
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
Step 4: Test Real-Time Audio Transcription with GPT-Realtime-Whisper
Transcription is the simplest use case. If your output is only text, the transcription model is all you need. GPT-Realtime-Whisper takes audio in and emits transcript deltas. It does not reason, call tools, or speak back, which makes it more cost-effective than using the full GPT-Realtime-2 model.
The key configuration field is session.type: "transcription". This tells the API to skip assistant responses and emit transcript events only. Use 24 kHz PCM16 mono audio, encoded as base64:
session_config = {
"type": "session.update",
"session": {
"type": "transcription",
"audio": {
"input": {
"format": {"type": "audio/pcm", "rate": 24000},
"transcription": {
"model": "gpt-realtime-whisper",
"language": "en"
},
"turn_detection": None
}
}
}
}
ws.send(json.dumps(session_config))
ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": audio_b64
}))
ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
Unlike the voice agent session, the transcription script commits the input buffer manually on a timer instead of using server_vad turn detection. Empty commits raise input_audio_buffer_commit_empty errors, so the script only commits after real audio has been sent. Transcript deltas arrive word by word in real time, typically within 3 to 4 seconds after speech starts.
Event Ordering Note
Completion events from overlapping turns can arrive out of order. Reconcile by item_id if you build a UI around the stream. The transcription model is billed by audio duration, not by token.
Step 5: Test Real-Time Translation with GPT-Realtime-Translate
Live speech translation looks similar to transcription but uses a separate endpoint. The translation endpoint removes the assistant response loop entirely. Translation sessions have no assistant turn loop and no response.create. The model works as a live interpreter, not a conversational agent.
The model supports more than 70 input languages and 13 output languages. You set the target language with session.audio.output.language. Source language detection is automatic. Key limitations: no custom prompting, no voice selection, and no domain glossaries.
url = "wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate"
session_config = {
"type": "session.update",
"session": {
"audio": {
"output": {"language": "es"},
"input": {
"transcription": {"model": "gpt-realtime-whisper"},
"noise_reduction": {"type": "near_field"}
}
}
}
}
ws.send(json.dumps(session_config))
ws.send(json.dumps({
"type": "session.input_audio_buffer.append",
"audio": audio_b64
}))
Translated audio arrives on session.output_audio.delta events, and the audio bytes are in event["delta"], not event["audio"]. The source and translated transcripts arrive separately:
if event_type == "session.output_audio.delta":
audio_out_queue.put(base64.b64decode(event["delta"]))
elif event_type == "session.input_transcript.delta":
print("[EN]", event.get("delta", ""), end="")
elif event_type == "session.output_transcript.delta":
print("[ES]", event.get("delta", ""), end="")
For short English-to-Spanish phrases, translated audio begins before the source utterance finishes. More distant language pairs can wait longer for context. One edge case: if the source audio is already in the target language, the model may produce silence rather than pass it through.
Step 6: Build a Voice Assistant with GPT-Realtime-2
This is the main GPT-Realtime-2 test: a voice agent that listens, speaks, keeps context, and can call tools. It is a speech-to-speech reasoning model with a 128K context window. The session configuration uses semantic_vad turn detection, which looks at speech cues rather than silence alone:
session_config = {
"type": "session.update",
"session": {
"type": "realtime",
"model": "gpt-realtime-2",
"output_modalities": ["audio"],
"audio": {
"input": {
"format": {"type": "audio/pcm", "rate": 24000},
"transcription": {"model": "gpt-realtime-whisper", "language": "en"},
"turn_detection": {
"type": "semantic_vad",
"eagerness": "medium",
"create_response": False,
"interrupt_response": True
}
},
"output": {
"format": {"type": "audio/pcm", "rate": 24000},
"voice": "marin"
}
},
"instructions": "You are a helpful voice assistant. Keep answers short.",
"reasoning": {"effort": "low"}
}
}
When the user's transcript completes, the client creates the assistant response. Assistant audio arrives on response.output_audio.delta events:
if event_type == "conversation.item.input_audio_transcription.completed":
ws.send(json.dumps({"type": "response.create"}))
elif event_type == "response.output_audio.delta":
audio_out_queue.put(base64.b64decode(event["delta"]))
Handling Barge-In and Interruptions
The interruption sequence is critical for a natural voice experience. When the user speaks over the assistant, the server sends input_audio_buffer.speech_started. The client must stop playback, record how much audio it played, and send conversation.item.truncate with audio_end_ms to tell the server where it was cut off:
if current_response_item_id and playback_position_ms > 0:
ws.send(json.dumps({
"type": "conversation.item.truncate",
"item_id": current_response_item_id,
"content_index": 0,
"audio_end_ms": playback_position_ms
}))
Laptop Speaker Feedback Loop
With laptop speakers, the microphone can pick up the assistant's audio output and send it back to the model. Use MUTE_MIC_DURING_ASSISTANT = True to silence the input stream while the assistant speaks. Set to False only if using headphones and you want interruption support.
Cost and Latency Overview
Pricing splits into two groups: GPT-Realtime-2 is billed by token, while translation and transcription are billed by audio minute:
| Model | Billing Model | ~30 Min Cost |
|---|---|---|
| gpt-realtime-whisper | Per minute | ~$0.51 |
| gpt-realtime-translate | Per minute | ~$1.02 |
| gpt-realtime-2 | Per token | Varies with usage |
Voice agents are harder to estimate because audio tokens accumulate on both sides of the conversation. Session length, speaking ratio, reasoning effort, and context size all matter. Prompt caching can reduce cost when earlier conversation turns stay stable.
API Limits, Audio Requirements, and Troubleshooting
Audio Format Requirements
WebSocket audio must be PCM16 at 24 kHz, mono, base64-encoded. Each input_audio_buffer.append event is capped at 15 MB (50-millisecond chunks are well below the limit). G.711 is also supported for telephony.
Session Duration Limits
Realtime sessions end after 60 minutes on OpenAI and 30 minutes on Azure OpenAI. Longer applications need a reconnect plan and a way to rebuild state. Voice must be chosen before the first audio output and cannot switch mid-session.
Rate Limits and Quotas
Rate limits are tier-based and project-specific. Tier 1 currently lists 200 requests per minute and 40,000 tokens per minute for GPT-Realtime-2. The Free tier is not supported for any Realtime API model.
Common Errors
The most frequent errors are empty buffer commits and wrong audio formatting. For voice agents, watch for feedback loops where the microphone hears the assistant's speaker output. Use headphones, echo cancellation, or mic muting to prevent this.
Session Reconnection Strategy
For long sessions, reconnect around 55 minutes instead of waiting for expiry. Note: the GPT-Realtime-2 model page has a generic "Streaming: Not supported" row that refers to Chat Completions streaming, not the Realtime API behavior.
Final Thoughts
The same pattern shows up across all three tests: each job has its own model and endpoint. That split affects what the model can do, how it bills, and how much client code you need to own. GPT-Realtime-Whisper covers live text, GPT-Realtime-Translate covers direct speech translation, and GPT-Realtime-2 covers full assistant behavior with speech, reasoning, and context.
The code does not show one model replacing the others. It shows that realtime voice apps depend on session design. Your starting point should be the smallest model that matches the job, with the remaining engineering time spent on audio quality, turn-taking, reconnects, and client state management.
Frequently Asked Questions
What is the difference between GPT-Realtime-2 and GPT-Realtime-Whisper?
GPT-Realtime-2 is a full speech-to-speech reasoning model that listens, reasons, and responds with audio. GPT-Realtime-Whisper only transcribes audio to text. Use Whisper for transcription and GPT-Realtime-2 for voice assistant scenarios.
Should I use WebRTC or WebSocket for the Realtime API?
Use WebRTC for browser and mobile clients since it handles jitter buffering and audio transport natively. Use WebSockets for server-side applications where your backend receives raw audio from a telephony provider or media pipeline.
What audio format does the Realtime API require?
PCM16 at 24 kHz, mono, base64-encoded. Each input_audio_buffer.append event is capped at 15 MB. G.711 is also supported for telephony use cases.
How long can a Realtime API session last?
Realtime sessions end after 60 minutes on OpenAI and 30 minutes on Azure OpenAI. For longer applications, implement a reconnect plan around 55 minutes and rebuild the conversation state from the last known turn.
Can I use GPT-Realtime-2 for translation?
Technically yes, but it is overkill. GPT-Realtime-Translate is purpose-built for live translation, supports 70+ input languages and 13 output languages, and is billed per minute instead of per token. Use GPT-Realtime-2 only when you need reasoning, tool calls, or conversation state.
Need Help with AI Voice Solutions?
Our AI experts can help you integrate real-time voice AI, transcription, and translation into your applications. From strategy to deployment, we guide you through every step.
