OpenAI recently launched GPT-Realtime-2, a speech-to-speech reasoning model that simplifies building real-time voice applications. Alongside it, OpenAI released two specialized models: GPT-Realtime-Whisper for transcription and GPT-Realtime-Translate for live translation. This complete beginner guide and step by step tutorial walks you through exactly how to connect to each API using Python WebSockets, send audio, and handle responses. By the end of this tutorial, you will have tested all three models with working code and understand when to use each one.

What You'll Learn:

How to choose between WebRTC and WebSocket for the Realtime API
Set up Python environment with WebSocket dependencies
Authenticate and connect to Realtime API WebSocket endpoints
Build real-time audio transcription with GPT-Realtime-Whisper
Implement live speech translation with GPT-Realtime-Translate
Create a full voice assistant with GPT-Realtime-2 including turn-taking and barge-in
Understand cost, latency, API limits, and common troubleshooting

What is the GPT-Realtime-2 API?

GPT-Realtime-2 is OpenAI's speech-to-speech reasoning model. It replaces the separate STT-to-LLM-to-TTS pipeline with a single model that understands audio directly, reasons about it, and responds with natural speech. It supports a 128K context window and controllable reasoning effort, making it suitable for long voice sessions.

The Realtime API actually covers three distinct models, each designed for a specific job:

Model	Purpose	Billing
gpt-realtime-2	Full voice assistant with reasoning, tool calls, and context	Per token
gpt-realtime-translate	Live speech translation across 70+ languages	Per minute
gpt-realtime-whisper	Real-time audio transcription (text only)	Per minute

This beginner guide covers the complete setup process. Before you begin, make sure you have:

OpenAI Account

OpenAI API key with access to GPT-Realtime-2 models. The Free tier is not supported for Realtime API.

Python 3.9+

Python 3.9 or newer with pip package manager and a working microphone for audio capture.

Important Note

Each model has its own endpoint and billing model. Using GPT-Realtime-2 for simple transcription is overkill and costs more. Always pick the smallest model that matches your task.

Step 1: Choose Between WebRTC and WebSocket

OpenAI provides two transport paths for the Realtime API. The right choice depends entirely on your client environment:

WebRTC — Use for browser and mobile clients. WebRTC handles jitter buffering, audio transport, and media stream management natively. For browser builds, use ephemeral client secrets so your API key never reaches the frontend.
WebSocket — Use for server-side applications. WebSockets make sense when your backend already receives raw audio from a telephony provider or media pipeline. This path shows event names directly, making model differences visible in the code.

For this complete tutorial, all three tests use WebSockets since we are connecting from a Python server environment.

Step 2: Set Up Python Environment and Dependencies

The full code for all scripts is available on GitHub. Clone the repository first, then install the required dependencies:

Clone Repository and Install Dependencies

git clone https://github.com/KhalidAbdelaty/gpt-realtime-api.git
cd gpt-realtime-api
pip install websocket-client sounddevice numpy python-dotenv

The websocket-client library handles the WebSocket connection. sounddevice captures microphone audio, numpy converts the audio buffer, and python-dotenv loads the API key from a .env file.

Platform-Specific Audio Setup

On macOS, you may need brew install portaudio before sounddevice works. On Linux, install portaudio19-dev. Windows typically works out of the box.

Step 3: Configure Authentication and WebSocket URL

Create a .env file at the root of your project and add your OpenAI API key:

.env File

OPENAI_API_KEY=sk-...

Server-side connections use an Authorization: Bearer header on the WebSocket handshake. Each model uses a different WebSocket URL:

Model	WebSocket URL
gpt-realtime-2	wss://api.openai.com/v1/realtime?model=gpt-realtime-2
gpt-realtime-translate	wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate
gpt-realtime-whisper	wss://api.openai.com/v1/realtime?intent=transcription

Load the API key in each script using python-dotenv:

Python Environment Loader

import os
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

Step 4: Test Real-Time Audio Transcription with GPT-Realtime-Whisper

Transcription is the simplest use case. If your output is only text, the transcription model is all you need. GPT-Realtime-Whisper takes audio in and emits transcript deltas. It does not reason, call tools, or speak back, which makes it more cost-effective than using the full GPT-Realtime-2 model.

The key configuration field is session.type: "transcription". This tells the API to skip assistant responses and emit transcript events only. Use 24 kHz PCM16 mono audio, encoded as base64:

Transcription Session Config

session_config = {
    "type": "session.update",
    "session": {
        "type": "transcription",
        "audio": {
            "input": {
                "format": {"type": "audio/pcm", "rate": 24000},
                "transcription": {
                    "model": "gpt-realtime-whisper",
                    "language": "en"
                },
                "turn_detection": None
            }
        }
    }
}

ws.send(json.dumps(session_config))
ws.send(json.dumps({
    "type": "input_audio_buffer.append",
    "audio": audio_b64
}))
ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

Unlike the voice agent session, the transcription script commits the input buffer manually on a timer instead of using server_vad turn detection. Empty commits raise input_audio_buffer_commit_empty errors, so the script only commits after real audio has been sent. Transcript deltas arrive word by word in real time, typically within 3 to 4 seconds after speech starts.

Event Ordering Note

Completion events from overlapping turns can arrive out of order. Reconcile by item_id if you build a UI around the stream. The transcription model is billed by audio duration, not by token.

Step 5: Test Real-Time Translation with GPT-Realtime-Translate

Live speech translation looks similar to transcription but uses a separate endpoint. The translation endpoint removes the assistant response loop entirely. Translation sessions have no assistant turn loop and no response.create. The model works as a live interpreter, not a conversational agent.

The model supports more than 70 input languages and 13 output languages. You set the target language with session.audio.output.language. Source language detection is automatic. Key limitations: no custom prompting, no voice selection, and no domain glossaries.

Translation Session Config

url = "wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate"

session_config = {
    "type": "session.update",
    "session": {
        "audio": {
            "output": {"language": "es"},
            "input": {
                "transcription": {"model": "gpt-realtime-whisper"},
                "noise_reduction": {"type": "near_field"}
            }
        }
    }
}

ws.send(json.dumps(session_config))
ws.send(json.dumps({
    "type": "session.input_audio_buffer.append",
    "audio": audio_b64
}))

Translated audio arrives on session.output_audio.delta events, and the audio bytes are in event["delta"], not event["audio"]. The source and translated transcripts arrive separately:

Translation Event Handling

if event_type == "session.output_audio.delta":
    audio_out_queue.put(base64.b64decode(event["delta"]))
elif event_type == "session.input_transcript.delta":
    print("[EN]", event.get("delta", ""), end="")
elif event_type == "session.output_transcript.delta":
    print("[ES]", event.get("delta", ""), end="")

For short English-to-Spanish phrases, translated audio begins before the source utterance finishes. More distant language pairs can wait longer for context. One edge case: if the source audio is already in the target language, the model may produce silence rather than pass it through.

Step 6: Build a Voice Assistant with GPT-Realtime-2

This is the main GPT-Realtime-2 test: a voice agent that listens, speaks, keeps context, and can call tools. It is a speech-to-speech reasoning model with a 128K context window. The session configuration uses semantic_vad turn detection, which looks at speech cues rather than silence alone:

Voice Agent Session Config

session_config = {
    "type": "session.update",
    "session": {
        "type": "realtime",
        "model": "gpt-realtime-2",
        "output_modalities": ["audio"],
        "audio": {
            "input": {
                "format": {"type": "audio/pcm", "rate": 24000},
                "transcription": {"model": "gpt-realtime-whisper", "language": "en"},
                "turn_detection": {
                    "type": "semantic_vad",
                    "eagerness": "medium",
                    "create_response": False,
                    "interrupt_response": True
                }
            },
            "output": {
                "format": {"type": "audio/pcm", "rate": 24000},
                "voice": "marin"
            }
        },
        "instructions": "You are a helpful voice assistant. Keep answers short.",
        "reasoning": {"effort": "low"}
    }
}

When the user's transcript completes, the client creates the assistant response. Assistant audio arrives on response.output_audio.delta events:

Response Create and Audio Handling

if event_type == "conversation.item.input_audio_transcription.completed":
    ws.send(json.dumps({"type": "response.create"}))

elif event_type == "response.output_audio.delta":
    audio_out_queue.put(base64.b64decode(event["delta"]))

Handling Barge-In and Interruptions

The interruption sequence is critical for a natural voice experience. When the user speaks over the assistant, the server sends input_audio_buffer.speech_started. The client must stop playback, record how much audio it played, and send conversation.item.truncate with audio_end_ms to tell the server where it was cut off:

Barge-In Truncation Handler

if current_response_item_id and playback_position_ms > 0:
    ws.send(json.dumps({
        "type": "conversation.item.truncate",
        "item_id": current_response_item_id,
        "content_index": 0,
        "audio_end_ms": playback_position_ms
    }))

Laptop Speaker Feedback Loop

With laptop speakers, the microphone can pick up the assistant's audio output and send it back to the model. Use MUTE_MIC_DURING_ASSISTANT = True to silence the input stream while the assistant speaks. Set to False only if using headphones and you want interruption support.

Cost and Latency Overview

Pricing splits into two groups: GPT-Realtime-2 is billed by token, while translation and transcription are billed by audio minute:

Model	Billing Model	~30 Min Cost
gpt-realtime-whisper	Per minute	~$0.51
gpt-realtime-translate	Per minute	~$1.02
gpt-realtime-2	Per token	Varies with usage

Voice agents are harder to estimate because audio tokens accumulate on both sides of the conversation. Session length, speaking ratio, reasoning effort, and context size all matter. Prompt caching can reduce cost when earlier conversation turns stay stable.

API Limits, Audio Requirements, and Troubleshooting

Audio Format Requirements

WebSocket audio must be PCM16 at 24 kHz, mono, base64-encoded. Each input_audio_buffer.append event is capped at 15 MB (50-millisecond chunks are well below the limit). G.711 is also supported for telephony.

Session Duration Limits

Realtime sessions end after 60 minutes on OpenAI and 30 minutes on Azure OpenAI. Longer applications need a reconnect plan and a way to rebuild state. Voice must be chosen before the first audio output and cannot switch mid-session.

Rate Limits and Quotas

Rate limits are tier-based and project-specific. Tier 1 currently lists 200 requests per minute and 40,000 tokens per minute for GPT-Realtime-2. The Free tier is not supported for any Realtime API model.

Common Errors

The most frequent errors are empty buffer commits and wrong audio formatting. For voice agents, watch for feedback loops where the microphone hears the assistant's speaker output. Use headphones, echo cancellation, or mic muting to prevent this.

Session Reconnection Strategy

For long sessions, reconnect around 55 minutes instead of waiting for expiry. Note: the GPT-Realtime-2 model page has a generic "Streaming: Not supported" row that refers to Chat Completions streaming, not the Realtime API behavior.

Final Thoughts

The same pattern shows up across all three tests: each job has its own model and endpoint. That split affects what the model can do, how it bills, and how much client code you need to own. GPT-Realtime-Whisper covers live text, GPT-Realtime-Translate covers direct speech translation, and GPT-Realtime-2 covers full assistant behavior with speech, reasoning, and context.

The code does not show one model replacing the others. It shows that realtime voice apps depend on session design. Your starting point should be the smallest model that matches the job, with the remaining engineering time spent on audio quality, turn-taking, reconnects, and client state management.

Frequently Asked Questions

What is the difference between GPT-Realtime-2 and GPT-Realtime-Whisper?

GPT-Realtime-2 is a full speech-to-speech reasoning model that listens, reasons, and responds with audio. GPT-Realtime-Whisper only transcribes audio to text. Use Whisper for transcription and GPT-Realtime-2 for voice assistant scenarios.

Should I use WebRTC or WebSocket for the Realtime API?

Use WebRTC for browser and mobile clients since it handles jitter buffering and audio transport natively. Use WebSockets for server-side applications where your backend receives raw audio from a telephony provider or media pipeline.

What audio format does the Realtime API require?

PCM16 at 24 kHz, mono, base64-encoded. Each input_audio_buffer.append event is capped at 15 MB. G.711 is also supported for telephony use cases.

How long can a Realtime API session last?

Realtime sessions end after 60 minutes on OpenAI and 30 minutes on Azure OpenAI. For longer applications, implement a reconnect plan around 55 minutes and rebuild the conversation state from the last known turn.

Can I use GPT-Realtime-2 for translation?

Technically yes, but it is overkill. GPT-Realtime-Translate is purpose-built for live translation, supports 70+ input languages and 13 output languages, and is billed per minute instead of per token. Use GPT-Realtime-2 only when you need reasoning, tool calls, or conversation state.

Need Help with AI Voice Solutions?

Our AI experts can help you integrate real-time voice AI, transcription, and translation into your applications. From strategy to deployment, we guide you through every step.

What You'll Learn:

How to choose between WebRTC and WebSocket for the Realtime API
Set up Python environment with WebSocket dependencies
Authenticate and connect to Realtime API WebSocket endpoints
Build real-time audio transcription with GPT-Realtime-Whisper
Implement live speech translation with GPT-Realtime-Translate
Create a full voice assistant with GPT-Realtime-2 including turn-taking and barge-in
Understand cost, latency, API limits, and common troubleshooting

What is the GPT-Realtime-2 API?

The Realtime API actually covers three distinct models, each designed for a specific job:

Model	Purpose	Billing
gpt-realtime-2	Full voice assistant with reasoning, tool calls, and context	Per token
gpt-realtime-translate	Live speech translation across 70+ languages	Per minute
gpt-realtime-whisper	Real-time audio transcription (text only)	Per minute

This beginner guide covers the complete setup process. Before you begin, make sure you have:

OpenAI Account

OpenAI API key with access to GPT-Realtime-2 models. The Free tier is not supported for Realtime API.

Python 3.9+

Python 3.9 or newer with pip package manager and a working microphone for audio capture.

Important Note

Each model has its own endpoint and billing model. Using GPT-Realtime-2 for simple transcription is overkill and costs more. Always pick the smallest model that matches your task.

Step 1: Choose Between WebRTC and WebSocket

OpenAI provides two transport paths for the Realtime API. The right choice depends entirely on your client environment:

WebRTC — Use for browser and mobile clients. WebRTC handles jitter buffering, audio transport, and media stream management natively. For browser builds, use ephemeral client secrets so your API key never reaches the frontend.
WebSocket — Use for server-side applications. WebSockets make sense when your backend already receives raw audio from a telephony provider or media pipeline. This path shows event names directly, making model differences visible in the code.

For this complete tutorial, all three tests use WebSockets since we are connecting from a Python server environment.

Step 2: Set Up Python Environment and Dependencies

The full code for all scripts is available on GitHub. Clone the repository first, then install the required dependencies:

Clone Repository and Install Dependencies

git clone https://github.com/KhalidAbdelaty/gpt-realtime-api.git
cd gpt-realtime-api
pip install websocket-client sounddevice numpy python-dotenv

Platform-Specific Audio Setup

On macOS, you may need brew install portaudio before sounddevice works. On Linux, install portaudio19-dev. Windows typically works out of the box.

Step 3: Configure Authentication and WebSocket URL

Create a .env file at the root of your project and add your OpenAI API key:

.env File

OPENAI_API_KEY=sk-...

Server-side connections use an Authorization: Bearer header on the WebSocket handshake. Each model uses a different WebSocket URL:

Model	WebSocket URL
gpt-realtime-2	wss://api.openai.com/v1/realtime?model=gpt-realtime-2
gpt-realtime-translate	wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate
gpt-realtime-whisper	wss://api.openai.com/v1/realtime?intent=transcription

Load the API key in each script using python-dotenv:

Python Environment Loader

import os
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

Step 4: Test Real-Time Audio Transcription with GPT-Realtime-Whisper

The key configuration field is session.type: "transcription". This tells the API to skip assistant responses and emit transcript events only. Use 24 kHz PCM16 mono audio, encoded as base64:

Transcription Session Config

session_config = {
    "type": "session.update",
    "session": {
        "type": "transcription",
        "audio": {
            "input": {
                "format": {"type": "audio/pcm", "rate": 24000},
                "transcription": {
                    "model": "gpt-realtime-whisper",
                    "language": "en"
                },
                "turn_detection": None
            }
        }
    }
}

ws.send(json.dumps(session_config))
ws.send(json.dumps({
    "type": "input_audio_buffer.append",
    "audio": audio_b64
}))
ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

Event Ordering Note

Completion events from overlapping turns can arrive out of order. Reconcile by item_id if you build a UI around the stream. The transcription model is billed by audio duration, not by token.

Step 5: Test Real-Time Translation with GPT-Realtime-Translate

Translation Session Config

url = "wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate"

session_config = {
    "type": "session.update",
    "session": {
        "audio": {
            "output": {"language": "es"},
            "input": {
                "transcription": {"model": "gpt-realtime-whisper"},
                "noise_reduction": {"type": "near_field"}
            }
        }
    }
}

ws.send(json.dumps(session_config))
ws.send(json.dumps({
    "type": "session.input_audio_buffer.append",
    "audio": audio_b64
}))

Translated audio arrives on session.output_audio.delta events, and the audio bytes are in event["delta"], not event["audio"]. The source and translated transcripts arrive separately:

Translation Event Handling

if event_type == "session.output_audio.delta":
    audio_out_queue.put(base64.b64decode(event["delta"]))
elif event_type == "session.input_transcript.delta":
    print("[EN]", event.get("delta", ""), end="")
elif event_type == "session.output_transcript.delta":
    print("[ES]", event.get("delta", ""), end="")

Step 6: Build a Voice Assistant with GPT-Realtime-2

Voice Agent Session Config

session_config = {
    "type": "session.update",
    "session": {
        "type": "realtime",
        "model": "gpt-realtime-2",
        "output_modalities": ["audio"],
        "audio": {
            "input": {
                "format": {"type": "audio/pcm", "rate": 24000},
                "transcription": {"model": "gpt-realtime-whisper", "language": "en"},
                "turn_detection": {
                    "type": "semantic_vad",
                    "eagerness": "medium",
                    "create_response": False,
                    "interrupt_response": True
                }
            },
            "output": {
                "format": {"type": "audio/pcm", "rate": 24000},
                "voice": "marin"
            }
        },
        "instructions": "You are a helpful voice assistant. Keep answers short.",
        "reasoning": {"effort": "low"}
    }
}

When the user's transcript completes, the client creates the assistant response. Assistant audio arrives on response.output_audio.delta events:

Response Create and Audio Handling

if event_type == "conversation.item.input_audio_transcription.completed":
    ws.send(json.dumps({"type": "response.create"}))

elif event_type == "response.output_audio.delta":
    audio_out_queue.put(base64.b64decode(event["delta"]))

Handling Barge-In and Interruptions

Barge-In Truncation Handler

if current_response_item_id and playback_position_ms > 0:
    ws.send(json.dumps({
        "type": "conversation.item.truncate",
        "item_id": current_response_item_id,
        "content_index": 0,
        "audio_end_ms": playback_position_ms
    }))

Laptop Speaker Feedback Loop

Cost and Latency Overview

Pricing splits into two groups: GPT-Realtime-2 is billed by token, while translation and transcription are billed by audio minute:

Model	Billing Model	~30 Min Cost
gpt-realtime-whisper	Per minute	~$0.51
gpt-realtime-translate	Per minute	~$1.02
gpt-realtime-2	Per token	Varies with usage

API Limits, Audio Requirements, and Troubleshooting

Audio Format Requirements

Session Duration Limits

Rate Limits and Quotas

Common Errors

Session Reconnection Strategy

Final Thoughts

Frequently Asked Questions

What is the difference between GPT-Realtime-2 and GPT-Realtime-Whisper?

Should I use WebRTC or WebSocket for the Realtime API?

What audio format does the Realtime API require?

PCM16 at 24 kHz, mono, base64-encoded. Each input_audio_buffer.append event is capped at 15 MB. G.711 is also supported for telephony use cases.

How long can a Realtime API session last?

Can I use GPT-Realtime-2 for translation?

Need Help with AI Voice Solutions?

Our AI experts can help you integrate real-time voice AI, transcription, and translation into your applications. From strategy to deployment, we guide you through every step.

How to Use OpenAI GPT-Realtime-2 API: Step by Step Guide

What is the GPT-Realtime-2 API?

OpenAI Account

Python 3.9+

Step 1: Choose Between WebRTC and WebSocket

Step 2: Set Up Python Environment and Dependencies

Step 3: Configure Authentication and WebSocket URL

Step 4: Test Real-Time Audio Transcription with GPT-Realtime-Whisper

Step 5: Test Real-Time Translation with GPT-Realtime-Translate

Step 6: Build a Voice Assistant with GPT-Realtime-2

Handling Barge-In and Interruptions

Cost and Latency Overview

API Limits, Audio Requirements, and Troubleshooting

Audio Format Requirements

Session Duration Limits

Rate Limits and Quotas

Common Errors

Session Reconnection Strategy

Final Thoughts

Frequently Asked Questions

What is the difference between GPT-Realtime-2 and GPT-Realtime-Whisper?

Should I use WebRTC or WebSocket for the Realtime API?

What audio format does the Realtime API require?

How long can a Realtime API session last?

Can I use GPT-Realtime-2 for translation?

Need Help with AI Voice Solutions?

Need This Implemented in Your Project?

How to Use OpenAI GPT-Realtime-2 API: Step by Step Guide

What is the GPT-Realtime-2 API?

OpenAI Account

Python 3.9+

Step 1: Choose Between WebRTC and WebSocket

Step 2: Set Up Python Environment and Dependencies

Step 3: Configure Authentication and WebSocket URL

Step 4: Test Real-Time Audio Transcription with GPT-Realtime-Whisper

Step 5: Test Real-Time Translation with GPT-Realtime-Translate

Step 6: Build a Voice Assistant with GPT-Realtime-2

Handling Barge-In and Interruptions

Cost and Latency Overview

API Limits, Audio Requirements, and Troubleshooting

Audio Format Requirements

Session Duration Limits

Rate Limits and Quotas

Common Errors

Session Reconnection Strategy

Final Thoughts

Frequently Asked Questions

What is the difference between GPT-Realtime-2 and GPT-Realtime-Whisper?

Should I use WebRTC or WebSocket for the Realtime API?

What audio format does the Realtime API require?

How long can a Realtime API session last?

Can I use GPT-Realtime-2 for translation?

Need Help with AI Voice Solutions?

Need This Implemented in Your Project?