How to Use DeepSeek V4 API: Complete Step by Step Guide
By Braincuber Team
Published on May 6, 2026
DeepSeek V4 is the latest generation of Mixture-of-Experts (MoE) language models from DeepSeek AI, featuring three explicit reasoning modes: Non-think, Think High, and Think Max. This complete beginner guide covers using the DeepSeek V4 API, building a Think Mode Arena with Streamlit, and comparing reasoning modes across latency, cost, and quality for 2026.
What You'll Learn:
- How to set up DeepSeek V4 API and obtain API keys
- Understanding the three reasoning modes (Non-think, Think High, Think Max)
- Building a Streamlit Arena app with parallel API calls
- Estimating costs with cache-aware pricing
- Comparing model responses across metrics and user ratings
What is DeepSeek V4?
DeepSeek V4 is DeepSeek AI's latest MoE model series, including two variants: DeepSeek-V4-Pro (1.6T total parameters, 49B activated) and DeepSeek-V4-Flash (284B total, 13B activated), both with 1M token context length.
Key architectural improvements include Hybrid Attention (CSA/HCA), DeepSeekMoE, Manifold-Constrained Hyper-Connections (mHC), and Muon Optimizer for faster convergence.
DeepSeek V4 Benchmark Scores
| Benchmark | DeepSeek-V4-Pro Max | Claude Opus 4.6 Max | Gemini-3.1-Pro High |
|---|---|---|---|
| LiveCodeBench (Pass@1) | 93.5 | 88.8 | 91.7 |
| Codeforces Rating | 3206 | — | 3052 |
| GPQA Diamond (Pass@1) | 90.1 | 91.3 | 94.3 |
| SWE Verified (Resolved) | 80.6 | 80.8 | 80.6 |
Understanding the Three Thinking Modes
DeepSeek V4 supports three reasoning modes, each suited for different tasks:
| Mode | Description | Best For |
|---|---|---|
| Non-think | Direct responses without internal chain of thought | Trivial tasks, format conversion, simple lookups |
| Think High | Step-by-step reasoning before responding | Coding, debugging, system design, planning |
| Think Max | Exhaustive reasoning with verification | Hard math proofs, complex refactoring |
Step by Step Guide to Building DeepSeek V4 Arena
Project Setup and Dependencies
Create a project folder and install the required packages: Streamlit and OpenAI SDK (since DeepSeek API is OpenAI-compatible).
mkdir deepseek-arena && cd deepseek-arena
pip install streamlit>=1.32.0 openai>=1.12.0
Set your DeepSeek API key as an environment variable:
export DEEPSEEK_API_KEY="sk-..."
Configuration — Models, Modes, Pricing
Define imports, model identifiers, reasoning mode configurations, and pricing constants. These control the app's behavior.
import os
import time
import concurrent.futures
from dataclasses import dataclass
from typing import Optional, Dict
import streamlit as st
from openai import OpenAI
DEEPSEEK_BASE_URL = "https://api.deepseek.com"
DEEPSEEK_API_KEY_ENV = "DEEPSEEK_API_KEY"
MODELS = {
"DeepSeek-V4-Flash (default)": "deepseek-v4-flash",
"DeepSeek-V4-Pro": "deepseek-v4-pro",
}
MODES: Dict[str, dict] = {
"Non-think": {
"icon": "",
"color": "#10b981",
"badge": "green",
"desc": "Fast, direct answers — no internal reasoning",
"thinking_type": "disabled",
"reasoning_effort": None,
},
"Think High": {
"icon": "",
"color": "#3b82f6",
"badge": "blue",
"desc": "Careful step-by-step reasoning before responding",
"thinking_type": "enabled",
"reasoning_effort": "high",
},
"Think Max": {
"icon": "",
"color": "#ef4444",
"badge": "red",
"desc": "Exhaustive reasoning — push analysis to the limit",
"thinking_type": "enabled",
"reasoning_effort": "max",
},
}
PRICING = {
"deepseek-v4-flash": {
"input_cache_hit": 0.0028,
"input_cache_miss": 0.14,
"output": 0.28,
},
"deepseek-v4-pro": {
"input_cache_hit": 0.003625,
"input_cache_miss": 0.435,
"output": 0.87,
},
}
Task Templates
Define task templates spanning different difficulty levels so each reasoning mode has a category where it wins. Includes trivial, coding, system design, planning, and math tasks.
TASKS = {
"Trivial / Lookup": {
"prompt": (
"Complete both tasks below:\n\n"
"Task A — Convert this JSON to YAML:\n"
'{"name": "Alice", "age": 30, "skills": ["Python", "ML", "LLMs"]}\n'
'Task B — Summarize this paragraph in exactly one sentence:\n'
'"Large language models have rapidly transformed natural language processing..."'
),
"expected_winner": "Non-think",
"tip": "Non-think should dominate here. No reasoning is required.",
},
"Coding / Debugging": {
"prompt": "...", # Full prompt as per source
"expected_winner": "Think High",
"tip": "Think High usually finds the bugs and explains them well.",
},
# Additional tasks: System Design, Planning, Math (IMO-style) as per source
}
Data Class for Results
Define a RunResult dataclass to capture metrics from each API call: mode, answer, thinking trace, latency, tokens, cost, etc.
@dataclass
class RunResult:
mode: str
answer: str = ""
thinking: str = ""
latency: float = 0.0
input_tokens: int = 0
output_tokens: int = 0
cost_usd: float = 0.0
error: Optional[str] = None
@property
def tokens_per_second(self) -> Optional[float]:
if self.latency > 0 and self.output_tokens > 0:
return self.output_tokens / self.latency
return None
@property
def thinking_word_count(self) -> int:
return len(self.thinking.split()) if self.thinking else 0
Cost Estimation Helpers
Helper functions to get cached prompt tokens and estimate cost accurately by splitting input tokens into cache-hit and cache-miss portions.
def get_cached_prompt_tokens(usage) -> int:
prompt_details = getattr(usage, "prompt_tokens_details", None)
if prompt_details is None:
return 0
cached_tokens = getattr(prompt_details, "cached_tokens", None)
if cached_tokens is not None:
return cached_tokens or 0
if isinstance(prompt_details, dict):
return prompt_details.get("cached_tokens", 0) or 0
return 0
def estimate_cost_usd(
model: str,
prompt_tokens: int,
completion_tokens: int,
cached_prompt_tokens: int,
) -> float:
pricing = PRICING.get(model, PRICING["deepseek-v4-flash"])
cached_tokens = min(cached_prompt_tokens, prompt_tokens)
uncached_tokens = max(prompt_tokens - cached_tokens, 0)
return (
cached_tokens / 1_000_000 * pricing["input_cache_hit"]
+ uncached_tokens / 1_000_000 * pricing["input_cache_miss"]
+ completion_tokens / 1_000_000 * pricing["output"]
)
Parallel API Execution
Use ThreadPoolExecutor to fire three parallel API calls (one per reasoning mode) so total wall-clock time equals the slowest mode.
def call_mode(client: OpenAI, model: str, mode_name: str, user_prompt: str) -> RunResult:
result = RunResult(mode=mode_name)
mode_cfg = MODES[mode_name]
start = time.perf_counter()
try:
request_kwargs = {
"model": model,
"messages": [{"role": "user", "content": user_prompt}],
"max_tokens": 4096,
"extra_body": {"thinking": {"type": mode_cfg["thinking_type"]}},
}
if mode_cfg["reasoning_effort"]:
request_kwargs["reasoning_effort"] = mode_cfg["reasoning_effort"]
response = client.chat.completions.create(**request_kwargs)
result.latency = time.perf_counter() - start
message = response.choices[0].message
result.thinking = (getattr(message, "reasoning_content", None) or "").strip()
result.answer = (message.content or "").strip()
usage = response.usage
result.input_tokens = getattr(usage, "prompt_tokens", 0) or 0
result.output_tokens = getattr(usage, "completion_tokens", 0) or 0
result.cost_usd = estimate_cost_usd(
model=model,
prompt_tokens=result.input_tokens,
completion_tokens=result.output_tokens,
cached_prompt_tokens=get_cached_prompt_tokens(usage),
)
except Exception as exc:
result.latency = time.perf_counter() - start
result.error = str(exc)
return result
def run_parallel(client: OpenAI, model: str, prompt: str) -> Dict[str, RunResult]:
results: Dict[str, RunResult] = {}
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as pool:
futures = {
pool.submit(call_mode, client, model, mode_name, prompt): mode_name
for mode_name in MODES
}
for fut in concurrent.futures.as_completed(futures):
results[futures[fut]] = fut.result()
return results
Streamlit UI and CSS
Build the Streamlit UI with CSS design system, per-mode result columns, collapsible thinking traces, and tab structure for overview, full responses, and ratings.
def main():
st.set_page_config(
page_title="DeepSeek V4 Think Mode Arena",
layout="wide",
initial_sidebar_state="expanded",
)
inject_css()
# ... rest of UI code as per source
Metrics and Ratings
Render metrics table comparing latency, tokens, cost, and thinking depth across modes, plus a winner summary across four dimensions.
def render_metrics_table(results: Dict[str, RunResult], ratings: Dict[str, int]):
# ... table rendering code as per source
def render_winner_summary(results: Dict[str, RunResult], ratings: Dict[str, int], expected: str):
# ... winner summary code as per source
Running the App
Start the Streamlit app with two commands. The app opens at localhost:8501, allowing you to select tasks, run the arena, and compare results.
# Set your API key
export DEEPSEEK_API_KEY="sk-..."
# Run
streamlit run app.py
Key Features of the Think Mode Arena
Parallel API Execution
Fires three parallel API calls using ThreadPoolExecutor, so total wall-clock time equals the slowest mode.
Three Reasoning Modes
Compare Non-think, Think High, and Think Max across latency, cost, and quality.
Cache-Aware Cost Estimation
Estimates cost by splitting input tokens into cache-hit and cache-miss portions.
User Rating System
Rate answers 1-5 per mode and see winner summary across four dimensions.
Important Notes
Always verify current pricing at api-docs.deepseek.com as rates change frequently. Keep your API key secure—never hardcode it. Legacy model IDs (deepseek-chat, deepseek-reasoner) will be retired on 2026-07-24.
Frequently Asked Questions
Why use OpenAI SDK instead of DeepSeek SDK?
DeepSeek's API is OpenAI-compatible, so you can point the OpenAI client at https://api.deepseek.com with a DeepSeek API key. No DeepSeek-specific SDK is needed.
How do I enable thinking mode in the API?
Use extra_body={"thinking": {"type": "enabled"}} and set reasoning_effort to "high" or "max". Non-think mode sets thinking type to "disabled" and omits reasoning_effort.
What is the context length of DeepSeek V4?
Both V4-Pro and V4-Flash support a 1 million token context window, with up to 384,000 tokens of output.
How much does DeepSeek V4 API cost?
V4-Flash costs $0.0028 per 1M input cache-hit, $0.14 per 1M cache-miss, $0.28 per 1M output. V4-Pro is ~6x more expensive. Verify current rates as promotions change.
Is the DeepSeek V4 API production-ready?
Yes, the API went live alongside the model weights in April 2026, running on the same infrastructure as V3 and V3.2 which scaled reliably for over a year.
Need Help with AI Model Integration?
Our AI experts can help you build custom AI applications with DeepSeek V4 API, Streamlit, and other modern tools. Get started with a free consultation today.
