DeepSeek V4 is the latest generation of Mixture-of-Experts (MoE) language models from DeepSeek AI, featuring three explicit reasoning modes: Non-think, Think High, and Think Max. This complete beginner guide covers using the DeepSeek V4 API, building a Think Mode Arena with Streamlit, and comparing reasoning modes across latency, cost, and quality for 2026.

What You'll Learn:

How to set up DeepSeek V4 API and obtain API keys
Understanding the three reasoning modes (Non-think, Think High, Think Max)
Building a Streamlit Arena app with parallel API calls
Estimating costs with cache-aware pricing
Comparing model responses across metrics and user ratings

What is DeepSeek V4?

DeepSeek V4 is DeepSeek AI's latest MoE model series, including two variants: DeepSeek-V4-Pro (1.6T total parameters, 49B activated) and DeepSeek-V4-Flash (284B total, 13B activated), both with 1M token context length.

Key architectural improvements include Hybrid Attention (CSA/HCA), DeepSeekMoE, Manifold-Constrained Hyper-Connections (mHC), and Muon Optimizer for faster convergence.

DeepSeek V4 Benchmark Scores

Benchmark	DeepSeek-V4-Pro Max	Claude Opus 4.6 Max	Gemini-3.1-Pro High
LiveCodeBench (Pass@1)	93.5	88.8	91.7
Codeforces Rating	3206	—	3052
GPQA Diamond (Pass@1)	90.1	91.3	94.3
SWE Verified (Resolved)	80.6	80.8	80.6

Understanding the Three Thinking Modes

DeepSeek V4 supports three reasoning modes, each suited for different tasks:

Mode	Description	Best For
Non-think	Direct responses without internal chain of thought	Trivial tasks, format conversion, simple lookups
Think High	Step-by-step reasoning before responding	Coding, debugging, system design, planning
Think Max	Exhaustive reasoning with verification	Hard math proofs, complex refactoring

Step by Step Guide to Building DeepSeek V4 Arena

Project Setup and Dependencies

Create a project folder and install the required packages: Streamlit and OpenAI SDK (since DeepSeek API is OpenAI-compatible).

Install Dependencies

mkdir deepseek-arena && cd deepseek-arena
pip install streamlit>=1.32.0 openai>=1.12.0

Set your DeepSeek API key as an environment variable:

Set API Key

export DEEPSEEK_API_KEY="sk-..."

Configuration — Models, Modes, Pricing

Define imports, model identifiers, reasoning mode configurations, and pricing constants. These control the app's behavior.

Configuration Code

import os
import time
import concurrent.futures
from dataclasses import dataclass
from typing import Optional, Dict
import streamlit as st
from openai import OpenAI

DEEPSEEK_BASE_URL = "https://api.deepseek.com"
DEEPSEEK_API_KEY_ENV = "DEEPSEEK_API_KEY"

MODELS = {
    "DeepSeek-V4-Flash (default)": "deepseek-v4-flash",
    "DeepSeek-V4-Pro": "deepseek-v4-pro",
}

MODES: Dict[str, dict] = {
    "Non-think": {
        "icon": "",
        "color": "#10b981",
        "badge": "green",
        "desc": "Fast, direct answers — no internal reasoning",
        "thinking_type": "disabled",
        "reasoning_effort": None,
    },
    "Think High": {
        "icon": "",
        "color": "#3b82f6",
        "badge": "blue",
        "desc": "Careful step-by-step reasoning before responding",
        "thinking_type": "enabled",
        "reasoning_effort": "high",
    },
    "Think Max": {
        "icon": "",
        "color": "#ef4444",
        "badge": "red",
        "desc": "Exhaustive reasoning — push analysis to the limit",
        "thinking_type": "enabled",
        "reasoning_effort": "max",
    },
}

PRICING = {
    "deepseek-v4-flash": {
        "input_cache_hit": 0.0028,
        "input_cache_miss": 0.14,
        "output": 0.28,
    },
    "deepseek-v4-pro": {
        "input_cache_hit": 0.003625,
        "input_cache_miss": 0.435,
        "output": 0.87,
    },
}

Task Templates

Define task templates spanning different difficulty levels so each reasoning mode has a category where it wins. Includes trivial, coding, system design, planning, and math tasks.

Task Templates Code

TASKS = {
    "Trivial / Lookup": {
        "prompt": (
            "Complete both tasks below:\n\n"
            "Task A — Convert this JSON to YAML:\n"
            '{"name": "Alice", "age": 30, "skills": ["Python", "ML", "LLMs"]}\n'
            'Task B — Summarize this paragraph in exactly one sentence:\n'
            '"Large language models have rapidly transformed natural language processing..."'
        ),
        "expected_winner": "Non-think",
        "tip": "Non-think should dominate here. No reasoning is required.",
    },
    "Coding / Debugging": {
        "prompt": "...",  # Full prompt as per source
        "expected_winner": "Think High",
        "tip": "Think High usually finds the bugs and explains them well.",
    },
    # Additional tasks: System Design, Planning, Math (IMO-style) as per source
}

Data Class for Results

Define a RunResult dataclass to capture metrics from each API call: mode, answer, thinking trace, latency, tokens, cost, etc.

RunResult Dataclass

@dataclass
class RunResult:
    mode: str
    answer: str = ""
    thinking: str = ""
    latency: float = 0.0
    input_tokens: int = 0
    output_tokens: int = 0
    cost_usd: float = 0.0
    error: Optional[str] = None

    @property
    def tokens_per_second(self) -> Optional[float]:
        if self.latency > 0 and self.output_tokens > 0:
            return self.output_tokens / self.latency
        return None

    @property
    def thinking_word_count(self) -> int:
        return len(self.thinking.split()) if self.thinking else 0

Cost Estimation Helpers

Helper functions to get cached prompt tokens and estimate cost accurately by splitting input tokens into cache-hit and cache-miss portions.

Cost Helper Functions

def get_cached_prompt_tokens(usage) -> int:
    prompt_details = getattr(usage, "prompt_tokens_details", None)
    if prompt_details is None:
        return 0
    cached_tokens = getattr(prompt_details, "cached_tokens", None)
    if cached_tokens is not None:
        return cached_tokens or 0
    if isinstance(prompt_details, dict):
        return prompt_details.get("cached_tokens", 0) or 0
    return 0

def estimate_cost_usd(
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
    cached_prompt_tokens: int,
) -> float:
    pricing = PRICING.get(model, PRICING["deepseek-v4-flash"])
    cached_tokens = min(cached_prompt_tokens, prompt_tokens)
    uncached_tokens = max(prompt_tokens - cached_tokens, 0)
    return (
        cached_tokens / 1_000_000 * pricing["input_cache_hit"]
        + uncached_tokens / 1_000_000 * pricing["input_cache_miss"]
        + completion_tokens / 1_000_000 * pricing["output"]
    )

Parallel API Execution

Use ThreadPoolExecutor to fire three parallel API calls (one per reasoning mode) so total wall-clock time equals the slowest mode.

Parallel Execution Code

def call_mode(client: OpenAI, model: str, mode_name: str, user_prompt: str) -> RunResult:
    result = RunResult(mode=mode_name)
    mode_cfg = MODES[mode_name]
    start = time.perf_counter()
    try:
        request_kwargs = {
            "model": model,
            "messages": [{"role": "user", "content": user_prompt}],
            "max_tokens": 4096,
            "extra_body": {"thinking": {"type": mode_cfg["thinking_type"]}},
        }
        if mode_cfg["reasoning_effort"]:
            request_kwargs["reasoning_effort"] = mode_cfg["reasoning_effort"]
        response = client.chat.completions.create(**request_kwargs)
        result.latency = time.perf_counter() - start
        message = response.choices[0].message
        result.thinking = (getattr(message, "reasoning_content", None) or "").strip()
        result.answer = (message.content or "").strip()
        usage = response.usage
        result.input_tokens = getattr(usage, "prompt_tokens", 0) or 0
        result.output_tokens = getattr(usage, "completion_tokens", 0) or 0
        result.cost_usd = estimate_cost_usd(
            model=model,
            prompt_tokens=result.input_tokens,
            completion_tokens=result.output_tokens,
            cached_prompt_tokens=get_cached_prompt_tokens(usage),
        )
    except Exception as exc:
        result.latency = time.perf_counter() - start
        result.error = str(exc)
    return result

def run_parallel(client: OpenAI, model: str, prompt: str) -> Dict[str, RunResult]:
    results: Dict[str, RunResult] = {}
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as pool:
        futures = {
            pool.submit(call_mode, client, model, mode_name, prompt): mode_name
            for mode_name in MODES
        }
        for fut in concurrent.futures.as_completed(futures):
            results[futures[fut]] = fut.result()
    return results

Streamlit UI and CSS

Build the Streamlit UI with CSS design system, per-mode result columns, collapsible thinking traces, and tab structure for overview, full responses, and ratings.

Streamlit UI Code

def main():
    st.set_page_config(
        page_title="DeepSeek V4 Think Mode Arena",
        layout="wide",
        initial_sidebar_state="expanded",
    )
    inject_css()
    # ... rest of UI code as per source

Metrics and Ratings

Render metrics table comparing latency, tokens, cost, and thinking depth across modes, plus a winner summary across four dimensions.

Metrics Code

def render_metrics_table(results: Dict[str, RunResult], ratings: Dict[str, int]):
    # ... table rendering code as per source

def render_winner_summary(results: Dict[str, RunResult], ratings: Dict[str, int], expected: str):
    # ... winner summary code as per source

Running the App

Start the Streamlit app with two commands. The app opens at localhost:8501, allowing you to select tasks, run the arena, and compare results.

Run App

# Set your API key
export DEEPSEEK_API_KEY="sk-..."

# Run
streamlit run app.py

Key Features of the Think Mode Arena

Parallel API Execution

Fires three parallel API calls using ThreadPoolExecutor, so total wall-clock time equals the slowest mode.

Three Reasoning Modes

Compare Non-think, Think High, and Think Max across latency, cost, and quality.

Cache-Aware Cost Estimation

Estimates cost by splitting input tokens into cache-hit and cache-miss portions.

User Rating System

Rate answers 1-5 per mode and see winner summary across four dimensions.

Important Notes

Always verify current pricing at api-docs.deepseek.com as rates change frequently. Keep your API key secure—never hardcode it. Legacy model IDs (deepseek-chat, deepseek-reasoner) will be retired on 2026-07-24.

Frequently Asked Questions

Why use OpenAI SDK instead of DeepSeek SDK?

DeepSeek's API is OpenAI-compatible, so you can point the OpenAI client at https://api.deepseek.com with a DeepSeek API key. No DeepSeek-specific SDK is needed.

How do I enable thinking mode in the API?

Use extra_body={"thinking": {"type": "enabled"}} and set reasoning_effort to "high" or "max". Non-think mode sets thinking type to "disabled" and omits reasoning_effort.

What is the context length of DeepSeek V4?

Both V4-Pro and V4-Flash support a 1 million token context window, with up to 384,000 tokens of output.

How much does DeepSeek V4 API cost?

V4-Flash costs $0.0028 per 1M input cache-hit, $0.14 per 1M cache-miss, $0.28 per 1M output. V4-Pro is ~6x more expensive. Verify current rates as promotions change.

Is the DeepSeek V4 API production-ready?

Yes, the API went live alongside the model weights in April 2026, running on the same infrastructure as V3 and V3.2 which scaled reliably for over a year.

Need Help with AI Model Integration?

Our AI experts can help you build custom AI applications with DeepSeek V4 API, Streamlit, and other modern tools. Get started with a free consultation today.

What You'll Learn:

How to set up DeepSeek V4 API and obtain API keys
Understanding the three reasoning modes (Non-think, Think High, Think Max)
Building a Streamlit Arena app with parallel API calls
Estimating costs with cache-aware pricing
Comparing model responses across metrics and user ratings

What is DeepSeek V4?

Key architectural improvements include Hybrid Attention (CSA/HCA), DeepSeekMoE, Manifold-Constrained Hyper-Connections (mHC), and Muon Optimizer for faster convergence.

DeepSeek V4 Benchmark Scores

Benchmark	DeepSeek-V4-Pro Max	Claude Opus 4.6 Max	Gemini-3.1-Pro High
LiveCodeBench (Pass@1)	93.5	88.8	91.7
Codeforces Rating	3206	—	3052
GPQA Diamond (Pass@1)	90.1	91.3	94.3
SWE Verified (Resolved)	80.6	80.8	80.6

Understanding the Three Thinking Modes

DeepSeek V4 supports three reasoning modes, each suited for different tasks:

Mode	Description	Best For
Non-think	Direct responses without internal chain of thought	Trivial tasks, format conversion, simple lookups
Think High	Step-by-step reasoning before responding	Coding, debugging, system design, planning
Think Max	Exhaustive reasoning with verification	Hard math proofs, complex refactoring

Step by Step Guide to Building DeepSeek V4 Arena

Project Setup and Dependencies

Create a project folder and install the required packages: Streamlit and OpenAI SDK (since DeepSeek API is OpenAI-compatible).

Install Dependencies

mkdir deepseek-arena && cd deepseek-arena
pip install streamlit>=1.32.0 openai>=1.12.0

Set your DeepSeek API key as an environment variable:

Set API Key

export DEEPSEEK_API_KEY="sk-..."

Configuration — Models, Modes, Pricing

Define imports, model identifiers, reasoning mode configurations, and pricing constants. These control the app's behavior.

Configuration Code

import os
import time
import concurrent.futures
from dataclasses import dataclass
from typing import Optional, Dict
import streamlit as st
from openai import OpenAI

DEEPSEEK_BASE_URL = "https://api.deepseek.com"
DEEPSEEK_API_KEY_ENV = "DEEPSEEK_API_KEY"

MODELS = {
    "DeepSeek-V4-Flash (default)": "deepseek-v4-flash",
    "DeepSeek-V4-Pro": "deepseek-v4-pro",
}

MODES: Dict[str, dict] = {
    "Non-think": {
        "icon": "",
        "color": "#10b981",
        "badge": "green",
        "desc": "Fast, direct answers — no internal reasoning",
        "thinking_type": "disabled",
        "reasoning_effort": None,
    },
    "Think High": {
        "icon": "",
        "color": "#3b82f6",
        "badge": "blue",
        "desc": "Careful step-by-step reasoning before responding",
        "thinking_type": "enabled",
        "reasoning_effort": "high",
    },
    "Think Max": {
        "icon": "",
        "color": "#ef4444",
        "badge": "red",
        "desc": "Exhaustive reasoning — push analysis to the limit",
        "thinking_type": "enabled",
        "reasoning_effort": "max",
    },
}

PRICING = {
    "deepseek-v4-flash": {
        "input_cache_hit": 0.0028,
        "input_cache_miss": 0.14,
        "output": 0.28,
    },
    "deepseek-v4-pro": {
        "input_cache_hit": 0.003625,
        "input_cache_miss": 0.435,
        "output": 0.87,
    },
}

Task Templates

Define task templates spanning different difficulty levels so each reasoning mode has a category where it wins. Includes trivial, coding, system design, planning, and math tasks.

Task Templates Code

TASKS = {
    "Trivial / Lookup": {
        "prompt": (
            "Complete both tasks below:\n\n"
            "Task A — Convert this JSON to YAML:\n"
            '{"name": "Alice", "age": 30, "skills": ["Python", "ML", "LLMs"]}\n'
            'Task B — Summarize this paragraph in exactly one sentence:\n'
            '"Large language models have rapidly transformed natural language processing..."'
        ),
        "expected_winner": "Non-think",
        "tip": "Non-think should dominate here. No reasoning is required.",
    },
    "Coding / Debugging": {
        "prompt": "...",  # Full prompt as per source
        "expected_winner": "Think High",
        "tip": "Think High usually finds the bugs and explains them well.",
    },
    # Additional tasks: System Design, Planning, Math (IMO-style) as per source
}

Data Class for Results

Define a RunResult dataclass to capture metrics from each API call: mode, answer, thinking trace, latency, tokens, cost, etc.

RunResult Dataclass

@dataclass
class RunResult:
    mode: str
    answer: str = ""
    thinking: str = ""
    latency: float = 0.0
    input_tokens: int = 0
    output_tokens: int = 0
    cost_usd: float = 0.0
    error: Optional[str] = None

    @property
    def tokens_per_second(self) -> Optional[float]:
        if self.latency > 0 and self.output_tokens > 0:
            return self.output_tokens / self.latency
        return None

    @property
    def thinking_word_count(self) -> int:
        return len(self.thinking.split()) if self.thinking else 0

Cost Estimation Helpers

Helper functions to get cached prompt tokens and estimate cost accurately by splitting input tokens into cache-hit and cache-miss portions.

Cost Helper Functions

def get_cached_prompt_tokens(usage) -> int:
    prompt_details = getattr(usage, "prompt_tokens_details", None)
    if prompt_details is None:
        return 0
    cached_tokens = getattr(prompt_details, "cached_tokens", None)
    if cached_tokens is not None:
        return cached_tokens or 0
    if isinstance(prompt_details, dict):
        return prompt_details.get("cached_tokens", 0) or 0
    return 0

def estimate_cost_usd(
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
    cached_prompt_tokens: int,
) -> float:
    pricing = PRICING.get(model, PRICING["deepseek-v4-flash"])
    cached_tokens = min(cached_prompt_tokens, prompt_tokens)
    uncached_tokens = max(prompt_tokens - cached_tokens, 0)
    return (
        cached_tokens / 1_000_000 * pricing["input_cache_hit"]
        + uncached_tokens / 1_000_000 * pricing["input_cache_miss"]
        + completion_tokens / 1_000_000 * pricing["output"]
    )

Parallel API Execution

Use ThreadPoolExecutor to fire three parallel API calls (one per reasoning mode) so total wall-clock time equals the slowest mode.

Parallel Execution Code

def call_mode(client: OpenAI, model: str, mode_name: str, user_prompt: str) -> RunResult:
    result = RunResult(mode=mode_name)
    mode_cfg = MODES[mode_name]
    start = time.perf_counter()
    try:
        request_kwargs = {
            "model": model,
            "messages": [{"role": "user", "content": user_prompt}],
            "max_tokens": 4096,
            "extra_body": {"thinking": {"type": mode_cfg["thinking_type"]}},
        }
        if mode_cfg["reasoning_effort"]:
            request_kwargs["reasoning_effort"] = mode_cfg["reasoning_effort"]
        response = client.chat.completions.create(**request_kwargs)
        result.latency = time.perf_counter() - start
        message = response.choices[0].message
        result.thinking = (getattr(message, "reasoning_content", None) or "").strip()
        result.answer = (message.content or "").strip()
        usage = response.usage
        result.input_tokens = getattr(usage, "prompt_tokens", 0) or 0
        result.output_tokens = getattr(usage, "completion_tokens", 0) or 0
        result.cost_usd = estimate_cost_usd(
            model=model,
            prompt_tokens=result.input_tokens,
            completion_tokens=result.output_tokens,
            cached_prompt_tokens=get_cached_prompt_tokens(usage),
        )
    except Exception as exc:
        result.latency = time.perf_counter() - start
        result.error = str(exc)
    return result

def run_parallel(client: OpenAI, model: str, prompt: str) -> Dict[str, RunResult]:
    results: Dict[str, RunResult] = {}
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as pool:
        futures = {
            pool.submit(call_mode, client, model, mode_name, prompt): mode_name
            for mode_name in MODES
        }
        for fut in concurrent.futures.as_completed(futures):
            results[futures[fut]] = fut.result()
    return results

Streamlit UI and CSS

Build the Streamlit UI with CSS design system, per-mode result columns, collapsible thinking traces, and tab structure for overview, full responses, and ratings.

Streamlit UI Code

def main():
    st.set_page_config(
        page_title="DeepSeek V4 Think Mode Arena",
        layout="wide",
        initial_sidebar_state="expanded",
    )
    inject_css()
    # ... rest of UI code as per source

Metrics and Ratings

Render metrics table comparing latency, tokens, cost, and thinking depth across modes, plus a winner summary across four dimensions.

Metrics Code

def render_metrics_table(results: Dict[str, RunResult], ratings: Dict[str, int]):
    # ... table rendering code as per source

def render_winner_summary(results: Dict[str, RunResult], ratings: Dict[str, int], expected: str):
    # ... winner summary code as per source

Running the App

Start the Streamlit app with two commands. The app opens at localhost:8501, allowing you to select tasks, run the arena, and compare results.

Run App

# Set your API key
export DEEPSEEK_API_KEY="sk-..."

# Run
streamlit run app.py

Key Features of the Think Mode Arena

Parallel API Execution

Fires three parallel API calls using ThreadPoolExecutor, so total wall-clock time equals the slowest mode.

Three Reasoning Modes

Compare Non-think, Think High, and Think Max across latency, cost, and quality.

Cache-Aware Cost Estimation

Estimates cost by splitting input tokens into cache-hit and cache-miss portions.

User Rating System

Rate answers 1-5 per mode and see winner summary across four dimensions.

Important Notes

Frequently Asked Questions

Why use OpenAI SDK instead of DeepSeek SDK?

DeepSeek's API is OpenAI-compatible, so you can point the OpenAI client at https://api.deepseek.com with a DeepSeek API key. No DeepSeek-specific SDK is needed.

How do I enable thinking mode in the API?

Use extra_body={"thinking": {"type": "enabled"}} and set reasoning_effort to "high" or "max". Non-think mode sets thinking type to "disabled" and omits reasoning_effort.

What is the context length of DeepSeek V4?

Both V4-Pro and V4-Flash support a 1 million token context window, with up to 384,000 tokens of output.

How much does DeepSeek V4 API cost?

V4-Flash costs $0.0028 per 1M input cache-hit, $0.14 per 1M cache-miss, $0.28 per 1M output. V4-Pro is ~6x more expensive. Verify current rates as promotions change.

Is the DeepSeek V4 API production-ready?

Yes, the API went live alongside the model weights in April 2026, running on the same infrastructure as V3 and V3.2 which scaled reliably for over a year.

Need Help with AI Model Integration?

Our AI experts can help you build custom AI applications with DeepSeek V4 API, Streamlit, and other modern tools. Get started with a free consultation today.

How to Use DeepSeek V4 API: Complete Step by Step Guide

What is DeepSeek V4?

DeepSeek V4 Benchmark Scores

Understanding the Three Thinking Modes

Step by Step Guide to Building DeepSeek V4 Arena

Project Setup and Dependencies

Configuration — Models, Modes, Pricing

Task Templates

Data Class for Results

Cost Estimation Helpers

Parallel API Execution

Streamlit UI and CSS

Metrics and Ratings

Running the App

Key Features of the Think Mode Arena

Parallel API Execution

Three Reasoning Modes

Cache-Aware Cost Estimation

User Rating System

Frequently Asked Questions

Why use OpenAI SDK instead of DeepSeek SDK?

How do I enable thinking mode in the API?

What is the context length of DeepSeek V4?

How much does DeepSeek V4 API cost?

Is the DeepSeek V4 API production-ready?

Need Help with AI Model Integration?

Need This Implemented in Your Project?

How to Use DeepSeek V4 API: Complete Step by Step Guide

What is DeepSeek V4?

DeepSeek V4 Benchmark Scores

Understanding the Three Thinking Modes

Step by Step Guide to Building DeepSeek V4 Arena

Project Setup and Dependencies

Configuration — Models, Modes, Pricing

Task Templates

Data Class for Results

Cost Estimation Helpers

Parallel API Execution

Streamlit UI and CSS

Metrics and Ratings

Running the App

Key Features of the Think Mode Arena

Parallel API Execution

Three Reasoning Modes

Cache-Aware Cost Estimation

User Rating System

Frequently Asked Questions

Why use OpenAI SDK instead of DeepSeek SDK?

How do I enable thinking mode in the API?

What is the context length of DeepSeek V4?

How much does DeepSeek V4 API cost?

Is the DeepSeek V4 API production-ready?

Need Help with AI Model Integration?

Need This Implemented in Your Project?