DeepSeek V4 just dropped with 1.6 trillion parameters, 1 million token context, and $0.14 per million tokens pricing. This is a huge deal - you can now paste an entire codebase, a 300-page PDF, or a month of chat logs into a single API call. But there are gotchas nobody is putting in the headlines.

What You'll Learn:

What DeepSeek V4 actually gives you (1M context, MoE architecture)
How to get API access in 3 minutes
Flash vs Pro - which one you need
The pricing trap nobody mentions
Three gotchas that will break your app

What You Are Actually Getting

Two models are available: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both support a context length of one million tokens. Both use Mixture-of-Experts (MoE) architecture, which means only a fraction of the model activates for each request - keeping inference costs low.

The architecture upgrade is significant. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. Translation: processing a million tokens costs 90% less memory than it did three months ago.

Flash vs Pro: Which One You Need

Use Flash For

Drafting emails, code generation, summarization, chat. Faster, cheaper, and handles 90% of what people actually do with LLMs. Start here.

Use Pro For

Repository-scale code analysis, multi-file refactoring, complex reasoning chains. Better for complex agentic workflows but costs more.

Getting Started: API Access in 3 Minutes

The web chat is live at chat.deepseek.com. You can test V4 there for free, no credit card. But if you want repeatable results or you are building something, you need the API.

Sign Up

Create an account at platform.deepseek.com

Top Up

Add at least $2 (minimum balance requirement)

Generate API Key

Create an API key from your dashboard

Test with cURL

Bash

curl https://api.deepseek.com/v1/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "Explain sparse attention in one sentence."}],
    "temperature": 0.2
  }'

Python Quick Start

The API is OpenAI-compatible. If you have used openai.ChatCompletion.create() before, swap the base URL and you are done:

Python

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Fix this Python function..."}],
    temperature=0.2
)

print(response.choices[0].message.content)

Important: Use Correct Sampling Settings

DeepSeek recommends temperature=1.0, top_p=1.0. Do not import GPT or Claude defaults (usually 0.7/0.95). Their model was tuned with these settings.

The Pricing Trap Nobody Mentions

Every tutorial shows the rate card. The pricing looks great, but two things will surprise you.

Model	Cache Miss	Cache Hit	Output
V4-Flash	$0.14/M	$0.028/M	$0.28/M
V4-Pro	$1.74/M	$0.145/M	$3.48/M

Catch #1: The 1M context is for input. Most answers fit in 2,000 output tokens. The 1M context is for input, not output. If you expect to generate a 50K-token response, you are capped by max_tokens, not by the context window.

Catch #2: Thinking modes change your token burn rate. V4 supports three reasoning modes: non-thinking, thinking, and thinking_max. Prices are the same - but thinking modes consume more tokens because the model writes reasoning traces.

Cache Hits Are Your Best Friend

Cache-hit discount: roughly 80% off Flash, 92% off Pro on repeated prefixes. If you are running 100 queries with the same 20K-token system prompt, the first call costs $2.80 (Flash). Calls 2-100 cost $0.56 each. That is a 7x difference on identical workload.

Cache Optimization Rule

Keep your first 1024+ tokens static across calls. Move dynamic content (timestamps, user IDs) to the end of the prompt. Hash your prefix - if it changes, cache invalidates for everyone.

Three Gotchas That Will Break Your App

Rate Limits Are Dynamic and Invisible

DeepSeek does NOT impose a fixed strict rate cap. But they use dynamic controls - if the service is under heavy load or you make unusually high calls, the system may throttle you. Solution: Add exponential backoff with jitter.

Your Old Model IDs Are Already on V4

deepseek-chat and deepseek-reasoner will be fully retired after July 24th, 2026. They are already routing to deepseek-v4-flash. Check your usage dashboard.

Think Max Mode Needs Huge Context

For Think Max reasoning mode, DeepSeek recommends setting the context window to at least 384K tokens. If you allocate 128K (the V3 default), Think Max will either fail silently or truncate reasoning chains.

When to Use V4 (and When Not To)

Use V4	Skip V4
Processing full codebases (1M context)	Image/video generation (not supported)
100K+ token context regularly	Guaranteed sub-second latency
Cost per token matters	Compliance bans Chinese AI
Open weights for local deployment	24/7 SLA guarantees

What to Do Next

Do not just read about V4. Test it on your actual workload. Here is the smallest useful experiment:

Take a task you currently send to GPT-4 or Claude (code review, summarization, extraction)
Run it through V4-Flash with temperature=0.2
Log the token count, response time, and output quality
Compare the cost

If quality is close and cost is 10x lower, you have found a use case. The API key takes 3 minutes. The test takes 10. Do it before your next standup.

Create account at platform.deepseek.com and add $2 minimum.

Generate your API key from the dashboard.

Use the OpenAI-compatible API with temperature=1.0, top_p=1.0.

Frequently Asked Questions

Is DeepSeek V4 better than GPT-4 or Claude?

Depends on the task. For coding and math, V4-Pro is competitive. For general knowledge and nuanced writing, Claude still leads. For cost-per-token, V4 wins by a wide margin.

Can I run V4 locally?

Yes, weights are MIT licensed. You will need dual RTX 4090s or a single RTX 5090 for V4-Flash. V4-Pro needs a serious cluster (community reports suggest 4x H100 for 50-150 tokens/sec).

What is the real difference between Flash and Pro?

Flash is 13B active params, Pro is 49B. For simple tasks, Flash matches Pro. For complex multi-step reasoning and huge context, Pro pulls ahead. Start with Flash; upgrade only if you measure a quality gap.

Does V4 support images or video?

No, as of April 2026 V4 is text-only. DeepSeek states they are working on multimodal capabilities. Keep a fallback if your pipeline needs image inputs.

Why did DeepSeek release V4 the same day as GPT-5.5?

The timing was deliberate. DeepSeek needed a launch window where open-source 1M-context MoE at a fraction of the cost would not be buried under a closed-source price hike.

Talk to a practice lead

Build this for your business?

We have shipped 50+ production AI agents for US enterprises since 2023 — SOC 2 Type II, audit logs, gated rollouts. Free 30-min architecture call below, no sales sequence.

Book a free 30-min AI call → AI Agent Development hub →

Related resources

What You'll Learn:

What DeepSeek V4 actually gives you (1M context, MoE architecture)
How to get API access in 3 minutes
Flash vs Pro - which one you need
The pricing trap nobody mentions
Three gotchas that will break your app

What You Are Actually Getting

Flash vs Pro: Which One You Need

Use Flash For

Drafting emails, code generation, summarization, chat. Faster, cheaper, and handles 90% of what people actually do with LLMs. Start here.

Use Pro For

Repository-scale code analysis, multi-file refactoring, complex reasoning chains. Better for complex agentic workflows but costs more.

Getting Started: API Access in 3 Minutes

The web chat is live at chat.deepseek.com. You can test V4 there for free, no credit card. But if you want repeatable results or you are building something, you need the API.

Sign Up

Create an account at platform.deepseek.com

Top Up

Add at least $2 (minimum balance requirement)

Generate API Key

Create an API key from your dashboard

Test with cURL

Bash

curl https://api.deepseek.com/v1/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "Explain sparse attention in one sentence."}],
    "temperature": 0.2
  }'

Python Quick Start

The API is OpenAI-compatible. If you have used openai.ChatCompletion.create() before, swap the base URL and you are done:

Python

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Fix this Python function..."}],
    temperature=0.2
)

print(response.choices[0].message.content)

Important: Use Correct Sampling Settings

DeepSeek recommends temperature=1.0, top_p=1.0. Do not import GPT or Claude defaults (usually 0.7/0.95). Their model was tuned with these settings.

The Pricing Trap Nobody Mentions

Every tutorial shows the rate card. The pricing looks great, but two things will surprise you.

Model	Cache Miss	Cache Hit	Output
V4-Flash	$0.14/M	$0.028/M	$0.28/M
V4-Pro	$1.74/M	$0.145/M	$3.48/M

Cache Hits Are Your Best Friend

Cache Optimization Rule

Keep your first 1024+ tokens static across calls. Move dynamic content (timestamps, user IDs) to the end of the prompt. Hash your prefix - if it changes, cache invalidates for everyone.

Three Gotchas That Will Break Your App

Rate Limits Are Dynamic and Invisible

Your Old Model IDs Are Already on V4

deepseek-chat and deepseek-reasoner will be fully retired after July 24th, 2026. They are already routing to deepseek-v4-flash. Check your usage dashboard.

Think Max Mode Needs Huge Context

When to Use V4 (and When Not To)

Use V4	Skip V4
Processing full codebases (1M context)	Image/video generation (not supported)
100K+ token context regularly	Guaranteed sub-second latency
Cost per token matters	Compliance bans Chinese AI
Open weights for local deployment	24/7 SLA guarantees

What to Do Next

Do not just read about V4. Test it on your actual workload. Here is the smallest useful experiment:

Take a task you currently send to GPT-4 or Claude (code review, summarization, extraction)
Run it through V4-Flash with temperature=0.2
Log the token count, response time, and output quality
Compare the cost

If quality is close and cost is 10x lower, you have found a use case. The API key takes 3 minutes. The test takes 10. Do it before your next standup.

Create account at platform.deepseek.com and add $2 minimum.

Generate your API key from the dashboard.

Use the OpenAI-compatible API with temperature=1.0, top_p=1.0.

Frequently Asked Questions

Is DeepSeek V4 better than GPT-4 or Claude?

Depends on the task. For coding and math, V4-Pro is competitive. For general knowledge and nuanced writing, Claude still leads. For cost-per-token, V4 wins by a wide margin.

Can I run V4 locally?

Yes, weights are MIT licensed. You will need dual RTX 4090s or a single RTX 5090 for V4-Flash. V4-Pro needs a serious cluster (community reports suggest 4x H100 for 50-150 tokens/sec).

What is the real difference between Flash and Pro?

Does V4 support images or video?

No, as of April 2026 V4 is text-only. DeepSeek states they are working on multimodal capabilities. Keep a fallback if your pipeline needs image inputs.

Why did DeepSeek release V4 the same day as GPT-5.5?

The timing was deliberate. DeepSeek needed a launch window where open-source 1M-context MoE at a fraction of the cost would not be buried under a closed-source price hike.

Talk to a practice lead

Build this for your business?

We have shipped 50+ production AI agents for US enterprises since 2023 — SOC 2 Type II, audit logs, gated rollouts. Free 30-min architecture call below, no sales sequence.

Book a free 30-min AI call → AI Agent Development hub →

Related resources

How to Use DeepSeek V4: Complete Beginner Guide

What You Are Actually Getting

Flash vs Pro: Which One You Need

Use Flash For

Use Pro For

Getting Started: API Access in 3 Minutes

Sign Up

Top Up

Generate API Key

Test with cURL

Python Quick Start

The Pricing Trap Nobody Mentions

Cache Hits Are Your Best Friend

Three Gotchas That Will Break Your App

Rate Limits Are Dynamic and Invisible

Your Old Model IDs Are Already on V4

Think Max Mode Needs Huge Context

When to Use V4 (and When Not To)

What to Do Next

Frequently Asked Questions

Is DeepSeek V4 better than GPT-4 or Claude?

Can I run V4 locally?

What is the real difference between Flash and Pro?

Does V4 support images or video?

Why did DeepSeek release V4 the same day as GPT-5.5?

Build this for your business?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Use DeepSeek V4: Complete Beginner Guide

What You Are Actually Getting

Flash vs Pro: Which One You Need

Use Flash For

Use Pro For

Getting Started: API Access in 3 Minutes

Sign Up

Top Up

Generate API Key

Test with cURL

Python Quick Start

The Pricing Trap Nobody Mentions

Cache Hits Are Your Best Friend

Three Gotchas That Will Break Your App

Rate Limits Are Dynamic and Invisible

Your Old Model IDs Are Already on V4

Think Max Mode Needs Huge Context

When to Use V4 (and When Not To)

What to Do Next

Frequently Asked Questions

Is DeepSeek V4 better than GPT-4 or Claude?

Can I run V4 locally?

What is the real difference between Flash and Pro?

Does V4 support images or video?

Why did DeepSeek release V4 the same day as GPT-5.5?

Build this for your business?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief