How to Use Claude Fable 5 API: Complete Python Tutorial Guide
Claude Fable 5 is Anthropic's most capable model, released on June 9, 2026, and this complete tutorial is a step by step guide to using the Claude Fable 5 API from Python. Whether you are looking for a beginner guide to your first API call or a complete tutorial covering advanced patterns like structured output, streaming, and tool use loops, this walkthrough covers every essential topic. How to use the Claude Fable 5 API is a question every Python developer building with AI in 2026 will need to answer — from authentication and client initialization through to FastAPI integration and production cost management. By the end of this guide you will have a working task-assistant project demonstrating every major feature of the Fable 5 API.
What You'll Learn:
- What Claude Fable 5 is, how it differs from previous Claude models, and when to choose it over Haiku, Sonnet, or Opus
- How to set up a Python project with the Anthropic SDK, python-dotenv, and Pydantic
- How to make basic messages.create API calls with system prompts and parse response content blocks
- How to use messages.parse with a Pydantic model to get structured, type-safe output from Fable 5
- How to stream responses token-by-token using messages.stream and the text_stream iterator
- How to define tools and run a complete tool use loop until end_turn or refusal
- How to count tokens before a call and calculate input/output costs at Fable 5 pricing
- How to handle refusals, control reasoning effort, and integrate Fable 5 into a FastAPI application
What Is Claude Fable 5?
Claude Fable 5 is Anthropic's most capable publicly available model as of June 2026. Released on June 9, 2026, it is identified by the model string claude-fable-5 in all API calls. Fable 5 is built on the Claude Mythos 5 architecture with additional safety classifiers layered on top, making it Anthropic's most safety-aligned release to date while also being the most capable for complex reasoning and long-context tasks.
The context window is 1 million tokens — large enough to load entire codebases, legal contracts, or research corpora in a single request. The maximum output per request is 128,000 tokens, which is sufficient for generating large documents, comprehensive test suites, or multi-file code changes in a single response. Adaptive thinking is always on in Fable 5 and cannot be disabled; every request benefits from extended internal reasoning before the response is generated.
Pricing is $10 per million input tokens and $50 per million output tokens. This makes Fable 5 the most expensive model in the current Claude lineup, but it is designed for tasks where the cost of an error far exceeds the cost of a more thorough analysis. Two important compliance constraints apply: Fable 5 has mandatory 30-day data retention and Zero Data Retention is unavailable, which means it may not be suitable for certain regulated industries or data privacy use cases. Check your data processing agreements before deploying Fable 5 in healthcare, legal, or financial contexts.
1M Token Context Window
Claude Fable 5 accepts up to 1 million input tokens per request and can return up to 128,000 output tokens. This allows entire codebases, books, or large document sets to be processed in a single API call, eliminating the need for chunking or summarization pipelines that lose context across segments.
Structured Output with Pydantic
The messages.parse endpoint accepts a Pydantic BaseModel class as the output_format parameter and returns a parsed_output attribute that is already validated and type-safe. No regex parsing, no JSON repair, no schema injection required — the SDK handles all enforcement automatically.
Streaming and Tool Use
The messages.stream context manager exposes a text_stream iterator that skips thinking blocks automatically, delivering clean token-by-token output to your UI. Tool use enables multi-step agentic loops where Fable 5 calls your Python functions and incorporates their results into a final answer.
Cost-Aware Development
At $10/$50 per million tokens, production Fable 5 usage requires cost management from day one. The SDK provides messages.count_tokens to estimate costs before a call, and prompt caching reduces repeat-context costs to $1/M input tokens. The Batch API halves all prices for non-real-time workloads.
Step by Step Guide: Building a Task Assistant with Claude Fable 5
This step by step guide builds a task-assistant project that demonstrates every major Fable 5 API feature in sequence. Each step is self-contained — you can implement them independently if you only need one pattern — but they are designed to compose into a complete production-ready assistant application by step 6.
Set Up the Python Project
Create a new directory called task-assistant, create and activate a virtual environment, then install the Anthropic SDK alongside FastAPI, Uvicorn, python-dotenv, and Pydantic. Create a .env file containing your ANTHROPIC_API_KEY. The Anthropic SDK reads this variable automatically when you call Anthropic() with no arguments — never pass the key as a positional argument or store it in source code. Initialize the client once at module level; creating a new Anthropic() instance on every request wastes connection pool resources.
Make Basic API Calls with messages.create
Use client.messages.create with model="claude-fable-5", a required max_tokens value, and a messages list containing at least one user-role message. The response object has a content list that may contain text blocks, thinking blocks, and tool_use blocks. Extract only text using the helper lambda shown in the code example. Token usage is available on response.usage.input_tokens and response.usage.output_tokens for cost tracking. Fable 5's tokenizer produces approximately 30% more tokens than older Claude models on the same input, so existing cost estimates need to be revised upward.
Add System Prompts for Consistent Behaviour
Pass the system parameter as a top-level field directly on the messages.create call — it is separate from the messages array and must never be placed inside the messages list as a system-role message. A well-crafted system prompt establishes the assistant's persona, output format, language preferences, and tool-use policy before the conversation begins. For long system prompts that repeat across many requests, prompt caching reduces their cost from $10/M to $1/M input tokens after the minimum 512-token cache threshold is reached. Store the system prompt in a Python constant at module level for easy reuse across functions.
Use Structured Output with messages.parse and Pydantic
Define a Pydantic BaseModel subclass that describes the exact JSON structure you need, then pass it as the output_format argument to client.messages.parse. The SDK enforces the schema server-side and returns a parsed_output attribute that is already validated. Always check that parsed_output is not None before accessing its fields — a refusal or parsing failure leaves it as None. Critical constraint: do not use assistant turn prefilling when calling messages.parse. Adding an assistant-role message at the end of the messages list conflicts with the structured output enforcement and returns an HTTP 400 error.
Implement Streaming with messages.stream
Use client.messages.stream as a context manager and iterate over stream.text_stream to receive tokens as they are generated. The text_stream iterator automatically skips thinking blocks — you only receive the final text output without any filtering code on your side. After the loop, call stream.get_final_message() to obtain the complete response object including usage statistics. This is essential for cost tracking because token counts are only available on the final message, not during the stream. Streaming is the recommended pattern for any user-facing interface where perceived latency matters.
Add Tool Use for Agentic Workflows
Define tools as Python dictionaries with a name, description, and input_schema following the JSON Schema specification. Pass the tools list to messages.create and implement a loop that continues until stop_reason is "end_turn" or "refusal". On each iteration, check whether any content block has type == "tool_use", extract the tool name and input arguments, execute the corresponding Python function, and append a tool_result message to the conversation history before sending the next request. Fable 5's adaptive thinking makes it particularly effective at planning multi-step tool sequences and recovering from unexpected tool outputs without explicit re-prompting.
Step 1: Project Setup
Create the project directory, set up a virtual environment, and install all required packages. The anthropic package includes the Anthropic Python SDK. fastapi and uvicorn are used in the final integration step. python-dotenv loads your ANTHROPIC_API_KEY from .env automatically. pydantic provides the BaseModel class for structured output.
# Create project directory and virtual environment
mkdir task-assistant && cd task-assistant
python -m venv .venv
# Activate on macOS / Linux
source .venv/bin/activate
# Activate on Windows (PowerShell)
.venvScriptsActivate.ps1
# Install all required packages
pip install anthropic fastapi uvicorn python-dotenv pydantic
# Create .env file — add your API key here
# NEVER commit this file to version control
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env
echo ".env" >> .gitignore
# Verify installation
python -c "import anthropic; print(anthropic.__version__)"
After installation, create the main module file assistant.py. Initialize the Anthropic client at module level — once per process, not once per request:
"""
assistant.py — Task assistant powered by Claude Fable 5.
Initialize the client once at module level for connection pool reuse.
"""
from anthropic import Anthropic
from dotenv import load_dotenv
# Load ANTHROPIC_API_KEY from .env into the environment.
# The Anthropic() constructor reads it automatically — no argument needed.
load_dotenv()
# Module-level client: one instance, shared across all functions.
# Never instantiate Anthropic() inside a request handler or loop.
client = Anthropic()
# Model constant — update here when upgrading to a future version
MODEL = "claude-fable-5"
# Helper: extract first text block from a response
get_text = lambda r: next((b.text for b in r.content if b.type == "text"), "")
# System prompt constant — stored once, reused across calls
# Long system prompts (512+ tokens) benefit from prompt caching ($1/M instead of $10/M)
SYSTEM_PROMPT = """You are a senior software architect assistant.
You help developers plan features, review code, and estimate complexity.
Always provide actionable recommendations backed by concrete reasoning.
When asked to plan a feature, break it into discrete implementation steps
with clear file paths and test coverage requirements."""
Step 2: Basic API Calls with messages.create
The messages.create method is the foundation of every Fable 5 integration. The max_tokens parameter is mandatory — the API returns a validation error if it is omitted. The response content list may contain multiple blocks of different types: text blocks contain the assistant's written response, thinking blocks contain Fable 5's internal reasoning (always present because adaptive thinking cannot be disabled), and tool_use blocks appear when the model invokes a tool. The helper lambda filters to text only, which is the correct approach for most non-agentic use cases.
from anthropic import Anthropic
client = Anthropic()
MODEL = "claude-fable-5"
# Helper: extract first text block from a response content list.
# Fable 5 always includes thinking blocks — this lambda skips them.
get_text = lambda r: next((b.text for b in r.content if b.type == "text"), "")
def ask(prompt: str, system: str = "", max_tokens: int = 512) -> str:
"""
Send a single-turn message to Claude Fable 5 and return the text reply.
Parameters
----------
prompt : The user message text.
system : Optional system prompt string (top-level field, NOT in messages).
max_tokens : Maximum output tokens (required by the API; raise for long answers).
Returns
-------
The first text block in the response, or an empty string on refusal.
"""
kwargs = {
"model": MODEL,
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": prompt}],
}
# system is a top-level field — never place it inside the messages list
if system:
kwargs["system"] = system
response = client.messages.create(**kwargs)
# Check stop_reason before reading content
if response.stop_reason == "refusal":
category = (
response.stop_details.category
if response.stop_details else "policy"
)
print(f"[INFO] Request declined ({category}) — no output generated.")
return ""
# Log token usage for cost tracking
usage = response.usage
INPUT_PRICE = 10.0 / 1_000_000 # $10 per million input tokens
OUTPUT_PRICE = 50.0 / 1_000_000 # $50 per million output tokens
cost = usage.input_tokens * INPUT_PRICE + usage.output_tokens * OUTPUT_PRICE
print(
f"[Usage] in={usage.input_tokens} out={usage.output_tokens} "
f"cost=${cost:.4f}"
)
return get_text(response)
if __name__ == "__main__":
reply = ask(
prompt="What are the three main design patterns for async Python APIs?",
system="You are a senior Python architect. Be concise.",
max_tokens=512,
)
print(reply)
Key Insight: Adaptive Thinking Is Always On in Fable 5
Unlike Sonnet and Opus, which allow you to disable extended thinking, Claude Fable 5's adaptive thinking is always active and cannot be turned off. Every request goes through an internal reasoning phase before the final response is generated. This means every response content list will contain at least one thinking block in addition to the text block. Always use the get_text helper (or an equivalent filter) when you need only the text output — iterating directly over response.content without filtering will mix thinking text into your application output. The upside is that Fable 5 produces significantly better results on multi-step problems without any configuration changes on your part.
Step 3: Structured Output with Pydantic and messages.parse
Structured output eliminates the fragile JSON-parsing code that plagues most LLM integrations. Instead of asking the model to "respond only in JSON" and then parsing the string yourself, you define a Pydantic model that describes the exact schema you need. The Fable 5 API enforces this schema server-side and returns a parsed_output attribute that is already a validated Python object. The example below defines a FeaturePlan model for software feature planning — a natural fit for Fable 5's complex reasoning capabilities.
from anthropic import Anthropic
from pydantic import BaseModel
client = Anthropic()
MODEL = "claude-fable-5"
# ── Pydantic schema: define the exact structure you need ────────────────────
class FeaturePlan(BaseModel):
"""Structured plan for implementing a software feature."""
summary: str # One-paragraph feature overview
steps: list[str] # Ordered implementation steps
files: list[str] # Files to create or modify
risks: list[str] # Technical risks and mitigations
tests: list[str] # Test cases to cover
def plan_feature(feature_description: str) -> FeaturePlan | None:
"""
Ask Claude Fable 5 to produce a structured feature plan.
IMPORTANT: Do NOT add an assistant-role message at the end of the
messages list when using messages.parse. Prefilling the assistant
turn conflicts with structured output enforcement and returns HTTP 400.
Returns
-------
A validated FeaturePlan object, or None if the request was refused
or the schema could not be satisfied.
"""
response = client.messages.parse(
model=MODEL,
max_tokens=2048,
output_format=FeaturePlan, # Pass the Pydantic class directly
system=(
"You are a senior software architect. "
"Produce detailed, actionable plans that a mid-level developer "
"can follow without additional clarification."
),
messages=[
{
"role": "user",
"content": (
f"Plan the implementation of the following feature:
"
f"{feature_description}
"
"Include specific file paths relative to a Django project root."
),
}
],
)
# Always check for None before accessing parsed_output fields.
# None is returned on refusal or if the schema could not be satisfied.
if response.parsed_output is None:
print("[WARN] Structured output returned None — check stop_reason.")
print(f" stop_reason={response.stop_reason}")
return None
plan: FeaturePlan = response.parsed_output
# Log usage
u = response.usage
cost = (u.input_tokens * 10 + u.output_tokens * 50) / 1_000_000
print(f"[Usage] in={u.input_tokens} out={u.output_tokens} cost=${cost:.4f}")
return plan
if __name__ == "__main__":
plan = plan_feature(
"Add a real-time notification system to the user dashboard. "
"Notifications should be pushed via WebSocket when a task is assigned "
"or completed, and persisted to a PostgreSQL table for 30 days."
)
if plan:
print("Summary:", plan.summary)
print("
Implementation Steps:")
for i, step in enumerate(plan.steps, 1):
print(f" {i}. {step}")
print("
Files to modify:")
for f in plan.files:
print(f" - {f}")
print("
Risks:")
for r in plan.risks:
print(f" - {r}")
print("
Test Cases:")
for t in plan.tests:
print(f" - {t}")
Two Critical Constraints for messages.parse
1. Never pass API keys through frontend request bodies. API keys belong in server-side environment variables only. A frontend JavaScript application must never send the key in a request body or header that originates from a browser — this exposes it to any user who inspects network traffic. Use a thin backend proxy instead. 2. Assistant turn prefilling returns HTTP 400. Adding an assistant-role message at the end of the messages array when using messages.parse causes the API to return an HTTP 400 Bad Request error. The structured output mode manages the assistant turn internally — do not add one manually.
Step 4: Streaming Responses with messages.stream
Streaming is the recommended delivery mechanism for any interface where the user is waiting for a response. Without streaming, the user sees nothing until Fable 5 finishes its full adaptive thinking pass and generates the complete response — which can take tens of seconds for complex outputs. With streaming, the first tokens arrive as soon as generation begins. The text_stream iterator handles the filtering of thinking blocks for you: only final text content is yielded.
from anthropic import Anthropic
client = Anthropic()
MODEL = "claude-fable-5"
def stream_response(
user_message: str,
system: str = "",
max_tokens: int = 2048,
) -> tuple[str, dict]:
"""
Stream a Fable 5 response and return the full text plus usage stats.
The text_stream iterator yields only final text tokens — thinking blocks
are filtered out automatically by the SDK. Call get_final_message() after
the loop to obtain token counts; they are not available during streaming.
Returns
-------
(full_text: str, usage: dict) — the complete text and token counts.
"""
full_text_parts: list[str] = []
kwargs = {
"model": MODEL,
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": user_message}],
}
if system:
kwargs["system"] = system
with client.messages.stream(**kwargs) as stream:
# text_stream skips thinking blocks — only final response text is yielded
for chunk in stream.text_stream:
print(chunk, end="", flush=True) # stream to stdout in real time
full_text_parts.append(chunk)
# get_final_message() blocks until the stream completes,
# then returns the full response object including usage statistics
final = stream.get_final_message()
print() # newline after streaming completes
usage = {
"input_tokens": final.usage.input_tokens,
"output_tokens": final.usage.output_tokens,
"cost_usd": (
final.usage.input_tokens * 10 / 1_000_000
+ final.usage.output_tokens * 50 / 1_000_000
),
}
return "".join(full_text_parts), usage
def stream_multi_turn(
conversation: list[dict],
system: str = "",
max_tokens: int = 2048,
) -> str:
"""
Stream a multi-turn conversation. The caller manages the messages list
and appends the assistant reply after each turn.
"""
kwargs = {
"model": MODEL,
"max_tokens": max_tokens,
"messages": conversation,
}
if system:
kwargs["system"] = system
reply_chunks: list[str] = []
with client.messages.stream(**kwargs) as stream:
for chunk in stream.text_stream:
print(chunk, end="", flush=True)
reply_chunks.append(chunk)
print()
return "".join(reply_chunks)
if __name__ == "__main__":
text, usage = stream_response(
user_message=(
"Write a complete Python implementation of a least-recently-used "
"(LRU) cache with O(1) get and put operations. Include full type "
"annotations, docstrings, and pytest tests."
),
system="You are a Python expert. Write clean, production-ready code.",
max_tokens=4096,
)
print(f"
[Usage] in={usage['input_tokens']} out={usage['output_tokens']} cost=${usage['cost_usd']:.4f}")
Step 5: Tool Use for Agentic Workflows
Tool use (also called function calling) enables Fable 5 to invoke Python functions and incorporate their results into a multi-step reasoning process. The model does not call your functions directly — it returns a structured request describing which function to call and with what arguments, and your code executes the function and returns the result. The loop continues until stop_reason is "end_turn" (task complete) or "refusal" (policy violation). Fable 5's adaptive thinking makes it particularly reliable at planning tool call sequences and handling unexpected tool outputs without explicit re-prompting from the developer.
import json
import os
from pathlib import Path
from anthropic import Anthropic
client = Anthropic()
MODEL = "claude-fable-5"
# ── Tool Definitions ─────────────────────────────────────────────────────────
READ_FILE_TOOL = {
"name": "read_project_file",
"description": (
"Read the contents of a file in the project directory. "
"Use this to inspect source code, configuration files, and documentation."
),
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Relative path from the project root (e.g. 'src/models/user.py')",
}
},
"required": ["path"],
},
}
LIST_DIR_TOOL = {
"name": "list_directory",
"description": "List all files in a project directory (non-recursive).",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Relative directory path from project root (e.g. 'src/models')",
}
},
"required": ["path"],
},
}
TOOLS = [READ_FILE_TOOL, LIST_DIR_TOOL]
# ── Tool Implementations ─────────────────────────────────────────────────────
def read_project_file(path: str) -> str:
try:
return Path(path).read_text(encoding="utf-8")
except FileNotFoundError:
return f"Error: File not found at path '{path}'"
except PermissionError:
return f"Error: Permission denied reading '{path}'"
def list_directory(path: str) -> str:
try:
entries = sorted(os.listdir(path))
return "
".join(entries) if entries else "(empty directory)"
except FileNotFoundError:
return f"Error: Directory not found at path '{path}'"
def dispatch_tool(name: str, inputs: dict) -> str:
"""Route a tool_use request to the correct Python function."""
if name == "read_project_file":
return read_project_file(inputs["path"])
elif name == "list_directory":
return list_directory(inputs["path"])
else:
return f"Error: Unknown tool '{name}'"
# ── Agentic Tool Use Loop ─────────────────────────────────────────────────────
def run_agent(task: str, max_iterations: int = 10) -> str:
"""
Run a Fable 5 agentic loop until end_turn or refusal.
The loop:
1. Send messages + tools to Fable 5
2. If stop_reason == "tool_use": execute each tool_use block,
append tool_result messages, and loop back
3. If stop_reason == "end_turn": return the final text response
4. If stop_reason == "refusal": log and return empty string
5. Safety cap: stop after max_iterations to prevent runaway loops
"""
messages = [{"role": "user", "content": task}]
iteration = 0
while iteration < max_iterations:
iteration += 1
print(f"[Agent] Iteration {iteration}/{max_iterations}")
response = client.messages.create(
model=MODEL,
max_tokens=4096,
system=(
"You are a senior code reviewer. Explore the project structure "
"using the available tools, then provide a thorough analysis."
),
tools=TOOLS,
messages=messages,
)
# Append assistant response to conversation history
messages.append({"role": "assistant", "content": response.content})
# ── Terminal conditions ───────────────────────────────────────────────
if response.stop_reason == "end_turn":
# Extract and return the final text response
return next(
(b.text for b in response.content if b.type == "text"), ""
)
if response.stop_reason == "refusal":
category = (
response.stop_details.category
if response.stop_details else "policy"
)
print(f"[Agent] Refusal ({category}) — stopping loop.")
return ""
# ── Handle tool_use blocks ────────────────────────────────────────────
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
print(f"[Tool] Calling {block.name}({json.dumps(block.input)})")
result = dispatch_tool(block.name, block.input)
print(f"[Tool] Result ({len(result)} chars)")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
# Append all tool results as a single user message
messages.append({"role": "user", "content": tool_results})
continue # Loop back to send tool results to the model
# Unexpected stop_reason — break to avoid infinite loop
print(f"[Agent] Unexpected stop_reason: {response.stop_reason}")
break
print(f"[Agent] Reached max_iterations ({max_iterations}) — stopping.")
return ""
if __name__ == "__main__":
result = run_agent(
"List the files in the current directory, then read 'assistant.py' "
"and identify any potential improvements to the error handling."
)
print("
--- Agent Result ---")
print(result)
Step 6: Token Counting and Cost Management
At $10 per million input tokens and $50 per million output tokens, production Fable 5 usage requires active cost tracking from day one. The Anthropic SDK provides messages.count_tokens to estimate the input token count before you commit to a full API call. This is particularly valuable for large-context use cases — knowing that a request will cost $0.80 in input tokens alone before sending it allows you to decide whether to truncate the context, switch to a cheaper model, or proceed with confidence. The code below also demonstrates prompt caching, the Effort parameter, and a comparison helper for choosing between models.
from anthropic import Anthropic
client = Anthropic()
# ── Pricing constants (per token) ────────────────────────────────────────────
# Fable 5 tokenizer produces ~30% more tokens than older Claude models
# on equivalent input — update existing cost estimates accordingly.
PRICING = {
"claude-haiku-4-5": {"input": 1.0 / 1_000_000, "output": 5.0 / 1_000_000},
"claude-sonnet-4-6": {"input": 3.0 / 1_000_000, "output": 15.0 / 1_000_000},
"claude-opus-4-8": {"input": 5.0 / 1_000_000, "output": 25.0 / 1_000_000},
"claude-fable-5": {"input": 10.0 / 1_000_000, "output": 50.0 / 1_000_000},
}
# Prompt cache pricing: $1/M input tokens after first cache write
# Minimum cache block size: 512 tokens
CACHE_INPUT_PRICE = 1.0 / 1_000_000
def estimate_cost(model: str, system: str, messages: list[dict]) -> dict:
"""
Count tokens and estimate cost BEFORE sending the request.
Uses messages.count_tokens — this does NOT generate a response
and is much cheaper than a full API call. Returns a dict with
token count and estimated input cost.
"""
count_response = client.messages.count_tokens(
model=model,
system=system,
messages=messages,
)
input_tokens = count_response.input_tokens
prices = PRICING.get(model, PRICING["claude-fable-5"])
estimated_input_cost = input_tokens * prices["input"]
return {
"model": model,
"input_tokens": input_tokens,
"estimated_input_cost_usd": estimated_input_cost,
}
def calculate_actual_cost(model: str, usage) -> float:
"""Calculate exact cost from a completed response's usage object."""
prices = PRICING.get(model, PRICING["claude-fable-5"])
return usage.input_tokens * prices["input"] + usage.output_tokens * prices["output"]
def send_with_effort(prompt: str, effort: str = "high") -> str:
"""
Send a request to Fable 5 with a specific effort level.
effort values: "low" | "medium" | "high" | "xhigh" | "max"
Higher effort = more internal reasoning steps = better quality,
but higher latency and potentially more output tokens.
"high" is the recommended default for complex tasks.
"""
model = "claude-fable-5"
system = "You are a senior software architect."
messages_list = [{"role": "user", "content": prompt}]
# Estimate cost before sending
estimate = estimate_cost(model, system, messages_list)
print(
f"[Pre-flight] model={model} input_tokens={estimate['input_tokens']} "
f"est_input_cost=${estimate['estimated_input_cost_usd']:.4f}"
)
response = client.messages.create(
model=model,
max_tokens=4096,
system=system,
output_config={"effort": effort}, # "low"|"medium"|"high"|"xhigh"|"max"
messages=messages_list,
)
actual_cost = calculate_actual_cost(model, response.usage)
print(
f"[Actual] in={response.usage.input_tokens} "
f"out={response.usage.output_tokens} "
f"cost=${actual_cost:.4f}"
)
return next((b.text for b in response.content if b.type == "text"), "")
def choose_model(task_complexity: str) -> str:
"""
Simple heuristic for model selection based on task complexity.
Returns the recommended model string for the given complexity level.
Fable 5 is only appropriate for the most demanding tasks — use
lighter models for straightforward workloads to manage costs.
"""
THRESHOLDS = {
"simple": "claude-haiku-4-5", # $1/$5 — fastest, cheapest
"moderate": "claude-sonnet-4-6", # $3/$15 — balanced quality/cost
"complex": "claude-opus-4-8", # $5/$25 — high quality
"critical": "claude-fable-5", # $10/$50 — most capable
}
return THRESHOLDS.get(task_complexity, "claude-sonnet-4-6")
if __name__ == "__main__":
# Demonstrate pre-flight token counting
test_system = "You are a senior software architect."
test_messages = [
{"role": "user", "content": "Design a microservices architecture for a B2B SaaS platform serving 10,000 concurrent users."}
]
est = estimate_cost("claude-fable-5", test_system, test_messages)
print(f"Pre-flight estimate: {est}")
# Demonstrate effort control
answer = send_with_effort(
"What are the top five architectural mistakes in high-traffic Django applications?",
effort="high",
)
print("
Response:", answer[:500], "...")
Handling Refusals
Fable 5 introduces structured refusals with a stop_reason of "refusal" and a stop_details object that classifies the refusal category. Refusal categories include "cyber" (cybersecurity attack assistance), "bio" (biological weapon information), and "reasoning_extraction" (attempts to extract internal reasoning traces). Critically, refusals do not generate billable output tokens — the model stops before generating content, so you are charged only for the input tokens consumed. Always check stop_reason before accessing response.content to avoid index errors on the empty content list that a refusal returns.
from anthropic import Anthropic
client = Anthropic()
def safe_call(prompt: str, system: str = "", max_tokens: int = 512) -> dict:
"""
Make a Fable 5 API call with full stop_reason handling.
stop_reason values for Fable 5:
"end_turn" — normal completion
"max_tokens" — hit max_tokens limit; consider increasing it
"tool_use" — model wants to call a tool (agentic loop needed)
"refusal" — content policy violation; no output generated
"pause_turn" — model paused mid-turn (rare; resume with empty user msg)
Refusal categories (response.stop_details.category):
"cyber" — cybersecurity attack assistance
"bio" — biological weapon information
"reasoning_extraction" — attempt to extract internal reasoning traces
Refusals are NOT billed for output tokens — only input tokens consumed.
"""
kwargs = {
"model": "claude-fable-5",
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": prompt}],
}
if system:
kwargs["system"] = system
response = client.messages.create(**kwargs)
result = {
"stop_reason": response.stop_reason,
"text": "",
"refusal_category": None,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"request_id": getattr(response, "_request_id", None),
}
if response.stop_reason == "end_turn":
result["text"] = next(
(b.text for b in response.content if b.type == "text"), ""
)
elif response.stop_reason == "refusal":
# Refusals do not generate billable output tokens
category = (
response.stop_details.category
if response.stop_details else "policy"
)
result["refusal_category"] = category
print(f"[Refusal] Request declined — category: {category}")
print(f" Input tokens consumed: {response.usage.input_tokens}")
print(f" Output tokens billed: 0 (refusals are not billed)")
elif response.stop_reason == "max_tokens":
# Response was truncated — extract partial text and log a warning
result["text"] = next(
(b.text for b in response.content if b.type == "text"), ""
)
print(
f"[Warning] Response truncated at {max_tokens} tokens. "
"Consider increasing max_tokens."
)
elif response.stop_reason == "tool_use":
# Unexpected tool_use outside an agentic loop
print("[Warning] Received tool_use stop_reason outside agentic loop.")
# Log request ID for support if something goes wrong
if result["request_id"]:
print(f"[Debug] request_id={result['request_id']}")
return result
FastAPI Integration
For production web APIs, use the AsyncAnthropic client and manage its lifecycle with FastAPI's lifespan context manager. This ensures the client is initialized once when the server starts and closed cleanly when it stops, preventing connection leaks. The streaming endpoint below uses FastAPI's StreamingResponse with an async generator to forward Fable 5's token stream directly to the HTTP response. Rate limit errors (HTTP 429) and overload errors (HTTP 529) should be handled with retries; the SDK retries twice by default, but production services should configure 4-5 retries with exponential backoff.
"""
api.py — FastAPI application with Claude Fable 5 streaming endpoint.
Run with: uvicorn api:app --reload --port 8000
"""
from contextlib import asynccontextmanager
from typing import AsyncGenerator
from anthropic import AsyncAnthropic, RateLimitError, APIStatusError
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
MODEL = "claude-fable-5"
# Module-level async client — initialized in lifespan, shared across requests
client_instance: AsyncAnthropic | None = None
@asynccontextmanager
async def lifespan(app: FastAPI):
"""
FastAPI lifespan context manager.
Initializes the AsyncAnthropic client on startup and closes it on shutdown.
ANTHROPIC_API_KEY is read from the environment automatically.
Never pass the API key through the request body from a frontend client.
"""
global client_instance
# max_retries=5 for production — SDK default is 2
client_instance = AsyncAnthropic(max_retries=5)
print("[Startup] AsyncAnthropic client initialized")
yield
# Clean shutdown: close connection pool
await client_instance.close()
print("[Shutdown] AsyncAnthropic client closed")
app = FastAPI(
title="Task Assistant API",
description="Claude Fable 5 powered task assistant",
lifespan=lifespan,
)
# ── Request / Response models ─────────────────────────────────────────────────
class ChatRequest(BaseModel):
message: str
system: str = "You are a helpful senior software architect."
max_tokens: int = 2048
class ChatResponse(BaseModel):
text: str
input_tokens: int
output_tokens: int
cost_usd: float
# ── Streaming endpoint ────────────────────────────────────────────────────────
@app.post("/chat/stream")
async def chat_stream(req: ChatRequest) -> StreamingResponse:
"""
Stream a Fable 5 response token-by-token using Server-Sent Events.
The client receives tokens as they are generated, not all at once.
"""
if client_instance is None:
raise HTTPException(status_code=503, detail="Client not initialized")
async def token_generator() -> AsyncGenerator[str, None]:
try:
async with client_instance.messages.stream(
model=MODEL,
max_tokens=req.max_tokens,
system=req.system,
messages=[{"role": "user", "content": req.message}],
) as stream:
async for chunk in stream.text_stream:
yield chunk
except RateLimitError:
# HTTP 429 — rate limit hit; SDK retried max_retries times and failed
yield "
[Error: Rate limit exceeded. Please retry after a moment.]"
except APIStatusError as e:
if e.status_code == 529:
# HTTP 529 — Anthropic API overloaded
yield "
[Error: API overloaded. Please retry after a moment.]"
else:
yield f"
[Error: API error {e.status_code}]"
return StreamingResponse(token_generator(), media_type="text/plain")
# ── Non-streaming endpoint ────────────────────────────────────────────────────
@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest) -> ChatResponse:
"""
Send a message to Fable 5 and return the complete response with cost.
"""
if client_instance is None:
raise HTTPException(status_code=503, detail="Client not initialized")
try:
response = await client_instance.messages.create(
model=MODEL,
max_tokens=req.max_tokens,
system=req.system,
messages=[{"role": "user", "content": req.message}],
)
except RateLimitError:
raise HTTPException(status_code=429, detail="Rate limit exceeded")
except APIStatusError as e:
raise HTTPException(status_code=e.status_code, detail=str(e))
text = next((b.text for b in response.content if b.type == "text"), "")
u = response.usage
cost = (u.input_tokens * 10 + u.output_tokens * 50) / 1_000_000
return ChatResponse(
text=text,
input_tokens=u.input_tokens,
output_tokens=u.output_tokens,
cost_usd=cost,
)
# ── Health check ──────────────────────────────────────────────────────────────
@app.get("/health")
async def health():
return {"status": "ok", "model": MODEL, "client_ready": client_instance is not None}
Claude Model Comparison: Choosing the Right Model
Fable 5 is not the right choice for every task. The table below compares all current Claude models to help you select the most cost-effective option for a given workload. Note that Fable 5's tokenizer produces approximately 30% more tokens than older models on equivalent input, so its effective cost advantage over Opus is smaller than the raw per-token prices suggest for most typical workloads.
| Model | Model String | Input ($/M tokens) | Output ($/M tokens) | Best For |
|---|---|---|---|---|
| Haiku 4.5 | claude-haiku-4-5 | $1 | $5 | High-volume pipelines, classification, short FAQs, summarization |
| Sonnet 4.6 | claude-sonnet-4-6 | $3 | $15 | Balanced quality/cost, customer-facing chat, code review, RAG |
| Opus 4.8 | claude-opus-4-8 | $5 | $25 | Complex reasoning, multi-step analysis, document understanding |
| Fable 5 | claude-fable-5 | $10 | $50 | Most capable: 1M context, adaptive thinking, agentic coding, high-stakes decisions |
When to Use vs. Avoid Claude Fable 5
| Use Fable 5 When... | Avoid Fable 5 When... |
|---|---|
| Complex multi-step reasoning where errors are expensive (architecture reviews, legal analysis, financial modelling) | Basic summarization or extraction tasks where Haiku 4.5 produces equivalent results at 10x lower cost |
| Software planning across large codebases using the full 1M token context window to review entire repositories | Short FAQ answering or customer service routing where speed and cost per query matter more than depth |
| Long-context document review — legal contracts, research papers, audit reports — where the full document must be in context | High-volume real-time pipelines (thousands of requests per minute) where Sonnet 4.6 is fast enough and 3x cheaper |
| Agentic coding tasks with many tool calls — debugging cascading failures, refactoring across multiple files, generating integration tests | Applications with strict Zero Data Retention requirements — Fable 5 has mandatory 30-day retention and ZDR is unavailable |
| Debugging problems with indirect effects where understanding causality across many files requires deep reasoning | Latency-sensitive applications where users expect sub-500ms first-token responses — Fable 5's adaptive thinking adds latency |
Production Deployment Checklist
Before deploying a Fable 5 integration to production, verify each of the following items. These are the most common sources of production incidents in Anthropic SDK integrations based on developer feedback.
Store API Keys as Server Environment Variables Only
Set ANTHROPIC_API_KEY as an environment variable on your server, CI/CD system, or secrets manager (AWS Secrets Manager, HashiCorp Vault, Doppler). Never commit it to source control, never log it, and never expose it to browser clients. For frontend applications, all API calls must go through a server-side proxy — the ANTHROPIC_API_KEY must never appear in a JavaScript bundle or be sent in a request body from a browser.
Configure Retry Logic for Rate Limits and Overload
HTTP 429 means you have hit your rate limit. HTTP 529 means the Anthropic API is temporarily overloaded. The SDK retries twice by default. For production services, initialize the client with Anthropic(max_retries=5) to add exponential backoff retries. For batch workloads, consider the Batch API which is not subject to the same rate limits and costs 50% less per token across all Claude models.
Log Input and Output Tokens on Every Request
Structured logging of token counts enables cost attribution, anomaly detection, and capacity planning. At minimum, log model, input_tokens, output_tokens, stop_reason, and the computed cost per request. Aggregate these logs in a time-series dashboard to detect token count spikes (often caused by unexpectedly large user inputs reaching the 1M context limit) and to project monthly costs before they appear on your invoice.
Log request_id for Support Escalation
The response object includes a _request_id field that uniquely identifies the API call on Anthropic's infrastructure. Log this value alongside your own request ID and user ID. If a response appears incorrect or a rate limit is incorrectly applied, providing the request_id to Anthropic support allows them to investigate the specific request without any ambiguity. This is the single most valuable piece of information for diagnosing API-side issues.
Frequently Asked Questions
What is the model string for Claude Fable 5 in the Python SDK?
The model string is claude-fable-5. Pass it as the model parameter in any client.messages.create, client.messages.parse, client.messages.stream, or client.messages.count_tokens call. Fable 5 was released June 9, 2026 and is available to all Anthropic API customers at $10/M input and $50/M output tokens. There is no "preview" or "latest" alias — always use the explicit version string to avoid unexpected behaviour when newer models are released.
Can I disable adaptive thinking in Claude Fable 5?
No. Adaptive thinking is always on in Claude Fable 5 and cannot be disabled. Every response will include internal reasoning before the final text output. The text_stream iterator and the get_text lambda helper both filter out thinking blocks automatically, so you do not need to handle them explicitly unless you want to expose the reasoning to users. If you need to minimize response latency for simple queries, use claude-haiku-4-5 or claude-sonnet-4-6 instead — they allow thinking to be disabled or constrained by budget.
Why does messages.parse return HTTP 400 when I add an assistant message?
Adding an assistant-role message at the end of the messages list when calling messages.parse is called "assistant turn prefilling" — it instructs the model to continue from a partial assistant response. This technique conflicts with the structured output enforcement mechanism that messages.parse uses internally, causing the API to return HTTP 400 Bad Request. Remove any trailing assistant-role messages from the messages list and the error will resolve. If you need prefilling for a non-structured call, use messages.create instead.
Does Claude Fable 5 support Zero Data Retention?
No. Claude Fable 5 has mandatory 30-day data retention and Zero Data Retention (ZDR) is not available for this model. If your application handles data that is subject to HIPAA, GDPR Right to Erasure, or other regulations requiring immediate data deletion after processing, you must use a different Claude model that supports ZDR or implement your own data handling controls outside the API. Check the Anthropic privacy policy and your data processing agreement for the exact retention terms before deploying Fable 5 in regulated contexts.
How do I reduce Claude Fable 5 API costs in production?
Four strategies reduce Fable 5 costs effectively: (1) Prompt caching — system prompts and repeated context blocks longer than 512 tokens cost $1/M instead of $10/M after the first cache write; (2) Batch API — non-real-time workloads processed through the Batch API cost 50% less per token across all Claude models; (3) Model routing — use messages.count_tokens to estimate complexity and route simple tasks to Haiku 4.5 ($1/$5) or Sonnet 4.6 ($3/$15) instead of Fable 5; (4) Effort control — for tasks that do not require maximum reasoning depth, pass output_config={"effort": "medium"} to reduce internal reasoning steps and output token counts.
Need Help Building AI-Powered Applications?
Our AI development team can help you architect, build, and deploy production Claude Fable 5 integrations — from structured output pipelines and agentic tool use loops to cost-optimized multi-model routing and FastAPI backends at scale.
About the author
Co-founder & AI Practice Lead, Braincuber Technologies
Co-founder at Braincuber. Builds production AI agents (Anthropic Claude, OpenAI, AWS Bedrock) for US fintech, healthcare, and retail clients with SOC 2 Type II / HIPAA-scope deployments. Joins every architecture review personally.
