How to Build a Qwen 3.7 Max AI Agent: Complete Step by Step Guide
Most AI demos still look like chatbots: you ask a question, the model answers, and the interaction ends. But real productivity work rarely happens in one step. A useful assistant needs to break down a vague request, gather context, compare options, produce a usable artifact, and check its own work before handing it back. Qwen 3.7 Max, released on May 20, 2026 by Alibaba's Qwen team, is an agent-first model designed for exactly these workflows. It excels at planning, tool use, coding, office automation, long-context reasoning, and multi-step execution. In this complete tutorial, you will build a compact productivity agent using Qwen 3.7 Max and Streamlit that takes a vague business request and turns it into a polished competitive analysis report through a fixed five-turn agent workflow.
What You'll Learn:
- How Qwen 3.7 Max differs from traditional LLMs for agentic workflows
- How to build a fixed five-turn agent workflow with planning, research, synthesis, writing, and self-review
- How to enable and preserve thinking across turns using
enable_thinkingandpreserve_thinking - How to add lightweight parallel research with ThreadPoolExecutor and fallback fixture data
- How to write, patch, and self-review a markdown artifact
- How to build a complete Streamlit progress UI with step cards and file preview
- How to handle provider errors and rate limits gracefully
What Is Qwen 3.7 Max?
Qwen 3.7 Max is Alibaba's flagship agentic model, optimized for tasks that require planning, tool use, coding, office automation, long-context reasoning, and multi-step execution. Unlike one-shot chat models, it is designed to stay coherent across a longer workflow involving multiple turns with different purposes: planning, evidence synthesis, long-form writing, self-critique, and revision. On X, users found the model on par with Opus 4.7 and GPT 5.5 for long-horizon coding tasks.
Agent-First Architecture
Built specifically for agentic workflows, not one-shot chat. Stays coherent across multi-turn tasks where each turn has a different purpose: planning, research, writing, critique, and revision. Designed for the agent era with strengths across coding agents, productivity assistants, MCP-based workflows, and long-horizon autonomous tasks.
Top-Tier Coding Benchmarks
Scores 69.7 on Terminal-Bench 2.0-Terminus, 60.6 on SWE-bench Pro, 78.3 on SWE-bench Multilingual, and 47.2 on NL2Repo. All ahead of DeepSeek V4 Pro. A standout long-horizon run showed a 10.0x geometric mean speedup over Triton reference after 35 hours of autonomous optimization.
Thinking Preservation
Supports enable_thinking and preserve_thinking flags that allow reasoning content to be passed across turns. The model can stay aligned with earlier decisions in multi-turn workflows. Planning criteria chosen in turn 1 remain consistent through synthesis and self-review in later turns.
General Agent Benchmarks
Scores 60.8 on MCP-Mark, 76.4 on MCP-Atlas, 67.2 on CoWorkBench, and 87.0 on SpreadSheetBench-v1. These benchmarks measure real agent capabilities like tool use, multi-step execution, and office automation rather than static question-answering.
Benchmark Performance Overview
The table below shows Qwen 3.7 Max benchmark scores across coding-agent and general-agent evaluations. All results are ahead of DeepSeek V4 Pro, with the largest gap in long-horizon coding (47.2 vs 35.5 on NL2Repo).
| Benchmark | Category | Score |
|---|---|---|
| Terminal-Bench 2.0-Terminus | Coding Agent | 69.7 |
| SWE-bench Pro | Coding Agent | 60.6 |
| SWE-bench Multilingual | Coding Agent | 78.3 |
| NL2Repo | Coding Agent | 47.2 |
| SciCode | Coding Agent | 53.5 |
| MCP-Mark | General Agent | 60.8 |
| MCP-Atlas | General Agent | 76.4 |
| CoWorkBench | General Agent | 67.2 |
| SpreadSheetBench-v1 | General Agent | 87.0 |
Project Overview: The Five-Turn Productivity Agent
The application takes a vague business request such as "I need to prepare a competitive analysis of the top 3 vector databases for a team presentation next week. I have no idea where to start." and runs a fixed five-turn workflow that produces a polished competitive_analysis.md file. The main project logic is split across two files: thinking_agent.py for the backend workflow and agent.py for the Streamlit interface. The backend centers on a ThinkingAgent class that runs the five-turn sequence and returns structured turn records, while the UI only needs to render each turn as it arrives.
OpenRouter vs Qwen API
This tutorial uses OpenRouter as the model gateway because it provides an OpenAI-compatible endpoint. You can also access Qwen 3.7 Max via the official Alibaba Cloud Model Studio or any other compatible provider. The backend code uses the OpenAI Python SDK format, so switching providers only requires changing the base_url and api_key.
Step by Step Implementation Guide
Set Up Your Python Environment
Create and activate a Python virtual environment, then install three dependencies: httpx for lightweight web search, openai for Qwen 3.7 Max API calls via the OpenAI-compatible client, and streamlit for the interactive web interface. Run python3 -m venv .venv && source .venv/bin/activate followed by pip install httpx openai streamlit. Set your OPENROUTER_API_KEY environment variable or configure it directly in the app.
Configure the Qwen 3.7 Max Client
Create a build_client() function in thinking_agent.py that centralizes all provider-specific configuration. Use the OpenAI Python SDK with OpenRouter's base URL (https://openrouter.ai/api/v1), set the model to qwen/qwen3.7-max, and include your API key plus optional HTTP-Referer and X-Title headers for OpenRouter analytics. The rest of the app never needs to know whether the request is going through OpenRouter or another compatible endpoint.
Build the Five-Turn Agent Workflow
Implement the ThinkingAgent class with a run() method that executes exactly five controlled stages: Planning (converts vague prompt to structured plan), Research (collects external context), Synthesis (builds comparison matrix), Draft (writes the markdown report), and Self-Review (reviews and patches the report). Each call to _run_turn() has a specific phase, making the workflow easier to debug because each model call has one clear responsibility.
Enable and Preserve Thinking Across Turns
Add extra_body={"enable_thinking": True, "preserve_thinking": preserve_thinking} to the chat completion call. This allows Qwen 3.7 Max to produce reasoning content during each response and preserve that reasoning across later turns. After each model response, extract both the visible answer and the reasoning content, then append the assistant message back into the conversation history with the reasoning content included. If the planning turn decides the final report should compare databases across deployment model, scalability, developer experience, hybrid search, and enterprise readiness, later turns can continue using those same criteria without drifting.
Add Parallel Research with Fallback Data
Define a fixed target list of databases to research (Pinecone, Weaviate, Qdrant). Use concurrent.futures.ThreadPoolExecutor with max_workers=3 to launch one search per database in parallel. If a live search fails, gracefully fall back to built-in deterministic fixture research snippets so the app never crashes due to an external dependency failure. The important design principle is graceful degradation: the agent should not crash just because one external search fails.
Generate the Synthesis and Comparison Matrix
After research, run the synthesis turn (turn 3) to convert scattered information into a structured comparison matrix. By this time, the conversation history already contains the original request, the planning output, the research summaries, and optionally the preserved thinking from earlier turns. The synthesis prompt focuses on one job: converting research into a decision-ready comparison across dimensions such as deployment model, scalability, developer experience, hybrid search support, enterprise readiness, and best-fit use case.
Write the Markdown Report
The draft turn (turn 4) uses all previous context to write the first report draft. The model response is cleaned using a _strip_markdown_fences() helper that removes triple backtick fences, then written to disk as competitive_analysis.md. The backend also records a ToolEvent for the file write so the Streamlit UI can show that a real file was created, not just another text response.
Implement Self-Review and Patch the Report
The final model turn (turn 5) is a self-review step. The agent re-reads the draft, checks it against the original goal, identifies weaknesses, and generates a revised version. The self-review prompt asks the model to check whether the recommendation is clear, comparison criteria are consistent, and the report is useful for a team presentation. The revised output overwrites the original file. This turns the app from a simple generator into a small editorial workflow where the output is the reviewed and patched version, not the first draft.
Build the Streamlit Progress Interface
Create agent.py with a render_streamlit_app() function. Use st.set_page_config(layout="wide") and a simple input area with a text box and a "Run Demo" button. When clicked, create a ThinkingAgent instance and call agent.run() with an on_turn callback that updates the progress bar and step cards. The callback receives a completed TurnRecord and refreshes the UI. During execution, the cards act as status indicators; after completion, they become clickable result views.
Preview and Download the Final Report
After the backend finishes, agent.py reads the final markdown file, checks that it exists, creates a download button with st.download_button(), and renders the markdown content directly in the Streamlit page using st.markdown(). This maintains a clean separation of responsibilities: thinking_agent.py writes the report and agent.py previews and downloads it.
Handle Errors and Rate Limits Gracefully
Add retry logic in the backend that attempts the model call, catches retryable failures, waits briefly (1.0s, then 2.0s), and tries again. Short delays smooth over temporary provider instability without making the user wait too long. In the Streamlit UI, display a clean error message instead of raw provider JSON or stack traces. Use a helper function like _friendly_rate_limit_message() that returns a clear explanation and practical next step for the user.
Run the Demo and Generate Your First Report
Start the app with streamlit run agent.py, enter a prompt like "I need to prepare a competitive analysis of the top 3 vector databases for a team presentation next week. I have no idea where to start.", and click Run Demo. The app moves through all five stages: Planning creates a structured plan, Research collects snippets, Synthesis builds a comparison matrix, Draft writes the first markdown report, and Self-Review patches the report. The final competitive_analysis.md is saved under output/showcase_runs/<timestamp>/with_preserve/competitive_analysis.md and displayed with a download button in the Streamlit UI.
Key Code: Client Configuration
The following code snippet shows how to configure the OpenAI-compatible client for Qwen 3.7 Max via OpenRouter. This build_client() function centralizes all provider configuration so the rest of the app never needs to know which provider is being used.
from openai import OpenAI
DEFAULT_MODEL = "qwen/qwen3.7-max"
DEFAULT_BASE_URL = "https://openrouter.ai/api/v1"
DEFAULT_REFERER = "https://qwen.ai"
DEFAULT_TITLE = "Qwen3.7-Max Productivity Agent"
def build_client(
api_key: str | None = None,
base_url: str | None = None,
referer: str = DEFAULT_REFERER,
title: str = DEFAULT_TITLE,
) -> OpenAI:
return OpenAI(
base_url=base_url or DEFAULT_BASE_URL,
api_key=api_key,
default_headers={
"HTTP-Referer": referer,
"X-Title": title,
},
)
Key Code: Thinking Preservation
The enable_thinking=True flag allows Qwen 3.7 Max to produce reasoning content during the response. The preserve_thinking flag controls whether that reasoning content is kept across later turns. This is the core mechanism that keeps the agent's reasoning consistent throughout the five-turn workflow.
response = self.client.chat.completions.create(
model=self.config.model,
messages=request_messages,
extra_body={
"enable_thinking": True,
"preserve_thinking": preserve_thinking,
},
)
message = response.choices[0].message
assistant_content = _normalize_text(getattr(message, "content", ""))
reasoning_content = _extract_reasoning_content(message)
assistant_message = {
"role": "assistant",
"content": assistant_content,
}
if preserve_thinking and reasoning_content:
assistant_message["reasoning_content"] = reasoning_content
history.append(assistant_message)
Key Code: Parallel Research with ThreadPoolExecutor
The research step launches three web searches in parallel using Python's concurrent.futures.ThreadPoolExecutor. If a live search fails, the system gracefully falls back to built-in fixture data so the workflow never breaks.
DATABASES = ("Pinecone", "Weaviate", "Qdrant")
from concurrent.futures import ThreadPoolExecutor, as_completed
with ThreadPoolExecutor(max_workers=3) as executor:
future_map = {
executor.submit(backend.search, database, query): database
for database, query in queries.items()
}
for future in as_completed(future_map):
database = future_map[future]
try:
results[database] = future.result()
except Exception as exc:
fallback = FixtureSearchBackend().search(database, queries[database])
fallback.warning = (
f"Live search failed for {database} ({exc}). "
"Fallback demo research was used."
)
results[database] = fallback
Key Code: Streamlit UI with Progress Callback
The Streamlit interface in agent.py keeps the UI simple: a prompt text area, a Run button, and a progress display. The on_turn callback updates the progress bar and step cards after each completed turn.
import streamlit as st
def render_streamlit_app() -> None:
st.set_page_config(
page_title="Qwen3.7-Max Demo Agent",
page_icon=":material/psychology:",
layout="wide",
initial_sidebar_state="collapsed",
)
goal = st.text_area("Prompt", value=DEFAULT_GOAL,
height=120, label_visibility="collapsed")
run_clicked = st.button("Run Demo", use_container_width=True)
if run_clicked:
agent = ThinkingAgent(config=config)
result = agent.run(
goal=goal,
preserve_thinking=True,
data_mode="live",
on_turn=on_turn,
)
def on_turn(turn: TurnRecord) -> None:
partial_turns.append(turn)
progress_placeholder.progress(
len(partial_turns) / len(phase_order),
text=f"Completed {_phase_label(turn.phase)}",
)
with steps_placeholder.container():
_render_step_cards(partial_turns, result_ready=False)
def _render_file_preview(report_path: str) -> None:
path = Path(report_path)
if not path.exists():
st.info("No file was written yet.")
return
content = path.read_text()
st.caption(f"Created file: {path.name}")
st.download_button(
"Download competitive_analysis.md",
data=content,
file_name=path.name,
use_container_width=True,
)
st.markdown(content)
Retry Logic and Error Handling
Model providers can timeout or rate-limit requests. The backend implements a simple retry mechanism with short delays, and the UI displays clean error messages instead of raw provider errors.
delays = (1.0, 2.0)
for attempt_index in range(len(delays) + 1):
try:
return self.client.chat.completions.create(...)
except Exception as exc:
if not _is_retryable_provider_error(exc):
raise
if attempt_index >= len(delays):
raise
time.sleep(delays[attempt_index])
def _friendly_rate_limit_message(exc: Any) -> str:
return (
"Qwen 3.7 Max is temporarily rate-limited upstream. "
"Please wait 30 to 60 seconds and try again."
)
Complete Code Repository
The full source code for this tutorial is available on GitHub at the official repository. The project is intentionally small with the main logic split across agent.py for the Streamlit UI and thinking_agent.py for the backend workflow. The requirements.txt file contains just three dependencies: httpx, openai, and streamlit.
Frequently Asked Questions
When should I use preserve_thinking=True with Qwen 3.7 Max?
preserve_thinking=True helps later turns stay aligned with earlier decisions in multi-turn workflows. In the five-turn agent example, if the planning turn chooses specific comparison criteria like scalability and enterprise readiness, the synthesis and self-review turns can continue using that same reasoning instead of drifting to different criteria.
Is Qwen 3.7 Max only for coding agents?
No. Coding is one of its major strengths, but the model is also positioned for productivity agents, office automation, MCP-based workflows, multilingual tasks, long-context reasoning, and long-horizon tool use. The SpreadSheetBench-v1 score of 87.0 demonstrates strong office automation capability.
Does the demo app use real web search or simulated data?
The default path attempts live DuckDuckGo scraping for real research. If that fails, it falls back to built-in fixture research snippets. This graceful degradation design ensures the app always completes the full workflow regardless of external search availability.
Is the model creating subagents in the five-turn demo app?
No. The app uses one model session with shared conversation history. The parallel part happens only in the research layer, where the backend launches three web searches concurrently and feeds the results back into the same model session for the synthesis turn.
Why separate agent.py and thinking_agent.py?
agent.py keeps the Streamlit UI readable by handling only page layout, prompt box, progress bar, step cards, result selector, and download button. thinking_agent.py holds the backend workflow, prompt construction, retries, research logic, and file operations so the agent behavior can be tested independently of the interface.
Need Help Building AI Agents?
Our AI experts can help you design and deploy custom agentic workflows with Qwen, GPT, Claude, and open-source models. From productivity agents to complex multi-agent systems, we deliver production-ready solutions.
About the author
Head of AI Agents Practice
Builds production AI agents for US financial services, healthcare, and retail. SOC 2 Type II / HIPAA-scope deployments on AWS Bedrock. Anthropic Claude, OpenAI, LangChain, MCP.
