How to Run Qwen3.5-27B Locally: Complete Step by Step Guide
By Braincuber Team
Published on April 20, 2026
Running large language models locally gives you complete control over your AI infrastructure. Instead of relying on expensive API calls to OpenAI or Anthropic, you can deploy powerful open-weight models like Qwen3.5-27B on your own GPU hardware. This complete tutorial shows you how to set up a remote H100 SXM instance on Vast.ai, serve Qwen3.5-27B with vLLM, connect it to OpenCode, and test its agentic coding ability on a real FastAPI project.
What You'll Learn:
- How to launch and access a Vast.ai H100 SXM GPU instance
- Setting up vLLM and serving Qwen3.5-27B as an OpenAI-compatible API
- Installing and configuring OpenCode to connect to your local model
- Testing the full setup with a real agentic coding task
- Understanding why vLLM is preferred over llama.cpp for 27B models
Why Choose vLLM Over llama.cpp for Qwen3.5-27B
While llama.cpp is excellent for many local inference setups, running a quantized 27B model through it often comes with trade-offs: lower output quality, reduced coding performance, and more setup friction. For larger models like Qwen3.5-27B, quantization can noticeably hurt reliability, and local deployments through llama.cpp can become difficult to manage if you want smooth, production-like agent behavior.
Instead, this tutorial uses vLLM on a rented H100 SXM GPU. That gives you a more stable OpenAI-compatible endpoint, better performance from the full model, and a much cleaner setup for agentic coding workflows. You get very high tokens-per-second throughput, a large context length, and a much smoother overall coding experience.
Prerequisites
Before you begin, make sure you have the following:
| Prerequisite | Description |
|---|---|
| Vast.ai Account | Create an account at vast.ai |
| Credits | At least $5 in credits to rent a GPU instance |
| Linux Terminal Knowledge | Basic familiarity with the Linux terminal |
GPU Recommendation
An H100 SXM is a strong choice because Qwen3.5-27B needs plenty of memory and fast throughput, especially for longer context windows and smoother coding-focused inference.
Step 1: Launch and Access Your Vast.ai Instance
We are using Vast.ai because it is a GPU marketplace that usually gives you far more flexibility on price and hardware than a traditional single-vendor cloud. You can rent anything from community-hosted machines to Secure Cloud / datacenter offers, which Vast marks with a blue datacenter label. For a setup like this, we strongly recommend choosing one of those blue-labeled datacenter machines for better reliability and a smoother experience.
Create a Vast.ai Account and Add Credits
Start by creating a Vast.ai account at vast.ai and adding enough credit to cover your session. Pricing changes constantly because it is a live marketplace, so treat any hourly rate as approximate rather than fixed.
Search for and Rent an H100 SXM Instance
Open the search page on Vast.ai, pick the PyTorch template with Jupyter enabled, and look for a 1x H100 SXM offer. Click to rent the machine.
Access the Instance Portal
Go to the Instances tab. When the status changes to Open, click the Open button to launch the instance portal in your browser. From there, you can open Jupyter and start a terminal session directly on the remote machine.
Keep Two Terminals Open
Keep two terminals open: one terminal to serve Qwen3.5-27B with vLLM, and one terminal to install and run OpenCode. Keeping these tasks separate makes the workflow much easier to manage once the model server is running.
Step 2: Install vLLM and Serve Qwen3.5
In this step, you will create an isolated Python environment, install vLLM, and launch Qwen3.5-27B as an OpenAI-compatible API that OpenCode can talk to. vLLM is a high-throughput inference engine for large language models, designed to serve models efficiently through an API.
Create a Workspace and Virtual Environment
In your first terminal, create a clean workspace for the model server with the following commands:
mkdir qwen-server
cd qwen-server/
uv venv --python 3.12
source .venv/bin/activate
This gives you a dedicated Python 3.12 environment so the model server dependencies stay isolated from the rest of the machine.
Install vLLM
Now install vLLM using the following command:
uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
Note
You may not need the nightly wheels for this setup, since current vLLM support for Qwen3.5 often works with the standard installation as well.
Start the Qwen3.5 Server
Once vLLM is installed, start the model server with the following command. The first time you run this command, vLLM will download the model and tokenizer files:
vllm serve Qwen/Qwen3.5-27B \
--host 127.0.0.1 \
--port 8000 \
--api-key local-dev-key \
--served-model-name qwen3.5-27b-local \
--tensor-parallel-size 1 \
--max-model-len 64000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3
After the model downloads and loads into GPU memory, you should see a message showing that the server startup is complete. This launches Qwen3.5-27B locally on port 8000 and protects the endpoint with the API key local-dev-key.
| Parameter | Purpose |
|---|---|
| --served-model-name | Gives the model a name OpenCode can reference |
| --enable-auto-tool-choice | Supports agent-style behavior |
| --tool-call-parser qwen3_coder | Suited for Qwen coding workflows |
| --reasoning-parser qwen3 | Handles Qwens reasoning output correctly |
Leave Terminal Running
Leave this terminal running once the server is ready. Do not close it until you are done using the model.
Step 3: Install and Configure OpenCode
In this step, you will open a second terminal, install OpenCode, and connect it to the local vLLM endpoint you started in Step 2. OpenCode is an open-source coding agent that can run in the terminal and connect to local models through its provider configuration.
Open a Second Terminal
Go back to the instance portal and open a second Jupyter terminal. Keep your first terminal running, since that is the one serving Qwen3.5 through vLLM. If clicking the Jupyter Terminal button opens the same terminal again, you can usually open another one by changing the number at the end of the terminal URL. For example, if your current terminal ends in /terminals/1, change it to /terminals/2.
Install OpenCode
Run the following command to install OpenCode:
curl -fsSL https://opencode.ai/install | bash
Then refresh your shell:
exec bash
This reloads your shell so the opencode command is available right away.
Configure OpenCode to Use Local vLLM
OpenCode looks for its global config in ~/.config/opencode/opencode.json, and its provider settings support custom baseURL and apiKey values. That makes it a good fit for a local OpenAI-compatible server like the one you launched with vLLM. Create the config file with:
mkdir -p ~/.config/opencode && cat > ~/.config/opencode/opencode.json <<'EOF'
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"vllm": {
"npm": "@ai-sdk/openai-compatible",
"name": "Local vLLM",
"options": {
"baseURL": "http://127.0.0.1:8000/v1",
"apiKey": "local-dev-key"
},
"models": {
"qwen3.5-27b-local": {
"name": "Qwen3.5-27B Local"
}
}
}
},
"model": "vllm/qwen3.5-27b-local",
"small_model": "vllm/qwen3.5-27b-local"
}
EOF
This tells OpenCode to use your local vLLM endpoint at http://127.0.0.1:8000/v1 instead of a hosted API. It also uses the same API key you set when launching the vLLM server in Step 2, so OpenCode can authenticate successfully.
Step 4: Connect OpenCode to the Local Model
Now, create a project directory for testing the coding agent:
mkdir investment-api
cd investment-api
opencode
This launches OpenCode inside the investment-api folder and uses your local Qwen3.5 model as its backend. From here, you can begin asking it to inspect files, explain code, or generate new project components using the model running on your rented GPU.
Step 5: Test Qwen3.5 on a Real Coding Task
Now it is time to test the full setup on a real coding task. Start with a very simple prompt just to make sure everything is working. Within a second, you should get a reply back, which is a good sign that the model is connected properly.
Next, give it a more realistic agentic coding task. For example:
Build a FastAPI backend for investment market data that
collects the latest public data every few minutes for S&P 500,
gold, silver, Brent, major indices, and selected stocks.
After around a minute of reasoning, the model will start asking useful follow-up questions. It will ask which data source to use, what kind of storage or database should be used, and which stock tickers to include. This is a good sign because it shows that the model is not just jumping into code blindly. It is trying to clarify the requirements first.
Once that is done, it creates a plan and starts working through the task step by step. It breaks the problem into smaller tasks, adds them to its to-do list, and then completes each part one by one. In less than five minutes, it can create the full project structure and generate a mostly working FastAPI backend, which is very fast for a task like this.
After that, ask it to test the API and run a smoke test. You will get an almost-working financial API. There may still be a few issues to fix, but for a short run like this, the performance is very impressive.
Performance Advantages of This Setup
High Tokens-Per-Second Throughput
With this setup, you get very high tokens-per-second throughput, making it practical for longer prompts and faster response times.
Large Context Length
The large context length supports longer code files and multi-file projects without losing track of important details.
Smoother Coding Experience
Experience a much smoother overall coding experience compared to running quantized models locally.
Full Model Performance
Avoid the performance trade-offs of quantization and get the full power of Qwen3.5-27B for agentic workflows.
Compared with trying to run a heavily quantized 27B model locally through llama.cpp, this setup is often more stable and better suited for serious coding work. You avoid many of the performance trade-offs, memory limitations, and setup issues that can show up when pushing larger models into a purely local environment.
Frequently Asked Questions
Why choose vLLM over llama.cpp for Qwen3.5-27B?
While llama.cpp works well for many local setups, running a quantized 27B model often results in lower output quality and reduced coding performance. vLLM provides an OpenAI-compatible endpoint, better full-model performance, and a cleaner setup for agentic coding workflows without quantization trade-offs.
How much does it cost to run Qwen3.5-27B on Vast.ai?
H100 SXM instances on Vast.ai typically cost around $1-3 per hour depending on the offer. You can start with as little as $5 in credits to test the setup. Pricing varies on the live marketplace.
Can I use a different GPU besides H100?
Yes, you can use other GPUs like A100 or even consumer cards like RTX 4090, but you may need to adjust the tensor-parallel-size and max-model-len parameters based on available VRAM. Performance will vary.
What is the max context length for Qwen3.5-27B with vLLM?
This tutorial sets --max-model-len to 64000 tokens. The actual maximum depends on your GPU memory. Qwen3.5 supports up to 128K context, but you'll need sufficient VRAM to accommodate it.
How do I connect other coding tools to this setup?
Since vLLM provides an OpenAI-compatible API, you can connect any tool that supports OpenAI API endpoints. Simply point the tool to http://127.0.0.1:8000/v1 and use the API key you configured (local-dev-key).
Need Help Setting Up Your Local AI Infrastructure?
Our experts can help you configure local LLM deployments, optimize vLLM settings, and integrate agentic coding workflows into your development pipeline.
