Running large language models locally gives you complete control over your AI infrastructure. Instead of relying on expensive API calls to OpenAI or Anthropic, you can deploy powerful open-weight models like Qwen3.5-27B on your own GPU hardware. This complete tutorial shows you how to set up a remote H100 SXM instance on Vast.ai, serve Qwen3.5-27B with vLLM, connect it to OpenCode, and test its agentic coding ability on a real FastAPI project.

What You'll Learn:

How to launch and access a Vast.ai H100 SXM GPU instance
Setting up vLLM and serving Qwen3.5-27B as an OpenAI-compatible API
Installing and configuring OpenCode to connect to your local model
Testing the full setup with a real agentic coding task
Understanding why vLLM is preferred over llama.cpp for 27B models

Why Choose vLLM Over llama.cpp for Qwen3.5-27B

While llama.cpp is excellent for many local inference setups, running a quantized 27B model through it often comes with trade-offs: lower output quality, reduced coding performance, and more setup friction. For larger models like Qwen3.5-27B, quantization can noticeably hurt reliability, and local deployments through llama.cpp can become difficult to manage if you want smooth, production-like agent behavior.

Instead, this tutorial uses vLLM on a rented H100 SXM GPU. That gives you a more stable OpenAI-compatible endpoint, better performance from the full model, and a much cleaner setup for agentic coding workflows. You get very high tokens-per-second throughput, a large context length, and a much smoother overall coding experience.

Prerequisites

Before you begin, make sure you have the following:

Prerequisite	Description
Vast.ai Account	Create an account at vast.ai
Credits	At least $5 in credits to rent a GPU instance
Linux Terminal Knowledge	Basic familiarity with the Linux terminal

GPU Recommendation

An H100 SXM is a strong choice because Qwen3.5-27B needs plenty of memory and fast throughput, especially for longer context windows and smoother coding-focused inference.

Step 1: Launch and Access Your Vast.ai Instance

We are using Vast.ai because it is a GPU marketplace that usually gives you far more flexibility on price and hardware than a traditional single-vendor cloud. You can rent anything from community-hosted machines to Secure Cloud / datacenter offers, which Vast marks with a blue datacenter label. For a setup like this, we strongly recommend choosing one of those blue-labeled datacenter machines for better reliability and a smoother experience.

Create a Vast.ai Account and Add Credits

Start by creating a Vast.ai account at vast.ai and adding enough credit to cover your session. Pricing changes constantly because it is a live marketplace, so treat any hourly rate as approximate rather than fixed.

Search for and Rent an H100 SXM Instance

Open the search page on Vast.ai, pick the PyTorch template with Jupyter enabled, and look for a 1x H100 SXM offer. Click to rent the machine.

Access the Instance Portal

Go to the Instances tab. When the status changes to Open, click the Open button to launch the instance portal in your browser. From there, you can open Jupyter and start a terminal session directly on the remote machine.

Keep Two Terminals Open

Keep two terminals open: one terminal to serve Qwen3.5-27B with vLLM, and one terminal to install and run OpenCode. Keeping these tasks separate makes the workflow much easier to manage once the model server is running.

Step 2: Install vLLM and Serve Qwen3.5

In this step, you will create an isolated Python environment, install vLLM, and launch Qwen3.5-27B as an OpenAI-compatible API that OpenCode can talk to. vLLM is a high-throughput inference engine for large language models, designed to serve models efficiently through an API.

Create a Workspace and Virtual Environment

In your first terminal, create a clean workspace for the model server with the following commands:

Create Workspace

mkdir qwen-server
cd qwen-server/
uv venv --python 3.12
source .venv/bin/activate

This gives you a dedicated Python 3.12 environment so the model server dependencies stay isolated from the rest of the machine.

Install vLLM

Now install vLLM using the following command:

Install vLLM

uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

Note

You may not need the nightly wheels for this setup, since current vLLM support for Qwen3.5 often works with the standard installation as well.

Start the Qwen3.5 Server

Once vLLM is installed, start the model server with the following command. The first time you run this command, vLLM will download the model and tokenizer files:

Start vLLM Server

vllm serve Qwen/Qwen3.5-27B \
 --host 127.0.0.1 \
 --port 8000 \
 --api-key local-dev-key \
 --served-model-name qwen3.5-27b-local \
 --tensor-parallel-size 1 \
 --max-model-len 64000 \
 --enable-auto-tool-choice \
 --tool-call-parser qwen3_coder \
 --reasoning-parser qwen3

After the model downloads and loads into GPU memory, you should see a message showing that the server startup is complete. This launches Qwen3.5-27B locally on port 8000 and protects the endpoint with the API key local-dev-key.

Parameter	Purpose
--served-model-name	Gives the model a name OpenCode can reference
--enable-auto-tool-choice	Supports agent-style behavior
--tool-call-parser qwen3_coder	Suited for Qwen coding workflows
--reasoning-parser qwen3	Handles Qwens reasoning output correctly

Leave Terminal Running

Leave this terminal running once the server is ready. Do not close it until you are done using the model.

Step 3: Install and Configure OpenCode

In this step, you will open a second terminal, install OpenCode, and connect it to the local vLLM endpoint you started in Step 2. OpenCode is an open-source coding agent that can run in the terminal and connect to local models through its provider configuration.

Open a Second Terminal

Go back to the instance portal and open a second Jupyter terminal. Keep your first terminal running, since that is the one serving Qwen3.5 through vLLM. If clicking the Jupyter Terminal button opens the same terminal again, you can usually open another one by changing the number at the end of the terminal URL. For example, if your current terminal ends in /terminals/1, change it to /terminals/2.

Install OpenCode

Run the following command to install OpenCode:

Install OpenCode

curl -fsSL https://opencode.ai/install | bash

Then refresh your shell:

Refresh Shell

exec bash

This reloads your shell so the opencode command is available right away.

Configure OpenCode to Use Local vLLM

OpenCode looks for its global config in ~/.config/opencode/opencode.json, and its provider settings support custom baseURL and apiKey values. That makes it a good fit for a local OpenAI-compatible server like the one you launched with vLLM. Create the config file with:

OpenCode Config

mkdir -p ~/.config/opencode && cat > ~/.config/opencode/opencode.json <<'EOF'
{
 "$schema": "https://opencode.ai/config.json",
 "provider": {
   "vllm": {
     "npm": "@ai-sdk/openai-compatible",
     "name": "Local vLLM",
     "options": {
       "baseURL": "http://127.0.0.1:8000/v1",
       "apiKey": "local-dev-key"
     },
     "models": {
       "qwen3.5-27b-local": {
         "name": "Qwen3.5-27B Local"
       }
     }
   }
 },
 "model": "vllm/qwen3.5-27b-local",
 "small_model": "vllm/qwen3.5-27b-local"
}
EOF

This tells OpenCode to use your local vLLM endpoint at http://127.0.0.1:8000/v1 instead of a hosted API. It also uses the same API key you set when launching the vLLM server in Step 2, so OpenCode can authenticate successfully.

Step 4: Connect OpenCode to the Local Model

Now, create a project directory for testing the coding agent:

Create Project and Launch OpenCode

mkdir investment-api
cd investment-api
opencode

This launches OpenCode inside the investment-api folder and uses your local Qwen3.5 model as its backend. From here, you can begin asking it to inspect files, explain code, or generate new project components using the model running on your rented GPU.

Step 5: Test Qwen3.5 on a Real Coding Task

Now it is time to test the full setup on a real coding task. Start with a very simple prompt just to make sure everything is working. Within a second, you should get a reply back, which is a good sign that the model is connected properly.

Next, give it a more realistic agentic coding task. For example:

Example Agentic Coding Prompt

Build a FastAPI backend for investment market data that 
collects the latest public data every few minutes for S&P 500, 
gold, silver, Brent, major indices, and selected stocks.

After around a minute of reasoning, the model will start asking useful follow-up questions. It will ask which data source to use, what kind of storage or database should be used, and which stock tickers to include. This is a good sign because it shows that the model is not just jumping into code blindly. It is trying to clarify the requirements first.

Once that is done, it creates a plan and starts working through the task step by step. It breaks the problem into smaller tasks, adds them to its to-do list, and then completes each part one by one. In less than five minutes, it can create the full project structure and generate a mostly working FastAPI backend, which is very fast for a task like this.

After that, ask it to test the API and run a smoke test. You will get an almost-working financial API. There may still be a few issues to fix, but for a short run like this, the performance is very impressive.

Performance Advantages of This Setup

High Tokens-Per-Second Throughput

With this setup, you get very high tokens-per-second throughput, making it practical for longer prompts and faster response times.

Large Context Length

The large context length supports longer code files and multi-file projects without losing track of important details.

Smoother Coding Experience

Experience a much smoother overall coding experience compared to running quantized models locally.

Full Model Performance

Avoid the performance trade-offs of quantization and get the full power of Qwen3.5-27B for agentic workflows.

Compared with trying to run a heavily quantized 27B model locally through llama.cpp, this setup is often more stable and better suited for serious coding work. You avoid many of the performance trade-offs, memory limitations, and setup issues that can show up when pushing larger models into a purely local environment.

Frequently Asked Questions

Why choose vLLM over llama.cpp for Qwen3.5-27B?

While llama.cpp works well for many local setups, running a quantized 27B model often results in lower output quality and reduced coding performance. vLLM provides an OpenAI-compatible endpoint, better full-model performance, and a cleaner setup for agentic coding workflows without quantization trade-offs.

How much does it cost to run Qwen3.5-27B on Vast.ai?

H100 SXM instances on Vast.ai typically cost around $1-3 per hour depending on the offer. You can start with as little as $5 in credits to test the setup. Pricing varies on the live marketplace.

Can I use a different GPU besides H100?

Yes, you can use other GPUs like A100 or even consumer cards like RTX 4090, but you may need to adjust the tensor-parallel-size and max-model-len parameters based on available VRAM. Performance will vary.

What is the max context length for Qwen3.5-27B with vLLM?

This tutorial sets --max-model-len to 64000 tokens. The actual maximum depends on your GPU memory. Qwen3.5 supports up to 128K context, but you'll need sufficient VRAM to accommodate it.

How do I connect other coding tools to this setup?

Since vLLM provides an OpenAI-compatible API, you can connect any tool that supports OpenAI API endpoints. Simply point the tool to http://127.0.0.1:8000/v1 and use the API key you configured (local-dev-key).

Need Help Setting Up Your Local AI Infrastructure?

Our experts can help you configure local LLM deployments, optimize vLLM settings, and integrate agentic coding workflows into your development pipeline.

What You'll Learn:

How to launch and access a Vast.ai H100 SXM GPU instance
Setting up vLLM and serving Qwen3.5-27B as an OpenAI-compatible API
Installing and configuring OpenCode to connect to your local model
Testing the full setup with a real agentic coding task
Understanding why vLLM is preferred over llama.cpp for 27B models

Why Choose vLLM Over llama.cpp for Qwen3.5-27B

Prerequisites

Before you begin, make sure you have the following:

Prerequisite	Description
Vast.ai Account	Create an account at vast.ai
Credits	At least $5 in credits to rent a GPU instance
Linux Terminal Knowledge	Basic familiarity with the Linux terminal

GPU Recommendation

An H100 SXM is a strong choice because Qwen3.5-27B needs plenty of memory and fast throughput, especially for longer context windows and smoother coding-focused inference.

Step 1: Launch and Access Your Vast.ai Instance

Create a Vast.ai Account and Add Credits

Search for and Rent an H100 SXM Instance

Open the search page on Vast.ai, pick the PyTorch template with Jupyter enabled, and look for a 1x H100 SXM offer. Click to rent the machine.

Access the Instance Portal

Keep Two Terminals Open

Step 2: Install vLLM and Serve Qwen3.5

Create a Workspace and Virtual Environment

In your first terminal, create a clean workspace for the model server with the following commands:

Create Workspace

mkdir qwen-server
cd qwen-server/
uv venv --python 3.12
source .venv/bin/activate

This gives you a dedicated Python 3.12 environment so the model server dependencies stay isolated from the rest of the machine.

Install vLLM

Now install vLLM using the following command:

Install vLLM

uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

Note

You may not need the nightly wheels for this setup, since current vLLM support for Qwen3.5 often works with the standard installation as well.

Start the Qwen3.5 Server

Once vLLM is installed, start the model server with the following command. The first time you run this command, vLLM will download the model and tokenizer files:

Start vLLM Server

vllm serve Qwen/Qwen3.5-27B \
 --host 127.0.0.1 \
 --port 8000 \
 --api-key local-dev-key \
 --served-model-name qwen3.5-27b-local \
 --tensor-parallel-size 1 \
 --max-model-len 64000 \
 --enable-auto-tool-choice \
 --tool-call-parser qwen3_coder \
 --reasoning-parser qwen3

Parameter	Purpose
--served-model-name	Gives the model a name OpenCode can reference
--enable-auto-tool-choice	Supports agent-style behavior
--tool-call-parser qwen3_coder	Suited for Qwen coding workflows
--reasoning-parser qwen3	Handles Qwens reasoning output correctly

Leave Terminal Running

Leave this terminal running once the server is ready. Do not close it until you are done using the model.

Step 3: Install and Configure OpenCode

Open a Second Terminal

Install OpenCode

Run the following command to install OpenCode:

Install OpenCode

curl -fsSL https://opencode.ai/install | bash

Then refresh your shell:

Refresh Shell

exec bash

This reloads your shell so the opencode command is available right away.

Configure OpenCode to Use Local vLLM

OpenCode Config

mkdir -p ~/.config/opencode && cat > ~/.config/opencode/opencode.json <<'EOF'
{
 "$schema": "https://opencode.ai/config.json",
 "provider": {
   "vllm": {
     "npm": "@ai-sdk/openai-compatible",
     "name": "Local vLLM",
     "options": {
       "baseURL": "http://127.0.0.1:8000/v1",
       "apiKey": "local-dev-key"
     },
     "models": {
       "qwen3.5-27b-local": {
         "name": "Qwen3.5-27B Local"
       }
     }
   }
 },
 "model": "vllm/qwen3.5-27b-local",
 "small_model": "vllm/qwen3.5-27b-local"
}
EOF

Step 4: Connect OpenCode to the Local Model

Now, create a project directory for testing the coding agent:

Create Project and Launch OpenCode

mkdir investment-api
cd investment-api
opencode

Step 5: Test Qwen3.5 on a Real Coding Task

Next, give it a more realistic agentic coding task. For example:

Example Agentic Coding Prompt

Build a FastAPI backend for investment market data that 
collects the latest public data every few minutes for S&P 500, 
gold, silver, Brent, major indices, and selected stocks.

Performance Advantages of This Setup

High Tokens-Per-Second Throughput

With this setup, you get very high tokens-per-second throughput, making it practical for longer prompts and faster response times.

Large Context Length

The large context length supports longer code files and multi-file projects without losing track of important details.

Smoother Coding Experience

Experience a much smoother overall coding experience compared to running quantized models locally.

Full Model Performance

Avoid the performance trade-offs of quantization and get the full power of Qwen3.5-27B for agentic workflows.

Frequently Asked Questions

Why choose vLLM over llama.cpp for Qwen3.5-27B?

How much does it cost to run Qwen3.5-27B on Vast.ai?

H100 SXM instances on Vast.ai typically cost around $1-3 per hour depending on the offer. You can start with as little as $5 in credits to test the setup. Pricing varies on the live marketplace.

Can I use a different GPU besides H100?

What is the max context length for Qwen3.5-27B with vLLM?

This tutorial sets --max-model-len to 64000 tokens. The actual maximum depends on your GPU memory. Qwen3.5 supports up to 128K context, but you'll need sufficient VRAM to accommodate it.

How do I connect other coding tools to this setup?

Need Help Setting Up Your Local AI Infrastructure?

Our experts can help you configure local LLM deployments, optimize vLLM settings, and integrate agentic coding workflows into your development pipeline.

How to Run Qwen3.5-27B Locally: Complete Step by Step Guide

Why Choose vLLM Over llama.cpp for Qwen3.5-27B

Prerequisites

Step 1: Launch and Access Your Vast.ai Instance

Create a Vast.ai Account and Add Credits

Search for and Rent an H100 SXM Instance

Access the Instance Portal

Step 2: Install vLLM and Serve Qwen3.5

Create a Workspace and Virtual Environment

Install vLLM

Start the Qwen3.5 Server

Step 3: Install and Configure OpenCode

Open a Second Terminal

Install OpenCode

Configure OpenCode to Use Local vLLM

Step 4: Connect OpenCode to the Local Model

Step 5: Test Qwen3.5 on a Real Coding Task

Performance Advantages of This Setup

High Tokens-Per-Second Throughput

Large Context Length

Smoother Coding Experience

Full Model Performance

Frequently Asked Questions

Why choose vLLM over llama.cpp for Qwen3.5-27B?

How much does it cost to run Qwen3.5-27B on Vast.ai?

Can I use a different GPU besides H100?

What is the max context length for Qwen3.5-27B with vLLM?

How do I connect other coding tools to this setup?

Need Help Setting Up Your Local AI Infrastructure?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Run Qwen3.5-27B Locally: Complete Step by Step Guide

Why Choose vLLM Over llama.cpp for Qwen3.5-27B

Prerequisites

Step 1: Launch and Access Your Vast.ai Instance

Create a Vast.ai Account and Add Credits

Search for and Rent an H100 SXM Instance

Access the Instance Portal

Step 2: Install vLLM and Serve Qwen3.5

Create a Workspace and Virtual Environment

Install vLLM

Start the Qwen3.5 Server

Step 3: Install and Configure OpenCode

Open a Second Terminal

Install OpenCode

Configure OpenCode to Use Local vLLM

Step 4: Connect OpenCode to the Local Model

Step 5: Test Qwen3.5 on a Real Coding Task

Performance Advantages of This Setup

High Tokens-Per-Second Throughput

Large Context Length

Smoother Coding Experience

Full Model Performance

Frequently Asked Questions

Why choose vLLM over llama.cpp for Qwen3.5-27B?

How much does it cost to run Qwen3.5-27B on Vast.ai?

Can I use a different GPU besides H100?

What is the max context length for Qwen3.5-27B with vLLM?

How do I connect other coding tools to this setup?

Need Help Setting Up Your Local AI Infrastructure?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief