How to Run GLM 5.1 Locally: Complete Setup for Agentic Coding
By Braincuber Team
Published on April 18, 2026
GLM 5.1 is one of the strongest open-source AI models available for local deployment. According to Artificial Analysis, it ranks as the leading open-weights model on their Intelligence Index. Z.ai positions it as a flagship release for coding, reasoning, and agentic workflows. Running it locally gives you complete control over your data while maintaining privacy.
What You'll Learn:
- Configure an H100 GPU environment on RunPod
- Build llama.cpp with CUDA support for GPU acceleration
- Download and optimize the GLM 5.1 GGUF model
- Set up a local model serving API
- Test the API with cURL and OpenAI Python SDK
- Access the model through the llama.cpp WebUI
- Integrate GLM 5.1 with Claude Code for agentic coding
Why Run GLM 5.1 Locally?
Running GLM 5.1 locally provides several key advantages over cloud-based APIs. Your data never leaves your environment, which is critical for sensitive projects. You have full control over the model configuration and can experiment with different prompts without usage limits or API costs.
GLM 5.1 features 744B parameters with 40B active parameters, a 200K context window, and excels at coding tasks. The Unsloth Dynamic 2-bit GGUF version reduces storage requirements by approximately 80%, making local deployment practical with consumer-grade hardware.
Hardware Requirements
This tutorial uses an H100 GPU with 80GB VRAM and 125GB RAM. The 2-bit quantized model requires approximately 220-236GB storage after download.
Step 1: Rent and Configure an H100 GPU Pod
Start by accessing the RunPod platform and navigating to the Pods tab. Select an H100 SXM machine type for optimal performance with large language models.
Choose Template
Select the latest PyTorch template from the available options. This provides a ready-to-use environment with Python and CUDA drivers pre-configured.
Configure Storage
Set the container disk to 100GB and volume disk to 300GB. This provides sufficient space for model files, dependencies, and cached downloads.
Expose Port
Expose port 8910 for the local model server and llama.cpp WebUI. This port will be used for both services.
Add HuggingFace Token
Add your HuggingFace token as an environment variable named HF_TOKEN to access gated model files.
After deploying the pod, open the JupyterLab instance. Launch a new terminal and install the required system packages:
apt-get update
apt-get install -y pciutils build-essential cmake curl git tmux libcurl4-openssl-dev
Step 2: Build llama.cpp with CUDA Support
Now build llama.cpp with CUDA support to enable GPU acceleration for inference. This step compiles the necessary binaries for running large models on your H100 GPU.
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first \
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
Step 3: Download the GLM 5.1 GGUF Model
The full GLM 5.1 model has 744B parameters and requires approximately 1.65TB of storage. The Unsloth Dynamic 2-bit GGUF version reduces this to 220-236GB, making local deployment practical while maintaining strong performance.
| Model Version | Parameters | Storage |
|---|---|---|
| Full Model | 744B (40B active) | ~1.65TB |
| UD-IQ2_M (2-bit) | 744B (40B active) | ~220-236GB |
Install the required tools and download the model:
pip -q install -U "huggingface_hub[hf_xet]" hf-xet
pip -q install -U hf_transfer
export HF_XET_HIGH_PERFORMANCE=1
hf download unsloth/GLM-5.1-GGUF \
--local-dir models/GLM-5.1-GGUF \
--include "*UD-IQ2_M*"
The download takes approximately 17 minutes depending on your connection speed. The model consists of multiple GGUF files that are stitching together during serving.
Step 4: Start the GLM 5.1 Local Server
Launch the llama.cpp server to load the model into memory and expose it as a local API. The --fit on flag automatically places as much of the model on the GPU as possible while offloading the rest to system RAM.
./llama.cpp/llama-server \
--model ./models/GLM-5.1-GGUF/UD-IQ2_M/GLM-5.1-UD-IQ2_M-00001-of-00006.gguf \
--alias "GLM-5.1" \
--host 0.0.0.0 \
--port 8910 \
--jinja \
--fit on \
--threads 16 \
--threads-batch 16 \
--ctx-size 32768 \
--batch-size 2048 \
--ubatch-size 512 \
--flash-attn on \
--temp 0.7 \
--top-p 0.95 \
--cont-batching \
--metrics \
--perf
Key Parameter: --fit on
This flag automatically places as much of the model as possible on the GPU (80GB VRAM) while offloading the rest to system RAM (125GB). This is essential for running such a large model without manual layer placement.
Once the model loads, you will see a message indicating the server is listening at http://0.0.0.0:8910.
Step 5: Test the Local API with cURL
Open a new terminal and test the API to verify the model is responding correctly. The llama.cpp server provides OpenAI-compatible endpoints.
curl http://127.0.0.1:8910/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: local-test" \
-d '{
"model": "GLM-5.1",
"max_tokens": 300,
"messages": [
{"role": "user", "content": "Write a Python hello world function."}
]
}'
Step 6: Connect with OpenAI Python SDK
Connect the local GLM 5.1 server to the OpenAI Python SDK. This enables seamless integration with existing applications that use the OpenAI client.
pip install openai
python - <<'PY'
from openai import OpenAI
client = OpenAI(
api_key="local-key",
base_url="http://127.0.0.1:8910/v1",
)
resp = client.completions.create(
model="GLM-5.1",
prompt="Answer briefly and in plain text only.
Question: What is the capital city of Australia?
Answer:",
temperature=0.2,
max_tokens=12,
)
print(resp.choices[0].text.strip())
PY
The expected output is Canberra. You can now point any existing OpenAI-compatible application to your local server.
Step 7: Access Through the llama.cpp WebUI
The llama.cpp server includes a built-in WebUI for interactive conversations. Return to your RunPod dashboard and access the exposed HTTP Service link for port 8910.
Interactive Chat
Chat with GLM 5.1 directly through the WebUI without using the terminal or API calls.
Fast Generation
Expect around 8 tokens per second for basic prompts. Performance varies with prompt complexity.
Step 8: Integrate Claude Code with Local GLM 5.1
Connect Claude Code to use GLM 5.1 as the underlying model for agentic coding tasks. This allows you to leverage Claude Code's agent capabilities while running locally.
curl -fsSL https://claude.ai/install.sh | bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
cat >> ~/.bashrc <<'EOF'
export ANTHROPIC_BASE_URL="http://127.0.0.1:8910"
export ANTHROPIC_AUTH_TOKEN="local-dev-token"
export ANTHROPIC_MODEL="GLM-5.1"
export ANTHROPIC_DEFAULT_SONNET_MODEL="GLM-5.1"
export API_TIMEOUT_MS=1200000
EOF
source ~/.bashrc
mkdir -p test-claude-local
cd test-claude-local
claude
Performance Note
Claude Code with local GLM 5.1 works but is noticeably slower than cloud APIs. With longer context and coding prompts, generation can drop to approximately 2 tokens per second. Consider disabling thinking mode for faster responses.
Common Issues and Fixes
This section covers common problems you may encounter when running GLM 5.1 locally and their solutions.
Model Fails to Load or Crashes
The model is too large for available memory. Try a smaller quantization or lower the context size. Use --fit on for automatic memory management.
WebUI Does Not Open
Use the exposed port from RunPod (not JupyterLab URL). The WebUI is at http://0.0.0.0:8910 on the HTTP Service link.
API Works in One Tool But Not Another
Check the base URL and endpoint path. llama.cpp supports both OpenAI routes (/v1) and Anthropic Messages routes (/v1/messages).
Model Feels Too Slow
Longer context and thinking mode significantly slow generation. Reduce context size, prompt length, or disable thinking mode for faster responses.
Local vs. Managed API: Which to Choose?
Running GLM 5.1 locally is straightforward for experimentation and learning. However, for production agentic coding workflows, consider the trade-offs carefully.
| Aspect | Local Deployment | Managed API |
|---|---|---|
| Data Privacy | Full control | Data sent to cloud |
| Cost | GPU rental | Per-token |
| Performance | Variable (2-8 tok/s) | Optimized |
| Maintenance | Self-managed | Handled by provider |
For lighter local experiments, local deployment is practical. For production agentic coding at scale, managed APIs typically offer better performance, less maintenance, and more reliable availability.
Frequently Asked Questions
What hardware do I need to run GLM 5.1 locally?
An H100 GPU with 80GB VRAM and at least 125GB system RAM is recommended. The 2-bit quantized model requires approximately 220-236GB storage.
Why is GLM 5.1 running slowly?
Longer context windows and thinking mode significantly slow generation. Try reducing context size or disabling thinking mode for faster responses.
Can I use GLM 5.1 with Claude Code?
Yes, configure Claude Code to point to your local GLM 5.1 server using environment variables for ANTHROPIC_BASE_URL and related settings.
Why does the model fail to load?
This usually indicates insufficient GPU or system memory. Use the --fit on flag for automatic memory management or try a smaller quantization.
Local or managed API: which is better?
Local is great for learning and experiments. Managed APIs are better for productionagentic coding due to optimized performance and less maintenance overhead.
Need Help Setting Up Local LLMs?
Our experts can help you configure llama.cpp, optimize GPU settings, and integrate local models with your development workflow.
