GLM 5.1 is one of the strongest open-source AI models available for local deployment. According to Artificial Analysis, it ranks as the leading open-weights model on their Intelligence Index. Z.ai positions it as a flagship release for coding, reasoning, and agentic workflows. Running it locally gives you complete control over your data while maintaining privacy.

What You'll Learn:

Configure an H100 GPU environment on RunPod
Build llama.cpp with CUDA support for GPU acceleration
Download and optimize the GLM 5.1 GGUF model
Set up a local model serving API
Test the API with cURL and OpenAI Python SDK
Access the model through the llama.cpp WebUI
Integrate GLM 5.1 with Claude Code for agentic coding

Why Run GLM 5.1 Locally?

Running GLM 5.1 locally provides several key advantages over cloud-based APIs. Your data never leaves your environment, which is critical for sensitive projects. You have full control over the model configuration and can experiment with different prompts without usage limits or API costs.

GLM 5.1 features 744B parameters with 40B active parameters, a 200K context window, and excels at coding tasks. The Unsloth Dynamic 2-bit GGUF version reduces storage requirements by approximately 80%, making local deployment practical with consumer-grade hardware.

Hardware Requirements

This tutorial uses an H100 GPU with 80GB VRAM and 125GB RAM. The 2-bit quantized model requires approximately 220-236GB storage after download.

Step 1: Rent and Configure an H100 GPU Pod

Start by accessing the RunPod platform and navigating to the Pods tab. Select an H100 SXM machine type for optimal performance with large language models.

Choose Template

Select the latest PyTorch template from the available options. This provides a ready-to-use environment with Python and CUDA drivers pre-configured.

Configure Storage

Set the container disk to 100GB and volume disk to 300GB. This provides sufficient space for model files, dependencies, and cached downloads.

Expose Port

Expose port 8910 for the local model server and llama.cpp WebUI. This port will be used for both services.

Add HuggingFace Token

Add your HuggingFace token as an environment variable named HF_TOKEN to access gated model files.

After deploying the pod, open the JupyterLab instance. Launch a new terminal and install the required system packages:

Install System Dependencies

apt-get update
apt-get install -y pciutils build-essential cmake curl git tmux libcurl4-openssl-dev

Step 2: Build llama.cpp with CUDA Support

Now build llama.cpp with CUDA support to enable GPU acceleration for inference. This step compiles the necessary binaries for running large models on your H100 GPU.

Clone and Configure

git clone https://github.com/ggml-org/llama.cpp

cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=ON

Compile Binaries

cmake --build llama.cpp/build --config Release -j --clean-first \
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split

cp llama.cpp/build/bin/llama-* llama.cpp

Step 3: Download the GLM 5.1 GGUF Model

The full GLM 5.1 model has 744B parameters and requires approximately 1.65TB of storage. The Unsloth Dynamic 2-bit GGUF version reduces this to 220-236GB, making local deployment practical while maintaining strong performance.

Model Version	Parameters	Storage
Full Model	744B (40B active)	~1.65TB
UD-IQ2_M (2-bit)	744B (40B active)	~220-236GB

Install the required tools and download the model:

Download GLM 5.1 GGUF

pip -q install -U "huggingface_hub[hf_xet]" hf-xet
pip -q install -U hf_transfer
export HF_XET_HIGH_PERFORMANCE=1
hf download unsloth/GLM-5.1-GGUF \
--local-dir models/GLM-5.1-GGUF \
--include "*UD-IQ2_M*"

The download takes approximately 17 minutes depending on your connection speed. The model consists of multiple GGUF files that are stitching together during serving.

Step 4: Start the GLM 5.1 Local Server

Launch the llama.cpp server to load the model into memory and expose it as a local API. The --fit on flag automatically places as much of the model on the GPU as possible while offloading the rest to system RAM.

Start Local Server

./llama.cpp/llama-server \
  --model ./models/GLM-5.1-GGUF/UD-IQ2_M/GLM-5.1-UD-IQ2_M-00001-of-00006.gguf \
  --alias "GLM-5.1" \
  --host 0.0.0.0 \
  --port 8910 \
  --jinja \
  --fit on \
  --threads 16 \
  --threads-batch 16 \
  --ctx-size 32768 \
  --batch-size 2048 \
  --ubatch-size 512 \
  --flash-attn on \
  --temp 0.7 \
  --top-p 0.95 \
  --cont-batching \
  --metrics \
  --perf

Key Parameter: --fit on

This flag automatically places as much of the model as possible on the GPU (80GB VRAM) while offloading the rest to system RAM (125GB). This is essential for running such a large model without manual layer placement.

Once the model loads, you will see a message indicating the server is listening at http://0.0.0.0:8910.

Step 5: Test the Local API with cURL

Open a new terminal and test the API to verify the model is responding correctly. The llama.cpp server provides OpenAI-compatible endpoints.

Test API with cURL

curl http://127.0.0.1:8910/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: local-test" \
-d '{
"model": "GLM-5.1",
"max_tokens": 300,
"messages": [
{"role": "user", "content": "Write a Python hello world function."}
]
}'

Step 6: Connect with OpenAI Python SDK

Connect the local GLM 5.1 server to the OpenAI Python SDK. This enables smooth integration with existing applications that use the OpenAI client.

OpenAI SDK Integration

pip install openai

python - <<'PY'
from openai import OpenAI

client = OpenAI(
    api_key="local-key",
    base_url="http://127.0.0.1:8910/v1",
)

resp = client.completions.create(
    model="GLM-5.1",
    prompt="Answer briefly and in plain text only.

Question: What is the capital city of Australia?
Answer:",
    temperature=0.2,
    max_tokens=12,
)

print(resp.choices[0].text.strip())
PY

The expected output is Canberra. You can now point any existing OpenAI-compatible application to your local server.

Step 7: Access Through the llama.cpp WebUI

The llama.cpp server includes a built-in WebUI for interactive conversations. Return to your RunPod dashboard and access the exposed HTTP Service link for port 8910.

Interactive Chat

Chat with GLM 5.1 directly through the WebUI without using the terminal or API calls.

Fast Generation

Expect around 8 tokens per second for basic prompts. Performance varies with prompt complexity.

Step 8: Integrate Claude Code with Local GLM 5.1

Connect Claude Code to use GLM 5.1 as the underlying model for agentic coding tasks. This allows you to use Claude Code's agent capabilities while running locally.

Install Claude Code

curl -fsSL https://claude.ai/install.sh | bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

Configure Environment

cat >> ~/.bashrc <<'EOF'
export ANTHROPIC_BASE_URL="http://127.0.0.1:8910"
export ANTHROPIC_AUTH_TOKEN="local-dev-token"
export ANTHROPIC_MODEL="GLM-5.1"
export ANTHROPIC_DEFAULT_SONNET_MODEL="GLM-5.1"
export API_TIMEOUT_MS=1200000
EOF

source ~/.bashrc

Launch Claude Code

mkdir -p test-claude-local
cd test-claude-local
claude

Performance Note

Claude Code with local GLM 5.1 works but is noticeably slower than cloud APIs. With longer context and coding prompts, generation can drop to approximately 2 tokens per second. Consider disabling thinking mode for faster responses.

Common Issues and Fixes

This section covers common problems you may encounter when running GLM 5.1 locally and their solutions.

Model Fails to Load or Crashes

The model is too large for available memory. Try a smaller quantization or lower the context size. Use --fit on for automatic memory management.

WebUI Does Not Open

Use the exposed port from RunPod (not JupyterLab URL). The WebUI is at http://0.0.0.0:8910 on the HTTP Service link.

API Works in One Tool But Not Another

Check the base URL and endpoint path. llama.cpp supports both OpenAI routes (/v1) and Anthropic Messages routes (/v1/messages).

Model Feels Too Slow

Longer context and thinking mode significantly slow generation. Reduce context size, prompt length, or disable thinking mode for faster responses.

Local vs. Managed API: Which to Choose?

Running GLM 5.1 locally is straightforward for experimentation and learning. However, for production agentic coding workflows, consider the trade-offs carefully.

Aspect	Local Deployment	Managed API
Data Privacy	Full control	Data sent to cloud
Cost	GPU rental	Per-token
Performance	Variable (2-8 tok/s)	Optimized
Maintenance	Self-managed	Handled by provider

For lighter local experiments, local deployment is practical. For production agentic coding at scale, managed APIs typically offer better performance, less maintenance, and more reliable availability.

Frequently Asked Questions

What hardware do I need to run GLM 5.1 locally?

An H100 GPU with 80GB VRAM and at least 125GB system RAM is recommended. The 2-bit quantized model requires approximately 220-236GB storage.

Why is GLM 5.1 running slowly?

Longer context windows and thinking mode significantly slow generation. Try reducing context size or disabling thinking mode for faster responses.

Can I use GLM 5.1 with Claude Code?

Yes, configure Claude Code to point to your local GLM 5.1 server using environment variables for ANTHROPIC_BASE_URL and related settings.

Why does the model fail to load?

This usually indicates insufficient GPU or system memory. Use the --fit on flag for automatic memory management or try a smaller quantization.

Local or managed API: which is better?

Local is great for learning and experiments. Managed APIs are better for productionagentic coding due to optimized performance and less maintenance overhead.

Need Help Setting Up Local LLMs?

Our experts can help you configure llama.cpp, optimize GPU settings, and integrate local models with your development workflow.

What You'll Learn:

Configure an H100 GPU environment on RunPod
Build llama.cpp with CUDA support for GPU acceleration
Download and optimize the GLM 5.1 GGUF model
Set up a local model serving API
Test the API with cURL and OpenAI Python SDK
Access the model through the llama.cpp WebUI
Integrate GLM 5.1 with Claude Code for agentic coding

Why Run GLM 5.1 Locally?

Hardware Requirements

This tutorial uses an H100 GPU with 80GB VRAM and 125GB RAM. The 2-bit quantized model requires approximately 220-236GB storage after download.

Step 1: Rent and Configure an H100 GPU Pod

Start by accessing the RunPod platform and navigating to the Pods tab. Select an H100 SXM machine type for optimal performance with large language models.

Choose Template

Select the latest PyTorch template from the available options. This provides a ready-to-use environment with Python and CUDA drivers pre-configured.

Configure Storage

Set the container disk to 100GB and volume disk to 300GB. This provides sufficient space for model files, dependencies, and cached downloads.

Expose Port

Expose port 8910 for the local model server and llama.cpp WebUI. This port will be used for both services.

Add HuggingFace Token

Add your HuggingFace token as an environment variable named HF_TOKEN to access gated model files.

After deploying the pod, open the JupyterLab instance. Launch a new terminal and install the required system packages:

Install System Dependencies

apt-get update
apt-get install -y pciutils build-essential cmake curl git tmux libcurl4-openssl-dev

Step 2: Build llama.cpp with CUDA Support

Now build llama.cpp with CUDA support to enable GPU acceleration for inference. This step compiles the necessary binaries for running large models on your H100 GPU.

Clone and Configure

git clone https://github.com/ggml-org/llama.cpp

cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=ON

Compile Binaries

cmake --build llama.cpp/build --config Release -j --clean-first \
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split

cp llama.cpp/build/bin/llama-* llama.cpp

Step 3: Download the GLM 5.1 GGUF Model

Model Version	Parameters	Storage
Full Model	744B (40B active)	~1.65TB
UD-IQ2_M (2-bit)	744B (40B active)	~220-236GB

Install the required tools and download the model:

Download GLM 5.1 GGUF

pip -q install -U "huggingface_hub[hf_xet]" hf-xet
pip -q install -U hf_transfer
export HF_XET_HIGH_PERFORMANCE=1
hf download unsloth/GLM-5.1-GGUF \
--local-dir models/GLM-5.1-GGUF \
--include "*UD-IQ2_M*"

The download takes approximately 17 minutes depending on your connection speed. The model consists of multiple GGUF files that are stitching together during serving.

Step 4: Start the GLM 5.1 Local Server

Start Local Server

./llama.cpp/llama-server \
  --model ./models/GLM-5.1-GGUF/UD-IQ2_M/GLM-5.1-UD-IQ2_M-00001-of-00006.gguf \
  --alias "GLM-5.1" \
  --host 0.0.0.0 \
  --port 8910 \
  --jinja \
  --fit on \
  --threads 16 \
  --threads-batch 16 \
  --ctx-size 32768 \
  --batch-size 2048 \
  --ubatch-size 512 \
  --flash-attn on \
  --temp 0.7 \
  --top-p 0.95 \
  --cont-batching \
  --metrics \
  --perf

Key Parameter: --fit on

Once the model loads, you will see a message indicating the server is listening at http://0.0.0.0:8910.

Step 5: Test the Local API with cURL

Open a new terminal and test the API to verify the model is responding correctly. The llama.cpp server provides OpenAI-compatible endpoints.

Test API with cURL

curl http://127.0.0.1:8910/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: local-test" \
-d '{
"model": "GLM-5.1",
"max_tokens": 300,
"messages": [
{"role": "user", "content": "Write a Python hello world function."}
]
}'

Step 6: Connect with OpenAI Python SDK

Connect the local GLM 5.1 server to the OpenAI Python SDK. This enables smooth integration with existing applications that use the OpenAI client.

OpenAI SDK Integration

pip install openai

python - <<'PY'
from openai import OpenAI

client = OpenAI(
    api_key="local-key",
    base_url="http://127.0.0.1:8910/v1",
)

resp = client.completions.create(
    model="GLM-5.1",
    prompt="Answer briefly and in plain text only.

Question: What is the capital city of Australia?
Answer:",
    temperature=0.2,
    max_tokens=12,
)

print(resp.choices[0].text.strip())
PY

The expected output is Canberra. You can now point any existing OpenAI-compatible application to your local server.

Step 7: Access Through the llama.cpp WebUI

The llama.cpp server includes a built-in WebUI for interactive conversations. Return to your RunPod dashboard and access the exposed HTTP Service link for port 8910.

Interactive Chat

Chat with GLM 5.1 directly through the WebUI without using the terminal or API calls.

Fast Generation

Expect around 8 tokens per second for basic prompts. Performance varies with prompt complexity.

Step 8: Integrate Claude Code with Local GLM 5.1

Connect Claude Code to use GLM 5.1 as the underlying model for agentic coding tasks. This allows you to use Claude Code's agent capabilities while running locally.

Install Claude Code

curl -fsSL https://claude.ai/install.sh | bash
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

Configure Environment

cat >> ~/.bashrc <<'EOF'
export ANTHROPIC_BASE_URL="http://127.0.0.1:8910"
export ANTHROPIC_AUTH_TOKEN="local-dev-token"
export ANTHROPIC_MODEL="GLM-5.1"
export ANTHROPIC_DEFAULT_SONNET_MODEL="GLM-5.1"
export API_TIMEOUT_MS=1200000
EOF

source ~/.bashrc

Launch Claude Code

mkdir -p test-claude-local
cd test-claude-local
claude

Performance Note

Common Issues and Fixes

This section covers common problems you may encounter when running GLM 5.1 locally and their solutions.

Model Fails to Load or Crashes

The model is too large for available memory. Try a smaller quantization or lower the context size. Use --fit on for automatic memory management.

WebUI Does Not Open

Use the exposed port from RunPod (not JupyterLab URL). The WebUI is at http://0.0.0.0:8910 on the HTTP Service link.

API Works in One Tool But Not Another

Check the base URL and endpoint path. llama.cpp supports both OpenAI routes (/v1) and Anthropic Messages routes (/v1/messages).

Model Feels Too Slow

Longer context and thinking mode significantly slow generation. Reduce context size, prompt length, or disable thinking mode for faster responses.

Local vs. Managed API: Which to Choose?

Running GLM 5.1 locally is straightforward for experimentation and learning. However, for production agentic coding workflows, consider the trade-offs carefully.

Aspect	Local Deployment	Managed API
Data Privacy	Full control	Data sent to cloud
Cost	GPU rental	Per-token
Performance	Variable (2-8 tok/s)	Optimized
Maintenance	Self-managed	Handled by provider

Frequently Asked Questions

What hardware do I need to run GLM 5.1 locally?

An H100 GPU with 80GB VRAM and at least 125GB system RAM is recommended. The 2-bit quantized model requires approximately 220-236GB storage.

Why is GLM 5.1 running slowly?

Longer context windows and thinking mode significantly slow generation. Try reducing context size or disabling thinking mode for faster responses.

Can I use GLM 5.1 with Claude Code?

Yes, configure Claude Code to point to your local GLM 5.1 server using environment variables for ANTHROPIC_BASE_URL and related settings.

Why does the model fail to load?

This usually indicates insufficient GPU or system memory. Use the --fit on flag for automatic memory management or try a smaller quantization.

Local or managed API: which is better?

Local is great for learning and experiments. Managed APIs are better for productionagentic coding due to optimized performance and less maintenance overhead.

Need Help Setting Up Local LLMs?

Our experts can help you configure llama.cpp, optimize GPU settings, and integrate local models with your development workflow.

How to Run GLM 5.1 Locally: Complete Setup for Agentic Coding

Why Run GLM 5.1 Locally?

Step 1: Rent and Configure an H100 GPU Pod

Choose Template

Configure Storage

Expose Port

Add HuggingFace Token

Step 2: Build llama.cpp with CUDA Support

Step 3: Download the GLM 5.1 GGUF Model

Step 4: Start the GLM 5.1 Local Server

Step 5: Test the Local API with cURL

Step 6: Connect with OpenAI Python SDK

Step 7: Access Through the llama.cpp WebUI

Interactive Chat

Fast Generation

Step 8: Integrate Claude Code with Local GLM 5.1

Common Issues and Fixes

Model Fails to Load or Crashes

WebUI Does Not Open

API Works in One Tool But Not Another

Model Feels Too Slow

Local vs. Managed API: Which to Choose?

Frequently Asked Questions

What hardware do I need to run GLM 5.1 locally?

Why is GLM 5.1 running slowly?

Can I use GLM 5.1 with Claude Code?

Why does the model fail to load?

Local or managed API: which is better?

Need Help Setting Up Local LLMs?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Run GLM 5.1 Locally: Complete Setup for Agentic Coding

Why Run GLM 5.1 Locally?

Step 1: Rent and Configure an H100 GPU Pod

Choose Template

Configure Storage

Expose Port

Add HuggingFace Token

Step 2: Build llama.cpp with CUDA Support

Step 3: Download the GLM 5.1 GGUF Model

Step 4: Start the GLM 5.1 Local Server

Step 5: Test the Local API with cURL

Step 6: Connect with OpenAI Python SDK

Step 7: Access Through the llama.cpp WebUI

Interactive Chat

Fast Generation

Step 8: Integrate Claude Code with Local GLM 5.1

Common Issues and Fixes

Model Fails to Load or Crashes

WebUI Does Not Open

API Works in One Tool But Not Another

Model Feels Too Slow

Local vs. Managed API: Which to Choose?

Frequently Asked Questions

What hardware do I need to run GLM 5.1 locally?

Why is GLM 5.1 running slowly?

Can I use GLM 5.1 with Claude Code?

Why does the model fail to load?

Local or managed API: which is better?

Need Help Setting Up Local LLMs?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief