Local LLMs have reached a turning point. With Gemma 4 long context window, native multimodal support, and the accessibility of Ollama, it is now practical to run a capable agentic coding assistant entirely on your own machine.

What You will Learn:

Understanding Gemma 4 model family and variants
Setting up Ollama for local inference
Installing and configuring Gradio for the UI
Building a split-pane code editor and chat interface
Implementing agentic tool use for code execution
Adding multimodal support for images and files
Running a fully local AI coding assistant

What Is Gemma 4?

Gemma 4 is Google DeepMind open-weights model family, designed for both local deployment and research. It builds on the Gemma lineage with improved instruction following, longer context windows, and native multimodal input handling.

Models like Gemma-4-26B(MOE) and Gemma-4-31B achieve Elo scores comparable to much larger models, indicating strong performance-per-parameter. The 31B model currently ranks as 3rd open model in the world on the Arena AI text leaderboard.

The Gemma 4 Model Family

Model	Architecture	Active Params	Context Length
Gemma-4-31B	Dense Transformer	31B	256K tokens
Gemma-4-26B-A4B	MoE (128 Experts)	3.8B	256K tokens
Gemma-4-E4B	Dense Transformer	4.5B effective	128K tokens
Gemma-4-E2B	Dense Transformer	2.3B effective	128K tokens

Model Selection

In this tutorial, we use the gemma4:e4b (9.6GB) variant via Ollama which is a quantized version well-suited for local inference on consumer hardware.

Running Gemma 4 via Ollama

Ollama handles model downloading, quantization, serving and provides an OpenAI-compatible HTTP API. For this tutorial, Ollama acts as the inference backend, and our app communicates with it over localhost:11434.

Terminal

curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma4:e4b 
ollama serve

The above commands installs Ollama locally using the official install script. We then download the gemma4:e4b model variant locally for on device inference. Finally, ollama serve starts the Ollama server, enabling our app to send requests to the model.

Build a Coding Assistant with Ollama

In this section, we will build the code assistant step by step. At a high level, here is what the app does:

Accepts a natural-language message in the chat panel, with optional image or file attachments
Injects the current editor code as context for the model
Sends a streaming request to Gemma 4 via Ollama /api/chat endpoint
Optionally calls tools (code execution, math evaluation) in an agentic loop
Pushes extracted code blocks from the response into the live editor

Install Dependencies

Install Gradio, requests, and Pillow for the UI and HTTP communication.

Terminal

pip install gradio requests pillow

Configuration and Constants

Define imports, model name, Ollama base URL, supported languages, and the default system prompt.

Create a file called app.py and add the following imports and configuration:

Define Agentic Tools

Create tools for code execution and mathematical calculations that the model can call during inference.

The assistant can operate in an agentic mode where it calls tools during inference. We define two tools in the standard function-calling schema that Ollama supports: run_code and calculate.

Tool Execution Layer

Implement the functions that execute code and evaluate math expressions safely.

Safety Design

The code execution uses a 5-second timeout, output is capped at 3,000 characters, and the math evaluator uses eval with __builtins__ set to empty to prevent access to dangerous functions.

Helper Utilities

Create utility functions for image encoding, file handling, and code extraction.

Core Chat Streaming Generator

Build the main function that handles conversation flow, tool calls, and streaming responses.

The chat function builds conversation history, enriches user input with context (code, files, images), sends streaming requests to Ollama, optionally executes tools in agentic mode, and yields partial responses for real-time UI updates.

Gradio UI Layout

Create a split-pane interface with code editor on the left and chat panel on the right.

Split-Pane Layout

Code workspace on the left (55%) and chat panel on the right (45%) for smooth workflow between coding and AI assistance.

Multimodal Input

Support for images (vision) and code/text files as context, enabling the model to reason over multiple modalities.

Event Wiring

Connect UI components to the chat function for interactive responses.

Theming and CSS

Add custom theme and CSS for a polished, GitHub-inspired look with light and dark mode support.

Launch the Application

Start the Gradio app and access it at localhost:7860.

Access the App

Navigate to http://localhost:7860 once the app is running. The ollama_ok() check will warn you on startup if Ollama is not running.

Conclusion

In this tutorial, we built a fully local AI coding assistant using Gemma 4, Ollama, and Gradio. The app supports multimodal input, real tool use, streaming responses, and a live code editor, all running on your own machine without any external API.

The whole-context approach of injecting the editor code directly into every prompt is simpler than RAG for a single-file workflow and works particularly well for tasks like explaining, refactoring, or extending code you are actively editing.

Next Steps

Add a file tree panel to support multi-file projects
Add support for additional Ollama models via a dropdown
Persist conversation history to disk so sessions survive app restarts

Frequently Asked Questions

Does this require a GPU?

Not necessarily. Gemma 4 e4b is a quantized model that runs on CPU, though a GPU will significantly improve the inference speed.

What is the difference between agentic mode and regular chat?

In regular mode, the model streams text only. In agentic mode, the model can call run_code or calculate, check the results, and incorporate them before finishing its response.

How does the editor update automatically?

The extract_last_code_block helper function scans the assistant response for the last fenced code block and pushes it to the editor after streaming completes.

What happens if Ollama is not running?

The ollama_ok check catches this on startup and on every chat submission. If Ollama is not reachable, the chat returns a formatted error message rather than crashing.

Can I use other models with this app?

Yes, you can modify the MODEL constant to use any model available in Ollama, such as llama3, mistral, or other models from the Ollama library.

Need Help with AI Implementation?

Our experts can help you implement AI solutions like building local coding assistants with Gemma 4. Get a free consultation to discuss your project requirements.

What You will Learn:

Understanding Gemma 4 model family and variants
Setting up Ollama for local inference
Installing and configuring Gradio for the UI
Building a split-pane code editor and chat interface
Implementing agentic tool use for code execution
Adding multimodal support for images and files
Running a fully local AI coding assistant

What Is Gemma 4?

The Gemma 4 Model Family

Model	Architecture	Active Params	Context Length
Gemma-4-31B	Dense Transformer	31B	256K tokens
Gemma-4-26B-A4B	MoE (128 Experts)	3.8B	256K tokens
Gemma-4-E4B	Dense Transformer	4.5B effective	128K tokens
Gemma-4-E2B	Dense Transformer	2.3B effective	128K tokens

Model Selection

In this tutorial, we use the gemma4:e4b (9.6GB) variant via Ollama which is a quantized version well-suited for local inference on consumer hardware.

Running Gemma 4 via Ollama

Terminal

curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma4:e4b 
ollama serve

Build a Coding Assistant with Ollama

In this section, we will build the code assistant step by step. At a high level, here is what the app does:

Accepts a natural-language message in the chat panel, with optional image or file attachments
Injects the current editor code as context for the model
Sends a streaming request to Gemma 4 via Ollama /api/chat endpoint
Optionally calls tools (code execution, math evaluation) in an agentic loop
Pushes extracted code blocks from the response into the live editor

Install Dependencies

Install Gradio, requests, and Pillow for the UI and HTTP communication.

Terminal

pip install gradio requests pillow

Configuration and Constants

Define imports, model name, Ollama base URL, supported languages, and the default system prompt.

Create a file called app.py and add the following imports and configuration:

Define Agentic Tools

Create tools for code execution and mathematical calculations that the model can call during inference.

The assistant can operate in an agentic mode where it calls tools during inference. We define two tools in the standard function-calling schema that Ollama supports: run_code and calculate.

Tool Execution Layer

Implement the functions that execute code and evaluate math expressions safely.

Safety Design

The code execution uses a 5-second timeout, output is capped at 3,000 characters, and the math evaluator uses eval with __builtins__ set to empty to prevent access to dangerous functions.

Helper Utilities

Create utility functions for image encoding, file handling, and code extraction.

Core Chat Streaming Generator

Build the main function that handles conversation flow, tool calls, and streaming responses.

Gradio UI Layout

Create a split-pane interface with code editor on the left and chat panel on the right.

Split-Pane Layout

Code workspace on the left (55%) and chat panel on the right (45%) for smooth workflow between coding and AI assistance.

Multimodal Input

Support for images (vision) and code/text files as context, enabling the model to reason over multiple modalities.

Event Wiring

Connect UI components to the chat function for interactive responses.

Theming and CSS

Add custom theme and CSS for a polished, GitHub-inspired look with light and dark mode support.

Launch the Application

Start the Gradio app and access it at localhost:7860.

Access the App

Navigate to http://localhost:7860 once the app is running. The ollama_ok() check will warn you on startup if Ollama is not running.

Conclusion

Next Steps

Add a file tree panel to support multi-file projects
Add support for additional Ollama models via a dropdown
Persist conversation history to disk so sessions survive app restarts

Frequently Asked Questions

Does this require a GPU?

Not necessarily. Gemma 4 e4b is a quantized model that runs on CPU, though a GPU will significantly improve the inference speed.

What is the difference between agentic mode and regular chat?

In regular mode, the model streams text only. In agentic mode, the model can call run_code or calculate, check the results, and incorporate them before finishing its response.

How does the editor update automatically?

The extract_last_code_block helper function scans the assistant response for the last fenced code block and pushes it to the editor after streaming completes.

What happens if Ollama is not running?

The ollama_ok check catches this on startup and on every chat submission. If Ollama is not reachable, the chat returns a formatted error message rather than crashing.

Can I use other models with this app?

Yes, you can modify the MODEL constant to use any model available in Ollama, such as llama3, mistral, or other models from the Ollama library.

Need Help with AI Implementation?

Our experts can help you implement AI solutions like building local coding assistants with Gemma 4. Get a free consultation to discuss your project requirements.

How to Build a Local AI Coding Agent with Gemma 4: Complete Guide

What Is Gemma 4?

The Gemma 4 Model Family

Running Gemma 4 via Ollama

Build a Coding Assistant with Ollama

Install Dependencies

Configuration and Constants

Define Agentic Tools

Tool Execution Layer

Helper Utilities

Core Chat Streaming Generator

Gradio UI Layout

Split-Pane Layout

Multimodal Input

Event Wiring

Theming and CSS

Launch the Application

Conclusion

Next Steps

Frequently Asked Questions

Does this require a GPU?

What is the difference between agentic mode and regular chat?

How does the editor update automatically?

What happens if Ollama is not running?

Can I use other models with this app?

Need Help with AI Implementation?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Build a Local AI Coding Agent with Gemma 4: Complete Guide

What Is Gemma 4?

The Gemma 4 Model Family

Running Gemma 4 via Ollama

Build a Coding Assistant with Ollama

Install Dependencies

Configuration and Constants

Define Agentic Tools

Tool Execution Layer

Helper Utilities

Core Chat Streaming Generator

Gradio UI Layout

Split-Pane Layout

Multimodal Input

Event Wiring

Theming and CSS

Launch the Application

Conclusion

Next Steps

Frequently Asked Questions

Does this require a GPU?

What is the difference between agentic mode and regular chat?

How does the editor update automatically?

What happens if Ollama is not running?

Can I use other models with this app?

Need Help with AI Implementation?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief