How to Use Agentic Vision in Gemini 3: Complete Step by Step Guide
By Braincuber Team
Published on April 21, 2026
Agentic vision in Gemini 3 Flash represents a breakthrough in AI-powered image analysis. Unlike traditional vision APIs that glance at an image once and provide answers, Gemini 3 can write and execute Python code mid-response to crop, zoom, and annotate images before committing to an answer. This comprehensive guide walks you through setup and four progressively challenging examples.
What You'll Learn:
- What agentic vision is and how the Think, Act, Observe loop works
- Setting up the environment with Google Genai SDK
- Configuring code execution for agentic vision
- Reading fine detail with crop and zoom techniques
- Handling low light and obstructions in images
- Counting and annotating objects with bounding boxes
- Multi-step extraction and data visualization
What Is Agentic Vision?
When you send an image to a standard vision API asking it to read a small price label, the response often comes back confident but wrong. The same issue occurs when counting objects in a cluttered photo: the model glances at the full image once and guesses a number that is way off.
Gemini 3 Flash handles this differently. With code execution turned on, it can write and run Python mid-response to crop, zoom, and annotate images before committing to an answer. Google calls this agentic vision and reports a 5-10% quality lift on vision benchmarks.
The "Think, Act, Observe" Loop
The concept is straightforward. With code execution active, Gemini 3 can stop mid-response, write and run Python in a sandbox, look at what the code produced, and decide what to do next. For vision tasks, this usually means the model is cutting out regions of an image, enlarging them, drawing bounding boxes, or adjusting contrast to pull out details that are not visible at the original resolution.
Think
The model looks at what it has (original image or cropped version) and plans its next move.
Act
Turns the plan into Python code. The sandbox has 43 pre-installed libraries.
Observe
Output feeds back into context. The model either answers or loops again.
Iterate
A single API call can cycle through this loop multiple times.
Important Limitations
Code execution and custom function calling cannot be used in the same request. Google warns that code execution can regress performance on non-visual tasks - only enable it when the task benefits from image manipulation.
Setting Up Your Environment
Grab an API key from ai.google.dev and store it as GOOGLE_API_KEY in your environment or a .env file. Then install the SDK and image libraries:
pip install google-genai python-dotenv Pillow matplotlib opencv-python
Creating the Client
Start by creating the client with a proper timeout configuration:
from google import genai
from google.genai import types
from dotenv import load_dotenv
load_dotenv()
client = genai.Client(
http_options=types.HttpOptions(timeout=600_000)
)
MODEL = "gemini-3-flash-preview"
Important
The timeout=600_000 matters. Agentic vision calls involve multiple rounds of code generation and execution inside the sandbox, and a single request can take around four minutes. The default httpx timeout will kill the connection well before that.
Configuring Agentic Vision
Next, set up the agentic config with code execution, thinking config, and media resolution:
agentic_config = types.GenerateContentConfig(
tools=[types.Tool(code_execution=types.ToolCodeExecution())],
thinking_config=types.ThinkingConfig(
thinking_level="HIGH",
include_thoughts=True
),
media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH,
)
| Component | Description |
|---|---|
| tools | Code execution tool - gives the model a sandboxed Python environment with 43 libraries (numpy, pandas, PIL, matplotlib, OpenCV) |
| thinking_level | HIGH gives the model more room to plan before writing code |
| media_resolution | MEDIA_RESOLUTION_HIGH is a good default; ULTRA_HIGH for dense images, LOW when speed matters |
Reading Fine Detail with Crop and Zoom
The first example involves a German grocery store shelf with three rows of products. Price labels line each shelf edge, and a "Liebe Kunden" (Dear Customers) notice sits near the bottom. The labels are readable if you zoom in, but small enough that a model looking at the full image has to decide what it can and cannot make out.
Loading the Image
from pathlib import Path
image_path = Path("images/grocery_shelf.jpg")
image_part = types.Part.from_bytes(
data=image_path.read_bytes(), mime_type="image/jpeg"
)
prompt = (
"Read the 'Liebe Kunden' sign at the bottom of the shelf. "
"What does it say? Also list all visible product names "
"and their prices from the shelf labels."
)
Standard Vision vs Agentic Vision
Run the prompt first without code execution, then with the agentic config. The difference is striking:
Standard Vision Result
About a third of entries come back as "(Label obscured)". The model is honest about what it cannot read.
Agentic Vision Result
The model crops seven regions, zooms into each, and reads 30+ products across all three shelves including Oryza, Reis-fit, Madras Curry, and Reiskugeln.
The key insight: the prompt asked to "list all visible product names and their prices" which set a completeness bar that pushed the model toward cropping. A vaguer prompt like "describe this image" would likely produce a single-pass answer even with code execution on.
Reading Through Low Light and Obstructions
This example involves a Brazilian bar menu photographed in dim lighting. Three glass panels list drinks in gold text, but pendant lamps block parts of the left panel, the right panel is barely visible at the frame edge, and glass reflections scatter across the surface.
First Round: Crop All Panels
The model crops all three panels (middle, left, right) and reads the well-lit center panel easily.
Second Round: Self-Correction
After observing the first crops, the model decides the left and right panels are still too hard to read and runs another round of tighter zooms - without being told to.
This demonstrates the power of leaving the method open: the model does not just preprocess once, it iterates until the results meet the bar set by the prompt. It worked backward from the goal of reading drinks and their styles.
Counting and Annotation
This example has coins scattered across a desk. Some overlap, a few are partially hidden, and the mix of copper and silver makes it easy to lose track.
The Ambiguity Problem
When you ask the model to "draw a bounding box around each coin," it returns JSON coordinates instead of actually drawing on the image. The word "draw" is ambiguous.
Pro Tip
Be explicit about the tool: say "Use Python to draw" instead of just "draw". This triggers the code execution path reliably.
prompt_v2 = (
"Count every coin in this image. Use Python to draw a numbered "
"bounding box on the image around each coin you find. Check for "
"overlapping or partially hidden coins. After annotating, give "
"me the final count."
)
This time, the model writes actual drawing code using PIL and produces an annotated image with red boxes and numbered coins. The annotation acts as a visual scratchpad: instead of a number you have to take on faith, you get a verifiable artifact you can check box by box.
Multi-Step Extraction and Plotting
This example chains several operations together: crop a table from a photo, extract the numbers into a DataFrame, and generate a chart. The image is an IRS-style tax table photographed on a desk with a calculator.
Round 1-3
Progressive zooms: crop the table, isolate column headers, then first 10 rows.
Round 4
Extract numbers into pandas DataFrame, print the table, and generate a grouped bar chart.
One API call produced a structured DataFrame, a summary with max/min rows identified, and a publication-ready bar chart. The prompt listed four things (crop, extract, chart, summarize), and the model treated each as a stage in a pipeline where the output of one step fed into the next.
Key Takeaways
| Pattern | Example |
|---|---|
| Set a completeness bar | "list all products" pushes toward thoroughness |
| Leave method open | Let model choose its preprocessing path |
| Be explicit about tools | "Use Python to draw" vs just "draw" |
Frequently Asked Questions
What is agentic vision in Gemini 3 Flash?
Agentic vision is a feature where Gemini 3 Flash can write and run Python code mid-response to manipulate images before answering using a Think, Act, Observe loop.
How do I enable agentic vision in the Gemini API?
Add a code execution tool to your GenerateContentConfig: tools=[types.Tool(code_execution=types.ToolCodeExecution())].
Why does the model skip code execution even when enabled?
The model only writes code when it decides a single look is not enough. If it can answer confidently from original resolution, it skips code execution - by design.
What is the difference between "draw" and "Use Python to draw"?
"Draw" is ambiguous and may return text descriptions. "Use Python to draw" tells the model to write executable code that produces visual output.
Can I combine agentic vision with custom function calling?
No. Code execution and custom function calling cannot be used in the same API request.
Ready to Build with Agentic Vision?
Agentic vision in Gemini 3 opens up powerful possibilities for image analysis tasks. Start with the examples in this tutorial, experiment with your own images, and explore how the Think, Act, Observe loop can transform your vision workflows.
