Agentic vision in Gemini 3 Flash represents a breakthrough in AI-powered image analysis. Unlike traditional vision APIs that glance at an image once and provide answers, Gemini 3 can write and execute Python code mid-response to crop, zoom, and annotate images before committing to an answer. This comprehensive guide walks you through setup and four progressively challenging examples.

What You'll Learn:

What agentic vision is and how the Think, Act, Observe loop works
Setting up the environment with Google Genai SDK
Configuring code execution for agentic vision
Reading fine detail with crop and zoom techniques
Handling low light and obstructions in images
Counting and annotating objects with bounding boxes
Multi-step extraction and data visualization

What Is Agentic Vision?

When you send an image to a standard vision API asking it to read a small price label, the response often comes back confident but wrong. The same issue occurs when counting objects in a cluttered photo: the model glances at the full image once and guesses a number that is way off.

Gemini 3 Flash handles this differently. With code execution turned on, it can write and run Python mid-response to crop, zoom, and annotate images before committing to an answer. Google calls this agentic vision and reports a 5-10% quality lift on vision benchmarks.

The "Think, Act, Observe" Loop

The concept is straightforward. With code execution active, Gemini 3 can stop mid-response, write and run Python in a sandbox, look at what the code produced, and decide what to do next. For vision tasks, this usually means the model is cutting out regions of an image, enlarging them, drawing bounding boxes, or adjusting contrast to pull out details that are not visible at the original resolution.

Think

The model looks at what it has (original image or cropped version) and plans its next move.

Act

Turns the plan into Python code. The sandbox has 43 pre-installed libraries.

Observe

Output feeds back into context. The model either answers or loops again.

Iterate

A single API call can cycle through this loop multiple times.

Important Limitations

Code execution and custom function calling cannot be used in the same request. Google warns that code execution can regress performance on non-visual tasks - only enable it when the task benefits from image manipulation.

Setting Up Your Environment

Grab an API key from ai.google.dev and store it as GOOGLE_API_KEY in your environment or a .env file. Then install the SDK and image libraries:

Terminal

pip install google-genai python-dotenv Pillow matplotlib opencv-python

Creating the Client

Start by creating the client with a proper timeout configuration:

Python

from google import genai
from google.genai import types
from dotenv import load_dotenv

load_dotenv()

client = genai.Client(
    http_options=types.HttpOptions(timeout=600_000)
)
MODEL = "gemini-3-flash-preview"

Important

The timeout=600_000 matters. Agentic vision calls involve multiple rounds of code generation and execution inside the sandbox, and a single request can take around four minutes. The default httpx timeout will kill the connection well before that.

Configuring Agentic Vision

Next, set up the agentic config with code execution, thinking config, and media resolution:

Python

agentic_config = types.GenerateContentConfig(
    tools=[types.Tool(code_execution=types.ToolCodeExecution())],
    thinking_config=types.ThinkingConfig(
        thinking_level="HIGH",
        include_thoughts=True
    ),
    media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH,
)

Component	Description
tools	Code execution tool - gives the model a sandboxed Python environment with 43 libraries (numpy, pandas, PIL, matplotlib, OpenCV)
thinking_level	HIGH gives the model more room to plan before writing code
media_resolution	MEDIA_RESOLUTION_HIGH is a good default; ULTRA_HIGH for dense images, LOW when speed matters

Reading Fine Detail with Crop and Zoom

The first example involves a German grocery store shelf with three rows of products. Price labels line each shelf edge, and a "Liebe Kunden" (Dear Customers) notice sits near the bottom. The labels are readable if you zoom in, but small enough that a model looking at the full image has to decide what it can and cannot make out.

Loading the Image

Python

from pathlib import Path

image_path = Path("images/grocery_shelf.jpg")
image_part = types.Part.from_bytes(
    data=image_path.read_bytes(), mime_type="image/jpeg"
)

prompt = (
    "Read the 'Liebe Kunden' sign at the bottom of the shelf. "
    "What does it say? Also list all visible product names "
    "and their prices from the shelf labels."
)

Standard Vision vs Agentic Vision

Run the prompt first without code execution, then with the agentic config. The difference is striking:

Standard Vision Result

About a third of entries come back as "(Label obscured)". The model is honest about what it cannot read.

Agentic Vision Result

The model crops seven regions, zooms into each, and reads 30+ products across all three shelves including Oryza, Reis-fit, Madras Curry, and Reiskugeln.

The key insight: the prompt asked to "list all visible product names and their prices" which set a completeness bar that pushed the model toward cropping. A vaguer prompt like "describe this image" would likely produce a single-pass answer even with code execution on.

Reading Through Low Light and Obstructions

This example involves a Brazilian bar menu photographed in dim lighting. Three glass panels list drinks in gold text, but pendant lamps block parts of the left panel, the right panel is barely visible at the frame edge, and glass reflections scatter across the surface.

First Round: Crop All Panels

The model crops all three panels (middle, left, right) and reads the well-lit center panel easily.

Second Round: Self-Correction

After observing the first crops, the model decides the left and right panels are still too hard to read and runs another round of tighter zooms - without being told to.

This demonstrates the power of leaving the method open: the model does not just preprocess once, it iterates until the results meet the bar set by the prompt. It worked backward from the goal of reading drinks and their styles.

Counting and Annotation

This example has coins scattered across a desk. Some overlap, a few are partially hidden, and the mix of copper and silver makes it easy to lose track.

The Ambiguity Problem

When you ask the model to "draw a bounding box around each coin," it returns JSON coordinates instead of actually drawing on the image. The word "draw" is ambiguous.

Pro Tip

Be explicit about the tool: say "Use Python to draw" instead of just "draw". This triggers the code execution path reliably.

Python

prompt_v2 = (
    "Count every coin in this image. Use Python to draw a numbered "
    "bounding box on the image around each coin you find. Check for "
    "overlapping or partially hidden coins. After annotating, give "
    "me the final count."
)

This time, the model writes actual drawing code using PIL and produces an annotated image with red boxes and numbered coins. The annotation acts as a visual scratchpad: instead of a number you have to take on faith, you get a verifiable artifact you can check box by box.

Multi-Step Extraction and Plotting

This example chains several operations together: crop a table from a photo, extract the numbers into a DataFrame, and generate a chart. The image is an IRS-style tax table photographed on a desk with a calculator.

Round 1-3

Progressive zooms: crop the table, isolate column headers, then first 10 rows.

Round 4

Extract numbers into pandas DataFrame, print the table, and generate a grouped bar chart.

One API call produced a structured DataFrame, a summary with max/min rows identified, and a publication-ready bar chart. The prompt listed four things (crop, extract, chart, summarize), and the model treated each as a stage in a pipeline where the output of one step fed into the next.

Key Takeaways

Pattern	Example
Set a completeness bar	"list all products" pushes toward thoroughness
Leave method open	Let model choose its preprocessing path
Be explicit about tools	"Use Python to draw" vs just "draw"

Frequently Asked Questions

What is agentic vision in Gemini 3 Flash?

Agentic vision is a feature where Gemini 3 Flash can write and run Python code mid-response to manipulate images before answering using a Think, Act, Observe loop.

How do I enable agentic vision in the Gemini API?

Add a code execution tool to your GenerateContentConfig: tools=[types.Tool(code_execution=types.ToolCodeExecution())].

Why does the model skip code execution even when enabled?

The model only writes code when it decides a single look is not enough. If it can answer confidently from original resolution, it skips code execution - by design.

What is the difference between "draw" and "Use Python to draw"?

"Draw" is ambiguous and may return text descriptions. "Use Python to draw" tells the model to write executable code that produces visual output.

Can I combine agentic vision with custom function calling?

No. Code execution and custom function calling cannot be used in the same API request.

Ready to Build with Agentic Vision?

Agentic vision in Gemini 3 opens up powerful possibilities for image analysis tasks. Start with the examples in this tutorial, experiment with your own images, and explore how the Think, Act, Observe loop can transform your vision workflows.

What You'll Learn:

What agentic vision is and how the Think, Act, Observe loop works
Setting up the environment with Google Genai SDK
Configuring code execution for agentic vision
Reading fine detail with crop and zoom techniques
Handling low light and obstructions in images
Counting and annotating objects with bounding boxes
Multi-step extraction and data visualization

What Is Agentic Vision?

The "Think, Act, Observe" Loop

Think

The model looks at what it has (original image or cropped version) and plans its next move.

Act

Turns the plan into Python code. The sandbox has 43 pre-installed libraries.

Observe

Output feeds back into context. The model either answers or loops again.

Iterate

A single API call can cycle through this loop multiple times.

Important Limitations

Setting Up Your Environment

Grab an API key from ai.google.dev and store it as GOOGLE_API_KEY in your environment or a .env file. Then install the SDK and image libraries:

Terminal

pip install google-genai python-dotenv Pillow matplotlib opencv-python

Creating the Client

Start by creating the client with a proper timeout configuration:

Python

from google import genai
from google.genai import types
from dotenv import load_dotenv

load_dotenv()

client = genai.Client(
    http_options=types.HttpOptions(timeout=600_000)
)
MODEL = "gemini-3-flash-preview"

Important

Configuring Agentic Vision

Next, set up the agentic config with code execution, thinking config, and media resolution:

Python

agentic_config = types.GenerateContentConfig(
    tools=[types.Tool(code_execution=types.ToolCodeExecution())],
    thinking_config=types.ThinkingConfig(
        thinking_level="HIGH",
        include_thoughts=True
    ),
    media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH,
)

Component	Description
tools	Code execution tool - gives the model a sandboxed Python environment with 43 libraries (numpy, pandas, PIL, matplotlib, OpenCV)
thinking_level	HIGH gives the model more room to plan before writing code
media_resolution	MEDIA_RESOLUTION_HIGH is a good default; ULTRA_HIGH for dense images, LOW when speed matters

Reading Fine Detail with Crop and Zoom

Loading the Image

Python

from pathlib import Path

image_path = Path("images/grocery_shelf.jpg")
image_part = types.Part.from_bytes(
    data=image_path.read_bytes(), mime_type="image/jpeg"
)

prompt = (
    "Read the 'Liebe Kunden' sign at the bottom of the shelf. "
    "What does it say? Also list all visible product names "
    "and their prices from the shelf labels."
)

Standard Vision vs Agentic Vision

Run the prompt first without code execution, then with the agentic config. The difference is striking:

Standard Vision Result

About a third of entries come back as "(Label obscured)". The model is honest about what it cannot read.

Agentic Vision Result

The model crops seven regions, zooms into each, and reads 30+ products across all three shelves including Oryza, Reis-fit, Madras Curry, and Reiskugeln.

Reading Through Low Light and Obstructions

First Round: Crop All Panels

The model crops all three panels (middle, left, right) and reads the well-lit center panel easily.

Second Round: Self-Correction

After observing the first crops, the model decides the left and right panels are still too hard to read and runs another round of tighter zooms - without being told to.

Counting and Annotation

This example has coins scattered across a desk. Some overlap, a few are partially hidden, and the mix of copper and silver makes it easy to lose track.

The Ambiguity Problem

When you ask the model to "draw a bounding box around each coin," it returns JSON coordinates instead of actually drawing on the image. The word "draw" is ambiguous.

Pro Tip

Be explicit about the tool: say "Use Python to draw" instead of just "draw". This triggers the code execution path reliably.

Python

prompt_v2 = (
    "Count every coin in this image. Use Python to draw a numbered "
    "bounding box on the image around each coin you find. Check for "
    "overlapping or partially hidden coins. After annotating, give "
    "me the final count."
)

Multi-Step Extraction and Plotting

Round 1-3

Progressive zooms: crop the table, isolate column headers, then first 10 rows.

Round 4

Extract numbers into pandas DataFrame, print the table, and generate a grouped bar chart.

Key Takeaways

Pattern	Example
Set a completeness bar	"list all products" pushes toward thoroughness
Leave method open	Let model choose its preprocessing path
Be explicit about tools	"Use Python to draw" vs just "draw"

Frequently Asked Questions

What is agentic vision in Gemini 3 Flash?

Agentic vision is a feature where Gemini 3 Flash can write and run Python code mid-response to manipulate images before answering using a Think, Act, Observe loop.

How do I enable agentic vision in the Gemini API?

Add a code execution tool to your GenerateContentConfig: tools=[types.Tool(code_execution=types.ToolCodeExecution())].

Why does the model skip code execution even when enabled?

The model only writes code when it decides a single look is not enough. If it can answer confidently from original resolution, it skips code execution - by design.

What is the difference between "draw" and "Use Python to draw"?

"Draw" is ambiguous and may return text descriptions. "Use Python to draw" tells the model to write executable code that produces visual output.

Can I combine agentic vision with custom function calling?

No. Code execution and custom function calling cannot be used in the same API request.

How to Use Agentic Vision in Gemini 3: Complete Step by Step Guide

What Is Agentic Vision?

The "Think, Act, Observe" Loop

Think

Act

Observe

Iterate

Setting Up Your Environment

Creating the Client

Configuring Agentic Vision

Reading Fine Detail with Crop and Zoom

Loading the Image

Standard Vision vs Agentic Vision

Standard Vision Result

Agentic Vision Result

Reading Through Low Light and Obstructions

First Round: Crop All Panels

Second Round: Self-Correction

Counting and Annotation

The Ambiguity Problem

Multi-Step Extraction and Plotting

Round 1-3

Round 4

Key Takeaways

Frequently Asked Questions

What is agentic vision in Gemini 3 Flash?

How do I enable agentic vision in the Gemini API?

Why does the model skip code execution even when enabled?

What is the difference between "draw" and "Use Python to draw"?

Can I combine agentic vision with custom function calling?

Ready to Build with Agentic Vision?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Use Agentic Vision in Gemini 3: Complete Step by Step Guide

What Is Agentic Vision?

The "Think, Act, Observe" Loop

Think

Act

Observe

Iterate

Setting Up Your Environment

Creating the Client

Configuring Agentic Vision

Reading Fine Detail with Crop and Zoom

Loading the Image

Standard Vision vs Agentic Vision

Standard Vision Result

Agentic Vision Result

Reading Through Low Light and Obstructions

First Round: Crop All Panels

Second Round: Self-Correction

Counting and Annotation

The Ambiguity Problem

Multi-Step Extraction and Plotting

Round 1-3

Round 4

Key Takeaways

Frequently Asked Questions

What is agentic vision in Gemini 3 Flash?

How do I enable agentic vision in the Gemini API?

Why does the model skip code execution even when enabled?

What is the difference between "draw" and "Use Python to draw"?

Can I combine agentic vision with custom function calling?

Ready to Build with Agentic Vision?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief