How to Build a Gemini Multimodal Document Agent for AI Hackathons
By Braincuber Team
Published on May 12, 2026
Enterprise teams spend thousands of hours manually extracting data from invoices, contracts, and scanned documents. Gemini 2.5 Flash can read any of those documents natively, and Google's Agent Development Kit gives you a clean framework to turn that capability into a production service. This complete beginner guide walks you through building a document extraction agent, wrapping it in a FastAPI service, containerizing it, and shipping it to a live Vultr server. By the end you will have a public API endpoint that accepts a PDF, image, or text file and returns structured JSON with every relevant field pulled out of the document.
What You Will Learn:
- How to get a Gemini API key and set up the project structure
- How to define Pydantic response schemas for structured extraction
- How to build three ADK extraction tools for invoices, contracts, and general documents
- How to wire the tools into an ADK agent with InMemoryRunner
- How to build a FastAPI service with file upload and validation
- How to containerize the agent with Docker and docker-compose
- How to provision a Vultr Cloud Compute instance and deploy
- How to test the live API endpoint with curl and inspect responses
What You Will Build
A containerized FastAPI service backed by a Google ADK agent that accepts file uploads (PDF, image, plain text), identifies the document type automatically, calls the appropriate extraction tool (invoice, contract, or general), and returns clean structured JSON. The stack is Python 3.11, Google ADK 1.18, Gemini 2.5 Flash, FastAPI, Docker, and Vultr Cloud Compute.
This kind of document intelligence agent is particularly valuable for AI hackathon 2026 teams, where shipping a working prototype in 48 hours is the whole game. Whether you are building a fintech tool, a legal document processor, or a contract analyzer, this stack gives you a production-ready foundation within a single sprint.
Prerequisites
Gemini API Key
A Google AI Studio account and API key from aistudio.google.com/app/apikey.
Vultr Account
A Vultr account with a billing method added to deploy the containerized agent to a cloud instance.
Docker Installed
Docker installed locally for building and testing the container image before deployment.
Python 3.10+
Python 3.10 or higher and basic familiarity with FastAPI and async Python.
Step 1: Get Your Gemini API Key
Get Your Gemini API Key
Go to aistudio.google.com/app/apikey, sign in, and click Get API Key. Create a new project if prompted, then copy the key. Keep it in a safe place as you will need it for both local development and the Vultr deployment.
Step 2: Set Up the Project
Create Project Directory and Install Dependencies
Create the project directory, set up a Python virtual environment, and create the requirements.txt with google-adk==1.18.0, fastapi, uvicorn, python-multipart, pydantic, and python-dotenv. Install with pip and create a .env file for the API key.
mkdir gemini-multimodal-document-agent
cd gemini-multimodal-document-agent
python3.10 -m venv .venv
source .venv/bin/activate
google-adk==1.18.0
fastapi>=0.111.0
uvicorn[standard]>=0.29.0
python-multipart>=0.0.9
pydantic>=2.7.0
python-dotenv>=1.0.0
Step 3: Define the Response Schemas
Create app/schemas.py with a Pydantic AnalysisResponse model that defines the shape of the JSON returned by the API endpoint. The response includes the document type, filename, extracted data dictionary, a human-readable summary, and optional processing notes.
from pydantic import BaseModel
from typing import Optional, Any
class AnalysisResponse(BaseModel):
document_type: str
filename: str
extracted_data: dict[str, Any]
summary: str
processing_notes: Optional[str] = None
Step 4: Build the Extraction Tools
The ADK agent uses function-calling tools to return structured data. Each tool corresponds to a document type. When the agent reads a document, it decides which tool to call and passes every extracted field as typed arguments. The tool writes those arguments into the session state, which we read back after the agent finishes.
The Invoice Extraction Tool
save_invoice_extraction accepts fields like vendor name, invoice number, dates, total amount, currency, line items, payment terms, and billing address. All list parameters are typed as list[str] because the Gemini API requires concrete generic types in tool schemas.
def save_invoice_extraction(
tool_context: ToolContext,
vendor_name: Optional[str] = None,
invoice_number: Optional[str] = None,
invoice_date: Optional[str] = None,
due_date: Optional[str] = None,
total_amount: Optional[str] = None,
currency: Optional[str] = None,
subtotal: Optional[str] = None,
tax_amount: Optional[str] = None,
line_items: Optional[list[str]] = None,
payment_terms: Optional[str] = None,
billing_address: Optional[str] = None,
notes: Optional[str] = None,
) -> str:
"""Save structured data extracted from an invoice document."""
tool_context.state["extraction_result"] = {
"document_type": "invoice",
"extracted_data": {
"vendor_name": vendor_name,
"invoice_number": invoice_number,
"invoice_date": invoice_date,
"due_date": due_date,
"total_amount": total_amount,
"currency": currency,
"subtotal": subtotal,
"tax_amount": tax_amount,
"line_items": line_items or [],
"payment_terms": payment_terms,
"billing_address": billing_address,
"notes": notes,
},
}
return "Invoice extraction saved."
The Contract and General Extraction Tools
save_contract_extraction handles contracts, agreements, NDAs, and MOUs with fields for parties, effective date, expiration date, contract type, key obligations, termination conditions, governing law, and jurisdiction. save_general_extraction handles everything else including reports, images, and plain text with fields for document title, summary, key entities, dates mentioned, key figures, and main topics.
Important: Typed List Parameters
All list parameters must be typed as list[str] rather than just list. The Gemini API generates a JSON schema from your tool's type annotations. An untyped list produces a schema without an items field, which the API rejects with a 400 INVALID_ARGUMENT error.
Step 5: Build the ADK Agent
Create app/agent.py with the agent instruction, runner setup, and the analyze_document function. The agent uses InMemoryRunner for session management, event routing, and LLM calls. types.Part.from_bytes passes raw file bytes directly to Gemini, which reads PDFs, images, and text natively without any preprocessing.
INSTRUCTION = """
You are an enterprise document intelligence agent. Your job is to analyze
uploaded documents and extract all relevant structured data from them.
When you receive a document, follow these steps:
1. Identify the document type: invoice, contract, or general.
2. Read the document carefully and extract every relevant field.
3. Call exactly ONE of the following tools with the extracted data.
Rules:
- Extract ALL fields you can find. If a field is missing, pass null.
- For line_items in invoices, format as: "Description | Qty | Unit Price | Total"
- For scanned images or photos, read all visible text before extracting.
- Always call one of the save tools. Never respond without calling a tool.
- Be precise with amounts, dates, and names. Do not infer missing values.
"""
def create_runner() -> InMemoryRunner:
agent = Agent(
model="gemini-2.5-flash",
name="document_agent",
tools=[
save_invoice_extraction,
save_contract_extraction,
save_general_extraction,
],
)
return InMemoryRunner(agent=agent, app_name="document_agent")
async def analyze_document(
runner: InMemoryRunner, file_bytes: bytes,
mime_type: str, filename: str,
) -> dict:
user_id = "api_user"
session_id = str(uuid.uuid4())
await runner.session_service.create_session(
app_name="document_agent",
user_id=user_id, session_id=session_id,
)
content = types.Content(
role="user",
parts=[
types.Part.from_bytes(data=file_bytes, mime_type=mime_type),
types.Part.from_text(text=f"Analyze this document: {filename}"),
],
)
async for _ in runner.run_async(
user_id=user_id, session_id=session_id, new_message=content,
):
pass
session = await runner.session_service.get_session(
app_name="document_agent", user_id=user_id, session_id=session_id,
)
result = session.state.get("extraction_result")
if not result:
return {"document_type": "unknown", "extracted_data": {},
"summary": "Could not extract structured data from this document."}
return result
Step 6: Build the FastAPI Service
Create app/main.py with the FastAPI application, file validation, and the /analyze endpoint. The service supports PDF, JPEG, PNG, WebP, plain text, and Markdown files up to 20 MB. The runner is created once at startup and reused across requests via the lifespan context manager.
SUPPORTED_MIME_TYPES = {
"application/pdf", "image/jpeg", "image/jpg",
"image/png", "image/webp", "text/plain", "text/markdown",
}
MAX_FILE_SIZE_MB = 20
@asynccontextmanager
async def lifespan(app: FastAPI):
if not os.getenv("GOOGLE_API_KEY"):
raise RuntimeError("GOOGLE_API_KEY is not set.")
app.state.runner = create_runner()
yield
app = FastAPI(title="Document Intelligence Agent", lifespan=lifespan)
@app.get("/health")
async def health():
return {"status": "ok"}
@app.post("/analyze", response_model=AnalysisResponse)
async def analyze(file: UploadFile = File(...)):
if file.content_type not in SUPPORTED_MIME_TYPES:
raise HTTPException(status_code=415, detail="Unsupported file type.")
file_bytes = await file.read()
if len(file_bytes) > MAX_FILE_SIZE_MB * 1024 * 1024:
raise HTTPException(status_code=413, detail="File too large. Max 20MB.")
if len(file_bytes) == 0:
raise HTTPException(status_code=400, detail="Uploaded file is empty.")
result = await analyze_document(
runner=app.state.runner, file_bytes=file_bytes,
mime_type=file.content_type, filename=file.filename or "document",
)
return AnalysisResponse(
document_type=result.get("document_type", "unknown"),
filename=file.filename or "document",
extracted_data=result.get("extracted_data", {}),
summary=_build_summary(result.get("document_type", ""),
result.get("extracted_data", {}),
file.filename or "document"),
)
Test it locally before deploying by running uvicorn app.main:app --host 0.0.0.0 --port 8000 and sending a test file with curl. FastAPI also provides an interactive docs UI at http://localhost:8000/docs where you can upload files and inspect responses without writing any curl commands.
Step 7: Containerize with Docker
Create Dockerfile and docker-compose.yml
Create a Dockerfile based on python:3.11-slim that copies requirements.txt, installs dependencies, copies the app directory, and runs uvicorn on port 8000. Then create docker-compose.yml that builds the image, maps port 8000, loads the .env file, and sets restart to unless-stopped. Build and verify locally with docker compose up --build.
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ ./app/
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
services:
app:
build: .
ports:
- "8000:8000"
env_file:
- .env
restart: unless-stopped
Step 8: Provision a Vultr Instance
Deploy a Vultr Cloud Compute Instance
Add a billing method to your Vultr account first. Then log in to console.vultr.com, click Quick Deploy then Instances. Select Shared CPU, location Amsterdam, image Ubuntu 24.04 LTS, plan vc2-1c-1gb ($5/month), hostname document-agent. Click Deploy and wait about 60 seconds for the status to change from Installing to Running. The IP address and root password appear on the instance overview page.
Step 9: Deploy the Agent on Vultr
SSH Into the Server and Install Docker
SSH into the server using the IP and root password from the dashboard. Install Docker with curl -fsSL https://get.docker.com | sh and enable it with systemctl. Back on your local machine, copy the project to the server with scp -r ./gemini-multimodal-document-agent root@YOUR_VULTR_IP:/opt/document-agent. On the server, create the .env file with your API key and run docker compose up -d --build. The first build takes 2 to 3 minutes.
Step 10: Test the Live API
Send Documents to the Public Endpoint
From your local machine, send a real document with curl -X POST http://YOUR_VULTR_IP:8000/analyze -F "file=@sample_invoice.txt;type=text/plain". The agent identifies the document type and returns structured JSON. Try a contract file to see the agent switch tools automatically and return parties, obligations, termination conditions, and governing law instead.
Expected response from the invoice endpoint:
{
"document_type": "invoice",
"filename": "sample_invoice.txt",
"extracted_data": {
"vendor_name": "Acme Solutions Ltd.",
"invoice_number": "INV-2026-0042",
"invoice_date": "2026-05-05",
"due_date": "2026-06-04",
"total_amount": "$6,032.00",
"currency": "USD",
"line_items": [
"API Integration Services | 1 | $2,500.00 | $2,500.00",
"Cloud Infrastructure Setup | 1 | $1,200.00 | $1,200.00"
],
"payment_terms": "Net 30"
},
"summary": "Invoice #INV-2026-0042 from Acme Solutions Ltd. for USD $6,032.00."
}
What Is Happening Under the Hood
When a file hits the /analyze endpoint, here is the execution path: FastAPI reads the file bytes and validates the MIME type. analyze_document creates a new ADK session and sends the file to Gemini via InMemoryRunner. The agent reads the document using Gemini's native multimodal understanding. Based on what it reads, the agent calls one of the three extraction tools. The tool writes structured data into the session state. After the agent finishes, we read that state and return it as JSON.
The key design decision is that the tools do not receive the document. Gemini has already read it from the multimodal message context. The tools only receive the extracted fields as typed arguments, which forces the model to commit to specific values rather than returning freeform text.
Next Steps
Add a Firewall
On Vultr, create a Firewall Group under Network to restrict port 8000 to trusted IPs, or put Nginx in front as a reverse proxy with SSL termination.
Handle Larger Files
For files over 20MB, swap Part.from_bytes for the Gemini Files API and pass a file URI. Gemini supports PDFs up to 1,000 pages.
Add More Document Types
Define a new tool function, for example save_purchase_order_extraction, add it to the agent, and update the instruction to describe when to call it.
Persist Results
Swap InMemorySessionService for a database-backed session service and store extraction results in Postgres or Supabase for audit trails and historical queries.
Frequently Asked Questions
What file types does the /analyze endpoint accept?
PDF, JPEG, PNG, WebP, plain text (.txt), and Markdown (.md) files up to 20 MB. For larger files, swap Part.from_bytes for the Gemini Files API and pass a file URI instead.
Why must list parameters in ADK tools be typed as list[str]?
The Gemini API generates a JSON schema from your tool annotations. An untyped list produces a schema without an items field, which the API rejects with a 400 INVALID_ARGUMENT error. Use list[str] for concrete generic types.
Can I use this stack in an AI hackathon project?
Yes, that is the point. The Docker Compose setup deploys to any cloud instance in one command, and the ADK tool-calling pattern makes it easy to extend with new document types or swap Gemini for another model.
How do I add support for a new document type like purchase orders?
Create a new extraction function in app/tools.py following the same pattern with typed parameters and ToolContext, then add it to the tools list in app/agent.py and update the instruction.
Do I need a GPU or expensive hardware to run this?
No, all the AI processing happens on Google's servers via the Gemini API. The Vultr instance only needs the $5/month plan with 1 vCPU and 1 GB RAM to run FastAPI and Docker.
Need Help with AI Agent Development?
Our experts can help you build multimodal document agents, deploy with Docker on cloud infrastructure, and design production-ready extraction pipelines for your AI hackathon projects.
