How to Use Microsoft MAI Models on Azure: Complete Guide
By Braincuber Team
Published on April 17, 2026
Microsoft dropped three production-ready models under its MAI (Microsoft AI) family, including MAI-Transcribe-1 for speech-to-text, MAI-Voice-1 for text-to-speech, and MAI-Image-2 for text-to-image generation. All three are available through Microsoft Foundry on Azure, with real API access. In this complete tutorial, we walk through exactly how to provision and connect to each model via Azure, then build a three-tab Streamlit app that tests all three models with live prompts and API calls.
What You'll Learn:
- Provision MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 on Azure
- Understand key benchmarks and what they actually mean
- Call all three models from Python via a Streamlit interface
- Build a complete three-tab demo app with speech, voice, and image generation
- Understand key features and limitations of each model before committing to build
What Are the Microsoft MAI Models?
Microsoft's MAI model family is built by the Microsoft AI Superintelligence team. These are fully in-house models, trained on Microsoft's own infrastructure, optimized for Microsoft's serving stack, and deployed directly through Azure. MAI-Transcribe-1 and MAI-Voice-1 run on the same Azure Speech infrastructure that already serves enterprise customers at scale, while MAI-Image-2 uses a flow-matching diffusion architecture with between 10 and 50 billion non-embedding parameters.
Model Overview
| Model | Task | Key Number | Price |
|---|---|---|---|
| MAI-Transcribe-1 | Speech to Text | 3.88% avg WER on FLEURS | $0.36/hr |
| MAI-Voice-1 | Text to Speech | 60s audio in 1s | $22/1M chars |
| MAI-Image-2 | Text to Image | Elo 1190 overall | $5/1M text + $33/1M img tokens |
MAI-Transcribe-1: Speech-to-Text
MAI-Transcribe-1 is a speech-to-text model supporting 25 languages. Its headline claim is a 3.88% average Word Error Rate (WER) on the FLEURS benchmark, which puts it ahead of GPT-Transcribe (4.17%), Scribe v2 (4.32%), Gemini 3.1 Flash-Lite (4.89%), and Whisper-large-v3 (7.60%).
The model is specifically built for messy real-world audio, including noisy environments, mixed accents, and natural speaking styles. Microsoft also claims MAI-Transcribe-1 delivers good accuracy at approximately 50% lower GPU cost than leading alternatives, which is a meaningful advantage for enterprises running transcription at scale.
Current Limitations
MAI-Transcribe-1 still lacks real-time transcription, speaker diarization, and context biasing, all listed as "coming soon." If these are hard requirements, consider Whisper or Azure Speech's existing real-time offering as a fallback.
MAI-Voice-1: Text-to-Speech
MAI-Voice-1 is Microsoft's high-fidelity TTS model. Its key capability is voice prompting, where you provide a 10-second audio sample and the model clones that voice without any fine-tuning. However, voice cloning requires gated approval from Microsoft.
The model generates 60 seconds of audio in a single second on a single GPU, supports per-turn emotion control via SSML, and is built for long-form content like audiobooks and podcasts. It already powers Copilot's Audio Expressions and podcast features in Microsoft's own products.
Voice Cloning
Voice prompting lets you clone any voice from a 10-second sample without fine-tuning. Requires gated approval via Microsoft's Limited Access Review process.
Emotion Control
SSML-based emotion control with four styles: neutral, joy, excitement, and empathy. Limited compared to ElevenLabs but built directly into the Azure platform.
English Only
MAI-Voice-1 is currently English-only with 10+ languages listed as "coming soon." If multilingual support is a requirement, consider alternative TTS providers.
MAI-Image-2: Text-to-Image
MAI-Image-2 uses a flow-matching diffusion architecture with 10-50 billion parameters. It is the top-3 model on Arena.ai's image leaderboard and is 2x faster than MAI-Image-1 in production. The model was developed in collaboration with photographers, designers, and visual storytellers, and is already deployed inside Copilot, Bing Image Creator, and PowerPoint.
Elo Scores by Category
| Category | Elo Score | Notes |
|---|---|---|
| Photorealistic | 1201 | Top-tier performance |
| Portraits | 1201 | Accurate skin tones |
| Text Rendering | 1186 | Better than before but still weakest category |
The model is particularly built for photorealism, accurate skin tones, and legible in-image text. However, text rendering is inconsistent across runs. In multiple test generations, the model occasionally produced misspelled or garbled text. This is a known failure mode for all current diffusion models, and MAI-Image-2 is better than most but not immune.
Human Review Required
If you are generating images where text accuracy is critical, always add a human review step. The current maximum resolution is 1024x1024 pixels, which rules out print-quality output without an upscaling step.
Setting Up Azure for All Three Models
All three MAI models are available on Microsoft Azure. MAI-Transcribe-1 and MAI-Voice-1 share the same Azure Speech resource, while MAI-Image-2 runs through Microsoft Foundry. Let's set up both step by step.
Prerequisites
- Sign up or log in to your Azure account to get $200 free credit
- Python 3.9 or higher
Step 1: Create an Azure Speech Resource
Both MAI-Transcribe-1 and MAI-Voice-1 models are accessed through the Azure AI Speech service, so one resource covers both.
Navigate to Azure Portal
Go to portal.azure.com and search for "Speech" in the search bar, then select Speech services.
Click Create
Click the + Create button to start provisioning a new Speech resource.
Configure the Resource
Fill in the details: Resource group (create mai-demo-rg), Region (East US - MAI models only supported in East US and West US), Name (mai-speech-demo), Pricing tier (Standard S0). Note: Free F0 has rate limits that may interrupt the demo.
Get Your Credentials
Once deployment completes, go to Keys and Endpoint in the left sidebar and copy Key 1 and note the Region value (eastus).
Step 2: Create a Foundry Resource and Deploy MAI-Image-2
MAI-Image-2 is accessed through Microsoft Foundry, which is separate from the regular Azure portal.
Access Foundry
From the Azure portal homepage, click Foundry in the top services row, then click Create a resource.
Configure Foundry
Set the Resource group to mai-demo-rg (same as Step 1), Region to East US, Name to mai-image-demo, and Default project name to proj-default.
Deploy MAI-Image-2
Once the resource is live, click Go to Foundry portal. In the Foundry interface, click Model catalog in the left sidebar, search for MAI, select MAI-Image-2, click Use this model, set the Deployment name to MAI-Image-2 and Deployment type to Global Standard, then click deploy.
Copy Credentials
After deployment, copy the Target URI (something like https://mai-image-demo.services.ai.azure.com/models) and the Key from the deployment details page.
Step 3: Configure the .env File
Create a .env file in your local project directory and add the credentials from the previous two steps:
AZURE_SPEECH_KEY=your_speech_key_1_here
AZURE_SPEECH_REGION=eastus
AZURE_IMAGE_ENDPOINT=https://your-resource-name.services.ai.azure.com/models
AZURE_IMAGE_KEY=your_foundry_key_here
AZURE_IMAGE_DEPLOYMENT=MAI-Image-2
Case Sensitivity Warning
AZURE_IMAGE_DEPLOYMENT is case-sensitive. Use MAI-Image-2 exactly as it appears in Foundry. Lowercase mai-image-2 will return an unknown_model error.
Building the MAI Models Demo App
We'll build a three-tab Streamlit app that calls all three MAI models via live Azure API calls. The app is split into two files: mai_clients.py, which handles all Azure API calls, and app.py, which handles the Streamlit UI.
Project Structure
mai_demo/
├── app.py # Streamlit UI
├── mai_clients.py # Azure API wrappers
├── requirements.txt
└── .env
Install Dependencies
Create a requirements.txt file with the following packages:
streamlit>=1.35.0
azure-cognitiveservices-speech>=1.38.0
requests>=2.31.0
pillow>=10.0.0
python-dotenv>=1.0.0
openai>=1.30.0
Install all dependencies:
pip install -r requirements.txt
MAI-Transcribe-1: Speech-to-Text Client
MAI-Transcribe-1 is accessed via the Azure Speech REST API. The key detail is that it takes a multipart form request with an audio file and a JSON definition object. The response comes back with combinedPhrases, which is an array of text segments that we join into a single transcript string.
def transcribe_audio(audio_bytes: bytes, filename: str = "audio.wav") -> dict:
url = (
f"https://{speech_region}.api.cognitive.microsoft.com"
f"/speechtotext/transcriptions:transcribe?api-version=2024-11-15"
)
headers = {"Ocp-Apim-Subscription-Key": speech_key}
definition = '{"locales":["en-US"],"profanityFilterMode":"None"}'
files = {
"audio": (filename, audio_bytes, _mime_type(filename)),
"definition": (None, definition, "application/json"),
}
resp = requests.post(url, headers=headers, files=files, timeout=60)
data = resp.json()
combined = " ".join(
p.get("text", "") for p in data.get("combinedPhrases", [])
).strip()
return {"success": True, "transcript": combined, "raw": data}
Supported audio formats include WAV, MP3, and FLAC. The API is region-locked to East US and West US for now.
MAI-Voice-1: Text-to-Speech Client
MAI-Voice-1 uses standard Azure Speech TTS via SSML. The key is constructing the <mstts:express-as> block for emotion control and referencing the right voice name format (en-us-Jasper:MAI-Voice-1).
def _build_ssml(text: str, emotion: str, voice_name: str) -> str:
style_map = {"neutral": None, "joy": "joy", "excitement": "excitement", "empathy": "empathy"}
style = style_map.get(emotion)
text_block = (
text if style is None
else f"{text} "
)
return f"""
{text_block}
"""
Available voice names follow the pattern en-us-{Name}:MAI-Voice-1. The current roster includes Jasper, June, Grant, Iris, Reed, and Joy.
MAI-Image-2: Text-to-Image Client
MAI-Image-2 uses a Foundry-specific endpoint, which is not the standard Azure OpenAI /openai/deployments/ path. The correct endpoint format is:
url = f"{base_endpoint}/mai/v1/images/generations"
Dimension Constraints
Both width and height must be at least 768 pixels, and their product cannot exceed 1,048,576. This works out to a maximum of 1024x1024 for square images. Passing 512x512 or omitting dimensions entirely will return a 400 Bad Request.
Building the Streamlit UI
The app is organized into three tabs, one per model:
import streamlit as st
from mai_clients import transcribe_audio, synthesize_speech, generate_image
tab1, tab2, tab3 = st.tabs(["Transcribe", "Voice", "Image"])
Running the App
With your .env file configured and dependencies installed, run the app:
streamlit run app.py
Streamlit will start a local server and automatically open the app in your browser at http://localhost:8501. You should see the three-tab interface.
Testing Results and Real-World Performance
Noisy Audio Transcription
Testing MAI-Transcribe-1 with a 7-second audio clip recorded in a noisy environment with an Indian accent returned a clean, accurate transcript in under 2 seconds. The result held up well on accented speech, which is where Whisper-large-v3 typically struggles most.
Per-Language Variance
While the headline WER is 3.88%, per-language numbers tell a different story. Arabic sits at 10.1% and Danish at 13.2%, both well above the headline figure. If your use case involves a specific language or regional accent distribution, test it against your actual audio before committing to the model in production.
Emotion Control Performance
Synthesizing a 113-character sentence with the excitement style using the en-us-Jasper:MAI-Voice-1 voice model returned a 6-second MP3 in under 1 second. The emotion was clearly present in the output with more energetic pacing and higher pitch variance. However, the emotion control is limited to four styles: neutral, joy, excitement, and empathy.
Photorealism and Text Rendering
Testing with a prompt combining photorealistic scene composition and in-image text: "A worn paperback book on a cafe table, natural window light, steam rising from an espresso cup beside it. The book cover reads: INFERENCE AT SCALE -- text clearly legible, not warped."
The photorealism was genuinely strong with natural window lighting, convincing steam, and believable texture on the book and cup. The text "INFERENCE AT SCALE" also rendered correctly and legibly. Image generation typically takes 2-4 seconds, which is noticeably faster than MAI-Image-1 in the same environment.
Cost Analysis
New Azure accounts come with $200 in free credit, and Azure for Students accounts include $100. Here is what real usage costs:
| Model | Example Usage | Cost |
|---|---|---|
| MAI-Transcribe-1 | 7-second clip | $0.0000007 |
| MAI-Voice-1 | 113-character synthesis | $0.000002 |
| MAI-Image-2 | 1 generation (1024x1024) | $0.005-0.02 |
Given that a full demo session across all three models costs well under $0.01, the free credit is more than enough to get started.
Frequently Asked Questions
Do I need a paid Azure account to use the MAI models?
No. New Azure accounts come with $200 in free credit, and Azure for Students accounts include $100. Given that a full demo session across all three models costs well under $0.01, the free credit is more than enough to get started.
Can I use MAI-Transcribe-1 for real-time transcription in a voice agent?
Not yet. MAI-Transcribe-1 is batch-only, so you submit a complete audio file and receive a transcript back. If real-time is a hard requirement, keep Whisper or Azure Speech's existing real-time offering as a fallback.
Why does MAI-Image-2 sometimes misspell words in generated images?
This is a known limitation of diffusion-based image generation models. The model generates images by transforming noise into pixels, so it has no explicit understanding of spelling or character sequences. MAI-Image-2 handles text rendering significantly better than its predecessor (Elo 1186 vs 1069), but inconsistencies still occur. Always add a human review step before using such images in production.
Can I clone any voice with MAI-Voice-1?
Voice cloning through the Personal Voice feature requires a 10-second audio sample from the voice talent and explicit recorded consent. Access is gated, so you need to submit an approval request through Microsoft's Limited Access Review process. The curated voice library is available immediately without approval.
What are the MAI-Image-2 resolution constraints?
Both width and height must be at least 768 pixels, and their product cannot exceed 1,048,576 pixels. This works out to a maximum of 1024x1024 for square images. Passing 512x512 or omitting dimensions will return a 400 Bad Request. The current maximum resolution rules out print-quality output without an upscaling step.
Need Help with Microsoft AI on Azure?
Our AI experts can help you integrate MAI models into your applications, optimize costs, and build production-ready AI workflows on Azure.
