Microsoft dropped three production-ready models under its MAI (Microsoft AI) family, including MAI-Transcribe-1 for speech-to-text, MAI-Voice-1 for text-to-speech, and MAI-Image-2 for text-to-image generation. All three are available through Microsoft Foundry on Azure, with real API access. In this complete tutorial, we walk through exactly how to provision and connect to each model via Azure, then build a three-tab Streamlit app that tests all three models with live prompts and API calls.

What You'll Learn:

Provision MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 on Azure
Understand key benchmarks and what they actually mean
Call all three models from Python via a Streamlit interface
Build a complete three-tab demo app with speech, voice, and image generation
Understand key features and limitations of each model before committing to build

What Are the Microsoft MAI Models?

Microsoft's MAI model family is built by the Microsoft AI Superintelligence team. These are fully in-house models, trained on Microsoft's own infrastructure, optimized for Microsoft's serving stack, and deployed directly through Azure. MAI-Transcribe-1 and MAI-Voice-1 run on the same Azure Speech infrastructure that already serves enterprise customers at scale, while MAI-Image-2 uses a flow-matching diffusion architecture with between 10 and 50 billion non-embedding parameters.

Model Overview

Model	Task	Key Number	Price
MAI-Transcribe-1	Speech to Text	3.88% avg WER on FLEURS	$0.36/hr
MAI-Voice-1	Text to Speech	60s audio in 1s	$22/1M chars
MAI-Image-2	Text to Image	Elo 1190 overall	$5/1M text + $33/1M img tokens

MAI-Transcribe-1: Speech-to-Text

MAI-Transcribe-1 is a speech-to-text model supporting 25 languages. Its headline claim is a 3.88% average Word Error Rate (WER) on the FLEURS benchmark, which puts it ahead of GPT-Transcribe (4.17%), Scribe v2 (4.32%), Gemini 3.1 Flash-Lite (4.89%), and Whisper-large-v3 (7.60%).

The model is specifically built for messy real-world audio, including noisy environments, mixed accents, and natural speaking styles. Microsoft also claims MAI-Transcribe-1 delivers good accuracy at approximately 50% lower GPU cost than leading alternatives, which is a meaningful advantage for enterprises running transcription at scale.

Current Limitations

MAI-Transcribe-1 still lacks real-time transcription, speaker diarization, and context biasing, all listed as "coming soon." If these are hard requirements, consider Whisper or Azure Speech's existing real-time offering as a fallback.

MAI-Voice-1: Text-to-Speech

MAI-Voice-1 is Microsoft's high-fidelity TTS model. Its key capability is voice prompting, where you provide a 10-second audio sample and the model clones that voice without any fine-tuning. However, voice cloning requires gated approval from Microsoft.

The model generates 60 seconds of audio in a single second on a single GPU, supports per-turn emotion control via SSML, and is built for long-form content like audiobooks and podcasts. It already powers Copilot's Audio Expressions and podcast features in Microsoft's own products.

Voice Cloning

Voice prompting lets you clone any voice from a 10-second sample without fine-tuning. Requires gated approval via Microsoft's Limited Access Review process.

Emotion Control

SSML-based emotion control with four styles: neutral, joy, excitement, and empathy. Limited compared to ElevenLabs but built directly into the Azure platform.

English Only

MAI-Voice-1 is currently English-only with 10+ languages listed as "coming soon." If multilingual support is a requirement, consider alternative TTS providers.

MAI-Image-2: Text-to-Image

MAI-Image-2 uses a flow-matching diffusion architecture with 10-50 billion parameters. It is the top-3 model on Arena.ai's image leaderboard and is 2x faster than MAI-Image-1 in production. The model was developed in collaboration with photographers, designers, and visual storytellers, and is already deployed inside Copilot, Bing Image Creator, and PowerPoint.

Elo Scores by Category

Category	Elo Score	Notes
Photorealistic	1201	Top-tier performance
Portraits	1201	Accurate skin tones
Text Rendering	1186	Better than before but still weakest category

The model is particularly built for photorealism, accurate skin tones, and legible in-image text. However, text rendering is inconsistent across runs. In multiple test generations, the model occasionally produced misspelled or garbled text. This is a known failure mode for all current diffusion models, and MAI-Image-2 is better than most but not immune.

Human Review Required

If you are generating images where text accuracy is critical, always add a human review step. The current maximum resolution is 1024x1024 pixels, which rules out print-quality output without an upscaling step.

Setting Up Azure for All Three Models

All three MAI models are available on Microsoft Azure. MAI-Transcribe-1 and MAI-Voice-1 share the same Azure Speech resource, while MAI-Image-2 runs through Microsoft Foundry. Let's set up both step by step.

Prerequisites

Sign up or log in to your Azure account to get $200 free credit
Python 3.9 or higher

Step 1: Create an Azure Speech Resource

Both MAI-Transcribe-1 and MAI-Voice-1 models are accessed through the Azure AI Speech service, so one resource covers both.

Navigate to Azure Portal

Go to portal.azure.com and search for "Speech" in the search bar, then select Speech services.

Click Create

Click the + Create button to start provisioning a new Speech resource.

Configure the Resource

Fill in the details: Resource group (create mai-demo-rg), Region (East US - MAI models only supported in East US and West US), Name (mai-speech-demo), Pricing tier (Standard S0). Note: Free F0 has rate limits that may interrupt the demo.

Get Your Credentials

Once deployment completes, go to Keys and Endpoint in the left sidebar and copy Key 1 and note the Region value (eastus).

Step 2: Create a Foundry Resource and Deploy MAI-Image-2

MAI-Image-2 is accessed through Microsoft Foundry, which is separate from the regular Azure portal.

Access Foundry

From the Azure portal homepage, click Foundry in the top services row, then click Create a resource.

Configure Foundry

Set the Resource group to mai-demo-rg (same as Step 1), Region to East US, Name to mai-image-demo, and Default project name to proj-default.

Deploy MAI-Image-2

Once the resource is live, click Go to Foundry portal. In the Foundry interface, click Model catalog in the left sidebar, search for MAI, select MAI-Image-2, click Use this model, set the Deployment name to MAI-Image-2 and Deployment type to Global Standard, then click deploy.

Copy Credentials

After deployment, copy the Target URI (something like https://mai-image-demo.services.ai.azure.com/models) and the Key from the deployment details page.

Step 3: Configure the .env File

Create a .env file in your local project directory and add the credentials from the previous two steps:

.env File

AZURE_SPEECH_KEY=your_speech_key_1_here
AZURE_SPEECH_REGION=eastus
AZURE_IMAGE_ENDPOINT=https://your-resource-name.services.ai.azure.com/models
AZURE_IMAGE_KEY=your_foundry_key_here
AZURE_IMAGE_DEPLOYMENT=MAI-Image-2

Case Sensitivity Warning

AZURE_IMAGE_DEPLOYMENT is case-sensitive. Use MAI-Image-2 exactly as it appears in Foundry. Lowercase mai-image-2 will return an unknown_model error.

Building the MAI Models Demo App

We'll build a three-tab Streamlit app that calls all three MAI models via live Azure API calls. The app is split into two files: mai_clients.py, which handles all Azure API calls, and app.py, which handles the Streamlit UI.

Project Structure

Directory Structure

mai_demo/
├── app.py            # Streamlit UI
├── mai_clients.py    # Azure API wrappers
├── requirements.txt
└── .env

Install Dependencies

Create a requirements.txt file with the following packages:

requirements.txt

streamlit>=1.35.0
azure-cognitiveservices-speech>=1.38.0
requests>=2.31.0
pillow>=10.0.0
python-dotenv>=1.0.0
openai>=1.30.0

Install all dependencies:

Terminal

pip install -r requirements.txt

MAI-Transcribe-1: Speech-to-Text Client

MAI-Transcribe-1 is accessed via the Azure Speech REST API. The key detail is that it takes a multipart form request with an audio file and a JSON definition object. The response comes back with combinedPhrases, which is an array of text segments that we join into a single transcript string.

Python - mai_clients.py (Transcribe)

def transcribe_audio(audio_bytes: bytes, filename: str = "audio.wav") -> dict:
    url = (
        f"https://{speech_region}.api.cognitive.microsoft.com"
        f"/speechtotext/transcriptions:transcribe?api-version=2024-11-15"
    )
    headers = {"Ocp-Apim-Subscription-Key": speech_key}
    definition = '{"locales":["en-US"],"profanityFilterMode":"None"}'
    files = {
        "audio": (filename, audio_bytes, _mime_type(filename)),
        "definition": (None, definition, "application/json"),
    }
    resp = requests.post(url, headers=headers, files=files, timeout=60)
    data = resp.json()
    combined = " ".join(
        p.get("text", "") for p in data.get("combinedPhrases", [])
    ).strip()
    return {"success": True, "transcript": combined, "raw": data}

Supported audio formats include WAV, MP3, and FLAC. The API is region-locked to East US and West US for now.

MAI-Voice-1: Text-to-Speech Client

MAI-Voice-1 uses standard Azure Speech TTS via SSML. The key is constructing the <mstts:express-as> block for emotion control and referencing the right voice name format (en-us-Jasper:MAI-Voice-1).

Python - mai_clients.py (TTS)

def _build_ssml(text: str, emotion: str, voice_name: str) -> str:
    style_map = {"neutral": None, "joy": "joy", "excitement": "excitement", "empathy": "empathy"}
    style = style_map.get(emotion)
    text_block = (
        text if style is None
        else f"{text}"
    )
    return f"""
      {text_block}
    """

Available voice names follow the pattern en-us-{Name}:MAI-Voice-1. The current roster includes Jasper, June, Grant, Iris, Reed, and Joy.

MAI-Image-2: Text-to-Image Client

MAI-Image-2 uses a Foundry-specific endpoint, which is not the standard Azure OpenAI /openai/deployments/ path. The correct endpoint format is:

Python - Endpoint

url = f"{base_endpoint}/mai/v1/images/generations"

Dimension Constraints

Both width and height must be at least 768 pixels, and their product cannot exceed 1,048,576. This works out to a maximum of 1024x1024 for square images. Passing 512x512 or omitting dimensions entirely will return a 400 Bad Request.

Building the Streamlit UI

The app is organized into three tabs, one per model:

Python - app.py

import streamlit as st
from mai_clients import transcribe_audio, synthesize_speech, generate_image

tab1, tab2, tab3 = st.tabs(["Transcribe", "Voice", "Image"])

Running the App

With your .env file configured and dependencies installed, run the app:

Terminal

streamlit run app.py

Streamlit will start a local server and automatically open the app in your browser at http://localhost:8501. You should see the three-tab interface.

Testing Results and Real-World Performance

Noisy Audio Transcription

Testing MAI-Transcribe-1 with a 7-second audio clip recorded in a noisy environment with an Indian accent returned a clean, accurate transcript in under 2 seconds. The result held up well on accented speech, which is where Whisper-large-v3 typically struggles most.

Per-Language Variance

While the headline WER is 3.88%, per-language numbers tell a different story. Arabic sits at 10.1% and Danish at 13.2%, both well above the headline figure. If your use case involves a specific language or regional accent distribution, test it against your actual audio before committing to the model in production.

Emotion Control Performance

Synthesizing a 113-character sentence with the excitement style using the en-us-Jasper:MAI-Voice-1 voice model returned a 6-second MP3 in under 1 second. The emotion was clearly present in the output with more energetic pacing and higher pitch variance. However, the emotion control is limited to four styles: neutral, joy, excitement, and empathy.

Photorealism and Text Rendering

Testing with a prompt combining photorealistic scene composition and in-image text: "A worn paperback book on a cafe table, natural window light, steam rising from an espresso cup beside it. The book cover reads: INFERENCE AT SCALE -- text clearly legible, not warped."

The photorealism was genuinely strong with natural window lighting, convincing steam, and believable texture on the book and cup. The text "INFERENCE AT SCALE" also rendered correctly and legibly. Image generation typically takes 2-4 seconds, which is noticeably faster than MAI-Image-1 in the same environment.

Cost Analysis

New Azure accounts come with $200 in free credit, and Azure for Students accounts include $100. Here is what real usage costs:

Model	Example Usage	Cost
MAI-Transcribe-1	7-second clip	$0.0000007
MAI-Voice-1	113-character synthesis	$0.000002
MAI-Image-2	1 generation (1024x1024)	$0.005-0.02

Given that a full demo session across all three models costs well under $0.01, the free credit is more than enough to get started.

Frequently Asked Questions

Do I need a paid Azure account to use the MAI models?

No. New Azure accounts come with $200 in free credit, and Azure for Students accounts include $100. Given that a full demo session across all three models costs well under $0.01, the free credit is more than enough to get started.

Can I use MAI-Transcribe-1 for real-time transcription in a voice agent?

Not yet. MAI-Transcribe-1 is batch-only, so you submit a complete audio file and receive a transcript back. If real-time is a hard requirement, keep Whisper or Azure Speech's existing real-time offering as a fallback.

Why does MAI-Image-2 sometimes misspell words in generated images?

This is a known limitation of diffusion-based image generation models. The model generates images by transforming noise into pixels, so it has no explicit understanding of spelling or character sequences. MAI-Image-2 handles text rendering significantly better than its predecessor (Elo 1186 vs 1069), but inconsistencies still occur. Always add a human review step before using such images in production.

Can I clone any voice with MAI-Voice-1?

Voice cloning through the Personal Voice feature requires a 10-second audio sample from the voice talent and explicit recorded consent. Access is gated, so you need to submit an approval request through Microsoft's Limited Access Review process. The curated voice library is available immediately without approval.

What are the MAI-Image-2 resolution constraints?

Both width and height must be at least 768 pixels, and their product cannot exceed 1,048,576 pixels. This works out to a maximum of 1024x1024 for square images. Passing 512x512 or omitting dimensions will return a 400 Bad Request. The current maximum resolution rules out print-quality output without an upscaling step.

Need Help with Microsoft AI on Azure?

Our AI experts can help you integrate MAI models into your applications, optimize costs, and build production-ready AI workflows on Azure.

What You'll Learn:

Provision MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 on Azure
Understand key benchmarks and what they actually mean
Call all three models from Python via a Streamlit interface
Build a complete three-tab demo app with speech, voice, and image generation
Understand key features and limitations of each model before committing to build

What Are the Microsoft MAI Models?

Model Overview

Model	Task	Key Number	Price
MAI-Transcribe-1	Speech to Text	3.88% avg WER on FLEURS	$0.36/hr
MAI-Voice-1	Text to Speech	60s audio in 1s	$22/1M chars
MAI-Image-2	Text to Image	Elo 1190 overall	$5/1M text + $33/1M img tokens

MAI-Transcribe-1: Speech-to-Text

Current Limitations

MAI-Voice-1: Text-to-Speech

Voice Cloning

Voice prompting lets you clone any voice from a 10-second sample without fine-tuning. Requires gated approval via Microsoft's Limited Access Review process.

Emotion Control

SSML-based emotion control with four styles: neutral, joy, excitement, and empathy. Limited compared to ElevenLabs but built directly into the Azure platform.

English Only

MAI-Voice-1 is currently English-only with 10+ languages listed as "coming soon." If multilingual support is a requirement, consider alternative TTS providers.

MAI-Image-2: Text-to-Image

Elo Scores by Category

Category	Elo Score	Notes
Photorealistic	1201	Top-tier performance
Portraits	1201	Accurate skin tones
Text Rendering	1186	Better than before but still weakest category

Human Review Required

Setting Up Azure for All Three Models

Prerequisites

Sign up or log in to your Azure account to get $200 free credit
Python 3.9 or higher

Step 1: Create an Azure Speech Resource

Both MAI-Transcribe-1 and MAI-Voice-1 models are accessed through the Azure AI Speech service, so one resource covers both.

Navigate to Azure Portal

Go to portal.azure.com and search for "Speech" in the search bar, then select Speech services.

Click Create

Click the + Create button to start provisioning a new Speech resource.

Configure the Resource

Get Your Credentials

Once deployment completes, go to Keys and Endpoint in the left sidebar and copy Key 1 and note the Region value (eastus).

Step 2: Create a Foundry Resource and Deploy MAI-Image-2

MAI-Image-2 is accessed through Microsoft Foundry, which is separate from the regular Azure portal.

Access Foundry

From the Azure portal homepage, click Foundry in the top services row, then click Create a resource.

Configure Foundry

Set the Resource group to mai-demo-rg (same as Step 1), Region to East US, Name to mai-image-demo, and Default project name to proj-default.

Deploy MAI-Image-2

Copy Credentials

After deployment, copy the Target URI (something like https://mai-image-demo.services.ai.azure.com/models) and the Key from the deployment details page.

Step 3: Configure the .env File

Create a .env file in your local project directory and add the credentials from the previous two steps:

.env File

AZURE_SPEECH_KEY=your_speech_key_1_here
AZURE_SPEECH_REGION=eastus
AZURE_IMAGE_ENDPOINT=https://your-resource-name.services.ai.azure.com/models
AZURE_IMAGE_KEY=your_foundry_key_here
AZURE_IMAGE_DEPLOYMENT=MAI-Image-2

Case Sensitivity Warning

AZURE_IMAGE_DEPLOYMENT is case-sensitive. Use MAI-Image-2 exactly as it appears in Foundry. Lowercase mai-image-2 will return an unknown_model error.

Building the MAI Models Demo App

Project Structure

Directory Structure

mai_demo/
├── app.py            # Streamlit UI
├── mai_clients.py    # Azure API wrappers
├── requirements.txt
└── .env

Install Dependencies

Create a requirements.txt file with the following packages:

requirements.txt

streamlit>=1.35.0
azure-cognitiveservices-speech>=1.38.0
requests>=2.31.0
pillow>=10.0.0
python-dotenv>=1.0.0
openai>=1.30.0

Install all dependencies:

Terminal

pip install -r requirements.txt

MAI-Transcribe-1: Speech-to-Text Client

Python - mai_clients.py (Transcribe)

def transcribe_audio(audio_bytes: bytes, filename: str = "audio.wav") -> dict:
    url = (
        f"https://{speech_region}.api.cognitive.microsoft.com"
        f"/speechtotext/transcriptions:transcribe?api-version=2024-11-15"
    )
    headers = {"Ocp-Apim-Subscription-Key": speech_key}
    definition = '{"locales":["en-US"],"profanityFilterMode":"None"}'
    files = {
        "audio": (filename, audio_bytes, _mime_type(filename)),
        "definition": (None, definition, "application/json"),
    }
    resp = requests.post(url, headers=headers, files=files, timeout=60)
    data = resp.json()
    combined = " ".join(
        p.get("text", "") for p in data.get("combinedPhrases", [])
    ).strip()
    return {"success": True, "transcript": combined, "raw": data}

Supported audio formats include WAV, MP3, and FLAC. The API is region-locked to East US and West US for now.

MAI-Voice-1: Text-to-Speech Client

Python - mai_clients.py (TTS)

def _build_ssml(text: str, emotion: str, voice_name: str) -> str:
    style_map = {"neutral": None, "joy": "joy", "excitement": "excitement", "empathy": "empathy"}
    style = style_map.get(emotion)
    text_block = (
        text if style is None
        else f"{text}"
    )
    return f"""
      {text_block}
    """

Available voice names follow the pattern en-us-{Name}:MAI-Voice-1. The current roster includes Jasper, June, Grant, Iris, Reed, and Joy.

MAI-Image-2: Text-to-Image Client

MAI-Image-2 uses a Foundry-specific endpoint, which is not the standard Azure OpenAI /openai/deployments/ path. The correct endpoint format is:

Python - Endpoint

url = f"{base_endpoint}/mai/v1/images/generations"

Dimension Constraints

Building the Streamlit UI

The app is organized into three tabs, one per model:

Python - app.py

import streamlit as st
from mai_clients import transcribe_audio, synthesize_speech, generate_image

tab1, tab2, tab3 = st.tabs(["Transcribe", "Voice", "Image"])

Running the App

With your .env file configured and dependencies installed, run the app:

Terminal

streamlit run app.py

Streamlit will start a local server and automatically open the app in your browser at http://localhost:8501. You should see the three-tab interface.

Testing Results and Real-World Performance

Noisy Audio Transcription

Per-Language Variance

Emotion Control Performance

Photorealism and Text Rendering

Cost Analysis

New Azure accounts come with $200 in free credit, and Azure for Students accounts include $100. Here is what real usage costs:

Model	Example Usage	Cost
MAI-Transcribe-1	7-second clip	$0.0000007
MAI-Voice-1	113-character synthesis	$0.000002
MAI-Image-2	1 generation (1024x1024)	$0.005-0.02

Given that a full demo session across all three models costs well under $0.01, the free credit is more than enough to get started.

Frequently Asked Questions

Do I need a paid Azure account to use the MAI models?

Can I use MAI-Transcribe-1 for real-time transcription in a voice agent?

Why does MAI-Image-2 sometimes misspell words in generated images?

Can I clone any voice with MAI-Voice-1?

What are the MAI-Image-2 resolution constraints?

Need Help with Microsoft AI on Azure?

Our AI experts can help you integrate MAI models into your applications, optimize costs, and build production-ready AI workflows on Azure.

How to Use Microsoft MAI Models on Azure: Complete Guide

What Are the Microsoft MAI Models?

Model Overview

MAI-Transcribe-1: Speech-to-Text

MAI-Voice-1: Text-to-Speech

Voice Cloning

Emotion Control

MAI-Image-2: Text-to-Image

Elo Scores by Category

Setting Up Azure for All Three Models

Prerequisites

Step 1: Create an Azure Speech Resource

Navigate to Azure Portal

Click Create

Configure the Resource

Get Your Credentials

Step 2: Create a Foundry Resource and Deploy MAI-Image-2

Access Foundry

Configure Foundry

Deploy MAI-Image-2

Copy Credentials

Step 3: Configure the .env File

Building the MAI Models Demo App

Project Structure

Install Dependencies

MAI-Transcribe-1: Speech-to-Text Client

MAI-Voice-1: Text-to-Speech Client

MAI-Image-2: Text-to-Image Client

Building the Streamlit UI

Running the App

Testing Results and Real-World Performance

Noisy Audio Transcription

Emotion Control Performance

Photorealism and Text Rendering

Cost Analysis

Frequently Asked Questions

Do I need a paid Azure account to use the MAI models?

Can I use MAI-Transcribe-1 for real-time transcription in a voice agent?

Why does MAI-Image-2 sometimes misspell words in generated images?

Can I clone any voice with MAI-Voice-1?

What are the MAI-Image-2 resolution constraints?

Need Help with Microsoft AI on Azure?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Use Microsoft MAI Models on Azure: Complete Guide

What Are the Microsoft MAI Models?

Model Overview

MAI-Transcribe-1: Speech-to-Text

MAI-Voice-1: Text-to-Speech

Voice Cloning

Emotion Control

MAI-Image-2: Text-to-Image

Elo Scores by Category

Setting Up Azure for All Three Models

Prerequisites

Step 1: Create an Azure Speech Resource

Navigate to Azure Portal

Click Create

Configure the Resource

Get Your Credentials

Step 2: Create a Foundry Resource and Deploy MAI-Image-2

Access Foundry

Configure Foundry

Deploy MAI-Image-2

Copy Credentials

Step 3: Configure the .env File

Building the MAI Models Demo App

Project Structure

Install Dependencies

MAI-Transcribe-1: Speech-to-Text Client

MAI-Voice-1: Text-to-Speech Client

MAI-Image-2: Text-to-Image Client

Building the Streamlit UI

Running the App

Testing Results and Real-World Performance

Noisy Audio Transcription

Emotion Control Performance

Photorealism and Text Rendering