Media (image, audio & transcription)

Rig provides unified abstractions for image generation, audio generation (text-to-speech), and audio transcription (speech-to-text) alongside its core text completion and embedding capabilities. Each media type is a trait implemented per provider, and models are created from the same provider clients you already use for completions and embeddings (see Providers & Clients).

Image generation

Requires the image feature flag: cargo add rig -F image

The rig::image_generation module provides the ImageGenerationModel trait for generating images from text prompts.

Core Types

pub trait ImageGenerationModel: Clone + Send + Sync {
    type Response: Send + Sync;

    async fn image_generation(
        &self,
        request: ImageGenerationRequest,
    ) -> Result<ImageGenerationResponse<Self::Response>, ImageGenerationError>;
}

Request Building

Use ImageGenerationRequestBuilder to construct requests:

use rig::image_generation::ImageGenerationModel;

let response = model
    .image_generation_request()
    .prompt("A futuristic city at sunset")
    .width(1024)
    .height(1024)
    .send()
    .await?;

// Access the generated image
let image_data = response.image;

ImageGenerationResponse

pub struct ImageGenerationResponse<T> {
    /// The generated image data
    pub image: Vec<u8>,
    /// The raw provider response (`T` is the provider's `Response` type)
    pub response: T,
}

Using with Agents

Image generation models can be accessed through providers that support them:

let openai = openai::Client::from_env()?;
let dalle = openai.image_generation_model("dall-e-3");

let response = dalle
    .image_generation_request()
    .prompt("A robot painting a landscape")
    .send()
    .await?;

Audio generation (text-to-speech)

Requires the audio feature flag: cargo add rig -F audio

The rig::audio_generation module provides the AudioGenerationModel trait for converting text to speech.

Core Types

pub trait AudioGenerationModel:
    Sized
    + Clone
    + WasmCompatSend
    + WasmCompatSync {
    type Response: Send + Sync;
    type Client;

    // Required methods
    fn make(client: &Self::Client, model: impl Into<String>) -> Self;
    fn audio_generation(
        &self,
        request: AudioGenerationRequest,
    ) -> impl Future<Output = Result<AudioGenerationResponse<Self::Response>, AudioGenerationError>> + Send;

    // Provided method
    fn audio_generation_request(&self) -> AudioGenerationRequestBuilder<Self> { ... }
}

Request Building

use rig::audio_generation::AudioGenerationModel;

let openai = openai::Client::from_env()?;
let model = openai.audio_generation_model("tts-1"); // requires the `audio` feature

let response = model
    .audio_generation_request()
    .text("Hello, how can I help you today?")
    .voice("alloy")
    .send()
    .await?;

// Access the generated audio
let audio_bytes = response.audio;

AudioGenerationResponse

pub struct AudioGenerationResponse<T> {
    /// The generated audio data
    pub audio: Vec<u8>,
    /// The raw provider response (`T` is the provider's `Response` type)
    pub response: T,
}

Audio transcription (speech-to-text)

The rig::transcription module provides the TranscriptionModel trait for transcribing audio to text.

Core Trait

pub trait TranscriptionModel:
    Clone
    + WasmCompatSend
    + WasmCompatSync {
    type Response: WasmCompatSend + WasmCompatSync;
    type Client;

    // Required methods
    fn make(client: &Self::Client, model: impl Into<String>) -> Self;
    fn transcription(
        &self,
        request: TranscriptionRequest,
    ) -> impl Future<Output = Result<TranscriptionResponse<Self::Response>, TranscriptionError>> + WasmCompatSend;

    // Provided method
    fn transcription_request(&self) -> TranscriptionRequestBuilder<Self> { ... }
}

Request Building

use rig::transcription::TranscriptionModel;

let openai = openai::Client::from_env()?;
let model = openai.transcription_model("whisper-1");

let audio_data: Vec<u8> = std::fs::read("audio.mp3")?;

let response = model
    .transcription_request()
    .data(audio_data)
    .language("en".to_string())
    .send()
    .await?;

println!("Transcription: {}", response.text);

Transcription: Hey, just calling to confirm our meeting tomorrow at ten. Talk soon.

TranscriptionResponse

pub struct TranscriptionResponse<T> {
    /// The transcribed text
    pub text: String,
    /// The raw provider response (`T` is the provider's `Response` type)
    pub response: T,
}

Provider support

Not all providers support all media types. Here is a summary of current support:

Provider	Image Generation	Audio Generation	Transcription
OpenAI	Yes (DALL-E)	Yes (TTS)	Yes (Whisper)
Other providers	Varies	Varies	Varies

Check the individual provider documentation for specific model support.

Best practices

Feature Flags: Only enable the feature flags you need (image, audio) to minimize compile times and binary size.
Error Handling: Each media type has its own error type (ImageGenerationError, AudioGenerationError, TranscriptionError). Handle them appropriately.
Large Payloads: Audio and image data can be large. Consider streaming where possible and be mindful of memory usage.
Model Selection: Different models within the same provider may have different capabilities, pricing, and quality. Refer to provider documentation for guidance.

Next steps

Providers & ClientsSee how the client that builds these media models also creates completion and embedding models.

Model ProvidersCheck which providers support image, audio, and transcription models — and their model names.

CompletionCombine media output with text completions to build richer multimodal responses.

AgentsWrap media models in an agent so a prompt can decide when to generate or transcribe.

Previous
Loaders Next
Observability

Media (image, audio & transcription)

Image generation

Core Types

Request Building

ImageGenerationResponse

Using with Agents

Audio generation (text-to-speech)

Core Types

Request Building

AudioGenerationResponse

Audio transcription (speech-to-text)

Core Trait

Request Building

TranscriptionResponse

Provider support

Best practices

See also

Next steps