Skip to content
Get Started

Media (image, audio & transcription)

Rig provides unified abstractions for image generation, audio generation (text-to-speech), and audio transcription (speech-to-text) alongside its core text completion and embedding capabilities. Each media type is a trait implemented per provider, and models are created from the same provider clients you already use for completions and embeddings (see Providers & Clients).

Requires the image feature flag: cargo add rig -F image

The rig::image_generation module provides the ImageGenerationModel trait for generating images from text prompts.

pub trait ImageGenerationModel: Clone + Send + Sync {
type Response: Send + Sync;
async fn image_generation(
&self,
request: ImageGenerationRequest,
) -> Result<ImageGenerationResponse<Self::Response>, ImageGenerationError>;
}

Use ImageGenerationRequestBuilder to construct requests:

use rig::image_generation::ImageGenerationModel;
let response = model
.image_generation_request()
.prompt("A futuristic city at sunset")
.width(1024)
.height(1024)
.send()
.await?;
// Access the generated image
let image_data = response.image;
pub struct ImageGenerationResponse<T> {
/// The generated image data
pub image: Vec<u8>,
/// The raw provider response (`T` is the provider's `Response` type)
pub response: T,
}

Image generation models can be accessed through providers that support them:

let openai = openai::Client::from_env()?;
let dalle = openai.image_generation_model("dall-e-3");
let response = dalle
.image_generation_request()
.prompt("A robot painting a landscape")
.send()
.await?;

Requires the audio feature flag: cargo add rig -F audio

The rig::audio_generation module provides the AudioGenerationModel trait for converting text to speech.

pub trait AudioGenerationModel:
Sized
+ Clone
+ WasmCompatSend
+ WasmCompatSync {
type Response: Send + Sync;
type Client;
// Required methods
fn make(client: &Self::Client, model: impl Into<String>) -> Self;
fn audio_generation(
&self,
request: AudioGenerationRequest,
) -> impl Future<Output = Result<AudioGenerationResponse<Self::Response>, AudioGenerationError>> + Send;
// Provided method
fn audio_generation_request(&self) -> AudioGenerationRequestBuilder<Self> { ... }
}
use rig::audio_generation::AudioGenerationModel;
let openai = openai::Client::from_env()?;
let model = openai.audio_generation_model("tts-1"); // requires the `audio` feature
let response = model
.audio_generation_request()
.text("Hello, how can I help you today?")
.voice("alloy")
.send()
.await?;
// Access the generated audio
let audio_bytes = response.audio;
pub struct AudioGenerationResponse<T> {
/// The generated audio data
pub audio: Vec<u8>,
/// The raw provider response (`T` is the provider's `Response` type)
pub response: T,
}

The rig::transcription module provides the TranscriptionModel trait for transcribing audio to text.

pub trait TranscriptionModel:
Clone
+ WasmCompatSend
+ WasmCompatSync {
type Response: WasmCompatSend + WasmCompatSync;
type Client;
// Required methods
fn make(client: &Self::Client, model: impl Into<String>) -> Self;
fn transcription(
&self,
request: TranscriptionRequest,
) -> impl Future<Output = Result<TranscriptionResponse<Self::Response>, TranscriptionError>> + WasmCompatSend;
// Provided method
fn transcription_request(&self) -> TranscriptionRequestBuilder<Self> { ... }
}
use rig::transcription::TranscriptionModel;
let openai = openai::Client::from_env()?;
let model = openai.transcription_model("whisper-1");
let audio_data: Vec<u8> = std::fs::read("audio.mp3")?;
let response = model
.transcription_request()
.data(audio_data)
.language("en".to_string())
.send()
.await?;
println!("Transcription: {}", response.text);
Transcription: Hey, just calling to confirm our meeting tomorrow at ten. Talk soon.
pub struct TranscriptionResponse<T> {
/// The transcribed text
pub text: String,
/// The raw provider response (`T` is the provider's `Response` type)
pub response: T,
}

Not all providers support all media types. Here is a summary of current support:

ProviderImage GenerationAudio GenerationTranscription
OpenAIYes (DALL-E)Yes (TTS)Yes (Whisper)
Other providersVariesVariesVaries

Check the individual provider documentation for specific model support.

  1. Feature Flags: Only enable the feature flags you need (image, audio) to minimize compile times and binary size.

  2. Error Handling: Each media type has its own error type (ImageGenerationError, AudioGenerationError, TranscriptionError). Handle them appropriately.

  3. Large Payloads: Audio and image data can be large. Consider streaming where possible and be mindful of memory usage.

  4. Model Selection: Different models within the same provider may have different capabilities, pricing, and quality. Refer to provider documentation for guidance.