Media (image, audio & transcription)
Rig provides unified abstractions for image generation, audio generation (text-to-speech), and audio transcription (speech-to-text) alongside its core text completion and embedding capabilities. Each media type is a trait implemented per provider, and models are created from the same provider clients you already use for completions and embeddings (see Providers & Clients).
Image generation
Section titled “Image generation”Requires the
imagefeature flag:cargo add rig -F image
The rig::image_generation module provides the ImageGenerationModel trait for generating images from text prompts.
Core Types
Section titled “Core Types”pub trait ImageGenerationModel: Clone + Send + Sync { type Response: Send + Sync;
async fn image_generation( &self, request: ImageGenerationRequest, ) -> Result<ImageGenerationResponse<Self::Response>, ImageGenerationError>;}Request Building
Section titled “Request Building”Use ImageGenerationRequestBuilder to construct requests:
use rig::image_generation::ImageGenerationModel;
let response = model .image_generation_request() .prompt("A futuristic city at sunset") .width(1024) .height(1024) .send() .await?;
// Access the generated imagelet image_data = response.image;ImageGenerationResponse
Section titled “ImageGenerationResponse”pub struct ImageGenerationResponse<T> { /// The generated image data pub image: Vec<u8>, /// The raw provider response (`T` is the provider's `Response` type) pub response: T,}Using with Agents
Section titled “Using with Agents”Image generation models can be accessed through providers that support them:
let openai = openai::Client::from_env()?;let dalle = openai.image_generation_model("dall-e-3");
let response = dalle .image_generation_request() .prompt("A robot painting a landscape") .send() .await?;Audio generation (text-to-speech)
Section titled “Audio generation (text-to-speech)”Requires the
audiofeature flag:cargo add rig -F audio
The rig::audio_generation module provides the AudioGenerationModel trait for converting text to speech.
Core Types
Section titled “Core Types”pub trait AudioGenerationModel: Sized + Clone + WasmCompatSend + WasmCompatSync { type Response: Send + Sync; type Client;
// Required methods fn make(client: &Self::Client, model: impl Into<String>) -> Self; fn audio_generation( &self, request: AudioGenerationRequest, ) -> impl Future<Output = Result<AudioGenerationResponse<Self::Response>, AudioGenerationError>> + Send;
// Provided method fn audio_generation_request(&self) -> AudioGenerationRequestBuilder<Self> { ... }}Request Building
Section titled “Request Building”use rig::audio_generation::AudioGenerationModel;
let openai = openai::Client::from_env()?;let model = openai.audio_generation_model("tts-1"); // requires the `audio` feature
let response = model .audio_generation_request() .text("Hello, how can I help you today?") .voice("alloy") .send() .await?;
// Access the generated audiolet audio_bytes = response.audio;AudioGenerationResponse
Section titled “AudioGenerationResponse”pub struct AudioGenerationResponse<T> { /// The generated audio data pub audio: Vec<u8>, /// The raw provider response (`T` is the provider's `Response` type) pub response: T,}Audio transcription (speech-to-text)
Section titled “Audio transcription (speech-to-text)”The rig::transcription module provides the TranscriptionModel trait for transcribing audio to text.
Core Trait
Section titled “Core Trait”pub trait TranscriptionModel: Clone + WasmCompatSend + WasmCompatSync { type Response: WasmCompatSend + WasmCompatSync; type Client;
// Required methods fn make(client: &Self::Client, model: impl Into<String>) -> Self; fn transcription( &self, request: TranscriptionRequest, ) -> impl Future<Output = Result<TranscriptionResponse<Self::Response>, TranscriptionError>> + WasmCompatSend;
// Provided method fn transcription_request(&self) -> TranscriptionRequestBuilder<Self> { ... }}Request Building
Section titled “Request Building”use rig::transcription::TranscriptionModel;
let openai = openai::Client::from_env()?;let model = openai.transcription_model("whisper-1");
let audio_data: Vec<u8> = std::fs::read("audio.mp3")?;
let response = model .transcription_request() .data(audio_data) .language("en".to_string()) .send() .await?;
println!("Transcription: {}", response.text);Transcription: Hey, just calling to confirm our meeting tomorrow at ten. Talk soon.TranscriptionResponse
Section titled “TranscriptionResponse”pub struct TranscriptionResponse<T> { /// The transcribed text pub text: String, /// The raw provider response (`T` is the provider's `Response` type) pub response: T,}Provider support
Section titled “Provider support”Not all providers support all media types. Here is a summary of current support:
| Provider | Image Generation | Audio Generation | Transcription |
|---|---|---|---|
| OpenAI | Yes (DALL-E) | Yes (TTS) | Yes (Whisper) |
| Other providers | Varies | Varies | Varies |
Check the individual provider documentation for specific model support.
Best practices
Section titled “Best practices”-
Feature Flags: Only enable the feature flags you need (
image,audio) to minimize compile times and binary size. -
Error Handling: Each media type has its own error type (
ImageGenerationError,AudioGenerationError,TranscriptionError). Handle them appropriately. -
Large Payloads: Audio and image data can be large. Consider streaming where possible and be mindful of memory usage.
-
Model Selection: Different models within the same provider may have different capabilities, pricing, and quality. Refer to provider documentation for guidance.
See also
Section titled “See also”- Completions — Text completion
- Providers & Clients — How provider clients create models
