Microsoft.Extensions.AI Integration

Cross-SDK comparison

See the centralized MEAI documentation for feature matrices and comparisons across all tryAGI SDKs.

The Cartesia SDK implements ISpeechToTextClient from Microsoft.Extensions.AI.

Supported Interfaces

Interface	Support Level
`ISpeechToTextClient`	Full (synchronous transcription, 115+ languages, word timestamps)

ISpeechToTextClient

Installation

dotnet add package Cartesia

File-Based Transcription

Transcribe an audio file to text. Unlike polling-based providers, Cartesia processes audio synchronously:

using Cartesia;
using Microsoft.Extensions.AI;

using var client = new CartesiaClient(apiKey);
ISpeechToTextClient sttClient = client;

using var audioStream = File.OpenRead("audio.wav");
var response = await sttClient.GetTextAsync(audioStream);

Console.WriteLine(response.Text);
Console.WriteLine($"Duration: {response.StartTime} - {response.EndTime}");

Transcription with Language Hint

Specify a language code for more accurate transcription:

ISpeechToTextClient sttClient = client;

using var audioStream = File.OpenRead("spanish-audio.wav");
var response = await sttClient.GetTextAsync(audioStream, new SpeechToTextOptions
{
    SpeechLanguage = "es",
});

Console.WriteLine(response.Text);

Specifying a Model

Select a specific Cartesia STT model:

ISpeechToTextClient sttClient = client;

using var audioStream = File.OpenRead("audio.wav");
var response = await sttClient.GetTextAsync(audioStream, new SpeechToTextOptions
{
    ModelId = "ink-whisper",
});

Console.WriteLine(response.Text);

Advanced Configuration with RawRepresentationFactory

Use RawRepresentationFactory to access Cartesia-specific features like audio encoding and timestamp granularity:

ISpeechToTextClient sttClient = client;

using var audioStream = File.OpenRead("audio.wav");
var response = await sttClient.GetTextAsync(audioStream, new SpeechToTextOptions
{
    RawRepresentationFactory = _ => new SttTranscribeRequest
    {
        Model = "ink-whisper",
        Language = SttTranscribeRequestLanguage.En,
        TimestampGranularities = [TimestampGranularity.Word],
    },
});

Console.WriteLine(response.Text);

// Access word-level timestamps from the raw response
var raw = (TranscriptionResponse)response.RawRepresentation!;
if (raw.Words is { Count: > 0 } words)
{
    foreach (var word in words)
    {
        Console.WriteLine($"  [{word.Start:F2}s - {word.End:F2}s] {word.Word}");
    }
}

Streaming Behavior

GetStreamingTextAsync delegates to the non-streaming GetTextAsync method internally. The Cartesia STT API processes audio synchronously (no polling needed), and then the full result is converted to SpeechToTextResponseUpdate events using ToSpeechToTextResponseUpdates().

This means you will not receive incremental transcription updates as audio is processed. The entire transcript is returned at once after processing completes. For most use cases, calling GetTextAsync directly is equivalent and simpler.

Accessing the Underlying Client

Retrieve the CartesiaClient from the MEAI interface:

ISpeechToTextClient sttClient = client;

var metadata = sttClient.GetService<SpeechToTextClientMetadata>();
Console.WriteLine($"Provider: {metadata?.ProviderName}"); // "cartesia"

var cartesiaClient = sttClient.GetService<CartesiaClient>();
// Use cartesiaClient for Cartesia-specific APIs (TTS, voice cloning, agents, etc.)