Skip to content

Microsoft.Extensions.AI Integration

Cross-SDK comparison

See the centralized MEAI documentation for feature matrices and comparisons across all tryAGI SDKs.

The Cartesia SDK implements the ISpeechToTextClient interface from Microsoft.Extensions.AI, enabling you to use Cartesia speech-to-text through a standardized .NET AI abstraction.

Supported Interfaces

Interface Support Level
ISpeechToTextClient Full (synchronous transcription, 115+ languages, word timestamps)

ISpeechToTextClient

Installation

1
dotnet add package Cartesia

File-Based Transcription

Transcribe an audio file to text. Unlike polling-based providers, Cartesia processes audio synchronously:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
using Cartesia;
using Microsoft.Extensions.AI;

using var client = new CartesiaClient(apiKey);
ISpeechToTextClient sttClient = client;

using var audioStream = File.OpenRead("audio.wav");
var response = await sttClient.GetTextAsync(audioStream);

Console.WriteLine(response.Text);
Console.WriteLine($"Duration: {response.StartTime} - {response.EndTime}");

Transcription with Language Hint

Specify a language code for more accurate transcription:

1
2
3
4
5
6
7
8
9
ISpeechToTextClient sttClient = client;

using var audioStream = File.OpenRead("spanish-audio.wav");
var response = await sttClient.GetTextAsync(audioStream, new SpeechToTextOptions
{
    SpeechLanguage = "es",
});

Console.WriteLine(response.Text);

Specifying a Model

Select a specific Cartesia STT model:

1
2
3
4
5
6
7
8
9
ISpeechToTextClient sttClient = client;

using var audioStream = File.OpenRead("audio.wav");
var response = await sttClient.GetTextAsync(audioStream, new SpeechToTextOptions
{
    ModelId = "ink-whisper",
});

Console.WriteLine(response.Text);

Advanced Configuration with RawRepresentationFactory

Use RawRepresentationFactory to access Cartesia-specific features like audio encoding and timestamp granularity:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ISpeechToTextClient sttClient = client;

using var audioStream = File.OpenRead("audio.wav");
var response = await sttClient.GetTextAsync(audioStream, new SpeechToTextOptions
{
    RawRepresentationFactory = _ => new SttTranscribeRequest
    {
        Model = "ink-whisper",
        Language = SttTranscribeRequestLanguage.En,
        TimestampGranularities = [TimestampGranularity.Word],
    },
});

Console.WriteLine(response.Text);

// Access word-level timestamps from the raw response
var raw = (TranscriptionResponse)response.RawRepresentation!;
if (raw.Words is { Count: > 0 } words)
{
    foreach (var word in words)
    {
        Console.WriteLine($"  [{word.Start:F2}s - {word.End:F2}s] {word.Word}");
    }
}

Accessing the Underlying Client

Retrieve the CartesiaClient from the MEAI interface:

1
2
3
4
5
6
7
ISpeechToTextClient sttClient = client;

var metadata = sttClient.GetService<SpeechToTextClientMetadata>();
Console.WriteLine($"Provider: {metadata?.ProviderName}"); // "cartesia"

var cartesiaClient = sttClient.GetService<CartesiaClient>();
// Use cartesiaClient for Cartesia-specific APIs (TTS, voice cloning, agents, etc.)