The Cartesia SDK implements ITextToSpeechClient and ISpeechToTextClient from Microsoft.Extensions.AI.
Supported Interfaces
Interface
Support Level
ITextToSpeechClient
Full (Sonic TTS bytes plus low-latency SSE streaming)
ISpeechToTextClient
Full (synchronous transcription, 115+ languages, word timestamps)
ITextToSpeechClient
Text-to-Speech
Generate speech through the standard MEAI interface. Cartesia requires an explicit VoiceId:
1 2 3 4 5 6 7 8 910111213141516171819
usingCartesia;usingMicrosoft.Extensions.AI;usingvarclient=newCartesiaClient(apiKey);ITextToSpeechClientttsClient=client;varresponse=awaitttsClient.GetAudioAsync("Cartesia Sonic is available through Microsoft.Extensions.AI.",newTextToSpeechOptions{ModelId="sonic-3.5",VoiceId="694f9389-aac1-45b6-b726-9d9369183238",AudioFormat="wav",Language="en-US",Speed=1.05f,});varaudio=response.Contents.OfType<DataContent>().Single();File.WriteAllBytes("cartesia.wav",audio.Data.ToArray());
Streaming Text-to-Speech
GetStreamingAudioAsync uses Cartesia's SSE endpoint and emits raw PCM audio chunks:
1 2 3 4 5 6 7 8 91011121314
awaitforeach(varupdateinttsClient.GetStreamingAudioAsync("Streaming speech starts returning audio before the full generation is done.",newTextToSpeechOptions{ModelId="sonic-3.5",VoiceId="694f9389-aac1-45b6-b726-9d9369183238",AudioFormat="pcm_s16le",})){foreach(varchunkinupdate.Contents.OfType<DataContent>()){Console.WriteLine($"{update.Kind}: {chunk.Data.Length} bytes");}}
Cartesia-Specific Options
Use AdditionalProperties for provider-specific fields:
Advanced Configuration with RawRepresentationFactory
Use RawRepresentationFactory to access Cartesia-specific features like audio encoding and timestamp granularity:
1 2 3 4 5 6 7 8 9101112131415161718192021222324
ISpeechToTextClientsttClient=client;usingvaraudioStream=File.OpenRead("audio.wav");varresponse=awaitsttClient.GetTextAsync(audioStream,newSpeechToTextOptions{RawRepresentationFactory=_=>newSttTranscribeRequest{Model="ink-whisper",Language=SttTranscribeRequestLanguage.En,TimestampGranularities=[TimestampGranularity.Word],},});Console.WriteLine(response.Text);// Access word-level timestamps from the raw responsevarraw=(TranscriptionResponse)response.RawRepresentation!;if(raw.Wordsis{Count:>0}words){foreach(varwordinwords){Console.WriteLine($" [{word.Start:F2}s - {word.End:F2}s] {word.Word}");}}
Streaming Behavior
GetStreamingTextAsync delegates to the non-streaming GetTextAsync method internally. The Cartesia STT API processes audio synchronously (no polling needed), and then the full result is converted to SpeechToTextResponseUpdate events using ToSpeechToTextResponseUpdates().
This means you will not receive incremental transcription updates as audio is processed. The entire transcript is returned at once after processing completes. For most use cases, calling GetTextAsync directly is equivalent and simpler.
Accessing the Underlying Client
Retrieve the CartesiaClient from the MEAI interface:
1234567
ISpeechToTextClientsttClient=client;varmetadata=sttClient.GetService<SpeechToTextClientMetadata>();Console.WriteLine($"Provider: {metadata?.ProviderName}");// "cartesia"varcartesiaClient=sttClient.GetService<CartesiaClient>();// Use cartesiaClient for Cartesia-specific APIs (TTS, voice cloning, agents, etc.)