Explore Audio Models

Browse and discover the best AI audio models for text to speech, speech to text, and music.

Explore Audio Models

Browse and discover the best AI audio models for text to speech, speech to text, and music.

All

Text to Speech

Speech to Text

Music

ACE-Step

Music

ACE-Step composes complete songs from text descriptions. Guide genre, mood, and structure with style tags and optional custom lyrics. Generates up to 4 minutes of multi-track audio with vocals.

≈ $0.050 per audio

Try ACE-Step Details

ACE-Step 1.5

Music

ACE-Step 1.5 composes complete songs from text descriptions. Guide genre, mood, and structure with style tags and required custom lyrics. Generates up to 4 minutes of multi-track audio with vocals.

≈ $0.050 per audio

Try ACE-Step 1.5 Details

ACE-Step v1.5 Base

Music

ACE-Step v1.5 Base is a Runware-hosted music model built for creator workflows, with stronger fidelity, more reliable stylistic consistency, and prompt-driven genre control for full-song generation.

≈ $0.009 per audio

Try ACE-Step v1.5 Base Details

ACE-Step v1.5 Turbo

Music

ACE-Step v1.5 Turbo is the faster, lower-cost Runware variant for full-song generation, with broad genre coverage, improved stylistic consistency, and text-guided music creation for creator workflows.

≈ $0.006 per audio

Try ACE-Step v1.5 Turbo Details

Alibaba Fun-ASR Flash

Speech to Text

Alibaba Cloud DashScope non-realtime speech recognition with multilingual transcription, punctuation, and sentence/word timestamps.

≈ $0.004 per minute

Try Alibaba Fun-ASR Flash Details

ByteDance Seed Audio 1.0

Music

ByteDance Seed Audio 1.0 generates natural audio from text, with optional preset voices, up to three reference audio clips, or a single reference image.

≈ $0.360 per audio

Try ByteDance Seed Audio 1.0 Details

ByteDance Seed Speech TTS 2.0

Text to Speech

ByteDance Seed Speech TTS 2.0 for natural multilingual speech with voice instructions and delivery controls.

≈ $0.051 per audio

Try ByteDance Seed Speech TTS 2.0 Details

ElevenLabs Music v1

Music

Eleven Music is a studio‑grade text‑to‑music model. Generate music with natural‑language prompts in any style — perfect for game soundtracks, podcast backgrounds, and marketing reels. Control genre, style, and structure, with optional vocals or instrumental. Supports 10–300s MP3 with selectable sample rate and bitrate.

≈ $0.100 per audio

Try ElevenLabs Music v1 Details

ElevenLabs Scribe V1

Speech to Text

ElevenLabs Scribe V1 transcription with word-level timestamps and speaker identification

≈ $0.051 per minute

Try ElevenLabs Scribe V1 Details

ElevenLabs Scribe V2

Speech to Text

ElevenLabs Scribe V2 transcription with improved accuracy, word-level timestamps, and speaker identification

≈ $0.051 per minute

Try ElevenLabs Scribe V2 Details

ElevenLabs Turbo V2.5

Text to Speech

High quality with lowest latency, ideal for real-time applications. Supports 32 languages while maintaining natural voice quality.

≈ $0.102 per audio

Try ElevenLabs Turbo V2.5 Details

ElevenLabs v3

Text to Speech

High-quality text-to-speech with enhanced controls and natural voices.

≈ $0.170 per audio

Try ElevenLabs v3 Details

Gemini 2.5 Flash Preview TTS

Text to Speech

Google Gemini native TTS. Single and multi-speaker support via prompt.

≈ $0.051 per audio

Try Gemini 2.5 Flash Preview TTS Details

Gemini 2.5 Pro Preview TTS

Text to Speech

Higher-quality Gemini TTS with controllable style and tone.

≈ $0.102 per audio

Try Gemini 2.5 Pro Preview TTS Details

Gemini 3.1 Flash TTS Preview

Text to Speech

Google Gemini 3.1 Flash text-to-speech with inline audio tag and multi-speaker prompt support.

≈ $0.102 per audio

Try Gemini 3.1 Flash TTS Preview Details

Google Lyria 3 Pro Music

Music

Google Lyria 3 Pro generates premium music clips from a text prompt, with optional image guidance, negative prompts, and seed-based repeatability.

≈ $0.080 per audio

Try Google Lyria 3 Pro Music Details

GPT-4o Mini Transcribe

Speech to Text

OpenAI's efficient speech-to-text model with improved accuracy over Whisper

≈ $0.005 per minute

Try GPT-4o Mini Transcribe Details

GPT-4o Mini Transcribe (2025-03-20)

Speech to Text

Original release snapshot of GPT-4o Mini Transcribe

≈ $0.005 per minute

Try GPT-4o Mini Transcribe (2025-03-20)Details

All

Text to Speech

Speech to Text

Music

ACE-Step

Music

ACE-Step composes complete songs from text descriptions. Guide genre, mood, and structure with style tags and optional custom lyrics. Generates up to 4 minutes of multi-track audio with vocals.

≈ $0.050 per audio

Try ACE-Step Details

ACE-Step 1.5

Music

ACE-Step 1.5 composes complete songs from text descriptions. Guide genre, mood, and structure with style tags and required custom lyrics. Generates up to 4 minutes of multi-track audio with vocals.

≈ $0.050 per audio

Try ACE-Step 1.5 Details

ACE-Step v1.5 Base

Music

ACE-Step v1.5 Base is a Runware-hosted music model built for creator workflows, with stronger fidelity, more reliable stylistic consistency, and prompt-driven genre control for full-song generation.

≈ $0.009 per audio

Try ACE-Step v1.5 Base Details

ACE-Step v1.5 Turbo

Music

≈ $0.006 per audio

Try ACE-Step v1.5 Turbo Details

Alibaba Fun-ASR Flash

Speech to Text

Alibaba Cloud DashScope non-realtime speech recognition with multilingual transcription, punctuation, and sentence/word timestamps.

≈ $0.004 per minute

Try Alibaba Fun-ASR Flash Details

ByteDance Seed Audio 1.0

Music

ByteDance Seed Audio 1.0 generates natural audio from text, with optional preset voices, up to three reference audio clips, or a single reference image.

≈ $0.360 per audio

Try ByteDance Seed Audio 1.0 Details

ByteDance Seed Speech TTS 2.0

Text to Speech

ByteDance Seed Speech TTS 2.0 for natural multilingual speech with voice instructions and delivery controls.

≈ $0.051 per audio

Try ByteDance Seed Speech TTS 2.0 Details

ElevenLabs Music v1

Music

≈ $0.100 per audio

Try ElevenLabs Music v1 Details

ElevenLabs Scribe V1