AI Apps MARS5 TTS

MARS5 TTS: Advanced Customizable Voice Synthesis

Cut text-to-speech costs with Unreal Speech. 11x cheaper than 11Labs. Production-ready. Stream in 300ms. Generate 10-hr audio. 48 voices. 8 languages. Per-word timestamps. 250K chars free. Try live demo:
Non-Fiction
Fiction
News
Blog
Conversation
0/250
Filesize
0 kb
Get Started for Free
MARS5 TTS

MARS5 TTS

Generates customizable speech audio from text.

MARS5 TTS

Overview of MARS5-TTS: Advanced Text-to-Speech Model from CAMB.AI

MARS5-TTS is a text-to-speech (TTS) model developed by CAMB.AI, designed to generate high-quality speech audio from text input. This model is particularly noted for its ability to handle complex prosodic scenarios, such as sports commentary or animated character voices, making it suitable for a variety of applications in different industries.

Key Features

  • Two-Stage AR-NAR Pipeline: MARS5 employs a novel architecture that includes an autoregressive transformer model for initial speech feature encoding, followed by a multinomial DDPM model for refining these features into final audio output.
  • Prosody Control: The model supports prosody control through textual cues like punctuation and capitalization, allowing users to influence the speech output's rhythm and emphasis naturally.
  • Speaker Identity Cloning: By using a reference audio file, MARS5 can mimic the voice of the speaker in the reference, offering capabilities ranging from shallow cloning (fast and requires no transcript) to deep cloning (slower but higher quality and requires a transcript).
  • High Compatibility: The model supports Python 3.10 and above, and works with libraries such as Torch, Torchaudio, and Librosa, among others.

Usage

Installation

Users can install necessary libraries using pip:

pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex

Model Loading

MARS5 can be loaded directly via torch.hub or from a cloned repository:

import torch, librosa
mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)

Or,

from inference import Mars5TTS, InferenceConfig as config_class
import torch, librosa
mars5 = Mars5TTS.from_pretrained("CAMB-AI/MARS5-TTS")

Performing Synthesis

To generate speech, users need to load a reference audio, set the cloning type, and perform TTS:

wav, sr = librosa.load('<path to 24kHz waveform>.wav', sr=mars5.sr, mono=True)
wav = torch.from_numpy(wav)
ref_transcript = "<transcript of the reference audio>"

deep_clone = True  # Set based on whether a transcript is available
cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100, top_k=100, temperature=0.7, freq_penalty=3)
ar_codes, output_audio = mars5.tts("The quick brown rat.", wav, ref_transcript, cfg=cfg)

Model Details

Checkpoints

  • AR fp16 Checkpoint: Approximately 750M parameters, config embedded.
  • NAR fp16 Checkpoint: Roughly 450M parameters, config embedded.

Both checkpoints are available in PyTorch .pt format and as .safetensors files, with the default loading via torch.hub.load() using the safetensors format.

Licensing

MARS5 is released under the AGPL-3.0 license, ensuring that it remains free and open-source, with modifications and shared improvements encouraged under the same license.

Additional Resources

  • Documentation and Demos: Further details on the model's architecture and performance can be found in the docs folder of the repository.
  • Online Demo: An online demo is available here.
  • Docker Support: Users can pull a Docker image from DockerHub or build their own using the provided Dockerfile.

MARS5-TTS is a robust solution for developers and companies looking to integrate advanced speech synthesis capabilities into their applications, with particular strengths in handling varied and challenging prosodic tasks.

Share MARS5 TTS:

Related Apps

Audioread
Audioread
Use AI to listen to articles, PDFs, emails, etc in your podcast player. "Read" while walking, driving, cleaning, and more.
Murf AI
Text to Speech
Murf AI
Converts text to realistic speech and creates voice clones.
Speechify
Text-to-Speech
Speechify
Generates natural-sounding speech from text and offers voice-over capabilities.
NaturalReader
Text to Speech
NaturalReader
Converts text to natural-sounding speech in multiple languages.
Play.ht
AI Voice Generation
Play.ht
Generates realistic speech from text across languages and accents.
Lovo
Voice Generation
Lovo
Generates realistic voices, converts text to speech, and edits videos.
Resemble.ai
Voice Cloning
Resemble.ai
Synthetic voice creation, management, and deepfake audio detection service.
BeyondWords
Text-to-Speech
BeyondWords
Transforms text into engaging, monetizable audio content.
Syllaby
Video Creation
Syllaby
Video content creation and management tool with virtual avatars.
Verbatik
Text-to-Speech
Verbatik
Converts text to speech and clones voices for diverse applications.
Big Speak
Speech Recognition
Big Speak
Converts text to speech and speech to text efficiently.
Just Think
AI Content Creation
Just Think
Enhances content creation with diverse, customizable digital tools.
Voices.ai
Voice Development
Voices.ai
Develops customizable voice applications using text-to-speech technology.
Gling
Video Editing
Gling
Automates video editing by removing silences and errors.
Clonemyvoice
AI Voiceover
Clonemyvoice
Generates voiceovers from text using cloned voices.
VideoDubber
Video Translation
VideoDubber
Translates, dubs, and clones voices for videos in 150 languages.
Free Text-To-Speech
Accessibility Tools
Free Text-To-Speech
Converts text to natural-sounding speech with customizable voices and languages.
CloneDub
Video Dubbing
CloneDub
Automates video dubbing in multiple languages with voice cloning.
AutoDubber
Video Translation
AutoDubber
Automates video translation, dubbing, and voice cloning in multiple languages.
Zyphra Zonos
Voice Cloning
Zyphra Zonos
Advanced text-to-speech models for real-time voice cloning.
ElevenLabs Studio
AI Audio Tools
ElevenLabs Studio
Generates realistic speech and audio content in multiple languages.
PSYCHE AI
AI Video Creation
PSYCHE AI
Generates customizable lifelike avatar videos with voiceovers in minutes.
Sesame
Voice Technology
Sesame
Enhances voice technology for realistic, empathetic digital interactions.
Expressive AI Avatars by Synthesia
AI Video Creation
Expressive AI Avatars by Synthesia
Create and manage multilingual videos with automated voiceovers and avatars.
Jamie
AI Note Taking
Jamie
Automates meeting notes and generates actionable summaries.
Musicfy
AI Music Creation
Musicfy
Generates songs using voice cloning and text-to-music tools.
voicechat2
AI Voice Chat
voicechat2
Local voice chat system with real-time speech processing.
SNR Audio
Text-to-Speech
SNR Audio
Provides affordable text-to-speech and speech-to-text services.
Orate
Speech Synthesis
Orate
Toolkit for creating and modifying realistic speech and transcribing audio.
Sign In