AI Apps MARS5 TTS

MARS5 TTS: Advanced Customizable Voice Synthesis

Cut text-to-speech costs with Unreal Speech. 11x cheaper than 11Labs. Production-ready. Stream in 300ms. Generate 10-hr audio. 48 voices. 8 languages. Per-word timestamps. 250K chars free. Try live demo:

Non-Fiction

Fiction

News

Blog

Conversation

0/250

Speed

0 s

Filesize

0 kb

Get Started for Free →

Try MARS5 TTS →

Overview of MARS5-TTS: Advanced Text-to-Speech Model from CAMB.AI

MARS5-TTS is a text-to-speech (TTS) model developed by CAMB.AI, designed to generate high-quality speech audio from text input. This model is particularly noted for its ability to handle complex prosodic scenarios, such as sports commentary or animated character voices, making it suitable for a variety of applications in different industries.

Key Features

Two-Stage AR-NAR Pipeline: MARS5 employs a novel architecture that includes an autoregressive transformer model for initial speech feature encoding, followed by a multinomial DDPM model for refining these features into final audio output.
Prosody Control: The model supports prosody control through textual cues like punctuation and capitalization, allowing users to influence the speech output's rhythm and emphasis naturally.
Speaker Identity Cloning: By using a reference audio file, MARS5 can mimic the voice of the speaker in the reference, offering capabilities ranging from shallow cloning (fast and requires no transcript) to deep cloning (slower but higher quality and requires a transcript).
High Compatibility: The model supports Python 3.10 and above, and works with libraries such as Torch, Torchaudio, and Librosa, among others.

Usage

Installation

Users can install necessary libraries using pip:

pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex

Model Loading

MARS5 can be loaded directly via torch.hub or from a cloned repository:

import torch, librosa
mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)

Or,

from inference import Mars5TTS, InferenceConfig as config_class
import torch, librosa
mars5 = Mars5TTS.from_pretrained("CAMB-AI/MARS5-TTS")

Performing Synthesis

To generate speech, users need to load a reference audio, set the cloning type, and perform TTS:

wav, sr = librosa.load('<path to 24kHz waveform>.wav', sr=mars5.sr, mono=True)
wav = torch.from_numpy(wav)
ref_transcript = "<transcript of the reference audio>"

deep_clone = True  # Set based on whether a transcript is available
cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100, top_k=100, temperature=0.7, freq_penalty=3)
ar_codes, output_audio = mars5.tts("The quick brown rat.", wav, ref_transcript, cfg=cfg)

Model Details

Checkpoints

AR fp16 Checkpoint: Approximately 750M parameters, config embedded.
NAR fp16 Checkpoint: Roughly 450M parameters, config embedded.

Both checkpoints are available in PyTorch .pt format and as .safetensors files, with the default loading via torch.hub.load() using the safetensors format.

Licensing

MARS5 is released under the AGPL-3.0 license, ensuring that it remains free and open-source, with modifications and shared improvements encouraged under the same license.

Additional Resources

Documentation and Demos: Further details on the model's architecture and performance can be found in the docs folder of the repository.
Online Demo: An online demo is available here.
Docker Support: Users can pull a Docker image from DockerHub or build their own using the provided Dockerfile.

MARS5-TTS is a robust solution for developers and companies looking to integrate advanced speech synthesis capabilities into their applications, with particular strengths in handling varied and challenging prosodic tasks.