MARS5-TTS is a text-to-speech (TTS) model developed by CAMB.AI, designed to generate high-quality speech audio from text input. This model is particularly noted for its ability to handle complex prosodic scenarios, such as sports commentary or animated character voices, making it suitable for a variety of applications in different industries.
Users can install necessary libraries using pip:
pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex
MARS5 can be loaded directly via torch.hub or from a cloned repository:
import torch, librosa
mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
Or,
from inference import Mars5TTS, InferenceConfig as config_class
import torch, librosa
mars5 = Mars5TTS.from_pretrained("CAMB-AI/MARS5-TTS")
To generate speech, users need to load a reference audio, set the cloning type, and perform TTS:
wav, sr = librosa.load('<path to 24kHz waveform>.wav', sr=mars5.sr, mono=True)
wav = torch.from_numpy(wav)
ref_transcript = "<transcript of the reference audio>"
deep_clone = True # Set based on whether a transcript is available
cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100, top_k=100, temperature=0.7, freq_penalty=3)
ar_codes, output_audio = mars5.tts("The quick brown rat.", wav, ref_transcript, cfg=cfg)
Both checkpoints are available in PyTorch .pt
format and as .safetensors
files, with the default loading via torch.hub.load()
using the safetensors format.
MARS5 is released under the AGPL-3.0 license, ensuring that it remains free and open-source, with modifications and shared improvements encouraged under the same license.
docs
folder of the repository.MARS5-TTS is a robust solution for developers and companies looking to integrate advanced speech synthesis capabilities into their applications, with particular strengths in handling varied and challenging prosodic tasks.