Meta releases SeamlessM4T, a general multilingual speech/text model claimed to surpass OpenAI’s Whisper. It’s available on github and everything can be used for free in a non-commercial setting.

Model Features:

  • Automatic speech recognition for ~100 languages.
  • Speech-to-text translation for ~100 input/output languages.
  • Speech-to-speech translation for ~100 input languages and 35 output languages.
  • Text-to-text and text-to-speech translation for nearly 100 languages.

Dataset:

  • SeamlessAlign: Open multimodal translation dataset with 270,000 hours of speech and text alignments.

Technical Insights:

  • Utilizes a multilingual and multimodal text embedding space for 200 languages.
  • Applied a teacher-student approach to extend this embedding space to the speech modality, covering 36 languages.
  • Mining performed on publicly available repositories resulted in 443,000 hours of speech aligned with texts and 29,000 hours of speech-to-speech alignments.

Toxicity Filter:

  • The model identifies toxic words from speech inputs/outputs and filters unbalanced toxicity in training data.
  • The demo detects toxicity in both input and output. If toxicity is only detected in the output, a warning is included and the output is not shown.
  • Given how impaired llama2-chat has been due to these kind of filters, it’s unclear how useful these models are in a general setting.