Prepare to be amazed as I introduce you to the groundbreaking Universal Speech Model (USM)! This magnificent creation is a state-of-the-art speech model family, boasting a staggering 2 billion parameters. It has been meticulously trained on 12 million hours of speech and 28 billion sentences of text, encompassing over 300 languages. Yes, you read that right: 300 languages!
Designed for use in YouTube (for instance, to generate closed captions), USM doesn’t discriminate between popular languages like English and Mandarin and lesser-known gems like Punjabi, Assamese, Santhali, Balinese, Shona, and the list goes on. Some of these languages are spoken by fewer than 20 million people, which makes finding the necessary training data a Herculean task.
The secret sauce behind USM’s success is its clever utilization of a large, unlabeled multilingual dataset for pre-training the encoder, which is then fine-tuned on a smaller labeled dataset. This approach enables USM to recognize under-represented languages with remarkable efficiency.
And the results? Simply astonishing! Despite having limited supervised data, USM achieves less than 30% word error rate (WER; lower is better) on average across 73 languages, a milestone never achieved before. In comparison to the current internal state-of-the-art model, USM scores a 6% lower WER for en-US. And when pitted against the recently released Whisper (large-v2), USM boasts a 32.7% lower WER on average for the 18 languages in which Whisper manages to keep WER under 40%.
But wait, there’s more! USM outperforms Whisper in downstream ASR tasks on publicly available datasets such as CORAAL (African American Vernacular English), SpeechStew (English), and FLEURS (102 languages). The results are nothing short of impressive: for FLEURS, USM has a 65.8% lower WER without in-domain data and a 67.8% lower WER with in-domain data when compared to Whisper.
And as if that weren’t enough, USM also excels in automated speech translation (AST)! Fine-tuned on the CoVoST dataset, USM outperforms Whisper across all resource availability segments, as measured by BLEU scores.
In conclusion, the Universal Speech Model is a game-changer, revolutionizing the world of speech recognition and translation. With USM on the scene, I’m confident that we’re on the cusp of a new era where no language—no matter how rare or under-represented—will be left behind. So, brace yourselves for a future where everyone can finally have their voice heard!
"prompt": "Google's Groundbreaking Speech Recognition AI Model, deepleaps.com, best quality, 4k, 8k, ultra highres, raw photo in hdr, sharp focus, intricate texture, skin imperfections, photograph of",
"negative_prompt": "worst quality, low quality, normal quality, child, painting, drawing, sketch, cartoon, anime, render, 3d, blurry, deformed, disfigured, morbid, mutated, bad anatomy, bad art",
"original_prompt": "Google's Groundbreaking Speech Recognition AI Model, deepleaps.com, best quality, 4k, 8k, ultra highres, raw photo in hdr, sharp focus, intricate texture, skin imperfections, photograph of",