The MLCommons Association, an open engineering consortium dedicated to improving machine learning for everyone, today announced the general availability of the People’s Speech Dataset and the Multilingual Spoken Words Corpus (MSWC).
These permissively licensed datasets advance innovation in machine learning research and commercial applications, the consortium said.
According to an official release, the People’s Speech Dataset is among the world’s largest English speech recognition datasets licensed for academic and commercial usage. The 30,000-hour supervised conversational dataset is an order of magnitude larger than what was available just a few years ago.
The dataset, released under a Creative Commons license, democratises access to speech technology such as voice assistants and transcription, and unlocks innovation in the machine learning community. Contributors to the dataset include researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA. It can be downloaded at mlcommons.org/speech.
The other release of the day, Multilingual Spoken Words Corpus (MSWC) is a rich audio speech dataset with more than 340,000 keywords in 50 languages with upwards of 23.4 million examples. Previous datasets relied on manual efforts to collect and validate thousands of utterances for each keyword and were commonly restricted to a single language.
A diverse multilingual dataset that spans languages spoken by over five billion people, MSWC advances the research and development of applications such as voice interfaces for a broad global audience. Contributors to the MSWC include researchers from Coqui, Factored, Google, Harvard University, Intel, Landing AI, NVIDIA, and the University of Michigan. It can be downloaded at mlcommons.org/words.
MLCommons also unveiled the new DataPerf benchmark suite that supports data-centric AI innovation by measuring the quality of datasets for common ML tasks and the impact of enhancing datasets. Also, the association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI.