The dataset includes over 170,000 unique dialogs
In March, 2005, a team of LTI researchers launched a spoken dialog system aimed at providing after-hours information to users of the Allegheny County public transit system. 13 years later, the system has handled over 200,000 calls, producing data that’s been used in over 22 doctoral theses and more than 250 publications outside the CMU community. And now, that data is publicly available to researchers everywhere in hopes of continuing to advance the state of the art in this ever-evolving field.
The NSF-funded “Let’s Go!” project, brainchild of LTI faculty members Maxine Eskenazi and Alan Black, was originally noteworthy not just for its utility to real-world users, but also for its focus on groups that had been neglected by previous spoken dialog systems, including the elderly and non-native English speakers. The system provided access to crucial information about bus schedules and service changes in real time, allowing users to find the relevant information using natural spoken language even when a human operator was unavailable.
This public-facing implementation allowed for the creation of a dataset comprising more than 171,000 dialogs. Of those, more than 93,000 include at least three turns of dialog between the human and computer, and resulted in a lookup of information in the database and its presentation to the user – the simplest measure of a “successful” user interaction. All of those interactions – including WAV audio files, log files and automatically-generated labels – are now available for download through the dataset’s GitHub page.
“Most very large datasets belong to some company and are not publicly available,” Eskenazi explained when noting the unique value of the Let’s Go! Dataset. She added that “It’s easy to make simulated data or paid user data, but it’s hard to get real user data.”
Eskenazi said that it was the widespread use of artificial neural networks, which require large amounts of data to operate effectively, that motivated the decision to make the dataset public now – along with the hope that researchers outside the CMU community will find the dataset as useful as many within CMU already have.
“We hope people use the data to train and compare their systems,” she said.
Let’s Go! was also recently integrated into another LTI-based spoken dialog project, DialPort, and can now be accessed through the DialPort website. More information on the Let's Go! project can be found on the project's web page.