LTI Research Areas
LTI researchers began identifying the synergies between human language technologies and computational biology more than a decade ago. LTI faculty and students continue to identify problems in computational biology that could benefit from techniques originally developed for human language technologies. These problems fall into many areas, including learning protein-protein interactions, learning 3-D protein structure, biological information management, genotype-phenotype mapping, disease association studies, genomic motif detection, and computational epidemiology.
Information Extraction, Summarization and Question Answering
The LTI has made great strides in information extraction and question-answering since it began investigating the topic more than a decade ago. The institute’s ongoing focus is on machine learning algorithms that can be used by a QA system to automatically learn a new domain, in order to significantly reduce cost and development time for new applications. Work includes developing modular, multi-strategy architecture; a metrics and measurement framework for open advancement combining architecture, metrics and challenge-problem datasets to maximize the impact of incremental R&D on component- and system-level performance; language-independent QA algorithms and multilingual componentes; structured information retrieval and rank learning models to support more effective document and passage retrieval for question-answering; algorithms for answer scoring; and algorithms for source expansion.
Information Retrieval, Text Mining and Analytics
Information retrieval (IR) and text mining research develops tools that enable people to find, organize and analyze text-based information. This area of computer science has become increasingly important due to the rapid growth of digitized information, social networks and mobile computing. The LTI’s research on new algorithms and experimental methodologies advance the state of the art in information retrieval. We also produce community research infrastructure such as datasets, computational services and open-source software that support the broader scientific community in its efforts to move the field forward. Most past and current work in text mining has focused on converting text to structured information, leading to an emphasis on techniques like clustering, classification, and extraction of facts and entities. The LTI is making contributions to the underlying technologies applied to text mining and analytics, and exploring the use of large-scale computing to mine very large text datasets.
Knowledge Representation, Reasoning and Acquisition
Automatically extracting data from natural sources like text, speech, video and physical sensors is often limited and less than perfectly accurate. The LTI’s knowledge representation, reasoning and acquisition research develops algorithms for reaching reasonable conclusions from various types of approximate, incomplete and unreliable data. It includes techniques for explicit representation of uncertainty; reasoning and hypothesis verification based on uncertain data; and planning of additional data collection. Researchers working on these problems investigate a variety of approaches, including Bayesian statistics, explicit representation of uncertainty in numeric and nominal data as distributions that are refined and merged and used in planning and optimization under uncertainty, targeted information gathering, and integration of probabilistic reasoning with active and proactive learning.
Language Technologies for Education
Current language technologies work in the LTI lies in the area of intelligent tutors. Researchers are investigating how to automatically choose sentences for pronunciation practice that would be confusing to the listener, given the native language of the speaker and the semantics of the context. Other research looks at teaching the use of articles in English by showing students a text with a pull-down menu from which they can select “the,” “a” or nothing. Yet another project tests the effects of reading documents on interesting topics. In order to assess student progress, there is ongoing work on question generation. There is also work on educational games and ways to deliver language learning on small devices in novel, interesting ways. The LTI is also a pioneer in the use of games for language education.
The LTI has a long history of research activity in machine translation. Current LTI MT research spans the spectrum of state-of-the-art topics in the field, including data-driven statistical-machine-learning-based approaches, where statistical models of translation equivalence are learned automatically from large volumes of parallel text. The institute’s research on statistical and data-driven MT investigates learning methods that incorporate deeper linguistic knowledge, primarily in syntax-annotated parallel-text and monolingual data. Other important research challenges include developing effective data-driven MT methods for languages with limited data resources, learning models of translation equivalence from non-parallel corpora, and dealing with morphologically rich and synthetic languages.
Natural Language Processing/Computational Linguistics
The LTI boasts strong research programs in natural language processing, which develops representations and algorithms for linguistic data that support a wide range of applied technologies. Examples include part of speech tags, morphological segmentation and labeling, chunked phrases, parse trees, and semantic expressions, along with algorithms for mapping raw text into these representations. The aim is to use statistical modeling to automatically infer such mappings from data. This approach has become the dominant technique for building linguistic analyzers. As part of its natural language process and computational linguistics research, the LTI also investigates multimedia analysis and retrieval and interfaces. At the core of its vision is the desire to make video as accessible and searchable as text is today by intelligently combining the strengths of automated speech recognition, video analysis, natural language processing and interfaces to overcome the shortfalls inherent in the technology components.
Speech contains nuances that humans can interpret, but prove challenging for machines. The LTI’s work in speech processing focuses on three areas: converting human speech to a string of the words spoken (automatic speech recognition); converting text into human-sounding speech (text to speech); and identifying a person, dialect or emotion from a short piece of audio (speaker identification). Current LTI speech research has moved beyond the simple conversion between text and speech and task-oriented dialogue to investigate more advanced speech issues concerned with content outside textual transcription. Speech is far more than just a representation of the words, and the LTI looks to identify — and synthesize — emotions, styles and sarcasm. Our researchers want storytelling to be engaging, automatic tutors to be persuasive, and conversational agents to take part in conversations like humans, not machines. Current research in LTI is leading speech in these directions.
Spoken Interfaces and Dialogue Processing
To the outside user, dialog systems are often found as information-giving servants. As the need to communicate with automated systems increases, the challenges to dialog systems researchers center around the quality of their response, and their flexibility and ability to adapt to changing situations and new groups of users. The domain of dialogue systems now encompasses work on system architecture; their adaptation to speakers, new tasks and environments; the use of reinforcement learning and simulated users; multimodal applications; the status of the participants in a dialog; and the means of assessing the systems. The general theme of this research has been to build on our current knowledge and infrastructure to investigate novel uses for spoken interaction.