Lemur Search   
Language Technologies Institute
Carnegie Mellon University
School of Computer Science

LTI Projects by Research Area

Machine Translation (3)

Information Retrieval (5)

Language Technologies for Education (9)

Computational Biology (3)

Information Extraction, Summarization, and Question Answering (4)

Other, Interdisciplinary Projects (6)
Speech Processing (7)

Knowledge Representation, Reasoning, and Acquisition (2)

Dialogue Processing (7)

Natural Language Processing / Computational Linguistics (6)

Spinoff Companies (2)

Older Projects (5)

Machine Translation [top]


Knowledge-based Machine Translation

The KANT project was founded in 1989 for the research and development of large-scale, practical translation systems for technical documentation. KANT uses a controlled vocabulary and grammar for each source language, and explicit yet focused semantic models for each technical domain to achieve very high accuracy in translation. Designed for multilingual document production, KANT has been applied to the domains of electric power utility management and heavy equipment technical documentation.

Contacts: Teruko Mitamura and Eric Nyberg

The Linguistic-Core Approach to Structured Translation and Analysis of Low-Resource Languages

Of the many diverse world languages, very few are within reach of current natural language processing (NLP) and machine translation (MT) techniques. While mainstream approaches fail to generalize for most languages due to the lack of resources (e.g. text, annotations, etc), our approach is designed to discover and leverage deep syntactic and semantic structures elicited from human experts.

Contacts: Jaime Carbonell, Lori Levin, Noah Smith, Chris Dyer

Portable Speech to Speech Translation

The Portable Speech to Speech Translation project develops two-way hand-held speech-to-speech translation systems, making use of the intelligence of the machine in the middle to automatically recognize ambiguities in the cross-lingual conversation, and resolve issues by asking the users disambiguation questions.

Contacts: Alan W Black and Alex Waibel

Speech Processing [top]

FestVox: Building Synthetic Voices

This project is designed to provide the tools, scripts and documentation to allow people to build synthetic voices for use with general speech applications. Support for English and other languages is provided. Voices produced by these methods run within Edinburgh University's Festival Speech Synthesis System. We also are developing a small, fast synthesis engine suitable for these voices called Flite. This project involves a number of aspects of speech synthesis research including prosodic modelling, unit select synthesis, statistical parametric synthesis (HMM-synthesis), diphone synthesis, text analysis, lexicon representation, limited domain synthesis. It also provides a forum for research and development of automatic labeling tools and synthesis evaluation tools. Voices build from these methods have been used in other CMU and external projects such as CMU Let's Go Bus Information System, and various speech-to-speech translation systems. Recent work investigates processing languages without written forms, and stylistic variation including emotion, casual and specific speaker styles.

Contact: Alan W Black


A Spoken Dialog System For the General Public

The Let's Go! project is building a spoken dialog system that can be used by the general public. While there has been success in building spoken dialog systems that are able to interact well with people (for example, the CMU Communicator system), these systems often work only for a limited group of people. The system we are developing for Let's Go! is designed to work with a much wider population, including groups which typically have trouble interacting with dialog systems, such as non-native English speakers and the elderly.

The Let's Go! project works in the domain of bus information for Pittsburgh's Port Authority Transit bus system. The system provides a telephone-based interface to bus schedules and route information.

Contacts: Maxine Eskenazi and Alan W Black


The PhonBank project seeks to develop a shared web-based database for the analysis of phonological development. There are 50 members of the PhonBank consortium group who are contributing their data to the project. The PHON program facilitates a number of tasks required for the analysis of phonological development. Phon supports multimedia data linkage, unit segmentation, multiple-blind transcription, automatic labeling of data, and systematic comparisons between target (model) and actual (produced) phonological forms. All of these functions are accessible through a user-friendly graphical interface. Databases managed within Phon can also be queried using a powerful search interface. This software program works on both Mac OS X and Windows platforms, is fully compliant with the CHILDES format, and supports Unicode font encoding. Phon is being made freely available to the community as open-source software. It meets specific needs related to the study of first language phonological development (including babbling), second language acquisition, and speech disorders. Phon will facilitate data exchange among researchers and the construction of a shared PhonBank database, another new initiative within CHILDES to support methodological and empirical needs of research in all areas of phonological development.

Contact: Brian MacWhinney

SLT4ID: Speech and Language Technologies for International Development

In underserved communities around the world, spoken language systems are potentially more natural, cheaper to deploy/maintain/upgrade, and place fewer requirements (such as literacy) on the user than traditional PC/GUI-based systems, while still offering valuable services such as information access and information sharing. In our first project in this domain we created and tested a speech user interface for accessing health information resources by low-literate community health care workers in rural Sindh province, Pakistan. Our current project, Polly, is a telephone-based dialog system for reaching low-literate populations via a voice-based game, then providing them with development-related voice-based services. In less than a year the system spread virally to over 165,000 users all over Pakistan, who engaged in over 630,000 calls. For more information, see http://www.cs.cmu.edu/~Polly/.

Contact: Roni Rosenfeld


The Sphinx project is an umbrella for research in basic speech technologies. Current activities include systems for real-time recognition (PocketSphinx) and for multi-modal interaction (KinectOly). Research projects include multi-party conversation management, learning through spoken language, long-term user adaptation and the development of human-robot interfaces (such as avatars). Research in speech recognition includes out-of-vocabulary (OOV) word detection and representation, and the use of conversational structure to enhance spoken term detection. The Sphinx recognition code-base and the Olympus dialog code-base are open-source and is used by a number of projects in LTI, elsewhere in the University as well as by a large number of other sites.

Contact: Alex Rudnicky


Speech Processing Interactive Creation and Evaluation Toolkit

Speech technology potentially allows everyone to participate in today's information revolution and can bridge the language barrier gap. Unfortunately, construction of speech processing systems requires significant resources. With some 4500-6000 languages in the world, traditionally speech processing is prohibitive to all but the most economically viable languages. In spite of recent improvements in speech processing, supporting new languages is a skilled job requiring significant effort from trained individuals. This project aims to overcome both limitations by providing innovative methods and tools for naive users to develop speech processing models, collect appropriate data to build these models, and evaluate the results allowing iterative improvements. By integrating speech recognition and synthesis technologies into an interactive language creation and evaluation toolkit usable by unskilled users, speech system generation will be revolutionized. Data and components for new languages will become available to everybody improving the mutual understanding and the educational and cultural exchange between the U.S. and other countries.

Contacts: Alan W Black and Tanja Schultz


Flexible voice synthesis through articulatory voice transformation

We have always wanted our machines to talk to us, but most people have strong preferences for particular voices. Current techniques in speech synthesis can build voices that sound very close to the original speaker, capturing the style, manner and articulation of the source voice. However such systems require many hours of carefully recorded speech and expert tuning to reach an acceptable level of quality.

An exciting new alternative method for building synthetic voices is voice transformation. Here we use an exsisting recorded database and convert it to a target voice using as little as 10-20 sentences. These techniques offer the potential to make speech synthesizers talk in whatever voice we desire, with significantly less effort required than previous techniques.

This project offers a new direction in voice transformation. Current transformation techniques concentrate on a spectral mapping of the voice, i.e. converting the properties of the speech signal. Instead we can use the underlying positions of the vocal tract articulators (i.e. the position of the teeth, tongue, lips, velum) which give rise to the spectral output of the voice.

Using new statistical modeling techniques we can successfully predict the positions of a speaker's articulators from the speech signal. Then in the virtual vocal tract domain map between speakers and regenerate the speech for the target voice.

This work enables the easy construction of new synthetic voices allowing personalization of speech output. It increases our knowledge of the speech generation process and characterizes what make a voice personal.

Contact: Alan W Black

Also see: Fluency

Information Retrieval [top]

Federated, Vertical, and Distributed Search [1, 2]

Large-scale search portals often provide access to a large collection of underlying search engines, for example, to integrate diverse content such as web, news, and image content; to integrate information obtained from different providers; or to distribute a massive search index across a computer cluster. Our research develops new ways to partition massive indexes into specialized search engines; select which specialized search engines are the best match for an individual query; and merge results returned by different specialized search engines into a single, coherent set of results. It enables principled integration of diverse resources into a single, integrated portal; and improves the speed of large-scale search by an order of magnitude using modest computational resources.

Contacts: Jamie Callan

Large-Scale Hierarchical Classification

Using classification to provide organizational views of large data has become increasingly important in the Big-Data Era. For instance, Wikipedia articles are indexed using over 600,000 categories in a dependency graph. Jointly optimizing all the classifiers (one per node) in such a large graph or hierarchy presents significant challenges for structured learning. Our research develops new statistical learning frameworks and scalable algorithms that successfully solve joint optimization problems with over one trillion model parameters (4 TB of storage) to produce state-of-the art effectiveness in international benchmark evaluations for large scale classification.

Contact: Yiming Yang

Lemur Project

The Lemur Project is a collaboration between researchers at the LTI and the University of Massachusetts to provide state-of-the-art software, datasets, and search services that support research by a broad, international community. Indri and Galago are extensible open-source search engines that provide powerful query languages; state-of-the-art retrieval models; indexing support for metadata, text annotations, and multiple text representations; and indexes capable of storing more than a billion documents. ClueWeb09 and ClueWeb12 are among the largest and most widely-used web document collections. Each year, dozens of papers at the leading IR conferences report on research that was conducted using software, datasets, and services provided by the Lemur Project.

Contact: Jamie Callan

Multi-Field Hierarchical Discovery and Tracking

Modeling information dynamics in at different levels of granularity is an open challenge. We are developing new Bayesian VonMieses-Fischer topical clustering techniques, including hierarchical and dynamic models that outperform existing methods and scale to large data. Our approach consists of multi-field graphical models for correlated latent topics, semi-supervised topology learning, metric learning, transfer learning and temporal trend modeling. We evaluate on large datasets of scientific literature (Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, etc.), as well as news-story collections.

Contact: Yiming Yang

SIDE: Summarization Integrated Development Environment

In this project, we are developing a configurable summarization environment that uses multi-level analyses of discourse to support a new generation of summarization technology addressing a variety of information management tasks.

SIDE is an infrastructure that facilitates construction of summaries tailored to the needs of the user. It aims to address the issue that there is no such thing as the perfect summary for all purposes. Rather, the quality of a summary is subjective, task dependent, and possibly specific to a user. The SIDE framework allows users flexibility in determining what they find more useful in a summary, both in terms of structure and content. In recent work we have begun to explore statistical approaches to text compression that utilize syntactic dependency features and discourse level features to achieve a higher level of fluency at more severe levels of compression. Our near future plans include exploring the idea of text simplification. An important application area is rapid prototyping of reporting interfaces for on-line discussion facilitators.

Also see: TagHelper 2.0

Contact: Carolyn Rosé

Knowledge Representation, Reasoning, and Acquisition [top]

Better Social Media Maps

Mapping social media messages can provide a geographic overview of an event. Most geographic Twitter maps are generated based on location from GPS device or user-registered location metadata. But such metadata are sparse. Our research is to create an algorithm that mines locations within social media messages. The resulting map is potentially more versatile than those that rely on sparse metadata alone. The map may have more messages, and more reliability due to message overlap; it will indicate which locations are most prominent due to overlap of message location, and it will be precise to the level of streets and buildings within a city when such are found in messages.

Contact: Judith Gelernter


Symbolic Knowledge Base

Scone is a high-performance, open-source knowledge-base (KB) system intended for use as a component in many software applications. Scone was specifically designed to support natural language understanding and generation, so our emphasis has been on efficiency, scalability (up to millions of entities and statements), and ease of adding new knowledge – not on theorem-proving or solving logic puzzles. At the LTI, Scone has improved the performance of search engines and document classifiers through the use of background knowledge (which disambiguates references in text and provides synonyms and related words to help the engines and classifiers). The system has also been used to extract events and time relations from free-text recipes, to model the belief states and motivations of characters in children's stories, and to extract meaning from very informal, ungrammatical text and speech. Our long-term goal is to use Scone as the foundation for a true natural-language understanding system and also, to develop a very flexible system for planning and reasoning about actions – applying Scone as the representation engine.

Contacts: Scott Fahlman

Language Technologies for Education [top]

Cycle Talk

Dialogue technology for supporting simulation based learning

In the CycleTalk project, we are developing a new collaborative learning support approach that makes use of tutorial dialogue technology in the context of a collaborative simulation based learning environment for college level thermodynamics instruction.

In order to encourage productive patterns of collaborative discourse, we are using language technologies to develop an infrastructure for scaffolding the interactions between students in computer supported collaborative learning environments, to help coordinate their communication, and to encourage deep thinking and reflection. Students who work with a partner using this support learn 1.25 standard deviations more than their counterparts working individually in the same environment without the collaboration support, where 1 standard deviation translates into 1 full letter grade. An important part of this work is dialogue technology capable of interacting with groups of humans that is designed to draw out reflection and engage students in directed lines of reasoning. This work builds on previous results demonstrating the effectiveness of these tutorial dialogue agents for supporting learning of individuals working alone in this domain.

Contact: Carolyn Rosé

Digital Bridges

Online education fostering partnerships for research and teaching

The key idea of the Digital Bridges project is to extend existing resources for technology-based education to create a vibrant environment for globalized instructional support and collaborative professional development.

Previous pilot efforts towards encouraging global teaching partnerships have been very high-effort, niche partnerships. This project is unique in that it brings together expertise in technology-related fields such as Artificial Intelligence, Machine Learning, Language Technologies, Robotics, and Computer Supported Collaborative Learning, with expertise in international development to propose a much more scalable, organic option that would complement such efforts. Existing resources developed in the team’s prior research such as state-of-the-art technology for supporting highly effective group learning provide both the infrastructure for the global professional development effort as well as one of the resources participating instructors can use. The technology for computer supported collaborative learning that we begin with has already been tested and proven successful on a small scale with multiple age groups (middle school, high school, and college aged students), multiple domains (psychology, earth sciences, mechanical engineering, and math), and multiple cultures (students in the U.S. and students in Taiwan). Thus, it has proven itself ready for testing in this more challenging, diverse, global on-line environment. The learning sciences principles that have provided the theoretical framework for its development have largely come from research conducted in the U.S. and in Europe. Thus, the proposed research provides the opportunity for testing the generality of findings from educational research primarily conducted in the U.S. and in Europe in the developing world, beginning with a pilot effort in collaboration with IIT Guwahati.

Contact: Carolyn Rosé


Foreign language accent correction

Fluency uses speech recognition (SPHINX II) to help users perfect their accents in a foreign language. The system detects pronunciation errors, such as duration mistakes and incorrect phones, and offers visual and aural suggestions as to how to correct them. The user can also listen to himself and to a native speaker.

Contact: Maxine Eskenazi

The Intelligent Writing Tutor (IWT)

The Intelligent Writing Tutor (IWT) project for ESL learners explores the issue of transfer and long-term retention of acquired knowledge, as part of the PSLC's underlying goal of developing a theory of robust learning. Through a series of learning experiments, we will look at both positive and negative transfer from a student's native language (L1) to English, the effects of an informed knowledge tracer on learning, and the role of level-appropriate feedback in achieving competency.

Contact: Teruko Mitamura

Mining the Web for Customized Curriculum Planning

With massive quantities of educational materials freely available on the web, the vision of universal education appears within our grasp. General-purpose search engines are insufficient as they do not focus on educational materials, objectives, pre-requisite relations, etc., nor do they stitch together multiple sources to create customized curricula for students. goals and current knowledge. The project focuses on: 1) extracting educational units from diverse web sites and representing them in a large directed graph, whose nodes are content descriptors and whose edges encode pre-requisite and other relations, 2) conducting multi-field topic inference via a new family of graphical models to infer relations among educational units, enriching the graph, and 3) automated curricular planning, focusing on providing sequences of lessons, courses, exercises and other education units for a student to achieve his or her educational goals, conditioned on current skills. The curriculum planner enriches a graph traversal path, with alternate paths, reinforcement options, and conditional branches.

Contact: Jaime Carbonell and Yiming Yang

PSLC Fluency Studies

With support from the Pittsburgh Science of Learning Center (PSLC-NSF), we have constructed online tutors for the consolidation of basic skills in second language learners of Chinese and French. These tutors assist with learning Chinese pinyin and correct detection of Chinese segments and tones, acquisition of vocabulary in various languages, practice with the assignment of nominal gender in French, and dictation of French from spoken input. Recent work looks at methods for consolidating fluency in sentence repetition and ways of achieving greater robustness in learning.

Contact: Brian MacWhinney

The REAP Project

Reader-Specific Lexical Practice for Improved Reading Comprehension

The core ideas of the project are i) a search engine that finds text passages satisfying very specific lexical constraints, ii) selecting materials from an open-corpus (the Web), thus satisfying a wide range of student interests and classroom needs, and iii) the ability to model an individual's degree of acquisition and fluency for each word in a constantly-expanding lexicon so as to provide student-specific practice and remediation. This combination enables research on a wide range of reading comprehension topics that were formerly difficult to investigate.

Contact: Maxine Eskenazi


Supporting virtual math teams with language technologies

In collaboration with Gerry Stahl and the Math Forum at Drexel University, this project seeks to develop a technological augmentation to available human support in a lightly staffed Virtual Math Teams (VMT) environment as well as deploying conversational agents that are triggered by automatically detected conversational events and that have the ability to elicit valuable collaborative behavior such as reflection, help seeking, and help provision.

Free on-line learning promises to transform the educational landscape of the United States through a significant broadening of supplemental educational opportunities for low income and minority students who do not have access to high quality private tutoring to supplement their in school education. This research attempts to understand how to structure interactions among peer learners in online education environments using language technologies. It seeks to enhance effective participation and learning in the Virtual Math Teams (VMT) online math service, housed in the Math Forum, a major NSF-funded initiative that specifically targets inner-city, low-income minority students, and reaches over a million kids per month with its various services. This will be accomplished by designing, developing, testing, refining and deploying automated interventions to support significantly less expensive but nevertheless highly effective group facilitation. The key research goal is to experimentally learn broadly applicable principles for supporting effective collaborative problem solving by eliciting behavior that is productive for student learning in diverse groups. These principles will be used to optimize the pedagogical effectiveness of the existing VMT-Basilica environment as one example of their concrete realization. The proposed research will yield new knowledge about how characteristics of the on-line VMT environment necessitate adaptation of approaches that have proven successful in lab and classroom studies in order to achieve comparable success in this challenging environment.

Contacts: Carolyn Rosé

See also: Project Listen

Dialogue Processing [top]


Assessing design engineering project classes with multi-disciplinary teams

This project brings together an interdisciplinary team with expertise in computer supported collaborative learning, language and information technologies, engineering education, and a variety of specific engineering fields in order to develop an infrastructure for supporting effective group functioning in engineering design project based courses.

The increasing emphasis in engineering education on maintaining the competitive advantage of U.S. engineers requires an understanding of how students learn the higher-order engineering skills of problem-solving and design, and the increasingly rapid technological change requires students to develop sophisticated information management skills so that they can build on and repurpose innovations and discoveries made in previous projects. The key to addressing these knowledge building problems is to develop DesignWebs, an infrastructure that supports effective storage and retrieval of documents as they evolve as part of a collaborative design process, and GRASP, and automatic assessment technology that monitors the well functioning of design teams through automatic speech processing. The GRASP unobtrusive assessment technology is designed to facilitate the supporting role the instructor can play in the development of team participation in that it that promotes transparency of group work so that instructors are able to identify groups that need more of their attention and involvement.

Contact: Carolyn Rosé


Reconfigurable multi-party dialogue environment

Based on our experiences with designing and engineering multi-party conversational environments such as collaborative learning systems that involve integrating the state of the art in text classification and conversational agent technology, we are developing a framework that facilitates such integration.

The goal of the instructional approach underlying the design of the VMT-Basilica framework is to maximize the benefit students receive from the interactions they have with one another by providing support for learning and effective collaboration in a way that is responsive to what is happening in the interaction in real time. Previous discourse analyses of collaborative conversations reveal that the majority of those interactions between students do not display the “higher order thinking” that collaborative learning is meant to elicit, and we have found this as well in our own observations in lab and classroom studies, both at the college level and at the middle school level. The literature on support for collaborative learning and learning more generally tells us that scaffolding should be faded over time, that over-scripting is detrimental to collaboration, and unnecessary support is demotivating. Thus, a major goal of our research is to address these issues with a framework that allows us to track what is happening in the interaction so that the automatically triggered support interventions can respond to it appropriately. Desiderata of the framework include reusability of component technologies, compatibility with other platforms, and the ability to provide flexibility to system designers to select from a wide range of existing components and then to synchronize, prioritize and coordinate them as desired in a convenient way.

Contact: Carolyn Rosé


Child Language Data Exchange System

The CHILDES Project has focused on the construction of a computerized database for the study of child language acquisition. There are currently 230 corpora in the database from 30 different languages. These corpora are composed of transcripts of spontaneous verbal interactions between young children and their parents, playmates, and teachers. Some of the corpora represent detailed longitudinal studies of single children or small groups of children collected across several years. Others represent cross-sectional studies of larger groups of children recorded less frequently. The total size of the transcript database is 2.0 gigabytes. Many of the transcripts are linked to additional audio and video media files that allow researchers to immediately playback the interactions on the level of individual sentences at any point in the transcript. The project maintains a list of 4000 child language researchers and students who have used the database and has records of over 3000 published articles based on the use of these materials. The project has also constructed a set of computer programs that are useful for conducting research into the various levels of language usage including lexicon, syntax, morphology, phonology, discourse, and narrative.

Contact: Brian MacWhinney

Dynamic Support for Computer Mediated Intercultural Communication

New Integration of theories and methods from the fields of CSCW and CSCL

Today, people connect with others from around the world in chatrooms, discussion lists, blogs, virtual game communities and other Internet locales. In the work domain, firms are increasingly taking advantage of computer-mediated communication (CMC) tools to establish global teams with members from a diverse set of nations. In education, schools are implementing virtual campuses and immersing students in other cultures. Bridging nations via technology does not, however, guarantee that the cultures of the nations involved are similarly bridged. Mismatches in social conventions, work styles, power relationships and conversational norms can lead to misunderstandings that negatively affect the interaction, relationships among team members, and ultimately the quality of group work. This project seeks to offer a novel, dynamic approach to promoting intercultural communication, adapted from the field of Computer-Supported Collaborative Learning (CSCL) that relies on context sensitive interventions triggered on an as-needed basis. Specifically, the proposed work focuses on communication problems related to what has been called transactivity, or the extent to which messages in a conversation build on one another in appropriate ways. In the CSCL literature, this communication-oriented approach has been used to tailor interventions for on-line collaborative learning dialogues. The proposed work extends this approach to the problem of intercultural communication by (a) identifying and categorizing the types of problems that arise in intercultural dialogues and delineating how these problems impact subjective and objective group outcomes; (b) applying machine learning techniques to coded dialogues with the aim of automatically recognizing when problems arise (or are likely to arise) in an intercultural conversation; and (c) developing and testing interventions to improve intercultural communication that can be triggered by this automatic analysis. These goals are addressed by a combination of laboratory studies of intercultural CMC and machine learning research.

Contact: Carolyn Rosé


Ravenclaw is an advanced architecture for dialogue management based on a dynamic representation that captures knowledge about the task that humans perform in given domains. It is the base for a number of dialogue system projects in LTI and elsewhere at Carnegie Mellon. Current dialogue management research centers on two topics: 1) Developing techniques for self-awareness that allow systems to adaptively detect and recover from misunderstandings, 2) Automating the configuration of dialogue systems by inferring task and dialogue structure from human-human interactions in limited domains.

Contact: Alex Rudnicky

TagHelper Tools

Tools for machine learning with text

This project provides a basic resource for researchers who use text processing technology in their work or want to learn about text mining at a basic level. It has been used by a wide range of researchers in fields as diverse as Law, Medicine, Social sciences, Education, Architecture, and Civil Engineering. It has also been used as a teaching tool in a variety of courses both at Carnegie Mellon university and other universities. A specific goal of our research is to develop text classification technology to address concerns specific to classifying sentences using coding schemes developed for behavioral research, especially in the area of computer supported collaborative learning. A particular focus of our work is developing text classification technology that performs well on highly skewed data sets, which is an active area of machine learning research. Another important problem is avoiding overfitting idiosyncratic features on non-IID datasets. TagHelper tools has been downloaded over a thousand times in the past 18 months.

Contact: Carolyn Rosé


TalkBank is an interdisciplinary project designed to create an openly available database for recording and transcriptions of spoken language interactions. It is composed of a series of topic-specific databases for particular research areas. These areas include classroom discourse (ClassBank), aphasia (AphasiaBank), Conversation Analysis (CABank), Supreme Court (SCOTUS), bilingualism (BilingBank), second language learning (SLABank), dementia (DementiaBank), child languages (CHILDES), and five other more specific topic areas. All materials in TalkBank databases are transcribed in the CHAT format, as specified by the CHAT XML and media are linked to transcripts on the sentence level.

Contact: Brian MacWhinney


Dialogue systems mostly involve one human talking to one machine. What about interaction between the members of a human-robot team? The project focuses on two specific research issues: The management of multi-participant dialogues (touching on issues such as turn-taking) and the development of grounding strategies that allow humans and robots to agree on mutually-understandable descriptions of objects and actions in the context of a treasure hunt.

Contact: Alex Rudnicky

Computational Biology [top]

Computational Discovery of Protein-Protein Interactions in the Mental Health and Inflammation Interactome

The objective of this research is to develop systematically designed computational algorithms to discover the human mental health and inflammation (MHAIN) interactome. The MHAIN interactome refers to the network of PPIs where at least one of the two proteins is involved in either brain or inflammation. Protein-protein interaction (PPI) network can help understand the molecular mechanisms behind the mind body processes of mental health (happiness and tranquility) and diseases (depression, psychosis, etc), and behind the modulations of psycho-neuro-inflammo processes by each other. Today, the PPI network information is incomplete, with nearly 90% of interactions being unknown. There is a critical need to discover the PPIs involved in brain and behavior and to apply systems biology methods to study these processes. We employ proactive learning and transfer learning to discover hitherto unknown interactions, and mine the network to generate hypotheses of system level interconnects.

Contacts: Madhavi Ganapathiraju, Assistant Professor at University of Pittsburgh

Predicting Host-Pathogen Protein-Protein Interactions

Protein-protein interactions are a key mechanism in cellular physiology and also in pathogen infections of hosts, such as viral or bacterial diseases in humans. This project pushes the limit of machine learning methods on sparse data, including active learning and new transfer learning techniques, to predict host-pathogen interactomes, leveraging known interactions from better studied pathogens to infer potential interactions in other less-well-studied ones, resulting in improved prediction accuracy.

Contacts: Jaime Carbonell and Judith Klein-Seetharaman

Viruses, Vaccines, and Digital Life

Viruses are the simplest known self-replicating computational systems. They also happen to be the leading emerging threat to humanity in the 21st century. Fortunately, the new understanding of life in general and viruses in particular as digital programs opens the door to computational methods of defending against these threats. In collaboration with epidemiologists, virologists and others at the University of Pittsburgh.s School of Public Health, we computationally model epidemic spread, create epidemic forecasts, and try to better understand the evolution of influenza, a critical step towards the development of a universal flu vaccine.

Contact: Roni Rosenfeld

Natural Language Processing / Computational Linguistics [top]

Big Multilinguality for Data-Driven Lexical Semantics

A key challenge in natural language processing is defining the computational representation of words. Data-driven distributional approaches use corpora to induce vector-space representations for words, based on the contexts they occur in. This project goes beyond traditional approaches (e.g., latent semantic analysis; Deerwester et al., 1990), which use words that tend to occur near a word in corpora to define the context, by extending the types of contexts used in constructing semantic vectors. First, this project incorporates translation contexts, i.e., words readily available in multilingual parallel corpora, alongside traditional monolingual corpora. This allows evidence-sharing across languages, most importantly from resource-rich languages with large corpora to more resource-poor languages. Second, this project incorporates social context inferable from social network platforms, captured through author, time, geographic, and social connection metadata. Taken together, these additional features give a broader definition of a word's context and lead to a more unified approach to the distributional approach to modeling human language, moving in the direction of a language-independent semantics.

Contacts: Noah Smith and Chris Dyer

Data-Driven, Computational Models for Discovery and Analysis of Framing

This project studies framing, a central concept in political communication that refers to portraying an issue from one perspective with corresponding de-emphasis of competing perspectives. Framing is known to significantly influence public attitudes toward policy issues and policy outcomes. As social media allow greater citizen engagement in political discourse, scientific study of the political world requires reliable analysis of how issues are framed, not only by traditional media and elites but by citizens participating in public discourse. Yet conventional content analysis for frame discovery and classification is complex and labor-intensive. Additionally, existing methods are ill-equipped to capture those many instances when one frame evolves into another frame over time. This project therefore develops new computational modeling methods, grounded in data-driven computational linguistics, aimed at improving the scientific understanding of how issues are framed by political elites, the media, and the public.

Contacts: Noah Smith

Flexible Learning for Natural Language Processing

Statistical learning is now central to natural language processing (NLP). Bridging the gap between learning and linguistic representation requires going beyond learning parameters. This CAREER project addresses three challenging, unresolved questions: (1) Given recent advances in learning the parameters of linguistic models and in approximate inference, how can the process of feature design be automated? (2) Given that NLP tasks are often defined without recourse to real applications and that a specific annotated dataset is unlikely to fulfill the needs of multiple NLP projects, can learning frameworks be extended to perform automatic task refinement, simplifying a linguistic analysis task to obtain more consistent, more precise, or faster performance? (3) Can computational models of language take into account the non-text context in which our linguistic data are embedded? Building on recent success in social text analysis and text-driven forecasting, this CAREER project seeks to exploit context to refine models of linguistic structure while enabling advances in this application area.

Contacts: Noah Smith

FUDG Framework for Syntactic Annotation

FUDG (Fragmentary Unlabeled Dependency Grammar) is a formalism that offers a flexible way to describe the syntactic structure of text. Beyond the traditional view of dependency syntax in which the tokens of a sentence form nodes in a tree, FUDG allows for a distinction between nodes and lexical items (which may be multiword expressions); provides special devices for coordination and coreference; and facilitates underspecified (partial) annotations where producing a complete parse would be difficult. GFL (Graph Fragment Language) is an ASCII-based encoding of unlabeled dependency annotations in the FUDG formalism.

Contacts: Noah Smith and Chris Dyer


This project focuses on the construction of grammatical relations taggers for the English, Spanish, Japanese, and Hebrew data in the CHILDES database.

Contact: Brian MacWhinney

Usable Privacy Policy Project

Natural language privacy policies have become a de facto standard to address expectations of .notice and choice. on the Web. Yet, there is ample evidence that users generally do not read these policies and that those who occasionally do struggle to understand what they read. Initiatives aimed at addressing this problem through the development of machine implementable standards or other solutions that require website operators to adhere to more stringent requirements have run into obstacles, with many website operators showing reluctance to commit to anything more than what they currently do. This frontier project builds on recent advances in natural language processing, privacy preference modeling, crowdsourcing, formal methods, and privacy interfaces to overcome this situation. It combines fundamental research with the development of scalable technologies to: (1) semi-automatically extract key privacy policy features from natural language website privacy policies, and (2) present these features to users in an easy-to-digest format that enables them to make more informed privacy decisions as they interact with different websites. As such, this project offers the prospect of overcoming the limitations of current natural language privacy policies without imposing new requirements on website operators.

Contacts: Noah Smith

Information Extraction, Summarization, and Question Answering [top]

Combating Human Trafficking using Natural Language Processing

The trafficking of people into and within the USA is a nefarious and pervasive problem. Many forms of trafficking --cross-border smuggling, child sex workers, involuntary farm and factory labor-- leave some trace in the infosphere, and to detect them, wide-scale information extraction and distillation is helpful. In addition, efforts by law enforcement and other government agencies to combat trafficking provide a wealth of statistics about apprehensions, success rates, etc., which can be mined to find patterns and correlations with external events. This funding supports a small amount of initial analysis and reporting to assist government efforts and to locate areas in which more elaborated research is likely to be worth increased investment.

Contacts: Eduard Hovy

DHS Analytics 5

Pilot Studies in Data Extraction and Analysis for DHS

This work investigates innovative ways in which data analytics can be brought to bear on problems experienced by government agencies working to ensure the safety and security of the public. A wide variety of pilot projects has been addressed; when initial success is found, a specific project splits off to form a larger independent effort. In the past we have investigated the use of language technology to find evidence of human trafficking, including forced underage work, and to compile, track, and channel this evidence to the appropriate agencies. In another, we have applied text mining to harvest a comprehensive compilation of challenges to cybersecurity (such as worms and viruses and DDOS attacks) plus their remedies. We are looking at following social media, like Twitter, treating it as an extensive distributed sensor network, in order to identify anomalous events and track their evolution.

Contacts: Eduard Hovy


Open-Domain Question Answering

Typical IR systems return a set of documents, or perhaps a set of queries. LTI Question Answering software extracts information from documents in large, open-domain corpora to answer questions in subject areas that are not known in advance.

Contacts: Eric Nyberg and Teruko Mitamura

SAFT: Deep Reading throughout the Semantic Analysis and Filtering of Text

Most language technology deals with large volumes of text, and overcomes gaps, omissions, and idiosyncratic phrasing in any treat by finding semantically equivalent material in other texts. But when the challenge is to interpret a single text, or a small set, then deeper reading of each sentence is required. This project investigates various aspects of such deeper reading, including semantic frame-based interpretation (using the SEMAFOR parser developed by Noah Smith and students), the detection of full and partial event and entity coreference (developed by Eduard Hovy and Teruko Mitamura and students), relation harvesting to fill frames (developed by William Cohen and students), information extraction and inference (developed by Hans Chalupsky), gap-filling, expectation postulation, and various aspects of support from large background knowledge (developed by Eduard Hovy and students), and novelty and anomaly detection (developed by Jaime Carbonell and Yiming Yang and students).

Contacts: Eduard Hovy

Other, Interdisciplinary Projects [top]


The AphasiaBank Project focuses on the construction of a computerized database for the study of language processing in aphasia. A consortium of 60 researchers has developed a shared methodological and conceptual framework for the processes of recording, transcription, coding, analysis, and commentary. These methods are based on the TalkBank XML schema and related computational tools for corpus analysis, parsing, and phonological analysis. Our nine specific aims are: protocol standardization. Database development, analysis customization, measure development, syndrome classification, qualitiatvie analysis, profiles of recovery processes, and the evaluation of treatment effects.

Contact: Brian MacWhinney


The Informedia project tries to understand video, and enable search, visualization and summarization in both contemporaneous and archival content collections. The core technology combines speech, image and natural language understanding to automatically transcribe, segment and index linear video for intelligent search and image retrieval.

Contacts: Howard Wactlar and Alex Hauptmann

Make Fewer Mistakes

Inattentional blindness (IB), or unawareness of peripheral or rare circumstance, is generally a positive adaption that allows people to finish their daily work without becoming distracted by anomalies. IB becomes negative when the distractions are threatening. People in security, the military, monitoring tasks, or those who review large sets of numbers or x-rays, for example, attend to the expected, but must also be distracted by anomalies so that they do not overlook the unexpected. Our research should provide insight into the science of IB, while lending evidence to support the design of a system to lessen IB that will help people make fewer mistakes.

Contact: Judith Gelernter

Project Listen

A reading tutor that listens

Project LISTEN's Reading Tutor listens to children read aloud, and then helps them learn to read. This project offers exciting opportunities for interdisciplinary research in speech technologies, cognitive and motivational psychology, human-computer interaction, computational linguistics, artificial intelligence, machine learning, graphic design, and of course reading. Project LISTEN is currently extending its automated Reading Tutor to accelerate children’s progress in fluency, vocabulary, and comprehension.

Contacts: Jack Mostow

Read the Web

Can computers learn to read? We think so. This research is developing a computer system that learns over time to read the web. Since January 2010, this computer system called NELL (Never-Ending Language Learner) has been running continuously, 24x7, attempting to perform two tasks each day: First, it attempts to "read," or extract facts from text found in hundreds of millions of web pages (e.g., playsInstrument(George_Harrison, guitar)). Second, it attempts to improve its reading competence, so that tomorrow it can extract more facts from the web, more accurately. As of January 2012, NELL had extracted approximately 15 million beliefs at different levels of confidence, into a diverse ontology containing approximately 700 categories and relations. Of these, NELL had high confidence in roughly 1 million beliefs, approximately 85% of which were correct. NELL is still imperfect, but it continues to learn daily. Follow its progress at http://rtw.ml.cmu.edu.

Contacts: Tom Mitchell and William Cohen

Understanding Cybersecurity

Cybersecurity takes many forms. Some of them are amenable to natural language analysis. In particular, people are often unaware of the risks they face, or even what to do when they encounter fake email, phishing, and so on. We are developing software to harvest, compile, and structure an inventory of online texts, discussions, and other information relating to all aspects of cybersecurity into a single easy-to-navigate portal. This can be used as an educational resource or as a springboard for more in-depth investigation.

Contacts: Eduard Hovy

Spinoff Companies [top]


LightSide develops tools that instantly evaluate student writing, giving immediate feedback to students throughout the writing process, helping with their first draft and every revision after that, all the way to an automated assessment delivered to teachers.


Safaba develops enterprise machine translation technology for globalization operations that include pre- and post-sales interactions, global customer care, and on-demand access to information across global units.

Older Projects [top]


Real-time analysis of massive structured data

We developed techniques for the indexing of massive incomplete and under-specified data, fast identification of exact and approximate matches among the available data, processing of massive information streams, and real-time identification of both known and surprising patterns.

Contacts: Jaime Carbonell, Eugene Fink and Robert Frederking


In this project, we explored the impact of tutor strategy and example selection on student explanation behavior. The purpose was to identify strategies that make the most productive use of time students spend with a tutorial dialogue system. We collected a corpus of tutoring dialogues in the calculus domain - in which students discussed worked-out examples (that may or may not have contained an error) with a human tutor. The student reasoned through the worked examples and identified, explained, and corrected errors. As part of this project, we experimented with automatic approaches to corpus analysis - applying and extending approaches used previously for text classification, dialogue act tagging, and automatic essay grading.

Contact: Carolyn Rosé


Dialogue Management System

The CAMMIA project (A Conversational Agent for Multilingual Mobile Information Access) is focused on research and development of a multi-tasking dialog management system that can be used with automatic speech recognition and VoiceXML to provide mobile information access.

Contacts: Teruko Mitamura and Eric Nyberg


Machine learning has been developed to the point where it can perform some truly useful tasks. However, much of the learning technology that's currently available requires extensive 'tuning' in order to work for any particular user, in the context of any particular task.

The focus of the RADAR project was to build a cognitive assistant embodying machine learning technology able to function "in the wild" -- by this, we mean that the technology need not be tuned by experts, and that the person using the system need not be trained in any special way. Using the RADAR system itself, in the task for which it is designed, should be enough to allow RADAR to learn to improve performance.

RADAR was a joint project between SRI International and Carnegie Mellon University and was funded by DARPA.

Contacts: Scott Fahlman and Jaime Carbonell

LTI-related RADAR Components:

Space-Time Planner - Contact: Eugene Fink
Knowledge Representation - Contact: Scott Fahlman
Briefing Assistant - Contact: Alex Rudnicky
NLP/Email - Contact: Eric Nyberg
Summarization - Contact: Alex Rudnicky

RADAR/Space-Time (Subset of RADAR project)

Resource management under uncertainty

We built a system for the automated and semi-automated management of office resources, such as office space and equipment. The related research challenges included the representation of uncertain knowledge about available resources, optimization based on uncertain knowledge, elicitation of more accurate data and user preferences, negotiations for office space, learning of user behavior and planning strategies, and collaboration with human administrators.

Contacts: Eugene Fink and Jaime Carbonell


General-purpose tools for reasoning under uncertainty

We developed general techniques for the representation and analysis of uncertainties in available data, identification of critical uncertainties and missing data, evaluation of their impact on specific conclusions and reasoning tasks, and planning of proactive information gathering.

Contacts: Eugene Fink, Anatole Gershman and Jaime Carbonell


Infrastructure for authoring and experimenting with natural language dialogue in tutoring systems and learning research

The focus of this work was to provide an infrastructure that would allow learning researchers to study dialogue in new ways and for educational technology researchers to quickly build dialogue based help systems for their tutoring systems. At the time of this research, we were entering a new phase in which we as a research community had to continue to improve the effectiveness of basic tutorial dialogue technology while also finding ways to accelerate both the process of investigating the effective use of dialogue as a learning intervention and the development of usable tutorial dialogue systems. We developed a community resource to address all three of these problems on a grand scale, building upon prior work developing both basic dialogue technology and tools for rapid development of running dialogue systems.

Contact: Carolyn Rosé

Language Technologies Institute • 5000 Forbes Ave • Pittsburgh, PA 15213-3891 • (412) 268-6591