Home


About

   Directions
   Admissions

   How To Apply

   The LTI Brochure


Education

   Ph.D.
   M.S.

   Undergrad Minor

   Courses

    FYI

LTI Forms

Seminars
   LTI Seminar Series
   Joint Speech Seminar (JSS)

   Machine Translation (MT)

   Student Research Symposium

   Information Retrieval Series


Visitor Information
   General
   Maps & Directions
   Hotel Links
   Parking Information


Research
   Projects

   Reports

    Dissertations


People

   Faculty

   Students

   Upcoming Graduates

   Staff

   Visitors   

   Who to See for What


Contacts


LTI Seminar Abstracts
Fall 07 - Spring 08


April 21
Gideon Mann
Google

Generalized Expectation Criteria for Semi-Supervised Learning

Current machine learning systems can be effective when they have sufficient training data. However, human annotation is costly and it is too expensive to have humans hand-annotate training data for all classification tasks of interest. This dilema has led to the appeal of semi-supervised learning algorithms, where a small amount of labeled data is augmented by a larger pool of unannotated data. In this talk, I show how generalized expectation (GE) criteria can be used for semi-supervised learning. Unlike traditional semi-supervised learning methods that use conventionally labeled instances as their supervised seed information, GE makes use of labeled features, where individual features are labeled with their correlation with output labels. Experiments with logistic regression and conditional random fields on natural language processing problems demonstrate that training with GE on labeled features, as opposed to traditional supervised training and alternative semi-supervised learning methods, can substantially reduce the amount of time it takes to train high performance models.

Biography: Gideon Mann is a researcher at Google New York. He received an Sc.B. in Computer Science from Brown University and an M.S. and a Ph.D. in Computer Science from Johns Hopkins University, and worked as a post-doctoral researcher at the University of Massachusetts, Amherst. His current research is in minimally-supervised techniques, particularly for information and fact extraction.

 

April 18
Hiroshi Kanayama
Tokyo Research Laboratory, IBM

ESPER: Extractor of Sentiment and Preference ExpRessions

Sentiment Analysis (SA) is a task to recognize writers' feelings as expressed in positive or negative comments, by analyzing unreadably large numbers of documents. SA is becoming a useful tool for the commercial activities of both companies and individual consumers. However, many domain-specific knowledge is required for deeper understanding of targets, and it is laborious work to prepare such a lexicon for each domain.

We propose an unsupervised lexicon building method for the detection of polar clauses, which convey positive or negative aspects in a specific domain. The lexical entries to be acquired are called polar atoms, the minimum human-understandable syntactic structures that specify the polarity of clauses. As a clue to obtain candidate polar atoms, we use context coherency, the tendency for same polarities to appear successively in contexts. Using the overall density and precision of coherency in the corpus, the statistical estimation picks up appropriate polar atoms among candidates, without any manual tuning of the threshold values. The experimental results show that the precision of polarity assignment with the automatically acquired lexicon was 94% on average, and our method is robust for corpora in diverse domains and for the size of the initial lexicon.  Though the experiments focus on Japanese text, the basic idea will be applicable to any language.

Biography:
Hiroshi Kanayama is a researcher at Tokyo Research Laboratory, IBM Japan. In 2000 he received a Master's degree from Graduate School of the University of Tokyo, for the research on syntactic analysis.  In IBM, his research focuses on several types of deep language analysis including machine translation, sentiment analysis and components for text mining.

 

April 15
Ronald M. Kaplan
Powerset, Inc.

Powerset: Deep Natural Language Processing for Web-Scale Information Retrieval

Google, Yahoo, and other conventional search engines have been remarkably successful at making vast amounts of information available to ordinary users. They achieve robustness and scale by creating efficient bag-of-words indexes of the terms they extract from unstructured text and by encouraging users to specify their information needs with keywords that are well-suited to bag-of-words retrieval. These methods suffer from errors of both precision and recall. Undesired results are returned because the systems do not index and cannot filter according to the semantic relations that the user has in mind, and desired results are missed because keyword matches cannot identify passages that use different terms and different syntactic constructions to express semantically equivalent concepts.

It is not a novel idea that these precision and recall problems can be addressed in principle by using deep natural language processing to extract underlying semantic concepts and relations both from text and from queries. Powerset is a start-up company that is attempting to address these problems in practice. We are combining fairly mature natural language technologies with carefully tuned indexing and retrieval components to build a web-scale semantic index for a natural language search engine. In this talk I'll point out why search is a particularly good application for natural language processing, outline some of the factors that justify this effort, and describe some of the technologies that make it possible. I'll also show examples of our current system to illustrate some of the strengths (and some of the current weaknesses) of our approach.

Biography:
Ronald M. Kaplan is Chief Scientific Officer at Powerset, Inc. Prior to joining Powerset, he was a Research Fellow at the (Xerox) Palo Alto Research Center where he created and directed the Natural Language Theory and Technology research group. He is also a Consulting Professor in the Linguistics Department at Stanford University and a Principal of Stanford’s Center for the Study of Language and Information

He received his Ph.D. in Social Psychology from Harvard University, where he investigated how explicit computational models of grammar could be embedded in models of human language performance. He has made many contributions to computational linguistics and linguistic theory. These include the notions of consumer-producer and active-chart parsing, the design of the formal theory of Lexical Functional Grammar and its initial computational implementation, and the mathematical, linguistic, and computational concepts that underlie the use of finite-state phonological and morphological descriptions.

Kaplan is a past President of the Association for Computational Linguistics, a co-recipient of the 1992 Software System Award of the Association for Computing Machinery, and a Fellow of the ACM. He has also been a Fellow-in-Residence at the Netherlands Institute for Advanced Study in the Humanities and Social Sciences. He holds over 30 patents in computational linguistics and related areas. 

 


April 10
Qin Jin
CMU

Robust Speaker Recognition

The automatic speaker recognition technologies have developed into more and more important modern technologies required by many speech-aided applications. The main challenge for automatic speaker recognition is to deal with the variability of the environments and channels from where the speech was obtained. Research in automatic speaker recognition has focused on telephone speech for several years. However it has ignored key problems: such as far-field speaker recognition with speech recorded with far-field distant microphones and in the presence of noise. In this talk, I will present our work to improve speaker identification system performance in far-field scenarios. First of all, we investigate approaches to improve robustness for traditional speaker recognition system which is based on low-level spectral information. We introduce a new reverberation compensation approach which along with feature warping in the feature processing procedure improves the system performance significantly. We propose four multiple channel combination approaches, which utilize information from multiple far-field microphones, to improve robustness under mismatched training-testing conditions. Secondly, we investigate approaches to use high-level speaker information to improve robustness. We propose new techniques to model speaker pronunciation idiosyncrasy from two dimensions: the cross-stream dimension and the time dimension. Such high-level information is expected to be robust under different channels. Thirdly, we investigate speaker segmentation and clustering aiming at improving the robustness of speaker recognition as well as automatic speech recognition performance in the multiple-speaker scenarios such as conversations and meetings.

 

Biography:
Qin Jin is a post-doc at Language Technology Institute, Carnegie Mellon University. Her research interest includes robust speaker recognition, human biometrics, automatic speech recognition, and machine learning in general. Qin Jin received her PhD in January 2007 from the Language Technologies Institute, Carnegie Mellon University and a BA and MS degrees from Tsinghua University, P.R.China.

 

 

 

March 21
Robert Frederking
Carnegie Mellon University

NineOneOne: Speech Translation for Spanish 9-1-1 Calls in the U.S. 

In many U.S. localities, when emergency calls are received in languages other than English (primarily Spanish), the dispatching center connects the call to the Language Line human translation service to translate for them. Though using human translators to assist callers during emergency calls might seem optimal, this scheme actually has serious shortcomings. The process is very slow, especially in starting up, and the translators are unfamiliar with the task, resulting in very poor quality service.

We will present NineOneOne, a system being developed here to recognize and translate Spanish emergency calls for better dispatching. The 9-1-1 domain has many research challenges, but we believe it is also a feasible domain for a real-world speech translation application. The domain is challenging because it requires real-time operation, and the recognition and translation of stressed telephone-quality speech in multiple dialects; it is still feasible because we have in-domain data from three 9-1-1 centers, there are strong vocabulary and task constraints, and perfection is not required. This domain also has clear, significant social value, addressing the chronic shortage of translation for Spanish (and in larger cities, a large number of other languages) at U.S. emergency dispatch centers. While we currently are working on Spanish/English, the approaches we use are fundamentally language-independent.

Our initial work has been aimed at demonstrating that we can produce ASR and utterance classification of sufficient quality to allow the development of a practical, limited-domain system. We will describe our initial results, which we believe have achieved that goal. We will also lay out the novel overall design for the system, and our plans for its further development.

Biography:
Robert Frederking is a Senior Systems Scientist at the Language Technologies Institute (LTI) in the School of Computer Science at Carnegie Mellon University, where he is the Chair of the LTI's graduate programs. He managed the Diplomat rapid-deployment, wearable speech translation project, and was the CMU PI for the ensuing Tongues project, which was led by Lockheed Martin, to produce a pocketable speech-to-speech translator for potential field use by the US Army. He was also a member of the Nespole! and Babylon speech translation projects.

March 7
Alex Hauptmann
Carnegie Mellon University

Video Analysis and Semantic Concepts for Multimedia Retrieval

Recent years have seen increasing popularity of video as a shared, searchable medium, with over 100 million daily views from YouTube clips as prime evidence. This talk will cover some of the successes in video retrieval research, focusing on broadcast content as the most extensively studied area. After beginning with a background of audio and video analysis, and spoken document retrieval, the talk will then discuss some current research issues, foremost of these is retrieval based on semantic concepts. One conjecture is that robust semantic concept classification could provide the means to close the semantic gap; that is to allow a characterization of visual or graphic content through natural language. This leads to an examination into the combination of knowledge sources for retrieval, relevance feedback and interfaces for retrieval. This talk is based on joint work with numerous members of the Informedia Project over many years.

 

February 29
John Tait
Information Retrieval Facility

Introduction to Patent Retrieval and the IRF

To be granted a valid patent involves ensuring that the invention has not been previously patented or put in the public domain. Patent searching involves indexing and searching existing patent collections but it also involves searching collections of scientific papers and more general web searching, to determine whether the idea has been previously publicly disclosed. In contrast to information exploration through internet search engines, patent searching involves long and complex queries; session based queries in which very lengthy periods (sometimes days) are spent refining and reviewing results; and finally but by no means least important, very high recall to ensure a critical item has not been missed.

The Information Retrieval Facility (IRF) is a new international not-for-profit organisation based in Vienna, with a distinguished International Scientific Board. It aims to support large scale information retrieval research generally, but more specifically work on patent retrieval and related areas. The facilities and plans of the IRF will be described. 

Biography:
Dr. John Tait is Chief Scientific Officer of the Information Retrieval Facility (IRF), a not-for-profit foundation dedicated to promoting research n large scale information retrieval. Prior to joining the IRF in 2007, Dr. Tait was Professor of Intelligent Information Systems and Associate Dean of Computing And Technology at the University of Sunderland in Great Britain. He is a past Program Chair of the SIGIR and ECIR conferences, an Associate Editor of ACM Transactions on Information Systems, and author of over 90 refereed conference and journal papers. His current research focuses on problems of retrieving still and moving images, and on patent retrieval. 

December 7
Noah Smith
Carnegie Mellon University

Statistical Parsing Triptych: Jeopardy, Morphosyntax, and M-Estimation 

This talk covers three recent advances in statistical parsing: an application, an algorithmic solution, and a learning solution.

The first part of the talk presents our work using quasi-synchronous dependency grammars - an elegant model originally designed for machine translation - in question answering. By modeling loose answer-to-question transformations at the level of bare-bones dependency structure, we achieve notably high on a TREC-style answer-selection task (Wang, Smith, and Mitamura, EMNLP-CoNLL 2007).

The second part of the talk turns to parsing algorithms. While much research has been devoted to parsing algorithms for languages that have clear morpheme boundaries (e.g., English), it is not clear what to do when a language displays morphological ambiguity as well. We describe two efficient ways to apply models for morphological and syntactic disambiguation in tandem, giving significant gains on parsing the Hebrew Treebank (Cohen and Smith, EMNLP-CoNLL 2007).

The third part of the talk turns to a learning problem. Since log-linear ("maximum entropy") models were first applied to NLP at IBM in the 1990s, they have been widely used. Training them, however, is very expensive for models of sequences and trees. We present a novel, generative parameter estimation algorithm for log-linear structure models based on a generalization of maximum likelihood estimation called M-estimation. We compare this method to existing learning algorithms on a shallow parsing task (Smith, Vail, and Lafferty, ACL 2007).

Biography:
Noah Smith is Assistant Professor of Language Technologies at Carnegie Mellon University. His research has spanned statistical machine translation, parallel corpus discovery, unsupervised statistical grammar induction, efficient morphological and syntactic processing algorithms, weighted logic programming, and the formal study of weighted grammars. He is a Hertz Fellow (2001-6), the recipient of an IBM Faculty Award (2007), and a member of the DARPA Computer Science Study Panel (2007). 

November 16
Eugene Fink
Carnegie Mellon University

Reasoning Under Uncertainty

The development of a robust AI system for reasoning under uncertainty involves several research challenges, including representation of uncertain data, optimization and inferences based on uncertain knowledge, identification of critical uncertainties, planning of additional data collection, and anticipation of possible future developments.

I will present work on these challenges, and describe the application of the related results in two AI systems. The first system is part of the RADAR architecture, which supports the use of uncertain data in planning and scheduling. The purpose of the second system is hypothesis verification based on massive uncertain data, as well as planning of proactive data gathering.

Biography:
Eugene Fink received the B.S. degree from Mount Allison University (Canada) in 1991, M.S. from the University of Waterloo (Canada) in 1992, and Ph.D. from Carnegie Mellon University in 1999. He was an assistant professor in the Computer Science and Engineering Department at the University of South Florida from 1999 to 2003. He is now a senior systems scientist in the School of Computer Science at Carnegie Mellon University. His research interests are in various aspects of artificial intelligence, including machine learning, planning, problem solving, e-commerce applications, medical applications, and theoretical foundations of artificial intelligence. His interests also include computational geometry and algorithm theory.

November 9
Raul Valdes-Perez
Vivisimo Inc.

Enterprise search: Not Your Kid Sister's Search Engine

Vivisimo, a CMU CSD spinoff from 2000, has evolved from its origins as a web search engine into an enterprise search software vendor, growing fast and competing well against large public companies.

Raul will discuss:

  • CMU origins of Vivisimo and its evolution thereafter
  • Vivisimo's view of post-retrieval navigation: clustering, taxonomies, meta-data navigation, etc.
  • Key ideas that underlie successful document clustering
  • Search 2.0: users teach the search engine
  • What corporate customers value in enterprise search
  • Ongoing technical challenges
  • Some entrepreneurial lessons learned

Biography:
Raul has been CEO of Vivisimo since co-founding it. He was selected as a 1997 Ernst & Young Entrepreneur of the Year, North Central Region. On the CS research faculty during 1991-2001, he worked on AI methods and applications to both data-driven and theory-driven problems in scientific discovery. From 1986-91 he was a CS PhD student at CMU, advised by the late Herbert Simon.
More information at: http://vivisimo.com/docs/valdes.pdf 

November 2
Luke Zettlemoyer
MIT

Learning to Map Sentences to Logical Form

Recent research has focused on the problem of learning to map natural language sentences to semantic representations of their underlying meaning. In this talk, I will present an online algorithm that learns a weighted combinatory categorial grammar (CCG) which is used to parse sentences to a lambda-calculus meaning representation. In particular, I will focus on recent work that addresses the challenge of parsing spontaneous, unedited natural language input which is commonly seen in dialogue domains such as ATIS travel planning. Finally, I will briefly discuss current research on developing a data set and learning algorithm for context-dependent parsing to logical form. 

Biography:
Luke Zettlemoyer is a Ph.D. student at MIT working on research in the intersections of natural language processing, machine learning and automated planning. The research he is presenting in this talk was previously described in papers at EMNLP 07 and UAI 05, where it received a best paper award. 

October 26
Guy Lebanon
Purdue University

Sequential Document Visualization

Documents and other categorical valued time series are often characterized by the frequencies of short range sequential patterns such as n-grams. This representation converts sequential data of varying lengths to high dimensional histogram vectors which are easily modeled by standard statistical models. Unfortunately, the histogram representation ignores most of the medium and long range sequential dependencies making it unsuitable for visualizing sequential data. We present a novel framework for sequential visualization of documents based on the idea of local statistical modeling. The framework embeds categorical time series as smooth curves in the multinomial simplex summarizing the progression of sequential trends. We discuss several visualization techniques based on the above framework and demonstrate their usefulness for document visualization.

Biography:
Guy Lebanon is an assistant professor at Purdue University with a joint appointment in Statistics and Electrical and Computer Engineering. His research area includes machine learning and computational statistics with a particular emphasis on modeling text documents and partially ranked data. Prof. Lebanon received the 2007 Teaching for Tomorrow Award from Purdue University and the Best Presentation Award in the 2004 LTI Student Research Symposium. Prof. Lebanon received his PhD in 2005 from the Language Technologies Institute, Carnegie Mellon University and a BA and MS degrees from Technion - Israel Institute of Technology.

Rayid Ghani
Accenture

Research Challenges in Enterprise Information Retrieval

Information Retrieval is a major component of Knowledge Management systems in every business but most of the research that is being done in IR today focuses on the Web and not on the needs and challenges of businesses. This is primarily due to the availability of data on the Web for academic researchers as well as familiarity with the problems in Web IR since all of us are consumers and can relate to the domain. In contrast, for Enterprise Information Retrieval, the data is not available to most researchers and the challenges and needs are not obvious to people who are not everyday users of such systems.

In this talk, I will point out some challenges in this domain, pose open research questions for the Information Retrieval, NLP, Machine Learning & Data Mining communities, and describe the experimental infrastructure that is being set up at Accenture Technology Labs to undertake those challenges. Our experimental test-bed for Enterprise Knowledge Management Research has access to potentially 150,000 users in Accenture and will allow us to solve large-scale Enterprise Information Retrieval problems in collaboration with academic institutions around the world.

Biography:
Rayid Ghani is a Senior Researcher at Accenture Technology Labs and leads the Analytics group which focuses on applied research in Machine Learning, Data Mining, and Information Retrieval. His current research interests include Machine Learning & Data Mining with special focus on Text Learning and Active Learning. Rayid has organized several workshops at ICML and KDD and holds a master's degree in Knowledge Discovery and Data Mining from Carnegie Mellon University. More details can be found at www.accenture.com/techlabs/ghani

October 5
Rebecca Hwa
University of Pittsburgh

Learning Evaluation Metrics for Sentence-Level Machine-Translation

The field of machine translation (MT) has made major strides in recent years. An important enabling factor has been the adoption of automatic evaluation metrics to guide researchers in making improvements to their systems. Research in automatic evaluation metrics faces two major challenges. One is to achieve higher agreement with human judgments when evaluating MT outputs at the sentence-level. Another is to minimize the reliance on expensive, human-developed resources such as reference sentences. In this talk, I present a regression-based approach to metric development. Our experiments suggest that by combining a wide range of features, the resulting metric has higher correlations with human judgments. Moreover, we show that the features do not have to be extracted from comparisons with human produced references. Using weaker indicators of fluency and adequacy, our learned metrics rival standard reference-based metrics in terms of correlations with human judgments on new test instances.

Biography:
Rebecca Hwa is an Assistant Professor in the Department of Computer Science at the University of Pittsburgh. Her recent research focus is on multilingual processing and machine translation. Before joining Pitt, she was a postdoc at University of Maryland. She received her PhD from Harvard University and her B.S. from UCLA.


August 27
Luo Si
Purdue University

A Knowledge Driven Regression Model in Micro-Array Data Analysis

One of the main challenges with applying regression models in micro-array data analysis is that the number of model parameters is much larger than the number of data points available for training. This sparse data problem often leads to a significant degradation in the modeling accuracy. We present a knowledge driven approach to effectively address this problem. The main idea is to extract the profiles of keywords from the research articles that describe the properties of the genes involved in the regression model. These keyword profiles of genes are then used to construct the similarity measure of the genes, which will be used to regulate the assignment of regression weights. More specifically, we present a Bayesian framework that automatically determines the importance of key words in determining the similarity of genes in their roles to a given biological process. Empirical studies with a real biological dataset show that the proposed Bayesian framework is effective in exploiting the text profiles of genes to reduce the regression errors.

This is a joint work with Rong Jin and Christian Chan at Michigan State University.

Biography:
Luo Si is an Assistant Professor in the Department of Computer Science and Statistics (by courtesy) at Purdue University. He received his Ph.D. degree from Carnegie Mellon University under the supervision of Professor Jamie Callan. His research spans a range of topics in information retrieval, machine learning, text mining and biomedical applications. His recent research focuses on federated search (distributed information retrieval), probabilistic models for collaborative filtering, and text/data mining for biomedical applications. He has published more than 40 conference and journal papers.

 

The LTI Webmaster
 



LTI is part of the School of Computer Science at Carnegie Mellon University.
This page is maintained by The LTI Webmaster.