Generalized Expectation Criteria for Semi-Supervised Learning
Current machine learning systems can be effective when they have
sufficient training data. However, human annotation is costly and it
is too expensive to have humans hand-annotate training data for all
classification tasks of interest. This dilema has led to the appeal
of semi-supervised learning algorithms, where a small amount of
labeled data is augmented by a larger pool of unannotated data. In
this talk, I show how generalized expectation (GE) criteria can be
used for semi-supervised learning. Unlike traditional semi-supervised
learning methods that use conventionally labeled instances as their
supervised seed information, GE makes use of labeled features, where
individual features are labeled with their correlation with output
labels. Experiments with logistic regression and conditional random
fields on natural language processing problems demonstrate that
training with GE on labeled features, as opposed to traditional
supervised training and alternative semi-supervised learning methods,
can substantially reduce the amount of time it takes to train high
performance models.
Biography:
Gideon Mann is a researcher at Google New York. He received an Sc.B.
in Computer Science from Brown University and an M.S. and a Ph.D. in
Computer Science from Johns Hopkins University, and worked as a
post-doctoral researcher at the University of Massachusetts, Amherst.
His current research is in minimally-supervised techniques,
particularly for information and fact extraction.
April 18
Hiroshi Kanayama
Tokyo Research Laboratory, IBM
ESPER: Extractor of Sentiment and Preference ExpRessions
Sentiment Analysis (SA) is a task to recognize writers' feelings as expressed in positive or negative comments, by analyzing unreadably large numbers of documents. SA is becoming a useful tool for the commercial activities of both companies and individual consumers. However, many domain-specific knowledge is required for deeper understanding of targets, and it is laborious work to prepare such a lexicon for each domain.
We propose an unsupervised lexicon building method for the detection of polar clauses, which convey positive or negative aspects in a specific domain. The lexical entries to be acquired are called polar atoms, the minimum human-understandable syntactic structures that specify the polarity of clauses. As a clue to obtain candidate polar atoms, we use context coherency, the tendency for same polarities to appear successively in contexts. Using the overall density and precision of coherency in the corpus, the statistical estimation picks up appropriate polar atoms among candidates, without any manual tuning of the threshold values. The experimental results show that the precision of polarity assignment with the automatically acquired lexicon was 94% on average, and our method is robust for corpora in diverse domains and for the size of the initial lexicon. Though the experiments focus on Japanese text, the basic idea will be applicable to any language.
Biography:
Hiroshi Kanayama is a researcher at Tokyo Research Laboratory, IBM Japan. In 2000 he received a Master's degree from Graduate School of the University of Tokyo, for the research on syntactic analysis. In IBM, his research focuses on several types of deep language analysis including machine translation, sentiment analysis and components for text mining.
April 15
Ronald M. Kaplan
Powerset, Inc.
Powerset: Deep Natural Language Processing for Web-Scale Information Retrieval
Google, Yahoo, and other conventional search engines have been remarkably successful at making vast amounts of information available to ordinary users. They achieve robustness and scale by creating efficient bag-of-words indexes of the terms they extract from unstructured text and by encouraging users to specify their information needs with keywords that are well-suited to bag-of-words retrieval. These methods suffer from errors of both precision and recall. Undesired results are returned because the systems do not index and cannot filter according to the semantic relations that the user has in mind, and desired results are missed because keyword matches cannot identify passages that use different terms and different syntactic constructions to express semantically equivalent concepts.
It is not a novel idea that these precision and recall problems can be addressed in principle by using deep natural language processing to extract underlying semantic concepts and relations both from text and from queries. Powerset is a start-up company that is attempting to address these problems in practice. We are combining fairly mature natural language technologies with carefully tuned indexing and retrieval components to build a web-scale semantic index for a natural language search engine. In this talk I'll point out why search is a particularly good application for natural language processing, outline some of the factors that justify this effort, and describe some of the technologies that make it possible. I'll also show examples of our current system to illustrate some of the strengths (and some of the current weaknesses) of our approach.
Biography:
Ronald M. Kaplan is Chief Scientific Officer at Powerset, Inc. Prior to joining Powerset, he was a Research Fellow at the (Xerox) Palo Alto Research Center where he created and directed the Natural Language Theory and Technology research group. He is also a Consulting Professor in the Linguistics Department at Stanford University and a Principal of Stanford’s Center for the Study of Language and Information
He received his Ph.D. in Social Psychology from Harvard University, where he investigated how explicit computational models of grammar could be embedded in models of human language performance. He has made many contributions to computational linguistics and linguistic theory. These include the notions of consumer-producer and active-chart parsing, the design of the formal theory of Lexical Functional Grammar and its initial computational implementation, and the mathematical, linguistic, and computational concepts that underlie the use of finite-state phonological and morphological descriptions.
Kaplan is a past President of the Association for Computational Linguistics, a co-recipient of the 1992 Software System Award of the Association for Computing Machinery, and a Fellow of the ACM. He has also been a Fellow-in-Residence at the Netherlands Institute for Advanced Study in the Humanities and Social Sciences. He holds over 30 patents in computational linguistics and related areas.
April 10
Qin Jin
CMU
Robust Speaker Recognition
The automatic speaker recognition technologies have developed into more and more important modern technologies required by many speech-aided applications. The main challenge for automatic speaker recognition is to deal with the variability of the environments and channels from where the speech was obtained. Research in automatic speaker recognition has focused on telephone speech for several years. However it has ignored key problems: such as far-field speaker recognition with speech recorded with far-field distant microphones and in the presence of noise. In this talk, I will present our work to improve speaker identification system performance in far-field scenarios. First of all, we investigate approaches to improve robustness for traditional speaker recognition system which is based on low-level spectral information. We introduce a new reverberation compensation approach which along with feature warping in the feature processing procedure improves the system performance significantly. We propose four multiple channel combination approaches, which utilize information from multiple far-field microphones, to improve robustness under mismatched training-testing conditions. Secondly, we investigate approaches to use high-level speaker information to improve robustness. We propose new techniques to model speaker pronunciation idiosyncrasy from two dimensions: the cross-stream dimension and the time dimension. Such high-level information is expected to be robust under different channels. Thirdly, we investigate speaker segmentation and clustering aiming at improving the robustness of speaker recognition as well as automatic speech recognition performance in the multiple-speaker scenarios such as conversations and meetings.
Biography:
Qin Jin is a post-doc at Language Technology Institute, Carnegie Mellon University. Her research interest includes robust speaker recognition, human biometrics, automatic speech recognition, and machine learning in general. Qin Jin received her PhD in January 2007 from the Language Technologies Institute, Carnegie Mellon University and a BA and MS degrees from Tsinghua University, P.R.China.
March 21
Robert Frederking
Carnegie Mellon University
NineOneOne: Speech Translation for Spanish 9-1-1 Calls in the U.S.
In many U.S. localities, when emergency calls are received in languages
other than English (primarily Spanish), the dispatching center connects
the call to the Language Line human translation service to translate for
them. Though using human translators to assist callers during emergency
calls might seem optimal, this scheme actually has serious shortcomings.
The process is very slow, especially in starting up, and the translators
are unfamiliar with the task, resulting in very poor quality service.
We will present NineOneOne, a system being developed here to recognize
and translate Spanish emergency calls for better dispatching. The 9-1-1
domain has many research challenges, but we believe it is also a
feasible domain for a real-world speech translation application. The
domain is challenging because it requires real-time operation, and the
recognition and translation of stressed telephone-quality speech in
multiple dialects; it is still feasible because we have in-domain data
from three 9-1-1 centers, there are strong vocabulary and task
constraints, and perfection is not required. This domain also has clear,
significant social value, addressing the chronic shortage of translation
for Spanish (and in larger cities, a large number of other languages) at
U.S. emergency dispatch centers. While we currently are working on
Spanish/English, the approaches we use are fundamentally
language-independent.
Our initial work has been aimed at demonstrating that we can produce ASR
and utterance classification of sufficient quality to allow the
development of a practical, limited-domain system. We will describe our
initial results, which we believe have achieved that goal. We will also
lay out the novel overall design for the system, and our plans for its
further development.
Biography:
Robert Frederking is a Senior Systems Scientist at the Language
Technologies Institute (LTI) in the School of Computer Science at
Carnegie Mellon University, where he is the Chair of the LTI's graduate
programs. He managed the Diplomat rapid-deployment, wearable speech
translation project, and was the CMU PI for the ensuing Tongues project,
which was led by Lockheed Martin, to produce a pocketable
speech-to-speech translator for potential field use by the US Army. He
was also a member of the Nespole! and Babylon speech translation projects.
March 7
Alex Hauptmann
Carnegie Mellon University
Video Analysis and Semantic Concepts for Multimedia Retrieval
Recent years have seen increasing popularity of video as a shared,
searchable medium, with over 100 million daily views from YouTube clips
as prime evidence. This talk will cover some of the successes in video retrieval research, focusing on broadcast content as the most
extensively studied area.
After beginning with a background of audio and video analysis, and
spoken document retrieval, the talk will then discuss some current
research issues, foremost of these is retrieval based on semantic
concepts. One conjecture is that robust semantic concept classification
could provide the means to close the semantic gap; that is to allow a
characterization of visual or graphic content through natural language.
This leads to an examination into the combination of knowledge sources
for retrieval, relevance feedback and interfaces for retrieval.
This talk is based on joint work with numerous members of the Informedia
Project over many years.
February 29
John Tait
Information Retrieval Facility
Introduction to Patent Retrieval and the IRF
To be granted a valid patent involves ensuring that the invention has
not been previously patented or put in the public domain. Patent
searching involves indexing and searching existing patent collections
but it also involves searching collections of scientific papers and
more general web searching, to determine whether the idea has been
previously publicly disclosed. In contrast to information exploration
through internet search engines, patent searching involves long and
complex queries; session based queries in which very lengthy periods
(sometimes days) are spent refining and reviewing results; and finally
but by no means least important, very high recall to ensure a critical
item has not been missed.
The Information Retrieval Facility (IRF) is a new international
not-for-profit organisation based in Vienna, with a distinguished
International Scientific Board. It aims to support large scale information
retrieval research generally, but more specifically work on patent
retrieval and related areas. The facilities and plans of the IRF will
be described.
Biography:
Dr. John Tait is Chief Scientific Officer of the Information Retrieval
Facility (IRF), a not-for-profit foundation dedicated to promoting
research n large scale information retrieval. Prior to joining the
IRF in
2007, Dr. Tait was Professor of Intelligent Information Systems and
Associate Dean of Computing And Technology at the University of
Sunderland in Great Britain. He is a past Program Chair of the SIGIR
and ECIR conferences, an Associate Editor of ACM Transactions on
Information Systems, and author of over 90 refereed conference and
journal papers. His current research focuses on problems of retrieving
still and moving images, and on patent retrieval.
December 7
Noah Smith
Carnegie Mellon University
Statistical Parsing Triptych: Jeopardy, Morphosyntax, and
M-Estimation
This talk covers three recent advances in statistical parsing: an
application, an algorithmic solution, and a learning solution.
The first part of the talk presents our work using quasi-synchronous
dependency grammars - an elegant model originally designed for machine
translation - in question answering. By modeling loose
answer-to-question transformations at the level of bare-bones
dependency structure, we achieve notably high on a TREC-style
answer-selection task (Wang, Smith, and Mitamura, EMNLP-CoNLL 2007).
The second part of the talk turns to parsing algorithms. While much
research has been devoted to parsing algorithms for languages that
have clear morpheme boundaries (e.g., English), it is not clear what
to do when a language displays morphological ambiguity as well. We
describe two efficient ways to apply models for morphological and
syntactic disambiguation in tandem, giving significant gains on
parsing the Hebrew Treebank (Cohen and Smith, EMNLP-CoNLL 2007).
The third part of the talk turns to a learning problem. Since
log-linear ("maximum entropy") models were first applied to NLP at IBM
in the 1990s, they have been widely used. Training them, however, is
very expensive for models of sequences and trees. We present a novel,
generative parameter estimation algorithm for log-linear structure
models based on a generalization of maximum likelihood estimation
called M-estimation. We compare this method to existing learning
algorithms on a shallow parsing task (Smith, Vail, and Lafferty, ACL
2007).
Biography:
Noah Smith is Assistant Professor of Language Technologies at Carnegie
Mellon University. His research has spanned statistical machine
translation, parallel corpus discovery, unsupervised statistical
grammar induction, efficient morphological and syntactic processing
algorithms, weighted logic programming, and the formal study of
weighted grammars. He is a Hertz Fellow (2001-6), the recipient of an
IBM Faculty Award (2007), and a member of the DARPA Computer Science
Study Panel (2007).
November 16
Eugene Fink
Carnegie Mellon University
Reasoning Under Uncertainty
The development of a robust AI system for reasoning under uncertainty
involves several research challenges, including representation of
uncertain data, optimization and inferences based on uncertain
knowledge, identification of critical uncertainties, planning of
additional data collection, and anticipation of possible future
developments.
I will present work on these challenges, and describe the application
of the related results in two AI systems. The first system is part of
the RADAR architecture, which supports the use of uncertain data in
planning and scheduling. The purpose of the second system is
hypothesis verification based on massive uncertain data, as well as
planning of proactive data gathering.
Biography: Eugene Fink received the B.S. degree from Mount Allison University
(Canada) in 1991, M.S. from the University of Waterloo (Canada) in 1992,
and Ph.D. from Carnegie Mellon University in 1999. He was an assistant
professor in the Computer Science and Engineering Department at the
University of South Florida from 1999 to 2003. He is now a senior systems
scientist in the School of Computer Science at Carnegie Mellon University.
His research interests are in various aspects of artificial intelligence,
including machine learning, planning, problem solving, e-commerce
applications, medical applications, and theoretical foundations of
artificial intelligence. His interests also include computational
geometry and algorithm theory.
November 9
Raul Valdes-Perez
Vivisimo Inc.
Enterprise search: Not Your Kid Sister's Search Engine
Vivisimo, a CMU CSD spinoff from 2000, has evolved from its origins as
a web search engine into an enterprise search software vendor, growing
fast and competing well against large public companies.
Raul will discuss:
CMU origins of Vivisimo and its evolution thereafter
Vivisimo's view of post-retrieval navigation: clustering, taxonomies,
meta-data navigation, etc.
Key ideas that underlie successful document clustering
Search 2.0: users teach the search engine
What corporate customers value in enterprise search
Ongoing technical challenges
Some entrepreneurial lessons learned
Biography:
Raul has been CEO of Vivisimo since co-founding it. He was selected
as a 1997 Ernst & Young Entrepreneur of the Year, North Central
Region. On the CS research faculty during 1991-2001, he worked on AI
methods and applications to both data-driven and theory-driven
problems in scientific discovery. From 1986-91 he was a CS PhD
student at CMU, advised by the late Herbert Simon.
More information at: http://vivisimo.com/docs/valdes.pdf
November 2
Luke Zettlemoyer
MIT
Learning to Map Sentences to Logical Form
Recent research has focused on the problem of learning to map
natural language sentences to semantic representations of their
underlying meaning. In this talk, I will present an online algorithm
that learns a weighted combinatory categorial grammar (CCG) which
is used to parse sentences to a lambda-calculus meaning representation.
In particular, I will focus on recent work that addresses the challenge
of parsing spontaneous, unedited natural language input which is
commonly seen in dialogue domains such as ATIS travel planning.
Finally, I will briefly discuss current research on developing a data set
and learning algorithm for context-dependent parsing to logical form.
Biography:
Luke Zettlemoyer is a Ph.D. student at MIT working on research in the
intersections of natural language processing, machine learning and
automated planning. The research he is presenting in this talk was
previously described in papers at EMNLP 07 and UAI 05, where it
received a best paper award.
October 26
Guy Lebanon
Purdue University
Sequential Document Visualization
Documents and other categorical valued time series are
often characterized by the frequencies of short range sequential
patterns such as n-grams. This representation converts
sequential data of varying lengths to high dimensional histogram
vectors which are easily modeled by standard statistical models.
Unfortunately, the histogram representation ignores most of the
medium and long range sequential dependencies making it unsuitable
for visualizing sequential data. We present a novel framework for
sequential visualization of documents based
on the idea of local statistical modeling. The framework embeds
categorical time series as smooth curves in the multinomial
simplex summarizing the progression of sequential trends. We
discuss several visualization techniques based on the above
framework and demonstrate their usefulness for document
visualization.
Biography:
Guy Lebanon is an assistant professor at Purdue University with a joint
appointment in Statistics and Electrical and Computer Engineering. His
research area includes machine learning and computational statistics
with a particular emphasis on modeling text documents and partially
ranked data. Prof. Lebanon received the 2007 Teaching for Tomorrow Award
from Purdue University and the Best Presentation Award in the 2004 LTI
Student Research Symposium. Prof. Lebanon received his PhD in 2005 from
the Language Technologies Institute, Carnegie Mellon University and a BA
and MS degrees from Technion - Israel Institute of Technology.
Rayid Ghani
Accenture
Research Challenges in Enterprise Information Retrieval
Information Retrieval is a major component of Knowledge Management
systems in every business but most of the research that is being done in
IR today focuses on the Web and not on the needs and challenges of
businesses. This is primarily due to the availability of data on the Web
for academic researchers as well as familiarity with the problems in Web
IR since all of us are consumers and can relate to the domain. In
contrast, for Enterprise Information Retrieval, the data is not
available to most researchers and the challenges and needs are not
obvious to people who are not everyday users of such systems.
In this talk, I will point out some challenges in this domain, pose open
research questions for the Information Retrieval, NLP, Machine Learning
& Data Mining communities, and describe the experimental infrastructure
that is being set up at Accenture Technology Labs to undertake those
challenges. Our experimental test-bed for Enterprise Knowledge
Management Research has access to potentially 150,000 users in Accenture
and will allow us to solve large-scale Enterprise Information Retrieval
problems in collaboration with academic institutions around the world.
Biography:
Rayid Ghani is a Senior Researcher at Accenture Technology Labs and
leads the Analytics group which focuses on applied research in Machine
Learning, Data Mining, and Information Retrieval. His current research
interests include Machine Learning & Data Mining with special focus on
Text Learning and Active Learning. Rayid has organized several workshops
at ICML and KDD and holds a master's degree in Knowledge Discovery and
Data Mining from Carnegie Mellon University.
More details can be found at www.accenture.com/techlabs/ghani
October 5
Rebecca Hwa
University of Pittsburgh
Learning Evaluation Metrics for Sentence-Level Machine-Translation
The field of machine translation (MT) has made major strides in recent
years. An important enabling factor has been the adoption of automatic
evaluation metrics to guide researchers in making improvements to
their systems. Research in automatic evaluation metrics faces two
major challenges. One is to achieve higher agreement with human
judgments when evaluating MT outputs at the sentence-level. Another
is to minimize the reliance on expensive, human-developed resources
such as reference sentences. In this talk, I present a
regression-based approach to metric development. Our experiments suggest
that by combining a wide range of features, the resulting metric has
higher correlations with human judgments. Moreover, we show that the
features do not have to be extracted from comparisons with human
produced references. Using weaker indicators of fluency and adequacy,
our learned metrics rival standard reference-based metrics in terms of
correlations with human judgments on new test instances.
Biography: Rebecca Hwa is an Assistant Professor in the Department of Computer
Science at the University of Pittsburgh. Her recent research focus is
on multilingual processing and machine translation. Before joining
Pitt, she was a postdoc at University of Maryland. She received her
PhD from Harvard University and her B.S. from UCLA.
August 27
Luo Si
Purdue University
A Knowledge Driven Regression Model in Micro-Array Data Analysis
One of the main challenges with applying regression models in
micro-array data analysis is that the number of model parameters is much
larger than the number of data points available for training. This
sparse data problem often leads to a significant degradation in the
modeling accuracy. We present a knowledge driven approach to effectively
address this problem. The main idea is to extract the profiles of
keywords from the research articles that describe the properties of the
genes involved in the regression model. These keyword profiles of genes
are then used to construct the similarity measure of the genes, which
will be used to regulate the assignment of regression weights. More
specifically, we present a Bayesian framework that automatically
determines the importance of key words in determining the similarity of
genes in their roles to a given biological process. Empirical studies
with a real biological dataset show that the proposed Bayesian framework
is effective in exploiting the text profiles of genes to reduce the
regression errors.
This is a joint work with Rong Jin and Christian Chan at Michigan State
University.
Biography:
Luo Si is an Assistant Professor in the Department of Computer Science
and Statistics (by courtesy) at Purdue University. He received his Ph.D.
degree from Carnegie Mellon University under the supervision of
Professor Jamie Callan. His research spans a range of topics in
information retrieval, machine learning, text mining and biomedical
applications. His recent research focuses on federated search
(distributed information retrieval), probabilistic models for
collaborative filtering, and text/data mining for biomedical
applications. He has published more than 40 conference and journal papers.