LTI Seminar Abstracts
Fall 05 - Spring 06
May 26,2006
Elaine Rich
Senior Lecturer
The University of Texas at Austin
The Theory of Automata and Formal Languages: A Discussion of its Uses in Modern
NLP and Computational Biology
The theoretical underpinnings of computing form a standard part of almost every
computer science curriculum. But the classic treatment of this material
isolates it from the myriad ways in which the theory influences the design of
modern hardware and software systems. I am working on a textbook that changes
that. The book will be published by Prentice-Hall, probably in mid 2007.
The main part of the book is organized in the standard way: It begins with
finite state machines and regular languages. Next it covers context-free
languages and it contains an optional chapter on context-free parsing
techniques. Then it introduces Turing machines and the question of
undecidability. Finally, it considers the problem of practical computability
by introducing time and space complexity classes. For more detail, see the
book's Table of Contents:
http://www.cs.utexas.edu/~ear/cs341/automatabook/TOC.html
Throughout the book there are links to applications of the key concepts. A
substantial appendix describes many of those applications. For a table that
lists some of them, see: http://www.cs.utexas.edu/~ear/cs341/automatabook/.
My goal, in presenting this talk, is to solicit input on the ways that the core
material is used in practice in state of the art NLP and computational biology
systems.
Bio:
Elaine Rich received her Ph.D. in Computer Science from Carnegie-Mellon in 1979.
Her thesis, Building and Exploiting User Models, laid the groundwork for the
next twenty years of work on personalizing information systems to meet the
needs of individual users. Dr. Rich joined the UT CS faculty in 1979. She
continued her work in the area of human/machine interfaces, with a focus on the
use of knowledge-based systems. In 1985, Dr. Rich left UT for the
Microelectronic and Computing Technology Corporation (MCC), where she focused
on the use of natural language in human/machine interfaces. In 1998, Dr. Rich
returned to the CS department at UT Austin. Since returning to UT, she has
taught Automata Theory, Artificial Intelligence, and Natural Language
Processing. She also served for two years as Associate Chair for Academic
Affairs in the department, during which time she created several new programs,
including the Turing Scholars honors program for undergraduates the First Bytes
summer camp program for high school girls. In 1983, Dr. Rich published her
first textbook, Artificial Intelligence. The book was translated into
Japanese, French, Spanish, German, Italian and Portuguese, and it was recently
selected by members of the ACM as number 21 in a list of most significant
classic AI books. In 1991, with Kevin Knight, she published a second edition.
The two editions have sold over 250,000 copies. In 1991, Dr. Rich was elected
a Fellow of the American Association for Artificial Intelligence.
May, 19, 2006
Alicia Tribble
LTI PhD Student
Recent Advances in CMU's Example-Based Machine Translation
In the past year, the Radar-Scone team has worked on
making our knowledge engine more accessible. This includes
building tools that integrate Scone with larger software systems
as well as creating new interfaces that make Scone easier for
people to use. An important aspect of this work is its impact on
knowledge acquisition using Scone. In this talk, I will present
several tools for expanding the knowledge base, ranging from
scripts that import structured data (SQL, KIF, XML) to a
GUI that lets humans explore and edit the knowledge base.
Acquired knowledge using each of these tools will be presented
and assessed.
May, 19, 2006
Yi Chang
LTI PhD Student
People Identification With Limited Labels in Privacy-Protected Video
IPeople identification is an essential task for video content analysis in a surveillance system.
A good classifier, however, requires a large amount of training data, which may not be
obtained in some scenario. In this paper, we propose an approach to augment insufficient
training data with pairwise constraints that can be offered from video images that have removed
people's identities by masking faces. We show user study results that human subjects can
perform reasonably well in labeling pairwise constraints from face masked images. We also
present a new discriminative learning algorithm WPKLR to handle uncertainties in pairwise
constraints. The effectiveness of the proposed approach is demonstrated using video captured
from a nursing home environment. The new method provides a way to obtain high accuracy of
people identification from limited labeled data with noisy pairwise constraints, and meanwhile
minimize the risk of exposing people's identities.
March 31, 2006
Ralf Brown
LTI Senior Systems Scientist
Recent Advances in CMU's Example-Based Machine Translation
Example-Based Machine Translation (EBMT) has been under investigation
at CMU for more than a decade. This talk will present work done in
the past year or so to improve the translation quality of the EBMT
system, including new word-alignment algorithms, generalization via
word clustering, the use of morphology, and provisions for preserving
ambiguities in the input text that can arise due to uncertain word
segmentation or the use of speech recognition.
These enhancements have produced increases of up to 30% on automated
quality metrics.
March 10, 2006
Chun Jin PhD student LTI
Incremental Computation Sharing on Continuous Queries
Computation sharing has been recognized as a key component for a Data Stream
Management System(DSMS) to support large-scale concurrent continuous queries
over high data-rate streams. In such a system, queries are more likely to
arrive individually, instead of in batch. Thus it is desirable to develop an
incremental sharing system that adds new queries individually into an
existing shared query evaluation plan. To do that, the system needs to
identify the common computations between the new query and the existing
query plan, choose optimally among multiple sharing paths, and add
non-sharable new computations to the plan.
Given the existing query plan presented as a query network R and a new query
Q of selection and joins, we focus on 1. methods of indexing and searching
existing computations in R, 2. algorithms of identifying common computations
between Q and R, and 3. two sharing path optimization strategies, match-plan
and sharing-selection. The solutions provide an unified approach of indexing
and identifying general common computations that allows flexible deployment
of incremental sharing optimization strategies.
The evaluation study shows that the indexing methods and common computation
identification algorithms lead to construction of more concise query
networks and thus leads to significant reduction in execution cost. The
study also shows that sharing-selection is more effective than match-plan.
POSTPONED
Yilu Zhou
PhD student, MIS, University of Arizona
Combining Probability Model and Web Mining Model: A Framework for Proper Name Transliteration
The World Wide Web has become the biggest knowledge repository. There are Web pages in almost every popular language. However, language boundaries prevent information sharing and discovery across countries. Web search engines, such as Google, were adding multilingual capabilities to search for foreign documents. Proper names, such as organizations, company names, product names, person names, play an important role in search queries. However, they are often foreign names which are often translated phonetically, referreds to as transliteration. Previous transliteration models can be categorized into three approaches: a rule-based approach, a machine learning approach and a statistical approach. In this research, we propose a generic framework for proper name translation which combines an enhanced Hidden Markov Model (HMM) statistical model and a Web mining model. We improved the traditional statistical-based transliteration in three areas: 1) incorporate phonetic transliteration knowledge base; 2) incorporate a bigram and a trigram HMM; 3) incorporate a Web mining model that uses word frequency of occurrence information from the Web. We evaluated the framework on two different language pairs, English-Arabic pair and English-Chinese pair. For English/Arabic transliteration, we found that a combination of bigram and trigram HMM method performed the best. While the bigram model alone achieved fairly good performance, the trigram model alone did not. The Web mining approach boosted the performance by 46%. For English/Chinese transliteration, we found that a combination of the bigram and the trigram HMM method performed the best. The trigram model out-performed bigram in English/Chinese transliteration. The Web mining approach again improved the performance by 12%. Overall, our framework achieved a precision of 86%-93% when 8 best transliterations were considered. Our results are encouraging and show promise of successful transliteration techniques to multilingual Web retrieval.
Bio: Yilu Zhou is a doctoral candidate in the Department of Management Information Systems at the University of Arizona, where she is also a research associate of the Artificial Intelligence Lab. Her research interests include multilingual knowledge discovery, Web mining and human computer interaction. She received a B.S. in Computer Science from Shanghai Jiaotong University.
December 14, 2005
Kevin Knight USC, ISI
Unsupervised Analysis for Decipherment Problems
We study a number of natural language decipherment problems
using unsupervised analysis methods. These problems include letter
substitution ciphers, character code conversion, archaeological phonetic
decipherment, and word-based ciphers related to machine translation. We
describe a handful of techniques that improve results across all these
problems.
December 2, 2005
Julia Hockenmaier
University of Pennsylvania
Protein Folding as Chart Parsing
In this talk I will show how techniques that are commonly used in statistical
parsing of natural language can be used in protein folding.
Just as there may be millions of possible analyses for a sentence in
the Wall Street Journal, there may be millions of possible structures
for a protein. However, humans do not have difficulties understanding
these sentences, and proteins do find their correct "native" structure
in a very short amount of time.
I hope to convince you in this talk that as computer scientists we are
faced with two very similar tasks, and that the knowledge we have
gained in working on statistical parsing of natural language can help
us in our understanding of the protein folding problem. Specifically,
I will show you how adapting CKY to a simplified structural
representation of proteins yields an algorithm that is not only
efficient (most of the time), but that also has a certain degree of
physical plausibility, and allows insight into the folding process
that cannot be obtained from more standard techniques such as Monte
Carlo sampling. Time permitting, I will outline how the parsing
perspective on protein folding yields a novel definition of folding
routes that might allow us to define a better dynamical model of the
folding process.
Mini bio:
Julia Hockenmaier is a postdoc with Aravind Joshi at the University of Pennsylvania and Ken Dill at the University of California, San Francisco. Her current research investigates how statistical parsing techniques can be applied to protein folding.
For her PhD at the University of Edinburgh, Julia developed one of the first statistical wide-coverage parsers for Combinatory Categorial Grammar, and created CCGbank, a (now publicly available) corpus of CCG derivations that is derived from the Penn Treebank.
November 21, 2005
John Carroll
Faculty at University of Sussex
Unsupervised Acquisition of Predominant Word Senses
There has been a great deal of recent research into word sense
disambiguation, particularly since the inception of the SENSEVAL
evaluation exercises. Since a word often has more than one meaning,
resolving word sense ambiguity could benefit applications that need some
level of semantic interpretation of language input. A major problem is
that the accuracy of word sense disambiguation systems is strongly
dependent on the quantity of manually sense-tagged data available, and
even the best systems when tagging every word token perform little
better than a simple heuristic that guesses the first, or predominant
sense of a word in all contexts. The success of this heuristic is due
to the skewed nature of word sense distributions. Data for the heuristic
can come from either lexicographer intuition or a sample of sense-tagged
data. However, there is a limited supply of the latter, and the sense
distributions and predominant sense of a word can depend on the domain
or source of a document. (The first sense of `star' for example would
be different in the popular press and scientific journals). In this
talk, I will describe a method for determining the predominant sense of
a word automatically from raw text, and some experiments investigating
the influence of different data sources. One of our main results is that
the method produces more accurate predominant sense information than the
widely-used SemCor corpus for nouns with low coverage in that corpus.
[Joint work with Diana McCarthy and Rob Koeling]
November 10, 2005
Barbara Rosario University of California, Berkeley Natural Language Processing in Bioinformatics: Uncovering Semantic Relations
Current-generation search engines provide a glimpse of the kinds of activities that can be catalyzed by intelligent processing of
large-scale document corpora. Further progress in this area will require the tools of statistical natural language processing, including tools for
automatic extraction of propositional information from text. This presentation will explore several lines of research on one of the core problems that
arise in this domain---the identification of semantic relations between constituents in sentences. First, I will discuss the problem of identifying relationships between two-word noun compounds (to characterize, for example, the treatment-for-disease relationship between the words of "migraine treatment" versus the method-of-treatment relationship between the words of "aerosol treatment".) Second, I'll describe my work in the area of Information Extraction, in
particular the problem of identifying semantic entities such as "treatment" and "disease" from biomedical text. Finally, I will present my recent work on the
problem of predicting protein-protein interactions from biological text.
A major impediment to such work is the acquisition of appropriately labeled training data; for my experiments I have identified a database that serves as a proxy for training data. In each of these cases I will describe the statistical machine learning methods---both generative and discriminative---used to tackle these
tasks.
( http://www.sims.berkeley.edu/~rosario)
October 21, 2005
Kevyn Collins-Thompson LTI PhD Student Query Expansion using Random Walk Models
We describe a Markov chain framework that combines
multiple sources of knowledge on term associations.
The stationary distribution of the model is used to obtain
probability estimates that a potential expansion term
reflects aspects of the original query. We use this model
for query expansion and evaluate the effectiveness of the
model by examining the accuracy and robustness of the
expansion methods, and investigate the relative
effectiveness of various sources of term evidence.
Statistically significant differences in accuracy were
observed depending on the weighting of evidence in the
random walk. For example, using co-occurrence data
later in the walk was generally better than using it early,
suggesting further improvements in effectiveness may
be possible by learning walk behaviors.
October 18, 2005
Makoto Nagao, President of National Institute
of Information and Communications Technology (NICT), Japan
National Institute of Information and Communications Technology
(NICT) is the sole national institute for the research and funding
in the area of information and communications. Its vision is to
realize the "universal communication" on the Earth. It has been
the central research group to realize the governmental plan of "e-Japan"
(2001-2005), and will keep the same position in the next five year
plan of the government: "u-Japan"(2006-2010), which is a step to
realize the universal communication. NICT's research activity includes
wireless communication and sensing, ultra-high speed network technology,
particularly optical communication technology, information security
technology, and human communication technology such as natural language
processing, information retrieval, human interface, virtual reality
and urban information robotic system. The talk will be composed
of general introduction of the above activities, and the future
direction of example-based machine translation and information retrieval.
About the speaker:
Dr. Makoto Nagao was a professor of Kyoto University and did research
works of natural language processing, image processing and digital
library system. He was the originator of example-based machine translation.
After serving the President of Kyoto University for six years, he
has been the president of NICT since 2004. He received many awards
including Japan Prize, ACL Lifetime Achievement Award, and Medal
of Honor of IAMT.
October 4, 2005
Douglas W. Oard, University of Maryland, College
Park
Test collections for information retrieval tasks have traditionally
assumed that what we are searching for are documents (e.g., Web pages,
news stories, or academic documents). Most information that is
generated is, however, not in originally generated as part of a
document, but rather as what we might refer to as "conversational
media" (e.g., email, speech, or instant messaging). In this talk,
I'll describe the creation of a spoken word test collection for the
the Cross-Language Evaluation Forum (CLEF) cross-language speech
retrieval track. I'll illustrate the issues that arise in the
creation of such test collections with some of the results that we
have obtained from our experiments with that collection. This is
joint work with Charles University, the IBM TJ Watson Research Center,
the Johns Hopkins University, the Survivors of the Shoah Visual
History Foundation, and the University of West Bohemia.
About the speaker:
Douglas Oard is an Associate Professor at the University of Maryland,
College Park, with a joint appointment in the College of Information
Studies and the Institute for Advanced Computer Studies. He holds a
Ph.D. in Electrical Engineering from the University of Maryland, and
his research interests center around the use of emerging technologies
to support information seeking by end users. His recent work has
focused on interactive techniques for cross-language information
retrieval, searching conversational media, and leveraging observable
behavior to improve user modeling.
September 28, 2005
Rosie Jones
Sr. Research Scientist, Information Retrieval
Yahoo! Matching Sciences
Many documents, such as emails and news stories, have timestamps,
corresponding to the creation date, or date the information was sent.
Using the timestamp, each document can be represented as a point on a
timeline. We can then represent the entire document collection as a
function of the number of documents at each point. A text query on a
collection of documents on a timeline selects the subset of documents
that are relevant to the query. The resulting set of documents can
also be represented on a timeline. For example, a query for
"earthquake" would lead to documents clustered in peaks on the time
line, near each earthquake described in the news. We describe
experiments on ways of using this timeline to identify temporal query
ambiguity, as well as predict when queries are likely to lead to poor
search results.
Bio:
Rosie Jones has been a research scientist in Information Retrieval at
Yahoo! since 2002. Her research interests are in information
retrieval, machine learning, and statistical natural language
processing. She received her PhD from Carnegie Mellon University's
Language Technologies Institute.
September 16, 2005
Noah A. Smith, Department of Computer Science / Center for
Language and Speech Processing, Johns Hopkins University
In lexicalized phrase-structure or dependency parses, a word's
modifiers tend to fall near it in the string. We show that a model
that considers dependency length as a feature can improve parsing
speed and accuracy of simple models for English, Chinese, and German.
We then show similar improvements by imposing *hard* bounds on
dependency length and (additionally) modeling the resulting sequence
of parse fragments. This simple "vine grammar" formalism has only
finite-state power and so allows guaranteed linear-time parsing, but
is parameterized like a lexicalized phrase-structure or dependency
grammar, with some extra parameters for stringing fragments together.
We describe an O(n) runtime chart parsing algorithm whose grammar
constant is the same as the parser of Eisner and Satta (1999).
Different kinds of bounds allow different trade-offs between speed and
recall.
This is joint work with Jason Eisner and will be presented at IWPT 2005.
Short biography:
Noah A. Smith is a PhD student at Johns Hopkins University. His
graduate research has focused on unsupervised approaches to finding
grammatical (or other useful) structure in text, by improving search
(e.g., using deterministic annealing) and introducing novel,
EM-rivaling objective functions (contrastive estimation; see CALD talk
on September 15). Noah has contributed to the Giza statistical MT
toolkit, the STRAND system for finding parallel text on the Web, and
the Dyna programming language. He works with Jason Eisner and is
supported by a fellowship from the Fannie and John Hertz Foundation.
August 26, 2005
Ahmed Rafea, CS Department, American University in Cairo
In this presentation, we will describe a fully grammar-based
generation approach for task-oriented interlingua-based spoken
dialogue that transforms a shallow semantic interlingua representation
called Interchange Format (IF) into Arabic sentences that correspond
to the intentions underlying the speakers' utterances. This
interlingua was developed within the framework of the NEgotiating
through SPOken Language in E-commerce (NESPOLE! ) project in Carnegie
Mellon University (CMU). Our generator operates in two stages:
mapping Interlingua representations into feature structures, and
generating Arabic sentence from feature structures. During mapping,
which is partially domain-specific, information about lexemes, tense,
number and other properties of the utterance is used to generate the
feature structure. A domain-specific ontology and a map lexicon are
used to select the right words. During sentence generation, which is
domain-independent, linguistic (lexical, morphological and
grammatical) knowledge is used to generate an Arabic sentence that
represents a speakers intention. We conducted automatic and human
evaluations of our generation approach using the output from the
English analyzer provided by CMU. The results of these evaluations
were promising. Our Arabic generator can be integrated with the "Machine Translation of Spoken Arabic into English" package developed
at CMU.
The work presented is part of the collaborative project between Cairo
University and Carnegie Mellon University, NSF Project Number:
INT-0001613 US Egypt Joint Science and Technology Board Project
Number: INF4001023. I co supervise the PhD thesis of Azza Abdel Monem
that includes this work with Dr Khaled Shaalan.
Short Biography:
Dr. Ahmed Rafea got his Ph.D. from University Paul Sabatier, Toulouse,
France in 1980. Dr. Rafea is a Computer Science Professor at the
American university in Cairo He served as a Professor of CS and Vice
Dean at Cairo University. He also served as a Visiting Professor in
San Diego State University, and National University in the United
States. Dr. Rafea was the principal investigator and manger of
several projects for developing expert systems in Agriculture. He is
currently the P.I of a collaborative research project on SMT with USC.
His research interests include NLP, MT, KBS, and KDD.
|