Home

About
   How To Apply
   The LTI Brochure


Education

   Ph.D.
   Dual Ph.D. with Portugal
   M.S.

   Undergrad Minor

   Courses


LTI Forms

Seminars
   LTI Seminar Series
   Joint Speech Seminar (JSS)

   Machine Translation (MT)

   Student Research Symposium

   Information Retrieval Series

   Large Scale Lunch Seminar

   Intelligence Seminar


Visitor Information
   General
   Maps & Directions
   Hotel Links
   Parking Information


Research
   Projects

   Reports

    Dissertations


People

   Faculty

   Students

   Upcoming Graduates

   Staff

   Visitors   

   Who to See for What


Administrative Contacts


LTI Seminar Abstracts
Fall 05 - Spring 06


May 26,2006
Elaine Rich
Senior Lecturer
The University of Texas at Austin

The Theory of Automata and Formal Languages: A Discussion of its Uses in Modern NLP and Computational Biology

The theoretical underpinnings of computing form a standard part of almost every computer science curriculum. But the classic treatment of this material isolates it from the myriad ways in which the theory influences the design of modern hardware and software systems. I am working on a textbook that changes that. The book will be published by Prentice-Hall, probably in mid 2007. The main part of the book is organized in the standard way: It begins with finite state machines and regular languages. Next it covers context-free languages and it contains an optional chapter on context-free parsing techniques. Then it introduces Turing machines and the question of undecidability. Finally, it considers the problem of practical computability by introducing time and space complexity classes. For more detail, see the book's Table of Contents:

http://www.cs.utexas.edu/~ear/cs341/automatabook/TOC.html

Throughout the book there are links to applications of the key concepts. A substantial appendix describes many of those applications. For a table that lists some of them, see: http://www.cs.utexas.edu/~ear/cs341/automatabook/. My goal, in presenting this talk, is to solicit input on the ways that the core material is used in practice in state of the art NLP and computational biology systems.

Bio:
Elaine Rich received her Ph.D. in Computer Science from Carnegie-Mellon in 1979. Her thesis, Building and Exploiting User Models, laid the groundwork for the next twenty years of work on personalizing information systems to meet the needs of individual users. Dr. Rich joined the UT CS faculty in 1979. She continued her work in the area of human/machine interfaces, with a focus on the use of knowledge-based systems. In 1985, Dr. Rich left UT for the Microelectronic and Computing Technology Corporation (MCC), where she focused on the use of natural language in human/machine interfaces. In 1998, Dr. Rich returned to the CS department at UT Austin. Since returning to UT, she has taught Automata Theory, Artificial Intelligence, and Natural Language Processing. She also served for two years as Associate Chair for Academic Affairs in the department, during which time she created several new programs, including the Turing Scholars honors program for undergraduates the First Bytes summer camp program for high school girls. In 1983, Dr. Rich published her first textbook, Artificial Intelligence. The book was translated into Japanese, French, Spanish, German, Italian and Portuguese, and it was recently selected by members of the ACM as number 21 in a list of most significant classic AI books. In 1991, with Kevin Knight, she published a second edition. The two editions have sold over 250,000 copies. In 1991, Dr. Rich was elected a Fellow of the American Association for Artificial Intelligence.

May, 19, 2006
Alicia Tribble
LTI PhD Student
Recent Advances in CMU's Example-Based Machine Translation

In the past year, the Radar-Scone team has worked on making our knowledge engine more accessible. This includes building tools that integrate Scone with larger software systems as well as creating new interfaces that make Scone easier for people to use. An important aspect of this work is its impact on knowledge acquisition using Scone. In this talk, I will present several tools for expanding the knowledge base, ranging from scripts that import structured data (SQL, KIF, XML) to a GUI that lets humans explore and edit the knowledge base. Acquired knowledge using each of these tools will be presented and assessed.

May, 19, 2006
Yi Chang
LTI PhD Student
People Identification With Limited Labels in Privacy-Protected Video

IPeople identification is an essential task for video content analysis in a surveillance system. A good classifier, however, requires a large amount of training data, which may not be obtained in some scenario. In this paper, we propose an approach to augment insufficient training data with pairwise constraints that can be offered from video images that have removed people's identities by masking faces. We show user study results that human subjects can perform reasonably well in labeling pairwise constraints from face masked images. We also present a new discriminative learning algorithm WPKLR to handle uncertainties in pairwise constraints. The effectiveness of the proposed approach is demonstrated using video captured from a nursing home environment. The new method provides a way to obtain high accuracy of people identification from limited labeled data with noisy pairwise constraints, and meanwhile minimize the risk of exposing people's identities.

March 31, 2006
Ralf Brown
LTI Senior Systems Scientist
Recent Advances in CMU's Example-Based Machine Translation

Example-Based Machine Translation (EBMT) has been under investigation at CMU for more than a decade. This talk will present work done in the past year or so to improve the translation quality of the EBMT system, including new word-alignment algorithms, generalization via word clustering, the use of morphology, and provisions for preserving ambiguities in the input text that can arise due to uncertain word segmentation or the use of speech recognition. These enhancements have produced increases of up to 30% on automated quality metrics.

March 10, 2006
Chun Jin
PhD student LTI
Incremental Computation Sharing on Continuous Queries

Computation sharing has been recognized as a key component for a Data Stream Management System(DSMS) to support large-scale concurrent continuous queries over high data-rate streams. In such a system, queries are more likely to arrive individually, instead of in batch. Thus it is desirable to develop an incremental sharing system that adds new queries individually into an existing shared query evaluation plan. To do that, the system needs to identify the common computations between the new query and the existing query plan, choose optimally among multiple sharing paths, and add non-sharable new computations to the plan.

Given the existing query plan presented as a query network R and a new query Q of selection and joins, we focus on 1. methods of indexing and searching existing computations in R, 2. algorithms of identifying common computations between Q and R, and 3. two sharing path optimization strategies, match-plan and sharing-selection. The solutions provide an unified approach of indexing and identifying general common computations that allows flexible deployment of incremental sharing optimization strategies.

The evaluation study shows that the indexing methods and common computation identification algorithms lead to construction of more concise query networks and thus leads to significant reduction in execution cost. The study also shows that sharing-selection is more effective than match-plan.

POSTPONED
Yilu Zhou
PhD student, MIS, University of Arizona

Combining Probability Model and Web Mining Model: A Framework for Proper Name Transliteration

The World Wide Web has become the biggest knowledge repository. There are Web pages in almost every popular language. However, language boundaries prevent information sharing and discovery across countries. Web search engines, such as Google, were adding multilingual capabilities to search for foreign documents. Proper names, such as organizations, company names, product names, person names, play an important role in search queries. However, they are often foreign names which are often translated phonetically, referreds to as transliteration. Previous transliteration models can be categorized into three approaches: a rule-based approach, a machine learning approach and a statistical approach. In this research, we propose a generic framework for proper name translation which combines an enhanced Hidden Markov Model (HMM) statistical model and a Web mining model. We improved the traditional statistical-based transliteration in three areas: 1) incorporate phonetic transliteration knowledge base; 2) incorporate a bigram and a trigram HMM; 3) incorporate a Web mining model that uses word frequency of occurrence information from the Web. We evaluated the framework on two different language pairs, English-Arabic pair and English-Chinese pair. For English/Arabic transliteration, we found that a combination of bigram and trigram HMM method performed the best. While the bigram model alone achieved fairly good performance, the trigram model alone did not. The Web mining approach boosted the performance by 46%. For English/Chinese transliteration, we found that a combination of the bigram and the trigram HMM method performed the best. The trigram model out-performed bigram in English/Chinese transliteration. The Web mining approach again improved the performance by 12%. Overall, our framework achieved a precision of 86%-93% when 8 best transliterations were considered. Our results are encouraging and show promise of successful transliteration techniques to multilingual Web retrieval.

Bio:
Yilu Zhou is a doctoral candidate in the Department of Management Information Systems at the University of Arizona, where she is also a research associate of the Artificial Intelligence Lab. Her research interests include multilingual knowledge discovery, Web mining and human computer interaction. She received a B.S. in Computer Science from Shanghai Jiaotong University.

December 14, 2005
Kevin Knight
USC, ISI
Unsupervised Analysis for Decipherment Problems

We study a number of natural language decipherment problems using unsupervised analysis methods. These problems include letter substitution ciphers, character code conversion, archaeological phonetic decipherment, and word-based ciphers related to machine translation. We describe a handful of techniques that improve results across all these problems.

December 2, 2005
Julia Hockenmaier
University of Pennsylvania
Protein Folding as Chart Parsing


In this talk I will show how techniques that are commonly used in statistical
parsing of natural language can be used in protein folding.

Just as there may be millions of possible analyses for a sentence in the Wall Street Journal, there may be millions of possible structures for a protein.  However, humans do not have difficulties understanding these sentences, and proteins do find their correct "native" structure in a very short amount of time.

I hope to convince you in this talk that as computer scientists we are faced with two very similar tasks, and that the knowledge we have gained in working on statistical parsing of natural language can help us in our understanding of the protein folding problem. Specifically, I will show you how adapting CKY to a simplified structural representation of proteins yields an algorithm that is not only efficient (most of the time), but that also has a certain degree of physical plausibility, and allows insight into the folding process that cannot be obtained from more standard techniques such as Monte Carlo sampling.  Time permitting, I will outline how the parsing perspective on protein folding yields a novel definition of folding routes that might allow us to define a better dynamical model of the folding process.

Mini bio: Julia Hockenmaier is a postdoc with Aravind Joshi at the University of Pennsylvania and Ken Dill at the University of California, San Francisco. Her current research investigates how statistical parsing techniques can be applied to protein folding. For her PhD at the University of Edinburgh, Julia developed one of the first statistical wide-coverage parsers for Combinatory Categorial Grammar, and created CCGbank, a (now publicly available) corpus of CCG derivations that is derived from the Penn Treebank.

November 21, 2005
John Carroll
Faculty at University of Sussex
Unsupervised Acquisition of Predominant Word Senses


There has been a great deal of recent research into word sense disambiguation, particularly since the inception of the SENSEVAL evaluation exercises. Since a word often has more than one meaning, resolving word sense ambiguity could benefit applications that need some level of semantic interpretation of language input. A major problem is that the accuracy of word sense disambiguation systems is strongly dependent on the quantity of manually sense-tagged data available, and even the best systems when tagging every word token perform little better than a simple heuristic that guesses the first, or predominant sense of a word in all contexts. The success of this heuristic is due to the skewed nature of word sense distributions. Data for the heuristic can come from either lexicographer intuition or a sample of sense-tagged data. However, there is a limited supply of the latter, and the sense distributions and predominant sense of a word can depend on the domain or source of a document. (The first sense of `star' for example would be different in the popular press and scientific journals). In this talk, I will describe a method for determining the predominant sense of a word automatically from raw text, and some experiments investigating the influence of different data sources. One of our main results is that the method produces more accurate predominant sense information than the widely-used SemCor corpus for nouns with low coverage in that corpus.

[Joint work with Diana McCarthy and Rob Koeling]

November 10, 2005
Barbara Rosario
University of California, Berkeley
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations


Current-generation search engines provide a glimpse of the kinds of activities that can be catalyzed by intelligent processing of large-scale document corpora. Further progress in this area will require the tools of statistical natural language processing, including tools for automatic extraction of propositional information from text. This presentation will explore several lines of research on one of the core problems that arise in this domain---the identification of semantic relations between constituents in sentences. First, I will discuss the problem of identifying relationships between two-word noun compounds (to characterize, for example, the treatment-for-disease relationship between the words of "migraine treatment" versus the method-of-treatment relationship between the words of "aerosol treatment".) Second, I'll describe my work in the area of Information Extraction, in particular the problem of identifying semantic entities such as "treatment" and "disease" from biomedical text. Finally, I will present my recent work on the problem of predicting protein-protein interactions from biological text.

A major impediment to such work is the acquisition of appropriately labeled training data; for my experiments I have identified a database that serves as a proxy for training data. In each of these cases I will describe the statistical machine learning methods---both generative and discriminative---used to tackle these tasks.

(http://www.sims.berkeley.edu/~rosario)

October 21, 2005
Kevyn Collins-Thompson
LTI PhD Student
Query Expansion using Random Walk Models


We describe a Markov chain framework that combines multiple sources of knowledge on term associations. The stationary distribution of the model is used to obtain probability estimates that a potential expansion term reflects aspects of the original query. We use this model for query expansion and evaluate the effectiveness of the model by examining the accuracy and robustness of the expansion methods, and investigate the relative effectiveness of various sources of term evidence. Statistically significant differences in accuracy were observed depending on the weighting of evidence in the random walk. For example, using co-occurrence data later in the walk was generally better than using it early, suggesting further improvements in effectiveness may be possible by learning walk behaviors.



October 18, 2005
Makoto Nagao, President of National Institute of Information and Communications Technology (NICT), Japan

National Institute of Information and Communications Technology (NICT) is the sole national institute for the research and funding in the area of information and communications. Its vision is to realize the "universal communication" on the Earth. It has been the central research group to realize the governmental plan of "e-Japan" (2001-2005), and will keep the same position in the next five year plan of the government: "u-Japan"(2006-2010), which is a step to realize the universal communication. NICT's research activity includes wireless communication and sensing, ultra-high speed network technology, particularly optical communication technology, information security technology, and human communication technology such as natural language processing, information retrieval, human interface, virtual reality and urban information robotic system. The talk will be composed of general introduction of the above activities, and the future direction of example-based machine translation and information retrieval.

About the speaker:

Dr. Makoto Nagao was a professor of Kyoto University and did research works of natural language processing, image processing and digital library system. He was the originator of example-based machine translation. After serving the President of Kyoto University for six years, he has been the president of NICT since 2004. He received many awards including Japan Prize, ACL Lifetime Achievement Award, and Medal of Honor of IAMT.


October 4, 2005
Douglas W. Oard, University of Maryland, College Park

Test collections for information retrieval tasks have traditionally assumed that what we are searching for are documents (e.g., Web pages, news stories, or academic documents). Most information that is generated is, however, not in originally generated as part of a document, but rather as what we might refer to as "conversational media" (e.g., email, speech, or instant messaging). In this talk, I'll describe the creation of a spoken word test collection for the the Cross-Language Evaluation Forum (CLEF) cross-language speech retrieval track. I'll illustrate the issues that arise in the creation of such test collections with some of the results that we have obtained from our experiments with that collection. This is joint work with Charles University, the IBM TJ Watson Research Center, the Johns Hopkins University, the Survivors of the Shoah Visual History Foundation, and the University of West Bohemia.

About the speaker:

Douglas Oard is an Associate Professor at the University of Maryland, College Park, with a joint appointment in the College of Information Studies and the Institute for Advanced Computer Studies. He holds a Ph.D. in Electrical Engineering from the University of Maryland, and his research interests center around the use of emerging technologies to support information seeking by end users. His recent work has focused on interactive techniques for cross-language information retrieval, searching conversational media, and leveraging observable behavior to improve user modeling.


September 28, 2005
Rosie Jones Sr. Research Scientist, Information Retrieval Yahoo! Matching Sciences

Many documents, such as emails and news stories, have timestamps, corresponding to the creation date, or date the information was sent. Using the timestamp, each document can be represented as a point on a timeline. We can then represent the entire document collection as a function of the number of documents at each point. A text query on a collection of documents on a timeline selects the subset of documents that are relevant to the query. The resulting set of documents can also be represented on a timeline. For example, a query for "earthquake" would lead to documents clustered in peaks on the time line, near each earthquake described in the news. We describe experiments on ways of using this timeline to identify temporal query ambiguity, as well as predict when queries are likely to lead to poor search results.

Bio:

Rosie Jones has been a research scientist in Information Retrieval at Yahoo! since 2002. Her research interests are in information retrieval, machine learning, and statistical natural language processing. She received her PhD from Carnegie Mellon University's Language Technologies Institute.

 


September 16, 2005
Noah A. Smith, Department of Computer Science / Center for Language and Speech Processing, Johns Hopkins University

In lexicalized phrase-structure or dependency parses, a word's modifiers tend to fall near it in the string. We show that a model that considers dependency length as a feature can improve parsing speed and accuracy of simple models for English, Chinese, and German. We then show similar improvements by imposing *hard* bounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. This simple "vine grammar" formalism has only finite-state power and so allows guaranteed linear-time parsing, but is parameterized like a lexicalized phrase-structure or dependency grammar, with some extra parameters for stringing fragments together. We describe an O(n) runtime chart parsing algorithm whose grammar constant is the same as the parser of Eisner and Satta (1999). Different kinds of bounds allow different trade-offs between speed and recall.

This is joint work with Jason Eisner and will be presented at IWPT 2005.

Short biography:

Noah A. Smith is a PhD student at Johns Hopkins University. His graduate research has focused on unsupervised approaches to finding grammatical (or other useful) structure in text, by improving search (e.g., using deterministic annealing) and introducing novel, EM-rivaling objective functions (contrastive estimation; see CALD talk on September 15). Noah has contributed to the Giza statistical MT toolkit, the STRAND system for finding parallel text on the Web, and the Dyna programming language. He works with Jason Eisner and is supported by a fellowship from the Fannie and John Hertz Foundation.


 

August 26, 2005
Ahmed Rafea, CS Department, American University in Cairo

In this presentation, we will describe a fully grammar-based generation approach for task-oriented interlingua-based spoken dialogue that transforms a shallow semantic interlingua representation called Interchange Format (IF) into Arabic sentences that correspond to the intentions underlying the speakers' utterances. This interlingua was developed within the framework of the NEgotiating through SPOken Language in E-commerce (NESPOLE! ) project in Carnegie Mellon University (CMU). Our generator operates in two stages: mapping Interlingua representations into feature structures, and generating Arabic sentence from feature structures. During mapping, which is partially domain-specific, information about lexemes, tense, number and other properties of the utterance is used to generate the feature structure. A domain-specific ontology and a map lexicon are used to select the right words. During sentence generation, which is domain-independent, linguistic (lexical, morphological and grammatical) knowledge is used to generate an Arabic sentence that represents a speakers intention. We conducted automatic and human evaluations of our generation approach using the output from the English analyzer provided by CMU. The results of these evaluations were promising. Our Arabic generator can be integrated with the "Machine Translation of Spoken Arabic into English" package developed at CMU. The work presented is part of the collaborative project between Cairo University and Carnegie Mellon University, NSF Project Number: INT-0001613 US Egypt Joint Science and Technology Board Project Number: INF4001023. I co supervise the PhD thesis of Azza Abdel Monem that includes this work with Dr Khaled Shaalan.

Short Biography:

Dr. Ahmed Rafea got his Ph.D. from University Paul Sabatier, Toulouse, France in 1980. Dr. Rafea is a Computer Science Professor at the American university in Cairo He served as a Professor of CS and Vice Dean at Cairo University. He also served as a Visiting Professor in San Diego State University, and National University in the United States. Dr. Rafea was the principal investigator and manger of several projects for developing expert systems in Agriculture. He is currently the P.I of a collaborative research project on SMT with USC. His research interests include NLP, MT, KBS, and KDD.



 
 
 


LTI is part of the School of Computer Science at Carnegie Mellon University.
This page is maintained by The LTI Webmaster.