About
Education
Research
People
News/Events
Contacts

LTI Seminar Abstracts Summer 2002


August 2, 2002
Talk 1: Yi Zhang, LTI Ph.D. student
Novelty and Redundancy Detection in Adaptive Filtering


We address the problem of extending an adaptive information filtering system to make decisions about the novelty and redundancy of relevant documents. It argues that relevance and redundance should each be modelled explicitly and separately. A set of five redundancy measures are proposed and evaluated in experiments with and without redundancy thresholds. The experimental results demonstrate that the cosine similarity metric and a redundancy measure based on a mixture of language models are both effective for identifying redundant documents. This is a practice talk for a presentation at SIGIR 2002.

Talk 2: Luo Si, LTI Ph.D. student
Using Sampled Data and Regression to Merge Search Engine Results


This work addresses the problem of merging results obtained from different databases and search engines in a distributed information retrieval environment. The prior research on this problem either assumed the exchange of statistics necessary for normalizing scores (cooperative solutions) or is heuristic. Both approaches have disadvantages. We show that the problem in uncooperative environments is simpler when viewed as a component of a distributed IR system that uses query-based sampling to create resource descriptions. Documents sampled for creating resource descriptions can also be used to create a sample centralized index, and this index is a source of training data for adaptive results merging algorithms. A variety of experiments demonstrate that this new approach is more effective than a well-known alternative, and that it allows query-by-query tuning of the results merging function. This is a practice talk for the coming SIGIR conference.

June 28, 2002
Talk 1: Kathrin Probst, LTI Ph.D. student
Using Similarity Scoring To Improve the Bilingual Dictionary for Word Alignment


This talk will represent a practice talk for a presentation at ACL 2002. Automatic word alignment generally uses a bilingual dictionary that, in one form or another, is based on co-occurrence of words. In this talk, I will describe an approach to improve such a dictionary. In particular, we rebuild the bilingual dictionary by clustering similar words in a language and assigning them a higher co-occurrence score with a given word in the other language than each single word would have otherwise. The improved dictionary is evaluated using a version of the Competitive Linking alignment algorithm. Experimental results show a significant improvement in precision and recall for word alignment when the improved dictionary is used.

Talk 2: Chad Langley, LTI Ph.D. student
Analysis for Speech Translation Using Grammar-Based Parsing and Automatic Classification


I describe a novel approach to analysis for spoken language translation which uses a combination of phrase-level grammar-based parsing and automatic classification. The job of the analyzer is to transform spoken task-oriented utterances into a shallow semantic interlingua representation. The goal of this hybrid approach is to provide accurate real-time analyses and to improve robustness and portability to new domains and languages.


June 21, 2002 --Michael Whitbrock, CYCORP
Thinking about the answers: Knowledge Formation and Question Answering with Cyc.


Over the past seventeen years, a team of computer scientists, philosophers and, recently, computational linguists have been pushing towards one vision of artificial intelligence: building a system that knows and can reason about large numbers of things in the world (Sid Vicious, toothpaste, mathematical sets and viruses spring to mind), and that can use this knowledge to perform useful tasks, notably working with people, and reading text, to increase its knowledge.

In the ARPA-funded Rapid Knowledge Formation Project, Cyc's one and a half million facts and rules about the world are used to drive interactive, English language interactions with lighly trained people, who can enter facts and rules about their specialities in an integrated and inferrentially productive way. In the ARDA-funded Aquaint Project, Cyc's common sense knowledge is used to validate the results of statistical and template driven information extraction, and to represent that information in a logical form that enables questions to be answered by combining information across multiple sources.

BIO: Dr. Michael Witbrock (Cycorp), has a PhD in Computer Science from Carnegie Mellon University, and currently is Director of the Knowledge Formation and Dialogue department at Cycorp. Previously he was Principal Scientist at Terra Lycos, working on integrating statistical and knowledge based approaches to understanding web user behavior, a research scientist at Just Systems Pittsburgh Research Center, working on statistical summarization, and a systems scientist at Carnegie Mellon on the Informedia spoken document information retrieval project. He also performed dissertation work in the area of speaker modeling. He is author of numerous publications in areas ranging across neural networks, parallel computer architecture, multimedia information retrieval, web browser design, genetic design, computational linguistics and speech recognition.


May 29, 2002 --Mark Sanderson, University of Sheffield, UK
Better estimation of term frequency


When discussing the utility of natural language processing techniques to information retrieval, it has sometimes been suggested that if anaphoric references within a text were resolved, one could calculate term frequency (within a document) based on a term's occurrences, both actual and anaphoric. This would result in a more accurate calculation of the term's usage in documents, with presumably, a consequent improvement in retrieval effectiveness. When such ideas were tested in past work, no improvement in effectiveness was observed and no entirely satisfactory explanation for the result was put forward. My talk will present an analysis that leads to such an explanation: examining a set of documents where anaphoric references have been manually resolved; exploring the similarities of anaphor resolution to stemming, showing the impact of this commoner process on term frequencies; and by reviewing past work on phrases. The analysis indicates that the reasons for the lack of improvement in retrieval effectiveness, shown in past work, is due to the lack of any substantial or useful additional information being provided by the resolution of anaphors.

This is very much work in progress and in the talk I'm keen to hear your views on further approaches that could be taken in the analysis.

Bio: Mark Sanderson has been a lecturer at the Information Studies Department atthe University of Sheffield for several years. Prior to joining Sheffieldhe received his Ph.D. at the University of Glasgow under the direction ofKeith van Rijsbergen. Mark's research has addressed a variety of topics inInformation Retrieval, including word-sense disambiguation, use ofhyperlink information, automatic forecasting of avalanches, creation ofquery-specific topic hierarchies, and pattern-based retrieval.


Webmaster: ehn@cs.cmu.edu



LTI is part of the School of Computer Science at Carnegie Mellon University.
This page is maintained by ckoch+@cs.cmu.edu, and was last updated 26 June 2002.