
About
Education
Research
People
News/Events
Contacts
|
LTI Seminar Abstracts Summer 2002
August 2, 2002
Talk 1: Yi Zhang, LTI Ph.D. student
Novelty and Redundancy Detection in Adaptive Filtering
We address the problem of extending an adaptive information filtering system
to make decisions about the novelty and redundancy of relevant documents.
It argues that relevance and redundance should each be modelled explicitly
and separately. A set of five redundancy measures are proposed and evaluated
in experiments with and without redundancy thresholds. The experimental
results demonstrate that the cosine similarity metric and a redundancy measure
based on a mixture of language models are both effective for identifying
redundant documents. This is a practice talk for a presentation at SIGIR
2002.
Talk 2: Luo Si, LTI Ph.D. student
Using Sampled Data and Regression to Merge Search Engine Results
This work addresses the problem of merging results obtained from different
databases and search engines in a distributed information retrieval environment.
The prior research on this problem either assumed the exchange of statistics
necessary for normalizing scores (cooperative solutions) or is heuristic.
Both approaches have disadvantages. We show that the problem in uncooperative
environments is simpler when viewed as a component of a distributed IR system
that uses query-based sampling to create resource descriptions. Documents
sampled for creating resource descriptions can also be used to create a
sample centralized index, and this index is a source of training data for
adaptive results merging algorithms. A variety of experiments demonstrate
that this new approach is more effective than a well-known alternative,
and that it allows query-by-query tuning of the results merging function.
This is a practice talk for the coming SIGIR conference.
June 28, 2002
Talk 1: Kathrin Probst, LTI Ph.D. student
Using Similarity Scoring To Improve the Bilingual Dictionary for Word
Alignment
This talk will represent a practice talk for a presentation at ACL 2002.
Automatic word alignment generally uses a bilingual dictionary that, in
one form or another, is based on co-occurrence of words. In this talk, I
will describe an approach to improve such a dictionary. In particular, we
rebuild the bilingual dictionary by clustering similar words in a language
and assigning them a higher co-occurrence score with a given word in the
other language than each single word would have otherwise. The improved
dictionary is evaluated using a version of the Competitive Linking alignment
algorithm. Experimental results show a significant improvement in precision
and recall for word alignment when the improved dictionary is used.
Talk 2: Chad Langley, LTI Ph.D. student
Analysis for Speech Translation Using Grammar-Based Parsing and Automatic
Classification
I describe a novel approach to analysis for spoken language translation
which uses a combination of phrase-level grammar-based parsing and automatic
classification. The job of the analyzer is to transform spoken task-oriented
utterances into a shallow semantic interlingua representation. The goal
of this hybrid approach is to provide accurate real-time analyses and to
improve robustness and portability to new domains and languages.
June 21, 2002 --Michael Whitbrock, CYCORP
Thinking about the answers: Knowledge Formation and Question Answering
with Cyc.
Over the past seventeen years, a team of computer scientists, philosophers
and, recently, computational linguists have been pushing towards one vision
of artificial intelligence: building a system that knows and can reason
about large numbers of things in the world (Sid Vicious, toothpaste, mathematical
sets and viruses spring to mind), and that can use this knowledge to perform
useful tasks, notably working with people, and reading text, to increase
its knowledge.
In the ARPA-funded Rapid Knowledge Formation Project, Cyc's one and a half
million facts and rules about the world are used to drive interactive, English
language interactions with lighly trained people, who can enter facts and
rules about their specialities in an integrated and inferrentially productive
way. In the ARDA-funded Aquaint Project, Cyc's common sense knowledge is
used to validate the results of statistical and template driven information
extraction, and to represent that information in a logical form that enables
questions to be answered by combining information across multiple sources.
BIO: Dr. Michael Witbrock (Cycorp), has a PhD in Computer Science from Carnegie
Mellon University, and currently is Director of the Knowledge Formation
and Dialogue department at Cycorp. Previously he was Principal Scientist
at Terra Lycos, working on integrating statistical and knowledge based approaches
to understanding web user behavior, a research scientist at Just Systems
Pittsburgh Research Center, working on statistical summarization, and a
systems scientist at Carnegie Mellon on the Informedia spoken document information
retrieval project. He also performed dissertation work in the area of speaker
modeling. He is author of numerous publications in areas ranging across
neural networks, parallel computer architecture, multimedia information
retrieval, web browser design, genetic design, computational linguistics
and speech recognition.
May 29, 2002 --Mark Sanderson, University of Sheffield,
UK
Better estimation of term frequency
When discussing the utility of natural language processing techniques to
information retrieval, it has sometimes been suggested that if anaphoric
references within a text were resolved, one could calculate term frequency
(within a document) based on a term's occurrences, both actual and anaphoric.
This would result in a more accurate calculation of the term's usage in
documents, with presumably, a consequent improvement in retrieval effectiveness.
When such ideas were tested in past work, no improvement in effectiveness
was observed and no entirely satisfactory explanation for the result was
put forward. My talk will present an analysis that leads to such an explanation:
examining a set of documents where anaphoric references have been manually
resolved; exploring the similarities of anaphor resolution to stemming,
showing the impact of this commoner process on term frequencies; and by
reviewing past work on phrases. The analysis indicates that the reasons
for the lack of improvement in retrieval effectiveness, shown in past work,
is due to the lack of any substantial or useful additional information being
provided by the resolution of anaphors.
This is very much work in progress and in the talk I'm keen to hear your
views on further approaches that could be taken in the analysis.
Bio: Mark Sanderson has been a lecturer at the Information Studies Department
atthe University of Sheffield for several years. Prior to joining Sheffieldhe
received his Ph.D. at the University of Glasgow under the direction ofKeith
van Rijsbergen. Mark's research has addressed a variety of topics inInformation
Retrieval, including word-sense disambiguation, use ofhyperlink information,
automatic forecasting of avalanches, creation ofquery-specific topic hierarchies,
and pattern-based retrieval.
|