There are more than 6,000 human languages, but less than a hundred of them have robust language technologies such as search engines, spell checkers or speech recognition. Nevertheless, speakers of the technologically poor languages may have uses for language technologies to access information related to education, politics, business, weather and health conditions — not to mention preservation of culture and community relationships. My work focuses on the linguistic aspects of language technologies, specifically the following:
Language Technologies With Low Resources
Many languages lack sufficient data for supervised or unsupervised machine learning. There may also be low-resource scenarios for technology-rich languages like English in specialized styles or subject areas. Such cases may call for hybrids of human-engineered knowledge and machine learning. The human engineered knowledge can be in the form of handwritten rules, priors or feature engineering.
Language Universals and Typology
When there is insufficient time or data to build an NLP system, it is useful to fall back on what is known about human languages in general or what is known about related languages. The field of linguistic typology and universals provides expectations for what languages might be like. We are exploring how to use typology and universals to develop language technologies for new languages on short timelines or in low-resource scenarios.
Corpus Annotation and Linguistic Resources
When time and data are available, language technologies can be based on supervised learning from annotated data. The annotations may be for any level of linguistic knowledge from sounds to social hierarchies. My approach to annotation is based on the linguistic theory of construction grammar.