In the spring of 2013, Language Technologies Institute Ph.D. student Leonid Boytsov and Bilegsaikhan (Bileg) Naidan, a Ph.D. student at the Norwegian University of Science and Technology, had a problem. They needed to evaluate a novel nearest-neighbor search method for non-metric spaces, but no adequate software suite was available.
So they built one.
That software, Non-Metric Space Library (NMSLIB), is available to the public and gaining traction on GitHub, where it's offered under the business-friendly Apache 2 license.
Built on a codebase originally developed by Naidan, NMSLIB is designed to find nearest data neighbors in generic spaces. The nearest neighbor search is an old problem that requires finding data points close to a query point. These data points are the neighbors of the query with respect to some distance. For example, data points might be airport locations and airplane locations may serve as the query points. In an emergency, you may need to find the closest airport location — the nearest neighbor to the plane's location.
Since NMSLIB's inception, Boytsov and Naidan have worked with their colleagues to improve the software, in particular by incorporating new search methods. Two important additions have included the Hierarchical Navigable Small World Graph (HNSW) and a pivoting algorithm for sparse spaces. The former, contributed by Yury Malkov, a research fellow at the Institute of Applied Physics of the Russian Academy of Sciences, is a novel controllable hierarchy of navigable neighborhood graphs that provide a data structure connecting sufficiently close data points. It works blazingly fast in the case of dense vector spaces. The pivoting approach for sparse spaces, such as text data, was created by David Novak, a research fellow at Masaryk University, Czech Republic. While still in its early stages, the approach could become a replacement for traditional retrieval libraries such as Lucene in some domains.
Most research pertaining to the nearest neighbor problem focuses on data sets with metric distances, but there are situations in which metric distances don't apply. In these cases, search software like NMSLIB can be incredibly useful.
"We are different from most of the software that was created for this problem in at least two ways," said Boytsov, who is advised by LTI Professor and Director of the Master of Computational Data Science Program Eric Nyberg. "We are generic — that means we work for a lot of similarity measures, a lot of similarity functions, so we work with images and text. And we're fast. In some cases, we are so fast that we're outperforming other methods by a good margin. We believe we are pushing the state of the art in nearest neighbor search right now."
Learn more about their software on GitHub.