Nearest-neighbors search is a common machine learning task. In this setting, we have a query and a reference dataset. For each point in the query dataset, we wish to know the points in the reference dataset which are closest to the given query point.
Alternately, if the query and reference datasets are the same, the problem can be stated more simply: for each point in the dataset, we wish to know the nearest points to that point.
The simplest way to perform nearest-neighbors search in mlpack is to use the mlpack_knn executable. This program will perform nearest-neighbors search and place the resultant neighbors into one file and the resultant distances into another. The output files are organized such that the first row corresponds to the nearest neighbors of the first query point, with the first column corresponding to the nearest neighbor, and so forth.
Below are several examples of simple usage (and the resultant output). The -v option is used so that output is given. Further documentation on each individual option can be found by typing
[INFO ] Loading 'dataset.csv' as CSV data. Size is 3 x 1000.
[INFO ] Loaded reference data from 'dataset.csv' (3 x 1000).
[INFO ] Building reference tree...
[INFO ] Tree built.
[INFO ] Searching for 5 nearest neighbors with dual-tree kd-tree search...
[INFO ] 18412 node combinations were scored.
[INFO ] 54543 base cases were calculated.
[INFO ] Search complete.
[INFO ] Saving CSV data to 'neighbors_out.csv'.
[INFO ] Saving CSV data to 'distances_out.csv'.
[INFO ]
[INFO ] Execution parameters:
[INFO ] distances_file: distances_out.csv
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_model_file: ""
[INFO ] k: 5
[INFO ] leaf_size: 20
[INFO ] naive: false
[INFO ] neighbors_file: neighbors_out.csv
[INFO ] output_model_file: ""
[INFO ] query_file: ""
[INFO ] random_basis: false
[INFO ] reference_file: dataset.csv
[INFO ] seed: 0
[INFO ] single_mode: false
[INFO ] tree_type: kd
[INFO ] verbose: true
[INFO ] version: false
[INFO ]
[INFO ] Program timers:
[INFO ] computing_neighbors: 0.108968s
[INFO ] loading_data: 0.006495s
[INFO ] saving_data: 0.003843s
[INFO ] total_time: 0.126036s
[INFO ] tree_building: 0.003442s
Convenient program timers are given for different parts of the calculation at the bottom of the output, as well as the parameters the simulation was run with. Now, if we look at the output files:
So, the nearest neighbor to point 0 is point 862, with a distance of 5.986076164057e-02. The second nearest neighbor to point 0 is point 344, with a distance of 7.664920518084e-02. The third nearest neighbor to point 5 is point 751, with a distance of 1.085637706630e-01.
Using the KNN class is particularly simple; first, the object must be constructed and given a dataset. Then, the method is run, and two matrices are returned: one which holds the indices of the nearest neighbors, and one which holds the distances of the nearest neighbors. These are of the same structure as the output –neighbors_file and –distances_file for the CLI interface (see above). A handful of examples of simple usage of the KNN class are given below.
By choosing different components for each of these template classes, a very arbitrary neighbor searching object can be constructed. Note that each of these template parameters have defaults, so it is not necessary to specify each one.
SortPolicy policy class
The SortPolicy template parameter allows specification of how the NeighborSearch object will decide which points are to be searched for. The mlpack::neighbor::NearestNeighborSort class is a well-documented example. A custom SortPolicy class must implement the same methods which NearestNeighborSort does:
The mlpack::neighbor::FurthestNeighborSort class is another implementation, which is used to create the 'KFN' typedef class, which finds the furthest neighbors, as opposed to the nearest neighbors.
MetricType policy class
The MetricType policy class allows the neighbor search to take place in any arbitrary metric space. The mlpack::metric::LMetric class is a good example implementation. A MetricType class must provide the following functions:
// Empty constructor is required.
MetricType();
// Compute the distance between two points.
template<typename VecType>
double Evaluate(const VecType& a, const VecType& b);
Internally, the NeighborSearch class keeps an instantiated MetricType class (which can be given in the constructor). This is useful for a metric like the Mahalanobis distance (mlpack::metric::MahalanobisDistance), which must store state (the covariance matrix). Therefore, you can write a non-static MetricType class and use it seamlessly with NeighborSearch.
For more information on the MetricType policy, see the documentation here.
MatType policy class
The MatType template parameter specifies the type of data matrix used. This type must implement the same operations as an Armadillo matrix, and so standard choices are arma::mat and arma::sp_mat.
TreeType policy class
The NeighborSearch class allows great extensibility in the selection of the type of tree used for search. This type must follow the typical mlpack TreeType policy, documented here.
The last template parameter the NeighborSearch class offers is the TraverserType class. The TraverserType class holds the strategy used to traverse the trees in either single-tree or dual-tree search mode. By default, it is set to use the default traverser of the given TreeType (which is the member TreeType::DualTreeTraverser).
This class must implement the following two methods:
Note also that any traverser given must satisfy the definition of a pruning dual-tree traversal given in the paper "Tree-independent dual-tree algorithms".