Linear regression and ridge regression are simple machine learning techniques that aim to estimate the parameters of a linear model. Assuming we have predictor points of dimensionality and responses , we are trying to estimate the best fit for in the linear model
for each predictor and response . If we take each predictor as a row in the matrix and each response as an entry of the vector , we can represent the model in vector form:
The result of this method is the vector , including the offset term (or intercept term) .
The simplest way to perform linear regression or ridge regression in mlpack is to use the mlpack_linear_regression executable. This program will perform linear regression and place the resultant coefficients into one file.
The output file holds a vector of coefficients in increasing order of dimension; that is, the offset term ( ), the coefficient for dimension 1 ( , then dimension 2 ( ) and so forth, as well as the intercept. This executable can also predict the values of a second dataset based on the computed coefficients.
Below are several examples of simple usage (and the resultant output). The option is used so that verbose output is given. Further documentation on each individual option can be found by typing
[INFO ] Loading 'dataset.csv' as CSV data. Size is 2 x 5.
[INFO ]
[INFO ] Execution parameters:
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_model_file: ""
[INFO ] lambda: 0
[INFO ] output_model_file: lr.xml
[INFO ] output_predictions: predictions.csv
[INFO ] test_file: ""
[INFO ] training_file: dataset.csv
[INFO ] training_responses: ""
[INFO ] verbose: true
[INFO ] version: false
[INFO ]
[INFO ] Program timers:
[INFO ] load_regressors: 0.000263s
[INFO ] loading_data: 0.000220s
[INFO ] regression: 0.000392s
[INFO ] total_time: 0.001920s
Convenient program timers are given for different parts of the calculation at the bottom of the output, as well as the parameters the simulation was run with. Now, if we look at the output model file, which is lr.xml,
As you can see, the function for this input is . We can see that the model we have trained catches this; in the <parameters> section of lr.xml, we can see that there are two elements, which are (approximately) 0 and 1. The first element corresponds to the intercept 0, and the second column corresponds to the coefficient 1 for the variable . Note that in this example, the regressors for the dataset are the second column. That is, the dataset is one dimensional, and the last column has the values, or responses, for each row. You can specify these responses in a separate file if you want, using the --input_responses, or -r, option.
[INFO ] Loading 'dataset.csv' as CSV data. Size is 2 x 5.
[INFO ] Loading 'predict.csv' as raw ASCII formatted data. Size is 1 x 3.
[INFO ] Saving CSV data to 'predictions.csv'.
[INFO ]
[INFO ] Execution parameters:
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_model_file: ""
[INFO ] lambda: 0
[INFO ] output_model_file: ""
[INFO ] output_predictions: predictions.csv
[INFO ] test_file: predict.csv
[INFO ] training_file: dataset.csv
[INFO ] training_responses: ""
[INFO ] verbose: true
[INFO ] version: false
[INFO ]
[INFO ] Program timers:
[INFO ] load_regressors: 0.000371s
[INFO ] load_test_points: 0.000229s
[INFO ] loading_data: 0.000491s
[INFO ] prediction: 0.000075s
[INFO ] regression: 0.000449s
[INFO ] saving_data: 0.000186s
[INFO ] total_time: 0.002731s
$ cat dataset.csv
0,0
1,1
2,2
3,3
4,4
$ cat predict.csv
2
3
4
$ cat predictions.csv
2.0000000000e+00
3.0000000000e+00
4.0000000000e+00
We used the same dataset, so we got the same parameters. The key thing to note about the predict.csv dataset is that it has the same dimensionality as the dataset used to create the model, one. If the model generating dataset has dimensions, so must the dataset we want to predict for.
Sometimes, the input matrix of predictors has a covariance matrix that is not invertible, or the system is overdetermined. In this case, ridge regression is useful: it adds a normalization term to the covariance matrix to make it invertible. Ridge regression is a standard technique and documentation for the mathematics behind it can be found anywhere on the Internet. In short, the covariance matrix
is replaced with
where is the identity matrix. So, a parameter greater than zero should be specified to perform ridge regression, using the --lambda (or -l) option. An example is given below.
[INFO ] Loading 'dataset.csv' as CSV data. Size is 2 x 5.
[INFO ]
[INFO ] Execution parameters:
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_model_file: ""
[INFO ] lambda: 0.5
[INFO ] output_model_file: lr.xml
[INFO ] output_predictions: predictions.csv
[INFO ] test_file: ""
[INFO ] training_file: dataset.csv
[INFO ] training_responses: ""
[INFO ] verbose: true
[INFO ] version: false
[INFO ]
[INFO ] Program timers:
[INFO ] load_regressors: 0.000210s
[INFO ] loading_data: 0.000170s
[INFO ] regression: 0.000332s
[INFO ] total_time: 0.001835s
Further documentation on options should be found by using the --help option.
The 'LinearRegression' class
The 'LinearRegression' class is a simple implementation of linear regression.
Using the LinearRegression class is very simple. It has two available constructors; one for generating a model from a matrix of predictors and a vector of responses, and one for loading an already computed model from a given file.
The class provides one method that performs computation:
Once you have generated or loaded a model, you can call this method and pass it a matrix of data points to predict values for using the model. The second parameter, predictions, will be modified to contain the predicted values corresponding to each row of the points matrix.
As discussed in Using ridge regression, ridge regression is useful when the covariance of the predictors is not invertible. The standard constructor can be used to set a value of lambda: