This package provides a set of functions for performing basic Linear and Ridge Regression on datasets. It is assumed that the reader is familiar with the concepts of Linear and Ridge Regression, so that further explanation of these terms is not required.
The functions available to users of this package are described in the section “Function Descriptions” and examples of their usage are given in the section “Example Usage”.
Also, the “Example Usage” section provides guidance on how the functions provided by this package should be used together.
Limitations:
It should be noted that this package has been created as part of an Introductory Data Science course and that the functions are not “super” robust: the functions do minimal input parameter checking. Consequently, the user should be careful to ensure that the inputs to the functions in this package are minimally correct in terms of type and dimension.
It should also be noted that all calculations are performed on in-memory resident data - i.e. all datasets are read into memory and calculations are also performed in-memory. This means that this package will not be suitable for very large datsets.
This package provides three functions to its users. Each of these is described below:
Specifically:
| \(\overline{y}\) | Return Value: The calculated label values. An (n x 1) column vector. |
| X | Matrix of data points. Has n rows (1 per data point) and m columns for each parameter in a data point. The first column in this matrix must be augmented to contain 1’s |
| \(\beta\) | Column vector (1 x m) of coefficients used to derive predicted label values. |
| missingAction | Should not be changed from the default value! |
When \(\lambda = 0\) (default case) the function only performs linear regression. When \(\lambda > 0\) then Ridge Regression is performed using \(\lambda\) as the specified “ridge” parameter.
Specifically:
| \(\beta\) | Return Value: The calculated coefficient vector. An (m x 1) column vector. |
| y | The label vector associated with the data points in X. An (n x 1) column vector. |
| X | Matrix of data points. Has n rows (1 per data point) and m columns for each parameter in a data point. The first column in this matrix must be augmented to contain 1’s |
| ridge | The ‘ridge’ parameter - must be non-negative! When 0 Linear Regression is performed. When \(>0\) Ridge Regression is performed with the specified ‘ridge’ value. |
| missingAction | Should not be changed from the default value! |
Specifically:
| result | Return Value: A list with two entries: (i) min_lambda This is the ridge value that minimizes the mean RSS error. (ii) min_mean is the actual minimum RSS value for that lambda |
| y | The label vector associated with the data points in X. An (n x 1) column vector. |
| X | Matrix of data points. Has n rows (1 per data point) and m columns for each parameter in a data point. The first column in this matrix must be augmented to contain 1’s |
| lambda.max | The maximum ‘ridge’ parameter to be tried - must be non-negative! |
| n.lambda | Number of ‘ridge’ values to try. E.g. with lambda.max=20 and n.lambda=4, the ridge values tested are: 0.000000, 6.666667, 13.333333,20.000000 |
Below is an example of the functions available in this package used on the R dataset “mtcars”.
# Load our package
library("assesment5")
##
## Attaching package: 'assesment5'
## The following object is masked from 'package:stats':
##
## predict
# Load the 'mtcars' dataset - part of R 'built-in' datasets
data(mtcars)
# Sample records:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Split the data into:
# data (X) has details about the cars - this is the set of data points
# (all columns except 1st column)
# labels (y) - has the petrol usage as mpg - this is the label data we
# want to predict (1st column)
# Note: 1. We always need to augment the data (X) matrix with a column of 1's
# 2. data read in as a dataframe needs to be "converted" to matrices
X <- as.matrix(mtcars[,-1])
X <- cbind(1,X)
y <- mtcars[,1]
y <- matrix(y, byrow=T)
# Calculate the minimum RSS error, and for which lambda it occured at
result <- selectlambda(y,X, n.lambda=30)
# print critical values from 'result':
cat('Minimum rss-mean:', result$min_mean, ', Min-lambda:', result$min_lambda)
## Minimum rss-mean: 52.45021 , Min-lambda: 2.068966
# Lambda != 0, so Linear Regression was not optimal.!!
# Let us calculate the y_hat (estimate label values and check difference to
# actuals).
# First determine coefficient params:
beta = linreg(y, X, ridge=result$min_lambda)
# Next Calculate y_hat from this 'beta'
y_hat = predict(X, beta)
# Show diffs as percentage of actuals:
head(100*(abs(y - y_hat)/y))
## [,1]
## Mazda RX4 3.656302
## Mazda RX4 Wag 3.410764
## Datsun 710 14.022872
## Hornet 4 Drive 3.417813
## Hornet Sportabout 8.363229
## Valiant 12.968367