Introduction

This package provides a set of functions for performing basic Linear and Ridge Regression on datasets. It is assumed that the reader is familiar with the concepts of Linear and Ridge Regression, so that further explanation of these terms is not required.

The functions available to users of this package are described in the section “Function Descriptions” and examples of their usage are given in the section “Example Usage”.

Also, the “Example Usage” section provides guidance on how the functions provided by this package should be used together.

Limitations:
It should be noted that this package has been created as part of an Introductory Data Science course and that the functions are not “super” robust: the functions do minimal input parameter checking. Consequently, the user should be careful to ensure that the inputs to the functions in this package are minimally correct in terms of type and dimension.

It should also be noted that all calculations are performed on in-memory resident data - i.e. all datasets are read into memory and calculations are also performed in-memory. This means that this package will not be suitable for very large datsets.

Function Descriptions

This package provides three functions to its users. Each of these is described below:

predict(X, \(\beta\), missingAction= “fail”)

This function is used to calculate predicted label values from a data set and coefficient parameters. Effectively, it performs the following, simple, multiplication of a matrix (X) and a column vector (\(\beta\)):

\(\overline{y} = X \beta\)


Specifically:

\(\overline{y}\) Return Value: The calculated label values. An (n x 1) column vector.
X Matrix of data points. Has n rows (1 per data point) and m columns for each parameter in a data point. The first column in this matrix must be augmented to contain 1’s
\(\beta\) Column vector (1 x m) of coefficients used to derive predicted label values.
missingAction Should not be changed from the default value!


linreg(y, X, missingAction= “fail”, ridge = 0)

This is the function that performs both the linear and ridge regressions. The purpose of the function is to determine the “best” set of parameters (\(\beta\)) from which to be able to determine future label values from new data points - which can either be done using Linear Regression or Ridge Regression.
This function is invoked as shown below:

\(\beta\) = linreg(y, X, ridge=\(\lambda\))


When \(\lambda = 0\) (default case) the function only performs linear regression. When \(\lambda > 0\) then Ridge Regression is performed using \(\lambda\) as the specified “ridge” parameter.
Specifically:

\(\beta\) Return Value: The calculated coefficient vector. An (m x 1) column vector.
y The label vector associated with the data points in X. An (n x 1) column vector.
X Matrix of data points. Has n rows (1 per data point) and m columns for each parameter in a data point. The first column in this matrix must be augmented to contain 1’s
ridge The ‘ridge’ parameter - must be non-negative! When 0 Linear Regression is performed. When \(>0\) Ridge Regression is performed with the specified ‘ridge’ value.
missingAction Should not be changed from the default value!


selectlambda(y, X, lambda.max=20, n.lambda=10)

This function attempts to determine the ‘ridge’ parameter that finds the optimal+ coefficient vector (\(\beta\)). In order to do this, this function repeatedly invoking the linreg function with a range of ‘ridge’ values. The ‘ridge’ values used start at 0 - which means that the first check is using Linear Regression and not Ridge Regression - and repeats until the maximum ridge value is reached. The ridge value with the minumum RSS value is returned.

+ Optimal coefficient vector, in this context, means the one that minimizes the Residual Sum of Squares (RSS) between the calculated and actual label values.

The function is evoked as shown below:

result = selectlambda(y, X)



Specifically:

result Return Value: A list with two entries: (i) min_lambda This is the ridge value that minimizes the mean RSS error. (ii) min_mean is the actual minimum RSS value for that lambda
y The label vector associated with the data points in X. An (n x 1) column vector.
X Matrix of data points. Has n rows (1 per data point) and m columns for each parameter in a data point. The first column in this matrix must be augmented to contain 1’s
lambda.max The maximum ‘ridge’ parameter to be tried - must be non-negative!
n.lambda Number of ‘ridge’ values to try. E.g. with lambda.max=20 and n.lambda=4, the ridge values tested are: 0.000000, 6.666667, 13.333333,20.000000



Example Usage

Below is an example of the functions available in this package used on the R dataset “mtcars”.

# Load our package
library("assesment5")
## 
## Attaching package: 'assesment5'
## The following object is masked from 'package:stats':
## 
##     predict
# Load the 'mtcars' dataset - part of R 'built-in' datasets
data(mtcars)

# Sample records:
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# Split the data into:
#    data (X) has details about the cars - this is the set of data points
#             (all columns except 1st column)
#    labels (y) - has the petrol usage as mpg - this is the label data we
#             want to predict (1st column)
# Note: 1. We always need to augment the data (X) matrix with a column of 1's
#       2. data read in as a dataframe needs to be "converted" to matrices
X <- as.matrix(mtcars[,-1])
X <- cbind(1,X)
y <- mtcars[,1]
y <- matrix(y, byrow=T)

# Calculate the minimum RSS error, and for which lambda it occured at
result <- selectlambda(y,X, n.lambda=30)
# print critical values from 'result':
cat('Minimum rss-mean:', result$min_mean, ', Min-lambda:', result$min_lambda)
## Minimum rss-mean: 52.45021 , Min-lambda: 2.068966
# Lambda != 0, so Linear Regression was not optimal.!!
# Let us calculate the y_hat (estimate label values and check difference to
# actuals).
# First determine coefficient params:
beta = linreg(y, X, ridge=result$min_lambda)
# Next Calculate y_hat from this 'beta'
y_hat = predict(X, beta)
# Show diffs as percentage of actuals:
head(100*(abs(y - y_hat)/y))
##                        [,1]
## Mazda RX4          3.656302
## Mazda RX4 Wag      3.410764
## Datsun 710        14.022872
## Hornet 4 Drive     3.417813
## Hornet Sportabout  8.363229
## Valiant           12.968367