1 Regression Models

In this section various regression models will be considered. Initially a linear regression model on all the 199 variables will be developed and used to predict the response on both the training and test sets.

This is followed by considering best subset selection utlising both forward and backward stepwise selection before considering the reduction (shrinkage) methods of ridge regression and the lasso.

In all these parametric models the transformed response variable log(y+1) will be used, as was discussed in the exploration section. Also, throughout this project, standardised predictor variables will be used.

1.1 Full Linear Model

Initially a model considering all 119 predictors was created. Given the size of the feature space, a summary of the model in which the predictors, their significance and their coeficients are examined in greater detail will not be done on the full model, but will again be revisited when subset selection is considered.

The full model exhibited an adjusted R-squared value of 0.3285949, illustrating what may result in a relatively poor ability to use the predictors for accurate predictions. This is confirmed by the model’s F-statistic 0.3285949, which is large enough for the model to be significant, but at the same time not as large as to engender confidence.

Considering the diagnostic plots in Fig… below:

  • The distribution of variance around zero on the first plot as well as the clear clustering of a relatively small number of observations does give some cause for concern.
  • Apart from discrepencies in the lower tail, the normal Q-Q plot appears to be acceptable.
  • The extreme outlyer(s), that were already identified in the exploration is also picked up and does have inordinate influence

In summary with regard to the diagnostics, although the residuals do appear to follow a reasonably normal distribution, the clustering, potentially unequal variance and outlyers may argue for the use of non-parametric methods for more accurate predictions.

Finally, predictions were generated for both the training and test sets and their mean squared errors (MSE) were determined. The MSE on the training set turned out to be 2.675786110^{4} and on the test set it was 2.561811310^{4}. The apparent anomoly in having a much larger training MSE than test MSE could be explained by the training set including the extreme outlyer identified in the exploration section above.

The following section will attempt to reduce the feature space to allow for a more parsimonious model and to begin making some interpretation of the value and contributions of key variables.

1.2 Subset Model Selection

In this section forward and backward stepwise regression will be applied to the training set. The decision on an optimally sized feature space will be made based on the test MSE values for different numbers of predictors.

Once this has been decided upon, a new model built on the selected coefficients of the full dataset will be used to establish more reliable model coefficients to use in a final determination of test MSE.

1.2.1 Forward Stepwise Regression

## Reordering variables and trying again:

From the plots in Fig…., there appears to be quite a large discrepancy between the adjusted R-Squared values vs the BIC values comparing the various models.

The lowest BIC value is for a model with 25 predictors, while the largest adjusted R-Squared value is for a model with 98 predictors.

1.2.2 Backward Stepwise Regression

## Reordering variables and trying again:

Applying backward stepwise regression, as the plots in Fig… illustrate shows only a slightly different picture. The large discrepancy remains, with the lowest BIC value appears for a model with 29 predictors, while the largest adjusted R-Squared value was found for a model with 103 predictors.

1.2.3 Optimum Feature Size

In order to select an optimum number of features, the test error rates for the forward and backward stepwise regressions were considered.

The use of the test set to determine optimum number of predictors resulted in 43 features for forward selection and 57 features for backward selection.

A model would now need to be developed using the selected features from either or both the procedures but trained on the full dataset to ensure more robust coefficients.

The final, full sample test MSE for 29 features using the forward step procedure amounted to 2.104814310^{4} and for the 57 features from the backward step procdure was found to be 1.967969210^{4}.

In the following section ridge and the lasso regression models will be considered.

1.3 Ridge Regression

In this section ridge regression on the dataset will be considered.

The ridge regression plot show in Fig… confirms a notion that has been developing with regard to this dataset which is that many features collectively contribute to the response variables. Apart from two features that seem to be more differentiated, the others appear to club together.

MSE calculations applied to the test set for the 100 different values of lambda have been plotted in the second plot in Fig… and identifies an optimal lambda value to be 1.

Using the predicted values (converted) for this optimal lambda resulted in a Test-MSE value of 1.903286710^{4}

1.4 The Lasso

The lasso regression was applied to the training set, allowing the function to select the range of lambdas by default.

The first plot in Fig… shows a similar meshing of lines, although the algorithm does appear to force a few of the features to zero, but only near the end. The second plot again shows the result of calculating MSEs for the test set for 100 values of lambda and identifies an optimal lambda as being 0.0281396.

Again, using the predicted values (after conversion) for the selected lambda gave a Test-MSE of

Using the predicted values (converted) for this optimal lambda resulted in a Test-MSE value of 1.914112410^{4}.

End