An Introduction to Statistical Learning

2: Statistical learning

The Trade-Off Between Prediction Accuracy and Model Interpretability (prediction and inference)

Choosing the appropriate model depends on the problem at hand, and of the data we have. Prediction accuracy can be high with certain methods, but the way predictions are made may be hard to interpret. Typical of classification methods.

Parametric vs. non-parametric methods - parametric methods, such as linear regression are easily interpretable - non-parametric methods, such as splines, can be over-fitted to the training data

We may be interested in these questions for inference problems: - Which predictors are related to the response variable - What is the relationship between the predictor and the response variable? Positive, negative and possible synergies/ multiplication effects between predictive variables. - Is a linear relationship sufficient to explain the relationship or are more complicated fits needed to describe the relationship?

Supervised Versus Unsupervised Learning

Unsupervised learning describes situations where there is no response variable. When all variables are predictors, we may be interested in the relationship between the variables. We may want to find clusters of observations that are similar.

Regression Versus Classification Problems

Regression problems require a quantitative response variable. A classification problem have a categorical response variable.

Assesing model accuracy

\(MSE\) mean square error is a measure of the fit of the model
Training MSE is the fit of the model on training data
Test MSE is the result we are interested in for prediction

Bias-variance tradeoff

The expected test MSE is always consisting of variance of x, bias and error. - Variance of the model is the difference in models that a different set of training data gives. A linear model has lower variance, as new training data will not change the model much, usually - Bias is the error induced by simplifying the relationship between the variables to a simple shape or formula.

The bayes classifier and the KNN classifier

The ideal classifier, but it relies on knowing the conditional distribution of Y given X. KNN classifier tries to use approximate the bayes classifier.

positive integer \(K\)
test observation \(x_0\)
first identify the K points in training data that are closest to \(x_0\), represented by \(N_0\).
it then estimates the conditional probability for class \(j\) as the fraction of points in \(N_0\) whose response values equal \(j\).
finally KNN applies Bayes rule and classifies the test observation \(x_0\) to the class with the largest probability.

3: Linear regression

Chapter 3 exercises

Simple linear regression

Multiple linear regression

\(R^2\) is not the same as the covariate when more variables are added. \(R^2\) is therefore used. The F-statistic increases when the model explains more of the

Other considerations

Common problems when we fit a linear regression model to a particular data set:

Non-linearity of the response-predictor relationships. Quadratic or other fit may be more suitable.
Correlation of error terms. Common in time-series data. Can be mitigated by good reasearch design.
Non-constant variance of error terms. Funnel plot of error. May be solved by log-transforming the explanatory variable.
Outliers.To address this problem of unusual values on the response variable, instead of plotting the residuals, we can plot the studentized residuals, computed by dividing each residual \(e_i\) by its estimated standard error. Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.
High-leverage points. Unusually high values for \(x_i\) gives high leverage of the model.
Collinearity. Two or more of the explanatory variables correlate. Can be multicollinearity, which is detected with the VIF-score (variance inflation factor).

4: Classification

Overview

Logistic regression

\(p(X) = Pr(Y = 1|X)\)

Logistic function

\[p(X) = \frac{e^{β_0 +β_1X}}{1 + e^{β_0+β_1X}}\]

Maximum likelihood

Linear discriminent analysis

Comparison of classification methods

5: Resampling Methods

Chapter 5 exercises

Model assessment

Evaluating the models performance
Model assessment is how well the data is represented by the model. The distribution of the estimated SE is one measure.

Model selection

Selecting the models right flexibility

Cross-validation

CV can be used to estimate the test error (training error is given by the data the model is fitted to). CV takes the training data and feeds the model many times with different slices of the data to detect the distribution of the many variations of the data in the training set. This “simulates” more data than we have.

CV can be LOOCV (Leave one out Cross-Validation), or K-fold CV, where not n models are run, but K folds are created and one part is held out at a time to serve as the validation set.

Validation set approach

Validation set approach is K=2, where a share of the training data is held out to serve as a validation set.

Drawback 1: Highly variable estimate of test error rate Drawback 2: Only a subset of the observations in the training set are included to fit the model. Models performs worse with fewer observations. Validation set error rate may tend to overestimate the test error rate.

LOOCV

A refined version of the Validation set approach. Uses n training sets with n-1 observations for each, so can be computationally heavy.

K-fold CV

Randomly dividing training set into k groups or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining \(k - 1\) folds. Results in \(k\) estimates of the test error. The \(k\)-fold CV estimate is computed by averaging these values. Typical k-fold is k=5, or k=10.

Bias-variance trade-off for k-fold CV

Many models fit gives LOOCV very low bias, compared to validation and k-fold CV. However, variance is high, as the subsets of training data are highly correlated with each other. This problem of variance is reduced with k = 5 or k = 10. Overlap between training sets are smaller. K-fold with k = 5 or k = 10 is often prefered.

CV on Classification problems

The bootstrap

Most common is to use the Bootstrap to provide a measure of accuracy of a parameter estimate or of a statistical learning method.