Choosing the appropriate model depends on the problem at hand, and of the data we have. Prediction accuracy can be high with certain methods, but the way predictions are made may be hard to interpret. Typical of classification methods.
Parametric vs. non-parametric methods - parametric methods, such as linear regression are easily interpretable - non-parametric methods, such as splines, can be over-fitted to the training data
We may be interested in these questions for inference problems: - Which predictors are related to the response variable - What is the relationship between the predictor and the response variable? Positive, negative and possible synergies/ multiplication effects between predictive variables. - Is a linear relationship sufficient to explain the relationship or are more complicated fits needed to describe the relationship?
Unsupervised learning describes situations where there is no response variable. When all variables are predictors, we may be interested in the relationship between the variables. We may want to find clusters of observations that are similar.
Regression problems require a quantitative response variable. A classification problem have a categorical response variable.
Bias-variance tradeoff
The expected test MSE is always consisting of variance of x, bias and error. - Variance of the model is the difference in models that a different set of training data gives. A linear model has lower variance, as new training data will not change the model much, usually - Bias is the error induced by simplifying the relationship between the variables to a simple shape or formula.
The bayes classifier and the KNN classifier
The ideal classifier, but it relies on knowing the conditional distribution of Y given X. KNN classifier tries to use approximate the bayes classifier.
\(R^2\) is not the same as the covariate when more variables are added. \(R^2\) is therefore used. The F-statistic increases when the model explains more of the
Common problems when we fit a linear regression model to a particular data set:
\(p(X) = Pr(Y = 1|X)\)
\[p(X) = \frac{e^{β_0 +β_1X}}{1 + e^{β_0+β_1X}}\]
Model assessment
Model selection
CV can be used to estimate the test error (training error is given by the data the model is fitted to). CV takes the training data and feeds the model many times with different slices of the data to detect the distribution of the many variations of the data in the training set. This “simulates” more data than we have.
CV can be LOOCV (Leave one out Cross-Validation), or K-fold CV, where not n models are run, but K folds are created and one part is held out at a time to serve as the validation set.
Validation set approach is K=2, where a share of the training data is held out to serve as a validation set.
Drawback 1: Highly variable estimate of test error rate Drawback 2: Only a subset of the observations in the training set are included to fit the model. Models performs worse with fewer observations. Validation set error rate may tend to overestimate the test error rate.
A refined version of the Validation set approach. Uses n training sets with n-1 observations for each, so can be computationally heavy.
Randomly dividing training set into k groups or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining \(k - 1\) folds. Results in \(k\) estimates of the test error. The \(k\)-fold CV estimate is computed by averaging these values. Typical k-fold is k=5, or k=10.
Many models fit gives LOOCV very low bias, compared to validation and k-fold CV. However, variance is high, as the subsets of training data are highly correlated with each other. This problem of variance is reduced with k = 5 or k = 10. Overlap between training sets are smaller. K-fold with k = 5 or k = 10 is often prefered.
Most common is to use the Bootstrap to provide a measure of accuracy of a parameter estimate or of a statistical learning method.