This set of lecture notes are based on chapter 2 of “Elements of Statistical Learning” by Hastie, Tibshirani and Friedman.
“The Akaike information criterion: Background, derivation,properties, application, interpretation, and refinements” 2019, Cavanaugh and Neath
This class assumes you are familiar with Multiple Linear Regression and Logistic Regression.
We will build on this knowledge base to improve our predictions and try to find solutions to other challenges.
First let us get comfortable with notation and terminology.
We have labeled values as the target/response variable Y.
We have labeled values as predictor(s) X. \[ \begin{aligned} Y=f(X) \end{aligned} \]
f is the function where in M.L.R. f is defined as
\[ \begin{aligned} Y=\beta_{0}+\beta_{1} \times X_{1}\ldots \beta_{K} \times X_{K} + \epsilon \end{aligned} \]
f is an assumed learning function.
In M.L.R, the function itself is the form involving linear sum of model parameters (\(\beta\)) and features (predictors/independent variables) \(X\).
Learning, in this context, refers to knowing more about the model parameters .
\[ \begin{aligned} Y=\beta_{0}+\beta_{1} \times X_{1}\ldots \beta_{K} \times X_{K} + \epsilon \end{aligned} \]
The following code and data is in: https://rdrr.io/cran/ElemStatLearn/man/mixture.example.html
We gather from the data object the features X and response/target variable y
The dimensions of object x is 200 rows and 2 columns. This in turn means the number of associated responses is also 200.
The response variable 100 values of 0 and 100 values of 1.
(Intercept) x1 x2
0.3290614 -0.0226360 0.2495983
[1] "Number of Predictions"
[1] 200
[1] "Number of correct Predictions"
[1] 146
To predict the value of Y (\(\hat{Y}\)), average out K of the \(Y_{i}\) (s) that is closest to \(X_{i}\)s. \[ \begin{aligned} \hat{Y}=\frac{1}{K} \sum_{x_{i} \in N_{k}(x) }{y_{i}} \end{aligned} \]
If this is a classification problem \(\hat{Y}\) is a proportion.
The more complication you are willing to introduce to your model, the better the fit is going to be.
Constructing new models from variables you already have, interactions, polynomials etc…
However, the more complicated the model is, the more likely you will decrease the generalizibility of it.
The overfitting and underfitting conundrum is referred to as the bias-variance tradeoff.
To understand first, bias vs variance tradeoff we will conduct simulations and demonstrate overfitting, and introduce basic concepts.
If you have taken ANYL 3334 you should know that if you have a dataset of N responses, N-1 features (covariates, independent variables) would perfectly predict these N responses.
How are we to know in the non-extreme cases whether we are overfitting our model?
Recall also that you can draw a prediction line without the assumption of normally distributed errors to minimize Sum of Squared Errors.
The assumption of normality for residuals create a more inferential framework.
Confidence and Prediction Intervals
Both on \(Y\) and \(\beta\)
Comprehending the process that generates the data. In other words, at least being able to communicate with an SME.
Comprehending the structure behind the models and the methods.