MLE maximizes the likelihood function, which is given as the product of all data under a distribution.
Note that an MLE does not always exist. We must check that \(l''(\theta)\) < 0 to make we have a max (as opposed to a min). Additionally, the function must be differentiable (which also implies continuous) for an MLE to exist. We can usually maximize a function (given an MLE exists) by only looking at parts dependent on the parameters in a function.
\(Y|X\) asssumes that \(Y = \sum_{n=1}^{p} \beta_j X_j + \epsilon\) where p = number of \(\beta\)’s and \(\epsilon\) is the error term. We use OLS (Ordinary Least Squares) typically, which minimizes the Residual Sum of Squares (RSS). \(RSS(\beta) = \sum_{n=1}^{n}(y_i - \beta^T x_i)^2\). For simple linear regression (as in, what we would need for purposes of this midterm), we have: \(RSS = e_1^2 + e_2^2 + ... + e_n^2\) which we can rewrite as \((y_1 - \beta_0 - \beta_1x_1)^2 + (y_2 - \beta_0 - \beta_1x_2)^2 + ... + (y_n - \beta_0 - \beta_1x_n)^2\) Finally we can interpret each \(\beta_n\) as the MLE. That is, we have \(\beta_n = argmax \sum logp(y_i|x_i)\)) Note that the goal of regression is to find a relationship between out input (X) and our response (Y). Linear regression has a model in the following form: \(Y = \beta_0 + \beta_1X_1 + ... + \beta_pX_p + \epsilon_t\)
We can fit models to minimize training RSS for any model. HOWEVER, the whole point of Statistical Learning is that we want to accurately predict UNSEEN (testing) data. Additionally, training error always underestimates the generalization error. We must be aware of any potential overfitting issues; these tend to have low training error, which increases testing error due to the bias-variance tradeoff. A general rule: The more flexible a model is, the more biased it tends to be. This is because flexible models account for more, which reduces bias, but increases variance.
The loss function is a simple concept. It takes value 1 if our model incorrectly predicts something and takes value 0 if it correctly predicts.
The risk function is the expected loss function. The optimal classifier minimized the risk function with respect to f.
Bayes Classifier predicts class 1 if \(\eta(X) \geq 1/2\) and -1 (the other class) otherwise. \(\eta(X) = P(Y=1|X=x)\) and we can use the law of total probability to get \(P(Y=1|X=x)\) if we are given \(P(Y=-1|X=x)\).
The truth is almost never exactly linear. Therefore, we need alternate ways to create models for data. Polynomials, step functions,splines,local regression, and generalized additive models offer a lot of flexibility, without losing the ease and interpretability of linear models.
Our polynomial regression model is:
\(y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + ... + \epsilon_i\) This is essentially just a case of multiple linear regression. Note that to have an \(x^4\), we would need to have an \(x^3, x^2\), and an \(x\) term.
A step function looks like this:
A linear spline looks like this:
Cubic splines follow similarly:
In all of the above approaches, we must pick a number of knots, K, to use.
Pros and Cons •Tree-based methods are simple and useful for interpretation. •However they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy. •Hence we also discuss bagging, random forests, and boosting. These methods grow multiple trees which are then combined to yield a single consensus prediction. •Combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss interpretation.
The ends of the branches are called terminal nodes/leaves. The regions \(R1, R2\), and \(R3\) represent the terminal nodes.
Trees are built in two ways: majority vote and average. Majority vote means we classify based on the mode at a split, and average means we take the average and compare it to \(\eta_x\) and classify as class 1 if \(\geq 0.5\) and class 2 otherwise.
The MSE can is defined as:
MSE\((\theta)\) = \(E[||\hat{\theta}-\theta||_2^2] = E [\sum_{j=1}^{p} (\hat{\theta}-\theta)^2]\)
The bias of the estimator is denoted as bias \(\hat{\theta}\) = E[\(\hat{\theta}\)] - \(\theta \in R^p\)
Thus, we can decompose MSE into the following parts:
RSS is defined to be: \(\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_jx_{ij})^2\)
Our “new” RSS is now defined to be (We want to minimize this):
\(\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p} (\beta_j)^2\)
We note that this is the same as the OLS RSS, but here, we have added a penalty term, which is the part following the plus when compared to OLS. The lambda portion is called a shrinkage penalty. It is small when the \(\beta_i...p\) coefficients are close to zero, and it has the effect of shrinking estimates of \(B_j\) towards zero. We note that when \(\lambda = 0\), ridge regression produces the same coefficients as OLS. This is because when we have \(\lambda = 0\), we have no penalty term, so ridge regression is equivalent to OLS. But when \(\lambda = \infty\), the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero (but not be exactly zero) as lambda approaches infinity. This means we will have no predictors for large values of \(\lambda\) since a large penalty term means we do not allow for anything to come into the model as a predictor. We note that the best value of \(\lambda\) is usually chosen through cross validation. As \(\lambda\) increases, the flexibility of ridge regression decreases, which increases bias, but reduces variance. Note that this means OLS is unbiased, but has high variance as a result; increasing bias by using ridge regression allows us to have a model with lower variance. A super important fact we need to notice about ridge regression is that it enforces the β coefficients to be lower, but it does not enforce them to be zero. That is, it will not get rid of irrelevant features but rather minimize their impact on the trained model.
A disadvantage of Ridge Regression: It will include all \(p\) predictors in the final model since the coefficients will shrink towards zero, but will not be exactly zero unless \(\lambda = \infty\). This can create challenges for interpretations when \(p\) is large. A reminder again: Increasing values of \(\lambda\) will tend to reduce the magnitude of the coefficients, but will not result in exclusion of any of the variables. This point brings us to Lasso regression (or as Tesi called it, “luh-so”)
Our “new” RSS here is now:
\(\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p} |\beta_j|\).
Compared to Ridge Regression, the only thing that changed here is the part following \(\lambda\). This difference allows for “useless” coefficients to be equal to zero when \(\lambda\) is sufficiently large. This means Lasso has the power to be used as a variable selection tool; this is a big pro when compared to Ridge Regression, and allows for more interprability. However, Lasso assumes a certain number of coefficients are zero; this means Lasso may perform better when we have data with many insignificant predictors, but it can do poorly when we have a small number of predictors due to its assumptions. Ridge regression may do better when the response is a function with many predictors, all with coefficients of rougly equal size.
A note on Ridge and Lasso: when the tuning parameter is zero, both of these will be exactly the same as OLS, since this penalty term is what differentiates these from OLS. Both of these techniques’ penalties typically drive coefficients to be lower, and is why they are referred to as shrinkage methods. Additionally, scaling matters for these; if we convert something from dollars to cents, our estimates for coefficients will change, and so will the response variable.
We select tuning parameters by seeing which cross-validation Mean Squared Error is the smallest. This is done through cross validation.
Cross-Validation is a very important concept since it helps us pick our best tuning parameter. It is one of two resampling methods we cover.
There are two types of cross-validation: k-fold cross validation and leave-one-out cross-validation (LOOCV). Note that LOOCV is a special case of k-fold CV when \(k=n\)
LOOCV has a couple of major advantages over the validation set approach. 1. It has far less bias. LOOCV: Repeatedly fit the statistical learning method using training sets that contain \(n−1\) observations, there are almost as many as are in the entire data set. This contrasts the validation method, where the training set is about half the size of the original data set. LOOCV approach tends not to overestimate the test error rateas much as the validation set approach does. 2. Performing LOOCV multiple times produces similar results. This is not typically true for the validation method. Remark: LOOCV has the potential to be expensive to implement, since the model has to be fit \(n\) times. This can be very time consuming if \(n\) is large, and if each individual model is slow to fit.
To visualize LOOCV, please see the following gif:
This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size (usually 5 or 10). 1. The first fold is treated as a validation set, and the method is fit on the remaining \(k−1\) folds. 2. The mean squared error, MSE1, is then computed on theobservations in the held-out fold. 3. This procedure is repeated \(k\) times; Each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, \(MSE_1,...,MSE_k\). The k-fold CV estimate is computed by averaging each MSE.
Recall that k-fold CV with \(k<n\) has a computational advantage to LOOCV. But putting computational issues aside, a less obvious but potentially more important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV. This has to do with a bias-variance trade-off; the validation set approach can lead to overestimates of the test error rate, since in this approach the training set used to fit the statistical learning method contains only half the observations of the entire data set. Using this logic, it is not hard to see that LOOCV will give approximately unbiased estimates of the test error, since each training set contains \(n−1\) observations, which is almost as many as the number of observations in the full data set. Performing k-fold CV for, say, k = 5 or k = 10 will lead to ani ntermediate level of bias, fewer than in the LOOCV approach, but substantially more than in the validation set approach. Therefore, from the perspective of bias reduction, LOOCV is preferred to k-fold CV. However, that LOOCV has higher variance (since it is less biased) than does k-fold CV with \(k<n\).
To visualize k-fold cross validation, please see this gif and image:
For more info on CV & Bootstrap, please see this link: http://www2.stat.duke.edu/~rcs46/lectures_2017/05-resample/05-cv.pdf
Bootstrap is another resampling method. We essentially use our sample to simulate a population distribution by “sampling” from our sample, WITH replacement. This allows us to estimate the mean and variance of our data. Increasing the number of samples we use to resample allows us to create a more accurate distribution.