Supervised learning is like teaching someone the differnce between two different things like the difference between a house and a bike.
Unsupervised learning - there is no outcome variable, just a set of predictors measured on a set of samples. The best explanation is trying to group things based on features.
Machine learning has a greter emphasis on large scale applications and prediction accuracy, where as *Statistical learning emphasizes models an their interpretability and precision and uncertainty.
The optimal or ideal regression is where we have \(E(Y|X = i)\) where i is some value of X and we want to get the expected value of Y. The expected value is a conditional mean of Y given some value of X. Since Y may or may not have any value at the given value X we relax the definition and create a range where we grab the neighbours around that given value of X. That formula is \(\hat{f}(x) = Ave(Y|X \in N(x)\) where \(N(x)\) is the neigbouring values. This can be pretty good for a small number of predictors and a large-ish N. This method can be lousy when \(p\) is large. This is what is known as the Curse of dimensionality, because nearest neighbors tend to be far away in high dimensions.
Curse of dimensionality - We need to get a reasonable fraction of the \(N\) values of \(y_i\) to average to bring the variance down - e.g. \(10\%\). A \(10\%\) neighborhood in high dimensions need no longer be local, so we lose the spirit of estimating \(E(Y|X = x)\) by local averaging. What this means is that with more predictors, the more data we capture into our averaging, where at one point we can calculate we need more data than given to calculate those \(10\%\).
To get around the curse we construct linear models, where we fit a line to the training data to find our coefficients for each parameter. The linear model function is \(f_L(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ...\beta_pX_p\). This model is almost never correct but serves as a good and interpretable approximation to the unknown true function \(f(X)\).
VERY IMPORTANT If ANY value has a hat = \(\hat{Y}\) for example that means its an estimation. Otherwise its a population.
TERM Warning Selecting something called parsimony means selecting simplere model involving fewer variables over a black box predictor involving them all.
Assessing the model accuracy can be done by fitting our model to some training data and computing the Mean Squared Error \(= MSE_{Tr} = Ave_{i \in_{Tr}}[y_i - \hat{f}(x_i)]^2\) This maybe biased toward overfit models because we can fit the model completly. This is why we use a test data, where we fit the model dirived from the fitting of the training data and test its fit on our test data to spot problems with overfitting or underfitting.
For flexible models we need to be on the look out for overfitting, since it is possible to fit those models to every data point in the set. This also does mean that rigid models are more prone to underfit rather than overfit.
Bias-Variance Trade-off. Say we have a model we want to know if we are fitting well. Say we want to evaluate our model using a single test observation \(y_0\) against our \(\hat{f}(x_0)\) so this would be the formula \(E(y_0 - \hat{f}(x_0))^2\) - this is the estimated prediciton error. This equation can be broken down into three pices: \[ E(y_0 - \hat{f}(x_0))^2 = Var(\hat{f}(x_0) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)) \]
There is nothing to be done about \(Var(\epsilon)\) since its irreducible. But \(Var(\hat{f}(x_0)\) is the average variance between training sets and \([Bias(\hat{f}(x_0))]^2\) is the difference between the average prediction of \(x_0\) and the true value of \(x_0\).
This will mean that as flexibility of \(\hat{f}\) increases, our variance increases and our bias decreases, because the model is chasing all the data points in our training set. So choosing the flexibility on average test error amounts to a bias-variance trade-off.
Think about logistics regression of 0 and 1, if we just use conditional probability of an average at a given value of X, we arrive at the same curse of dimensionality as before.
To measure the error we use the misclassification error rate which is really just the number of instances we classified wrongly. Formula is:
\[ Err_{Te} = Ave_{i \in Te}I[y_i \neq \hat{C}(x_i)] \] We are there for looking for the model that has the smallest number of errors (in the population).
In the classification of K nearest neigbhor, if \(K\) is the number of observations in our training set we use to classify our observation in our Test data \(L\). This means we only look at \(K = n\) points that is the closest to \(L\) in \(K\) to classify it. If \(K = 1\) that means we use only one point to classify \(L\) which will be the point closest to \(L\). If \(K = 10\) that means we take the 10 nearest data points around \(L\) and use the total number of 0’s and 1’s to make our classification.
We need to calculate the Standard error of the Slope and intercept. We are most interested in the SE of the Slope. Its formula is:
\[ SE(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum^{n}_i = 1(x_i - \overline{x})^2} \] The lower line of the formula is the spread of our x values, so the more spread they are the more we ancour our line down, decreasing our Standard Error.
To calculate the \(R^2\) which is the overall accuracy of the model we have:
\[ R^2 = \frac{TSS - RSS}{TSS} = 1-\frac{RSS}{TSS} \] where TSS is the Total Sum of Squares $ = ^{n}_{i = 1}(y_i - )^2$ and RSS \(= \sum^{n}_{i = 1}(y_i - \hat{y}_i)^2\). Notice that RSS has our real value of y and subtracting our estimated \(\hat{y}\) from it and in TSS we use the mean of y.
Having variables in the model can increase its \(R^2\) and its \(F\)ratio but might not be significantly impacting our model or they are highly correlated with other predictors or for other reasons we need to have a method of selecting important variables.
This method will compute all possible combination of variables in creating the model. This has a math formula for complexity which is \(2^p\) where \(p\) is the number of predictors. This means if we have \(40\) predictors then we have over a billion possible models. THIS IS USUALLY NOT USED
Forward selection - Here we start with a null model and we add variables one at a time where you fit the variable with the lowest RSS, then you fix that variable and continue to the next one.
Backwards selection Here you start with all the variables and remove the ones with the highest p-value, meaning those that are not significant or have the lowest t-statistic.
So what if we think that there is interaction with two variables in our model. We can take those two coefficients and multiply them together to see the interaction between the two predictors. If the interaction is significant that will mean that an increase/decrease in one will predict an increase/decrease in the other.
If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.
The rationale is that interactions are hard to interpret in a model without main effects - their meaning is changed.
IF asked then yes you can use linear regression as a classification model, but its bad at it since it can produce values that are below 0 and above one. This is dependent on the number of data points that are in the set that distribute between 0’s and 1’s. And linear regression is NOT appropriate when when the # of classifiers is > 2. Here multiple logistic regression or discriminant analysis is more appropriate.
This is a good method since it produces a probability of X between the values 0 and 1. Its usefull to see the formula wich is:
\[ p(X) = \frac{e^{\beta_0 + \beta_1X}}{1+e^{\beta_0+\beta_1X}} = log(\frac{p(X)}{1-p(X)}) = \beta_0+\beta_1X \] The middle part is known as the monotone transformation is called log odds or logit transformation of \(p(X)\)
We then use Maximum likelihood to estimate the parameters, this likelihood gives the probability of the observed zeros and ones in the data, so we pick \(\beta_0\) and \(\beta_1\) to maximize the likelihood of the observed data.
IMPORTANT NOTE The coefficients do take into account the units that are used in the variable. So if we are using f.ex dollars in 1000’s, then given a coefficient of 0.005 would be 5 (0.005*1000).
Here we model the distribution of X in each class separately and them use Bayse theorem to flip things around and obtain \(Pr(Y|X)\). When we use normal distribution for each class, this will lead to linear or quadratic discriminant analysis.
What this means it plots the distribution of X for both classifiers and where they intersect will determine which observation belongs to which group.
If a class is well-seprated, the parameter estimates for the logistics regression model are unstable (can reach into infinity), linear discriminant does not suffer from this problem.
If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistics regression model
Linear discriminant analysis is popular when we have more than two response classes because it also provides low-dimensional views of the data.
There is a problem with LDA when we have many features, for example 4000, so in those cases we might want to reduce the dimensions.
If we construct a confusion matrix to esimate our (a matrix where we get our predicted groupings and our true groupings to see our accuracy) we can adjust this my adjusting our threshold, meaning the value that classifies between groups. This will adjust our False Positive rate and False negative rate, so we need to be carefull in setting these since one will effect the other. This change in the threshold is captured by the ROC Curve (think of the ABC analysis curve). There is also possible to use the AUC (Area Under the Curve) to summarize the performance. High AUC = GOOD.
When the LDA is a Quadratic, then we get a non-linear boundary, meaning the boundary is more curved and in same cases fits the data better.
Both share the same linear form when it comes to the linear equation.
The difference is in how the parameters are estimated. Logistics uses the conditional likelihood based on \(Pr(Y|X)\) known as discriminative learning.
LDA uses the full likelihood based on \(Pr(X,Y)\) known as generative learning.
Despite these differences, in practice the results are often very similar.
Logistics is popular for classification especially when K = 2, LDA is useful when \(n\) is small, or the classes are well separated and Gaussian assumptions are reasonable, also when K>2. Adding Naive Bays is useful when \(p\) is very high.
Here we talk about Cross validation and Bootstrap
These methods refit a model of interest to samples formed from the training set in order to obtain additional infromation about the fitted model. For example they provide estimates of test-set prediction error, and the starndard deviation and bias of our parameter estimates.
Test error si the average error that results from using a statistical learning method to predict the response on a new observation, one that was not used in the training. In contrast training error can be easily calculated by applying the statistical learning method to the observations used in its training. But the training error rate often is quite different from the test error rate and in particular ther former can dramatically underestimate the latter.
When we think about this we need to remember Bias Variance trade off and training- vs test-set performance. Lets remember: \(Var(\hat{f}(x_0)\) is the average variance between training sets and \([Bias(\hat{f}(x_0))]^2\) is the difference between the average prediction of \(x_0\) and the true value of \(x_0\). So as our model complexity increases, our Training prediction error goes down but our Test Prediction error goes up due to high variance and low bias. Contrast that with low model complexity, where we might see an increase in our prediction error due to low variance and high bias.
OVERFITTING Increases our TEST ERROR.
Fig 1
For validation we divide the dataset into two even parts. This is a random split into two halfs without replacement
This will give an ok estimation of how many predictors we need but there can be a high variance of the Mean Squared Error between iterations, resulting in us overestimating the test error for the model fit on the entire data set.
This method also only uses a subset of the observation (\(\frac{n}{2}\)), those that are included in the training set rather than in the validation set are used to fit the model.
This is the most used Cross validation method. This is used to select the best model and to give an idea of the test error of the final chosen model.
K-fold is an idea that we divide the dataset into \(K\) number of subsets. So \(K\) can be any integer (usually and most widely used folds are \(K = 5 \ or \ 10\)). This is were we Randomly divide the dataset (without replacement) into \(K\) parts fit the model on \(K-1\) parts and test on the last one. This is then done for all the parts, so every part is part of being the test data.
There is a special case of this called leave-one-out cross validation (LOOCV) where \(K = n\) and we use \(n\)-folds. This is sometimes usefull but typically doesn’t shake up the data enough. The estimates from each fold are higly correlated and hence their average can have high variance (this decreases our bias, meaning we have a good chance of overfit). There for its better use \(K = 5 \ or \ 10\).
Since each training set is only \((K-1)/K\) as big as the original training set, the estimates of prediciton error will typically biassed upward, since our variance is low between iteration of the training sets, increasing our chances of underfitting.
We can minimize our bias by using LOOCV but this estimate has high variance, as noted earlier, increasing our chances of overfitting.
This is why the rule of thumb Using K = 5 or 10 provides a good compromise for this bias-variance tradeoff
We use the same CV for classification, but instead here we are measureing classification error.
There is a way in which one can cheat CV. I we have to many predicters and a small sample for example, we cant cherry pick our predicters by not inlcuding all of them in the model, since we have already learned about the class lables.
Using the example in the lectures, if we have 5000 predictors, 50 samples, if I construct logistic regression model to pick the top 100 predictors, if I use a correlation to find these top 100 predictors and then ignore the rest will lead to my error estimation to be extremely low since the predictors were cherry picked to fit with the outcome variable.
Fig 2
Here we can see we apply CV for both step one and two, where we divide our entire dataset (predictors and outcome) into even parts and apply CV on them both, all the while having a good validation set to test it on.
So bootstrap is when we resample our sample multiple times, with replacement and compute the average of our estimator to find a good fit. Our Standard Error (in this case its the Standard Deviation) gives a good idea about the accuracy of our estimator.
A good thing to remember is that everything is an estimation, so our dataset is a population estimate (\(\hat{P}\)) and so is everything else.
Why would we consider a alternative to least squares.
Prediction accyracy especially when \(p > n\), to control the variance
Model interpretability by removing irrelevant features, that by settign the corresponding coefficient estimates to zero - we can obtain a model that is more easily interpreted (feature selection)
There are three classes of methods.
This method fits all the possible combinations of a model. It starts with the null model and continues to add variables to the model. The variables are selected in order by how much well they fit the data, meaning the lowest RSS variable is fixed in place first. But this creates ALL possible combinations of our predictors, meaning its \(2^{P}\) so if you have 10 predictors, there are 1024 possible combinations. This is prone to overfit, since its hard to select a model that contains the best model, since the model only gets better as we add more predictors.
This is more restrictive alternatives to best subset selection.
Forward stepwise selection - is very similar to best subset but the difference is we look at every predictor individualy and what they can add to our model. We start with the null model and find the variable the variable that gives the biggest improvement to the model, fix it in place, then we check again until we are done with the predictors.
Backwards stepwise selection is the excat opposide of forwad, you start with all the variables, and start eleminating variables with the lowest RSS.
For both variables we are now creating \(p^2\) models, meaning if we have 10 predictors then we get 100 models which is way less than best subset.
these are used to adjust our training RSS that gives us an estimate of our test RSS.
Mallow’s Cp adds an adjustment to our RSS which is \(Cp = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)\). We have to have n > p, otherwise our full model will not be defined and our error will be \(0\). So Cp is limited to cases where n>p
AIC = \(-2logL + 2 * d\) where d = #predictors and L = Maximized likelyhood function for the model. Cp and AIC are very equivalent, although AIC is better for other models.
BIC = \(\frac{1}{n}(RSS + log(n)d\hat{\sigma}^2\). BIC uses the number of observations in the set. So BIC has a heavier penalty for models with many variables and hence results in the selection of smaller models than Cp. Again with BIC we estimating our average test RSS across observations. So we want this value to be small.
The difference between AIC and BIC is that BIC (look at the log function in both formulas) is going to favor a small model since you get a penalty for bigger models, SO BIC picks smaller models, AIC does not
Adjusted \(R^2\) has the same idea as BIC, where it penalizes your model for having to many predictors. Lets first look at the base \(R^2\) formula
\[ R^2 = 1 - \frac{RSS}{TSS} \] for adjusted \(R^2\)
\[ Adj.R^2 = 1 - \frac{RSS/n - d - 1}{TSS / (n-1)} \]
So the terms added will penalize the model for having to many predictors. Remember that n is our population and d is the #predictors. This also applies this when p > n.
Here we are using \(\lambda\) as a tuning parameter that penalizes the coefficients the larger they get. \(\lambda\) is some non-zero number, pushing the coefficients to zero. To find the value of \(\lambda\) is found using CV. We usually apply Ridge after we have standardized our predictors by:
\[ \hat{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij} - \bar{x}_j)}} \]
We use CV to find the optimized lambda that minimizes the Mean squared error. Lambda has an impact on the variance, decreasing it but also remember it increases our bias.
There is also an obvious disadvantage of Ridge regression, unlike subset selection, which will generally select models that invlive just a subset of the variables, ridge regression will include al p predictors in the final model. The Lasso over comes this disadvantage.
Between the Lasso and Ridge, in Ridge we penilize using lambda and the sum of squared residuals, in Lasso we use the absolute value of the coefficients. So it a shrinkage using the absolute value.
Unlike the Ridge regression that pushes the coefficients towards \(0\), if the value of lambda is big enough in the Lasso model the coefficients can become exactly \(0\), thus making it usefull as a subset selection for variable selection. To find the best value of lambda, we use CV.
The lasso creates a sparse model since it can push coefficients to excatly 0, but ridge will produce dense models since it will contain all the variables.
Well with the first three the number of predictors is known. In the ridge regression it is not known. This is not the same for the lasso since it does “remove” some variables that are not important.
We do this by using CV, but as the number of lambda increases the closer to 0 all the coefficients will get and there for increasing our bias to much, increasing our chances of underfitting.
PCA is principal component analysis. This one I know fairly well so the difference between it and Partial Least Squares wil be considered.
This method is very much like PCA except instead of just blindly creating linear combinations but are in the direction of the predictors and take no regard for the response variable. So the major drawback with PCA is that there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.
So PLS identifies new features that are linear combinations of the original features and then fits a linear model via OLS using these M new features - this insures that the linear combinations have the same direction as the response variable.
This inturn means that PLS is a supervised machine learning algorithm, since it know the response variable and uses it to build its new features.
To know how big our tree needs to be we add a penalty for how large the tree gets (sort of like ridge and lasso). To find the right alpha value we use cross validation and pick the one with the lowest MSE. This is called pruning or cost complexity.
WHAT NOT TO USE - This is very noisy. Classification uses classification error rate rather than RSS, which is simply the fraction of the training oservations in that region that do not belong to the most common class
This index is also known as the measure of node purity, where a small value indicates that a node contains predominantly observations from a single class.
Trees are very easy to explain, even more so than linear regressin
Som people believe that decision treees more cloesly mirror human decision-making than do the regression and classification approaches seen on perivious chapters
Trees can be displayed graphically and are easily interpreted even by a non-expert
Trees can easily handle qualitative predictors without the need to create dummy variables
Unfortunately, trees generally DO NOT have the same level of predictive accyracy as some of the other regression and classification approaches seen in this book.
Bagging shares the same idea as bootstrapping. It randomly selects from the training set to create a new set (with replacement) and it does this to create 100’s of trees (full trees). The they take a majority vote to find the over all prediction that is the most commonly occurring class among the B predictions.
This is similar to Leave One Out CV. So it will not use all the training set to make its tree, the use a sample not used in the creation of multiple trees and find if it accurately predicts the outcome.
Here we do the same procedure as in bagging, but only a small subset of predictors are allowed to be considered when making a subsequent split and from that subset they are choosen at random. This helps decorrelate the trees. This helps reduce the variance when we then average the tress.
We only discuss boosting for decision trees.
Here we grow trees one after another, minimizing the resudials each time. So we are fitting our trees to our residuals, shrinking our trees each time by factor lambda and updating our residuals. There is a big chance of overfitting, so we need to learn slowly, meaning grapping small amount of residuals to make sure we dont overfit.
There are a number of tuning parameters for boosting
Number of trees B. Unlike bagging and random forests, boosting can overfit if B is to large, although this overfitting tends to occur slowly if at all. We use CV to select B.
Shrinkage parameter lambda is a small positive number. This controls the rate at which boosting learns, Typical values are \(0.01\) and \(0.001\) and the right choice can depend on the problem. Very samll lambda can require using a very large value of B in orderto achieve good performance.
The number of spilts d in each tree which controls the complexity of the boosted ensemble. Often d = 1 works well, in which case each tree is a stump, consisting of a single split and resulting in an additive model. More generally d is the interaction depth and contrls the interaction order of the boosted model, since d splits can involve at most d variables.
For bagged/RF regression trees we look at RSS decrease du to split over a given predictor, averaged over all B trees, a large value indiates an important predictor.
Similar for bagged/RF classification trees, we add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all B trees.