In general, we have datasets with response variables, and different predictors of that response. The general relationship between the two is \[ Y = f(X) + \varepsilon\]
Statistical learning is essentially just a number of different methods for estimating \(f\), in order to make predictions about the response variable \(Y\).
In statistical learning models, a dataset is split into a training and a test set. The training set is the data that the model uses to “learn” from, and the test set is data that the model has not yet seen, but wants to predict. As such, there are two main measures which we are concerned with when it comes to model fit: the training MSE (error rate) and the test MSE (error rate). MSE is the measure of fit when the response variable is continuous, while error rate is the proportion of incorrect predictions when the response variable is categorical.
We want to choose the model which gives the lowest test MSE (error rate) as this means that the model performs well on previously unseen data, and has some predictive value outside of just the data used to train it. In general, we expect the training MSE (error rate) to be smalller than test MSE, because a lot of the models are specifically designed to minimize training MSE. When the observed difference is too large, then we should be concerned about “overfitting”, which means that the model follows the errors or noise of the training data too closely. This will result in highly accurate results for predicting the training data, but highly inaccurate results when predicting any observation outside of the training data, and overfit is something that should be avoided at all costs.
More flexible models have higher variance, but less bias, while less flexible models are the opposite. Finding the right levels of flexibility for a model in order to get the lowest test MSE (error rate) possible without overfitting, is the goal of the statistical learning methods which will be covered in this course.
The Bayes’ classifier, in simplest terms, assigns each observation to the most likely class given its predictor values. It does this by computing the conditional probability:
\[\operatorname{Pr}(Y=j | X = x_0)\]
Where \(Y\) is the response variable and \(j\) is the specific class that it belongs to. This is all given \(x_0\), the vector of predictors. For each value of each predictor, there is a different probability of the response variable belonging to a specific class. When the above equation is >50% is when it gets assigned to class \(j\).
The Bayes’ classifier always produces the lowest error (analagous to the irreducible error rate) rate which is defined as
\[1-E\left(\max _{j} \operatorname{Pr}(Y=j | X)\right)\]
but is not equal to 0 because there is always some overlap of the classes. In reality, we never really know the true conditional distriution of \(Y\) given \(X\), so many other methods attempt to estimate this, and then classify an observation to the class with the highest probability. Thus, the Bayes’ classifier is simply a “gold standard” for which to compare other methods to.
In KNN, a test observation \(x_0\) is selected, and a positive integer \(K\) is chosen as the number of closest points to \(x_0\) for which to estimate the conditional probability, by taking the fraction of points in the neighbourhood around \(x_0\), defined as the set \(\mathcal{N}_0\) whose response values are class \(j\)
\[\operatorname{Pr}\left(Y=j | X=x_{0}\right)=\frac{1}{K} \sum_{i \in \mathcal{N}_{0}} I\left(y_{i}=j\right)\]
For example, if we set \(K = 10\), then the model will look for the 10 closest points to \(x_0\), then check the proportions of how many observations are assigned to each class, and then assign \(x_0\) to the class with the highest probability.
The smaller \(K\) is, the the more flexible the model becomes and the more non-linear the decision boundary becomes. The opposite is true as \(K\) gets large. The method for choosing \(K\) is discussed in the next chapter, which is a very important part of this process as the results drastically change depending on what the value for \(K\) is.
When dealing with qualitative responses, linear regression is not feasible. Instead, logistic regression is used to model the probability that a response belongs to a particular category rather than modeling the response itself. Logistic regression uses the logistic function to ensure a prediction of a probability between 0 and 1. The logistic function takes the form: \[p(x)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}\]
Maximum likelihood is used to model the coefficients of the logistic function. Rearranging and taking the logarithm of both sides of the logistic equation yields an equation for the logit or log-odds: \[\ln(\frac{p(x)}{1-p(x)})=\beta_0+\beta_1X\]
The logit is linear in \(X\). For logistic regression, increasing \(X\) by one-unit changes the log odds by \(\beta_1\). The estimates \(\beta_0\) and \(\beta_1\) are chosen to maximize this likelihood function.
Multiple logistic regression involves multiple predictors. The multiple logistic regression model returns a probability expressed as: \[p(x)=\frac{e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}\]
When dealing with multiple-class logistic regression, logistic regression is possible, but discriminant analysis is preferred.
In the Linear Discriminant Analysis approach, the distribution of the predictors X, is modeled separately in each of the response classes (Y=k), and then Bayes’ theorem is used to flip these around into estimates for Pra(Y=k│X=x). This is similar in form to the logistic regression when the distributions are assumed to be normal.
Bayes’ theorem states that \[p_{k}(x)=\operatorname{Pr}(Y=k |X=x)=\frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^{K} \pi_{l} f_{l}(x)}\]
where
\[ f_{k}(x)=\frac{1}{\sqrt{2 \pi} \sigma_{k}} \exp \left(-\frac{1}{2 \sigma_{k}^{2}}\left(x-\mu_{k}\right)^{2}\right) \] is the density function of X for an observation from the kth class. \(f_k (x)\) is relatively large if there is a high probability that an observation in the kth class has \(X\approx x\) and is relatively small otherwise.
Plugging \(f_k (x)\) into \(p_k (x)\) and taking the log of the equation, we have that
\[ \delta_{k}(x)=x \frac{\mu_{k}}{\sigma^{2}}-\frac{\mu_{k}^{2}}{2 \sigma^{2}}+\log \left(\pi_{k}\right) \] Therefore, an observation is assigned to the class for which \(δ_k (x)\) is largest.
The LDA test error rate is used to determine how well the LDA classifier performs on the data. This measures the proportion of the entire dataset that is wrongly predicted by the model and is usually above the Bayes’ error rate. The performance of the classifier is also characterized by its sensitivity and specificity. A confusion matrix is used to measure this by comparing the LDA predictions to the true statuses of the observations. Sensitivity is the percentage of “positives” that are correctly identified and specificity is the percentage of “negatives” that are correctly identified.
The Quadratic Discriminant Analysis on the other hand, assumes that each class has its own covariance matrix. Choosing LDA over QDA or vice-versa, depends on the bias-variance trade-off. LDA is a much flexible classifier than the QDA and has substantially lower variance. However, if LDA’s assumption that K classes share a common covariance matrix is badly off, then LDA can suffer from high bias. LDA tends to be better than QDA if there are relatively few training observations so reducing variance is crucial. QDA is recommended if the training set is very large.
The dataset we used for this was a dataset on incomes in the United States, defined as whether or not a persons income is above $50,000. The predictors included in this dataset are:
## [1] "Bachelors" "HS-grad" "11th" "Masters"
## [5] "9th" "Some-college" "Assoc-acdm" "Assoc-voc"
## [9] "7th-8th" "Doctorate" "Prof-school" "5th-6th"
## [13] "10th" "1st-4th" "Preschool" "12th"
## [1] "White" "Black" "Asian-Pac-Islander"
## [4] "Amer-Indian-Eskimo" "Other"
## [1] "Male" "Female"
## [1] "Adm-clerical" "Exec-managerial" "Handlers-cleaners"
## [4] "Prof-specialty" "Other-service" "Sales"
## [7] "Craft-repair" "Transport-moving" "Farming-fishing"
## [10] "Machine-op-inspct" "Tech-support" "?"
## [13] "Protective-serv" "Armed-Forces" "Priv-house-serv"
Hours worked per week (Integer)
Immigrant status (dummy variable for whether or not someone was born in the US)
set.seed(1) # need to run this so that it can break ties when points are equidistant by randomly choosing which ones to include
predictors <- cbind(race, education, `hours-per-week`, occupation, sex, immigrant)
knn1 <- knn(predictors, predictors, income, k = 10)
## income
## knn1 <=50K >50K
## <=50K 23074 4741
## >50K 1646 3100
The specificity for this model is:
## [1] 0.9334142
The sensitivity for this model is:
## [1] 0.1811339
The total proportion of correct predictions is:
## [1] 0.8038451
The proportion of correct predictions from a trivial model which only predicts “<=50K” everytime would be:
## [1] 0.7591904
So then this model only improves upon a trivial model by roughly 4%.
logistic1 <- glm(income ~ race + education + sex + `hours-per-week` + occupation + immigrant, data = income_data, family = binomial)
summary(logistic1)
##
## Call:
## glm(formula = income ~ race + education + sex + `hours-per-week` +
## occupation + immigrant, family = binomial, data = income_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3431 -0.7010 -0.4010 -0.1061 3.2178
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.597630 0.082669 -19.326 < 2e-16 ***
## raceBlack -0.361491 0.062271 -5.805 6.43e-09 ***
## raceAsian-Pac-Islander -0.068844 0.092238 -0.746 0.455441
## raceAmer-Indian-Eskimo -0.608477 0.189856 -3.205 0.001351 **
## raceOther -0.785206 0.234993 -3.341 0.000834 ***
## educationHS-grad -0.911590 0.045484 -20.042 < 2e-16 ***
## education11th -1.848459 0.141062 -13.104 < 2e-16 ***
## educationMasters 0.505106 0.061375 8.230 < 2e-16 ***
## education9th -1.864142 0.205424 -9.075 < 2e-16 ***
## educationSome-college -0.728854 0.045756 -15.929 < 2e-16 ***
## educationAssoc-acdm -0.494065 0.082450 -5.992 2.07e-09 ***
## educationAssoc-voc -0.410007 0.073216 -5.600 2.14e-08 ***
## education7th-8th -1.862103 0.172431 -10.799 < 2e-16 ***
## educationDoctorate 1.245146 0.124668 9.988 < 2e-16 ***
## educationProf-school 1.154125 0.106891 10.797 < 2e-16 ***
## education5th-6th -1.753487 0.265978 -6.593 4.32e-11 ***
## education10th -1.675109 0.140898 -11.889 < 2e-16 ***
## education1st-4th -2.129208 0.427174 -4.984 6.22e-07 ***
## educationPreschool -11.945906 70.918411 -0.168 0.866233
## education12th -1.457949 0.191038 -7.632 2.32e-14 ***
## sexFemale -1.210453 0.039609 -30.560 < 2e-16 ***
## `hours-per-week` 0.028276 0.001339 21.113 < 2e-16 ***
## occupationExec-managerial 0.947608 0.062470 15.169 < 2e-16 ***
## occupationHandlers-cleaners -1.102174 0.125353 -8.793 < 2e-16 ***
## occupationProf-specialty 0.488865 0.065993 7.408 1.28e-13 ***
## occupationOther-service -1.143020 0.103064 -11.090 < 2e-16 ***
## occupationSales 0.300343 0.065938 4.555 5.24e-06 ***
## occupationCraft-repair 0.099916 0.066415 1.504 0.132474
## occupationTransport-moving -0.056130 0.085232 -0.659 0.510184
## occupationFarming-fishing -0.947699 0.118483 -7.999 1.26e-15 ***
## occupationMachine-op-inspct -0.275656 0.087968 -3.134 0.001727 **
## occupationTech-support 0.493799 0.091986 5.368 7.95e-08 ***
## occupation? -0.353377 0.096396 -3.666 0.000246 ***
## occupationProtective-serv 0.441658 0.102941 4.290 1.78e-05 ***
## occupationArmed-Forces -1.029652 1.089490 -0.945 0.344619
## occupationPriv-house-serv -2.382861 1.034760 -2.303 0.021289 *
## immigrant -0.188090 0.059969 -3.136 0.001710 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35948 on 32560 degrees of freedom
## Residual deviance: 28078 on 32524 degrees of freedom
## AIC: 28152
##
## Number of Fisher Scoring iterations: 12
contrasts(income) # shows us break down of y variable to confirm which one we are predicting as "1" in the regression.
## >50K
## <=50K 0
## >50K 1
Below is the confusion matrix for this application
## income
## logistic_predictions <=50K >50K
## <=50K 23053 4784
## >50K 1667 3057
The specificity for this model is:
## [1] 0.9325647
The sensitivity for this model is:
## [1] 0.1832248
The total proportion of correct predictions is:
## [1] 0.8018795
The proportion of correct predictions from a trivial model which only predicts “<=50K” everytime would be:
## [1] 0.7591904
So then this model only improves upon a trivial model by roughly 4% as well.
lda1 <- lda(income ~ race + education + sex + `hours-per-week` + occupation + immigrant, data = income_data)
lda_predictions <- predict(lda1, income)
lda_class <- lda_predictions$class
## income
## lda_class <=50K >50K
## <=50K 22938 4717
## >50K 1782 3124
The specificity for this model is:
## [1] 0.9279126
The sensitivity for this model is:
## [1] 0.1809915
The total proportion of correct predictions is:
## [1] 0.8004054
The proportion of correct predictions from a trivial model which only predicts “<=50K” everytime would be:
## [1] 0.7591904
So then this model only improves upon a trivial model by roughly 4% as well.
For QDA, we had to remove education do to collinearity with the other variables, as the model could not compute with this variable included.
qda1 <- qda(income ~ race + `hours-per-week` + occupation + sex + immigrant, data = income_data) # removed education because it was too correlated with race
qda_predictions <- predict(qda1, income)
qda_class <- qda_predictions$class
qda_confusion <- table(qda_class, income)
qda_confusion
## income
## qda_class <=50K >50K
## <=50K 10773 1013
## >50K 13947 6828
The specificity for this model is:
## [1] 0.435801
The sensitivity for this model is:
## [1] 0.05755355
The total proportion of correct predictions is:
## [1] 0.5405547
The proportion of correct predictions from a trivial model which only predicts “<=50K” everytime would be:
## [1] 0.7591904
So then it is clear that QDA is much worse as it doesn’t even do as good as a trivial classifier.