Statistical Learning

In general, we have datasets with response variables, and different predictors of that response. The general relationship between the two is \[ Y = f(X) + \varepsilon\]

Statistical learning is essentially just a number of different methods for estimating \(f\), in order to make predictions about the response variable \(Y\).

In statistical learning models, a dataset is split into a training and a test set. The training set is the data that the model uses to “learn” from, and the test set is data that the model has not yet seen, but wants to predict. As such, there are two main measures which we are concerned with when it comes to model fit: the training MSE (error rate) and the test MSE (error rate). MSE is the measure of fit when the response variable is continuous, while error rate is the proportion of incorrect predictions when the response variable is categorical.

We want to choose the model which gives the lowest test MSE (error rate) as this means that the model performs well on previously unseen data, and has some predictive value outside of just the data used to train it. In general, we expect the training MSE (error rate) to be smalller than test MSE, because a lot of the models are specifically designed to minimize training MSE. When the observed difference is too large, then we should be concerned about “overfitting”, which means that the model follows the errors or noise of the training data too closely. This will result in highly accurate results for predicting the training data, but highly inaccurate results when predicting any observation outside of the training data, and overfit is something that should be avoided at all costs.

More flexible models have higher variance, but less bias, while less flexible models are the opposite. Finding the right levels of flexibility for a model in order to get the lowest test MSE (error rate) possible without overfitting, is the goal of the statistical learning methods which will be covered in this course.

Bayes’ Classifier

The Bayes’ classifier, in simplest terms, assigns each observation to the most likely class given its predictor values. It does this by computing the conditional probability:

\[\operatorname{Pr}(Y=j | X = x_0)\]

Where \(Y\) is the response variable and \(j\) is the specific class that it belongs to. This is all given \(x_0\), the vector of predictors. For each value of each predictor, there is a different probability of the response variable belonging to a specific class. When the above equation is >50% is when it gets assigned to class \(j\).

The Bayes’ classifier always produces the lowest error (analagous to the irreducible error rate) rate which is defined as

\[1-E\left(\max _{j} \operatorname{Pr}(Y=j | X)\right)\]

but is not equal to 0 because there is always some overlap of the classes. In reality, we never really know the true conditional distriution of \(Y\) given \(X\), so many other methods attempt to estimate this, and then classify an observation to the class with the highest probability. Thus, the Bayes’ classifier is simply a “gold standard” for which to compare other methods to.

K-Nearest Neighbours (KNN)

In KNN, a test observation \(x_0\) is selected, and a positive integer \(K\) is chosen as the number of closest points to \(x_0\) for which to estimate the conditional probability, by taking the fraction of points in the neighbourhood around \(x_0\), defined as the set \(\mathcal{N}_0\) whose response values are class \(j\)

\[\operatorname{Pr}\left(Y=j | X=x_{0}\right)=\frac{1}{K} \sum_{i \in \mathcal{N}_{0}} I\left(y_{i}=j\right)\]

For example, if we set \(K = 10\), then the model will look for the 10 closest points to \(x_0\), then check the proportions of how many observations are assigned to each class, and then assign \(x_0\) to the class with the highest probability.

The smaller \(K\) is, the the more flexible the model becomes and the more non-linear the decision boundary becomes. The opposite is true as \(K\) gets large. The method for choosing \(K\) is discussed in the next chapter, which is a very important part of this process as the results drastically change depending on what the value for \(K\) is.

Logistic Regression

When dealing with qualitative responses, linear regression is not feasible. Instead, logistic regression is used to model the probability that a response belongs to a particular category rather than modeling the response itself. Logistic regression uses the logistic function to ensure a prediction of a probability between 0 and 1. The logistic function takes the form: \[p(x)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}\]

Maximum likelihood is used to model the coefficients of the logistic function. Rearranging and taking the logarithm of both sides of the logistic equation yields an equation for the logit or log-odds: \[\ln(\frac{p(x)}{1-p(x)})=\beta_0+\beta_1X\]

The logit is linear in \(X\). For logistic regression, increasing \(X\) by one-unit changes the log odds by \(\beta_1\). The estimates \(\beta_0\) and \(\beta_1\) are chosen to maximize this likelihood function.

Multiple logistic regression involves multiple predictors. The multiple logistic regression model returns a probability expressed as: \[p(x)=\frac{e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}\]

When dealing with multiple-class logistic regression, logistic regression is possible, but discriminant analysis is preferred.

Linear and Quadratic Discriminant Analysis (LDA & QDA)

In the Linear Discriminant Analysis approach, the distribution of the predictors X, is modeled separately in each of the response classes (Y=k), and then Bayes’ theorem is used to flip these around into estimates for Pra(Y=k│X=x). This is similar in form to the logistic regression when the distributions are assumed to be normal.

Bayes’ theorem states that \[p_{k}(x)=\operatorname{Pr}(Y=k |X=x)=\frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^{K} \pi_{l} f_{l}(x)}\]

where

\[ f_{k}(x)=\frac{1}{\sqrt{2 \pi} \sigma_{k}} \exp \left(-\frac{1}{2 \sigma_{k}^{2}}\left(x-\mu_{k}\right)^{2}\right) \] is the density function of X for an observation from the kth class. \(f_k (x)\) is relatively large if there is a high probability that an observation in the kth class has \(X\approx x\) and is relatively small otherwise.

Plugging \(f_k (x)\) into \(p_k (x)\) and taking the log of the equation, we have that

\[ \delta_{k}(x)=x \frac{\mu_{k}}{\sigma^{2}}-\frac{\mu_{k}^{2}}{2 \sigma^{2}}+\log \left(\pi_{k}\right) \] Therefore, an observation is assigned to the class for which \(δ_k (x)\) is largest.

The LDA test error rate is used to determine how well the LDA classifier performs on the data. This measures the proportion of the entire dataset that is wrongly predicted by the model and is usually above the Bayes’ error rate. The performance of the classifier is also characterized by its sensitivity and specificity. A confusion matrix is used to measure this by comparing the LDA predictions to the true statuses of the observations. Sensitivity is the percentage of “positives” that are correctly identified and specificity is the percentage of “negatives” that are correctly identified.

The Quadratic Discriminant Analysis on the other hand, assumes that each class has its own covariance matrix. Choosing LDA over QDA or vice-versa, depends on the bias-variance trade-off. LDA is a much flexible classifier than the QDA and has substantially lower variance. However, if LDA’s assumption that K classes share a common covariance matrix is badly off, then LDA can suffer from high bias. LDA tends to be better than QDA if there are relatively few training observations so reducing variance is crucial. QDA is recommended if the training set is very large.

Applications

Data

The dataset we used for this was a dataset on incomes in the United States, defined as whether or not a persons income is above $50,000. The predictors included in this dataset are:

  • Education
##  [1] "Bachelors"    "HS-grad"      "11th"         "Masters"     
##  [5] "9th"          "Some-college" "Assoc-acdm"   "Assoc-voc"   
##  [9] "7th-8th"      "Doctorate"    "Prof-school"  "5th-6th"     
## [13] "10th"         "1st-4th"      "Preschool"    "12th"
  • Race
## [1] "White"              "Black"              "Asian-Pac-Islander"
## [4] "Amer-Indian-Eskimo" "Other"
  • Sex
## [1] "Male"   "Female"
  • Occupation
##  [1] "Adm-clerical"      "Exec-managerial"   "Handlers-cleaners"
##  [4] "Prof-specialty"    "Other-service"     "Sales"            
##  [7] "Craft-repair"      "Transport-moving"  "Farming-fishing"  
## [10] "Machine-op-inspct" "Tech-support"      "?"                
## [13] "Protective-serv"   "Armed-Forces"      "Priv-house-serv"
  • Hours worked per week (Integer)

  • Immigrant status (dummy variable for whether or not someone was born in the US)

K-Nearest Neighbours

set.seed(1) # need to run this so that it can break ties when points are equidistant by randomly choosing which ones to include

predictors <- cbind(race, education, `hours-per-week`, occupation, sex, immigrant)


knn1 <- knn(predictors, predictors, income, k = 10)
##        income
## knn1    <=50K  >50K
##   <=50K 23074  4741
##   >50K   1646  3100

The specificity for this model is:

## [1] 0.9334142

The sensitivity for this model is:

## [1] 0.1811339

The total proportion of correct predictions is:

## [1] 0.8038451

The proportion of correct predictions from a trivial model which only predicts “<=50K” everytime would be:

## [1] 0.7591904

So then this model only improves upon a trivial model by roughly 4%.

Logistic Regression

logistic1 <- glm(income ~ race + education + sex + `hours-per-week` + occupation + immigrant, data = income_data, family = binomial)

summary(logistic1)
## 
## Call:
## glm(formula = income ~ race + education + sex + `hours-per-week` + 
##     occupation + immigrant, family = binomial, data = income_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3431  -0.7010  -0.4010  -0.1061   3.2178  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.597630   0.082669 -19.326  < 2e-16 ***
## raceBlack                    -0.361491   0.062271  -5.805 6.43e-09 ***
## raceAsian-Pac-Islander       -0.068844   0.092238  -0.746 0.455441    
## raceAmer-Indian-Eskimo       -0.608477   0.189856  -3.205 0.001351 ** 
## raceOther                    -0.785206   0.234993  -3.341 0.000834 ***
## educationHS-grad             -0.911590   0.045484 -20.042  < 2e-16 ***
## education11th                -1.848459   0.141062 -13.104  < 2e-16 ***
## educationMasters              0.505106   0.061375   8.230  < 2e-16 ***
## education9th                 -1.864142   0.205424  -9.075  < 2e-16 ***
## educationSome-college        -0.728854   0.045756 -15.929  < 2e-16 ***
## educationAssoc-acdm          -0.494065   0.082450  -5.992 2.07e-09 ***
## educationAssoc-voc           -0.410007   0.073216  -5.600 2.14e-08 ***
## education7th-8th             -1.862103   0.172431 -10.799  < 2e-16 ***
## educationDoctorate            1.245146   0.124668   9.988  < 2e-16 ***
## educationProf-school          1.154125   0.106891  10.797  < 2e-16 ***
## education5th-6th             -1.753487   0.265978  -6.593 4.32e-11 ***
## education10th                -1.675109   0.140898 -11.889  < 2e-16 ***
## education1st-4th             -2.129208   0.427174  -4.984 6.22e-07 ***
## educationPreschool          -11.945906  70.918411  -0.168 0.866233    
## education12th                -1.457949   0.191038  -7.632 2.32e-14 ***
## sexFemale                    -1.210453   0.039609 -30.560  < 2e-16 ***
## `hours-per-week`              0.028276   0.001339  21.113  < 2e-16 ***
## occupationExec-managerial     0.947608   0.062470  15.169  < 2e-16 ***
## occupationHandlers-cleaners  -1.102174   0.125353  -8.793  < 2e-16 ***
## occupationProf-specialty      0.488865   0.065993   7.408 1.28e-13 ***
## occupationOther-service      -1.143020   0.103064 -11.090  < 2e-16 ***
## occupationSales               0.300343   0.065938   4.555 5.24e-06 ***
## occupationCraft-repair        0.099916   0.066415   1.504 0.132474    
## occupationTransport-moving   -0.056130   0.085232  -0.659 0.510184    
## occupationFarming-fishing    -0.947699   0.118483  -7.999 1.26e-15 ***
## occupationMachine-op-inspct  -0.275656   0.087968  -3.134 0.001727 ** 
## occupationTech-support        0.493799   0.091986   5.368 7.95e-08 ***
## occupation?                  -0.353377   0.096396  -3.666 0.000246 ***
## occupationProtective-serv     0.441658   0.102941   4.290 1.78e-05 ***
## occupationArmed-Forces       -1.029652   1.089490  -0.945 0.344619    
## occupationPriv-house-serv    -2.382861   1.034760  -2.303 0.021289 *  
## immigrant                    -0.188090   0.059969  -3.136 0.001710 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35948  on 32560  degrees of freedom
## Residual deviance: 28078  on 32524  degrees of freedom
## AIC: 28152
## 
## Number of Fisher Scoring iterations: 12
contrasts(income) # shows us break down of y variable to confirm which one we are predicting as "1" in the regression.
##       >50K
## <=50K    0
## >50K     1

Below is the confusion matrix for this application

##                     income
## logistic_predictions <=50K  >50K
##                <=50K 23053  4784
##                >50K   1667  3057

The specificity for this model is:

## [1] 0.9325647

The sensitivity for this model is:

## [1] 0.1832248

The total proportion of correct predictions is:

## [1] 0.8018795

The proportion of correct predictions from a trivial model which only predicts “<=50K” everytime would be:

## [1] 0.7591904

So then this model only improves upon a trivial model by roughly 4% as well.

Linear Discriminant Analysis

lda1 <- lda(income ~ race + education + sex + `hours-per-week` + occupation + immigrant, data = income_data)

lda_predictions <- predict(lda1, income)

lda_class <- lda_predictions$class
##          income
## lda_class <=50K  >50K
##     <=50K 22938  4717
##     >50K   1782  3124

The specificity for this model is:

## [1] 0.9279126

The sensitivity for this model is:

## [1] 0.1809915

The total proportion of correct predictions is:

## [1] 0.8004054

The proportion of correct predictions from a trivial model which only predicts “<=50K” everytime would be:

## [1] 0.7591904

So then this model only improves upon a trivial model by roughly 4% as well.

Quadratic Discriminant Analysis

For QDA, we had to remove education do to collinearity with the other variables, as the model could not compute with this variable included.

qda1 <- qda(income ~ race + `hours-per-week` + occupation + sex + immigrant, data = income_data) # removed education because it was too correlated with race

qda_predictions <- predict(qda1, income)

qda_class <- qda_predictions$class
qda_confusion <- table(qda_class, income)
qda_confusion
##          income
## qda_class <=50K  >50K
##     <=50K 10773  1013
##     >50K  13947  6828

The specificity for this model is:

## [1] 0.435801

The sensitivity for this model is:

## [1] 0.05755355

The total proportion of correct predictions is:

## [1] 0.5405547

The proportion of correct predictions from a trivial model which only predicts “<=50K” everytime would be:

## [1] 0.7591904

So then it is clear that QDA is much worse as it doesn’t even do as good as a trivial classifier.