Classification

Jake

05/07/2021

  • Classification is used when the response \(y_i\) is qualitative/categorical. Our goal is to find a predictive function \(f\) that can estimate the probabilities of each class and classify the observation into the class where the relative decision boundary is passed:

\[ P(Y=k|X) \in [0,1]\] * Error Rate is a commonly used metric in classification problems: + Test error rate is more useful than training error rate, which is often estimated through resampling methods

\[ \frac{1}{n}\sum^n_{i=1}\mathbb{I}(y_i\neq\hat{y}_i)\] * Confusion matrix is another useful metric that is more in depth than the error rate: + Sometimes a specific type of error is more impactful on a project than general error. - For example if making the decision on whether someone has cancer and needs mediciation, it is worth sacrificing misdiagnosing some people to ensure the most people with cancer can be diagnosed as possible. * True - Positive rate in this case is: \(\frac{11}{14}\) * False - Positive rate in this case is: \(\frac{2}{16}\) * Positive Predicted value is: \(\frac{14}{17}\) * Negative predicted value is: \(\frac{11}{13}\)

Confusion Matrix

Confusion Matrix

  • ROC Curve plots the true positive rate against the false positive rate, which is done by changing the threshold.
    • The threshold is what probability is required to classify an observation into a specific class
    • The threshold is normally 50% by default, but with domain knowledge of what errors we want to minimise other thresholds are beneficial
      • If we want to lower the false positive (saying something is true when its not) you increase the probability required to classify the observation.
      • For example, we may only want to set an observation to class 1 if the probability is above 75%. \[ P(Y=1|X) > 75%\]
      • This may increase the overall error rate, but decrease a different type of error. This graph shows how decreasing he threshold increases the error rate but decreases the predicted false error
Error curve

Error curve

  • A good ROC curve hugs the top left.
    • Can also consider the area under the curve to get a numerically interpretable metric
ROC

ROC

Logistic Regression

  • The logistic regression is used for a binary response which follows the bernoulli distribution. As the resulting formula we use is linear in X, the decision boundary is linear.

\[ f(y_i) = p^{y_i}(1-p)^{1-y_i}\]

Logistic Boundary

Logistic Boundary

  • Logistic regression predictor values correspond to the change in the log odds rather than probability.

  • Most modelling limiations of linear regression, such as collinarity, carry over from the gaussian model.

  • Our goal is to model the probability of an event occuring, which is bounded between \((0,1)\) so we map x to this range, and then convert it to a linear model.

\[ P(x_i) = \frac{\exp(x_i)}{1+\exp(x_i)} = \frac{1}{1+\exp(-x_i)}\]

  • WIth inverse:

\[ x(p) = \log\left(\frac{p}{1-p}\right)\] * This forms the basis of our binomial model, where we model the log odds in a linear fashion:

\[ \underbrace{\log\left(\frac{P(x_i)}{1-P(x_i)}\right)}_{\text{Log odds}} = \mathbf{x}_i^\top\boldsymbol{\beta}\] * Logistic Regression in R (\(1\) trial, e.g win or lose):

## 
## Call:
## glm(formula = isjunk ~ ., family = binomial, data = spam)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.6549  -0.5781   0.5594   1.9608  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.96341    0.06573 -29.871   <2e-16 ***
## freq.excl    1.43767    0.11126  12.922   <2e-16 ***
## freq.dollar 12.13917    0.62319  19.479   <2e-16 ***
## freq.hash    0.27763    0.13236   2.097    0.036 *  
## average      0.19913    0.01739  11.448   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6170.2  on 4600  degrees of freedom
## Residual deviance: 4227.1  on 4596  degrees of freedom
## AIC: 4237.1
## 
## Number of Fisher Scoring iterations: 15
  • We use maximum likelihood to estimate the coefficients:

\[ \ell(\boldsymbol{\beta})=\prod^n_{i=1}P(x_i)^{y_i}(1-P(x_i))^{1-y_i}\]

  • Beta estimates, note this needs to be solved numerically:

\[ \sum^n_{i=1}y_ix_i=\sum^n_{i=1}x_i\left(1-\frac{\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}{1+\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}\right)\]

Linear Discriminant Analysis

  • Linear Discriminant analysis is a classifier that attempts to mimic the theoretical best classifier (the bayes classifier) through the use of the bayes formula and the assumption that the covariates are normally distributed for each class.
    • Note that the bayes classifier minimises the total error rate, but we often want to minimise a specific error, which can be done by changing the decision boundaries
    • As expected with the name, the decision boundary is linear, which is due to the discriminant function we maximise being linear with covariates.
LDA p=1

LDA p=1

  • Given the response is a specific class, the covariates are assumed to be normally distributed with class specific mean but equal variance.

    \[ X|Y=k \sim N(\mu_k,\sigma^2) \]

  • Bayes Formula:

    • The prior \(\pi_k = P(Y=k)\) is relatively easy to estimate
    • The density is hard to estimate, but we assume the density of the predictors given Y is normally distributed

    \[ f_k(x)\equiv X|Y=k \sim N(\mu_k,\sigma^2)\]

    \[ p_k(x)\equiv P(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum^K_{l=1}\pi_lf_l(x)}\]

  • Using the bayes formula and normal assumption, we can create a discriminate function. Maximising this discriminant function is equivalent to maximising the bayes probability, therefore we can classify based on the maximum discriminant.

    • Note that while maximising this value is equivalent to maximising the probability \(\delta_k(x)\neq p_k(x)\)

\[ \delta_k(x) = x\frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat{\mu}_k^2}{2\hat{\sigma}^2} + \log(\hat{\pi}_k) \]

  • Where we use the estimates:
    • Class specific average

\[ \hat{\mu}_k = \frac{1}{n_k}\sum_{i:y_i=k}x_i\] + Common variance, effectively a weighted average of class variances

\[ \hat{\sigma}^2 = \frac{1}{N-K}\sum^K_{k=1}\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2\] + Prior probability is estimated as the proportion of observations in the kth class

\[ \hat{\pi}_k = \frac{n_k}{n} \]

  • This can be extended to multiple predictors, where we assume the covariates are drawn from a multi-variate guassian distribution with class specific mean vector and a common covariance matrix. This distribution assumes each individual predictor follows a one dimensional normal distribution with some correlation between pairs of predictors.

\[ \mathbf{X}|Y=k\sim N(\boldsymbol{\mu}_k, \boldsymbol{\Sigma})\]

\[ \delta_k(x) = \mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{\mu}_k-\frac{1}{2}\mathbf{\mu}_k^\top\boldsymbol{\Sigma}^{-1}\mathbf{\mu}_k+\log(\pi_k)\]

  • Example of the decision boundary with 2 predictors for a 3 class problem:
LDA p=2

LDA p=2

  • In R we can conduct the LDA using a similar command to glm
    • The lda() package is in the mass package, and the data is from the ISLR package
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
  • Predict returns a list of 3 elements
    • Prediction of class, Posterior a matrix that contains posterior probabilities that the corresponding observation belongs to the kth class, and x which contains the linear discriminants. Note that the probabilities are in respect to the market decreasing is the first column
## [1] "class"     "posterior" "x"
##          
## lda.class Down  Up
##      Down   35  35
##      Up     76 106
## [1] 0.5595238
## [1] 0

Quadratic Discriminant Analysis

  • Quadratic discriminant analysis is an extension over LDA which also attempts to simulate the bayes decision boundary.

    • However QDA does not assume common covariance matrix

    \[ X|Y=k\sim N(\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)\]

  • The discriminant function is quadratic due to this change in assumption.

\[ \delta_k(x) = -\frac{1}{2}\mathbf{x}^\top\boldsymbol{\Sigma}_k^{-1}\mathbf{x} + \mathbf{x}^\top\boldsymbol{\Sigma}_k^{-1}\boldsymbol{\mu}_k - \frac{1}{2}\boldsymbol{\mu}_k^\top\boldsymbol{\Sigma}_k^{-1}\boldsymbol{\mu}_k +\log(\hat{\pi})\]

  • QDA is much more flexible than LDA, but this can result in higher variance
    • Calculating one covariance matrix requires \(p(p+1)/2\) parameters while estimation for K seperate covariance requires \(Kp(p+1)/2\) parameters
    • Use LDA on smaller data sets where reducing variance is most important. If the common covariance assumption is incorrect it can result in extreme bias as well.
    • Use QDA on larger data sets where reducing variance is not as important or when the common covariance assumption is obviously incorrect
QDA vs Bayes uncommon variance

QDA vs Bayes uncommon variance

  • Can implement this into R:
    • qda() function is part of the MASS library
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
##          
## qda.class Down  Up
##      Down   30  20
##      Up     81 121
## [1] 0.5992063

KNN

  • KNN classification is a simple, non parametric classification method that considers the k nearest neighbour to a datapoint and assigns the new observation to the class with the highest rate of occurance in the neighbourhood:

\[ f_k(x) = \frac{1}{K}\sum_{x_i\in N_0}\mathbb{I}(y_i=k) \]

  • Due to this local approach, KNN is extremely non-linear and therefore fares well when there is a very complicated decision boundary.
    • This high flexibility can also be an issue due to the high possible variance
    • Struggles in high dimensions due to the curse of dimensionality
  • We can implement this in R
    • knn() function is part of class package
    • train.X is the training data
    • test.X is the test data
    • train.Direction is the training responses
    • K is the amount of nearest neighbours to compare
## Warning: package 'class' was built under R version 3.6.3
##         
## knn.pred Down Up
##     Down   48 56
##     Up     63 85
## [1] 0.5277778
  • Note that KNN is based off distances, so we may need to scale variables
    • Note that we have to exclude factors when scaling