Linear discriminant analysis.

References

Load packages

## for lda() and qda()
library(MASS)
## Data for An Introduction to Statistical Learning with Applications in R
library(ISLR)
## Plotting
library(ggplot2)
library(gridExtra)
## for knn()
library(class)

Load data

data(package = "ISLR")
Data sets in package ‘ISLR’:

Auto                               Auto Data Set
Caravan                            The Insurance Company (TIC) Benchmark
Carseats                           Sales of Child Car Seats
College                            U.S. News and World Report's College Data'
Default                            Credit Card Default Data
Hitters                            Baseball Data
Khan                               Khan Gene Data
NCI60                              NCI 60 Data
OJ                                 Orange Juice Data
Portfolio                          Portfolio Data
Smarket                            S&P Stock Market Data
Wage                               Mid-Atlantic Wage Data
Weekly                             Weekly S&P Stock Market Data
## Load Default (Credit Card Default Data)
data(Default)
head(Default)
##   default student balance income
## 1      No      No   729.5  44362
## 2      No     Yes   817.2  12106
## 3      No      No  1073.5  31767
## 4      No      No   529.3  35704
## 5      No      No   785.7  38463
## 6      No     Yes   919.6   7492
Description:
     A simulated data set containing information on ten thousand
     customers. The aim here is to predict which customers will default
     on their credit card debt.
Format:
     A data frame with 10000 observations on the following 4 variables.
     ‘default’ A factor with levels ‘No’ and ‘Yes’ indicating whether
          the customer defaulted on their debt
     ‘student’ A factor with levels ‘No’ and ‘Yes’ indicating whether
          the customer is a student
     ‘balance’ The average balance that the customer has remaining on
          their credit card after making their monthly payment
     ‘income’ Income of customer

Discriptive

Individuals with higher credit card balance may be more likely to default.

## Summary
summary(Default)
##  default    student       balance         income     
##  No :9667   No :7056   Min.   :   0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 482   1st Qu.:21340  
##                        Median : 824   Median :34553  
##                        Mean   : 835   Mean   :33517  
##                        3rd Qu.:1166   3rd Qu.:43808  
##                        Max.   :2654   Max.   :73554
## Plot
plotData <- ggplot(data = Default,
       mapping = aes(x = balance, y = income, color = default, shape = student)) +
    layer(geom = "point", stat = "identity", alpha = 0.5) +
    scale_color_manual(values = c("No" = "yellow", "Yes" = "red")) +
    theme_bw() +
    theme(legend.key = element_blank()) +
    labs(title = "Original data")
plotData

plot of chunk unnamed-chunk-5

Logistic Regression

Both the balance and the income variables are standardized for interpretability. Both factors are significant, but the coefficient is less impressive for the income variable. The color of the plot is arbitrary, the threshold (predicted probability > 0.5 is used) for deciding the predicted classification is up to the investigator.

## Fit logistic regression
resLogit <- glm(formula = default ~ scale(balance) + scale(income),
                family  = binomial(link = "logit"),
                data    = Default)
## Show the result
summary(resLogit)
## 
## Call:
## glm(formula = default ~ scale(balance) + scale(income), family = binomial(link = "logit"), 
##     data = Default)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.473  -0.144  -0.057  -0.021   3.724  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -6.1256     0.1876  -32.66  < 2e-16 ***
## scale(balance)   2.7316     0.1100   24.84  < 2e-16 ***
## scale(income)    0.2775     0.0665    4.17  0.00003 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

## Put the predicted probability
Default$predProbLogit <- predict(resLogit, type = "response")
Default$predClassLogit <- factor(predict(resLogit, type = "response") > 0.5,
                                 levels = c(FALSE,TRUE), labels = c("No","Yes"))

## Plot (probability)
plotLogit <- ggplot(data = Default,
       mapping = aes(x = balance, y = income, color = predProbLogit, shape = student)) +
    layer(geom = "point", alpha = 0.5) +
    scale_color_gradient(low = "yellow", high = "red") +
    theme_bw() +
    theme(legend.key = element_blank()) +
    labs(title = "Predicted probability of outcome (Logistic)")
grid.arrange(plotData, plotLogit, ncol = 2)

plot of chunk unnamed-chunk-6


## Plot (classification)
plotLdaClass <- ggplot(data = Default,
       mapping = aes(x = balance, y = income, color = predClassLogit, shape = student)) +
    layer(geom = "point", alpha = 0.5) +
    scale_color_manual(values = c("No" = "yellow", "Yes" = "red")) +
    theme_bw() +
    theme(legend.key = element_blank()) +
    labs(title = "Predicted outcome (Logistic; p>0.5)")
grid.arrange(plotData, plotLdaClass, ncol = 2)

plot of chunk unnamed-chunk-6

Linear Discriminant Analysis

LDA is similar to logistic regression in this kind of a simple case. However, there are certain advantages.

## Fit LDA
resLda <- lda(formula = default ~ scale(balance) + scale(income),
              data    = Default)
resLda
## Call:
## lda(default ~ scale(balance) + scale(income), data = Default)
## 
## Prior probabilities of groups:
##     No    Yes 
## 0.9667 0.0333 
## 
## Group means:
##     scale(balance) scale(income)
## No        -0.06498      0.003688
## Yes        1.88633     -0.107061
## 
## Coefficients of linear discriminants:
##                   LD1
## scale(balance) 1.0791
## scale(income)  0.1039

## Predict
predLda <- predict(resLda)
str(predLda)
## List of 3
##  $ class    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ posterior: num [1:10000, 1:2] 0.997 0.997 0.987 0.999 0.996 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10000] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:2] "No" "Yes"
##  $ x        : num [1:10000, 1] -0.1516 -0.2075 0.5177 -0.6659 -0.0724 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10000] "1" "2" "3" "4" ...
##   .. ..$ : chr "LD1"

## Put into the dataset
Default$predProbLda <- predLda$posterior[,"Yes"]
Default$predClassLda <- predLda$class

## Plot (probability)
plotLdaProb <- ggplot(data = Default,
       mapping = aes(x = balance, y = income, color = predProbLda, shape = student)) +
    layer(geom = "point", alpha = 0.5) +
    scale_color_gradient(low = "yellow", high = "red") +
    theme_bw() +
    theme(legend.key = element_blank()) +
    labs(title = "Predicted probability of outcome (LDA)")
grid.arrange(plotData, plotLdaProb, ncol = 2)