Classification Models in R

Why Classification?

The linear regression model discussed in Chapter 3 assumes that the response variable \(Y\) is quantitative. But in many situations, the response variable is instead qualitative/categorical like eye color or type of disease. In this section we study classification models in R. There many possible classification techniques. Some of the widely-used classifiers are: logistic regression, linear discriminant analysis, naive Bayes, and K-nearest neighbors. This discussion will largely focus on logistic regression as an modification of the linear regression.

Default Dataset

To illustrate the concept of classification we will use the Default data set. We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance for a subset of 10000 individuals.

setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
Credit <- read.csv(".\\Default.csv")

Exploratory Charts

We try to visualize the different behavior of those who default vs those who do not in terms of balance and income. Based on the scatter plot and box plot shown below there appears to be an observed difference between those who default vs those who did not default across balance and income.

colors <- c("cyan4", 
            "coral2") 

attach(Credit)
plot(x=balance, y=income,
     pch=19,
     col=colors[factor(default)])

legend("topright",
       legend = c("Did not default", "Default"),
       pch=19,
       col=colors)

par(mfrow=c(1,2))

boxplot(balance ~ default,
        data=Credit,
        col=colors)

boxplot(income ~ default,
        data=Credit,
        col=colors)

Fit a Logistic Regression Model

The default data was first re coded to a numeric variable given by: “Yes” - 1 “No” - 0. After which, the logistic regression model was constructed using the glm() function provided by the codes below.

The variables balance and income are significant predictors in modelling customer default.

# Recode "yes" - 1 and "no" to 0
default2 <- c(Credit$default == "Yes")
Credit$default2 <- as.numeric(default2)


class.fit <- glm(default2 ~ balance + income, 
                 data = Credit,
                 family = binomial)

summary(class.fit)

## 
## Call:
## glm(formula = default2 ~ balance + income, family = binomial, 
##     data = Credit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Assess the Model Fit Using a Confusion Matrix

In logistic regression, the predictions are assessed by getting the percentage of correct classifications vs incorrect classifications. As much as possible, we want the correct classifications to be high (usually at 80%). These are shown in the confusion matrix. Based on the adopted model at 0.30 cut off, we can see that we have high true (“no default”) but we have moderate to low (“default”) classifications.

class.prob <- predict(class.fit, type="response")

par(mfrow = c(1,1))

#Use 0.3 cutoff to classify as "default"
class.pred <- ifelse(class.prob > 0.3, "Yes", "No")
table(class.pred)

## class.pred
##   No  Yes 
## 9691  309

# Create Confusion Matrix from Default Data
table(Credit$default, class.pred)

##      class.pred
##         No  Yes
##   No  9522  145
##   Yes  169  164

Creating Predictions for New Data Set

The model that was created using 10000 observations may now be used to a different dataset with variables balance and income. For illustration suppose our new dataset was create from random picks in the data.

# Make Predictions with New Data

Credit_new <- Credit[c(23, 5647, 8721, 4444, 23, 456, 231, 2313, 33, 210),]

class.prob.new <- predict(class.fit, newdata=Credit_new, type="response")
class.pred.new <- ifelse(class.prob.new > 0.30, "Default", "No Default")

class.pred.new

##           23         5647         8721         4444         23.1          456 
## "No Default" "No Default" "No Default" "No Default" "No Default" "No Default" 
##          231         2313           33          210 
## "No Default" "No Default" "No Default"    "Default"

Credit[210,]

##     default student  balance  income default2
## 210     Yes     Yes 1899.391 20655.2        1