The linear regression model discussed in Chapter 3 assumes that the response variable \(Y\) is quantitative. But in many situations, the response variable is instead qualitative/categorical like eye color or type of disease. In this section we study classification models in R. There many possible classification techniques. Some of the widely-used classifiers are: logistic regression, linear discriminant analysis, naive Bayes, and K-nearest neighbors. This discussion will largely focus on logistic regression as an modification of the linear regression.
To illustrate the concept of classification we will use the
Default data set. We are interested in predicting whether
an individual will default on his or her credit card payment, on the
basis of annual income and monthly credit card
balance for a subset of 10000 individuals.
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
Credit <- read.csv(".\\Default.csv")
We try to visualize the different behavior of those who default vs
those who do not in terms of balance and
income. Based on the scatter plot and box plot shown below
there appears to be an observed difference between those who default vs
those who did not default across balance and income.
colors <- c("cyan4",
"coral2")
attach(Credit)
plot(x=balance, y=income,
pch=19,
col=colors[factor(default)])
legend("topright",
legend = c("Did not default", "Default"),
pch=19,
col=colors)
par(mfrow=c(1,2))
boxplot(balance ~ default,
data=Credit,
col=colors)
boxplot(income ~ default,
data=Credit,
col=colors)
The default data was first re coded to a numeric
variable given by: “Yes” - 1 “No” - 0. After which, the logistic
regression model was constructed using the glm() function
provided by the codes below.
The variables balance and income are
significant predictors in modelling customer default.
# Recode "yes" - 1 and "no" to 0
default2 <- c(Credit$default == "Yes")
Credit$default2 <- as.numeric(default2)
class.fit <- glm(default2 ~ balance + income,
data = Credit,
family = binomial)
summary(class.fit)
##
## Call:
## glm(formula = default2 ~ balance + income, family = binomial,
## data = Credit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
In logistic regression, the predictions are assessed by getting the percentage of correct classifications vs incorrect classifications. As much as possible, we want the correct classifications to be high (usually at 80%). These are shown in the confusion matrix. Based on the adopted model at 0.30 cut off, we can see that we have high true (“no default”) but we have moderate to low (“default”) classifications.
class.prob <- predict(class.fit, type="response")
par(mfrow = c(1,1))
#Use 0.3 cutoff to classify as "default"
class.pred <- ifelse(class.prob > 0.3, "Yes", "No")
table(class.pred)
## class.pred
## No Yes
## 9691 309
# Create Confusion Matrix from Default Data
table(Credit$default, class.pred)
## class.pred
## No Yes
## No 9522 145
## Yes 169 164
The model that was created using 10000 observations may now be used
to a different dataset with variables balance and
income. For illustration suppose our new dataset was create
from random picks in the data.
# Make Predictions with New Data
Credit_new <- Credit[c(23, 5647, 8721, 4444, 23, 456, 231, 2313, 33, 210),]
class.prob.new <- predict(class.fit, newdata=Credit_new, type="response")
class.pred.new <- ifelse(class.prob.new > 0.30, "Default", "No Default")
class.pred.new
## 23 5647 8721 4444 23.1 456
## "No Default" "No Default" "No Default" "No Default" "No Default" "No Default"
## 231 2313 33 210
## "No Default" "No Default" "No Default" "Default"
Credit[210,]
## default student balance income default2
## 210 Yes Yes 1899.391 20655.2 1