A few months ago, I began learning how to apply two supervised learning algorithms: linear regression and logistic regression. There are a variety of online resources revealing different approaches to utilizing both linear regression and logistic regression. In this blog post, I will be focusing on two guides I used to learn more about logistic regression.
Recall logistic regression is used to estimate the probability of a categorial response based on one or more predictor variables. For example, we can use logistic regression to determine the likehihood a patient might have cancer on the basis of his or her symptoms.
The UC Business Analytics R Programming Guide has a great tutorial on the basics of logistic regression. The tutorial uses a simulated data set containing information on ten thousand customers such as whether a customer defaulted, is a student, the average balance carried by the customer and income of the customer. I will be providing an overview of the material covered and what I learned.
This guide demonstrates why one would use logistic regression over linear regression. Utilizing linear regression to classify customers as high- or low-risk based on their balance, for example, results in nonsensical predictions. By trying to fit a linear regression model to the data, we would predict a negative probability of defaulting for customers with balances close to zero. We would also predict values larger than 1 for customers with very large balances. Of course, the true probability that someone would default must fall between 0 and 1. The logist function equation defined in Eq. 1 gives outputs between 0 and 1 for all values of X and therefore avoids issues we would encounter by using linear regression.
\(p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}} \tag{1}\)
To create a logistic regression model in R, we can use the built-in glm function. Here is the single-line of code required to create a linear model that attempts to predict whether a customer has defaulted due to balance.
model1 <- glm(default ~ balance, family = "binomial", data = train)
The glm function uses maximum likelihood estimation to fit the model to our data. We now develop an intution for maximum likelihood estimation and then formalize the concept mathematically. In the case of our example, the glm function attempts to find parameter estimates to plug into Eq. 1 such that the predicted probability of default for each individual is as close to the individual’s observed default status. Therefore we will try to find parameter estimates to plug into Eq. 1 that yield a number close to 1 for all individuals who defaulted and a number close to zero for all individuals who did not. Here is a formalization of maximum likelihood estimation referred to as a likelihood function.
\(\ell(\beta_0, \beta_1) = \prod_{i:y_i=1}p(x_i) \prod_{i':y_i'=0}(1-p(x_i')) \tag{2}\)
The guide then provides a visual of how the glm function fits the model to the example data.
default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit") +
xlab("Balance") +
ylab("Probability of Default")
## Warning: package 'bindrcpp' was built under R version 3.4.4
Although creating a logistic model in R is simple, I initially struggled to interpret information given by the model. We can use the tidy function to see coefficient estimates, standard error, and other useful related information.
tidy(model1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -11.0 0.489 -22.5 2.66e⁻¹¹²
## 2 balance 0.00567 0.000295 19.2 2.53e⁻ ⁸²
In the case of the example model, we can see that the estimate for the balance term is about 0.0057. This means that if there is a one dollar increase in a customer’s balance, we should expect the odds of a customer defaulting to increase by a factor of 1.0057.
To predict the probability of defaulting, we can use the predict function. The guide demonstrates how we can use the predict function to compare the proabilty of defaulting based on range of balances.
predict(model1, data.frame(balance = c(1000, 2000)), type = "response")
## 1 2
## 0.004785057 0.582089269
The guide also explains the phenomenon known as confounding. When we perform a regression with only one predictor as we did above, we may obtain radically different results than those obtained from a regression with multiple predictors, especially when there is correlation among the predictors.
If we fit a model that uses the student variable, we will find that students seem to nearly twice the odds of defaulting compared to non-students.
model2 <- glm(default ~ student, family = "binomial", data = train)
Now let’s create a model that predicts the probability of default based on balance, income, and student status.
model3 <- glm(default ~ balance + income + student, family = "binomial", data = train)
tidy(model3)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -10.9 0.648 -16.8 1.47e⁻⁶³
## 2 balance 0.00591 0.000310 19.0 7.90e⁻⁸¹
## 3 income - 0.00000501 0.0000108 - 0.465 6.42e⁻ ¹
## 4 studentYes - 0.809 0.313 - 2.58 9.78e⁻ ³
The p-values of student status and balance are extremely small, which means both variables are associated with probability of defaulting. The coefficient for the student variable is negative, which indicates that students are less likely to default than non-students. Our previous model, however, claimed that students are more likely to default than non-students. The guide explains that a student is definitely more likely to default than a non-student if there is no available information about the student’s credit card balance. However, a student is less likely to default than a non-student with the same credit card balance.
Before reading this guide, I was uncertain how to determine which predictors are most influential in predicting the resonse variable. We can use the varImp function from caret, a R machine learning package. Let’s use the varImp function to determine which predictors in our third model are the most influential.
caret::varImp(model3)
## Overall
## balance 19.0403764
## income 0.4647343
## studentYes 2.5835947
We can see a customer’s balance is the most important predictor while income appears to be the least important.
How do we determine the performance of our models? The guide provides a few different tests of how well our model fits to our data. Remember, if a model fits to the data too well, it could run the risk of not generalizing well to unseen data (overfitting).
One goal of logistic regression is to mimimize deviance residuals. To see the deviance residuals of a model, we can use the summary function. The null deviance represents the difference in fit between a model with only the intercept and a model with theoretically perfect fit. One way to usually improve the fit of a model and reduce model deviance is to add more predictor variables. We can perform a likelihood ratio test to determine whether added predictor variables are statistically significant improvements. The anova function can allow us to compare the results of multiple models at once. Let’s compare the first and third models from the guide.
anova(model1, model3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: default ~ balance
## Model 2: default ~ balance + income + student
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 6045 908.69
## 2 6043 895.02 2 13.668 0.001076 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show that the third model is a statistically significant improvement from the first model and therefore an improved model fit.
Another test of fit attempts to mimmic the R-squared statistic used in linear regression to explain the proportion of variance in the dependent variable that is explained by the predictors. One pseudo R-squared metric, McFadden’s R-squared, is defined as
\(1−\frac{ln(LM_1)}{ln(LM_0)}\)
where \(ln(LM_1)\) is the log likelihood for the fitted model and \(ln(LM_0)\) is the log likelihood for the null model. According to McFadden, a model with a McFadden pseudo R-squared value of roughly .40 represents very good fit. Here is the code necessary to see the McFadden pseudo R-squared values of the three models created throughout the tutorial.
list(model1 = pscl::pR2(model1)["McFadden"],
model2 = pscl::pR2(model2)["McFadden"],
model3 = pscl::pR2(model3)["McFadden"])
## $model1
## McFadden
## 0.4726215
##
## $model2
## McFadden
## 0.004898314
##
## $model3
## McFadden
## 0.4805543
We can see that the second model does not have a very good fit. The third model appears to improve upon the first slightly.
We can determine whether individual points do not fit our model by assessing deviance residuals. The following code from the tutorial provides a visual of which standardized deviance residuals exceed three standard deviations.
model1_data <- augment(model1) %>%
mutate(index = 1:n())
ggplot(model1_data, aes(index, .std.resid, color = default)) +
geom_point(alpha = .5) +
geom_ref_line(h = 3)
Filtering for residuals greater than three standard deviations, we can see all observations represent customers who defaulted with budgets much lower than normal defaulters.
Another way to visual influential observations is to look at Cook’s distance values. The following code shows the top 5 largest values.
plot(model1, which = 4, id.n = 5)
Investigating these observations, we see they include customers who defaulted with very low balances and customers who did not default, but had balances over $2,000. Although it is not recommended, removing these observations would affect the shape, location, and confidence interval of the logistic regression S-curve.
According to the guide, the most important metric for predictive models is their ability to predict the target variable on out-of-sample observations. We will first use the three models to predict values on our training data set.
test.predicted.m1 <- predict(model1, newdata = test, type = "response")
test.predicted.m2 <- predict(model2, newdata = test, type = "response")
test.predicted.m3 <- predict(model3, newdata = test, type = "response")
We can now compare the predicted target variable to the observed values for each model and see which performs best. Let’s use a confusion matrix, which is a table that describes the classification performance of a model on the test data.
list(
model1 = table(test$default, test.predicted.m1 > 0.5) %>% prop.table() %>% round(3),
model2 = table(test$default, test.predicted.m2 > 0.5) %>% prop.table() %>% round(3),
model3 = table(test$default, test.predicted.m3 > 0.5) %>% prop.table() %>% round(3)
)
## $model1
##
## FALSE TRUE
## No 0.962 0.003
## Yes 0.025 0.010
##
## $model2
##
## FALSE
## No 0.965
## Yes 0.035
##
## $model3
##
## FALSE TRUE
## No 0.963 0.003
## Yes 0.026 0.009
The “No” and “Yes” in the rows represent whether customers actually defaulted or not. The “FALSE” and “TRUE” in the columns represent whether we predicted customers to default or not. Let’s look at each quadrant of the confusion matrix.
The bottom-right quadrant value is the percentage of true positives, cases in which we predicted the customer would default and they did.
The top-left quadrant value is the percentage of true negatives, cases in which we predicted a customer would not default and the customer did not.
The top-right quadrant value is the percentage of false positives, cases in which we predicted a customer would default but they actually didn’t (type I error).
The bottom-left quadrant value is the percentage of false negatives, cases in which we predicted a customer would not default but they actually did (type II error).
We can see that the results for our first model and third model are relatively similar. The second model never actually predicts those customers that default.
We should also look at the missclassification rates of each model.
test %>%
mutate(m1.pred = ifelse(test.predicted.m1 > 0.5, "Yes", "No"),
m2.pred = ifelse(test.predicted.m2 > 0.5, "Yes", "No"),
m3.pred = ifelse(test.predicted.m3 > 0.5, "Yes", "No")) %>%
summarise(m1.error = mean(default != m1.pred),
m2.error = mean(default != m2.pred),
m3.error = mean(default != m3.pred))
## # A tibble: 1 x 3
## m1.error m2.error m3.error
## <dbl> <dbl> <dbl>
## 1 0.0278 0.0349 0.0281
We see each model has a relatively low error rate. However, we must remember that our second model never actually predicts customers that did default.
Instead of looking at the percentages in our confusion matrix, we could look at the raw numbers of true positives, true negatives, false positives, and false negatives. Let’s look at the first model.
table(test$default, test.predicted.m1 > 0.5)
##
## FALSE TRUE
## No 3803 12
## Yes 98 40
Although the overall error rate of the first model was low, we see now that 40 of the total 138 customers who defaulted were predicted by the first model. The sensitivity (the true positive rate) of our model is only about 29%, which is not very good. This means only 29% of default occurrences were actually predicted by the first model. We also care about the specificity (the true negative rate) of our model. Our first model has a specificity (percentage of non-defaulters correctly identified) of about 99.6%. This can be explained by the fact that 97% percentage of our observations are non-defaulters. The tutorial explains that, depending on the circumstance, we may be more concerned with specificity over sensitivity (or vice versa). A credit card company, for example, would be more likely to be concerned with sensitivty since they wish to reduce their risk.
As a visual learner, the receiving operating characteristic (ROC) is one useful way to measure classifier performance visually. We can generate a graphic that shows the trade off between the rate you can correctly predict something with the rate of incorrectly predicting something. We are concerned about maximizing the area under the curve of the ROC graphic. The guide states that a model does a good job in discriminating between the two categories that comprise our target variable if its area under the curve is about .8. Here is the code necessarily to show ROC plots for both our first and second models.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
par(mfrow=c(1, 2))
prediction(test.predicted.m1, test$default) %>%
performance(measure = "tpr", x.measure = "fpr") %>%
plot()
prediction(test.predicted.m2, test$default) %>%
performance(measure = "tpr", x.measure = "fpr") %>%
plot()
We would definitely prefer our first model compared to the second!
Another tutorial I found value from helped me learn about one way to deal with data sets that have imbalanced classes.
Suppose you are training a model to detect whether a patient has a particularly rare disease that only affects 5% of the patients screened. We could create an accurate (but useless) model that always predicts a patient does not have the disease. Our model would therefore have an accuracy of 95%. Of course, this model only has a high accuracy due to the small number of patients who have the disease. One beginner’s guide from Selva Prabhakaran demonstrated a particularly useful solution if your data set has highly imbalanced classes.
To adjust for bias in datasets with imbalanced classes, we can use a technique called down-sampling. Down-sampling is the process of randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm. The following is an example that utlizes breast cancer data from the mlbench package. We first will view the proportion of individuals with malignant tumors to individuals with benign tumors.
## View proportion of benign tumors to malignant tumors
table(bc$Class)
##
## 0 1
## 444 239
There are nearly twice as many individuals with benign tumors compared to individuals with malignant tumors. Let’s try to correct the class imbalance.
set.seed(100)
down_train <- downSample(x = trainData[, colnames(trainData) %ni% "Class"],
y = trainData$Class)
table(down_train$Class)
##
## 0 1
## 168 168
As we can see, the new table shows how we can use down-sampling to fix issues with biased classes.