A logistic regression is used to predict the probability of an event occurring. Unlike a linear regression model, logistic regression involves a categorical variable as its target, with the output being a probability between 0 and 1. Situations which may necessitate the use of a logistic model include:
These are examples of binary logsitic regression, in which there are only two possible outcomes (win/loss, rain/no rain, etc). It’s also possible to expand to situations with more than two outcomes, either without an order (multinominal logistic regression) or with an order (ordinal logistic regression). This run-through will build a simple binary logistic regression and analyse it.
The data set ‘Credit’, available from the R package ’ISLR", will be used in this example. It contains information regarding the credit card balance of customers.
# Install 'ISLR', if not already downloaded. Use "install.packages('ISLR')"
library(ISLR)
credit_data <- Credit
summary(credit_data)
## ID Income Limit Rating
## Min. : 1.0 Min. : 10.35 Min. : 855 Min. : 93.0
## 1st Qu.:100.8 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2
## Median :200.5 Median : 33.12 Median : 4622 Median :344.0
## Mean :200.5 Mean : 45.22 Mean : 4736 Mean :354.9
## 3rd Qu.:300.2 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2
## Max. :400.0 Max. :186.63 Max. :13913 Max. :982.0
## Cards Age Education Gender Student
## Min. :1.000 Min. :23.00 Min. : 5.00 Male :193 No :360
## 1st Qu.:2.000 1st Qu.:41.75 1st Qu.:11.00 Female:207 Yes: 40
## Median :3.000 Median :56.00 Median :14.00
## Mean :2.958 Mean :55.67 Mean :13.45
## 3rd Qu.:4.000 3rd Qu.:70.00 3rd Qu.:16.00
## Max. :9.000 Max. :98.00 Max. :20.00
## Married Ethnicity Balance
## No :155 African American: 99 Min. : 0.00
## Yes:245 Asian :102 1st Qu.: 68.75
## Caucasian :199 Median : 459.50
## Mean : 520.01
## 3rd Qu.: 863.00
## Max. :1999.00
The logistic regression model being designed today will use ‘Married’ as its target variable. This binary variable represents whether the customer is married (“Yes” or “No”). The aim of the logistic regression model will be the predict the likelihood of a given customer being married based on the other indicators. This information could be utilised by a bank to identify customers that may be suitable for a new couples credit card plan, for example.
The variable ‘Married’ will be altered for this model. ‘1’ will represent a married individual, and ‘0’ otherwise.
credit_data$marriedTarget <- ifelse(credit_data$Married == "Yes", 1, 0)
We can also split the data into two sets. The larger set, known as the training set, will be used to build the model. The second and smaller set, the testing set, will be used to assess the accuracy of the model. Here, we’ll use a 70-30 split.
# set.seed allows for us to reproduce this exact analysis with the same results.
set.seed(111)
split <- sort(sample(nrow(credit_data), nrow(credit_data)*0.7))
training <- credit_data[split,]
testing <- credit_data[-split,]
To fit a logistic model, the function glm() can be used with the family being specified as “binomial”. The variables ‘ID’ and ‘Ethnicity’
log_model <- glm(marriedTarget ~ Income + Limit + Rating + Cards + Age + Education +
Gender + Student + Balance,
family = binomial,
data = training)
summary(log_model)
##
## Call:
## glm(formula = marriedTarget ~ Income + Limit + Rating + Cards +
## Age + Education + Gender + Student + Balance, family = binomial,
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8955 -1.2795 0.8018 0.9792 1.3932
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1955657 1.0796625 -1.107 0.2681
## Income -0.0156693 0.0118931 -1.318 0.1877
## Limit -0.0005632 0.0009157 -0.615 0.5385
## Rating 0.0195659 0.0132642 1.475 0.1402
## Cards -0.0860103 0.1095398 -0.785 0.4323
## Age -0.0120293 0.0077979 -1.543 0.1229
## Education 0.0181959 0.0410206 0.444 0.6573
## GenderFemale 0.3801467 0.2542046 1.495 0.1348
## StudentYes 0.6018666 0.7091414 0.849 0.3960
## Balance -0.0027436 0.0013465 -2.038 0.0416 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 372.46 on 279 degrees of freedom
## Residual deviance: 358.66 on 270 degrees of freedom
## AIC: 378.66
##
## Number of Fisher Scoring iterations: 4
We have now created a logistic regression model which estimates the probability that a customer is married. However, most of the explanatory variables are insignificant in this model (their p-value, represented as Pr(>|z|), is less than 0.05). The option is there to remove some of the insignificant variables (‘Income’, ‘Gender’ and ‘Student’) and see if this improves the model. A lower AIC is ideal! For now, we’ll stick with this model.
The estimates of this summary make up the equation of the logistic model. They also allow for us to interpret the odds of a customer being married according to the model. For the variable ‘Balance’, it can be explained as such:
The odds of the customer being married decrease by e^(0.00274) times for each additional dollar in credit card balance the customer has, whilst controlling for every other variable
We will now predict whether the customers are married or not in the testing set.
probabilities <- predict(log_model,
newdata = testing,
type = "response")
This has created the probability (one for each customer in the testing set) that the customer is married. We can predict that each customer with a probability greater than 0.5 is married.
marriagePredictions <- ifelse(probabilities > 0.5, "Yes", "No")
We can create a table to compare the predictions with the true marriage status of each customer.
table(marriagePredictions, testing$Married)
##
## marriagePredictions No Yes
## No 5 10
## Yes 43 62
In this table, the left side represents the predictions and the top represents the true value.
It appears that the model incorrectly predicted many of the customers as married when they weren’t. This could be a sign of overfitting, or simply a poor model. When attempting logistic regression with your own data, it is important to do the following to find a suitable model.