Basic Logistic Regression in R

A logistic regression is used to predict the probability of an event occurring. Unlike a linear regression model, logistic regression involves a categorical variable as its target, with the output being a probability between 0 and 1. Situations which may necessitate the use of a logistic model include:

Predicting a team’s chances of winning a sporting contest
Predicting whether it will rain on a particular day
Classing an email as spam or not

These are examples of binary logsitic regression, in which there are only two possible outcomes (win/loss, rain/no rain, etc). It’s also possible to expand to situations with more than two outcomes, either without an order (multinominal logistic regression) or with an order (ordinal logistic regression). This run-through will build a simple binary logistic regression and analyse it.

The data set ‘Credit’, available from the R package ’ISLR", will be used in this example. It contains information regarding the credit card balance of customers.

# Install 'ISLR', if not already downloaded. Use "install.packages('ISLR')"
library(ISLR)

credit_data <- Credit

summary(credit_data)

##        ID            Income           Limit           Rating     
##  Min.   :  1.0   Min.   : 10.35   Min.   :  855   Min.   : 93.0  
##  1st Qu.:100.8   1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2  
##  Median :200.5   Median : 33.12   Median : 4622   Median :344.0  
##  Mean   :200.5   Mean   : 45.22   Mean   : 4736   Mean   :354.9  
##  3rd Qu.:300.2   3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2  
##  Max.   :400.0   Max.   :186.63   Max.   :13913   Max.   :982.0  
##      Cards            Age          Education        Gender    Student  
##  Min.   :1.000   Min.   :23.00   Min.   : 5.00    Male :193   No :360  
##  1st Qu.:2.000   1st Qu.:41.75   1st Qu.:11.00   Female:207   Yes: 40  
##  Median :3.000   Median :56.00   Median :14.00                         
##  Mean   :2.958   Mean   :55.67   Mean   :13.45                         
##  3rd Qu.:4.000   3rd Qu.:70.00   3rd Qu.:16.00                         
##  Max.   :9.000   Max.   :98.00   Max.   :20.00                         
##  Married              Ethnicity      Balance       
##  No :155   African American: 99   Min.   :   0.00  
##  Yes:245   Asian           :102   1st Qu.:  68.75  
##            Caucasian       :199   Median : 459.50  
##                                   Mean   : 520.01  
##                                   3rd Qu.: 863.00  
##                                   Max.   :1999.00

The logistic regression model being designed today will use ‘Married’ as its target variable. This binary variable represents whether the customer is married (“Yes” or “No”). The aim of the logistic regression model will be the predict the likelihood of a given customer being married based on the other indicators. This information could be utilised by a bank to identify customers that may be suitable for a new couples credit card plan, for example.

Data preparation

The variable ‘Married’ will be altered for this model. ‘1’ will represent a married individual, and ‘0’ otherwise.

credit_data$marriedTarget <- ifelse(credit_data$Married == "Yes", 1, 0)

We can also split the data into two sets. The larger set, known as the training set, will be used to build the model. The second and smaller set, the testing set, will be used to assess the accuracy of the model. Here, we’ll use a 70-30 split.

# set.seed allows for us to reproduce this exact analysis with the same results.
set.seed(111)
split <- sort(sample(nrow(credit_data), nrow(credit_data)*0.7))

training <- credit_data[split,]
testing <- credit_data[-split,]

Building the model

To fit a logistic model, the function glm() can be used with the family being specified as “binomial”. The variables ‘ID’ and ‘Ethnicity’

log_model <- glm(marriedTarget ~ Income + Limit + Rating + Cards + Age + Education +
                   Gender + Student + Balance,
                 family = binomial,
                 data = training)

summary(log_model)

## 
## Call:
## glm(formula = marriedTarget ~ Income + Limit + Rating + Cards + 
##     Age + Education + Gender + Student + Balance, family = binomial, 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8955  -1.2795   0.8018   0.9792   1.3932  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -1.1955657  1.0796625  -1.107   0.2681  
## Income       -0.0156693  0.0118931  -1.318   0.1877  
## Limit        -0.0005632  0.0009157  -0.615   0.5385  
## Rating        0.0195659  0.0132642   1.475   0.1402  
## Cards        -0.0860103  0.1095398  -0.785   0.4323  
## Age          -0.0120293  0.0077979  -1.543   0.1229  
## Education     0.0181959  0.0410206   0.444   0.6573  
## GenderFemale  0.3801467  0.2542046   1.495   0.1348  
## StudentYes    0.6018666  0.7091414   0.849   0.3960  
## Balance      -0.0027436  0.0013465  -2.038   0.0416 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 372.46  on 279  degrees of freedom
## Residual deviance: 358.66  on 270  degrees of freedom
## AIC: 378.66
## 
## Number of Fisher Scoring iterations: 4

We have now created a logistic regression model which estimates the probability that a customer is married. However, most of the explanatory variables are insignificant in this model (their p-value, represented as Pr(>|z|), is less than 0.05). The option is there to remove some of the insignificant variables (‘Income’, ‘Gender’ and ‘Student’) and see if this improves the model. A lower AIC is ideal! For now, we’ll stick with this model.

The estimates of this summary make up the equation of the logistic model. They also allow for us to interpret the odds of a customer being married according to the model. For the variable ‘Balance’, it can be explained as such:

The odds of the customer being married decrease by e^(0.00274) times for each additional dollar in credit card balance the customer has, whilst controlling for every other variable

Testing the data

We will now predict whether the customers are married or not in the testing set.

probabilities <- predict(log_model, 
                       newdata = testing,
                       type = "response")

This has created the probability (one for each customer in the testing set) that the customer is married. We can predict that each customer with a probability greater than 0.5 is married.

marriagePredictions <- ifelse(probabilities > 0.5, "Yes", "No")

We can create a table to compare the predictions with the true marriage status of each customer.

table(marriagePredictions, testing$Married)

##                    
## marriagePredictions No Yes
##                 No   5  10
##                 Yes 43  62

In this table, the left side represents the predictions and the top represents the true value.

It appears that the model incorrectly predicted many of the customers as married when they weren’t. This could be a sign of overfitting, or simply a poor model. When attempting logistic regression with your own data, it is important to do the following to find a suitable model.

Built several models with different variables to find the one of best fit.
Try different split sizes to allow for enough training data.
Solve any overfitting issues by using less variables

Basic Logistic Regression in R

Steven Azzopardi

Data preparation

Building the model

Testing the data