Example 1: Logistic Regression (1 binary dependent variable, 2 predictor variables)

The problem: Simmons’ catalogs are expensive and Simmons would like to send them to only those customers who have the highest probability of making a $200 purchase using the discount coupon included in the catalog. Simmons’ management thinks that annual spending at Simmons Stores and whether a customer has a Simmons credit card are two variables that might be helpful in predicting whether a customer who receives the catalog will use the coupon to make a $200 purchase.

library(readxl)
Simmons <- read_excel("Simmons.xlsx")
Simmons

## # A tibble: 100 × 4
##    Customer Spending  Card Coupon
##       <dbl>    <dbl> <dbl>  <dbl>
##  1        1     2.29     1      0
##  2        2     3.22     1      0
##  3        3     2.13     1      0
##  4        4     3.92     0      0
##  5        5     2.53     1      0
##  6        6     2.47     0      1
##  7        7     2.38     0      0
##  8        8     7.08     0      0
##  9        9     1.18     1      1
## 10       10     3.34     0      0
## # … with 90 more rows

M1 = glm(Coupon~Card+Spending,family=binomial, Simmons)
summary(M1)

## 
## Call:
## glm(formula = Coupon ~ Card + Spending, family = binomial, data = Simmons)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6839  -1.0140  -0.6503   1.1216   1.8794  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.1464     0.5772  -3.718 0.000201 ***
## Card          1.0987     0.4447   2.471 0.013483 *  
## Spending      0.3416     0.1287   2.655 0.007928 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 134.60  on 99  degrees of freedom
## Residual deviance: 120.97  on 97  degrees of freedom
## AIC: 126.97
## 
## Number of Fisher Scoring iterations: 4

interpreting these results:

the equation with the coefficients is: ln(p/1-p) = -2.146 + 1.-988 * Card + 0.3416 * Spending p: probabilty of coupon equals to 1

interpreting the p-values: for each possbile predictor variable, we run a hypothesis test for card, the H0 is that the coefficient is not 1.0987 and the HA is that the coefficient is 1.0987 since we are below 5% we reject H0 and say that the coefficient is 1.0987

when comparing 2 models, look at their AICs * the lower the AIC, the better

we don’t have a simple way to tell if this model is any good * we would need to divide the data into a training and test set and look at it that way * that way we could get a prediction score

Now run a prediction:

a = data.frame(Spending=c(2,2), Card=c(0,1))
predict.glm(M1, newdata=a, type="response")

##         1         2 
## 0.1879957 0.4099058

##   Spending Card
## 1        2    0
## 2        2    1

Interpreting the results:

 1         2

0.1879957 0.4099058

If the spending is equal, they are 20% more likely to use the coupon if they have the card then if they do not have the card. # see next example for more on interpretation

Example 2

The problem:

Over the past few years the percentage of students who leave Lakeland College at the end of the first year has increased. Last year Lakeland started a voluntary one-week orientation program to help first-year students adjust to campus life. If Lakeland is able to show that the orientation program has a positive effect on retention, they will consider making the program a requirement for all first-year students. Lakeland’s administration also suspects that students with lower GPAs have a higher probability of leaving Lakeland at the end of the first year. In order to investigate the relation of these variables to retention, Lakeland selected a random sample of 100 students from last year’s entering class. The data are contained in the data set named Lakeland.

library(readxl)
Lakeland <- read_excel("Lakeland.xlsx")
Lakeland

## # A tibble: 100 × 4
##    Student   GPA Program Return
##      <dbl> <dbl>   <dbl>  <dbl>
##  1       1  3.78       1      1
##  2       2  2.38       0      1
##  3       3  1.3        0      0
##  4       4  2.19       1      0
##  5       5  3.22       1      1
##  6       6  2.68       1      1
##  7       7  2.72       0      0
##  8       8  1.74       0      0
##  9       9  1.86       0      0
## 10      10  3.53       1      1
## # … with 90 more rows

What if the column Program had true/ false values and we needed to turn them into ones and 0s?

Use this code: Lakeland$Program [Lakeland$Program == 1] <- “true”

Lakeland$Program [Lakeland$Program == “true”] <- 1

M2 = glm(Return~GPA + Program, family="binomial", data=Lakeland)
summary(M2)

## 
## Call:
## glm(formula = Return ~ GPA + Program, family = "binomial", data = Lakeland)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9610  -0.4828   0.2848   0.5980   1.8154  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -6.8926     1.7472  -3.945 7.98e-05 ***
## GPA           2.5388     0.6729   3.773 0.000161 ***
## Program       1.5608     0.5631   2.772 0.005579 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128.207  on 99  degrees of freedom
## Residual deviance:  80.338  on 97  degrees of freedom
## AIC: 86.338
## 
## Number of Fisher Scoring iterations: 5

Interpreting: * Both predictor variables are significant

Make a prediction:

a = data.frame(Program=c(1,0), GPA=c(2.5,2.5))
pred = predict.glm(M2, newdata=a, "response")
cbind(a, pred)

##   Program GPA      pred
## 1       1 2.5 0.7340349
## 2       0 2.5 0.3668944

Interpreting: * The probablity of a student with 2.5 GPA who did not attend the program has a 37% chance of returning the next year. * The probablity of a student with 2.5 GPA who did attend the program is 73.4%

Estimating the odds ratio:

exp(coef(M2))

##  (Intercept)          GPA      Program 
##  0.001015316 12.664425631  4.762412959

The estimated odds of returning to Lakeland for students who attended the orgiantation program is 4.76 times greater then the estimated odds of returning to Lakeland for students who did not attend the orientation program, if the GPA is the same.

In terms of GPA, I can say that the higher GPA increases odds of returning– I can say that GPA has a bigger impact then attendance in the program– but we can’t get more specific. If we wanted to get more specific, would need to convert GPA into a categorical variable. (For example, could make GPA yes for >3 and no for <3 or another cutoff)

Logistic Notes

Nabil Arnaoot

2/12/2022

Example 1: Logistic Regression (1 binary dependent variable, 2 predictor variables)

Example 2