The problem: Simmons’ catalogs are expensive and Simmons would like to send them to only those customers who have the highest probability of making a $200 purchase using the discount coupon included in the catalog. Simmons’ management thinks that annual spending at Simmons Stores and whether a customer has a Simmons credit card are two variables that might be helpful in predicting whether a customer who receives the catalog will use the coupon to make a $200 purchase.
library(readxl)
Simmons <- read_excel("Simmons.xlsx")
Simmons
## # A tibble: 100 × 4
## Customer Spending Card Coupon
## <dbl> <dbl> <dbl> <dbl>
## 1 1 2.29 1 0
## 2 2 3.22 1 0
## 3 3 2.13 1 0
## 4 4 3.92 0 0
## 5 5 2.53 1 0
## 6 6 2.47 0 1
## 7 7 2.38 0 0
## 8 8 7.08 0 0
## 9 9 1.18 1 1
## 10 10 3.34 0 0
## # … with 90 more rows
M1 = glm(Coupon~Card+Spending,family=binomial, Simmons)
summary(M1)
##
## Call:
## glm(formula = Coupon ~ Card + Spending, family = binomial, data = Simmons)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6839 -1.0140 -0.6503 1.1216 1.8794
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1464 0.5772 -3.718 0.000201 ***
## Card 1.0987 0.4447 2.471 0.013483 *
## Spending 0.3416 0.1287 2.655 0.007928 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 134.60 on 99 degrees of freedom
## Residual deviance: 120.97 on 97 degrees of freedom
## AIC: 126.97
##
## Number of Fisher Scoring iterations: 4
interpreting these results:
the equation with the coefficients is: ln(p/1-p) = -2.146 + 1.-988 * Card + 0.3416 * Spending p: probabilty of coupon equals to 1
interpreting the p-values: for each possbile predictor variable, we run a hypothesis test for card, the H0 is that the coefficient is not 1.0987 and the HA is that the coefficient is 1.0987 since we are below 5% we reject H0 and say that the coefficient is 1.0987
when comparing 2 models, look at their AICs * the lower the AIC, the better
we don’t have a simple way to tell if this model is any good * we would need to divide the data into a training and test set and look at it that way * that way we could get a prediction score
Now run a prediction:
a = data.frame(Spending=c(2,2), Card=c(0,1))
predict.glm(M1, newdata=a, type="response")
## 1 2
## 0.1879957 0.4099058
a
## Spending Card
## 1 2 0
## 2 2 1
Interpreting the results:
1 2
0.1879957 0.4099058
The problem:
Over the past few years the percentage of students who leave Lakeland College at the end of the first year has increased. Last year Lakeland started a voluntary one-week orientation program to help first-year students adjust to campus life. If Lakeland is able to show that the orientation program has a positive effect on retention, they will consider making the program a requirement for all first-year students. Lakeland’s administration also suspects that students with lower GPAs have a higher probability of leaving Lakeland at the end of the first year. In order to investigate the relation of these variables to retention, Lakeland selected a random sample of 100 students from last year’s entering class. The data are contained in the data set named Lakeland.
library(readxl)
Lakeland <- read_excel("Lakeland.xlsx")
Lakeland
## # A tibble: 100 × 4
## Student GPA Program Return
## <dbl> <dbl> <dbl> <dbl>
## 1 1 3.78 1 1
## 2 2 2.38 0 1
## 3 3 1.3 0 0
## 4 4 2.19 1 0
## 5 5 3.22 1 1
## 6 6 2.68 1 1
## 7 7 2.72 0 0
## 8 8 1.74 0 0
## 9 9 1.86 0 0
## 10 10 3.53 1 1
## # … with 90 more rows
What if the column Program had true/ false values and we needed to turn them into ones and 0s?
Use this code: Lakeland\(Program [Lakeland\)Program == 1] <- “true”
Lakeland\(Program [Lakeland\)Program == “true”] <- 1
M2 = glm(Return~GPA + Program, family="binomial", data=Lakeland)
summary(M2)
##
## Call:
## glm(formula = Return ~ GPA + Program, family = "binomial", data = Lakeland)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9610 -0.4828 0.2848 0.5980 1.8154
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.8926 1.7472 -3.945 7.98e-05 ***
## GPA 2.5388 0.6729 3.773 0.000161 ***
## Program 1.5608 0.5631 2.772 0.005579 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 128.207 on 99 degrees of freedom
## Residual deviance: 80.338 on 97 degrees of freedom
## AIC: 86.338
##
## Number of Fisher Scoring iterations: 5
Interpreting: * Both predictor variables are significant
Make a prediction:
a = data.frame(Program=c(1,0), GPA=c(2.5,2.5))
pred = predict.glm(M2, newdata=a, "response")
cbind(a, pred)
## Program GPA pred
## 1 1 2.5 0.7340349
## 2 0 2.5 0.3668944
Interpreting: * The probablity of a student with 2.5 GPA who did not attend the program has a 37% chance of returning the next year. * The probablity of a student with 2.5 GPA who did attend the program is 73.4%
Estimating the odds ratio:
exp(coef(M2))
## (Intercept) GPA Program
## 0.001015316 12.664425631 4.762412959
The estimated odds of returning to Lakeland for students who attended the orgiantation program is 4.76 times greater then the estimated odds of returning to Lakeland for students who did not attend the orientation program, if the GPA is the same.
In terms of GPA, I can say that the higher GPA increases odds of returning– I can say that GPA has a bigger impact then attendance in the program– but we can’t get more specific. If we wanted to get more specific, would need to convert GPA into a categorical variable. (For example, could make GPA yes for >3 and no for <3 or another cutoff)