The owner of a company would like to be able to predict whether employees will stay with the company or leave.
The data contains information about various characteristics of employees. Please note that the dataset has been updated to account for repetitions, which is needed for Goodness of Fit Assessment. See below for the description of these characteristics.
You can import the data and set up the problem with the following R code:
## [1] 158
Fit a logistic regression model using Staying as the response variable with Num.Of.Products as the predictor and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model1. Note that Num.Of.Products should be treated as a categorical variable.
(a) 3 pts - Display the summary of model1. What are the model parameters and estimates?
## Fit a logistic regression model
model1 = glm(Staying ~ Num.Of.Products, weights=Employees, family=binomial(link = "logit"), data=data)# simply binomial will work too
summary(model1)
##
## Call:
## glm(formula = Staying ~ Num.Of.Products, family = binomial(link = "logit"),
## data = data, weights = Employees)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.2827 -1.4676 -0.1022 1.4490 4.7231
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.37886 0.04743 7.988 1.37e-15 ***
## Num.Of.Products2 -1.76683 0.10313 -17.132 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 981.04 on 157 degrees of freedom
## Residual deviance: 632.04 on 156 degrees of freedom
## AIC: 1056.8
##
## Number of Fisher Scoring iterations: 4
#Answer:
Intercept = 0.37886 (b0)
Num.Of.Products2 = -1.76683 (b1)
The intercept, or the β0 is 0.37886. The coefficient for Num.Of.Products or the β1 is -1.767
The estimated parameters are bo and b1 as shown above
(b) 3 pts - Write down the equation for the Odds of Staying.
#Answer
Pstaying/(1-Pstaying) = exp(beta0+beta1x)=exp(0.37886-1.76683*Num.Of.Products2 )
note: There are no error terms, so No parameter for the variance of error
(c) 3 pts - Provide a meaningful interpretation for the estimated coefficient for Num.Of.Products2 with respect to the log-odds of staying and the odds of staying.
#Answer
The estimated coefficient for Num.Of.Products2 is the logs odds of staying for employees who own products2.
In this case, -1.76683 is the log odds of employees staying: This indicates the log odds of Staying decrease by 1.767 for every product owned
Or conversely, for every product owned, we expect the odds of staying than leaving to be exp(-1.76683)=0.1708 or whopping 82.91% decrease!
(a) 3 pts - Using model1, find a 90% confidence interval for the coefficient for Num.Of.Products2.
#at 90% confidence level
# a confidence interval for the logit
CI <- confint.default(model1,level=0.90)
#To translate to the probability scale we calculate inverse logits
plogis(confint(model1))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.5710552 0.6158799
## Num.Of.Products2 0.1222091 0.1726168
#compute confidence intervals for some or all the parameters.
exp(confint(model1,level=.9))
## Waiting for profiling to be done...
## 5 % 95 %
## (Intercept) 1.3512484 1.5794323
## Num.Of.Products2 0.1439397 0.2021056
CI
## 5 % 95 %
## (Intercept) 0.3008488 0.4568652
## Num.Of.Products2 -1.9364587 -1.5971969
#Anwer:
(b) 3 pts - Is model1 significant overall at the 0.01 significance level?
## Test for overall regression
gstat = model1$null.deviance - deviance(model1)# same as this-> model.agg$deviance
cbind(gstat, 1-pchisq(gstat,length(coef(model1))-1))# model1$df.null - model1$df.residual works too instead of lenght
## gstat
## [1,] 348.996 0
#Answer
(c) 3 pts - Which regression coefficients are significantly nonzero at the 0.01 significance level? Which are significantly negative? Why?
#Answer:
We can see the estimates and p values of the model and see that the coefficients associated with the intercept and Num.Of.Products are significantly non-zero with p-values essentially zero: Below is a quick summary from the model summary:
(Intercept) = 0.37886, pr = 1.37e-15 ***
Num.Of.Products2 = -1.76683, pr < 2e-16 ***
The coefficient for Num.of.Products2 is negatively significant at alpha of 0.01
The intercept is positively significant also at alpha of 0.01
The reason is both coefficients have P-value levels essentially zero
(a) 3.5 pts - Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude? Explain the differences, if any, between these findings and what you found in Question 2b.
#GOF Hypothesis Test: using deviance residuals & using Pearson Residuals
# Deviance Test for GOF using deviance residuals
c(deviance(model1), 1-pchisq(deviance(model1),156))# using lower tail way=>pchisq(deviance(model1),156, lower.tail=F)
## [1] 632.04 0.00
# same as above, you can calc deviance like you would for Pearsons
# calc_dev= residuals(model1, type='deviance')
# dev.tvalue=sum(calc_dev^2)
# c(dev.tvalue, 1-pchisq(dev.tvalue,156))
## GOF test using Pearson residuals-same as above
pearres2 = residuals(model1,type="pearson")
pearson.tvalue2 = sum(pearres2^2)
c(pearson.tvalue2, 1-pchisq(pearson.tvalue2,156))
## [1] 562.1763 0.0000
#Answer:
-Both deviance way and Pearson’s way led to the same conclusion, the p-value is essentially zero which means the model is NOT a good fit as we reject the Null hypothesis (model is good fit) and conclude that the model is NOT a good fit
(b) 3.5 pts - Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What assessments can you make about the goodness of fit of the model1 based on these plots?
## Residual Analysis
res = resid(model1,type="deviance")
par(mfrow=c(1,2))
#qq plot of resid=>you can do it 2 ways: using pearrons' residuals or direrctly thru "deviance"
qqnorm(res, ylab="Std residuals")
qqline(res,col="red",lwd=2)
# qqnorm(pearres2,ylab="test"), pearres2 from previous question from top
# qqline(res2,col="blue",lwd=2)
#hhistogram of resid
hist(res,breaks=20,xlab="Std residuals", main="Histogram of Residuals",col="magenta",)
#Answer
QQ plot suggest we have quite normal profile in the inner part of the data but with deviations from normality on both ends
The histogram showed quite a normal fit
The Q-Q plot is slightly tailed but the Histogram looks relatively normal and both looked reasonable in the sense that they do not point to a departure from normality
(c) 3 pts - Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?
D <- sum(residuals(model1, type='deviance')^2)
phi <- D/(nrow(data)-(length(coef(model1))-1)-1)
phi
## [1] 4.051539
#Answer:
Fit a logistic regression model using Staying as the response variable with Age.Group, Gender, Tenure, Num.Of.Products, and Is.Active.Member as the predictors and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model2. Note that Age.Group, Gender, Num.Of.Products, and Is.Active.Member should be treated as categorical variables.
# Data before aggregation-> To customize your own categorical factors
data.f = within(data, {
Age = factor(Age.Group, labels=c("Twenties","Thirties","Forties","Fifties"))
gender = factor(Gender, labels=c("Female", "Male"))
is.active.member = factor(Is.Active.Member, labels=c("Non_Active_Member", "Active_Member"))})
model2 <- glm(Staying~Age+gender+Tenure+Num.Of.Products+is.active.member, weights=Employees, family=binomial(link = "logit"),data=data.f)
summary(model2)
##
## Call:
## glm(formula = Staying ~ Age + gender + Tenure + Num.Of.Products +
## is.active.member, family = binomial(link = "logit"), data = data.f,
## weights = Employees)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.10929 -0.76949 -0.07324 0.74079 3.06551
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.109572 0.282617 -0.388 0.698
## AgeThirties 0.384480 0.267984 1.435 0.151
## AgeForties 1.734115 0.270384 6.414 1.42e-10 ***
## AgeFifties 2.955578 0.337727 8.751 < 2e-16 ***
## genderMale -0.572069 0.093776 -6.100 1.06e-09 ***
## Tenure -0.003319 0.016569 -0.200 0.841
## Num.Of.Products2 -1.410946 0.112000 -12.598 < 2e-16 ***
## is.active.memberActive_Member -0.850280 0.095829 -8.873 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 981.04 on 157 degrees of freedom
## Residual deviance: 162.35 on 150 degrees of freedom
## AIC: 599.07
##
## Number of Fisher Scoring iterations: 4
#or using original factors
model2.originalfactor <- glm(Staying~Age.Group+Gender+Tenure+Num.Of.Products+Is.Active.Member, weights=Employees, family=binomial(link = "logit"),data=data)
summary(model2.originalfactor)
##
## Call:
## glm(formula = Staying ~ Age.Group + Gender + Tenure + Num.Of.Products +
## Is.Active.Member, family = binomial(link = "logit"), data = data,
## weights = Employees)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.10929 -0.76949 -0.07324 0.74079 3.06551
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.109572 0.282617 -0.388 0.698
## Age.Group3 0.384480 0.267984 1.435 0.151
## Age.Group4 1.734115 0.270384 6.414 1.42e-10 ***
## Age.Group5 2.955578 0.337727 8.751 < 2e-16 ***
## Gender1 -0.572069 0.093776 -6.100 1.06e-09 ***
## Tenure -0.003319 0.016569 -0.200 0.841
## Num.Of.Products2 -1.410946 0.112000 -12.598 < 2e-16 ***
## Is.Active.Member1 -0.850280 0.095829 -8.873 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 981.04 on 157 degrees of freedom
## Residual deviance: 162.35 on 150 degrees of freedom
## AIC: 599.07
##
## Number of Fisher Scoring iterations: 4
#Answer
There were 2 options to do this: model2 which used the original factors given by the problem or customize to more meaningful categories
As the summary results of both models showed the results are the exactly same, either way would’ve been fine for this problem
(a) 3 pts - Write down the equation for the probability of staying.
#Get the coeeficients of the model2; using the customized version only
coef(model2)
## (Intercept) AgeThirties
## -0.10957208 0.38447954
## AgeForties AgeFifties
## 1.73411499 2.95557843
## genderMale Tenure
## -0.57206916 -0.00331918
## Num.Of.Products2 is.active.memberActive_Member
## -1.41094581 -0.85027969
#Answer:
(b) 3 pts - Provide a meaningful interpretation for the estimated coefficients of Tenure and Is.Active.Member1 with respect to the odds of staying.
#Answer:
Holding all else constant, a 1-unit increase in Tenure, we expect to see exp=(-0.003319)=0.996 or 0.33% decrease in odds of staying
Holding all else constant, the odds of staying for active member is exp(-0.850279)=0.4273 or 57.27% lower than non-active members
(c) 3 pts - Is Is.Active.Member1 statistically significant given the other variables in model2 at the 0.01 significance level?
#Answer:
(d) 10 pts - Has your goodness of fit been affected? Follow the instructions to repeat the tests, plots, and dispersion parameter calculation you performed in Question 3 with model2.
(d-1) Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude?
#GOF Hypothesis Test: using deviance residuals & using Pearson Residuals
# Deviance Test for GOF using deviance residuals
c(deviance(model2), 1-pchisq(deviance(model2),150))# using lower tail way=>pchisq(deviance(model2),150, lower.tail=F)
## [1] 162.3494910 0.2319118
# same as above, you can calc deviance like you would for Pearsons
# calc_devr= residuals(model2, type='deviance')
# dev.tvaluer=sum(calc_devr^2)
# c(dev.tvaluer, 1-pchisq(dev.tvaluer,150))
## GOF test using Pearson residuals-same as above
pearres2.2 = residuals(model2,type="pearson")
pearson.tvalue2.2 = sum(pearres2.2^2)
c(pearson.tvalue2.2, 1-pchisq(pearson.tvalue2.2,150))
## [1] 154.1554225 0.3912174
#Answer:
(d-2) Evaluate the linearity assumption of model2 by plotting the log-odds of Staying vs. Tenure. What do you conclude?
# Is it a linear fit?
# plot(Tenure,log((Staying)/(1-Staying)), ylab="Logit of Staying", main="Scatterplot of logit
# staying rate vs Tenureshipt", col=c("red","blue"), lwd=3)
g1 <- ggplot(data, aes(Tenure, log((Staying)/(1-Staying))) ) +
geom_point(color="green") +
stat_smooth()+theme_economist()+ ggtitle("Tenure vs. logit staying rate")
g1
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
#Answer:
(d-3) Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What do you conclude?
## Residual Analysis
res2 = resid(model2,type="deviance")
par(mfrow=c(1,2))
#qq plot of resid=>you can do it 2 ways: using pearrons' residuals or direrctly thru "deviance"
qqnorm(res2, ylab="Std residuals")
qqline(res2,col="blue",lwd=2)
# qqnorm(pearres2.2,ylab="test"), same as before
# qqline(res2,col="blue",lwd=2)
#hhistogram of resid
hist(res,breaks=20,xlab="Std residuals", main="Histogram of Residuals",col="yellow",)
#Answer:
The QQ plot improved a bit especially at the ends, it’s not as tailed as previously shown. So that was a bit of improvement
The Histogram plot looked very similar. It was quite good before and this is about the same. So from a normality perspective, I say both do not show signs of departure from normality
(d-4) Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?
D2 <- sum(residuals(model2, type='deviance')^2)
phi2 <- D2/(nrow(data)-(length(coef(model2))-1)-1)
phi2
## [1] 1.08233
#Answer
Moerover, A dispersion parameter close to 1 indicates the variability of the response is close to the variability estimated by model
(e) 4 pts - Overall, would you say model2 is a good-fitting model? If so, why? If not, what would you suggest to improve the fit and why? Note: We are not asking you to spend hours finding the best possible model but to offer plausible suggestions along with your reasoning.
#Answer:
From the plots and test results from previous questions, the goodness of fit (GOF) had improved. Overall, model2 appears to be a better fitting model.
One possibility to improve the GOF, is to change the link function. although not shown here could possibly reduce the deviance
Some other ideas include transforming the data, removing influential points from data; especially the 2 QQ plots showed that there were some outliers lurking in the ends of the data set.
Suppose there is an employee with the following characteristics:
Age.Group: 2
Gender: 0
Tenure: 2
Num.Of.Products: 2
Is.Active.Member: 1
(a) 3 pts - Predict their probability of staying using model1.
Age.Group<-'2'
Gender<-'0'
Tenure<-2
Num.Of.Products<-'2'
Is.Active.Member<-'1'
test <- data.frame(Age.Group, Gender,Tenure,Num.Of.Products,Is.Active.Member)
## Build a prediction for model1 with the test data
test.pred1 <- predict(model1, test, type="response")
test.pred1
## 1
## 0.1997319
(b) 3 pts - Predict their probability of staying using model2.
#perdict using model2
Age<-'Twenties'
gender<-'Female'
Tenure<-2
Num.Of.Products<-'2'
is.active.member<-'Active_Member'
test2 <- data.frame(Age, gender,Tenure,Num.Of.Products,is.active.member)
test.pred2 <- predict(model2, test2, type="response")
test.pred2
## 1
## 0.08490958
(c) 3 pts - Comment on how your predictions compare.
#Answer:
The Prediction from model 1 is about 20% (rounding up) of staying while model 2 predicts a lower probability of staying at 8%.
At high level, since model2 had a better GOF, I prefer to use model2’s prediction level as that seems to be more reliable
However, a real robust Cross validation process should be performed by splitting the data into training and testing sets. This more robust way allows us to calculate accuracy metrics and evaluate which model is truly more accurate