Background

The owner of a company would like to be able to predict whether employees will stay with the company or leave.

Data Description

The data contains information about various characteristics of employees. Please note that the dataset has been updated to account for repetitions, which is needed for Goodness of Fit Assessment. See below for the description of these characteristics.

  1. Age.Group: 1-9 (1 corresponds to teen, 2 corresponds to twenties, etc.)
  2. Gender: 1 if male, 0 if female
  3. Tenure: Number of years with the company
  4. Num.Of.Products: Number of products owned
  5. Is.Active.Member: 1 if active member, 0 if inactive member
  6. Staying: Fraction of employees that stayed with the company for a given set of predicting variables.

Setup

You can import the data and set up the problem with the following R code:

## [1] 158

Question 1: Fitting a Model - 9 pts

Fit a logistic regression model using Staying as the response variable with Num.Of.Products as the predictor and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model1. Note that Num.Of.Products should be treated as a categorical variable.

(a) 3 pts - Display the summary of model1. What are the model parameters and estimates?

## Fit a logistic regression model
model1 = glm(Staying ~ Num.Of.Products, weights=Employees, family=binomial(link = "logit"), data=data)# simply binomial will work too
summary(model1)
## 
## Call:
## glm(formula = Staying ~ Num.Of.Products, family = binomial(link = "logit"), 
##     data = data, weights = Employees)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2827  -1.4676  -0.1022   1.4490   4.7231  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.37886    0.04743   7.988 1.37e-15 ***
## Num.Of.Products2 -1.76683    0.10313 -17.132  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.04  on 157  degrees of freedom
## Residual deviance: 632.04  on 156  degrees of freedom
## AIC: 1056.8
## 
## Number of Fisher Scoring iterations: 4

#Answer:

The intercept, or the β0 is 0.37886. The coefficient for Num.Of.Products or the β1 is -1.767

The estimated parameters are bo and b1 as shown above

(b) 3 pts - Write down the equation for the Odds of Staying.

#Answer

Pstaying/(1-Pstaying) = exp(beta0+beta1x)=exp(0.37886-1.76683*Num.Of.Products2 )

note: There are no error terms, so No parameter for the variance of error

(c) 3 pts - Provide a meaningful interpretation for the estimated coefficient for Num.Of.Products2 with respect to the log-odds of staying and the odds of staying.

#Answer

Question 2: Inference - 9 pts

(a) 3 pts - Using model1, find a 90% confidence interval for the coefficient for Num.Of.Products2.

#at 90% confidence level
# a confidence interval for the logit 
CI <- confint.default(model1,level=0.90)
#To translate to the probability scale we calculate inverse logits
plogis(confint(model1))
## Waiting for profiling to be done...
##                      2.5 %    97.5 %
## (Intercept)      0.5710552 0.6158799
## Num.Of.Products2 0.1222091 0.1726168
#compute confidence intervals for some or all the parameters.
exp(confint(model1,level=.9))
## Waiting for profiling to be done...
##                        5 %      95 %
## (Intercept)      1.3512484 1.5794323
## Num.Of.Products2 0.1439397 0.2021056
CI
##                         5 %       95 %
## (Intercept)       0.3008488  0.4568652
## Num.Of.Products2 -1.9364587 -1.5971969

#Anwer:

(b) 3 pts - Is model1 significant overall at the 0.01 significance level?

## Test for overall regression
gstat = model1$null.deviance - deviance(model1)# same as this-> model.agg$deviance 
cbind(gstat, 1-pchisq(gstat,length(coef(model1))-1))# model1$df.null - model1$df.residual works too instead of lenght
##        gstat  
## [1,] 348.996 0

#Answer

(c) 3 pts - Which regression coefficients are significantly nonzero at the 0.01 significance level? Which are significantly negative? Why?

#Answer:

We can see the estimates and p values of the model and see that the coefficients associated with the intercept and Num.Of.Products are significantly non-zero with p-values essentially zero: Below is a quick summary from the model summary:

(Intercept) = 0.37886, pr = 1.37e-15 ***

Num.Of.Products2 = -1.76683, pr < 2e-16 ***

Question 3: Goodness of fit - 10 pts

(a) 3.5 pts - Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude? Explain the differences, if any, between these findings and what you found in Question 2b.

#GOF Hypothesis Test: using deviance residuals & using Pearson Residuals
# Deviance Test for GOF using deviance residuals
c(deviance(model1), 1-pchisq(deviance(model1),156))# using lower tail way=>pchisq(deviance(model1),156, lower.tail=F)
## [1] 632.04   0.00
# same as above, you can calc deviance like you would for Pearsons
# calc_dev= residuals(model1, type='deviance')
# dev.tvalue=sum(calc_dev^2)
# c(dev.tvalue, 1-pchisq(dev.tvalue,156))

## GOF test using Pearson residuals-same as above
pearres2 = residuals(model1,type="pearson")
pearson.tvalue2 = sum(pearres2^2)
c(pearson.tvalue2, 1-pchisq(pearson.tvalue2,156))
## [1] 562.1763   0.0000

#Answer:

-Both deviance way and Pearson’s way led to the same conclusion, the p-value is essentially zero which means the model is NOT a good fit as we reject the Null hypothesis (model is good fit) and conclude that the model is NOT a good fit

(b) 3.5 pts - Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What assessments can you make about the goodness of fit of the model1 based on these plots?

## Residual Analysis

res = resid(model1,type="deviance")
par(mfrow=c(1,2))

#qq plot of resid=>you can do it 2 ways:  using pearrons' residuals or direrctly thru "deviance"
qqnorm(res, ylab="Std residuals")
qqline(res,col="red",lwd=2)
# qqnorm(pearres2,ylab="test"), pearres2 from previous question from top
# qqline(res2,col="blue",lwd=2)

#hhistogram of resid
hist(res,breaks=20,xlab="Std residuals", main="Histogram of Residuals",col="magenta",)

#Answer

The Q-Q plot is slightly tailed but the Histogram looks relatively normal and both looked reasonable in the sense that they do not point to a departure from normality

(c) 3 pts - Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?

D <- sum(residuals(model1, type='deviance')^2)
phi <- D/(nrow(data)-(length(coef(model1))-1)-1)
phi
## [1] 4.051539

#Answer:

Question 4: Fitting the full model- 23 pts

Fit a logistic regression model using Staying as the response variable with Age.Group, Gender, Tenure, Num.Of.Products, and Is.Active.Member as the predictors and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model2. Note that Age.Group, Gender, Num.Of.Products, and Is.Active.Member should be treated as categorical variables.

# Data before aggregation-> To customize your own categorical factors
data.f = within(data, {
Age = factor(Age.Group, labels=c("Twenties","Thirties","Forties","Fifties"))
gender = factor(Gender, labels=c("Female", "Male"))
is.active.member = factor(Is.Active.Member, labels=c("Non_Active_Member", "Active_Member"))})

model2 <- glm(Staying~Age+gender+Tenure+Num.Of.Products+is.active.member, weights=Employees, family=binomial(link = "logit"),data=data.f)
summary(model2)
## 
## Call:
## glm(formula = Staying ~ Age + gender + Tenure + Num.Of.Products + 
##     is.active.member, family = binomial(link = "logit"), data = data.f, 
##     weights = Employees)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.10929  -0.76949  -0.07324   0.74079   3.06551  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -0.109572   0.282617  -0.388    0.698    
## AgeThirties                    0.384480   0.267984   1.435    0.151    
## AgeForties                     1.734115   0.270384   6.414 1.42e-10 ***
## AgeFifties                     2.955578   0.337727   8.751  < 2e-16 ***
## genderMale                    -0.572069   0.093776  -6.100 1.06e-09 ***
## Tenure                        -0.003319   0.016569  -0.200    0.841    
## Num.Of.Products2              -1.410946   0.112000 -12.598  < 2e-16 ***
## is.active.memberActive_Member -0.850280   0.095829  -8.873  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.04  on 157  degrees of freedom
## Residual deviance: 162.35  on 150  degrees of freedom
## AIC: 599.07
## 
## Number of Fisher Scoring iterations: 4
#or using original factors
model2.originalfactor <- glm(Staying~Age.Group+Gender+Tenure+Num.Of.Products+Is.Active.Member, weights=Employees, family=binomial(link = "logit"),data=data)
summary(model2.originalfactor)
## 
## Call:
## glm(formula = Staying ~ Age.Group + Gender + Tenure + Num.Of.Products + 
##     Is.Active.Member, family = binomial(link = "logit"), data = data, 
##     weights = Employees)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.10929  -0.76949  -0.07324   0.74079   3.06551  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.109572   0.282617  -0.388    0.698    
## Age.Group3         0.384480   0.267984   1.435    0.151    
## Age.Group4         1.734115   0.270384   6.414 1.42e-10 ***
## Age.Group5         2.955578   0.337727   8.751  < 2e-16 ***
## Gender1           -0.572069   0.093776  -6.100 1.06e-09 ***
## Tenure            -0.003319   0.016569  -0.200    0.841    
## Num.Of.Products2  -1.410946   0.112000 -12.598  < 2e-16 ***
## Is.Active.Member1 -0.850280   0.095829  -8.873  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.04  on 157  degrees of freedom
## Residual deviance: 162.35  on 150  degrees of freedom
## AIC: 599.07
## 
## Number of Fisher Scoring iterations: 4

#Answer

(a) 3 pts - Write down the equation for the probability of staying.

#Get the coeeficients of the model2; using the customized version only
coef(model2)
##                   (Intercept)                   AgeThirties 
##                   -0.10957208                    0.38447954 
##                    AgeForties                    AgeFifties 
##                    1.73411499                    2.95557843 
##                    genderMale                        Tenure 
##                   -0.57206916                   -0.00331918 
##              Num.Of.Products2 is.active.memberActive_Member 
##                   -1.41094581                   -0.85027969

#Answer:

(b) 3 pts - Provide a meaningful interpretation for the estimated coefficients of Tenure and Is.Active.Member1 with respect to the odds of staying.

#Answer:

(c) 3 pts - Is Is.Active.Member1 statistically significant given the other variables in model2 at the 0.01 significance level?

#Answer:

(d) 10 pts - Has your goodness of fit been affected? Follow the instructions to repeat the tests, plots, and dispersion parameter calculation you performed in Question 3 with model2.

(d-1) Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude?

#GOF Hypothesis Test: using deviance residuals & using Pearson Residuals
# Deviance Test for GOF using deviance residuals
c(deviance(model2), 1-pchisq(deviance(model2),150))# using lower tail way=>pchisq(deviance(model2),150, lower.tail=F)
## [1] 162.3494910   0.2319118
# same as above, you can calc deviance like you would for Pearsons
# calc_devr= residuals(model2, type='deviance')
# dev.tvaluer=sum(calc_devr^2)
# c(dev.tvaluer, 1-pchisq(dev.tvaluer,150))

## GOF test using Pearson residuals-same as above
pearres2.2 = residuals(model2,type="pearson")
pearson.tvalue2.2 = sum(pearres2.2^2)
c(pearson.tvalue2.2, 1-pchisq(pearson.tvalue2.2,150))
## [1] 154.1554225   0.3912174

#Answer:

(d-2) Evaluate the linearity assumption of model2 by plotting the log-odds of Staying vs. Tenure. What do you conclude?

# Is it a linear fit?
# plot(Tenure,log((Staying)/(1-Staying)), ylab="Logit of Staying", main="Scatterplot of logit 
# staying rate vs Tenureshipt", col=c("red","blue"), lwd=3)

g1 <- ggplot(data, aes(Tenure, log((Staying)/(1-Staying))) ) +
  geom_point(color="green") +
  stat_smooth()+theme_economist()+ ggtitle("Tenure vs. logit staying rate")
g1
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 10 rows containing non-finite values (stat_smooth).

#Answer:

(d-3) Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What do you conclude?

## Residual Analysis

res2 = resid(model2,type="deviance")
par(mfrow=c(1,2))

#qq plot of resid=>you can do it 2 ways:  using pearrons' residuals or direrctly thru "deviance"
qqnorm(res2, ylab="Std residuals")
qqline(res2,col="blue",lwd=2)
# qqnorm(pearres2.2,ylab="test"), same as before
# qqline(res2,col="blue",lwd=2)

#hhistogram of resid
hist(res,breaks=20,xlab="Std residuals", main="Histogram of Residuals",col="yellow",)

#Answer:

(d-4) Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?

D2 <- sum(residuals(model2, type='deviance')^2)
phi2 <- D2/(nrow(data)-(length(coef(model2))-1)-1)
phi2
## [1] 1.08233

#Answer

Moerover, A dispersion parameter close to 1 indicates the variability of the response is close to the variability estimated by model

(e) 4 pts - Overall, would you say model2 is a good-fitting model? If so, why? If not, what would you suggest to improve the fit and why? Note: We are not asking you to spend hours finding the best possible model but to offer plausible suggestions along with your reasoning.

#Answer:

Question 5: Prediction - 9 pts

Suppose there is an employee with the following characteristics:

  1. Age.Group: 2

  2. Gender: 0

  3. Tenure: 2

  4. Num.Of.Products: 2

  5. Is.Active.Member: 1

(a) 3 pts - Predict their probability of staying using model1.

Age.Group<-'2'
Gender<-'0'
Tenure<-2
Num.Of.Products<-'2'
Is.Active.Member<-'1'
test <- data.frame(Age.Group, Gender,Tenure,Num.Of.Products,Is.Active.Member)
## Build a prediction for model1 with the test data
test.pred1 <- predict(model1, test, type="response")
test.pred1
##         1 
## 0.1997319

(b) 3 pts - Predict their probability of staying using model2.

#perdict using model2
Age<-'Twenties'
gender<-'Female'
Tenure<-2
Num.Of.Products<-'2'
is.active.member<-'Active_Member'
test2 <- data.frame(Age, gender,Tenure,Num.Of.Products,is.active.member)

test.pred2 <- predict(model2, test2, type="response")
test.pred2
##          1 
## 0.08490958

(c) 3 pts - Comment on how your predictions compare.

#Answer: