Background

The owner of a company would like to be able to predict whether employees will stay with the company or leave.

Data Description

The data contains information about various characteristics of employees. Please note that the dataset has been updated to account for repetitions, which is needed for Goodness of Fit Assessment. See below for the description of these characteristics.

Age.Group: 1-9 (1 corresponds to teen, 2 corresponds to twenties, etc.)
Gender: 1 if male, 0 if female
Tenure: Number of years with the company
Num.Of.Products: Number of products owned
Is.Active.Member: 1 if active member, 0 if inactive member
Staying: Fraction of employees that stayed with the company for a given set of predicting variables.

Setup

You can import the data and set up the problem with the following R code:

## [1] 158

Question 1: Fitting a Model - 9 pts

Fit a logistic regression model using Staying as the response variable with Num.Of.Products as the predictor and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model1. Note that Num.Of.Products should be treated as a categorical variable.

(a) 3 pts - Display the summary of model1. What are the model parameters and estimates?

## Fit a logistic regression model
model1 = glm(Staying ~ Num.Of.Products, weights=Employees, family=binomial(link = "logit"), data=data)# simply binomial will work too
summary(model1)

## 
## Call:
## glm(formula = Staying ~ Num.Of.Products, family = binomial(link = "logit"), 
##     data = data, weights = Employees)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2827  -1.4676  -0.1022   1.4490   4.7231  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.37886    0.04743   7.988 1.37e-15 ***
## Num.Of.Products2 -1.76683    0.10313 -17.132  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.04  on 157  degrees of freedom
## Residual deviance: 632.04  on 156  degrees of freedom
## AIC: 1056.8
## 
## Number of Fisher Scoring iterations: 4

#Answer:

Intercept = 0.37886 (b0)
Num.Of.Products2 = -1.76683 (b1)

The intercept, or the β0 is 0.37886. The coefficient for Num.Of.Products or the β1 is -1.767

The estimated parameters are bo and b1 as shown above

(b) 3 pts - Write down the equation for the Odds of Staying.

#Answer

Pstaying/(1-Pstaying) = exp(beta0+beta1x)=exp(0.37886-1.76683*Num.Of.Products2 )

note: There are no error terms, so No parameter for the variance of error

(c) 3 pts - Provide a meaningful interpretation for the estimated coefficient for Num.Of.Products2 with respect to the log-odds of staying and the odds of staying.

#Answer

The estimated coefficient for Num.Of.Products2 is the logs odds of staying for employees who own products2.
In this case, -1.76683 is the log odds of employees staying: This indicates the log odds of Staying decrease by 1.767 for every product owned
Or conversely, for every product owned, we expect the odds of staying than leaving to be exp(-1.76683)=0.1708 or whopping 82.91% decrease!

Question 2: Inference - 9 pts

(a) 3 pts - Using model1, find a 90% confidence interval for the coefficient for Num.Of.Products2.

#at 90% confidence level
# a confidence interval for the logit 
CI <- confint.default(model1,level=0.90)
#To translate to the probability scale we calculate inverse logits
plogis(confint(model1))

## Waiting for profiling to be done...

##                      2.5 %    97.5 %
## (Intercept)      0.5710552 0.6158799
## Num.Of.Products2 0.1222091 0.1726168

#compute confidence intervals for some or all the parameters.
exp(confint(model1,level=.9))

## Waiting for profiling to be done...

##                        5 %      95 %
## (Intercept)      1.3512484 1.5794323
## Num.Of.Products2 0.1439397 0.2021056

CI

##                         5 %       95 %
## (Intercept)       0.3008488  0.4568652
## Num.Of.Products2 -1.9364587 -1.5971969

#Anwer:

The confidence interval of the estimated parameter or coefficient (Num.Of.Products2) is [-1.9364587 , -1.5971969]

(b) 3 pts - Is model1 significant overall at the 0.01 significance level?

## Test for overall regression
gstat = model1$null.deviance - deviance(model1)# same as this-> model.agg$deviance 
cbind(gstat, 1-pchisq(gstat,length(coef(model1))-1))# model1$df.null - model1$df.residual works too instead of lenght

##        gstat  
## [1,] 348.996 0

#Answer

The p-value is essentially zero which indicates that model1 is statistically significant overall

(c) 3 pts - Which regression coefficients are significantly nonzero at the 0.01 significance level? Which are significantly negative? Why?

#Answer:

We can see the estimates and p values of the model and see that the coefficients associated with the intercept and Num.Of.Products are significantly non-zero with p-values essentially zero: Below is a quick summary from the model summary:

(Intercept) = 0.37886, pr = 1.37e-15 ***

Num.Of.Products2 = -1.76683, pr < 2e-16 ***

The coefficient for Num.of.Products2 is negatively significant at alpha of 0.01
The intercept is positively significant also at alpha of 0.01
The reason is both coefficients have P-value levels essentially zero

Question 3: Goodness of fit - 10 pts

(a) 3.5 pts - Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude? Explain the differences, if any, between these findings and what you found in Question 2b.

#GOF Hypothesis Test: using deviance residuals & using Pearson Residuals
# Deviance Test for GOF using deviance residuals
c(deviance(model1), 1-pchisq(deviance(model1),156))# using lower tail way=>pchisq(deviance(model1),156, lower.tail=F)

## [1] 632.04   0.00

# same as above, you can calc deviance like you would for Pearsons
# calc_dev= residuals(model1, type='deviance')
# dev.tvalue=sum(calc_dev^2)
# c(dev.tvalue, 1-pchisq(dev.tvalue,156))

## GOF test using Pearson residuals-same as above
pearres2 = residuals(model1,type="pearson")
pearson.tvalue2 = sum(pearres2^2)
c(pearson.tvalue2, 1-pchisq(pearson.tvalue2,156))

## [1] 562.1763   0.0000

#Answer:

-Both deviance way and Pearson’s way led to the same conclusion, the p-value is essentially zero which means the model is NOT a good fit as we reject the Null hypothesis (model is good fit) and conclude that the model is NOT a good fit

This is not contrary to question 2b which suggested the model was significant overall. Remember, it’s is still possible for a model to have explanatory/predictive power while still being a poor fit

(b) 3.5 pts - Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What assessments can you make about the goodness of fit of the model1 based on these plots?

## Residual Analysis

res = resid(model1,type="deviance")
par(mfrow=c(1,2))

#qq plot of resid=>you can do it 2 ways:  using pearrons' residuals or direrctly thru "deviance"
qqnorm(res, ylab="Std residuals")
qqline(res,col="red",lwd=2)
# qqnorm(pearres2,ylab="test"), pearres2 from previous question from top
# qqline(res2,col="blue",lwd=2)

#hhistogram of resid
hist(res,breaks=20,xlab="Std residuals", main="Histogram of Residuals",col="magenta",)

#Answer

QQ plot suggest we have quite normal profile in the inner part of the data but with deviations from normality on both ends
The histogram showed quite a normal fit

The Q-Q plot is slightly tailed but the Histogram looks relatively normal and both looked reasonable in the sense that they do not point to a departure from normality

(c) 3 pts - Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?

D <- sum(residuals(model1, type='deviance')^2)
phi <- D/(nrow(data)-(length(coef(model1))-1)-1)
phi

## [1] 4.051539

#Answer:

Based on the Over dispersion parameter which is calculated to be greater than 2, this model is over dispersed

Question 4: Fitting the full model- 23 pts

Fit a logistic regression model using Staying as the response variable with Age.Group, Gender, Tenure, Num.Of.Products, and Is.Active.Member as the predictors and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model2. Note that Age.Group, Gender, Num.Of.Products, and Is.Active.Member should be treated as categorical variables.

# Data before aggregation-> To customize your own categorical factors
data.f = within(data, {
Age = factor(Age.Group, labels=c("Twenties","Thirties","Forties","Fifties"))
gender = factor(Gender, labels=c("Female", "Male"))
is.active.member = factor(Is.Active.Member, labels=c("Non_Active_Member", "Active_Member"))})

model2 <- glm(Staying~Age+gender+Tenure+Num.Of.Products+is.active.member, weights=Employees, family=binomial(link = "logit"),data=data.f)
summary(model2)

## 
## Call:
## glm(formula = Staying ~ Age + gender + Tenure + Num.Of.Products + 
##     is.active.member, family = binomial(link = "logit"), data = data.f, 
##     weights = Employees)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.10929  -0.76949  -0.07324   0.74079   3.06551  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -0.109572   0.282617  -0.388    0.698    
## AgeThirties                    0.384480   0.267984   1.435    0.151    
## AgeForties                     1.734115   0.270384   6.414 1.42e-10 ***
## AgeFifties                     2.955578   0.337727   8.751  < 2e-16 ***
## genderMale                    -0.572069   0.093776  -6.100 1.06e-09 ***
## Tenure                        -0.003319   0.016569  -0.200    0.841    
## Num.Of.Products2              -1.410946   0.112000 -12.598  < 2e-16 ***
## is.active.memberActive_Member -0.850280   0.095829  -8.873  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.04  on 157  degrees of freedom
## Residual deviance: 162.35  on 150  degrees of freedom
## AIC: 599.07
## 
## Number of Fisher Scoring iterations: 4

#or using original factors
model2.originalfactor <- glm(Staying~Age.Group+Gender+Tenure+Num.Of.Products+Is.Active.Member, weights=Employees, family=binomial(link = "logit"),data=data)
summary(model2.originalfactor)

## 
## Call:
## glm(formula = Staying ~ Age.Group + Gender + Tenure + Num.Of.Products + 
##     Is.Active.Member, family = binomial(link = "logit"), data = data, 
##     weights = Employees)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.10929  -0.76949  -0.07324   0.74079   3.06551  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.109572   0.282617  -0.388    0.698    
## Age.Group3         0.384480   0.267984   1.435    0.151    
## Age.Group4         1.734115   0.270384   6.414 1.42e-10 ***
## Age.Group5         2.955578   0.337727   8.751  < 2e-16 ***
## Gender1           -0.572069   0.093776  -6.100 1.06e-09 ***
## Tenure            -0.003319   0.016569  -0.200    0.841    
## Num.Of.Products2  -1.410946   0.112000 -12.598  < 2e-16 ***
## Is.Active.Member1 -0.850280   0.095829  -8.873  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.04  on 157  degrees of freedom
## Residual deviance: 162.35  on 150  degrees of freedom
## AIC: 599.07
## 
## Number of Fisher Scoring iterations: 4

#Answer

There were 2 options to do this: model2 which used the original factors given by the problem or customize to more meaningful categories
As the summary results of both models showed the results are the exactly same, either way would’ve been fine for this problem

(a) 3 pts - Write down the equation for the probability of staying.

#Get the coeeficients of the model2; using the customized version only
coef(model2)

##                   (Intercept)                   AgeThirties 
##                   -0.10957208                    0.38447954 
##                    AgeForties                    AgeFifties 
##                    1.73411499                    2.95557843 
##                    genderMale                        Tenure 
##                   -0.57206916                   -0.00331918 
##              Num.Of.Products2 is.active.memberActive_Member 
##                   -1.41094581                   -0.85027969

#Answer:

𝑝(𝑥) = exp(-0.10957208+0.38447954x1+1.73411499x2+2.95557843x3-0.57206916x4-0.00331918x5-1.41094581x6-0.85027969x7)/ [1 + exp(-0.10957208+0.38447954x1+1.73411499x2+2.95557843x3-0.57206916x4-0.00331918x5-1.41094581x6-0.85027969x7)]

(b) 3 pts - Provide a meaningful interpretation for the estimated coefficients of Tenure and Is.Active.Member1 with respect to the odds of staying.

#Answer:

Holding all else constant, a 1-unit increase in Tenure, we expect to see exp=(-0.003319)=0.996 or 0.33% decrease in odds of staying
Holding all else constant, the odds of staying for active member is exp(-0.850279)=0.4273 or 57.27% lower than non-active members

(c) 3 pts - Is Is.Active.Member1 statistically significant given the other variables in model2 at the 0.01 significance level?

#Answer:

Yes, the p-value of Is.Active.Member1 (Active Member) is essentially zero (pr< 2e-16), given the other variables in the model

(d) 10 pts - Has your goodness of fit been affected? Follow the instructions to repeat the tests, plots, and dispersion parameter calculation you performed in Question 3 with model2.

(d-1) Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude?

#GOF Hypothesis Test: using deviance residuals & using Pearson Residuals
# Deviance Test for GOF using deviance residuals
c(deviance(model2), 1-pchisq(deviance(model2),150))# using lower tail way=>pchisq(deviance(model2),150, lower.tail=F)

## [1] 162.3494910   0.2319118

# same as above, you can calc deviance like you would for Pearsons
# calc_devr= residuals(model2, type='deviance')
# dev.tvaluer=sum(calc_devr^2)
# c(dev.tvaluer, 1-pchisq(dev.tvaluer,150))

## GOF test using Pearson residuals-same as above
pearres2.2 = residuals(model2,type="pearson")
pearson.tvalue2.2 = sum(pearres2.2^2)
c(pearson.tvalue2.2, 1-pchisq(pearson.tvalue2.2,150))

## [1] 154.1554225   0.3912174

#Answer:

Both ways lead to same conclusion in that p-value is HIGH now and as such it’s now a good fit

(d-2) Evaluate the linearity assumption of model2 by plotting the log-odds of Staying vs. Tenure. What do you conclude?

# Is it a linear fit?
# plot(Tenure,log((Staying)/(1-Staying)), ylab="Logit of Staying", main="Scatterplot of logit 
# staying rate vs Tenureshipt", col=c("red","blue"), lwd=3)

g1 <- ggplot(data, aes(Tenure, log((Staying)/(1-Staying))) ) +
  geom_point(color="green") +
  stat_smooth()+theme_economist()+ ggtitle("Tenure vs. logit staying rate")
g1

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 10 rows containing non-finite values (stat_smooth).

#Answer:

There does not appear to be strong linear relationship between log-odds of Staying vs. Tenure. The trend is almost “flat-line” (log-odds of Staying) even as Tenurship increased

(d-3) Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What do you conclude?

## Residual Analysis

res2 = resid(model2,type="deviance")
par(mfrow=c(1,2))

#qq plot of resid=>you can do it 2 ways:  using pearrons' residuals or direrctly thru "deviance"
qqnorm(res2, ylab="Std residuals")
qqline(res2,col="blue",lwd=2)
# qqnorm(pearres2.2,ylab="test"), same as before
# qqline(res2,col="blue",lwd=2)

#hhistogram of resid
hist(res,breaks=20,xlab="Std residuals", main="Histogram of Residuals",col="yellow",)

#Answer:

The QQ plot improved a bit especially at the ends, it’s not as tailed as previously shown. So that was a bit of improvement
The Histogram plot looked very similar. It was quite good before and this is about the same. So from a normality perspective, I say both do not show signs of departure from normality

(d-4) Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?

D2 <- sum(residuals(model2, type='deviance')^2)
phi2 <- D2/(nrow(data)-(length(coef(model2))-1)-1)
phi2

## [1] 1.08233

#Answer

The Dispersion parameter is now less than 2 and approximately 1 which means we do NOT have an Overdispersed model

Moerover, A dispersion parameter close to 1 indicates the variability of the response is close to the variability estimated by model

(e) 4 pts - Overall, would you say model2 is a good-fitting model? If so, why? If not, what would you suggest to improve the fit and why? Note: We are not asking you to spend hours finding the best possible model but to offer plausible suggestions along with your reasoning.

#Answer:

From the plots and test results from previous questions, the goodness of fit (GOF) had improved. Overall, model2 appears to be a better fitting model.
One possibility to improve the GOF, is to change the link function. although not shown here could possibly reduce the deviance
Some other ideas include transforming the data, removing influential points from data; especially the 2 QQ plots showed that there were some outliers lurking in the ends of the data set.

Question 5: Prediction - 9 pts

Suppose there is an employee with the following characteristics:

Age.Group: 2
Gender: 0
Tenure: 2
Num.Of.Products: 2
Is.Active.Member: 1

(a) 3 pts - Predict their probability of staying using model1.

Age.Group<-'2'
Gender<-'0'
Tenure<-2
Num.Of.Products<-'2'
Is.Active.Member<-'1'
test <- data.frame(Age.Group, Gender,Tenure,Num.Of.Products,Is.Active.Member)
## Build a prediction for model1 with the test data
test.pred1 <- predict(model1, test, type="response")
test.pred1

##         1 
## 0.1997319

(b) 3 pts - Predict their probability of staying using model2.

#perdict using model2
Age<-'Twenties'
gender<-'Female'
Tenure<-2
Num.Of.Products<-'2'
is.active.member<-'Active_Member'
test2 <- data.frame(Age, gender,Tenure,Num.Of.Products,is.active.member)

test.pred2 <- predict(model2, test2, type="response")
test.pred2

##          1 
## 0.08490958

(c) 3 pts - Comment on how your predictions compare.

#Answer:

The Prediction from model 1 is about 20% (rounding up) of staying while model 2 predicts a lower probability of staying at 8%.
At high level, since model2 had a better GOF, I prefer to use model2’s prediction level as that seems to be more reliable
However, a real robust Cross validation process should be performed by splitting the data into training and testing sets. This more robust way allows us to calculate accuracy metrics and evaluate which model is truly more accurate

HW3 Peer Assessment SUMMER 2022

Sat Jul 9 10:07:48 2022 EDT

Background

Data Description

Setup

Question 1: Fitting a Model - 9 pts

Question 2: Inference - 9 pts

Question 3: Goodness of fit - 10 pts

Question 4: Fitting the full model- 23 pts

Question 5: Prediction - 9 pts