The owner of a company would like to be able to predict whether employees will stay with the company or leave.
The data contains information about various characteristics of employees. Please note that the dataset has been updated to account for repetitions, which is needed for Goodness of Fit Assessment. See below for the description of these characteristics.
You can import the data and set up the problem with the following R code:
# Import the data
data = read.csv("C:/Users/abhij/Downloads/hw3_data.csv", header=TRUE, fileEncoding="UTF-8-BOM")
# Create variable Staying
data$Staying = data$Stay/data$Employees
# Set variables as categoricals
data$Num.Of.Products<-as.factor(data$Num.Of.Products)
data$Age.Group<-as.factor(data$Age.Group)
data$Gender<-as.factor(data$Gender)
data$Is.Active.Member<-as.factor(data$Is.Active.Member)
# Print head of data
head(data)
Fit a logistic regression model using Staying as the response variable with Num.Of.Products as the predictor and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model1. Note that Num.Of.Products should be treated as a categorical variable.
(a) 3 pts - Display the summary of model1. What are the model parameters and estimates?
model1 = glm(Staying ~ Num.Of.Products, weights=Employees, family=binomial(link = "logit"), data=data)
summary(model1)
##
## Call:
## glm(formula = Staying ~ Num.Of.Products, family = binomial(link = "logit"),
## data = data, weights = Employees)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.2827 -1.4676 -0.1022 1.4490 4.7231
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.37886 0.04743 7.988 1.37e-15 ***
## Num.Of.Products2 -1.76683 0.10313 -17.132 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 981.04 on 157 degrees of freedom
## Residual deviance: 632.04 on 156 degrees of freedom
## AIC: 1056.8
##
## Number of Fisher Scoring iterations: 4
The model parameters are as follows -
Intercept = 0.37886
Num of Products2 = -1.76683
In this case, \(\beta0\)is 0.37886, which is the intercept and \(\beta1\) is -1.76883.
The model is the same as the estimate.
(b) 3 pts - Write down the equation for the Odds of Staying.
\[
\frac{PStaying}{(1-PStaying)} =
𝗲^(ß0+ß1*x)=𝗲^(0.37886-(1.76683*Num.Of.Products2))
\]
(c) 3 pts - Provide a meaningful interpretation for the estimated coefficient for Num.Of.Products2 with respect to the log-odds of staying and the odds of staying.
Here \(\beta1\) is the intercept pointing to the log odds of employees who are going to stay for every number of products that are owned by the employees and thus, the value that we see is the log odds of \(-1.76883\) shows that log odds of staying decreases for every product owned
The log odds can be converted into regular value by taking exponential
So odds of staying will be \(e^(-1.76883) = 0.1708\)
(a) 3 pts - Using model1, find a 90% confidence interval for the coefficient for Num.Of.Products2.
Conf_Int <- confint(model1, level = 0.90)
## Waiting for profiling to be done...
Conf_Int
## 5 % 95 %
## (Intercept) 0.3010289 0.4570655
## Num.Of.Products2 -1.9383610 -1.5989652
The Confidence intervals for the model1 which is the confidence interval for \(\beta1\) (Num_Of_Products2) is (-1.9383610, -1.5989652)
(b) 3 pts - Is model1 significant overall at the 0.01 significance level?
model1$null.deviance - deviance(model1)
## [1] 348.996
model1$df.null-model1$df.residual
## [1] 1
round(c(model1$null.deviance - deviance(model1), 1-pchisq(deviance(model1),10)),3)
## [1] 348.996 0.000
Here we can see that by Deviance residuals – the P-value is 0.000, so at a 0.01 significance level, the model1 is statistically significant.
(c) 3 pts - Which regression coefficients are significantly nonzero at the 0.01 significance level? Which are significantly negative? Why?
\(\beta0\) value is $ = 0.37886$ and the P-Value is \(= 1.37e-15 ***\)
\(\beta1\) value is $ = -1.76683$ and the P-Value is \(= < 2e-16 ***\)
Here we can see \(\beta0\) is statistically positively significant and \(\beta1\) is negatively significant at \(\alpha\) value of 0.01 as the value is as good as zero.
(a) 3.5 pts - Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude? Explain the differences, if any, between these findings and what you found in Question 2b.
## Deviance Test for GOF using deviance residuals
c(deviance(model1), 1-pchisq(deviance(model1),156))
## [1] 632.04 0.00
## GOF test using Pearson residuals
pearres2 = residuals(model1,type="pearson")
pearson.tvalue = sum(pearres2^2)
c(pearson.tvalue, 1-pchisq(pearson.tvalue,156))
## [1] 562.1763 0.0000
Test for goodness-of-fit:
Using deviance residuals: P-value ≈ 0
Reject the null hypothesis of good fit (thus NOT a good fit)
Using Pearson residual: P-value ≈ 0.0000
Reject the null hypothesis of good fit (thus NOT a good fit)
Here we can see that the model is NOT a good fit, versus in Q2b we saw that the model is significant. This is because the model has predictive power even though it is a poor fit overall.
(b) 3.5 pts - Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What assessments can you make about the goodness of fit of the model1 based on these plots?
res = resid(model1, type="deviance")
boxplot(res~Num.Of.Products,
xlab="Number of Products",
ylab="Std residuals",
data=data)
qqnorm(res, ylab="Std residuals")
qqline(res, col="blue", lwd=2)
hist(res, 10, xlab="Std residuals", main="")
hist(res, 30, xlab="Std residuals", main="")
As of the plots, we can see that the Histogram and QQ Plot both are pointing that the spread is a normal fit. However, the QQ plot does show that the tails are heavier so there could be some points that are either outliers or need to be evaluated.
(c) 3 pts - Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?
As per the lessons, the overdispersion parameter is given by
\[
\frac{D}{(n-p-1)}
\]
where D is the sum of squares of the deviances
D <- sum(residuals(model1,type='deviance')^2)
n <- nrow(data)
p <- length(coef(model1))-1
dispersion_parameter <- D/(n-p-1)
dispersion_parameter
## [1] 4.051539
As per the general rule mentioned in the lesson, if \(𝜙>2\) then overdispersed model - so here since dispersion_parameter is greater than 2, i.e. 4.051539, this model is considered to be overdispersed
Fit a logistic regression model using Staying as the response variable with Age.Group, Gender, Tenure, Num.Of.Products, and Is.Active.Member as the predictors and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model2. Note that Age.Group, Gender, Num.Of.Products, and Is.Active.Member should be treated as categorical variables.
# to convert the columns, Age.Group, Gender, Num.Of.Products and Is.Active.Member as categorical variables
attach(data)
#Age.Group <- as.factor(Age.Group)
str(data)
## 'data.frame': 158 obs. of 8 variables:
## $ Age.Group : Factor w/ 4 levels "2","3","4","5": 1 1 1 1 1 1 1 2 2 2 ...
## $ Gender : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 2 1 2 2 ...
## $ Tenure : int 3 4 4 7 7 4 8 0 0 0 ...
## $ Num.Of.Products : Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 1 1 1 ...
## $ Is.Active.Member: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 2 ...
## $ Stay : int 5 5 2 3 2 4 1 7 5 3 ...
## $ Employees : int 11 10 13 10 14 12 12 14 13 13 ...
## $ Staying : num 0.455 0.5 0.154 0.3 0.143 ...
levels(Age.Group)
## [1] "2" "3" "4" "5"
Age.Group = factor(Age.Group, labels = c("Twenties", "Thirties", "Forties", "Fifties"))
levels(Age.Group)
## [1] "Twenties" "Thirties" "Forties" "Fifties"
levels(Gender)
## [1] "0" "1"
Gender = factor(Gender, labels = c("Female","Male"))
levels(Gender)
## [1] "Female" "Male"
levels(Num.Of.Products)
## [1] "1" "2"
levels(Is.Active.Member)
## [1] "0" "1"
Is.Active.Member <- factor(Is.Active.Member, labels = c("Inactive_Member","Active_Member"))
levels(Is.Active.Member)
## [1] "Inactive_Member" "Active_Member"
head(data)
## Creating the full model with the fields above
model2 <- glm(Staying~Age.Group+Gender+Tenure+Num.Of.Products+Is.Active.Member, family = binomial(link ="logit"), weights = Employees)
summary(model2)
##
## Call:
## glm(formula = Staying ~ Age.Group + Gender + Tenure + Num.Of.Products +
## Is.Active.Member, family = binomial(link = "logit"), weights = Employees)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.10929 -0.76949 -0.07324 0.74079 3.06551
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.109572 0.282617 -0.388 0.698
## Age.GroupThirties 0.384480 0.267984 1.435 0.151
## Age.GroupForties 1.734115 0.270384 6.414 1.42e-10 ***
## Age.GroupFifties 2.955578 0.337727 8.751 < 2e-16 ***
## GenderMale -0.572069 0.093776 -6.100 1.06e-09 ***
## Tenure -0.003319 0.016569 -0.200 0.841
## Num.Of.Products2 -1.410946 0.112000 -12.598 < 2e-16 ***
## Is.Active.MemberActive_Member -0.850280 0.095829 -8.873 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 981.04 on 157 degrees of freedom
## Residual deviance: 162.35 on 150 degrees of freedom
## AIC: 599.07
##
## Number of Fisher Scoring iterations: 4
(a) 3 pts - Write down the equation for the probability of staying.
coef(model2)
## (Intercept) Age.GroupThirties
## -0.10957208 0.38447954
## Age.GroupForties Age.GroupFifties
## 1.73411499 2.95557843
## GenderMale Tenure
## -0.57206916 -0.00331918
## Num.Of.Products2 Is.Active.MemberActive_Member
## -1.41094581 -0.85027969
Here the equation is as below
\(p(𝑋_1,⋯,𝑋_𝑝 )=e^(𝛽_0+𝛽_1 𝑋_1+⋯+𝛽_𝑝 𝑋_𝑝 )/(1+e^(𝛽_0+𝛽_1 𝑋_1+⋯+𝛽_𝑝 𝑋_𝑝 ) )\)
p(x) = {e^(-0.10957208 + 0.384480X1 +1.734115 X2 + 2.955578 * X3 - 0.572069 * X4 - 0.003319 * X5 - 1.410946 * X6 - 0.850280 * X7 )}/{1 + e^(-0.10957208 + 0.384480X1 +1.734115 X2 + 2.955578 * X3 - 0.572069 * X4 - 0.003319 * X5 - 1.410946 * X6 - 0.850280 * X7 )}
$$
$$
(b) 3 pts - Provide a meaningful interpretation for the estimated coefficients of Tenure and Is.Active.Member1 with respect to the odds of staying.
For Tenure the coefficient is -0.003319. So we can say,holding all the rest of the factors as constant - a 1-unit increase in Tenure will lead to DECREASE of \(e^(-0.003319) = 0.996686502\) i.e. 0.3313498% decrease in Employee Staying
For Is.Active.MemberActive_Member, the coefficient is -0.850280. So we can say, holding all the rest of the factors as constant - a 1-unit increase in an Active Member will lead to DECREASE of \(e^(-0.850280) = 0.427295273\) i.e. 57.2704727% decrease in Employee Staying
(c) 3 pts - Is Is.Active.Member1 statistically significant given the other variables in model2 at the 0.01 significance level?
(d) 10 pts - Has your goodness of fit been affected? Follow the instructions to repeat the tests, plots, and dispersion parameter calculation you performed in Question 3 with model2.
(d-1) Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude?
## Deviance Test for GOF using deviance residuals
c(deviance(model2), 1-pchisq(deviance(model2),150))
## [1] 162.3494910 0.2319118
## GOF test using Pearson residuals
pearres3 = residuals(model2,type="pearson")
pearson.tvalue_1 = sum(pearres3^2)
c(pearson.tvalue_1, 1-pchisq(pearson.tvalue_1,150))
## [1] 154.1554225 0.3912174
Test for goodness-of-fit:
Using deviance residuals: P-value > 0
Accept the null hypothesis of good fit (thus a good fit)
Using Pearson residual: P-value > 0
Accept the null hypothesis of good fit (thus a good fit)
Here we can see that the model IS a good fit, versus in Q3 we saw that the model was NOT a significant fit.
(d-2) Evaluate the linearity assumption of model2 by plotting the log-odds of Staying vs. Tenure. What do you conclude?
plot(Tenure,log((Staying)/(1-Staying)), ylab="Logit Staying Rate", main="Logit Staying Rate vs Tenure", col=c("green","red"), lwd=2)
abline(a=0, b=0, col = "red")
Here we can see that there is no relationship of Logit Staying Rate and Tenure. There does not seem to be any trend that can be found here.
(d-3) Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What do you conclude?
res2 = resid(model2, type="deviance")
boxplot(res2~Num.Of.Products,
xlab="Number of Products",
ylab="Std residuals",
data=data)
boxplot(res2~Age.Group,
xlab="Age Group",
ylab="Std residuals",
data=data)
boxplot(res2~Gender,
xlab="Gender",
ylab="Std residuals",
data=data)
boxplot(res2~Tenure,
xlab="Tenure",
ylab="Std residuals",
data=data)
boxplot(res2~Is.Active.Member,
xlab="Is Active Member",
ylab="Std residuals",
data=data)
qqnorm(res2, ylab="Std residuals")
qqline(res2, col="blue", lwd=2)
hist(res2, 10, xlab="Std residuals", main="")
hist(res2, 30, xlab="Std residuals", main="")
– We see that the Normal QQ Plot tail is better than before
– Also Histogram is showing improvement too
So we can say that the model is NORMAL
(d-4) Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?
As per the lessons, the overdispersion parameter is given by
\[
\frac{D}{(n-p-1)}
\]
where D is the sum of squares of the deviances
D2 <- sum(residuals(model2,type='deviance')^2)
n2 <- nrow(data)
p2 <- length(coef(model2))-1
dispersion_parameter2 <- D2/(n2-p2-1)
dispersion_parameter2
## [1] 1.08233
As per the general rule mentioned in the lesson, if \(𝜙>2\) then overdispersed model - so here since dispersion_parameter is LESS than 2, i.e. 1.08233, this model is considered to NOT BE overdispersed
(e) 4 pts - Overall, would you say model2 is a good-fitting model? If so, why? If not, what would you suggest to improve the fit and why? Note: We are not asking you to spend hours finding the best possible model but to offer plausible suggestions along with your reasoning.
with(model1, cbind(res.deviance = deviance, df = df.residual,
p = pchisq(deviance, df.residual, lower.tail=FALSE)))
## res.deviance df p
## [1,] 632.04 156 1.5526e-58
with(model2, cbind(res2.deviance = deviance(model2), df = df.residual(model2),
p = pchisq(deviance(model2), df.residual(model2), lower.tail=FALSE)))
## res2.deviance df p
## [1,] 162.3495 150 0.2319118
As seen from the Boxplots, QQ Plot and Histogram, we can see that the overall normality has been improved.
The GOF tests also shows that we have a large p-value (which is what we want)
So the model2 is mostly better than model1
Suppose there is an employee with the following characteristics:
Age.Group: 2
Gender: 0
Tenure: 2
Num.Of.Products: 2
Is.Active.Member: 1
(a) 3 pts - Predict their probability of staying using model1.
Age.Group <- '2'
Gender <- '0'
Tenure <- 2
Num.Of.Products <- '2'
Is.Active.Member <- '1'
new_employee <- data.frame(Age.Group,Gender,Tenure,Num.Of.Products,Is.Active.Member)
prediction_test <- predict(model1,new_employee,type="response")
prediction_test
## 1
## 0.1997319
Staying probability will be 19.97319%
(b) 3 pts - Predict their probability of staying using model2.
Age.Group <- "Twenties"
Gender <- "Female"
Tenure <- 2
Num.Of.Products <- "2"
Is.Active.Member <- "Active_Member"
new_employee2 <- data.frame(Age.Group, Gender, Tenure, Num.Of.Products, Is.Active.Member)
prediction_test2 <- predict(model2,new_employee2,type="response")
prediction_test2
## 1
## 0.08490958
Staying probability will be 8.490958 %
(c) 3 pts - Comment on how your predictions compare.i.e. which model is more reliable based on the analysis?
As discussed above, model2 is fit better than model1. So for this new employee I would prefer to keep 8.490958 % as the probability of the employee staying.