Background

The owner of a company would like to be able to predict whether employees will stay with the company or leave.

Data Description

The data contains information about various characteristics of employees. Please note that the dataset has been updated to account for repetitions, which is needed for Goodness of Fit Assessment. See below for the description of these characteristics.

  1. Age.Group: 1-9 (1 corresponds to teen, 2 corresponds to twenties, etc.)
  2. Gender: 1 if male, 0 if female
  3. Tenure: Number of years with the company
  4. Num.Of.Products: Number of products owned
  5. Is.Active.Member: 1 if active member, 0 if inactive member
  6. Staying: Fraction of employees that stayed with the company for a given set of predicting variables.

Setup

You can import the data and set up the problem with the following R code:

# Import the data
data = read.csv("C:/Users/abhij/Downloads/hw3_data.csv", header=TRUE, fileEncoding="UTF-8-BOM")

# Create variable Staying
data$Staying = data$Stay/data$Employees

# Set variables as categoricals
data$Num.Of.Products<-as.factor(data$Num.Of.Products)
data$Age.Group<-as.factor(data$Age.Group)
data$Gender<-as.factor(data$Gender)
data$Is.Active.Member<-as.factor(data$Is.Active.Member)

# Print head of data
head(data)

Question 1: Fitting a Model - 9 pts

Fit a logistic regression model using Staying as the response variable with Num.Of.Products as the predictor and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model1. Note that Num.Of.Products should be treated as a categorical variable.

(a) 3 pts - Display the summary of model1. What are the model parameters and estimates?

model1 = glm(Staying ~ Num.Of.Products, weights=Employees, family=binomial(link = "logit"), data=data)

summary(model1)
## 
## Call:
## glm(formula = Staying ~ Num.Of.Products, family = binomial(link = "logit"), 
##     data = data, weights = Employees)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2827  -1.4676  -0.1022   1.4490   4.7231  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.37886    0.04743   7.988 1.37e-15 ***
## Num.Of.Products2 -1.76683    0.10313 -17.132  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.04  on 157  degrees of freedom
## Residual deviance: 632.04  on 156  degrees of freedom
## AIC: 1056.8
## 
## Number of Fisher Scoring iterations: 4

The model parameters are as follows -

Intercept = 0.37886

Num of Products2 = -1.76683

In this case, \(\beta0\)is 0.37886, which is the intercept and \(\beta1\) is -1.76883.

The model is the same as the estimate.

(b) 3 pts - Write down the equation for the Odds of Staying.

\[ \frac{PStaying}{(1-PStaying)} = 𝗲^(ß0+ß1*x)=𝗲^(0.37886-(1.76683*Num.Of.Products2)) \]

(c) 3 pts - Provide a meaningful interpretation for the estimated coefficient for Num.Of.Products2 with respect to the log-odds of staying and the odds of staying.

Question 2: Inference - 9 pts

(a) 3 pts - Using model1, find a 90% confidence interval for the coefficient for Num.Of.Products2.

Conf_Int <- confint(model1, level = 0.90)
## Waiting for profiling to be done...
Conf_Int
##                         5 %       95 %
## (Intercept)       0.3010289  0.4570655
## Num.Of.Products2 -1.9383610 -1.5989652

The Confidence intervals for the model1 which is the confidence interval for \(\beta1\) (Num_Of_Products2) is (-1.9383610, -1.5989652)

(b) 3 pts - Is model1 significant overall at the 0.01 significance level?

model1$null.deviance - deviance(model1)
## [1] 348.996
model1$df.null-model1$df.residual
## [1] 1
round(c(model1$null.deviance - deviance(model1), 1-pchisq(deviance(model1),10)),3)
## [1] 348.996   0.000

Here we can see that by Deviance residuals – the P-value is 0.000, so at a 0.01 significance level, the model1 is statistically significant.

(c) 3 pts - Which regression coefficients are significantly nonzero at the 0.01 significance level? Which are significantly negative? Why?

\(\beta0\) value is $ = 0.37886$ and the P-Value is \(= 1.37e-15 ***\)

\(\beta1\) value is $ = -1.76683$ and the P-Value is \(= < 2e-16 ***\)

Here we can see \(\beta0\) is statistically positively significant and \(\beta1\) is negatively significant at \(\alpha\) value of 0.01 as the value is as good as zero.

Question 3: Goodness of fit - 10 pts

(a) 3.5 pts - Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude? Explain the differences, if any, between these findings and what you found in Question 2b.

## Deviance Test for GOF using deviance residuals
c(deviance(model1), 1-pchisq(deviance(model1),156))
## [1] 632.04   0.00
## GOF test using Pearson residuals
pearres2 = residuals(model1,type="pearson")
pearson.tvalue = sum(pearres2^2)
c(pearson.tvalue, 1-pchisq(pearson.tvalue,156))
## [1] 562.1763   0.0000

Test for goodness-of-fit:

Here we can see that the model is NOT a good fit, versus in Q2b we saw that the model is significant. This is because the model has predictive power even though it is a poor fit overall.

(b) 3.5 pts - Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What assessments can you make about the goodness of fit of the model1 based on these plots?

res = resid(model1, type="deviance")

boxplot(res~Num.Of.Products,
    xlab="Number of Products", 
    ylab="Std residuals",
    data=data)

qqnorm(res, ylab="Std residuals")
qqline(res, col="blue", lwd=2)

hist(res, 10, xlab="Std residuals", main="")

hist(res, 30, xlab="Std residuals", main="")

As of the plots, we can see that the Histogram and QQ Plot both are pointing that the spread is a normal fit. However, the QQ plot does show that the tails are heavier so there could be some points that are either outliers or need to be evaluated.

(c) 3 pts - Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?

As per the lessons, the overdispersion parameter is given by

\[ \frac{D}{(n-p-1)} \]
where D is the sum of squares of the deviances

D <- sum(residuals(model1,type='deviance')^2)
n <- nrow(data)
p <- length(coef(model1))-1
dispersion_parameter <- D/(n-p-1)
dispersion_parameter
## [1] 4.051539

As per the general rule mentioned in the lesson, if \(𝜙>2\) then overdispersed model - so here since dispersion_parameter is greater than 2, i.e. 4.051539, this model is considered to be overdispersed

Question 4: Fitting the full model- 23 pts

Fit a logistic regression model using Staying as the response variable with Age.Group, Gender, Tenure, Num.Of.Products, and Is.Active.Member as the predictors and logit as the link function. Ensure to include the weights parameter for specifying the number of trials. Call it model2. Note that Age.Group, Gender, Num.Of.Products, and Is.Active.Member should be treated as categorical variables.

# to convert the columns, Age.Group, Gender, Num.Of.Products and Is.Active.Member as categorical variables

attach(data)

#Age.Group <- as.factor(Age.Group)

str(data)
## 'data.frame':    158 obs. of  8 variables:
##  $ Age.Group       : Factor w/ 4 levels "2","3","4","5": 1 1 1 1 1 1 1 2 2 2 ...
##  $ Gender          : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 2 1 2 2 ...
##  $ Tenure          : int  3 4 4 7 7 4 8 0 0 0 ...
##  $ Num.Of.Products : Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 1 1 1 ...
##  $ Is.Active.Member: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 2 ...
##  $ Stay            : int  5 5 2 3 2 4 1 7 5 3 ...
##  $ Employees       : int  11 10 13 10 14 12 12 14 13 13 ...
##  $ Staying         : num  0.455 0.5 0.154 0.3 0.143 ...
levels(Age.Group)
## [1] "2" "3" "4" "5"
Age.Group = factor(Age.Group, labels = c("Twenties", "Thirties", "Forties", "Fifties"))
levels(Age.Group)
## [1] "Twenties" "Thirties" "Forties"  "Fifties"
levels(Gender)
## [1] "0" "1"
Gender = factor(Gender, labels = c("Female","Male"))
levels(Gender)
## [1] "Female" "Male"
levels(Num.Of.Products)
## [1] "1" "2"
levels(Is.Active.Member)
## [1] "0" "1"
Is.Active.Member <- factor(Is.Active.Member, labels = c("Inactive_Member","Active_Member"))
levels(Is.Active.Member)
## [1] "Inactive_Member" "Active_Member"
head(data)
## Creating the full model with the fields above

model2 <- glm(Staying~Age.Group+Gender+Tenure+Num.Of.Products+Is.Active.Member, family = binomial(link ="logit"), weights = Employees)

summary(model2)
## 
## Call:
## glm(formula = Staying ~ Age.Group + Gender + Tenure + Num.Of.Products + 
##     Is.Active.Member, family = binomial(link = "logit"), weights = Employees)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.10929  -0.76949  -0.07324   0.74079   3.06551  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -0.109572   0.282617  -0.388    0.698    
## Age.GroupThirties              0.384480   0.267984   1.435    0.151    
## Age.GroupForties               1.734115   0.270384   6.414 1.42e-10 ***
## Age.GroupFifties               2.955578   0.337727   8.751  < 2e-16 ***
## GenderMale                    -0.572069   0.093776  -6.100 1.06e-09 ***
## Tenure                        -0.003319   0.016569  -0.200    0.841    
## Num.Of.Products2              -1.410946   0.112000 -12.598  < 2e-16 ***
## Is.Active.MemberActive_Member -0.850280   0.095829  -8.873  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.04  on 157  degrees of freedom
## Residual deviance: 162.35  on 150  degrees of freedom
## AIC: 599.07
## 
## Number of Fisher Scoring iterations: 4

(a) 3 pts - Write down the equation for the probability of staying.

coef(model2)
##                   (Intercept)             Age.GroupThirties 
##                   -0.10957208                    0.38447954 
##              Age.GroupForties              Age.GroupFifties 
##                    1.73411499                    2.95557843 
##                    GenderMale                        Tenure 
##                   -0.57206916                   -0.00331918 
##              Num.Of.Products2 Is.Active.MemberActive_Member 
##                   -1.41094581                   -0.85027969

Here the equation is as below

\(p(𝑋_1,⋯,𝑋_𝑝 )=e^(𝛽_0+𝛽_1 𝑋_1+⋯+𝛽_𝑝 𝑋_𝑝 )/(1+e^(𝛽_0+𝛽_1 𝑋_1+⋯+𝛽_𝑝 𝑋_𝑝 ) )\)

p(x) = {e^(-0.10957208 + 0.384480X1 +1.734115 X2 + 2.955578 * X3 - 0.572069 * X4 - 0.003319 * X5 - 1.410946 * X6 - 0.850280 * X7 )}/{1 + e^(-0.10957208 + 0.384480X1 +1.734115 X2 + 2.955578 * X3 - 0.572069 * X4 - 0.003319 * X5 - 1.410946 * X6 - 0.850280 * X7 )}

$$

$$

(b) 3 pts - Provide a meaningful interpretation for the estimated coefficients of Tenure and Is.Active.Member1 with respect to the odds of staying.

(c) 3 pts - Is Is.Active.Member1 statistically significant given the other variables in model2 at the 0.01 significance level?

(d) 10 pts - Has your goodness of fit been affected? Follow the instructions to repeat the tests, plots, and dispersion parameter calculation you performed in Question 3 with model2.

(d-1) Perform goodness-of-fit hypothesis tests using both Deviance and Pearson residuals. What do you conclude?

## Deviance Test for GOF using deviance residuals
c(deviance(model2), 1-pchisq(deviance(model2),150))
## [1] 162.3494910   0.2319118
## GOF test using Pearson residuals
pearres3 = residuals(model2,type="pearson")
pearson.tvalue_1 = sum(pearres3^2)
c(pearson.tvalue_1, 1-pchisq(pearson.tvalue_1,150))
## [1] 154.1554225   0.3912174

Test for goodness-of-fit:

Here we can see that the model IS a good fit, versus in Q3 we saw that the model was NOT a significant fit.

(d-2) Evaluate the linearity assumption of model2 by plotting the log-odds of Staying vs. Tenure. What do you conclude?

plot(Tenure,log((Staying)/(1-Staying)), ylab="Logit Staying Rate", main="Logit Staying Rate vs Tenure", col=c("green","red"), lwd=2)
abline(a=0, b=0, col = "red")

Here we can see that there is no relationship of Logit Staying Rate and Tenure. There does not seem to be any trend that can be found here.

(d-3) Evaluate whether the deviance residuals are normally distributed by producing a QQ plot and histogram of the deviance residuals. What do you conclude?

res2 = resid(model2, type="deviance")

boxplot(res2~Num.Of.Products,
    xlab="Number of Products", 
    ylab="Std residuals",
    data=data)

boxplot(res2~Age.Group,
    xlab="Age Group", 
    ylab="Std residuals",
    data=data)

boxplot(res2~Gender,
    xlab="Gender", 
    ylab="Std residuals",
    data=data)

boxplot(res2~Tenure,
    xlab="Tenure", 
    ylab="Std residuals",
    data=data)

boxplot(res2~Is.Active.Member,
    xlab="Is Active Member", 
    ylab="Std residuals",
    data=data)

qqnorm(res2, ylab="Std residuals")
qqline(res2, col="blue", lwd=2)

hist(res2, 10, xlab="Std residuals", main="")

hist(res2, 30, xlab="Std residuals", main="")

– We see that the Normal QQ Plot tail is better than before

– Also Histogram is showing improvement too

So we can say that the model is NORMAL

(d-4) Calculate the estimated dispersion parameter for this model. Is this an overdispersed model?

As per the lessons, the overdispersion parameter is given by

\[ \frac{D}{(n-p-1)} \]
where D is the sum of squares of the deviances

D2 <- sum(residuals(model2,type='deviance')^2)
n2 <- nrow(data)
p2 <- length(coef(model2))-1
dispersion_parameter2 <- D2/(n2-p2-1)
dispersion_parameter2
## [1] 1.08233

As per the general rule mentioned in the lesson, if \(𝜙>2\) then overdispersed model - so here since dispersion_parameter is LESS than 2, i.e. 1.08233, this model is considered to NOT BE overdispersed

(e) 4 pts - Overall, would you say model2 is a good-fitting model? If so, why? If not, what would you suggest to improve the fit and why? Note: We are not asking you to spend hours finding the best possible model but to offer plausible suggestions along with your reasoning.

with(model1, cbind(res.deviance = deviance, df = df.residual,
               p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance  df          p
## [1,]       632.04 156 1.5526e-58
with(model2, cbind(res2.deviance = deviance(model2), df = df.residual(model2),
               p = pchisq(deviance(model2), df.residual(model2), lower.tail=FALSE)))
##      res2.deviance  df         p
## [1,]      162.3495 150 0.2319118

As seen from the Boxplots, QQ Plot and Histogram, we can see that the overall normality has been improved.

The GOF tests also shows that we have a large p-value (which is what we want)

So the model2 is mostly better than model1

Question 5: Prediction - 9 pts

Suppose there is an employee with the following characteristics:

  1. Age.Group: 2

  2. Gender: 0

  3. Tenure: 2

  4. Num.Of.Products: 2

  5. Is.Active.Member: 1

(a) 3 pts - Predict their probability of staying using model1.

Age.Group <- '2'
Gender <- '0'
Tenure <- 2
Num.Of.Products <- '2'
Is.Active.Member <- '1'

new_employee <- data.frame(Age.Group,Gender,Tenure,Num.Of.Products,Is.Active.Member)

prediction_test <- predict(model1,new_employee,type="response")
prediction_test
##         1 
## 0.1997319

Staying probability will be 19.97319%

(b) 3 pts - Predict their probability of staying using model2.

Age.Group <- "Twenties"
Gender <- "Female"
Tenure <- 2
Num.Of.Products <- "2"
Is.Active.Member <- "Active_Member"

new_employee2 <- data.frame(Age.Group, Gender, Tenure, Num.Of.Products, Is.Active.Member)

prediction_test2 <- predict(model2,new_employee2,type="response")


prediction_test2
##          1 
## 0.08490958

Staying probability will be 8.490958 %

(c) 3 pts - Comment on how your predictions compare.i.e. which model is more reliable based on the analysis?

As discussed above, model2 is fit better than model1. So for this new employee I would prefer to keep 8.490958 % as the probability of the employee staying.