1 Setting up the environment

set.seed(456333)
library(gdata)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(tidyverse)
library(stringr)
library(modelsummary)
library(lubridate)
library(scales)
library(graphics)
library(caret)

This environment is usually hidden in my code. You can turn it on or off by adding echo=TRUE/FALSE in the chunk options. If you are using this, please make sure to remove the irrelevant text.

There are 18 variables (columns) and 4410 employees as described on the Canterra Employee Dataset.

1.1 Background

A large company named Canterra, employs, at any given point of time, around 4000 employees. However, around 15% of its employees leave the company every year and need to be replaced with the talent pool available in the job market. The management believes that this level of attrition (employees leaving, either on their own or because they got fired) is bad for the company because of the following reasons:

The former employees’ projects get delayed, which makes it challenging to meet timelines, resulting in a reputation loss among consumers and partners A sizeable department has to be maintained to recruit new talent. More often than not, new employees must be trained for the job and/or given time to acclimate themselves to the company.

The management hypothesizes that higher job satisfaction and a higher number of total working years will reduce employee attrition. Additionally, the marketing management was interested to know if demographic variables such as gender, education, and age affect employee attrition. Hence, the administration has contracted you as a consultant to understand whether they should focus on these two factors to curb attrition. In other words, they want to know if changes in their internal and external recruitment strategies would help retain employees.

1.2 Data

# Set a seed for reproducibility
set.seed(123)

library(readxl)
employee  <- read_excel("Employee_Data_Project.xlsx")
head(employee)
## # A tibble: 6 × 18
##     Age Attrition BusinessTravel    DistanceFromHome Education EmployeeID Gender
##   <dbl> <chr>     <chr>                        <dbl>     <dbl>      <dbl> <chr> 
## 1    51 No        Travel_Rarely                    6         2          1 Female
## 2    31 Yes       Travel_Frequently               10         1          2 Female
## 3    32 No        Travel_Frequently               17         4          3 Male  
## 4    38 No        Non-Travel                       2         5          4 Male  
## 5    32 No        Travel_Rarely                   10         1          5 Male  
## 6    46 No        Travel_Rarely                    8         3          6 Female
## # ℹ 11 more variables: JobLevel <dbl>, MaritalStatus <chr>, Income <dbl>,
## #   NumCompaniesWorked <chr>, StandardHours <dbl>, TotalWorkingYears <chr>,
## #   TrainingTimesLastYear <dbl>, YearsAtCompany <dbl>,
## #   YearsWithCurrManager <dbl>, EnvironmentSatisfaction <chr>,
## #   JobSatisfaction <chr>
employee_df <- employee %>% select("Age", "Attrition", "Gender", "JobSatisfaction","Income")

#function to count 'NA' strings in a column
count_NA = function(column){
  sum(column == "NA")
}
#Apply the function to each column of our new selected data frame
na_string = sapply(employee_df,count_NA)
print(na_string)
##             Age       Attrition          Gender JobSatisfaction          Income 
##               0               0               0              20               0
#changing the 'NA' strings with true NA values 
employee_df$JobSatisfaction <- ifelse(employee_df$JobSatisfaction =="NA",NA, employee_df$JobSatisfaction)

#counting the variables 
table(employee_df$JobSatisfaction)
## 
##    1    2    3    4 
##  860  840 1323 1367
#sum of NA values for this particular column
sum(is.na(employee_df$JobSatisfaction))
## [1] 20
# Calculate the mode  of JobSatisfaction excluding NA
mode_job_satisfaction <- employee_df %>% 
                          filter(!is.na(JobSatisfaction)) %>% 
                          count(JobSatisfaction) %>% 
                          arrange(desc(n)) %>% 
                          slice(1) %>% 
                          pull(JobSatisfaction)

# Impute missing values in JobSatisfaction with the mode
employee_df$JobSatisfaction[is.na(employee_df$JobSatisfaction)] <- mode_job_satisfaction

# Now check the count of missing values again to confirm imputation
sum(is.na(employee_df$JobSatisfaction))
## [1] 0
# Transform Attrition variable to binary and scale Income variable (normalize)
employee_df$Income <- (employee$Income - min(employee$Income)) / (max(employee$Income) - min(employee$Income))
employee_df$Attrition <- ifelse(employee_df$Attrition == "Yes", 1, 0)



# Split the dataset into 70% training and 30% testing using the createDataPartition() function from the caret package
library(caret)

# Assuming you have already selected the necessary columns in `employee`
trainIndex <- createDataPartition(employee_df$Attrition, p = .7, list = FALSE, times = 1)
train <- employee_df[trainIndex, ]
test <- employee_df[-trainIndex, ]
head(train)
## # A tibble: 6 × 5
##     Age Attrition Gender JobSatisfaction Income
##   <dbl>     <dbl> <chr>  <chr>            <dbl>
## 1    51         0 Female 4               0.638 
## 2    31         1 Female 2               0.167 
## 3    38         0 Male   4               0.385 
## 4    32         0 Male   1               0.0702
## 5    28         1 Male   3               0.253 
## 6    31         0 Male   4               0.0545

2 Logistic Regression Rationale

2.1 Explain why logistic regression is an appropriate modeling technique for predicting employee attrition in this dataset compared to classical regression methods. (5 points)

In evaluating attrition prediction for this particular data set and case attrition is a binary classified variable that is categorical and we use logistic regression when the response variable is categorical. here is why we chose logistic regression

1, Logistic regression is made for this ‘yes’ or ‘no’ kind of question because logistic regression models the probability that each input belongs to a particular category as we are interested in the likelihood of class membership rather than continuous results
2, logistic regression uses sigmoid(s-shaped) curve or logit functtion to the output of a linear equation that lies only between 0 and 1, which aligns with the interpretations of probabilities.In contrast linear regression can predict outside of this range and making it hard for us to draw meaning from the result.
3, Logistic regression simplifies interpretation of outcomes; its coefficients show the expected change in the log odds of the outcome for each one-unit increase in the predictor.

Also, when we tried to fit a linear regression model to our data, the diagnostic plots looked all wrong. They showed patterns that linear regression isn’t supposed to have, like splitting into two groups and some kind of trends violating the basic assumptions. This was another hint that logistic regression was the way to go for our project.

linear  <- lm(Attrition ~ Age, data = train)
summary(linear)
## 
## Call:
## lm(formula = Attrition ~ Age, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27712 -0.19166 -0.14893 -0.08178  0.96706 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.3870065  0.0275704   14.04   <2e-16 ***
## Age         -0.0061045  0.0007224   -8.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3632 on 3085 degrees of freedom
## Multiple R-squared:  0.02262,    Adjusted R-squared:  0.02231 
## F-statistic: 71.41 on 1 and 3085 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(linear)

2.2 Provide a specific rationale for utilizing logistic regression in predicting attrition concerning the age of employees. (10 points)

Linear regression is based on the assumption of a linear relationship between the predictors and the outcome variable. when we look at Age as a factor,Age does not increase or decrease the likelihood of someone leaving their job in a simple straigh line. In many real-world scenarios, including the relationship between age and attrition, the relationship may be non-linear, which logistic regression can handle effectively, especially with the addition of Interaction and linear regression might oversimplify it. The output of linear regression (regression coefficients) directly quantifies the change in the dependent variable for a one-unit change in an independent variable. For a binary outcome like attrition, interpreting these changes in terms of the dependent variable itself (rather than the odds of the outcome occurring) is less intuitive and less useful in decision-making contexts. on the other hand The logistic regression model’s ability to handle binary outcomes, model probabilities correctly, interpret coefficients as odds ratios, capture non-linear relationships, and use Maximum Likelihood estimation for parameter estimation makes it particularly suitable for predicting employee attrition based on age. Also, the way logistic regression gives us the results is intuitive that it basically tells us about the odds of somone leaving for each year as they get older. The coefficients in logistic regression, interpreted as log-odds, have a clear probabilistic interpretation. For instance, the coefficient for age \(\beta1\) indicates how the log-odds of attrition change with a one-unit change in age. Exponentiating \(\beta\)’s gives the odds ratio, which can be understood as how the odds of attrition multiply for each additional year of age. In logistic regression, we model the log-odds (logit) of the probability: \[\log\left(\frac{P(Y=1)}{1 - P(Y=1)}\right) = \beta_0 + \beta_1X + ... \] This formulation allows us to establish a linear relationship between the log-odds of the outcome and the predictors. This is crucial because while the relationship between predictors and the probability of an outcome may not be linear, the relationship between the predictors and the log-odds often is.

3 Model Parameter Interpretation

# Fit a logistic regression model
model <- glm(Attrition ~ Age, data = train, family = binomial())


# Summary of the model
summary(model)
## 
## Call:
## glm(formula = Attrition ~ Age, family = binomial(), data = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.115179   0.213085   0.541    0.589    
## Age         -0.049461   0.006005  -8.236   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2721.4  on 3086  degrees of freedom
## Residual deviance: 2647.7  on 3085  degrees of freedom
## AIC: 2651.7
## 
## Number of Fisher Scoring iterations: 4

3.1 After conducting the logistic regression analysis, interpret the parameter estimates obtained from the model with a focus on their significance in predicting employee attrition. (15 points)

Let’s break down the interpretation of the coefficients:

  1. Intercept (\(\beta_{0}\)):

    • Estimate: 0.115179
    • Std. Error: 0.213085
    • z value: 0.541
    • Pr(>|z|): 0.589

    The intercept 0.115179 represents the estimated log odds of attrition when Age is zero. However, age in the real world scenario Age can not be zero so this interpretation will not be meaningful but used as reference of baseline value for the model Std.Error is telling us that our on average the intercept estimate from the actual or true estimate value deviates by 0.213085. z_value with 0.541 is not significant since it doesn’t fall with in the rejection area(>1.96 or <-1.96), which means that there is no strong evidence that our intercept is different from zero when all the other variables are held zero(null hypothesis) p-value with 0.589 is greater than 0.05 showing insignificancy

  2. Coefficient for Age (\(\beta_{1}\)):

    • Estimate:-0.049461
    • Std. Error: 0.006005
    • z value: -8.236
    • Pr(>|z|): <2e-16

    Coefficient of Age with -0.049461 indicates that when the Age increases the log odds of Attrition decreases. the negative sign is indicating inverse relation ship. The standard error of the Age variable in this specific model is 0.006005. This suggests that our estimation of the Age coefficient from the true value will likely fall between -0.0612308 and -0.0376912. These intervals are calculated using -0.049461 \(\pm\) (1.96 \(\times\) 0.006005), representing a 95% confidence interval. Z-value with -8.236 is highly significant since it falls far to the left from the critical region, we can confidently reject the null hypothesis and conclude that there is significant evidence to suggest that the coefficient of Age is indeed different from zero. P-value is less than <2e-16 which is even lesser than 0.05 indicating the significant of the variable Age explaining the Attrition.

Overall Null Deviance: 2721.4 on 3086 degrees of freedom Residual Deviance: 2647.7 on 3085 degrees of freedom AIC: 2651.7 The Null Deviance of 2721.4 represents the baseline of the model, which includes only intercept, to compare the fit of our model. The Residual Deviance of 2647.7 shows how well our model with Age as a predictor fits the data compared to the baseline model. the lower residual deviance compared to the null deviance the better so here we can observe how adding Age as a predictor can reduce the residual deviance by 73.7. The AIC score of 2651.7, while being a balance between the model’s complexity and its fit, helps us in comparing this model with others.

To further understand the impact of Age, let’s delve into the odds ratio. In logistic regression, the coefficients tell us about the log-odds change for a one-unit increase in the predictor. By taking the exponent of our Age coefficient (-0.049461), we calculate the odds ratio, which turns out to be 0.9517. This number translates to a very interpretable insight: for each additional year of age, the odds of an employee’s attrition decrease by approximately 4.83% (calculated as 1 - 0.9517) which means each addition of years makes an employee approximately 4.83% less likely to leave.

Interpretation of the odds ratio:

The relationship between Age and the probability of attrition (denoted as \(\mathbb{P}(\text{Attrition})\)) in logistic regression can be described using the log-odds formula:

\[ \begin{align} \ln\left(\frac{\mathbb{P}(\text{Attrition})}{1 - \mathbb{P}(\text{Attrition})}\right) &= \beta_0 + \beta_{\text{Age}} \times \text{Age} \end{align} \]

Where \(\beta_0\) is the intercept and \(\beta_{\text{Age}}\) is the coefficient for Age. In our model:

\[ \begin{align} \ln\left(\frac{\mathbb{P}(\text{Attrition})}{1 - \mathbb{P}(\text{Attrition})}\right) &= 0.115179 - 0.049461 \times \text{Age} \end{align} \]

To express this in terms of probability and calculate the odds ratio:

\[ \begin{align} \mathbb{P}(\text{Attrition}) &= \frac{e^{0.115179 - 0.049461 \times \text{Age}}}{1 + e^{0.115179 - 0.049461 \times \text{Age}}} \end{align} \]

The odds ratio for Age is then the exponentiation of its coefficient:

\[ \text{Odds Ratio for Age} = e^{-0.049461} \]

By calculating this, we find:

\[ \text{Odds Ratio for Age} \approx 0.9517 \]

This implies that with each additional year of Age, the odds of attrition decrease by about 4.83% (calculated as \((1 - 0.9517) \times 100\%\)). This shows a significant trend where increasing age is associated with a reduced likelihood of employee attrition.

4 Distribution Estimates

4.1 Describe the concept of a marginal distribution and explain its relevance in understanding employee attrition. Calculate and report the estimated marginal distribution of attrition (Attrition: Yes/No) without the influence of the logistic regression model. (15 points)

marg_prop <- table(train$Attrition)
print(marg_prop)
## 
##    0    1 
## 2591  496

In a statistical concept marginal distribution is an assessment of how frequently each category of a single variable like attrition status(Yes/No) occurs in our data set. it tells us the overall probability or proportion of each outcome excluding any other variables and features. it essentially summarizes the frequency of that specific variable in the data set. For instance, if we’re looking at a dataset with multiple variables like age, salary, income and job satisfaction, the marginal distribution of age would simply focus on how age varies across all observations regardless of other variables.

When we start looking at employee attrition, the first thing we do is check how many employees left the company ‘Yes’ and how many stayed in the company ‘No.’ We will count them across our entire dataset. This could tell us a lot about how common or rare it is for employees to leave.

In simple terms, what we’re doing is looking at the marginal distribution of attrition. This means we’re just focusing on how often each values (Yes or No) happens. It’s like figuring out the chances or probabilities of employees leaving or staying, based only on what we see in our data. From our training data, the marginal distribution of employees who stayed (Attrition: ‘No’) is 2591, while the marginal distribution for those who left (Attrition: ‘Yes’) is 496. This information allows us to gauge how likely it is for employees to leave the company. Consequently, we can deduce that, relative to the total number of observed employees, 83% have stayed, and 16.06% have left. This calculation provides a fundamental insight into the attrition rate, forming the basis for more detailed future analyses.

marg_prop <- prop.table(table(train$Attrition))
print(marg_prop)
## 
##         0         1 
## 0.8393262 0.1606738

4.2 For modeling purposes, choose an age value representing a “younger” employee and another age value for an “older” employee. Calculate and report the model-based estimates of the conditional distribution of attrition for these two age groups. (10 points)

Let’s choose 25 years as representative of a “younger” employee.and We’ll choose 50 years as representative of an “older” employee.The logistic regression equation with Age as a predictor is:\[P(\text{Attrition} = 1 | \text{Age}) = \frac{1}{1 + \exp{-(\beta_0 + \beta_1 \times \text{Age})}}\]

model <- glm(Attrition ~ Age, data = train, family = binomial())

# Calculate the estimated probability of attrition for a younger employee age = 25
age_younger <- 25
exp_younger <- exp(model$coefficients[1] + model$coefficients[2] * age_younger)
prob_attrition_younger <- exp_younger / (1 + exp_younger)

# Calculate the estimated probability of attrition for an older employee age =50
age_older <- 50
exp_older <- exp(model$coefficients[1] + model$coefficients[2] * age_older)
prob_attrition_older <- exp_older / (1 + exp_older)

# Print the probabilities
prob_attrition_younger
## (Intercept) 
##   0.2457624
prob_attrition_older
## (Intercept) 
##  0.08644273

Based on the logistic regression model, the estimated probabilities of attrition for the two age groups are:

Younger Employee (Age 25): The estimated probability of attrition is approximately 24.58%. This suggests that younger employees (at age 25) have a higher likelihood of leaving the company.

Older Employee (Age 50): The estimated probability of attrition is approximately 8.64%. This indicates that older employees (at age 50) are less likely to leave the company compared to their younger counterparts.

Write the math equation here. use $$ $$ to write the equation in latex.

For the “younger” age group (Age 25), the probability of attrition based on our logistic regression model is

\[ P(\text{Attrition} = 1 | \text{Age}_{\text{young}}) = \frac{1}{1 + e^{-(0.115179 -0.049461 \times 25)}} \]

Similarly, for the “older” age group (Age 50), the probability is calculated using their respective age:

\[ P(\text{Attrition} = 1 | \text{Age}_{\text{old}}) = \frac{1}{1 + e^{-(0.115179 -0.049461 \times 50)}} \]

5 Graphical Representation

5.1 Create a graph illustrating the relationship between age and attrition using the logistic regression model. Include appropriate labels and explanations. (10 points)

# Predict probabilities of attrition
train$Attrition_pred <- predict(model,type="response")

ggplot(train,aes(x=Age,y=Attrition_pred))+ geom_point(aes(color=Attrition),alpha = 0.5) + geom_smooth(method = "model", method.args = list(family="binomial"), se = FALSE)+ labs(title = "relationship between Age and Attrition", x="Age", y="Probability of getting fired or leave", color = "Age")+ theme_linedraw()+ theme(panel.grid = element_line(color = "#666666", linewidth = 0.65, linetype = 4))

In the graph that depicts the relationship between age and the probability of attrition, as predicted by our logistic regression model, we plot the age of employees on the X-axis, covering a range from 18 to 60 years. On the Y-axis, we show the predicted probability of an employee deciding to leave the company, as it relates to their age. The graph demonstrates a clear downward trend, indicating that the likelihood of attrition decreases with increasing age.

The graph appears approximately linear rather than showcasing the typical sigmoid (S-shaped) curve typically associated with logistic regression models. This linear tendency in our graph might be attributed to the relatively small size of the coefficient for age, \(\beta_1\), in our model. Additionally, the range of ages in our dataset contributes to this effect; within this specific span of 18 to 60 years, the distinctive S-shaped curve of logistic regression becomes less pronounced, giving the graph a seemingly linear quality.

5.2 Create a table summarizing the key statistics of the age variable in relation to attrition and discuss any insights from the table. (10 points)

library(dplyr)

summary_table <- train %>%
    group_by(Attrition) %>%
    summarise(
        Mean_Age = mean(Age, na.rm = TRUE),
        Median_Age = median(Age, na.rm = TRUE),
        Min_Age = min(Age, na.rm = TRUE),
        Max_Age = max(Age, na.rm = TRUE),
        SD_Age = sd(Age, na.rm = TRUE)
    )

summary_table

6 A tibble: 2 × 6

Attrition Mean_Age Median_Age Min_Age Max_Age SD_Age 1 0 37.7 36 18 60 8.79 2 1 34.0 32 18 58 9.71

Attrition 0 - employees who stayed Attrition 1- employees who left

Based on the summary table we observed the mean and median age of employees who stayed have a higher mean and median age for employees who left the company. this suggests that on average older employees tend to stay with the company and median age 36 for stayed employees indicates a central tendency for older employees to remain within the organization.

age range column is quite broad years for both groups.employees who left are in the range of 18-58 and employees who stayed are in the range of 18-60. this indicates wide diversity in the ages of employees. the standard deviation for the age of employees who stayed is slightly lower(8.79 years) compared to those who left(9.71 years). this shows a more variability in the age of employees who left than those who stayed.

7 Model Comparison

Choose four variables from the list provided (e.g., Age, Gender, JobSatisfaction, Income) and use them to create logistic regression models with the following combinations:

7.1 Model 1: A one-variable model with Age.

Already Answered in Q3. Just copy the code and paste it here.

# Fit Model 1: Simple Logistic Regression
model_simple <- glm(Attrition~1, data = train, family = binomial())
# Fit Model 1: Simple Logistic Regression with Age
model1 <- glm(Attrition ~ Age, data = train, family = binomial())
summary(model1)
## 
## Call:
## glm(formula = Attrition ~ Age, family = binomial(), data = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.115179   0.213085   0.541    0.589    
## Age         -0.049461   0.006005  -8.236   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2721.4  on 3086  degrees of freedom
## Residual deviance: 2647.7  on 3085  degrees of freedom
## AIC: 2651.7
## 
## Number of Fisher Scoring iterations: 4

Intercept: 0.115179 this value represents the log odds of attrition when the Age is zero. since Age cannot be zero it’s not practical.

Coefficient of Age:-0.049461 the negative coefficient for Age indicates negative correlation

Residual Deviance: the reduction in deviance by 73.7 shows the significance of Age in the model Z-value : the z-value of age lands far into the left at -8.236 showing significance P-value: the p-value for this model is less than 0.05 indicating its significant

7.2 Model 2: A two-variable model with Age and Gender.

# Fit Model 2: Logistic Regression with Age and Gender as Predictors
model2 <- glm(Attrition ~ Age + Gender, data = train, family = binomial())
summary(model2)
## 
## Call:
## glm(formula = Attrition ~ Age + Gender, family = binomial(), 
##     data = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.143439   0.224580   0.639    0.523    
## Age         -0.049575   0.006015  -8.242   <2e-16 ***
## GenderMale  -0.040422   0.101294  -0.399    0.690    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2721.4  on 3086  degrees of freedom
## Residual deviance: 2647.6  on 3084  degrees of freedom
## AIC: 2653.6
## 
## Number of Fisher Scoring iterations: 4

Intercept: The intercept 0.143439 represents the log odds of attrition when both Age and Gender Male are zero. Since these conditions can not be zero specially Age, it is not practical and we can not draw interpretation.

Coefficient of Age: -0.049575 The negative coefficient for Age indicates a negative correlation with attrition. Specifically, as Age increases, the probability of attrition decreases. The magnitude and direction of this coefficient are consistent with our previous model(model1).

Coefficient of GenderMale: -0.040422 This coefficient suggests that being male (GenderMale) has a negative correlation with attrition, but the relationship is not statistically significant (p-value = 0.690). This implies that gender, in this model, does not play a significant role in predicting attrition.

Residual Deviance: The reduction in deviance to 2647.6 from the null deviance of 2721.4 which is a decrease of about 73.8 showing us further the predicting power of age the addition of Gender does not significantly change this measure from the previous model(model1).

Z-value: The z-value for Age remains highly significant at -8.242, similar to the previous model

P-value: The p-value for Age (< 2e-16) is lessthan 0.05 indicating its signigicance.But The p-value for GenderMale indicates that it is not a significant predictor of attrition in this model.

Overall Model Fit: The AIC value slightly increased to 2653.6 compared to the previous model. This, along with the GenderMale coefficient’s lack of significance, suggests that adding gender as a predictor does not substantially improve the model’s ability to predict attrition over Age alone.

7.3 Model 3: A three-variable model with Age, Gender, and JobSatisfaction.

# Fit Model 3: Logistic Regression with Age, Gender, and JobSatisfaction as Predictors
model3 <- glm(Attrition ~ Age + Gender + JobSatisfaction, data = train, family = binomial())
summary(model3)
## 
## Call:
## glm(formula = Attrition ~ Age + Gender + JobSatisfaction, family = binomial(), 
##     data = train)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.604422   0.243197   2.485  0.01294 *  
## Age              -0.050343   0.006048  -8.324  < 2e-16 ***
## GenderMale       -0.026387   0.102058  -0.259  0.79598    
## JobSatisfaction2 -0.390296   0.149104  -2.618  0.00885 ** 
## JobSatisfaction3 -0.461991   0.134264  -3.441  0.00058 ***
## JobSatisfaction4 -0.818393   0.141452  -5.786 7.22e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2721.4  on 3086  degrees of freedom
## Residual deviance: 2613.6  on 3081  degrees of freedom
## AIC: 2625.6
## 
## Number of Fisher Scoring iterations: 4

Intercept: The intercept 0.604422 represents the log odds of attrition when Age, GenderMale, and all Job Satisfaction levels are zero. While Age and GenderMale being zero are not realistic conditions, the intercept is important for the logistic model’s structure.

Coefficient of Age: -0.050343 Consistent with previous models, the negative coefficient for Age indicates that as Age increases, the likelihood of attrition decreases. The high statistical significance (p-value < 2e-16) reinforces Age as a key predictor in attrition.

Coefficient of GenderMale: -0.026387 model suggests that being male slightly decreases the likelihood of attrition, but this effect is not statistically significant (p-value = 0.79598), implying that gender is not a major factor in predicting attrition.

Coefficients of Job Satisfaction: The coefficients for different levels of Job Satisfaction (2, 3 and 4) is (-0.390296,-0.461991 and -0.818393) respectively.And all the negative signs, indicating that higher job satisfaction is associated with a lower likelihood of attrition. The statistical significance of these coefficients (especially for levels 3 and 4) highlights job satisfaction as an important factor in understanding attrition.

Residual Deviance: The decrease in residual deviance to 2612.9 from the null deviance of 2721.4 suggests that the model, including Age, Gender, and Job Satisfaction, explains a significant portion of the variance in attrition. This is a considerable reduction, indicating a good model fit.

Overall Model Fit: The AIC value is 2625.6, which is lower than the previous models, indicating a better fit with the inclusion of Job Satisfaction alongside Age and Gender.

Z-values and P-values: The z-values and corresponding p-values show that Age and Job Satisfaction levels are significant predictors of attrition, while Gender is not.

7.4 Model 4: An interaction model involving Age, Gender, JobSatisfaction, Income, and Gender:Income. (10 points)

# Fit Model 4: Logistic Regression with Interactions
model4 <- glm(Attrition ~ Age + Gender + JobSatisfaction + Income + Gender:Income, data = train, family = binomial())

summary(model4)
## 
## Call:
## glm(formula = Attrition ~ Age + Gender + JobSatisfaction + Income + 
##     Gender:Income, family = binomial(), data = train)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.751429   0.262152   2.866 0.004152 ** 
## Age               -0.050820   0.006058  -8.390  < 2e-16 ***
## GenderMale        -0.029049   0.156422  -0.186 0.852673    
## JobSatisfaction2  -0.397560   0.149256  -2.664 0.007731 ** 
## JobSatisfaction3  -0.451003   0.134710  -3.348 0.000814 ***
## JobSatisfaction4  -0.824024   0.141699  -5.815 6.05e-09 ***
## Income            -0.465892   0.334004  -1.395 0.163056    
## GenderMale:Income  0.020162   0.430138   0.047 0.962614    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2721.4  on 3086  degrees of freedom
## Residual deviance: 2608.8  on 3079  degrees of freedom
## AIC: 2624.8
## 
## Number of Fisher Scoring iterations: 5

Intercept : The 0.751429 intercept suggests that log odds of attrition when all predictor variables are zero. As before, this is not practical.

Coefficient of Age : -0.050820 the negative coefficient indicates that as Age increases, the likelihood of attrition decreases significantly (p-value < 2e-16). but still its P-values is highlighting age as a significant factor in attrition.

Coefficient of GenderMale : The coefficient -0.029049 suggests a slight decrease in the likelihood of attrition for male employees compared to females, but this difference is not statistically significant because of its p-value greater than 0.05 and its Z value is with in range of -1.96 and 1.96.

coefficient of Job satisfaction: The negative values for the coefficients of Job Satisfaction levels 2, 3 and 4 is -0.397560, -0.451003, -0.824024 respectively implying that as job satisfaction increases, the likelihood of attrition decreases.

Coefficient of Income : -0.465892 The negative coefficient for Income suggests a decrease in attrition likelihood with higher income, although it is p-value (0.163056) is greater than 0.05 indicating that it is not signigicant.

Interaction Term (GenderMale:Income): The interaction term between Gender and Income has a very small positive coefficient of 0.020162 and it is p-value (0.962614) is greater than 0.05 and Z value is inside the range of -1.96 and 1.96 showing it’s insignificant.

Residual Deviance: The reduction in residual deviance to 2608.0 from the null deviance of 2721.4 suggests the model explains a significant portion of the variance in attrition, indicating a good model fit.

Overall The AIC value is 2624.8, which is slightly lower than the previous model, suggesting a marginally better fit with the addition of Income and the Gender:Income interaction.

7.5 Compare the performance of these models using the following metrics: pseudo R^2, Log Likelihood, AIC, and BIC. Include tables summarizing the results. (10 points)

aic_values <- c( AIC(model1),AIC(model2), AIC(model3), AIC(model4))
log_liklihood <- c(logLik(model1), logLik(model2),logLik(model3),logLik(model4))
bic_values <- c(BIC(model1),BIC(model2),BIC(model3),BIC(model4))

model_metrics <- data.frame(
  model_name = c("model1", "model2", "model3", "model4"),
  aic_values = aic_values,
  log_likelihood = log_liklihood,
  bic_value = bic_values
)
model_metrics
##   model_name aic_values log_likelihood bic_value
## 1     model1   2651.733      -1323.866  2663.802
## 2     model2   2653.574      -1323.787  2671.678
## 3     model3   2625.632      -1306.816  2661.842
## 4     model4   2624.821      -1304.410  2673.100
# Create a list of your models


# Create the model summary table

# Print summary table



# Create the model summary table

Interpretation and Comparison AIC (Akaike Information Criterion)

Lower AIC values are generally better. Model4 has the lowest AIC with the value of 2624.821 followed closely by model3 with the value of 2625.632, suggesting these models have a better fit compared to model1 and model2.

Log Likelihood

Higher (less negative) log-likelihood values indicate a better model fit. Model4 (-1304.410) has the highest log likelihood, indicating the best fit among the models, followed by model3 with a value of -1306.816.

BIC (Bayesian Information Criterion):

Like AIC, lower BIC values are better, but BIC penalizes model complexity more strongly. Model1 has the lowest BIC (2651.733), suggesting it might be the preferred model when considering both fit and simplicity. Model4, despite having the best AIC and Log Likelihood, has the highest BIC (2673.100), indicating it might be more complex.

Model Selection: If prioritizing model simplicity, model1 might be preferable due to its lowest BIC. However, if the emphasis is on how well the model fits, Model4 seems to be the better choice, as suggested by its AIC and Log Likelihood scores, though this comes with a trade-off of being more complex.

There’s a trade-off between model complexity and fit. More complex models might fit the data better but can be less generalizable due to overfitting. Simpler models might generalize better but at the expense of fit.

7.6 Create a graph to visually represent the comparison of the models based on the selected metrics. (10 points)

ggplot(model_metrics, aes(x = model_name)) +
  geom_point(aes(y = aic_values, color = "AIC"), size = 3) +
  geom_point(aes(y = -log_likelihood, color = "Log Likelihood"), size = 3) +
  geom_point(aes(y = bic_value, color = "BIC"), size = 3) +
  geom_line(aes(y = aic_values, group = 1, color = "AIC"), linetype = "solid", linewidth = 1) +
  geom_line(aes(y = -log_likelihood, group = 1, color = "Log Likelihood"), linetype = "solid", linewidth = 1) +
  geom_line(aes(y = bic_value, group = 1, color = "BIC"), linetype = "solid", linewidth = 1) +
  labs(title = "Model Comparison Based on AIC, Log Likelihood, and BIC",
       x = "Models",
       y = "Value") +
  scale_color_manual(values = c("AIC" = "#FFCC00", "Log Likelihood" = "#669966", "BIC" = "#993366")) +
  theme_linedraw()+
  theme(panel.grid = element_line(color = "#666666",
                                  linewidth = 0.65,
                                  linetype = 4))+
  theme(legend.title = element_blank(),
        legend.position = "top") 

# Create a data frame for model names and metrics
# Create a ggplot with geom_point and geom_line

The graph presents a comparison of four statistical models based on AIC, BIC, and Log Likelihood values.

Model3 and Model4 are almost the have the same score over all and have the lowest BIC and AIC score but model3 slightly have lower BIC than model4 suggesting that it has the best balance of model fit and complexity. also Model4 shows the highest Log Likelihood, indicating a strong fit to the training data, its slightly higher BIC value compared to Model3

Models 1 and 2, with higher AIC and BIC values, appear to have a less optimal fit. Overall, Considering BIC as the selection standard, Model 3 emerges as the optimal choice for predicting because of its lowest BIC value it has lower complexity and robust fit particularly because BIC penalizes the inclusion of insignificant variables.

8 Predictive Performance

“Using the logistic regression models glm1, glm2, glm3, and glm4 that you have developed, perform a prediction analysis for employee attrition. (15 points)

8.1 For each of the four models, calculate the predicted probabilities of employee attrition for the test dataset.

# Load necessary libraries
library(pROC)
# Load necessary libraries
library(pROC)

# Calculate predicted probabilities for each model using the test dataset
predicted_probs_glm1 <- predict(model1, newdata = test, type = "response")
predicted_probs_glm2 <- predict(model2, newdata = test, type = "response")
predicted_probs_glm3 <- predict(model3, newdata = test, type = "response")
predicted_probs_glm4 <- predict(model4, newdata = test, type = "response")

# Create ROC curve objects for each model
roc_glm1 <- roc(test$Attrition, predicted_probs_glm1)
roc_glm2 <- roc(test$Attrition, predicted_probs_glm2)
roc_glm3 <- roc(test$Attrition, predicted_probs_glm3)
roc_glm4 <- roc(test$Attrition, predicted_probs_glm4)

8.2 Create ROC curves for each model to evaluate their discriminatory power in predicting attrition.

plot(roc_glm1, col = "blue", main = "ROC Curves for Logistic Regression Models")

plot(roc_glm2, col = "red", add = TRUE)
plot(roc_glm3, col = "green", add = TRUE)
plot(roc_glm4, col = "purple", add = TRUE)

# Add a legend at the bottom right of the plot (you can adjust the position as needed)
legend("bottomright", legend = c("model 1", "model 2", "model 3", "model 4"),
       col = c("blue", "red", "green", "purple"), lwd = 2, cex = 0.8, inset = 0.05)

The ROC curve plot in the above is showing the performance of four different logistic regression models.

The X-axis represents specificity(1-False Positive Rate) and the Y-axis represents Sensitivity(True Positive Rate). so the top-left corner represents the ideal point where sensitivity is high, and the false positive rate is low.then we plotted multiple ROC curves each representing a different logistic regression model.

The curves are closely overlapping, suggesting that the predictive performances of the models are quite similar. the diagonal grey line represents a no-skill classifier or random-chance so model’s above this curve are predicting better than random chance. from the graph, it can be observed that model 4 seems to be slighly dominant, as indicated by its curve being the highest at certain points.However, without the AUC values, it’s difficult to declare any model significantly superior over the entire range of specificity. ## Compare the ROC curves and AUC. Discuss which model appears to be the most effective in distinguishing between employees with and without attrition.

# Calculate AUC values
auc_glm1 <- auc(roc_glm1)
auc_glm2 <- auc(roc_glm2)
auc_glm3 <- auc(roc_glm3)
auc_glm4 <- auc(roc_glm4)

# Print AUC values
auc_values <- c(auc_glm1, auc_glm2, auc_glm3, auc_glm4)
names(auc_values) <- c("Model1", "Model2", "Model3", "Model4")
auc_values
##    Model1    Model2    Model3    Model4 
## 0.6495592 0.6482096 0.6698283 0.6724834

The AUC values provided offer a quantitative assessment of the four models’ capabilities in discrimination between the two classes: attrition and no attrition. With the highest AUC of 0.6724834, Model 4 emerges as the superior model in terms of distinguishing between employees who are likely to leave and those who are likely to stay. Model 3 follows closely with an AUC of 0.6698283, positioning it as the second most effective model. Model 1 ranks third with an AUC of 0.6495592, while Model 2 trails with the lowest AUC of 0.6482096.

Although all models perform above the random chance level of 0.5, suggesting better than random predictions, their AUC values are not exceptionally high, which indicates a moderate discriminative ability overall. Considering these AUC results, Model 4 is deemed the most appropriate choice for predicting employee attrition in the given dataset.

8.3 Explain why it does not matter whether you use Log Likelihood, AIC, or BIC to compare all the models. (Hint: Consider the number of parameters in all models). (10 points)

The reason why it does not matter weather we use you use Log Likelihood, AIC, or BIC to compare all the models is because they all measure how well a statistical model explains the observed data using the context of maximum likelihood estimation. Assuming they all have same number of parameters let’s see how they work one by one

start with the definition of Likelihood, it is a measure of how well our parameters or model explain the overall data. so the natural logarithm of the likelihood function is log likelihood we use natural logartihm to make our calculation easier and also it does not change the concept. For a set of observed data, represented as \(\mathcal{D}\), which includes data points \(x_1, x_2, \ldots, x_n\), and given a statistical model with parameters denoted by \(\theta\), we define the likelihood function \(L(\theta)\) as the joint probability of observing the dataset \(\mathcal{D}\) under the model parameters \(\theta\). Mathematically, the likelihood is the product of the individual probabilities of each observed data point:

\[ L(\theta) = P(\mathcal{D} | \theta) = \prod_{i=1}^{n} P(x_i | \theta) \]

To simplify computations, we take the natural logarithm of the likelihood function, resulting in the log likelihood, which is the sum of the logarithms of the individual probabilities:

\[ \log L(\theta) = \sum_{i=1}^{n} \log P(x_i | \theta) \]

This sum is more manageable to work with for purposes of optimization, such as finding the maximum likelihood estimates of the parameters \(\theta\), because it turns the operation of multiplication into addition which simplifies the differentiation process needed for optimization.

and the AIC (Akaike Information Criterion) is a measure used to evaluate the quality of statistical models for a given set of data. the mathematical formula is \[ \text{AIC} = 2k - 2\ln(L) \] where K is the number of parameter in the model and L is the maximum value of likelihood for the model so basically AIC uses the log likelihood to measure how well the model fits the data the difference is AIC penalizes models that have more parameters to avoid overfitting by making it proportional to the number of parameters in the model. so the lowest AIC values are preferrable

lastly BIC(Bayesian Information Criterion) similar to AIC but it differs on how it penalizes the model complexity typically avoiding overfitting than AIC. the mathematical formula is \[ \text{BIC} = \ln(n)k - 2\ln(L) \] n is samples size, k is the number of the parameters in the model and L is log likelihood of the model BIc penalizes more the addition of new parameters heavily than AIC. in BIC the penality for each additional parameter depends on the sample size given \[ \ln(n)k \] unlike AIc the penalty increases as the sample size increases. a lower BIC is preferrable which means BIC with higher Log likelihood now the reason why it doesn’t matter is when all models have the same number of parameters, the penality term in AIC and BIC will be constant that does not affect the ranking of the models. since they all calculate the maximum likelihood the choice of AIC, BIC and log likelihood will not change the order of preference for the models.

9 Recommendations and Reflection

9.1 Summarize the key findings from the analysis, including insights gained from the EDA, model building, and model comparison. Include tables or visualizations to support your summary. (10 points)

the most important key findings from the analysis are getting the relationship between the dependent variable with each independent variable and we found some relationship are inverse and some are not that impact to predict the attrition. Interestingly, Income as we may think important was not significant predictor of attrition as age and job satisfaction. The analysis revealed a significant relationship between age and attrition. As age increases, the probability of an employee staying with the company also increases. This suggests that older employees are less likely to leave, possibly due to increased job stability or satisfaction. Also the same as “Age”, “Job Satisfaction” emerged as a crucial factor to show case lower probability of leaving the company.
In both the five-variable and three-variable logistic regression models, age and job satisfaction showed significant predictive power for attrition. The interaction between variables, especially in the five-variable model, brought additional complexity but did not significantly enhance the predictive power, suggesting the strength of these individual factors.\

Here below we plotted the distribution of age against attrition rates using histogram and as we can see from the plot the higher employee left count is recorded in the age range of between 30 to 40

ggplot(train, aes(x = Age, fill = as.factor(Attrition))) + geom_histogram(position = "dodge", binwidth = 1) + scale_fill_manual(values = c("1" = "red", "0" = "green"), labels = c("Left", "Stayed")) + labs(title = "Attrition Rates Across Age Groups", x = "Age", y = "Count", fill = "Attrition Status") + theme_minimal()

In our model comparison, we closely examined various evaluation criteria to understand their impact on model selection. This process revealed a critical trade-off between underfitting and overfitting. For example, using maximum log likelihood as our benchmark could have led us to choose Model 1, whereas prioritizing the lowest Akaike Information Criterion (AIC) pointed towards Model 4, which was also supported by the Area Under the Curve (AUC) analysis. However, the Bayesian Information Criterion (BIC) directed us towards Model 3. This decision-making process highlighted the importance of balancing simplicity and complexity in model selection. Ultimately, we Model 3, valuing its blend of straightforwardness and predictive capability, as suggested by the BIC. This choice underscores our approach to prioritize models that offer an effective balance between ease of interpretation and robust predictive power.

9.2 Provide recommendations to Canterra based on the analysis. Suggest strategies to reduce attrition rates and improve employee retention. (10 points)\

While Income is important, these insights suggest that simply increasing salaries may not be the most effective way to reduce attrition. Instead, a more holistic approach that addresses various aspects of job satisfaction might yield better results. Based on the analysis, we have seen the impact of Age and Job Satisfaction increasing retention and attrition. The following strategies could be beneficial for Canterra to reduce attrition rates and improve employee retention:

  • Develop targeted retention strategies for younger employees,

  • Implement measures to improve job satisfaction.

By focusing on these areas, Canterra can create a more engaging and satisfying work environment.