Q1
setwd("/Users/txharris/Desktop/IS 6489")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bi_data <- read.csv("bioimplants.csv")
# Overall Attrition Rate
overall_attrition_rate <- mean(bi_data$attrition == "Yes")
# Conditional Attrition Rates by Department and Job Role
conditional_attrition_rates <- bi_data %>%
group_by(department, job_role) %>%
summarise(conditional_attrition_rate = mean(attrition == "Yes"))
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
# Sort the table by attrition rate in descending order
conditional_attrition_rates <- conditional_attrition_rates %>%
arrange(desc(conditional_attrition_rate))
# Majority Class Prediction
majority_class <- bi_data %>%
summarise(majority_class_accuracy = max(table(bi_data$attrition))/nrow(bi_data))
# Display results
cat("Overall Attrition Rate:", overall_attrition_rate, "\n\n")
## Overall Attrition Rate: 0.1612245
cat("Summary Table of Conditional Attrition Rates (Sorted by Rate):\n")
## Summary Table of Conditional Attrition Rates (Sorted by Rate):
print(conditional_attrition_rates)
## # A tibble: 11 × 3
## # Groups: department [3]
## department job_role conditional_attrition_rate
## <chr> <chr> <dbl>
## 1 Sales Sales Representative 0.398
## 2 Research & Development Laboratory Technician 0.239
## 3 Human Resources Human Resources 0.231
## 4 Sales Sales Executive 0.175
## 5 Research & Development Research Scientist 0.161
## 6 Research & Development Manufacturing Director 0.0690
## 7 Research & Development Healthcare Representative 0.0687
## 8 Research & Development Manager 0.0556
## 9 Sales Manager 0.0541
## 10 Research & Development Research Director 0.025
## 11 Human Resources Manager 0
cat("\nMajority Class Prediction In-Sample Accuracy:", majority_class$majority_class_accuracy, "\n")
##
## Majority Class Prediction In-Sample Accuracy: 0.8387755
Q2
logit_model <- glm(attrition == "Yes" ~ . - employee_number, data = bi_data, family = "binomial")
# Make Predictions
predicted_probabilities <- predict(logit_model, newdata = bi_data, type = "response")
predicted_attrition <- ifelse(predicted_probabilities >= 0.5, "Yes", "No")
# Calculate Accuracy
accuracy <- mean(predicted_attrition == bi_data$attrition)
majority_class_accuracy <- majority_class$majority_class_accuracy
cat("Logistic Regression Model Accuracy (Decision Threshold = 0.5):", accuracy, "\n")
## Logistic Regression Model Accuracy (Decision Threshold = 0.5): 0.892517
cat("Majority Class Prediction In-Sample Accuracy:", majority_class_accuracy, "\n")
## Majority Class Prediction In-Sample Accuracy: 0.8387755
# Comment
if (accuracy > majority_class_accuracy) {
cat("The logistic regression model offers an improvement over predicting with the majority class.\n")
} else {
cat("The logistic regression model does not offer a significant improvement over predicting with the majority class.\n")
}
## The logistic regression model offers an improvement over predicting with the majority class.
Q3
# Center and Scale Numeric Predictors
bi_data_centered_scaled <- bi_data %>%
mutate_if(is.numeric, scale)
bi_data_centered_scaled <- select(bi_data_centered_scaled, -department)
# Fit Logistic Regression Model
logit_model_centered_scaled <- glm(attrition == "Yes" ~ . - employee_number,
family = "binomial",
data = bi_data_centered_scaled)
# Assuming you've already fitted the logistic regression model and stored it in logit_model_centered_scaled
# Get the coefficients and their details
coefficients_summary <- summary(logit_model_centered_scaled)$coefficients
# Identify predictor with the largest absolute z value
largest_effect_size_predictor <- rownames(coefficients_summary)[which.max(abs(coefficients_summary[, "z value"]))]
# Display results
cat("Predictor with the Largest Effect Size:", largest_effect_size_predictor, "\n")
## Predictor with the Largest Effect Size: over_timeYes
cat("Estimate:", coefficients_summary[largest_effect_size_predictor, "Estimate"], "\n")
## Estimate: 1.97353
cat("Std. Error:", coefficients_summary[largest_effect_size_predictor, "Std. Error"], "\n")
## Std. Error: 0.1930413
cat("z value:", coefficients_summary[largest_effect_size_predictor, "z value"], "\n")
## z value: 10.22336
cat("Pr(>|z|):", coefficients_summary[largest_effect_size_predictor, "Pr(>|z|)"], "\n")
## Pr(>|z|): 1.558457e-24
# Interpretation
cat("\nInterpretation of the Coefficient with the Largest Effect Size:\n")
##
## Interpretation of the Coefficient with the Largest Effect Size:
cat("A 1 standard deviation increase in", largest_effect_size_predictor,
"is associated with a coefficient-sized change of",
coefficients_summary[largest_effect_size_predictor, "Estimate"],
"in the log odds of attrition, on average, while holding other predictors constant.\n")
## A 1 standard deviation increase in over_timeYes is associated with a coefficient-sized change of 1.97353 in the log odds of attrition, on average, while holding other predictors constant.
Q4
## 1 Proposed Policy Change:
# For this scenario, I would likely attempt to decrease the number of overtime hours worked weekly to force the attrition rate downwards. Here we would want to focus on prioritizing punctuality and rewarding individuals who do not have to spend more time at work than necessary. We would also want to provide support for those who cannot avoid increased workload during the week which should aid the attrition rate.
## 2 Estimate on Churn Probability:
# The predictor 'over_timeYes' has an estimated coefficient of 1.974. A 1 standard deviation increase in 'over_timeYes' is associated with a coefficient-sized change of 1.974 in the probability of attrition. The positive coefficient will suggest that an increase in overtime hours is associated with an increase in the log odds of attrition. The proposed communication is to clearly communicate the policy change by emphasizing its positive impact on reducing overtime-related stress and lowering the risk of attrition. We will also want to highlight the beneficial aspects of this policy change on overall morale and performance at work.
Q5
## In her report Angelica should include the following:
# Analysis of current attrition rate: The overall rate for BioImplants is 0.1612245. Conditional attrition rates by department is included in the table from one of the first questions. This is sorted in descending order. The majority class prediction for in-sample accuracy is 0.8387755.
# A logistic regression model was then fitted using predictors, excluding the value for employee number. The model was then 0.892517 accurate and there was a decision threshold of 0.5. This outperformed the majority class model with an accuracy of 0.8387755.
# We then standardized the numeric predictors to make the comparison of effect sizes more correct. The redundant variable of department was removed to avoid multicollineariy. This then lea to finding the predictor with the largest effect size which was overtimeYes. This coefficient was 1.97353.
# The proposed policy change aims to correct overtime-related attrition and focuses on improving worker's work and life schedule. It would be likely an improvement in the attrition rates of the company as a whole because it will focus on avoiding extenuating the number of hours worked for each person. The estimated change in the log odds of attrition is 1 standard deviation increase in overtimeYes leads to a change of 1.97353.
# Of course all findings and proposed solutions should be tracked and monitored overtime to see if the correct predictors are actually addressed and then lead to corrective changes. It can be assumed that the attrition rates will decrease because the proposed policy change is directly related to overtimeYes being eliminated.