The following document is created as part of a degree requirement. Many of the stylistic features used in this document are part of a template. You can find more information about the template by visiting Yan Holtz’s github page.
###### Libraries #######
library(dplyr)
library(tidyverse)
library(psych)
library(kableExtra)
library(ggpubr)
library(glmnet)
library(DescTools)
library(caret)
library(randomForest)# kbl is an rmarkdown thing, makes some neat looking tables. What's important is really just the describe call here. While some of these are not necessarily continuous, I looked at the ones we might treat as continuous in analyses.
kbl(round(describe(df[,-c(1:3, 5, 8, 11,24:25)]),3)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1 | 1470 | 36.924 | 9.135 | 36 | 36.471 | 8.896 | 18 | 60 | 42 | 0.412 | -0.410 | 0.238 |
| Travel_Frequency | 2 | 1470 | 1.086 | 0.532 | 1 | 1.108 | 0.000 | 0 | 2 | 2 | 0.082 | 0.410 | 0.014 |
| DistanceFromHome_Miles | 3 | 1470 | 9.193 | 8.107 | 7 | 8.084 | 7.413 | 1 | 29 | 28 | 0.956 | -0.232 | 0.211 |
| PerformanceRating | 4 | 1470 | 3.069 | 0.474 | 3 | 3.047 | 0.000 | 1 | 4 | 3 | -0.444 | 4.421 | 0.012 |
| MonthlyIncome | 5 | 1470 | 6502.931 | 4707.957 | 4919 | 5667.241 | 3260.237 | 1009 | 19999 | 18990 | 1.367 | 0.992 | 122.793 |
| Other_Companies_Worked_At | 6 | 1470 | 2.693 | 2.498 | 2 | 2.361 | 1.483 | 0 | 9 | 9 | 1.024 | 0.002 | 0.065 |
| TrainingTimesLastYear | 7 | 1470 | 2.799 | 1.289 | 3 | 2.721 | 1.483 | 0 | 6 | 6 | 0.552 | 0.484 | 0.034 |
| YearsAtCompany | 8 | 1470 | 7.008 | 6.127 | 5 | 5.986 | 4.448 | 0 | 40 | 40 | 1.761 | 3.909 | 0.160 |
| YearsInCurrentRole | 9 | 1470 | 4.229 | 3.623 | 3 | 3.855 | 4.448 | 0 | 18 | 18 | 0.915 | 0.467 | 0.094 |
| YearsSinceLastPromotion | 10 | 1470 | 2.188 | 3.222 | 1 | 1.483 | 1.483 | 0 | 15 | 15 | 1.980 | 3.587 | 0.084 |
| YearsWithCurrManager | 11 | 1470 | 4.123 | 3.568 | 3 | 3.769 | 4.448 | 0 | 17 | 17 | 0.832 | 0.162 | 0.093 |
| Manager_Satisfaction | 12 | 1470 | 2.786 | 0.965 | 3 | 2.858 | 1.483 | 1 | 4 | 3 | -0.273 | -0.940 | 0.025 |
| CoWorker_Satisfaction | 13 | 1470 | 2.866 | 0.680 | 3 | 2.853 | 0.000 | 1 | 4 | 3 | -0.140 | -0.180 | 0.018 |
| Job_Involvement | 14 | 1470 | 2.772 | 1.029 | 3 | 2.840 | 1.483 | 1 | 4 | 3 | -0.303 | -1.080 | 0.027 |
| Work_Environment | 15 | 1470 | 3.259 | 0.708 | 3 | 3.329 | 1.483 | 1 | 4 | 3 | -0.495 | -0.605 | 0.018 |
| WorkLifeBalance | 16 | 1470 | 2.702 | 0.750 | 3 | 2.719 | 0.000 | 1 | 4 | 3 | -0.499 | 0.085 | 0.020 |
| JobChallenge | 17 | 1470 | 2.690 | 1.114 | 3 | 2.737 | 1.483 | 1 | 4 | 3 | -0.278 | -1.279 | 0.029 |
None of the variables appear to have a skew greater than or equal to 2. Years Since Last Promotion is close, at 1.98. Performance Rating, Years at Company, and Years Since Last Promotion all have a kurtosis greater than 2. It will be helpful to visualize the data in order to triangulate on a decision.
######### QQ Plots #########
p<-ggqqplot(df, x = "YearsAtCompany",
ggtheme = theme_pubclean(), title = "Years At Company", color = "#00AFBB")
#adjusts specific color layer, 1 is the points, 2 would be the line
p$layers[[1]]$aes_params$colour <- "#097880"; pp1<-ggqqplot(df, x = "YearsSinceLastPromotion",
ggtheme = theme_pubclean(), title = "Years Since Last Promotion", color = "#00AFBB")
p1$layers[[1]]$aes_params$colour <- "#097880"; p1######### Histograms #########
histo <- ggplot(df, aes(x=YearsAtCompany))
histo <- histo + geom_histogram(binwidth=2, colour="black",
aes(y=..density.., fill=..count..))
histo <- histo + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo <- histo + stat_function(fun=dnorm,
color="red",
args=list(mean=mean(df$YearsAtCompany),
sd=sd(df$YearsAtCompany)))
histohisto1 <- ggplot(df, aes(x=YearsSinceLastPromotion))
histo1 <- histo1 + geom_histogram(binwidth=1, colour="black",
aes(y=..density.., fill=..count..))
histo1 <- histo1 + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo1 <- histo1 + stat_function(fun=dnorm,
color="red", args=list(mean=mean(df$YearsSinceLastPromotion),
sd=sd(df$YearsSinceLastPromotion)))
histo1It is evident that these two variables are significantly skewed. I’ll transform Years at Company and Years Since Last Promotion. I’ll do this by taking the square root of the original values. Once that’s done, I’ll look at their skew and kurtosis again, the qqplots, and the histograms.
df$YearsAtCompany<-sqrt(df$YearsAtCompany)
df$YearsSinceLastPromotion <-sqrt(df$YearsSinceLastPromotion)
transformdescrip<-round(describe(df[,c(14,16)]),3)
kbl(transformdescrip[,c(11:13)]) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| skew | kurtosis | se | |
|---|---|---|---|
| YearsAtCompany | 0.426 | 0.376 | 0.029 |
| YearsSinceLastPromotion | 0.738 | -0.351 | 0.027 |
######### QQ Plots ###########
p2<-ggqqplot(df, x = "YearsAtCompany",
ggtheme = theme_pubclean(), title = "Years At Company", color = "#00AFBB")
p2$layers[[1]]$aes_params$colour <- "#097880"; p2p3<-ggqqplot(df, x = "YearsSinceLastPromotion",
ggtheme = theme_pubclean(), title = "Years Since Last Promotion", color = "#00AFBB")
p3$layers[[1]]$aes_params$colour <- "#097880"; p3######### Histograms #########
histo2 <- ggplot(df, aes(x=YearsAtCompany))
histo2 <- histo2 + geom_histogram(binwidth=.15, colour="black",
aes(y=..density.., fill=..count..))
histo2 <- histo2 + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo2 <- histo2 + stat_function(fun=dnorm,
color="red",
args=list(mean=mean(df$YearsAtCompany),
sd=sd(df$YearsAtCompany)))
histo2histo3 <- ggplot(df, aes(x=YearsSinceLastPromotion))
histo3 <- histo3 + geom_histogram(binwidth=.15, colour="black",
aes(y=..density.., fill=..count..))
histo3 <- histo3 + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo3 <- histo3 + stat_function(fun=dnorm,
color="red", args=list(mean=mean(df$YearsSinceLastPromotion),
sd=sd(df$YearsSinceLastPromotion)))
histo3After transforming the variables, skew for Years at Company and Years Since Last Promotion was -1.094 and -.052 respectively. The kurtosis was 3.022 and -1.516 respectively.
I need to dummy code the categorical variables Department and Marital Status
train<- dfdummy[dfdummy$fold < 5,]
test<- dfdummy[dfdummy$fold == 5,]
logmodel<-glm(Turn_YesNo ~ .-fold, data = train, family = "binomial")
summary(logmodel)## [1] 0.302269
## McFadden CoxSnell Nagelkerke
## 0.3075264 0.2355045 0.4043713
logmodeltrain<-glm(Turn_YesNo ~ . -fold,
data = cbind(test,prediction = predict.glm(logmodel,
type = "response")),
family = "binomial")## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
logmodeltest<-glm(Turn_YesNo ~ . -fold,
data = cbind(test,prediction = predict.glm(logmodel,
newdata = test, type = "response")),
family = "binomial")
summary(logmodeltrain)
summary(logmodeltest)## [1] 0.3135195
## [1] 0.3163269
## McFadden CoxSnell Nagelkerke
## 0.3168722 0.2535113 0.4207297
## McFadden CoxSnell Nagelkerke
## 0.3193176 0.2551937 0.4235218
model performance appears to be better in the test sample across several different r-squared measures.
I will standardize the variables and conduct an elastic net logistic regression on the training sample, and cross validate it on the test sample.
traindata<-cbind(as.data.frame(scale(train[,-c(2,21:22)])),train[,c(2,21:22)])
testdata <- cbind(as.data.frame(scale(test[,-c(2,21:22)])),test[,c(2,21:22)])
traindata$Turn_YesNo <- factor(traindata$Turn_YesNo, levels = c(0,1), labels = c("Stay", "Leave"))
set.seed(2033)
folds<- createMultiFolds(traindata$Turn_YesNo, k = 10, times = 1)
Trainspecses <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
index = folds,
selectionFunction = "best",
classProbs = T,
summaryFunction = multiClassSummary,
verboseIter = TRUE)set.seed(2033)
ES_fit <- train(x = traindata[,-c(25:26)], y = traindata$Turn_YesNo, "glmnet",
family= "binomial", trControl = Trainspecses,
metric = "Accuracy", standardize = FALSE, tuneLength = 10)## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
Estest<-glm(Turn_YesNo ~ . -fold, data = cbind(test,prediction = predict(ES_fit,
newdata = testdata, type = "prob")), family = "binomial")
summary(Estest)## [1] 0.8800884
The holdout accuracy at the optimal tuning parameter was .878. The optimal tuning parameters were alpha = .1 and lambda = .003
kbl(round(as.data.frame(Estest$coefficients),3)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| Estest$coefficients | |
|---|---|
| (Intercept) | 8.201 |
| Age | -0.050 |
| Male_YesNo | 0.411 |
| Travel_Frequency | 0.050 |
| DistanceFromHome_Miles | 0.012 |
| Education_Level | -0.090 |
| PerformanceRating | -1.855 |
| MonthlyIncome | 0.000 |
| StockOptionLevel | 0.453 |
| Other_Companies_Worked_At | 0.071 |
| TrainingTimesLastYear | -0.014 |
| YearsAtCompany | 0.268 |
| YearsInCurrentRole | -0.240 |
| YearsSinceLastPromotion | 0.141 |
| YearsWithCurrManager | 0.110 |
| Manager_Satisfaction | -0.148 |
| CoWorker_Satisfaction | 0.054 |
| Job_Involvement | -0.210 |
| Work_Environment | -0.031 |
| WorkLifeBalance | -0.264 |
| JobChallenge | 0.004 |
Sales and Marketing
|
0.874 |
| Support | 0.224 |
| Single | 0.773 |
| Divorced | -0.896 |
| prediction.Stay | -2.280 |
| prediction.Leave | NA |
## [1] 0.3161057
## McFadden CoxSnell Nagelkerke
## 0.3195030 0.2553211 0.4237332
The psuedo R-squared that is typically calculated using the summary output is McFadden’s R-squared. That value comes out to .316. This was a similar result to the simple logistic regression, which also provided a pseudo r-squared of .316.
set.seed(2033)
folds<-createMultiFolds(traindata$Turn_YesNo,k=10,times=1)
Trainspecsrf<-trainControl(method="repeatedcv",
number=10,
repeats=3,
index=folds,
classProbs = T,
summaryFunction=multiClassSummary,
search="grid",
selectionFunction = "best",
verboseIter = TRUE)
rf_fit<-train(Turn_YesNo~.-fold,data=traindata,method="rf",
metric="Accuracy", trControl=Trainspecsrf,ntree=500,nodesize=5,
standardize=FALSE)## Random Forest
##
## 1176 samples
## 25 predictor
## 2 classes: 'Stay', 'Leave'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1058, 1059, 1058, 1058, 1059, 1058, ...
## Resampling results across tuning parameters:
##
## mtry logLoss AUC prAUC Accuracy Kappa F1
## 2 0.3211937 0.8466876 0.7549885 0.8724540 0.2989548 0.9293604
## 13 0.3784352 0.8262538 0.7358867 0.8750181 0.3954654 0.9292954
## 24 0.3324764 0.8239929 0.7204191 0.8767058 0.4024288 0.9303028
## Sensitivity Specificity Pos_Pred_Value Neg_Pred_Value Precision
## 0.9959596 0.2146199 0.8712107 0.9214286 0.8712107
## 0.9767677 0.3330409 0.8865064 0.7546032 0.8865064
## 0.9777778 0.3383041 0.8874342 0.7522908 0.8874342
## Recall Detection_Rate Balanced_Accuracy
## 0.9959596 0.8384543 0.6052897
## 0.9767677 0.8223091 0.6549043
## 0.9777778 0.8231494 0.6580409
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 24.
## [1] 0.8767058
rftest<-glm(Turn_YesNo ~ . -fold, data = cbind(testdata,prediction = predict(rf_fit,
newdata = testdata, type = "prob")), family = "binomial")
predtrainrf<-predict(rf_fit, traindata, type = "prob")
library(ltm)## Warning: package 'ltm' was built under R version 4.0.4
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 4.0.4
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: msm
## Warning: package 'msm' was built under R version 4.0.4
## Loading required package: polycor
## Warning: package 'polycor' was built under R version 4.0.4
##
## Attaching package: 'polycor'
## The following object is masked from 'package:psych':
##
## polyserial
##
## Attaching package: 'ltm'
## The following object is masked from 'package:psych':
##
## factor.scores
## [1] 0.9429721
set.seed(2033)
traindata<-rename(traindata, SalesAndMarketing = "Sales and Marketing")
Rf_model<-randomForest(Turn_YesNo~.-fold ,data=traindata,mtry=13,
ntree=500,nodesize=5, importance=TRUE)
varImpPlot(Rf_model)
The most important variables according to accuracy are manager satisfaciton, performance ratings, monthly income, job challenge, and age. According to the gini index, the most important are Monthly income, manager satisfaction, age, performance ratings, and job challenge.
## [1] 0.3536698
## McFadden CoxSnell Nagelkerke
## 0.3471926 0.2741059 0.4549086
We see a much larger difference in the Pseudo r-squared than when we used the elastic net approach. the pseudo r-squared is .354. This is not a dramatic improvement over our other method, which both hovered around .31, but does provide some support to the use of a rf model in this application. ```
A work by Amer Odeh