Assignment 2



The following document is created as part of a degree requirement. Many of the stylistic features used in this document are part of a template. You can find more information about the template by visiting Yan Holtz’s github page.

###### Libraries #######
library(dplyr)
library(tidyverse)
library(psych)
library(kableExtra)
library(ggpubr)
library(glmnet)
library(DescTools)
library(caret)
library(randomForest)

1 Getting the Data Loaded

df<-read.csv("C:/Users/amero/OneDrive - Wayne State University/School/2020-2021 PhD/Big Data/HW/Assignment 2/HW_file.csv")

2 Create Fold Variable

set.seed(2033)
df$fold <-sample(c(rep(1,nrow(df)*.2),  rep(2,nrow(df)*.2), rep(3,nrow(df)*.2), rep(4,nrow(df)*.2), rep(5,nrow(df)*.2)))

3 Pre-Processing Continuous Variables

# Quickly checking col numbers and names for next step
colnames(df)
# kbl is an rmarkdown thing, makes some neat looking tables. What's important is really just the describe call here. While some of these are not necessarily continuous, I looked at the ones we might treat as continuous in analyses.

kbl(round(describe(df[,-c(1:3, 5, 8, 11,24:25)]),3)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
vars n mean sd median trimmed mad min max range skew kurtosis se
Age 1 1470 36.924 9.135 36 36.471 8.896 18 60 42 0.412 -0.410 0.238
Travel_Frequency 2 1470 1.086 0.532 1 1.108 0.000 0 2 2 0.082 0.410 0.014
DistanceFromHome_Miles 3 1470 9.193 8.107 7 8.084 7.413 1 29 28 0.956 -0.232 0.211
PerformanceRating 4 1470 3.069 0.474 3 3.047 0.000 1 4 3 -0.444 4.421 0.012
MonthlyIncome 5 1470 6502.931 4707.957 4919 5667.241 3260.237 1009 19999 18990 1.367 0.992 122.793
Other_Companies_Worked_At 6 1470 2.693 2.498 2 2.361 1.483 0 9 9 1.024 0.002 0.065
TrainingTimesLastYear 7 1470 2.799 1.289 3 2.721 1.483 0 6 6 0.552 0.484 0.034
YearsAtCompany 8 1470 7.008 6.127 5 5.986 4.448 0 40 40 1.761 3.909 0.160
YearsInCurrentRole 9 1470 4.229 3.623 3 3.855 4.448 0 18 18 0.915 0.467 0.094
YearsSinceLastPromotion 10 1470 2.188 3.222 1 1.483 1.483 0 15 15 1.980 3.587 0.084
YearsWithCurrManager 11 1470 4.123 3.568 3 3.769 4.448 0 17 17 0.832 0.162 0.093
Manager_Satisfaction 12 1470 2.786 0.965 3 2.858 1.483 1 4 3 -0.273 -0.940 0.025
CoWorker_Satisfaction 13 1470 2.866 0.680 3 2.853 0.000 1 4 3 -0.140 -0.180 0.018
Job_Involvement 14 1470 2.772 1.029 3 2.840 1.483 1 4 3 -0.303 -1.080 0.027
Work_Environment 15 1470 3.259 0.708 3 3.329 1.483 1 4 3 -0.495 -0.605 0.018
WorkLifeBalance 16 1470 2.702 0.750 3 2.719 0.000 1 4 3 -0.499 0.085 0.020
JobChallenge 17 1470 2.690 1.114 3 2.737 1.483 1 4 3 -0.278 -1.279 0.029



None of the variables appear to have a skew greater than or equal to 2. Years Since Last Promotion is close, at 1.98. Performance Rating, Years at Company, and Years Since Last Promotion all have a kurtosis greater than 2. It will be helpful to visualize the data in order to triangulate on a decision.

######### QQ Plots #########
p<-ggqqplot(df, x = "YearsAtCompany",
   ggtheme = theme_pubclean(), title = "Years At Company", color = "#00AFBB")
#adjusts specific color layer, 1 is the points, 2 would be the line
p$layers[[1]]$aes_params$colour <- "#097880"; p

p1<-ggqqplot(df, x = "YearsSinceLastPromotion",
   ggtheme = theme_pubclean(), title = "Years Since Last Promotion", color = "#00AFBB")
p1$layers[[1]]$aes_params$colour <- "#097880"; p1

######### Histograms #########

histo <- ggplot(df, aes(x=YearsAtCompany))
histo <- histo + geom_histogram(binwidth=2, colour="black", 
                          aes(y=..density.., fill=..count..))
histo <- histo + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo <- histo + stat_function(fun=dnorm,
                         color="red",
                         args=list(mean=mean(df$YearsAtCompany), 
                                  sd=sd(df$YearsAtCompany)))
histo

histo1 <- ggplot(df, aes(x=YearsSinceLastPromotion))
histo1 <- histo1 + geom_histogram(binwidth=1, colour="black", 
                          aes(y=..density.., fill=..count..))
histo1 <- histo1 + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo1 <- histo1 + stat_function(fun=dnorm,
                         color="red",                         args=list(mean=mean(df$YearsSinceLastPromotion), 
                                  sd=sd(df$YearsSinceLastPromotion)))
histo1



It is evident that these two variables are significantly skewed. I’ll transform Years at Company and Years Since Last Promotion. I’ll do this by taking the square root of the original values. Once that’s done, I’ll look at their skew and kurtosis again, the qqplots, and the histograms.

df$YearsAtCompany<-sqrt(df$YearsAtCompany)
df$YearsSinceLastPromotion <-sqrt(df$YearsSinceLastPromotion)

transformdescrip<-round(describe(df[,c(14,16)]),3)
kbl(transformdescrip[,c(11:13)]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
skew kurtosis se
YearsAtCompany 0.426 0.376 0.029
YearsSinceLastPromotion 0.738 -0.351 0.027
######### QQ Plots ###########
p2<-ggqqplot(df, x = "YearsAtCompany",
   ggtheme = theme_pubclean(), title = "Years At Company", color = "#00AFBB")
p2$layers[[1]]$aes_params$colour <- "#097880"; p2

p3<-ggqqplot(df, x = "YearsSinceLastPromotion",
   ggtheme = theme_pubclean(), title = "Years Since Last Promotion", color = "#00AFBB")
p3$layers[[1]]$aes_params$colour <- "#097880"; p3

######### Histograms #########

histo2 <- ggplot(df, aes(x=YearsAtCompany))
histo2 <- histo2 + geom_histogram(binwidth=.15, colour="black", 
                          aes(y=..density.., fill=..count..))
histo2 <- histo2 + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo2 <- histo2 + stat_function(fun=dnorm,
                         color="red",
                         args=list(mean=mean(df$YearsAtCompany), 
                                  sd=sd(df$YearsAtCompany)))
histo2

histo3 <- ggplot(df, aes(x=YearsSinceLastPromotion))
histo3 <- histo3 + geom_histogram(binwidth=.15, colour="black", 
                          aes(y=..density.., fill=..count..))
histo3 <- histo3 + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo3 <- histo3 + stat_function(fun=dnorm,
                         color="red",                         args=list(mean=mean(df$YearsSinceLastPromotion), 
                                  sd=sd(df$YearsSinceLastPromotion)))
histo3



After transforming the variables, skew for Years at Company and Years Since Last Promotion was -1.094 and -.052 respectively. The kurtosis was 3.022 and -1.516 respectively.

4 Pre-Processing Categorical Variables



I need to dummy code the categorical variables Department and Marital Status

dfdummy<-cbind(df,dummy.code(df$Department),dummy.code(df$MaritalStatus))
# setting my reference column for dummy coding as well as removing the old variables
dfdummy=subset(dfdummy,select = -c(EmployeeNumber,Department,MaritalStatus, Production,Married))

5 Run Standard Logistic Regression

a and b, creating model object and fit statistic

train<- dfdummy[dfdummy$fold < 5,]
test<- dfdummy[dfdummy$fold == 5,]
logmodel<-glm(Turn_YesNo ~ .-fold, data = train, family = "binomial")
summary(logmodel)
# McFadden's pseudo R^2 = 1 - L(model) / L(intercept)
pseudoRsq = (1 - (716.5 / 1026.9))
pseudoRsq
## [1] 0.302269
# For other pseudo-rsquared
PseudoR2(logmodel, c("McFadden", "CoxSnell", "Nagelkerke"))
##   McFadden   CoxSnell Nagelkerke 
##  0.3075264  0.2355045  0.4043713

c and d predicting turnover in test sample

logmodeltrain<-glm(Turn_YesNo ~ . -fold, 
                  data = cbind(test,prediction = predict.glm(logmodel, 
                  type = "response")), 
                  family = "binomial")
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
logmodeltest<-glm(Turn_YesNo ~ . -fold, 
                  data = cbind(test,prediction = predict.glm(logmodel, 
                  newdata = test, type = "response")), 
                  family = "binomial")
summary(logmodeltrain)
summary(logmodeltest)
pesudoRsqtrain = (1 - (744.9/1085.1))
pesudoRsqtrain
## [1] 0.3135195
pseudoRsqtest = (1 - (185.46 / 271.27))
pseudoRsqtest
## [1] 0.3163269
# For other pseudo-rsquared
PseudoR2(logmodeltrain, c("McFadden", "CoxSnell", "Nagelkerke"))
##   McFadden   CoxSnell Nagelkerke 
##  0.3168722  0.2535113  0.4207297
PseudoR2(logmodeltest, c("McFadden", "CoxSnell", "Nagelkerke"))
##   McFadden   CoxSnell Nagelkerke 
##  0.3193176  0.2551937  0.4235218

f Model performance



model performance appears to be better in the test sample across several different r-squared measures.

5 Elastic Net Logistic Regression Tuned Via Caret



I will standardize the variables and conduct an elastic net logistic regression on the training sample, and cross validate it on the test sample.

a and b

traindata<-cbind(as.data.frame(scale(train[,-c(2,21:22)])),train[,c(2,21:22)])
testdata <- cbind(as.data.frame(scale(test[,-c(2,21:22)])),test[,c(2,21:22)])
traindata$Turn_YesNo <- factor(traindata$Turn_YesNo, levels = c(0,1), labels = c("Stay", "Leave"))

set.seed(2033)
folds<- createMultiFolds(traindata$Turn_YesNo, k = 10, times = 1)
Trainspecses <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 3,
                           index = folds,
                           selectionFunction = "best",
                           classProbs = T,
                           summaryFunction = multiClassSummary,
                           verboseIter = TRUE)
set.seed(2033)
ES_fit <- train(x = traindata[,-c(25:26)], y = traindata$Turn_YesNo, "glmnet", 
                family= "binomial", trControl = Trainspecses, 
                metric = "Accuracy", standardize = FALSE, tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
Estest<-glm(Turn_YesNo ~ . -fold, data = cbind(test,prediction = predict(ES_fit, 
                newdata = testdata, type = "prob")), family = "binomial")
summary(Estest)
mean(ES_fit$resample$Accuracy)
## [1] 0.8800884
ES_fit$finalModel$tuneValue



The holdout accuracy at the optimal tuning parameter was .878. The optimal tuning parameters were alpha = .1 and lambda = .003

c

kbl(round(as.data.frame(Estest$coefficients),3)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Estest$coefficients
(Intercept) 8.201
Age -0.050
Male_YesNo 0.411
Travel_Frequency 0.050
DistanceFromHome_Miles 0.012
Education_Level -0.090
PerformanceRating -1.855
MonthlyIncome 0.000
StockOptionLevel 0.453
Other_Companies_Worked_At 0.071
TrainingTimesLastYear -0.014
YearsAtCompany 0.268
YearsInCurrentRole -0.240
YearsSinceLastPromotion 0.141
YearsWithCurrManager 0.110
Manager_Satisfaction -0.148
CoWorker_Satisfaction 0.054
Job_Involvement -0.210
Work_Environment -0.031
WorkLifeBalance -0.264
JobChallenge 0.004
Sales and Marketing 0.874
Support 0.224
Single 0.773
Divorced -0.896
prediction.Stay -2.280
prediction.Leave NA

d

pseudoRsqES = (1 - (185.52 / 271.27))
pseudoRsqES
## [1] 0.3161057
# For other pseudo-rsquared
PseudoR2(Estest, c("McFadden", "CoxSnell", "Nagelkerke"))
##   McFadden   CoxSnell Nagelkerke 
##  0.3195030  0.2553211  0.4237332



The psuedo R-squared that is typically calculated using the summary output is McFadden’s R-squared. That value comes out to .316. This was a similar result to the simple logistic regression, which also provided a pseudo r-squared of .316.

Random Forests

set.seed(2033)
folds<-createMultiFolds(traindata$Turn_YesNo,k=10,times=1)
Trainspecsrf<-trainControl(method="repeatedcv",
                           number=10,
                           repeats=3,
                           index=folds,
                           classProbs = T,
                           summaryFunction=multiClassSummary,
                           search="grid",
                           selectionFunction = "best",
                           verboseIter = TRUE) 
 
rf_fit<-train(Turn_YesNo~.-fold,data=traindata,method="rf",
              metric="Accuracy", trControl=Trainspecsrf,ntree=500,nodesize=5,
              standardize=FALSE)
rf_fit
## Random Forest 
## 
## 1176 samples
##   25 predictor
##    2 classes: 'Stay', 'Leave' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1058, 1059, 1058, 1058, 1059, 1058, ... 
## Resampling results across tuning parameters:
## 
##   mtry  logLoss    AUC        prAUC      Accuracy   Kappa      F1       
##    2    0.3211937  0.8466876  0.7549885  0.8724540  0.2989548  0.9293604
##   13    0.3784352  0.8262538  0.7358867  0.8750181  0.3954654  0.9292954
##   24    0.3324764  0.8239929  0.7204191  0.8767058  0.4024288  0.9303028
##   Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision
##   0.9959596    0.2146199    0.8712107       0.9214286       0.8712107
##   0.9767677    0.3330409    0.8865064       0.7546032       0.8865064
##   0.9777778    0.3383041    0.8874342       0.7522908       0.8874342
##   Recall     Detection_Rate  Balanced_Accuracy
##   0.9959596  0.8384543       0.6052897        
##   0.9767677  0.8223091       0.6549043        
##   0.9777778  0.8231494       0.6580409        
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 24.

6 Examining Model Within the Training Sample

a and b

mean(rf_fit$resample$Accuracy)
## [1] 0.8767058
rftest<-glm(Turn_YesNo ~ . -fold, data = cbind(testdata,prediction = predict(rf_fit, 
            newdata = testdata, type = "prob")), family = "binomial")
predtrainrf<-predict(rf_fit, traindata, type = "prob")
library(ltm)
## Warning: package 'ltm' was built under R version 4.0.4
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 4.0.4
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: msm
## Warning: package 'msm' was built under R version 4.0.4
## Loading required package: polycor
## Warning: package 'polycor' was built under R version 4.0.4
## 
## Attaching package: 'polycor'
## The following object is masked from 'package:psych':
## 
##     polyserial
## 
## Attaching package: 'ltm'
## The following object is masked from 'package:psych':
## 
##     factor.scores
biserial.cor(predtrainrf$Stay,traindata$Turn_YesNo)
## [1] 0.9429721

c

set.seed(2033)
traindata<-rename(traindata, SalesAndMarketing = "Sales and Marketing")
Rf_model<-randomForest(Turn_YesNo~.-fold ,data=traindata,mtry=13,
                       ntree=500,nodesize=5, importance=TRUE)
varImpPlot(Rf_model)



The most important variables according to accuracy are manager satisfaciton, performance ratings, monthly income, job challenge, and age. According to the gini index, the most important are Monthly income, manager satisfaction, age, performance ratings, and job challenge.

7 Examine Model Performance in Test Sample

a

pseudoRsqrftest = (1 - (175.33 / 271.27))
pseudoRsqrftest
## [1] 0.3536698
# For other pseudo-rsquared
PseudoR2(rftest, c("McFadden", "CoxSnell", "Nagelkerke"))
##   McFadden   CoxSnell Nagelkerke 
##  0.3471926  0.2741059  0.4549086



We see a much larger difference in the Pseudo r-squared than when we used the elastic net approach. the pseudo r-squared is .354. This is not a dramatic improvement over our other method, which both hovered around .31, but does provide some support to the use of a rf model in this application. ```

 




A work by Amer Odeh