The following document is created as part of a degree requirement. Many of the stylistic features used in this document are part of a template. You can find more information about the template by visiting Yan Holtz’s github page.

###### Libraries #######
library(dplyr)
library(tidyverse)
library(psych)
library(kableExtra)
library(ggpubr)
library(glmnet)
library(DescTools)
library(caret)
library(randomForest)

1 Getting the Data Loaded

df<-read.csv("C:/Users/amero/OneDrive - Wayne State University/School/2020-2021 PhD/Big Data/HW/Assignment 2/HW_file.csv")

2 Create Fold Variable

set.seed(2033)
df$fold <-sample(c(rep(1,nrow(df)*.2),  rep(2,nrow(df)*.2), rep(3,nrow(df)*.2), rep(4,nrow(df)*.2), rep(5,nrow(df)*.2)))

3 Pre-Processing Continuous Variables

# Quickly checking col numbers and names for next step
colnames(df)

# kbl is an rmarkdown thing, makes some neat looking tables. What's important is really just the describe call here. While some of these are not necessarily continuous, I looked at the ones we might treat as continuous in analyses.

kbl(round(describe(df[,-c(1:3, 5, 8, 11,24:25)]),3)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Age	1	1470	36.924	9.135	36	36.471	8.896	18	60	42	0.412	-0.410	0.238
Travel_Frequency	2	1470	1.086	0.532	1	1.108	0.000	0	2	2	0.082	0.410	0.014
DistanceFromHome_Miles	3	1470	9.193	8.107	7	8.084	7.413	1	29	28	0.956	-0.232	0.211
PerformanceRating	4	1470	3.069	0.474	3	3.047	0.000	1	4	3	-0.444	4.421	0.012
MonthlyIncome	5	1470	6502.931	4707.957	4919	5667.241	3260.237	1009	19999	18990	1.367	0.992	122.793
Other_Companies_Worked_At	6	1470	2.693	2.498	2	2.361	1.483	0	9	9	1.024	0.002	0.065
TrainingTimesLastYear	7	1470	2.799	1.289	3	2.721	1.483	0	6	6	0.552	0.484	0.034
YearsAtCompany	8	1470	7.008	6.127	5	5.986	4.448	0	40	40	1.761	3.909	0.160
YearsInCurrentRole	9	1470	4.229	3.623	3	3.855	4.448	0	18	18	0.915	0.467	0.094
YearsSinceLastPromotion	10	1470	2.188	3.222	1	1.483	1.483	0	15	15	1.980	3.587	0.084
YearsWithCurrManager	11	1470	4.123	3.568	3	3.769	4.448	0	17	17	0.832	0.162	0.093
Manager_Satisfaction	12	1470	2.786	0.965	3	2.858	1.483	1	4	3	-0.273	-0.940	0.025
CoWorker_Satisfaction	13	1470	2.866	0.680	3	2.853	0.000	1	4	3	-0.140	-0.180	0.018
Job_Involvement	14	1470	2.772	1.029	3	2.840	1.483	1	4	3	-0.303	-1.080	0.027
Work_Environment	15	1470	3.259	0.708	3	3.329	1.483	1	4	3	-0.495	-0.605	0.018
WorkLifeBalance	16	1470	2.702	0.750	3	2.719	0.000	1	4	3	-0.499	0.085	0.020
JobChallenge	17	1470	2.690	1.114	3	2.737	1.483	1	4	3	-0.278	-1.279	0.029

None of the variables appear to have a skew greater than or equal to 2. Years Since Last Promotion is close, at 1.98. Performance Rating, Years at Company, and Years Since Last Promotion all have a kurtosis greater than 2. It will be helpful to visualize the data in order to triangulate on a decision.

######### QQ Plots #########
p<-ggqqplot(df, x = "YearsAtCompany",
   ggtheme = theme_pubclean(), title = "Years At Company", color = "#00AFBB")
#adjusts specific color layer, 1 is the points, 2 would be the line
p$layers[[1]]$aes_params$colour <- "#097880"; p

p1<-ggqqplot(df, x = "YearsSinceLastPromotion",
   ggtheme = theme_pubclean(), title = "Years Since Last Promotion", color = "#00AFBB")
p1$layers[[1]]$aes_params$colour <- "#097880"; p1

######### Histograms #########

histo <- ggplot(df, aes(x=YearsAtCompany))
histo <- histo + geom_histogram(binwidth=2, colour="black", 
                          aes(y=..density.., fill=..count..))
histo <- histo + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo <- histo + stat_function(fun=dnorm,
                         color="red",
                         args=list(mean=mean(df$YearsAtCompany), 
                                  sd=sd(df$YearsAtCompany)))
histo

histo1 <- ggplot(df, aes(x=YearsSinceLastPromotion))
histo1 <- histo1 + geom_histogram(binwidth=1, colour="black", 
                          aes(y=..density.., fill=..count..))
histo1 <- histo1 + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo1 <- histo1 + stat_function(fun=dnorm,
                         color="red",                         args=list(mean=mean(df$YearsSinceLastPromotion), 
                                  sd=sd(df$YearsSinceLastPromotion)))
histo1

It is evident that these two variables are significantly skewed. I’ll transform Years at Company and Years Since Last Promotion. I’ll do this by taking the square root of the original values. Once that’s done, I’ll look at their skew and kurtosis again, the qqplots, and the histograms.

df$YearsAtCompany<-sqrt(df$YearsAtCompany)
df$YearsSinceLastPromotion <-sqrt(df$YearsSinceLastPromotion)

transformdescrip<-round(describe(df[,c(14,16)]),3)
kbl(transformdescrip[,c(11:13)]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	skew	kurtosis	se
YearsAtCompany	0.426	0.376	0.029
YearsSinceLastPromotion	0.738	-0.351	0.027

######### QQ Plots ###########
p2<-ggqqplot(df, x = "YearsAtCompany",
   ggtheme = theme_pubclean(), title = "Years At Company", color = "#00AFBB")
p2$layers[[1]]$aes_params$colour <- "#097880"; p2

p3<-ggqqplot(df, x = "YearsSinceLastPromotion",
   ggtheme = theme_pubclean(), title = "Years Since Last Promotion", color = "#00AFBB")
p3$layers[[1]]$aes_params$colour <- "#097880"; p3

######### Histograms #########

histo2 <- ggplot(df, aes(x=YearsAtCompany))
histo2 <- histo2 + geom_histogram(binwidth=.15, colour="black", 
                          aes(y=..density.., fill=..count..))
histo2 <- histo2 + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo2 <- histo2 + stat_function(fun=dnorm,
                         color="red",
                         args=list(mean=mean(df$YearsAtCompany), 
                                  sd=sd(df$YearsAtCompany)))
histo2

histo3 <- ggplot(df, aes(x=YearsSinceLastPromotion))
histo3 <- histo3 + geom_histogram(binwidth=.15, colour="black", 
                          aes(y=..density.., fill=..count..))
histo3 <- histo3 + scale_fill_gradient("Count", low="#a4d8db", high="#00afbb")
histo3 <- histo3 + stat_function(fun=dnorm,
                         color="red",                         args=list(mean=mean(df$YearsSinceLastPromotion), 
                                  sd=sd(df$YearsSinceLastPromotion)))
histo3

After transforming the variables, skew for Years at Company and Years Since Last Promotion was -1.094 and -.052 respectively. The kurtosis was 3.022 and -1.516 respectively.

4 Pre-Processing Categorical Variables

I need to dummy code the categorical variables Department and Marital Status

dfdummy<-cbind(df,dummy.code(df$Department),dummy.code(df$MaritalStatus))
# setting my reference column for dummy coding as well as removing the old variables
dfdummy=subset(dfdummy,select = -c(EmployeeNumber,Department,MaritalStatus, Production,Married))

5 Run Standard Logistic Regression

a and b, creating model object and fit statistic

train<- dfdummy[dfdummy$fold < 5,]
test<- dfdummy[dfdummy$fold == 5,]
logmodel<-glm(Turn_YesNo ~ .-fold, data = train, family = "binomial")
summary(logmodel)

# McFadden's pseudo R^2 = 1 - L(model) / L(intercept)
pseudoRsq = (1 - (716.5 / 1026.9))
pseudoRsq

## [1] 0.302269

# For other pseudo-rsquared
PseudoR2(logmodel, c("McFadden", "CoxSnell", "Nagelkerke"))

##   McFadden   CoxSnell Nagelkerke 
##  0.3075264  0.2355045  0.4043713

c and d predicting turnover in test sample

logmodeltrain<-glm(Turn_YesNo ~ . -fold, 
                  data = cbind(test,prediction = predict.glm(logmodel, 
                  type = "response")), 
                  family = "binomial")

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

logmodeltest<-glm(Turn_YesNo ~ . -fold, 
                  data = cbind(test,prediction = predict.glm(logmodel, 
                  newdata = test, type = "response")), 
                  family = "binomial")
summary(logmodeltrain)
summary(logmodeltest)

pesudoRsqtrain = (1 - (744.9/1085.1))
pesudoRsqtrain

## [1] 0.3135195

pseudoRsqtest = (1 - (185.46 / 271.27))
pseudoRsqtest

## [1] 0.3163269

# For other pseudo-rsquared
PseudoR2(logmodeltrain, c("McFadden", "CoxSnell", "Nagelkerke"))

##   McFadden   CoxSnell Nagelkerke 
##  0.3168722  0.2535113  0.4207297

PseudoR2(logmodeltest, c("McFadden", "CoxSnell", "Nagelkerke"))

##   McFadden   CoxSnell Nagelkerke 
##  0.3193176  0.2551937  0.4235218

f Model performance

model performance appears to be better in the test sample across several different r-squared measures.

5 Elastic Net Logistic Regression Tuned Via Caret

I will standardize the variables and conduct an elastic net logistic regression on the training sample, and cross validate it on the test sample.

a and b

traindata<-cbind(as.data.frame(scale(train[,-c(2,21:22)])),train[,c(2,21:22)])
testdata <- cbind(as.data.frame(scale(test[,-c(2,21:22)])),test[,c(2,21:22)])
traindata$Turn_YesNo <- factor(traindata$Turn_YesNo, levels = c(0,1), labels = c("Stay", "Leave"))

set.seed(2033)
folds<- createMultiFolds(traindata$Turn_YesNo, k = 10, times = 1)
Trainspecses <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 3,
                           index = folds,
                           selectionFunction = "best",
                           classProbs = T,
                           summaryFunction = multiClassSummary,
                           verboseIter = TRUE)

set.seed(2033)
ES_fit <- train(x = traindata[,-c(25:26)], y = traindata$Turn_YesNo, "glmnet", 
                family= "binomial", trControl = Trainspecses, 
                metric = "Accuracy", standardize = FALSE, tuneLength = 10)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

Estest<-glm(Turn_YesNo ~ . -fold, data = cbind(test,prediction = predict(ES_fit, 
                newdata = testdata, type = "prob")), family = "binomial")
summary(Estest)

mean(ES_fit$resample$Accuracy)

## [1] 0.8800884

ES_fit$finalModel$tuneValue

The holdout accuracy at the optimal tuning parameter was .878. The optimal tuning parameters were alpha = .1 and lambda = .003

c

kbl(round(as.data.frame(Estest$coefficients),3)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Estest$coefficients
(Intercept)	8.201
Age	-0.050
Male_YesNo	0.411
Travel_Frequency	0.050
DistanceFromHome_Miles	0.012
Education_Level	-0.090
PerformanceRating	-1.855
MonthlyIncome	0.000
StockOptionLevel	0.453
Other_Companies_Worked_At	0.071
TrainingTimesLastYear	-0.014
YearsAtCompany	0.268
YearsInCurrentRole	-0.240
YearsSinceLastPromotion	0.141
YearsWithCurrManager	0.110
Manager_Satisfaction	-0.148
CoWorker_Satisfaction	0.054
Job_Involvement	-0.210
Work_Environment	-0.031
WorkLifeBalance	-0.264
JobChallenge	0.004
`Sales and Marketing`	0.874
Support	0.224
Single	0.773
Divorced	-0.896
prediction.Stay	-2.280
prediction.Leave	NA

d

pseudoRsqES = (1 - (185.52 / 271.27))
pseudoRsqES

## [1] 0.3161057

# For other pseudo-rsquared
PseudoR2(Estest, c("McFadden", "CoxSnell", "Nagelkerke"))

##   McFadden   CoxSnell Nagelkerke 
##  0.3195030  0.2553211  0.4237332

The psuedo R-squared that is typically calculated using the summary output is McFadden’s R-squared. That value comes out to .316. This was a similar result to the simple logistic regression, which also provided a pseudo r-squared of .316.

Random Forests

set.seed(2033)
folds<-createMultiFolds(traindata$Turn_YesNo,k=10,times=1)
Trainspecsrf<-trainControl(method="repeatedcv",
                           number=10,
                           repeats=3,
                           index=folds,
                           classProbs = T,
                           summaryFunction=multiClassSummary,
                           search="grid",
                           selectionFunction = "best",
                           verboseIter = TRUE) 
 
rf_fit<-train(Turn_YesNo~.-fold,data=traindata,method="rf",
              metric="Accuracy", trControl=Trainspecsrf,ntree=500,nodesize=5,
              standardize=FALSE)

rf_fit

## Random Forest 
## 
## 1176 samples
##   25 predictor
##    2 classes: 'Stay', 'Leave' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1058, 1059, 1058, 1058, 1059, 1058, ... 
## Resampling results across tuning parameters:
## 
##   mtry  logLoss    AUC        prAUC      Accuracy   Kappa      F1       
##    2    0.3211937  0.8466876  0.7549885  0.8724540  0.2989548  0.9293604
##   13    0.3784352  0.8262538  0.7358867  0.8750181  0.3954654  0.9292954
##   24    0.3324764  0.8239929  0.7204191  0.8767058  0.4024288  0.9303028
##   Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision
##   0.9959596    0.2146199    0.8712107       0.9214286       0.8712107
##   0.9767677    0.3330409    0.8865064       0.7546032       0.8865064
##   0.9777778    0.3383041    0.8874342       0.7522908       0.8874342
##   Recall     Detection_Rate  Balanced_Accuracy
##   0.9959596  0.8384543       0.6052897        
##   0.9767677  0.8223091       0.6549043        
##   0.9777778  0.8231494       0.6580409        
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 24.

6 Examining Model Within the Training Sample

a and b

mean(rf_fit$resample$Accuracy)

## [1] 0.8767058

rftest<-glm(Turn_YesNo ~ . -fold, data = cbind(testdata,prediction = predict(rf_fit, 
            newdata = testdata, type = "prob")), family = "binomial")
predtrainrf<-predict(rf_fit, traindata, type = "prob")
library(ltm)

## Warning: package 'ltm' was built under R version 4.0.4

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 4.0.4

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: msm

## Warning: package 'msm' was built under R version 4.0.4

## Loading required package: polycor

## Warning: package 'polycor' was built under R version 4.0.4

## 
## Attaching package: 'polycor'

## The following object is masked from 'package:psych':
## 
##     polyserial

## 
## Attaching package: 'ltm'

## The following object is masked from 'package:psych':
## 
##     factor.scores

biserial.cor(predtrainrf$Stay,traindata$Turn_YesNo)

## [1] 0.9429721

c

set.seed(2033)
traindata<-rename(traindata, SalesAndMarketing = "Sales and Marketing")
Rf_model<-randomForest(Turn_YesNo~.-fold ,data=traindata,mtry=13,
                       ntree=500,nodesize=5, importance=TRUE)
varImpPlot(Rf_model)

The most important variables according to accuracy are manager satisfaciton, performance ratings, monthly income, job challenge, and age. According to the gini index, the most important are Monthly income, manager satisfaction, age, performance ratings, and job challenge.

7 Examine Model Performance in Test Sample

a

pseudoRsqrftest = (1 - (175.33 / 271.27))
pseudoRsqrftest

## [1] 0.3536698

# For other pseudo-rsquared
PseudoR2(rftest, c("McFadden", "CoxSnell", "Nagelkerke"))

##   McFadden   CoxSnell Nagelkerke 
##  0.3471926  0.2741059  0.4549086

We see a much larger difference in the Pseudo r-squared than when we used the elastic net approach. the pseudo r-squared is .354. This is not a dramatic improvement over our other method, which both hovered around .31, but does provide some support to the use of a rf model in this application. ```

A work by Amer Odeh

amerodeh94@gmail.com

Assignment 2

Amer Odeh - 22 March 2021

1 Getting the Data Loaded

2 Create Fold Variable

3 Pre-Processing Continuous Variables

4 Pre-Processing Categorical Variables

5 Run Standard Logistic Regression

a and b, creating model object and fit statistic

c and d predicting turnover in test sample

f Model performance

5 Elastic Net Logistic Regression Tuned Via Caret

a and b

c

d

Random Forests

6 Examining Model Within the Training Sample

a and b

c

7 Examine Model Performance in Test Sample

a