Data Initialization

library(corrplot)
library(factoextra)
## Loading required package: ggplot2
library(ggvis)
## 
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
## 
##     resolution
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
library(ggplot2)
library(cluster)
col1 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","white", 
        "cyan", "#007FFF", "blue","#00007F"))
col2 <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7",
            "#FFFFFF", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))  
col3 <- colorRampPalette(c("red", "white", "blue")) 
col4 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","#7FFF7F", 
            "cyan", "#007FFF", "blue","#00007F"))   
###############
HR=read.csv("HR_comma_sep.csv",header=TRUE)
##############

For the purpose of analysis in later step, we will convert all continuous variables into categorical variables: saitsfaction level(5 levels), last evaluation, average monthly hours, and name the new data HR.f

HR.f=data.frame(HR)
HR.f$sales=ifelse(HR$sales=='sales',0,
       ifelse(HR$sales=='accounting',1,
              (ifelse(HR$sales=='hr',2,
                      (ifelse(HR$sales=='technical',3,
                              (ifelse(HR$sales=='support',4,
                                      ifelse(HR$sales=='management',5,
                                             ifelse(HR$sales=='IT',6,
                                                    ifelse(HR$sales=='product_mng',7,
                                                           ifelse(HR$sales=='marketing',8,9))))))))))))
                                                              
#replace(HR$sales,c('sales', 'accounting', 'hr', 'technical', 'support', 'management','IT', 'product_mng', 'marketing', 'RandD'),c(0,1,2,3,4,5,6,7,8,9))
HR.f$salary=ifelse(HR$salary=='low',0,ifelse(HR$salary=='medium',1,2))
#HR.f$satisfaction_level=cut(HR$satisfaction_level, seq(0,5,1)/5)
#HR.f$last_evaluation=as.factor(cut(HR$last_evaluation, seq(0,5,1)/5,labels=c(1:5)))
#HR.f$average_montly_hours=as.factor(cut(HR$average_montly_hours, seq(96,330,54),labels=c(1:4)))
HR.f$salary=as.numeric(HR.f$salary)
str(HR.f)
## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ salary               : num  0 1 1 0 0 0 0 0 0 0 ...

EDA

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to 1.maximize insight into a data set; 2.uncover underlying structure; 3.extract important variables; 4.detect outliers and anomalies; 5.test underlying assumptions; 6.develop parsimonious models; and 7.determine optimal factor settings.

The particular graphical techniques employed in EDA are often quite simple, consisting of various techniques of:

Plotting the raw data (such as data traces, histograms, bihistograms, probability plots, lag plots, block plots, and Youden plots. Plotting simple statistics such as mean plots, standard deviation plots, box plots, and main effects plots of the raw data. Positioning such plots so as to maximize our natural pattern-recognition abilities, such as using multiple plots per page.


First, let’s list out all the factors that would affect the responses(ie: the frequency of leaving a current position):

  1. Satisfaction level
  2. Last evaluation
  3. Number of projects
  4. Ave Montly Hours
  5. Time spend at the company
  6. Work accident
  7. Promotion last 5 years
  8. Type of Work
  9. Salary

Intuitively, we expect the existent correlation between: 1 vs (4,5,7,9) 2 vs (3,4,8) 3 vs (4,5,8,9) 4 vs (8) 5 vs (7,8,9) 7 vs (9) However, we don’t know if their relationship are linear or not linear

To see if the relationship between our covariates are linear or not, we will start to inspect the correlation coefficient matrix

Evaluating Correlation

corrplot(abs(cor(HR.f)),method='color',type='full',bg='black',title='Correlations',addgrid.col='black',tl.cex=0.6,tl.col='grey')

cor=as.matrix(cor(HR.f))
left_cor=as.matrix(cor[,7],10,2)

By evaluating the correlation coefficient between our interest (ie: whether an employee has left the firm or not) and the covariates, we notice a strong relationship between the outcomes and factors like amount of time spend at the company, and salary and the level of job satisfaction. A further investigation between the reponses vs each covariate will be made later on (after choosing appropriate factors)

  1. Year stay at the company vs Response
time_spend_left=HR$time_spend_company[HR$left==1]
time_spend_nleft=HR$time_spend_company[HR$left==0] 
HR %>%
  ggvis(~time_spend_left) %>% 
  layer_histograms(width=1,center=35,fill := "#22E5D3") %>%  
  add_axis("x", title = "Years spent at the firm")%>%  
  add_axis("y", title = "Leaving Frequency")
HR %>%
  ggvis(~time_spend_nleft) %>% 
  layer_histograms(width=1,center=35,fill := "#694393") %>%  
  add_axis("x", title = "Years spent at the firm")%>%  
  add_axis("y", title = "Staying Frequency")

We see a few outliers. By looking at the two histograms, the distribution for the observations of employees that left or currently staying at the company are both skewed to the right. We can use logistic regression later to exploit the rate of quitting a job each year. We also note that it is reasonable to assume that the data is not time-homnogenous. Therefore, dividing the time into different sub-intervals will allow us to make a more accurate analysis. ——-

PCA (Principle Component Analysis)

In another to reduces the number of factors that affect the reponses, we will use PDA to eliminate the less significant factors. In order to do so, we will create a new data frame, called left.HR which only considered data of those who already left the company. left. HR does not include the catgorical variable like Sale. By seperating the data into covariates set and reponses set called: left.HRx, resp.

#Separate the reponse (#left) vs the rest 
left.HR=subset(HR,HR$left==1)
left.HR$salary=as.numeric(HR.f$salary[HR.f$left==1])
#left.HR$sales=as.numeric(left.HR$sales)
#left.HR$satisfaction_level=as.numeric(left.HR$satisfaction_level)
#left.HR$last_evaluation=as.numeric(left.HR$last_evaluation)
#left.HR$average_montly_hours=as.numeric(left.HR$average_montly_hours)
left.HRx=left.HR[,c(1:6,8,10)]
resp=left.HR[,7]
str(left.HRx)
## 'data.frame':    3571 obs. of  8 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ salary               : num  0 1 1 0 0 0 0 0 0 0 ...
#covariates.HR=dummy.data.frame(HR1)
#str(covariates.HR)
left.HRx.pca=prcomp(left.HRx,center=TRUE,scale=TRUE)
# HR.train=as.matrix(HR[1:10000,])
# #Dividing data into training set and testing set
# pca.train=new.HR[1:10000,]
# pca.test=new.HR[ 10001:nrow(new.HR),]
# #Choosing our re
# pca=prcomp(pca.train,scale.=T)
# names(pca)
  1. center and scale refers to respective mean and standard deviation of the variables that are used for normalization prior to implementing PCA
left.HRx.pca$center
##    satisfaction_level       last_evaluation        number_project 
##          4.400980e-01          7.181126e-01          3.855503e+00 
##  average_montly_hours    time_spend_company         Work_accident 
##          2.074192e+02          3.876505e+00          4.732568e-02 
## promotion_last_5years                salary 
##          5.320638e-03          4.147298e-01
left.HRx.pca$scale
##    satisfaction_level       last_evaluation        number_project 
##            0.26393344            0.19767336            1.81816537 
##  average_montly_hours    time_spend_company         Work_accident 
##           61.20282541            0.97769792            0.21236428 
## promotion_last_5years                salary 
##            0.07275859            0.53734100
  1. The rotation measure provides the principal component loading. Each column of rotation matrix contains the principal component loading vector. This is the most important measure we should be interested in.
left.HRx.pca$rotation
##                                PC1         PC2          PC3          PC4
## satisfaction_level     0.064766948 -0.85199300  0.031004290 -0.016341333
## last_evaluation        0.522095767 -0.06367855 -0.008667552  0.005293749
## number_project         0.493324241  0.31747874 -0.031809385  0.006525901
## average_montly_hours   0.509724188  0.19420707 -0.023656219  0.001097224
## time_spend_company     0.467753449 -0.35887414  0.006383595 -0.008402757
## Work_accident         -0.003869928 -0.03861270 -0.686483885 -0.326131422
## promotion_last_5years -0.034960126 -0.01678162 -0.711589354  0.128613204
## salary                -0.001794815  0.03131479  0.140470178 -0.936315908
##                                PC5           PC6          PC7          PC8
## satisfaction_level     0.032646860  0.3517029762  0.146749345  0.349840947
## last_evaluation        0.004860174  0.4330025705 -0.583693714 -0.441638656
## number_project         0.011793738 -0.0168973640 -0.198599828  0.784166766
## average_montly_hours   0.016824837  0.2526277625  0.773379878 -0.199207801
## time_spend_company     0.006664009 -0.7902732828 -0.002848809 -0.166494195
## Work_accident         -0.648681810  0.0068351859  0.004623174  0.004008782
## promotion_last_5years  0.689327225 -0.0113478134 -0.009640189 -0.014201167
## salary                 0.320131022 -0.0003592198 -0.009248118 -0.005223461
dim(left.HRx.pca$x)
## [1] 3571    8
summary(left.HRx.pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.8162 1.1428 1.0273 1.0030 0.9673 0.41439 0.36125
## Proportion of Variance 0.4123 0.1633 0.1319 0.1258 0.1169 0.02146 0.01631
## Cumulative Proportion  0.4123 0.5756 0.7075 0.8333 0.9502 0.97167 0.98798
##                            PC8
## Standard deviation     0.31010
## Proportion of Variance 0.01202
## Cumulative Proportion  1.00000

PCA 1, 2, 3, 4 account more than 80% of the total variance in the data.

  1. Plotting resultant principal components
#cumulative scree plot
prop.variance=left.HRx.pca$sdev/cumsum(left.HRx.pca$sdev)^2
plot(cumsum(prop.variance), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

This plot shows that 4 components results in variance close to ~ 80%. Therefore, in this case, we’ll select number of components as 5 [PC1 to PC5] and proceed to the modeling stage. This completes the steps to implement PCA on train data. For modeling, we’ll use these 5 components as predictor variables and follow the normal procedures.

Screeplot

fviz_screeplot(left.HRx.pca,choice='eigenvalue',barfill="#00AFBB",line.col='black')

From the scree plot, we notice that dimension 1 plays sinigicant role in explaining the variations among the responses. Together with PCA2-PCA6,they explain 90% the variance in the data. While 7,8,9 are much less imparative in making senses of our data, and might be considered as noises. Therefore, it is recommended to temporarily ignored the last three factors, and not taking them into account when forming the models.

Biplot

biplot(left.HRx.pca, scale = 0)

fviz_pca_var(left.HRx.pca,col.var = "#00AFBB", 
   gradient.cols = c("white", "blue", "red"),
   ggtheme = theme_minimal())

  1. Clustering the
churn = as.data.frame(predict(left.HRx.pca,newdata=left.HRx))
str(churn$cluster)
##  NULL
##  Factor w/ 3 levels "1","2","3": 1 2 3 2 1 1 3 2 2 1 ...
ggplot(data=churn,mapping =
         aes(x=PC1,y=PC2,color=cluster))+geom_point(shape=1)+ggtitle("Clusters of churned employees(ncluster=3)")+ theme(plot.title = element_text(hjust = 0.6))