Data Initialization
library(corrplot)
library(factoextra)
## Loading required package: ggplot2
library(ggvis)
##
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
##
## resolution
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
library(ggplot2)
library(cluster)
col1 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","white",
"cyan", "#007FFF", "blue","#00007F"))
col2 <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7",
"#FFFFFF", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))
col3 <- colorRampPalette(c("red", "white", "blue"))
col4 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","#7FFF7F",
"cyan", "#007FFF", "blue","#00007F"))
###############
HR=read.csv("HR_comma_sep.csv",header=TRUE)
##############
For the purpose of analysis in later step, we will convert all continuous variables into categorical variables: saitsfaction level(5 levels), last evaluation, average monthly hours, and name the new data HR.f
HR.f=data.frame(HR)
HR.f$sales=ifelse(HR$sales=='sales',0,
ifelse(HR$sales=='accounting',1,
(ifelse(HR$sales=='hr',2,
(ifelse(HR$sales=='technical',3,
(ifelse(HR$sales=='support',4,
ifelse(HR$sales=='management',5,
ifelse(HR$sales=='IT',6,
ifelse(HR$sales=='product_mng',7,
ifelse(HR$sales=='marketing',8,9))))))))))))
#replace(HR$sales,c('sales', 'accounting', 'hr', 'technical', 'support', 'management','IT', 'product_mng', 'marketing', 'RandD'),c(0,1,2,3,4,5,6,7,8,9))
HR.f$salary=ifelse(HR$salary=='low',0,ifelse(HR$salary=='medium',1,2))
#HR.f$satisfaction_level=cut(HR$satisfaction_level, seq(0,5,1)/5)
#HR.f$last_evaluation=as.factor(cut(HR$last_evaluation, seq(0,5,1)/5,labels=c(1:5)))
#HR.f$average_montly_hours=as.factor(cut(HR$average_montly_hours, seq(96,330,54),labels=c(1:4)))
HR.f$salary=as.numeric(HR.f$salary)
str(HR.f)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : num 0 0 0 0 0 0 0 0 0 0 ...
## $ salary : num 0 1 1 0 0 0 0 0 0 0 ...
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to 1.maximize insight into a data set; 2.uncover underlying structure; 3.extract important variables; 4.detect outliers and anomalies; 5.test underlying assumptions; 6.develop parsimonious models; and 7.determine optimal factor settings.
The particular graphical techniques employed in EDA are often quite simple, consisting of various techniques of:
Plotting the raw data (such as data traces, histograms, bihistograms, probability plots, lag plots, block plots, and Youden plots. Plotting simple statistics such as mean plots, standard deviation plots, box plots, and main effects plots of the raw data. Positioning such plots so as to maximize our natural pattern-recognition abilities, such as using multiple plots per page.
First, let’s list out all the factors that would affect the responses(ie: the frequency of leaving a current position):
Intuitively, we expect the existent correlation between: 1 vs (4,5,7,9) 2 vs (3,4,8) 3 vs (4,5,8,9) 4 vs (8) 5 vs (7,8,9) 7 vs (9) However, we don’t know if their relationship are linear or not linear
To see if the relationship between our covariates are linear or not, we will start to inspect the correlation coefficient matrix
corrplot(abs(cor(HR.f)),method='color',type='full',bg='black',title='Correlations',addgrid.col='black',tl.cex=0.6,tl.col='grey')
cor=as.matrix(cor(HR.f))
left_cor=as.matrix(cor[,7],10,2)
By evaluating the correlation coefficient between our interest (ie: whether an employee has left the firm or not) and the covariates, we notice a strong relationship between the outcomes and factors like amount of time spend at the company, and salary and the level of job satisfaction. A further investigation between the reponses vs each covariate will be made later on (after choosing appropriate factors)
time_spend_left=HR$time_spend_company[HR$left==1]
time_spend_nleft=HR$time_spend_company[HR$left==0]
HR %>%
ggvis(~time_spend_left) %>%
layer_histograms(width=1,center=35,fill := "#22E5D3") %>%
add_axis("x", title = "Years spent at the firm")%>%
add_axis("y", title = "Leaving Frequency")
HR %>%
ggvis(~time_spend_nleft) %>%
layer_histograms(width=1,center=35,fill := "#694393") %>%
add_axis("x", title = "Years spent at the firm")%>%
add_axis("y", title = "Staying Frequency")
In another to reduces the number of factors that affect the reponses, we will use PDA to eliminate the less significant factors. In order to do so, we will create a new data frame, called left.HR which only considered data of those who already left the company. left. HR does not include the catgorical variable like Sale. By seperating the data into covariates set and reponses set called: left.HRx, resp.
#Separate the reponse (#left) vs the rest
left.HR=subset(HR,HR$left==1)
left.HR$salary=as.numeric(HR.f$salary[HR.f$left==1])
#left.HR$sales=as.numeric(left.HR$sales)
#left.HR$satisfaction_level=as.numeric(left.HR$satisfaction_level)
#left.HR$last_evaluation=as.numeric(left.HR$last_evaluation)
#left.HR$average_montly_hours=as.numeric(left.HR$average_montly_hours)
left.HRx=left.HR[,c(1:6,8,10)]
resp=left.HR[,7]
str(left.HRx)
## 'data.frame': 3571 obs. of 8 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ salary : num 0 1 1 0 0 0 0 0 0 0 ...
#covariates.HR=dummy.data.frame(HR1)
#str(covariates.HR)
left.HRx.pca=prcomp(left.HRx,center=TRUE,scale=TRUE)
# HR.train=as.matrix(HR[1:10000,])
# #Dividing data into training set and testing set
# pca.train=new.HR[1:10000,]
# pca.test=new.HR[ 10001:nrow(new.HR),]
# #Choosing our re
# pca=prcomp(pca.train,scale.=T)
# names(pca)
left.HRx.pca$center
## satisfaction_level last_evaluation number_project
## 4.400980e-01 7.181126e-01 3.855503e+00
## average_montly_hours time_spend_company Work_accident
## 2.074192e+02 3.876505e+00 4.732568e-02
## promotion_last_5years salary
## 5.320638e-03 4.147298e-01
left.HRx.pca$scale
## satisfaction_level last_evaluation number_project
## 0.26393344 0.19767336 1.81816537
## average_montly_hours time_spend_company Work_accident
## 61.20282541 0.97769792 0.21236428
## promotion_last_5years salary
## 0.07275859 0.53734100
left.HRx.pca$rotation
## PC1 PC2 PC3 PC4
## satisfaction_level 0.064766948 -0.85199300 0.031004290 -0.016341333
## last_evaluation 0.522095767 -0.06367855 -0.008667552 0.005293749
## number_project 0.493324241 0.31747874 -0.031809385 0.006525901
## average_montly_hours 0.509724188 0.19420707 -0.023656219 0.001097224
## time_spend_company 0.467753449 -0.35887414 0.006383595 -0.008402757
## Work_accident -0.003869928 -0.03861270 -0.686483885 -0.326131422
## promotion_last_5years -0.034960126 -0.01678162 -0.711589354 0.128613204
## salary -0.001794815 0.03131479 0.140470178 -0.936315908
## PC5 PC6 PC7 PC8
## satisfaction_level 0.032646860 0.3517029762 0.146749345 0.349840947
## last_evaluation 0.004860174 0.4330025705 -0.583693714 -0.441638656
## number_project 0.011793738 -0.0168973640 -0.198599828 0.784166766
## average_montly_hours 0.016824837 0.2526277625 0.773379878 -0.199207801
## time_spend_company 0.006664009 -0.7902732828 -0.002848809 -0.166494195
## Work_accident -0.648681810 0.0068351859 0.004623174 0.004008782
## promotion_last_5years 0.689327225 -0.0113478134 -0.009640189 -0.014201167
## salary 0.320131022 -0.0003592198 -0.009248118 -0.005223461
dim(left.HRx.pca$x)
## [1] 3571 8
summary(left.HRx.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8162 1.1428 1.0273 1.0030 0.9673 0.41439 0.36125
## Proportion of Variance 0.4123 0.1633 0.1319 0.1258 0.1169 0.02146 0.01631
## Cumulative Proportion 0.4123 0.5756 0.7075 0.8333 0.9502 0.97167 0.98798
## PC8
## Standard deviation 0.31010
## Proportion of Variance 0.01202
## Cumulative Proportion 1.00000
PCA 1, 2, 3, 4 account more than 80% of the total variance in the data.
#cumulative scree plot
prop.variance=left.HRx.pca$sdev/cumsum(left.HRx.pca$sdev)^2
plot(cumsum(prop.variance), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
This plot shows that 4 components results in variance close to ~ 80%. Therefore, in this case, we’ll select number of components as 5 [PC1 to PC5] and proceed to the modeling stage. This completes the steps to implement PCA on train data. For modeling, we’ll use these 5 components as predictor variables and follow the normal procedures.
Screeplot
fviz_screeplot(left.HRx.pca,choice='eigenvalue',barfill="#00AFBB",line.col='black')
From the scree plot, we notice that dimension 1 plays sinigicant role in explaining the variations among the responses. Together with PCA2-PCA6,they explain 90% the variance in the data. While 7,8,9 are much less imparative in making senses of our data, and might be considered as noises. Therefore, it is recommended to temporarily ignored the last three factors, and not taking them into account when forming the models.
Biplot
biplot(left.HRx.pca, scale = 0)
fviz_pca_var(left.HRx.pca,col.var = "#00AFBB",
gradient.cols = c("white", "blue", "red"),
ggtheme = theme_minimal())
churn = as.data.frame(predict(left.HRx.pca,newdata=left.HRx))
str(churn$cluster)
## NULL
## Factor w/ 3 levels "1","2","3": 1 2 3 2 1 1 3 2 2 1 ...
ggplot(data=churn,mapping =
aes(x=PC1,y=PC2,color=cluster))+geom_point(shape=1)+ggtitle("Clusters of churned employees(ncluster=3)")+ theme(plot.title = element_text(hjust = 0.6))