Importing the libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tibble)
library(xtable)
library(ggplot2)
PART A
The file downloaded from kaggle website and loaded into mydf
mydf <- read.csv("C:/Users/keiva/Dropbox (Personal)/GW/06- Fall 2017/01- Programming in business analytics/04- Week 04 ( 21 Sep 2017 )/Assignment 04/HR_comma_sep.csv")
Getting rows and columns
dim.data.frame(mydf)
## [1] 14999 10
There are 14999 records and 10 attributes on the dataset
exploring the data structure:
str(mydf)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
there are 2 numeric attributes, 6 integer ones and 2 factors.
Now we correlate all numeric fields:
xtable(cor(subset(mydf, select = -c(sales, salary))))
## % latex table generated in R 3.4.2 by xtable 1.8-2 package
## % Sun Oct 01 01:14:02 2017
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrrrrrr}
## \hline
## & satisfaction\_level & last\_evaluation & number\_project & average\_montly\_hours & time\_spend\_company & Work\_accident & left & promotion\_last\_5years \\
## \hline
## satisfaction\_level & 1.00 & 0.11 & -0.14 & -0.02 & -0.10 & 0.06 & -0.39 & 0.03 \\
## last\_evaluation & 0.11 & 1.00 & 0.35 & 0.34 & 0.13 & -0.01 & 0.01 & -0.01 \\
## number\_project & -0.14 & 0.35 & 1.00 & 0.42 & 0.20 & -0.00 & 0.02 & -0.01 \\
## average\_montly\_hours & -0.02 & 0.34 & 0.42 & 1.00 & 0.13 & -0.01 & 0.07 & -0.00 \\
## time\_spend\_company & -0.10 & 0.13 & 0.20 & 0.13 & 1.00 & 0.00 & 0.14 & 0.07 \\
## Work\_accident & 0.06 & -0.01 & -0.00 & -0.01 & 0.00 & 1.00 & -0.15 & 0.04 \\
## left & -0.39 & 0.01 & 0.02 & 0.07 & 0.14 & -0.15 & 1.00 & -0.06 \\
## promotion\_last\_5years & 0.03 & -0.01 & -0.01 & -0.00 & 0.07 & 0.04 & -0.06 & 1.00 \\
## \hline
## \end{tabular}
## \end{table}
Basically, we “melt” data so that each row is a unique id-variable combination. Then we “cast” the melted data into any shape we would like. Here we calculate the corrolation of all predictors (except sales and salary) on each other and assign the melted data to melted_cormat. then we draw that melted table using ggplot and apply the desired aesthetics.
library(reshape2)
melted_cormat <- melt(cor(subset(mydf, select = -c(sales, salary))))
head(melted_cormat)
## Var1 Var2 value
## 1 satisfaction_level satisfaction_level 1.00000000
## 2 last_evaluation satisfaction_level 0.10502121
## 3 number_project satisfaction_level -0.14296959
## 4 average_montly_hours satisfaction_level -0.02004811
## 5 time_spend_company satisfaction_level -0.10086607
## 6 Work_accident satisfaction_level 0.05869724
g <- ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_raster()
g <- g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g
Now, we add another attribute to mydf, named as status to return the binary data of left column into the string status (Left or stayed) and be able to print it. then we print a 3 dimensional table to have an overview on how many people with which level of salary have left or stayed in each department
mydf$status <- ifelse(mydf$left == 1, "Left", "Stay")
table(mydf$status, mydf$salary, mydf$sales)
## , , = accounting
##
##
## high low medium
## Left 5 99 100
## Stay 69 259 235
##
## , , = hr
##
##
## high low medium
## Left 6 92 117
## Stay 39 243 242
##
## , , = IT
##
##
## high low medium
## Left 4 172 97
## Stay 79 437 438
##
## , , = management
##
##
## high low medium
## Left 1 59 31
## Stay 224 121 194
##
## , , = marketing
##
##
## high low medium
## Left 9 126 68
## Stay 71 276 308
##
## , , = product_mng
##
##
## high low medium
## Left 6 105 87
## Stay 62 346 296
##
## , , = RandD
##
##
## high low medium
## Left 4 55 62
## Stay 47 309 310
##
## , , = sales
##
##
## high low medium
## Left 14 697 303
## Stay 255 1402 1469
##
## , , = support
##
##
## high low medium
## Left 8 389 158
## Stay 133 757 784
##
## , , = technical
##
##
## high low medium
## Left 25 378 294
## Stay 176 994 853
Here we plot number of employees who left each department
ggplot(mydf[ which(mydf$status=='Left'),],
aes(x = sales, fill = status)) + geom_bar() + guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
and we plot employees who stay in the company by department
ggplot(mydf[ which(mydf$status=='Stay'),],
aes( x = sales, fill = status)) + geom_bar(fill="blue") + guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
PART B
Now we plot employees who leave the company by salary level
ggplot(mydf[ which(mydf$status=='Left'),],
aes( x = salary, fill = status)) + geom_bar(fill="green") + guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Then we plotemployees who stay in the company by salary level:
ggplot(mydf[ which(mydf$status=='Stay'),],
aes( x = salary, fill = status)) + geom_bar(fill="yellow") + guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Here we make a chart for number of employees who leave the company by time spent in the company faceted by salary level AND department, also we save it in a file with a better view at the working directory, using the ggsave()
ggplot(mydf[ which(mydf$status=='Left'),],
aes( x = time_spend_company, fill = status),width = 5, height = 10) + geom_bar(fill="green") + guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(salary~sales)+ggsave("plot.png", width = 10, height = 5)
Now we make a chart for number of employees who stayed in the company by time spent in the company, faceted by salary level AND department
ggplot(mydf[ which(mydf$status=='Stay'),],
aes( x = time_spend_company, fill = status),width = 5, height = 10) + geom_bar(fill="blue") + guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(salary~sales)+ggsave("plot2.png", width = 10, height = 5)
PART C
We draw a bar plot of how many years did employees stay in the company before leaving while Year stayed in the company is the x axis:
ggplot(mydf[ which(mydf$status=='Left'),],
aes( x = time_spend_company, fill = status)) + geom_bar(fill="green") + guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
And we draw a bar plot of how many years did employees stay in the company before leaving by department while Year stayed in the company is the x axis. for this reason, we add a facet_grid layer to the ggplot and use ggsave to save it in the working directory for better view:
ggplot(mydf[ which(mydf$status=='Left'),],
aes( x = time_spend_company, fill = status)) + geom_bar(fill="green") + guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+facet_grid(~sales)+ggsave("plot3.png", width = 10, height = 5)
Finally, we Draw a bar plot of how many years did employees stay in the company before leaving by salary level while Year stayed in the company is the x axis:
ggplot(mydf[ which(mydf$status=='Left'),],
aes( x = time_spend_company, fill = status)) + geom_bar(fill="green") + guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+facet_grid(~salary)+ggsave("plot4.png", width = 10, height = 5)
Analysis: From the plots in part C we understand that the level of income has a direct effect on leaving the company. the lower level of income, the greater number of employee leaving the company. Also people tend to leave the company after three years and the tendecy decreases as we pass the 3rd year. The sales, technical and support departments have the highest leaving rates among the deparments, respectively