Importing the libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tibble)
library(xtable)
library(ggplot2)

PART A

The file downloaded from kaggle website and loaded into mydf

mydf <- read.csv("C:/Users/keiva/Dropbox (Personal)/GW/06- Fall 2017/01- Programming in business analytics/04- Week 04 ( 21 Sep 2017 )/Assignment 04/HR_comma_sep.csv")

Getting rows and columns

dim.data.frame(mydf)
## [1] 14999    10

There are 14999 records and 10 attributes on the dataset

exploring the data structure:

str(mydf)
## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

there are 2 numeric attributes, 6 integer ones and 2 factors.

Now we correlate all numeric fields:

xtable(cor(subset(mydf, select = -c(sales, salary))))
## % latex table generated in R 3.4.2 by xtable 1.8-2 package
## % Sun Oct 01 01:14:02 2017
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrrrrrr}
##   \hline
##  & satisfaction\_level & last\_evaluation & number\_project & average\_montly\_hours & time\_spend\_company & Work\_accident & left & promotion\_last\_5years \\ 
##   \hline
## satisfaction\_level & 1.00 & 0.11 & -0.14 & -0.02 & -0.10 & 0.06 & -0.39 & 0.03 \\ 
##   last\_evaluation & 0.11 & 1.00 & 0.35 & 0.34 & 0.13 & -0.01 & 0.01 & -0.01 \\ 
##   number\_project & -0.14 & 0.35 & 1.00 & 0.42 & 0.20 & -0.00 & 0.02 & -0.01 \\ 
##   average\_montly\_hours & -0.02 & 0.34 & 0.42 & 1.00 & 0.13 & -0.01 & 0.07 & -0.00 \\ 
##   time\_spend\_company & -0.10 & 0.13 & 0.20 & 0.13 & 1.00 & 0.00 & 0.14 & 0.07 \\ 
##   Work\_accident & 0.06 & -0.01 & -0.00 & -0.01 & 0.00 & 1.00 & -0.15 & 0.04 \\ 
##   left & -0.39 & 0.01 & 0.02 & 0.07 & 0.14 & -0.15 & 1.00 & -0.06 \\ 
##   promotion\_last\_5years & 0.03 & -0.01 & -0.01 & -0.00 & 0.07 & 0.04 & -0.06 & 1.00 \\ 
##    \hline
## \end{tabular}
## \end{table}

Basically, we “melt” data so that each row is a unique id-variable combination. Then we “cast” the melted data into any shape we would like. Here we calculate the corrolation of all predictors (except sales and salary) on each other and assign the melted data to melted_cormat. then we draw that melted table using ggplot and apply the desired aesthetics.

library(reshape2)
melted_cormat <- melt(cor(subset(mydf, select = -c(sales, salary))))
head(melted_cormat)
##                   Var1               Var2       value
## 1   satisfaction_level satisfaction_level  1.00000000
## 2      last_evaluation satisfaction_level  0.10502121
## 3       number_project satisfaction_level -0.14296959
## 4 average_montly_hours satisfaction_level -0.02004811
## 5   time_spend_company satisfaction_level -0.10086607
## 6        Work_accident satisfaction_level  0.05869724
g <- ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_raster()
g <- g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g

Now, we add another attribute to mydf, named as status to return the binary data of left column into the string status (Left or stayed) and be able to print it. then we print a 3 dimensional table to have an overview on how many people with which level of salary have left or stayed in each department

mydf$status <- ifelse(mydf$left == 1, "Left", "Stay")
table(mydf$status,  mydf$salary, mydf$sales)
## , ,  = accounting
## 
##       
##        high  low medium
##   Left    5   99    100
##   Stay   69  259    235
## 
## , ,  = hr
## 
##       
##        high  low medium
##   Left    6   92    117
##   Stay   39  243    242
## 
## , ,  = IT
## 
##       
##        high  low medium
##   Left    4  172     97
##   Stay   79  437    438
## 
## , ,  = management
## 
##       
##        high  low medium
##   Left    1   59     31
##   Stay  224  121    194
## 
## , ,  = marketing
## 
##       
##        high  low medium
##   Left    9  126     68
##   Stay   71  276    308
## 
## , ,  = product_mng
## 
##       
##        high  low medium
##   Left    6  105     87
##   Stay   62  346    296
## 
## , ,  = RandD
## 
##       
##        high  low medium
##   Left    4   55     62
##   Stay   47  309    310
## 
## , ,  = sales
## 
##       
##        high  low medium
##   Left   14  697    303
##   Stay  255 1402   1469
## 
## , ,  = support
## 
##       
##        high  low medium
##   Left    8  389    158
##   Stay  133  757    784
## 
## , ,  = technical
## 
##       
##        high  low medium
##   Left   25  378    294
##   Stay  176  994    853

Here we plot number of employees who left each department

ggplot(mydf[ which(mydf$status=='Left'),], 
       aes(x = sales, fill = status)) + geom_bar() + guides(fill=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

and we plot employees who stay in the company by department

ggplot(mydf[ which(mydf$status=='Stay'),], 
       aes( x = sales, fill = status)) + geom_bar(fill="blue") + guides(fill=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

PART B

Now we plot employees who leave the company by salary level

ggplot(mydf[ which(mydf$status=='Left'),], 
       aes( x = salary, fill = status)) + geom_bar(fill="green") + guides(fill=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Then we plotemployees who stay in the company by salary level:

ggplot(mydf[ which(mydf$status=='Stay'),], 
       aes( x = salary, fill = status)) + geom_bar(fill="yellow") + guides(fill=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Here we make a chart for number of employees who leave the company by time spent in the company faceted by salary level AND department, also we save it in a file with a better view at the working directory, using the ggsave()

ggplot(mydf[ which(mydf$status=='Left'),], 
       aes( x = time_spend_company, fill = status),width = 5, height = 10) + geom_bar(fill="green") + guides(fill=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(salary~sales)+ggsave("plot.png", width = 10, height = 5)

Now we make a chart for number of employees who stayed in the company by time spent in the company, faceted by salary level AND department

ggplot(mydf[ which(mydf$status=='Stay'),], 
       aes( x = time_spend_company, fill = status),width = 5, height = 10) + geom_bar(fill="blue") + guides(fill=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(salary~sales)+ggsave("plot2.png", width = 10, height = 5)

PART C

We draw a bar plot of how many years did employees stay in the company before leaving while Year stayed in the company is the x axis:

ggplot(mydf[ which(mydf$status=='Left'),], 
       aes( x = time_spend_company, fill = status)) + geom_bar(fill="green") + guides(fill=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

And we draw a bar plot of how many years did employees stay in the company before leaving by department while Year stayed in the company is the x axis. for this reason, we add a facet_grid layer to the ggplot and use ggsave to save it in the working directory for better view:

ggplot(mydf[ which(mydf$status=='Left'),], 
       aes( x = time_spend_company, fill = status)) + geom_bar(fill="green") + guides(fill=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+facet_grid(~sales)+ggsave("plot3.png", width = 10, height = 5)

Finally, we Draw a bar plot of how many years did employees stay in the company before leaving by salary level while Year stayed in the company is the x axis:

ggplot(mydf[ which(mydf$status=='Left'),], 
       aes( x = time_spend_company, fill = status)) + geom_bar(fill="green") + guides(fill=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+facet_grid(~salary)+ggsave("plot4.png", width = 10, height = 5)

Analysis: From the plots in part C we understand that the level of income has a direct effect on leaving the company. the lower level of income, the greater number of employee leaving the company. Also people tend to leave the company after three years and the tendecy decreases as we pass the 3rd year. The sales, technical and support departments have the highest leaving rates among the deparments, respectively