This particular document is using a fictional company data. This document shows my efforts in analyzing the data without any deeper and/or additional knowledgeg outside the data points provided in the dataset.
I will be using the tidyverse package for the analysis to better work with tibble, a type of data frame that is easier to work with within R since it is easier on the system and better to read. I will also be using ggplot2 for plotting, which in this case will generate all the plots. I will include the description of the data by using str() command.
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
I have to change the name for the sales variable since it is descriping departments, since the first row is the header, or variable names.
colnames(data)[9] = "Department"
So without a specific question to answer I set out to analyse mostly the differences of those who have left and those who have stayed with the company. Lets first focus specifically on satisfaction level and I will mostly be breaking this analysis down by department.
ggplot(data = data,mapping = aes(x = satisfaction_level, y = time_spend_company))+
geom_point (mapping = aes(color = left))+
facet_grid (~ Department)+
coord_flip()+
xlab("Satisfaction Level")+
ylab("Time spend at company in years")+
ggtitle ("Exploring satisfaction level by those who left within departments by year")+
labs ("Left")
As we can see there are obvious groupings of people who have left in all departments. The light blue color being those who have left, we can see we are losing them who are low in satisfaction in year 3 and we are loosing those with high satisfaction after year 6. Since the job market is usually very fluid with most people leaving their jobs after 5 years to start at anothfer company to further their carreer, we will focus on those with low satisfaction and have left.
low.left.data = filter(data, left==1, satisfaction_level <=0.5)
So I will reproduce the same graph using the subseted dataset
ggplot(data = low.left.data,mapping = aes(x = satisfaction_level, y = time_spend_company))+
geom_point ()+
facet_grid (~ Department)+
coord_flip()+
xlab("Satisfaction Level")+
ylab("Time spend at company in years")+
ggtitle ("Exploring satisfaction level by those who left within departments by year")+
labs ("Left")
Question is “can we influence those at satisfaction level 0.4 to stay by increased satisfaction.” This group might be taking on many work hours, that might give a good indication if they are over-worked.
ggplot(data = low.left.data, mapping = aes(y = satisfaction_level, x = average_montly_hours))+
geom_point()+
facet_wrap(~Department)
So clearly there are two groupings consistent in all departments. Now, those who are working around 150 hours a month on average are working about 7.5 hours a day, so this specific data is not letting me include those individuals as being over-worked. There is however a clear indication with those working around 250 to 300 hours on average a month, being 12.5 hours and 15 hours a day repsectively. Lets color this graph by salary to see if we are paying them for there contribution accordingly. The data set only includes low, medium and high salary as a factor variable including no specific salary numbers. So lets see how that plots
ggplot(data = low.left.data, mapping = aes(y = satisfaction_level, x = average_montly_hours))+
geom_point(mapping = aes(color = salary))+
facet_wrap(~Department)
There seems to be somethings to improve upon in this graph, so I want to look at workers that cost the least amount of money, namely the low salary workers, which if working long hours and taking many projects would logically be the workers we want to keep and/or increase their satisfaction level. This is also in relation to their last evaluation, but lets look at that in a later graph and number of projects. I am going to reproduce the last graph with only low salary workers.
low.left.salarylow.data = filter (low.left.data, salary =="low")
ggplot(data = low.left.salarylow.data, mapping = aes(y = satisfaction_level, x = average_montly_hours))+
geom_point()+
facet_wrap(~Department)
This graph tells me more analysis is needed, specifically with regards to the employees last evaluation numbers. That will be an analysis for a later day in part 2 of this analysis.