This project covers introduction to data and exploratory data analysis. The processes involve various processes that are conducted in the analysis process using R programming. Therefore, the project aims at introducing a dataset called “HR_comma_sep”, and tell a story of the dataset using different visualization charts. A Human Resource of a company would like to determine the employee attrition, and to do so, a data analysis has to be done to find the reasons as to why the employees are leaving the company. In the visualizations I have provided below, it shows the relationship between various variables.
In this section, it involves importing dataset into the RStudio IDE so that the pre-processes of data analysis can be performed in the dataset. In this first part, it shows the imported dataset and viewing a general summary of the dataset that is to be analyzed.
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left promotion_last_5years
## Min. : 2.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 3.000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean : 3.498 Mean :0.1446 Mean :0.2381 Mean :0.02127
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## sales salary
## sales :4140 high :1237
## technical :2720 low :7316
## support :2229 medium:6446
## IT :1227
## product_mng: 902
## marketing : 858
## (Other) :2923
In this section, any dataset has to be checked if it is clean, and also if it is ready to be analyzed. A dataset that has not been cleaned has a high probability of giving wrong results. Therefore, I changed the column names “sales” and “time_spend_company” to “role” and “experience_in_company” respectively.
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## experience_in_company Work_accident left promotion_last_5years role salary
## 1 3 0 1 0 sales low
## 2 6 0 1 0 sales medium
## 3 4 0 1 0 sales medium
## 4 5 0 1 0 sales low
## 5 3 0 1 0 sales low
## 6 3 0 1 0 sales low
The dataset has the mean satisfaction of the employees at 0.61, the turnover rate of employees is 24%, and employee observations of approximately 15,000, and the features are only 10.
## [1] 14999 10
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ experience_in_company: int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ role : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
## 0 1
## 11428 3571
## [1] 23.80825
This section involes the generation of graphics that are both insightful and beautiful. They help the readers of the data to gain more knowledge of the data and understand what is communicated by the data analysts. This is the step in data analysis process where the questions of the data analysis process are answered.
Firstly, I examined the relationship of the various employee features. The features are Evaluation, Satisfaction of the employees, and the average hours that the employee works monthly. This visulization aims at answering the following questions: - Could employees be grouped in a specific way using these features? - Is there a reason for the high spike that is in the graph of low satisfaction of employees? - Is there any relationship between the average hours the employee works monthly and evaluation?
In the chart below, it aims answering the following question; - Do employees who have more than six projects feel that they are being overworked in the company? - Does it mean that employees with less than two project less valued in the company, thus deciding to leave the company? - Why are the employees with higher or lower spectrum of projects leave the company?
The conclusion from the visualization is that the employees with 2, 6, and 7 projet left the company at alarming rates. Those with seven projects, all of them left.
## Turnover vs. Satisfaction Chart
The chart for the employees that turned over is a tri-modal distribution. The chart shows that the employees with less satisfaction left the company more and also those with a high satisfaction left the company more too.
The simple assumption that is abvious from the chart is that as the project count increases, so will the average monthly hours increases too. In the boxplot below,it shows that even those employees who did not turned over had their monthly hours remain constant. In contrast to this information, the employees who turned over shows that their average monthly hours increased so did the projects too. The question that the chart answers is that, why the employees who left the company worked more hours than those employees who did not work more hours even when the two groups had the same projects count?
This boxplot looks similar to the one above but there is an increase in the evaluation of employees who had more projects count in the turnover group. But also, the employees in the non-turnover group had an evaluation score that is consistent despite the increase in the project counts.