Project Work

This project covers introduction to data and exploratory data analysis. The processes involve various processes that are conducted in the analysis process using R programming. Therefore, the project aims at introducing a dataset called “HR_comma_sep”, and tell a story of the dataset using different visualization charts. A Human Resource of a company would like to determine the employee attrition, and to do so, a data analysis has to be done to find the reasons as to why the employees are leaving the company. In the visualizations I have provided below, it shows the relationship between various variables.

Introduction to Data

In this section, it involves importing dataset into the RStudio IDE so that the pre-processes of data analysis can be performed in the dataset. In this first part, it shows the imported dataset and viewing a general summary of the dataset that is to be analyzed.

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left        promotion_last_5years
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000   Min.   :0.00000      
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000      
##  Median : 3.000     Median :0.0000   Median :0.0000   Median :0.00000      
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381   Mean   :0.02127      
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000      
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000   Max.   :1.00000      
##                                                                            
##          sales         salary    
##  sales      :4140   high  :1237  
##  technical  :2720   low   :7316  
##  support    :2229   medium:6446  
##  IT         :1227                
##  product_mng: 902                
##  marketing  : 858                
##  (Other)    :2923

Data Cleaning

In this section, any dataset has to be checked if it is clean, and also if it is ready to be analyzed. A dataset that has not been cleaned has a high probability of giving wrong results. Therefore, I changed the column names “sales” and “time_spend_company” to “role” and “experience_in_company” respectively.

##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   experience_in_company Work_accident left promotion_last_5years  role salary
## 1                     3             0    1                     0 sales    low
## 2                     6             0    1                     0 sales medium
## 3                     4             0    1                     0 sales medium
## 4                     5             0    1                     0 sales    low
## 5                     3             0    1                     0 sales    low
## 6                     3             0    1                     0 sales    low

Exploring the Dataset

The dataset has the mean satisfaction of the employees at 0.61, the turnover rate of employees is 24%, and employee observations of approximately 15,000, and the features are only 10.

## [1] 14999    10
## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ experience_in_company: int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ role                 : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
##     0     1 
## 11428  3571
## [1] 23.80825

Exploratory Data Analysis

This section involes the generation of graphics that are both insightful and beautiful. They help the readers of the data to gain more knowledge of the data and understand what is communicated by the data analysts. This is the step in data analysis process where the questions of the data analysis process are answered.

Distribution Plots

Firstly, I examined the relationship of the various employee features. The features are Evaluation, Satisfaction of the employees, and the average hours that the employee works monthly. This visulization aims at answering the following questions: - Could employees be grouped in a specific way using these features? - Is there a reason for the high spike that is in the graph of low satisfaction of employees? - Is there any relationship between the average hours the employee works monthly and evaluation?

Visualization of Turnover vs. ProjectCount

In the chart below, it aims answering the following question; - Do employees who have more than six projects feel that they are being overworked in the company? - Does it mean that employees with less than two project less valued in the company, thus deciding to leave the company? - Why are the employees with higher or lower spectrum of projects leave the company?

The conclusion from the visualization is that the employees with 2, 6, and 7 projet left the company at alarming rates. Those with seven projects, all of them left.

## Turnover vs. Satisfaction Chart

The chart for the employees that turned over is a tri-modal distribution. The chart shows that the employees with less satisfaction left the company more and also those with a high satisfaction left the company more too.

ProjectCount vs. MonthlyHours Chart

The simple assumption that is abvious from the chart is that as the project count increases, so will the average monthly hours increases too. In the boxplot below,it shows that even those employees who did not turned over had their monthly hours remain constant. In contrast to this information, the employees who turned over shows that their average monthly hours increased so did the projects too. The question that the chart answers is that, why the employees who left the company worked more hours than those employees who did not work more hours even when the two groups had the same projects count?

ProjectCount vs. Evaluation Chart

This boxplot looks similar to the one above but there is an increase in the evaluation of employees who had more projects count in the turnover group. But also, the employees in the non-turnover group had an evaluation score that is consistent despite the increase in the project counts.