The Company ABC which has very poor employee satisfaction and retention. Company has decided to look into the request of paying their employees for overtime hours. The information available for the sample employees includes currently available information such as satisfaction, number of projects and salary level as well as hours worked.
# load libraries
library(ggplot2)
library(DT)
library(dplyr)
library(data.table)
# load data
hr_data <- read.csv("https://raw.githubusercontent.com/SubhalaxmiRout002/Data-606-Final-Project/master/HR_comma_sep.csv", stringsAsFactors = FALSE)
# about data
str(hr_data)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Department : chr "sales" "sales" "sales" "sales" ...
## $ salary : chr "low" "medium" "medium" "low" ...
# data info
dim(hr_data)
## [1] 14999 10
# view data
DT::datatable(hr_data)
#rename columns
colNames <- c("satLevel", "lastEval", "numProj", "avgHrs", "timeCpny"
, "wrkAcdnt", "left", "fiveYrPrmo", "department", "salary")
data.table::setnames(hr_data, colNames)
names(hr_data)
## [1] "satLevel" "lastEval" "numProj" "avgHrs" "timeCpny"
## [6] "wrkAcdnt" "left" "fiveYrPrmo" "department" "salary"
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Predict, how much salary the company would need to pay out for the overtime employe.
What are the cases, and how many are there?
Each case represents an employee working hours details along with salary. There 14999 observations in the given data set.
Describe the method of data collection.
Data collected from Kaggle. Here is the source:
What type of study is this (observational/experiment)?
This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
Data is collected by Kaggle and is available online here: https://www.kaggle.com/giripujar/hr-analytics . For this project, downloaded data from Kaggle and stored the data in Github repository. Using read.csv() read the data from Git repo.
What is the response variable? Is it quantitative or qualitative?
Salary, job left are response variable. Both are qualitative.
You should have two independent variables, one quantitative and one qualitative.
Number of project(number_project), Average Monthly hours spend (average_montly_hours) and level of satisfaction (satisfaction_level) are independent variable. All are quantitative.
Department is qualitative.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
# show summary statistics of each column
summary(hr_data)
## satLevel lastEval numProj avgHrs
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
## timeCpny wrkAcdnt left fiveYrPrmo
## Min. : 2.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 3.000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean : 3.498 Mean :0.1446 Mean :0.2381 Mean :0.02127
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## department salary
## Length:14999 Length:14999
## Class :character Class :character
## Mode :character Mode :character
##
##
##
# histogram for all numeric variables to understand distribution
ggplot(data = hr_data, aes(x = avgHrs/4)) +
geom_histogram(breaks=seq(10, 100, by=7), color = "black", fill = "steelblue") +
labs(title="Distribution of Average Hours per Week",x="Avg Hours", y = "Frequency")
ggplot(data = hr_data, aes(x = satLevel)) +
geom_histogram(breaks=seq(0, 1, by=0.1), color = "black", fill = "steelblue") +
labs(title="Distribution of Satisfaction Level",x="Satisfaction Level", y = "Frequency")
ggplot(data = hr_data, aes(x = lastEval)) +
geom_histogram(breaks=seq(0, 1, by=.1), color = "black", fill = "steelblue") +
labs(title="Distribution of Last Evaluations",x="Last Eval", y = "Frequency")
ggplot(data = hr_data, aes(x = numProj)) +
geom_histogram(breaks=seq(1, 10, by=1), color = "black", fill = "steelblue") +
labs(title="Distribution of Number of Projects",x="Number of Projects", y = "Frequency")
# box plot to show the percentile distribution of average hours per week by jdepartment.
ggplot(data = hr_data) + geom_boxplot(aes(x = department, y = avgHrs), fill = "steelblue") +
labs(title="Department Analysis Data", x="Department", y = "Avg Hours") +
theme(axis.text.x=element_text(angle=30,hjust=1),plot.title = element_text(hjust = 0.5))
We can do a lot of comparions between various variables i.e highest retention by department, employee that decided to left vs the employee that still working based on salary, ratio of satisfied employee vs unsatisfied employee. We can predict the salary (per month/year) for the employee who worked over time using linear regression model.