The goal of our project is to analyze the Employee Attrition dataset in order to find patterns and predictions. Based on the information given, we are focusing on the impacts and relationships from age, gender, and educational background in respect to other factors such as number of companies worked at, number of promotions, salary increases, monthly income, and other determinants. Since this data is Human Resources centric, the desired end result is always to identify and alleviate inequities within the company in order to enhance the value creation model of the organization for all stakeholders.
Packages used: tidyverse Software used: Tableau, Watson, R, Excel
There is no information available regarding the timeline of the data collection, its source, or country.
Our hypothesis is that generational stereotypes and gender inequalities will be easily recognizable within the dataset, i.e. younger employees will have worked for more companies and women will be paid less than men, among other considerations. For R studio, we hypothesize that the monthly income for employees increases with their age due to work experience, seniority job level, and skills.
The industry our project is based on a typical business scenario that involves employees, salaries, education and other factors that may influence each other such as job satisfaction level, job level, martial status. Our project is using data from a Human Resources perspective, which is an important feature of a business’ performance. Some key metrics for this project’s industry (using the data we have been provided with): 1. Job satisfaction 2. Performance Rating 3. Attrition
See below for images:
knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\interesting1.png')
knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\interesting2.png')
knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\interesting3.png')
knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\interesting4.png')
knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\interesting5.png')
knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\interesting6.png')
The first figure analyzes average monthly_income and age. It’s interesting that Females earn more than males in general, which was not expected. The second analyzes average work life balance and age. It’s interesting that Females have much lower work life balance during the years 22-24. One theory is that they are entry level grads who are still trying to find their passion or working extremely hard to prove themselves. The third analyzes ave number of companies worked and age. This is interesting, because we were under the impression the previous generation did not have as much fluctuations in careers as our generation. The fourth analyzes average perfomance rating and age, which is compared to average percent salary rate. It’s interesting that the first years have a peak in performance and pay, but tends to stagnant in the following years. The fifth demonstrates age vs. rows, which analyzes the amount of employees in that age group. It’s interesting to note the bell curve shape to this figure, but it makes sense that there are low numbers of entry levels and senior management due to recruiting youth and retiring seniors. The sixth shows the main drivers of monthly income, being job level, job role and total working years. The longer you stay at this company, the more you’ll get paid.
These graphs illustrate various ways of analysing our set of data. Using descriptive statistics, Watson provides most insight in this respect. We notice that JobLevel and JobRole are the 2 strongest variables that drive monthly income. When analyzing the monthly income vs. age, there appears to be a general positive trendline. We wanted to integrate a third variable, being gender for comparison. We found that women tend to be paid more in general, which is quite interesting and against our initial hypothesis. This has driven us to discover what are the discrepancies between pay, from job role, number of companies worked at and performance levels.
Data cleansing is an important process, as it makes data more manageable and allows for some variables that are characteristic to be interpreted.
mydata <- read_csv("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\employee_attrition.csv")
Parsed with column specification:
cols(
.default = col_integer(),
Attrition = col_character(),
BusinessTravel = col_character(),
Department = col_character(),
EducationField = col_character(),
JobRole = col_character(),
MaritalStatus = col_character(),
Over18 = col_character(),
OverTime = col_character()
)
See spec(...) for full column specifications.
Warning message:
In file(con, "r") :
cannot open file 'Lab8.utf8.md': No such file or directory
head(mydata)
summary(mydata)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount
Min. :18.00 Length:1470 Length:1470 Min. : 102.0 Length:1470 Min. : 1.000 Min. :1.000 Length:1470 Min. :1
1st Qu.:30.00 Class :character Class :character 1st Qu.: 465.0 Class :character 1st Qu.: 2.000 1st Qu.:2.000 Class :character 1st Qu.:1
Median :36.00 Mode :character Mode :character Median : 802.0 Mode :character Median : 7.000 Median :3.000 Mode :character Median :1
Mean :36.92 Mean : 802.5 Mean : 9.193 Mean :2.913 Mean :1
3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 3rd Qu.:4.000 3rd Qu.:1
Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000 Max. :1
EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus
Min. : 1.0 Min. :1.000 Min. :1.0 Min. : 30.00 Min. :1.00 Min. :1.000 Length:1470 Min. :1.000 Length:1470
1st Qu.: 491.2 1st Qu.:2.000 1st Qu.:1.0 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000 Class :character 1st Qu.:2.000 Class :character
Median :1020.5 Median :3.000 Median :2.0 Median : 66.00 Median :3.00 Median :2.000 Mode :character Median :3.000 Mode :character
Mean :1024.9 Mean :2.722 Mean :1.6 Mean : 65.89 Mean :2.73 Mean :2.064 Mean :2.729
3rd Qu.:1555.8 3rd Qu.:4.000 3rd Qu.:2.0 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000 3rd Qu.:4.000
Max. :2068.0 Max. :4.000 Max. :2.0 Max. :100.00 Max. :4.00 Max. :5.000 Max. :4.000
MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
Min. : 1009 Min. : 2094 Min. :0.000 Length:1470 Length:1470 Min. :11.00 Min. :3.000 Min. :1.000 Min. :80
1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Class :character Class :character 1st Qu.:12.00 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80
Median : 4919 Median :14236 Median :2.000 Mode :character Mode :character Median :14.00 Median :3.000 Median :3.000 Median :80
Mean : 6503 Mean :14313 Mean :2.693 Mean :15.21 Mean :3.154 Mean :2.712 Mean :80
3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80
Max. :19999 Max. :26999 Max. :9.000 Max. :25.00 Max. :4.000 Max. :4.000 Max. :80
StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Min. :0.0000 Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
Median :1.0000 Median :10.00 Median :3.000 Median :3.000 Median : 5.000 Median : 3.000 Median : 1.000 Median : 3.000
Mean :0.7939 Mean :11.28 Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229 Mean : 2.188 Mean : 4.123
3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
Max. :3.0000 Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000 Max. :15.000 Max. :17.000
age <- mydata$Age
education <- mydata$Education
monthly_income <- mydata$MonthlyIncome
gender <-mydata$Gender
YearsAtCompany <- mydata$YearsAtCompany
YearsSinceLastPromotion <-mydata$YearsSinceLastPromotion
WorkLifeBalance <- mydata$WorkLifeBalance
PerformanceRating <- mydata$PerformanceRating
Overtime <- mydata$OverTime
EducationField <- mydata$EducationField
TotalWorkingYears <- mydata$TotalWorkingYears
TrainingTimesLastYear <- mydata$TrainingTimesLastYear
write_csv(x = mydata, path = "C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\clean_employee_attrition.csv")
summary(mydata)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount
Min. :18.00 Length:1470 Length:1470 Min. : 102.0 Length:1470 Min. : 1.000 Min. :1.000 Length:1470 Min. :1
1st Qu.:30.00 Class :character Class :character 1st Qu.: 465.0 Class :character 1st Qu.: 2.000 1st Qu.:2.000 Class :character 1st Qu.:1
Median :36.00 Mode :character Mode :character Median : 802.0 Mode :character Median : 7.000 Median :3.000 Mode :character Median :1
Mean :36.92 Mean : 802.5 Mean : 9.193 Mean :2.913 Mean :1
3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 3rd Qu.:4.000 3rd Qu.:1
Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000 Max. :1
EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus
Min. : 1.0 Min. :1.000 Min. :1.0 Min. : 30.00 Min. :1.00 Min. :1.000 Length:1470 Min. :1.000 Length:1470
1st Qu.: 491.2 1st Qu.:2.000 1st Qu.:1.0 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000 Class :character 1st Qu.:2.000 Class :character
Median :1020.5 Median :3.000 Median :2.0 Median : 66.00 Median :3.00 Median :2.000 Mode :character Median :3.000 Mode :character
Mean :1024.9 Mean :2.722 Mean :1.6 Mean : 65.89 Mean :2.73 Mean :2.064 Mean :2.729
3rd Qu.:1555.8 3rd Qu.:4.000 3rd Qu.:2.0 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000 3rd Qu.:4.000
Max. :2068.0 Max. :4.000 Max. :2.0 Max. :100.00 Max. :4.00 Max. :5.000 Max. :4.000
MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
Min. : 1009 Min. : 2094 Min. :0.000 Length:1470 Length:1470 Min. :11.00 Min. :3.000 Min. :1.000 Min. :80
1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Class :character Class :character 1st Qu.:12.00 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80
Median : 4919 Median :14236 Median :2.000 Mode :character Mode :character Median :14.00 Median :3.000 Median :3.000 Median :80
Mean : 6503 Mean :14313 Mean :2.693 Mean :15.21 Mean :3.154 Mean :2.712 Mean :80
3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80
Max. :19999 Max. :26999 Max. :9.000 Max. :25.00 Max. :4.000 Max. :4.000 Max. :80
StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Min. :0.0000 Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
Median :1.0000 Median :10.00 Median :3.000 Median :3.000 Median : 5.000 Median : 3.000 Median : 1.000 Median : 3.000
Mean :0.7939 Mean :11.28 Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229 Mean : 2.188 Mean : 4.123
3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
Max. :3.0000 Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000 Max. :15.000 Max. :17.000
Based on this data, we ca see max and min values for the data set variables. However, not all of them add value/provide meaningful implications. The variables that are meaningful are as follows: -Age -Monthly_Income/Hourly Rate/Monthly Rate (and other means of pay) -Performance Rating -Job Satisfaction
With this new clean data set, we can now analyze gender and another variable, like age or monthly_income. ### 4B) Using descriptive statistics explore dataset investigate min and max values
min(age)
[1] 18
max(age)
[1] 60
min(monthly_income)
[1] 1009
max(monthly_income)
[1] 19999
data_corr <- cor(monthly_income,education)
data_corr
[1] 0.09496068
Corr <- cor(age, monthly_income)
Corr
[1] 0.4978546
It’s interesting to see that age and monthly_income are better correlated than monthly_income and education, which means age and monthly_income are better predictors of each other than the other relationship.
p <- qplot( x = age, y = monthly_income, data = mydata) +geom_point()
p
p + geom_smooth(method="lm")
linear_model <- lm( monthly_income ~ age, data =mydata )
summary( linear_model )
Call:
lm(formula = monthly_income ~ age, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-9990.1 -2592.7 -677.9 1810.5 12540.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2970.67 443.70 -6.695 3.06e-11 ***
age 256.57 11.67 21.995 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4084 on 1468 degrees of freedom
Multiple R-squared: 0.2479, Adjusted R-squared: 0.2473
F-statistic: 483.8 on 1 and 1468 DF, p-value: < 2.2e-16
Equation of the line = y = -2970.67 + 256.57x
If two new employees were signed on, one being entry level at 18 years old and the other being senior at 50 years old, we can predict their monthly incomes:
age = 18
y = -2970.67 + 256.57*(age)
y
[1] 1647.59
age = 50
y= -2970.67+256.57*(age)
y
[1] 9857.83
Multiple R-squared: 0.2479, Adjusted R-squared: 0.2473
The model suggests that these data points are not very suited to the regression line as the R squared value is under 0.5, which is typically a benchmark.Hence, this means that age is not a very good predictor of montly income.Because of this, it is advisable to add other variables to find a better suited relationship.
See images below from various sources:
knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\interesting8.png')
knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\BSAD343H Project\\hr\\Gender\\interesting9.png')
We couldn’t use other numerical data because the other variables were categorical or ordinal, we decided that a bar graph would be most suitable to illustrate the relationships. In fact, these figures, while they do not offer a better prediction of the dependent variable, definitely offer more insight into our original findings between age and monthly income. For instance, analyzing education field and monthly income, we can identify that employees who have degrees in life sciences are paid more than other degrees. It’s interesting to note how little HR is paid in comparison. Our theory is that there is only a few employees who have that degree. In the other figure that compares gender and pay, it appears that there is a significant gender pay discrepancy. However, this goes against our initial finding. Upon refelction and more analysis, we identified the cause of these two conlficting ideas. This set of data comprises of approximately 60% males and 40% females. Rather than concluding these two findings are unrelated, they actually depict an important attribute about our data, such that females, while there are less, tend to be paid higher due to their seniority roles in the company. Overall, these extra findings provide more insight into our intial finding of monthly income and age, such that education and number of employees who are males or females can affect our first findings. Hence, these extra findings do not predict the dependent variable more accurately, but rather give us more explanation of our current model.