## Loading required package: lpSolveAPI
## Loading required package: tidyverse
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
My project is looking at a human resources data set from a random company that contains data on age, monthly income, years with company, attrition, and a handful of other variables. My goal is to try and find correlations between some of the variables to see if I can find any meaningful takeaways that this company or businesses in general can benefit from.
The package used for this project was tidyverse, and the software used was R, Tableau, Watson Analytics, and Microsoft Excel.
I do not have a description of the data. I am not sure the source, year, or country.
I do not believe I will find a wage gap in this company because after skimming through the data, I do not think this company has a gender bias. I do think there is going to be higher wages for people who are older, have worked there longer and have more experience/education because these people are more valuable in the company’s eyes. I do think that there will be a pay difference between the different job roles because each job involves a different skill level and each may be valued differently. I think marital status will have no correlation with any of the variables. I do think many of the variables combined impact the attrition. I think business travel will have an effect on the pay because traveling a lot might mean the person is putting in more hours depending on where they are traveling. Lastly, I think the distance from home and the pay for each employee will be correlated because the people who are further from their job might be getting paid less so they stay in a cheaper further away place, while once you start making more, you can afford living downtown if the company is in a city that is. Overall, I am interested in creating a predictive model to try and predict the attrition of an employee, and I believe most these variable will contribute to the overall outcome. I also want to create a predictive model to try and predict the employees’ monthly income.
The company seems to be some type of research and development business with a sales department as well. The data is human resource data collected on employees, and the variables include age, attrition, business travel, daily rate, department, distance from home, education, education field, environmental satisfaction, gender, hourly rate, job involvement, job level, job role, job satisfaction, marital status, monthly income, monthly rate, number of companies worked, over time, percent sales, relationship satisfaction, standard hours, stock options, total working years, training time, work life balance, years at company, years in current role, years since last promotion, and years with current manager. Since it is HR data, key metrics or KPIs for HR would be: attrition, job satisfaction, job involvement, work life balance, and relationship satisfaction.
mydata <- read.csv("C:\\Users\\dylan\\Desktop\\BSAD343H\\Project\\HR_data_set.csv")
summary(mydata)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0
## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0
## Median :36.00 Travel_Rarely :1043 Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 63 Min. : 1.000 Min. :1.000
## Research & Development:961 1st Qu.: 2.000 1st Qu.:2.000
## Sales :446 Median : 7.000 Median :3.000
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
##
## EducationField EnvironmentSatisfaction Gender
## Human Resources : 27 Min. :1.000 Female:588
## Life Sciences :606 1st Qu.:2.000 Male :882
## Marketing :159 Median :3.000
## Medical :464 Mean :2.722
## Other : 82 3rd Qu.:4.000
## Technical Degree:132 Max. :4.000
##
## HourlyRate JobInvolvement JobLevel
## Min. : 30.00 Min. :1.00 Min. :1.000
## 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000
## Median : 66.00 Median :3.00 Median :2.000
## Mean : 65.89 Mean :2.73 Mean :2.064
## 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :100.00 Max. :4.00 Max. :5.000
##
## JobRole JobSatisfaction MaritalStatus
## Sales Executive :326 Min. :1.000 Divorced:327
## Research Scientist :292 1st Qu.:2.000 Married :673
## Laboratory Technician :259 Median :3.000 Single :470
## Manufacturing Director :145 Mean :2.729
## Healthcare Representative:131 3rd Qu.:4.000
## Manager :102 Max. :4.000
## (Other) :215
## MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
## Min. : 1009 Min. : 2094 Min. :0.000 No :1054
## 1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Yes: 416
## Median : 4919 Median :14236 Median :2.000
## Mean : 6503 Mean :14313 Mean :2.693
## 3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000
## Max. :19999 Max. :26999 Max. :9.000
##
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## Min. :11.00 Min. :3.000 Min. :1.000
## 1st Qu.:12.00 1st Qu.:3.000 1st Qu.:2.000
## Median :14.00 Median :3.000 Median :3.000
## Mean :15.21 Mean :3.154 Mean :2.712
## 3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :25.00 Max. :4.000 Max. :4.000
##
## StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## Min. :80 Min. :0.0000 Min. : 0.00 Min. :0.000
## 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000
## Median :80 Median :1.0000 Median :10.00 Median :3.000
## Mean :80 Mean :0.7939 Mean :11.28 Mean :2.799
## 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000
## Max. :80 Max. :3.0000 Max. :40.00 Max. :6.000
##
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.761 Mean : 7.008 Mean : 4.229
## 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :4.000 Max. :40.000 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.188 Mean : 4.123
## 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
##
After going through the data, I realized the variables like environment satisfaction, job satisfaction, education, worklife balance, relationship satisfaction, and job involvement were all ranking systems on a scale of 1 to 4. Also the minimum monthly income is 1009, which seems like an unlivable income.
knitr::include_graphics("C:\\Users\\dylan\\Desktop\\BSAD343H\\Project\\pic2.png")
This decision tree is a predictive model that has a predictive strength of 86% for attrition within the company. This is a very strong predictor percentage and it will lend some key insights into what employees tend to get fired. From there the company can build a profile of a person who will succeed at the company and not get fired or quit, and then refine their hiring process so that there are less people getting laid off or quitting.
knitr::include_graphics("C:\\Users\\dylan\\Desktop\\BSAD343H\\Project\\pic3.png")
Interestingly enough this chart shows the wage difference between men and women and it is very close to equal. If getting extremely specific, women make on average $65.90/hour while men make on average $65.88/hour.
knitr::include_graphics("C:\\Users\\dylan\\Desktop\\BSAD343H\\Project\\pic4.png")
As you can see from this graph, there is a positive correlation between age and monthly income as hypothesized.
Watson Analytics provided insights into the data as shown in the images above. The images above show the many insights Watson found. It created a predictive decision tree with an 86% accuracy, it showed that there was no wage gap in this company and if anything, women make a bit more than men on average, and it showed a positive correlation between age and monthly income.
The data for the most part seemed quite clean. I did delete both employee count and employee number because neither of these variables would provide any meaningful correlations with other variables. Also, since all employees were over eighteen, I deleted that column as well, since we have age already and every employee is over 18 so this variable is redundant.
I cleaned the data in Excel before transferring to RStudio. I deleted all the unnecessary columns, and ran through to make sure none of the entries were missing or misspelled. Therefore, I do not need RStudio for cleaning the data.
As mentioned, the data was cleaned in Excel so I do not need to save a new dataset.
summary(mydata)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0
## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0
## Median :36.00 Travel_Rarely :1043 Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 63 Min. : 1.000 Min. :1.000
## Research & Development:961 1st Qu.: 2.000 1st Qu.:2.000
## Sales :446 Median : 7.000 Median :3.000
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
##
## EducationField EnvironmentSatisfaction Gender
## Human Resources : 27 Min. :1.000 Female:588
## Life Sciences :606 1st Qu.:2.000 Male :882
## Marketing :159 Median :3.000
## Medical :464 Mean :2.722
## Other : 82 3rd Qu.:4.000
## Technical Degree:132 Max. :4.000
##
## HourlyRate JobInvolvement JobLevel
## Min. : 30.00 Min. :1.00 Min. :1.000
## 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000
## Median : 66.00 Median :3.00 Median :2.000
## Mean : 65.89 Mean :2.73 Mean :2.064
## 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :100.00 Max. :4.00 Max. :5.000
##
## JobRole JobSatisfaction MaritalStatus
## Sales Executive :326 Min. :1.000 Divorced:327
## Research Scientist :292 1st Qu.:2.000 Married :673
## Laboratory Technician :259 Median :3.000 Single :470
## Manufacturing Director :145 Mean :2.729
## Healthcare Representative:131 3rd Qu.:4.000
## Manager :102 Max. :4.000
## (Other) :215
## MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
## Min. : 1009 Min. : 2094 Min. :0.000 No :1054
## 1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Yes: 416
## Median : 4919 Median :14236 Median :2.000
## Mean : 6503 Mean :14313 Mean :2.693
## 3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000
## Max. :19999 Max. :26999 Max. :9.000
##
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## Min. :11.00 Min. :3.000 Min. :1.000
## 1st Qu.:12.00 1st Qu.:3.000 1st Qu.:2.000
## Median :14.00 Median :3.000 Median :3.000
## Mean :15.21 Mean :3.154 Mean :2.712
## 3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :25.00 Max. :4.000 Max. :4.000
##
## StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## Min. :80 Min. :0.0000 Min. : 0.00 Min. :0.000
## 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000
## Median :80 Median :1.0000 Median :10.00 Median :3.000
## Mean :80 Mean :0.7939 Mean :11.28 Mean :2.799
## 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000
## Max. :80 Max. :3.0000 Max. :40.00 Max. :6.000
##
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.761 Mean : 7.008 Mean : 4.229
## 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :4.000 Max. :40.000 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.188 Mean : 4.123
## 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
##
My observations were noted in section 2B when I did a summary of the data while I was exploring the dataset.
All the min and max values are in the summary section above in 4A, however, below are what I believe to be some of the important min and max information.
min(mydata$Age)
## [1] 18
min(mydata$DailyRate)
## [1] 102
min(mydata$JobSatisfaction)
## [1] 1
min(mydata$YearsAtCompany)
## [1] 0
max(mydata$Age)
## [1] 60
max(mydata$DailyRate)
## [1] 1499
max(mydata$JobSatisfaction)
## [1] 4
max(mydata$YearsAtCompany)
## [1] 40
data_corr <- cor(mydata$Age, mydata$MonthlyIncome)
data_corr
## [1] 0.4978546
This number affirms the graph earlier showing the positive correlation between age and monthly income. This makes sense because the older you are the more work experience you most likely have making you more valuable to the company meaning they have to pay you more.
p <- qplot( x = mydata$Age, y = mydata$MonthlyIncome, data = mydata) + geom_point()
p + geom_smooth(method="lm")
As mentioned, the variables are positively correlated, hence the positive slope.
p2 <- qplot( x = mydata$DistanceFromHome, y = mydata$MonthlyIncome, data = mydata) + geom_point()
p2 + geom_smooth(method="lm")
This graph makes it seem poorly correlated, however, the further away you are away from home there are less people with high incomes.
Based on your hypothesis create a predictive mode
reg <- lm( mydata$MonthlyIncome ~ mydata$Age, data = mydata )
reg
##
## Call:
## lm(formula = mydata$MonthlyIncome ~ mydata$Age, data = mydata)
##
## Coefficients:
## (Intercept) mydata$Age
## -2970.7 256.6
summary(reg)
##
## Call:
## lm(formula = mydata$MonthlyIncome ~ mydata$Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9990.1 -2592.7 -677.9 1810.5 12540.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2970.67 443.70 -6.695 3.06e-11 ***
## mydata$Age 256.57 11.67 21.995 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4084 on 1468 degrees of freedom
## Multiple R-squared: 0.2479, Adjusted R-squared: 0.2473
## F-statistic: 483.8 on 1 and 1468 DF, p-value: < 2.2e-16
Predicting monthly income by just age has a low r-squared of .2469 and adjusted r-squared of .2473. Meaning age is not a great predictor of income.
reg2 <- lm( mydata$MonthlyIncome ~ mydata$Age + mydata$JobLevel + mydata$Education, data = mydata )
summary(reg2)
##
## Call:
## lm(formula = mydata$MonthlyIncome ~ mydata$Age + mydata$JobLevel +
## mydata$Education, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5117.5 -962.0 107.1 757.9 3834.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2062.174 177.579 -11.613 <2e-16 ***
## mydata$Age 9.957 4.947 2.013 0.0443 *
## mydata$JobLevel 4001.877 40.139 99.700 <2e-16 ***
## mydata$Education -21.359 38.162 -0.560 0.5758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1465 on 1466 degrees of freedom
## Multiple R-squared: 0.9033, Adjusted R-squared: 0.9031
## F-statistic: 4567 on 3 and 1466 DF, p-value: < 2.2e-16
When age is coupled with job level and education, the r-squared shoots up to .9033 and the adjusted r-squared goes to .9031.
This model very accurately predicts the dependent variable which was monthly income. The r-squared and adjusted r-squared were both above .9 which are quite high.