## Loading required package: lpSolveAPI
## Loading required package: tidyverse
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Task#1: About this Project


1A) Brief description of your project.

My project is looking at a human resources data set from a random company that contains data on age, monthly income, years with company, attrition, and a handful of other variables. My goal is to try and find correlations between some of the variables to see if I can find any meaningful takeaways that this company or businesses in general can benefit from.

1B) Resources - Packages/software use on the project (R, Tableau, Watson)

The package used for this project was tidyverse, and the software used was R, Tableau, Watson Analytics, and Microsoft Excel.

1C) Data Description source, year, country.

I do not have a description of the data. I am not sure the source, year, or country.

1D) Potential business cases (hypothesis)

I do not believe I will find a wage gap in this company because after skimming through the data, I do not think this company has a gender bias. I do think there is going to be higher wages for people who are older, have worked there longer and have more experience/education because these people are more valuable in the company’s eyes. I do think that there will be a pay difference between the different job roles because each job involves a different skill level and each may be valued differently. I think marital status will have no correlation with any of the variables. I do think many of the variables combined impact the attrition. I think business travel will have an effect on the pay because traveling a lot might mean the person is putting in more hours depending on where they are traveling. Lastly, I think the distance from home and the pay for each employee will be correlated because the people who are further from their job might be getting paid less so they stay in a cheaper further away place, while once you start making more, you can afford living downtown if the company is in a city that is. Overall, I am interested in creating a predictive model to try and predict the attrition of an employee, and I believe most these variable will contribute to the overall outcome. I also want to create a predictive model to try and predict the employees’ monthly income.


Task#2: Data Collection - INDUSTRY - HR_data_set


2A) Brief description and key metrics of the project industry (context)

The company seems to be some type of research and development business with a sales department as well. The data is human resource data collected on employees, and the variables include age, attrition, business travel, daily rate, department, distance from home, education, education field, environmental satisfaction, gender, hourly rate, job involvement, job level, job role, job satisfaction, marital status, monthly income, monthly rate, number of companies worked, over time, percent sales, relationship satisfaction, standard hours, stock options, total working years, training time, work life balance, years at company, years in current role, years since last promotion, and years with current manager. Since it is HR data, key metrics or KPIs for HR would be: attrition, job satisfaction, job involvement, work life balance, and relationship satisfaction.

2B) Explore the dataset, note any interesting patterns highlights

mydata <- read.csv("C:\\Users\\dylan\\Desktop\\BSAD343H\\Project\\HR_data_set.csv")
summary(mydata)
##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department  DistanceFromHome   Education    
##  Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
##  Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
##  Sales                 :446   Median : 7.000   Median :3.000  
##                               Mean   : 9.193   Mean   :2.913  
##                               3rd Qu.:14.000   3rd Qu.:4.000  
##                               Max.   :29.000   Max.   :5.000  
##                                                               
##           EducationField EnvironmentSatisfaction    Gender   
##  Human Resources : 27    Min.   :1.000           Female:588  
##  Life Sciences   :606    1st Qu.:2.000           Male  :882  
##  Marketing       :159    Median :3.000                       
##  Medical         :464    Mean   :2.722                       
##  Other           : 82    3rd Qu.:4.000                       
##  Technical Degree:132    Max.   :4.000                       
##                                                              
##    HourlyRate     JobInvolvement    JobLevel    
##  Min.   : 30.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000  
##  Median : 66.00   Median :3.00   Median :2.000  
##  Mean   : 65.89   Mean   :2.73   Mean   :2.064  
##  3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :100.00   Max.   :4.00   Max.   :5.000  
##                                                 
##                       JobRole    JobSatisfaction  MaritalStatus
##  Sales Executive          :326   Min.   :1.000   Divorced:327  
##  Research Scientist       :292   1st Qu.:2.000   Married :673  
##  Laboratory Technician    :259   Median :3.000   Single  :470  
##  Manufacturing Director   :145   Mean   :2.729                 
##  Healthcare Representative:131   3rd Qu.:4.000                 
##  Manager                  :102   Max.   :4.000                 
##  (Other)                  :215                                 
##  MonthlyIncome    MonthlyRate    NumCompaniesWorked OverTime  
##  Min.   : 1009   Min.   : 2094   Min.   :0.000      No :1054  
##  1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000      Yes: 416  
##  Median : 4919   Median :14236   Median :2.000                
##  Mean   : 6503   Mean   :14313   Mean   :2.693                
##  3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000                
##  Max.   :19999   Max.   :26999   Max.   :9.000                
##                                                               
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction
##  Min.   :11.00     Min.   :3.000     Min.   :1.000           
##  1st Qu.:12.00     1st Qu.:3.000     1st Qu.:2.000           
##  Median :14.00     Median :3.000     Median :3.000           
##  Mean   :15.21     Mean   :3.154     Mean   :2.712           
##  3rd Qu.:18.00     3rd Qu.:3.000     3rd Qu.:4.000           
##  Max.   :25.00     Max.   :4.000     Max.   :4.000           
##                                                              
##  StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear
##  Min.   :80    Min.   :0.0000   Min.   : 0.00     Min.   :0.000        
##  1st Qu.:80    1st Qu.:0.0000   1st Qu.: 6.00     1st Qu.:2.000        
##  Median :80    Median :1.0000   Median :10.00     Median :3.000        
##  Mean   :80    Mean   :0.7939   Mean   :11.28     Mean   :2.799        
##  3rd Qu.:80    3rd Qu.:1.0000   3rd Qu.:15.00     3rd Qu.:3.000        
##  Max.   :80    Max.   :3.0000   Max.   :40.00     Max.   :6.000        
##                                                                        
##  WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :4.000   Max.   :40.000   Max.   :18.000    
##                                                     
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 1.000          Median : 3.000      
##  Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :15.000          Max.   :17.000      
## 

After going through the data, I realized the variables like environment satisfaction, job satisfaction, education, worklife balance, relationship satisfaction, and job involvement were all ranking systems on a scale of 1 to 4. Also the minimum monthly income is 1009, which seems like an unlivable income.

knitr::include_graphics("C:\\Users\\dylan\\Desktop\\BSAD343H\\Project\\pic2.png")

This decision tree is a predictive model that has a predictive strength of 86% for attrition within the company. This is a very strong predictor percentage and it will lend some key insights into what employees tend to get fired. From there the company can build a profile of a person who will succeed at the company and not get fired or quit, and then refine their hiring process so that there are less people getting laid off or quitting.

knitr::include_graphics("C:\\Users\\dylan\\Desktop\\BSAD343H\\Project\\pic3.png")

Interestingly enough this chart shows the wage difference between men and women and it is very close to equal. If getting extremely specific, women make on average $65.90/hour while men make on average $65.88/hour.

knitr::include_graphics("C:\\Users\\dylan\\Desktop\\BSAD343H\\Project\\pic4.png")

As you can see from this graph, there is a positive correlation between age and monthly income as hypothesized.

2C) Paragraph using descriptive statistics

Watson Analytics provided insights into the data as shown in the images above. The images above show the many insights Watson found. It created a predictive decision tree with an 86% accuracy, it showed that there was no wage gap in this company and if anything, women make a bit more than men on average, and it showed a positive correlation between age and monthly income.


Task#3: Data Preparation - Cleaning and preparing the data for analysis


3A) Describe the steps to cleaning and prepare the data for analysis. Make an argument of why those steps are necessary.

The data for the most part seemed quite clean. I did delete both employee count and employee number because neither of these variables would provide any meaningful correlations with other variables. Also, since all employees were over eighteen, I deleted that column as well, since we have age already and every employee is over 18 so this variable is redundant.

3B) Clean the data

I cleaned the data in Excel before transferring to RStudio. I deleted all the unnecessary columns, and ran through to make sure none of the entries were missing or misspelled. Therefore, I do not need RStudio for cleaning the data.

3C) Save a new clean dataset

As mentioned, the data was cleaned in Excel so I do not need to save a new dataset.


Data Analysis: Descriptive Statistics, Correlations


4A) Basic descriptive statistics of the new clean dataset (write down any observations)

summary(mydata)
##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department  DistanceFromHome   Education    
##  Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
##  Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
##  Sales                 :446   Median : 7.000   Median :3.000  
##                               Mean   : 9.193   Mean   :2.913  
##                               3rd Qu.:14.000   3rd Qu.:4.000  
##                               Max.   :29.000   Max.   :5.000  
##                                                               
##           EducationField EnvironmentSatisfaction    Gender   
##  Human Resources : 27    Min.   :1.000           Female:588  
##  Life Sciences   :606    1st Qu.:2.000           Male  :882  
##  Marketing       :159    Median :3.000                       
##  Medical         :464    Mean   :2.722                       
##  Other           : 82    3rd Qu.:4.000                       
##  Technical Degree:132    Max.   :4.000                       
##                                                              
##    HourlyRate     JobInvolvement    JobLevel    
##  Min.   : 30.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000  
##  Median : 66.00   Median :3.00   Median :2.000  
##  Mean   : 65.89   Mean   :2.73   Mean   :2.064  
##  3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :100.00   Max.   :4.00   Max.   :5.000  
##                                                 
##                       JobRole    JobSatisfaction  MaritalStatus
##  Sales Executive          :326   Min.   :1.000   Divorced:327  
##  Research Scientist       :292   1st Qu.:2.000   Married :673  
##  Laboratory Technician    :259   Median :3.000   Single  :470  
##  Manufacturing Director   :145   Mean   :2.729                 
##  Healthcare Representative:131   3rd Qu.:4.000                 
##  Manager                  :102   Max.   :4.000                 
##  (Other)                  :215                                 
##  MonthlyIncome    MonthlyRate    NumCompaniesWorked OverTime  
##  Min.   : 1009   Min.   : 2094   Min.   :0.000      No :1054  
##  1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000      Yes: 416  
##  Median : 4919   Median :14236   Median :2.000                
##  Mean   : 6503   Mean   :14313   Mean   :2.693                
##  3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000                
##  Max.   :19999   Max.   :26999   Max.   :9.000                
##                                                               
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction
##  Min.   :11.00     Min.   :3.000     Min.   :1.000           
##  1st Qu.:12.00     1st Qu.:3.000     1st Qu.:2.000           
##  Median :14.00     Median :3.000     Median :3.000           
##  Mean   :15.21     Mean   :3.154     Mean   :2.712           
##  3rd Qu.:18.00     3rd Qu.:3.000     3rd Qu.:4.000           
##  Max.   :25.00     Max.   :4.000     Max.   :4.000           
##                                                              
##  StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear
##  Min.   :80    Min.   :0.0000   Min.   : 0.00     Min.   :0.000        
##  1st Qu.:80    1st Qu.:0.0000   1st Qu.: 6.00     1st Qu.:2.000        
##  Median :80    Median :1.0000   Median :10.00     Median :3.000        
##  Mean   :80    Mean   :0.7939   Mean   :11.28     Mean   :2.799        
##  3rd Qu.:80    3rd Qu.:1.0000   3rd Qu.:15.00     3rd Qu.:3.000        
##  Max.   :80    Max.   :3.0000   Max.   :40.00     Max.   :6.000        
##                                                                        
##  WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :4.000   Max.   :40.000   Max.   :18.000    
##                                                     
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 1.000          Median : 3.000      
##  Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :15.000          Max.   :17.000      
## 

My observations were noted in section 2B when I did a summary of the data while I was exploring the dataset.

4B) Using descriptive statistics explore dataset investigate min and max values

All the min and max values are in the summary section above in 4A, however, below are what I believe to be some of the important min and max information.

min(mydata$Age)
## [1] 18
min(mydata$DailyRate)
## [1] 102
min(mydata$JobSatisfaction)
## [1] 1
min(mydata$YearsAtCompany)
## [1] 0
max(mydata$Age)
## [1] 60
max(mydata$DailyRate)
## [1] 1499
max(mydata$JobSatisfaction)
## [1] 4
max(mydata$YearsAtCompany)
## [1] 40

4C) Create a correlation table ( only numeric data ). Note any significant values

data_corr <- cor(mydata$Age, mydata$MonthlyIncome)
data_corr
## [1] 0.4978546

This number affirms the graph earlier showing the positive correlation between age and monthly income. This makes sense because the older you are the more work experience you most likely have making you more valuable to the company meaning they have to pay you more.

4D) Create a correlation table ( only numeric data ). Give a brief explanation/guess of why some variables are correlated

data_corr <- cor(mydata$DistanceFromHome, mydata$MonthlyIncome)
data_corr
## [1] -0.01701444

This number means there is a negative correlation between the distance from home and the monthly income. As I suggested in my hypothesis, people who have less money tend to live in cheaper places which are further away from the center of the city, and therefore have a further commute.


Visual Analytics: Use Table or R to create plots


p <- qplot( x = mydata$Age, y = mydata$MonthlyIncome, data = mydata) + geom_point()
p + geom_smooth(method="lm")

As mentioned, the variables are positively correlated, hence the positive slope.

p2 <- qplot( x = mydata$DistanceFromHome, y = mydata$MonthlyIncome, data = mydata) + geom_point()
p2 + geom_smooth(method="lm")

This graph makes it seem poorly correlated, however, the further away you are away from home there are less people with high incomes.


Predictive Analytics: Create a Predictive Model


Based on your hypothesis create a predictive mode


Task#5: Visual Analytics - Create a Predictive Model


5A) Based on your hypothesis and observations create a predictive model

reg <- lm( mydata$MonthlyIncome ~ mydata$Age, data = mydata )
reg
## 
## Call:
## lm(formula = mydata$MonthlyIncome ~ mydata$Age, data = mydata)
## 
## Coefficients:
## (Intercept)   mydata$Age  
##     -2970.7        256.6

5B) Evaluate the efficiency of model. Note R Square and Adjusted R Square, be suspicious of very high R Squares.

summary(reg)
## 
## Call:
## lm(formula = mydata$MonthlyIncome ~ mydata$Age, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9990.1 -2592.7  -677.9  1810.5 12540.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2970.67     443.70  -6.695 3.06e-11 ***
## mydata$Age    256.57      11.67  21.995  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4084 on 1468 degrees of freedom
## Multiple R-squared:  0.2479, Adjusted R-squared:  0.2473 
## F-statistic: 483.8 on 1 and 1468 DF,  p-value: < 2.2e-16

Predicting monthly income by just age has a low r-squared of .2469 and adjusted r-squared of .2473. Meaning age is not a great predictor of income.

5C) Try different models (combination of independent variables). To find a better model if possible.

reg2 <- lm( mydata$MonthlyIncome ~ mydata$Age + mydata$JobLevel + mydata$Education, data = mydata )
summary(reg2)
## 
## Call:
## lm(formula = mydata$MonthlyIncome ~ mydata$Age + mydata$JobLevel + 
##     mydata$Education, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5117.5  -962.0   107.1   757.9  3834.6 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2062.174    177.579 -11.613   <2e-16 ***
## mydata$Age           9.957      4.947   2.013   0.0443 *  
## mydata$JobLevel   4001.877     40.139  99.700   <2e-16 ***
## mydata$Education   -21.359     38.162  -0.560   0.5758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1465 on 1466 degrees of freedom
## Multiple R-squared:  0.9033, Adjusted R-squared:  0.9031 
## F-statistic:  4567 on 3 and 1466 DF,  p-value: < 2.2e-16

When age is coupled with job level and education, the r-squared shoots up to .9033 and the adjusted r-squared goes to .9031.

5D) Describe your findings. How accurately your model predicts the dependent (target) variable

This model very accurately predicts the dependent variable which was monthly income. The r-squared and adjusted r-squared were both above .9 which are quite high.