DATA 605 Week12

Can Job Satisfaction be Predicted?

The Data

I used a data set from Kaggle.com on Human Resources data.

pth <- "C:\\Users\\Nate\\Documents\\DataSet\\HR_comma_sep.csv"
hr <- pth %>% read.csv() %>% data.frame()
head(hr)

##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years sales salary
## 1                  3             0    1                     0 sales    low
## 2                  6             0    1                     0 sales medium
## 3                  4             0    1                     0 sales medium
## 4                  5             0    1                     0 sales    low
## 5                  3             0    1                     0 sales    low
## 6                  3             0    1                     0 sales    low

summary(hr)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923

‘Left’ is the dichotomial variable with 1 being left the company and 0 being still there. Monthly hours * left will be my dichotomous interaction term to identify if long hours are a reason for leaving a company. Time_spend_company will be the quadratic term.

The Model

time_sq <- hr$time_spend_company^2
fit <- lm(hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + hr$average_montly_hours + hr$time_spend_company + time_sq + hr$Work_accident + hr$left + hr$left:hr$time_spend_company + hr$promotion_last_5years)
summary(fit)

## 
## Call:
## lm(formula = hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + 
##     hr$average_montly_hours + hr$time_spend_company + time_sq + 
##     hr$Work_accident + hr$left + hr$left:hr$time_spend_company + 
##     hr$promotion_last_5years)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71176 -0.15070  0.01652  0.15398  0.79805 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.044e+00  1.449e-02  72.082  < 2e-16 ***
## hr$last_evaluation             9.434e-02  1.144e-02   8.250  < 2e-16 ***
## hr$number_project             -5.455e-02  1.636e-03 -33.352  < 2e-16 ***
## hr$average_montly_hours       -1.721e-04  3.945e-05  -4.361  1.3e-05 ***
## hr$time_spend_company         -8.819e-02  4.890e-03 -18.035  < 2e-16 ***
## time_sq                        6.731e-03  4.714e-04  14.277  < 2e-16 ***
## hr$Work_accident              -4.352e-04  4.901e-03  -0.089   0.9292    
## hr$left                       -9.919e-01  1.760e-02 -56.344  < 2e-16 ***
## hr$promotion_last_5years       2.242e-02  1.187e-02   1.889   0.0589 .  
## hr$time_spend_company:hr$left  2.064e-01  4.527e-03  45.592  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2084 on 14989 degrees of freedom
## Multiple R-squared:  0.298,  Adjusted R-squared:  0.2976 
## F-statistic: 706.9 on 9 and 14989 DF,  p-value: < 2.2e-16

Only work accident does not have a statistically significant at the 95% correlation with a P = 0.9292. We can remove it from the model. Promotion_last_5years is borderline, I did remove it, too after its p-value became larger when removing work accidents. Adjusted R\(^2\) did not change much from 0.2976 to 0.2975.

time_sq <- hr$time_spend_company^2
fit <- lm(hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + hr$average_montly_hours + hr$time_spend_company + time_sq + hr$left + hr$left:hr$time_spend_company)
summary(fit)

## 
## Call:
## lm(formula = hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + 
##     hr$average_montly_hours + hr$time_spend_company + time_sq + 
##     hr$left + hr$left:hr$time_spend_company)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7120 -0.1512  0.0165  0.1539  0.7977 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.044e+00  1.446e-02  72.232  < 2e-16 ***
## hr$last_evaluation             9.428e-02  1.144e-02   8.245  < 2e-16 ***
## hr$number_project             -5.457e-02  1.636e-03 -33.360  < 2e-16 ***
## hr$average_montly_hours       -1.714e-04  3.945e-05  -4.344 1.41e-05 ***
## hr$time_spend_company         -8.812e-02  4.890e-03 -18.020  < 2e-16 ***
## time_sq                        6.744e-03  4.713e-04  14.307  < 2e-16 ***
## hr$left                       -9.914e-01  1.759e-02 -56.351  < 2e-16 ***
## hr$time_spend_company:hr$left  2.061e-01  4.525e-03  45.554  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2084 on 14991 degrees of freedom
## Multiple R-squared:  0.2978, Adjusted R-squared:  0.2975 
## F-statistic: 908.3 on 7 and 14991 DF,  p-value: < 2.2e-16

The following are negatively correlated, meaning that as job satisfaction increases these numbers tend to decrease: numbers of projects, average monthly hours, time spent at company. This paints a picture that over-work tends to wear people down with a company until they quit.

The following are positively correlated: time since last evaluation, time_sq, and time spent*left. Note that time squared puts extra emphasis on people that spend a very large amount of time at a company (2^2 is 4, but 10^2 is 100). Furthermore, left*time spent at company means that for people that did leave, but spent a great deal of time at a company were more satisfied than people who left after a smaller amount of time. This means that frequent feedback is important to job satisfaction, and that building a sense of loyalty (as measures by time spent) is important for job satisfaction.

The Residuals

Although the residuals are reasonably normal, there is a definite pattern regarding the constant variance, so I would say the the multiple regression is not valid.

plot(fitted(fit), resid(fit))

hist(resid(fit))

qqnorm(resid(fit))
qqline(resid(fit))

DATA 605 Week12

Nathan Cooper

November 14, 2017

Can Job Satisfaction be Predicted?

The Data

The Model

The Residuals