Can Job Satisfaction be Predicted?

The Data

I used a data set from Kaggle.com on Human Resources data.

pth <- "C:\\Users\\Nate\\Documents\\DataSet\\HR_comma_sep.csv"
hr <- pth %>% read.csv() %>% data.frame()
head(hr)
##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years sales salary
## 1                  3             0    1                     0 sales    low
## 2                  6             0    1                     0 sales medium
## 3                  4             0    1                     0 sales medium
## 4                  5             0    1                     0 sales    low
## 5                  3             0    1                     0 sales    low
## 6                  3             0    1                     0 sales    low
summary(hr)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923

‘Left’ is the dichotomial variable with 1 being left the company and 0 being still there. Monthly hours * left will be my dichotomous interaction term to identify if long hours are a reason for leaving a company. Time_spend_company will be the quadratic term.

The Model

time_sq <- hr$time_spend_company^2
fit <- lm(hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + hr$average_montly_hours + hr$time_spend_company + time_sq + hr$Work_accident + hr$left + hr$left:hr$time_spend_company + hr$promotion_last_5years)
summary(fit)
## 
## Call:
## lm(formula = hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + 
##     hr$average_montly_hours + hr$time_spend_company + time_sq + 
##     hr$Work_accident + hr$left + hr$left:hr$time_spend_company + 
##     hr$promotion_last_5years)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71176 -0.15070  0.01652  0.15398  0.79805 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.044e+00  1.449e-02  72.082  < 2e-16 ***
## hr$last_evaluation             9.434e-02  1.144e-02   8.250  < 2e-16 ***
## hr$number_project             -5.455e-02  1.636e-03 -33.352  < 2e-16 ***
## hr$average_montly_hours       -1.721e-04  3.945e-05  -4.361  1.3e-05 ***
## hr$time_spend_company         -8.819e-02  4.890e-03 -18.035  < 2e-16 ***
## time_sq                        6.731e-03  4.714e-04  14.277  < 2e-16 ***
## hr$Work_accident              -4.352e-04  4.901e-03  -0.089   0.9292    
## hr$left                       -9.919e-01  1.760e-02 -56.344  < 2e-16 ***
## hr$promotion_last_5years       2.242e-02  1.187e-02   1.889   0.0589 .  
## hr$time_spend_company:hr$left  2.064e-01  4.527e-03  45.592  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2084 on 14989 degrees of freedom
## Multiple R-squared:  0.298,  Adjusted R-squared:  0.2976 
## F-statistic: 706.9 on 9 and 14989 DF,  p-value: < 2.2e-16

Only work accident does not have a statistically significant at the 95% correlation with a P = 0.9292. We can remove it from the model. Promotion_last_5years is borderline, I did remove it, too after its p-value became larger when removing work accidents. Adjusted R\(^2\) did not change much from 0.2976 to 0.2975.

time_sq <- hr$time_spend_company^2
fit <- lm(hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + hr$average_montly_hours + hr$time_spend_company + time_sq + hr$left + hr$left:hr$time_spend_company)
summary(fit)
## 
## Call:
## lm(formula = hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + 
##     hr$average_montly_hours + hr$time_spend_company + time_sq + 
##     hr$left + hr$left:hr$time_spend_company)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7120 -0.1512  0.0165  0.1539  0.7977 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.044e+00  1.446e-02  72.232  < 2e-16 ***
## hr$last_evaluation             9.428e-02  1.144e-02   8.245  < 2e-16 ***
## hr$number_project             -5.457e-02  1.636e-03 -33.360  < 2e-16 ***
## hr$average_montly_hours       -1.714e-04  3.945e-05  -4.344 1.41e-05 ***
## hr$time_spend_company         -8.812e-02  4.890e-03 -18.020  < 2e-16 ***
## time_sq                        6.744e-03  4.713e-04  14.307  < 2e-16 ***
## hr$left                       -9.914e-01  1.759e-02 -56.351  < 2e-16 ***
## hr$time_spend_company:hr$left  2.061e-01  4.525e-03  45.554  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2084 on 14991 degrees of freedom
## Multiple R-squared:  0.2978, Adjusted R-squared:  0.2975 
## F-statistic: 908.3 on 7 and 14991 DF,  p-value: < 2.2e-16

The following are negatively correlated, meaning that as job satisfaction increases these numbers tend to decrease: numbers of projects, average monthly hours, time spent at company. This paints a picture that over-work tends to wear people down with a company until they quit.

The following are positively correlated: time since last evaluation, time_sq, and time spent*left. Note that time squared puts extra emphasis on people that spend a very large amount of time at a company (2^2 is 4, but 10^2 is 100). Furthermore, left*time spent at company means that for people that did leave, but spent a great deal of time at a company were more satisfied than people who left after a smaller amount of time. This means that frequent feedback is important to job satisfaction, and that building a sense of loyalty (as measures by time spent) is important for job satisfaction.

The Residuals

Although the residuals are reasonably normal, there is a definite pattern regarding the constant variance, so I would say the the multiple regression is not valid.

plot(fitted(fit), resid(fit))

hist(resid(fit))

qqnorm(resid(fit))
qqline(resid(fit))