I used a data set from Kaggle.com on Human Resources data.
pth <- "C:\\Users\\Nate\\Documents\\DataSet\\HR_comma_sep.csv"
hr <- pth %>% read.csv() %>% data.frame()
head(hr)
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## time_spend_company Work_accident left promotion_last_5years sales salary
## 1 3 0 1 0 sales low
## 2 6 0 1 0 sales medium
## 3 4 0 1 0 sales medium
## 4 5 0 1 0 sales low
## 5 3 0 1 0 sales low
## 6 3 0 1 0 sales low
summary(hr)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
‘Left’ is the dichotomial variable with 1 being left the company and 0 being still there. Monthly hours * left will be my dichotomous interaction term to identify if long hours are a reason for leaving a company. Time_spend_company will be the quadratic term.
time_sq <- hr$time_spend_company^2
fit <- lm(hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + hr$average_montly_hours + hr$time_spend_company + time_sq + hr$Work_accident + hr$left + hr$left:hr$time_spend_company + hr$promotion_last_5years)
summary(fit)
##
## Call:
## lm(formula = hr$satisfaction_level ~ hr$last_evaluation + hr$number_project +
## hr$average_montly_hours + hr$time_spend_company + time_sq +
## hr$Work_accident + hr$left + hr$left:hr$time_spend_company +
## hr$promotion_last_5years)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71176 -0.15070 0.01652 0.15398 0.79805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.044e+00 1.449e-02 72.082 < 2e-16 ***
## hr$last_evaluation 9.434e-02 1.144e-02 8.250 < 2e-16 ***
## hr$number_project -5.455e-02 1.636e-03 -33.352 < 2e-16 ***
## hr$average_montly_hours -1.721e-04 3.945e-05 -4.361 1.3e-05 ***
## hr$time_spend_company -8.819e-02 4.890e-03 -18.035 < 2e-16 ***
## time_sq 6.731e-03 4.714e-04 14.277 < 2e-16 ***
## hr$Work_accident -4.352e-04 4.901e-03 -0.089 0.9292
## hr$left -9.919e-01 1.760e-02 -56.344 < 2e-16 ***
## hr$promotion_last_5years 2.242e-02 1.187e-02 1.889 0.0589 .
## hr$time_spend_company:hr$left 2.064e-01 4.527e-03 45.592 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2084 on 14989 degrees of freedom
## Multiple R-squared: 0.298, Adjusted R-squared: 0.2976
## F-statistic: 706.9 on 9 and 14989 DF, p-value: < 2.2e-16
Only work accident does not have a statistically significant at the 95% correlation with a P = 0.9292. We can remove it from the model. Promotion_last_5years is borderline, I did remove it, too after its p-value became larger when removing work accidents. Adjusted R\(^2\) did not change much from 0.2976 to 0.2975.
time_sq <- hr$time_spend_company^2
fit <- lm(hr$satisfaction_level ~ hr$last_evaluation + hr$number_project + hr$average_montly_hours + hr$time_spend_company + time_sq + hr$left + hr$left:hr$time_spend_company)
summary(fit)
##
## Call:
## lm(formula = hr$satisfaction_level ~ hr$last_evaluation + hr$number_project +
## hr$average_montly_hours + hr$time_spend_company + time_sq +
## hr$left + hr$left:hr$time_spend_company)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7120 -0.1512 0.0165 0.1539 0.7977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.044e+00 1.446e-02 72.232 < 2e-16 ***
## hr$last_evaluation 9.428e-02 1.144e-02 8.245 < 2e-16 ***
## hr$number_project -5.457e-02 1.636e-03 -33.360 < 2e-16 ***
## hr$average_montly_hours -1.714e-04 3.945e-05 -4.344 1.41e-05 ***
## hr$time_spend_company -8.812e-02 4.890e-03 -18.020 < 2e-16 ***
## time_sq 6.744e-03 4.713e-04 14.307 < 2e-16 ***
## hr$left -9.914e-01 1.759e-02 -56.351 < 2e-16 ***
## hr$time_spend_company:hr$left 2.061e-01 4.525e-03 45.554 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2084 on 14991 degrees of freedom
## Multiple R-squared: 0.2978, Adjusted R-squared: 0.2975
## F-statistic: 908.3 on 7 and 14991 DF, p-value: < 2.2e-16
The following are negatively correlated, meaning that as job satisfaction increases these numbers tend to decrease: numbers of projects, average monthly hours, time spent at company. This paints a picture that over-work tends to wear people down with a company until they quit.
The following are positively correlated: time since last evaluation, time_sq, and time spent*left. Note that time squared puts extra emphasis on people that spend a very large amount of time at a company (2^2 is 4, but 10^2 is 100). Furthermore, left*time spent at company means that for people that did leave, but spent a great deal of time at a company were more satisfied than people who left after a smaller amount of time. This means that frequent feedback is important to job satisfaction, and that building a sense of loyalty (as measures by time spent) is important for job satisfaction.
Although the residuals are reasonably normal, there is a definite pattern regarding the constant variance, so I would say the the multiple regression is not valid.
plot(fitted(fit), resid(fit))
hist(resid(fit))
qqnorm(resid(fit))
qqline(resid(fit))