This data set is a mock data set for a fictitious company’s HR department. It was created to help train HR professionals as real world HR data sets are hard to come by in the public domain. There are 36 variables in this data set and 316 observations. If we exclude variables that serve as numerical IDs, are missing for the majority of observations, or contain information also found in other variables we are left with 20 variables that are practical to run regression analysis on.
The variables are the following *Zip code Position Sex Marital status Citizenship Hispanic Race Termination reason Employment status Department, Manager Requirement source, Performance score, and from diversity job fair
The primary objective of this analysis is to determine the nature of any non-zero associations between salary and the predictor variables
In this data set many of the variables contain information that is either repeated or highly related to information from other variables. To avoid multicolinarity in our regression model we will have to remove any redundant variables. 9 of the variables serve as numeric IDs for categorical variables, which can be eliminated as they contain the same information. With only 313 unique observations there is not a lot of room for variables that will require large amounts of dummy variables if we want to hit the 15 observations per parameter rule of thumb. variables like state have simply too many unique values to be implemented in the model. Other variables like race and citizenship can be implemented if we group similar observations together such as considering non-citizens and eligible non-citizens as both non-citizens. Other variables such as employee name, date of hire, and last performance review date can be removed for simply being impractical, correlation with salary is highly unlikely and no practical meaning can be interpreted from any relationship. Some variables like department, position, and manager have large overlaps in their groupings of employees. Other groupings like marital status and married have completely overlapping information, with one variable providing more detailed information than the other. For these variables we are again limited by our number of observations and must use the variables with the least categories and most genral information. The variable date of birth would be much more practical for analysis is we converted it into each employees age, though it wouldn’t make sense to calculate current age for employees that were fired years ago, so we will calculate age as the difference between date of birth and date of termination for any terminated employee, and for those still employed the difference between date of birth and the last recorded performance review date 2/28/2019.
The following 15 variables will be used for the initial full model Salary Department Age Sex MarriedID CitizenDesc HispanicLatino RaceDesc FromDiversityJobFairID PerfScoreID EngagementSurvey EmpSatisfaction SpecialProjectsCount DaysLateLast30 *Absences
The initial model including all variable has some noticeable issues with the assumptions required for multiple linear regression. The vs fit plot shows that linearity of the data breaks down at observations with the smallest salary. The variance of the residuals also increases with larger salary In the residual plot. The QQ plot shows strong indication of non normality of the residuals.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 46845.77281 | 14891.6926 | 3.1457655 | 0.0018281 |
| DepartmentExecutive Office | 169876.85284 | 20999.5604 | 8.0895433 | 0.0000000 |
| DepartmentIT/IS | 26081.09874 | 7734.1457 | 3.3722017 | 0.0008466 |
| DepartmentProduction | -20713.90122 | 9137.5567 | -2.2668972 | 0.0241303 |
| DepartmentSales | -12588.71973 | 9781.0042 | -1.2870580 | 0.1990969 |
| DepartmentSoftware Engineering | 20935.85939 | 8247.3236 | 2.5385035 | 0.0116540 |
| SexM | 782.44642 | 2079.9162 | 0.3761913 | 0.7070488 |
| MarriedID | -738.96546 | 2162.2171 | -0.3417628 | 0.7327760 |
| CitizenDescUS Citizen | -1475.88129 | 4768.1621 | -0.3095283 | 0.7571412 |
| HispanicLatinoYes | -2653.01368 | 3779.7663 | -0.7018989 | 0.4833032 |
| RaceDescWhite | -3252.71873 | 2318.5294 | -1.4029232 | 0.1617058 |
| Termd | -34.40088 | 2290.4615 | -0.0150192 | 0.9880272 |
| FromDiversityJobFairID | -2032.21489 | 4083.4868 | -0.4976666 | 0.6190949 |
| PerfScoreID | 6359.93385 | 2705.2534 | 2.3509568 | 0.0193924 |
| EngagementSurvey | 653.52371 | 1658.2249 | 0.3941104 | 0.6937882 |
| EmpSatisfaction | 947.09036 | 1206.8335 | 0.7847730 | 0.4332251 |
| SpecialProjectsCount | -1651.11016 | 1769.1955 | -0.9332548 | 0.3514621 |
| DaysLateLast30 | 2350.86086 | 1255.3523 | 1.8726702 | 0.0621165 |
| Absences | 360.39139 | 178.4222 | 2.0198800 | 0.0443120 |
| Age | 198.55942 | 120.8940 | 1.6424255 | 0.1015820 |
## Warning: not plotting observations with leverage one:
## 151
Checking the VIF of the model shows the variables department and special project count both have concerningly high VIF. Out of the two department is the more statisticaly signifcant variable. The variance inflation factor of department drops when special project count is removed from the model. We will keep special project count out of the model for the remainder of the analysis
## GVIF Df GVIF^(1/(2*Df))
## Department 22.650302 5 1.366178
## Sex 1.038347 1 1.018993
## MarriedID 1.095131 1 1.046485
## CitizenDesc 1.084047 1 1.041176
## HispanicLatino 1.143612 1 1.069398
## RaceDesc 1.259194 1 1.122138
## Termd 1.140917 1 1.068137
## FromDiversityJobFairID 1.377570 1 1.173699
## PerfScoreID 2.456543 1 1.567336
## EngagementSurvey 1.671086 1 1.292705
## EmpSatisfaction 1.172679 1 1.082903
## SpecialProjectsCount 16.826731 1 4.102040
## DaysLateLast30 2.572022 1 1.603753
## Absences 1.061992 1 1.030530
## Age 1.128080 1 1.062111
## GVIF Df GVIF^(1/(2*Df))
## Department 1.421808 5 1.035820
## Sex 1.037086 1 1.018374
## MarriedID 1.094835 1 1.046344
## CitizenDesc 1.076992 1 1.037782
## HispanicLatino 1.142464 1 1.068861
## RaceDesc 1.258950 1 1.122030
## Termd 1.132669 1 1.064269
## FromDiversityJobFairID 1.369783 1 1.170377
## PerfScoreID 2.430919 1 1.559140
## EngagementSurvey 1.664528 1 1.290166
## EmpSatisfaction 1.166275 1 1.079942
## DaysLateLast30 2.566187 1 1.601932
## Absences 1.061987 1 1.030527
## Age 1.122536 1 1.059498
performing a box cox transformation on the response variable salary
creates a large improvement in our assumptions for the model residuals.
Normality, linearity, and constant variance are all imporoved when
compared to the un-transformed model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.8333317 | 2e-07 | 3.838369e+06 | 0.0000000 |
| DepartmentExecutive Office | 0.0000011 | 3e-07 | 3.338734e+00 | 0.0009508 |
| DepartmentIT/IS | 0.0000004 | 1e-07 | 3.178179e+00 | 0.0016411 |
| DepartmentProduction | -0.0000003 | 1e-07 | -2.418385e+00 | 0.0162013 |
| DepartmentSales | 0.0000000 | 1e-07 | -3.475994e-01 | 0.7283915 |
| DepartmentSoftware Engineering | 0.0000004 | 1e-07 | 3.277818e+00 | 0.0011725 |
| SexM | 0.0000000 | 0e+00 | 4.248656e-01 | 0.6712476 |
| MarriedID | 0.0000000 | 0e+00 | 2.945925e-01 | 0.7685145 |
| CitizenDescUS Citizen | 0.0000000 | 1e-07 | -5.744203e-01 | 0.5661257 |
| HispanicLatinoYes | -0.0000001 | 1e-07 | -1.275157e+00 | 0.2032673 |
| RaceDescWhite | -0.0000001 | 0e+00 | -1.445534e+00 | 0.1493804 |
| Termd | 0.0000000 | 0e+00 | 7.206800e-02 | 0.9425971 |
| FromDiversityJobFairID | -0.0000001 | 1e-07 | -9.683425e-01 | 0.3336747 |
| PerfScoreID | 0.0000001 | 0e+00 | 1.449404e+00 | 0.1482979 |
| EngagementSurvey | 0.0000000 | 0e+00 | 5.693107e-01 | 0.5695830 |
| EmpSatisfaction | 0.0000000 | 0e+00 | 2.610800e-02 | 0.9791890 |
| DaysLateLast30 | 0.0000000 | 0e+00 | 9.311570e-01 | 0.3525416 |
| Absences | 0.0000000 | 0e+00 | 1.896328e+00 | 0.0589037 |
| Age | 0.0000000 | 0e+00 | 1.187967e+00 | 0.2358119 |
## Warning: not plotting observations with leverage one:
## 151
Here I reduce the model using a step wise procedure working backwards from the transformed linear model. The step wise procedure leaves only 4 explanatory variables in the linear model. these variables are department,Hispanic(Y/N),performance score, and absences.
## Start: AIC=-9336.52
## Salary ~ Department + Sex + MarriedID + CitizenDesc + HispanicLatino +
## RaceDesc + Termd + FromDiversityJobFairID + PerfScoreID +
## EngagementSurvey + EmpSatisfaction + DaysLateLast30 + Absences +
## Age
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - EmpSatisfaction 1 1.0000e-16 2.5221e-11 -9338.5
## - Termd 1 4.0000e-16 2.5221e-11 -9338.5
## - MarriedID 1 7.5000e-15 2.5228e-11 -9338.4
## - Sex 1 1.5600e-14 2.5236e-11 -9338.3
## - EngagementSurvey 1 2.8000e-14 2.5249e-11 -9338.2
## - CitizenDesc 1 2.8500e-14 2.5249e-11 -9338.2
## - DaysLateLast30 1 7.4900e-14 2.5295e-11 -9337.6
## - FromDiversityJobFairID 1 8.1000e-14 2.5302e-11 -9337.5
## - Age 1 1.2190e-13 2.5342e-11 -9337.0
## - HispanicLatino 1 1.4040e-13 2.5361e-11 -9336.8
## <none> 2.5221e-11 -9336.5
## - RaceDesc 1 1.8050e-13 2.5401e-11 -9336.3
## - PerfScoreID 1 1.8140e-13 2.5402e-11 -9336.3
## - Absences 1 3.1060e-13 2.5531e-11 -9334.7
## - Department 5 1.8788e-11 4.4009e-11 -9173.4
##
## Step: AIC=-9338.52
## Salary ~ Department + Sex + MarriedID + CitizenDesc + HispanicLatino +
## RaceDesc + Termd + FromDiversityJobFairID + PerfScoreID +
## EngagementSurvey + DaysLateLast30 + Absences + Age
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - Termd 1 5.0000e-16 2.5221e-11 -9340.5
## - MarriedID 1 7.4000e-15 2.5228e-11 -9340.4
## - Sex 1 1.5500e-14 2.5236e-11 -9340.3
## - EngagementSurvey 1 2.8100e-14 2.5249e-11 -9340.2
## - CitizenDesc 1 2.9000e-14 2.5250e-11 -9340.2
## - DaysLateLast30 1 7.4800e-14 2.5295e-11 -9339.6
## - FromDiversityJobFairID 1 8.1100e-14 2.5302e-11 -9339.5
## - Age 1 1.2260e-13 2.5343e-11 -9339.0
## - HispanicLatino 1 1.4100e-13 2.5362e-11 -9338.8
## <none> 2.5221e-11 -9338.5
## - RaceDesc 1 1.8040e-13 2.5401e-11 -9338.3
## - PerfScoreID 1 1.8930e-13 2.5410e-11 -9338.2
## - Absences 1 3.1240e-13 2.5533e-11 -9336.7
## - Department 5 1.8815e-11 4.4035e-11 -9175.2
##
## Step: AIC=-9340.51
## Salary ~ Department + Sex + MarriedID + CitizenDesc + HispanicLatino +
## RaceDesc + FromDiversityJobFairID + PerfScoreID + EngagementSurvey +
## DaysLateLast30 + Absences + Age
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - MarriedID 1 7.8000e-15 2.5229e-11 -9342.4
## - Sex 1 1.5500e-14 2.5237e-11 -9342.3
## - EngagementSurvey 1 2.8900e-14 2.5250e-11 -9342.2
## - CitizenDesc 1 2.9700e-14 2.5251e-11 -9342.1
## - DaysLateLast30 1 7.6900e-14 2.5298e-11 -9341.6
## - FromDiversityJobFairID 1 8.1300e-14 2.5302e-11 -9341.5
## - Age 1 1.2240e-13 2.5343e-11 -9341.0
## - HispanicLatino 1 1.4160e-13 2.5363e-11 -9340.8
## <none> 2.5221e-11 -9340.5
## - RaceDesc 1 1.8000e-13 2.5401e-11 -9340.3
## - PerfScoreID 1 1.8910e-13 2.5410e-11 -9340.2
## - Absences 1 3.1680e-13 2.5538e-11 -9338.6
## - Department 5 1.9212e-11 4.4434e-11 -9174.4
##
## Step: AIC=-9342.42
## Salary ~ Department + Sex + CitizenDesc + HispanicLatino + RaceDesc +
## FromDiversityJobFairID + PerfScoreID + EngagementSurvey +
## DaysLateLast30 + Absences + Age
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - Sex 1 1.5100e-14 2.5244e-11 -9344.2
## - EngagementSurvey 1 2.6100e-14 2.5255e-11 -9344.1
## - CitizenDesc 1 3.1500e-14 2.5260e-11 -9344.0
## - DaysLateLast30 1 7.3500e-14 2.5302e-11 -9343.5
## - FromDiversityJobFairID 1 8.0400e-14 2.5309e-11 -9343.4
## - Age 1 1.1970e-13 2.5349e-11 -9342.9
## - HispanicLatino 1 1.5020e-13 2.5379e-11 -9342.6
## <none> 2.5229e-11 -9342.4
## - RaceDesc 1 1.7620e-13 2.5405e-11 -9342.3
## - PerfScoreID 1 1.8530e-13 2.5414e-11 -9342.1
## - Absences 1 3.3190e-13 2.5561e-11 -9340.4
## - Department 5 1.9314e-11 4.4543e-11 -9175.6
##
## Step: AIC=-9344.23
## Salary ~ Department + CitizenDesc + HispanicLatino + RaceDesc +
## FromDiversityJobFairID + PerfScoreID + EngagementSurvey +
## DaysLateLast30 + Absences + Age
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - EngagementSurvey 1 2.7200e-14 2.5271e-11 -9345.9
## - CitizenDesc 1 3.1800e-14 2.5276e-11 -9345.8
## - FromDiversityJobFairID 1 7.7300e-14 2.5321e-11 -9345.3
## - DaysLateLast30 1 7.8700e-14 2.5323e-11 -9345.3
## - Age 1 1.2020e-13 2.5364e-11 -9344.8
## - HispanicLatino 1 1.4910e-13 2.5393e-11 -9344.4
## <none> 2.5244e-11 -9344.2
## - RaceDesc 1 1.7090e-13 2.5415e-11 -9344.1
## - PerfScoreID 1 1.8570e-13 2.5430e-11 -9344.0
## - Absences 1 3.2960e-13 2.5574e-11 -9342.2
## - Department 5 1.9634e-11 4.4878e-11 -9175.3
##
## Step: AIC=-9345.9
## Salary ~ Department + CitizenDesc + HispanicLatino + RaceDesc +
## FromDiversityJobFairID + PerfScoreID + DaysLateLast30 + Absences +
## Age
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - CitizenDesc 1 2.8400e-14 2.5299e-11 -9347.5
## - DaysLateLast30 1 5.7800e-14 2.5329e-11 -9347.2
## - FromDiversityJobFairID 1 7.6200e-14 2.5347e-11 -9347.0
## - Age 1 1.2250e-13 2.5394e-11 -9346.4
## - HispanicLatino 1 1.4490e-13 2.5416e-11 -9346.1
## <none> 2.5271e-11 -9345.9
## - RaceDesc 1 1.6580e-13 2.5437e-11 -9345.9
## - PerfScoreID 1 2.2570e-13 2.5497e-11 -9345.1
## - Absences 1 3.2650e-13 2.5598e-11 -9343.9
## - Department 5 1.9608e-11 4.4879e-11 -9177.3
##
## Step: AIC=-9347.55
## Salary ~ Department + HispanicLatino + RaceDesc + FromDiversityJobFairID +
## PerfScoreID + DaysLateLast30 + Absences + Age
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - DaysLateLast30 1 5.2300e-14 2.5352e-11 -9348.9
## - FromDiversityJobFairID 1 7.0400e-14 2.5370e-11 -9348.7
## - Age 1 1.2470e-13 2.5424e-11 -9348.0
## - HispanicLatino 1 1.4930e-13 2.5449e-11 -9347.7
## <none> 2.5299e-11 -9347.5
## - RaceDesc 1 1.8670e-13 2.5486e-11 -9347.3
## - PerfScoreID 1 2.1610e-13 2.5516e-11 -9346.9
## - Absences 1 3.3400e-13 2.5633e-11 -9345.5
## - Department 5 1.9643e-11 4.4942e-11 -9178.9
##
## Step: AIC=-9348.91
## Salary ~ Department + HispanicLatino + RaceDesc + FromDiversityJobFairID +
## PerfScoreID + Absences + Age
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - FromDiversityJobFairID 1 6.2200e-14 2.5414e-11 -9350.1
## - Age 1 1.2300e-13 2.5475e-11 -9349.4
## - HispanicLatino 1 1.6310e-13 2.5515e-11 -9348.9
## <none> 2.5352e-11 -9348.9
## - PerfScoreID 1 1.9130e-13 2.5543e-11 -9348.6
## - RaceDesc 1 1.9390e-13 2.5546e-11 -9348.5
## - Absences 1 3.4890e-13 2.5701e-11 -9346.7
## - Department 5 1.9595e-11 4.4947e-11 -9180.8
##
## Step: AIC=-9350.15
## Salary ~ Department + HispanicLatino + RaceDesc + PerfScoreID +
## Absences + Age
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - Age 1 1.0250e-13 2.5516e-11 -9350.9
## - RaceDesc 1 1.3800e-13 2.5552e-11 -9350.5
## <none> 2.5414e-11 -9350.1
## - PerfScoreID 1 1.9350e-13 2.5607e-11 -9349.8
## - HispanicLatino 1 2.2260e-13 2.5637e-11 -9349.4
## - Absences 1 3.2910e-13 2.5743e-11 -9348.1
## - Department 5 1.9573e-11 4.4987e-11 -9182.5
##
## Step: AIC=-9350.89
## Salary ~ Department + HispanicLatino + RaceDesc + PerfScoreID +
## Absences
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - RaceDesc 1 1.4660e-13 2.5663e-11 -9351.1
## <none> 2.5516e-11 -9350.9
## - PerfScoreID 1 2.2920e-13 2.5746e-11 -9350.1
## - HispanicLatino 1 2.4750e-13 2.5764e-11 -9349.9
## - Absences 1 3.0530e-13 2.5822e-11 -9349.2
## - Department 5 1.9568e-11 4.5084e-11 -9183.9
##
## Step: AIC=-9351.11
## Salary ~ Department + HispanicLatino + PerfScoreID + Absences
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## <none> 2.5663e-11 -9351.1
## - PerfScoreID 1 2.2530e-13 2.5888e-11 -9350.4
## - HispanicLatino 1 2.5400e-13 2.5917e-11 -9350.0
## - Absences 1 3.0810e-13 2.5971e-11 -9349.4
## - Department 5 1.9895e-11 4.5558e-11 -9182.6
##
## Call:
## lm(formula = Salary ~ Department + HispanicLatino + PerfScoreID +
## Absences, data = THRDATA)
##
## Coefficients:
## (Intercept) DepartmentExecutive Office
## 8.333e-01 1.183e-06
## DepartmentIT/IS DepartmentProduction
## 3.820e-07 -2.247e-07
## DepartmentSales DepartmentSoftware Engineering
## -1.352e-09 4.631e-07
## HispanicLatinoYes PerfScoreID
## -1.026e-07 4.623e-08
## Absences
## 5.445e-09
Here I preform goodness of fit calculations for 3 of the models. The first model is the full model sans coliniar vairables, the second is the box cox transformed model, and the third is the reduced box cox transformed model. Of these three models the reduced model seems the most apropriate.The full model has a slightly higher adjusted r^2, meaning the reduced model explains less of the variability of the observations. This is offset by the reduced models much better mallows Cp and AIC that indicate a higher accuracy in predictions using the reduced model.
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| Full model | 92902252429 | 0.5264570 | 0.4972660 | 19 | 6107.171 | 6178.228 | 106716640789 |
| Transformed model | 0 | 0.4560837 | 0.4225547 | 19 | -9336.521 | -9265.465 | 0 |
| reduced model | 0 | 0.4465416 | 0.4318805 | 9 | -9351.112 | -9317.454 | 0 |
The final model is the following
((Salary^-1.2)-1)/-1.2 = 0.8333318 + 0.0000012* DepartmentExecutive Office + 0.0000004 * DepartmentIT/IS -0.0000002 * DepartmentProduction -0.000000001 * DepartmentSales + 0.0000005 * DepartmentSoftware Engineering -0.0000001 * HispanicLatinoYes + 0.000000046 * PerfScoreID + 0.0000000054 *Absences
In this model department = Admin Offices and HispanicLation = Yes are used as baselines. As we transformed the salary variable a direct interpretation of the values of the coefficients is difficult. We can however comment on the direction and relative sizes of these coefficients. The difference between departments is the most significant predictor of employee salary. The order of salary by department ranked highest to lowest are Execute Office, Software Engineering, It/IS, Admin Offices,Sales,Production. Hispaniclatino having a negative correlation with salary would be a concerning result for any HR employees. The positive relationship between performance score and salary is an expected and desired finding. The positive correlation between absences and salary is strange, but probably not meaningful given it is a fraction of strength of the other variables.