This data set is a mock data set for a fictitious company’s HR department. It was created to help train HR professionals as real world HR data sets are hard to come by in the public domain. There are 36 variables in this data set and 316 observations. Many of the variable contain redundant information and will have to be removed during Initial data analysis before a regression model can be fit.
In this report I will build a multiple linear regression model and bootstrap sampling to find and estimate any correlatiory relationships between employee salary and any of the other variables in the data set. This sort of analysis would be practicle for a company to preform as it could determine the existance and size of any pay disparties between certain groups of employees. It can also help determine if Salary has any sort of affect on employe satisfaction and retention.
In this data set many of the variables contain information that is either repeated or highly related to information from other variables. To avoid multicolinarity in our regression model we will have to remove any redundant variables. 9 of the variables serve as numeric IDs for categorical variables, which can be eliminated as they contain the same information. With only 313 unique observations there is not a lot of room for variables that will require large amounts of dummy variables if we want to hit the 15 observations per parameter rule of thumb. variables like state have simply too many unique values to be implemented in the model. Other variables like race and citizenship can be implemented if we group similar observations together such as considering non-citizens and eligible non-citizens as both non-citizens. Other variables such as employee name, date of hire, and last performance review date can be removed for simply being impractical, correlation with salary is highly unlikely and no practical meaning can be interpreted from any relationship. Some variables like department, position, and manager have large overlaps in their groupings of employees. Other groupings like marital status and married have completely overlapping information, with one variable providing more detailed information than the other. For these variables we are again limited by our number of observations and must use the variables with the least categories and most genral information. The variable date of birth would be much more practical for analysis is we converted it into each employees age, though it wouldn’t make sense to calculate current age for employees that were fired years ago, so we will calculate age as the difference between date of birth and date of termination for any terminated employee, and for those still employed the difference between date of birth and the last recorded performance review date 2/28/2019.
The following 15 variables will be used for the initial full model
The initial model including all variable has some noticeable issues with the assumptions required for multiple linear regression. The vs fit plot shows that linearity of the data breaks down at observations with the smallest salary. The variance of the residuals also increases with larger salary In the residual plot. The QQ plot shows strong indication of non normality of the residuals.
Checking the VIF of the model shows the variables department and special project count both have concerningly high VIF. Out of the two department is the more statisticaly signifcant variable. The variance inflation factor of department drops when special project count is removed from the model. We will keep special project count out of the model for the remainder of the analysis
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 46845.77281 | 14891.6926 | 3.1457655 | 0.0018281 |
| DepartmentExecutive Office | 169876.85284 | 20999.5604 | 8.0895433 | 0.0000000 |
| DepartmentIT/IS | 26081.09874 | 7734.1457 | 3.3722017 | 0.0008466 |
| DepartmentProduction | -20713.90122 | 9137.5567 | -2.2668972 | 0.0241303 |
| DepartmentSales | -12588.71973 | 9781.0042 | -1.2870580 | 0.1990969 |
| DepartmentSoftware Engineering | 20935.85939 | 8247.3236 | 2.5385035 | 0.0116540 |
| SexM | 782.44642 | 2079.9162 | 0.3761913 | 0.7070488 |
| MarriedID | -738.96546 | 2162.2171 | -0.3417628 | 0.7327760 |
| CitizenDescUS Citizen | -1475.88129 | 4768.1621 | -0.3095283 | 0.7571412 |
| HispanicLatinoYes | -2653.01368 | 3779.7663 | -0.7018989 | 0.4833032 |
| RaceDescWhite | -3252.71873 | 2318.5294 | -1.4029232 | 0.1617058 |
| Termd | -34.40088 | 2290.4615 | -0.0150192 | 0.9880272 |
| FromDiversityJobFairID | -2032.21489 | 4083.4868 | -0.4976666 | 0.6190949 |
| PerfScoreID | 6359.93385 | 2705.2534 | 2.3509568 | 0.0193924 |
| EngagementSurvey | 653.52371 | 1658.2249 | 0.3941104 | 0.6937882 |
| EmpSatisfaction | 947.09036 | 1206.8335 | 0.7847730 | 0.4332251 |
| SpecialProjectsCount | -1651.11016 | 1769.1955 | -0.9332548 | 0.3514621 |
| DaysLateLast30 | 2350.86086 | 1255.3523 | 1.8726702 | 0.0621165 |
| Absences | 360.39139 | 178.4222 | 2.0198800 | 0.0443120 |
| Age | 198.55942 | 120.8940 | 1.6424255 | 0.1015820 |
## Warning: not plotting observations with leverage one:
## 151
## GVIF Df GVIF^(1/(2*Df))
## Department 22.650302 5 1.366178
## Sex 1.038347 1 1.018993
## MarriedID 1.095131 1 1.046485
## CitizenDesc 1.084047 1 1.041176
## HispanicLatino 1.143612 1 1.069398
## RaceDesc 1.259194 1 1.122138
## Termd 1.140917 1 1.068137
## FromDiversityJobFairID 1.377570 1 1.173699
## PerfScoreID 2.456543 1 1.567336
## EngagementSurvey 1.671086 1 1.292705
## EmpSatisfaction 1.172679 1 1.082903
## SpecialProjectsCount 16.826731 1 4.102040
## DaysLateLast30 2.572022 1 1.603753
## Absences 1.061992 1 1.030530
## Age 1.128080 1 1.062111
## GVIF Df GVIF^(1/(2*Df))
## Department 1.421808 5 1.035820
## Sex 1.037086 1 1.018374
## MarriedID 1.094835 1 1.046344
## CitizenDesc 1.076992 1 1.037782
## HispanicLatino 1.142464 1 1.068861
## RaceDesc 1.258950 1 1.122030
## Termd 1.132669 1 1.064269
## FromDiversityJobFairID 1.369783 1 1.170377
## PerfScoreID 2.430919 1 1.559140
## EngagementSurvey 1.664528 1 1.290166
## EmpSatisfaction 1.166275 1 1.079942
## DaysLateLast30 2.566187 1 1.601932
## Absences 1.061987 1 1.030527
## Age 1.122536 1 1.059498
As there are concerning violations of the residual assumptions
neccesary for multiple linear regression a transformation of the data is
needed. Here I perform a box cox transformation on the response variable
Salary using a Lamda of -1.2. Normality, linearity, and constant
variance are all improved when compared to the un-transformed model
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.8333317 | 2e-07 | 3.838369e+06 | 0.0000000 |
| DepartmentExecutive Office | 0.0000011 | 3e-07 | 3.338734e+00 | 0.0009508 |
| DepartmentIT/IS | 0.0000004 | 1e-07 | 3.178179e+00 | 0.0016411 |
| DepartmentProduction | -0.0000003 | 1e-07 | -2.418385e+00 | 0.0162013 |
| DepartmentSales | 0.0000000 | 1e-07 | -3.475994e-01 | 0.7283915 |
| DepartmentSoftware Engineering | 0.0000004 | 1e-07 | 3.277818e+00 | 0.0011725 |
| SexM | 0.0000000 | 0e+00 | 4.248656e-01 | 0.6712476 |
| MarriedID | 0.0000000 | 0e+00 | 2.945925e-01 | 0.7685145 |
| CitizenDescUS Citizen | 0.0000000 | 1e-07 | -5.744203e-01 | 0.5661257 |
| HispanicLatinoYes | -0.0000001 | 1e-07 | -1.275157e+00 | 0.2032673 |
| RaceDescWhite | -0.0000001 | 0e+00 | -1.445534e+00 | 0.1493804 |
| Termd | 0.0000000 | 0e+00 | 7.206800e-02 | 0.9425971 |
| FromDiversityJobFairID | -0.0000001 | 1e-07 | -9.683425e-01 | 0.3336747 |
| PerfScoreID | 0.0000001 | 0e+00 | 1.449404e+00 | 0.1482979 |
| EngagementSurvey | 0.0000000 | 0e+00 | 5.693107e-01 | 0.5695830 |
| EmpSatisfaction | 0.0000000 | 0e+00 | 2.610800e-02 | 0.9791890 |
| DaysLateLast30 | 0.0000000 | 0e+00 | 9.311570e-01 | 0.3525416 |
| Absences | 0.0000000 | 0e+00 | 1.896328e+00 | 0.0589037 |
| Age | 0.0000000 | 0e+00 | 1.187967e+00 | 0.2358119 |
## Warning: not plotting observations with leverage one:
## 151
Here I reduce the model using a step wise procedure working backwards from the transformed linear model. The step wise procedure leaves only 4 explanatory variables in the linear model. these variables are department,Hispanic(Y/N),performance score, and absences.
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
##
## Call:
## lm(formula = Salary ~ Department + HispanicLatino + PerfScoreID +
## Absences, data = THRDATA)
##
## Coefficients:
## (Intercept) DepartmentExecutive Office
## 8.333e-01 1.183e-06
## DepartmentIT/IS DepartmentProduction
## 3.820e-07 -2.247e-07
## DepartmentSales DepartmentSoftware Engineering
## -1.352e-09 4.631e-07
## HispanicLatinoYes PerfScoreID
## -1.026e-07 4.623e-08
## Absences
## 5.445e-09
Here I preform goodness of fit calculations for 3 of the models. The first model is the full model sans coliniar variables, the second is the box cox transformed model, and the third is the reduced box cox transformed model. Of these three models the reduced model seems the most appropriate.Comparing the box cox model and the reduced box cox model the reduced box cox model has superior R^2 adjusted, AIC, and SBC scores. In comparing the full model to the reduced box cox model, since the reduced model uses a transformed salary variable the only measures of goodness of fit we can use to compare are the scaled measures R^2 adjusted and mallows Cp. Comparing these two measures the mallows Cp is 10 less for the reduced model as it has 10 less explanatory variables. The R^2 adjusted is actually higher for the full model. Even though the full model has the higher adjusted R^2 the reduced box cox model is still superior as the difference is not high enough to justify using a model has more violations of residual assumptions and a large amount of unnecessary explanatory paramaters.
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| Full model | 92902252429 | 0.5264570 | 0.4972660 | 19 | 6107.171 | 6178.228 | 106716640789 |
| Transformed model | 0 | 0.4560837 | 0.4225547 | 19 | -9336.521 | -9265.465 | 0 |
| reduced model | 0 | 0.4465416 | 0.4318805 | 9 | -9351.112 | -9317.454 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 0.833331782 | 0.000000131 | 6375259.469453854 | 0.000000000 | [ 0.833331447048226 , 0.833332155059029 ] |
| DepartmentExecutive Office | 0.000001183 | 0.000000313 | 3.781593616 | 0.000187821 | [ 8.67487931074518e-07 , 1.49823394490083e-06 ] |
| DepartmentIT/IS | 0.000000382 | 0.000000106 | 3.604894451 | 0.000365174 | [ 6.63679556241574e-08 , 6.90004517645866e-07 ] |
| DepartmentProduction | -0.000000225 | 0.000000099 | -2.259884867 | 0.024540650 | [ -5.36460819077033e-07 , 5.58007181767876e-08 ] |
| DepartmentSales | -0.000000001 | 0.000000111 | -0.012189849 | 0.990282197 | [ -3.27824656989153e-07 , 3.01368478578284e-07 ] |
| DepartmentSoftware Engineering | 0.000000463 | 0.000000131 | 3.534131872 | 0.000473170 | [ 1.59174012419789e-07 , 7.46701334546113e-07 ] |
| HispanicLatinoYes | -0.000000103 | 0.000000059 | -1.728961136 | 0.084837996 | [ -2.56326666822163e-07 , 4.23411838016973e-08 ] |
| PerfScoreID | 0.000000046 | 0.000000028 | 1.628405785 | 0.104481580 | [ -1.20026374277153e-08 , 1.05130508663306e-07 ] |
| Absences | 0.000000005 | 0.000000003 | 1.904226170 | 0.057830886 | [ -8.09852443303752e-11 , 1.04495097962794e-08 ] |
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 0.833331782 | 0.000000131 | 6375259.469453854 | 0.000000000 | [ 0.833331539792289 , 0.833332039981755 ] |
| DepartmentExecutive Office | 0.000001183 | 0.000000313 | 3.781593616 | 0.000187821 | [ 5.90735719938336e-07 , 1.81491140528572e-06 ] |
| DepartmentIT/IS | 0.000000382 | 0.000000106 | 3.604894451 | 0.000365174 | [ 1.87579447921628e-07 , 5.83784904791996e-07 ] |
| DepartmentProduction | -0.000000225 | 0.000000099 | -2.259884867 | 0.024540650 | [ -4.05061753366224e-07 , -2.53296553722555e-08 ] |
| DepartmentSales | -0.000000001 | 0.000000111 | -0.012189849 | 0.990282197 | [ -2.05728214178093e-07 , 2.30324377493449e-07 ] |
| DepartmentSoftware Engineering | 0.000000463 | 0.000000131 | 3.534131872 | 0.000473170 | [ 2.28877073627788e-07 , 7.06062215569661e-07 ] |
| HispanicLatinoYes | -0.000000103 | 0.000000059 | -1.728961136 | 0.084837996 | [ -2.28668327402229e-07 , 9.59996200022065e-09 ] |
| PerfScoreID | 0.000000046 | 0.000000028 | 1.628405785 | 0.104481580 | [ -1.05636065700556e-08 , 1.00425033104526e-07 ] |
| Absences | 0.000000005 | 0.000000003 | 1.904226170 | 0.057830886 | [ -1.61775944139583e-10 , 1.10007580615625e-08 ] |
While the two techniques for producing bootstrap samples don’t always produce the same confidence intervals, in this case the results are the same. As the final model includes Box Cox transformed values for Salary direct Interpretation of the parameter values have no practical meaning. Looking at the confidence intervals and P-values reveal that only the variable to be statistically significant at a alpha of 0.05 is Department. The variables Hispanic/Latino and Absences would be statistically significant at an alpha of 0.1, which would probably be a more appropriate level given the subject matter, but it would be inappropriate to change endpoints after the fact. Employees In the Departments Executive Office, Software Engineering, and IT/IS all have significantly larger salaries than those in the Admin’s office department. Employees in the Production department have significantly lower salaries than those in the Admin’s offices. Employees in the sales department do not have a statistically significant difference in their salaries compared to those in the Admin’s Office. Based on the overlap in the 95% confidence Intervals of their coefficients we can also conclude that there is not a statistically significant difference between the salaries of the executive office, IT/IS, and Software Engineering. From this information we can order the Departments from highest to lowers salaries as follows
In the end we were only able to conclude one of our 15 initial explanatory variables had a signifcant relationship with Salary. Some variables such as Absences and Hispanic/Latino were “nearly” significant and might be worth running a second analyis if a data set with a higher number of employes is possible. The results of this analsis shows a clear pay disparity between the different departments. These results could be utilized to help determine if the value added by certain departments and roles is overvalued/undervalued by the company.