1 Data Set Overview

This data set is a mock data set for a fictitious company’s HR department. It was created to help train HR professionals as real world HR data sets are hard to come by in the public domain. There are 36 variables in this data set and 316 observations. Many of the variable contain redundant information and will have to be removed during Initial data analysis before a regression model can be fit.

2 Purpose of Analysis

In this report I will build a multiple linear regression model and bootstrap sampling to find and estimate any correlatiory relationships between employee salary and any of the other variables in the data set. This sort of analysis would be practicle for a company to preform as it could determine the existance and size of any pay disparties between certain groups of employees. It can also help determine if Salary has any sort of affect on employe satisfaction and retention.

3 Exploratory Data Analysis and Variable Selection

In this data set many of the variables contain information that is either repeated or highly related to information from other variables. To avoid multicolinarity in our regression model we will have to remove any redundant variables. 9 of the variables serve as numeric IDs for categorical variables, which can be eliminated as they contain the same information. With only 313 unique observations there is not a lot of room for variables that will require large amounts of dummy variables if we want to hit the 15 observations per parameter rule of thumb. variables like state have simply too many unique values to be implemented in the model. Other variables like race and citizenship can be implemented if we group similar observations together such as considering non-citizens and eligible non-citizens as both non-citizens. Other variables such as employee name, date of hire, and last performance review date can be removed for simply being impractical, correlation with salary is highly unlikely and no practical meaning can be interpreted from any relationship. Some variables like department, position, and manager have large overlaps in their groupings of employees. Other groupings like marital status and married have completely overlapping information, with one variable providing more detailed information than the other. For these variables we are again limited by our number of observations and must use the variables with the least categories and most genral information. The variable date of birth would be much more practical for analysis is we converted it into each employees age, though it wouldn’t make sense to calculate current age for employees that were fired years ago, so we will calculate age as the difference between date of birth and date of termination for any terminated employee, and for those still employed the difference between date of birth and the last recorded performance review date 2/28/2019.

The following 15 variables will be used for the initial full model

  • Salary
  • Department
  • Age
  • Sex
  • MarriedID
  • CitizenDesc
  • HispanicLatino
  • RaceDesc
  • FromDiversityJobFairID
  • PerfScoreID
  • EngagementSurvey
  • EmpSatisfaction
  • SpecialProjectsCount
  • DaysLateLast30
  • Absences

4 Initial Model and residual analysis

The initial model including all variable has some noticeable issues with the assumptions required for multiple linear regression. The vs fit plot shows that linearity of the data breaks down at observations with the smallest salary. The variance of the residuals also increases with larger salary In the residual plot. The QQ plot shows strong indication of non normality of the residuals.

Checking the VIF of the model shows the variables department and special project count both have concerningly high VIF. Out of the two department is the more statisticaly signifcant variable. The variance inflation factor of department drops when special project count is removed from the model. We will keep special project count out of the model for the remainder of the analysis

Estimate Std. Error t value Pr(>|t|)
(Intercept) 46845.77281 14891.6926 3.1457655 0.0018281
DepartmentExecutive Office 169876.85284 20999.5604 8.0895433 0.0000000
DepartmentIT/IS 26081.09874 7734.1457 3.3722017 0.0008466
DepartmentProduction -20713.90122 9137.5567 -2.2668972 0.0241303
DepartmentSales -12588.71973 9781.0042 -1.2870580 0.1990969
DepartmentSoftware Engineering 20935.85939 8247.3236 2.5385035 0.0116540
SexM 782.44642 2079.9162 0.3761913 0.7070488
MarriedID -738.96546 2162.2171 -0.3417628 0.7327760
CitizenDescUS Citizen -1475.88129 4768.1621 -0.3095283 0.7571412
HispanicLatinoYes -2653.01368 3779.7663 -0.7018989 0.4833032
RaceDescWhite -3252.71873 2318.5294 -1.4029232 0.1617058
Termd -34.40088 2290.4615 -0.0150192 0.9880272
FromDiversityJobFairID -2032.21489 4083.4868 -0.4976666 0.6190949
PerfScoreID 6359.93385 2705.2534 2.3509568 0.0193924
EngagementSurvey 653.52371 1658.2249 0.3941104 0.6937882
EmpSatisfaction 947.09036 1206.8335 0.7847730 0.4332251
SpecialProjectsCount -1651.11016 1769.1955 -0.9332548 0.3514621
DaysLateLast30 2350.86086 1255.3523 1.8726702 0.0621165
Absences 360.39139 178.4222 2.0198800 0.0443120
Age 198.55942 120.8940 1.6424255 0.1015820
## Warning: not plotting observations with leverage one:
##   151

##                             GVIF Df GVIF^(1/(2*Df))
## Department             22.650302  5        1.366178
## Sex                     1.038347  1        1.018993
## MarriedID               1.095131  1        1.046485
## CitizenDesc             1.084047  1        1.041176
## HispanicLatino          1.143612  1        1.069398
## RaceDesc                1.259194  1        1.122138
## Termd                   1.140917  1        1.068137
## FromDiversityJobFairID  1.377570  1        1.173699
## PerfScoreID             2.456543  1        1.567336
## EngagementSurvey        1.671086  1        1.292705
## EmpSatisfaction         1.172679  1        1.082903
## SpecialProjectsCount   16.826731  1        4.102040
## DaysLateLast30          2.572022  1        1.603753
## Absences                1.061992  1        1.030530
## Age                     1.128080  1        1.062111
##                            GVIF Df GVIF^(1/(2*Df))
## Department             1.421808  5        1.035820
## Sex                    1.037086  1        1.018374
## MarriedID              1.094835  1        1.046344
## CitizenDesc            1.076992  1        1.037782
## HispanicLatino         1.142464  1        1.068861
## RaceDesc               1.258950  1        1.122030
## Termd                  1.132669  1        1.064269
## FromDiversityJobFairID 1.369783  1        1.170377
## PerfScoreID            2.430919  1        1.559140
## EngagementSurvey       1.664528  1        1.290166
## EmpSatisfaction        1.166275  1        1.079942
## DaysLateLast30         2.566187  1        1.601932
## Absences               1.061987  1        1.030527
## Age                    1.122536  1        1.059498

5 Transforming the model with Box Cox

As there are concerning violations of the residual assumptions neccesary for multiple linear regression a transformation of the data is needed. Here I perform a box cox transformation on the response variable Salary using a Lamda of -1.2. Normality, linearity, and constant variance are all improved when compared to the un-transformed model

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8333317 2e-07 3.838369e+06 0.0000000
DepartmentExecutive Office 0.0000011 3e-07 3.338734e+00 0.0009508
DepartmentIT/IS 0.0000004 1e-07 3.178179e+00 0.0016411
DepartmentProduction -0.0000003 1e-07 -2.418385e+00 0.0162013
DepartmentSales 0.0000000 1e-07 -3.475994e-01 0.7283915
DepartmentSoftware Engineering 0.0000004 1e-07 3.277818e+00 0.0011725
SexM 0.0000000 0e+00 4.248656e-01 0.6712476
MarriedID 0.0000000 0e+00 2.945925e-01 0.7685145
CitizenDescUS Citizen 0.0000000 1e-07 -5.744203e-01 0.5661257
HispanicLatinoYes -0.0000001 1e-07 -1.275157e+00 0.2032673
RaceDescWhite -0.0000001 0e+00 -1.445534e+00 0.1493804
Termd 0.0000000 0e+00 7.206800e-02 0.9425971
FromDiversityJobFairID -0.0000001 1e-07 -9.683425e-01 0.3336747
PerfScoreID 0.0000001 0e+00 1.449404e+00 0.1482979
EngagementSurvey 0.0000000 0e+00 5.693107e-01 0.5695830
EmpSatisfaction 0.0000000 0e+00 2.610800e-02 0.9791890
DaysLateLast30 0.0000000 0e+00 9.311570e-01 0.3525416
Absences 0.0000000 0e+00 1.896328e+00 0.0589037
Age 0.0000000 0e+00 1.187967e+00 0.2358119
## Warning: not plotting observations with leverage one:
##   151

6 Reducing the model through AIC algoritm

Here I reduce the model using a step wise procedure working backwards from the transformed linear model. The step wise procedure leaves only 4 explanatory variables in the linear model. these variables are department,Hispanic(Y/N),performance score, and absences.

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense

## Warning: attempting model selection on an essentially perfect fit is nonsense
## 
## Call:
## lm(formula = Salary ~ Department + HispanicLatino + PerfScoreID + 
##     Absences, data = THRDATA)
## 
## Coefficients:
##                    (Intercept)      DepartmentExecutive Office  
##                      8.333e-01                       1.183e-06  
##                DepartmentIT/IS     DepartmentProduction         
##                      3.820e-07                      -2.247e-07  
##                DepartmentSales  DepartmentSoftware Engineering  
##                     -1.352e-09                       4.631e-07  
##              HispanicLatinoYes                     PerfScoreID  
##                     -1.026e-07                       4.623e-08  
##                       Absences  
##                      5.445e-09

7 Goodness of fit comparing 3 models

Here I preform goodness of fit calculations for 3 of the models. The first model is the full model sans coliniar variables, the second is the box cox transformed model, and the third is the reduced box cox transformed model. Of these three models the reduced model seems the most appropriate.Comparing the box cox model and the reduced box cox model the reduced box cox model has superior R^2 adjusted, AIC, and SBC scores. In comparing the full model to the reduced box cox model, since the reduced model uses a transformed salary variable the only measures of goodness of fit we can use to compare are the scaled measures R^2 adjusted and mallows Cp. Comparing these two measures the mallows Cp is 10 less for the reduced model as it has 10 less explanatory variables. The R^2 adjusted is actually higher for the full model. Even though the full model has the higher adjusted R^2 the reduced box cox model is still superior as the difference is not high enough to justify using a model has more violations of residual assumptions and a large amount of unnecessary explanatory paramaters.

Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
Full model 92902252429 0.5264570 0.4972660 19 6107.171 6178.228 106716640789
Transformed model 0 0.4560837 0.4225547 19 -9336.521 -9265.465 0
reduced model 0 0.4465416 0.4318805 9 -9351.112 -9317.454 0

8 Bootstrapping the model tecnique 1 - bootstraping observations

Regression Coefficient Matrix
Estimate Std. Error t value Pr(>|t|) btc.ci.95
(Intercept) 0.833331782 0.000000131 6375259.469453854 0.000000000 [ 0.833331447048226 , 0.833332155059029 ]
DepartmentExecutive Office 0.000001183 0.000000313 3.781593616 0.000187821 [ 8.67487931074518e-07 , 1.49823394490083e-06 ]
DepartmentIT/IS 0.000000382 0.000000106 3.604894451 0.000365174 [ 6.63679556241574e-08 , 6.90004517645866e-07 ]
DepartmentProduction -0.000000225 0.000000099 -2.259884867 0.024540650 [ -5.36460819077033e-07 , 5.58007181767876e-08 ]
DepartmentSales -0.000000001 0.000000111 -0.012189849 0.990282197 [ -3.27824656989153e-07 , 3.01368478578284e-07 ]
DepartmentSoftware Engineering 0.000000463 0.000000131 3.534131872 0.000473170 [ 1.59174012419789e-07 , 7.46701334546113e-07 ]
HispanicLatinoYes -0.000000103 0.000000059 -1.728961136 0.084837996 [ -2.56326666822163e-07 , 4.23411838016973e-08 ]
PerfScoreID 0.000000046 0.000000028 1.628405785 0.104481580 [ -1.20026374277153e-08 , 1.05130508663306e-07 ]
Absences 0.000000005 0.000000003 1.904226170 0.057830886 [ -8.09852443303752e-11 , 1.04495097962794e-08 ]

9 Bootstrapping the model tecnique 2 - residual bootstraping

Regression residual Matrix
Estimate Std. Error t value Pr(>|t|) btr.ci.95
(Intercept) 0.833331782 0.000000131 6375259.469453854 0.000000000 [ 0.833331539792289 , 0.833332039981755 ]
DepartmentExecutive Office 0.000001183 0.000000313 3.781593616 0.000187821 [ 5.90735719938336e-07 , 1.81491140528572e-06 ]
DepartmentIT/IS 0.000000382 0.000000106 3.604894451 0.000365174 [ 1.87579447921628e-07 , 5.83784904791996e-07 ]
DepartmentProduction -0.000000225 0.000000099 -2.259884867 0.024540650 [ -4.05061753366224e-07 , -2.53296553722555e-08 ]
DepartmentSales -0.000000001 0.000000111 -0.012189849 0.990282197 [ -2.05728214178093e-07 , 2.30324377493449e-07 ]
DepartmentSoftware Engineering 0.000000463 0.000000131 3.534131872 0.000473170 [ 2.28877073627788e-07 , 7.06062215569661e-07 ]
HispanicLatinoYes -0.000000103 0.000000059 -1.728961136 0.084837996 [ -2.28668327402229e-07 , 9.59996200022065e-09 ]
PerfScoreID 0.000000046 0.000000028 1.628405785 0.104481580 [ -1.05636065700556e-08 , 1.00425033104526e-07 ]
Absences 0.000000005 0.000000003 1.904226170 0.057830886 [ -1.61775944139583e-10 , 1.10007580615625e-08 ]

10 Analyis of Bootstrap confidence Intervals and Final conclusions

While the two techniques for producing bootstrap samples don’t always produce the same confidence intervals, in this case the results are the same. As the final model includes Box Cox transformed values for Salary direct Interpretation of the parameter values have no practical meaning. Looking at the confidence intervals and P-values reveal that only the variable to be statistically significant at a alpha of 0.05 is Department. The variables Hispanic/Latino and Absences would be statistically significant at an alpha of 0.1, which would probably be a more appropriate level given the subject matter, but it would be inappropriate to change endpoints after the fact. Employees In the Departments Executive Office, Software Engineering, and IT/IS all have significantly larger salaries than those in the Admin’s office department. Employees in the Production department have significantly lower salaries than those in the Admin’s offices. Employees in the sales department do not have a statistically significant difference in their salaries compared to those in the Admin’s Office. Based on the overlap in the 95% confidence Intervals of their coefficients we can also conclude that there is not a statistically significant difference between the salaries of the executive office, IT/IS, and Software Engineering. From this information we can order the Departments from highest to lowers salaries as follows

  • Executive Office - Software Engineering -IT/IS
  • Admin’s Office - Sales
  • Production

In the end we were only able to conclude one of our 15 initial explanatory variables had a signifcant relationship with Salary. Some variables such as Absences and Hispanic/Latino were “nearly” significant and might be worth running a second analyis if a data set with a higher number of employes is possible. The results of this analsis shows a clear pay disparity between the different departments. These results could be utilized to help determine if the value added by certain departments and roles is overvalued/undervalued by the company.