Assessing factors that contribute to salary

0.1 Data Set Description

This data set is a mock data set for a fictitious company’s HR department. It was created to help train HR professionals as real world HR data sets are hard to come by in the public domain. There are 36 variables in this data set and 316 observations. If we exclude variables that serve as numerical IDs, are missing for the majority of observations, or contain information also found in other variables we are left with 20 variables that are practical to run regression analysis on.

The variables are the following *Zip code Position Sex Marital status Citizenship Hispanic Race Termination reason Employment status Department, Manager Requirement source, Performance score, and from diversity job fair

0.2 Reason for analysis

The primary objective of this analysis is to determine the nature of any non-zero associations between salary and the predictor variables

0.3 Exploratory data analysis

In this data set many of the variables contain information that is either repeated or highly related to information from other variables. To avoid multicolinarity in our regression model we will have to remove any redundant variables. 9 of the variables serve as numeric IDs for categorical variables, which can be eliminated as they contain the same information. With only 313 unique observations there is not a lot of room for variables that will require large amounts of dummy variables if we want to hit the 15 observations per parameter rule of thumb. variables like state have simply too many unique values to be implemented in the model. Other variables like race and citizenship can be implemented if we group similar observations together such as considering non-citizens and eligible non-citizens as both non-citizens. Other variables such as employee name, date of hire, and last performance review date can be removed for simply being impractical, correlation with salary is highly unlikely and no practical meaning can be interpreted from any relationship. Some variables like department, position, and manager have large overlaps in their groupings of employees. Other groupings like marital status and married have completely overlapping information, with one variable providing more detailed information than the other. For these variables we are again limited by our number of observations and must use the variables with the least categories and most genral information. The variable date of birth would be much more practical for analysis is we converted it into each employees age, though it wouldn’t make sense to calculate current age for employees that were fired years ago, so we will calculate age as the difference between date of birth and date of termination for any terminated employee, and for those still employed the difference between date of birth and the last recorded performance review date 2/28/2019.

The following 15 variables will be used for the initial full model Salary Department Age Sex MarriedID CitizenDesc HispanicLatino RaceDesc FromDiversityJobFairID PerfScoreID EngagementSurvey EmpSatisfaction SpecialProjectsCount DaysLateLast30 *Absences

0.4 Initial Model

The initial model including all variable has some noticeable issues with the assumptions required for multiple linear regression. The vs fit plot shows that linearity of the data breaks down at observations with the smallest salary. The variance of the residuals also increases with larger salary In the residual plot. The QQ plot shows strong indication of non normality of the residuals.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	46845.77281	14891.6926	3.1457655	0.0018281
DepartmentExecutive Office	169876.85284	20999.5604	8.0895433	0.0000000
DepartmentIT/IS	26081.09874	7734.1457	3.3722017	0.0008466
DepartmentProduction	-20713.90122	9137.5567	-2.2668972	0.0241303
DepartmentSales	-12588.71973	9781.0042	-1.2870580	0.1990969
DepartmentSoftware Engineering	20935.85939	8247.3236	2.5385035	0.0116540
SexM	782.44642	2079.9162	0.3761913	0.7070488
MarriedID	-738.96546	2162.2171	-0.3417628	0.7327760
CitizenDescUS Citizen	-1475.88129	4768.1621	-0.3095283	0.7571412
HispanicLatinoYes	-2653.01368	3779.7663	-0.7018989	0.4833032
RaceDescWhite	-3252.71873	2318.5294	-1.4029232	0.1617058
Termd	-34.40088	2290.4615	-0.0150192	0.9880272
FromDiversityJobFairID	-2032.21489	4083.4868	-0.4976666	0.6190949
PerfScoreID	6359.93385	2705.2534	2.3509568	0.0193924
EngagementSurvey	653.52371	1658.2249	0.3941104	0.6937882
EmpSatisfaction	947.09036	1206.8335	0.7847730	0.4332251
SpecialProjectsCount	-1651.11016	1769.1955	-0.9332548	0.3514621
DaysLateLast30	2350.86086	1255.3523	1.8726702	0.0621165
Absences	360.39139	178.4222	2.0198800	0.0443120
Age	198.55942	120.8940	1.6424255	0.1015820

## Warning: not plotting observations with leverage one:
##   151

0.5 Checking VIF and removing colinear variables

Checking the VIF of the model shows the variables department and special project count both have concerningly high VIF. Out of the two department is the more statisticaly signifcant variable. The variance inflation factor of department drops when special project count is removed from the model. We will keep special project count out of the model for the remainder of the analysis

##                             GVIF Df GVIF^(1/(2*Df))
## Department             22.650302  5        1.366178
## Sex                     1.038347  1        1.018993
## MarriedID               1.095131  1        1.046485
## CitizenDesc             1.084047  1        1.041176
## HispanicLatino          1.143612  1        1.069398
## RaceDesc                1.259194  1        1.122138
## Termd                   1.140917  1        1.068137
## FromDiversityJobFairID  1.377570  1        1.173699
## PerfScoreID             2.456543  1        1.567336
## EngagementSurvey        1.671086  1        1.292705
## EmpSatisfaction         1.172679  1        1.082903
## SpecialProjectsCount   16.826731  1        4.102040
## DaysLateLast30          2.572022  1        1.603753
## Absences                1.061992  1        1.030530
## Age                     1.128080  1        1.062111

##                            GVIF Df GVIF^(1/(2*Df))
## Department             1.421808  5        1.035820
## Sex                    1.037086  1        1.018374
## MarriedID              1.094835  1        1.046344
## CitizenDesc            1.076992  1        1.037782
## HispanicLatino         1.142464  1        1.068861
## RaceDesc               1.258950  1        1.122030
## Termd                  1.132669  1        1.064269
## FromDiversityJobFairID 1.369783  1        1.170377
## PerfScoreID            2.430919  1        1.559140
## EngagementSurvey       1.664528  1        1.290166
## EmpSatisfaction        1.166275  1        1.079942
## DaysLateLast30         2.566187  1        1.601932
## Absences               1.061987  1        1.030527
## Age                    1.122536  1        1.059498

0.6 Boxcox tranformation of model

performing a box cox transformation on the response variable salary creates a large improvement in our assumptions for the model residuals. Normality, linearity, and constant variance are all imporoved when compared to the un-transformed model.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.8333317	2e-07	3.838369e+06	0.0000000
DepartmentExecutive Office	0.0000011	3e-07	3.338734e+00	0.0009508
DepartmentIT/IS	0.0000004	1e-07	3.178179e+00	0.0016411
DepartmentProduction	-0.0000003	1e-07	-2.418385e+00	0.0162013
DepartmentSales	0.0000000	1e-07	-3.475994e-01	0.7283915
DepartmentSoftware Engineering	0.0000004	1e-07	3.277818e+00	0.0011725
SexM	0.0000000	0e+00	4.248656e-01	0.6712476
MarriedID	0.0000000	0e+00	2.945925e-01	0.7685145
CitizenDescUS Citizen	0.0000000	1e-07	-5.744203e-01	0.5661257
HispanicLatinoYes	-0.0000001	1e-07	-1.275157e+00	0.2032673
RaceDescWhite	-0.0000001	0e+00	-1.445534e+00	0.1493804
Termd	0.0000000	0e+00	7.206800e-02	0.9425971
FromDiversityJobFairID	-0.0000001	1e-07	-9.683425e-01	0.3336747
PerfScoreID	0.0000001	0e+00	1.449404e+00	0.1482979
EngagementSurvey	0.0000000	0e+00	5.693107e-01	0.5695830
EmpSatisfaction	0.0000000	0e+00	2.610800e-02	0.9791890
DaysLateLast30	0.0000000	0e+00	9.311570e-01	0.3525416
Absences	0.0000000	0e+00	1.896328e+00	0.0589037
Age	0.0000000	0e+00	1.187967e+00	0.2358119

## Warning: not plotting observations with leverage one:
##   151

0.7 Reducing the model through AIC algoritm

Here I reduce the model using a step wise procedure working backwards from the transformed linear model. The step wise procedure leaves only 4 explanatory variables in the linear model. these variables are department,Hispanic(Y/N),performance score, and absences.

## Start:  AIC=-9336.52
## Salary ~ Department + Sex + MarriedID + CitizenDesc + HispanicLatino + 
##     RaceDesc + Termd + FromDiversityJobFairID + PerfScoreID + 
##     EngagementSurvey + EmpSatisfaction + DaysLateLast30 + Absences + 
##     Age

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                          Df  Sum of Sq        RSS     AIC
## - EmpSatisfaction         1 1.0000e-16 2.5221e-11 -9338.5
## - Termd                   1 4.0000e-16 2.5221e-11 -9338.5
## - MarriedID               1 7.5000e-15 2.5228e-11 -9338.4
## - Sex                     1 1.5600e-14 2.5236e-11 -9338.3
## - EngagementSurvey        1 2.8000e-14 2.5249e-11 -9338.2
## - CitizenDesc             1 2.8500e-14 2.5249e-11 -9338.2
## - DaysLateLast30          1 7.4900e-14 2.5295e-11 -9337.6
## - FromDiversityJobFairID  1 8.1000e-14 2.5302e-11 -9337.5
## - Age                     1 1.2190e-13 2.5342e-11 -9337.0
## - HispanicLatino          1 1.4040e-13 2.5361e-11 -9336.8
## <none>                                 2.5221e-11 -9336.5
## - RaceDesc                1 1.8050e-13 2.5401e-11 -9336.3
## - PerfScoreID             1 1.8140e-13 2.5402e-11 -9336.3
## - Absences                1 3.1060e-13 2.5531e-11 -9334.7
## - Department              5 1.8788e-11 4.4009e-11 -9173.4
## 
## Step:  AIC=-9338.52
## Salary ~ Department + Sex + MarriedID + CitizenDesc + HispanicLatino + 
##     RaceDesc + Termd + FromDiversityJobFairID + PerfScoreID + 
##     EngagementSurvey + DaysLateLast30 + Absences + Age

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                          Df  Sum of Sq        RSS     AIC
## - Termd                   1 5.0000e-16 2.5221e-11 -9340.5
## - MarriedID               1 7.4000e-15 2.5228e-11 -9340.4
## - Sex                     1 1.5500e-14 2.5236e-11 -9340.3
## - EngagementSurvey        1 2.8100e-14 2.5249e-11 -9340.2
## - CitizenDesc             1 2.9000e-14 2.5250e-11 -9340.2
## - DaysLateLast30          1 7.4800e-14 2.5295e-11 -9339.6
## - FromDiversityJobFairID  1 8.1100e-14 2.5302e-11 -9339.5
## - Age                     1 1.2260e-13 2.5343e-11 -9339.0
## - HispanicLatino          1 1.4100e-13 2.5362e-11 -9338.8
## <none>                                 2.5221e-11 -9338.5
## - RaceDesc                1 1.8040e-13 2.5401e-11 -9338.3
## - PerfScoreID             1 1.8930e-13 2.5410e-11 -9338.2
## - Absences                1 3.1240e-13 2.5533e-11 -9336.7
## - Department              5 1.8815e-11 4.4035e-11 -9175.2
## 
## Step:  AIC=-9340.51
## Salary ~ Department + Sex + MarriedID + CitizenDesc + HispanicLatino + 
##     RaceDesc + FromDiversityJobFairID + PerfScoreID + EngagementSurvey + 
##     DaysLateLast30 + Absences + Age

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                          Df  Sum of Sq        RSS     AIC
## - MarriedID               1 7.8000e-15 2.5229e-11 -9342.4
## - Sex                     1 1.5500e-14 2.5237e-11 -9342.3
## - EngagementSurvey        1 2.8900e-14 2.5250e-11 -9342.2
## - CitizenDesc             1 2.9700e-14 2.5251e-11 -9342.1
## - DaysLateLast30          1 7.6900e-14 2.5298e-11 -9341.6
## - FromDiversityJobFairID  1 8.1300e-14 2.5302e-11 -9341.5
## - Age                     1 1.2240e-13 2.5343e-11 -9341.0
## - HispanicLatino          1 1.4160e-13 2.5363e-11 -9340.8
## <none>                                 2.5221e-11 -9340.5
## - RaceDesc                1 1.8000e-13 2.5401e-11 -9340.3
## - PerfScoreID             1 1.8910e-13 2.5410e-11 -9340.2
## - Absences                1 3.1680e-13 2.5538e-11 -9338.6
## - Department              5 1.9212e-11 4.4434e-11 -9174.4
## 
## Step:  AIC=-9342.42
## Salary ~ Department + Sex + CitizenDesc + HispanicLatino + RaceDesc + 
##     FromDiversityJobFairID + PerfScoreID + EngagementSurvey + 
##     DaysLateLast30 + Absences + Age

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                          Df  Sum of Sq        RSS     AIC
## - Sex                     1 1.5100e-14 2.5244e-11 -9344.2
## - EngagementSurvey        1 2.6100e-14 2.5255e-11 -9344.1
## - CitizenDesc             1 3.1500e-14 2.5260e-11 -9344.0
## - DaysLateLast30          1 7.3500e-14 2.5302e-11 -9343.5
## - FromDiversityJobFairID  1 8.0400e-14 2.5309e-11 -9343.4
## - Age                     1 1.1970e-13 2.5349e-11 -9342.9
## - HispanicLatino          1 1.5020e-13 2.5379e-11 -9342.6
## <none>                                 2.5229e-11 -9342.4
## - RaceDesc                1 1.7620e-13 2.5405e-11 -9342.3
## - PerfScoreID             1 1.8530e-13 2.5414e-11 -9342.1
## - Absences                1 3.3190e-13 2.5561e-11 -9340.4
## - Department              5 1.9314e-11 4.4543e-11 -9175.6
## 
## Step:  AIC=-9344.23
## Salary ~ Department + CitizenDesc + HispanicLatino + RaceDesc + 
##     FromDiversityJobFairID + PerfScoreID + EngagementSurvey + 
##     DaysLateLast30 + Absences + Age

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                          Df  Sum of Sq        RSS     AIC
## - EngagementSurvey        1 2.7200e-14 2.5271e-11 -9345.9
## - CitizenDesc             1 3.1800e-14 2.5276e-11 -9345.8
## - FromDiversityJobFairID  1 7.7300e-14 2.5321e-11 -9345.3
## - DaysLateLast30          1 7.8700e-14 2.5323e-11 -9345.3
## - Age                     1 1.2020e-13 2.5364e-11 -9344.8
## - HispanicLatino          1 1.4910e-13 2.5393e-11 -9344.4
## <none>                                 2.5244e-11 -9344.2
## - RaceDesc                1 1.7090e-13 2.5415e-11 -9344.1
## - PerfScoreID             1 1.8570e-13 2.5430e-11 -9344.0
## - Absences                1 3.2960e-13 2.5574e-11 -9342.2
## - Department              5 1.9634e-11 4.4878e-11 -9175.3
## 
## Step:  AIC=-9345.9
## Salary ~ Department + CitizenDesc + HispanicLatino + RaceDesc + 
##     FromDiversityJobFairID + PerfScoreID + DaysLateLast30 + Absences + 
##     Age

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                          Df  Sum of Sq        RSS     AIC
## - CitizenDesc             1 2.8400e-14 2.5299e-11 -9347.5
## - DaysLateLast30          1 5.7800e-14 2.5329e-11 -9347.2
## - FromDiversityJobFairID  1 7.6200e-14 2.5347e-11 -9347.0
## - Age                     1 1.2250e-13 2.5394e-11 -9346.4
## - HispanicLatino          1 1.4490e-13 2.5416e-11 -9346.1
## <none>                                 2.5271e-11 -9345.9
## - RaceDesc                1 1.6580e-13 2.5437e-11 -9345.9
## - PerfScoreID             1 2.2570e-13 2.5497e-11 -9345.1
## - Absences                1 3.2650e-13 2.5598e-11 -9343.9
## - Department              5 1.9608e-11 4.4879e-11 -9177.3
## 
## Step:  AIC=-9347.55
## Salary ~ Department + HispanicLatino + RaceDesc + FromDiversityJobFairID + 
##     PerfScoreID + DaysLateLast30 + Absences + Age

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                          Df  Sum of Sq        RSS     AIC
## - DaysLateLast30          1 5.2300e-14 2.5352e-11 -9348.9
## - FromDiversityJobFairID  1 7.0400e-14 2.5370e-11 -9348.7
## - Age                     1 1.2470e-13 2.5424e-11 -9348.0
## - HispanicLatino          1 1.4930e-13 2.5449e-11 -9347.7
## <none>                                 2.5299e-11 -9347.5
## - RaceDesc                1 1.8670e-13 2.5486e-11 -9347.3
## - PerfScoreID             1 2.1610e-13 2.5516e-11 -9346.9
## - Absences                1 3.3400e-13 2.5633e-11 -9345.5
## - Department              5 1.9643e-11 4.4942e-11 -9178.9
## 
## Step:  AIC=-9348.91
## Salary ~ Department + HispanicLatino + RaceDesc + FromDiversityJobFairID + 
##     PerfScoreID + Absences + Age

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                          Df  Sum of Sq        RSS     AIC
## - FromDiversityJobFairID  1 6.2200e-14 2.5414e-11 -9350.1
## - Age                     1 1.2300e-13 2.5475e-11 -9349.4
## - HispanicLatino          1 1.6310e-13 2.5515e-11 -9348.9
## <none>                                 2.5352e-11 -9348.9
## - PerfScoreID             1 1.9130e-13 2.5543e-11 -9348.6
## - RaceDesc                1 1.9390e-13 2.5546e-11 -9348.5
## - Absences                1 3.4890e-13 2.5701e-11 -9346.7
## - Department              5 1.9595e-11 4.4947e-11 -9180.8
## 
## Step:  AIC=-9350.15
## Salary ~ Department + HispanicLatino + RaceDesc + PerfScoreID + 
##     Absences + Age

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                  Df  Sum of Sq        RSS     AIC
## - Age             1 1.0250e-13 2.5516e-11 -9350.9
## - RaceDesc        1 1.3800e-13 2.5552e-11 -9350.5
## <none>                         2.5414e-11 -9350.1
## - PerfScoreID     1 1.9350e-13 2.5607e-11 -9349.8
## - HispanicLatino  1 2.2260e-13 2.5637e-11 -9349.4
## - Absences        1 3.2910e-13 2.5743e-11 -9348.1
## - Department      5 1.9573e-11 4.4987e-11 -9182.5
## 
## Step:  AIC=-9350.89
## Salary ~ Department + HispanicLatino + RaceDesc + PerfScoreID + 
##     Absences

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                  Df  Sum of Sq        RSS     AIC
## - RaceDesc        1 1.4660e-13 2.5663e-11 -9351.1
## <none>                         2.5516e-11 -9350.9
## - PerfScoreID     1 2.2920e-13 2.5746e-11 -9350.1
## - HispanicLatino  1 2.4750e-13 2.5764e-11 -9349.9
## - Absences        1 3.0530e-13 2.5822e-11 -9349.2
## - Department      5 1.9568e-11 4.5084e-11 -9183.9
## 
## Step:  AIC=-9351.11
## Salary ~ Department + HispanicLatino + PerfScoreID + Absences

## Warning: attempting model selection on an essentially perfect fit is nonsense

##                  Df  Sum of Sq        RSS     AIC
## <none>                         2.5663e-11 -9351.1
## - PerfScoreID     1 2.2530e-13 2.5888e-11 -9350.4
## - HispanicLatino  1 2.5400e-13 2.5917e-11 -9350.0
## - Absences        1 3.0810e-13 2.5971e-11 -9349.4
## - Department      5 1.9895e-11 4.5558e-11 -9182.6

## 
## Call:
## lm(formula = Salary ~ Department + HispanicLatino + PerfScoreID + 
##     Absences, data = THRDATA)
## 
## Coefficients:
##                    (Intercept)      DepartmentExecutive Office  
##                      8.333e-01                       1.183e-06  
##                DepartmentIT/IS     DepartmentProduction         
##                      3.820e-07                      -2.247e-07  
##                DepartmentSales  DepartmentSoftware Engineering  
##                     -1.352e-09                       4.631e-07  
##              HispanicLatinoYes                     PerfScoreID  
##                     -1.026e-07                       4.623e-08  
##                       Absences  
##                      5.445e-09

0.8 Goodness of fit comparing 3 models

Here I preform goodness of fit calculations for 3 of the models. The first model is the full model sans coliniar vairables, the second is the box cox transformed model, and the third is the reduced box cox transformed model. Of these three models the reduced model seems the most apropriate.The full model has a slightly higher adjusted r^2, meaning the reduced model explains less of the variability of the observations. This is offset by the reduced models much better mallows Cp and AIC that indicate a higher accuracy in predictions using the reduced model.

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC	PRESS
Full model	92902252429	0.5264570	0.4972660	19	6107.171	6178.228	106716640789
Transformed model	0	0.4560837	0.4225547	19	-9336.521	-9265.465	0
reduced model	0	0.4465416	0.4318805	9	-9351.112	-9317.454	0

0.9 Final model

The final model is the following

((Salary^-1.2)-1)/-1.2 = 0.8333318 + 0.0000012* DepartmentExecutive Office + 0.0000004 * DepartmentIT/IS -0.0000002 * DepartmentProduction -0.000000001 * DepartmentSales + 0.0000005 * DepartmentSoftware Engineering -0.0000001 * HispanicLatinoYes + 0.000000046 * PerfScoreID + 0.0000000054 *Absences

In this model department = Admin Offices and HispanicLation = Yes are used as baselines. As we transformed the salary variable a direct interpretation of the values of the coefficients is difficult. We can however comment on the direction and relative sizes of these coefficients. The difference between departments is the most significant predictor of employee salary. The order of salary by department ranked highest to lowest are Execute Office, Software Engineering, It/IS, Admin Offices,Sales,Production. Hispaniclatino having a negative correlation with salary would be a concerning result for any HR employees. The positive relationship between performance score and salary is an expected and desired finding. The positive correlation between absences and salary is strange, but probably not meaningful given it is a fraction of strength of the other variables.