1 Data Set Information

This data set is a mock data set for a fictitious company’s HR department. It was created to help train HR professionals as real world HR data sets are hard to come by in the public domain. There are 36 variables (25 variables in practice, 9 of them are merely numeric IDs for categorical variables) in this data set and 316 observations.

There are 15 categorical variables in the data set. These variables are Employee name, State, Zip code, Position, Sex, Marital status, Citizenship, Hispanic, Race, Termination reason, Employment status, Department, Manager, Requirement source, Performance score, and from diversity job fair. 4 of the variables are dates. These are date of birth, date hired, date terminated, and date of last performance review. 6 of the variables are numeric. These are salary, Engagement survey score, Employee satisfaction, Special projects count, Days late last 30 day, and Absences.

A number of these variables would make effective response variables but for this analysis I will use Salary as the response. I will determine which variables have a statistically significant relationship with salary and determine the nature of those relationships.For the theoretical company this analysis would highlight if there is any sort of pay discrimination along lines such as race, sex or marital status. Also finding the relationship between pay and the satisfaction/productivity variables can help guide the company on how to improve employee retention and productivity.

Excluding variables that are redundant, serve as ids, or are missing for the vast majority of employees we are left with 20 practical variables. With 312 total observations this analysis will be within the required 15 obs per variable needed to estimate regression coefficients.

2 Simple Linear regression

Looking at the pair wise scatter plot for the 6 numeric variables, the response variable salary has low correlation with all the other variables except one, Special project count. Special project count and salary have a correlation of 0.51 and the scatter plot backs this as it shows a very positive mostly linear relationship between the two variables. This will be the variable we will use for our linear model. If we are looking to transform this data it is important to note that all the variables besides Absences are very skewed. The extreme right tailed nature of the response variable Salary makes it the number one candidate for transformation. A box cox transformation would help normalize this data

3 Non-transformed linear model

4 Transformed linear model

While not perfect, the residual plot and QQ plot both show mostly normalized residuals that are massively improved when compared to the non-transformed model. The most concerning potential violation is that the residuals in the residual plot are not consistently centered at zero. From my perspective this trend isn’t nearly significant enough to invalidate the model. There are no outliers.

5 Bootstraped confidence Interval

Bootstrap confidence intervals of regression coefficients.
  2.5% Mean 97.5%
X2.5. 1.111 1.111 1.111
X2.5..1 2.134e-06 2.653e-06 3.146e-06

6 Standard Confidence Interval

Standard confidence intervals of regression coefficients
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.111 5.825e-07 1907335 0
NUMHRDATA$SpecialProjectsCount 2.657e-06 2.204e-07 12.06 1.058e-27

The two estimates for the linear model are extremely similar. There is a slight difference in the interaction term and the confidence interval for the bootstrap interaction term is slightly wider. The p-values and confidence intervals of both models show near perfect statistical significance. While these two models are virtually identical I would prefer to present the bootstrap model on account of the non-normality of the original un-transformed Salary and Special Projects variables.