Loading the necesary packages…
In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.
Within this dataset, we decided to explore and attempt to answer several research questions, such as:
Are there areas within the company where salary distribution is not equitable?
Is an individual’s salary influenced by any specific factors present in the dataset?
Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?
loading the dataset
Dataset description
The structure of the dataset
glimpse(HRDataset_v14)
## Rows: 311
## Columns: 36
## $ Employee_Name <chr> "Adinolfi, Wilson K", "Ait Sidi, Karthikey…
## $ EmpID <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc <chr> "White", "White", "White", "White", "White"…
## $ DateofHire <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
Based on EDA analysis we so that:
Salary exhibits non-normal behavior and strong right skewness.
Extreme values (executive-level salaries) may influence regression results.
Transformation techniques (e.g., log transformation of salary) may improve model performance.
Categorical variables such as department, position, and employment status are likely strong predictors of salary.
Gender alone may not fully explain salary variation without controlling for position and department.
## Rows: 311
## Columns: 15
## $ married_id <fct> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ status <chr> "no_married", "married", "married", "marrie…
## $ log10_salary <dbl> 4.795922, 5.018854, 4.812613, 4.812853, 4.7…
## Rows: 311
## Columns: 15
## $ married_id <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ salary <int> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ status <chr> "no_married", "married", "married", "marrie…
Just a regression …
lm_ih <- lm(salary ~ ., data = dataset)
lm_ih
##
## Call:
## lm(formula = salary ~ ., data = dataset)
##
## Coefficients:
## (Intercept) married_id
## 95093.0 443.8
## marital_status_id gender_id
## -1133.9 -351.1
## emp_status_id dept_id
## -1607.4 -8722.4
## perf_score_id from_diversity_job_fair_id
## 6392.3 3462.4
## termd position_id
## 3924.0 -545.7
## emp_satisfaction special_projects_count
## 432.2 2625.3
## days_late_last30 absences
## 1663.1 324.8
## status
## NA
# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-salary))
# lm_ih
summary(lm_ih)
##
## Call:
## lm(formula = salary ~ ., data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56336 -9497 -2108 5279 159658
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95093.0 17138.2 5.549 0.0000000638 ***
## married_id 443.8 2520.6 0.176 0.8604
## marital_status_id -1133.9 1297.3 -0.874 0.3828
## gender_id -351.1 2433.9 -0.144 0.8854
## emp_status_id -1607.4 2274.8 -0.707 0.4804
## dept_id -8722.4 1932.4 -4.514 0.0000091892 ***
## perf_score_id 6392.3 3117.4 2.051 0.0412 *
## from_diversity_job_fair_id 3462.4 4256.8 0.813 0.4167
## termd 3924.0 8436.1 0.465 0.6422
## position_id -545.7 219.7 -2.483 0.0136 *
## emp_satisfaction 432.2 1394.5 0.310 0.7568
## special_projects_count 2625.3 796.4 3.296 0.0011 **
## days_late_last30 1663.1 1418.7 1.172 0.2420
## absences 324.8 207.6 1.564 0.1188
## status NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20970 on 297 degrees of freedom
## Multiple R-squared: 0.3345, Adjusted R-squared: 0.3053
## F-statistic: 11.48 on 13 and 297 DF, p-value: < 0.00000000000000022
Identifying the variable that is strongly correlated with status
| Nr.Crt. | Variable | Value |
|---|---|---|
| 1 | (Intercept) | 0 |
| 2 | married_id | 1 |
| 3 | marital_status_id | 0 |
| 4 | gender_id | 0 |
| 5 | emp_status_id | 0 |
| 6 | dept_id | 0 |
| 7 | perf_score_id | 0 |
| 8 | from_diversity_job_fair_id | 0 |
| 9 | termd | 0 |
| 10 | position_id | 0 |
| 11 | emp_satisfaction | 0 |
| 12 | special_projects_count | 0 |
| 13 | days_late_last30 | 0 |
| 14 | absences | 0 |
The importance of the variables
library(performance)
par(mfrow = c(2, 2))
plot(lm_ih, pch = 16, col = '#006EA1')
par(mfrow = c(1, 1))
The estimated values of the regression coefficients
| Nr.Crt. | term | estimate | std.error | statistic | p.value | signif |
|---|---|---|---|---|---|---|
| 1 | (Intercept) | 9.51e+04 | 1.71e+04 | 5.55 | 6.38e-08 | s |
| 2 | married_id | 444 | 2.52e+03 | 0.176 | 0.86 | ns |
| 3 | marital_status_id | -1.13e+03 | 1.3e+03 | -0.874 | 0.383 | ns |
| 4 | gender_id | -351 | 2.43e+03 | -0.144 | 0.885 | ns |
| 5 | emp_status_id | -1.61e+03 | 2.27e+03 | -0.707 | 0.48 | ns |
| 6 | dept_id | -8.72e+03 | 1.93e+03 | -4.51 | 9.19e-06 | s |
| 7 | perf_score_id | 6.39e+03 | 3.12e+03 | 2.05 | 0.0412 | s |
| 8 | from_diversity_job_fair_id | 3.46e+03 | 4.26e+03 | 0.813 | 0.417 | ns |
| 9 | termd | 3.92e+03 | 8.44e+03 | 0.465 | 0.642 | ns |
| 10 | position_id | -546 | 220 | -2.48 | 0.0136 | s |
| 11 | emp_satisfaction | 432 | 1.39e+03 | 0.31 | 0.757 | ns |
| 12 | special_projects_count | 2.63e+03 | 796 | 3.3 | 0.0011 | s |
| 13 | days_late_last30 | 1.66e+03 | 1.42e+03 | 1.17 | 0.242 | ns |
| 14 | absences | 325 | 208 | 1.56 | 0.119 | ns |
| 15 | status |
The validation metrics of the model
| Nr.Crt. | var | values |
|---|---|---|
| 1 | r.squared | 0.334 |
| 2 | adj.r.squared | 0.305 |
| 3 | sigma | 2.1e+04 |
| 4 | statistic | 11.5 |
| 5 | p.value | 4.94e-20 |
| 6 | df | 13 |
| 7 | logLik | -3.53e+03 |
| 8 | AIC | 7.09e+03 |
| 9 | BIC | 7.14e+03 |
| 10 | deviance | 1.31e+11 |
| 11 | df.residual | 297 |
| 12 | nobs | 311 |
## OK: No outliers detected.
## - Based on the following method and threshold: cook (1).
## - For variable: (Whole model)
VIF values obtained for the variables include in the model
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| married_id | 1.08 | 1.02 | 1.37 | 1.04 | 0.928 | 0.73 | 0.984 |
| marital_status_id | 1.06 | 1.01 | 1.45 | 1.03 | 0.947 | 0.689 | 0.993 |
| gender_id | 1.03 | 1 | 2.28 | 1.01 | 0.971 | 0.439 | 0.999 |
| emp_status_id | 11.7 | 9.63 | 14.4 | 3.43 | 0.0851 | 0.0695 | 0.104 |
| dept_id | 2.44 | 2.08 | 2.91 | 1.56 | 0.41 | 0.344 | 0.48 |
| perf_score_id | 2.36 | 2.02 | 2.81 | 1.54 | 0.423 | 0.355 | 0.495 |
| from_diversity_job_fair_id | 1.08 | 1.02 | 1.36 | 1.04 | 0.923 | 0.735 | 0.981 |
| termd | 11.2 | 9.19 | 13.7 | 3.35 | 0.0892 | 0.0729 | 0.109 |
| position_id | 1.32 | 1.19 | 1.55 | 1.15 | 0.758 | 0.647 | 0.843 |
| emp_satisfaction | 1.13 | 1.05 | 1.36 | 1.06 | 0.882 | 0.734 | 0.953 |
| special_projects_count | 2.47 | 2.11 | 2.95 | 1.57 | 0.405 | 0.339 | 0.474 |
| days_late_last30 | 2.38 | 2.04 | 2.83 | 1.54 | 0.42 | 0.353 | 0.491 |
| absences | 1.04 | 1 | 1.65 | 1.02 | 0.96 | 0.608 | 0.997 |
We observe that status variable is NA meaning that between this variable and another variable there is prefect collinearity, suggesting that one or more variables are exact linear combinations of others. As a result, the design matrix becomes singular and VIF cannot be computed.
The variable that is perfect correlated with status is: married_id
# vif(lm_ih)
So we are going to eliminate the status variable and termd variable (highly correlated with emp_status_id; the value of the correlation coefficient is 0.91) from the dataset and rebuild the model.
Estimate regression model
lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
lm_ih
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status,
## -termd))
##
## Coefficients:
## (Intercept) married_id
## 93194.6 413.6
## marital_status_id gender_id
## -1132.3 -324.3
## emp_status_id dept_id
## -603.2 -8628.1
## perf_score_id from_diversity_job_fair_id
## 6594.0 3185.4
## position_id emp_satisfaction
## -560.5 406.1
## special_projects_count days_late_last30
## 2664.4 1815.1
## absences
## 327.0
summary(lm_ih)
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status,
## -termd))
##
## Residuals:
## Min 1Q Median 3Q Max
## -56169 -9542 -1754 5019 160081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93194.6 16623.2 5.606 0.0000000471 ***
## married_id 413.6 2516.4 0.164 0.869563
## marital_status_id -1132.3 1295.6 -0.874 0.382829
## gender_id -324.3 2430.0 -0.133 0.893919
## emp_status_id -603.2 715.8 -0.843 0.400069
## dept_id -8628.1 1919.2 -4.496 0.0000099409 ***
## perf_score_id 6594.0 3083.0 2.139 0.033265 *
## from_diversity_job_fair_id 3185.4 4209.4 0.757 0.449816
## position_id -560.5 217.1 -2.581 0.010318 *
## emp_satisfaction 406.1 1391.6 0.292 0.770612
## special_projects_count 2664.4 790.9 3.369 0.000855 ***
## days_late_last30 1815.1 1378.8 1.316 0.189032
## absences 327.0 207.3 1.577 0.115772
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20940 on 298 degrees of freedom
## Multiple R-squared: 0.334, Adjusted R-squared: 0.3072
## F-statistic: 12.45 on 12 and 298 DF, p-value: < 0.00000000000000022
The importance of the variables
The estimated values of the regression coefficients
| Nr.Crt. | term | estimate | std.error | statistic | p.value | signif |
|---|---|---|---|---|---|---|
| 1 | (Intercept) | 9.32e+04 | 1.66e+04 | 5.61 | 4.71e-08 | s |
| 2 | married_id | 414 | 2.52e+03 | 0.164 | 0.87 | ns |
| 3 | marital_status_id | -1.13e+03 | 1.3e+03 | -0.874 | 0.383 | ns |
| 4 | gender_id | -324 | 2.43e+03 | -0.133 | 0.894 | ns |
| 5 | emp_status_id | -603 | 716 | -0.843 | 0.4 | ns |
| 6 | dept_id | -8.63e+03 | 1.92e+03 | -4.5 | 9.94e-06 | s |
| 7 | perf_score_id | 6.59e+03 | 3.08e+03 | 2.14 | 0.0333 | s |
| 8 | from_diversity_job_fair_id | 3.19e+03 | 4.21e+03 | 0.757 | 0.45 | ns |
| 9 | position_id | -560 | 217 | -2.58 | 0.0103 | s |
| 10 | emp_satisfaction | 406 | 1.39e+03 | 0.292 | 0.771 | ns |
| 11 | special_projects_count | 2.66e+03 | 791 | 3.37 | 0.000855 | s |
| 12 | days_late_last30 | 1.82e+03 | 1.38e+03 | 1.32 | 0.189 | ns |
| 13 | absences | 327 | 207 | 1.58 | 0.116 | ns |
The validation metrics of the model
| Nr.Crt. | var | values |
|---|---|---|
| 1 | r.squared | 0.334 |
| 2 | adj.r.squared | 0.307 |
| 3 | sigma | 2.09e+04 |
| 4 | statistic | 12.5 |
| 5 | p.value | 1.5e-20 |
| 6 | df | 12 |
| 7 | logLik | -3.53e+03 |
| 8 | AIC | 7.09e+03 |
| 9 | BIC | 7.14e+03 |
| 10 | deviance | 1.31e+11 |
| 11 | df.residual | 298 |
| 12 | nobs | 311 |
VIF values obtained for the variables include in the model
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| married_id | 1.08 | 1.02 | 1.38 | 1.04 | 0.929 | 0.727 | 0.985 |
| marital_status_id | 1.06 | 1.01 | 1.46 | 1.03 | 0.947 | 0.686 | 0.993 |
| gender_id | 1.03 | 1 | 2.38 | 1.01 | 0.972 | 0.42 | 0.999 |
| emp_status_id | 1.17 | 1.07 | 1.39 | 1.08 | 0.857 | 0.721 | 0.933 |
| dept_id | 2.41 | 2.06 | 2.88 | 1.55 | 0.415 | 0.347 | 0.485 |
| perf_score_id | 2.32 | 1.98 | 2.76 | 1.52 | 0.432 | 0.362 | 0.504 |
| from_diversity_job_fair_id | 1.06 | 1.01 | 1.42 | 1.03 | 0.941 | 0.705 | 0.991 |
| position_id | 1.29 | 1.16 | 1.52 | 1.14 | 0.775 | 0.659 | 0.859 |
| emp_satisfaction | 1.13 | 1.05 | 1.36 | 1.06 | 0.883 | 0.734 | 0.954 |
| special_projects_count | 2.44 | 2.08 | 2.92 | 1.56 | 0.41 | 0.343 | 0.48 |
| days_late_last30 | 2.25 | 1.93 | 2.68 | 1.5 | 0.444 | 0.373 | 0.518 |
| absences | 1.04 | 1 | 1.67 | 1.02 | 0.961 | 0.598 | 0.998 |
We will rebuild the model including only the significant variables.
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status,
## -termd))
##
## Residuals:
## Min 1Q Median 3Q Max
## -56169 -9542 -1754 5019 160081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93194.6 16623.2 5.606 0.0000000471 ***
## married_id 413.6 2516.4 0.164 0.869563
## marital_status_id -1132.3 1295.6 -0.874 0.382829
## gender_id -324.3 2430.0 -0.133 0.893919
## emp_status_id -603.2 715.8 -0.843 0.400069
## dept_id -8628.1 1919.2 -4.496 0.0000099409 ***
## perf_score_id 6594.0 3083.0 2.139 0.033265 *
## from_diversity_job_fair_id 3185.4 4209.4 0.757 0.449816
## position_id -560.5 217.1 -2.581 0.010318 *
## emp_satisfaction 406.1 1391.6 0.292 0.770612
## special_projects_count 2664.4 790.9 3.369 0.000855 ***
## days_late_last30 1815.1 1378.8 1.316 0.189032
## absences 327.0 207.3 1.577 0.115772
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20940 on 298 degrees of freedom
## Multiple R-squared: 0.334, Adjusted R-squared: 0.3072
## F-statistic: 12.45 on 12 and 298 DF, p-value: < 0.00000000000000022
## married_id marital_status_id
## 1.076813 1.055875
## gender_id emp_status_id
## 1.028911 1.166447
## dept_id perf_score_id
## 2.412074 2.316162
## from_diversity_job_fair_id position_id
## 1.062681 1.290889
## emp_satisfaction special_projects_count
## 1.131887 2.441384
## days_late_last30 absences
## 2.252283 1.040777
Variables that has no impact on salary are:
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| married_id | 414 | 2.52e+03 | 0.164 | 0.87 |
| marital_status_id | -1.13e+03 | 1.3e+03 | -0.874 | 0.383 |
| gender_id | -324 | 2.43e+03 | -0.133 | 0.894 |
| emp_status_id | -603 | 716 | -0.843 | 0.4 |
| from_diversity_job_fair_id | 3.19e+03 | 4.21e+03 | 0.757 | 0.45 |
| emp_satisfaction | 406 | 1.39e+03 | 0.292 | 0.771 |
| days_late_last30 | 1.82e+03 | 1.38e+03 | 1.32 | 0.189 |
| absences | 327 | 207 | 1.58 | 0.116 |
## [1] "married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences"
lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences)))
lm_ih
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status,
## termd, married_id, marital_status_id, gender_id, emp_status_id,
## from_diversity_job_fair_id, emp_satisfaction, days_late_last30,
## absences)))
##
## Coefficients:
## (Intercept) dept_id perf_score_id
## 105263.2 -8607.9 4063.2
## position_id special_projects_count
## -620.5 2682.4
# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences, salary)))
# lm_ih
summary(lm_ih)
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status,
## termd, married_id, marital_status_id, gender_id, emp_status_id,
## from_diversity_job_fair_id, emp_satisfaction, days_late_last30,
## absences)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -48859 -9725 -2244 4500 159690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105263.2 13558.0 7.764 0.000000000000125 ***
## dept_id -8607.9 1895.7 -4.541 0.000008070682410 ***
## perf_score_id 4063.2 2028.2 2.003 0.046014 *
## position_id -620.5 212.1 -2.925 0.003700 **
## special_projects_count 2682.4 770.8 3.480 0.000575 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20900 on 306 degrees of freedom
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.3099
## F-statistic: 35.81 on 4 and 306 DF, p-value: < 0.00000000000000022
The estimated values of the regression coefficients
| Nr.Crt. | term | estimate | std.error | statistic | p.value | signif |
|---|---|---|---|---|---|---|
| 1 | (Intercept) | 1.05e+05 | 1.36e+04 | 7.76 | 1.25e-13 | s |
| 2 | dept_id | -8.61e+03 | 1.9e+03 | -4.54 | 8.07e-06 | s |
| 3 | perf_score_id | 4.06e+03 | 2.03e+03 | 2 | 0.046 | s |
| 4 | position_id | -620 | 212 | -2.93 | 0.0037 | s |
| 5 | special_projects_count | 2.68e+03 | 771 | 3.48 | 0.000575 | s |
The validation metrics of the model
| Nr.Crt. | var | values |
|---|---|---|
| 1 | r.squared | 0.319 |
| 2 | adj.r.squared | 0.31 |
| 3 | sigma | 2.09e+04 |
| 4 | statistic | 35.8 |
| 5 | p.value | 1.53e-24 |
| 6 | df | 4 |
| 7 | logLik | -3.53e+03 |
| 8 | AIC | 7.08e+03 |
| 9 | BIC | 7.1e+03 |
| 10 | deviance | 1.34e+11 |
| 11 | df.residual | 306 |
| 12 | nobs | 311 |
The importance of the variables
VIF values obtained for the variables include in the model
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| dept_id | 2.36 | 2.01 | 2.83 | 1.54 | 0.423 | 0.353 | 0.497 |
| perf_score_id | 1.01 | 1 | 1.96e+05 | 1 | 0.994 | 5.09e-06 | 1 |
| position_id | 1.24 | 1.12 | 1.47 | 1.11 | 0.808 | 0.682 | 0.892 |
| special_projects_count | 2.33 | 1.99 | 2.79 | 1.53 | 0.43 | 0.358 | 0.504 |
As it can be seen from the checking assumption of the model the multicollinearity effect was eliminated. The metric of the model are the same, 33% of the variation of the dependent variable, salary, is explained by the model. We can see that the model is significant (p-value < 0.05), but not all the included predictors have an significant impact on the salary.
From the checking assumptions of the model we can observe that:
The model does not capture the actual shape of the salary distribution, suggesting potential skewness or the presence of outliers that are not well handled by linear regression.
The relationship between the predictors and salary does not appear to be fully linear, indicating possible model misspecification.
There is evidence of heteroscedasticity, as the variance of the residuals is not constant across fitted values.
Several observations exhibit relatively high leverage (e.g., 132, 27, 151), indicating the presence of influential data points that may affect the estimated regression coefficients.
Additionally, high multicollinearity is detected among some predictors, which can lead to unstable coefficient estimates and inflated standard errors. After eliminating the variables with high VIF values the model was rebuild.
The residuals are not perfectly normally distributed, likely due to positive skewness in salary values. Although this is less critical in large samples, it may still affect statistical inference, including p-values and confidence intervals.
To improve the model, the following steps we should follow:
Using robust regression techniques;
Implementing regularized models such as Ridge or Lasso regression.
Because some of the assumption of the regression model are not checked (like homoscedasticity, the influence of the outliers, deviation from the normality) we will implement more robust regression model.
Because the homoscedasticity assumption is violated
The standard errors become robust and the regression coefficients are not changing
##
## Call:
## lm(formula = salary ~ ., data = dataset)
##
## Coefficients:
## (Intercept) married_id
## 95093.0 443.8
## marital_status_id gender_id
## -1133.9 -351.1
## emp_status_id dept_id
## -1607.4 -8722.4
## perf_score_id from_diversity_job_fair_id
## 6392.3 3462.4
## termd position_id
## 3924.0 -545.7
## emp_satisfaction special_projects_count
## 432.2 2625.3
## days_late_last30 absences
## 1663.1 324.8
## status
## NA
# lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences)))
# lm_ih
# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences, salary)))
# lm_ih
library(sandwich)
library(lmtest)
coeftest(lm_ih, vcov = vcovHC(lm_ih, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95092.95 38575.24 2.4651 0.01426 *
## married_id 443.76 2561.24 0.1733 0.86256
## marital_status_id -1133.88 874.65 -1.2964 0.19585
## gender_id -351.10 2553.90 -0.1375 0.89075
## emp_status_id -1607.41 1759.77 -0.9134 0.36176
## dept_id -8722.36 5270.46 -1.6550 0.09899 .
## perf_score_id 6392.34 3518.32 1.8169 0.07024 .
## from_diversity_job_fair_id 3462.36 4834.52 0.7162 0.47445
## termd 3923.97 5617.38 0.6985 0.48539
## position_id -545.66 327.55 -1.6659 0.09679 .
## emp_satisfaction 432.19 1050.52 0.4114 0.68107
## special_projects_count 2625.25 1790.43 1.4663 0.14363
## days_late_last30 1663.11 1366.13 1.2174 0.22442
## absences 324.78 185.97 1.7464 0.08177 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
for the outliers and influential observations we will use more robust regression models
for the multicollinearity we will use Ridge, Lasso and ElasticNet regression models
Regression model estimation
library(MASS)
lm_ih <- MASS::rlm(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
summary(lm_ih)
##
## Call: rlm(formula = salary ~ ., data = dataset %>% dplyr::select(-status,
## -termd))
## Residuals:
## Min 1Q Median 3Q Max
## -44276.7 -7565.1 438.9 6640.3 173976.3
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 82230.1036 9375.5766 8.7707
## married_id -900.1458 1419.2794 -0.6342
## marital_status_id -326.5630 730.7229 -0.4469
## gender_id 146.2642 1370.5508 0.1067
## emp_status_id 50.3524 403.7250 0.1247
## dept_id -5264.4981 1082.4401 -4.8635
## perf_score_id 2700.8751 1738.8486 1.5533
## from_diversity_job_fair_id 103.1237 2374.1452 0.0434
## position_id -298.1443 122.4571 -2.4347
## emp_satisfaction 269.4655 784.8575 0.3433
## special_projects_count 3504.2525 446.0931 7.8554
## days_late_last30 651.2826 777.6269 0.8375
## absences 135.8248 116.9227 1.1617
##
## Residual standard error: 10310 on 298 degrees of freedom
coeftest(lm_ih, vcov = vcovHC(lm_ih, type = "HC1"))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 82230.104 17405.612 4.7243 0.000002309 ***
## married_id -900.146 1477.479 -0.6092 0.54236
## marital_status_id -326.563 647.397 -0.5044 0.61396
## gender_id 146.264 1418.480 0.1031 0.91787
## emp_status_id 50.352 438.760 0.1148 0.90863
## dept_id -5264.498 2562.226 -2.0547 0.03991 *
## perf_score_id 2700.875 1769.581 1.5263 0.12694
## from_diversity_job_fair_id 103.124 3283.623 0.0314 0.97495
## position_id -298.144 184.308 -1.6176 0.10574
## emp_satisfaction 269.466 744.174 0.3621 0.71728
## special_projects_count 3504.252 880.485 3.9799 0.000068941 ***
## days_late_last30 651.283 672.159 0.9689 0.33257
## absences 135.825 115.144 1.1796 0.23816
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The importance of the variables
The estimated values of the regression coefficients
| Crt. | term | estimate | std.error | statistic | Decision |
|---|---|---|---|---|---|
| 1 | (Intercept) | 8.22e+04 | 9.38e+03 | 8.77 | s |
| 2 | married_id | -900 | 1.42e+03 | -0.634 | ns |
| 3 | marital_status_id | -327 | 731 | -0.447 | ns |
| 4 | gender_id | 146 | 1.37e+03 | 0.107 | ns |
| 5 | emp_status_id | 50.4 | 404 | 0.125 | ns |
| 6 | dept_id | -5.26e+03 | 1.08e+03 | -4.86 | s |
| 7 | perf_score_id | 2.7e+03 | 1.74e+03 | 1.55 | ns |
| 8 | from_diversity_job_fair_id | 103 | 2.37e+03 | 0.0434 | ns |
| 9 | position_id | -298 | 122 | -2.43 | s |
| 10 | emp_satisfaction | 269 | 785 | 0.343 | ns |
| 11 | special_projects_count | 3.5e+03 | 446 | 7.86 | s |
| 12 | days_late_last30 | 651 | 778 | 0.838 | ns |
| 13 | absences | 136 | 117 | 1.16 | ns |
The validation metrics of the model
| Crt. | var | values |
|---|---|---|
| 1 | sigma | 1.03e+04 |
| 2 | converged | 1 |
| 3 | logLik | -3.54e+03 |
| 4 | AIC | 7.1e+03 |
| 5 | BIC | 7.15e+03 |
| 6 | deviance | 1.36e+11 |
| 7 | nobs | 311 |
VIF values obtained for the variables include in the model
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| married_id | 1.08 | 1.02 | 1.38 | 1.04 | 0.929 | 0.727 | 0.985 |
| marital_status_id | 1.06 | 1.01 | 1.46 | 1.03 | 0.947 | 0.686 | 0.993 |
| gender_id | 1.03 | 1 | 2.38 | 1.01 | 0.972 | 0.42 | 0.999 |
| emp_status_id | 1.17 | 1.07 | 1.39 | 1.08 | 0.857 | 0.721 | 0.933 |
| dept_id | 2.41 | 2.06 | 2.88 | 1.55 | 0.415 | 0.347 | 0.485 |
| perf_score_id | 2.32 | 1.98 | 2.76 | 1.52 | 0.432 | 0.362 | 0.504 |
| from_diversity_job_fair_id | 1.06 | 1.01 | 1.42 | 1.03 | 0.941 | 0.705 | 0.991 |
| position_id | 1.29 | 1.16 | 1.52 | 1.14 | 0.775 | 0.659 | 0.859 |
| emp_satisfaction | 1.13 | 1.05 | 1.36 | 1.06 | 0.883 | 0.734 | 0.954 |
| special_projects_count | 2.44 | 2.08 | 2.92 | 1.56 | 0.41 | 0.343 | 0.48 |
| days_late_last30 | 2.25 | 1.93 | 2.68 | 1.5 | 0.444 | 0.373 | 0.518 |
| absences | 1.04 | 1 | 1.67 | 1.02 | 0.961 | 0.598 | 0.998 |
For modeling salary as the dependent variable all columns in dataset except status are included as predictors.
The robust regression technique (MM) is used for modeling, reducing the influence of outliers on coefficients and combines high breakdown point with high efficiency.
Regression model estimation
library(robustbase)
lm_ih <- lmrob(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
library(robustbase)
summary(lm_ih)
##
## Call:
## lmrob(formula = salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
## \--> method = "MM"
## Residuals:
## Min 1Q Median 3Q Max
## -61592.3 -6537.9 479.6 6375.1 218886.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1993.00 15712.18 0.127 0.899
## married_id 30.57 1219.13 0.025 0.980
## marital_status_id -95.04 577.71 -0.165 0.869
## gender_id -116.98 1291.51 -0.091 0.928
## emp_status_id 418.79 365.10 1.147 0.252
## dept_id 9119.66 2311.09 3.946 0.0000992 ***
## perf_score_id 2523.75 1679.81 1.502 0.134
## from_diversity_job_fair_id 129.86 2974.33 0.044 0.965
## position_id 191.53 149.16 1.284 0.200
## emp_satisfaction -419.97 685.17 -0.613 0.540
## special_projects_count 8067.66 819.51 9.845 < 0.0000000000000002 ***
## days_late_last30 265.20 557.27 0.476 0.635
## absences 115.12 104.15 1.105 0.270
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Robust residual standard error: 10380
## Multiple R-squared: 0.5665, Adjusted R-squared: 0.5491
## Convergence in 31 IRWLS iterations
##
## Robustness weights:
## 15 observations c(30,56,77,97,132,133,151,191,228,241,245,256,260,269,309)
## are outliers with |weight| = 0 ( < 0.00032);
## 28 weights are ~= 1. The remaining 268 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01717 0.87990 0.96250 0.89740 0.98750 0.99900
## Algorithmic parameters:
## tuning.chi bb tuning.psi refine.tol
## 1.54764000000000 0.50000000000000 4.68506100000000 0.00000010000000
## rel.tol scale.tol solve.tol zero.tol
## 0.00000010000000 0.00000000010000 0.00000010000000 0.00000000010000
## eps.outlier eps.x warn.limit.reject warn.limit.meanrw
## 0.00032154340836 0.00000000005457 0.50000000000000 0.50000000000000
## nResample max.it best.r.s k.fast.s k.max
## 500 50 2 1 200
## maxit.scale trace.lev mts compute.rd fast.s.large.n
## 200 0 1000 0 2000
## psi subsampling cov
## "bisquare" "nonsingular" ".vcov.avar1"
## compute.outlier.stats
## "SM"
## seed : int(0)
It can be observe that dept_id and special_projects_count are the strongest predictors of salary.
Most of the demographic variables (married, gender, etc.) are not significant.
The variable special projects (special_projects_count) have the largest positive effect on salary (~ 8078.91 increase per project).
The observations with weight = 0 are completely ignored in coefficient estimation, which means that the model will not be distorted by extreme salaries or unusual employees.
The importance of the variables
qr(model.matrix(lm_ih))$rank
## [1] 13
ncol(model.matrix(lm_ih))
## [1] 13
## Model :
## salary ~ married_id + marital_status_id + gender_id + emp_status_id +
## dept_id + perf_score_id + from_diversity_job_fair_id + position_id +
## emp_satisfaction + special_projects_count + days_late_last30 +
## absences
The estimated values of the regression coefficients
| Nr.Crt. | term | estimate | std.error | statistic | p.value | signif |
|---|---|---|---|---|---|---|
| 1 | (Intercept) | 1.99e+03 | 1.57e+04 | 0.127 | 0.899 | ns |
| 2 | married_id | 30.6 | 1.22e+03 | 0.0251 | 0.98 | ns |
| 3 | marital_status_id | -95 | 578 | -0.165 | 0.869 | ns |
| 4 | gender_id | -117 | 1.29e+03 | -0.0906 | 0.928 | ns |
| 5 | emp_status_id | 419 | 365 | 1.15 | 0.252 | ns |
| 6 | dept_id | 9.12e+03 | 2.31e+03 | 3.95 | 9.92e-05 | s |
| 7 | perf_score_id | 2.52e+03 | 1.68e+03 | 1.5 | 0.134 | ns |
| 8 | from_diversity_job_fair_id | 130 | 2.97e+03 | 0.0437 | 0.965 | ns |
| 9 | position_id | 192 | 149 | 1.28 | 0.2 | ns |
| 10 | emp_satisfaction | -420 | 685 | -0.613 | 0.54 | ns |
| 11 | special_projects_count | 8.07e+03 | 820 | 9.84 | 5.55e-20 | s |
| 12 | days_late_last30 | 265 | 557 | 0.476 | 0.635 | ns |
| 13 | absences | 115 | 104 | 1.11 | 0.27 | ns |
The validation metrics of the model
| Nr.Crt. | var | values |
|---|---|---|
| 1 | r.squared | 0.567 |
| 2 | sigma | 1.04e+04 |
| 3 | df.residual | 298 |
The fitted, and residula values
In this case the model explains 58% of the variation in salary (adjusted R² is 0.56), which is pretty good.
VIF values obtained for the variables include in the model
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| married_id | 1.07 | 1.01 | 1.39 | 1.03 | 0.935 | 0.718 | 0.988 |
| marital_status_id | 1.3 | 1.17 | 1.53 | 1.14 | 0.767 | 0.653 | 0.852 |
| gender_id | 1.21 | 1.11 | 1.43 | 1.1 | 0.824 | 0.698 | 0.904 |
| emp_status_id | 1.32 | 1.19 | 1.55 | 1.15 | 0.758 | 0.645 | 0.843 |
| dept_id | 5.12 | 4.25 | 6.23 | 2.26 | 0.195 | 0.161 | 0.235 |
| perf_score_id | 3.8 | 3.18 | 4.59 | 1.95 | 0.263 | 0.218 | 0.314 |
| from_diversity_job_fair_id | 1.25 | 1.13 | 1.47 | 1.12 | 0.8 | 0.68 | 0.883 |
| position_id | 3.33 | 2.81 | 4.02 | 1.83 | 0.3 | 0.249 | 0.356 |
| emp_satisfaction | 1.48 | 1.31 | 1.74 | 1.22 | 0.677 | 0.576 | 0.764 |
| special_projects_count | 2.34 | 2 | 2.79 | 1.53 | 0.427 | 0.358 | 0.499 |
| days_late_last30 | 3.17 | 2.67 | 3.81 | 1.78 | 0.316 | 0.262 | 0.374 |
| absences | 1.37 | 1.23 | 1.61 | 1.17 | 0.727 | 0.62 | 0.814 |
Both methods are robust alternatives to OLS, but MM (lmrob) tends to be more robust with high breakdown points, while M (rlm) focuses on efficient estimation for moderately contaminated data.
In both models special_projects_count is consistently highly significant.
dept_id variable shows large effects, but rlm and lmrob can differ in sign/magnitude depending on outlier treatment.
In both cases most demographic variables are not significant.
Both models account for extreme salaries, but differently (lmrob ignores extreme outliers, rlm downweights them).
Overall, Lasso and Ridge regression are used when we have multicollinearity problems, to many predictors and we want a stable predictive model.
library(glmnet)
## Loaded glmnet 4.1-8
x <- model.matrix(salary ~ ., dataset)[,-1]
y <- dataset$salary
Regression model estimation
ridge_model <- cv.glmnet(x, y, alpha = 0)
ridge_model
##
## Call: cv.glmnet(x = x, y = y, alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 9885 78 476230896 143918814 14
## 1se 859774 30 618809836 175693563 14
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 89160.9564
## married_id 245.5060
## marital_status_id -961.9268
## gender_id 340.6729
## emp_status_id -496.3839
## dept_id -6305.7042
## perf_score_id 3347.7496
## from_diversity_job_fair_id 2518.0345
## termd -331.6805
## position_id -367.1815
## emp_satisfaction 526.8653
## special_projects_count 2414.1555
## days_late_last30 418.6630
## absences 250.1753
## status 244.5240
| Nr.Crt. | Matric | Value |
|---|---|---|
| 1 | lambda.min | 9.89e+03 |
| 2 | lambda.1se | 8.6e+05 |
| 3 | nobs | 311 |
Reduces de value of the regression coefficients to zero, not eliminate them
Interpretation:
special_projects_count (2501.91) → has strongest positive effect on salary, which means that more special projects increases salary.
days_late_last30 (571.18), emp_satisfaction (523.16), absences (266.46) → have a moderate positive effect on salary. Very interesting is that satisfaction and days late are positive here (possibly because Ridge shrinks coefficients but keeps them in model).
position_id (-397.95) → has a negative impact on salary, meaning higher position ID reduces salary after controlling for others.
status (233.94) → has a small positive effect on salary.
Ridge regression shrinks coefficients, reduces variance due to multicollinearity, but does not perform variable selection.
Compare to lmrob: special_projects_count remains the strongest predictor across models.
Evaluating the performance of the Ridge model
## [1] 430105398
The cross-validation curve reflects the bias–variance tradeoff inherent in regularized models. At very small λ values, the model approaches OLS and exhibits signs of mild overfitting. As λ increases, prediction error decreases due to reduced variance. However, excessive regularization leads to underfitting, as indicated by the increasing error for large λ values. The optimal λ balances bias and variance, ensuring better generalization performance.
Regression model estimation
##
## Call: cv.glmnet(x = x, y = y, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 783 31 468106080 145969732 8
## 1se 11633 2 613398133 164322653 1
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 99689.4344
## married_id .
## marital_status_id -485.7549
## gender_id .
## emp_status_id -145.4528
## dept_id -7593.2077
## perf_score_id 2738.4442
## from_diversity_job_fair_id 309.5148
## termd .
## position_id -437.4469
## emp_satisfaction .
## special_projects_count 2710.0045
## days_late_last30 .
## absences 220.9998
## status .
Interpretation:
special_projects_count (2740.63) → is the strongest positive predictor, having a the strongest effect on salary. Each additional special project increases salary by ~2,740 units.
dept_id (-7249.22) → has a large effect depending on department coding: some departments pay much less/more than the baseline/mean.
perf_score_id (2371.70) → the performance scores positively affect salary.
absences (179.11) → has a minor positive effect on salary (interesting, could reflect correlation with other variables).
position_id (-395.62) → has a small negative effect on salary.
marital_status_id (-235.71) → has a small negative effect on salary. It appears that marital status slightly reduces salary.
| Metric | Value |
|---|---|
| lambda.min | 783 |
| lambda.1se | 1.16e+04 |
| nobs | 311 |
Can eliminate variables by forcing the value of regression coefficient to be zero
Performance of the Lasso model
## [1] 426931353
Lasso shrinks unimportant variables to zero, performing a variable selection while controlling multicollinearity and overfitting. So the final model focuses only on variables that actually explain variation in salary. In this case the important predictors are: dept_id, perf_score_id, special_projects_count, absences, marital_status_id, position_id, while married_id, gender_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, status are dropped predictors.
Comparing the performance of the Ridge and Lasso models
| Nr.Crt. | MSE | Model | lambda.min | lambda.1se | nobs |
|---|---|---|---|---|---|
| 1 | 4.3e+08 | Ridge | 9.89e+03 | 8.6e+05 | 311 |
| 2 | 4.27e+08 | Lasso | 783 | 1.16e+04 | 311 |
Robust regression models address violations of the classical assumptions of regression, such as heteroscedasticity, non-normality of errors, multicollinearity, and the presence of influential observations. Ridge regression stabilizes the coefficients in the presence of multicollinearity, while Lasso regression performs variable selection and simultaneously controls model complexity.
The Lasso model performs variable selection by shrinking some coefficients exactly to zero. At λ = lambda.min, the model retains eight predictors, achieving the lowest cross-validated error. However, λ = lambda.1se produces a more parsimonious model with only two predictors and comparable predictive performance, suggesting improved generalization and stability.
Elastic Net regression was implemented to simultaneously address multicollinearity and perform variable selection. By combining L1 and L2 regularization penalties, the method provides more stable coefficient estimates compared to Lasso, particularly in the presence of correlated predictors.
Regression model estimation
##
## Call: cv.glmnet(x = x, y = y, alpha = 0.5)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 1720 30 465955650 107134394 8
## 1se 17600 5 563913327 125347337 2
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 98022.18688
## married_id .
## marital_status_id -401.36371
## gender_id .
## emp_status_id -120.56475
## dept_id -7240.73932
## perf_score_id 2572.98929
## from_diversity_job_fair_id 17.97873
## termd .
## position_id -404.19215
## emp_satisfaction .
## special_projects_count 2706.17379
## days_late_last30 .
## absences 203.25672
## status .
## [1] 428627377
Elastic Net combines L1 and L2 regularization, allowing for both shrinkage and grouped variable selection. It is particularly suitable in the presence of multicollinearity, as it stabilizes coefficient estimates while retaining correlated predictors.
Searching for the optimal hyperparameters
## [1] 467815942 461436413 463404375 468896799 468776405 465053606 465292529
## [8] 476683081 484718055 469561763 479238723
## [1] 0.1
## [1] 5926.088
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 94567.12167
## married_id .
## marital_status_id -594.24170
## gender_id .
## emp_status_id -299.64596
## dept_id -6586.05190
## perf_score_id 2626.97295
## from_diversity_job_fair_id 972.35006
## termd .
## position_id -362.64193
## emp_satisfaction 14.27022
## special_projects_count 2545.52998
## days_late_last30 .
## absences 213.02951
## status .
## [1] 430778136
The optimal Elastic Net model was selected by performing cross-validation over a grid of alpha values (0 to 1). The combination of alpha and lambda that minimized the cross-validated mean squared error was chosen, and the final model was re-estimated using these optimal hyperparameters.
The analysis began with the estimation of an OLS model to explain salary variation based on employees’ demographic and organizational characteristics. Diagnostic tests revealed several issues, including the presence of influential observations, potential heteroscedasticity, deviations from residual normality, and indications of multicollinearity among predictors.
To improve the robustness of the estimates, alternative methods were applied. Robust regression techniques (rlm, lmrob) reduced the influence of extreme observations by down-weighting large residuals, resulting in more stable coefficient estimates in the presence of outliers. However, these methods do not address multicollinearity.
Subsequently, regularized models (Ridge, Lasso, and Elastic Net) were implemented. Cross-validation results showed that introducing a penalty term improves predictive performance compared to OLS.
Ridge regression stabilized coefficient estimates in the presence of correlated predictors. Lasso performed automatic variable selection, leading to a more parsimonious model without substantial loss of predictive accuracy. Elastic Net provided an effective compromise between stability and sparsity, making it particularly suitable in the presence of multicollinearity.
The cross-validation curves clearly illustrated the bias–variance tradeoff, and the optimal selection of λ allowed for the construction of a balanced model with strong predictive performance and controlled complexity.