Scoring model

Analyzed data set: HRDataset_v14 available here

Loading the necesary packages…

Project Objectives

  1. Exploratory data analysis - EDA was done in another post and available here
  2. Identification and detailed analysis of the studied phenomenon through the development of scoring models
  3. Interpretation of results and presentation of conclusions

Dataset description

In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.

Within this dataset, we decided to explore and attempt to answer several research questions, such as:

  1. Are there areas within the company where salary distribution is not equitable?

  2. Is an individual’s salary influenced by any specific factors present in the dataset?

  3. Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?

loading the dataset

Dataset description

The structure of the dataset

glimpse(HRDataset_v14)
## Rows: 311
## Columns: 36
## $ Employee_Name              <chr> "Adinolfi, Wilson  K", "Ait Sidi, Karthikey…
## $ EmpID                      <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID                  <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID            <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID                   <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID                <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID                     <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID                <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary                     <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID                 <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position                   <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State                      <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip                        <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB                        <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex                        <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc                <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc                <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc                   <chr> "White", "White", "White", "White", "White"…
## $ DateofHire                 <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination          <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason                 <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus           <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department                 <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName                <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID                  <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource          <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore           <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey           <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction            <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount       <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…

The key findings from the EDA and their implications for further modeling

Based on EDA analysis we so that:

  • Salary exhibits non-normal behavior and strong right skewness.

  • Extreme values (executive-level salaries) may influence regression results.

  • Transformation techniques (e.g., log transformation of salary) may improve model performance.

  • Categorical variables such as department, position, and employment status are likely strong predictors of salary.

  • Gender alone may not fully explain salary variation without controlling for position and department.

## Rows: 311
## Columns: 15
## $ married_id                 <fct> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ status                     <chr> "no_married", "married", "married", "marrie…
## $ log10_salary               <dbl> 4.795922, 5.018854, 4.812613, 4.812853, 4.7…
## Rows: 311
## Columns: 15
## $ married_id                 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ salary                     <int> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ status                     <chr> "no_married", "married", "married", "marrie…

Correlation matrix

MRLM - Linear regression model

Generating the regression model

Just a regression

lm_ih <- lm(salary ~ ., data = dataset)
lm_ih
## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    95093.0                       443.8  
##          marital_status_id                   gender_id  
##                    -1133.9                      -351.1  
##              emp_status_id                     dept_id  
##                    -1607.4                     -8722.4  
##              perf_score_id  from_diversity_job_fair_id  
##                     6392.3                      3462.4  
##                      termd                 position_id  
##                     3924.0                      -545.7  
##           emp_satisfaction      special_projects_count  
##                      432.2                      2625.3  
##           days_late_last30                    absences  
##                     1663.1                       324.8  
##                     status  
##                         NA
# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-salary))
# lm_ih
summary(lm_ih)
## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56336  -9497  -2108   5279 159658 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 95093.0    17138.2   5.549 0.0000000638 ***
## married_id                    443.8     2520.6   0.176       0.8604    
## marital_status_id           -1133.9     1297.3  -0.874       0.3828    
## gender_id                    -351.1     2433.9  -0.144       0.8854    
## emp_status_id               -1607.4     2274.8  -0.707       0.4804    
## dept_id                     -8722.4     1932.4  -4.514 0.0000091892 ***
## perf_score_id                6392.3     3117.4   2.051       0.0412 *  
## from_diversity_job_fair_id   3462.4     4256.8   0.813       0.4167    
## termd                        3924.0     8436.1   0.465       0.6422    
## position_id                  -545.7      219.7  -2.483       0.0136 *  
## emp_satisfaction              432.2     1394.5   0.310       0.7568    
## special_projects_count       2625.3      796.4   3.296       0.0011 ** 
## days_late_last30             1663.1     1418.7   1.172       0.2420    
## absences                      324.8      207.6   1.564       0.1188    
## status                           NA         NA      NA           NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20970 on 297 degrees of freedom
## Multiple R-squared:  0.3345, Adjusted R-squared:  0.3053 
## F-statistic: 11.48 on 13 and 297 DF,  p-value: < 0.00000000000000022

Identifying the variable that is strongly correlated with status

Nr.Crt.VariableValue
1(Intercept)0
2married_id1
3marital_status_id0
4gender_id0
5emp_status_id0
6dept_id0
7perf_score_id0
8from_diversity_job_fair_id0
9termd0
10position_id0
11emp_satisfaction0
12special_projects_count0
13days_late_last300
14absences0

The importance of the variables

Check assumptions of linear regression

library(performance)
par(mfrow = c(2, 2))
plot(lm_ih, pch = 16, col = '#006EA1')

par(mfrow = c(1, 1))

The estimated values of the regression coefficients

Nr.Crt.termestimatestd.errorstatisticp.valuesignif
1(Intercept)9.51e+041.71e+045.55 6.38e-08s
2married_id444       2.52e+030.1760.86ns
3marital_status_id-1.13e+031.3e+03 -0.8740.383ns
4gender_id-351       2.43e+03-0.1440.885ns
5emp_status_id-1.61e+032.27e+03-0.7070.48ns
6dept_id-8.72e+031.93e+03-4.51 9.19e-06s
7perf_score_id6.39e+033.12e+032.05 0.0412s
8from_diversity_job_fair_id3.46e+034.26e+030.8130.417ns
9termd3.92e+038.44e+030.4650.642ns
10position_id-546       220       -2.48 0.0136s
11emp_satisfaction432       1.39e+030.31 0.757ns
12special_projects_count2.63e+03796       3.3  0.0011s
13days_late_last301.66e+031.42e+031.17 0.242ns
14absences325       208       1.56 0.119ns
15status                  

The validation metrics of the model

Nr.Crt.varvalues
1r.squared 0.334
2adj.r.squared 0.305
3sigma         2.1e+04
4statistic         11.5
5p.value   4.94e-20
6df         13
7logLik         -3.53e+03
8AIC         7.09e+03
9BIC         7.14e+03
10deviance         1.31e+11
11df.residual297
12nobs         311

Cooks distance, predicted values, residuals and influntial points

## OK: No outliers detected.
## - Based on the following method and threshold: cook (1).
## - For variable: (Whole model)

VIF values obtained for the variables include in the model

TermVIFVIF_CI_lowVIF_CI_highSE_factorToleranceTolerance_CI_lowTolerance_CI_high
married_id1.081.021.371.040.928 0.73  0.984
marital_status_id1.061.011.451.030.947 0.689 0.993
gender_id1.031   2.281.010.971 0.439 0.999
emp_status_id11.7 9.6314.4 3.430.08510.06950.104
dept_id2.442.082.911.560.41  0.344 0.48 
perf_score_id2.362.022.811.540.423 0.355 0.495
from_diversity_job_fair_id1.081.021.361.040.923 0.735 0.981
termd11.2 9.1913.7 3.350.08920.07290.109
position_id1.321.191.551.150.758 0.647 0.843
emp_satisfaction1.131.051.361.060.882 0.734 0.953
special_projects_count2.472.112.951.570.405 0.339 0.474
days_late_last302.382.042.831.540.42  0.353 0.491
absences1.041   1.651.020.96  0.608 0.997

We observe that status variable is NA meaning that between this variable and another variable there is prefect collinearity, suggesting that one or more variables are exact linear combinations of others. As a result, the design matrix becomes singular and VIF cannot be computed.

The variable that is perfect correlated with status is: married_id

# vif(lm_ih)

So we are going to eliminate the status variable and termd variable (highly correlated with emp_status_id; the value of the correlation coefficient is 0.91) from the dataset and rebuild the model.

Generating a new regression model without including the status and termd variables

Estimate regression model

lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
lm_ih
## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    93194.6                       413.6  
##          marital_status_id                   gender_id  
##                    -1132.3                      -324.3  
##              emp_status_id                     dept_id  
##                     -603.2                     -8628.1  
##              perf_score_id  from_diversity_job_fair_id  
##                     6594.0                      3185.4  
##                position_id            emp_satisfaction  
##                     -560.5                       406.1  
##     special_projects_count            days_late_last30  
##                     2664.4                      1815.1  
##                   absences  
##                      327.0
summary(lm_ih)
## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56169  -9542  -1754   5019 160081 
## 
## Coefficients:
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 93194.6    16623.2   5.606 0.0000000471 ***
## married_id                    413.6     2516.4   0.164     0.869563    
## marital_status_id           -1132.3     1295.6  -0.874     0.382829    
## gender_id                    -324.3     2430.0  -0.133     0.893919    
## emp_status_id                -603.2      715.8  -0.843     0.400069    
## dept_id                     -8628.1     1919.2  -4.496 0.0000099409 ***
## perf_score_id                6594.0     3083.0   2.139     0.033265 *  
## from_diversity_job_fair_id   3185.4     4209.4   0.757     0.449816    
## position_id                  -560.5      217.1  -2.581     0.010318 *  
## emp_satisfaction              406.1     1391.6   0.292     0.770612    
## special_projects_count       2664.4      790.9   3.369     0.000855 ***
## days_late_last30             1815.1     1378.8   1.316     0.189032    
## absences                      327.0      207.3   1.577     0.115772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20940 on 298 degrees of freedom
## Multiple R-squared:  0.334,  Adjusted R-squared:  0.3072 
## F-statistic: 12.45 on 12 and 298 DF,  p-value: < 0.00000000000000022

The importance of the variables

The estimated values of the regression coefficients

Nr.Crt.termestimatestd.errorstatisticp.valuesignif
1(Intercept)9.32e+041.66e+045.61 4.71e-08s
2married_id414       2.52e+030.1640.87ns
3marital_status_id-1.13e+031.3e+03 -0.8740.383ns
4gender_id-324       2.43e+03-0.1330.894ns
5emp_status_id-603       716       -0.8430.4ns
6dept_id-8.63e+031.92e+03-4.5  9.94e-06s
7perf_score_id6.59e+033.08e+032.14 0.0333s
8from_diversity_job_fair_id3.19e+034.21e+030.7570.45ns
9position_id-560       217       -2.58 0.0103s
10emp_satisfaction406       1.39e+030.2920.771ns
11special_projects_count2.66e+03791       3.37 0.000855s
12days_late_last301.82e+031.38e+031.32 0.189ns
13absences327       207       1.58 0.116ns

The validation metrics of the model

Nr.Crt.varvalues
1r.squared 0.334
2adj.r.squared 0.307
3sigma         2.09e+04
4statistic         12.5
5p.value   1.5e-20
6df         12
7logLik         -3.53e+03
8AIC         7.09e+03
9BIC         7.14e+03
10deviance         1.31e+11
11df.residual298
12nobs         311

VIF values obtained for the variables include in the model

TermVIFVIF_CI_lowVIF_CI_highSE_factorToleranceTolerance_CI_lowTolerance_CI_high
married_id1.081.021.381.040.9290.7270.985
marital_status_id1.061.011.461.030.9470.6860.993
gender_id1.031   2.381.010.9720.42 0.999
emp_status_id1.171.071.391.080.8570.7210.933
dept_id2.412.062.881.550.4150.3470.485
perf_score_id2.321.982.761.520.4320.3620.504
from_diversity_job_fair_id1.061.011.421.030.9410.7050.991
position_id1.291.161.521.140.7750.6590.859
emp_satisfaction1.131.051.361.060.8830.7340.954
special_projects_count2.442.082.921.560.41 0.3430.48 
days_late_last302.251.932.681.5 0.4440.3730.518
absences1.041   1.671.020.9610.5980.998

We will rebuild the model including only the significant variables.

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56169  -9542  -1754   5019 160081 
## 
## Coefficients:
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 93194.6    16623.2   5.606 0.0000000471 ***
## married_id                    413.6     2516.4   0.164     0.869563    
## marital_status_id           -1132.3     1295.6  -0.874     0.382829    
## gender_id                    -324.3     2430.0  -0.133     0.893919    
## emp_status_id                -603.2      715.8  -0.843     0.400069    
## dept_id                     -8628.1     1919.2  -4.496 0.0000099409 ***
## perf_score_id                6594.0     3083.0   2.139     0.033265 *  
## from_diversity_job_fair_id   3185.4     4209.4   0.757     0.449816    
## position_id                  -560.5      217.1  -2.581     0.010318 *  
## emp_satisfaction              406.1     1391.6   0.292     0.770612    
## special_projects_count       2664.4      790.9   3.369     0.000855 ***
## days_late_last30             1815.1     1378.8   1.316     0.189032    
## absences                      327.0      207.3   1.577     0.115772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20940 on 298 degrees of freedom
## Multiple R-squared:  0.334,  Adjusted R-squared:  0.3072 
## F-statistic: 12.45 on 12 and 298 DF,  p-value: < 0.00000000000000022
##                 married_id          marital_status_id 
##                   1.076813                   1.055875 
##                  gender_id              emp_status_id 
##                   1.028911                   1.166447 
##                    dept_id              perf_score_id 
##                   2.412074                   2.316162 
## from_diversity_job_fair_id                position_id 
##                   1.062681                   1.290889 
##           emp_satisfaction     special_projects_count 
##                   1.131887                   2.441384 
##           days_late_last30                   absences 
##                   2.252283                   1.040777

Generating a new regression model using only the significant variables

Variables that has no impact on salary are:

termestimatestd.errorstatisticp.value
married_id414       2.52e+030.1640.87 
marital_status_id-1.13e+031.3e+03 -0.8740.383
gender_id-324       2.43e+03-0.1330.894
emp_status_id-603       716       -0.8430.4  
from_diversity_job_fair_id3.19e+034.21e+030.7570.45 
emp_satisfaction406       1.39e+030.2920.771
days_late_last301.82e+031.38e+031.32 0.189
absences327       207       1.58 0.116
## [1] "married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences"
lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences)))
lm_ih
## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Coefficients:
##            (Intercept)                 dept_id           perf_score_id  
##               105263.2                 -8607.9                  4063.2  
##            position_id  special_projects_count  
##                 -620.5                  2682.4
# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences, salary)))
# lm_ih
summary(lm_ih)
## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48859  -9725  -2244   4500 159690 
## 
## Coefficients:
##                        Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)            105263.2    13558.0   7.764 0.000000000000125 ***
## dept_id                 -8607.9     1895.7  -4.541 0.000008070682410 ***
## perf_score_id            4063.2     2028.2   2.003          0.046014 *  
## position_id              -620.5      212.1  -2.925          0.003700 ** 
## special_projects_count   2682.4      770.8   3.480          0.000575 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20900 on 306 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3099 
## F-statistic: 35.81 on 4 and 306 DF,  p-value: < 0.00000000000000022

The estimated values of the regression coefficients

Nr.Crt.termestimatestd.errorstatisticp.valuesignif
1(Intercept)1.05e+051.36e+047.761.25e-13s
2dept_id-8.61e+031.9e+03 -4.548.07e-06s
3perf_score_id4.06e+032.03e+032   0.046s
4position_id-620       212       -2.930.0037s
5special_projects_count2.68e+03771       3.480.000575s

The validation metrics of the model

Nr.Crt.varvalues
1r.squared 0.319
2adj.r.squared 0.31
3sigma         2.09e+04
4statistic         35.8
5p.value   1.53e-24
6df         4
7logLik         -3.53e+03
8AIC         7.08e+03
9BIC         7.1e+03
10deviance         1.34e+11
11df.residual306
12nobs         311

The importance of the variables

VIF values obtained for the variables include in the model

TermVIFVIF_CI_lowVIF_CI_highSE_factorToleranceTolerance_CI_lowTolerance_CI_high
dept_id2.362.012.83    1.540.4230.353   0.497
perf_score_id1.011   1.96e+051   0.9945.09e-061    
position_id1.241.121.47    1.110.8080.682   0.892
special_projects_count2.331.992.79    1.530.43 0.358   0.504

As it can be seen from the checking assumption of the model the multicollinearity effect was eliminated. The metric of the model are the same, 33% of the variation of the dependent variable, salary, is explained by the model. We can see that the model is significant (p-value < 0.05), but not all the included predictors have an significant impact on the salary.

From the checking assumptions of the model we can observe that:

  • The model does not capture the actual shape of the salary distribution, suggesting potential skewness or the presence of outliers that are not well handled by linear regression.

  • The relationship between the predictors and salary does not appear to be fully linear, indicating possible model misspecification.

  • There is evidence of heteroscedasticity, as the variance of the residuals is not constant across fitted values.

  • Several observations exhibit relatively high leverage (e.g., 132, 27, 151), indicating the presence of influential data points that may affect the estimated regression coefficients.

  • Additionally, high multicollinearity is detected among some predictors, which can lead to unstable coefficient estimates and inflated standard errors. After eliminating the variables with high VIF values the model was rebuild.

  • The residuals are not perfectly normally distributed, likely due to positive skewness in salary values. Although this is less critical in large samples, it may still affect statistical inference, including p-values and confidence intervals.

To improve the model, the following steps we should follow:

  • Using robust regression techniques;

  • Implementing regularized models such as Ridge or Lasso regression.

Using more robust regression models

Because some of the assumption of the regression model are not checked (like homoscedasticity, the influence of the outliers, deviation from the normality) we will implement more robust regression model.

Because the homoscedasticity assumption is violated

The standard errors become robust and the regression coefficients are not changing

## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    95093.0                       443.8  
##          marital_status_id                   gender_id  
##                    -1133.9                      -351.1  
##              emp_status_id                     dept_id  
##                    -1607.4                     -8722.4  
##              perf_score_id  from_diversity_job_fair_id  
##                     6392.3                      3462.4  
##                      termd                 position_id  
##                     3924.0                      -545.7  
##           emp_satisfaction      special_projects_count  
##                      432.2                      2625.3  
##           days_late_last30                    absences  
##                     1663.1                       324.8  
##                     status  
##                         NA
# lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences)))
# lm_ih
# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences, salary)))
# lm_ih
library(sandwich)
library(lmtest)

coeftest(lm_ih, vcov = vcovHC(lm_ih, type = "HC1"))
## 
## t test of coefficients:
## 
##                            Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                95092.95   38575.24  2.4651  0.01426 *
## married_id                   443.76    2561.24  0.1733  0.86256  
## marital_status_id          -1133.88     874.65 -1.2964  0.19585  
## gender_id                   -351.10    2553.90 -0.1375  0.89075  
## emp_status_id              -1607.41    1759.77 -0.9134  0.36176  
## dept_id                    -8722.36    5270.46 -1.6550  0.09899 .
## perf_score_id               6392.34    3518.32  1.8169  0.07024 .
## from_diversity_job_fair_id  3462.36    4834.52  0.7162  0.47445  
## termd                       3923.97    5617.38  0.6985  0.48539  
## position_id                 -545.66     327.55 -1.6659  0.09679 .
## emp_satisfaction             432.19    1050.52  0.4114  0.68107  
## special_projects_count      2625.25    1790.43  1.4663  0.14363  
## days_late_last30            1663.11    1366.13  1.2174  0.22442  
## absences                     324.78     185.97  1.7464  0.08177 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

for the outliers and influential observations we will use more robust regression models

for the multicollinearity we will use Ridge, Lasso and ElasticNet regression models

Robust regression model from MASS package

Regression model estimation

library(MASS)
lm_ih <- MASS::rlm(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
summary(lm_ih)
## 
## Call: rlm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -44276.7  -7565.1    438.9   6640.3 173976.3 
## 
## Coefficients:
##                            Value      Std. Error t value   
## (Intercept)                82230.1036  9375.5766     8.7707
## married_id                  -900.1458  1419.2794    -0.6342
## marital_status_id           -326.5630   730.7229    -0.4469
## gender_id                    146.2642  1370.5508     0.1067
## emp_status_id                 50.3524   403.7250     0.1247
## dept_id                    -5264.4981  1082.4401    -4.8635
## perf_score_id               2700.8751  1738.8486     1.5533
## from_diversity_job_fair_id   103.1237  2374.1452     0.0434
## position_id                 -298.1443   122.4571    -2.4347
## emp_satisfaction             269.4655   784.8575     0.3433
## special_projects_count      3504.2525   446.0931     7.8554
## days_late_last30             651.2826   777.6269     0.8375
## absences                     135.8248   116.9227     1.1617
## 
## Residual standard error: 10310 on 298 degrees of freedom
coeftest(lm_ih, vcov = vcovHC(lm_ih, type = "HC1"))
## 
## z test of coefficients:
## 
##                             Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)                82230.104  17405.612  4.7243 0.000002309 ***
## married_id                  -900.146   1477.479 -0.6092     0.54236    
## marital_status_id           -326.563    647.397 -0.5044     0.61396    
## gender_id                    146.264   1418.480  0.1031     0.91787    
## emp_status_id                 50.352    438.760  0.1148     0.90863    
## dept_id                    -5264.498   2562.226 -2.0547     0.03991 *  
## perf_score_id               2700.875   1769.581  1.5263     0.12694    
## from_diversity_job_fair_id   103.124   3283.623  0.0314     0.97495    
## position_id                 -298.144    184.308 -1.6176     0.10574    
## emp_satisfaction             269.466    744.174  0.3621     0.71728    
## special_projects_count      3504.252    880.485  3.9799 0.000068941 ***
## days_late_last30             651.283    672.159  0.9689     0.33257    
## absences                     135.825    115.144  1.1796     0.23816    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The importance of the variables

The estimated values of the regression coefficients

Crt.termestimatestd.errorstatisticDecision
1(Intercept)8.22e+049.38e+038.77s
2married_id-900       1.42e+03-0.634ns
3marital_status_id-327       731       -0.447ns
4gender_id146       1.37e+030.107ns
5emp_status_id50.4     404       0.125ns
6dept_id-5.26e+031.08e+03-4.86s
7perf_score_id2.7e+03 1.74e+031.55ns
8from_diversity_job_fair_id103       2.37e+030.0434ns
9position_id-298       122       -2.43s
10emp_satisfaction269       785       0.343ns
11special_projects_count3.5e+03 446       7.86s
12days_late_last30651       778       0.838ns
13absences136       117       1.16ns

The validation metrics of the model

Crt.varvalues
1sigma1.03e+04
2converged1
3logLik-3.54e+03
4AIC7.1e+03
5BIC7.15e+03
6deviance1.36e+11
7nobs311

VIF values obtained for the variables include in the model

TermVIFVIF_CI_lowVIF_CI_highSE_factorToleranceTolerance_CI_lowTolerance_CI_high
married_id1.081.021.381.040.9290.7270.985
marital_status_id1.061.011.461.030.9470.6860.993
gender_id1.031   2.381.010.9720.42 0.999
emp_status_id1.171.071.391.080.8570.7210.933
dept_id2.412.062.881.550.4150.3470.485
perf_score_id2.321.982.761.520.4320.3620.504
from_diversity_job_fair_id1.061.011.421.030.9410.7050.991
position_id1.291.161.521.140.7750.6590.859
emp_satisfaction1.131.051.361.060.8830.7340.954
special_projects_count2.442.082.921.560.41 0.3430.48 
days_late_last302.251.932.681.5 0.4440.3730.518
absences1.041   1.671.020.9610.5980.998

Robust regression model from robustbase package

For modeling salary as the dependent variable all columns in dataset except status are included as predictors.

The robust regression technique (MM) is used for modeling, reducing the influence of outliers on coefficients and combines high breakdown point with high efficiency.

Regression model estimation

library(robustbase)
lm_ih <- lmrob(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
library(robustbase)
summary(lm_ih)
## 
## Call:
## lmrob(formula = salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
##  \--> method = "MM"
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -61592.3  -6537.9    479.6   6375.1 218886.4 
## 
## Coefficients:
##                            Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                 1993.00   15712.18   0.127                0.899    
## married_id                    30.57    1219.13   0.025                0.980    
## marital_status_id            -95.04     577.71  -0.165                0.869    
## gender_id                   -116.98    1291.51  -0.091                0.928    
## emp_status_id                418.79     365.10   1.147                0.252    
## dept_id                     9119.66    2311.09   3.946            0.0000992 ***
## perf_score_id               2523.75    1679.81   1.502                0.134    
## from_diversity_job_fair_id   129.86    2974.33   0.044                0.965    
## position_id                  191.53     149.16   1.284                0.200    
## emp_satisfaction            -419.97     685.17  -0.613                0.540    
## special_projects_count      8067.66     819.51   9.845 < 0.0000000000000002 ***
## days_late_last30             265.20     557.27   0.476                0.635    
## absences                     115.12     104.15   1.105                0.270    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 10380 
## Multiple R-squared:  0.5665, Adjusted R-squared:  0.5491 
## Convergence in 31 IRWLS iterations
## 
## Robustness weights: 
##  15 observations c(30,56,77,97,132,133,151,191,228,241,245,256,260,269,309)
##   are outliers with |weight| = 0 ( < 0.00032); 
##  28 weights are ~= 1. The remaining 268 ones are summarized as
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01717 0.87990 0.96250 0.89740 0.98750 0.99900 
## Algorithmic parameters: 
##        tuning.chi                bb        tuning.psi        refine.tol 
##  1.54764000000000  0.50000000000000  4.68506100000000  0.00000010000000 
##           rel.tol         scale.tol         solve.tol          zero.tol 
##  0.00000010000000  0.00000000010000  0.00000010000000  0.00000000010000 
##       eps.outlier             eps.x warn.limit.reject warn.limit.meanrw 
##  0.00032154340836  0.00000000005457  0.50000000000000  0.50000000000000 
##      nResample         max.it       best.r.s       k.fast.s          k.max 
##            500             50              2              1            200 
##    maxit.scale      trace.lev            mts     compute.rd fast.s.large.n 
##            200              0           1000              0           2000 
##                   psi           subsampling                   cov 
##            "bisquare"         "nonsingular"         ".vcov.avar1" 
## compute.outlier.stats 
##                  "SM" 
## seed : int(0)

It can be observe that dept_id and special_projects_count are the strongest predictors of salary.

Most of the demographic variables (married, gender, etc.) are not significant.

The variable special projects (special_projects_count) have the largest positive effect on salary (~ 8078.91 increase per project).

The observations with weight = 0 are completely ignored in coefficient estimation, which means that the model will not be distorted by extreme salaries or unusual employees.

The importance of the variables

qr(model.matrix(lm_ih))$rank
## [1] 13
ncol(model.matrix(lm_ih))
## [1] 13
## Model :
## salary ~ married_id + marital_status_id + gender_id + emp_status_id + 
##     dept_id + perf_score_id + from_diversity_job_fair_id + position_id + 
##     emp_satisfaction + special_projects_count + days_late_last30 + 
##     absences

The estimated values of the regression coefficients

Nr.Crt.termestimatestd.errorstatisticp.valuesignif
1(Intercept)1.99e+031.57e+040.127 0.899ns
2married_id30.6     1.22e+030.02510.98ns
3marital_status_id-95       578       -0.165 0.869ns
4gender_id-117       1.29e+03-0.09060.928ns
5emp_status_id419       365       1.15  0.252ns
6dept_id9.12e+032.31e+033.95  9.92e-05s
7perf_score_id2.52e+031.68e+031.5   0.134ns
8from_diversity_job_fair_id130       2.97e+030.04370.965ns
9position_id192       149       1.28  0.2ns
10emp_satisfaction-420       685       -0.613 0.54ns
11special_projects_count8.07e+03820       9.84  5.55e-20s
12days_late_last30265       557       0.476 0.635ns
13absences115       104       1.11  0.27ns

The validation metrics of the model

Nr.Crt.varvalues
1r.squared 0.567
2sigma         1.04e+04
3df.residual298

The fitted, and residula values

In this case the model explains 58% of the variation in salary (adjusted R² is 0.56), which is pretty good.

VIF values obtained for the variables include in the model

TermVIFVIF_CI_lowVIF_CI_highSE_factorToleranceTolerance_CI_lowTolerance_CI_high
married_id1.071.011.391.030.9350.7180.988
marital_status_id1.3 1.171.531.140.7670.6530.852
gender_id1.211.111.431.1 0.8240.6980.904
emp_status_id1.321.191.551.150.7580.6450.843
dept_id5.124.256.232.260.1950.1610.235
perf_score_id3.8 3.184.591.950.2630.2180.314
from_diversity_job_fair_id1.251.131.471.120.8  0.68 0.883
position_id3.332.814.021.830.3  0.2490.356
emp_satisfaction1.481.311.741.220.6770.5760.764
special_projects_count2.342   2.791.530.4270.3580.499
days_late_last303.172.673.811.780.3160.2620.374
absences1.371.231.611.170.7270.62 0.814

Both methods are robust alternatives to OLS, but MM (lmrob) tends to be more robust with high breakdown points, while M (rlm) focuses on efficient estimation for moderately contaminated data.

In both models special_projects_count is consistently highly significant.

dept_id variable shows large effects, but rlm and lmrob can differ in sign/magnitude depending on outlier treatment.

In both cases most demographic variables are not significant.

Both models account for extreme salaries, but differently (lmrob ignores extreme outliers, rlm downweights them).

Ridge and Lasso Regularization

Overall, Lasso and Ridge regression are used when we have multicollinearity problems, to many predictors and we want a stable predictive model.

library(glmnet)
## Loaded glmnet 4.1-8
x <- model.matrix(salary ~ ., dataset)[,-1]
y <- dataset$salary
Ridge Regression (alpha = 0)

Regression model estimation

ridge_model <- cv.glmnet(x, y, alpha = 0)
ridge_model
## 
## Call:  cv.glmnet(x = x, y = y, alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index   Measure        SE Nonzero
## min   9885    78 476230896 143918814      14
## 1se 859774    30 618809836 175693563      14

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)                89160.9564
## married_id                   245.5060
## marital_status_id           -961.9268
## gender_id                    340.6729
## emp_status_id               -496.3839
## dept_id                    -6305.7042
## perf_score_id               3347.7496
## from_diversity_job_fair_id  2518.0345
## termd                       -331.6805
## position_id                 -367.1815
## emp_satisfaction             526.8653
## special_projects_count      2414.1555
## days_late_last30             418.6630
## absences                     250.1753
## status                       244.5240

Nr.Crt.MatricValue
1lambda.min9.89e+03
2lambda.1se8.6e+05
3nobs    311

Reduces de value of the regression coefficients to zero, not eliminate them

Interpretation:

  • special_projects_count (2501.91) → has strongest positive effect on salary, which means that more special projects increases salary.

  • days_late_last30 (571.18), emp_satisfaction (523.16), absences (266.46) → have a moderate positive effect on salary. Very interesting is that satisfaction and days late are positive here (possibly because Ridge shrinks coefficients but keeps them in model).

  • position_id (-397.95) → has a negative impact on salary, meaning higher position ID reduces salary after controlling for others.

  • status (233.94) → has a small positive effect on salary.

Ridge regression shrinks coefficients, reduces variance due to multicollinearity, but does not perform variable selection.

Compare to lmrob: special_projects_count remains the strongest predictor across models.

Evaluating the performance of the Ridge model

## [1] 430105398

The cross-validation curve reflects the bias–variance tradeoff inherent in regularized models. At very small λ values, the model approaches OLS and exhibits signs of mild overfitting. As λ increases, prediction error decreases due to reduced variance. However, excessive regularization leads to underfitting, as indicated by the increasing error for large λ values. The optimal λ balances bias and variance, ensuring better generalization performance.

Lasso Regression (alpha = 1)

Regression model estimation

## 
## Call:  cv.glmnet(x = x, y = y, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index   Measure        SE Nonzero
## min    783    31 468106080 145969732       8
## 1se  11633     2 613398133 164322653       1

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)                99689.4344
## married_id                     .     
## marital_status_id           -485.7549
## gender_id                      .     
## emp_status_id               -145.4528
## dept_id                    -7593.2077
## perf_score_id               2738.4442
## from_diversity_job_fair_id   309.5148
## termd                          .     
## position_id                 -437.4469
## emp_satisfaction               .     
## special_projects_count      2710.0045
## days_late_last30               .     
## absences                     220.9998
## status                         .

Interpretation:

  • special_projects_count (2740.63) → is the strongest positive predictor, having a the strongest effect on salary. Each additional special project increases salary by ~2,740 units.

  • dept_id (-7249.22) → has a large effect depending on department coding: some departments pay much less/more than the baseline/mean.

  • perf_score_id (2371.70) → the performance scores positively affect salary.

  • absences (179.11) → has a minor positive effect on salary (interesting, could reflect correlation with other variables).

  • position_id (-395.62) → has a small negative effect on salary.

  • marital_status_id (-235.71) → has a small negative effect on salary. It appears that marital status slightly reduces salary.

MetricValue
lambda.min783       
lambda.1se1.16e+04
nobs311       

Can eliminate variables by forcing the value of regression coefficient to be zero

Performance of the Lasso model

## [1] 426931353

Lasso shrinks unimportant variables to zero, performing a variable selection while controlling multicollinearity and overfitting. So the final model focuses only on variables that actually explain variation in salary. In this case the important predictors are: dept_id, perf_score_id, special_projects_count, absences, marital_status_id, position_id, while married_id, gender_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, status are dropped predictors.

Comparing the performance of the Ridge and Lasso models

Nr.Crt.MSEModellambda.minlambda.1senobs
14.3e+08Ridge9.89e+038.6e+05311
24.27e+08Lasso783       1.16e+04311

Robust regression models address violations of the classical assumptions of regression, such as heteroscedasticity, non-normality of errors, multicollinearity, and the presence of influential observations. Ridge regression stabilizes the coefficients in the presence of multicollinearity, while Lasso regression performs variable selection and simultaneously controls model complexity.

The Lasso model performs variable selection by shrinking some coefficients exactly to zero. At λ = lambda.min, the model retains eight predictors, achieving the lowest cross-validated error. However, λ = lambda.1se produces a more parsimonious model with only two predictors and comparable predictive performance, suggesting improved generalization and stability.

Elastic Net regression (alpha = 0.5)

Elastic Net regression was implemented to simultaneously address multicollinearity and perform variable selection. By combining L1 and L2 regularization penalties, the method provides more stable coefficient estimates compared to Lasso, particularly in the presence of correlated predictors.

Regression model estimation

## 
## Call:  cv.glmnet(x = x, y = y, alpha = 0.5) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index   Measure        SE Nonzero
## min   1720    30 465955650 107134394       8
## 1se  17600     5 563913327 125347337       2

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                     s1
## (Intercept)                98022.18688
## married_id                     .      
## marital_status_id           -401.36371
## gender_id                      .      
## emp_status_id               -120.56475
## dept_id                    -7240.73932
## perf_score_id               2572.98929
## from_diversity_job_fair_id    17.97873
## termd                          .      
## position_id                 -404.19215
## emp_satisfaction               .      
## special_projects_count      2706.17379
## days_late_last30               .      
## absences                     203.25672
## status                         .
## [1] 428627377

Elastic Net combines L1 and L2 regularization, allowing for both shrinkage and grouped variable selection. It is particularly suitable in the presence of multicollinearity, as it stabilizes coefficient estimates while retaining correlated predictors.

Searching for the optimal hyperparameters

##  [1] 467815942 461436413 463404375 468896799 468776405 465053606 465292529
##  [8] 476683081 484718055 469561763 479238723
## [1] 0.1
## [1] 5926.088
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                     s0
## (Intercept)                94567.12167
## married_id                     .      
## marital_status_id           -594.24170
## gender_id                      .      
## emp_status_id               -299.64596
## dept_id                    -6586.05190
## perf_score_id               2626.97295
## from_diversity_job_fair_id   972.35006
## termd                          .      
## position_id                 -362.64193
## emp_satisfaction              14.27022
## special_projects_count      2545.52998
## days_late_last30               .      
## absences                     213.02951
## status                         .
## [1] 430778136

The optimal Elastic Net model was selected by performing cross-validation over a grid of alpha values (0 to 1). The combination of alpha and lambda that minimized the cross-validated mean squared error was chosen, and the final model was re-estimated using these optimal hyperparameters.

Conclusions

The analysis began with the estimation of an OLS model to explain salary variation based on employees’ demographic and organizational characteristics. Diagnostic tests revealed several issues, including the presence of influential observations, potential heteroscedasticity, deviations from residual normality, and indications of multicollinearity among predictors.

To improve the robustness of the estimates, alternative methods were applied. Robust regression techniques (rlm, lmrob) reduced the influence of extreme observations by down-weighting large residuals, resulting in more stable coefficient estimates in the presence of outliers. However, these methods do not address multicollinearity.

Subsequently, regularized models (Ridge, Lasso, and Elastic Net) were implemented. Cross-validation results showed that introducing a penalty term improves predictive performance compared to OLS.

Ridge regression stabilized coefficient estimates in the presence of correlated predictors. Lasso performed automatic variable selection, leading to a more parsimonious model without substantial loss of predictive accuracy. Elastic Net provided an effective compromise between stability and sparsity, making it particularly suitable in the presence of multicollinearity.

The cross-validation curves clearly illustrated the bias–variance tradeoff, and the optimal selection of λ allowed for the construction of a balanced model with strong predictive performance and controlled complexity.

Bibliography

  1. Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
  2. Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
  3. Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
  4. Peterson BG, Carl P (2020). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version 2.0.4, https://CRAN.R-project.org/package=PerformanceAnalytics.
  5. Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. R package version 1.4.0, https://CRAN.R-project.org/package=scales.
  6. Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
  7. Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.
  8. Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). Available from https://github.com/taiyun/corrplot
  9. Rubba C (2023). htmltab: Assemble Data Frames from HTML Tables. R package version 0.8.2.9000, https://github.com/htmltab/htmltab.
  10. Brandon M. Greenwell and Bradley C. Boehmke (2020). Variable Importance Plots—An Introduction to the vip Package. The R Journal, 12(1), 343–366. URL https://doi.org/10.32614/RJ-2020-013.
  11. Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01
  12. Robinson D, Hayes A, Couch S (2023). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.5, https://CRAN.R-project.org/package=broom. 13.H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
  13. Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.
  14. Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.
  15. Cui B (2020). DataExplorer: Automate Data Exploration and Treatment. R package version 0.8.2, https://CRAN.R-project.org/package=DataExplorer.
  16. Rushworth A (2022). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.12, https://CRAN.R-project.org/package=inspectdf.
  17. Grosjean P, Ibanez F (2018). pastecs: Package for Analysis of Space-Time Ecological Series. R package version 1.3.21, https://CRAN.R-project.org/package=pastecs.
  18. https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
  19. William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.3.9, https://CRAN.R-project.org/package=psych.
  20. B. Venables, Modern Applied Statistics With S, 2002, Edition: 4thPublisher: Springer-Verlag, DOI: 10.1007/b97626.
  21. Friedman J, Tibshirani R, Hastie T (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1-22. doi:10.18637/jss.v033.i01 https://doi.org/10.18637/jss.v033.i01.
  22. Martin Maechler, Peter Rousseeuw, Christophe Croux, Valentin Todorov, Andreas Ruckstuhl, Matias Salibian-Barrera, Tobias Verbeke, Manuel Koller, c(“Eduardo”, “L. T.”) Conceicao and Maria Anna di Palma (2023). robustbase: Basic Robust Statistics R package version 0.99-1. URL http://CRAN.R-project.org/package=robustbase