Scoring model

Analyzed data set: HRDataset_v14 available here

Loading the necesary packages…

Project Objectives

  1. Exploratory data analysis - EDA was done in another post and available here
  2. Identification and detailed analysis of the studied phenomenon through the development of scoring models
  3. Interpretation of results and presentation of conclusions

Dataset description

In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.

Within this dataset, we decided to explore and attempt to answer several research questions, such as:

  1. Are there areas within the company where salary distribution is not equitable?

  2. Is an individual’s salary influenced by any specific factors present in the dataset?

  3. Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?

loading the dataset

Dataset description

The structure of the dataset

glimpse(HRDataset_v14)
## Rows: 311
## Columns: 36
## $ Employee_Name              <chr> "Adinolfi, Wilson  K", "Ait Sidi, Karthikey…
## $ EmpID                      <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID                  <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID            <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID                   <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID                <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID                     <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID                <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary                     <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID                 <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position                   <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State                      <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip                        <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB                        <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex                        <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc                <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc                <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc                   <chr> "White", "White", "White", "White", "White"…
## $ DateofHire                 <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination          <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason                 <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus           <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department                 <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName                <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID                  <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource          <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore           <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey           <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction            <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount       <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…

The key findings from the EDA and their implications for further modeling

Based on EDA analysis we so that:

  • Salary exhibits non-normal behavior and strong right skewness.

  • Extreme values (executive-level salaries) may influence regression results.

  • Transformation techniques (e.g., log transformation of salary) may improve model performance.

  • Categorical variables such as department, position, and employment status are likely strong predictors of salary.

  • Gender alone may not fully explain salary variation without controlling for position and department.

## Rows: 311
## Columns: 15
## $ married_id                 <fct> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ status                     <chr> "no_married", "married", "married", "marrie…
## $ log10_salary               <dbl> 4.795922, 5.018854, 4.812613, 4.812853, 4.7…
## Rows: 311
## Columns: 15
## $ married_id                 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ salary                     <int> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ status                     <chr> "no_married", "married", "married", "marrie…

Correlation matrix

MRLM - Linear regression model

Generating the regression model

Just a regression

lm_ih <- lm(salary ~ ., data = dataset)
lm_ih
## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    95093.0                       443.8  
##          marital_status_id                   gender_id  
##                    -1133.9                      -351.1  
##              emp_status_id                     dept_id  
##                    -1607.4                     -8722.4  
##              perf_score_id  from_diversity_job_fair_id  
##                     6392.3                      3462.4  
##                      termd                 position_id  
##                     3924.0                      -545.7  
##           emp_satisfaction      special_projects_count  
##                      432.2                      2625.3  
##           days_late_last30                    absences  
##                     1663.1                       324.8  
##                     status  
##                         NA
# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-salary))
# lm_ih
summary(lm_ih)
## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56336  -9497  -2108   5279 159658 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 95093.0    17138.2   5.549 0.0000000638 ***
## married_id                    443.8     2520.6   0.176       0.8604    
## marital_status_id           -1133.9     1297.3  -0.874       0.3828    
## gender_id                    -351.1     2433.9  -0.144       0.8854    
## emp_status_id               -1607.4     2274.8  -0.707       0.4804    
## dept_id                     -8722.4     1932.4  -4.514 0.0000091892 ***
## perf_score_id                6392.3     3117.4   2.051       0.0412 *  
## from_diversity_job_fair_id   3462.4     4256.8   0.813       0.4167    
## termd                        3924.0     8436.1   0.465       0.6422    
## position_id                  -545.7      219.7  -2.483       0.0136 *  
## emp_satisfaction              432.2     1394.5   0.310       0.7568    
## special_projects_count       2625.3      796.4   3.296       0.0011 ** 
## days_late_last30             1663.1     1418.7   1.172       0.2420    
## absences                      324.8      207.6   1.564       0.1188    
## status                           NA         NA      NA           NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20970 on 297 degrees of freedom
## Multiple R-squared:  0.3345, Adjusted R-squared:  0.3053 
## F-statistic: 11.48 on 13 and 297 DF,  p-value: < 0.00000000000000022

Identifying the variable that is strongly correlated with status

Nr.Crt.VariableValue
1(Intercept)0
2married_id1
3marital_status_id0
4gender_id0
5emp_status_id0
6dept_id0
7perf_score_id0
8from_diversity_job_fair_id0
9termd0
10position_id0
11emp_satisfaction0
12special_projects_count0
13days_late_last300
14absences0

The importance of the variables

Check assumptions of linear regression

library(performance)
par(mfrow = c(2, 2))
plot(lm_ih, pch = 16, col = '#006EA1')

par(mfrow = c(1, 1))

The estimated values of the regression coefficients

Nr.Crt.termestimatestd.errorstatisticp.valuesignif
1(Intercept)9.51e+041.71e+045.55 6.38e-08s
2married_id444       2.52e+030.1760.86ns
3marital_status_id-1.13e+031.3e+03 -0.8740.383ns
4gender_id-351       2.43e+03-0.1440.885ns
5emp_status_id-1.61e+032.27e+03-0.7070.48ns
6dept_id-8.72e+031.93e+03-4.51 9.19e-06s
7perf_score_id6.39e+033.12e+032.05 0.0412s
8from_diversity_job_fair_id3.46e+034.26e+030.8130.417ns
9termd3.92e+038.44e+030.4650.642ns
10position_id-546       220       -2.48 0.0136s
11emp_satisfaction432       1.39e+030.31 0.757ns
12special_projects_count2.63e+03796       3.3  0.0011s
13days_late_last301.66e+031.42e+031.17 0.242ns
14absences325       208       1.56 0.119ns
15status                  

The validation metrics of the model

Nr.Crt.varvalues
1r.squared 0.334
2adj.r.squared 0.305
3sigma         2.1e+04
4statistic         11.5
5p.value   4.94e-20
6df         13
7logLik         -3.53e+03
8AIC         7.09e+03
9BIC         7.14e+03
10deviance         1.31e+11
11df.residual297
12nobs         311

Cooks distance, predicted values, residuals and influntial points

## OK: No outliers detected.
## - Based on the following method and threshold: cook (1).
## - For variable: (Whole model)

VIF values obtained for the variables include in the model

TermVIFVIF_CI_lowVIF_CI_highSE_factorToleranceTolerance_CI_lowTolerance_CI_high
married_id1.081.021.371.040.928 0.73  0.984
marital_status_id1.061.011.451.030.947 0.689 0.993
gender_id1.031   2.281.010.971 0.439 0.999
emp_status_id11.7 9.6314.4 3.430.08510.06950.104
dept_id2.442.082.911.560.41  0.344 0.48 
perf_score_id2.362.022.811.540.423 0.355 0.495
from_diversity_job_fair_id1.081.021.361.040.923 0.735 0.981
termd11.2 9.1913.7 3.350.08920.07290.109
position_id1.321.191.551.150.758 0.647 0.843
emp_satisfaction1.131.051.361.060.882 0.734 0.953
special_projects_count2.472.112.951.570.405 0.339 0.474
days_late_last302.382.042.831.540.42  0.353 0.491
absences1.041   1.651.020.96  0.608 0.997

We observe that status variable is NA meaning that between this variable and another variable there is prefect collinearity, suggesting that one or more variables are exact linear combinations of others. As a result, the design matrix becomes singular and VIF cannot be computed.

The variable that is perfect correlated with status is: married_id

# vif(lm_ih)

So we are going to eliminate the status variable and termd variable (highly correlated with emp_status_id; the value of the correlation coefficient is 0.91) from the dataset and rebuild the model.

Feature engineering

Regression model with dept_id transformed
dataset2 <- dataset1 %>% 
  mutate(dept_id = as.factor(dept_id))

dataset2$gender_id <- as.numeric(dataset2$gender_id)
dataset2$married_id <- as.numeric(dataset2$married_id)
dataset2$marital_status_id <- as.numeric(dataset2$marital_status_id)
dataset2$emp_status_id <- as.numeric(dataset2$emp_status_id)
dataset2$emp_satisfaction <- as.numeric(dataset2$emp_satisfaction)
dataset2$perf_score_id <- as.numeric(dataset2$perf_score_id)

str(dataset2)
## tibble [311 × 16] (S3: tbl_df/tbl/data.frame)
##  $ married_id                : num [1:311] 0 1 1 1 0 0 0 0 0 0 ...
##  $ marital_status_id         : num [1:311] 0 1 1 1 2 0 0 4 0 2 ...
##  $ gender_id                 : num [1:311] 1 1 0 0 0 0 0 1 0 1 ...
##  $ emp_status_id             : num [1:311] 1 5 5 1 5 1 1 1 3 1 ...
##  $ dept_id                   : Factor w/ 7 levels "1","2","3","4",..: 5 3 5 5 5 5 4 5 5 3 ...
##  $ perf_score_id             : num [1:311] 4 3 3 3 3 4 3 3 3 3 ...
##  $ from_diversity_job_fair_id: num [1:311] 0 0 0 0 0 0 0 0 1 0 ...
##  $ termd                     : num [1:311] 0 1 1 0 1 0 0 0 0 0 ...
##  $ position_id               : num [1:311] 19 27 20 19 19 19 24 19 19 14 ...
##  $ emp_satisfaction          : num [1:311] 5 3 3 5 4 5 3 4 3 5 ...
##  $ special_projects_count    : num [1:311] 0 6 0 0 0 0 4 0 0 6 ...
##  $ days_late_last30          : num [1:311] 0 0 0 0 0 0 0 0 0 0 ...
##  $ absences                  : num [1:311] 1 17 3 15 2 15 19 19 4 16 ...
##  $ salary                    : num [1:311] 62506 104437 64955 64991 50825 ...
##  $ status                    : num [1:311] 0 1 1 1 0 0 0 0 0 0 ...
##  $ log10_salary              : num [1:311] 4.8 5.02 4.81 4.81 4.71 ...
library(robustbase)
lm_ih <- lmrob(salary ~ ., data = dataset2 %>% dplyr::select(-status, -termd, -log10_salary))
## Loaded glmnet 4.1-8
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

LASSO regression model

Unlike classical linear regression, LASSO regression (Least Absolute Shrinkage and Selection Operator) introduces a regularization term based on the absolute values of the model parameters.

The regularization term is the product of two components:

  • the L1 norm, which represents the sum of the absolute values of the model coefficients;
  • the hyperparameter \(\lambda\), which controls the strength of the penalty applied to the coefficients.

The parameter \(\lambda\) introduces bias in the estimation process but helps reduce model variance.

The L1 regularization term allows insignificant coefficients to be shrunk toward zero or forced exactly to zero, which enables automatic variable selection, retaining only the predictors that have a significant influence on the response variable.

The LASSO optimization problem can be written as:

\[ RSS_{LASSO} = \|y - X\beta\|_2^2 + \lambda \sum_{j=1}^{p} |\beta_j| \]

where:

A larger value of \(\lambda\) increases the shrinkage effect on the coefficients, potentially setting some coefficients exactly to zero and thus performing feature selection.

LASSO regression optimization and objective function

The solution of the LASSO regression optimization problem is typically obtained using numerical algorithms, such as:

  • Coordinate descent, which iteratively updates the regression coefficients one at a time while keeping the others fixed.
  • Least angle regression (LARS), which adjusts the regression coefficients gradually as the value of the regularization parameter \(\lambda\) changes.

For a single predictor, the explicit solution can be expressed using the soft-thresholding operator:

\[ \widehat{\beta_j} = \text{sign}\left(\widehat{\beta_j^{OLS}}\right) \cdot \max\left(0, \left|\widehat{\beta_j^{OLS}}\right| - \frac{\lambda}{2} \right) \]

where:

  • \(\widehat{\beta_j^{OLS}}\) is the coefficient obtained from the ordinary least squares (OLS) regression, where no regularization term is included.
  • \(\text{sign}(\cdot)\) is the sign function, returning 1 for positive values and −1 for negative values.
  • \(\lambda/2\) is the shrinkage threshold determined by the regularization hyperparameter.

The hyperparameter \(\lambda\) reduces the magnitude of the coefficients and forces them to become exactly zero when their absolute value is smaller than the threshold \(\lambda/2\).


Explicit form of the LASSO objective function

The objective function of the LASSO regression model can be written as:

\[ J_{LASSO}(\beta) = \sum_{i=1}^{n} \left( y_i - \left( \sum_{j=1}^{p}\beta_j x_{i,j} + \beta_0 \right) \right)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \]

where:

  • \(\lambda\) is the hyperparameter controlling the strength of the penalty.
  • \(|\beta_j|\) represents the absolute value of the regression coefficients.
  • \(x_i = (x_{i1},\ldots,x_{ip})\) are the predictors for observation \(i\).
  • \(y_i\) represents the observed values of the response variable.
  • \(\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\) represents the predicted response value.

LASSO cost function

The cost function minimized in LASSO regression is:

\[ J_{LASSO}(\beta) = RSS_{OLS} + \lambda \sum_{j=1}^{p} |\beta_j| \]

or equivalently:

\[ J_{LASSO}(\beta) = \min_{\beta_0,\beta} \left[ \frac{1}{2n} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right] \]

The LASSO regression model aims to minimize this objective function.


Role of the regularization term

The regularization term introduced in the LASSO regression model controls the bias–variance trade-off as follows:

  • If \(\lambda = 0\): the LASSO regression model becomes equivalent to the classical multiple linear regression (OLS) model, where the coefficients are estimated solely based on the training data.
    In this case, the model may lead to overfitting, especially when the matrix \(X'X\) is poorly conditioned or when predictors are highly correlated.

  • If \(\lambda > 0\): the penalty term shrinks the regression coefficients toward zero.
    As a result, the model becomes more stable and less sensitive to small eigenvalues of the matrix \(X'X\).
    Some regression coefficients may be reduced to exactly zero, effectively performing automatic feature selection.

  • If \(\lambda \rightarrow \infty\): the penalty term becomes dominant relative to the squared error term.
    Consequently, the optimal solution forces the coefficients toward:

\[ \beta_j = 0, \quad \forall j \]

In this situation, the conditional mean reduces to:

\[ \hat{y} = \beta_0 \]

meaning that the predictions no longer depend on the predictor variables.

Therefore, the LASSO model collapses into a constant model, which leads to underfitting, since the model is no longer able to learn meaningful patterns from the data.

Properties of the LASSO regression model

The LASSO regression model is relatively flexible with respect to the absence of strict multicollinearity assumptions among predictor variables.
Unlike classical linear regression, it does not explicitly require the error terms to follow a normal distribution in order to estimate the regression coefficients.

Thus, LASSO can estimate model parameters even when the predictors are highly correlated. However, it is important to note that the other assumptions of the classical multiple linear regression model remain generally applicable.

Geometric interpretation of the LASSO optimization

Figure 4 illustrates the graphical representation of how the optimal solution is obtained by minimizing the objective function in both classical regression (OLS) and LASSO regression.

The x and y axes represent the values of two regression parameters of the model.

The solid concentric circles represent the residual sum of squares (RSS) for different combinations of the regression parameters. Each circle corresponds to a constant level of the objective function.

The dotted diamonds represent the additional constraint introduced by the L1 regularization term and the hyperparameter \(\lambda\) used in the LASSO model.

The optimal LASSO solution is obtained at the point where the smallest RSS contour first intersects the L1 constraint region. Due to the diamond-shaped constraint, the intersection often occurs on one of the axes, which forces some regression coefficients to become exactly zero. This property explains why LASSO performs automatic variable selection.

Advantages and disadvantages of the LASSO regression model

In the following section, we briefly present the main advantages and disadvantages of the LASSO regression model.

Advantages of LASSO regression

  1. Reduces overfitting
    One of the major benefits of LASSO regression is its ability to control overfitting, which improves the model’s performance when applied to new, unseen data. This is particularly useful when the dataset contains a large number of predictors.

  2. Automatic variable selection
    The regularization term forces some regression coefficients to become exactly zero. As a result, insignificant predictors are automatically removed from the model, which reduces model complexity and improves interpretability.

  3. Efficient for high-dimensional data
    LASSO regression can be applied to datasets where the number of predictors is larger than the number of observations. This makes it particularly useful in high-dimensional settings such as machine learning or genomic data analysis.

Disadvantages of LASSO regression

  1. Sensitivity to multicollinearity
    When predictors are strongly correlated (multicollinearity), classical linear regression becomes unstable because the coefficients may be poorly estimated. Although LASSO introduces a regularization term that stabilizes coefficient estimation, it is not always suitable when there are many highly correlated predictors.
    In such cases, LASSO tends to keep only one predictor from a group of correlated variables while eliminating the others, which may lead to loss of useful information.

  2. Lower efficiency when many predictors are relevant
    In datasets where many predictors have a significant influence on the response variable but are also highly correlated, LASSO may eliminate important independent variables.

  3. Introduces bias for large coefficients
    Because the regularization term constrains the magnitude of the regression coefficients, predictors with large coefficients may be excessively penalized. This can introduce bias in coefficient estimates, potentially reducing predictive accuracy.

  4. Assumes linear relationships
    LASSO regression is a regularized version of the classical linear regression model, therefore it assumes a linear relationship between predictors and the response variable. If the data exhibit more complex nonlinear relationships, alternative modeling approaches may be required.

Lasso Regression (alpha = 1)

Regression model estimation

## 
## Call:  cv.glmnet(x = x, y = y, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index   Measure        SE Nonzero
## min   1053    28 438113709 134852220       6
## 1se   9818     4 570601502 162261764       3

## 20 x 1 sparse Matrix of class "dgCMatrix"
##                                     s1
## (Intercept)                 62412.0042
## married_id                      .     
## marital_status_id               .     
## gender_id                       .     
## emp_status_id                   .     
## dept_id2                   160198.4396
## dept_id3                    23184.3082
## dept_id4                    18979.5455
## dept_id5                   -10519.3913
## dept_id6                        .     
## dept_id7                        .     
## perf_score_id                2472.5875
## from_diversity_job_fair_id      .     
## termd                           .     
## position_id                     .     
## emp_satisfaction                .     
## special_projects_count          .     
## days_late_last30                .     
## absences                      142.9105
## status                          .

Interpretation:

  • special_projects_count (-2265.49) → is an positive predictor, not having the strongest effect on salary. Each additional special project decreases salary by ~2265.49 units.

  • dept_id (169629) → has a large effect depending on department coding: some departments pay much less/more than the baseline/mean.

  • perf_score_id (6789.2) → the performance scores positively affect salary.

  • absences (304.25) → has a minor positive effect on salary (interesting, could reflect correlation with other variables).

  • position_id (30.526) → has a small positive effect on salary.

  • marital_status_id (-705.29) → has a negative effect on salary. It appears that marital status slightly reduces salary.

MetricValue
lambda.min1.05e+03
lambda.1se9.82e+03
nobs311       

Can eliminate variables by forcing the value of regression coefficient to be zero

Performance of the Lasso model

## [1] 316251360

Lasso shrinks unimportant variables to zero, performing a variable selection while controlling multicollinearity and overfitting. So the final model focuses only on variables that actually explain variation in salary. In this case the important predictors are: dept_id, perf_score_id, special_projects_count, absences, marital_status_id, position_id, while married_id, gender_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, status are dropped predictors.

Bibliography

  1. Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

  2. Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org

  3. Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.

  4. Peterson BG, Carl P (2020). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version 2.0.4, https://CRAN.R-project.org/package=PerformanceAnalytics.

  5. Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. R package version 1.4.0, https://CRAN.R-project.org/package=scales.

  6. Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.

  7. Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.

  8. Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). Available from https://github.com/taiyun/corrplot

  9. Rubba C (2023). htmltab: Assemble Data Frames from HTML Tables. R package version 0.8.2.9000, https://github.com/htmltab/htmltab.

  10. Brandon M. Greenwell and Bradley C. Boehmke (2020). Variable Importance Plots—An Introduction to the vip Package. The R Journal, 12(1), 343–366. URL https://doi.org/10.32614/RJ-2020-013.

  11. Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01

  12. Robinson D, Hayes A, Couch S (2023). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.5, https://CRAN.R-project.org/package=broom. 13.H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

  13. Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.

  14. Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.

  15. Cui B (2020). DataExplorer: Automate Data Exploration and Treatment. R package version 0.8.2, https://CRAN.R-project.org/package=DataExplorer.

  16. Rushworth A (2022). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.12, https://CRAN.R-project.org/package=inspectdf.

  17. Grosjean P, Ibanez F (2018). pastecs: Package for Analysis of Space-Time Ecological Series. R package version 1.3.21, https://CRAN.R-project.org/package=pastecs.

  18. https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

  19. William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.3.9, https://CRAN.R-project.org/package=psych.

  20. B. Venables, Modern Applied Statistics With S, 2002, Edition: 4thPublisher: Springer-Verlag, DOI: 10.1007/b97626.

  21. Friedman J, Tibshirani R, Hastie T (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1-22. doi:10.18637/jss.v033.i01 https://doi.org/10.18637/jss.v033.i01.

  22. Martin Maechler, Peter Rousseeuw, Christophe Croux, Valentin Todorov, Andreas Ruckstuhl, Matias Salibian-Barrera, Tobias Verbeke, Manuel Koller, c(“Eduardo”, “L. T.”) Conceicao and Maria Anna di Palma (2023). robustbase: Basic Robust Statistics R package version 0.99-1. URL http://CRAN.R-project.org/package=robustbase

  23. https://www.linkedin.com/pulse/lasso-regression-clearly-explained-bhabani-shankear-basak/ accesat la data 17.12.2024

  24. U. Riswanto, Ridge regression is a fantastic choice when you need a balance between flexibility and simplicity, especially in cases where you have lots of features or multicollinearity issues, https://ujangriswanto08.medium.com/a-beginners-guide-to-ridge-regression-and-regularization-in-machine-learning-4aeae6ec7680 accesat la data 25.12.2024.

  25. https://www.quora.com/What-are-the-pros-and-cons-of-lasso-regression accesat la data de 19.12.2024.

  26. https://medium.com/@shruti.dhumne/what-is-lasso-regression-bd44addc448c accesat la data de 19.12.2024.

  27. J. Gallier, J. Quaintance, Solving the Elastic Net and Lasso Regression Problems, 2024, https://www.cis.upenn.edu/~cis5150/elastic-net.pdf accesat la data de 19.12.2024.

  28. H. Zou, T. Hastie, Regularization and Variable Selection via the Elastic Net, 2004, https://statanaly.com/wp-content/uploads/2023/04/elasticnet.pdf accesat la data 19.12.2024.

  29. E. Rodola, A. Torsello, T. Harada, Y. Kuniyoshi, D. Cremers, Elastic Net Constraints for Shape Matching, https://cvg.cit.tum.de/_media/spezial/bib/rodola-iccv13.pdf accesat la data 19.12.2024.

  30. P. Mohan, Ridge, Lasso & Elastic Net Regression, 2021, https://blog.devgenius.io/ridge-lasso-elastic-net-regression-2ea752186e51 accesat la data 17.12.2024.

  31. https://dev.to/harsimranjit_singh_0133dc/elastic-net-regularization-balancing-between-l1-and-l2-penalties-3ib7 accesat la data 17.12.2024. 33.Arthur E. Hoerl, Robert W. Kennard, TECHNOMETRICS, 1970, 12, 1, Ridge Regression: Biased Estimation for Nonorthogonal Problems.

  32. Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning Data Mining, Inference, and Prediction, 2017, Springer Science+Business Media, ISBN: 978-0-387-84857-0.

  33. Xavier Bourret Sicotte, Ridge and Lasso: visualizing the optimal solutions, 2018, https://xavierbourretsicotte.github.io/ridge_lasso_visual.html.

  34. Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning Data Mining, Inference, and Prediction, 2017, Springer Science+Business Media, ISBN: 978-0-387-84857-0.

  35. Arthur E. Hoerl, Robert W. Kennard, TECHNOMETRICS, 1970, 12, 1, Ridge Regression: Biased Estimation for Nonorthogonal Problems.

  36. Hui Zou and Trevor Hastie, Regularization and variable selection via the elastic net, J. R. Statist. Soc. B, 2005, 67, Part 2, pp. 301–320.