Loading the necesary packages…
In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.
Within this dataset, we decided to explore and attempt to answer several research questions, such as:
Are there areas within the company where salary distribution is not equitable?
Is an individual’s salary influenced by any specific factors present in the dataset?
Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?
loading the dataset
Dataset description
The structure of the dataset
glimpse(HRDataset_v14)
## Rows: 311
## Columns: 36
## $ Employee_Name <chr> "Adinolfi, Wilson K", "Ait Sidi, Karthikey…
## $ EmpID <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc <chr> "White", "White", "White", "White", "White"…
## $ DateofHire <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
Based on EDA analysis we so that:
Salary exhibits non-normal behavior and strong right skewness.
Extreme values (executive-level salaries) may influence regression results.
Transformation techniques (e.g., log transformation of salary) may improve model performance.
Categorical variables such as department, position, and employment status are likely strong predictors of salary.
Gender alone may not fully explain salary variation without controlling for position and department.
## Rows: 311
## Columns: 15
## $ married_id <fct> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ status <chr> "no_married", "married", "married", "marrie…
## $ log10_salary <dbl> 4.795922, 5.018854, 4.812613, 4.812853, 4.7…
## Rows: 311
## Columns: 15
## $ married_id <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ salary <int> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ status <chr> "no_married", "married", "married", "marrie…
Just a regression …
lm_ih <- lm(salary ~ ., data = dataset)
lm_ih
##
## Call:
## lm(formula = salary ~ ., data = dataset)
##
## Coefficients:
## (Intercept) married_id
## 95093.0 443.8
## marital_status_id gender_id
## -1133.9 -351.1
## emp_status_id dept_id
## -1607.4 -8722.4
## perf_score_id from_diversity_job_fair_id
## 6392.3 3462.4
## termd position_id
## 3924.0 -545.7
## emp_satisfaction special_projects_count
## 432.2 2625.3
## days_late_last30 absences
## 1663.1 324.8
## status
## NA
# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-salary))
# lm_ih
summary(lm_ih)
##
## Call:
## lm(formula = salary ~ ., data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56336 -9497 -2108 5279 159658
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95093.0 17138.2 5.549 0.0000000638 ***
## married_id 443.8 2520.6 0.176 0.8604
## marital_status_id -1133.9 1297.3 -0.874 0.3828
## gender_id -351.1 2433.9 -0.144 0.8854
## emp_status_id -1607.4 2274.8 -0.707 0.4804
## dept_id -8722.4 1932.4 -4.514 0.0000091892 ***
## perf_score_id 6392.3 3117.4 2.051 0.0412 *
## from_diversity_job_fair_id 3462.4 4256.8 0.813 0.4167
## termd 3924.0 8436.1 0.465 0.6422
## position_id -545.7 219.7 -2.483 0.0136 *
## emp_satisfaction 432.2 1394.5 0.310 0.7568
## special_projects_count 2625.3 796.4 3.296 0.0011 **
## days_late_last30 1663.1 1418.7 1.172 0.2420
## absences 324.8 207.6 1.564 0.1188
## status NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20970 on 297 degrees of freedom
## Multiple R-squared: 0.3345, Adjusted R-squared: 0.3053
## F-statistic: 11.48 on 13 and 297 DF, p-value: < 0.00000000000000022
Identifying the variable that is strongly correlated with status
| Nr.Crt. | Variable | Value |
|---|---|---|
| 1 | (Intercept) | 0 |
| 2 | married_id | 1 |
| 3 | marital_status_id | 0 |
| 4 | gender_id | 0 |
| 5 | emp_status_id | 0 |
| 6 | dept_id | 0 |
| 7 | perf_score_id | 0 |
| 8 | from_diversity_job_fair_id | 0 |
| 9 | termd | 0 |
| 10 | position_id | 0 |
| 11 | emp_satisfaction | 0 |
| 12 | special_projects_count | 0 |
| 13 | days_late_last30 | 0 |
| 14 | absences | 0 |
The importance of the variables
library(performance)
par(mfrow = c(2, 2))
plot(lm_ih, pch = 16, col = '#006EA1')
par(mfrow = c(1, 1))
The estimated values of the regression coefficients
| Nr.Crt. | term | estimate | std.error | statistic | p.value | signif |
|---|---|---|---|---|---|---|
| 1 | (Intercept) | 9.51e+04 | 1.71e+04 | 5.55 | 6.38e-08 | s |
| 2 | married_id | 444 | 2.52e+03 | 0.176 | 0.86 | ns |
| 3 | marital_status_id | -1.13e+03 | 1.3e+03 | -0.874 | 0.383 | ns |
| 4 | gender_id | -351 | 2.43e+03 | -0.144 | 0.885 | ns |
| 5 | emp_status_id | -1.61e+03 | 2.27e+03 | -0.707 | 0.48 | ns |
| 6 | dept_id | -8.72e+03 | 1.93e+03 | -4.51 | 9.19e-06 | s |
| 7 | perf_score_id | 6.39e+03 | 3.12e+03 | 2.05 | 0.0412 | s |
| 8 | from_diversity_job_fair_id | 3.46e+03 | 4.26e+03 | 0.813 | 0.417 | ns |
| 9 | termd | 3.92e+03 | 8.44e+03 | 0.465 | 0.642 | ns |
| 10 | position_id | -546 | 220 | -2.48 | 0.0136 | s |
| 11 | emp_satisfaction | 432 | 1.39e+03 | 0.31 | 0.757 | ns |
| 12 | special_projects_count | 2.63e+03 | 796 | 3.3 | 0.0011 | s |
| 13 | days_late_last30 | 1.66e+03 | 1.42e+03 | 1.17 | 0.242 | ns |
| 14 | absences | 325 | 208 | 1.56 | 0.119 | ns |
| 15 | status |
The validation metrics of the model
| Nr.Crt. | var | values |
|---|---|---|
| 1 | r.squared | 0.334 |
| 2 | adj.r.squared | 0.305 |
| 3 | sigma | 2.1e+04 |
| 4 | statistic | 11.5 |
| 5 | p.value | 4.94e-20 |
| 6 | df | 13 |
| 7 | logLik | -3.53e+03 |
| 8 | AIC | 7.09e+03 |
| 9 | BIC | 7.14e+03 |
| 10 | deviance | 1.31e+11 |
| 11 | df.residual | 297 |
| 12 | nobs | 311 |
## OK: No outliers detected.
## - Based on the following method and threshold: cook (1).
## - For variable: (Whole model)
VIF values obtained for the variables include in the model
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| married_id | 1.08 | 1.02 | 1.37 | 1.04 | 0.928 | 0.73 | 0.984 |
| marital_status_id | 1.06 | 1.01 | 1.45 | 1.03 | 0.947 | 0.689 | 0.993 |
| gender_id | 1.03 | 1 | 2.28 | 1.01 | 0.971 | 0.439 | 0.999 |
| emp_status_id | 11.7 | 9.63 | 14.4 | 3.43 | 0.0851 | 0.0695 | 0.104 |
| dept_id | 2.44 | 2.08 | 2.91 | 1.56 | 0.41 | 0.344 | 0.48 |
| perf_score_id | 2.36 | 2.02 | 2.81 | 1.54 | 0.423 | 0.355 | 0.495 |
| from_diversity_job_fair_id | 1.08 | 1.02 | 1.36 | 1.04 | 0.923 | 0.735 | 0.981 |
| termd | 11.2 | 9.19 | 13.7 | 3.35 | 0.0892 | 0.0729 | 0.109 |
| position_id | 1.32 | 1.19 | 1.55 | 1.15 | 0.758 | 0.647 | 0.843 |
| emp_satisfaction | 1.13 | 1.05 | 1.36 | 1.06 | 0.882 | 0.734 | 0.953 |
| special_projects_count | 2.47 | 2.11 | 2.95 | 1.57 | 0.405 | 0.339 | 0.474 |
| days_late_last30 | 2.38 | 2.04 | 2.83 | 1.54 | 0.42 | 0.353 | 0.491 |
| absences | 1.04 | 1 | 1.65 | 1.02 | 0.96 | 0.608 | 0.997 |
We observe that status variable is NA meaning that between this variable and another variable there is prefect collinearity, suggesting that one or more variables are exact linear combinations of others. As a result, the design matrix becomes singular and VIF cannot be computed.
The variable that is perfect correlated with status is: married_id
# vif(lm_ih)
So we are going to eliminate the status variable and termd variable (highly correlated with emp_status_id; the value of the correlation coefficient is 0.91) from the dataset and rebuild the model.
dataset2 <- dataset1 %>%
mutate(dept_id = as.factor(dept_id))
dataset2$gender_id <- as.numeric(dataset2$gender_id)
dataset2$married_id <- as.numeric(dataset2$married_id)
dataset2$marital_status_id <- as.numeric(dataset2$marital_status_id)
dataset2$emp_status_id <- as.numeric(dataset2$emp_status_id)
dataset2$emp_satisfaction <- as.numeric(dataset2$emp_satisfaction)
dataset2$perf_score_id <- as.numeric(dataset2$perf_score_id)
str(dataset2)
## tibble [311 × 16] (S3: tbl_df/tbl/data.frame)
## $ married_id : num [1:311] 0 1 1 1 0 0 0 0 0 0 ...
## $ marital_status_id : num [1:311] 0 1 1 1 2 0 0 4 0 2 ...
## $ gender_id : num [1:311] 1 1 0 0 0 0 0 1 0 1 ...
## $ emp_status_id : num [1:311] 1 5 5 1 5 1 1 1 3 1 ...
## $ dept_id : Factor w/ 7 levels "1","2","3","4",..: 5 3 5 5 5 5 4 5 5 3 ...
## $ perf_score_id : num [1:311] 4 3 3 3 3 4 3 3 3 3 ...
## $ from_diversity_job_fair_id: num [1:311] 0 0 0 0 0 0 0 0 1 0 ...
## $ termd : num [1:311] 0 1 1 0 1 0 0 0 0 0 ...
## $ position_id : num [1:311] 19 27 20 19 19 19 24 19 19 14 ...
## $ emp_satisfaction : num [1:311] 5 3 3 5 4 5 3 4 3 5 ...
## $ special_projects_count : num [1:311] 0 6 0 0 0 0 4 0 0 6 ...
## $ days_late_last30 : num [1:311] 0 0 0 0 0 0 0 0 0 0 ...
## $ absences : num [1:311] 1 17 3 15 2 15 19 19 4 16 ...
## $ salary : num [1:311] 62506 104437 64955 64991 50825 ...
## $ status : num [1:311] 0 1 1 1 0 0 0 0 0 0 ...
## $ log10_salary : num [1:311] 4.8 5.02 4.81 4.81 4.71 ...
library(robustbase)
lm_ih <- lmrob(salary ~ ., data = dataset2 %>% dplyr::select(-status, -termd, -log10_salary))
## Loaded glmnet 4.1-8
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Unlike classical linear regression, LASSO regression (Least Absolute Shrinkage and Selection Operator) introduces a regularization term based on the absolute values of the model parameters.
The regularization term is the product of two components:
The parameter \(\lambda\) introduces bias in the estimation process but helps reduce model variance.
The L1 regularization term allows insignificant coefficients to be shrunk toward zero or forced exactly to zero, which enables automatic variable selection, retaining only the predictors that have a significant influence on the response variable.
The LASSO optimization problem can be written as:
\[ RSS_{LASSO} = \|y - X\beta\|_2^2 + \lambda \sum_{j=1}^{p} |\beta_j| \]
where:
A larger value of \(\lambda\) increases the shrinkage effect on the coefficients, potentially setting some coefficients exactly to zero and thus performing feature selection.
The solution of the LASSO regression optimization problem is typically obtained using numerical algorithms, such as:
For a single predictor, the explicit solution can be expressed using the soft-thresholding operator:
\[ \widehat{\beta_j} = \text{sign}\left(\widehat{\beta_j^{OLS}}\right) \cdot \max\left(0, \left|\widehat{\beta_j^{OLS}}\right| - \frac{\lambda}{2} \right) \]
where:
The hyperparameter \(\lambda\) reduces the magnitude of the coefficients and forces them to become exactly zero when their absolute value is smaller than the threshold \(\lambda/2\).
The objective function of the LASSO regression model can be written as:
\[ J_{LASSO}(\beta) = \sum_{i=1}^{n} \left( y_i - \left( \sum_{j=1}^{p}\beta_j x_{i,j} + \beta_0 \right) \right)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \]
where:
The cost function minimized in LASSO regression is:
\[ J_{LASSO}(\beta) = RSS_{OLS} + \lambda \sum_{j=1}^{p} |\beta_j| \]
or equivalently:
\[ J_{LASSO}(\beta) = \min_{\beta_0,\beta} \left[ \frac{1}{2n} \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right] \]
The LASSO regression model aims to minimize this objective function.
The regularization term introduced in the LASSO regression model controls the bias–variance trade-off as follows:
\[ \beta_j = 0, \quad \forall j \]
In this situation, the conditional mean reduces to:
\[ \hat{y} = \beta_0 \]
meaning that the predictions no longer depend on the predictor variables.
Therefore, the LASSO model collapses into a constant model, which leads to underfitting, since the model is no longer able to learn meaningful patterns from the data.
The LASSO regression model is relatively flexible
with respect to the absence of strict multicollinearity assumptions
among predictor variables.
Unlike classical linear regression, it does not explicitly require the
error terms to follow a normal distribution in order to estimate the
regression coefficients.
Thus, LASSO can estimate model parameters even when the predictors are highly correlated. However, it is important to note that the other assumptions of the classical multiple linear regression model remain generally applicable.
Figure 4 illustrates the graphical representation of how the optimal solution is obtained by minimizing the objective function in both classical regression (OLS) and LASSO regression.
The x and y axes represent the values of two regression parameters of the model.
The solid concentric circles represent the residual sum of squares (RSS) for different combinations of the regression parameters. Each circle corresponds to a constant level of the objective function.
The dotted diamonds represent the additional constraint introduced by the L1 regularization term and the hyperparameter \(\lambda\) used in the LASSO model.
The optimal LASSO solution is obtained at the point where the smallest RSS contour first intersects the L1 constraint region. Due to the diamond-shaped constraint, the intersection often occurs on one of the axes, which forces some regression coefficients to become exactly zero. This property explains why LASSO performs automatic variable selection.
In the following section, we briefly present the main advantages and disadvantages of the LASSO regression model.
Reduces overfitting
One of the major benefits of LASSO regression is its ability to control
overfitting, which improves the model’s performance
when applied to new, unseen data. This is particularly useful when the
dataset contains a large number of predictors.
Automatic variable selection
The regularization term forces some regression coefficients to become
exactly zero. As a result, insignificant predictors are automatically
removed from the model, which reduces model complexity and improves
interpretability.
Efficient for high-dimensional data
LASSO regression can be applied to datasets where the number of
predictors is larger than the number of observations. This makes it
particularly useful in high-dimensional settings such as machine
learning or genomic data analysis.
Sensitivity to multicollinearity
When predictors are strongly correlated (multicollinearity), classical
linear regression becomes unstable because the coefficients may be
poorly estimated. Although LASSO introduces a regularization term that
stabilizes coefficient estimation, it is not always suitable when there
are many highly correlated predictors.
In such cases, LASSO tends to keep only one predictor from a group of
correlated variables while eliminating the others, which may lead to
loss of useful information.
Lower efficiency when many predictors are
relevant
In datasets where many predictors have a significant influence on the
response variable but are also highly correlated, LASSO may eliminate
important independent variables.
Introduces bias for large coefficients
Because the regularization term constrains the magnitude of the
regression coefficients, predictors with large coefficients may be
excessively penalized. This can introduce bias in coefficient
estimates, potentially reducing predictive accuracy.
Assumes linear relationships
LASSO regression is a regularized version of the classical
linear regression model, therefore it assumes a
linear relationship between predictors and the response
variable. If the data exhibit more complex nonlinear relationships,
alternative modeling approaches may be required.
Regression model estimation
##
## Call: cv.glmnet(x = x, y = y, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 1053 28 438113709 134852220 6
## 1se 9818 4 570601502 162261764 3
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 62412.0042
## married_id .
## marital_status_id .
## gender_id .
## emp_status_id .
## dept_id2 160198.4396
## dept_id3 23184.3082
## dept_id4 18979.5455
## dept_id5 -10519.3913
## dept_id6 .
## dept_id7 .
## perf_score_id 2472.5875
## from_diversity_job_fair_id .
## termd .
## position_id .
## emp_satisfaction .
## special_projects_count .
## days_late_last30 .
## absences 142.9105
## status .
Interpretation:
special_projects_count (-2265.49) → is an positive predictor, not having the strongest effect on salary. Each additional special project decreases salary by ~2265.49 units.
dept_id (169629) → has a large effect depending on department coding: some departments pay much less/more than the baseline/mean.
perf_score_id (6789.2) → the performance scores positively affect salary.
absences (304.25) → has a minor positive effect on salary (interesting, could reflect correlation with other variables).
position_id (30.526) → has a small positive effect on salary.
marital_status_id (-705.29) → has a negative effect on salary. It appears that marital status slightly reduces salary.
| Metric | Value |
|---|---|
| lambda.min | 1.05e+03 |
| lambda.1se | 9.82e+03 |
| nobs | 311 |
Can eliminate variables by forcing the value of regression coefficient to be zero
Performance of the Lasso model
## [1] 316251360
Lasso shrinks unimportant variables to zero, performing a variable selection while controlling multicollinearity and overfitting. So the final model focuses only on variables that actually explain variation in salary. In this case the important predictors are: dept_id, perf_score_id, special_projects_count, absences, marital_status_id, position_id, while married_id, gender_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, status are dropped predictors.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
Peterson BG, Carl P (2020). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version 2.0.4, https://CRAN.R-project.org/package=PerformanceAnalytics.
Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. R package version 1.4.0, https://CRAN.R-project.org/package=scales.
Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.
Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). Available from https://github.com/taiyun/corrplot
Rubba C (2023). htmltab: Assemble Data Frames from HTML Tables. R package version 0.8.2.9000, https://github.com/htmltab/htmltab.
Brandon M. Greenwell and Bradley C. Boehmke (2020). Variable Importance Plots—An Introduction to the vip Package. The R Journal, 12(1), 343–366. URL https://doi.org/10.32614/RJ-2020-013.
Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01
Robinson D, Hayes A, Couch S (2023). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.5, https://CRAN.R-project.org/package=broom. 13.H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.
Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.
Cui B (2020). DataExplorer: Automate Data Exploration and Treatment. R package version 0.8.2, https://CRAN.R-project.org/package=DataExplorer.
Rushworth A (2022). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.12, https://CRAN.R-project.org/package=inspectdf.
Grosjean P, Ibanez F (2018). pastecs: Package for Analysis of Space-Time Ecological Series. R package version 1.3.21, https://CRAN.R-project.org/package=pastecs.
https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.3.9, https://CRAN.R-project.org/package=psych.
B. Venables, Modern Applied Statistics With S, 2002, Edition: 4thPublisher: Springer-Verlag, DOI: 10.1007/b97626.
Friedman J, Tibshirani R, Hastie T (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1-22. doi:10.18637/jss.v033.i01 https://doi.org/10.18637/jss.v033.i01.
Martin Maechler, Peter Rousseeuw, Christophe Croux, Valentin Todorov, Andreas Ruckstuhl, Matias Salibian-Barrera, Tobias Verbeke, Manuel Koller, c(“Eduardo”, “L. T.”) Conceicao and Maria Anna di Palma (2023). robustbase: Basic Robust Statistics R package version 0.99-1. URL http://CRAN.R-project.org/package=robustbase
https://www.linkedin.com/pulse/lasso-regression-clearly-explained-bhabani-shankear-basak/ accesat la data 17.12.2024
U. Riswanto, Ridge regression is a fantastic choice when you need a balance between flexibility and simplicity, especially in cases where you have lots of features or multicollinearity issues, https://ujangriswanto08.medium.com/a-beginners-guide-to-ridge-regression-and-regularization-in-machine-learning-4aeae6ec7680 accesat la data 25.12.2024.
https://www.quora.com/What-are-the-pros-and-cons-of-lasso-regression accesat la data de 19.12.2024.
https://medium.com/@shruti.dhumne/what-is-lasso-regression-bd44addc448c accesat la data de 19.12.2024.
J. Gallier, J. Quaintance, Solving the Elastic Net and Lasso Regression Problems, 2024, https://www.cis.upenn.edu/~cis5150/elastic-net.pdf accesat la data de 19.12.2024.
H. Zou, T. Hastie, Regularization and Variable Selection via the Elastic Net, 2004, https://statanaly.com/wp-content/uploads/2023/04/elasticnet.pdf accesat la data 19.12.2024.
E. Rodola, A. Torsello, T. Harada, Y. Kuniyoshi, D. Cremers, Elastic Net Constraints for Shape Matching, https://cvg.cit.tum.de/_media/spezial/bib/rodola-iccv13.pdf accesat la data 19.12.2024.
P. Mohan, Ridge, Lasso & Elastic Net Regression, 2021, https://blog.devgenius.io/ridge-lasso-elastic-net-regression-2ea752186e51 accesat la data 17.12.2024.
https://dev.to/harsimranjit_singh_0133dc/elastic-net-regularization-balancing-between-l1-and-l2-penalties-3ib7 accesat la data 17.12.2024. 33.Arthur E. Hoerl, Robert W. Kennard, TECHNOMETRICS, 1970, 12, 1, Ridge Regression: Biased Estimation for Nonorthogonal Problems.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning Data Mining, Inference, and Prediction, 2017, Springer Science+Business Media, ISBN: 978-0-387-84857-0.
Xavier Bourret Sicotte, Ridge and Lasso: visualizing the optimal solutions, 2018, https://xavierbourretsicotte.github.io/ridge_lasso_visual.html.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning Data Mining, Inference, and Prediction, 2017, Springer Science+Business Media, ISBN: 978-0-387-84857-0.
Arthur E. Hoerl, Robert W. Kennard, TECHNOMETRICS, 1970, 12, 1, Ridge Regression: Biased Estimation for Nonorthogonal Problems.
Hui Zou and Trevor Hastie, Regularization and variable selection via the elastic net, J. R. Statist. Soc. B, 2005, 67, Part 2, pp. 301–320.