Scoring model

Analyzed data set: HRDataset_v14 available here

Loading the necesary packages…

Project Objectives

Exploratory data analysis - EDA was done in another post and available here
Identification and detailed analysis of the studied phenomenon through the development of scoring models
Interpretation of results and presentation of conclusions

Dataset description

In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.

Within this dataset, we decided to explore and attempt to answer several research questions, such as:

Are there areas within the company where salary distribution is not equitable?
Is an individual’s salary influenced by any specific factors present in the dataset?
Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?

loading the dataset

Dataset description

The structure of the dataset

glimpse(HRDataset_v14)

## Rows: 311
## Columns: 36
## $ Employee_Name              <chr> "Adinolfi, Wilson  K", "Ait Sidi, Karthikey…
## $ EmpID                      <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID                  <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID            <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID                   <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID                <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID                     <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID                <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary                     <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID                 <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position                   <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State                      <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip                        <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB                        <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex                        <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc                <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc                <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc                   <chr> "White", "White", "White", "White", "White"…
## $ DateofHire                 <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination          <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason                 <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus           <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department                 <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName                <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID                  <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource          <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore           <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey           <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction            <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount       <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…

The key findings from the EDA and their implications for further modeling

Based on EDA analysis we so that:

Salary exhibits non-normal behavior and strong right skewness.
Extreme values (executive-level salaries) may influence regression results.
Transformation techniques (e.g., log transformation of salary) may improve model performance.
Categorical variables such as department, position, and employment status are likely strong predictors of salary.
Gender alone may not fully explain salary variation without controlling for position and department.

## Rows: 311
## Columns: 15
## $ married_id                 <fct> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ status                     <chr> "no_married", "married", "married", "marrie…
## $ log10_salary               <dbl> 4.795922, 5.018854, 4.812613, 4.812853, 4.7…

## Rows: 311
## Columns: 15
## $ married_id                 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ salary                     <int> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ status                     <chr> "no_married", "married", "married", "marrie…

Correlation matrix

MRLM - Linear regression model

Generating the regression model

Just a regression …

lm_ih <- lm(salary ~ ., data = dataset)
lm_ih

## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    95093.0                       443.8  
##          marital_status_id                   gender_id  
##                    -1133.9                      -351.1  
##              emp_status_id                     dept_id  
##                    -1607.4                     -8722.4  
##              perf_score_id  from_diversity_job_fair_id  
##                     6392.3                      3462.4  
##                      termd                 position_id  
##                     3924.0                      -545.7  
##           emp_satisfaction      special_projects_count  
##                      432.2                      2625.3  
##           days_late_last30                    absences  
##                     1663.1                       324.8  
##                     status  
##                         NA

# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-salary))
# lm_ih

summary(lm_ih)

## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56336  -9497  -2108   5279 159658 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 95093.0    17138.2   5.549 0.0000000638 ***
## married_id                    443.8     2520.6   0.176       0.8604    
## marital_status_id           -1133.9     1297.3  -0.874       0.3828    
## gender_id                    -351.1     2433.9  -0.144       0.8854    
## emp_status_id               -1607.4     2274.8  -0.707       0.4804    
## dept_id                     -8722.4     1932.4  -4.514 0.0000091892 ***
## perf_score_id                6392.3     3117.4   2.051       0.0412 *  
## from_diversity_job_fair_id   3462.4     4256.8   0.813       0.4167    
## termd                        3924.0     8436.1   0.465       0.6422    
## position_id                  -545.7      219.7  -2.483       0.0136 *  
## emp_satisfaction              432.2     1394.5   0.310       0.7568    
## special_projects_count       2625.3      796.4   3.296       0.0011 ** 
## days_late_last30             1663.1     1418.7   1.172       0.2420    
## absences                      324.8      207.6   1.564       0.1188    
## status                           NA         NA      NA           NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20970 on 297 degrees of freedom
## Multiple R-squared:  0.3345, Adjusted R-squared:  0.3053 
## F-statistic: 11.48 on 13 and 297 DF,  p-value: < 0.00000000000000022

Identifying the variable that is strongly correlated with status

Nr.Crt.	Variable	Value
1	(Intercept)	0
2	married_id	1
3	marital_status_id	0
4	gender_id	0
5	emp_status_id	0
6	dept_id	0
7	perf_score_id	0
8	from_diversity_job_fair_id	0
9	termd	0
10	position_id	0
11	emp_satisfaction	0
12	special_projects_count	0
13	days_late_last30	0
14	absences	0

The importance of the variables

Check assumptions of linear regression

library(performance)
par(mfrow = c(2, 2))
plot(lm_ih, pch = 16, col = '#006EA1')

par(mfrow = c(1, 1))

The estimated values of the regression coefficients

Nr.Crt.	term	estimate	std.error	statistic	p.value	signif
1	(Intercept)	9.51e+04	1.71e+04	5.55	6.38e-08	s
2	married_id	444	2.52e+03	0.176	0.86	ns
3	marital_status_id	-1.13e+03	1.3e+03	-0.874	0.383	ns
4	gender_id	-351	2.43e+03	-0.144	0.885	ns
5	emp_status_id	-1.61e+03	2.27e+03	-0.707	0.48	ns
6	dept_id	-8.72e+03	1.93e+03	-4.51	9.19e-06	s
7	perf_score_id	6.39e+03	3.12e+03	2.05	0.0412	s
8	from_diversity_job_fair_id	3.46e+03	4.26e+03	0.813	0.417	ns
9	termd	3.92e+03	8.44e+03	0.465	0.642	ns
10	position_id	-546	220	-2.48	0.0136	s
11	emp_satisfaction	432	1.39e+03	0.31	0.757	ns
12	special_projects_count	2.63e+03	796	3.3	0.0011	s
13	days_late_last30	1.66e+03	1.42e+03	1.17	0.242	ns
14	absences	325	208	1.56	0.119	ns
15	status

The validation metrics of the model

Nr.Crt.	var	values
1	r.squared	0.334
2	adj.r.squared	0.305
3	sigma	2.1e+04
4	statistic	11.5
5	p.value	4.94e-20
6	df	13
7	logLik	-3.53e+03
8	AIC	7.09e+03
9	BIC	7.14e+03
10	deviance	1.31e+11
11	df.residual	297
12	nobs	311

Cooks distance, predicted values, residuals and influntial points

## OK: No outliers detected.
## - Based on the following method and threshold: cook (1).
## - For variable: (Whole model)

VIF values obtained for the variables include in the model

Term	VIF	VIF_CI_low	VIF_CI_high	SE_factor	Tolerance	Tolerance_CI_low	Tolerance_CI_high
married_id	1.08	1.02	1.37	1.04	0.928	0.73	0.984
marital_status_id	1.06	1.01	1.45	1.03	0.947	0.689	0.993
gender_id	1.03	1	2.28	1.01	0.971	0.439	0.999
emp_status_id	11.7	9.63	14.4	3.43	0.0851	0.0695	0.104
dept_id	2.44	2.08	2.91	1.56	0.41	0.344	0.48
perf_score_id	2.36	2.02	2.81	1.54	0.423	0.355	0.495
from_diversity_job_fair_id	1.08	1.02	1.36	1.04	0.923	0.735	0.981
termd	11.2	9.19	13.7	3.35	0.0892	0.0729	0.109
position_id	1.32	1.19	1.55	1.15	0.758	0.647	0.843
emp_satisfaction	1.13	1.05	1.36	1.06	0.882	0.734	0.953
special_projects_count	2.47	2.11	2.95	1.57	0.405	0.339	0.474
days_late_last30	2.38	2.04	2.83	1.54	0.42	0.353	0.491
absences	1.04	1	1.65	1.02	0.96	0.608	0.997

We observe that status variable is NA meaning that between this variable and another variable there is prefect collinearity, suggesting that one or more variables are exact linear combinations of others. As a result, the design matrix becomes singular and VIF cannot be computed.

The variable that is perfect correlated with status is: married_id

# vif(lm_ih)

So we are going to eliminate the status variable and termd variable (highly correlated with emp_status_id; the value of the correlation coefficient is 0.91) from the dataset and rebuild the model.

Generating a new regression model without including the status and termd variables

Estimate regression model

lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
lm_ih

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    93194.6                       413.6  
##          marital_status_id                   gender_id  
##                    -1132.3                      -324.3  
##              emp_status_id                     dept_id  
##                     -603.2                     -8628.1  
##              perf_score_id  from_diversity_job_fair_id  
##                     6594.0                      3185.4  
##                position_id            emp_satisfaction  
##                     -560.5                       406.1  
##     special_projects_count            days_late_last30  
##                     2664.4                      1815.1  
##                   absences  
##                      327.0

summary(lm_ih)

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56169  -9542  -1754   5019 160081 
## 
## Coefficients:
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 93194.6    16623.2   5.606 0.0000000471 ***
## married_id                    413.6     2516.4   0.164     0.869563    
## marital_status_id           -1132.3     1295.6  -0.874     0.382829    
## gender_id                    -324.3     2430.0  -0.133     0.893919    
## emp_status_id                -603.2      715.8  -0.843     0.400069    
## dept_id                     -8628.1     1919.2  -4.496 0.0000099409 ***
## perf_score_id                6594.0     3083.0   2.139     0.033265 *  
## from_diversity_job_fair_id   3185.4     4209.4   0.757     0.449816    
## position_id                  -560.5      217.1  -2.581     0.010318 *  
## emp_satisfaction              406.1     1391.6   0.292     0.770612    
## special_projects_count       2664.4      790.9   3.369     0.000855 ***
## days_late_last30             1815.1     1378.8   1.316     0.189032    
## absences                      327.0      207.3   1.577     0.115772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20940 on 298 degrees of freedom
## Multiple R-squared:  0.334,  Adjusted R-squared:  0.3072 
## F-statistic: 12.45 on 12 and 298 DF,  p-value: < 0.00000000000000022

The importance of the variables

The estimated values of the regression coefficients

Nr.Crt.	term	estimate	std.error	statistic	p.value	signif
1	(Intercept)	9.32e+04	1.66e+04	5.61	4.71e-08	s
2	married_id	414	2.52e+03	0.164	0.87	ns
3	marital_status_id	-1.13e+03	1.3e+03	-0.874	0.383	ns
4	gender_id	-324	2.43e+03	-0.133	0.894	ns
5	emp_status_id	-603	716	-0.843	0.4	ns
6	dept_id	-8.63e+03	1.92e+03	-4.5	9.94e-06	s
7	perf_score_id	6.59e+03	3.08e+03	2.14	0.0333	s
8	from_diversity_job_fair_id	3.19e+03	4.21e+03	0.757	0.45	ns
9	position_id	-560	217	-2.58	0.0103	s
10	emp_satisfaction	406	1.39e+03	0.292	0.771	ns
11	special_projects_count	2.66e+03	791	3.37	0.000855	s
12	days_late_last30	1.82e+03	1.38e+03	1.32	0.189	ns
13	absences	327	207	1.58	0.116	ns

The validation metrics of the model

Nr.Crt.	var	values
1	r.squared	0.334
2	adj.r.squared	0.307
3	sigma	2.09e+04
4	statistic	12.5
5	p.value	1.5e-20
6	df	12
7	logLik	-3.53e+03
8	AIC	7.09e+03
9	BIC	7.14e+03
10	deviance	1.31e+11
11	df.residual	298
12	nobs	311

VIF values obtained for the variables include in the model

Term	VIF	VIF_CI_low	VIF_CI_high	SE_factor	Tolerance	Tolerance_CI_low	Tolerance_CI_high
married_id	1.08	1.02	1.38	1.04	0.929	0.727	0.985
marital_status_id	1.06	1.01	1.46	1.03	0.947	0.686	0.993
gender_id	1.03	1	2.38	1.01	0.972	0.42	0.999
emp_status_id	1.17	1.07	1.39	1.08	0.857	0.721	0.933
dept_id	2.41	2.06	2.88	1.55	0.415	0.347	0.485
perf_score_id	2.32	1.98	2.76	1.52	0.432	0.362	0.504
from_diversity_job_fair_id	1.06	1.01	1.42	1.03	0.941	0.705	0.991
position_id	1.29	1.16	1.52	1.14	0.775	0.659	0.859
emp_satisfaction	1.13	1.05	1.36	1.06	0.883	0.734	0.954
special_projects_count	2.44	2.08	2.92	1.56	0.41	0.343	0.48
days_late_last30	2.25	1.93	2.68	1.5	0.444	0.373	0.518
absences	1.04	1	1.67	1.02	0.961	0.598	0.998

We will rebuild the model including only the significant variables.

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56169  -9542  -1754   5019 160081 
## 
## Coefficients:
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 93194.6    16623.2   5.606 0.0000000471 ***
## married_id                    413.6     2516.4   0.164     0.869563    
## marital_status_id           -1132.3     1295.6  -0.874     0.382829    
## gender_id                    -324.3     2430.0  -0.133     0.893919    
## emp_status_id                -603.2      715.8  -0.843     0.400069    
## dept_id                     -8628.1     1919.2  -4.496 0.0000099409 ***
## perf_score_id                6594.0     3083.0   2.139     0.033265 *  
## from_diversity_job_fair_id   3185.4     4209.4   0.757     0.449816    
## position_id                  -560.5      217.1  -2.581     0.010318 *  
## emp_satisfaction              406.1     1391.6   0.292     0.770612    
## special_projects_count       2664.4      790.9   3.369     0.000855 ***
## days_late_last30             1815.1     1378.8   1.316     0.189032    
## absences                      327.0      207.3   1.577     0.115772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20940 on 298 degrees of freedom
## Multiple R-squared:  0.334,  Adjusted R-squared:  0.3072 
## F-statistic: 12.45 on 12 and 298 DF,  p-value: < 0.00000000000000022

##                 married_id          marital_status_id 
##                   1.076813                   1.055875 
##                  gender_id              emp_status_id 
##                   1.028911                   1.166447 
##                    dept_id              perf_score_id 
##                   2.412074                   2.316162 
## from_diversity_job_fair_id                position_id 
##                   1.062681                   1.290889 
##           emp_satisfaction     special_projects_count 
##                   1.131887                   2.441384 
##           days_late_last30                   absences 
##                   2.252283                   1.040777

Generating a new regression model using only the significant variables

Variables that has no impact on salary are:

term	estimate	std.error	statistic	p.value
married_id	414	2.52e+03	0.164	0.87
marital_status_id	-1.13e+03	1.3e+03	-0.874	0.383
gender_id	-324	2.43e+03	-0.133	0.894
emp_status_id	-603	716	-0.843	0.4
from_diversity_job_fair_id	3.19e+03	4.21e+03	0.757	0.45
emp_satisfaction	406	1.39e+03	0.292	0.771
days_late_last30	1.82e+03	1.38e+03	1.32	0.189
absences	327	207	1.58	0.116

## [1] "married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences"

lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences)))
lm_ih

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Coefficients:
##            (Intercept)                 dept_id           perf_score_id  
##               105263.2                 -8607.9                  4063.2  
##            position_id  special_projects_count  
##                 -620.5                  2682.4

# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences, salary)))
# lm_ih

summary(lm_ih)

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48859  -9725  -2244   4500 159690 
## 
## Coefficients:
##                        Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)            105263.2    13558.0   7.764 0.000000000000125 ***
## dept_id                 -8607.9     1895.7  -4.541 0.000008070682410 ***
## perf_score_id            4063.2     2028.2   2.003          0.046014 *  
## position_id              -620.5      212.1  -2.925          0.003700 ** 
## special_projects_count   2682.4      770.8   3.480          0.000575 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20900 on 306 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3099 
## F-statistic: 35.81 on 4 and 306 DF,  p-value: < 0.00000000000000022

The estimated values of the regression coefficients

Nr.Crt.	term	estimate	std.error	statistic	p.value	signif
1	(Intercept)	1.05e+05	1.36e+04	7.76	1.25e-13	s
2	dept_id	-8.61e+03	1.9e+03	-4.54	8.07e-06	s
3	perf_score_id	4.06e+03	2.03e+03	2	0.046	s
4	position_id	-620	212	-2.93	0.0037	s
5	special_projects_count	2.68e+03	771	3.48	0.000575	s

The validation metrics of the model

Nr.Crt.	var	values
1	r.squared	0.319
2	adj.r.squared	0.31
3	sigma	2.09e+04
4	statistic	35.8
5	p.value	1.53e-24
6	df	4
7	logLik	-3.53e+03
8	AIC	7.08e+03
9	BIC	7.1e+03
10	deviance	1.34e+11
11	df.residual	306
12	nobs	311

The importance of the variables

VIF values obtained for the variables include in the model

Term	VIF	VIF_CI_low	VIF_CI_high	SE_factor	Tolerance	Tolerance_CI_low	Tolerance_CI_high
dept_id	2.36	2.01	2.83	1.54	0.423	0.353	0.497
perf_score_id	1.01	1	1.96e+05	1	0.994	5.09e-06	1
position_id	1.24	1.12	1.47	1.11	0.808	0.682	0.892
special_projects_count	2.33	1.99	2.79	1.53	0.43	0.358	0.504

As it can be seen from the checking assumption of the model the multicollinearity effect was eliminated. The metric of the model are the same, 33% of the variation of the dependent variable, salary, is explained by the model. We can see that the model is significant (p-value < 0.05), but not all the included predictors have an significant impact on the salary.

From the checking assumptions of the model we can observe that:

The model does not capture the actual shape of the salary distribution, suggesting potential skewness or the presence of outliers that are not well handled by linear regression.
The relationship between the predictors and salary does not appear to be fully linear, indicating possible model misspecification.
There is evidence of heteroscedasticity, as the variance of the residuals is not constant across fitted values.
Several observations exhibit relatively high leverage (e.g., 132, 27, 151), indicating the presence of influential data points that may affect the estimated regression coefficients.
Additionally, high multicollinearity is detected among some predictors, which can lead to unstable coefficient estimates and inflated standard errors. After eliminating the variables with high VIF values the model was rebuild.
The residuals are not perfectly normally distributed, likely due to positive skewness in salary values. Although this is less critical in large samples, it may still affect statistical inference, including p-values and confidence intervals.

To improve the model, the following steps we should follow:

Using robust regression techniques;
Implementing regularized models such as Ridge or Lasso regression.

Using more robust regression models

Because some of the assumption of the regression model are not checked (like homoscedasticity, the influence of the outliers, deviation from the normality) we will implement more robust regression model.

Because the homoscedasticity assumption is violated

The standard errors become robust and the regression coefficients are not changing

## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    95093.0                       443.8  
##          marital_status_id                   gender_id  
##                    -1133.9                      -351.1  
##              emp_status_id                     dept_id  
##                    -1607.4                     -8722.4  
##              perf_score_id  from_diversity_job_fair_id  
##                     6392.3                      3462.4  
##                      termd                 position_id  
##                     3924.0                      -545.7  
##           emp_satisfaction      special_projects_count  
##                      432.2                      2625.3  
##           days_late_last30                    absences  
##                     1663.1                       324.8  
##                     status  
##                         NA

# lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences)))
# lm_ih

# lm_ih <- lm(log10_salary ~ ., data = dataset1 %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences, salary)))
# lm_ih

library(sandwich)
library(lmtest)

coeftest(lm_ih, vcov = vcovHC(lm_ih, type = "HC1"))

## 
## t test of coefficients:
## 
##                            Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                95092.95   38575.24  2.4651  0.01426 *
## married_id                   443.76    2561.24  0.1733  0.86256  
## marital_status_id          -1133.88     874.65 -1.2964  0.19585  
## gender_id                   -351.10    2553.90 -0.1375  0.89075  
## emp_status_id              -1607.41    1759.77 -0.9134  0.36176  
## dept_id                    -8722.36    5270.46 -1.6550  0.09899 .
## perf_score_id               6392.34    3518.32  1.8169  0.07024 .
## from_diversity_job_fair_id  3462.36    4834.52  0.7162  0.47445  
## termd                       3923.97    5617.38  0.6985  0.48539  
## position_id                 -545.66     327.55 -1.6659  0.09679 .
## emp_satisfaction             432.19    1050.52  0.4114  0.68107  
## special_projects_count      2625.25    1790.43  1.4663  0.14363  
## days_late_last30            1663.11    1366.13  1.2174  0.22442  
## absences                     324.78     185.97  1.7464  0.08177 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

for the outliers and influential observations we will use more robust regression models

for the multicollinearity we will use Ridge, Lasso and ElasticNet regression models

Robust regression model from MASS package

Regression model estimation

library(MASS)
lm_ih <- MASS::rlm(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))

summary(lm_ih)

## 
## Call: rlm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -44276.7  -7565.1    438.9   6640.3 173976.3 
## 
## Coefficients:
##                            Value      Std. Error t value   
## (Intercept)                82230.1036  9375.5766     8.7707
## married_id                  -900.1458  1419.2794    -0.6342
## marital_status_id           -326.5630   730.7229    -0.4469
## gender_id                    146.2642  1370.5508     0.1067
## emp_status_id                 50.3524   403.7250     0.1247
## dept_id                    -5264.4981  1082.4401    -4.8635
## perf_score_id               2700.8751  1738.8486     1.5533
## from_diversity_job_fair_id   103.1237  2374.1452     0.0434
## position_id                 -298.1443   122.4571    -2.4347
## emp_satisfaction             269.4655   784.8575     0.3433
## special_projects_count      3504.2525   446.0931     7.8554
## days_late_last30             651.2826   777.6269     0.8375
## absences                     135.8248   116.9227     1.1617
## 
## Residual standard error: 10310 on 298 degrees of freedom

coeftest(lm_ih, vcov = vcovHC(lm_ih, type = "HC1"))

## 
## z test of coefficients:
## 
##                             Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)                82230.104  17405.612  4.7243 0.000002309 ***
## married_id                  -900.146   1477.479 -0.6092     0.54236    
## marital_status_id           -326.563    647.397 -0.5044     0.61396    
## gender_id                    146.264   1418.480  0.1031     0.91787    
## emp_status_id                 50.352    438.760  0.1148     0.90863    
## dept_id                    -5264.498   2562.226 -2.0547     0.03991 *  
## perf_score_id               2700.875   1769.581  1.5263     0.12694    
## from_diversity_job_fair_id   103.124   3283.623  0.0314     0.97495    
## position_id                 -298.144    184.308 -1.6176     0.10574    
## emp_satisfaction             269.466    744.174  0.3621     0.71728    
## special_projects_count      3504.252    880.485  3.9799 0.000068941 ***
## days_late_last30             651.283    672.159  0.9689     0.33257    
## absences                     135.825    115.144  1.1796     0.23816    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The importance of the variables

The estimated values of the regression coefficients

Crt.	term	estimate	std.error	statistic	Decision
1	(Intercept)	8.22e+04	9.38e+03	8.77	s
2	married_id	-900	1.42e+03	-0.634	ns
3	marital_status_id	-327	731	-0.447	ns
4	gender_id	146	1.37e+03	0.107	ns
5	emp_status_id	50.4	404	0.125	ns
6	dept_id	-5.26e+03	1.08e+03	-4.86	s
7	perf_score_id	2.7e+03	1.74e+03	1.55	ns
8	from_diversity_job_fair_id	103	2.37e+03	0.0434	ns
9	position_id	-298	122	-2.43	s
10	emp_satisfaction	269	785	0.343	ns
11	special_projects_count	3.5e+03	446	7.86	s
12	days_late_last30	651	778	0.838	ns
13	absences	136	117	1.16	ns

The validation metrics of the model

Crt.	var	values
1	sigma	1.03e+04
2	converged	1
3	logLik	-3.54e+03
4	AIC	7.1e+03
5	BIC	7.15e+03
6	deviance	1.36e+11
7	nobs	311

VIF values obtained for the variables include in the model

Term	VIF	VIF_CI_low	VIF_CI_high	SE_factor	Tolerance	Tolerance_CI_low	Tolerance_CI_high
married_id	1.08	1.02	1.38	1.04	0.929	0.727	0.985
marital_status_id	1.06	1.01	1.46	1.03	0.947	0.686	0.993
gender_id	1.03	1	2.38	1.01	0.972	0.42	0.999
emp_status_id	1.17	1.07	1.39	1.08	0.857	0.721	0.933
dept_id	2.41	2.06	2.88	1.55	0.415	0.347	0.485
perf_score_id	2.32	1.98	2.76	1.52	0.432	0.362	0.504
from_diversity_job_fair_id	1.06	1.01	1.42	1.03	0.941	0.705	0.991
position_id	1.29	1.16	1.52	1.14	0.775	0.659	0.859
emp_satisfaction	1.13	1.05	1.36	1.06	0.883	0.734	0.954
special_projects_count	2.44	2.08	2.92	1.56	0.41	0.343	0.48
days_late_last30	2.25	1.93	2.68	1.5	0.444	0.373	0.518
absences	1.04	1	1.67	1.02	0.961	0.598	0.998

Robust regression model from robustbase package

For modeling salary as the dependent variable all columns in dataset except status are included as predictors.

The robust regression technique (MM) is used for modeling, reducing the influence of outliers on coefficients and combines high breakdown point with high efficiency.

Regression model estimation

library(robustbase)
lm_ih <- lmrob(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))

library(robustbase)
summary(lm_ih)

## 
## Call:
## lmrob(formula = salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
##  \--> method = "MM"
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -61592.3  -6537.9    479.6   6375.1 218886.4 
## 
## Coefficients:
##                            Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                 1993.00   15712.18   0.127                0.899    
## married_id                    30.57    1219.13   0.025                0.980    
## marital_status_id            -95.04     577.71  -0.165                0.869    
## gender_id                   -116.98    1291.51  -0.091                0.928    
## emp_status_id                418.79     365.10   1.147                0.252    
## dept_id                     9119.66    2311.09   3.946            0.0000992 ***
## perf_score_id               2523.75    1679.81   1.502                0.134    
## from_diversity_job_fair_id   129.86    2974.33   0.044                0.965    
## position_id                  191.53     149.16   1.284                0.200    
## emp_satisfaction            -419.97     685.17  -0.613                0.540    
## special_projects_count      8067.66     819.51   9.845 < 0.0000000000000002 ***
## days_late_last30             265.20     557.27   0.476                0.635    
## absences                     115.12     104.15   1.105                0.270    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 10380 
## Multiple R-squared:  0.5665, Adjusted R-squared:  0.5491 
## Convergence in 31 IRWLS iterations
## 
## Robustness weights: 
##  15 observations c(30,56,77,97,132,133,151,191,228,241,245,256,260,269,309)
##   are outliers with |weight| = 0 ( < 0.00032); 
##  28 weights are ~= 1. The remaining 268 ones are summarized as
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01717 0.87990 0.96250 0.89740 0.98750 0.99900 
## Algorithmic parameters: 
##        tuning.chi                bb        tuning.psi        refine.tol 
##  1.54764000000000  0.50000000000000  4.68506100000000  0.00000010000000 
##           rel.tol         scale.tol         solve.tol          zero.tol 
##  0.00000010000000  0.00000000010000  0.00000010000000  0.00000000010000 
##       eps.outlier             eps.x warn.limit.reject warn.limit.meanrw 
##  0.00032154340836  0.00000000005457  0.50000000000000  0.50000000000000 
##      nResample         max.it       best.r.s       k.fast.s          k.max 
##            500             50              2              1            200 
##    maxit.scale      trace.lev            mts     compute.rd fast.s.large.n 
##            200              0           1000              0           2000 
##                   psi           subsampling                   cov 
##            "bisquare"         "nonsingular"         ".vcov.avar1" 
## compute.outlier.stats 
##                  "SM" 
## seed : int(0)

It can be observe that dept_id and special_projects_count are the strongest predictors of salary.

Most of the demographic variables (married, gender, etc.) are not significant.

The variable special projects (special_projects_count) have the largest positive effect on salary (~ 8078.91 increase per project).

The observations with weight = 0 are completely ignored in coefficient estimation, which means that the model will not be distorted by extreme salaries or unusual employees.

The importance of the variables

qr(model.matrix(lm_ih))$rank

## [1] 13

ncol(model.matrix(lm_ih))

## [1] 13

## Model :
## salary ~ married_id + marital_status_id + gender_id + emp_status_id + 
##     dept_id + perf_score_id + from_diversity_job_fair_id + position_id + 
##     emp_satisfaction + special_projects_count + days_late_last30 + 
##     absences

The estimated values of the regression coefficients

Nr.Crt.	term	estimate	std.error	statistic	p.value	signif
1	(Intercept)	1.99e+03	1.57e+04	0.127	0.899	ns
2	married_id	30.6	1.22e+03	0.0251	0.98	ns
3	marital_status_id	-95	578	-0.165	0.869	ns
4	gender_id	-117	1.29e+03	-0.0906	0.928	ns
5	emp_status_id	419	365	1.15	0.252	ns
6	dept_id	9.12e+03	2.31e+03	3.95	9.92e-05	s
7	perf_score_id	2.52e+03	1.68e+03	1.5	0.134	ns
8	from_diversity_job_fair_id	130	2.97e+03	0.0437	0.965	ns
9	position_id	192	149	1.28	0.2	ns
10	emp_satisfaction	-420	685	-0.613	0.54	ns
11	special_projects_count	8.07e+03	820	9.84	5.55e-20	s
12	days_late_last30	265	557	0.476	0.635	ns
13	absences	115	104	1.11	0.27	ns

The validation metrics of the model

Nr.Crt.	var	values
1	r.squared	0.567
2	sigma	1.04e+04
3	df.residual	298

The fitted, and residula values

In this case the model explains 58% of the variation in salary (adjusted R² is 0.56), which is pretty good.

VIF values obtained for the variables include in the model

Term	VIF	VIF_CI_low	VIF_CI_high	SE_factor	Tolerance	Tolerance_CI_low	Tolerance_CI_high
married_id	1.07	1.01	1.39	1.03	0.935	0.718	0.988
marital_status_id	1.3	1.17	1.53	1.14	0.767	0.653	0.852
gender_id	1.21	1.11	1.43	1.1	0.824	0.698	0.904
emp_status_id	1.32	1.19	1.55	1.15	0.758	0.645	0.843
dept_id	5.12	4.25	6.23	2.26	0.195	0.161	0.235
perf_score_id	3.8	3.18	4.59	1.95	0.263	0.218	0.314
from_diversity_job_fair_id	1.25	1.13	1.47	1.12	0.8	0.68	0.883
position_id	3.33	2.81	4.02	1.83	0.3	0.249	0.356
emp_satisfaction	1.48	1.31	1.74	1.22	0.677	0.576	0.764
special_projects_count	2.34	2	2.79	1.53	0.427	0.358	0.499
days_late_last30	3.17	2.67	3.81	1.78	0.316	0.262	0.374
absences	1.37	1.23	1.61	1.17	0.727	0.62	0.814

Both methods are robust alternatives to OLS, but MM (lmrob) tends to be more robust with high breakdown points, while M (rlm) focuses on efficient estimation for moderately contaminated data.

In both models special_projects_count is consistently highly significant.

dept_id variable shows large effects, but rlm and lmrob can differ in sign/magnitude depending on outlier treatment.

In both cases most demographic variables are not significant.

Both models account for extreme salaries, but differently (lmrob ignores extreme outliers, rlm downweights them).

Ridge and Lasso Regularization

Overall, Lasso and Ridge regression are used when we have multicollinearity problems, to many predictors and we want a stable predictive model.

library(glmnet)

## Loaded glmnet 4.1-8

x <- model.matrix(salary ~ ., dataset)[,-1]
y <- dataset$salary

Ridge Regression (alpha = 0)

Regression model estimation

ridge_model <- cv.glmnet(x, y, alpha = 0)
ridge_model

## 
## Call:  cv.glmnet(x = x, y = y, alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index   Measure        SE Nonzero
## min   9885    78 476230896 143918814      14
## 1se 859774    30 618809836 175693563      14

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)                89160.9564
## married_id                   245.5060
## marital_status_id           -961.9268
## gender_id                    340.6729
## emp_status_id               -496.3839
## dept_id                    -6305.7042
## perf_score_id               3347.7496
## from_diversity_job_fair_id  2518.0345
## termd                       -331.6805
## position_id                 -367.1815
## emp_satisfaction             526.8653
## special_projects_count      2414.1555
## days_late_last30             418.6630
## absences                     250.1753
## status                       244.5240

Nr.Crt.	Matric	Value
1	lambda.min	9.89e+03
2	lambda.1se	8.6e+05
3	nobs	311

Reduces de value of the regression coefficients to zero, not eliminate them

Interpretation:

special_projects_count (2501.91) → has strongest positive effect on salary, which means that more special projects increases salary.
days_late_last30 (571.18), emp_satisfaction (523.16), absences (266.46) → have a moderate positive effect on salary. Very interesting is that satisfaction and days late are positive here (possibly because Ridge shrinks coefficients but keeps them in model).
position_id (-397.95) → has a negative impact on salary, meaning higher position ID reduces salary after controlling for others.
status (233.94) → has a small positive effect on salary.

Ridge regression shrinks coefficients, reduces variance due to multicollinearity, but does not perform variable selection.

Compare to lmrob: special_projects_count remains the strongest predictor across models.

Evaluating the performance of the Ridge model

## [1] 430105398

The cross-validation curve reflects the bias–variance tradeoff inherent in regularized models. At very small λ values, the model approaches OLS and exhibits signs of mild overfitting. As λ increases, prediction error decreases due to reduced variance. However, excessive regularization leads to underfitting, as indicated by the increasing error for large λ values. The optimal λ balances bias and variance, ensuring better generalization performance.

Lasso Regression (alpha = 1)

Regression model estimation

## 
## Call:  cv.glmnet(x = x, y = y, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index   Measure        SE Nonzero
## min    783    31 468106080 145969732       8
## 1se  11633     2 613398133 164322653       1

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)                99689.4344
## married_id                     .     
## marital_status_id           -485.7549
## gender_id                      .     
## emp_status_id               -145.4528
## dept_id                    -7593.2077
## perf_score_id               2738.4442
## from_diversity_job_fair_id   309.5148
## termd                          .     
## position_id                 -437.4469
## emp_satisfaction               .     
## special_projects_count      2710.0045
## days_late_last30               .     
## absences                     220.9998
## status                         .

Interpretation:

special_projects_count (2740.63) → is the strongest positive predictor, having a the strongest effect on salary. Each additional special project increases salary by ~2,740 units.
dept_id (-7249.22) → has a large effect depending on department coding: some departments pay much less/more than the baseline/mean.
perf_score_id (2371.70) → the performance scores positively affect salary.
absences (179.11) → has a minor positive effect on salary (interesting, could reflect correlation with other variables).
position_id (-395.62) → has a small negative effect on salary.
marital_status_id (-235.71) → has a small negative effect on salary. It appears that marital status slightly reduces salary.

Metric	Value
lambda.min	783
lambda.1se	1.16e+04
nobs	311

Can eliminate variables by forcing the value of regression coefficient to be zero

Performance of the Lasso model

## [1] 426931353

Lasso shrinks unimportant variables to zero, performing a variable selection while controlling multicollinearity and overfitting. So the final model focuses only on variables that actually explain variation in salary. In this case the important predictors are: dept_id, perf_score_id, special_projects_count, absences, marital_status_id, position_id, while married_id, gender_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, status are dropped predictors.

Comparing the performance of the Ridge and Lasso models

Nr.Crt.	MSE	Model	lambda.min	lambda.1se	nobs
1	4.3e+08	Ridge	9.89e+03	8.6e+05	311
2	4.27e+08	Lasso	783	1.16e+04	311

Robust regression models address violations of the classical assumptions of regression, such as heteroscedasticity, non-normality of errors, multicollinearity, and the presence of influential observations. Ridge regression stabilizes the coefficients in the presence of multicollinearity, while Lasso regression performs variable selection and simultaneously controls model complexity.

The Lasso model performs variable selection by shrinking some coefficients exactly to zero. At λ = lambda.min, the model retains eight predictors, achieving the lowest cross-validated error. However, λ = lambda.1se produces a more parsimonious model with only two predictors and comparable predictive performance, suggesting improved generalization and stability.

Elastic Net regression (alpha = 0.5)

Elastic Net regression was implemented to simultaneously address multicollinearity and perform variable selection. By combining L1 and L2 regularization penalties, the method provides more stable coefficient estimates compared to Lasso, particularly in the presence of correlated predictors.

Regression model estimation

## 
## Call:  cv.glmnet(x = x, y = y, alpha = 0.5) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index   Measure        SE Nonzero
## min   1720    30 465955650 107134394       8
## 1se  17600     5 563913327 125347337       2

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                     s1
## (Intercept)                98022.18688
## married_id                     .      
## marital_status_id           -401.36371
## gender_id                      .      
## emp_status_id               -120.56475
## dept_id                    -7240.73932
## perf_score_id               2572.98929
## from_diversity_job_fair_id    17.97873
## termd                          .      
## position_id                 -404.19215
## emp_satisfaction               .      
## special_projects_count      2706.17379
## days_late_last30               .      
## absences                     203.25672
## status                         .

## [1] 428627377

Elastic Net combines L1 and L2 regularization, allowing for both shrinkage and grouped variable selection. It is particularly suitable in the presence of multicollinearity, as it stabilizes coefficient estimates while retaining correlated predictors.

Searching for the optimal hyperparameters

##  [1] 467815942 461436413 463404375 468896799 468776405 465053606 465292529
##  [8] 476683081 484718055 469561763 479238723

## [1] 0.1

## [1] 5926.088

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                     s0
## (Intercept)                94567.12167
## married_id                     .      
## marital_status_id           -594.24170
## gender_id                      .      
## emp_status_id               -299.64596
## dept_id                    -6586.05190
## perf_score_id               2626.97295
## from_diversity_job_fair_id   972.35006
## termd                          .      
## position_id                 -362.64193
## emp_satisfaction              14.27022
## special_projects_count      2545.52998
## days_late_last30               .      
## absences                     213.02951
## status                         .

## [1] 430778136

The optimal Elastic Net model was selected by performing cross-validation over a grid of alpha values (0 to 1). The combination of alpha and lambda that minimized the cross-validated mean squared error was chosen, and the final model was re-estimated using these optimal hyperparameters.

Conclusions

The analysis began with the estimation of an OLS model to explain salary variation based on employees’ demographic and organizational characteristics. Diagnostic tests revealed several issues, including the presence of influential observations, potential heteroscedasticity, deviations from residual normality, and indications of multicollinearity among predictors.

To improve the robustness of the estimates, alternative methods were applied. Robust regression techniques (rlm, lmrob) reduced the influence of extreme observations by down-weighting large residuals, resulting in more stable coefficient estimates in the presence of outliers. However, these methods do not address multicollinearity.

Subsequently, regularized models (Ridge, Lasso, and Elastic Net) were implemented. Cross-validation results showed that introducing a penalty term improves predictive performance compared to OLS.

Ridge regression stabilized coefficient estimates in the presence of correlated predictors. Lasso performed automatic variable selection, leading to a more parsimonious model without substantial loss of predictive accuracy. Elastic Net provided an effective compromise between stability and sparsity, making it particularly suitable in the presence of multicollinearity.

The cross-validation curves clearly illustrated the bias–variance tradeoff, and the optimal selection of λ allowed for the construction of a balanced model with strong predictive performance and controlled complexity.

Bibliography

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
Peterson BG, Carl P (2020). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version 2.0.4, https://CRAN.R-project.org/package=PerformanceAnalytics.
Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. R package version 1.4.0, https://CRAN.R-project.org/package=scales.
Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.
Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). Available from https://github.com/taiyun/corrplot
Rubba C (2023). htmltab: Assemble Data Frames from HTML Tables. R package version 0.8.2.9000, https://github.com/htmltab/htmltab.
Brandon M. Greenwell and Bradley C. Boehmke (2020). Variable Importance Plots—An Introduction to the vip Package. The R Journal, 12(1), 343–366. URL https://doi.org/10.32614/RJ-2020-013.
Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01
Robinson D, Hayes A, Couch S (2023). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.5, https://CRAN.R-project.org/package=broom. 13.H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.
Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.
Cui B (2020). DataExplorer: Automate Data Exploration and Treatment. R package version 0.8.2, https://CRAN.R-project.org/package=DataExplorer.
Rushworth A (2022). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.12, https://CRAN.R-project.org/package=inspectdf.
Grosjean P, Ibanez F (2018). pastecs: Package for Analysis of Space-Time Ecological Series. R package version 1.3.21, https://CRAN.R-project.org/package=pastecs.
https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.3.9, https://CRAN.R-project.org/package=psych.
B. Venables, Modern Applied Statistics With S, 2002, Edition: 4thPublisher: Springer-Verlag, DOI: 10.1007/b97626.
Friedman J, Tibshirani R, Hastie T (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1-22. doi:10.18637/jss.v033.i01 https://doi.org/10.18637/jss.v033.i01.
Martin Maechler, Peter Rousseeuw, Christophe Croux, Valentin Todorov, Andreas Ruckstuhl, Matias Salibian-Barrera, Tobias Verbeke, Manuel Koller, c(“Eduardo”, “L. T.”) Conceicao and Maria Anna di Palma (2023). robustbase: Basic Robust Statistics R package version 0.99-1. URL http://CRAN.R-project.org/package=robustbase

Scoring model

by Irimia Mihaela

2026-02-26

Scoring model

Analyzed data set: HRDataset_v14 available here

Project Objectives

Dataset description

The key findings from the EDA and their implications for further modeling

Correlation matrix

MRLM - Linear regression model

Generating the regression model

Check assumptions of linear regression

Cooks distance, predicted values, residuals and influntial points

Generating a new regression model without including the status and termd variables

Generating a new regression model using only the significant variables

Using more robust regression models

Robust regression model from MASS package

Robust regression model from robustbase package

Ridge and Lasso Regularization

Ridge Regression (alpha = 0)

Lasso Regression (alpha = 1)

Elastic Net regression (alpha = 0.5)

Conclusions

Bibliography