Scoring model

Analyzed data set: HRDataset_v14 available here

Loading the necesary packages…

Project Objectives

Exploratory data analysis - EDA was done in another post and available here
Identification and detailed analysis of the studied phenomenon through the development of scoring models
Interpretation of results and presentation of conclusions

Dataset description

In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.

Within this dataset, we decided to explore and attempt to answer several research questions, such as:

Are there areas within the company where salary distribution is not equitable?
Is an individual’s salary influenced by any specific factors present in the dataset?
Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?

loading the dataset

Dataset description

The structure of the dataset

glimpse(HRDataset_v14)

## Rows: 311
## Columns: 36
## $ Employee_Name              <chr> "Adinolfi, Wilson  K", "Ait Sidi, Karthikey…
## $ EmpID                      <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID                  <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID            <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID                   <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID                <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID                     <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID                <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary                     <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID                 <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position                   <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State                      <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip                        <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB                        <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex                        <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc                <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc                <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc                   <chr> "White", "White", "White", "White", "White"…
## $ DateofHire                 <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination          <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason                 <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus           <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department                 <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName                <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID                  <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource          <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore           <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey           <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction            <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount       <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…

The key findings from the EDA and their implications for further modeling

Based on EDA analysis we so that:

-   Salary exhibits non-normal behavior and strong right skewness.

-   Extreme values (executive-level salaries) may influence regression results.

-   Transformation techniques (e.g., log transformation of salary) may improve model performance.

-   Categorical variables such as department, position, and employment status are likely strong predictors of salary.

-   Gender alone may not fully explain salary variation without controlling for position and department.

## Rows: 311
## Columns: 15
## $ married_id                 <fct> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ status                     <chr> "no_married", "married", "married", "marrie…
## $ log10_salary               <dbl> 4.795922, 5.018854, 4.812613, 4.812853, 4.7…

## Rows: 311
## Columns: 15
## $ married_id                 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ salary                     <int> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ status                     <chr> "no_married", "married", "married", "marrie…

Correlation matrix

MRLM - Linear regression model

Generating the regression model

Just a regression …

lm_ih <- lm(salary ~ ., data = dataset)
lm_ih

## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    95093.0                       443.8  
##          marital_status_id                   gender_id  
##                    -1133.9                      -351.1  
##              emp_status_id                     dept_id  
##                    -1607.4                     -8722.4  
##              perf_score_id  from_diversity_job_fair_id  
##                     6392.3                      3462.4  
##                      termd                 position_id  
##                     3924.0                      -545.7  
##           emp_satisfaction      special_projects_count  
##                      432.2                      2625.3  
##           days_late_last30                    absences  
##                     1663.1                       324.8  
##                     status  
##                         NA

summary(lm_ih)

## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56336  -9497  -2108   5279 159658 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 95093.0    17138.2   5.549 0.0000000638 ***
## married_id                    443.8     2520.6   0.176       0.8604    
## marital_status_id           -1133.9     1297.3  -0.874       0.3828    
## gender_id                    -351.1     2433.9  -0.144       0.8854    
## emp_status_id               -1607.4     2274.8  -0.707       0.4804    
## dept_id                     -8722.4     1932.4  -4.514 0.0000091892 ***
## perf_score_id                6392.3     3117.4   2.051       0.0412 *  
## from_diversity_job_fair_id   3462.4     4256.8   0.813       0.4167    
## termd                        3924.0     8436.1   0.465       0.6422    
## position_id                  -545.7      219.7  -2.483       0.0136 *  
## emp_satisfaction              432.2     1394.5   0.310       0.7568    
## special_projects_count       2625.3      796.4   3.296       0.0011 ** 
## days_late_last30             1663.1     1418.7   1.172       0.2420    
## absences                      324.8      207.6   1.564       0.1188    
## status                           NA         NA      NA           NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20970 on 297 degrees of freedom
## Multiple R-squared:  0.3345, Adjusted R-squared:  0.3053 
## F-statistic: 11.48 on 13 and 297 DF,  p-value: < 0.00000000000000022

The importance of the variables

vip(lm_ih)

Check assumptions of linear regression

library(performance)
par(mfrow = c(2, 2))
plot(lm_ih, pch = 16, col = '#006EA1')

par(mfrow = c(1, 1))

The estimated values of the regression coefficients

tidy(lm_ih)

term	estimate	std.error	statistic	p.value
(Intercept)	9.51e+04	1.71e+04	5.55	6.38e-08
married_id	444	2.52e+03	0.176	0.86
marital_status_id	-1.13e+03	1.3e+03	-0.874	0.383
gender_id	-351	2.43e+03	-0.144	0.885
emp_status_id	-1.61e+03	2.27e+03	-0.707	0.48
dept_id	-8.72e+03	1.93e+03	-4.51	9.19e-06
perf_score_id	6.39e+03	3.12e+03	2.05	0.0412
from_diversity_job_fair_id	3.46e+03	4.26e+03	0.813	0.417
termd	3.92e+03	8.44e+03	0.465	0.642
position_id	-546	220	-2.48	0.0136
emp_satisfaction	432	1.39e+03	0.31	0.757
special_projects_count	2.63e+03	796	3.3	0.0011
days_late_last30	1.66e+03	1.42e+03	1.17	0.242
absences	325	208	1.56	0.119
status

The validation metrics of the model

glance(lm_ih) %>%
  gather(var, values)

var	values
r.squared	0.334
adj.r.squared	0.305
sigma	2.1e+04
statistic	11.5
p.value	4.94e-20
df	13
logLik	-3.53e+03
AIC	7.09e+03
BIC	7.14e+03
deviance	1.31e+11
df.residual	297
nobs	311

Cooks distance, predicted values, residuals and influntial points

## OK: No outliers detected.
## - Based on the following method and threshold: cook (1).
## - For variable: (Whole model)

VIF values obtained for the variables include in the model

Term	VIF	VIF_CI_low	VIF_CI_high	SE_factor	Tolerance	Tolerance_CI_low	Tolerance_CI_high
married_id	1.08	1.02	1.37	1.04	0.928	0.73	0.984
marital_status_id	1.06	1.01	1.45	1.03	0.947	0.689	0.993
gender_id	1.03	1	2.28	1.01	0.971	0.439	0.999
emp_status_id	11.7	9.63	14.4	3.43	0.0851	0.0695	0.104
dept_id	2.44	2.08	2.91	1.56	0.41	0.344	0.48
perf_score_id	2.36	2.02	2.81	1.54	0.423	0.355	0.495
from_diversity_job_fair_id	1.08	1.02	1.36	1.04	0.923	0.735	0.981
termd	11.2	9.19	13.7	3.35	0.0892	0.0729	0.109
position_id	1.32	1.19	1.55	1.15	0.758	0.647	0.843
emp_satisfaction	1.13	1.05	1.36	1.06	0.882	0.734	0.953
special_projects_count	2.47	2.11	2.95	1.57	0.405	0.339	0.474
days_late_last30	2.38	2.04	2.83	1.54	0.42	0.353	0.491
absences	1.04	1	1.65	1.02	0.96	0.608	0.997

We observe that status variable is NA meaning that between this variable and another variable there is prefect collinearity, suggesting that one or more variables are exact linear combinations of others. As a result, the design matrix becomes singular and VIF cannot be computed.

The variable that is perfect collinear with status is: married_id

# vif(lm_ih)

So we are going to eliminate the status variable and termd variable (highly correlated with emp_status_id; the value of the correlation coefficient is 0.91) from the dataset and rebuild the model.

Generating a new regression model without including the status and termd variables

lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
lm_ih

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    93194.6                       413.6  
##          marital_status_id                   gender_id  
##                    -1132.3                      -324.3  
##              emp_status_id                     dept_id  
##                     -603.2                     -8628.1  
##              perf_score_id  from_diversity_job_fair_id  
##                     6594.0                      3185.4  
##                position_id            emp_satisfaction  
##                     -560.5                       406.1  
##     special_projects_count            days_late_last30  
##                     2664.4                      1815.1  
##                   absences  
##                      327.0

summary(lm_ih)

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56169  -9542  -1754   5019 160081 
## 
## Coefficients:
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 93194.6    16623.2   5.606 0.0000000471 ***
## married_id                    413.6     2516.4   0.164     0.869563    
## marital_status_id           -1132.3     1295.6  -0.874     0.382829    
## gender_id                    -324.3     2430.0  -0.133     0.893919    
## emp_status_id                -603.2      715.8  -0.843     0.400069    
## dept_id                     -8628.1     1919.2  -4.496 0.0000099409 ***
## perf_score_id                6594.0     3083.0   2.139     0.033265 *  
## from_diversity_job_fair_id   3185.4     4209.4   0.757     0.449816    
## position_id                  -560.5      217.1  -2.581     0.010318 *  
## emp_satisfaction              406.1     1391.6   0.292     0.770612    
## special_projects_count       2664.4      790.9   3.369     0.000855 ***
## days_late_last30             1815.1     1378.8   1.316     0.189032    
## absences                      327.0      207.3   1.577     0.115772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20940 on 298 degrees of freedom
## Multiple R-squared:  0.334,  Adjusted R-squared:  0.3072 
## F-statistic: 12.45 on 12 and 298 DF,  p-value: < 0.00000000000000022

The estimated values of the regression coefficients

tidy(lm_ih)

term	estimate	std.error	statistic	p.value
(Intercept)	9.32e+04	1.66e+04	5.61	4.71e-08
married_id	414	2.52e+03	0.164	0.87
marital_status_id	-1.13e+03	1.3e+03	-0.874	0.383
gender_id	-324	2.43e+03	-0.133	0.894
emp_status_id	-603	716	-0.843	0.4
dept_id	-8.63e+03	1.92e+03	-4.5	9.94e-06
perf_score_id	6.59e+03	3.08e+03	2.14	0.0333
from_diversity_job_fair_id	3.19e+03	4.21e+03	0.757	0.45
position_id	-560	217	-2.58	0.0103
emp_satisfaction	406	1.39e+03	0.292	0.771
special_projects_count	2.66e+03	791	3.37	0.000855
days_late_last30	1.82e+03	1.38e+03	1.32	0.189
absences	327	207	1.58	0.116

The validation metrics of the model

glance(lm_ih) %>%
  gather(var, values)

var	values
r.squared	0.334
adj.r.squared	0.307
sigma	2.09e+04
statistic	12.5
p.value	1.5e-20
df	12
logLik	-3.53e+03
AIC	7.09e+03
BIC	7.14e+03
deviance	1.31e+11
df.residual	298
nobs	311

VIF values obtained for the variables include in the model

Term	VIF	VIF_CI_low	VIF_CI_high	SE_factor	Tolerance	Tolerance_CI_low	Tolerance_CI_high
married_id	1.08	1.02	1.38	1.04	0.929	0.727	0.985
marital_status_id	1.06	1.01	1.46	1.03	0.947	0.686	0.993
gender_id	1.03	1	2.38	1.01	0.972	0.42	0.999
emp_status_id	1.17	1.07	1.39	1.08	0.857	0.721	0.933
dept_id	2.41	2.06	2.88	1.55	0.415	0.347	0.485
perf_score_id	2.32	1.98	2.76	1.52	0.432	0.362	0.504
from_diversity_job_fair_id	1.06	1.01	1.42	1.03	0.941	0.705	0.991
position_id	1.29	1.16	1.52	1.14	0.775	0.659	0.859
emp_satisfaction	1.13	1.05	1.36	1.06	0.883	0.734	0.954
special_projects_count	2.44	2.08	2.92	1.56	0.41	0.343	0.48
days_late_last30	2.25	1.93	2.68	1.5	0.444	0.373	0.518
absences	1.04	1	1.67	1.02	0.961	0.598	0.998

We will rebuild the model including only the significant variables.

Generating a new regression model using only the significant variables

Variables that has no impact on salary are:

term	estimate	std.error	statistic	p.value
married_id	414	2.52e+03	0.164	0.87
marital_status_id	-1.13e+03	1.3e+03	-0.874	0.383
gender_id	-324	2.43e+03	-0.133	0.894
emp_status_id	-603	716	-0.843	0.4
from_diversity_job_fair_id	3.19e+03	4.21e+03	0.757	0.45
emp_satisfaction	406	1.39e+03	0.292	0.771
days_late_last30	1.82e+03	1.38e+03	1.32	0.189
absences	327	207	1.58	0.116

## [1] "married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences"

lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences)))
lm_ih

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Coefficients:
##            (Intercept)                 dept_id           perf_score_id  
##               105263.2                 -8607.9                  4063.2  
##            position_id  special_projects_count  
##                 -620.5                  2682.4

summary(lm_ih)

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48859  -9725  -2244   4500 159690 
## 
## Coefficients:
##                        Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)            105263.2    13558.0   7.764 0.000000000000125 ***
## dept_id                 -8607.9     1895.7  -4.541 0.000008070682410 ***
## perf_score_id            4063.2     2028.2   2.003          0.046014 *  
## position_id              -620.5      212.1  -2.925          0.003700 ** 
## special_projects_count   2682.4      770.8   3.480          0.000575 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20900 on 306 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3099 
## F-statistic: 35.81 on 4 and 306 DF,  p-value: < 0.00000000000000022

The estimated values of the regression coefficients

tidy(lm_ih)

term	estimate	std.error	statistic	p.value
(Intercept)	1.05e+05	1.36e+04	7.76	1.25e-13
dept_id	-8.61e+03	1.9e+03	-4.54	8.07e-06
perf_score_id	4.06e+03	2.03e+03	2	0.046
position_id	-620	212	-2.93	0.0037
special_projects_count	2.68e+03	771	3.48	0.000575

The validation metrics of the model

VIF values obtained for the variables include in the model

Term	VIF	VIF_CI_low	VIF_CI_high	SE_factor	Tolerance	Tolerance_CI_low	Tolerance_CI_high
dept_id	2.36	2.01	2.83	1.54	0.423	0.353	0.497
perf_score_id	1.01	1	1.96e+05	1	0.994	5.09e-06	1
position_id	1.24	1.12	1.47	1.11	0.808	0.682	0.892
special_projects_count	2.33	1.99	2.79	1.53	0.43	0.358	0.504

As it can be seen from the checking assumption of the model the multicollinearity effect was eliminated. The metric of the model are the same, 33% of the variation of the dependent variable, salary, is explained by the model. We can see that the model is significant (p-value < 0.05), but not all the included predictors have an significant impact on the salary.

From the checking assumptions of the model we can observe that:

The model does not capture the actual shape of the salary distribution, suggesting potential skewness or the presence of outliers that are not well handled by linear regression.

The relationship between the predictors and salary does not appear to be fully linear, indicating possible model misspecification.

There is evidence of heteroscedasticity, as the variance of the residuals is not constant across fitted values.

Several observations exhibit relatively high leverage (e.g., 132, 309, 151), indicating the presence of influential data points that may affect the estimated regression coefficients.

Additionally, high multicollinearity is detected among some predictors, which can lead to unstable coefficient estimates and inflated standard errors. After eliminating the variables with high VIF values the model was rebuild.

The residuals are not perfectly normally distributed, likely due to positive skewness in salary values. Although this is less critical in large samples, it may still affect statistical inference, including p-values and confidence intervals.

To improve the model, the following steps we should follow:

    Using robust regression techniques;

    Implementing regularized models such as Ridge or Lasso regression.

## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48859  -9725  -2244   4500 159690 
## 
## Coefficients:
##                        Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)            105263.2    13558.0   7.764 0.000000000000125 ***
## dept_id                 -8607.9     1895.7  -4.541 0.000008070682410 ***
## perf_score_id            4063.2     2028.2   2.003          0.046014 *  
## position_id              -620.5      212.1  -2.925          0.003700 ** 
## special_projects_count   2682.4      770.8   3.480          0.000575 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20900 on 306 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3099 
## F-statistic: 35.81 on 4 and 306 DF,  p-value: < 0.00000000000000022

##                dept_id          perf_score_id            position_id 
##               2.362817               1.006371               1.237009 
## special_projects_count 
##               2.328210

Bibliography

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
Peterson BG, Carl P (2020). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version 2.0.4, https://CRAN.R-project.org/package=PerformanceAnalytics.
Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. R package version 1.4.0, https://CRAN.R-project.org/package=scales.
Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.
Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). Available from https://github.com/taiyun/corrplot
Rubba C (2023). htmltab: Assemble Data Frames from HTML Tables. R package version 0.8.2.9000, https://github.com/htmltab/htmltab.
Brandon M. Greenwell and Bradley C. Boehmke (2020). Variable Importance Plots—An Introduction to the vip Package. The R Journal, 12(1), 343–366. URL https://doi.org/10.32614/RJ-2020-013.
Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01
Robinson D, Hayes A, Couch S (2023). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.5, https://CRAN.R-project.org/package=broom. 13.H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.
Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.
Cui B (2020). DataExplorer: Automate Data Exploration and Treatment. R package version 0.8.2, https://CRAN.R-project.org/package=DataExplorer.
Rushworth A (2022). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.12, https://CRAN.R-project.org/package=inspectdf.
Grosjean P, Ibanez F (2018). pastecs: Package for Analysis of Space-Time Ecological Series. R package version 1.3.21, https://CRAN.R-project.org/package=pastecs.
https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.3.9, https://CRAN.R-project.org/package=psych.

Scoring model

by Irimia Mihaela

2026-02-26

Scoring model

Analyzed data set: HRDataset_v14 available here

Project Objectives

Dataset description

The key findings from the EDA and their implications for further modeling

Correlation matrix

MRLM - Linear regression model

Generating the regression model

Check assumptions of linear regression

Cooks distance, predicted values, residuals and influntial points

Generating a new regression model without including the status and termd variables

Generating a new regression model using only the significant variables

Bibliography