Loading the necesary packages…
In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.
Within this dataset, we decided to explore and attempt to answer several research questions, such as:
Are there areas within the company where salary distribution is not equitable?
Is an individual’s salary influenced by any specific factors present in the dataset?
Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?
loading the dataset
Dataset description
The structure of the dataset
glimpse(HRDataset_v14)
## Rows: 311
## Columns: 36
## $ Employee_Name <chr> "Adinolfi, Wilson K", "Ait Sidi, Karthikey…
## $ EmpID <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc <chr> "White", "White", "White", "White", "White"…
## $ DateofHire <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
Based on EDA analysis we so that:
- Salary exhibits non-normal behavior and strong right skewness.
- Extreme values (executive-level salaries) may influence regression results.
- Transformation techniques (e.g., log transformation of salary) may improve model performance.
- Categorical variables such as department, position, and employment status are likely strong predictors of salary.
- Gender alone may not fully explain salary variation without controlling for position and department.
## Rows: 311
## Columns: 15
## $ married_id <fct> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ status <chr> "no_married", "married", "married", "marrie…
## $ log10_salary <dbl> 4.795922, 5.018854, 4.812613, 4.812853, 4.7…
## Rows: 311
## Columns: 15
## $ married_id <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ salary <int> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ status <chr> "no_married", "married", "married", "marrie…
Just a regression …
lm_ih <- lm(salary ~ ., data = dataset)
lm_ih
##
## Call:
## lm(formula = salary ~ ., data = dataset)
##
## Coefficients:
## (Intercept) married_id
## 95093.0 443.8
## marital_status_id gender_id
## -1133.9 -351.1
## emp_status_id dept_id
## -1607.4 -8722.4
## perf_score_id from_diversity_job_fair_id
## 6392.3 3462.4
## termd position_id
## 3924.0 -545.7
## emp_satisfaction special_projects_count
## 432.2 2625.3
## days_late_last30 absences
## 1663.1 324.8
## status
## NA
summary(lm_ih)
##
## Call:
## lm(formula = salary ~ ., data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56336 -9497 -2108 5279 159658
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95093.0 17138.2 5.549 0.0000000638 ***
## married_id 443.8 2520.6 0.176 0.8604
## marital_status_id -1133.9 1297.3 -0.874 0.3828
## gender_id -351.1 2433.9 -0.144 0.8854
## emp_status_id -1607.4 2274.8 -0.707 0.4804
## dept_id -8722.4 1932.4 -4.514 0.0000091892 ***
## perf_score_id 6392.3 3117.4 2.051 0.0412 *
## from_diversity_job_fair_id 3462.4 4256.8 0.813 0.4167
## termd 3924.0 8436.1 0.465 0.6422
## position_id -545.7 219.7 -2.483 0.0136 *
## emp_satisfaction 432.2 1394.5 0.310 0.7568
## special_projects_count 2625.3 796.4 3.296 0.0011 **
## days_late_last30 1663.1 1418.7 1.172 0.2420
## absences 324.8 207.6 1.564 0.1188
## status NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20970 on 297 degrees of freedom
## Multiple R-squared: 0.3345, Adjusted R-squared: 0.3053
## F-statistic: 11.48 on 13 and 297 DF, p-value: < 0.00000000000000022
The importance of the variables
vip(lm_ih)
library(performance)
par(mfrow = c(2, 2))
plot(lm_ih, pch = 16, col = '#006EA1')
par(mfrow = c(1, 1))
The estimated values of the regression coefficients
tidy(lm_ih)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 9.51e+04 | 1.71e+04 | 5.55 | 6.38e-08 |
| married_id | 444 | 2.52e+03 | 0.176 | 0.86 |
| marital_status_id | -1.13e+03 | 1.3e+03 | -0.874 | 0.383 |
| gender_id | -351 | 2.43e+03 | -0.144 | 0.885 |
| emp_status_id | -1.61e+03 | 2.27e+03 | -0.707 | 0.48 |
| dept_id | -8.72e+03 | 1.93e+03 | -4.51 | 9.19e-06 |
| perf_score_id | 6.39e+03 | 3.12e+03 | 2.05 | 0.0412 |
| from_diversity_job_fair_id | 3.46e+03 | 4.26e+03 | 0.813 | 0.417 |
| termd | 3.92e+03 | 8.44e+03 | 0.465 | 0.642 |
| position_id | -546 | 220 | -2.48 | 0.0136 |
| emp_satisfaction | 432 | 1.39e+03 | 0.31 | 0.757 |
| special_projects_count | 2.63e+03 | 796 | 3.3 | 0.0011 |
| days_late_last30 | 1.66e+03 | 1.42e+03 | 1.17 | 0.242 |
| absences | 325 | 208 | 1.56 | 0.119 |
| status |
The validation metrics of the model
glance(lm_ih) %>%
gather(var, values)
| var | values |
|---|---|
| r.squared | 0.334 |
| adj.r.squared | 0.305 |
| sigma | 2.1e+04 |
| statistic | 11.5 |
| p.value | 4.94e-20 |
| df | 13 |
| logLik | -3.53e+03 |
| AIC | 7.09e+03 |
| BIC | 7.14e+03 |
| deviance | 1.31e+11 |
| df.residual | 297 |
| nobs | 311 |
## OK: No outliers detected.
## - Based on the following method and threshold: cook (1).
## - For variable: (Whole model)
VIF values obtained for the variables include in the model
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| married_id | 1.08 | 1.02 | 1.37 | 1.04 | 0.928 | 0.73 | 0.984 |
| marital_status_id | 1.06 | 1.01 | 1.45 | 1.03 | 0.947 | 0.689 | 0.993 |
| gender_id | 1.03 | 1 | 2.28 | 1.01 | 0.971 | 0.439 | 0.999 |
| emp_status_id | 11.7 | 9.63 | 14.4 | 3.43 | 0.0851 | 0.0695 | 0.104 |
| dept_id | 2.44 | 2.08 | 2.91 | 1.56 | 0.41 | 0.344 | 0.48 |
| perf_score_id | 2.36 | 2.02 | 2.81 | 1.54 | 0.423 | 0.355 | 0.495 |
| from_diversity_job_fair_id | 1.08 | 1.02 | 1.36 | 1.04 | 0.923 | 0.735 | 0.981 |
| termd | 11.2 | 9.19 | 13.7 | 3.35 | 0.0892 | 0.0729 | 0.109 |
| position_id | 1.32 | 1.19 | 1.55 | 1.15 | 0.758 | 0.647 | 0.843 |
| emp_satisfaction | 1.13 | 1.05 | 1.36 | 1.06 | 0.882 | 0.734 | 0.953 |
| special_projects_count | 2.47 | 2.11 | 2.95 | 1.57 | 0.405 | 0.339 | 0.474 |
| days_late_last30 | 2.38 | 2.04 | 2.83 | 1.54 | 0.42 | 0.353 | 0.491 |
| absences | 1.04 | 1 | 1.65 | 1.02 | 0.96 | 0.608 | 0.997 |
We observe that status variable is NA meaning that between this variable and another variable there is prefect collinearity, suggesting that one or more variables are exact linear combinations of others. As a result, the design matrix becomes singular and VIF cannot be computed.
The variable that is perfect collinear with status is: married_id
# vif(lm_ih)
So we are going to eliminate the status variable and termd variable (highly correlated with emp_status_id; the value of the correlation coefficient is 0.91) from the dataset and rebuild the model.
lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
lm_ih
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status,
## -termd))
##
## Coefficients:
## (Intercept) married_id
## 93194.6 413.6
## marital_status_id gender_id
## -1132.3 -324.3
## emp_status_id dept_id
## -603.2 -8628.1
## perf_score_id from_diversity_job_fair_id
## 6594.0 3185.4
## position_id emp_satisfaction
## -560.5 406.1
## special_projects_count days_late_last30
## 2664.4 1815.1
## absences
## 327.0
summary(lm_ih)
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status,
## -termd))
##
## Residuals:
## Min 1Q Median 3Q Max
## -56169 -9542 -1754 5019 160081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93194.6 16623.2 5.606 0.0000000471 ***
## married_id 413.6 2516.4 0.164 0.869563
## marital_status_id -1132.3 1295.6 -0.874 0.382829
## gender_id -324.3 2430.0 -0.133 0.893919
## emp_status_id -603.2 715.8 -0.843 0.400069
## dept_id -8628.1 1919.2 -4.496 0.0000099409 ***
## perf_score_id 6594.0 3083.0 2.139 0.033265 *
## from_diversity_job_fair_id 3185.4 4209.4 0.757 0.449816
## position_id -560.5 217.1 -2.581 0.010318 *
## emp_satisfaction 406.1 1391.6 0.292 0.770612
## special_projects_count 2664.4 790.9 3.369 0.000855 ***
## days_late_last30 1815.1 1378.8 1.316 0.189032
## absences 327.0 207.3 1.577 0.115772
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20940 on 298 degrees of freedom
## Multiple R-squared: 0.334, Adjusted R-squared: 0.3072
## F-statistic: 12.45 on 12 and 298 DF, p-value: < 0.00000000000000022
The estimated values of the regression coefficients
tidy(lm_ih)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 9.32e+04 | 1.66e+04 | 5.61 | 4.71e-08 |
| married_id | 414 | 2.52e+03 | 0.164 | 0.87 |
| marital_status_id | -1.13e+03 | 1.3e+03 | -0.874 | 0.383 |
| gender_id | -324 | 2.43e+03 | -0.133 | 0.894 |
| emp_status_id | -603 | 716 | -0.843 | 0.4 |
| dept_id | -8.63e+03 | 1.92e+03 | -4.5 | 9.94e-06 |
| perf_score_id | 6.59e+03 | 3.08e+03 | 2.14 | 0.0333 |
| from_diversity_job_fair_id | 3.19e+03 | 4.21e+03 | 0.757 | 0.45 |
| position_id | -560 | 217 | -2.58 | 0.0103 |
| emp_satisfaction | 406 | 1.39e+03 | 0.292 | 0.771 |
| special_projects_count | 2.66e+03 | 791 | 3.37 | 0.000855 |
| days_late_last30 | 1.82e+03 | 1.38e+03 | 1.32 | 0.189 |
| absences | 327 | 207 | 1.58 | 0.116 |
The validation metrics of the model
glance(lm_ih) %>%
gather(var, values)
| var | values |
|---|---|
| r.squared | 0.334 |
| adj.r.squared | 0.307 |
| sigma | 2.09e+04 |
| statistic | 12.5 |
| p.value | 1.5e-20 |
| df | 12 |
| logLik | -3.53e+03 |
| AIC | 7.09e+03 |
| BIC | 7.14e+03 |
| deviance | 1.31e+11 |
| df.residual | 298 |
| nobs | 311 |
VIF values obtained for the variables include in the model
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| married_id | 1.08 | 1.02 | 1.38 | 1.04 | 0.929 | 0.727 | 0.985 |
| marital_status_id | 1.06 | 1.01 | 1.46 | 1.03 | 0.947 | 0.686 | 0.993 |
| gender_id | 1.03 | 1 | 2.38 | 1.01 | 0.972 | 0.42 | 0.999 |
| emp_status_id | 1.17 | 1.07 | 1.39 | 1.08 | 0.857 | 0.721 | 0.933 |
| dept_id | 2.41 | 2.06 | 2.88 | 1.55 | 0.415 | 0.347 | 0.485 |
| perf_score_id | 2.32 | 1.98 | 2.76 | 1.52 | 0.432 | 0.362 | 0.504 |
| from_diversity_job_fair_id | 1.06 | 1.01 | 1.42 | 1.03 | 0.941 | 0.705 | 0.991 |
| position_id | 1.29 | 1.16 | 1.52 | 1.14 | 0.775 | 0.659 | 0.859 |
| emp_satisfaction | 1.13 | 1.05 | 1.36 | 1.06 | 0.883 | 0.734 | 0.954 |
| special_projects_count | 2.44 | 2.08 | 2.92 | 1.56 | 0.41 | 0.343 | 0.48 |
| days_late_last30 | 2.25 | 1.93 | 2.68 | 1.5 | 0.444 | 0.373 | 0.518 |
| absences | 1.04 | 1 | 1.67 | 1.02 | 0.961 | 0.598 | 0.998 |
We will rebuild the model including only the significant variables.
Variables that has no impact on salary are:
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| married_id | 414 | 2.52e+03 | 0.164 | 0.87 |
| marital_status_id | -1.13e+03 | 1.3e+03 | -0.874 | 0.383 |
| gender_id | -324 | 2.43e+03 | -0.133 | 0.894 |
| emp_status_id | -603 | 716 | -0.843 | 0.4 |
| from_diversity_job_fair_id | 3.19e+03 | 4.21e+03 | 0.757 | 0.45 |
| emp_satisfaction | 406 | 1.39e+03 | 0.292 | 0.771 |
| days_late_last30 | 1.82e+03 | 1.38e+03 | 1.32 | 0.189 |
| absences | 327 | 207 | 1.58 | 0.116 |
## [1] "married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences"
lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences)))
lm_ih
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status,
## termd, married_id, marital_status_id, gender_id, emp_status_id,
## from_diversity_job_fair_id, emp_satisfaction, days_late_last30,
## absences)))
##
## Coefficients:
## (Intercept) dept_id perf_score_id
## 105263.2 -8607.9 4063.2
## position_id special_projects_count
## -620.5 2682.4
summary(lm_ih)
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status,
## termd, married_id, marital_status_id, gender_id, emp_status_id,
## from_diversity_job_fair_id, emp_satisfaction, days_late_last30,
## absences)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -48859 -9725 -2244 4500 159690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105263.2 13558.0 7.764 0.000000000000125 ***
## dept_id -8607.9 1895.7 -4.541 0.000008070682410 ***
## perf_score_id 4063.2 2028.2 2.003 0.046014 *
## position_id -620.5 212.1 -2.925 0.003700 **
## special_projects_count 2682.4 770.8 3.480 0.000575 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20900 on 306 degrees of freedom
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.3099
## F-statistic: 35.81 on 4 and 306 DF, p-value: < 0.00000000000000022
The estimated values of the regression coefficients
tidy(lm_ih)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 1.05e+05 | 1.36e+04 | 7.76 | 1.25e-13 |
| dept_id | -8.61e+03 | 1.9e+03 | -4.54 | 8.07e-06 |
| perf_score_id | 4.06e+03 | 2.03e+03 | 2 | 0.046 |
| position_id | -620 | 212 | -2.93 | 0.0037 |
| special_projects_count | 2.68e+03 | 771 | 3.48 | 0.000575 |
The validation metrics of the model
VIF values obtained for the variables include in the model
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| dept_id | 2.36 | 2.01 | 2.83 | 1.54 | 0.423 | 0.353 | 0.497 |
| perf_score_id | 1.01 | 1 | 1.96e+05 | 1 | 0.994 | 5.09e-06 | 1 |
| position_id | 1.24 | 1.12 | 1.47 | 1.11 | 0.808 | 0.682 | 0.892 |
| special_projects_count | 2.33 | 1.99 | 2.79 | 1.53 | 0.43 | 0.358 | 0.504 |
As it can be seen from the checking assumption of the model the multicollinearity effect was eliminated. The metric of the model are the same, 33% of the variation of the dependent variable, salary, is explained by the model. We can see that the model is significant (p-value < 0.05), but not all the included predictors have an significant impact on the salary.
From the checking assumptions of the model we can observe that:
The model does not capture the actual shape of the salary distribution, suggesting potential skewness or the presence of outliers that are not well handled by linear regression.
The relationship between the predictors and salary does not appear to be fully linear, indicating possible model misspecification.
There is evidence of heteroscedasticity, as the variance of the residuals is not constant across fitted values.
Several observations exhibit relatively high leverage (e.g., 132, 309, 151), indicating the presence of influential data points that may affect the estimated regression coefficients.
Additionally, high multicollinearity is detected among some predictors, which can lead to unstable coefficient estimates and inflated standard errors. After eliminating the variables with high VIF values the model was rebuild.
The residuals are not perfectly normally distributed, likely due to positive skewness in salary values. Although this is less critical in large samples, it may still affect statistical inference, including p-values and confidence intervals.
To improve the model, the following steps we should follow:
Using robust regression techniques;
Implementing regularized models such as Ridge or Lasso regression.
##
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status,
## termd, married_id, marital_status_id, gender_id, emp_status_id,
## from_diversity_job_fair_id, emp_satisfaction, days_late_last30,
## absences)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -48859 -9725 -2244 4500 159690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105263.2 13558.0 7.764 0.000000000000125 ***
## dept_id -8607.9 1895.7 -4.541 0.000008070682410 ***
## perf_score_id 4063.2 2028.2 2.003 0.046014 *
## position_id -620.5 212.1 -2.925 0.003700 **
## special_projects_count 2682.4 770.8 3.480 0.000575 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20900 on 306 degrees of freedom
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.3099
## F-statistic: 35.81 on 4 and 306 DF, p-value: < 0.00000000000000022
## dept_id perf_score_id position_id
## 2.362817 1.006371 1.237009
## special_projects_count
## 2.328210