Scoring model

Analyzed data set: HRDataset_v14 available here

Loading the necesary packages…

Project Objectives

  1. Exploratory data analysis - EDA was done in another post and available here
  2. Identification and detailed analysis of the studied phenomenon through the development of scoring models
  3. Interpretation of results and presentation of conclusions

Dataset description

In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.

Within this dataset, we decided to explore and attempt to answer several research questions, such as:

  1. Are there areas within the company where salary distribution is not equitable?

  2. Is an individual’s salary influenced by any specific factors present in the dataset?

  3. Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?

loading the dataset

Dataset description

The structure of the dataset

glimpse(HRDataset_v14)
## Rows: 311
## Columns: 36
## $ Employee_Name              <chr> "Adinolfi, Wilson  K", "Ait Sidi, Karthikey…
## $ EmpID                      <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID                  <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID            <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID                   <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID                <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID                     <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID                <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary                     <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID                 <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position                   <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State                      <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip                        <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB                        <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex                        <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc                <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc                <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc                   <chr> "White", "White", "White", "White", "White"…
## $ DateofHire                 <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination          <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason                 <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus           <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department                 <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName                <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID                  <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource          <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore           <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey           <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction            <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount       <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…

The key findings from the EDA and their implications for further modeling

Based on EDA analysis we so that:

-   Salary exhibits non-normal behavior and strong right skewness.

-   Extreme values (executive-level salaries) may influence regression results.

-   Transformation techniques (e.g., log transformation of salary) may improve model performance.

-   Categorical variables such as department, position, and employment status are likely strong predictors of salary.

-   Gender alone may not fully explain salary variation without controlling for position and department.
## Rows: 311
## Columns: 15
## $ married_id                 <fct> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ status                     <chr> "no_married", "married", "married", "marrie…
## $ log10_salary               <dbl> 4.795922, 5.018854, 4.812613, 4.812853, 4.7…
## Rows: 311
## Columns: 15
## $ married_id                 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ salary                     <int> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ status                     <chr> "no_married", "married", "married", "marrie…

Correlation matrix

MRLM - Linear regression model

Generating the regression model

Just a regression …

lm_ih <- lm(salary ~ ., data = dataset)
lm_ih
## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    95093.0                       443.8  
##          marital_status_id                   gender_id  
##                    -1133.9                      -351.1  
##              emp_status_id                     dept_id  
##                    -1607.4                     -8722.4  
##              perf_score_id  from_diversity_job_fair_id  
##                     6392.3                      3462.4  
##                      termd                 position_id  
##                     3924.0                      -545.7  
##           emp_satisfaction      special_projects_count  
##                      432.2                      2625.3  
##           days_late_last30                    absences  
##                     1663.1                       324.8  
##                     status  
##                         NA
summary(lm_ih)
## 
## Call:
## lm(formula = salary ~ ., data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56336  -9497  -2108   5279 159658 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 95093.0    17138.2   5.549 0.0000000638 ***
## married_id                    443.8     2520.6   0.176       0.8604    
## marital_status_id           -1133.9     1297.3  -0.874       0.3828    
## gender_id                    -351.1     2433.9  -0.144       0.8854    
## emp_status_id               -1607.4     2274.8  -0.707       0.4804    
## dept_id                     -8722.4     1932.4  -4.514 0.0000091892 ***
## perf_score_id                6392.3     3117.4   2.051       0.0412 *  
## from_diversity_job_fair_id   3462.4     4256.8   0.813       0.4167    
## termd                        3924.0     8436.1   0.465       0.6422    
## position_id                  -545.7      219.7  -2.483       0.0136 *  
## emp_satisfaction              432.2     1394.5   0.310       0.7568    
## special_projects_count       2625.3      796.4   3.296       0.0011 ** 
## days_late_last30             1663.1     1418.7   1.172       0.2420    
## absences                      324.8      207.6   1.564       0.1188    
## status                           NA         NA      NA           NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20970 on 297 degrees of freedom
## Multiple R-squared:  0.3345, Adjusted R-squared:  0.3053 
## F-statistic: 11.48 on 13 and 297 DF,  p-value: < 0.00000000000000022

The importance of the variables

vip(lm_ih)

Check assumptions of linear regression

library(performance)
par(mfrow = c(2, 2))
plot(lm_ih, pch = 16, col = '#006EA1')

par(mfrow = c(1, 1))

The estimated values of the regression coefficients

tidy(lm_ih)
termestimatestd.errorstatisticp.value
(Intercept)9.51e+041.71e+045.55 6.38e-08
married_id444       2.52e+030.1760.86    
marital_status_id-1.13e+031.3e+03 -0.8740.383   
gender_id-351       2.43e+03-0.1440.885   
emp_status_id-1.61e+032.27e+03-0.7070.48    
dept_id-8.72e+031.93e+03-4.51 9.19e-06
perf_score_id6.39e+033.12e+032.05 0.0412  
from_diversity_job_fair_id3.46e+034.26e+030.8130.417   
termd3.92e+038.44e+030.4650.642   
position_id-546       220       -2.48 0.0136  
emp_satisfaction432       1.39e+030.31 0.757   
special_projects_count2.63e+03796       3.3  0.0011  
days_late_last301.66e+031.42e+031.17 0.242   
absences325       208       1.56 0.119   
status                         

The validation metrics of the model

glance(lm_ih) %>%
  gather(var, values)
varvalues
r.squared0.334   
adj.r.squared0.305   
sigma2.1e+04 
statistic11.5     
p.value4.94e-20
df13       
logLik-3.53e+03
AIC7.09e+03
BIC7.14e+03
deviance1.31e+11
df.residual297       
nobs311       

Cooks distance, predicted values, residuals and influntial points

## OK: No outliers detected.
## - Based on the following method and threshold: cook (1).
## - For variable: (Whole model)

VIF values obtained for the variables include in the model

TermVIFVIF_CI_lowVIF_CI_highSE_factorToleranceTolerance_CI_lowTolerance_CI_high
married_id1.081.021.371.040.928 0.73  0.984
marital_status_id1.061.011.451.030.947 0.689 0.993
gender_id1.031   2.281.010.971 0.439 0.999
emp_status_id11.7 9.6314.4 3.430.08510.06950.104
dept_id2.442.082.911.560.41  0.344 0.48 
perf_score_id2.362.022.811.540.423 0.355 0.495
from_diversity_job_fair_id1.081.021.361.040.923 0.735 0.981
termd11.2 9.1913.7 3.350.08920.07290.109
position_id1.321.191.551.150.758 0.647 0.843
emp_satisfaction1.131.051.361.060.882 0.734 0.953
special_projects_count2.472.112.951.570.405 0.339 0.474
days_late_last302.382.042.831.540.42  0.353 0.491
absences1.041   1.651.020.96  0.608 0.997

We observe that status variable is NA meaning that between this variable and another variable there is prefect collinearity, suggesting that one or more variables are exact linear combinations of others. As a result, the design matrix becomes singular and VIF cannot be computed.

The variable that is perfect collinear with status is: married_id

# vif(lm_ih)

So we are going to eliminate the status variable and termd variable (highly correlated with emp_status_id; the value of the correlation coefficient is 0.91) from the dataset and rebuild the model.

Generating a new regression model without including the status and termd variables

lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-status, -termd))
lm_ih
## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Coefficients:
##                (Intercept)                  married_id  
##                    93194.6                       413.6  
##          marital_status_id                   gender_id  
##                    -1132.3                      -324.3  
##              emp_status_id                     dept_id  
##                     -603.2                     -8628.1  
##              perf_score_id  from_diversity_job_fair_id  
##                     6594.0                      3185.4  
##                position_id            emp_satisfaction  
##                     -560.5                       406.1  
##     special_projects_count            days_late_last30  
##                     2664.4                      1815.1  
##                   absences  
##                      327.0
summary(lm_ih)
## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-status, 
##     -termd))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56169  -9542  -1754   5019 160081 
## 
## Coefficients:
##                            Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)                 93194.6    16623.2   5.606 0.0000000471 ***
## married_id                    413.6     2516.4   0.164     0.869563    
## marital_status_id           -1132.3     1295.6  -0.874     0.382829    
## gender_id                    -324.3     2430.0  -0.133     0.893919    
## emp_status_id                -603.2      715.8  -0.843     0.400069    
## dept_id                     -8628.1     1919.2  -4.496 0.0000099409 ***
## perf_score_id                6594.0     3083.0   2.139     0.033265 *  
## from_diversity_job_fair_id   3185.4     4209.4   0.757     0.449816    
## position_id                  -560.5      217.1  -2.581     0.010318 *  
## emp_satisfaction              406.1     1391.6   0.292     0.770612    
## special_projects_count       2664.4      790.9   3.369     0.000855 ***
## days_late_last30             1815.1     1378.8   1.316     0.189032    
## absences                      327.0      207.3   1.577     0.115772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20940 on 298 degrees of freedom
## Multiple R-squared:  0.334,  Adjusted R-squared:  0.3072 
## F-statistic: 12.45 on 12 and 298 DF,  p-value: < 0.00000000000000022

The estimated values of the regression coefficients

tidy(lm_ih)
termestimatestd.errorstatisticp.value
(Intercept)9.32e+041.66e+045.61 4.71e-08
married_id414       2.52e+030.1640.87    
marital_status_id-1.13e+031.3e+03 -0.8740.383   
gender_id-324       2.43e+03-0.1330.894   
emp_status_id-603       716       -0.8430.4     
dept_id-8.63e+031.92e+03-4.5  9.94e-06
perf_score_id6.59e+033.08e+032.14 0.0333  
from_diversity_job_fair_id3.19e+034.21e+030.7570.45    
position_id-560       217       -2.58 0.0103  
emp_satisfaction406       1.39e+030.2920.771   
special_projects_count2.66e+03791       3.37 0.000855
days_late_last301.82e+031.38e+031.32 0.189   
absences327       207       1.58 0.116   

The validation metrics of the model

glance(lm_ih) %>%
  gather(var, values)
varvalues
r.squared0.334   
adj.r.squared0.307   
sigma2.09e+04
statistic12.5     
p.value1.5e-20 
df12       
logLik-3.53e+03
AIC7.09e+03
BIC7.14e+03
deviance1.31e+11
df.residual298       
nobs311       

VIF values obtained for the variables include in the model

TermVIFVIF_CI_lowVIF_CI_highSE_factorToleranceTolerance_CI_lowTolerance_CI_high
married_id1.081.021.381.040.9290.7270.985
marital_status_id1.061.011.461.030.9470.6860.993
gender_id1.031   2.381.010.9720.42 0.999
emp_status_id1.171.071.391.080.8570.7210.933
dept_id2.412.062.881.550.4150.3470.485
perf_score_id2.321.982.761.520.4320.3620.504
from_diversity_job_fair_id1.061.011.421.030.9410.7050.991
position_id1.291.161.521.140.7750.6590.859
emp_satisfaction1.131.051.361.060.8830.7340.954
special_projects_count2.442.082.921.560.41 0.3430.48 
days_late_last302.251.932.681.5 0.4440.3730.518
absences1.041   1.671.020.9610.5980.998

We will rebuild the model including only the significant variables.

Generating a new regression model using only the significant variables

Variables that has no impact on salary are:

termestimatestd.errorstatisticp.value
married_id414       2.52e+030.1640.87 
marital_status_id-1.13e+031.3e+03 -0.8740.383
gender_id-324       2.43e+03-0.1330.894
emp_status_id-603       716       -0.8430.4  
from_diversity_job_fair_id3.19e+034.21e+030.7570.45 
emp_satisfaction406       1.39e+030.2920.771
days_late_last301.82e+031.38e+031.32 0.189
absences327       207       1.58 0.116
## [1] "married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences"
lm_ih <- lm(salary ~ ., data = dataset %>% dplyr::select(-c(status, termd, married_id, marital_status_id, gender_id, emp_status_id, from_diversity_job_fair_id, emp_satisfaction, days_late_last30, absences)))
lm_ih
## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Coefficients:
##            (Intercept)                 dept_id           perf_score_id  
##               105263.2                 -8607.9                  4063.2  
##            position_id  special_projects_count  
##                 -620.5                  2682.4
summary(lm_ih)
## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48859  -9725  -2244   4500 159690 
## 
## Coefficients:
##                        Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)            105263.2    13558.0   7.764 0.000000000000125 ***
## dept_id                 -8607.9     1895.7  -4.541 0.000008070682410 ***
## perf_score_id            4063.2     2028.2   2.003          0.046014 *  
## position_id              -620.5      212.1  -2.925          0.003700 ** 
## special_projects_count   2682.4      770.8   3.480          0.000575 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20900 on 306 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3099 
## F-statistic: 35.81 on 4 and 306 DF,  p-value: < 0.00000000000000022

The estimated values of the regression coefficients

tidy(lm_ih)
termestimatestd.errorstatisticp.value
(Intercept)1.05e+051.36e+047.761.25e-13
dept_id-8.61e+031.9e+03 -4.548.07e-06
perf_score_id4.06e+032.03e+032   0.046   
position_id-620       212       -2.930.0037  
special_projects_count2.68e+03771       3.480.000575

The validation metrics of the model

VIF values obtained for the variables include in the model

TermVIFVIF_CI_lowVIF_CI_highSE_factorToleranceTolerance_CI_lowTolerance_CI_high
dept_id2.362.012.83    1.540.4230.353   0.497
perf_score_id1.011   1.96e+051   0.9945.09e-061    
position_id1.241.121.47    1.110.8080.682   0.892
special_projects_count2.331.992.79    1.530.43 0.358   0.504

As it can be seen from the checking assumption of the model the multicollinearity effect was eliminated. The metric of the model are the same, 33% of the variation of the dependent variable, salary, is explained by the model. We can see that the model is significant (p-value < 0.05), but not all the included predictors have an significant impact on the salary.

From the checking assumptions of the model we can observe that:

The model does not capture the actual shape of the salary distribution, suggesting potential skewness or the presence of outliers that are not well handled by linear regression.

The relationship between the predictors and salary does not appear to be fully linear, indicating possible model misspecification.

There is evidence of heteroscedasticity, as the variance of the residuals is not constant across fitted values.

Several observations exhibit relatively high leverage (e.g., 132, 309, 151), indicating the presence of influential data points that may affect the estimated regression coefficients.

Additionally, high multicollinearity is detected among some predictors, which can lead to unstable coefficient estimates and inflated standard errors. After eliminating the variables with high VIF values the model was rebuild.

The residuals are not perfectly normally distributed, likely due to positive skewness in salary values. Although this is less critical in large samples, it may still affect statistical inference, including p-values and confidence intervals.

To improve the model, the following steps we should follow:

    Using robust regression techniques;

    Implementing regularized models such as Ridge or Lasso regression.
## 
## Call:
## lm(formula = salary ~ ., data = dataset %>% dplyr::select(-c(status, 
##     termd, married_id, marital_status_id, gender_id, emp_status_id, 
##     from_diversity_job_fair_id, emp_satisfaction, days_late_last30, 
##     absences)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48859  -9725  -2244   4500 159690 
## 
## Coefficients:
##                        Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)            105263.2    13558.0   7.764 0.000000000000125 ***
## dept_id                 -8607.9     1895.7  -4.541 0.000008070682410 ***
## perf_score_id            4063.2     2028.2   2.003          0.046014 *  
## position_id              -620.5      212.1  -2.925          0.003700 ** 
## special_projects_count   2682.4      770.8   3.480          0.000575 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20900 on 306 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3099 
## F-statistic: 35.81 on 4 and 306 DF,  p-value: < 0.00000000000000022
##                dept_id          perf_score_id            position_id 
##               2.362817               1.006371               1.237009 
## special_projects_count 
##               2.328210

Bibliography

  1. Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
  2. Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
  3. Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
  4. Peterson BG, Carl P (2020). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version 2.0.4, https://CRAN.R-project.org/package=PerformanceAnalytics.
  5. Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. R package version 1.4.0, https://CRAN.R-project.org/package=scales.
  6. Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
  7. Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.
  8. Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). Available from https://github.com/taiyun/corrplot
  9. Rubba C (2023). htmltab: Assemble Data Frames from HTML Tables. R package version 0.8.2.9000, https://github.com/htmltab/htmltab.
  10. Brandon M. Greenwell and Bradley C. Boehmke (2020). Variable Importance Plots—An Introduction to the vip Package. The R Journal, 12(1), 343–366. URL https://doi.org/10.32614/RJ-2020-013.
  11. Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01
  12. Robinson D, Hayes A, Couch S (2023). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.5, https://CRAN.R-project.org/package=broom. 13.H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
  13. Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.
  14. Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.
  15. Cui B (2020). DataExplorer: Automate Data Exploration and Treatment. R package version 0.8.2, https://CRAN.R-project.org/package=DataExplorer.
  16. Rushworth A (2022). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.12, https://CRAN.R-project.org/package=inspectdf.
  17. Grosjean P, Ibanez F (2018). pastecs: Package for Analysis of Space-Time Ecological Series. R package version 1.3.21, https://CRAN.R-project.org/package=pastecs.
  18. https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
  19. William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.3.9, https://CRAN.R-project.org/package=psych.