Predicting salary using years of experience (YOE)

2023-10-15

Problem Description

In this project, we aim to develop a predictive model that estimates an individual’s salary based on their years of professional experience. By leveraging historical salary and experience data, the goal is to create an accurate and efficient model that aids in salary negotiations, job market analysis, and workforce planning.

We’ll be using the Salary dataset obtained from Kaggle (https://www.kaggle.com/datasets/abhishek14398/salary-dataset-simple-linear-regression) for this project.

Loading data

##   X YearsExperience Salary
## 1 0             1.2  39344
## 2 1             1.4  46206
## 3 2             1.6  37732
## 4 3             2.1  43526
## 5 4             2.3  39892
## 6 5             3.0  56643

##        X         YearsExperience      Salary      
##  Min.   : 0.00   Min.   : 1.200   Min.   : 37732  
##  1st Qu.: 7.25   1st Qu.: 3.300   1st Qu.: 56722  
##  Median :14.50   Median : 4.800   Median : 65238  
##  Mean   :14.50   Mean   : 5.413   Mean   : 76004  
##  3rd Qu.:21.75   3rd Qu.: 7.800   3rd Qu.:100546  
##  Max.   :29.00   Max.   :10.600   Max.   :122392

Properties of dataset

The data types of each column are

## 'data.frame':    30 obs. of  3 variables:
##  $ X              : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ YearsExperience: num  1.2 1.4 1.6 2.1 2.3 3 3.1 3.3 3.3 3.8 ...
##  $ Salary         : num  39344 46206 37732 43526 39892 ...

The number of null rows in each column is as shown below :-

##               X YearsExperience          Salary 
##               0               0               0

From this we observe that there isn’t a null value for any column of the dataset and hence we won’t need to perform any kind of imputation or even remove any rows of this dataset.

Exploratory Data Analysis (EDA)

From this unimodal histogram we can see that most of the people have about 5 YOE (mode) and the mean YOE is also greater than the median YOE which is greater than the mode YOE since the histogram is slightly skewed to the right.

Continuation of EDA

From this unimodal histogram we can see that the most common salary is about $50k and the mean salary is also greater than the median salary which is greater than the mode salary since the histogram is slightly skewed to the right.

Continuation of EDA

We will now plot the years of experience against the salary in the form of a scatter plot to see how strong the correlation is between these 2 variables, ‘YearsExperience’ being the independent variable and ‘Salary’ being the dependent variable.

## Correlation coefficient between YearsExperience and Salary: 0.9782416

ggplot(salary, aes(x = YearsExperience, y = Salary)) +
  geom_point() +
  labs(
    x = "Years of Experience",
    y = "Salary",
    title = "Salary vs. Years of Experience Scatter Plot"
  )

Continuation of EDA

From this plot, since the correlation between the 2 variables is highly positive, the more the number of years of experience that a person has, the higher the salary they can expect.

Linear Regression

Due to the strong association between YearsExperience and Salary indicated by the highly positive correlation observed before, it makes sense to use a linear regression model to approximately predict a worker’s salary using their years of experience as input.

A linear regression model is a mathematical model used for predicting the numeric value of a quantitative dependent variable (Salary) based on historical data relating YearsExperience with Salary. It would attempt to fit a model of the form below using the historical data :- \[ \text{Salary} = \beta_0 + \beta_1 \times \text{YOE} + \epsilon \]

Continuation of Linear Regression

$\text{Salary}$ represents the dependent variable, which is the individual’s salary
$\beta_0$ is the intercept term, representing the expected salary when the number of years of experience is zero.
$\beta_1$ is the coefficient for the independent variable $\text{YearsExperience}$, indicating how much the salary is expected to change for each additional year of experience.
$\epsilon$ represents the error term, which accounts for unexplained variance in salary that is not captured by the model.

\[ \\ \beta_0 = \bar{Salary}-(\beta_1 \times \bar{YOE}) \\ \beta_1 = \frac{\sum_{i=1}^{n}(YearsExperience-\bar{YOE})(Salary-\bar{Salary})}{\sum_{i=1}^{n}(YOE-\bar{YOE})^2} \]

Coefficient of Determination

$R^2$ is the coefficient of determination (R-squared).
$y_i$ represents the observed values of the dependent variable.
$\hat{y}_i$ represents the predicted values of the dependent variable from the regression model.
$\bar{y}$ is the mean of the observed values of the dependent variable.

A higher $R^2$ value (closer to 1) indicates that a larger proportion of the variance in the dependent variable is explained by the independent variables, while a lower $R^2$ value (closer to 0) suggests that the model explains less of the variance.

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \]

Creating the train and test data sets

The train and test data sets are now created with a test size of 20%.

The train data set is given below :-

##     X YearsExperience Salary
## 1   0             1.2  39344
## 3   2             1.6  37732
## 4   3             2.1  43526
## 6   5             3.0  56643
## 7   6             3.1  60151
## 9   8             3.3  64446
## 10  9             3.8  57190
## 12 11             4.1  55795
## 13 12             4.1  56958
## 15 14             4.6  61112
## 16 15             5.0  67939
## 18 17             5.4  83089
## 19 18             6.0  81364
## 21 20             6.9  91739
## 22 21             7.2  98274
## 24 23             8.3 113813
## 25 24             8.8 109432
## 27 26             9.6 116970
## 28 27             9.7 112636
## 30 29            10.6 121873

Continuation

The test data set is given below :-

##     X YearsExperience Salary
## 2   1             1.4  46206
## 5   4             2.3  39892
## 8   7             3.3  54446
## 11 10             4.0  63219
## 14 13             4.2  57082
## 17 16             5.2  66030
## 20 19             6.1  93941
## 23 22             8.0 101303
## 26 25             9.1 105583
## 29 28            10.4 122392

Creation of the model and its training

lm_model <- lm(Salary ~ YearsExperience, data = train_data)
lm_summary <- summary(lm_model)

# Extract the R-squared value
r_squared <- lm_summary$r.squared

# Print the R-squared value
cat("R-squared (R²) Value:", r_squared, "\n")

## R-squared (R²) Value: 0.9627567

Since the value of $R^2$ is high, the model is a perfect fit for solving this problem as it explains a large proportion of the variance in Salary.

Properties of the model

## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8134.5 -4112.8    -1.6  3597.4  9882.3 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24880.7     2688.7   9.254 2.91e-08 ***
## YearsExperience   9524.1      441.5  21.571 2.60e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5482 on 18 degrees of freedom
## Multiple R-squared:  0.9628, Adjusted R-squared:  0.9607 
## F-statistic: 465.3 on 1 and 18 DF,  p-value: 2.601e-14

Model Evaluation

\[ \\MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| \\MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \\RMSE = \sqrt{MSE} \]

## Mean absolute Error (MAE): 5186.68

## Mean Squared Error (MSE): 40414455

## Root Mean Squared Error (RMSE): 6357.236

Visualizing the residuals

# Calculate the residuals
residuals <- test_data$Salary - test_predictions

# Create a data frame for residuals and YearsExperience
residual_data <- data.frame(YearsExperience = test_data$YearsExperience, Residuals = residuals)

# Create a scatter plot of residuals
plot <- plot_ly(residual_data, x = ~YearsExperience, y = ~Residuals, mode = "markers", type = "scatter", name = "Residuals")

# Customize the plot layout
plot <- plot %>% layout(
  title = "Residuals Scatter Plot",
  xaxis = list(title = "YearsExperience"),
  yaxis = list(title = "Residuals"),
  showlegend = TRUE
)

# Display the plot
plot

Visualizing the residuals - II

Since the plot is randomly distributed, its a good indication that the assumptions of a linear regression model like Independence, Homoscedasticity, constant variance are met by the fitted model which hence provides a good fit for the data.

Model Viz.

From this plot we can see that the trained model is a good fit for the data. The residuals being minimal also supports this inference.