Getting Started

Importing the Data

For the prediction, we will have to make sure that we import the data that we use for our model to make sure that the data could be used to form our prediction. The dataset that we are using in this project is called “Salary_Data”.

salary_df <- read.csv("Salary_Data.csv")

Importing the Libraries

For this project, we will be using some libraries for the plotting our data for visuals and manipulating the data. Most notably, the ggplot2 library will be used as a tool for us to make visualizations of the data and the dplyr library will be used to help us make transformations and manipulate the data

library(ggplot2)
library(dplyr)
library(scales)

About the Dataset

Overview

This dataset includes a complete overview, how experience and age impact salary. It contains columns like age, gender, job title, years of experience, education level, and salary. This dataset explores the relationship between an employee years of experience and their salary to understand how professional growth impacts earnings.

More information about this dataset can be found on kaggle at https://www.kaggle.com/datasets/mubeenshehzadi/salary-dataset.

summary(salary_df)
##       Age           Gender          Education.Level     Job.Title        
##  Min.   :21.00   Length:6704        Length:6704        Length:6704       
##  1st Qu.:28.00   Class :character   Class :character   Class :character  
##  Median :32.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :33.62                                                           
##  3rd Qu.:38.00                                                           
##  Max.   :62.00                                                           
##  NA's   :2                                                               
##  Years.of.Experience     Salary      
##  Min.   : 0.000      Min.   :   350  
##  1st Qu.: 3.000      1st Qu.: 70000  
##  Median : 7.000      Median :115000  
##  Mean   : 8.095      Mean   :115327  
##  3rd Qu.:12.000      3rd Qu.:160000  
##  Max.   :34.000      Max.   :250000  
##  NA's   :3           NA's   :5
head(salary_df)

Introduction

Background

In today’s corporate job market, it is important to understand the various factors of compensation, for both employers and employees. As an employee, having a clear idea of their earning potential and career path is important. Meanwhile, employers should also keep in mind fair and competitive salary to retain employees and attract new talent.

Problem

The main objective of this analysis is to model the relationship between an employee’s years of experience and their annual salary. Some the of questions that we are trying to solve involve:

  • Is there a significant or major relationship between years of experience and salary?
  • Can we use the relationship to create a model to predict employee’s salary based on experience?
  • Is there a linear relationship between years of experience and salary?

This project aims to seek more information about these questions and topics, while using statistical methods to analyze the relationship. Typically, one of the biggest deciding factors of salary is years of experience. With this project, we will dive into the data to get a more in depth understanding of how this may look.

Cleaning the Data

Getting Started

The first step, which we have done already, is to load the data into our R project. Immediately after loading the data, we performed some checks to ensure the data structure and integrity along with some the type of each column, using functions like head() and summary().

Missing values

Now before moving forward with our data, we should confirm whether or not all columns have missing or invalid values.

colSums(is.na(salary_df))
##                 Age              Gender     Education.Level           Job.Title 
##                   2                   0                   0                   0 
## Years.of.Experience              Salary 
##                   3                   5

Using this code, we discover that we have missing values in age, years of experience, and salary columns. Since this only accounts for a very small portion of our data (under 0.5%), we can safely omit these values, as they will not be needed in our linear regression model. The linear regression model does not work with missing data and would exclude those values by default; however, it is better to handle this explicitly beforehand to allows us to clean our data before the model can process it.

salary_df <- na.omit(salary_df)

colSums(is.na(salary_df))
##                 Age              Gender     Education.Level           Job.Title 
##                   0                   0                   0                   0 
## Years.of.Experience              Salary 
##                   0                   0

Column Name Updates

Our dataset is now complete and well structured without missing or invalid values. Now, moving forward, we are going to rename the column names of the data to be more friendly for us to process. For example, some columns names currently contain spaces in their names. Omitting these spaces will make the columns more R-friendly and readable for us to process and work with.

names(salary_df) <- c("Age", "Gender", "Education_Level", "Job_Title", "Years_of_Experience", "Salary")

head(salary_df)

Education Level Categories

Finally, looking at our data, we are also given information on the education level of each individual; however, the data currently contains duplicate education levels which represent the same education level, but labeled slightly differently.

print(unique(salary_df$Education_Level))
## [1] "Bachelor's"        "Master's"          "PhD"              
## [4] "Bachelor's Degree" "Master's Degree"   ""                 
## [7] "High School"       "phD"

For example, there are two unique identifiers for a PhD degree, which are “PhD” and “phD”. These essentially represent the same education level, so we will need to clean the data and combine these into one education level so the data is grouped properly.

salary_df <- salary_df %>%
  mutate(Education_Level = case_when(
    Education_Level %in% c("Bachelor's Degree", "Bachelor's") ~ "Bachelor's",
    Education_Level %in% c("Master's Degree", "Master's") ~ "Master's",
    Education_Level %in% c("phD", "PhD") ~ "PhD",
    TRUE ~ Education_Level
  ))

print(unique(salary_df$Education_Level))
## [1] "Bachelor's"  "Master's"    "PhD"         ""            "High School"

After these steps to prepare our data, we are now ready to process and analyze the data to give us more insights.

Data Exploration

Highest Salaries Overview

One of the biggest deciding factors of salaries is years of experience, but how true is this? We can explore the highest average salaries in our data set and their corresponding job titles and average years of experience. By grouping the data by job title and calculating the mean salary and years of experience for each job, we can get a better idea of the potential relationship between how job title and years of experience may impact salary.

salary_by_job <- salary_df %>%
  group_by(Job_Title) %>%
  summarise(
    Average_Salary = mean(Salary),
    Average_Experience = mean(Years_of_Experience)
  ) %>%
  arrange(desc(Average_Salary)) %>%
  top_n(10)
## Selecting by Average_Experience
print(salary_by_job)
## # A tibble: 13 × 3
##    Job_Title                       Average_Salary Average_Experience
##    <chr>                                    <dbl>              <dbl>
##  1 CEO                                    250000                25  
##  2 Chief Technology Officer               250000                24  
##  3 Director                               200000                20  
##  4 Director of Human Resources            187500                22  
##  5 Director of Human Capital              180000                21  
##  6 Director of Sales and Marketing        180000                22  
##  7 Human Resources Director               180000                20  
##  8 Director of Finance                    175000                20  
##  9 Director of Product Management         175000                21  
## 10 Director of Operations                 172727.               20.6
## 11 Principal Engineer                     170000                20  
## 12 Supply Chain Analyst                   130000                22  
## 13 Operations Analyst                     110000                20

This preliminary finding suggests that while years of experience is a key contributing factor, job title and function can also be a powerful factor in determining salary as well, based on the fact that most top 10 highest average salaries being “Director” roles or higher up.

Salary vs Experience Relationship

To help us better see the relationship in the data, we can visualize the relationship between salary and years of experience. This can be done using a scatter plot which is an amazing and ideal way to examine the relationship between two continuous variables. In our case, we want to see the relationship between salary and years of experience.

ggplot(salary_df, aes(x = Years_of_Experience, y = Salary)) +
  geom_point(color = "darkgreen") +
  labs(title = "Salary vs. Years of Experience",
       x = "Years of Experience",
       y = "Salary (US Dollars)") + scale_y_continuous(labels = comma)

The resulting scatter plot reveals a clear positive and linear trend of a data, indicating that as years of experience increases, salary consistently increases as well. The data points in our scatter plot form a tight, upward trend which suggests that there is a positive linear relationship. Therefore, a linear model would be a good choice for this dataset.

Salary vs Education Level

In our dataset, we are also given more information about education level of each individual as well. To further explore this data, we can also generate box plots to compare the salary distributions across various education levels including Highschool, Bachelor’s, Master’s, and PhD. By using a box plot, we can see additional information about the median salaries, outliers, and ranges for each group.

salary_df %>%
  filter(Education_Level != "") %>%
  mutate(Education_Level = factor(
    Education_Level, 
    levels = c("High School", "Bachelor's", "Master's", "PhD"))) %>%
  ggplot(aes(x = Education_Level, y = Salary, fill = Education_Level)) +
    geom_boxplot() +
    labs(
      title = "Salary Distribution by Education Level",
      x = "Education Level",
      y = "Salary (US Dollars)"
    )

Based on the box plot, education appears to be another significant factor influencing salary. Individuals with a PhD tend to have the highest median salary compared to other education levels, though there are several outliers. This suggests that a higher level of education can positively impact salary potential, as advanced degrees often provide greater expertise and experience within one’s field.

Model Creation

Linear Regression Model Creation

To quantify the relationship between salary and experience, we can create a simple linear regression model. This statistical method aims to model the relationship between these two variables by fitting a linear equation onto the observed data that we have gone over.

Our goal is to find a model that will represent the data points with a straight line. This is defined by the equation:

\(Salary = \beta_0 + \beta_1 \times Experience\)

where \(\beta_0\) is the intercept and \(\beta_1\) is the slope.

salary_model <- lm(Salary ~ Years_of_Experience, data = salary_df)

summary(salary_model)
## 
## Call:
## lm(formula = Salary ~ Years_of_Experience, data = salary_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -148236  -22377   -5564   21015  100576 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         58283.28     632.71   92.12   <2e-16 ***
## Years_of_Experience  7046.77      62.57  112.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31030 on 6697 degrees of freedom
## Multiple R-squared:  0.6544, Adjusted R-squared:  0.6544 
## F-statistic: 1.268e+04 on 1 and 6697 DF,  p-value: < 2.2e-16

Using R’s lm() function, we fitted the model and obtained the results. From the summary output, we can identify some key findings:

  • Coefficients: The model estimated an intercept (\(\beta_0\)) of approximately 58283.28. This is the theoretical starting salary for an individual with zero years of experience. The coefficient for Years_of_Experience (\(\beta_1\)) was approximately 7046.77. In other words, this value means that for each additional year of professional experience, the model predicts an average salary increase of $7046.77.

  • R-squared: This value (0.6544) tells you the percentage of variation in Salary that can be explained by Years_of_Experience. A higher value means a better fit.

  • p-value: The p-value for Years_of_Experience (< 2.2e-16) indicates that there is a statistically significant relationship between experience and salary.

Linear Regression Model Visualization

Our linear regression model can be visualized on the same scatter plot that we have generated as well. Since the linear regression model is a straight line, we can represent this on the scatter plot to visually see the expected inputs and outputs of our model. To do so, plot the scatter plot again but this time, with the addition of the regression line.

ggplot(salary_df, aes(x = Years_of_Experience, y = Salary)) +
  geom_point(color = "darkgreen") +
  geom_smooth(method = "lm", color = "purple") +
  labs(title = "Salary vs. Years of Experience with Regression Line",
       x = "Years of Experience",
       y = "Salary (US Dollars)") + scale_y_continuous(labels = comma)
## `geom_smooth()` using formula = 'y ~ x'

Model Results

Appyling the Model

Now that we have built and validated our linear regression model, we can use it for its primary purpose: making predictions. The final equation for our model is based on the coefficients we found:

\(Salary = 58283.28 + 7046.77 \times Years\_of\_Experience\)

This formula allows us to input a value for Years_of_Experience and receive an estimated Salary in return. Let’s test this by predicting the salary for a hypothetical employee with 20 years of experience.

new_employee_experience <- data.frame(Years_of_Experience = 20)
predicted_salary <- predict(salary_model, newdata = new_employee_experience)

cat(sprintf("The predicted salary for 20 years of experience is: $%s", round(predicted_salary, 2)))
## The predicted salary for 20 years of experience is: $199218.64

Analyzing the Results

The model predicts that an employee with 20 years of experience will make an annual salary of around $199,218.64.

Additionally, we can also see that the linear regression line in our visual scatter plot with the straight line also maps to around this value as well, visually validating the result that we received.

By providing a simple input, years of experience, we can get a general estimate of the expected salary for employees.

Conclusion

Based on the analysis, there is a strong, positive, and statistically significant linear relationship between salary and years of experience. For every additional year of experience, an employee’s salary is predicted to increase by approximately $7046.77.

This report set out to investigate the relationship between an employee’s years of experience and their annual salary. The results confirmed that there is a strong, positive, and statistically significant linear relationship between these two variables.

Our simple linear regression model demonstrated that years of experience is a powerful predictor of salary.