For the prediction, we will have to make sure that we import the data that we use for our model to make sure that the data could be used to form our prediction. The dataset that we are using in this project is called “Salary_Data”.
salary_df <- read.csv("Salary_Data.csv")
For this project, we will be using some libraries for the plotting our data for visuals and manipulating the data. Most notably, the ggplot2 library will be used as a tool for us to make visualizations of the data and the dplyr library will be used to help us make transformations and manipulate the data
library(ggplot2)
library(dplyr)
library(scales)
This dataset includes a complete overview, how experience and age impact salary. It contains columns like age, gender, job title, years of experience, education level, and salary. This dataset explores the relationship between an employee years of experience and their salary to understand how professional growth impacts earnings.
More information about this dataset can be found on kaggle at https://www.kaggle.com/datasets/mubeenshehzadi/salary-dataset.
summary(salary_df)
## Age Gender Education.Level Job.Title
## Min. :21.00 Length:6704 Length:6704 Length:6704
## 1st Qu.:28.00 Class :character Class :character Class :character
## Median :32.00 Mode :character Mode :character Mode :character
## Mean :33.62
## 3rd Qu.:38.00
## Max. :62.00
## NA's :2
## Years.of.Experience Salary
## Min. : 0.000 Min. : 350
## 1st Qu.: 3.000 1st Qu.: 70000
## Median : 7.000 Median :115000
## Mean : 8.095 Mean :115327
## 3rd Qu.:12.000 3rd Qu.:160000
## Max. :34.000 Max. :250000
## NA's :3 NA's :5
head(salary_df)
In today’s corporate job market, it is important to understand the various factors of compensation, for both employers and employees. As an employee, having a clear idea of their earning potential and career path is important. Meanwhile, employers should also keep in mind fair and competitive salary to retain employees and attract new talent.
The main objective of this analysis is to model the relationship between an employee’s years of experience and their annual salary. Some the of questions that we are trying to solve involve:
This project aims to seek more information about these questions and topics, while using statistical methods to analyze the relationship. Typically, one of the biggest deciding factors of salary is years of experience. With this project, we will dive into the data to get a more in depth understanding of how this may look.
The first step, which we have done already, is to load the data into our R project. Immediately after loading the data, we performed some checks to ensure the data structure and integrity along with some the type of each column, using functions like head() and summary().
Now before moving forward with our data, we should confirm whether or not all columns have missing or invalid values.
colSums(is.na(salary_df))
## Age Gender Education.Level Job.Title
## 2 0 0 0
## Years.of.Experience Salary
## 3 5
Using this code, we discover that we have missing values in age, years of experience, and salary columns. Since this only accounts for a very small portion of our data (under 0.5%), we can safely omit these values, as they will not be needed in our linear regression model. The linear regression model does not work with missing data and would exclude those values by default; however, it is better to handle this explicitly beforehand to allows us to clean our data before the model can process it.
salary_df <- na.omit(salary_df)
colSums(is.na(salary_df))
## Age Gender Education.Level Job.Title
## 0 0 0 0
## Years.of.Experience Salary
## 0 0
Our dataset is now complete and well structured without missing or invalid values. Now, moving forward, we are going to rename the column names of the data to be more friendly for us to process. For example, some columns names currently contain spaces in their names. Omitting these spaces will make the columns more R-friendly and readable for us to process and work with.
names(salary_df) <- c("Age", "Gender", "Education_Level", "Job_Title", "Years_of_Experience", "Salary")
head(salary_df)
Finally, looking at our data, we are also given information on the education level of each individual; however, the data currently contains duplicate education levels which represent the same education level, but labeled slightly differently.
print(unique(salary_df$Education_Level))
## [1] "Bachelor's" "Master's" "PhD"
## [4] "Bachelor's Degree" "Master's Degree" ""
## [7] "High School" "phD"
For example, there are two unique identifiers for a PhD degree, which are “PhD” and “phD”. These essentially represent the same education level, so we will need to clean the data and combine these into one education level so the data is grouped properly.
salary_df <- salary_df %>%
mutate(Education_Level = case_when(
Education_Level %in% c("Bachelor's Degree", "Bachelor's") ~ "Bachelor's",
Education_Level %in% c("Master's Degree", "Master's") ~ "Master's",
Education_Level %in% c("phD", "PhD") ~ "PhD",
TRUE ~ Education_Level
))
print(unique(salary_df$Education_Level))
## [1] "Bachelor's" "Master's" "PhD" "" "High School"
After these steps to prepare our data, we are now ready to process and analyze the data to give us more insights.
One of the biggest deciding factors of salaries is years of experience, but how true is this? We can explore the highest average salaries in our data set and their corresponding job titles and average years of experience. By grouping the data by job title and calculating the mean salary and years of experience for each job, we can get a better idea of the potential relationship between how job title and years of experience may impact salary.
salary_by_job <- salary_df %>%
group_by(Job_Title) %>%
summarise(
Average_Salary = mean(Salary),
Average_Experience = mean(Years_of_Experience)
) %>%
arrange(desc(Average_Salary)) %>%
top_n(10)
## Selecting by Average_Experience
print(salary_by_job)
## # A tibble: 13 × 3
## Job_Title Average_Salary Average_Experience
## <chr> <dbl> <dbl>
## 1 CEO 250000 25
## 2 Chief Technology Officer 250000 24
## 3 Director 200000 20
## 4 Director of Human Resources 187500 22
## 5 Director of Human Capital 180000 21
## 6 Director of Sales and Marketing 180000 22
## 7 Human Resources Director 180000 20
## 8 Director of Finance 175000 20
## 9 Director of Product Management 175000 21
## 10 Director of Operations 172727. 20.6
## 11 Principal Engineer 170000 20
## 12 Supply Chain Analyst 130000 22
## 13 Operations Analyst 110000 20
This preliminary finding suggests that while years of experience is a key contributing factor, job title and function can also be a powerful factor in determining salary as well, based on the fact that most top 10 highest average salaries being “Director” roles or higher up.
To help us better see the relationship in the data, we can visualize the relationship between salary and years of experience. This can be done using a scatter plot which is an amazing and ideal way to examine the relationship between two continuous variables. In our case, we want to see the relationship between salary and years of experience.
ggplot(salary_df, aes(x = Years_of_Experience, y = Salary)) +
geom_point(color = "darkgreen") +
labs(title = "Salary vs. Years of Experience",
x = "Years of Experience",
y = "Salary (US Dollars)") + scale_y_continuous(labels = comma)
The resulting scatter plot reveals a clear positive and linear trend of a data, indicating that as years of experience increases, salary consistently increases as well. The data points in our scatter plot form a tight, upward trend which suggests that there is a positive linear relationship. Therefore, a linear model would be a good choice for this dataset.
In our dataset, we are also given more information about education level of each individual as well. To further explore this data, we can also generate box plots to compare the salary distributions across various education levels including Highschool, Bachelor’s, Master’s, and PhD. By using a box plot, we can see additional information about the median salaries, outliers, and ranges for each group.
salary_df %>%
filter(Education_Level != "") %>%
mutate(Education_Level = factor(
Education_Level,
levels = c("High School", "Bachelor's", "Master's", "PhD"))) %>%
ggplot(aes(x = Education_Level, y = Salary, fill = Education_Level)) +
geom_boxplot() +
labs(
title = "Salary Distribution by Education Level",
x = "Education Level",
y = "Salary (US Dollars)"
)
Based on the box plot, education appears to be another significant factor influencing salary. Individuals with a PhD tend to have the highest median salary compared to other education levels, though there are several outliers. This suggests that a higher level of education can positively impact salary potential, as advanced degrees often provide greater expertise and experience within one’s field.
To quantify the relationship between salary and experience, we can create a simple linear regression model. This statistical method aims to model the relationship between these two variables by fitting a linear equation onto the observed data that we have gone over.
Our goal is to find a model that will represent the data points with a straight line. This is defined by the equation:
\(Salary = \beta_0 + \beta_1 \times Experience\)
where \(\beta_0\) is the intercept and \(\beta_1\) is the slope.
salary_model <- lm(Salary ~ Years_of_Experience, data = salary_df)
summary(salary_model)
##
## Call:
## lm(formula = Salary ~ Years_of_Experience, data = salary_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -148236 -22377 -5564 21015 100576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58283.28 632.71 92.12 <2e-16 ***
## Years_of_Experience 7046.77 62.57 112.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31030 on 6697 degrees of freedom
## Multiple R-squared: 0.6544, Adjusted R-squared: 0.6544
## F-statistic: 1.268e+04 on 1 and 6697 DF, p-value: < 2.2e-16
Using R’s lm() function, we fitted the model and obtained the results. From the summary output, we can identify some key findings:
Coefficients: The model estimated an intercept (\(\beta_0\)) of approximately 58283.28. This is the theoretical starting salary for an individual with zero years of experience. The coefficient for Years_of_Experience (\(\beta_1\)) was approximately 7046.77. In other words, this value means that for each additional year of professional experience, the model predicts an average salary increase of $7046.77.
R-squared: This value (0.6544) tells you the
percentage of variation in Salary that can be explained by
Years_of_Experience. A higher value means a better
fit.
p-value: The p-value for
Years_of_Experience (< 2.2e-16) indicates that there is
a statistically significant relationship between experience and
salary.
Our linear regression model can be visualized on the same scatter plot that we have generated as well. Since the linear regression model is a straight line, we can represent this on the scatter plot to visually see the expected inputs and outputs of our model. To do so, plot the scatter plot again but this time, with the addition of the regression line.
ggplot(salary_df, aes(x = Years_of_Experience, y = Salary)) +
geom_point(color = "darkgreen") +
geom_smooth(method = "lm", color = "purple") +
labs(title = "Salary vs. Years of Experience with Regression Line",
x = "Years of Experience",
y = "Salary (US Dollars)") + scale_y_continuous(labels = comma)
## `geom_smooth()` using formula = 'y ~ x'
Now that we have built and validated our linear regression model, we can use it for its primary purpose: making predictions. The final equation for our model is based on the coefficients we found:
\(Salary = 58283.28 + 7046.77 \times Years\_of\_Experience\)
This formula allows us to input a value for Years_of_Experience and receive an estimated Salary in return. Let’s test this by predicting the salary for a hypothetical employee with 20 years of experience.
new_employee_experience <- data.frame(Years_of_Experience = 20)
predicted_salary <- predict(salary_model, newdata = new_employee_experience)
cat(sprintf("The predicted salary for 20 years of experience is: $%s", round(predicted_salary, 2)))
## The predicted salary for 20 years of experience is: $199218.64
The model predicts that an employee with 20 years of experience will make an annual salary of around $199,218.64.
Additionally, we can also see that the linear regression line in our visual scatter plot with the straight line also maps to around this value as well, visually validating the result that we received.
By providing a simple input, years of experience, we can get a general estimate of the expected salary for employees.
Based on the analysis, there is a strong, positive, and statistically significant linear relationship between salary and years of experience. For every additional year of experience, an employee’s salary is predicted to increase by approximately $7046.77.
This report set out to investigate the relationship between an employee’s years of experience and their annual salary. The results confirmed that there is a strong, positive, and statistically significant linear relationship between these two variables.
Our simple linear regression model demonstrated that years of experience is a powerful predictor of salary.