A Study on Factors Affecting Salary in the Job Market
This project analyzes a job market dataset to understand the relationships between experience, skills, education, and salary. The goal is to identify patterns and insights using exploratory data analysis and visualization techniques.
This section presents various types of data analysis performed on the dataset to uncover patterns, relationships, and insights. Different analytical approaches such as relational analysis, categorical analysis, distribution analysis, comparative analysis, and multivariate analysis are used.
Each type of analysis is performed using appropriate visualization techniques to better understand how different variables interact and influence each other.
library(tidyverse)
library(ggplot2)
library(GGally)
library(readr)
library(mice)
library(missForest)
library(GGally)
library(car)
library(tidyverse)
#load data set
job_market <- read_csv("job_salary_prediction_dataset.csv", show_col_types = FALSE)
head(job_market)
## # A tibble: 6 × 10
## job_title experience_years education_level skills_count industry company_size
## <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 AI Engine… 10 Bachelor 2 Healthc… Medium
## 2 Business … 15 Master 3 Technol… Small
## 3 Software … 18 Master 17 Consult… Startup
## 4 Cloud Eng… 11 Bachelor 8 Governm… Large
## 5 Cloud Eng… 2 High School 17 Media Startup
## 6 Cloud Eng… 9 PhD 13 Governm… Enterprise
## # ℹ 4 more variables: location <chr>, remote_work <chr>, certifications <dbl>,
## # salary <dbl>
#data processing . Data types convertion
job_market$job_title <- as.factor(job_market$job_title)
job_market$education_level <- as.factor(job_market$education_level)
job_market$industry <- as.factor(job_market$industry)
job_market$company_size <- as.factor(job_market$company_size)
job_market$location <- as.factor(job_market$location)
job_market$remote_work <- as.factor(job_market$remote_work)
#missing value handling
colSums(is.na(job_market))
## job_title experience_years education_level skills_count
## 0 0 0 0
## industry company_size location remote_work
## 0 0 0 0
## certifications salary
## 0 0
numeric_cols <- c("experience_years", "skills_count", "certifications", "salary")
mice_output <- mice(job_market[numeric_cols], m = 1, method = 'pmm', seed = 123)
##
## iter imp variable
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
data_clean <- complete(mice_output)
# Update the main dataset with imputed values
job_market[numeric_cols] <- data_clean
# Finalize dataset for analysis
data_final <- job_market
#checking for duplicate rows
sum(duplicated(data_final))
## [1] 0
#outliers detection
boxplot(data_final$salary, main = "Salary Distribution")
Relational analysis is used to study the relationship between two numerical variables and understand how one variable influences another.
In this analysis, scatter plots are used where each point represents an observation in the dataset. By plotting one variable on the x-axis and another on the y-axis, patterns and trends can be visually examined.
The strength of the relationship is determined by observing the direction and spread of the points: - A clear upward pattern indicates a strong positive relationship - A scattered pattern indicates a weak or no relationship
This method helps identify which factors significantly impact salary, such as experience, and which factors have minimal influence, such as skills or certifications.
This section analyzes relationships between variables to understand how different factors influence salary.
ggplot(data_final, aes(x = experience_years, y = salary)) +
geom_point(color = "blue") +
labs(title = "Experience vs Salary",
x = "Experience (Years)",
y = "Salary") +
theme_minimal()
Interpretation
The scatter plot shows a positive relationship between experience and salary. As experience increases, salary tends to increase, indicating that experience is a key factor influencing earnings.
Key Insight: Experience is the strongest factor influencing salary.
ggplot(data_final, aes(x = skills_count, y = salary)) +
geom_point(color = "green") +
labs(title = "Skills vs Salary",
x = "Number of Skills",
y = "Salary")
Interpretation
The scatter plot shows a weak relationship between skills and salary. While higher skill counts may slightly increase salary, the effect is not strong, indicating that other factors also play a significant role.
ggplot(data_final, aes(x = certifications, y = salary)) +
geom_point(color = "red") +
labs(title = "Certifications vs Salary",
x = "Number of Certifications",
y = "Salary")
Interpretation
The scatter plot indicates a very weak relationship between certifications and salary. Salary levels vary widely regardless of the number of certifications, suggesting certifications alone do not significantly influence earnings.
ggplot(data_final, aes(x = experience_years, y = skills_count)) +
geom_point(color = "darkgreen") +
labs(title = "Experience vs Skills",
x = "Experience (Years)",
y = "Number of Skills")
Interpretation
The scatter plot shows no clear relationship between experience and number of skills. Skills appear to be independent of experience, suggesting that skill acquisition may depend on individual learning rather than years of experience.
ggplot(data_final, aes(x = skills_count, y = certifications)) +
geom_point(color = "brown") +
labs(title = "Skills vs Certifications",
x = "Number of Skills",
y = "Certifications")
Interpretation
The scatter plot shows no clear relationship between skills and certifications. Individuals with a high number of skills do not necessarily possess more certifications, indicating that practical skills and formal certifications may develop independently.
Categorical analysis examines how different categories or groups affect a numerical variable.
In this section, boxplots are used to compare salary distributions across categories such as education level, company size, job role, and industry. Each boxplot summarizes key statistical values including median, quartiles, and potential outliers.
By comparing the median and spread of values across categories, we can determine whether certain groups tend to have higher or lower salaries.
This approach helps in understanding how factors like education and company size influence salary differences between groups.
ggplot(data_final, aes(x = education_level, y = salary)) +
geom_boxplot(fill = "orange") +
labs(title = "Education Level vs Salary",
x = "Education Level",
y = "Salary") +
theme_minimal()
Interpretation
The boxplot shows that higher education levels are associated with higher salaries. Individuals with Master’s and PhD degrees tend to earn more compared to those with lower education levels, indicating a positive relationship between education and earnings.
Key Insight: Education significantly improves earning potential.
##Analysis 7: Company Size vs Salary
ggplot(data_final, aes(x = company_size, y = salary)) +
geom_boxplot(fill = "purple") +
labs(title = "Company Size vs Salary",
x = "Company Size",
y = "Salary")
Interpretation
Larger companies, particularly enterprises, tend to offer higher salaries compared to startups. This suggests that company size has a significant impact on salary, likely due to greater resources and structured pay scales in larger organizations.
ggplot(data_final, aes(x = remote_work, y = salary)) +
geom_boxplot(fill = "cyan") +
labs(title = "Remote Work vs Salary",
x = "Work Mode",
y = "Salary")
Interpretation
Remote work shows a slightly higher salary compared to on-site and hybrid modes, but the difference is not very significant, indicating that work mode has a limited impact on salary.
ggplot(data_final, aes(x = industry, y = salary)) +
geom_boxplot(fill = "gold") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Industry vs Salary",
x = "Industry",
y = "Salary")
Interpretation
Salary variation across industries is relatively moderate, indicating that industry alone does not drastically influence salary levels.
ggplot(data_final, aes(x = job_title, y = salary)) +
geom_boxplot(fill = "lightgreen") +
coord_flip() +
labs(title = "Salary by Job Title",
x = "Job Title",
y = "Salary") +
theme_minimal()
Interpretation
Job role has a noticeable impact on salary. Technical and specialized roles such as AI Engineer and Machine Learning Engineer tend to offer higher salaries, while roles like Data Analyst and Business Analyst have comparatively lower salary ranges.
Distribution analysis focuses on understanding how data values are spread across a range.
Histograms are used for numerical variables such as salary and experience to visualize the frequency of values within specific intervals. Bar charts are used for categorical variables such as job title and company size to show counts of each category.
By analyzing the shape of the distribution, we can identify patterns such as: - Symmetry or skewness - Concentration of data - Presence of extreme values
This helps in understanding whether the data is balanced and how values are distributed across different ranges.
ggplot(data_final, aes(x = experience_years)) +
geom_histogram(bins = 20, fill = "blue", color = "black") +
labs(title = "Distribution of Experience",
x = "Experience (Years)",
y = "Count")
Interpretation
The histogram shows that experience is fairly evenly distributed across the dataset, with a slight concentration around mid-level experience. This indicates a balanced dataset representing both junior and senior professionals.
ggplot(data_final, aes(x = salary)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "darkred", color = "black") +
geom_density(color = "blue", size = 1) +
labs(title = "Distribution of Salary",
x = "Salary",
y = "Density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation
The salary distribution is slightly right-skewed, indicating that most individuals earn mid-range salaries, while a smaller number earn significantly higher salaries.
ggplot(data_final, aes(x = job_title)) +
geom_bar(fill = "steelblue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Job Title Distribution",
x = "Job Title",
y = "Count")
Interpretation
The distribution of job titles is relatively uniform, indicating that the dataset represents various job roles evenly without significant bias toward any particular profession.
edu_counts <- table(data_final$education_level)
pie(edu_counts,
main = "Education Level Distribution",
col = rainbow(length(edu_counts)))
Interpretation
The pie chart shows that education levels are fairly evenly distributed among individuals, with no single category dominating significantly.
ggplot(data_final, aes(x = company_size)) +
geom_bar(fill = "orange") +
labs(title = "Company Size Distribution",
x = "Company Size",
y = "Count")
Interpretation
The distribution of company sizes is relatively uniform, with a slightly higher representation of medium-sized companies. This indicates that the dataset includes a balanced mix of organizations, allowing for fair comparison across different company sizes.
Multivariate analysis examines the relationship between more than two variables simultaneously.
In this section, colored scatter plots and pair plots are used. Colored scatter plots help visualize how an additional categorical variable (such as education level) influences the relationship between two numerical variables.
Pair plots display multiple relationships between numerical variables in a single view, including scatter plots and correlation values.
This approach helps in identifying combined effects, such as how education and experience together influence salary, and provides a deeper understanding of interactions between variables.
ggplot(data_final, aes(x = experience_years, y = salary, color = education_level)) +
geom_point() +
labs(title = "Experience vs Salary by Education Level",
x = "Experience",
y = "Salary") +
theme_minimal()
Interpretation
The plot shows that salary increases with experience across all education levels. However, individuals with higher education (Master’s and PhD) consistently earn more than those with lower education at the same experience level, indicating a combined effect of education and experience on salary.
Key Insight: Education enhances the impact of experience on salary.
ggpairs(data_final[, c("experience_years", "skills_count", "certifications", "salary")])
Interpretation
The pair plot reveals that experience has the strongest positive correlation with salary, while skills and certifications show weak relationships. Additionally, there is little to no correlation between experience and skills or certifications.
Comparative analysis is used to compare different groups within the dataset to identify patterns and differences.
Bar charts with grouped categories are used to compare variables such as work mode across different locations. This allows visual comparison of how categories behave across different groups.
By analyzing differences in bar heights, we can determine whether certain categories dominate or whether the distribution is uniform across groups.
This method helps in identifying whether location or other grouping variables influence patterns such as remote work preferences.
ggplot(data_final, aes(x = location, fill = remote_work)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Remote Work Distribution by Location",
x = "Location",
y = "Count",
fill = "Work Mode")
Interpretation
The distribution of work modes across different locations appears relatively uniform. This suggests that remote, hybrid, and on-site work patterns are fairly consistent across countries, with no significant regional preference.
ggplot(data_final, aes(x = education_level)) +
geom_bar(fill = "coral") +
labs(title = "Education Level Distribution",
x = "Education Level",
y = "Count")
Interpretation
The distribution of education levels shows that Bachelor’s degree holders are the most common, while PhD holders are the least common. However, the differences are not very large, indicating a relatively balanced representation of education levels in the dataset.
Cumulative Distribution Function (CDF) analysis is used to understand the cumulative proportion of data points below a certain value.
In this analysis, the CDF curve is plotted to show how values accumulate across the dataset. The y-axis represents the cumulative probability, while the x-axis represents the variable (salary).
This allows us to determine key insights such as: - Median value (where cumulative probability is 0.5) - Percentage of observations below a certain threshold
CDF provides a clearer understanding of data distribution compared to histograms, especially for percentile-based interpretation.
plot(ecdf(data_final$salary),
main = "CDF of Salary",
xlab = "Salary",
ylab = "Cumulative Probability",
col = "blue")
Interpretation
The CDF shows that around 50% of individuals earn approximately between 140,000 and 150,000. The steep slope in this region indicates a high concentration of salaries around the median.
Analysis of Variance (ANOVA) is a statistical method used to determine whether there are significant differences between the means of three or more groups. Unlike a t-test, which compares two groups, ANOVA evaluates multiple groups simultaneously.
ANOVA works by comparing the variance between group means to the variance within the groups. If the variation between groups is significantly larger than the variation within groups, it suggests that at least one group mean is different from the others.
In this analysis, ANOVA is used to examine whether salary differs significantly across different education levels.
The following hypotheses are tested:
Null Hypothesis (H₀): There is no significant difference in salary across different education levels. Alternative Hypothesis (H₁): At least one education level has a significantly different mean salary.
anova_model <- aov(salary ~ education_level, data = data_final)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## education_level 4 8.027e+11 2.007e+11 158.9 <2e-16 ***
## Residuals 5941 7.501e+12 1.263e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(salary ~ education_level, data = data_final,
col = "lightblue",
main = "Salary by Education Level")
Interpretation
The ANOVA results show a highly significant effect of education level on salary (F = 158.9, p < 0.001). Since the p-value is much smaller than the significance level of 0.05, the null hypothesis is rejected. This indicates that there are statistically significant differences in mean salaries across different education levels. Therefore, education level plays an important role in determining salary.
The following hypotheses are tested:
Null Hypothesis (H₀): There is no significant difference in salary across different company sizes.
Alternative Hypothesis (H₁): At least one company size has a significantly different mean salary.
anova_model2 <- aov(salary ~ company_size, data = data_final)
summary(anova_model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## company_size 4 1.437e+12 3.592e+11 310.8 <2e-16 ***
## Residuals 5941 6.867e+12 1.156e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(salary ~ company_size, data = data_final,
col = "lightgreen",
main = "Salary by Company Size",
xlab = "Company Size",
ylab = "Salary")
Interpretation
The ANOVA results show a highly significant effect of company size on salary (F = 310.8, p < 0.001). Since the p-value is much smaller than the significance level of 0.05, the null hypothesis is rejected. This indicates that there are statistically significant differences in mean salaries across different company sizes. Therefore, company size plays an important role in determining salary.
Correlation analysis is used to measure the strength and direction of the linear relationship between two continuous variables. The correlation coefficient ranges from -1 to +1, where values close to +1 indicate a strong positive relationship, values close to -1 indicate a strong negative relationship, and values near 0 indicate little or no linear relationship.
The following hypotheses are tested:
Null Hypothesis (H₀): There is no linear relationship between years of experience and salary. Alternative Hypothesis (H₁): There is a significant linear relationship between years of experience and salary.
In this analysis, correlation is used to examine how experience (experience_years) is related to salary.
cor(data_final$experience_years, data_final$salary)
## [1] 0.4351317
plot(data_final$experience_years, data_final$salary,
col = "blue",
main = "Experience vs Salary",
xlab = "Years of Experience",
ylab = "Salary")
Interpretation
The correlation coefficient between experience and salary is 0.435, indicating a moderate positive relationship. This suggests that as years of experience increase, salary tends to increase as well, although the relationship is not very strong. Therefore, the null hypothesis is rejected, and it can be concluded that experience has a significant positive association with salary.
The following hypotheses are tested:
Null Hypothesis (H₀): There is no linear relationship between skills count and salary.
Alternative Hypothesis (H₁): There is a significant linear relationship between skills count and salary.
cor(data_final$skills_count, data_final$salary)
## [1] 0.1046777
plot(data_final$skills_count, data_final$salary,
col = "darkgreen",
main = "Skills Count vs Salary",
xlab = "Skills Count",
ylab = "Salary")
Interpretation
The correlation coefficient between skills count and salary is 0.105, indicating a very weak positive relationship. This suggests that an increase in the number of skills is associated with only a slight increase in salary. Therefore, while there is a positive relationship, it is not strong enough to indicate a meaningful linear association.
The following hypotheses are tested:
Null Hypothesis (H₀): There is no linear relationship between certifications and salary.
Alternative Hypothesis (H₁): There is a significant linear relationship between certifications and salary.
cor(data_final$certifications, data_final$salary)
## [1] 0.08287942
plot(data_final$certifications, data_final$salary,
col = "purple",
main = "Certifications vs Salary",
xlab = "Number of Certifications",
ylab = "Salary")
Interpretation
The correlation coefficient between certifications and salary is 0.083, indicating a very weak positive relationship. This suggests that an increase in the number of certifications is associated with only a slight increase in salary. Therefore, certifications alone do not have a strong linear relationship with salary.
Single linear regression is used to model the relationship between a dependent variable and one independent variable by fitting a straight line to the observed data. It helps in understanding how changes in the independent variable affect the dependent variable and can be used for prediction.
The regression model is expressed as:
y(dependent variable) = β₀ + β₁(factor/independent variable) + ε
where: y: The Dependent Variable (the thing you’re trying to predict). x: The Independent Variable (the factor you’re using to make the prediction). β (Intercept): The value of y when x is zero (where the line hits the vertical axis). β₁ (Slope): How much y changes for every one-unit increase in x. ε (Error): The “noise” or distance between the actual data points and the perfect line.
The regression model is expressed as:
Salary = β₀ + β₁ (Skills Count) + ε
The following hypotheses are tested: Null Hypothesis (H₀): The coefficient of skills count (β₁) is equal to zero, meaning skills count has no effect on salary. Alternative Hypothesis (H₁): The coefficient of skills count (β₁) is not equal to zero, meaning skills count has a significant effect on salary.
In this analysis, single linear regression is used to predict salary based on the number of skills.
reg_model <- lm(salary ~ skills_count, data = data_final)
summary(reg_model)
##
## Call:
## lm(formula = salary ~ skills_count, data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103772 -26086 -2039 24096 154145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 138550.66 1006.81 137.613 < 2e-16 ***
## skills_count 721.89 88.96 8.115 5.86e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37170 on 5944 degrees of freedom
## Multiple R-squared: 0.01096, Adjusted R-squared: 0.01079
## F-statistic: 65.85 on 1 and 5944 DF, p-value: 5.856e-16
plot(data_final$skills_count, data_final$salary,
col = "darkgreen",
main = "Skills Count vs Salary",
xlab = "Skills Count",
ylab = "Salary")
abline(reg_model, col = "red", lwd = 2)
Interpretation
The regression results show that skills count has a statistically significant effect on salary (β₁ = 721.89, p < 0.001). This indicates that for each additional skill, salary increases by approximately 722 units on average. However, the R-squared value is 0.011, suggesting that skills count explains only a small portion of the variation in salary. Therefore, while the relationship is statistically significant, it is relatively weak in terms of predictive power.
reg_model2 <- lm(salary ~ certifications, data = data_final)
summary(reg_model2)
##
## Call:
## lm(formula = salary ~ certifications, data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101732 -26412 -2467 23447 156542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 141202.5 854.7 165.207 < 2e-16 ***
## certifications 1806.0 281.7 6.412 1.55e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37250 on 5944 degrees of freedom
## Multiple R-squared: 0.006869, Adjusted R-squared: 0.006702
## F-statistic: 41.11 on 1 and 5944 DF, p-value: 1.548e-10
plot(data_final$certifications, data_final$salary,
col = "purple",
main = "Certifications vs Salary",
xlab = "Number of Certifications",
ylab = "Salary")
abline(reg_model2, col = "red", lwd = 2)
Interpretation
The regression results show that certifications have a statistically significant effect on salary (β₁ = 1806.0, p < 0.001). This indicates that each additional certification is associated with an average increase of approximately 1806 units in salary. However, the R-squared value is 0.0069, indicating that certifications explain only a very small portion of the variation in salary. Therefore, while certifications have a significant impact, their overall explanatory power is limited.
Multiple linear regression is used to model the relationship between a dependent variable and two or more independent variables. It allows us to understand how multiple factors simultaneously influence the dependent variable and improves predictive accuracy compared to single-variable models.
The regression model is expressed as:
y = β₀ + β₁ (x1) + β₂ (x2) + β₃ (x3) + ε
where β₀ is the intercept, β₁, β₂, and β₃ represent the effect of each independent variable on dependent variable (the one we want to predict), and ε is the error term.
The following hypotheses are tested:
Null Hypothesis (H₀): All coefficients (β₁, β₂, β₃) are equal to zero, meaning none of the independent variables significantly affect salary.
Alternative Hypothesis (H₁): At least one coefficient is not equal to zero, at least one independent variable has a significant effect on salary.
In this analysis, multiple linear regression is used to examine how experience, skills count, and certifications jointly influence salary.
multi_model <- lm(salary ~ experience_years + skills_count + certifications, data = data_final)
summary(multi_model)
##
## Call:
## lm(formula = salary ~ experience_years + skills_count + certifications,
## data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97219 -22305 -1547 20184 125447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107104.76 1303.28 82.181 < 2e-16 ***
## experience_years 2667.40 70.42 37.876 < 2e-16 ***
## skills_count 753.04 79.63 9.457 < 2e-16 ***
## certifications 1789.19 251.59 7.112 1.28e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33260 on 5942 degrees of freedom
## Multiple R-squared: 0.2085, Adjusted R-squared: 0.2081
## F-statistic: 521.7 on 3 and 5942 DF, p-value: < 2.2e-16
plot(fitted(multi_model), data_final$salary,
col = "darkblue",
main = "Fitted vs Actual Salary",
xlab = "Predicted Salary",
ylab = "Actual Salary")
abline(0, 1, col = "red", lwd = 2)
Interpretation
The multiple regression results show that all three variables—experience, skills count, and certifications—have a statistically significant effect on salary (p < 0.001 for all predictors). Experience has the strongest impact (β₁ = 2667.40), followed by certifications (β₃ = 1789.19) and skills count (β₂ = 753.04).
The R-squared value of 0.2085 indicates that approximately 20.85% of the variation in salary is explained by the combined model.
Overall, the model demonstrates that experience is the most important predictor of salary, while skills and certifications also contribute meaningfully when considered together.
Polynomial regression is an extension of linear regression that models the relationship between the independent variable and dependent variable as an nth-degree polynomial rather than a straight line. It is used when the relationship between variables is non-linear
For a second-degree (quadratic) polynomial:
y = β₀ + β₁ (x1) + β₂ (x2)^2 + ε
where: y: dependent variable x: Independent Variable β0: Intercept β₁, β₂: coefficients ε: Error term
In this analysis, the regression model is expressed as:
Salary = β₀ + β₁ (Experience) + β₂ (Experience²) + ε
where β₂ captures the curvature in the relationship, allowing the model to represent non-linear patterns.
The following hypotheses are tested:
Null Hypothesis (H₀): The coefficient of the squared term (β₂) is equal to zero,there is no non-linear relationship between experience and salary.
Alternative Hypothesis (H₁): The coefficient of the squared term (β₂) is not equal to zero, there is a significant non-linear relationship.
In this analysis, polynomial regression is used to examine whether salary increases with experience in a non-linear manner.
poly_model <- lm(salary ~ experience_years + I(experience_years^2), data = data_final)
summary(poly_model)
##
## Call:
## lm(formula = salary ~ experience_years + I(experience_years^2),
## data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101021 -22371 -1476 21038 132522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.194e+05 1.184e+03 100.832 <2e-16 ***
## experience_years 2.604e+03 2.765e+02 9.415 <2e-16 ***
## I(experience_years^2) 2.554e+00 1.333e+01 0.192 0.848
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33660 on 5943 degrees of freedom
## Multiple R-squared: 0.1893, Adjusted R-squared: 0.1891
## F-statistic: 694.1 on 2 and 5943 DF, p-value: < 2.2e-16
plot(data_final$experience_years, data_final$salary,
col = "darkorange",
main = "Polynomial Regression: Experience vs Salary",
xlab = "Experience (Years)",
ylab = "Salary")
exp_seq <- seq(min(data_final$experience_years),
max(data_final$experience_years),
length.out = 100)
pred <- predict(poly_model,
newdata = data.frame(experience_years = exp_seq))
lines(exp_seq, pred, col = "blue", lwd = 2)
Interpretation
The polynomial regression results show that the linear term for experience is statistically significant (β₁ ≈ 2604, p < 0.001), indicating that salary increases with experience. However, the squared term (β₂ ≈ 2.55) is not statistically significant (p = 0.848), suggesting that there is no strong evidence of a non-linear relationship between experience and salary.
The R-squared value of 0.189 indicates that the model explains approximately 18.9% of the variation in salary, which is slightly lower than the multiple regression model. This suggests that adding the polynomial term does not improve the model significantly.
Overall, the results indicate that the relationship between experience and salary is primarily linear, and a simple linear regression model is sufficient.
set.seed(123)
train_index <- sample(1:nrow(data_final), 0.8 * nrow(data_final))
train_data <- data_final[train_index, ]
test_data <- data_final[-train_index, ]
nrow(train_data)
## [1] 4756
nrow(test_data)
## [1] 1190
train_model <- lm(salary ~ experience_years + skills_count + certifications, data = train_data)
summary(train_model)
##
## Call:
## lm(formula = salary ~ experience_years + skills_count + certifications,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95378 -22234 -1375 20195 125149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106311.79 1452.41 73.197 < 2e-16 ***
## experience_years 2731.84 78.98 34.588 < 2e-16 ***
## skills_count 745.61 89.22 8.357 < 2e-16 ***
## certifications 1897.55 281.15 6.749 1.66e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33300 on 4752 degrees of freedom
## Multiple R-squared: 0.2162, Adjusted R-squared: 0.2157
## F-statistic: 437 on 3 and 4752 DF, p-value: < 2.2e-16
predictions <- predict(train_model, newdata = test_data)
rmse <- sqrt(mean((test_data$salary - predictions)^2))
rmse
## [1] 33124.14
Interpretation
The model was evaluated using a train-test split approach, where 80% of the data was used for training and 20% for testing. The Root Mean Square Error (RMSE) obtained on the test dataset is 33124.14.
This value indicates the average deviation between the predicted and actual salary values. Considering the scale of the salary variable, the model demonstrates a reasonable level of predictive accuracy.
Overall, the results suggest that the multiple regression model, which includes experience, skills count, and certifications, is effective in predicting salary. However, there is still unexplained variation, indicating that additional factors may influence salary beyond the variables considered in this analysis.
new_candidates <- data.frame(
experience_years = c(2, 5, 10),
skills_count = c(5, 8, 12),
certifications = c(1, 3, 5)
)
new_predictions <- predict(train_model, newdata = new_candidates)
new_predictions
## 1 2 3
## 117401.1 131628.5 152065.3
Interpretation
The trained multiple regression model was used to predict salaries for new candidate profiles based on their experience, skills, and certifications.
The predicted salaries are as follows:
Candidate 1: 117401.1
Candidate 2: 131628.5
Candidate 3: 152065.3
The results show that salary increases with higher experience, skills, and certifications. Among the three candidates, the individual with the highest experience, skills, and certifications receives the highest predicted salary.
This demonstrates that the model can be effectively used for real-world salary prediction and decision-making, such as evaluating job candidates or estimating compensation based on qualifications.
Statistical analysis is used to summarize and describe the dataset using numerical measures.
In this section, key statistical metrics such as mean, median, variance, standard deviation, quartiles, and interquartile range (IQR) are calculated for salary.
The analysis is performed as follows: - Mean and median are used to measure central tendency - Variance and standard deviation measure the spread of data - Quartiles and IQR describe the distribution and range of the middle 50% of data
By comparing mean and median, we can identify skewness in the data. Measures of spread help in understanding how widely salaries vary across individuals.
This analysis provides a quantitative summary that supports the insights obtained from visualizations.
# Basic statistics
min(data_final$salary)
## [1] 41276
max(data_final$salary)
## [1] 304968
mean(data_final$salary)
## [1] 145723.5
median(data_final$salary)
## [1] 143722.5
var(data_final$salary)
## [1] 1396798727
sd(data_final$salary)
## [1] 37373.77
# Quartiles & Percentiles
quantile(data_final$salary)
## 0% 25% 50% 75% 100%
## 41276.0 119171.2 143722.5 169706.8 304968.0
# IQR
IQR(data_final$salary)
## [1] 50535.5
Interpretation
The mean salary is slightly higher than the median, indicating a mildly right-skewed distribution. This suggests the presence of some high salary values pulling the average upward.
The standard deviation shows moderate variability in salary, indicating noticeable differences in earnings among individuals.
The range of salaries is wide, highlighting the gap between lower and higher earners.
The interquartile range indicates that the middle 50% of salaries are concentrated within a moderate range, suggesting that most individuals earn around the median salary level.
This project focused on analyzing and understanding the factors influencing salary using a structured data analysis approach. The study began with data preprocessing, including handling missing values, transforming categorical variables, and ensuring data quality. Exploratory data analysis through visualization and statistical measures helped identify patterns and distributions within the dataset. Techniques such as ANOVA and correlation were used to examine relationships between variables, revealing that factors like education level and company size have a significant impact on salary, while variables such as skills and certifications show weaker individual relationships.
Further analysis using regression models provided deeper insights into cause-and-effect relationships. Single regression models showed limited explanatory power, whereas multiple regression demonstrated improved performance by combining experience, skills, and certifications. Polynomial regression indicated that the relationship between experience and salary is primarily linear. The model was evaluated using a train-test split approach, achieving an RMSE of 33124.14, indicating reasonable predictive accuracy. Finally, the model was applied to new data to predict salaries, demonstrating its practical applicability. Overall, the project highlights that salary is influenced by multiple interacting factors and emphasizes the importance of combining statistical and predictive techniques for effective data analysis.