A Study on Factors Affecting Salary in the Job Market


Introduction

This project analyzes a job market dataset to understand the relationships between experience, skills, education, and salary. The goal is to identify patterns and insights using exploratory data analysis and visualization techniques.

Data Analysis

This section presents various types of data analysis performed on the dataset to uncover patterns, relationships, and insights. Different analytical approaches such as relational analysis, categorical analysis, distribution analysis, comparative analysis, and multivariate analysis are used.

Each type of analysis is performed using appropriate visualization techniques to better understand how different variables interact and influence each other.


Load Required Libraries

library(tidyverse)
library(ggplot2)
library(GGally)
library(readr)
library(mice)
library(missForest)
library(GGally)
library(car)
library(tidyverse)

#load data set
job_market <- read_csv("job_salary_prediction_dataset.csv", show_col_types = FALSE)
head(job_market)
## # A tibble: 6 × 10
##   job_title  experience_years education_level skills_count industry company_size
##   <chr>                 <dbl> <chr>                  <dbl> <chr>    <chr>       
## 1 AI Engine…               10 Bachelor                   2 Healthc… Medium      
## 2 Business …               15 Master                     3 Technol… Small       
## 3 Software …               18 Master                    17 Consult… Startup     
## 4 Cloud Eng…               11 Bachelor                   8 Governm… Large       
## 5 Cloud Eng…                2 High School               17 Media    Startup     
## 6 Cloud Eng…                9 PhD                       13 Governm… Enterprise  
## # ℹ 4 more variables: location <chr>, remote_work <chr>, certifications <dbl>,
## #   salary <dbl>
#data processing . Data types convertion
job_market$job_title <- as.factor(job_market$job_title)
job_market$education_level <- as.factor(job_market$education_level)
job_market$industry <- as.factor(job_market$industry)
job_market$company_size <- as.factor(job_market$company_size)
job_market$location <- as.factor(job_market$location)
job_market$remote_work <- as.factor(job_market$remote_work)

#missing value handling
colSums(is.na(job_market))
##        job_title experience_years  education_level     skills_count 
##                0                0                0                0 
##         industry     company_size         location      remote_work 
##                0                0                0                0 
##   certifications           salary 
##                0                0
numeric_cols <- c("experience_years", "skills_count", "certifications", "salary")
mice_output  <- mice(job_market[numeric_cols], m = 1, method = 'pmm', seed = 123)
## 
##  iter imp variable
##   1   1
##   2   1
##   3   1
##   4   1
##   5   1
data_clean   <- complete(mice_output)

# Update the main dataset with imputed values
job_market[numeric_cols] <- data_clean

# Finalize dataset for analysis
data_final <- job_market

#checking for duplicate rows
sum(duplicated(data_final))
## [1] 0
#outliers detection
boxplot(data_final$salary, main = "Salary Distribution")

Relational Analysis

Relational analysis is used to study the relationship between two numerical variables and understand how one variable influences another.

In this analysis, scatter plots are used where each point represents an observation in the dataset. By plotting one variable on the x-axis and another on the y-axis, patterns and trends can be visually examined.

The strength of the relationship is determined by observing the direction and spread of the points: - A clear upward pattern indicates a strong positive relationship - A scattered pattern indicates a weak or no relationship

This method helps identify which factors significantly impact salary, such as experience, and which factors have minimal influence, such as skills or certifications.

This section analyzes relationships between variables to understand how different factors influence salary.

Analysis 1: Experience vs Salary

ggplot(data_final, aes(x = experience_years, y = salary)) +
  geom_point(color = "blue") +
  labs(title = "Experience vs Salary",
       x = "Experience (Years)",
       y = "Salary") +
theme_minimal()

Interpretation

The scatter plot shows a positive relationship between experience and salary. As experience increases, salary tends to increase, indicating that experience is a key factor influencing earnings.

Key Insight: Experience is the strongest factor influencing salary.

Analysis 2: Skills vs Salary

ggplot(data_final, aes(x = skills_count, y = salary)) +
  geom_point(color = "green") +
  labs(title = "Skills vs Salary",
       x = "Number of Skills",
       y = "Salary")

Interpretation

The scatter plot shows a weak relationship between skills and salary. While higher skill counts may slightly increase salary, the effect is not strong, indicating that other factors also play a significant role.

Analysis 3: Certification vs salary

ggplot(data_final, aes(x = certifications, y = salary)) +
  geom_point(color = "red") +
  labs(title = "Certifications vs Salary",
       x = "Number of Certifications",
       y = "Salary")

Interpretation

The scatter plot indicates a very weak relationship between certifications and salary. Salary levels vary widely regardless of the number of certifications, suggesting certifications alone do not significantly influence earnings.

Analysis 4: Experience vs Skills

ggplot(data_final, aes(x = experience_years, y = skills_count)) +
  geom_point(color = "darkgreen") +
  labs(title = "Experience vs Skills",
       x = "Experience (Years)",
       y = "Number of Skills")

Interpretation

The scatter plot shows no clear relationship between experience and number of skills. Skills appear to be independent of experience, suggesting that skill acquisition may depend on individual learning rather than years of experience.

Analysis 5: Skills vs Certifications

ggplot(data_final, aes(x = skills_count, y = certifications)) +
  geom_point(color = "brown") +
  labs(title = "Skills vs Certifications",
       x = "Number of Skills",
       y = "Certifications")

Interpretation

The scatter plot shows no clear relationship between skills and certifications. Individuals with a high number of skills do not necessarily possess more certifications, indicating that practical skills and formal certifications may develop independently.

Categorical Analysis

Categorical analysis examines how different categories or groups affect a numerical variable.

In this section, boxplots are used to compare salary distributions across categories such as education level, company size, job role, and industry. Each boxplot summarizes key statistical values including median, quartiles, and potential outliers.

By comparing the median and spread of values across categories, we can determine whether certain groups tend to have higher or lower salaries.

This approach helps in understanding how factors like education and company size influence salary differences between groups.

Analysis 6: Education Level vs Salary

ggplot(data_final, aes(x = education_level, y = salary)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Education Level vs Salary",
       x = "Education Level",
       y = "Salary") +
theme_minimal()

Interpretation

The boxplot shows that higher education levels are associated with higher salaries. Individuals with Master’s and PhD degrees tend to earn more compared to those with lower education levels, indicating a positive relationship between education and earnings.

Key Insight: Education significantly improves earning potential.

##Analysis 7: Company Size vs Salary

ggplot(data_final, aes(x = company_size, y = salary)) +
  geom_boxplot(fill = "purple") +
  labs(title = "Company Size vs Salary",
       x = "Company Size",
       y = "Salary")

Interpretation

Larger companies, particularly enterprises, tend to offer higher salaries compared to startups. This suggests that company size has a significant impact on salary, likely due to greater resources and structured pay scales in larger organizations.

Analysis 8: Remote Work vs Salary

ggplot(data_final, aes(x = remote_work, y = salary)) +
  geom_boxplot(fill = "cyan") +
  labs(title = "Remote Work vs Salary",
       x = "Work Mode",
       y = "Salary")

Interpretation

Remote work shows a slightly higher salary compared to on-site and hybrid modes, but the difference is not very significant, indicating that work mode has a limited impact on salary.

Analysis 9: Industry vs Salary

ggplot(data_final, aes(x = industry, y = salary)) +
  geom_boxplot(fill = "gold") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Industry vs Salary",
       x = "Industry",
       y = "Salary")

Interpretation

Salary variation across industries is relatively moderate, indicating that industry alone does not drastically influence salary levels.

Analysis 10: Salary by Job Title

ggplot(data_final, aes(x = job_title, y = salary)) +
  geom_boxplot(fill = "lightgreen") +
  coord_flip() +
  labs(title = "Salary by Job Title",
       x = "Job Title",
       y = "Salary") +
  theme_minimal()

Interpretation

Job role has a noticeable impact on salary. Technical and specialized roles such as AI Engineer and Machine Learning Engineer tend to offer higher salaries, while roles like Data Analyst and Business Analyst have comparatively lower salary ranges.

Distribution Analysis

Distribution analysis focuses on understanding how data values are spread across a range.

Histograms are used for numerical variables such as salary and experience to visualize the frequency of values within specific intervals. Bar charts are used for categorical variables such as job title and company size to show counts of each category.

By analyzing the shape of the distribution, we can identify patterns such as: - Symmetry or skewness - Concentration of data - Presence of extreme values

This helps in understanding whether the data is balanced and how values are distributed across different ranges.

Analysis 11: Distribution of Experience

ggplot(data_final, aes(x = experience_years)) +
  geom_histogram(bins = 20, fill = "blue", color = "black") +
  labs(title = "Distribution of Experience",
       x = "Experience (Years)",
       y = "Count")

Interpretation

The histogram shows that experience is fairly evenly distributed across the dataset, with a slight concentration around mid-level experience. This indicates a balanced dataset representing both junior and senior professionals.

Analysis 12: Distribution of Salary

ggplot(data_final, aes(x = salary)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "darkred", color = "black") +
  geom_density(color = "blue", size = 1) +
  labs(title = "Distribution of Salary",
       x = "Salary",
       y = "Density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation

The salary distribution is slightly right-skewed, indicating that most individuals earn mid-range salaries, while a smaller number earn significantly higher salaries.

Analysis 13: Job Title Distribution

ggplot(data_final, aes(x = job_title)) +
  geom_bar(fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Job Title Distribution",
       x = "Job Title",
       y = "Count")

Interpretation

The distribution of job titles is relatively uniform, indicating that the dataset represents various job roles evenly without significant bias toward any particular profession.

Analysis 14: Education Level Distribution (Pie Chart)

edu_counts <- table(data_final$education_level)

pie(edu_counts,
    main = "Education Level Distribution",
    col = rainbow(length(edu_counts)))

Interpretation

The pie chart shows that education levels are fairly evenly distributed among individuals, with no single category dominating significantly.

Analysis 15: Company Size Distribution

ggplot(data_final, aes(x = company_size)) +
  geom_bar(fill = "orange") +
  labs(title = "Company Size Distribution",
       x = "Company Size",
       y = "Count")

Interpretation

The distribution of company sizes is relatively uniform, with a slightly higher representation of medium-sized companies. This indicates that the dataset includes a balanced mix of organizations, allowing for fair comparison across different company sizes.

Multivariate Analysis

Multivariate analysis examines the relationship between more than two variables simultaneously.

In this section, colored scatter plots and pair plots are used. Colored scatter plots help visualize how an additional categorical variable (such as education level) influences the relationship between two numerical variables.

Pair plots display multiple relationships between numerical variables in a single view, including scatter plots and correlation values.

This approach helps in identifying combined effects, such as how education and experience together influence salary, and provides a deeper understanding of interactions between variables.

Analysis 16: Experience vs Salary by Education Level

ggplot(data_final, aes(x = experience_years, y = salary, color = education_level)) +
  geom_point() +
  labs(title = "Experience vs Salary by Education Level",
       x = "Experience",
       y = "Salary") +
theme_minimal()

Interpretation

The plot shows that salary increases with experience across all education levels. However, individuals with higher education (Master’s and PhD) consistently earn more than those with lower education at the same experience level, indicating a combined effect of education and experience on salary.

Key Insight: Education enhances the impact of experience on salary.

Analysis 17: Pair Plot of Numeric Variables

ggpairs(data_final[, c("experience_years", "skills_count", "certifications", "salary")])

Interpretation

The pair plot reveals that experience has the strongest positive correlation with salary, while skills and certifications show weak relationships. Additionally, there is little to no correlation between experience and skills or certifications.

Comparative Analysis

Comparative analysis is used to compare different groups within the dataset to identify patterns and differences.

Bar charts with grouped categories are used to compare variables such as work mode across different locations. This allows visual comparison of how categories behave across different groups.

By analyzing differences in bar heights, we can determine whether certain categories dominate or whether the distribution is uniform across groups.

This method helps in identifying whether location or other grouping variables influence patterns such as remote work preferences.

Analysis 18: Remote Work Distribution by Location

ggplot(data_final, aes(x = location, fill = remote_work)) +
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Remote Work Distribution by Location",
       x = "Location",
       y = "Count",
       fill = "Work Mode")

Interpretation

The distribution of work modes across different locations appears relatively uniform. This suggests that remote, hybrid, and on-site work patterns are fairly consistent across countries, with no significant regional preference.

Analysis 19: Education Level Distribution

ggplot(data_final, aes(x = education_level)) +
  geom_bar(fill = "coral") +
  labs(title = "Education Level Distribution",
       x = "Education Level",
       y = "Count")

Interpretation

The distribution of education levels shows that Bachelor’s degree holders are the most common, while PhD holders are the least common. However, the differences are not very large, indicating a relatively balanced representation of education levels in the dataset.

Cumulative Distribution Analysis (CDF)

Cumulative Distribution Function (CDF) analysis is used to understand the cumulative proportion of data points below a certain value.

In this analysis, the CDF curve is plotted to show how values accumulate across the dataset. The y-axis represents the cumulative probability, while the x-axis represents the variable (salary).

This allows us to determine key insights such as: - Median value (where cumulative probability is 0.5) - Percentage of observations below a certain threshold

CDF provides a clearer understanding of data distribution compared to histograms, especially for percentile-based interpretation.

Analysis 20: CDF of Salary

plot(ecdf(data_final$salary),
     main = "CDF of Salary",
     xlab = "Salary",
     ylab = "Cumulative Probability",
     col = "blue")

Interpretation

The CDF shows that around 50% of individuals earn approximately between 140,000 and 150,000. The steep slope in this region indicates a high concentration of salaries around the median.

ANOVA (Analysis of Variance)

Analysis of Variance (ANOVA) is a statistical method used to determine whether there are significant differences between the means of three or more groups. Unlike a t-test, which compares two groups, ANOVA evaluates multiple groups simultaneously.

ANOVA works by comparing the variance between group means to the variance within the groups. If the variation between groups is significantly larger than the variation within groups, it suggests that at least one group mean is different from the others.

In this analysis, ANOVA is used to examine whether salary differs significantly across different education levels.

Hypotheses

The following hypotheses are tested:

Null Hypothesis (H₀): There is no significant difference in salary across different education levels. Alternative Hypothesis (H₁): At least one education level has a significantly different mean salary.

Analysis 21: ANOVA Analysis : Effect of Education on Salary

anova_model <- aov(salary ~ education_level, data = data_final)
summary(anova_model)
##                   Df    Sum Sq   Mean Sq F value Pr(>F)    
## education_level    4 8.027e+11 2.007e+11   158.9 <2e-16 ***
## Residuals       5941 7.501e+12 1.263e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(salary ~ education_level, data = data_final,
        col = "lightblue",
        main = "Salary by Education Level")

Interpretation

The ANOVA results show a highly significant effect of education level on salary (F = 158.9, p < 0.001). Since the p-value is much smaller than the significance level of 0.05, the null hypothesis is rejected. This indicates that there are statistically significant differences in mean salaries across different education levels. Therefore, education level plays an important role in determining salary.

Analysis 22: Company size vs Salary (ANOVA)

The following hypotheses are tested:

Null Hypothesis (H₀): There is no significant difference in salary across different company sizes.

Alternative Hypothesis (H₁): At least one company size has a significantly different mean salary.

anova_model2 <- aov(salary ~ company_size, data = data_final)
summary(anova_model2)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## company_size    4 1.437e+12 3.592e+11   310.8 <2e-16 ***
## Residuals    5941 6.867e+12 1.156e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(salary ~ company_size, data = data_final,
        col = "lightgreen",
        main = "Salary by Company Size",
        xlab = "Company Size",
        ylab = "Salary")

Interpretation

The ANOVA results show a highly significant effect of company size on salary (F = 310.8, p < 0.001). Since the p-value is much smaller than the significance level of 0.05, the null hypothesis is rejected. This indicates that there are statistically significant differences in mean salaries across different company sizes. Therefore, company size plays an important role in determining salary.

Correlation Analysis

Correlation analysis is used to measure the strength and direction of the linear relationship between two continuous variables. The correlation coefficient ranges from -1 to +1, where values close to +1 indicate a strong positive relationship, values close to -1 indicate a strong negative relationship, and values near 0 indicate little or no linear relationship.

Hypothesis

The following hypotheses are tested:

Null Hypothesis (H₀): There is no linear relationship between years of experience and salary. Alternative Hypothesis (H₁): There is a significant linear relationship between years of experience and salary.

In this analysis, correlation is used to examine how experience (experience_years) is related to salary.

Analysis 23: Correlation between Experience and Salary

cor(data_final$experience_years, data_final$salary)
## [1] 0.4351317
plot(data_final$experience_years, data_final$salary,
     col = "blue",
     main = "Experience vs Salary",
     xlab = "Years of Experience",
     ylab = "Salary")

Interpretation

The correlation coefficient between experience and salary is 0.435, indicating a moderate positive relationship. This suggests that as years of experience increase, salary tends to increase as well, although the relationship is not very strong. Therefore, the null hypothesis is rejected, and it can be concluded that experience has a significant positive association with salary.

Analysis 24: Correlation between salary and skills

The following hypotheses are tested:

Null Hypothesis (H₀): There is no linear relationship between skills count and salary.

Alternative Hypothesis (H₁): There is a significant linear relationship between skills count and salary.

cor(data_final$skills_count, data_final$salary)
## [1] 0.1046777
plot(data_final$skills_count, data_final$salary,
     col = "darkgreen",
     main = "Skills Count vs Salary",
     xlab = "Skills Count",
     ylab = "Salary")

Interpretation

The correlation coefficient between skills count and salary is 0.105, indicating a very weak positive relationship. This suggests that an increase in the number of skills is associated with only a slight increase in salary. Therefore, while there is a positive relationship, it is not strong enough to indicate a meaningful linear association.

Analysis 25: Correlation between Certification and Salary

The following hypotheses are tested:

Null Hypothesis (H₀): There is no linear relationship between certifications and salary.

Alternative Hypothesis (H₁): There is a significant linear relationship between certifications and salary.

cor(data_final$certifications, data_final$salary)
## [1] 0.08287942
plot(data_final$certifications, data_final$salary,
     col = "purple",
     main = "Certifications vs Salary",
     xlab = "Number of Certifications",
     ylab = "Salary")

Interpretation

The correlation coefficient between certifications and salary is 0.083, indicating a very weak positive relationship. This suggests that an increase in the number of certifications is associated with only a slight increase in salary. Therefore, certifications alone do not have a strong linear relationship with salary.

Single Linear Regression

Single linear regression is used to model the relationship between a dependent variable and one independent variable by fitting a straight line to the observed data. It helps in understanding how changes in the independent variable affect the dependent variable and can be used for prediction.

The regression model is expressed as:

y(dependent variable) = β₀ + β₁(factor/independent variable) + ε

where: y: The Dependent Variable (the thing you’re trying to predict). x: The Independent Variable (the factor you’re using to make the prediction). β (Intercept): The value of y when x is zero (where the line hits the vertical axis). β₁ (Slope): How much y changes for every one-unit increase in x. ε (Error): The “noise” or distance between the actual data points and the perfect line.

Hypothesis

The regression model is expressed as:

Salary = β₀ + β₁ (Skills Count) + ε

The following hypotheses are tested: Null Hypothesis (H₀): The coefficient of skills count (β₁) is equal to zero, meaning skills count has no effect on salary. Alternative Hypothesis (H₁): The coefficient of skills count (β₁) is not equal to zero, meaning skills count has a significant effect on salary.

In this analysis, single linear regression is used to predict salary based on the number of skills.

Analysis 26: Linear regression - Salary vs skill count

reg_model <- lm(salary ~ skills_count, data = data_final)

summary(reg_model)
## 
## Call:
## lm(formula = salary ~ skills_count, data = data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103772  -26086   -2039   24096  154145 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  138550.66    1006.81 137.613  < 2e-16 ***
## skills_count    721.89      88.96   8.115 5.86e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37170 on 5944 degrees of freedom
## Multiple R-squared:  0.01096,    Adjusted R-squared:  0.01079 
## F-statistic: 65.85 on 1 and 5944 DF,  p-value: 5.856e-16
plot(data_final$skills_count, data_final$salary,
     col = "darkgreen",
     main = "Skills Count vs Salary",
     xlab = "Skills Count",
     ylab = "Salary")

abline(reg_model, col = "red", lwd = 2)

Interpretation

The regression results show that skills count has a statistically significant effect on salary (β₁ = 721.89, p < 0.001). This indicates that for each additional skill, salary increases by approximately 722 units on average. However, the R-squared value is 0.011, suggesting that skills count explains only a small portion of the variation in salary. Therefore, while the relationship is statistically significant, it is relatively weak in terms of predictive power.

Analysis 27: Single Linear Regression - Certification vs Salary

reg_model2 <- lm(salary ~ certifications, data = data_final)
summary(reg_model2)
## 
## Call:
## lm(formula = salary ~ certifications, data = data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -101732  -26412   -2467   23447  156542 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    141202.5      854.7 165.207  < 2e-16 ***
## certifications   1806.0      281.7   6.412 1.55e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37250 on 5944 degrees of freedom
## Multiple R-squared:  0.006869,   Adjusted R-squared:  0.006702 
## F-statistic: 41.11 on 1 and 5944 DF,  p-value: 1.548e-10
plot(data_final$certifications, data_final$salary,
     col = "purple",
     main = "Certifications vs Salary",
     xlab = "Number of Certifications",
     ylab = "Salary")

abline(reg_model2, col = "red", lwd = 2)

Interpretation

The regression results show that certifications have a statistically significant effect on salary (β₁ = 1806.0, p < 0.001). This indicates that each additional certification is associated with an average increase of approximately 1806 units in salary. However, the R-squared value is 0.0069, indicating that certifications explain only a very small portion of the variation in salary. Therefore, while certifications have a significant impact, their overall explanatory power is limited.

Multiple Linear Regression

Multiple linear regression is used to model the relationship between a dependent variable and two or more independent variables. It allows us to understand how multiple factors simultaneously influence the dependent variable and improves predictive accuracy compared to single-variable models.

The regression model is expressed as:

y = β₀ + β₁ (x1) + β₂ (x2) + β₃ (x3) + ε

where β₀ is the intercept, β₁, β₂, and β₃ represent the effect of each independent variable on dependent variable (the one we want to predict), and ε is the error term.

Hypothesis

The following hypotheses are tested:

Null Hypothesis (H₀): All coefficients (β₁, β₂, β₃) are equal to zero, meaning none of the independent variables significantly affect salary.

Alternative Hypothesis (H₁): At least one coefficient is not equal to zero, at least one independent variable has a significant effect on salary.

In this analysis, multiple linear regression is used to examine how experience, skills count, and certifications jointly influence salary.

Analysis 28: Multiple Linear Regression - Salary vs experience, skills, and certification

multi_model <- lm(salary ~ experience_years + skills_count + certifications, data = data_final)
summary(multi_model)
## 
## Call:
## lm(formula = salary ~ experience_years + skills_count + certifications, 
##     data = data_final)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -97219 -22305  -1547  20184 125447 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      107104.76    1303.28  82.181  < 2e-16 ***
## experience_years   2667.40      70.42  37.876  < 2e-16 ***
## skills_count        753.04      79.63   9.457  < 2e-16 ***
## certifications     1789.19     251.59   7.112 1.28e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33260 on 5942 degrees of freedom
## Multiple R-squared:  0.2085, Adjusted R-squared:  0.2081 
## F-statistic: 521.7 on 3 and 5942 DF,  p-value: < 2.2e-16
plot(fitted(multi_model), data_final$salary,
     col = "darkblue",
     main = "Fitted vs Actual Salary",
     xlab = "Predicted Salary",
     ylab = "Actual Salary")

abline(0, 1, col = "red", lwd = 2)

Interpretation

The multiple regression results show that all three variables—experience, skills count, and certifications—have a statistically significant effect on salary (p < 0.001 for all predictors). Experience has the strongest impact (β₁ = 2667.40), followed by certifications (β₃ = 1789.19) and skills count (β₂ = 753.04).

The R-squared value of 0.2085 indicates that approximately 20.85% of the variation in salary is explained by the combined model.

Overall, the model demonstrates that experience is the most important predictor of salary, while skills and certifications also contribute meaningfully when considered together.

Polynomial Regression

Polynomial regression is an extension of linear regression that models the relationship between the independent variable and dependent variable as an nth-degree polynomial rather than a straight line. It is used when the relationship between variables is non-linear

For a second-degree (quadratic) polynomial:

y = β₀ + β₁ (x1) + β₂ (x2)^2 + ε

where: y: dependent variable x: Independent Variable β0: Intercept β₁, β₂: coefficients ε: Error term

In this analysis, the regression model is expressed as:

Salary = β₀ + β₁ (Experience) + β₂ (Experience²) + ε

where β₂ captures the curvature in the relationship, allowing the model to represent non-linear patterns.

Hypothesis

The following hypotheses are tested:

Null Hypothesis (H₀): The coefficient of the squared term (β₂) is equal to zero,there is no non-linear relationship between experience and salary.

Alternative Hypothesis (H₁): The coefficient of the squared term (β₂) is not equal to zero, there is a significant non-linear relationship.

In this analysis, polynomial regression is used to examine whether salary increases with experience in a non-linear manner.

Analysis 29: Salary vs experiences (analyzing non-linear relationship)

poly_model <- lm(salary ~ experience_years + I(experience_years^2), data = data_final)
summary(poly_model)
## 
## Call:
## lm(formula = salary ~ experience_years + I(experience_years^2), 
##     data = data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -101021  -22371   -1476   21038  132522 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.194e+05  1.184e+03 100.832   <2e-16 ***
## experience_years      2.604e+03  2.765e+02   9.415   <2e-16 ***
## I(experience_years^2) 2.554e+00  1.333e+01   0.192    0.848    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33660 on 5943 degrees of freedom
## Multiple R-squared:  0.1893, Adjusted R-squared:  0.1891 
## F-statistic: 694.1 on 2 and 5943 DF,  p-value: < 2.2e-16
plot(data_final$experience_years, data_final$salary,
     col = "darkorange",
     main = "Polynomial Regression: Experience vs Salary",
     xlab = "Experience (Years)",
     ylab = "Salary")

exp_seq <- seq(min(data_final$experience_years),
               max(data_final$experience_years),
               length.out = 100)

pred <- predict(poly_model,
                newdata = data.frame(experience_years = exp_seq))

lines(exp_seq, pred, col = "blue", lwd = 2)

Interpretation

The polynomial regression results show that the linear term for experience is statistically significant (β₁ ≈ 2604, p < 0.001), indicating that salary increases with experience. However, the squared term (β₂ ≈ 2.55) is not statistically significant (p = 0.848), suggesting that there is no strong evidence of a non-linear relationship between experience and salary.

The R-squared value of 0.189 indicates that the model explains approximately 18.9% of the variation in salary, which is slightly lower than the multiple regression model. This suggests that adding the polynomial term does not improve the model significantly.

Overall, the results indicate that the relationship between experience and salary is primarily linear, and a simple linear regression model is sufficient.

Train-split test

set.seed(123)

train_index <- sample(1:nrow(data_final), 0.8 * nrow(data_final))

train_data <- data_final[train_index, ]
test_data  <- data_final[-train_index, ]

nrow(train_data)
## [1] 4756
nrow(test_data)
## [1] 1190
train_model <- lm(salary ~ experience_years + skills_count + certifications, data = train_data)
summary(train_model)
## 
## Call:
## lm(formula = salary ~ experience_years + skills_count + certifications, 
##     data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -95378 -22234  -1375  20195 125149 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      106311.79    1452.41  73.197  < 2e-16 ***
## experience_years   2731.84      78.98  34.588  < 2e-16 ***
## skills_count        745.61      89.22   8.357  < 2e-16 ***
## certifications     1897.55     281.15   6.749 1.66e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33300 on 4752 degrees of freedom
## Multiple R-squared:  0.2162, Adjusted R-squared:  0.2157 
## F-statistic:   437 on 3 and 4752 DF,  p-value: < 2.2e-16
predictions <- predict(train_model, newdata = test_data)

rmse <- sqrt(mean((test_data$salary - predictions)^2))
rmse
## [1] 33124.14

Interpretation

The model was evaluated using a train-test split approach, where 80% of the data was used for training and 20% for testing. The Root Mean Square Error (RMSE) obtained on the test dataset is 33124.14.

This value indicates the average deviation between the predicted and actual salary values. Considering the scale of the salary variable, the model demonstrates a reasonable level of predictive accuracy.

Overall, the results suggest that the multiple regression model, which includes experience, skills count, and certifications, is effective in predicting salary. However, there is still unexplained variation, indicating that additional factors may influence salary beyond the variables considered in this analysis.

Applying the model on new dataset to predict

new_candidates <- data.frame(
  experience_years = c(2, 5, 10),
  skills_count = c(5, 8, 12),
  certifications = c(1, 3, 5)
)

new_predictions <- predict(train_model, newdata = new_candidates)
new_predictions
##        1        2        3 
## 117401.1 131628.5 152065.3

Interpretation

The trained multiple regression model was used to predict salaries for new candidate profiles based on their experience, skills, and certifications.

The predicted salaries are as follows:

Candidate 1: 117401.1
Candidate 2: 131628.5
Candidate 3: 152065.3

The results show that salary increases with higher experience, skills, and certifications. Among the three candidates, the individual with the highest experience, skills, and certifications receives the highest predicted salary.

This demonstrates that the model can be effectively used for real-world salary prediction and decision-making, such as evaluating job candidates or estimating compensation based on qualifications.

Statistical Computation

Statistical analysis is used to summarize and describe the dataset using numerical measures.

In this section, key statistical metrics such as mean, median, variance, standard deviation, quartiles, and interquartile range (IQR) are calculated for salary.

The analysis is performed as follows: - Mean and median are used to measure central tendency - Variance and standard deviation measure the spread of data - Quartiles and IQR describe the distribution and range of the middle 50% of data

By comparing mean and median, we can identify skewness in the data. Measures of spread help in understanding how widely salaries vary across individuals.

This analysis provides a quantitative summary that supports the insights obtained from visualizations.

# Basic statistics
min(data_final$salary)
## [1] 41276
max(data_final$salary)
## [1] 304968
mean(data_final$salary)
## [1] 145723.5
median(data_final$salary)
## [1] 143722.5
var(data_final$salary)
## [1] 1396798727
sd(data_final$salary)
## [1] 37373.77
# Quartiles & Percentiles
quantile(data_final$salary)
##       0%      25%      50%      75%     100% 
##  41276.0 119171.2 143722.5 169706.8 304968.0
# IQR
IQR(data_final$salary)
## [1] 50535.5

Interpretation

The mean salary is slightly higher than the median, indicating a mildly right-skewed distribution. This suggests the presence of some high salary values pulling the average upward.

The standard deviation shows moderate variability in salary, indicating noticeable differences in earnings among individuals.

The range of salaries is wide, highlighting the gap between lower and higher earners.

The interquartile range indicates that the middle 50% of salaries are concentrated within a moderate range, suggesting that most individuals earn around the median salary level.

Conclusion

This project focused on analyzing and understanding the factors influencing salary using a structured data analysis approach. The study began with data preprocessing, including handling missing values, transforming categorical variables, and ensuring data quality. Exploratory data analysis through visualization and statistical measures helped identify patterns and distributions within the dataset. Techniques such as ANOVA and correlation were used to examine relationships between variables, revealing that factors like education level and company size have a significant impact on salary, while variables such as skills and certifications show weaker individual relationships.

Further analysis using regression models provided deeper insights into cause-and-effect relationships. Single regression models showed limited explanatory power, whereas multiple regression demonstrated improved performance by combining experience, skills, and certifications. Polynomial regression indicated that the relationship between experience and salary is primarily linear. The model was evaluated using a train-test split approach, achieving an RMSE of 33124.14, indicating reasonable predictive accuracy. Finally, the model was applied to new data to predict salaries, demonstrating its practical applicability. Overall, the project highlights that salary is influenced by multiple interacting factors and emphasizes the importance of combining statistical and predictive techniques for effective data analysis.