# Read the data
survey_data <- read.csv("survey-data.csv")SDS 324E Final Project - Predicting Salaries for Software Engineers in the United States
- Preston Cook
- Rome Vu
- Phoebe Wang
Introduction
The data source for our analysis is the 2024 Stack Overflow Developer Survey. The annual survey is well-regarded as a reliable way to gauge the state of the global tech industry. The survey often falls alongside other popular data sources for basic data science, such as the Titanic, or MNIST datasets. The goal of our analysis was to reliably predict annual salaries for full-time software engineers in the United States.
We selected this dataset because one of our team members, Preston, is graduating this semester with aspirations of becoming a software engineer. This analysis serves a dual purpose: fulfilling the requirements of our final project while also providing Preston with valuable insights into the dynamics of the American tech job market as he prepares for his career.
Before extracting relevant predictors, we filtered our dataset using Pandas to only include data points for software engineers in the United States. This process resulted in 2321 records for our analysis.
After performing some data wrangling, we extracted relevant factors and renamed columns to conduct data analysis in an R Studio environment more easily. After this process, we were left with the following predictors in no particular order:
- Age (Categorical)
- Education Level (Categorical)
- Number of Years Coding (Numeric)
- Number of Years Coding Professionally (Numeric)
- Developer Type (Categorical)
- Organization Size (Categorical)
- Work Type (Categorical)
- Years of Work Experience (Numeric)
- Industry (Categorical)
- Job Satisfaction (Numeric)
Here are breakdowns for the number of each data points in each class across our categorical predictors:
# Create a table for age range class distribution
age_counts <- table(survey_data$age)
age_df <- as.data.frame(age_counts)
colnames(age_df) <- c("Age Range", "Count")
knitr::kable(age_df, caption = "Age Range Class Counts")| Age Range | Count |
|---|---|
| 18-24 | 180 |
| 25-34 | 934 |
| 35-44 | 705 |
| 45-54 | 318 |
| 55-64 | 164 |
| 65 years or older | 20 |
# Create a table for education level class distribution
ed_counts <- table(survey_data$ed_level)
ed_df <- as.data.frame(ed_counts)
colnames(ed_df) <- c("Education Level", "Count")
knitr::kable(ed_df, caption = "Education Level Class Counts")| Education Level | Count |
|---|---|
| Associate’s degree | 98 |
| Bachelor’s degree | 1431 |
| Master’s degree | 407 |
| Primary/elementary school | 6 |
| Professional degree | 74 |
| Secondary school | 47 |
| Some college | 258 |
# Create a table for developer type class distribution
dev_type_counts <- table(survey_data$dev_type)
dev_type_df <- as.data.frame(dev_type_counts)
colnames(dev_type_df) <- c("Developer Type", "Count")
knitr::kable(dev_type_df, caption = "Developer Type Class Counts")| Developer Type | Count |
|---|---|
| Cloud infrastructure engineer | 42 |
| Data engineer | 72 |
| Data scientist or machine learning specialist | 43 |
| Developer, back-end | 497 |
| Developer, desktop or enterprise applications | 117 |
| Developer, embedded applications or devices | 71 |
| Developer, front-end | 104 |
| Developer, full-stack | 1079 |
| Developer, mobile | 59 |
| Developer, QA or test | 21 |
| DevOps specialist | 38 |
| Engineer, site reliability | 26 |
| Engineering manager | 82 |
| Research & Development role | 30 |
| Senior Executive (C-Suite, VP, etc.) | 40 |
# Create a table for organization size class distribution
org_size_counts <- table(survey_data$org_size)
org_size_df <- as.data.frame(org_size_counts)
colnames(org_size_df) <- c("Organization Size", "Count")
knitr::kable(org_size_df, caption = "Organization Size Class Counts")| Organization Size | Count |
|---|---|
| 1,000 to 4,999 | 316 |
| 10 to 19 | 152 |
| 10,000 or more | 502 |
| 100 to 499 | 492 |
| 2 to 9 | 157 |
| 20 to 99 | 385 |
| 5,000 to 9,999 | 146 |
| 500 to 999 | 171 |
# Create a table for work type class distribution
work_type_counts <- table(survey_data$work_type)
work_type_df <- as.data.frame(work_type_counts)
colnames(work_type_df) <- c("Work Type", "Count")
knitr::kable(work_type_df, caption = "Work Type Class Counts")| Work Type | Count |
|---|---|
| Hybrid | 763 |
| In-person | 270 |
| Remote | 1288 |
# Create a table for industry type class distribution
industry_counts <- table(survey_data$industry)
industry_df <- as.data.frame(industry_counts)
colnames(industry_df) <- c("Industry Type", "Count")
knitr::kable(industry_df, caption = "Industry Type Class Counts")| Industry Type | Count |
|---|---|
| Banking/Financial Services | 117 |
| Computer Systems Design and Services | 75 |
| Energy | 52 |
| Fintech | 156 |
| Government | 169 |
| Healthcare | 205 |
| Higher Education | 87 |
| Insurance | 62 |
| Internet, Telecomm or Information Services | 129 |
| Manufacturing | 141 |
| Media & Advertising Services | 85 |
| Retail and Consumer Services | 159 |
| Software Development | 786 |
| Transportation, or Supply Chain | 98 |
Formulation of Hypotheses
We would like to test the following hypotheses:
Hypothesis 1
“There is a significant relationship between salary and at least one of the predictors (age, education level, years of coding experience, years of professional experience, development type, organization size, work type, work experience, or industry).”
This is a sort of base hypothesis that mean to test in order to gauge the overall reliability of the survey data for predicting salaries for American software engineers. If our model provides evidence that this hypothesis is true, then we can more confidently approach the remaining two.
Hypothesis 2
“Including education level as a predictor will improve the performance of a model that already includes age, years of coding experience, years of professional experience, development type, organization size, work type, work experience, industry, and job satisfaction.”
Unlike many other career paths, software engineering has historically placed less value on education–more particularly college education. Or, at least, there is a widespread perception that this is the case. By testing this hypothesis, we can determine the amount of truth in this.
Hypothesis 3
“Job satisfaction and years of work experience are valuable predictors of log salary, even when accounting for age, education level, years of coding experience, years of professional experience, development type, organization size, work type, and industry.”
Lastly, we’d like to test the relationship between job satisfaction and salary predictions. Information like this could potentially rationalize choosing a company that initially pays less, but allows one to feel more satisfied in their work.
Exploring Marginal Relationships in the Data
We will begin by performing some basic exploratory analysis on our dataset. This will help us determine the underlying logic of our model and potential issues that we may need to address in the future.
Numeric Predictor Analysis
Let’s start by plotting salary against each of our numeric variables to check for general trends and potential multicollinearity.
# Set up a 2x2 grid for multiple plots
par(mfrow=c(2,2))
# Salary vs Years Coding Scatterplot
plot(survey_data$years_code, survey_data$converted_comp_yearly,
main="Salary vs. Years Coding",
xlab="Years Coding", ylab="Salary", pch=19, col='blue')
# Salary vs Years Coding Professionally Scatterplot
plot(survey_data$years_code_pro, survey_data$converted_comp_yearly,
main="Salary vs. Years Coding Professionally",
xlab="Years Coding Professionally", ylab="Salary", pch=19, col='red')
# Salary vs Years of Work Experience Scatterplot
plot(survey_data$work_exp, survey_data$converted_comp_yearly,
main="Salary vs. Years of Work Experience",
xlab="Years of Work Experience", ylab="Salary", pch=19, col='orange')
# Salary vs Job Satisfaction Scatterplot
plot(survey_data$job_sat, survey_data$converted_comp_yearly,
main="Salary vs. Job Satisfaction",
xlab="Job Satisfaction", ylab="Salary", pch=19, col='green')As one might have predicted, it seems like the distribution of our years_code, years_code_pro, and work_exp variables are very similar. This is something that we should check when developing our model to assure that our assumption of no multicollinearity is not violated. However, there does appear to be visual evidence that job satisfaction is positively related to salary.
As a second part of our numeric analysis, let’s look at the correlation coefficient of each of our numeric predictors with the target variable.
# List of predictors and target variable
predictors <- c("years_code", "years_code_pro", "work_exp", "job_sat")
target <- "converted_comp_yearly"
# Calculate correlation coefficients
correlations <- sapply(predictors, function(predictor) {
cor(survey_data[[predictor]], survey_data[[target]], use="complete.obs")
})
# Create a data frame to display the results
correlation_table <- data.frame(
Correlation_Coefficient = correlations
)
# Display the table
knitr::kable(correlation_table, caption = "Correlation Coefficients of Predictors with Salary",
col.names = c("Predictor", "Correlation Coefficient"), align = c("l", "r"))| Predictor | Correlation Coefficient |
|---|---|
| years_code | 0.2152076 |
| years_code_pro | 0.2097014 |
| work_exp | 0.1764005 |
| job_sat | 0.0690380 |
It seems like years_code has the highest correlation with the target variable, which is higher than all the other numeric predictors. This provides some preliminary evidence that the number of years someone has coding is more important than their number of years of work experience. Regarding job_sat, it seems like the correlation is very low despite the graphical evidence. This could be due to job satisfaction only relating to salary strongly at the extremes of the distribution.
Categorical Predictor Analysis
Now, let’s look at our categorical predictors and their effects on the target variable through some box plots.
Adjusting margins and plotting each boxplot
Unfortunately some of the graphs are a bit squished in markdown because of the long category names, but we can still find some insight in the data. There is visual evidence that the median salary across classes is fairly similar between categorical predictors with some small differences.
# Age
par(mar = c(10, 4, 4, 2))
boxplot(converted_comp_yearly ~ age,
xlab = "", ylab = "",
main = "Age Range Class Counts",
las = 2, data = survey_data)In the age categories, although the median salary across classes is roughly equal, there are more outliers for 25-34-year-olds and 45-54 categories.
# Education Level
par(mar = c(14, 4, 4, 2))
boxplot(converted_comp_yearly ~ ed_level,
xlab = "", ylab = "",
main = "Education Level Class Counts",
las = 2, data = survey_data)For education level, PhDs seem to have a slightly higher median in the data, but there are more positive outliers for those with just a bachelor’s degree.
# Developer Type
par(mar = c(18, 4, 4, 2))
boxplot(converted_comp_yearly ~ dev_type,
xlab = "", ylab = "",
main = "Developer Type Class Counts",
las = 2, data = survey_data)And, although the developer type chart is fairly squished, it seems like backend and full-stack engineers have a higher median salary as well as more positive outliers.
# Organization Size
par(mar = c(10, 4, 4, 2))
boxplot(converted_comp_yearly ~ org_size,
xlab = "", ylab = "",
main = "Organization Size Class Counts",
las = 2, data = survey_data)Regarding organization size, it seems like those working in organizations with 10,000+ employees tend to have higher salaries. This is predictable, as big tech is known for paying its software engineers disproportionately well.
# Work Type
par(mar = c(10, 4, 4, 2))
boxplot(converted_comp_yearly ~ work_type,
xlab = "", ylab = "",
main = "Work Type Class Counts",
las = 2, data = survey_data)Between work type, it seems like they are roughly the same, with those working remotely edging out others by a few thousand dollars.
# Industry Type
par(mar = c(18, 4, 4, 2))
boxplot(converted_comp_yearly ~ industry,
xlab = "", ylab = "",
main = "Industry Type Class Counts",
las = 2, data = survey_data)Lastly, it seems like those working directly in the tech industry itself seem to have the highest median salary as well as the most positive outliers.
Fitting the Model and Assessing Overall Model Significance
Now, let’s finally create our model and assess the significance of the predictors shown in the previous step.
# Fit the multiple linear regression model
full_model <- lm(converted_comp_yearly ~
age + ed_level + years_code + years_code_pro +
dev_type + org_size + work_type + work_exp +
industry + job_sat, data = survey_data)
summary(full_model)
Call:
lm(formula = converted_comp_yearly ~ age + ed_level + years_code +
years_code_pro + dev_type + org_size + work_type + work_exp +
industry + job_sat, data = survey_data)
Residuals:
Min 1Q Median 3Q Max
-258180 -44402 -14003 24877 1652033
Coefficients:
Estimate Std. Error
(Intercept) 99064.8 23578.2
age25-34 36929.6 8494.5
age35-44 34675.1 10132.2
age45-54 4624.7 13472.0
age55-64 -36727.6 17580.7
age65 years or older -83251.3 29673.0
ed_levelBachelor's degree 17496.6 10476.0
ed_levelMaster's degree 36453.4 11412.3
ed_levelPrimary/elementary school 43085.6 41692.3
ed_levelProfessional degree 77206.7 16023.5
ed_levelSecondary school 5576.2 17755.3
ed_levelSome college 10384.4 11900.8
years_code 1374.3 528.5
years_code_pro 2650.6 739.3
dev_typeData engineer -37500.6 19343.2
dev_typeData scientist or machine learning specialist -20373.4 21803.3
dev_typeDeveloper, back-end -8580.5 15932.2
dev_typeDeveloper, desktop or enterprise applications -40393.7 17947.9
dev_typeDeveloper, embedded applications or devices -15267.1 19554.4
dev_typeDeveloper, front-end -36406.1 18217.0
dev_typeDeveloper, full-stack -35251.9 15679.2
dev_typeDeveloper, mobile 4298.9 20025.7
dev_typeDeveloper, QA or test -65769.1 26572.0
dev_typeDevOps specialist -22044.5 22253.1
dev_typeEngineer, site reliability -4172.8 24774.1
dev_typeEngineering manager 17689.6 18868.3
dev_typeResearch & Development role -35462.9 23998.5
dev_typeSenior Executive (C-Suite, VP, etc.) 85514.5 22304.4
org_size10 to 19 -47598.9 9971.5
org_size10,000 or more 20766.9 7189.3
org_size100 to 499 -27594.2 7183.7
org_size2 to 9 -54831.4 10019.4
org_size20 to 99 -38966.8 7620.3
org_size5,000 to 9,999 -7416.4 9965.6
org_size500 to 999 -27493.0 9473.0
work_typeIn-person 10509.2 7220.7
work_typeRemote 7682.4 4780.4
work_exp -495.5 600.3
industryComputer Systems Design and Services 33614.1 14948.7
industryEnergy -26435.9 16642.8
industryFintech 18096.5 12219.6
industryGovernment -41194.2 12042.6
industryHealthcare -20158.9 11615.9
industryHigher Education -56719.3 14172.0
industryInsurance 1514.9 15727.8
industryInternet, Telecomm or Information Services 4242.9 12782.1
industryManufacturing -44991.4 12579.4
industryMedia & Advertising Services 9841.1 14266.2
industryRetail and Consumer Services -2639.1 12130.4
industrySoftware Development 4846.8 10009.7
industryTransportation, or Supply Chain -27245.1 13668.9
job_sat 2283.0 960.2
t value Pr(>|t|)
(Intercept) 4.202 2.75e-05 ***
age25-34 4.347 1.44e-05 ***
age35-44 3.422 0.000632 ***
age45-54 0.343 0.731419
age55-64 -2.089 0.036811 *
age65 years or older -2.806 0.005065 **
ed_levelBachelor's degree 1.670 0.095027 .
ed_levelMaster's degree 3.194 0.001421 **
ed_levelPrimary/elementary school 1.033 0.301518
ed_levelProfessional degree 4.818 1.54e-06 ***
ed_levelSecondary school 0.314 0.753507
ed_levelSome college 0.873 0.382986
years_code 2.600 0.009371 **
years_code_pro 3.585 0.000344 ***
dev_typeData engineer -1.939 0.052663 .
dev_typeData scientist or machine learning specialist -0.934 0.350189
dev_typeDeveloper, back-end -0.539 0.590241
dev_typeDeveloper, desktop or enterprise applications -2.251 0.024506 *
dev_typeDeveloper, embedded applications or devices -0.781 0.435030
dev_typeDeveloper, front-end -1.998 0.045785 *
dev_typeDeveloper, full-stack -2.248 0.024651 *
dev_typeDeveloper, mobile 0.215 0.830043
dev_typeDeveloper, QA or test -2.475 0.013392 *
dev_typeDevOps specialist -0.991 0.321973
dev_typeEngineer, site reliability -0.168 0.866256
dev_typeEngineering manager 0.938 0.348585
dev_typeResearch & Development role -1.478 0.139624
dev_typeSenior Executive (C-Suite, VP, etc.) 3.834 0.000130 ***
org_size10 to 19 -4.773 1.93e-06 ***
org_size10,000 or more 2.889 0.003907 **
org_size100 to 499 -3.841 0.000126 ***
org_size2 to 9 -5.472 4.93e-08 ***
org_size20 to 99 -5.114 3.43e-07 ***
org_size5,000 to 9,999 -0.744 0.456831
org_size500 to 999 -2.902 0.003741 **
work_typeIn-person 1.455 0.145686
work_typeRemote 1.607 0.108181
work_exp -0.825 0.409196
industryComputer Systems Design and Services 2.249 0.024632 *
industryEnergy -1.588 0.112328
industryFintech 1.481 0.138761
industryGovernment -3.421 0.000636 ***
industryHealthcare -1.735 0.082797 .
industryHigher Education -4.002 6.48e-05 ***
industryInsurance 0.096 0.923276
industryInternet, Telecomm or Information Services 0.332 0.739967
industryManufacturing -3.577 0.000355 ***
industryMedia & Advertising Services 0.690 0.490378
industryRetail and Consumer Services -0.218 0.827790
industrySoftware Development 0.484 0.628285
industryTransportation, or Supply Chain -1.993 0.046357 *
job_sat 2.378 0.017509 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 98640 on 2269 degrees of freedom
Multiple R-squared: 0.2358, Adjusted R-squared: 0.2186
F-statistic: 13.73 on 51 and 2269 DF, p-value: < 2.2e-16
Interpreting the intercept and coefficients
Intercept: 102,860.9
The intercept indicates that when all the categorical variables are at their base level or when the numerical variables equal zero, the annual salary is predicted to be $102,860.9.
This is the table of the coefficients for all the age classes. Every coefficient is relative to the base level of 18-24 years old:
| Terms | Estimate |
|---|---|
age25-34 |
37,131.8 |
age35-44 |
35,920.1 |
age45-54 |
5,681.7 |
age55-64 |
-35,070.0 |
age65 years or older |
-81,298.7 |
Interpretation of two coefficients as an example:
age25-34: A person between age 25 to 34 is predicted to earn $37,131.8 more than a person between 18 to 24 years old.
age55-64: A person between age 55 to 64 is predicted to earn $35,070 less than a person between 18 to 24 years old.
Table of the coefficients for all the different ed_level classes. Every coefficient is relative to the base level of receiving an Associate’s degree.
| Terms | Estimates |
|---|---|
ed_levelBachelor's degree |
17,589.3 |
ed_levelMaster's degree |
36,470.9 |
ed_levelPrimary/elementary school |
43,276.0 |
ed_levelProfessional degree |
75,970.2 |
ed_levelSecondary school |
5,316.4 |
ed_levelSome college |
10,018.2 |
Interpretation of two coefficients as an example:
ed_levelBachelor's degree: A person who has a Bachelor’s degree is predicted to earn $17,589.3 more than a person with an Associate’s degree.
ed_levelMaster's degree: A person who has a Master’s degree is predicted to earn $36,470.9 more than a person with an Associate’s degree.
Table of the coefficients for all the different dev_type classes. Every coefficient is relative to the base level of a cloud infrastructure engineer.
| Terms | Estimates |
|---|---|
dev_typeData engineer |
-42,285.5 |
dev_typeData scientist or machine learning specialist |
-18,937.9 |
dev_typeDeveloper, back-end |
-13,050.7 |
dev_typeDeveloper, desktop or enterprise applications |
-44,883.5 |
dev_typeDeveloper, embedded applications or devices |
-20,303.0 |
dev_typeDeveloper, front-end |
-41,236.1 |
dev_typeDeveloper, full-stack |
-39,978.0 |
dev_typeDeveloper, mobile |
-514.0 |
dev_typeDeveloper, QA or test |
-70,829.9 |
dev_typeDevOps specialist |
-26,766.8 |
dev_typeEngineer, site reliability |
-9,051.4 |
dev_typeEngineering manager |
13,109.5 |
dev_typeResearch & Development role |
-40,209.7 |
dev_typeSenior Executive (C-Suite, VP, etc.) |
81,090.6 |
Interpretation of two coefficients as an example:
dev_typeData engineer: A data engineer is predicted to earn $42,285.5 less than a cloud infrastructure engineer.
dev_typeData scientist or machine learning specialist: A data scientist or machine learning specialist is predicted to earn $18,937.9 less than a cloud infrastructure engineer.
Table of the coefficients for all the different org_size classes. Every coefficient is relative to the base level of an organization of 1,000 to 4,999 people.
| Terms | Estimates |
|---|---|
org_size10 to 19 |
-47,557.7 |
org_size10,000 or more |
21,602.5 |
org_size100 to 499 |
-27,499.6 |
org_size2 to 9 |
-54,576.0 |
org_size20 to 99 |
-38,365.0 |
org_size5,000 to 9,999 |
-7,291.4 |
org_size500 to 999 |
-27,632.5 |
Interpretation of two coefficients as an example:
org_size10 to 19: The annual salary of employees in an organization of 10 to 19 people is predicted to be $47,557.7 lower than that of employees in an organization of 1,000 to 4,999 people.
org_size10,000 or more: The annual salary of employees in an organization of 10,000 or more people is predicted to be $21,602.5 higher than that of employees in an organization of 1,000 to 4,999 people.
Table of the coefficients for all the different work_type classes. Every coefficient is relative to the base level of working hybrid.
| Term | Estimate |
|---|---|
work_typeIn-person |
10,517.1 |
work_typeRemote |
7,919.1 |
work_typeIn-person: A person who works in-person is predicted to earn $10,517.1 more than a person who works hybrid.
work_typeRemote: A person who works remotely is predicted to earn $7,919.1 more than a person who works hybrid.
Table of the coefficients for all the different industry classes. Every coefficient is relative to the base level of the Banking/Financial Services industry.
| Terms | Estimates |
|---|---|
industryComputer Systems Design and Services |
37,456.8 |
industryEnergy |
-26,250.4 |
industryFintech |
19,377.3 |
industryGovernment |
-41,201.2 |
industryHealthcare |
-19,990.3 |
industryHigher Education |
-56,497.7 |
industryInsurance |
1,542.9 |
industryInternet, Telecomm or Information Services |
4,192.0 |
industryManufacturing |
-44,766.9 |
industryMedia & Advertising Services |
10,026.4 |
industryRetail and Consumer Services |
-2,511.6 |
industrySoftware Development |
4,971.8 |
industryTransportation, or Supply Chain |
-24,978.4 |
Interpretation of two coefficients as an example:
industryComputer Systems Design and Services: A person working in the Computer Systems Design and Services industry is predicted to earn $37,456.8 more than someone working in the Banking/Financial Services industry.
industryEnergy: A person working in the Energy industry is predicted to earn $26,250.4 less than someone working in the Banking/Financial Services industry.
Numerical Coefficient Interpretations
years_code: As the number of years coding increases by 1, the annual salary is predicted to increase by $1,354.9.
years_code_pro: As the number of years coding professionally increases by 1, the annual salary is predicted to increase by $2,633.7.
work_exp: As the number of years someone worked increases by 1, the annual salary is predicted to decrease by $523.
job_sat: As the job satisfaction rating increases by 1, the annual salary is predicted to increase by $2,331.9.
F-Test Conclusion
The null hypothesis is that all regression coefficients (i.e., slopes) are zero. The alternate hypothesis is that at least one of the regression coefficients is non-zero. Since the p-value associated with the F-statistic (13.85, p < 2.2e-16) is less than the alpha level of 0.05, we can reject the null hypothesis and conclude at least one predictor has a non-zero relationship with annual salary.
R-squared Interpretation
The multiple R-squared is 0.2377, which means that 23.77% of the variation in annual salary is accounted for by all predictors in the model.
Residual Standard Error (RSE) Interpretation
The residual standard error of 98410 indicates that the actual salaries are expected to differ from the predicted salaries by around $98,410. Although this is a large average difference between the actual and predicted values, it may be reasonable as the data includes people who make many hundred thousand dollars.
Improving the Model
# Diagnostic plot
library(ggfortify)Loading required package: ggplot2
autoplot(full_model)In the residual vs. fitted plot, the line slopes down and deviates from the horizontal line as we move to the right, suggesting that the linearity assumption is violated. In the scale-location plot, the line curves upward, indicating that the data does not have constant variance. In the Q-Q plot, the data points on the far right of the diagonal line spikes up, suggesting that the data may be right-skewed instead of normally distributed. In general, our model violates the linearity, constant variance, and normality assumptions.
# Log transformed the dependent variable (annual salary)
survey_data$log_salary <- log(survey_data$converted_comp_yearly)
log_model <- lm(log_salary ~
age + ed_level + years_code + years_code_pro +
dev_type + org_size + work_type + work_exp +
industry + job_sat, data = survey_data)
autoplot(log_model)After applying a log transformation to the dependent variable, annual salary, the linear assumption is better met as the line is flatter towards the right of the plot compared to the previous model. The constant variance assumption has also improved, although the blue line still curves upward towards the right, indicating that the variance remains non-constant. The normality assumption, however, is still not fully met, as the data points deviate from the diagonal line on the far left and far right of the plot. We will proceed with using logarithmic annual salary as the dependent variable, as it demonstrates overall better performance than the original model.
# Replaced years_code with a spline
library(splines)
spline_model <- lm(log_salary ~
age + ed_level + bs(years_code, 3) + years_code_pro +
dev_type + org_size + work_type + work_exp +
industry + job_sat, data = survey_data)
# Comparing the original model and the model with a spline
anova(log_model, spline_model)Analysis of Variance Table
Model 1: log_salary ~ age + ed_level + years_code + years_code_pro + dev_type +
org_size + work_type + work_exp + industry + job_sat
Model 2: log_salary ~ age + ed_level + bs(years_code, 3) + years_code_pro +
dev_type + org_size + work_type + work_exp + industry + job_sat
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2269 594.09
2 2267 584.28 2 9.8118 19.035 6.338e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(log_model, spline_model) df AIC
log_model 53 3529.849
spline_model 55 3495.196
BIC(log_model, spline_model) df BIC
log_model 53 3834.586
spline_model 55 3811.433
Based on the results of the anova(), AIC(), and BIC(), including a spline on years_code makes the model perform better, yet we decided to discard the spline part in the subsequent analyses.
# Interactions between work_exp and job_sat
interaction_model <- lm(log_salary ~
age + ed_level + years_code + years_code_pro +
dev_type + org_size + work_type + industry +
work_exp * job_sat, data = survey_data)
summary(interaction_model)
Call:
lm(formula = log_salary ~ age + ed_level + years_code + years_code_pro +
dev_type + org_size + work_type + industry + work_exp * job_sat,
data = survey_data)
Residuals:
Min 1Q Median 3Q Max
-8.8149 -0.2134 -0.0009 0.2251 2.3906
Coefficients:
Estimate Std. Error
(Intercept) 11.1382291 0.1302411
age25-34 0.3276732 0.0440683
age35-44 0.2990231 0.0525832
age45-54 0.1019508 0.0698951
age55-64 -0.1238627 0.0912231
age65 years or older -0.4125160 0.1542725
ed_levelBachelor's degree 0.1080401 0.0543462
ed_levelMaster's degree 0.2110533 0.0592032
ed_levelPrimary/elementary school 0.3022907 0.2163207
ed_levelProfessional degree 0.3916871 0.0831365
ed_levelSecondary school -0.2242134 0.0921178
ed_levelSome college 0.0475316 0.0617439
years_code 0.0045048 0.0027417
years_code_pro 0.0189257 0.0038356
dev_typeData engineer 0.0091639 0.1003839
dev_typeData scientist or machine learning specialist -0.0454047 0.1131081
dev_typeDeveloper, back-end 0.1109343 0.0826878
dev_typeDeveloper, desktop or enterprise applications -0.0374935 0.0931764
dev_typeDeveloper, embedded applications or devices 0.0927870 0.1014681
dev_typeDeveloper, front-end -0.0155748 0.0945419
dev_typeDeveloper, full-stack -0.0081117 0.0813592
dev_typeDeveloper, mobile 0.2072413 0.1039056
dev_typeDeveloper, QA or test -0.1232881 0.1378535
dev_typeDevOps specialist 0.0880480 0.1154544
dev_typeEngineer, site reliability 0.1928227 0.1285218
dev_typeEngineering manager 0.2639323 0.0979176
dev_typeResearch & Development role 0.0284766 0.1245240
dev_typeSenior Executive (C-Suite, VP, etc.) 0.4524348 0.1157187
org_size10 to 19 -0.2457127 0.0518131
org_size10,000 or more 0.0515788 0.0373160
org_size100 to 499 -0.1382886 0.0372661
org_size2 to 9 -0.4712614 0.0519849
org_size20 to 99 -0.2216290 0.0395314
org_size5,000 to 9,999 -0.0443482 0.0516978
org_size500 to 999 -0.1382114 0.0491450
work_typeIn-person -0.0248187 0.0374859
work_typeRemote 0.0524046 0.0247999
industryComputer Systems Design and Services -0.0159543 0.0776642
industryEnergy -0.1131632 0.0863520
industryFintech 0.0853433 0.0634125
industryGovernment -0.1817692 0.0624905
industryHealthcare -0.0622074 0.0602657
industryHigher Education -0.4881152 0.0736292
industryInsurance 0.0373403 0.0816460
industryInternet, Telecomm or Information Services -0.0036748 0.0663399
industryManufacturing -0.2364641 0.0652905
industryMedia & Advertising Services 0.0431589 0.0740836
industryRetail and Consumer Services -0.0188821 0.0629557
industrySoftware Development 0.0349963 0.0519628
industryTransportation, or Supply Chain -0.1719717 0.0709703
work_exp 0.0047148 0.0048497
job_sat 0.0226266 0.0086020
work_exp:job_sat -0.0005113 0.0005213
t value Pr(>|t|)
(Intercept) 85.520 < 2e-16 ***
age25-34 7.436 1.47e-13 ***
age35-44 5.687 1.46e-08 ***
age45-54 1.459 0.144807
age55-64 -1.358 0.174662
age65 years or older -2.674 0.007550 **
ed_levelBachelor's degree 1.988 0.046932 *
ed_levelMaster's degree 3.565 0.000372 ***
ed_levelPrimary/elementary school 1.397 0.162424
ed_levelProfessional degree 4.711 2.61e-06 ***
ed_levelSecondary school -2.434 0.015010 *
ed_levelSome college 0.770 0.441487
years_code 1.643 0.100513
years_code_pro 4.934 8.63e-07 ***
dev_typeData engineer 0.091 0.927271
dev_typeData scientist or machine learning specialist -0.401 0.688143
dev_typeDeveloper, back-end 1.342 0.179859
dev_typeDeveloper, desktop or enterprise applications -0.402 0.687433
dev_typeDeveloper, embedded applications or devices 0.914 0.360580
dev_typeDeveloper, front-end -0.165 0.869164
dev_typeDeveloper, full-stack -0.100 0.920589
dev_typeDeveloper, mobile 1.995 0.046215 *
dev_typeDeveloper, QA or test -0.894 0.371234
dev_typeDevOps specialist 0.763 0.445769
dev_typeEngineer, site reliability 1.500 0.133673
dev_typeEngineering manager 2.695 0.007081 **
dev_typeResearch & Development role 0.229 0.819135
dev_typeSenior Executive (C-Suite, VP, etc.) 3.910 9.51e-05 ***
org_size10 to 19 -4.742 2.24e-06 ***
org_size10,000 or more 1.382 0.167042
org_size100 to 499 -3.711 0.000212 ***
org_size2 to 9 -9.065 < 2e-16 ***
org_size20 to 99 -5.606 2.32e-08 ***
org_size5,000 to 9,999 -0.858 0.391074
org_size500 to 999 -2.812 0.004961 **
work_typeIn-person -0.662 0.507986
work_typeRemote 2.113 0.034701 *
industryComputer Systems Design and Services -0.205 0.837257
industryEnergy -1.310 0.190164
industryFintech 1.346 0.178487
industryGovernment -2.909 0.003664 **
industryHealthcare -1.032 0.302080
industryHigher Education -6.629 4.20e-11 ***
industryInsurance 0.457 0.647468
industryInternet, Telecomm or Information Services -0.055 0.955830
industryManufacturing -3.622 0.000299 ***
industryMedia & Advertising Services 0.583 0.560240
industryRetail and Consumer Services -0.300 0.764260
industrySoftware Development 0.673 0.500706
industryTransportation, or Supply Chain -2.423 0.015464 *
work_exp 0.972 0.331061
job_sat 2.630 0.008586 **
work_exp:job_sat -0.981 0.326793
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5117 on 2268 degrees of freedom
Multiple R-squared: 0.2955, Adjusted R-squared: 0.2793
F-statistic: 18.29 on 52 and 2268 DF, p-value: < 2.2e-16
We hypothesize that there is an interaction between job satisfaction and work experience that affects annual salary. The coefficient of -0.0006510 indicates that as the job satisfaction rating increases by 1, the effect of work experience on annual salary decreases by around $0.001. Although this is a non-significant interaction, it indicates that the more satisfied a person is with their job, the smaller the impact work experience has on increasing their annual salary.
Testing Your Hypotheses
Hypothesis 1: There is some relationship between salary and any of the predictors. Null: All of the regression coefficients in log_model are zero. Alt: At least one of the regression coefficients in log_model is non-zero. Hypothesis 2: We would expect education level to improve a model that already includes age, years_code, years_code_pro, dev_type, org_size, work_type, work_exp, industry, and job_sat. Null: The coefficients associated with the predictor ed_level is equal to 0, given that all other predictors are in the model. Alt: The coefficients associated with the predictor ed_level is not equal to 0, given that all other predictors are in the model. Hypothesis 3: Job satisfaction and years of work experience are useful for predicting log_salary if we are already using age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry. Null: The reduced model with only age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry as predictors is correct. Alt: The reduced model with only age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry as predictors is not correct.
# Testing hypothesis 1
summary(log_model)
Call:
lm(formula = log_salary ~ age + ed_level + years_code + years_code_pro +
dev_type + org_size + work_type + work_exp + industry + job_sat,
data = survey_data)
Residuals:
Min 1Q Median 3Q Max
-8.8107 -0.2156 -0.0021 0.2269 2.3976
Coefficients:
Estimate Std. Error
(Intercept) 11.182113 0.122314
age25-34 0.328074 0.044066
age35-44 0.300485 0.052562
age45-54 0.102963 0.069887
age55-64 -0.125796 0.091201
age65 years or older -0.422559 0.153931
ed_levelBachelor's degree 0.108279 0.054345
ed_levelMaster's degree 0.211302 0.059202
ed_levelPrimary/elementary school 0.298349 0.216282
ed_levelProfessional degree 0.390247 0.083123
ed_levelSecondary school -0.222876 0.092107
ed_levelSome college 0.046606 0.061736
years_code 0.004531 0.002742
years_code_pro 0.018892 0.003835
dev_typeData engineer 0.011900 0.100344
dev_typeData scientist or machine learning specialist -0.044977 0.113106
dev_typeDeveloper, back-end 0.113391 0.082649
dev_typeDeveloper, desktop or enterprise applications -0.033963 0.093106
dev_typeDeveloper, embedded applications or devices 0.095100 0.101440
dev_typeDeveloper, front-end -0.012904 0.094502
dev_typeDeveloper, full-stack -0.006264 0.081337
dev_typeDeveloper, mobile 0.209236 0.103885
dev_typeDeveloper, QA or test -0.124745 0.137844
dev_typeDevOps specialist 0.089825 0.115439
dev_typeEngineer, site reliability 0.193762 0.128517
dev_typeEngineering manager 0.266541 0.097881
dev_typeResearch & Development role 0.031112 0.124494
dev_typeSenior Executive (C-Suite, VP, etc.) 0.450783 0.115705
org_size10 to 19 -0.248619 0.051728
org_size10,000 or more 0.050359 0.037295
org_size100 to 499 -0.138257 0.037266
org_size2 to 9 -0.472151 0.051977
org_size20 to 99 -0.221478 0.039531
org_size5,000 to 9,999 -0.044415 0.051697
org_size500 to 999 -0.137710 0.049142
work_typeIn-person -0.026236 0.037458
work_typeRemote 0.052190 0.024799
work_exp 0.001068 0.003114
industryComputer Systems Design and Services -0.011794 0.077548
industryEnergy -0.111547 0.086336
industryFintech 0.086980 0.063390
industryGovernment -0.180291 0.062472
industryHealthcare -0.061325 0.060258
industryHigher Education -0.484163 0.073518
industryInsurance 0.040314 0.081589
industryInternet, Telecomm or Information Services -0.001685 0.066308
industryManufacturing -0.234418 0.065257
industryMedia & Advertising Services 0.046454 0.074007
industryRetail and Consumer Services -0.017039 0.062927
industrySoftware Development 0.036900 0.051926
industryTransportation, or Supply Chain -0.169075 0.070908
job_sat 0.015748 0.004981
t value Pr(>|t|)
(Intercept) 91.422 < 2e-16 ***
age25-34 7.445 1.37e-13 ***
age35-44 5.717 1.23e-08 ***
age45-54 1.473 0.140814
age55-64 -1.379 0.167930
age65 years or older -2.745 0.006097 **
ed_levelBachelor's degree 1.992 0.046444 *
ed_levelMaster's degree 3.569 0.000366 ***
ed_levelPrimary/elementary school 1.379 0.167893
ed_levelProfessional degree 4.695 2.83e-06 ***
ed_levelSecondary school -2.420 0.015609 *
ed_levelSome college 0.755 0.450378
years_code 1.653 0.098566 .
years_code_pro 4.926 9.01e-07 ***
dev_typeData engineer 0.119 0.905611
dev_typeData scientist or machine learning specialist -0.398 0.690927
dev_typeDeveloper, back-end 1.372 0.170213
dev_typeDeveloper, desktop or enterprise applications -0.365 0.715311
dev_typeDeveloper, embedded applications or devices 0.938 0.348598
dev_typeDeveloper, front-end -0.137 0.891398
dev_typeDeveloper, full-stack -0.077 0.938615
dev_typeDeveloper, mobile 2.014 0.044115 *
dev_typeDeveloper, QA or test -0.905 0.365576
dev_typeDevOps specialist 0.778 0.436581
dev_typeEngineer, site reliability 1.508 0.131778
dev_typeEngineering manager 2.723 0.006516 **
dev_typeResearch & Development role 0.250 0.802681
dev_typeSenior Executive (C-Suite, VP, etc.) 3.896 0.000101 ***
org_size10 to 19 -4.806 1.64e-06 ***
org_size10,000 or more 1.350 0.177055
org_size100 to 499 -3.710 0.000212 ***
org_size2 to 9 -9.084 < 2e-16 ***
org_size20 to 99 -5.603 2.37e-08 ***
org_size5,000 to 9,999 -0.859 0.390359
org_size500 to 999 -2.802 0.005117 **
work_typeIn-person -0.700 0.483736
work_typeRemote 2.105 0.035441 *
work_exp 0.343 0.731580
industryComputer Systems Design and Services -0.152 0.879135
industryEnergy -1.292 0.196482
industryFintech 1.372 0.170155
industryGovernment -2.886 0.003939 **
industryHealthcare -1.018 0.308932
industryHigher Education -6.586 5.61e-11 ***
industryInsurance 0.494 0.621276
industryInternet, Telecomm or Information Services -0.025 0.979730
industryManufacturing -3.592 0.000335 ***
industryMedia & Advertising Services 0.628 0.530266
industryRetail and Consumer Services -0.271 0.786593
industrySoftware Development 0.711 0.477384
industryTransportation, or Supply Chain -2.384 0.017188 *
job_sat 3.162 0.001590 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5117 on 2269 degrees of freedom
Multiple R-squared: 0.2952, Adjusted R-squared: 0.2793
F-statistic: 18.63 on 51 and 2269 DF, p-value: < 2.2e-16
For hypothesis 1, we used summary() and had a p-value less than 2.2e-16, which is much less than our cutoff value of 0.05. This means we can reject the null hypothesis for the overall significance of the model test. This also means there is an overwhelming amount of evidence that there is an obvious relationship between the predictors and log_salary or that at least one of the regression coefficients in our multiple linear regression is non-zero. Referring to the informal hypothesis, there is some relationship between salary and any of the predictors.
#Testing hypothesis 2
drop1(log_model, test = "F")Single term deletions
Model:
log_salary ~ age + ed_level + years_code + years_code_pro + dev_type +
org_size + work_type + work_exp + industry + job_sat
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 594.09 -3058.9
age 5 36.194 630.29 -2931.6 27.6466 < 2.2e-16 ***
ed_level 6 15.373 609.47 -3011.6 9.7859 1.136e-10 ***
years_code 1 0.715 594.81 -3058.1 2.7308 0.09857 .
years_code_pro 1 6.353 600.45 -3036.2 24.2631 9.009e-07 ***
dev_type 14 19.328 613.42 -3012.6 5.2726 5.895e-10 ***
org_size 7 39.509 633.60 -2923.4 21.5563 < 2.2e-16 ***
work_type 2 1.852 595.95 -3055.6 3.5359 0.02929 *
work_exp 1 0.031 594.12 -3060.7 0.1177 0.73158
industry 13 34.104 628.20 -2955.3 10.0195 < 2.2e-16 ***
job_sat 1 2.617 596.71 -3050.7 9.9956 0.00159 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For hypothesis 2, we performed a drop1() test, and ed_level had a p-value of 1.616e-11, which is much less than our cutoff value of 0.05. This means we can reject the null hypothesis and conclude that the coefficients associated with ed_level are not equal to 0, given that all other predictors are in the model. Referring back to the informal hypothesis, this means that education level significantly improves a model that already includes age, years_code, years_code_pro, dev_type, org_size, work_type, work_exp, industry, and job_sat.
#Testing hypothesis 3
library(broom)
reduced_log_model <- lm(log_salary ~
age + ed_level + years_code + years_code_pro +
dev_type + org_size + work_type +
industry, data = survey_data)
tidy(anova(reduced_log_model, log_model))# A tibble: 2 × 7
term df.residual rss df sumsq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 log_salary ~ age + ed_level … 2271 597. NA NA NA NA
2 log_salary ~ age + ed_level … 2269 594. 2 2.67 5.09 0.00623
For hypothesis 3, we tested a nested hypothesis using anova() and had a p-value of 6.18e-5 which is less than our cutoff value of 0.05. The small p-value associated with our F-statistic suggests that the additional predictors in the full model (job_sat and work_exp) have significant explanatory power. We can reject the null hypothesis and conclude that the reduced model with only age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry as predictors is not correct. Referring back to the informal hypothesis, job satisfaction and years of work experience are useful for predicting log_salary if we are already using age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry.
The only limitation to the conclusion is that we didn’t use the exponentiated and actual values of salary, converted_comp_yearly; we took the log of it and made log_salary to meet more assumptions.
Robustness of Results
# Residuals vs Leverage plot
autoplot(log_model, which = 5)In the residuals vs. leverage plot, observations 1725 and 1811 have very low standardized residuals, meaning they have extreme residuals/outlying salaries. 57 is on the far right of the plot, indicating that it has high leverage.
# Cook’s Distance
library(car)Loading required package: carData
influencePlot(log_model) StudRes Hat CookD
669 -0.4550808 0.18370209 0.0008965839
875 -14.7535122 0.04016286 0.1598844716
995 1.4750644 0.18199848 0.0093048047
1027 -18.9412550 0.04331690 0.2698458007
Looking at the plot for Cook’s distance, observations 1815 and 2295 are the most influential points with low studentized residuals, followed by 58, 995, and 669, which are influential points with high leverage.
# Features of the influential points
survey_data[c(1811, 2291, 57, 992, 668), ] age ed_level years_code years_code_pro dev_type
1811 35-44 Master's degree 38 12 Developer, full-stack
2291 25-34 Bachelor's degree 10 8 Developer, full-stack
57 45-54 Associate's degree 10 8 Developer, full-stack
992 25-34 Bachelor's degree 20 12 Developer, full-stack
668 25-34 Some college 10 3 Developer, mobile
org_size work_type work_exp
1811 10,000 or more Remote 12
2291 100 to 499 Remote 8
57 100 to 499 In-person 8
992 2 to 9 Hybrid 12
668 10,000 or more Remote 20
industry job_sat converted_comp_yearly
1811 Banking/Financial Services 5 196000
2291 Healthcare 8 170000
57 Higher Education 1 80000
992 Software Development 9 150000
668 Internet, Telecomm or Information Services 7 140000
log_salary
1811 12.18587
2291 12.04355
57 11.28978
992 11.91839
668 11.84940
Observations 58, 995, and 669 are high-leverage and influential points. All three individuals have a primary/elementary school education, work as full-stack developers, and earn over $100,000. Their high leverage is likely due to the rarity of earning such a high salary among others with similar education levels. Observations 1815 and 2295, on the other hand, earn less than $10,000 annually—an uncommon occurrence in the software development industry. Even though this might be partially explained by their employment in smaller organizations, their actual annual salary is much lower than their predicted annual salary, which may mean they still make less than those with similar backgrounds.
# DFFITS and plot
salary_dffits <- dffits(log_model)
salary_rstandard <- dffits(log_model)
plot(salary_dffits, salary_rstandard)# Identify "weird" DFFITS
subset(survey_data, dffits(log_model) < -1.5) age ed_level years_code years_code_pro
46 25-34 Bachelor's degree 19 8
875 35-44 Bachelor's degree 7 7
910 35-44 Master's degree 10 7
1027 18-24 Secondary school 12 1
dev_type org_size work_type
46 Developer, back-end 20 to 99 Hybrid
875 Cloud infrastructure engineer 10,000 or more Remote
910 Data scientist or machine learning specialist 10,000 or more Remote
1027 Developer, full-stack 2 to 9 Hybrid
work_exp industry job_sat
46 8 Fintech 7
875 12 Transportation, or Supply Chain 7
910 8 Computer Systems Design and Services 8
1027 1 Higher Education 8
converted_comp_yearly log_salary
46 250 5.521461
875 115 4.744932
910 300 5.703782
1027 4 1.386294
The plot reveals one “weird” DFFITS value, observation 1815. This observation strongly influences the overall prediction and has an extremely low studentized residual. This indicates that the 1811 observation had an unusually low annual salary, and the model significantly overpredicted their earnings.
# Calculate VIF
vif(log_model) GVIF Df GVIF^(1/(2*Df))
age 5.473238 5 1.185289
ed_level 1.351843 6 1.025441
years_code 7.590940 1 2.755166
years_code_pro 11.912990 1 3.451520
dev_type 1.702494 14 1.019185
org_size 1.347328 7 1.021523
work_type 1.229707 2 1.053053
work_exp 9.105905 1 3.017599
industry 1.532221 13 1.016548
job_sat 1.049900 1 1.024646
The output shows that the variable years of coding professionally has the largest VIF, followed by years of work experience, years of coding, and age. This suggests that these variables are highly correlated with each other, indicating issues with multicollinearity in the model.
# Calculate PRESS
library(DAAG)
Attaching package: 'DAAG'
The following object is masked from 'package:car':
vif
press(log_model)[1] 628.9997
The PRESS statistic of our model is 395.594.
# LOO R-squared
loo_predictions <- function(fit) {
e <- residuals(fit)
h <- hatvalues(fit)
Y <- fitted(fit) + e
return(Y - e / (1 - h))
}
loo_pred <- loo_predictions(log_model)
cor(loo_pred, survey_data$log_salary)^2[1] 0.2552413
Our model does not overfit, as the LOO R-squared of around 0.3442 suggests that only around 34.42% of the variation in the logged annual salary is accounted for by the predictors of our model. This indicates that the model’s complexity is not excessive.
Automatic Model Selection
# Backward selection
salary_step <- step(log_model, direction = "backward")Start: AIC=-3058.86
log_salary ~ age + ed_level + years_code + years_code_pro + dev_type +
org_size + work_type + work_exp + industry + job_sat
Df Sum of Sq RSS AIC
- work_exp 1 0.031 594.12 -3060.7
<none> 594.09 -3058.9
- years_code 1 0.715 594.81 -3058.1
- work_type 2 1.852 595.95 -3055.6
- job_sat 1 2.617 596.71 -3050.7
- years_code_pro 1 6.353 600.45 -3036.2
- dev_type 14 19.328 613.42 -3012.6
- ed_level 6 15.373 609.47 -3011.6
- industry 13 34.104 628.20 -2955.3
- age 5 36.194 630.29 -2931.6
- org_size 7 39.509 633.60 -2923.4
Step: AIC=-3060.74
log_salary ~ age + ed_level + years_code + years_code_pro + dev_type +
org_size + work_type + industry + job_sat
Df Sum of Sq RSS AIC
<none> 594.12 -3060.7
- years_code 1 0.758 594.88 -3059.8
- work_type 2 1.864 595.99 -3057.5
- job_sat 1 2.635 596.76 -3052.5
- years_code_pro 1 8.945 603.07 -3028.1
- dev_type 14 19.382 613.51 -3014.2
- ed_level 6 15.380 609.50 -3013.4
- industry 13 34.106 628.23 -2957.2
- age 5 36.620 630.75 -2931.9
- org_size 7 39.496 633.62 -2925.4
The final model selected by backward selection removes work_exp, the years of work experience.
In general, automatic model selection only focuses on selecting a model that optimizes criteria such as AIC. However, this approach may include mediator variables in the final model simply because they are strongly correlated with the outcome variable. It may also lead to confounding biases by omitting important predictors that influence the outcome but were removed because they did not appear statistically significant.
The p-values reported in the model summary can be misleading, as they do not account for the fact that model selection was performed. As a result, the p-values may mostly, or even entirely, appear statistically significant, reflecting the model selection’s intent to identify a model that shows strong statistical significance with the predicted outcome.
This is the summary output from out final selected model:
summary(salary_step)
Call:
lm(formula = log_salary ~ age + ed_level + years_code + years_code_pro +
dev_type + org_size + work_type + industry + job_sat, data = survey_data)
Residuals:
Min 1Q Median 3Q Max
-8.8129 -0.2151 -0.0019 0.2252 2.3975
Coefficients:
Estimate Std. Error
(Intercept) 11.182634 0.122280
age25-34 0.330274 0.043588
age35-44 0.305929 0.050100
age45-54 0.111617 0.065163
age55-64 -0.113224 0.083499
age65 years or older -0.406548 0.146657
ed_levelBachelor's degree 0.107653 0.054304
ed_levelMaster's degree 0.210612 0.059157
ed_levelPrimary/elementary school 0.298224 0.216239
ed_levelProfessional degree 0.387834 0.082809
ed_levelSecondary school -0.222433 0.092080
ed_levelSome college 0.046628 0.061724
years_code 0.004636 0.002724
years_code_pro 0.019537 0.003342
dev_typeData engineer 0.012151 0.100322
dev_typeData scientist or machine learning specialist -0.044798 0.113083
dev_typeDeveloper, back-end 0.112856 0.082618
dev_typeDeveloper, desktop or enterprise applications -0.034132 0.093087
dev_typeDeveloper, embedded applications or devices 0.093731 0.101342
dev_typeDeveloper, front-end -0.013869 0.094442
dev_typeDeveloper, full-stack -0.006779 0.081307
dev_typeDeveloper, mobile 0.208691 0.103853
dev_typeDeveloper, QA or test -0.124954 0.137816
dev_typeDevOps specialist 0.089199 0.115402
dev_typeEngineer, site reliability 0.193572 0.128491
dev_typeEngineering manager 0.266497 0.097862
dev_typeResearch & Development role 0.030872 0.124468
dev_typeSenior Executive (C-Suite, VP, etc.) 0.451766 0.115647
org_size10 to 19 -0.248609 0.051718
org_size10,000 or more 0.050529 0.037284
org_size100 to 499 -0.138111 0.037256
org_size2 to 9 -0.471956 0.051963
org_size20 to 99 -0.221187 0.039514
org_size5,000 to 9,999 -0.043926 0.051668
org_size500 to 999 -0.137134 0.049104
work_typeIn-person -0.026155 0.037450
work_typeRemote 0.052414 0.024785
industryComputer Systems Design and Services -0.011676 0.077532
industryEnergy -0.111394 0.086318
industryFintech 0.087278 0.063372
industryGovernment -0.180344 0.062460
industryHealthcare -0.061592 0.060242
industryHigher Education -0.483785 0.073496
industryInsurance 0.039871 0.081563
industryInternet, Telecomm or Information Services -0.001763 0.066295
industryManufacturing -0.233881 0.065225
industryMedia & Advertising Services 0.047340 0.073947
industryRetail and Consumer Services -0.016760 0.062910
industrySoftware Development 0.037185 0.051909
industryTransportation, or Supply Chain -0.169097 0.070894
job_sat 0.015795 0.004978
t value Pr(>|t|)
(Intercept) 91.451 < 2e-16 ***
age25-34 7.577 5.11e-14 ***
age35-44 6.106 1.20e-09 ***
age45-54 1.713 0.086866 .
age55-64 -1.356 0.175234
age65 years or older -2.772 0.005615 **
ed_levelBachelor's degree 1.982 0.047554 *
ed_levelMaster's degree 3.560 0.000378 ***
ed_levelPrimary/elementary school 1.379 0.167987
ed_levelProfessional degree 4.683 2.99e-06 ***
ed_levelSecondary school -2.416 0.015786 *
ed_levelSome college 0.755 0.450069
years_code 1.702 0.088837 .
years_code_pro 5.846 5.76e-09 ***
dev_typeData engineer 0.121 0.903609
dev_typeData scientist or machine learning specialist -0.396 0.692029
dev_typeDeveloper, back-end 1.366 0.172075
dev_typeDeveloper, desktop or enterprise applications -0.367 0.713900
dev_typeDeveloper, embedded applications or devices 0.925 0.355118
dev_typeDeveloper, front-end -0.147 0.883259
dev_typeDeveloper, full-stack -0.083 0.933564
dev_typeDeveloper, mobile 2.009 0.044603 *
dev_typeDeveloper, QA or test -0.907 0.364675
dev_typeDevOps specialist 0.773 0.439641
dev_typeEngineer, site reliability 1.507 0.132077
dev_typeEngineering manager 2.723 0.006515 **
dev_typeResearch & Development role 0.248 0.804132
dev_typeSenior Executive (C-Suite, VP, etc.) 3.906 9.64e-05 ***
org_size10 to 19 -4.807 1.63e-06 ***
org_size10,000 or more 1.355 0.175480
org_size100 to 499 -3.707 0.000215 ***
org_size2 to 9 -9.082 < 2e-16 ***
org_size20 to 99 -5.598 2.43e-08 ***
org_size5,000 to 9,999 -0.850 0.395321
org_size500 to 999 -2.793 0.005270 **
work_typeIn-person -0.698 0.485002
work_typeRemote 2.115 0.034563 *
industryComputer Systems Design and Services -0.151 0.880308
industryEnergy -1.291 0.197005
industryFintech 1.377 0.168572
industryGovernment -2.887 0.003921 **
industryHealthcare -1.022 0.306692
industryHigher Education -6.582 5.73e-11 ***
industryInsurance 0.489 0.625005
industryInternet, Telecomm or Information Services -0.027 0.978789
industryManufacturing -3.586 0.000343 ***
industryMedia & Advertising Services 0.640 0.522118
industryRetail and Consumer Services -0.266 0.789951
industrySoftware Development 0.716 0.473853
industryTransportation, or Supply Chain -2.385 0.017152 *
job_sat 3.173 0.001530 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5116 on 2270 degrees of freedom
Multiple R-squared: 0.2951, Adjusted R-squared: 0.2796
F-statistic: 19.01 on 50 and 2270 DF, p-value: < 2.2e-16
Conclusion
In all, our multiple regression model identified several significant predictiors of salary, including age, education level, job satisfaction, job type, and organizational size. Together these predictors suggest meaningful relationships between salary. Most interesting, perhaps, is the statistically significant relationship between job satisfaction and salary, which provides evidence that software engineers in the United States can both enjoy their work and be well-paid. Education is also one of the largest statistically significant predictors, showing significant benefits to pursuing a master or doctoral degree.
Notably, however, the model’s r-squared is lower than ideal at 29.5%. Some potential areas for improvement are checking for multicollinearity between the model’s predictors, accounting for non-linear relationships between the predictors and log_salary, and searching for possible confounders that hurt the model’s predictive power.
Regardless, the model provides unique insight into the state of the software engineering job market in the United States, and highlights the impact of demographic, educational, and workplace factors on salary trends within the industry.