SDS 324E Final Project - Predicting Salaries for Software Engineers in the United States

  • Preston Cook
  • Rome Vu
  • Phoebe Wang

Introduction

The data source for our analysis is the 2024 Stack Overflow Developer Survey. The annual survey is well-regarded as a reliable way to gauge the state of the global tech industry. The survey often falls alongside other popular data sources for basic data science, such as the Titanic, or MNIST datasets. The goal of our analysis was to reliably predict annual salaries for full-time software engineers in the United States.

We selected this dataset because one of our team members, Preston, is graduating this semester with aspirations of becoming a software engineer. This analysis serves a dual purpose: fulfilling the requirements of our final project while also providing Preston with valuable insights into the dynamics of the American tech job market as he prepares for his career.

# Read the data
survey_data <- read.csv("survey-data.csv")

Before extracting relevant predictors, we filtered our dataset using Pandas to only include data points for software engineers in the United States. This process resulted in 2321 records for our analysis.

After performing some data wrangling, we extracted relevant factors and renamed columns to conduct data analysis in an R Studio environment more easily. After this process, we were left with the following predictors in no particular order:

  • Age (Categorical)
  • Education Level (Categorical)
  • Number of Years Coding (Numeric)
  • Number of Years Coding Professionally (Numeric)
  • Developer Type (Categorical)
  • Organization Size (Categorical)
  • Work Type (Categorical)
  • Years of Work Experience (Numeric)
  • Industry (Categorical)
  • Job Satisfaction (Numeric)

Here are breakdowns for the number of each data points in each class across our categorical predictors:

# Create a table for age range class distribution
age_counts <- table(survey_data$age)
age_df <- as.data.frame(age_counts)
colnames(age_df) <- c("Age Range", "Count")
knitr::kable(age_df, caption = "Age Range Class Counts")
Age Range Class Counts
Age Range Count
18-24 180
25-34 934
35-44 705
45-54 318
55-64 164
65 years or older 20
# Create a table for education level class distribution
ed_counts <- table(survey_data$ed_level)
ed_df <- as.data.frame(ed_counts)
colnames(ed_df) <- c("Education Level", "Count")
knitr::kable(ed_df, caption = "Education Level Class Counts")
Education Level Class Counts
Education Level Count
Associate’s degree 98
Bachelor’s degree 1431
Master’s degree 407
Primary/elementary school 6
Professional degree 74
Secondary school 47
Some college 258
# Create a table for developer type class distribution
dev_type_counts <- table(survey_data$dev_type)
dev_type_df <- as.data.frame(dev_type_counts)
colnames(dev_type_df) <- c("Developer Type", "Count")
knitr::kable(dev_type_df, caption = "Developer Type Class Counts")
Developer Type Class Counts
Developer Type Count
Cloud infrastructure engineer 42
Data engineer 72
Data scientist or machine learning specialist 43
Developer, back-end 497
Developer, desktop or enterprise applications 117
Developer, embedded applications or devices 71
Developer, front-end 104
Developer, full-stack 1079
Developer, mobile 59
Developer, QA or test 21
DevOps specialist 38
Engineer, site reliability 26
Engineering manager 82
Research & Development role 30
Senior Executive (C-Suite, VP, etc.) 40
# Create a table for organization size class distribution
org_size_counts <- table(survey_data$org_size)
org_size_df <- as.data.frame(org_size_counts)
colnames(org_size_df) <- c("Organization Size", "Count")
knitr::kable(org_size_df, caption = "Organization Size Class Counts")
Organization Size Class Counts
Organization Size Count
1,000 to 4,999 316
10 to 19 152
10,000 or more 502
100 to 499 492
2 to 9 157
20 to 99 385
5,000 to 9,999 146
500 to 999 171
# Create a table for work type class distribution
work_type_counts <- table(survey_data$work_type)
work_type_df <- as.data.frame(work_type_counts)
colnames(work_type_df) <- c("Work Type", "Count")
knitr::kable(work_type_df, caption = "Work Type Class Counts")
Work Type Class Counts
Work Type Count
Hybrid 763
In-person 270
Remote 1288
# Create a table for industry type class distribution
industry_counts <- table(survey_data$industry)
industry_df <- as.data.frame(industry_counts)
colnames(industry_df) <- c("Industry Type", "Count")
knitr::kable(industry_df, caption = "Industry Type Class Counts")
Industry Type Class Counts
Industry Type Count
Banking/Financial Services 117
Computer Systems Design and Services 75
Energy 52
Fintech 156
Government 169
Healthcare 205
Higher Education 87
Insurance 62
Internet, Telecomm or Information Services 129
Manufacturing 141
Media & Advertising Services 85
Retail and Consumer Services 159
Software Development 786
Transportation, or Supply Chain 98

Formulation of Hypotheses

We would like to test the following hypotheses:

Hypothesis 1

“There is a significant relationship between salary and at least one of the predictors (age, education level, years of coding experience, years of professional experience, development type, organization size, work type, work experience, or industry).”

This is a sort of base hypothesis that mean to test in order to gauge the overall reliability of the survey data for predicting salaries for American software engineers. If our model provides evidence that this hypothesis is true, then we can more confidently approach the remaining two.

Hypothesis 2

“Including education level as a predictor will improve the performance of a model that already includes age, years of coding experience, years of professional experience, development type, organization size, work type, work experience, industry, and job satisfaction.”

Unlike many other career paths, software engineering has historically placed less value on education–more particularly college education. Or, at least, there is a widespread perception that this is the case. By testing this hypothesis, we can determine the amount of truth in this.

Hypothesis 3

“Job satisfaction and years of work experience are valuable predictors of log salary, even when accounting for age, education level, years of coding experience, years of professional experience, development type, organization size, work type, and industry.”

Lastly, we’d like to test the relationship between job satisfaction and salary predictions. Information like this could potentially rationalize choosing a company that initially pays less, but allows one to feel more satisfied in their work.

Exploring Marginal Relationships in the Data

We will begin by performing some basic exploratory analysis on our dataset. This will help us determine the underlying logic of our model and potential issues that we may need to address in the future.

Numeric Predictor Analysis

Let’s start by plotting salary against each of our numeric variables to check for general trends and potential multicollinearity.

# Set up a 2x2 grid for multiple plots
par(mfrow=c(2,2))

# Salary vs Years Coding Scatterplot
plot(survey_data$years_code, survey_data$converted_comp_yearly, 
     main="Salary vs. Years Coding",
     xlab="Years Coding", ylab="Salary", pch=19, col='blue')

# Salary vs Years Coding Professionally Scatterplot
plot(survey_data$years_code_pro, survey_data$converted_comp_yearly, 
     main="Salary vs. Years Coding Professionally",
     xlab="Years Coding Professionally", ylab="Salary", pch=19, col='red')

# Salary vs Years of Work Experience Scatterplot
plot(survey_data$work_exp, survey_data$converted_comp_yearly, 
     main="Salary vs. Years of Work Experience",
     xlab="Years of Work Experience", ylab="Salary", pch=19, col='orange')

# Salary vs Job Satisfaction Scatterplot
plot(survey_data$job_sat, survey_data$converted_comp_yearly, 
     main="Salary vs. Job Satisfaction",
     xlab="Job Satisfaction", ylab="Salary", pch=19, col='green')

As one might have predicted, it seems like the distribution of our years_code, years_code_pro, and work_exp variables are very similar. This is something that we should check when developing our model to assure that our assumption of no multicollinearity is not violated. However, there does appear to be visual evidence that job satisfaction is positively related to salary.

As a second part of our numeric analysis, let’s look at the correlation coefficient of each of our numeric predictors with the target variable.

# List of predictors and target variable
predictors <- c("years_code", "years_code_pro", "work_exp", "job_sat")
target <- "converted_comp_yearly"

# Calculate correlation coefficients
correlations <- sapply(predictors, function(predictor) {
  cor(survey_data[[predictor]], survey_data[[target]], use="complete.obs")
})

# Create a data frame to display the results
correlation_table <- data.frame(
  Correlation_Coefficient = correlations
)

# Display the table
knitr::kable(correlation_table, caption = "Correlation Coefficients of Predictors with Salary",
             col.names = c("Predictor", "Correlation Coefficient"), align = c("l", "r"))
Correlation Coefficients of Predictors with Salary
Predictor Correlation Coefficient
years_code 0.2152076
years_code_pro 0.2097014
work_exp 0.1764005
job_sat 0.0690380

It seems like years_code has the highest correlation with the target variable, which is higher than all the other numeric predictors. This provides some preliminary evidence that the number of years someone has coding is more important than their number of years of work experience. Regarding job_sat, it seems like the correlation is very low despite the graphical evidence. This could be due to job satisfaction only relating to salary strongly at the extremes of the distribution.

Categorical Predictor Analysis

Now, let’s look at our categorical predictors and their effects on the target variable through some box plots.

Adjusting margins and plotting each boxplot

Unfortunately some of the graphs are a bit squished in markdown because of the long category names, but we can still find some insight in the data. There is visual evidence that the median salary across classes is fairly similar between categorical predictors with some small differences.

# Age
par(mar = c(10, 4, 4, 2))  
boxplot(converted_comp_yearly ~ age, 
        xlab = "", ylab = "", 
        main = "Age Range Class Counts", 
        las = 2, data = survey_data)

In the age categories, although the median salary across classes is roughly equal, there are more outliers for 25-34-year-olds and 45-54 categories.

# Education Level
par(mar = c(14, 4, 4, 2))  
boxplot(converted_comp_yearly ~ ed_level, 
        xlab = "", ylab = "", 
        main = "Education Level Class Counts", 
        las = 2, data = survey_data)

For education level, PhDs seem to have a slightly higher median in the data, but there are more positive outliers for those with just a bachelor’s degree.

# Developer Type
par(mar = c(18, 4, 4, 2))  
boxplot(converted_comp_yearly ~ dev_type, 
        xlab = "", ylab = "", 
        main = "Developer Type Class Counts", 
        las = 2, data = survey_data)

And, although the developer type chart is fairly squished, it seems like backend and full-stack engineers have a higher median salary as well as more positive outliers.

# Organization Size
par(mar = c(10, 4, 4, 2))  
boxplot(converted_comp_yearly ~ org_size, 
        xlab = "", ylab = "", 
        main = "Organization Size Class Counts", 
        las = 2, data = survey_data)

Regarding organization size, it seems like those working in organizations with 10,000+ employees tend to have higher salaries. This is predictable, as big tech is known for paying its software engineers disproportionately well.

# Work Type
par(mar = c(10, 4, 4, 2))  
boxplot(converted_comp_yearly ~ work_type, 
        xlab = "", ylab = "", 
        main = "Work Type Class Counts", 
        las = 2, data = survey_data)

Between work type, it seems like they are roughly the same, with those working remotely edging out others by a few thousand dollars.

# Industry Type
par(mar = c(18, 4, 4, 2))  
boxplot(converted_comp_yearly ~ industry, 
        xlab = "", ylab = "", 
        main = "Industry Type Class Counts", 
        las = 2, data = survey_data)

Lastly, it seems like those working directly in the tech industry itself seem to have the highest median salary as well as the most positive outliers.

Fitting the Model and Assessing Overall Model Significance

Now, let’s finally create our model and assess the significance of the predictors shown in the previous step.

# Fit the multiple linear regression model
full_model <- lm(converted_comp_yearly ~ 
                 age + ed_level + years_code + years_code_pro +
                 dev_type + org_size + work_type + work_exp +
                 industry + job_sat, data = survey_data)
summary(full_model)

Call:
lm(formula = converted_comp_yearly ~ age + ed_level + years_code + 
    years_code_pro + dev_type + org_size + work_type + work_exp + 
    industry + job_sat, data = survey_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-258180  -44402  -14003   24877 1652033 

Coefficients:
                                                      Estimate Std. Error
(Intercept)                                            99064.8    23578.2
age25-34                                               36929.6     8494.5
age35-44                                               34675.1    10132.2
age45-54                                                4624.7    13472.0
age55-64                                              -36727.6    17580.7
age65 years or older                                  -83251.3    29673.0
ed_levelBachelor's degree                              17496.6    10476.0
ed_levelMaster's degree                                36453.4    11412.3
ed_levelPrimary/elementary school                      43085.6    41692.3
ed_levelProfessional degree                            77206.7    16023.5
ed_levelSecondary school                                5576.2    17755.3
ed_levelSome college                                   10384.4    11900.8
years_code                                              1374.3      528.5
years_code_pro                                          2650.6      739.3
dev_typeData engineer                                 -37500.6    19343.2
dev_typeData scientist or machine learning specialist -20373.4    21803.3
dev_typeDeveloper, back-end                            -8580.5    15932.2
dev_typeDeveloper, desktop or enterprise applications -40393.7    17947.9
dev_typeDeveloper, embedded applications or devices   -15267.1    19554.4
dev_typeDeveloper, front-end                          -36406.1    18217.0
dev_typeDeveloper, full-stack                         -35251.9    15679.2
dev_typeDeveloper, mobile                               4298.9    20025.7
dev_typeDeveloper, QA or test                         -65769.1    26572.0
dev_typeDevOps specialist                             -22044.5    22253.1
dev_typeEngineer, site reliability                     -4172.8    24774.1
dev_typeEngineering manager                            17689.6    18868.3
dev_typeResearch & Development role                   -35462.9    23998.5
dev_typeSenior Executive (C-Suite, VP, etc.)           85514.5    22304.4
org_size10 to 19                                      -47598.9     9971.5
org_size10,000 or more                                 20766.9     7189.3
org_size100 to 499                                    -27594.2     7183.7
org_size2 to 9                                        -54831.4    10019.4
org_size20 to 99                                      -38966.8     7620.3
org_size5,000 to 9,999                                 -7416.4     9965.6
org_size500 to 999                                    -27493.0     9473.0
work_typeIn-person                                     10509.2     7220.7
work_typeRemote                                         7682.4     4780.4
work_exp                                                -495.5      600.3
industryComputer Systems Design and Services           33614.1    14948.7
industryEnergy                                        -26435.9    16642.8
industryFintech                                        18096.5    12219.6
industryGovernment                                    -41194.2    12042.6
industryHealthcare                                    -20158.9    11615.9
industryHigher Education                              -56719.3    14172.0
industryInsurance                                       1514.9    15727.8
industryInternet, Telecomm or Information Services      4242.9    12782.1
industryManufacturing                                 -44991.4    12579.4
industryMedia & Advertising Services                    9841.1    14266.2
industryRetail and Consumer Services                   -2639.1    12130.4
industrySoftware Development                            4846.8    10009.7
industryTransportation, or Supply Chain               -27245.1    13668.9
job_sat                                                 2283.0      960.2
                                                      t value Pr(>|t|)    
(Intercept)                                             4.202 2.75e-05 ***
age25-34                                                4.347 1.44e-05 ***
age35-44                                                3.422 0.000632 ***
age45-54                                                0.343 0.731419    
age55-64                                               -2.089 0.036811 *  
age65 years or older                                   -2.806 0.005065 ** 
ed_levelBachelor's degree                               1.670 0.095027 .  
ed_levelMaster's degree                                 3.194 0.001421 ** 
ed_levelPrimary/elementary school                       1.033 0.301518    
ed_levelProfessional degree                             4.818 1.54e-06 ***
ed_levelSecondary school                                0.314 0.753507    
ed_levelSome college                                    0.873 0.382986    
years_code                                              2.600 0.009371 ** 
years_code_pro                                          3.585 0.000344 ***
dev_typeData engineer                                  -1.939 0.052663 .  
dev_typeData scientist or machine learning specialist  -0.934 0.350189    
dev_typeDeveloper, back-end                            -0.539 0.590241    
dev_typeDeveloper, desktop or enterprise applications  -2.251 0.024506 *  
dev_typeDeveloper, embedded applications or devices    -0.781 0.435030    
dev_typeDeveloper, front-end                           -1.998 0.045785 *  
dev_typeDeveloper, full-stack                          -2.248 0.024651 *  
dev_typeDeveloper, mobile                               0.215 0.830043    
dev_typeDeveloper, QA or test                          -2.475 0.013392 *  
dev_typeDevOps specialist                              -0.991 0.321973    
dev_typeEngineer, site reliability                     -0.168 0.866256    
dev_typeEngineering manager                             0.938 0.348585    
dev_typeResearch & Development role                    -1.478 0.139624    
dev_typeSenior Executive (C-Suite, VP, etc.)            3.834 0.000130 ***
org_size10 to 19                                       -4.773 1.93e-06 ***
org_size10,000 or more                                  2.889 0.003907 ** 
org_size100 to 499                                     -3.841 0.000126 ***
org_size2 to 9                                         -5.472 4.93e-08 ***
org_size20 to 99                                       -5.114 3.43e-07 ***
org_size5,000 to 9,999                                 -0.744 0.456831    
org_size500 to 999                                     -2.902 0.003741 ** 
work_typeIn-person                                      1.455 0.145686    
work_typeRemote                                         1.607 0.108181    
work_exp                                               -0.825 0.409196    
industryComputer Systems Design and Services            2.249 0.024632 *  
industryEnergy                                         -1.588 0.112328    
industryFintech                                         1.481 0.138761    
industryGovernment                                     -3.421 0.000636 ***
industryHealthcare                                     -1.735 0.082797 .  
industryHigher Education                               -4.002 6.48e-05 ***
industryInsurance                                       0.096 0.923276    
industryInternet, Telecomm or Information Services      0.332 0.739967    
industryManufacturing                                  -3.577 0.000355 ***
industryMedia & Advertising Services                    0.690 0.490378    
industryRetail and Consumer Services                   -0.218 0.827790    
industrySoftware Development                            0.484 0.628285    
industryTransportation, or Supply Chain                -1.993 0.046357 *  
job_sat                                                 2.378 0.017509 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 98640 on 2269 degrees of freedom
Multiple R-squared:  0.2358,    Adjusted R-squared:  0.2186 
F-statistic: 13.73 on 51 and 2269 DF,  p-value: < 2.2e-16

Interpreting the intercept and coefficients

Intercept: 102,860.9
The intercept indicates that when all the categorical variables are at their base level or when the numerical variables equal zero, the annual salary is predicted to be $102,860.9.

This is the table of the coefficients for all the age classes. Every coefficient is relative to the base level of 18-24 years old:

Terms Estimate
age25-34 37,131.8
age35-44 35,920.1
age45-54 5,681.7
age55-64 -35,070.0
age65 years or older -81,298.7

Interpretation of two coefficients as an example:

age25-34: A person between age 25 to 34 is predicted to earn $37,131.8 more than a person between 18 to 24 years old.

age55-64: A person between age 55 to 64 is predicted to earn $35,070 less than a person between 18 to 24 years old.

Table of the coefficients for all the different ed_level classes. Every coefficient is relative to the base level of receiving an Associate’s degree.

Terms Estimates
ed_levelBachelor's degree 17,589.3
ed_levelMaster's degree 36,470.9
ed_levelPrimary/elementary school 43,276.0
ed_levelProfessional degree 75,970.2
ed_levelSecondary school 5,316.4
ed_levelSome college 10,018.2

Interpretation of two coefficients as an example:

ed_levelBachelor's degree: A person who has a Bachelor’s degree is predicted to earn $17,589.3 more than a person with an Associate’s degree.

ed_levelMaster's degree: A person who has a Master’s degree is predicted to earn $36,470.9 more than a person with an Associate’s degree.

Table of the coefficients for all the different dev_type classes. Every coefficient is relative to the base level of a cloud infrastructure engineer.

Terms Estimates
dev_typeData engineer -42,285.5
dev_typeData scientist or machine learning specialist -18,937.9
dev_typeDeveloper, back-end -13,050.7
dev_typeDeveloper, desktop or enterprise applications -44,883.5
dev_typeDeveloper, embedded applications or devices -20,303.0
dev_typeDeveloper, front-end -41,236.1
dev_typeDeveloper, full-stack -39,978.0
dev_typeDeveloper, mobile -514.0
dev_typeDeveloper, QA or test -70,829.9
dev_typeDevOps specialist -26,766.8
dev_typeEngineer, site reliability -9,051.4
dev_typeEngineering manager 13,109.5
dev_typeResearch & Development role -40,209.7
dev_typeSenior Executive (C-Suite, VP, etc.) 81,090.6

Interpretation of two coefficients as an example:

dev_typeData engineer: A data engineer is predicted to earn $42,285.5 less than a cloud infrastructure engineer.

dev_typeData scientist or machine learning specialist: A data scientist or machine learning specialist is predicted to earn $18,937.9 less than a cloud infrastructure engineer.

Table of the coefficients for all the different org_size classes. Every coefficient is relative to the base level of an organization of 1,000 to 4,999 people.

Terms Estimates
org_size10 to 19 -47,557.7
org_size10,000 or more 21,602.5
org_size100 to 499 -27,499.6
org_size2 to 9 -54,576.0
org_size20 to 99 -38,365.0
org_size5,000 to 9,999 -7,291.4
org_size500 to 999 -27,632.5

Interpretation of two coefficients as an example:

org_size10 to 19: The annual salary of employees in an organization of 10 to 19 people is predicted to be $47,557.7 lower than that of employees in an organization of 1,000 to 4,999 people.

org_size10,000 or more: The annual salary of employees in an organization of 10,000 or more people is predicted to be $21,602.5 higher than that of employees in an organization of 1,000 to 4,999 people.

Table of the coefficients for all the different work_type classes. Every coefficient is relative to the base level of working hybrid.

Term Estimate
work_typeIn-person 10,517.1
work_typeRemote 7,919.1

work_typeIn-person: A person who works in-person is predicted to earn $10,517.1 more than a person who works hybrid.

work_typeRemote: A person who works remotely is predicted to earn $7,919.1 more than a person who works hybrid.

Table of the coefficients for all the different industry classes. Every coefficient is relative to the base level of the Banking/Financial Services industry.

Terms Estimates
industryComputer Systems Design and Services 37,456.8
industryEnergy -26,250.4
industryFintech 19,377.3
industryGovernment -41,201.2
industryHealthcare -19,990.3
industryHigher Education -56,497.7
industryInsurance 1,542.9
industryInternet, Telecomm or Information Services 4,192.0
industryManufacturing -44,766.9
industryMedia & Advertising Services 10,026.4
industryRetail and Consumer Services -2,511.6
industrySoftware Development 4,971.8
industryTransportation, or Supply Chain -24,978.4

Interpretation of two coefficients as an example:

industryComputer Systems Design and Services: A person working in the Computer Systems Design and Services industry is predicted to earn $37,456.8 more than someone working in the Banking/Financial Services industry.

industryEnergy: A person working in the Energy industry is predicted to earn $26,250.4 less than someone working in the Banking/Financial Services industry.

Numerical Coefficient Interpretations

years_code: As the number of years coding increases by 1, the annual salary is predicted to increase by $1,354.9.

years_code_pro: As the number of years coding professionally increases by 1, the annual salary is predicted to increase by $2,633.7.

work_exp: As the number of years someone worked increases by 1, the annual salary is predicted to decrease by $523.

job_sat: As the job satisfaction rating increases by 1, the annual salary is predicted to increase by $2,331.9.

F-Test Conclusion

The null hypothesis is that all regression coefficients (i.e., slopes) are zero. The alternate hypothesis is that at least one of the regression coefficients is non-zero. Since the p-value associated with the F-statistic (13.85, p < 2.2e-16) is less than the alpha level of 0.05, we can reject the null hypothesis and conclude at least one predictor has a non-zero relationship with annual salary.

R-squared Interpretation

The multiple R-squared is 0.2377, which means that 23.77% of the variation in annual salary is accounted for by all predictors in the model.

Residual Standard Error (RSE) Interpretation

The residual standard error of 98410 indicates that the actual salaries are expected to differ from the predicted salaries by around $98,410. Although this is a large average difference between the actual and predicted values, it may be reasonable as the data includes people who make many hundred thousand dollars.

Improving the Model

# Diagnostic plot
library(ggfortify)
Loading required package: ggplot2
autoplot(full_model)

In the residual vs. fitted plot, the line slopes down and deviates from the horizontal line as we move to the right, suggesting that the linearity assumption is violated. In the scale-location plot, the line curves upward, indicating that the data does not have constant variance. In the Q-Q plot, the data points on the far right of the diagonal line spikes up, suggesting that the data may be right-skewed instead of normally distributed. In general, our model violates the linearity, constant variance, and normality assumptions.

# Log transformed the dependent variable (annual salary)
survey_data$log_salary <- log(survey_data$converted_comp_yearly)

log_model <-  lm(log_salary ~ 
                        age + ed_level + years_code + years_code_pro +
                        dev_type + org_size + work_type + work_exp +
                        industry + job_sat, data = survey_data)
autoplot(log_model)

After applying a log transformation to the dependent variable, annual salary, the linear assumption is better met as the line is flatter towards the right of the plot compared to the previous model. The constant variance assumption has also improved, although the blue line still curves upward towards the right, indicating that the variance remains non-constant. The normality assumption, however, is still not fully met, as the data points deviate from the diagonal line on the far left and far right of the plot. We will proceed with using logarithmic annual salary as the dependent variable, as it demonstrates overall better performance than the original model.

# Replaced years_code with a spline
library(splines)
spline_model <- lm(log_salary ~ 
                   age + ed_level + bs(years_code, 3) + years_code_pro +
                   dev_type + org_size + work_type + work_exp +
                   industry + job_sat, data = survey_data)

# Comparing the original model and the model with a spline
anova(log_model, spline_model)
Analysis of Variance Table

Model 1: log_salary ~ age + ed_level + years_code + years_code_pro + dev_type + 
    org_size + work_type + work_exp + industry + job_sat
Model 2: log_salary ~ age + ed_level + bs(years_code, 3) + years_code_pro + 
    dev_type + org_size + work_type + work_exp + industry + job_sat
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   2269 594.09                                  
2   2267 584.28  2    9.8118 19.035 6.338e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(log_model, spline_model)
             df      AIC
log_model    53 3529.849
spline_model 55 3495.196
BIC(log_model, spline_model)
             df      BIC
log_model    53 3834.586
spline_model 55 3811.433

Based on the results of the anova(), AIC(), and BIC(), including a spline on years_code makes the model perform better, yet we decided to discard the spline part in the subsequent analyses.

# Interactions between work_exp and job_sat
interaction_model <- lm(log_salary ~ 
                          age + ed_level + years_code + years_code_pro +
                          dev_type + org_size + work_type + industry +
                          work_exp * job_sat, data = survey_data)
summary(interaction_model)

Call:
lm(formula = log_salary ~ age + ed_level + years_code + years_code_pro + 
    dev_type + org_size + work_type + industry + work_exp * job_sat, 
    data = survey_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8149 -0.2134 -0.0009  0.2251  2.3906 

Coefficients:
                                                        Estimate Std. Error
(Intercept)                                           11.1382291  0.1302411
age25-34                                               0.3276732  0.0440683
age35-44                                               0.2990231  0.0525832
age45-54                                               0.1019508  0.0698951
age55-64                                              -0.1238627  0.0912231
age65 years or older                                  -0.4125160  0.1542725
ed_levelBachelor's degree                              0.1080401  0.0543462
ed_levelMaster's degree                                0.2110533  0.0592032
ed_levelPrimary/elementary school                      0.3022907  0.2163207
ed_levelProfessional degree                            0.3916871  0.0831365
ed_levelSecondary school                              -0.2242134  0.0921178
ed_levelSome college                                   0.0475316  0.0617439
years_code                                             0.0045048  0.0027417
years_code_pro                                         0.0189257  0.0038356
dev_typeData engineer                                  0.0091639  0.1003839
dev_typeData scientist or machine learning specialist -0.0454047  0.1131081
dev_typeDeveloper, back-end                            0.1109343  0.0826878
dev_typeDeveloper, desktop or enterprise applications -0.0374935  0.0931764
dev_typeDeveloper, embedded applications or devices    0.0927870  0.1014681
dev_typeDeveloper, front-end                          -0.0155748  0.0945419
dev_typeDeveloper, full-stack                         -0.0081117  0.0813592
dev_typeDeveloper, mobile                              0.2072413  0.1039056
dev_typeDeveloper, QA or test                         -0.1232881  0.1378535
dev_typeDevOps specialist                              0.0880480  0.1154544
dev_typeEngineer, site reliability                     0.1928227  0.1285218
dev_typeEngineering manager                            0.2639323  0.0979176
dev_typeResearch & Development role                    0.0284766  0.1245240
dev_typeSenior Executive (C-Suite, VP, etc.)           0.4524348  0.1157187
org_size10 to 19                                      -0.2457127  0.0518131
org_size10,000 or more                                 0.0515788  0.0373160
org_size100 to 499                                    -0.1382886  0.0372661
org_size2 to 9                                        -0.4712614  0.0519849
org_size20 to 99                                      -0.2216290  0.0395314
org_size5,000 to 9,999                                -0.0443482  0.0516978
org_size500 to 999                                    -0.1382114  0.0491450
work_typeIn-person                                    -0.0248187  0.0374859
work_typeRemote                                        0.0524046  0.0247999
industryComputer Systems Design and Services          -0.0159543  0.0776642
industryEnergy                                        -0.1131632  0.0863520
industryFintech                                        0.0853433  0.0634125
industryGovernment                                    -0.1817692  0.0624905
industryHealthcare                                    -0.0622074  0.0602657
industryHigher Education                              -0.4881152  0.0736292
industryInsurance                                      0.0373403  0.0816460
industryInternet, Telecomm or Information Services    -0.0036748  0.0663399
industryManufacturing                                 -0.2364641  0.0652905
industryMedia & Advertising Services                   0.0431589  0.0740836
industryRetail and Consumer Services                  -0.0188821  0.0629557
industrySoftware Development                           0.0349963  0.0519628
industryTransportation, or Supply Chain               -0.1719717  0.0709703
work_exp                                               0.0047148  0.0048497
job_sat                                                0.0226266  0.0086020
work_exp:job_sat                                      -0.0005113  0.0005213
                                                      t value Pr(>|t|)    
(Intercept)                                            85.520  < 2e-16 ***
age25-34                                                7.436 1.47e-13 ***
age35-44                                                5.687 1.46e-08 ***
age45-54                                                1.459 0.144807    
age55-64                                               -1.358 0.174662    
age65 years or older                                   -2.674 0.007550 ** 
ed_levelBachelor's degree                               1.988 0.046932 *  
ed_levelMaster's degree                                 3.565 0.000372 ***
ed_levelPrimary/elementary school                       1.397 0.162424    
ed_levelProfessional degree                             4.711 2.61e-06 ***
ed_levelSecondary school                               -2.434 0.015010 *  
ed_levelSome college                                    0.770 0.441487    
years_code                                              1.643 0.100513    
years_code_pro                                          4.934 8.63e-07 ***
dev_typeData engineer                                   0.091 0.927271    
dev_typeData scientist or machine learning specialist  -0.401 0.688143    
dev_typeDeveloper, back-end                             1.342 0.179859    
dev_typeDeveloper, desktop or enterprise applications  -0.402 0.687433    
dev_typeDeveloper, embedded applications or devices     0.914 0.360580    
dev_typeDeveloper, front-end                           -0.165 0.869164    
dev_typeDeveloper, full-stack                          -0.100 0.920589    
dev_typeDeveloper, mobile                               1.995 0.046215 *  
dev_typeDeveloper, QA or test                          -0.894 0.371234    
dev_typeDevOps specialist                               0.763 0.445769    
dev_typeEngineer, site reliability                      1.500 0.133673    
dev_typeEngineering manager                             2.695 0.007081 ** 
dev_typeResearch & Development role                     0.229 0.819135    
dev_typeSenior Executive (C-Suite, VP, etc.)            3.910 9.51e-05 ***
org_size10 to 19                                       -4.742 2.24e-06 ***
org_size10,000 or more                                  1.382 0.167042    
org_size100 to 499                                     -3.711 0.000212 ***
org_size2 to 9                                         -9.065  < 2e-16 ***
org_size20 to 99                                       -5.606 2.32e-08 ***
org_size5,000 to 9,999                                 -0.858 0.391074    
org_size500 to 999                                     -2.812 0.004961 ** 
work_typeIn-person                                     -0.662 0.507986    
work_typeRemote                                         2.113 0.034701 *  
industryComputer Systems Design and Services           -0.205 0.837257    
industryEnergy                                         -1.310 0.190164    
industryFintech                                         1.346 0.178487    
industryGovernment                                     -2.909 0.003664 ** 
industryHealthcare                                     -1.032 0.302080    
industryHigher Education                               -6.629 4.20e-11 ***
industryInsurance                                       0.457 0.647468    
industryInternet, Telecomm or Information Services     -0.055 0.955830    
industryManufacturing                                  -3.622 0.000299 ***
industryMedia & Advertising Services                    0.583 0.560240    
industryRetail and Consumer Services                   -0.300 0.764260    
industrySoftware Development                            0.673 0.500706    
industryTransportation, or Supply Chain                -2.423 0.015464 *  
work_exp                                                0.972 0.331061    
job_sat                                                 2.630 0.008586 ** 
work_exp:job_sat                                       -0.981 0.326793    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5117 on 2268 degrees of freedom
Multiple R-squared:  0.2955,    Adjusted R-squared:  0.2793 
F-statistic: 18.29 on 52 and 2268 DF,  p-value: < 2.2e-16

We hypothesize that there is an interaction between job satisfaction and work experience that affects annual salary. The coefficient of -0.0006510 indicates that as the job satisfaction rating increases by 1, the effect of work experience on annual salary decreases by around $0.001. Although this is a non-significant interaction, it indicates that the more satisfied a person is with their job, the smaller the impact work experience has on increasing their annual salary.

Testing Your Hypotheses

Hypothesis 1: There is some relationship between salary and any of the predictors. Null: All of the regression coefficients in log_model are zero. Alt: At least one of the regression coefficients in log_model is non-zero. Hypothesis 2: We would expect education level to improve a model that already includes age, years_code, years_code_pro, dev_type, org_size, work_type, work_exp, industry, and job_sat. Null: The coefficients associated with the predictor ed_level is equal to 0, given that all other predictors are in the model. Alt: The coefficients associated with the predictor ed_level is not equal to 0, given that all other predictors are in the model. Hypothesis 3: Job satisfaction and years of work experience are useful for predicting log_salary if we are already using age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry. Null: The reduced model with only age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry as predictors is correct. Alt: The reduced model with only age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry as predictors is not correct.

# Testing hypothesis 1
summary(log_model)

Call:
lm(formula = log_salary ~ age + ed_level + years_code + years_code_pro + 
    dev_type + org_size + work_type + work_exp + industry + job_sat, 
    data = survey_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8107 -0.2156 -0.0021  0.2269  2.3976 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           11.182113   0.122314
age25-34                                               0.328074   0.044066
age35-44                                               0.300485   0.052562
age45-54                                               0.102963   0.069887
age55-64                                              -0.125796   0.091201
age65 years or older                                  -0.422559   0.153931
ed_levelBachelor's degree                              0.108279   0.054345
ed_levelMaster's degree                                0.211302   0.059202
ed_levelPrimary/elementary school                      0.298349   0.216282
ed_levelProfessional degree                            0.390247   0.083123
ed_levelSecondary school                              -0.222876   0.092107
ed_levelSome college                                   0.046606   0.061736
years_code                                             0.004531   0.002742
years_code_pro                                         0.018892   0.003835
dev_typeData engineer                                  0.011900   0.100344
dev_typeData scientist or machine learning specialist -0.044977   0.113106
dev_typeDeveloper, back-end                            0.113391   0.082649
dev_typeDeveloper, desktop or enterprise applications -0.033963   0.093106
dev_typeDeveloper, embedded applications or devices    0.095100   0.101440
dev_typeDeveloper, front-end                          -0.012904   0.094502
dev_typeDeveloper, full-stack                         -0.006264   0.081337
dev_typeDeveloper, mobile                              0.209236   0.103885
dev_typeDeveloper, QA or test                         -0.124745   0.137844
dev_typeDevOps specialist                              0.089825   0.115439
dev_typeEngineer, site reliability                     0.193762   0.128517
dev_typeEngineering manager                            0.266541   0.097881
dev_typeResearch & Development role                    0.031112   0.124494
dev_typeSenior Executive (C-Suite, VP, etc.)           0.450783   0.115705
org_size10 to 19                                      -0.248619   0.051728
org_size10,000 or more                                 0.050359   0.037295
org_size100 to 499                                    -0.138257   0.037266
org_size2 to 9                                        -0.472151   0.051977
org_size20 to 99                                      -0.221478   0.039531
org_size5,000 to 9,999                                -0.044415   0.051697
org_size500 to 999                                    -0.137710   0.049142
work_typeIn-person                                    -0.026236   0.037458
work_typeRemote                                        0.052190   0.024799
work_exp                                               0.001068   0.003114
industryComputer Systems Design and Services          -0.011794   0.077548
industryEnergy                                        -0.111547   0.086336
industryFintech                                        0.086980   0.063390
industryGovernment                                    -0.180291   0.062472
industryHealthcare                                    -0.061325   0.060258
industryHigher Education                              -0.484163   0.073518
industryInsurance                                      0.040314   0.081589
industryInternet, Telecomm or Information Services    -0.001685   0.066308
industryManufacturing                                 -0.234418   0.065257
industryMedia & Advertising Services                   0.046454   0.074007
industryRetail and Consumer Services                  -0.017039   0.062927
industrySoftware Development                           0.036900   0.051926
industryTransportation, or Supply Chain               -0.169075   0.070908
job_sat                                                0.015748   0.004981
                                                      t value Pr(>|t|)    
(Intercept)                                            91.422  < 2e-16 ***
age25-34                                                7.445 1.37e-13 ***
age35-44                                                5.717 1.23e-08 ***
age45-54                                                1.473 0.140814    
age55-64                                               -1.379 0.167930    
age65 years or older                                   -2.745 0.006097 ** 
ed_levelBachelor's degree                               1.992 0.046444 *  
ed_levelMaster's degree                                 3.569 0.000366 ***
ed_levelPrimary/elementary school                       1.379 0.167893    
ed_levelProfessional degree                             4.695 2.83e-06 ***
ed_levelSecondary school                               -2.420 0.015609 *  
ed_levelSome college                                    0.755 0.450378    
years_code                                              1.653 0.098566 .  
years_code_pro                                          4.926 9.01e-07 ***
dev_typeData engineer                                   0.119 0.905611    
dev_typeData scientist or machine learning specialist  -0.398 0.690927    
dev_typeDeveloper, back-end                             1.372 0.170213    
dev_typeDeveloper, desktop or enterprise applications  -0.365 0.715311    
dev_typeDeveloper, embedded applications or devices     0.938 0.348598    
dev_typeDeveloper, front-end                           -0.137 0.891398    
dev_typeDeveloper, full-stack                          -0.077 0.938615    
dev_typeDeveloper, mobile                               2.014 0.044115 *  
dev_typeDeveloper, QA or test                          -0.905 0.365576    
dev_typeDevOps specialist                               0.778 0.436581    
dev_typeEngineer, site reliability                      1.508 0.131778    
dev_typeEngineering manager                             2.723 0.006516 ** 
dev_typeResearch & Development role                     0.250 0.802681    
dev_typeSenior Executive (C-Suite, VP, etc.)            3.896 0.000101 ***
org_size10 to 19                                       -4.806 1.64e-06 ***
org_size10,000 or more                                  1.350 0.177055    
org_size100 to 499                                     -3.710 0.000212 ***
org_size2 to 9                                         -9.084  < 2e-16 ***
org_size20 to 99                                       -5.603 2.37e-08 ***
org_size5,000 to 9,999                                 -0.859 0.390359    
org_size500 to 999                                     -2.802 0.005117 ** 
work_typeIn-person                                     -0.700 0.483736    
work_typeRemote                                         2.105 0.035441 *  
work_exp                                                0.343 0.731580    
industryComputer Systems Design and Services           -0.152 0.879135    
industryEnergy                                         -1.292 0.196482    
industryFintech                                         1.372 0.170155    
industryGovernment                                     -2.886 0.003939 ** 
industryHealthcare                                     -1.018 0.308932    
industryHigher Education                               -6.586 5.61e-11 ***
industryInsurance                                       0.494 0.621276    
industryInternet, Telecomm or Information Services     -0.025 0.979730    
industryManufacturing                                  -3.592 0.000335 ***
industryMedia & Advertising Services                    0.628 0.530266    
industryRetail and Consumer Services                   -0.271 0.786593    
industrySoftware Development                            0.711 0.477384    
industryTransportation, or Supply Chain                -2.384 0.017188 *  
job_sat                                                 3.162 0.001590 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5117 on 2269 degrees of freedom
Multiple R-squared:  0.2952,    Adjusted R-squared:  0.2793 
F-statistic: 18.63 on 51 and 2269 DF,  p-value: < 2.2e-16

For hypothesis 1, we used summary() and had a p-value less than 2.2e-16, which is much less than our cutoff value of 0.05. This means we can reject the null hypothesis for the overall significance of the model test. This also means there is an overwhelming amount of evidence that there is an obvious relationship between the predictors and log_salary or that at least one of the regression coefficients in our multiple linear regression is non-zero. Referring to the informal hypothesis, there is some relationship between salary and any of the predictors.

#Testing hypothesis 2
drop1(log_model, test = "F")
Single term deletions

Model:
log_salary ~ age + ed_level + years_code + years_code_pro + dev_type + 
    org_size + work_type + work_exp + industry + job_sat
               Df Sum of Sq    RSS     AIC F value    Pr(>F)    
<none>                      594.09 -3058.9                      
age             5    36.194 630.29 -2931.6 27.6466 < 2.2e-16 ***
ed_level        6    15.373 609.47 -3011.6  9.7859 1.136e-10 ***
years_code      1     0.715 594.81 -3058.1  2.7308   0.09857 .  
years_code_pro  1     6.353 600.45 -3036.2 24.2631 9.009e-07 ***
dev_type       14    19.328 613.42 -3012.6  5.2726 5.895e-10 ***
org_size        7    39.509 633.60 -2923.4 21.5563 < 2.2e-16 ***
work_type       2     1.852 595.95 -3055.6  3.5359   0.02929 *  
work_exp        1     0.031 594.12 -3060.7  0.1177   0.73158    
industry       13    34.104 628.20 -2955.3 10.0195 < 2.2e-16 ***
job_sat         1     2.617 596.71 -3050.7  9.9956   0.00159 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For hypothesis 2, we performed a drop1() test, and ed_level had a p-value of 1.616e-11, which is much less than our cutoff value of 0.05. This means we can reject the null hypothesis and conclude that the coefficients associated with ed_level are not equal to 0, given that all other predictors are in the model. Referring back to the informal hypothesis, this means that education level significantly improves a model that already includes age, years_code, years_code_pro, dev_type, org_size, work_type, work_exp, industry, and job_sat.

#Testing hypothesis 3
library(broom)
reduced_log_model <- lm(log_salary ~ 
                          age + ed_level + years_code + years_code_pro +
                          dev_type + org_size + work_type +
                          industry, data = survey_data)
tidy(anova(reduced_log_model, log_model))
# A tibble: 2 × 7
  term                          df.residual   rss    df sumsq statistic  p.value
  <chr>                               <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 log_salary ~ age + ed_level …        2271  597.    NA NA        NA    NA      
2 log_salary ~ age + ed_level …        2269  594.     2  2.67      5.09  0.00623

For hypothesis 3, we tested a nested hypothesis using anova() and had a p-value of 6.18e-5 which is less than our cutoff value of 0.05. The small p-value associated with our F-statistic suggests that the additional predictors in the full model (job_sat and work_exp) have significant explanatory power. We can reject the null hypothesis and conclude that the reduced model with only age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry as predictors is not correct. Referring back to the informal hypothesis, job satisfaction and years of work experience are useful for predicting log_salary if we are already using age, ed_level, years_code, years_code_pro, dev_type, org_size, work_type, and industry.

The only limitation to the conclusion is that we didn’t use the exponentiated and actual values of salary, converted_comp_yearly; we took the log of it and made log_salary to meet more assumptions.

Robustness of Results

# Residuals vs Leverage plot
autoplot(log_model, which = 5)

In the residuals vs. leverage plot, observations 1725 and 1811 have very low standardized residuals, meaning they have extreme residuals/outlying salaries. 57 is on the far right of the plot, indicating that it has high leverage.

# Cook’s Distance
library(car)
Loading required package: carData
influencePlot(log_model)

         StudRes        Hat        CookD
669   -0.4550808 0.18370209 0.0008965839
875  -14.7535122 0.04016286 0.1598844716
995    1.4750644 0.18199848 0.0093048047
1027 -18.9412550 0.04331690 0.2698458007

Looking at the plot for Cook’s distance, observations 1815 and 2295 are the most influential points with low studentized residuals, followed by 58, 995, and 669, which are influential points with high leverage.

# Features of the influential points
survey_data[c(1811, 2291, 57, 992, 668), ]
       age           ed_level years_code years_code_pro              dev_type
1811 35-44    Master's degree         38             12 Developer, full-stack
2291 25-34  Bachelor's degree         10              8 Developer, full-stack
57   45-54 Associate's degree         10              8 Developer, full-stack
992  25-34  Bachelor's degree         20             12 Developer, full-stack
668  25-34       Some college         10              3     Developer, mobile
           org_size work_type work_exp
1811 10,000 or more    Remote       12
2291     100 to 499    Remote        8
57       100 to 499 In-person        8
992          2 to 9    Hybrid       12
668  10,000 or more    Remote       20
                                       industry job_sat converted_comp_yearly
1811                 Banking/Financial Services       5                196000
2291                                 Healthcare       8                170000
57                             Higher Education       1                 80000
992                        Software Development       9                150000
668  Internet, Telecomm or Information Services       7                140000
     log_salary
1811   12.18587
2291   12.04355
57     11.28978
992    11.91839
668    11.84940

Observations 58, 995, and 669 are high-leverage and influential points. All three individuals have a primary/elementary school education, work as full-stack developers, and earn over $100,000. Their high leverage is likely due to the rarity of earning such a high salary among others with similar education levels. Observations 1815 and 2295, on the other hand, earn less than $10,000 annually—an uncommon occurrence in the software development industry. Even though this might be partially explained by their employment in smaller organizations, their actual annual salary is much lower than their predicted annual salary, which may mean they still make less than those with similar backgrounds.

# DFFITS and plot
salary_dffits <- dffits(log_model)
salary_rstandard <- dffits(log_model)
plot(salary_dffits, salary_rstandard)

# Identify "weird" DFFITS
subset(survey_data, dffits(log_model) < -1.5)
       age          ed_level years_code years_code_pro
46   25-34 Bachelor's degree         19              8
875  35-44 Bachelor's degree          7              7
910  35-44   Master's degree         10              7
1027 18-24  Secondary school         12              1
                                          dev_type       org_size work_type
46                             Developer, back-end       20 to 99    Hybrid
875                  Cloud infrastructure engineer 10,000 or more    Remote
910  Data scientist or machine learning specialist 10,000 or more    Remote
1027                         Developer, full-stack         2 to 9    Hybrid
     work_exp                             industry job_sat
46          8                              Fintech       7
875        12      Transportation, or Supply Chain       7
910         8 Computer Systems Design and Services       8
1027        1                     Higher Education       8
     converted_comp_yearly log_salary
46                     250   5.521461
875                    115   4.744932
910                    300   5.703782
1027                     4   1.386294

The plot reveals one “weird” DFFITS value, observation 1815. This observation strongly influences the overall prediction and has an extremely low studentized residual. This indicates that the 1811 observation had an unusually low annual salary, and the model significantly overpredicted their earnings.

# Calculate VIF
vif(log_model)
                    GVIF Df GVIF^(1/(2*Df))
age             5.473238  5        1.185289
ed_level        1.351843  6        1.025441
years_code      7.590940  1        2.755166
years_code_pro 11.912990  1        3.451520
dev_type        1.702494 14        1.019185
org_size        1.347328  7        1.021523
work_type       1.229707  2        1.053053
work_exp        9.105905  1        3.017599
industry        1.532221 13        1.016548
job_sat         1.049900  1        1.024646

The output shows that the variable years of coding professionally has the largest VIF, followed by years of work experience, years of coding, and age. This suggests that these variables are highly correlated with each other, indicating issues with multicollinearity in the model.

# Calculate PRESS
library(DAAG)

Attaching package: 'DAAG'
The following object is masked from 'package:car':

    vif
press(log_model)
[1] 628.9997

The PRESS statistic of our model is 395.594.

# LOO R-squared
loo_predictions <- function(fit) {
  e <- residuals(fit)
  h <- hatvalues(fit)
  Y <- fitted(fit) + e
  return(Y - e / (1 - h))
}

loo_pred <- loo_predictions(log_model)
cor(loo_pred, survey_data$log_salary)^2
[1] 0.2552413

Our model does not overfit, as the LOO R-squared of around 0.3442 suggests that only around 34.42% of the variation in the logged annual salary is accounted for by the predictors of our model. This indicates that the model’s complexity is not excessive.

Automatic Model Selection

# Backward selection
salary_step <- step(log_model, direction = "backward")
Start:  AIC=-3058.86
log_salary ~ age + ed_level + years_code + years_code_pro + dev_type + 
    org_size + work_type + work_exp + industry + job_sat

                 Df Sum of Sq    RSS     AIC
- work_exp        1     0.031 594.12 -3060.7
<none>                        594.09 -3058.9
- years_code      1     0.715 594.81 -3058.1
- work_type       2     1.852 595.95 -3055.6
- job_sat         1     2.617 596.71 -3050.7
- years_code_pro  1     6.353 600.45 -3036.2
- dev_type       14    19.328 613.42 -3012.6
- ed_level        6    15.373 609.47 -3011.6
- industry       13    34.104 628.20 -2955.3
- age             5    36.194 630.29 -2931.6
- org_size        7    39.509 633.60 -2923.4

Step:  AIC=-3060.74
log_salary ~ age + ed_level + years_code + years_code_pro + dev_type + 
    org_size + work_type + industry + job_sat

                 Df Sum of Sq    RSS     AIC
<none>                        594.12 -3060.7
- years_code      1     0.758 594.88 -3059.8
- work_type       2     1.864 595.99 -3057.5
- job_sat         1     2.635 596.76 -3052.5
- years_code_pro  1     8.945 603.07 -3028.1
- dev_type       14    19.382 613.51 -3014.2
- ed_level        6    15.380 609.50 -3013.4
- industry       13    34.106 628.23 -2957.2
- age             5    36.620 630.75 -2931.9
- org_size        7    39.496 633.62 -2925.4

The final model selected by backward selection removes work_exp, the years of work experience.

In general, automatic model selection only focuses on selecting a model that optimizes criteria such as AIC. However, this approach may include mediator variables in the final model simply because they are strongly correlated with the outcome variable. It may also lead to confounding biases by omitting important predictors that influence the outcome but were removed because they did not appear statistically significant.

The p-values reported in the model summary can be misleading, as they do not account for the fact that model selection was performed. As a result, the p-values may mostly, or even entirely, appear statistically significant, reflecting the model selection’s intent to identify a model that shows strong statistical significance with the predicted outcome.

This is the summary output from out final selected model:

summary(salary_step)

Call:
lm(formula = log_salary ~ age + ed_level + years_code + years_code_pro + 
    dev_type + org_size + work_type + industry + job_sat, data = survey_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8129 -0.2151 -0.0019  0.2252  2.3975 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           11.182634   0.122280
age25-34                                               0.330274   0.043588
age35-44                                               0.305929   0.050100
age45-54                                               0.111617   0.065163
age55-64                                              -0.113224   0.083499
age65 years or older                                  -0.406548   0.146657
ed_levelBachelor's degree                              0.107653   0.054304
ed_levelMaster's degree                                0.210612   0.059157
ed_levelPrimary/elementary school                      0.298224   0.216239
ed_levelProfessional degree                            0.387834   0.082809
ed_levelSecondary school                              -0.222433   0.092080
ed_levelSome college                                   0.046628   0.061724
years_code                                             0.004636   0.002724
years_code_pro                                         0.019537   0.003342
dev_typeData engineer                                  0.012151   0.100322
dev_typeData scientist or machine learning specialist -0.044798   0.113083
dev_typeDeveloper, back-end                            0.112856   0.082618
dev_typeDeveloper, desktop or enterprise applications -0.034132   0.093087
dev_typeDeveloper, embedded applications or devices    0.093731   0.101342
dev_typeDeveloper, front-end                          -0.013869   0.094442
dev_typeDeveloper, full-stack                         -0.006779   0.081307
dev_typeDeveloper, mobile                              0.208691   0.103853
dev_typeDeveloper, QA or test                         -0.124954   0.137816
dev_typeDevOps specialist                              0.089199   0.115402
dev_typeEngineer, site reliability                     0.193572   0.128491
dev_typeEngineering manager                            0.266497   0.097862
dev_typeResearch & Development role                    0.030872   0.124468
dev_typeSenior Executive (C-Suite, VP, etc.)           0.451766   0.115647
org_size10 to 19                                      -0.248609   0.051718
org_size10,000 or more                                 0.050529   0.037284
org_size100 to 499                                    -0.138111   0.037256
org_size2 to 9                                        -0.471956   0.051963
org_size20 to 99                                      -0.221187   0.039514
org_size5,000 to 9,999                                -0.043926   0.051668
org_size500 to 999                                    -0.137134   0.049104
work_typeIn-person                                    -0.026155   0.037450
work_typeRemote                                        0.052414   0.024785
industryComputer Systems Design and Services          -0.011676   0.077532
industryEnergy                                        -0.111394   0.086318
industryFintech                                        0.087278   0.063372
industryGovernment                                    -0.180344   0.062460
industryHealthcare                                    -0.061592   0.060242
industryHigher Education                              -0.483785   0.073496
industryInsurance                                      0.039871   0.081563
industryInternet, Telecomm or Information Services    -0.001763   0.066295
industryManufacturing                                 -0.233881   0.065225
industryMedia & Advertising Services                   0.047340   0.073947
industryRetail and Consumer Services                  -0.016760   0.062910
industrySoftware Development                           0.037185   0.051909
industryTransportation, or Supply Chain               -0.169097   0.070894
job_sat                                                0.015795   0.004978
                                                      t value Pr(>|t|)    
(Intercept)                                            91.451  < 2e-16 ***
age25-34                                                7.577 5.11e-14 ***
age35-44                                                6.106 1.20e-09 ***
age45-54                                                1.713 0.086866 .  
age55-64                                               -1.356 0.175234    
age65 years or older                                   -2.772 0.005615 ** 
ed_levelBachelor's degree                               1.982 0.047554 *  
ed_levelMaster's degree                                 3.560 0.000378 ***
ed_levelPrimary/elementary school                       1.379 0.167987    
ed_levelProfessional degree                             4.683 2.99e-06 ***
ed_levelSecondary school                               -2.416 0.015786 *  
ed_levelSome college                                    0.755 0.450069    
years_code                                              1.702 0.088837 .  
years_code_pro                                          5.846 5.76e-09 ***
dev_typeData engineer                                   0.121 0.903609    
dev_typeData scientist or machine learning specialist  -0.396 0.692029    
dev_typeDeveloper, back-end                             1.366 0.172075    
dev_typeDeveloper, desktop or enterprise applications  -0.367 0.713900    
dev_typeDeveloper, embedded applications or devices     0.925 0.355118    
dev_typeDeveloper, front-end                           -0.147 0.883259    
dev_typeDeveloper, full-stack                          -0.083 0.933564    
dev_typeDeveloper, mobile                               2.009 0.044603 *  
dev_typeDeveloper, QA or test                          -0.907 0.364675    
dev_typeDevOps specialist                               0.773 0.439641    
dev_typeEngineer, site reliability                      1.507 0.132077    
dev_typeEngineering manager                             2.723 0.006515 ** 
dev_typeResearch & Development role                     0.248 0.804132    
dev_typeSenior Executive (C-Suite, VP, etc.)            3.906 9.64e-05 ***
org_size10 to 19                                       -4.807 1.63e-06 ***
org_size10,000 or more                                  1.355 0.175480    
org_size100 to 499                                     -3.707 0.000215 ***
org_size2 to 9                                         -9.082  < 2e-16 ***
org_size20 to 99                                       -5.598 2.43e-08 ***
org_size5,000 to 9,999                                 -0.850 0.395321    
org_size500 to 999                                     -2.793 0.005270 ** 
work_typeIn-person                                     -0.698 0.485002    
work_typeRemote                                         2.115 0.034563 *  
industryComputer Systems Design and Services           -0.151 0.880308    
industryEnergy                                         -1.291 0.197005    
industryFintech                                         1.377 0.168572    
industryGovernment                                     -2.887 0.003921 ** 
industryHealthcare                                     -1.022 0.306692    
industryHigher Education                               -6.582 5.73e-11 ***
industryInsurance                                       0.489 0.625005    
industryInternet, Telecomm or Information Services     -0.027 0.978789    
industryManufacturing                                  -3.586 0.000343 ***
industryMedia & Advertising Services                    0.640 0.522118    
industryRetail and Consumer Services                   -0.266 0.789951    
industrySoftware Development                            0.716 0.473853    
industryTransportation, or Supply Chain                -2.385 0.017152 *  
job_sat                                                 3.173 0.001530 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5116 on 2270 degrees of freedom
Multiple R-squared:  0.2951,    Adjusted R-squared:  0.2796 
F-statistic: 19.01 on 50 and 2270 DF,  p-value: < 2.2e-16

Conclusion

In all, our multiple regression model identified several significant predictiors of salary, including age, education level, job satisfaction, job type, and organizational size. Together these predictors suggest meaningful relationships between salary. Most interesting, perhaps, is the statistically significant relationship between job satisfaction and salary, which provides evidence that software engineers in the United States can both enjoy their work and be well-paid. Education is also one of the largest statistically significant predictors, showing significant benefits to pursuing a master or doctoral degree.

Notably, however, the model’s r-squared is lower than ideal at 29.5%. Some potential areas for improvement are checking for multicollinearity between the model’s predictors, accounting for non-linear relationships between the predictors and log_salary, and searching for possible confounders that hurt the model’s predictive power.

Regardless, the model provides unique insight into the state of the software engineering job market in the United States, and highlights the impact of demographic, educational, and workplace factors on salary trends within the industry.