# Packages needed
library(tidyverse)
Pay equality is an issue that is somewhat prominent within our society right now. In fact, the gender pay gap is a talking point for proponents of feminism. That is no surprise, everyone deserves to be paid fairly for their work. However, I suspect gender might not be the only correlated variable in terms of how much someone earns, particularly in the tech industry. Even if my impression is wrong, it is certainly a conversation worth having and exploring. In the spirit of that notion, my research question is this:
Are work experience, gender, and the ability to code full-stack (as opposed to front-end or back-end only) predictive of income for professional developers in the US?
The highly specific nature of the question is a consequence of the data I plan to use (and the limitations of my skills). I will use data from Stack Overflow’s excellent developers’ survey that is carried out annually. To elaborate, I will use the 2022 survey data, as it is the most recent one. The survey can be download manually here, and its methodology can be found here.
The survey data is first downloaded into the working directory, and then it is read into the R environment. It is unfortunate that the data has to be downloaded. The main CSV file is inside a zipped folder, so reading it in directly does not seem to be an option. I tried to upload the file to my GitHub, but GitHub does not allow a file size this big to be hosted, and I couldn’t figure out how to make a temporary file work. Regardless, the data is successfully imported into R, and ultimately, that’s the main concern.
# Load data
url <- 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2022.zip'
download.file(url, './data.zip')
unzip('./data.zip')
df <- read_csv('./survey_results_public.csv')
schema <- read_csv('./survey_results_schema.csv') # Useful for getting details on columns
glimpse(df)
## Rows: 73,268
## Columns: 79
## $ ResponseId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ MainBranch <chr> "None of these", "I am a developer by p…
## $ Employment <chr> NA, "Employed, full-time", "Employed, f…
## $ RemoteWork <chr> NA, "Fully remote", "Hybrid (some remot…
## $ CodingActivities <chr> NA, "Hobby;Contribute to open-source pr…
## $ EdLevel <chr> NA, NA, "Master’s degree (M.A., M.S., M…
## $ LearnCode <chr> NA, NA, "Books / Physical media;Friend …
## $ LearnCodeOnline <chr> NA, NA, "Technical documentation;Blogs;…
## $ LearnCodeCoursesCert <chr> NA, NA, NA, NA, NA, NA, NA, "Coursera;U…
## $ YearsCode <chr> NA, NA, "14", "20", "8", "15", "3", "1"…
## $ YearsCodePro <chr> NA, NA, "5", "17", "3", NA, NA, NA, "6"…
## $ DevType <chr> NA, NA, "Data scientist or machine lear…
## $ OrgSize <chr> NA, NA, "20 to 99 employees", "100 to 4…
## $ PurchaseInfluence <chr> NA, NA, "I have some influence", "I hav…
## $ BuyNewTool <chr> NA, NA, NA, "Other (please specify):", …
## $ Country <chr> NA, "Canada", "United Kingdom of Great …
## $ Currency <chr> NA, "CAD\tCanadian dollar", "GBP\tPound…
## $ CompTotal <dbl> NA, NA, 32000, 60000, NA, NA, NA, NA, 4…
## $ CompFreq <chr> NA, NA, "Yearly", "Monthly", NA, NA, NA…
## $ LanguageHaveWorkedWith <chr> NA, "JavaScript;TypeScript", "C#;C++;HT…
## $ LanguageWantToWorkWith <chr> NA, "Rust;TypeScript", "C#;C++;HTML/CSS…
## $ DatabaseHaveWorkedWith <chr> NA, NA, "Microsoft SQL Server", "Micros…
## $ DatabaseWantToWorkWith <chr> NA, NA, "Microsoft SQL Server", "Micros…
## $ PlatformHaveWorkedWith <chr> NA, NA, NA, NA, "Firebase;Microsoft Azu…
## $ PlatformWantToWorkWith <chr> NA, NA, NA, NA, "Firebase;Microsoft Azu…
## $ WebframeHaveWorkedWith <chr> NA, NA, "Angular.js", "ASP.NET;ASP.NET …
## $ WebframeWantToWorkWith <chr> NA, NA, "Angular;Angular.js", "ASP.NET;…
## $ MiscTechHaveWorkedWith <chr> NA, NA, "Pandas", ".NET", ".NET", NA, N…
## $ MiscTechWantToWorkWith <chr> NA, NA, ".NET", ".NET", ".NET;Apache Ka…
## $ ToolsTechHaveWorkedWith <chr> NA, NA, NA, NA, "npm", "Homebrew", "Hom…
## $ ToolsTechWantToWorkWith <chr> NA, NA, NA, NA, "Docker;Kubernetes", "H…
## $ NEWCollabToolsHaveWorkedWith <chr> NA, NA, "Notepad++;Visual Studio", "Not…
## $ NEWCollabToolsWantToWorkWith <chr> NA, NA, "Notepad++;Visual Studio", "Not…
## $ `OpSysProfessional use` <chr> NA, "macOS", "Windows", "Windows", "Win…
## $ `OpSysPersonal use` <chr> NA, "Windows Subsystem for Linux (WSL)"…
## $ VersionControlSystem <chr> NA, "Git", "Git", "Git", "Git;Other (pl…
## $ VCInteraction <chr> NA, NA, "Code editor", "Code editor;Com…
## $ `VCHostingPersonal use` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `VCHostingProfessional use` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ OfficeStackAsyncHaveWorkedWith <chr> NA, NA, NA, "Jira Work Management;Trell…
## $ OfficeStackAsyncWantToWorkWith <chr> NA, NA, NA, "Jira Work Management;Trell…
## $ OfficeStackSyncHaveWorkedWith <chr> NA, NA, "Microsoft Teams", "Slack;Zoom"…
## $ OfficeStackSyncWantToWorkWith <chr> NA, NA, "Microsoft Teams", "Slack;Zoom"…
## $ Blockchain <chr> NA, "Very unfavorable", "Very unfavorab…
## $ NEWSOSites <chr> NA, "Collectives on Stack Overflow;Stac…
## $ SOVisitFreq <chr> NA, "Daily or almost daily", "Multiple …
## $ SOAccount <chr> NA, "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ SOPartFreq <chr> NA, "Daily or almost daily", "Multiple …
## $ SOComm <chr> NA, "Not sure", "Neutral", "Yes, defini…
## $ Age <chr> NA, NA, "25-34 years old", "35-44 years…
## $ Gender <chr> NA, NA, "Man", "Man", NA, "Or, in your …
## $ Trans <chr> NA, NA, "No", "No", NA, "Or, in your ow…
## $ Sexuality <chr> NA, NA, "Bisexual", "Straight / Heteros…
## $ Ethnicity <chr> NA, NA, "White", "White", NA, "Or, in y…
## $ Accessibility <chr> NA, NA, "None of the above", "None of t…
## $ MentalHealth <chr> NA, NA, "I have a mood or emotional dis…
## $ TBranch <chr> NA, "No", "No", "No", "No", NA, NA, NA,…
## $ ICorPM <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Indepe…
## $ WorkExp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 6, NA, …
## $ Knowledge_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Disagr…
## $ Knowledge_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_7 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Disagr…
## $ Frequency_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "3-5 ti…
## $ Frequency_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "3-5 ti…
## $ Frequency_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Never"…
## $ TimeSearching <chr> NA, NA, NA, NA, NA, NA, NA, NA, "15-30 …
## $ TimeAnswering <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Over 1…
## $ Onboarding <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Somewh…
## $ ProfessionalTech <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Inners…
## $ TrueFalse_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Yes", …
## $ TrueFalse_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Yes", …
## $ TrueFalse_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Yes", …
## $ SurveyLength <chr> NA, "Too long", "Appropriate in length"…
## $ SurveyEase <chr> NA, "Difficult", "Neither easy nor diff…
## $ ConvertedCompYearly <dbl> NA, NA, 40205, 215232, NA, NA, NA, NA, …
# It's obvious US will be among the top, so no need to use string searches to get exact value
df %>%
count(Country) %>%
arrange(desc(n))
# Checking the exact value for the category I am looking for: professional developer
df %>%
distinct(MainBranch)
# Subsetting the data to get the relevant columns
# The schema was referenced to do this part: View(schema)
df_subset <- df %>%
filter(Country == 'United States of America', MainBranch == 'I am a developer by profession') %>%
select(WorkExp, Gender, DevType, CompTotal, CompFreq) %>%
drop_na()
I have limited the analysis to professional developers because the research question is about income in the tech industry. Taking into considerations only the professionals when assessing and predicting income data seems reasonable. Additionally, limiting the analysis to developers inside the US has been done for two reasons. The first reason is to limit the size of the data to a more manageable level, as the original data contains 73268 observations. Secondly, US has been chosen because doing so makes it redundant to convert currencies.
Firstly, I would like to get a look at all the values for each variable.
# WorkExp, Gender, DevType, CompTotal, CompFreq are the columns
df_subset %>%
count(WorkExp, sort = TRUE)
df_subset %>%
count(Gender, sort = TRUE)
df_subset %>%
count(DevType, sort = TRUE)
df_subset %>%
count(CompTotal, sort = TRUE)
df_subset %>%
count(CompFreq, sort = TRUE)
Gender variableLooking at all the values for the Gender variable, it
seems clear that some of the levels of this variable are intentionally
ambiguous, perhaps to protect privacy. Not only that, it is not
practical to include too many levels of this variable in the final
linear model, especially considering that the vast majority of the
respondents belong to only two of the levels (male and female). So, I
will collapse the rest of the responses into one category called
“other”.
df_subset <- df_subset %>%
mutate(gender_collapsed = if_else(Gender %in% c('Man', 'Woman'), Gender, 'Other'))
DevType variableAdditionally, there seems to be many developer types in the
DevType variable, with some overlapping between the types.
This is not an ideal situation for answering the research question that
is concerned with the income dichotomy between full-stack and
front-end/back-end developers. For example, consider the response level
“Developer, full-stack;Developer, back-end”. So, the respondent clearly
performs full-stack development alongside back-end development. In other
words, the respondent is a full-stack developer, adding back-end to the
response gives no new information that is useful for my objective. So,
let’s explore this variable DevType further to see if it’s
possible to correctly identify all the observations that fall neatly
into one of two mutually exclusive categories: “full-stack” or “not
full-stack”. Of course, the first step will be to filter out all
non-developers first. Then, the other logical tests and the refactoring
can be performed to get the desired result.
# Getting rid of non-developers
df_subset <- df_subset %>%
filter(str_detect(string = .$DevType,
pattern = '[Dd]evelop'))
# Categorizing the developers as either "full-stack" or "not full-stack"
# First identifying all full-stack observations
df_subset <- df_subset %>%
mutate(stack_info = if_else(
condition = str_detect(.$DevType, regex('full-stack', ignore_case = TRUE)),
true = 'full-stack',
false = 'unknown'))
# Then identifying the front/back-end devs as not full-stack and filtering out the rest
df_not_full_stack <- df_subset %>%
filter(stack_info == 'unknown') %>%
mutate(stack_info = if_else(
condition = str_detect(.$DevType, regex('(front-end|back-end)')),
true = 'not full-stack',
false = 'other')) %>%
filter(stack_info == 'not full-stack')
# Finally, filtering df_subset so that only full-stack observations remain
# And row-binding that output and df_not_full_stack
# To get all the observations with full-stack and not full-stack designations
# Ultimately, I have categorized devs as full-stack if and only if full-stack appears under DevType
# And I have categorized devs as not full-stack if and only if
# They have either front-end or back-end under DevType
df_subset <- df_subset %>%
filter(stack_info == 'full-stack') %>%
bind_rows(., df_not_full_stack)
# Checking that the long process actually worked
df_subset %>%
count(stack_info, sort = TRUE)
CompTotal variable by
CompFreqThe next task is modifying CompTotal by
CompFreq. In other words, if the value is “Monthly” or
“Weekly” under CompFreq, then the CompTotal
value must be multiplied by 12 or 52.
# Creating a new comp_weight column
df_subset <- df_subset %>%
mutate(comp_weight = 0)
# Populating the column with the correct weights based on CompFreq value
for (i in 1:nrow(df_subset)) {
if (df_subset$CompFreq[i] == 'Monthly') {
df_subset[i, 'comp_weight'] <- 12
}
else if (df_subset$CompFreq[i] == 'Weekly') {
df_subset[i, 'comp_weight'] <- 52
}
else {
df_subset[i, 'comp_weight'] <- 1
}
}
# Checing that it worked
df_subset %>% count(comp_weight, sort = TRUE)
df_subset %>% count(CompFreq, sort = TRUE)
# Modifying the CompTotal by comp_weight
df_subset <- df_subset %>%
mutate(comp_adj = CompTotal * comp_weight)
comp_adjThe last task to do before the data is ready for building a multiple regression model is removing observations with abnormally high income. This is done to not only remove potentially false reports but also to get rid of extreme outliers. Removing outliers is not suggested under normal circumstances, but this analysis is meant to reflect the typical societal trends regarding income and certain other potentially significant variables. Billionaires by definition are somewhat beyond that conversation about the norm. So, I think there is a valid defense in this case for being less rigid about dropping certain observations with extreme compensation values.
summary(df_subset$comp_adj) # Max value is about 12 trillion, so clearly there are some problematic outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 1.050e+05 1.450e+05 2.572e+09 1.950e+05 1.200e+13
df_subset %>%
filter(comp_adj >= 1000000000) # That one observation is the only one with over 1 billion annual income
# Dropping that one outlier
df_subset <- df_subset %>%
filter(comp_adj < 1000000000)
So, only one observation is dropped for having annual income of over
1 billion dollars. Now, the linear model can be built using the
predictor variables WorkExp, gender_collapsed,
and stack_info and the response variable
comp_adj.
In this project, the process of developing a linear model will be carried out in accordance with Chapter 9 of the excellent book OpenIntro Statistics Fourth Edition. As is mentioned in the book, there are two main approaches to building a multiple regression model. They are: the adjusted R^2 approach and the p-value approach. For both approaches, it is recommended to adopt a step-wise model selection strategy: either backward elimination technique or forward selection technique.
The initial choice between the adjusted R^2 approach and the p-value approach can be reconciled with consideration for the actual objective of the model. For machine learning applications, improving prediction accuracy is the primary concern, and the adjusted R^2 approach is suitable for this. However, that is not the case for this project. The goal is to determine if work experience, gender, and the ability to code full-stack (as opposed to front-end or back-end only) are predictors of income for professional developers in the US. In other words, the interest lies in understanding the statistical significance of the relationship between the predictor variables and the response variable. In such a situation, the p-value approach is more suitable.
In terms of pairing the p-value approach with one of the step-wise model selection strategies, my preference is the backward elimination technique. So, all the variables will be included in the model at first. Then the variable with the largest p-value will be removed if its p-value is greater than a significance level of alpha = 0.05. However, if the largest p-value is smaller than 0.05, then no variable needs to be removed because the model is as good as it could be.
Let’s fit all the aforementioned predictors against the response
variable of comp_adj. Then, we can summarize the model to
see the p-values associated with each variable.
model <- lm(comp_adj ~ WorkExp + gender_collapsed + stack_info, data = df_subset)
summary(model)
##
## Call:
## lm(formula = comp_adj ~ WorkExp + gender_collapsed + stack_info,
## data = df_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -412212 -271875 -212230 -140917 35978465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 424586 33117 12.821 <2e-16 ***
## WorkExp -3051 1770 -1.724 0.0848 .
## gender_collapsedOther -144766 93617 -1.546 0.1221
## gender_collapsedWoman -135364 75070 -1.803 0.0714 .
## stack_infonot full-stack -31336 37776 -0.830 0.4068
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1216000 on 4660 degrees of freedom
## Multiple R-squared: 0.001832, Adjusted R-squared: 0.0009755
## F-statistic: 2.139 on 4 and 4660 DF, p-value: 0.07345
In accordance with our chosen methodology, the predictor variable
stack_info must be dropped as it has a p-value of 0.40, the
highest of the bunch. Hopefully, after dropping it, the new model will
be a better fit.
model <- lm(comp_adj ~ WorkExp + gender_collapsed, data = df_subset)
summary(model)
##
## Call:
## lm(formula = comp_adj ~ WorkExp + gender_collapsed, data = df_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -401414 -269588 -212944 -143585 35989420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 413579 30342 13.631 <2e-16 ***
## WorkExp -2999 1769 -1.696 0.0900 .
## gender_collapsedOther -146106 93600 -1.561 0.1186
## gender_collapsedWoman -137194 75035 -1.828 0.0676 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1216000 on 4661 degrees of freedom
## Multiple R-squared: 0.001685, Adjusted R-squared: 0.001042
## F-statistic: 2.622 on 3 and 4661 DF, p-value: 0.04898
After dropping stack_info, the new model is still not
good enough because the gender_collapsed predictor variable
has high p-values associated with its two levels (the third level
doesn’t show because it’s the baseline). In particular, the “Other”
level for gender_collapsed has the highest p-value in the
current model: 0.1186. So, let’s try dropping
gender_collapsed.
model <- lm(comp_adj ~ WorkExp, data = df_subset)
summary(model)
##
## Call:
## lm(formula = comp_adj ~ WorkExp, data = df_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -385191 -262429 -212562 -155191 36006659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 396057 29415 13.465 <2e-16 ***
## WorkExp -2716 1765 -1.539 0.124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1216000 on 4663 degrees of freedom
## Multiple R-squared: 0.0005076, Adjusted R-squared: 0.0002933
## F-statistic: 2.368 on 1 and 4663 DF, p-value: 0.1239
Well, this is unfortunate. Even the last variable
WorkExp seems to be a poor predictor of the response
variable comp_adj. Its p-value is 0.124. So, this linear
model must also be dismissed.
It would be prudent to reflect on what exactly happened. One thought that immediately occurs to mind is that before modeling, linearity must always be checked using a plot or the correlation coefficient. I had planned to do this during checking the conditions of the model’s validity after finalizing the model, sticking true to the order of things in Chapter 9. However, it should have been done before the modeling. So, let’s do it now to see what’s going on.
df_subset %>%
summarize(cor(WorkExp, comp_adj)) # Very small correlation coefficient
# Visualizng WorkExp vs comp_adj
df_subset %>%
ggplot(mapping = aes(x = WorkExp, y = comp_adj)) +
geom_jitter(color = 'purple', alpha = .2) +
labs(title = 'Compensation by experience',
x = 'WorkExp (years)',
y = 'comp_adj (dollars)')
# Visualizing distribution of comp_adj grouped by gender_collapsed
df_subset %>%
ggplot(mapping = aes(color = gender_collapsed, y = comp_adj)) +
geom_boxplot(alpha = 0.5) +
scale_x_discrete(expand = expansion()) +
labs(title = "Compensation by genders", y = 'comp_adj (dollars)')
# Visualizing distribution of comp_adj grouped by stack_info
df_subset %>%
ggplot(mapping = aes(color = stack_info, y = comp_adj)) +
geom_boxplot(alpha = 0.5) +
scale_x_discrete(expand = expansion()) +
labs(title = "Compensation by full-stack or not", y = 'comp_adj (dollars)')
The correlation coefficient pertaining to WorkExp and
comp_adj is very small, so immediately it’s obvious that
there is an issue. The scatterplot also makes it clear that there is not
a linear relationship between WorkExp and
comp_adj. The compensation for the vast majority of the
people seems to be unaffected by experience, which runs counter to
intuition. Some outliers are scattered higher up, distributed fairly
evenly across the various experience ranges.
Moving on to the next diagnostic concerning the relationship between
gender_collapsed and comp_adj, again the trend
from the previous visual holds true. There is barely any difference in
compensation data across the different genders. Outliers are more
numerous for males, but that’s to be expected given how many more males
there are in the data.
Similarly, for stack_info and comp_adj, the
groups make no difference in income distribution. Outliers overshadow
the majority of the data. At this point, I’m fairly certain that the
issue is high variance. The compensation values show exponential growth
across the whole spectrum of data. So, looking at the distribution of
comp_adj should be helpful.
df_subset %>%
ggplot(mapping = aes(x = comp_adj)) +
geom_histogram(fill = 'magenta', bins = 30) +
labs(title = 'Histogram of Compensation', x = 'comp_adj (dollars)')
The earlier assessment seems to be correct. There are 30 bins in the
histogram, yet only one is really visible. The left-most bin indicates
that the data is massively right-skewed. Most of the data is
concentrated toward the lower-side while some data are very very high
and to the right. My guess is that a log transformation could work for
this data. Before doing that, let’s add 1 to any value in
comp_adj that is 0 (to avoid NAN values).
df_subset %>%
filter(comp_adj == 0) %>%
nrow() # There are 9 values that are 0
## [1] 9
# Adding 1 to 0
df_subset <- df_subset %>%
mutate(comp_adj = if_else(comp_adj == 0, comp_adj + 1, comp_adj))
# Checking the correlation coefficient between log transformed comp_adj and WorkExp
df_subset %>%
summarize(cor(WorkExp, log10(comp_adj))) # No luck...
# Distribution of log10(comp_adj)
df_subset %>%
ggplot(mapping = aes(x = log10(comp_adj))) +
geom_histogram(fill = 'forestgreen') +
labs(title = 'Histogram of log-transformed compensation data')
# Scatterplot between log10(comp_adj) and WorkExp
df_subset %>%
ggplot(mapping = aes(x = WorkExp, y = log10(comp_adj))) +
geom_jitter(color = 'navyblue', alpha = 0.2) +
stat_smooth(method = "lm", se = FALSE) +
labs(title = 'Log-transformed compensation by experience')
# Visualizing distribution of log10(comp_adj) grouped by gender_collapsed
df_subset %>%
ggplot(mapping = aes(color = gender_collapsed, y = log10(comp_adj))) +
geom_boxplot(alpha = 0.5) +
scale_x_discrete(expand = expansion()) +
labs(title = "Compensation by Genders")
# Visualizing distribution of log10(comp_adj) grouped by stack_info
df_subset %>%
ggplot(mapping = aes(color = stack_info, y = log10(comp_adj))) +
geom_boxplot(alpha = 0.5) +
scale_x_discrete(expand = expansion()) +
labs(title = "Compensation by full-stack or not")
No luck with the log transformation. Despite the fact that the
distribution of the comp_adj variable looks much less
skewed under a log transformation, there is still no semblance of
correlation between the income data and the variables work experience,
gender, and ability to develop full-stack. In the case of work
experience, the slope of the scatterplot is flat, basically 0. The
correlation coefficient of course reflects this with a value of about
0.04. Meanwhile, the categorical variables show no disparity between the
groups. The boxplots for each categorical variable are right next to
each other. If anything, this exercise has been an excellent exploration
of the independence between income and the other variables.
Still, I have learned one valuable lesson that I won’t forget. Always always always check visuals and correlation coefficients before diving into the process of creating a linear model.
Formally speaking, the conclusion is this: Work experience, gender, and the ability to code full-stack (as opposed to front-end or back-end only) are not predictive of income for professional developers in the US. In fact, income seems to be fairly independent of these other variables, at least within the context of developers taking Stack Overflow’s 2022 survey.
I don’t know if these findings are important, but they are certainly interesting. I never would have guessed that those variables are not correlated. In fact, if this analysis could be made more robust and reliable, then the findings would be very important because the implications would be polarizing to say the least.
However, therein lies the caveat. This study was performed by an amateur using a very niche dataset. There are bound to be confounders and other things I have missed. Because how could it be that there is no strong correlation between work experience and income? Particularly for technical professionals like developers. So, I guess the limitations of the analysis would be that it needs to be reproduced by others, and perhaps in a more rigorous manner. I only looked at a handful of variables, and there are literally 79 available in total. So, they can be explored further to explain this strange conclusion.