# Packages needed
library(tidyverse)

Part 1 - Introduction

Pay equality is an issue that is somewhat prominent within our society right now. In fact, the gender pay gap is a talking point for proponents of feminism. That is no surprise, everyone deserves to be paid fairly for their work. However, I suspect gender might not be the only correlated variable in terms of how much someone earns, particularly in the tech industry. Even if my impression is wrong, it is certainly a conversation worth having and exploring. In the spirit of that notion, my research question is this:

Are work experience, gender, and the ability to code full-stack (as opposed to front-end or back-end only) predictive of income for professional developers in the US?

The highly specific nature of the question is a consequence of the data I plan to use (and the limitations of my skills). I will use data from Stack Overflow’s excellent developers’ survey that is carried out annually. To elaborate, I will use the 2022 survey data, as it is the most recent one. The survey can be download manually here, and its methodology can be found here.

Part 2 - Data

The survey data is first downloaded into the working directory, and then it is read into the R environment. It is unfortunate that the data has to be downloaded. The main CSV file is inside a zipped folder, so reading it in directly does not seem to be an option. I tried to upload the file to my GitHub, but GitHub does not allow a file size this big to be hosted, and I couldn’t figure out how to make a temporary file work. Regardless, the data is successfully imported into R, and ultimately, that’s the main concern.

# Load data
url <- 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2022.zip'
download.file(url, './data.zip')
unzip('./data.zip')
df <- read_csv('./survey_results_public.csv')
schema <- read_csv('./survey_results_schema.csv')  # Useful for getting details on columns
glimpse(df)

## Rows: 73,268
## Columns: 79
## $ ResponseId                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ MainBranch                     <chr> "None of these", "I am a developer by p…
## $ Employment                     <chr> NA, "Employed, full-time", "Employed, f…
## $ RemoteWork                     <chr> NA, "Fully remote", "Hybrid (some remot…
## $ CodingActivities               <chr> NA, "Hobby;Contribute to open-source pr…
## $ EdLevel                        <chr> NA, NA, "Master’s degree (M.A., M.S., M…
## $ LearnCode                      <chr> NA, NA, "Books / Physical media;Friend …
## $ LearnCodeOnline                <chr> NA, NA, "Technical documentation;Blogs;…
## $ LearnCodeCoursesCert           <chr> NA, NA, NA, NA, NA, NA, NA, "Coursera;U…
## $ YearsCode                      <chr> NA, NA, "14", "20", "8", "15", "3", "1"…
## $ YearsCodePro                   <chr> NA, NA, "5", "17", "3", NA, NA, NA, "6"…
## $ DevType                        <chr> NA, NA, "Data scientist or machine lear…
## $ OrgSize                        <chr> NA, NA, "20 to 99 employees", "100 to 4…
## $ PurchaseInfluence              <chr> NA, NA, "I have some influence", "I hav…
## $ BuyNewTool                     <chr> NA, NA, NA, "Other (please specify):", …
## $ Country                        <chr> NA, "Canada", "United Kingdom of Great …
## $ Currency                       <chr> NA, "CAD\tCanadian dollar", "GBP\tPound…
## $ CompTotal                      <dbl> NA, NA, 32000, 60000, NA, NA, NA, NA, 4…
## $ CompFreq                       <chr> NA, NA, "Yearly", "Monthly", NA, NA, NA…
## $ LanguageHaveWorkedWith         <chr> NA, "JavaScript;TypeScript", "C#;C++;HT…
## $ LanguageWantToWorkWith         <chr> NA, "Rust;TypeScript", "C#;C++;HTML/CSS…
## $ DatabaseHaveWorkedWith         <chr> NA, NA, "Microsoft SQL Server", "Micros…
## $ DatabaseWantToWorkWith         <chr> NA, NA, "Microsoft SQL Server", "Micros…
## $ PlatformHaveWorkedWith         <chr> NA, NA, NA, NA, "Firebase;Microsoft Azu…
## $ PlatformWantToWorkWith         <chr> NA, NA, NA, NA, "Firebase;Microsoft Azu…
## $ WebframeHaveWorkedWith         <chr> NA, NA, "Angular.js", "ASP.NET;ASP.NET …
## $ WebframeWantToWorkWith         <chr> NA, NA, "Angular;Angular.js", "ASP.NET;…
## $ MiscTechHaveWorkedWith         <chr> NA, NA, "Pandas", ".NET", ".NET", NA, N…
## $ MiscTechWantToWorkWith         <chr> NA, NA, ".NET", ".NET", ".NET;Apache Ka…
## $ ToolsTechHaveWorkedWith        <chr> NA, NA, NA, NA, "npm", "Homebrew", "Hom…
## $ ToolsTechWantToWorkWith        <chr> NA, NA, NA, NA, "Docker;Kubernetes", "H…
## $ NEWCollabToolsHaveWorkedWith   <chr> NA, NA, "Notepad++;Visual Studio", "Not…
## $ NEWCollabToolsWantToWorkWith   <chr> NA, NA, "Notepad++;Visual Studio", "Not…
## $ `OpSysProfessional use`        <chr> NA, "macOS", "Windows", "Windows", "Win…
## $ `OpSysPersonal use`            <chr> NA, "Windows Subsystem for Linux (WSL)"…
## $ VersionControlSystem           <chr> NA, "Git", "Git", "Git", "Git;Other (pl…
## $ VCInteraction                  <chr> NA, NA, "Code editor", "Code editor;Com…
## $ `VCHostingPersonal use`        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `VCHostingProfessional use`    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ OfficeStackAsyncHaveWorkedWith <chr> NA, NA, NA, "Jira Work Management;Trell…
## $ OfficeStackAsyncWantToWorkWith <chr> NA, NA, NA, "Jira Work Management;Trell…
## $ OfficeStackSyncHaveWorkedWith  <chr> NA, NA, "Microsoft Teams", "Slack;Zoom"…
## $ OfficeStackSyncWantToWorkWith  <chr> NA, NA, "Microsoft Teams", "Slack;Zoom"…
## $ Blockchain                     <chr> NA, "Very unfavorable", "Very unfavorab…
## $ NEWSOSites                     <chr> NA, "Collectives on Stack Overflow;Stac…
## $ SOVisitFreq                    <chr> NA, "Daily or almost daily", "Multiple …
## $ SOAccount                      <chr> NA, "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ SOPartFreq                     <chr> NA, "Daily or almost daily", "Multiple …
## $ SOComm                         <chr> NA, "Not sure", "Neutral", "Yes, defini…
## $ Age                            <chr> NA, NA, "25-34 years old", "35-44 years…
## $ Gender                         <chr> NA, NA, "Man", "Man", NA, "Or, in your …
## $ Trans                          <chr> NA, NA, "No", "No", NA, "Or, in your ow…
## $ Sexuality                      <chr> NA, NA, "Bisexual", "Straight / Heteros…
## $ Ethnicity                      <chr> NA, NA, "White", "White", NA, "Or, in y…
## $ Accessibility                  <chr> NA, NA, "None of the above", "None of t…
## $ MentalHealth                   <chr> NA, NA, "I have a mood or emotional dis…
## $ TBranch                        <chr> NA, "No", "No", "No", "No", NA, NA, NA,…
## $ ICorPM                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Indepe…
## $ WorkExp                        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 6, NA, …
## $ Knowledge_1                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_2                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Disagr…
## $ Knowledge_3                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_4                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_5                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_6                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Agree"…
## $ Knowledge_7                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Disagr…
## $ Frequency_1                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "3-5 ti…
## $ Frequency_2                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "3-5 ti…
## $ Frequency_3                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Never"…
## $ TimeSearching                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, "15-30 …
## $ TimeAnswering                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Over 1…
## $ Onboarding                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Somewh…
## $ ProfessionalTech               <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Inners…
## $ TrueFalse_1                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Yes", …
## $ TrueFalse_2                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Yes", …
## $ TrueFalse_3                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Yes", …
## $ SurveyLength                   <chr> NA, "Too long", "Appropriate in length"…
## $ SurveyEase                     <chr> NA, "Difficult", "Neither easy nor diff…
## $ ConvertedCompYearly            <dbl> NA, NA, 40205, 215232, NA, NA, NA, NA, …

# It's obvious US will be among the top, so no need to use string searches to get exact value
df %>% 
  count(Country) %>% 
  arrange(desc(n))

# Checking the exact value for the category I am looking for: professional developer
df %>% 
  distinct(MainBranch)

# Subsetting the data to get the relevant columns
# The schema was referenced to do this part: View(schema)
df_subset <- df %>% 
  filter(Country == 'United States of America', MainBranch == 'I am a developer by profession') %>% 
  select(WorkExp, Gender, DevType, CompTotal, CompFreq) %>% 
  drop_na()

I have limited the analysis to professional developers because the research question is about income in the tech industry. Taking into considerations only the professionals when assessing and predicting income data seems reasonable. Additionally, limiting the analysis to developers inside the US has been done for two reasons. The first reason is to limit the size of the data to a more manageable level, as the original data contains 73268 observations. Secondly, US has been chosen because doing so makes it redundant to convert currencies.

Part 3 - Exploratory data analysis

Looking at the variables one by one

Firstly, I would like to get a look at all the values for each variable.

# WorkExp, Gender, DevType, CompTotal, CompFreq are the columns
df_subset %>% 
  count(WorkExp, sort = TRUE)

df_subset %>% 
  count(Gender, sort = TRUE)

df_subset %>% 
  count(DevType, sort = TRUE)

df_subset %>% 
  count(CompTotal, sort = TRUE)

df_subset %>% 
  count(CompFreq, sort = TRUE)

Transforming `Gender` variable

Looking at all the values for the Gender variable, it seems clear that some of the levels of this variable are intentionally ambiguous, perhaps to protect privacy. Not only that, it is not practical to include too many levels of this variable in the final linear model, especially considering that the vast majority of the respondents belong to only two of the levels (male and female). So, I will collapse the rest of the responses into one category called “other”.

df_subset <- df_subset %>% 
  mutate(gender_collapsed = if_else(Gender %in% c('Man', 'Woman'), Gender, 'Other'))

Transforming `DevType` variable

Additionally, there seems to be many developer types in the DevType variable, with some overlapping between the types. This is not an ideal situation for answering the research question that is concerned with the income dichotomy between full-stack and front-end/back-end developers. For example, consider the response level “Developer, full-stack;Developer, back-end”. So, the respondent clearly performs full-stack development alongside back-end development. In other words, the respondent is a full-stack developer, adding back-end to the response gives no new information that is useful for my objective. So, let’s explore this variable DevType further to see if it’s possible to correctly identify all the observations that fall neatly into one of two mutually exclusive categories: “full-stack” or “not full-stack”. Of course, the first step will be to filter out all non-developers first. Then, the other logical tests and the refactoring can be performed to get the desired result.

# Getting rid of non-developers
df_subset <- df_subset %>% 
  filter(str_detect(string = .$DevType, 
                    pattern = '[Dd]evelop'))

# Categorizing the developers as either "full-stack" or "not full-stack"
# First identifying all full-stack observations
df_subset <- df_subset %>% 
  mutate(stack_info = if_else(
    condition = str_detect(.$DevType, regex('full-stack', ignore_case = TRUE)), 
    true = 'full-stack', 
    false = 'unknown'))

# Then identifying the front/back-end devs as not full-stack and filtering out the rest
df_not_full_stack <- df_subset %>% 
  filter(stack_info == 'unknown') %>% 
  mutate(stack_info = if_else(
  condition = str_detect(.$DevType, regex('(front-end|back-end)')), 
  true = 'not full-stack', 
  false = 'other')) %>% 
  filter(stack_info == 'not full-stack')

# Finally, filtering df_subset so that only full-stack observations remain
# And row-binding that output and df_not_full_stack
# To get all the observations with full-stack and not full-stack designations
# Ultimately, I have categorized devs as full-stack if and only if full-stack appears under DevType
# And I have categorized devs as not full-stack if and only if
# They have either front-end or back-end under DevType
df_subset <- df_subset %>% 
  filter(stack_info == 'full-stack') %>% 
  bind_rows(., df_not_full_stack)

# Checking that the long process actually worked
df_subset %>% 
  count(stack_info, sort = TRUE)

Transforming `CompTotal` variable by `CompFreq`

The next task is modifying CompTotal by CompFreq. In other words, if the value is “Monthly” or “Weekly” under CompFreq, then the CompTotal value must be multiplied by 12 or 52.

# Creating a new comp_weight column
df_subset <- df_subset %>% 
  mutate(comp_weight = 0)

# Populating the column with the correct weights based on CompFreq value
for (i in 1:nrow(df_subset)) {
  if (df_subset$CompFreq[i] == 'Monthly') {
    df_subset[i, 'comp_weight'] <- 12
  }
  else if (df_subset$CompFreq[i] == 'Weekly') {
    df_subset[i, 'comp_weight'] <- 52
  }
  else {
    df_subset[i, 'comp_weight'] <- 1
  }
}

# Checing that it worked
df_subset %>% count(comp_weight, sort = TRUE)

df_subset %>% count(CompFreq, sort = TRUE)

# Modifying the CompTotal by comp_weight
df_subset <- df_subset %>% 
  mutate(comp_adj = CompTotal * comp_weight)

Removing extreme values of `comp_adj`

The last task to do before the data is ready for building a multiple regression model is removing observations with abnormally high income. This is done to not only remove potentially false reports but also to get rid of extreme outliers. Removing outliers is not suggested under normal circumstances, but this analysis is meant to reflect the typical societal trends regarding income and certain other potentially significant variables. Billionaires by definition are somewhat beyond that conversation about the norm. So, I think there is a valid defense in this case for being less rigid about dropping certain observations with extreme compensation values.

summary(df_subset$comp_adj)  # Max value is about 12 trillion, so clearly there are some problematic outliers

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 1.050e+05 1.450e+05 2.572e+09 1.950e+05 1.200e+13

df_subset %>% 
  filter(comp_adj >= 1000000000)  # That one observation is the only one with over 1 billion annual income

# Dropping that one outlier
df_subset <- df_subset %>% 
  filter(comp_adj < 1000000000)

So, only one observation is dropped for having annual income of over 1 billion dollars. Now, the linear model can be built using the predictor variables WorkExp, gender_collapsed, and stack_info and the response variable comp_adj.

Part 4 - Inference

Getting started

In this project, the process of developing a linear model will be carried out in accordance with Chapter 9 of the excellent book OpenIntro Statistics Fourth Edition. As is mentioned in the book, there are two main approaches to building a multiple regression model. They are: the adjusted R^2 approach and the p-value approach. For both approaches, it is recommended to adopt a step-wise model selection strategy: either backward elimination technique or forward selection technique.

The initial choice between the adjusted R^2 approach and the p-value approach can be reconciled with consideration for the actual objective of the model. For machine learning applications, improving prediction accuracy is the primary concern, and the adjusted R^2 approach is suitable for this. However, that is not the case for this project. The goal is to determine if work experience, gender, and the ability to code full-stack (as opposed to front-end or back-end only) are predictors of income for professional developers in the US. In other words, the interest lies in understanding the statistical significance of the relationship between the predictor variables and the response variable. In such a situation, the p-value approach is more suitable.

In terms of pairing the p-value approach with one of the step-wise model selection strategies, my preference is the backward elimination technique. So, all the variables will be included in the model at first. Then the variable with the largest p-value will be removed if its p-value is greater than a significance level of alpha = 0.05. However, if the largest p-value is smaller than 0.05, then no variable needs to be removed because the model is as good as it could be.

P-value approach for multiple regression using backward elimination technique

Let’s fit all the aforementioned predictors against the response variable of comp_adj. Then, we can summarize the model to see the p-values associated with each variable.

model <- lm(comp_adj ~ WorkExp + gender_collapsed + stack_info, data = df_subset)
summary(model)

## 
## Call:
## lm(formula = comp_adj ~ WorkExp + gender_collapsed + stack_info, 
##     data = df_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
##  -412212  -271875  -212230  -140917 35978465 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                424586      33117  12.821   <2e-16 ***
## WorkExp                     -3051       1770  -1.724   0.0848 .  
## gender_collapsedOther     -144766      93617  -1.546   0.1221    
## gender_collapsedWoman     -135364      75070  -1.803   0.0714 .  
## stack_infonot full-stack   -31336      37776  -0.830   0.4068    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1216000 on 4660 degrees of freedom
## Multiple R-squared:  0.001832,   Adjusted R-squared:  0.0009755 
## F-statistic: 2.139 on 4 and 4660 DF,  p-value: 0.07345

In accordance with our chosen methodology, the predictor variable stack_info must be dropped as it has a p-value of 0.40, the highest of the bunch. Hopefully, after dropping it, the new model will be a better fit.

model <- lm(comp_adj ~ WorkExp + gender_collapsed, data = df_subset)
summary(model)

## 
## Call:
## lm(formula = comp_adj ~ WorkExp + gender_collapsed, data = df_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
##  -401414  -269588  -212944  -143585 35989420 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             413579      30342  13.631   <2e-16 ***
## WorkExp                  -2999       1769  -1.696   0.0900 .  
## gender_collapsedOther  -146106      93600  -1.561   0.1186    
## gender_collapsedWoman  -137194      75035  -1.828   0.0676 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1216000 on 4661 degrees of freedom
## Multiple R-squared:  0.001685,   Adjusted R-squared:  0.001042 
## F-statistic: 2.622 on 3 and 4661 DF,  p-value: 0.04898

After dropping stack_info, the new model is still not good enough because the gender_collapsed predictor variable has high p-values associated with its two levels (the third level doesn’t show because it’s the baseline). In particular, the “Other” level for gender_collapsed has the highest p-value in the current model: 0.1186. So, let’s try dropping gender_collapsed.

model <- lm(comp_adj ~ WorkExp, data = df_subset)
summary(model)

## 
## Call:
## lm(formula = comp_adj ~ WorkExp, data = df_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
##  -385191  -262429  -212562  -155191 36006659 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   396057      29415  13.465   <2e-16 ***
## WorkExp        -2716       1765  -1.539    0.124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1216000 on 4663 degrees of freedom
## Multiple R-squared:  0.0005076,  Adjusted R-squared:  0.0002933 
## F-statistic: 2.368 on 1 and 4663 DF,  p-value: 0.1239

Well, this is unfortunate. Even the last variable WorkExp seems to be a poor predictor of the response variable comp_adj. Its p-value is 0.124. So, this linear model must also be dismissed.

The post-mortem

It would be prudent to reflect on what exactly happened. One thought that immediately occurs to mind is that before modeling, linearity must always be checked using a plot or the correlation coefficient. I had planned to do this during checking the conditions of the model’s validity after finalizing the model, sticking true to the order of things in Chapter 9. However, it should have been done before the modeling. So, let’s do it now to see what’s going on.

df_subset %>%
  summarize(cor(WorkExp, comp_adj))  # Very small correlation coefficient

# Visualizng WorkExp vs comp_adj
df_subset %>% 
  ggplot(mapping = aes(x = WorkExp, y = comp_adj)) + 
  geom_jitter(color = 'purple', alpha = .2) + 
  labs(title = 'Compensation by experience', 
       x = 'WorkExp (years)', 
       y = 'comp_adj (dollars)')

# Visualizing distribution of comp_adj grouped by gender_collapsed
df_subset %>% 
  ggplot(mapping = aes(color = gender_collapsed, y = comp_adj)) + 
  geom_boxplot(alpha = 0.5) + 
  scale_x_discrete(expand = expansion()) + 
  labs(title = "Compensation by genders", y = 'comp_adj (dollars)')

# Visualizing distribution of comp_adj grouped by stack_info
df_subset %>% 
  ggplot(mapping = aes(color = stack_info, y = comp_adj)) + 
  geom_boxplot(alpha = 0.5) + 
  scale_x_discrete(expand = expansion()) + 
  labs(title = "Compensation by full-stack or not", y = 'comp_adj (dollars)')

The correlation coefficient pertaining to WorkExp and comp_adj is very small, so immediately it’s obvious that there is an issue. The scatterplot also makes it clear that there is not a linear relationship between WorkExp and comp_adj. The compensation for the vast majority of the people seems to be unaffected by experience, which runs counter to intuition. Some outliers are scattered higher up, distributed fairly evenly across the various experience ranges.

Moving on to the next diagnostic concerning the relationship between gender_collapsed and comp_adj, again the trend from the previous visual holds true. There is barely any difference in compensation data across the different genders. Outliers are more numerous for males, but that’s to be expected given how many more males there are in the data.

Similarly, for stack_info and comp_adj, the groups make no difference in income distribution. Outliers overshadow the majority of the data. At this point, I’m fairly certain that the issue is high variance. The compensation values show exponential growth across the whole spectrum of data. So, looking at the distribution of comp_adj should be helpful.

df_subset %>% 
  ggplot(mapping = aes(x = comp_adj)) + 
  geom_histogram(fill = 'magenta', bins = 30) + 
  labs(title = 'Histogram of Compensation', x = 'comp_adj (dollars)')

The earlier assessment seems to be correct. There are 30 bins in the histogram, yet only one is really visible. The left-most bin indicates that the data is massively right-skewed. Most of the data is concentrated toward the lower-side while some data are very very high and to the right. My guess is that a log transformation could work for this data. Before doing that, let’s add 1 to any value in comp_adj that is 0 (to avoid NAN values).

df_subset %>% 
  filter(comp_adj == 0) %>% 
  nrow()  # There are 9 values that are 0

## [1] 9

# Adding 1 to 0
df_subset <- df_subset %>% 
  mutate(comp_adj = if_else(comp_adj == 0, comp_adj + 1, comp_adj))

# Checking the correlation coefficient between log transformed comp_adj and WorkExp
df_subset %>%
  summarize(cor(WorkExp, log10(comp_adj)))  # No luck...

# Distribution of log10(comp_adj)
df_subset %>% 
  ggplot(mapping = aes(x = log10(comp_adj))) + 
  geom_histogram(fill = 'forestgreen') + 
  labs(title = 'Histogram of log-transformed compensation data')

# Scatterplot between log10(comp_adj) and WorkExp
df_subset %>% 
  ggplot(mapping = aes(x = WorkExp, y = log10(comp_adj))) + 
  geom_jitter(color = 'navyblue', alpha = 0.2) + 
  stat_smooth(method = "lm", se = FALSE) + 
  labs(title = 'Log-transformed compensation by experience')

# Visualizing distribution of log10(comp_adj) grouped by gender_collapsed
df_subset %>% 
  ggplot(mapping = aes(color = gender_collapsed, y = log10(comp_adj))) + 
  geom_boxplot(alpha = 0.5) + 
  scale_x_discrete(expand = expansion()) + 
  labs(title = "Compensation by Genders")

# Visualizing distribution of log10(comp_adj) grouped by stack_info
df_subset %>% 
  ggplot(mapping = aes(color = stack_info, y = log10(comp_adj))) + 
  geom_boxplot(alpha = 0.5) + 
  scale_x_discrete(expand = expansion()) + 
  labs(title = "Compensation by full-stack or not")

No luck with the log transformation. Despite the fact that the distribution of the comp_adj variable looks much less skewed under a log transformation, there is still no semblance of correlation between the income data and the variables work experience, gender, and ability to develop full-stack. In the case of work experience, the slope of the scatterplot is flat, basically 0. The correlation coefficient of course reflects this with a value of about 0.04. Meanwhile, the categorical variables show no disparity between the groups. The boxplots for each categorical variable are right next to each other. If anything, this exercise has been an excellent exploration of the independence between income and the other variables.

Still, I have learned one valuable lesson that I won’t forget. Always always always check visuals and correlation coefficients before diving into the process of creating a linear model.

Part 5 - Conclusion

Formally speaking, the conclusion is this: Work experience, gender, and the ability to code full-stack (as opposed to front-end or back-end only) are not predictive of income for professional developers in the US. In fact, income seems to be fairly independent of these other variables, at least within the context of developers taking Stack Overflow’s 2022 survey.

I don’t know if these findings are important, but they are certainly interesting. I never would have guessed that those variables are not correlated. In fact, if this analysis could be made more robust and reliable, then the findings would be very important because the implications would be polarizing to say the least.

However, therein lies the caveat. This study was performed by an amateur using a very niche dataset. There are bound to be confounders and other things I have missed. Because how could it be that there is no strong correlation between work experience and income? Particularly for technical professionals like developers. So, I guess the limitations of the analysis would be that it needs to be reproduced by others, and perhaps in a more rigorous manner. I only looked at a handful of variables, and there are literally 79 available in total. So, they can be explored further to explain this strange conclusion.

References

OpenIntro Statistics Fourth Edition by David Diez, Mine Çetinkaya-Rundel, and Christopher D Barr.

DATA 606 Final Project

Prinon Mahdi

Part 1 - Introduction

Part 2 - Data

Part 3 - Exploratory data analysis

Looking at the variables one by one

Transforming `Gender` variable

Transforming `DevType` variable

Transforming `CompTotal` variable by `CompFreq`

Removing extreme values of `comp_adj`

Part 4 - Inference

Getting started

P-value approach for multiple regression using backward elimination technique

The post-mortem

Part 5 - Conclusion

References

DATA 606 Final Project

Prinon Mahdi

Part 1 - Introduction

Part 2 - Data

Part 3 - Exploratory data analysis

Looking at the variables one by one

Transforming Gender variable

Transforming DevType variable

Transforming CompTotal variable by CompFreq

Removing extreme values of comp_adj

Part 4 - Inference

Getting started

P-value approach for multiple regression using backward elimination technique

The post-mortem

Part 5 - Conclusion

References

Transforming `Gender` variable

Transforming `DevType` variable

Transforming `CompTotal` variable by `CompFreq`

Removing extreme values of `comp_adj`