Finding Value Within the Data Science Industry

Author

Team Krijudato (Kristin L, Julia F, David G, Tony F)

Published

October 29, 2023


Introduction

Beginning in 2011, Stack Overflow has conducted an annual survey for developers to participate in. This survey provides insights on a wide variety of topics, such as employment status, annual salary, language knowledge, location, and much more.

As aspiring data scientists, we aim to answer the following questions:

  • What are the most popular programming languages?

  • What are the highest paying programming languages?

  • Do certain developer types make more money than others?

  • Do different regions make more money than others?

  • Does age have an impact on salary?

  • Are people more likely to be employed full-time if they have a higher level degree?

  • Is there a gender wage gap in the industry?

  • Do larger companies hire more experienced people?

  • Does knowing more than one language increase your salary?

  • Does having a higher level degree impact your salary?

In this project, we will use the Stack Overflow survey data to answer the above questions, and hopefully provide more insight to our fellow aspiring data scientists.


Project overview


Data engineering overview

Our data engineering pipeline does three things.

  1. For each year, download raw survey data from our s3 cache. Unify and coalesce columns on each year, and then union together.
  2. Explode wide certain multi-value columns. For example, “PlatformWorkedWith” contains both AWS and Google Cloud.
  3. Save the CSV file in the root directory so the markdown can load it from cache.

funcitons.r contains our data engineering pipeline code.

Also, to prepare for this project, we had to do a column analysis. If you’d like to see what went into that, review and run our admin.r file.


load and prepare base data for this markdown file…

packages <- c("dplyr","tidyr", "ggplot2", "stringr", "gt", "gridExtra")
lapply(packages, library, character.only = TRUE)
source("functions.r") # we'll attach the functions.r file with our submission
options(scipen=999)

wide_stack <- get_stack_df(persist = TRUE, load_from_cache = TRUE)
language_cols <- c("python", "sql", "java", "scala", "r", "julia")

wide_stack <- wide_stack %>%
  mutate(Gender = ifelse(Gender == "Woman", "Female", ifelse(Gender == "Man", "Male", Gender))) %>%
  mutate(Age = as.numeric(as.character(Age)))

done loading…


The Programming Languages of Data Science

Language use over time

Python and SQL are the most prevalent languages in 2022. Notably, Julia appears to have exhibited consistent growth since 2017, which makes sense considering it is a relatively new language. The distribution plot is another glimpse of usage changes over time.

long_stack <- wide_stack %>%
  pivot_longer(cols = all_of(language_cols),
               names_to = "Language",
               values_to = "LanguageYesOrNo",
               names_prefix = "language_") %>%
  filter(LanguageYesOrNo == "yes") %>%
  select(Year, Age, Gender, AnnualSalary, Language, Employment, LanguageYesOrNo)

language_growth <- long_stack %>%
  filter(LanguageYesOrNo == "yes") %>%
  group_by(Language, Year) %>%
  summarise(Count = n())

p1 <- ggplot(language_growth, aes(x = Year, y= Count, color = Language)) +
  geom_line(size = 1.5) +
  ggtitle("Use Over Time") + 
  labs( x = "Year", y = "Count of Surveyors" ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

p2 <- ggplot(long_stack, aes(x = Language, fill = factor(Year))) +
  geom_bar(position = "dodge", width = 0.7) +
  ggtitle("Distribution by Year") +
  labs( x = "", y = "Count of Surveyors", fill = "Year" ) +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
grid.arrange(p1, p2, ncol = 2, widths = c(2, 1.5))

Languages and pay over time

Let’s now look at the distribution of salaries over the years. We will need to create a new data frame that has the following attributes:

The violin plot below illustrates the salary distributions for each language in each year. From the plot, it’s apparent that Scala has consistently held the highest average salary among the six languages since 2018.

long_stack %>%
  filter(Year %in% 2018:2022) %>%
  filter(str_detect(Employment, "full")) %>%
  filter(AnnualSalary < 500000) %>% # removal of outliers
  filter(!is.na(AnnualSalary)) %>%

  select(Year, Age, Gender, AnnualSalary, Language, Employment) %>%
  ggplot(aes(x = Language, y = AnnualSalary, fill = Language)) +
    geom_violin() +
    facet_wrap(~Year) +
    theme(
      plot.title = element_text(hjust = 0.5),
      axis.text.x = element_text(angle = 45, hjust = 1),
      legend.position = "none"
    ) +
    ggtitle("Distribution of Salaries for Each Language") +
    labs(x = "", y = "Annual Salary (USD)")

Conclusions: Languages

Based on the preceding charts, our observations indicate that Python and SQL are the dominant languages in terms of popularity. Julia stands out for its remarkable and sustained growth since 2017, which aligns with outside research as it is a relatively new language. As for compensation, Scala emerges as the consistently top-paying language since 2017, as evidenced by the salary vs. language graph.


Salaries

By Country

Monaco and Afghanistan were the countries with the two highest average annual salaries. Specifically in the U.S., blockchain developers also had the highest average annual salaries.

wide_stack %>%
  filter(!is.na(AnnualSalary)) %>%
  mutate(Country = 
    str_replace_all(Country, "United Kingdom of Great Britain and Northern Ireland", "Great Britain")) %>%
  mutate(Country = 
    str_replace_all(Country, "United States of America", "USA")) %>%
  group_by(Country) %>%
  summarise(AvgSalary = mean(AnnualSalary, na.rm = TRUE)) %>%
  arrange(-AvgSalary) %>%
  head(n = 15) %>%
  ggplot(aes(x = reorder(Country, -AvgSalary), y = AvgSalary)) + 
    geom_bar(stat = "identity", position = "dodge") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    ggtitle("Average Salaries by Country/Region") +
    xlab("Country")

By Developer Type

Blockchain developers appeared the have the highest salary, but this might be a typo or outlier. Besides blockchain developers, project managers and senior executives had the highest average annual salaries.

wide_stack %>%
  mutate(DevType = str_split_fixed(DevType, ';', 2)[, 1]) %>%
  filter(!is.na(AnnualSalary)) %>%
  group_by(DevType) %>%
  summarise(AvgSalary = mean(AnnualSalary, na.rm = TRUE)) %>%
  arrange(desc(AvgSalary)) %>%
  head(n = 20) %>%
  ggplot(aes(x = reorder(DevType, -AvgSalary), y = AvgSalary)) +
    geom_bar(stat = "identity", position = "dodge") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    ggtitle("Top 20 Average Salaries by Developer Type") +
    labs(x = "Developer Type", y = "Average Salary")

By developer type, United States Only

Like the rest of the world, blockchain developers, project managers, and senior executives had the highest annual salaries out of all the job titles in the U.S.

wide_stack %>%
  mutate(DevType = str_split_fixed(DevType, ';', 2)[, 1]) %>%
  filter(!is.na(AnnualSalary)) %>%
  group_by(DevType) %>%
  filter((Country == "United States") | (Country == "United States of America")) %>%
  summarise(AvgSalary = mean(AnnualSalary, na.rm = TRUE)) %>%
  arrange(desc(AvgSalary)) %>%
  head(n = 20) %>%
  ggplot(aes(x = reorder(DevType, -AvgSalary), y = AvgSalary)) +
    geom_bar(stat = "identity", position = "dodge") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    ggtitle("Top 20 Average Salaries in the U.S. by Developer Type") +
    labs(x = "Developer Type", y = "Average Salary") 

By developer type, United States Only, minus blockchain

Since the average salary for blockchain developers was much higher than other salaries, it was removed from the following chart. The differences between the rest of the developer salaries in the U.S. can more clearly be seen here.

wide_stack %>%
  mutate(DevType = str_split_fixed(DevType, ';', 2)[, 1]) %>%
  filter(!is.na(AnnualSalary)) %>%
  group_by(DevType) %>%
  summarise(AvgSalary = mean(AnnualSalary, na.rm = TRUE)) %>%
  arrange(desc(AvgSalary)) %>%
  head(n = 20) %>%
  filter(DevType != "Blockchain") %>%
  ggplot(aes(x = reorder(DevType, -AvgSalary), y = AvgSalary)) +
    geom_bar(stat = "identity", position = "dodge") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    ggtitle("Top 19 Average Salaries in the U.S. by Developer Type", subtitle = "Excluding Blockchain Developers") +
    labs(x = "Developer Type", y = "Average Salary")

Salary vs age

We can see from the below graph that those between ages 23 and 38 with full time jobs have a higher salary distribution than those 18 to 23 and 38 and older. This phenomenon might be attributed to the composition of the respondents in the Stack Overflow survey. It is possible that older generations of workers feel less inclined to complete the Stack Overflow survey.

wide_stack %>%
  filter(Year %in% 2018:2022) %>% # 2017 not in USD
  filter(str_detect(Employment, "full")) %>% # full timers only 
  filter(!is.na(Age)) %>% 
  mutate(Age_Group = cut(Age,
      breaks = c(18, seq(23, 65, by = 5)),  # Create breaks from 18 to 65 in increments of 5
      labels = paste(seq(18, 60, by = 5), seq(23, 65, by = 5), sep = "-"),
      right = FALSE)) %>%
  na.exclude(Age_Group) %>%
  select(Year, Age_Group, Gender, AnnualSalary) %>%
  ggplot(aes(x = Age_Group, y = AnnualSalary)) +
    geom_point() +
    ggtitle("Age vs. Annual Salary") +
    labs(x = "Age Group", y = "Annual Salary (USD)") +
    theme(plot.title = element_text(hjust = 0.5))


Status, Degrees, and Org Size

Employment Status by degrees

People are most likely to be employed full-time if they have a Bachelor’s degree, Master’s degree, or completed some college. Also, people who have a Bachelor’s degree are more likely to be self-employed than any other degree. People who only completed high school are most likely to be unemployed. Since most people who took the survey had Bachelor’s degrees based on these counts, it may have skewed the results. Also, people are more likely to take this survey if they are employed because they may use Stack Overflow to help them while working at their job.

degree_order <- c("High School", "Some college", "Associate's", "Bachelor's", "Professional", "Master's", "Doctorate")

wide_stack %>%
  mutate(
    DevType = str_split_fixed(DevType, ';', 2)[, 1],
    JobStatus = case_when(
      grepl("Employed, full-time", Employment) ~ "Employed, full-time",
      grepl("Employed, part-time", Employment) ~ "Employed, part-time",
      grepl("Not employed", Employment) ~ "Not employed",
      grepl("self-employed", Employment) ~ "Self-employed",
      grepl("I prefer not to say", Employment) ~ NA_character_,
      grepl("Retired", Employment) ~ "Retired",
      grepl("Student", Employment) ~ "Student",
      TRUE ~ Employment
    ),
    Degree = case_when(
      grepl("Associate", EdLevel) ~ "Associate's",
      grepl("Bachelor", EdLevel) ~ "Bachelor's",
      grepl("Master", EdLevel) ~ "Master's",
      grepl("doctoral", EdLevel) ~ "Doctorate",
      grepl("Some college", EdLevel) ~ "Some college",
      grepl("Secondary", EdLevel) ~ "High School",
      grepl("Professional", EdLevel) ~ "Professional",
      grepl("prefer|never|Primary|Something", EdLevel) ~ NA_character_,
      TRUE ~ EdLevel
    )
  ) %>%
  select(JobStatus, Degree) %>%
  filter(!is.na(Degree), !is.na(JobStatus)) %>%
  mutate(Degree = factor(Degree, levels = degree_order)) %>%
  ggplot(aes(Degree, ..count..)) + 
  geom_bar(aes(fill = JobStatus), position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Employment Status by Degree ",
       x = "Degree",
       y = "Count of People") +
  scale_fill_brewer(palette = "Set3")

Organization size to years coding

People of all coding experience work for organizations of all different sizes. Most people have coded 0 to 20 years, but people of all coding experiences work for small organizations, medium-sized organizations, and large organizations. There does not seem to be any relationship. The decision to employ workers, and the decision to work at a particular-size company is likely determined on a individual basis. Most organizations likely try to employ people of all different experience levels. Some employees need a lot of experience, while others will gain that experience on the job.

ggplot(wide_stack, aes(x = YearsCodeProAvg, y = OrgSizeAvg)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Organization Size by years coding Professionally") +
  labs(x = "Years coding", y = "Org Size")


Gender and Programming Languages

Gender, salary and language

The below plot illustrates the average salary per gender for each distinct language.

wide_stack %>%
  filter(rowSums(.[, language_cols] == "yes") <= 1) %>%
  mutate(KnownLanguage = case_when(
    python == "yes" ~ "python",
    sql == "yes" ~ "sql",
    java == "yes" ~ "java",
    scala == "yes" ~ "scala",
    r == "yes" ~ "r",
    julia == "yes" ~ "julia",
    TRUE ~ "unknown"
  )) %>%
  filter(KnownLanguage != "unknown") %>%
  filter(Year %in% 2018:2022) %>%
  filter(str_detect(Employment, "full")) %>%
  select(Year, Gender, Employment, AnnualSalary, KnownLanguage) %>%
  filter(Gender %in% c("Female", "Male")) %>%
  group_by(Gender, KnownLanguage) %>%
  summarise(AvgSalary = mean(AnnualSalary, na.rm = TRUE)) %>%
  ggplot(aes(x = KnownLanguage, y = AvgSalary, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  ggtitle("Average Salary for Each Distinct Language") +
  labs(x = "Language", y = "Average Salary") +
  scale_fill_manual(values = c("Female" = "hotpink", "Male" = "darkblue")) +
  theme_minimal() + 
  coord_flip()

Gender and language within surveyors

From the previous graph, we see that the average female salary for Scala is an outlier, with its average salary being over $400,000. To investigate this further, we will construct a table displaying the respondent counts for each scenario.

From the table below, it’s apparent that there are only four females in this specific dataset who are proficient in Julia. When considering the actual population of females who exclusively know Julia, this dataset may not provide an accurate representation of their salaries. A similar inference could be made for Scala, as there are only 27 females who exclusively know Scala in contrast to 506 men. However, it’s important to note that this dataset accurately reflects the salaries of females who participated in the Stack Overflow survey. In other words, these average salaries are representative of the sample of females who completed the survey, but not the true population of females who know Scala and Julia.

 wide_stack %>%
  filter(rowSums(.[, language_cols] == "yes") <= 1) %>%
  mutate(KnownLanguage = case_when(
    python == "yes" ~ "python",
    sql == "yes" ~ "sql",
    java == "yes" ~ "java",
    scala == "yes" ~ "scala",
    r == "yes" ~ "r",
    julia == "yes" ~ "julia",
    TRUE ~ "unknown"
  )) %>%
  filter(KnownLanguage != "unknown") %>%
  filter(Year %in% 2018:2022) %>%
  filter(str_detect(Employment, "full")) %>%
  filter(Gender %in% c("Female", "Male")) %>%
  filter(!is.na(AnnualSalary)) %>%
  group_by(Gender, KnownLanguage) %>%
  mutate(Count = n()) %>%
  select(KnownLanguage, Gender, Count) %>%
  distinct() %>%
  arrange(KnownLanguage) %>%
  ungroup() %>%
  gt() %>%
  tab_header(
    title = md("**Count of Surveyors in Average Salary Calculation**")
  ) %>%
  cols_label(
    KnownLanguage = md("**Language**"),
    Gender = md("**Gender**"),
    Count = md("**Count**")
  )
Count of Surveyors in Average Salary Calculation
Language Gender Count
java Male 13802
java Female 891
julia Male 124
julia Female 4
python Male 18870
python Female 1040
r Male 541
r Female 86
scala Male 506
scala Female 27
sql Male 34273
sql Female 2095

Case Study: Scala and Python

Either, or both?

Is there a financial advantage to having proficiency in both Python and Scala compared to solely knowing Scala or Python? Based on the below graph, we can infer that knowing a combination of the two languages will likely result in a higher salary.

wide_stack %>%
  # filter for years 2018-2022, since 2017 is not in USD
  filter(Year %in% 2018:2022) %>%
  filter(str_detect(Employment, "full")) %>%
  filter(Gender %in% c("Female", "Male")) %>%
  mutate(status = case_when(
    python == "yes" & scala == "yes" ~ "both",
    python == "yes" & scala == "no" ~ "python",
    python == "no" & scala == "yes" ~ "scala",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(status)) %>%
  group_by(status, Gender) %>%
  summarise(AvgSalary = mean(AnnualSalary, na.rm = TRUE), .groups = "drop") %>%
  # Plot
  ggplot(aes(x = status, y = AvgSalary, fill = Gender)) +
    geom_bar(stat = "identity", position = "dodge") +
    ggtitle("Average Salaries for\nScala/Python Combinations") +
    labs(x = "Combination", y = "Average Salary") +
  # center title
    theme(plot.title = element_text(hjust = 0.5)) +
    theme_minimal()

Both, over time

wide_stack %>%
  filter(Year %in% 2018:2022) %>%
  filter(str_detect(Employment, "full")) %>%
  mutate(status = case_when(
    python == "yes" & scala == "yes" ~ "both",
    python == "yes" & scala == "no" ~ "python",
    python == "no" & scala == "yes" ~ "scala",
    python == "no" & scala == "no" ~ "neither"
  )) %>%
  filter(status != "neither") %>%
  select(Year, AnnualSalary, status) %>%
  group_by(Year, status) %>%
  summarise(AvgSalary = mean(AnnualSalary, na.rm = TRUE), .groups = "drop") %>%
  rename(`Area of Expertise` = status) %>%
  ggplot(aes(x = Year, y = AvgSalary, color = `Area of Expertise`)) +
    geom_path(size=2) +
    ggtitle("Average Salaries for Scala/Python Combinations") +
    labs(y = "Average Salary In USD", x = "Year")

Both, by education level

EdLevels <- c("Associate degree", "Bachelor’s degree", "Some college/university study without earning a degree", "Master’s degree", "Other doctoral degree", "Professional degree")

wide_stack_no2017US <- wide_stack %>%
  filter(Year %in% 2018:2022) %>%
  mutate(EdLevel = gsub(r"{\s*\([^\)]+\)}", "", as.character(EdLevel))) %>%
  filter(str_detect(Country, "United States") & EdLevel %in% EdLevels)

# Create a new dataset with specific conditions
df_scala_python_edlevel <- bind_rows(
  wide_stack_no2017US %>%
    filter(python == "yes" & scala == "yes") %>%
    filter(str_detect(Employment, "full")) %>%
    select(Year, EdLevel, Country, AnnualSalary) %>%
    mutate(status = "both"),
  
  wide_stack_no2017US %>%
    filter(python == "yes" & scala == "no") %>%
    filter(str_detect(Employment, "full")) %>%
    select(Year, EdLevel, Country, AnnualSalary) %>%
    mutate(status = "python"),
  
  wide_stack_no2017US %>%
    filter(python == "no" & scala == "yes") %>%
    filter(str_detect(Employment, "full")) %>%
    select(Year, EdLevel, Country, AnnualSalary) %>%
    mutate(status = "scala")
) %>%
  group_by(EdLevel, Year) %>%
  summarise(AvgSalary = mean(AnnualSalary, na.rm = TRUE), .groups = "drop")

# Clean up the dataset by removing NA values
df_scala_python_edlevel <- df_scala_python_edlevel[complete.cases(df_scala_python_edlevel), ]
colnames(df_scala_python_edlevel)[colnames(df_scala_python_edlevel) == "EdLevel"] <- "Level of Education"

# Plot the data
ggplot(df_scala_python_edlevel, aes(x = Year, y = AvgSalary, color = `Level of Education`, group = `Level of Education`)) + 
  geom_line() + 
  ggtitle("Average Salaries for Scala and Python (USA) By Level of Education") +
  labs(y = "Average Salary in USD", x = "Year")

From the above graphs, we can infer that knowing a combination of the two languages will likely result in a higher salary and despite a drop in the salaries of python developers from 2019 to 2021 there is marked growth in salaries in all areas of expertise in the past 2 years in the United States. The growth of salaries is also seen across all levels of education roughly since 2021. Although the nature of this growth is not entirely clear because the survey was based on individuals self reporting their education.