608story5.knit

Story 5 - How Much Do We Get Paid?

Daniel DeBonis

Assignment

I have introduced the term “Data Practitioner” as a generic job descriptor because we have so many different job role titles for individuals whose work activities overlap including Data Scientist, Data Engineer, Data Analyst, Business Analyst, Data Architect, etc. For this story we will answer the question, “How much do we get paid?” Your analysis and data visualizations must address the variation in average salary based on role descriptor and state. Notes: You will need to identify reliable sources for salary data and assemble the data sets that you will need. Your visualization(s) must show the most salient information (variation in average salary by role and by state). For this Story you must use a code library and code that you have written in R, Python or Java Script (additional coding in other languages is allowed). Post generation enhancements to you generated visualization will be allowed (e.g. Addition of kickers and labels).

Sourcing Data

The data for current salaries across commonly used job titles within data science comes from the Bureau of Labor Statistics. Using the Standard Occupational Classification System used in this dataset, five codes were identified to be the most likely jobs to include data science practices. For example, 15-2099 other Mathematical Occupations which can include those working in machine learning or bioinformatics. Many business categories include the term analysts, but the one selected seemed to be the most relevant to the question.

library(readxl)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

excel_path <- "C:/Users/ddebo/Downloads/oesm23st/oesm23st/state_M2023_dl.xlsx"

oews_data <- read_excel(excel_path, sheet = "state_M2023_dl")

# Check the first few rows and column names
glimpse(oews_data)

## Rows: 37,676
## Columns: 32
## $ AREA         <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01…
## $ AREA_TITLE   <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
## $ AREA_TYPE    <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2…
## $ PRIM_STATE   <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ NAICS        <chr> "000000", "000000", "000000", "000000", "000000", "000000…
## $ NAICS_TITLE  <chr> "Cross-industry", "Cross-industry", "Cross-industry", "Cr…
## $ I_GROUP      <chr> "cross-industry", "cross-industry", "cross-industry", "cr…
## $ OWN_CODE     <chr> "1235", "1235", "1235", "1235", "1235", "1235", "1235", "…
## $ OCC_CODE     <chr> "00-0000", "11-0000", "11-1011", "11-1021", "11-1031", "1…
## $ OCC_TITLE    <chr> "All Occupations", "Management Occupations", "Chief Execu…
## $ O_GROUP      <chr> "total", "major", "detailed", "detailed", "detailed", "de…
## $ TOT_EMP      <chr> "2053090", "105580", "720", "34450", "1140", "70", "1490"…
## $ EMP_PRSE     <chr> "0", "0.8", "6.8", "2.7", "9.1", "16.5", "3.7", "3.3", "*…
## $ JOBS_1000    <chr> "1000", "51.423999999999999", "0.34799999999999998", "16.…
## $ LOC_QUOTIENT <chr> "1", "0.74", "0.25", "0.73", "2.6", "0.24", "0.3", "0.47"…
## $ PCT_TOTAL    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ PCT_RPT      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ H_MEAN       <chr> "25.67", "56.21", "106.26", "62.17", "*", "53.99", "62.94…
## $ A_MEAN       <chr> "53400", "116920", "221030", "129310", "33690", "112290",…
## $ MEAN_PRSE    <chr> "0.2", "0.5", "5.8", "1.1000000000000001", "5.09999999999…
## $ H_PCT10      <chr> "10.87", "24.38", "31.59", "23.11", "*", "36.770000000000…
## $ H_PCT25      <chr> "14.22", "35.18", "59.6", "34.74", "*", "39.8800000000000…
## $ H_MEDIAN     <chr> "19.88", "47.95", "79.48", "49.67", "*", "50.37", "54.55"…
## $ H_PCT75      <chr> "30.09", "67.22", "102.01", "78.25", "*", "64.03", "78.95…
## $ H_PCT90      <chr> "46.18", "95.44", "#", "112.54", "*", "71.209999999999994…
## $ A_PCT10      <chr> "22620", "50710", "65700", "48080", "18320", "76480", "65…
## $ A_PCT25      <chr> "29580", "73180", "123960", "72260", "19670", "82950", "8…
## $ A_MEDIAN     <chr> "41350", "99740", "165320", "103320", "24470", "104770", …
## $ A_PCT75      <chr> "62580", "139810", "212180", "162760", "45050", "133170",…
## $ A_PCT90      <chr> "96050", "198520", "#", "234080", "55070", "148110", "220…
## $ ANNUAL       <chr> NA, NA, NA, NA, "TRUE", NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ HOURLY       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

# SOC codes for data-related roles
data_soc_codes <- c(
  "15-2051", # Data Scientists
  "15-2031", # Operations Research Analysts
  "15-1243", # Database Architects
  "13-1111", # Management Analysts
  "15-2099"  # Mathematical Occupations, All Other
)

# Filter, clean, and summarize
salary_data <- oews_data |>
  filter(OCC_CODE %in% data_soc_codes) |>
  select(AREA_TITLE, OCC_CODE, OCC_TITLE, TOT_EMP, A_MEAN, A_MEDIAN, PRIM_STATE) |>
  mutate(A_MEAN = as.numeric(A_MEAN))|>
  mutate(A_MEDIAN = as.numeric(A_MEDIAN))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `A_MEAN = as.numeric(A_MEAN)`.
## Caused by warning:
## ! NAs introduced by coercion

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `A_MEDIAN = as.numeric(A_MEDIAN)`.
## Caused by warning:
## ! NAs introduced by coercion

# Summarize by state and job title
summary_salary <- salary_data |>
  select(AREA_TITLE, OCC_TITLE, A_MEAN)

Visualization(s) 1 - States with Top and Bottom 5 Mean Salaries per Title

# Function to plot top & bottom N states on the same graph
plot_top_bottom_single_axis <- function(data, role, n = 5) {
  
  df <- data %>%
    filter(OCC_TITLE == role) |>
    # Select top N and bottom N
    slice_max(order_by = A_MEAN, n = n) |>
    mutate(Rank = "Top") |>
    bind_rows(
      data |>
        filter(OCC_TITLE == role) |>
        slice_min(order_by = A_MEAN, n = n) |>
        mutate(Rank = "Bottom")
    ) %>%
    # Reorder states by salary for plotting
    arrange(A_MEAN) |>
    mutate(AREA_TITLE = factor(AREA_TITLE, levels = AREA_TITLE))
  
  ggplot(df, aes(x = AREA_TITLE, y = A_MEAN, fill = Rank)) +
    geom_col() +
    coord_flip() +
    scale_fill_manual(values = c("Top" = "steelblue", "Bottom" = "firebrick")) +
    labs(
      title = role,
      x = "State",
      y = "Mean Salary (USD)",
      caption = "Top 5 states shown in blue, bottom 5 states in red"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

}

# List of roles
roles <- c(
  "Data Scientists",
  "Database Architects",
  "Management Analysts",
  "Operations Research Analysts",
  "Mathematical Science Occupations, All Other"
)

# Loop over roles to display plots
for (role in roles) {
  p <- plot_top_bottom_single_axis(summary_salary, role, n = 5)
  
  # Display
  print(p)

}

### Visualization(s) 2 - Box Plot for Mean and Median salaries

Salaries are a case where the mean might not be the best representation of what an average person should expect to be receiving. Most companies have some type of hierarchical structure and the highest paying jobs are the lowest in number, so I wanted to investigate the whether the median would be a more representative value in answering our question.

salary_data <- salary_data |>
    mutate(OCC_TITLE_SHORT = str_wrap(OCC_TITLE, width = 15))
ggplot(salary_data, aes(x = OCC_TITLE_SHORT, y = A_MEAN, fill = OCC_TITLE)) +
  geom_boxplot(alpha = 0.7) +
  coord_flip() +
  scale_y_continuous(labels = scales::label_comma()) +
  labs(
    title = "Salary Distribution by Role",
    x = "Role",
    y = "Mean Salary",
    caption = "Data across all states"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggplot(salary_data, aes(x = OCC_TITLE_SHORT, y = A_MEDIAN, fill = OCC_TITLE)) +
  geom_boxplot(alpha = 0.7) +
  coord_flip() +
  scale_y_continuous(labels = scales::label_comma()) +
  labs(
    title = "Salary Distribution by Role",
    x = "Role",
    y = "Median Salary",
    caption = "Data across all states"
  ) +
    theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

There was some right skew apparent in the distribution of the mean salaries, particularly for data scientists and database architects. Comparing the first box plot to the second does support the idea that the median would provide a more accurate answer, with all distributions shifting downwards.

Visualization 3 - Distribution of Differences between Mean and Median Salaries

I wanted a more concrete visualization of the skew before committing to using median as the primary measure.

salary_data <- salary_data %>%
  mutate(Diff = A_MEAN - A_MEDIAN)

ggplot(salary_data, aes(x = Diff)) +
  geom_histogram(binwidth = 2000, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Mean vs Median Salary Differences Across States",
    x = "Mean - Median Salary",
    y = "Number of States",
    caption = "Positive values indicate mean > median"
  ) +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Interestingly enough, we get a nearly normal distribution of differences, but the key takeaway is the mean value around 5000. Within the pairs of mean and median salaries within states and job titles, mean salaries average $5,000 higher, confirming the skew within the mean, which is why I choose to focus on the median as the measure

Visualization 4 - Distribution of Differences between Mean and Median across Job Titles

ggplot(salary_data, aes(x = OCC_TITLE_SHORT, y = A_MEAN - A_MEDIAN, fill = OCC_TITLE)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::label_comma()) +
  coord_flip() +
  labs(
    title = "Distribution of Mean vs Median Salary Differences by Job Title",
    x = "Job Title",
    y = "Mean - Median Salary",
    caption = "Boxplots show state-to-state variation"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Management analysts get singled out here as the strongest contributor to the right skew within mean salaries, but it is a pattern that we see to some extent across all categories.

Visualization 5 - Revisiting V1 using Median

# Function to plot top & bottom N states on the same graph
plot_top_bottom_single_axis <- function(data, role, n = 5) {
  
  df <- data %>%
    filter(OCC_TITLE == role) %>%
    # Select top N and bottom N
    slice_max(order_by = A_MEDIAN, n = n) %>%
    mutate(Rank = "Top") %>%
    bind_rows(
      data %>%
        filter(OCC_TITLE == role) %>%
        slice_min(order_by = A_MEDIAN, n = n) %>%
        mutate(Rank = "Bottom")
    ) %>%
    # Reorder states by salary for plotting
    arrange(A_MEDIAN) %>%
    mutate(AREA_TITLE = factor(AREA_TITLE, levels = AREA_TITLE))
  
  ggplot(df, aes(x = AREA_TITLE, y = A_MEDIAN, fill = Rank)) +
    geom_col() +
    coord_flip() +
    scale_fill_manual(values = c("Top" = "steelblue", "Bottom" = "firebrick")) +
    labs(
      title = role,
      x = "State",
      y = "Median Salary (USD)",
      caption = "Top 5 states shown in blue, bottom 5 states in red"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

}

# List of roles
roles <- c(
  "Data Scientists",
  "Database Architects",
  "Management Analysts",
  "Operations Research Analysts",
  "Mathematical Science Occupations, All Other"
)

# Loop over roles to display/save plots
for (role in roles) {
  p <- plot_top_bottom_single_axis(salary_data, role, n = 5)
  
  # Display
  print(p)
  
  
}

Visualization 6 - Heat Map of Median Salaries by State and Title

ggplot(salary_data, aes(x = PRIM_STATE, y = OCC_TITLE_SHORT, fill = A_MEDIAN)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(option = "plasma", labels = scales::label_dollar()) +
  labs(
    title = "Median Salary by State and Job Title",
    x = "State",
    y = "Job Title",
    fill = "Median Salary"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

It is true that the state initials are barely legible, but the purpose of this heatmap is more to show the variation rather than tying that variation to specific states, which is much better done in the following visualization. However, this heatmap format allows for the data to be split at the job level rather than requiring five separate images to convey these variances.

Visualization 7 - Map of Weighted Median Salaries for Data Fields in the United States

library(usmap)       # for plotting US maps
library(scales)      # for dollar formatting

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(matrixStats) # for weighted median

## 
## Attaching package: 'matrixStats'

## The following object is masked from 'package:dplyr':
## 
##     count

# Compute weighted median per state
state_medians <- oews_data |>
  filter(OCC_CODE %in% data_soc_codes) |>
  mutate(TOT_EMP = as.numeric(TOT_EMP)) |>
  mutate(A_MEDIAN = as.numeric(A_MEDIAN))|>  # coerce to numeric
  replace_na(list(TOT_EMP = 0)) |>
  group_by(PRIM_STATE) |>
  summarize(
    median_salary = weightedMedian(A_MEDIAN, w = TOT_EMP, na.rm = TRUE)
  ) |>
  ungroup()

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `TOT_EMP = as.numeric(TOT_EMP)`.
## Caused by warning:
## ! NAs introduced by coercion

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `A_MEDIAN = as.numeric(A_MEDIAN)`.
## Caused by warning:
## ! NAs introduced by coercion

# Plot the map
map_df <- state_medians |>
  rename(state = PRIM_STATE) |>
  mutate(median_salary = as.numeric(median_salary))

# Plot
plot_usmap(data = map_df, values = "median_salary") +
  scale_fill_viridis_c(labels = scales::label_dollar(), na.value = "grey90") +
  labs(title = "Average Median Salary for Data Roles by State",
       fill = "Median Salary",
       caption = "Value calcuated from median salary per title weighted by number of employees having title") +
  theme(legend.position = "right")

The main difficulty in this final visualization was combining the information on state and title levels. With six job titles, I could not just simply take a median per state. The heatmap can confirm my suspicions that there are not equal numbers of employees in each category, so averaging the median values would be biased as well. However, since we have the number of employees in each category in each state in our original data, we can use these to weight the median values according to make a more accurate, weighted average. This allows the viewer a more complete picture of expected salaries rather than focusing solely on the data scientist title, capturing more possibilities in the field.