Story 4: “How much Do We Get Paid?”

Understanding salary variations across job roles is essential for career planning. With overlapping responsibilities among roles like Data Scientist, Data Engineer, and Business Analyst, it’s important to explore how pay differs by occupation and location. Using data from the U.S. Bureau of Labor Statistics (BLS), this analysis investigates average salary variations for these roles across different states. By visualizing these differences, we aim to highlight how location influences earning potential and provide valuable insights for individuals in data-driven industries.

Dataset

This dataset, sourced from the U.S. Bureau of Labor Statistics (BLS), includes average annual salaries for specific roles such as “Data Scientists,” “Database Administrators,” “Database Architects,” “Statisticians,” and “Information Security Analysts” across various U.S. states. It allows for a comparison of salary variations by role and location, highlighting regional differences in compensation.

library(ggplot2)
library(dplyr)
library(sf)
library(tigris)
library(dplyr)
df <- read.csv("https://raw.githubusercontent.com/suswong/DATA-608/refs/heads/main/state_M2023_dl.csv")

occupations_to_filter <- c(
  "Data Scientists",
  "Database Administrators",
  "Database Architects",
  "Statisticians",
  "Information Security Analysts"
)

filtered_df <- df %>% 
  filter(OCC_TITLE %in% occupations_to_filter)%>% 
  select(PRIM_STATE, AREA_TITLE,OCC_TITLE, H_MEAN, A_MEAN)

filtered_df$H_MEAN <- as.numeric(gsub(",", "", filtered_df$H_MEAN))
filtered_df$A_MEAN <- as.numeric(gsub(",", "", filtered_df$A_MEAN))

Visualization

Boxplot

states <- st_as_sf(states(cb = TRUE))
##   |                                                                              |                                                                      |   0%  |                                                                              |=                                                                     |   1%  |                                                                              |==                                                                    |   3%  |                                                                              |====                                                                  |   6%  |                                                                              |=====                                                                 |   8%  |                                                                              |=========                                                             |  13%  |                                                                              |===========                                                           |  16%  |                                                                              |=============================                                         |  41%  |                                                                              |===============================================                       |  67%  |                                                                              |=================================================================     |  92%  |                                                                              |======================================================================| 100%
merged_df <- merge(states, filtered_df, by.x = "STUSPS", by.y = "PRIM_STATE", all.x = TRUE)

ggplot(merged_df, aes(y = OCC_TITLE, x = A_MEAN,fill = OCC_TITLE)) +
  geom_boxplot() +
  scale_fill_viridis_d(option = "C")+
  labs(title = "Annual Mean Salary Distribution by Occupation", 
       y = "Occupation", 
       x = "Annual Mean Salary") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels for better readability
    axis.title.x = element_blank(),  # Optionally remove x-axis label
    panel.grid = element_blank(),  # Remove grid lines for cleaner look
    legend.position = "none"  # Remove the legend
  )

US Heatmap of Annual Mean Salary for Different Job Names

Amongst all states, New York, California, and Texas are the highest paying states. Database Architects are paid higher in most states and statisticians are paid least in more states.

occupations <- c("Statisticians","Database Administrators","Data Scientists", "Information Security Analysts","Database Architects")
global_min <- min(merged_df$A_MEAN, na.rm = TRUE)
global_max <- max(merged_df$A_MEAN, na.rm = TRUE)

for (occupation in occupations) {
  occupation_data <- merged_df %>% filter(OCC_TITLE == occupation)
  plot <- ggplot(occupation_data) +
    geom_sf(aes(fill = A_MEAN)) +
scale_fill_viridis_c(option = "C", 
                         na.value = "gray", 
                         limits = c(global_min, global_max),  
                         direction = -1) + 
    labs(title = paste("Annual Mean Salary for", occupation)) +
    theme_minimal() +
    coord_sf(xlim = c(-125, -65), ylim = c(25, 50), expand = FALSE) +
    theme(
      axis.text = element_blank(),  
      axis.ticks = element_blank(),
      panel.grid = element_blank() 
    )
  
  #ggsave(paste0("heatmap_", gsub(" ", "_", occupation), ".png"), plot = plot, width = 7, height = 7)
  print(plot)
}

plot

Conclusion

Amongst all states, New York, California, and Texas are the highest paying states. Database Architects are paid higher in most states and statisticians are paid least in more states. For further analysis, we should factor in job experience, industry, and education.