Understanding salary variations across job roles is essential for career planning. With overlapping responsibilities among roles like Data Scientist, Data Engineer, and Business Analyst, it’s important to explore how pay differs by occupation and location. Using data from the U.S. Bureau of Labor Statistics (BLS), this analysis investigates average salary variations for these roles across different states. By visualizing these differences, we aim to highlight how location influences earning potential and provide valuable insights for individuals in data-driven industries.
This dataset, sourced from the U.S. Bureau of Labor Statistics (BLS), includes average annual salaries for specific roles such as “Data Scientists,” “Database Administrators,” “Database Architects,” “Statisticians,” and “Information Security Analysts” across various U.S. states. It allows for a comparison of salary variations by role and location, highlighting regional differences in compensation.
library(ggplot2)
library(dplyr)
library(sf)
library(tigris)
library(dplyr)
df <- read.csv("https://raw.githubusercontent.com/suswong/DATA-608/refs/heads/main/state_M2023_dl.csv")
occupations_to_filter <- c(
"Data Scientists",
"Database Administrators",
"Database Architects",
"Statisticians",
"Information Security Analysts"
)
filtered_df <- df %>%
filter(OCC_TITLE %in% occupations_to_filter)%>%
select(PRIM_STATE, AREA_TITLE,OCC_TITLE, H_MEAN, A_MEAN)
filtered_df$H_MEAN <- as.numeric(gsub(",", "", filtered_df$H_MEAN))
filtered_df$A_MEAN <- as.numeric(gsub(",", "", filtered_df$A_MEAN))
states <- st_as_sf(states(cb = TRUE))
## | | | 0% | |= | 1% | |== | 3% | |==== | 6% | |===== | 8% | |========= | 13% | |=========== | 16% | |============================= | 41% | |=============================================== | 67% | |================================================================= | 92% | |======================================================================| 100%
merged_df <- merge(states, filtered_df, by.x = "STUSPS", by.y = "PRIM_STATE", all.x = TRUE)
ggplot(merged_df, aes(y = OCC_TITLE, x = A_MEAN,fill = OCC_TITLE)) +
geom_boxplot() +
scale_fill_viridis_d(option = "C")+
labs(title = "Annual Mean Salary Distribution by Occupation",
y = "Occupation",
x = "Annual Mean Salary") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for better readability
axis.title.x = element_blank(), # Optionally remove x-axis label
panel.grid = element_blank(), # Remove grid lines for cleaner look
legend.position = "none" # Remove the legend
)
Amongst all states, New York, California, and Texas are the highest paying states. Database Architects are paid higher in most states and statisticians are paid least in more states.
occupations <- c("Statisticians","Database Administrators","Data Scientists", "Information Security Analysts","Database Architects")
global_min <- min(merged_df$A_MEAN, na.rm = TRUE)
global_max <- max(merged_df$A_MEAN, na.rm = TRUE)
for (occupation in occupations) {
occupation_data <- merged_df %>% filter(OCC_TITLE == occupation)
plot <- ggplot(occupation_data) +
geom_sf(aes(fill = A_MEAN)) +
scale_fill_viridis_c(option = "C",
na.value = "gray",
limits = c(global_min, global_max),
direction = -1) +
labs(title = paste("Annual Mean Salary for", occupation)) +
theme_minimal() +
coord_sf(xlim = c(-125, -65), ylim = c(25, 50), expand = FALSE) +
theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
)
#ggsave(paste0("heatmap_", gsub(" ", "_", occupation), ".png"), plot = plot, width = 7, height = 7)
print(plot)
}
plot
Amongst all states, New York, California, and Texas are the highest paying states. Database Architects are paid higher in most states and statisticians are paid least in more states. For further analysis, we should factor in job experience, industry, and education.