Story 4 - How Much do We Get Paid?

Intro

The term Data Practitioner is a nice catch-all phrase for work that spans data science and demands analytical and quantitative expertise as well as exceptional communication skills. Distilling data to actionable intelligence and presenting it in a way that is easily consumed across multiple audiences is equal parts art and science. For this Story, I pulled data from the Department of Labor OES system for all of 2022. I distilled the csv file for roles that I thought fit into this data practitioner bucket. What I found was largely intuitive with some useful insights for job seekers.

I filtered the original dataset for the following job titles:

Data Scientists Database Administrators Database Architects Management Analysts

This list title, I believe, serves as a useful proxy for Business Analyst and Data Analyst roles, for which I could find no results in the dataset I pulled. Database Administrators and Database Architects might stray closer to more traditional IT, but the job descriptions I discovered for these titles lead me to believe that at least some employers view them as reasonably close enough to consider some of my classmates.

I started my analysis by looking across geographies at the salary information for each of the titles.

# Load the dataset from the URL
url <- "https://raw.githubusercontent.com/evanmclaughlin/DATA-608/Story-4/OES_2022_Data.v4.csv"
data <- read.csv(url)
head(data)

##   AREA AREA_TITLE AREA_TYPE PRIM_STATE NAICS    NAICS_TITLE        I_GROUP
## 1    1    Alabama         2         AL     0 Cross-industry cross-industry
## 2    2     Alaska         2         AK     0 Cross-industry cross-industry
## 3    4    Arizona         2         AZ     0 Cross-industry cross-industry
## 4    5   Arkansas         2         AR     0 Cross-industry cross-industry
## 5    6 California         2         CA     0 Cross-industry cross-industry
## 6    8   Colorado         2         CO     0 Cross-industry cross-industry
##   OWN_CODE OCC_CODE               OCC_TITLE  O_GROUP TOT_EMP EMP_PRSE JOBS_1000
## 1     1235  15-1242 Database Administrators detailed    1740      4.8     0.868
## 2     1235  15-1242 Database Administrators detailed     110     10.7     0.347
## 3     1235  15-1242 Database Administrators detailed    1960      5.4     0.647
## 4     1235  15-1242 Database Administrators detailed     400      9.6     0.321
## 5     1235  15-1242 Database Administrators detailed    7830      3.4     0.444
## 6     1235  15-1242 Database Administrators detailed    2200       12     0.797
##   LOC_QUOTIENT PCT_TOTAL PCT_RPT H_MEAN A_MEAN MEAN_PRSE H_PCT10 H_PCT25
## 1         1.59        NA      NA  41.87  87090       1.2   25.16   31.94
## 2         0.64        NA      NA  44.25  92040       3.9   28.35   33.39
## 3         1.19        NA      NA  47.78  99370       1.4   25.37   35.08
## 4         0.59        NA      NA  37.29  77560       1.8   21.22   27.78
## 5         0.82        NA      NA  54.92 114240       1.7   27.53   36.65
## 6         1.46        NA      NA  53.47 111210       3.5   28.12   40.44
##   H_MEDIAN H_PCT75 H_PCT90     A_PCT10     A_PCT25     A_MEDIAN      A_PCT75
## 1    39.33   50.78   62.10  52,340.00   66,440.00    81,810.00   105,630.00 
## 2    40.06   52.57   60.24  58,970.00   69,450.00    83,330.00   109,340.00 
## 3    49.84   58.24   66.07  52,770.00   72,970.00   103,670.00   121,140.00 
## 4    36.67   44.05   53.55  44,130.00   57,780.00    76,280.00    91,620.00 
## 5    52.88   68.62   82.97  57,260.00   76,230.00   109,990.00   142,720.00 
## 6    51.82   64.99   75.81  58,500.00   84,100.00   107,780.00   135,180.00 
##        A_PCT90 ANNUAL HOURLY
## 1  129,160.00      NA     NA
## 2  125,300.00      NA     NA
## 3  137,430.00      NA     NA
## 4  111,380.00      NA     NA
## 5  172,580.00      NA     NA
## 6  157,680.00      NA     NA

# let's make an attractive color palette quickly
palette_rainbow <- c("#9e0142", "#d53e4f", "#fdae61", "#ffd86b", "#66c2a6", "#3288bd", "#5e4fa2")

# we can start by visualizing the annual salaries by the occupation labels we've already filtered for in our dataset. Let's try a boxplot
title_bp <- ggplot(data, aes(x=" ", y=A_MEAN, group=OCC_TITLE)) + 
  geom_boxplot(aes(fill=OCC_TITLE)) + theme_minimal()
title_bp <- title_bp + scale_y_continuous(labels = label_comma())
title_bp <- title_bp + facet_grid(. ~ OCC_TITLE)
title_bp <- title_bp + scale_fill_manual(values=palette_rainbow)
title_bp <- title_bp + theme(legend.position="none") 
title_bp <- title_bp + theme(text = element_text(size=12), axis.title=element_text(size=12))
title_bp <- title_bp + labs(title = "Average Salary - US", x= " ", y= "Salary")

title_bp

## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).

This is a helpful look at average salaries across the titles that we’ve distilled from the full OES dataset. There’s not a huge variance across the titles, although “Database Architect” is noticeably higher than the others. Since we’ve already extracted the state-level data from the dataset and removed a few state records that didn’t include data, we can create a helpful graphic with state data across all job titles.

state_bp <- ggplot(data, aes(x=PRIM_STATE, y=A_MEAN, fill=PRIM_STATE)) + 
  geom_boxplot() + theme_minimal() + coord_flip()
state_bp <- state_bp + scale_y_continuous(labels = label_comma())
state_bp <- state_bp + theme(legend.position="none") 
state_bp <- state_bp + theme(text = element_text(size=8), axis.title=element_text(size=12)) 
state_bp <- state_bp + labs(title = "US Salaries by State / Territory", x= "State or Territory", y= "Annual Average Salary")
state_bp <- state_bp + theme(plot.title = element_text(size=8))
state_bp

## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).

ggsave(file="state_bp.pdf", width=4, height=6, dpi=300)

## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).

Not too many surprises here, particularly the two state leaders that serve as the main homes for the country’s tech giants (WA and CA), but we would benefit from a state breakdown of each occupation title. I like boxplots for this exercise, but they don’t make as much sense when each occupation is just one figure, so I’m going to use barcharts for the graphics below.

data_ds <- data %>%
  filter(OCC_TITLE == "Data Scientists")

data_dadmin <- data %>%
  filter(OCC_TITLE == "Database Administrators")

data_darch <- data %>%
  filter(OCC_TITLE == "Database Architects")

data_ma <- data %>%
  filter(OCC_TITLE == "Management Analysts")

#ggbarplot(data_ds ,  x= "A_MEAN", y= "PRIM_STATE", color = "PRIM_STATE" , position = position_dodge())

ggplot(data_ds) +
  geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + coord_flip() +
  theme(legend.position = "none", text = element_text(size=8)) +
  labs( title = "Data Science Average Salaries by State and Territory (No Data from Vermont)", x = "", y = "", fill = "Source")

## Warning: Removed 1 rows containing missing values (`geom_bar()`).

ggplot(data_dadmin) +
  geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + coord_flip() +
  theme(legend.position = "none", text = element_text(size=8)) +
  labs( title = "Database Administrater Average Salaries by State and Territory", x = "", y = "", fill = "Source")

ggplot(data_darch) +
  geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + coord_flip() +
  theme(legend.position = "none", text = element_text(size=8)) +
  labs( title = "Database Architects Average Salaries by State and Territory (No Data From Virginia)", x = "", y = "", fill = "Source")

## Warning: Removed 1 rows containing missing values (`geom_bar()`).

ggplot(data_ma) +
  geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + coord_flip() + theme(legend.position = "none", text = element_text(size=8)) + labs( title = "Management Analysts Average Salaries by State and Territory (No Data from Vermont and Maine)", x = "", y = "", fill = "Source")

## Warning: Removed 2 rows containing missing values (`geom_bar()`).

These last visualizations provide the most useful breakdown. We see intuitive results Data Scientists and Database Architects with the nation’s tech hubs jumping out. However, Database Administrater and Management Analyst results stray from common knowledge, leaving us to consider the possibility that regional idiosyncrasies might contribute to greater salary variance than perhaps we might have thought previously. Additionally, there appears to be great regional demand for Management Analysts along the East Coast, perhaps owing to the fact that these roles are more in demand within financial services. For those Data Science students open to cold weather and working in finance, this could be a good option.

Story 4 - How Much do We Get Paid?

Evan McLaughlin

2023-10-22

How Much Do We Get Paid?

Intro