I have introduced the term “Data Practitioner” as a generic job descriptor because we have so many different job role titles for individuals whose work activities overlap including Data Scientist, Data Engineer, Data Analyst, Business Analyst, Data Architect, etc. For this story we will answer the question, “How much do we get paid?” Your analysis and data visualizations must address the variation in average salary based on role descriptor and state.
The term Data Practitioner is a nice catch-all phrase for work that spans data science and demands analytical and quantitative expertise as well as exceptional communication skills. Distilling data to actionable intelligence and presenting it in a way that is easily consumed across multiple audiences is equal parts art and science. For this Story, I pulled data from the Department of Labor OES system for all of 2022. I distilled the csv file for roles that I thought fit into this data practitioner bucket. What I found was largely intuitive with some useful insights for job seekers.
I filtered the original dataset for the following job titles:
Data Scientists Database Administrators Database Architects Management Analysts
This list title, I believe, serves as a useful proxy for Business Analyst and Data Analyst roles, for which I could find no results in the dataset I pulled. Database Administrators and Database Architects might stray closer to more traditional IT, but the job descriptions I discovered for these titles lead me to believe that at least some employers view them as reasonably close enough to consider some of my classmates.
I started my analysis by looking across geographies at the salary information for each of the titles.
# Load the dataset from the URL
url <- "https://raw.githubusercontent.com/evanmclaughlin/DATA-608/Story-4/OES_2022_Data.v4.csv"
data <- read.csv(url)
head(data)
## AREA AREA_TITLE AREA_TYPE PRIM_STATE NAICS NAICS_TITLE I_GROUP
## 1 1 Alabama 2 AL 0 Cross-industry cross-industry
## 2 2 Alaska 2 AK 0 Cross-industry cross-industry
## 3 4 Arizona 2 AZ 0 Cross-industry cross-industry
## 4 5 Arkansas 2 AR 0 Cross-industry cross-industry
## 5 6 California 2 CA 0 Cross-industry cross-industry
## 6 8 Colorado 2 CO 0 Cross-industry cross-industry
## OWN_CODE OCC_CODE OCC_TITLE O_GROUP TOT_EMP EMP_PRSE JOBS_1000
## 1 1235 15-1242 Database Administrators detailed 1740 4.8 0.868
## 2 1235 15-1242 Database Administrators detailed 110 10.7 0.347
## 3 1235 15-1242 Database Administrators detailed 1960 5.4 0.647
## 4 1235 15-1242 Database Administrators detailed 400 9.6 0.321
## 5 1235 15-1242 Database Administrators detailed 7830 3.4 0.444
## 6 1235 15-1242 Database Administrators detailed 2200 12 0.797
## LOC_QUOTIENT PCT_TOTAL PCT_RPT H_MEAN A_MEAN MEAN_PRSE H_PCT10 H_PCT25
## 1 1.59 NA NA 41.87 87090 1.2 25.16 31.94
## 2 0.64 NA NA 44.25 92040 3.9 28.35 33.39
## 3 1.19 NA NA 47.78 99370 1.4 25.37 35.08
## 4 0.59 NA NA 37.29 77560 1.8 21.22 27.78
## 5 0.82 NA NA 54.92 114240 1.7 27.53 36.65
## 6 1.46 NA NA 53.47 111210 3.5 28.12 40.44
## H_MEDIAN H_PCT75 H_PCT90 A_PCT10 A_PCT25 A_MEDIAN A_PCT75
## 1 39.33 50.78 62.10 52,340.00 66,440.00 81,810.00 105,630.00
## 2 40.06 52.57 60.24 58,970.00 69,450.00 83,330.00 109,340.00
## 3 49.84 58.24 66.07 52,770.00 72,970.00 103,670.00 121,140.00
## 4 36.67 44.05 53.55 44,130.00 57,780.00 76,280.00 91,620.00
## 5 52.88 68.62 82.97 57,260.00 76,230.00 109,990.00 142,720.00
## 6 51.82 64.99 75.81 58,500.00 84,100.00 107,780.00 135,180.00
## A_PCT90 ANNUAL HOURLY
## 1 129,160.00 NA NA
## 2 125,300.00 NA NA
## 3 137,430.00 NA NA
## 4 111,380.00 NA NA
## 5 172,580.00 NA NA
## 6 157,680.00 NA NA
# let's make an attractive color palette quickly
palette_rainbow <- c("#9e0142", "#d53e4f", "#fdae61", "#ffd86b", "#66c2a6", "#3288bd", "#5e4fa2")
# we can start by visualizing the annual salaries by the occupation labels we've already filtered for in our dataset. Let's try a boxplot
title_bp <- ggplot(data, aes(x=" ", y=A_MEAN, group=OCC_TITLE)) +
geom_boxplot(aes(fill=OCC_TITLE)) + theme_minimal()
title_bp <- title_bp + scale_y_continuous(labels = label_comma())
title_bp <- title_bp + facet_grid(. ~ OCC_TITLE)
title_bp <- title_bp + scale_fill_manual(values=palette_rainbow)
title_bp <- title_bp + theme(legend.position="none")
title_bp <- title_bp + theme(text = element_text(size=12), axis.title=element_text(size=12))
title_bp <- title_bp + labs(title = "Average Salary - US", x= " ", y= "Salary")
title_bp
## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).
This is a helpful look at average salaries across the titles that we’ve
distilled from the full OES dataset. There’s not a huge variance across
the titles, although “Database Architect” is noticeably higher than the
others. Since we’ve already extracted the state-level data from the
dataset and removed a few state records that didn’t include data, we can
create a helpful graphic with state data across all job titles.
state_bp <- ggplot(data, aes(x=PRIM_STATE, y=A_MEAN, fill=PRIM_STATE)) +
geom_boxplot() + theme_minimal() + coord_flip()
state_bp <- state_bp + scale_y_continuous(labels = label_comma())
state_bp <- state_bp + theme(legend.position="none")
state_bp <- state_bp + theme(text = element_text(size=8), axis.title=element_text(size=12))
state_bp <- state_bp + labs(title = "US Salaries by State / Territory", x= "State or Territory", y= "Annual Average Salary")
state_bp <- state_bp + theme(plot.title = element_text(size=8))
state_bp
## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).
ggsave(file="state_bp.pdf", width=4, height=6, dpi=300)
## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).
Not too many surprises here, particularly the two state leaders that serve as the main homes for the country’s tech giants (WA and CA), but we would benefit from a state breakdown of each occupation title. I like boxplots for this exercise, but they don’t make as much sense when each occupation is just one figure, so I’m going to use barcharts for the graphics below.
data_ds <- data %>%
filter(OCC_TITLE == "Data Scientists")
data_dadmin <- data %>%
filter(OCC_TITLE == "Database Administrators")
data_darch <- data %>%
filter(OCC_TITLE == "Database Architects")
data_ma <- data %>%
filter(OCC_TITLE == "Management Analysts")
#ggbarplot(data_ds , x= "A_MEAN", y= "PRIM_STATE", color = "PRIM_STATE" , position = position_dodge())
ggplot(data_ds) +
geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + coord_flip() +
theme(legend.position = "none", text = element_text(size=8)) +
labs( title = "Data Science Average Salaries by State and Territory (No Data from Vermont)", x = "", y = "", fill = "Source")
## Warning: Removed 1 rows containing missing values (`geom_bar()`).
ggplot(data_dadmin) +
geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + coord_flip() +
theme(legend.position = "none", text = element_text(size=8)) +
labs( title = "Database Administrater Average Salaries by State and Territory", x = "", y = "", fill = "Source")
ggplot(data_darch) +
geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + coord_flip() +
theme(legend.position = "none", text = element_text(size=8)) +
labs( title = "Database Architects Average Salaries by State and Territory (No Data From Virginia)", x = "", y = "", fill = "Source")
## Warning: Removed 1 rows containing missing values (`geom_bar()`).
ggplot(data_ma) +
geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + coord_flip() + theme(legend.position = "none", text = element_text(size=8)) + labs( title = "Management Analysts Average Salaries by State and Territory (No Data from Vermont and Maine)", x = "", y = "", fill = "Source")
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
These last visualizations provide the most useful breakdown. We see intuitive results Data Scientists and Database Architects with the nation’s tech hubs jumping out. However, Database Administrater and Management Analyst results stray from common knowledge, leaving us to consider the possibility that regional idiosyncrasies might contribute to greater salary variance than perhaps we might have thought previously. Additionally, there appears to be great regional demand for Management Analysts along the East Coast, perhaps owing to the fact that these roles are more in demand within financial services. For those Data Science students open to cold weather and working in finance, this could be a good option.