Story - 4 : How much do we get paid?

I have introduced the term “Data Practitioner” as a generic job descriptor because we have so many different job role titles for individuals whose work activities overlap including Data Scientist, Data Engineer, Data Analyst, Business Analyst, Data Architect, etc. For this story we will answer the question, “How much do we get paid?” Your analysis and data visualizations must address the variation in average salary based on role descriptor and state.

Description

The term “Data Practitioner” encompasses individuals with expertise in data science, requiring analytical, quantitative, and communication skills. For this project, data was sourced from the Department of Labor OES system for 2022, focusing on roles such as Data Scientists, Database Administrators,and Database Architects. Despite variations in job descriptions, these roles serve as proxies for Business Analysts and Data Analysts. The analysis primarily involved examining salary information across different geographical regions for each selected job title.

data <- read_csv("C:/Users/mikha/OneDrive/Desktop/Data 608/Stories/salarydata.csv")
## Rows: 37569 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): AREA, AREA_TITLE, PRIM_STATE, NAICS, NAICS_TITLE, I_GROUP, OCC_COD...
## dbl  (2): AREA_TYPE, OWN_CODE
## lgl  (4): PCT_TOTAL, PCT_RPT, ANNUAL, HOURLY
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df <- data %>% 
 filter(OCC_TITLE %in% c("Data Scientists","Database Administrators","Database Architects")) 
df$A_MEAN <- gsub(",", "", df$A_MEAN)

df$A_MEAN <- as.numeric(df$A_MEAN)
## Warning: NAs introduced by coercion
sum(is.na(df$A_MEAN))
## [1] 2
title_bp <- ggplot(df, aes(x = "", y = A_MEAN, fill = OCC_TITLE)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::label_comma()) +
  facet_grid(. ~ OCC_TITLE) +
  scale_fill_manual(values = rainbow(10)) +
  theme_minimal() +
  theme(legend.position = "none",
        text = element_text(size = 12),
        axis.title = element_text(size = 12)) +
  labs(title = "Average Salary - US", x = NULL, y = "Salary")

print(title_bp)
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Salaries across the distilled titles from the full OES dataset show relatively minor variance, with “Database Architect” notably higher than others. After extracting state-level data and removing records without information, a graphical representation of state data across all job titles can provide valuable insights.

state_bp <- ggplot(df, aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = PRIM_STATE)) + 
  geom_boxplot() + 
  theme_minimal() + 
  scale_y_continuous(labels = scales::label_comma()) +
  theme(legend.position = "none",
        text = element_text(size = 8),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 8),
        axis.text.x = element_text(angle = 90, hjust = 1)) +  # Rotate x-axis labels
  labs(title = "US Salaries by State / Territory", 
       x = "State or Territory", 
       y = "Annual Average Salary")

state_bp
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

There weren’t many unexpected findings in this analysis, especially considering the prominent presence of tech giants in the leading states like Washington and California. However, it would be helpful to delve deeper into each state’s breakdown for every occupation title. While I typically prefer using boxplots for such comparisons, they may not be as suitable when each occupation represents a single figure. Therefore, I’ve opted for bar charts for the graphics below.

Filtering for each job titles to plot the salaries by state for each job title

data_ds <- df %>%
  filter(OCC_TITLE == "Data Scientists")

data_dadmin <- df %>%
  filter(OCC_TITLE == "Database Administrators")

data_darch <- df %>%
  filter(OCC_TITLE == "Database Architects")

data_ma <- df %>%
  filter(OCC_TITLE == "Management Analysts")

Data Scientist salary by state

ggplot(data_ds) +
  geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + 
  coord_flip() +
  scale_fill_viridis(option = "viridis", direction = -1) + 
  theme(legend.position = "none", text = element_text(size = 8)) +
  labs(title = "Data Science Average Salaries by State and Territory (No Data from Vermont)", x = "", y = "", fill = "Source")
## Warning: Removed 1 rows containing missing values (`geom_bar()`).

Database Administrater salary by state

ggplot(data_dadmin) +
  geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + 
  coord_flip() +
  scale_fill_viridis(option = "viridis", direction = -1) + 
  theme(legend.position = "none", text = element_text(size = 8)) +
  labs(title = "Database Administrater Average Salaries by State and Territory", x = "", y = "", fill = "Source")

Database Architects salary by state

ggplot(data_darch) +
  geom_bar(aes(x = reorder(PRIM_STATE, -A_MEAN), y = A_MEAN, fill = A_MEAN), stat = "identity", position = "dodge", width = 1) + 
  coord_flip() +
  scale_fill_viridis(option = "viridis", direction = -1) + 
  theme(legend.position = "none", text = element_text(size = 8)) +
  labs(title = "Database Architects Average Salaries by State and Territory (No Data From Virginia)", x = "", y = "", fill = "Source")
## Warning: Removed 1 rows containing missing values (`geom_bar()`).

Conclusion

These visualizations provide the useful breakdown. We see expected results with Data Scientists and Database Architects prominent in the nation’s tech hubs. However, the findings for Database Administrators deviate from the norm, suggesting that regional idiosyncrasies may contribute to greater salary variance than previously thought. Additionally, there appears to be significant regional demand for Database Administrators along the East Coast, possibly due to specific industry needs. For those Data Science students open to colder climates and working in tech, this could present a favorable option.