Salaries for Data Professionals

This project assesses the salary distributions for data science related occupations, such as data analyst, data engineer, machine learning engineer, and data scientist. The data is sourced from Glassdoor (https://www.glassdoor.com/Salary/Glassdoor-Salaries-E100431.htm)

Firstly, the data is loaded.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

wages_raw = read.csv("/Users/Lucas/Downloads/salaries.csv")

General Distribution

To get an idea of the general distribution of the salaries for the different occupations, violoin plots are used to capture the distribution density, but also the minimum and maximum. The plot is ordered from least paid to highest paid. As can be seen, data analysts and statisticians make the least amount, while data engineers and quantitative analysts make the most money. However, it is also apparent that the salary expectations for data analysts and statisticians are more predictable, and not that variable to the other occupations.

ggplot(wages_raw, aes(x = reorder(Job.Title, Annual.Salary), y = Annual.Salary, fill = Job.Title)) +
  geom_violin(trim = FALSE) +
  labs(title = "Salary Distribution by Occupation and State (Abbreviation)",
       x = "Occupation", y = "Annual Salary") +
  scale_y_continuous(labels = scales::dollar) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Angled x-axis labels
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # Angled x-axis labels
        axis.text.y = element_text(angle = 0))  # Y-axis remains vertical

State Comparison

Because salaries vary vastly by state in the US, it is important to become a bit more granular and include the state-wide data. For this, a stacked bar plot and a stacked heatmap is used.

The stacked bar plot shows the salary range for each occupation, stacked ontop of each other, by state. It shows the state with the most paid in general, and then shows the proportion of each occupation. Interestingly, it is Washington state where data professions in general are paid highest. Followed by New York and Alaska. The least paid states are Arkansas, West Virginia and Florida.

Additionally, the stacked heat map allows for even more granularity, as the exact profession and state can be traced. For example, the highest paid profession in general is seemingly a big data engineer in Washington, followed by data engineers in New York, Alaska, Massachusetts, and Oregon. This graph also shows that location matters a lot. A data analyst in New York could make as much as a data scientist in North Dakota. This is, of course, not surprising given that the cost of living in NY are much higher than in ND.

ggplot(wages_raw, aes(x = reorder(Abbreviation, -Annual.Salary), y = Annual.Salary, fill = Job.Title)) +
  geom_bar(stat = "identity") +
  labs(title = "Stacked Bar Plot of Salaries by Occupation and State (Abbreviation)",
       x = "State", y = "Annual Salary") +
  scale_y_continuous(labels = scales::dollar) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Angled x-axis labels
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # Angled x-axis labels
        axis.text.y = element_text(angle = 0))  # Y-axis remains vertical

wages_raw$Job.Title <- reorder(wages_raw$Job.Title, wages_raw$Annual.Salary, FUN = mean)

# Create the heatmap plot with reordered Job.Title
ggplot(wages_raw, aes(x = Job.Title, y = reorder(Abbreviation, -Annual.Salary), fill = Annual.Salary)) +
  geom_tile() +
  labs(title = "Salaries by Occupation and State",
       x = "Occupation", y = "State") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", labels = scales::dollar) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In conclusion, data professions seem to be paid high in general. Even as a data analyst in Idaho or Nebraska, one can expect to be paid 60,000 USD. This is right around the median salary of the US as a whole, which should provide for decent means. Especially considering that data analyst is often an entry-level position for more complex roles like data scientist or data engineer. Lastly, it is surprising to see the data engineer roles making generally more than data scientists, because it seems to be consesus that data scientists are the ‘highest’ non-leadership positions.