Instructions
I have introduced the term “Data Practitioner” as a generic job descriptor because we have so many different job role titles for individuals whose work activities overlap including Data Scientist, Data Engineer, Data Analyst, Business Analyst, Data Architect, etc.
For this story we will answer the question, “How much do we get paid?” Your analysis and data visualizations must address the variation in average salary based on role descriptor and state.
You will need to identify reliable sources for salary data and assemble the data sets that you will need.
Your visualization(s) must show the most salient information (variation in average salary by role and by state).
For this Story you must use a code library and code that you have written in R, Python or Java Script (additional coding in other languages is allowed).
Introduction
In this analysis, we explore the salaries of data-related professions across the United States. The term Data Practitioner is used as a generic descriptor to include various overlapping roles, such as Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect.
The key objective of this story is to answer the question: “How much do we get paid?” Our analysis focuses on the variation in average salaries by both job title and state. By using interactive tables, violin plots, stacked bar charts, and heatmaps, we aim to provide a clear and detailed view of how compensation differs across occupations and locations, allowing readers to quickly understand the most salient trends.
Data Set
This project assesses the salary distributions for data science related occupations, such as data analyst, data engineer, machine learning engineer, and data scientist.
Data Sourced: https://www.glassdoor.com/Salary/Glassdoor-Salaries-E100431.htm
Firstly, the data is loaded.
'data.frame': 250 obs. of 6 variables:
$ State : chr "New York" "Vermont" "California" "Maine" ...
$ Annual.Salary: chr "136,172.00" "133,828.00" "131,441.00" "127,644.00" ...
$ Monthly.Pay : chr "11,347.00" "11,152.00" "10,953.00" "10,637.00" ...
$ Weekly.Pay : chr "2,618.00" "2,573.00" "2,527.00" "2,454.00" ...
$ Hourly.Wage : num 65.5 64.3 63.2 61.4 60.7 ...
$ Job : chr "Data Scientist" "Data Scientist" "Data Scientist" "Data Scientist" ...
# Converting String to Numeric
df$`Annual.Salary` <- as.numeric(gsub(",", "", df$`Annual.Salary`))
df$`Monthly.Pay` <- as.numeric(gsub(",", "", df$`Monthly.Pay`))
df$`Weekly.Pay` <- as.numeric(gsub(",", "", df$`Weekly.Pay`))
datatable(
df,
options = list(
pageLength = 10, # rows per page
lengthChange = FALSE, # hide dropdown to change page length
scrollY = "400px", # fixed table height
scrollCollapse = TRUE # collapse if fewer rows
),
caption = "State Salary Details View"
)# Copy of df
df2 <- df
# Average Salary by Job
Avg_Job <- df2 %>%
group_by(Job) %>%
summarize(Avg_Annual_Salary = mean(Annual.Salary, na.rm = TRUE)) %>%
arrange(desc(Avg_Annual_Salary))
# Average Annual Salary By State
Avg_State <- df %>%
group_by(State) %>%
summarize(Avg_Annual_Salary = mean(Annual.Salary, na.rm = TRUE)) %>%
arrange(desc(Avg_Annual_Salary))
# Add state abbreviations for clarity
data("state")
Avg_State$Abbreviation <- state.abb[match(Avg_State$State, state.name)]
#print(Avg_Job)
datatable(
Avg_Job,
options = list(
pageLength = 10, # rows per page
lengthChange = FALSE, # hide dropdown to change page length
scrollY = "400px", # fixed table height
scrollCollapse = TRUE # collapse if fewer rows
),
caption = "Average Salary by Job Details View"
)#print(Avg_State)
datatable(
Avg_State,
options = list(
pageLength = 10, # rows per page
lengthChange = FALSE, # hide dropdown to change page length
scrollY = "400px", # fixed table height
scrollCollapse = TRUE # collapse if fewer rows
),
caption = "List of State by Average Salary"
)General Distribution
To get an idea of the general distribution of the salaries for the different occupations, violoin plots are used to capture the distribution density, but also the minimum and maximum. The plot is ordered from least paid to highest paid. As can be seen, data analysts and statisticians make the least amount, while data engineers and quantitative analysts make the most money. However, it is also apparent that the salary expectations for data analysts and statisticians are more predictable, and not that variable to the other occupations.
ggplot(df, aes(x = Job, y = Annual.Salary, fill = Job)) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_boxplot(width = 0.1, color = "black", alpha = 0.6) +
labs(
title = "Earnings Depend on Role and Geography",
subtitle = "Variation in Annual Salaries Across Data-Related Job Roles",
x = "Job Role",
y = "Annual Salary ($)",
caption = "Source: Glassdoor Salary Data (2024)"
) +
scale_y_continuous(labels = dollar_format()) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
annotate(
"text",
x = 4,
y = 180000,
label = "Machine Learning Engineers earn the highest median salaries",
size = 3.5,
color = "black"
)This chart shows that earnings vary significantly by role. Machine Learning Engineers and Data Scientists are consistently at the higher end, while Data Analysts typically earn less. This supports the idea that salary depends heavily on the type of role.
State Comparison
Because salaries vary vastly by state in the US, it is important to become a bit more granular and include the state-wide data. For this, a stacked bar plot and a stacked heatmap is used.
The stacked bar plot shows the salary range for each occupation, stacked ontop of each other, by state. It shows the state with the most paid in general, and then shows the proportion of each occupation. Interestingly, it is Washington state where data professions in general are paid highest. Followed by New York and Alaska. The least paid states are Arkansas, West Virginia and Florida.
Additionally, the stacked heat map allows for even more granularity, as the exact profession and state can be traced. For example, the highest paid profession in general is seemingly a big data engineer in Washington, followed by data engineers in New York, Alaska, Massachusetts, and Oregon. This graph also shows that location matters a lot. A data analyst in New York could make as much as a data scientist in North Dakota. This is, of course, not surprising given that the cost of living in NY are much higher than in ND.
# Stacked Bar Plot: Salaries by Job and State (Abbreviation)
# Convert Annual Salary to numeric (remove commas and quotes)
df$Annual.Salary <- as.numeric(gsub(",", "", df$Annual.Salary))
# Bar Plot of Salaries by Job and State
ggplot(df, aes(x = reorder(State, -Annual.Salary), y = Annual.Salary, fill = Job)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Stacked Bar Plot of Annual Salaries by Job and State",
x = "State",
y = "Annual Salary ($)"
) +
scale_y_continuous(labels = dollar) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
plot.title = element_text(size = 14, face = "bold")
)The stacked bar plot shows how salaries vary by state and job role. States like Washington, New York, and Massachusetts offer the highest overall pay for data professionals, while states such as Arkansas and West Virginia rank lower. The plot also highlights that Data Engineers and Machine Learning Engineers consistently contribute most to the higher salary totals across top-paying states.
Heat Map Analysis of Salaries by Job and State
This heat map highlights how salaries for data-related roles vary across different states. It visually compares which states offer the highest and lowest pay for each job title, making it easy to spot regional trends and identify where certain roles are most valued.
# Ensure Annual.Salary is numeric
df$Annual.Salary <- as.numeric(gsub(",", "", df$Annual.Salary))
# Reorder Job by mean salary
df$Job <- reorder(df$Job, df$Annual.Salary, FUN = mean)
# Heatmap of Salaries by Job and State
ggplot(df, aes(x = Job, y = reorder(State, -Annual.Salary), fill = Annual.Salary)) +
geom_tile(color = "white") +
labs(
title = "Geographic and Role Based Salary Patterns",
subtitle = "Annual Salaries for Data Roles Across U.S. States",
x = "Job Role",
y = "State",
fill = "Salary ($)",
caption = "Source: Glassdoor Salary Data (2024)"
) +
scale_fill_gradient(low = "lightblue", high = "darkblue", labels = dollar) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size = 10)
)States like California, New York, and Massachusetts show the darkest tiles, indicating higher salary levels for most roles. The heatmap confirms that geography significantly influences pay even within the same job category.
Top 5 Highest Paid Roles per State Table
Top_Roles <- df %>%
group_by(State) %>%
arrange(desc(Annual.Salary)) %>%
slice(1:5)
datatable(
Top_Roles,
options = list(
pageLength = 10,
lengthChange = FALSE,
scrollY = "400px",
scrollCollapse = TRUE
),
caption = "Top 5 Highest Paid Roles per State"
)This scatter plot shows the relationship between job roles and salaries in the top five highest paying states. It helps identify which roles command the highest pay in these regions and how compensation varies by both position and location.
library(ggrepel)
Top_States <- df %>% group_by(State) %>%
summarize(Avg_Salary = mean(Annual.Salary, na.rm = TRUE)) %>%
arrange(desc(Avg_Salary)) %>% slice_head(n = 5)
# Filter df for Top 5 States
df_top_states <- df %>% filter(State %in% Top_States$State)
# Step 3: Create top_labels data for labeling
top_labels <- df_top_states %>% group_by(State) %>% slice_max(order_by = Annual.Salary, n = 1)
# Step 4: Scatter plot
ggplot(df_top_states, aes(x = State, y = Annual.Salary, color = Job)) +
geom_point(size = 3, alpha = 0.8) +
geom_smooth(method = "lm", color = "#4F3C5F", se = FALSE) +
geom_text_repel(
data = top_labels,
aes(label = Job),
size = 3.2,
color = "black"
) +
scale_color_brewer(palette = "Set2") +
scale_y_continuous(labels = dollar_format()) +
labs(
title = "Top 5 States and Highest Paid Data Roles",
subtitle = "Roles like Machine Learning Engineer and Data Scientist dominate high paying states",
x = "State",
y = "Annual Salary ($)",
color = "Job Role",
caption = "Source: Glassdoor Salary Data (2024)"
) +
theme_bw() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
annotate(
"text",
x = 2,
y = 190000,
label = "California and NY lead in pay for ML and Data Science roles",
size = 3.5,
color = "black"
)This final plot ties both dimensions together role and geography. It shows that top paying jobs like Machine Learning Engineer and Data Scientist are most concentrated in high salary states such as California and New York. The trend line reinforces that these states consistently offer better compensation across data roles.
Conclusion
Data professions command strong salaries overall, though compensation varies by both role and geography
Role differences: Data Engineers and Quantitative Analysts tend to earn the most, while Data Analysts and Statisticians see lower but steadier pay.
Geographic variation: States such as Washington, New York, and Alaska offer the highest salaries, while Arkansas and West Virginia rank among the lowest.
Impact of location: A Data Analyst in New York can earn nearly as much as a Data Scientist in a lower paying state, showing how strongly geography shapes opportunity.
Overall, this analysis shows that understanding both the role and the location is critical for making informed career decisions in the data field.