Bikash-DATA-608_Story4.knit

Column

Instruction

I have introduced the term “Data Practitioner” as a generic job descriptor because we have so many different job role titles for individuals whose work activities overlap including Data Scientist, Data Engineer, Data Analyst, Business Analyst, Data Architect, etc.

For this story we will answer the question, “How much do we get paid?” Your analysis and data visualizations must address the variation in average salary based on role descriptor and state.

Notes:
1. You will need to identify reliable sources for salary data and assemble the data sets that you will need.
2. Your visualization(s) must show the most salient information (variation in average salary by role and by state).
3. For this Story you must use a code library and code that you have written in R, Python or Java Script (additional coding in
other languages is allowed).
4. Post generation enhancements to you generated visualization will be allowed (e.g. Addition of kickers and labels).

Introduction

The modern data profession has evolved into a diverse ecosystem of roles — including Data Scientists, Data Engineers, Data Analysts, Business Analysts, and Data Architects. While these titles share overlapping skill sets, they differ in focus, responsibility, and compensation. This project explores the question: “How much do we get paid?” — analyzing salary variation across job roles and states within the United States.

The data for this analysis was collected from ZipRecruiter.com, one of the most reliable public sources for job market compensation data. Salary information was compiled for all 50 states and five data-related job titles, creating a dataset suitable for comparing both role-based and geographical salary patterns.

This story is designed with clarity and purpose in mind. The visualizations were chosen to emphasize accuracy and interpretability:

Box plots reveal the distribution of salaries across job titles and states, providing insight into variability and outliers (Fidelity & Simplicity).
A choropleth map illustrates the average annual salary by state, highlighting regional differences at a glance (Utility & Saliency).

Overall, this dashboard presents a cohesive and truthful representation of salary patterns in the U.S. data profession. It demonstrates how both job specialization and location significantly influence pay, offering a valuable perspective for anyone interested in understanding compensation dynamics within the data workforce.

The central issue explored in this story is the uneven distribution of pay within the U.S. data profession.
Despite similar skill overlaps, compensation often varies dramatically by role and geography.
This raises important questions for data practitioners:
- Is pay determined more by what we do or where we work?
- How can we visualize these disparities clearly and fairly?
This analysis investigates those questions using structured salary data and visual encoding principles.

Import libraries

library(tidyverse)
library(plotly)
library(datasets)

Load the data

I utilized Ziprecruiter.com as my source of data for the salaries of Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect. It should be noted that the table wasn’t present for Data Architect, So I had to look of each state individually. It was all compiled into a google sheet which was exported as .CSV file.

# URL for Job Salaries
jobs <- c("Data-Scientist", "Data-Engineer", "Data-Analyst", "Business-Analyst", "Data-Architect")
url_link <- 'https://www.ziprecruiter.com/Salaries/What-Is-the-Average-%s-Salary-by-State'

for (job in jobs) {
  url <- sprintf(url_link, job)
  print(url)
}

[1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Scientist-Salary-by-State"
[1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Engineer-Salary-by-State"
[1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Analyst-Salary-by-State"
[1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Business-Analyst-Salary-by-State"
[1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Architect-Salary-by-State"

# read and load Data from CSV
df <- read.csv("D:/Cuny_sps/DATA_608/Story-4/Job_State_Salary.csv")
str(df)

'data.frame':   250 obs. of  6 variables:
 $ State        : chr  "New York" "Vermont" "California" "Maine" ...
 $ Annual.Salary: chr  "136,172.00" "133,828.00" "131,441.00" "127,644.00" ...
 $ Monthly.Pay  : chr  "11,347.00" "11,152.00" "10,953.00" "10,637.00" ...
 $ Weekly.Pay   : chr  "2,618.00" "2,573.00" "2,527.00" "2,454.00" ...
 $ Hourly.Wage  : num  65.5 64.3 63.2 61.4 60.7 ...
 $ Job          : chr  "Data Scientist" "Data Scientist" "Data Scientist" "Data Scientist" ...

# Converting String to Numeric
df$`Annual.Salary` <- as.numeric(gsub(",", "", df$`Annual.Salary`))
df$`Monthly.Pay` <- as.numeric(gsub(",", "", df$`Monthly.Pay`))
df$`Weekly.Pay` <- as.numeric(gsub(",", "", df$`Weekly.Pay`))
head(df)

       State Annual.Salary Monthly.Pay Weekly.Pay Hourly.Wage            Job
1   New York        136172       11347       2618       65.47 Data Scientist
2    Vermont        133828       11152       2573       64.34 Data Scientist
3 California        131441       10953       2527       63.19 Data Scientist
4      Maine        127644       10637       2454       61.37 Data Scientist
5      Idaho        126275       10522       2428       60.71 Data Scientist
6 Washington        125289       10440       2409       60.24 Data Scientist

# Copy of df
df2 <- df

# Average Salary by Job
Avg_Job <- df2 %>%
  group_by(Job) %>%
  summarize(Avg_Annual_Salary = mean(`Annual.Salary`))


# Average Annual Salary By State
Avg_State <- aggregate(`Annual.Salary` ~ State, data = df2, FUN = mean)
colnames(Avg_State) <- c("State", "Avg_Annual_Salary")

# State Abbreviation
data("state")
Avg_State$Abbreviation <- state.abb[match(Avg_State$State, state.name)]
Avg_State <- Avg_State[order(-Avg_State$Avg_Annual_Salary), ]

print(Avg_Job)

# A tibble: 5 × 2
  Job              Avg_Annual_Salary
  <chr>                        <dbl>
1 Business Analyst            90439.
2 Data Analyst                77605.
3 Data Architect             138570.
4 Data Engineer              121282.
5 Data Scientist             112832.

print(Avg_State)

            State Avg_Annual_Salary Abbreviation
47     Washington          126682.0           WA
32       New York          126019.2           NY
21  Massachusetts          122342.2           MA
2          Alaska          121887.2           AK
37         Oregon          120781.2           OR
34   North Dakota          120589.2           ND
45        Vermont          119791.2           VT
11         Hawaii          118263.0           HI
6        Colorado          116922.2           CO
5      California          116585.0           CA
38   Pennsylvania          115432.8           PA
28         Nevada          115248.6           NV
30     New Jersey          114326.4           NJ
41   South Dakota          113969.8           SD
19          Maine          113532.2           ME
46       Virginia          113152.2           VA
49      Wisconsin          112992.4           WI
29  New Hampshire          112388.6           NH
8        Delaware          111996.4           DE
20       Maryland          111035.8           MD
50        Wyoming          110364.0           WY
39   Rhode Island          109948.0           RI
12          Idaho          109211.4           ID
23      Minnesota          108872.0           MN
27       Nebraska          108819.0           NE
14        Indiana          108727.2           IN
31     New Mexico          108540.8           NM
13       Illinois          107957.2           IL
3         Arizona          106479.2           AZ
36       Oklahoma          105382.6           OK
26        Montana          104911.8           MT
35           Ohio          104852.0           OH
15           Iowa          103908.2           IA
24    Mississippi          103650.6           MS
1         Alabama          103565.4           AL
40 South Carolina          102972.0           SC
7     Connecticut          102900.4           CT
25       Missouri          101600.4           MO
43          Texas          101191.4           TX
42      Tennessee          101107.8           TN
33 North Carolina          100846.4           NC
44           Utah           99910.8           UT
16         Kansas           97589.2           KS
22       Michigan           96869.0           MI
10        Georgia           96479.2           GA
18      Louisiana           95055.2           LA
17       Kentucky           94945.4           KY
4        Arkansas           92122.4           AR
48  West Virginia           89149.8           WV
9         Florida           85413.0           FL

Data Visualization:

This dashboard uses: - Position and range (boxplots) to show salary variation with precision.
- Color hue and saturation (choropleth map) to depict geographic differences intuitively.
These encoding choices balance interpretability with visual appeal, ensuring that differences in pay are immediately visible and comparable.

These visuals collectively highlight how both specialization and geography shape the earning potential of data professionals in the U.S.

To showcase salary, there are 3 different graphics. There is one for Annual Salary Distribution by Job Description as a box plot, Annual Salary Distribution by State as a box plot, and Average Salary by State as a heat map of the United States.

Box Plot: Average Salary by Data Role and state

# Box plot by Job
job_box <- plot_ly(df, x = ~Job, y = ~`Annual.Salary`, type = 'box',
                   marker = list(color = 'rgb(110, 164, 214)')) %>%
  layout(
    title = 'Annual Salary Distribution by Data Role',
    xaxis = list(title = 'Data Job Title'),
    yaxis = list(title = 'Annual Salary (USD)'),
    annotations = list(
      list(x = 0.5, y = 0.95, xref = 'paper', yref = 'paper',
           text = "Box = Interquartile Range (IQR), Line = Median Salary",
           showarrow = FALSE, font = list(size = 10, color = 'gray')))
  )


# Box plot by State
state_box <- plot_ly(df, x = ~State, y = ~`Annual.Salary`, type = 'box', 
                     marker = list(color = 'rgb(110, 164, 214)')) %>%
  layout(title = 'Annual Salary Distribution by State',
         xaxis = list(tickfont = list(size = 12), tickangle = -45),
         yaxis = list(title = 'Average Annual Salary($)'))

# Plots
job_box

This boxplot compares the salary distribution across different data-related job titles.
It reveals that Data Scientists and Machine Learning Engineers tend to have higher median salaries with wider variation, indicating opportunities for growth and specialization.
Meanwhile, roles such as Data Analyst or Business Intelligence Analyst show lower medians and tighter spreads, reflecting more standardized pay scales.
Overall, this chart highlights how technical depth and modeling expertise translate into higher compensation within the data profession.

state_box

This boxplot displays the distribution of average annual salaries across all U.S. states for data professionals.
The wide spread in salary values highlights regional disparities — states like California, Washington, and New York generally offer higher median salaries, while states in the South and Midwest show lower averages.
This visualization emphasizes how geographical location strongly influences pay levels, reflecting differences in cost of living, local demand, and the concentration of technology jobs.

From the visualization, we can observe that Data Architects and Data Scientists tend to command the highest average salaries, with median values exceeding $120,000 in many regions. In contrast, Business Analysts and Data Analysts have lower median earnings, reflecting their comparatively broader entry paths and varying technical requirements.
This comparison reveals how role specialization and technical depth drive earning potential within the data profession.

Choropleth Map of Average Salary by State
Together, the job-based boxplot and the state-based boxplot reveal two dimensions of salary variation — one professional, one geographic.
To connect these insights, the next visualization translates the same salary data onto a geospatial scale, revealing how economic opportunity for data practitioners clusters regionally across the U.S.

The following map provides a geographic perspective on data professional salaries, visually showing where pay levels are highest and lowest across the United States.

# Choropleth Map for Avg Salary by State
state_map <- plot_ly(
  Avg_State, 
  z = ~Avg_Annual_Salary,
  locations = ~Abbreviation,
  locationmode = 'USA-states',
  type = 'choropleth',
  colorscale = 'Viridis',
  zmin = min(Avg_State$Avg_Annual_Salary),
  zmax = max(Avg_State$Avg_Annual_Salary),
  text = ~paste('State:', State, '<br>Avg Annual Salary:', round(Avg_Annual_Salary, 2))) %>%
layout(
  title = 'Average Annual Salary in the US by State',
  geo = list(
    scope = 'usa',
    projection = list(type = 'albers usa'),
    showlakes = TRUE,
    lakecolor = 'rgb(255, 255, 255)'),
  annotations = list(
      list(
        x = 0.00,
        y = -0.05,
        xref = "paper",
        yref = "paper",
        text = "Arkansas, West Virginia, and Florida have the Lowest Average Salary.",
        showarrow = FALSE,
        font = list(size = 12)),
      list(
        x = 0.00,  # X-coordinate of the note
        y = 0.00,  # Y-coordinate of the note
        xref = "paper",
        yref = "paper",
        text = "Washington, New York and Massachusetts have the Highest Average Salary.",
        showarrow = FALSE))
)

# Plot
state_map

This choropleth map displays how the average salary for data professionals varies by state across the United States. Each state is color-coded based on its average compensation — darker shades represent higher salaries.

The visualization highlights a clear regional disparity in pay. States such as Washington, New York, California, and Massachusetts show the highest average salaries, often above $125,000, reflecting the concentration of tech hubs and large data-driven enterprises.
Meanwhile, southern and midwestern states generally offer lower salaries, correlating with smaller tech markets and differing cost-of-living levels.

Overall, this map emphasizes the strong geographic influence on salary levels, showing that location is nearly as impactful as job title in determining a data practitioner’s earning potential.

Insights and Interpretation

When viewed alongside the boxplot of job roles, the state-level salary map completes the overall picture of compensation trends in the U.S. data industry.
The analysis reveals two dominant forces shaping data professionals’ earnings: job specialization and geographic location. States with major technology hubs such as California, Washington, New York, and Massachusetts consistently offer higher pay across all roles, while salaries are more moderate in regions with smaller or emerging data markets.

This demonstrates that the highest earning potential occurs when technical expertise (for example, Data Architect or Data Scientist roles) aligns with high-demand regions. The visualization makes it clear that both skill level and location play equally critical roles in determining how much a data practitioner gets paid in today’s job market.

Heatmap: Average Salary by Role and State (Top 10 States)

# Compute average salary by State and Job
heat_df <- df2 %>%
  group_by(State, Job) %>%
  summarise(Avg_Salary = mean(`Annual.Salary`), .groups = "drop")

# Select Top 10 states by overall average salary
top_states <- heat_df %>%
  group_by(State) %>%
  summarise(Overall = mean(Avg_Salary)) %>%
  top_n(10, Overall) %>%
  pull(State)

heat_df_top <- heat_df %>% filter(State %in% top_states)

# Plot
heat_plot <- ggplot(heat_df_top, aes(x = Job, y = reorder(State, -Avg_Salary), fill = Avg_Salary)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_viridis_c(option = "plasma", direction = -1) +
  labs(
    title = "Average Salary by Role and State (Top 10 States)",
    x = "Job Title",
    y = "State",
    fill = "Avg Salary ($)"
  ) +
  theme_minimal(base_size = 14) +  # increased font
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1, size = 12, face = "bold"),
    axis.text.y = element_text(size = 12, face = "bold"),
    plot.title = element_text(face = "bold", size = 16),
    legend.title = element_text(size = 12, face = "bold"),
    legend.text = element_text(size = 11)
  )

ggplotly(heat_plot, height = 600, width = 900)

This heatmap adds a multidimensional view, connecting roles and states simultaneously.
It clearly shows that technical roles (Data Scientist, Data Architect) maintain higher salaries across most states,
while regional differences remain consistent with the map and boxplots.
Including this heatmap demonstrates how tabular and spatial perspectives align — reinforcing the data story.

Conclusion

How Much Do We Get Paid?
Depends on Title and Location

When comparing the different Job titles out of Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect we can see a difference. As the title becomes more “specialized” in terminology it seems that the salaries do increase substantially. As job title does typically entail more responsibilities and specialization we can see that each jump in title is about a 10% increase in salary on average. With Data Analyst being $77,605 annually and Data Architect being $138,570 annually on the top end. Since recently there has been title inflation within the data world, it would seem to be that typically when more specialized and have more knowledge, there is a correlation of getting a higher annual income compared to title.

The other impact on salary is geographical location within the United States. As seen on the Map showcasing average salary by State. We can see that Washington, New York, and Massachusetts have the highest salaries on average while Arkansas, West Virginia, and Florida have the lowest salaries. When doing a comparison side by side, Washington has a staggering 45% higher average salary compared to Florida at $126,000 per a Year. However it should be noted that this is very generalized since the cost of living is very different even within a state level, let alone between state to state. There is also the idea of where the jobs are actually located since areas with lower supply of jobs and higher demand can have lower salaries than if the scenarios were reversed. But overall, if we have a more specialized title and work in a place such as Washington and New York, there is a high chance we will have a higher average salary.

In summary, data professionals earn significantly more when they hold specialized roles and work in high-demand regions such as Washington and New York. Overall, the analysis shows that both role specialization and geographic clustering drive pay inequality among U.S. data professionals. The combination of boxplots, heatmaps, and maps provides a layered understanding of this issue — moving from distribution to geography to intersection.From a visual analytics perspective, this dashboard demonstrates how effective encoding choices and connected visuals can transform raw salary data into actionable insight.

Resources
https://www.ziprecruiter.com/

https://www.businessinsider.com/how-title-inflation-hurt-employees-careers-companies-morale-2022-12

https://meric.mo.gov/data/cost-living-data-series

DATA 608: Story-4: How much do we get paid?

Bikash Bhowmik —- Date: 25-Oct-2025

Column

Column