I have introduced the term “Data Practitioner” as a generic job descriptor because we have so many different job role titles for individuals whose work activities overlap including Data Scientist, Data Engineer, Data Analyst, Business Analyst, Data Architect, etc.
For this story we will answer the question, “How much do we get paid?” Your analysis and data visualizations must address the variation in average salary based on role descriptor and state.
library(tidyverse)
library(plotly)
library(datasets)
I utilized Ziprecruiter.com as my source of data for the salaries of Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect. It should be noted that the table wasn’t present for Data Architect, So I had to look of each state individually. It was all compiled into a google sheet which was exported as .CSV file.
# URL for Job Salaries
jobs <- c("Data-Scientist", "Data-Engineer", "Data-Analyst", "Business-Analyst", "Data-Architect")
url_link <- 'https://www.ziprecruiter.com/Salaries/What-Is-the-Average-%s-Salary-by-State'
for (job in jobs) {
url <- sprintf(url_link, job)
print(url)
}
## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Scientist-Salary-by-State"
## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Engineer-Salary-by-State"
## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Analyst-Salary-by-State"
## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Business-Analyst-Salary-by-State"
## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Architect-Salary-by-State"
# Import Data from CSV
df <- read.csv("https://raw.githubusercontent.com/Jlok17/2022MSDS/main/Source/Job_State%20Salary%20-%20Sheet1.csv")
str(df)
## 'data.frame': 250 obs. of 6 variables:
## $ State : chr "New York" "Vermont" "California" "Maine" ...
## $ Annual.Salary: chr "136,172.00" "133,828.00" "131,441.00" "127,644.00" ...
## $ Monthly.Pay : chr "11,347.00" "11,152.00" "10,953.00" "10,637.00" ...
## $ Weekly.Pay : chr "2,618.00" "2,573.00" "2,527.00" "2,454.00" ...
## $ Hourly.Wage : num 65.5 64.3 63.2 61.4 60.7 ...
## $ Job : chr "Data Scientist" "Data Scientist" "Data Scientist" "Data Scientist" ...
# Converting String to Numeric
df$`Annual.Salary` <- as.numeric(gsub(",", "", df$`Annual.Salary`))
df$`Monthly.Pay` <- as.numeric(gsub(",", "", df$`Monthly.Pay`))
df$`Weekly.Pay` <- as.numeric(gsub(",", "", df$`Weekly.Pay`))
head(df)
## State Annual.Salary Monthly.Pay Weekly.Pay Hourly.Wage Job
## 1 New York 136172 11347 2618 65.47 Data Scientist
## 2 Vermont 133828 11152 2573 64.34 Data Scientist
## 3 California 131441 10953 2527 63.19 Data Scientist
## 4 Maine 127644 10637 2454 61.37 Data Scientist
## 5 Idaho 126275 10522 2428 60.71 Data Scientist
## 6 Washington 125289 10440 2409 60.24 Data Scientist
# Copy of df
df2 <- df
# Average Salary by Job
Avg_Job <- df2 %>%
group_by(Job) %>%
summarize(Avg_Annual_Salary = mean(`Annual.Salary`))
# Average Annual Salary By State
Avg_State <- aggregate(`Annual.Salary` ~ State, data = df2, FUN = mean)
colnames(Avg_State) <- c("State", "Avg_Annual_Salary")
# State Abbreviation
data("state")
Avg_State$Abbreviation <- state.abb[match(Avg_State$State, state.name)]
Avg_State <- Avg_State[order(-Avg_State$Avg_Annual_Salary), ]
print(Avg_Job)
## # A tibble: 5 × 2
## Job Avg_Annual_Salary
## <chr> <dbl>
## 1 Business Analyst 90439.
## 2 Data Analyst 77605.
## 3 Data Architect 138570.
## 4 Data Engineer 121282.
## 5 Data Scientist 112832.
print(Avg_State)
## State Avg_Annual_Salary Abbreviation
## 47 Washington 126682.0 WA
## 32 New York 126019.2 NY
## 21 Massachusetts 122342.2 MA
## 2 Alaska 121887.2 AK
## 37 Oregon 120781.2 OR
## 34 North Dakota 120589.2 ND
## 45 Vermont 119791.2 VT
## 11 Hawaii 118263.0 HI
## 6 Colorado 116922.2 CO
## 5 California 116585.0 CA
## 38 Pennsylvania 115432.8 PA
## 28 Nevada 115248.6 NV
## 30 New Jersey 114326.4 NJ
## 41 South Dakota 113969.8 SD
## 19 Maine 113532.2 ME
## 46 Virginia 113152.2 VA
## 49 Wisconsin 112992.4 WI
## 29 New Hampshire 112388.6 NH
## 8 Delaware 111996.4 DE
## 20 Maryland 111035.8 MD
## 50 Wyoming 110364.0 WY
## 39 Rhode Island 109948.0 RI
## 12 Idaho 109211.4 ID
## 23 Minnesota 108872.0 MN
## 27 Nebraska 108819.0 NE
## 14 Indiana 108727.2 IN
## 31 New Mexico 108540.8 NM
## 13 Illinois 107957.2 IL
## 3 Arizona 106479.2 AZ
## 36 Oklahoma 105382.6 OK
## 26 Montana 104911.8 MT
## 35 Ohio 104852.0 OH
## 15 Iowa 103908.2 IA
## 24 Mississippi 103650.6 MS
## 1 Alabama 103565.4 AL
## 40 South Carolina 102972.0 SC
## 7 Connecticut 102900.4 CT
## 25 Missouri 101600.4 MO
## 43 Texas 101191.4 TX
## 42 Tennessee 101107.8 TN
## 33 North Carolina 100846.4 NC
## 44 Utah 99910.8 UT
## 16 Kansas 97589.2 KS
## 22 Michigan 96869.0 MI
## 10 Georgia 96479.2 GA
## 18 Louisiana 95055.2 LA
## 17 Kentucky 94945.4 KY
## 4 Arkansas 92122.4 AR
## 48 West Virginia 89149.8 WV
## 9 Florida 85413.0 FL
To showcase salary, there are 3 different graphics. There is one for Annual Salary Distribution by Job Description as a box plot, Annual Salary Distribution by State as a box plot, and Average Salary by State as a heat map of the United States.
# Box plot by Job
job_box <- plot_ly(df, x = ~Job, y = ~`Annual.Salary`, type = 'box',
marker = list(color = 'rgb(110, 164, 214)')) %>%
layout(title = 'Annual Salary Distribution by Job Title',
xaxis = list(title = 'Job Title'),
yaxis = list(title = 'Average Annual Salary($)'))
# Box plot by State
state_box <- plot_ly(df, x = ~State, y = ~`Annual.Salary`, type = 'box',
marker = list(color = 'rgb(110, 164, 214)')) %>%
layout(title = 'Annual Salary Distribution by State',
xaxis = list(tickfont = list(size = 14), tickangle = -45),
yaxis = list(title = 'Average Annual Salary($)'))
# Plots
job_box
state_box
# Choropleth Map for Avg Salary by State
state_map <- plot_ly(
Avg_State,
z = ~Avg_Annual_Salary,
locations = ~Abbreviation,
locationmode = 'USA-states',
type = 'choropleth',
colorscale = 'Viridis',
zmin = min(Avg_State$Avg_Annual_Salary),
zmax = max(Avg_State$Avg_Annual_Salary),
text = ~paste('State:', State, '<br>Avg Annual Salary:', round(Avg_Annual_Salary, 2))) %>%
layout(
title = 'Average Annual Salary in the US by State',
geo = list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = 'rgb(255, 255, 255)'),
annotations = list(
list(
x = 0.00,
y = -0.05,
xref = "paper",
yref = "paper",
text = "Arkansas, West Virginia, and Florida have the Lowest Average Salary.",
showarrow = FALSE,
font = list(size = 12)),
list(
x = 0.00, # X-coordinate of the note
y = 0.00, # Y-coordinate of the note
xref = "paper",
yref = "paper",
text = "Washington, New York and Massachusetts have the Highest Average Salary.",
showarrow = FALSE))
)
# Plot
state_map
When comparing the different Job titles out of Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect we can see a difference. As the title becomes more “specialized” in terminology it seems that the salaries does increase substantially. As job title does typically entail more responsibilities and specialization we can see that each jump in title is about a 10% increase in salary on average. With Data Analyst being $77,605 annually and Data Architect being $138,570 annually on the top end. Since recently there has been title inflation within the data world, it would seem to be that typically when your more specialized and have more knowledge, there is a correlation of getting a higher annual income compared to title.
The other impact on salary is geological location within the United States. As seen on the Map showcasing average salary by State. We can see that Washington, New York, and Massachusetts have the highest salaries on average while Arkansas, West Virginia, and Florida have the lowest salaries. When doing a comparison side by side, Washington has a staggering 45% higher average salary compared to Florida at $126,000 per a Year. However it should be noted that this is very generalized since the cost of living is very different even within a state level, let alone between state to state. There is also the idea of where the jobs are actually located since areas with lower supply of jobs and higher demand can have lower salaries than if the scenarios were reversed. But overall, if you have a more specialized title and work in a place such as Washington and New York, there is a high chance you will have a higher average salary.