Story 4 Data 608

Objective:

I have introduced the term “Data Practitioner” as a generic job descriptor because we have so many different job role titles for individuals whose work activities overlap including Data Scientist, Data Engineer, Data Analyst, Business Analyst, Data Architect, etc.

For this story we will answer the question, “How much do we get paid?” Your analysis and data visualizations must address the variation in average salary based on role descriptor and state.

Notes:

You will need to identify reliable sources for salary data and assemble the data sets that you will need.
Your visualization(s) must show the most salient information (variation in average salary by role and by state).
For this Story you must use a code library and code that you have written in R, Python or Java Script (additional coding in other languages is allowed).
Post generation enhancements to you generated visualization will be allowed (e.g. Addition of kickers and labels).

Libraries Used:

library(tidyverse)
library(plotly)
library(datasets)

Data Collection:

I utilized Ziprecruiter.com as my source of data for the salaries of Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect. It should be noted that the table wasn’t present for Data Architect, So I had to look of each state individually. It was all compiled into a google sheet which was exported as .CSV file.

# URL for Job Salaries
jobs <- c("Data-Scientist", "Data-Engineer", "Data-Analyst", "Business-Analyst", "Data-Architect")
url_link <- 'https://www.ziprecruiter.com/Salaries/What-Is-the-Average-%s-Salary-by-State'

for (job in jobs) {
  url <- sprintf(url_link, job)
  print(url)
}

## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Scientist-Salary-by-State"
## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Engineer-Salary-by-State"
## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Analyst-Salary-by-State"
## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Business-Analyst-Salary-by-State"
## [1] "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Architect-Salary-by-State"

# Import Data from CSV
df <- read.csv("https://raw.githubusercontent.com/Jlok17/2022MSDS/main/Source/Job_State%20Salary%20-%20Sheet1.csv")
str(df)

## 'data.frame':    250 obs. of  6 variables:
##  $ State        : chr  "New York" "Vermont" "California" "Maine" ...
##  $ Annual.Salary: chr  "136,172.00" "133,828.00" "131,441.00" "127,644.00" ...
##  $ Monthly.Pay  : chr  "11,347.00" "11,152.00" "10,953.00" "10,637.00" ...
##  $ Weekly.Pay   : chr  "2,618.00" "2,573.00" "2,527.00" "2,454.00" ...
##  $ Hourly.Wage  : num  65.5 64.3 63.2 61.4 60.7 ...
##  $ Job          : chr  "Data Scientist" "Data Scientist" "Data Scientist" "Data Scientist" ...

# Converting String to Numeric
df$`Annual.Salary` <- as.numeric(gsub(",", "", df$`Annual.Salary`))
df$`Monthly.Pay` <- as.numeric(gsub(",", "", df$`Monthly.Pay`))
df$`Weekly.Pay` <- as.numeric(gsub(",", "", df$`Weekly.Pay`))
head(df)

##        State Annual.Salary Monthly.Pay Weekly.Pay Hourly.Wage            Job
## 1   New York        136172       11347       2618       65.47 Data Scientist
## 2    Vermont        133828       11152       2573       64.34 Data Scientist
## 3 California        131441       10953       2527       63.19 Data Scientist
## 4      Maine        127644       10637       2454       61.37 Data Scientist
## 5      Idaho        126275       10522       2428       60.71 Data Scientist
## 6 Washington        125289       10440       2409       60.24 Data Scientist

# Copy of df
df2 <- df

# Average Salary by Job
Avg_Job <- df2 %>%
  group_by(Job) %>%
  summarize(Avg_Annual_Salary = mean(`Annual.Salary`))


# Average Annual Salary By State
Avg_State <- aggregate(`Annual.Salary` ~ State, data = df2, FUN = mean)
colnames(Avg_State) <- c("State", "Avg_Annual_Salary")

# State Abbreviation
data("state")
Avg_State$Abbreviation <- state.abb[match(Avg_State$State, state.name)]
Avg_State <- Avg_State[order(-Avg_State$Avg_Annual_Salary), ]

print(Avg_Job)

## # A tibble: 5 × 2
##   Job              Avg_Annual_Salary
##   <chr>                        <dbl>
## 1 Business Analyst            90439.
## 2 Data Analyst                77605.
## 3 Data Architect             138570.
## 4 Data Engineer              121282.
## 5 Data Scientist             112832.

print(Avg_State)

##             State Avg_Annual_Salary Abbreviation
## 47     Washington          126682.0           WA
## 32       New York          126019.2           NY
## 21  Massachusetts          122342.2           MA
## 2          Alaska          121887.2           AK
## 37         Oregon          120781.2           OR
## 34   North Dakota          120589.2           ND
## 45        Vermont          119791.2           VT
## 11         Hawaii          118263.0           HI
## 6        Colorado          116922.2           CO
## 5      California          116585.0           CA
## 38   Pennsylvania          115432.8           PA
## 28         Nevada          115248.6           NV
## 30     New Jersey          114326.4           NJ
## 41   South Dakota          113969.8           SD
## 19          Maine          113532.2           ME
## 46       Virginia          113152.2           VA
## 49      Wisconsin          112992.4           WI
## 29  New Hampshire          112388.6           NH
## 8        Delaware          111996.4           DE
## 20       Maryland          111035.8           MD
## 50        Wyoming          110364.0           WY
## 39   Rhode Island          109948.0           RI
## 12          Idaho          109211.4           ID
## 23      Minnesota          108872.0           MN
## 27       Nebraska          108819.0           NE
## 14        Indiana          108727.2           IN
## 31     New Mexico          108540.8           NM
## 13       Illinois          107957.2           IL
## 3         Arizona          106479.2           AZ
## 36       Oklahoma          105382.6           OK
## 26        Montana          104911.8           MT
## 35           Ohio          104852.0           OH
## 15           Iowa          103908.2           IA
## 24    Mississippi          103650.6           MS
## 1         Alabama          103565.4           AL
## 40 South Carolina          102972.0           SC
## 7     Connecticut          102900.4           CT
## 25       Missouri          101600.4           MO
## 43          Texas          101191.4           TX
## 42      Tennessee          101107.8           TN
## 33 North Carolina          100846.4           NC
## 44           Utah           99910.8           UT
## 16         Kansas           97589.2           KS
## 22       Michigan           96869.0           MI
## 10        Georgia           96479.2           GA
## 18      Louisiana           95055.2           LA
## 17       Kentucky           94945.4           KY
## 4        Arkansas           92122.4           AR
## 48  West Virginia           89149.8           WV
## 9         Florida           85413.0           FL

Data Visualization:

To showcase salary, there are 3 different graphics. There is one for Annual Salary Distribution by Job Description as a box plot, Annual Salary Distribution by State as a box plot, and Average Salary by State as a heat map of the United States.

# Box plot by Job
job_box <- plot_ly(df, x = ~Job, y = ~`Annual.Salary`, type = 'box', 
                   marker = list(color = 'rgb(110, 164, 214)')) %>%
  layout(title = 'Annual Salary Distribution by Job Title',
         xaxis = list(title = 'Job Title'),
         yaxis = list(title = 'Average Annual Salary($)'))

# Box plot by State
state_box <- plot_ly(df, x = ~State, y = ~`Annual.Salary`, type = 'box', 
                     marker = list(color = 'rgb(110, 164, 214)')) %>%
  layout(title = 'Annual Salary Distribution by State',
         xaxis = list(tickfont = list(size = 14), tickangle = -45),
         yaxis = list(title = 'Average Annual Salary($)'))

# Plots
job_box

state_box

# Choropleth Map for Avg Salary by State
state_map <- plot_ly(
  Avg_State, 
  z = ~Avg_Annual_Salary,
  locations = ~Abbreviation,
  locationmode = 'USA-states',
  type = 'choropleth',
  colorscale = 'Viridis',
  zmin = min(Avg_State$Avg_Annual_Salary),
  zmax = max(Avg_State$Avg_Annual_Salary),
  text = ~paste('State:', State, '<br>Avg Annual Salary:', round(Avg_Annual_Salary, 2))) %>%
layout(
  title = 'Average Annual Salary in the US by State',
  geo = list(
    scope = 'usa',
    projection = list(type = 'albers usa'),
    showlakes = TRUE,
    lakecolor = 'rgb(255, 255, 255)'),
  annotations = list(
      list(
        x = 0.00,
        y = -0.05,
        xref = "paper",
        yref = "paper",
        text = "Arkansas, West Virginia, and Florida have the Lowest Average Salary.",
        showarrow = FALSE,
        font = list(size = 12)),
      list(
        x = 0.00,  # X-coordinate of the note
        y = 0.00,  # Y-coordinate of the note
        xref = "paper",
        yref = "paper",
        text = "Washington, New York and Massachusetts have the Highest Average Salary.",
        showarrow = FALSE))
)

# Plot
state_map

How Much Do We Get Paid?

Depends on Title and Location

When comparing the different Job titles out of Data Scientist, Data Engineer, Data Analyst, Business Analyst, and Data Architect we can see a difference. As the title becomes more “specialized” in terminology it seems that the salaries does increase substantially. As job title does typically entail more responsibilities and specialization we can see that each jump in title is about a 10% increase in salary on average. With Data Analyst being $77,605 annually and Data Architect being $138,570 annually on the top end. Since recently there has been title inflation within the data world, it would seem to be that typically when your more specialized and have more knowledge, there is a correlation of getting a higher annual income compared to title.

The other impact on salary is geological location within the United States. As seen on the Map showcasing average salary by State. We can see that Washington, New York, and Massachusetts have the highest salaries on average while Arkansas, West Virginia, and Florida have the lowest salaries. When doing a comparison side by side, Washington has a staggering 45% higher average salary compared to Florida at $126,000 per a Year. However it should be noted that this is very generalized since the cost of living is very different even within a state level, let alone between state to state. There is also the idea of where the jobs are actually located since areas with lower supply of jobs and higher demand can have lower salaries than if the scenarios were reversed. But overall, if you have a more specialized title and work in a place such as Washington and New York, there is a high chance you will have a higher average salary.

Resources

https://www.ziprecruiter.com/

https://www.businessinsider.com/how-title-inflation-hurt-employees-careers-companies-morale-2022-12

https://meric.mo.gov/data/cost-living-data-series