Story - 4 : How much do we get paid?

I have introduced the term “Data Practitioner” as a generic job descriptor because we have so many different job role titles for individuals whose work activities overlap including Data Scientist, Data Engineer, Data Analyst, Business Analyst, Data Architect, etc. For this story we will answer the question, “How much do we get paid?” Your analysis and data visualizations must address the variation in average salary based on role descriptor and state.

Notes: You will need to identify reliable sources for salary data and assemble the data sets that you will need.

Your visualization(s) must show the most salient information (variation in average salary by role and by state).

For this Story you must use a code library and code that you have written in R, Python or Java Script (additional coding in other languages is allowed).

Post generation enhancements to you generated visualization will be allowed (e.g. Addition of kickers and labels).

Retreiving Data & Data Cleaning

The Salary data is retrieved from the following links:

https://www.ziprecruiter.com/Salaries/What-Is-the-Average-DATA-Scientist-Salary-by-State

https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Engineer-Salary-by-State

https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Analyst-Salary-by-State

https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Business-Analyst-Salary-by-State

https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Data-Architect-Salary-by-State

The tables from each link are copied to Excel and saved as Salary2.csv file.

Each State names are converted to its abbreviations

df_salary <- read.csv("https://raw.githubusercontent.com/tonyCUNY/tonyCUNY/main/Salary2.csv")

df_salary <- df_salary |> 
                mutate(across(.cols = 2:5, .fns = as.numeric))

df_salary$State <- state.abb[match(df_salary$State, state.name)]

glimpse(df_salary)

## Rows: 250
## Columns: 6
## $ State         <chr> "NY", "VT", "CA", "ME", "ID", "WA", "PA", "MA", "AK", "N…
## $ Annual.Salary <dbl> 136172, 133828, 131441, 127644, 126275, 125289, 124713, …
## $ Monthly.Pay   <dbl> 11347, 11152, 10953, 10637, 10522, 10440, 10392, 10265, …
## $ Weekly.Pay    <dbl> 2618, 2573, 2527, 2454, 2428, 2409, 2398, 2369, 2353, 23…
## $ Hourly.Wage   <dbl> 65.47, 64.34, 63.19, 61.37, 60.71, 60.24, 59.96, 59.23, …
## $ Job.Title     <chr> "Data Scientist", "Data Scientist", "Data Scientist", "D…

Data Visualization:

As a Data Science graduate student, it is essential to understand how data practitioners get paid. This knowledge helps us decide which career path to pursue and whether we should consider relocating to a specific state for better opportunities.

Now, let’s explore which careers have the highest and lowest average salaries. Additionally, we’ll identify the top five states with the highest salaries for each career.

Which career has the highest or lowest average salary?

p <- ggplot(df_salary, aes(x = reorder(Job.Title, -Annual.Salary), y = Annual.Salary)) +
  geom_boxplot() +
  labs(x = "Data Practitioner Career", y = "Average Annual Salary ($)", title = "Average Annual Salary by Career")

p + geom_boxplot(data = subset(df_salary, Job.Title == "Data Architect"), fill = "#005EFF") + 
    geom_boxplot(data = subset(df_salary, Job.Title == "Data Engineer"), fill = "#267FFF") +
    geom_boxplot(data = subset(df_salary, Job.Title == "Data Scientist"), fill = "#4DA6FF") +
    geom_boxplot(data = subset(df_salary, Job.Title == "Business Analyst"), fill = "#80CCFF") +
    geom_boxplot(data = subset(df_salary, Job.Title == "Data Analyst"), fill = "#B3E0FF") +
    geom_text(data = subset(df_salary, Job.Title == "Data Architect"),
            aes(label = paste("$", round(median(Annual.Salary), 2)), y = median(Annual.Salary) + 1000),
            vjust = -0.5, color = "#FFA64D", size = 3.2, fontface = "bold") +
    geom_text(data = subset(df_salary, Job.Title == "Data Analyst"),
            aes(label = paste("$", round(median(Annual.Salary), 2)), y = median(Annual.Salary) + 1000),
            vjust = -0.2, color = "#F98109", size = 3.2, fontface = "bold") +
    geom_text(data = subset(df_salary, Job.Title == "Data Engineer"),
            aes(label = paste("$", round(median(Annual.Salary), 2)), y = median(Annual.Salary) + 1000),
            vjust = -0.5, color = "#FFA64D", size = 3.2, fontface = "bold")+
    geom_text(data = subset(df_salary, Job.Title == "Data Scientist"),
            aes(label = paste("$", round(median(Annual.Salary), 2)), y = median(Annual.Salary) + 1000),
            vjust = -0.5, color = "#F98109", size = 3.2, fontface = "bold")+
    geom_text(data = subset(df_salary, Job.Title == "Business Analyst"),
            aes(label = paste("$", round(median(Annual.Salary), 2)), y = median(Annual.Salary) + 1000),
            vjust = -0.2, color = "#F98109", size = 3.2, fontface = "bold") +
    theme(plot.title = element_text(face = "bold")) +
    theme(plot.title = element_text(hjust = 0.5))

Which five states have the highest salaries for each career?

df_da <- df_salary |> 
                  filter(Job.Title == "Data Architect") |> 
                  rename(state = "State") |> 
                  group_by(state) |> 
                  summarise(Avg_Annual_Salary = mean(`Annual.Salary`)) |> 
                  arrange(desc(Avg_Annual_Salary))
df_da <- df_da[order(-df_da$Avg_Annual_Salary), ]
df_da$is_colored <- ifelse(rank(-df_da$Avg_Annual_Salary, ties.method = "min") <= 5, "#267FFF", "#F0F0FC")


p <- plot_usmap(data = df_da, 
                regions = "state",
                values = "is_colored",
                labels = TRUE) 
p$layers[[2]]$aes_params$size <- 2.5

p +
  scale_fill_identity() +
  labs(title = "Data Architect",
       subtitle = "Top 5 States with Highest Salary") +
  theme(legend.position =  "right",
        plot.title = element_text(hjust = 0.5, vjust = 1, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, vjust = 0, face = "bold"))

df_da <- df_salary |> 
                  filter(Job.Title == "Data Engineer") |> 
                  rename(state = "State") |> 
                  group_by(state) |> 
                  summarise(Avg_Annual_Salary = mean(`Annual.Salary`)) |> 
                  arrange(desc(Avg_Annual_Salary))
df_da <- df_da[order(-df_da$Avg_Annual_Salary), ]
df_da$is_colored <- ifelse(rank(-df_da$Avg_Annual_Salary, ties.method = "min") <= 5, "#267FFF", "#F0F0FC")


p <- plot_usmap(data = df_da, 
                regions = "state",
                values = "is_colored",
                labels = TRUE) 
p$layers[[2]]$aes_params$size <- 2.5

p +
  scale_fill_identity() +
  labs(title = "Data Engineer",
       subtitle = "Top 5 State with Highest Salary") +
  theme(legend.position =  "right",
        plot.title = element_text(hjust = 0.5, vjust = 1, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, vjust = 0, face = "bold"))

df_da <- df_salary |> 
                  filter(Job.Title == "Data Scientist") |> 
                  rename(state = "State") |> 
                  group_by(state) |> 
                  summarise(Avg_Annual_Salary = mean(`Annual.Salary`)) |> 
                  arrange(desc(Avg_Annual_Salary))
df_da <- df_da[order(-df_da$Avg_Annual_Salary), ]
df_da$is_colored <- ifelse(rank(-df_da$Avg_Annual_Salary, ties.method = "min") <= 5, "#267FFF", "#F0F0FC")


p <- plot_usmap(data = df_da, 
                regions = "state",
                values = "is_colored",
                labels = TRUE) 
p$layers[[2]]$aes_params$size <- 2.5

p +
  scale_fill_identity() +
  labs(title = "Data Scientist",
       subtitle = "Top 5 States with Highest Salary") +
  theme(legend.position =  "right",
        plot.title = element_text(hjust = 0.5, vjust = 1, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, vjust = 0, face = "bold"))

df_da <- df_salary |> 
                  filter(Job.Title == "Business Analyst") |> 
                  rename(state = "State") |> 
                  group_by(state) |> 
                  summarise(Avg_Annual_Salary = mean(`Annual.Salary`)) |> 
                  arrange(desc(Avg_Annual_Salary))
df_da <- df_da[order(-df_da$Avg_Annual_Salary), ]
df_da$is_colored <- ifelse(rank(-df_da$Avg_Annual_Salary, ties.method = "min") <= 5, "#267FFF", "#F0F0FC")


p <- plot_usmap(data = df_da, 
                regions = "state",
                values = "is_colored",
                labels = TRUE) 
p$layers[[2]]$aes_params$size <- 2.5

p +
  scale_fill_identity() +
  labs(title = "Business Analyst",
       subtitle = "Top 5 States with Highest Salary") +
  theme(legend.position =  "right",
        plot.title = element_text(hjust = 0.5, vjust = 1, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, vjust = 0, face = "bold"))

df_da <- df_salary |> 
                  filter(Job.Title == "Data Analyst") |> 
                  rename(state = "State") |> 
                  group_by(state) |> 
                  summarise(Avg_Annual_Salary = mean(`Annual.Salary`)) |> 
                  arrange(desc(Avg_Annual_Salary))
df_da <- df_da[order(-df_da$Avg_Annual_Salary), ]
df_da$is_colored <- ifelse(rank(-df_da$Avg_Annual_Salary, ties.method = "min") <= 5, "#267FFF", "#F0F0FC")


p <- plot_usmap(data = df_da, 
                regions = "state",
                values = "is_colored",
                labels = TRUE) 
p$layers[[2]]$aes_params$size <- 2.5

p +
  scale_fill_identity() +
  labs(title = "Data Analyst",
       subtitle = "Top 5 States with Highest Salary") +
  theme(legend.position =  "right",
        plot.title = element_text(hjust = 0.5, vjust = 1, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, vjust = 0, face = "bold"))

Conclusion:

Data Architect has the highest average annual salary of $ 138,179, the best states for being a Data Architect is CA, NY, WA, MA and AK.

Data Engineer has average annual salary of $ 119,560, the best states for being a Data Engineer is OR, ND, MA, AK and HI.

Data Scientist has annual salary of $ 112,170, the best states for being an Data Architect is CA, NY, ID, VT and ME.

Business Analyst has average annual salary of $ 90,024, the best states for being an Data Architect is NY, WA, MD, VA and DE.

Data Analyst has the lowest average annual salary of $ 76,811, the best states for being an Data Architect is NY, PA, NJ, NH and WY.

Story - 4 : How much do we get paid?

CHUN SHING LEUNG

2024-03-17

Retreiving Data & Data Cleaning

Data Visualization:

Which career has the highest or lowest average salary?

Which five states have the highest salaries for each career?

Conclusion:

It seems NY is the best state for being a “Data Practitioner”