Introduction

Our task is to answer how much data practitioners get paid depending on the job title and location.


Load Libraries

We’re using the tidyverse to have a cohesive and consistent ecosystem of libraries to complete Data Science-motivated projects.

The `readr`` package lets us read text files from locally on our computer or from the internet.


Load Data

We are loading data from kaggle: https://www.kaggle.com/code/hasibalmuzdadid/data-science-jobs-salary-analysis-retro-vibe/input

We had a Goldilocks and the Nine Bears situation with finding an appropriate dataset. The first many had issues like too few records or no state information or no salary information but the ninth was perfect except it was international instead of only restricted to the 50 USA States. However it’s just right for the nature of the assignment.



Data Preparation

There are too many job titles and too many locations so we are going to select only the top 9 job titles and locations and lump the remaining into an other category so we have ten of each job title and location.

The main roles seem to be Data Analyst, Data Engineer and Data Scientist with Machine Learning Engineer being a strong fourth title.

## 
##          Big Data Engineer               Data Analyst 
##                          8                         97 
##             Data Architect              Data Engineer 
##                         11                        132 
##       Data Science Manager             Data Scientist 
##                         12                        143 
##  Machine Learning Engineer Machine Learning Scientist 
##                         41                          8 
##         Research Scientist                      Other 
##                         16                        139

The top locations in alphabetical order are:
CA - Canada
DE - Germany
ES - Spain
FR - France
GB - Great Britain
GR - Greece
IN - India
JP - Japan
US - USA

## 
##    CA    DE    ES    FR    GB    GR    IN    JP    US Other 
##    29    25    15    18    44    13    30     7   332    94



Average Salary by Job Title and Location


Calculation and tables

Here we calculate and display the average salary by job title.

## # A tibble: 10 × 2
##    job_title2                 avgsalarytitle
##    <fct>                               <dbl>
##  1 Big Data Engineer                  51974 
##  2 Data Analyst                       92893.
##  3 Data Architect                    177874.
##  4 Data Engineer                     112725 
##  5 Data Science Manager              158328.
##  6 Data Scientist                    108188.
##  7 Machine Learning Engineer         104880.
##  8 Machine Learning Scientist        158412.
##  9 Research Scientist                109020.
## 10 Other                             123882.

Here we calculate and display the average salary by location.

## # A tibble: 10 × 2
##    employee_residence2 avgsalaryloc
##    <fct>                      <dbl>
##  1 CA                        97085.
##  2 DE                        85553.
##  3 ES                        57593.
##  4 FR                        59887.
##  5 GB                        81403.
##  6 GR                        56331.
##  7 IN                        37322.
##  8 JP                       103538.
##  9 US                       149194.
## 10 Other                     59338.


Data visualization

Here we show variation in average salary by role and by state. Note the salary is in USD.




Postamble

Machine learning tends to have higher salaries than equivalent pure data roles. Architects, Managers and Scientists tend to have higher compensations than Engineers who earn more than Analysts.

The USA has the highest salaries however it should be seen through the lens of cost of living differences by geography. It would partially be captured in cost of living say for Health Care costs, but the other countries may have more benefits paid by their employer that’s not reflected in the salary package so the difference between the USA and the other countries in pay may not be so great.

It’s also interesting there were so many respondents from Greece. Perhaps Athens is a data practitioner hub to learn more about.




Self Critique

I tried to streamline the story to meet the requirements efficiently but elegantly. I put the code entirely in the appendix so there were less interruptions when going through the story. I tend to see a lot of complexity in everything so this was an exercise in keeping a tight scope.

Maybe an avenue for future analysis would be to convert the salaries into a cost-of-living adjusted salary. We could have used fuzzy matching to get more job titles.

The tibbles could be replaced with more attractive tables, with formatted salaries. The country codes could be replaced by the real names. Also, I could have changed the formatting of the salary in the x-axis so that it was $150,000 instead of 150000, or used fuzzy matching to get more job titles in the main buckets instead of other.


References

To help with the data visualization I used the R Graph Gallery at https://r-graph-gallery.com/



Code Appendix


# Libraries used
library(tidyverse)
library(readr)

# Load Data
df <- read_csv("ds_salaries.csv")

# Lumping job title and location
df$job_title2 <- fct_lump_min(df$job_title, 8)
df$employee_residence2 <- fct_lump_n(df$employee_residence, 9)

# Displaying the top ten job titles
table(df$job_title2)

# Displaying the top ten locations
table(df$employee_residence2)

# Calculate the average salary by job title and location
salarytitle <- group_by(df, job_title2) %>% 
  summarize(avgsalarytitle = mean(salary_in_usd))
salaryloc <- group_by(df, employee_residence2) %>% 
  summarize(avgsalaryloc = mean(salary_in_usd))

# Display the average salary by job title
salarytitle

# Display the average salary by salary
salaryloc

# Graph average salary by job title
salarytitle %>%
  mutate(job_title2 = fct_reorder(job_title2, avgsalarytitle)) %>%
  ggplot( aes(x=job_title2, y=avgsalarytitle)) +
    geom_bar(stat="identity", fill="#f68060", alpha=.7, width=.4) +
    coord_flip() +
    ggtitle("Variation in Average Salary by Job Title") +
    theme_minimal() +
    theme(axis.title.x = element_blank()) +
    theme(axis.title.y = element_blank()) +
    theme(plot.title = element_text(size=16))

# Graph average salary by location
salaryloc %>%
  mutate(employee_residence2 = fct_reorder(employee_residence2, avgsalaryloc)) %>%
  ggplot( aes(x=employee_residence2, y=avgsalaryloc)) +
    geom_col(fill="#f68060", alpha=.7, width=.5) +
    coord_flip() +
    ggtitle("Variation in Average Salary by Location") +
    theme_minimal() +
    theme(axis.title.x = element_blank()) +
    theme(axis.title.y = element_blank()) +
    theme(plot.title = element_text(size=16))