Our task is to answer how much data practitioners get paid depending on the job title and location.
We’re using the tidyverse
to have a cohesive and
consistent ecosystem of libraries to complete Data Science-motivated
projects.
The `readr`` package lets us read text files from locally on our computer or from the internet.
We are loading data from kaggle: https://www.kaggle.com/code/hasibalmuzdadid/data-science-jobs-salary-analysis-retro-vibe/input
We had a Goldilocks and the Nine Bears situation with finding an appropriate dataset. The first many had issues like too few records or no state information or no salary information but the ninth was perfect except it was international instead of only restricted to the 50 USA States. However it’s just right for the nature of the assignment.
There are too many job titles and too many locations so we are going
to select only the top 9 job titles and locations and lump the remaining
into an other
category so we have ten of each job title and
location.
The main roles seem to be Data Analyst, Data Engineer and Data Scientist with Machine Learning Engineer being a strong fourth title.
##
## Big Data Engineer Data Analyst
## 8 97
## Data Architect Data Engineer
## 11 132
## Data Science Manager Data Scientist
## 12 143
## Machine Learning Engineer Machine Learning Scientist
## 41 8
## Research Scientist Other
## 16 139
The top locations in alphabetical order are:
CA - Canada
DE - Germany
ES - Spain
FR - France
GB - Great Britain
GR - Greece
IN - India
JP - Japan
US - USA
##
## CA DE ES FR GB GR IN JP US Other
## 29 25 15 18 44 13 30 7 332 94
Here we calculate and display the average salary by job title.
## # A tibble: 10 × 2
## job_title2 avgsalarytitle
## <fct> <dbl>
## 1 Big Data Engineer 51974
## 2 Data Analyst 92893.
## 3 Data Architect 177874.
## 4 Data Engineer 112725
## 5 Data Science Manager 158328.
## 6 Data Scientist 108188.
## 7 Machine Learning Engineer 104880.
## 8 Machine Learning Scientist 158412.
## 9 Research Scientist 109020.
## 10 Other 123882.
Here we calculate and display the average salary by location.
## # A tibble: 10 × 2
## employee_residence2 avgsalaryloc
## <fct> <dbl>
## 1 CA 97085.
## 2 DE 85553.
## 3 ES 57593.
## 4 FR 59887.
## 5 GB 81403.
## 6 GR 56331.
## 7 IN 37322.
## 8 JP 103538.
## 9 US 149194.
## 10 Other 59338.
Here we show variation in average salary by role and by state. Note the salary is in USD.
Machine learning tends to have higher salaries than equivalent pure data roles. Architects, Managers and Scientists tend to have higher compensations than Engineers who earn more than Analysts.
The USA has the highest salaries however it should be seen through the lens of cost of living differences by geography. It would partially be captured in cost of living say for Health Care costs, but the other countries may have more benefits paid by their employer that’s not reflected in the salary package so the difference between the USA and the other countries in pay may not be so great.
It’s also interesting there were so many respondents from Greece. Perhaps Athens is a data practitioner hub to learn more about.
I tried to streamline the story to meet the requirements efficiently but elegantly. I put the code entirely in the appendix so there were less interruptions when going through the story. I tend to see a lot of complexity in everything so this was an exercise in keeping a tight scope.
Maybe an avenue for future analysis would be to convert the salaries into a cost-of-living adjusted salary. We could have used fuzzy matching to get more job titles.
The tibbles could be replaced with more attractive tables, with
formatted salaries. The country codes could be replaced by the real
names. Also, I could have changed the formatting of the salary in the
x-axis so that it was $150,000 instead of 150000, or used fuzzy matching
to get more job titles in the main buckets instead of
other
.
To help with the data visualization I used the R Graph Gallery at https://r-graph-gallery.com/
# Libraries used
library(tidyverse)
library(readr)
# Load Data
df <- read_csv("ds_salaries.csv")
# Lumping job title and location
df$job_title2 <- fct_lump_min(df$job_title, 8)
df$employee_residence2 <- fct_lump_n(df$employee_residence, 9)
# Displaying the top ten job titles
table(df$job_title2)
# Displaying the top ten locations
table(df$employee_residence2)
# Calculate the average salary by job title and location
salarytitle <- group_by(df, job_title2) %>%
summarize(avgsalarytitle = mean(salary_in_usd))
salaryloc <- group_by(df, employee_residence2) %>%
summarize(avgsalaryloc = mean(salary_in_usd))
# Display the average salary by job title
salarytitle
# Display the average salary by salary
salaryloc
# Graph average salary by job title
salarytitle %>%
mutate(job_title2 = fct_reorder(job_title2, avgsalarytitle)) %>%
ggplot( aes(x=job_title2, y=avgsalarytitle)) +
geom_bar(stat="identity", fill="#f68060", alpha=.7, width=.4) +
coord_flip() +
ggtitle("Variation in Average Salary by Job Title") +
theme_minimal() +
theme(axis.title.x = element_blank()) +
theme(axis.title.y = element_blank()) +
theme(plot.title = element_text(size=16))
# Graph average salary by location
salaryloc %>%
mutate(employee_residence2 = fct_reorder(employee_residence2, avgsalaryloc)) %>%
ggplot( aes(x=employee_residence2, y=avgsalaryloc)) +
geom_col(fill="#f68060", alpha=.7, width=.5) +
coord_flip() +
ggtitle("Variation in Average Salary by Location") +
theme_minimal() +
theme(axis.title.x = element_blank()) +
theme(axis.title.y = element_blank()) +
theme(plot.title = element_text(size=16))