The Data Science Job Salaries Dataset was downloaded from the site https://ai-jobs.net, kept it in my Github account and loaded into R. It contains 14491 rows and 11 variables The download basically contains a single table with all salary information structured in each columns below.
work_year: The year the salary was paid.
experience_level: The experience level in the job during the year with the following possible values: EN = Entry-level / Junior MI = Mid-level / Intermediate SE = Senior-level / Expert EX = Executive-level / Director
employment_type: The type of employment for the role: PT = Part-time FT = Full-time CT = Contract FL = Freelance
job_title: the role worked in during the year.
salary: the total gross salary amount paid.
salary_currency: the currency of the salary paid as an ISO 4217 currency code.
salary_in_usd: the salary in USD (FX rate divided by avg. USD rate of respective year) via statistical data from the BIS and central banks.
employee_residence: employee’s primary country of residence in during the work year as an ISO 3166 country code.
remote_ratio: the overall amount of work done remotely, possible values are as follows: 0 = No remote work (less than 20%) 50 = Partially remote/hybird 100= Fully remote (more than 80%)
company_location: the country of the employer’s main office or contracting branch as an ISO 3166 country code.
company_size: the average number of people that worked for the company during the year: S = less than 50 employees (small) M = 50 to 250 employees (medium) L = more than 250 employees (large)
#Install required packages
library(tidyverse)
library(countrycode)
library(plotly)
# Read the salaries dataframe
salaries_data <- read.csv("https://raw.githubusercontent.com/LwinShwe/DATA-608-Story-4/main/salaries%20dataset.csv")
str(salaries_data)## 'data.frame': 14491 obs. of 11 variables:
## $ work_year : int 2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
## $ experience_level : chr "MI" "MI" "SE" "SE" ...
## $ employment_type : chr "FT" "FT" "FT" "FT" ...
## $ job_title : chr "Data Scientist" "Data Scientist" "AI Architect" "AI Architect" ...
## $ salary : int 86218 68975 285000 153400 101763 81410 166000 73100 170000 85000 ...
## $ salary_currency : chr "USD" "USD" "USD" "USD" ...
## $ salary_in_usd : int 86218 68975 285000 153400 101763 81410 166000 73100 170000 85000 ...
## $ employee_residence: chr "CA" "CA" "US" "US" ...
## $ remote_ratio : int 0 0 0 0 0 0 0 0 100 100 ...
## $ company_location : chr "CA" "CA" "US" "US" ...
## $ company_size : chr "M" "M" "M" "M" ...
#Drop overlapped columns
salaries_data <- salaries_data[, -c(5, 6)] # Dropping columns 'salary'and 'salary_currency'
# Print the dimensions of the dataframe
cat("Dimensions of the dataframe:", dim(salaries_data), "\n")## Dimensions of the dataframe: 14491 9
## work_year experience_level employment_type job_title salary_in_usd
## 1 2024 MI FT Data Scientist 86218
## 2 2024 MI FT Data Scientist 68975
## 3 2024 SE FT AI Architect 285000
## employee_residence remote_ratio company_location company_size
## 1 CA 0 CA M
## 2 CA 0 CA M
## 3 US 0 US M
In the data frame, there are 3 numeric columns: (1)work_year, (2)salary_in_usd, (3)remote_ratio 6 categorical columns: (1)experience_level, (2)employment_type, (3)job_title, (4)employee_residense, (5)company_location, (6)company_size
## [1] FALSE
## [1] FALSE
## work_year experience_level employment_type job_title
## Min. :2020 Length:14491 Length:14491 Length:14491
## 1st Qu.:2023 Class :character Class :character Class :character
## Median :2023 Mode :character Mode :character Mode :character
## Mean :2023
## 3rd Qu.:2024
## Max. :2024
## salary_in_usd employee_residence remote_ratio company_location
## Min. : 15000 Length:14491 Min. : 0.00 Length:14491
## 1st Qu.:102000 Class :character 1st Qu.: 0.00 Class :character
## Median :141525 Mode :character Median : 0.00 Mode :character
## Mean :149925 Mean : 32.99
## 3rd Qu.:185900 3rd Qu.:100.00
## Max. :800000 Max. :100.00
## company_size
## Length:14491
## Class :character
## Mode :character
##
##
##
to display the result of hierarchical clustering in ‘experience_level’ column There’s 4 categories in ‘Experience Level’, each are: EN = Entry-level / Junior MI = Mid-level / Intermediate SE = Senior-level / Expert EX = Executive-level / Director
library(dplyr)
library(ggplot2)
# Replace values in 'experience_level' column
salaries_data$experience_level <- ifelse(salaries_data$experience_level == 'EN', 'Entry-level/Junior',
ifelse(salaries_data$experience_level == 'MI', 'Mid-level/Intermediate',
ifelse(salaries_data$experience_level == 'SE', 'Senior-level/Expert',
ifelse(salaries_data$experience_level == 'EX', 'Executive-level/Director', salaries_data$experience_level))))
# Count the occurrences of each experience level
ex_level <- table(salaries_data$experience_level)
print(ex_level)##
## Entry-level/Junior Executive-level/Director Mid-level/Intermediate
## 1118 429 3442
## Senior-level/Expert
## 9502
# Draw a bar chart
barplot(ex_level,
main = "Four Categories of Experience Levels",
xlab = "Experience Levels",
ylab = "Count",
col = "skyblue",
ylim = c(0, max(ex_level) * 1.1), # Set y-axis limit slightly above the maximum count
border = NA, # Remove borders
cex.names = 0.75 # Set the font size of x-axis labels
)
# Add values of each count on the bar graph
text(x = 1:length(ex_level), y = ex_level, labels = ex_level, pos = 3, cex = 0.8, col = "black")and find the frequencies of top 10 jobs in the data set.
There are 150 different types of Job titles.
library(wordcloud)
# Find unique job titles
unique_job_titles <- unique(salaries_data$job_title)
# Count the number of unique job titles
num_unique_job_titles <- length(unique_job_titles)
print(num_unique_job_titles)## [1] 150
# Count the frequency of each unique job title
job_title_freq <- table(salaries_data$job_title)
# Generate word cloud
wordcloud(words = names(job_title_freq), freq = job_title_freq,
scale = c(3, 0.5), # Adjust font scale
max.words = 100,
random.order = FALSE, # Keep order from dataframe
colors = brewer.pal(9, "Set1"), # Specify color palette
random.color = FALSE, # Use consistent colors
rot.per = 0.35, # Rotate 35% of words
main = "WordCloud of Job Titles") # Main title# Calculate the top 10 job titles
top10_job_title <- sort(table(salaries_data$job_title), decreasing = TRUE)[1:10]
# Create a dataframe with top 10 job titles and their counts
top10_df <- data.frame(job_title = names(top10_job_title), count = as.numeric(top10_job_title))
print(top10_df)## job_title count
## 1 Data Engineer 3107
## 2 Data Scientist 2964
## 3 Data Analyst 2147
## 4 Machine Learning Engineer 1508
## 5 Research Scientist 467
## 6 Analytics Engineer 397
## 7 Applied Scientist 373
## 8 Data Architect 355
## 9 Research Engineer 272
## 10 Business Intelligence Engineer 220
# Create a histogram of the top 10 job titles
ggplot(top10_df, aes(x = reorder(job_title, -count), y = count)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
geom_text(aes(label = count), vjust = -0.5, size = 3.5, color = "black") + # Add text labels for counts
labs(title = "Top 10 Frequent Job Titles", x = "Job Titles", y = "Frequencies") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 10, hjust = 0.5),
axis.text = element_text(size = 10),
axis.title = element_text(size = 10),
panel.grid = element_blank()) # Remove background grid linesThe histogram shows that data engineer, data scientist and data analyst ranked top 3 frequent job titles, but it can be easily seen that others are also related to those top 3 job titles.
The four types of employment for the roles are as follow PT = Part-time FT = Full-time CT = Contract FL = Freelance
# Load necessary libraries
library(ggplot2)
# Create a dataframe with employment type counts
type_grouped <- table(salaries_data$employment_type)
e_type <- c('Full-Time', 'Part-Time', 'Contract', 'Freelance')
df_type <- data.frame(employment_type = names(type_grouped), count = as.numeric(type_grouped))
# Create the bar plot with adjusted font size and plot dimensions
ggplot(df_type, aes(x = reorder(employment_type, -count), y = count, fill = employment_type)) +
geom_bar(stat = "identity", width = 0.5, color = "black") +
geom_text(aes(label = count), vjust = -0.5, size = 3, color = "black") + # Adjust font size
labs(title = "Employment Type Distribution", x = "Employment Type", y = "Count") +
scale_fill_brewer(palette = "PuBuGn") + # Set color palette
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10), # Adjust font size
plot.title = element_text(size = 17, hjust = 0.5, family = "Franklin Gothic"),
axis.text = element_text(size = 10), # Adjust font size
axis.title = element_text(size = 14), # Adjust font size
legend.position = "none",
plot.margin = margin(1, 1, 1, 3, "cm")) # Adjust plot dimensionsFull-time FT is the most highest type of employment.
# Create Box Plot
fig1 <- plot_ly(y = salaries_data$salary_in_usd, type = "box", name = "salary_in_usd") %>%
layout(title = "Salary Distribution in USD", font = list(size = 15, family = "Franklin Gothic"),
template = "plotly_dark")
fig1We can clearly see that salary mostly distributed between 102k and 186k.
# Filter data for each work year
w2020 <- subset(salaries_data, work_year == 2020)
w2021 <- subset(salaries_data, work_year == 2021)
w2022 <- subset(salaries_data, work_year == 2022)
w2023 <- subset(salaries_data, work_year == 2023)
# Calculate mean salary by work year
year_salary <- data.frame(
"2020" = mean(w2020$salary_in_usd),
"2021" = mean(w2021$salary_in_usd),
"2022" = mean(w2022$salary_in_usd),
"2023" = mean(w2023$salary_in_usd)
)
# Prepare data for the line graph
work_years <- c("2020", "2021", "2022", "2023")
mean_salaries <- unlist(year_salary)
# Create line graph
fig2 <- plot_ly(x = work_years, y = mean_salaries, type = "scatter", mode = "lines+markers",
marker = list(color = "red", size = 10),
line = list(color = "skyblue", width = 2)) %>%
layout(title = "Mean Salary by Work Year",
xaxis = list(title = "Work Year"),
yaxis = list(title = "Mean Salary (USD)"),
font = list(size = 15, family = "Franklin Gothic"),
template = "plotly_dark",
showlegend = FALSE) # Hide legend
fig2We got average paid 153.73k in 2023 that is higher than in working year 2022, 2021 and 2020.
# Subset data by experience level
entry_salary <- subset(salaries_data, experience_level == "Entry-level/Junior")
mid_salary <- subset(salaries_data, experience_level == "Mid-level/Intermediate")
senior_salary <- subset(salaries_data, experience_level == "Senior-level/Expert")
executive_salary <- subset(salaries_data, experience_level == "Executive-level/Director")
# Calculate mean salary by experience level
group_labels <- c("Entry-level/Junior", "Mid-level/Intermediate", "Senior-level/Expert", "Executive-level/Director")
colors <- c("green", "yellow", "blue", "red")
lst <- c(mean(entry_salary$salary_in_usd), mean(mid_salary$salary_in_usd), mean(senior_salary$salary_in_usd), mean(executive_salary$salary_in_usd))
# Create bar plot for mean salary by experience level
fig3 <- plot_ly(x = group_labels, y = lst, type = "bar",
marker = list(color = colors), text = round(lst / 1000, 2),
name = "Mean Salary in USD",
width = 0.2) %>%
layout(title = "Annual Salary Vs Experience Level",
xaxis = list(title = "Experience Level"),
yaxis = list(title = "Average Salary (USD)"),
font = list(size = 15, family = "Franklin Gothic"),
template = "plotly_dark",
showlegend = FALSE)
fig3We got annual salary of 91.96k for entry-level, 125.23k for mid-level, 163.7k for senior-level and 194.09k for executive-level.
# Group data by salary and job title
salary_job <- salaries_data %>%
group_by(salary_in_usd, job_title) %>%
tally() %>%
arrange(desc(salary_in_usd)) %>%
tail(20)
# Remove the last column
salary_job <- salary_job[, -ncol(salary_job)]
mean_salary <- salary_job %>%
group_by(job_title) %>%
summarize(mean_salary_in_usd = mean(salary_in_usd))
print(mean_salary)## # A tibble: 14 × 2
## job_title mean_salary_in_usd
## <chr> <dbl>
## 1 Big Data Engineer 16228
## 2 Business Data Analyst 17000
## 3 Business Intelligence Developer 15000
## 4 Computer Vision Engineer 15897
## 5 Data Analyst 15848.
## 6 Data Analytics Lead 17511
## 7 Data Engineer 17312.
## 8 Data Science 16666
## 9 Data Scientist 16643
## 10 ML Engineer 15966
## 11 Machine Learning Developer 15000
## 12 Product Data Analyst 16417
## 13 Research Engineer 16455
## 14 Staff Data Analyst 15000
# Plot distribution graph with values
ggplot(mean_salary, aes(x = job_title, y = mean_salary_in_usd)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
geom_text(aes(label = round(mean_salary_in_usd, 2)), vjust = -0.5, size = 3.0, color = "red") + # Add text labels for values
labs(title = "Annual Salary Distribution by Job Titles", x = "Job Titles", y = "Mean Salary (USD)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 15, hjust = 0.5),
panel.grid = element_blank(),
axis.title = element_text(size = 12))Salaries Data Visualization of mean annual salary has been mainly analyzed depending on the experience levels, job titles and work years. As we have a lot of working experience level like an executive-level, we get the highest paid of 194.09k USD. Although the top most frequent job titles mostly vary, there is no big annual salary distribution gap for each Data Practitioners.We have got average higher paid 153.74k USD until 2023 work year but we could not say how much shall we get paid in the coming years at the time being due to ths limited data set.