DATA 608: Story 4

Story - 4 : How much do we get paid?

The Data Science Job Salaries Dataset was downloaded from the site https://ai-jobs.net, kept it in my Github account and loaded into R. It contains 14491 rows and 11 variables The download basically contains a single table with all salary information structured in each columns below.

work_year: The year the salary was paid.

experience_level: The experience level in the job during the year with the following possible values: EN = Entry-level / Junior MI = Mid-level / Intermediate SE = Senior-level / Expert EX = Executive-level / Director

employment_type: The type of employment for the role: PT = Part-time FT = Full-time CT = Contract FL = Freelance

job_title: the role worked in during the year.

salary: the total gross salary amount paid.

salary_currency: the currency of the salary paid as an ISO 4217 currency code.

salary_in_usd: the salary in USD (FX rate divided by avg. USD rate of respective year) via statistical data from the BIS and central banks.

employee_residence: employee’s primary country of residence in during the work year as an ISO 3166 country code.

remote_ratio: the overall amount of work done remotely, possible values are as follows: 0 = No remote work (less than 20%) 50 = Partially remote/hybird 100= Fully remote (more than 80%)

company_location: the country of the employer’s main office or contracting branch as an ISO 3166 country code.

company_size: the average number of people that worked for the company during the year: S = less than 50 employees (small) M = 50 to 250 employees (medium) L = more than 250 employees (large)

Import Data Set

#Install required packages 
library(tidyverse)
library(countrycode)
library(plotly)

# Read the salaries dataframe 
salaries_data <- read.csv("https://raw.githubusercontent.com/LwinShwe/DATA-608-Story-4/main/salaries%20dataset.csv")
str(salaries_data)

## 'data.frame':    14491 obs. of  11 variables:
##  $ work_year         : int  2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
##  $ experience_level  : chr  "MI" "MI" "SE" "SE" ...
##  $ employment_type   : chr  "FT" "FT" "FT" "FT" ...
##  $ job_title         : chr  "Data Scientist" "Data Scientist" "AI Architect" "AI Architect" ...
##  $ salary            : int  86218 68975 285000 153400 101763 81410 166000 73100 170000 85000 ...
##  $ salary_currency   : chr  "USD" "USD" "USD" "USD" ...
##  $ salary_in_usd     : int  86218 68975 285000 153400 101763 81410 166000 73100 170000 85000 ...
##  $ employee_residence: chr  "CA" "CA" "US" "US" ...
##  $ remote_ratio      : int  0 0 0 0 0 0 0 0 100 100 ...
##  $ company_location  : chr  "CA" "CA" "US" "US" ...
##  $ company_size      : chr  "M" "M" "M" "M" ...

#Drop overlapped columns
salaries_data <- salaries_data[, -c(5, 6)] # Dropping columns 'salary'and 'salary_currency'

# Print the dimensions of the dataframe
cat("Dimensions of the dataframe:", dim(salaries_data), "\n")

## Dimensions of the dataframe: 14491 9

# Display the first 3 rows of the dataframe
head(salaries_data, 3)

##   work_year experience_level employment_type      job_title salary_in_usd
## 1      2024               MI              FT Data Scientist         86218
## 2      2024               MI              FT Data Scientist         68975
## 3      2024               SE              FT   AI Architect        285000
##   employee_residence remote_ratio company_location company_size
## 1                 CA            0               CA            M
## 2                 CA            0               CA            M
## 3                 US            0               US            M

Find Missing Values

In the data frame, there are 3 numeric columns: (1)work_year, (2)salary_in_usd, (3)remote_ratio 6 categorical columns: (1)experience_level, (2)employment_type, (3)job_title, (4)employee_residense, (5)company_location, (6)company_size

# Check for missing values in the entire dataframe
any(is.na(salaries_data))

## [1] FALSE

# Check for missing values in a specific column
any(is.na(salaries_data$salary_in_usd))

## [1] FALSE

summary(salaries_data)

##    work_year    experience_level   employment_type     job_title        
##  Min.   :2020   Length:14491       Length:14491       Length:14491      
##  1st Qu.:2023   Class :character   Class :character   Class :character  
##  Median :2023   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2023                                                           
##  3rd Qu.:2024                                                           
##  Max.   :2024                                                           
##  salary_in_usd    employee_residence  remote_ratio    company_location  
##  Min.   : 15000   Length:14491       Min.   :  0.00   Length:14491      
##  1st Qu.:102000   Class :character   1st Qu.:  0.00   Class :character  
##  Median :141525   Mode  :character   Median :  0.00   Mode  :character  
##  Mean   :149925                      Mean   : 32.99                     
##  3rd Qu.:185900                      3rd Qu.:100.00                     
##  Max.   :800000                      Max.   :100.00                     
##  company_size      
##  Length:14491      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Categorize Experience Level

to display the result of hierarchical clustering in ‘experience_level’ column There’s 4 categories in ‘Experience Level’, each are: EN = Entry-level / Junior MI = Mid-level / Intermediate SE = Senior-level / Expert EX = Executive-level / Director

library(dplyr)
library(ggplot2)
# Replace values in 'experience_level' column
salaries_data$experience_level <- ifelse(salaries_data$experience_level == 'EN', 'Entry-level/Junior',
                               ifelse(salaries_data$experience_level == 'MI', 'Mid-level/Intermediate',
                                      ifelse(salaries_data$experience_level == 'SE', 'Senior-level/Expert',
                                             ifelse(salaries_data$experience_level == 'EX', 'Executive-level/Director', salaries_data$experience_level))))

# Count the occurrences of each experience level
ex_level <- table(salaries_data$experience_level)
print(ex_level)

## 
##       Entry-level/Junior Executive-level/Director   Mid-level/Intermediate 
##                     1118                      429                     3442 
##      Senior-level/Expert 
##                     9502

# Draw a bar chart
barplot(ex_level, 
        main = "Four Categories of Experience Levels",
        xlab = "Experience Levels",
        ylab = "Count",
        col = "skyblue",
        ylim = c(0, max(ex_level) * 1.1),  # Set y-axis limit slightly above the maximum count
        border = NA,  # Remove borders
        cex.names = 0.75  # Set the font size of x-axis labels
       
)
# Add values of each count on the bar graph
text(x = 1:length(ex_level), y = ex_level, labels = ex_level, pos = 3, cex = 0.8, col = "black")

Display Job Titles

and find the frequencies of top 10 jobs in the data set.

There are 150 different types of Job titles.

library(wordcloud)
# Find unique job titles
unique_job_titles <- unique(salaries_data$job_title)
# Count the number of unique job titles
num_unique_job_titles <- length(unique_job_titles)
print(num_unique_job_titles)

## [1] 150

# Count the frequency of each unique job title
job_title_freq <- table(salaries_data$job_title)

# Generate word cloud
wordcloud(words = names(job_title_freq), freq = job_title_freq,
          scale = c(3, 0.5),  # Adjust font scale
          max.words = 100, 
          random.order = FALSE,  # Keep order from dataframe
          colors = brewer.pal(9, "Set1"),  # Specify color palette
          random.color = FALSE,  # Use consistent colors
          rot.per = 0.35,  # Rotate 35% of words
          main = "WordCloud of Job Titles")  # Main title

# Calculate the top 10 job titles
top10_job_title <- sort(table(salaries_data$job_title), decreasing = TRUE)[1:10]

# Create a dataframe with top 10 job titles and their counts
top10_df <- data.frame(job_title = names(top10_job_title), count = as.numeric(top10_job_title))
print(top10_df)

##                         job_title count
## 1                   Data Engineer  3107
## 2                  Data Scientist  2964
## 3                    Data Analyst  2147
## 4       Machine Learning Engineer  1508
## 5              Research Scientist   467
## 6              Analytics Engineer   397
## 7               Applied Scientist   373
## 8                  Data Architect   355
## 9               Research Engineer   272
## 10 Business Intelligence Engineer   220

# Create a histogram of the top 10 job titles
ggplot(top10_df, aes(x = reorder(job_title, -count), y = count)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  geom_text(aes(label = count), vjust = -0.5, size = 3.5, color = "black") +  # Add text labels for counts
  labs(title = "Top 10 Frequent Job Titles", x = "Job Titles", y = "Frequencies") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 10, hjust = 0.5),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 10),
        panel.grid = element_blank())  # Remove background grid lines

The histogram shows that data engineer, data scientist and data analyst ranked top 3 frequent job titles, but it can be easily seen that others are also related to those top 3 job titles.

Plot Employment Types

The four types of employment for the roles are as follow PT = Part-time FT = Full-time CT = Contract FL = Freelance

# Load necessary libraries
library(ggplot2)

# Create a dataframe with employment type counts
type_grouped <- table(salaries_data$employment_type)
e_type <- c('Full-Time', 'Part-Time', 'Contract', 'Freelance')
df_type <- data.frame(employment_type = names(type_grouped), count = as.numeric(type_grouped))

# Create the bar plot with adjusted font size and plot dimensions
ggplot(df_type, aes(x = reorder(employment_type, -count), y = count, fill = employment_type)) +
  geom_bar(stat = "identity", width = 0.5, color = "black") +
  geom_text(aes(label = count), vjust = -0.5, size = 3, color = "black") +  # Adjust font size
  labs(title = "Employment Type Distribution", x = "Employment Type", y = "Count") +
  scale_fill_brewer(palette = "PuBuGn") +  # Set color palette
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),  # Adjust font size
        plot.title = element_text(size = 17, hjust = 0.5, family = "Franklin Gothic"),
        axis.text = element_text(size = 10),  # Adjust font size
        axis.title = element_text(size = 14),  # Adjust font size
        legend.position = "none", 
        plot.margin = margin(1, 1, 1, 3, "cm"))  # Adjust plot dimensions

Full-time FT is the most highest type of employment.

Box Plot for Salary Distribution

# Create Box Plot
fig1 <- plot_ly(y = salaries_data$salary_in_usd, type = "box", name = "salary_in_usd") %>%
  layout(title = "Salary Distribution in USD", font = list(size = 15, family = "Franklin Gothic"),
         template = "plotly_dark")

fig1

We can clearly see that salary mostly distributed between 102k and 186k.

Salary Vs Work Year

# Filter data for each work year
w2020 <- subset(salaries_data, work_year == 2020)
w2021 <- subset(salaries_data, work_year == 2021)
w2022 <- subset(salaries_data, work_year == 2022)
w2023 <- subset(salaries_data, work_year == 2023)

# Calculate mean salary by work year
year_salary <- data.frame(
  "2020" = mean(w2020$salary_in_usd),
  "2021" = mean(w2021$salary_in_usd),
  "2022" = mean(w2022$salary_in_usd),
  "2023" = mean(w2023$salary_in_usd)
)


# Prepare data for the line graph
work_years <- c("2020", "2021", "2022", "2023")
mean_salaries <- unlist(year_salary)

# Create line graph
fig2 <- plot_ly(x = work_years, y = mean_salaries, type = "scatter", mode = "lines+markers", 
                marker = list(color = "red", size = 10),
                line = list(color = "skyblue", width = 2)) %>%
  layout(title = "Mean Salary by Work Year", 
         xaxis = list(title = "Work Year"),
         yaxis = list(title = "Mean Salary (USD)"),
         font = list(size = 15, family = "Franklin Gothic"),
         template = "plotly_dark",
         showlegend = FALSE)  # Hide legend

fig2

We got average paid 153.73k in 2023 that is higher than in working year 2022, 2021 and 2020.

Salary Vs Experience Level

# Subset data by experience level
entry_salary <- subset(salaries_data, experience_level == "Entry-level/Junior")
mid_salary <- subset(salaries_data, experience_level == "Mid-level/Intermediate")
senior_salary <- subset(salaries_data, experience_level == "Senior-level/Expert")
executive_salary <- subset(salaries_data, experience_level == "Executive-level/Director")


# Calculate mean salary by experience level
group_labels <- c("Entry-level/Junior", "Mid-level/Intermediate", "Senior-level/Expert", "Executive-level/Director")
colors <- c("green", "yellow", "blue", "red")
lst <- c(mean(entry_salary$salary_in_usd), mean(mid_salary$salary_in_usd), mean(senior_salary$salary_in_usd), mean(executive_salary$salary_in_usd))

# Create bar plot for mean salary by experience level
fig3 <- plot_ly(x = group_labels, y = lst, type = "bar", 
                marker = list(color = colors), text = round(lst / 1000, 2),
                name = "Mean Salary in USD",
                width = 0.2) %>%
  layout(title = "Annual Salary Vs Experience Level", 
         xaxis = list(title = "Experience Level"),
         yaxis = list(title = "Average Salary (USD)"),
         font = list(size = 15, family = "Franklin Gothic"),
         template = "plotly_dark",
         showlegend = FALSE)

fig3

We got annual salary of 91.96k for entry-level, 125.23k for mid-level, 163.7k for senior-level and 194.09k for executive-level.

Salary Vs Job Titles

# Group data by salary and job title 
salary_job <- salaries_data %>%
  group_by(salary_in_usd, job_title) %>%
  tally() %>%
  arrange(desc(salary_in_usd)) %>%
  tail(20)

# Remove the last column
salary_job <- salary_job[, -ncol(salary_job)]
mean_salary <- salary_job %>%
  group_by(job_title) %>%
  summarize(mean_salary_in_usd = mean(salary_in_usd))
print(mean_salary)

## # A tibble: 14 × 2
##    job_title                       mean_salary_in_usd
##    <chr>                                        <dbl>
##  1 Big Data Engineer                           16228 
##  2 Business Data Analyst                       17000 
##  3 Business Intelligence Developer             15000 
##  4 Computer Vision Engineer                    15897 
##  5 Data Analyst                                15848.
##  6 Data Analytics Lead                         17511 
##  7 Data Engineer                               17312.
##  8 Data Science                                16666 
##  9 Data Scientist                              16643 
## 10 ML Engineer                                 15966 
## 11 Machine Learning Developer                  15000 
## 12 Product Data Analyst                        16417 
## 13 Research Engineer                           16455 
## 14 Staff Data Analyst                          15000

# Plot distribution graph with values
ggplot(mean_salary, aes(x = job_title, y = mean_salary_in_usd)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  geom_text(aes(label = round(mean_salary_in_usd, 2)), vjust = -0.5, size = 3.0, color = "red") +  # Add text labels for values
  labs(title = "Annual Salary Distribution by Job Titles", x = "Job Titles", y = "Mean Salary (USD)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 15, hjust = 0.5),
        panel.grid = element_blank(),
        axis.title = element_text(size = 12))

Conclusions

Salaries Data Visualization of mean annual salary has been mainly analyzed depending on the experience levels, job titles and work years. As we have a lot of working experience level like an executive-level, we get the highest paid of 194.09k USD. Although the top most frequent job titles mostly vary, there is no big annual salary distribution gap for each Data Practitioners.We have got average higher paid 153.74k USD until 2023 work year but we could not say how much shall we get paid in the coming years at the time being due to ths limited data set.