2024 Salary Insight by Job Roles.

This assignment will explore the datasets contains detailed data information from professionals working in various job. Each record will have the key attributes such as job title, salary amount, years of experience, employment type, and location. Some entries may also include skill set, company size, or educational level, allowing for deeper exploratory data analysis.

Libary that are used in this analysis

library(tidyverse)   # data cleaning, wrangling, plotting
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)       # descriptive stats, correlations
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(car)         # regression diagnostics & ANOVA tools
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:psych':
## 
##     logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(MASS)        # linear models, distributions
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(stats)
library(psych)       # describe(), pairs panels, etc.
library(Hmisc)       # describe(), rcorr()   
## 
## Attaching package: 'Hmisc'
## 
## The following object is masked from 'package:psych':
## 
##     describe
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(ggpubr)
library(knitr)
library(ggplot2)

Here is a summary of the dataset.

The dataset includes job records from 2020 to 2024, with most entries centered around 2023–2024. This suggests the data is recent and relevant, and any salary trends reflect current industry conditions. The extremely high maximum suggests possible outliers or high-executive roles. The large gap between the minimum and maximum indicates the dataset includes both entry-level and senior/executive positions across different countries. When converted to USD, salaries range from $15,000 to $800,000, which is much more realistic and standardized compared to the original salary column. This tells you: Currency conversion removes extreme values caused by different currencies or exchange rates. The upper end ($800k) likely represents high-level tech roles or senior leadership positions. The distribution is likely right-skewed (a few very high earners raising the average).

dataset <- read.csv("Dataset salary 2024.csv")
summary(dataset)
##    work_year    experience_level   employment_type     job_title        
##  Min.   :2020   Length:16534       Length:16534       Length:16534      
##  1st Qu.:2023   Class :character   Class :character   Class :character  
##  Median :2023   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2023                                                           
##  3rd Qu.:2024                                                           
##  Max.   :2024                                                           
##      salary         salary_currency    salary_in_usd    employee_residence
##  Min.   :   14000   Length:16534       Min.   : 15000   Length:16534      
##  1st Qu.:  101763   Class :character   1st Qu.:101125   Class :character  
##  Median :  142200   Mode  :character   Median :141300   Mode  :character  
##  Mean   :  163727                      Mean   :149687                     
##  3rd Qu.:  187200                      3rd Qu.:185900                     
##  Max.   :30400000                      Max.   :800000                     
##   remote_ratio company_location   company_size      
##  Min.   :  0   Length:16534       Length:16534      
##  1st Qu.:  0   Class :character   Class :character  
##  Median :  0   Mode  :character   Mode  :character  
##  Mean   : 32                                        
##  3rd Qu.:100                                        
##  Max.   :100

Here is the dataset shown in the first few rows

library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
head(dataset, 20) %>%
  kable("html") %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "250px")
work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
2024 SE FT AI Engineer 202730 USD 202730 US 0 US M
2024 SE FT AI Engineer 92118 USD 92118 US 0 US M
2024 SE FT Data Engineer 130500 USD 130500 US 0 US M
2024 SE FT Data Engineer 96000 USD 96000 US 0 US M
2024 SE FT Machine Learning Engineer 190000 USD 190000 US 0 US M
2024 SE FT Machine Learning Engineer 160000 USD 160000 US 0 US M
2024 MI FT ML Engineer 400000 USD 400000 US 0 US M
2024 MI FT ML Engineer 65000 USD 65000 US 0 US M
2024 EN FT Data Analyst 101520 USD 101520 US 0 US M
2024 EN FT Data Analyst 45864 USD 45864 US 0 US M
2024 SE FT Data Engineer 172469 USD 172469 US 0 US M
2024 SE FT Data Engineer 114945 USD 114945 US 0 US M
2024 EX FT NLP Engineer 200000 USD 200000 US 0 US M
2024 EX FT NLP Engineer 150000 USD 150000 US 0 US M
2024 MI FT Data Scientist 156450 USD 156450 US 100 US M
2024 MI FT Data Scientist 119200 USD 119200 US 100 US M
2024 SE FT Data Analyst 170000 USD 170000 US 0 US M
2024 SE FT Data Analyst 130000 USD 130000 US 0 US M
2024 SE FT Applied Scientist 222200 USD 222200 US 0 US L
2024 SE FT Applied Scientist 136000 USD 136000 US 0 US L

Remove missing values

dataset_clean <- dataset %>% drop_na()
# Convert character categories to factors
dataset_clean <- dataset_clean %>%
  mutate(
    experience_level = as.factor(experience_level),
    employment_type   = as.factor(employment_type),
    job_title         = as.factor(job_title),   
    company_size      = as.factor(company_size)
  )

Visualization 1 - Salary by Experience Level (Boxplot)

library(ggplot2)
ggplot(dataset_clean, aes(x = experience_level, y = salary_in_usd)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Salary Distribution by Experience Level",
       x = "Experience Level",
       y = "Salary (USD)")

## (Insights)

  1. Senior employees earn noticeably higher salaries. The boxplot shows a clear upward shift in salary levels for Senior (SE) experience compared to Entry-level (EN) or Mid-level (MI) workers. This indicates that experience has a strong positive impact on salary in this dataset.
  2. Entry-level salaries are tightly grouped Entry-level (EN) workers have a smaller spread in their salaries, meaning most of them earn within a similar range. This suggests companies tend to offer standardized pay for beginner roles.
  3. Senior-level salaries have the widest range The SE box is much taller, revealing that salaries vary significantly — some senior roles pay modestly while others pay extremely high. This shows that specialized skills or job titles cause major salary differences at senior levels.
  4. A few very high salaries cause outliers. You will likely see points far above the box for senior/lead roles. These are outliers representing high-paying jobs such as: AI Engineer, Machine Learning Engineer, Data Scientist (top tier levels). This means salary opportunities in advanced tech roles can be very unequal but highly rewarding.
  5. Salary growth is not equal across levels . The gap between Entry → Mid is smaller, but the gap between Mid → Senior is much larger. This suggests that the biggest salary jump happens when workers develop advanced experience and specialization.

In summary

The boxplot shows that salary increases strongly with experience level. Entry-level workers earn within a tight range, indicating standardized pay, while senior and expert roles display much wider salary distributions and significantly higher median values. High-paying tech roles create visible outliers, suggesting that specialization dramatically increases earning potential.”

Correlation_Between_Roles <- cor()

Visualization 2 — Number of Jobs by Job Title (Bar Chart)

library(ggplot2)
dataset_clean %>%
count(job_title) %>%
ggplot(aes(x = reorder(job_title, n), y = n)) +
geom_col() +
coord_flip() +
labs(title = "Count of Job Titles",
       x = "Job Title",
       y = "Count")

Due to the high number of job titles, the chart could not provide detailed observation for the bar chart. Therefore, another bar chart was performed to show only top 20 job titles of the Data industry, to provide a better observation for this visualization.

library(dplyr)
library(ggplot2)
dataset_clean %>%
count(job_title) %>%
arrange(desc(n)) %>%        # sort by count
slice_head(n = 20) %>%      # keep top 20
ggplot(aes(x = reorder(job_title, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
    title = "Top 20 Most Common Job Titles",
    x = "Job Title",
    y = "Count"
  ) +
  theme_minimal()

Now, given the second chart, has clearly shown the majority of jobs titles in the Data Industry, with the top_1 is Data engineer, clearly shown a high demand in the data analysis and processing roles. Next comes with Data scientist and Data analysis.

Visualization 3 - Remote Ratio Distribution (Pie Chart)

library(ggplot2)
dataset_clean %>%
count(remote_ratio) %>%
ggplot(aes(x = "", y = n, fill = factor(remote_ratio))) +
geom_col() +
coord_polar(theta = "y") +
labs(title = "Remote Work Distribution",
     fill = "Remote Ratio (%)")

This pie chart distribution shows the variations between the remote and in-site, with 0% be in-sites, 50% be hybrid, and 100% be remote. So far, the data collected from 2021 to 2024 shows that the majority of the data industry field are in-sites, indicating a strong demand for in-person collaboration, and to provide an effective communication with the bussiness in-person.