2024 Salary Insight by Job Roles.

This assignment will explore the datasets contains detailed data information from professionals working in various job. Each record will have the key attributes such as job title, salary amount, years of experience, employment type, and location. Some entries may also include skill set, company size, or educational level, allowing for deeper exploratory data analysis.

Libary that are used in this analysis

library(tidyverse)   # data cleaning, wrangling, plotting

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(psych)       # descriptive stats, correlations

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(car)         # regression diagnostics & ANOVA tools

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:psych':
## 
##     logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(MASS)        # linear models, distributions

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(stats)
library(psych)       # describe(), pairs panels, etc.
library(Hmisc)       # describe(), rcorr()

## 
## Attaching package: 'Hmisc'
## 
## The following object is masked from 'package:psych':
## 
##     describe
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(ggpubr)
library(knitr)
library(ggplot2)

Here is a summary of the dataset.

The dataset includes job records from 2020 to 2024, with most entries centered around 2023–2024. This suggests the data is recent and relevant, and any salary trends reflect current industry conditions. The extremely high maximum suggests possible outliers or high-executive roles. The large gap between the minimum and maximum indicates the dataset includes both entry-level and senior/executive positions across different countries. When converted to USD, salaries range from $15,000 to $800,000, which is much more realistic and standardized compared to the original salary column. This tells you: Currency conversion removes extreme values caused by different currencies or exchange rates. The upper end ($800k) likely represents high-level tech roles or senior leadership positions. The distribution is likely right-skewed (a few very high earners raising the average).

dataset <- read.csv("Dataset salary 2024.csv")
summary(dataset)

##    work_year    experience_level   employment_type     job_title        
##  Min.   :2020   Length:16534       Length:16534       Length:16534      
##  1st Qu.:2023   Class :character   Class :character   Class :character  
##  Median :2023   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2023                                                           
##  3rd Qu.:2024                                                           
##  Max.   :2024                                                           
##      salary         salary_currency    salary_in_usd    employee_residence
##  Min.   :   14000   Length:16534       Min.   : 15000   Length:16534      
##  1st Qu.:  101763   Class :character   1st Qu.:101125   Class :character  
##  Median :  142200   Mode  :character   Median :141300   Mode  :character  
##  Mean   :  163727                      Mean   :149687                     
##  3rd Qu.:  187200                      3rd Qu.:185900                     
##  Max.   :30400000                      Max.   :800000                     
##   remote_ratio company_location   company_size      
##  Min.   :  0   Length:16534       Length:16534      
##  1st Qu.:  0   Class :character   Class :character  
##  Median :  0   Mode  :character   Mode  :character  
##  Mean   : 32                                        
##  3rd Qu.:100                                        
##  Max.   :100

Here is the dataset shown in the first few rows

library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

head(dataset, 20) %>%
  kable("html") %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "250px")

work_year	experience_level	employment_type	job_title	salary	salary_currency	salary_in_usd	employee_residence	remote_ratio	company_location	company_size
2024	SE	FT	AI Engineer	202730	USD	202730	US	0	US	M
2024	SE	FT	AI Engineer	92118	USD	92118	US	0	US	M
2024	SE	FT	Data Engineer	130500	USD	130500	US	0	US	M
2024	SE	FT	Data Engineer	96000	USD	96000	US	0	US	M
2024	SE	FT	Machine Learning Engineer	190000	USD	190000	US	0	US	M
2024	SE	FT	Machine Learning Engineer	160000	USD	160000	US	0	US	M
2024	MI	FT	ML Engineer	400000	USD	400000	US	0	US	M
2024	MI	FT	ML Engineer	65000	USD	65000	US	0	US	M
2024	EN	FT	Data Analyst	101520	USD	101520	US	0	US	M
2024	EN	FT	Data Analyst	45864	USD	45864	US	0	US	M
2024	SE	FT	Data Engineer	172469	USD	172469	US	0	US	M
2024	SE	FT	Data Engineer	114945	USD	114945	US	0	US	M
2024	EX	FT	NLP Engineer	200000	USD	200000	US	0	US	M
2024	EX	FT	NLP Engineer	150000	USD	150000	US	0	US	M
2024	MI	FT	Data Scientist	156450	USD	156450	US	100	US	M
2024	MI	FT	Data Scientist	119200	USD	119200	US	100	US	M
2024	SE	FT	Data Analyst	170000	USD	170000	US	0	US	M
2024	SE	FT	Data Analyst	130000	USD	130000	US	0	US	M
2024	SE	FT	Applied Scientist	222200	USD	222200	US	0	US	L
2024	SE	FT	Applied Scientist	136000	USD	136000	US	0	US	L

Remove missing values

dataset_clean <- dataset %>% drop_na()
# Convert character categories to factors
dataset_clean <- dataset_clean %>%
  mutate(
    experience_level = as.factor(experience_level),
    employment_type   = as.factor(employment_type),
    job_title         = as.factor(job_title),   
    company_size      = as.factor(company_size)
  )

Visualization 1 - Salary by Experience Level (Boxplot)

library(ggplot2)
ggplot(dataset_clean, aes(x = experience_level, y = salary_in_usd)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Salary Distribution by Experience Level",
       x = "Experience Level",
       y = "Salary (USD)")

## (Insights)

Senior employees earn noticeably higher salaries. The boxplot shows a clear upward shift in salary levels for Senior (SE) experience compared to Entry-level (EN) or Mid-level (MI) workers. This indicates that experience has a strong positive impact on salary in this dataset.

Entry-level salaries are tightly grouped Entry-level (EN) workers have a smaller spread in their salaries, meaning most of them earn within a similar range. This suggests companies tend to offer standardized pay for beginner roles.

Senior-level salaries have the widest range The SE box is much taller, revealing that salaries vary significantly — some senior roles pay modestly while others pay extremely high. This shows that specialized skills or job titles cause major salary differences at senior levels.

A few very high salaries cause outliers. You will likely see points far above the box for senior/lead roles. These are outliers representing high-paying jobs such as: AI Engineer, Machine Learning Engineer, Data Scientist (top tier levels). This means salary opportunities in advanced tech roles can be very unequal but highly rewarding.

Salary growth is not equal across levels . The gap between Entry → Mid is smaller, but the gap between Mid → Senior is much larger. This suggests that the biggest salary jump happens when workers develop advanced experience and specialization.

In summary

The boxplot shows that salary increases strongly with experience level. Entry-level workers earn within a tight range, indicating standardized pay, while senior and expert roles display much wider salary distributions and significantly higher median values. High-paying tech roles create visible outliers, suggesting that specialization dramatically increases earning potential.”

Correlation_Between_Roles <- cor()

Visualization 2 — Number of Jobs by Job Title (Bar Chart)

library(ggplot2)
dataset_clean %>%
count(job_title) %>%
ggplot(aes(x = reorder(job_title, n), y = n)) +
geom_col() +
coord_flip() +
labs(title = "Count of Job Titles",
       x = "Job Title",
       y = "Count")

Due to the high number of job titles, the chart could not provide detailed observation for the bar chart. Therefore, another bar chart was performed to show only top 20 job titles of the Data industry, to provide a better observation for this visualization.

library(dplyr)
library(ggplot2)
dataset_clean %>%
count(job_title) %>%
arrange(desc(n)) %>%        # sort by count
slice_head(n = 20) %>%      # keep top 20
ggplot(aes(x = reorder(job_title, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
    title = "Top 20 Most Common Job Titles",
    x = "Job Title",
    y = "Count"
  ) +
  theme_minimal()

Now, given the second chart, has clearly shown the majority of jobs titles in the Data Industry, with the top_1 is Data engineer, clearly shown a high demand in the data analysis and processing roles. Next comes with Data scientist and Data analysis.

Visualization 3 - Remote Ratio Distribution (Pie Chart)

library(ggplot2)
dataset_clean %>%
count(remote_ratio) %>%
ggplot(aes(x = "", y = n, fill = factor(remote_ratio))) +
geom_col() +
coord_polar(theta = "y") +
labs(title = "Remote Work Distribution",
     fill = "Remote Ratio (%)")

This pie chart distribution shows the variations between the remote and in-site, with 0% be in-sites, 50% be hybrid, and 100% be remote. So far, the data collected from 2021 to 2024 shows that the majority of the data industry field are in-sites, indicating a strong demand for in-person collaboration, and to provide an effective communication with the bussiness in-person.

Assignment_2_XNIF

Trung Tai Ta