This assignment will explore the datasets contains detailed data information from professionals working in various job. Each record will have the key attributes such as job title, salary amount, years of experience, employment type, and location. Some entries may also include skill set, company size, or educational level, allowing for deeper exploratory data analysis.
library(tidyverse) # data cleaning, wrangling, plotting
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych) # descriptive stats, correlations
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(car) # regression diagnostics & ANOVA tools
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:psych':
##
## logit
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(MASS) # linear models, distributions
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(stats)
library(psych) # describe(), pairs panels, etc.
library(Hmisc) # describe(), rcorr()
##
## Attaching package: 'Hmisc'
##
## The following object is masked from 'package:psych':
##
## describe
##
## The following objects are masked from 'package:dplyr':
##
## src, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, units
library(ggpubr)
library(knitr)
library(ggplot2)
The dataset includes job records from 2020 to 2024, with most entries centered around 2023–2024. This suggests the data is recent and relevant, and any salary trends reflect current industry conditions. The extremely high maximum suggests possible outliers or high-executive roles. The large gap between the minimum and maximum indicates the dataset includes both entry-level and senior/executive positions across different countries. When converted to USD, salaries range from $15,000 to $800,000, which is much more realistic and standardized compared to the original salary column. This tells you: Currency conversion removes extreme values caused by different currencies or exchange rates. The upper end ($800k) likely represents high-level tech roles or senior leadership positions. The distribution is likely right-skewed (a few very high earners raising the average).
dataset <- read.csv("Dataset salary 2024.csv")
summary(dataset)
## work_year experience_level employment_type job_title
## Min. :2020 Length:16534 Length:16534 Length:16534
## 1st Qu.:2023 Class :character Class :character Class :character
## Median :2023 Mode :character Mode :character Mode :character
## Mean :2023
## 3rd Qu.:2024
## Max. :2024
## salary salary_currency salary_in_usd employee_residence
## Min. : 14000 Length:16534 Min. : 15000 Length:16534
## 1st Qu.: 101763 Class :character 1st Qu.:101125 Class :character
## Median : 142200 Mode :character Median :141300 Mode :character
## Mean : 163727 Mean :149687
## 3rd Qu.: 187200 3rd Qu.:185900
## Max. :30400000 Max. :800000
## remote_ratio company_location company_size
## Min. : 0 Length:16534 Length:16534
## 1st Qu.: 0 Class :character Class :character
## Median : 0 Mode :character Mode :character
## Mean : 32
## 3rd Qu.:100
## Max. :100
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
head(dataset, 20) %>%
kable("html") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "250px")
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size |
|---|---|---|---|---|---|---|---|---|---|---|
| 2024 | SE | FT | AI Engineer | 202730 | USD | 202730 | US | 0 | US | M |
| 2024 | SE | FT | AI Engineer | 92118 | USD | 92118 | US | 0 | US | M |
| 2024 | SE | FT | Data Engineer | 130500 | USD | 130500 | US | 0 | US | M |
| 2024 | SE | FT | Data Engineer | 96000 | USD | 96000 | US | 0 | US | M |
| 2024 | SE | FT | Machine Learning Engineer | 190000 | USD | 190000 | US | 0 | US | M |
| 2024 | SE | FT | Machine Learning Engineer | 160000 | USD | 160000 | US | 0 | US | M |
| 2024 | MI | FT | ML Engineer | 400000 | USD | 400000 | US | 0 | US | M |
| 2024 | MI | FT | ML Engineer | 65000 | USD | 65000 | US | 0 | US | M |
| 2024 | EN | FT | Data Analyst | 101520 | USD | 101520 | US | 0 | US | M |
| 2024 | EN | FT | Data Analyst | 45864 | USD | 45864 | US | 0 | US | M |
| 2024 | SE | FT | Data Engineer | 172469 | USD | 172469 | US | 0 | US | M |
| 2024 | SE | FT | Data Engineer | 114945 | USD | 114945 | US | 0 | US | M |
| 2024 | EX | FT | NLP Engineer | 200000 | USD | 200000 | US | 0 | US | M |
| 2024 | EX | FT | NLP Engineer | 150000 | USD | 150000 | US | 0 | US | M |
| 2024 | MI | FT | Data Scientist | 156450 | USD | 156450 | US | 100 | US | M |
| 2024 | MI | FT | Data Scientist | 119200 | USD | 119200 | US | 100 | US | M |
| 2024 | SE | FT | Data Analyst | 170000 | USD | 170000 | US | 0 | US | M |
| 2024 | SE | FT | Data Analyst | 130000 | USD | 130000 | US | 0 | US | M |
| 2024 | SE | FT | Applied Scientist | 222200 | USD | 222200 | US | 0 | US | L |
| 2024 | SE | FT | Applied Scientist | 136000 | USD | 136000 | US | 0 | US | L |
dataset_clean <- dataset %>% drop_na()
# Convert character categories to factors
dataset_clean <- dataset_clean %>%
mutate(
experience_level = as.factor(experience_level),
employment_type = as.factor(employment_type),
job_title = as.factor(job_title),
company_size = as.factor(company_size)
)
library(ggplot2)
ggplot(dataset_clean, aes(x = experience_level, y = salary_in_usd)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Salary Distribution by Experience Level",
x = "Experience Level",
y = "Salary (USD)")
## (Insights)
- Senior employees earn noticeably higher salaries. The boxplot shows a clear upward shift in salary levels for Senior (SE) experience compared to Entry-level (EN) or Mid-level (MI) workers. This indicates that experience has a strong positive impact on salary in this dataset.
- Entry-level salaries are tightly grouped Entry-level (EN) workers have a smaller spread in their salaries, meaning most of them earn within a similar range. This suggests companies tend to offer standardized pay for beginner roles.
- Senior-level salaries have the widest range The SE box is much taller, revealing that salaries vary significantly — some senior roles pay modestly while others pay extremely high. This shows that specialized skills or job titles cause major salary differences at senior levels.
- A few very high salaries cause outliers. You will likely see points far above the box for senior/lead roles. These are outliers representing high-paying jobs such as: AI Engineer, Machine Learning Engineer, Data Scientist (top tier levels). This means salary opportunities in advanced tech roles can be very unequal but highly rewarding.
- Salary growth is not equal across levels . The gap between Entry → Mid is smaller, but the gap between Mid → Senior is much larger. This suggests that the biggest salary jump happens when workers develop advanced experience and specialization.
The boxplot shows that salary increases strongly with experience level. Entry-level workers earn within a tight range, indicating standardized pay, while senior and expert roles display much wider salary distributions and significantly higher median values. High-paying tech roles create visible outliers, suggesting that specialization dramatically increases earning potential.”
Correlation_Between_Roles <- cor()
library(ggplot2)
dataset_clean %>%
count(job_title) %>%
ggplot(aes(x = reorder(job_title, n), y = n)) +
geom_col() +
coord_flip() +
labs(title = "Count of Job Titles",
x = "Job Title",
y = "Count")
Due to the high number of job titles, the chart could not provide detailed observation for the bar chart. Therefore, another bar chart was performed to show only top 20 job titles of the Data industry, to provide a better observation for this visualization.
library(dplyr)
library(ggplot2)
dataset_clean %>%
count(job_title) %>%
arrange(desc(n)) %>% # sort by count
slice_head(n = 20) %>% # keep top 20
ggplot(aes(x = reorder(job_title, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 Most Common Job Titles",
x = "Job Title",
y = "Count"
) +
theme_minimal()
Now, given the second chart, has clearly shown the majority of jobs titles in the Data Industry, with the top_1 is Data engineer, clearly shown a high demand in the data analysis and processing roles. Next comes with Data scientist and Data analysis.
library(ggplot2)
dataset_clean %>%
count(remote_ratio) %>%
ggplot(aes(x = "", y = n, fill = factor(remote_ratio))) +
geom_col() +
coord_polar(theta = "y") +
labs(title = "Remote Work Distribution",
fill = "Remote Ratio (%)")
This pie chart distribution shows the variations between the remote and in-site, with 0% be in-sites, 50% be hybrid, and 100% be remote. So far, the data collected from 2021 to 2024 shows that the majority of the data industry field are in-sites, indicating a strong demand for in-person collaboration, and to provide an effective communication with the bussiness in-person.