| Name | Student ID |
|---|---|
| Mehran Gharooni Khoshkehbar | S2014607 |
| Ali Bin Abdul Rahman | 23062959 |
The field of Data Science is fast-moving, so it’s important that job seekers, employers, and educators alike have an understanding of the dynamics in salaries and trends in remote work.
In this paper, we analyze a dataset of Data Science salaries, studying the relationships between different job characteristics (like title, experience, company size, etc.) and:
To analyze salary and remote working availability for data science employees.
In this section, we load the necessary libraries for analysis, then load the data into R, and do some preliminary analysis of the data. This setup ensures that our environment is correctly set up and we have a complete overview of the structure of the dataset before proceeding with further analysis.
The R packages required for data manipulation, visualization, and
modeling are listed below. The use of the for loop helps in
the installation of missing packages and then loads them into the
current session. This will make sure that all the functions needed for
later analytical activities are available.
# List of packages required for the analysis
packages_needed <- c(
"dplyr", # for data manipulation
"ggplot2", # for plotting
"randomForest", # for random forest (classification)
"tidyr", # for data tidying
"caret", # for model training/tuning
"ranger", # for fast random forest
"scales" # for pretty axis labels
)
# Loop through the packages and install/load them if not already available
for (pkg in packages_needed) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg, dependencies = TRUE)
library(pkg, character.only = TRUE)
}
}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Loading required package: randomForest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
## Loading required package: tidyr
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ranger
##
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
##
## importance
## Loading required package: scales
Here, we load the dataset containing data science salary information.
The dataset is in CSV format, and it is read into a data frame. We use
the parameter stringsAsFactors = FALSE to ensure that
string columns are not automatically converted into factor variables, as
this may hinder flexibility during data manipulation. After loading, we
inspect the structure of the dataset by displaying its dimensions and a
preview of the first few rows.
# Load the raw dataset into a data frame
df_salaries_raw <- read.csv("DataScience_salaries_2024.csv", stringsAsFactors = FALSE)
# Print the dimensions of the dataset
message("Raw Data Dimensions: ", nrow(df_salaries_raw), " rows, ", ncol(df_salaries_raw), " columns.")
## Raw Data Dimensions: 14838 rows, 11 columns.
To understand the dataset’s structure and contents, we examine its
first few rows. The head function is used to display the
initial records. This step provides insight into the columns, data
types, and potential issues (e.g., missing or inconsistent values) in
the dataset.
# Display the first six rows of the dataset
head(df_salaries_raw)
## work_year experience_level employment_type job_title
## 1 2021 MI FT Data Scientist
## 2 2021 MI FT BI Data Analyst
## 3 2020 MI FT Data Scientist
## 4 2021 MI FT ML Engineer
## 5 2022 SE FT Lead Machine Learning Engineer
## 6 2021 MI FT ML Engineer
## salary salary_currency salary_in_usd employee_residence remote_ratio
## 1 30400000 CLP 40038 CL 100
## 2 11000000 HUF 36259 HU 50
## 3 11000000 HUF 35735 HU 50
## 4 8500000 JPY 77364 JP 50
## 5 7500000 INR 95386 IN 50
## 6 7000000 JPY 63711 JP 50
## company_location company_size
## 1 CL L
## 2 US L
## 3 HU L
## 4 JP S
## 5 IN L
## 6 JP S
Before any analysis can be done on a dataset, it is always important to check for missing data. Here, we will calculate the count and percentage of missing values for each column. We also plot a bar chart to visualize the proportion of missing data across columns, which will help us spot columns that are problematic and might need imputation or removal.
# Calculate missing values for each column
missing_counts <- sapply(df_salaries_raw, function(col) sum(is.na(col) | col == ""))
missing_pct <- round(missing_counts / nrow(df_salaries_raw) * 100, 2)
# Create a data frame of missing values
df_missing_summary <- data.frame(
Columns = names(missing_counts),
MissingCount = missing_counts,
MissingPct = missing_pct
)
# Display the summary of missing values
print(df_missing_summary)
## Columns MissingCount MissingPct
## work_year work_year 1 0.01
## experience_level experience_level 6 0.04
## employment_type employment_type 7 0.05
## job_title job_title 30 0.20
## salary salary 17 0.11
## salary_currency salary_currency 15 0.10
## salary_in_usd salary_in_usd 10 0.07
## employee_residence employee_residence 16 0.11
## remote_ratio remote_ratio 83 0.56
## company_location company_location 13 0.09
## company_size company_size 8 0.05
# Visualize missing percentages with a bar chart
ggplot(df_missing_summary, aes(x = reorder(Columns, -MissingPct), y = MissingPct)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
theme_minimal() +
labs(title = "Percentage of Missing Values per Column",
x = "Columns",
y = "Missing Percentage")
The dataset contains the following key columns:
EN = Entry-level, MI = Mid-level,
SE = Senior-level, EX = Executive-level).FT = Full-time, PT = Part-time, etc.).salary (e.g., USD, EUR).0, 50, 100).S =
Small, M = Medium, L = Large).In this section, we conduct exploratory data analysis (EDA) on the raw dataset to understand how it is structured, identify missing values, and visualize some initial trends in the data. This is an essential step in determining potential problems and uncovering insights before getting into actual data cleansing and other analyses.
Missing data is an important part of EDA. We now calculate the counts and percentages of missing values for each column. This gives a better understanding of the quality of the data and shows columns that possibly need imputation or removal. A summary of missing values is shown below for further analysis.
# Calculate missing values for each column
missing_counts <- sapply(df_salaries_raw, function(col) sum(is.na(col) | col == ""))
missing_pct <- round(missing_counts / nrow(df_salaries_raw) * 100, 2)
# Create a summary of missing values
df_missing_summary <- data.frame(
Columns = names(missing_counts),
MissingCount = missing_counts,
MissingPct = missing_pct
)
# Print the summary of missing values
message("Missing Value Summary (Absolute & %):")
## Missing Value Summary (Absolute & %):
print(df_missing_summary)
## Columns MissingCount MissingPct
## work_year work_year 1 0.01
## experience_level experience_level 6 0.04
## employment_type employment_type 7 0.05
## job_title job_title 30 0.20
## salary salary 17 0.11
## salary_currency salary_currency 15 0.10
## salary_in_usd salary_in_usd 10 0.07
## employee_residence employee_residence 16 0.11
## remote_ratio remote_ratio 83 0.56
## company_location company_location 13 0.09
## company_size company_size 8 0.05
Summarizing the numeric columns provides an overview of their distribution, including measures such as the mean, median, and range. This step helps identify outliers and understand the scale of the data.
# Display summary statistics for numeric columns
summary(df_salaries_raw)
## work_year experience_level employment_type job_title
## Min. :2020 Length:14838 Length:14838 Length:14838
## 1st Qu.:2023 Class :character Class :character Class :character
## Median :2023 Mode :character Mode :character Mode :character
## Mean :2023
## 3rd Qu.:2024
## Max. :2024
## NA's :1
## salary salary_currency salary_in_usd employee_residence
## Min. : 14000 Length:14838 Min. : 15000 Length:14838
## 1st Qu.: 102000 Class :character 1st Qu.:101715 Class :character
## Median : 142200 Mode :character Median :141300 Mode :character
## Mean : 164981 Mean :149833
## 3rd Qu.: 187415 3rd Qu.:185900
## Max. :30400000 Max. :800000
## NA's :17 NA's :10
## remote_ratio company_location company_size
## Min. : 0.00 Length:14838 Length:14838
## 1st Qu.: 0.00 Class :character Class :character
## Median : 0.00 Mode :character Mode :character
## Mean : 32.89
## 3rd Qu.:100.00
## Max. :100.00
## NA's :83
Visualizing the distribution of key columns allows us to detect
patterns, skewness, and potential issues in the data. In this step, we
create histograms for key variables such as salary_in_usd,
remote_ratio, and experience_level to better
understand their distribution.
salary_in_usdThe salary_in_usd column represents the salary in USD.
We plot a histogram to examine its distribution and detect skewness.
if ("salary_in_usd" %in% colnames(df_salaries_raw)) {
ggplot(df_salaries_raw, aes(x = salary_in_usd)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
theme_minimal() +
scale_x_continuous(labels = scales::comma) +
labs(title = "Raw Distribution of salary_in_usd",
x = "Salary in USD",
y = "Count")
}
## Warning: Removed 10 rows containing non-finite outside the scale range
## (`stat_bin()`).
Explanation: The histogram reveals a right-skewed distribution, indicating that most salaries are concentrated in the lower range, with fewer high salaries.
remote_ratioThe remote_ratio column represents the percentage of
work done remotely. A histogram is plotted to observe the distribution
of remote work percentages.
if ("remote_ratio" %in% colnames(df_salaries_raw)) {
ggplot(df_salaries_raw, aes(x = remote_ratio)) +
geom_histogram(bins = 30, fill = "purple", color = "black") +
theme_minimal() +
labs(title = "Raw Distribution of remote_ratio",
x = "Remote Ratio (%)",
y = "Count")
}
## Warning: Removed 83 rows containing non-finite outside the scale range
## (`stat_bin()`).
Explanation: The histogram indicates that the
remote_ratio column has most observations at 0% and 100%,
with very few at 50%. This suggests that most roles are either fully
remote or entirely onsite.
experience_levelThe experience_level column categorizes employees by
their seniority level. We use a bar chart to visualize the distribution
of these categories.
if ("experience_level" %in% colnames(df_salaries_raw)) {
ggplot(df_salaries_raw, aes(x = experience_level)) +
geom_bar(fill = "orange", color = "black") +
theme_minimal() +
labs(title = "Raw Distribution of experience_level",
x = "Experience Level",
y = "Count")
}
Explanation: The bar chart shows the frequency of each experience level in the dataset, providing insights into the seniority distribution of employees.
experience_level shows the distribution of
employee seniority, which can influence salaries and remote work
availability.In this section, we preprocess the raw data to make it ready for
analysis by creating a cleaned and transformed version of the data
frame, df_salaries. This includes handling missing values,
picking out specific rows, categorizing columns, and performing
transformations to ensure that the data is ready to be modeled. The
steps thus help to cleanse the data and enhance its quality for accurate
and meaningful analysis.
To reduce noise in the dataset, columns with more than 50% missing values are removed. This ensures that only relevant and sufficiently populated data remains for analysis.
# Keep columns with less than 50% missing values
df_salaries <- df_salaries_raw
cols_before <- ncol(df_salaries)
df_salaries <- df_salaries[, colSums(is.na(df_salaries) | df_salaries == "") / nrow(df_salaries) < 0.5]
cols_after <- ncol(df_salaries)
# Output the number of dropped columns
message("Number of columns dropped (>50% missing): ", cols_before - cols_after)
## Number of columns dropped (>50% missing): 0
The dataset is filtered to include only rows where
employment_type is “FT” (full-time jobs). This focuses the
analysis on the majority of the dataset, as 99% of records are full-time
positions.
# Filter rows for full-time employment only
rows_before_ft <- nrow(df_salaries)
if ("employment_type" %in% colnames(df_salaries)) {
df_salaries <- subset(df_salaries, employment_type == "FT")
}
rows_after_ft <- nrow(df_salaries)
# Output the number of rows removed
message("Filtered FT only: Rows before = ", rows_before_ft, ", after = ", rows_after_ft)
## Filtered FT only: Rows before = 14838, after = 14765
Missing values in numeric columns, such as salary_in_usd
and work_year, are replaced with the median of their
respective columns. This ensures that the data remains representative
without being overly influenced by outliers.
# Impute missing values for numeric columns
numeric_cols <- c("salary_in_usd", "work_year")
for (col_nm in numeric_cols) {
if (col_nm %in% colnames(df_salaries)) {
med_value <- median(df_salaries[[col_nm]], na.rm = TRUE)
df_salaries[[col_nm]][is.na(df_salaries[[col_nm]])] <- med_value
}
}
job_title into Broad GroupsThe job_title column is recategorized into broader
categories to simplify analysis. Titles are grouped into roles such as
Manager/Director, Analyst, Engineer, and Data Scientist based on
patterns in their names.
# Categorize job titles into broad groups
if ("job_title" %in% colnames(df_salaries)) {
df_salaries$job_title[is.na(df_salaries$job_title) | df_salaries$job_title == ""] <- "Unknown"
df_salaries <- df_salaries %>%
mutate(
job_title = case_when(
grepl("Lead|Head|Director|Manager", job_title, ignore.case = TRUE) ~ "MgrLeadDir",
grepl("BI|Business Intelligence", job_title, ignore.case = TRUE) ~ "BI",
grepl("Machine Learning|ML|AI|Deep Learning|NLP",
job_title, ignore.case = TRUE) ~ "ML_AI",
grepl("Architect", job_title, ignore.case = TRUE) ~ "Architect",
grepl("Engineer", job_title, ignore.case = TRUE) ~ "Engineer",
grepl("Analyst|Analytics", job_title, ignore.case = TRUE) ~ "Analyst",
grepl("Scientist|Science", job_title, ignore.case = TRUE) ~ "DataSci",
TRUE ~ "Other"
)
)
}
To standardize the analysis, the dataset is filtered to include only employees and companies based in the United States. This reduces variability caused by regional differences in salaries and employment practices.
# Filter for US-based employees and companies
if ("employee_residence" %in% colnames(df_salaries)) {
df_salaries <- subset(df_salaries, employee_residence == "US")
}
if ("company_location" %in% colnames(df_salaries)) {
df_salaries <- subset(df_salaries, company_location == "US")
}
remote_ratio to Numeric Binary (0 or
1)The remote_ratio column, originally representing
percentages, is converted into a binary format. Values of 50% or 100%
are grouped as 1 (remote work available), while 0% remains 0 (no remote
work). This simplifies analysis and modeling.
if ("remote_ratio" %in% colnames(df_salaries)) {
df_salaries$remote_ratio <- suppressWarnings(as.numeric(df_salaries$remote_ratio))
df_salaries$remote_ratio[is.na(df_salaries$remote_ratio)] <- 0
df_salaries <- df_salaries %>%
mutate(
remote_ratio = case_when(
remote_ratio == 50 ~ 1,
remote_ratio == 100 ~ 1,
TRUE ~ 0
)
)
}
Explanation: By reducing the
remote_ratio column to a binary indicator, we simplify the
analysis of remote work availability.
Columns that are redundant or irrelevant to the analysis are removed. This reduces noise and streamlines the dataset.
drops <- c("employment_type", "salary", "salary_currency", "employee_residence", "company_location")
for (col_to_drop in drops) {
if (col_to_drop %in% colnames(df_salaries)) {
df_salaries[[col_to_drop]] <- NULL
}
}
Explanation: - employment_type: Only
full-time jobs are considered. - salary and
salary_currency: The salary_in_usd column
already provides standardized salary data. -
employee_residence and company_location: Only
US-based records are included.
experience_level to NumericThe experience_level column is transformed into numeric
values for modeling, with 1 representing entry-level and 4 representing
executive-level positions.
if ("experience_level" %in% colnames(df_salaries)) {
df_salaries$experience_level[is.na(df_salaries$experience_level) |
df_salaries$experience_level == ""] <- "MI"
df_salaries <- df_salaries %>%
mutate(
experience_level = case_when(
grepl("EN", experience_level, ignore.case = TRUE) ~ 1,
grepl("MI", experience_level, ignore.case = TRUE) ~ 2,
grepl("SE", experience_level, ignore.case = TRUE) ~ 3,
grepl("EX", experience_level, ignore.case = TRUE) ~ 4,
TRUE ~ 2
)
)
}
company_size to NumericThe company_size column is converted into numeric
values, with 1 representing small companies, 2 for medium-sized
companies, and 3 for large companies.
if ("company_size" %in% colnames(df_salaries)) {
df_salaries$company_size[is.na(df_salaries$company_size) |
df_salaries$company_size == ""] <- "M"
df_salaries <- df_salaries %>%
mutate(
company_size = case_when(
grepl("S", company_size, ignore.case = TRUE) ~ 1,
grepl("M", company_size, ignore.case = TRUE) ~ 2,
grepl("L", company_size, ignore.case = TRUE) ~ 3,
TRUE ~ 2
)
)
}
salary_in_usd at the 99th
PercentileTo reduce the influence of extreme outliers, salaries above the 99th percentile are capped at this value.
if ("salary_in_usd" %in% colnames(df_salaries)) {
cap_value <- quantile(df_salaries$salary_in_usd, 0.99, na.rm = TRUE)
df_salaries$salary_in_usd <- ifelse(
df_salaries$salary_in_usd > cap_value,
cap_value,
df_salaries$salary_in_usd
)
}
years_since_2020 FeatureA new feature, years_since_2020, is created based on the
work_year column to quantify the time since 2020.
if (!("work_year" %in% colnames(df_salaries))) {
df_salaries$work_year <- 2023
}
df_salaries$work_year[is.na(df_salaries$work_year)] <- median(df_salaries$work_year, na.rm = TRUE)
df_salaries$years_since_2020 <- df_salaries$work_year - 2020
log_salary_in_usdRows with invalid or missing salary data are removed, and a new column for log-transformed salary is created to normalize the distribution.
df_salaries <- df_salaries %>%
filter(!is.na(salary_in_usd) & salary_in_usd > 0)
df_salaries$log_salary_in_usd <- log(df_salaries$salary_in_usd + 1)
job_titleThe job_title column is converted into one-hot encoded
variables to simplify its use in machine learning models.
if ("job_title" %in% colnames(df_salaries)) {
df_salaries$job_title <- factor(df_salaries$job_title)
dummy_obj <- caret::dummyVars(~ job_title, data = df_salaries)
jobtitle_dummies <- predict(dummy_obj, newdata = df_salaries)
df_salaries <- cbind(df_salaries, jobtitle_dummies)
df_salaries$job_title <- NULL
}
After applying all transformations, the dataset is ready for analysis and modeling. Key steps included simplifying categorical variables, handling missing data, and engineering new features to enhance predictive power.
# Display the structure of the cleaned dataset
message("Structure of the cleaned dataset:")
## Structure of the cleaned dataset:
str(df_salaries)
## 'data.frame': 12857 obs. of 15 variables:
## $ work_year : num 2024 2024 2024 2024 2023 ...
## $ experience_level : num 3 2 3 2 2 2 2 3 2 3 ...
## $ salary_in_usd : num 342580 342580 342580 342580 342580 ...
## $ remote_ratio : num 0 0 0 0 0 0 0 0 0 1 ...
## $ company_size : num 2 2 2 2 2 2 2 2 2 2 ...
## $ years_since_2020 : num 4 4 4 4 3 3 3 3 4 4 ...
## $ log_salary_in_usd : num 12.7 12.7 12.7 12.7 12.7 ...
## $ job_title.Analyst : num 0 0 1 0 0 0 0 0 0 0 ...
## $ job_title.Architect : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job_title.BI : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job_title.DataSci : num 0 0 0 0 0 0 0 1 0 1 ...
## $ job_title.Engineer : num 1 0 0 0 0 0 1 0 0 0 ...
## $ job_title.MgrLeadDir: num 0 0 0 0 0 0 0 0 0 0 ...
## $ job_title.ML_AI : num 0 1 0 1 1 1 0 0 1 0 ...
## $ job_title.Other : num 0 0 0 0 0 0 0 0 0 0 ...
# Summarize key numeric columns
message("Summary of key numeric columns:")
## Summary of key numeric columns:
summary(df_salaries[, c("salary_in_usd", "experience_level",
"company_size", "years_since_2020",
"log_salary_in_usd")])
## salary_in_usd experience_level company_size years_since_2020
## Min. : 24000 Min. :1.000 Min. :1.000 Min. :0.00
## 1st Qu.:112000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.00
## Median :148000 Median :3.000 Median :2.000 Median :3.00
## Mean :156511 Mean :2.678 Mean :2.051 Mean :3.18
## 3rd Qu.:191475 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:4.00
## Max. :342580 Max. :4.000 Max. :3.000 Max. :4.00
## log_salary_in_usd
## Min. :10.09
## 1st Qu.:11.63
## Median :11.90
## Mean :11.88
## 3rd Qu.:12.16
## Max. :12.74
Overall, in Section 4, we focused on refining the raw dataset to
ensure its suitability for analysis. We addressed missing data by
imputing numeric columns and dropping excessively incomplete columns.
The dataset was filtered to focus on full-time, US-based employees. Key
variables like job_title, remote_ratio,
experience_level, and company_size were
transformed into simplified or numeric formats, while outliers in
salary_in_usd were capped to reduce skewness. New features,
such as years_since_2020 and
log_salary_in_usd, were added to enhance predictive power.
Finally, one-hot encoding was applied to categorical variables like
job_title to make the data compatible with machine learning
models. These transformations ensure a cleaner, more structured dataset
ready for analysis.
In this section, we do an extensive exploratory data analysis (EDA) of the cleaned dataset. We focus on visualization of univariate distributions along with bivariate relationships to explain patterns, trends, and possible predictors for modeling purposes. This exercise is quite vital to understand the dynamics of important variables and their interrelations.
salary_in_usd DistributionA histogram of the salary_in_usd column is plotted to
examine the distribution of salaries after outlier capping. This
visualization provides insights into the overall spread and skewness of
salaries.
ggplot(df_salaries, aes(x = salary_in_usd)) +
geom_histogram(bins = 30, fill = "darkblue", color = "white") +
theme_minimal() +
scale_x_continuous(labels = scales::comma) +
labs(title = "Cleaned Distribution of salary_in_usd (Capped at 99th percentile)",
x = "Salary in USD",
y = "Count")
Explanation: The histogram reveals the distribution of salaries in USD after capping outliers. This ensures that extremely high salaries do not skew the data.
log_salary_in_usd DistributionLog-transformed salaries (log_salary_in_usd) are
visualized to assess the normalized distribution. Log transformation
often reduces skewness and stabilizes variance, making it more suitable
for modeling.
ggplot(df_salaries, aes(x = log_salary_in_usd)) +
geom_histogram(bins = 30, fill = "green", color = "black") +
theme_minimal() +
labs(title = "Distribution of log_salary_in_usd",
x = "Log Salary (USD)",
y = "Count")
Explanation: The histogram of log-transformed salaries demonstrates a more symmetrical distribution, which aids in predictive modeling.
experience_level DistributionA bar chart is used to display the distribution of
experience_level, showcasing the frequency of different
seniority levels in the dataset.
ggplot(df_salaries, aes(x = factor(experience_level))) +
geom_bar(fill = "cornflowerblue") +
theme_minimal() +
labs(title = "Distribution of Experience Level (Numeric Form)",
x = "Experience Level (1=EN,2=MI,3=SE,4=EX)",
y = "Count")
Explanation: This bar chart shows the proportion of employees at various experience levels, helping us understand the distribution of seniority in the dataset.
A boxplot is created to analyze the relationship between
experience_level and log_salary_in_usd. This
helps identify whether higher experience levels correspond to higher
salaries.
ggplot(df_salaries, aes(x = factor(experience_level), y = log_salary_in_usd)) +
geom_boxplot(fill = "lightblue") +
theme_minimal() +
labs(title = "Boxplot: Log Salary vs. Experience Level",
x = "Experience Level",
y = "Log Salary (USD)")
Explanation: The boxplot highlights differences in log-transformed salaries across experience levels, suggesting a positive correlation between experience and salary.
A boxplot visualizes the relationship between
company_size and salary_in_usd to determine
whether larger companies tend to pay higher salaries.
ggplot(df_salaries, aes(x = factor(company_size), y = salary_in_usd)) +
geom_boxplot(fill = "lightgreen") +
theme_minimal() +
scale_y_continuous(labels = scales::comma) +
labs(title = "Boxplot: Salary in USD vs. Company Size",
x = "Company Size (1=S,2=M,3=L)",
y = "Salary (USD)")
Explanation: This plot reveals how salaries vary with company size, with larger companies generally paying higher salaries.
The relationship between remote_ratio and
log_salary_in_usd is visualized to explore whether remote
work availability impacts salary.
ggplot(df_salaries, aes(x = factor(remote_ratio), y = log_salary_in_usd)) +
geom_boxplot(fill = "plum") +
theme_minimal() +
labs(title = "Boxplot: Log Salary vs. Remote Ratio (0/1)",
x = "Remote Ratio (0=Mostly Onsite, 1=Remote/Half/Half)",
y = "Log Salary (USD)")
Explanation: The boxplot examines salary differences between remote and onsite roles. It provides insights into how remote work availability correlates with compensation.
experience_level) and company size
(company_size) demonstrate clear trends in influencing
salaries.remote_ratio)In this section we developed a classifier that predicts whether
remote work (remote_ratio) is offered by using a Random
Forest classifier. The dependent variable remote_ratio is
categorical, which takes 1 when remote work is offered, and 0 when the
company is focused on onsite roles. The classification exercise involves
choosing predictor variables, splitting the dataset into training and
test sets, training the model in the Random Forest algorithm, and
validating its performance.
remote_ratio to FactorThe target variable remote_ratio is converted into a
factor, ensuring it is treated as a categorical variable for
classification.
if ("remote_ratio" %in% colnames(df_salaries)) {
df_salaries$remote_ratio <- factor(df_salaries$remote_ratio, levels = c(0, 1))
}
Explanation: This step prepares the target variable for classification by encoding it as a factor with two levels: 0 and 1.
We select a set of predictors for the model, including engineered
features such as years_since_2020, numeric columns like
experience_level and salary_in_usd, and any
one-hot encoded job title columns (if applicable).
if (exists("jobtitle_dummies")) {
dummy_colnames <- colnames(jobtitle_dummies)
} else {
dummy_colnames <- character(0)
}
class_predictors <- c("years_since_2020",
"experience_level",
"company_size",
"salary_in_usd",
dummy_colnames) # includes one-hot job_title columns
# Filter complete cases for classification
df_class <- df_salaries[complete.cases(df_salaries[, c(class_predictors, "remote_ratio")]), ]
Explanation: This step ensures the predictors are well-defined and that the dataset contains no missing values for selected features or the target variable.
The dataset is split into training (80%) and testing (20%) subsets to evaluate model performance on unseen data.
set.seed(123)
train_indices_class <- sample(seq_len(nrow(df_class)), size = 0.8 * nrow(df_class))
train_set_class <- df_class[train_indices_class, ]
test_set_class <- df_class[-train_indices_class, ]
Explanation: A random seed ensures reproducibility of the split. The training set is used to build the model, and the testing set evaluates its generalization ability.
The Random Forest model is trained using the selected predictors.
Hyperparameters such as ntree (number of trees) and
mtry (number of variables tried at each split) are
specified.
rf_model_class <- randomForest(
formula = as.formula(
paste("remote_ratio ~", paste(class_predictors, collapse=" + "))
),
data = train_set_class,
ntree = 200,
mtry = 3,
importance = TRUE
)
# Show model summary
rf_model_class
##
## Call:
## randomForest(formula = as.formula(paste("remote_ratio ~", paste(class_predictors, collapse = " + "))), data = train_set_class, ntree = 200, mtry = 3, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 29.4%
## Confusion matrix:
## 0 1 class.error
## 0 6740 314 0.04451375
## 1 2710 521 0.83874961
Explanation: Random Forest builds multiple decision trees and aggregates their outputs to classify observations. This method is robust and often achieves high accuracy on structured data.
The trained model is evaluated on the test set. Predictions are compared against actual values using a confusion matrix, and the overall accuracy is calculated.
pred_class <- predict(rf_model_class, newdata = test_set_class, type = "class")
conf_mat_class <- table(Predicted = pred_class, Actual = test_set_class$remote_ratio)
# Display confusion matrix
conf_mat_class
## Actual
## Predicted 0 1
## 0 1671 688
## 1 83 130
# Calculate classification accuracy
accuracy_class <- mean(pred_class == test_set_class$remote_ratio)
message(sprintf("Classification Accuracy: %.2f %%", accuracy_class * 100))
## Classification Accuracy: 70.02 %
Explanation: - The confusion matrix summarizes the model’s predictions against actual values, showing the number of true positives, true negatives, false positives, and false negatives. - Accuracy is computed as the proportion of correctly classified observations.
The Random Forest classifier provides an initial model for predicting remote work availability. Key results include:
While the model shows moderate performance, further optimization or
inclusion of additional predictors may improve accuracy. Potential
limitations include class imbalance or weak predictors for
remote_ratio.
log_salary_in_usd)This section develops a regression model to predict
log_salary_in_usd, the logarithmic transformation of salary
in USD. The process involves defining predictors, performing
hyperparameter tuning using 5-fold cross-validation, and evaluating the
model’s performance on a test set.
The set of predictors used for regression includes features like
years_since_2020, experience_level,
company_size, remote_ratio, and one-hot
encoded job title columns.
regress_predictors <- c("years_since_2020",
"experience_level",
"company_size",
"remote_ratio",
dummy_colnames)
Explanation: These features are selected based on their relevance to predicting salaries. They include temporal, categorical, and continuous variables to capture different aspects of salary determination.
We ensure the dataset contains no missing values for the selected
predictors or the target variable (log_salary_in_usd).
df_reg <- df_salaries[complete.cases(df_salaries[, c(regress_predictors, "log_salary_in_usd")]), ]
Explanation: Removing rows with missing values ensures that the model is trained on clean data without inconsistencies.
The dataset is split into training (80%) and testing (20%) subsets to evaluate the model’s performance on unseen data.
set.seed(123)
train_indices_reg <- sample(seq_len(nrow(df_reg)), size = 0.8 * nrow(df_reg))
train_data_reg <- df_reg[train_indices_reg, ]
test_data_reg <- df_reg[-train_indices_reg, ]
Explanation: Splitting the data allows us to validate the model’s ability to generalize beyond the training set.
Training control specifies 5-fold cross-validation to tune hyperparameters. This ensures robust model performance.
train_control <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
Explanation: Cross-validation prevents overfitting by evaluating the model’s performance on multiple subsets of the training data.
We define a grid of hyperparameters for tuning the Random Forest
model, including mtry (number of predictors considered at
each split) and min.node.size (minimum node size).
tune_grid <- expand.grid(
mtry = c(2, 3, 4, 6),
splitrule = "variance",
min.node.size = c(5, 10, 20)
)
Explanation: Hyperparameter tuning allows us to optimize the model for better accuracy and performance.
The Random Forest regression model is trained using the caret package with 500 trees and the specified hyperparameters.
set.seed(123)
rf_tuned <- caret::train(
x = train_data_reg[, regress_predictors],
y = train_data_reg$log_salary_in_usd,
method = "ranger",
trControl = train_control,
tuneGrid = tune_grid,
num.trees = 500,
importance = "impurity"
)
# Print tuning results and plot the model
rf_tuned
## Random Forest
##
## 10285 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 8229, 8228, 8228, 8229, 8226
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 2 5 0.3445002 0.2668814 0.2738430
## 2 10 0.3443686 0.2674749 0.2737891
## 2 20 0.3443949 0.2673466 0.2738431
## 3 5 0.3414060 0.2717050 0.2712548
## 3 10 0.3413903 0.2717812 0.2712255
## 3 20 0.3415064 0.2713525 0.2713196
## 4 5 0.3405589 0.2740167 0.2703682
## 4 10 0.3404469 0.2744761 0.2703107
## 4 20 0.3404662 0.2744189 0.2703318
## 6 5 0.3406348 0.2734709 0.2702730
## 6 10 0.3405054 0.2740338 0.2701894
## 6 20 0.3404036 0.2744281 0.2701400
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 6, splitrule = variance
## and min.node.size = 20.
plot(rf_tuned)
best_rf_reg <- rf_tuned$finalModel
Explanation: The caret package simplifies model training and hyperparameter tuning. The plot provides insights into how hyperparameter choices impact model performance.
Predictions are made on the test set, and actual values are compared to evaluate model performance.
rf_preds_log <- predict(rf_tuned, newdata = test_data_reg)
actuals_log <- test_data_reg$log_salary_in_usd
Explanation: This step evaluates the model’s ability to predict unseen data.
Performance metrics such as R² and RMSE are calculated on the log-transformed scale.
rss_rf <- sum((rf_preds_log - actuals_log)^2)
tss_rf <- sum((actuals_log - mean(train_data_reg$log_salary_in_usd))^2)
r2_rf <- 1 - rss_rf/tss_rf
rmse_rf <- sqrt(mean((rf_preds_log - actuals_log)^2))
Explanation: R² measures the proportion of variance explained by the model, while RMSE quantifies the average prediction error.
The model’s predictions are exponentiated to compare errors in the original salary scale.
rf_preds_exp <- exp(rf_preds_log)
actuals_exp <- exp(actuals_log)
rmse_rf_actual <- sqrt(mean((rf_preds_exp - actuals_exp)^2))
message(sprintf("Regression (Log) R²: %.4f", r2_rf))
## Regression (Log) R²: 0.3123
message(sprintf("Regression (Log) RMSE: %.4f", rmse_rf))
## Regression (Log) RMSE: 0.3363
Explanation: Converting predictions back to the original scale helps assess the model’s real-world applicability. RMSE on the salary scale reflects the average dollar deviation from true values.
The regression model achieves an R² of approximately 0.31, indicating
that 31% of the variance in log_salary_in_usd is explained
by the predictors. The RMSE on the log scale is ~0.336, These results
suggest moderate predictive power, with room for improvement through
additional features or advanced modeling techniques.
In this section, we analyze the outputs of the classification and regression models developed in previous sections. The performance metrics for each model are evaluated, and the results are interpreted in detail. Additionally, we include visualizations to provide deeper insights into model performance.
The classification model’s performance is evaluated using the confusion matrix and accuracy score. These metrics provide insights into the model’s ability to predict remote work availability.
# Display confusion matrix and accuracy
message("\n--- Classification Analysis ---")
##
## --- Classification Analysis ---
message("Confusion Matrix:")
## Confusion Matrix:
print(conf_mat_class)
## Actual
## Predicted 0 1
## 0 1671 688
## 1 83 130
message(sprintf("Accuracy: %.2f %%", accuracy_class * 100))
## Accuracy: 70.02 %
Explanation: - The confusion matrix shows the number of correct and incorrect predictions for each class (0 and 1). - The accuracy metric reflects the overall correctness of the model.
To visualize the confusion matrix, we create a heatmap to show the distribution of predictions.
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
# Prepare confusion matrix for plotting
conf_mat_df <- as.data.frame(as.table(conf_mat_class))
colnames(conf_mat_df) <- c("Predicted", "Actual", "Count")
# Plot the confusion matrix
ggplot(conf_mat_df, aes(x = Actual, y = Predicted, fill = Count)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Confusion Matrix for Classification Model",
x = "Actual",
y = "Predicted") +
theme_minimal()
Explanation: The heatmap visually represents the confusion matrix, highlighting areas where the model performs well or struggles.
The regression model’s performance is evaluated using R² and RMSE metrics on both the log-transformed and original salary scales. These metrics quantify the model’s ability to predict salaries.
# Display regression metrics
message("\n--- Regression Analysis ---")
##
## --- Regression Analysis ---
message(sprintf("R² (log scale): %.4f", r2_rf))
## R² (log scale): 0.3123
message(sprintf("RMSE (log scale): %.4f", rmse_rf))
## RMSE (log scale): 0.3363
message(sprintf("RMSE (actual scale): %.2f", rmse_rf_actual))
## RMSE (actual scale): 53045.24
Explanation: - R²: Measures the proportion of variance in the target variable explained by the model. - RMSE: Indicates the average prediction error on both log-transformed and original scales.
A scatter plot is used to compare the predicted and actual salaries on the original scale, highlighting the model’s accuracy.
# Create a data frame for plotting
results_df <- data.frame(
Predicted = rf_preds_exp,
Actual = actuals_exp
)
# Scatter plot of predicted vs. actual salaries
ggplot(results_df, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.6, color = "blue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(title = "Predicted vs. Actual Salaries",
x = "Actual Salary (USD)",
y = "Predicted Salary (USD)")
Explanation: - Points close to the diagonal line indicate accurate predictions. - Deviations from the line highlight underpredictions or overpredictions.
While the models show reasonable performance, additional features or advanced modeling techniques could further enhance predictive accuracy and generalizability.
In this project, we did an extensive analysis of the dataset containing data science salary information. The main goals were data cleaning and transformation, exploration of important patterns within the data, and development of predictive models that classify remote work availability and predict salaries. The first step in the process, of course, was thorough cleansing and preparation of the dataset to take care of missing values problems, standardization of features, and generating new variables. This preprocessing ensured that the data was robust but properly primed for substantive analyses. Exploratory data analysis revealed important patterns, such as the salary, years of experience, company size, along with other important variables, showing how the availability of remote work correlated with these variables. These results have generated a basis by which predictive models can be built.
Two different machine-learning models were developed. First, a Random Forest Classifier was used to predict the availability of remote work. The model showed precisely that while certain trends in the availability of remote work could be detected, not all variability in the data can be explained by predictors. Second, a Random Forest Regressor was trained for the prediction of log-transformed salaries.
These results underline the complexity of the factors driving both salaries and the availability of remote work in data science. Although our models did capture important trends, such as the effects of experience level and company size on salaries, a large portion of variability still remains unexplained. This could very well indicate that additional features are needed, or maybe more complex approaches to modeling would help explain the findings and give more precise predictions of these outcomes.
In summary, this analysis has provided deep insights into the dynamics and trends of remote work-related salaries in the domain of data science. It underlined the needs of detailed data cleaning and feature engineering for the development of strong predictive models. As the models show moderate levels of success, they simultaneously expose the intrinsic challenges within modeling complex human resources data and suggest further areas of improvement and exploration.