WQD7004 PROGRAMMING FOR DATA SCIENCE OCC3

Name	Student ID
Mehran Gharooni Khoshkehbar	S2014607
Ali Bin Abdul Rahman	23062959

Section 1: Introduction & Objectives

The field of Data Science is fast-moving, so it’s important that job seekers, employers, and educators alike have an understanding of the dynamics in salaries and trends in remote work.

In this paper, we analyze a dataset of Data Science salaries, studying the relationships between different job characteristics (like title, experience, company size, etc.) and:

Salary.
Remote work availability.

Objectives

Clean and explore the dataset thoroughly (EDA).
Classify “remote_ratio” (0 or 1) using Random Forest (classification).
Predict the log of salary (“log_salary_in_usd”) using a tuned Random Forest (regression).
Evaluate the performance of these models and interpret the results.

Aim and Purpose

Purpose

To analyze salary and remote working availability for data science employees.

Specific Objectives

Build a model that can classify whether remote work is available for data science employees.
Develop a model that can predict salaries for data science employees based on multiple factors.
Evaluate and analyze the performance of the models built.

Section 2: Loading Libraries & Data

In this section, we load the necessary libraries for analysis, then load the data into R, and do some preliminary analysis of the data. This setup ensures that our environment is correctly set up and we have a complete overview of the structure of the dataset before proceeding with further analysis.

Step A: Install & Load Packages

The R packages required for data manipulation, visualization, and modeling are listed below. The use of the for loop helps in the installation of missing packages and then loads them into the current session. This will make sure that all the functions needed for later analytical activities are available.

# List of packages required for the analysis
packages_needed <- c(
  "dplyr",        # for data manipulation
  "ggplot2",      # for plotting
  "randomForest", # for random forest (classification)
  "tidyr",        # for data tidying
  "caret",        # for model training/tuning
  "ranger",       # for fast random forest
  "scales"        # for pretty axis labels
)

# Loop through the packages and install/load them if not already available
for (pkg in packages_needed) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg, dependencies = TRUE)
    library(pkg, character.only = TRUE)
  }
}

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: ggplot2

## Loading required package: randomForest

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

## Loading required package: tidyr

## Loading required package: caret

## Loading required package: lattice

## Loading required package: ranger

## 
## Attaching package: 'ranger'

## The following object is masked from 'package:randomForest':
## 
##     importance

## Loading required package: scales

Step B: Load the Dataset

Here, we load the dataset containing data science salary information. The dataset is in CSV format, and it is read into a data frame. We use the parameter stringsAsFactors = FALSE to ensure that string columns are not automatically converted into factor variables, as this may hinder flexibility during data manipulation. After loading, we inspect the structure of the dataset by displaying its dimensions and a preview of the first few rows.

# Load the raw dataset into a data frame
df_salaries_raw <- read.csv("DataScience_salaries_2024.csv", stringsAsFactors = FALSE)

# Print the dimensions of the dataset
message("Raw Data Dimensions: ", nrow(df_salaries_raw), " rows, ", ncol(df_salaries_raw), " columns.")

## Raw Data Dimensions: 14838 rows, 11 columns.

Step C: Initial Inspection of the Data

To understand the dataset’s structure and contents, we examine its first few rows. The head function is used to display the initial records. This step provides insight into the columns, data types, and potential issues (e.g., missing or inconsistent values) in the dataset.

# Display the first six rows of the dataset
head(df_salaries_raw)

##   work_year experience_level employment_type                      job_title
## 1      2021               MI              FT                 Data Scientist
## 2      2021               MI              FT                BI Data Analyst
## 3      2020               MI              FT                 Data Scientist
## 4      2021               MI              FT                    ML Engineer
## 5      2022               SE              FT Lead Machine Learning Engineer
## 6      2021               MI              FT                    ML Engineer
##     salary salary_currency salary_in_usd employee_residence remote_ratio
## 1 30400000             CLP         40038                 CL          100
## 2 11000000             HUF         36259                 HU           50
## 3 11000000             HUF         35735                 HU           50
## 4  8500000             JPY         77364                 JP           50
## 5  7500000             INR         95386                 IN           50
## 6  7000000             JPY         63711                 JP           50
##   company_location company_size
## 1               CL            L
## 2               US            L
## 3               HU            L
## 4               JP            S
## 5               IN            L
## 6               JP            S

Step D: Visualizing Missing Data

Before any analysis can be done on a dataset, it is always important to check for missing data. Here, we will calculate the count and percentage of missing values for each column. We also plot a bar chart to visualize the proportion of missing data across columns, which will help us spot columns that are problematic and might need imputation or removal.

# Calculate missing values for each column
missing_counts <- sapply(df_salaries_raw, function(col) sum(is.na(col) | col == ""))
missing_pct <- round(missing_counts / nrow(df_salaries_raw) * 100, 2)

# Create a data frame of missing values
df_missing_summary <- data.frame(
  Columns = names(missing_counts),
  MissingCount = missing_counts,
  MissingPct = missing_pct
)

# Display the summary of missing values
print(df_missing_summary)

##                               Columns MissingCount MissingPct
## work_year                   work_year            1       0.01
## experience_level     experience_level            6       0.04
## employment_type       employment_type            7       0.05
## job_title                   job_title           30       0.20
## salary                         salary           17       0.11
## salary_currency       salary_currency           15       0.10
## salary_in_usd           salary_in_usd           10       0.07
## employee_residence employee_residence           16       0.11
## remote_ratio             remote_ratio           83       0.56
## company_location     company_location           13       0.09
## company_size             company_size            8       0.05

# Visualize missing percentages with a bar chart
ggplot(df_missing_summary, aes(x = reorder(Columns, -MissingPct), y = MissingPct)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Percentage of Missing Values per Column",
       x = "Columns",
       y = "Missing Percentage")

Dataset Description

The dataset contains the following key columns:

work_year: Year the salary was recorded (e.g., 2020, 2021, 2022, 2023, 2024).
experience_level: Categorical seniority level (EN = Entry-level, MI = Mid-level, SE = Senior-level, EX = Executive-level).
employment_type: Type of employment (FT = Full-time, PT = Part-time, etc.).
job_title: Full job title (e.g., Data Scientist, Machine Learning Engineer).
salary: Salary in local currency.
salary_currency: Currency code for salary (e.g., USD, EUR).
salary_in_usd: Salary converted to USD.
employee_residence: Employee’s country of residence.
remote_ratio: Percentage of work done remotely (0, 50, 100).
company_location: Headquarters location of the company.
company_size: Size of the company (S = Small, M = Medium, L = Large).

Section 3: Raw Data Exploration (EDA Part 1)

In this section, we conduct exploratory data analysis (EDA) on the raw dataset to understand how it is structured, identify missing values, and visualize some initial trends in the data. This is an essential step in determining potential problems and uncovering insights before getting into actual data cleansing and other analyses.

Step A: Checking Missing or Empty Values

Missing data is an important part of EDA. We now calculate the counts and percentages of missing values for each column. This gives a better understanding of the quality of the data and shows columns that possibly need imputation or removal. A summary of missing values is shown below for further analysis.

# Calculate missing values for each column
missing_counts <- sapply(df_salaries_raw, function(col) sum(is.na(col) | col == ""))
missing_pct <- round(missing_counts / nrow(df_salaries_raw) * 100, 2)

# Create a summary of missing values
df_missing_summary <- data.frame(
  Columns = names(missing_counts),
  MissingCount = missing_counts,
  MissingPct = missing_pct
)

# Print the summary of missing values
message("Missing Value Summary (Absolute & %):")

## Missing Value Summary (Absolute & %):

print(df_missing_summary)

##                               Columns MissingCount MissingPct
## work_year                   work_year            1       0.01
## experience_level     experience_level            6       0.04
## employment_type       employment_type            7       0.05
## job_title                   job_title           30       0.20
## salary                         salary           17       0.11
## salary_currency       salary_currency           15       0.10
## salary_in_usd           salary_in_usd           10       0.07
## employee_residence employee_residence           16       0.11
## remote_ratio             remote_ratio           83       0.56
## company_location     company_location           13       0.09
## company_size             company_size            8       0.05

Step B: Basic Summary of Numeric Columns

Summarizing the numeric columns provides an overview of their distribution, including measures such as the mean, median, and range. This step helps identify outliers and understand the scale of the data.

# Display summary statistics for numeric columns
summary(df_salaries_raw)

##    work_year    experience_level   employment_type     job_title        
##  Min.   :2020   Length:14838       Length:14838       Length:14838      
##  1st Qu.:2023   Class :character   Class :character   Class :character  
##  Median :2023   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2023                                                           
##  3rd Qu.:2024                                                           
##  Max.   :2024                                                           
##  NA's   :1                                                              
##      salary         salary_currency    salary_in_usd    employee_residence
##  Min.   :   14000   Length:14838       Min.   : 15000   Length:14838      
##  1st Qu.:  102000   Class :character   1st Qu.:101715   Class :character  
##  Median :  142200   Mode  :character   Median :141300   Mode  :character  
##  Mean   :  164981                      Mean   :149833                     
##  3rd Qu.:  187415                      3rd Qu.:185900                     
##  Max.   :30400000                      Max.   :800000                     
##  NA's   :17                            NA's   :10                         
##   remote_ratio    company_location   company_size      
##  Min.   :  0.00   Length:14838       Length:14838      
##  1st Qu.:  0.00   Class :character   Class :character  
##  Median :  0.00   Mode  :character   Mode  :character  
##  Mean   : 32.89                                        
##  3rd Qu.:100.00                                        
##  Max.   :100.00                                        
##  NA's   :83

Step C: Quick Univariate Distributions (Histograms)

Visualizing the distribution of key columns allows us to detect patterns, skewness, and potential issues in the data. In this step, we create histograms for key variables such as salary_in_usd, remote_ratio, and experience_level to better understand their distribution.

1. Distribution of `salary_in_usd`

The salary_in_usd column represents the salary in USD. We plot a histogram to examine its distribution and detect skewness.

if ("salary_in_usd" %in% colnames(df_salaries_raw)) {
  ggplot(df_salaries_raw, aes(x = salary_in_usd)) +
    geom_histogram(bins = 30, fill = "blue", color = "black") +
    theme_minimal() +
    scale_x_continuous(labels = scales::comma) +
    labs(title = "Raw Distribution of salary_in_usd", 
         x = "Salary in USD",
         y = "Count")
}

## Warning: Removed 10 rows containing non-finite outside the scale range
## (`stat_bin()`).

Explanation: The histogram reveals a right-skewed distribution, indicating that most salaries are concentrated in the lower range, with fewer high salaries.

2. Distribution of `remote_ratio`

The remote_ratio column represents the percentage of work done remotely. A histogram is plotted to observe the distribution of remote work percentages.

if ("remote_ratio" %in% colnames(df_salaries_raw)) {
  ggplot(df_salaries_raw, aes(x = remote_ratio)) +
    geom_histogram(bins = 30, fill = "purple", color = "black") +
    theme_minimal() +
    labs(title = "Raw Distribution of remote_ratio", 
         x = "Remote Ratio (%)",
         y = "Count")
}

## Warning: Removed 83 rows containing non-finite outside the scale range
## (`stat_bin()`).

Explanation: The histogram indicates that the remote_ratio column has most observations at 0% and 100%, with very few at 50%. This suggests that most roles are either fully remote or entirely onsite.

3. Distribution of `experience_level`

The experience_level column categorizes employees by their seniority level. We use a bar chart to visualize the distribution of these categories.

if ("experience_level" %in% colnames(df_salaries_raw)) {
  ggplot(df_salaries_raw, aes(x = experience_level)) +
    geom_bar(fill = "orange", color = "black") +
    theme_minimal() +
    labs(title = "Raw Distribution of experience_level", 
         x = "Experience Level",
         y = "Count")
}

Explanation: The bar chart shows the frequency of each experience level in the dataset, providing insights into the seniority distribution of employees.

Summary of Findings

The missing value summary highlights columns with potential data quality issues. Addressing these will be crucial in the cleaning phase.
Numeric summaries and histograms reveal key patterns, such as the right-skewed nature of salaries and the bimodal distribution of remote work percentages.
Visualizing experience_level shows the distribution of employee seniority, which can influence salaries and remote work availability.

Section 4: Data Cleaning & Transformation

In this section, we preprocess the raw data to make it ready for analysis by creating a cleaned and transformed version of the data frame, df_salaries. This includes handling missing values, picking out specific rows, categorizing columns, and performing transformations to ensure that the data is ready to be modeled. The steps thus help to cleanse the data and enhance its quality for accurate and meaningful analysis.

Step A: Drop Columns with >50% Missing Values

To reduce noise in the dataset, columns with more than 50% missing values are removed. This ensures that only relevant and sufficiently populated data remains for analysis.

# Keep columns with less than 50% missing values
df_salaries <- df_salaries_raw
cols_before <- ncol(df_salaries)
df_salaries <- df_salaries[, colSums(is.na(df_salaries) | df_salaries == "") / nrow(df_salaries) < 0.5]
cols_after <- ncol(df_salaries)

# Output the number of dropped columns
message("Number of columns dropped (>50% missing): ", cols_before - cols_after)

## Number of columns dropped (>50% missing): 0

Step B: Filter to Full-Time (FT) Jobs

The dataset is filtered to include only rows where employment_type is “FT” (full-time jobs). This focuses the analysis on the majority of the dataset, as 99% of records are full-time positions.

# Filter rows for full-time employment only
rows_before_ft <- nrow(df_salaries)
if ("employment_type" %in% colnames(df_salaries)) {
  df_salaries <- subset(df_salaries, employment_type == "FT")
}
rows_after_ft <- nrow(df_salaries)

# Output the number of rows removed
message("Filtered FT only: Rows before = ", rows_before_ft, ", after = ", rows_after_ft)

## Filtered FT only: Rows before = 14838, after = 14765

Step C: Impute Missing Numeric Columns with Median

Missing values in numeric columns, such as salary_in_usd and work_year, are replaced with the median of their respective columns. This ensures that the data remains representative without being overly influenced by outliers.

# Impute missing values for numeric columns
numeric_cols <- c("salary_in_usd", "work_year")
for (col_nm in numeric_cols) {
  if (col_nm %in% colnames(df_salaries)) {
    med_value <- median(df_salaries[[col_nm]], na.rm = TRUE)
    df_salaries[[col_nm]][is.na(df_salaries[[col_nm]])] <- med_value
  }
}

Step D: Categorize `job_title` into Broad Groups

The job_title column is recategorized into broader categories to simplify analysis. Titles are grouped into roles such as Manager/Director, Analyst, Engineer, and Data Scientist based on patterns in their names.

# Categorize job titles into broad groups
if ("job_title" %in% colnames(df_salaries)) {
  df_salaries$job_title[is.na(df_salaries$job_title) | df_salaries$job_title == ""] <- "Unknown"
  df_salaries <- df_salaries %>%
    mutate(
      job_title = case_when(
        grepl("Lead|Head|Director|Manager", job_title, ignore.case = TRUE) ~ "MgrLeadDir",
        grepl("BI|Business Intelligence", job_title, ignore.case = TRUE)    ~ "BI",
        grepl("Machine Learning|ML|AI|Deep Learning|NLP", 
              job_title, ignore.case = TRUE) ~ "ML_AI",
        grepl("Architect", job_title, ignore.case = TRUE)                  ~ "Architect",
        grepl("Engineer", job_title, ignore.case = TRUE)                   ~ "Engineer",
        grepl("Analyst|Analytics", job_title, ignore.case = TRUE)          ~ "Analyst",
        grepl("Scientist|Science", job_title, ignore.case = TRUE)          ~ "DataSci",
        TRUE ~ "Other"
      )
    )
}

Step E: Filter for US Data Only

To standardize the analysis, the dataset is filtered to include only employees and companies based in the United States. This reduces variability caused by regional differences in salaries and employment practices.

# Filter for US-based employees and companies
if ("employee_residence" %in% colnames(df_salaries)) {
  df_salaries <- subset(df_salaries, employee_residence == "US")
}
if ("company_location" %in% colnames(df_salaries)) {
  df_salaries <- subset(df_salaries, company_location == "US")
}

Step F: Convert `remote_ratio` to Numeric Binary (0 or 1)

The remote_ratio column, originally representing percentages, is converted into a binary format. Values of 50% or 100% are grouped as 1 (remote work available), while 0% remains 0 (no remote work). This simplifies analysis and modeling.

if ("remote_ratio" %in% colnames(df_salaries)) {
  df_salaries$remote_ratio <- suppressWarnings(as.numeric(df_salaries$remote_ratio))
  df_salaries$remote_ratio[is.na(df_salaries$remote_ratio)] <- 0
  df_salaries <- df_salaries %>%
    mutate(
      remote_ratio = case_when(
        remote_ratio == 50  ~ 1,
        remote_ratio == 100 ~ 1,
        TRUE                ~ 0
      )
    )
}

Explanation: By reducing the remote_ratio column to a binary indicator, we simplify the analysis of remote work availability.

Step G: Remove Unneeded Columns

Columns that are redundant or irrelevant to the analysis are removed. This reduces noise and streamlines the dataset.

drops <- c("employment_type", "salary", "salary_currency", "employee_residence", "company_location")
for (col_to_drop in drops) {
  if (col_to_drop %in% colnames(df_salaries)) {
    df_salaries[[col_to_drop]] <- NULL
  }
}

Explanation: - employment_type: Only full-time jobs are considered. - salary and salary_currency: The salary_in_usd column already provides standardized salary data. - employee_residence and company_location: Only US-based records are included.

Step H: Convert `experience_level` to Numeric

The experience_level column is transformed into numeric values for modeling, with 1 representing entry-level and 4 representing executive-level positions.

if ("experience_level" %in% colnames(df_salaries)) {
  df_salaries$experience_level[is.na(df_salaries$experience_level) | 
                               df_salaries$experience_level == ""] <- "MI"
  df_salaries <- df_salaries %>%
    mutate(
      experience_level = case_when(
        grepl("EN", experience_level, ignore.case = TRUE) ~ 1,
        grepl("MI", experience_level, ignore.case = TRUE) ~ 2,
        grepl("SE", experience_level, ignore.case = TRUE) ~ 3,
        grepl("EX", experience_level, ignore.case = TRUE) ~ 4,
        TRUE ~ 2
      )
    )
}

Step I: Convert `company_size` to Numeric

The company_size column is converted into numeric values, with 1 representing small companies, 2 for medium-sized companies, and 3 for large companies.

if ("company_size" %in% colnames(df_salaries)) {
  df_salaries$company_size[is.na(df_salaries$company_size) | 
                           df_salaries$company_size == ""] <- "M"
  df_salaries <- df_salaries %>%
    mutate(
      company_size = case_when(
        grepl("S", company_size, ignore.case = TRUE) ~ 1,
        grepl("M", company_size, ignore.case = TRUE) ~ 2,
        grepl("L", company_size, ignore.case = TRUE) ~ 3,
        TRUE ~ 2
      )
    )
}

Step J: Cap Outliers in `salary_in_usd` at the 99th Percentile

To reduce the influence of extreme outliers, salaries above the 99th percentile are capped at this value.

if ("salary_in_usd" %in% colnames(df_salaries)) {
  cap_value <- quantile(df_salaries$salary_in_usd, 0.99, na.rm = TRUE)
  df_salaries$salary_in_usd <- ifelse(
    df_salaries$salary_in_usd > cap_value, 
    cap_value, 
    df_salaries$salary_in_usd
  )
}

Step K: Create `years_since_2020` Feature

A new feature, years_since_2020, is created based on the work_year column to quantify the time since 2020.

if (!("work_year" %in% colnames(df_salaries))) {
  df_salaries$work_year <- 2023
}
df_salaries$work_year[is.na(df_salaries$work_year)] <- median(df_salaries$work_year, na.rm = TRUE)
df_salaries$years_since_2020 <- df_salaries$work_year - 2020

Step L: Remove Invalid Rows and Create `log_salary_in_usd`

Rows with invalid or missing salary data are removed, and a new column for log-transformed salary is created to normalize the distribution.

df_salaries <- df_salaries %>%
  filter(!is.na(salary_in_usd) & salary_in_usd > 0)
df_salaries$log_salary_in_usd <- log(df_salaries$salary_in_usd + 1)

Step M: One-Hot Encode `job_title`

The job_title column is converted into one-hot encoded variables to simplify its use in machine learning models.

if ("job_title" %in% colnames(df_salaries)) {
  df_salaries$job_title <- factor(df_salaries$job_title)
  dummy_obj <- caret::dummyVars(~ job_title, data = df_salaries)
  jobtitle_dummies <- predict(dummy_obj, newdata = df_salaries)
  df_salaries <- cbind(df_salaries, jobtitle_dummies)
  df_salaries$job_title <- NULL
}

Summary of Transformed Dataset

After applying all transformations, the dataset is ready for analysis and modeling. Key steps included simplifying categorical variables, handling missing data, and engineering new features to enhance predictive power.

# Display the structure of the cleaned dataset
message("Structure of the cleaned dataset:")

## Structure of the cleaned dataset:

str(df_salaries)

## 'data.frame':    12857 obs. of  15 variables:
##  $ work_year           : num  2024 2024 2024 2024 2023 ...
##  $ experience_level    : num  3 2 3 2 2 2 2 3 2 3 ...
##  $ salary_in_usd       : num  342580 342580 342580 342580 342580 ...
##  $ remote_ratio        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ company_size        : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ years_since_2020    : num  4 4 4 4 3 3 3 3 4 4 ...
##  $ log_salary_in_usd   : num  12.7 12.7 12.7 12.7 12.7 ...
##  $ job_title.Analyst   : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ job_title.Architect : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job_title.BI        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job_title.DataSci   : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ job_title.Engineer  : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ job_title.MgrLeadDir: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job_title.ML_AI     : num  0 1 0 1 1 1 0 0 1 0 ...
##  $ job_title.Other     : num  0 0 0 0 0 0 0 0 0 0 ...

# Summarize key numeric columns
message("Summary of key numeric columns:")

## Summary of key numeric columns:

summary(df_salaries[, c("salary_in_usd", "experience_level", 
                        "company_size", "years_since_2020", 
                        "log_salary_in_usd")])

##  salary_in_usd    experience_level  company_size   years_since_2020
##  Min.   : 24000   Min.   :1.000    Min.   :1.000   Min.   :0.00    
##  1st Qu.:112000   1st Qu.:2.000    1st Qu.:2.000   1st Qu.:3.00    
##  Median :148000   Median :3.000    Median :2.000   Median :3.00    
##  Mean   :156511   Mean   :2.678    Mean   :2.051   Mean   :3.18    
##  3rd Qu.:191475   3rd Qu.:3.000    3rd Qu.:2.000   3rd Qu.:4.00    
##  Max.   :342580   Max.   :4.000    Max.   :3.000   Max.   :4.00    
##  log_salary_in_usd
##  Min.   :10.09    
##  1st Qu.:11.63    
##  Median :11.90    
##  Mean   :11.88    
##  3rd Qu.:12.16    
##  Max.   :12.74

Overall, in Section 4, we focused on refining the raw dataset to ensure its suitability for analysis. We addressed missing data by imputing numeric columns and dropping excessively incomplete columns. The dataset was filtered to focus on full-time, US-based employees. Key variables like job_title, remote_ratio, experience_level, and company_size were transformed into simplified or numeric formats, while outliers in salary_in_usd were capped to reduce skewness. New features, such as years_since_2020 and log_salary_in_usd, were added to enhance predictive power. Finally, one-hot encoding was applied to categorical variables like job_title to make the data compatible with machine learning models. These transformations ensure a cleaner, more structured dataset ready for analysis.

Section 5: EDA on the Cleaned Data (EDA Part 2)

In this section, we do an extensive exploratory data analysis (EDA) of the cleaned dataset. We focus on visualization of univariate distributions along with bivariate relationships to explain patterns, trends, and possible predictors for modeling purposes. This exercise is quite vital to understand the dynamics of important variables and their interrelations.

Univariate Histograms

1. Cleaned `salary_in_usd` Distribution

A histogram of the salary_in_usd column is plotted to examine the distribution of salaries after outlier capping. This visualization provides insights into the overall spread and skewness of salaries.

ggplot(df_salaries, aes(x = salary_in_usd)) +
  geom_histogram(bins = 30, fill = "darkblue", color = "white") +
  theme_minimal() +
  scale_x_continuous(labels = scales::comma) +
  labs(title = "Cleaned Distribution of salary_in_usd (Capped at 99th percentile)",
       x = "Salary in USD",
       y = "Count")

Explanation: The histogram reveals the distribution of salaries in USD after capping outliers. This ensures that extremely high salaries do not skew the data.

2. `log_salary_in_usd` Distribution

Log-transformed salaries (log_salary_in_usd) are visualized to assess the normalized distribution. Log transformation often reduces skewness and stabilizes variance, making it more suitable for modeling.

ggplot(df_salaries, aes(x = log_salary_in_usd)) +
  geom_histogram(bins = 30, fill = "green", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of log_salary_in_usd", 
       x = "Log Salary (USD)",
       y = "Count")

Explanation: The histogram of log-transformed salaries demonstrates a more symmetrical distribution, which aids in predictive modeling.

3. `experience_level` Distribution

A bar chart is used to display the distribution of experience_level, showcasing the frequency of different seniority levels in the dataset.

ggplot(df_salaries, aes(x = factor(experience_level))) +
  geom_bar(fill = "cornflowerblue") +
  theme_minimal() +
  labs(title = "Distribution of Experience Level (Numeric Form)", 
       x = "Experience Level (1=EN,2=MI,3=SE,4=EX)",
       y = "Count")

Explanation: This bar chart shows the proportion of employees at various experience levels, helping us understand the distribution of seniority in the dataset.

Bivariate Relationships

1. Experience Level vs. Log Salary

A boxplot is created to analyze the relationship between experience_level and log_salary_in_usd. This helps identify whether higher experience levels correspond to higher salaries.

ggplot(df_salaries, aes(x = factor(experience_level), y = log_salary_in_usd)) +
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  labs(title = "Boxplot: Log Salary vs. Experience Level",
       x = "Experience Level",
       y = "Log Salary (USD)")

Explanation: The boxplot highlights differences in log-transformed salaries across experience levels, suggesting a positive correlation between experience and salary.

2. Company Size vs. Salary

A boxplot visualizes the relationship between company_size and salary_in_usd to determine whether larger companies tend to pay higher salaries.

ggplot(df_salaries, aes(x = factor(company_size), y = salary_in_usd)) +
  geom_boxplot(fill = "lightgreen") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Boxplot: Salary in USD vs. Company Size",
       x = "Company Size (1=S,2=M,3=L)",
       y = "Salary (USD)")

Explanation: This plot reveals how salaries vary with company size, with larger companies generally paying higher salaries.

3. Remote Ratio vs. Log Salary

The relationship between remote_ratio and log_salary_in_usd is visualized to explore whether remote work availability impacts salary.

ggplot(df_salaries, aes(x = factor(remote_ratio), y = log_salary_in_usd)) +
  geom_boxplot(fill = "plum") +
  theme_minimal() +
  labs(title = "Boxplot: Log Salary vs. Remote Ratio (0/1)",
       x = "Remote Ratio (0=Mostly Onsite, 1=Remote/Half/Half)",
       y = "Log Salary (USD)")

Explanation: The boxplot examines salary differences between remote and onsite roles. It provides insights into how remote work availability correlates with compensation.

Summary of Findings

The cleaned salary distribution highlights a capped range, while log transformation ensures normality for modeling.
Seniority levels (experience_level) and company size (company_size) demonstrate clear trends in influencing salaries.
Remote work availability shows potential correlations with higher salaries, though further modeling is needed to confirm these trends.

Section 6: Classification Model (Predict `remote_ratio`)

In this section we developed a classifier that predicts whether remote work (remote_ratio) is offered by using a Random Forest classifier. The dependent variable remote_ratio is categorical, which takes 1 when remote work is offered, and 0 when the company is focused on onsite roles. The classification exercise involves choosing predictor variables, splitting the dataset into training and test sets, training the model in the Random Forest algorithm, and validating its performance.

Step 1: Convert `remote_ratio` to Factor

The target variable remote_ratio is converted into a factor, ensuring it is treated as a categorical variable for classification.

if ("remote_ratio" %in% colnames(df_salaries)) {
  df_salaries$remote_ratio <- factor(df_salaries$remote_ratio, levels = c(0, 1))
}

Explanation: This step prepares the target variable for classification by encoding it as a factor with two levels: 0 and 1.

Step 2: Prepare Predictor Set

We select a set of predictors for the model, including engineered features such as years_since_2020, numeric columns like experience_level and salary_in_usd, and any one-hot encoded job title columns (if applicable).

if (exists("jobtitle_dummies")) {
  dummy_colnames <- colnames(jobtitle_dummies)
} else {
  dummy_colnames <- character(0)
}

class_predictors <- c("years_since_2020", 
                      "experience_level", 
                      "company_size", 
                      "salary_in_usd",
                      dummy_colnames)  # includes one-hot job_title columns

# Filter complete cases for classification
df_class <- df_salaries[complete.cases(df_salaries[, c(class_predictors, "remote_ratio")]), ]

Explanation: This step ensures the predictors are well-defined and that the dataset contains no missing values for selected features or the target variable.

Step 3: Train-Test Split

The dataset is split into training (80%) and testing (20%) subsets to evaluate model performance on unseen data.

set.seed(123)
train_indices_class <- sample(seq_len(nrow(df_class)), size = 0.8 * nrow(df_class))
train_set_class <- df_class[train_indices_class, ]
test_set_class  <- df_class[-train_indices_class, ]

Explanation: A random seed ensures reproducibility of the split. The training set is used to build the model, and the testing set evaluates its generalization ability.

Step 4: Train Random Forest Classifier

The Random Forest model is trained using the selected predictors. Hyperparameters such as ntree (number of trees) and mtry (number of variables tried at each split) are specified.

rf_model_class <- randomForest(
  formula = as.formula(
    paste("remote_ratio ~", paste(class_predictors, collapse=" + "))
  ),
  data = train_set_class,
  ntree = 200,
  mtry = 3,
  importance = TRUE
)

# Show model summary
rf_model_class

## 
## Call:
##  randomForest(formula = as.formula(paste("remote_ratio ~", paste(class_predictors,      collapse = " + "))), data = train_set_class, ntree = 200,      mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 29.4%
## Confusion matrix:
##      0   1 class.error
## 0 6740 314  0.04451375
## 1 2710 521  0.83874961

Explanation: Random Forest builds multiple decision trees and aggregates their outputs to classify observations. This method is robust and often achieves high accuracy on structured data.

Step 5: Evaluate Model Performance

The trained model is evaluated on the test set. Predictions are compared against actual values using a confusion matrix, and the overall accuracy is calculated.

pred_class <- predict(rf_model_class, newdata = test_set_class, type = "class")
conf_mat_class <- table(Predicted = pred_class, Actual = test_set_class$remote_ratio)

# Display confusion matrix
conf_mat_class

##          Actual
## Predicted    0    1
##         0 1671  688
##         1   83  130

# Calculate classification accuracy
accuracy_class <- mean(pred_class == test_set_class$remote_ratio)
message(sprintf("Classification Accuracy: %.2f %%", accuracy_class * 100))

## Classification Accuracy: 70.02 %

Explanation: - The confusion matrix summarizes the model’s predictions against actual values, showing the number of true positives, true negatives, false positives, and false negatives. - Accuracy is computed as the proportion of correctly classified observations.

Summary of Classification Results

The Random Forest classifier provides an initial model for predicting remote work availability. Key results include:

Confusion Matrix: Breaks down the model’s predictions into true positives, true negatives, and errors.
Accuracy: Indicates the percentage of correctly classified cases, typically around 70% based on example data.

While the model shows moderate performance, further optimization or inclusion of additional predictors may improve accuracy. Potential limitations include class imbalance or weak predictors for remote_ratio.

Section 7: Regression Model (Predict `log_salary_in_usd`)

This section develops a regression model to predict log_salary_in_usd, the logarithmic transformation of salary in USD. The process involves defining predictors, performing hyperparameter tuning using 5-fold cross-validation, and evaluating the model’s performance on a test set.

Step 1: Define Predictors

The set of predictors used for regression includes features like years_since_2020, experience_level, company_size, remote_ratio, and one-hot encoded job title columns.

regress_predictors <- c("years_since_2020", 
                        "experience_level", 
                        "company_size", 
                        "remote_ratio",
                        dummy_colnames)

Explanation: These features are selected based on their relevance to predicting salaries. They include temporal, categorical, and continuous variables to capture different aspects of salary determination.

Step 2: Filter Complete Rows

We ensure the dataset contains no missing values for the selected predictors or the target variable (log_salary_in_usd).

df_reg <- df_salaries[complete.cases(df_salaries[, c(regress_predictors, "log_salary_in_usd")]), ]

Explanation: Removing rows with missing values ensures that the model is trained on clean data without inconsistencies.

Step 3: Train-Test Split

The dataset is split into training (80%) and testing (20%) subsets to evaluate the model’s performance on unseen data.

set.seed(123)
train_indices_reg <- sample(seq_len(nrow(df_reg)), size = 0.8 * nrow(df_reg))
train_data_reg <- df_reg[train_indices_reg, ]
test_data_reg  <- df_reg[-train_indices_reg, ]

Explanation: Splitting the data allows us to validate the model’s ability to generalize beyond the training set.

Step 4: Define Training Control (5-Fold CV)

Training control specifies 5-fold cross-validation to tune hyperparameters. This ensures robust model performance.

train_control <- trainControl(method = "cv", number = 5, verboseIter = FALSE)

Explanation: Cross-validation prevents overfitting by evaluating the model’s performance on multiple subsets of the training data.

Step 5: Define Hyperparameter Grid

We define a grid of hyperparameters for tuning the Random Forest model, including mtry (number of predictors considered at each split) and min.node.size (minimum node size).

tune_grid <- expand.grid(
  mtry = c(2, 3, 4, 6),
  splitrule = "variance",
  min.node.size = c(5, 10, 20)
)

Explanation: Hyperparameter tuning allows us to optimize the model for better accuracy and performance.

Step 6: Train the Random Forest Model

The Random Forest regression model is trained using the caret package with 500 trees and the specified hyperparameters.

set.seed(123)
rf_tuned <- caret::train(
  x = train_data_reg[, regress_predictors],
  y = train_data_reg$log_salary_in_usd,
  method = "ranger",
  trControl = train_control,
  tuneGrid = tune_grid,
  num.trees = 500,
  importance = "impurity"
)

# Print tuning results and plot the model
rf_tuned

## Random Forest 
## 
## 10285 samples
##    12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 8229, 8228, 8228, 8229, 8226 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE       Rsquared   MAE      
##   2      5             0.3445002  0.2668814  0.2738430
##   2     10             0.3443686  0.2674749  0.2737891
##   2     20             0.3443949  0.2673466  0.2738431
##   3      5             0.3414060  0.2717050  0.2712548
##   3     10             0.3413903  0.2717812  0.2712255
##   3     20             0.3415064  0.2713525  0.2713196
##   4      5             0.3405589  0.2740167  0.2703682
##   4     10             0.3404469  0.2744761  0.2703107
##   4     20             0.3404662  0.2744189  0.2703318
##   6      5             0.3406348  0.2734709  0.2702730
##   6     10             0.3405054  0.2740338  0.2701894
##   6     20             0.3404036  0.2744281  0.2701400
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 6, splitrule = variance
##  and min.node.size = 20.

plot(rf_tuned)

best_rf_reg <- rf_tuned$finalModel

Explanation: The caret package simplifies model training and hyperparameter tuning. The plot provides insights into how hyperparameter choices impact model performance.

Step 7: Predict on Test Data

Predictions are made on the test set, and actual values are compared to evaluate model performance.

rf_preds_log <- predict(rf_tuned, newdata = test_data_reg)
actuals_log <- test_data_reg$log_salary_in_usd

Explanation: This step evaluates the model’s ability to predict unseen data.

Step 8: Compute R² and RMSE (Log Scale)

Performance metrics such as R² and RMSE are calculated on the log-transformed scale.

rss_rf <- sum((rf_preds_log - actuals_log)^2)
tss_rf <- sum((actuals_log - mean(train_data_reg$log_salary_in_usd))^2)
r2_rf  <- 1 - rss_rf/tss_rf
rmse_rf <- sqrt(mean((rf_preds_log - actuals_log)^2))

Explanation: R² measures the proportion of variance explained by the model, while RMSE quantifies the average prediction error.

Step 9: Compare on Original Salary Scale

The model’s predictions are exponentiated to compare errors in the original salary scale.

rf_preds_exp <- exp(rf_preds_log)
actuals_exp <- exp(actuals_log)
rmse_rf_actual <- sqrt(mean((rf_preds_exp - actuals_exp)^2))

message(sprintf("Regression (Log) R²: %.4f", r2_rf))

## Regression (Log) R²: 0.3123

message(sprintf("Regression (Log) RMSE: %.4f", rmse_rf))

## Regression (Log) RMSE: 0.3363

Explanation: Converting predictions back to the original scale helps assess the model’s real-world applicability. RMSE on the salary scale reflects the average dollar deviation from true values.

Summary of Regression Results

The regression model achieves an R² of approximately 0.31, indicating that 31% of the variance in log_salary_in_usd is explained by the predictors. The RMSE on the log scale is ~0.336, These results suggest moderate predictive power, with room for improvement through additional features or advanced modeling techniques.

Section 8: Prediction & Analysis of Outputs

In this section, we analyze the outputs of the classification and regression models developed in previous sections. The performance metrics for each model are evaluated, and the results are interpreted in detail. Additionally, we include visualizations to provide deeper insights into model performance.

A) Classification Analysis

The classification model’s performance is evaluated using the confusion matrix and accuracy score. These metrics provide insights into the model’s ability to predict remote work availability.

# Display confusion matrix and accuracy
message("\n--- Classification Analysis ---")

## 
## --- Classification Analysis ---

message("Confusion Matrix:")

## Confusion Matrix:

print(conf_mat_class)

##          Actual
## Predicted    0    1
##         0 1671  688
##         1   83  130

message(sprintf("Accuracy: %.2f %%", accuracy_class * 100))

## Accuracy: 70.02 %

Explanation: - The confusion matrix shows the number of correct and incorrect predictions for each class (0 and 1). - The accuracy metric reflects the overall correctness of the model.

Visualizing Classification Results

To visualize the confusion matrix, we create a heatmap to show the distribution of predictions.

library(ggplot2)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

# Prepare confusion matrix for plotting
conf_mat_df <- as.data.frame(as.table(conf_mat_class))
colnames(conf_mat_df) <- c("Predicted", "Actual", "Count")

# Plot the confusion matrix
ggplot(conf_mat_df, aes(x = Actual, y = Predicted, fill = Count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Confusion Matrix for Classification Model",
       x = "Actual",
       y = "Predicted") +
  theme_minimal()

Explanation: The heatmap visually represents the confusion matrix, highlighting areas where the model performs well or struggles.

B) Regression Analysis

The regression model’s performance is evaluated using R² and RMSE metrics on both the log-transformed and original salary scales. These metrics quantify the model’s ability to predict salaries.

# Display regression metrics
message("\n--- Regression Analysis ---")

## 
## --- Regression Analysis ---

message(sprintf("R² (log scale): %.4f", r2_rf))

## R² (log scale): 0.3123

message(sprintf("RMSE (log scale): %.4f", rmse_rf))

## RMSE (log scale): 0.3363

message(sprintf("RMSE (actual scale): %.2f", rmse_rf_actual))

## RMSE (actual scale): 53045.24

Explanation: - R²: Measures the proportion of variance in the target variable explained by the model. - RMSE: Indicates the average prediction error on both log-transformed and original scales.

Visualizing Predicted vs. Actual Salaries

A scatter plot is used to compare the predicted and actual salaries on the original scale, highlighting the model’s accuracy.

# Create a data frame for plotting
results_df <- data.frame(
  Predicted = rf_preds_exp,
  Actual = actuals_exp
)

# Scatter plot of predicted vs. actual salaries
ggplot(results_df, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(title = "Predicted vs. Actual Salaries",
       x = "Actual Salary (USD)",
       y = "Predicted Salary (USD)")

Explanation: - Points close to the diagonal line indicate accurate predictions. - Deviations from the line highlight underpredictions or overpredictions.

Summary of Findings

Classification Results:

Accuracy: The classification model achieves an accuracy of approximately 70%, indicating moderate performance.
Confusion Matrix: The heatmap reveals areas where the model excels and struggles, providing insights into potential improvements.

Regression Results:

R²: Around 31% of the variance in log-transformed salaries is explained by the predictors.
RMSE: The average prediction error is ~0.336 on the log scale, suggesting room for improvement.

While the models show reasonable performance, additional features or advanced modeling techniques could further enhance predictive accuracy and generalizability.

Section 9: Conclusion

In this project, we did an extensive analysis of the dataset containing data science salary information. The main goals were data cleaning and transformation, exploration of important patterns within the data, and development of predictive models that classify remote work availability and predict salaries. The first step in the process, of course, was thorough cleansing and preparation of the dataset to take care of missing values problems, standardization of features, and generating new variables. This preprocessing ensured that the data was robust but properly primed for substantive analyses. Exploratory data analysis revealed important patterns, such as the salary, years of experience, company size, along with other important variables, showing how the availability of remote work correlated with these variables. These results have generated a basis by which predictive models can be built.

Two different machine-learning models were developed. First, a Random Forest Classifier was used to predict the availability of remote work. The model showed precisely that while certain trends in the availability of remote work could be detected, not all variability in the data can be explained by predictors. Second, a Random Forest Regressor was trained for the prediction of log-transformed salaries.

These results underline the complexity of the factors driving both salaries and the availability of remote work in data science. Although our models did capture important trends, such as the effects of experience level and company size on salaries, a large portion of variability still remains unexplained. This could very well indicate that additional features are needed, or maybe more complex approaches to modeling would help explain the findings and give more precise predictions of these outcomes.

In summary, this analysis has provided deep insights into the dynamics and trends of remote work-related salaries in the domain of data science. It underlined the needs of detailed data cleaning and feature engineering for the development of strong predictive models. As the models show moderate levels of success, they simultaneously expose the intrinsic challenges within modeling complex human resources data and suggest further areas of improvement and exploration.

WQD7004 PROGRAMMING FOR DATA SCIENCE OCC3

Section 1: Introduction & Objectives

Objectives

Aim and Purpose

Purpose

Specific Objectives

Section 2: Loading Libraries & Data

Step A: Install & Load Packages

Step B: Load the Dataset

Step C: Initial Inspection of the Data

Step D: Visualizing Missing Data

Dataset Description

Section 3: Raw Data Exploration (EDA Part 1)

Step A: Checking Missing or Empty Values

Step B: Basic Summary of Numeric Columns

Step C: Quick Univariate Distributions (Histograms)

1. Distribution of salary_in_usd

2. Distribution of remote_ratio

3. Distribution of experience_level

Summary of Findings

Section 4: Data Cleaning & Transformation

Step A: Drop Columns with >50% Missing Values

Step B: Filter to Full-Time (FT) Jobs

Step C: Impute Missing Numeric Columns with Median

Step D: Categorize job_title into Broad Groups

Step E: Filter for US Data Only

Step F: Convert remote_ratio to Numeric Binary (0 or 1)

Step G: Remove Unneeded Columns

Step H: Convert experience_level to Numeric

Step I: Convert company_size to Numeric

Step J: Cap Outliers in salary_in_usd at the 99th Percentile

Step K: Create years_since_2020 Feature

Step L: Remove Invalid Rows and Create log_salary_in_usd

Step M: One-Hot Encode job_title

Summary of Transformed Dataset

Section 5: EDA on the Cleaned Data (EDA Part 2)

Univariate Histograms

1. Cleaned salary_in_usd Distribution

2. log_salary_in_usd Distribution

3. experience_level Distribution

Bivariate Relationships

1. Experience Level vs. Log Salary

2. Company Size vs. Salary

3. Remote Ratio vs. Log Salary

Summary of Findings

Section 6: Classification Model (Predict remote_ratio)

Step 1: Convert remote_ratio to Factor

Step 2: Prepare Predictor Set

Step 3: Train-Test Split

Step 4: Train Random Forest Classifier

Step 5: Evaluate Model Performance

Summary of Classification Results

Section 7: Regression Model (Predict log_salary_in_usd)

Step 1: Define Predictors

Step 2: Filter Complete Rows

Step 3: Train-Test Split

Step 4: Define Training Control (5-Fold CV)

Step 5: Define Hyperparameter Grid

Step 6: Train the Random Forest Model

Step 7: Predict on Test Data

Step 8: Compute R² and RMSE (Log Scale)

Step 9: Compare on Original Salary Scale

Summary of Regression Results

Section 8: Prediction & Analysis of Outputs

A) Classification Analysis

Visualizing Classification Results

B) Regression Analysis

Visualizing Predicted vs. Actual Salaries

Summary of Findings

Classification Results:

Regression Results:

Section 9: Conclusion

1. Distribution of `salary_in_usd`

2. Distribution of `remote_ratio`

3. Distribution of `experience_level`

Step D: Categorize `job_title` into Broad Groups

Step F: Convert `remote_ratio` to Numeric Binary (0 or 1)

Step H: Convert `experience_level` to Numeric

Step I: Convert `company_size` to Numeric

Step J: Cap Outliers in `salary_in_usd` at the 99th Percentile

Step K: Create `years_since_2020` Feature

Step L: Remove Invalid Rows and Create `log_salary_in_usd`

Step M: One-Hot Encode `job_title`

1. Cleaned `salary_in_usd` Distribution

2. `log_salary_in_usd` Distribution

3. `experience_level` Distribution

Section 6: Classification Model (Predict `remote_ratio`)

Step 1: Convert `remote_ratio` to Factor

Section 7: Regression Model (Predict `log_salary_in_usd`)