Introduction

The title of the dataset chosen is ‘Data Science Salaries Dataset’. The dataset is regarding the salaries of positions in the data science field. Included in the dataset are the year when the data was collected, the work experience level, type of employment, job title, salary, currency in which the salary given, salary after conversion to USD, location of employees’ residence, location of the company, the size of the company and the availability of remote working. The dataset was last updates in October 2024 and the data was collected from 2020 to 2024.

Aim & Purpose:

The aim and purpose of this dataset is to analyse the compensation and benefit for data science workers.

Objectives:

  1. To determine how salary is distributed across different factors.

  2. To find out how does different factors affect the compensation offered.

  3. To build models to predict the employees’ benefit and compensation.

Data Importation & Raw Data Exploration

The following code snippet is to import the data first.

# path location and file name
path <- "C:/Users/User/DataScience_salaries_2024.csv"

# import data
data <- read.csv(path)

# Update
print("Data imported!")
## [1] "Data imported!"

The code below checks the dimension of the dataset.

# find size of dataset
numrow <- nrow(data)
numcol <- ncol(data)

# print size of dataset
print("Number of Rows:")
## [1] "Number of Rows:"
print(numrow)
## [1] 14838
print("Number of Columns:")
## [1] "Number of Columns:"
print(numcol)
## [1] 11

The dataset has 14838 rows and 11 columns. The columns are the features for each row data.

The code below prints out the column names.

# find all column names
colnames(data)
##  [1] "work_year"          "experience_level"   "employment_type"   
##  [4] "job_title"          "salary"             "salary_currency"   
##  [7] "salary_in_usd"      "employee_residence" "remote_ratio"      
## [10] "company_location"   "company_size"

Based on the data dictionary provided on the Kaggle webpage, the column names contain data as listed below.

work_year: The year where the data was collected. (Numerical)

experience_level: Employees’ experience level (EN: Entry/Junior, MI: Mid-level, SE: Senior, EX: Expert). (String)

employment_type: Employment type (FT: Full Time, PT: Part Time, CT: Contract). (String)

job_title: The position title. (String)

salary: The salary in the local currency. (Numerical)

salary_currency: The local currency. (String)

salary_in_usd: The salary after conversion to US Dollar. (Numerical)

employee_residence: The location of the employee’s residence (country). (String)

remote_ratio: Availability of remote working (0: No remote work, 50: Semi remote work, 100: Fully remote work). (Numerical)

company_location: The location of the company (country). (String)

company_size: The size of the company (S: Small, M: Medium, L: Large). (String)

The code below checks for any empty cell.

# find any missing cell
sapply(data, function(col) {
  sum(is.na(col) | col == "")
})
##          work_year   experience_level    employment_type          job_title 
##                  0                  0                  0                  0 
##             salary    salary_currency      salary_in_usd employee_residence 
##                  0                  0                  0                  0 
##       remote_ratio   company_location       company_size 
##                  0                  0                  0

Based on the previous code results, there is no column with missing values.

Data Cleaning & Transformation

For employment_type column, almost 100% of the data has the value of FT, meaning that almost all employments are full time employment. To ease data analysis and modelling, values which are not FT are removed.

# check number of rows before filtration
numrow <- nrow(data)
print("Number of Rows Before:")
## [1] "Number of Rows Before:"
print(numrow)
## [1] 14838
# filter data
data <- subset(data, employment_type %in% c("FT"))
print("Data filtered based on employment_type column!")
## [1] "Data filtered based on employment_type column!"
# check number of rows after filtration
numrow <- nrow(data)
print("Number of Rows After:")
## [1] "Number of Rows After:"
print(numrow)
## [1] 14772

For the job title, there are many job titles in the dataset. However, this can be grouped up into few categories to ease the data analysis. The categorisation is as such.

1: Lead, head, director and other managerial roles

2: Business intelligence

3: Machine learning and artificial intelligence

4: Data architect

5: Data engineer

6: Data analyst

7: Data scientist

0: Others

# check if dplyr is installed; if not install and then load the library
if (!require(dplyr)) install.packages("dplyr")
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dplyr)

# categorise the data
data <- data %>%
  mutate(
    job_title = case_when(
      grepl("Lead|Head|Director|Manager", job_title, ignore.case = FALSE) ~ 1,
      grepl("BI|Business Intelligence", job_title, ignore.case = FALSE) ~ 2,
      grepl("Machine Learning|ML|AI|Deep Learning|NLP", job_title, ignore.case = FALSE) ~ 3,
      grepl("Architect", job_title, ignore.case = FALSE) ~ 4,
      grepl("Engineer", job_title, ignore.case = FALSE) ~ 5,
      grepl("Analyst|Analytics", job_title, ignore.case = FALSE) ~ 6,
      grepl("Science|Scientist", job_title, ignore.case = FALSE) ~ 7,
      TRUE ~ 0
    )
  )

# count each category
value_counts <- table(data$job_title)
value_counts
## 
##    0    1    2    3    4    5    6    7 
##  284  635  628 2172  372 4035 2425 4221

When looking at the dataset information on the Kaggle webpage. Majority of the employees’ residence and company location are located in the US (around 87% for both) which may skew the data analysis and visualisation. Thus, it is decided that non-US values for both of these columns are filtered out. In the end, this data analysis will be focused in the US.

# check number of rows before filtration
numrow <- nrow(data)
print("Number of Rows Before:")
## [1] "Number of Rows Before:"
print(numrow)
## [1] 14772
# filter data
data <- subset(data, employee_residence %in% c("US"))
print("Data filtered based on employee_residence column!")
## [1] "Data filtered based on employee_residence column!"
data <- subset(data, company_location %in% c("US"))
print("Data filtered based on company_location column!")
## [1] "Data filtered based on company_location column!"
# check number of rows after filtration
numrow <- nrow(data)
print("Number of Rows After:")
## [1] "Number of Rows After:"
print(numrow)
## [1] 12893

For remote working, there are three possible values which are 0 (no remote working), 50 (semi remote working) and 100 (full remote working). However, the amount for semi remote working is very little (around 1%). To ease the data analysis, this column will be transformed as such.

0: Maintain as 0

50: Changed to 1

100: Changed to 1

For this transformation, value of 0 means that no remote working is given and value of 1 is remote working is available.

data <- data %>%
  mutate(
    remote_ratio = case_when(
      grepl(0, remote_ratio, ignore.case = FALSE) ~ 0,
      grepl(50, remote_ratio, ignore.case = FALSE) ~ 1,
      grepl(100, remote_ratio, ignore.case = FALSE) ~ 1,
    )
  )

Cleaning and transformation conducted on the data has made some of the columns redundant. Therefore, these columns can be removed to ease the data modelling. These columns are salary, salary_currency, employee_residence and company_location. The reasons are as below.

employement_type: No longer needed because only focusing on full time employee.

salary: No longer needed because it is salary in local currency.

salary_currency: No longer needed since analysis will be done in USD.

employee_residence: No longer needed because only focus on the USA.

company location: No longer needed because only focus on the USA.

# delete columns
data$employment_type <- NULL
data$salary <- NULL
data$salary_currency <- NULL
data$employee_residence <- NULL
data$company_location <- NULL

# recheck all columns
colnames(data)
## [1] "work_year"        "experience_level" "job_title"        "salary_in_usd"   
## [5] "remote_ratio"     "company_size"

Before continuing for exploratory data analysis and modelling, to ease data analysis, all string values are converted to numerical values.

# convert string to numerical for experience_level column
data <- data %>%
  mutate(
    experience_level = case_when(
      grepl("EN", experience_level, ignore.case = FALSE) ~ 1,
      grepl("MI", experience_level, ignore.case = FALSE) ~ 2,
      grepl("SE", experience_level, ignore.case = FALSE) ~ 3,
      grepl("EX", experience_level, ignore.case = FALSE) ~ 4
    )
  )

# convert string to numerical for company_size column
data <- data %>%
  mutate(
    company_size = case_when(
      grepl("S", company_size, ignore.case = FALSE) ~ 1,
      grepl("M", company_size, ignore.case = FALSE) ~ 2,
      grepl("L", company_size, ignore.case = FALSE) ~ 3
    )
  )

Exploratory Data Analysis

plot(data$work_year, data$salary_in_usd, 
     main = "Salary against Work Year", 
     xlab = "Year data was collected", 
     ylab = "Salary in USD", 
     pch = 16,
     col = "blue")

plot(data$experience_level, data$salary_in_usd, 
     main = "Salary against Experience Level", 
     xlab = "Experience Level", 
     ylab = "Salary in USD", 
     pch = 16,
     col = "blue")

plot(data$job_title, data$salary_in_usd, 
     main = "Salary against Job Title", 
     xlab = "Job Title", 
     ylab = "Salary in USD", 
     pch = 16,
     col = "blue")

plot(data$company_size, data$salary_in_usd, 
     main = "Salary against Company Size", 
     xlab = "Company Size", 
     ylab = "Salary in USD", 
     pch = 16,
     col = "blue")

# create contingency table
cont_year <- table(data$work_year, data$remote_ratio)
cont_expl <- table(data$experience_level, data$remote_ratio)
cont_jobt <- table(data$job_title, data$remote_ratio)
cont_comp <- table(data$company_size, data$remote_ratio)