This dataset contains insights into the compensation of data science professionals across various global regions and roles. Understanding the nuances of these features is critical for building a robust predictive model.
libs <- c("tidyverse", "dplyr", "gridExtra", "ggplot2", "ggfortify", "broom",
"ggcorrplot", "corrr", "corrplot", "MASS", "datarium", "randomForest",
"rpart", "Hmisc", "skimr", "stringr", "patchwork", "glmnet",
"rpart.plot", "knitr", "caret", "pROC")
for (lib in libs) {
if (!(lib %in% installed.packages())){
install.packages(lib)
}
library(lib, character.only = TRUE)
}
theme_set(theme_classic())salary <- read.csv("/home/jakes/Documents/strathmore/Modules/Module 3/linear_models/salaries/salaries.csv")
head(salary, 10)#> work_year experience_level employment_type job_title
#> 1 2025 MI FT Customer Success Manager
#> 2 2025 SE FT Engineer
#> 3 2025 SE FT Engineer
#> 4 2025 SE FT Applied Scientist
#> 5 2025 SE FT Applied Scientist
#> 6 2025 EN FT Data Analyst
#> 7 2025 EN FT Data Analyst
#> 8 2025 SE FT Software Development Engineer
#> 9 2025 SE FT Software Development Engineer
#> 10 2025 SE FT Research Scientist
#> salary salary_currency salary_in_usd employee_residence remote_ratio
#> 1 57000 EUR 60000 NL 50
#> 2 165000 USD 165000 US 0
#> 3 109000 USD 109000 US 0
#> 4 294000 USD 294000 US 0
#> 5 137600 USD 137600 US 0
#> 6 82000 USD 82000 US 0
#> 7 44000 USD 44000 US 0
#> 8 149800 USD 149800 US 0
#> 9 89700 USD 89700 US 0
#> 10 200000 USD 200000 US 0
#> company_location company_size
#> 1 NL L
#> 2 US M
#> 3 US M
#> 4 US M
#> 5 US M
#> 6 US M
#> 7 US M
#> 8 US L
#> 9 US L
#> 10 US M
Checking the structure and data types of the dataset.
#> Rows: 88,584
#> Columns: 11
#> $ work_year <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 202…
#> $ experience_level <chr> "MI", "SE", "SE", "SE", "SE", "EN", "EN", "SE", "SE…
#> $ employment_type <chr> "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT…
#> $ job_title <chr> "Customer Success Manager", "Engineer", "Engineer",…
#> $ salary <int> 57000, 165000, 109000, 294000, 137600, 82000, 44000…
#> $ salary_currency <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
#> $ salary_in_usd <int> 60000, 165000, 109000, 294000, 137600, 82000, 44000…
#> $ employee_residence <chr> "NL", "US", "US", "US", "US", "US", "US", "US", "US…
#> $ remote_ratio <int> 50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 100, 100, 0…
#> $ company_location <chr> "NL", "US", "US", "US", "US", "US", "US", "US", "US…
#> $ company_size <chr> "L", "M", "M", "M", "M", "M", "M", "L", "L", "M", "…
All the variables have the correct data types.
Checking for missing values and distributions of each column.
| Name | salary |
| Number of rows | 88584 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| experience_level | 0 | 1 | 2 | 2 | 0 | 4 | 0 |
| employment_type | 0 | 1 | 2 | 2 | 0 | 4 | 0 |
| job_title | 0 | 1 | 5 | 41 | 0 | 312 | 0 |
| salary_currency | 0 | 1 | 3 | 3 | 0 | 26 | 0 |
| employee_residence | 0 | 1 | 2 | 2 | 0 | 96 | 0 |
| company_location | 0 | 1 | 2 | 2 | 0 | 90 | 0 |
| company_size | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| work_year | 0 | 1 | 2024.03 | 0.62 | 2020 | 2024.0 | 2024 | 2024 | 2025 | ▁▁▁▇▂ |
| salary | 0 | 1 | 161932.29 | 196531.72 | 14000 | 106000.0 | 147000 | 199500 | 30400000 | ▇▁▁▁▁ |
| salary_in_usd | 0 | 1 | 157567.80 | 73531.37 | 15000 | 106097.2 | 146307 | 198600 | 800000 | ▇▅▁▁▁ |
| remote_ratio | 0 | 1 | 21.29 | 40.83 | 0 | 0.0 | 0 | 0 | 100 | ▇▁▁▁▂ |
Checking the summary statistics.
#> work_year experience_level employment_type job_title
#> Min. :2020 Length:88584 Length:88584 Length:88584
#> 1st Qu.:2024 Class :character Class :character Class :character
#> Median :2024 Mode :character Mode :character Mode :character
#> Mean :2024
#> 3rd Qu.:2024
#> Max. :2025
#> salary salary_currency salary_in_usd employee_residence
#> Min. : 14000 Length:88584 Min. : 15000 Length:88584
#> 1st Qu.: 106000 Class :character 1st Qu.:106097 Class :character
#> Median : 147000 Mode :character Median :146307 Mode :character
#> Mean : 161932 Mean :157568
#> 3rd Qu.: 199500 3rd Qu.:198600
#> Max. :30400000 Max. :800000
#> remote_ratio company_location company_size
#> Min. : 0.00 Length:88584 Length:88584
#> 1st Qu.: 0.00 Class :character Class :character
#> Median : 0.00 Mode :character Mode :character
#> Mean : 21.29
#> 3rd Qu.: 0.00
#> Max. :100.00
Checking the column value counts.
#> $experience_level
#>
#> EN EX MI SE
#> 8381 1859 26748 51596
#>
#> $employment_type
#>
#> CT FL FT PT
#> 224 16 88111 233
#>
#> $job_title
#>
#> Account Executive
#> 22
#> Actuarial Analyst
#> 14
#> Admin & Data Analyst
#> 6
#> AI Architect
#> 165
#> AI Data Engineer
#> 2
#> AI Data Scientist
#> 8
#> AI Developer
#> 123
#> AI Engineer
#> 850
#> AI Engineering Manager
#> 2
#> AI Governance Lead
#> 2
#> AI Lead
#> 15
#> AI Machine Learning Engineer
#> 3
#> AI Product Manager
#> 27
#> AI Product Owner
#> 18
#> AI Programmer
#> 9
#> AI Research Engineer
#> 8
#> AI Research Scientist
#> 7
#> AI Researcher
#> 54
#> AI Scientist
#> 74
#> AI Software Development Engineer
#> 1
#> AI Software Engineer
#> 11
#> AI Solution Architect
#> 10
#> AI Specialist
#> 70
#> Algorithm Developer
#> 22
#> Analyst
#> 2066
#> Analytics Analyst
#> 1
#> Analytics Engineer
#> 1374
#> Analytics Engineering Manager
#> 1
#> Analytics Lead
#> 52
#> Analytics Specialist
#> 32
#> Application Developer
#> 8
#> Applied AI ML Lead
#> 2
#> Applied Data Scientist
#> 13
#> Applied Machine Learning Engineer
#> 3
#> Applied Machine Learning Scientist
#> 14
#> Applied Research Scientist
#> 2
#> Applied Scientist
#> 1778
#> Architect
#> 208
#> Artificial Intelligence Engineer
#> 18
#> Associate
#> 1205
#> Autonomous Vehicle Technician
#> 2
#> AWS Data Architect
#> 1
#> Azure Data Engineer
#> 2
#> Backend Developer
#> 2
#> Backend Engineer
#> 189
#> Backend Software Engineer
#> 1
#> Bear Robotics
#> 10
#> BI Analyst
#> 153
#> BI Data Analyst
#> 19
#> BI Data Engineer
#> 1
#> BI Developer
#> 226
#> BI Engineer
#> 74
#> Big Data Analyst
#> 1
#> Big Data Architect
#> 2
#> Big Data Developer
#> 7
#> Big Data Engineer
#> 13
#> Bioinformatician
#> 18
#> Bioinformatics Scientist
#> 44
#> Business Analyst
#> 238
#> Business Data Analyst
#> 21
#> Business Development Manager
#> 1
#> Business Insights Manager
#> 1
#> Business Intelligence
#> 294
#> Business Intelligence Analyst
#> 502
#> Business Intelligence Consultant
#> 2
#> Business Intelligence Data Analyst
#> 2
#> Business Intelligence Developer
#> 285
#> Business Intelligence Engineer
#> 680
#> Business Intelligence Lead
#> 32
#> Business Intelligence Manager
#> 38
#> Business Intelligence Specialist
#> 16
#> Chatbot Developer
#> 4
#> Clinical Data Operator
#> 10
#> Cloud Data Architect
#> 1
#> Cloud Data Engineer
#> 4
#> Cloud Database Administrator
#> 22
#> Cloud Database Engineer
#> 37
#> Cloud Developer
#> 2
#> Cloud Engineer
#> 102
#> Compliance Data Analyst
#> 2
#> Computational Biologist
#> 40
#> Computational Scientist
#> 4
#> Computer Vision Engineer
#> 98
#> Computer Vision Software Engineer
#> 5
#> Consultant
#> 611
#> Consultant Data Engineer
#> 2
#> Controls Engineer
#> 1
#> CRM Data Analyst
#> 1
#> Customer Success Manager
#> 1
#> Data Analyst
#> 8652
#> Data Analyst Lead
#> 2
#> Data Analytics Associate
#> 7
#> Data Analytics Consultant
#> 40
#> Data Analytics Developer
#> 10
#> Data Analytics Engineer
#> 5
#> Data Analytics Lead
#> 89
#> Data Analytics Manager
#> 172
#> Data Analytics Specialist
#> 60
#> Data Analytics Team Lead
#> 2
#> Data and Reporting Analyst
#> 16
#> Data and Reporting Professional
#> 12
#> Data Architect
#> 1499
#> Data Developer
#> 100
#> Data DevOps Engineer
#> 3
#> Data Engineer
#> 10883
#> Data Governance
#> 114
#> Data Governance Analyst
#> 98
#> Data Governance Architect
#> 2
#> Data Governance Engineer
#> 14
#> Data Governance Lead
#> 64
#> Data Governance Manager
#> 31
#> Data Governance Specialist
#> 54
#> Data Infrastructure Engineer
#> 80
#> Data Integration Analyst
#> 14
#> Data Integration Coordinator
#> 4
#> Data Integration Developer
#> 12
#> Data Integration Engineer
#> 62
#> Data Integration Specialist
#> 61
#> Data Integrator
#> 10
#> Data Integrity Analyst
#> 12
#> Data Integrity Specialist
#> 2
#> Data Lead
#> 186
#> Data Management Analyst
#> 156
#> Data Management Associate
#> 10
#> Data Management Consultant
#> 12
#> Data Management Coordinator
#> 2
#> Data Management Lead
#> 28
#> Data Management Manager
#> 8
#> Data Management Specialist
#> 133
#> Data Manager
#> 727
#> Data Modeler
#> 134
#> Data Operations
#> 14
#> Data Operations Analyst
#> 80
#> Data Operations Associate
#> 18
#> Data Operations Engineer
#> 60
#> Data Operations Lead
#> 2
#> Data Operations Manager
#> 24
#> Data Operations Specialist
#> 38
#> Data Pipeline Engineer
#> 2
#> Data Platform Engineer
#> 20
#> Data Product Analyst
#> 2
#> Data Product Manager
#> 192
#> Data Product Owner
#> 92
#> Data Quality Analyst
#> 111
#> Data Quality Engineer
#> 37
#> Data Quality Lead
#> 40
#> Data Quality Manager
#> 16
#> Data Quality Specialist
#> 23
#> Data Reporter
#> 4
#> Data Reporting Analyst
#> 62
#> Data Reporting Specialist
#> 2
#> Data Science Tech Lead
#> 1
#> Data Scientist
#> 13156
#> Data Scientist Associate
#> 1
#> Data Scientist Lead
#> 3
#> Data Scientist Manager
#> 1
#> Data Specialist
#> 528
#> Data Strategist
#> 64
#> Data Strategy Lead
#> 16
#> Data Strategy Manager
#> 6
#> Data Team Lead
#> 36
#> Data Visualization Analyst
#> 37
#> Data Visualization Developer
#> 20
#> Data Visualization Engineer
#> 54
#> Data Visualization Specialist
#> 46
#> Database Administrator
#> 4
#> Databricks Engineer
#> 2
#> DataOps Engineer
#> 10
#> Decision Scientist
#> 172
#> Deep Learning Engineer
#> 12
#> Deep Learning Researcher
#> 1
#> Developer
#> 344
#> Developer Advocate
#> 4
#> DevOps Engineer
#> 286
#> Director of Business Intelligence
#> 2
#> Director of Machine Learning
#> 22
#> Elasticsearch Administrator
#> 2
#> Encounter Data Management Professional
#> 36
#> Engineer
#> 5344
#> Engineering Manager
#> 460
#> Enterprise Account Executive
#> 28
#> ETL Developer
#> 42
#> ETL Engineer
#> 7
#> Finance Data Analyst
#> 3
#> Financial Data Analyst
#> 4
#> Frontend Engineer
#> 12
#> Full Stack Developer
#> 60
#> Full Stack Engineer
#> 122
#> Fullstack Engineer
#> 24
#> GenAI Architect
#> 10
#> Head of AI
#> 26
#> Head of Business Intelligence
#> 2
#> Head of Data
#> 206
#> Head of Machine Learning
#> 11
#> Infrastructure Engineer
#> 16
#> Insight Analyst
#> 88
#> Java Developer
#> 2
#> Lead AI Engineer
#> 1
#> Lead Analyst
#> 54
#> Lead Data Analysis
#> 4
#> Lead Data Analyst
#> 6
#> Lead Data Engineer
#> 7
#> Lead Data Management
#> 2
#> Lead Data Scientist
#> 11
#> Lead Engineer
#> 48
#> Lead Machine Learning Engineer
#> 5
#> Machine Learning Architect
#> 18
#> Machine Learning Developer
#> 44
#> Machine Learning Engineer
#> 6443
#> Machine Learning Infrastructure Engineer
#> 41
#> Machine Learning Lead
#> 8
#> Machine Learning Manager
#> 14
#> Machine Learning Model Engineer
#> 16
#> Machine Learning Modeler
#> 10
#> Machine Learning Performance Engineer
#> 2
#> Machine Learning Platform Engineer
#> 16
#> Machine Learning Quality Engineer
#> 14
#> Machine Learning Research Engineer
#> 5
#> Machine Learning Researcher
#> 553
#> Machine Learning Scientist
#> 340
#> Machine Learning Software Engineer
#> 15
#> Machine Learning Specialist
#> 12
#> Machine Learning Tech Lead
#> 2
#> Machine Vision Engineer
#> 2
#> Manager
#> 3488
#> Manager Data Management
#> 2
#> Marketing Analyst
#> 8
#> Marketing Analytics Manager
#> 2
#> Marketing Data Analyst
#> 2
#> Marketing Data Engineer
#> 1
#> Marketing Data Scientist
#> 1
#> Marketing Science Partner
#> 2
#> Master Data Management
#> 20
#> Master Data Specialist
#> 2
#> Member of Technical Staff
#> 86
#> ML Infrastructure Engineer
#> 18
#> MLOps Engineer
#> 116
#> NLP Engineer
#> 23
#> People Data Analyst
#> 1
#> Platform Data Engineer
#> 1
#> Platform Engineer
#> 290
#> Postdoctoral Fellow
#> 16
#> Postdoctoral Researcher
#> 7
#> Power BI
#> 10
#> Power BI Administrator
#> 4
#> Power BI Architect
#> 2
#> Power BI Consultant
#> 2
#> Power BI Developer
#> 73
#> Power BI Specialist
#> 2
#> Pricing Analyst
#> 14
#> Principal Application Delivery Consultant
#> 40
#> Principal Data Analyst
#> 2
#> Principal Data Architect
#> 1
#> Principal Data Engineer
#> 3
#> Principal Data Scientist
#> 10
#> Principal Machine Learning Engineer
#> 3
#> Principal Researcher
#> 16
#> Principal Software Architect
#> 6
#> Principal Statistical Programmer
#> 44
#> Product Analyst
#> 110
#> Product Data Analyst
#> 6
#> Product Designer
#> 59
#> Product Manager
#> 1314
#> Product Owner
#> 19
#> Product Specialist
#> 2
#> Prompt Engineer
#> 53
#> Python Developer
#> 60
#> QA Engineer
#> 44
#> Quantitative Analyst
#> 85
#> Quantitative Developer
#> 14
#> Quantitative Research Analyst
#> 1
#> Quantitative Researcher
#> 60
#> Research Analyst
#> 380
#> Research Assistant
#> 55
#> Research Associate
#> 112
#> Research Data Manager
#> 1
#> Research Engineer
#> 1203
#> Research Scientist
#> 2553
#> Risk Analyst
#> 10
#> Robotics Engineer
#> 78
#> Robotics Software Engineer
#> 12
#> Safety Data Management Specialist
#> 1
#> Sales Data Analyst
#> 1
#> Sales Development Representative
#> 7
#> SAS Developer
#> 2
#> Scala Spark Developer
#> 4
#> Security Engineer
#> 56
#> Security Researcher
#> 8
#> Site Reliability Engineer
#> 282
#> Software Architect
#> 18
#> Software Data Engineer
#> 3
#> Software Developer
#> 424
#> Software Development Engineer
#> 451
#> Software Engineer
#> 9596
#> Solution Architect
#> 142
#> Solution Engineer
#> 2
#> Solutions Architect
#> 510
#> Solutions Engineer
#> 225
#> Staff Data Analyst
#> 3
#> Staff Data Scientist
#> 3
#> Staff Machine Learning Engineer
#> 1
#> Stage
#> 14
#> Statistical Programmer
#> 74
#> Statistician
#> 46
#> System Engineer
#> 18
#> Systems Engineer
#> 406
#> Tableau Developer
#> 2
#> Tech Lead
#> 18
#> Technical Lead
#> 112
#> Technical Specialist
#> 4
#> Technical Writer
#> 19
#> Technology Integrator
#> 12
#>
#> $salary_currency
#>
#> AUD BRL CAD CHF CLP CZK DKK EUR GBP HKD HUF ILS INR
#> 18 18 279 34 1 3 5 1559 2479 1 5 3 80
#> JPY MXN NOK NZD PHP PLN SEK SGD THB TRY TWD USD ZAR
#> 8 3 3 1 12 45 1 15 2 4 8 83994 3
#>
#> $employee_residence
#>
#> AD AE AM AR AS AT AU BA BE BG BM BO BR
#> 1 5 10 66 1 173 298 2 26 4 1 2 73
#> CA CF CH CL CN CO CR CY CZ DE DK DO DZ
#> 3203 2 38 19 1 39 7 11 9 263 7 4 1
#> EC EE EG ES FI FR GB GE GH GR HK HN HR
#> 6 11 42 203 45 226 2576 2 4 24 2 3 8
#> HU ID IE IL IN IQ IR IT JE JP KE KR KW
#> 6 3 85 15 155 1 1 41 1 20 6 8 1
#> LB LT LU LV MD MT MU MX MY NG NL NO NZ
#> 12 202 4 54 2 15 1 69 3 13 187 7 40
#> OM PE PH PK PL PR PT QA RO RS RU RW SA
#> 1 9 46 7 86 9 47 1 7 4 6 1 3
#> SE SG SI SK SV TH TN TR TW UA UG US UZ
#> 9 22 6 110 5 4 2 23 8 16 1 79705 3
#> VE VN XK ZA ZM
#> 2 7 2 61 1
#>
#> $company_location
#>
#> AD AE AM AR AS AT AU BA BE BG BR BS CA
#> 1 5 9 62 4 173 303 2 24 3 71 1 3204
#> CD CF CH CL CN CO CR CY CZ DE DK DO DZ
#> 1 2 39 18 1 39 6 10 10 272 9 3 2
#> EC EE EG ES FI FR GB GH GI GR HK HN HR
#> 6 12 41 199 46 221 2584 3 1 22 1 3 5
#> HU ID IE IL IN IQ IR IT JP KE KR LB LT
#> 6 4 85 17 134 1 1 34 20 6 8 12 202
#> LU LV MD MT MU MX MY NG NL NO NZ OM PE
#> 5 54 1 15 1 70 3 10 187 7 40 1 8
#> PH PK PL PR PT QA RO RS RU SA SE SG SI
#> 43 3 85 8 44 1 6 3 7 3 10 21 6
#> SK SV TH TR TW UA US VE VN XK ZA ZM
#> 110 5 3 21 8 15 79762 2 4 2 61 1
#>
#> $company_size
#>
#> L M S
#> 2703 85667 214
The features job_title, employee_residence,
and company_location have very many separate categories. We
will classify them based on their frequency and domain knowledge. If not
addressed, this will become a high cardinality problem during
modelling.
Restructuring the job titles into 7 broad categories instead of 312.
salary <- salary %>%
mutate(job_title = case_when(
str_detect(job_title, "Manager|Head|Director|Lead|Principal|Staff|Architect|Tech Lead") ~ "Leadership & Management",
str_detect(job_title, "Machine Learning|ML|AI|Computer Vision|Robotics|NLP|GenAI|Autonomous") ~ "ML & AI",
str_detect(job_title, "Scientist|Research|Applied|Decision Scientist|Statistician") ~ "Data Science",
str_detect(job_title, "Engineer|ETL|Database|Infrastructure|Data Platform|Pipeline") ~ "Data Engineering",
str_detect(job_title, "Analyst|Analytics|Reporting|Visualization|Quant") ~ "Data Analysis",
str_detect(job_title, "BI |Business Intelligence|Tableau|Power BI|Insight") ~ "BI & Strategy",
str_detect(job_title, "Software|Developer|Backend|Full Stack|Cloud|DevOps|Systems") ~ "Software Engineering",
TRUE ~ "Other"
))
table(salary$job_title)#>
#> BI & Strategy Data Analysis Data Engineering
#> 916 12841 31051
#> Data Science Leadership & Management ML & AI
#> 19597 10454 9128
#> Other Software Engineering
#> 3540 1057
The currency the salaries were paid in were mostly USD, EUR, and GBP; the rest were classified as ‘Other’.
salary <- salary %>%
mutate(salary_currency = case_when(
str_detect(salary_currency, "USD") ~ "USD",
str_detect(salary_currency, "EUR") ~ "EUR",
str_detect(salary_currency, "GBP") ~ "GBP",
TRUE ~ "Other"
))
table(salary$salary_currency)#>
#> EUR GBP Other USD
#> 1559 2479 552 83994
Both employees and companies were predominantly located in the US, Canada, and Great Britain; the rest were classified as ‘Other’.
salary <- salary %>%
mutate(employee_residence = case_when(
str_detect(employee_residence, "US") ~ "US",
str_detect(employee_residence, "CA") ~ "CA",
str_detect(employee_residence, "GB") ~ "GB",
TRUE ~ "Other"
))
table(salary$employee_residence)#>
#> CA GB Other US
#> 3203 2576 3100 79705
salary <- salary %>%
mutate(company_location = case_when(
str_detect(company_location, "US") ~ "US",
str_detect(company_location, "CA") ~ "CA",
str_detect(company_location, "GB") ~ "GB",
TRUE ~ "Other"
))
table(salary$company_location)#>
#> CA GB Other US
#> 3204 2584 3034 79762
Confirming all transformations.
#> $experience_level
#>
#> EN EX MI SE
#> 8381 1859 26748 51596
#>
#> $employment_type
#>
#> CT FL FT PT
#> 224 16 88111 233
#>
#> $job_title
#>
#> BI & Strategy Data Analysis Data Engineering
#> 916 12841 31051
#> Data Science Leadership & Management ML & AI
#> 19597 10454 9128
#> Other Software Engineering
#> 3540 1057
#>
#> $salary_currency
#>
#> EUR GBP Other USD
#> 1559 2479 552 83994
#>
#> $employee_residence
#>
#> CA GB Other US
#> 3203 2576 3100 79705
#>
#> $company_location
#>
#> CA GB Other US
#> 3204 2584 3034 79762
#>
#> $company_size
#>
#> L M S
#> 2703 85667 214
ggplot(salary, aes(y = salary_in_usd)) +
geom_boxplot(fill = "steelblue", outlier.color = "red", outlier.shape = 16) +
labs(title = "Identifying Salary Outliers", y = "Salary (USD)") +
theme_minimal()Q1 <- quantile(salary$salary_in_usd, 0.25)
Q3 <- quantile(salary$salary_in_usd, 0.75)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
outliers <- salary %>% filter(salary_in_usd < lower_bound | salary_in_usd > upper_bound)
cat("Number of outliers identified:", nrow(outliers), "\n")#> Number of outliers identified: 1750
#> Upper Bound for 'Normal' Salaries: $ 337354.1
#> Lower Bound for 'Normal' Salaries: $ -32656.88
Setting the floor of minimum salaries at $1 and the ceiling at the upper bound through winsorization.
logical_min <- 1
statistical_max <- 337354.1
salary_refined <- salary %>%
filter(salary_in_usd >= logical_min & salary_in_usd <= statistical_max)
summary(salary_refined$salary_in_usd)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 15000 105000 145000 152495 193262 337300
#> Rows after trimming outliers: 86834
We will visualize salary_in_usd using histograms and
density plots. Since salary data is right-skewed, we may need to
consider a log-transformation to meet the normality assumptions of
Linear Regression.
p1 <- ggplot(salary_refined, aes(x = salary_in_usd)) +
geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "steelblue", color = "white") +
geom_density(color = "red", linewidth = 1) +
labs(title = "Distribution of Salary (USD)", x = "Salary", y = "Density") +
theme_minimal()
p2 <- ggplot(salary_refined, aes(x = log10(salary_in_usd))) +
geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "darkorange", color = "white") +
geom_density(color = "black", linewidth = 1) +
labs(title = "Log10 Transformed Salary", x = "Log10(Salary)", y = "Density") +
theme_minimal()
p1 + p2We expect a positive correlation between
experience_level and salary_in_usd. We will
use boxplots to validate this.
salary_refined$experience_level <- factor(salary_refined$experience_level,
levels = c("EN", "MI", "SE", "EX"))
ggplot(salary_refined, aes(x = experience_level, y = salary_in_usd, fill = experience_level)) +
geom_boxplot(alpha = 0.7, outlier.colour = "red") +
stat_summary(fun = "mean", geom = "point", shape = 20, size = 3, color = "white") +
labs(title = "Salary Distribution by Experience Level",
subtitle = "White dots represent the mean salary",
x = "Experience Level",
y = "Salary (USD)") +
theme_minimal() +
guides(fill = "none")salary_corr_df <- salary_refined %>%
mutate(
exp_num = as.numeric(factor(experience_level, levels = c("EN", "MI", "SE", "EX"))),
size_num = as.numeric(factor(company_size, levels = c("S", "M", "L")))
) %>%
dplyr::select(salary_in_usd, salary, exp_num, size_num, remote_ratio, work_year)
cor_matrix <- cor(salary_corr_df, use = "complete.obs")
corrplot(cor_matrix,
method = "color",
type = "lower",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
title = "Feature Correlation with Salary",
mar = c(0, 0, 2, 0))We examine if employee_residence and
company_location are identical for most records, as high
multicollinearity can destabilize linear coefficients.
salary_refined$is_local <- salary_refined$employee_residence == salary_refined$company_location
loc_check <- table(salary_refined$is_local)
prop_check <- prop.table(loc_check) * 100
print(loc_check)#>
#> FALSE TRUE
#> 85 86749
#>
#> FALSE TRUE
#> 0.09788792 99.90211208
ggplot(as.data.frame(prop_check), aes(x = Var1, y = Freq, fill = Var1)) +
geom_bar(stat = "identity") +
labs(title = "Geographic Overlap Check",
x = "Lives in Company Country?",
y = "Percentage (%)") +
scale_x_discrete(labels = c("No (Remote/International)", "Yes (Local)")) +
theme_minimal() +
guides(fill = "none")Since 99.9% of employees reside in the same country as their company
(vs. 0.09% international), we drop employee_residence to
avoid multicollinearity. We also drop the raw salary column
as it was used to derive salary_in_usd.
#> work_year experience_level employment_type job_title
#> Min. :2020 EN: 8336 Length:86834 Length:86834
#> 1st Qu.:2024 MI:26309 Class :character Class :character
#> Median :2024 SE:50435 Mode :character Mode :character
#> Mean :2024 EX: 1754
#> 3rd Qu.:2024
#> Max. :2025
#> salary_currency salary_in_usd remote_ratio company_location
#> Length:86834 Min. : 15000 Min. : 0.00 Length:86834
#> Class :character 1st Qu.:105000 1st Qu.: 0.00 Class :character
#> Mode :character Median :145000 Median : 0.00 Mode :character
#> Mean :152495 Mean : 21.54
#> 3rd Qu.:193262 3rd Qu.: 0.00
#> Max. :337300 Max. :100.00
#> company_size is_local
#> Length:86834 Mode :logical
#> Class :character FALSE:85
#> Mode :character TRUE :86749
#>
#>
#>
For our baseline predictive analysis, we implement a Multiple
Linear Regression (MLR) model. This allows us to quantify the
relationship between our categorical predictors and our continuous
target variable, salary_in_usd.
The model assumes that the target variable \(y\) is a linear function of the independent variables \(x_i\), plus a random error term:
\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n + \epsilon\]
| Component | Definition | Contextual Meaning |
|---|---|---|
| \(y\) | Dependent Variable | The Salary in USD we are trying to predict. |
| \(\beta_0\) | Intercept | The baseline salary when all predictors are at their reference level. |
| \(\beta_1 \dots \beta_n\) | Coefficients | The estimated change in salary for every one-unit change in a feature. |
| \(\epsilon\) | Error Term | The Residuals; the variance in salary not explained by our model. |
formula_std <- salary_in_usd ~ work_year + experience_level + employment_type +
job_title + remote_ratio + company_location + company_size
model_std <- lm(formula_std, data = salary)
summary(model_std)#>
#> Call:
#> lm(formula = formula_std, data = salary)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -170060 -39538 -7373 33093 239871
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -4.958e+06 6.258e+05 -7.921 2.38e-15 ***
#> work_year 2.478e+03 3.092e+02 8.013 1.13e-15 ***
#> experience_levelMI 1.722e+04 7.418e+02 23.211 < 2e-16 ***
#> experience_levelSE 4.346e+04 7.057e+02 61.589 < 2e-16 ***
#> experience_levelEX 6.822e+04 1.466e+03 46.536 < 2e-16 ***
#> employment_typeFL -2.180e+04 1.421e+04 -1.534 0.125032
#> employment_typeFT 2.020e+04 3.675e+03 5.497 3.87e-08 ***
#> employment_typePT -1.737e+04 5.127e+03 -3.388 0.000704 ***
#> job_titleData Analysis -4.356e+03 1.877e+03 -2.321 0.020315 *
#> job_titleData Engineering 3.559e+04 1.831e+03 19.432 < 2e-16 ***
#> job_titleData Science 3.730e+04 1.848e+03 20.183 < 2e-16 ***
#> job_titleLeadership & Management 3.735e+04 1.884e+03 19.823 < 2e-16 ***
#> job_titleML & AI 5.408e+04 1.897e+03 28.503 < 2e-16 ***
#> job_titleOther 2.009e+03 2.025e+03 0.992 0.321122
#> job_titleSoftware Engineering 7.790e+03 2.467e+03 3.157 0.001594 **
#> remote_ratio -9.657e+01 4.576e+00 -21.104 < 2e-16 ***
#> company_locationGB -3.614e+04 1.453e+03 -24.865 < 2e-16 ***
#> company_locationOther -4.247e+04 1.398e+03 -30.382 < 2e-16 ***
#> company_locationUS 2.280e+04 9.873e+02 23.097 < 2e-16 ***
#> company_sizeM -2.782e+03 1.085e+03 -2.563 0.010366 *
#> company_sizeS -1.786e+04 3.967e+03 -4.501 6.76e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 54470 on 86813 degrees of freedom
#> Multiple R-squared: 0.2674, Adjusted R-squared: 0.2672
#> F-statistic: 1584 on 20 and 86813 DF, p-value: < 2.2e-16
The model assumes that the log-transformed target variable \(\log(y)\) is a linear function of the independent variables:
\[\log(y) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n + \epsilon\]
model_log <- lm(log(salary_in_usd) ~ work_year + experience_level + employment_type +
job_title + remote_ratio + company_location + company_size,
data = salary)
summary(model_log)#>
#> Call:
#> lm(formula = log(salary_in_usd) ~ work_year + experience_level +
#> employment_type + job_title + remote_ratio + company_location +
#> company_size, data = salary)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.35934 -0.24791 0.01034 0.26364 1.62803
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.089e+01 4.305e+00 -4.853 1.22e-06 ***
#> work_year 1.573e-02 2.127e-03 7.398 1.40e-13 ***
#> experience_levelMI 1.925e-01 5.103e-03 37.731 < 2e-16 ***
#> experience_levelSE 3.835e-01 4.854e-03 79.015 < 2e-16 ***
#> experience_levelEX 5.351e-01 1.008e-02 53.064 < 2e-16 ***
#> employment_typeFL -2.182e-01 9.776e-02 -2.232 0.0256 *
#> employment_typeFT 2.848e-01 2.528e-02 11.267 < 2e-16 ***
#> employment_typePT -1.563e-01 3.526e-02 -4.432 9.34e-06 ***
#> job_titleData Analysis -3.188e-02 1.291e-02 -2.469 0.0136 *
#> job_titleData Engineering 2.561e-01 1.260e-02 20.327 < 2e-16 ***
#> job_titleData Science 2.694e-01 1.271e-02 21.194 < 2e-16 ***
#> job_titleLeadership & Management 2.645e-01 1.296e-02 20.409 < 2e-16 ***
#> job_titleML & AI 3.758e-01 1.305e-02 28.792 < 2e-16 ***
#> job_titleOther -8.567e-03 1.393e-02 -0.615 0.5384
#> job_titleSoftware Engineering 5.020e-02 1.697e-02 2.958 0.0031 **
#> remote_ratio -5.878e-04 3.148e-05 -18.672 < 2e-16 ***
#> company_locationGB -3.882e-01 9.998e-03 -38.831 < 2e-16 ***
#> company_locationOther -5.224e-01 9.615e-03 -54.332 < 2e-16 ***
#> company_locationUS 1.595e-01 6.791e-03 23.484 < 2e-16 ***
#> company_sizeM -8.528e-03 7.465e-03 -1.142 0.2533
#> company_sizeS -1.657e-01 2.729e-02 -6.071 1.28e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3747 on 86813 degrees of freedom
#> Multiple R-squared: 0.3507, Adjusted R-squared: 0.3506
#> F-statistic: 2345 on 20 and 86813 DF, p-value: < 2.2e-16
We use Residuals vs Fitted and Normal Q-Q plots to compare model fit before and after log-transformation.
par(mfrow = c(2, 2))
plot(model_std, which = 1, main = "Standard: Residuals vs Fitted")
plot(model_std, which = 2, main = "Standard: Normal Q-Q")
plot(model_log, which = 1, main = "Log-Transformed: Residuals vs Fitted")
plot(model_log, which = 2, main = "Log-Transformed: Normal Q-Q")The increase in \(R^2\) from 0.26 to 0.35 indicates that the log-transformation allows the model to capture the nature of salary increases more effectively. By compressing the right-skewed salary distribution, we reduced the influence of high-income outliers.
To refine our model and address potential over-fitting or redundancy, we employ Stepwise Selection using the AIC (Akaike Information Criterion). AIC rewards goodness-of-fit but penalizes additional predictors, guiding us toward a simpler and more interpretable model.
full_model <- lm(log(salary_in_usd) ~ work_year + experience_level + employment_type +
job_title + remote_ratio + company_location + company_size,
data = salary)
reduced_model <- step(full_model, direction = "backward", trace = FALSE)
summary(reduced_model)#>
#> Call:
#> lm(formula = log(salary_in_usd) ~ work_year + experience_level +
#> employment_type + job_title + remote_ratio + company_location +
#> company_size, data = salary)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.35934 -0.24791 0.01034 0.26364 1.62803
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.089e+01 4.305e+00 -4.853 1.22e-06 ***
#> work_year 1.573e-02 2.127e-03 7.398 1.40e-13 ***
#> experience_levelMI 1.925e-01 5.103e-03 37.731 < 2e-16 ***
#> experience_levelSE 3.835e-01 4.854e-03 79.015 < 2e-16 ***
#> experience_levelEX 5.351e-01 1.008e-02 53.064 < 2e-16 ***
#> employment_typeFL -2.182e-01 9.776e-02 -2.232 0.0256 *
#> employment_typeFT 2.848e-01 2.528e-02 11.267 < 2e-16 ***
#> employment_typePT -1.563e-01 3.526e-02 -4.432 9.34e-06 ***
#> job_titleData Analysis -3.188e-02 1.291e-02 -2.469 0.0136 *
#> job_titleData Engineering 2.561e-01 1.260e-02 20.327 < 2e-16 ***
#> job_titleData Science 2.694e-01 1.271e-02 21.194 < 2e-16 ***
#> job_titleLeadership & Management 2.645e-01 1.296e-02 20.409 < 2e-16 ***
#> job_titleML & AI 3.758e-01 1.305e-02 28.792 < 2e-16 ***
#> job_titleOther -8.567e-03 1.393e-02 -0.615 0.5384
#> job_titleSoftware Engineering 5.020e-02 1.697e-02 2.958 0.0031 **
#> remote_ratio -5.878e-04 3.148e-05 -18.672 < 2e-16 ***
#> company_locationGB -3.882e-01 9.998e-03 -38.831 < 2e-16 ***
#> company_locationOther -5.224e-01 9.615e-03 -54.332 < 2e-16 ***
#> company_locationUS 1.595e-01 6.791e-03 23.484 < 2e-16 ***
#> company_sizeM -8.528e-03 7.465e-03 -1.142 0.2533
#> company_sizeS -1.657e-01 2.729e-02 -6.071 1.28e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3747 on 86813 degrees of freedom
#> Multiple R-squared: 0.3507, Adjusted R-squared: 0.3506
#> F-statistic: 2345 on 20 and 86813 DF, p-value: < 2.2e-16
Comparing the full and reduced models.
#> Full Model Adjusted R2: 0.3505958
#> Reduced Model Adjusted R2: 0.3505958
dropped_vars <- setdiff(names(coef(full_model)), names(coef(reduced_model)))
cat("\nVariables removed by Stepwise Selection:\n")#>
#> Variables removed by Stepwise Selection:
#> character(0)
No variables were dropped, meaning the full and reduced models are identical — all predictors contribute meaningfully.
To evaluate predictive accuracy, we split the dataset into a Training Set (70%) used to fit the model, and a Test Set (30%) used to evaluate performance.
set.seed(123)
n <- nrow(salary)
train_indices <- sample(1:n, size = 0.7 * n)
train_data <- salary[train_indices, ]
test_data <- salary[-train_indices, ]
final_model <- lm(log(salary_in_usd) ~ work_year + experience_level + employment_type +
job_title + remote_ratio + company_location + company_size,
data = train_data)
log_preds <- predict(final_model, newdata = test_data)
actual_usd <- test_data$salary_in_usd
predicted_usd <- exp(log_preds)
mse_val <- mean((actual_usd - predicted_usd)^2)
rmse_val <- sqrt(mse_val)
cat("Mean Squared Error (MSE):", format(mse_val, scientific = FALSE), "\n")#> Mean Squared Error (MSE): 3062656228
#> Root Mean Squared Error (RMSE): $ 55341.27
On average, the log-linear model’s salary predictions are off by approximately $55,341.27.
rf_data <- salary %>%
mutate(across(where(is.character), as.factor)) %>%
dplyr::select(-any_of(c("log_salary", "is_local")))
set.seed(123)
train_idx <- sample(1:nrow(rf_data), 0.7 * nrow(rf_data))
train_rf <- rf_data[train_idx, ]
test_rf <- rf_data[-train_idx, ]
rf_model <- randomForest(salary_in_usd ~ .,
data = train_rf,
ntree = 500,
importance = TRUE)
rf_preds <- predict(rf_model, newdata = test_rf)
rf_rmse <- sqrt(mean((test_rf$salary_in_usd - rf_preds)^2))
cat("Random Forest RMSE: $", round(rf_rmse, 2), "\n")#> Random Forest RMSE: $ 53775.49
#> Improvement over Linear Model: $ 1565.78
varImpPlot(rf_model,
main = "Feature Importance: What Predicts Salary?",
col = "darkgreen",
pch = 16)library(glmnet)
x_train <- model.matrix(salary_in_usd ~ ., train_rf)[, -1]
y_train <- train_rf$salary_in_usd
x_test <- model.matrix(salary_in_usd ~ ., test_rf)[, -1]
y_test <- test_rf$salary_in_usd
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_preds <- predict(lasso_cv, s = lasso_cv$lambda.min, newx = x_test)
lasso_rmse <- sqrt(mean((y_test - lasso_preds)^2))
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_preds <- predict(ridge_cv, s = ridge_cv$lambda.min, newx = x_test)
ridge_rmse <- sqrt(mean((y_test - ridge_preds)^2))
cat("Lasso RMSE: $", round(lasso_rmse, 2), "\n")#> Lasso RMSE: $ 54293.57
#> Ridge RMSE: $ 54355.6
tree_model <- rpart(salary_in_usd ~ ., data = train_rf, method = "anova")
tree_preds <- predict(tree_model, newdata = test_rf)
tree_rmse <- sqrt(mean((test_rf$salary_in_usd - tree_preds)^2))
rpart.plot(tree_model, type = 3, digits = 3, fallen.leaves = TRUE,
main = "Salary Decision Logic")#> Decision Tree RMSE: $ 55819.2
results_table <- data.frame(
Model = c("Random Forest", "Lasso Regression", "Ridge Regression",
"Linear Regression (Log)", "Decision Tree"),
RMSE = c("$53,775.49", "$54,448.88", "$54,526.86", "$55,341.27", "$55,981.30"),
Interpretation = c(
"Best performer: captures complex, non-linear interactions between variables.",
"Similar to Ridge, but performs automatic feature selection.",
"Linear model with a penalty to reduce the impact of multicollinearity.",
"The baseline; effective but limited by linear assumptions.",
"High interpretability but prone to over-simplification."
)
)
knitr::kable(results_table,
caption = "Table 1: Comparison of Model Performance (Ranked by RMSE)",
align = "lll")| Model | RMSE | Interpretation |
|---|---|---|
| Random Forest | $53,775.49 | Best performer: captures complex, non-linear interactions between variables. |
| Lasso Regression | $54,448.88 | Similar to Ridge, but performs automatic feature selection. |
| Ridge Regression | $54,526.86 | Linear model with a penalty to reduce the impact of multicollinearity. |
| Linear Regression (Log) | $55,341.27 | The baseline; effective but limited by linear assumptions. |
| Decision Tree | $55,981.30 | High interpretability but prone to over-simplification. |
By achieving the lowest RMSE ($53,775.49), the Random Forest model provides the most reliable salary estimates, though it remains a “black box” compared to the interpretability of the other models.
This section builds a binary logistic regression classifier to predict whether an employee’s salary falls in the top 25% (High) or below (Normal), using the AI/Data Science salary dataset covering 2020-2025.
Workflow at a glance:
We ordinal-encode the four key categorical predictors so they can enter the logistic regression as numeric inputs.
| Variable | Encoding |
|---|---|
experience_level |
EN = 1, MI = 2, SE = 3, EX = 4 |
employment_type |
PT/FL = 1, CT = 2, FT = 3 |
company_size |
S = 1, M = 2, L = 3 |
remote_ratio |
0 -> 0, 50 -> 1, 100 -> 2 |
salary <- salary %>%
mutate(
exp_num = case_when(
experience_level == "EN" ~ 1,
experience_level == "MI" ~ 2,
experience_level == "SE" ~ 3,
experience_level == "EX" ~ 4,
TRUE ~ NA_real_
),
emp_num = case_when(
employment_type == "FT" ~ 3,
employment_type == "CT" ~ 2,
employment_type == "PT" ~ 1,
employment_type == "FL" ~ 1,
TRUE ~ NA_real_
),
size_num = case_when(
company_size == "S" ~ 1,
company_size == "M" ~ 2,
company_size == "L" ~ 3,
TRUE ~ NA_real_
),
remote_num = case_when(
remote_ratio == 0 ~ 0,
remote_ratio == 50 ~ 1,
remote_ratio == 100 ~ 2,
TRUE ~ NA_real_
)
)
glimpse(salary %>% dplyr::select(exp_num, emp_num, size_num, remote_num))#> Rows: 88,584
#> Columns: 4
#> $ exp_num <dbl> 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2,…
#> $ emp_num <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ size_num <dbl> 3, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,…
#> $ remote_num <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 2, 2,…
We create several derived features before modelling.
Leakage note:
sal_vs_job_meanis computed using a temporary training split so that job-title group means are never contaminated by test data.
salary <- salary %>%
mutate(
same_country = as.integer(as.character(employee_residence) == as.character(company_location)),
us_company = as.integer(as.character(company_location) == "US"),
is_ml_ai_role = as.integer(grepl(
"machine learning|data science|ai|research|engineer|scientist|analyst",
job_title, ignore.case = TRUE)),
exp_x_remote = exp_num * remote_num,
year_c = work_year - 2022
)
set.seed(42)
temp_idx <- createDataPartition(salary$salary_in_usd, p = 0.8, list = FALSE)
job_means <- salary %>%
slice(temp_idx) %>%
group_by(job_title) %>%
summarise(job_mean_sal = mean(salary_in_usd, na.rm = TRUE), .groups = "drop")
salary <- salary %>%
left_join(job_means, by = "job_title") %>%
mutate(
sal_vs_job_mean = salary_in_usd /
coalesce(job_mean_sal, median(salary_in_usd, na.rm = TRUE))
)
salary %>%
dplyr::select(same_country, us_company, is_ml_ai_role,
exp_x_remote, year_c, sal_vs_job_mean) %>%
summary()#> same_country us_company is_ml_ai_role exp_x_remote
#> Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
#> 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.000
#> Median :1.0000 Median :1.0000 Median :1.0000 Median :0.000
#> Mean :0.9983 Mean :0.9004 Mean :0.8266 Mean :1.119
#> 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000
#> Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :8.000
#> year_c sal_vs_job_mean
#> Min. :-2.000 Min. :0.07923
#> 1st Qu.: 2.000 1st Qu.:0.71238
#> Median : 2.000 Median :0.94265
#> Mean : 2.035 Mean :1.00104
#> 3rd Qu.: 2.000 3rd Qu.:1.23038
#> Max. : 3.000 Max. :7.50863
We define High salary as any salary at or above the
75th percentile of salary_in_usd, giving a roughly 75/25
class split.
p75 <- quantile(salary$salary_in_usd, 0.75, na.rm = TRUE)
cat(sprintf("75th percentile threshold: $%s\n", format(round(p75), big.mark = ",")))#> 75th percentile threshold: $198,600
salary <- salary %>%
mutate(
high_sal = factor(
as.integer(salary_in_usd >= p75),
levels = c(0, 1),
labels = c("Normal", "High")
)
)
cat("\nClass distribution:\n")#>
#> Class distribution:
#>
#> Normal High
#> 66419 22165
#>
#> Class proportions:
#>
#> Normal High
#> 0.75 0.25
ggplot(salary, aes(x = high_sal, fill = high_sal)) +
geom_bar(width = 0.5) +
geom_text(stat = "count",
aes(label = after_stat(count)),
vjust = -0.5, size = 4) +
scale_fill_manual(values = c("Normal" = "#85B7EB", "High" = "#1D9E75")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(title = "Class Balance: Salary Target",
x = NULL, y = "Count") +
theme_minimal() +
theme(legend.position = "none")Class balance: High vs Normal salary
We select the 10 modelling features and perform a stratified
80/20 split on high_sal to preserve class
proportions in both sets.
cls_df <- salary %>%
dplyr::select(
high_sal,
exp_num,
emp_num,
size_num,
remote_num,
same_country,
us_company,
is_ml_ai_role,
year_c,
sal_vs_job_mean,
exp_x_remote
) %>%
drop_na()
set.seed(42)
idx_c <- createDataPartition(cls_df$high_sal, p = 0.8, list = FALSE)
train_c <- cls_df[ idx_c, ]
test_c <- cls_df[-idx_c, ]
cat(sprintf("Train: %d rows | Test: %d rows\n", nrow(train_c), nrow(test_c)))#> Train: 70868 rows | Test: 17716 rows
#>
#> Train class balance:
#>
#> Normal High
#> 0.75 0.25
With ~75% Normal and ~25% High, accuracy alone is misleading. We assign inverse-frequency weights so both classes contribute equally to the log-likelihood during fitting.
n_total <- nrow(train_c)
n_high <- sum(train_c$high_sal == "High")
n_normal <- sum(train_c$high_sal == "Normal")
class_weights <- ifelse(
train_c$high_sal == "High",
n_total / (2 * n_high),
n_total / (2 * n_normal)
)
cat(sprintf("Class weights — Normal: %.3f | High: %.3f\n",
n_total / (2 * n_normal),
n_total / (2 * n_high)))#> Class weights — Normal: 0.667 | High: 1.998
We fit a weighted logistic regression with all 10 candidate predictors.
full_fit <- glm(
high_sal ~ exp_num + emp_num + size_num + remote_num +
same_country + us_company + is_ml_ai_role +
year_c + sal_vs_job_mean + exp_x_remote,
data = train_c,
family = binomial(link = "logit"),
weights = class_weights
)
summary(full_fit)#>
#> Call:
#> glm(formula = high_sal ~ exp_num + emp_num + size_num + remote_num +
#> same_country + us_company + is_ml_ai_role + year_c + sal_vs_job_mean +
#> exp_x_remote, family = binomial(link = "logit"), data = train_c,
#> weights = class_weights)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -19.69960 1.10372 -17.848 < 2e-16 ***
#> exp_num 0.45402 0.02231 20.347 < 2e-16 ***
#> emp_num 1.65241 0.28087 5.883 4.02e-09 ***
#> size_num 0.69406 0.06931 10.014 < 2e-16 ***
#> remote_num -0.72975 0.08689 -8.399 < 2e-16 ***
#> same_country 1.46892 0.72249 2.033 0.042 *
#> us_company 0.39377 0.05905 6.669 2.58e-11 ***
#> is_ml_ai_role 0.70473 0.03431 20.538 < 2e-16 ***
#> year_c 0.23522 0.02106 11.170 < 2e-16 ***
#> sal_vs_job_mean 8.18253 0.06394 127.973 < 2e-16 ***
#> exp_x_remote 0.17684 0.03047 5.804 6.47e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 98244 on 70867 degrees of freedom
#> Residual deviance: 43234 on 70857 degrees of freedom
#> AIC: 55410
#>
#> Number of Fisher Scoring iterations: 6
We use bidirectional stepwise selection (AIC criterion) to find the most parsimonious model. Variables with negligible contribution are dropped; useful variables already excluded may be re-added.
#>
#> Call:
#> glm(formula = high_sal ~ exp_num + emp_num + size_num + remote_num +
#> same_country + us_company + is_ml_ai_role + year_c + sal_vs_job_mean +
#> exp_x_remote, family = binomial(link = "logit"), data = train_c,
#> weights = class_weights)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -19.69960 1.10372 -17.848 < 2e-16 ***
#> exp_num 0.45402 0.02231 20.347 < 2e-16 ***
#> emp_num 1.65241 0.28087 5.883 4.02e-09 ***
#> size_num 0.69406 0.06931 10.014 < 2e-16 ***
#> remote_num -0.72975 0.08689 -8.399 < 2e-16 ***
#> same_country 1.46892 0.72249 2.033 0.042 *
#> us_company 0.39377 0.05905 6.669 2.58e-11 ***
#> is_ml_ai_role 0.70473 0.03431 20.538 < 2e-16 ***
#> year_c 0.23522 0.02106 11.170 < 2e-16 ***
#> sal_vs_job_mean 8.18253 0.06394 127.973 < 2e-16 ***
#> exp_x_remote 0.17684 0.03047 5.804 6.47e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 98244 on 70867 degrees of freedom
#> Residual deviance: 43234 on 70857 degrees of freedom
#> AIC: 55410
#>
#> Number of Fisher Scoring iterations: 6
or_table <- data.frame(
OR = exp(coef(step_fit)),
CI_lower = exp(confint(step_fit)[, 1]),
CI_upper = exp(confint(step_fit)[, 2])
) %>%
round(3)
cat("OR > 1: predictor increases odds of High salary\n")#> OR > 1: predictor increases odds of High salary
#> OR < 1: predictor decreases odds of High salary
#> OR CI_lower CI_upper
#> (Intercept) 0.000 0.000 0.000
#> exp_num 1.575 1.507 1.645
#> emp_num 5.220 3.068 9.254
#> size_num 2.002 1.748 2.294
#> remote_num 0.482 0.406 0.571
#> same_country 4.345 1.149 19.237
#> us_company 1.483 1.321 1.665
#> is_ml_ai_role 2.023 1.892 2.164
#> year_c 1.265 1.214 1.319
#> sal_vs_job_mean 3577.910 3158.737 4058.623
#> exp_x_remote 1.193 1.125 1.267
prob_pred <- predict(step_fit, newdata = test_c, type = "response")
class_pred <- factor(ifelse(prob_pred >= 0.5, "High", "Normal"),
levels = c("Normal", "High"))
cm_05 <- confusionMatrix(class_pred, test_c$high_sal, positive = "High")
print(cm_05)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Normal High
#> Normal 11554 591
#> High 1729 3842
#>
#> Accuracy : 0.869
#> 95% CI : (0.864, 0.874)
#> No Information Rate : 0.7498
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.6785
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.8667
#> Specificity : 0.8698
#> Pos Pred Value : 0.6896
#> Neg Pred Value : 0.9513
#> Prevalence : 0.2502
#> Detection Rate : 0.2169
#> Detection Prevalence : 0.3145
#> Balanced Accuracy : 0.8683
#>
#> 'Positive' Class : High
#>
AUC is the primary evaluation metric here — it is threshold-independent and robust to class imbalance.
roc_obj <- roc(test_c$high_sal, prob_pred,
levels = c("Normal", "High"), quiet = TRUE)
auc_val <- auc(roc_obj)
cat(sprintf("AUC: %.3f\n", auc_val))#> AUC: 0.944
#> Interpretation: 0.9-1.0 excellent | 0.8-0.9 good | 0.7-0.8 fair | <0.7 poor
ggroc(roc_obj, colour = "#1D9E75", linewidth = 1) +
geom_abline(slope = 1, intercept = 1,
linetype = "dashed", colour = "gray50") +
annotate("text", x = 0.3, y = 0.1,
label = sprintf("AUC = %.3f", auc_val), size = 4.5,
colour = "#1D9E75", fontface = "bold") +
labs(title = "ROC Curve: Salary Classifier",
x = "Specificity",
y = "Sensitivity") +
theme_minimal()ROC curve for the salary classifier
The Youden index maximises
Sensitivity + Specificity - 1, balancing the true positive
and true negative rates simultaneously.
best_thresh <- coords(roc_obj, "best",
ret = "threshold",
best.method = "youden")
best_thresh <- as.numeric(best_thresh)
cat(sprintf("Youden-optimal threshold: %.3f\n", best_thresh))#> Youden-optimal threshold: 0.403
class_opt <- factor(ifelse(prob_pred >= best_thresh, "High", "Normal"),
levels = c("Normal", "High"))
cm_opt <- confusionMatrix(class_opt, test_c$high_sal, positive = "High")
print(cm_opt)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Normal High
#> Normal 11053 341
#> High 2230 4092
#>
#> Accuracy : 0.8549
#> 95% CI : (0.8496, 0.86)
#> No Information Rate : 0.7498
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.6613
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9231
#> Specificity : 0.8321
#> Pos Pred Value : 0.6473
#> Neg Pred Value : 0.9701
#> Prevalence : 0.2502
#> Detection Rate : 0.2310
#> Detection Prevalence : 0.3569
#> Balanced Accuracy : 0.8776
#>
#> 'Positive' Class : High
#>
Note: Accuracy is misleading here — the No-Information Rate is ~75%. Use Kappa and AUC as primary metrics for this imbalanced classifier.
f1_05 <- cm_05$byClass["F1"]
f1_opt <- cm_opt$byClass["F1"]
comparison <- data.frame(
Threshold = c(0.500, round(best_thresh, 3)),
Accuracy = c(cm_05$overall["Accuracy"], cm_opt$overall["Accuracy"]),
Sensitivity = c(cm_05$byClass["Sensitivity"], cm_opt$byClass["Sensitivity"]),
Specificity = c(cm_05$byClass["Specificity"], cm_opt$byClass["Specificity"]),
Kappa = c(cm_05$overall["Kappa"], cm_opt$overall["Kappa"]),
F1 = c(f1_05, f1_opt)
) %>%
mutate(across(-Threshold, ~ round(., 3)))
knitr::kable(comparison,
caption = "Threshold Comparison: Default (0.50) vs Youden-Optimal",
align = "lrrrrr")| Threshold | Accuracy | Sensitivity | Specificity | Kappa | F1 |
|---|---|---|---|---|---|
| 0.500 | 0.869 | 0.867 | 0.870 | 0.678 | 0.768 |
| 0.403 | 0.855 | 0.923 | 0.832 | 0.661 | 0.761 |
#>
#> AUC (threshold-independent): 0.944
#> Model : Weighted logistic regression (stepwise AIC)
#> Target : High salary (salary_in_usd >= 75th percentile)
#> AUC : 0.944
#> Kappa (threshold = 0.50) : 0.678
#> F1 (threshold = 0.50) : 0.768
#> Optimal threshold (Youden) : 0.403
The classification model identifies three main factors influencing entry into the top 25% salary bracket:
apps-fileview.texmex_20260329.08_p1 salary_full.txt Displaying salary_full.txt.