Part 1: Regression

1. Dataset Overview & Attribute Information

This dataset contains insights into the compensation of data science professionals across various global regions and roles. Understanding the nuances of these features is critical for building a robust predictive model.

1.1 Decision Features

  • work_year: The year the salary was paid. This is vital for observing inflation or market demand trends over time.
  • experience_level: A categorical ordinal feature ranging from EN (Entry) to EX (Executive).
  • employment_type: Defines the nature of the contract (Part Time (PT), Full Time (FT), CT (Contract), FL (Freelance)).
  • job_title: The specific role (e.g., Data Scientist, ML Engineer).

1.2 Geographic & Corporate Features

  • employee_residence & company_location: ISO 3166 codes representing where the talent lives vs. where the company is based.
  • company_size: Categorized by number of employees into Small (<50), Medium (50-250), or Large (>250).
  • remote_ratio: An ordinal numeric value (0, 50, 100) indicating the level of remote work.

1.3 Target Variable

  • salary_in_usd (Primary Target): The standardized version of the salary, normalized to USD based on the exchange rates of the respective work_year. For modeling purposes, this eliminates the variance caused by currency fluctuations and allows for a direct comparison across global markets.

2. Data Loading & Preparation

2.1 Load Libraries

libs <- c("tidyverse", "dplyr", "gridExtra", "ggplot2", "ggfortify", "broom",
          "ggcorrplot", "corrr", "corrplot", "MASS", "datarium", "randomForest",
          "rpart", "Hmisc", "skimr", "stringr", "patchwork", "glmnet",
          "rpart.plot", "knitr", "caret", "pROC")

for (lib in libs) {
    if (!(lib %in% installed.packages())){
      install.packages(lib)
    }
    library(lib, character.only = TRUE)
}

theme_set(theme_classic())

2.2 Load Data

salary <- read.csv("/home/jakes/Documents/strathmore/Modules/Module 3/linear_models/salaries/salaries.csv")
head(salary, 10)
#>    work_year experience_level employment_type                     job_title
#> 1       2025               MI              FT      Customer Success Manager
#> 2       2025               SE              FT                      Engineer
#> 3       2025               SE              FT                      Engineer
#> 4       2025               SE              FT             Applied Scientist
#> 5       2025               SE              FT             Applied Scientist
#> 6       2025               EN              FT                  Data Analyst
#> 7       2025               EN              FT                  Data Analyst
#> 8       2025               SE              FT Software Development Engineer
#> 9       2025               SE              FT Software Development Engineer
#> 10      2025               SE              FT            Research Scientist
#>    salary salary_currency salary_in_usd employee_residence remote_ratio
#> 1   57000             EUR         60000                 NL           50
#> 2  165000             USD        165000                 US            0
#> 3  109000             USD        109000                 US            0
#> 4  294000             USD        294000                 US            0
#> 5  137600             USD        137600                 US            0
#> 6   82000             USD         82000                 US            0
#> 7   44000             USD         44000                 US            0
#> 8  149800             USD        149800                 US            0
#> 9   89700             USD         89700                 US            0
#> 10 200000             USD        200000                 US            0
#>    company_location company_size
#> 1                NL            L
#> 2                US            M
#> 3                US            M
#> 4                US            M
#> 5                US            M
#> 6                US            M
#> 7                US            M
#> 8                US            L
#> 9                US            L
#> 10               US            M

2.3 Data Understanding

Checking the structure and data types of the dataset.

glimpse(salary)
#> Rows: 88,584
#> Columns: 11
#> $ work_year          <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 202…
#> $ experience_level   <chr> "MI", "SE", "SE", "SE", "SE", "EN", "EN", "SE", "SE…
#> $ employment_type    <chr> "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT…
#> $ job_title          <chr> "Customer Success Manager", "Engineer", "Engineer",…
#> $ salary             <int> 57000, 165000, 109000, 294000, 137600, 82000, 44000…
#> $ salary_currency    <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
#> $ salary_in_usd      <int> 60000, 165000, 109000, 294000, 137600, 82000, 44000…
#> $ employee_residence <chr> "NL", "US", "US", "US", "US", "US", "US", "US", "US…
#> $ remote_ratio       <int> 50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 100, 100, 0…
#> $ company_location   <chr> "NL", "US", "US", "US", "US", "US", "US", "US", "US…
#> $ company_size       <chr> "L", "M", "M", "M", "M", "M", "M", "L", "L", "M", "…

All the variables have the correct data types.

Checking for missing values and distributions of each column.

skim(salary)
Data summary
Name salary
Number of rows 88584
Number of columns 11
_______________________
Column type frequency:
character 7
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
experience_level 0 1 2 2 0 4 0
employment_type 0 1 2 2 0 4 0
job_title 0 1 5 41 0 312 0
salary_currency 0 1 3 3 0 26 0
employee_residence 0 1 2 2 0 96 0
company_location 0 1 2 2 0 90 0
company_size 0 1 1 1 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
work_year 0 1 2024.03 0.62 2020 2024.0 2024 2024 2025 ▁▁▁▇▂
salary 0 1 161932.29 196531.72 14000 106000.0 147000 199500 30400000 ▇▁▁▁▁
salary_in_usd 0 1 157567.80 73531.37 15000 106097.2 146307 198600 800000 ▇▅▁▁▁
remote_ratio 0 1 21.29 40.83 0 0.0 0 0 100 ▇▁▁▁▂

Checking the summary statistics.

summary(salary)
#>    work_year    experience_level   employment_type     job_title        
#>  Min.   :2020   Length:88584       Length:88584       Length:88584      
#>  1st Qu.:2024   Class :character   Class :character   Class :character  
#>  Median :2024   Mode  :character   Mode  :character   Mode  :character  
#>  Mean   :2024                                                           
#>  3rd Qu.:2024                                                           
#>  Max.   :2025                                                           
#>      salary         salary_currency    salary_in_usd    employee_residence
#>  Min.   :   14000   Length:88584       Min.   : 15000   Length:88584      
#>  1st Qu.:  106000   Class :character   1st Qu.:106097   Class :character  
#>  Median :  147000   Mode  :character   Median :146307   Mode  :character  
#>  Mean   :  161932                      Mean   :157568                     
#>  3rd Qu.:  199500                      3rd Qu.:198600                     
#>  Max.   :30400000                      Max.   :800000                     
#>   remote_ratio    company_location   company_size      
#>  Min.   :  0.00   Length:88584       Length:88584      
#>  1st Qu.:  0.00   Class :character   Class :character  
#>  Median :  0.00   Mode  :character   Mode  :character  
#>  Mean   : 21.29                                        
#>  3rd Qu.:  0.00                                        
#>  Max.   :100.00

Checking the column value counts.

salary %>%
  select_if(~is.character(.) | is.factor(.)) %>%
  map(table)
#> $experience_level
#> 
#>    EN    EX    MI    SE 
#>  8381  1859 26748 51596 
#> 
#> $employment_type
#> 
#>    CT    FL    FT    PT 
#>   224    16 88111   233 
#> 
#> $job_title
#> 
#>                         Account Executive 
#>                                        22 
#>                         Actuarial Analyst 
#>                                        14 
#>                      Admin & Data Analyst 
#>                                         6 
#>                              AI Architect 
#>                                       165 
#>                          AI Data Engineer 
#>                                         2 
#>                         AI Data Scientist 
#>                                         8 
#>                              AI Developer 
#>                                       123 
#>                               AI Engineer 
#>                                       850 
#>                    AI Engineering Manager 
#>                                         2 
#>                        AI Governance Lead 
#>                                         2 
#>                                   AI Lead 
#>                                        15 
#>              AI Machine Learning Engineer 
#>                                         3 
#>                        AI Product Manager 
#>                                        27 
#>                          AI Product Owner 
#>                                        18 
#>                             AI Programmer 
#>                                         9 
#>                      AI Research Engineer 
#>                                         8 
#>                     AI Research Scientist 
#>                                         7 
#>                             AI Researcher 
#>                                        54 
#>                              AI Scientist 
#>                                        74 
#>          AI Software Development Engineer 
#>                                         1 
#>                      AI Software Engineer 
#>                                        11 
#>                     AI Solution Architect 
#>                                        10 
#>                             AI Specialist 
#>                                        70 
#>                       Algorithm Developer 
#>                                        22 
#>                                   Analyst 
#>                                      2066 
#>                         Analytics Analyst 
#>                                         1 
#>                        Analytics Engineer 
#>                                      1374 
#>             Analytics Engineering Manager 
#>                                         1 
#>                            Analytics Lead 
#>                                        52 
#>                      Analytics Specialist 
#>                                        32 
#>                     Application Developer 
#>                                         8 
#>                        Applied AI ML Lead 
#>                                         2 
#>                    Applied Data Scientist 
#>                                        13 
#>         Applied Machine Learning Engineer 
#>                                         3 
#>        Applied Machine Learning Scientist 
#>                                        14 
#>                Applied Research Scientist 
#>                                         2 
#>                         Applied Scientist 
#>                                      1778 
#>                                 Architect 
#>                                       208 
#>          Artificial Intelligence Engineer 
#>                                        18 
#>                                 Associate 
#>                                      1205 
#>             Autonomous Vehicle Technician 
#>                                         2 
#>                        AWS Data Architect 
#>                                         1 
#>                       Azure Data Engineer 
#>                                         2 
#>                         Backend Developer 
#>                                         2 
#>                          Backend Engineer 
#>                                       189 
#>                 Backend Software Engineer 
#>                                         1 
#>                             Bear Robotics 
#>                                        10 
#>                                BI Analyst 
#>                                       153 
#>                           BI Data Analyst 
#>                                        19 
#>                          BI Data Engineer 
#>                                         1 
#>                              BI Developer 
#>                                       226 
#>                               BI Engineer 
#>                                        74 
#>                          Big Data Analyst 
#>                                         1 
#>                        Big Data Architect 
#>                                         2 
#>                        Big Data Developer 
#>                                         7 
#>                         Big Data Engineer 
#>                                        13 
#>                          Bioinformatician 
#>                                        18 
#>                  Bioinformatics Scientist 
#>                                        44 
#>                          Business Analyst 
#>                                       238 
#>                     Business Data Analyst 
#>                                        21 
#>              Business Development Manager 
#>                                         1 
#>                 Business Insights Manager 
#>                                         1 
#>                     Business Intelligence 
#>                                       294 
#>             Business Intelligence Analyst 
#>                                       502 
#>          Business Intelligence Consultant 
#>                                         2 
#>        Business Intelligence Data Analyst 
#>                                         2 
#>           Business Intelligence Developer 
#>                                       285 
#>            Business Intelligence Engineer 
#>                                       680 
#>                Business Intelligence Lead 
#>                                        32 
#>             Business Intelligence Manager 
#>                                        38 
#>          Business Intelligence Specialist 
#>                                        16 
#>                         Chatbot Developer 
#>                                         4 
#>                    Clinical Data Operator 
#>                                        10 
#>                      Cloud Data Architect 
#>                                         1 
#>                       Cloud Data Engineer 
#>                                         4 
#>              Cloud Database Administrator 
#>                                        22 
#>                   Cloud Database Engineer 
#>                                        37 
#>                           Cloud Developer 
#>                                         2 
#>                            Cloud Engineer 
#>                                       102 
#>                   Compliance Data Analyst 
#>                                         2 
#>                   Computational Biologist 
#>                                        40 
#>                   Computational Scientist 
#>                                         4 
#>                  Computer Vision Engineer 
#>                                        98 
#>         Computer Vision Software Engineer 
#>                                         5 
#>                                Consultant 
#>                                       611 
#>                  Consultant Data Engineer 
#>                                         2 
#>                         Controls Engineer 
#>                                         1 
#>                          CRM Data Analyst 
#>                                         1 
#>                  Customer Success Manager 
#>                                         1 
#>                              Data Analyst 
#>                                      8652 
#>                         Data Analyst Lead 
#>                                         2 
#>                  Data Analytics Associate 
#>                                         7 
#>                 Data Analytics Consultant 
#>                                        40 
#>                  Data Analytics Developer 
#>                                        10 
#>                   Data Analytics Engineer 
#>                                         5 
#>                       Data Analytics Lead 
#>                                        89 
#>                    Data Analytics Manager 
#>                                       172 
#>                 Data Analytics Specialist 
#>                                        60 
#>                  Data Analytics Team Lead 
#>                                         2 
#>                Data and Reporting Analyst 
#>                                        16 
#>           Data and Reporting Professional 
#>                                        12 
#>                            Data Architect 
#>                                      1499 
#>                            Data Developer 
#>                                       100 
#>                      Data DevOps Engineer 
#>                                         3 
#>                             Data Engineer 
#>                                     10883 
#>                           Data Governance 
#>                                       114 
#>                   Data Governance Analyst 
#>                                        98 
#>                 Data Governance Architect 
#>                                         2 
#>                  Data Governance Engineer 
#>                                        14 
#>                      Data Governance Lead 
#>                                        64 
#>                   Data Governance Manager 
#>                                        31 
#>                Data Governance Specialist 
#>                                        54 
#>              Data Infrastructure Engineer 
#>                                        80 
#>                  Data Integration Analyst 
#>                                        14 
#>              Data Integration Coordinator 
#>                                         4 
#>                Data Integration Developer 
#>                                        12 
#>                 Data Integration Engineer 
#>                                        62 
#>               Data Integration Specialist 
#>                                        61 
#>                           Data Integrator 
#>                                        10 
#>                    Data Integrity Analyst 
#>                                        12 
#>                 Data Integrity Specialist 
#>                                         2 
#>                                 Data Lead 
#>                                       186 
#>                   Data Management Analyst 
#>                                       156 
#>                 Data Management Associate 
#>                                        10 
#>                Data Management Consultant 
#>                                        12 
#>               Data Management Coordinator 
#>                                         2 
#>                      Data Management Lead 
#>                                        28 
#>                   Data Management Manager 
#>                                         8 
#>                Data Management Specialist 
#>                                       133 
#>                              Data Manager 
#>                                       727 
#>                              Data Modeler 
#>                                       134 
#>                           Data Operations 
#>                                        14 
#>                   Data Operations Analyst 
#>                                        80 
#>                 Data Operations Associate 
#>                                        18 
#>                  Data Operations Engineer 
#>                                        60 
#>                      Data Operations Lead 
#>                                         2 
#>                   Data Operations Manager 
#>                                        24 
#>                Data Operations Specialist 
#>                                        38 
#>                    Data Pipeline Engineer 
#>                                         2 
#>                    Data Platform Engineer 
#>                                        20 
#>                      Data Product Analyst 
#>                                         2 
#>                      Data Product Manager 
#>                                       192 
#>                        Data Product Owner 
#>                                        92 
#>                      Data Quality Analyst 
#>                                       111 
#>                     Data Quality Engineer 
#>                                        37 
#>                         Data Quality Lead 
#>                                        40 
#>                      Data Quality Manager 
#>                                        16 
#>                   Data Quality Specialist 
#>                                        23 
#>                             Data Reporter 
#>                                         4 
#>                    Data Reporting Analyst 
#>                                        62 
#>                 Data Reporting Specialist 
#>                                         2 
#>                    Data Science Tech Lead 
#>                                         1 
#>                            Data Scientist 
#>                                     13156 
#>                  Data Scientist Associate 
#>                                         1 
#>                       Data Scientist Lead 
#>                                         3 
#>                    Data Scientist Manager 
#>                                         1 
#>                           Data Specialist 
#>                                       528 
#>                           Data Strategist 
#>                                        64 
#>                        Data Strategy Lead 
#>                                        16 
#>                     Data Strategy Manager 
#>                                         6 
#>                            Data Team Lead 
#>                                        36 
#>                Data Visualization Analyst 
#>                                        37 
#>              Data Visualization Developer 
#>                                        20 
#>               Data Visualization Engineer 
#>                                        54 
#>             Data Visualization Specialist 
#>                                        46 
#>                    Database Administrator 
#>                                         4 
#>                       Databricks Engineer 
#>                                         2 
#>                          DataOps Engineer 
#>                                        10 
#>                        Decision Scientist 
#>                                       172 
#>                    Deep Learning Engineer 
#>                                        12 
#>                  Deep Learning Researcher 
#>                                         1 
#>                                 Developer 
#>                                       344 
#>                        Developer Advocate 
#>                                         4 
#>                           DevOps Engineer 
#>                                       286 
#>         Director of Business Intelligence 
#>                                         2 
#>              Director of Machine Learning 
#>                                        22 
#>               Elasticsearch Administrator 
#>                                         2 
#>    Encounter Data Management Professional 
#>                                        36 
#>                                  Engineer 
#>                                      5344 
#>                       Engineering Manager 
#>                                       460 
#>              Enterprise Account Executive 
#>                                        28 
#>                             ETL Developer 
#>                                        42 
#>                              ETL Engineer 
#>                                         7 
#>                      Finance Data Analyst 
#>                                         3 
#>                    Financial Data Analyst 
#>                                         4 
#>                         Frontend Engineer 
#>                                        12 
#>                      Full Stack Developer 
#>                                        60 
#>                       Full Stack Engineer 
#>                                       122 
#>                        Fullstack Engineer 
#>                                        24 
#>                           GenAI Architect 
#>                                        10 
#>                                Head of AI 
#>                                        26 
#>             Head of Business Intelligence 
#>                                         2 
#>                              Head of Data 
#>                                       206 
#>                  Head of Machine Learning 
#>                                        11 
#>                   Infrastructure Engineer 
#>                                        16 
#>                           Insight Analyst 
#>                                        88 
#>                            Java Developer 
#>                                         2 
#>                          Lead AI Engineer 
#>                                         1 
#>                              Lead Analyst 
#>                                        54 
#>                        Lead Data Analysis 
#>                                         4 
#>                         Lead Data Analyst 
#>                                         6 
#>                        Lead Data Engineer 
#>                                         7 
#>                      Lead Data Management 
#>                                         2 
#>                       Lead Data Scientist 
#>                                        11 
#>                             Lead Engineer 
#>                                        48 
#>            Lead Machine Learning Engineer 
#>                                         5 
#>                Machine Learning Architect 
#>                                        18 
#>                Machine Learning Developer 
#>                                        44 
#>                 Machine Learning Engineer 
#>                                      6443 
#>  Machine Learning Infrastructure Engineer 
#>                                        41 
#>                     Machine Learning Lead 
#>                                         8 
#>                  Machine Learning Manager 
#>                                        14 
#>           Machine Learning Model Engineer 
#>                                        16 
#>                  Machine Learning Modeler 
#>                                        10 
#>     Machine Learning Performance Engineer 
#>                                         2 
#>        Machine Learning Platform Engineer 
#>                                        16 
#>         Machine Learning Quality Engineer 
#>                                        14 
#>        Machine Learning Research Engineer 
#>                                         5 
#>               Machine Learning Researcher 
#>                                       553 
#>                Machine Learning Scientist 
#>                                       340 
#>        Machine Learning Software Engineer 
#>                                        15 
#>               Machine Learning Specialist 
#>                                        12 
#>                Machine Learning Tech Lead 
#>                                         2 
#>                   Machine Vision Engineer 
#>                                         2 
#>                                   Manager 
#>                                      3488 
#>                   Manager Data Management 
#>                                         2 
#>                         Marketing Analyst 
#>                                         8 
#>               Marketing Analytics Manager 
#>                                         2 
#>                    Marketing Data Analyst 
#>                                         2 
#>                   Marketing Data Engineer 
#>                                         1 
#>                  Marketing Data Scientist 
#>                                         1 
#>                 Marketing Science Partner 
#>                                         2 
#>                    Master Data Management 
#>                                        20 
#>                    Master Data Specialist 
#>                                         2 
#>                 Member of Technical Staff 
#>                                        86 
#>                ML Infrastructure Engineer 
#>                                        18 
#>                            MLOps Engineer 
#>                                       116 
#>                              NLP Engineer 
#>                                        23 
#>                       People Data Analyst 
#>                                         1 
#>                    Platform Data Engineer 
#>                                         1 
#>                         Platform Engineer 
#>                                       290 
#>                       Postdoctoral Fellow 
#>                                        16 
#>                   Postdoctoral Researcher 
#>                                         7 
#>                                  Power BI 
#>                                        10 
#>                    Power BI Administrator 
#>                                         4 
#>                        Power BI Architect 
#>                                         2 
#>                       Power BI Consultant 
#>                                         2 
#>                        Power BI Developer 
#>                                        73 
#>                       Power BI Specialist 
#>                                         2 
#>                           Pricing Analyst 
#>                                        14 
#> Principal Application Delivery Consultant 
#>                                        40 
#>                    Principal Data Analyst 
#>                                         2 
#>                  Principal Data Architect 
#>                                         1 
#>                   Principal Data Engineer 
#>                                         3 
#>                  Principal Data Scientist 
#>                                        10 
#>       Principal Machine Learning Engineer 
#>                                         3 
#>                      Principal Researcher 
#>                                        16 
#>              Principal Software Architect 
#>                                         6 
#>          Principal Statistical Programmer 
#>                                        44 
#>                           Product Analyst 
#>                                       110 
#>                      Product Data Analyst 
#>                                         6 
#>                          Product Designer 
#>                                        59 
#>                           Product Manager 
#>                                      1314 
#>                             Product Owner 
#>                                        19 
#>                        Product Specialist 
#>                                         2 
#>                           Prompt Engineer 
#>                                        53 
#>                          Python Developer 
#>                                        60 
#>                               QA Engineer 
#>                                        44 
#>                      Quantitative Analyst 
#>                                        85 
#>                    Quantitative Developer 
#>                                        14 
#>             Quantitative Research Analyst 
#>                                         1 
#>                   Quantitative Researcher 
#>                                        60 
#>                          Research Analyst 
#>                                       380 
#>                        Research Assistant 
#>                                        55 
#>                        Research Associate 
#>                                       112 
#>                     Research Data Manager 
#>                                         1 
#>                         Research Engineer 
#>                                      1203 
#>                        Research Scientist 
#>                                      2553 
#>                              Risk Analyst 
#>                                        10 
#>                         Robotics Engineer 
#>                                        78 
#>                Robotics Software Engineer 
#>                                        12 
#>         Safety Data Management Specialist 
#>                                         1 
#>                        Sales Data Analyst 
#>                                         1 
#>          Sales Development Representative 
#>                                         7 
#>                             SAS Developer 
#>                                         2 
#>                     Scala Spark Developer 
#>                                         4 
#>                         Security Engineer 
#>                                        56 
#>                       Security Researcher 
#>                                         8 
#>                 Site Reliability Engineer 
#>                                       282 
#>                        Software Architect 
#>                                        18 
#>                    Software Data Engineer 
#>                                         3 
#>                        Software Developer 
#>                                       424 
#>             Software Development Engineer 
#>                                       451 
#>                         Software Engineer 
#>                                      9596 
#>                        Solution Architect 
#>                                       142 
#>                         Solution Engineer 
#>                                         2 
#>                       Solutions Architect 
#>                                       510 
#>                        Solutions Engineer 
#>                                       225 
#>                        Staff Data Analyst 
#>                                         3 
#>                      Staff Data Scientist 
#>                                         3 
#>           Staff Machine Learning Engineer 
#>                                         1 
#>                                     Stage 
#>                                        14 
#>                    Statistical Programmer 
#>                                        74 
#>                              Statistician 
#>                                        46 
#>                           System Engineer 
#>                                        18 
#>                          Systems Engineer 
#>                                       406 
#>                         Tableau Developer 
#>                                         2 
#>                                 Tech Lead 
#>                                        18 
#>                            Technical Lead 
#>                                       112 
#>                      Technical Specialist 
#>                                         4 
#>                          Technical Writer 
#>                                        19 
#>                     Technology Integrator 
#>                                        12 
#> 
#> $salary_currency
#> 
#>   AUD   BRL   CAD   CHF   CLP   CZK   DKK   EUR   GBP   HKD   HUF   ILS   INR 
#>    18    18   279    34     1     3     5  1559  2479     1     5     3    80 
#>   JPY   MXN   NOK   NZD   PHP   PLN   SEK   SGD   THB   TRY   TWD   USD   ZAR 
#>     8     3     3     1    12    45     1    15     2     4     8 83994     3 
#> 
#> $employee_residence
#> 
#>    AD    AE    AM    AR    AS    AT    AU    BA    BE    BG    BM    BO    BR 
#>     1     5    10    66     1   173   298     2    26     4     1     2    73 
#>    CA    CF    CH    CL    CN    CO    CR    CY    CZ    DE    DK    DO    DZ 
#>  3203     2    38    19     1    39     7    11     9   263     7     4     1 
#>    EC    EE    EG    ES    FI    FR    GB    GE    GH    GR    HK    HN    HR 
#>     6    11    42   203    45   226  2576     2     4    24     2     3     8 
#>    HU    ID    IE    IL    IN    IQ    IR    IT    JE    JP    KE    KR    KW 
#>     6     3    85    15   155     1     1    41     1    20     6     8     1 
#>    LB    LT    LU    LV    MD    MT    MU    MX    MY    NG    NL    NO    NZ 
#>    12   202     4    54     2    15     1    69     3    13   187     7    40 
#>    OM    PE    PH    PK    PL    PR    PT    QA    RO    RS    RU    RW    SA 
#>     1     9    46     7    86     9    47     1     7     4     6     1     3 
#>    SE    SG    SI    SK    SV    TH    TN    TR    TW    UA    UG    US    UZ 
#>     9    22     6   110     5     4     2    23     8    16     1 79705     3 
#>    VE    VN    XK    ZA    ZM 
#>     2     7     2    61     1 
#> 
#> $company_location
#> 
#>    AD    AE    AM    AR    AS    AT    AU    BA    BE    BG    BR    BS    CA 
#>     1     5     9    62     4   173   303     2    24     3    71     1  3204 
#>    CD    CF    CH    CL    CN    CO    CR    CY    CZ    DE    DK    DO    DZ 
#>     1     2    39    18     1    39     6    10    10   272     9     3     2 
#>    EC    EE    EG    ES    FI    FR    GB    GH    GI    GR    HK    HN    HR 
#>     6    12    41   199    46   221  2584     3     1    22     1     3     5 
#>    HU    ID    IE    IL    IN    IQ    IR    IT    JP    KE    KR    LB    LT 
#>     6     4    85    17   134     1     1    34    20     6     8    12   202 
#>    LU    LV    MD    MT    MU    MX    MY    NG    NL    NO    NZ    OM    PE 
#>     5    54     1    15     1    70     3    10   187     7    40     1     8 
#>    PH    PK    PL    PR    PT    QA    RO    RS    RU    SA    SE    SG    SI 
#>    43     3    85     8    44     1     6     3     7     3    10    21     6 
#>    SK    SV    TH    TR    TW    UA    US    VE    VN    XK    ZA    ZM 
#>   110     5     3    21     8    15 79762     2     4     2    61     1 
#> 
#> $company_size
#> 
#>     L     M     S 
#>  2703 85667   214

The features job_title, employee_residence, and company_location have very many separate categories. We will classify them based on their frequency and domain knowledge. If not addressed, this will become a high cardinality problem during modelling.

3. Data Cleaning & Transformation

3.1 Consolidate Job Titles

Restructuring the job titles into 7 broad categories instead of 312.

salary <- salary %>%
  mutate(job_title = case_when(
    str_detect(job_title, "Manager|Head|Director|Lead|Principal|Staff|Architect|Tech Lead") ~ "Leadership & Management",
    str_detect(job_title, "Machine Learning|ML|AI|Computer Vision|Robotics|NLP|GenAI|Autonomous") ~ "ML & AI",
    str_detect(job_title, "Scientist|Research|Applied|Decision Scientist|Statistician") ~ "Data Science",
    str_detect(job_title, "Engineer|ETL|Database|Infrastructure|Data Platform|Pipeline") ~ "Data Engineering",
    str_detect(job_title, "Analyst|Analytics|Reporting|Visualization|Quant") ~ "Data Analysis",
    str_detect(job_title, "BI |Business Intelligence|Tableau|Power BI|Insight") ~ "BI & Strategy",
    str_detect(job_title, "Software|Developer|Backend|Full Stack|Cloud|DevOps|Systems") ~ "Software Engineering",
    TRUE ~ "Other"
  ))

table(salary$job_title)
#> 
#>           BI & Strategy           Data Analysis        Data Engineering 
#>                     916                   12841                   31051 
#>            Data Science Leadership & Management                 ML & AI 
#>                   19597                   10454                    9128 
#>                   Other    Software Engineering 
#>                    3540                    1057

3.2 Consolidate Salary Currency

The currency the salaries were paid in were mostly USD, EUR, and GBP; the rest were classified as ‘Other’.

salary <- salary %>%
  mutate(salary_currency = case_when(
    str_detect(salary_currency, "USD") ~ "USD",
    str_detect(salary_currency, "EUR") ~ "EUR",
    str_detect(salary_currency, "GBP") ~ "GBP",
    TRUE ~ "Other"
  ))

table(salary$salary_currency)
#> 
#>   EUR   GBP Other   USD 
#>  1559  2479   552 83994

3.3 Consolidate Employee Residence & Company Location

Both employees and companies were predominantly located in the US, Canada, and Great Britain; the rest were classified as ‘Other’.

salary <- salary %>%
  mutate(employee_residence = case_when(
    str_detect(employee_residence, "US") ~ "US",
    str_detect(employee_residence, "CA") ~ "CA",
    str_detect(employee_residence, "GB") ~ "GB",
    TRUE ~ "Other"
  ))

table(salary$employee_residence)
#> 
#>    CA    GB Other    US 
#>  3203  2576  3100 79705
salary <- salary %>%
  mutate(company_location = case_when(
    str_detect(company_location, "US") ~ "US",
    str_detect(company_location, "CA") ~ "CA",
    str_detect(company_location, "GB") ~ "GB",
    TRUE ~ "Other"
  ))

table(salary$company_location)
#> 
#>    CA    GB Other    US 
#>  3204  2584  3034 79762

Confirming all transformations.

salary %>%
  select_if(~is.character(.) | is.factor(.)) %>%
  map(table)
#> $experience_level
#> 
#>    EN    EX    MI    SE 
#>  8381  1859 26748 51596 
#> 
#> $employment_type
#> 
#>    CT    FL    FT    PT 
#>   224    16 88111   233 
#> 
#> $job_title
#> 
#>           BI & Strategy           Data Analysis        Data Engineering 
#>                     916                   12841                   31051 
#>            Data Science Leadership & Management                 ML & AI 
#>                   19597                   10454                    9128 
#>                   Other    Software Engineering 
#>                    3540                    1057 
#> 
#> $salary_currency
#> 
#>   EUR   GBP Other   USD 
#>  1559  2479   552 83994 
#> 
#> $employee_residence
#> 
#>    CA    GB Other    US 
#>  3203  2576  3100 79705 
#> 
#> $company_location
#> 
#>    CA    GB Other    US 
#>  3204  2584  3034 79762 
#> 
#> $company_size
#> 
#>     L     M     S 
#>  2703 85667   214

3.4 Outlier Detection & Treatment

ggplot(salary, aes(y = salary_in_usd)) +
  geom_boxplot(fill = "steelblue", outlier.color = "red", outlier.shape = 16) +
  labs(title = "Identifying Salary Outliers", y = "Salary (USD)") +
  theme_minimal()

Q1 <- quantile(salary$salary_in_usd, 0.25)
Q3 <- quantile(salary$salary_in_usd, 0.75)
IQR_val <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

outliers <- salary %>% filter(salary_in_usd < lower_bound | salary_in_usd > upper_bound)
cat("Number of outliers identified:", nrow(outliers), "\n")
#> Number of outliers identified: 1750
cat("Upper Bound for 'Normal' Salaries: $", round(upper_bound, 2), "\n")
#> Upper Bound for 'Normal' Salaries: $ 337354.1
cat("Lower Bound for 'Normal' Salaries: $", round(lower_bound, 2))
#> Lower Bound for 'Normal' Salaries: $ -32656.88

Setting the floor of minimum salaries at $1 and the ceiling at the upper bound through winsorization.

logical_min     <- 1
statistical_max <- 337354.1

salary_refined <- salary %>%
  filter(salary_in_usd >= logical_min & salary_in_usd <= statistical_max)

summary(salary_refined$salary_in_usd)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   15000  105000  145000  152495  193262  337300
cat("Rows after trimming outliers:", nrow(salary_refined))
#> Rows after trimming outliers: 86834

4. Exploratory Data Analysis

4.1 Distribution of the Target Variable

We will visualize salary_in_usd using histograms and density plots. Since salary data is right-skewed, we may need to consider a log-transformation to meet the normality assumptions of Linear Regression.

p1 <- ggplot(salary_refined, aes(x = salary_in_usd)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "steelblue", color = "white") +
  geom_density(color = "red", linewidth = 1) +
  labs(title = "Distribution of Salary (USD)", x = "Salary", y = "Density") +
  theme_minimal()

p2 <- ggplot(salary_refined, aes(x = log10(salary_in_usd))) +
  geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "darkorange", color = "white") +
  geom_density(color = "black", linewidth = 1) +
  labs(title = "Log10 Transformed Salary", x = "Log10(Salary)", y = "Density") +
  theme_minimal()

p1 + p2

4.2 Ordinal Relationships

We expect a positive correlation between experience_level and salary_in_usd. We will use boxplots to validate this.

salary_refined$experience_level <- factor(salary_refined$experience_level,
                                          levels = c("EN", "MI", "SE", "EX"))

ggplot(salary_refined, aes(x = experience_level, y = salary_in_usd, fill = experience_level)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "red") +
  stat_summary(fun = "mean", geom = "point", shape = 20, size = 3, color = "white") +
  labs(title = "Salary Distribution by Experience Level",
       subtitle = "White dots represent the mean salary",
       x = "Experience Level",
       y = "Salary (USD)") +
  theme_minimal() +
  guides(fill = "none")

4.3 Correlation Analysis

salary_corr_df <- salary_refined %>%
  mutate(
    exp_num  = as.numeric(factor(experience_level, levels = c("EN", "MI", "SE", "EX"))),
    size_num = as.numeric(factor(company_size, levels = c("S", "M", "L")))
  ) %>%
  dplyr::select(salary_in_usd, salary, exp_num, size_num, remote_ratio, work_year)

cor_matrix <- cor(salary_corr_df, use = "complete.obs")

corrplot(cor_matrix,
         method = "color",
         type   = "lower",
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         title  = "Feature Correlation with Salary",
         mar    = c(0, 0, 2, 0))

4.4 Collinearity Check

We examine if employee_residence and company_location are identical for most records, as high multicollinearity can destabilize linear coefficients.

salary_refined$is_local <- salary_refined$employee_residence == salary_refined$company_location

loc_check  <- table(salary_refined$is_local)
prop_check <- prop.table(loc_check) * 100

print(loc_check)
#> 
#> FALSE  TRUE 
#>    85 86749
print(prop_check)
#> 
#>       FALSE        TRUE 
#>  0.09788792 99.90211208
ggplot(as.data.frame(prop_check), aes(x = Var1, y = Freq, fill = Var1)) +
  geom_bar(stat = "identity") +
  labs(title = "Geographic Overlap Check",
       x = "Lives in Company Country?",
       y = "Percentage (%)") +
  scale_x_discrete(labels = c("No (Remote/International)", "Yes (Local)")) +
  theme_minimal() +
  guides(fill = "none")

Since 99.9% of employees reside in the same country as their company (vs. 0.09% international), we drop employee_residence to avoid multicollinearity. We also drop the raw salary column as it was used to derive salary_in_usd.

salary <- salary_refined %>%
  dplyr::select(-c(employee_residence, salary))
summary(salary)
#>    work_year    experience_level employment_type     job_title        
#>  Min.   :2020   EN: 8336         Length:86834       Length:86834      
#>  1st Qu.:2024   MI:26309         Class :character   Class :character  
#>  Median :2024   SE:50435         Mode  :character   Mode  :character  
#>  Mean   :2024   EX: 1754                                              
#>  3rd Qu.:2024                                                         
#>  Max.   :2025                                                         
#>  salary_currency    salary_in_usd     remote_ratio    company_location  
#>  Length:86834       Min.   : 15000   Min.   :  0.00   Length:86834      
#>  Class :character   1st Qu.:105000   1st Qu.:  0.00   Class :character  
#>  Mode  :character   Median :145000   Median :  0.00   Mode  :character  
#>                     Mean   :152495   Mean   : 21.54                     
#>                     3rd Qu.:193262   3rd Qu.:  0.00                     
#>                     Max.   :337300   Max.   :100.00                     
#>  company_size        is_local      
#>  Length:86834       Mode :logical  
#>  Class :character   FALSE:85       
#>  Mode  :character   TRUE :86749    
#>                                    
#>                                    
#> 

5. Linear Modeling

For our baseline predictive analysis, we implement a Multiple Linear Regression (MLR) model. This allows us to quantify the relationship between our categorical predictors and our continuous target variable, salary_in_usd.

5.1 Standard Linear Model

The model assumes that the target variable \(y\) is a linear function of the independent variables \(x_i\), plus a random error term:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n + \epsilon\]

Component Definition Contextual Meaning
\(y\) Dependent Variable The Salary in USD we are trying to predict.
\(\beta_0\) Intercept The baseline salary when all predictors are at their reference level.
\(\beta_1 \dots \beta_n\) Coefficients The estimated change in salary for every one-unit change in a feature.
\(\epsilon\) Error Term The Residuals; the variance in salary not explained by our model.
formula_std <- salary_in_usd ~ work_year + experience_level + employment_type +
                               job_title + remote_ratio + company_location + company_size

model_std <- lm(formula_std, data = salary)
summary(model_std)
#> 
#> Call:
#> lm(formula = formula_std, data = salary)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -170060  -39538   -7373   33093  239871 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      -4.958e+06  6.258e+05  -7.921 2.38e-15 ***
#> work_year                         2.478e+03  3.092e+02   8.013 1.13e-15 ***
#> experience_levelMI                1.722e+04  7.418e+02  23.211  < 2e-16 ***
#> experience_levelSE                4.346e+04  7.057e+02  61.589  < 2e-16 ***
#> experience_levelEX                6.822e+04  1.466e+03  46.536  < 2e-16 ***
#> employment_typeFL                -2.180e+04  1.421e+04  -1.534 0.125032    
#> employment_typeFT                 2.020e+04  3.675e+03   5.497 3.87e-08 ***
#> employment_typePT                -1.737e+04  5.127e+03  -3.388 0.000704 ***
#> job_titleData Analysis           -4.356e+03  1.877e+03  -2.321 0.020315 *  
#> job_titleData Engineering         3.559e+04  1.831e+03  19.432  < 2e-16 ***
#> job_titleData Science             3.730e+04  1.848e+03  20.183  < 2e-16 ***
#> job_titleLeadership & Management  3.735e+04  1.884e+03  19.823  < 2e-16 ***
#> job_titleML & AI                  5.408e+04  1.897e+03  28.503  < 2e-16 ***
#> job_titleOther                    2.009e+03  2.025e+03   0.992 0.321122    
#> job_titleSoftware Engineering     7.790e+03  2.467e+03   3.157 0.001594 ** 
#> remote_ratio                     -9.657e+01  4.576e+00 -21.104  < 2e-16 ***
#> company_locationGB               -3.614e+04  1.453e+03 -24.865  < 2e-16 ***
#> company_locationOther            -4.247e+04  1.398e+03 -30.382  < 2e-16 ***
#> company_locationUS                2.280e+04  9.873e+02  23.097  < 2e-16 ***
#> company_sizeM                    -2.782e+03  1.085e+03  -2.563 0.010366 *  
#> company_sizeS                    -1.786e+04  3.967e+03  -4.501 6.76e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 54470 on 86813 degrees of freedom
#> Multiple R-squared:  0.2674, Adjusted R-squared:  0.2672 
#> F-statistic:  1584 on 20 and 86813 DF,  p-value: < 2.2e-16

5.2 Log-Transformed Linear Model

The model assumes that the log-transformed target variable \(\log(y)\) is a linear function of the independent variables:

\[\log(y) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n + \epsilon\]

model_log <- lm(log(salary_in_usd) ~ work_year + experience_level + employment_type +
                                     job_title + remote_ratio + company_location + company_size,
                data = salary)

summary(model_log)
#> 
#> Call:
#> lm(formula = log(salary_in_usd) ~ work_year + experience_level + 
#>     employment_type + job_title + remote_ratio + company_location + 
#>     company_size, data = salary)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.35934 -0.24791  0.01034  0.26364  1.62803 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      -2.089e+01  4.305e+00  -4.853 1.22e-06 ***
#> work_year                         1.573e-02  2.127e-03   7.398 1.40e-13 ***
#> experience_levelMI                1.925e-01  5.103e-03  37.731  < 2e-16 ***
#> experience_levelSE                3.835e-01  4.854e-03  79.015  < 2e-16 ***
#> experience_levelEX                5.351e-01  1.008e-02  53.064  < 2e-16 ***
#> employment_typeFL                -2.182e-01  9.776e-02  -2.232   0.0256 *  
#> employment_typeFT                 2.848e-01  2.528e-02  11.267  < 2e-16 ***
#> employment_typePT                -1.563e-01  3.526e-02  -4.432 9.34e-06 ***
#> job_titleData Analysis           -3.188e-02  1.291e-02  -2.469   0.0136 *  
#> job_titleData Engineering         2.561e-01  1.260e-02  20.327  < 2e-16 ***
#> job_titleData Science             2.694e-01  1.271e-02  21.194  < 2e-16 ***
#> job_titleLeadership & Management  2.645e-01  1.296e-02  20.409  < 2e-16 ***
#> job_titleML & AI                  3.758e-01  1.305e-02  28.792  < 2e-16 ***
#> job_titleOther                   -8.567e-03  1.393e-02  -0.615   0.5384    
#> job_titleSoftware Engineering     5.020e-02  1.697e-02   2.958   0.0031 ** 
#> remote_ratio                     -5.878e-04  3.148e-05 -18.672  < 2e-16 ***
#> company_locationGB               -3.882e-01  9.998e-03 -38.831  < 2e-16 ***
#> company_locationOther            -5.224e-01  9.615e-03 -54.332  < 2e-16 ***
#> company_locationUS                1.595e-01  6.791e-03  23.484  < 2e-16 ***
#> company_sizeM                    -8.528e-03  7.465e-03  -1.142   0.2533    
#> company_sizeS                    -1.657e-01  2.729e-02  -6.071 1.28e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3747 on 86813 degrees of freedom
#> Multiple R-squared:  0.3507, Adjusted R-squared:  0.3506 
#> F-statistic:  2345 on 20 and 86813 DF,  p-value: < 2.2e-16

5.3 Model Diagnostics

We use Residuals vs Fitted and Normal Q-Q plots to compare model fit before and after log-transformation.

par(mfrow = c(2, 2))
plot(model_std, which = 1, main = "Standard: Residuals vs Fitted")
plot(model_std, which = 2, main = "Standard: Normal Q-Q")
plot(model_log, which = 1, main = "Log-Transformed: Residuals vs Fitted")
plot(model_log, which = 2, main = "Log-Transformed: Normal Q-Q")

par(mfrow = c(1, 1))
  • Residuals vs. Fitted: The standard model exhibited heteroscedasticity (a fan-shaped spread), where errors increased alongside higher salary predictions. The log-transformed model significantly stabilized this variance.
  • Normal Q-Q: The standard model showed heavy tails indicating non-normal residuals. The log-transformation brought the points much closer to the 45-degree reference line.

5.4 Model Evaluation: R-Squared

The increase in \(R^2\) from 0.26 to 0.35 indicates that the log-transformation allows the model to capture the nature of salary increases more effectively. By compressing the right-skewed salary distribution, we reduced the influence of high-income outliers.

6. Feature Engineering & Variable Selection

To refine our model and address potential over-fitting or redundancy, we employ Stepwise Selection using the AIC (Akaike Information Criterion). AIC rewards goodness-of-fit but penalizes additional predictors, guiding us toward a simpler and more interpretable model.

full_model <- lm(log(salary_in_usd) ~ work_year + experience_level + employment_type +
                 job_title + remote_ratio + company_location + company_size,
                 data = salary)

reduced_model <- step(full_model, direction = "backward", trace = FALSE)
summary(reduced_model)
#> 
#> Call:
#> lm(formula = log(salary_in_usd) ~ work_year + experience_level + 
#>     employment_type + job_title + remote_ratio + company_location + 
#>     company_size, data = salary)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.35934 -0.24791  0.01034  0.26364  1.62803 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      -2.089e+01  4.305e+00  -4.853 1.22e-06 ***
#> work_year                         1.573e-02  2.127e-03   7.398 1.40e-13 ***
#> experience_levelMI                1.925e-01  5.103e-03  37.731  < 2e-16 ***
#> experience_levelSE                3.835e-01  4.854e-03  79.015  < 2e-16 ***
#> experience_levelEX                5.351e-01  1.008e-02  53.064  < 2e-16 ***
#> employment_typeFL                -2.182e-01  9.776e-02  -2.232   0.0256 *  
#> employment_typeFT                 2.848e-01  2.528e-02  11.267  < 2e-16 ***
#> employment_typePT                -1.563e-01  3.526e-02  -4.432 9.34e-06 ***
#> job_titleData Analysis           -3.188e-02  1.291e-02  -2.469   0.0136 *  
#> job_titleData Engineering         2.561e-01  1.260e-02  20.327  < 2e-16 ***
#> job_titleData Science             2.694e-01  1.271e-02  21.194  < 2e-16 ***
#> job_titleLeadership & Management  2.645e-01  1.296e-02  20.409  < 2e-16 ***
#> job_titleML & AI                  3.758e-01  1.305e-02  28.792  < 2e-16 ***
#> job_titleOther                   -8.567e-03  1.393e-02  -0.615   0.5384    
#> job_titleSoftware Engineering     5.020e-02  1.697e-02   2.958   0.0031 ** 
#> remote_ratio                     -5.878e-04  3.148e-05 -18.672  < 2e-16 ***
#> company_locationGB               -3.882e-01  9.998e-03 -38.831  < 2e-16 ***
#> company_locationOther            -5.224e-01  9.615e-03 -54.332  < 2e-16 ***
#> company_locationUS                1.595e-01  6.791e-03  23.484  < 2e-16 ***
#> company_sizeM                    -8.528e-03  7.465e-03  -1.142   0.2533    
#> company_sizeS                    -1.657e-01  2.729e-02  -6.071 1.28e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3747 on 86813 degrees of freedom
#> Multiple R-squared:  0.3507, Adjusted R-squared:  0.3506 
#> F-statistic:  2345 on 20 and 86813 DF,  p-value: < 2.2e-16

Comparing the full and reduced models.

cat("Full Model Adjusted R2:", summary(full_model)$adj.r.squared, "\n")
#> Full Model Adjusted R2: 0.3505958
cat("Reduced Model Adjusted R2:", summary(reduced_model)$adj.r.squared, "\n")
#> Reduced Model Adjusted R2: 0.3505958
dropped_vars <- setdiff(names(coef(full_model)), names(coef(reduced_model)))
cat("\nVariables removed by Stepwise Selection:\n")
#> 
#> Variables removed by Stepwise Selection:
print(dropped_vars)
#> character(0)

No variables were dropped, meaning the full and reduced models are identical — all predictors contribute meaningfully.

7. Predictive Modeling

To evaluate predictive accuracy, we split the dataset into a Training Set (70%) used to fit the model, and a Test Set (30%) used to evaluate performance.

7.1 Log-Linear Regression

set.seed(123)

n             <- nrow(salary)
train_indices <- sample(1:n, size = 0.7 * n)
train_data    <- salary[train_indices, ]
test_data     <- salary[-train_indices, ]

final_model   <- lm(log(salary_in_usd) ~ work_year + experience_level + employment_type +
                    job_title + remote_ratio + company_location + company_size,
                    data = train_data)

log_preds     <- predict(final_model, newdata = test_data)
actual_usd    <- test_data$salary_in_usd
predicted_usd <- exp(log_preds)

mse_val  <- mean((actual_usd - predicted_usd)^2)
rmse_val <- sqrt(mse_val)

cat("Mean Squared Error (MSE):", format(mse_val, scientific = FALSE), "\n")
#> Mean Squared Error (MSE): 3062656228
cat("Root Mean Squared Error (RMSE): $", round(rmse_val, 2), "\n")
#> Root Mean Squared Error (RMSE): $ 55341.27

On average, the log-linear model’s salary predictions are off by approximately $55,341.27.

7.2 Random Forest

rf_data <- salary %>%
  mutate(across(where(is.character), as.factor)) %>%
  dplyr::select(-any_of(c("log_salary", "is_local")))

set.seed(123)
train_idx <- sample(1:nrow(rf_data), 0.7 * nrow(rf_data))
train_rf  <- rf_data[train_idx, ]
test_rf   <- rf_data[-train_idx, ]

rf_model <- randomForest(salary_in_usd ~ .,
                         data       = train_rf,
                         ntree      = 500,
                         importance = TRUE)

rf_preds <- predict(rf_model, newdata = test_rf)
rf_rmse  <- sqrt(mean((test_rf$salary_in_usd - rf_preds)^2))

cat("Random Forest RMSE: $", round(rf_rmse, 2), "\n")
#> Random Forest RMSE: $ 53775.49
cat("Improvement over Linear Model: $", round(55341.27 - rf_rmse, 2))
#> Improvement over Linear Model: $ 1565.78
varImpPlot(rf_model,
           main = "Feature Importance: What Predicts Salary?",
           col  = "darkgreen",
           pch  = 16)

7.3 Lasso & Ridge Regression

library(glmnet)

x_train <- model.matrix(salary_in_usd ~ ., train_rf)[, -1]
y_train <- train_rf$salary_in_usd
x_test  <- model.matrix(salary_in_usd ~ ., test_rf)[, -1]
y_test  <- test_rf$salary_in_usd

lasso_cv    <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_preds <- predict(lasso_cv, s = lasso_cv$lambda.min, newx = x_test)
lasso_rmse  <- sqrt(mean((y_test - lasso_preds)^2))

ridge_cv    <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_preds <- predict(ridge_cv, s = ridge_cv$lambda.min, newx = x_test)
ridge_rmse  <- sqrt(mean((y_test - ridge_preds)^2))

cat("Lasso RMSE: $", round(lasso_rmse, 2), "\n")
#> Lasso RMSE: $ 54293.57
cat("Ridge RMSE: $", round(ridge_rmse, 2))
#> Ridge RMSE: $ 54355.6

7.4 Decision Tree Regression

tree_model <- rpart(salary_in_usd ~ ., data = train_rf, method = "anova")

tree_preds <- predict(tree_model, newdata = test_rf)
tree_rmse  <- sqrt(mean((test_rf$salary_in_usd - tree_preds)^2))

rpart.plot(tree_model, type = 3, digits = 3, fallen.leaves = TRUE,
           main = "Salary Decision Logic")

cat("Decision Tree RMSE: $", round(tree_rmse, 2))
#> Decision Tree RMSE: $ 55819.2

7.5 Model Comparison

results_table <- data.frame(
  Model = c("Random Forest", "Lasso Regression", "Ridge Regression",
            "Linear Regression (Log)", "Decision Tree"),
  RMSE  = c("$53,775.49", "$54,448.88", "$54,526.86", "$55,341.27", "$55,981.30"),
  Interpretation = c(
    "Best performer: captures complex, non-linear interactions between variables.",
    "Similar to Ridge, but performs automatic feature selection.",
    "Linear model with a penalty to reduce the impact of multicollinearity.",
    "The baseline; effective but limited by linear assumptions.",
    "High interpretability but prone to over-simplification."
  )
)

knitr::kable(results_table,
             caption = "Table 1: Comparison of Model Performance (Ranked by RMSE)",
             align   = "lll")
Table 1: Comparison of Model Performance (Ranked by RMSE)
Model RMSE Interpretation
Random Forest $53,775.49 Best performer: captures complex, non-linear interactions between variables.
Lasso Regression $54,448.88 Similar to Ridge, but performs automatic feature selection.
Ridge Regression $54,526.86 Linear model with a penalty to reduce the impact of multicollinearity.
Linear Regression (Log) $55,341.27 The baseline; effective but limited by linear assumptions.
Decision Tree $55,981.30 High interpretability but prone to over-simplification.

By achieving the lowest RMSE ($53,775.49), the Random Forest model provides the most reliable salary estimates, though it remains a “black box” compared to the interpretability of the other models.


Part 2: Classification

This section builds a binary logistic regression classifier to predict whether an employee’s salary falls in the top 25% (High) or below (Normal), using the AI/Data Science salary dataset covering 2020-2025.

Workflow at a glance:

  1. Load data and libraries
  2. Initial cleaning and ordinal encoding
  3. Feature engineering
  4. Build binary target variable
  5. Train / test split
  6. Handle class imbalance with case weights
  7. Fit full logistic regression model
  8. Stepwise variable selection (AIC)
  9. Predictions at default threshold (0.50)
  10. ROC curve and AUC
  11. Optimal threshold via Youden Index
  12. Threshold comparison table

1. Load Data

salary <- read.csv("/home/jakes/Documents/strathmore/Modules/Module 3/linear_models/salaries/salaries.csv")

2. Initial Cleaning & Ordinal Encoding

We ordinal-encode the four key categorical predictors so they can enter the logistic regression as numeric inputs.

Variable Encoding
experience_level EN = 1, MI = 2, SE = 3, EX = 4
employment_type PT/FL = 1, CT = 2, FT = 3
company_size S = 1, M = 2, L = 3
remote_ratio 0 -> 0, 50 -> 1, 100 -> 2
salary <- salary %>%
  mutate(
    exp_num = case_when(
      experience_level == "EN" ~ 1,
      experience_level == "MI" ~ 2,
      experience_level == "SE" ~ 3,
      experience_level == "EX" ~ 4,
      TRUE ~ NA_real_
    ),
    emp_num = case_when(
      employment_type == "FT" ~ 3,
      employment_type == "CT" ~ 2,
      employment_type == "PT" ~ 1,
      employment_type == "FL" ~ 1,
      TRUE ~ NA_real_
    ),
    size_num = case_when(
      company_size == "S" ~ 1,
      company_size == "M" ~ 2,
      company_size == "L" ~ 3,
      TRUE ~ NA_real_
    ),
    remote_num = case_when(
      remote_ratio == 0   ~ 0,
      remote_ratio == 50  ~ 1,
      remote_ratio == 100 ~ 2,
      TRUE ~ NA_real_
    )
  )

glimpse(salary %>% dplyr::select(exp_num, emp_num, size_num, remote_num))
#> Rows: 88,584
#> Columns: 4
#> $ exp_num    <dbl> 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2,…
#> $ emp_num    <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ size_num   <dbl> 3, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,…
#> $ remote_num <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 2, 2,…

3. Feature Engineering

We create several derived features before modelling.

Leakage note: sal_vs_job_mean is computed using a temporary training split so that job-title group means are never contaminated by test data.

salary <- salary %>%
  mutate(
    same_country  = as.integer(as.character(employee_residence) == as.character(company_location)),
    us_company    = as.integer(as.character(company_location) == "US"),
    is_ml_ai_role = as.integer(grepl(
                      "machine learning|data science|ai|research|engineer|scientist|analyst",
                      job_title, ignore.case = TRUE)),
    exp_x_remote  = exp_num * remote_num,
    year_c        = work_year - 2022
  )

set.seed(42)
temp_idx <- createDataPartition(salary$salary_in_usd, p = 0.8, list = FALSE)

job_means <- salary %>%
  slice(temp_idx) %>%
  group_by(job_title) %>%
  summarise(job_mean_sal = mean(salary_in_usd, na.rm = TRUE), .groups = "drop")

salary <- salary %>%
  left_join(job_means, by = "job_title") %>%
  mutate(
    sal_vs_job_mean = salary_in_usd /
                      coalesce(job_mean_sal, median(salary_in_usd, na.rm = TRUE))
  )

salary %>%
  dplyr::select(same_country, us_company, is_ml_ai_role,
                exp_x_remote, year_c, sal_vs_job_mean) %>%
  summary()
#>   same_country      us_company     is_ml_ai_role     exp_x_remote  
#>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
#>  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.000  
#>  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.000  
#>  Mean   :0.9983   Mean   :0.9004   Mean   :0.8266   Mean   :1.119  
#>  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.000  
#>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :8.000  
#>      year_c       sal_vs_job_mean  
#>  Min.   :-2.000   Min.   :0.07923  
#>  1st Qu.: 2.000   1st Qu.:0.71238  
#>  Median : 2.000   Median :0.94265  
#>  Mean   : 2.035   Mean   :1.00104  
#>  3rd Qu.: 2.000   3rd Qu.:1.23038  
#>  Max.   : 3.000   Max.   :7.50863

4. Build Binary Target Variable

We define High salary as any salary at or above the 75th percentile of salary_in_usd, giving a roughly 75/25 class split.

p75 <- quantile(salary$salary_in_usd, 0.75, na.rm = TRUE)
cat(sprintf("75th percentile threshold: $%s\n", format(round(p75), big.mark = ",")))
#> 75th percentile threshold: $198,600
salary <- salary %>%
  mutate(
    high_sal = factor(
      as.integer(salary_in_usd >= p75),
      levels = c(0, 1),
      labels = c("Normal", "High")
    )
  )

cat("\nClass distribution:\n")
#> 
#> Class distribution:
print(table(salary$high_sal))
#> 
#> Normal   High 
#>  66419  22165
cat("\nClass proportions:\n")
#> 
#> Class proportions:
print(round(prop.table(table(salary$high_sal)), 3))
#> 
#> Normal   High 
#>   0.75   0.25
ggplot(salary, aes(x = high_sal, fill = high_sal)) +
  geom_bar(width = 0.5) +
  geom_text(stat = "count",
            aes(label = after_stat(count)),
            vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("Normal" = "#85B7EB", "High" = "#1D9E75")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(title = "Class Balance: Salary Target",
       x = NULL, y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")
Class balance: High vs Normal salary

Class balance: High vs Normal salary

5. Train / Test Split

We select the 10 modelling features and perform a stratified 80/20 split on high_sal to preserve class proportions in both sets.

cls_df <- salary %>%
  dplyr::select(
    high_sal,
    exp_num,
    emp_num,
    size_num,
    remote_num,
    same_country,
    us_company,
    is_ml_ai_role,
    year_c,
    sal_vs_job_mean,
    exp_x_remote
  ) %>%
  drop_na()

set.seed(42)
idx_c   <- createDataPartition(cls_df$high_sal, p = 0.8, list = FALSE)
train_c <- cls_df[ idx_c, ]
test_c  <- cls_df[-idx_c, ]

cat(sprintf("Train: %d rows | Test: %d rows\n", nrow(train_c), nrow(test_c)))
#> Train: 70868 rows | Test: 17716 rows
cat("\nTrain class balance:\n")
#> 
#> Train class balance:
print(round(prop.table(table(train_c$high_sal)), 3))
#> 
#> Normal   High 
#>   0.75   0.25

6. Handle Class Imbalance

With ~75% Normal and ~25% High, accuracy alone is misleading. We assign inverse-frequency weights so both classes contribute equally to the log-likelihood during fitting.

n_total  <- nrow(train_c)
n_high   <- sum(train_c$high_sal == "High")
n_normal <- sum(train_c$high_sal == "Normal")

class_weights <- ifelse(
  train_c$high_sal == "High",
  n_total / (2 * n_high),
  n_total / (2 * n_normal)
)

cat(sprintf("Class weights — Normal: %.3f | High: %.3f\n",
            n_total / (2 * n_normal),
            n_total / (2 * n_high)))
#> Class weights — Normal: 0.667 | High: 1.998

7. Full Logistic Regression Model

We fit a weighted logistic regression with all 10 candidate predictors.

full_fit <- glm(
  high_sal ~ exp_num + emp_num + size_num + remote_num +
             same_country + us_company + is_ml_ai_role +
             year_c + sal_vs_job_mean + exp_x_remote,
  data    = train_c,
  family  = binomial(link = "logit"),
  weights = class_weights
)

summary(full_fit)
#> 
#> Call:
#> glm(formula = high_sal ~ exp_num + emp_num + size_num + remote_num + 
#>     same_country + us_company + is_ml_ai_role + year_c + sal_vs_job_mean + 
#>     exp_x_remote, family = binomial(link = "logit"), data = train_c, 
#>     weights = class_weights)
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -19.69960    1.10372 -17.848  < 2e-16 ***
#> exp_num           0.45402    0.02231  20.347  < 2e-16 ***
#> emp_num           1.65241    0.28087   5.883 4.02e-09 ***
#> size_num          0.69406    0.06931  10.014  < 2e-16 ***
#> remote_num       -0.72975    0.08689  -8.399  < 2e-16 ***
#> same_country      1.46892    0.72249   2.033    0.042 *  
#> us_company        0.39377    0.05905   6.669 2.58e-11 ***
#> is_ml_ai_role     0.70473    0.03431  20.538  < 2e-16 ***
#> year_c            0.23522    0.02106  11.170  < 2e-16 ***
#> sal_vs_job_mean   8.18253    0.06394 127.973  < 2e-16 ***
#> exp_x_remote      0.17684    0.03047   5.804 6.47e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 98244  on 70867  degrees of freedom
#> Residual deviance: 43234  on 70857  degrees of freedom
#> AIC: 55410
#> 
#> Number of Fisher Scoring iterations: 6

8. Stepwise Variable Selection (AIC)

We use bidirectional stepwise selection (AIC criterion) to find the most parsimonious model. Variables with negligible contribution are dropped; useful variables already excluded may be re-added.

step_fit <- step(full_fit, direction = "both", trace = 0)
summary(step_fit)
#> 
#> Call:
#> glm(formula = high_sal ~ exp_num + emp_num + size_num + remote_num + 
#>     same_country + us_company + is_ml_ai_role + year_c + sal_vs_job_mean + 
#>     exp_x_remote, family = binomial(link = "logit"), data = train_c, 
#>     weights = class_weights)
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -19.69960    1.10372 -17.848  < 2e-16 ***
#> exp_num           0.45402    0.02231  20.347  < 2e-16 ***
#> emp_num           1.65241    0.28087   5.883 4.02e-09 ***
#> size_num          0.69406    0.06931  10.014  < 2e-16 ***
#> remote_num       -0.72975    0.08689  -8.399  < 2e-16 ***
#> same_country      1.46892    0.72249   2.033    0.042 *  
#> us_company        0.39377    0.05905   6.669 2.58e-11 ***
#> is_ml_ai_role     0.70473    0.03431  20.538  < 2e-16 ***
#> year_c            0.23522    0.02106  11.170  < 2e-16 ***
#> sal_vs_job_mean   8.18253    0.06394 127.973  < 2e-16 ***
#> exp_x_remote      0.17684    0.03047   5.804 6.47e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 98244  on 70867  degrees of freedom
#> Residual deviance: 43234  on 70857  degrees of freedom
#> AIC: 55410
#> 
#> Number of Fisher Scoring iterations: 6

8.1 Odds Ratios

or_table <- data.frame(
  OR       = exp(coef(step_fit)),
  CI_lower = exp(confint(step_fit)[, 1]),
  CI_upper = exp(confint(step_fit)[, 2])
) %>%
  round(3)

cat("OR > 1: predictor increases odds of High salary\n")
#> OR > 1: predictor increases odds of High salary
cat("OR < 1: predictor decreases odds of High salary\n\n")
#> OR < 1: predictor decreases odds of High salary
print(or_table)
#>                       OR CI_lower CI_upper
#> (Intercept)        0.000    0.000    0.000
#> exp_num            1.575    1.507    1.645
#> emp_num            5.220    3.068    9.254
#> size_num           2.002    1.748    2.294
#> remote_num         0.482    0.406    0.571
#> same_country       4.345    1.149   19.237
#> us_company         1.483    1.321    1.665
#> is_ml_ai_role      2.023    1.892    2.164
#> year_c             1.265    1.214    1.319
#> sal_vs_job_mean 3577.910 3158.737 4058.623
#> exp_x_remote       1.193    1.125    1.267

9. Predictions at Default Threshold (0.50)

prob_pred  <- predict(step_fit, newdata = test_c, type = "response")
class_pred <- factor(ifelse(prob_pred >= 0.5, "High", "Normal"),
                     levels = c("Normal", "High"))

cm_05 <- confusionMatrix(class_pred, test_c$high_sal, positive = "High")
print(cm_05)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Normal  High
#>     Normal  11554   591
#>     High     1729  3842
#>                                         
#>                Accuracy : 0.869         
#>                  95% CI : (0.864, 0.874)
#>     No Information Rate : 0.7498        
#>     P-Value [Acc > NIR] : < 2.2e-16     
#>                                         
#>                   Kappa : 0.6785        
#>                                         
#>  Mcnemar's Test P-Value : < 2.2e-16     
#>                                         
#>             Sensitivity : 0.8667        
#>             Specificity : 0.8698        
#>          Pos Pred Value : 0.6896        
#>          Neg Pred Value : 0.9513        
#>              Prevalence : 0.2502        
#>          Detection Rate : 0.2169        
#>    Detection Prevalence : 0.3145        
#>       Balanced Accuracy : 0.8683        
#>                                         
#>        'Positive' Class : High          
#> 

10. ROC Curve & AUC

AUC is the primary evaluation metric here — it is threshold-independent and robust to class imbalance.

roc_obj <- roc(test_c$high_sal, prob_pred,
               levels = c("Normal", "High"), quiet = TRUE)
auc_val <- auc(roc_obj)

cat(sprintf("AUC: %.3f\n", auc_val))
#> AUC: 0.944
cat("Interpretation: 0.9-1.0 excellent | 0.8-0.9 good | 0.7-0.8 fair | <0.7 poor\n")
#> Interpretation: 0.9-1.0 excellent | 0.8-0.9 good | 0.7-0.8 fair | <0.7 poor
ggroc(roc_obj, colour = "#1D9E75", linewidth = 1) +
  geom_abline(slope = 1, intercept = 1,
              linetype = "dashed", colour = "gray50") +
  annotate("text", x = 0.3, y = 0.1,
           label = sprintf("AUC = %.3f", auc_val), size = 4.5,
           colour = "#1D9E75", fontface = "bold") +
  labs(title = "ROC Curve: Salary Classifier",
       x     = "Specificity",
       y     = "Sensitivity") +
  theme_minimal()
ROC curve for the salary classifier

ROC curve for the salary classifier

11. Optimal Threshold (Youden Index)

The Youden index maximises Sensitivity + Specificity - 1, balancing the true positive and true negative rates simultaneously.

best_thresh <- coords(roc_obj, "best",
                      ret         = "threshold",
                      best.method = "youden")
best_thresh <- as.numeric(best_thresh)
cat(sprintf("Youden-optimal threshold: %.3f\n", best_thresh))
#> Youden-optimal threshold: 0.403
class_opt <- factor(ifelse(prob_pred >= best_thresh, "High", "Normal"),
                    levels = c("Normal", "High"))

cm_opt <- confusionMatrix(class_opt, test_c$high_sal, positive = "High")
print(cm_opt)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Normal  High
#>     Normal  11053   341
#>     High     2230  4092
#>                                         
#>                Accuracy : 0.8549        
#>                  95% CI : (0.8496, 0.86)
#>     No Information Rate : 0.7498        
#>     P-Value [Acc > NIR] : < 2.2e-16     
#>                                         
#>                   Kappa : 0.6613        
#>                                         
#>  Mcnemar's Test P-Value : < 2.2e-16     
#>                                         
#>             Sensitivity : 0.9231        
#>             Specificity : 0.8321        
#>          Pos Pred Value : 0.6473        
#>          Neg Pred Value : 0.9701        
#>              Prevalence : 0.2502        
#>          Detection Rate : 0.2310        
#>    Detection Prevalence : 0.3569        
#>       Balanced Accuracy : 0.8776        
#>                                         
#>        'Positive' Class : High          
#> 

12. Threshold Comparison

Note: Accuracy is misleading here — the No-Information Rate is ~75%. Use Kappa and AUC as primary metrics for this imbalanced classifier.

f1_05  <- cm_05$byClass["F1"]
f1_opt <- cm_opt$byClass["F1"]

comparison <- data.frame(
  Threshold   = c(0.500, round(best_thresh, 3)),
  Accuracy    = c(cm_05$overall["Accuracy"],    cm_opt$overall["Accuracy"]),
  Sensitivity = c(cm_05$byClass["Sensitivity"], cm_opt$byClass["Sensitivity"]),
  Specificity = c(cm_05$byClass["Specificity"], cm_opt$byClass["Specificity"]),
  Kappa       = c(cm_05$overall["Kappa"],       cm_opt$overall["Kappa"]),
  F1          = c(f1_05,                        f1_opt)
) %>%
  mutate(across(-Threshold, ~ round(., 3)))

knitr::kable(comparison,
             caption = "Threshold Comparison: Default (0.50) vs Youden-Optimal",
             align   = "lrrrrr")
Threshold Comparison: Default (0.50) vs Youden-Optimal
Threshold Accuracy Sensitivity Specificity Kappa F1
0.500 0.869 0.867 0.870 0.678 0.768
0.403 0.855 0.923 0.832 0.661 0.761
cat(sprintf("\nAUC (threshold-independent): %.3f\n", auc_val))
#> 
#> AUC (threshold-independent): 0.944

13. Summary & Conclusions

#> Model                      : Weighted logistic regression (stepwise AIC)
#> Target                     : High salary (salary_in_usd >= 75th percentile)
#> AUC                        : 0.944
#> Kappa (threshold = 0.50)   : 0.678
#> F1    (threshold = 0.50)   : 0.768
#> Optimal threshold (Youden) : 0.403

Key Findings & Implications

Drivers of High Salary

The classification model identifies three main factors influencing entry into the top 25% salary bracket:

  • Experience Premium: Senior (SE) and Executive (EX) roles significantly increase the likelihood of high compensation, showing a strong non-linear effect.
  • Geographic Advantage: Working for a US-based company is one of the strongest predictors of high salary.
  • Specialization: Roles related to Machine Learning and AI have a higher probability of exceeding the 75th percentile compared to general data roles.

Model Performance Insights

  • Metric Choice: Due to class imbalance, Cohen’s Kappa (k) was used instead of accuracy to better capture true model performance, as it accounts for chance agreement.
  • AUC Strength: A high AUC indicates strong ability to distinguish between high and normal salary classes.
  • Threshold Optimization: Adjusting the classification threshold via the Youden Index improved sensitivity, making the model more effective at identifying high earners.

Practical Implications

  • Career Strategy: Advancing experience level, targeting US-based companies, and specializing in AI/ML roles are key drivers of higher earnings.
  • Model Application: The classifier can function as a salary screening tool, helping users assess the likelihood of a role meeting high-salary criteria.

apps-fileview.texmex_20260329.08_p1 salary_full.txt Displaying salary_full.txt.