Part 1: Regression

1. Dataset Overview & Attribute Information

This dataset contains insights into the compensation of data science professionals across various global regions and roles. Understanding the nuances of these features is critical for building a robust predictive model.

1.1 Decision Features

work_year: The year the salary was paid. This is vital for observing inflation or market demand trends over time.
experience_level: A categorical ordinal feature ranging from EN (Entry) to EX (Executive).
employment_type: Defines the nature of the contract (Part Time (PT), Full Time (FT), CT (Contract), FL (Freelance)).
job_title: The specific role (e.g., Data Scientist, ML Engineer).

1.2 Geographic & Corporate Features

employee_residence & company_location: ISO 3166 codes representing where the talent lives vs. where the company is based.
company_size: Categorized by number of employees into Small (<50), Medium (50-250), or Large (>250).
remote_ratio: An ordinal numeric value (0, 50, 100) indicating the level of remote work.

1.3 Target Variable

salary_in_usd (Primary Target): The standardized version of the salary, normalized to USD based on the exchange rates of the respective work_year. For modeling purposes, this eliminates the variance caused by currency fluctuations and allows for a direct comparison across global markets.

2. Data Loading & Preparation

2.1 Load Libraries

libs <- c("tidyverse", "dplyr", "gridExtra", "ggplot2", "ggfortify", "broom",
          "ggcorrplot", "corrr", "corrplot", "MASS", "datarium", "randomForest",
          "rpart", "Hmisc", "skimr", "stringr", "patchwork", "glmnet",
          "rpart.plot", "knitr", "caret", "pROC")

for (lib in libs) {
    if (!(lib %in% installed.packages())){
      install.packages(lib)
    }
    library(lib, character.only = TRUE)
}

theme_set(theme_classic())

2.2 Load Data

salary <- read.csv("/home/jakes/Documents/strathmore/Modules/Module 3/linear_models/salaries/salaries.csv")
head(salary, 10)

#>    work_year experience_level employment_type                     job_title
#> 1       2025               MI              FT      Customer Success Manager
#> 2       2025               SE              FT                      Engineer
#> 3       2025               SE              FT                      Engineer
#> 4       2025               SE              FT             Applied Scientist
#> 5       2025               SE              FT             Applied Scientist
#> 6       2025               EN              FT                  Data Analyst
#> 7       2025               EN              FT                  Data Analyst
#> 8       2025               SE              FT Software Development Engineer
#> 9       2025               SE              FT Software Development Engineer
#> 10      2025               SE              FT            Research Scientist
#>    salary salary_currency salary_in_usd employee_residence remote_ratio
#> 1   57000             EUR         60000                 NL           50
#> 2  165000             USD        165000                 US            0
#> 3  109000             USD        109000                 US            0
#> 4  294000             USD        294000                 US            0
#> 5  137600             USD        137600                 US            0
#> 6   82000             USD         82000                 US            0
#> 7   44000             USD         44000                 US            0
#> 8  149800             USD        149800                 US            0
#> 9   89700             USD         89700                 US            0
#> 10 200000             USD        200000                 US            0
#>    company_location company_size
#> 1                NL            L
#> 2                US            M
#> 3                US            M
#> 4                US            M
#> 5                US            M
#> 6                US            M
#> 7                US            M
#> 8                US            L
#> 9                US            L
#> 10               US            M

2.3 Data Understanding

Checking the structure and data types of the dataset.

glimpse(salary)

#> Rows: 88,584
#> Columns: 11
#> $ work_year          <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 202…
#> $ experience_level   <chr> "MI", "SE", "SE", "SE", "SE", "EN", "EN", "SE", "SE…
#> $ employment_type    <chr> "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT…
#> $ job_title          <chr> "Customer Success Manager", "Engineer", "Engineer",…
#> $ salary             <int> 57000, 165000, 109000, 294000, 137600, 82000, 44000…
#> $ salary_currency    <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
#> $ salary_in_usd      <int> 60000, 165000, 109000, 294000, 137600, 82000, 44000…
#> $ employee_residence <chr> "NL", "US", "US", "US", "US", "US", "US", "US", "US…
#> $ remote_ratio       <int> 50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 100, 100, 0…
#> $ company_location   <chr> "NL", "US", "US", "US", "US", "US", "US", "US", "US…
#> $ company_size       <chr> "L", "M", "M", "M", "M", "M", "M", "L", "L", "M", "…

All the variables have the correct data types.

Checking for missing values and distributions of each column.

skim(salary)

Data summary
Name	salary
Number of rows	88584
Number of columns	11
_______________________
Column type frequency:
character	7
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
experience_level	1	2	2	4
employment_type	1	2	2	4
job_title	1	5	41	312
salary_currency	1	3	3	26
employee_residence	1	2	2	96
company_location	1	2	2	90
company_size	1	1	1	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
work_year	1	2024.03	0.62	2020	2024.0	2024	2024	2025	▁▁▁▇▂
salary	1	161932.29	196531.72	14000	106000.0	147000	199500	30400000	▇▁▁▁▁
salary_in_usd	1	157567.80	73531.37	15000	106097.2	146307	198600	800000	▇▅▁▁▁
remote_ratio	1	21.29	40.83	0	0.0	0	0	100	▇▁▁▁▂

Checking the summary statistics.

summary(salary)

#>    work_year    experience_level   employment_type     job_title        
#>  Min.   :2020   Length:88584       Length:88584       Length:88584      
#>  1st Qu.:2024   Class :character   Class :character   Class :character  
#>  Median :2024   Mode  :character   Mode  :character   Mode  :character  
#>  Mean   :2024                                                           
#>  3rd Qu.:2024                                                           
#>  Max.   :2025                                                           
#>      salary         salary_currency    salary_in_usd    employee_residence
#>  Min.   :   14000   Length:88584       Min.   : 15000   Length:88584      
#>  1st Qu.:  106000   Class :character   1st Qu.:106097   Class :character  
#>  Median :  147000   Mode  :character   Median :146307   Mode  :character  
#>  Mean   :  161932                      Mean   :157568                     
#>  3rd Qu.:  199500                      3rd Qu.:198600                     
#>  Max.   :30400000                      Max.   :800000                     
#>   remote_ratio    company_location   company_size      
#>  Min.   :  0.00   Length:88584       Length:88584      
#>  1st Qu.:  0.00   Class :character   Class :character  
#>  Median :  0.00   Mode  :character   Mode  :character  
#>  Mean   : 21.29                                        
#>  3rd Qu.:  0.00                                        
#>  Max.   :100.00

Checking the column value counts.

salary %>%
  select_if(~is.character(.) | is.factor(.)) %>%
  map(table)

#> $experience_level
#> 
#>    EN    EX    MI    SE 
#>  8381  1859 26748 51596 
#> 
#> $employment_type
#> 
#>    CT    FL    FT    PT 
#>   224    16 88111   233 
#> 
#> $job_title
#> 
#>                         Account Executive 
#>                                        22 
#>                         Actuarial Analyst 
#>                                        14 
#>                      Admin & Data Analyst 
#>                                         6 
#>                              AI Architect 
#>                                       165 
#>                          AI Data Engineer 
#>                                         2 
#>                         AI Data Scientist 
#>                                         8 
#>                              AI Developer 
#>                                       123 
#>                               AI Engineer 
#>                                       850 
#>                    AI Engineering Manager 
#>                                         2 
#>                        AI Governance Lead 
#>                                         2 
#>                                   AI Lead 
#>                                        15 
#>              AI Machine Learning Engineer 
#>                                         3 
#>                        AI Product Manager 
#>                                        27 
#>                          AI Product Owner 
#>                                        18 
#>                             AI Programmer 
#>                                         9 
#>                      AI Research Engineer 
#>                                         8 
#>                     AI Research Scientist 
#>                                         7 
#>                             AI Researcher 
#>                                        54 
#>                              AI Scientist 
#>                                        74 
#>          AI Software Development Engineer 
#>                                         1 
#>                      AI Software Engineer 
#>                                        11 
#>                     AI Solution Architect 
#>                                        10 
#>                             AI Specialist 
#>                                        70 
#>                       Algorithm Developer 
#>                                        22 
#>                                   Analyst 
#>                                      2066 
#>                         Analytics Analyst 
#>                                         1 
#>                        Analytics Engineer 
#>                                      1374 
#>             Analytics Engineering Manager 
#>                                         1 
#>                            Analytics Lead 
#>                                        52 
#>                      Analytics Specialist 
#>                                        32 
#>                     Application Developer 
#>                                         8 
#>                        Applied AI ML Lead 
#>                                         2 
#>                    Applied Data Scientist 
#>                                        13 
#>         Applied Machine Learning Engineer 
#>                                         3 
#>        Applied Machine Learning Scientist 
#>                                        14 
#>                Applied Research Scientist 
#>                                         2 
#>                         Applied Scientist 
#>                                      1778 
#>                                 Architect 
#>                                       208 
#>          Artificial Intelligence Engineer 
#>                                        18 
#>                                 Associate 
#>                                      1205 
#>             Autonomous Vehicle Technician 
#>                                         2 
#>                        AWS Data Architect 
#>                                         1 
#>                       Azure Data Engineer 
#>                                         2 
#>                         Backend Developer 
#>                                         2 
#>                          Backend Engineer 
#>                                       189 
#>                 Backend Software Engineer 
#>                                         1 
#>                             Bear Robotics 
#>                                        10 
#>                                BI Analyst 
#>                                       153 
#>                           BI Data Analyst 
#>                                        19 
#>                          BI Data Engineer 
#>                                         1 
#>                              BI Developer 
#>                                       226 
#>                               BI Engineer 
#>                                        74 
#>                          Big Data Analyst 
#>                                         1 
#>                        Big Data Architect 
#>                                         2 
#>                        Big Data Developer 
#>                                         7 
#>                         Big Data Engineer 
#>                                        13 
#>                          Bioinformatician 
#>                                        18 
#>                  Bioinformatics Scientist 
#>                                        44 
#>                          Business Analyst 
#>                                       238 
#>                     Business Data Analyst 
#>                                        21 
#>              Business Development Manager 
#>                                         1 
#>                 Business Insights Manager 
#>                                         1 
#>                     Business Intelligence 
#>                                       294 
#>             Business Intelligence Analyst 
#>                                       502 
#>          Business Intelligence Consultant 
#>                                         2 
#>        Business Intelligence Data Analyst 
#>                                         2 
#>           Business Intelligence Developer 
#>                                       285 
#>            Business Intelligence Engineer 
#>                                       680 
#>                Business Intelligence Lead 
#>                                        32 
#>             Business Intelligence Manager 
#>                                        38 
#>          Business Intelligence Specialist 
#>                                        16 
#>                         Chatbot Developer 
#>                                         4 
#>                    Clinical Data Operator 
#>                                        10 
#>                      Cloud Data Architect 
#>                                         1 
#>                       Cloud Data Engineer 
#>                                         4 
#>              Cloud Database Administrator 
#>                                        22 
#>                   Cloud Database Engineer 
#>                                        37 
#>                           Cloud Developer 
#>                                         2 
#>                            Cloud Engineer 
#>                                       102 
#>                   Compliance Data Analyst 
#>                                         2 
#>                   Computational Biologist 
#>                                        40 
#>                   Computational Scientist 
#>                                         4 
#>                  Computer Vision Engineer 
#>                                        98 
#>         Computer Vision Software Engineer 
#>                                         5 
#>                                Consultant 
#>                                       611 
#>                  Consultant Data Engineer 
#>                                         2 
#>                         Controls Engineer 
#>                                         1 
#>                          CRM Data Analyst 
#>                                         1 
#>                  Customer Success Manager 
#>                                         1 
#>                              Data Analyst 
#>                                      8652 
#>                         Data Analyst Lead 
#>                                         2 
#>                  Data Analytics Associate 
#>                                         7 
#>                 Data Analytics Consultant 
#>                                        40 
#>                  Data Analytics Developer 
#>                                        10 
#>                   Data Analytics Engineer 
#>                                         5 
#>                       Data Analytics Lead 
#>                                        89 
#>                    Data Analytics Manager 
#>                                       172 
#>                 Data Analytics Specialist 
#>                                        60 
#>                  Data Analytics Team Lead 
#>                                         2 
#>                Data and Reporting Analyst 
#>                                        16 
#>           Data and Reporting Professional 
#>                                        12 
#>                            Data Architect 
#>                                      1499 
#>                            Data Developer 
#>                                       100 
#>                      Data DevOps Engineer 
#>                                         3 
#>                             Data Engineer 
#>                                     10883 
#>                           Data Governance 
#>                                       114 
#>                   Data Governance Analyst 
#>                                        98 
#>                 Data Governance Architect 
#>                                         2 
#>                  Data Governance Engineer 
#>                                        14 
#>                      Data Governance Lead 
#>                                        64 
#>                   Data Governance Manager 
#>                                        31 
#>                Data Governance Specialist 
#>                                        54 
#>              Data Infrastructure Engineer 
#>                                        80 
#>                  Data Integration Analyst 
#>                                        14 
#>              Data Integration Coordinator 
#>                                         4 
#>                Data Integration Developer 
#>                                        12 
#>                 Data Integration Engineer 
#>                                        62 
#>               Data Integration Specialist 
#>                                        61 
#>                           Data Integrator 
#>                                        10 
#>                    Data Integrity Analyst 
#>                                        12 
#>                 Data Integrity Specialist 
#>                                         2 
#>                                 Data Lead 
#>                                       186 
#>                   Data Management Analyst 
#>                                       156 
#>                 Data Management Associate 
#>                                        10 
#>                Data Management Consultant 
#>                                        12 
#>               Data Management Coordinator 
#>                                         2 
#>                      Data Management Lead 
#>                                        28 
#>                   Data Management Manager 
#>                                         8 
#>                Data Management Specialist 
#>                                       133 
#>                              Data Manager 
#>                                       727 
#>                              Data Modeler 
#>                                       134 
#>                           Data Operations 
#>                                        14 
#>                   Data Operations Analyst 
#>                                        80 
#>                 Data Operations Associate 
#>                                        18 
#>                  Data Operations Engineer 
#>                                        60 
#>                      Data Operations Lead 
#>                                         2 
#>                   Data Operations Manager 
#>                                        24 
#>                Data Operations Specialist 
#>                                        38 
#>                    Data Pipeline Engineer 
#>                                         2 
#>                    Data Platform Engineer 
#>                                        20 
#>                      Data Product Analyst 
#>                                         2 
#>                      Data Product Manager 
#>                                       192 
#>                        Data Product Owner 
#>                                        92 
#>                      Data Quality Analyst 
#>                                       111 
#>                     Data Quality Engineer 
#>                                        37 
#>                         Data Quality Lead 
#>                                        40 
#>                      Data Quality Manager 
#>                                        16 
#>                   Data Quality Specialist 
#>                                        23 
#>                             Data Reporter 
#>                                         4 
#>                    Data Reporting Analyst 
#>                                        62 
#>                 Data Reporting Specialist 
#>                                         2 
#>                    Data Science Tech Lead 
#>                                         1 
#>                            Data Scientist 
#>                                     13156 
#>                  Data Scientist Associate 
#>                                         1 
#>                       Data Scientist Lead 
#>                                         3 
#>                    Data Scientist Manager 
#>                                         1 
#>                           Data Specialist 
#>                                       528 
#>                           Data Strategist 
#>                                        64 
#>                        Data Strategy Lead 
#>                                        16 
#>                     Data Strategy Manager 
#>                                         6 
#>                            Data Team Lead 
#>                                        36 
#>                Data Visualization Analyst 
#>                                        37 
#>              Data Visualization Developer 
#>                                        20 
#>               Data Visualization Engineer 
#>                                        54 
#>             Data Visualization Specialist 
#>                                        46 
#>                    Database Administrator 
#>                                         4 
#>                       Databricks Engineer 
#>                                         2 
#>                          DataOps Engineer 
#>                                        10 
#>                        Decision Scientist 
#>                                       172 
#>                    Deep Learning Engineer 
#>                                        12 
#>                  Deep Learning Researcher 
#>                                         1 
#>                                 Developer 
#>                                       344 
#>                        Developer Advocate 
#>                                         4 
#>                           DevOps Engineer 
#>                                       286 
#>         Director of Business Intelligence 
#>                                         2 
#>              Director of Machine Learning 
#>                                        22 
#>               Elasticsearch Administrator 
#>                                         2 
#>    Encounter Data Management Professional 
#>                                        36 
#>                                  Engineer 
#>                                      5344 
#>                       Engineering Manager 
#>                                       460 
#>              Enterprise Account Executive 
#>                                        28 
#>                             ETL Developer 
#>                                        42 
#>                              ETL Engineer 
#>                                         7 
#>                      Finance Data Analyst 
#>                                         3 
#>                    Financial Data Analyst 
#>                                         4 
#>                         Frontend Engineer 
#>                                        12 
#>                      Full Stack Developer 
#>                                        60 
#>                       Full Stack Engineer 
#>                                       122 
#>                        Fullstack Engineer 
#>                                        24 
#>                           GenAI Architect 
#>                                        10 
#>                                Head of AI 
#>                                        26 
#>             Head of Business Intelligence 
#>                                         2 
#>                              Head of Data 
#>                                       206 
#>                  Head of Machine Learning 
#>                                        11 
#>                   Infrastructure Engineer 
#>                                        16 
#>                           Insight Analyst 
#>                                        88 
#>                            Java Developer 
#>                                         2 
#>                          Lead AI Engineer 
#>                                         1 
#>                              Lead Analyst 
#>                                        54 
#>                        Lead Data Analysis 
#>                                         4 
#>                         Lead Data Analyst 
#>                                         6 
#>                        Lead Data Engineer 
#>                                         7 
#>                      Lead Data Management 
#>                                         2 
#>                       Lead Data Scientist 
#>                                        11 
#>                             Lead Engineer 
#>                                        48 
#>            Lead Machine Learning Engineer 
#>                                         5 
#>                Machine Learning Architect 
#>                                        18 
#>                Machine Learning Developer 
#>                                        44 
#>                 Machine Learning Engineer 
#>                                      6443 
#>  Machine Learning Infrastructure Engineer 
#>                                        41 
#>                     Machine Learning Lead 
#>                                         8 
#>                  Machine Learning Manager 
#>                                        14 
#>           Machine Learning Model Engineer 
#>                                        16 
#>                  Machine Learning Modeler 
#>                                        10 
#>     Machine Learning Performance Engineer 
#>                                         2 
#>        Machine Learning Platform Engineer 
#>                                        16 
#>         Machine Learning Quality Engineer 
#>                                        14 
#>        Machine Learning Research Engineer 
#>                                         5 
#>               Machine Learning Researcher 
#>                                       553 
#>                Machine Learning Scientist 
#>                                       340 
#>        Machine Learning Software Engineer 
#>                                        15 
#>               Machine Learning Specialist 
#>                                        12 
#>                Machine Learning Tech Lead 
#>                                         2 
#>                   Machine Vision Engineer 
#>                                         2 
#>                                   Manager 
#>                                      3488 
#>                   Manager Data Management 
#>                                         2 
#>                         Marketing Analyst 
#>                                         8 
#>               Marketing Analytics Manager 
#>                                         2 
#>                    Marketing Data Analyst 
#>                                         2 
#>                   Marketing Data Engineer 
#>                                         1 
#>                  Marketing Data Scientist 
#>                                         1 
#>                 Marketing Science Partner 
#>                                         2 
#>                    Master Data Management 
#>                                        20 
#>                    Master Data Specialist 
#>                                         2 
#>                 Member of Technical Staff 
#>                                        86 
#>                ML Infrastructure Engineer 
#>                                        18 
#>                            MLOps Engineer 
#>                                       116 
#>                              NLP Engineer 
#>                                        23 
#>                       People Data Analyst 
#>                                         1 
#>                    Platform Data Engineer 
#>                                         1 
#>                         Platform Engineer 
#>                                       290 
#>                       Postdoctoral Fellow 
#>                                        16 
#>                   Postdoctoral Researcher 
#>                                         7 
#>                                  Power BI 
#>                                        10 
#>                    Power BI Administrator 
#>                                         4 
#>                        Power BI Architect 
#>                                         2 
#>                       Power BI Consultant 
#>                                         2 
#>                        Power BI Developer 
#>                                        73 
#>                       Power BI Specialist 
#>                                         2 
#>                           Pricing Analyst 
#>                                        14 
#> Principal Application Delivery Consultant 
#>                                        40 
#>                    Principal Data Analyst 
#>                                         2 
#>                  Principal Data Architect 
#>                                         1 
#>                   Principal Data Engineer 
#>                                         3 
#>                  Principal Data Scientist 
#>                                        10 
#>       Principal Machine Learning Engineer 
#>                                         3 
#>                      Principal Researcher 
#>                                        16 
#>              Principal Software Architect 
#>                                         6 
#>          Principal Statistical Programmer 
#>                                        44 
#>                           Product Analyst 
#>                                       110 
#>                      Product Data Analyst 
#>                                         6 
#>                          Product Designer 
#>                                        59 
#>                           Product Manager 
#>                                      1314 
#>                             Product Owner 
#>                                        19 
#>                        Product Specialist 
#>                                         2 
#>                           Prompt Engineer 
#>                                        53 
#>                          Python Developer 
#>                                        60 
#>                               QA Engineer 
#>                                        44 
#>                      Quantitative Analyst 
#>                                        85 
#>                    Quantitative Developer 
#>                                        14 
#>             Quantitative Research Analyst 
#>                                         1 
#>                   Quantitative Researcher 
#>                                        60 
#>                          Research Analyst 
#>                                       380 
#>                        Research Assistant 
#>                                        55 
#>                        Research Associate 
#>                                       112 
#>                     Research Data Manager 
#>                                         1 
#>                         Research Engineer 
#>                                      1203 
#>                        Research Scientist 
#>                                      2553 
#>                              Risk Analyst 
#>                                        10 
#>                         Robotics Engineer 
#>                                        78 
#>                Robotics Software Engineer 
#>                                        12 
#>         Safety Data Management Specialist 
#>                                         1 
#>                        Sales Data Analyst 
#>                                         1 
#>          Sales Development Representative 
#>                                         7 
#>                             SAS Developer 
#>                                         2 
#>                     Scala Spark Developer 
#>                                         4 
#>                         Security Engineer 
#>                                        56 
#>                       Security Researcher 
#>                                         8 
#>                 Site Reliability Engineer 
#>                                       282 
#>                        Software Architect 
#>                                        18 
#>                    Software Data Engineer 
#>                                         3 
#>                        Software Developer 
#>                                       424 
#>             Software Development Engineer 
#>                                       451 
#>                         Software Engineer 
#>                                      9596 
#>                        Solution Architect 
#>                                       142 
#>                         Solution Engineer 
#>                                         2 
#>                       Solutions Architect 
#>                                       510 
#>                        Solutions Engineer 
#>                                       225 
#>                        Staff Data Analyst 
#>                                         3 
#>                      Staff Data Scientist 
#>                                         3 
#>           Staff Machine Learning Engineer 
#>                                         1 
#>                                     Stage 
#>                                        14 
#>                    Statistical Programmer 
#>                                        74 
#>                              Statistician 
#>                                        46 
#>                           System Engineer 
#>                                        18 
#>                          Systems Engineer 
#>                                       406 
#>                         Tableau Developer 
#>                                         2 
#>                                 Tech Lead 
#>                                        18 
#>                            Technical Lead 
#>                                       112 
#>                      Technical Specialist 
#>                                         4 
#>                          Technical Writer 
#>                                        19 
#>                     Technology Integrator 
#>                                        12 
#> 
#> $salary_currency
#> 
#>   AUD   BRL   CAD   CHF   CLP   CZK   DKK   EUR   GBP   HKD   HUF   ILS   INR 
#>    18    18   279    34     1     3     5  1559  2479     1     5     3    80 
#>   JPY   MXN   NOK   NZD   PHP   PLN   SEK   SGD   THB   TRY   TWD   USD   ZAR 
#>     8     3     3     1    12    45     1    15     2     4     8 83994     3 
#> 
#> $employee_residence
#> 
#>    AD    AE    AM    AR    AS    AT    AU    BA    BE    BG    BM    BO    BR 
#>     1     5    10    66     1   173   298     2    26     4     1     2    73 
#>    CA    CF    CH    CL    CN    CO    CR    CY    CZ    DE    DK    DO    DZ 
#>  3203     2    38    19     1    39     7    11     9   263     7     4     1 
#>    EC    EE    EG    ES    FI    FR    GB    GE    GH    GR    HK    HN    HR 
#>     6    11    42   203    45   226  2576     2     4    24     2     3     8 
#>    HU    ID    IE    IL    IN    IQ    IR    IT    JE    JP    KE    KR    KW 
#>     6     3    85    15   155     1     1    41     1    20     6     8     1 
#>    LB    LT    LU    LV    MD    MT    MU    MX    MY    NG    NL    NO    NZ 
#>    12   202     4    54     2    15     1    69     3    13   187     7    40 
#>    OM    PE    PH    PK    PL    PR    PT    QA    RO    RS    RU    RW    SA 
#>     1     9    46     7    86     9    47     1     7     4     6     1     3 
#>    SE    SG    SI    SK    SV    TH    TN    TR    TW    UA    UG    US    UZ 
#>     9    22     6   110     5     4     2    23     8    16     1 79705     3 
#>    VE    VN    XK    ZA    ZM 
#>     2     7     2    61     1 
#> 
#> $company_location
#> 
#>    AD    AE    AM    AR    AS    AT    AU    BA    BE    BG    BR    BS    CA 
#>     1     5     9    62     4   173   303     2    24     3    71     1  3204 
#>    CD    CF    CH    CL    CN    CO    CR    CY    CZ    DE    DK    DO    DZ 
#>     1     2    39    18     1    39     6    10    10   272     9     3     2 
#>    EC    EE    EG    ES    FI    FR    GB    GH    GI    GR    HK    HN    HR 
#>     6    12    41   199    46   221  2584     3     1    22     1     3     5 
#>    HU    ID    IE    IL    IN    IQ    IR    IT    JP    KE    KR    LB    LT 
#>     6     4    85    17   134     1     1    34    20     6     8    12   202 
#>    LU    LV    MD    MT    MU    MX    MY    NG    NL    NO    NZ    OM    PE 
#>     5    54     1    15     1    70     3    10   187     7    40     1     8 
#>    PH    PK    PL    PR    PT    QA    RO    RS    RU    SA    SE    SG    SI 
#>    43     3    85     8    44     1     6     3     7     3    10    21     6 
#>    SK    SV    TH    TR    TW    UA    US    VE    VN    XK    ZA    ZM 
#>   110     5     3    21     8    15 79762     2     4     2    61     1 
#> 
#> $company_size
#> 
#>     L     M     S 
#>  2703 85667   214

The features job_title, employee_residence, and company_location have very many separate categories. We will classify them based on their frequency and domain knowledge. If not addressed, this will become a high cardinality problem during modelling.

3. Data Cleaning & Transformation

3.1 Consolidate Job Titles

Restructuring the job titles into 7 broad categories instead of 312.

salary <- salary %>%
  mutate(job_title = case_when(
    str_detect(job_title, "Manager|Head|Director|Lead|Principal|Staff|Architect|Tech Lead") ~ "Leadership & Management",
    str_detect(job_title, "Machine Learning|ML|AI|Computer Vision|Robotics|NLP|GenAI|Autonomous") ~ "ML & AI",
    str_detect(job_title, "Scientist|Research|Applied|Decision Scientist|Statistician") ~ "Data Science",
    str_detect(job_title, "Engineer|ETL|Database|Infrastructure|Data Platform|Pipeline") ~ "Data Engineering",
    str_detect(job_title, "Analyst|Analytics|Reporting|Visualization|Quant") ~ "Data Analysis",
    str_detect(job_title, "BI |Business Intelligence|Tableau|Power BI|Insight") ~ "BI & Strategy",
    str_detect(job_title, "Software|Developer|Backend|Full Stack|Cloud|DevOps|Systems") ~ "Software Engineering",
    TRUE ~ "Other"
  ))

table(salary$job_title)

#> 
#>           BI & Strategy           Data Analysis        Data Engineering 
#>                     916                   12841                   31051 
#>            Data Science Leadership & Management                 ML & AI 
#>                   19597                   10454                    9128 
#>                   Other    Software Engineering 
#>                    3540                    1057

3.2 Consolidate Salary Currency

The currency the salaries were paid in were mostly USD, EUR, and GBP; the rest were classified as ‘Other’.

salary <- salary %>%
  mutate(salary_currency = case_when(
    str_detect(salary_currency, "USD") ~ "USD",
    str_detect(salary_currency, "EUR") ~ "EUR",
    str_detect(salary_currency, "GBP") ~ "GBP",
    TRUE ~ "Other"
  ))

table(salary$salary_currency)

#> 
#>   EUR   GBP Other   USD 
#>  1559  2479   552 83994

3.3 Consolidate Employee Residence & Company Location

Both employees and companies were predominantly located in the US, Canada, and Great Britain; the rest were classified as ‘Other’.

salary <- salary %>%
  mutate(employee_residence = case_when(
    str_detect(employee_residence, "US") ~ "US",
    str_detect(employee_residence, "CA") ~ "CA",
    str_detect(employee_residence, "GB") ~ "GB",
    TRUE ~ "Other"
  ))

table(salary$employee_residence)

#> 
#>    CA    GB Other    US 
#>  3203  2576  3100 79705

salary <- salary %>%
  mutate(company_location = case_when(
    str_detect(company_location, "US") ~ "US",
    str_detect(company_location, "CA") ~ "CA",
    str_detect(company_location, "GB") ~ "GB",
    TRUE ~ "Other"
  ))

table(salary$company_location)

#> 
#>    CA    GB Other    US 
#>  3204  2584  3034 79762

Confirming all transformations.

salary %>%
  select_if(~is.character(.) | is.factor(.)) %>%
  map(table)

#> $experience_level
#> 
#>    EN    EX    MI    SE 
#>  8381  1859 26748 51596 
#> 
#> $employment_type
#> 
#>    CT    FL    FT    PT 
#>   224    16 88111   233 
#> 
#> $job_title
#> 
#>           BI & Strategy           Data Analysis        Data Engineering 
#>                     916                   12841                   31051 
#>            Data Science Leadership & Management                 ML & AI 
#>                   19597                   10454                    9128 
#>                   Other    Software Engineering 
#>                    3540                    1057 
#> 
#> $salary_currency
#> 
#>   EUR   GBP Other   USD 
#>  1559  2479   552 83994 
#> 
#> $employee_residence
#> 
#>    CA    GB Other    US 
#>  3203  2576  3100 79705 
#> 
#> $company_location
#> 
#>    CA    GB Other    US 
#>  3204  2584  3034 79762 
#> 
#> $company_size
#> 
#>     L     M     S 
#>  2703 85667   214

3.4 Outlier Detection & Treatment

ggplot(salary, aes(y = salary_in_usd)) +
  geom_boxplot(fill = "steelblue", outlier.color = "red", outlier.shape = 16) +
  labs(title = "Identifying Salary Outliers", y = "Salary (USD)") +
  theme_minimal()

Q1 <- quantile(salary$salary_in_usd, 0.25)
Q3 <- quantile(salary$salary_in_usd, 0.75)
IQR_val <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

outliers <- salary %>% filter(salary_in_usd < lower_bound | salary_in_usd > upper_bound)
cat("Number of outliers identified:", nrow(outliers), "\n")

#> Number of outliers identified: 1750

cat("Upper Bound for 'Normal' Salaries: $", round(upper_bound, 2), "\n")

#> Upper Bound for 'Normal' Salaries: $ 337354.1

cat("Lower Bound for 'Normal' Salaries: $", round(lower_bound, 2))

#> Lower Bound for 'Normal' Salaries: $ -32656.88

Setting the floor of minimum salaries at $1 and the ceiling at the upper bound through winsorization.

logical_min     <- 1
statistical_max <- 337354.1

salary_refined <- salary %>%
  filter(salary_in_usd >= logical_min & salary_in_usd <= statistical_max)

summary(salary_refined$salary_in_usd)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   15000  105000  145000  152495  193262  337300

cat("Rows after trimming outliers:", nrow(salary_refined))

#> Rows after trimming outliers: 86834

4. Exploratory Data Analysis

4.1 Distribution of the Target Variable

We will visualize salary_in_usd using histograms and density plots. Since salary data is right-skewed, we may need to consider a log-transformation to meet the normality assumptions of Linear Regression.

p1 <- ggplot(salary_refined, aes(x = salary_in_usd)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "steelblue", color = "white") +
  geom_density(color = "red", linewidth = 1) +
  labs(title = "Distribution of Salary (USD)", x = "Salary", y = "Density") +
  theme_minimal()

p2 <- ggplot(salary_refined, aes(x = log10(salary_in_usd))) +
  geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "darkorange", color = "white") +
  geom_density(color = "black", linewidth = 1) +
  labs(title = "Log10 Transformed Salary", x = "Log10(Salary)", y = "Density") +
  theme_minimal()

p1 + p2

4.2 Ordinal Relationships

We expect a positive correlation between experience_level and salary_in_usd. We will use boxplots to validate this.

salary_refined$experience_level <- factor(salary_refined$experience_level,
                                          levels = c("EN", "MI", "SE", "EX"))

ggplot(salary_refined, aes(x = experience_level, y = salary_in_usd, fill = experience_level)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "red") +
  stat_summary(fun = "mean", geom = "point", shape = 20, size = 3, color = "white") +
  labs(title = "Salary Distribution by Experience Level",
       subtitle = "White dots represent the mean salary",
       x = "Experience Level",
       y = "Salary (USD)") +
  theme_minimal() +
  guides(fill = "none")

4.3 Correlation Analysis

salary_corr_df <- salary_refined %>%
  mutate(
    exp_num  = as.numeric(factor(experience_level, levels = c("EN", "MI", "SE", "EX"))),
    size_num = as.numeric(factor(company_size, levels = c("S", "M", "L")))
  ) %>%
  dplyr::select(salary_in_usd, salary, exp_num, size_num, remote_ratio, work_year)

cor_matrix <- cor(salary_corr_df, use = "complete.obs")

corrplot(cor_matrix,
         method = "color",
         type   = "lower",
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         title  = "Feature Correlation with Salary",
         mar    = c(0, 0, 2, 0))

4.4 Collinearity Check

We examine if employee_residence and company_location are identical for most records, as high multicollinearity can destabilize linear coefficients.

salary_refined$is_local <- salary_refined$employee_residence == salary_refined$company_location

loc_check  <- table(salary_refined$is_local)
prop_check <- prop.table(loc_check) * 100

print(loc_check)

#> 
#> FALSE  TRUE 
#>    85 86749

print(prop_check)

#> 
#>       FALSE        TRUE 
#>  0.09788792 99.90211208

ggplot(as.data.frame(prop_check), aes(x = Var1, y = Freq, fill = Var1)) +
  geom_bar(stat = "identity") +
  labs(title = "Geographic Overlap Check",
       x = "Lives in Company Country?",
       y = "Percentage (%)") +
  scale_x_discrete(labels = c("No (Remote/International)", "Yes (Local)")) +
  theme_minimal() +
  guides(fill = "none")

Since 99.9% of employees reside in the same country as their company (vs. 0.09% international), we drop employee_residence to avoid multicollinearity. We also drop the raw salary column as it was used to derive salary_in_usd.

salary <- salary_refined %>%
  dplyr::select(-c(employee_residence, salary))
summary(salary)

#>    work_year    experience_level employment_type     job_title        
#>  Min.   :2020   EN: 8336         Length:86834       Length:86834      
#>  1st Qu.:2024   MI:26309         Class :character   Class :character  
#>  Median :2024   SE:50435         Mode  :character   Mode  :character  
#>  Mean   :2024   EX: 1754                                              
#>  3rd Qu.:2024                                                         
#>  Max.   :2025                                                         
#>  salary_currency    salary_in_usd     remote_ratio    company_location  
#>  Length:86834       Min.   : 15000   Min.   :  0.00   Length:86834      
#>  Class :character   1st Qu.:105000   1st Qu.:  0.00   Class :character  
#>  Mode  :character   Median :145000   Median :  0.00   Mode  :character  
#>                     Mean   :152495   Mean   : 21.54                     
#>                     3rd Qu.:193262   3rd Qu.:  0.00                     
#>                     Max.   :337300   Max.   :100.00                     
#>  company_size        is_local      
#>  Length:86834       Mode :logical  
#>  Class :character   FALSE:85       
#>  Mode  :character   TRUE :86749    
#>                                    
#>                                    
#>

5. Linear Modeling

For our baseline predictive analysis, we implement a Multiple Linear Regression (MLR) model. This allows us to quantify the relationship between our categorical predictors and our continuous target variable, salary_in_usd.

5.1 Standard Linear Model

The model assumes that the target variable $y$ is a linear function of the independent variables $x_i$, plus a random error term:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n + \epsilon\]

Component	Definition	Contextual Meaning
$y$	Dependent Variable	The Salary in USD we are trying to predict.
$\beta_0$	Intercept	The baseline salary when all predictors are at their reference level.
$\beta_1 \dots \beta_n$	Coefficients	The estimated change in salary for every one-unit change in a feature.
$\epsilon$	Error Term	The Residuals; the variance in salary not explained by our model.

formula_std <- salary_in_usd ~ work_year + experience_level + employment_type +
                               job_title + remote_ratio + company_location + company_size

model_std <- lm(formula_std, data = salary)
summary(model_std)

#> 
#> Call:
#> lm(formula = formula_std, data = salary)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -170060  -39538   -7373   33093  239871 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      -4.958e+06  6.258e+05  -7.921 2.38e-15 ***
#> work_year                         2.478e+03  3.092e+02   8.013 1.13e-15 ***
#> experience_levelMI                1.722e+04  7.418e+02  23.211  < 2e-16 ***
#> experience_levelSE                4.346e+04  7.057e+02  61.589  < 2e-16 ***
#> experience_levelEX                6.822e+04  1.466e+03  46.536  < 2e-16 ***
#> employment_typeFL                -2.180e+04  1.421e+04  -1.534 0.125032    
#> employment_typeFT                 2.020e+04  3.675e+03   5.497 3.87e-08 ***
#> employment_typePT                -1.737e+04  5.127e+03  -3.388 0.000704 ***
#> job_titleData Analysis           -4.356e+03  1.877e+03  -2.321 0.020315 *  
#> job_titleData Engineering         3.559e+04  1.831e+03  19.432  < 2e-16 ***
#> job_titleData Science             3.730e+04  1.848e+03  20.183  < 2e-16 ***
#> job_titleLeadership & Management  3.735e+04  1.884e+03  19.823  < 2e-16 ***
#> job_titleML & AI                  5.408e+04  1.897e+03  28.503  < 2e-16 ***
#> job_titleOther                    2.009e+03  2.025e+03   0.992 0.321122    
#> job_titleSoftware Engineering     7.790e+03  2.467e+03   3.157 0.001594 ** 
#> remote_ratio                     -9.657e+01  4.576e+00 -21.104  < 2e-16 ***
#> company_locationGB               -3.614e+04  1.453e+03 -24.865  < 2e-16 ***
#> company_locationOther            -4.247e+04  1.398e+03 -30.382  < 2e-16 ***
#> company_locationUS                2.280e+04  9.873e+02  23.097  < 2e-16 ***
#> company_sizeM                    -2.782e+03  1.085e+03  -2.563 0.010366 *  
#> company_sizeS                    -1.786e+04  3.967e+03  -4.501 6.76e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 54470 on 86813 degrees of freedom
#> Multiple R-squared:  0.2674, Adjusted R-squared:  0.2672 
#> F-statistic:  1584 on 20 and 86813 DF,  p-value: < 2.2e-16

5.2 Log-Transformed Linear Model

The model assumes that the log-transformed target variable $\log(y)$ is a linear function of the independent variables:

\[\log(y) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n + \epsilon\]

model_log <- lm(log(salary_in_usd) ~ work_year + experience_level + employment_type +
                                     job_title + remote_ratio + company_location + company_size,
                data = salary)

summary(model_log)

#> 
#> Call:
#> lm(formula = log(salary_in_usd) ~ work_year + experience_level + 
#>     employment_type + job_title + remote_ratio + company_location + 
#>     company_size, data = salary)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.35934 -0.24791  0.01034  0.26364  1.62803 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      -2.089e+01  4.305e+00  -4.853 1.22e-06 ***
#> work_year                         1.573e-02  2.127e-03   7.398 1.40e-13 ***
#> experience_levelMI                1.925e-01  5.103e-03  37.731  < 2e-16 ***
#> experience_levelSE                3.835e-01  4.854e-03  79.015  < 2e-16 ***
#> experience_levelEX                5.351e-01  1.008e-02  53.064  < 2e-16 ***
#> employment_typeFL                -2.182e-01  9.776e-02  -2.232   0.0256 *  
#> employment_typeFT                 2.848e-01  2.528e-02  11.267  < 2e-16 ***
#> employment_typePT                -1.563e-01  3.526e-02  -4.432 9.34e-06 ***
#> job_titleData Analysis           -3.188e-02  1.291e-02  -2.469   0.0136 *  
#> job_titleData Engineering         2.561e-01  1.260e-02  20.327  < 2e-16 ***
#> job_titleData Science             2.694e-01  1.271e-02  21.194  < 2e-16 ***
#> job_titleLeadership & Management  2.645e-01  1.296e-02  20.409  < 2e-16 ***
#> job_titleML & AI                  3.758e-01  1.305e-02  28.792  < 2e-16 ***
#> job_titleOther                   -8.567e-03  1.393e-02  -0.615   0.5384    
#> job_titleSoftware Engineering     5.020e-02  1.697e-02   2.958   0.0031 ** 
#> remote_ratio                     -5.878e-04  3.148e-05 -18.672  < 2e-16 ***
#> company_locationGB               -3.882e-01  9.998e-03 -38.831  < 2e-16 ***
#> company_locationOther            -5.224e-01  9.615e-03 -54.332  < 2e-16 ***
#> company_locationUS                1.595e-01  6.791e-03  23.484  < 2e-16 ***
#> company_sizeM                    -8.528e-03  7.465e-03  -1.142   0.2533    
#> company_sizeS                    -1.657e-01  2.729e-02  -6.071 1.28e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3747 on 86813 degrees of freedom
#> Multiple R-squared:  0.3507, Adjusted R-squared:  0.3506 
#> F-statistic:  2345 on 20 and 86813 DF,  p-value: < 2.2e-16

5.3 Model Diagnostics

We use Residuals vs Fitted and Normal Q-Q plots to compare model fit before and after log-transformation.

par(mfrow = c(2, 2))
plot(model_std, which = 1, main = "Standard: Residuals vs Fitted")
plot(model_std, which = 2, main = "Standard: Normal Q-Q")
plot(model_log, which = 1, main = "Log-Transformed: Residuals vs Fitted")
plot(model_log, which = 2, main = "Log-Transformed: Normal Q-Q")

par(mfrow = c(1, 1))

Residuals vs. Fitted: The standard model exhibited heteroscedasticity (a fan-shaped spread), where errors increased alongside higher salary predictions. The log-transformed model significantly stabilized this variance.
Normal Q-Q: The standard model showed heavy tails indicating non-normal residuals. The log-transformation brought the points much closer to the 45-degree reference line.

5.4 Model Evaluation: R-Squared

The increase in $R^2$ from 0.26 to 0.35 indicates that the log-transformation allows the model to capture the nature of salary increases more effectively. By compressing the right-skewed salary distribution, we reduced the influence of high-income outliers.

6. Feature Engineering & Variable Selection

To refine our model and address potential over-fitting or redundancy, we employ Stepwise Selection using the AIC (Akaike Information Criterion). AIC rewards goodness-of-fit but penalizes additional predictors, guiding us toward a simpler and more interpretable model.

full_model <- lm(log(salary_in_usd) ~ work_year + experience_level + employment_type +
                 job_title + remote_ratio + company_location + company_size,
                 data = salary)

reduced_model <- step(full_model, direction = "backward", trace = FALSE)
summary(reduced_model)

#> 
#> Call:
#> lm(formula = log(salary_in_usd) ~ work_year + experience_level + 
#>     employment_type + job_title + remote_ratio + company_location + 
#>     company_size, data = salary)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.35934 -0.24791  0.01034  0.26364  1.62803 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      -2.089e+01  4.305e+00  -4.853 1.22e-06 ***
#> work_year                         1.573e-02  2.127e-03   7.398 1.40e-13 ***
#> experience_levelMI                1.925e-01  5.103e-03  37.731  < 2e-16 ***
#> experience_levelSE                3.835e-01  4.854e-03  79.015  < 2e-16 ***
#> experience_levelEX                5.351e-01  1.008e-02  53.064  < 2e-16 ***
#> employment_typeFL                -2.182e-01  9.776e-02  -2.232   0.0256 *  
#> employment_typeFT                 2.848e-01  2.528e-02  11.267  < 2e-16 ***
#> employment_typePT                -1.563e-01  3.526e-02  -4.432 9.34e-06 ***
#> job_titleData Analysis           -3.188e-02  1.291e-02  -2.469   0.0136 *  
#> job_titleData Engineering         2.561e-01  1.260e-02  20.327  < 2e-16 ***
#> job_titleData Science             2.694e-01  1.271e-02  21.194  < 2e-16 ***
#> job_titleLeadership & Management  2.645e-01  1.296e-02  20.409  < 2e-16 ***
#> job_titleML & AI                  3.758e-01  1.305e-02  28.792  < 2e-16 ***
#> job_titleOther                   -8.567e-03  1.393e-02  -0.615   0.5384    
#> job_titleSoftware Engineering     5.020e-02  1.697e-02   2.958   0.0031 ** 
#> remote_ratio                     -5.878e-04  3.148e-05 -18.672  < 2e-16 ***
#> company_locationGB               -3.882e-01  9.998e-03 -38.831  < 2e-16 ***
#> company_locationOther            -5.224e-01  9.615e-03 -54.332  < 2e-16 ***
#> company_locationUS                1.595e-01  6.791e-03  23.484  < 2e-16 ***
#> company_sizeM                    -8.528e-03  7.465e-03  -1.142   0.2533    
#> company_sizeS                    -1.657e-01  2.729e-02  -6.071 1.28e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3747 on 86813 degrees of freedom
#> Multiple R-squared:  0.3507, Adjusted R-squared:  0.3506 
#> F-statistic:  2345 on 20 and 86813 DF,  p-value: < 2.2e-16

Comparing the full and reduced models.

cat("Full Model Adjusted R2:", summary(full_model)$adj.r.squared, "\n")

#> Full Model Adjusted R2: 0.3505958

cat("Reduced Model Adjusted R2:", summary(reduced_model)$adj.r.squared, "\n")

#> Reduced Model Adjusted R2: 0.3505958

dropped_vars <- setdiff(names(coef(full_model)), names(coef(reduced_model)))
cat("\nVariables removed by Stepwise Selection:\n")

#> 
#> Variables removed by Stepwise Selection:

print(dropped_vars)

#> character(0)

No variables were dropped, meaning the full and reduced models are identical — all predictors contribute meaningfully.

7. Predictive Modeling

To evaluate predictive accuracy, we split the dataset into a Training Set (70%) used to fit the model, and a Test Set (30%) used to evaluate performance.

7.1 Log-Linear Regression

set.seed(123)

n             <- nrow(salary)
train_indices <- sample(1:n, size = 0.7 * n)
train_data    <- salary[train_indices, ]
test_data     <- salary[-train_indices, ]

final_model   <- lm(log(salary_in_usd) ~ work_year + experience_level + employment_type +
                    job_title + remote_ratio + company_location + company_size,
                    data = train_data)

log_preds     <- predict(final_model, newdata = test_data)
actual_usd    <- test_data$salary_in_usd
predicted_usd <- exp(log_preds)

mse_val  <- mean((actual_usd - predicted_usd)^2)
rmse_val <- sqrt(mse_val)

cat("Mean Squared Error (MSE):", format(mse_val, scientific = FALSE), "\n")

#> Mean Squared Error (MSE): 3062656228

cat("Root Mean Squared Error (RMSE): $", round(rmse_val, 2), "\n")

#> Root Mean Squared Error (RMSE): $ 55341.27

On average, the log-linear model’s salary predictions are off by approximately $55,341.27.

7.2 Random Forest

rf_data <- salary %>%
  mutate(across(where(is.character), as.factor)) %>%
  dplyr::select(-any_of(c("log_salary", "is_local")))

set.seed(123)
train_idx <- sample(1:nrow(rf_data), 0.7 * nrow(rf_data))
train_rf  <- rf_data[train_idx, ]
test_rf   <- rf_data[-train_idx, ]

rf_model <- randomForest(salary_in_usd ~ .,
                         data       = train_rf,
                         ntree      = 500,
                         importance = TRUE)

rf_preds <- predict(rf_model, newdata = test_rf)
rf_rmse  <- sqrt(mean((test_rf$salary_in_usd - rf_preds)^2))

cat("Random Forest RMSE: $", round(rf_rmse, 2), "\n")

#> Random Forest RMSE: $ 53775.49

cat("Improvement over Linear Model: $", round(55341.27 - rf_rmse, 2))

#> Improvement over Linear Model: $ 1565.78

varImpPlot(rf_model,
           main = "Feature Importance: What Predicts Salary?",
           col  = "darkgreen",
           pch  = 16)

7.3 Lasso & Ridge Regression

library(glmnet)

x_train <- model.matrix(salary_in_usd ~ ., train_rf)[, -1]
y_train <- train_rf$salary_in_usd
x_test  <- model.matrix(salary_in_usd ~ ., test_rf)[, -1]
y_test  <- test_rf$salary_in_usd

lasso_cv    <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_preds <- predict(lasso_cv, s = lasso_cv$lambda.min, newx = x_test)
lasso_rmse  <- sqrt(mean((y_test - lasso_preds)^2))

ridge_cv    <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_preds <- predict(ridge_cv, s = ridge_cv$lambda.min, newx = x_test)
ridge_rmse  <- sqrt(mean((y_test - ridge_preds)^2))

cat("Lasso RMSE: $", round(lasso_rmse, 2), "\n")

#> Lasso RMSE: $ 54293.57

cat("Ridge RMSE: $", round(ridge_rmse, 2))

#> Ridge RMSE: $ 54355.6

7.4 Decision Tree Regression

tree_model <- rpart(salary_in_usd ~ ., data = train_rf, method = "anova")

tree_preds <- predict(tree_model, newdata = test_rf)
tree_rmse  <- sqrt(mean((test_rf$salary_in_usd - tree_preds)^2))

rpart.plot(tree_model, type = 3, digits = 3, fallen.leaves = TRUE,
           main = "Salary Decision Logic")

cat("Decision Tree RMSE: $", round(tree_rmse, 2))

#> Decision Tree RMSE: $ 55819.2

7.5 Model Comparison

results_table <- data.frame(
  Model = c("Random Forest", "Lasso Regression", "Ridge Regression",
            "Linear Regression (Log)", "Decision Tree"),
  RMSE  = c("$53,775.49", "$54,448.88", "$54,526.86", "$55,341.27", "$55,981.30"),
  Interpretation = c(
    "Best performer: captures complex, non-linear interactions between variables.",
    "Similar to Ridge, but performs automatic feature selection.",
    "Linear model with a penalty to reduce the impact of multicollinearity.",
    "The baseline; effective but limited by linear assumptions.",
    "High interpretability but prone to over-simplification."
  )
)

knitr::kable(results_table,
             caption = "Table 1: Comparison of Model Performance (Ranked by RMSE)",
             align   = "lll")

Table 1: Comparison of Model Performance (Ranked by RMSE)
Model	RMSE	Interpretation
Random Forest	$53,775.49	Best performer: captures complex, non-linear interactions between variables.
Lasso Regression	$54,448.88	Similar to Ridge, but performs automatic feature selection.
Ridge Regression	$54,526.86	Linear model with a penalty to reduce the impact of multicollinearity.
Linear Regression (Log)	$55,341.27	The baseline; effective but limited by linear assumptions.
Decision Tree	$55,981.30	High interpretability but prone to over-simplification.

By achieving the lowest RMSE ($53,775.49), the Random Forest model provides the most reliable salary estimates, though it remains a “black box” compared to the interpretability of the other models.

Part 2: Classification

This section builds a binary logistic regression classifier to predict whether an employee’s salary falls in the top 25% (High) or below (Normal), using the AI/Data Science salary dataset covering 2020-2025.

Workflow at a glance:

Load data and libraries
Initial cleaning and ordinal encoding
Feature engineering
Build binary target variable
Train / test split
Handle class imbalance with case weights
Fit full logistic regression model
Stepwise variable selection (AIC)
Predictions at default threshold (0.50)
ROC curve and AUC
Optimal threshold via Youden Index
Threshold comparison table

1. Load Data

salary <- read.csv("/home/jakes/Documents/strathmore/Modules/Module 3/linear_models/salaries/salaries.csv")

2. Initial Cleaning & Ordinal Encoding

We ordinal-encode the four key categorical predictors so they can enter the logistic regression as numeric inputs.

Variable	Encoding
`experience_level`	EN = 1, MI = 2, SE = 3, EX = 4
`employment_type`	PT/FL = 1, CT = 2, FT = 3
`company_size`	S = 1, M = 2, L = 3
`remote_ratio`	0 -> 0, 50 -> 1, 100 -> 2

salary <- salary %>%
  mutate(
    exp_num = case_when(
      experience_level == "EN" ~ 1,
      experience_level == "MI" ~ 2,
      experience_level == "SE" ~ 3,
      experience_level == "EX" ~ 4,
      TRUE ~ NA_real_
    ),
    emp_num = case_when(
      employment_type == "FT" ~ 3,
      employment_type == "CT" ~ 2,
      employment_type == "PT" ~ 1,
      employment_type == "FL" ~ 1,
      TRUE ~ NA_real_
    ),
    size_num = case_when(
      company_size == "S" ~ 1,
      company_size == "M" ~ 2,
      company_size == "L" ~ 3,
      TRUE ~ NA_real_
    ),
    remote_num = case_when(
      remote_ratio == 0   ~ 0,
      remote_ratio == 50  ~ 1,
      remote_ratio == 100 ~ 2,
      TRUE ~ NA_real_
    )
  )

glimpse(salary %>% dplyr::select(exp_num, emp_num, size_num, remote_num))

#> Rows: 88,584
#> Columns: 4
#> $ exp_num    <dbl> 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2,…
#> $ emp_num    <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ size_num   <dbl> 3, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,…
#> $ remote_num <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 2, 2,…

3. Feature Engineering

We create several derived features before modelling.

Leakage note: sal_vs_job_mean is computed using a temporary training split so that job-title group means are never contaminated by test data.

salary <- salary %>%
  mutate(
    same_country  = as.integer(as.character(employee_residence) == as.character(company_location)),
    us_company    = as.integer(as.character(company_location) == "US"),
    is_ml_ai_role = as.integer(grepl(
                      "machine learning|data science|ai|research|engineer|scientist|analyst",
                      job_title, ignore.case = TRUE)),
    exp_x_remote  = exp_num * remote_num,
    year_c        = work_year - 2022
  )

set.seed(42)
temp_idx <- createDataPartition(salary$salary_in_usd, p = 0.8, list = FALSE)

job_means <- salary %>%
  slice(temp_idx) %>%
  group_by(job_title) %>%
  summarise(job_mean_sal = mean(salary_in_usd, na.rm = TRUE), .groups = "drop")

salary <- salary %>%
  left_join(job_means, by = "job_title") %>%
  mutate(
    sal_vs_job_mean = salary_in_usd /
                      coalesce(job_mean_sal, median(salary_in_usd, na.rm = TRUE))
  )

salary %>%
  dplyr::select(same_country, us_company, is_ml_ai_role,
                exp_x_remote, year_c, sal_vs_job_mean) %>%
  summary()

#>   same_country      us_company     is_ml_ai_role     exp_x_remote  
#>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
#>  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.000  
#>  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.000  
#>  Mean   :0.9983   Mean   :0.9004   Mean   :0.8266   Mean   :1.119  
#>  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.000  
#>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :8.000  
#>      year_c       sal_vs_job_mean  
#>  Min.   :-2.000   Min.   :0.07923  
#>  1st Qu.: 2.000   1st Qu.:0.71238  
#>  Median : 2.000   Median :0.94265  
#>  Mean   : 2.035   Mean   :1.00104  
#>  3rd Qu.: 2.000   3rd Qu.:1.23038  
#>  Max.   : 3.000   Max.   :7.50863

4. Build Binary Target Variable

We define High salary as any salary at or above the 75th percentile of salary_in_usd, giving a roughly 75/25 class split.

p75 <- quantile(salary$salary_in_usd, 0.75, na.rm = TRUE)
cat(sprintf("75th percentile threshold: $%s\n", format(round(p75), big.mark = ",")))

#> 75th percentile threshold: $198,600

salary <- salary %>%
  mutate(
    high_sal = factor(
      as.integer(salary_in_usd >= p75),
      levels = c(0, 1),
      labels = c("Normal", "High")
    )
  )

cat("\nClass distribution:\n")

#> 
#> Class distribution:

print(table(salary$high_sal))

#> 
#> Normal   High 
#>  66419  22165

cat("\nClass proportions:\n")

#> 
#> Class proportions:

print(round(prop.table(table(salary$high_sal)), 3))

#> 
#> Normal   High 
#>   0.75   0.25

ggplot(salary, aes(x = high_sal, fill = high_sal)) +
  geom_bar(width = 0.5) +
  geom_text(stat = "count",
            aes(label = after_stat(count)),
            vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("Normal" = "#85B7EB", "High" = "#1D9E75")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(title = "Class Balance: Salary Target",
       x = NULL, y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Class balance: High vs Normal salary

5. Train / Test Split

We select the 10 modelling features and perform a stratified 80/20 split on high_sal to preserve class proportions in both sets.

cls_df <- salary %>%
  dplyr::select(
    high_sal,
    exp_num,
    emp_num,
    size_num,
    remote_num,
    same_country,
    us_company,
    is_ml_ai_role,
    year_c,
    sal_vs_job_mean,
    exp_x_remote
  ) %>%
  drop_na()

set.seed(42)
idx_c   <- createDataPartition(cls_df$high_sal, p = 0.8, list = FALSE)
train_c <- cls_df[ idx_c, ]
test_c  <- cls_df[-idx_c, ]

cat(sprintf("Train: %d rows | Test: %d rows\n", nrow(train_c), nrow(test_c)))

#> Train: 70868 rows | Test: 17716 rows

cat("\nTrain class balance:\n")

#> 
#> Train class balance:

print(round(prop.table(table(train_c$high_sal)), 3))

#> 
#> Normal   High 
#>   0.75   0.25

6. Handle Class Imbalance

With ~75% Normal and ~25% High, accuracy alone is misleading. We assign inverse-frequency weights so both classes contribute equally to the log-likelihood during fitting.

n_total  <- nrow(train_c)
n_high   <- sum(train_c$high_sal == "High")
n_normal <- sum(train_c$high_sal == "Normal")

class_weights <- ifelse(
  train_c$high_sal == "High",
  n_total / (2 * n_high),
  n_total / (2 * n_normal)
)

cat(sprintf("Class weights — Normal: %.3f | High: %.3f\n",
            n_total / (2 * n_normal),
            n_total / (2 * n_high)))

#> Class weights — Normal: 0.667 | High: 1.998

7. Full Logistic Regression Model

We fit a weighted logistic regression with all 10 candidate predictors.

full_fit <- glm(
  high_sal ~ exp_num + emp_num + size_num + remote_num +
             same_country + us_company + is_ml_ai_role +
             year_c + sal_vs_job_mean + exp_x_remote,
  data    = train_c,
  family  = binomial(link = "logit"),
  weights = class_weights
)

summary(full_fit)

#> 
#> Call:
#> glm(formula = high_sal ~ exp_num + emp_num + size_num + remote_num + 
#>     same_country + us_company + is_ml_ai_role + year_c + sal_vs_job_mean + 
#>     exp_x_remote, family = binomial(link = "logit"), data = train_c, 
#>     weights = class_weights)
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -19.69960    1.10372 -17.848  < 2e-16 ***
#> exp_num           0.45402    0.02231  20.347  < 2e-16 ***
#> emp_num           1.65241    0.28087   5.883 4.02e-09 ***
#> size_num          0.69406    0.06931  10.014  < 2e-16 ***
#> remote_num       -0.72975    0.08689  -8.399  < 2e-16 ***
#> same_country      1.46892    0.72249   2.033    0.042 *  
#> us_company        0.39377    0.05905   6.669 2.58e-11 ***
#> is_ml_ai_role     0.70473    0.03431  20.538  < 2e-16 ***
#> year_c            0.23522    0.02106  11.170  < 2e-16 ***
#> sal_vs_job_mean   8.18253    0.06394 127.973  < 2e-16 ***
#> exp_x_remote      0.17684    0.03047   5.804 6.47e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 98244  on 70867  degrees of freedom
#> Residual deviance: 43234  on 70857  degrees of freedom
#> AIC: 55410
#> 
#> Number of Fisher Scoring iterations: 6

8. Stepwise Variable Selection (AIC)

We use bidirectional stepwise selection (AIC criterion) to find the most parsimonious model. Variables with negligible contribution are dropped; useful variables already excluded may be re-added.

step_fit <- step(full_fit, direction = "both", trace = 0)
summary(step_fit)

#> 
#> Call:
#> glm(formula = high_sal ~ exp_num + emp_num + size_num + remote_num + 
#>     same_country + us_company + is_ml_ai_role + year_c + sal_vs_job_mean + 
#>     exp_x_remote, family = binomial(link = "logit"), data = train_c, 
#>     weights = class_weights)
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)     -19.69960    1.10372 -17.848  < 2e-16 ***
#> exp_num           0.45402    0.02231  20.347  < 2e-16 ***
#> emp_num           1.65241    0.28087   5.883 4.02e-09 ***
#> size_num          0.69406    0.06931  10.014  < 2e-16 ***
#> remote_num       -0.72975    0.08689  -8.399  < 2e-16 ***
#> same_country      1.46892    0.72249   2.033    0.042 *  
#> us_company        0.39377    0.05905   6.669 2.58e-11 ***
#> is_ml_ai_role     0.70473    0.03431  20.538  < 2e-16 ***
#> year_c            0.23522    0.02106  11.170  < 2e-16 ***
#> sal_vs_job_mean   8.18253    0.06394 127.973  < 2e-16 ***
#> exp_x_remote      0.17684    0.03047   5.804 6.47e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 98244  on 70867  degrees of freedom
#> Residual deviance: 43234  on 70857  degrees of freedom
#> AIC: 55410
#> 
#> Number of Fisher Scoring iterations: 6

8.1 Odds Ratios

or_table <- data.frame(
  OR       = exp(coef(step_fit)),
  CI_lower = exp(confint(step_fit)[, 1]),
  CI_upper = exp(confint(step_fit)[, 2])
) %>%
  round(3)

cat("OR > 1: predictor increases odds of High salary\n")

#> OR > 1: predictor increases odds of High salary

cat("OR < 1: predictor decreases odds of High salary\n\n")

#> OR < 1: predictor decreases odds of High salary

print(or_table)

#>                       OR CI_lower CI_upper
#> (Intercept)        0.000    0.000    0.000
#> exp_num            1.575    1.507    1.645
#> emp_num            5.220    3.068    9.254
#> size_num           2.002    1.748    2.294
#> remote_num         0.482    0.406    0.571
#> same_country       4.345    1.149   19.237
#> us_company         1.483    1.321    1.665
#> is_ml_ai_role      2.023    1.892    2.164
#> year_c             1.265    1.214    1.319
#> sal_vs_job_mean 3577.910 3158.737 4058.623
#> exp_x_remote       1.193    1.125    1.267

9. Predictions at Default Threshold (0.50)

prob_pred  <- predict(step_fit, newdata = test_c, type = "response")
class_pred <- factor(ifelse(prob_pred >= 0.5, "High", "Normal"),
                     levels = c("Normal", "High"))

cm_05 <- confusionMatrix(class_pred, test_c$high_sal, positive = "High")
print(cm_05)

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Normal  High
#>     Normal  11554   591
#>     High     1729  3842
#>                                         
#>                Accuracy : 0.869         
#>                  95% CI : (0.864, 0.874)
#>     No Information Rate : 0.7498        
#>     P-Value [Acc > NIR] : < 2.2e-16     
#>                                         
#>                   Kappa : 0.6785        
#>                                         
#>  Mcnemar's Test P-Value : < 2.2e-16     
#>                                         
#>             Sensitivity : 0.8667        
#>             Specificity : 0.8698        
#>          Pos Pred Value : 0.6896        
#>          Neg Pred Value : 0.9513        
#>              Prevalence : 0.2502        
#>          Detection Rate : 0.2169        
#>    Detection Prevalence : 0.3145        
#>       Balanced Accuracy : 0.8683        
#>                                         
#>        'Positive' Class : High          
#>

10. ROC Curve & AUC

AUC is the primary evaluation metric here — it is threshold-independent and robust to class imbalance.

roc_obj <- roc(test_c$high_sal, prob_pred,
               levels = c("Normal", "High"), quiet = TRUE)
auc_val <- auc(roc_obj)

cat(sprintf("AUC: %.3f\n", auc_val))

#> AUC: 0.944

cat("Interpretation: 0.9-1.0 excellent | 0.8-0.9 good | 0.7-0.8 fair | <0.7 poor\n")

#> Interpretation: 0.9-1.0 excellent | 0.8-0.9 good | 0.7-0.8 fair | <0.7 poor

ggroc(roc_obj, colour = "#1D9E75", linewidth = 1) +
  geom_abline(slope = 1, intercept = 1,
              linetype = "dashed", colour = "gray50") +
  annotate("text", x = 0.3, y = 0.1,
           label = sprintf("AUC = %.3f", auc_val), size = 4.5,
           colour = "#1D9E75", fontface = "bold") +
  labs(title = "ROC Curve: Salary Classifier",
       x     = "Specificity",
       y     = "Sensitivity") +
  theme_minimal()

ROC curve for the salary classifier

11. Optimal Threshold (Youden Index)

The Youden index maximises Sensitivity + Specificity - 1, balancing the true positive and true negative rates simultaneously.

best_thresh <- coords(roc_obj, "best",
                      ret         = "threshold",
                      best.method = "youden")
best_thresh <- as.numeric(best_thresh)
cat(sprintf("Youden-optimal threshold: %.3f\n", best_thresh))

#> Youden-optimal threshold: 0.403

class_opt <- factor(ifelse(prob_pred >= best_thresh, "High", "Normal"),
                    levels = c("Normal", "High"))

cm_opt <- confusionMatrix(class_opt, test_c$high_sal, positive = "High")
print(cm_opt)

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Normal  High
#>     Normal  11053   341
#>     High     2230  4092
#>                                         
#>                Accuracy : 0.8549        
#>                  95% CI : (0.8496, 0.86)
#>     No Information Rate : 0.7498        
#>     P-Value [Acc > NIR] : < 2.2e-16     
#>                                         
#>                   Kappa : 0.6613        
#>                                         
#>  Mcnemar's Test P-Value : < 2.2e-16     
#>                                         
#>             Sensitivity : 0.9231        
#>             Specificity : 0.8321        
#>          Pos Pred Value : 0.6473        
#>          Neg Pred Value : 0.9701        
#>              Prevalence : 0.2502        
#>          Detection Rate : 0.2310        
#>    Detection Prevalence : 0.3569        
#>       Balanced Accuracy : 0.8776        
#>                                         
#>        'Positive' Class : High          
#>

12. Threshold Comparison

Note: Accuracy is misleading here — the No-Information Rate is ~75%. Use Kappa and AUC as primary metrics for this imbalanced classifier.

f1_05  <- cm_05$byClass["F1"]
f1_opt <- cm_opt$byClass["F1"]

comparison <- data.frame(
  Threshold   = c(0.500, round(best_thresh, 3)),
  Accuracy    = c(cm_05$overall["Accuracy"],    cm_opt$overall["Accuracy"]),
  Sensitivity = c(cm_05$byClass["Sensitivity"], cm_opt$byClass["Sensitivity"]),
  Specificity = c(cm_05$byClass["Specificity"], cm_opt$byClass["Specificity"]),
  Kappa       = c(cm_05$overall["Kappa"],       cm_opt$overall["Kappa"]),
  F1          = c(f1_05,                        f1_opt)
) %>%
  mutate(across(-Threshold, ~ round(., 3)))

knitr::kable(comparison,
             caption = "Threshold Comparison: Default (0.50) vs Youden-Optimal",
             align   = "lrrrrr")

Threshold Comparison: Default (0.50) vs Youden-Optimal
Threshold	Accuracy	Sensitivity	Specificity	Kappa	F1
0.500	0.869	0.867	0.870	0.678	0.768
0.403	0.855	0.923	0.832	0.661	0.761

cat(sprintf("\nAUC (threshold-independent): %.3f\n", auc_val))

#> 
#> AUC (threshold-independent): 0.944

13. Summary & Conclusions

#> Model                      : Weighted logistic regression (stepwise AIC)
#> Target                     : High salary (salary_in_usd >= 75th percentile)
#> AUC                        : 0.944
#> Kappa (threshold = 0.50)   : 0.678
#> F1    (threshold = 0.50)   : 0.768
#> Optimal threshold (Youden) : 0.403

Key Findings & Implications

Drivers of High Salary

The classification model identifies three main factors influencing entry into the top 25% salary bracket:

Experience Premium: Senior (SE) and Executive (EX) roles significantly increase the likelihood of high compensation, showing a strong non-linear effect.
Geographic Advantage: Working for a US-based company is one of the strongest predictors of high salary.
Specialization: Roles related to Machine Learning and AI have a higher probability of exceeding the 75th percentile compared to general data roles.

Model Performance Insights

Metric Choice: Due to class imbalance, Cohen’s Kappa (k) was used instead of accuracy to better capture true model performance, as it accounts for chance agreement.
AUC Strength: A high AUC indicates strong ability to distinguish between high and normal salary classes.
Threshold Optimization: Adjusting the classification threshold via the Youden Index improved sensitivity, making the model more effective at identifying high earners.

Practical Implications

Career Strategy: Advancing experience level, targeting US-based companies, and specializing in AI/ML roles are key drivers of higher earnings.
Model Application: The classifier can function as a salary screening tool, helping users assess the likelihood of a role meeting high-salary criteria.

apps-fileview.texmex_20260329.08_p1 salary_full.txt Displaying salary_full.txt.

Component	Definition	Contextual Meaning
\(y\)	Dependent Variable	The Salary in USD we are trying to predict.
\(\beta_0\)	Intercept	The baseline salary when all predictors are at their reference level.
\(\beta_1 \dots \beta_n\)	Coefficients	The estimated change in salary for every one-unit change in a feature.
\(\epsilon\)	Error Term	The Residuals; the variance in salary not explained by our model.

Project 2: Salary Prediction

Predicting Salary (2020 to 2025)

Group 5

2026-04-07