Introduction

In this report, we are going to build a classification model that predicts whether an individual’s salary is “higher than 50K” or “lower than or equal to 50K.” By analyzing key attributes such as education, occupation, and native country, our goal is to uncover patterns that influence income levels.

Loading libraries

library(tidyverse)
library(readr)
library(corrplot)
library(kableExtra)
library(rpart)
library(rattle)
library(gridExtra)
library(caret)
library(pROC)
library(here)
library(ROSE)
library(dummies)
library(xgboost)
library(neuralnet)
salary_df <- read_csv("c3.csv") 

salary_df <- salary_df %>% mutate_all(~ifelse(. == '?', NA, .))

Description of Data Set

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt: continuous; the number of people that Census believes the entry represents.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: marital status of an individual: Married-civ-spouse -> a civilian spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse - a spouse in the Armed Forces.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: represents what this individual is relative to other: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • feat01-10: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

We will consider predicting probability of earning more than 50K (salary - <=50K or >50K) given a set of characteristics.

Let us examine the structure of the data:

salary_df %>% glimpse()
## Rows: 42,561
## Columns: 26
## $ id               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16

## $ age              <dbl> 25, 41, 51, 57, 31, 74, 42, 35, 28, 46, 21, 32, 26, 4

## $ `capital-gain`   <dbl> 5178, 3103, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `capital-loss`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ education        <chr> "11th", "Bachelors", "Some-college", "HS-grad", "Bach

## $ `education-num`  <dbl> 7, 13, 10, 9, 13, 14, 9, 9, 9, 4, 10, 11, 7, 10, 13, 

## $ feat01           <dbl> 0.8448420, 1.1844558, 1.3336356, 1.1780716, 1.3148465

## $ feat02           <dbl> 0.6298227, 0.4318522, 0.3418702, 0.6165489, 0.6925617

## $ feat03           <dbl> 1.1146365, 0.6941100, 0.7709931, 1.2863809, 1.2001315

## $ feat04           <dbl> 0.4889465, 0.2861112, 0.6193634, 0.6102719, 0.6396099

## $ feat05           <dbl> 0.9520031, 0.7136972, 1.4626074, 1.5411345, 1.4679602

## $ feat06           <dbl> 0.5787523, 0.6261846, 0.2777089, 0.3670095, 0.7277344

## $ feat07           <dbl> 1.0472219, 1.4231821, 1.0282187, 0.5600741, 0.9760788

## $ feat08           <dbl> 1.3348179, 1.4226539, 0.7770539, 0.9994005, 0.8707198

## $ feat09           <dbl> 0.8129876, 0.9177271, 0.9499549, 1.0217224, 0.5442584

## $ feat10           <dbl> 0.8436999, 0.8794644, 1.1056284, 1.3840671, 1.0400557

## $ fnlwgt           <dbl> 120238, 167106, 205884, 184553, 137537, 84197, 76280,

## $ `hours-per-week` <dbl> 40, 35, 40, 50, 40, 10, 40, 40, 50, 75, 20, 40, 65, 4

## $ `marital-status` <chr> "Married-civ-spouse", "Married-civ-spouse", "Married-

## $ `native-country` <chr> "Poland", "Philippines", "United-States", "United-Sta

## $ occupation       <chr> "Transport-moving", "Adm-clerical", "Tech-support", "

## $ race             <chr> "White", "Asian-Pac-Islander", "White", "White", "Whi

## $ relationship     <chr> "Husband", "Husband", "Wife", "Husband", "Unmarried",

## $ salary           <chr> ">50K", ">50K", ">50K", ">50K", "<=50K", "<=50K", "<=

## $ sex              <chr> "Male", "Male", "Female", "Male", "Female", "Female",

## $ workclass        <chr> "Private", "Private", "Private", "Self-emp-not-inc", 

kable(table(salary_df$salary), col.names = c("Salary", "Frequency"), format = "html") %>%
  kable_styling(full_width = FALSE)
Salary Frequency
<=50K 32300
>50K 10261
kable(table(salary_df$salary)/length(salary_df$salary), format = "html",col.names = c("Salary", "Frequency")) %>%
  kable_styling(full_width = FALSE)
Salary Frequency
<=50K 0.7589107
>50K 0.2410893

Data Preparation

We divide data set into the training and testing sample.

set.seed(1618)
training_obs <- createDataPartition(salary_df$salary, 
                                    p = 0.8, 
                                    list = FALSE) 
salary.df.test  <- salary_df[-training_obs,]

salary.df.train <- salary_df[training_obs,]

There is around 7% of missing values, it is rather a significant number, therefore we will not drop whole rows.

colSums(is.na(salary.df.train))
##             id            age   capital-gain   capital-loss      education 
##              0              0              0              0              0 
##  education-num         feat01         feat02         feat03         feat04 
##              0              0              0              0              0 
##         feat05         feat06         feat07         feat08         feat09 
##              0              0              0              0              0 
##         feat10         fnlwgt hours-per-week marital-status native-country 
##              0              0              0              0            611 
##     occupation           race   relationship         salary            sex 
##           1951              0              0              0              0 
##      workclass 
##           1943
sum(!complete.cases(salary.df.train))
## [1] 2529
sum(!complete.cases(salary.df.train))/nrow(salary.df.train)
## [1] 0.07427531

For missing values at native_country we created additional category Unknown.

salary.df.train <- salary.df.train %>%
  mutate(`native-country` = ifelse(is.na(`native-country`), 'Unknown', `native-country`))

For missing occupation with workclass equals Never-worked we created category Unemployed. The rest of missing values were categorized as Unknown.

salary.df.train %>% 
  filter((is.na(workclass) & !is.na(occupation)) | (!is.na(workclass) & is.na(occupation)))
## # A tibble: 8 × 26
##      id   age `capital-gain` `capital-loss` education    `education-num` feat01
##   <dbl> <dbl>          <dbl>          <dbl> <chr>                  <dbl>  <dbl>
## 1 15885    20              0              0 Some-college              10  0.531
## 2 25363    18              0              0 10th                       6  1.78 
## 3 32685    18              0              0 11th                       7  1.23 
## 4 34067    18              0              0 11th                       7  0.787
## 5 35289    18              0              0 Some-college              10  0.992
## 6 38690    30              0              0 HS-grad                    9  0.965
## 7 39350    23              0              0 7th-8th                    4  1.03 
## 8 39564    17              0              0 10th                       6  0.849
## # â„č 19 more variables: feat02 <dbl>, feat03 <dbl>, feat04 <dbl>, feat05 <dbl>,
## #   feat06 <dbl>, feat07 <dbl>, feat08 <dbl>, feat09 <dbl>, feat10 <dbl>,
## #   fnlwgt <dbl>, `hours-per-week` <dbl>, `marital-status` <chr>,
## #   `native-country` <chr>, occupation <chr>, race <chr>, relationship <chr>,
## #   salary <chr>, sex <chr>, workclass <chr>
salary.df.train <- salary.df.train %>%
  mutate(occupation = ifelse(workclass=='Never-worked', 'Unemployed', occupation),
         occupation = ifelse(is.na(occupation), 'Unknown', occupation),
         workclass = ifelse(is.na(workclass), 'Unknown', workclass))

There are 25 independent variables. However, after careful investigation, we decided to perform some cleaning:

  • Variable education is removed since it is a textual version of variable education-num.

  • The variable fnlwgt represents the number of people that Census believes an observation represents. While it can be treated as weights, it is not used as a dependent variable.

  • capital-gain and capital-loss are combined into one variable: capital.

  • Categories of marital-status were grouped in 3: Married, Previously Married, Single to avoid sparsity.

  • The variable native_country contains 41 unique values, which may influence the model’s performance due to overfitting, complexity, and sparsity in some categories. Therefore, we grouped countries into a few regions: United-States, Central America, South America, North America, Asia, Europe, Caribbean and Other.

salary.df.train <- salary.df.train %>% 
  mutate(capital = `capital-gain`-`capital-loss`,
         marital_status_grouped = case_when(
              `marital-status` %in% c('Married-AF-spouse', 'Married-civ-spouse', 
                                    'Married-spouse-absent') ~ 'Married',
              `marital-status` == 'Never-married' ~ 'Single',
              `marital-status` %in% c('Divorced', 'Separated', 'Widowed') ~ 'Previously Married',
              TRUE ~ as.character(`marital-status`)),
         region = case_when(
              `native-country` %in% c("United-States", "Outlying-US(Guam-USVI-etc)") ~ "United-States",
              `native-country` %in% c("El-Salvador", "Guatemala", "Honduras", 
                                      "Nicaragua") ~ "Central America",
              `native-country` %in% c("Ecuador", "Peru", "Columbia") ~ "South America",
              `native-country` %in% c("Canada","Mexico") ~ "North America",
              `native-country` %in% c("Philippines", "India", "China", "Vietnam", "Japan", "Taiwan", 
                                       "Hong", "Thailand", "Cambodia", "Laos", "Iran") ~ "Asia",
              `native-country` %in% c("Germany", "Italy", "Portugal", "Greece", "Poland", "France", 
                                       "Ireland", "Hungary", "Scotland", "Yugoslavia", "England") ~ "Europe",
              `native-country` %in% c("Puerto-Rico", "Dominican-Republic", "Jamaica", "Cuba", 
                                       "Haiti", "Trinadad&Tobago") ~ "Caribbean",
              TRUE ~ "Other"  
            )
  ) %>% 
  rename(education_num = `education-num`,
         hours_per_week = `hours-per-week`,
         marital_status = marital_status_grouped,
         native_country = `native-country`) %>% 
  mutate(salary = factor(salary), 
         salary = recode_factor(salary, 
                                "<=50K" = "under_50K", 
                                ">50K" = "over_50K"),
         sex = factor(sex),
         occupation = factor(occupation),
         marital_status = factor(marital_status),
         native_country = factor(native_country),
         race = factor(race),
         relationship = factor(relationship),
         workclass = factor(workclass),
         region = factor(region),
         education_num = as.integer(education_num),
         education_ord = factor(education_num, levels=c(1:16), ordered=TRUE),
         age = as.integer(age),
         hours_per_week = as.integer(hours_per_week)) %>% 
  dplyr::select(age, capital, education_num, hours_per_week, marital_status, native_country,region,education_ord,
         salary, sex, fnlwgt, occupation, race, relationship, workclass,
         feat01,feat02,feat03,feat04,feat05,feat06,feat07,feat08,feat09,feat10)

Data Visualization

Below we presented distribution of variables. There are 4 numeric variables and 8 categorical, including salary which we will predict.

plot1 <- ggplot(salary.df.train, aes(x = age)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

plot2 <- ggplot(salary.df.train, aes(x = capital)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

plot3 <- ggplot(salary.df.train, aes(x = hours_per_week)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

plot14 <- ggplot(salary.df.train, aes(x = education_num)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

plot4 <- ggplot(salary.df.train, aes(x = feat01)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

plot5 <- ggplot(salary.df.train, aes(x = feat02)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

plot6 <- ggplot(salary.df.train, aes(x = feat03)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))

plot7 <- ggplot(salary.df.train, aes(x = feat04)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))


plot8 <- ggplot(salary.df.train, aes(x = feat05)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))


plot9 <- ggplot(salary.df.train, aes(x = feat06)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))


plot10 <- ggplot(salary.df.train, aes(x = feat07)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))


plot11 <- ggplot(salary.df.train, aes(x = feat08)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))


plot12 <- ggplot(salary.df.train, aes(x = feat09)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))


plot13 <- ggplot(salary.df.train, aes(x = feat10)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))


grid.arrange(plot1, plot2, plot3, plot14, plot4, plot5, plot6, plot7, plot8,
             plot9, plot10, plot11, plot12, plot13,
             ncol = 3,
             top = 'Distribution of numeric variables')

plot1 <- ggplot(salary.df.train, aes(x = workclass)) +
  geom_bar(fill = "pink", color = "black", alpha = 0.7) +
  geom_text(
    stat = "count", 
    aes(label = after_stat(count), vjust = -0.15), 
    size = 2,
    color = "black"
  ) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 7, angle = 45, hjust = 1)
  )

plot2 <- ggplot(salary.df.train, aes(x = marital_status)) +
  geom_bar(fill = "pink", color = "black", alpha = 0.7) +
  geom_text(
    stat = "count", 
    aes(label = after_stat(count), vjust = 1.2), 
    size = 3,
    color = "black"
  ) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 7, angle = 45, hjust = 1)
  )

plot3 <- ggplot(salary.df.train, aes(x = region)) +
  geom_bar(fill = "pink", color = "black", alpha = 0.7) +
  geom_text(
    stat = "count", 
    aes(label = after_stat(count), vjust = -0.15), 
    size = 2,
    color = "black"
  ) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 7, angle = 45, hjust = 1)
  )

plot4 <- ggplot(salary.df.train, aes(x = salary)) +
  geom_bar(fill = "pink", color = "black", alpha = 0.7) +
  geom_text(
    stat = "count", 
    aes(label = after_stat(count), vjust = 1.2), 
    size = 3,
    color = "black"
  ) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 7, angle = 45, hjust = 1)
  )

plot5 <- ggplot(salary.df.train, aes(x = sex)) +
  geom_bar(fill = "pink", color = "black", alpha = 0.7) +
  geom_text(
    stat = "count", 
    aes(label = after_stat(count), vjust = 1.2), 
    size = 3,
    color = "black"
  ) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 7, angle = 45, hjust = 1)
  )

plot6 <- ggplot(salary.df.train, aes(x = occupation)) +
  geom_bar(fill = "pink", color = "black", alpha = 0.7) +
  geom_text(
    stat = "count", 
    aes(label = after_stat(count), vjust = -0.15), 
    size = 2,
    color = "black"
  ) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 7, angle = 45, hjust = 1)
  )

plot7 <- ggplot(salary.df.train, aes(x = race)) +
  geom_bar(fill = "pink", color = "black", alpha = 0.7) +
  geom_text(
    stat = "count", 
    aes(label = after_stat(count), vjust = -0.2), 
    size = 2.5,
    color = "black"
  ) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 7, angle = 45, hjust = 1)
  )

plot8 <- ggplot(salary.df.train, aes(x = relationship)) +
  geom_bar(fill = "pink", color = "black", alpha = 0.7) +
  geom_text(
    stat = "count", 
    aes(label = after_stat(count), vjust = 1), 
    size = 3,
    color = "black"
  ) +
  labs(y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 7, angle = 45, hjust = 1)
  )

grid.arrange(plot1, plot2, plot3, plot4,
             plot5, plot6, plot7, plot8,
             ncol = 2,
             top = 'Distribution of categorical variables')

There is no strong correlation between numeric variables. The strongest positive correlation is observed between salary and education which suggests than the higher the education the higher probability than salary is greater than 50K. Variables such as capital, age, hours_per_week, sex, feat02 are also positively correlated which is correct with the intuition. Variables feat04 and feat06 are only variables which are positively correlated with salary.

modified_data <- salary.df.train %>%
  mutate(
    sex_male = as.numeric(sex == "Male"),
    salary_over_50 = as.numeric(salary == "over_50K")
  )

testRes = cor.mtest(modified_data[, sapply(modified_data, is.numeric)], conf.level = 0.95, method="pearson")

corrplot(cor(modified_data[, sapply(modified_data, is.numeric)]), 
         p.mat = testRes$p,
         method = "color", 
         type = "upper", 
         order = "hclust", 
         tl.col = "black", 
         tl.srt = 45, 
         sig.level = c(0.001, 0.01, 0.05), pch.cex = 0.9,
         insig = 'label_sig', pch.col = 'grey20')

Validation and Test Data Partition

We are splitting the test portion of the data into validation and test sets. Before proceeding with this split, it’s important to preprocess the data in the same way as the training data to ensure consistency in feature engineering.

salary.df.test <- salary.df.test %>%
  mutate(`native-country` = ifelse(is.na(`native-country`), 'Unknown', `native-country`))

salary.df.test <- salary.df.test %>%
  mutate(occupation = ifelse(workclass=='Never-worked', 'Unemployed', occupation),
         occupation = ifelse(is.na(occupation), 'Unknown', occupation),
         workclass = ifelse(is.na(workclass), 'Unknown', workclass))

salary.df.test <- salary.df.test %>%
  mutate(capital = `capital-gain`-`capital-loss`,
         marital_status_grouped = case_when(
              `marital-status` %in% c('Married-AF-spouse', 'Married-civ-spouse', 
                                    'Married-spouse-absent') ~ 'Married',
              `marital-status` == 'Never-married' ~ 'Single',
              `marital-status` %in% c('Divorced', 'Separated', 'Widowed') ~ 'Previously Married',
              TRUE ~ as.character(`marital-status`)),
         region = case_when(
              `native-country` %in% c("United-States", "Outlying-US(Guam-USVI-etc)") ~ "United-States",
              `native-country` %in% c("El-Salvador", "Guatemala", "Honduras", 
                                      "Nicaragua") ~ "Central America",
              `native-country` %in% c("Ecuador", "Peru", "Columbia") ~ "South America",
              `native-country` %in% c("Canada","Mexico") ~ "North America",
              `native-country` %in% c("Philippines", "India", "China", "Vietnam", "Japan", "Taiwan", 
                                       "Hong", "Thailand", "Cambodia", "Laos", "Iran") ~ "Asia",
              `native-country` %in% c("Germany", "Italy", "Portugal", "Greece", "Poland", "France", 
                                       "Ireland", "Hungary", "Scotland", "Yugoslavia", "England") ~ "Europe",
              `native-country` %in% c("Puerto-Rico", "Dominican-Republic", "Jamaica", "Cuba", 
                                       "Haiti", "Trinadad&Tobago") ~ "Caribbean",
              TRUE ~ "Other"  
            )
  ) %>% 
  rename(education_num = `education-num`,
         hours_per_week = `hours-per-week`,
         marital_status = marital_status_grouped,
         native_country = `native-country`) %>% 
  mutate(salary = factor(salary), 
         salary = recode_factor(salary, 
                                "<=50K" = "under_50K", 
                                ">50K" = "over_50K"),
         sex = factor(sex),
         occupation = factor(occupation),
         marital_status = factor(marital_status),
         native_country = factor(native_country),
         race = factor(race),
         relationship = factor(relationship),
         workclass = factor(workclass),
         region = factor(region),
         education_num = as.integer(education_num),
         education_ord = factor(education_num, levels=c(1:16), ordered=TRUE),
         age = as.integer(age),
         hours_per_week = as.integer(hours_per_week)) %>% 
  dplyr::select(age, capital, education_num, hours_per_week, marital_status, native_country,region,education_ord,
         salary, sex, fnlwgt, occupation, race, relationship, workclass,
         feat01,feat02,feat03,feat04,feat05,feat06,feat07,feat08,feat09,feat10)
set.seed(1618)
training_obs <- createDataPartition(salary.df.test$salary, 
                                    p = 0.5, 
                                    list = FALSE) 
salary.df.valid <- salary.df.test[training_obs,]
salary.df.test  <- salary.df.test[-training_obs,]

Model Fromula

model1.formula <- salary ~ age + education_ord + capital + hours_per_week + marital_status + occupation + race + relationship + sex + workclass + region +
  feat01 + feat02 + feat03 + feat04 + feat05 + feat06 + feat07 + feat08 + feat09 + feat10

Classification Trees

A classification tree model has been employed to better understand the dataset and identify significant variables that contribute to the prediction of salary levels. The primary goal of a classification tree is to recursively partition the dataset into subsets based on predictor variables, creating a tree structure that can be interpreted to make predictions.

salary.class.tree <-   
  rpart(model1.formula,
        data = salary.df.train,
        method = "class",
        minbucket = 1500, 
        cp = -1)

opt <- which.min(salary.class.tree$cptable[, "xerror"])
cp <- salary.class.tree$cptable[opt, "CP"]

salary.class.tree <- 
  prune(salary.class.tree, cp = cp)
fancyRpartPlot(salary.class.tree)

Key findings:

  • Root Node: The root node indicates that the majority (76%) of individuals have a salary less than or equal to $50K, while 24% have a salary above $50K.
  • Node 2 (Relationship): Individuals categorized as “Not-in-family,” “Other-relative,” “Own-child,” or “Unmarried” account for a subset where 94% have a salary less than or equal to $50K.
  • Node 3 (Relationship): Individuals categorized as “Husband” or “Wife” form another subset, with 55% having a salary less than or equal to $50K.
  • Node 6 (Education Level): Among individuals in the “Husband” or “Wife” category with an education level below 13, 66% have a salary less than or equal to $50K.
  • Node 7 (Education Level): Among individuals in the “Husband” or “Wife” category with an education level equal to or above 13, 73% have a salary above $50K.
  • Node 12 (Feat04): Among individuals in the “Husband” or “Wife” category with an education level equal to or below 12 and feat04 greater or equal to 0.41, 71% have a salary under $50K.
  • Node 13 (Feat04): Among individuals in the “Husband” or “Wife” category with an education level equal to or below 12 and feat04 greater or equal to 0.41, 64% have a salary above $50K.
par(mar = c(5.1, 6.1, 4.1, 2.1))
barplot(rev(salary.class.tree$variable.importance), # vactor
        col = "blue",  # colors
        main = "imporatnce of variables in model tree4",
        horiz = T,  # horizontal type of plot
        las = 1,    # labels always horizontally 
        cex.names = 0.6)

The classification tree provides insights into the significant predictors of salary levels. In this analysis, variables such as relationship, marital_status and education_ord were identified as key factors influencing salary predictions.

pred.class.train  <- as.data.frame(predict(salary.class.tree, salary.df.train))
ROC.class.train  <- roc(
  as.numeric(salary.df.train$salary == "under_50K"), 
  pred.class.train[, 1])

pred.class.valid <- as.data.frame(predict(salary.class.tree, 
                            salary.df.valid))
ROC.class.valid  <- roc(as.numeric(salary.df.valid$salary == "under_50K"), 
                       pred.class.valid[, 1])

list(
  ROC.class.train = ROC.class.train,
  ROC.class.valid = ROC.class.valid
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ", 
                         round(100*(2 * auc(ROC.class.train) - 1), 1), "%, ",
                         "GINI VALID: ", 
                         round(100*(2 * auc(ROC.class.valid) - 1), 1), "% ")) +
  theme_bw() + coord_fixed() +
  # scale_color_brewer(palette = "Paired") +
  scale_color_manual(values = RColorBrewer::brewer.pal(n = 4, 
                                                       name = "Paired")[c(1, 3)])

Random Forests

Random Forests is a powerful learning technique employed for both classification and regression tasks. Our objective is to predict binary variable salary and to achieve this, we are generating three distinct models. To optimize performance, we utilize grid search with 5 resampling iterations for parameters such as mtry and min.node.size.

method = “ranger”

For method “ranger” we decided to verify two different importance metrics: permutation and impurity and two different number of trees: 100 or 250. Additionally there are parameters in grid search: mtry from 5 to 46 and minimal node size: 100,250 or 500.

if(0){
  
ctrl <- trainControl(method = "cv", 
                     number = 5, 
                     classProbs = TRUE)


parameters_ranger <- expand.grid(
  mtry = c(5:46), # 46 is number of all independent variables once encoded 
  splitrule = "gini",
  min.node.size = c(100, 250, 500)
)


set.seed(1618)  
rf_model_1 <- train(
  model1.formula,
  data = salary.df.train,
  method = "ranger",
  num.trees = 100,
  importance = "permutation",
  tuneGrid = parameters_ranger,
  trControl = ctrl,
  weights = salary.df.train$fnlwgt
)

saveRDS(rf_model_1, file = here("output", "rf_model_1.rds"))

set.seed(1618)  
rf_model_2 <- train(
  model1.formula,
  data = salary.df.train,
  method = "ranger",
  num.trees = 250,
  importance = "impurity",
  tuneGrid = parameters_ranger,
  trControl = ctrl,
  weights = salary.df.train$fnlwgt
)

saveRDS(rf_model_2, file = here("output", "rf_model_2.rds"))
}

rf_model_1 <- readRDS(here("output", "rf_model_1.rds"))
rf_model_2 <- readRDS(here("output", "rf_model_2.rds"))

method = “rf”

if(0){
ctrl <- trainControl(method = "oob",
                     classProbs = TRUE)


parameters_ranger <- expand.grid(mtry = 5:46)


set.seed(1618)  
rf_model_3 <- train(
  model1.formula,
  data = salary.df.train,
  method = "rf",
  ntree = 100,
  nodesize = 100,
  tuneGrid = parameters_ranger,
  trControl = ctrl,
  weights = salary.df.train$fnlwgt
)

saveRDS(rf_model_3, file = here("output", "rf_model_3.rds"))
}

rf_model_3 <- readRDS(here("output", "rf_model_3.rds"))

In order to verify which models perform best results we are generating ROC curves.

##############################################
pred.train.rf_1 <- predict(rf_model_1, 
                         salary.df.train, 
                         type = "prob")[, "over_50K"]
ROC.train.rf_1  <- roc(as.numeric(
  salary.df.train$salary == "under_50K"),
  pred.train.rf_1)

pred.valid.rf_1  <- predict(rf_model_1, 
                         salary.df.valid, 
                         type = "prob")[, "over_50K"]
ROC.valid.rf_1   <- roc(as.numeric(
  salary.df.valid$salary == "under_50K"), pred.valid.rf_1)
##############################################
pred.train.rf_2 <- predict(rf_model_2, 
                         salary.df.train, 
                         type = "prob")[, "under_50K"]
ROC.train.rf_2  <- roc(as.numeric(
  salary.df.train$salary == "under_50K"),
  pred.train.rf_2)

pred.valid.rf_2  <- predict(rf_model_2, 
                         salary.df.valid, 
                         type = "prob")[, "under_50K"]
ROC.valid.rf_2   <- roc(as.numeric(
  salary.df.valid$salary == "under_50K"), pred.valid.rf_2)
##############################################
pred.train.rf_3 <- predict(rf_model_3, 
                         salary.df.train, 
                         type = "prob")[, "under_50K"]
ROC.train.rf_3  <- roc(as.numeric(
  salary.df.train$salary == "under_50K"),
  pred.train.rf_3)

pred.valid.rf_3  <- predict(rf_model_3, 
                         salary.df.valid, 
                         type = "prob")[, "under_50K"]
ROC.valid.rf_3   <- roc(as.numeric(
  salary.df.valid$salary == "under_50K"), pred.valid.rf_3)
list(
  ROC.train.rf_1  = ROC.train.rf_1,
  ROC.valid.rf_1  = ROC.valid.rf_1,
  ROC.train.rf_2  = ROC.train.rf_2,
  ROC.valid.rf_2  = ROC.valid.rf_2,
  ROC.train.rf_3  = ROC.train.rf_3,
  ROC.valid.rf_3  = ROC.valid.rf_3
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(title = paste0("Gini TEST: ",
                      "rf1 = ", 
                      round(100 * (2 * auc(ROC.valid.rf_1) - 1), 1), "%, ",
                      "rf2 = ", 
                      round(100 * (2 * auc(ROC.valid.rf_2) - 1), 1), "%, ",
                      "rf3 = ", 
                      round(100 * (2 * auc(ROC.valid.rf_3) - 1), 1), "%, "),
       subtitle =  paste0("Gini TRAIN: ",
                          "rf1 = ", 
                          round(100 * (2 * auc(ROC.train.rf_1) - 1), 1), "%, ",
                          "rf2 = ", 
                          round(100 * (2 * auc(ROC.train.rf_2) - 1), 1), "%, ",
                          "rf3 = ", 
                          round(100 * (2 * auc(ROC.train.rf_3) - 1), 1), "%, ")) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

All models have performed well, with accuracies above 90% for train data set and above 80% for test set. This discrepancy could be an indication of overfitting. Model 1 and Model 2, with additional tuning of min.node.size, show slightly higher accuracy and kappa values. The accuracy improvement between models is relatively small.

The random forest models seem effective in predicting the salary class, with Model 2 having a slight edge in terms of AUC.

salary.rf <- rf_model_2

Below is a plot presenting the 15 most important predictors. In comparison to the previous Classification Trees Model, the selection is entirely different. The key predictors are found to be capital, education, marital status, age, feat02, feat04, feat06 and hours_per_week.

source(here("funs", "getVarImpPlot4Ranger.R"))
getVarImpPlot4Ranger(salary.rf$finalModel,
                     sort = TRUE,
                     main = "Importance of predictors",
                     n.var = 15)

Logistic Regression

Now we apply the logistic regression model.

fiveStats <- function(...) c(twoClassSummary(...), 
                             defaultSummary(...))

ctrl <- trainControl(method = "cv",
                     number = 5, 
                     savePredictions = "final",
                     summaryFunction = fiveStats,
                     classProbs = TRUE)

set.seed(1618)
salary.logit <- 
  train(model1.formula, 
        data = salary.df.train, 
        method = "glm",
        family = "binomial", 
        preProcess = c("center", "scale"),
        trControl = ctrl)

salary.logit
## Generalized Linear Model 
## 
## 34049 samples
##    21 predictor
##     2 classes: 'under_50K', 'over_50K' 
## 
## Pre-processing: centered (70), scaled (70) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 27240, 27239, 27239, 27239, 27239 
## Resampling results:
## 
##   ROC        Sens       Spec       Accuracy   Kappa    
##   0.9230546  0.9352554  0.6634205  0.8697173  0.6270098

The model achieved consistent Gini coefficients of 85% and 84% on both the training and validation sets. In contrast to the random forests model, where a discrepancy of around 10 percentage points was observed between the training and validation Gini coefficients, the logistic regression model demonstrates better consistency.

pred.logit.train <- predict(salary.logit, 
                            salary.df.train,
                            type = "prob")
ROC.logit.train  <- roc(as.numeric(salary.df.train$salary == "over_50K"), 
                       pred.logit.train[, 1])

pred.logit.valid <- predict(salary.logit, 
                            salary.df.valid,
                            type = "prob")
ROC.logit.valid  <- roc(as.numeric(salary.df.valid$salary == "over_50K"), 
                       pred.logit.valid[, 1])

list(
  ROC.logit.train = ROC.logit.train,
  ROC.logit.valid = ROC.logit.valid
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ", 
                         round(100*(2 * auc(ROC.logit.train) - 1), 1), "%, ",
                         "GINI VALID: ", 
                         round(100*(2 * auc(ROC.logit.valid) - 1), 1), "% ")) +
  theme_bw() + coord_fixed() +
  # scale_color_brewer(palette = "Paired") +
  scale_color_manual(values = RColorBrewer::brewer.pal(n = 4, 
                                                       name = "Paired")[c(1, 3)])

XGBoost

Our next model is XGBoost which needs data preparation, we also implemented grid searcg procedure.

Preparing the train data

xgb.train.label <- salary.df.train %>% 
  select(salary) %>%
  mutate(salary == "under_50K") %>%
  pull() %>%
  as.integer()
tmp.train.data <- 
  dummy.data.frame(salary.df.train %>% 
                     select( 
                       age,capital,education_num,hours_per_week,
                       marital_status,region,sex,occupation,
                       race,relationship,workclass,feat01,feat02,feat03,
                       feat04,feat05,feat06,feat07,feat08,feat09,feat10
                     ) %>%
                     mutate(education_num = as.integer(education_num)) %>% 
                     as.data.frame(),
                   names = NULL, omit.constants=TRUE,
                   dummy.classes = getOption("dummy.classes"),
                   all = TRUE)
tmp.train.data %>% glimpse()
## Rows: 34,049
## Columns: 63
## $ age                                <int> 25, 51, 57, 31, 74, 42, 35, 28, 46,

## $ capital                            <dbl> 5178, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ education_num                      <int> 7, 10, 9, 13, 14, 9, 9, 9, 4, 10, 7

## $ hours_per_week                     <int> 40, 40, 50, 40, 10, 40, 40, 50, 75,

## $ marital_statusMarried              <int> 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,

## $ `marital_statusPreviously Married` <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,

## $ marital_statusSingle               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,

## $ regionAsia                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ regionCaribbean                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `regionCentral America`            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ regionEurope                       <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,

## $ `regionNorth America`              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ regionOther                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,

## $ `regionSouth America`              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `regionUnited-States`              <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,

## $ sexFemale                          <int> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,

## $ sexMale                            <int> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,

## $ `occupationAdm-clerical`           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,

## $ `occupationArmed-Forces`           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `occupationCraft-repair`           <int> 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,

## $ `occupationExec-managerial`        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,

## $ `occupationFarming-fishing`        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `occupationHandlers-cleaners`      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `occupationMachine-op-inspct`      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `occupationOther-service`          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `occupationPriv-house-serv`        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `occupationProf-specialty`         <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,

## $ `occupationProtective-serv`        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ occupationSales                    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,

## $ `occupationTech-support`           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `occupationTransport-moving`       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ occupationUnemployed               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ occupationUnknown                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,

## $ `raceAmer-Indian-Eskimo`           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `raceAsian-Pac-Islander`           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,

## $ raceBlack                          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,

## $ raceOther                          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ raceWhite                          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,

## $ relationshipHusband                <int> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,

## $ `relationshipNot-in-family`        <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,

## $ `relationshipOther-relative`       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,

## $ `relationshipOwn-child`            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ relationshipUnmarried              <int> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,

## $ relationshipWife                   <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,

## $ `workclassFederal-gov`             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,

## $ `workclassLocal-gov`               <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `workclassNever-worked`            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ workclassPrivate                   <int> 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,

## $ `workclassSelf-emp-inc`            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ `workclassSelf-emp-not-inc`        <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,

## $ `workclassState-gov`               <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

## $ workclassUnknown                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,

## $ `workclassWithout-pay`             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## $ feat01                             <dbl> 0.8448420, 1.3336356, 1.1780716, 1.

## $ feat02                             <dbl> 0.6298227, 0.3418702, 0.6165489, 0.

## $ feat03                             <dbl> 1.1146365, 0.7709931, 1.2863809, 1.

## $ feat04                             <dbl> 0.4889465, 0.6193634, 0.6102719, 0.

## $ feat05                             <dbl> 0.9520031, 1.4626074, 1.5411345, 1.

## $ feat06                             <dbl> 0.5787523, 0.2777089, 0.3670095, 0.

## $ feat07                             <dbl> 1.0472219, 1.0282187, 0.5600741, 0.

## $ feat08                             <dbl> 1.3348179, 0.7770539, 0.9994005, 0.

## $ feat09                             <dbl> 0.8129876, 0.9499549, 1.0217224, 0.

## $ feat10                             <dbl> 0.8436999, 1.1056284, 1.3840671, 1.

xgb.train.data   <- as.matrix(tmp.train.data) 
rm(tmp.train.data)
xgb.train <- xgb.DMatrix(data = xgb.train.data, 
                         label = xgb.train.label)

Gridsearch

if (0) {
  paramsGrid <- 
    expand.grid(subsample = c(0.8, 1), 
                colsample_bynode = c(0.7, 1),
                max_depth = c(1, 2, 3),
                eta = c(0.3, 0.1, 0.05, 0.01),
                gamma = c(0, 1, 3, 10),
                early_stopping_rounds = c(5, 10, 30, 50),
                min_child_weight = c(10, 20, 30)) %>%
    mutate(ID = row_number())
  
  set.seed(1618)
  paramsGrid <- 
    paramsGrid[sample(1:nrow(paramsGrid), 100), ] 
  
  gridSearchResults <- list()
  
  for (i in 1:nrow(paramsGrid)) {
    
    # Extract Parameters to test
    thisID               <- paramsGrid[i, "ID"]
    thisSubsample        <- paramsGrid[i, "subsample"]
    thisColsample_bynode <- paramsGrid[i, "colsample_bynode"]
    thisMax_depth        <- paramsGrid[i, "max_depth"]
    thisEta              <- paramsGrid[i, "eta"]
    thisGamma            <- paramsGrid[i, "gamma"]
    thisEarly_stopping_rounds <- paramsGrid[i, "early_stopping_rounds"]
    thisMin_child_weight <- paramsGrid[i, "min_child_weight"]
    
    cat("processing combination", thisID, "...\n")

    seed_split <- 1618
    set.seed(seed_split)
    xgboostModelCV <- xgb.cv(
      data = xgb.train, 
      nfold = 10,
      booster = "gbtree",
      objective = "binary:logistic",
      eta = thisEta,
      max_depth = thisMax_depth,  
      min_child_weight = thisMin_child_weight,
      gamma = thisGamma,
      lambda = 0.5,
      subsample = thisSubsample,
      colsample = 1,
      colsample_bytree = 1,
      colsample_bylevel = 0.5,
      colsample_bynode = thisColsample_bynode,
      metrics = "auc",
      eval.metric = "auc",
      nrounds = 1000,
      nthreads = 2,
      early_stopping_rounds = thisEarly_stopping_rounds,
      watchlist = list(val1 = xgb.train),
      verbose = 1
    )
    
    xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
    train_auc_mean <- tail(xvalidationScores$train_auc_mean, 1)
    train_auc_std  <- tail(xvalidationScores$train_auc_std, 1)
    test_auc_mean  <- tail(xvalidationScores$test_auc_mean, 1)
    test_auc_std   <- tail(xvalidationScores$test_auc_std, 1)
    train.gini     <- 2 * tail(xvalidationScores$train_auc_mean, 1) - 1
    test.gini      <- 2 * tail(xvalidationScores$test_auc_mean,1) - 1
    gridSearchResults[[i]] <- c(train.gini, 
                                test.gini,
                                train_auc_mean,
                                train_auc_std,
                                test_auc_mean,
                                test_auc_std,
                                thisSubsample, 
                                thisColsample_bynode, 
                                thisMax_depth, 
                                thisEta,
                                thisGamma,
                                thisEarly_stopping_rounds,
                                thisMin_child_weight)
  
    
  
  }

  if (0) gridSearchResults %>% saveRDS(here("output", "gridSearchResults.rds"))
}

Loading gridsearch results form the external file

gridSearchResults <- readRDS(here("output", "gridSearchResults.rds"))
  output <- 
    as_tibble(do.call(rbind, gridSearchResults)) %>%
    mutate(id = row_number()) %>%
    select(id, everything())
  names(output) <- 
    varnames <- c("id", 
                  "train_Gini", "test_Gini", 
                  "train_AUC_mean", "train_AUC_std", 
                  "test_AUC_mean", "test_AUC_std", 
                  "subsample", "colsample_bynode", "max_depth", "eta",
                  "gamma", "early_stopping_rounds", "min_child_weight"
    )
  output <- output %>% arrange(desc(test_Gini))
  output
## # A tibble: 100 × 14
##       id train_Gini test_Gini train_AUC_mean train_AUC_std test_AUC_mean
##    <int>      <dbl>     <dbl>          <dbl>         <dbl>         <dbl>
##  1     9      0.919     0.894          0.959      0.000268         0.947
##  2    75      0.916     0.894          0.958      0.000381         0.947
##  3    69      0.914     0.893          0.957      0.000301         0.946
##  4    51      0.927     0.892          0.963      0.000406         0.946
##  5    46      0.909     0.892          0.955      0.000538         0.946
##  6    61      0.923     0.892          0.961      0.000483         0.946
##  7    81      0.914     0.892          0.957      0.000443         0.946
##  8   100      0.926     0.892          0.963      0.000463         0.946
##  9    68      0.918     0.891          0.959      0.000389         0.946
## 10    74      0.926     0.891          0.963      0.000443         0.946
## # â„č 90 more rows
## # â„č 8 more variables: test_AUC_std <dbl>, subsample <dbl>,
## #   colsample_bynode <dbl>, max_depth <dbl>, eta <dbl>, gamma <dbl>,
## #   early_stopping_rounds <dbl>, min_child_weight <dbl>

Best Params

bestParams <-
  output %>% 
  arrange(desc(test_Gini)) %>%
  mutate(diff = train_Gini - test_Gini) %>%
  filter(test_Gini > 0.891 & diff < 0.02) %>%
  select(-starts_with("train"), -starts_with("test")) %>%
  unlist()
bestParams
##                    id             subsample      colsample_bynode 
##            46.0000000             1.0000000             0.7000000 
##             max_depth                   eta                 gamma 
##             2.0000000             0.3000000             3.0000000 
## early_stopping_rounds      min_child_weight                  diff 
##            50.0000000            10.0000000             0.0169308

Retraining the model with the optimal parameters - with CV

if(0){
seed_split <- 1618
set.seed(seed_split)
xgb.fit.cv <- xgb.cv(
  nfold = 10,
  data = xgb.train, 
  booster = "gbtree",
  objective = "binary:logistic",
  eta = bestParams["eta"],
  max_depth = bestParams["max_depth"],  
  min_child_weight = bestParams["min_child_weight"],
  gamma = bestParams["gamma"],
  lambda = 0.5,
  subsample = bestParams["subsample"],
  colsample = 1,
  colsample_bytree = 1,
  colsample_bylevel = 0.5,
  colsample_bynode = bestParams["colsample_bynode"],
  metrics = "auc",
  eval.metric = "auc",
  nrounds = 1000,
  nthreads = 2,
  early_stopping_rounds = bestParams["early_stopping_rounds"],
  watchlist=list(val1 = xgb.train),
  verbose = 1
)

xgb.fit.cv %>% saveRDS(here("output", "xgb_fit_cv2.rds"))
}

Retraining the model with the optimal parameters - without CV

if(0){
  
xgb.fit.cv <- readRDS(here("output", "xgb_fit_cv2.rds"))

seed_split <- 1618
set.seed(seed_split)
salary.xgb <- xgb.train(
  data = xgb.train, 
  booster = "gbtree",
  objective = "binary:logistic",
  eta = bestParams["eta"],
  max_depth = bestParams["max_depth"],  
  min_child_weight = bestParams["min_child_weight"],
  gamma = bestParams["gamma"],
  lambda = 0.5,
  subsample = bestParams["subsample"],
  colsample = 1,
  colsample_bytree = 1,
  colsample_bylevel = 0.5,
  colsample_bynode = bestParams["colsample_bynode"],
  metrics = "auc",
  eval.metric = "auc",
  nrounds = xgb.fit.cv$best_iteration,
  nthreads = 2,
  early_stopping_rounds = bestParams["early_stopping_rounds"],
  watchlist=list(val1 = xgb.train),
  verbose = 0
)

salary.xgb %>% saveRDS(here("output", "salary.xgb.rds"))
}
salary.xgb <- readRDS(here("output", "salary.xgb.rds"))

Visualization of features importance.

The plot displays the most important features in the model, ranked in descending order based on their significance. The top features contributing to the model include capital,marital_statusMarried, education_num, and relationshipHusband among others. It’s worth mentioning that XGBoost included feat04, feat02 and feat06 which showed significant correlation with salary. However, it did not include the rest of the artificial features.

importance_matrix <- xgb.importance(model = salary.xgb)
top_variables <- head(rownames(importance_matrix), 15)

# Subset the importance matrix to include only the top variables
top_importance_matrix <- subset(importance_matrix, rownames(importance_matrix) %in% top_variables)

# Print the subset of the importance matrix
print(top_importance_matrix)
##                       Feature        Gain       Cover   Frequency
##  1:                   capital 0.213636201 0.223865235 0.185714286
##  2:     marital_statusMarried 0.152995029 0.017085835 0.024285714
##  3:             education_num 0.144249906 0.066383870 0.058571429
##  4:       relationshipHusband 0.092414305 0.011988699 0.008571429
##  5:                    feat04 0.086323277 0.098015246 0.088571429
##  6:                       age 0.073061914 0.068730727 0.065714286
##  7:                    feat06 0.050447772 0.060225356 0.060000000
##  8:                    feat02 0.042385741 0.048788107 0.057142857
##  9:            hours_per_week 0.035510798 0.052644621 0.055714286
## 10:      marital_statusSingle 0.019346410 0.004486087 0.007142857
## 11:          relationshipWife 0.016887695 0.027115997 0.024285714
## 12: occupationExec-managerial 0.013336668 0.015723442 0.014285714
## 13:   occupationOther-service 0.007447751 0.011096873 0.007142857
## 14:                 sexFemale 0.006599651 0.008816839 0.011428571
## 15: occupationFarming-fishing 0.004264967 0.011678483 0.008571429
# Plot the importance of the top variables
xgb.plot.importance(importance_matrix = top_importance_matrix)

Prediction for the train data

pred.xgb.train <- predict(salary.xgb, xgb.train, reshape = TRUE)
ROC.train <- roc(xgb.train.label, pred.xgb.train)

Preparing the test data

xgb.valid.label <- salary.df.valid %>% 
  select(salary) %>%
  mutate(salary == "under_50K") %>%
  pull() %>%
  as.integer()
tmp.valid.data <- 
  dummy.data.frame(salary.df.valid %>% 
                     select( 
                       age,capital,education_num,hours_per_week,
                       marital_status,region,sex,occupation,
                       race,relationship,workclass, feat01,feat02,feat03,
                       feat04,feat05,feat06,feat07,feat08,feat09,feat10
                     ) %>%
                     mutate(education_num = as.integer(education_num)) %>% 
                     as.data.frame(),
                   names = NULL, omit.constants=TRUE,
                   dummy.classes = getOption("dummy.classes"),
                   all = TRUE)
tmp.valid.data <- tmp.valid.data[, salary.xgb$feature_names]
xgb.valid.data   <- as.matrix(tmp.valid.data) 
rm(tmp.valid.data)
xgb.valid.data %>% glimpse()
##  num [1:4256, 1:63] 28 34 28 39 57 43 40 50 62 48 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:4256] "1" "2" "3" "4" ...
##   ..$ : chr [1:63] "age" "capital" "education_num" "hours_per_week" ...
xgb.valid <- xgb.DMatrix(data = xgb.valid.data, 
                         label = xgb.valid.label)
pred.xgb.valid <- predict(salary.xgb, xgb.valid, reshape = TRUE)
ROC.valid <- pROC::roc(xgb.valid.label, pred.xgb.valid) 

ROC Curve

Now we are processing ROC curve. The results for train dataset are slightly worse than in case of random forests,94% to 91%. However the discrepancy is lower for XGBoost, results for test data set is 88,6%. XGBoost model seems to be more consistent.

list(
  ROC.train = ROC.train,
  ROC.valid  = ROC.valid
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini: ",
                         "TRAIN = ", 
                         round(100*(2 * auc(ROC.train) - 1), 1), "%, ",
                         "TEST = ", 
                         round(100*(2 * auc(ROC.valid) - 1), 1), "%"),
       caption = "source: own calculations") +
  theme_bw() + coord_fixed() 

Ensembling - Weighted Voting

We are going to implement simple ensembling based on weighted voting. First step is combining results from all models into one data frame.

preds.train <- cbind(logit = pred.logit.train$under_50K, 
                     class_tree = pred.class.train$under_50K,
                     xgb = pred.xgb.train,
                     rf = pred.train.rf_2)

preds.train %>% head()
##          logit class_tree       xgb        rf
## [1,] 0.6216153  0.7140378 0.2473250 0.1651958
## [2,] 0.3456506  0.7140378 0.3435149 0.4644596
## [3,] 0.4908892  0.7140378 0.3902369 0.4114981
## [4,] 0.9667551  0.9362605 0.9802357 0.8946082
## [5,] 0.7476965  0.9362605 0.9264451 0.8467233
## [6,] 0.6316543  0.7140378 0.6811452 0.6862005

Below there is presented correlation plot for all models results. As we know XGBoost and Random Forests gave the highest results and they are also very strong correlated. Logistic Regression results were consistent and gave around 86% however it is also highly correlated with XGBoost and Random Forests. Ensembling might not be the best solution here, nevertheless let us verify it.

cor(preds.train)
##                logit class_tree       xgb        rf
## logit      1.0000000  0.7695343 0.9184463 0.9236927
## class_tree 0.7695343  1.0000000 0.7383541 0.7690964
## xgb        0.9184463  0.7383541 1.0000000 0.9622331
## rf         0.9236927  0.7690964 0.9622331 1.0000000
corrplot::corrplot(cor(preds.train))

preds.valid <- cbind(logit = pred.logit.valid$under_50K, 
                     class_tree = pred.class.valid$under_50K,
                     xgb = pred.xgb.valid,
                     rf = pred.valid.rf_2)

preds.valid %>% head()
##          logit class_tree       xgb        rf
## [1,] 0.9831757  0.9362605 0.9848252 0.9823226
## [2,] 0.5214763  0.7140378 0.5196844 0.6072156
## [3,] 0.9931267  0.9362605 0.9892424 0.9972532
## [4,] 0.6049790  0.7140378 0.6916719 0.5637618
## [5,] 0.8554764  0.9362605 0.8422664 0.7331504
## [6,] 0.9810695  0.9362605 0.9819695 0.9707648
cor(preds.valid)
##                logit class_tree       xgb        rf
## logit      1.0000000  0.7483839 0.9213339 0.9282638
## class_tree 0.7483839  1.0000000 0.7342402 0.7724098
## xgb        0.9213339  0.7342402 1.0000000 0.9574183
## rf         0.9282638  0.7724098 0.9574183 1.0000000
corrplot::corrplot(cor(preds.valid))

models.weights <- c(
  logit = auc(ROC.logit.train),
  class_tree = auc(ROC.class.train),
  rf = auc(ROC.train.rf_2),
  xgb = auc(ROC.train))

models.weights <- models.weights/sum(models.weights)
models.weights
##      logit class_tree         rf        xgb 
##  0.2517718  0.2241443  0.2639837  0.2601002
head(preds.train) * (models.weights * 4)
##          logit class_tree       xgb        rf
## [1,] 0.6260208  0.7539773 0.2490778 0.1744360
## [2,] 0.3099025  0.7428855 0.3079876 0.4832242
## [3,] 0.5183470  0.7190982 0.4120647 0.4144145
## [4,] 1.0058128  0.8394300 1.0198380 0.8020855
## [5,] 0.7529954  0.9886301 0.9330108 0.8940846
## [6,] 0.5663269  0.7428855 0.6106993 0.7139235
preds.train$weighted.voting <- ifelse(rowSums((preds.train * models.weights*4) > 0.5) > 2,
         "under_50K", "over_50K")

preds.valid$weighted.voting <-
  ifelse(rowSums((preds.valid * models.weights*4) > 0.5) > 2,
         "under_50K", "over_50K")

As expected the results are not satisfying. The ensembling method decrease results and in the same time does not decrease the discrepancy between train and test data sets.

ROC.ensemb.train  <- roc(as.numeric(salary.df.train$salary == "under_50K"),
                 as.numeric(preds.train$weighted.voting == "under_50K"))

ROC.ensemb.valid  <- roc(as.numeric(salary.df.valid$salary == "under_50K"),
                 as.numeric(preds.valid$weighted.voting == "under_50K"))

list(
  ROC.ensemb.train = ROC.ensemb.train,
  ROC.ensemb.valid = ROC.ensemb.valid
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ", 
                         round(100*(2 * auc(ROC.ensemb.train) - 1), 1), "%, ",
                         "GINI VALID: ", 
                         round(100*(2 * auc(ROC.ensemb.valid) - 1), 1), "% ")) +
  theme_bw() + coord_fixed() +
  # scale_color_brewer(palette = "Paired") +
  scale_color_manual(values = RColorBrewer::brewer.pal(n = 4, 
                                                       name = "Paired")[c(1, 3)])

Neural Network

Another chosen model to implement is neural network. This model needs some data preparation which is proceed below.

Training Neural Network

Standardizing Numeric Variables

numerical_columns <- sapply(salary.df.train, is.numeric)
df_standardized <- as.data.frame(scale(salary.df.train[, numerical_columns]))

salary.df.train.std <- cbind(salary.df.train[, !numerical_columns, drop = FALSE], df_standardized)
mean_values <- colMeans(salary.df.train[, numerical_columns])
sd_values <- apply(salary.df.train[, numerical_columns], 2, sd)

df_valid_standardized <- as.data.frame(scale(salary.df.valid[, numerical_columns], center = mean_values, scale = sd_values))

salary.df.valid.std <- cbind(salary.df.valid[, !numerical_columns, drop = FALSE], df_valid_standardized)
salary.train.mtx <- 
  model.matrix(object = model1.formula, 
               data   = salary.df.train.std)
dim(salary.train.mtx)
## [1] 34049    71
colnames(salary.train.mtx) <- gsub(" ", "_",  colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub(",", "_",  colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("/", "",   colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("-", "_",  colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("'", "",   colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("\\+", "", colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("\\^", "", colnames(salary.train.mtx))
col_list <- paste(c(colnames(salary.train.mtx[, -1])), collapse = "+")
col_list <- paste(c("salary_1 ~ ", col_list), collapse = "")
(model.formula2 <- formula(col_list))
## salary_1 ~ age + education_ord.L + education_ord.Q + education_ord.C + 
##     education_ord4 + education_ord5 + education_ord6 + education_ord7 + 
##     education_ord8 + education_ord9 + education_ord10 + education_ord11 + 
##     education_ord12 + education_ord13 + education_ord14 + education_ord15 + 
##     capital + hours_per_week + marital_statusPreviously_Married + 
##     marital_statusSingle + occupationArmed_Forces + occupationCraft_repair + 
##     occupationExec_managerial + occupationFarming_fishing + occupationHandlers_cleaners + 
##     occupationMachine_op_inspct + occupationOther_service + occupationPriv_house_serv + 
##     occupationProf_specialty + occupationProtective_serv + occupationSales + 
##     occupationTech_support + occupationTransport_moving + occupationUnemployed + 
##     occupationUnknown + raceAsian_Pac_Islander + raceBlack + 
##     raceOther + raceWhite + relationshipNot_in_family + relationshipOther_relative + 
##     relationshipOwn_child + relationshipUnmarried + relationshipWife + 
##     sexMale + workclassLocal_gov + workclassNever_worked + workclassPrivate + 
##     workclassSelf_emp_inc + workclassSelf_emp_not_inc + workclassState_gov + 
##     workclassUnknown + workclassWithout_pay + regionCaribbean + 
##     regionCentral_America + regionEurope + regionNorth_America + 
##     regionOther + regionSouth_America + regionUnited_States + 
##     feat01 + feat02 + feat03 + feat04 + feat05 + feat06 + feat07 + 
##     feat08 + feat09 + feat10

Hence, let us decide to use 4 hidden layers: we will use 15 neurons in the 1st layer and 8, 8, and 4 neurons in the subsequent layers, respectively. The number of layers and neurons were chosen arbitrarily and compared with a few other combinations; however, the results were not close to those from Random Forests, XGBoost, nor Logistic Regression. Therefore, we did not spend more time searching for the best parameters. The results from the train and test datasets are consistent, though.

set.seed(1618)

salary.nn <- 
  data.frame(salary.train.mtx,
             salary_1 = as.numeric(salary.df.train.std$salary == "under_50K")) %>%
  sample_n(1000) %>%
  neuralnet(model.formula2,
            data = .,
            hidden = c(15,8,8,4), # number of neurons in hidden layers
            linear.output = FALSE, # F for classification
            learningrate.limit = NULL,
            learningrate.factor = list(minus = 0.7, plus = 1.3),
            algorithm = "rprop+",
            threshold = 0.01)
plot(salary.nn, rep = "best")

Predictions and ROC

salary.valid.mtx <- 
  model.matrix(object = model1.formula, 
               data = salary.df.valid.std)

colnames(salary.valid.mtx) <- gsub(" ", "_",  colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub(",", "_",  colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub("/", "",   colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub("-", "_",  colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub("'", "",   colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub("\\+", "", colnames(salary.valid.mtx))
salary.pred.nn <- compute(salary.nn, salary.valid.mtx[, -1])
salary.pred.nn$net.result %>% head(10)
##           [,1]
## 1  0.992748353
## 2  0.991538964
## 3  0.992749227
## 4  0.991950714
## 5  0.992741099
## 6  0.974188102
## 7  0.992748661
## 8  0.626344352
## 9  0.991985458
## 10 0.007918318
(confusionMatrix.nn <-
  confusionMatrix(as.numeric((salary.pred.nn$net.result > 0.5)) %>% as.factor(),
                  as.factor(ifelse(salary.df.valid.std$salary == "under_50K", 1, 0))) )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  594  388
##          1  432 2842
##                                           
##                Accuracy : 0.8073          
##                  95% CI : (0.7952, 0.8191)
##     No Information Rate : 0.7589          
##     P-Value [Acc > NIR] : 2.205e-14       
##                                           
##                   Kappa : 0.4656          
##                                           
##  Mcnemar's Test P-Value : 0.1332          
##                                           
##             Sensitivity : 0.5789          
##             Specificity : 0.8799          
##          Pos Pred Value : 0.6049          
##          Neg Pred Value : 0.8681          
##              Prevalence : 0.2411          
##          Detection Rate : 0.1396          
##    Detection Prevalence : 0.2307          
##       Balanced Accuracy : 0.7294          
##                                           
##        'Positive' Class : 0               
## 

As usual, the final step is to draw the ROC curve. The results are not satisfying

ROC.train.nn <- 
  pROC::roc(as.numeric(salary.df.train.std$salary == "under_50K"), 
            compute(salary.nn, salary.train.mtx[, -1])$net.result[, 1])
ROC.valid.nn  <- 
  pROC::roc(as.numeric(salary.df.valid.std$salary == "under_50K"), 
            compute(salary.nn, salary.valid.mtx[, -1])$net.result[, 1])

list(
  ROC.train.nn = ROC.train.nn,
  ROC.valid.nn  = ROC.valid.nn
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "nn = ", 
                         round(100*(2 * auc(ROC.train.nn) - 1), 1), "%, ",
                         "Gini TEST: ",
                         "nn = ", 
                         round(100*(2 * auc(ROC.valid.nn) - 1), 1), "%, "
                         )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

Comparison and Summary

The XGBoost model yielded the highest results on the validation dataset, with the Random Forest model trailing slightly behind. However, as mentioned earlier, the Random Forest model exhibited significantly more overfitting. Therefore, we have concluded that XGBoost is the best model for production.

ROC.valid.xgb = ROC.valid

list(
  ROC.class.valid = ROC.class.valid,
  ROC.logit.valid = ROC.logit.valid,
  ROC.valid.rf_2 = ROC.valid.rf_2,
  ROC.valid.xgb = ROC.valid.xgb,
  ROC.ensemb.valid = ROC.ensemb.valid,
  ROC.valid.nn = ROC.valid.nn
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini VALID: ",
                         "class = ", 
                         round(100*(2 * auc(ROC.class.valid) - 1), 1), "%, ",
                         "logit = ", 
                         round(100*(2 * auc(ROC.logit.valid) - 1), 1), "%, ",
                         "rf = ", 
                         round(100*(2 * auc(ROC.valid.rf_2) - 1), 1), "%, ",
                         "xgb = ", 
                         round(100*(2 * auc(ROC.valid.xgb) - 1), 1), "%, ",
                         "ensemb = ", 
                         round(100*(2 * auc(ROC.ensemb.valid) - 1), 1), "%, ",
                         "nn = ", 
                         round(100*(2 * auc(ROC.valid.nn) - 1), 1), "%, "
                         )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

Final Results

Now we are going to verify results on our final test dataset. There is needed preparation of data.

Preparing the test data

xgb.test.label <- salary.df.test %>% 
  select(salary) %>%
  mutate(salary == "under_50K") %>%
  pull() %>%
  as.integer()
tmp.test.data <- 
  dummy.data.frame(salary.df.test %>% 
                     select( 
                        age,capital,education_num,hours_per_week,
                       marital_status,region,sex,occupation,
                       race,relationship,workclass, feat01,feat02,feat03,
                       feat04,feat05,feat06,feat07,feat08,feat09,feat10
                     ) %>%
                     mutate(education_num = as.integer(education_num)) %>% 
                     as.data.frame(),
                   names = NULL, omit.constants=TRUE,drop = FALSE,
                   dummy.classes = getOption("dummy.classes"),
                   all = TRUE)
tmp.test.data <- tmp.test.data[,salary.xgb$feature_names]
xgb.test.data   <- as.matrix(tmp.test.data) 
rm(tmp.test.data)
xgb.test <- xgb.DMatrix(data = xgb.test.data, 
                         label = xgb.test.label)
pred.xgb.test <- predict(salary.xgb, xgb.test, reshape = TRUE)
ROC.test <- pROC::roc(xgb.test.label, pred.xgb.test) 

ROC Curve

Upon reviewing the results from the Random Forest model, we suspected the presence of overfitting due to a significant discrepancy between the training and test datasets. However, in the case of XGBoost, this does not seem to be an issue. The discrepancy is only slight, and the results from the validation and test datasets are very close.

list(
  ROC.train = ROC.train,
  ROC.valid  = ROC.valid,
  ROC.test = ROC.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini: ",
                         "TRAIN = ", 
                         round(100*(2 * auc(ROC.train) - 1), 1), "%, ",
                         "VALID = ", 
                         round(100*(2 * auc(ROC.valid) - 1), 1), "%",
                         "TEST = ", 
                         round(100*(2 * auc(ROC.test) - 1), 1), "%"),
       caption = "source: own calculations") +
  theme_bw() + coord_fixed() 

Alternative Approach

Nevertheless, remembering that correlation does not imply causation, we are excluding variables that are not interpretable and may cause overfitting in the long run. Therefore, variables ‘feat02,’ ‘feat04,’ and ‘feat06’ have been dropped. The model function is as follows:

model2.formula <- salary ~ age + education_ord + capital + hours_per_week + marital_status + occupation + race + relationship + sex + workclass + region 
if(0){
ROC.valid.xgb = ROC.valid

valid_ROC_without_feat <- list(
  ROC.class.valid = ROC.class.valid,
  ROC.logit.valid = ROC.logit.valid,
  ROC.valid.rf_2 = ROC.valid.rf_2,
  ROC.valid.xgb = ROC.valid.xgb,
  ROC.ensemb.valid = ROC.ensemb.valid,
  ROC.valid.nn = ROC.valid.nn
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini VALID: ",
                         "class = ", 
                         round(100*(2 * auc(ROC.class.valid) - 1), 1), "%, ",
                         "logit = ", 
                         round(100*(2 * auc(ROC.logit.valid) - 1), 1), "%, ",
                         "rf = ", 
                         round(100*(2 * auc(ROC.valid.rf_2) - 1), 1), "%, ",
                         "xgb = ", 
                         round(100*(2 * auc(ROC.valid.xgb) - 1), 1), "%, ",
                         "ensemb = ", 
                         round(100*(2 * auc(ROC.ensemb.valid) - 1), 1), "%, ",
                         "nn = ", 
                         round(100*(2 * auc(ROC.valid.nn) - 1), 1), "%, "
                         )) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")


saveRDS(valid_ROC_without_feat, file = here("output", "valid_ROC_without_feat.rds"))
}

The results for all models are worse as it seen on below ROC curve plot.

valid_ROC_without_feat <- readRDS(here("output", "valid_ROC_without_feat.rds"))

valid_ROC_without_feat

if(0){

final_ROC_without_feat <- list(
  ROC.train = ROC.train,
  ROC.valid  = ROC.valid,
  ROC.test = ROC.test
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini: ",
                         "TRAIN = ", 
                         round(100*(2 * auc(ROC.train) - 1), 1), "%, ",
                         "VALID = ", 
                         round(100*(2 * auc(ROC.valid) - 1), 1), "%",
                         "TEST = ", 
                         round(100*(2 * auc(ROC.test) - 1), 1), "%"),
       caption = "source: own calculations") +
  theme_bw() + coord_fixed() 


saveRDS(final_ROC_without_feat, file = here("output", "final_ROC_without_feat.rds"))

}

Indeed, the results of the XGBoost model are slightly more consistent; however, the difference is not significant. Based solely on the ROC curve and Gini metric, it is not possible to discern the presence of artificial variables in the previous model which may cause overfitting noticed in random forests model.

final_ROC_without_feat <- readRDS(here("output", "final_ROC_without_feat.rds"))

final_ROC_without_feat