In this report, we are going to build a classification model that predicts whether an individualâs salary is âhigher than 50Kâ or âlower than or equal to 50K.â By analyzing key attributes such as education, occupation, and native country, our goal is to uncover patterns that influence income levels.
age: continuous.workclass: Private, Self-emp-not-inc, Self-emp-inc,
Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.fnlwgt: continuous; the number of people that Census
believes the entry represents.education: Bachelors, Some-college, 11th, HS-grad,
Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters,
1st-4th, 10th, Doctorate, 5th-6th, Preschool.education-num: continuous.marital-status: marital status of an individual:
Married-civ-spouse -> a civilian spouse, Divorced, Never-married,
Separated, Widowed, Married-spouse-absent, Married-AF-spouse - a spouse
in the Armed Forces.occupation: Tech-support, Craft-repair, Other-service,
Sales, Exec-managerial, Prof-specialty, Handlers-cleaners,
Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving,
Priv-house-serv, Protective-serv, Armed-Forces.relationship: represents what this individual is
relative to other: Wife, Own-child, Husband, Not-in-family,
Other-relative, Unmarried.race: White, Asian-Pac-Islander, Amer-Indian-Eskimo,
Other, Black.sex: Female, Male.capital-gain: continuous.capital-loss: continuous.hours-per-week: continuous.feat01-10: continuous.native-country: United-States, Cambodia, England,
Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan,
Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland,
Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic,
Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua,
Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru,
Hong, Holand-Netherlands.We will consider predicting probability of earning more than 50K
(salary - <=50K or >50K) given a set
of characteristics.
Let us examine the structure of the data:
## Rows: 42,561
## Columns: 26
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16âŠ
## $ age <dbl> 25, 41, 51, 57, 31, 74, 42, 35, 28, 46, 21, 32, 26, 4âŠ
## $ `capital-gain` <dbl> 5178, 3103, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `capital-loss` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ education <chr> "11th", "Bachelors", "Some-college", "HS-grad", "BachâŠ
## $ `education-num` <dbl> 7, 13, 10, 9, 13, 14, 9, 9, 9, 4, 10, 11, 7, 10, 13, âŠ
## $ feat01 <dbl> 0.8448420, 1.1844558, 1.3336356, 1.1780716, 1.3148465âŠ
## $ feat02 <dbl> 0.6298227, 0.4318522, 0.3418702, 0.6165489, 0.6925617âŠ
## $ feat03 <dbl> 1.1146365, 0.6941100, 0.7709931, 1.2863809, 1.2001315âŠ
## $ feat04 <dbl> 0.4889465, 0.2861112, 0.6193634, 0.6102719, 0.6396099âŠ
## $ feat05 <dbl> 0.9520031, 0.7136972, 1.4626074, 1.5411345, 1.4679602âŠ
## $ feat06 <dbl> 0.5787523, 0.6261846, 0.2777089, 0.3670095, 0.7277344âŠ
## $ feat07 <dbl> 1.0472219, 1.4231821, 1.0282187, 0.5600741, 0.9760788âŠ
## $ feat08 <dbl> 1.3348179, 1.4226539, 0.7770539, 0.9994005, 0.8707198âŠ
## $ feat09 <dbl> 0.8129876, 0.9177271, 0.9499549, 1.0217224, 0.5442584âŠ
## $ feat10 <dbl> 0.8436999, 0.8794644, 1.1056284, 1.3840671, 1.0400557âŠ
## $ fnlwgt <dbl> 120238, 167106, 205884, 184553, 137537, 84197, 76280,âŠ
## $ `hours-per-week` <dbl> 40, 35, 40, 50, 40, 10, 40, 40, 50, 75, 20, 40, 65, 4âŠ
## $ `marital-status` <chr> "Married-civ-spouse", "Married-civ-spouse", "Married-âŠ
## $ `native-country` <chr> "Poland", "Philippines", "United-States", "United-StaâŠ
## $ occupation <chr> "Transport-moving", "Adm-clerical", "Tech-support", "âŠ
## $ race <chr> "White", "Asian-Pac-Islander", "White", "White", "WhiâŠ
## $ relationship <chr> "Husband", "Husband", "Wife", "Husband", "Unmarried",âŠ
## $ salary <chr> ">50K", ">50K", ">50K", ">50K", "<=50K", "<=50K", "<=âŠ
## $ sex <chr> "Male", "Male", "Female", "Male", "Female", "Female",âŠ
## $ workclass <chr> "Private", "Private", "Private", "Self-emp-not-inc", âŠ
kable(table(salary_df$salary), col.names = c("Salary", "Frequency"), format = "html") %>%
kable_styling(full_width = FALSE)| Salary | Frequency |
|---|---|
| <=50K | 32300 |
| >50K | 10261 |
kable(table(salary_df$salary)/length(salary_df$salary), format = "html",col.names = c("Salary", "Frequency")) %>%
kable_styling(full_width = FALSE)| Salary | Frequency |
|---|---|
| <=50K | 0.7589107 |
| >50K | 0.2410893 |
We divide data set into the training and testing sample.
set.seed(1618)
training_obs <- createDataPartition(salary_df$salary,
p = 0.8,
list = FALSE)
salary.df.test <- salary_df[-training_obs,]
salary.df.train <- salary_df[training_obs,]There is around 7% of missing values, it is rather a significant number, therefore we will not drop whole rows.
## id age capital-gain capital-loss education
## 0 0 0 0 0
## education-num feat01 feat02 feat03 feat04
## 0 0 0 0 0
## feat05 feat06 feat07 feat08 feat09
## 0 0 0 0 0
## feat10 fnlwgt hours-per-week marital-status native-country
## 0 0 0 0 611
## occupation race relationship salary sex
## 1951 0 0 0 0
## workclass
## 1943
## [1] 2529
## [1] 0.07427531
For missing values at native_country we created
additional category Unknown.
salary.df.train <- salary.df.train %>%
mutate(`native-country` = ifelse(is.na(`native-country`), 'Unknown', `native-country`))For missing occupation with workclass
equals Never-worked we created category Unemployed.
The rest of missing values were categorized as Unknown.
salary.df.train %>%
filter((is.na(workclass) & !is.na(occupation)) | (!is.na(workclass) & is.na(occupation)))## # A tibble: 8 Ă 26
## id age `capital-gain` `capital-loss` education `education-num` feat01
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 15885 20 0 0 Some-college 10 0.531
## 2 25363 18 0 0 10th 6 1.78
## 3 32685 18 0 0 11th 7 1.23
## 4 34067 18 0 0 11th 7 0.787
## 5 35289 18 0 0 Some-college 10 0.992
## 6 38690 30 0 0 HS-grad 9 0.965
## 7 39350 23 0 0 7th-8th 4 1.03
## 8 39564 17 0 0 10th 6 0.849
## # âč 19 more variables: feat02 <dbl>, feat03 <dbl>, feat04 <dbl>, feat05 <dbl>,
## # feat06 <dbl>, feat07 <dbl>, feat08 <dbl>, feat09 <dbl>, feat10 <dbl>,
## # fnlwgt <dbl>, `hours-per-week` <dbl>, `marital-status` <chr>,
## # `native-country` <chr>, occupation <chr>, race <chr>, relationship <chr>,
## # salary <chr>, sex <chr>, workclass <chr>
salary.df.train <- salary.df.train %>%
mutate(occupation = ifelse(workclass=='Never-worked', 'Unemployed', occupation),
occupation = ifelse(is.na(occupation), 'Unknown', occupation),
workclass = ifelse(is.na(workclass), 'Unknown', workclass))There are 25 independent variables. However, after careful investigation, we decided to perform some cleaning:
Variable education is removed since it is a textual
version of variable education-num.
The variable fnlwgt represents the number of people
that Census believes an observation represents. While it can be treated
as weights, it is not used as a dependent variable.
capital-gain and capital-loss are
combined into one variable: capital.
Categories of marital-status were grouped in 3:
Married, Previously Married, Single to avoid
sparsity.
The variable native_country contains 41 unique
values, which may influence the modelâs performance due to overfitting,
complexity, and sparsity in some categories. Therefore, we grouped
countries into a few regions: United-States, Central America, South
America, North America, Asia, Europe, Caribbean and Other.
salary.df.train <- salary.df.train %>%
mutate(capital = `capital-gain`-`capital-loss`,
marital_status_grouped = case_when(
`marital-status` %in% c('Married-AF-spouse', 'Married-civ-spouse',
'Married-spouse-absent') ~ 'Married',
`marital-status` == 'Never-married' ~ 'Single',
`marital-status` %in% c('Divorced', 'Separated', 'Widowed') ~ 'Previously Married',
TRUE ~ as.character(`marital-status`)),
region = case_when(
`native-country` %in% c("United-States", "Outlying-US(Guam-USVI-etc)") ~ "United-States",
`native-country` %in% c("El-Salvador", "Guatemala", "Honduras",
"Nicaragua") ~ "Central America",
`native-country` %in% c("Ecuador", "Peru", "Columbia") ~ "South America",
`native-country` %in% c("Canada","Mexico") ~ "North America",
`native-country` %in% c("Philippines", "India", "China", "Vietnam", "Japan", "Taiwan",
"Hong", "Thailand", "Cambodia", "Laos", "Iran") ~ "Asia",
`native-country` %in% c("Germany", "Italy", "Portugal", "Greece", "Poland", "France",
"Ireland", "Hungary", "Scotland", "Yugoslavia", "England") ~ "Europe",
`native-country` %in% c("Puerto-Rico", "Dominican-Republic", "Jamaica", "Cuba",
"Haiti", "Trinadad&Tobago") ~ "Caribbean",
TRUE ~ "Other"
)
) %>%
rename(education_num = `education-num`,
hours_per_week = `hours-per-week`,
marital_status = marital_status_grouped,
native_country = `native-country`) %>%
mutate(salary = factor(salary),
salary = recode_factor(salary,
"<=50K" = "under_50K",
">50K" = "over_50K"),
sex = factor(sex),
occupation = factor(occupation),
marital_status = factor(marital_status),
native_country = factor(native_country),
race = factor(race),
relationship = factor(relationship),
workclass = factor(workclass),
region = factor(region),
education_num = as.integer(education_num),
education_ord = factor(education_num, levels=c(1:16), ordered=TRUE),
age = as.integer(age),
hours_per_week = as.integer(hours_per_week)) %>%
dplyr::select(age, capital, education_num, hours_per_week, marital_status, native_country,region,education_ord,
salary, sex, fnlwgt, occupation, race, relationship, workclass,
feat01,feat02,feat03,feat04,feat05,feat06,feat07,feat08,feat09,feat10)Below we presented distribution of variables. There are 4 numeric
variables and 8 categorical, including salary which we will
predict.
plot1 <- ggplot(salary.df.train, aes(x = age)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot2 <- ggplot(salary.df.train, aes(x = capital)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot3 <- ggplot(salary.df.train, aes(x = hours_per_week)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot14 <- ggplot(salary.df.train, aes(x = education_num)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot4 <- ggplot(salary.df.train, aes(x = feat01)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot5 <- ggplot(salary.df.train, aes(x = feat02)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot6 <- ggplot(salary.df.train, aes(x = feat03)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot7 <- ggplot(salary.df.train, aes(x = feat04)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot8 <- ggplot(salary.df.train, aes(x = feat05)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot9 <- ggplot(salary.df.train, aes(x = feat06)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot10 <- ggplot(salary.df.train, aes(x = feat07)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot11 <- ggplot(salary.df.train, aes(x = feat08)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot12 <- ggplot(salary.df.train, aes(x = feat09)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
plot13 <- ggplot(salary.df.train, aes(x = feat10)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.7) +
labs(y = "Frequency") +
theme_minimal() +
theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12))
grid.arrange(plot1, plot2, plot3, plot14, plot4, plot5, plot6, plot7, plot8,
plot9, plot10, plot11, plot12, plot13,
ncol = 3,
top = 'Distribution of numeric variables')plot1 <- ggplot(salary.df.train, aes(x = workclass)) +
geom_bar(fill = "pink", color = "black", alpha = 0.7) +
geom_text(
stat = "count",
aes(label = after_stat(count), vjust = -0.15),
size = 2,
color = "black"
) +
labs(y = "Frequency") +
theme_minimal() +
theme(
axis.text = element_text(size = 7, angle = 45, hjust = 1)
)
plot2 <- ggplot(salary.df.train, aes(x = marital_status)) +
geom_bar(fill = "pink", color = "black", alpha = 0.7) +
geom_text(
stat = "count",
aes(label = after_stat(count), vjust = 1.2),
size = 3,
color = "black"
) +
labs(y = "Frequency") +
theme_minimal() +
theme(
axis.text = element_text(size = 7, angle = 45, hjust = 1)
)
plot3 <- ggplot(salary.df.train, aes(x = region)) +
geom_bar(fill = "pink", color = "black", alpha = 0.7) +
geom_text(
stat = "count",
aes(label = after_stat(count), vjust = -0.15),
size = 2,
color = "black"
) +
labs(y = "Frequency") +
theme_minimal() +
theme(
axis.text = element_text(size = 7, angle = 45, hjust = 1)
)
plot4 <- ggplot(salary.df.train, aes(x = salary)) +
geom_bar(fill = "pink", color = "black", alpha = 0.7) +
geom_text(
stat = "count",
aes(label = after_stat(count), vjust = 1.2),
size = 3,
color = "black"
) +
labs(y = "Frequency") +
theme_minimal() +
theme(
axis.text = element_text(size = 7, angle = 45, hjust = 1)
)
plot5 <- ggplot(salary.df.train, aes(x = sex)) +
geom_bar(fill = "pink", color = "black", alpha = 0.7) +
geom_text(
stat = "count",
aes(label = after_stat(count), vjust = 1.2),
size = 3,
color = "black"
) +
labs(y = "Frequency") +
theme_minimal() +
theme(
axis.text = element_text(size = 7, angle = 45, hjust = 1)
)
plot6 <- ggplot(salary.df.train, aes(x = occupation)) +
geom_bar(fill = "pink", color = "black", alpha = 0.7) +
geom_text(
stat = "count",
aes(label = after_stat(count), vjust = -0.15),
size = 2,
color = "black"
) +
labs(y = "Frequency") +
theme_minimal() +
theme(
axis.text = element_text(size = 7, angle = 45, hjust = 1)
)
plot7 <- ggplot(salary.df.train, aes(x = race)) +
geom_bar(fill = "pink", color = "black", alpha = 0.7) +
geom_text(
stat = "count",
aes(label = after_stat(count), vjust = -0.2),
size = 2.5,
color = "black"
) +
labs(y = "Frequency") +
theme_minimal() +
theme(
axis.text = element_text(size = 7, angle = 45, hjust = 1)
)
plot8 <- ggplot(salary.df.train, aes(x = relationship)) +
geom_bar(fill = "pink", color = "black", alpha = 0.7) +
geom_text(
stat = "count",
aes(label = after_stat(count), vjust = 1),
size = 3,
color = "black"
) +
labs(y = "Frequency") +
theme_minimal() +
theme(
axis.text = element_text(size = 7, angle = 45, hjust = 1)
)
grid.arrange(plot1, plot2, plot3, plot4,
plot5, plot6, plot7, plot8,
ncol = 2,
top = 'Distribution of categorical variables')There is no strong correlation between numeric variables. The
strongest positive correlation is observed between salary
and education which suggests than the higher the education
the higher probability than salary is greater than 50K. Variables such
as capital, age, hours_per_week,
sex, feat02 are also positively correlated
which is correct with the intuition. Variables feat04 and
feat06 are only variables which are positively correlated
with salary.
modified_data <- salary.df.train %>%
mutate(
sex_male = as.numeric(sex == "Male"),
salary_over_50 = as.numeric(salary == "over_50K")
)
testRes = cor.mtest(modified_data[, sapply(modified_data, is.numeric)], conf.level = 0.95, method="pearson")
corrplot(cor(modified_data[, sapply(modified_data, is.numeric)]),
p.mat = testRes$p,
method = "color",
type = "upper",
order = "hclust",
tl.col = "black",
tl.srt = 45,
sig.level = c(0.001, 0.01, 0.05), pch.cex = 0.9,
insig = 'label_sig', pch.col = 'grey20')We are splitting the test portion of the data into validation and test sets. Before proceeding with this split, itâs important to preprocess the data in the same way as the training data to ensure consistency in feature engineering.
salary.df.test <- salary.df.test %>%
mutate(`native-country` = ifelse(is.na(`native-country`), 'Unknown', `native-country`))
salary.df.test <- salary.df.test %>%
mutate(occupation = ifelse(workclass=='Never-worked', 'Unemployed', occupation),
occupation = ifelse(is.na(occupation), 'Unknown', occupation),
workclass = ifelse(is.na(workclass), 'Unknown', workclass))
salary.df.test <- salary.df.test %>%
mutate(capital = `capital-gain`-`capital-loss`,
marital_status_grouped = case_when(
`marital-status` %in% c('Married-AF-spouse', 'Married-civ-spouse',
'Married-spouse-absent') ~ 'Married',
`marital-status` == 'Never-married' ~ 'Single',
`marital-status` %in% c('Divorced', 'Separated', 'Widowed') ~ 'Previously Married',
TRUE ~ as.character(`marital-status`)),
region = case_when(
`native-country` %in% c("United-States", "Outlying-US(Guam-USVI-etc)") ~ "United-States",
`native-country` %in% c("El-Salvador", "Guatemala", "Honduras",
"Nicaragua") ~ "Central America",
`native-country` %in% c("Ecuador", "Peru", "Columbia") ~ "South America",
`native-country` %in% c("Canada","Mexico") ~ "North America",
`native-country` %in% c("Philippines", "India", "China", "Vietnam", "Japan", "Taiwan",
"Hong", "Thailand", "Cambodia", "Laos", "Iran") ~ "Asia",
`native-country` %in% c("Germany", "Italy", "Portugal", "Greece", "Poland", "France",
"Ireland", "Hungary", "Scotland", "Yugoslavia", "England") ~ "Europe",
`native-country` %in% c("Puerto-Rico", "Dominican-Republic", "Jamaica", "Cuba",
"Haiti", "Trinadad&Tobago") ~ "Caribbean",
TRUE ~ "Other"
)
) %>%
rename(education_num = `education-num`,
hours_per_week = `hours-per-week`,
marital_status = marital_status_grouped,
native_country = `native-country`) %>%
mutate(salary = factor(salary),
salary = recode_factor(salary,
"<=50K" = "under_50K",
">50K" = "over_50K"),
sex = factor(sex),
occupation = factor(occupation),
marital_status = factor(marital_status),
native_country = factor(native_country),
race = factor(race),
relationship = factor(relationship),
workclass = factor(workclass),
region = factor(region),
education_num = as.integer(education_num),
education_ord = factor(education_num, levels=c(1:16), ordered=TRUE),
age = as.integer(age),
hours_per_week = as.integer(hours_per_week)) %>%
dplyr::select(age, capital, education_num, hours_per_week, marital_status, native_country,region,education_ord,
salary, sex, fnlwgt, occupation, race, relationship, workclass,
feat01,feat02,feat03,feat04,feat05,feat06,feat07,feat08,feat09,feat10)A classification tree model has been employed to better understand the dataset and identify significant variables that contribute to the prediction of salary levels. The primary goal of a classification tree is to recursively partition the dataset into subsets based on predictor variables, creating a tree structure that can be interpreted to make predictions.
salary.class.tree <-
rpart(model1.formula,
data = salary.df.train,
method = "class",
minbucket = 1500,
cp = -1)
opt <- which.min(salary.class.tree$cptable[, "xerror"])
cp <- salary.class.tree$cptable[opt, "CP"]
salary.class.tree <-
prune(salary.class.tree, cp = cp)
fancyRpartPlot(salary.class.tree)Key findings:
par(mar = c(5.1, 6.1, 4.1, 2.1))
barplot(rev(salary.class.tree$variable.importance), # vactor
col = "blue", # colors
main = "imporatnce of variables in model tree4",
horiz = T, # horizontal type of plot
las = 1, # labels always horizontally
cex.names = 0.6)The classification tree provides insights into the significant
predictors of salary levels. In this analysis, variables such as
relationship, marital_status and
education_ord were identified as key factors influencing
salary predictions.
pred.class.train <- as.data.frame(predict(salary.class.tree, salary.df.train))
ROC.class.train <- roc(
as.numeric(salary.df.train$salary == "under_50K"),
pred.class.train[, 1])
pred.class.valid <- as.data.frame(predict(salary.class.tree,
salary.df.valid))
ROC.class.valid <- roc(as.numeric(salary.df.valid$salary == "under_50K"),
pred.class.valid[, 1])
list(
ROC.class.train = ROC.class.train,
ROC.class.valid = ROC.class.valid
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
round(100*(2 * auc(ROC.class.train) - 1), 1), "%, ",
"GINI VALID: ",
round(100*(2 * auc(ROC.class.valid) - 1), 1), "% ")) +
theme_bw() + coord_fixed() +
# scale_color_brewer(palette = "Paired") +
scale_color_manual(values = RColorBrewer::brewer.pal(n = 4,
name = "Paired")[c(1, 3)])Random Forests is a powerful learning technique employed for both
classification and regression tasks. Our objective is to predict binary
variable salary and to achieve this, we are generating
three distinct models. To optimize performance, we utilize grid search
with 5 resampling iterations for parameters such as mtry and
min.node.size.
For method ârangerâ we decided to verify two different importance metrics: permutation and impurity and two different number of trees: 100 or 250. Additionally there are parameters in grid search: mtry from 5 to 46 and minimal node size: 100,250 or 500.
if(0){
ctrl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE)
parameters_ranger <- expand.grid(
mtry = c(5:46), # 46 is number of all independent variables once encoded
splitrule = "gini",
min.node.size = c(100, 250, 500)
)
set.seed(1618)
rf_model_1 <- train(
model1.formula,
data = salary.df.train,
method = "ranger",
num.trees = 100,
importance = "permutation",
tuneGrid = parameters_ranger,
trControl = ctrl,
weights = salary.df.train$fnlwgt
)
saveRDS(rf_model_1, file = here("output", "rf_model_1.rds"))
set.seed(1618)
rf_model_2 <- train(
model1.formula,
data = salary.df.train,
method = "ranger",
num.trees = 250,
importance = "impurity",
tuneGrid = parameters_ranger,
trControl = ctrl,
weights = salary.df.train$fnlwgt
)
saveRDS(rf_model_2, file = here("output", "rf_model_2.rds"))
}
rf_model_1 <- readRDS(here("output", "rf_model_1.rds"))
rf_model_2 <- readRDS(here("output", "rf_model_2.rds"))if(0){
ctrl <- trainControl(method = "oob",
classProbs = TRUE)
parameters_ranger <- expand.grid(mtry = 5:46)
set.seed(1618)
rf_model_3 <- train(
model1.formula,
data = salary.df.train,
method = "rf",
ntree = 100,
nodesize = 100,
tuneGrid = parameters_ranger,
trControl = ctrl,
weights = salary.df.train$fnlwgt
)
saveRDS(rf_model_3, file = here("output", "rf_model_3.rds"))
}
rf_model_3 <- readRDS(here("output", "rf_model_3.rds"))In order to verify which models perform best results we are generating ROC curves.
##############################################
pred.train.rf_1 <- predict(rf_model_1,
salary.df.train,
type = "prob")[, "over_50K"]
ROC.train.rf_1 <- roc(as.numeric(
salary.df.train$salary == "under_50K"),
pred.train.rf_1)
pred.valid.rf_1 <- predict(rf_model_1,
salary.df.valid,
type = "prob")[, "over_50K"]
ROC.valid.rf_1 <- roc(as.numeric(
salary.df.valid$salary == "under_50K"), pred.valid.rf_1)
##############################################
pred.train.rf_2 <- predict(rf_model_2,
salary.df.train,
type = "prob")[, "under_50K"]
ROC.train.rf_2 <- roc(as.numeric(
salary.df.train$salary == "under_50K"),
pred.train.rf_2)
pred.valid.rf_2 <- predict(rf_model_2,
salary.df.valid,
type = "prob")[, "under_50K"]
ROC.valid.rf_2 <- roc(as.numeric(
salary.df.valid$salary == "under_50K"), pred.valid.rf_2)
##############################################
pred.train.rf_3 <- predict(rf_model_3,
salary.df.train,
type = "prob")[, "under_50K"]
ROC.train.rf_3 <- roc(as.numeric(
salary.df.train$salary == "under_50K"),
pred.train.rf_3)
pred.valid.rf_3 <- predict(rf_model_3,
salary.df.valid,
type = "prob")[, "under_50K"]
ROC.valid.rf_3 <- roc(as.numeric(
salary.df.valid$salary == "under_50K"), pred.valid.rf_3)list(
ROC.train.rf_1 = ROC.train.rf_1,
ROC.valid.rf_1 = ROC.valid.rf_1,
ROC.train.rf_2 = ROC.train.rf_2,
ROC.valid.rf_2 = ROC.valid.rf_2,
ROC.train.rf_3 = ROC.train.rf_3,
ROC.valid.rf_3 = ROC.valid.rf_3
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(title = paste0("Gini TEST: ",
"rf1 = ",
round(100 * (2 * auc(ROC.valid.rf_1) - 1), 1), "%, ",
"rf2 = ",
round(100 * (2 * auc(ROC.valid.rf_2) - 1), 1), "%, ",
"rf3 = ",
round(100 * (2 * auc(ROC.valid.rf_3) - 1), 1), "%, "),
subtitle = paste0("Gini TRAIN: ",
"rf1 = ",
round(100 * (2 * auc(ROC.train.rf_1) - 1), 1), "%, ",
"rf2 = ",
round(100 * (2 * auc(ROC.train.rf_2) - 1), 1), "%, ",
"rf3 = ",
round(100 * (2 * auc(ROC.train.rf_3) - 1), 1), "%, ")) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")All models have performed well, with accuracies above 90% for train data set and above 80% for test set. This discrepancy could be an indication of overfitting. Model 1 and Model 2, with additional tuning of min.node.size, show slightly higher accuracy and kappa values. The accuracy improvement between models is relatively small.
The random forest models seem effective in predicting the salary class, with Model 2 having a slight edge in terms of AUC.
Below is a plot presenting the 15 most important predictors. In comparison to the previous Classification Trees Model, the selection is entirely different. The key predictors are found to be capital, education, marital status, age, feat02, feat04, feat06 and hours_per_week.
source(here("funs", "getVarImpPlot4Ranger.R"))
getVarImpPlot4Ranger(salary.rf$finalModel,
sort = TRUE,
main = "Importance of predictors",
n.var = 15)Now we apply the logistic regression model.
fiveStats <- function(...) c(twoClassSummary(...),
defaultSummary(...))
ctrl <- trainControl(method = "cv",
number = 5,
savePredictions = "final",
summaryFunction = fiveStats,
classProbs = TRUE)
set.seed(1618)
salary.logit <-
train(model1.formula,
data = salary.df.train,
method = "glm",
family = "binomial",
preProcess = c("center", "scale"),
trControl = ctrl)
salary.logit## Generalized Linear Model
##
## 34049 samples
## 21 predictor
## 2 classes: 'under_50K', 'over_50K'
##
## Pre-processing: centered (70), scaled (70)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 27240, 27239, 27239, 27239, 27239
## Resampling results:
##
## ROC Sens Spec Accuracy Kappa
## 0.9230546 0.9352554 0.6634205 0.8697173 0.6270098
The model achieved consistent Gini coefficients of 85% and 84% on both the training and validation sets. In contrast to the random forests model, where a discrepancy of around 10 percentage points was observed between the training and validation Gini coefficients, the logistic regression model demonstrates better consistency.
pred.logit.train <- predict(salary.logit,
salary.df.train,
type = "prob")
ROC.logit.train <- roc(as.numeric(salary.df.train$salary == "over_50K"),
pred.logit.train[, 1])
pred.logit.valid <- predict(salary.logit,
salary.df.valid,
type = "prob")
ROC.logit.valid <- roc(as.numeric(salary.df.valid$salary == "over_50K"),
pred.logit.valid[, 1])
list(
ROC.logit.train = ROC.logit.train,
ROC.logit.valid = ROC.logit.valid
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
round(100*(2 * auc(ROC.logit.train) - 1), 1), "%, ",
"GINI VALID: ",
round(100*(2 * auc(ROC.logit.valid) - 1), 1), "% ")) +
theme_bw() + coord_fixed() +
# scale_color_brewer(palette = "Paired") +
scale_color_manual(values = RColorBrewer::brewer.pal(n = 4,
name = "Paired")[c(1, 3)])Our next model is XGBoost which needs data preparation, we also implemented grid searcg procedure.
xgb.train.label <- salary.df.train %>%
select(salary) %>%
mutate(salary == "under_50K") %>%
pull() %>%
as.integer()tmp.train.data <-
dummy.data.frame(salary.df.train %>%
select(
age,capital,education_num,hours_per_week,
marital_status,region,sex,occupation,
race,relationship,workclass,feat01,feat02,feat03,
feat04,feat05,feat06,feat07,feat08,feat09,feat10
) %>%
mutate(education_num = as.integer(education_num)) %>%
as.data.frame(),
names = NULL, omit.constants=TRUE,
dummy.classes = getOption("dummy.classes"),
all = TRUE)
tmp.train.data %>% glimpse()## Rows: 34,049
## Columns: 63
## $ age <int> 25, 51, 57, 31, 74, 42, 35, 28, 46,âŠ
## $ capital <dbl> 5178, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ education_num <int> 7, 10, 9, 13, 14, 9, 9, 9, 4, 10, 7âŠ
## $ hours_per_week <int> 40, 40, 50, 40, 10, 40, 40, 50, 75,âŠ
## $ marital_statusMarried <int> 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,âŠ
## $ `marital_statusPreviously Married` <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,âŠ
## $ marital_statusSingle <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,âŠ
## $ regionAsia <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ regionCaribbean <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `regionCentral America` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ regionEurope <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,âŠ
## $ `regionNorth America` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ regionOther <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,âŠ
## $ `regionSouth America` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `regionUnited-States` <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,âŠ
## $ sexFemale <int> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,âŠ
## $ sexMale <int> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,âŠ
## $ `occupationAdm-clerical` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,âŠ
## $ `occupationArmed-Forces` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `occupationCraft-repair` <int> 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,âŠ
## $ `occupationExec-managerial` <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,âŠ
## $ `occupationFarming-fishing` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `occupationHandlers-cleaners` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `occupationMachine-op-inspct` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `occupationOther-service` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `occupationPriv-house-serv` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `occupationProf-specialty` <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,âŠ
## $ `occupationProtective-serv` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ occupationSales <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `occupationTech-support` <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `occupationTransport-moving` <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ occupationUnemployed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ occupationUnknown <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,âŠ
## $ `raceAmer-Indian-Eskimo` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `raceAsian-Pac-Islander` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,âŠ
## $ raceBlack <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,âŠ
## $ raceOther <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ raceWhite <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,âŠ
## $ relationshipHusband <int> 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,âŠ
## $ `relationshipNot-in-family` <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,âŠ
## $ `relationshipOther-relative` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,âŠ
## $ `relationshipOwn-child` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ relationshipUnmarried <int> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,âŠ
## $ relationshipWife <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,âŠ
## $ `workclassFederal-gov` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,âŠ
## $ `workclassLocal-gov` <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `workclassNever-worked` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ workclassPrivate <int> 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,âŠ
## $ `workclassSelf-emp-inc` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ `workclassSelf-emp-not-inc` <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,âŠ
## $ `workclassState-gov` <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,âŠ
## $ workclassUnknown <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,âŠ
## $ `workclassWithout-pay` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,âŠ
## $ feat01 <dbl> 0.8448420, 1.3336356, 1.1780716, 1.âŠ
## $ feat02 <dbl> 0.6298227, 0.3418702, 0.6165489, 0.âŠ
## $ feat03 <dbl> 1.1146365, 0.7709931, 1.2863809, 1.âŠ
## $ feat04 <dbl> 0.4889465, 0.6193634, 0.6102719, 0.âŠ
## $ feat05 <dbl> 0.9520031, 1.4626074, 1.5411345, 1.âŠ
## $ feat06 <dbl> 0.5787523, 0.2777089, 0.3670095, 0.âŠ
## $ feat07 <dbl> 1.0472219, 1.0282187, 0.5600741, 0.âŠ
## $ feat08 <dbl> 1.3348179, 0.7770539, 0.9994005, 0.âŠ
## $ feat09 <dbl> 0.8129876, 0.9499549, 1.0217224, 0.âŠ
## $ feat10 <dbl> 0.8436999, 1.1056284, 1.3840671, 1.âŠ
if (0) {
paramsGrid <-
expand.grid(subsample = c(0.8, 1),
colsample_bynode = c(0.7, 1),
max_depth = c(1, 2, 3),
eta = c(0.3, 0.1, 0.05, 0.01),
gamma = c(0, 1, 3, 10),
early_stopping_rounds = c(5, 10, 30, 50),
min_child_weight = c(10, 20, 30)) %>%
mutate(ID = row_number())
set.seed(1618)
paramsGrid <-
paramsGrid[sample(1:nrow(paramsGrid), 100), ]
gridSearchResults <- list()
for (i in 1:nrow(paramsGrid)) {
# Extract Parameters to test
thisID <- paramsGrid[i, "ID"]
thisSubsample <- paramsGrid[i, "subsample"]
thisColsample_bynode <- paramsGrid[i, "colsample_bynode"]
thisMax_depth <- paramsGrid[i, "max_depth"]
thisEta <- paramsGrid[i, "eta"]
thisGamma <- paramsGrid[i, "gamma"]
thisEarly_stopping_rounds <- paramsGrid[i, "early_stopping_rounds"]
thisMin_child_weight <- paramsGrid[i, "min_child_weight"]
cat("processing combination", thisID, "...\n")
seed_split <- 1618
set.seed(seed_split)
xgboostModelCV <- xgb.cv(
data = xgb.train,
nfold = 10,
booster = "gbtree",
objective = "binary:logistic",
eta = thisEta,
max_depth = thisMax_depth,
min_child_weight = thisMin_child_weight,
gamma = thisGamma,
lambda = 0.5,
subsample = thisSubsample,
colsample = 1,
colsample_bytree = 1,
colsample_bylevel = 0.5,
colsample_bynode = thisColsample_bynode,
metrics = "auc",
eval.metric = "auc",
nrounds = 1000,
nthreads = 2,
early_stopping_rounds = thisEarly_stopping_rounds,
watchlist = list(val1 = xgb.train),
verbose = 1
)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
train_auc_mean <- tail(xvalidationScores$train_auc_mean, 1)
train_auc_std <- tail(xvalidationScores$train_auc_std, 1)
test_auc_mean <- tail(xvalidationScores$test_auc_mean, 1)
test_auc_std <- tail(xvalidationScores$test_auc_std, 1)
train.gini <- 2 * tail(xvalidationScores$train_auc_mean, 1) - 1
test.gini <- 2 * tail(xvalidationScores$test_auc_mean,1) - 1
gridSearchResults[[i]] <- c(train.gini,
test.gini,
train_auc_mean,
train_auc_std,
test_auc_mean,
test_auc_std,
thisSubsample,
thisColsample_bynode,
thisMax_depth,
thisEta,
thisGamma,
thisEarly_stopping_rounds,
thisMin_child_weight)
}
if (0) gridSearchResults %>% saveRDS(here("output", "gridSearchResults.rds"))
}gridSearchResults <- readRDS(here("output", "gridSearchResults.rds"))
output <-
as_tibble(do.call(rbind, gridSearchResults)) %>%
mutate(id = row_number()) %>%
select(id, everything())
names(output) <-
varnames <- c("id",
"train_Gini", "test_Gini",
"train_AUC_mean", "train_AUC_std",
"test_AUC_mean", "test_AUC_std",
"subsample", "colsample_bynode", "max_depth", "eta",
"gamma", "early_stopping_rounds", "min_child_weight"
)
output <- output %>% arrange(desc(test_Gini))
output## # A tibble: 100 Ă 14
## id train_Gini test_Gini train_AUC_mean train_AUC_std test_AUC_mean
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9 0.919 0.894 0.959 0.000268 0.947
## 2 75 0.916 0.894 0.958 0.000381 0.947
## 3 69 0.914 0.893 0.957 0.000301 0.946
## 4 51 0.927 0.892 0.963 0.000406 0.946
## 5 46 0.909 0.892 0.955 0.000538 0.946
## 6 61 0.923 0.892 0.961 0.000483 0.946
## 7 81 0.914 0.892 0.957 0.000443 0.946
## 8 100 0.926 0.892 0.963 0.000463 0.946
## 9 68 0.918 0.891 0.959 0.000389 0.946
## 10 74 0.926 0.891 0.963 0.000443 0.946
## # âč 90 more rows
## # âč 8 more variables: test_AUC_std <dbl>, subsample <dbl>,
## # colsample_bynode <dbl>, max_depth <dbl>, eta <dbl>, gamma <dbl>,
## # early_stopping_rounds <dbl>, min_child_weight <dbl>
bestParams <-
output %>%
arrange(desc(test_Gini)) %>%
mutate(diff = train_Gini - test_Gini) %>%
filter(test_Gini > 0.891 & diff < 0.02) %>%
select(-starts_with("train"), -starts_with("test")) %>%
unlist()
bestParams## id subsample colsample_bynode
## 46.0000000 1.0000000 0.7000000
## max_depth eta gamma
## 2.0000000 0.3000000 3.0000000
## early_stopping_rounds min_child_weight diff
## 50.0000000 10.0000000 0.0169308
if(0){
seed_split <- 1618
set.seed(seed_split)
xgb.fit.cv <- xgb.cv(
nfold = 10,
data = xgb.train,
booster = "gbtree",
objective = "binary:logistic",
eta = bestParams["eta"],
max_depth = bestParams["max_depth"],
min_child_weight = bestParams["min_child_weight"],
gamma = bestParams["gamma"],
lambda = 0.5,
subsample = bestParams["subsample"],
colsample = 1,
colsample_bytree = 1,
colsample_bylevel = 0.5,
colsample_bynode = bestParams["colsample_bynode"],
metrics = "auc",
eval.metric = "auc",
nrounds = 1000,
nthreads = 2,
early_stopping_rounds = bestParams["early_stopping_rounds"],
watchlist=list(val1 = xgb.train),
verbose = 1
)
xgb.fit.cv %>% saveRDS(here("output", "xgb_fit_cv2.rds"))
}if(0){
xgb.fit.cv <- readRDS(here("output", "xgb_fit_cv2.rds"))
seed_split <- 1618
set.seed(seed_split)
salary.xgb <- xgb.train(
data = xgb.train,
booster = "gbtree",
objective = "binary:logistic",
eta = bestParams["eta"],
max_depth = bestParams["max_depth"],
min_child_weight = bestParams["min_child_weight"],
gamma = bestParams["gamma"],
lambda = 0.5,
subsample = bestParams["subsample"],
colsample = 1,
colsample_bytree = 1,
colsample_bylevel = 0.5,
colsample_bynode = bestParams["colsample_bynode"],
metrics = "auc",
eval.metric = "auc",
nrounds = xgb.fit.cv$best_iteration,
nthreads = 2,
early_stopping_rounds = bestParams["early_stopping_rounds"],
watchlist=list(val1 = xgb.train),
verbose = 0
)
salary.xgb %>% saveRDS(here("output", "salary.xgb.rds"))
}The plot displays the most important features in the model, ranked in
descending order based on their significance. The top features
contributing to the model include
capital,marital_statusMarried,
education_num, and relationshipHusband among
others. Itâs worth mentioning that XGBoost included feat04,
feat02 and feat06 which showed significant
correlation with salary. However, it did not include the
rest of the artificial features.
importance_matrix <- xgb.importance(model = salary.xgb)
top_variables <- head(rownames(importance_matrix), 15)
# Subset the importance matrix to include only the top variables
top_importance_matrix <- subset(importance_matrix, rownames(importance_matrix) %in% top_variables)
# Print the subset of the importance matrix
print(top_importance_matrix)## Feature Gain Cover Frequency
## 1: capital 0.213636201 0.223865235 0.185714286
## 2: marital_statusMarried 0.152995029 0.017085835 0.024285714
## 3: education_num 0.144249906 0.066383870 0.058571429
## 4: relationshipHusband 0.092414305 0.011988699 0.008571429
## 5: feat04 0.086323277 0.098015246 0.088571429
## 6: age 0.073061914 0.068730727 0.065714286
## 7: feat06 0.050447772 0.060225356 0.060000000
## 8: feat02 0.042385741 0.048788107 0.057142857
## 9: hours_per_week 0.035510798 0.052644621 0.055714286
## 10: marital_statusSingle 0.019346410 0.004486087 0.007142857
## 11: relationshipWife 0.016887695 0.027115997 0.024285714
## 12: occupationExec-managerial 0.013336668 0.015723442 0.014285714
## 13: occupationOther-service 0.007447751 0.011096873 0.007142857
## 14: sexFemale 0.006599651 0.008816839 0.011428571
## 15: occupationFarming-fishing 0.004264967 0.011678483 0.008571429
# Plot the importance of the top variables
xgb.plot.importance(importance_matrix = top_importance_matrix)xgb.valid.label <- salary.df.valid %>%
select(salary) %>%
mutate(salary == "under_50K") %>%
pull() %>%
as.integer()tmp.valid.data <-
dummy.data.frame(salary.df.valid %>%
select(
age,capital,education_num,hours_per_week,
marital_status,region,sex,occupation,
race,relationship,workclass, feat01,feat02,feat03,
feat04,feat05,feat06,feat07,feat08,feat09,feat10
) %>%
mutate(education_num = as.integer(education_num)) %>%
as.data.frame(),
names = NULL, omit.constants=TRUE,
dummy.classes = getOption("dummy.classes"),
all = TRUE)tmp.valid.data <- tmp.valid.data[, salary.xgb$feature_names]
xgb.valid.data <- as.matrix(tmp.valid.data)
rm(tmp.valid.data)
xgb.valid.data %>% glimpse()## num [1:4256, 1:63] 28 34 28 39 57 43 40 50 62 48 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:4256] "1" "2" "3" "4" ...
## ..$ : chr [1:63] "age" "capital" "education_num" "hours_per_week" ...
Now we are processing ROC curve. The results for train dataset are slightly worse than in case of random forests,94% to 91%. However the discrepancy is lower for XGBoost, results for test data set is 88,6%. XGBoost model seems to be more consistent.
list(
ROC.train = ROC.train,
ROC.valid = ROC.valid
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini: ",
"TRAIN = ",
round(100*(2 * auc(ROC.train) - 1), 1), "%, ",
"TEST = ",
round(100*(2 * auc(ROC.valid) - 1), 1), "%"),
caption = "source: own calculations") +
theme_bw() + coord_fixed() We are going to implement simple ensembling based on weighted voting. First step is combining results from all models into one data frame.
preds.train <- cbind(logit = pred.logit.train$under_50K,
class_tree = pred.class.train$under_50K,
xgb = pred.xgb.train,
rf = pred.train.rf_2)
preds.train %>% head()## logit class_tree xgb rf
## [1,] 0.6216153 0.7140378 0.2473250 0.1651958
## [2,] 0.3456506 0.7140378 0.3435149 0.4644596
## [3,] 0.4908892 0.7140378 0.3902369 0.4114981
## [4,] 0.9667551 0.9362605 0.9802357 0.8946082
## [5,] 0.7476965 0.9362605 0.9264451 0.8467233
## [6,] 0.6316543 0.7140378 0.6811452 0.6862005
Below there is presented correlation plot for all models results. As we know XGBoost and Random Forests gave the highest results and they are also very strong correlated. Logistic Regression results were consistent and gave around 86% however it is also highly correlated with XGBoost and Random Forests. Ensembling might not be the best solution here, nevertheless let us verify it.
## logit class_tree xgb rf
## logit 1.0000000 0.7695343 0.9184463 0.9236927
## class_tree 0.7695343 1.0000000 0.7383541 0.7690964
## xgb 0.9184463 0.7383541 1.0000000 0.9622331
## rf 0.9236927 0.7690964 0.9622331 1.0000000
preds.valid <- cbind(logit = pred.logit.valid$under_50K,
class_tree = pred.class.valid$under_50K,
xgb = pred.xgb.valid,
rf = pred.valid.rf_2)
preds.valid %>% head()## logit class_tree xgb rf
## [1,] 0.9831757 0.9362605 0.9848252 0.9823226
## [2,] 0.5214763 0.7140378 0.5196844 0.6072156
## [3,] 0.9931267 0.9362605 0.9892424 0.9972532
## [4,] 0.6049790 0.7140378 0.6916719 0.5637618
## [5,] 0.8554764 0.9362605 0.8422664 0.7331504
## [6,] 0.9810695 0.9362605 0.9819695 0.9707648
## logit class_tree xgb rf
## logit 1.0000000 0.7483839 0.9213339 0.9282638
## class_tree 0.7483839 1.0000000 0.7342402 0.7724098
## xgb 0.9213339 0.7342402 1.0000000 0.9574183
## rf 0.9282638 0.7724098 0.9574183 1.0000000
models.weights <- c(
logit = auc(ROC.logit.train),
class_tree = auc(ROC.class.train),
rf = auc(ROC.train.rf_2),
xgb = auc(ROC.train))
models.weights <- models.weights/sum(models.weights)
models.weights## logit class_tree rf xgb
## 0.2517718 0.2241443 0.2639837 0.2601002
## logit class_tree xgb rf
## [1,] 0.6260208 0.7539773 0.2490778 0.1744360
## [2,] 0.3099025 0.7428855 0.3079876 0.4832242
## [3,] 0.5183470 0.7190982 0.4120647 0.4144145
## [4,] 1.0058128 0.8394300 1.0198380 0.8020855
## [5,] 0.7529954 0.9886301 0.9330108 0.8940846
## [6,] 0.5663269 0.7428855 0.6106993 0.7139235
preds.train$weighted.voting <- ifelse(rowSums((preds.train * models.weights*4) > 0.5) > 2,
"under_50K", "over_50K")
preds.valid$weighted.voting <-
ifelse(rowSums((preds.valid * models.weights*4) > 0.5) > 2,
"under_50K", "over_50K")As expected the results are not satisfying. The ensembling method decrease results and in the same time does not decrease the discrepancy between train and test data sets.
ROC.ensemb.train <- roc(as.numeric(salary.df.train$salary == "under_50K"),
as.numeric(preds.train$weighted.voting == "under_50K"))
ROC.ensemb.valid <- roc(as.numeric(salary.df.valid$salary == "under_50K"),
as.numeric(preds.valid$weighted.voting == "under_50K"))
list(
ROC.ensemb.train = ROC.ensemb.train,
ROC.ensemb.valid = ROC.ensemb.valid
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
round(100*(2 * auc(ROC.ensemb.train) - 1), 1), "%, ",
"GINI VALID: ",
round(100*(2 * auc(ROC.ensemb.valid) - 1), 1), "% ")) +
theme_bw() + coord_fixed() +
# scale_color_brewer(palette = "Paired") +
scale_color_manual(values = RColorBrewer::brewer.pal(n = 4,
name = "Paired")[c(1, 3)])Another chosen model to implement is neural network. This model needs some data preparation which is proceed below.
numerical_columns <- sapply(salary.df.train, is.numeric)
df_standardized <- as.data.frame(scale(salary.df.train[, numerical_columns]))
salary.df.train.std <- cbind(salary.df.train[, !numerical_columns, drop = FALSE], df_standardized)mean_values <- colMeans(salary.df.train[, numerical_columns])
sd_values <- apply(salary.df.train[, numerical_columns], 2, sd)
df_valid_standardized <- as.data.frame(scale(salary.df.valid[, numerical_columns], center = mean_values, scale = sd_values))
salary.df.valid.std <- cbind(salary.df.valid[, !numerical_columns, drop = FALSE], df_valid_standardized)salary.train.mtx <-
model.matrix(object = model1.formula,
data = salary.df.train.std)
dim(salary.train.mtx)## [1] 34049 71
colnames(salary.train.mtx) <- gsub(" ", "_", colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub(",", "_", colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("/", "", colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("-", "_", colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("'", "", colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("\\+", "", colnames(salary.train.mtx))
colnames(salary.train.mtx) <- gsub("\\^", "", colnames(salary.train.mtx))col_list <- paste(c(colnames(salary.train.mtx[, -1])), collapse = "+")
col_list <- paste(c("salary_1 ~ ", col_list), collapse = "")
(model.formula2 <- formula(col_list))## salary_1 ~ age + education_ord.L + education_ord.Q + education_ord.C +
## education_ord4 + education_ord5 + education_ord6 + education_ord7 +
## education_ord8 + education_ord9 + education_ord10 + education_ord11 +
## education_ord12 + education_ord13 + education_ord14 + education_ord15 +
## capital + hours_per_week + marital_statusPreviously_Married +
## marital_statusSingle + occupationArmed_Forces + occupationCraft_repair +
## occupationExec_managerial + occupationFarming_fishing + occupationHandlers_cleaners +
## occupationMachine_op_inspct + occupationOther_service + occupationPriv_house_serv +
## occupationProf_specialty + occupationProtective_serv + occupationSales +
## occupationTech_support + occupationTransport_moving + occupationUnemployed +
## occupationUnknown + raceAsian_Pac_Islander + raceBlack +
## raceOther + raceWhite + relationshipNot_in_family + relationshipOther_relative +
## relationshipOwn_child + relationshipUnmarried + relationshipWife +
## sexMale + workclassLocal_gov + workclassNever_worked + workclassPrivate +
## workclassSelf_emp_inc + workclassSelf_emp_not_inc + workclassState_gov +
## workclassUnknown + workclassWithout_pay + regionCaribbean +
## regionCentral_America + regionEurope + regionNorth_America +
## regionOther + regionSouth_America + regionUnited_States +
## feat01 + feat02 + feat03 + feat04 + feat05 + feat06 + feat07 +
## feat08 + feat09 + feat10
Hence, let us decide to use 4 hidden layers: we will use 15 neurons in the 1st layer and 8, 8, and 4 neurons in the subsequent layers, respectively. The number of layers and neurons were chosen arbitrarily and compared with a few other combinations; however, the results were not close to those from Random Forests, XGBoost, nor Logistic Regression. Therefore, we did not spend more time searching for the best parameters. The results from the train and test datasets are consistent, though.
set.seed(1618)
salary.nn <-
data.frame(salary.train.mtx,
salary_1 = as.numeric(salary.df.train.std$salary == "under_50K")) %>%
sample_n(1000) %>%
neuralnet(model.formula2,
data = .,
hidden = c(15,8,8,4), # number of neurons in hidden layers
linear.output = FALSE, # F for classification
learningrate.limit = NULL,
learningrate.factor = list(minus = 0.7, plus = 1.3),
algorithm = "rprop+",
threshold = 0.01)salary.valid.mtx <-
model.matrix(object = model1.formula,
data = salary.df.valid.std)
colnames(salary.valid.mtx) <- gsub(" ", "_", colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub(",", "_", colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub("/", "", colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub("-", "_", colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub("'", "", colnames(salary.valid.mtx))
colnames(salary.valid.mtx) <- gsub("\\+", "", colnames(salary.valid.mtx))## [,1]
## 1 0.992748353
## 2 0.991538964
## 3 0.992749227
## 4 0.991950714
## 5 0.992741099
## 6 0.974188102
## 7 0.992748661
## 8 0.626344352
## 9 0.991985458
## 10 0.007918318
(confusionMatrix.nn <-
confusionMatrix(as.numeric((salary.pred.nn$net.result > 0.5)) %>% as.factor(),
as.factor(ifelse(salary.df.valid.std$salary == "under_50K", 1, 0))) )## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 594 388
## 1 432 2842
##
## Accuracy : 0.8073
## 95% CI : (0.7952, 0.8191)
## No Information Rate : 0.7589
## P-Value [Acc > NIR] : 2.205e-14
##
## Kappa : 0.4656
##
## Mcnemar's Test P-Value : 0.1332
##
## Sensitivity : 0.5789
## Specificity : 0.8799
## Pos Pred Value : 0.6049
## Neg Pred Value : 0.8681
## Prevalence : 0.2411
## Detection Rate : 0.1396
## Detection Prevalence : 0.2307
## Balanced Accuracy : 0.7294
##
## 'Positive' Class : 0
##
As usual, the final step is to draw the ROC curve. The results are not satisfying
ROC.train.nn <-
pROC::roc(as.numeric(salary.df.train.std$salary == "under_50K"),
compute(salary.nn, salary.train.mtx[, -1])$net.result[, 1])
ROC.valid.nn <-
pROC::roc(as.numeric(salary.df.valid.std$salary == "under_50K"),
compute(salary.nn, salary.valid.mtx[, -1])$net.result[, 1])
list(
ROC.train.nn = ROC.train.nn,
ROC.valid.nn = ROC.valid.nn
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"nn = ",
round(100*(2 * auc(ROC.train.nn) - 1), 1), "%, ",
"Gini TEST: ",
"nn = ",
round(100*(2 * auc(ROC.valid.nn) - 1), 1), "%, "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")The XGBoost model yielded the highest results on the validation dataset, with the Random Forest model trailing slightly behind. However, as mentioned earlier, the Random Forest model exhibited significantly more overfitting. Therefore, we have concluded that XGBoost is the best model for production.
ROC.valid.xgb = ROC.valid
list(
ROC.class.valid = ROC.class.valid,
ROC.logit.valid = ROC.logit.valid,
ROC.valid.rf_2 = ROC.valid.rf_2,
ROC.valid.xgb = ROC.valid.xgb,
ROC.ensemb.valid = ROC.ensemb.valid,
ROC.valid.nn = ROC.valid.nn
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini VALID: ",
"class = ",
round(100*(2 * auc(ROC.class.valid) - 1), 1), "%, ",
"logit = ",
round(100*(2 * auc(ROC.logit.valid) - 1), 1), "%, ",
"rf = ",
round(100*(2 * auc(ROC.valid.rf_2) - 1), 1), "%, ",
"xgb = ",
round(100*(2 * auc(ROC.valid.xgb) - 1), 1), "%, ",
"ensemb = ",
round(100*(2 * auc(ROC.ensemb.valid) - 1), 1), "%, ",
"nn = ",
round(100*(2 * auc(ROC.valid.nn) - 1), 1), "%, "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")Now we are going to verify results on our final test dataset. There is needed preparation of data.
xgb.test.label <- salary.df.test %>%
select(salary) %>%
mutate(salary == "under_50K") %>%
pull() %>%
as.integer()tmp.test.data <-
dummy.data.frame(salary.df.test %>%
select(
age,capital,education_num,hours_per_week,
marital_status,region,sex,occupation,
race,relationship,workclass, feat01,feat02,feat03,
feat04,feat05,feat06,feat07,feat08,feat09,feat10
) %>%
mutate(education_num = as.integer(education_num)) %>%
as.data.frame(),
names = NULL, omit.constants=TRUE,drop = FALSE,
dummy.classes = getOption("dummy.classes"),
all = TRUE)Upon reviewing the results from the Random Forest model, we suspected the presence of overfitting due to a significant discrepancy between the training and test datasets. However, in the case of XGBoost, this does not seem to be an issue. The discrepancy is only slight, and the results from the validation and test datasets are very close.
list(
ROC.train = ROC.train,
ROC.valid = ROC.valid,
ROC.test = ROC.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini: ",
"TRAIN = ",
round(100*(2 * auc(ROC.train) - 1), 1), "%, ",
"VALID = ",
round(100*(2 * auc(ROC.valid) - 1), 1), "%",
"TEST = ",
round(100*(2 * auc(ROC.test) - 1), 1), "%"),
caption = "source: own calculations") +
theme_bw() + coord_fixed() Nevertheless, remembering that correlation does not imply causation, we are excluding variables that are not interpretable and may cause overfitting in the long run. Therefore, variables âfeat02,â âfeat04,â and âfeat06â have been dropped. The model function is as follows:
model2.formula <- salary ~ age + education_ord + capital + hours_per_week + marital_status + occupation + race + relationship + sex + workclass + region if(0){
ROC.valid.xgb = ROC.valid
valid_ROC_without_feat <- list(
ROC.class.valid = ROC.class.valid,
ROC.logit.valid = ROC.logit.valid,
ROC.valid.rf_2 = ROC.valid.rf_2,
ROC.valid.xgb = ROC.valid.xgb,
ROC.ensemb.valid = ROC.ensemb.valid,
ROC.valid.nn = ROC.valid.nn
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini VALID: ",
"class = ",
round(100*(2 * auc(ROC.class.valid) - 1), 1), "%, ",
"logit = ",
round(100*(2 * auc(ROC.logit.valid) - 1), 1), "%, ",
"rf = ",
round(100*(2 * auc(ROC.valid.rf_2) - 1), 1), "%, ",
"xgb = ",
round(100*(2 * auc(ROC.valid.xgb) - 1), 1), "%, ",
"ensemb = ",
round(100*(2 * auc(ROC.ensemb.valid) - 1), 1), "%, ",
"nn = ",
round(100*(2 * auc(ROC.valid.nn) - 1), 1), "%, "
)) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
saveRDS(valid_ROC_without_feat, file = here("output", "valid_ROC_without_feat.rds"))
}The results for all models are worse as it seen on below ROC curve plot.
valid_ROC_without_feat <- readRDS(here("output", "valid_ROC_without_feat.rds"))
valid_ROC_without_featif(0){
final_ROC_without_feat <- list(
ROC.train = ROC.train,
ROC.valid = ROC.valid,
ROC.test = ROC.test
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini: ",
"TRAIN = ",
round(100*(2 * auc(ROC.train) - 1), 1), "%, ",
"VALID = ",
round(100*(2 * auc(ROC.valid) - 1), 1), "%",
"TEST = ",
round(100*(2 * auc(ROC.test) - 1), 1), "%"),
caption = "source: own calculations") +
theme_bw() + coord_fixed()
saveRDS(final_ROC_without_feat, file = here("output", "final_ROC_without_feat.rds"))
}Indeed, the results of the XGBoost model are slightly more consistent; however, the difference is not significant. Based solely on the ROC curve and Gini metric, it is not possible to discern the presence of artificial variables in the previous model which may cause overfitting noticed in random forests model.
final_ROC_without_feat <- readRDS(here("output", "final_ROC_without_feat.rds"))
final_ROC_without_feat