We want to ask the following questions:
So the objective of this project are:
To answer the questions above, we collect the data from UCI and the details of the dataste are showns below:
Based on the collected data, we mainly implement the subsequent data preparation:
We carry out EDA in the following ways:
After EDA, we build classification and regressions:
We will make necessary conclusions for this project and see what we could get.
Since UCI split the data into two parts with the same contents, we will import each dataset and have a look at its first rows respectively.
# Import adult.data from UCI
df_data <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", sep=",", header = F)
# Have a look at the first six rows of df_data
head(df_data)
## V1 V2 V3 V4 V5 V6
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## V7 V8 V9 V10 V11 V12 V13 V14
## 1 Adm-clerical Not-in-family White Male 2174 0 40 United-States
## 2 Exec-managerial Husband White Male 0 0 13 United-States
## 3 Handlers-cleaners Not-in-family White Male 0 0 40 United-States
## 4 Handlers-cleaners Husband Black Male 0 0 40 United-States
## 5 Prof-specialty Wife Black Female 0 0 40 Cuba
## 6 Exec-managerial Wife White Female 0 0 40 United-States
## V15
## 1 <=50K
## 2 <=50K
## 3 <=50K
## 4 <=50K
## 5 <=50K
## 6 <=50K
# Import adult.test from UCI
df_test <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", header = F)
# Have a look at the first six rows of df_test
head(df_test)
## V1 V2 V3 V4 V5 V6
## 1 |1x3 Cross validator NA NA
## 2 25 Private 226802 11th 7 Never-married
## 3 38 Private 89814 HS-grad 9 Married-civ-spouse
## 4 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse
## 5 44 Private 160323 Some-college 10 Married-civ-spouse
## 6 18 ? 103497 Some-college 10 Never-married
## V7 V8 V9 V10 V11 V12 V13 V14
## 1 NA NA NA
## 2 Machine-op-inspct Own-child Black Male 0 0 40 United-States
## 3 Farming-fishing Husband White Male 0 0 50 United-States
## 4 Protective-serv Husband White Male 0 0 40 United-States
## 5 Machine-op-inspct Husband Black Male 7688 0 40 United-States
## 6 ? Own-child White Female 0 0 30 United-States
## V15
## 1
## 2 <=50K.
## 3 <=50K.
## 4 >50K.
## 5 >50K.
## 6 <=50K.
As you see, there is something wrong with the first row of df_test and we will
# Remove the first row of df_test
df_test <- df_test[-1,]
# Integrate the datasets
df_adult <- rbind(df_data, df_test)
# Have a look at the first six rows
head(df_adult)
## V1 V2 V3 V4 V5 V6
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## V7 V8 V9 V10 V11 V12 V13 V14
## 1 Adm-clerical Not-in-family White Male 2174 0 40 United-States
## 2 Exec-managerial Husband White Male 0 0 13 United-States
## 3 Handlers-cleaners Not-in-family White Male 0 0 40 United-States
## 4 Handlers-cleaners Husband Black Male 0 0 40 United-States
## 5 Prof-specialty Wife Black Female 0 0 40 Cuba
## 6 Exec-managerial Wife White Female 0 0 40 United-States
## V15
## 1 <=50K
## 2 <=50K
## 3 <=50K
## 4 <=50K
## 5 <=50K
## 6 <=50K
# Have at look at its dimension
dim(df_adult)
## [1] 48842 15
As you can see, it has the dimension (48842, 15) and there are columns names from v1 to v15. So we will rename the names of columns
# Rename the columns names
names(df_adult) <- c("age","workclass","fnlwgt","education","education_num","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss","hours_per_week","native_country","annual_income")
To start with, we will have a look at the first six rows again to gain a understanding of the data.
# Have a look at the first six rows again
head(df_adult)
## age workclass fnlwgt education education_num marital_status
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hours_per_week native_country annual_income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
It seems there are nothing wrong and thus we will its data structure
# Have a look at the data structure
str(df_adult)
## 'data.frame': 48842 obs. of 15 variables:
## $ age : chr "39" "50" "38" "53" ...
## $ workclass : chr " State-gov" " Self-emp-not-inc" " Private" " Private" ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
## $ education_num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital_status: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
## $ occupation : chr " Adm-clerical" " Exec-managerial" " Handlers-cleaners" " Handlers-cleaners" ...
## $ relationship : chr " Not-in-family" " Husband" " Not-in-family" " Husband" ...
## $ race : chr " White" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours_per_week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native_country: chr " United-States" " United-States" " United-States" " United-States" ...
## $ annual_income : chr " <=50K" " <=50K" " <=50K" " <=50K" ...
From the data structure, not only could we realize the categorical columns such as workclass have whitespaces but also we notice there some columns like education with incorrect classes. So we will have two goals:
# Load the library dplyr and stringer
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
# remove whitespace using str_trim and apply functions while the data classes are converted by columns through as.factor and lapply functions
df_cat_col <- df_adult %>%
select(c("workclass", "education", "marital_status", "occupation", "relationship", "race","sex","native_country","annual_income")) %>%
apply(2, str_trim) %>%
as.data.frame() %>%
lapply(as.factor) %>%
as.data.frame()
# Assign the correct categorical columns to df_adult
df_adult[,c("workclass", "education" , "marital_status", "occupation", "relationship", "race","sex","native_country","annual_income")] <- df_cat_col
# Also, column age should be integers and we will convert it into the right format
df_adult$age <- as.integer(df_adult$age)
Then we will have a look at the summary information for this dataset.
# Have a look at the summary information
summary(df_adult)
## age workclass fnlwgt
## Min. :17.00 Private :33906 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 3862 1st Qu.: 117551
## Median :37.00 Local-gov : 3136 Median : 178145
## Mean :38.64 ? : 2799 Mean : 189664
## 3rd Qu.:48.00 State-gov : 1981 3rd Qu.: 237642
## Max. :90.00 Self-emp-inc : 1695 Max. :1490400
## (Other) : 1463
## education education_num marital_status
## HS-grad :15784 Min. : 1.00 Divorced : 6633
## Some-college:10878 1st Qu.: 9.00 Married-AF-spouse : 37
## Bachelors : 8025 Median :10.00 Married-civ-spouse :22379
## Masters : 2657 Mean :10.08 Married-spouse-absent: 628
## Assoc-voc : 2061 3rd Qu.:12.00 Never-married :16117
## 11th : 1812 Max. :16.00 Separated : 1530
## (Other) : 7625 Widowed : 1518
## occupation relationship race
## Prof-specialty : 6172 Husband :19716 Amer-Indian-Eskimo: 470
## Craft-repair : 6112 Not-in-family :12583 Asian-Pac-Islander: 1519
## Exec-managerial: 6086 Other-relative: 1506 Black : 4685
## Adm-clerical : 5611 Own-child : 7581 Other : 406
## Sales : 5504 Unmarried : 5125 White :41762
## Other-service : 4923 Wife : 2331
## (Other) :14434
## sex capital_gain capital_loss hours_per_week
## Female:16192 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :32650 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1079 Mean : 87.5 Mean :40.42
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## native_country annual_income
## United-States:43832 <=50K :24720
## Mexico : 951 <=50K.:12435
## ? : 857 >50K : 7841
## Philippines : 295 >50K. : 3846
## Germany : 206
## Puerto-Rico : 184
## (Other) : 2517
From the summary information, you could observe that there are “?” in columns workclass and native-country and they represent missing values.Meanwhile some columns like education-num have missing values NA. In addition, the column annual_income has different repressions such as <=50K adn <=50K.. So generally we will have three goals:
Before determining how to deal with them, we will firstly have a look at its percentage.
# Set the "?" to NA
df_adult[df_adult == "?"] <- NA
# Count how many missing values
sum(!complete.cases(df_adult))
## [1] 3620
# Count the percentage of missing values
sum(!complete.cases(df_adult)) / nrow(df_adult)
## [1] 0.07411654
You could see that 3621 rows of missing values and accounts for merely 7.41%. Thus we will directly delete them.
# Select the non-missing rows and we assign the new data frame to df_ad in case we need df_adult later
df_ad <- df_adult[complete.cases(df_adult),]
# Considering there are still levels for "?" whose frequency is 0, we will remove them
df_ad <- df_ad %>%
droplevels()
For the inconsistent repressions of column annual_income, we will convert them to the consistent one.
df_ad$annual_income <- df_ad $annual_income %>%
as.character() %>%
str_remove("[.]") %>%
as.factor()
Finally we will have a look at whether there are redundant variables before EDA. Obviously variable education_num is the numerical representation of education while relationship could be derived from sex and marital_status. Thus we will
# Remove the column education_num and relationship
df_ad <- df_ad %>%
select(-c("education_num", "relationship"))
For numerical variables, we will check whether they are redundant by looking at their correlations. Thus we will
# get the class of each column in df_ad
col_class <- sapply(df_ad, class)
# get the column names of categorical variables
cat_names <- names(col_class[col_class == "factor"])
# get the column names of numerical variables
num_names <- names(col_class[col_class != "factor"])
# Load the library ggcorrplot
library(ggcorrplot)
## Loading required package: ggplot2
# Get the correlation matrix chart
df_ad %>%
select(num_names) %>%
cor() %>%
ggcorrplot(hc.order = TRUE, type = "lower",
lab = TRUE)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(num_names)` instead of `num_names` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
As you could notice, the correlation between variables is pretty week.Thus we will keep all numerical variables.
In order to gain a understanding of data, we will
summary(df_ad)
## age workclass fnlwgt
## Min. :17.00 Federal-gov : 1406 Min. : 13492
## 1st Qu.:28.00 Local-gov : 3100 1st Qu.: 117388
## Median :37.00 Private :33307 Median : 178316
## Mean :38.55 Self-emp-inc : 1646 Mean : 189735
## 3rd Qu.:47.00 Self-emp-not-inc: 3796 3rd Qu.: 237926
## Max. :90.00 State-gov : 1946 Max. :1490400
## Without-pay : 21
## education marital_status occupation
## HS-grad :14783 Divorced : 6297 Craft-repair : 6020
## Some-college: 9899 Married-AF-spouse : 32 Prof-specialty : 6008
## Bachelors : 7570 Married-civ-spouse :21055 Exec-managerial: 5984
## Masters : 2514 Married-spouse-absent: 552 Adm-clerical : 5540
## Assoc-voc : 1959 Never-married :14598 Sales : 5408
## 11th : 1619 Separated : 1411 Other-service : 4808
## (Other) : 6878 Widowed : 1277 (Other) :11454
## race sex capital_gain capital_loss
## Amer-Indian-Eskimo: 435 Female:14695 Min. : 0 Min. : 0.00
## Asian-Pac-Islander: 1303 Male :30527 1st Qu.: 0 1st Qu.: 0.00
## Black : 4228 Median : 0 Median : 0.00
## Other : 353 Mean : 1101 Mean : 88.59
## White :38903 3rd Qu.: 0 3rd Qu.: 0.00
## Max. :99999 Max. :4356.00
##
## hours_per_week native_country annual_income
## Min. : 1.00 United-States:41292 <=50K:34014
## 1st Qu.:40.00 Mexico : 903 >50K :11208
## Median :40.00 Philippines : 283
## Mean :40.94 Germany : 193
## 3rd Qu.:45.00 Puerto-Rico : 175
## Max. :99.00 Canada : 163
## (Other) : 2213
From the summary information, we could obtain a series of information:
We will plot the data in the following orders:
We will define a descending horizontal barplot function and call this function for each categorical variable. Then almost all braplots are arranged together.
# Load library ggplot2 and ggpubr
library(ggplot2)
library(ggpubr)
# define a barplot function for categorical variable
barplot_single_cat <- function(col_name) {
col_freq <- as.data.frame(sort(table(df_ad[[col_name]]), decreasing = F))
colnames(col_freq) <- c(col_name,"freq")
ggplot() + aes(col_freq[[col_name]],col_freq[["freq"]], fill = "red") + geom_bar(stat = "identity") +coord_flip() + labs(x = "frequency", y = col_name, title = paste("barplot of", col_name)) + theme(text = element_text(size=25), plot.title = element_text(hjust = 0.5), legend.position = "none") }
# Create a barplot for each categorical variable in the form of barplot_column_name
for (col_name in cat_names) {
name <- paste("barplot",col_name, sep="_")
assign(name, barplot_single_cat(col_name))
}
# Arrange all barplots apart from barplot_native_country because it has too many levels and its visual effect will be affected
ggarrange(barplot_workclass, barplot_education, barplot_marital_status, barplot_occupation, barplot_race, barplot_sex, barplot_annual_income, nrow=4, ncol=2)
# Have a look at barplot_native_country
barplot_native_country
From the descending barplots, we could observe some obvious patterns, for instance:
We could have a look at their specific percentages
#Define a function that returns the count, percentage of top n in a variable
col_freq_percent <- function(col_name, top_n) {
col_freq <- as.data.frame(sort(table(df_ad[[col_name]]), decreasing = T))
col_freq[["percent"]] <- col_freq[["Freq"]] / nrow(df_ad)
names(col_freq) <- c(col_name, "freq","percent")
return(col_freq[1:top_n,])
}
# Define a function that return the percentage of a variable
col_percent_barplot <- function(df, col_name) ggplot() + aes(df[[col_name]]) + geom_bar(fill = "#00AFBB") + geom_text(aes(label = scales:: percent(..count../sum(..count..))), stat ="count", vjust = -0.2) + theme(plot.title = element_text(hjust = 0.5)) + labs(x = col_name, y = "percentage", title = paste("percentage of", col_name))
# The table of the top 1 percentage of column workclass
col_freq_percent("workclass", 1)
## workclass freq percent
## 1 Private 33307 0.736522
# The barplot of the percentage of column workclass
col_percent_barplot(df_ad,"workclass")
# The table of the top 1 percentage of column race
col_freq_percent("race", 1)
## race freq percent
## 1 White 38903 0.8602671
# The barplot of the percentage of column race
col_percent_barplot(df_ad,"race")
# The table of top 3 percentage of column native_country
col_freq_percent("native_country", 1)
## native_country freq percent
## 1 United-States 41292 0.9130954
So you could see the corresponding percentages are 73.7%, 86%, and 91.3% separately.
For single numerical variable, we will define a histogram function and call it for each variable and then arrange all histograms together.
# Create a histogram function for numerical variable
histogram_single_num <- function(col_name) {
ggplot() + aes(x = df_ad[[col_name]], fill = "red") + geom_histogram() + labs(y = "frequency", x = col_name, title = paste("histogram of", col_name)) + theme(plot.title = element_text(hjust = 0.5), legend.position = "none")}
# get the column names of numerical variables
num_names <- names(col_class[col_class != "factor"])
#Create a histogram for each numerical variable in the form of histogram_column_name
for (col_name in num_names) {
name <- paste("histogram",col_name, sep="_")
assign(name, histogram_single_num(col_name))
}
# Arrange all histograms
ggarrange(histogram_age, histogram_fnlwgt, histogram_capital_gain, histogram_capital_loss, histogram_hours_per_week,nrow=2, ncol=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on those histograms, we could get:
# The top 2 number and percentage of column capital_gain
col_freq_percent("capital_gain", 2)
## capital_gain freq percent
## 1 0 41432 0.91619123
## 2 15024 498 0.01101234
# The top 2 number and percentage of column capital_loss
col_freq_percent("capital_loss", 2)
## capital_loss freq percent
## 1 0 43082 0.95267790
## 2 1902 294 0.00650126
# The top 2 number and percentage of column capital_loss
col_freq_percent("hours_per_week", 2)
## hours_per_week freq percent
## 1 40 21358 0.47229225
## 2 50 4094 0.09053116
We could see that 91.6% capital_gain, 95.3% capital_loss, 47.2% hours_per_week are 0, 0 and 40 respectively.
Considering the imbalanced distribution of the data for a categorical variable based on the previous plots, we figure out it is better to plot the percentage of each subcategory by barplot.
barplot_two_var <- function(col_name) {
ggplot() + aes(y = df_ad[[col_name]],fill = df_ad[["annual_income"]]) + geom_bar(position = "fill") + labs(x = "percentage", y = col_name, title = paste("barplot of", col_name, "by annual_income")) + theme(text = element_text(size=40), plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="annual_income"))}
# Create a barplot for each categorical variable in the form of barplot_column_name
for (col_name in cat_names[-length(cat_names)]) {
name <- paste("barplot",col_name, "by_annual_income",sep="_")
assign(name, barplot_two_var(col_name))
}
# Arrange all barplots
ggarrange(barplot_workclass_by_annual_income, barplot_education_by_annual_income, barplot_marital_status_by_annual_income, barplot_occupation_by_annual_income, barplot_race_by_annual_income, barplot_sex_by_annual_income, nrow=3, ncol=2)
# Have a look at the barplot_native_country_by_annual_income
barplot_native_country_by_annual_income
From those barplots, we could draw some insights such as :
Since most numerical variables are imbalanced, we will facet each variable by annual_income.
# # Remove warning message
options(warn=-1)
# Create a histogram function for numerical variable
histogram_two_var <- function(col_name) {
ggplot(df_ad) + aes(x = df_ad[[col_name]], fill = df_ad[["annual_income"]]) + facet_wrap(~df_ad[["annual_income"]]) + geom_histogram() + guides(fill=guide_legend(title="annual_income")) + labs(x = col_name, title = paste(col_name, "by annual_income")) + theme(text = element_text(size=30), plot.title = element_text(hjust = 0.5))}
#Create a histogram for each numerical variable in the form of histogram_column_name
for (col_name in num_names) {
name <- paste("histogram",col_name, "by_annual_income",sep="_")
assign(name, histogram_two_var(col_name))
}
# Arrange all histograms
ggarrange(histogram_age_by_annual_income, histogram_fnlwgt_by_annual_income, histogram_capital_gain_by_annual_income, histogram_capital_loss_by_annual_income, histogram_hours_per_week_by_annual_income,nrow=3, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on those histograms, we could gain some insights about our data like:
Next, we will go ahead for modeling.
density_two_var <- function(col_name, font_size) {
ggplot() + aes(x = df_ad[["age"]], fill = df_ad[[col_name]]) + geom_density(alpha = 0.2) + labs(x = "age", y = "density", title = paste("density of age by", col_name)) + theme(text = element_text(size=font_size), plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title=col_name))}
# Create a density curve for age by each categorical variable
for (col_name in cat_names[-(length(cat_names) - 1)]) {
name <- paste("density_age_by",col_name,sep="_")
assign(name, density_two_var(col_name, 40)) }
# Arrange all density curves apart from age by native_country
ggarrange(density_age_by_workclass, density_age_by_education, density_age_by_marital_status, density_age_by_occupation,density_age_by_race,density_age_by_sex,density_age_by_annual_income,nrow=4, ncol=2)
# density of age by native_country
density_two_var("native_country", 25)
According to those density curves, we are able to get some interesting insights like:
scatter_two_var <- function(col_name) {
ggplot() + aes(x = df_ad[["age"]], y = df_ad[[col_name]]) + geom_point(color = "red") + geom_smooth() + labs(x = "age", y = col_name, title = paste("scatter of age and", col_name)) + theme(text = element_text(size=40), plot.title = element_text(hjust = 0.5)) }
# Create a density curve for age by each categorical variable
for (col_name in num_names[-1]) {
name <- paste("scatter_age",col_name,sep="_")
assign(name, scatter_two_var(col_name)) }
# Arrange all histograms
ggarrange(scatter_age_fnlwgt, scatter_age_capital_gain,scatter_age_capital_loss, scatter_age_hours_per_week,nrow=2, ncol=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Based on those scatter plots, we could see there are no obvious patterns and thus we will go ahead for modelling.
Here we will build two models:
For the classification models, the steps are below:
# Load the library caret
library(caret)
## Loading required package: lattice
# Split the dataset into 80% and 20% as training and testing dataset respectively
# Set the random seed to ensure we get the same traning dataset and testing dataset everytime we run the code.
split <- 0.8
set.seed(998)
train_index <- createDataPartition(df_ad$annual_income, p=split, list = F)
data_train <- df_ad[train_index, ]
data_test <- df_ad[-train_index, ]
# 10 fold cross validation
train_control <- trainControl(method="cv", number=10)
# 1.logistics regression
set.seed(2)
model_glm <- train(annual_income~., data=data_train, trControl=train_control, method="glm", preProcess = c("center", "scale"),metric="Accuracy")
print(model_glm)
## Generalized Linear Model
##
## 36179 samples
## 12 predictor
## 2 classes: '<=50K', '>50K'
##
## Pre-processing: centered (90), scaled (90)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 32561, 32561, 32562, 32561, 32561, 32560, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8477849 0.5637562
# 2.decision tree
set.seed(2)
model_dt <- train(annual_income~., data=data_train, trControl=train_control, method="rpart", metric="Accuracy")
print(model_dt)
## CART
##
## 36179 samples
## 12 predictor
## 2 classes: '<=50K', '>50K'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 32561, 32561, 32562, 32561, 32561, 32560, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.05620609 0.8108018 0.35786489
## 0.06735809 0.7940240 0.23957743
## 0.07377049 0.7657487 0.07744025
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.05620609.
As you could see, logistics regression model performs best with about 84% mean accuracy. Thus we tend to choose it as our final model. In order to confirm whether there is overfitting, we will validate it in our testing dataset.
# Make predictions
x_test_cla <- data_test[, 1:length(data_train) - 1]
y_test_cla <- data_test[, length(data_train)]
prediction_glm <- predict(model_glm, x_test_cla)
# Have a look at the confusion metrics
confusionMatrix(prediction_glm, y_test_cla)
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 6310 878
## >50K 492 1363
##
## Accuracy : 0.8485
## 95% CI : (0.8409, 0.8558)
## No Information Rate : 0.7522
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5687
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9277
## Specificity : 0.6082
## Pos Pred Value : 0.8779
## Neg Pred Value : 0.7348
## Prevalence : 0.7522
## Detection Rate : 0.6978
## Detection Prevalence : 0.7949
## Balanced Accuracy : 0.7679
##
## 'Positive' Class : <=50K
##
Obviously, the prediction accuracy is around 84% and it shows logistics regression fits well. Finally we want to have a look at the important features
# Have a look at the feature importance of logistics regression model
import_classif <- varImp(model_glm)
import_classif
## glm variable importance
##
## only 20 most important variables shown (out of 90)
##
## Overall
## `marital_statusMarried-civ-spouse` 100.00
## capital_gain 90.79
## capital_loss 54.18
## hours_per_week 52.41
## age 47.56
## `educationProf-school` 42.65
## educationMasters 40.57
## educationDoctorate 38.26
## educationBachelors 36.87
## `occupationExec-managerial` 31.81
## `workclassSelf-emp-not-inc` 28.19
## `educationAssoc-acdm` 23.70
## `occupationOther-service` 23.11
## `educationAssoc-voc` 22.95
## `educationSome-college` 22.15
## `occupationFarming-fishing` 21.38
## `occupationProf-specialty` 19.55
## `workclassState-gov` 19.25
## `workclassLocal-gov` 18.61
## `marital_statusNever-married` 16.40
We could observe there are some important features like capital_gain, capital_loss, hours_per_week, age to determine whether the annual income is over 50k.
We will build regression models for age and select the best model.
# ridge regression
set.seed(3)
model_ridge <- train(age~., data=data_train, trControl=train_control,method="ridge", metric="RMSE")
model_ridge
## Ridge Regression
##
## 36179 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 32562, 32561, 32561, 32562, 32562, 32560, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0e+00 10.47033 0.3778042 8.234712
## 1e-04 10.47032 0.3778057 8.234701
## 1e-01 10.49127 0.3755782 8.242094
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 1e-04.
# Least Angle Regression
set.seed(3)
model_lars <- train(age~., data=data_train, trControl=train_control,method="lars", metric="RMSE")
model_lars
## Least Angle Regression
##
## 36179 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 32562, 32561, 32561, 32562, 32562, 32560, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.050 12.44730 0.2750774 10.156253
## 0.525 10.54760 0.3681961 8.319319
## 1.000 10.46517 0.3771267 8.233227
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 1.
Evidently, least angle regression has the smaller RMSE and we will choose it as our prediction model. Similarly, we will confirm whether our model is overfitting.
# Make predictions on the testing dataset
x_test_reg <- data_test[, 2:length(data_train)]
y_test_reg <- data_test[, 1]
prediction_lars <- predict(model_lars,x_test_reg)
RMSE(prediction_lars, y_test_reg)
## [1] 10.22163
As you could notice, the RMSE of prediction is about 10 and it proves our model fits well. Finally we will have a look at the feature importance.
# Feature importance
import_reg <- varImp(model_lars)
# Plot the feature importance
plot(import_reg)
As you could see the most important features are marital_status and annual_income. Since the importance of some features like race is 0, we will attempt to train our regresison model using less features.
# Train the regresion model with less features
data_retrain <- data_train[, -c(7,6,4,12)]
data_retest <- df_ad[, -c(7,6,4,12)]
# set the random seed
set.seed(3)
# Train the model
model_lars_retrain <- train(age~., data=data_retrain, trControl=train_control,method="lars", metric="RMSE")
# Have a look at the model
model_lars_retrain
## Least Angle Regression
##
## 36179 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 32562, 32561, 32561, 32562, 32562, 32560, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.050 12.83088 0.2750774 10.489917
## 0.525 10.89071 0.3337148 8.676002
## 1.000 10.71457 0.3470846 8.447828
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 1.
# Have a look at the performance of the model on the testing dataset
x_test_retrain <- data_retest[, 2:length(data_retest)]
y_test_retrain <- data_retest[, 1]
prediction_lars_retrain <- predict(model_lars_retrain,x_test_retrain)
RMSE(prediction_lars_retrain, y_test_retrain)
## [1] 10.64923
We could see both RMSE are about 10 and perform almost the same with the model trained on all features. Thus we could predict the age with this model especially considering a vast amount of data.