Predict whether income exceeds $50K/yr and age based on census data

1. Student information

  • ZHONGLIANG SHI(17221268)
  • Rahil Mushtaq(S2018835)

2.Project process

2.1 Ask good questions

We want to ask the following questions:

  • How can we predict whether income exceeds $50K/yr based on census data
  • How can we predict age based on census data

So the objective of this project are:

  • Predict whether income exceeds $50K/yr based on census data
  • Predict age based on census data

2.2 Data Collection

To answer the questions above, we collect the data from UCI and the details of the dataste are showns below:

  • Data source: http://archive.ics.uci.edu/ml/datasets/Adult
  • Title of the dataset: Adult Data Set
  • Year of the dataset: 1996
  • The purpose of the dataset: Predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset.
  • Dimension of the dataset: 48842 instances with 14 attributes
  • Area of the dataset: Social
  • Missing Values?: Yes
  • Content of the dataset:
    • annual_income: >50K, <=50K.
    • age: continuous.
    • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
    • fnlwgt: continuous.
    • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
    • education-num: continuous.
    • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent,
      Married-AF-spouse.
    • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, - Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
    • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
    • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
    • sex: Female, Male.
    • capital-gain: continuous.
    • capital-loss: continuous.
    • hours-per-week: continuous.
    • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, - Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

2.3 Data Preparation

Based on the collected data, we mainly implement the subsequent data preparation:

  • Rename the column names
  • Correct the classes of variables
  • Correct the content of variables like whitespaces and inconsistent format
  • Deal with the missing values
  • Drop redundant variables

2.4 EDA

We carry out EDA in the following ways:

  • Summary information
  • Plotting

2.5 Modelling

After EDA, we build classification and regressions:

  • Split the dataset into 80% and 20% respectively for training and testing dataset
  • Train two classification models based on supervised learning algorithms on the training dataset using 10-fold cross validation
  • Select the best model based on the mean accuracy of 10-fold cross validation and identify the top important features
  • Predict the testing dataset and confirm whether our classification model is overfitting
  • Similarly, we train two regression models on the training dataset using 10-fold cross validation, then select the best model, make predictions on the test dataset, confirm whether it fits well and identify the top important features.

2.6 Conclusion

We will make necessary conclusions for this project and see what we could get.

3 Data Preparation

3.1 Import data

Since UCI split the data into two parts with the same contents, we will import each dataset and have a look at its first rows respectively.

# Import adult.data from UCI
df_data <-  read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", sep=",", header = F)
# Have a look at the first six rows of df_data
head(df_data)
##   V1                V2     V3         V4 V5                  V6
## 1 39         State-gov  77516  Bachelors 13       Never-married
## 2 50  Self-emp-not-inc  83311  Bachelors 13  Married-civ-spouse
## 3 38           Private 215646    HS-grad  9            Divorced
## 4 53           Private 234721       11th  7  Married-civ-spouse
## 5 28           Private 338409  Bachelors 13  Married-civ-spouse
## 6 37           Private 284582    Masters 14  Married-civ-spouse
##                   V7             V8     V9     V10  V11 V12 V13            V14
## 1       Adm-clerical  Not-in-family  White    Male 2174   0  40  United-States
## 2    Exec-managerial        Husband  White    Male    0   0  13  United-States
## 3  Handlers-cleaners  Not-in-family  White    Male    0   0  40  United-States
## 4  Handlers-cleaners        Husband  Black    Male    0   0  40  United-States
## 5     Prof-specialty           Wife  Black  Female    0   0  40           Cuba
## 6    Exec-managerial           Wife  White  Female    0   0  40  United-States
##      V15
## 1  <=50K
## 2  <=50K
## 3  <=50K
## 4  <=50K
## 5  <=50K
## 6  <=50K
# Import adult.test from UCI
df_test <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", header = F)
# Have a look at the first six rows of df_test
head(df_test)
##                     V1         V2     V3            V4 V5                  V6
## 1 |1x3 Cross validator                NA               NA                    
## 2                   25    Private 226802          11th  7       Never-married
## 3                   38    Private  89814       HS-grad  9  Married-civ-spouse
## 4                   28  Local-gov 336951    Assoc-acdm 12  Married-civ-spouse
## 5                   44    Private 160323  Some-college 10  Married-civ-spouse
## 6                   18          ? 103497  Some-college 10       Never-married
##                   V7         V8     V9     V10  V11 V12 V13            V14
## 1                                                NA  NA  NA               
## 2  Machine-op-inspct  Own-child  Black    Male    0   0  40  United-States
## 3    Farming-fishing    Husband  White    Male    0   0  50  United-States
## 4    Protective-serv    Husband  White    Male    0   0  40  United-States
## 5  Machine-op-inspct    Husband  Black    Male 7688   0  40  United-States
## 6                  ?  Own-child  White  Female    0   0  30  United-States
##       V15
## 1        
## 2  <=50K.
## 3  <=50K.
## 4   >50K.
## 5   >50K.
## 6  <=50K.

As you see, there is something wrong with the first row of df_test and we will

  • Remove the first row of df_test
  • Integrate df_data and df_test to the data frame df_adult
# Remove the first row of df_test
df_test <- df_test[-1,]
# Integrate the datasets
df_adult <- rbind(df_data, df_test)
# Have a look at the first six rows
head(df_adult)
##   V1                V2     V3         V4 V5                  V6
## 1 39         State-gov  77516  Bachelors 13       Never-married
## 2 50  Self-emp-not-inc  83311  Bachelors 13  Married-civ-spouse
## 3 38           Private 215646    HS-grad  9            Divorced
## 4 53           Private 234721       11th  7  Married-civ-spouse
## 5 28           Private 338409  Bachelors 13  Married-civ-spouse
## 6 37           Private 284582    Masters 14  Married-civ-spouse
##                   V7             V8     V9     V10  V11 V12 V13            V14
## 1       Adm-clerical  Not-in-family  White    Male 2174   0  40  United-States
## 2    Exec-managerial        Husband  White    Male    0   0  13  United-States
## 3  Handlers-cleaners  Not-in-family  White    Male    0   0  40  United-States
## 4  Handlers-cleaners        Husband  Black    Male    0   0  40  United-States
## 5     Prof-specialty           Wife  Black  Female    0   0  40           Cuba
## 6    Exec-managerial           Wife  White  Female    0   0  40  United-States
##      V15
## 1  <=50K
## 2  <=50K
## 3  <=50K
## 4  <=50K
## 5  <=50K
## 6  <=50K
# Have at look at its dimension
dim(df_adult)
## [1] 48842    15

3.2 Rename the columns

As you can see, it has the dimension (48842, 15) and there are columns names from v1 to v15. So we will rename the names of columns

# Rename the columns names
names(df_adult) <- c("age","workclass","fnlwgt","education","education_num","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss","hours_per_week","native_country","annual_income")

3.3 Have a look at the content of data and its structure

To start with, we will have a look at the first six rows again to gain a understanding of the data.

# Have a look at the first six rows again
head(df_adult)
##   age         workclass fnlwgt  education education_num      marital_status
## 1  39         State-gov  77516  Bachelors            13       Never-married
## 2  50  Self-emp-not-inc  83311  Bachelors            13  Married-civ-spouse
## 3  38           Private 215646    HS-grad             9            Divorced
## 4  53           Private 234721       11th             7  Married-civ-spouse
## 5  28           Private 338409  Bachelors            13  Married-civ-spouse
## 6  37           Private 284582    Masters            14  Married-civ-spouse
##           occupation   relationship   race     sex capital_gain capital_loss
## 1       Adm-clerical  Not-in-family  White    Male         2174            0
## 2    Exec-managerial        Husband  White    Male            0            0
## 3  Handlers-cleaners  Not-in-family  White    Male            0            0
## 4  Handlers-cleaners        Husband  Black    Male            0            0
## 5     Prof-specialty           Wife  Black  Female            0            0
## 6    Exec-managerial           Wife  White  Female            0            0
##   hours_per_week native_country annual_income
## 1             40  United-States         <=50K
## 2             13  United-States         <=50K
## 3             40  United-States         <=50K
## 4             40  United-States         <=50K
## 5             40           Cuba         <=50K
## 6             40  United-States         <=50K

It seems there are nothing wrong and thus we will its data structure

# Have a look at the data structure
str(df_adult)
## 'data.frame':    48842 obs. of  15 variables:
##  $ age           : chr  "39" "50" "38" "53" ...
##  $ workclass     : chr  " State-gov" " Self-emp-not-inc" " Private" " Private" ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education     : chr  " Bachelors" " Bachelors" " HS-grad" " 11th" ...
##  $ education_num : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital_status: chr  " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
##  $ occupation    : chr  " Adm-clerical" " Exec-managerial" " Handlers-cleaners" " Handlers-cleaners" ...
##  $ relationship  : chr  " Not-in-family" " Husband" " Not-in-family" " Husband" ...
##  $ race          : chr  " White" " White" " White" " Black" ...
##  $ sex           : chr  " Male" " Male" " Male" " Male" ...
##  $ capital_gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours_per_week: int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native_country: chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ annual_income : chr  " <=50K" " <=50K" " <=50K" " <=50K" ...

From the data structure, not only could we realize the categorical columns such as workclass have whitespaces but also we notice there some columns like education with incorrect classes. So we will have two goals:

  • Remove whitespace
  • Covert the classes of columns

3.4 Correct the classes of variables and data format

# Load the library dplyr and stringer
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
# remove whitespace using str_trim and apply functions while the data classes are converted by columns through as.factor and lapply functions
df_cat_col <- df_adult %>% 
  select(c("workclass", "education", "marital_status", "occupation", "relationship", "race","sex","native_country","annual_income")) %>%
  apply(2, str_trim) %>%
  as.data.frame() %>%
  lapply(as.factor) %>%
  as.data.frame() 
# Assign the correct categorical columns to df_adult
df_adult[,c("workclass", "education" , "marital_status", "occupation", "relationship", "race","sex","native_country","annual_income")] <- df_cat_col
# Also, column age should be integers and we will convert it into the right format
df_adult$age <- as.integer(df_adult$age)

3.5 Have a look at the summary information

Then we will have a look at the summary information for this dataset.

# Have a look at the summary information
summary(df_adult)
##       age                   workclass         fnlwgt       
##  Min.   :17.00   Private         :33906   Min.   :  12285  
##  1st Qu.:28.00   Self-emp-not-inc: 3862   1st Qu.: 117551  
##  Median :37.00   Local-gov       : 3136   Median : 178145  
##  Mean   :38.64   ?               : 2799   Mean   : 189664  
##  3rd Qu.:48.00   State-gov       : 1981   3rd Qu.: 237642  
##  Max.   :90.00   Self-emp-inc    : 1695   Max.   :1490400  
##                  (Other)         : 1463                    
##         education     education_num                 marital_status 
##  HS-grad     :15784   Min.   : 1.00   Divorced             : 6633  
##  Some-college:10878   1st Qu.: 9.00   Married-AF-spouse    :   37  
##  Bachelors   : 8025   Median :10.00   Married-civ-spouse   :22379  
##  Masters     : 2657   Mean   :10.08   Married-spouse-absent:  628  
##  Assoc-voc   : 2061   3rd Qu.:12.00   Never-married        :16117  
##  11th        : 1812   Max.   :16.00   Separated            : 1530  
##  (Other)     : 7625                   Widowed              : 1518  
##            occupation            relationship                   race      
##  Prof-specialty : 6172   Husband       :19716   Amer-Indian-Eskimo:  470  
##  Craft-repair   : 6112   Not-in-family :12583   Asian-Pac-Islander: 1519  
##  Exec-managerial: 6086   Other-relative: 1506   Black             : 4685  
##  Adm-clerical   : 5611   Own-child     : 7581   Other             :  406  
##  Sales          : 5504   Unmarried     : 5125   White             :41762  
##  Other-service  : 4923   Wife          : 2331                             
##  (Other)        :14434                                                    
##      sex         capital_gain    capital_loss    hours_per_week 
##  Female:16192   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##  Male  :32650   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median :    0   Median :   0.0   Median :40.00  
##                 Mean   : 1079   Mean   :  87.5   Mean   :40.42  
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##        native_country  annual_income 
##  United-States:43832   <=50K :24720  
##  Mexico       :  951   <=50K.:12435  
##  ?            :  857   >50K  : 7841  
##  Philippines  :  295   >50K. : 3846  
##  Germany      :  206                 
##  Puerto-Rico  :  184                 
##  (Other)      : 2517

3.6 Deal with missing values and correct the data format

From the summary information, you could observe that there are “?” in columns workclass and native-country and they represent missing values.Meanwhile some columns like education-num have missing values NA. In addition, the column annual_income has different repressions such as <=50K adn <=50K.. So generally we will have three goals:

  • Assign “?” to NA
  • Determine how to deal with missing values like remove them
  • Address the inconsistent format of column annula_income such as <=50K and <=50K.

Before determining how to deal with them, we will firstly have a look at its percentage.

# Set the "?" to NA
df_adult[df_adult == "?"] <- NA
# Count how many missing values
sum(!complete.cases(df_adult))
## [1] 3620
# Count the percentage of missing values
sum(!complete.cases(df_adult)) / nrow(df_adult)
## [1] 0.07411654

You could see that 3621 rows of missing values and accounts for merely 7.41%. Thus we will directly delete them.

# Select the non-missing rows and we assign the new data frame to df_ad in case we need df_adult later
df_ad <- df_adult[complete.cases(df_adult),]
# Considering there are still levels for "?" whose frequency is 0, we will remove them
df_ad <- df_ad %>%
  droplevels()

For the inconsistent repressions of column annual_income, we will convert them to the consistent one.

df_ad$annual_income <-  df_ad $annual_income %>%
  as.character() %>%
  str_remove("[.]") %>%
  as.factor()

3.7 Deal with the redundant variables

Finally we will have a look at whether there are redundant variables before EDA. Obviously variable education_num is the numerical representation of education while relationship could be derived from sex and marital_status. Thus we will

  • Remove the column education_num
  • Remove the column relationship
# Remove the column education_num and relationship
df_ad <- df_ad %>%
  select(-c("education_num", "relationship"))

For numerical variables, we will check whether they are redundant by looking at their correlations. Thus we will

  • Have a look at the correlation of numericla variables
  • Determine how to deal with numerical variables
# get the class of each column in df_ad
col_class <- sapply(df_ad, class)
# get the column names of categorical variables
cat_names <- names(col_class[col_class == "factor"])
# get the column names of numerical variables
num_names <- names(col_class[col_class != "factor"])
# Load the library ggcorrplot
library(ggcorrplot)
## Loading required package: ggplot2
# Get the correlation matrix chart
df_ad %>%
  select(num_names) %>%
  cor() %>%
  ggcorrplot(hc.order = TRUE, type = "lower",
   lab = TRUE)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(num_names)` instead of `num_names` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

As you could notice, the correlation between variables is pretty week.Thus we will keep all numerical variables.

4 EDA

In order to gain a understanding of data, we will

  • Have a look at its summary information like min, mean and max
  • Plot the data by variable class and the number of variables

4.1 Summary information

summary(df_ad)
##       age                   workclass         fnlwgt       
##  Min.   :17.00   Federal-gov     : 1406   Min.   :  13492  
##  1st Qu.:28.00   Local-gov       : 3100   1st Qu.: 117388  
##  Median :37.00   Private         :33307   Median : 178316  
##  Mean   :38.55   Self-emp-inc    : 1646   Mean   : 189735  
##  3rd Qu.:47.00   Self-emp-not-inc: 3796   3rd Qu.: 237926  
##  Max.   :90.00   State-gov       : 1946   Max.   :1490400  
##                  Without-pay     :   21                    
##         education                   marital_status            occupation   
##  HS-grad     :14783   Divorced             : 6297   Craft-repair   : 6020  
##  Some-college: 9899   Married-AF-spouse    :   32   Prof-specialty : 6008  
##  Bachelors   : 7570   Married-civ-spouse   :21055   Exec-managerial: 5984  
##  Masters     : 2514   Married-spouse-absent:  552   Adm-clerical   : 5540  
##  Assoc-voc   : 1959   Never-married        :14598   Sales          : 5408  
##  11th        : 1619   Separated            : 1411   Other-service  : 4808  
##  (Other)     : 6878   Widowed              : 1277   (Other)        :11454  
##                  race           sex         capital_gain    capital_loss    
##  Amer-Indian-Eskimo:  435   Female:14695   Min.   :    0   Min.   :   0.00  
##  Asian-Pac-Islander: 1303   Male  :30527   1st Qu.:    0   1st Qu.:   0.00  
##  Black             : 4228                  Median :    0   Median :   0.00  
##  Other             :  353                  Mean   : 1101   Mean   :  88.59  
##  White             :38903                  3rd Qu.:    0   3rd Qu.:   0.00  
##                                            Max.   :99999   Max.   :4356.00  
##                                                                             
##  hours_per_week        native_country  annual_income
##  Min.   : 1.00   United-States:41292   <=50K:34014  
##  1st Qu.:40.00   Mexico       :  903   >50K :11208  
##  Median :40.00   Philippines  :  283                
##  Mean   :40.94   Germany      :  193                
##  3rd Qu.:45.00   Puerto-Rico  :  175                
##  Max.   :99.00   Canada       :  163                
##                  (Other)      : 2213

From the summary information, we could obtain a series of information:

  • The age range from 17 to 90 with 37 as its median
  • There are 14695 and 30527 records for female and male respectively
  • The capital_gain is from 0 to 99999 with 1101 as its mean
  • The hours_per_week varies from 1 to 99 with 40.94 as its mean
  • etc

4.2 Plotting

We will plot the data in the following orders:

  • Plot single categorical variable
  • Plot single numerical variable
  • Plot single categorical variable and target variable annual_income
  • Plot single numerical variable and target variable annual_income
  • Plot single categorical variable and target variable age
  • Plot single numerical variable and target variable age

4.2.1 Single categorical variable

We will define a descending horizontal barplot function and call this function for each categorical variable. Then almost all braplots are arranged together.

# Load library ggplot2 and ggpubr
library(ggplot2)
library(ggpubr)
# define a barplot function for categorical variable
barplot_single_cat <- function(col_name) {
  col_freq <- as.data.frame(sort(table(df_ad[[col_name]]), decreasing = F))
  colnames(col_freq) <- c(col_name,"freq")
  ggplot() + aes(col_freq[[col_name]],col_freq[["freq"]], fill = "red") + geom_bar(stat = "identity") +coord_flip() + labs(x = "frequency", y = col_name, title = paste("barplot of", col_name)) + theme(text = element_text(size=25), plot.title = element_text(hjust = 0.5), legend.position = "none") }
# Create a barplot for each categorical variable in the form of barplot_column_name
for (col_name in cat_names) {
  name <- paste("barplot",col_name, sep="_")
  assign(name, barplot_single_cat(col_name))
}
# Arrange all barplots apart from barplot_native_country because it has too many levels and its visual effect will be affected
ggarrange(barplot_workclass, barplot_education, barplot_marital_status, barplot_occupation, barplot_race, barplot_sex, barplot_annual_income, nrow=4, ncol=2)

# Have a look at barplot_native_country
barplot_native_country

From the descending barplots, we could observe some obvious patterns, for instance:

  • Most case in variable workclass, race, native_country are Private, White, United-States respectively.

We could have a look at their specific percentages

#Define a function that returns the count, percentage of top n in a variable
col_freq_percent <- function(col_name, top_n) {
col_freq <- as.data.frame(sort(table(df_ad[[col_name]]), decreasing = T))
col_freq[["percent"]] <- col_freq[["Freq"]] / nrow(df_ad)
names(col_freq) <- c(col_name, "freq","percent")
return(col_freq[1:top_n,])
} 

# Define a function that return the percentage of a variable
col_percent_barplot <- function(df, col_name) ggplot() + aes(df[[col_name]]) + geom_bar(fill = "#00AFBB") + geom_text(aes(label = scales:: percent(..count../sum(..count..))), stat ="count", vjust = -0.2) +  theme(plot.title = element_text(hjust = 0.5)) + labs(x = col_name, y = "percentage", title = paste("percentage of", col_name))

# The table of the top 1 percentage of column workclass
col_freq_percent("workclass", 1)
##   workclass  freq  percent
## 1   Private 33307 0.736522
# The barplot of the percentage of column workclass
col_percent_barplot(df_ad,"workclass")

# The table of the top 1 percentage of column race
col_freq_percent("race", 1)
##    race  freq   percent
## 1 White 38903 0.8602671
# The barplot of the percentage of column race
col_percent_barplot(df_ad,"race")

# The table of top 3 percentage of column native_country
col_freq_percent("native_country", 1)
##   native_country  freq   percent
## 1  United-States 41292 0.9130954

So you could see the corresponding percentages are 73.7%, 86%, and 91.3% separately.

4.2.2 Single numerical variable

For single numerical variable, we will define a histogram function and call it for each variable and then arrange all histograms together.

# Create a histogram function for numerical variable
histogram_single_num <- function(col_name) {
  ggplot() + aes(x = df_ad[[col_name]], fill = "red") + geom_histogram() + labs(y = "frequency", x = col_name, title = paste("histogram of", col_name)) + theme(plot.title = element_text(hjust = 0.5), legend.position = "none")}
# get the column names of numerical variables
num_names <- names(col_class[col_class != "factor"])
#Create a histogram for each numerical variable in the form of histogram_column_name
for (col_name in num_names) {
  name <- paste("histogram",col_name, sep="_")
  assign(name, histogram_single_num(col_name))
}
# Arrange all histograms
ggarrange(histogram_age,  histogram_fnlwgt, histogram_capital_gain, histogram_capital_loss, histogram_hours_per_week,nrow=2, ncol=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on those histograms, we could get:

  • Histograms of age and fnlwgt are left skewed
  • The data of the columns capital_gain, capital_loss, hours_per_week is highly imbalanced. Taking capital_gain as an example, most values are 0. We could have a deeper look at those three columns.
# The top 2 number and percentage of column capital_gain
col_freq_percent("capital_gain", 2)
##   capital_gain  freq    percent
## 1            0 41432 0.91619123
## 2        15024   498 0.01101234
# The top 2 number and percentage of column capital_loss
col_freq_percent("capital_loss", 2)
##   capital_loss  freq    percent
## 1            0 43082 0.95267790
## 2         1902   294 0.00650126
# The top 2 number and percentage of column capital_loss
col_freq_percent("hours_per_week", 2)
##   hours_per_week  freq    percent
## 1             40 21358 0.47229225
## 2             50  4094 0.09053116

We could see that 91.6% capital_gain, 95.3% capital_loss, 47.2% hours_per_week are 0, 0 and 40 respectively.

4.2.3 Single categorical variable and target variable annual_income

Considering the imbalanced distribution of the data for a categorical variable based on the previous plots, we figure out it is better to plot the percentage of each subcategory by barplot.

barplot_two_var <- function(col_name) {
  ggplot() + aes(y = df_ad[[col_name]],fill = df_ad[["annual_income"]]) + geom_bar(position = "fill")  + labs(x = "percentage", y = col_name, title = paste("barplot of", col_name, "by annual_income")) + theme(text = element_text(size=40), plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="annual_income"))}
# Create a barplot for each categorical variable in the form of barplot_column_name
for (col_name in cat_names[-length(cat_names)]) {
  name <- paste("barplot",col_name, "by_annual_income",sep="_")
  assign(name, barplot_two_var(col_name))
}
# Arrange all barplots
ggarrange(barplot_workclass_by_annual_income, barplot_education_by_annual_income, barplot_marital_status_by_annual_income, barplot_occupation_by_annual_income, barplot_race_by_annual_income, barplot_sex_by_annual_income, nrow=3, ncol=2)

# Have a look at the barplot_native_country_by_annual_income
barplot_native_country_by_annual_income

From those barplots, we could draw some insights such as :

  • column workclass: Only Self-emp-inc accounts for over 50% in the group >50k in
  • column education: Over 50% Prof-school, Masters and Doctorate has the annual income over 50k
  • column occupation: Prof-specialty and Exec-managerial are more likely to gain annual income over 50k
  • column sex: Higher percentage of male has the annual income over 50k comapred with female.

4.2.4 Single numerical variable and target variable annual_income

Since most numerical variables are imbalanced, we will facet each variable by annual_income.

# # Remove warning message
options(warn=-1)
# Create a histogram function for numerical variable
histogram_two_var <- function(col_name) {
  ggplot(df_ad) + aes(x = df_ad[[col_name]], fill = df_ad[["annual_income"]]) + facet_wrap(~df_ad[["annual_income"]]) + geom_histogram()  + guides(fill=guide_legend(title="annual_income")) + labs(x = col_name, title = paste(col_name, "by annual_income")) +  theme(text = element_text(size=30), plot.title = element_text(hjust = 0.5))}
#Create a histogram for each numerical variable in the form of histogram_column_name
for (col_name in num_names) {
  name <- paste("histogram",col_name, "by_annual_income",sep="_")
  assign(name, histogram_two_var(col_name))
}
# Arrange all histograms
ggarrange(histogram_age_by_annual_income,  histogram_fnlwgt_by_annual_income, histogram_capital_gain_by_annual_income, histogram_capital_loss_by_annual_income, histogram_hours_per_week_by_annual_income,nrow=3, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on those histograms, we could gain some insights about our data like:

  • column age: The distribution of the group with the annual_income <= 50k is left skewed while the other group is center. This means that you are more likely to achieve annual income >50k as you grow older.
  • Other column: Different groups almost share the same distribution for each variable.

Next, we will go ahead for modeling.

4.2.5 Single categorical variable and target variable age for regression

density_two_var <- function(col_name, font_size) {
  ggplot() + aes(x = df_ad[["age"]], fill = df_ad[[col_name]]) + geom_density(alpha = 0.2)  + labs(x = "age", y = "density", title = paste("density of age by", col_name)) + theme(text = element_text(size=font_size), plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title=col_name))}
# Create a density curve for age by each categorical variable
for (col_name in cat_names[-(length(cat_names) - 1)]) {
  name <- paste("density_age_by",col_name,sep="_")
  assign(name, density_two_var(col_name, 40)) }
# Arrange all density curves apart from age by native_country
ggarrange(density_age_by_workclass,  density_age_by_education, density_age_by_marital_status, density_age_by_occupation,density_age_by_race,density_age_by_sex,density_age_by_annual_income,nrow=4, ncol=2)

# density of age by native_country
density_two_var("native_country", 25)

According to those density curves, we are able to get some interesting insights like:

  • column marital_status: Compared with other groups in marital_status, the age of Widowed averagely tends to older mainly distribute between 50 and 75
  • column sex: Compared with male, the age distribution of female is more left-skewed. This means that the average age of female is younger than female in the data.
  • column annual_income: Similarly, The age distribution of annual_income <= 50k is more left_skewed and it tells us that the average of annual_income >50k is larger.

4.2.6 Single numerical variable and target variable age for regression

scatter_two_var <- function(col_name) {
  ggplot() + aes(x = df_ad[["age"]], y = df_ad[[col_name]]) + geom_point(color = "red")  + geom_smooth() + labs(x = "age", y = col_name, title = paste("scatter of age and", col_name)) + theme(text = element_text(size=40), plot.title = element_text(hjust = 0.5)) }
# Create a density curve for age by each categorical variable
for (col_name in num_names[-1]) {
  name <- paste("scatter_age",col_name,sep="_")
  assign(name, scatter_two_var(col_name)) }
# Arrange all histograms
ggarrange(scatter_age_fnlwgt,  scatter_age_capital_gain,scatter_age_capital_loss, scatter_age_hours_per_week,nrow=2, ncol=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Based on those scatter plots, we could see there are no obvious patterns and thus we will go ahead for modelling.

5. Modeling

Here we will build two models:

  • Classification model to predict whether the annual income is over >50k
  • Regression model to predict the age

5.1 Classification

For the classification models, the steps are below:

  • Split the dataset into 80% and 20% for training and testing dataset respectively
  • Train three classification models based on the algorithms knn, logistics regression, and random forest on the training dataset using 10-fold cross validation
  • Select the best model and evaluate its performance on the testing dataset to confirm whether it is overfitting.
# Load the library caret
library(caret)
## Loading required package: lattice
# Split the dataset into 80% and 20% as training and testing dataset respectively
# Set the random seed to ensure we get the same traning dataset and testing dataset everytime we run the code.
split <- 0.8
set.seed(998)
train_index <- createDataPartition(df_ad$annual_income, p=split, list = F)
data_train <- df_ad[train_index, ]
data_test <- df_ad[-train_index, ]
# 10 fold cross validation
train_control <- trainControl(method="cv", number=10)
# 1.logistics regression
set.seed(2)
model_glm <- train(annual_income~., data=data_train, trControl=train_control, method="glm", preProcess = c("center", "scale"),metric="Accuracy")
print(model_glm)
## Generalized Linear Model 
## 
## 36179 samples
##    12 predictor
##     2 classes: '<=50K', '>50K' 
## 
## Pre-processing: centered (90), scaled (90) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 32561, 32561, 32562, 32561, 32561, 32560, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8477849  0.5637562
# 2.decision tree
set.seed(2)
model_dt <- train(annual_income~., data=data_train, trControl=train_control, method="rpart", metric="Accuracy")
print(model_dt)
## CART 
## 
## 36179 samples
##    12 predictor
##     2 classes: '<=50K', '>50K' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 32561, 32561, 32562, 32561, 32561, 32560, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.05620609  0.8108018  0.35786489
##   0.06735809  0.7940240  0.23957743
##   0.07377049  0.7657487  0.07744025
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.05620609.

As you could see, logistics regression model performs best with about 84% mean accuracy. Thus we tend to choose it as our final model. In order to confirm whether there is overfitting, we will validate it in our testing dataset.

# Make predictions
x_test_cla <- data_test[, 1:length(data_train) - 1]
y_test_cla <- data_test[, length(data_train)]
prediction_glm <- predict(model_glm, x_test_cla)
# Have a look at the confusion metrics
confusionMatrix(prediction_glm, y_test_cla)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  6310  878
##      >50K    492 1363
##                                           
##                Accuracy : 0.8485          
##                  95% CI : (0.8409, 0.8558)
##     No Information Rate : 0.7522          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5687          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9277          
##             Specificity : 0.6082          
##          Pos Pred Value : 0.8779          
##          Neg Pred Value : 0.7348          
##              Prevalence : 0.7522          
##          Detection Rate : 0.6978          
##    Detection Prevalence : 0.7949          
##       Balanced Accuracy : 0.7679          
##                                           
##        'Positive' Class : <=50K           
## 

Obviously, the prediction accuracy is around 84% and it shows logistics regression fits well. Finally we want to have a look at the important features

# Have a look at the feature importance of logistics regression model
import_classif <- varImp(model_glm)
import_classif
## glm variable importance
## 
##   only 20 most important variables shown (out of 90)
## 
##                                    Overall
## `marital_statusMarried-civ-spouse`  100.00
## capital_gain                         90.79
## capital_loss                         54.18
## hours_per_week                       52.41
## age                                  47.56
## `educationProf-school`               42.65
## educationMasters                     40.57
## educationDoctorate                   38.26
## educationBachelors                   36.87
## `occupationExec-managerial`          31.81
## `workclassSelf-emp-not-inc`          28.19
## `educationAssoc-acdm`                23.70
## `occupationOther-service`            23.11
## `educationAssoc-voc`                 22.95
## `educationSome-college`              22.15
## `occupationFarming-fishing`          21.38
## `occupationProf-specialty`           19.55
## `workclassState-gov`                 19.25
## `workclassLocal-gov`                 18.61
## `marital_statusNever-married`        16.40

We could observe there are some important features like capital_gain, capital_loss, hours_per_week, age to determine whether the annual income is over 50k.

5.2 Regression

We will build regression models for age and select the best model.

# ridge regression
set.seed(3)
model_ridge <- train(age~., data=data_train, trControl=train_control,method="ridge", metric="RMSE")
model_ridge
## Ridge Regression 
## 
## 36179 samples
##    12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 32562, 32561, 32561, 32562, 32562, 32560, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0e+00   10.47033  0.3778042  8.234712
##   1e-04   10.47032  0.3778057  8.234701
##   1e-01   10.49127  0.3755782  8.242094
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 1e-04.
# Least Angle Regression
set.seed(3)
model_lars <- train(age~., data=data_train, trControl=train_control,method="lars", metric="RMSE")
model_lars
## Least Angle Regression 
## 
## 36179 samples
##    12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 32562, 32561, 32561, 32562, 32562, 32560, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.050     12.44730  0.2750774  10.156253
##   0.525     10.54760  0.3681961   8.319319
##   1.000     10.46517  0.3771267   8.233227
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 1.

Evidently, least angle regression has the smaller RMSE and we will choose it as our prediction model. Similarly, we will confirm whether our model is overfitting.

# Make predictions on the testing dataset
x_test_reg <- data_test[, 2:length(data_train)]
y_test_reg <- data_test[, 1]
prediction_lars <- predict(model_lars,x_test_reg)
RMSE(prediction_lars, y_test_reg)
## [1] 10.22163

As you could notice, the RMSE of prediction is about 10 and it proves our model fits well. Finally we will have a look at the feature importance.

# Feature importance
import_reg <- varImp(model_lars)
# Plot the feature importance
plot(import_reg)

As you could see the most important features are marital_status and annual_income. Since the importance of some features like race is 0, we will attempt to train our regresison model using less features.

# Train the regresion model with less features
data_retrain <- data_train[, -c(7,6,4,12)]
data_retest <- df_ad[, -c(7,6,4,12)]
# set the random seed
set.seed(3)
# Train the model
model_lars_retrain <- train(age~., data=data_retrain, trControl=train_control,method="lars", metric="RMSE")
# Have a look at the model
model_lars_retrain
## Least Angle Regression 
## 
## 36179 samples
##     8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 32562, 32561, 32561, 32562, 32562, 32560, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.050     12.83088  0.2750774  10.489917
##   0.525     10.89071  0.3337148   8.676002
##   1.000     10.71457  0.3470846   8.447828
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 1.
# Have a look at the performance of the model on the testing dataset
x_test_retrain <- data_retest[, 2:length(data_retest)]
y_test_retrain <- data_retest[, 1]
prediction_lars_retrain <- predict(model_lars_retrain,x_test_retrain)
RMSE(prediction_lars_retrain, y_test_retrain)
## [1] 10.64923

We could see both RMSE are about 10 and perform almost the same with the model trained on all features. Thus we could predict the age with this model especially considering a vast amount of data.

6 Conclusion

  • Conclusion 1: We build a classification model with 84% accuracy to predict whether income exceeds $50K/yr
  • Conclusion 2: We build a regression model with RMSE about 10 to predict the age
  • Conclusion 3: The regression model trained on 9 features perform as well as the one trained on 13 features, we could use it especially considering huge amount of data.
  • Conclusion 4: We also identify the top important features for our models like capital_gain, capital_loss for our classification model and marital_status, annual_income for the regression model.