The source of the data is the University of California Irvinie Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Adult. The dataset contains 14 attributes. We’ll use most of those attributes to predict whether an individuals makes 50 thousand dollars a year or less, or if their income exceeds that amount.
We stat by loading the required libraries and doing some data preparation. That includes reading in the data, naming the columns after the information given on the source site, finding out about missing values, NA’s, and other values that might be problematic later during the analysis. Please refer to the source site to find out more details about the data set or any of the specific attributes.
# loading libraries
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(e1071))
suppressPackageStartupMessages(library(dplyr))
# reading in dataset
data <- fread("~/Documents/datasets/adult.data")
# data preparation
col_names <- c("age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race", "sex",
"capital_gain", "capital_loss", "hours_per_week", "native_country",
"income")
names(data) <- col_names
Let’s take a look at what the data looks like.
head(data)
## age workclass fnlwgt education education_num marital_status
## 1: 39 State-gov 77516 Bachelors 13 Never-married
## 2: 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3: 38 Private 215646 HS-grad 9 Divorced
## 4: 53 Private 234721 11th 7 Married-civ-spouse
## 5: 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6: 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1: Adm-clerical Not-in-family White Male 2174 0
## 2: Exec-managerial Husband White Male 0 0
## 3: Handlers-cleaners Not-in-family White Male 0 0
## 4: Handlers-cleaners Husband Black Male 0 0
## 5: Prof-specialty Wife Black Female 0 0
## 6: Exec-managerial Wife White Female 0 0
## hours_per_week native_country income
## 1: 40 United-States <=50K
## 2: 13 United-States <=50K
## 3: 40 United-States <=50K
## 4: 40 United-States <=50K
## 5: 40 Cuba <=50K
## 6: 40 United-States <=50K
Most attributes will be used as is, some will be dropped, others will be modified before being used. For example, under the variable marital_status, there are seven different levels. We can observe some patterns and similitarities amongst some of the levels. We will create a new variable lives_w_spouse, which will divide the observations into those who live with their spouse and those who do not. You can think of it as simplifying that variable. I will use variable and attribute interchangeably in this writeup.
For the variable occupation, which has 15 levels, we will also combine and collapse the different levels into two levels under a new variable wcollar_prof, which has to levels. wcollar_prof will give a value of 1 to those who have white collar professions and a value of 0 to the rest. While it is important for the Census Bureau (where the data originates) to have detailed information about each participant for the sake of being precise in keeping track of the population, we’re interested in predicting income. For instance, an observation that has “exec-managerial” or “protective-serv” under occupation is very likely to be high income. There is no need to keep all the different 16 levels, when they can be simplified.
native_country has 42 levels. We’ve collapsed it to two under the variable dev_country, which assigns a value of 1 to those who come from India, China or developed countries. While China and India are not developed countries, we have included them in that pot mainly because most of the immigrants coming from those nations into the United States are highly educated, and their incomes typically surpass those of the average American.
The attribute race has been been modified. Observations with the label white will remain like that, and observations with other labels will now be under the label nonwhite. The reason behind it is that there were not enough observations in each of the other groups, and so it is best to combine them to form a larger group.
education, with 14 levels, has also been replaced and collapsed into educ_level, with only four levels.
workclass has been simplified and changed to worktype, which has fewer levels: gov, self, and private for government related jobs, self-employed individuals, and those who work in the private sector.
The attributes capital_loss and capital_gain, which were continuous, have been combined into extra_income, which has three levels: positive, negative, and none for positive, negative, and no extra income.
The attributes age, education_num have not been modified.
We changed the attribute income into high_income, which also has two levels, 1 for income over 50K and 0 for income at 50K or less. The only difference this change makes is that it is easier to differential a 0 from a 1, and it also takes less storage space.
The attributes relationship and fnlwgt were dropped.
Below is the code for all of the feature engineering. Disclaimer, before arriving at choosing the new groupings I spent some time poking around the data, making tables, and plotting bar charts and histograms. For the sake of simplicity I left all of that out of this write-up.
# Feature engineerig code
# creating new labels
wcollar_prof <- c("Exec-managerial", "Prof-specialty", "Protective-serv", "Sales", "Tech-support")
dev_nations <- c("Canada", "England", "France", "Germany", "Italy", "Ireland",
"Japan", "Portugal", "Taiwan", "India", "Holand-Netherlands", "China", "United-Sates")
gov <- c("Federal-gov", "Local-gov", "State-gov")
self <- c("Self-emp-inc", "Self-emp-not-inc")
noHS <- c("10th", "11th", "12th", "1st-4th", "5th-6th", "7th-8th", "9th", "Preschool")
HS <- c("HS-grad", "Some-college")
AS <- c("Prof-school", "Assoc-acdm", "Assoc-voc")
GS <- c("Masters", "Doctorate")
# creating new variables/updating some existing ones
data$white_collar <- ifelse(data$occupation %in% wcollar_prof, 1, 0)
data$dev_country <- ifelse(data$native_country %in% dev_nations, 1, 0)
data$high_income <- ifelse(data$income == ">50K", 1, 0)
data$lives_w_spouse <- ifelse(data$marital_status %in%
c("Married-AF-spouse", "Married-civ-spouse"), 1, 0)
data$worktype <- ifelse(data$workclass %in% gov, "gov",
ifelse(data$workclass %in% self, "self",
ifelse(data$workclass == "Private", "private", "other")))
data$educ_level <- ifelse(data$education %in% HS, "HS-grad",
ifelse(data$education %in% noHS, "no-HS",
ifelse(data$education %in% AS, "CC-grad",
ifelse(data$education %in% GS, "grad-school", "college"))))
data$extra_income <- data$capital_gain - data$capital_loss
data$extra_income <- ifelse(data$extra_income == 0, "none",
ifelse(data$extra_income < 0 , "neg", "pos"))
data$race <- ifelse(data$race == "White", "white", "nonwhite")
data$white_collar <- as.factor(data$white_collar)
data$dev_country <- as.factor(data$dev_country)
data$high_income <- as.factor(data$high_income)
data$lives_w_spouse <- as.factor(data$lives_w_spouse)
data$worktype <- as.factor(data$worktype)
data$educ_level <- as.factor(data$educ_level)
data$extra_income <- as.factor(data$extra_income)
# dropping previously modified attributes and attributes that will not be used
data <- data %>%
select(-occupation, -income, -native_country, -marital_status, -workclass,
-education, -fnlwgt, -capital_gain, -capital_loss, -relationship)
#rearranging columns so that high income is the last column
high_income <- data$high_income
data <- data %>% select(-high_income)
data <- cbind(data, high_income)
Let’s take another look at our data.
head(data)
## age education_num race sex hours_per_week white_collar
## 1: 39 13 white Male 40 0
## 2: 50 13 white Male 13 1
## 3: 38 9 white Male 40 0
## 4: 53 7 nonwhite Male 40 0
## 5: 28 13 nonwhite Female 40 1
## 6: 37 14 white Female 40 1
## dev_country lives_w_spouse worktype educ_level extra_income
## 1: 0 0 gov college pos
## 2: 0 1 self college none
## 3: 0 0 private HS-grad none
## 4: 0 1 private no-HS none
## 5: 0 1 private college none
## 6: 0 1 private grad-school none
## high_income
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## 6: 0
Now that we have spent some time creating new attributes, updating ones already there, dropping other ones, we move to the model creation part of out analys. The code below is a funciton that splits our dataset into a test and a training datasets. We’ll carry out three different classification models. The first is a support vector machine model, the second a naive Bayes model, and the third a logistic regression model. For a better description of each of those models check out this site: https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R
# function to split data frame
splitdf <- function(df, seed=NULL, train_fraction=0.8) {
if (train_fraction<=0 | train_fraction>=1) stop("Training fraction must be strictly between 0 and 1")
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(df)
trainindex <- sample(index, trunc(length(index)*train_fraction))
trainset <- df[trainindex, ]
testset <- df[-trainindex, ]
list(trainset=trainset,testset=testset)}
# using the function to create the partition
splits <- splitdf(data, seed=26587, train_fraction = .8)
Train <- as.data.frame(splits[1])
Test <- as.data.frame(splits[2])
names(Train) <- names(data)
names(Test) <- names(data)
The SVM model achieves an overall accuracy of 83.3%, meaning that it predicted 83.3% of the observations in the Test set. Let’s dig deeper into those predictions:
94.4% of those with low income were predicted correctly. So, 5.6% low income people were incorrectly classified as high income.
49.8% high income were correctly predicted as high income, while 51.2% of high income individuals were incorrectly predicted as low-income.
Weakness: does poorly when predicting observations in the high income category.
#SVM
svm_model <- svm(high_income ~ ., data = Train)
svm_p <- predict(svm_model, Test[,-12])
accuracy <- sum(svm_p==Test$high_income)/length(svm_p)
accuracy
## [1] 0.833103
table(svm_p, Test$high_income)
##
## svm_p 0 1
## 0 4620 813
## 1 274 806
The naive Bayes model achieves an overall accuracy of 82.9%. Details:
88.6% of those with low income were classified correctly, while 11.4% low income people were incorrectly classified as high income.
65.9% high income were predicted as high income, and 34.1% of high income individuals were incorrectly classified as low-income.
Performs better than SVM when predicting high income, but still lags.
#naive Bayes
NB_model <- naiveBayes(high_income~., data=Train)
NB_p <- predict(NB_model, Test[,-12])
## Warning in data.matrix(newdata): NAs introduced by coercion
## Warning in data.matrix(newdata): NAs introduced by coercion
accuracy <- sum(NB_p==Test$high_income)/length(NB_p)
accuracy
## [1] 0.8298787
table(NB_p, Test$high_income)
##
## NB_p 0 1
## 0 4338 552
## 1 556 1067
The logistic regression model in R works slightly different. Although the variable high income is binary and was declared as a factor, the function predict gives a probability for each obervation as an output. It is up to us to come up with a threshold and determine a cutoff for what we want to consider low income and high income.
We use three different thresholds, all of which give us 82.0%-83.6% in overall accuracy. So the accuracy is comparable to that of the previous two models. For the sake of simplicity, we include analysis for only one of those three thresholds (0.3). That value means that any prediction given by the function predict with probability of 0.3 or less will be classified as low income (high_income = 0), the rest will be classified as high income (high_income = 1). Details:
84.1% of those low income individuals were classified correctly, and 15.9% low income observations were classified as high income.
75.9% of high income were classified correctly, while only 24.1% of high income were classified as low income.
This models performs that best out of the three tested and shown in this writeup.
Please note that the other thresholds used were: 0.5 and 0.57. Their accuracy when predicting high income was similar to that of the SVM and naive Bayes models.
lr_model <- glm(high_income ~ ., data = Train, family=binomial)
lr_p <- predict(lr_model, Test[,-12], type="response")
lr_p1 <- ifelse(lr_p <= 0.3, 0, 1)
accuracy <- sum(lr_p1==Test$high_income)/length(lr_p1)
accuracy
## [1] 0.8208199
table(lr_p1, Test$high_income)
##
## lr_p1 0 1
## 0 4116 389
## 1 778 1230
Although the overall accuracy is virtually the same amongst the three classification models, there is a significant difference in terms of accuracy within the levels of income. If I were to choose one model, I would chose the last one, the logistic regression model. Its accuracy is better distributed between levels than the SVM and the naive Bayes. When choosing models we should pay attention in detail to every scenario, rather than making a hasty decision.
Thanks for reading!
.