Introduction

The Problem statement

Classification is a supervised machine learning algorithm of assigning a class label based on known examples. There are many different types of classification tasks requiring machine learning approaches such as those to be discussed below. In this document, we will try to predict a person’s income based on census data.

The predictions will be either above 50K or below 50K. Since the dependent variable is of a categorical type, we have a classification problem at hand.

While there are numerous techniques of classification, we will apply the following three and compare the results:

1. Logistic regression

logistic regression is a regression with a qualitative dependent variable. The regression models the probability that the qualitative dependent variable will take a specific value. It uses a link function to link the regression over variables to natural probability range, [0,1].

2. Naive Bayes

Naive bayes is a supervised learning model used in classification that is based on assumption that all features of a class are independent of each other. This model is mainly used for problems with multiple classes and used in text classification.

3. Random Forest

Random forest is a supervised learning algorithm/model that uses a large number of ensembled decision trees on data samples. Each tree in the forest makes a prediction and the best solution is selected based on voting. This model is common for its simplicity and diversity hence, the combination of the multiple decision trees is likely to get a more accurate and stable prediction.

Data source

The source data comes from Census-Income (KDD) Data Set available Here The data set is divided into a train and test sets, each with 22 columns, some of which are categorical.

Data description

Exploratory data analysis

Before we get to the modeling itself, we will analyse the data at hand.

Data preparation

Installing the necessary packages

requiredPackages = c("e1071","randomForest","janitor","verification","olsrr","DescTools","caret","tibble","purrr","corrplot","corrplot","dbplyr","dplyr","readr", "ggplot2", "glmnet","class","kernlab","verification")

for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE)}

Load the data

data_census <- read.csv("CensusData_Train.csv")

Data summary

summary(data_census)
##       AAGE         ACLSWKR             AHRSPAY          AHGA          
##  Min.   : 0.00   Length:32723       Min.   : 0.00   Length:32723      
##  1st Qu.:15.00   Class :character   1st Qu.: 0.00   Class :character  
##  Median :33.00   Mode  :character   Median : 0.00   Mode  :character  
##  Mean   :34.45                      Mean   :15.44                     
##  3rd Qu.:49.00                      3rd Qu.:33.00                     
##  Max.   :90.00                      Max.   :51.00                     
##     AHSCOL            AMARITL             AMJOCC             ARACE          
##  Length:32723       Length:32723       Length:32723       Length:32723      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    AREORGN              ASEX              AUNMEM            AWKSTAT         
##  Length:32723       Length:32723       Length:32723       Length:32723      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     CAPGAIN           CAPLOSS            DIVVAL          FILESTAT        
##  Min.   :    0.0   Min.   :   0.00   Min.   :    0.0   Length:32723      
##  1st Qu.:    0.0   1st Qu.:   0.00   1st Qu.:    0.0   Class :character  
##  Median :    0.0   Median :   0.00   Median :    0.0   Mode  :character  
##  Mean   :  402.2   Mean   :  36.42   Mean   :  183.5                     
##  3rd Qu.:    0.0   3rd Qu.:   0.00   3rd Qu.:    0.0                     
##  Max.   :99999.0   Max.   :4608.00   Max.   :99999.0                     
##    PENATVTY           PRCITSHP             SEOTR           WKSWORK     
##  Length:32723       Length:32723       Min.   :0.0000   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.:0.0000   1st Qu.: 0.00  
##  Mode  :character   Mode  :character   Median :0.0000   Median : 8.00  
##                                        Mean   :0.1768   Mean   :23.23  
##                                        3rd Qu.:0.0000   3rd Qu.:52.00  
##                                        Max.   :2.0000   Max.   :52.00  
##    PTOTVAL         
##  Length:32723      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Let’s check our target variable,PTOTVAL

summary(data_census$PTOTVAL)
##    Length     Class      Mode 
##     32723 character character

PTOTVAL is a ordinal catogorical variable with two levels (-50000,50000+).

# is.factor(data_census$PTOTVAL)
data_census$PTOTVAL <- factor(data_census$PTOTVAL,
                           levels = c(" 50000+.","-50000"), #levels
                           ordered = TRUE) # ordinal

levels(data_census$PTOTVAL)
## [1] " 50000+." "-50000"
table(data_census$PTOTVAL)
## 
##  50000+.   -50000 
##     1956    30767

Missing values and outliers

Let’s find if any entries are missing

sum(is.na(data_census))
## [1] 0
sapply(data_census, function(x) sum(is.na(x)))
##     AAGE  ACLSWKR  AHRSPAY     AHGA   AHSCOL  AMARITL   AMJOCC    ARACE 
##        0        0        0        0        0        0        0        0 
##  AREORGN     ASEX   AUNMEM  AWKSTAT  CAPGAIN  CAPLOSS   DIVVAL FILESTAT 
##        0        0        0        0        0        0        0        0 
## PENATVTY PRCITSHP    SEOTR  WKSWORK  PTOTVAL 
##        0        0        0        0        0

This is not quite true, becasue we know the missing values were represented by a question mark.

Let’s check by variable.

sapply(data_census, function(x) sum(x==" ?"))
##     AAGE  ACLSWKR  AHRSPAY     AHGA   AHSCOL  AMARITL   AMJOCC    ARACE 
##        0        0        0        0        0        0        0        0 
##  AREORGN     ASEX   AUNMEM  AWKSTAT  CAPGAIN  CAPLOSS   DIVVAL FILESTAT 
##        0        0        0        0        0        0        0        0 
## PENATVTY PRCITSHP    SEOTR  WKSWORK  PTOTVAL 
##      525        0        0        0        0

Variable PENATVTY has missing values. We will will remove those observations.

data_census <- data_census[!data_census$PENATVTY == " ?",]

Categorical and numeric variables

  1. Categorical variables
categorical_vars <- 
  sapply(data_census, is.character) %>% 
  which() %>% 
  names()

categorical_vars
##  [1] "ACLSWKR"  "AHGA"     "AHSCOL"   "AMARITL"  "AMJOCC"   "ARACE"   
##  [7] "AREORGN"  "ASEX"     "AUNMEM"   "AWKSTAT"  "FILESTAT" "PENATVTY"
## [13] "PRCITSHP"

Let’s change all categorical variables into factors.

data_census[categorical_vars] <- lapply(data_census[categorical_vars],factor)

Now we can check the list of levels for each variable.The output is hidden since too long.

lapply(data_census[categorical_vars],levels)

Setting contrast

options(contrasts = c("contr.treatment",  # for non-ordinal factors
                      "contr.treatment")) # for ordinal factors
  1. Numeric variables
numeric_vars <- 
  sapply(data_census, is.numeric) %>% 
  which() %>% 
  names()

numeric_vars
## [1] "AAGE"    "AHRSPAY" "CAPGAIN" "CAPLOSS" "DIVVAL"  "SEOTR"   "WKSWORK"

Classification methods

In this stage, we will apply the 3 classification techniques discussed in the introduction. Before that, however, we will need to select from the variables.

Variable selection

Before applying the classification techniques, we will once again go through the variables and select only those that will positively contribute to our model accuracy. Large size of variables can be beneficial to find the true causality relationship, however, it may make the calculations very expensive or sometimes overfitting may happen.

Selecting from numeric variables

Generally, it is advisable to remove variables with a single level (zero variance), variables which are linear combinations of each other and variables with very large correlation with others. Let’s do that one by one.

1. Removing variables with 1 level

sapply(data_census[, numeric_vars], 
        function(x) 
          unique(x) %>% 
          length()) %>% 
  sort()
##   SEOTR AHRSPAY WKSWORK CAPLOSS    AAGE CAPGAIN  DIVVAL 
##       3      52      53      88      91     120     649

No variables with 1 level.

2. Removing highly correlated variables

If we have two highly correlated variables, we remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn’t drastically reduce the R-squared.

correlations <- cor(data_census[, numeric_vars],
    use = "pairwise.complete.obs")

Plotting the correlations

# using the 30 most correlated variables
corrplot.mixed(correlations,
               upper = "number",
               lower = "circle",
               tl.col = "black",
               tl.pos = "lt")

We see that there are variables WKSWORK ( weeks worked in a year) and AHRSPAY (wage per hour) have the highest correlation. We can remove one of them, however, since the correlation is only 75%, we keep both of them

  1. Removing linear combinations
( findLinearCombos(data_census[,numeric_vars] ) ->
    linearCombos )
## $linearCombos
## list()
## 
## $remove
## NULL

No linear combinations.

Selecting from categorical variables

We can use general regression model to apply logistic regression. Then we will compare the p - values.

We will need to prepare the dependent variable for this.

data_census$PTOTVAL <- ifelse(data_census$PTOTVAL=="-50000",0,1)
categorical_vars_all <- c(categorical_vars,"PTOTVAL") # add the dependent variable to the catergorical variables list. 
model <- lm(PTOTVAL ~ ., 
                 data = data_census[categorical_vars_all])

# summary(model)

Each level of a categorical variable is treated as a separate variable. However, all variables are removed together if insignificant.

Automated elimination

Backward elimination is general to specific approaching, where the model starts with all available variables, and then runs the model again and again, in each step removing 1 least significant variable. The process stops when a model with all variables significant is achieved. Other automated options are step-wise elimination and forward elimination.

ols_step_backward_p(model,
                    prem = 0.05,
                    # show progress
                    progress = FALSE) -> backward_selector

List of removed variables.

backward_selector$removed
## [1] "ARACE"    "PRCITSHP" "AREORGN"

Variables PRCITSHP, AREORGN,ARACE are selected for removal.

The remaining categorical variables

categorical_vars_all <- categorical_vars_all[!categorical_vars_all %in% backward_selector$removed]

Let’s combine the selected variables.

selectedvars <- c(categorical_vars_all ,numeric_vars)

Now that we have selected the variables, we can modeling.

Logistic Regression

  1. logit model
model_logit <- glm(PTOTVAL ~ ., 
                    family =  binomial(link = "logit"),
                    data = data_census[selectedvars])
  1. probit model
model_probit <- glm(PTOTVAL ~ ., 
                    family =  binomial(link = "probit"),
                    data = data_census[selectedvars])

Naive Bayes

The function trainControl generates parameters that further control how models are created, with possible values.

model_nb = naiveBayes(as.factor(PTOTVAL) ~ .,
        data = data_census[selectedvars])

Random Forest

model_rf = randomForest(PTOTVAL~., data_train, ntree=100, proximity=T)

Prediction on test data

Now we will bring on the test data and test each of the trained models on the test data. The prediction results will be compared to the actual data and also to the results of other models.

Let’s load the data.

data_test <- read.csv("CensusData_Test.csv")

Preparing the test data

We have to apply all the changes we made to the train data to the test data too.

  1. Remove rows with question mark for variable **
data_test <- data_test[!data_test$PENATVTY == " ?",]
  1. Change all categorical variables to factors.
data_test[categorical_vars] <- lapply(data_test[categorical_vars],factor)
  1. Change the dependent variable to binary levels (0,1)
data_test$PTOTVAL <- ifelse(data_test$PTOTVAL=="-50000",0,1)

Prediction

Logistic Regression

a.logit

predict_logit <- predict(model_logit,data_test[selectedvars],
                          type = "response")

b.probit

predict_probit <- predict(model_probit,data_test[selectedvars],
                          type = "response")

Naive Bayes

predict_nb <- predict(model_nb,data_test[selectedvars],
                          type = "raw")
predict_nb <- predict_nb[,2]

Random Forest

Let’s first get the train and test data to have the same class.

xtest <- rbind(data_census[1, ] , data_test)
xtest <- xtest[-1,]
predict_rf <- predict(model_rf,xtest,
                       type = "response")

Evaluation of model accuracy

The following function from lecture materials decodes the predicted probability values to binary levels,(0,1), and then calculates the model accuracy statistics from the confusion matrix it creates.

summary_binary <- function(predicted_probs,
                           real,
                           cutoff = 0.5,
                           level_positive = "1",
                           level_negative = "0") {

  ctable <- confusionMatrix(as.factor(ifelse(predicted_probs > cutoff, 
                                             level_positive, 
                                             level_negative)), 
                            real, 
                            level_positive) 

  stats <- round(c(ctable$overall[1],
                   ctable$byClass[c(1:4, 7, 11)]),5)
  return(stats)
}

We analyse the predictions using the function above. Let’s first put the predictions all into dataframe.

predictions <- data.frame(logit = predict_logit,probit = predict_probit,
                             nb = predict_nb,rf = predict_rf)

The summary_binary() requires real values in factor format for confusionMatrix calculation. The following code chunk does that.

data_test$PTOTVAL <- factor(data_test$PTOTVAL)

Here we call the summary_binary() function for every pair of prediction and real value. The results will be compiled into a single dataframe for easier comparison.

stats_logit <- summary_binary(predictions$logit,data_test$PTOTVAL) #1. logit
stats_probit <- summary_binary(predictions$probit,data_test$PTOTVAL) #2. probit
stats_nb <- summary_binary(predictions$nb,data_test$PTOTVAL) #3. KNN
stats_rf <- summary_binary(predictions$rf,data_test$PTOTVAL) #4. SVM

To compare the results in a table,

metrics <- rbind(stats_logit,stats_probit,stats_nb,stats_rf)
row.names(metrics) <- c("Logit","Probit","Naive Bayes","Random Forest")
metrics
##               Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## Logit          0.83566     0.34480     0.86847        0.14912        0.95199
## Probit         0.83685     0.35747     0.86890        0.15418        0.95289
## Naive Bayes    0.83441     0.82534     0.83502        0.25062        0.98621
## Random Forest  0.94216     0.10769     0.99794        0.77778        0.94360
##                    F1 Balanced Accuracy
## Logit         0.20820           0.60663
## Probit        0.21543           0.61318
## Naive Bayes   0.38449           0.83018
## Random Forest 0.18919           0.55282

Comparison using ROC curve

An ROC curve plots TPR vs. FPR at different classification thresholds. The area under ROC curve,AUC,can be interpreted as the probability that the model ranks a random positive example more highly than a random negative example. The larger AUC values indicate better models.

Logit

data_test$PTOTVAL <- ifelse(data_test$PTOTVAL=="0",0,1)
roc.plot(data_test$PTOTVAL,
         predictions$logit)

Area under ROC

roc.area(data_test$PTOTVAL,
         predictions$logit)$A
## [1] 0.8168358

Probit

roc.plot(data_test$PTOTVAL,
         predictions$probit)

Area under ROC

roc.area(data_test$PTOTVAL,
         predictions$probit)$A
## [1] 0.819343

Naive Bayes

roc.plot(data_test$PTOTVAL,
         predictions$nb)

Area under ROC

roc.area(data_test$PTOTVAL,
         predictions$nb)$A
## [1] 0.8995487

Random Forest

roc.plot(data_test$PTOTVAL,
         predictions$rf)

Area under ROC

roc.area(data_test$PTOTVAL,
         predictions$rf)$A
## [1] 0.8325236

Conclusion

After training and running 3 different classification models, we can depict from the metrics that the Random Forest model has the highest accuracy rate at 94% with the rest at an average of 84%. Naive Bayes however, has the lowest ranking accuracy rate at 83% but is the fastest running model amongst the 3 classifying models used.

Logistic regression models (logit and probit) perform the worst, however, they run fast. The random forest algorithm achieves the highest classification accuracy, but takes the longest time and memory space. Naive Bayes, therefore works the best for this dataset.