The Problem statement
Classification is a supervised machine learning algorithm of assigning a class label based on known examples. There are many different types of classification tasks requiring machine learning approaches such as those to be discussed below. In this document, we will try to predict a person’s income based on census data.
The predictions will be either above 50K or below 50K. Since the dependent variable is of a categorical type, we have a classification problem at hand.
While there are numerous techniques of classification, we will apply the following three and compare the results:
1. Logistic regression
logistic regression is a regression with a qualitative dependent variable. The regression models the probability that the qualitative dependent variable will take a specific value. It uses a link function to link the regression over variables to natural probability range, [0,1].
2. Naive Bayes
Naive bayes is a supervised learning model used in classification that is based on assumption that all features of a class are independent of each other. This model is mainly used for problems with multiple classes and used in text classification.
3. Random Forest
Random forest is a supervised learning algorithm/model that uses a large number of ensembled decision trees on data samples. Each tree in the forest makes a prediction and the best solution is selected based on voting. This model is common for its simplicity and diversity hence, the combination of the multiple decision trees is likely to get a more accurate and stable prediction.
Data source
The source data comes from Census-Income (KDD) Data Set available Here The data set is divided into a train and test sets, each with 22 columns, some of which are categorical.
Data description
Before we get to the modeling itself, we will analyse the data at hand.
Installing the necessary packages
requiredPackages = c("e1071","randomForest","janitor","verification","olsrr","DescTools","caret","tibble","purrr","corrplot","corrplot","dbplyr","dplyr","readr", "ggplot2", "glmnet","class","kernlab","verification")
for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE)}
Load the data
data_census <- read.csv("CensusData_Train.csv")
Data summary
summary(data_census)
## AAGE ACLSWKR AHRSPAY AHGA
## Min. : 0.00 Length:32723 Min. : 0.00 Length:32723
## 1st Qu.:15.00 Class :character 1st Qu.: 0.00 Class :character
## Median :33.00 Mode :character Median : 0.00 Mode :character
## Mean :34.45 Mean :15.44
## 3rd Qu.:49.00 3rd Qu.:33.00
## Max. :90.00 Max. :51.00
## AHSCOL AMARITL AMJOCC ARACE
## Length:32723 Length:32723 Length:32723 Length:32723
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AREORGN ASEX AUNMEM AWKSTAT
## Length:32723 Length:32723 Length:32723 Length:32723
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## CAPGAIN CAPLOSS DIVVAL FILESTAT
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Length:32723
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.0 Class :character
## Median : 0.0 Median : 0.00 Median : 0.0 Mode :character
## Mean : 402.2 Mean : 36.42 Mean : 183.5
## 3rd Qu.: 0.0 3rd Qu.: 0.00 3rd Qu.: 0.0
## Max. :99999.0 Max. :4608.00 Max. :99999.0
## PENATVTY PRCITSHP SEOTR WKSWORK
## Length:32723 Length:32723 Min. :0.0000 Min. : 0.00
## Class :character Class :character 1st Qu.:0.0000 1st Qu.: 0.00
## Mode :character Mode :character Median :0.0000 Median : 8.00
## Mean :0.1768 Mean :23.23
## 3rd Qu.:0.0000 3rd Qu.:52.00
## Max. :2.0000 Max. :52.00
## PTOTVAL
## Length:32723
## Class :character
## Mode :character
##
##
##
Let’s check our target variable,PTOTVAL
summary(data_census$PTOTVAL)
## Length Class Mode
## 32723 character character
PTOTVAL is a ordinal catogorical variable with two levels (-50000,50000+).
# is.factor(data_census$PTOTVAL)
data_census$PTOTVAL <- factor(data_census$PTOTVAL,
levels = c(" 50000+.","-50000"), #levels
ordered = TRUE) # ordinal
levels(data_census$PTOTVAL)
## [1] " 50000+." "-50000"
table(data_census$PTOTVAL)
##
## 50000+. -50000
## 1956 30767
Let’s find if any entries are missing
sum(is.na(data_census))
## [1] 0
sapply(data_census, function(x) sum(is.na(x)))
## AAGE ACLSWKR AHRSPAY AHGA AHSCOL AMARITL AMJOCC ARACE
## 0 0 0 0 0 0 0 0
## AREORGN ASEX AUNMEM AWKSTAT CAPGAIN CAPLOSS DIVVAL FILESTAT
## 0 0 0 0 0 0 0 0
## PENATVTY PRCITSHP SEOTR WKSWORK PTOTVAL
## 0 0 0 0 0
This is not quite true, becasue we know the missing values were represented by a question mark.
Let’s check by variable.
sapply(data_census, function(x) sum(x==" ?"))
## AAGE ACLSWKR AHRSPAY AHGA AHSCOL AMARITL AMJOCC ARACE
## 0 0 0 0 0 0 0 0
## AREORGN ASEX AUNMEM AWKSTAT CAPGAIN CAPLOSS DIVVAL FILESTAT
## 0 0 0 0 0 0 0 0
## PENATVTY PRCITSHP SEOTR WKSWORK PTOTVAL
## 525 0 0 0 0
Variable PENATVTY has missing values. We will will remove those observations.
data_census <- data_census[!data_census$PENATVTY == " ?",]
Categorical and numeric variables
categorical_vars <-
sapply(data_census, is.character) %>%
which() %>%
names()
categorical_vars
## [1] "ACLSWKR" "AHGA" "AHSCOL" "AMARITL" "AMJOCC" "ARACE"
## [7] "AREORGN" "ASEX" "AUNMEM" "AWKSTAT" "FILESTAT" "PENATVTY"
## [13] "PRCITSHP"
Let’s change all categorical variables into factors.
data_census[categorical_vars] <- lapply(data_census[categorical_vars],factor)
Now we can check the list of levels for each variable.The output is hidden since too long.
lapply(data_census[categorical_vars],levels)
Setting contrast
options(contrasts = c("contr.treatment", # for non-ordinal factors
"contr.treatment")) # for ordinal factors
numeric_vars <-
sapply(data_census, is.numeric) %>%
which() %>%
names()
numeric_vars
## [1] "AAGE" "AHRSPAY" "CAPGAIN" "CAPLOSS" "DIVVAL" "SEOTR" "WKSWORK"
In this stage, we will apply the 3 classification techniques discussed in the introduction. Before that, however, we will need to select from the variables.
Before applying the classification techniques, we will once again go through the variables and select only those that will positively contribute to our model accuracy. Large size of variables can be beneficial to find the true causality relationship, however, it may make the calculations very expensive or sometimes overfitting may happen.
Generally, it is advisable to remove variables with a single level (zero variance), variables which are linear combinations of each other and variables with very large correlation with others. Let’s do that one by one.
1. Removing variables with 1 level
sapply(data_census[, numeric_vars],
function(x)
unique(x) %>%
length()) %>%
sort()
## SEOTR AHRSPAY WKSWORK CAPLOSS AAGE CAPGAIN DIVVAL
## 3 52 53 88 91 120 649
No variables with 1 level.
2. Removing highly correlated variables
If we have two highly correlated variables, we remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn’t drastically reduce the R-squared.
correlations <- cor(data_census[, numeric_vars],
use = "pairwise.complete.obs")
Plotting the correlations
# using the 30 most correlated variables
corrplot.mixed(correlations,
upper = "number",
lower = "circle",
tl.col = "black",
tl.pos = "lt")
We see that there are variables WKSWORK ( weeks worked in a year) and AHRSPAY (wage per hour) have the highest correlation. We can remove one of them, however, since the correlation is only 75%, we keep both of them
( findLinearCombos(data_census[,numeric_vars] ) ->
linearCombos )
## $linearCombos
## list()
##
## $remove
## NULL
No linear combinations.
We can use general regression model to apply logistic regression. Then we will compare the p - values.
We will need to prepare the dependent variable for this.
data_census$PTOTVAL <- ifelse(data_census$PTOTVAL=="-50000",0,1)
categorical_vars_all <- c(categorical_vars,"PTOTVAL") # add the dependent variable to the catergorical variables list.
model <- lm(PTOTVAL ~ .,
data = data_census[categorical_vars_all])
# summary(model)
Each level of a categorical variable is treated as a separate variable. However, all variables are removed together if insignificant.
Automated elimination
Backward elimination is general to specific approaching, where the model starts with all available variables, and then runs the model again and again, in each step removing 1 least significant variable. The process stops when a model with all variables significant is achieved. Other automated options are step-wise elimination and forward elimination.
ols_step_backward_p(model,
prem = 0.05,
# show progress
progress = FALSE) -> backward_selector
List of removed variables.
backward_selector$removed
## [1] "ARACE" "PRCITSHP" "AREORGN"
Variables PRCITSHP, AREORGN,ARACE are selected for removal.
The remaining categorical variables
categorical_vars_all <- categorical_vars_all[!categorical_vars_all %in% backward_selector$removed]
Let’s combine the selected variables.
selectedvars <- c(categorical_vars_all ,numeric_vars)
Now that we have selected the variables, we can modeling.
model_logit <- glm(PTOTVAL ~ .,
family = binomial(link = "logit"),
data = data_census[selectedvars])
model_probit <- glm(PTOTVAL ~ .,
family = binomial(link = "probit"),
data = data_census[selectedvars])
The function trainControl generates parameters that further control how models are created, with possible values.
model_nb = naiveBayes(as.factor(PTOTVAL) ~ .,
data = data_census[selectedvars])
model_rf = randomForest(PTOTVAL~., data_train, ntree=100, proximity=T)
Now we will bring on the test data and test each of the trained models on the test data. The prediction results will be compared to the actual data and also to the results of other models.
Let’s load the data.
data_test <- read.csv("CensusData_Test.csv")
We have to apply all the changes we made to the train data to the test data too.
data_test <- data_test[!data_test$PENATVTY == " ?",]
data_test[categorical_vars] <- lapply(data_test[categorical_vars],factor)
data_test$PTOTVAL <- ifelse(data_test$PTOTVAL=="-50000",0,1)
Logistic Regression
a.logit
predict_logit <- predict(model_logit,data_test[selectedvars],
type = "response")
b.probit
predict_probit <- predict(model_probit,data_test[selectedvars],
type = "response")
Naive Bayes
predict_nb <- predict(model_nb,data_test[selectedvars],
type = "raw")
predict_nb <- predict_nb[,2]
Random Forest
Let’s first get the train and test data to have the same class.
xtest <- rbind(data_census[1, ] , data_test)
xtest <- xtest[-1,]
predict_rf <- predict(model_rf,xtest,
type = "response")
The following function from lecture materials decodes the predicted probability values to binary levels,(0,1), and then calculates the model accuracy statistics from the confusion matrix it creates.
summary_binary <- function(predicted_probs,
real,
cutoff = 0.5,
level_positive = "1",
level_negative = "0") {
ctable <- confusionMatrix(as.factor(ifelse(predicted_probs > cutoff,
level_positive,
level_negative)),
real,
level_positive)
stats <- round(c(ctable$overall[1],
ctable$byClass[c(1:4, 7, 11)]),5)
return(stats)
}
We analyse the predictions using the function above. Let’s first put the predictions all into dataframe.
predictions <- data.frame(logit = predict_logit,probit = predict_probit,
nb = predict_nb,rf = predict_rf)
The summary_binary() requires real values in factor format for confusionMatrix calculation. The following code chunk does that.
data_test$PTOTVAL <- factor(data_test$PTOTVAL)
Here we call the summary_binary() function for every pair of prediction and real value. The results will be compiled into a single dataframe for easier comparison.
stats_logit <- summary_binary(predictions$logit,data_test$PTOTVAL) #1. logit
stats_probit <- summary_binary(predictions$probit,data_test$PTOTVAL) #2. probit
stats_nb <- summary_binary(predictions$nb,data_test$PTOTVAL) #3. KNN
stats_rf <- summary_binary(predictions$rf,data_test$PTOTVAL) #4. SVM
To compare the results in a table,
metrics <- rbind(stats_logit,stats_probit,stats_nb,stats_rf)
row.names(metrics) <- c("Logit","Probit","Naive Bayes","Random Forest")
metrics
## Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## Logit 0.83566 0.34480 0.86847 0.14912 0.95199
## Probit 0.83685 0.35747 0.86890 0.15418 0.95289
## Naive Bayes 0.83441 0.82534 0.83502 0.25062 0.98621
## Random Forest 0.94216 0.10769 0.99794 0.77778 0.94360
## F1 Balanced Accuracy
## Logit 0.20820 0.60663
## Probit 0.21543 0.61318
## Naive Bayes 0.38449 0.83018
## Random Forest 0.18919 0.55282
An ROC curve plots TPR vs. FPR at different classification thresholds. The area under ROC curve,AUC,can be interpreted as the probability that the model ranks a random positive example more highly than a random negative example. The larger AUC values indicate better models.
data_test$PTOTVAL <- ifelse(data_test$PTOTVAL=="0",0,1)
roc.plot(data_test$PTOTVAL,
predictions$logit)
Area under ROC
roc.area(data_test$PTOTVAL,
predictions$logit)$A
## [1] 0.8168358
roc.plot(data_test$PTOTVAL,
predictions$probit)
Area under ROC
roc.area(data_test$PTOTVAL,
predictions$probit)$A
## [1] 0.819343
roc.plot(data_test$PTOTVAL,
predictions$nb)
Area under ROC
roc.area(data_test$PTOTVAL,
predictions$nb)$A
## [1] 0.8995487
roc.plot(data_test$PTOTVAL,
predictions$rf)
Area under ROC
roc.area(data_test$PTOTVAL,
predictions$rf)$A
## [1] 0.8325236
After training and running 3 different classification models, we can depict from the metrics that the Random Forest model has the highest accuracy rate at 94% with the rest at an average of 84%. Naive Bayes however, has the lowest ranking accuracy rate at 83% but is the fastest running model amongst the 3 classifying models used.
Logistic regression models (logit and probit) perform the worst, however, they run fast. The random forest algorithm achieves the highest classification accuracy, but takes the longest time and memory space. Naive Bayes, therefore works the best for this dataset.