Predicting Income With Classification: NB, Decision Tree & Random Forest

Introduction

Welcome to the Income Data Analysis and Classification Project! In this project, our main goal is to analyze income data and build a predictive model using various classification algorithms. Our aim is to explore different approaches and identify potential improvements in predicting income levels.

The dataset we will be working with contains information about individuals, including their age, education, occupation, marital status, and more. We have a total of 32,561 observations and 15 variables, including the target variable “income,” which indicates whether an individual’s income is less than or equal to $50,000 per year or more.

To achieve our objective, we will be utilizing three classification algorithms: Naive Bayes, Decision Tree, and Random Forest. Each of these algorithms has its own unique characteristics and assumptions, which we will discuss in detail.

Let’s dive into the analysis, understand the intricacies of each algorithm, and work towards building an accurate and robust predictive model.

You can find the data set In Here

Objectives

By analyzing the income data and building predictive models using Naive Bayes, Decision Tree, and Random Forest, we aim to identify the best-performing algorithm and potentially uncover areas for improvement in predicting income levels. This project will not only enhance our understanding of classification algorithms but also provide valuable insights into the factors influencing income.Here are the bullet point objectives for the project:

  • Analyze the income data to gain insights into the distribution and characteristics of the variables.
  • Preprocess the data by handling missing values, encoding categorical variables, and scaling numerical variables as required.
  • Split the dataset into training and testing sets to evaluate the performance of the classification algorithms.
  • Implement the Naive Bayes algorithm and train a model to predict income levels.
    • Evaluate the performance of the Naive Bayes model using appropriate evaluation metrics such as accuracy, precision, recall, and F1 score.
    • Understand the underlying assumptions of Naive Bayes and discuss its strengths and limitations.
  • Implement the Decision Tree algorithm and train a model to predict income levels.
    • Explore different decision tree criteria, such as Gini impurity or information gain, and fine-tune the model using appropriate hyperparameters.
    • Evaluate the performance of the Decision Tree model and compare it with the Naive Bayes model.
    • Understand the interpretability of decision trees and discuss their advantages and disadvantages.
  • Implement the Random Forest algorithm and train an ensemble of decision trees to predict income levels.
    • Discuss the concept of ensemble learning and the benefits of using Random Forest.
    • Evaluate the performance of the Random Forest model and compare it with the previous models.
  • Identify the best-performing classification algorithm for predicting income levels based on the evaluation results.
  • Discuss the overall findings, including the strengths and weaknesses of each model, and suggest potential areas for improvement.
  • Provide recommendations for further enhancements or feature engineering techniques that can improve the accuracy of the predictive models.
  • These objectives will guide us throughout the project and help us achieve our goals of analyzing the income data and building predictive models using Naive Bayes, Decision Tree, and Random Forest algorithms.

DataSet Overview

You can find this data set in Here’s a brief explanation for each variable in the dataset:

  • age: Represents the age of the individual (numeric variable).
  • workclass: Indicates the type of work class or employment status of the individual - (categorical variable).
  • fnlwgt: Stands for “final weight” and represents the sampling weight assigned to each observation (numeric variable).
  • education: Denotes the highest level of education completed by the individual (categorical variable).
  • education.num: Corresponds to the numerical representation of the education variable (numeric variable).
  • marital.status: Indicates the marital status of the individual (categorical variable).
  • occupation: Represents the occupation of the individual (categorical variable).
  • relationship: Describes the individual’s role in the family (categorical variable).
  • race: Specifies the race of the individual (categorical variable).
  • sex: Denotes the gender of the individual (categorical variable).
  • capital.gain: Represents the capital gains of the individual (numeric variable).
  • capital.loss: Represents the capital losses of the individual (numeric variable).
  • hours.per.week: Indicates the number of hours worked per week by the individual (numeric variable).
  • native.country: Specifies the native country of the individual (categorical variable).
  • income: Indicates whether the individual’s income is less than or equal to $50,000 per year or more (categorical variable).

These explanations should give you a general understanding of each variable in the dataset.


Data Preparation

Import package & Dataset

Import Package :

library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(tidymodels) # to build tidy models
library(caret) # to pre-process data
library(tibble) # for creating and manipulating tabular data structures
library(animation) # for creating animated visualizations
library(GGally)      # Extension to ggplot2 for exploratory data analysis

#Naive Bayes
library(e1071) # for implementing Naive Bayes algorithm
library(pROC) # for computing ROC curves and other evaluation metrics
library(ROSE) # for handling imbalanced datasets using oversampling techniques

#Decision Tree
library(partykit) # for constructing and visualizing decision trees
library(rpart) # for building decision tree models
library(rpart.plot) # for visualizing decision trees using plots

#Random Forest
library(randomForest) # for building random forest models and ensembles

Import dataSet

Ds_income <- read.csv("data_input/adult.csv")
rmarkdown::paged_table(Ds_income)

Data Wrangling

glimpse(Ds_income)
#> Rows: 32,561
#> Columns: 15
#> $ age            <int> 90, 82, 66, 54, 41, 34, 38, 74, 68, 41, 45, 38, 52, 32,…
#> $ workclass      <chr> "?", "Private", "?", "Private", "Private", "Private", "…
#> $ fnlwgt         <int> 77053, 132870, 186061, 140359, 264663, 216864, 150601, …
#> $ education      <chr> "HS-grad", "HS-grad", "Some-college", "7th-8th", "Some-…
#> $ education.num  <int> 9, 9, 10, 4, 10, 9, 6, 16, 9, 10, 16, 15, 13, 14, 16, 1…
#> $ marital.status <chr> "Widowed", "Widowed", "Widowed", "Divorced", "Separated…
#> $ occupation     <chr> "?", "Exec-managerial", "?", "Machine-op-inspct", "Prof…
#> $ relationship   <chr> "Not-in-family", "Not-in-family", "Unmarried", "Unmarri…
#> $ race           <chr> "White", "White", "Black", "White", "White", "White", "…
#> $ sex            <chr> "Female", "Female", "Female", "Female", "Female", "Fema…
#> $ capital.gain   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ capital.loss   <int> 4356, 4356, 4356, 3900, 3900, 3770, 3770, 3683, 3683, 3…
#> $ hours.per.week <int> 40, 18, 40, 40, 40, 45, 40, 20, 40, 60, 35, 45, 20, 55,…
#> $ native.country <chr> "United-States", "United-States", "United-States", "Uni…
#> $ income         <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", "…

Removing Redundant Variabel

Removing redundant variables is an important step in data preprocessing to ensure the accuracy and efficiency of our analysis. In this dataset, we have identified three variables that can be considered redundant: “fnlwgt,” “education,” and “relationship.”

  • fnlwgt: The variable "fnlwgt" represents the final weight, but its calculation and exact meaning are not provided in the dataset documentation. Without a clear understanding of how it was derived or its significance in predicting income levels, it becomes difficult to incorporate this variable into our analysis.

  • education: The dataset includes both "education" and "education.num" variables. Upon examination, it becomes apparent that "education.num" already captures the numerical representation of the education level, rendering the “education” variable redundant.

  • relationship: The variable “relationship” represents the role of an individual within the family structure. However, the “marital.status” variable already provides information about the person’s marital status, which inherently includes their relationship to others. By removing these redundant variables, we can streamline our dataset and focus on the key variables that contribute significantly to predicting income levels.

# Remove redundant variables from the dataset
Ds_income <- subset(Ds_income, select = c(-fnlwgt, -education, -relationship))

Finding NA Value

Replacing “?” with “NA” is an essential step in data preprocessing to ensure consistency and facilitate data analysis. In the given dataset, “?” is used to represent missing or unknown values. However, many data analysis and modeling techniques in R treat “NA” as the standard symbol to denote missing values.

By replacing “?” with “NA” , we achieve uniformity and ensure compatibility with R’s built-in functions and packages specifically designed to handle missing values. This consistency allows us to utilize powerful tools for imputation, statistical analysis, and predictive modeling, as well as avoiding potential errors or misinterpretation when working with missing data.

# Replace "?" with NA in the dataset
Ds_income[Ds_income == "?"] <- NA
# Calculate the total and percentage of missing values in each column
missing_data <- data.frame(
  Column = names(Ds_income),
  Total = colSums(is.na(Ds_income)),
  Percent = colMeans(is.na(Ds_income)) * 100
)

# Create a tibble with the missing data information
missing_table <- as_tibble(missing_data)

# Print the missing data table
print(missing_table)
#> # A tibble: 12 × 3
#>    Column         Total Percent
#>    <chr>          <dbl>   <dbl>
#>  1 age                0    0   
#>  2 workclass       1836    5.64
#>  3 education.num      0    0   
#>  4 marital.status     0    0   
#>  5 occupation      1843    5.66
#>  6 race               0    0   
#>  7 sex                0    0   
#>  8 capital.gain       0    0   
#>  9 capital.loss       0    0   
#> 10 hours.per.week     0    0   
#> 11 native.country   583    1.79
#> 12 income             0    0

I prefer to drop the rows that contain missing values (NA) in the “workclass” and “occupation” variables due to the small percentage. So i decide to drop it

# Drop rows with missing values in "workclass" and "occupation" variables
Ds_income <- Ds_income[complete.cases(Ds_income[c("workclass", "occupation")]), ]

Changing Data Type

glimpse(x = Ds_income)
#> Rows: 30,718
#> Columns: 12
#> $ age            <int> 82, 54, 41, 34, 38, 74, 68, 41, 45, 38, 52, 32, 46, 45,…
#> $ workclass      <chr> "Private", "Private", "Private", "Private", "Private", …
#> $ education.num  <int> 9, 4, 10, 9, 6, 16, 9, 10, 16, 15, 13, 14, 15, 7, 14, 1…
#> $ marital.status <chr> "Widowed", "Divorced", "Separated", "Divorced", "Separa…
#> $ occupation     <chr> "Exec-managerial", "Machine-op-inspct", "Prof-specialty…
#> $ race           <chr> "White", "White", "White", "White", "White", "White", "…
#> $ sex            <chr> "Female", "Female", "Female", "Female", "Male", "Female…
#> $ capital.gain   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ capital.loss   <int> 4356, 3900, 3900, 3770, 3770, 3683, 3683, 3004, 3004, 2…
#> $ hours.per.week <int> 18, 40, 40, 45, 40, 20, 40, 60, 35, 45, 20, 55, 40, 76,…
#> $ native.country <chr> "United-States", "United-States", "United-States", "Uni…
#> $ income         <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", ">50K", "<…

After Analysing the data type i decidid to change couple of variabel to factor such as "race", "sex", "income", "workclass", "marital.status"

Ds_income <- Ds_income %>% 
  mutate_at(c("race", "sex", "income", "workclass", "marital.status", "native.country", "occupation"), as.factor)

# Change values in the "income" column
Ds_income$income <- ifelse(Ds_income$income == ">50K", 1, 0)

Exploraty Data Analysis

Check distribution

# explore with summary
summary(Ds_income)
#>       age                   workclass     education.num  
#>  Min.   :17.00   Federal-gov     :  960   Min.   : 1.00  
#>  1st Qu.:28.00   Local-gov       : 2093   1st Qu.: 9.00  
#>  Median :37.00   Private         :22696   Median :10.00  
#>  Mean   :38.44   Self-emp-inc    : 1116   Mean   :10.13  
#>  3rd Qu.:47.00   Self-emp-not-inc: 2541   3rd Qu.:13.00  
#>  Max.   :90.00   State-gov       : 1298   Max.   :16.00  
#>                  Without-pay     :   14                  
#>                marital.status            occupation  
#>  Divorced             : 4258   Prof-specialty :4140  
#>  Married-AF-spouse    :   21   Craft-repair   :4099  
#>  Married-civ-spouse   :14339   Exec-managerial:4066  
#>  Married-spouse-absent:  389   Adm-clerical   :3770  
#>  Never-married        : 9912   Sales          :3650  
#>  Separated            :  959   Other-service  :3295  
#>  Widowed              :  840   (Other)        :7698  
#>                  race           sex         capital.gain    capital.loss    
#>  Amer-Indian-Eskimo:  286   Female: 9930   Min.   :    0   Min.   :   0.00  
#>  Asian-Pac-Islander:  974   Male  :20788   1st Qu.:    0   1st Qu.:   0.00  
#>  Black             : 2909                  Median :    0   Median :   0.00  
#>  Other             :  248                  Mean   : 1106   Mean   :  88.91  
#>  White             :26301                  3rd Qu.:    0   3rd Qu.:   0.00  
#>                                            Max.   :99999   Max.   :4356.00  
#>                                                                             
#>  hours.per.week        native.country      income     
#>  Min.   : 1.00   United-States:27504   Min.   :0.000  
#>  1st Qu.:40.00   Mexico       :  610   1st Qu.:0.000  
#>  Median :40.00   Philippines  :  188   Median :0.000  
#>  Mean   :40.95   Germany      :  128   Mean   :0.249  
#>  3rd Qu.:45.00   Puerto-Rico  :  109   3rd Qu.:0.000  
#>  Max.   :99.00   (Other)      : 1623   Max.   :1.000  
#>                  NA's         :  556

Insight :

  • The age distribution of the individuals in the Ds_income dataset ranges from 17 to 90, with a median age of 37 and a mean age of 38.44.

  • The most common workclass is “Private” with 22,696 individuals, followed by “Local-gov” with 2,093 individuals, among other categories.

  • The education.num variable represents the level of education, ranging from 1 to 16, with a median of 10 and a mean of 10.13.

  • The dataset includes various marital statuses, with “Married-civ-spouse” being the most common, followed by “Never-married,” among other categories.

  • The dataset contains individuals with different occupations, with “Prof-specialty” and “Craft-repair” being the most common occupations.

  • The dataset comprises individuals from different racial backgrounds, with “White” and “Black” being the most common races.

  • The dataset consists of both male (20,788) and female (9,930) individuals.

  • The capital.gain variable ranges from 0 to 99,999, with a median of 0 and a mean of approximately 1,106.

  • The capital.loss variable ranges from 0 to 4,356, with a median of 0 and a mean of approximately 88.91.

  • The hours.per.week variable ranges from 1 to 99, with a median and mean of approximately 40 hours per week.

  • The dataset includes individuals from various countries, with “United-States” being the most common.

  • Approximately 75.1% of individuals in the dataset have an income below the threshold, - - while approximately 24.9% have an income above the threshold.

Checking outlier

# Load the necessary libraries
library(ggplot2)
library(dplyr)
library(tidyr)

# Select numerical variables
numerical_vars <- Ds_income %>%
  select_if(is.numeric) %>%
  names()

# Create box plot for numerical variables
gathered_data <- Ds_income %>%
  select(all_of(numerical_vars)) %>%
  gather(variable, value)

ggplot(gathered_data, aes(x = variable, y = value)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Box Plot of Numerical Variables",
       x = "Variable",
       y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Nothing here seems crucial to me, and because the overall number of data is very big, the only variabel has outlier is capital.gain. I don’t think the outliers will have much of an impact. So im gonna leave the outlier there

Checking Correlation

ggcorr(Ds_income, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

As we can see there are no variables that have a strong correlation (x > 0.5 ) between each other

Check class-imbalance

prop.table(table(Ds_income$income))
#> 
#>         0         1 
#> 0.7509603 0.2490397

According to my analysis and the visualization, i think it’s we gonna need some imputation technique to make our class imbalance. In modelling i will give try to do some imputation technique UpSampling


Modelling

Cross Validation

Cross-validation is a resampling technique used to assess the performance of a predictive model. It involves dividing the available data into multiple subsets, typically a training set and a validation set (or test set). The model is trained on the training set and then evaluated on the validation set to estimate its performance.

Now, let’s discuss why we use 70:30 train-test split cross-validation. The 70:30 split refers to allocating 70% of the data for training the model and reserving the remaining 30% for evaluating its performance. This split is a commonly used practice, but it is not a hard rule and can vary depending on the dataset size, complexity, and specific requirements of the problem at hand.

The rationale behind the 70:30 split is to strike a balance between having enough data for training the model effectively and having a sufficient amount of data for evaluating its performance. With 70% of the data used for training, the model can learn from a substantial portion of the available information and capture the underlying patterns in the data.

# Set seed for reproducibility
set.seed(123)

# Perform train-test splitting
train_index <- createDataPartition(Ds_income$income, p = 0.7, list = FALSE)
Income_train <- Ds_income[train_index, ]
Income_test <- Ds_income[-train_index, ]

# Check the class distribution in the training data
prop.table(table(Income_train$income))
#> 
#>         0         1 
#> 0.7512905 0.2487095
# Calculate the proportions of each income category
proportions <- table(Income_train$income) / length(Income_train$income) * 100

# Create a data frame for plotting
plot_data <- data.frame(Income_Category = names(proportions),
                        Proportion = proportions)

# Create the bar plot
ggplot(plot_data, aes(x = Income_Category, y = proportions)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Proportion of Income Categories",
       x = "Income Category",
       y = "Proportion (%)") +
  theme_minimal()

In the context of Our dataset, the proportion of the income variable is not balanced, with approximately

75% of the data belonging to one class (0) and only about 25% belonging to the other class (1). This imbalance can pose challenges when building predictive models.

When going upSampling, we can use the function from the caret library, while for upsampling we can use the upSample function. Upsampling is a technique used to address the issue of imbalanced data. It involves artificially increasing the number of instances in the minority class by creating synthetic examples. By doing so, upsampling aims to balance the class distribution and provide the model with more balanced training data, helping it to learn from both classes more effectively.

The ROSE package in R (Robust Sampling Technique for Imbalanced Data) provides functionality for performing upsampling. It offers a method called ROSE() that generates synthetic examples for the minority class based on random sampling with replacement. These synthetic examples are created by perturbing the existing minority class instances, thereby increasing their representation in the dataset.

Performing Upsampling using package ROSE()

# Perform upsampling using ROSE (SMOTE)
upsampled_data <- ovun.sample(income ~ ., 
                              data = Income_train, 
                              method = "both", 
                              p = 0.5, 
                              seed = 123)

# Check the class distribution after upsampling
prop.table(table(upsampled_data$data$income))
#> 
#>         0         1 
#> 0.5080102 0.4919898

Then we enter the labels into the train_labels and test_labels variables.

Ds_train_labels <- Ds_income[train_index,]$label
Ds_test_labels <- Ds_income[-train_index,]$label

Naives Baiyes

Naive Bayes is a popular classification algorithm used in machine learning and data science. It is based on Bayes’ theorem, which describes the relationship between conditional probabilities of events. Naive Bayes is particularly well-suited for certain types of problems and offers several advantages:

  • Simplicity and Efficiency: Naive Bayes is known for its simplicity and efficiency. It is relatively easy to understand and implement, making it a good choice for both beginners and experienced practitioners.
  • Interpretability: Naive Bayes provides interpretability by estimating the conditional probabilities of each feature given the class variable. This allows us to understand the contribution and importance of each feature in the classification decision.

Model Fitting

Explanation of Assumptions

Naive Bayes is called “naive” because it uses the assumption of independent events in the predictor, so that the predictor has no relationship with other predictors. Therefore, we can directly multiply the conditional probabilities.

\[P(A\ \cap\ B\ \cap\ C\ |\ Income) = P(A\ |\ Income) \times P(B\ |\ Income) \times P(C\ |\ Income)\]

So that :

\[P(Income| A\ \cap\ B\ \cap\ C) = \frac{P(Income) \ P(A\ |\ Income)\ P(B\ |\ Income)\ P(C\ | \ Income)}{P(A \cap B \cap C)}\] \(P(A \cap B \cap C)\) cannot be multiplied directly, it needs to be explained further.

Advanced Description

Because this independent assumption applies only to conditional probability, we need to first translate \(P(A \cap B \cap C)\) in sum form

\[P(A \cap B \cap C) = P(A \cap B \cap C | Income) P(Income) + P(A \cap B \cap C |\neg Purchased) P(\neg Income) \]

So that:

\[P(Income| A\ \cap\ B\ \cap\ C) = \frac{P(Income) \ P(A\ |\ Income)\ P(B\ |\ Income)\ P(C\ | \ Income)}{P(A \cap B \cap C | Income) P(Income) + P(A \cap B \cap C | \neg Purchased) P(\neg Income)}\] Notes : Negation of \(\neg\) is an event that we don’t want, in this case it is Income = no

Notes:

Something to remember - Naive Bayes assumes that the predictors are not related (even though in reality there is a relationship, Naive Bayes thinks there is no relationship).

  • Naive Bayes uses the assumption of independent probability on its conditional probabilities to simplify its calculations.

# Convert variables to factors
upsampled_data$data$income <- factor(upsampled_data$data$income)
Income_test$income <- factor(Income_test$income)

# Train the Naive Bayes model
nb_model <- naiveBayes(income ~ ., data = upsampled_data$data)

# Make predictions on the test data
predictions <- predict(nb_model, newdata = Income_test)

Model Evaluation

In this section we gonna use naiveBayes(). It estimates conditional probabilities by assuming independence between features given the class variable and uses these probabilities to classify new instances. Naive Bayes is a popular and efficient algorithm that works well in various

Confusion Matrix

Confusion Matrix Test

# Evaluate the performance of the model
confusion_matrix <- confusionMatrix(predictions, Income_test$income, positive = "0")
confusion_matrix
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 6436 1371
#>          1  477  931
#>                                                
#>                Accuracy : 0.7995               
#>                  95% CI : (0.7911, 0.8076)     
#>     No Information Rate : 0.7502               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.3853               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9310               
#>             Specificity : 0.4044               
#>          Pos Pred Value : 0.8244               
#>          Neg Pred Value : 0.6612               
#>              Prevalence : 0.7502               
#>          Detection Rate : 0.6984               
#>    Detection Prevalence : 0.8472               
#>       Balanced Accuracy : 0.6677               
#>                                                
#>        'Positive' Class : 0                    
#> 

The evaluation of the naiveBayes() model is as follows:

Accuracy: The accuracy of the model is 0.8041, indicating that approximately 80.41% of the predictions made by the model are correct.

Sensitivity (True Positive Rate): The sensitivity of the model is 0.9338, indicating that it correctly identifies 93.38% of the positive instances.

Specificity (True Negative Rate): The specificity of the model is 0.4193, indicating that it correctly identifies 41.93% of the negative instances.

Positive Predictive Value (Precision): The positive predictive value, also known as precision, is 0.8267. It represents the proportion of correctly predicted positive instances out of all instances predicted as positive.

Negative Predictive Value: - The negative predictive value is 0.6811. It represents the proportion of correctly predicted negative instances out of all instances predicted as negative.

In summary, the naiveBayes() model achieves an accuracy of approximately 80.41%, which is better than the no information rate. It exhibits a fair agreement between predicted and actual classes (kappa value of 0.4047). The model demonstrates high sensitivity (93.38%) but low specificity (41.93%). Precision (positive predictive value) is 82.67%, indicating a relatively high proportion of correctly predicted positive instances. The negative predictive value is 68.11%. The prevalence of the positive class is 74.79%. The detection rate (true positive rate in population) is 69.84% and the detection prevalence is 84.48%. The balanced accuracy, which considers both sensitivity and specificity, is 67.66%.

It’s important to interpret these evaluation metrics in the context of the specific problem and domain knowledge. Further analysis and comparison with other models or evaluation metrics may be necessary to gain a comprehensive understanding of the model’s performance.

Confusion Matrix Train

# Make predictions on the training data using the decision tree model
train_predictions_nb <- predict(nb_model, 
                             newdata = upsampled_data$data, 
                             type = "class")

# Create the confusion matrix for the training data
confusionMatrix(train_predictions_nb, upsampled_data$data$income)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 9983 5917
#>          1  735 4463
#>                                                
#>                Accuracy : 0.6847               
#>                  95% CI : (0.6784, 0.691)      
#>     No Information Rate : 0.508                
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.3643               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9314               
#>             Specificity : 0.4300               
#>          Pos Pred Value : 0.6279               
#>          Neg Pred Value : 0.8586               
#>              Prevalence : 0.5080               
#>          Detection Rate : 0.4732               
#>    Detection Prevalence : 0.7536               
#>       Balanced Accuracy : 0.6807               
#>                                                
#>        'Positive' Class : 0                    
#> 

Model Condition

  • Overfitting: the model is good on the train data, but very bad on the test data (Overfitting occurs because our model is too smart (our model can catch patterns in too detail), so it falls on the test data)
    • Performance Data Train: 90%
    • Performance Test Data: 40%
  • Underfitting: bad models in the train data and in the test data (Underfit occurs because our model is too simple for data that is not that simple (for example: using a linear model for non-linear data))
    • Performance Data Train: 20 %
    • Performance Data Test: 15 %
  • Just Right: the model is good in data train, and decreases slightly (difference < 0.1)
    • Performance Data Train: 89 %
    • Performance Data Test: 91 %

Based on our condition Model, Our model is a Just-Right Model with: - Recall in Train Data = 80.61 % - Recall in Test Data = 80.31 %

ROC and AUC

When using accuracy alone, we don’t know the ability of the model whether the model can properly separate positive and negative classes o*r not. Therefore, we will examine other evaluation metrics called ROC and AUC.

ROC : ROC (Receiver-Operating Curve) is a curve that describes the relationship between the True Positive Rate (Sensitivity or Recall) and the False Positive Rate (1-Specificity) at each threshold. A good model should ideally have a high True Positive Rate and a low False Positive Rate.

AUC : AUC (Area Under ROC Curve) shows the area under the ROC curve. The closer to 1, the better the model’s performance in separating positive and negative classes.

# Make predictions on the test data
prediction_probs <- predict(nb_model, newdata = Income_test, type = "raw")

# Extract predicted probabilities for the positive class
positive_probs <- prediction_probs[, "1"]

# Compute the ROC curve
roc_curve <- roc(Income_test$income, positive_probs)

# Calculate the AUC
auc <- auc(roc_curve)

# Print the AUC and plot the ROC curve
plot(roc_curve, main = "ROC Curve", xlab = "False Positive Rate", ylab = "True Positive Rate")

cat("AUC:", auc, "\n")
#> AUC: 0.8545437

** Insight from our ROC and AOC Value ** : - ROC in a typical ROC curve, the FPR starts from 0 on the x-axis and increases gradually as the TPR increases on the y-axis. However, in some cases, the ROC curve may start from a value greater than 0 on the x-axis, such as 1.0 as our ROC plot. This situation can occur when the threshold for classification is set in such a way that the model predicts all instances as positive. As a result, the FPR becomes 1.0 because there are no true negatives in the predictions.

  • AOC Our AOC value is 0.8665051. This means that the evaluation results of our model are very good because this number is close to 1. As a result, our model succeeds in distinguishing positive and negative classes well

Decision Tree

Decision Trees offer interpretability, handle both numerical and categorical features, capture complex relationships, rank feature importance, handle missing values, and form the basis for powerful ensemble methods. They are particularly useful when interpretability is desired, nonlinear relationships need to be captured, and feature importance is important. However, careful consideration should be given to overfitting and the potential need for regularization techniques to optimize their performance.

Model Fitting

the rpart package is a powerful and flexible tool for creating Decision Trees in R. It offers robust and efficient algorithms, handles different types of features, provides mechanisms to prevent overfitting, and produces interpretable trees. Its integration with other packages and frameworks enhances its usability and allows for further model improvements.

# Train the decision tree model
dt_model <- rpart(income ~ ., data = upsampled_data$data)

# Plot the decision tree
rpart.plot(dt_model, extra = 1, box.palette = "Blues")

Model Evaluation

Confusion Matrix

Confusion Matrix Test

# Make predictions on the test data using the decision tree model
test_predictions <- predict(dt_model, newdata = Income_test, type = "class")

# Create the confusion matrix for the test data
confusionMatrix(test_predictions, Income_test$income)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 5085  303
#>          1 1828 1999
#>                                                
#>                Accuracy : 0.7687               
#>                  95% CI : (0.76, 0.7773)       
#>     No Information Rate : 0.7502               
#>     P-Value [Acc > NIR] : 0.00001767           
#>                                                
#>                   Kappa : 0.4947               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.7356               
#>             Specificity : 0.8684               
#>          Pos Pred Value : 0.9438               
#>          Neg Pred Value : 0.5223               
#>              Prevalence : 0.7502               
#>          Detection Rate : 0.5518               
#>    Detection Prevalence : 0.5847               
#>       Balanced Accuracy : 0.8020               
#>                                                
#>        'Positive' Class : 0                    
#> 

Accuracy: The accuracy of the model is 0.7903, indicating that approximately 79.03% of the predictions made by the model are correct.

Sensitivity (True Positive Rate): The sensitivity of the model is 0.7732, indicating that it correctly identifies 77.32% of the positive instances.

Specificity (True Negative Rate): The specificity of the model is 0.8412, indicating that it correctly identifies 84.12% of the negative instances.

Positive Predictive Value (Precision): The positive predictive value, also known as precision, is 0.9352. It represents the proportion of correctly predicted positive instances out of all instances predicted as positive.

Negative Predictive Value: The negative predictive value is 0.5556. It represents the proportion of correctly predicted negative instances out of all instances predicted as negative.

In summary, the rpart() model achieves an accuracy of approximately 79.03%, which is better than the no information rate. It exhibits a moderate agreement between predicted and actual classes (kappa value of 0.5249). The model demonstrates a reasonable sensitivity (77.32%) and specificity (84.12%). Precision (positive predictive value) is 93.52%, indicating a high proportion of correctly predicted positive instances. The negative predictive value is 55.56%. The prevalence of the positive class is 74.79%. The detection rate (true positive rate in population) is 57.83%, and the detection prevalence is 61.83%. The balanced accuracy, which considers both sensitivity and specificity, is 80.72%.

Confusion Matrix Train

# Make predictions on the training data using the decision tree model
train_predictions <- predict(dt_model, 
                             newdata = upsampled_data$data, 
                             type = "class")

# Create the confusion matrix for the training data
confusionMatrix(train_predictions, upsampled_data$data$income)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 8046 1465
#>          1 2672 8915
#>                                                
#>                Accuracy : 0.8039               
#>                  95% CI : (0.7985, 0.8093)     
#>     No Information Rate : 0.508                
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6084               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.7507               
#>             Specificity : 0.8589               
#>          Pos Pred Value : 0.8460               
#>          Neg Pred Value : 0.7694               
#>              Prevalence : 0.5080               
#>          Detection Rate : 0.3814               
#>    Detection Prevalence : 0.4508               
#>       Balanced Accuracy : 0.8048               
#>                                                
#>        'Positive' Class : 0                    
#> 

Model Condition

  • Overfitting: the model is good on the train data, but very bad on the test data (Overfitting occurs because our model is too smart (our model can catch patterns in too detail), so it falls on the test data)
    • Performance Data Train: 90%
    • Performance Test Data: 40%
  • Underfitting: bad models in the train data and in the test data (Underfit occurs because our model is too simple for data that is not that simple (for example: using a linear model for non-linear data))
    • Performance Data Train: 20 %
    • Performance Data Test: 15 %
  • Just Right: the model is good in data train, and decreases slightly (difference < 0.1)
    • Performance Data Train: 89 %
    • Performance Data Test: 91 %

Based on our condition Model, Our model is a Just-Right Model with Recall in Train Data = 77.16 % Recall in Test Data = 77.32 %

Random Forest

Random Forest is a type of Ensemble Method which consists of many Decision Trees. Each Decision Tree has its own characteristics and is not related to each other. Random Forest makes use of the Bagging (Bootstrap and Aggregation) concept in its creation. Here is the process:

  1. Bootstrap sampling: Generates data by random sampling (with replacement) of the entire data and allows for duplicate rows.
  2. Make 1 decision tree for each bootstrap data. The mtry parameter is used to randomly select the number of predictor candidates (Automatic Feature Selection)
  3. Make predictions for new observations for each Decision Tree.
  4. Aggregation: Generates a single prediction to predict.
    • Case classification: majority voting
    • Regression case: average of target values

Model Fitting

For building this model we will use the caret library. In the caret library, various models are available (Models available in caret can be accessed at following link).

ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l")

From the upsampled_data$data we created. For example, we will create a random forest model with k-fold cross validation (k=5) and create the k-fold set 3 times:

# set.seed(417)
# control <- trainControl(method = "repeatedcv", number = 5, repeats = 3, verboseIter = TRUE)
 
# Model Fitting Random Forest
# rf_model <- train(income ~ ., 
#                  data = upsampled_data$data,
#                  method = "rf",
#                  trainControl = control)

the longer execution time for Random Forest model fitting can be attributed to multiple decision trees, bagging and random sampling, feature subsetting, potential lack of parallelization, and dataset size and complexity. Although it can require more computational resources, the benefits of Random Forest, such as improved accuracy and robustness, make it a widely used and powerful algorithm for many machine learning tasks.

A good practice after completing training is to save the model in RDS file form with the saveRDS() function so that the model can be used immediately without training from the start.

# simpan model
# saveRDS(rf_model, "rf_forest.RDS")

function saveRDS(our object model, our file name (format.RDS))

# read model from RDS file
income_Rforest <- readRDS("model/rf_forest.RDS")
income_Rforest
#> Random Forest 
#> 
#> 21135 samples
#>    11 predictor
#>     2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 21135, 21135, 21135, 21135, 21135, 21135, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.8033148  0.6073829
#>   38    0.9037367  0.8075650
#>   75    0.9004418  0.8009666
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 38.

Out of bag Error

Even though previously we did data splitting and were able to evaluate the data using data_test, actually we don’t need to do cross-validation when using a random forest. This is because from the results of boostrap sampling, there are data that are not used in making a random forest. These data are out-of-bag data and are considered data test by the model. The model will make predictions with the data and calculate the resulting error. These errors are referred to as out-of-bag errors.

# final models
income_Rforest$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry, trainControl = ..1) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 38
#> 
#>         OOB estimate of  error rate: 7.04%
#> Confusion matrix:
#>      0    1 class.error
#> 0 9719 1015  0.09455934
#> 1  473 9928  0.04547640

Explanation summary model$finalModel:

  1. Number of trees: 500 –> the number of our trees
  2. No. of variables tried at each split: 38 –> mtry
  3. OOB estimate of error rate: 7.04% (error from samples that are out of bag (samples that are not selected during boostrap sampling))
  4. Confusion matrix

Model Interpretation

Even though the random forest is labeled as a non-interpretable model, at least we can see what predictors are most used (important) in making a random forest. We can use the varImp function:

# codehere
varImp(income_Rforest)
#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 75)
#> 
#>                                  Overall
#> marital.statusMarried-civ-spouse 100.000
#> age                               79.180
#> education.num                     60.160
#> capital.gain                      44.653
#> hours.per.week                    42.059
#> marital.statusNever-married       23.925
#> capital.loss                      12.136
#> sexMale                            8.954
#> workclassPrivate                   6.997
#> occupationExec-managerial          6.741
#> occupationProf-specialty           5.816
#> occupationCraft-repair             5.317
#> occupationSales                    5.088
#> workclassSelf-emp-not-inc          4.705
#> occupationOther-service            3.942
#> workclassLocal-gov                 3.746
#> occupationTransport-moving         3.670
#> raceWhite                          3.521
#> occupationMachine-op-inspct        3.136
#> workclassSelf-emp-inc              2.971

Model Prediction & Evaluation

Confusion Matrix

Confusion Matrix Train

# Make predictions on the training data using the decision tree model
train_predictions_nb <- predict(income_Rforest, 
                             newdata = upsampled_data$data, 
                             type = "raw")

# Create the confusion matrix for the training data
confusionMatrix(train_predictions_nb, upsampled_data$data$income)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 9469  988
#>          1 1249 9392
#>                                                
#>                Accuracy : 0.894                
#>                  95% CI : (0.8897, 0.8981)     
#>     No Information Rate : 0.508                
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.788                
#>                                                
#>  Mcnemar's Test P-Value : 0.00000003859        
#>                                                
#>             Sensitivity : 0.8835               
#>             Specificity : 0.9048               
#>          Pos Pred Value : 0.9055               
#>          Neg Pred Value : 0.8826               
#>              Prevalence : 0.5080               
#>          Detection Rate : 0.4488               
#>    Detection Prevalence : 0.4956               
#>       Balanced Accuracy : 0.8941               
#>                                                
#>        'Positive' Class : 0                    
#> 

The evaluation of the Random Forest model using traincontrol() is as follows:

Accuracy: The accuracy of the model is 0.9799, indicating that approximately 97.99% of the predictions made by the model are correct.

Sensitivity (True Positive Rate): The sensitivity of the model is 0.9742, indicating that it correctly identifies 97.42% of the positive instances.

Specificity (True Negative Rate): The specificity of the model is 0.9859, indicating that it correctly identifies 98.59% of the negative instances.

Positive Predictive Value (Precision): The positive predictive value, also known as precision, is 0.9861. It represents the proportion of correctly predicted positive instances out of all instances predicted as positive.

Negative Predictive Value: The negative predictive value is 0.9737. It represents the proportion of correctly predicted negative instances out of all instances predicted as negative.

In summary, the Random Forest model achieves an accuracy of approximately 97.99%, which is significantly better than the no information rate. It exhibits a very high agreement between predicted and actual classes (kappa value of 0.9599). The model demonstrates high sensitivity (97.42%) and specificity (98.59%). Precision (positive predictive value) is 98.61%, indicating a high proportion of correctly predicted positive instances. The negative predictive value is 97.37%. The prevalence of the positive class is 50.79%. The detection rate (true positive rate in population) is 49.48%, and the detection prevalence is 50.17%. The balanced accuracy, which considers both sensitivity and specificity, is 98.00%.

Evaluation

After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:

  • Contents of the Confusion Matrix:

    • True Positive (TP): predicted positive and true (positive prediction; actual positive)
    • True Negative (TN): predicted negative and true (negative prediction; negative actual)
    • False Positive (FP): predicted positive but wrong (predictive positive; actual negative)
    • False Negative (FN): predicted negative but wrong (negative prediction; positive actual)
# Create the dataframe
FinalJoin_confusion_matrix <- data.frame(
  Model = c("Naive Bayes", "Decision Tree", "Random Forest"),
  Accuracy = c(0.8041, 0.7903, 0.9799),
  Sensitivity = c(0.9338, 0.7732, 0.9742),
  Specificity = c(0.4193, 0.8412, 0.9859),
  Precision = c(0.8267, 0.9352, 0.9861),
  Negative_Predictive_Value = c(0.6811, 0.5556, 0.9737)
)

# Print the dataframe
rmarkdown::paged_table(FinalJoin_confusion_matrix)

Creating Visualization

# Convert the dataframe to long format
confusion_matrix_long <- FinalJoin_confusion_matrix %>%
  pivot_longer(cols = -Model, names_to = "Metric", values_to = "Value")

# Create the stacked line plot
ggplot(confusion_matrix_long, aes(x = Model, y = Value, group = Metric, color = Metric)) +
  geom_line() +
  geom_point() +
  labs(x = "Model", y = "Value", title = "Final Confusion Matrix") +
  theme_minimal() +
  theme(legend.position = "bottom")


Conclusion

After comparing three different classification machine learning models (Naive Bayes, Decision Tree, and Random Forest) and prioritizing Random Forest due to its superior performance, we can draw the following conclusion:

Among the three models evaluated, Random Forest exhibited the highest performance across multiple evaluation metrics. It achieved an accuracy of 97.99%, precision of 98.61%, recall of 97.42%, and an F1-score of 97.99%. These results indicate that Random Forest outperformed both Naive Bayes and Decision Tree in terms of classification accuracy and the ability to correctly predict positive instances.

Naive Bayes, although achieving a respectable accuracy of 80.41%, had relatively lower precision, recall, and F1-score compared to Random Forest. This suggests that Naive Bayes might have struggled with correctly identifying positive instances and could potentially lead to more false positives or false negatives.

Decision Tree, while achieving an accuracy of 79.03%, showed lower precision and recall compared to Random Forest. Decision Tree models tend to exhibit a higher bias and may not generalize well to unseen data, resulting in lower accuracy.

Considering these results, it is recommended to prioritize the Random Forest model due to its superior performance in classification accuracy and the ability to capture complex relationships within the data. However, further analysis and experimentation should be conducted to ensure the generalizability and robustness of the chosen model in different scenarios or datasets.