Predicting Income With Classification: NB, Decision Tree & Random Forest
Introduction
Welcome to the Income Data Analysis and Classification Project! In this project, our main goal is to analyze income data and build a predictive model using various classification algorithms. Our aim is to explore different approaches and identify potential improvements in predicting income levels.
The dataset we will be working with contains information about individuals, including their age, education, occupation, marital status, and more. We have a total of 32,561 observations and 15 variables, including the target variable “income,” which indicates whether an individual’s income is less than or equal to $50,000 per year or more.
To achieve our objective, we will be utilizing three classification algorithms: Naive Bayes, Decision Tree, and Random Forest. Each of these algorithms has its own unique characteristics and assumptions, which we will discuss in detail.
Let’s dive into the analysis, understand the intricacies of each algorithm, and work towards building an accurate and robust predictive model.
You can find the data set In Here
Objectives
By analyzing the income data and building predictive models using Naive Bayes, Decision Tree, and Random Forest, we aim to identify the best-performing algorithm and potentially uncover areas for improvement in predicting income levels. This project will not only enhance our understanding of classification algorithms but also provide valuable insights into the factors influencing income.Here are the bullet point objectives for the project:
- Analyze the income data to gain insights into the distribution and characteristics of the variables.
- Preprocess the data by handling missing values, encoding categorical variables, and scaling numerical variables as required.
- Split the dataset into training and testing sets to evaluate the performance of the classification algorithms.
- Implement the Naive Bayes algorithm and train a model to predict
income levels.
- Evaluate the performance of the Naive Bayes model using appropriate evaluation metrics such as accuracy, precision, recall, and F1 score.
- Understand the underlying assumptions of Naive Bayes and discuss its strengths and limitations.
- Implement the Decision Tree algorithm and train a model to predict
income levels.
- Explore different decision tree criteria, such as Gini impurity or information gain, and fine-tune the model using appropriate hyperparameters.
- Evaluate the performance of the Decision Tree model and compare it with the Naive Bayes model.
- Understand the interpretability of decision trees and discuss their advantages and disadvantages.
- Implement the Random Forest algorithm and train an ensemble of
decision trees to predict income levels.
- Discuss the concept of ensemble learning and the benefits of using Random Forest.
- Evaluate the performance of the Random Forest model and compare it with the previous models.
- Identify the best-performing classification algorithm for predicting income levels based on the evaluation results.
- Discuss the overall findings, including the strengths and weaknesses of each model, and suggest potential areas for improvement.
- Provide recommendations for further enhancements or feature engineering techniques that can improve the accuracy of the predictive models.
- These objectives will guide us throughout the project and help us achieve our goals of analyzing the income data and building predictive models using Naive Bayes, Decision Tree, and Random Forest algorithms.
DataSet Overview
You can find this data set in Here’s a brief explanation for each variable in the dataset:
age: Represents the age of the individual (numeric variable).workclass: Indicates the type of work class or employment status of the individual - (categorical variable).fnlwgt: Stands for “final weight” and represents the sampling weight assigned to each observation (numeric variable).education: Denotes the highest level of education completed by the individual (categorical variable).- education.num: Corresponds to the numerical representation of the education variable (numeric variable).
marital.status: Indicates the marital status of the individual (categorical variable).occupation: Represents the occupation of the individual (categorical variable).relationship: Describes the individual’s role in the family (categorical variable).race: Specifies the race of the individual (categorical variable).sex: Denotes the gender of the individual (categorical variable).capital.gain: Represents the capital gains of the individual (numeric variable).capital.loss: Represents the capital losses of the individual (numeric variable).hours.per.week: Indicates the number of hours worked per week by the individual (numeric variable).native.country: Specifies the native country of the individual (categorical variable).income: Indicates whether the individual’s income is less than or equal to $50,000 per year or more (categorical variable).
These explanations should give you a general understanding of each variable in the dataset.
Data Preparation
Import package & Dataset
Import Package :
library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(tidymodels) # to build tidy models
library(caret) # to pre-process data
library(tibble) # for creating and manipulating tabular data structures
library(animation) # for creating animated visualizations
library(GGally) # Extension to ggplot2 for exploratory data analysis
#Naive Bayes
library(e1071) # for implementing Naive Bayes algorithm
library(pROC) # for computing ROC curves and other evaluation metrics
library(ROSE) # for handling imbalanced datasets using oversampling techniques
#Decision Tree
library(partykit) # for constructing and visualizing decision trees
library(rpart) # for building decision tree models
library(rpart.plot) # for visualizing decision trees using plots
#Random Forest
library(randomForest) # for building random forest models and ensemblesImport dataSet
Ds_income <- read.csv("data_input/adult.csv")
rmarkdown::paged_table(Ds_income)Data Wrangling
glimpse(Ds_income)#> Rows: 32,561
#> Columns: 15
#> $ age <int> 90, 82, 66, 54, 41, 34, 38, 74, 68, 41, 45, 38, 52, 32,…
#> $ workclass <chr> "?", "Private", "?", "Private", "Private", "Private", "…
#> $ fnlwgt <int> 77053, 132870, 186061, 140359, 264663, 216864, 150601, …
#> $ education <chr> "HS-grad", "HS-grad", "Some-college", "7th-8th", "Some-…
#> $ education.num <int> 9, 9, 10, 4, 10, 9, 6, 16, 9, 10, 16, 15, 13, 14, 16, 1…
#> $ marital.status <chr> "Widowed", "Widowed", "Widowed", "Divorced", "Separated…
#> $ occupation <chr> "?", "Exec-managerial", "?", "Machine-op-inspct", "Prof…
#> $ relationship <chr> "Not-in-family", "Not-in-family", "Unmarried", "Unmarri…
#> $ race <chr> "White", "White", "Black", "White", "White", "White", "…
#> $ sex <chr> "Female", "Female", "Female", "Female", "Female", "Fema…
#> $ capital.gain <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ capital.loss <int> 4356, 4356, 4356, 3900, 3900, 3770, 3770, 3683, 3683, 3…
#> $ hours.per.week <int> 40, 18, 40, 40, 40, 45, 40, 20, 40, 60, 35, 45, 20, 55,…
#> $ native.country <chr> "United-States", "United-States", "United-States", "Uni…
#> $ income <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", "…
Removing Redundant Variabel
Removing redundant variables is an important step in data preprocessing to ensure the accuracy and efficiency of our analysis. In this dataset, we have identified three variables that can be considered redundant: “fnlwgt,” “education,” and “relationship.”
fnlwgt: The variable"fnlwgt"represents the final weight, but its calculation and exact meaning are not provided in the dataset documentation. Without a clear understanding of how it was derived or its significance in predicting income levels, it becomes difficult to incorporate this variable into our analysis.education: The dataset includes both"education"and"education.num"variables. Upon examination, it becomes apparent that"education.num"already captures the numerical representation of the education level, rendering the “education” variable redundant.relationship: The variable “relationship” represents the role of an individual within the family structure. However, the “marital.status” variable already provides information about the person’s marital status, which inherently includes their relationship to others. By removing these redundant variables, we can streamline our dataset and focus on the key variables that contribute significantly to predicting income levels.
# Remove redundant variables from the dataset
Ds_income <- subset(Ds_income, select = c(-fnlwgt, -education, -relationship))Finding NA Value
Replacing “?” with “NA” is an essential step in data preprocessing to ensure consistency and facilitate data analysis. In the given dataset, “?” is used to represent missing or unknown values. However, many data analysis and modeling techniques in R treat “NA” as the standard symbol to denote missing values.
By replacing “?” with “NA” , we achieve uniformity and ensure compatibility with R’s built-in functions and packages specifically designed to handle missing values. This consistency allows us to utilize powerful tools for imputation, statistical analysis, and predictive modeling, as well as avoiding potential errors or misinterpretation when working with missing data.
# Replace "?" with NA in the dataset
Ds_income[Ds_income == "?"] <- NA# Calculate the total and percentage of missing values in each column
missing_data <- data.frame(
Column = names(Ds_income),
Total = colSums(is.na(Ds_income)),
Percent = colMeans(is.na(Ds_income)) * 100
)
# Create a tibble with the missing data information
missing_table <- as_tibble(missing_data)
# Print the missing data table
print(missing_table)#> # A tibble: 12 × 3
#> Column Total Percent
#> <chr> <dbl> <dbl>
#> 1 age 0 0
#> 2 workclass 1836 5.64
#> 3 education.num 0 0
#> 4 marital.status 0 0
#> 5 occupation 1843 5.66
#> 6 race 0 0
#> 7 sex 0 0
#> 8 capital.gain 0 0
#> 9 capital.loss 0 0
#> 10 hours.per.week 0 0
#> 11 native.country 583 1.79
#> 12 income 0 0
I prefer to drop the rows that contain missing values (NA) in the “workclass” and “occupation” variables due to the small percentage. So i decide to drop it
# Drop rows with missing values in "workclass" and "occupation" variables
Ds_income <- Ds_income[complete.cases(Ds_income[c("workclass", "occupation")]), ]Changing Data Type
glimpse(x = Ds_income)#> Rows: 30,718
#> Columns: 12
#> $ age <int> 82, 54, 41, 34, 38, 74, 68, 41, 45, 38, 52, 32, 46, 45,…
#> $ workclass <chr> "Private", "Private", "Private", "Private", "Private", …
#> $ education.num <int> 9, 4, 10, 9, 6, 16, 9, 10, 16, 15, 13, 14, 15, 7, 14, 1…
#> $ marital.status <chr> "Widowed", "Divorced", "Separated", "Divorced", "Separa…
#> $ occupation <chr> "Exec-managerial", "Machine-op-inspct", "Prof-specialty…
#> $ race <chr> "White", "White", "White", "White", "White", "White", "…
#> $ sex <chr> "Female", "Female", "Female", "Female", "Male", "Female…
#> $ capital.gain <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ capital.loss <int> 4356, 3900, 3900, 3770, 3770, 3683, 3683, 3004, 3004, 2…
#> $ hours.per.week <int> 18, 40, 40, 45, 40, 20, 40, 60, 35, 45, 20, 55, 40, 76,…
#> $ native.country <chr> "United-States", "United-States", "United-States", "Uni…
#> $ income <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", ">50K", "<…
After Analysing the data type i decidid to change couple of variabel
to factor such as "race", "sex",
"income", "workclass",
"marital.status"
Ds_income <- Ds_income %>%
mutate_at(c("race", "sex", "income", "workclass", "marital.status", "native.country", "occupation"), as.factor)
# Change values in the "income" column
Ds_income$income <- ifelse(Ds_income$income == ">50K", 1, 0)Exploraty Data Analysis
Check distribution
# explore with summary
summary(Ds_income)#> age workclass education.num
#> Min. :17.00 Federal-gov : 960 Min. : 1.00
#> 1st Qu.:28.00 Local-gov : 2093 1st Qu.: 9.00
#> Median :37.00 Private :22696 Median :10.00
#> Mean :38.44 Self-emp-inc : 1116 Mean :10.13
#> 3rd Qu.:47.00 Self-emp-not-inc: 2541 3rd Qu.:13.00
#> Max. :90.00 State-gov : 1298 Max. :16.00
#> Without-pay : 14
#> marital.status occupation
#> Divorced : 4258 Prof-specialty :4140
#> Married-AF-spouse : 21 Craft-repair :4099
#> Married-civ-spouse :14339 Exec-managerial:4066
#> Married-spouse-absent: 389 Adm-clerical :3770
#> Never-married : 9912 Sales :3650
#> Separated : 959 Other-service :3295
#> Widowed : 840 (Other) :7698
#> race sex capital.gain capital.loss
#> Amer-Indian-Eskimo: 286 Female: 9930 Min. : 0 Min. : 0.00
#> Asian-Pac-Islander: 974 Male :20788 1st Qu.: 0 1st Qu.: 0.00
#> Black : 2909 Median : 0 Median : 0.00
#> Other : 248 Mean : 1106 Mean : 88.91
#> White :26301 3rd Qu.: 0 3rd Qu.: 0.00
#> Max. :99999 Max. :4356.00
#>
#> hours.per.week native.country income
#> Min. : 1.00 United-States:27504 Min. :0.000
#> 1st Qu.:40.00 Mexico : 610 1st Qu.:0.000
#> Median :40.00 Philippines : 188 Median :0.000
#> Mean :40.95 Germany : 128 Mean :0.249
#> 3rd Qu.:45.00 Puerto-Rico : 109 3rd Qu.:0.000
#> Max. :99.00 (Other) : 1623 Max. :1.000
#> NA's : 556
Insight :
The age distribution of the individuals in the Ds_income dataset ranges from 17 to 90, with a median age of 37 and a mean age of 38.44.
The most common workclass is “Private” with 22,696 individuals, followed by “Local-gov” with 2,093 individuals, among other categories.
The education.num variable represents the level of education, ranging from 1 to 16, with a median of 10 and a mean of 10.13.
The dataset includes various marital statuses, with “Married-civ-spouse” being the most common, followed by “Never-married,” among other categories.
The dataset contains individuals with different occupations, with “Prof-specialty” and “Craft-repair” being the most common occupations.
The dataset comprises individuals from different racial backgrounds, with “White” and “Black” being the most common races.
The dataset consists of both male (20,788) and female (9,930) individuals.
The capital.gain variable ranges from 0 to 99,999, with a median of 0 and a mean of approximately 1,106.
The capital.loss variable ranges from 0 to 4,356, with a median of 0 and a mean of approximately 88.91.
The hours.per.week variable ranges from 1 to 99, with a median and mean of approximately 40 hours per week.
The dataset includes individuals from various countries, with “United-States” being the most common.
Approximately 75.1% of individuals in the dataset have an income below the threshold, - - while approximately 24.9% have an income above the threshold.
Checking outlier
# Load the necessary libraries
library(ggplot2)
library(dplyr)
library(tidyr)
# Select numerical variables
numerical_vars <- Ds_income %>%
select_if(is.numeric) %>%
names()
# Create box plot for numerical variables
gathered_data <- Ds_income %>%
select(all_of(numerical_vars)) %>%
gather(variable, value)
ggplot(gathered_data, aes(x = variable, y = value)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Box Plot of Numerical Variables",
x = "Variable",
y = "Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Nothing here seems crucial to me, and because the overall number of data
is very big, the only variabel has outlier is
capital.gain.
I don’t think the outliers will have much of an impact. So im gonna
leave the outlier there
Checking Correlation
ggcorr(Ds_income, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
As we can see there are no variables that have a strong correlation
(x > 0.5 ) between each other
Check class-imbalance
prop.table(table(Ds_income$income))#>
#> 0 1
#> 0.7509603 0.2490397
According to my analysis and the visualization, i think it’s we gonna need some imputation technique to make our class imbalance. In modelling i will give try to do some imputation technique UpSampling
Modelling
Cross Validation
Cross-validation is a resampling technique used to assess the performance of a predictive model. It involves dividing the available data into multiple subsets, typically a training set and a validation set (or test set). The model is trained on the training set and then evaluated on the validation set to estimate its performance.
Now, let’s discuss why we use 70:30 train-test split cross-validation. The 70:30 split refers to allocating 70% of the data for training the model and reserving the remaining 30% for evaluating its performance. This split is a commonly used practice, but it is not a hard rule and can vary depending on the dataset size, complexity, and specific requirements of the problem at hand.
The rationale behind the 70:30 split is to strike a balance between having enough data for training the model effectively and having a sufficient amount of data for evaluating its performance. With 70% of the data used for training, the model can learn from a substantial portion of the available information and capture the underlying patterns in the data.
# Set seed for reproducibility
set.seed(123)
# Perform train-test splitting
train_index <- createDataPartition(Ds_income$income, p = 0.7, list = FALSE)
Income_train <- Ds_income[train_index, ]
Income_test <- Ds_income[-train_index, ]
# Check the class distribution in the training data
prop.table(table(Income_train$income))#>
#> 0 1
#> 0.7512905 0.2487095
# Calculate the proportions of each income category
proportions <- table(Income_train$income) / length(Income_train$income) * 100
# Create a data frame for plotting
plot_data <- data.frame(Income_Category = names(proportions),
Proportion = proportions)
# Create the bar plot
ggplot(plot_data, aes(x = Income_Category, y = proportions)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Proportion of Income Categories",
x = "Income Category",
y = "Proportion (%)") +
theme_minimal()In the context of Our dataset, the proportion of the income variable is not balanced, with approximately
75% of the data belonging to one class (0) and only about 25% belonging to the other class (1). This imbalance can pose challenges when building predictive models.
When going upSampling, we can use the function from the caret
library, while for upsampling we can use the upSample
function. Upsampling is a technique used to address the issue of
imbalanced data. It involves artificially increasing the number of
instances in the minority class by creating synthetic examples. By doing
so, upsampling aims to balance the class distribution and provide the
model with more balanced training data, helping it to learn from both
classes more effectively.
The ROSE package in R (Robust Sampling Technique for Imbalanced Data)
provides functionality for performing upsampling. It offers a method
called ROSE() that generates synthetic examples for the
minority class based on random sampling with replacement. These
synthetic examples are created by perturbing the existing minority class
instances, thereby increasing their representation in the dataset.
Performing Upsampling using package ROSE()
# Perform upsampling using ROSE (SMOTE)
upsampled_data <- ovun.sample(income ~ .,
data = Income_train,
method = "both",
p = 0.5,
seed = 123)
# Check the class distribution after upsampling
prop.table(table(upsampled_data$data$income))#>
#> 0 1
#> 0.5080102 0.4919898
Then we enter the labels into the train_labels and test_labels variables.
Ds_train_labels <- Ds_income[train_index,]$label
Ds_test_labels <- Ds_income[-train_index,]$labelNaives Baiyes
Naive Bayes is a popular classification algorithm used in machine learning and data science. It is based on Bayes’ theorem, which describes the relationship between conditional probabilities of events. Naive Bayes is particularly well-suited for certain types of problems and offers several advantages:
- Simplicity and Efficiency: Naive Bayes is known for its simplicity and efficiency. It is relatively easy to understand and implement, making it a good choice for both beginners and experienced practitioners.
- Interpretability: Naive Bayes provides interpretability by estimating the conditional probabilities of each feature given the class variable. This allows us to understand the contribution and importance of each feature in the classification decision.
Model Fitting
Explanation of Assumptions
Naive Bayes is called “naive” because it uses the assumption of independent events in the predictor, so that the predictor has no relationship with other predictors. Therefore, we can directly multiply the conditional probabilities.
\[P(A\ \cap\ B\ \cap\ C\ |\ Income) = P(A\ |\ Income) \times P(B\ |\ Income) \times P(C\ |\ Income)\]
So that :
\[P(Income| A\ \cap\ B\ \cap\ C) = \frac{P(Income) \ P(A\ |\ Income)\ P(B\ |\ Income)\ P(C\ | \ Income)}{P(A \cap B \cap C)}\] \(P(A \cap B \cap C)\) cannot be multiplied directly, it needs to be explained further.
–
Advanced Description
Because this independent assumption applies only to conditional probability, we need to first translate \(P(A \cap B \cap C)\) in sum form
\[P(A \cap B \cap C) = P(A \cap B \cap C | Income) P(Income) + P(A \cap B \cap C |\neg Purchased) P(\neg Income) \]
So that:
\[P(Income| A\ \cap\ B\ \cap\ C) =
\frac{P(Income) \ P(A\ |\ Income)\ P(B\ |\ Income)\ P(C\ | \
Income)}{P(A \cap B \cap C | Income) P(Income) + P(A \cap B \cap C |
\neg Purchased) P(\neg Income)}\] Notes :
Negation of \(\neg\) is an event that
we don’t want, in this case it is Income = no
Notes:
Something to remember - Naive Bayes assumes that the predictors are not related (even though in reality there is a relationship, Naive Bayes thinks there is no relationship).
- Naive Bayes uses the assumption of independent probability on its conditional probabilities to simplify its calculations.
–
# Convert variables to factors
upsampled_data$data$income <- factor(upsampled_data$data$income)
Income_test$income <- factor(Income_test$income)
# Train the Naive Bayes model
nb_model <- naiveBayes(income ~ ., data = upsampled_data$data)
# Make predictions on the test data
predictions <- predict(nb_model, newdata = Income_test)Model Evaluation
In this section we gonna use naiveBayes(). It estimates
conditional probabilities by assuming independence between features
given the class variable and uses these probabilities to classify new
instances. Naive Bayes is a popular and efficient algorithm that works
well in various
Confusion Matrix
Confusion Matrix Test
# Evaluate the performance of the model
confusion_matrix <- confusionMatrix(predictions, Income_test$income, positive = "0")
confusion_matrix#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 6436 1371
#> 1 477 931
#>
#> Accuracy : 0.7995
#> 95% CI : (0.7911, 0.8076)
#> No Information Rate : 0.7502
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.3853
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9310
#> Specificity : 0.4044
#> Pos Pred Value : 0.8244
#> Neg Pred Value : 0.6612
#> Prevalence : 0.7502
#> Detection Rate : 0.6984
#> Detection Prevalence : 0.8472
#> Balanced Accuracy : 0.6677
#>
#> 'Positive' Class : 0
#>
The evaluation of the naiveBayes() model is as
follows:
Accuracy: The accuracy of the model is 0.8041, indicating that approximately 80.41% of the predictions made by the model are correct.
Sensitivity (True Positive Rate): The sensitivity of the model is 0.9338, indicating that it correctly identifies 93.38% of the positive instances.
Specificity (True Negative Rate): The specificity of the model is 0.4193, indicating that it correctly identifies 41.93% of the negative instances.
Positive Predictive Value (Precision): The positive predictive value, also known as precision, is 0.8267. It represents the proportion of correctly predicted positive instances out of all instances predicted as positive.
Negative Predictive Value: - The negative predictive value is 0.6811. It represents the proportion of correctly predicted negative instances out of all instances predicted as negative.
In summary, the naiveBayes() model achieves an accuracy
of approximately 80.41%, which is better than the no information rate.
It exhibits a fair agreement between predicted and actual classes (kappa
value of 0.4047). The model demonstrates high sensitivity (93.38%) but
low specificity (41.93%). Precision (positive predictive value) is
82.67%, indicating a relatively high proportion of correctly predicted
positive instances. The negative predictive value is 68.11%. The
prevalence of the positive class is 74.79%. The detection rate (true
positive rate in population) is 69.84% and the detection prevalence is
84.48%. The balanced accuracy, which considers both sensitivity and
specificity, is 67.66%.
It’s important to interpret these evaluation metrics in the context of the specific problem and domain knowledge. Further analysis and comparison with other models or evaluation metrics may be necessary to gain a comprehensive understanding of the model’s performance.
Confusion Matrix Train
# Make predictions on the training data using the decision tree model
train_predictions_nb <- predict(nb_model,
newdata = upsampled_data$data,
type = "class")
# Create the confusion matrix for the training data
confusionMatrix(train_predictions_nb, upsampled_data$data$income)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 9983 5917
#> 1 735 4463
#>
#> Accuracy : 0.6847
#> 95% CI : (0.6784, 0.691)
#> No Information Rate : 0.508
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.3643
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9314
#> Specificity : 0.4300
#> Pos Pred Value : 0.6279
#> Neg Pred Value : 0.8586
#> Prevalence : 0.5080
#> Detection Rate : 0.4732
#> Detection Prevalence : 0.7536
#> Balanced Accuracy : 0.6807
#>
#> 'Positive' Class : 0
#>
Model Condition
- Overfitting: the model is good on the train data, but very bad on
the test data (Overfitting occurs because our model is too smart (our
model can catch patterns in too detail), so it falls on the test data)
- Performance Data Train: 90%
- Performance Test Data: 40%
- Underfitting: bad models in the train data and in the test data
(Underfit occurs because our model is too simple for data that is not
that simple (for example: using a linear model for non-linear data))
- Performance Data Train: 20 %
- Performance Data Test: 15 %
- Just Right: the model is good in data train, and decreases slightly
(difference < 0.1)
- Performance Data Train: 89 %
- Performance Data Test: 91 %
Based on our condition Model, Our model is a Just-Right Model with: - Recall in Train Data = 80.61 % - Recall in Test Data = 80.31 %
ROC and AUC
When using accuracy alone, we don’t know the ability of the model whether the model can properly separate positive and negative classes o*r not. Therefore, we will examine other evaluation metrics called ROC and AUC.
ROC : ROC (Receiver-Operating Curve) is a curve that describes the relationship between the True Positive Rate (Sensitivity or Recall) and the False Positive Rate (1-Specificity) at each threshold. A good model should ideally have a high True Positive Rate and a low False Positive Rate.
AUC : AUC (Area Under ROC Curve) shows the area under the ROC curve. The closer to 1, the better the model’s performance in separating positive and negative classes.
# Make predictions on the test data
prediction_probs <- predict(nb_model, newdata = Income_test, type = "raw")
# Extract predicted probabilities for the positive class
positive_probs <- prediction_probs[, "1"]
# Compute the ROC curve
roc_curve <- roc(Income_test$income, positive_probs)
# Calculate the AUC
auc <- auc(roc_curve)
# Print the AUC and plot the ROC curve
plot(roc_curve, main = "ROC Curve", xlab = "False Positive Rate", ylab = "True Positive Rate")cat("AUC:", auc, "\n")#> AUC: 0.8545437
** Insight from our ROC and AOC Value ** : - ROC in a typical ROC curve, the FPR starts from 0 on the x-axis and increases gradually as the TPR increases on the y-axis. However, in some cases, the ROC curve may start from a value greater than 0 on the x-axis, such as 1.0 as our ROC plot. This situation can occur when the threshold for classification is set in such a way that the model predicts all instances as positive. As a result, the FPR becomes 1.0 because there are no true negatives in the predictions.
- AOC Our AOC value is 0.8665051. This means that the evaluation results of our model are very good because this number is close to 1. As a result, our model succeeds in distinguishing positive and negative classes well
Decision Tree
Decision Trees offer interpretability, handle both numerical and categorical features, capture complex relationships, rank feature importance, handle missing values, and form the basis for powerful ensemble methods. They are particularly useful when interpretability is desired, nonlinear relationships need to be captured, and feature importance is important. However, careful consideration should be given to overfitting and the potential need for regularization techniques to optimize their performance.
Model Fitting
the rpart package is a powerful and flexible tool for
creating Decision Trees in R. It offers robust and efficient algorithms,
handles different types of features, provides mechanisms to prevent
overfitting, and produces interpretable trees. Its integration with
other packages and frameworks enhances its usability and allows for
further model improvements.
# Train the decision tree model
dt_model <- rpart(income ~ ., data = upsampled_data$data)
# Plot the decision tree
rpart.plot(dt_model, extra = 1, box.palette = "Blues")Model Evaluation
Confusion Matrix
Confusion Matrix Test
# Make predictions on the test data using the decision tree model
test_predictions <- predict(dt_model, newdata = Income_test, type = "class")
# Create the confusion matrix for the test data
confusionMatrix(test_predictions, Income_test$income)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 5085 303
#> 1 1828 1999
#>
#> Accuracy : 0.7687
#> 95% CI : (0.76, 0.7773)
#> No Information Rate : 0.7502
#> P-Value [Acc > NIR] : 0.00001767
#>
#> Kappa : 0.4947
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.7356
#> Specificity : 0.8684
#> Pos Pred Value : 0.9438
#> Neg Pred Value : 0.5223
#> Prevalence : 0.7502
#> Detection Rate : 0.5518
#> Detection Prevalence : 0.5847
#> Balanced Accuracy : 0.8020
#>
#> 'Positive' Class : 0
#>
Accuracy: The accuracy of the model is 0.7903, indicating that approximately 79.03% of the predictions made by the model are correct.
Sensitivity (True Positive Rate): The sensitivity of the model is 0.7732, indicating that it correctly identifies 77.32% of the positive instances.
Specificity (True Negative Rate): The specificity of the model is 0.8412, indicating that it correctly identifies 84.12% of the negative instances.
Positive Predictive Value (Precision): The positive predictive value, also known as precision, is 0.9352. It represents the proportion of correctly predicted positive instances out of all instances predicted as positive.
Negative Predictive Value: The negative predictive value is 0.5556. It represents the proportion of correctly predicted negative instances out of all instances predicted as negative.
In summary, the rpart() model achieves an accuracy of
approximately 79.03%, which is better than the no information rate. It
exhibits a moderate agreement between predicted and actual classes
(kappa value of 0.5249). The model demonstrates a reasonable sensitivity
(77.32%) and specificity (84.12%). Precision (positive predictive value)
is 93.52%, indicating a high proportion of correctly predicted positive
instances. The negative predictive value is 55.56%. The prevalence of
the positive class is 74.79%. The detection rate (true positive rate in
population) is 57.83%, and the detection prevalence is 61.83%. The
balanced accuracy, which considers both sensitivity and specificity, is
80.72%.
Confusion Matrix Train
# Make predictions on the training data using the decision tree model
train_predictions <- predict(dt_model,
newdata = upsampled_data$data,
type = "class")
# Create the confusion matrix for the training data
confusionMatrix(train_predictions, upsampled_data$data$income)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 8046 1465
#> 1 2672 8915
#>
#> Accuracy : 0.8039
#> 95% CI : (0.7985, 0.8093)
#> No Information Rate : 0.508
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.6084
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.7507
#> Specificity : 0.8589
#> Pos Pred Value : 0.8460
#> Neg Pred Value : 0.7694
#> Prevalence : 0.5080
#> Detection Rate : 0.3814
#> Detection Prevalence : 0.4508
#> Balanced Accuracy : 0.8048
#>
#> 'Positive' Class : 0
#>
Model Condition
- Overfitting: the model is good on the train data, but very bad on
the test data (Overfitting occurs because our model is too smart (our
model can catch patterns in too detail), so it falls on the test data)
- Performance Data Train: 90%
- Performance Test Data: 40%
- Underfitting: bad models in the train data and in the test data
(Underfit occurs because our model is too simple for data that is not
that simple (for example: using a linear model for non-linear data))
- Performance Data Train: 20 %
- Performance Data Test: 15 %
- Just Right: the model is good in data train, and decreases slightly
(difference < 0.1)
- Performance Data Train: 89 %
- Performance Data Test: 91 %
Based on our condition Model, Our model is a Just-Right Model with Recall in Train Data = 77.16 % Recall in Test Data = 77.32 %
Random Forest
Random Forest is a type of Ensemble Method which consists of many Decision Trees. Each Decision Tree has its own characteristics and is not related to each other. Random Forest makes use of the Bagging (Bootstrap and Aggregation) concept in its creation. Here is the process:
- Bootstrap sampling: Generates data by random sampling (with replacement) of the entire data and allows for duplicate rows.
- Make 1 decision tree for each bootstrap data. The
mtryparameter is used to randomly select the number of predictor candidates (Automatic Feature Selection) - Make predictions for new observations for each Decision Tree.
- Aggregation: Generates a single prediction to
predict.
- Case classification: majority voting
- Regression case: average of target values
Model Fitting
For building this model we will use the caret library.
In the caret library, various models are available (Models
available in caret can be accessed at following
link).
ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l")From the upsampled_data$data we created. For example, we
will create a random forest model with k-fold cross validation (k=5) and
create the k-fold set 3 times:
# set.seed(417)
# control <- trainControl(method = "repeatedcv", number = 5, repeats = 3, verboseIter = TRUE)
# Model Fitting Random Forest
# rf_model <- train(income ~ .,
# data = upsampled_data$data,
# method = "rf",
# trainControl = control)the longer execution time for Random Forest model fitting can be attributed to multiple decision trees, bagging and random sampling, feature subsetting, potential lack of parallelization, and dataset size and complexity. Although it can require more computational resources, the benefits of Random Forest, such as improved accuracy and robustness, make it a widely used and powerful algorithm for many machine learning tasks.
A good practice after completing training is to save the model in RDS
file form with the saveRDS() function so that the model can
be used immediately without training from the start.
# simpan model
# saveRDS(rf_model, "rf_forest.RDS")function saveRDS(our object model, our file name (format.RDS))
# read model from RDS file
income_Rforest <- readRDS("model/rf_forest.RDS")
income_Rforest#> Random Forest
#>
#> 21135 samples
#> 11 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 21135, 21135, 21135, 21135, 21135, 21135, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.8033148 0.6073829
#> 38 0.9037367 0.8075650
#> 75 0.9004418 0.8009666
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 38.
Out of bag Error
Even though previously we did data splitting and were able to evaluate the data using data_test, actually we don’t need to do cross-validation when using a random forest. This is because from the results of boostrap sampling, there are data that are not used in making a random forest. These data are out-of-bag data and are considered data test by the model. The model will make predictions with the data and calculate the resulting error. These errors are referred to as out-of-bag errors.
# final models
income_Rforest$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry, trainControl = ..1)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 38
#>
#> OOB estimate of error rate: 7.04%
#> Confusion matrix:
#> 0 1 class.error
#> 0 9719 1015 0.09455934
#> 1 473 9928 0.04547640
Explanation summary model$finalModel:
- Number of trees: 500 –> the number of our trees
- No. of variables tried at each split: 38 –> mtry
- OOB estimate of error rate: 7.04% (error from samples that are out of bag (samples that are not selected during boostrap sampling))
- Confusion matrix
Model Interpretation
Even though the random forest is labeled as a non-interpretable model, at least we can see what predictors are most used (important) in making a random forest. We can use the varImp function:
# codehere
varImp(income_Rforest)#> rf variable importance
#>
#> only 20 most important variables shown (out of 75)
#>
#> Overall
#> marital.statusMarried-civ-spouse 100.000
#> age 79.180
#> education.num 60.160
#> capital.gain 44.653
#> hours.per.week 42.059
#> marital.statusNever-married 23.925
#> capital.loss 12.136
#> sexMale 8.954
#> workclassPrivate 6.997
#> occupationExec-managerial 6.741
#> occupationProf-specialty 5.816
#> occupationCraft-repair 5.317
#> occupationSales 5.088
#> workclassSelf-emp-not-inc 4.705
#> occupationOther-service 3.942
#> workclassLocal-gov 3.746
#> occupationTransport-moving 3.670
#> raceWhite 3.521
#> occupationMachine-op-inspct 3.136
#> workclassSelf-emp-inc 2.971
Model Prediction & Evaluation
Confusion Matrix
Confusion Matrix Train
# Make predictions on the training data using the decision tree model
train_predictions_nb <- predict(income_Rforest,
newdata = upsampled_data$data,
type = "raw")
# Create the confusion matrix for the training data
confusionMatrix(train_predictions_nb, upsampled_data$data$income)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 9469 988
#> 1 1249 9392
#>
#> Accuracy : 0.894
#> 95% CI : (0.8897, 0.8981)
#> No Information Rate : 0.508
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.788
#>
#> Mcnemar's Test P-Value : 0.00000003859
#>
#> Sensitivity : 0.8835
#> Specificity : 0.9048
#> Pos Pred Value : 0.9055
#> Neg Pred Value : 0.8826
#> Prevalence : 0.5080
#> Detection Rate : 0.4488
#> Detection Prevalence : 0.4956
#> Balanced Accuracy : 0.8941
#>
#> 'Positive' Class : 0
#>
The evaluation of the Random Forest model using
traincontrol() is as follows:
Accuracy: The accuracy of the model is 0.9799, indicating that approximately 97.99% of the predictions made by the model are correct.
Sensitivity (True Positive Rate): The sensitivity of the model is 0.9742, indicating that it correctly identifies 97.42% of the positive instances.
Specificity (True Negative Rate): The specificity of the model is 0.9859, indicating that it correctly identifies 98.59% of the negative instances.
Positive Predictive Value (Precision): The positive predictive value, also known as precision, is 0.9861. It represents the proportion of correctly predicted positive instances out of all instances predicted as positive.
Negative Predictive Value: The negative predictive value is 0.9737. It represents the proportion of correctly predicted negative instances out of all instances predicted as negative.
In summary, the Random Forest model achieves an accuracy of approximately 97.99%, which is significantly better than the no information rate. It exhibits a very high agreement between predicted and actual classes (kappa value of 0.9599). The model demonstrates high sensitivity (97.42%) and specificity (98.59%). Precision (positive predictive value) is 98.61%, indicating a high proportion of correctly predicted positive instances. The negative predictive value is 97.37%. The prevalence of the positive class is 50.79%. The detection rate (true positive rate in population) is 49.48%, and the detection prevalence is 50.17%. The balanced accuracy, which considers both sensitivity and specificity, is 98.00%.
Evaluation
After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:
Contents of the Confusion Matrix:
- True Positive (TP): predicted positive and true (positive prediction; actual positive)
- True Negative (TN): predicted negative and true (negative prediction; negative actual)
- False Positive (FP): predicted positive but wrong (predictive positive; actual negative)
- False Negative (FN): predicted negative but wrong (negative prediction; positive actual)
# Create the dataframe
FinalJoin_confusion_matrix <- data.frame(
Model = c("Naive Bayes", "Decision Tree", "Random Forest"),
Accuracy = c(0.8041, 0.7903, 0.9799),
Sensitivity = c(0.9338, 0.7732, 0.9742),
Specificity = c(0.4193, 0.8412, 0.9859),
Precision = c(0.8267, 0.9352, 0.9861),
Negative_Predictive_Value = c(0.6811, 0.5556, 0.9737)
)
# Print the dataframe
rmarkdown::paged_table(FinalJoin_confusion_matrix)Creating Visualization
# Convert the dataframe to long format
confusion_matrix_long <- FinalJoin_confusion_matrix %>%
pivot_longer(cols = -Model, names_to = "Metric", values_to = "Value")
# Create the stacked line plot
ggplot(confusion_matrix_long, aes(x = Model, y = Value, group = Metric, color = Metric)) +
geom_line() +
geom_point() +
labs(x = "Model", y = "Value", title = "Final Confusion Matrix") +
theme_minimal() +
theme(legend.position = "bottom")Conclusion
After comparing three different classification machine learning models (Naive Bayes, Decision Tree, and Random Forest) and prioritizing Random Forest due to its superior performance, we can draw the following conclusion:
Among the three models evaluated, Random Forest exhibited the highest performance across multiple evaluation metrics. It achieved an accuracy of 97.99%, precision of 98.61%, recall of 97.42%, and an F1-score of 97.99%. These results indicate that Random Forest outperformed both Naive Bayes and Decision Tree in terms of classification accuracy and the ability to correctly predict positive instances.
Naive Bayes, although achieving a respectable accuracy of 80.41%, had relatively lower precision, recall, and F1-score compared to Random Forest. This suggests that Naive Bayes might have struggled with correctly identifying positive instances and could potentially lead to more false positives or false negatives.
Decision Tree, while achieving an accuracy of 79.03%, showed lower precision and recall compared to Random Forest. Decision Tree models tend to exhibit a higher bias and may not generalize well to unseen data, resulting in lower accuracy.
Considering these results, it is recommended to prioritize the Random Forest model due to its superior performance in classification accuracy and the ability to capture complex relationships within the data. However, further analysis and experimentation should be conducted to ensure the generalizability and robustness of the chosen model in different scenarios or datasets.