Brain Stroke Prediction using Machine Learning

Introduction

Main

Previously I have created a model for Brain Stroke Prediction based on Logistic Regression and K-NN. However, both accuracy is not sattisfying. I am expecting the accuracy of the model is above 85%. The processing of the model can be seen here. with that we are going to use three different models which are Naive Bayes Classifier, Decision Tree and Random Forest.

Objectives

A stroke is an interruption of the blood supply to any part of the brain. If blood flow was stopped for longer than a few seconds and the brain cannot get blood and oxygen, brain cells can die, and the abilities controlled by that area of the brain are lost. In this R markdown we will use some features to see whether we will be able to predict the stoke or not?

Library and Setup

library(tidyverse)
library(dplyr)
library(rsample)
library(caret)

Data Preparation

Read Data

brain <- read.csv("full_data.csv")

glimpse(brain)

#> Rows: 4,981
#> Columns: 11
#> $ gender            <chr> "Male", "Male", "Female", "Female", "Male", "Male", …
#> $ age               <dbl> 67, 80, 49, 79, 81, 74, 69, 78, 81, 61, 54, 79, 50, …
#> $ hypertension      <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1…
#> $ heart_disease     <int> 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0…
#> $ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes…
#> $ work_type         <chr> "Private", "Private", "Private", "Self-employed", "P…
#> $ Residence_type    <chr> "Urban", "Rural", "Urban", "Rural", "Urban", "Rural"…
#> $ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70.09, 94.39…
#> $ bmi               <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 24.2, 29.7…
#> $ smoking_status    <chr> "formerly smoked", "never smoked", "smokes", "never …
#> $ stroke            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Attribute Information

gender: “Male”, “Female” or “Other”
age: age of the patient
hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
heartdisease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
evermarried: “No” or “Yes”
worktype: “children”, “Govtjov”, “Neverworked”, “Private” or “Self-employed”
Residencetype: “Rural” or “Urban”
avgglucoselevel: average glucose level in blood
bmi: body mass index
smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
stroke: 0 if not or 1 if the patient had a stroke

*Note: “Unknown” in smoking_status means that the information is unavailable for this patient

This is a bigger picture of our dataset:

rmarkdown::paged_table(brain)

Data Wrangling

In some of the variables used, there is a data type discrepancy, therefore what we need to do is to adjust the data type on some of the existing variables.

brain <- brain %>%
  mutate_if(is.character, as.factor) %>%
  mutate(hypertension = factor(hypertension, levels = c(0, 1)),
         heart_disease = factor(heart_disease, levels = c(0, 1)),
         stroke = factor(stroke, levels = c(0, 1))
         )
glimpse(brain)

#> Rows: 4,981
#> Columns: 11
#> $ gender            <fct> Male, Male, Female, Female, Male, Male, Female, Fema…
#> $ age               <dbl> 67, 80, 49, 79, 81, 74, 69, 78, 81, 61, 54, 79, 50, …
#> $ hypertension      <fct> 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1…
#> $ heart_disease     <fct> 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0…
#> $ ever_married      <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes…
#> $ work_type         <fct> Private, Private, Private, Self-employed, Private, P…
#> $ Residence_type    <fct> Urban, Rural, Urban, Rural, Urban, Rural, Urban, Urb…
#> $ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70.09, 94.39…
#> $ bmi               <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 24.2, 29.7…
#> $ smoking_status    <fct> formerly smoked, never smoked, smokes, never smoked,…
#> $ stroke            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Exploratory Data Analysis

Check missing value

colSums(is.na(brain))

#>            gender               age      hypertension     heart_disease 
#>                 0                 0                 0                 0 
#>      ever_married         work_type    Residence_type avg_glucose_level 
#>                 0                 0                 0                 0 
#>               bmi    smoking_status            stroke 
#>                 0                 0                 0

Check summaries of the data

summary(brain)

#>     gender          age        hypertension heart_disease ever_married
#>  Female:2907   Min.   : 0.08   0:4502       0:4706        No :1701    
#>  Male  :2074   1st Qu.:25.00   1: 479       1: 275        Yes:3280    
#>                Median :45.00                                          
#>                Mean   :43.42                                          
#>                3rd Qu.:61.00                                          
#>                Max.   :82.00                                          
#>          work_type    Residence_type avg_glucose_level      bmi      
#>  children     : 673   Rural:2449     Min.   : 55.12    Min.   :14.0  
#>  Govt_job     : 644   Urban:2532     1st Qu.: 77.23    1st Qu.:23.7  
#>  Private      :2860                  Median : 91.85    Median :28.1  
#>  Self-employed: 804                  Mean   :105.94    Mean   :28.5  
#>                                      3rd Qu.:113.86    3rd Qu.:32.6  
#>                                      Max.   :271.74    Max.   :48.9  
#>          smoking_status stroke  
#>  formerly smoked: 867   0:4733  
#>  never smoked   :1838   1: 248  
#>  smokes         : 776           
#>  Unknown        :1500           
#>                                 
#>

Check Class Imbalance

prop.table(table(brain$stroke))

#> 
#>         0         1 
#> 0.9502108 0.0497892

Cross Validation 1

To evaluate the model and see its ability to predict new data, our data is divided into 2: train data and test data. We call this process cross-validation. Our data will be split into 80% of data train and 20% data test.

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(data = brain, prop = 0.80, strata = stroke)

brain_train <- training(index)
brain_test <- testing(index)

prop.table(table(brain_train$stroke))

#> 
#>          0          1 
#> 0.94879518 0.05120482

table(brain_train$stroke)

#> 
#>    0    1 
#> 3780  204

If you pay attention, the proportion of brain_train is considered imbalance. There are two methods for dealing with imbalanced data which using simple technique either upsampling or downsampling.

Upsampling: this method increases the size of the minority class by sampling with replacement so that the classes will have the same size.
Downsampling: in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.

Here we gonna used downsampling method for this imbalanced data.

# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)

brain_train <- downSample(x = brain_train %>% select(-stroke), 
                          y = brain_train$stroke,
                          yname = "stroke")

prop.table(table(brain_train$stroke))

#> 
#>   0   1 
#> 0.5 0.5

table(brain_train$stroke)

#> 
#>   0   1 
#> 204 204

Model

Naive Bayes Classifier

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

Build Model

# create model
library(e1071)
model_nb_brain <- naiveBayes(formula = stroke~., data = brain_train)

Predict

# predict on data test
brain_test$pred_label <- predict(object = model_nb_brain, newdata = brain_test, type = "class")

Evaluation

confusionMatrix(data = brain_test$pred_label, reference = brain_test$stroke, positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 652   6
#>          1 301  38
#>                                              
#>                Accuracy : 0.6921             
#>                  95% CI : (0.6624, 0.7206)   
#>     No Information Rate : 0.9559             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.1305             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.86364            
#>             Specificity : 0.68416            
#>          Pos Pred Value : 0.11209            
#>          Neg Pred Value : 0.99088            
#>              Prevalence : 0.04413            
#>          Detection Rate : 0.03811            
#>    Detection Prevalence : 0.34002            
#>       Balanced Accuracy : 0.77390            
#>                                              
#>        'Positive' Class : 1                  
#>

Decision Tree

Decision Tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

Build Model

# create model
library(partykit)
model_dt_brain <- ctree(formula = stroke~., data = brain_train)

# visualize decision tree
plot(model_dt_brain, type = "simple")

Predict

# predict data test
pred_dt_brain <- predict(object = model_dt_brain, newdata = brain_test, type = "response")

Evaluation

# confusion matrix data test
confusionMatrix(data = pred_dt_brain, reference = brain_test$stroke, positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 647   5
#>          1 306  39
#>                                              
#>                Accuracy : 0.6881             
#>                  95% CI : (0.6583, 0.7167)   
#>     No Information Rate : 0.9559             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.1326             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.88636            
#>             Specificity : 0.67891            
#>          Pos Pred Value : 0.11304            
#>          Neg Pred Value : 0.99233            
#>              Prevalence : 0.04413            
#>          Detection Rate : 0.03912            
#>    Detection Prevalence : 0.34604            
#>       Balanced Accuracy : 0.78264            
#>                                              
#>        'Positive' Class : 1                  
#>

Random Forest

Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees. Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling from the dataset forming sample datasets for every model. This part is called Bootstrap.

Build Model

#set.seed(417)
#ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

#model_rf_brain <- train(stroke~., data = brain_train, method="rf", trControl = ctrl)

## Save model
#saveRDS(model_rf_brain, file = "model_rf_brain.RDS")

# Read model
model_rf_brain <- readRDS("model_rf_brain.RDS")

# Summary model
model_rf_brain

#> Random Forest 
#> 
#> 408 samples
#>  10 predictor
#>   2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 326, 326, 327, 326, 327, 327, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.7679615  0.5356479
#>    8    0.7655525  0.5309945
#>   14    0.7508180  0.5015343
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

From the RF results, the best mtry is mtry = 2 (2 variables), although with 8 or 14 variables the accuracy value obtained is good and the value almost the same. But in this cross-validation using 2 variables is the best of all tried mtry. From this we know that mtry is a lot of variables used in model building and by system default we can also see that Random Forest tries various mtry values.

Out of Bag Error

When using random forest - we are not required to split our dataset into train and test sets because random forest already has out-of-bag estimates (OOB) which act as a reliable estimate of the accuracy on unseen examples. Although, it is also possible to hold out a regular train-test cross-validation. For example, the OOB we achieved (in the summary below) was generated from our wine_train dataset.

library(randomForest)
model_rf_brain$finalModel

#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 2
#> 
#>         OOB estimate of  error rate: 22.55%
#> Confusion matrix:
#>     0   1 class.error
#> 0 146  58   0.2843137
#> 1  34 170   0.1666667

In the model_rf_brain model, the Out of Bag Error value is 22.55%. In other words, the accuracy of the model on the test data (out of bag data) is 77.45%!

# Accuracy 
accuracy <- 100 - 22.55
accuracy

#> [1] 77.45

Interpretation

Even though the random forest is labeled as a non-interpretable model, at least we can see what predictors are most used (important) in making random forest:

varImp(model_rf_brain) %>% plot()

Here we find out the most important predictors in making random forest model is age

Predict

pred_rf_brain <- predict(object = model_rf_brain, newdata = brain_test, type = "raw")

Evaluation

confusionMatrix(data = pred_rf_brain, reference = brain_test$stroke, positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 666   8
#>          1 287  36
#>                                              
#>                Accuracy : 0.7041             
#>                  95% CI : (0.6747, 0.7323)   
#>     No Information Rate : 0.9559             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.1285             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.81818            
#>             Specificity : 0.69885            
#>          Pos Pred Value : 0.11146            
#>          Neg Pred Value : 0.98813            
#>              Prevalence : 0.04413            
#>          Detection Rate : 0.03611            
#>    Detection Prevalence : 0.32397            
#>       Balanced Accuracy : 0.75851            
#>                                              
#>        'Positive' Class : 1                  
#>

Discussion

The evaluation of confusion matrix of Naive Bayes method from data train are as follows:

Accuracy reveals value 0.6921, meaning that 69.21% of our data is correctly classified.
Sensitivity/ Recall reveals value 0.86364, meaning that FN (False Negative) portion 86.364% of our positive outcomes are correctly classified.
Pos Pred Value/ Precision reveals value 0.11209, meaning that FP (False Positive) portion 11.209% of our positive predictions are correct.

The evaluation of confusion matrix of Decision Tree method from data train are as follows:

Accuracy reveals value 0.6881, meaning that 68.81% of our data is correctly classified.
Sensitivity/ Recall reveals value 0.88636, meaning that FN (False Negative) portion 88.636% of our positive outcomes are correctly classified.
Pos Pred Value/ Precision reveals value 0.11304, meaning that FP (False Positive) portion 11.304% of our positive predictions are correct.

The evaluation of confusion matrix of Random Forest method from data train are as follows:

Accuracy reveals value 0.7041, meaning that 70.41% of our data is correctly classified.
Sensitivity/ Recall reveals value 0.81818, meaning that FN (False Negative) portion 81.818% of our positive outcomes are correctly classified.
Pos Pred Value/ Precision reveals value 0.11146, meaning that FP (False Positive) portion 11.146% of our positive predictions are correct.

Cross Validation 2

In my prediction from the previous three model give the accuracy below of 80% because a lot of data is losing. Here we will try improving the data in cross-validation using upsampling method for the dataset. So that we are not losing a lot of dataset.

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(data = brain, prop = 0.80, strata = stroke)

brain_train2 <- training(index)
brain_test2 <- testing(index)

prop.table(table(brain_train2$stroke))

#> 
#>          0          1 
#> 0.94879518 0.05120482

Here we gonna used upsampling method for this imbalanced data.

# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)

brain_train2 <- upSample(x = brain_train2[, -1],
                        y = brain_train2$stroke,
                        yname = "stroke")

prop.table(table(brain_train2$stroke))

#> 
#>   0   1 
#> 0.5 0.5

table(brain_train2$stroke)

#> 
#>    0    1 
#> 3780 3780

Model Improvement

Naive Bayes Classifier

Build Model

# create model
library(e1071)
model_nb_brain2 <- naiveBayes(formula = stroke~., data = brain_train2)

Predict

# predict on data test
brain_test2$pred_label <- predict(object = model_nb_brain2, newdata = brain_test2, type = "class")

Evaluation

confusionMatrix(data = brain_test2$pred_label, reference = brain_test2$stroke, positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 953   0
#>          1   0  44
#>                                                
#>                Accuracy : 1                    
#>                  95% CI : (0.9963, 1)          
#>     No Information Rate : 0.9559               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 1                    
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#>                                                
#>             Sensitivity : 1.00000              
#>             Specificity : 1.00000              
#>          Pos Pred Value : 1.00000              
#>          Neg Pred Value : 1.00000              
#>              Prevalence : 0.04413              
#>          Detection Rate : 0.04413              
#>    Detection Prevalence : 0.04413              
#>       Balanced Accuracy : 1.00000              
#>                                                
#>        'Positive' Class : 1                    
#>

Decision Tree

Build Model

# create model
library(partykit)
model_dt_brain2 <- ctree(formula = stroke~., data = brain_train2)

# visualize decision tree
plot(model_dt_brain2, type = "simple")

Predict

# predict data test
pred_dt_brain2 <- predict(object = model_dt_brain2, newdata = brain_test2, type = "response")

Evaluation

# confusion matrix data test
confusionMatrix(data = pred_dt_brain2, reference = brain_test2$stroke, positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 953   0
#>          1   0  44
#>                                                
#>                Accuracy : 1                    
#>                  95% CI : (0.9963, 1)          
#>     No Information Rate : 0.9559               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 1                    
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#>                                                
#>             Sensitivity : 1.00000              
#>             Specificity : 1.00000              
#>          Pos Pred Value : 1.00000              
#>          Neg Pred Value : 1.00000              
#>              Prevalence : 0.04413              
#>          Detection Rate : 0.04413              
#>    Detection Prevalence : 0.04413              
#>       Balanced Accuracy : 1.00000              
#>                                                
#>        'Positive' Class : 1                    
#>

Random Forest

Build Model

#set.seed(417)
#ctrl2 <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

#model_rf_brain2 <- train(stroke~., data = brain_train2, method="rf", trControl = ctrl2)

## Save model
#saveRDS(model_rf_brain2, file = "model_rf_brain2.RDS")

# Read model
model_rf_brain2 <- readRDS("model_rf_brain2.RDS")

# Summary model
model_rf_brain2

#> Random Forest 
#> 
#> 7560 samples
#>    9 predictor
#>    2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 6048, 6048, 6048, 6048, 6048, 6048, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.8892416  0.7784832
#>    7    0.9893739  0.9787478
#>   13    0.9844356  0.9688713
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 7.

From the RF results, the best mtry is mtry = 7 (7 variables), although with 2 or 13 variables the accuracy value obtained is good and the value almost the same. this is also give us a new insight comaring to previous model where we get 2 mtry.

Out of Bag Error

library(randomForest)
model_rf_brain2$finalModel

#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 7
#> 
#>         OOB estimate of  error rate: 0.73%
#> Confusion matrix:
#>      0    1 class.error
#> 0 3725   55  0.01455026
#> 1    0 3780  0.00000000

In the model_rf_brain2 model, the Out of Bag Error value is 0.73%. In other words, the accuracy of the model on the test data (out of bag data) is 99.27%!

# Accuracy 
accuracy2 <- 100 - 0.73
accuracy2

#> [1] 99.27

Interpretation

varImp(model_rf_brain2) %>% plot()

Here we find out the most important predictors in making random forest model is age and this result is the same like previous model.

Predict

pred_rf_brain2 <- predict(object = model_rf_brain2, newdata = brain_test2, type = "raw")

Evaluation

confusionMatrix(data = pred_rf_brain2, reference = brain_test2$stroke, positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 936  43
#>          1  17   1
#>                                           
#>                Accuracy : 0.9398          
#>                  95% CI : (0.9232, 0.9538)
#>     No Information Rate : 0.9559          
#>     P-Value [Acc > NIR] : 0.992546        
#>                                           
#>                   Kappa : 0.0068          
#>                                           
#>  Mcnemar's Test P-Value : 0.001249        
#>                                           
#>             Sensitivity : 0.022727        
#>             Specificity : 0.982162        
#>          Pos Pred Value : 0.055556        
#>          Neg Pred Value : 0.956078        
#>              Prevalence : 0.044132        
#>          Detection Rate : 0.001003        
#>    Detection Prevalence : 0.018054        
#>       Balanced Accuracy : 0.502444        
#>                                           
#>        'Positive' Class : 1               
#>

Discussion

The evaluation of confusion matrix of Naive Bayes method from data train are as follows:

Accuracy reveals value 1, meaning that 100% of our data is correctly classified.
Sensitivity/ Recall reveals value 1.00, meaning that FN (False Negative) portion 100% of our positive outcomes are correctly classified.
Pos Pred Value/ Precision reveals value 1.00, meaning that FP (False Positive) portion 100% of our positive predictions are correct.

The evaluation of confusion matrix of Decision Tree method from data train are as follows:

Accuracy reveals value 1, meaning that 100% of our data is correctly classified.
Sensitivity/ Recall reveals value 1.00, meaning that FN (False Negative) portion 100% of our positive outcomes are correctly classified.
Pos Pred Value/ Precision reveals value 1.00, meaning that FP (False Positive) portion 100% of our positive predictions are correct.

The evaluation of confusion matrix of Random Forest method from data train are as follows:

Accuracy reveals value 0.9398, meaning that 93.98% of our data is correctly classified.
Sensitivity/ Recall reveals value 0.022727, meaning that FN (False Negative) portion 2.27% of our positive outcomes are correctly classified.
Pos Pred Value/ Precision reveals value 0.055556, meaning that FP (False Positive) portion 5.55% of our positive predictions are correct.

Summary

In conclusion, by creating two model with different method of handling imbalanced dataset, we got different results too. Based on discussion, it is clearly shown that the accuracy value of all three models which are Naive Bayes, Decision Tree and Random Forest give us the accuracy more than 90%. Where the model Naive Bayes and Decision Tree giving an accuracy of 100%. This is because we do upsampling method to our imbalanced dataset.

In contrast to first model created, the highest accuracy that we got is 70% which is based on Random Forest Model. In other side all value of Recall/ Sensitivity not showing a good result too.

The dataset is affecting our model, this imbalanced dataset need to be added with new data. Even though with upsampling method we get higher accuracy but it doesn’t make this model satisfying, because upsampling method make this model become overfitting.