Picture are taken from Kaggle

Intro

1.1 Greetings

Hi Everyone :)

Welcome to my Rmd.

This is my HTML_Document which contains Rice type classification 2.

Hope you can enjoy that!

1.2. What We Will Do

We will learn to use Logistic regression and KNN model using Rice type dataset. We wanna know the relationship among variables. We also wanna classify the type of a new rice (test data) based on the data that we have trained before.

Data Source: https://www.kaggle.com/datasets/mssmartypants/rice-type-classification

1.3. About Dataset

Context

This is a set of data created for rice classification. I recommend using this dataset for educational purposes, for practice and to acquire the necessary knowledge. It is modified dataset from this resource: link Jasmine - 1, Gonen - 0.

Content

That’s inside is more than just rows and columns. You can see rice details listed as column names.

Description

All attributes are numeric variables and they are listed bellow:

-. id

-. Area

-. MajorAxisLength

-. MinorAxisLength

-. Eccentricity

-. ConvexArea

-. EquivDiameter

-. Extent

-. Perimeter

-. Roundness

-. AspectRation

-. Class

1.4. Business Goal

We wanna know :

-. Accuracy between Naive Bayes, Decision Tree, and Random Forest.

-. Classify type of new rice by data that we have trained before.

Import Library

library(dplyr) # for data wrangling
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(inspectdf) # for EDA
library(tidymodels) # to build tidy models
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom        0.8.0     v rsample      0.1.1
## v dials        0.1.1     v tibble       3.1.6
## v infer        1.0.0     v tidyr        1.2.0
## v modeldata    0.1.1     v tune         0.2.0
## v parsnip      0.2.1     v workflows    0.2.6
## v purrr        0.3.4     v workflowsets 0.2.1
## v recipes      0.2.0     v yardstick    0.0.9
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x gridExtra::combine() masks dplyr::combine()
## x purrr::discard()     masks scales::discard()
## x dplyr::filter()      masks stats::filter()
## x dplyr::lag()         masks stats::lag()
## x recipes::step()      masks stats::step()
## * Dig deeper into tidy modeling with R at https://www.tmwr.org
library(caret) # to pre-process data
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
## 
##     lift
library(mlbench)
library(caTools)
library(mice) 
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
## 
##     tune
## The following object is masked from 'package:rsample':
## 
##     permutations
## The following object is masked from 'package:parsnip':
## 
##     tune
library(rpart) 
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
library(randomForest) 
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(animation)

Read the dataset

rice <- read.csv("data_input/rice.csv")
str(rice)
## 'data.frame':    18185 obs. of  12 variables:
##  $ id             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Area           : int  4537 2872 3048 3073 3693 2990 3556 3788 2629 5719 ...
##  $ MajorAxisLength: num  92.2 74.7 76.3 77 85.1 ...
##  $ MinorAxisLength: num  64 51.4 52 51.9 56.4 ...
##  $ Eccentricity   : num  0.72 0.726 0.731 0.739 0.749 ...
##  $ ConvexArea     : int  4677 3015 3132 3157 3802 3080 3636 3866 2790 5819 ...
##  $ EquivDiameter  : num  76 60.5 62.3 62.6 68.6 ...
##  $ Extent         : num  0.658 0.713 0.759 0.784 0.769 ...
##  $ Perimeter      : num  273 208 210 211 230 ...
##  $ Roundness      : num  0.765 0.832 0.868 0.87 0.875 ...
##  $ AspectRation   : num  1.44 1.45 1.47 1.48 1.51 ...
##  $ Class          : int  1 1 1 1 1 1 1 1 1 1 ...

Data Cleaning

There is a variables that we don’t need like Id. So we can do subsetting.

rice <- rice[,-c(1)]
str(rice)
## 'data.frame':    18185 obs. of  11 variables:
##  $ Area           : int  4537 2872 3048 3073 3693 2990 3556 3788 2629 5719 ...
##  $ MajorAxisLength: num  92.2 74.7 76.3 77 85.1 ...
##  $ MinorAxisLength: num  64 51.4 52 51.9 56.4 ...
##  $ Eccentricity   : num  0.72 0.726 0.731 0.739 0.749 ...
##  $ ConvexArea     : int  4677 3015 3132 3157 3802 3080 3636 3866 2790 5819 ...
##  $ EquivDiameter  : num  76 60.5 62.3 62.6 68.6 ...
##  $ Extent         : num  0.658 0.713 0.759 0.784 0.769 ...
##  $ Perimeter      : num  273 208 210 211 230 ...
##  $ Roundness      : num  0.765 0.832 0.868 0.87 0.875 ...
##  $ AspectRation   : num  1.44 1.45 1.47 1.48 1.51 ...
##  $ Class          : int  1 1 1 1 1 1 1 1 1 1 ...

That is very important to check missing values inside of dataset.

anyNA(rice)
## [1] FALSE

Great! The data is complete and ready to be processed.

Data Manipulation

rice$Class <- as.factor(rice$Class)
rice <- rice %>% 
  mutate(Class = factor(Class, levels = c(1,0), 
                        labels = c("Jasmine", "Gonen")))

Splitting Dataset

Next step is doing splitting train test data. The purpose is data train will be used to make model, data test will be used to test our model or compare unseen data. And we can use that to know the ability of our model to unseen data.

set.seed(150)
 
#Splitting train data dan test data
split=sample.split(rice, SplitRatio = 0.7) 

#Train dataset
training_set=subset(rice,split==TRUE)       

#Test dataset
test_set=subset(rice,split==FALSE)  
table(rice$Class)
## 
## Jasmine   Gonen 
##    9985    8200
dim(training_set) 
## [1] 11573    11
dim(test_set)
## [1] 6612   11
topredict_set<-test_set[1:10] 

dim(topredict_set)
## [1] 6612   10
str(topredict_set)
## 'data.frame':    6612 obs. of  10 variables:
##  $ Area           : int  3048 3073 5719 2665 3265 4301 3852 3163 2647 3606 ...
##  $ MajorAxisLength: num  76.3 77 106.7 74.4 83.4 ...
##  $ MinorAxisLength: num  52 51.9 69 48.1 52.8 ...
##  $ Eccentricity   : num  0.731 0.739 0.763 0.763 0.774 ...
##  $ ConvexArea     : int  3132 3157 5819 2777 3420 4427 4023 3232 2710 3658 ...
##  $ EquivDiameter  : num  62.3 62.6 85.3 58.3 64.5 ...
##  $ Extent         : num  0.759 0.784 0.755 0.597 0.576 ...
##  $ Perimeter      : num  210 211 282 202 228 ...
##  $ Roundness      : num  0.868 0.87 0.905 0.817 0.79 ...
##  $ AspectRation   : num  1.47 1.48 1.55 1.55 1.58 ...

Create Model

Naive Bayes

#Create NaiveBayes Model  
model_naive<- naiveBayes(Class ~ ., data = training_set)  

#Predict Class in dataset validation (topredict)
preds_naive <- predict(model_naive, newdata = topredict_set) 

(conf_matrix_naive <- table(preds_naive, test_set$Class)) 
##            
## preds_naive Jasmine Gonen
##     Jasmine    3622    97
##     Gonen        12  2881

The result of Naive Bayes Classification can predict 3622 Jasmine type well/right with 97 worse prediction. And model predict 2881 Gonen type well and 12 worse prediction. How about the accuracy? Let’s check below.

confusionMatrix(conf_matrix_naive) 
## Confusion Matrix and Statistics
## 
##            
## preds_naive Jasmine Gonen
##     Jasmine    3622    97
##     Gonen        12  2881
##                                           
##                Accuracy : 0.9835          
##                  95% CI : (0.9801, 0.9864)
##     No Information Rate : 0.5496          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9666          
##                                           
##  Mcnemar's Test P-Value : 8.573e-16       
##                                           
##             Sensitivity : 0.9967          
##             Specificity : 0.9674          
##          Pos Pred Value : 0.9739          
##          Neg Pred Value : 0.9959          
##              Prevalence : 0.5496          
##          Detection Rate : 0.5478          
##    Detection Prevalence : 0.5625          
##       Balanced Accuracy : 0.9821          
##                                           
##        'Positive' Class : Jasmine         
## 

From the output, Naive Bayes has accurracy 98.35%

Decision Tree

dt_model <- ctree(Class~ .,training_set)
plot(dt_model)

plot(dt_model, type="simple")

dt_model
## 
## Model formula:
## Class ~ Area + MajorAxisLength + MinorAxisLength + Eccentricity + 
##     ConvexArea + EquivDiameter + Extent + Perimeter + Roundness + 
##     AspectRation
## 
## Fitted party:
## [1] root
## |   [2] MinorAxisLength <= 58.57826
## |   |   [3] Roundness <= 0.72365
## |   |   |   [4] Eccentricity <= 0.92796
## |   |   |   |   [5] ConvexArea <= 6477: Jasmine (n = 476, err = 1.5%)
## |   |   |   |   [6] ConvexArea > 6477: Jasmine (n = 140, err = 12.1%)
## |   |   |   [7] Eccentricity > 0.92796
## |   |   |   |   [8] Roundness <= 0.70488
## |   |   |   |   |   [9] Eccentricity <= 0.9415: Jasmine (n = 2479, err = 0.4%)
## |   |   |   |   |   [10] Eccentricity > 0.9415: Jasmine (n = 2861, err = 0.0%)
## |   |   |   |   [11] Roundness > 0.70488: Jasmine (n = 54, err = 5.6%)
## |   |   [12] Roundness > 0.72365
## |   |   |   [13] ConvexArea <= 5769: Jasmine (n = 263, err = 0.4%)
## |   |   |   [14] ConvexArea > 5769
## |   |   |   |   [15] MinorAxisLength <= 56.89267: Jasmine (n = 42, err = 26.2%)
## |   |   |   |   [16] MinorAxisLength > 56.89267: Gonen (n = 53, err = 30.2%)
## |   [17] MinorAxisLength > 58.57826
## |   |   [18] Roundness <= 0.70251
## |   |   |   [19] MinorAxisLength <= 62.36241: Jasmine (n = 56, err = 28.6%)
## |   |   |   [20] MinorAxisLength > 62.36241: Gonen (n = 62, err = 0.0%)
## |   |   [21] Roundness > 0.70251
## |   |   |   [22] MinorAxisLength <= 60.41631: Gonen (n = 104, err = 21.2%)
## |   |   |   [23] MinorAxisLength > 60.41631
## |   |   |   |   [24] MajorAxisLength <= 122.1913
## |   |   |   |   |   [25] EquivDiameter <= 84.58329: Gonen (n = 7, err = 42.9%)
## |   |   |   |   |   [26] EquivDiameter > 84.58329: Gonen (n = 63, err = 1.6%)
## |   |   |   |   [27] MajorAxisLength > 122.1913
## |   |   |   |   |   [28] EquivDiameter <= 92.94585: Gonen (n = 243, err = 1.2%)
## |   |   |   |   |   [29] EquivDiameter > 92.94585: Gonen (n = 4670, err = 0.0%)
## 
## Number of inner nodes:    14
## Number of terminal nodes: 15
width(dt_model)
## [1] 15
depth(dt_model)
## [1] 5

After we train the data train, now we can use it to test data.

predict(dt_model, head(test_set[,-11]))
##       3       4      10      11      14      15 
## Jasmine Jasmine   Gonen Jasmine Jasmine   Gonen 
## Levels: Jasmine Gonen
pred <- predict(dt_model, test_set[,-11])
(conf_matrix_dtree <- table(pred, test_set$Class))
##          
## pred      Jasmine Gonen
##   Jasmine    3595    44
##   Gonen        39  2934

The result of Decision Tree Classification can predict 3595 Jasmine type well/right with 44 worse prediction. And model predict 2934 Gonen type well and 39 worse prediction.

Check the probabilities.

predict(dt_model, head(test_set[,-11]), type="prob")
##       Jasmine       Gonen
## 3  0.99619772 0.003802281
## 4  0.99619772 0.003802281
## 10 0.01587302 0.984126984
## 11 0.99619772 0.003802281
## 14 0.99619772 0.003802281
## 15 0.21153846 0.788461538

From the probabilities, we know that rice in number 3 and 4 have 99.61% probabilities as Jasmine type. And then rice in number 10 has 98.41% probability.

caret::confusionMatrix(pred, test_set[,11])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Jasmine Gonen
##    Jasmine    3595    44
##    Gonen        39  2934
##                                         
##                Accuracy : 0.9874        
##                  95% CI : (0.9845, 0.99)
##     No Information Rate : 0.5496        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9746        
##                                         
##  Mcnemar's Test P-Value : 0.6606        
##                                         
##             Sensitivity : 0.9893        
##             Specificity : 0.9852        
##          Pos Pred Value : 0.9879        
##          Neg Pred Value : 0.9869        
##              Prevalence : 0.5496        
##          Detection Rate : 0.5437        
##    Detection Prevalence : 0.5504        
##       Balanced Accuracy : 0.9872        
##                                         
##        'Positive' Class : Jasmine       
## 

If we see from the confussion matrix, the accuracy of Decision Tree Classification is 98.74%. It means the accuracy of Decision Tree is a little bit higher than the accuracy of Naive Bayes. But it is not too significant.

Random Forest

set.seed(150)
n0_var <- nearZeroVar(rice[,1:10]) 
#NzeroVar yang diolah adalah kolom ke 1 hingga kolom 10
bc <- rice[,-n0_var]
ani.options(interval = 1, nmax = 15)

cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l") 

Now, we create model with 5-Folds cross validation with 3 repeats.

set.seed(150)
#codes below was remarked because everytime running the program, it took a lot of time. So, if we wanna running program, just use the RDS file that has been saved in the first time running.

#ctrl <- trainControl(method="repeatedcv", number=5, repeats=3) 

#model_rforest <- train(Class~ ., data=training_set, method="rf", trControl = ctrl)

#saveRDS(model_rforest, file = "model_rforest_rice.rds")
model_rforest <- readRDS("model_rforest_rice.rds")

We try to plot it.

plot(model_rforest)

sum(predict(model_rforest,test_set[,-11])==test_set[,11])
## [1] 6529
varImp(model_rforest)
## rf variable importance
## 
##                  Overall
## MinorAxisLength 100.0000
## Eccentricity     64.2599
## AspectRation     58.5896
## Roundness        52.3971
## EquivDiameter    38.1436
## Area             31.9368
## ConvexArea       20.3825
## Perimeter         5.5170
## MajorAxisLength   0.2724
## Extent            0.0000

From the result we know that MinorAxisLength has the most significant effect as variable to predict data in Random Forest.

plot(model_rforest$finalModel)
legend("topright", colnames(model_rforest$finalModel$err.rate),
       col=1:6,cex=0.8,fill=1:6)

model_rforest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.93%
## Confusion matrix:
##         Jasmine Gonen class.error
## Jasmine    6312    39 0.006140765
## Gonen        69  5153 0.013213328

With mtry 2 variables, we got model which can predict 6312 Jasmine type well and 39 worse prediction. And it can predict 5153 Gonen type well with 69 worse prediction.

predict_forest <- predict(model_rforest, topredict_set)
(conf_matrix_forestI <- table(predict_forest, test_set$Class)) 
##               
## predict_forest Jasmine Gonen
##        Jasmine    3596    45
##        Gonen        38  2933
confusionMatrix(conf_matrix_forestI) 
## Confusion Matrix and Statistics
## 
##               
## predict_forest Jasmine Gonen
##        Jasmine    3596    45
##        Gonen        38  2933
##                                         
##                Accuracy : 0.9874        
##                  95% CI : (0.9845, 0.99)
##     No Information Rate : 0.5496        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9746        
##                                         
##  Mcnemar's Test P-Value : 0.5102        
##                                         
##             Sensitivity : 0.9895        
##             Specificity : 0.9849        
##          Pos Pred Value : 0.9876        
##          Neg Pred Value : 0.9872        
##              Prevalence : 0.5496        
##          Detection Rate : 0.5439        
##    Detection Prevalence : 0.5507        
##       Balanced Accuracy : 0.9872        
##                                         
##        'Positive' Class : Jasmine       
## 

Random Forest (Original)

model_rf <- randomForest(Class ~ ., data = training_set, importance=TRUE, 
                         ntree = 500)

model_rf
## 
## Call:
##  randomForest(formula = Class ~ ., data = training_set, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.99%
## Confusion matrix:
##         Jasmine Gonen class.error
## Jasmine    6310    41 0.006455676
## Gonen        74  5148 0.014170816
preds_rf <- predict(model_rf, topredict_set)
plot(model_rf)

(conf_matrix_forestII <- table(preds_rf, test_set$Class))
##          
## preds_rf  Jasmine Gonen
##   Jasmine    3596    44
##   Gonen        38  2934
confusionMatrix(conf_matrix_forestII) 
## Confusion Matrix and Statistics
## 
##          
## preds_rf  Jasmine Gonen
##   Jasmine    3596    44
##   Gonen        38  2934
##                                           
##                Accuracy : 0.9876          
##                  95% CI : (0.9846, 0.9901)
##     No Information Rate : 0.5496          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9749          
##                                           
##  Mcnemar's Test P-Value : 0.5808          
##                                           
##             Sensitivity : 0.9895          
##             Specificity : 0.9852          
##          Pos Pred Value : 0.9879          
##          Neg Pred Value : 0.9872          
##              Prevalence : 0.5496          
##          Detection Rate : 0.5439          
##    Detection Prevalence : 0.5505          
##       Balanced Accuracy : 0.9874          
##                                           
##        'Positive' Class : Jasmine         
## 

Conclusion

confusionMatrix(conf_matrix_naive) 
## Confusion Matrix and Statistics
## 
##            
## preds_naive Jasmine Gonen
##     Jasmine    3622    97
##     Gonen        12  2881
##                                           
##                Accuracy : 0.9835          
##                  95% CI : (0.9801, 0.9864)
##     No Information Rate : 0.5496          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9666          
##                                           
##  Mcnemar's Test P-Value : 8.573e-16       
##                                           
##             Sensitivity : 0.9967          
##             Specificity : 0.9674          
##          Pos Pred Value : 0.9739          
##          Neg Pred Value : 0.9959          
##              Prevalence : 0.5496          
##          Detection Rate : 0.5478          
##    Detection Prevalence : 0.5625          
##       Balanced Accuracy : 0.9821          
##                                           
##        'Positive' Class : Jasmine         
## 
confusionMatrix(conf_matrix_dtree)
## Confusion Matrix and Statistics
## 
##          
## pred      Jasmine Gonen
##   Jasmine    3595    44
##   Gonen        39  2934
##                                         
##                Accuracy : 0.9874        
##                  95% CI : (0.9845, 0.99)
##     No Information Rate : 0.5496        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9746        
##                                         
##  Mcnemar's Test P-Value : 0.6606        
##                                         
##             Sensitivity : 0.9893        
##             Specificity : 0.9852        
##          Pos Pred Value : 0.9879        
##          Neg Pred Value : 0.9869        
##              Prevalence : 0.5496        
##          Detection Rate : 0.5437        
##    Detection Prevalence : 0.5504        
##       Balanced Accuracy : 0.9872        
##                                         
##        'Positive' Class : Jasmine       
## 
confusionMatrix(conf_matrix_forestII) 
## Confusion Matrix and Statistics
## 
##          
## preds_rf  Jasmine Gonen
##   Jasmine    3596    44
##   Gonen        38  2934
##                                           
##                Accuracy : 0.9876          
##                  95% CI : (0.9846, 0.9901)
##     No Information Rate : 0.5496          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9749          
##                                           
##  Mcnemar's Test P-Value : 0.5808          
##                                           
##             Sensitivity : 0.9895          
##             Specificity : 0.9852          
##          Pos Pred Value : 0.9879          
##          Neg Pred Value : 0.9872          
##              Prevalence : 0.5496          
##          Detection Rate : 0.5439          
##    Detection Prevalence : 0.5505          
##       Balanced Accuracy : 0.9874          
##                                           
##        'Positive' Class : Jasmine         
## 

-. The highest accuracy is from Random Forest model with 98.76%, the second is decision tree with 98.74% and the last is naive bayes with 98.35%.

-. The models from each method / algorythm are well actually because the average of all acurracy is more than 90%.

-. Some of variables in dataset was support to predict or classify data test very well. It can be known from the summary of each models.