Content
That’s inside is more than just rows and columns. You can see rice details listed as column names.
Picture are taken from Kaggle
Hi Everyone :)
Welcome to my Rmd.
This is my HTML_Document which contains Rice type classification 2.
Hope you can enjoy that!
We will learn to use Logistic regression and KNN model using Rice type dataset. We wanna know the relationship among variables. We also wanna classify the type of a new rice (test data) based on the data that we have trained before.
Data Source: https://www.kaggle.com/datasets/mssmartypants/rice-type-classification
This is a set of data created for rice classification. I recommend using this dataset for educational purposes, for practice and to acquire the necessary knowledge. It is modified dataset from this resource: link Jasmine - 1, Gonen - 0.
That’s inside is more than just rows and columns. You can see rice details listed as column names.
All attributes are numeric variables and they are listed bellow:
-. id
-. Area
-. MajorAxisLength
-. MinorAxisLength
-. Eccentricity
-. ConvexArea
-. EquivDiameter
-. Extent
-. Perimeter
-. Roundness
-. AspectRation
-. Class
We wanna know :
-. Accuracy between Naive Bayes, Decision Tree, and Random Forest.
-. Classify type of new rice by data that we have trained before.
library(dplyr) # for data wrangling
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(inspectdf) # for EDA
library(tidymodels) # to build tidy models
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom 0.8.0 v rsample 0.1.1
## v dials 0.1.1 v tibble 3.1.6
## v infer 1.0.0 v tidyr 1.2.0
## v modeldata 0.1.1 v tune 0.2.0
## v parsnip 0.2.1 v workflows 0.2.6
## v purrr 0.3.4 v workflowsets 0.2.1
## v recipes 0.2.0 v yardstick 0.0.9
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x gridExtra::combine() masks dplyr::combine()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## * Dig deeper into tidy modeling with R at https://www.tmwr.org
library(caret) # to pre-process data
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
##
## lift
library(mlbench)
library(caTools)
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
##
## tune
## The following object is masked from 'package:rsample':
##
## permutations
## The following object is masked from 'package:parsnip':
##
## tune
library(rpart)
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(animation)
rice <- read.csv("data_input/rice.csv")
str(rice)
## 'data.frame': 18185 obs. of 12 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Area : int 4537 2872 3048 3073 3693 2990 3556 3788 2629 5719 ...
## $ MajorAxisLength: num 92.2 74.7 76.3 77 85.1 ...
## $ MinorAxisLength: num 64 51.4 52 51.9 56.4 ...
## $ Eccentricity : num 0.72 0.726 0.731 0.739 0.749 ...
## $ ConvexArea : int 4677 3015 3132 3157 3802 3080 3636 3866 2790 5819 ...
## $ EquivDiameter : num 76 60.5 62.3 62.6 68.6 ...
## $ Extent : num 0.658 0.713 0.759 0.784 0.769 ...
## $ Perimeter : num 273 208 210 211 230 ...
## $ Roundness : num 0.765 0.832 0.868 0.87 0.875 ...
## $ AspectRation : num 1.44 1.45 1.47 1.48 1.51 ...
## $ Class : int 1 1 1 1 1 1 1 1 1 1 ...
There is a variables that we don’t need like Id. So we can do subsetting.
rice <- rice[,-c(1)]
str(rice)
## 'data.frame': 18185 obs. of 11 variables:
## $ Area : int 4537 2872 3048 3073 3693 2990 3556 3788 2629 5719 ...
## $ MajorAxisLength: num 92.2 74.7 76.3 77 85.1 ...
## $ MinorAxisLength: num 64 51.4 52 51.9 56.4 ...
## $ Eccentricity : num 0.72 0.726 0.731 0.739 0.749 ...
## $ ConvexArea : int 4677 3015 3132 3157 3802 3080 3636 3866 2790 5819 ...
## $ EquivDiameter : num 76 60.5 62.3 62.6 68.6 ...
## $ Extent : num 0.658 0.713 0.759 0.784 0.769 ...
## $ Perimeter : num 273 208 210 211 230 ...
## $ Roundness : num 0.765 0.832 0.868 0.87 0.875 ...
## $ AspectRation : num 1.44 1.45 1.47 1.48 1.51 ...
## $ Class : int 1 1 1 1 1 1 1 1 1 1 ...
That is very important to check missing values inside of dataset.
anyNA(rice)
## [1] FALSE
Great! The data is complete and ready to be processed.
rice$Class <- as.factor(rice$Class)
rice <- rice %>%
mutate(Class = factor(Class, levels = c(1,0),
labels = c("Jasmine", "Gonen")))
Next step is doing splitting train test data. The purpose is data train will be used to make model, data test will be used to test our model or compare unseen data. And we can use that to know the ability of our model to unseen data.
set.seed(150)
#Splitting train data dan test data
split=sample.split(rice, SplitRatio = 0.7)
#Train dataset
training_set=subset(rice,split==TRUE)
#Test dataset
test_set=subset(rice,split==FALSE)
table(rice$Class)
##
## Jasmine Gonen
## 9985 8200
dim(training_set)
## [1] 11573 11
dim(test_set)
## [1] 6612 11
topredict_set<-test_set[1:10]
dim(topredict_set)
## [1] 6612 10
str(topredict_set)
## 'data.frame': 6612 obs. of 10 variables:
## $ Area : int 3048 3073 5719 2665 3265 4301 3852 3163 2647 3606 ...
## $ MajorAxisLength: num 76.3 77 106.7 74.4 83.4 ...
## $ MinorAxisLength: num 52 51.9 69 48.1 52.8 ...
## $ Eccentricity : num 0.731 0.739 0.763 0.763 0.774 ...
## $ ConvexArea : int 3132 3157 5819 2777 3420 4427 4023 3232 2710 3658 ...
## $ EquivDiameter : num 62.3 62.6 85.3 58.3 64.5 ...
## $ Extent : num 0.759 0.784 0.755 0.597 0.576 ...
## $ Perimeter : num 210 211 282 202 228 ...
## $ Roundness : num 0.868 0.87 0.905 0.817 0.79 ...
## $ AspectRation : num 1.47 1.48 1.55 1.55 1.58 ...
#Create NaiveBayes Model
model_naive<- naiveBayes(Class ~ ., data = training_set)
#Predict Class in dataset validation (topredict)
preds_naive <- predict(model_naive, newdata = topredict_set)
(conf_matrix_naive <- table(preds_naive, test_set$Class))
##
## preds_naive Jasmine Gonen
## Jasmine 3622 97
## Gonen 12 2881
The result of Naive Bayes Classification can predict 3622 Jasmine type well/right with 97 worse prediction. And model predict 2881 Gonen type well and 12 worse prediction. How about the accuracy? Let’s check below.
confusionMatrix(conf_matrix_naive)
## Confusion Matrix and Statistics
##
##
## preds_naive Jasmine Gonen
## Jasmine 3622 97
## Gonen 12 2881
##
## Accuracy : 0.9835
## 95% CI : (0.9801, 0.9864)
## No Information Rate : 0.5496
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9666
##
## Mcnemar's Test P-Value : 8.573e-16
##
## Sensitivity : 0.9967
## Specificity : 0.9674
## Pos Pred Value : 0.9739
## Neg Pred Value : 0.9959
## Prevalence : 0.5496
## Detection Rate : 0.5478
## Detection Prevalence : 0.5625
## Balanced Accuracy : 0.9821
##
## 'Positive' Class : Jasmine
##
From the output, Naive Bayes has accurracy 98.35%
dt_model <- ctree(Class~ .,training_set)
plot(dt_model)
plot(dt_model, type="simple")
dt_model
##
## Model formula:
## Class ~ Area + MajorAxisLength + MinorAxisLength + Eccentricity +
## ConvexArea + EquivDiameter + Extent + Perimeter + Roundness +
## AspectRation
##
## Fitted party:
## [1] root
## | [2] MinorAxisLength <= 58.57826
## | | [3] Roundness <= 0.72365
## | | | [4] Eccentricity <= 0.92796
## | | | | [5] ConvexArea <= 6477: Jasmine (n = 476, err = 1.5%)
## | | | | [6] ConvexArea > 6477: Jasmine (n = 140, err = 12.1%)
## | | | [7] Eccentricity > 0.92796
## | | | | [8] Roundness <= 0.70488
## | | | | | [9] Eccentricity <= 0.9415: Jasmine (n = 2479, err = 0.4%)
## | | | | | [10] Eccentricity > 0.9415: Jasmine (n = 2861, err = 0.0%)
## | | | | [11] Roundness > 0.70488: Jasmine (n = 54, err = 5.6%)
## | | [12] Roundness > 0.72365
## | | | [13] ConvexArea <= 5769: Jasmine (n = 263, err = 0.4%)
## | | | [14] ConvexArea > 5769
## | | | | [15] MinorAxisLength <= 56.89267: Jasmine (n = 42, err = 26.2%)
## | | | | [16] MinorAxisLength > 56.89267: Gonen (n = 53, err = 30.2%)
## | [17] MinorAxisLength > 58.57826
## | | [18] Roundness <= 0.70251
## | | | [19] MinorAxisLength <= 62.36241: Jasmine (n = 56, err = 28.6%)
## | | | [20] MinorAxisLength > 62.36241: Gonen (n = 62, err = 0.0%)
## | | [21] Roundness > 0.70251
## | | | [22] MinorAxisLength <= 60.41631: Gonen (n = 104, err = 21.2%)
## | | | [23] MinorAxisLength > 60.41631
## | | | | [24] MajorAxisLength <= 122.1913
## | | | | | [25] EquivDiameter <= 84.58329: Gonen (n = 7, err = 42.9%)
## | | | | | [26] EquivDiameter > 84.58329: Gonen (n = 63, err = 1.6%)
## | | | | [27] MajorAxisLength > 122.1913
## | | | | | [28] EquivDiameter <= 92.94585: Gonen (n = 243, err = 1.2%)
## | | | | | [29] EquivDiameter > 92.94585: Gonen (n = 4670, err = 0.0%)
##
## Number of inner nodes: 14
## Number of terminal nodes: 15
width(dt_model)
## [1] 15
depth(dt_model)
## [1] 5
After we train the data train, now we can use it to test data.
predict(dt_model, head(test_set[,-11]))
## 3 4 10 11 14 15
## Jasmine Jasmine Gonen Jasmine Jasmine Gonen
## Levels: Jasmine Gonen
pred <- predict(dt_model, test_set[,-11])
(conf_matrix_dtree <- table(pred, test_set$Class))
##
## pred Jasmine Gonen
## Jasmine 3595 44
## Gonen 39 2934
The result of Decision Tree Classification can predict 3595 Jasmine type well/right with 44 worse prediction. And model predict 2934 Gonen type well and 39 worse prediction.
Check the probabilities.
predict(dt_model, head(test_set[,-11]), type="prob")
## Jasmine Gonen
## 3 0.99619772 0.003802281
## 4 0.99619772 0.003802281
## 10 0.01587302 0.984126984
## 11 0.99619772 0.003802281
## 14 0.99619772 0.003802281
## 15 0.21153846 0.788461538
From the probabilities, we know that rice in number 3 and 4 have 99.61% probabilities as Jasmine type. And then rice in number 10 has 98.41% probability.
caret::confusionMatrix(pred, test_set[,11])
## Confusion Matrix and Statistics
##
## Reference
## Prediction Jasmine Gonen
## Jasmine 3595 44
## Gonen 39 2934
##
## Accuracy : 0.9874
## 95% CI : (0.9845, 0.99)
## No Information Rate : 0.5496
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9746
##
## Mcnemar's Test P-Value : 0.6606
##
## Sensitivity : 0.9893
## Specificity : 0.9852
## Pos Pred Value : 0.9879
## Neg Pred Value : 0.9869
## Prevalence : 0.5496
## Detection Rate : 0.5437
## Detection Prevalence : 0.5504
## Balanced Accuracy : 0.9872
##
## 'Positive' Class : Jasmine
##
If we see from the confussion matrix, the accuracy of Decision Tree Classification is 98.74%. It means the accuracy of Decision Tree is a little bit higher than the accuracy of Naive Bayes. But it is not too significant.
set.seed(150)
n0_var <- nearZeroVar(rice[,1:10])
#NzeroVar yang diolah adalah kolom ke 1 hingga kolom 10
bc <- rice[,-n0_var]
ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l")
Now, we create model with 5-Folds cross validation with 3 repeats.
set.seed(150)
#codes below was remarked because everytime running the program, it took a lot of time. So, if we wanna running program, just use the RDS file that has been saved in the first time running.
#ctrl <- trainControl(method="repeatedcv", number=5, repeats=3)
#model_rforest <- train(Class~ ., data=training_set, method="rf", trControl = ctrl)
#saveRDS(model_rforest, file = "model_rforest_rice.rds")
model_rforest <- readRDS("model_rforest_rice.rds")
We try to plot it.
plot(model_rforest)
sum(predict(model_rforest,test_set[,-11])==test_set[,11])
## [1] 6529
varImp(model_rforest)
## rf variable importance
##
## Overall
## MinorAxisLength 100.0000
## Eccentricity 64.2599
## AspectRation 58.5896
## Roundness 52.3971
## EquivDiameter 38.1436
## Area 31.9368
## ConvexArea 20.3825
## Perimeter 5.5170
## MajorAxisLength 0.2724
## Extent 0.0000
From the result we know that MinorAxisLength has the most significant effect as variable to predict data in Random Forest.
plot(model_rforest$finalModel)
legend("topright", colnames(model_rforest$finalModel$err.rate),
col=1:6,cex=0.8,fill=1:6)
model_rforest$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.93%
## Confusion matrix:
## Jasmine Gonen class.error
## Jasmine 6312 39 0.006140765
## Gonen 69 5153 0.013213328
With mtry 2 variables, we got model which can predict 6312 Jasmine type well and 39 worse prediction. And it can predict 5153 Gonen type well with 69 worse prediction.
predict_forest <- predict(model_rforest, topredict_set)
(conf_matrix_forestI <- table(predict_forest, test_set$Class))
##
## predict_forest Jasmine Gonen
## Jasmine 3596 45
## Gonen 38 2933
confusionMatrix(conf_matrix_forestI)
## Confusion Matrix and Statistics
##
##
## predict_forest Jasmine Gonen
## Jasmine 3596 45
## Gonen 38 2933
##
## Accuracy : 0.9874
## 95% CI : (0.9845, 0.99)
## No Information Rate : 0.5496
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9746
##
## Mcnemar's Test P-Value : 0.5102
##
## Sensitivity : 0.9895
## Specificity : 0.9849
## Pos Pred Value : 0.9876
## Neg Pred Value : 0.9872
## Prevalence : 0.5496
## Detection Rate : 0.5439
## Detection Prevalence : 0.5507
## Balanced Accuracy : 0.9872
##
## 'Positive' Class : Jasmine
##
model_rf <- randomForest(Class ~ ., data = training_set, importance=TRUE,
ntree = 500)
model_rf
##
## Call:
## randomForest(formula = Class ~ ., data = training_set, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.99%
## Confusion matrix:
## Jasmine Gonen class.error
## Jasmine 6310 41 0.006455676
## Gonen 74 5148 0.014170816
preds_rf <- predict(model_rf, topredict_set)
plot(model_rf)
(conf_matrix_forestII <- table(preds_rf, test_set$Class))
##
## preds_rf Jasmine Gonen
## Jasmine 3596 44
## Gonen 38 2934
confusionMatrix(conf_matrix_forestII)
## Confusion Matrix and Statistics
##
##
## preds_rf Jasmine Gonen
## Jasmine 3596 44
## Gonen 38 2934
##
## Accuracy : 0.9876
## 95% CI : (0.9846, 0.9901)
## No Information Rate : 0.5496
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9749
##
## Mcnemar's Test P-Value : 0.5808
##
## Sensitivity : 0.9895
## Specificity : 0.9852
## Pos Pred Value : 0.9879
## Neg Pred Value : 0.9872
## Prevalence : 0.5496
## Detection Rate : 0.5439
## Detection Prevalence : 0.5505
## Balanced Accuracy : 0.9874
##
## 'Positive' Class : Jasmine
##
confusionMatrix(conf_matrix_naive)
## Confusion Matrix and Statistics
##
##
## preds_naive Jasmine Gonen
## Jasmine 3622 97
## Gonen 12 2881
##
## Accuracy : 0.9835
## 95% CI : (0.9801, 0.9864)
## No Information Rate : 0.5496
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9666
##
## Mcnemar's Test P-Value : 8.573e-16
##
## Sensitivity : 0.9967
## Specificity : 0.9674
## Pos Pred Value : 0.9739
## Neg Pred Value : 0.9959
## Prevalence : 0.5496
## Detection Rate : 0.5478
## Detection Prevalence : 0.5625
## Balanced Accuracy : 0.9821
##
## 'Positive' Class : Jasmine
##
confusionMatrix(conf_matrix_dtree)
## Confusion Matrix and Statistics
##
##
## pred Jasmine Gonen
## Jasmine 3595 44
## Gonen 39 2934
##
## Accuracy : 0.9874
## 95% CI : (0.9845, 0.99)
## No Information Rate : 0.5496
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9746
##
## Mcnemar's Test P-Value : 0.6606
##
## Sensitivity : 0.9893
## Specificity : 0.9852
## Pos Pred Value : 0.9879
## Neg Pred Value : 0.9869
## Prevalence : 0.5496
## Detection Rate : 0.5437
## Detection Prevalence : 0.5504
## Balanced Accuracy : 0.9872
##
## 'Positive' Class : Jasmine
##
confusionMatrix(conf_matrix_forestII)
## Confusion Matrix and Statistics
##
##
## preds_rf Jasmine Gonen
## Jasmine 3596 44
## Gonen 38 2934
##
## Accuracy : 0.9876
## 95% CI : (0.9846, 0.9901)
## No Information Rate : 0.5496
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9749
##
## Mcnemar's Test P-Value : 0.5808
##
## Sensitivity : 0.9895
## Specificity : 0.9852
## Pos Pred Value : 0.9879
## Neg Pred Value : 0.9872
## Prevalence : 0.5496
## Detection Rate : 0.5439
## Detection Prevalence : 0.5505
## Balanced Accuracy : 0.9874
##
## 'Positive' Class : Jasmine
##
-. The highest accuracy is from Random Forest model with 98.76%, the second is decision tree with 98.74% and the last is naive bayes with 98.35%.
-. The models from each method / algorythm are well actually because the average of all acurracy is more than 90%.
-. Some of variables in dataset was support to predict or classify data test very well. It can be known from the summary of each models.