Data Structure
- data dimension: (898, 35)
- no missing value
- only target variable is character type, the rest are numeric/integer type
## 'data.frame': 898 obs. of 35 variables:
## $ AREA : int 422163 338136 526843 416063 347562 408953 451414 382636 546063 420044 ...
## $ PERIMETER : num 2379 2085 2647 2351 2160 ...
## $ MAJOR_AXIS : num 838 724 941 828 764 ...
## $ MINOR_AXIS : num 646 595 715 645 583 ...
## $ ECCENTRICITY : num 0.637 0.569 0.649 0.627 0.646 ...
## $ EQDIASQ : num 733 656 819 728 665 ...
## $ SOLIDITY : num 0.995 0.997 0.996 0.995 0.991 ...
## $ CONVEX_AREA : int 424428 339014 528876 418255 350797 410036 452755 385277 552598 423531 ...
## $ EXTENT : num 0.783 0.779 0.766 0.776 0.757 ...
## $ ASPECT_RATIO : num 1.3 1.22 1.31 1.28 1.31 ...
## $ ROUNDNESS : num 0.937 0.977 0.945 0.946 0.936 ...
## $ COMPACTNESS : num 0.875 0.906 0.871 0.879 0.871 ...
## $ SHAPEFACTOR_1: num 0.002 0.0021 0.0018 0.002 0.0022 0.0021 0.002 0.0021 0.0017 0.002 ...
## $ SHAPEFACTOR_2: num 0.0015 0.0018 0.0014 0.0016 0.0017 0.0015 0.0014 0.0016 0.0014 0.0015 ...
## $ SHAPEFACTOR_3: num 0.766 0.822 0.758 0.773 0.758 ...
## $ SHAPEFACTOR_4: num 0.994 0.999 0.997 0.992 0.994 ...
## $ MeanRR : num 117.4 100.1 131 86.8 105.5 ...
## $ MeanRG : num 109.9 105.6 118.6 88.3 101.8 ...
## $ MeanRB : num 95.7 95.7 103.9 82.4 85.3 ...
## $ StdDevRR : num 26.5 27.3 29.7 28.7 30.3 ...
## $ StdDevRG : num 23.1 23.5 24.6 24.5 25 ...
## $ StdDevRB : num 30.1 28.1 33.9 30.4 27.2 ...
## $ SkewRR : num -0.566 -0.233 -0.715 0.458 -0.355 ...
## $ SkewRG : num -0.0114 0.1349 -0.1059 1.2917 0.2101 ...
## $ SkewRB : num 0.602 0.413 0.918 1.803 0.886 ...
## $ KurtosisRR : num 3.24 2.62 3.75 5.04 2.7 ...
## $ KurtosisRG : num 2.96 2.63 3.86 8.61 2.98 ...
## $ KurtosisRB : num 4.23 3.17 4.72 8.26 4.41 ...
## $ EntropyRR : num -5.92e+10 -3.42e+10 -9.39e+10 -3.21e+10 -4.00e+10 ...
## $ EntropyRG : num -5.07e+10 -3.75e+10 -7.47e+10 -3.21e+10 -3.60e+10 ...
## $ EntropyRB : num -3.99e+10 -3.15e+10 -6.03e+10 -2.96e+10 -2.56e+10 ...
## $ ALLdaub4RR : num 58.7 50 65.5 43.4 52.8 ...
## $ ALLdaub4RG : num 55 52.8 59.3 44.1 50.9 ...
## $ ALLdaub4RB : num 47.8 47.8 51.9 41.2 42.7 ...
## $ Class : chr "BERHI" "BERHI" "BERHI" "BERHI" ...
## [1] 0
EDA
Numeric variables
Multi-Collinearity with full features
Hightly Positive correlated pairs:
- AREA: PERIMETER/MAJOR_AXIS/MINOR_AXIS/EQDIASQ/CONVEX_AREA
- SOLIDITY: SHAPEFACTOR_4
- ASPECT_RATIO: SHAPEFACTOR_1
- ROUNDNESS: COMPACTNESS
- COMPACTNESS: SHAPEFACTOR_3
- MeanRR: MeanRG/MeanRB/ALLdaub4RR/ALLdaub4RB/ALLdaub4RG
- StdDevRR: StdDevRG
- SkewRR: SkewRG/KurtosisRR
- SkewRG: KurtosisRG
- SkewRB: KurtosisRB
- EntropyRR: EntropyRG/EntropyRB
Hightly Negative correlated pairs:
- COMPACTNESS: ECCENTRICITY
- SHAPEFACTOR_2: AREA/PERIMETER/MAJOR_AXIS/EQDIASQ/CONVEX_AREA
- SHAPEFACTOR_3: ECCENTRICITY
- SkewRR: MeanRR/MeanRG/MeanRB65
- KurtosisRG: MeanRR
- ALLdaub4RR: SkewRR/SkewRG/KurtosisRG
Multi-Collinearity with reduced features
Distribution
Categorical Variables
Create Partition
## [1] 672 12
## [1] 226 11
Modeling
Decision Tree: 75%
## Confusion Matrix and Statistics
##
## Reference
## Prediction BERHI DEGLET DOKOL IRAQI ROTANA SAFAVI SOGAY
## BERHI 9 0 0 4 0 0 4
## DEGLET 0 17 4 0 1 0 6
## DOKOL 0 6 47 0 0 0 0
## IRAQI 3 0 0 12 0 0 0
## ROTANA 4 0 0 1 39 1 3
## SAFAVI 0 0 0 1 0 50 0
## SOGAY 0 1 1 0 1 0 11
##
## Overall Statistics
##
## Accuracy : 0.8186
## 95% CI : (0.762, 0.8666)
## No Information Rate : 0.2301
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7804
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: BERHI Class: DEGLET Class: DOKOL Class: IRAQI
## Sensitivity 0.56250 0.70833 0.9038 0.66667
## Specificity 0.96190 0.94554 0.9655 0.98558
## Pos Pred Value 0.52941 0.60714 0.8868 0.80000
## Neg Pred Value 0.96651 0.96465 0.9711 0.97156
## Prevalence 0.07080 0.10619 0.2301 0.07965
## Detection Rate 0.03982 0.07522 0.2080 0.05310
## Detection Prevalence 0.07522 0.12389 0.2345 0.06637
## Balanced Accuracy 0.76220 0.82694 0.9347 0.82612
## Class: ROTANA Class: SAFAVI Class: SOGAY
## Sensitivity 0.9512 0.9804 0.45833
## Specificity 0.9514 0.9943 0.98515
## Pos Pred Value 0.8125 0.9804 0.78571
## Neg Pred Value 0.9888 0.9943 0.93868
## Prevalence 0.1814 0.2257 0.10619
## Detection Rate 0.1726 0.2212 0.04867
## Detection Prevalence 0.2124 0.2257 0.06195
## Balanced Accuracy 0.9513 0.9873 0.72174
SVM: 83%
## Confusion Matrix and Statistics
##
## Reference
## Prediction BERHI DEGLET DOKOL IRAQI ROTANA SAFAVI SOGAY
## BERHI 13 0 0 1 0 0 0
## DEGLET 0 14 0 0 1 1 3
## DOKOL 0 9 51 0 0 0 0
## IRAQI 1 0 0 17 0 0 0
## ROTANA 2 0 0 0 36 0 4
## SAFAVI 0 0 1 0 0 50 2
## SOGAY 0 1 0 0 4 0 15
##
## Overall Statistics
##
## Accuracy : 0.8673
## 95% CI : (0.816, 0.9086)
## No Information Rate : 0.2301
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8388
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: BERHI Class: DEGLET Class: DOKOL Class: IRAQI
## Sensitivity 0.81250 0.58333 0.9808 0.94444
## Specificity 0.99524 0.97525 0.9483 0.99519
## Pos Pred Value 0.92857 0.73684 0.8500 0.94444
## Neg Pred Value 0.98585 0.95169 0.9940 0.99519
## Prevalence 0.07080 0.10619 0.2301 0.07965
## Detection Rate 0.05752 0.06195 0.2257 0.07522
## Detection Prevalence 0.06195 0.08407 0.2655 0.07965
## Balanced Accuracy 0.90387 0.77929 0.9645 0.96982
## Class: ROTANA Class: SAFAVI Class: SOGAY
## Sensitivity 0.8780 0.9804 0.62500
## Specificity 0.9676 0.9829 0.97525
## Pos Pred Value 0.8571 0.9434 0.75000
## Neg Pred Value 0.9728 0.9942 0.95631
## Prevalence 0.1814 0.2257 0.10619
## Detection Rate 0.1593 0.2212 0.06637
## Detection Prevalence 0.1858 0.2345 0.08850
## Balanced Accuracy 0.9228 0.9816 0.80012
Naive Bayes: 85%
## Confusion Matrix and Statistics
##
## Reference
## Prediction BERHI DEGLET DOKOL IRAQI ROTANA SAFAVI SOGAY
## BERHI 13 0 0 2 0 0 1
## DEGLET 0 12 2 0 0 0 7
## DOKOL 0 7 48 0 0 0 0
## IRAQI 2 0 0 16 0 0 0
## ROTANA 1 0 0 0 38 0 1
## SAFAVI 0 1 2 0 0 51 0
## SOGAY 0 4 0 0 3 0 15
##
## Overall Statistics
##
## Accuracy : 0.854
## 95% CI : (0.8011, 0.8973)
## No Information Rate : 0.2301
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8233
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: BERHI Class: DEGLET Class: DOKOL Class: IRAQI
## Sensitivity 0.81250 0.50000 0.9231 0.88889
## Specificity 0.98571 0.95545 0.9598 0.99038
## Pos Pred Value 0.81250 0.57143 0.8727 0.88889
## Neg Pred Value 0.98571 0.94146 0.9766 0.99038
## Prevalence 0.07080 0.10619 0.2301 0.07965
## Detection Rate 0.05752 0.05310 0.2124 0.07080
## Detection Prevalence 0.07080 0.09292 0.2434 0.07965
## Balanced Accuracy 0.89911 0.72772 0.9414 0.93964
## Class: ROTANA Class: SAFAVI Class: SOGAY
## Sensitivity 0.9268 1.0000 0.62500
## Specificity 0.9892 0.9829 0.96535
## Pos Pred Value 0.9500 0.9444 0.68182
## Neg Pred Value 0.9839 1.0000 0.95588
## Prevalence 0.1814 0.2257 0.10619
## Detection Rate 0.1681 0.2257 0.06637
## Detection Prevalence 0.1770 0.2389 0.09735
## Balanced Accuracy 0.9580 0.9914 0.79517
Conclusion
Among the three classifiers, based on the accuracy score, Naive Bayes has the best output. Besides the mentioned three classification method, there is some other methods that worth to try such as cnn supported by keras/tensorflow, gradient boost and xgboost