Blog4: Date Classification

Jie Zou

2022-05-19

Data Structure

  • data dimension: (898, 35)
  • no missing value
  • only target variable is character type, the rest are numeric/integer type
## 'data.frame':    898 obs. of  35 variables:
##  $ AREA         : int  422163 338136 526843 416063 347562 408953 451414 382636 546063 420044 ...
##  $ PERIMETER    : num  2379 2085 2647 2351 2160 ...
##  $ MAJOR_AXIS   : num  838 724 941 828 764 ...
##  $ MINOR_AXIS   : num  646 595 715 645 583 ...
##  $ ECCENTRICITY : num  0.637 0.569 0.649 0.627 0.646 ...
##  $ EQDIASQ      : num  733 656 819 728 665 ...
##  $ SOLIDITY     : num  0.995 0.997 0.996 0.995 0.991 ...
##  $ CONVEX_AREA  : int  424428 339014 528876 418255 350797 410036 452755 385277 552598 423531 ...
##  $ EXTENT       : num  0.783 0.779 0.766 0.776 0.757 ...
##  $ ASPECT_RATIO : num  1.3 1.22 1.31 1.28 1.31 ...
##  $ ROUNDNESS    : num  0.937 0.977 0.945 0.946 0.936 ...
##  $ COMPACTNESS  : num  0.875 0.906 0.871 0.879 0.871 ...
##  $ SHAPEFACTOR_1: num  0.002 0.0021 0.0018 0.002 0.0022 0.0021 0.002 0.0021 0.0017 0.002 ...
##  $ SHAPEFACTOR_2: num  0.0015 0.0018 0.0014 0.0016 0.0017 0.0015 0.0014 0.0016 0.0014 0.0015 ...
##  $ SHAPEFACTOR_3: num  0.766 0.822 0.758 0.773 0.758 ...
##  $ SHAPEFACTOR_4: num  0.994 0.999 0.997 0.992 0.994 ...
##  $ MeanRR       : num  117.4 100.1 131 86.8 105.5 ...
##  $ MeanRG       : num  109.9 105.6 118.6 88.3 101.8 ...
##  $ MeanRB       : num  95.7 95.7 103.9 82.4 85.3 ...
##  $ StdDevRR     : num  26.5 27.3 29.7 28.7 30.3 ...
##  $ StdDevRG     : num  23.1 23.5 24.6 24.5 25 ...
##  $ StdDevRB     : num  30.1 28.1 33.9 30.4 27.2 ...
##  $ SkewRR       : num  -0.566 -0.233 -0.715 0.458 -0.355 ...
##  $ SkewRG       : num  -0.0114 0.1349 -0.1059 1.2917 0.2101 ...
##  $ SkewRB       : num  0.602 0.413 0.918 1.803 0.886 ...
##  $ KurtosisRR   : num  3.24 2.62 3.75 5.04 2.7 ...
##  $ KurtosisRG   : num  2.96 2.63 3.86 8.61 2.98 ...
##  $ KurtosisRB   : num  4.23 3.17 4.72 8.26 4.41 ...
##  $ EntropyRR    : num  -5.92e+10 -3.42e+10 -9.39e+10 -3.21e+10 -4.00e+10 ...
##  $ EntropyRG    : num  -5.07e+10 -3.75e+10 -7.47e+10 -3.21e+10 -3.60e+10 ...
##  $ EntropyRB    : num  -3.99e+10 -3.15e+10 -6.03e+10 -2.96e+10 -2.56e+10 ...
##  $ ALLdaub4RR   : num  58.7 50 65.5 43.4 52.8 ...
##  $ ALLdaub4RG   : num  55 52.8 59.3 44.1 50.9 ...
##  $ ALLdaub4RB   : num  47.8 47.8 51.9 41.2 42.7 ...
##  $ Class        : chr  "BERHI" "BERHI" "BERHI" "BERHI" ...
## [1] 0

EDA

Numeric variables

Multi-Collinearity with full features

Hightly Positive correlated pairs:

  • AREA: PERIMETER/MAJOR_AXIS/MINOR_AXIS/EQDIASQ/CONVEX_AREA
  • SOLIDITY: SHAPEFACTOR_4
  • ASPECT_RATIO: SHAPEFACTOR_1
  • ROUNDNESS: COMPACTNESS
  • COMPACTNESS: SHAPEFACTOR_3
  • MeanRR: MeanRG/MeanRB/ALLdaub4RR/ALLdaub4RB/ALLdaub4RG
  • StdDevRR: StdDevRG
  • SkewRR: SkewRG/KurtosisRR
  • SkewRG: KurtosisRG
  • SkewRB: KurtosisRB
  • EntropyRR: EntropyRG/EntropyRB

Hightly Negative correlated pairs:

  • COMPACTNESS: ECCENTRICITY
  • SHAPEFACTOR_2: AREA/PERIMETER/MAJOR_AXIS/EQDIASQ/CONVEX_AREA
  • SHAPEFACTOR_3: ECCENTRICITY
  • SkewRR: MeanRR/MeanRG/MeanRB65
  • KurtosisRG: MeanRR
  • ALLdaub4RR: SkewRR/SkewRG/KurtosisRG Multi-Collinearity with reduced features

Distribution

Categorical Variables

Create Partition

## [1] 672  12
## [1] 226  11

Modeling

Decision Tree: 75%

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BERHI DEGLET DOKOL IRAQI ROTANA SAFAVI SOGAY
##     BERHI      9      0     0     4      0      0     4
##     DEGLET     0     17     4     0      1      0     6
##     DOKOL      0      6    47     0      0      0     0
##     IRAQI      3      0     0    12      0      0     0
##     ROTANA     4      0     0     1     39      1     3
##     SAFAVI     0      0     0     1      0     50     0
##     SOGAY      0      1     1     0      1      0    11
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8186         
##                  95% CI : (0.762, 0.8666)
##     No Information Rate : 0.2301         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7804         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: BERHI Class: DEGLET Class: DOKOL Class: IRAQI
## Sensitivity               0.56250       0.70833       0.9038      0.66667
## Specificity               0.96190       0.94554       0.9655      0.98558
## Pos Pred Value            0.52941       0.60714       0.8868      0.80000
## Neg Pred Value            0.96651       0.96465       0.9711      0.97156
## Prevalence                0.07080       0.10619       0.2301      0.07965
## Detection Rate            0.03982       0.07522       0.2080      0.05310
## Detection Prevalence      0.07522       0.12389       0.2345      0.06637
## Balanced Accuracy         0.76220       0.82694       0.9347      0.82612
##                      Class: ROTANA Class: SAFAVI Class: SOGAY
## Sensitivity                 0.9512        0.9804      0.45833
## Specificity                 0.9514        0.9943      0.98515
## Pos Pred Value              0.8125        0.9804      0.78571
## Neg Pred Value              0.9888        0.9943      0.93868
## Prevalence                  0.1814        0.2257      0.10619
## Detection Rate              0.1726        0.2212      0.04867
## Detection Prevalence        0.2124        0.2257      0.06195
## Balanced Accuracy           0.9513        0.9873      0.72174

SVM: 83%

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BERHI DEGLET DOKOL IRAQI ROTANA SAFAVI SOGAY
##     BERHI     13      0     0     1      0      0     0
##     DEGLET     0     14     0     0      1      1     3
##     DOKOL      0      9    51     0      0      0     0
##     IRAQI      1      0     0    17      0      0     0
##     ROTANA     2      0     0     0     36      0     4
##     SAFAVI     0      0     1     0      0     50     2
##     SOGAY      0      1     0     0      4      0    15
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8673         
##                  95% CI : (0.816, 0.9086)
##     No Information Rate : 0.2301         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8388         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: BERHI Class: DEGLET Class: DOKOL Class: IRAQI
## Sensitivity               0.81250       0.58333       0.9808      0.94444
## Specificity               0.99524       0.97525       0.9483      0.99519
## Pos Pred Value            0.92857       0.73684       0.8500      0.94444
## Neg Pred Value            0.98585       0.95169       0.9940      0.99519
## Prevalence                0.07080       0.10619       0.2301      0.07965
## Detection Rate            0.05752       0.06195       0.2257      0.07522
## Detection Prevalence      0.06195       0.08407       0.2655      0.07965
## Balanced Accuracy         0.90387       0.77929       0.9645      0.96982
##                      Class: ROTANA Class: SAFAVI Class: SOGAY
## Sensitivity                 0.8780        0.9804      0.62500
## Specificity                 0.9676        0.9829      0.97525
## Pos Pred Value              0.8571        0.9434      0.75000
## Neg Pred Value              0.9728        0.9942      0.95631
## Prevalence                  0.1814        0.2257      0.10619
## Detection Rate              0.1593        0.2212      0.06637
## Detection Prevalence        0.1858        0.2345      0.08850
## Balanced Accuracy           0.9228        0.9816      0.80012

Naive Bayes: 85%

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BERHI DEGLET DOKOL IRAQI ROTANA SAFAVI SOGAY
##     BERHI     13      0     0     2      0      0     1
##     DEGLET     0     12     2     0      0      0     7
##     DOKOL      0      7    48     0      0      0     0
##     IRAQI      2      0     0    16      0      0     0
##     ROTANA     1      0     0     0     38      0     1
##     SAFAVI     0      1     2     0      0     51     0
##     SOGAY      0      4     0     0      3      0    15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.854           
##                  95% CI : (0.8011, 0.8973)
##     No Information Rate : 0.2301          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8233          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: BERHI Class: DEGLET Class: DOKOL Class: IRAQI
## Sensitivity               0.81250       0.50000       0.9231      0.88889
## Specificity               0.98571       0.95545       0.9598      0.99038
## Pos Pred Value            0.81250       0.57143       0.8727      0.88889
## Neg Pred Value            0.98571       0.94146       0.9766      0.99038
## Prevalence                0.07080       0.10619       0.2301      0.07965
## Detection Rate            0.05752       0.05310       0.2124      0.07080
## Detection Prevalence      0.07080       0.09292       0.2434      0.07965
## Balanced Accuracy         0.89911       0.72772       0.9414      0.93964
##                      Class: ROTANA Class: SAFAVI Class: SOGAY
## Sensitivity                 0.9268        1.0000      0.62500
## Specificity                 0.9892        0.9829      0.96535
## Pos Pred Value              0.9500        0.9444      0.68182
## Neg Pred Value              0.9839        1.0000      0.95588
## Prevalence                  0.1814        0.2257      0.10619
## Detection Rate              0.1681        0.2257      0.06637
## Detection Prevalence        0.1770        0.2389      0.09735
## Balanced Accuracy           0.9580        0.9914      0.79517

Conclusion

Among the three classifiers, based on the accuracy score, Naive Bayes has the best output. Besides the mentioned three classification method, there is some other methods that worth to try such as cnn supported by keras/tensorflow, gradient boost and xgboost