data visualization data mining

using ggplot2 & caret R package

Author

kirit ved

Published

December 14, 2023

ggplot2

setting R environment

Loading required package: pacman
 [1] "NeuralNetTools" "nnet"           "kernlab"        "caret"         
 [5] "lattice"        "kbv"            "janitor"        "lubridate"     
 [9] "forcats"        "stringr"        "dplyr"          "purrr"         
[13] "readr"          "tidyr"          "tibble"         "ggplot2"       
[17] "tidyverse"      "pacman"        

loading iris dataset & viewing it

Code
d=iris |> janitor::clean_names()
head(d)
  sepal_length sepal_width petal_length petal_width species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
Code
tail(d)
    sepal_length sepal_width petal_length petal_width   species
145          6.7         3.3          5.7         2.5 virginica
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica
Code
str(d)
'data.frame':   150 obs. of  5 variables:
 $ sepal_length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ sepal_width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ petal_length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ petal_width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Code
  sepal_length    sepal_width     petal_length    petal_width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

viewing histogram

viewing density plot

viewing scattered plots

viewing box plots

viewing bar plots

# A tibble: 3 × 4
  species      cnt     m     s
  <fct>      <dbl> <dbl> <dbl>
1 setosa        50  5.01  0.35
2 versicolor    50  5.94  0.52
3 virginica     50  6.59  0.64

caret R package for data mining

create data partition for traing & testing

[1] 100
[1] 50

random forest using caret

Random Forest 

100 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 89, 91, 91, 89, 90, 90, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.9488889  0.9231796
  3     0.9488889  0.9231796
  4     0.9488889  0.9231796

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         16          0         0
  versicolor      0         18         1
  virginica       0          1        14

Overall Statistics
                                          
               Accuracy : 0.96            
                 95% CI : (0.8629, 0.9951)
    No Information Rate : 0.38            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9397          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                   1.00            0.9474           0.9333
Specificity                   1.00            0.9677           0.9714
Pos Pred Value                1.00            0.9474           0.9333
Neg Pred Value                1.00            0.9677           0.9714
Prevalence                    0.32            0.3800           0.3000
Detection Rate                0.32            0.3600           0.2800
Detection Prevalence          0.32            0.3800           0.3000
Balanced Accuracy             1.00            0.9576           0.9524

decision tree model

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         16          0         0
  versicolor      0         18         1
  virginica       0          1        14

Overall Statistics
                                          
               Accuracy : 0.96            
                 95% CI : (0.8629, 0.9951)
    No Information Rate : 0.38            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9397          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                   1.00            0.9474           0.9333
Specificity                   1.00            0.9677           0.9714
Pos Pred Value                1.00            0.9474           0.9333
Neg Pred Value                1.00            0.9677           0.9714
Prevalence                    0.32            0.3800           0.3000
Detection Rate                0.32            0.3600           0.2800
Detection Prevalence          0.32            0.3800           0.3000
Balanced Accuracy             1.00            0.9576           0.9524

svm model

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         16          0         0
  versicolor      0         18         1
  virginica       0          1        14

Overall Statistics
                                          
               Accuracy : 0.96            
                 95% CI : (0.8629, 0.9951)
    No Information Rate : 0.38            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9397          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                   1.00            0.9474           0.9333
Specificity                   1.00            0.9677           0.9714
Pos Pred Value                1.00            0.9474           0.9333
Neg Pred Value                1.00            0.9677           0.9714
Prevalence                    0.32            0.3800           0.3000
Detection Rate                0.32            0.3600           0.2800
Detection Prevalence          0.32            0.3800           0.3000
Balanced Accuracy             1.00            0.9576           0.9524

nueral network model

Neural Network 

100 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 91, 90, 90, 91, 90, 91, ... 
Resampling results across tuning parameters:

  size  decay  Accuracy   Kappa    
  1     0e+00  0.8221212  0.7265152
  1     1e-04  0.9705556  0.9556818
  1     1e-01  0.9538889  0.9306818
  3     0e+00  0.8850000  0.8247294
  3     1e-04  0.9483333  0.9223485
  3     1e-01  0.9705556  0.9556818
  5     0e+00  0.8747980  0.8103781
  5     1e-04  0.9503535  0.9254349
  5     1e-01  0.9705556  0.9556818

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 1 and decay = 1e-04.

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         16          0         0
  versicolor      0         19         0
  virginica       0          0        15

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9289, 1)
    No Information Rate : 0.38       
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                   1.00              1.00              1.0
Specificity                   1.00              1.00              1.0
Pos Pred Value                1.00              1.00              1.0
Neg Pred Value                1.00              1.00              1.0
Prevalence                    0.32              0.38              0.3
Detection Rate                0.32              0.38              0.3
Detection Prevalence          0.32              0.38              0.3
Balanced Accuracy             1.00              1.00              1.0

knn model

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         16          0         0
  versicolor      0         19         0
  virginica       0          0        15

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9289, 1)
    No Information Rate : 0.38       
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                   1.00              1.00              1.0
Specificity                   1.00              1.00              1.0
Pos Pred Value                1.00              1.00              1.0
Neg Pred Value                1.00              1.00              1.0
Prevalence                    0.32              0.38              0.3
Detection Rate                0.32              0.38              0.3
Detection Prevalence          0.32              0.38              0.3
Balanced Accuracy             1.00              1.00              1.0

variable selection/importance

Boruta performed 9 iterations in 0.254668 secs.
 4 attributes confirmed important: petal_length, petal_width,
sepal_length, sepal_width;
 No attributes deemed unimportant.
'data.frame':   4 obs. of  6 variables:
 $ mean_imp  : num  15.8 10.9 31.3 31.7
 $ median_imp: num  15.8 11 31.2 31.4
 $ min_imp   : num  15.03 9.64 29.88 29.44
 $ max_imp   : num  16.7 11.5 32.9 33.8
 $ norm_hits : num  1 1 1 1
 $ decision  : Factor w/ 3 levels "Tentative","Confirmed",..: 2 2 2 2
             mean_imp median_imp  min_imp  max_imp norm_hits  decision
petal_width  31.66109   31.37492 29.43654 33.82919         1 Confirmed
petal_length 31.32477   31.21240 29.87999 32.86677         1 Confirmed
sepal_length 15.79728   15.80730 15.02592 16.72509         1 Confirmed
sepal_width  10.86794   11.00223  9.64369 11.52468         1 Confirmed