Developing ML solutions on a aarch64 Raspberry-Pi CM4

The original RPubs was posted 6 years ago and, at the time, was developed on R3.2.2 running on a quad-core 64-bit Windows8 laptop. Today this slightly rejuvenated and amended code is running on a 64-bit quad-core RaspberryPi CM4. This is now possible with the open-source RStudio IDE and R4.20 natively built and running smoothly at 2.00 GHz on Ubuntu Mate 22.04 LTS aarch64 desktop!

It is even faster and so impressively runs side-by-side a Python 3.10 Jupyter notebook!

Actual Screenshot

Classification Challenges are Everywhere…

You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted with the challenge of sorting and classifying to meet process or performance properties. While traditional Research and Development does approach the problem with experimentation, it generally involves designs, time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten or perhaps obsolete.
Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for classification, an essential step used to reduce complexity and optimize product segmentation, Lean Innovation and establishing robust source of supply networks.

Today, we will explain how Machine Learning can shed new light on this generic and very persistent classification and clustering challenge. We will derive with modern algorithms simple (we prefer less rules) and accurate (perfect) classifications on a complete dataset.

If you didn’t read about the other important aspect of formulation optimization, please consult Machine Learning, Key to Your Formulation Challenges.

Step 1: Retrieve Existing Data

We will mirror the approach used in the formulation challenge and use another dataset hosted on UCI Machine Learning Repository, to classify the edible attribute of…Mushrooms based on attribute described in The Audubon Society Field Guide to North American Mushrooms (1981). The challenge we tackle today is to classify properly a go/no-go attribute which scientists, engineers and business professionals must address daily. Any established R&D would certainly have similar and sometimes hidden knowledge in its archives…

Again, We will use R to demonstrate quickly the approach on this dataset, and its full description. We continue to maintain reproducibility of the analysis as a general practice. The analysis tool and platform are documented, all libraries clearly listed, while data is retrieved programmatically and date stamped from the repository.

We will display a structure of the mushrooms dataset and the corresponding dictionary to translate the property factors.

Sys.info()[1:5]
##                                              sysname 
##                                              "Linux" 
##                                              release 
##                                  "5.15.0-1006-raspi" 
##                                              version 
## "#6-Ubuntu SMP PREEMPT Mon Apr 25 12:50:48 UTC 2022" 
##                                             nodename 
##                                          "cruncher2" 
##                                              machine 
##                                            "aarch64"
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: aarch64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 22.04 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.8.0  magrittr_2.0.3 
##  [5] evaluate_0.15   stringi_1.7.6   rlang_1.0.2     cli_3.3.0      
##  [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1     rmarkdown_2.14 
## [13] tools_4.2.0     stringr_1.4.0   xfun_0.30       yaml_2.3.5     
## [17] fastmap_1.1.0   compiler_4.2.0  htmltools_0.5.2 knitr_1.39     
## [21] sass_0.4.1
library(stringr)
library(RWeka)
library(C50)
library(rpart)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
userdir <- getwd()
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Mushrooms_Data.csv")
dateDownloaded <- date()
mushrooms <- read.csv("./data/Mushrooms_Data.csv",header=FALSE,stringsAsFactors=TRUE)
fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Names.txt")
txt <- readLines("./data/Names.txt")
lns <- data.frame(beg=which(grepl("P_1) odor=",txt)),end=which(grepl("on the whole dataset.",txt)))
# we now capture all lines of text between beg and end from txt
res <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse=" ")})
res <- gsub("\t", "", res, fixed = TRUE)
res <- gsub("( {2,})"," ",res, fixed=FALSE)
res <- gsub("P_","\n",res,fixed=TRUE)
writeLines(res,"./data/parsed_res.csv")
res <- readLines("./data/parsed_res.csv")
res<-res[-1]
lns <- data.frame(beg=which(grepl("7. Attribute Information:",txt)),end=which(grepl("urban=u,waste=w,woods=d",txt)))
txt <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse=" ")})
txt <- gsub(" ", "", txt, fixed = TRUE)
txt <- gsub("(\\d+\\.)","\\\n",txt, fixed=FALSE)
txt <- gsub("\nAttributeInformation:\\(","",txt,fixed=FALSE)
txt <- gsub("\\)","",txt,fixed=FALSE)
txt <- gsub(":",",",txt,fixed=TRUE)
txt <- gsub("?","",txt,fixed=TRUE)
txt <- gsub("-","_",txt,fixed=TRUE)
writeLines(txt,"./data/parsed.csv")
attrib <- readLines("./data/parsed.csv")
attrib <- sapply (1:length(attrib),function(i) {gsub(","," ",attrib[i],fixed=TRUE)})
dictionary <- sapply (1:length(attrib),function(i) {strsplit(attrib[i],' ')})
colnames(mushrooms)<-sapply(1:length(attrib),function(i) {colnames(mushrooms)[i]<-dictionary[[i]][1]})
dictionary<-sapply (1:length(attrib),function(i) {dictionary[[i]][-1]}) # contains the levels strings
dictionary<-sapply(1:length(attrib),function(i){sapply(1:lengths(dictionary[i]),function(j){p1<-strsplit(dictionary[[i]][j],"=")[[1]][1];p2<-strsplit(dictionary[[i]][j],"=")[[1]][2];dictionary[[i]][j]<-paste0(p2,',',p1)})})

Step 2: Clean the Data

We notice that the stalk_root property has a missing level indicated with ‘?’. We can attempt two analysis: First, we keep the missing data as coded and proceed with the classification models. We also can easily re-code as missing with the value, drop the corresponding level, and omit all non-complete cases in a new dataset mushrooms_complete.

mushrooms_complete<-mushrooms
mushrooms_complete$stalk_root[mushrooms_complete$stalk_root=='?']<-NA
mushrooms_complete<-mushrooms_complete[complete.cases(mushrooms_complete),]
mushrooms_complete$stalk_root<-droplevels(mushrooms_complete$stalk_root)
str(mushrooms_complete)
## 'data.frame':    5644 obs. of  23 variables:
##  $ classes                 : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap_surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap_color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk_shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 4 levels "b","c","e","r": 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil_color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
knitr::kable(table(mushrooms_complete$classes),caption="Cleaned Dataset")
Cleaned Dataset
Var1 Freq
e 3488
p 2156

We can now reassign translated levels to both the original and the complete mushroom datasets.

m <- sapply (1:length(attrib),function(i){levels(mushrooms[[i]]) <- sapply(1:length (levels(mushrooms[[i]])),function(j){
  a<-strsplit(dictionary[[i]][[j]],",")[[1]][1]
  b<-strsplit(dictionary[[i]][[j]],",")[[1]][2]
  levels(mushrooms[[i]])[levels(mushrooms[[i]])==a] <- b } ) 
mushrooms[[i]]} )
m <- as.data.frame(m)
colnames(m) <- colnames(mushrooms)
mushrooms <- m
# convert chr to factors
mushrooms[sapply(mushrooms, is.character)] <- lapply(mushrooms[sapply(mushrooms, is.character)],as.factor)

m <- sapply (1:length(attrib),function(i){levels(mushrooms_complete[[i]]) <- sapply(1:length (levels(mushrooms_complete[[i]])),function(j){
  a<-strsplit(dictionary[[i]][[j]],",")[[1]][1]
  b<-strsplit(dictionary[[i]][[j]],",")[[1]][2]
  levels(mushrooms_complete[[i]])[levels(mushrooms_complete[[i]])==a] <- b } ) 
mushrooms_complete[[i]]} )
m <- as.data.frame(m)
colnames(m) <- colnames(mushrooms_complete)
mushrooms_complete <- m
# convert chr to factors
mushrooms_complete[sapply(mushrooms_complete, is.character)] <- lapply(mushrooms_complete[sapply(mushrooms_complete, is.character)],as.factor)
rm(m,lns,attrib,txt,dictionary)# cleanup

As we observe that the veil_type feature is absolutely common with a single factor, we can exclude it from further analysis and examine the remaining 22 properties: we’ll observe a fairly balanced classification set with 4208 edible and 3916 poisonous mushrooms on the original set.

# now automatically discard factors with unique level
mushrooms_complete <- mushrooms_complete[, sapply(mushrooms_complete, function(col) length(unique(col))) > 1]
# notice we lost the veil_type because that factor had a unique level
mushrooms <- mushrooms[, sapply(mushrooms, function(col) length(unique(col))) > 1]

str(mushrooms)
## 'data.frame':    8124 obs. of  22 variables:
##  $ classes                 : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap_color               : Factor w/ 10 levels "brown","buff",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "almond","anise",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","descending": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "black","brown",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "cobwebby","evanescent",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 6 2 4 6 2 2 4 4 2 4 ...
knitr::kable(table(mushrooms$classes),caption="Original Dataset")
Original Dataset
Var1 Freq
edible 4208
poisonous 3916
str(mushrooms_complete)
## 'data.frame':    5644 obs. of  22 variables:
##  $ classes                 : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap_color               : Factor w/ 8 levels "brown","buff",..: 5 8 7 7 4 8 7 7 7 8 ...
##  $ bruises                 : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 7 levels "almond","anise",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","descending": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 9 levels "buff","chocolate",..: 3 3 4 4 3 4 1 4 5 1 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 4 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk_color_above_ring  : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ stalk_color_below_ring  : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ veil_color              : Factor w/ 2 levels "white","yellow": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 4 levels "cobwebby","flaring",..: 4 4 4 4 1 4 4 4 4 4 ...
##  $ spore_print_color       : Factor w/ 6 levels "brown","buff",..: 2 3 3 2 3 2 2 3 2 2 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 6 levels "grasses","leaves",..: 6 2 4 6 2 2 4 4 2 4 ...
knitr::kable(table(mushrooms_complete$classes),caption="Complete Cases")
Complete Cases
Var1 Freq
edible 3488
poisonous 2156

However, the complete set is not only smaller, but a bit more imbalanced after removal of the missing data with 3488 edible and 2156 poisonous mushrooms.

We now will conduct 2 parallel analysis streams to compare performance classification and explore multiple approaches, to attempt a perfect classification.

Let’s start with OneR classification, from the RWeka package.

Step 3: Classifying Data with OneR

We classify the original and complete mushrooms datasets.

mushroom_1R <- OneR(classes ~ .,data = mushrooms)
mushroomc_1R <- OneR(classes ~ .,data = mushrooms_complete)

Step 4: Evaluating OneR Performance

mushroom_1R
## odor:
##  almond  -> edible
##  anise   -> poisonous
##  creosote    -> poisonous
##  fishy   -> edible
##  foul    -> poisonous
##  musty   -> edible
##  none    -> poisonous
##  pungent -> poisonous
##  spicy   -> poisonous
## (8004/8124 instances correct)
summary(mushroom_1R)
## 
## === Summary ===
## 
## Correctly Classified Instances        8004               98.5229 %
## Incorrectly Classified Instances       120                1.4771 %
## Kappa statistic                          0.9704
## Mean absolute error                      0.0148
## Root mean squared error                  0.1215
## Relative absolute error                  2.958  %
## Root relative squared error             24.323  %
## Total Number of Instances             8124     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  4208    0 |    a = edible
##   120 3796 |    b = poisonous
mushroomc_1R
## odor:
##  almond  -> edible
##  anise   -> poisonous
##  creosote    -> poisonous
##  fishy   -> edible
##  foul    -> poisonous
##  musty   -> edible
##  none    -> poisonous
## (5556/5644 instances correct)
summary(mushroomc_1R)
## 
## === Summary ===
## 
## Correctly Classified Instances        5556               98.4408 %
## Incorrectly Classified Instances        88                1.5592 %
## Kappa statistic                          0.9667
## Mean absolute error                      0.0156
## Root mean squared error                  0.1249
## Relative absolute error                  3.3022 %
## Root relative squared error             25.6994 %
## Total Number of Instances             5644     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  3488    0 |    a = edible
##    88 2068 |    b = poisonous

We observe the OneR model provides more than 98.52% correct classification using only the odor as criteria on the original set and 98.44% on the complete set. However the confusion matrix reveals 120 poisonous mushrooms were classified as edible in the original dataset and 88 in the complete dataset.

Let’s try to improve on the OneR model, using JRip.

Step 5: Improving Model with JRip

mushroom_JRip <- JRip(classes ~ ., data = mushrooms)
mushroom_JRip
## JRIP rules:
## ===========
## 
## (odor = creosote) => classes=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = black) => classes=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0)
## (odor = anise) => classes=poisonous (192.0/0.0)
## (spore_print_color = orange) => classes=poisonous (72.0/0.0)
## (stalk_surface_below_ring = smooth) and (stalk_surface_above_ring = scaly) => classes=poisonous (68.0/0.0)
## (habitat = meadows) and (cap_color = white) => classes=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => classes=poisonous (8.0/0.0)
##  => classes=edible (4208.0/0.0)
## 
## Number of Rules : 9
summary(mushroom_JRip)
## 
## === Summary ===
## 
## Correctly Classified Instances        8124              100      %
## Incorrectly Classified Instances         0                0      %
## Kappa statistic                          1     
## Mean absolute error                      0     
## Root mean squared error                  0     
## Relative absolute error                  0      %
## Root relative squared error              0      %
## Total Number of Instances             8124     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  4208    0 |    a = edible
##     0 3916 |    b = poisonous
mushroomc_JRip <- JRip(classes ~ ., data = mushrooms_complete)
mushroomc_JRip
## JRIP rules:
## ===========
## 
## (odor = creosote) => classes=poisonous (1584.0/0.0)
## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0)
## (odor = anise) => classes=poisonous (192.0/0.0)
## (spore_print_color = orange) => classes=poisonous (72.0/0.0)
## (population = clustered) => classes=poisonous (52.0/0.0)
##  => classes=edible (3488.0/0.0)
## 
## Number of Rules : 6
summary(mushroomc_JRip)
## 
## === Summary ===
## 
## Correctly Classified Instances        5644              100      %
## Incorrectly Classified Instances         0                0      %
## Kappa statistic                          1     
## Mean absolute error                      0     
## Root mean squared error                  0     
## Relative absolute error                  0      %
## Root relative squared error              0      %
## Total Number of Instances             5644     
## 
## === Confusion Matrix ===
## 
##     a    b   <-- classified as
##  3488    0 |    a = edible
##     0 2156 |    b = poisonous

We observe that JRip derives 9 rules with 22 variables, and can classify correctly the original set. However, on the complete set, only 6 rules are derived to reach the same perfect classification.

Step 6: Improving Model with C5.0

In the next step, we’ll attempt to improve selection performance using the C5.0 package, which we’ll apply using odor and gill_size (the two most influential factor variables), and then compare with all 22 variables selected.

mushroom_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms, rules = TRUE)
summary(mushroom_c5rules)
## 
## Call:
## C5.0.formula(formula = classes ~ odor + gill_size, data = mushrooms, rules
##  = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu May 12 17:50:37 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 8124 cases (3 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (4328/120, lift 1.9)
##  odor in {almond, fishy, musty}
##  ->  class edible  [0.972]
## 
## Rule 2: (3796, lift 2.1)
##  odor in {anise, creosote, foul, none, pungent, spicy}
##  ->  class poisonous  [1.000]
## 
## Default class: edible
## 
## 
## Evaluation on training data (8124 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       2  120( 1.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    4208          (a): class edible
##     120  3796    (b): class poisonous
## 
## 
##  Attribute usage:
## 
##  100.00% odor
## 
## 
## Time: 0.0 secs

On the original dataset, we observe that C5.0, applied to the two most influential factor variables, yields similar results than OneR and classifies 98.52% of the mushrooms correctly, leaving 120 misclassified! On the complete set, C5.0 results are similar to OneR: 98.44% of the mushrooms are classified correctly, leaving 88 mushrooms misclassified.

Let’s apply C5.0 on the 22 variables.

Step 7: Improving C5.0 using all variables

mushroomc_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms_complete, rules = TRUE)
summary(mushroomc_c5rules)
## 
## Call:
## C5.0.formula(formula = classes ~ odor + gill_size, data =
##  mushrooms_complete, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu May 12 17:50:37 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 5644 cases (3 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (3576/88, lift 1.6)
##  odor in {almond, fishy, musty}
##  ->  class edible  [0.975]
## 
## Rule 2: (2068, lift 2.6)
##  odor in {anise, creosote, foul, none}
##  ->  class poisonous  [1.000]
## 
## Default class: edible
## 
## 
## Evaluation on training data (5644 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       2   88( 1.6%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    3488          (a): class edible
##      88  2068    (b): class poisonous
## 
## 
##  Attribute usage:
## 
##  100.00% odor
## 
## 
## Time: 0.0 secs
mushroom_c5improved_rules <- C5.0(classes ~ ., data = mushrooms, rules = TRUE)
summary(mushroom_c5improved_rules)
## 
## Call:
## C5.0.formula(formula = classes ~ ., data = mushrooms, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu May 12 17:50:38 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 8124 cases (22 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (4148/4, lift 1.9)
##  cap_surface in {fibrous, scaly, smooth}
##  odor in {almond, fishy, musty}
##  stalk_color_below_ring in {cinnamon, gray, pink, red, white}
##  spore_print_color in {black, brown, buff, chocolate, green, purple,
##                               white, yellow}
##  ->  class edible  [0.999]
## 
## Rule 2: (3500/12, lift 1.9)
##  cap_surface in {fibrous, scaly, smooth}
##  odor in {almond, fishy, musty}
##  stalk_root in {club, cup, equal, rhizomorphs}
##  spore_print_color in {buff, chocolate, purple, white}
##  ->  class edible  [0.996]
## 
## Rule 3: (3796, lift 2.1)
##  odor in {anise, creosote, foul, none, pungent, spicy}
##  ->  class poisonous  [1.000]
## 
## Rule 4: (72, lift 2.0)
##  spore_print_color = orange
##  ->  class poisonous  [0.986]
## 
## Rule 5: (24, lift 2.0)
##  stalk_color_below_ring = yellow
##  ->  class poisonous  [0.962]
## 
## Rule 6: (16, lift 2.0)
##  stalk_root = bulbous
##  stalk_color_below_ring = orange
##  ->  class poisonous  [0.944]
## 
## Rule 7: (4, lift 1.7)
##  cap_surface = grooves
##  ->  class poisonous  [0.833]
## 
## Default class: edible
## 
## 
## Evaluation on training data (8124 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       7   12( 0.1%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    4208          (a): class edible
##      12  3904    (b): class poisonous
## 
## 
##  Attribute usage:
## 
##   98.67% odor
##   52.83% spore_print_color
##   51.99% cap_surface
##   51.55% stalk_color_below_ring
##   43.28% stalk_root
## 
## 
## Time: 0.1 secs
mushroomc_c5improved_rules <- C5.0(classes ~ ., data = mushrooms_complete, rules = TRUE)
summary(mushroomc_c5improved_rules)
## 
## Call:
## C5.0.formula(formula = classes ~ ., data = mushrooms_complete, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu May 12 17:50:39 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 5644 cases (22 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (3488, lift 1.6)
##  odor in {almond, fishy, musty}
##  spore_print_color in {buff, chocolate, purple, white}
##  population in {abundant, numerous, scattered, several, solitary}
##  ->  class edible  [1.000]
## 
## Rule 2: (2068, lift 2.6)
##  odor in {anise, creosote, foul, none}
##  ->  class poisonous  [1.000]
## 
## Rule 3: (72, lift 2.6)
##  spore_print_color = orange
##  ->  class poisonous  [0.986]
## 
## Rule 4: (52, lift 2.6)
##  population = clustered
##  ->  class poisonous  [0.981]
## 
## Default class: edible
## 
## 
## Evaluation on training data (5644 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       4    0( 0.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    3488          (a): class edible
##          2156    (b): class poisonous
## 
## 
##  Attribute usage:
## 
##   98.44% odor
##   63.08% spore_print_color
##   62.72% population
## 
## 
## Time: 0.0 secs

Using all 22 variables on the original dataset, C5.0 derives 7 rules, and classifies all but 12 correctly. On the complete dataset, C5.0 derives 4 rules, and classifies all mushrooms correctly! We can easily chart the tree, using the rpart and rattle packages.

tree <- rpart(mushroom_c5improved_rules,
              data=mushrooms,
              control=rpart.control(minsplit=20,
                                    cp=0,
                                    digits=6)
              )
fancyRpartPlot(tree, 
               palettes=c("Greys", "Oranges"),
               cex=0.75, 
               main="Original Mushroom Dataset",
               sub="")

treec <- rpart(mushroomc_c5improved_rules,
               data=mushrooms_complete,
               control=rpart.control(minsplit=20,
                                     cp=0,
                                     digits=6)
               )
fancyRpartPlot(treec,
               palettes=c("Greys", "Oranges"),
               cex=0.75, 
               main="Complete Mushroom Dataset",
               sub="")

Finally, we will use PART to classify, and compare the results.

mushroom_PART_rules <- PART(classes ~ ., data = mushrooms)
mushroom_PART_rules
## PART decision list
## ------------------
## 
## odor = creosote: poisonous (2160.0)
## 
## gill_size = broad AND
## ring_number = one: edible (3392.0)
## 
## ring_number = two AND
## spore_print_color = white: edible (528.0)
## 
## odor = pungent: poisonous (576.0)
## 
## odor = spicy: poisonous (576.0)
## 
## stalk_shape = enlarging AND
## stalk_surface_below_ring = silky AND
## odor = none: poisonous (256.0)
## 
## stalk_shape = enlarging AND
## odor = anise: poisonous (192.0)
## 
## gill_size = narrow AND
## stalk_surface_above_ring = silky AND
## population = several: edible (192.0)
## 
## gill_size = broad: poisonous (108.0)
## 
## stalk_surface_below_ring = silky AND
## bruises = bruises: edible (60.0)
## 
## stalk_surface_below_ring = smooth: poisonous (40.0)
## 
## bruises = bruises: edible (36.0)
## 
## : poisonous (8.0)
## 
## Number of Rules  :   13
mushroomc_PART_rules <- PART(classes ~ ., data = mushrooms_complete)
mushroomc_PART_rules
## PART decision list
## ------------------
## 
## odor = musty AND
## ring_number = one AND
## veil_color = white AND
## gill_size = broad: edible (2496.0)
## 
## odor = creosote: poisonous (1584.0)
## 
## odor = almond: edible (400.0)
## 
## odor = fishy: edible (400.0)
## 
## odor = none: poisonous (256.0)
## 
## odor = anise: poisonous (192.0)
## 
## stalk_root = cup: edible (96.0)
## 
## spore_print_color = orange: poisonous (72.0)
## 
## stalk_root = bulbous AND
## population = several: edible (64.0)
## 
## population = clustered: poisonous (52.0)
## 
## : edible (32.0)
## 
## Number of Rules  :   11

On the original mushrooms dataset, PART classifies all properly but must rely on 13 rules to reach the goal. On the complete set, PART achieves the same outcome and derives 11 rules.

Step 8: Comparisons with Original Rules in Reference Material

It is always interesting to compare a solution to alternatives. In this case we can refer to the original rules derived in 1997, and extracted from the documentation which resulted in 48 errors, or 99.41% accuracy on the whole dataset:

res
## [1] "1) odor=NOT(almond.OR.anise.OR.none) 120 poisonous cases missed, 98.52% accuracy "                                                                                                                                                                                                                                     
## [2] "2) spore-print-color=green 48 cases missed, 99.41% accuracy "                                                                                                                                                                                                                                                          
## [3] "3) odor=none.AND.stalk-surface-below-ring=scaly.AND. (stalk-color-above-ring=NOT.brown) 8 cases missed, 99.90% accuracy "                                                                                                                                                                                              
## [4] "4) habitat=leaves.AND.cap-color=white 100% accuracy Rule "                                                                                                                                                                                                                                                             
## [5] "4) may also be "                                                                                                                                                                                                                                                                                                       
## [6] "4') population=clustered.AND.cap_color=white These rule involve 6 attributes (out of 22). Rules for edible mushrooms are obtained as negation of the rules given above, for example the rule: odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green gives 48 errors, or 99.41% accuracy on the whole dataset."

Conclusions

C5.0 algorithm applied on all 22 variables of the complete mushroom set is able to correctly classify with 4 rules. This is the best performance we achieved on the set, with the minimum number of rules derived and the most accurate (perfect) outcome obtained on the complete dataset. It also only selected 3 variables: odor, spore_print_color and population, out of the 22 variables provided, compared to the referenced document, where 6 attributes and 4 rules resulted in 99.41% accuracy.

We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help resolve classification challenges, offering a fast, efficient and economical alternative to tedious experimentation. It is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or any scientific area. This second tool is certainly as useful as the formulation tool we reviewed previously.

Classifying Rubber properties to meet rolling resistance and emissions, or modern composites to build renewable energy sources or lightweight transportation vehicles and next-generation public transit, as well as innovative UV-shield ointments and tasty snacks and drinks…, all present similar challenges where only the nature of inputs and outputs vary. Therefore, this method too can and should be applied broadly!

Why not try and implement Machine Learning in your scientific or technical expert area and boost innovation with improved Data Analytics!

References

The following sources are referenced as they provided significant help and information to develop this Machine Learning analysis applied to formulations:

  1. UCI Machine Learning Repository
  2. mushroom documentation
  3. stringr
  4. RWeka
  5. C50
  6. rpart
  7. rpart.plot
  8. ratttle
  9. Machine Learning, Key to Your Classification Challenges
  10. Building R-Studio for aarm64 on Ubuntu 22.04