The original RPubs was posted 6 years ago and, at the time, was developed on R3.2.2 running on a quad-core 64-bit Windows8 laptop. Today this slightly rejuvenated and amended code is running on a 64-bit quad-core RaspberryPi CM4. This is now possible with the open-source RStudio IDE and R4.20 natively built and running smoothly at 2.00 GHz on Ubuntu Mate 22.04 LTS aarch64 desktop!
It is even faster and so impressively runs side-by-side a Python 3.10 Jupyter notebook!
Actual Screenshot
You develop pharmaceutical, cosmetic, food, industrial or civil
engineered products, and are often confronted with the challenge of
sorting and classifying to meet process or performance properties. While
traditional Research and Development does approach the problem with
experimentation, it generally involves designs, time and resource
constraints, and can be considered slow, expensive and often times
redundant, fast forgotten or perhaps obsolete.
Consider the alternative Machine Learning tools offers today. We will
show this is not only quick, efficient and ultimately the only way Front
End of Innovation should proceed, and how it is particularly suited for
classification, an essential step used to reduce complexity and optimize
product segmentation, Lean Innovation and establishing robust source of
supply networks.
Today, we will explain how Machine Learning can shed new light on this generic and very persistent classification and clustering challenge. We will derive with modern algorithms simple (we prefer less rules) and accurate (perfect) classifications on a complete dataset.
If you didn’t read about the other important aspect of formulation optimization, please consult Machine Learning, Key to Your Formulation Challenges.
We will mirror the approach used in the formulation challenge and use another dataset hosted on UCI Machine Learning Repository, to classify the edible attribute of…Mushrooms based on attribute described in The Audubon Society Field Guide to North American Mushrooms (1981). The challenge we tackle today is to classify properly a go/no-go attribute which scientists, engineers and business professionals must address daily. Any established R&D would certainly have similar and sometimes hidden knowledge in its archives…
Again, We will use R to demonstrate quickly the approach on this dataset, and its full description. We continue to maintain reproducibility of the analysis as a general practice. The analysis tool and platform are documented, all libraries clearly listed, while data is retrieved programmatically and date stamped from the repository.
We will display a structure of the mushrooms dataset and the corresponding dictionary to translate the property factors.
Sys.info()[1:5]
## sysname
## "Linux"
## release
## "5.15.0-1006-raspi"
## version
## "#6-Ubuntu SMP PREEMPT Mon Apr 25 12:50:48 UTC 2022"
## nodename
## "cruncher2"
## machine
## "aarch64"
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: aarch64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 22.04 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/aarch64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.8.0 magrittr_2.0.3
## [5] evaluate_0.15 stringi_1.7.6 rlang_1.0.2 cli_3.3.0
## [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1 rmarkdown_2.14
## [13] tools_4.2.0 stringr_1.4.0 xfun_0.30 yaml_2.3.5
## [17] fastmap_1.1.0 compiler_4.2.0 htmltools_0.5.2 knitr_1.39
## [21] sass_0.4.1
library(stringr)
library(RWeka)
library(C50)
library(rpart)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
userdir <- getwd()
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Mushrooms_Data.csv")
dateDownloaded <- date()
mushrooms <- read.csv("./data/Mushrooms_Data.csv",header=FALSE,stringsAsFactors=TRUE)
fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Names.txt")
txt <- readLines("./data/Names.txt")
lns <- data.frame(beg=which(grepl("P_1) odor=",txt)),end=which(grepl("on the whole dataset.",txt)))
# we now capture all lines of text between beg and end from txt
res <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse=" ")})
res <- gsub("\t", "", res, fixed = TRUE)
res <- gsub("( {2,})"," ",res, fixed=FALSE)
res <- gsub("P_","\n",res,fixed=TRUE)
writeLines(res,"./data/parsed_res.csv")
res <- readLines("./data/parsed_res.csv")
res<-res[-1]
lns <- data.frame(beg=which(grepl("7. Attribute Information:",txt)),end=which(grepl("urban=u,waste=w,woods=d",txt)))
txt <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[l],by=1)],collapse=" ")})
txt <- gsub(" ", "", txt, fixed = TRUE)
txt <- gsub("(\\d+\\.)","\\\n",txt, fixed=FALSE)
txt <- gsub("\nAttributeInformation:\\(","",txt,fixed=FALSE)
txt <- gsub("\\)","",txt,fixed=FALSE)
txt <- gsub(":",",",txt,fixed=TRUE)
txt <- gsub("?","",txt,fixed=TRUE)
txt <- gsub("-","_",txt,fixed=TRUE)
writeLines(txt,"./data/parsed.csv")
attrib <- readLines("./data/parsed.csv")
attrib <- sapply (1:length(attrib),function(i) {gsub(","," ",attrib[i],fixed=TRUE)})
dictionary <- sapply (1:length(attrib),function(i) {strsplit(attrib[i],' ')})
colnames(mushrooms)<-sapply(1:length(attrib),function(i) {colnames(mushrooms)[i]<-dictionary[[i]][1]})
dictionary<-sapply (1:length(attrib),function(i) {dictionary[[i]][-1]}) # contains the levels strings
dictionary<-sapply(1:length(attrib),function(i){sapply(1:lengths(dictionary[i]),function(j){p1<-strsplit(dictionary[[i]][j],"=")[[1]][1];p2<-strsplit(dictionary[[i]][j],"=")[[1]][2];dictionary[[i]][j]<-paste0(p2,',',p1)})})
We notice that the stalk_root property has a missing level indicated with ‘?’. We can attempt two analysis: First, we keep the missing data as coded and proceed with the classification models. We also can easily re-code as missing with the value, drop the corresponding level, and omit all non-complete cases in a new dataset mushrooms_complete.
mushrooms_complete<-mushrooms
mushrooms_complete$stalk_root[mushrooms_complete$stalk_root=='?']<-NA
mushrooms_complete<-mushrooms_complete[complete.cases(mushrooms_complete),]
mushrooms_complete$stalk_root<-droplevels(mushrooms_complete$stalk_root)
str(mushrooms_complete)
## 'data.frame': 5644 obs. of 23 variables:
## $ classes : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ cap_surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
## $ cap_color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ gill_attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill_size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill_color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
## $ stalk_shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk_root : Factor w/ 4 levels "b","c","e","r": 3 2 2 3 3 2 2 2 3 2 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk_color_above_ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ stalk_color_below_ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil_type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil_color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ ring_number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring_type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ spore_print_color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
## $ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
knitr::kable(table(mushrooms_complete$classes),caption="Cleaned Dataset")
| Var1 | Freq |
|---|---|
| e | 3488 |
| p | 2156 |
We can now reassign translated levels to both the original and the complete mushroom datasets.
m <- sapply (1:length(attrib),function(i){levels(mushrooms[[i]]) <- sapply(1:length (levels(mushrooms[[i]])),function(j){
a<-strsplit(dictionary[[i]][[j]],",")[[1]][1]
b<-strsplit(dictionary[[i]][[j]],",")[[1]][2]
levels(mushrooms[[i]])[levels(mushrooms[[i]])==a] <- b } )
mushrooms[[i]]} )
m <- as.data.frame(m)
colnames(m) <- colnames(mushrooms)
mushrooms <- m
# convert chr to factors
mushrooms[sapply(mushrooms, is.character)] <- lapply(mushrooms[sapply(mushrooms, is.character)],as.factor)
m <- sapply (1:length(attrib),function(i){levels(mushrooms_complete[[i]]) <- sapply(1:length (levels(mushrooms_complete[[i]])),function(j){
a<-strsplit(dictionary[[i]][[j]],",")[[1]][1]
b<-strsplit(dictionary[[i]][[j]],",")[[1]][2]
levels(mushrooms_complete[[i]])[levels(mushrooms_complete[[i]])==a] <- b } )
mushrooms_complete[[i]]} )
m <- as.data.frame(m)
colnames(m) <- colnames(mushrooms_complete)
mushrooms_complete <- m
# convert chr to factors
mushrooms_complete[sapply(mushrooms_complete, is.character)] <- lapply(mushrooms_complete[sapply(mushrooms_complete, is.character)],as.factor)
rm(m,lns,attrib,txt,dictionary)# cleanup
As we observe that the veil_type feature is absolutely common with a single factor, we can exclude it from further analysis and examine the remaining 22 properties: we’ll observe a fairly balanced classification set with 4208 edible and 3916 poisonous mushrooms on the original set.
# now automatically discard factors with unique level
mushrooms_complete <- mushrooms_complete[, sapply(mushrooms_complete, function(col) length(unique(col))) > 1]
# notice we lost the veil_type because that factor had a unique level
mushrooms <- mushrooms[, sapply(mushrooms, function(col) length(unique(col))) > 1]
str(mushrooms)
## 'data.frame': 8124 obs. of 22 variables:
## $ classes : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4 3 4 4 3 ...
## $ cap_color : Factor w/ 10 levels "brown","buff",..: 5 10 9 9 4 10 9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "almond","anise",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ gill_attachment : Factor w/ 2 levels "attached","descending": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill_color : Factor w/ 12 levels "black","brown",..: 5 5 6 6 5 6 3 6 8 3 ...
## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk_root : Factor w/ 5 levels "bulbous","club",..: 4 3 3 4 4 3 3 3 4 3 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring_type : Factor w/ 5 levels "cobwebby","evanescent",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ spore_print_color : Factor w/ 9 levels "black","brown",..: 3 4 4 3 4 3 3 4 3 3 ...
## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "grasses","leaves",..: 6 2 4 6 2 2 4 4 2 4 ...
knitr::kable(table(mushrooms$classes),caption="Original Dataset")
| Var1 | Freq |
|---|---|
| edible | 4208 |
| poisonous | 3916 |
str(mushrooms_complete)
## 'data.frame': 5644 obs. of 22 variables:
## $ classes : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4 3 4 4 3 ...
## $ cap_color : Factor w/ 8 levels "brown","buff",..: 5 8 7 7 4 8 7 7 7 8 ...
## $ bruises : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 7 levels "almond","anise",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ gill_attachment : Factor w/ 2 levels "attached","descending": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill_color : Factor w/ 9 levels "buff","chocolate",..: 3 3 4 4 3 4 1 4 5 1 ...
## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk_root : Factor w/ 4 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk_color_above_ring : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ stalk_color_below_ring : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ veil_color : Factor w/ 2 levels "white","yellow": 1 1 1 1 1 1 1 1 1 1 ...
## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring_type : Factor w/ 4 levels "cobwebby","flaring",..: 4 4 4 4 1 4 4 4 4 4 ...
## $ spore_print_color : Factor w/ 6 levels "brown","buff",..: 2 3 3 2 3 2 2 3 2 2 ...
## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 6 levels "grasses","leaves",..: 6 2 4 6 2 2 4 4 2 4 ...
knitr::kable(table(mushrooms_complete$classes),caption="Complete Cases")
| Var1 | Freq |
|---|---|
| edible | 3488 |
| poisonous | 2156 |
However, the complete set is not only smaller, but a bit more imbalanced after removal of the missing data with 3488 edible and 2156 poisonous mushrooms.
We now will conduct 2 parallel analysis streams to compare performance classification and explore multiple approaches, to attempt a perfect classification.
Let’s start with OneR classification, from the RWeka package.
We classify the original and complete mushrooms datasets.
mushroom_1R <- OneR(classes ~ .,data = mushrooms)
mushroomc_1R <- OneR(classes ~ .,data = mushrooms_complete)
mushroom_1R
## odor:
## almond -> edible
## anise -> poisonous
## creosote -> poisonous
## fishy -> edible
## foul -> poisonous
## musty -> edible
## none -> poisonous
## pungent -> poisonous
## spicy -> poisonous
## (8004/8124 instances correct)
summary(mushroom_1R)
##
## === Summary ===
##
## Correctly Classified Instances 8004 98.5229 %
## Incorrectly Classified Instances 120 1.4771 %
## Kappa statistic 0.9704
## Mean absolute error 0.0148
## Root mean squared error 0.1215
## Relative absolute error 2.958 %
## Root relative squared error 24.323 %
## Total Number of Instances 8124
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 4208 0 | a = edible
## 120 3796 | b = poisonous
mushroomc_1R
## odor:
## almond -> edible
## anise -> poisonous
## creosote -> poisonous
## fishy -> edible
## foul -> poisonous
## musty -> edible
## none -> poisonous
## (5556/5644 instances correct)
summary(mushroomc_1R)
##
## === Summary ===
##
## Correctly Classified Instances 5556 98.4408 %
## Incorrectly Classified Instances 88 1.5592 %
## Kappa statistic 0.9667
## Mean absolute error 0.0156
## Root mean squared error 0.1249
## Relative absolute error 3.3022 %
## Root relative squared error 25.6994 %
## Total Number of Instances 5644
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 3488 0 | a = edible
## 88 2068 | b = poisonous
We observe the OneR model provides more than 98.52% correct classification using only the odor as criteria on the original set and 98.44% on the complete set. However the confusion matrix reveals 120 poisonous mushrooms were classified as edible in the original dataset and 88 in the complete dataset.
Let’s try to improve on the OneR model, using JRip.
mushroom_JRip <- JRip(classes ~ ., data = mushrooms)
mushroom_JRip
## JRIP rules:
## ===========
##
## (odor = creosote) => classes=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = black) => classes=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0)
## (odor = anise) => classes=poisonous (192.0/0.0)
## (spore_print_color = orange) => classes=poisonous (72.0/0.0)
## (stalk_surface_below_ring = smooth) and (stalk_surface_above_ring = scaly) => classes=poisonous (68.0/0.0)
## (habitat = meadows) and (cap_color = white) => classes=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => classes=poisonous (8.0/0.0)
## => classes=edible (4208.0/0.0)
##
## Number of Rules : 9
summary(mushroom_JRip)
##
## === Summary ===
##
## Correctly Classified Instances 8124 100 %
## Incorrectly Classified Instances 0 0 %
## Kappa statistic 1
## Mean absolute error 0
## Root mean squared error 0
## Relative absolute error 0 %
## Root relative squared error 0 %
## Total Number of Instances 8124
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 4208 0 | a = edible
## 0 3916 | b = poisonous
mushroomc_JRip <- JRip(classes ~ ., data = mushrooms_complete)
mushroomc_JRip
## JRIP rules:
## ===========
##
## (odor = creosote) => classes=poisonous (1584.0/0.0)
## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0)
## (odor = anise) => classes=poisonous (192.0/0.0)
## (spore_print_color = orange) => classes=poisonous (72.0/0.0)
## (population = clustered) => classes=poisonous (52.0/0.0)
## => classes=edible (3488.0/0.0)
##
## Number of Rules : 6
summary(mushroomc_JRip)
##
## === Summary ===
##
## Correctly Classified Instances 5644 100 %
## Incorrectly Classified Instances 0 0 %
## Kappa statistic 1
## Mean absolute error 0
## Root mean squared error 0
## Relative absolute error 0 %
## Root relative squared error 0 %
## Total Number of Instances 5644
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 3488 0 | a = edible
## 0 2156 | b = poisonous
We observe that JRip derives 9 rules with 22 variables, and can classify correctly the original set. However, on the complete set, only 6 rules are derived to reach the same perfect classification.
In the next step, we’ll attempt to improve selection performance using the C5.0 package, which we’ll apply using odor and gill_size (the two most influential factor variables), and then compare with all 22 variables selected.
mushroom_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms, rules = TRUE)
summary(mushroom_c5rules)
##
## Call:
## C5.0.formula(formula = classes ~ odor + gill_size, data = mushrooms, rules
## = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu May 12 17:50:37 2022
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 8124 cases (3 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (4328/120, lift 1.9)
## odor in {almond, fishy, musty}
## -> class edible [0.972]
##
## Rule 2: (3796, lift 2.1)
## odor in {anise, creosote, foul, none, pungent, spicy}
## -> class poisonous [1.000]
##
## Default class: edible
##
##
## Evaluation on training data (8124 cases):
##
## Rules
## ----------------
## No Errors
##
## 2 120( 1.5%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 4208 (a): class edible
## 120 3796 (b): class poisonous
##
##
## Attribute usage:
##
## 100.00% odor
##
##
## Time: 0.0 secs
On the original dataset, we observe that C5.0, applied to the two most influential factor variables, yields similar results than OneR and classifies 98.52% of the mushrooms correctly, leaving 120 misclassified! On the complete set, C5.0 results are similar to OneR: 98.44% of the mushrooms are classified correctly, leaving 88 mushrooms misclassified.
Let’s apply C5.0 on the 22 variables.
mushroomc_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms_complete, rules = TRUE)
summary(mushroomc_c5rules)
##
## Call:
## C5.0.formula(formula = classes ~ odor + gill_size, data =
## mushrooms_complete, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu May 12 17:50:37 2022
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 5644 cases (3 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (3576/88, lift 1.6)
## odor in {almond, fishy, musty}
## -> class edible [0.975]
##
## Rule 2: (2068, lift 2.6)
## odor in {anise, creosote, foul, none}
## -> class poisonous [1.000]
##
## Default class: edible
##
##
## Evaluation on training data (5644 cases):
##
## Rules
## ----------------
## No Errors
##
## 2 88( 1.6%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 3488 (a): class edible
## 88 2068 (b): class poisonous
##
##
## Attribute usage:
##
## 100.00% odor
##
##
## Time: 0.0 secs
mushroom_c5improved_rules <- C5.0(classes ~ ., data = mushrooms, rules = TRUE)
summary(mushroom_c5improved_rules)
##
## Call:
## C5.0.formula(formula = classes ~ ., data = mushrooms, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu May 12 17:50:38 2022
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 8124 cases (22 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (4148/4, lift 1.9)
## cap_surface in {fibrous, scaly, smooth}
## odor in {almond, fishy, musty}
## stalk_color_below_ring in {cinnamon, gray, pink, red, white}
## spore_print_color in {black, brown, buff, chocolate, green, purple,
## white, yellow}
## -> class edible [0.999]
##
## Rule 2: (3500/12, lift 1.9)
## cap_surface in {fibrous, scaly, smooth}
## odor in {almond, fishy, musty}
## stalk_root in {club, cup, equal, rhizomorphs}
## spore_print_color in {buff, chocolate, purple, white}
## -> class edible [0.996]
##
## Rule 3: (3796, lift 2.1)
## odor in {anise, creosote, foul, none, pungent, spicy}
## -> class poisonous [1.000]
##
## Rule 4: (72, lift 2.0)
## spore_print_color = orange
## -> class poisonous [0.986]
##
## Rule 5: (24, lift 2.0)
## stalk_color_below_ring = yellow
## -> class poisonous [0.962]
##
## Rule 6: (16, lift 2.0)
## stalk_root = bulbous
## stalk_color_below_ring = orange
## -> class poisonous [0.944]
##
## Rule 7: (4, lift 1.7)
## cap_surface = grooves
## -> class poisonous [0.833]
##
## Default class: edible
##
##
## Evaluation on training data (8124 cases):
##
## Rules
## ----------------
## No Errors
##
## 7 12( 0.1%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 4208 (a): class edible
## 12 3904 (b): class poisonous
##
##
## Attribute usage:
##
## 98.67% odor
## 52.83% spore_print_color
## 51.99% cap_surface
## 51.55% stalk_color_below_ring
## 43.28% stalk_root
##
##
## Time: 0.1 secs
mushroomc_c5improved_rules <- C5.0(classes ~ ., data = mushrooms_complete, rules = TRUE)
summary(mushroomc_c5improved_rules)
##
## Call:
## C5.0.formula(formula = classes ~ ., data = mushrooms_complete, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu May 12 17:50:39 2022
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 5644 cases (22 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (3488, lift 1.6)
## odor in {almond, fishy, musty}
## spore_print_color in {buff, chocolate, purple, white}
## population in {abundant, numerous, scattered, several, solitary}
## -> class edible [1.000]
##
## Rule 2: (2068, lift 2.6)
## odor in {anise, creosote, foul, none}
## -> class poisonous [1.000]
##
## Rule 3: (72, lift 2.6)
## spore_print_color = orange
## -> class poisonous [0.986]
##
## Rule 4: (52, lift 2.6)
## population = clustered
## -> class poisonous [0.981]
##
## Default class: edible
##
##
## Evaluation on training data (5644 cases):
##
## Rules
## ----------------
## No Errors
##
## 4 0( 0.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 3488 (a): class edible
## 2156 (b): class poisonous
##
##
## Attribute usage:
##
## 98.44% odor
## 63.08% spore_print_color
## 62.72% population
##
##
## Time: 0.0 secs
Using all 22 variables on the original dataset, C5.0 derives 7 rules, and classifies all but 12 correctly. On the complete dataset, C5.0 derives 4 rules, and classifies all mushrooms correctly! We can easily chart the tree, using the rpart and rattle packages.
tree <- rpart(mushroom_c5improved_rules,
data=mushrooms,
control=rpart.control(minsplit=20,
cp=0,
digits=6)
)
fancyRpartPlot(tree,
palettes=c("Greys", "Oranges"),
cex=0.75,
main="Original Mushroom Dataset",
sub="")
treec <- rpart(mushroomc_c5improved_rules,
data=mushrooms_complete,
control=rpart.control(minsplit=20,
cp=0,
digits=6)
)
fancyRpartPlot(treec,
palettes=c("Greys", "Oranges"),
cex=0.75,
main="Complete Mushroom Dataset",
sub="")
Finally, we will use PART to classify, and compare the results.
mushroom_PART_rules <- PART(classes ~ ., data = mushrooms)
mushroom_PART_rules
## PART decision list
## ------------------
##
## odor = creosote: poisonous (2160.0)
##
## gill_size = broad AND
## ring_number = one: edible (3392.0)
##
## ring_number = two AND
## spore_print_color = white: edible (528.0)
##
## odor = pungent: poisonous (576.0)
##
## odor = spicy: poisonous (576.0)
##
## stalk_shape = enlarging AND
## stalk_surface_below_ring = silky AND
## odor = none: poisonous (256.0)
##
## stalk_shape = enlarging AND
## odor = anise: poisonous (192.0)
##
## gill_size = narrow AND
## stalk_surface_above_ring = silky AND
## population = several: edible (192.0)
##
## gill_size = broad: poisonous (108.0)
##
## stalk_surface_below_ring = silky AND
## bruises = bruises: edible (60.0)
##
## stalk_surface_below_ring = smooth: poisonous (40.0)
##
## bruises = bruises: edible (36.0)
##
## : poisonous (8.0)
##
## Number of Rules : 13
mushroomc_PART_rules <- PART(classes ~ ., data = mushrooms_complete)
mushroomc_PART_rules
## PART decision list
## ------------------
##
## odor = musty AND
## ring_number = one AND
## veil_color = white AND
## gill_size = broad: edible (2496.0)
##
## odor = creosote: poisonous (1584.0)
##
## odor = almond: edible (400.0)
##
## odor = fishy: edible (400.0)
##
## odor = none: poisonous (256.0)
##
## odor = anise: poisonous (192.0)
##
## stalk_root = cup: edible (96.0)
##
## spore_print_color = orange: poisonous (72.0)
##
## stalk_root = bulbous AND
## population = several: edible (64.0)
##
## population = clustered: poisonous (52.0)
##
## : edible (32.0)
##
## Number of Rules : 11
On the original mushrooms dataset, PART classifies all properly but must rely on 13 rules to reach the goal. On the complete set, PART achieves the same outcome and derives 11 rules.
It is always interesting to compare a solution to alternatives. In this case we can refer to the original rules derived in 1997, and extracted from the documentation which resulted in 48 errors, or 99.41% accuracy on the whole dataset:
res
## [1] "1) odor=NOT(almond.OR.anise.OR.none) 120 poisonous cases missed, 98.52% accuracy "
## [2] "2) spore-print-color=green 48 cases missed, 99.41% accuracy "
## [3] "3) odor=none.AND.stalk-surface-below-ring=scaly.AND. (stalk-color-above-ring=NOT.brown) 8 cases missed, 99.90% accuracy "
## [4] "4) habitat=leaves.AND.cap-color=white 100% accuracy Rule "
## [5] "4) may also be "
## [6] "4') population=clustered.AND.cap_color=white These rule involve 6 attributes (out of 22). Rules for edible mushrooms are obtained as negation of the rules given above, for example the rule: odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green gives 48 errors, or 99.41% accuracy on the whole dataset."
C5.0 algorithm applied on all 22 variables of the complete mushroom set is able to correctly classify with 4 rules. This is the best performance we achieved on the set, with the minimum number of rules derived and the most accurate (perfect) outcome obtained on the complete dataset. It also only selected 3 variables: odor, spore_print_color and population, out of the 22 variables provided, compared to the referenced document, where 6 attributes and 4 rules resulted in 99.41% accuracy.
We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help resolve classification challenges, offering a fast, efficient and economical alternative to tedious experimentation. It is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or any scientific area. This second tool is certainly as useful as the formulation tool we reviewed previously.
Classifying Rubber properties to meet rolling resistance and emissions, or modern composites to build renewable energy sources or lightweight transportation vehicles and next-generation public transit, as well as innovative UV-shield ointments and tasty snacks and drinks…, all present similar challenges where only the nature of inputs and outputs vary. Therefore, this method too can and should be applied broadly!
Why not try and implement Machine Learning in your scientific or technical expert area and boost innovation with improved Data Analytics!
The following sources are referenced as they provided significant help and information to develop this Machine Learning analysis applied to formulations: