Import file as CSV, check variables and label column headers. Create 1st column as “Classes” -> Edible or Poisonous And list all its 22 attributess
The attributes will be rename after loading:
library(devtools)
## Loading required package: usethis
library(RCurl)
## Loading required package: bitops
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plyr)
library(ggplot2)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
# Read CSV into R
mydata <- read.csv(url('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'))
#store raw data in a data frame
df <- as.data.frame(mydata, header=FALSE,sep = "")
summary(df)
## p x s n t p.1
## e:4208 b: 452 f:2320 n :2283 f:4748 n :3528
## p:3915 c: 4 g: 4 g :1840 t:3375 f :2160
## f:3152 s:2555 e :1500 s : 576
## k: 828 y:3244 y :1072 y : 576
## s: 32 w :1040 a : 400
## x:3655 b : 168 l : 400
## (Other): 220 (Other): 483
## f c n.1 k e e.1 s.1
## a: 210 c:6811 b:5612 b :1728 e:3515 ?:2480 f: 552
## f:7913 w:1312 n:2511 p :1492 t:4608 b:3776 k:2372
## w :1202 c: 556 s:5175
## n :1048 e:1119 y: 24
## g : 752 r: 192
## h : 732
## (Other):1169
## s.2 w w.1 p.2 w.2 o
## f: 600 w :4463 w :4383 p:8123 n: 96 n: 36
## k:2304 p :1872 p :1872 o: 96 o:7487
## s:4935 g : 576 g : 576 w:7923 t: 600
## y: 284 n : 448 n : 512 y: 8
## b : 432 b : 432
## o : 192 o : 192
## (Other): 140 (Other): 156
## p.3 k.1 s.3 u
## e:2776 w :2388 a: 384 d:3148
## f: 48 n :1968 c: 340 g:2148
## l:1296 k :1871 n: 400 l: 832
## n: 36 h :1632 s:1247 m: 292
## p:3967 r : 72 v:4040 p:1144
## b : 48 y:1712 u: 367
## (Other): 144 w: 192
The Data: In this data set, there are 8124 obervations corresponding to 22 attribustes. Each row represents a gilled mushroom sample in the Agaricus and Lepiota Family in the united states. It has been classified by shape, color, smell, population, ring,and habitat just to name a few.
This data is all nominally valued -> 22 attributes that are categorical and lend itself well to further classification questions.
We will explore and perform some simple data analysis,and selectively pick a few attributes to see how it relates to the response variables; edible vs. poisonous. Specifically, we will try to see if certain attributes correlates well enough to classify whether the response variable is edible or poisonous
First task is to perform some data munging: - Adding Labels to each columns according to whether its “edible” or “Poisnous” class and its attributes (explanatory variables)
## class cap-shape cap-surface cap-color bruises ? odor gill-attachment
## 1 edible x s y t a f
## 2 edible b s w t l f
## 3 poisonous x y w t p f
## 4 edible x s g f n f
## 5 edible x y y t a f
## 6 edible b s w t a f
## gill-spacing gill-size gill-color Stalk-shape Stalk-root
## 1 c b k e c
## 2 c b n e c
## 3 c n n e e
## 4 w b k t e
## 5 c b n e c
## 6 c b g e c
## Stalk-surface-above-ring Stalk-surface-below-ring Stalk-color-above-ring
## 1 s s w
## 2 s s w
## 3 s s w
## 4 s s w
## 5 s s w
## 6 s s w
## Stalk-color-below-ring veil-type veil-color ring-number ring-type
## 1 w p w o p
## 2 w p w o p
## 3 w p w o p
## 4 w p w o e
## 5 w p w o p
## 6 w p w o p
## spore-print-color population habitat
## 1 n n g
## 2 n n m
## 3 k s u
## 4 n a g
## 5 k n g
## 6 k n m
To check if data is imbalanced or not?
count_edible <- table(df$"class")
any(is.na(df)) # To check if there are any missing values in the data set
## [1] FALSE
count_edible
##
## edible poisonous
## 4208 3915
This is quite a balanced set with almost equal numbers of edibles vs. poisonous mushrooms
NOw more explolatory analysis to see how the structure of the data:
dim(df)
## [1] 8123 23
str(df)
## 'data.frame': 8123 obs. of 23 variables:
## $ class : Factor w/ 2 levels "edible","poisonous": 1 1 2 1 1 1 1 2 1 1 ...
## $ cap-shape : Factor w/ 6 levels "b","c","f","k",..: 6 1 6 6 6 1 1 6 1 6 ...
## $ cap-surface : Factor w/ 4 levels "f","g","s","y": 3 3 4 3 4 3 4 4 3 4 ...
## $ cap-color : Factor w/ 10 levels "b","c","e","g",..: 10 9 9 4 10 9 9 9 10 10 ...
## $ bruises ? : Factor w/ 2 levels "f","t": 2 2 2 1 2 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "a","c","f","l",..: 1 4 7 6 1 1 4 7 1 4 ...
## $ gill-attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill-spacing : Factor w/ 2 levels "c","w": 1 1 1 2 1 1 1 1 1 1 ...
## $ gill-size : Factor w/ 2 levels "b","n": 1 1 2 1 1 1 1 2 1 1 ...
## $ gill-color : Factor w/ 12 levels "b","e","g","h",..: 5 6 6 5 6 3 6 8 3 3 ...
## $ Stalk-shape : Factor w/ 2 levels "e","t": 1 1 1 2 1 1 1 1 1 1 ...
## $ Stalk-root : Factor w/ 5 levels "?","b","c","e",..: 3 3 4 4 3 3 3 4 3 3 ...
## $ Stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ Stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ Stalk-color-above-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ Stalk-color-below-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil-type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil-color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ ring-number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring-type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 1 5 5 5 5 5 5 ...
## $ spore-print-color : Factor w/ 9 levels "b","h","k","n",..: 4 4 3 4 3 3 4 3 3 4 ...
## $ population : Factor w/ 6 levels "a","c","n","s",..: 3 3 4 1 3 3 4 5 4 3 ...
## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 2 4 6 2 2 4 4 2 4 2 ...
classtable <- table(df$"class")
classtable
##
## edible poisonous
## 4208 3915
classfreqs <- classtable/sum(classtable)
classfreqs
##
## edible poisonous
## 0.5180352 0.4819648
The Pie chart below shows a very well balanced data set in the response variable; this is excellent because most data sets in classification problems are naturally imbalanced and avoid additonal work of not overfitting such as Modified synthetic minority oversampling technique (MSMOTE) or something else?
slices <- c(52,48)
lbls <- c("edible","Poisonous")
pct <-round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add % to labels
lbls <- paste(lbls, "%", sep="") #add % sign
pie(slices,labels = lbls, col=rainbow(length(lbls)),main = "Pie Chart of edible vs. Poisonous Mushrooms")
Let’s use Habitat as one of the explanatory variables to see what kind of relationships it has to the response variables (edible vs. poisonous); And Giving it meaningful labels to the columns along the way
habitat <- subset(df, select=c("class","habitat"))
habitat <- table(habitat)
head(habitat)
## habitat
## class d g l m p u w
## edible 1880 1408 240 256 136 96 192
## poisonous 1268 740 592 36 1008 271 0
# Rename a column in R to somthing more meaningful
colnames(habitat)[colnames(habitat)=="d"] <- "woods"
colnames(habitat)[colnames(habitat)=="g"] <- "grasses"
colnames(habitat)[colnames(habitat)=="l"] <- "leaves"
colnames(habitat)[colnames(habitat)=="m"] <- "meadows"
colnames(habitat)[colnames(habitat)=="p"] <- "paths"
colnames(habitat)[colnames(habitat)=="u"] <- "urban"
colnames(habitat)[colnames(habitat)=="w"] <- "waste"
habitat
## habitat
## class woods grasses leaves meadows paths urban waste
## edible 1880 1408 240 256 136 96 192
## poisonous 1268 740 592 36 1008 271 0
summary(habitat)
## Number of cases in table: 8123
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 1573, df = 6, p-value = 0
chisq.test(habitat)
##
## Pearson's Chi-squared test
##
## data: habitat
## X-squared = 1573, df = 6, p-value < 2.2e-16
Lets create a bar plot of Habitat attribute to see relationship to response variable:
barplot(habitat,
main = "Habitat of Each Class",
ylab = "Count",
xlab = "Habitats",
col = c("green","red"),
ylim=c(0, 2000), ## with c()
beside=T
)
legend("topright",
c("poisonous","edible"),
fill = c("red","green")
)
Observation 1a: The habitat attribute has high indicativeness of either ebibility or poisonous if the mushrooms comes from the Meadows and waste environements
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
head(summary(df))
## class cap-shape cap-surface cap-color bruises ?
## edible :4208 b: 452 f:2320 n :2283 f:4748
## poisonous:3915 c: 4 g: 4 g :1840 t:3375
## f:3152 s:2555 e :1500
## k: 828 y:3244 y :1072
## s: 32 w :1040
## x:3655 b : 168
## odor gill-attachment gill-spacing gill-size gill-color
## n :3528 a: 210 c:6811 b:5612 b :1728
## f :2160 f:7913 w:1312 n:2511 p :1492
## s : 576 w :1202
## y : 576 n :1048
## a : 400 g : 752
## l : 400 h : 732
## Stalk-shape Stalk-root Stalk-surface-above-ring Stalk-surface-below-ring
## e:3515 ?:2480 f: 552 f: 600
## t:4608 b:3776 k:2372 k:2304
## c: 556 s:5175 s:4935
## e:1119 y: 24 y: 284
## r: 192
##
## Stalk-color-above-ring Stalk-color-below-ring veil-type veil-color
## w :4463 w :4383 p:8123 n: 96
## p :1872 p :1872 o: 96
## g : 576 g : 576 w:7923
## n : 448 n : 512 y: 8
## b : 432 b : 432
## o : 192 o : 192
## ring-number ring-type spore-print-color population habitat
## n: 36 e:2776 w :2388 a: 384 d:3148
## o:7487 f: 48 n :1968 c: 340 g:2148
## t: 600 l:1296 k :1871 n: 400 l: 832
## n: 36 h :1632 s:1247 m: 292
## p:3967 r : 72 v:4040 p:1144
## b : 48 y:1712 u: 367
habitat1 <- subset(df, select=c("class","habitat"))
head(habitat1)
## class habitat
## 1 edible g
## 2 edible m
## 3 poisonous u
## 4 edible g
## 5 edible g
## 6 edible m
#renaming attributes
habitat1$"habitat" <- revalue(habitat1$"habitat", c("d"="woods"))
habitat1$"habitat" <- revalue(habitat1$"habitat", c("g"="grasses"))
habitat1$"habitat" <- revalue(habitat1$"habitat", c("l"="leaves"))
habitat1$"habitat" <- revalue(habitat1$"habitat", c("m"="meadows"))
habitat1$"habitat" <- revalue(habitat1$"habitat", c("p"="paths"))
habitat1$"habitat" <- revalue(habitat1$"habitat", c("u"="urban"))
habitat1$"habitat" <- revalue(habitat1$"habitat", c("w"="waste"))
head(habitat1)
## class habitat
## 1 edible grasses
## 2 edible meadows
## 3 poisonous urban
## 4 edible grasses
## 5 edible grasses
## 6 edible meadows
ggplot(data=habitat1, aes(habitat1$"habitat")) +
geom_histogram(stat = "count", color="darkblue", fill="lightblue") +
facet_wrap(~class) +
xlab("Habitats")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Observation 1b: The habitat attribute has only one parameter that has exclusive nature. If the habitat is w(waste), then the mushroom is poisonous.
Let’s randomly pick another attribute such as Cap-Shape if there is such exclusivity by repeating the steps above?
cs <-subset(df, select=c("class","cap-shape"))
cs <-table(cs)
head(cs)
## cap-shape
## class b c f k s x
## edible 404 0 1596 228 32 1948
## poisonous 48 4 1556 600 0 1707
# Rename a column in R
colnames(cs)[colnames(cs)=="b"] <- "bell"
colnames(cs)[colnames(cs)=="c"] <- "conical"
colnames(cs)[colnames(cs)=="x"] <- "convex"
colnames(cs)[colnames(cs)=="f"] <- "flat"
colnames(cs)[colnames(cs)=="k"] <- "knobbed"
colnames(cs)[colnames(cs)=="s"] <- "sunken"
cs
## cap-shape
## class bell conical flat knobbed sunken convex
## edible 404 0 1596 228 32 1948
## poisonous 48 4 1556 600 0 1707
summary(cs)
## Number of cases in table: 8123
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 490, df = 5, p-value = 1.157e-103
## Chi-squared approximation may be incorrect
chisq.test(cs)
## Warning in chisq.test(cs): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cs
## X-squared = 489.99, df = 5, p-value < 2.2e-16
Again, Plot the bar chart again to visualize the relation to the response variable set;
barplot(cs,
main = "Cap Shape of Each Class",
ylab = "Count",
xlab = "Cap-Shape",
col = c("orange","blue"),
ylim=c(0, 2000), ## with c()
beside=T
)
legend("topleft",
c("poisonous","edible"),
fill = c("blue","orange")
)
Observation 2a: The information of interest is finding the attributes which have count only in one of the charts - edible or poisonous. Looks like Cap shapes of Bell and Sunken could possibly be useful as an indicator of edibility
library(ggplot2)
library(gridExtra)
head(summary(df))
## class cap-shape cap-surface cap-color bruises ?
## edible :4208 b: 452 f:2320 n :2283 f:4748
## poisonous:3915 c: 4 g: 4 g :1840 t:3375
## f:3152 s:2555 e :1500
## k: 828 y:3244 y :1072
## s: 32 w :1040
## x:3655 b : 168
## odor gill-attachment gill-spacing gill-size gill-color
## n :3528 a: 210 c:6811 b:5612 b :1728
## f :2160 f:7913 w:1312 n:2511 p :1492
## s : 576 w :1202
## y : 576 n :1048
## a : 400 g : 752
## l : 400 h : 732
## Stalk-shape Stalk-root Stalk-surface-above-ring Stalk-surface-below-ring
## e:3515 ?:2480 f: 552 f: 600
## t:4608 b:3776 k:2372 k:2304
## c: 556 s:5175 s:4935
## e:1119 y: 24 y: 284
## r: 192
##
## Stalk-color-above-ring Stalk-color-below-ring veil-type veil-color
## w :4463 w :4383 p:8123 n: 96
## p :1872 p :1872 o: 96
## g : 576 g : 576 w:7923
## n : 448 n : 512 y: 8
## b : 432 b : 432
## o : 192 o : 192
## ring-number ring-type spore-print-color population habitat
## n: 36 e:2776 w :2388 a: 384 d:3148
## o:7487 f: 48 n :1968 c: 340 g:2148
## t: 600 l:1296 k :1871 n: 400 l: 832
## n: 36 h :1632 s:1247 m: 292
## p:3967 r : 72 v:4040 p:1144
## b : 48 y:1712 u: 367
cs <-subset(df, select=c("class","cap-shape"))
head(cs)
## class cap-shape
## 1 edible x
## 2 edible b
## 3 poisonous x
## 4 edible x
## 5 edible x
## 6 edible b
#renaming attributes
cs$"cap-shape" <- revalue(cs$"cap-shape", c("b"="bell"))
cs$"cap-shape" <- revalue(cs$"cap-shape", c("c"="conical"))
cs$"cap-shape" <- revalue(cs$"cap-shape", c("x"="convex"))
cs$"cap-shape" <- revalue(cs$"cap-shape", c("f"="flat"))
cs$"cap-shape" <- revalue(cs$"cap-shape", c("k"="knobbed"))
cs$"cap-shape" <- revalue(cs$"cap-shape", c("s"="sunken"))
head(cs)
## class cap-shape
## 1 edible convex
## 2 edible bell
## 3 poisonous convex
## 4 edible convex
## 5 edible convex
## 6 edible bell
ggplot(data=cs, aes(cs$"cap-shape")) +
geom_histogram(stat = "count", color="darkblue", fill="lightblue") +
facet_wrap(~class) +
xlab("Cap Shape")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Observation 2b:
The conical is present only in poisonous whereas sunken is only present in edible; confirming the earelier bar plot that these 2 characteristics can be used as possible explanatory variables.
In general, one should have tried to find the attributes that play a role in deciding whether a given mushroom is edible or poisonous by going through all of these attributes.
So far, we haven’t found a “slam dunk” attribute and its associated charactersitics that show high correlation relationship to classify these mushrooms yet.
Lets try a 3rd attribute; odor and see what kind of classification power it has?
head(summary(df))
## class cap-shape cap-surface cap-color bruises ?
## edible :4208 b: 452 f:2320 n :2283 f:4748
## poisonous:3915 c: 4 g: 4 g :1840 t:3375
## f:3152 s:2555 e :1500
## k: 828 y:3244 y :1072
## s: 32 w :1040
## x:3655 b : 168
## odor gill-attachment gill-spacing gill-size gill-color
## n :3528 a: 210 c:6811 b:5612 b :1728
## f :2160 f:7913 w:1312 n:2511 p :1492
## s : 576 w :1202
## y : 576 n :1048
## a : 400 g : 752
## l : 400 h : 732
## Stalk-shape Stalk-root Stalk-surface-above-ring Stalk-surface-below-ring
## e:3515 ?:2480 f: 552 f: 600
## t:4608 b:3776 k:2372 k:2304
## c: 556 s:5175 s:4935
## e:1119 y: 24 y: 284
## r: 192
##
## Stalk-color-above-ring Stalk-color-below-ring veil-type veil-color
## w :4463 w :4383 p:8123 n: 96
## p :1872 p :1872 o: 96
## g : 576 g : 576 w:7923
## n : 448 n : 512 y: 8
## b : 432 b : 432
## o : 192 o : 192
## ring-number ring-type spore-print-color population habitat
## n: 36 e:2776 w :2388 a: 384 d:3148
## o:7487 f: 48 n :1968 c: 340 g:2148
## t: 600 l:1296 k :1871 n: 400 l: 832
## n: 36 h :1632 s:1247 m: 292
## p:3967 r : 72 v:4040 p:1144
## b : 48 y:1712 u: 367
od <-subset(df, select=c("class","odor"))
head(od)
## class odor
## 1 edible a
## 2 edible l
## 3 poisonous p
## 4 edible n
## 5 edible a
## 6 edible a
#renaming attributes
od$"odor" <- revalue(od$"odor", c("a"="almond"))
od$"odor" <- revalue(od$"odor", c("l"="anise"))
od$"odor" <- revalue(od$"odor", c("c"="creosote"))
od$"odor" <- revalue(od$"odor", c("y"="fishy"))
od$"odor" <- revalue(od$"odor", c("f"="foul"))
od$"odor" <- revalue(od$"odor", c("m"="musty"))
od$"odor" <- revalue(od$"odor", c("n"="none"))
od$"odor" <- revalue(od$"odor", c("p"="pungent"))
od$"odor" <- revalue(od$"odor", c("s"="spicy"))
head(od)
## class odor
## 1 edible almond
## 2 edible anise
## 3 poisonous pungent
## 4 edible none
## 5 edible almond
## 6 edible almond
ggplot(data=od, aes(od$"odor")) +
geom_histogram(stat = "count", color="darkblue", fill="lightblue") +
facet_wrap(~class) +
xlab("Odor")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Observation 3: The odor attribute seems to have more exclusiveness as compared to any other attribute encountered thus far. For instance; If the odor is almond, creosote, or anise the mushroom is edible and if the odor is musty, pungent, spicy or fishy the mushroom belong to the poisonous class. This is the best example yet of what is strived for in this analysis. This attribute seems to have most correlation in classifying the mushroom.
The “decoding” factor is to find an attribute’s characteristics that belong in one class; edible and not another; poisonous. This exclusiveness can be use as guide posts to classifying the mushroom.
summary(df)
## class cap-shape cap-surface cap-color bruises ?
## edible :4208 b: 452 f:2320 n :2283 f:4748
## poisonous:3915 c: 4 g: 4 g :1840 t:3375
## f:3152 s:2555 e :1500
## k: 828 y:3244 y :1072
## s: 32 w :1040
## x:3655 b : 168
## (Other): 220
## odor gill-attachment gill-spacing gill-size gill-color
## n :3528 a: 210 c:6811 b:5612 b :1728
## f :2160 f:7913 w:1312 n:2511 p :1492
## s : 576 w :1202
## y : 576 n :1048
## a : 400 g : 752
## l : 400 h : 732
## (Other): 483 (Other):1169
## Stalk-shape Stalk-root Stalk-surface-above-ring Stalk-surface-below-ring
## e:3515 ?:2480 f: 552 f: 600
## t:4608 b:3776 k:2372 k:2304
## c: 556 s:5175 s:4935
## e:1119 y: 24 y: 284
## r: 192
##
##
## Stalk-color-above-ring Stalk-color-below-ring veil-type veil-color
## w :4463 w :4383 p:8123 n: 96
## p :1872 p :1872 o: 96
## g : 576 g : 576 w:7923
## n : 448 n : 512 y: 8
## b : 432 b : 432
## o : 192 o : 192
## (Other): 140 (Other): 156
## ring-number ring-type spore-print-color population habitat
## n: 36 e:2776 w :2388 a: 384 d:3148
## o:7487 f: 48 n :1968 c: 340 g:2148
## t: 600 l:1296 k :1871 n: 400 l: 832
## n: 36 h :1632 s:1247 m: 292
## p:3967 r : 72 v:4040 p:1144
## b : 48 y:1712 u: 367
## (Other): 144 w: 192
m1 <- ggplot(aes(df$"veil-type"), data = df) +
geom_histogram(stat = "count", color="darkblue", fill="lightblue") +
facet_wrap(~class) +
xlab("Veil Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m2 <- ggplot(aes(df$"veil-color"), data = df) +
geom_histogram(stat = "count", color="darkblue", fill="lightblue") +
facet_wrap(~class) +
xlab("Veil Color")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(m1, m2, ncol = 2)
Observation 4a :
The veil type had no contribution at all in deciding the class of the mushroom. This type of attributes is an excellent example of having data but of no use in analyzing the dataset for classification purposes. It would be better to get rid of such attributes via dimensional reduction. Dimensionality reduction plays a very important role in speeding up the analysis when the datasets have large number of attributes and/or observations; as it minimizes noise unrelated to data significance.
Observation 4b: The veil color n(brown) and o(orange) suggests that the mushroom belongs to edible class and y(yellow) suggests that it belongs to poisonous class.
Finally, the purpose of the next exercise and its related plots is to find the combined exclusivity of any of the two attributes (odor, cap-shape or habitat) in deciding the edibility of mushroom. The observations are noted after each graph: I have taken into account two attributes along with class and how they contribute the the edibility of the mushrooms, lets see…
cor1 <-subset(df, select=c("class","odor","cap-shape","habitat"))
head(cor1)
## class odor cap-shape habitat
## 1 edible a x g
## 2 edible l b m
## 3 poisonous p x u
## 4 edible n x g
## 5 edible a x g
## 6 edible a b m
#renaming attributes
cor1$"odor" <- revalue(cor1$"odor", c("a"="almond"))
cor1$"odor" <- revalue(cor1$"odor", c("l"="anise"))
cor1$"odor" <- revalue(cor1$"odor", c("c"="creosote"))
cor1$"odor" <- revalue(cor1$"odor", c("y"="fishy"))
cor1$"odor" <- revalue(cor1$"odor", c("f"="foul"))
cor1$"odor" <- revalue(cor1$"odor", c("m"="musty"))
cor1$"odor" <- revalue(cor1$"odor", c("n"="none"))
cor1$"odor" <- revalue(cor1$"odor", c("p"="pungent"))
cor1$"odor" <- revalue(cor1$"odor", c("s"="spicy"))
cor1$"habitat" <- revalue(cor1$"habitat", c("d"="woods"))
cor1$"habitat" <- revalue(cor1$"habitat", c("g"="grasses"))
cor1$"habitat" <- revalue(cor1$"habitat", c("l"="leaves"))
cor1$"habitat" <- revalue(cor1$"habitat", c("m"="meadows"))
cor1$"habitat" <- revalue(cor1$"habitat", c("p"="paths"))
cor1$"habitat" <- revalue(cor1$"habitat", c("u"="urban"))
cor1$"habitat" <- revalue(cor1$"habitat", c("w"="waste"))
cor1$"cap-shape" <- revalue(cor1$"cap-shape", c("b"="bell"))
cor1$"cap-shape" <- revalue(cor1$"cap-shape", c("c"="conical"))
cor1$"cap-shape" <- revalue(cor1$"cap-shape", c("x"="convex"))
cor1$"cap-shape" <- revalue(cor1$"cap-shape", c("f"="flat"))
cor1$"cap-shape" <- revalue(cor1$"cap-shape", c("k"="knobbed"))
cor1$"cap-shape" <- revalue(cor1$"cap-shape", c("s"="sunken"))
head(cor1)
## class odor cap-shape habitat
## 1 edible almond convex grasses
## 2 edible anise bell meadows
## 3 poisonous pungent convex urban
## 4 edible none convex grasses
## 5 edible almond convex grasses
## 6 edible almond bell meadows
ggplot(cor1, aes(cor1$"odor", cor1$"cap-shape", class)) +
geom_point(aes(shape = factor(class), color = factor(class)), size = 4.5) +
scale_shape_manual(values = c('+', 'x')) +
scale_colour_manual(values = c("green", "red"))
Observation 5: For edibility - When odor is almond and anise and cap-shape is convex, flat, bell For poisonous - when odor is foul, pungent, spice and fishy at the same time cap-shape is convex, knobbed and flat It becomes amibigous - When mushroom has no smell and the cap-shape is convex, sunken, knobbed, flat, conical and bell it is really hard to tell edibile or poisonous
ggplot(cor1, aes(cor1$"cap-shape", cor1$"habitat", class)) +
geom_point(aes(shape = factor(class), color = factor(class)), size = 4.5) +
scale_shape_manual(values = c('+', 'x')) +
scale_colour_manual(values = c("green", "red"))
Observation 6: For edibility - When the environement came from waste (this is weird? to say the least) and cap shapes are flat, knobbed and convex
Every other characteristics are ambigous as it shows no pattern and its all over the place
Lastly, lets see if Habitat and odor can pin point if the mushrooms are edible or not?
ggplot(cor1, aes(cor1$"habitat", cor1$"odor", class)) +
geom_point(aes(shape = factor(class), color = factor(class)), size = 4.5) +
scale_shape_manual(values = c('+', 'x')) +
scale_colour_manual(values = c("green", "red"))
Observation 7: Its not clear if there is any relationship from this, therefore, odor and habitat do not have any positive colleration
In summary: The Guide was correct in stating that there is no simple and fast rule to classify the edibility of mushrooms one way or another. There are certainly more attributes to test it out but based on using these 3 attributes of odor, cap-shapes and habitats there aren’t any clear relationship one could use. The closest one is odor as it shows exclusivity in clasasifying the mushrooms and it show some positive coleration. Finally, it was also shown that veil-type does not have any classification power at all and should be removed as attribute as this will reduce the amount of unnecessary noise.