The analysis presented in this article is of mushroom dataset obtained from https://www.kaggle.com/uciml/mushroom-classification/data. The data consists of 23 parameters pertaining to the edibility of a given mushroom. The analysis of this dataset is to find which of these parameters are most relevant to achieve dimensionality reduction. The flow of the analysis is as follows: finding the exclusiveness of the each parameter, finding the correlation between the parameters and the edibility and finally finding how much does a parameter contribute towards the edibility of a mushroom.
The dataset consists of 8124 observations with 23 columns. The direction of this analysis is to study the data from column 2 to column 23 and find the weightage of each attribute. The entire dataset is single character entry with unique meaning. The attributes and the possible categories of each attribute are listed below. The attribute and the acronyms are shown in the table.
| Name of the attribute | Acronyms |
|---|---|
| class | edible = e , poisonous = p |
| cap shape | bell = b , conical = c , convex = x, flat = f, knobbed = k, sunken = s |
| cap surface | fibrous = f, grooves = g, scaly = y, smooth = s |
| cap color | brown = n, buff = b, cinnamon = c, gray = g, green = r, pink = p, purple = u, red = e, white = w, yellow = y |
| bruises | bruises = t, no = f |
| odor | almond = a, anise = l, creosote = c, fishy = y, foul = f, musty = m, none = n, pungent = p, spicy = s |
| gill attachment | attached = a, descending = d, free = f, notched = n |
| gill spacing | close = c, crowded = w, distant = d |
| gill size | broad = b, narrow = n |
| gill color | black = k, brown = n, buff = b, chocolate = h, gray = g, green = r, orange = o, pink = p, purple = u, red = e, white = w, yellow = y |
| stalk shape | enlarging = e, tapering = t |
| stalk root | bulbous = b, club = c, cup = u, equal = e, rhizomorphs = z, rooted = r, missing = ? |
| stalk surface above ring | fibrous = f, scaly = y, silky = k, smooth = s |
| stalk surface below ring | fibrous = f, scaly = y, silky = k, smooth = s |
| stalk color above ring | brown = n, buff = b, cinnamon = c, gray = g, orange = o, pink = p, red = e, white = w, yellow = y |
| stalk color below ring | brown = n, buff = b, cinnamon = c, gray = g, orange = o, pink = p, red = e, white = w, yellow = y |
| veil type | partial = p, universal = u |
| veil color | brown = n, orange = o, white = w, yellow = y |
| ring number | none = n, one = o, two = t |
| ring type | cobwebby = c, evanescent = e, flaring = f, large = l, none = n, pendant = p, sheathing = s, zone = z |
| spore print color | black = k, brown = n, buff = b, chocolate = h, green = r, orange = o, purple = u, white = w, yellow = y |
| population | abundant = a, clustered = c, numerous = n, scattered = s, several = v, solitary = y |
| habitat | grasses = g, leaves = l, meadows = m, paths = p, urban = u, waste = w, woods = d |
The details of the names of columns i.e. attributes in the dataset after loading the dataset are found.
mush <- read.csv('~/Documents/Rdataset/mushrooms.csv')
names(mush)
## [1] "class" "cap.shape"
## [3] "cap.surface" "cap.color"
## [5] "bruises" "odor"
## [7] "gill.attachment" "gill.spacing"
## [9] "gill.size" "gill.color"
## [11] "stalk.shape" "stalk.root"
## [13] "stalk.surface.above.ring" "stalk.surface.below.ring"
## [15] "stalk.color.above.ring" "stalk.color.below.ring"
## [17] "veil.type" "veil.color"
## [19] "ring.number" "ring.type"
## [21] "spore.print.color" "population"
## [23] "habitat"
Thereafter the summary of the dataset can be seen. Here, each attribute and the possible values of each category with the count are seen.
summary(mush)
## class cap.shape cap.surface cap.color bruises odor
## e:4208 b: 452 f:2320 n :2284 f:4748 n :3528
## p:3916 c: 4 g: 4 g :1840 t:3376 f :2160
## f:3152 s:2556 e :1500 s : 576
## k: 828 y:3244 y :1072 y : 576
## s: 32 w :1040 a : 400
## x:3656 b : 168 l : 400
## (Other): 220 (Other): 484
## gill.attachment gill.spacing gill.size gill.color stalk.shape
## a: 210 c:6812 b:5612 b :1728 e:3516
## f:7914 w:1312 n:2512 p :1492 t:4608
## w :1202
## n :1048
## g : 752
## h : 732
## (Other):1170
## stalk.root stalk.surface.above.ring stalk.surface.below.ring
## ?:2480 f: 552 f: 600
## b:3776 k:2372 k:2304
## c: 556 s:5176 s:4936
## e:1120 y: 24 y: 284
## r: 192
##
##
## stalk.color.above.ring stalk.color.below.ring veil.type veil.color
## w :4464 w :4384 p:8124 n: 96
## p :1872 p :1872 o: 96
## g : 576 g : 576 w:7924
## n : 448 n : 512 y: 8
## b : 432 b : 432
## o : 192 o : 192
## (Other): 140 (Other): 156
## ring.number ring.type spore.print.color population habitat
## n: 36 e:2776 w :2388 a: 384 d:3148
## o:7488 f: 48 n :1968 c: 340 g:2148
## t: 600 l:1296 k :1872 n: 400 l: 832
## n: 36 h :1632 s:1248 m: 292
## p:3968 r : 72 v:4040 p:1144
## b : 48 y:1712 u: 368
## (Other): 144 w: 192
Now, I have plotted the histograms of each category and split them into two graphs according to their edibility. The objective of doing do is to find the attributes which are exclusive only in either class. The more exclusiveness hints towards a stronger correlation between the attribute and the edibility of the mushroom. The first three attributes - cap shape, cap surface and cap color are plotted below.
library(ggplot2)
library(gridExtra)
m1 <- ggplot(aes(x = cap.shape), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Cap Shape")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m2 <- ggplot(aes(x = cap.surface), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Cap Surface")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m3 <- ggplot(aes(x = cap.color), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Cap Color")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(m1, m2, m3, ncol = 2)
The information of interest is finding the attributes which have count only in one of the charts - edible or poisonous. In Cap Shape, c (conical) is present only in poisonous whereas s (sunken) is only present in edible. Similarly, for Cap Surface, g(grooves) is present only in poisonous. In Cap Color, r(green) and u(purple) are only present in poisonous. This exclusiveness means that if the cap surface is conical the mushroom is poisonous. In summary, I have tried to find the deciding attributes that play a role in deciding whether a given mushroom is edible or poisonous.
m4 <- ggplot(aes(x = bruises), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Bruises")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m5 <- ggplot(aes(x = odor), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Odor")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(m4, m5, ncol = 2)
The bruises attribute have distribution in both classes and does not contribute in decision. The odor attribute seems to have more exclusiveness as compared to any other attribute encountered so far. If the odor is a(almond), c(creosote), or l(anise) the mushroom is edible and if the odor is m(musty), p(pungent), s(spicy) or y(fishy) the mushroom belong to the poisonous class. This is the best example of what is strived for in this analysis. This attribute seems to have more weightage in deciding the class of the mushroom.
m6 <- ggplot(aes(x = gill.attachment), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Gill Attachemnt")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m7 <- ggplot(aes(x = gill.spacing), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Gill Spacing")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m8 <- ggplot(aes(x = gill.size), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Gill Size")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m9 <- ggplot(aes(x = gill.color), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Gill Color")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(m6, m7, m8, m9, ncol = 2)
The Gill attachment, gill spacing and gill size seem to have distribution in both classes with no exclusiveness at all. As for gill color, there is exclusiveness as follows - the color b(buff), e(red) and o(orange) implies edible and r(green) implies poisonous. There is exclusiveness but there is distribution in other cases. However, the contribution of gill color cannot be ignored.
m10 <- ggplot(aes(x = stalk.shape), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Stalk Shape")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m11 <- ggplot(aes(x = stalk.root), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Stalk Root")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(m10, m11, ncol = 2)
The stalk shape does not seem to contribute to decide the class of mushroom. On the other hand the stalk root having r(rooted), is poisonous.
m12 <- ggplot(aes(x = stalk.surface.above.ring), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Stalk Surface Above Ring")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m13 <- ggplot(aes(x = stalk.surface.below.ring), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Stalk Surface Below Ring")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m14 <- ggplot(aes(x = stalk.color.above.ring), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Stalk Color Above Ring")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m15 <- ggplot(aes(x = stalk.color.below.ring), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Stalk Color Below Ring")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(m12, m13, m14, m15, ncol = 2)
The attributes related to stalk surface - either above or below the ring have distribution in both the classes and there is no exclusiveness for both these parameters. The stalk color above ring and below do have exclusiveness. The exclusiveness is elaborated here.
1. If the stalk color above ring is e(red), g(grey) or o(orange) implies an edible mushroom. On the other hand if the stalk color above ring is b(buff), c(cinnamon) or y(yellow) implies that the mushroom is poisonous.
2. If the stalk color below ring is e(red), g(grey) or o(orange) implies an edible mushroom. On the other hand if the stalk color below ring is b(buff), c(cinnamon) or y(yellow) implies that the mushroom is poisonous.
m16 <- ggplot(aes(x = veil.type), data = mush) +
geom_histogram(stat = "count") +
xlab("Veil Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m17 <- ggplot(aes(x = veil.color), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Veil Color")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m18 <- ggplot(aes(x = ring.number), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Ring Number")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m19 <- ggplot(aes(x = ring.type), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Ring Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(m16, m17, m18, m19, ncol = 2)
The veil type partial for all mushrooms as confirmed later and hence it had no contribution at all in deciding the class of the mushroom. This type of attributes is an excellent example of having data but of no use in analyzing the dataset. It would be better to get rid of such attributes, a concept called dimensionality reduction. Dimensionality reduction plays a very important role in speeding up the analysis when the datasets have large number of attributes and/or observations.
The veil color n(brown) and o(orange) suggests that the mushroom belongs to edible class and y(yellow) suggests that it belongs to poisonous class.
The number of ring has only one possible way to help in deciding
m20 <- ggplot(aes(x = spore.print.color), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Spore Print Color")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m21 <- ggplot(aes(x = population), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Population")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
m22 <- ggplot(aes(x = habitat), data = mush) +
geom_histogram(stat = "count") +
facet_wrap(~class) +
xlab("Habitat")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
grid.arrange(m20, m21, m22, ncol = 2)
The spore print color have exclusiveness. If the spore print color is b(black), o(orange), u(purple) or y(yellow) the mushroom belongs to edible class. Rest of the spore print colors are distributed in both types of classes.
The population attribute also has exclusiveness as follows - if the population attribute is a(abundant) or n(numerous), the mushroom is edible.
The last attribute of habitat has only one parameter that has exclusive nature. If the habitat is w(waste), then the mushroom is poisonous.
It is worth noting that the observations made from these histograms may be questionable. The reason for this being, the limitation of the histogram itself. It cannot be guaranteed from the histogram alone the absence for a particular bar. The count on the y-axis is considerably high and hence minor counts are expected to be so small that they almost do not appear. So, it is assumed here that the attributes which do not seem to have any count are considered that not even a single value falls under that category.
summary(mush$veil.type)
## p
## 8124
The confirmation that the veil type has only one value for the entire observations in the dataset.
After doing a preliminary analysis on the mushroom dataset, I have tried to find how closely the attributes are related to the class of the mushroom. The correlation between two categorical variables can be calculated by using the Chi-squared test. The code further calculates the correlation between each attribute and the class of the mushroom.
library(MASS)
tbl1 <- table(mush$class, mush$cap.shape)
chisq.test(tbl1)
## Warning in chisq.test(tbl1): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl1
## X-squared = 489.92, df = 5, p-value < 2.2e-16
tbl2 <- table(mush$class, mush$cap.surface)
chisq.test(tbl2)
## Warning in chisq.test(tbl2): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl2
## X-squared = 315.04, df = 3, p-value < 2.2e-16
tbl3 <- table(mush$class, mush$cap.color)
chisq.test(tbl3)
##
## Pearson's Chi-squared test
##
## data: tbl3
## X-squared = 387.6, df = 9, p-value < 2.2e-16
tbl4 <- table(mush$class, mush$bruises)
chisq.test(tbl4)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl4
## X-squared = 2041.4, df = 1, p-value < 2.2e-16
tbl5 <- table(mush$class, mush$odor)
chisq.test(tbl5)
##
## Pearson's Chi-squared test
##
## data: tbl5
## X-squared = 7659.7, df = 8, p-value < 2.2e-16
tbl6 <- table(mush$class, mush$gill.attachment)
chisq.test(tbl6)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl6
## X-squared = 133.99, df = 1, p-value < 2.2e-16
tbl7 <- table(mush$class, mush$gill.spacing)
chisq.test(tbl7)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl7
## X-squared = 984.14, df = 1, p-value < 2.2e-16
tbl8 <- table(mush$class, mush$gill.size)
chisq.test(tbl8)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl8
## X-squared = 2366.8, df = 1, p-value < 2.2e-16
tbl9 <- table(mush$class, mush$gill.color)
chisq.test(tbl9)
##
## Pearson's Chi-squared test
##
## data: tbl9
## X-squared = 3765.7, df = 11, p-value < 2.2e-16
tbl10 <- table(mush$class, mush$stalk.shape)
chisq.test(tbl10)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl10
## X-squared = 84.142, df = 1, p-value < 2.2e-16
tbl11 <- table(mush$class, mush$stalk.root)
chisq.test(tbl11)
##
## Pearson's Chi-squared test
##
## data: tbl11
## X-squared = 1344.4, df = 4, p-value < 2.2e-16
tbl12 <- table(mush$class, mush$stalk.surface.above.ring)
chisq.test(tbl12)
##
## Pearson's Chi-squared test
##
## data: tbl12
## X-squared = 2808.3, df = 3, p-value < 2.2e-16
tbl13 <- table(mush$class, mush$stalk.surface.below.ring)
chisq.test(tbl13)
##
## Pearson's Chi-squared test
##
## data: tbl13
## X-squared = 2684.5, df = 3, p-value < 2.2e-16
tbl14 <- table(mush$class, mush$stalk.color.above.ring)
chisq.test(tbl14)
## Warning in chisq.test(tbl14): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl14
## X-squared = 2237.9, df = 8, p-value < 2.2e-16
tbl15 <- table(mush$class, mush$stalk.color.below.ring)
chisq.test(tbl15)
##
## Pearson's Chi-squared test
##
## data: tbl15
## X-squared = 2152.4, df = 8, p-value < 2.2e-16
tbl16 <- table(mush$class, mush$veil.type)
chisq.test(tbl16)
##
## Chi-squared test for given probabilities
##
## data: tbl16
## X-squared = 10.495, df = 1, p-value = 0.001197
tbl17 <- table(mush$class, mush$veil.color)
chisq.test(tbl17)
## Warning in chisq.test(tbl17): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl17
## X-squared = 191.22, df = 3, p-value < 2.2e-16
tbl18 <- table(mush$class, mush$ring.number)
chisq.test(tbl18)
##
## Pearson's Chi-squared test
##
## data: tbl18
## X-squared = 374.74, df = 2, p-value < 2.2e-16
tbl19 <- table(mush$class, mush$ring.type)
chisq.test(tbl19)
##
## Pearson's Chi-squared test
##
## data: tbl19
## X-squared = 2956.6, df = 4, p-value < 2.2e-16
tbl20 <- table(mush$class, mush$spore.print.color)
chisq.test(tbl20)
##
## Pearson's Chi-squared test
##
## data: tbl20
## X-squared = 4602, df = 8, p-value < 2.2e-16
tbl21 <- table(mush$class, mush$population)
chisq.test(tbl21)
##
## Pearson's Chi-squared test
##
## data: tbl21
## X-squared = 1929.7, df = 5, p-value < 2.2e-16
tbl22 <- table(mush$class, mush$habitat)
chisq.test(tbl22)
##
## Pearson's Chi-squared test
##
## data: tbl22
## X-squared = 1573.8, df = 6, p-value < 2.2e-16
The correlation between the class of the mushroom and the other attributes based on the Chi-squared test, in descending order, is as follows: (1) odor, (2) spore print color, (3) gill color, (4) ring type, (5) stalk surface above ring, (6) stalk surface below ring, (7) gill size, (8) stalk color above ring, (9) stalk color below ring, (10) bruises, (11) population, (12) habitat, (13) stalk root, (14) gill spacing, (15) cap shape, (16) cap color, (17) ring number, (18) cap surface, (19) veil color, (20) gill attachment, (21) stalk shape, (22) veil type.
I have taken in to account two attributes along with class and how they contribute the the edibility of the mushrooms. The first graph is of odor and spore print color. The choice of these two attributes is based on the Chi-squared test values for correlation as seen earlier. The second graph is of ring type and gill color. The third graph is of stalk surface above ring and stalk surface below ring. The fourth graph is of gill size and stalk color above ring. The last graph is plotted for bruises and stalk color below ring.
The purpose of these plots is to find the combined exclusivity of two attributes in deciding the edibility of mushroom. The observations are noted after each graph.
ggplot(mush, aes(odor, spore.print.color, class)) +
geom_point(aes(shape = factor(class), color = factor(class)), size = 4.5) +
scale_shape_manual(values = c('+', 'x')) +
scale_colour_manual(values = c("green", "red"))
It is difficult to decide the edibility of mushroom when the odor is n(none) and spore print color is w(white). In rest of the combinations whether the mushroom is edible or poisonous is clear.
ggplot(mush, aes(ring.type, gill.color, class)) +
geom_point(aes(shape = factor(class), color = factor(class)), size = 4.5) +
scale_shape_manual(values = c('+', 'x')) +
scale_colour_manual(values = c("green", "red"))
In this graph the ambiguity is revealed when the ring type is p(pendant).
ggplot(mush, aes(stalk.surface.above.ring, stalk.surface.below.ring, class)) +
geom_point(aes(shape = factor(class), color = factor(class)), size = 4.5) +
scale_shape_manual(values = c('+', 'x')) +
scale_colour_manual(values = c("green", "red"))
For stalk surface above ring and stalk surface below ring there entire chart for other than two places is ambiguous and is rendered to be of no use.
ggplot(mush, aes(gill.size, stalk.color.above.ring, class)) +
geom_point(aes(shape = factor(class), color = factor(class)), size = 4.5) +
scale_shape_manual(values = c('+', 'x')) +
scale_colour_manual(values = c("green", "red"))
Out of 11, only 4 points are ambiguous. The ambiguity is maximum when the stalk color above ring is w(white) and n(brown). There is also ambiguity when the stock color above ring is p(pink).
ggplot(mush, aes(bruises, stalk.color.below.ring, class)) +
geom_point(aes(shape = factor(class), color = factor(class)), size = 4.5) +
scale_shape_manual(values = c('+', 'x')) +
scale_colour_manual(values = c("green", "red"))
The ambiguity arises when the stalk color below ring is w(white) and n (brown). In rest of the cases when the combination of two attributes is taken into consideration, the class of the mushroom can be predicted.
The combination of two can be extended to many possibilities since it is difficult to find the importance of every attribute in deciding the edibility of a given mushroom. The importance of attributes is discussed later.
The main objective of this analysis is to predict whether a given mushroom is edible or poisonous. To achieve that I have split the entire dataset in three parts. The first part is the training data which is 20% of the data, the second part is validation data which is 30% and the last one is the test data which 50% of the entire dataset. I plan on training the model on the training dataset, increase the accuracy by using the validation dataset. Finally, is is anticipated that the accuracy will be maximum for the test dataset after revising the model.
df <- mush
fractionTraining <- 0.20
fractionValidation <- 0.30
fractionTest <- 0.50
sampleSizeTraining <- floor(fractionTraining * nrow(df))
sampleSizeValidation <- floor(fractionValidation * nrow(df))
sampleSizeTest <- floor(fractionTest * nrow(df))
indicesTraining <- sort(sample(seq_len(nrow(df)), size=sampleSizeTraining))
indicesNotTraining <- setdiff(seq_len(nrow(df)), indicesTraining)
indicesValidation <- sort(sample(indicesNotTraining, size=sampleSizeValidation))
indicesTest <- setdiff(indicesNotTraining, indicesValidation)
dfTraining <- df[indicesTraining, ]
dfValidation <- df[indicesValidation, ]
dfTest <- df[indicesTest, ]
#head(dfTraining)
#head(dfValidation)
#head(dfTest)
I have used random forest for training the dataset. The choice of random forest is owing to the decision of every attribute and if the decision is not clear then split the data further using another attribute. The data is repetitively partitioned. The following graph shows the error.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf = randomForest(class ~ .,
ntree = 100,
data = dfTraining)
plot(rf)
print(rf)
##
## Call:
## randomForest(formula = class ~ ., data = dfTraining, ntree = 100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0.06%
## Confusion matrix:
## e p class.error
## e 829 0 0.000000000
## p 1 794 0.001257862
It is necessary to analyse the importance of each attribute that helps in deciding the class of the mushroom. The attributes based on their importance in deciding the edibility of mushroom are listed in descending order as seen in the graph. It is to be noted that this order is not the same as the one I have discovered from the Chi-squared correlation test. The reason behind this is simple. The correlation test takes in to account the contribution of a single attribute on the edibility of the mushroom. On the other hand, the random forest works as follows. After evaluating one attribute the data is further split based on other attribute keeping in mind the previous split. This changes the order in random forest.
varImpPlot(rf,
sort = T,
main = "Variable Importance")
The importance of each attribute according to Mean Decrease Gini is listed below.
var.imp = data.frame(importance(rf, type=2))
# make row names as columns
var.imp$Variables = row.names(var.imp)
print(var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),])
## MeanDecreaseGini Variables
## odor 304.3597093 odor
## spore.print.color 107.0664728 spore.print.color
## stalk.surface.above.ring 49.7497656 stalk.surface.above.ring
## gill.size 43.6077644 gill.size
## gill.color 42.8564167 gill.color
## ring.type 41.0620021 ring.type
## stalk.surface.below.ring 31.4287810 stalk.surface.below.ring
## stalk.root 25.4136081 stalk.root
## population 25.1353222 population
## habitat 21.9881497 habitat
## gill.spacing 19.7624958 gill.spacing
## stalk.color.above.ring 19.3069898 stalk.color.above.ring
## ring.number 16.3125141 ring.number
## cap.color 14.9109354 cap.color
## bruises 14.8959091 bruises
## stalk.shape 13.1189841 stalk.shape
## stalk.color.below.ring 9.8058711 stalk.color.below.ring
## cap.surface 4.2256780 cap.surface
## cap.shape 3.1265147 cap.shape
## gill.attachment 0.6677469 gill.attachment
## veil.color 0.6554684 veil.color
## veil.type 0.0000000 veil.type
library(caret)
## Loading required package: lattice
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2017c.
## 1.0/zoneinfo/Asia/Kolkata'
The code below uses the training dataset to predict the class of the mushroom.
dfTraining$predicted.response = predict(rf , dfTraining)
dfTraining$predicted.response <- as.factor(dfTraining$predicted.response)
print(
confusionMatrix(data = dfTraining$predicted.response,
reference = dfTraining$class,
positive = 'e'))
## Confusion Matrix and Statistics
##
## Reference
## Prediction e p
## e 829 0
## p 0 795
##
## Accuracy : 1
## 95% CI : (0.9977, 1)
## No Information Rate : 0.5105
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5105
## Detection Rate : 0.5105
## Detection Prevalence : 0.5105
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : e
##
The accuracy of the model is 99.94% which in itself is quite high. It is expected that the accuracy will increase for the other datasets as he model was supposed to learn in this case.
dfValidation$predicted.response = predict(rf , dfValidation)
dfValidation$predicted.response <- as.factor(dfValidation$predicted.response)
print(
confusionMatrix(data = dfValidation$predicted.response,
reference = dfValidation$class,
positive = 'e'))
## Confusion Matrix and Statistics
##
## Reference
## Prediction e p
## e 1256 0
## p 0 1181
##
## Accuracy : 1
## 95% CI : (0.9985, 1)
## No Information Rate : 0.5154
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5154
## Detection Rate : 0.5154
## Detection Prevalence : 0.5154
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : e
##
As expected, the accuracy of the model for validation dataset is increased to 99.96%
dfTest$predicted.response = predict(rf , dfTest)
dfTest$predicted.response <- as.factor(dfTest$predicted.response)
print(
confusionMatrix(data = dfTest$predicted.response,
reference = dfTest$class,
positive = 'e'))
## Confusion Matrix and Statistics
##
## Reference
## Prediction e p
## e 2123 0
## p 0 1940
##
## Accuracy : 1
## 95% CI : (0.9991, 1)
## No Information Rate : 0.5225
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5225
## Detection Rate : 0.5225
## Detection Prevalence : 0.5225
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : e
##
The accuracy of the test dataset is of utmost importance since it is completely new data for the model. The success of the predictive model depends on the higher accuracy of test dataset. The accuracy of the test dataset is 99.95%. The model predicted incorrectly in only two cases out of 4063.
The number of splits that required to accurately predict the class of the mushroom is calculated and displayed below.
library(rpart)
model_tree <- rpart(class ~ ., data = mush,
method = "class", cp = 0.00001)
printcp(model_tree)
##
## Classification tree:
## rpart(formula = class ~ ., data = mush, method = "class", cp = 1e-05)
##
## Variables actually used in tree construction:
## [1] cap.surface habitat odor
## [4] spore.print.color stalk.color.below.ring stalk.root
##
## Root node error: 3916/8124 = 0.48203
##
## n= 8124
##
## CP nsplit rel error xerror xstd
## 1 0.9693565 0 1.0000000 1.0000000 0.01150089
## 2 0.0183861 1 0.0306435 0.0306435 0.00277662
## 3 0.0061287 2 0.0122574 0.0122574 0.00176397
## 4 0.0020429 3 0.0061287 0.0061287 0.00124917
## 5 0.0010215 5 0.0020429 0.0020429 0.00072192
## 6 0.0000100 7 0.0000000 0.0010215 0.00051060
plotcp(model_tree)
Based on the importance of the attributes, a decision tree can be created that leads to deciding the class of the mushroom. The decision tree that is used in this data analysis is plotted below.
library(rpart.plot)
model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]
## [1] 1e-05
bestcp <- round(model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"], 4)
model_tree_pruned <- prune(model_tree, cp = bestcp)
rpart.plot(model_tree_pruned, extra = 104, box.palette = "GnBu",
branch.lty = 3, shadow.col = "gray", nn = TRUE)
The mushroom dataset is analysed in three ways.
The first one involved plotting histograms to explore the contribution of a single attribute in deciding the edibility of the mushroom. The histograms are a part of bi-variate analysis. After analyzing the histograms, the following observations were made :
1. There does not exists a single attribute which can sufficiently serve as a deciding factor.
2. The higher exclusiveness implies larger contribution of that attribute towards decision making.
3. Some attributes play absolutely no role in decision making and hence they can be completely ignored. This is an example of dimensionality reduction for achieving better efficiency.
The second way is calculation oriented based on contribution of single attribute towards the class of mushroom. The dataset has only categorical variables for all attributes. I have used the Chi-squared Test to determine the correlation between a given attribute and the class of mushroom. This is again a bi-variate analysis. The correlation test helped in establishing relationship between each attribute and the class (edibility) of the mushroom. The higher X-squared implies higher correlation. The sequence of contributing attributes based on this number was listed thereof.
The third way was drawing plots that would investigate the exclusiveness of two attributes taken together on classify them according to that class of the mushroom. I have used simple but effective plots with an intention to narrow down on the deciding factors based on the chi-squared test. These graphs present the decisiveness and ambiguities when two attributes are considered simultaneously. There could be numerous combinations but I have kept it simple, following the chi-squared test.
The limitation to these approach was that only one or two attributes were considered while understanding the contribution they made in decidability. It is necessary to get a holistic view to understand the actual importance of each attribute. It is need to split the data hereon into training, validation and test sets so that the predictive model get completely new observations to work on. This in turn is necessary to evaluate the accuracy of the model.
I have used random forest and and Gini coefficient to calculate the importance of attributes in deciding the class that a given a mushroom belongs to. The importance of variables is calculated, plotted and displayed in descending order. After understanding the importance of the variables, it is only logical to develop a decision tree which would help in deciding the edibility of mushroom. The decision tree is plotted with the deciding factors are every split in the tree.