“This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one”
This data set is as an example of Categorical data which assigns numbers or characters to represent different categories of answers. These types of data can’t be arithmetically manipulated.
The best way to summarize categorical data is with frequencies. Frequencies can also be translated into percentages to show what percentage of the total sample answered in each category.
Mushrooms <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"), header = FALSE)
head.matrix(x = Mushrooms, 6L)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1 p x s n t p f c n k e e s s w w p w o p
## 2 e x s y t a f c b k e c s s w w p w o p
## 3 e b s w t l f c b n e c s s w w p w o p
## 4 p x y w t p f c n n e e s s w w p w o p
## 5 e x s g f n f w b k t e s s w w p w o e
## 6 e x y y t a f c b n e c s s w w p w o p
## V21 V22 V23
## 1 k s u
## 2 n n g
## 3 n n m
## 4 k s u
## 5 n a g
## 6 k n g
class(Mushrooms)
## [1] "data.frame"
dim(Mushrooms)
## [1] 8124 23
Mushrooms1 <- Mushrooms[,c(1,6,14,21,22,23)]
head.matrix(x = Mushrooms1, 6L)
## V1 V6 V14 V21 V22 V23
## 1 p p s k s u
## 2 e a s n n g
## 3 e l s n n m
## 4 p p s k s u
## 5 e n s n a g
## 6 e a s k n g
Renaming columns
names(Mushrooms1) = c("classification", "odor", "stalk-sbr", "spore-pr-c", "population", "habitat")
names(Mushrooms1)
## [1] "classification" "odor" "stalk-sbr" "spore-pr-c"
## [5] "population" "habitat"
Subsetting data Data set of Edible Mushrooms - 4280 instances
edible_mushrooms <- Mushrooms1
edible_mushrooms = subset(Mushrooms1, classification == "e")
head.matrix(x = edible_mushrooms, 6L)
## classification odor stalk-sbr spore-pr-c population habitat
## 2 e a s n n g
## 3 e l s n n m
## 5 e n s n a g
## 6 e a s k n g
## 7 e a s k n m
## 8 e l s n s m
Data set of Poisonous Mushrooms - 3750 instances
poisonous_mushrooms <- Mushrooms1
poisonous_mushrooms = subset(Mushrooms1, classification == "p")
head.matrix(x = poisonous_mushrooms, 6L)
## classification odor stalk-sbr spore-pr-c population habitat
## 1 p p s k s u
## 4 p p s k s u
## 9 p p s k v g
## 14 p p s n v u
## 18 p p s k s g
## 19 p p s n s u
updating a subset value ‘e’ as ’edible
edible_mushrooms$classification <- ifelse(edible_mushrooms$classification== 'e', 'edible', edible_mushrooms$classification)
head.matrix(x = edible_mushrooms, 3L)
## classification odor stalk-sbr spore-pr-c population habitat
## 2 edible a s n n g
## 3 edible l s n n m
## 5 edible n s n a g
The attribute that is the best predictive analysis might be habitat that can tell us where the poisonous mushrooms can be found mean while edible mushrooms.
Frequency distribution analysis using Bar Graphs using X axis = classification and Y axis = percentage
could answer questions like;
How many edible or poisonous mushrooms are in the data set? Results in frecuencies and/or percentages
count = table(Mushrooms1$classification)
t = as.data.frame(count)
names(t)[1] = 'classification'
t
## classification Freq
## 1 e 4208
## 2 p 3916
What is the habitat with more frecuency for poisonous mushrooms or edible mushrooms? Possible habitat values: grasses=g,leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d
count = table(Mushrooms1$classification, Mushrooms1$habitat)
t = as.data.frame(count)
names(t)[1] = 'classification'
names(t)[2] = 'habitat'
t
## classification habitat Freq
## 1 e d 1880
## 2 p d 1268
## 3 e g 1408
## 4 p g 740
## 5 e l 240
## 6 p l 592
## 7 e m 256
## 8 p m 36
## 9 e p 136
## 10 p p 1008
## 11 e u 96
## 12 p u 272
## 13 e w 192
## 14 p w 0
What odor is more common in the poisonous mushrooms or edible mushrooms? Posible odor values: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
count1 = table(Mushrooms1$classification, Mushrooms1$odor)
t = as.data.frame(count1)
names(t)[1] = 'classification'
names(t)[2] = 'odor'
t
## classification odor Freq
## 1 e a 400
## 2 p a 0
## 3 e c 0
## 4 p c 192
## 5 e f 0
## 6 p f 2160
## 7 e l 400
## 8 p l 0
## 9 e m 0
## 10 p m 36
## 11 e n 3408
## 12 p n 120
## 13 e p 0
## 14 p p 256
## 15 e s 0
## 16 p s 576
## 17 e y 0
## 18 p y 576
After studying the data set, I found two relevant attributes, habitat and odor that can provide additional information to determine of whether a particular mushroom is poisonous or edible.