knitr::opts_chunk$set(echo = TRUE)
## Loading required package: bitops
## Warning: package 'plyr' was built under R version 3.6.1
From the Data Dictionary: This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be’’ for Poisonous Oak and Ivy.
mushroomURL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
# read in the mushroom data as a csv with no headers and comma separator
mushroomData <- read.csv(mushroomURL, header = FALSE, sep = ",")
mushroomData <- as.data.frame(mushroomData)
Now that the data is loaded, let’s take a preliminary look at it.
ncol(mushroomData)
## [1] 23
nrow(mushroomData)
## [1] 8124
head(mushroomData)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1 p x s n t p f c n k e e s s w w p w o p
## 2 e x s y t a f c b k e c s s w w p w o p
## 3 e b s w t l f c b n e c s s w w p w o p
## 4 p x y w t p f c n n e e s s w w p w o p
## 5 e x s g f n f w b k t e s s w w p w o e
## 6 e x y y t a f c b n e c s s w w p w o p
## V21 V22 V23
## 1 k s u
## 2 n n g
## 3 n n m
## 4 k s u
## 5 n a g
## 6 k n g
The number of columns in the dataset is 23, but the website highlights 22. A quick check of the values in column 1 indicates 2 possibilities - e or p.
unique(mushroomData$V1)
## [1] p e
## Levels: e p
Since the goal of this project was to determine whether a particular mushroom is likely to be edible, my initial guess is that this first column is the classification. Perhaps p = poisonous and e = edible? A deeper look into the data dictionary confirms that column 1 is indeed a classification!
Now that we have a very high-level overview of the data, let’s do a little bit of cleaning.
Since we don’t have any column names defined, let’s create some now. We’ll use the information provided by UCI:
names <- c('CLASSIFICATION','CAP_SHAPE','CAP_SURFACE','CAP_COLOR','BRUISES','ODOR','GILL_ATTACHMENT','GILL_SPACING','GILL_SIZE','GILL_COLOR','STALK_SHAPE', 'STALK_ROOT','STALK_SURFACE_ABOVE_RING','STALK_SURFACE_BELOW_RING','STALK_COLOR_ABOVE_RING','STALK_COLOR_BELOW_RING','VEIL_TYPE','VEIL_COLOR','RING_NUMBER','RING_TYPE','SPORE_PRINT_COLOR','POPULATION','HABITAT')
colnames(mushroomData) <- names
I love data (the more the better!), but since the goal of this project is to create a subset of the original data set, let’s limit our pull to only color characteristics & the classification.
mushroomDataSubset <- mushroomData[,c(grep("CLASSIFICATION",names(mushroomData)),grep("COLOR",names(mushroomData)))]
head(mushroomDataSubset)
## CLASSIFICATION CAP_COLOR GILL_COLOR STALK_COLOR_ABOVE_RING
## 1 p n k w
## 2 e y k w
## 3 e w n w
## 4 p w n w
## 5 e g k w
## 6 e y n w
## STALK_COLOR_BELOW_RING VEIL_COLOR SPORE_PRINT_COLOR
## 1 w w k
## 2 w w n
## 3 w w n
## 4 w w k
## 5 w w n
## 6 w w k
Looks good! Let’s take a look at the data in a little more detail. Are there certain colors that seem to be more prominent than others?
summary(mushroomDataSubset)
## CLASSIFICATION CAP_COLOR GILL_COLOR STALK_COLOR_ABOVE_RING
## e:4208 n :2284 b :1728 w :4464
## p:3916 g :1840 p :1492 p :1872
## e :1500 w :1202 g : 576
## y :1072 n :1048 n : 448
## w :1040 g : 752 b : 432
## b : 168 h : 732 o : 192
## (Other): 220 (Other):1170 (Other): 140
## STALK_COLOR_BELOW_RING VEIL_COLOR SPORE_PRINT_COLOR
## w :4384 n: 96 w :2388
## p :1872 o: 96 n :1968
## g : 576 w:7924 k :1872
## n : 512 y: 8 h :1632
## b : 432 r : 72
## o : 192 b : 48
## (Other): 156 (Other): 144
The summary is great, but the abbreviations aren’t super intuitive. Is g green or gray? Is b brown or blue? Let’s replace these values so we have a better idea.
classReplacements <- c("p" = "poisonous", "e" = "edible")
mushroomDataSubset$CLASSIFICATION <- revalue(mushroomDataSubset$CLASSIFICATION, classReplacements)
colorReplacements <- c("n"="brown", "g"= "gray", "e"= "red", "y"= "yellow", "w" = "white", "b"= "buff", "c" = "cinnamon", "r" = "green", "p" = "pink", "u" = "purple", "k" = "black", "h" = "chocolate", "o" = "orange")
mushroomDataSubset <- sapply(mushroomDataSubset, function(x) revalue(x,colorReplacements))
## The following `from` values were not present in `x`: n, g, e, y, w, b, c, r, p, u, k, h, o
## The following `from` values were not present in `x`: k, h, o
## The following `from` values were not present in `x`: c
## The following `from` values were not present in `x`: r, u, k, h
## The following `from` values were not present in `x`: r, u, k, h
## The following `from` values were not present in `x`: g, e, b, c, r, p, u, k, h
## The following `from` values were not present in `x`: g, e, c, p
Let’s take one last look at the data:
summary(mushroomDataSubset)
## CLASSIFICATION CAP_COLOR GILL_COLOR STALK_COLOR_ABOVE_RING
## edible :4208 brown :2284 buff :1728 white :4464
## poisonous:3916 gray :1840 pink :1492 pink :1872
## red :1500 white :1202 gray : 576
## yellow :1072 brown :1048 brown : 448
## white :1040 gray : 752 buff : 432
## buff : 168 chocolate: 732 orange : 192
## (Other): 220 (Other) :1170 (Other): 140
## STALK_COLOR_BELOW_RING VEIL_COLOR SPORE_PRINT_COLOR
## white :4384 brown : 96 white :2388
## pink :1872 orange: 96 brown :1968
## gray : 576 white :7924 black :1872
## brown : 512 yellow: 8 chocolate:1632
## buff : 432 green : 72
## orange : 192 buff : 48
## (Other): 156 (Other) : 144