Assignment 1 - Loading Data into a Data Frame

knitr::opts_chunk$set(echo = TRUE)

Load Libraries

## Loading required package: bitops

## Warning: package 'plyr' was built under R version 3.6.1

Load Dataset

From the Data Dictionary: This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be’’ for Poisonous Oak and Ivy.

mushroomURL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"

# read in the mushroom data as a csv with no headers and comma separator
mushroomData <- read.csv(mushroomURL, header = FALSE, sep = ",")
mushroomData <- as.data.frame(mushroomData)

Initial Exploration

Preliminary Look

Now that the data is loaded, let’s take a preliminary look at it.

ncol(mushroomData)

## [1] 23

nrow(mushroomData)

## [1] 8124

head(mushroomData)

##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p
## 2  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p
## 3  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p
## 4  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p
## 5  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e
## 6  e  x  y  y  t  a  f  c  b   n   e   c   s   s   w   w   p   w   o   p
##   V21 V22 V23
## 1   k   s   u
## 2   n   n   g
## 3   n   n   m
## 4   k   s   u
## 5   n   a   g
## 6   k   n   g

The number of columns in the dataset is 23, but the website highlights 22. A quick check of the values in column 1 indicates 2 possibilities - e or p.

unique(mushroomData$V1)

## [1] p e
## Levels: e p

Since the goal of this project was to determine whether a particular mushroom is likely to be edible, my initial guess is that this first column is the classification. Perhaps p = poisonous and e = edible? A deeper look into the data dictionary confirms that column 1 is indeed a classification!

Data Cleanup

Now that we have a very high-level overview of the data, let’s do a little bit of cleaning.

Redefine Column Names

Since we don’t have any column names defined, let’s create some now. We’ll use the information provided by UCI:

names <- c('CLASSIFICATION','CAP_SHAPE','CAP_SURFACE','CAP_COLOR','BRUISES','ODOR','GILL_ATTACHMENT','GILL_SPACING','GILL_SIZE','GILL_COLOR','STALK_SHAPE', 'STALK_ROOT','STALK_SURFACE_ABOVE_RING','STALK_SURFACE_BELOW_RING','STALK_COLOR_ABOVE_RING','STALK_COLOR_BELOW_RING','VEIL_TYPE','VEIL_COLOR','RING_NUMBER','RING_TYPE','SPORE_PRINT_COLOR','POPULATION','HABITAT')

colnames(mushroomData) <- names

Subset Data

I love data (the more the better!), but since the goal of this project is to create a subset of the original data set, let’s limit our pull to only color characteristics & the classification.

mushroomDataSubset <- mushroomData[,c(grep("CLASSIFICATION",names(mushroomData)),grep("COLOR",names(mushroomData)))]

head(mushroomDataSubset)

##   CLASSIFICATION CAP_COLOR GILL_COLOR STALK_COLOR_ABOVE_RING
## 1              p         n          k                      w
## 2              e         y          k                      w
## 3              e         w          n                      w
## 4              p         w          n                      w
## 5              e         g          k                      w
## 6              e         y          n                      w
##   STALK_COLOR_BELOW_RING VEIL_COLOR SPORE_PRINT_COLOR
## 1                      w          w                 k
## 2                      w          w                 n
## 3                      w          w                 n
## 4                      w          w                 k
## 5                      w          w                 n
## 6                      w          w                 k

Looks good! Let’s take a look at the data in a little more detail. Are there certain colors that seem to be more prominent than others?

summary(mushroomDataSubset)

##  CLASSIFICATION   CAP_COLOR      GILL_COLOR   STALK_COLOR_ABOVE_RING
##  e:4208         n      :2284   b      :1728   w      :4464          
##  p:3916         g      :1840   p      :1492   p      :1872          
##                 e      :1500   w      :1202   g      : 576          
##                 y      :1072   n      :1048   n      : 448          
##                 w      :1040   g      : 752   b      : 432          
##                 b      : 168   h      : 732   o      : 192          
##                 (Other): 220   (Other):1170   (Other): 140          
##  STALK_COLOR_BELOW_RING VEIL_COLOR SPORE_PRINT_COLOR
##  w      :4384           n:  96     w      :2388     
##  p      :1872           o:  96     n      :1968     
##  g      : 576           w:7924     k      :1872     
##  n      : 512           y:   8     h      :1632     
##  b      : 432                      r      :  72     
##  o      : 192                      b      :  48     
##  (Other): 156                      (Other): 144

Replace Abbreviations

The summary is great, but the abbreviations aren’t super intuitive. Is g green or gray? Is b brown or blue? Let’s replace these values so we have a better idea.

classReplacements <- c("p" = "poisonous", "e" = "edible")
mushroomDataSubset$CLASSIFICATION <- revalue(mushroomDataSubset$CLASSIFICATION, classReplacements)


colorReplacements <- c("n"="brown", "g"= "gray", "e"= "red", "y"= "yellow", "w" = "white", "b"= "buff", "c" = "cinnamon", "r" = "green", "p" = "pink", "u" = "purple", "k" = "black", "h" = "chocolate", "o" = "orange")

mushroomDataSubset <- sapply(mushroomDataSubset, function(x) revalue(x,colorReplacements))

## The following `from` values were not present in `x`: n, g, e, y, w, b, c, r, p, u, k, h, o

## The following `from` values were not present in `x`: k, h, o

## The following `from` values were not present in `x`: c

## The following `from` values were not present in `x`: r, u, k, h
## The following `from` values were not present in `x`: r, u, k, h

## The following `from` values were not present in `x`: g, e, b, c, r, p, u, k, h

## The following `from` values were not present in `x`: g, e, c, p

Let’s take one last look at the data:

summary(mushroomDataSubset)

##    CLASSIFICATION   CAP_COLOR        GILL_COLOR   STALK_COLOR_ABOVE_RING
##  edible   :4208   brown  :2284   buff     :1728   white  :4464          
##  poisonous:3916   gray   :1840   pink     :1492   pink   :1872          
##                   red    :1500   white    :1202   gray   : 576          
##                   yellow :1072   brown    :1048   brown  : 448          
##                   white  :1040   gray     : 752   buff   : 432          
##                   buff   : 168   chocolate: 732   orange : 192          
##                   (Other): 220   (Other)  :1170   (Other): 140          
##  STALK_COLOR_BELOW_RING  VEIL_COLOR   SPORE_PRINT_COLOR
##  white  :4384           brown :  96   white    :2388   
##  pink   :1872           orange:  96   brown    :1968   
##  gray   : 576           white :7924   black    :1872   
##  brown  : 512           yellow:   8   chocolate:1632   
##  buff   : 432                         green    :  72   
##  orange : 192                         buff     :  48   
##  (Other): 156                         (Other)  : 144