Your task (should you choose to except it – sorry, had the Mission Impossible theme playing as soon as I started reading) is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data-for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.
The data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Data is courtesy of UCI, Jeff Schlimmer and The Audubon Society.
# Load packages
library(RCurl)
# Load data file
shrooms <- read.csv(text=getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"), header = FALSE, sep = ",")
# Quick look at the data
head(shrooms)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1 p x s n t p f c n k e e s s w w p w o p
## 2 e x s y t a f c b k e c s s w w p w o p
## 3 e b s w t l f c b n e c s s w w p w o p
## 4 p x y w t p f c n n e e s s w w p w o p
## 5 e x s g f n f w b k t e s s w w p w o e
## 6 e x y y t a f c b n e c s s w w p w o p
## V21 V22 V23
## 1 k s u
## 2 n n g
## 3 n n m
## 4 k s u
## 5 n a g
## 6 k n g
summary(shrooms)
## V1 V2 V3 V4 V5 V6
## e:4208 b: 452 f:2320 n :2284 f:4748 n :3528
## p:3916 c: 4 g: 4 g :1840 t:3376 f :2160
## f:3152 s:2556 e :1500 s : 576
## k: 828 y:3244 y :1072 y : 576
## s: 32 w :1040 a : 400
## x:3656 b : 168 l : 400
## (Other): 220 (Other): 484
## V7 V8 V9 V10 V11 V12 V13
## a: 210 c:6812 b:5612 b :1728 e:3516 ?:2480 f: 552
## f:7914 w:1312 n:2512 p :1492 t:4608 b:3776 k:2372
## w :1202 c: 556 s:5176
## n :1048 e:1120 y: 24
## g : 752 r: 192
## h : 732
## (Other):1170
## V14 V15 V16 V17 V18 V19
## f: 600 w :4464 w :4384 p:8124 n: 96 n: 36
## k:2304 p :1872 p :1872 o: 96 o:7488
## s:4936 g : 576 g : 576 w:7924 t: 600
## y: 284 n : 448 n : 512 y: 8
## b : 432 b : 432
## o : 192 o : 192
## (Other): 140 (Other): 156
## V20 V21 V22 V23
## e:2776 w :2388 a: 384 d:3148
## f: 48 n :1968 c: 340 g:2148
## l:1296 k :1872 n: 400 l: 832
## n: 36 h :1632 s:1248 m: 292
## p:3968 r : 72 v:4040 p:1144
## b : 48 y:1712 u: 368
## (Other): 144 w: 192
The first column describes class (edible=e, poisonous=p) and the data set includes 22 variables:
cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=scap-surface: fibrous=f, grooves=g, scaly=y, smooth=scap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=ybruises?: bruises=t, no=fodor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=sgill-attachment: attached=a, descending=d, free=f, notched=ngill-spacing: close=c, crowded=w, distant=dgill-size: broad=b, narrow=ngill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=ystalk-shape: enlarging=e, tapering=tstalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=sstalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=sstalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=ystalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=yveil-type: partial=p, universal=uveil-color: brown=n, orange=o, white=w, yellow=yring-number: none=n, one=o, two=tring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=zspore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u,white=w, yellow=ypopulation: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=yhabitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d# Rename columns
names(shrooms) <- c('class','capshape','capsurface','capcolor','bruises','odor','gillattachment','gillspacing','gillsize','gillcolor','stalkshape','stalkroot','stalksurfaceabovering','stalksurfacebelowring','stalkcolorabovering','stalkcolorbelowring','veiltype','veilcolor','ringnumber','ringtype','sporeprintcolor','population','habitat')
Select class, odor, ringnumber, ringtype and population for all mushrooms that grows in the woods to further transform and review.
sub.shrooms <- subset(shrooms, habitat == 'd', select = c(class, odor, ringnumber, ringtype, population))
summary(sub.shrooms)
## class odor ringnumber ringtype population
## e:1880 n :1816 n: 36 e: 608 a: 0
## p:1268 f : 624 o:3104 f: 48 c: 36
## c : 192 t: 8 l: 432 n: 0
## s : 192 n: 36 s: 96
## y : 192 p:2024 v:1904
## a : 48 y:1112
## (Other): 84
Update variables with more meaningful values.
# Update class
sub.shrooms$class <- as.character(sub.shrooms$class)
sub.shrooms$class[sub.shrooms$class == 'e'] <- 'Edible'
sub.shrooms$class[sub.shrooms$class == 'p'] <- 'Poisonous'
sub.shrooms$class <- as.factor(sub.shrooms$class)
# Update odor
sub.shrooms$odor <- as.character(sub.shrooms$odor)
sub.shrooms$odor[sub.shrooms$odor == 'a'] <- 'Almond'
sub.shrooms$odor[sub.shrooms$odor == 'l'] <- 'Anise'
sub.shrooms$odor[sub.shrooms$odor == 'c'] <- 'Creosote'
sub.shrooms$odor[sub.shrooms$odor == 'y'] <- 'Fishy'
sub.shrooms$odor[sub.shrooms$odor == 'f'] <- 'Foul'
sub.shrooms$odor[sub.shrooms$odor == 'm'] <- 'Musty'
sub.shrooms$odor[sub.shrooms$odor == 'n'] <- 'None'
sub.shrooms$odor[sub.shrooms$odor == 'p'] <- 'Pungent'
sub.shrooms$odor[sub.shrooms$odor == 's'] <- 'Spicy'
sub.shrooms$odor <- as.factor(sub.shrooms$odor)
# Update ring type; Replace with NA if no rings exist
sub.shrooms$ringtype <- as.character(sub.shrooms$ringtype)
sub.shrooms$ringtype[sub.shrooms$ringtype == 'p'] <- 'Pendant'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'c'] <- 'Cobwebby'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'e'] <- 'Evanescent'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'f'] <- 'Flaring'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'l'] <- 'Large'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'n'] <- NA
sub.shrooms$ringtype[sub.shrooms$ringtype == 's'] <- 'Sheathing'
sub.shrooms$ringtype[sub.shrooms$ringtype == 'z'] <- 'Zone'
sub.shrooms$ringtype <- as.factor(sub.shrooms$ringtype)
# Update population
sub.shrooms$population <- as.character(sub.shrooms$population)
sub.shrooms$population[sub.shrooms$population == 'a'] <- 'Abundant'
sub.shrooms$population[sub.shrooms$population == 'c'] <- 'Clustered'
sub.shrooms$population[sub.shrooms$population == 'n'] <- 'Numerous'
sub.shrooms$population[sub.shrooms$population == 's'] <- 'Scattered'
sub.shrooms$population[sub.shrooms$population == 'v'] <- 'Several'
sub.shrooms$population[sub.shrooms$population == 'y'] <- 'Solitary'
sub.shrooms$population <- as.factor(sub.shrooms$population)
# Update ring number and convert to numeric
sub.shrooms$ringnumber <- as.character(sub.shrooms$ringnumber)
sub.shrooms$ringnumber[sub.shrooms$ringnumber == 'n'] <- 0
sub.shrooms$ringnumber[sub.shrooms$ringnumber == 'o'] <- 1
sub.shrooms$ringnumber[sub.shrooms$ringnumber == 't'] <- 2
sub.shrooms$ringnumber <- as.numeric(sub.shrooms$ringnumber)
# Display 25 random rows to check the transformations
set.seed(125)
sub.shrooms[sample(1:nrow(sub.shrooms), 25), ]
## class odor ringnumber ringtype population
## 6035 Poisonous Spicy 1 Evanescent Several
## 2152 Edible None 1 Pendant Solitary
## 2770 Edible None 1 Pendant Solitary
## 2957 Edible None 1 Pendant Solitary
## 7595 Poisonous Fishy 1 Evanescent Several
## 7636 Poisonous Musty 0 <NA> Clustered
## 3587 Edible None 1 Pendant Solitary
## 2874 Edible None 1 Pendant Solitary
## 4093 Edible None 1 Pendant Solitary
## 3880 Poisonous Creosote 1 Pendant Scattered
## 479 Edible Almond 1 Pendant Several
## 5019 Poisonous Foul 1 Large Several
## 2962 Edible None 1 Pendant Solitary
## 2088 Edible None 1 Pendant Solitary
## 3674 Edible None 1 Pendant Solitary
## 3196 Edible None 1 Pendant Several
## 3177 Edible None 1 Pendant Solitary
## 7113 Poisonous Fishy 1 Evanescent Several
## 4191 Poisonous Foul 1 Large Several
## 4958 Poisonous Foul 1 Large Several
## 3007 Edible None 1 Pendant Several
## 4664 Edible None 1 Flaring Solitary
## 4287 Poisonous Foul 1 Large Solitary
## 2749 Edible None 1 Pendant Solitary
## 3743 Poisonous Creosote 1 Pendant Several
summary(sub.shrooms)
## class odor ringnumber ringtype
## Edible :1880 None :1816 Min. :0.0000 Evanescent: 608
## Poisonous:1268 Foul : 624 1st Qu.:1.0000 Flaring : 48
## Creosote: 192 Median :1.0000 Large : 432
## Fishy : 192 Mean :0.9911 Pendant :2024
## Spicy : 192 3rd Qu.:1.0000 NA's : 36
## Almond : 48 Max. :2.0000
## (Other) : 84
## population
## Clustered: 36
## Scattered: 96
## Several :1904
## Solitary :1112
##
##
##
Adding a few rudimentary graphs just for a bit of practice.
plot(sub.shrooms$class ~ sub.shrooms$ringtype, xlab = "Ring Type", ylab = "Class", main = "Class by Ring Type (Woods Only)")
plot(sub.shrooms$odor, xlab = "Odor", ylab = "Frequency", col = 1:length(sub.shrooms$odor))