Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. We will spend several weeks in this course on tidying and transformation operations. Some of this work could be done in SQL or R (or Python or…). Here, you are asked to use R—you may use base functions or packages as you like.
Mushrooms Dataset. A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”
Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.
Read data from my github repro into R dataframe and display first few rows
theUrl <- "https://raw.githubusercontent.com/forhadakbar/data607fall2019/master/agaricus-lepiota.data"
mushroom<- read.csv(url(theUrl), header = TRUE, sep = ",")
head(mushroom)## [1] "data.frame"
## [1] 8123 23
## [1] "p" "x" "s" "n" "t" "p.1" "f" "c" "n.1" "k" "e"
## [12] "e.1" "s.1" "s.2" "w" "w.1" "p.2" "w.2" "o" "p.3" "k.1" "s.3"
## [23] "u"
## [1] FALSE
No null found in the data frame
## [1] FALSE
No missing values found in the data frame
names(mushroom) <- c("Class", "Cap-Shape", "Cap-Surface", "Cap-Color", "Bruises", "Odor", "Gill-Attachment", "Gill-Spacing", "Gill-Size", "Gill-Color", "Stalk-Shape", "Stalk-Root", "Stalk-Surface-Above-Ring", "Stalk-Surface-Below-Ring", "Stalk-Color-Above-Ring", "Stalk-Color-Below-Ring", "Veil-Type", "Veil-Color", "Ring-Number", "Ring-Type", "Spore-Print-Color", "Population", "Habitat")
head(mushroom)Rename the variables in each column:
mushroom$Class <- recode(mushroom$Class, e = "edible", p = "poisonous")
mushroom$`Cap-Shape` <- recode(mushroom$`Cap-Shape`,
b = "bell", c = "conical", x = "convex", f = "flat", k = "knobbed", s = "sunken")
mushroom$`Cap-Surface` <- recode(mushroom$`Cap-Surface`, f="fibrous", g="grooves", y="scaly", s="smooth")
mushroom$`Cap-Color` <- recode(mushroom$`Cap-Color`, n = "brown", b = "buff", c = "cinnamon", g = "gray", r = "green", p = "pink", u = "purple", e = "red", w = "white", y = "yellow")
mushroom$Bruises <- recode(mushroom$Bruises, t = "bruises", f = "no-bruises")
mushroom$Odor <- recode(mushroom$Odor, a = 'almond', l = 'anise', c = 'creosote', y = 'fishy', f = 'foul', m = 'musty', n = 'none', p = 'pungent', s = 'spicy')
mushroom$`Gill-Attachment` <- recode(mushroom$`Gill-Attachment`, a = 'attached', d = 'descending', f = 'free', n = 'notched')
mushroom$`Gill-Spacing` <- recode(mushroom$`Gill-Spacing`, c = 'close', w = 'crowded', d = 'distant')
mushroom$`Gill-Size` <- recode(mushroom$`Gill-Size`, b = 'broad', n = 'narrow' )
mushroom$`Gill-Color` <- recode(mushroom$`Gill-Color`, k = 'black', n = 'brown', b = 'buff', h = 'chocolate', g = 'gray', r = 'green', o = 'orange', p = 'pink', u = 'purple', e = 'red', w = 'white', y = 'yellow')
mushroom$`Stalk-Shape` <- recode(mushroom$`Stalk-Shape`, e = 'enlarging', t = 'tapering')
mushroom$`Stalk-Root` <- recode(mushroom$`Stalk-Root`, b = 'bulbous', c = 'club', u = 'cup', e = 'equal', z = 'rhizomorphs', r = 'rooted')
mushroom$`Stalk-Surface-Above-Ring` <- recode(mushroom$`Stalk-Surface-Above-Ring`, f = 'fibrous', y = 'scaly', k = 'silky', s = 'smooth')
mushroom$`Stalk-Surface-Below-Ring` <- recode(mushroom$`Stalk-Surface-Below-Ring`, f = 'fibrous', y = 'scaly', k = 'silky', s = 'smooth')
mushroom$`Stalk-Color-Above-Ring` <- recode(mushroom$`Stalk-Color-Above-Ring`, n = 'brown', b = 'buff', c = 'cinnamon', g = 'gray', o = 'orange', p = 'pink', e = 'red', w = 'white', y = 'yellow')
mushroom$`Stalk-Color-Below-Ring` <- recode(mushroom$`Stalk-Color-Below-Ring`, n = 'brown', b = 'buff', c = 'cinnamon', g = 'gray', o = 'orange', p = 'pink', e = 'red', w = 'white', y = 'yellow')
mushroom$`Veil-Type` <- recode(mushroom$`Veil-Type`, p = 'partial', u = 'universal')
mushroom$`Veil-Color` <- recode(mushroom$`Veil-Color`, n = 'brown', o = 'orange', w = 'white', y = 'yellow')
mushroom$`Ring-Number` <- recode(mushroom$`Ring-Number`, n = 'none', o = 'one', t = 'two')
mushroom$`Ring-Type` <- recode(mushroom$`Ring-Type`, c = 'cobwebby', e = 'evanescent', f = 'flaring', l = 'large', n = 'none', p = 'pendant', s = 'sheathing', z = 'zone')
mushroom$`Spore-Print-Color` <- recode(mushroom$`Spore-Print-Color`, k = 'black', n = 'brown', b = 'buff', h = 'chocolate', r = 'green', o = 'orange', u = 'purple',
w = 'white', y = 'yellow')
mushroom$Population <- recode(mushroom$Population, a = 'abundant', c = 'clustered', n = 'numerous', s = 'scattered', v = 'several', y = 'solitary')
mushroom$Habitat <- recode(mushroom$Habitat, g = 'grasses', l = 'leaves', m = 'meadows', p = 'paths', u = 'urban', w = 'waste', d = 'woods')
head(mushroom)## Class Cap-Shape Cap-Surface Cap-Color
## edible :4208 bell : 452 fibrous:2320 brown :2283
## poisonous:3915 conical: 4 grooves: 4 gray :1840
## flat :3152 smooth :2555 red :1500
## knobbed: 828 scaly :3244 yellow :1072
## sunken : 32 white :1040
## convex :3655 buff : 168
## (Other): 220
## Bruises Odor Gill-Attachment Gill-Spacing
## no-bruises:4748 none :3528 attached: 210 close :6811
## bruises :3375 foul :2160 free :7913 crowded:1312
## spicy : 576
## fishy : 576
## almond : 400
## anise : 400
## (Other): 483
## Gill-Size Gill-Color Stalk-Shape Stalk-Root
## broad :5612 buff :1728 enlarging:3515 ? :2480
## narrow:2511 pink :1492 tapering :4608 bulbous:3776
## white :1202 club : 556
## brown :1048 equal :1119
## gray : 752 rooted : 192
## chocolate: 732
## (Other) :1169
## Stalk-Surface-Above-Ring Stalk-Surface-Below-Ring Stalk-Color-Above-Ring
## fibrous: 552 fibrous: 600 white :4463
## silky :2372 silky :2304 pink :1872
## smooth :5175 smooth :4935 gray : 576
## scaly : 24 scaly : 284 brown : 448
## buff : 432
## orange : 192
## (Other): 140
## Stalk-Color-Below-Ring Veil-Type Veil-Color Ring-Number
## white :4383 partial:8123 brown : 96 none: 36
## pink :1872 orange: 96 one :7487
## gray : 576 white :7923 two : 600
## brown : 512 yellow: 8
## buff : 432
## orange : 192
## (Other): 156
## Ring-Type Spore-Print-Color Population Habitat
## evanescent:2776 white :2388 abundant : 384 woods :3148
## flaring : 48 brown :1968 clustered: 340 grasses:2148
## large :1296 black :1871 numerous : 400 leaves : 832
## none : 36 chocolate:1632 scattered:1247 meadows: 292
## pendant :3967 green : 72 several :4040 paths :1144
## buff : 48 solitary :1712 urban : 367
## (Other) : 144 waste : 192
create a data frame with a subset of the columns:
According to the documentation, a small number of columns provide an excellent prediction of which mushrooms are poisonous: These columns are odor, spore-print-color, stalk-surface-below-ring, and stalk-color-above-ring. I have added cap-color and habitat in addition.
mushroomsubset<- mushroom[,c("Class","Odor","Spore-Print-Color","Stalk-Color-Above-Ring","Stalk-Color-Below-Ring","Cap-Color","Habitat")]
head(mushroomsubset)#poison_class vs. odor
poison_by_odor <-table(mushroomsubset[c(1,2)])
barplot(poison_by_odor,legend.text=TRUE, beside=TRUE, col=c("light blue","dark orange"), xlab = "Odor",ylab = "Species Count", main="Frequency of Poisonous Mushrooms by Odor", cex.names=.75)#poison_class vs. Spore-Print-Color
poison_by_odor <-table(mushroomsubset[c(1,3)])
barplot(poison_by_odor,legend.text=TRUE, beside=TRUE, col=c("light blue","dark orange"), xlab = "Spore-Print-Color",ylab = "Species Count", main="Frequency of Poisonous Mushrooms by Spore-Print-Color", cex.names=.75)# poison_class vs. Stalk-Color-Above-Ring
poison_by_odor <-table(mushroomsubset[c(1,4)])
barplot(poison_by_odor,legend.text=TRUE, beside=TRUE, col=c("light blue","dark orange"), xlab = "Stalk-Color-Above-Ring",ylab = "Species Count", main="Frequency of Poisonous Mushrooms by Stalk-Color-Above-Ring", cex.names=.75)# poison_class vs. Stalk-Color-Below-Ring
poison_by_odor <-table(mushroomsubset[c(1,5)])
barplot(poison_by_odor,legend.text=TRUE, beside=TRUE, col=c("light blue","dark orange"), xlab = "Stalk-Color-Below-Ring",ylab = "Species Count", main="Frequency of Poisonous Mushrooms by Stalk-Color-Below-Ring", cex.names=.75)# poison_class vs. Cap-Color
poison_by_odor <-table(mushroomsubset[c(1,6)])
barplot(poison_by_odor,legend.text=TRUE, beside=TRUE, col=c("light blue","dark orange"), xlab = "Cap-Color",ylab = "Species Count", main="Frequency of Poisonous Mushrooms by Cap-Color", cex.names=.75)# poison_class vs. Habitat
poison_by_odor <-table(mushroomsubset[c(1,7)])
barplot(poison_by_odor,legend.text=TRUE, beside=TRUE, col=c("light blue","dark orange"), xlab = "Habitat",ylab = "Species Count", main="Frequency of Poisonous Mushrooms by Habitat", cex.names=.75)The majority of poisonous mushrooms smell foul, while the majority of edible mushrooms don’t smell at all.