Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. We will spend several weeks in this course on tidying and transformation operations. Some of this work could be done in SQL or R (or Python or…). Here, you are asked to use R—you may use base functions or packages as you like. Mushrooms Dataset. A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?” Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.
library(knitr)
df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data")
#Converting to a df.
df <- as.data.frame(df)
kable(head(df))
| p | x | s | n | t | p.1 | f | c | n.1 | k | e | e.1 | s.1 | s.2 | w | w.1 | p.2 | w.2 | o | p.3 | k.1 | s.3 | u |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| e | x | s | y | t | a | f | c | b | k | e | c | s | s | w | w | p | w | o | p | n | n | g |
| e | b | s | w | t | l | f | c | b | n | e | c | s | s | w | w | p | w | o | p | n | n | m |
| p | x | y | w | t | p | f | c | n | n | e | e | s | s | w | w | p | w | o | p | k | s | u |
| e | x | s | g | f | n | f | w | b | k | t | e | s | s | w | w | p | w | o | e | n | a | g |
| e | x | y | y | t | a | f | c | b | n | e | c | s | s | w | w | p | w | o | p | k | n | g |
| e | b | s | w | t | a | f | c | b | g | e | c | s | s | w | w | p | w | o | p | k | n | m |
This doesn’t look like much. Let’s move on the naming the variables/features.
The homework gave various tasks in a certain order but I think it makes sense to at least assign names to colums first.
In my quick search I found this RPub which figgured out a concise way to pull the data from the dictionary and name the columns. It also named the coded values with their proper name. The value added work that I am bringing to this assigment is that will automate naming the catagorical variables so that all information will be correctly encoded.
file <- 'https://raw.githubusercontent.com/dvillalobos/MSDA/master/607/Homework/Villalobos-Homework1-dictionary.txt'
mushroomsdict <- read.table(file, sep="|", header=TRUE, stringsAsFactors = FALSE)
mushroomsdict
## Index Attribute
## 1 0 class
## 2 1 cap-shape
## 3 2 cap-surface
## 4 3 cap-color
## 5 4 bruises?
## 6 5 odor
## 7 6 gill-attachment
## 8 7 gill-spacing
## 9 8 gill-size
## 10 9 gill-color
## 11 10 stalk-shape
## 12 11 stalk-root
## 13 12 stalk-surface-above-ring
## 14 13 stalk-surface-below-ring
## 15 14 stalk-color-above-ring
## 16 15 stalk-color-below-ring
## 17 16 veil-type
## 18 17 veil-color
## 19 18 ring-number
## 20 19 ring-type
## 21 20 spore-print-color
## 22 21 population
## 23 22 habitat
## Information
## 1 edible=e,poisonous=p
## 2 bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
## 3 fibrous=f,grooves=g,scaly=y,smooth=s
## 4 brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
## 5 bruises=t,no=f
## 6 almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
## 7 attached=a,descending=d,free=f,notched=n
## 8 close=c,crowded=w,distant=d
## 9 broad=b,narrow=n
## 10 black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
## 11 enlarging=e,tapering=t
## 12 bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
## 13 fibrous=f,scaly=y,silky=k,smooth=s
## 14 fibrous=f,scaly=y,silky=k,smooth=s
## 15 brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
## 16 brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
## 17 partial=p,universal=u
## 18 brown=n,orange=o,white=w,yellow=y
## 19 none=n,one=o,two=t
## 20 cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
## 21 black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
## 22 abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
## 23 grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d
This allows us to name the columns with the atribute column:
colnames(df) <- mushroomsdict$Attribute
head(df)
## class cap-shape cap-surface cap-color bruises? odor gill-attachment
## 1 e x s y t a f
## 2 e b s w t l f
## 3 p x y w t p f
## 4 e x s g f n f
## 5 e x y y t a f
## 6 e b s w t a f
## gill-spacing gill-size gill-color stalk-shape stalk-root
## 1 c b k e c
## 2 c b n e c
## 3 c n n e e
## 4 w b k t e
## 5 c b n e c
## 6 c b g e c
## stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring
## 1 s s w
## 2 s s w
## 3 s s w
## 4 s s w
## 5 s s w
## 6 s s w
## stalk-color-below-ring veil-type veil-color ring-number ring-type
## 1 w p w o p
## 2 w p w o p
## 3 w p w o p
## 4 w p w o e
## 5 w p w o p
## 6 w p w o p
## spore-print-color population habitat
## 1 n n g
## 2 n n m
## 3 k s u
## 4 n a g
## 5 k n g
## 6 k n m
Again thanks to Duubar Villalobos Jimenez
transMush <- function(headcols){
# Reading information:
mushHeadVals <- mushroomsdict$Information[headcols]
# Must use the as.character to split into string.
mushHeadVals <- strsplit(as.character(mushHeadVals), ",", fixed = TRUE)
# Convert to a data frame:
mushHeadVals <- data.frame(mushHeadVals)
# Make sure that the names carry over so that they can be matched with the values in df:
colnames(mushHeadVals) <- mushroomsdict$Attribute[headcols]
# seperate the values based on the "="
mushHeadVals <- data.frame(do.call("rbind",
strsplit(as.character(mushHeadVals[,1]),
"=", fixed = TRUE)))
# This command remanes the values:
colnames(mushHeadVals) <- c(mushroomsdict$Attribute[headcols], "values")
#assigninf the factor data type in a new data frame:
mush[, headcols] <- factor(mush[, headcols], ordered = TRUE)
levels(mush[, headcols]) <- as.character(mushHeadVals[,1])
return(mush)
}
#New data set is, as far as I can tell not necessary but in order to preserve the integrity of the original it seems worth while.
mush <- subset(df, select = c(1:dim(df)[2]))
head(mush)
## class cap-shape cap-surface cap-color bruises? odor gill-attachment
## 1 e x s y t a f
## 2 e b s w t l f
## 3 p x y w t p f
## 4 e x s g f n f
## 5 e x y y t a f
## 6 e b s w t a f
## gill-spacing gill-size gill-color stalk-shape stalk-root
## 1 c b k e c
## 2 c b n e c
## 3 c n n e e
## 4 w b k t e
## 5 c b n e c
## 6 c b g e c
## stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring
## 1 s s w
## 2 s s w
## 3 s s w
## 4 s s w
## 5 s s w
## 6 s s w
## stalk-color-below-ring veil-type veil-color ring-number ring-type
## 1 w p w o p
## 2 w p w o p
## 3 w p w o p
## 4 w p w o e
## 5 w p w o p
## 6 w p w o p
## spore-print-color population habitat
## 1 n n g
## 2 n n m
## 3 k s u
## 4 n a g
## 5 k n g
## 6 k n m
In order to distiguish this analysis, I’m going to use a for loop to itterate over all coded values to see the true name. (It would be nice to do this with a vector applied to the function but I’m not good enough yet.)
for(i in 1:23){
mush <- transMush(i)
}
kable(head(mush, 20))
| class | cap-shape | cap-surface | cap-color | bruises? | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | stalk-root | stalk-surface-above-ring | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| edible | sunken | scaly | yellow | no | almond | descending | close | broad | gray | enlarging | cup | silky | silky | white | white | partial | white | one | none | chocolate | numerous | leaves |
| edible | bell | scaly | white | no | fishy | descending | close | broad | green | enlarging | cup | silky | silky | white | white | partial | white | one | none | chocolate | numerous | paths |
| poisonous | sunken | smooth | white | no | none | descending | close | narrow | green | enlarging | equal | silky | silky | white | white | partial | white | one | none | buff | scattered | waste |
| edible | sunken | scaly | gray | bruises | musty | descending | crowded | broad | gray | tapering | equal | silky | silky | white | white | partial | white | one | cobwebby | chocolate | abundant | leaves |
| edible | sunken | smooth | yellow | no | almond | descending | close | broad | green | enlarging | cup | silky | silky | white | white | partial | white | one | none | buff | numerous | leaves |
| edible | bell | scaly | white | no | almond | descending | close | broad | buff | enlarging | cup | silky | silky | white | white | partial | white | one | none | buff | numerous | paths |
| edible | bell | smooth | white | no | fishy | descending | close | broad | green | enlarging | cup | silky | silky | white | white | partial | white | one | none | chocolate | scattered | paths |
| poisonous | sunken | smooth | white | no | none | descending | close | narrow | pink | enlarging | equal | silky | silky | white | white | partial | white | one | none | buff | several | leaves |
| edible | bell | scaly | yellow | no | almond | descending | close | broad | buff | enlarging | cup | silky | silky | white | white | partial | white | one | none | buff | scattered | paths |
| edible | sunken | smooth | yellow | no | fishy | descending | close | broad | buff | enlarging | cup | silky | silky | white | white | partial | white | one | none | chocolate | numerous | leaves |
| edible | sunken | smooth | yellow | no | almond | descending | close | broad | green | enlarging | cup | silky | silky | white | white | partial | white | one | none | buff | scattered | paths |
| edible | bell | scaly | yellow | no | almond | descending | close | broad | white | enlarging | cup | silky | silky | white | white | partial | white | one | none | chocolate | scattered | leaves |
| poisonous | sunken | smooth | white | no | none | descending | close | narrow | gray | enlarging | equal | silky | silky | white | white | partial | white | one | none | chocolate | several | waste |
| edible | sunken | fibrous | green | bruises | musty | descending | crowded | broad | green | tapering | equal | silky | fibrous | white | white | partial | white | one | cobwebby | buff | abundant | leaves |
| edible | knobbed | fibrous | gray | bruises | musty | descending | close | narrow | gray | enlarging | equal | silky | silky | white | white | partial | white | one | none | chocolate | solitary | waste |
| edible | convex | fibrous | white | bruises | musty | descending | crowded | broad | gray | tapering | equal | silky | silky | white | white | partial | white | one | cobwebby | chocolate | abundant | leaves |
| poisonous | sunken | scaly | green | no | none | descending | close | narrow | green | enlarging | equal | silky | silky | white | white | partial | white | one | none | buff | scattered | leaves |
| poisonous | sunken | smooth | white | no | none | descending | close | narrow | green | enlarging | equal | silky | silky | white | white | partial | white | one | none | chocolate | scattered | waste |
| poisonous | sunken | scaly | green | no | none | descending | close | narrow | gray | enlarging | equal | silky | silky | white | white | partial | white | one | none | chocolate | scattered | waste |
| edible | bell | scaly | yellow | no | almond | descending | close | broad | gray | enlarging | cup | silky | silky | white | white | partial | white | one | none | chocolate | scattered | paths |
At this point, subsetting is easy.
names(mush)
## [1] "class" "cap-shape"
## [3] "cap-surface" "cap-color"
## [5] "bruises?" "odor"
## [7] "gill-attachment" "gill-spacing"
## [9] "gill-size" "gill-color"
## [11] "stalk-shape" "stalk-root"
## [13] "stalk-surface-above-ring" "stalk-surface-below-ring"
## [15] "stalk-color-above-ring" "stalk-color-below-ring"
## [17] "veil-type" "veil-color"
## [19] "ring-number" "ring-type"
## [21] "spore-print-color" "population"
## [23] "habitat"
subMush <- subset(mush, select = c("class", "cap-shape", "habitat", "ring-number"))
kable(head(subMush, 20))
| class | cap-shape | habitat | ring-number |
|---|---|---|---|
| edible | sunken | leaves | one |
| edible | bell | paths | one |
| poisonous | sunken | waste | one |
| edible | sunken | leaves | one |
| edible | sunken | leaves | one |
| edible | bell | paths | one |
| edible | bell | paths | one |
| poisonous | sunken | leaves | one |
| edible | bell | paths | one |
| edible | sunken | leaves | one |
| edible | sunken | paths | one |
| edible | bell | leaves | one |
| poisonous | sunken | waste | one |
| edible | sunken | leaves | one |
| edible | knobbed | waste | one |
| edible | convex | leaves | one |
| poisonous | sunken | leaves | one |
| poisonous | sunken | waste | one |
| poisonous | sunken | waste | one |
| edible | bell | paths | one |