First we get names file and concat it into a single string. Then we use an over complicated regex to extract the attributes, we then append the missing class attribute, and then use a double dose of regex (because of uncertainty in how capture groups work with str_extract) to extract the headers
names= readLines(url("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names"))
named = paste(names,collapse="\n")
attributes = str_extract_all( named, "\\ {4,5}[0-9]{1,2}\\..*:.*\n(\\s{15,}[a-z].*\n){0,6}\n{0,1}")[[1]]
attributes = prepend(attributes, " 0. class: poisonous=p,edible=e")
headers = str_extract(str_extract(attributes,"[0-9]{1,2}\\.\\s([a-z\\-\\?]*):"),"[a-z\\-\\?]+")
kable(head(headers[1:6]))
| x |
|---|
| class |
| cap-shape |
| cap-surface |
| cap-color |
| bruises? |
| odor |
Once we have the headers extracted we can get the data and give it names. We also subset it to show we can.
mushDat <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"), col.names=headers)
subMush = mushDat[,c(1,2,3,6,22,23)]
kable(head(subMush), )
| class | cap.shape | cap.surface | odor | population | habitat |
|---|---|---|---|---|---|
| e | x | s | a | n | g |
| e | b | s | l | n | m |
| p | x | y | p | s | u |
| e | x | s | n | a | g |
| e | x | y | a | n | g |
| e | b | s | a | n | m |
The data is coming along but is unreadable without the key. So we use a much simpler regex to extract the translation keys. These aren’t in a very useful format so we bind them up into a list of mapping tables. We then use mapvalues from plpyr to do the remapping using the table lists for each column. Finally we have human readable data.
mapping = sapply(str_extract_all(attributes,"[a-z]+=[a-z]+"),strsplit,"=")
kable(head(mapping[1:2][1:2]))
|
|
mapTables=list()
for(i in 1:length(mapping))
mapTables[[i]] = do.call(rbind,mapping[[i]])
kable(head(mapTables[[1]]))
| poisonous | p |
| edible | e |
for(i in 1:length(mapping))
mushDat[[i]] <- mapvalues(mushDat[[i]],mapTables[[i]][,2],mapTables[[i]][,1])
subMush = mushDat[,c(1,5,7,8,22,23)]
kable(head(subMush))
| class | bruises. | gill.attachment | gill.spacing | population | habitat |
|---|---|---|---|---|---|
| edible | bruises | free | close | numerous | grasses |
| edible | bruises | free | close | numerous | meadows |
| poisonous | bruises | free | close | scattered | urban |
| edible | no | free | crowded | abundant | grasses |
| edible | bruises | free | close | numerous | grasses |
| edible | bruises | free | close | numerous | meadows |
Let’s first see if eating in a random selection of species (as in a museum or archive, obviously random in the world depends on population, etc)
ggplot(mushDat, aes(x=class))+geom_bar(aes(y=stat(count),fill=class))
That doesn’t look terribly reassuring. Of course it depends on where we are in the world.
ggplot(mushDat, aes(x=class,y=habitat)) +geom_bin2d()
Well there aren’t a lot of wasteland mushrooms, but they are all edible. Paths are apparently more dangerous than you’d expect though.
Finally we will finish with some clustering. Since we don’t have any numeric data we will use kmodes to try to find some relations.
kable(kmodes(mushDat[1:4],10)$modes)
| class | cap.shape | cap.surface | cap.color |
|---|---|---|---|
| edible | flat | scaly | gray |
| poisonous | knobbed | scaly | brown |
| poisonous | convex | smooth | red |
| edible | convex | fibrous | brown |
| edible | convex | scaly | brown |
| poisonous | convex | scaly | brown |
| edible | bell | smooth | white |
| edible | convex | scaly | yellow |
| poisonous | convex | fibrous | gray |
| poisonous | flat | scaly | yellow |
Going further we could probably do some decomposition to find the property that is the safest.