Gather the names file

First we get names file and concat it into a single string. Then we use an over complicated regex to extract the attributes, we then append the missing class attribute, and then use a double dose of regex (because of uncertainty in how capture groups work with str_extract) to extract the headers

names= readLines(url("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names"))
named = paste(names,collapse="\n")
attributes = str_extract_all( named, "\\ {4,5}[0-9]{1,2}\\..*:.*\n(\\s{15,}[a-z].*\n){0,6}\n{0,1}")[[1]]
attributes = prepend(attributes, "     0. class:                       poisonous=p,edible=e")
headers = str_extract(str_extract(attributes,"[0-9]{1,2}\\.\\s([a-z\\-\\?]*):"),"[a-z\\-\\?]+")
kable(head(headers[1:6]))
x
class
cap-shape
cap-surface
cap-color
bruises?
odor

The actual data

Once we have the headers extracted we can get the data and give it names. We also subset it to show we can.

 mushDat <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"), col.names=headers)
subMush = mushDat[,c(1,2,3,6,22,23)]
kable(head(subMush),  )
class cap.shape cap.surface odor population habitat
e x s a n g
e b s l n m
p x y p s u
e x s n a g
e x y a n g
e b s a n m

Changing the contents to be human readable

The data is coming along but is unreadable without the key. So we use a much simpler regex to extract the translation keys. These aren’t in a very useful format so we bind them up into a list of mapping tables. We then use mapvalues from plpyr to do the remapping using the table lists for each column. Finally we have human readable data.

mapping = sapply(str_extract_all(attributes,"[a-z]+=[a-z]+"),strsplit,"=")
kable(head(mapping[1:2][1:2]))
x
poisonous
p
x
edible
e
x
bell
b
x
conical
c
x
convex
x
x
flat
f
x
knobbed
k
x
sunken
s
mapTables=list()
for(i in 1:length(mapping))
  mapTables[[i]] = do.call(rbind,mapping[[i]])
kable(head(mapTables[[1]]))
poisonous p
edible e
for(i in 1:length(mapping))
  mushDat[[i]] <- mapvalues(mushDat[[i]],mapTables[[i]][,2],mapTables[[i]][,1])
subMush = mushDat[,c(1,5,7,8,22,23)]
kable(head(subMush))
class bruises. gill.attachment gill.spacing population habitat
edible bruises free close numerous grasses
edible bruises free close numerous meadows
poisonous bruises free close scattered urban
edible no free crowded abundant grasses
edible bruises free close numerous grasses
edible bruises free close numerous meadows

Some basic data exploration

Let’s first see if eating in a random selection of species (as in a museum or archive, obviously random in the world depends on population, etc)

ggplot(mushDat, aes(x=class))+geom_bar(aes(y=stat(count),fill=class))

That doesn’t look terribly reassuring. Of course it depends on where we are in the world.

ggplot(mushDat, aes(x=class,y=habitat)) +geom_bin2d()

Well there aren’t a lot of wasteland mushrooms, but they are all edible. Paths are apparently more dangerous than you’d expect though.

Finally we will finish with some clustering. Since we don’t have any numeric data we will use kmodes to try to find some relations.

kable(kmodes(mushDat[1:4],10)$modes)
class cap.shape cap.surface cap.color
edible flat scaly gray
poisonous knobbed scaly brown
poisonous convex smooth red
edible convex fibrous brown
edible convex scaly brown
poisonous convex scaly brown
edible bell smooth white
edible convex scaly yellow
poisonous convex fibrous gray
poisonous flat scaly yellow

Going further we could probably do some decomposition to find the property that is the safest.