Description of the data here: https://archive.ics.uci.edu/ml/datasets/Mushroom
The data does not contain any headers so set header=FALSE Details here: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
Check the data
head(df.mushroom)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1: p x s n t p f c n k e e s s w w p w o p
## 2: e x s y t a f c b k e c s s w w p w o p
## 3: e b s w t l f c b n e c s s w w p w o p
## 4: p x y w t p f c n n e e s s w w p w o p
## 5: e x s g f n f w b k t e s s w w p w o e
## 6: e x y y t a f c b n e c s s w w p w o p
## V21 V22 V23
## 1: k s u
## 2: n n g
## 3: n n m
## 4: k s u
## 5: n a g
## 6: k n g
Now we need to use the provided documentation in order to name the columns with the correct names. Lets create a vector that has all the column names. We can append this vector to the columns to assign column names This handy tutorial details the process I used: http://rprogramming.net/rename-columns-in-r/
Append the column names to the dataframe
colnames(df.mushroom) <- mushroom.col
df.mushroomb<-subset(df.mushroom, select=c('class', 'capshape', 'odor', 'spore_print_color'))
We can do this manually but a loop to iterate on each entry might be easier. Lets convert the other entries in the selected columns to their full names as well. Refer to this link for a handy tutorial I used to write the following loops https://www.datamentor.io/r-programming/if-else-statement
i <- 1
for (x in df.mushroomb$class)
{
if (x == 'p')
{
df.mushroomb$class[i] <- "poisonous"
}
else if (x == 'e')
{
df.mushroomb$class[i] <- "edible"
}
i <- i + 1
}
i <- 1
for (x in df.mushroomb$capshape)
{
if (x == 'b')
{
df.mushroomb$capshape[i] <- "bell"
}
else if (x == 'c')
{
df.mushroomb$capshape[i] <- "conical"
}
else if (x == 'x')
{
df.mushroomb$capshape[i] <- "convex"
}
else if (x == 'f')
{
df.mushroomb$capshape[i] <- "flat"
}
else if (x == 'k')
{
df.mushroomb$capshape[i] <- "knobbed"
}
else if (x == 's'){
df.mushroomb$capshape[i] <- "sunken"
}
i <- i + 1
}
i <- 1
for (x in df.mushroomb$odor)
{
if (x == 'a')
{
df.mushroomb$odor[i] <- "almond"
}
else if (x == 'l')
{
df.mushroomb$odor[i] <- "anise"
}
else if (x == 'c')
{
df.mushroomb$odor[i] <- "creosote"
}
else if (x == 'y')
{
df.mushroomb$odor[i] <- "fishy"
}
else if (x == 'f')
{
df.mushroomb$odor[i] <- "foul"
}
else if (x == 'm')
{
df.mushroomb$odor[i] <- "musty"
}
else if (x == 'n')
{
df.mushroomb$odor[i] <- "none"
}
else if (x == 'p')
{
df.mushroomb$odor[i] <- "pungent"
}
else if (x == 's')
{
df.mushroomb$odor[i] <- "spicy"
}
i <- i + 1
}
i <- 1
for (x in df.mushroomb$spore_print_color)
{
if (x == 'k')
{
df.mushroomb$spore_print_color[i] <- "black"
}
else if (x == 'n')
{
df.mushroomb$spore_print_color[i] <- "brown"
}
else if (x == 'b')
{
df.mushroomb$spore_print_color[i] <- "buff"
}
else if (x == 'h')
{
df.mushroomb$spore_print_color[i] <- "chocolate"
}
else if (x == 'r')
{
df.mushroomb$spore_print_color[i] <- "green"
}
else if (x == 'o')
{
df.mushroomb$spore_print_color[i] <- "orange"
}
else if (x == 'u')
{
df.mushroomb$spore_print_color[i] <- "purple"
}
else if (x == 'w')
{
df.mushroomb$spore_print_color[i] <- "white"
}
else if (x == 'y')
{
df.mushroomb$spore_print_color[i] <- "yellow"
}
i <- i + 1
}
Troubleshooting: You may need to specify factors as false when you load in the data, otherwise the loops will not work. This depends on how the data is loaded in. fread seemed to fix it. How is our data so far?
head(df.mushroomb)
## class capshape odor spore_print_color
## 1: poisonous convex pungent black
## 2: edible convex almond brown
## 3: edible bell anise brown
## 4: poisonous convex pungent black
## 5: edible convex none brown
## 6: edible convex almond black
Data looks good! Looks good, lets explore the data What are the relative frequencies for the class variable? Lets try some things to answer this question
Get the counts of edible or poisonous
table(df.mushroomb$class)
##
## edible poisonous
## 4208 3916
convert counts to proportions instead
class_counts <- table(df.mushroomb$class)
class_counts / sum(class_counts)
##
## edible poisonous
## 0.5179714 0.4820286
lets visually summerize the data using a standard barplot We will use ggplot2 and then use summary to see binning and plotting details
library(ggplot2)
bar_plt <- ggplot(df.mushroomb, aes(x = class))
bar_plt <- bar_plt + geom_bar()
summary(bar_plt)
## data: class, capshape, odor, spore_print_color [8124x4]
## mapping: x = class
## faceting: <ggproto object: Class FacetNull, Facet>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map: function
## map_data: function
## params: list
## render_back: function
## render_front: function
## render_panels: function
## setup_data: function
## setup_params: function
## shrink: TRUE
## train: function
## train_positions: function
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet>
## -----------------------------------
## geom_bar: width = NULL, na.rm = FALSE
## stat_count: width = NULL, na.rm = FALSE
## position_stack
There are many more variations of data exploration but for the task at hand, the data has been subsetted and renamed for downstream analysis.
One last check:
head(df.mushroomb)
## class capshape odor spore_print_color
## 1: poisonous convex pungent black
## 2: edible convex almond brown
## 3: edible bell anise brown
## 4: poisonous convex pungent black
## 5: edible convex none brown
## 6: edible convex almond black
tail(df.mushroomb)
## class capshape odor spore_print_color
## 1: poisonous knobbed foul white
## 2: edible knobbed none buff
## 3: edible convex none buff
## 4: edible flat none buff
## 5: poisonous knobbed fishy white
## 6: edible convex none orange