First, I load the data in using read.table:
week3 <- read.table("week3.data", sep = ",", header = FALSE)
Next, I'll take a look at the data structure:
str(week3)
## 'data.frame': 8124 obs. of 23 variables:
## $ V1 : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
## $ V2 : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ V3 : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
## $ V4 : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
## $ V5 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
## $ V6 : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ V7 : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ V8 : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
## $ V9 : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
## $ V10: Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
## $ V11: Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
## $ V12: Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
## $ V13: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ V14: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ V15: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ V16: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ V17: Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
## $ V18: Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ V19: Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ V20: Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ V21: Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
## $ V22: Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ V23: Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
Now, I'll rename the columns according to the dataset description found here.
colnames(week3) <- c("edibility",
"cap-shape",
"cap-surface",
"cap-color",
"bruises?",
"odor",
"gill-attachment",
"gill-spacing",
"gill-size",
"gill-color",
"stalk-shape",
"stalk-root",
"stalk-surface-above-ring",
"stalk-surface-below-ring",
"stalk-color-above-ring",
"stalk-color-below-ring",
"veil-type",
"veil-color",
"ring-number",
"ring-type",
"spore-print-color",
"population",
"habitat")
Next, I'll subset the data on gill attributes and edibility:
week3_gills <- subset(week3, select=c("edibility",
"gill-attachment",
"gill-spacing",
"gill-size",
"gill-color"))
I'm going to use the plyr package to rename the factor levels. First, I'll load the package:
library(plyr)
I'll start with the mapvalues() function:
week3_gills$edibility <- mapvalues(week3_gills$edibility,
from = c("e", "p"),
to = c("edible","poisonous"))
For the attribute/column names, I had to start subsetting by brackets because I used hypens in the column names. I did that to stay consistent with the dataset description, but in the future I would probably convert the hyphens to underscores to make the dataframe easier to work with.
I also decided to include factor levels mentioned in the description, but had no instances in the dataset, in case I added data that had those factors in the future.
week3_gills[,"gill-attachment"] <- mapvalues(week3_gills[,"gill-attachment"],
from = c("a","d","f","n"),
to = c("attached", "descending",
"free", "notched"))
## The following `from` values were not present in `x`: d, n
week3_gills[,"gill-spacing"] <- mapvalues(week3_gills[,"gill-spacing"],
from = c("c", "w", "d"),
to = c("crowded", "wild", "distant"))
## The following `from` values were not present in `x`: d
Here I tried revalue() (also from plyr) just to switch things up.
week3_gills[,"gill-size"] <- revalue(week3_gills[,"gill-size"],
c("b"="broad","n"="narrow"))
Here we have the opposite situation as above. There was a factor level, “u”, that was not mentioned in the data set description for gill-color. I guessed that it meant “unknown” and named it as such, but couldn't find any information in the description that would indicate one way or another.
week3_gills[,"gill-color"] <- revalue(week3_gills[,"gill-color"],
c("k"="black",
"n"="brown",
"b"="buff",
"h"="chocolate",
"g"="gray",
"c"="cinnamon",
"o"="orange",
"p"="pink",
"e"="red",
"w"="white",
"y"="yellow",
"u"="unknown"))
## The following `from` values were not present in `x`: c
Finally, I take a look at the structure of my subset to make sure everything looks good.
str(week3_gills)
## 'data.frame': 8124 obs. of 5 variables:
## $ edibility : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill-attachment: Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill-spacing : Factor w/ 2 levels "crowded","wild": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill-size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill-color : Factor w/ 12 levels "buff","red","gray",..: 5 5 6 6 5 6 3 6 8 3 ...
fin