Week 1 Assignment

By Brian Weinfeld

January 30, 2018

I began by reading 2 in csv files from my github page. The first contains the unaltered raw mushroom data while the second contains a lightly hand-edited version of the attribute names, abbreviations and associated values to be used with the mushroom data.

mushrooms_raw <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment1/master/agaricus-lepiota.csv', header=FALSE, stringsAsFactors=FALSE)
names <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment1/master/names.csv', header=FALSE, stringsAsFactors=FALSE, row.names=1, col.names=0:12)

I transformed names dataframe for easier processing and then used the column names to assign the columns of mushrooms_raw.

names <- t(names)
names(mushrooms_raw) <- colnames(names)
head(mushrooms_raw[,1:5], 10)

##    classes cap-shape cap-surface Cap-color bruises?
## 1        p         x           s         n        t
## 2        e         x           s         y        t
## 3        e         b           s         w        t
## 4        p         x           y         w        t
## 5        e         x           s         g        f
## 6        e         x           y         y        t
## 7        e         b           s         w        t
## 8        e         b           y         w        t
## 9        p         x           y         w        t
## 10       e         b           s         y        t

head(names[,1:5], 10)

##     classes       cap-shape   cap-surface Cap-color    bruises?   
## X1  "edible=e"    "bell=b"    "fibrous=f" "brown=n"    "bruises=t"
## X2  "poisonous=p" "conical=c" "grooves=g" "buff=b"     "no=f"     
## X3  ""            "convex=x"  "scaly=y"   "cinnamon=c" ""         
## X4  ""            "flat=f"    "smooth=s"  "gray=g"     ""         
## X5  ""            "knobbed=k" ""          "green=r"    ""         
## X6  ""            "sunken=s"  ""          "pink=p"     ""         
## X7  ""            ""          ""          "purple=u"   ""         
## X8  ""            ""          ""          "red=e"      ""         
## X9  ""            ""          ""          "white=w"    ""         
## X10 ""            ""          ""          "yellow=y"   ""

I subset the raw mushroom data by selected ‘classes’ and 3 other columns (‘odor’, ‘population’ and ‘habitat’). I stored this subset in mushrooms_reduced and stored the columns in my_columns for future use.

my_columns <- c('classes', 'odor', 'population', 'habitat')
mushrooms_reduced <- subset(mushrooms_raw, select=my_columns)
head(mushrooms_reduced, 10)

##    classes odor population habitat
## 1        p    p          s       u
## 2        e    a          n       g
## 3        e    l          n       m
## 4        p    p          s       u
## 5        e    n          a       g
## 6        e    a          n       g
## 7        e    a          n       m
## 8        e    l          s       m
## 9        p    p          v       g
## 10       e    a          s       m

The below method accepts the names and mushroom_reduced data structures and the name of the column that I wish to process. It reads the names matrix and creates two vectors that contain the matching abbrevation and full names of the given class. An a to b substitution is then made and the modified data is returned.

#"unlist" code adapted from "Henry" on Stackoverflow
process_names <- function(names, mushrooms_reduced, column){
  split_names <- strsplit(names[,column], '=')
  split_names[sapply(split_names, length)==0] <- NULL
  abbreviations <- unlist(split_names)[2*(1:length(split_names))]
  full <- unlist(split_names)[2*(1:length(split_names))-1]
  return(mapvalues(mushrooms_reduced[,column], from=abbreviations, to=full))
}

I looped over all the appropriate columns calling the process_names method and this will substitute the abbrevation for the full name in every entry in the data frame.

for(column in my_columns){
  mushrooms_reduced[,column] <- process_names(names, mushrooms_reduced, column)
}
head(mushrooms_reduced, 10)

##      classes    odor population habitat
## 1  poisonous pungent  scattered   urban
## 2     edible  almond   numerous grasses
## 3     edible   anise   numerous meadows
## 4  poisonous pungent  scattered   urban
## 5     edible    none   abundant grasses
## 6     edible  almond   numerous grasses
## 7     edible  almond   numerous meadows
## 8     edible   anise  scattered meadows
## 9  poisonous pungent    several grasses
## 10    edible  almond  scattered meadows

In conclusion, the original mushrooms data has been read in, subset and updated with more readable information and is ready to be saved or processed further.