I began by reading 2 in csv files from my github page. The first contains the unaltered raw mushroom data while the second contains a lightly hand-edited version of the attribute names, abbreviations and associated values to be used with the mushroom data.
mushrooms_raw <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment1/master/agaricus-lepiota.csv', header=FALSE, stringsAsFactors=FALSE)
names <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment1/master/names.csv', header=FALSE, stringsAsFactors=FALSE, row.names=1, col.names=0:12)
I transformed names dataframe for easier processing and then used the column names to assign the columns of mushrooms_raw.
names <- t(names)
names(mushrooms_raw) <- colnames(names)
head(mushrooms_raw[,1:5], 10)
## classes cap-shape cap-surface Cap-color bruises?
## 1 p x s n t
## 2 e x s y t
## 3 e b s w t
## 4 p x y w t
## 5 e x s g f
## 6 e x y y t
## 7 e b s w t
## 8 e b y w t
## 9 p x y w t
## 10 e b s y t
head(names[,1:5], 10)
## classes cap-shape cap-surface Cap-color bruises?
## X1 "edible=e" "bell=b" "fibrous=f" "brown=n" "bruises=t"
## X2 "poisonous=p" "conical=c" "grooves=g" "buff=b" "no=f"
## X3 "" "convex=x" "scaly=y" "cinnamon=c" ""
## X4 "" "flat=f" "smooth=s" "gray=g" ""
## X5 "" "knobbed=k" "" "green=r" ""
## X6 "" "sunken=s" "" "pink=p" ""
## X7 "" "" "" "purple=u" ""
## X8 "" "" "" "red=e" ""
## X9 "" "" "" "white=w" ""
## X10 "" "" "" "yellow=y" ""
I subset the raw mushroom data by selected ‘classes’ and 3 other columns (‘odor’, ‘population’ and ‘habitat’). I stored this subset in mushrooms_reduced and stored the columns in my_columns for future use.
my_columns <- c('classes', 'odor', 'population', 'habitat')
mushrooms_reduced <- subset(mushrooms_raw, select=my_columns)
head(mushrooms_reduced, 10)
## classes odor population habitat
## 1 p p s u
## 2 e a n g
## 3 e l n m
## 4 p p s u
## 5 e n a g
## 6 e a n g
## 7 e a n m
## 8 e l s m
## 9 p p v g
## 10 e a s m
The below method accepts the names and mushroom_reduced data structures and the name of the column that I wish to process. It reads the names matrix and creates two vectors that contain the matching abbrevation and full names of the given class. An a to b substitution is then made and the modified data is returned.
#"unlist" code adapted from "Henry" on Stackoverflow
process_names <- function(names, mushrooms_reduced, column){
split_names <- strsplit(names[,column], '=')
split_names[sapply(split_names, length)==0] <- NULL
abbreviations <- unlist(split_names)[2*(1:length(split_names))]
full <- unlist(split_names)[2*(1:length(split_names))-1]
return(mapvalues(mushrooms_reduced[,column], from=abbreviations, to=full))
}
I looped over all the appropriate columns calling the process_names method and this will substitute the abbrevation for the full name in every entry in the data frame.
for(column in my_columns){
mushrooms_reduced[,column] <- process_names(names, mushrooms_reduced, column)
}
head(mushrooms_reduced, 10)
## classes odor population habitat
## 1 poisonous pungent scattered urban
## 2 edible almond numerous grasses
## 3 edible anise numerous meadows
## 4 poisonous pungent scattered urban
## 5 edible none abundant grasses
## 6 edible almond numerous grasses
## 7 edible almond numerous meadows
## 8 edible anise scattered meadows
## 9 poisonous pungent several grasses
## 10 edible almond scattered meadows
In conclusion, the original mushrooms data has been read in, subset and updated with more readable information and is ready to be saved or processed further.