R Assignment - Basic Data Loading and Transformations Page 1 of 1

Assignment - Loading Data into a Data Frame Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. We will spend several weeks in this course on tidying and transformation operations. Some of this work could be done in SQL or R (or Python or.). Here, you are asked to use R-you may use base functions or packages as you like.

Mushrooms Dataset. A famous-if slightly moldy-dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”

Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data-for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.

If you are working in a group, you also have the option of replacing the mushroom dataset in the assignment with a different data set that your group members might find more interesting. Please place your solution into a single R Markdown (.Rmd) file and publish your solution out to rpubs.com. You should post the .Rmd file in your GitHub repository, and provide the appropriate URLs to your GitHub repository and your rpubs.com file in your assignment link. You should also have the original data file accessible through your code-for example, stored in a GitHub repository and referenced in your code. We’ll look together at some of the most interesting student solutions in next week’s meetup.

mushrooms <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', sep=",", header=FALSE, stringsAsFactors = FALSE)
head(mushrooms)
##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p
## 2  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p
## 3  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p
## 4  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p
## 5  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e
## 6  e  x  y  y  t  a  f  c  b   n   e   c   s   s   w   w   p   w   o   p
##   V21 V22 V23
## 1   k   s   u
## 2   n   n   g
## 3   n   n   m
## 4   k   s   u
## 5   n   a   g
## 6   k   n   g

I have uploaded the data dictionary for this data set in github from the website

mushroomsdictionary <- read.table('https://raw.githubusercontent.com/maharjansudhan/DATA607/master/homework1_dictionary.txt', sep="|", header=TRUE, stringsAsFactors = FALSE)
mushroomsdictionary
##    Index                Attribute
## 1      0                    class
## 2      1                cap-shape
## 3      2              cap-surface
## 4      3                cap-color
## 5      4                 bruises?
## 6      5                     odor
## 7      6          gill-attachment
## 8      7             gill-spacing
## 9      8                gill-size
## 10     9               gill-color
## 11    10              stalk-shape
## 12    11               stalk-root
## 13    12 stalk-surface-above-ring
## 14    13 stalk-surface-below-ring
## 15    14   stalk-color-above-ring
## 16    15   stalk-color-below-ring
## 17    16                veil-type
## 18    17               veil-color
## 19    18              ring-number
## 20    19                ring-type
## 21    20        spore-print-color
## 22    21               population
## 23    22                  habitat
##                                                                                          Information
## 1                                                                               edible=e,poisonous=p
## 2                                                bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
## 3                                                               fibrous=f,grooves=g,scaly=y,smooth=s
## 4                    brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
## 5                                                                                     bruises=t,no=f
## 6                        almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
## 7                                                           attached=a,descending=d,free=f,notched=n
## 8                                                                        close=c,crowded=w,distant=d
## 9                                                                                   broad=b,narrow=n
## 10 black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
## 11                                                                            enlarging=e,tapering=t
## 12                                   bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
## 13                                                                fibrous=f,scaly=y,silky=k,smooth=s
## 14                                                                fibrous=f,scaly=y,silky=k,smooth=s
## 15                           brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
## 16                           brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
## 17                                                                             partial=p,universal=u
## 18                                                                 brown=n,orange=o,white=w,yellow=y
## 19                                                                                none=n,one=o,two=t
## 20                     cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
## 21                     black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
## 22                                abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
## 23                                      grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

Now we will rename the Column Names for the data frame

colnames(mushrooms) <- mushroomsdictionary$Attribute
head(mushrooms)
##   class cap-shape cap-surface cap-color bruises? odor gill-attachment
## 1     p         x           s         n        t    p               f
## 2     e         x           s         y        t    a               f
## 3     e         b           s         w        t    l               f
## 4     p         x           y         w        t    p               f
## 5     e         x           s         g        f    n               f
## 6     e         x           y         y        t    a               f
##   gill-spacing gill-size gill-color stalk-shape stalk-root
## 1            c         n          k           e          e
## 2            c         b          k           e          c
## 3            c         b          n           e          c
## 4            c         n          n           e          e
## 5            w         b          k           t          e
## 6            c         b          n           e          c
##   stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   stalk-color-below-ring veil-type veil-color ring-number ring-type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         e
## 6                      w         p          w           o         p
##   spore-print-color population habitat
## 1                 k          s       u
## 2                 n          n       g
## 3                 n          n       m
## 4                 k          s       u
## 5                 n          a       g
## 6                 k          n       g

Automatic Transformation

# Automatic transformation for all columns
transformmushrooms <- function(headercolumn)
  {
  # reading information 
  mheaderValues <- mushroomsdictionary$Information[headercolumn]
  # split columns based on "," separators
  mheaderValues  <- strsplit(as.character(mheaderValues), ',', fixed=TRUE)
  # create data frame for new data 
  mheaderValues <- data.frame(mheaderValues)
  # rename the column 
  colnames(mheaderValues) <- mushroomsdictionary$Attribute[headercolumn]
  # separate the values 
  mheaderValues <- data.frame(do.call('rbind', strsplit(as.character(mheaderValues[,1]),'=',fixed=TRUE)))
  # rename the values of the new columns
  colnames(mheaderValues) <- c(mushroomsdictionary$Attribute[headercolumn], "Value")
  #mheaderValues

  # Assign the factors
  mush[,headercolumn] <- factor(mush[,headercolumn], ordered = TRUE)

  levels(mush[,headercolumn]) <-  as.character(mheaderValues[,1])
  return(mush)
}

# transform all values automatically
# Create New set
mush <- subset(mushrooms, select = c(1:dim(mushrooms)[2]))
head(mush)
##   class cap-shape cap-surface cap-color bruises? odor gill-attachment
## 1     p         x           s         n        t    p               f
## 2     e         x           s         y        t    a               f
## 3     e         b           s         w        t    l               f
## 4     p         x           y         w        t    p               f
## 5     e         x           s         g        f    n               f
## 6     e         x           y         y        t    a               f
##   gill-spacing gill-size gill-color stalk-shape stalk-root
## 1            c         n          k           e          e
## 2            c         b          k           e          c
## 3            c         b          n           e          c
## 4            c         n          n           e          e
## 5            w         b          k           t          e
## 6            c         b          n           e          c
##   stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   stalk-color-below-ring veil-type veil-color ring-number ring-type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         e
## 6                      w         p          w           o         p
##   spore-print-color population habitat
## 1                 k          s       u
## 2                 n          n       g
## 3                 n          n       m
## 4                 k          s       u
## 5                 n          a       g
## 6                 k          n       g
# Calling the function
mush <- transformmushrooms(1) # Calling to transform column # 1
mush <- transformmushrooms(8) # Calling to transform column # 8
mush <- transformmushrooms(9) # Calling to transform column # 9
head(mush,25)
##        class cap-shape cap-surface cap-color bruises? odor gill-attachment
## 1  poisonous         x           s         n        t    p               f
## 2     edible         x           s         y        t    a               f
## 3     edible         b           s         w        t    l               f
## 4  poisonous         x           y         w        t    p               f
## 5     edible         x           s         g        f    n               f
## 6     edible         x           y         y        t    a               f
## 7     edible         b           s         w        t    a               f
## 8     edible         b           y         w        t    l               f
## 9  poisonous         x           y         w        t    p               f
## 10    edible         b           s         y        t    a               f
## 11    edible         x           y         y        t    l               f
## 12    edible         x           y         y        t    a               f
## 13    edible         b           s         y        t    a               f
## 14 poisonous         x           y         w        t    p               f
## 15    edible         x           f         n        f    n               f
## 16    edible         s           f         g        f    n               f
## 17    edible         f           f         w        f    n               f
## 18 poisonous         x           s         n        t    p               f
## 19 poisonous         x           y         w        t    p               f
## 20 poisonous         x           s         n        t    p               f
## 21    edible         b           s         y        t    a               f
## 22 poisonous         x           y         n        t    p               f
## 23    edible         b           y         y        t    l               f
## 24    edible         b           y         w        t    a               f
## 25    edible         b           s         w        t    l               f
##    gill-spacing gill-size gill-color stalk-shape stalk-root
## 1         close    narrow          k           e          e
## 2         close     broad          k           e          c
## 3         close     broad          n           e          c
## 4         close    narrow          n           e          e
## 5       crowded     broad          k           t          e
## 6         close     broad          n           e          c
## 7         close     broad          g           e          c
## 8         close     broad          n           e          c
## 9         close    narrow          p           e          e
## 10        close     broad          g           e          c
## 11        close     broad          g           e          c
## 12        close     broad          n           e          c
## 13        close     broad          w           e          c
## 14        close    narrow          k           e          e
## 15      crowded     broad          n           t          e
## 16        close    narrow          k           e          e
## 17      crowded     broad          k           t          e
## 18        close    narrow          n           e          e
## 19        close    narrow          n           e          e
## 20        close    narrow          k           e          e
## 21        close     broad          k           e          c
## 22        close    narrow          n           e          e
## 23        close     broad          k           e          c
## 24        close     broad          w           e          c
## 25        close     broad          g           e          c
##    stalk-surface-above-ring stalk-surface-below-ring
## 1                         s                        s
## 2                         s                        s
## 3                         s                        s
## 4                         s                        s
## 5                         s                        s
## 6                         s                        s
## 7                         s                        s
## 8                         s                        s
## 9                         s                        s
## 10                        s                        s
## 11                        s                        s
## 12                        s                        s
## 13                        s                        s
## 14                        s                        s
## 15                        s                        f
## 16                        s                        s
## 17                        s                        s
## 18                        s                        s
## 19                        s                        s
## 20                        s                        s
## 21                        s                        s
## 22                        s                        s
## 23                        s                        s
## 24                        s                        s
## 25                        s                        s
##    stalk-color-above-ring stalk-color-below-ring veil-type veil-color
## 1                       w                      w         p          w
## 2                       w                      w         p          w
## 3                       w                      w         p          w
## 4                       w                      w         p          w
## 5                       w                      w         p          w
## 6                       w                      w         p          w
## 7                       w                      w         p          w
## 8                       w                      w         p          w
## 9                       w                      w         p          w
## 10                      w                      w         p          w
## 11                      w                      w         p          w
## 12                      w                      w         p          w
## 13                      w                      w         p          w
## 14                      w                      w         p          w
## 15                      w                      w         p          w
## 16                      w                      w         p          w
## 17                      w                      w         p          w
## 18                      w                      w         p          w
## 19                      w                      w         p          w
## 20                      w                      w         p          w
## 21                      w                      w         p          w
## 22                      w                      w         p          w
## 23                      w                      w         p          w
## 24                      w                      w         p          w
## 25                      w                      w         p          w
##    ring-number ring-type spore-print-color population habitat
## 1            o         p                 k          s       u
## 2            o         p                 n          n       g
## 3            o         p                 n          n       m
## 4            o         p                 k          s       u
## 5            o         e                 n          a       g
## 6            o         p                 k          n       g
## 7            o         p                 k          n       m
## 8            o         p                 n          s       m
## 9            o         p                 k          v       g
## 10           o         p                 k          s       m
## 11           o         p                 n          n       g
## 12           o         p                 k          s       m
## 13           o         p                 n          s       g
## 14           o         p                 n          v       u
## 15           o         e                 k          a       g
## 16           o         p                 n          y       u
## 17           o         e                 n          a       g
## 18           o         p                 k          s       g
## 19           o         p                 n          s       u
## 20           o         p                 n          s       u
## 21           o         p                 n          s       m
## 22           o         p                 n          v       g
## 23           o         p                 n          s       m
## 24           o         p                 n          n       m
## 25           o         p                 k          s       m
# New set to reduce the number of columns.
mush <- subset(mush, select = c(1,8,9))
head(mush,25)
##        class gill-spacing gill-size
## 1  poisonous        close    narrow
## 2     edible        close     broad
## 3     edible        close     broad
## 4  poisonous        close    narrow
## 5     edible      crowded     broad
## 6     edible        close     broad
## 7     edible        close     broad
## 8     edible        close     broad
## 9  poisonous        close    narrow
## 10    edible        close     broad
## 11    edible        close     broad
## 12    edible        close     broad
## 13    edible        close     broad
## 14 poisonous        close    narrow
## 15    edible      crowded     broad
## 16    edible        close    narrow
## 17    edible      crowded     broad
## 18 poisonous        close    narrow
## 19 poisonous        close    narrow
## 20 poisonous        close    narrow
## 21    edible        close     broad
## 22 poisonous        close    narrow
## 23    edible        close     broad
## 24    edible        close     broad
## 25    edible        close     broad