Assignment Brief

In this assignment, we have been asked to use R, take a data set in a certain form and make certain transformations for easier downstream analysis.

The full details of the assignment can be found here

As per the full assignment brief the task can be broken down into below sub tasks;

  1. Study the Dataset and the Associated Description Data (i.e “data dictionary”)
  2. Import the data and create a data frame with subsets of columns. Include column that indicates edible or poisonous.
  3. Add meaningful column names and replace the abbreviations used in the data.

About the Data

The data set repository and full details can be found in UCI repository here. In order to make it convinient to import the data set to our R markdown, I have added the raw data file into my github repo, which can be found here

As per the UCI repository, the attribute information is as below:

Attribute Information:

  1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
  2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
  3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
  4. bruises?: bruises=t,no=f
  5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
  6. gill-attachment: attached=a,descending=d,free=f,notched=n
  7. gill-spacing: close=c,crowded=w,distant=d
  8. gill-size: broad=b,narrow=n
  9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
  10. stalk-shape: enlarging=e,tapering=t
  11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
  12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
  13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
  14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
  15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
  16. veil-type: partial=p,universal=u
  17. veil-color: brown=n,orange=o,white=w,yellow=y
  18. ring-number: none=n,one=o,two=t
  19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
  20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
  21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
  22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

Import the Data and Create a Data Frame

# import the data set using read.csv function with no columns.

mushrooms<- read.csv('https://raw.githubusercontent.com/anilak1978/data-acquisition-management-mushroom-data-set/master/agaricus-lepiota.data', header=FALSE, sep=',')

#overview of the data
head(mushrooms)
##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p
## 2  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p
## 3  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p
## 4  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p
## 5  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e
## 6  e  x  y  y  t  a  f  c  b   n   e   c   s   s   w   w   p   w   o   p
##   V21 V22 V23
## 1   k   s   u
## 2   n   n   g
## 3   n   n   m
## 4   k   s   u
## 5   n   a   g
## 6   k   n   g

When we look at the data frame “mushrooms), we see 23 columns, however the data set information provided states that there are 22 attributes. This is because the first column (V1) attribute is if the mushroom is poisinous or not. If the musroom is posinous, the value of the attribute is”p“, if it is e, the value is”edible". Let’s assign column names to our data frame.

Assign Column Names

# adding column names defined in the data set information provided by UCI Repository
colnames(mushrooms)<- c('edibility', 
                        'cap-shape', 
                        'cap-surface', 
                        'cap-color', 
                        'bruises', 
                        'odor', 
                        'gill-attachment', 
                        'gill-spacing', 
                        'gill-size', 
                        'gill-color', 
                        'stalk-shape', 
                        'stalk-root', 
                        'stalk-surface-above-ring', 
                        'stalk-surface-below-ring', 
                        'stalk-color-above-ring', 
                        'stalk-color-below-ring', 
                        'veil-type', 
                        'veil-color', 
                        'ring-number', 
                        'ring-type', 
                        'spore-pint', 
                        'population', 
                        'habitat')

# overview of the data
head(mushrooms)
##   edibility cap-shape cap-surface cap-color bruises odor gill-attachment
## 1         p         x           s         n       t    p               f
## 2         e         x           s         y       t    a               f
## 3         e         b           s         w       t    l               f
## 4         p         x           y         w       t    p               f
## 5         e         x           s         g       f    n               f
## 6         e         x           y         y       t    a               f
##   gill-spacing gill-size gill-color stalk-shape stalk-root
## 1            c         n          k           e          e
## 2            c         b          k           e          c
## 3            c         b          n           e          c
## 4            c         n          n           e          e
## 5            w         b          k           t          e
## 6            c         b          n           e          c
##   stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   stalk-color-below-ring veil-type veil-color ring-number ring-type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         e
## 6                      w         p          w           o         p
##   spore-pint population habitat
## 1          k          s       u
## 2          n          n       g
## 3          n          n       m
## 4          k          s       u
## 5          n          a       g
## 6          k          n       g

I have used the attribute information provided in the data set as column names. I used “edibility” as the column name since it states if the mushroom is posinous or edible. As part of our next step; we will subset the columns; utilizing 4 attributes(features)including edibility.

Subset the data frame

# subset data frame mushrooms and create a new data frame called mushrooms_subset

mushrooms_subset <- subset(mushrooms, select = c(1:5))
head(mushrooms_subset)
##   edibility cap-shape cap-surface cap-color bruises
## 1         p         x           s         n       t
## 2         e         x           s         y       t
## 3         e         b           s         w       t
## 4         p         x           y         w       t
## 5         e         x           s         g       f
## 6         e         x           y         y       t

For no particular reason, following the assignment, i selected so subset the first 5 attributes in the “mushrooms” database which included the column data indicates if the mushroom is edible or poisionous. (“ediblity”)

Replace the abbreviations

# first look at the structure of mushrooms_subset data frame to see the levels of each attribute
str(mushrooms_subset)
## 'data.frame':    8124 obs. of  5 variables:
##  $ edibility  : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap-shape  : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap-surface: Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap-color  : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises    : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...

We can see edibility has 2, can-shape 6, can-surface 4, cap-color 10 and bruises has 2 levels. We can use the attribute information to replace the abbreviations in the data frame.

# target each level value , replace the old name with new.

levels(mushrooms_subset$edibility)[levels(mushrooms_subset$edibility)=='e']<-'edible'
levels(mushrooms_subset$edibility)[levels(mushrooms_subset$edibility)=='p']<-'poisonous'
levels(mushrooms_subset$`cap-shape`)[levels(mushrooms_subset$`cap-shape`)=='b']<-'bell'
levels(mushrooms_subset$`cap-shape`)[levels(mushrooms_subset$`cap-shape`)=='c']<-'conical'
levels(mushrooms_subset$`cap-shape`)[levels(mushrooms_subset$`cap-shape`)=='x']<-'convex'
levels(mushrooms_subset$`cap-shape`)[levels(mushrooms_subset$`cap-shape`)=='f']<-'flat'
levels(mushrooms_subset$`cap-shape`)[levels(mushrooms_subset$`cap-shape`)=='k']<-'knobbed'
levels(mushrooms_subset$`cap-shape`)[levels(mushrooms_subset$`cap-shape`)=='s']<-'sunken'
levels(mushrooms_subset$`cap-surface`)[levels(mushrooms_subset$`cap-surface`)=='f']<-'fibrous'
levels(mushrooms_subset$`cap-surface`)[levels(mushrooms_subset$`cap-surface`)=='g']<-'grooves'
levels(mushrooms_subset$`cap-surface`)[levels(mushrooms_subset$`cap-surface`)=='y']<-'scaly'
levels(mushrooms_subset$`cap-surface`)[levels(mushrooms_subset$`cap-surface`)=='s']<-'smooth'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='n']<-'brown'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='b']<-'buff'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='c']<-'cinnamon'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='g']<-'gray'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='r']<-'green'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='p']<-'pink'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='u']<-'purple'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='e']<-'red'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='w']<-'white'
levels(mushrooms_subset$`cap-color`)[levels(mushrooms_subset$`cap-color`)=='y']<-'yellow'
levels(mushrooms_subset$bruises)[levels(mushrooms_subset$bruises)=='t']<-'yes'
levels(mushrooms_subset$bruises)[levels(mushrooms_subset$bruises)=='f']<-'no'

#overview of the mushrooms_subset data frame with new row values.

head(mushrooms_subset)
##   edibility cap-shape cap-surface cap-color bruises
## 1 poisonous    convex      smooth     brown     yes
## 2    edible    convex      smooth    yellow     yes
## 3    edible      bell      smooth     white     yes
## 4 poisonous    convex       scaly     white     yes
## 5    edible    convex      smooth      gray      no
## 6    edible    convex       scaly    yellow     yes

This completes the Loading Data into a Data Frame Assignment for Data Acqusition and Management DATA 607 class.