Assigment1-Loading Data into a Data Frame

Data Set Information:

“This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one”

This data set is as an example of Categorical data which assigns numbers or characters to represent different categories of answers. These types of data can’t be arithmetically manipulated.

The best way to summarize categorical data is with frequencies. Frequencies can also be translated into percentages to show what percentage of the total sample answered in each category.

Mushrooms <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"), header = FALSE)
head.matrix(x = Mushrooms, 6L)

##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p
## 2  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p
## 3  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p
## 4  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p
## 5  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e
## 6  e  x  y  y  t  a  f  c  b   n   e   c   s   s   w   w   p   w   o   p
##   V21 V22 V23
## 1   k   s   u
## 2   n   n   g
## 3   n   n   m
## 4   k   s   u
## 5   n   a   g
## 6   k   n   g

class(Mushrooms)

## [1] "data.frame"

dim(Mushrooms)

## [1] 8124   23

Mushrooms1 <- Mushrooms[,c(1,6,14,21,22,23)]
head.matrix(x = Mushrooms1, 6L)

##   V1 V6 V14 V21 V22 V23
## 1  p  p   s   k   s   u
## 2  e  a   s   n   n   g
## 3  e  l   s   n   n   m
## 4  p  p   s   k   s   u
## 5  e  n   s   n   a   g
## 6  e  a   s   k   n   g

Renaming columns

names(Mushrooms1) = c("classification", "odor", "stalk-sbr", "spore-pr-c", "population", "habitat")
names(Mushrooms1)

## [1] "classification" "odor"           "stalk-sbr"      "spore-pr-c"    
## [5] "population"     "habitat"

Subsetting data Data set of Edible Mushrooms - 4280 instances

edible_mushrooms <- Mushrooms1
edible_mushrooms = subset(Mushrooms1, classification == "e")
head.matrix(x = edible_mushrooms, 6L)

##   classification odor stalk-sbr spore-pr-c population habitat
## 2              e    a         s          n          n       g
## 3              e    l         s          n          n       m
## 5              e    n         s          n          a       g
## 6              e    a         s          k          n       g
## 7              e    a         s          k          n       m
## 8              e    l         s          n          s       m

Data set of Poisonous Mushrooms - 3750 instances

poisonous_mushrooms <- Mushrooms1
poisonous_mushrooms = subset(Mushrooms1, classification == "p")
head.matrix(x = poisonous_mushrooms, 6L)

##    classification odor stalk-sbr spore-pr-c population habitat
## 1               p    p         s          k          s       u
## 4               p    p         s          k          s       u
## 9               p    p         s          k          v       g
## 14              p    p         s          n          v       u
## 18              p    p         s          k          s       g
## 19              p    p         s          n          s       u

updating a subset value ‘e’ as ’edible

edible_mushrooms$classification <- ifelse(edible_mushrooms$classification== 'e', 'edible', edible_mushrooms$classification)
head.matrix(x = edible_mushrooms, 3L)

##   classification odor stalk-sbr spore-pr-c population habitat
## 2         edible    a         s          n          n       g
## 3         edible    l         s          n          n       m
## 5         edible    n         s          n          a       g

The attribute that is the best predictive analysis might be habitat that can tell us where the poisonous mushrooms can be found mean while edible mushrooms.

Frequency distribution analysis using Bar Graphs using X axis = classification and Y axis = percentage

could answer questions like;

How many edible or poisonous mushrooms are in the data set? Results in frecuencies and/or percentages

count = table(Mushrooms1$classification)
t = as.data.frame(count)
names(t)[1] = 'classification'
t

##   classification Freq
## 1              e 4208
## 2              p 3916

What is the habitat with more frecuency for poisonous mushrooms or edible mushrooms? Possible habitat values: grasses=g,leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

count = table(Mushrooms1$classification, Mushrooms1$habitat)
t = as.data.frame(count)
names(t)[1] = 'classification'
names(t)[2] = 'habitat'
t

##    classification habitat Freq
## 1               e       d 1880
## 2               p       d 1268
## 3               e       g 1408
## 4               p       g  740
## 5               e       l  240
## 6               p       l  592
## 7               e       m  256
## 8               p       m   36
## 9               e       p  136
## 10              p       p 1008
## 11              e       u   96
## 12              p       u  272
## 13              e       w  192
## 14              p       w    0

What odor is more common in the poisonous mushrooms or edible mushrooms? Posible odor values: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s

count1 = table(Mushrooms1$classification, Mushrooms1$odor)
t = as.data.frame(count1)
names(t)[1] = 'classification'
names(t)[2] = 'odor'
t

##    classification odor Freq
## 1               e    a  400
## 2               p    a    0
## 3               e    c    0
## 4               p    c  192
## 5               e    f    0
## 6               p    f 2160
## 7               e    l  400
## 8               p    l    0
## 9               e    m    0
## 10              p    m   36
## 11              e    n 3408
## 12              p    n  120
## 13              e    p    0
## 14              p    p  256
## 15              e    s    0
## 16              p    s  576
## 17              e    y    0
## 18              p    y  576

After studying the data set, I found two relevant attributes, habitat and odor that can provide additional information to determine of whether a particular mushroom is poisonous or edible.

Assigment1-Loading Data into a Data Frame

Durley Torres-Marin

August 31, 2017

Data Set Information: