Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. We will spend several weeks in this course on tidying and transformation operations. Some of this work could be done in SQL or R (or Python or…). Here, you are asked to use R—you may use base functions or packages as you like. Mushrooms Dataset. A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community makes it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. A typical problem (which is beyond the scope of this assignment!) is to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?” Your task is to study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there! You should take the data, and create a data frame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous and three or four other columns. You should also add meaningful column names and replace the abbreviations used in the data—for example, in the appropriate column, “e” might become “edible.” Your deliverable is the R code to perform these transformation tasks.
## Loading required package: bitops
p | x | s | n | t | p.1 | f | c | n.1 | k | e | e.1 | s.1 | s.2 | w | w.1 | p.2 | w.2 | o | p.3 | k.1 | s.3 | u |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
e | x | s | y | t | a | f | c | b | k | e | c | s | s | w | w | p | w | o | p | n | n | g |
e | b | s | w | t | l | f | c | b | n | e | c | s | s | w | w | p | w | o | p | n | n | m |
p | x | y | w | t | p | f | c | n | n | e | e | s | s | w | w | p | w | o | p | k | s | u |
e | x | s | g | f | n | f | w | b | k | t | e | s | s | w | w | p | w | o | e | n | a | g |
e | x | y | y | t | a | f | c | b | n | e | c | s | s | w | w | p | w | o | p | k | n | g |
e | b | s | w | t | a | f | c | b | g | e | c | s | s | w | w | p | w | o | p | k | n | m |
# Let | ’s ch | eck o | ut th | e sum | mary o | f mus | hrdoo | m data | set |
## p x s n t p.1
## e:4208 b: 452 f:2320 n :2283 f:4748 n :3528
## p:3915 c: 4 g: 4 g :1840 t:3375 f :2160
## f:3152 s:2555 e :1500 s : 576
## k: 828 y:3244 y :1072 y : 576
## s: 32 w :1040 a : 400
## x:3655 b : 168 l : 400
## (Other): 220 (Other): 483
## f c n.1 k e e.1 s.1
## a: 210 c:6811 b:5612 b :1728 e:3515 ?:2480 f: 552
## f:7913 w:1312 n:2511 p :1492 t:4608 b:3776 k:2372
## w :1202 c: 556 s:5175
## n :1048 e:1119 y: 24
## g : 752 r: 192
## h : 732
## (Other):1169
## s.2 w w.1 p.2 w.2 o
## f: 600 w :4463 w :4383 p:8123 n: 96 n: 36
## k:2304 p :1872 p :1872 o: 96 o:7487
## s:4935 g : 576 g : 576 w:7923 t: 600
## y: 284 n : 448 n : 512 y: 8
## b : 432 b : 432
## o : 192 o : 192
## (Other): 140 (Other): 156
## p.3 k.1 s.3 u
## e:2776 w :2388 a: 384 d:3148
## f: 48 n :1968 c: 340 g:2148
## l:1296 k :1871 n: 400 l: 832
## n: 36 h :1632 s:1247 m: 292
## p:3967 r : 72 v:4040 p:1144
## b : 48 y:1712 u: 367
## (Other): 144 w: 192
Apparently all the variables show that they are all categorical and hence above digits show the number of those categories in that variables.
In order to read the file, I had to convert the dictionary into text file and saved it in my root folder. I fenced the text file from there.
Index | Attribute | Information |
---|---|---|
0 | class | edible=e,poisonous=p |
1 | cap-shape | bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s |
2 | cap-surface | fibrous=f,grooves=g,scaly=y,smooth=s |
3 | cap-color | brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y |
4 | bruises? | bruises=t,no=f |
5 | odor | almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s |
6 | gill-attachment | attached=a,descending=d,free=f,notched=n |
7 | gill-spacing | close=c,crowded=w,distant=d |
8 | gill-size | broad=b,narrow=n |
9 | gill-color | black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y |
10 | stalk-shape | enlarging=e,tapering=t |
11 | stalk-root | bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=? |
12 | stalk-surface-above-ring | fibrous=f,scaly=y,silky=k,smooth=s |
13 | stalk-surface-below-ring | fibrous=f,scaly=y,silky=k,smooth=s |
14 | stalk-color-above-ring | brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y |
15 | stalk-color-below-ring | brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y |
16 | veil-type | partial=p,universal=u |
17 | veil-color | brown=n,orange=o,white=w,yellow=y |
18 | ring-number | none=n,one=o,two=t |
19 | ring-type | cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z |
20 | spore-print-color | black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y |
21 | population | abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y |
22 | habitat | grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d |
## p x s n t p.1 f c n.1 k e e.1 s.1 s.2 w w.1 p.2 w.2 o p.3 k.1 s.3 u
## 1 e x s y t a f c b k e c s s w w p w o p n n g
## 2 e b s w t l f c b n e c s s w w p w o p n n m
## 3 p x y w t p f c n n e e s s w w p w o p k s u
## 4 e x s g f n f w b k t e s s w w p w o e n a g
## 5 e x y y t a f c b n e c s s w w p w o p k n g
## 6 e b s w t a f c b g e c s s w w p w o p k n m
Now let’s rename the column’s name. First column is the class (e= edible and p=poisonous) while the rest of other 22 variables are the names which needs to be changed according to the above given names.
Now let’s check the name of the columns to see if they’re changed.
## [1] "class" "capshape"
## [3] "capsurface" "capcolor"
## [5] "bruises" "odor"
## [7] "gillattachment" "gillspacing"
## [9] "gillsize" "gillcolor"
## [11] "stalkshape" "stalkroot"
## [13] "stalksurfaceabovering" "stalksurfacebelowring"
## [15] "stalkcolorabovering" "stalkcolorbelowring"
## [17] "veiltype" "veilcolor"
## [19] "ringnumber" "ringtype"
## [21] "sporeprintcolor" "population"
## [23] "habitat"
## class ringtype ringnumber capsurface bruises
## e:4208 e:2776 n: 36 f:2320 f:4748
## p:3915 f: 48 o:7487 g: 4 t:3375
## l:1296 t: 600 s:2555
## n: 36 y:3244
## p:3967
Now since we are done with renaming the column names, let’s move to change the category types into sensible information as right now we cannot understand what each data represents. We will replace the categories of 4 variables and transform them into meaningful and clear names.
## class ringtype ringnumber capsurface
## Edible :4208 evanescent:2776 none: 36 fibrous:2320
## Poisonous:3915 flaring : 48 one :7487 grooves: 4
## large :1296 two : 600 scaly :3244
## none : 36 smooth :2555
## pendant :3967
## bruises
## bruises:3375
## no :4748
##
##
##
## class ringtype ringnumber capsurface bruises
## 6620 Poisonous evanescent one scaly no
## 3148 Poisonous pendant one fibrous no
## 6238 Poisonous evanescent one smooth no
## 7479 Edible pendant two fibrous no
## 3077 Edible pendant one fibrous bruises
## 5958 Poisonous large one scaly no
## 1386 Edible evanescent one fibrous no
## 5976 Edible evanescent two smooth bruises
## 2929 Edible pendant one fibrous bruises
## 5452 Poisonous large one scaly no
## 7350 Edible pendant two smooth no
## 7579 Poisonous evanescent one scaly no
## 3653 Poisonous large one fibrous no
## 8040 Edible pendant one smooth no
## 5651 Poisonous large one scaly no
## 851 Edible pendant one scaly bruises
## 6061 Poisonous evanescent one scaly no
## 797 Edible evanescent one smooth no
## 2605 Poisonous large one fibrous no
## 7388 Edible pendant one smooth no
## 3888 Edible pendant one scaly bruises
## 7360 Edible pendant two smooth bruises
## 6141 Poisonous evanescent one scaly no
## 7102 Poisonous evanescent one scaly no
## 5740 Poisonous pendant one smooth bruises
## 603 Edible pendant one scaly bruises
## 6414 Poisonous evanescent one smooth no
## 7636 Edible pendant one smooth no
## 1874 Edible pendant one scaly bruises
## 2224 Edible pendant one fibrous bruises
## 4008 Edible pendant one fibrous bruises
## 1751 Edible pendant one scaly bruises
## 1298 Edible pendant one scaly bruises
## 2653 Edible pendant one scaly bruises
## 3073 Poisonous large one fibrous no
## 2621 Edible pendant one scaly bruises
## 4555 Poisonous large one fibrous no
## 2209 Edible pendant one scaly bruises
## 4040 Poisonous large one fibrous no
## 4406 Poisonous large one scaly no
## 2029 Edible pendant one scaly bruises
## 2926 Poisonous large one fibrous no
## 8049 Poisonous evanescent one scaly no
## 5107 Poisonous pendant one grooves bruises
## 7896 Poisonous evanescent one scaly no
## 3219 Edible pendant one scaly bruises
## 7442 Poisonous evanescent one smooth no
## 5571 Poisonous pendant two scaly bruises
## 5872 Poisonous evanescent one scaly no
## 1466 Edible pendant one fibrous bruises