We have seen some basic functions built in the base package of R (which means you don’t have to install anything). However, there are several packages that allows you to clean/reorganize/munge the data in a more elegant and efficient fashion. They are reshape2, plyr, dplyr, or tidyr packages. There even exist cheat sheet for dplyr and tidyr packages.
We’ve seen order, subset, r/cbind, and merge before. Let’s explore some more handy functions in the reshape2 and plyr packages.
library(reshape2)
library(plyr)
library(magrittr)
RY=read.csv(text=getURL("https://raw.githubusercontent.com/OscarFHC/NRE538_GSRA/master/Labs/NRE538_DataWrangling/RY.csv"), sep=",", header=T,comment.char="#")
id=read.csv(text=getURL("https://raw.githubusercontent.com/OscarFHC/NRE538_GSRA/master/Labs/NRE538_DataWrangling/sp_id.csv"), sep=",", header=T,comment.char="#")
trait=read.csv(text=getURL("https://raw.githubusercontent.com/OscarFHC/NRE538_GSRA/master/Labs/NRE538_DataWrangling/sp_traits.csv"), sep=",", header=T,comment.char="#")
goal=read.csv(text=getURL("https://raw.githubusercontent.com/OscarFHC/NRE538_GSRA/master/Labs/NRE538_DataWrangling/goal.csv"), sep=",", header=T,comment.char="#")
Our goal is to generate the goal data frame by using the following functions.
head(goal)
## sp2 sp1 RY1 RY2 avgRY C.N.x C.P.x N.P.x
## 1 2 1 0.004167904 1.1385059 0.5713369 18.57258 435.5213 23.4497
## 2 3 1 0.007112204 0.9607456 0.4839289 18.57258 435.5213 23.4497
## 3 4 1 0.519281008 1.6451093 1.0821952 18.57258 435.5213 23.4497
## 4 5 1 0.004955600 0.9128606 0.4589081 18.57258 435.5213 23.4497
## 5 7 1 0.449101255 1.3225352 0.8858182 18.57258 435.5213 23.4497
## 6 8 1 0.883576608 0.1783871 0.5309819 18.57258 435.5213 23.4497
## C.N.y C.P.y N.P.y
## 1 16.437039 365.5173 22.23742
## 2 12.359818 259.7180 21.01310
## 3 10.639614 161.1780 15.14885
## 4 7.129078 146.6461 20.57014
## 5 18.936076 363.5510 19.19885
## 6 10.896064 219.1580 20.11350
Given a data frame, mutate modifies, adds, or removes variables.
head(RY)
## sp1 sp2 RY1 RY2 avgRY
## 1 1 2 0.004167904 1.1385059 0.5713369
## 2 1 3 0.007112204 0.9607456 0.4839289
## 3 1 4 0.519281008 1.6451093 1.0821952
## 4 1 5 0.004955600 0.9128606 0.4589081
## 5 1 7 0.449101255 1.3225352 0.8858182
## 6 1 8 0.883576608 0.1783871 0.5309819
RY = RY %>%
mutate(avgry = (RY1+RY2)/2)
head(RY)
## sp1 sp2 RY1 RY2 avgRY avgry
## 1 1 2 0.004167904 1.1385059 0.5713369 0.5713369
## 2 1 3 0.007112204 0.9607456 0.4839289 0.4839289
## 3 1 4 0.519281008 1.6451093 1.0821952 1.0821952
## 4 1 5 0.004955600 0.9128606 0.4589081 0.4589081
## 5 1 7 0.449101255 1.3225352 0.8858182 0.8858182
## 6 1 8 0.883576608 0.1783871 0.5309819 0.5309819
RY = RY %>%
mutate(avgry = avgry/2)
head(RY)
## sp1 sp2 RY1 RY2 avgRY avgry
## 1 1 2 0.004167904 1.1385059 0.5713369 0.2856684
## 2 1 3 0.007112204 0.9607456 0.4839289 0.2419644
## 3 1 4 0.519281008 1.6451093 1.0821952 0.5410976
## 4 1 5 0.004955600 0.9128606 0.4589081 0.2294540
## 5 1 7 0.449101255 1.3225352 0.8858182 0.4429091
## 6 1 8 0.883576608 0.1783871 0.5309819 0.2654909
This allows you to subset rows or select(this is NOT a function here!) columns.
sp1 = RY %>%
subset(sp1==1, select=c(RY1))
head(sp1)
## RY1
## 1 0.004167904
## 2 0.007112204
## 3 0.519281008
## 4 0.004955600
## 5 0.449101255
## 6 0.883576608
This can also be achieved by the filter (for rows) and select (for columns) functions.
These two, by its name, do the join operation. The merge belongs to the base package, and the join belongs to plyr.
RY.m = RY %>%
merge(id, by.x="sp1", by.y="id") %>%
merge(id, by.x="sp2", by.y="id") %>%
arrange(sp1)
head(RY.m)
## sp2 sp1 RY1 RY2 avgRY avgry
## 1 2 1 0.004167904 1.1385059 0.5713369 0.2856684
## 2 3 1 0.007112204 0.9607456 0.4839289 0.2419644
## 3 4 1 0.519281008 1.6451093 1.0821952 0.5410976
## 4 5 1 0.004955600 0.9128606 0.4589081 0.2294540
## 5 7 1 0.449101255 1.3225352 0.8858182 0.4429091
## 6 8 1 0.883576608 0.1783871 0.5309819 0.2654909
## sp.x sp.y
## 1 Ankistrodesmus falcatus Botryococcus sudeticus
## 2 Ankistrodesmus falcatus Chlamydocapsa ampla
## 3 Ankistrodesmus falcatus Chlamydomonas moewusii
## 4 Ankistrodesmus falcatus Chlorella sorokiniana
## 5 Ankistrodesmus falcatus Closteriopsis acicularis
## 6 Ankistrodesmus falcatus Coelastrum microporum
In the join, we can specify the type. Try ? join to understand it.
colnames(trait)[3] = "sp"
trait.1 = trait %>%
join(id, by="sp", type="left") %>%
subset(id!="NA", select=c(id, C.N, C.P, N.P))
head(trait.1)
## id C.N C.P N.P
## 2 1 18.572578 435.5213 23.44970
## 6 2 16.437039 365.5173 22.23742
## 7 3 12.359818 259.7180 21.01310
## 8 4 10.639614 161.1780 15.14885
## 10 5 7.129078 146.6461 20.57014
## 12 7 18.936076 363.5510 19.19885
trait.2 = trait %>%
join(id, by="sp", type="inner") %>%
subset(select=c(id, C.N, C.P, N.P))
head(trait.2)
## id C.N C.P N.P
## 1 1 18.572578 435.5213 23.44970
## 2 2 16.437039 365.5173 22.23742
## 3 3 12.359818 259.7180 21.01310
## 4 4 10.639614 161.1780 15.14885
## 5 5 7.129078 146.6461 20.57014
## 6 7 18.936076 363.5510 19.19885
This is a very useful function that allows you to apply a function to a data frame and reture a data frame back.
sum = RY.m %>%
ddply(~sp1, summarize,
mean=mean(RY1),
med=quantile(RY1, 0.5),
lo=quantile(RY1, 0.05),
hi=quantile(RY1, 0.95))
sum
## sp1 mean med lo hi
## 1 1 0.2898778 0.17751278 0.004916215 0.9034076
## 2 2 0.9827241 0.93150681 0.718204608 1.4055848
## 3 3 0.8541042 0.90856213 0.442758034 1.1683561
## 4 4 0.7166386 0.70336081 0.054399857 1.2965328
## 5 5 0.2924596 0.08802885 0.028070055 0.8381911
## 6 7 0.6433982 0.27215109 0.017227364 1.9947887
## 7 8 0.1618722 0.06555914 0.013586022 0.5179500
## 8 9 0.6887810 0.57784979 0.114829614 1.4363793
## 9 10 0.6938790 0.58021739 0.160945652 1.3804556
## 10 11 0.7865976 0.81150458 0.213351056 1.4043605
## 11 12 0.9732537 0.96997228 0.648571429 1.3483598
## 12 13 0.4190639 0.19459916 0.011743253 1.1569805
## 13 14 0.6739349 0.65758755 0.156050176 1.3208471
## 14 16 0.6139744 0.64942358 0.138329708 1.2701913
## 15 17 0.9132902 0.94729067 0.683282582 1.1070749
## 16 18 0.5401754 0.55012667 0.218512486 0.7728944
## 17 19 0.3379909 0.33799089 0.047359639 0.6286221
## 18 20 1.5665162 1.56651623 1.063076587 2.0699559
## 19 21 0.9117685 0.91176848 0.760318363 1.0632186
## 20 22 1.2263210 1.22632104 1.219591227 1.2330508
## 21 23 0.9479769 0.94797688 0.947976879 0.9479769
Let’s use what we’ve learned to generate the goal data frame.
## sp2 sp1 RY1 RY2 avgRY C.N.x C.P.x N.P.x
## 1 2 1 0.004167904 1.1385059 0.5713369 18.57258 435.5213 23.4497
## 2 3 1 0.007112204 0.9607456 0.4839289 18.57258 435.5213 23.4497
## 3 4 1 0.519281008 1.6451093 1.0821952 18.57258 435.5213 23.4497
## 4 5 1 0.004955600 0.9128606 0.4589081 18.57258 435.5213 23.4497
## 5 7 1 0.449101255 1.3225352 0.8858182 18.57258 435.5213 23.4497
## 6 8 1 0.883576608 0.1783871 0.5309819 18.57258 435.5213 23.4497
## C.N.y C.P.y N.P.y
## 1 16.437039 365.5173 22.23742
## 2 12.359818 259.7180 21.01310
## 3 10.639614 161.1780 15.14885
## 4 7.129078 146.6461 20.57014
## 5 18.936076 363.5510 19.19885
## 6 10.896064 219.1580 20.11350
In addition, if you are dealing with large dataset, data.table package will be your good friend!
melt function can make a data frame from “wide” form into a “long” form.
RY.long = RY %>%
melt(id.vars=c("sp1", "sp2"), measure.vars=c("RY1", "RY2"), variable.name="RY_type", value.name="value")
head(RY.long)
## sp1 sp2 RY_type value
## 1 1 2 RY1 0.004167904
## 2 1 3 RY1 0.007112204
## 3 1 4 RY1 0.519281008
## 4 1 5 RY1 0.004955600
## 5 1 7 RY1 0.449101255
## 6 1 8 RY1 0.883576608
cast function can do the reverse of melt
RY.wide = RY.long %>%
dcast(sp2+sp1~RY_type)
head(RY.wide)
## sp2 sp1 RY1 RY2
## 1 2 1 0.004167904 1.13850585
## 2 3 1 0.007112204 0.96074558
## 3 3 2 0.739350695 0.11359761
## 4 4 1 0.519281008 1.64510930
## 5 5 1 0.004955600 0.91286058
## 6 5 3 0.898026283 0.02287871