DataFrame Management in R

In my surroundings at work I see still qute a few people managing their data in spreadsheet software like Excel or Calc, these software will do the work but I usually tend to do as little data manipulation in them as possible and to turn as soon as possible my spreadsheets into csv files and then bring the data to R where every single manipulation I do on them is recorded by default in the history (if you use RStudio) or in scripts if you are documenting your work (which should always be the way to go). The aim of this post is to show how to do some manipulations oftenly done on data (ie subsetting, summarising, ordering …) in R. As always there are a thousand way to do an operation, I will go through the basic way to do these manipulation using the vector-based approach of R and then at the end show how new libraries allow you to do these manipulation on data frame using code easily understandable for those not grasping (yet) the magic of vector-based operations.

#################
#Data management#
#################


#the data frame I will use
data<-data.frame(Factor1=rep(LETTERS[1:5],each=20),Factor2=sample(letters[1:10],100,replace=TRUE),Var1=rnorm(100,2,4),Var2=rpois(100,2))
#some simple summary
summary(data)

##  Factor1    Factor2        Var1             Var2     
##  A:20    g      :13   Min.   :-6.533   Min.   :0.00  
##  B:20    h      :13   1st Qu.:-0.746   1st Qu.:1.00  
##  C:20    b      :11   Median : 1.285   Median :2.00  
##  D:20    f      :11   Mean   : 1.264   Mean   :2.02  
##  E:20    a      :10   3rd Qu.: 3.441   3rd Qu.:3.00  
##          d      :10   Max.   : 9.678   Max.   :7.00  
##          (Other):32

table(data$Factor1)

## 
##  A  B  C  D  E 
## 20 20 20 20 20

table(data$Factor2)

## 
##  a  b  c  d  e  f  g  h  i  j 
## 10 11  6 10  9 11 13 13  9  8

#####basic way using vectors######
#subsetting
#only keep observation with Factor1 equal to A
sub1<-subset(data,Factor1=="A")
#only keep observation with Factor1 equal to A and Var2 lower than 4
data$Factor1=="A"

##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE

sub2<-data[data$Factor1=="A" & data$Var2<4,]
summary(sub2)

##  Factor1    Factor2       Var1             Var2     
##  A:19    b      :3   Min.   :-3.882   Min.   :0.00  
##  B: 0    c      :3   1st Qu.:-0.733   1st Qu.:1.00  
##  C: 0    j      :3   Median : 1.076   Median :2.00  
##  D: 0    a      :2   Mean   : 0.917   Mean   :1.63  
##  E: 0    e      :2   3rd Qu.: 3.011   3rd Qu.:2.00  
##          i      :2   Max.   : 5.128   Max.   :3.00  
##          (Other):4

#only keep every thrird rows
head(data[seq(1,nrow(data),3),])

##    Factor1 Factor2    Var1 Var2
## 1        A       i -2.6481    1
## 4        A       f  2.9360    1
## 7        A       e -0.4924    3
## 10       A       h -1.7991    2
## 13       A       a  3.4397    1
## 16       A       i  3.7078    2

#only keep row number 2,6,13,22 from column 1 and 4
data[c(2,6,13,22),c(1,4)] #when numbers are following each other can use :, ie 1:10

##    Factor1 Var2
## 2        A    4
## 6        A    1
## 13       A    1
## 22       B    0

#summarising
library(plyr)
#get the mean value and standard error of Var1 for each level of Factor1
rbind.fill(by(data,data$Factor1,function(x) return(data.frame(Factor1=unique(x$Factor1),Mean=mean(x$Var1),SE=sd(x$Var1)/sqrt(length(x$Var1))))))

##   Factor1    Mean     SE
## 1       A  1.0826 0.5989
## 2       B  1.7220 0.9099
## 3       C  3.3090 0.8064
## 4       D -0.2736 0.7218
## 5       E  0.4780 0.5889

#get the 25% and 75% quantile for Var2 for each level of Factor2
rbind.fill(by(data,data$Factor2,function(x) return(data.frame(Factor2=unique(x$Factor2),Q_25=quantile(x$Var2,prob=0.25),Q_75=quantile(x$Var2,prob=0.75)))))

##    Factor2 Q_25 Q_75
## 1        a 1.00 1.75
## 2        b 0.50 2.50
## 3        c 1.25 3.00
## 4        d 0.25 2.00
## 5        e 1.00 2.00
## 6        f 1.50 3.50
## 7        g 1.00 3.00
## 8        h 1.00 2.00
## 9        i 2.00 3.00
## 10       j 1.00 2.00

Wow these two last calls can seem rather intimidating at first but as always you need to start by the center and then walk away from it to understand what is happening in these two lines, let’s look at the first one for example. First we call an un-named function on the dataframe data and we apply this function to each level of data$Factor1 separately, we pass these chuncks of data to the function and call them x, now this function will return a dataframe made of three columns, the first one named Factor1 take the unique value present in the column Factor1 of the x chuncks, the second one takes the mean of the Var1 values, the third one divide the standard deviation of Var1 values by the square root of the number of observations (giving the standard error around the mean). As the by function will return a serie of dataframe we can combine them together in one dataframe using rbind.fill. This is rather long lines of code, keep them in mind as at the end of the post you will see how to do this in a different way.

#changing column order
data<-data[,c(1,4,3,2)]
head(data)

##   Factor1 Var2   Var1 Factor2
## 1       A    1 -2.648       i
## 2       A    4  4.232       h
## 3       A    3  1.344       d
## 4       A    1  2.936       f
## 5       A    1  5.128       b
## 6       A    1  1.076       c

#also work with column names
data<-data[,c("Factor1","Var1","Factor2","Var2")]
head(data)

##   Factor1   Var1 Factor2 Var2
## 1       A -2.648       i    1
## 2       A  4.232       h    4
## 3       A  1.344       d    3
## 4       A  2.936       f    1
## 5       A  5.128       b    1
## 6       A  1.076       c    1

#sorting the rows first by Factor1 then by Factor2
data<-data[do.call(order,list(data$Factor1,data$Factor2)),]

######increasing complexity, switching from long to wide format########
library(reshape2)
#the long format makes one column keeping the info on a grouping variable (eg Sex) instead of making a separate column for each levels
#the object data is for example in a long format, we may want to make a separate column for each level of Factor1 and storing Var1 in the rows
data$Observation<-rep(1:20,time=5)
data_wide<-dcast(Observation~Factor1,data = data,value.var = "Var1") #the left-hand side of the formula is the variable that will make up the rows the right hand side the columns
#if certain combination are missing one can use the fill argument
data_wide<-dcast(Factor2~Factor1,data=data,fun.aggregate = length,fill=0) #here we count how many observations are for each levels of Factor2 and Factor1

## Using Observation as value column: use value.var to override.

#other functions can be provided if nore then one values are present in each cells
data_wide<-dcast(Factor2~Factor1,data=data,fun.aggregate = sum,value.var="Var2",fill=0)
#turning back the data to a long format
data_long<-melt(data_wide,value.name = "Sum_Var2",id.vars="Factor2",variable.name = "Factor1") #melt the data frame id.vars correspond to the column that contain the factor infos
#long format are then pretty handy to use for plotting
library(ggplot2)
ggplot(data_long,aes(x=Factor2,y=Sum_Var2,colour=Factor1))+geom_point()

plot of chunk unnamed-chunk-2

#but is also the way the data should be structure for data analysis:
lm(Sum_Var2~Factor2+Factor1,data_long)

## 
## Call:
## lm(formula = Sum_Var2 ~ Factor2 + Factor1, data = data_long)
## 
## Coefficients:
## (Intercept)     Factor2b     Factor2c     Factor2d     Factor2e  
##    2.46e+00    -8.67e-16    -2.00e-01     2.00e-01    -1.02e-15  
##    Factor2f     Factor2g     Factor2h     Factor2i     Factor2j  
##    2.60e+00     4.00e+00     2.20e+00     1.60e+00     8.92e-16  
##    Factor1B     Factor1C     Factor1D     Factor1E  
##    1.00e-01     6.00e-01     1.20e+00     8.00e-01

For more about long and wide format you can also look at the great article in the R cookbook on this: http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/. Now let’s turn to a new library that came out to my attention recently and that is extremely elegant and easy to use. More info on this library: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

#####using the dplyr to turn all data manipulation easy######
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, desc, failwith, id, mutate, summarise, summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#the five functions of dplyr, dplyr works with data frame instead of vectors which makes data frame manipulation much more straightforward
filter(data,Factor1%in%c("A","D"),Var1>=0) #similar to subset

##    Factor1   Var1 Factor2 Var2 Observation
## 1        A 3.0852       a    2           1
## 2        A 3.4397       a    1           2
## 3        A 5.1282       b    1           3
## 4        A 2.1778       b    1           4
## 5        A 0.8335       b    0           5
## 6        A 1.0763       c    1           6
## 7        A 4.7738       c    3           7
## 8        A 1.3436       d    3           9
## 9        A 2.9360       f    1          12
## 10       A 4.2322       h    4          14
## 11       A 3.7078       i    2          17
## 12       A 0.5562       j    2          18
## 13       A 1.8503       j    1          20
## 14       D 0.1532       a    1           1
## 15       D 1.5194       d    2           4
## 16       D 4.4320       e    0           8
## 17       D 4.9094       f    3           9
## 18       D 7.2959       f    2          12
## 19       D 1.0685       h    2          15
## 20       D 0.5544       i    3          17
## 21       D 1.2271       j    5          20

head(select(data,contains("factor",ignore.case=TRUE))) #only return some specific columns see ?select for more possibilities

##    Factor1 Factor2
## 8        A       a
## 13       A       a
## 5        A       b
## 14       A       b
## 17       A       b
## 6        A       c

head(arrange(data,Factor1,Var1))

##   Factor1    Var1 Factor2 Var2 Observation
## 1       A -3.8821       c    3           8
## 2       A -3.1996       e    2          11
## 3       A -2.6481       i    1          16
## 4       A -1.7991       h    2          15
## 5       A -0.7349       g    2          13
## 6       A -0.7319       j    0          19

head(mutate(data,Var3=Var1+Var2,M_1=(Observation+Var2)/length(Var2)))

##   Factor1   Var1 Factor2 Var2 Observation   Var3  M_1
## 1       A 3.0852       a    2           1 5.0852 0.03
## 2       A 3.4397       a    1           2 4.4397 0.03
## 3       A 5.1282       b    1           3 6.1282 0.04
## 4       A 2.1778       b    1           4 3.1778 0.05
## 5       A 0.8335       b    0           5 0.8335 0.05
## 6       A 1.0763       c    1           6 2.0763 0.07

summarise(data,sum(Var2))

##   sum(Var2)
## 1       202

#summarise becomes extremely handy when use with group_by
data_d<-summarise(group_by(data,Factor1),Mean=mean(Var1),SE=sd(Var1)/sqrt(n())) #remember the huge by function needed to get the same results
#the n() function is built-in with dplyr and count how many element there are
#going from the full dataset to a graph summarising mean difference between factor is swift and painless using these functions
ggplot(data_d,aes(x=Factor1,y=Mean))+geom_point(colour="red",size=3,show_guide=FALSE)+geom_errorbar(aes(ymin=Mean-2*SE,ymax=Mean+2*SE),width=.1)

plot of chunk unnamed-chunk-3

As always as you dwell deeper in these topics you can see that the options are extremely numerous which makes R extremely enjoyable for data manipulation once the basics are understood. As R is used nowadays for most of the data analysis (in my field of work at least), I see it natural to bring the data as soon as possible into R to really play with it and grasp there structure instead of just doing linear models in R and then using other softwares to make plots or observe basic patterns in the data. Enjoy your data manips’!

DataFrame Management in R

Lionel

10/11/2014