These are notes for the first session of the MRRI/Drexel R Workgroup. In this session we’ll cover some intermediate-level R topics that I think help turn R beginners into R experts. We’ll be talking about my Top 5 Most Useful R Functions:

1. subset()
2. merge()
3. melt()
4. dcast()
5. ddply()

That Top 5 doesn’t include functions for basic mechanics (reading data, computing means, etc.) or statistical methods (multilevel regression) or ggplot, which I think deserves a special jury prize for being the best graphing function ever invented and will be the topic of discussion at our next meeting.

## Data Input/Output

For one reason or another, many of us have data in Excel. The easiest way to get Excel data into a R-friendly format is to use “Save As” and save your Excel worksheet as “Text (Tab Delimited) (*.txt)“:

then you can read it into R using read.delim("filename.txt"). You’ll need to use the correct path for the filename and you can even read data directly from the internet:

WordLearnEx <- read.delim("http://www.danmirman.org/gca/WordLearnEx.txt")

Ideally, you’d skip Excel entirely and read raw data files. If your data collection software does not produce standard tab-delimited files, you’ll need to use the more general function read.table(). The basic syntax is the same, but you’ll need to specify some details about the structure of the data file. The most important ones are:

• Whether or not your data file has a header with column (variable) names.
• If yes, you’ll need to use header = TRUE
• If no, you might want to use col.names = c("V1", "V2", ...) to specify the variable names (V1, V2, etc. are the default names, you could use something more informative).
• The field separator: The most common ones are tab-delimited sep = "\t" and comma-separated sep = ",", but you can specify anything there. By default read.table will use all white space as separators, which means that spaces, tabs, newlines, etc. will be treated as separators.
• Preliminary lines to skip: Sometimes raw data files have a header with some general or preliminary information that is not part of the tabular data. You can skip these lines using the skip = n argument, where n is an integer indicating the number of rows to skip.
• Missing data codes: the R default is that missing data are indicated by NA, but you can override this default using na.strings = 999 to specify 999 (or whatever) as the code for missing data. This is very useful and important because otherwise those 999 values would become part of the computations for means, etc. whereas NA can be easily excluded from computations and can be used to catch when something went wrong.
• You might also want to use strip.white = TRUE to remove sneaky whitespace from beginnings or ends of letter strings.

Here is how you’d read a tab-delimited raw data file with a 4-line preliminary header, no variable names, and “nan” used to indicate missing data:

#not run
raw_data <- read.table("Raw_Data", sep = "\t", skip = 4,
col.names = c("ID", "Trial", "Condition", "Accuracy", "RT"),
na.strings = "nan")

The customization arguments can also be used with read.delim() if you need to customize anything within the context of a standard tab-delimited file.

I usually separate the initial reading and organizing of raw data from the main analysis. So after those initial pre-processing steps I save the data set into a R data file, which I can later load for analysis without having to re-do the pre-processing steps. R data files can hold multiple R objects, so I can save the raw (read but unprocessed) and processed data into a single file (and any intermediate data as well), which is nice if I might need to tweak some later pre-processing steps.

As an example, I can save the word learning data and the affect data from the psych package into a single data file:

library(psych)
save(WordLearnEx, affect, file="Part1_Data.RData")

Now I can load those data:

load("Part1_Data.RData")

## Subsetting and Merging

### Subset

The subset() function does just what you think it would do: allow you to pull out a subset of a data frame. There are many different ways to accomplish this goal using fancy indexing, but the subset() function is a one-stop-shop that does pretty much everything with an easy-to-use syntax. In general, the subset() function takes a data frame and a set of conditions, then returns the subset of those original data where those conditions are met.

I think the most common kind of subset is a subset of rows based on particular values in key columns. In this case, the conditions are specified by logical operations. For example, to get just the “High TP” condition data from the WordLearnEx data:

WL.hi <- subset(WordLearnEx, TP == "High")
summary(WL.hi)
##     Subject         TP          Block         Accuracy
##  Min.   :303.0   High:280   Min.   : 1.0   Min.   :0.0000
##  1st Qu.:325.5   Low :  0   1st Qu.: 3.0   1st Qu.:0.7915
##  Median :349.5              Median : 5.5   Median :0.8330
##  Mean   :350.0              Mean   : 5.5   Mean   :0.8315
##  3rd Qu.:379.2              3rd Qu.: 8.0   3rd Qu.:1.0000
##  Max.   :395.0              Max.   :10.0   Max.   :1.0000
##     Correct        Incorrect
##  Min.   :0.000   Min.   :0.000
##  1st Qu.:4.750   1st Qu.:0.000
##  Median :5.000   Median :1.000
##  Mean   :4.989   Mean   :1.011
##  3rd Qu.:6.000   3rd Qu.:1.250
##  Max.   :6.000   Max.   :6.000

Or to exclude data from a specific participant:

WL.excl <- subset(WordLearnEx, Subject != 378)
summary(WL.excl)
##     Subject         TP          Block         Accuracy
##  Min.   :244.0   High:270   Min.   : 1.0   Min.   :0.0000
##  1st Qu.:318.0   Low :280   1st Qu.: 3.0   1st Qu.:0.6670
##  Median :345.0              Median : 5.5   Median :0.8330
##  Mean   :343.8              Mean   : 5.5   Mean   :0.8096
##  3rd Qu.:370.0              3rd Qu.: 8.0   3rd Qu.:1.0000
##  Max.   :395.0              Max.   :10.0   Max.   :1.0000
##     Correct        Incorrect
##  Min.   :0.000   Min.   :0.000
##  1st Qu.:4.000   1st Qu.:0.000
##  Median :5.000   Median :1.000
##  Mean   :4.858   Mean   :1.142
##  3rd Qu.:6.000   3rd Qu.:2.000
##  Max.   :6.000   Max.   :6.000

You can also use multiple joint constraints. For example, excluding all floor and ceiling accuracy blocks:

WL.mid <- subset(WordLearnEx, Accuracy > 0 & Accuracy < 1)
summary(WL.mid)
##     Subject       TP          Block           Accuracy
##  Min.   :244   High:149   Min.   : 1.000   Min.   :0.1670
##  1st Qu.:323   Low :180   1st Qu.: 2.000   1st Qu.:0.5000
##  Median :351              Median : 4.000   Median :0.6670
##  Mean   :347              Mean   : 4.681   Mean   :0.6742
##  3rd Qu.:376              3rd Qu.: 7.000   3rd Qu.:0.8330
##  Max.   :395              Max.   :10.000   Max.   :0.8330
##     Correct        Incorrect
##  Min.   :1.000   Min.   :1.000
##  1st Qu.:3.000   1st Qu.:1.000
##  Median :4.000   Median :2.000
##  Mean   :4.046   Mean   :1.954
##  3rd Qu.:5.000   3rd Qu.:3.000
##  Max.   :5.000   Max.   :5.000

As far as I know, there is no limit on the complexity of those logical conditions, but be careful with building very complicated conditions.

You can also define a subset of columns by using the select argument. For example, to pull out a handful of key columns from the affect data:

affect.neg <- subset(affect, select=c(Study, Film, NA1, NA2))
summary(affect.neg)
##   Study          Film            NA1              NA2
##  maps:160   Min.   :1.000   Min.   : 0.000   Min.   : 0.000
##  flat:170   1st Qu.:1.250   1st Qu.: 1.000   1st Qu.: 0.000
##             Median :3.000   Median : 2.000   Median : 3.000
##             Mean   :2.515   Mean   : 3.693   Mean   : 4.649
##             3rd Qu.:4.000   3rd Qu.: 6.000   3rd Qu.: 7.000
##             Max.   :4.000   Max.   :28.000   Max.   :30.000

Or to drop an unnecessary column:

WL.nacc <- subset(WordLearnEx, select = -Accuracy)
summary(WL.nacc)
##     Subject         TP          Block         Correct       Incorrect
##  Min.   :244.0   High:280   Min.   : 1.0   Min.   :0.00   Min.   :0.00
##  1st Qu.:319.5   Low :280   1st Qu.: 3.0   1st Qu.:4.00   1st Qu.:0.00
##  Median :346.5              Median : 5.5   Median :5.00   Median :1.00
##  Mean   :344.4              Mean   : 5.5   Mean   :4.83   Mean   :1.17
##  3rd Qu.:370.5              3rd Qu.: 8.0   3rd Qu.:6.00   3rd Qu.:2.00
##  Max.   :395.0              Max.   :10.0   Max.   :6.00   Max.   :6.00

And you can combine these different kinds of conditions:

affect.neg.maps <- subset(affect, Study == "maps", select=c(Study, Film, NA1, NA2))
summary(affect.neg.maps)
##   Study          Film            NA1              NA2
##  maps:160   Min.   :1.000   Min.   : 0.000   Min.   : 0.00
##  flat:  0   1st Qu.:1.000   1st Qu.: 0.000   1st Qu.: 0.00
##             Median :3.000   Median : 2.000   Median : 3.00
##             Mean   :2.481   Mean   : 3.506   Mean   : 4.90
##             3rd Qu.:3.000   3rd Qu.: 5.000   3rd Qu.: 7.25
##             Max.   :4.000   Max.   :28.000   Max.   :30.00

Notice that the Study factor in the subset still contains both levels (maps and flat) even though one of them has no observations. This is mostly innocuous, but can occasionally cause confusion and/or errors. You can clean this up using the droplevels() function, which will drop any factor levels that have no observations.

### Merge

The merge() function is one of the smartest functions in R and provides a simple one-line way to accomplish something that would be fairly complex otherwise. In general, merge() will merge two data frames using columns with matching variable names to align matching values. One scenario that comes up a lot in my research is that a group of participants does a study with multiple trials, then I want to combine these multi-trial data with some individual-subject data like aphasia severity or performance on some other test.

I’ll use the Word Learning data as an example and make up a severity score

sev <- data.frame(Subject = unique(WordLearnEx$Subject), Severity = rnorm(length(unique(WordLearnEx$Subject))))
summary(sev)
##     Subject         Severity
##  Min.   :244.0   Min.   :-2.21650
##  1st Qu.:319.5   1st Qu.:-0.36478
##  Median :346.5   Median : 0.18600
##  Mean   :344.4   Mean   : 0.07171
##  3rd Qu.:370.5   3rd Qu.: 0.74850
##  Max.   :395.0   Max.   : 2.54057
dim(sev)
## [1] 56  2
dim(WordLearnEx)
## [1] 560   6

Notice that the sev data frame has only 56 rows (one for each subject) but the WordLearnEx data frame has 560 rows because there are 10 blocks of word learning data per subject, so I can’t just stick the Severity column onto the WordLearnEx data because I need to make sure each severity score lines up with their corresponding subject. This is exactly what merge() does:

WL.sev <- merge(WordLearnEx, sev)
summary(WL.sev)
##     Subject         TP          Block         Accuracy        Correct
##  Min.   :244.0   High:280   Min.   : 1.0   Min.   :0.000   Min.   :0.00
##  1st Qu.:319.5   Low :280   1st Qu.: 3.0   1st Qu.:0.667   1st Qu.:4.00
##  Median :346.5              Median : 5.5   Median :0.833   Median :5.00
##  Mean   :344.4              Mean   : 5.5   Mean   :0.805   Mean   :4.83
##  3rd Qu.:370.5              3rd Qu.: 8.0   3rd Qu.:1.000   3rd Qu.:6.00
##  Max.   :395.0              Max.   :10.0   Max.   :1.000   Max.   :6.00
##    Incorrect       Severity
##  Min.   :0.00   Min.   :-2.21650
##  1st Qu.:0.00   1st Qu.:-0.36478
##  Median :1.00   Median : 0.18600
##  Mean   :1.17   Mean   : 0.07171
##  3rd Qu.:2.00   3rd Qu.: 0.74850
##  Max.   :6.00   Max.   : 2.54057

The new Severity column got added, and we can check that it worked correctly by looking at the first (and last) few rows in each data frame:

#first few rows
head(sev)
##   Subject   Severity
## 1     244  0.7364737
## 2     253 -0.7187716
## 3     302  1.0764167
## 4     303 -0.1527374
## 5     305  1.0161045
## 6     306  0.1524591
head(WL.sev, 20)
##    Subject  TP Block Accuracy Correct Incorrect   Severity
## 1      244 Low     1    1.000       6         0  0.7364737
## 2      244 Low     2    0.333       2         4  0.7364737
## 3      244 Low     3    0.667       4         2  0.7364737
## 4      244 Low     4    0.667       4         2  0.7364737
## 5      244 Low     5    0.667       4         2  0.7364737
## 6      244 Low     6    0.833       5         1  0.7364737
## 7      244 Low     7    0.500       3         3  0.7364737
## 8      244 Low     8    0.833       5         1  0.7364737
## 9      244 Low     9    1.000       6         0  0.7364737
## 10     244 Low    10    0.667       4         2  0.7364737
## 11     253 Low     1    0.500       3         3 -0.7187716
## 12     253 Low     2    1.000       6         0 -0.7187716
## 13     253 Low     3    0.500       3         3 -0.7187716
## 14     253 Low     4    1.000       6         0 -0.7187716
## 15     253 Low     5    0.500       3         3 -0.7187716
## 16     253 Low     6    0.500       3         3 -0.7187716
## 17     253 Low     7    0.667       4         2 -0.7187716
## 18     253 Low     8    0.500       3         3 -0.7187716
## 19     253 Low     9    0.167       1         5 -0.7187716
## 20     253 Low    10    0.833       5         1 -0.7187716
#last few rows
tail(sev)
##    Subject   Severity
## 51     387 -0.6791419
## 52     388 -2.1101675
## 53     390 -0.6550895
## 54     391  0.4821140
## 55     393  0.2343186
## 56     395  0.2812411
tail(WL.sev, 20)
##     Subject   TP Block Accuracy Correct Incorrect  Severity
## 541     393 High     1    0.167       1         5 0.2343186
## 542     393 High     2    0.333       2         4 0.2343186
## 543     393 High     3    0.333       2         4 0.2343186
## 544     393 High     4    0.167       1         5 0.2343186
## 545     393 High     5    0.500       3         3 0.2343186
## 546     393 High     6    0.667       4         2 0.2343186
## 547     393 High     7    0.833       5         1 0.2343186
## 548     393 High     8    1.000       6         0 0.2343186
## 549     393 High     9    1.000       6         0 0.2343186
## 550     393 High    10    0.667       4         2 0.2343186
## 551     395 High     1    0.167       1         5 0.2812411
## 552     395 High     2    0.000       0         6 0.2812411
## 553     395 High     3    0.833       5         1 0.2812411
## 554     395 High     4    0.833       5         1 0.2812411
## 555     395 High     5    0.833       5         1 0.2812411
## 556     395 High     6    1.000       6         0 0.2812411
## 557     395 High     7    0.500       3         3 0.2812411
## 558     395 High     8    1.000       6         0 0.2812411
## 559     395 High     9    0.667       4         2 0.2812411
## 560     395 High    10    1.000       6         0 0.2812411

This worked very smoothly because we had a matching Subject column in both data frames and each subject was in both. Things are not always this easy, so let’s make up a more tricky case:

sev2 <- data.frame(Participant = unique(WordLearnEx$Subject), Severity = rnorm(length(unique(WordLearnEx$Subject))))

So now I’ve created a similar severity data set, except the subject id is called Participant. It is certainly possible to rename that variable so it matches, but in a real situation I’d have read those data from an external file and it would be nice to remain consistent with that file. Instead, I can explicitly tell merge which columns are supposed to match across the two data frames:

WL.sev2 <- merge(WordLearnEx, sev2, by.x = "Subject", by.y = "Participant")
summary(WL.sev2)
##     Subject         TP          Block         Accuracy        Correct
##  Min.   :244.0   High:280   Min.   : 1.0   Min.   :0.000   Min.   :0.00
##  1st Qu.:319.5   Low :280   1st Qu.: 3.0   1st Qu.:0.667   1st Qu.:4.00
##  Median :346.5              Median : 5.5   Median :0.833   Median :5.00
##  Mean   :344.4              Mean   : 5.5   Mean   :0.805   Mean   :4.83
##  3rd Qu.:370.5              3rd Qu.: 8.0   3rd Qu.:1.000   3rd Qu.:6.00
##  Max.   :395.0              Max.   :10.0   Max.   :1.000   Max.   :6.00
##    Incorrect       Severity
##  Min.   :0.00   Min.   :-1.9212
##  1st Qu.:0.00   1st Qu.:-0.4752
##  Median :1.00   Median : 0.1147
##  Mean   :1.17   Mean   : 0.0787
##  3rd Qu.:2.00   3rd Qu.: 0.7497
##  Max.   :6.00   Max.   : 2.1281

Now imagine the more recently recruited participants haven’t had their severity assessed yet, so those scores are missing. We can exclude them from our made-up data using subset():

sev2.s <- subset(sev2, Participant < 390)
dim(sev2.s)
## [1] 52  2

So now we have severity scores for only 52 participants (4 participants were excluded). Now there are going to be subjects in the word learning data set that don’t have severity scores. The default is that only fully matching data are included in the merged output:

WL.sev.s <- merge(WordLearnEx, sev2.s, by.x = "Subject", by.y = "Participant")
summary(WL.sev.s)
##     Subject         TP          Block         Accuracy
##  Min.   :244.0   High:260   Min.   : 1.0   Min.   :0.0000
##  1st Qu.:317.8   Low :260   1st Qu.: 3.0   1st Qu.:0.6670
##  Median :341.5              Median : 5.5   Median :0.8330
##  Mean   :340.7              Mean   : 5.5   Mean   :0.8096
##  3rd Qu.:365.5              3rd Qu.: 8.0   3rd Qu.:1.0000
##  Max.   :388.0              Max.   :10.0   Max.   :1.0000
##     Correct        Incorrect        Severity
##  Min.   :0.000   Min.   :0.000   Min.   :-1.9212
##  1st Qu.:4.000   1st Qu.:0.000   1st Qu.:-0.4642
##  Median :5.000   Median :1.000   Median : 0.1194
##  Mean   :4.858   Mean   :1.142   Mean   : 0.1191
##  3rd Qu.:6.000   3rd Qu.:2.000   3rd Qu.: 0.7858
##  Max.   :6.000   Max.   :6.000   Max.   : 2.1281

You can override that default and force it to include all of the values, in which case the missing data will be (quite reasonably) filled in with NA:

WL.sev.s <- merge(WordLearnEx, sev2.s, by.x = "Subject", by.y = "Participant", all.x = TRUE)
summary(WL.sev.s)
##     Subject         TP          Block         Accuracy        Correct
##  Min.   :244.0   High:280   Min.   : 1.0   Min.   :0.000   Min.   :0.00
##  1st Qu.:319.5   Low :280   1st Qu.: 3.0   1st Qu.:0.667   1st Qu.:4.00
##  Median :346.5              Median : 5.5   Median :0.833   Median :5.00
##  Mean   :344.4              Mean   : 5.5   Mean   :0.805   Mean   :4.83
##  3rd Qu.:370.5              3rd Qu.: 8.0   3rd Qu.:1.000   3rd Qu.:6.00
##  Max.   :395.0              Max.   :10.0   Max.   :1.000   Max.   :6.00
##
##    Incorrect       Severity
##  Min.   :0.00   Min.   :-1.9212
##  1st Qu.:0.00   1st Qu.:-0.4642
##  Median :1.00   Median : 0.1195
##  Mean   :1.17   Mean   : 0.1191
##  3rd Qu.:2.00   3rd Qu.: 0.7858
##  Max.   :6.00   Max.   : 2.1281
##                 NA's   :40

Now all subject numbers are there, but notice that the Severity column has 40 NA values (10 for each of the 4 participants with missing Severity data).

## Reshape

This section is about converting between “wide” and “long” data formats. A typical version of a wide data format is when each row corresponds to a participant and each observation is in a separate column. For a longitudinal study, that would mean having rows for Subject1, Subject2, Subject3, etc. and columns for Time1, Time2, Time3, etc. and the element in cell (N, M) would be the observed value for Subject N at Time M. The same data in a long format would have just three columns: subject ID, time point, and observed value. That is, each row corresponds to an observation and the columns provide information about that observation (Subject N, Time M, etc.). Converting between wide and long formats can be a laborious and error-prone sequence of copy-and-paste operations or require writing a clever script. Thereshape2 package makes these conversions very easy.

### Wide-to-long: melt()

First, I’m going to clean up my workspace, load the reshape2 package, and make a wide data example from the affect data:

rm(list=ls())
library(reshape2)
#pull out just the negative affect data
affect.neg <- subset(affect, select=c(Film, NA1, NA2))
summary(affect.neg)
##       Film            NA1              NA2
##  Min.   :1.000   Min.   : 0.000   Min.   : 0.000
##  1st Qu.:1.250   1st Qu.: 1.000   1st Qu.: 0.000
##  Median :3.000   Median : 2.000   Median : 3.000
##  Mean   :2.515   Mean   : 3.693   Mean   : 4.649
##  3rd Qu.:4.000   3rd Qu.: 6.000   3rd Qu.: 7.000
##  Max.   :4.000   Max.   :28.000   Max.   :30.000

This is a typical wide data set: each row is a subject, the Film column defines what kind of film the subject watched, and the two negative affect observations are the pretest and postest (before and after the film). There is no subject id, but we can make one based on the row number:

affect.neg$Subject <- factor(1:nrow(affect.neg)) Also, let’s label the film types: affect.neg <- within(affect.neg, Film <- factor(Film, levels=c("1","2","3","4"), labels=c("Sad", "Threat", "Neutral", "Happy"))) summary(affect.neg) ## Film NA1 NA2 Subject ## Sad :83 Min. : 0.000 Min. : 0.000 1 : 1 ## Threat :78 1st Qu.: 1.000 1st Qu.: 0.000 2 : 1 ## Neutral:85 Median : 2.000 Median : 3.000 3 : 1 ## Happy :84 Mean : 3.693 Mean : 4.649 4 : 1 ## 3rd Qu.: 6.000 3rd Qu.: 7.000 5 : 1 ## Max. :28.000 Max. :30.000 6 : 1 ## (Other):324 Now we can go about converting this data frame from wide to long format using the melt() function from the reshape2 package. To understand the melt syntax, we need a little data format terminology: • measure variables are the observations, the things that were measured. In our case, that is negative affect and is in the NA1 and NA2 variables. • id variables are those that describe the circumstances of the observation(s). In our case, that would be Subject and Film. When we convert from wide to long, the id variables are left in place and the measure variables are combined into two columns: a new id variable column that contains the names of the original measure variables (NA1 and NA2) and a value column that contains the values of the original measure variables. To use melt, we just need to tell it the id variables and the measure variables, and it does the rest. Here’s an example: affect.melt <- melt(affect.neg, id=c("Subject","Film"), measure=c("NA1", "NA2")) summary(affect.melt) ## Subject Film variable value ## 1 : 2 Sad :166 NA1:330 Min. : 0.000 ## 2 : 2 Threat :156 NA2:330 1st Qu.: 0.000 ## 3 : 2 Neutral:170 Median : 2.000 ## 4 : 2 Happy :168 Mean : 4.171 ## 5 : 2 3rd Qu.: 6.000 ## 6 : 2 Max. :30.000 ## (Other):648 Notice that each subject now has two observations instead of one (pretest and posttest) and the new columns were given default names of “variable” and “value”. We can override those defaults to produce more informative variable names: affect.melt <- melt(affect.neg, id=c("Subject","Film"), measure=c("NA1", "NA2"), variable.name="Test", value.name="Negative_Affect") summary(affect.melt) ## Subject Film Test Negative_Affect ## 1 : 2 Sad :166 NA1:330 Min. : 0.000 ## 2 : 2 Threat :156 NA2:330 1st Qu.: 0.000 ## 3 : 2 Neutral:170 Median : 2.000 ## 4 : 2 Happy :168 Mean : 4.171 ## 5 : 2 3rd Qu.: 6.000 ## 6 : 2 Max. :30.000 ## (Other):648 Also, by default if only id or measure is specified, then all of the unspecified variables will be treated as the other one, so either of these slightly more compact forms would produce the same result affect.melt <- melt(affect.neg, id=c("Subject","Film"), variable.name="Test", value.name="Negative_Affect") affect.melt <- melt(affect.neg, measure=c("NA1", "NA2"), variable.name="Test", value.name="Negative_Affect") We can also improve the names of the levels for the “Test” variable: levels(affect.melt$Test) <- c("Pre", "Post")
summary(affect.melt)
##     Subject         Film       Test     Negative_Affect
##  1      :  2   Sad    :166   Pre :330   Min.   : 0.000
##  2      :  2   Threat :156   Post:330   1st Qu.: 0.000
##  3      :  2   Neutral:170              Median : 2.000
##  4      :  2   Happy  :168              Mean   : 4.171
##  5      :  2                            3rd Qu.: 6.000
##  6      :  2                            Max.   :30.000
##  (Other):648

### Long-to-wide: dcast()

The dcast function performs the opposite transformation: long-to-wide (there’s also an acast function that works the same way but produces array output instead of data frame output). The syntax here is a little different, it uses formula syntax where the right-hand side of the formula are the columns that should remain columns and the left-hand side are the columns whose values should become separate columns. For example, here’s how we could convert our new long-format affect data back to a wide format:

affect.cast <- dcast(affect.melt, Subject + Film ~ Test)
## Using Negative_Affect as value column: use value.var to override.
summary(affect.cast)
##     Subject         Film         Pre              Post
##  1      :  1   Sad    :83   Min.   : 0.000   Min.   : 0.000
##  2      :  1   Threat :78   1st Qu.: 1.000   1st Qu.: 0.000
##  3      :  1   Neutral:85   Median : 2.000   Median : 3.000
##  4      :  1   Happy  :84   Mean   : 3.693   Mean   : 4.649
##  5      :  1                3rd Qu.: 6.000   3rd Qu.: 7.000
##  6      :  1                Max.   :28.000   Max.   :30.000
##  (Other):324

As the warning note indicated, dcast made a guess about which column to use as the value column. It does a fairly good job guessing, but you might want to specify it explicitly:

affect.cast <- dcast(affect.melt, Subject + Film ~ Test,
value.var = "Negative_Affect")

### Long-to-summary: dcast()

In the above example each of the values in the long-format affect.melt data frame had a place in the wide-format affect.cast data frame. But if we use a reduced formula, then this function can (quickly and easily) produce summary tables. For example:

dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect")
## Aggregation function missing: defaulting to length
##      Film Pre Post
## 2  Threat  78   78
## 3 Neutral  85   85
## 4   Happy  84   84

The default summary function is length, which just gives us a count of observations. We can specify other functions:

dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = mean)
##      Film      Pre     Post
## 2  Threat 3.364103 5.371795
## 3 Neutral 3.921176 2.894118
## 4   Happy 3.955952 1.785714
dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = sd)
##      Film      Pre     Post
## 2  Threat 4.449164 5.177647
## 3 Neutral 4.655209 4.127387
## 4   Happy 4.395846 2.883570

We can also get row and/or column summaries (“margins”):

dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = mean,
margins = "Film")
##      Film      Pre     Post
## 2  Threat 3.364103 5.371795
## 3 Neutral 3.921176 2.894118
## 4   Happy 3.955952 1.785714
## 5   (all) 3.693333 4.649394
dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = mean,
margins = "Test")
##      Film      Pre     Post    (all)
## 1     Sad 3.503614 8.666265 6.084940
## 2  Threat 3.364103 5.371795 4.367949
## 3 Neutral 3.921176 2.894118 3.407647
## 4   Happy 3.955952 1.785714 2.870833
#all margins
dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = mean,
margins = TRUE)
##      Film      Pre     Post    (all)
## 1     Sad 3.503614 8.666265 6.084940
## 2  Threat 3.364103 5.371795 4.367949
## 3 Neutral 3.921176 2.894118 3.407647
## 4   Happy 3.955952 1.785714 2.870833
## 5   (all) 3.693333 4.649394 4.171364

## Split-Apply-Combine with plyr

Many data analysis and processing tasks can be accomplished by a 3-step process:

1. Split the data set into systematic sub-groups
2. Apply some function or transformation to each sub-group
3. Combine the transformed data to re-form a single data set

For example, computing subject RT means is splitting the data by subject, applying the mean function to the RT data, and re-combining the data. The same thing goes for more complex summary statistics (like MEAN + 3SD) or tranformations like centering RT’s relative to the subject’s mean RT.

This can be accomplished using for loops, but that involves more programming, more opportunities to make errors, and will be slower to run. Base R has split and apply functions that can be used for this, but their syntax can get tricky.

The plyr package offers easy-to-use tools for carrying out split-apply-combine procedures. The key functions have names like this: [][]ply. Where the first slot is filled by a letter indicating the kind of object you’re starting with – d for data frame, l for list, or a for array – and the second slot is filled by a letter indicating the kind of object you want returned. I usually use ddply because I am starting from a data frame and want to end up with a data frame.

You’ll need to give ddply three things: the starting data frame, the list of variables that define the splitting into subsets, and the function that should be applied to each subset. The splits are defined by a list of variables such that the data frame will be split up into subsets based on unique combinations of values of those subsets (this will be more clear with an example in a minute). There’s a fair amount of flexibility in the syntax for these components, which can be good, but can also be somewhat confusing. Which is to say that the following examples will use versions of the syntax that I find clearest and most intuitive, but these are certainly not the only ways to use this function.

As an example, let’s look at some reaction time data from a spoken-to-written word matching study:

density <- read.delim("http://www.danmirman.org/gca/Density.txt")
summary(density)
##     Subject         Group           Item      Density           RT
##  S103   : 115   Older  :1494   bag    :  30   High:1397   Min.   :  440
##  S122   : 109   Younger:1370   blonde :  30   Low :1467   1st Qu.: 1161
##  S105   : 107                  chain  :  30               Median : 1536
##  S102   : 105                  chair  :  30               Mean   : 1730
##  S108   : 102                  child  :  30               3rd Qu.: 2040
##  S110   : 102                  couch  :  30               Max.   :12753
##  (Other):2224                  (Other):2684
head(density)
##   Subject Group   Item Density   RT
## 1    S101 Older   wool     Low 3366
## 2    S101 Older   chop    High 1584
## 3    S101 Older    bag    High 2008
## 4    S101 Older    tap    High 3089
## 5    S101 Older  lodge     Low 1831
## 6    S101 Older throne     Low 2863

Remember how we can use dcast to make summary tables:

dcast(density, Group ~ Density, mean, value.var = "RT")
##     Group     High      Low
## 1   Older 2108.388 1941.361
## 2 Younger 1433.184 1387.575

We can produce a similar summary table using ddply:

library(plyr)
ddply(density, .(Group, Density), summarize, meanRT = mean(RT))
##     Group Density   meanRT
## 1   Older    High 2108.388
## 2   Older     Low 1941.361
## 3 Younger    High 1433.184
## 4 Younger     Low 1387.575

Let’s unpack that syntax. The first argument is just the starting data frame. The second argument is a list of variables – we gave Group and Density, which means that the data will be split up into subsets based on unique combinations of those two variables; that is, the 4 combinations that we get from the 2x2 design. Then we say that the function we want to apply is summarize, which means we’re going to create a new data frame that contains some kind of summary or transformation of data from the original data frame. The “summary” can also have multiple components, which can be handy if you want to build a complex summary table:

ddply(density, .(Group, Density), summarize, meanRT = mean(RT), SD = sd(RT), minRT = min(RT), maxRT = max(RT))
##     Group Density   meanRT        SD minRT maxRT
## 1   Older    High 2108.388 1093.6589   440  9170
## 2   Older     Low 1941.361 1034.2270   596 12753
## 3 Younger    High 1433.184  622.1910   480  6459
## 4 Younger     Low 1387.575  632.1079   457  6381

Summary tables are nice, but this function really shines when it comes to pre-processing data for analysis. For example, the data frame contains trial-level data. If we want to compute subject (or item) means, we just add subject (or item) ID to the list of subset-defining variables:

#by-subject means
density.subj <- ddply(density, .(Subject, Group, Density), summarize, RT = mean(RT))
summary(density.subj)
##     Subject       Group    Density         RT
##  S101   : 2   Older  :32   High:30   Min.   :1027
##  S102   : 2   Younger:28   Low :30   1st Qu.:1383
##  S103   : 2                          Median :1603
##  S104   : 2                          Mean   :1743
##  S105   : 2                          3rd Qu.:1960
##  S106   : 2                          Max.   :3601
##  (Other):48
#by-item means
density.item <- ddply(density, .(Item, Density, Group), summarize, RT = mean(RT))
summary(density.item)
##       Item     Density        Group           RT
##  bag    :  2   High:120   Older  :120   Min.   : 942.2
##  ball   :  2   Low :120   Younger:120   1st Qu.:1366.0
##  band   :  2                            Median :1686.4
##  bean   :  2                            Mean   :1767.4
##  bear   :  2                            3rd Qu.:2040.0
##  bell   :  2                            Max.   :3866.8
##  (Other):228

These are the kind of data we’d use for by-subject and by-item ANOVAs. We can also get more complex, like computing by-subject RT means, standard deviations, and means excluding outliers (RTs that are more than 3SD above the subject’s condition mean):

density.subj <- ddply(density, .(Subject, Group, Density), summarize,
meanRT = mean(RT), sdRT = sd(RT),
meanRTexcl = mean(RT[RT < (meanRT + 3*sdRT)]))
summary(density.subj)
##     Subject       Group    Density       meanRT          sdRT
##  S101   : 2   Older  :32   High:30   Min.   :1027   Min.   : 354.3
##  S102   : 2   Younger:28   Low :30   1st Qu.:1383   1st Qu.: 491.5
##  S103   : 2                          Median :1603   Median : 594.7
##  S104   : 2                          Mean   :1743   Mean   : 696.3
##  S105   : 2                          3rd Qu.:1960   3rd Qu.: 735.8
##  S106   : 2                          Max.   :3601   Max.   :2466.1
##  (Other):48
##    meanRTexcl
##  Min.   :1001
##  1st Qu.:1362
##  Median :1545
##  Mean   :1700
##  3rd Qu.:1940
##  Max.   :3601
## 

For completeness, we might also want to keep track of the number of trials so we’ll know how many trials were in the original means and how many were excluded as outliers:

density.subj <- ddply(density, .(Subject, Group, Density), summarize,
N = length(RT), meanRT = mean(RT), sdRT = sd(RT),
Nexcl = sum(RT < (meanRT + 3*sdRT)),
meanRTexcl = mean(RT[RT < (meanRT + 3*sdRT)]))
head(density.subj)
##   Subject   Group Density  N   meanRT     sdRT Nexcl meanRTexcl
## 1    S101   Older    High 45 2482.044 920.7922    44   2407.886
## 2    S101   Older     Low 48 2122.312 694.8509    47   2058.553
## 3    S102 Younger    High 50 1318.060 485.7474    49   1286.449
## 4    S102 Younger     Low 55 1306.836 554.0681    53   1233.358
## 5    S103 Younger    High 60 1712.467 736.6167    58   1629.414
## 6    S103 Younger     Low 55 1738.200 955.1870    54   1652.222

All of these examples have been based on computing different kinds of summaries, but what if we want to transform the existing data without compressing it into a summary? We’ll need to use a different function for that, instead of summarize we’ll use transform or, better yet, mutate. For example, let’s say we want to transform (or mutate) the raw RT into subject-level z-scores:

density.z <- ddply(density, .(Subject), mutate, zRT = scale(RT))
summary(density.z)
##     Subject         Group           Item      Density           RT
##  S103   : 115   Older  :1494   bag    :  30   High:1397   Min.   :  440
##  S122   : 109   Younger:1370   blonde :  30   Low :1467   1st Qu.: 1161
##  S105   : 107                  chain  :  30               Median : 1536
##  S102   : 105                  chair  :  30               Mean   : 1730
##  S108   : 102                  child  :  30               3rd Qu.: 2040
##  S110   : 102                  couch  :  30               Max.   :12753
##  (Other):2224                  (Other):2684
##        zRT.V1
##  Min.   :-2.050589
##  1st Qu.:-0.672325
##  Median :-0.168882
##  Mean   : 0.000000
##  3rd Qu.: 0.450302
##  Max.   : 5.851678
## 

Notice that the number of observations in the data frame stayed exactly the same and that the new data frame contains all of the columns from the original, plus the new “mutated” variable. And that we did this using one line of code that was executed in a tiny fraction of a second.

For a somewhat different plyr tutorial, check out Sean Anderson’s plyr: Split-Apply-Combine for Mortals.