These are notes for the first session of the MRRI/Drexel R Workgroup. In this session we’ll cover some intermediate-level R topics that I think help turn R beginners into R experts. We’ll be talking about my Top 5 Most Useful R Functions:
subset()
merge()
melt()
dcast()
ddply()
That Top 5 doesn’t include functions for basic mechanics (reading data, computing means, etc.) or statistical methods (multilevel regression) or ggplot
, which I think deserves a special jury prize for being the best graphing function ever invented and will be the topic of discussion at our next meeting.
For one reason or another, many of us have data in Excel. The easiest way to get Excel data into a R-friendly format is to use “Save As” and save your Excel worksheet as “Text (Tab Delimited) (*.txt)“:
then you can read it into R using read.delim("filename.txt")
. You’ll need to use the correct path for the filename and you can even read data directly from the internet:
WordLearnEx <- read.delim("http://www.danmirman.org/gca/WordLearnEx.txt")
Ideally, you’d skip Excel entirely and read raw data files. If your data collection software does not produce standard tab-delimited files, you’ll need to use the more general function read.table()
. The basic syntax is the same, but you’ll need to specify some details about the structure of the data file. The most important ones are:
header = TRUE
col.names = c("V1", "V2", ...)
to specify the variable names (V1, V2, etc. are the default names, you could use something more informative).sep = "\t"
and comma-separated sep = ","
, but you can specify anything there. By default read.table
will use all white space as separators, which means that spaces, tabs, newlines, etc. will be treated as separators.skip = n
argument, where n
is an integer indicating the number of rows to skip.NA
, but you can override this default using na.strings = 999
to specify 999 (or whatever) as the code for missing data. This is very useful and important because otherwise those 999 values would become part of the computations for means, etc. whereas NA
can be easily excluded from computations and can be used to catch when something went wrong.strip.white = TRUE
to remove sneaky whitespace from beginnings or ends of letter strings.Here is how you’d read a tab-delimited raw data file with a 4-line preliminary header, no variable names, and “nan” used to indicate missing data:
#not run
raw_data <- read.table("Raw_Data", sep = "\t", skip = 4,
col.names = c("ID", "Trial", "Condition", "Accuracy", "RT"),
na.strings = "nan")
The customization arguments can also be used with read.delim()
if you need to customize anything within the context of a standard tab-delimited file.
I usually separate the initial reading and organizing of raw data from the main analysis. So after those initial pre-processing steps I save the data set into a R data file, which I can later load for analysis without having to re-do the pre-processing steps. R data files can hold multiple R objects, so I can save the raw (read but unprocessed) and processed data into a single file (and any intermediate data as well), which is nice if I might need to tweak some later pre-processing steps.
As an example, I can save the word learning data and the affect
data from the psych
package into a single data file:
library(psych)
save(WordLearnEx, affect, file="Part1_Data.RData")
Now I can load those data:
load("Part1_Data.RData")
The subset()
function does just what you think it would do: allow you to pull out a subset of a data frame. There are many different ways to accomplish this goal using fancy indexing, but the subset()
function is a one-stop-shop that does pretty much everything with an easy-to-use syntax. In general, the subset()
function takes a data frame and a set of conditions, then returns the subset of those original data where those conditions are met.
I think the most common kind of subset is a subset of rows based on particular values in key columns. In this case, the conditions are specified by logical operations. For example, to get just the “High TP” condition data from the WordLearnEx data:
WL.hi <- subset(WordLearnEx, TP == "High")
summary(WL.hi)
## Subject TP Block Accuracy
## Min. :303.0 High:280 Min. : 1.0 Min. :0.0000
## 1st Qu.:325.5 Low : 0 1st Qu.: 3.0 1st Qu.:0.7915
## Median :349.5 Median : 5.5 Median :0.8330
## Mean :350.0 Mean : 5.5 Mean :0.8315
## 3rd Qu.:379.2 3rd Qu.: 8.0 3rd Qu.:1.0000
## Max. :395.0 Max. :10.0 Max. :1.0000
## Correct Incorrect
## Min. :0.000 Min. :0.000
## 1st Qu.:4.750 1st Qu.:0.000
## Median :5.000 Median :1.000
## Mean :4.989 Mean :1.011
## 3rd Qu.:6.000 3rd Qu.:1.250
## Max. :6.000 Max. :6.000
Or to exclude data from a specific participant:
WL.excl <- subset(WordLearnEx, Subject != 378)
summary(WL.excl)
## Subject TP Block Accuracy
## Min. :244.0 High:270 Min. : 1.0 Min. :0.0000
## 1st Qu.:318.0 Low :280 1st Qu.: 3.0 1st Qu.:0.6670
## Median :345.0 Median : 5.5 Median :0.8330
## Mean :343.8 Mean : 5.5 Mean :0.8096
## 3rd Qu.:370.0 3rd Qu.: 8.0 3rd Qu.:1.0000
## Max. :395.0 Max. :10.0 Max. :1.0000
## Correct Incorrect
## Min. :0.000 Min. :0.000
## 1st Qu.:4.000 1st Qu.:0.000
## Median :5.000 Median :1.000
## Mean :4.858 Mean :1.142
## 3rd Qu.:6.000 3rd Qu.:2.000
## Max. :6.000 Max. :6.000
You can also use multiple joint constraints. For example, excluding all floor and ceiling accuracy blocks:
WL.mid <- subset(WordLearnEx, Accuracy > 0 & Accuracy < 1)
summary(WL.mid)
## Subject TP Block Accuracy
## Min. :244 High:149 Min. : 1.000 Min. :0.1670
## 1st Qu.:323 Low :180 1st Qu.: 2.000 1st Qu.:0.5000
## Median :351 Median : 4.000 Median :0.6670
## Mean :347 Mean : 4.681 Mean :0.6742
## 3rd Qu.:376 3rd Qu.: 7.000 3rd Qu.:0.8330
## Max. :395 Max. :10.000 Max. :0.8330
## Correct Incorrect
## Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:1.000
## Median :4.000 Median :2.000
## Mean :4.046 Mean :1.954
## 3rd Qu.:5.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000
As far as I know, there is no limit on the complexity of those logical conditions, but be careful with building very complicated conditions.
You can also define a subset of columns by using the select
argument. For example, to pull out a handful of key columns from the affect
data:
affect.neg <- subset(affect, select=c(Study, Film, NA1, NA2))
summary(affect.neg)
## Study Film NA1 NA2
## maps:160 Min. :1.000 Min. : 0.000 Min. : 0.000
## flat:170 1st Qu.:1.250 1st Qu.: 1.000 1st Qu.: 0.000
## Median :3.000 Median : 2.000 Median : 3.000
## Mean :2.515 Mean : 3.693 Mean : 4.649
## 3rd Qu.:4.000 3rd Qu.: 6.000 3rd Qu.: 7.000
## Max. :4.000 Max. :28.000 Max. :30.000
Or to drop an unnecessary column:
WL.nacc <- subset(WordLearnEx, select = -Accuracy)
summary(WL.nacc)
## Subject TP Block Correct Incorrect
## Min. :244.0 High:280 Min. : 1.0 Min. :0.00 Min. :0.00
## 1st Qu.:319.5 Low :280 1st Qu.: 3.0 1st Qu.:4.00 1st Qu.:0.00
## Median :346.5 Median : 5.5 Median :5.00 Median :1.00
## Mean :344.4 Mean : 5.5 Mean :4.83 Mean :1.17
## 3rd Qu.:370.5 3rd Qu.: 8.0 3rd Qu.:6.00 3rd Qu.:2.00
## Max. :395.0 Max. :10.0 Max. :6.00 Max. :6.00
And you can combine these different kinds of conditions:
affect.neg.maps <- subset(affect, Study == "maps", select=c(Study, Film, NA1, NA2))
summary(affect.neg.maps)
## Study Film NA1 NA2
## maps:160 Min. :1.000 Min. : 0.000 Min. : 0.00
## flat: 0 1st Qu.:1.000 1st Qu.: 0.000 1st Qu.: 0.00
## Median :3.000 Median : 2.000 Median : 3.00
## Mean :2.481 Mean : 3.506 Mean : 4.90
## 3rd Qu.:3.000 3rd Qu.: 5.000 3rd Qu.: 7.25
## Max. :4.000 Max. :28.000 Max. :30.00
Notice that the Study
factor in the subset still contains both levels (maps
and flat
) even though one of them has no observations. This is mostly innocuous, but can occasionally cause confusion and/or errors. You can clean this up using the droplevels()
function, which will drop any factor levels that have no observations.
The merge()
function is one of the smartest functions in R and provides a simple one-line way to accomplish something that would be fairly complex otherwise. In general, merge()
will merge two data frames using columns with matching variable names to align matching values. One scenario that comes up a lot in my research is that a group of participants does a study with multiple trials, then I want to combine these multi-trial data with some individual-subject data like aphasia severity or performance on some other test.
I’ll use the Word Learning data as an example and make up a severity score
sev <- data.frame(Subject = unique(WordLearnEx$Subject),
Severity = rnorm(length(unique(WordLearnEx$Subject))))
summary(sev)
## Subject Severity
## Min. :244.0 Min. :-2.21650
## 1st Qu.:319.5 1st Qu.:-0.36478
## Median :346.5 Median : 0.18600
## Mean :344.4 Mean : 0.07171
## 3rd Qu.:370.5 3rd Qu.: 0.74850
## Max. :395.0 Max. : 2.54057
dim(sev)
## [1] 56 2
dim(WordLearnEx)
## [1] 560 6
Notice that the sev
data frame has only 56 rows (one for each subject) but the WordLearnEx
data frame has 560 rows because there are 10 blocks of word learning data per subject, so I can’t just stick the Severity
column onto the WordLearnEx
data because I need to make sure each severity score lines up with their corresponding subject. This is exactly what merge()
does:
WL.sev <- merge(WordLearnEx, sev)
summary(WL.sev)
## Subject TP Block Accuracy Correct
## Min. :244.0 High:280 Min. : 1.0 Min. :0.000 Min. :0.00
## 1st Qu.:319.5 Low :280 1st Qu.: 3.0 1st Qu.:0.667 1st Qu.:4.00
## Median :346.5 Median : 5.5 Median :0.833 Median :5.00
## Mean :344.4 Mean : 5.5 Mean :0.805 Mean :4.83
## 3rd Qu.:370.5 3rd Qu.: 8.0 3rd Qu.:1.000 3rd Qu.:6.00
## Max. :395.0 Max. :10.0 Max. :1.000 Max. :6.00
## Incorrect Severity
## Min. :0.00 Min. :-2.21650
## 1st Qu.:0.00 1st Qu.:-0.36478
## Median :1.00 Median : 0.18600
## Mean :1.17 Mean : 0.07171
## 3rd Qu.:2.00 3rd Qu.: 0.74850
## Max. :6.00 Max. : 2.54057
The new Severity column got added, and we can check that it worked correctly by looking at the first (and last) few rows in each data frame:
#first few rows
head(sev)
## Subject Severity
## 1 244 0.7364737
## 2 253 -0.7187716
## 3 302 1.0764167
## 4 303 -0.1527374
## 5 305 1.0161045
## 6 306 0.1524591
head(WL.sev, 20)
## Subject TP Block Accuracy Correct Incorrect Severity
## 1 244 Low 1 1.000 6 0 0.7364737
## 2 244 Low 2 0.333 2 4 0.7364737
## 3 244 Low 3 0.667 4 2 0.7364737
## 4 244 Low 4 0.667 4 2 0.7364737
## 5 244 Low 5 0.667 4 2 0.7364737
## 6 244 Low 6 0.833 5 1 0.7364737
## 7 244 Low 7 0.500 3 3 0.7364737
## 8 244 Low 8 0.833 5 1 0.7364737
## 9 244 Low 9 1.000 6 0 0.7364737
## 10 244 Low 10 0.667 4 2 0.7364737
## 11 253 Low 1 0.500 3 3 -0.7187716
## 12 253 Low 2 1.000 6 0 -0.7187716
## 13 253 Low 3 0.500 3 3 -0.7187716
## 14 253 Low 4 1.000 6 0 -0.7187716
## 15 253 Low 5 0.500 3 3 -0.7187716
## 16 253 Low 6 0.500 3 3 -0.7187716
## 17 253 Low 7 0.667 4 2 -0.7187716
## 18 253 Low 8 0.500 3 3 -0.7187716
## 19 253 Low 9 0.167 1 5 -0.7187716
## 20 253 Low 10 0.833 5 1 -0.7187716
#last few rows
tail(sev)
## Subject Severity
## 51 387 -0.6791419
## 52 388 -2.1101675
## 53 390 -0.6550895
## 54 391 0.4821140
## 55 393 0.2343186
## 56 395 0.2812411
tail(WL.sev, 20)
## Subject TP Block Accuracy Correct Incorrect Severity
## 541 393 High 1 0.167 1 5 0.2343186
## 542 393 High 2 0.333 2 4 0.2343186
## 543 393 High 3 0.333 2 4 0.2343186
## 544 393 High 4 0.167 1 5 0.2343186
## 545 393 High 5 0.500 3 3 0.2343186
## 546 393 High 6 0.667 4 2 0.2343186
## 547 393 High 7 0.833 5 1 0.2343186
## 548 393 High 8 1.000 6 0 0.2343186
## 549 393 High 9 1.000 6 0 0.2343186
## 550 393 High 10 0.667 4 2 0.2343186
## 551 395 High 1 0.167 1 5 0.2812411
## 552 395 High 2 0.000 0 6 0.2812411
## 553 395 High 3 0.833 5 1 0.2812411
## 554 395 High 4 0.833 5 1 0.2812411
## 555 395 High 5 0.833 5 1 0.2812411
## 556 395 High 6 1.000 6 0 0.2812411
## 557 395 High 7 0.500 3 3 0.2812411
## 558 395 High 8 1.000 6 0 0.2812411
## 559 395 High 9 0.667 4 2 0.2812411
## 560 395 High 10 1.000 6 0 0.2812411
This worked very smoothly because we had a matching Subject
column in both data frames and each subject was in both. Things are not always this easy, so let’s make up a more tricky case:
sev2 <- data.frame(Participant = unique(WordLearnEx$Subject),
Severity = rnorm(length(unique(WordLearnEx$Subject))))
So now I’ve created a similar severity data set, except the subject id is called Participant
. It is certainly possible to rename that variable so it matches, but in a real situation I’d have read those data from an external file and it would be nice to remain consistent with that file. Instead, I can explicitly tell merge
which columns are supposed to match across the two data frames:
WL.sev2 <- merge(WordLearnEx, sev2, by.x = "Subject", by.y = "Participant")
summary(WL.sev2)
## Subject TP Block Accuracy Correct
## Min. :244.0 High:280 Min. : 1.0 Min. :0.000 Min. :0.00
## 1st Qu.:319.5 Low :280 1st Qu.: 3.0 1st Qu.:0.667 1st Qu.:4.00
## Median :346.5 Median : 5.5 Median :0.833 Median :5.00
## Mean :344.4 Mean : 5.5 Mean :0.805 Mean :4.83
## 3rd Qu.:370.5 3rd Qu.: 8.0 3rd Qu.:1.000 3rd Qu.:6.00
## Max. :395.0 Max. :10.0 Max. :1.000 Max. :6.00
## Incorrect Severity
## Min. :0.00 Min. :-1.9212
## 1st Qu.:0.00 1st Qu.:-0.4752
## Median :1.00 Median : 0.1147
## Mean :1.17 Mean : 0.0787
## 3rd Qu.:2.00 3rd Qu.: 0.7497
## Max. :6.00 Max. : 2.1281
Now imagine the more recently recruited participants haven’t had their severity assessed yet, so those scores are missing. We can exclude them from our made-up data using subset()
:
sev2.s <- subset(sev2, Participant < 390)
dim(sev2.s)
## [1] 52 2
So now we have severity scores for only 52 participants (4 participants were excluded). Now there are going to be subjects in the word learning data set that don’t have severity scores. The default is that only fully matching data are included in the merged output:
WL.sev.s <- merge(WordLearnEx, sev2.s, by.x = "Subject", by.y = "Participant")
summary(WL.sev.s)
## Subject TP Block Accuracy
## Min. :244.0 High:260 Min. : 1.0 Min. :0.0000
## 1st Qu.:317.8 Low :260 1st Qu.: 3.0 1st Qu.:0.6670
## Median :341.5 Median : 5.5 Median :0.8330
## Mean :340.7 Mean : 5.5 Mean :0.8096
## 3rd Qu.:365.5 3rd Qu.: 8.0 3rd Qu.:1.0000
## Max. :388.0 Max. :10.0 Max. :1.0000
## Correct Incorrect Severity
## Min. :0.000 Min. :0.000 Min. :-1.9212
## 1st Qu.:4.000 1st Qu.:0.000 1st Qu.:-0.4642
## Median :5.000 Median :1.000 Median : 0.1194
## Mean :4.858 Mean :1.142 Mean : 0.1191
## 3rd Qu.:6.000 3rd Qu.:2.000 3rd Qu.: 0.7858
## Max. :6.000 Max. :6.000 Max. : 2.1281
You can override that default and force it to include all of the values, in which case the missing data will be (quite reasonably) filled in with NA
:
WL.sev.s <- merge(WordLearnEx, sev2.s, by.x = "Subject", by.y = "Participant", all.x = TRUE)
summary(WL.sev.s)
## Subject TP Block Accuracy Correct
## Min. :244.0 High:280 Min. : 1.0 Min. :0.000 Min. :0.00
## 1st Qu.:319.5 Low :280 1st Qu.: 3.0 1st Qu.:0.667 1st Qu.:4.00
## Median :346.5 Median : 5.5 Median :0.833 Median :5.00
## Mean :344.4 Mean : 5.5 Mean :0.805 Mean :4.83
## 3rd Qu.:370.5 3rd Qu.: 8.0 3rd Qu.:1.000 3rd Qu.:6.00
## Max. :395.0 Max. :10.0 Max. :1.000 Max. :6.00
##
## Incorrect Severity
## Min. :0.00 Min. :-1.9212
## 1st Qu.:0.00 1st Qu.:-0.4642
## Median :1.00 Median : 0.1195
## Mean :1.17 Mean : 0.1191
## 3rd Qu.:2.00 3rd Qu.: 0.7858
## Max. :6.00 Max. : 2.1281
## NA's :40
Now all subject numbers are there, but notice that the Severity column has 40 NA values (10 for each of the 4 participants with missing Severity data).
This section is about converting between “wide” and “long” data formats. A typical version of a wide data format is when each row corresponds to a participant and each observation is in a separate column. For a longitudinal study, that would mean having rows for Subject1, Subject2, Subject3, etc. and columns for Time1, Time2, Time3, etc. and the element in cell (N, M) would be the observed value for Subject N at Time M. The same data in a long format would have just three columns: subject ID, time point, and observed value. That is, each row corresponds to an observation and the columns provide information about that observation (Subject N, Time M, etc.). Converting between wide and long formats can be a laborious and error-prone sequence of copy-and-paste operations or require writing a clever script. Thereshape2
package makes these conversions very easy.
First, I’m going to clean up my workspace, load the reshape2
package, and make a wide data example from the affect
data:
rm(list=ls())
library(reshape2)
load("Part1_Data.RData")
#pull out just the negative affect data
affect.neg <- subset(affect, select=c(Film, NA1, NA2))
summary(affect.neg)
## Film NA1 NA2
## Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:1.250 1st Qu.: 1.000 1st Qu.: 0.000
## Median :3.000 Median : 2.000 Median : 3.000
## Mean :2.515 Mean : 3.693 Mean : 4.649
## 3rd Qu.:4.000 3rd Qu.: 6.000 3rd Qu.: 7.000
## Max. :4.000 Max. :28.000 Max. :30.000
This is a typical wide data set: each row is a subject, the Film column defines what kind of film the subject watched, and the two negative affect observations are the pretest and postest (before and after the film). There is no subject id, but we can make one based on the row number:
affect.neg$Subject <- factor(1:nrow(affect.neg))
Also, let’s label the film types:
affect.neg <- within(affect.neg,
Film <- factor(Film, levels=c("1","2","3","4"),
labels=c("Sad", "Threat", "Neutral", "Happy")))
summary(affect.neg)
## Film NA1 NA2 Subject
## Sad :83 Min. : 0.000 Min. : 0.000 1 : 1
## Threat :78 1st Qu.: 1.000 1st Qu.: 0.000 2 : 1
## Neutral:85 Median : 2.000 Median : 3.000 3 : 1
## Happy :84 Mean : 3.693 Mean : 4.649 4 : 1
## 3rd Qu.: 6.000 3rd Qu.: 7.000 5 : 1
## Max. :28.000 Max. :30.000 6 : 1
## (Other):324
Now we can go about converting this data frame from wide to long format using the melt()
function from the reshape2
package. To understand the melt
syntax, we need a little data format terminology:
NA1
and NA2
variables.Subject
and Film
.When we convert from wide to long, the id variables are left in place and the measure variables are combined into two columns: a new id variable column that contains the names of the original measure variables (NA1 and NA2) and a value column that contains the values of the original measure variables. To use melt
, we just need to tell it the id variables and the measure variables, and it does the rest. Here’s an example:
affect.melt <- melt(affect.neg,
id=c("Subject","Film"),
measure=c("NA1", "NA2"))
summary(affect.melt)
## Subject Film variable value
## 1 : 2 Sad :166 NA1:330 Min. : 0.000
## 2 : 2 Threat :156 NA2:330 1st Qu.: 0.000
## 3 : 2 Neutral:170 Median : 2.000
## 4 : 2 Happy :168 Mean : 4.171
## 5 : 2 3rd Qu.: 6.000
## 6 : 2 Max. :30.000
## (Other):648
Notice that each subject now has two observations instead of one (pretest and posttest) and the new columns were given default names of “variable” and “value”. We can override those defaults to produce more informative variable names:
affect.melt <- melt(affect.neg,
id=c("Subject","Film"),
measure=c("NA1", "NA2"),
variable.name="Test",
value.name="Negative_Affect")
summary(affect.melt)
## Subject Film Test Negative_Affect
## 1 : 2 Sad :166 NA1:330 Min. : 0.000
## 2 : 2 Threat :156 NA2:330 1st Qu.: 0.000
## 3 : 2 Neutral:170 Median : 2.000
## 4 : 2 Happy :168 Mean : 4.171
## 5 : 2 3rd Qu.: 6.000
## 6 : 2 Max. :30.000
## (Other):648
Also, by default if only id
or measure
is specified, then all of the unspecified variables will be treated as the other one, so either of these slightly more compact forms would produce the same result
affect.melt <- melt(affect.neg,
id=c("Subject","Film"),
variable.name="Test",
value.name="Negative_Affect")
affect.melt <- melt(affect.neg,
measure=c("NA1", "NA2"),
variable.name="Test",
value.name="Negative_Affect")
We can also improve the names of the levels for the “Test” variable:
levels(affect.melt$Test) <- c("Pre", "Post")
summary(affect.melt)
## Subject Film Test Negative_Affect
## 1 : 2 Sad :166 Pre :330 Min. : 0.000
## 2 : 2 Threat :156 Post:330 1st Qu.: 0.000
## 3 : 2 Neutral:170 Median : 2.000
## 4 : 2 Happy :168 Mean : 4.171
## 5 : 2 3rd Qu.: 6.000
## 6 : 2 Max. :30.000
## (Other):648
The dcast
function performs the opposite transformation: long-to-wide (there’s also an acast
function that works the same way but produces array output instead of data frame output). The syntax here is a little different, it uses formula syntax where the right-hand side of the formula are the columns that should remain columns and the left-hand side are the columns whose values should become separate columns. For example, here’s how we could convert our new long-format affect data back to a wide format:
affect.cast <- dcast(affect.melt, Subject + Film ~ Test)
## Using Negative_Affect as value column: use value.var to override.
summary(affect.cast)
## Subject Film Pre Post
## 1 : 1 Sad :83 Min. : 0.000 Min. : 0.000
## 2 : 1 Threat :78 1st Qu.: 1.000 1st Qu.: 0.000
## 3 : 1 Neutral:85 Median : 2.000 Median : 3.000
## 4 : 1 Happy :84 Mean : 3.693 Mean : 4.649
## 5 : 1 3rd Qu.: 6.000 3rd Qu.: 7.000
## 6 : 1 Max. :28.000 Max. :30.000
## (Other):324
As the warning note indicated, dcast
made a guess about which column to use as the value
column. It does a fairly good job guessing, but you might want to specify it explicitly:
affect.cast <- dcast(affect.melt, Subject + Film ~ Test,
value.var = "Negative_Affect")
In the above example each of the values in the long-format affect.melt
data frame had a place in the wide-format affect.cast
data frame. But if we use a reduced formula, then this function can (quickly and easily) produce summary tables. For example:
dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect")
## Aggregation function missing: defaulting to length
## Film Pre Post
## 1 Sad 83 83
## 2 Threat 78 78
## 3 Neutral 85 85
## 4 Happy 84 84
The default summary function is length
, which just gives us a count of observations. We can specify other functions:
dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = mean)
## Film Pre Post
## 1 Sad 3.503614 8.666265
## 2 Threat 3.364103 5.371795
## 3 Neutral 3.921176 2.894118
## 4 Happy 3.955952 1.785714
dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = sd)
## Film Pre Post
## 1 Sad 3.726289 5.352735
## 2 Threat 4.449164 5.177647
## 3 Neutral 4.655209 4.127387
## 4 Happy 4.395846 2.883570
We can also get row and/or column summaries (“margins”):
dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = mean,
margins = "Film")
## Film Pre Post
## 1 Sad 3.503614 8.666265
## 2 Threat 3.364103 5.371795
## 3 Neutral 3.921176 2.894118
## 4 Happy 3.955952 1.785714
## 5 (all) 3.693333 4.649394
dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = mean,
margins = "Test")
## Film Pre Post (all)
## 1 Sad 3.503614 8.666265 6.084940
## 2 Threat 3.364103 5.371795 4.367949
## 3 Neutral 3.921176 2.894118 3.407647
## 4 Happy 3.955952 1.785714 2.870833
#all margins
dcast(affect.melt, Film ~ Test,
value.var = "Negative_Affect",
fun.aggregate = mean,
margins = TRUE)
## Film Pre Post (all)
## 1 Sad 3.503614 8.666265 6.084940
## 2 Threat 3.364103 5.371795 4.367949
## 3 Neutral 3.921176 2.894118 3.407647
## 4 Happy 3.955952 1.785714 2.870833
## 5 (all) 3.693333 4.649394 4.171364
Many data analysis and processing tasks can be accomplished by a 3-step process:
For example, computing subject RT means is splitting the data by subject, applying the mean function to the RT data, and re-combining the data. The same thing goes for more complex summary statistics (like MEAN + 3SD) or tranformations like centering RT’s relative to the subject’s mean RT.
This can be accomplished using for
loops, but that involves more programming, more opportunities to make errors, and will be slower to run. Base R has split
and apply
functions that can be used for this, but their syntax can get tricky.
The plyr
package offers easy-to-use tools for carrying out split-apply-combine procedures. The key functions have names like this: [][]ply
. Where the first slot is filled by a letter indicating the kind of object you’re starting with – d
for data frame, l
for list, or a
for array – and the second slot is filled by a letter indicating the kind of object you want returned. I usually use ddply
because I am starting from a data frame and want to end up with a data frame.
You’ll need to give ddply
three things: the starting data frame, the list of variables that define the splitting into subsets, and the function that should be applied to each subset. The splits are defined by a list of variables such that the data frame will be split up into subsets based on unique combinations of values of those subsets (this will be more clear with an example in a minute). There’s a fair amount of flexibility in the syntax for these components, which can be good, but can also be somewhat confusing. Which is to say that the following examples will use versions of the syntax that I find clearest and most intuitive, but these are certainly not the only ways to use this function.
As an example, let’s look at some reaction time data from a spoken-to-written word matching study:
density <- read.delim("http://www.danmirman.org/gca/Density.txt")
summary(density)
## Subject Group Item Density RT
## S103 : 115 Older :1494 bag : 30 High:1397 Min. : 440
## S122 : 109 Younger:1370 blonde : 30 Low :1467 1st Qu.: 1161
## S105 : 107 chain : 30 Median : 1536
## S102 : 105 chair : 30 Mean : 1730
## S108 : 102 child : 30 3rd Qu.: 2040
## S110 : 102 couch : 30 Max. :12753
## (Other):2224 (Other):2684
head(density)
## Subject Group Item Density RT
## 1 S101 Older wool Low 3366
## 2 S101 Older chop High 1584
## 3 S101 Older bag High 2008
## 4 S101 Older tap High 3089
## 5 S101 Older lodge Low 1831
## 6 S101 Older throne Low 2863
Remember how we can use dcast
to make summary tables:
dcast(density, Group ~ Density, mean, value.var = "RT")
## Group High Low
## 1 Older 2108.388 1941.361
## 2 Younger 1433.184 1387.575
We can produce a similar summary table using ddply
:
library(plyr)
ddply(density, .(Group, Density), summarize, meanRT = mean(RT))
## Group Density meanRT
## 1 Older High 2108.388
## 2 Older Low 1941.361
## 3 Younger High 1433.184
## 4 Younger Low 1387.575
Let’s unpack that syntax. The first argument is just the starting data frame. The second argument is a list of variables – we gave Group and Density, which means that the data will be split up into subsets based on unique combinations of those two variables; that is, the 4 combinations that we get from the 2x2 design. Then we say that the function we want to apply is summarize
, which means we’re going to create a new data frame that contains some kind of summary or transformation of data from the original data frame. The “summary” can also have multiple components, which can be handy if you want to build a complex summary table:
ddply(density, .(Group, Density), summarize, meanRT = mean(RT), SD = sd(RT), minRT = min(RT), maxRT = max(RT))
## Group Density meanRT SD minRT maxRT
## 1 Older High 2108.388 1093.6589 440 9170
## 2 Older Low 1941.361 1034.2270 596 12753
## 3 Younger High 1433.184 622.1910 480 6459
## 4 Younger Low 1387.575 632.1079 457 6381
Summary tables are nice, but this function really shines when it comes to pre-processing data for analysis. For example, the data frame contains trial-level data. If we want to compute subject (or item) means, we just add subject (or item) ID to the list of subset-defining variables:
#by-subject means
density.subj <- ddply(density, .(Subject, Group, Density), summarize, RT = mean(RT))
summary(density.subj)
## Subject Group Density RT
## S101 : 2 Older :32 High:30 Min. :1027
## S102 : 2 Younger:28 Low :30 1st Qu.:1383
## S103 : 2 Median :1603
## S104 : 2 Mean :1743
## S105 : 2 3rd Qu.:1960
## S106 : 2 Max. :3601
## (Other):48
#by-item means
density.item <- ddply(density, .(Item, Density, Group), summarize, RT = mean(RT))
summary(density.item)
## Item Density Group RT
## bag : 2 High:120 Older :120 Min. : 942.2
## ball : 2 Low :120 Younger:120 1st Qu.:1366.0
## band : 2 Median :1686.4
## bean : 2 Mean :1767.4
## bear : 2 3rd Qu.:2040.0
## bell : 2 Max. :3866.8
## (Other):228
These are the kind of data we’d use for by-subject and by-item ANOVAs. We can also get more complex, like computing by-subject RT means, standard deviations, and means excluding outliers (RTs that are more than 3SD above the subject’s condition mean):
density.subj <- ddply(density, .(Subject, Group, Density), summarize,
meanRT = mean(RT), sdRT = sd(RT),
meanRTexcl = mean(RT[RT < (meanRT + 3*sdRT)]))
summary(density.subj)
## Subject Group Density meanRT sdRT
## S101 : 2 Older :32 High:30 Min. :1027 Min. : 354.3
## S102 : 2 Younger:28 Low :30 1st Qu.:1383 1st Qu.: 491.5
## S103 : 2 Median :1603 Median : 594.7
## S104 : 2 Mean :1743 Mean : 696.3
## S105 : 2 3rd Qu.:1960 3rd Qu.: 735.8
## S106 : 2 Max. :3601 Max. :2466.1
## (Other):48
## meanRTexcl
## Min. :1001
## 1st Qu.:1362
## Median :1545
## Mean :1700
## 3rd Qu.:1940
## Max. :3601
##
For completeness, we might also want to keep track of the number of trials so we’ll know how many trials were in the original means and how many were excluded as outliers:
density.subj <- ddply(density, .(Subject, Group, Density), summarize,
N = length(RT), meanRT = mean(RT), sdRT = sd(RT),
Nexcl = sum(RT < (meanRT + 3*sdRT)),
meanRTexcl = mean(RT[RT < (meanRT + 3*sdRT)]))
head(density.subj)
## Subject Group Density N meanRT sdRT Nexcl meanRTexcl
## 1 S101 Older High 45 2482.044 920.7922 44 2407.886
## 2 S101 Older Low 48 2122.312 694.8509 47 2058.553
## 3 S102 Younger High 50 1318.060 485.7474 49 1286.449
## 4 S102 Younger Low 55 1306.836 554.0681 53 1233.358
## 5 S103 Younger High 60 1712.467 736.6167 58 1629.414
## 6 S103 Younger Low 55 1738.200 955.1870 54 1652.222
All of these examples have been based on computing different kinds of summaries, but what if we want to transform the existing data without compressing it into a summary? We’ll need to use a different function for that, instead of summarize
we’ll use transform
or, better yet, mutate
. For example, let’s say we want to transform (or mutate) the raw RT into subject-level z-scores:
density.z <- ddply(density, .(Subject), mutate, zRT = scale(RT))
summary(density.z)
## Subject Group Item Density RT
## S103 : 115 Older :1494 bag : 30 High:1397 Min. : 440
## S122 : 109 Younger:1370 blonde : 30 Low :1467 1st Qu.: 1161
## S105 : 107 chain : 30 Median : 1536
## S102 : 105 chair : 30 Mean : 1730
## S108 : 102 child : 30 3rd Qu.: 2040
## S110 : 102 couch : 30 Max. :12753
## (Other):2224 (Other):2684
## zRT.V1
## Min. :-2.050589
## 1st Qu.:-0.672325
## Median :-0.168882
## Mean : 0.000000
## 3rd Qu.: 0.450302
## Max. : 5.851678
##
Notice that the number of observations in the data frame stayed exactly the same and that the new data frame contains all of the columns from the original, plus the new “mutated” variable. And that we did this using one line of code that was executed in a tiny fraction of a second.
For a somewhat different plyr
tutorial, check out Sean Anderson’s plyr: Split-Apply-Combine for Mortals.