## Joining, by = "n"

Load some R packages

You may need to install these packages before loading

library(ggplot2)
library(dplyr)

Download the data from Learn

You can download the datafile, called woodward_data.csv from this week’s Learn resource. Then, create an R script in the same directory as your datafile, and read it in following the code below. Process and graph the data by following along with this guide.

 Load and check the data

woodward_data = read.csv("woodward_data.csv")
summary(woodward_data)
##        n               TestEvent  Familiarization       LT          
##  Min.   : 1.00   NewGoal    :96   Hand:96         Min.   : 0.03849  
##  1st Qu.: 8.75   NewLocation:96   Rod :96         1st Qu.: 5.23771  
##  Median :16.50                                    Median : 9.07272  
##  Mean   :16.50                                    Mean   :11.52830  
##  3rd Qu.:24.25                                    3rd Qu.:15.83204  
##  Max.   :32.00                                    Max.   :30.00000  
##   TrialNumber
##  Min.   :1   
##  1st Qu.:1   
##  Median :2   
##  Mean   :2   
##  3rd Qu.:3   
##  Max.   :3

Here, the different columns mean:

  • n = subject number (i.e., which participant is which)
  • TestEvent = Whether the test trial involves a new goal or a new location
  • Familiarization = Whether infants saw a hand reaching, or a rod
  • LT = Looking Time on each trial
  • TrialNumber = Participants took part in three trials in each condition

Examine the distribution of looking times

Create a histogram of looking times

hist(woodward_data$PutTheRightColumnNameHere)

You can see that the data is skewed.

Process the data for graphing

The graphs that you see in papers typically display by subject averages. What that means is, first, for each subject, you get their mean result in each condition, and then you find the overall mean in each condition, across the different subjects. So, remember that each subject here did 3 trials per condition. What we will do first is, for each subject, average across those three trials in each condition.

To do that we will use some dplyr functions, that make grouping and averaging easy.

subject_average = woodward_data %>%
   group_by(COLUMN_FOR_SUBJECT_NUMBER, TEST_EVENT_COLUMN,FAMILIARIZATION_COLUMN) %>% #<-- FILL THIS IN
   summarise(LT.mean = FUNCTION_FOR_MEAN(LT)) 

summary(subject_average)
hist(subject_average$COLUMN_FOR_LOOKING_TIME_MEAN)

You can see three interesting things here.

  • The pipe %>% operator takes the woodward_data variable and passes it to the function on the next line, which is called group_by.
  • The group_by function tells R that we want to group our data by which subject produced it, and which condition it came from, but we do not want to group by other variables, such as TrialNumber. That means we will end up ignoring trial number on the next line (which is what we want to do, because we want to average over it).
  • The summarise function says to create a new variable, LT.mean, which is equal to the mean of LT.
  • Because of the group_by call above, the mean will be caculated for each subject and condition, across the different trials that the subject was in.

Your result should look like this:

##        n               TestEvent  Familiarization    LT.mean      
##  Min.   : 1.00   NewGoal    :32   Hand:32         Min.   : 1.057  
##  1st Qu.: 8.75   NewLocation:32   Rod :32         1st Qu.: 5.222  
##  Median :16.50                                    Median : 8.590  
##  Mean   :16.50                                    Mean   :11.528  
##  3rd Qu.:24.25                                    3rd Qu.:16.147  
##  Max.   :32.00                                    Max.   :30.000

## # A tibble: 64 x 4
## # Groups:   n, TestEvent [?]
##        n TestEvent   Familiarization LT.mean
##    <int> <fct>       <fct>             <dbl>
##  1     1 NewGoal     Rod                6.03
##  2     1 NewLocation Rod                4.33
##  3     2 NewGoal     Rod               30   
##  4     2 NewLocation Rod               30   
##  5     3 NewGoal     Rod                8.09
##  6     3 NewLocation Rod                6.38
##  7     4 NewGoal     Rod               11.2 
##  8     4 NewLocation Rod                7.01
##  9     5 NewGoal     Rod               13.7 
## 10     5 NewLocation Rod               12.8 
## # ... with 54 more rows

Now, let’s get the average for each condition

condition_average = subject_average %>%
  group_by(FAMILIARIZATION_COLUMN,TEST_EVENT_COLUMN) %>% #<-- FILL THIS IN
   summarise(LT.mean = FUNCTION_FOR_MEAN(LT.mean),
             LT.sd = FUNCTION_FOR_STANDARD_DEVIATION(LT.mean))

condition_average
## # A tibble: 4 x 4
## # Groups:   Familiarization [?]
##   Familiarization TestEvent   LT.Grand.Mean LT.sd
##   <fct>           <fct>               <dbl> <dbl>
## 1 Hand            NewGoal              9.57  8.79
## 2 Hand            NewLocation         12.4   8.41
## 3 Rod             NewGoal             13.0   8.03
## 4 Rod             NewLocation         11.1   8.37

 Graph the data

Now let’s graph the data, using GGPLOT2 as last time. Look back to the last exercise to see how to add the errorbars to your chart.

ggplot(condition_average, aes(x = WHAT_VARIABLE_ON_X_AXIS,
                              y = WHAT_VARIABLE_ON_Y_AXIS,
                              fill = WHAT_VARIABLE_DECIDES_COLORS_OF_BARS))+
  geom_col(position = "dodge")+
  geom_errorbar(aes(ymin = WHAT_VALUE, ymax = WHAT_VALUE), position = "dodge")