Overview

The work done so far on this module has focussed primarily on organising and filtering your raw data and some simple visualisations. In science, at some point you will want to run some statistical tests—or “models”—to make inferences from your data. R and its associated packages is an incredibly powerful environment within which to do some very sophisticated analysis. It caters from everything from the very basic (e.g., correlations, t-tests etc.) all the way to advanced machine learning algorithms used in data science. In this week’s class, we are going to focus on correlations—a statistical test to ascertain whether there is an association between two variables in your data set (e.g., is height associated with weight?).

Reading

There is no external reading this week. Below I walk through a single example of how to conduct a correlation in R which you can then use to solve the workshop exercises.

Wide data??

The annoying thing about simple statistical tests in R is that you need to have your data in an ‘untidy’, wide format. Variables that you want to correlate need to be in their own column. In week 6, we learned that tidy data observes the following three rules:

  • Variables are represented in columns
  • Observations are represented as rows
  • Values are contained within cells

In week 6, I showed some data from a fictitious study where researchers were interested in the effect of time of day (morning vs. evening) and stimulus type (words vs. images) on memory performance, which I have extended here. In this fictitious study, 50 Participants were tested in both the morning and the afternoon, and had to learn lists of words and images. Participants then had to recall what they had learned, and the percent correct was used as the measure of memory accuracy.

Here’s how this data is represented in long format:

This obeys the three rules:

  • Variables are represented in columns: TRUE. The columns time_of_day and stimulus_type are variables in our data.
  • Observations are represented as rows: TRUE. Each row now represent individual observations, and the columns represent what level of the variables the observation had. For example, in row 1 we see the observation for participant 1 in the morning when learning words.
  • Values are contained within cells: TRUE. We now have the column memory_score which tracks the value for each observation.

Wide to long format

You can download this data from the OSF and work along with the following correlation code; it’s the memory_data.csv file.

Correlations (and other simple tests like t-tests) require the data to be in wide format. This is a little frustrating, as in Week 6 I explained many of the advantages of having data in long format. Whilst it’s true that data need to be in wide format for very simple statistical tests, much of R’s statistical machinery requires your data to be in long format hence why it’s the preferred format in the tidyverse landscape.

Using the skills learned in Week 6, let’s first get our data into wide format:

wide_data <- long_data %>% 
  pivot_wider(names_from = c(time_of_day, stimulus_type), 
              values_from = memory_score)
## # A tibble: 50 × 5
##    participant morning_words morning_images evening_words evening_images
##          <int>         <dbl>          <dbl>         <dbl>          <dbl>
##  1           1          96.6           73.3          85.0           88.8
##  2           2          97.5           73.9          68.7           73.0
##  3           3          71.4           75.9          68.7           91.2
##  4           4          93.2           91.4          75.6           75.8
##  5           5          85.7           61.6          97.7           87.1
##  6           6          80.8           90.0          98.5           91.0
##  7           7          89.5           87.1          89.6           67.5
##  8           8          65.4           66.8          89.3           61.2
##  9           9          86.3           70.4          81.4           65.4
## 10          10          88.2           80.6          60.1           87.2
## # ℹ 40 more rows

The cor() function

Correlation in R is conducted using the cor() function. Have a look at the help file for cor().

?cor

Let’s say we are interested in the correlation between memory accuracy for morning_words and evening_words. (This isn’t a great question to ask of your data, but it allows us to show how correlations are conducted.) According to the help file, we need to pass in to argument x “…a numeric vector, matrix, or data frame”. The argument y is set to NULL by default.

There are two ways to do this analysis. In the first approach, we pass morning_words in as the x argument, and evening_words in as the y argument:

cor(x = wide_data$morning_words, y = wide_data$evening_words)

Before we look at the result, what’s this $ business?? As the data are in wide format, we need to tell R which column of the data tibble to look in for each of our two variables. Therefore, we use the R operator $ to tell R which column we are referring to. The general structure of this is: data_name$column_name. So, we are referring to the wide_data data tibble, and the morning_words column. We can see what the result of just this command is:

wide_data$morning_words
##  [1] 96.59 97.48 71.45 93.22 85.67 80.76 89.46 65.39 86.28 88.20 78.31 88.76
## [13] 97.39 70.22 78.49 97.60 99.13 64.70 79.00 82.41 96.16 65.55 99.56 97.87
## [25] 63.30 80.57 75.61 96.23 77.88 93.44 89.50 92.44 75.52 87.41 60.16 93.32
## [37] 60.29 68.31 96.26 84.47 75.18 77.43 61.50 98.94 77.27 98.30 95.51 85.60
## [49] 98.84 84.75

What it does is it goes in to the morning_words column of the wide_data data tibble, and pulls out all of the associated data and stores it in a vector. So, when the help file of the cor() function says that x and y can be passed in as a numeric vector, this is what we are doing (shown here as a long-winded approach):

x <- wide_data$morning_words
y <- wide_data$evening_words
cor(x = x, y = y)

Going back to the previous code, let’s run it and find out what the correlation coefficient is:

cor(x = wide_data$morning_words, y = wide_data$evening_words)
## [1] 0.01877779

The correlation coefficient is r = 0.019, which suggests there is basically no correlation between the two variables.

The second approach is when you want to conduct all possible pair-wise correlations in your data tibble. It’s best shown rather than described:

cor(x = wide_data)
##                participant morning_words morning_images evening_words
## participant     1.00000000   -0.03706475     0.01192209   -0.01418588
## morning_words  -0.03706475    1.00000000     0.05725083    0.01877779
## morning_images  0.01192209    0.05725083     1.00000000    0.10350219
## evening_words  -0.01418588    0.01877779     0.10350219    1.00000000
## evening_images -0.05820505   -0.01126688    -0.17550695   -0.11006258
##                evening_images
## participant       -0.05820505
## morning_words     -0.01126688
## morning_images    -0.17550695
## evening_words     -0.11006258
## evening_images     1.00000000

Here we have passed the whole data tibble in to the x argument of the cor() function. This computes every possible pair-wise correlation in your data, and produces the correlation coefficient for each test as a matrix. To make it print more nicely, I will pipe the result of this function call to the round() function which will round the results to the desired number of decimal places (in this case to 3 decimal places):

cor(x = wide_data) %>% 
  round(digits = 3)
##                participant morning_words morning_images evening_words
## participant          1.000        -0.037          0.012        -0.014
## morning_words       -0.037         1.000          0.057         0.019
## morning_images       0.012         0.057          1.000         0.104
## evening_words       -0.014         0.019          0.104         1.000
## evening_images      -0.058        -0.011         -0.176        -0.110
##                evening_images
## participant            -0.058
## morning_words          -0.011
## morning_images         -0.176
## evening_words          -0.110
## evening_images          1.000

This is sometimes a good approach if you are interested in every correlation, but not all make sense: For example, it has computed the correlation coefficient between participant number and memory accuracy, which is not an informative correlation.

Workshop Exercises

To come…

Further Reading

If you are keen to learn more about statistical analysis in R, I recommend the following. Note that these are not required reading for this module! This is just for interest if you want to explore further: