The work done so far on this module has focussed primarily on organising and filtering your raw data and some simple visualisations. In science, at some point you will want to run some statistical tests—or “models”—to make inferences from your data. R and its associated packages is an incredibly powerful environment within which to do some very sophisticated analysis. It caters from everything from the very basic (e.g., correlations, t-tests etc.) all the way to advanced machine learning algorithms used in data science. In this week’s class, we are going to focus on correlations—a statistical test to ascertain whether there is an association between two variables in your data set (e.g., is height associated with weight?).
There is no external reading this week. Below I walk through a single example of how to conduct a correlation in R which you can then use to solve the workshop exercises.
The annoying thing about simple statistical tests in R is that you need to have your data in an ‘untidy’, wide format. Variables that you want to correlate need to be in their own column. In week 6, we learned that tidy data observes the following three rules:
In week 6, I showed some data from a fictitious study where researchers were interested in the effect of time of day (morning vs. evening) and stimulus type (words vs. images) on memory performance, which I have extended here. In this fictitious study, 50 Participants were tested in both the morning and the afternoon, and had to learn lists of words and images. Participants then had to recall what they had learned, and the percent correct was used as the measure of memory accuracy.
Here’s how this data is represented in long format:
This obeys the three rules:
time_of_day and stimulus_type are
variables in our data.participant 1 in the
morning when learning words.memory_score which tracks the value for
each observation.You can download this data from the OSF and work along with the following correlation code; it’s the memory_data.csv file.
Correlations (and other simple tests like t-tests) require the data
to be in wide format. This is a little frustrating, as in Week 6 I
explained many of the advantages of having data in long format. Whilst
it’s true that data need to be in wide format for very simple
statistical tests, much of R’s statistical machinery requires your data
to be in long format hence why it’s the preferred format in the
tidyverse landscape.
Using the skills learned in Week 6, let’s first get our data into wide format:
wide_data <- long_data %>%
pivot_wider(names_from = c(time_of_day, stimulus_type),
values_from = memory_score)
## # A tibble: 50 × 5
## participant morning_words morning_images evening_words evening_images
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 96.6 73.3 85.0 88.8
## 2 2 97.5 73.9 68.7 73.0
## 3 3 71.4 75.9 68.7 91.2
## 4 4 93.2 91.4 75.6 75.8
## 5 5 85.7 61.6 97.7 87.1
## 6 6 80.8 90.0 98.5 91.0
## 7 7 89.5 87.1 89.6 67.5
## 8 8 65.4 66.8 89.3 61.2
## 9 9 86.3 70.4 81.4 65.4
## 10 10 88.2 80.6 60.1 87.2
## # ℹ 40 more rows
Correlation in R is conducted using the cor() function.
Have a look at the help file for cor().
?cor
Let’s say we are interested in the correlation between memory
accuracy for morning_words and evening_words.
(This isn’t a great question to ask of your data, but it allows us to
show how correlations are conducted.) According to the help file, we
need to pass in to argument x “…a numeric vector, matrix,
or data frame”. The argument y is set to NULL
by default.
There are two ways to do this analysis. In the first approach, we
pass morning_words in as the x argument, and
evening_words in as the y argument:
cor(x = wide_data$morning_words, y = wide_data$evening_words)
Before we look at the result, what’s this $ business??
As the data are in wide format, we need to tell R which column of the
data tibble to look in for each of our two variables. Therefore, we use
the R operator $ to tell R which column we are referring
to. The general structure of this is:
data_name$column_name. So, we are referring to the
wide_data data tibble, and the morning_words
column. We can see what the result of just this command is:
wide_data$morning_words
## [1] 96.59 97.48 71.45 93.22 85.67 80.76 89.46 65.39 86.28 88.20 78.31 88.76
## [13] 97.39 70.22 78.49 97.60 99.13 64.70 79.00 82.41 96.16 65.55 99.56 97.87
## [25] 63.30 80.57 75.61 96.23 77.88 93.44 89.50 92.44 75.52 87.41 60.16 93.32
## [37] 60.29 68.31 96.26 84.47 75.18 77.43 61.50 98.94 77.27 98.30 95.51 85.60
## [49] 98.84 84.75
What it does is it goes in to the morning_words column
of the wide_data data tibble, and pulls out all of the
associated data and stores it in a vector. So, when the help file of the
cor() function says that x and y
can be passed in as a numeric vector, this is what we are doing (shown
here as a long-winded approach):
x <- wide_data$morning_words
y <- wide_data$evening_words
cor(x = x, y = y)
Going back to the previous code, let’s run it and find out what the correlation coefficient is:
cor(x = wide_data$morning_words, y = wide_data$evening_words)
## [1] 0.01877779
The correlation coefficient is r = 0.019, which suggests there is basically no correlation between the two variables.
The second approach is when you want to conduct all possible pair-wise correlations in your data tibble. It’s best shown rather than described:
cor(x = wide_data)
## participant morning_words morning_images evening_words
## participant 1.00000000 -0.03706475 0.01192209 -0.01418588
## morning_words -0.03706475 1.00000000 0.05725083 0.01877779
## morning_images 0.01192209 0.05725083 1.00000000 0.10350219
## evening_words -0.01418588 0.01877779 0.10350219 1.00000000
## evening_images -0.05820505 -0.01126688 -0.17550695 -0.11006258
## evening_images
## participant -0.05820505
## morning_words -0.01126688
## morning_images -0.17550695
## evening_words -0.11006258
## evening_images 1.00000000
Here we have passed the whole data tibble in to the x
argument of the cor() function. This computes every
possible pair-wise correlation in your data, and produces the
correlation coefficient for each test as a matrix. To make it print more
nicely, I will pipe the result of this function call to the
round() function which will round the results to the
desired number of decimal places (in this case to 3 decimal places):
cor(x = wide_data) %>%
round(digits = 3)
## participant morning_words morning_images evening_words
## participant 1.000 -0.037 0.012 -0.014
## morning_words -0.037 1.000 0.057 0.019
## morning_images 0.012 0.057 1.000 0.104
## evening_words -0.014 0.019 0.104 1.000
## evening_images -0.058 -0.011 -0.176 -0.110
## evening_images
## participant -0.058
## morning_words -0.011
## morning_images -0.176
## evening_words -0.110
## evening_images 1.000
This is sometimes a good approach if you are interested in every correlation, but not all make sense: For example, it has computed the correlation coefficient between participant number and memory accuracy, which is not an informative correlation.
To come…
If you are keen to learn more about statistical analysis in R, I recommend the following. Note that these are not required reading for this module! This is just for interest if you want to explore further: