Welcome to the pre-lab homework for your second Psychology lab. In the second lab, we will be using many of the same functions as you used in your first lab. This pre-lab activity is designed to give you a re-fresher of many of the skills you have already learned, and include a new skill for lab 2 : calculating inferential statistics in R.
Your pre-lab task is to work through these instructions. The skills you’ll cover in this homework are the same skills we’ll be using in the second week of the labs using the data we’ll collect in the lab.
This pre-lab activity has an accompanying Multiple Choice Quiz, with questions you should answer to check you have correctly analysed your data. You can either complete the MCQ as you work through this sheet, or note down the values you calcualte as you go to complete the MCQ after.
The first thing we need to do is to download the folder containing the data we’ll be using to practice with in this homework task. Download the folder from Moodle and save this on your computer.
Next, we need to set the working directory in R. This is an instruction to tell R where the folder with the data we want to use is saved on our computers.
To set the working directory follow these steps:
We load a package using the R command library()
with the name of the package between the parentheses ()
.
So, in this case (where our package is called “tidyverse”) we give R the command library(tidyverse)
library(tidyverse)
IF YOU SEE THE ERROR MESSAGE: Error in library(tidyverse) : there is no package called ‘tidyverse’
You should have tidyverse installed already, but if you have changed your computer you may need to install this again.
install.packages("tidyverse")
After installation is complete return to the above step: library(tidyverse)
Now R knows where to find our data (in our working directory) and has loaded in a package (tidyverse) with all the functions we need. Next, we need to load in our data set. To do this we type in the command dat <- read_csv('Access_practice_data.csv')
Remember, this will only worK if you have correctly set your working directory to the folder where you have downloaded the dataset to.
dat <- read_csv('Access_practice_data.csv')
The command read_csv
is telling R we want to read in a ‘csv’ file (like saying .doc or .docx). We have to make sure the file name is typed in exactly the same in R as it is saved on our computer. The arrow <-
assigns the data we are reading in our new dame for our data called dat.
For this practice session we’ll be looking at some made-up data. In this hypothetical example, we have 20 participants who have taken part in an experiment on memory. In the experiment, our participants were given 30 words to try to remember. Participants’ memory was tested immedietly after learning (timepoint 1) and a week later (timepoint 2). We are interested to see if there is a significant difference in how many words our participants can remember between the two timepoints.
Lets start by looking at our data by typing View(dat)
View(dat)
A new screen will appear showing us our data set. Our data has 20 rows (1 row for each participant) and 3 columns. The first column id gives the participant number. The second column Timepoint_1 tells us how many words (out of 30) the participant remembered immediatly after learning, and our third column Timepoint_2 tells us how many words (out of 30) the participant remmebered 1 week later.
Descriptive statistics are numbers that help to describe our data (for example, the avergae memory score at a particular timepoint). In this exercise, we want to calculate the average memory score at Timpoint 1 and Timpoint 2.
The average can also be referred to as the mean. The mean score is calculated by adding all the scores together, and then dividing the total by the number of participants. Want to learn more about calculating means? have a look at this website
We can calculate the mean (i.e. average) memory score for each timepoint by using the mean()
command. We need to tell R what dataset we want to look at (dat) and which column we want to look at. We use the dollar sign $
to tell us which column we want to extract or look at in our data set. So the command dat$Timepoint_1
means we want to calculate the mean for this column only.
Timepoint1_mean <- mean(dat$Timepoint_1)
Timepoint2_mean <- mean(dat$Timepoint_2)
The argument mean(dat$Timepoint_1)
can be read as “calculate the mean of the column in our data set”dat" that is called “Timepoint_1”.
The next type of descriptive statistics we want to calculate is known as the standard deviation. The standard deviation measures the amount of spread in our data (i.e. how similar participants are to the mean). The standard deviation is a more difficult concept than the mean, so have a look at this website (https://www.mathsisfun.com/data/standard-deviation.html) for more information about how the standard devition is calculated.
It is important you understand why the standard deviation is important. Have a look at this website (https://www.dummies.com/education/math/statistics/why-standard-deviation-is-an-important-statistic/) for more information.
We calculate the standard deviation by using the argument sd()
Timepoint1_sd <- sd(dat$Timepoint_1)
Timepoint2_sd <- sd(dat$Timepoint_2)
Often in psychology we want to make a visual representation of our data. In this case, we want to make one of the simplest types of visual representations - a bar graph.
There are 2 stages to making our bar graph. First, we need to tell R what values we want to graph (our groups means).
Our groups means are stored in Timepoint1_mean
and Timepoint2_mean
but we now want these numbers to be stored in the same variable name. We can do this by creating a new variable called graph_means
and storing our means into this variable.
We use the concatenate function c()
to link together the numbers of interest. Our numbers of interest are stored in the variables Timepoint1_mean
and Timepoint2_mean
so we tell R to link these together c(Timepoint1_mean, Timepoint2_mean)
and store these in a new variable called graph_means
.
graph_means <-c(Timepoint1_mean, Timepoint2_mean)
We can then create a barplot()
of our means:
barplot(graph_means)
Next, we need to give our barplot()
a title using the argument main="title"
, as well as an x-axis label (vertical along the bottom of the graph) using xlab="x-label"
and a y-axis label (horizontal along the edge of the graph) using yab = "y-label"
.
The argument name.arg=c()
allows us to name our two columns.
The last argument ylim=c(ymin, ymax)
tells R how big to make the y-axis.
Try running the code below first to see where the labels are placed. Then, edit this code by replacing the the title and axis labels with your own appropriate title and axis labels.
barplot(graph_means, main="Type your title in here", xlab="Type your x axis label here", ylab = "Type your y axis label here", names.arg=c("Timepoint 1", "Timepoint 2"), ylim=c(0, 25))
Inferential statistics allow us to decide if there are any statistically significant differences between groups or levels of our independent variable. In this experiment, we want to know if there are any significant differences between the memory score at timepoint 1 and timepoint 2.
There are lots of different types of inferential tests. In lab 2 we will look at one of these types of tests known as a t-test.
There are two main types of t-tests we can use - a paired samples t-test and an independent samples t-test. We use these t-tests on different occasions, depending if we have a within or between participant design.
If we have a between participants design our participants take part in only one condition of the experiment, so we conduct an independent samples t-test.
If we have a within participants design our participants take part in all conditions of the experiment, so we conduct an paired samples t-test.
To conduct either t-test in R we use the function t.test
. We need to tell R what data we are comparing (our memory scores at timepoimt 1 and timepoint 2) and whether we want to conduct an independent or paired samples t-test.
In the example below, in the argument paired = NULL
reaplce NULL with either TRUE if you want to run a paired samples t-test or FALSE if you want to run a independent samples t-test.
t.test(dat$Timepoint_1, dat$Timepoint_2, paired = NULL, alternative = "two.sided")
There is lots of information given when you run this test, which we are going to discuss in much more detail in class. The 3 aspects we are looking for are the values for t, df and the p-value.
You might notice that the p-value has a strange looking sceintific notation at the end e-06. This is known as exponential notation and is used when we have very big or very small numbers. To remove this and see the actual p-value more clearly, run the code below:
options(scipen = 999)
t.test(dat$Timepoint_1, dat$Timepoint_2, paired = NULL, alternative = "two.sided")
You should now see that p is a very small number. Our p-value is used to tell us if we have a statistically significant difference between the memory test scores at timepoint 1 and timepoint 2. If our p value is less than 0.05 (in mathematical notation this is written as p<0.05 ) then we can say we have evidence of a statistically significant difference.
When we report our t-test, we do this in a special way - the values for t and df are reported within this formula. At the end of formula we report if p is greater than 0.05 (written as p>0.05) or p is less than 0.05 (written as p< 0.05)
t(df) = t, p<0.05 OR p>0.05
If you have not done so already, you should now complete the multiple choice quiz on Moodle