Summer School Data Handling

This document contain a copy of the R code used in the quantative practical class for the Psychology Summer School. You can work through this code independently to develop some experience of what it is like working with R Studio.

For reference, this document has been created using a ‘R markdown’ file (something you will learn about if you are continuing on to Psychology). These files are written in R, but can be exported as a pdf, word document or html web page.

R Markdown allows you to embed an R code in “chunks”.

An embeded R Code chunk looks like this:

summary(cars)

From now on, everywhere you see an R code ‘chunk’ you should type the R code into R Studio. Please do not copy and paste R code chunks. There can be subtle differences due to formatting which means the R code will not work. It is good practice to type out the R code, so you can begin to get a feel for the way R code is written.

A few reminders before we get started

R is CaSe SeNsItIvE - this means you should make sure you are typing in your R code exactly. View is not the same as view for instance - if the command does not have the correct case, you will encounter code errors and the code will not work.

You should also be aware of where you need quotation marks " compared to an apostrophe ’

Again, these are different and the code will not work if you use the wrong one.

R often works in doubles, so where you are using a quotation " mark, an apostrophe ’ or a parentheses ( these should come as a pair at the beginning and end of the content in the middle.

For example:

(hello)

“hello”

‘hello’

Incorrect formatting is the most common coding error you will encounter with R, followed by incorrect spelling!

Set your working Directory

Before starting any R session you will need to set your working directory - i.e. choose the folder on your computer you want to work in. This is done manually using the “session” tab at the top of R Studio.

You need to choose: Session -> Set working directory -> choose directory.

You will then see the option to choose the folder you wish to work from. You can set up a new folder for storing your R Studio work from Summer School if you wish, but do not call this folder R as this will be confused with the names of the folders created when you installed R. Instead call the folder something like ‘Summer School’ for instance.

You should ensure that the data (downloaded from Moodle) is in the folder you set as your working directory.

Install Tidyverse

R works using functions (A named section of code that can be reused). For example the function used for calcualted the average, or mean, of a set of numbers is mean().

Tidyverse is a package containing a group of R functions, that are extensively used. The first time you use R you will need to install this package. You only need to install the package once - in the future if you use R the package will be there to use.

To install a package in R we use the function install.packages("package_name").

So, to install the package named Tidyverse, we type into R:

install.packages("Tidyverse")

It may take a few minutes for the package to load. Do not be concerned by jargon messages appearing in the console, or by red writing. This is all normal.

Opening tidyverse

Every package installed in R is kept in R’s library.

Every time you want to work in R and use functions which are part of the package Tidyverse, you will need to tell R you want to open the package.

This is a bit like needing to go to the library, to borrow a book you want to work with. At the end of each R session, the book/package goes back to the library, so we have to go collect it again at the beginning of each session.

We collect a package from the library by using the function library(package_name).

So, to collect tidyverse we type:

library(tidyverse)

Loading in data

Next we want to load in our data. You should download the dataset from Moodle and ensure your data is in the folder that you have set as your working directory.

Most of the time when working with R your data will be stored in a csv file. This may be a new file type to you. Similar to an excel xls file, csv files are useful for spreadsheets of data.

There are several ways to read in data in R, however we are going to use the tidyverse function read_csv('file_name.csv') to read in our data.

When we read in our data, we also want to assign this to a new variable in R. A variable in R has a different definition than a variable in e.g. an experiment.

In R, the term ‘variable’ means a word that identifies and stores the value of some data for later use.

We are going to use the variable name “dat” (short for data). However, we could call our variable anything we like.

We assign values to a variable by using an arrow <-

To read in our data, and assign it to a variable named “dat” we type:

dat <- read_csv('Summer School Data.csv')

This will create a new variable in our R Studio Environment.

To view the data type:

View(dat)

Descriptive Statistics

Descriptive statistics are used to describe the general patterns in our data. We can calculate for example the mean, mode or median, which are all measured of central tendency - they describe around what values our data clusters.

We can also use measures of spread, to calculate how spread our our scores are. The typical way of doing this is to measure standard deviation.

The most appropriate descriptive statistics to use will depend on your particular data set. This is something you will learn more about if you continue on to Psychology at undergraduate level.

Because we have two groups in our experiment, we want to calcualte our descriptive statistics for each group seperately.

To do this, we need to tell R we want to group our data, using the function group_by()

group_by() requires two inputs - we first need to say what data we want to group (our data is stored in the variable ‘dat’) and we need to tell R on what premise we want to group our data. We could, for example, group our data based on gender, age or recall accuracy. In this example however, we want to group our data based on whether participants were in the repetition or visualisation condition, which is information stored in the column titled ‘Group’

We want to save our grouped data to a new variable called “dat_grouped”:

dat_grouped <- group_by(dat, Group)

This will create a new variable in our environment. The ‘grouping’ aspect of the data is hidden (it wont look like anything has changed).

Now we can go ahead and produce our descriptive statistics. We’re going to calculate the mean and standard deviation, using the function `summarise()

Similar to group_by() when we use the summarise() function, we need 2 inputs - where are data is stored (its now the ‘dat_grouped’ variable) and what statistics we want.

We’re going to calculate the mean first.

We want to calculate the mean number of words recalled, so we tell R to calculate the mean of the values in the column named ‘Recall’.

Because we have grouped the data, R will calculate the mean recall for each group separately.

We’re going to save this in a new variable called ‘dat_mean’

dat_mean <- summarise(dat_grouped, mean_score = mean(Recall))

To view the result we type

View (dat_mean)

You will see a new table, which show the mean recall score for each group.

We are now going to do the same, calculating the standard deviation.

dat_sd <- summarise(dat_grouped, sd_score = sd(Recall))

Again we can view the output by typing:

View (dat_sd)

Visualising the data

It is important in research to show visualisations of the data. One way of doing this is a box plot. There are many, better ways of visualising data that you will be taught if you continue on to learn about Psychology. However, the bar plot is one you may be more familiar with, so this is the one we will be looking at.

One of the mostly widely used functions for visualising data is called ggplot()

This function is the basis of many different types of data visualisations. There are lots of different inputs that can be used with ggplot() so we will be looking at one of the less complex versions today.

To begin we need to tell ggplot what variable to use (our data is in the variable “dat_mean”) and what we wat to plot on the x and y axis of the group.

We use the function geom_bar() to tell ggplot we want to make a bar graph.

The final argument stat= "identity" tells ggplot that we want the heights of the bars to represent values in the data.

ggplot(dat_mean, aes(x=Group, y=mean_score))+
               geom_bar(stat="identity")

Inferential statistics

Inferential statistics are used to infer something about the population, based on a sample.

There are lots of different tpyes of inferential statistics that can be used in Psychology - you’ll be introduced to several of these statistical tools if your continue to undergraduate level Psychology.

For now, we are going to look at null hypothesis significance testing (NHST). Remember from class that in psychology we test against the null (not experimental) hypothesis.

There are different types of NHST. In this example though, we want to compare the means of two groups.

We use different types of tests depending on whether our groups are related (i.e. the same participants, such as in a within-groups design) or unrelated (i.e. there are different participants in each group, such as in a between-groups design).

As we have a between groups design, we’ll be conducting something called an “independent t-test”.

We need to tell R what data we want to compare. To do this, we need to ask R to find all the ‘recall’ values for every participant in the ‘repetition’ group, and compare these to the ‘recall’ values of the participants in the ‘visualisation’ group.

To do this, we are going to use the ‘dat’ data set, filter the data by finding all the participants in the named group, and then pull out their recall scores from the data set. We then need to do this again for the other group.

This gives us the code below:

dat_t <- t.test(dat %>% filter(Group== "Repetition") %>% pull(Recall), dat %>% filter(Group=="Visualisation") %>% pull(Recall))

To see the result, we can type:

dat_t
## 
##  Welch Two Sample t-test
## 
## data:  dat %>% filter(Group == "Repetition") %>% pull(Recall) and dat %>% filter(Group == "Visualisation") %>% pull(Recall)
## t = -6.6079, df = 19.62, p-value = 2.151e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.154951 -3.198584
## sample estimates:
## mean of x mean of y 
##  4.545455  9.222222