knitr::opts_chunk$set(echo = TRUE)

require(tidyverse)

This lab is about graphing data using the package ggplot2. ggplot2 is a package for making beautiful graphs in a logical fashion. You will also learn to use it in your statistics courses, but it is never too early to start practicing.

In order to graph data properly, we will need to process that data a little. To accomplish this, we will learn to use the package dplyr. dplyr is a wonderful package for making data analysis more logical and human-friendly. We will only use a few of the package’s capabilities this week, but we can do more next week

Both ggplot2 and dplyr are created by the same group of people, a group who also created RStudio. These two packages are part of the tidyverse, which is a suite of packages that use similar syntax to facilitate the practice of data science.

We’ll be working in RMarkdown again today! Wahoo! No JavaScript! Don’t worry if you have forgotten your Markdown skills, you can find a cheat sheet here.

1 Create a Markdown Document

From the Learn page:

  1. Download the Template Markdown Document for the Lab, save it in your lab folder for the week, and open it in RStudio.
  2. Download the dataset for this week’s lab and save it in your lab folder. It is the same dataset that we went over during class.

In the setup section of your RMarkdown document, we will add some code to load packages like ggplot2 and dplyr. To do this, we will load one single package, called tidyverse, that contains all of the different packages. Use the function require() to do this, as you can see I’ve done at the top of this document. Save and Knit your Rmd, and ensure those packages are installed. If they aren’t installed already, you will have to install them from the Packaages menu.

2  Load your data

Once that is done, you should move to the next chunk of code, where you should load the lab data, using the tidyverse function read_csv() as below. Call the loaded dataset lab_data

lab_data = read_csv(fill_in_the_datafile_here)

When you knit the document, you should see a short messages as so:

Which handily tells you all about the different columns in your datafile. For instance, the column Ptcpt is made up of characters (i.e., combinations of letters and numbers). You can also learn about your data by adding an additional line to your code chunk that prints a summary of your newly created datafile, using the function summary().

##     Ptcpt               DOB                DOT           
##  Length:1472        Length:1472        Length:1472       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    AgeYears             Sex              LangEng         
##  Length:1472        Length:1472        Length:1472       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   LangOther         List_PresSide          Item          
##  Length:1472        Length:1472        Length:1472       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    Sentence          Condition           Response        
##  Length:1472        Length:1472        Length:1472       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    AgeGroup        
##  Length:1472       
##  Class :character  
##  Mode  :character

Ok, now I am going to tell you to do something a little bit strange. Open your datafile (discourse.csv) using Microsoft Excel. Match up the columns in the Excel file to the summary information you just printed. One thing may stand out for you – In the Excel file, the column Response is numeric (i.e., 1s and 0s). But in the file loaded into R, it is being treated as a column full of characters. What gives?

In this experiment, Response was either correct or incorrect, which was coded as 1 or 0. Response really should thus be numeric. One possibility is that the data in Response has been corrupted – maybe there is an incorrect value in there? To examine this, look at what unique values are present in Response, by applying the function unique() to the column Response in the table lab_data.

## [1] "1"               "0"               "didn't response"

Uh oh! It looks like we have an incorrect entry in the column! One of our wonderful research assistants must have entered that the participant didn’t respond, rather than simply leave the relevant cells in the CSV file blank. We will use some functions from the package dplyr to fix this.

In the next chunk, enter the following code

lab_data <- dplyr::mutate(lab_data, 
                          Response_cleaned = replace(
                            Response, 
                            Response == "didn't response",
                            ""))

unique(lab_data$Response_cleaned)
## [1] "1" "0" ""

Here, we are combining the dplyr function mutate with another function called replace that is part of the “base” build of R (i.e., you don’t need to install a package for it). Let’s go through what’s happening here.

2.0.1 Function “replace”"

The function replace has three arguments. The first argument is a vector that contains a set of values (in this case, the column Response), the second argument says which values should be replaced (i.e., values where Response == "didn't response"), and the final argument says what the replacement should be.

You can see that in action again here; test this code in your next chunk, and modify the different elements so that you create and replace different variables.

test = c(1,2,3)
test = replace(test, test == 2,"a")
print(test)
## [1] "1" "a" "3"

2.0.2 Function “mutate”

mutate is a dplyr function that allows us to modify tables and dataframes.

The first argument of mutate should be the dataframe that you want to modify. The second argument is more complex. That argument is an instruction to create a new column in your dataframe. The columns values are on the right hand side of the = sign. So, we have created a new column called Response_cleaned where the value “didn’t response” is replaced with NA.

Importantly, we don’t need to creaate an entirely new column. You can simply replace an old column, as below; do this in the next chunk.

lab_data <- dplyr::mutate(lab_data, 
                          Response = replace(
                            Response, 
                            Response == "didn't response",
                            NA))

unique(lab_data$Response)
## [1] "1" "0" NA
summary(lab_data$Response)
##    Length     Class      Mode 
##      1472 character character

So, now we have replaced the character with an NA. But! Something is still wrong. Remember that when we imported this file, the column was a character vector. That won’t have changed, so now we need to convert it to a numeric vector. We can do that by adding an extra formula to the mutate call, as below. The beauty of mutate is that it allows us to perform multiple operations within a single function, so that our code can be more readable. For instance, in the function below, we first replace the bad data in Response, and then having done that, we convert Response into a vector of numbers using the function as.numeric(). Do this in your chunk.

lab_data <- dplyr::mutate(lab_data, 
                          Response = replace(
                            Response, 
                            Response == "didn't response",
                            NA),
                          Response = as.numeric(Response))

summary(lab_data$Response)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  1.0000  1.0000  0.7695  1.0000  1.0000       1

Important. When you call a dplyr function, you don’t need to write the dplyrr:: part first, as in dplyr::mutate(). You could also just write mutate(). But it is helpful to see what package your functions come from.

We will leave dplyr for a little while now, but we will come back to it soon. Make sure that, by this point in your RMarkdown document, you have read in your datafile, printed it as a summary, and edited the column Response appropriately, so that we can start to build some graphs.

3  Explore your data

We want to see how participants’ responses in this task vary across all the different conditions, age groups, etc. We will use `ggplot2 to do this.

Move to the new code chunk entitled exploratory_plot.

ggplot graphs are built in layers. First, we build a layer that contains some core information about the plot (but does not itself create an image). Then, we add new layers to the plot that create and modify the image.

ggplot(your_data, 
       aes(x = name_of_your_x_axis_column,
           y = name_of_your_y_axis_column)) +
geom_point()

Create a graph based on the schematic above. This first graph won’t win any prizes for either beauty or informativeness, but it is a start. Incidentally, why do you think this graph only plots values at 0 or 1?

If you go to the ggplot2 website you will find hundreds of functions for prettifying your graph.

3.1 Assign different colours to different conditions.

In your original ggplot call, there is another function called aes(). aes() stands for “aesthetic”, and the information here defines the basic design of your graph. x is your x axis, y is your y axis, and so on.

There are additional aesthetics that you can also use. For example, when plotting points, you can set the Colour of the points to vary based on a condition in your experiment. In the next code chunk, reuse the code from your graph above, but edited so that the colour of the points vary by the column Condition.

ggplot(your_data, 
       aes(x = name_of_your_x_axis_column,
           y = name_of_your_y_axis_column,
           colour = name_of_your_condition_column)
       ) +
  geom_point()

It still looks a bit stupid, right? There are points of different colours, but they are all plotted on top of one another.

One way to account for this is to not simply plot points, but to plot jittered points. That is, points whose location is randomly offset from their true location. You can do that in the next code chunk, by editing your code to use the function geom_jitter() rather than geom_point().

ggplot(your_data, 
       aes(x = name_of_your_x_axis_column,
           y = name_of_your_y_axis_column,
           colour = name_of_your_condition_column)) +
  geom_jitter()

This should look much better! It is still a little silly, but now you can see all the points, and see how they differ across the different conditions.

Still, it would be good if the points didn’t all overlap, right?

3.2 Split by age

To do this, we will turn this one plot into lots and lots of little plots, using a technique called faceting. We will add an extra layer defined by the function facet_grid(). This function has the syntax facet_grid(row~columns). For instance, if you want to make a 2d plot in which each row of graphs was a different condition, and each column was a different age group, you would do that by writing facet_grid(Condition ~ AgeGroup). But you don’t want to do that here, because that wouldn’t be very informative. We only want one row of graphs, so we replace the first argument with ., as in the code below, which you should use in your next chunk.

ggplot(your_data, 
       aes(x = name_of_your_x_axis_column,
           y = name_of_your_y_axis_column,
           colour = name_of_your_condition_column)) +
  geom_jitter()+
  facet_grid(.~name_of_your_Age_column)

If done properly, this should create something weird!

The problem is, that we use AgeGroup for both our x axis, and also for our faceting. When the facet is for 5-year-olds, there isn’t any data on the x axis for 7-year-olds. So, we need to change our x axis, and the best way to do so is to put our most important condition on the x axis. Do this in the next code chunk.

ggplot(your_data, 
       aes(x = name_of_your_condition_column,
           y = name_of_your_y_axis_column,
           colour = name_of_your_condition_column)) +
  geom_jitter()+
  facet_grid(.~name_of_your_Age_column)

That’s better!

Now, in the next code chunk, try creating a graph that uses an extra element in the facet_grid. For instance, try creating a graph that splits by Sex.

(it looks like there’s no big difference between genders here - phew!)

3.3 Add some summaries.

Plotting all the points in a dataset is fun, but not necessarily that informative. Let’s use dplyr to create a new dataframe that allows us to plot each participant’s mean score. To do this, we will take advantage of a cool feature of dplyr, known as piping, and indicated by the %>% symbol. We will introduce this feature by showing you how to create a data summary that calculates the standard deviation for each AgeGroup and Condition.

lab_data_summary = lab_data %>%
  dplyr::select(Condition,AgeGroup,Response) %>% 
  dplyr::group_by(Condition,AgeGroup) %>%
  dplyr::summarise(ResponseStandardDeviation = sd(Response, na.rm =T))

lab_data_summary

Here, the first line says two things

lab_data_summary = lab_data %>%

It says that we want to create a new variable called lab_data_summary, that will be created from lab_data. However, the pipe %>% at the end of the line, says that we are going to do some things to lab_data before we assign the variable.

The pipe passes lab_data to the next line, where the dplyr function select is operating.

  dplyr::select(Condition,AgeGroup,Response) %>% 

Because we have used the pipe %>%, we automatically pass the output of the previous lines, as the first argument of the new function. So, select takes lab_data, and selects only the columns that we want to work with. In this case, that’s Condition, AgeGroup, and Response. If we ended the code now, then we would create a variable called lab_data_summary that is composed of those four columns from the variable lab_data. But first, we are going to do more. We pipe %>% those four columns down to the next line.

Note. Because we used the pipe %>%, we have automatically passed lab_data to the function select. If we weren’t using the pipe, we could do the same thing by writing dplyr::select(lab_data, Condition,AgeGroup,Response), i.e., when we don’t pipe, we have to explicitly state the first argument of the function select(). When we do use the pipe, the data from the previous line gets used as the first argument.

  dplyr::group_by(Condition,AgeGroup) %>%

This next line says that, within those columns, we are going to group the data by Condition and AgeGroup, in prepartion for doing some summary statistics (like mean, standard deviation, etc). Then, we pipe %>% this information to the next line.

  dplyr::summarise(ResponseStandardDeviation = sd(Response, na.rm =T))

The function summarise is a bit like the function mutate from before. However, while mutate performed an action on every value in a column, summarise groups those values together, according to the information in group_by. In this case, we create a new column called ResponseStandardDeviation, which is the standard deviation of the data within those two columns we were grouping by before. The figure below illustrates the difference between mutate and summarise.

Now, create your own code chunk that modifies the code above, to produce a table in which you create the mean response accuracy for each participant in each condition.

Assign its output to a variable called lab_data_ptcpt. The output should look like this.

## # A tibble: 192 x 4
## # Groups:   Ptcpt, Condition [?]
##    Ptcpt Condition AgeGroup ResponseMean
##    <chr> <chr>     <chr>           <dbl>
##  1 AF1   but       Adults              1
##  2 AF1   practice  Adults              1
##  3 AF1   so        Adults              1
##  4 AF2   but       Adults              1
##  5 AF2   practice  Adults              1
##  6 AF2   so        Adults              1
##  7 AF3   but       Adults              1
##  8 AF3   practice  Adults              1
##  9 AF3   so        Adults              1
## 10 AF4   but       Adults              1
## # ... with 182 more rows

3.4 Plot the resulting data

In the next code chunk, create a ggplot using your new variable lab_data_ptcpt, in which Mean response accuracy is the y axis, Condition is the x axis, and AgeGroup is the facet. Using geom_jitter() to plot points.

3.5  Play with your new graph.

In the next few chunks, create a new set of plots that replace geom_jitter() with some other interesting plotting devices, such as geom_violin and geom_boxplot. Remember that you can find other plotting devices at the ggplot2 website here, under Layer:geoms.

4 Calculate group means and standard errors

Let’s go back to dplyr, and create a new dataframe with the group mean, i.e., the mean collapsing subjects for each Age/Condition cell, plus also the standard error for that cell. Reuse and modify the code in the image below for your purposes (you’ll have to zoom in to see it properly).

## # A tibble: 12 x 6
## # Groups:   Condition [?]
##    Condition AgeGroup GrandMean ResponseSD     n      SE
##    <chr>     <chr>        <dbl>      <dbl> <int>   <dbl>
##  1 but       3            0.4       0.234     16 0.0585 
##  2 but       5            0.169     0.166     16 0.0416 
##  3 but       7            0.775     0.330     16 0.0824 
##  4 but       Adults       0.888     0.193     16 0.0482 
##  5 practice  3            0.917     0.149     16 0.0373 
##  6 practice  5            0.979     0.0833    16 0.0208 
##  7 practice  7            1         0         16 0      
##  8 practice  Adults       1         0         16 0      
##  9 so        3            0.731     0.182     16 0.0454 
## 10 so        5            0.956     0.0512    16 0.0128 
## 11 so        7            1         0         16 0      
## 12 so        Adults       0.993     0.0278    16 0.00694

4.1 Make a graph of means plus standard errors as eror bars

In the next chunk, reuse the code from a prior graph, such that you take this new dataframe and plot the grand means. Rather than using geom_jitter, you can use geom_point(), because there is no need to jitter your points.

Then, you can add error bars using the function geom_linerange(), and adding an aes() object into that function, that has Key:value pairs, ymin = YourMeanColumn - YourSEColumn, and ymax = YourMeanColumn + YourSEColumn.

In another chunk, you can also create a barplot by subbing geom_bar(stat="identity") for geom_point(). You’ll notice that the colours will seem a bit odd. You can fix this, by adding an extra Key:Value pair to the aes() call in your ggplot() call, which reads fill = Condition.

Finally, try making your own funky bar graphs, for instance by changing around some of the axes. For instance, it might be easier to compare conditions if you facet the graph by Condition, rather than by Age Group, with the latter on the X axis?