knitr::opts_chunk$set(echo = TRUE)
require(tidyverse)This lab is about graphing data using the package ggplot2. ggplot2 is a package for making beautiful graphs in a logical fashion. You will also learn to use it in your statistics courses, but it is never too early to start practicing.
In order to graph data properly, we will need to process that data a little. To accomplish this, we will learn to use the package dplyr. dplyr is a wonderful package for making data analysis more logical and human-friendly. We will only use a few of the package’s capabilities this week, but we can do more next week
Both ggplot2 and dplyr are created by the same group of people, a group who also created RStudio. These two packages are part of the tidyverse, which is a suite of packages that use similar syntax to facilitate the practice of data science.
We’ll be working in RMarkdown again today! Wahoo! No JavaScript! Don’t worry if you have forgotten your Markdown skills, you can find a cheat sheet here.
From the Learn page:
In the setup section of your RMarkdown document, we will add some code to load packages like ggplot2 and dplyr. To do this, we will load one single package, called tidyverse, that contains all of the different packages. Use the function require() to do this, as you can see I’ve done at the top of this document. Save and Knit your Rmd, and ensure those packages are installed. If they aren’t installed already, you will have to install them from the Packaages menu.
Once that is done, you should move to the next chunk of code, where you should load the lab data, using the tidyverse function read_csv() as below. Call the loaded dataset lab_data
lab_data = read_csv(fill_in_the_datafile_here)When you knit the document, you should see a short messages as so:
Which handily tells you all about the different columns in your datafile. For instance, the column Ptcpt is made up of characters (i.e., combinations of letters and numbers). You can also learn about your data by adding an additional line to your code chunk that prints a summary of your newly created datafile, using the function summary().
## Ptcpt DOB DOT
## Length:1472 Length:1472 Length:1472
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## AgeYears Sex LangEng
## Length:1472 Length:1472 Length:1472
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## LangOther List_PresSide Item
## Length:1472 Length:1472 Length:1472
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Sentence Condition Response
## Length:1472 Length:1472 Length:1472
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## AgeGroup
## Length:1472
## Class :character
## Mode :character
Ok, now I am going to tell you to do something a little bit strange. Open your datafile (discourse.csv) using Microsoft Excel. Match up the columns in the Excel file to the summary information you just printed. One thing may stand out for you – In the Excel file, the column Response is numeric (i.e., 1s and 0s). But in the file loaded into R, it is being treated as a column full of characters. What gives?
In this experiment, Response was either correct or incorrect, which was coded as 1 or 0. Response really should thus be numeric. One possibility is that the data in Response has been corrupted – maybe there is an incorrect value in there? To examine this, look at what unique values are present in Response, by applying the function unique() to the column Response in the table lab_data.
## [1] "1" "0" "didn't response"
Uh oh! It looks like we have an incorrect entry in the column! One of our wonderful research assistants must have entered that the participant didn’t respond, rather than simply leave the relevant cells in the CSV file blank. We will use some functions from the package dplyr to fix this.
In the next chunk, enter the following code
lab_data <- dplyr::mutate(lab_data,
Response_cleaned = replace(
Response,
Response == "didn't response",
""))
unique(lab_data$Response_cleaned)## [1] "1" "0" ""
Here, we are combining the dplyr function mutate with another function called replace that is part of the “base” build of R (i.e., you don’t need to install a package for it). Let’s go through what’s happening here.
The function replace has three arguments. The first argument is a vector that contains a set of values (in this case, the column Response), the second argument says which values should be replaced (i.e., values where Response == "didn't response"), and the final argument says what the replacement should be.
You can see that in action again here; test this code in your next chunk, and modify the different elements so that you create and replace different variables.
test = c(1,2,3)
test = replace(test, test == 2,"a")
print(test)## [1] "1" "a" "3"
mutate is a dplyr function that allows us to modify tables and dataframes.
The first argument of mutate should be the dataframe that you want to modify. The second argument is more complex. That argument is an instruction to create a new column in your dataframe. The columns values are on the right hand side of the = sign. So, we have created a new column called Response_cleaned where the value “didn’t response” is replaced with NA.
Importantly, we don’t need to creaate an entirely new column. You can simply replace an old column, as below; do this in the next chunk.
lab_data <- dplyr::mutate(lab_data,
Response = replace(
Response,
Response == "didn't response",
NA))
unique(lab_data$Response)## [1] "1" "0" NA
summary(lab_data$Response)## Length Class Mode
## 1472 character character
So, now we have replaced the character with an NA. But! Something is still wrong. Remember that when we imported this file, the column was a character vector. That won’t have changed, so now we need to convert it to a numeric vector. We can do that by adding an extra formula to the mutate call, as below. The beauty of mutate is that it allows us to perform multiple operations within a single function, so that our code can be more readable. For instance, in the function below, we first replace the bad data in Response, and then having done that, we convert Response into a vector of numbers using the function as.numeric(). Do this in your chunk.
lab_data <- dplyr::mutate(lab_data,
Response = replace(
Response,
Response == "didn't response",
NA),
Response = as.numeric(Response))
summary(lab_data$Response)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 1.0000 1.0000 0.7695 1.0000 1.0000 1
Important. When you call a dplyr function, you don’t need to write the
dplyrr::part first, as indplyr::mutate(). You could also just writemutate(). But it is helpful to see what package your functions come from.
We will leave dplyr for a little while now, but we will come back to it soon. Make sure that, by this point in your RMarkdown document, you have read in your datafile, printed it as a summary, and edited the column Response appropriately, so that we can start to build some graphs.
We want to see how participants’ responses in this task vary across all the different conditions, age groups, etc. We will use `ggplot2 to do this.
Move to the new code chunk entitled exploratory_plot.
ggplot graphs are built in layers. First, we build a layer that contains some core information about the plot (but does not itself create an image). Then, we add new layers to the plot that create and modify the image.
ggplot(your_data,
aes(x = name_of_your_x_axis_column,
y = name_of_your_y_axis_column)) +
geom_point()Create a graph based on the schematic above. This first graph won’t win any prizes for either beauty or informativeness, but it is a start. Incidentally, why do you think this graph only plots values at 0 or 1?
If you go to the ggplot2 website you will find hundreds of functions for prettifying your graph.
In your original ggplot call, there is another function called aes(). aes() stands for “aesthetic”, and the information here defines the basic design of your graph. x is your x axis, y is your y axis, and so on.
There are additional aesthetics that you can also use. For example, when plotting points, you can set the Colour of the points to vary based on a condition in your experiment. In the next code chunk, reuse the code from your graph above, but edited so that the colour of the points vary by the column Condition.
ggplot(your_data,
aes(x = name_of_your_x_axis_column,
y = name_of_your_y_axis_column,
colour = name_of_your_condition_column)
) +
geom_point()It still looks a bit stupid, right? There are points of different colours, but they are all plotted on top of one another.
One way to account for this is to not simply plot points, but to plot jittered points. That is, points whose location is randomly offset from their true location. You can do that in the next code chunk, by editing your code to use the function geom_jitter() rather than geom_point().
ggplot(your_data,
aes(x = name_of_your_x_axis_column,
y = name_of_your_y_axis_column,
colour = name_of_your_condition_column)) +
geom_jitter()This should look much better! It is still a little silly, but now you can see all the points, and see how they differ across the different conditions.
Still, it would be good if the points didn’t all overlap, right?
To do this, we will turn this one plot into lots and lots of little plots, using a technique called faceting. We will add an extra layer defined by the function facet_grid(). This function has the syntax facet_grid(row~columns). For instance, if you want to make a 2d plot in which each row of graphs was a different condition, and each column was a different age group, you would do that by writing facet_grid(Condition ~ AgeGroup). But you don’t want to do that here, because that wouldn’t be very informative. We only want one row of graphs, so we replace the first argument with ., as in the code below, which you should use in your next chunk.
ggplot(your_data,
aes(x = name_of_your_x_axis_column,
y = name_of_your_y_axis_column,
colour = name_of_your_condition_column)) +
geom_jitter()+
facet_grid(.~name_of_your_Age_column)If done properly, this should create something weird!
The problem is, that we use AgeGroup for both our x axis, and also for our faceting. When the facet is for 5-year-olds, there isn’t any data on the x axis for 7-year-olds. So, we need to change our x axis, and the best way to do so is to put our most important condition on the x axis. Do this in the next code chunk.
ggplot(your_data,
aes(x = name_of_your_condition_column,
y = name_of_your_y_axis_column,
colour = name_of_your_condition_column)) +
geom_jitter()+
facet_grid(.~name_of_your_Age_column)That’s better!
Now, in the next code chunk, try creating a graph that uses an extra element in the facet_grid. For instance, try creating a graph that splits by Sex.
(it looks like there’s no big difference between genders here - phew!)
Plotting all the points in a dataset is fun, but not necessarily that informative. Let’s use dplyr to create a new dataframe that allows us to plot each participant’s mean score. To do this, we will take advantage of a cool feature of dplyr, known as piping, and indicated by the %>% symbol. We will introduce this feature by showing you how to create a data summary that calculates the standard deviation for each AgeGroup and Condition.
lab_data_summary = lab_data %>%
dplyr::select(Condition,AgeGroup,Response) %>%
dplyr::group_by(Condition,AgeGroup) %>%
dplyr::summarise(ResponseStandardDeviation = sd(Response, na.rm =T))
lab_data_summaryHere, the first line says two things
lab_data_summary = lab_data %>%It says that we want to create a new variable called lab_data_summary, that will be created from lab_data. However, the pipe %>% at the end of the line, says that we are going to do some things to lab_data before we assign the variable.
The pipe passes lab_data to the next line, where the dplyr function select is operating.
dplyr::select(Condition,AgeGroup,Response) %>% Because we have used the pipe %>%, we automatically pass the output of the previous lines, as the first argument of the new function. So, select takes lab_data, and selects only the columns that we want to work with. In this case, that’s Condition, AgeGroup, and Response. If we ended the code now, then we would create a variable called lab_data_summary that is composed of those four columns from the variable lab_data. But first, we are going to do more. We pipe %>% those four columns down to the next line.
Note. Because we used the pipe
%>%, we have automatically passedlab_datato the function select. If we weren’t using the pipe, we could do the same thing by writingdplyr::select(lab_data, Condition,AgeGroup,Response), i.e., when we don’t pipe, we have to explicitly state the first argument of the functionselect(). When we do use the pipe, the data from the previous line gets used as the first argument.
dplyr::group_by(Condition,AgeGroup) %>%This next line says that, within those columns, we are going to group the data by Condition and AgeGroup, in prepartion for doing some summary statistics (like mean, standard deviation, etc). Then, we pipe %>% this information to the next line.
dplyr::summarise(ResponseStandardDeviation = sd(Response, na.rm =T))The function summarise is a bit like the function mutate from before. However, while mutate performed an action on every value in a column, summarise groups those values together, according to the information in group_by. In this case, we create a new column called ResponseStandardDeviation, which is the standard deviation of the data within those two columns we were grouping by before. The figure below illustrates the difference between mutate and summarise.
Now, create your own code chunk that modifies the code above, to produce a table in which you create the mean response accuracy for each participant in each condition.
Assign its output to a variable called lab_data_ptcpt. The output should look like this.
## # A tibble: 192 x 4
## # Groups: Ptcpt, Condition [?]
## Ptcpt Condition AgeGroup ResponseMean
## <chr> <chr> <chr> <dbl>
## 1 AF1 but Adults 1
## 2 AF1 practice Adults 1
## 3 AF1 so Adults 1
## 4 AF2 but Adults 1
## 5 AF2 practice Adults 1
## 6 AF2 so Adults 1
## 7 AF3 but Adults 1
## 8 AF3 practice Adults 1
## 9 AF3 so Adults 1
## 10 AF4 but Adults 1
## # ... with 182 more rows
In the next code chunk, create a ggplot using your new variable lab_data_ptcpt, in which Mean response accuracy is the y axis, Condition is the x axis, and AgeGroup is the facet. Using geom_jitter() to plot points.
In the next few chunks, create a new set of plots that replace geom_jitter() with some other interesting plotting devices, such as geom_violin and geom_boxplot. Remember that you can find other plotting devices at the ggplot2 website here, under Layer:geoms.
Let’s go back to dplyr, and create a new dataframe with the group mean, i.e., the mean collapsing subjects for each Age/Condition cell, plus also the standard error for that cell. Reuse and modify the code in the image below for your purposes (you’ll have to zoom in to see it properly).
## # A tibble: 12 x 6
## # Groups: Condition [?]
## Condition AgeGroup GrandMean ResponseSD n SE
## <chr> <chr> <dbl> <dbl> <int> <dbl>
## 1 but 3 0.4 0.234 16 0.0585
## 2 but 5 0.169 0.166 16 0.0416
## 3 but 7 0.775 0.330 16 0.0824
## 4 but Adults 0.888 0.193 16 0.0482
## 5 practice 3 0.917 0.149 16 0.0373
## 6 practice 5 0.979 0.0833 16 0.0208
## 7 practice 7 1 0 16 0
## 8 practice Adults 1 0 16 0
## 9 so 3 0.731 0.182 16 0.0454
## 10 so 5 0.956 0.0512 16 0.0128
## 11 so 7 1 0 16 0
## 12 so Adults 0.993 0.0278 16 0.00694
In the next chunk, reuse the code from a prior graph, such that you take this new dataframe and plot the grand means. Rather than using geom_jitter, you can use geom_point(), because there is no need to jitter your points.
Then, you can add error bars using the function geom_linerange(), and adding an aes() object into that function, that has Key:value pairs, ymin = YourMeanColumn - YourSEColumn, and ymax = YourMeanColumn + YourSEColumn.
In another chunk, you can also create a barplot by subbing geom_bar(stat="identity") for geom_point(). You’ll notice that the colours will seem a bit odd. You can fix this, by adding an extra Key:Value pair to the aes() call in your ggplot() call, which reads fill = Condition.
Finally, try making your own funky bar graphs, for instance by changing around some of the axes. For instance, it might be easier to compare conditions if you facet the graph by Condition, rather than by Age Group, with the latter on the X axis?