Introduciton to Visualisating Data in RStudio

Why?

This part has two parts - 1) why use RStudio and 2) why visualise data? First, RStudio is great. Learning R will give you skills in a widely sought-after & transferable area (i.e., data programming), enable you to use a free to access statistical software and put more powerful analysis and data visualisation at your fingertips. Second, why should we visualise our data and what does it even mean to visualise data? Here, I mean visualising data to be a way of showing information to enable communication, analysis, discovery and understanding (Cairo, 2016). Data visualisation is important for understanding the true distribution of data (i.e., whether measures of central tendency reflect the sample), for spotting outliers, understanding trends, and for nicely showing the results of data analyses.

What do I want from you here?

In a word, nothing. You don’t have to do this at all. You can pass your lab report to an excellent level without any of this visualisation. But, I think we’re here to gain skills and to show what we are capable of. So, this document is designed to help you go from getting basic bar plots you could include to more desirable and informative raincloud plots

What this document will not do

This worksheet will not provide a detailed background of the layout of RStudio. These can be found online if you wish (or email me). This worksheet alone will not cover all of the details of what we are actually doing with each line of code (this would make the document even longer and potentially more off-putting).

What do you need?

-This RMarkdown document

-The accompanying datafile (called A2-QuantDataR.csv) available here on Github: https://github.com/brimmelljack/RaincloudsMSc-SEP7006

-To install R and RStudio (https://posit.co/download/rstudio-desktop/) - this is free

-Probably something to make notes

-Drinks & snacks (coffee and maybe something stronger depending on your coping mechanisms)

Final notes

To get some nice visualisations of variables from the data set for your lab report, you could essentially just copy and paste the code blocks below into RStudio and press “run” - do that if you want. The examples I perform below are not done on all variables, though. So, to be holistic you will need to edit the code ever so slightly (e.g., change WM accuracy variable to WM efficiency).

Let’s get going…

The first thing we have to do is open a new script, install/load packages (Google what is an R package), and import the MSc data file. This sounds very simple, and some of it is (new script), but to be honest, importing data files is something I found tough to start with.

Let’s start by creating a new script (the place you write all your code and run your analyses get your visualisations). To do this click: File > New File > R Script. You can also press CONTROL+SHIFT+N

Quick organisational comments

When you open a new script you’ll see a bunch of blank boxes on your screen that all looks similar. We’re going to split the screen into quarters in our head. The top right is the “environment” this is where any object, dataframe, tibble, function will go once you create it. You can access things from here by clicking them. Below that (bottom right) is where you can do a number of things. First, you have “files” where you can look at files on your computer, you have “plots” where any plots you create will live (this will become populated later), you have “packages” which shows all the installed packages you have and is a way of loading packages into the current session (more on this later), you have “help” which is your space for finding answers. Unsure what an argument within a R function or call does, check help (it’s also useful to use ?nameoffunction if you’re stuck). Finally you have “viewer” and “presentation” - we will not use these today. The bottom left is the “Console”. You can type code directly in to here, but I use it more of a space for viewing output of tests and think of it more of R’s space to work out the code we have just written. The top left is the new script. This is where you will write all of your code. If the console is R’s working space, this is yours. This is where you will type and copy the relevant info from this document below.

Now you have a blank canvas to work with. What you would do in the real world here is write your own code. But, for this example you can copy the below chunks of code in to you RStudio script and run it.

You can mess around with the blank script before carrying on, if you’d like. To write a note in the script, you start the line with a hashtag (#)

#This is the start of my script - Jack

If you were to write something on the script without a hashtag, R would think it’s a line of code and try to run it (and likely fail).

The next thing to do is load our packages. Very briefly, packages are the bread & butter of R/RStudio and they actually do most of the work for us. A package normally serves a particular purpose and is made up of operators, functions, and arguments that we must satisfy in our code to get an output that is interpretable.

For this example we will require the tidyverse package (contains a number of useful packages), base r (built in, doesn’t need installing), the cowplot package and a package called ggrain.

To install these copy the below code in to your RStudio and press run. NOTE becuase I already have the packages, I have used #’s here, when you copy the code remove the #’s before you click run!

Running code

To run code there are a couple options. Put your cursor at the end of a single line you’d like to run and press CONTROL+ENTER or click the “Run” button (located in the top right of the script writing window). Also, you can run multiple lines of code in one go by highlighting the lines you want to run (e.g.,the install.packages lines below) and pressing either CONTROL+ENTER or the “Run” button. You should run the code as you go (in my opinion) so remember to run the code after each new bit that you add.

#install.packages("tidyverse")
#install.packages("cowplot")

Just like that RStudio has downloaded lovely little packages from the internet. But, we ain’t done there. Now we have to load them in our library. To do this, copy and run the below code.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cowplot)

## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp

OK, one last hurdle before we can start working with our data to create some lovely plots. Loading in the data file itself. Now, I did mention that this can be kinda tricky but it can also be fine (let’s pray).

The first thing to ensure is that the data file is saved to your computer. It’s a good idea to know the pathway to the data file as this will be required. This is a part of the code that you will have to write yourself - but you can do it!

For me, this code looks like the following:

MSc_data <- read.csv(file = "C:/Users/jb28/OneDrive - University of Bolton/R/A2-QuantDataR.csv")
head(MSc_data)

##   id age gender    level anx.intervention WMaccuracy1 WMefficiency1
## 1  1  28      M 1st Team          control          31          3.54
## 2  2  25      M 1st Team          control          35          3.92
## 3  3  26      M 1st Team          control          34          4.18
## 4  4  30      M 1st Team          control          31          6.18
## 5  5  27      M 1st Team          control          31          3.49
## 6  6  28      M 1st Team          control          33          4.12
##   PassingSuccess.1 ShotCreatingActions1 WMaccuracy2 WMefficiency2
## 1               62                   70          30          3.98
## 2               85                   99          33          4.21
## 3               88                  103          29          4.53
## 4               65                   73          38          5.54
## 5               60                   68          30          4.02
## 6               70                   78          33          4.00
##   PassingSuccess.2 ShotCreatingActions2
## 1               63                   73
## 2               80                  104
## 3               87                  100
## 4               67                   74
## 5               59                   66
## 6               70                   74

Let’s talk about what we’ve done above. It looks like an easy line of code and in some respects it was. But, it won’t work for you. Making sure you read the file in properly is all about making sure the text in red is right. The red text is basically a pathway to the file you want to open. In my example above I have the data file (called A2-QuantDataR.csv) saved in a folder called “R” on my University of Bolton OneDrive account. So, the key thing here is knowing where you have saved the document you want to read in. I like to use my computer desktop or the OneDrive. Wherever it is, make sure you know it.

The first part of the first line of the code (MSc_data <-) is quite important. The MSc_data part is me naming the file I am copying in to RStudio. You can call it anything you want. Though I suggest sticking to the same as me for now. The little arrow part <- is sometimes referred to as the “left assign”. Left because pointing left and assign because we’re assigning the information on the right of the arrow to the name on the left of the arrow.

When you have read in the data file you should see the newly created object appear in the “environment” - this is located in the top right (by default) of the screen and contains all the objects, variables, lists & functions (Google these if you want).

To see the newly created object, you can just click on it in the environment. But, the other code we see here is head(MSc_data) the call head() will show us a preview of the data we have just read in to R. We get the variables in columns with the first few rows of data (6, in this case).

NOW WE HAVE SOME DATA TO WORK WITH WOOOOO

The next thing for us to do is pivot this bad boy. What on Earth does that mean.

Well, the data file (remember you can click on it from the environment to see the full thing) is currently in what we refer to as wide format. This means that all the information we have about a participant is contained within a single row of the dataframe. But, when we’re working with repeated measures data, or longitudinal data, we normally want the data in what is called long format. A key tenet of long format is that we create a timepoint column. In this example we currently have two columns for each of Working Memory (WM) accuracy, WM efficiency, Passing Success and Shot Creating Actions. By pivoting to long format we will create a new column (the “Time” column) and reduced the number of WM Accuracy, WM efficiency, Passing Success and Shot Creating Actions columns to just one.

Let’s do this now. The good thing (maybe…) about this part is that you can just copy my code below to your own RStudio and run it. There are a couple of things to check first. One, make sure your newly created dataframe is called ‘MSc_data’ like mine is and make sure you read in the same file I did (the one available on the GitHub page: https://github.com/brimmelljack/RaincloudsMSc-SEP7006).

MSc_data$gender <- as.numeric(factor(MSc_data$gender))
MSc_data$level <- as.numeric(factor(MSc_data$level))
MSc_data$anx.intervention <- as.numeric(factor(MSc_data$anx.intervention))
str(MSc_data)

## 'data.frame':    90 obs. of  13 variables:
##  $ id                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                 : int  28 25 26 30 27 28 33 31 26 27 ...
##  $ gender              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ level               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ anx.intervention    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ WMaccuracy1         : int  31 35 34 31 31 33 28 34 30 30 ...
##  $ WMefficiency1       : num  3.54 3.92 4.18 6.18 3.49 4.12 3.74 3.83 3.88 3.24 ...
##  $ PassingSuccess.1    : int  62 85 88 65 60 70 55 78 59 62 ...
##  $ ShotCreatingActions1: int  70 99 103 73 68 78 67 95 69 73 ...
##  $ WMaccuracy2         : int  30 33 29 38 30 33 30 32 34 33 ...
##  $ WMefficiency2       : num  3.98 4.21 4.53 5.54 4.02 4 3.55 4.1 4.13 3.53 ...
##  $ PassingSuccess.2    : int  63 80 87 67 59 70 58 77 57 64 ...
##  $ ShotCreatingActions2: int  73 104 100 74 66 74 70 96 70 70 ...

No pivoting has happened yet. But, in the above code you have changed a few character variables (gender, level, anx.intervention) to numeric. The str() call shows this. The next block of code will be quite long, feel free to just copy and paste it in to R and run it. I will explain what has happened on the other side of the code…see you there.

#pivot out working memory accuracy
WM_acc <- MSc_data %>% pivot_longer(cols = starts_with("WMaccuracy"),
          names_to = "Timepoint", values_to = "WM_Accuracy") %>%
          mutate(Timepoint = gsub("WMaccuracy", "", Timepoint))
#pivot out working memory efficiency
WM_eff <- MSc_data %>% pivot_longer(cols = starts_with("WMefficiency"),
          names_to = "Timepoint", values_to = "WM_Efficiency") %>%
          mutate(Timepoint = gsub("WMefficiency", "", Timepoint))
#pivot out Passing Success
Pass <-   MSc_data %>% pivot_longer(cols = starts_with("Passing"),
          names_to = "Timepoint", values_to = "PassingSuccess") %>%
          mutate(Timepoint = gsub("PassingSuccess.", "", Timepoint))
#pivot out Shot Creating Actions
Shot <-   MSc_data %>% pivot_longer(cols = starts_with("Shot"),
          names_to = "Timepoint", values_to = "ShotCreatingActions") %>%
          mutate(Timepoint = gsub("ShotCreatingActions", "", Timepoint)) %>%
          select(-(WMaccuracy1:PassingSuccess.2))

The above has created four new data frames in the environment (pass, shot, WM_acc and WM_eff) to represent the four dependent variables (DVs) in this study. You can click in to any of them and you will see that for the most part, not a lot has changed. The key thing is that each of these examples now has a “Timepoint” variable and a variable of interest has been split. But, the job of pivoting longer is not quite done. If you look at the above code each pivot_longer() call is foucsed only on one variable. What we now need to do is merge these in to one.

#create a name for the new dataframe that has all the variables. I have 
#gone with MSc_long
MSc_long <- Shot %>% 
  mutate(PassSuccess = Pass$PassingSuccess) %>%
  mutate(WM_Accuracy = WM_acc$WM_Accuracy) %>%
  mutate(WM_Efficiency = WM_eff$WM_Efficiency)
head(MSc_long)

## # A tibble: 6 × 10
##      id   age gender level anx.intervention Timepoint ShotCreatingActions
##   <int> <int>  <dbl> <dbl>            <dbl> <chr>                   <int>
## 1     1    28      1     1                1 1                          70
## 2     1    28      1     1                1 2                          73
## 3     2    25      1     1                1 1                          99
## 4     2    25      1     1                1 2                         104
## 5     3    26      1     1                1 1                         103
## 6     3    26      1     1                1 2                         100
## # ℹ 3 more variables: PassSuccess <int>, WM_Accuracy <int>, WM_Efficiency <dbl>

We now have our final dataframe and we can begin to create some plots that are great to include in our lab reports. Remember you do not have to do this (but it will be nice).

Before we do this, the head(MSc_long) call above will allow you to see the layout of the new dataframe. You should see that each id now has two rows. In each of these age, gender, level, and anx.intervention don’t change. But, the ‘Time’ column shows the two timepoints (1 = pre intervention and 2 - post intervention). We then see the DV (WM accuracy, WM efficiency, Passing Success and Shot Creating Actions) variables.

Let’s start with a simple bar plot. I am only going to do this for one of the variables in each example. You can copy the code for to your own RStudio to get the plot the first time and then just make small amendments to get the plot for a new variable. I will cover how to do this briefly, but part of the fun will be learning to do this yourself.

#first, let's create a simple bar plot. 
p1 <- ggplot(MSc_long, aes(x=anx.intervention, y=WM_Accuracy, fill=Timepoint)) +
  geom_bar(stat = "summary", fun = "mean", position = "dodge") +
  theme_cowplot() +
  ggtitle("WM Accuracy Barplot") +
  ylab("Mean WM Accuracy") +
  xlab("Anxiety Group")
#for the record anxiety group 1 = control, anxiety group 2 = REBT, and 
#anxiety group 3 = self-talk
p1

Above is our bar plot, you can see that the anxiety group makes up the x-axis (1=control, 2=REBT, 3=self-talk), Mean WM accuracy is up the y-axis, and the color bars denote the timepoints. However, from this we cannot see the distribution of the data nor the error or deviation from the measure of central tendency (i.e., mean). Now let’s get a boxplot (below).

p2 <- ggplot(MSc_long, aes(y=WM_Accuracy, fill=Timepoint)) +
  geom_boxplot() +
  theme_cowplot() +
  ggtitle("WM Accuracy Boxplot") +
  ylab("WM Accuracy") +
  xlab("Anxiety Group") +
  facet_grid(. ~ anx.intervention) +
  scale_x_discrete(labels=c('', '', ''))
p2

The above is a box plot. It’s a little bit better than a bar plot as we can see more information. For example, we have the median value as the thick black horizontal line, the box itself shows the lower (lower on the plot) & upper (higher on the plot) quartiles, the black lines (referred to as “whiskers”) end with the minimum (lower on the plot) and maximum (higher on the plot) scores, and finally, any black dots indicate an outlier (an extreme score).

But, we can level up our visualisation one more time. To do this, we’re going to build raincloud plots with ggplot2 & ggrain. To improve data visualisation some researchers have suggested plotting raw data points, to use a “split-half” violin aesthetic, and to include a standardised visualisation of central tendency (e.g., mean/median) and error as in a boxplot (Allen et al., 2019).

Let’s create a raincloud plot for our data set below. You can copy the code below to get an example. Remember, just like before I will not do all of the plots for you. You will have to mess around with the code to get all of the plots (i.e., plots for all of the four key variables).

Raincloud with boxplots

#install and load a new package - remove the hashatag on the code below
#install.packages("ggrain")
library(ggrain)

p3 <- ggplot(MSc_long, aes(x = anx.intervention, y = WM_Accuracy, fill = Timepoint)) +
  geom_rain(cov = "Timepoint", alpha = .5, boxplot.args.pos = list(width = 0.1, 
  position = position_nudge(x = .26)), violin.args.pos = list(side = "r",
  width = 1.4, position = position_nudge(x = 0.4))) +
  theme_cowplot() +
  facet_grid(. ~ anx.intervention) +
  coord_flip() +
  ggtitle("Raincloud plot for WM accuracy") +
  ylab("WM Accuracy") +
  xlab("Anxiety Group") +
  scale_x_discrete(labels=c('', '', ''))
p3

And there we have it, a raincloud plot for three anxiety intervention groups ( 1 = control, 2 = REBT, 3 = self-talk) and two timepoints (pre and post intervention) for the outcome variable WM accuracy. This plot has some of the best features of the plots earlier (e.g., the box plot) alongside some of the other key features of a good visualisation (e.g., density plots and the raw data points). Is the plot perfect? No. But, it’s a step in the right direction and may be something you edit more to make even better. Finally, you can copy this code to your own RStudio and edit it accordingly ( probably with help from the internet - the references below should help as well).

NOTE my last real thing to say for now is a tip for editing these to include the other key variables (i.e., I’ve only ever worked with the WM Accuracy variable and you have three others). The only thing you should need to change to get the same three plots with the other variables is anywhere you see the old variables name. Do let me know if you need help with that.

Final word

Thanks for taking the time to read through this, I hope it’s been enjoyable learning a new skill and if you have any questions reach out to me on J.Brimmell@bolton.ac.uk or on Twitter @JACKBRIMMELL.

Dr. Jack Brimmell. Feel free to share with credit to original github page: https://github.com/brimmelljack/RaincloudsMSc-SEP7006

References

Allen, M., Poggiali, D., Whitaker, K., Marshall, T. R., & Kievit, R. A. (2019). Raincloud plots: a multi-platform tool for robust data visualization. Wellcome Open Research, 4.

Cairo, A. (2016). The truthful art: Data, charts, and maps for communication. New Riders

Introduciton to Visualisating Data in RStudio

Dr. Jack Brimmell

2023-04-04