Some background information

For this practical, we’ll be using open data from the following source: Wright, D. and London, K. (2011). Modern Regression Techniques Using R, London: Sage. (Chapter 8). This is available here.

This dataset contains information about a randomised controlled trial, in which primary school pupils were asked about the time they spent doing exercises (weekly). Then, their classes (within schools) were divided into 4 different groups and pupils within those classes were given different treatments:

  1. Control (no treatment at all)
  2. Leaflet (information about exercising)
  3. Leaflet+plan (the leaflet and a plan of exercise)
  4. Leaflet+plan+quiz (leaflet, plan and a quiz to do about exercising)

Pupils were asked again about the time they spent doing exercises (weekly) some time after the treatments were given.

The variable sqw2 is the outcome (time spent doing exercises after treatment - standardised). The variable sqw1 is the measure before the treatment. The variable wcond is the group or condition to which the class of the pupils was assigned

1. Define a working directory

You can use any directory in your computer. As in the example below:

setwd("C:/myfolder")

2. Read in data

This is from Wright and London’s book website companion.

mydata<-read.delim("https://us.sagepub.com/sites/default/files/upm-binaries/26934_exercise.dat")

Note: You can safely ignore the warning when reading in the data.

3. Descriptive statistics

R is very intuitive for certain basic functions. For example, you can obtain mean and standard deviation using basic functions

mean(mydata$sqw2, na.rm = TRUE)
## [1] 1.78919
mean(mydata$sqw1, na.rm = TRUE)
## [1] 1.680725
sd(mydata$sqw2, na.rm = TRUE)
## [1] 0.5633153
sd(mydata$sqw2, na.rm = TRUE)
## [1] 0.5633153

For other basic functions, you can consult the R reference card. On page 2, under the “Math” heading, you will some descriptive statistics of interest.

But then, since this is a cluster randomised controlled trial, you might want to know the means and standard deviations by class.

Task 1: Statistics by group

Create a summary dataset containing the mean and standard deviation of the outcome, grouped by class.

For this, you’ll need to load the dplyr package:

library(dplyr)

summary_data <- mydata %>%
  group_by(class) %>%
  summarise(mean_sqw2=mean(sqw2),
            stdev_sqw2=sd(sqw2))

You can browse your newly created dataset by clicking on it in the environment tab, or you can type summary_data to print it on the console.

summary_data 
## # A tibble: 22 x 3
##    class mean_sqw2 stdev_sqw2
##    <int>     <dbl>      <dbl>
##  1     1      1.54      0.685
##  2     2      2.05      0.547
##  3     3      1.79      0.542
##  4     4      1.90      0.470
##  5     5      1.78      0.583
##  6     6      2.07      0.389
##  7     7      1.68      0.602
##  8     8      1.80      0.546
##  9     9      1.53      0.572
## 10    10      1.87      0.537
## # ... with 12 more rows

If you want to export this dataset and use it in another software package, like Excel, you can use the write family of functions.

write.csv(summary_data, "summary_data.csv")

That will save a csv file (which you can open in Excel) in your working directory.

Task 2: Relationship between two variables

You can obtain the Pearson correlation by typing:

cor(mydata$sqw2, mydata$sqw1, use="comp") # "comp" stands for "complete observations"
## [1] 0.7648664

A basic bivariate plot (using ggplot) will include the name of the dataset, a y and x variable and the actual figure/symbol used to represent the values. It would have the following form:

First, we need to load the ggplot2 package

library(ggplot2)

plot1<-ggplot(mydata, 
              aes(y=sqw2, x=sqw1)) +
  geom_point()

plot1

We save plots to an object to be able to reproduce them more quickly. It is good practice, although not essential.

You can modify plot1 by including a “colour” subcommand inside the “geom_point()” command. This would be for example: geom_point(colour=“red”)

For more tips and example code, you can visit the following link: http://www.cookbook-r.com/Graphs/

ggplot2 works with “layers”. The basic plot above can be enhanced by adding a different geom (layer) that gives further instructions to ggplot. Below we’ll see some examples:

Task 3: Fitted line

Add a single fitted line to plot1

plot2<- ggplot(mydata, aes(y=sqw2, x=sqw1)) +
geom_point() +
geom_smooth(method="lm", aes(y=sqw2, x=sqw1), se=T)

plot2

Task 4: Fitted line per class

Add fitted lines per class. TO do this, you don’t need to add a different geom, but insert a group statement inside the aes subcommand.


plot3<- ggplot(mydata, aes(y=sqw2, x=sqw1, group=factor(class))) +
geom_point() +
geom_smooth(method="lm", aes(colour=factor(class)), se=F)

plot3

Bonus track

It’s easy to produce more elaborated graphics with ggplot2. Look at the example below. Try to identify the different layers that were added to plot3.


plot4<- ggplot(mydata, aes(y=sqw2, x=sqw1, by=factor(class))) +
geom_point(aes(colour=factor(class))) +
geom_smooth(method="lm", aes(y=sqw2, x=sqw1), se=F) +
facet_wrap(~class) + theme(legend.position = "none") +
ylab("After") + xlab("Before") +
theme_minimal() +
ggtitle("Relationship between exercise before and after treatment by class")

plot4

---
title: "Data manipulation and graphics in R <img src=\"seminars_logo.png\" width=\"200\" height=\"50\" style=\"float: right;\"/>"
author: "Patricio Troncoso"
date: "Latest update: February 2020"
output:
  html_document:
    code_download: yes
    highlighter: null
    theme: cosmo
    toc: yes
    toc_depth: 4
    toc_float: yes
    fontsize: 12pt
    includes:
      #in_header: header.html
      after_body: footer.html
---
  
```{r setup, include=FALSE, warning=F}
knitr::opts_chunk$set(echo = TRUE)
```

# Some background information

For this practical, we’ll be using open data from the following source: Wright, D. and London, K. (2011). Modern Regression Techniques Using R, London:  Sage. (Chapter 8). This is available [*here*](http://dx.doi.org/10.4135/9780857024497).

This dataset contains information about a randomised controlled trial, in which primary school pupils were asked about the time they spent doing exercises (weekly). Then, their classes (within schools) were divided into 4 different groups and pupils within those classes were given different treatments:
  
1) Control (no treatment at all)
2) Leaflet (information about exercising)
3) Leaflet+plan (the leaflet and a plan of exercise)
4) Leaflet+plan+quiz (leaflet, plan and a quiz to do about exercising)

Pupils were asked again about the time they spent doing exercises (weekly) some time after the treatments were given.

The variable sqw2 is the outcome (time spent doing exercises after treatment - standardised).
The variable sqw1 is the measure before the treatment.
The variable wcond is the group or condition to which the class of the pupils was assigned

# 1. Define a working directory

You can use any directory in your computer. As in the example below:

```{}
setwd("C:/myfolder")
```

# 2. Read in data 

This is from Wright and London's book website companion. 
```{r, message=FALSE, warning=F}
mydata<-read.delim("https://us.sagepub.com/sites/default/files/upm-binaries/26934_exercise.dat")
```
Note: You can safely ignore the warning when reading in the data.


# 3. Descriptive statistics

R is very intuitive for certain basic functions. For example, you can obtain mean and standard deviation using basic functions

```{r,message=FALSE, warning=F}
mean(mydata$sqw2, na.rm = TRUE)
mean(mydata$sqw1, na.rm = TRUE)
sd(mydata$sqw2, na.rm = TRUE)
sd(mydata$sqw2, na.rm = TRUE)
```

For other basic functions, you can consult the [R reference card](https://cran.r-project.org/doc/contrib/Short-refcard.pdf). On page 2, under the "Math" heading, you will some descriptive statistics of interest.

But then, since this is a cluster randomised controlled trial, you might want to know the means and standard deviations by class.

## Task 1: Statistics by group

Create a summary dataset containing the mean and standard deviation of the outcome, grouped by class.

For this, you'll need to load the dplyr package:
  
```{r,message=FALSE, warning=F}
library(dplyr)

summary_data <- mydata %>%
  group_by(class) %>%
  summarise(mean_sqw2=mean(sqw2),
            stdev_sqw2=sd(sqw2))
```

You can browse your newly created dataset by clicking on it in the environment tab, or you can type `summary_data` to print it on the console.

```{r,message=FALSE, warning=F}
summary_data 
```

If you want to export this dataset and use it in another software package, like Excel, you can use the `write` family of functions.

```{r,message=FALSE, warning=F, eval=F}
write.csv(summary_data, "summary_data.csv")

```

That will save a `csv` file (which you can open in Excel) in your working directory.

## Task 2: Relationship between two variables
  
You can obtain the Pearson correlation by typing: 
  
```{r,message=FALSE, warning=F}
cor(mydata$sqw2, mydata$sqw1, use="comp") # "comp" stands for "complete observations"

```

A basic bivariate plot (using ggplot) will include the name of the dataset, a y and x variable and the actual figure/symbol used to represent the values. It would have the following form:
  
First, we need to load the ggplot2 package

```{r,message=FALSE, warning=F}
library(ggplot2)

plot1<-ggplot(mydata, 
              aes(y=sqw2, x=sqw1)) +
  geom_point()

plot1
```

We save plots to an object to be able to reproduce them more quickly. It is good practice, although not essential.

You can modify plot1 by including a "colour" subcommand inside the "geom_point()" command. This would be for example: geom_point(colour="red")

For more tips and example code, you can visit the following link: http://www.cookbook-r.com/Graphs/

`ggplot2` works with "layers". The basic plot above can be enhanced by adding a different `geom` (layer) that gives further instructions to `ggplot`. Below we'll see some examples:


## Task 3: Fitted line

Add a single fitted line to plot1


```{r, warning=F}

plot2<- ggplot(mydata, aes(y=sqw2, x=sqw1)) +
geom_point() +
geom_smooth(method="lm", aes(y=sqw2, x=sqw1), se=T)

plot2
```



## Task 4: Fitted line per class

Add fitted lines per class. TO do this, you don't need to add a different `geom`, but insert a `group` statement inside the `aes` subcommand.

<br>
```{r, warning=F}
plot3<- ggplot(mydata, aes(y=sqw2, x=sqw1, group=factor(class))) +
geom_point() +
geom_smooth(method="lm", aes(colour=factor(class)), se=F)

plot3
```



## Bonus track

It's easy to produce more elaborated graphics with ggplot2. Look at the example below. Try to identify the different layers that were added to plot3.

<br>
```{r, warning=F}
plot4<- ggplot(mydata, aes(y=sqw2, x=sqw1, by=factor(class))) +
geom_point(aes(colour=factor(class))) +
geom_smooth(method="lm", aes(y=sqw2, x=sqw1), se=F) +
facet_wrap(~class) + theme(legend.position = "none") +
ylab("After") + xlab("Before") +
theme_minimal() +
ggtitle("Relationship between exercise before and after treatment by class")

plot4
```
 

Patricio Troncoso

patricio.troncoso@manchester.ac.uk