Today we will have a cursory look at some of the packages within the so-called tidyverse, and demonstrate how to create, use and publish on the web an R Notebook, using R Markdown.
For more information, see:
These two texts written by Dylan Childs for courses at Sheffield University are very useful:
as is this by Winston Change on how to do plots using ggplot2 package, which is part of the tidyverse.
It is normal to load packages. They extend the capabilities of R into often niche regions. These two are so useful that I always include them at the beginnig of every script, usually along with a few others.
If you have never used them before on your machine, then you need to install them using the install.packages() function:
install.packages("tidyverse")
install.packages("here")
Doing this is like installing an app on your phone. You do it once, and so in R, you do not include these lines in a script, because you would run that script multiple times. You would run them in the console window (bottom left).
Afterwards, you need to make them available to your R session, so in your script you would include these lines:
library(tidyverse)
library(here)
Here we use the incredibly useful package here to tell R where to find the data file we want, then the read_csv() function from the readr package with tidyverse to read our data file onto a so-called data-frame, which we have called iris. You could have called it anything you want, but be sensible.
Once we have done this we should see this new ’object` listed in RStudio’s environment pane, top-right.
filepath<-here("data","iris.csv")
iris<-read_csv(filepath)
glimpse (iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <chr> "setosa", "setosa", "setosa", "setosa", "setosa", "setosa…
Before we do any kind of statistical analysis of our data we typically have to bash it into shape in some way and just have a look at it, either by summarising it or by plotting it. This ‘wrangling’ followed by exploratory data analysis of the data often uses functions from the dplyr package within tidyverse. This package is extremely powerful and just about the main workhorse of any substantive wrangling of data.
All functions within dplyr act on data frames and produce data frames as output. Their first argument is always the data frame on which they are acting. Hence if we wanted to consider the shape of the petals, meaning by that the ratio of the length to the width of each, then we could do this:
petals.shape<-mutate(iris,Petal.Shape=Petal.Length/Petal.Width)
This would create a new object called petals.shape, a new, modified version of the original data frame iris, but now with an additional column called Petal.Shape, which has been calculated from the existing columns Petal.Length and Petal.Width.
This would be fine, but there is a better way. We can exploit the fact that all functions within tidyverse produce data frames as output and act on data frames as input. Like all functions in R they may have arguments. A tidyverse function always has another data frame as its first argument, the data frame on which it is going to act.
This means that they can be piped together, using the output of one line as the input to the next. The glue for this is the ‘pipe’ operator %>%. This is from the magritte package, that gets loaded as part of tidyverse. The name is a joke, geddit? I didn’t.
A very common sequence of code lines is something like the following. In this, we do stuff to the data if we have to, such as using mutate() to create new columns of values, perhaps using values from others, as here, then group_by() followed by summarise() to calculate summary statistics, such as a mean and standard error for groups of the data, grouped for example by site, or species and so on.
iris %>%
mutate(Petal.Shape=Petal.Length/Petal.Width) %>% # create a new column Petal.Shape
group_by(Species) %>%
summarise(
mean_PS=mean(Petal.Shape),
se_PS=sqrt(var(Petal.Shape)/n())
)
And now we can plot the data:
A really powerful package to use for this is ggplot2. Along with dplyr which we used above, this is probably the part of tidyverse that you would use most.
Here is an example of how we could do this:
iris %>%
ggplot(aes(x=Petal.Length,y=Petal.Width,colour=Species)) +
geom_point() +
theme_classic()
And here is an example of how we could first create some new variables, using
mutate(), then ‘pipe’ the amended version of our data into ggplot, and this time also change the labels on the axes
iris %>%
mutate(Petal.Shape=Petal.Length/Petal.Width) %>% # create a new column Petal.Shape
mutate(Sepal.Shape=Sepal.Length/Sepal.Width) %>% # create a new column Sepal.Shape
ggplot(aes(x=Petal.Shape,y=Sepal.Shape,colour=Species)) +
geom_point() +
labs(x= "Petal Shape",
y= "Sepal Shape") +
theme_classic()
or we could do a bar plot with error bars, for example of the Sepal lengths of the different species:
iris %>%
group_by(Species) %>%
summarise(mean_vals=mean(Sepal.Length),se_vals=sqrt(var(Sepal.Length)/n())) %>%
ggplot(aes(x=Species,y=mean_vals,fill=Species)) +
geom_col() +
geom_errorbar(aes(ymin=mean_vals-se_vals,ymax=mean_vals+se_vals,colour=Species),width=0.2) +
scale_colour_brewer(palette="Blues") +
scale_fill_brewer(palette="Blues") +
labs(x="Iris species",
y="Mean Sepal length (mm)")+
theme_classic()
That’s easy. Just press the ‘Knit’ button at the top of the script pane. A rendered html, pdf or Word version of your script will appear, and will also be stored in the same folder as the markdown script that generated it. If you chose the html option, as I always do, you can view what you get in a browser.
That’s even easier. Just press the Publish button at the top-right of the notebook. This will make a copy of the notebook appear on the RPubs website, for which you first need to register. Caution - anythong on there is visible to the public.
This notebook can be found here
If I change anything in the script, I can just as easily republish the amended notebook.
Here are some graphical ways to check whether a data set is normally distributed
Here is something on the vegan package, based on the chapter on that from Mark Gardener’s book.
That last one included some fancy maths, which you can easily include in a notebook using elementary latex, like this:
\[ H=-\sum {p_i\log_b{p_i}} \]