Introduction

This next section of the practical will introduce you to the programming language R, the lingua franca of statistical computing. Unlike point and click software, everything you do in R you do from the command line. Although this can seem a little daunting at first, it’s actually not that hard to learn, and the benefits of doing so are enormous.If you’re very new to R I’d suggest checking out this link that introduces you to the interface and some basic stats

Why R?

It’s free and open source
It’s flexible: ecology and environmental science data are notoriously messy, which typically necessitates complex ad hoc approaches to analyses
It’s relatively easy to learn
It’s supported by a large community of developers and users
There are lots of online materials/help/tutorials etc
There’s a package (collection of functions) for everything
It’s flexible (worth reiterating)
You will use it increasingly if you do upper levels of environmental science and ecology

For those of you who want to develop your skills further, the School of Biological, Earth and Environ- mental Sciences and the School of Mathematics and Statistics has set up this very useful web site with many tools from simple manipulating data (getting started with R), to the complex statistical analyses of data. Also, the UNSW codeRs is a student and staff run community dedicated for ‘R’ users. They set up events that cater to learning R. They meet fortnightly, alternating between workshops, drop-in help sessions, and other social events. You can visit them here

Aim

In this section we will be walking you through the process of data analysis (using R and R studio) for Assessment 2 - Biodiversity Report. We will:

Do exploratory analysis of the inverts abundance data.
Create graphs.
Calculate summarizing statistic (mean, standard deviation).

Meeting R & R Studio

R is a free statistical program used commonly by (but not exclusive to) biologists and psychologists. R studio is the user friendly extension that makes the R console easier to use.

Think of R like a car engine and RStudio like the body of the car, RStudio will not run without R.

To find out more about R and RStudio, like how to customize your RStudio, you can visit this link.

In the default RStudio layout you’ll see four panes in RStudio:

The code editor (script window) in the top left pane where you write your code before ‘sending’ it to R;
The R console in the bottom left pane where R works its magic;
The workspace in the top right pane where you can view all R objects (e.g. data files); and
A pane to view plots, help files, available packages (‘add ons’) etc. in the bottom right pane.

Let’s code!

Preparing the workspace

Step 1: Create a folder on your computer and name it “Rdata”

Step 2: Download your class data from moodle: Week 5 > Class Data > Select your class folder > download the inverts.csv file. Save this file in the “Rdata” folder you just made

You must use your scheduled practicals data

Step 3: Create a new R Project File: File > new project > existing directory > browse and select your “Rdata” folder > create project

When you revisit your data next time, simply go into your Rdata folder and select the file with the suffix ‘.Rproj’. Setting up Rprojects is the best practice when it comes to organising your data files and scripts. It will save you from the headache of manually setting your working directory every session.

Step 4: Start a new Script. File > new file > R script.

Save this script now in the Rdata file

Step 5: Clear R’s memory - just in case it is holding any old data from a previous analysis.

Hint: press the “copy code” button in the left corner of the chunk below and paste it into your own script

rm(list=ls())

To run the code on any line, place your cursor on the line you wish to run and press ‘Ctrl + Enter’ to send the command to the R console. Alternatively, you can click the Run icon in the script panel

Step 6: Load & see your data into R

inverts_dat <- read.csv(file = "inverts.csv", header = TRUE)

In the code above, you’ve assigned the name of this file to inverts_dat. From now on, your data will be referred to as inverts_dat.

See the structure of your data:

str(inverts_dat)

## 'data.frame':    152 obs. of  4 variables:
##  $ site  : chr  "CP" "CP" "CP" "CP" ...
##  $ group : int  1 2 5 6 6 1 5 1 1 2 ...
##  $ invert: chr  "Amphipod A" "Amphipod A" "Amphipod A" "Amphipod A" ...
##  $ count : int  1 1 1 1 2 3 8 6 24 30 ...

You can also view your data in a table:

This data will appear as a new tab in your RStudio

View(inverts_dat)

There are 5 variables (shown as columns) in your data:

site, with two possible values reflecting collection location Centennial Park (CP) and Randwick Environment Park (REP).
invert, with several possible values reflecting the type of invertebrate (e.g. ant A).
count, with values reflecting the abundance of each insect in each observation.
group, with values reflecting the identity of the group (1, 2, 3, etc) from each class.

Step 7: Install & Load packages

Packages are like mods for R. These may take a minute or so to load. Look for the small stop sign located in the right-hand corner of the R-consol window.

You can press “No” if a window pops up asking you if you want to re-start R.

Installing packages:

install.packages("tidyverse")

install.packages("ggthemes")

Loading packages:

library("tidyverse")

library("ggthemes")

Exploring the Data

Step 1: We will create two barplots that show the abundance if leaf litter intervebrates at each site.

Code for Centennial Park:

See Question 1 in the workbook

You can use the zoom button in the graphing panel to view the figure in more detail.

inverts_dat %>%
filter(site == "CP") %>% 
  group_by(invert) %>%
  ggplot(aes(x = reorder(invert, -count), y = count, fill = site)) +
  geom_bar(stat = "identity", width = .7) +
  labs(x = "Invertebrate species", y = "Abundance") +
  theme_bw() +
  ggtitle("Abundance of leaf litter invertebrates at Centennial Park") +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = -45,hjust = 0))

Code for Randwick Environment Park:

inverts_dat %>%
filter(site == "REP") %>%
  group_by(invert) %>%
  ggplot(aes(x = reorder(invert, -count), y = count, fill = site)) +
  geom_bar(stat = "identity", width = .7) +
  labs(x = "Invertebrate Species", y = "Abundance") +
  theme_bw() +
  ggtitle("Abundance of leaf litter invertebrates at REP") +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = -45, hjust = 0, size = 6))

DO NOT INCLUDE THESE GRAPHS IN YOUR REPORT - THIS IS RAW DATA

Analysing the Data

There are several ways to measure biodiversity. The most common is to assess species richness, which is the total number of distinct species within a local community.

Having many species generally coincides with having a diverse and healthy ecosystem. However, the relative abundance, the number of individuals per species, also needs to be considered.

An informative way to measure biodiversity that combines richness and abundance is the Shannon biodiversity index (H). The Shannon index measures how similar the abundances of different species are in the community. The Shannon diversity index provides more information about community composition than simply species richness (i.e. the number of species present) by accounting for the relative abundance of the different species. The higher the value of H, the higher the diversity of species in a particular community. The lower the value of H, the lower the diversity. A value of H = 0 indicates a community that only has one species.

Measuring species abundance

Step 1: Creating an abundance table

abund_group <- summarise(group_by(inverts_dat, site, group),
abundance = sum(count))

We can see which group found the most invertebrates:

abund_group

## # A tibble: 12 × 3
## # Groups:   site [2]
##    site  group abundance
##    <chr> <int>     <int>
##  1 CP        1        40
##  2 CP        2       157
##  3 CP        3        16
##  4 CP        4       113
##  5 CP        5       379
##  6 CP        6        38
##  7 REP       1        33
##  8 REP       2        84
##  9 REP       3        32
## 10 REP       4        36
## 11 REP       5        55
## 12 REP       6        43

Step 2: Making boxplot figures

If you would like some help understanding boxplots I suggest you read this blogpost.

To make a boxplot we’ll need to install another package:

install.packages("patchwork")

When the installation is done, load the package:

library(patchwork)

Now we are ready to make our plot:

See Question 2 in the workbook

abund_graph <- 
  abund_group %>%
  ggplot(aes(x = site, y = abundance, fill = site)) +
  geom_boxplot(width = 0.15, position = position_dodge(width = 10)) +
  geom_jitter(shape = 1, position = position_jitter(0.05)) + 
  labs(x = "Site", y = "Number of invertebrates") + 
  theme_bw() +
  theme(panel.border= element_blank(), axis.line.x = element_line(colour = "black",size = 1), axis.line.y = element_line(colour ="black", size = 1), legend.position = "none")

Let’s see the graph:

abund_graph

Let’s export and save this graph. You can find this graph in the “Rdata” file you created in the very beginning.

ggsave("Invertebrate_Abundance.png", plot=abund_graph, width=6, height=4, dpi=300)

Step 3: Calculating the mean and standard error for abundance

R doesn’t have a code for calculating the standard error, so we will make our own function:

standard_error <- function(x) sd(x) / sqrt(length(x))

Now we can run this function to calculate mean and standard error

abund_all <- summarise(group_by(abund_group, site),
Mean = mean(abundance),
SE = standard_error(abundance))

View our results:

See Question 3 in the workbook

abund_all

## # A tibble: 2 × 3
##   site   Mean    SE
##   <chr> <dbl> <dbl>
## 1 CP    124.  55.5 
## 2 REP    47.2  8.15

Now Let’s export these stats:

write.csv(abund_all, "Mean&SE_abundance.csv")

Mean versus Median: Both are measures of where the center of a data set lies (called “Central Tendency” in stats), but they are usually different numbers. For example, take this list of numbers: 10, 10, 20, 40, 70. The mean (informally, the “average“) is found by adding all of the numbers together and dividing by the number of items in the set: 10 + 10 + 20 + 40 + 70 / 5 = 30. The median is found by ordering the set from lowest to highest and finding the exact middle. The median is just the middle number: 20.

Measuring species Richness

Step 1: Creating a richness table

rich_group <- summarise(group_by(inverts_dat, site, group),
richness = length(unique(invert)))

Let’s view it:

rich_group

## # A tibble: 12 × 3
## # Groups:   site [2]
##    site  group richness
##    <chr> <int>    <int>
##  1 CP        1        9
##  2 CP        2       11
##  3 CP        3        5
##  4 CP        4       20
##  5 CP        5       19
##  6 CP        6       13
##  7 REP       1       12
##  8 REP       2       10
##  9 REP       3       13
## 10 REP       4       13
## 11 REP       5       14
## 12 REP       6       13

Step 2: Making boxplot figures

See Question 4 in the workbook

rich_graph <- rich_group %>%
  ggplot(aes(x = site, y = richness, fill = site)) +
  geom_boxplot(width = 0.15, position = position_dodge(width = 10)) +
  geom_jitter(shape = 1, position = position_jitter(0.05)) +
  labs(x = "Site", y = "Species richness") +
  theme_bw() +
  theme(panel.border = element_blank(), axis.line.x = element_line(colour = "black", size = 1), axis.line.y = element_line(colour = "black", size = 1), legend.position = "none")

Let’s see the graph:

rich_graph

Let’s export and save this graph. You can find this graph in the “Rdata” file you created in the very beginning.

ggsave("Invertebrate_Richness.png", plot=rich_graph, width=6, height=4, dpi=300)

Step 3: Calculating the mean and standard error

rich_all <- summarize(group_by(rich_group, site),
Mean = mean(richness),
SE = standard_error(richness))

View our results:

See Question 5 in the workbook

rich_all

## # A tibble: 2 × 3
##   site   Mean    SE
##   <chr> <dbl> <dbl>
## 1 CP     12.8 2.37 
## 2 REP    12.5 0.563

Now Let’s export these stats:

write.csv(rich_all, "Mean&SE_richness.csv")

Measuring Shannon diversity index

Finally, calculate the Shannon diversity index (H) for each group in each habitat type. Remember that the higher the value of H, the higher the diversity of species in a particular community. The lower the value of H, the lower the diversity. A value of H = 0 indicates a community that only has one species.

See Question 6 in the workbook

Step 1: We need to load the package that has the Shannon index formula

install.packages("vegan")

When the package has installed, load it in:

library(vegan)

Step 2: Calculating Shannon index

div_group <- summarise(group_by(inverts_dat, site, group),
diversity = diversity(count, index = "shannon"))

To view it:

div_group

## # A tibble: 12 × 3
## # Groups:   site [2]
##    site  group diversity
##    <chr> <int>     <dbl>
##  1 CP        1      1.41
##  2 CP        2      1.17
##  3 CP        3      1.13
##  4 CP        4      1.79
##  5 CP        5      1.27
##  6 CP        6      2.21
##  7 REP       1      2.16
##  8 REP       2      1.38
##  9 REP       3      1.79
## 10 REP       4      1.88
## 11 REP       5      1.93
## 12 REP       6      2.13

Step 3: Making a boxplot figure

shannon_graph <- div_group %>%
  ggplot(aes(x = site, y = diversity, fill = site)) +
  geom_boxplot(width = 0.15, position = position_dodge(width = 10)) +
  geom_jitter(shape = 1, position = position_jitter(0.05)) +
  labs(x = "Site", y = "Shannon index (H)")+
  theme_bw() +
  theme(panel.border = element_blank(), axis.line.x = element_line(colour = "black", size = 1), axis.line.y = element_line(colour = "black", size = 1), legend.position = "none")

Let’s see the graph:

shannon_graph

Let’s export and save this graph. You can find this graph in the “Rdata” file you created in the very beginning.

ggsave("ShannonIndex.png", plot=shannon_graph, width=6, height=4, dpi=300)

Step 4: Calculate the mean and standard error

div_all <- summarise(group_by(div_group, site),
Mean = mean(diversity),
SE = standard_error(diversity))

View our results:

div_all

## # A tibble: 2 × 3
##   site   Mean    SE
##   <chr> <dbl> <dbl>
## 1 CP     1.50 0.173
## 2 REP    1.88 0.116

Now Let’s export these stats:

write.csv(div_all, "Mean&SE_Shannon.csv")

See Question 7 in the workbook

You have now finished your analysis, congratulations!

All we have left to do is cite the packages and software we used.

Citing R (we do not cite RStudio as it’s just an interface, it doesn’t actually do any of the analysis)

citation()

## To cite R in publications use:
## 
##   R Core Team (2024). _R: A Language and Environment for Statistical
##   Computing_. R Foundation for Statistical Computing, Vienna, Austria.
##   <https://www.R-project.org/>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {R: A Language and Environment for Statistical Computing},
##     author = {{R Core Team}},
##     organization = {R Foundation for Statistical Computing},
##     address = {Vienna, Austria},
##     year = {2024},
##     url = {https://www.R-project.org/},
##   }
## 
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.

Citing all of the packages we used today:

citation("tidyverse")

## To cite package 'tidyverse' in publications use:
## 
##   Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
##   Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
##   E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
##   Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
##   the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
##   doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Welcome to the {tidyverse}},
##     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
##     year = {2019},
##     journal = {Journal of Open Source Software},
##     volume = {4},
##     number = {43},
##     pages = {1686},
##     doi = {10.21105/joss.01686},
##   }

citation("ggthemes")

## To cite package 'ggthemes' in publications use:
## 
##   Arnold J (2024). _ggthemes: Extra Themes, Scales and Geoms for
##   'ggplot2'_. R package version 5.1.0,
##   <https://CRAN.R-project.org/package=ggthemes>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {ggthemes: Extra Themes, Scales and Geoms for 'ggplot2'},
##     author = {Jeffrey B. Arnold},
##     year = {2024},
##     note = {R package version 5.1.0},
##     url = {https://CRAN.R-project.org/package=ggthemes},
##   }

citation("vegan")

## To cite package 'vegan' in publications use:
## 
##   Oksanen J, Simpson G, Blanchet F, Kindt R, Legendre P, Minchin P,
##   O'Hara R, Solymos P, Stevens M, Szoecs E, Wagner H, Barbour M,
##   Bedward M, Bolker B, Borcard D, Carvalho G, Chirico M, De Caceres M,
##   Durand S, Evangelista H, FitzJohn R, Friendly M, Furneaux B, Hannigan
##   G, Hill M, Lahti L, McGlinn D, Ouellette M, Ribeiro Cunha E, Smith T,
##   Stier A, Ter Braak C, Weedon J, Borman T (2025). _vegan: Community
##   Ecology Package_. R package version 2.6-10,
##   <https://CRAN.R-project.org/package=vegan>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {vegan: Community Ecology Package},
##     author = {Jari Oksanen and Gavin L. Simpson and F. Guillaume Blanchet and Roeland Kindt and Pierre Legendre and Peter R. Minchin and R.B. O'Hara and Peter Solymos and M. Henry H. Stevens and Eduard Szoecs and Helene Wagner and Matt Barbour and Michael Bedward and Ben Bolker and Daniel Borcard and Gustavo Carvalho and Michael Chirico and Miquel {De Caceres} and Sebastien Durand and Heloisa Beatriz Antoniazi Evangelista and Rich FitzJohn and Michael Friendly and Brendan Furneaux and Geoffrey Hannigan and Mark O. Hill and Leo Lahti and Dan McGlinn and Marie-Helene Ouellette and Eduardo {Ribeiro Cunha} and Tyler Smith and Adrian Stier and Cajo J.F. {Ter Braak} and James Weedon and Tuomas Borman},
##     year = {2025},
##     note = {R package version 2.6-10},
##     url = {https://CRAN.R-project.org/package=vegan},
##   }

BIOS1301 Practical 5: Data Analysis in R

Dr. Anastasia Shavrova

Introduction

Why R?

Aim

Meeting R & R Studio

Let’s code!

Preparing the workspace

Exploring the Data

DO NOT INCLUDE THESE GRAPHS IN YOUR REPORT - THIS IS RAW DATA

Analysing the Data

Measuring species abundance

Measuring species Richness

Measuring Shannon diversity index

You have now finished your analysis, congratulations!

Don’t forget to save your script in case you want to revisit it and avoid redoing the entire practical!