Introduction

This next section of the practical will introduce you to the programming language R, the lingua franca of statistical computing. Unlike point and click software, everything you do in R you do from the command line. Although this can seem a little daunting at first, it’s actually not that hard to learn, and the benefits of doing so are enormous.If you’re very new to R, I’d suggest checking out this link that introduces you to the interface and some basic stats

Why R?

For those of you who want to develop your skills further, the School of Biological, Earth and Environmental Sciences and the School of Mathematics and Statistics has set up this very useful web site with many tools from simple manipulating data (getting started with R), to the complex statistical analyses of data. Also, the UNSW codeRs is a student and staff run community dedicated for ‘R’ users. They set up events that cater to learning R. They meet fortnightly, alternating between workshops, drop-in help sessions, and other social events. You can visit them here

Aim

In this section we will be walking you through the process of data analysis (using R and R studio) for Assessment 2 - Biodiversity Report. We will:

Meeting R & R Studio

R is a free statistical program used commonly (but not exclusively) by biologists and psychologists. R studio is a software extension that makes the R console easier to use by adding additional user-friendly interfaces.

Think of R like a car engine and RStudio like the body of the car, RStudio will not run without R.

To find out more about R and RStudio, like how to customize your RStudio, you can visit this link.

In the default RStudio layout, you’ll see four panes in RStudio:

  1. The code editor (script window) in the top left pane where you write your code before ‘sending’ it to R;
  2. The R console in the bottom left pane where R works its magic;
  3. The workspace in the top right pane where you can view all R objects (e.g., data files); and
  4. A pane to view plots, help files, available packages (‘add ons’), etc. in the bottom right pane.

Throughout this practical, you will encounter questions to answer. Taking notes on these as you go will help you write the Results section of your Report (Assessment 2).

Let’s code!

Preparing the workspace

Step 1: Create a folder on your computer and name it “Rdata”

Step 2: Download your class data from moodle: Week 5 > Class Data > Select your class folder > download the inverts.csv file. Save this file in the “Rdata” folder you just made.

You must use your scheduled practical’s data

Step 3: Create a new R Project File: File > new project > existing directory > browse and select your “Rdata” folder > create project

  • When you revisit your code next time, simply go into your Rdata folder and select the file with the suffix ‘.Rproj’. Setting up Rprojects is the best practice when it comes to organising your data files and code scripts. It will save you from the headache of manually setting your working directory every session.

Step 4: Start a new Script. File > new file > R script.

Save this script now in the Rdata file

Step 5: Clear R’s memory - just in case it is holding any old data from a previous analysis.

Hint: press the “copy code” button in the left corner of the chunk below and paste it into your own script that you just made

rm(list=ls())

To run the code on any line, place your cursor on the line you wish to run and press ‘Ctrl + Enter’ to send the command to the R console. Alternatively, you can click the Run icon in the top right of the script pane.

Step 6: Load & see your data in R

inverts_dat <- read.csv(file = "inverts.csv", header = TRUE)

In the code above, you’ve assigned the name of this file as inverts_dat. From now on, your data will be referred to as inverts_dat.

See the structure of your data:

str(inverts_dat)
## 'data.frame':    152 obs. of  4 variables:
##  $ site  : chr  "CP" "CP" "CP" "CP" ...
##  $ group : int  1 2 5 6 6 1 5 1 1 2 ...
##  $ invert: chr  "Amphipod A" "Amphipod A" "Amphipod A" "Amphipod A" ...
##  $ count : int  1 1 1 1 2 3 8 6 24 30 ...

You can also view your data in a table:

This data will appear in a new tab in RStudio

View(inverts_dat)

There are 5 variables (shown as columns in the table you made) in your data:

  • site, with two possible values reflecting collection location Centennial Park (CP) or Randwick Environment Park (REP).
  • invert, with several possible values reflecting the type of invertebrate (e.g., Ant 1).
  • count, with values reflecting the abundance of each insect observed.
  • group, with values reflecting the identity of the group (1, 2, 3, etc.) from each class.

Step 7: Install & load packages

Packages are like mods for R; they make the experience of coding complex requests much simpler.

Packages may take a minute or so to load. Look for the small stop sign located in the top right-hand corner of the R-console pane to check if R is still working on your command. Alternatively, wait for the ‘>’ icon to appear at the bottom of the R console - this indicates that the console is ready for a new command.

You can press “No” if a window pops up asking if you want to restart R.

While packages only need to be installed once for every version of R (you may need to re-install packages if you update to a new version), you need to load packages every session (i.e., if you close RStudio and re-open your project from the Rdata folder, you need to re-run the lines to load in your packages). Check if you package is loaded by navigating to the “Packages” tab in the bottom-right pane. Loaded packages (sorted in alphabetical order) have a check mark next to them.

Installing packages:

install.packages("tidyverse")
install.packages("ggthemes")

Loading packages:

library("tidyverse")
library("ggthemes")

Exploring the data

Step 1: We will create two bar graphs that show the abundance of leaf litter invertebrates at each site. You can use the zoom button in the graphing panel to view a figure in more detail.

Code for Centennial Park:

# The '#' indicate a comment in the code. These allow you to annotate what each line does without R thinking that your notes are code! We've gone through to break down what each line does for you. This is a helpful practice when your code gets really complex.

inverts_dat %>% # load in your prac's invertebrate data (remember we named it 'inverts_dat' above!).
filter(site == "CP") %>%  # ask R to only look at data from Centennial Park.
  group_by(invert) %>% # set invertebrates as the variable of interest - all functions will now be applied to each different invertebrate variable.
  ggplot(aes(x = reorder(invert, -count), y = count, fill = site)) + # create a plot where the x axis has the inverts in order from highest count/abundance to lowest and the y axis shows the abundance itself and colour the bars by what site they came from (since they all came from Centennial Park as per the above line, they should all be the same colour!).
  geom_bar(stat = "identity", width = .7) + # specify the type of plot you want - in this case, a simple bar plot.
  labs(x = "Invertebrate species", y = "Abundance") + # set what you want the labels of your graph axes to be.
  theme_bw() + # there are many themes in the ggthemes package to choose from. The black and white "theme_bw" one is most commonly used as it's the clearest for most data.
  ggtitle("Abundance of leaf litter invertebrates at Centennial Park") + # set what you want the title of your graph to be. This will appear in the top left of the figure.
  theme(legend.position = "none") + # set the legend position to "none" to remove a legend from your graph. If you don't remove it, it will automatically appear on the right.
  theme(axis.text.x = element_text(angle = -45,hjust = 0)) # set aesthetics of your graph like angling the text on the x axis to make it easier to read.

Code for Randwick Environment Park:

# notice this code is nearly identical to the above since we're making a near identical bar graph.

inverts_dat %>%
filter(site == "REP") %>% # here we changed "CP" in the above code to "REP" to indicate that we only want to look at the data from Randwick Environment Park instead.
  group_by(invert) %>%
  ggplot(aes(x = reorder(invert, -count), y = count, fill = site)) +
  geom_bar(stat = "identity", width = .7) +
  labs(x = "Invertebrate Species", y = "Abundance") +
  theme_bw() +
  ggtitle("Abundance of leaf litter invertebrates at REP") +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = -45, hjust = 0))

Question 1: Abundance refers to the amount of a given species in an area. Which three invertebrates have the highest abundance in Centennial Park and Randwick Environment Park?

DO NOT INCLUDE THESE GRAPHS IN YOUR REPORT - THIS IS RAW DATA

Analysing the data

There are several ways to measure biodiversity. The most common is to assess species richness, which is the total number of distinct species within a local community.

Having many species generally coincides with having a diverse and healthy ecosystem. However, the relative abundance, the number of individuals per species, also needs to be considered.

An informative way to measure biodiversity that combines richness and abundance is the Shannon Diversity Index (H) also referred to simply as the H index. This measures how similar the abundances of different species are in the community. The Shannon Diversity Index provides more information about community composition than species richness by accounting for the relative abundance of the different species. The higher the value of H, the higher the biodiversity in a particular community. The lower the value of H, the lower the biodiversity. A value of H = 0 indicates a community that only has one species.

Measuring species abundance

Step 1: Creating an abundance table

abund_group <- summarise(group_by(inverts_dat, site, group), # we've set this abundance table to have a name that we can use to call it up at any time. This is useful for when you need to revisit something multiple times. We then want R to create a summary of the data, but to focus the invert data first by which site it came from and then by which group input the data. We group by site first since there are two "group 1"s - REP group 1 and CP group 1. If we grouped by group number first, we'd be adding together the counts from CP and REP which are two different communities.
abundance = sum(count)) # since our "count" column is the abundance of each invertebrate type, if we sum these counts together for each group at each site, we can find that group's total abundance of invertebrates they identified.

We can see which group found the most invertebrates:

abund_group # now we can use the name we gave the code above to call this table forward as much as we like!
## # A tibble: 12 × 3
## # Groups:   site [2]
##    site  group abundance
##    <chr> <int>     <int>
##  1 CP        1        40
##  2 CP        2       157
##  3 CP        3        16
##  4 CP        4       113
##  5 CP        5       379
##  6 CP        6        38
##  7 REP       1        33
##  8 REP       2        84
##  9 REP       3        32
## 10 REP       4        36
## 11 REP       5        55
## 12 REP       6        43

Step 2: Making boxplot figures

If you would like some help understanding boxplots, I suggest you read this blog post.

To make a boxplot, we’ll need to install another package:

install.packages("patchwork")

If asked in a pop-up window if you want to install from sources, select “Yes”.

Then, load the package:

library(patchwork)

Now we are ready to make our plot:

abund_graph <- # see again how you can name your code to call upon this graph.
  abund_group %>% # we've organised the raw inverts_dat data to be summed into abundance data for each group. This means we have less work for R to do in this code chunk because we've already done it above and gave this new data table a name.
  ggplot(aes(x = site, y = abundance, fill = site)) + # now we create a plot where the x axis indicates site and y axis indicates abundance. We're colouring the plot by which site the abundance is for.
  geom_boxplot(width = 0.15, position = position_dodge(width = 10)) + # we specify that we want a boxplot and set some general aesthetic parameters for sizing.
  geom_jitter(shape = 1, position = position_jitter(0.05)) +  # we can further specify that if two points were to sit on the same spot (i.e., they have the same site and the same abundance values), they are slightly offset so both can be seen in the plot. We can also set a shape value from the given shapes in the package - try '2' instead of '1' if you're curious.
  labs(x = "Site", y = "Number of invertebrates") + # once again, we can set the labels for our axes.
  theme_bw() + # and once again, set the style of the plot.
  ggtitle("Invertebrate Abundance") + # don't forget to give your figure a title!
  theme(panel.border= element_blank(), axis.line.x = element_line(colour = "black",size = 1), axis.line.y = element_line(colour ="black", size = 1), legend.position = "none") # this final line indicates a variety of aesthetic choices such as if you want a border around the boxplot, what colour and how thick the lines should be, etc.

Let’s see the graph:

abund_graph

Let’s export and save this graph. You can find this graph in the “Rdata” file you created in the very beginning.

ggsave("Invertebrate_Abundance.png", plot=abund_graph, width=6, height=4, dpi=300) # we can save the graph with ggsave and set aesthetics like size and resolution of the image.

Question 2: Based on your graph “Invertebrate Abundance” which park had the greater abundance?

Step 3: Calculating the mean and standard error for abundance

R doesn’t have a code for calculating the standard error, so we will make our own function:

standard_error <- function(x) sd(x) / sqrt(length(x)) # here we can create a function named "standard_error" where we ask it to look for a dataset represented by x and calculate the standard deviation of the dataset and divide that by the square root of how many data points we have in that dataset (this is how to calculate standard error).

Now we can run this function to calculate mean and standard error

abund_all <- summarise(group_by(abund_group, site), # we're now creating a table of summary statistics (standard error and mean) but first grouping our abundance dataset by site, so we apply the functions to each site separately.
Mean = mean(abundance), # R already has a function to calculate mean, and we've set the table header for this column as "Mean"
SE = standard_error(abundance)) # and we use the function we just made to calculate standard error and set the table header for this column as "SE".

View our results:

abund_all
## # A tibble: 2 × 3
##   site   Mean    SE
##   <chr> <dbl> <dbl>
## 1 CP    124.  55.5 
## 2 REP    47.2  8.15

Now Let’s export these stats:

write.csv(abund_all, "Mean&SE_abundance.csv") # We now create a .csv file for the table we just made which will be saved in the Rdata folder with this R project. A csv file is like an excel file that can't be edited. This is also one of the easiest formats that R can read, so it's a great file type for processing data in R.

Question 3: Do the summary statistics for “Invertebrate Abundance” match what you see in the graph? Why might they not match?

Mean versus Median: Both are measures of where the center of a data set lies (called “Central Tendency” in stats), but they are usually different numbers. For example, take this list of numbers: 10, 10, 20, 40, 70. The mean (informally, the “average“) is found by adding all of the numbers together and dividing by the number of items in the set: 10 + 10 + 20 + 40 + 70 / 5 = 30. The median is found by ordering the set from lowest to highest and finding the exact middle. The median is just the middle number: 20.

Measuring species Richness

Step 1: Creating a richness table

rich_group <- summarise(group_by(inverts_dat, site, group), # note the similarities in the code for the above abundance table.
richness = length(unique(invert))) # but now we are creating a column "richness" which will tell us how many unique invertebrates each group found at each site.

Let’s view it:

rich_group
## # A tibble: 12 × 3
## # Groups:   site [2]
##    site  group richness
##    <chr> <int>    <int>
##  1 CP        1        9
##  2 CP        2       11
##  3 CP        3        5
##  4 CP        4       20
##  5 CP        5       19
##  6 CP        6       13
##  7 REP       1       12
##  8 REP       2       10
##  9 REP       3       13
## 10 REP       4       13
## 11 REP       5       14
## 12 REP       6       13

Step 2: Make a richness boxplot

rich_graph <- rich_group %>% # note the similarities to our previous boxplot.
  ggplot(aes(x = site, y = richness, fill = site)) + # but now we want to work with richness on the y axis instead of abundance.
  geom_boxplot(width = 0.15, position = position_dodge(width = 10)) +
  geom_jitter(shape = 1, position = position_jitter(0.05)) +
  labs(x = "Site", y = "Species richness") + # and we need to change the axis label.
  theme_bw() +
  ggtitle("Invertebrate Richness") + # and change the title to match.
  theme(panel.border = element_blank(), axis.line.x = element_line(colour = "black", size = 1), axis.line.y = element_line(colour = "black", size = 1), legend.position = "none")

Let’s see the graph:

rich_graph

Let’s export and save this graph. You can find this graph in the “Rdata” file you created in the very beginning.

ggsave("Invertebrate_Richness.png", plot=rich_graph, width=6, height=4, dpi=300) # as above, we can save the figure and set aesthetics.

Question 4: Based on your graph “Invertebrate Richness” which park had the greater richness?

Step 3: Calculating the mean and standard error

rich_all <- summarize(group_by(rich_group, site), # as before, we calculate the mean and standard error but now for our species richness dataset.
Mean = mean(richness),
SE = standard_error(richness)) # see how we can use the same function we created, but apply it to our richness dataset. Creating functions like this can save a lot of time.

View our results:

rich_all
## # A tibble: 2 × 3
##   site   Mean    SE
##   <chr> <dbl> <dbl>
## 1 CP     12.8 2.37 
## 2 REP    12.5 0.563

Now Let’s export these stats:

write.csv(rich_all, "Mean&SE_richness.csv")

Question 5: Do the summary statistics for “Invertebrate Richness” match what you see in the graph?

Measuring Shannon Diversity Index

Finally, calculate the Shannon Diversity Index (H) for each group in each habitat type. Remember that the higher the value of H, the higher the biodiversity of a particular community. The lower the value of H, the lower the biodiversity. A value of H = 0 indicates a community that only has one species.

Based on that understanding of the H index, let’s make a prediction before calculating.

Question 6: Now that you know the Abundance and Richness, which park do you predict will have the higher Shannon Diversity Index (H)?

An R programmer has already created a package that contains the Shannon Diversity Index formula. Step 1: Load the package

install.packages("vegan") # since we've never used this package before, we need to install it first.

When the package has installed, load it in:

library(vegan)

Step 2: Calculating Shannon Diversity Index

div_group <- summarise(group_by(inverts_dat, site, group), # we're creating another table to calculate H from our class data for each group at each site.
diversity = diversity(count, index = "shannon")) # the package contains the function "diversity" which contains many types of diversity indices. We're only interested in the Shannon index so we specify the index to be "Shannon". To better understand how this function (or any function!) works, type ?diversity (replace diversity with the function name) in the R console.

Let’s view it:

div_group
## # A tibble: 12 × 3
## # Groups:   site [2]
##    site  group diversity
##    <chr> <int>     <dbl>
##  1 CP        1      1.41
##  2 CP        2      1.17
##  3 CP        3      1.13
##  4 CP        4      1.79
##  5 CP        5      1.27
##  6 CP        6      2.21
##  7 REP       1      2.16
##  8 REP       2      1.38
##  9 REP       3      1.79
## 10 REP       4      1.88
## 11 REP       5      1.93
## 12 REP       6      2.13

Step 3: Make a boxplot figure

shannon_graph <- div_group %>%
  ggplot(aes(x = site, y = diversity, fill = site)) + # we use a similar code to the other boxplots, but now we want the y axis to reflect the H index.
  geom_boxplot(width = 0.15, position = position_dodge(width = 10)) +
  geom_jitter(shape = 1, position = position_jitter(0.05)) +
  labs(x = "Site", y = "Shannon Diversity Index (H)") + # change your labels to match.
  theme_bw() +
  ggtitle("Shannon Diversity Index for Centennial Park vs Randwick Environment Park") + # change your title to match.
  theme(panel.border = element_blank(), axis.line.x = element_line(colour = "black", size = 1), axis.line.y = element_line(colour = "black", size = 1), legend.position = "none")

Let’s see the graph:

shannon_graph

Let’s export and save this graph. You can find this graph in the “Rdata” file you created in the very beginning.

ggsave("ShannonIndex.png", plot=shannon_graph, width=6, height=4, dpi=300) # saving our new graph and setting the aesthetics.

Step 4: Calculate the mean and standard error for the Shannon Diversity Index

div_all <- summarise(group_by(div_group, site), # we can use a similar code to the other times we've calculated mean and SE, but now applied to our diversity variable.
Mean = mean(diversity),
SE = standard_error(diversity))

View our results:

div_all
## # A tibble: 2 × 3
##   site   Mean    SE
##   <chr> <dbl> <dbl>
## 1 CP     1.50 0.173
## 2 REP    1.88 0.116

Now Let’s export these stats:

write.csv(div_all, "Mean&SE_Shannon.csv") # once again we create a csv file to save the data to the Rdata folder.

Question 7: After graphing and calculating the mean and standard error of the Shannon Diversity Index (H), which park has the higher biodiversity? Was your prediction from Question 6 correct?

You have now finished your analysis. Congratulations!

All we have left to do is cite the packages and software we used.

Citing R (we do not cite RStudio. As it’s just an interface, it doesn’t actually do any of the analysis):

citation() # this function automatically cites R. Notice there's no value specified in the brackets. The output in the console should contain a citation you can copy into word in APA format. To convert to other formats, you can copy the BibTex entry (raw data for the reference) into a reference managing software like Zotero (free and created by researchers!) which can give you alternate formats like Harvard, Chicago, or MLA.
## To cite R in publications use:
## 
##   R Core Team (2025). _R: A Language and Environment for Statistical
##   Computing_. R Foundation for Statistical Computing, Vienna, Austria.
##   <https://www.R-project.org/>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {R: A Language and Environment for Statistical Computing},
##     author = {{R Core Team}},
##     organization = {R Foundation for Statistical Computing},
##     address = {Vienna, Austria},
##     year = {2025},
##     url = {https://www.R-project.org/},
##   }
## 
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.

Citing all of the packages we used today:

citation("tidyverse") # when you include a package name in this function, it'll give you the citation for the specified package.
## To cite package 'tidyverse' in publications use:
## 
##   Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
##   Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
##   E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
##   Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
##   the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
##   doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Welcome to the {tidyverse}},
##     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
##     year = {2019},
##     journal = {Journal of Open Source Software},
##     volume = {4},
##     number = {43},
##     pages = {1686},
##     doi = {10.21105/joss.01686},
##   }
citation("ggthemes")
## To cite package 'ggthemes' in publications use:
## 
##   Arnold J (2025). _ggthemes: Extra Themes, Scales and Geoms for
##   'ggplot2'_. doi:10.32614/CRAN.package.ggthemes
##   <https://doi.org/10.32614/CRAN.package.ggthemes>, R package version
##   5.2.0, <https://CRAN.R-project.org/package=ggthemes>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {ggthemes: Extra Themes, Scales and Geoms for 'ggplot2'},
##     author = {Jeffrey B. Arnold},
##     year = {2025},
##     note = {R package version 5.2.0},
##     url = {https://CRAN.R-project.org/package=ggthemes},
##     doi = {10.32614/CRAN.package.ggthemes},
##   }
citation("vegan")
## To cite package 'vegan' in publications use:
## 
##   Oksanen J, Simpson G, Blanchet F, Kindt R, Legendre P, Minchin P,
##   O'Hara R, Solymos P, Stevens M, Szoecs E, Wagner H, Barbour M,
##   Bedward M, Bolker B, Borcard D, Borman T, Carvalho G, Chirico M, De
##   Caceres M, Durand S, Evangelista H, FitzJohn R, Friendly M, Furneaux
##   B, Hannigan G, Hill M, Lahti L, Martino C, McGlinn D, Ouellette M,
##   Ribeiro Cunha E, Smith T, Stier A, Ter Braak C, Weedon J (2026).
##   _vegan: Community Ecology Package_. doi:10.32614/CRAN.package.vegan
##   <https://doi.org/10.32614/CRAN.package.vegan>, R package version
##   2.7-3, <https://CRAN.R-project.org/package=vegan>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {vegan: Community Ecology Package},
##     author = {Jari Oksanen and Gavin L. Simpson and F. Guillaume Blanchet and Roeland Kindt and Pierre Legendre and Peter R. Minchin and R.B. O'Hara and Peter Solymos and M. Henry H. Stevens and Eduard Szoecs and Helene Wagner and Matt Barbour and Michael Bedward and Ben Bolker and Daniel Borcard and Tuomas Borman and Gustavo Carvalho and Michael Chirico and Miquel {De Caceres} and Sebastien Durand and Heloisa Beatriz Antoniazi Evangelista and Rich FitzJohn and Michael Friendly and Brendan Furneaux and Geoffrey Hannigan and Mark O. Hill and Leo Lahti and Cameron Martino and Dan McGlinn and Marie-Helene Ouellette and Eduardo {Ribeiro Cunha} and Tyler Smith and Adrian Stier and Cajo J.F. {Ter Braak} and James Weedon},
##     year = {2026},
##     note = {R package version 2.7-3},
##     url = {https://CRAN.R-project.org/package=vegan},
##     doi = {10.32614/CRAN.package.vegan},
##   }

Don’t forget to save your script in case you want to revisit it and avoid redoing the entire practical!