Practical 1 - AESC40180 25/26

Author

Virginia Morera-Pujol

Introduction

The aim of this practical is to practice using R, and carrying out summary statistics on a provided data set.

The data associated with these practical are in the Fallow.csv file. Information related to these data are in the paper Fallow Birds_Bird Study.pdf

Data were collected in several sites (~~column~~ variable site ) in 4 different regions (variable region ) over several years (variable year)

Other variable names

STOCU: Stone curlew
SHTLA: Short-toed lark
CALLA: Calandra lark
LITBS: Little bustard
natbuf: % of area set aside with natural vegetation
area: field size (in ha)
ae: ratio of edge (m) to area (m²) set aside

Housekeeping

Do this at the beginning of every script to start with a clean environment. Remember, “Rule 5: adopt good practices and be consistent”

rm(list=ls()) # this clears the environment from any leftover objects

library(tidyverse) #this loads the packages we'll need for easy coding

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# setwd("Yourpath") #this tells R what folder you'll be working out of

Load data

With the line of code below you’ll load the dataset. This means R has it now stored in the environment under the name “fallow” (or whatever you want to name)

fallow <- read_csv("Data/Fallow.csv")

Rows: 210 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): site, region
dbl (8): year, STOCU, SHTLA, CALLA, LITBU, natbuf, area, ea

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We use the symbol <- for “assignment”. That means, for giving things (data, numbers, words) that we want to store in the environment a name that we can later use to find them or perform operations and analyses on them. Try the little exercise below to understand assignment and numerical operations in R.

x <- 10        # assign the number 10 to an object called x
y <- 4         # assign the number 4 to an object called y
sum_xy <- x + y
product_xy <- x * y
x_squared <- x^2
sum_xy + product_xy

[1] 54

Back to our fallow birds exercise, let’s explore the data and display some summary statistics.

Data exploration

We can see what the first rows of data look like using the function head() , and explore the details about column types and dataset structure with the function str() . Additionally, you can ask R to give you a summary of the whole dataset with the function summary()

head(fallow)

# A tibble: 6 × 10
  site  region  year STOCU SHTLA CALLA LITBU natbuf   area     ea
  <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 A006  A       2004     5     0     0     2  0.320 35070. 0.0265
2 A007  A       2004     3     0     0     0  0.332 42620. 0.0253
3 A015  A       2004     0     0     0     0  0.534  3463. 0.0805
4 A034  A       2004     0     0     1     1  0.755 11739. 0.0399
5 A048  A       2004     6     0     0     0  0.329 13826. 0.0497
6 A111  A       2004     0     0     0     0  0.474 19352. 0.0282

str(fallow)

spc_tbl_ [210 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ site  : chr [1:210] "A006" "A007" "A015" "A034" ...
 $ region: chr [1:210] "A" "A" "A" "A" ...
 $ year  : num [1:210] 2004 2004 2004 2004 2004 ...
 $ STOCU : num [1:210] 5 3 0 0 6 0 0 0 0 1 ...
 $ SHTLA : num [1:210] 0 0 0 0 0 0 0 0 0 0 ...
 $ CALLA : num [1:210] 0 0 0 1 0 0 0 0 0 0 ...
 $ LITBU : num [1:210] 2 0 0 1 0 0 1 0 0 0 ...
 $ natbuf: num [1:210] 0.32 0.332 0.534 0.755 0.329 ...
 $ area  : num [1:210] 35070 42620 3463 11739 13826 ...
 $ ea    : num [1:210] 0.0265 0.0253 0.0805 0.0399 0.0497 ...
 - attr(*, "spec")=
  .. cols(
  ..   site = col_character(),
  ..   region = col_character(),
  ..   year = col_double(),
  ..   STOCU = col_double(),
  ..   SHTLA = col_double(),
  ..   CALLA = col_double(),
  ..   LITBU = col_double(),
  ..   natbuf = col_double(),
  ..   area = col_double(),
  ..   ea = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

summary(fallow)

     site              region               year          STOCU       
 Length:210         Length:210         Min.   :2004   Min.   :0.0000  
 Class :character   Class :character   1st Qu.:2004   1st Qu.:0.0000  
 Mode  :character   Mode  :character   Median :2005   Median :0.0000  
                                       Mean   :2005   Mean   :0.8143  
                                       3rd Qu.:2005   3rd Qu.:1.0000  
                                       Max.   :2005   Max.   :8.0000  
     SHTLA             CALLA            LITBU         natbuf      
 Min.   : 0.0000   Min.   : 0.000   Min.   :0.0   Min.   :0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.:0.0   1st Qu.:0.1024  
 Median : 0.0000   Median : 0.000   Median :0.0   Median :0.2525  
 Mean   : 0.4857   Mean   : 1.152   Mean   :0.5   Mean   :0.2745  
 3rd Qu.: 0.0000   3rd Qu.: 1.000   3rd Qu.:1.0   3rd Qu.:0.4099  
 Max.   :15.0000   Max.   :17.000   Max.   :6.0   Max.   :0.9370  
      area                ea         
 Min.   :   535.2   Min.   :0.01178  
 1st Qu.:  8576.6   1st Qu.:0.02659  
 Median : 19005.0   Median :0.03700  
 Mean   : 24790.0   Mean   :0.04249  
 3rd Qu.: 34447.3   3rd Qu.:0.05015  
 Max.   :129500.2   Max.   :0.18299

We can also access individual variables in the dataframe by using the $ operator. And we can use functions like mean(), median() or sd() to obtain summary statistics from each.

Try a few of these below with different variables of the fallow dataframe

mean(fallow$area)

[1] 24789.99

median(fallow$area)

[1] 19004.96

sd(fallow$area)

[1] 21708.7

var(fallow$area)

[1] 471267726

quantile(fallow$area, 0.25)

     25% 
8576.638

This last function, quantiles(), introduces the idea that a function can have more than 1 argument. In this case, the function quantiles() has two arguments, the first one is the object you want to calculate the quantile of (the area variable of the fallow dataframe) and the second argument (they are separated by commas) is the level of quantile you want to calculate (in this case 25%). You can calculate several quantiles at once by using a vector as the second argument. In R, vectors are a list of numbers, and we create them with c(). So, for example, if we want to calculate the 25% and 75% quantiles, we substitute the 0.25 above with a c(0.25, 0.75). Note that inside a vector, the different elements are also separated by commas. Now try calculating different quantiles of different variables below.

quantile(fallow$area, c(0.25, 0.75))

      25%       75% 
 8576.638 34447.286

So far we’re only displaying this results in the console, we are not storing them in the environment, because we are not assigning them to any object (remember, the left facing arrow?). But we can assign all of these to objects and then you’ll see the name you’ve chosen for them appear in your environment as a stored object. For example, below I am going to demonstrate that the 50% quantile is the same as the median.

q50 <- quantile(fallow$area, 0.5)
med <- median(fallow$area)
q50

     50% 
19004.96

med

[1] 19004.96

med == q50

 50% 
TRUE

This introduces another idea. Whenever you want to tell R that two things are equal (or, like in this case, ask R if two things are equal), you need to use two equal signs, not just one. If you’re curious, ask me why, but it might confuse you :)

Can you think of a way of similarly proving that the standard deviation (sd()) is the square root of the variance (var())

Data exploration 2

So far we’ve used very basic functions for data exploration of single variables. However, what if we wanted to know the mean of the area by year? Or by region? Now we need to do a bit more coding (but don’t panic) and introduce a couple of extra concepts. First of all, if you’ve used R before, you might be used to doing this in a different way. There’s no wrong way, but I find this one I will show you (tidy coding) a bit more intuitive for people that are just getting started.

Tidy coding uses words that are “verbs” as functions (words like summarise, mutate, filter, select) that will give you a clue as to what the function is doing, and uses something called “pipes” that look like this %>% to string these verbs into a coherent sentence of what you want to do. So, for example, if you had a dataframe called example and you wanted to select some variables, then filter some rows out, then mutate one of the variables so it loos a bit different, it would look something like this:

example %>% 
  select(...) %>% 
  filter(...) %>% 
  mutate(...)

Tip: you can automatically insert a pipe symbol by holding the Shift and Ctrl keys at the same time and typing m.

Now, let’s use tidy functions to produce a mean and standard deviation value of area and edge to area ratio per year.

fallow %>% 
  group_by(year) %>% 
  summarise(mean_area = mean(area), 
            sd_area = sd(area), 
            mean_ea = mean(ea), 
            sd_ea = sd(ea))

# A tibble: 2 × 5
   year mean_area sd_area mean_ea  sd_ea
  <dbl>     <dbl>   <dbl>   <dbl>  <dbl>
1  2004    23162.  17769.  0.0436 0.0246
2  2005    26084.  24390.  0.0416 0.0200

Note several things:

We use group_by() to tell R what categorical variable we want to group our summary statistics by (does what it says in the tin, doesn’t it?).
Inside the function summarise() we can do as many summary statistics as you want, of as many variables from the original dataset as you want. You just need to separate them by commas (I also put every calculation in a new line as it looks tidier and easier to read, but that’s just a style choice).
Inside the function summarise() we don’t need to specify the name of the dataset and use the $ operator to specify the column (so no fallow$area, just area). This is because in the top line we’ve already told R we’ll be performing all of these operations onto the fallow dataset by using the pipe.

Can you figure out a way of asking R to give you summary statistics of each region now? What about for each region and year combination?

Another useful tidy function (that in reality is just a specific case of summarise()) is the function tally(). It will give you a count of how many observations per group. So, for example, if we want the number of observations per year and region, you would use the group_by() function as before, and then follow it by tally() instead of summarise().

fallow %>% 
  group_by(year, region) %>% 
  tally()

# A tibble: 8 × 3
# Groups:   year [2]
   year region     n
  <dbl> <chr>  <int>
1  2004 A         25
2  2004 E         22
3  2004 G         21
4  2004 S         25
5  2005 A         33
6  2005 E         23
7  2005 G         30
8  2005 S         31

Plotting

Now onto the last difficult thing you’ll learn today (that wasn’t so bad, was it?). I’m going to teach you how to make beautiful plots using the package ggplot, which has a similar syntax to tidy coding, as I also thing it’s more intuitive for newbies than the base R way, but if you’ve learned any other way, again, that’s also valid.

To create plots with ggplot you use a similar structure as what we’ve seen for grouping by and summarising with tidy coding. First you tell R which dataset you will be plotting from, and then you add more and more layers of complexity as you want them. One confusing thing (and I agree it’s silly) is that instead of pipes (%>%) ggplot uses plus signs (+) to concatenate different layers.

Each layer of a plot you want to add is called a geom, and it can be points (for a scatterplot) lines (for a line plot) bars (for a barplot or histogram), etc. So, to plot points you would use geom_point() and to plot a barplot you’d use geom_bar().

Inside the geom you’d have an argument (remember, they’re the bits inside a function and are separated by commas?) that holds the information of what you want to plot (area on the x axis and ratio on the y axis for example), and this argument uses the name aes() (comes from aesthetics, don’t ask me why). Inside the aes() argument you’d put everything you want to map to a variable in your dataset, for example, as we said, area will be mapped to the x axis, and ratio will be mapped to the y axis, but we can also map year to colour (that means the points will be coloured by year), or number of stone curlews (STOCU) to the size of the points (that means, the higher the value in the column STOCU for that observation, the larger the point).

ggplot(fallow) + 
  geom_point(aes(x = area, y = ea, col = year, size = STOCU))

You see that the legend for year looks a bit weird? That is because R thinks the variable year is numeric, same as the ea or area. However, we know it is a factor (it has value for 2004 and 2005 but nothing in between), so I’m now going to introduce you to another tidy verb, mutate(), to fix it.

fallow_fixed <- fallow %>%  
  mutate(year_factor = factor(year))

And now we can perform the same plotting operation, but changing the name of the dataset from fallow to fallow_fixed and the variable year to year_factor

ggplot(fallow_fixed) + 
  geom_point(aes(x = area, y = ea, col = year_factor, size = STOCU))

There are many arguments you can add to a ggplot to make the plot prettier: you can add a title, customise axes names and tick labels, change colours, etc. You can check out this link for more details or even ask ChatGPT! (promt example: using R, can you show me how to add a title to a ggplot plot?). For now, I’m only going to show you one example, which is adding a theme (automatic formatting to the background and colours). I don’t really like the gray background of the plot area, so I’m going to use a b&w theme (theme_bw()) to get rid of it.

ggplot(fallow_fixed) + 
  geom_point(aes(x = area, y = ea, col = year_factor, size = STOCU)) + 
  theme_bw()

Another very useful summary plot is a boxplot (geom_boxplot()). That will give you the median, 25% and 75% quantiles, maximum, minimum, and whether there are any outliers, for any data group you specify. To plot them in ggplot you only need to specify the grouping variable as the x argument, and the variable you want to summarise as the y argument inside the aes() (please don’t panic it’s simple even if it sounds complicated).

ggplot(fallow) + 
  geom_boxplot(aes(x = region, y = STOCU)) +
  theme_bw()

Can you think of a way to make each box a different colour?
If you wanted to see stone curlews, which region would you visit for a better chance of encountering one?

Practice

Now is your turn to try and plot de data in as many ways as you can think of! Try also adding titles to the plots, or changing axes labels!