The aim of this practical is to practice using R, and carrying out summary statistics on a provided data set.
The data associated with these practical are in the Fallow.csv file. Information related to these data are in the paper Fallow Birds_Bird Study.pdf
Data were collected in several sites (column variable site ) in 4 different regions (variable region ) over several years (variable year)
Other variable names
STOCU: Stone curlew
SHTLA: Short-toed lark
CALLA: Calandra lark
LITBS: Little bustard
natbuf: % of area set aside with natural vegetation
area: field size (in ha)
ae: ratio of edge (m) to area (m2) set aside
Housekeeping
Do this at the beginning of every script to start with a clean environment. Remember, “Rule 5: adopt good practices and be consistent”
rm(list=ls()) # this clears the environment from any leftover objectslibrary(tidyverse) #this loads the packages we'll need for easy coding
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# setwd("Yourpath") #this tells R what folder you'll be working out of
Load data
With the line of code below you’ll load the dataset. This means R has it now stored in the environment under the name “fallow” (or whatever you want to name)
fallow <-read_csv("Data/Fallow.csv")
Rows: 210 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): site, region
dbl (8): year, STOCU, SHTLA, CALLA, LITBU, natbuf, area, ea
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We use the symbol <- for “assignment”. That means, for giving things (data, numbers, words) that we want to store in the environment a name that we can later use to find them or perform operations and analyses on them. Try the little exercise below to understand assignment and numerical operations in R.
x <-10# assign the number 10 to an object called xy <-4# assign the number 4 to an object called ysum_xy <- x + yproduct_xy <- x * yx_squared <- x^2sum_xy + product_xy
[1] 54
Back to our fallow birds exercise, let’s explore the data and display some summary statistics.
Data exploration
We can see what the first rows of data look like using the function head() , and explore the details about column types and dataset structure with the function str() . Additionally, you can ask R to give you a summary of the whole dataset with the function summary()
head(fallow)
# A tibble: 6 × 10
site region year STOCU SHTLA CALLA LITBU natbuf area ea
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A006 A 2004 5 0 0 2 0.320 35070. 0.0265
2 A007 A 2004 3 0 0 0 0.332 42620. 0.0253
3 A015 A 2004 0 0 0 0 0.534 3463. 0.0805
4 A034 A 2004 0 0 1 1 0.755 11739. 0.0399
5 A048 A 2004 6 0 0 0 0.329 13826. 0.0497
6 A111 A 2004 0 0 0 0 0.474 19352. 0.0282
str(fallow)
spc_tbl_ [210 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ site : chr [1:210] "A006" "A007" "A015" "A034" ...
$ region: chr [1:210] "A" "A" "A" "A" ...
$ year : num [1:210] 2004 2004 2004 2004 2004 ...
$ STOCU : num [1:210] 5 3 0 0 6 0 0 0 0 1 ...
$ SHTLA : num [1:210] 0 0 0 0 0 0 0 0 0 0 ...
$ CALLA : num [1:210] 0 0 0 1 0 0 0 0 0 0 ...
$ LITBU : num [1:210] 2 0 0 1 0 0 1 0 0 0 ...
$ natbuf: num [1:210] 0.32 0.332 0.534 0.755 0.329 ...
$ area : num [1:210] 35070 42620 3463 11739 13826 ...
$ ea : num [1:210] 0.0265 0.0253 0.0805 0.0399 0.0497 ...
- attr(*, "spec")=
.. cols(
.. site = col_character(),
.. region = col_character(),
.. year = col_double(),
.. STOCU = col_double(),
.. SHTLA = col_double(),
.. CALLA = col_double(),
.. LITBU = col_double(),
.. natbuf = col_double(),
.. area = col_double(),
.. ea = col_double()
.. )
- attr(*, "problems")=<externalptr>
summary(fallow)
site region year STOCU
Length:210 Length:210 Min. :2004 Min. :0.0000
Class :character Class :character 1st Qu.:2004 1st Qu.:0.0000
Mode :character Mode :character Median :2005 Median :0.0000
Mean :2005 Mean :0.8143
3rd Qu.:2005 3rd Qu.:1.0000
Max. :2005 Max. :8.0000
SHTLA CALLA LITBU natbuf
Min. : 0.0000 Min. : 0.000 Min. :0.0 Min. :0.0000
1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.:0.0 1st Qu.:0.1024
Median : 0.0000 Median : 0.000 Median :0.0 Median :0.2525
Mean : 0.4857 Mean : 1.152 Mean :0.5 Mean :0.2745
3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.:1.0 3rd Qu.:0.4099
Max. :15.0000 Max. :17.000 Max. :6.0 Max. :0.9370
area ea
Min. : 535.2 Min. :0.01178
1st Qu.: 8576.6 1st Qu.:0.02659
Median : 19005.0 Median :0.03700
Mean : 24790.0 Mean :0.04249
3rd Qu.: 34447.3 3rd Qu.:0.05015
Max. :129500.2 Max. :0.18299
We can also access individual variables in the dataframe by using the $ operator. And we can use functions like mean(), median() or sd() to obtain summary statistics from each.
Try a few of these below with different variables of the fallow dataframe
mean(fallow$area)
[1] 24789.99
median(fallow$area)
[1] 19004.96
sd(fallow$area)
[1] 21708.7
var(fallow$area)
[1] 471267726
quantile(fallow$area, 0.25)
25%
8576.638
This last function, quantiles(), introduces the idea that a function can have more than 1 argument. In this case, the function quantiles() has two arguments, the first one is the object you want to calculate the quantile of (the area variable of the fallow dataframe) and the second argument (they are separated by commas) is the level of quantile you want to calculate (in this case 25%). You can calculate several quantiles at once by using a vector as the second argument. In R, vectors are a list of numbers, and we create them with c(). So, for example, if we want to calculate the 25% and 75% quantiles, we substitute the 0.25 above with a c(0.25, 0.75). Note that inside a vector, the different elements are also separated by commas. Now try calculating different quantiles of different variables below.
quantile(fallow$area, c(0.25, 0.75))
25% 75%
8576.638 34447.286
So far we’re only displaying this results in the console, we are not storing them in the environment, because we are not assigning them to any object (remember, the left facing arrow?). But we can assign all of these to objects and then you’ll see the name you’ve chosen for them appear in your environment as a stored object. For example, below I am going to demonstrate that the 50% quantile is the same as the median.
This introduces another idea. Whenever you want to tell R that two things are equal (or, like in this case, ask R if two things are equal), you need to use two equal signs, not just one. If you’re curious, ask me why, but it might confuse you :)
Can you think of a way of similarly proving that the standard deviation (sd()) is the square root of the variance (var())
Data exploration 2
So far we’ve used very basic functions for data exploration of single variables. However, what if we wanted to know the mean of the area by year? Or by region? Now we need to do a bit more coding (but don’t panic) and introduce a couple of extra concepts. First of all, if you’ve used R before, you might be used to doing this in a different way. There’s no wrong way, but I find this one I will show you (tidy coding) a bit more intuitive for people that are just getting started.
Tidy coding uses words that are “verbs” as functions (words like summarise, mutate, filter, select) that will give you a clue as to what the function is doing, and uses something called “pipes” that look like this %>% to string these verbs into a coherent sentence of what you want to do. So, for example, if you had a dataframe called example and you wanted to select some variables, then filter some rows out, then mutate one of the variables so it loos a bit different, it would look something like this:
example %>%select(...) %>%filter(...) %>%mutate(...)
Tip: you can automatically insert a pipe symbol by holding the Shift and Ctrl keys at the same time and typing m.
Now, let’s use tidy functions to produce a mean and standard deviation value of area and edge to area ratio per year.
We use group_by() to tell R what categorical variable we want to group our summary statistics by (does what it says in the tin, doesn’t it?).
Inside the function summarise() we can do as many summary statistics as you want, of as many variables from the original dataset as you want. You just need to separate them by commas (I also put every calculation in a new line as it looks tidier and easier to read, but that’s just a style choice).
Inside the function summarise() we don’t need to specify the name of the dataset and use the $ operator to specify the column (so no fallow$area, just area). This is because in the top line we’ve already told R we’ll be performing all of these operations onto the fallow dataset by using the pipe.
Can you figure out a way of asking R to give you summary statistics of each region now? What about for each region and year combination?
Another useful tidy function (that in reality is just a specific case of summarise()) is the function tally(). It will give you a count of how many observations per group. So, for example, if we want the number of observations per year and region, you would use the group_by() function as before, and then follow it by tally() instead of summarise().
fallow %>%group_by(year, region) %>%tally()
# A tibble: 8 × 3
# Groups: year [2]
year region n
<dbl> <chr> <int>
1 2004 A 25
2 2004 E 22
3 2004 G 21
4 2004 S 25
5 2005 A 33
6 2005 E 23
7 2005 G 30
8 2005 S 31
Plotting
Now onto the last difficult thing you’ll learn today (that wasn’t so bad, was it?). I’m going to teach you how to make beautiful plots using the package ggplot, which has a similar syntax to tidy coding, as I also thing it’s more intuitive for newbies than the base R way, but if you’ve learned any other way, again, that’s also valid.
To create plots with ggplot you use a similar structure as what we’ve seen for grouping by and summarising with tidy coding. First you tell R which dataset you will be plotting from, and then you add more and more layers of complexity as you want them. One confusing thing (and I agree it’s silly) is that instead of pipes (%>%) ggplot uses plus signs (+) to concatenate different layers.
Each layer of a plot you want to add is called a geom, and it can be points (for a scatterplot) lines (for a line plot) bars (for a barplot or histogram), etc. So, to plot points you would use geom_point() and to plot a barplot you’d use geom_bar().
Inside the geom you’d have an argument (remember, they’re the bits inside a function and are separated by commas?) that holds the information of what you want to plot (area on the x axis and ratio on the y axis for example), and this argument uses the name aes() (comes from aesthetics, don’t ask me why). Inside the aes() argument you’d put everything you want to map to a variable in your dataset, for example, as we said, area will be mapped to the x axis, and ratio will be mapped to the y axis, but we can also map year to colour (that means the points will be coloured by year), or number of stone curlews (STOCU) to the size of the points (that means, the higher the value in the column STOCU for that observation, the larger the point).
ggplot(fallow) +geom_point(aes(x = area, y = ea, col = year, size = STOCU))
You see that the legend for year looks a bit weird? That is because R thinks the variable year is numeric, same as the ea or area. However, we know it is a factor (it has value for 2004 and 2005 but nothing in between), so I’m now going to introduce you to another tidy verb, mutate(), to fix it.
And now we can perform the same plotting operation, but changing the name of the dataset from fallow to fallow_fixed and the variable year to year_factor
ggplot(fallow_fixed) +geom_point(aes(x = area, y = ea, col = year_factor, size = STOCU))
There are many arguments you can add to a ggplot to make the plot prettier: you can add a title, customise axes names and tick labels, change colours, etc. You can check out this link for more details or even ask ChatGPT! (promt example: using R, can you show me how to add a title to a ggplot plot?). For now, I’m only going to show you one example, which is adding a theme (automatic formatting to the background and colours). I don’t really like the gray background of the plot area, so I’m going to use a b&w theme (theme_bw()) to get rid of it.
ggplot(fallow_fixed) +geom_point(aes(x = area, y = ea, col = year_factor, size = STOCU)) +theme_bw()
Another very useful summary plot is a boxplot (geom_boxplot()). That will give you the median, 25% and 75% quantiles, maximum, minimum, and whether there are any outliers, for any data group you specify. To plot them in ggplot you only need to specify the grouping variable as the x argument, and the variable you want to summarise as the y argument inside the aes() (please don’t panic it’s simple even if it sounds complicated).
ggplot(fallow) +geom_boxplot(aes(x = region, y = STOCU)) +theme_bw()
Can you think of a way to make each box a different colour?
If you wanted to see stone curlews, which region would you visit for a better chance of encountering one?
Practice
Now is your turn to try and plot de data in as many ways as you can think of! Try also adding titles to the plots, or changing axes labels!