Practical 2 - AESC40180 25/26

Author

Virginia Morera-Pujol

Introduction

In this very short practical we will explore the variables in the Fallow dataset we already used last week, and try and figure out what distribution would be more appropriate to use with each of them.

Reminder of what is contained in the dataset:

Data were collected in several sites (column variable site ) in 4 different regions (variable region ) over several years (variable year)

Other variable names

  • STOCU: Stone curlew

  • SHTLA: Short-toed lark

  • CALLA: Calandra lark

  • LITBS: Little bustard

  • natbuf: % of area set aside with natural vegetation

  • area: field size (in ha)

  • ae: ratio of edge (m) to area (m2) set aside

Housekeeping

We already explained this last week, but remember to do this at the beginning of every script to start with a clean environment. Remember, “Rule 5: adopt good practices and be consistent”

rm(list=ls()) # this clears the environment from any leftover objects

# setwd("Yourpath") #this tells R what folder you'll be working out of

library(tidyverse) #this loads the packages we'll need for easy coding
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load data

With the line of code below you’ll load the dataset.

fallow <- read_csv("Data/Fallow.csv")
Rows: 210 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): site, region
dbl (8): year, STOCU, SHTLA, CALLA, LITBU, natbuf, area, ea

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore data

I will display here an example with the variable STOCU, and you can then practice with all other variables. First of all, remember you can always check what type of variable we are dealing with by displaying the structure of the whole dataset with the function str(), or you can call on the class of one specific variable using the function class() and the $ operator that lets you select individual varibales from a dataset

str(fallow)
spc_tbl_ [210 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ site  : chr [1:210] "A006" "A007" "A015" "A034" ...
 $ region: chr [1:210] "A" "A" "A" "A" ...
 $ year  : num [1:210] 2004 2004 2004 2004 2004 ...
 $ STOCU : num [1:210] 5 3 0 0 6 0 0 0 0 1 ...
 $ SHTLA : num [1:210] 0 0 0 0 0 0 0 0 0 0 ...
 $ CALLA : num [1:210] 0 0 0 1 0 0 0 0 0 0 ...
 $ LITBU : num [1:210] 2 0 0 1 0 0 1 0 0 0 ...
 $ natbuf: num [1:210] 0.32 0.332 0.534 0.755 0.329 ...
 $ area  : num [1:210] 35070 42620 3463 11739 13826 ...
 $ ea    : num [1:210] 0.0265 0.0253 0.0805 0.0399 0.0497 ...
 - attr(*, "spec")=
  .. cols(
  ..   site = col_character(),
  ..   region = col_character(),
  ..   year = col_double(),
  ..   STOCU = col_double(),
  ..   SHTLA = col_double(),
  ..   CALLA = col_double(),
  ..   LITBU = col_double(),
  ..   natbuf = col_double(),
  ..   area = col_double(),
  ..   ea = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
class(fallow$STOCU)
[1] "numeric"

We can see it is a numeric variable, and looking at the numbers it appears to be a discrete variable (no decimals) which makes sense as we know these are counts of observed stone curlews.

Histogram

One of the best ways of checking the distribution of a variable is to plot a histogram. Go back to last weeks tutorial or to the resources on brightspace for a more detailed explanation on ggplot2, but here below is the code to plot a histogram

ggplot(fallow) + 
  geom_histogram(aes(x = STOCU), bins = 10) + 
  theme_bw()

Do you remember what the bins argument does? Play around with it a little bit to see what it does to the resulting plot.

QQ plot

You can also create a qqplot to explore how well your data adjust to a normal distribution. In order to do so, you need to “stack” two different geoms in ggplot2. This will be good practice on how to build more complex plots.

ggplot(fallow) +
  stat_qq(aes(sample = STOCU), col = "darkred") + 
  stat_qq_line(aes(sample = STOCU)) + 
  theme_bw()

  • Which part of the plot do you think stat_qq() is responsible for, and which part corresponds to stat_qq_line()?

  • Do you think this variable is normal, or could be considered “normal enough”?

  • Why do you think the data (points) are organised in such a structure?

Normality test

In R, you can also perform a Saphiro-Wilk normality test. The function is very simply shapiro.test(), and it only takes one argument, the variable you want to test for normality, which you have to specify but naming the dataset it belongs, and then selecting it with the $ operator. If you ever have doubts about what arguments a function takes, or what they are, you can always ask r for help using the function help(), see below

shapiro.test(fallow$STOCU)

    Shapiro-Wilk normality test

data:  fallow$STOCU
W = 0.66868, p-value < 2.2e-16
help(shapiro.test)
starting httpd help server ... done

Do you remember how to interpret the Shapiro-Wilk test?. Go back to the lecture to remind yourself what the W is, and under which p-value we reject the null hypothesis (and even what the null hypothesis is). Hypothesis testing is not easy to wrap your head around, so look it up as many times as necessary (I still do to this day!)

Now it’s your turn!

Follow the above structure to plot histograms, qq-plots, and test the normality of all the other variables in the dataset. A few questions before you start.

  • Are there any variables you expect, beforehand, not to conform with normality?

  • What distribution do you think would be more appropriate for variables like STOCU? Why?