In this very short practical we will explore the variables in the Fallow dataset we already used last week, and try and figure out what distribution would be more appropriate to use with each of them.
Reminder of what is contained in the dataset:
Data were collected in several sites (column variable site ) in 4 different regions (variable region ) over several years (variable year)
Other variable names
STOCU: Stone curlew
SHTLA: Short-toed lark
CALLA: Calandra lark
LITBS: Little bustard
natbuf: % of area set aside with natural vegetation
area: field size (in ha)
ae: ratio of edge (m) to area (m2) set aside
Housekeeping
We already explained this last week, but remember to do this at the beginning of every script to start with a clean environment. Remember, “Rule 5: adopt good practices and be consistent”
rm(list=ls()) # this clears the environment from any leftover objects# setwd("Yourpath") #this tells R what folder you'll be working out oflibrary(tidyverse) #this loads the packages we'll need for easy coding
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load data
With the line of code below you’ll load the dataset.
fallow <-read_csv("Data/Fallow.csv")
Rows: 210 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): site, region
dbl (8): year, STOCU, SHTLA, CALLA, LITBU, natbuf, area, ea
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Explore data
I will display here an example with the variable STOCU, and you can then practice with all other variables. First of all, remember you can always check what type of variable we are dealing with by displaying the structure of the whole dataset with the function str(), or you can call on the class of one specific variable using the function class() and the $ operator that lets you select individual varibales from a dataset
str(fallow)
spc_tbl_ [210 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ site : chr [1:210] "A006" "A007" "A015" "A034" ...
$ region: chr [1:210] "A" "A" "A" "A" ...
$ year : num [1:210] 2004 2004 2004 2004 2004 ...
$ STOCU : num [1:210] 5 3 0 0 6 0 0 0 0 1 ...
$ SHTLA : num [1:210] 0 0 0 0 0 0 0 0 0 0 ...
$ CALLA : num [1:210] 0 0 0 1 0 0 0 0 0 0 ...
$ LITBU : num [1:210] 2 0 0 1 0 0 1 0 0 0 ...
$ natbuf: num [1:210] 0.32 0.332 0.534 0.755 0.329 ...
$ area : num [1:210] 35070 42620 3463 11739 13826 ...
$ ea : num [1:210] 0.0265 0.0253 0.0805 0.0399 0.0497 ...
- attr(*, "spec")=
.. cols(
.. site = col_character(),
.. region = col_character(),
.. year = col_double(),
.. STOCU = col_double(),
.. SHTLA = col_double(),
.. CALLA = col_double(),
.. LITBU = col_double(),
.. natbuf = col_double(),
.. area = col_double(),
.. ea = col_double()
.. )
- attr(*, "problems")=<externalptr>
class(fallow$STOCU)
[1] "numeric"
We can see it is a numeric variable, and looking at the numbers it appears to be a discrete variable (no decimals) which makes sense as we know these are counts of observed stone curlews.
Histogram
One of the best ways of checking the distribution of a variable is to plot a histogram. Go back to last weeks tutorial or to the resources on brightspace for a more detailed explanation on ggplot2, but here below is the code to plot a histogram
Do you remember what the bins argument does? Play around with it a little bit to see what it does to the resulting plot.
QQ plot
You can also create a qqplot to explore how well your data adjust to a normal distribution. In order to do so, you need to “stack” two different geoms in ggplot2. This will be good practice on how to build more complex plots.
ggplot(fallow) +stat_qq(aes(sample = STOCU), col ="darkred") +stat_qq_line(aes(sample = STOCU)) +theme_bw()
Which part of the plot do you think stat_qq() is responsible for, and which part corresponds to stat_qq_line()?
Do you think this variable is normal, or could be considered “normal enough”?
Why do you think the data (points) are organised in such a structure?
Normality test
In R, you can also perform a Saphiro-Wilk normality test. The function is very simply shapiro.test(), and it only takes one argument, the variable you want to test for normality, which you have to specify but naming the dataset it belongs, and then selecting it with the $ operator. If you ever have doubts about what arguments a function takes, or what they are, you can always ask r for help using the function help(), see below
shapiro.test(fallow$STOCU)
Shapiro-Wilk normality test
data: fallow$STOCU
W = 0.66868, p-value < 2.2e-16
help(shapiro.test)
starting httpd help server ... done
Do you remember how to interpret the Shapiro-Wilk test?. Go back to the lecture to remind yourself what the W is, and under which p-value we reject the null hypothesis (and even what the null hypothesis is). Hypothesis testing is not easy to wrap your head around, so look it up as many times as necessary (I still do to this day!)
Now it’s your turn!
Follow the above structure to plot histograms, qq-plots, and test the normality of all the other variables in the dataset. A few questions before you start.
Are there any variables you expect, beforehand, not to conform with normality?
What distribution do you think would be more appropriate for variables like STOCU? Why?