In this exercise we will analyse the WEBS Porth data that you are using in your Zoology in Practice module.
We will use this as an example data set that will give you practice in using R to answer whatever questons you might have of the data.
You can use the code you write here as examples that you can adapt whenever you want to do something similar again.
Now work through the exercises below. In your script, either copy the code you see below as you get to it, or follow instructions and write your own.
Remember to head your script as follows:
## Your name
## The date
## Sensible title
Make sure you annotate your script using comments, each one preceded by a # symbol, so that someone else or, more likely, you later on, will be able to understand what the script is doing.
We don’t need to do this, but it is good practice to do. See what happens in the environment pane (top-right) of RStudio when you implement this line. Vamoosh! All the things (they are called objects) in R’s ‘brain’ are gone. It’s neater this way.
rm(list=ls())
In this course, we will always use the tidyverse() package. Among other things, this contains the dyplr() and ggplot() packages, used for data manipulation and plotting respectively. We will also use the lubridate() package which is handy for doing things with dates.
library(tidyverse)
library(lubridate)
Note that if you have never installed these packages before on the machone you are working on, then you will need to do install.packages('tidyverse') in the console window, and maybe the same again for lubridate().
We will load the data into a data frame (sometimes called a tibble) called ‘porth’ and the read_me file into a data frame called ‘readme’.
porth<-read_csv('../data/WEBS_Porth.csv')
## Parsed with column specification:
## cols(
## SPECIES = col_character(),
## COUNTUNIT_NAME = col_character(),
## SITE_NAME = col_character(),
## DATE = col_character(),
## COUNT = col_double(),
## COUNT_ACCURACY = col_character(),
## COUNT_COMPLETENESS = col_logical(),
## SITE_GRIDREF = col_character(),
## COUNTY = col_character(),
## LO_REGION = col_character(),
## MONTH = col_double(),
## CALENDAR_YEAR = col_double(),
## WEBS_YEAR = col_double(),
## COMMENT = col_character(),
## DUPLICATE_CORE_VISIT = col_logical(),
## WEBS_METHOD = col_character()
## )
readme<-read_csv('../data/WEBS_Porth_readme.csv')
## Parsed with column specification:
## cols(
## `Column Header` = col_character(),
## Description = col_character(),
## `Given for count unit level` = col_character(),
## `Given for site level` = col_character(),
## `Notes on interpretation` = col_character()
## )
Then we check what we have got using glimpse().
glimpse(porth)
glimpse(readme)
In the environment pane of RStudio, you should see that both the porth and readme objects have appeared. Have they? If so, click on them. This should give you an Excel-like view of them in the script window, which you may find useful.
In the output of glimpse() you should see that some columns of data in porth are labelled <chr>, which means R thinks they are text, some are labelled <dbl>, which means R thinks they are numerical data, while a few are labelled <lgl>, which means that R thinks they are ‘logical’, that is, Yes/No, True/False, 1/0 type-things. Note that these columns in fact only contain NAs, which is R’s way of depicting missing data.
Often, data sets need to be manipulated in some way before you do any further analysis.They might need to be tidied, for example, or have missing data dealt with in some way.
A common thing to have to do is to sort out any dates and times, if you have any in your data.
This data set does have them, so as an example of data preparation, let’s deal with that.
We want to study the population trends of the different species of birds with time. This means we might need to tell R that the DATE column contains not just any old text but dates. Just in case, let’s do it. We use the lubridate() and dplyr() packages.
porth<-mutate(porth,DATE=dmy_hm(DATE))
What did that do?
glimpse(porth)
mutate() is one of the key commands from dplyr(). It operates on data frames (like porth). We use it to alter data in columns or to create new columns. Mutate! Gedddit?
The dmy_hm() command is from lubridate(), which is a package for playing around with dates. Geddit? We use this particular command becuase we can see from looking at our data that the dates in it came to us in the form “14/01/2001 15:45” that is, day-month-year hour-minute.
Can you guess what command you would have used had the dates or times been written in some other form, such as “01/14/2001 15:45”?
Dealing with date and time data can be very tricky, since we have to deal not only with different formats, but also with time zones, leap years and the changing of the clock for summer and winter time. lubridate() makes this business about as easy as it gets.
Note that the date column is no longer labelled as <chr> but as <dttm>. Huzzah! That tells us that R now recognises the data in it as what they are - dates!
Before we do any fancy statistical analysis, let us simply extract a few key summary statistics from the data, and then plot the data. Doing this first with data is almost always a good idea. More often than not, it will indicate to you whether your hunches about the data are correct, suggest likely answers to any questions you already have in mind about it (hopefully, if you took the data, you do have some!) and perhaps also suggest new questions.
To answer this, we use the very powerful group_by() then summarise() combination from dplyr().
In the following code, we first group the data by SPECIES, then for each species we count the number that have been seen. We choose the name count, which will appear as the column heading in a table that we will get as output, and to get the actual count, we use n() which tells dplyr() that we want to count how many there are of each species. Finally, we use arrange(-count) to sort the date in descending order of the count number.
porth %>%
group_by(SPECIES) %>%
summarise(count=n()) %>%
arrange(-count)
Note the use of the ‘pipe’ symbol %>%. This gives you a way of grouping a sequence of commands together in a very neat way. Each stage in the sequence acts as the input to the next stage.
How many species have been seen altogether? Which have been the three most commonly seen?
Let us take those three and see how the counts of observations have varied over time.
Let us consider the Mallard duck. First let’s see how many of these have been seen in each year.
Study the code below and satisfy yourself that you can see that it is a simple variation of the previous chunk of code.
porth %>%
filter(SPECIES=='Mallard') %>%
group_by(CALENDAR_YEAR) %>%
summarise(count=n())
There are data going from 1960 until 2020. That is a lot of years. Maybe a good way to study the population trend with time is to group the data by decade and do a box plot. Let’s try that.
First, let us create a new column which lists the decade of any observation. This requires a bit of maths, and for the plot we need to tell R that this column will be a ‘factor’, which is R-speak for categorical data.
porth<-mutate(porth,decade=as.factor(as.integer((CALENDAR_YEAR-1960)/10))) # the maths bit
table(porth$decade)
##
## 0 1 2 3 4 5 6
## 78 136 290 283 436 397 15
Almost there, let us relabel the decades as something more meaningful:
levels(porth$decade)<-c("1960s","1970s","1980s","1990s","2000s","2010s","2020s")
table(porth$decade)
##
## 1960s 1970s 1980s 1990s 2000s 2010s 2020s
## 78 136 290 283 436 397 15
That’s better!
Now let us plot this data. One type of plot that we could use is the box and whisker plot.
First, let us filter from the porth data all the rows about mallards, and put them in another data frame call, er, mallard:
mallard<-filter(porth,SPECIES=='Mallard')
To plot the data, we will use the command ggplot(), part of the ggplot2() package within tidyverse() to do this. ggplot() is not the only way to do plots in R, but it is very powerful, can produce supberb plots, and is what we will use in this course.
The code chunk below is an example of how to use ggplot(). Don’t worry if you don’t understand it at the moment. As you use and adapt it again and again, the purposes of the different lines will become clearer.
ggplot(mallard,aes(x=decade,y=COUNT))+
geom_boxplot()+
xlab('Decade')+
ylab('Count per year')+
ggtitle('Mallard observations at Porth, by decade')+
theme_bw()
Now produce similar plots, but for Coots and for Tufted Ducks.
See if you can adapt the code in the chunks above to produce a bar plot that shows how observed numbers of Mallards at Porth have varied by month, from 1960 to the present.
Hint: You can adapt two chunks of code you have already used to do this.
First manipulate porth using group_by() and summarise(), just like you did a short while ago, but this time group by MONTH, and this time save the fruits of your labour in an object called, say, mallard_by_month.
So your code for this part could start:
mallard_by_month<-porth %>%
filter(SPECIES=='Mallard') %>%
group_by(MONTH) %>%
summarise(count=n())
mallard_by_month
Then, to plot this, we could adapt other code we have used for other charts:
ggplot(mallard,aes(x=MONTH))+
geom_bar()+
xlab('Month')+
ylab('Count')+
ggtitle('Mallard observations at Porth, by month')+
theme_bw()
Notice how here we have used geom_bar() to do a bar chart, rather than geom_boxplot() for a box and whisker plot, as we did before.
This chart is a start and is good enough to tell you how counts have varied with month, but if it were to be shown to other people, it could be improved in a number of ways. For example we might want the x-axis to show the names of the months, and perhaps we would like that axis to start in July and go on until June, since that would capture a full ‘season’ of observations.
But let’s leave it for now……
target <- c('Mallard','Coot','Tufted Duck')
topthree<-filter(porth,SPECIES %in% target)
ggplot(topthree,aes(x=decade,y=COUNT,fill=SPECIES,colour=SPECIES))+
geom_boxplot()+
xlab('Decade')+
ylab('Count per year')+
ggtitle('Duck observations at Porth, by decade')+
theme_bw()
Find this plot too busy? We could instead do a separate plot for each of these species, but place the plots side by side. ggplot() has a very useful feature called facet_wrap() which you can use for this.
target <- c('Mallard','Coot','Tufted Duck')
topthree<-filter(porth,SPECIES %in% target)
ggplot(topthree,aes(x=decade,y=COUNT))+
geom_boxplot()+
xlab('Decade')+
ylab('Count per year')+
ggtitle('Observations at Porth, by decade')+
facet_wrap(~SPECIES)+
theme_bw()+
theme(axis.text.x = element_text(angle = 90)) # rotate the tick labels on the x-axis
Now we have looked at our data in a few ways, we might probe a little deeper, ask some specific questions of the data, and try to establish how confident we can be of what it is telling us about the variation with time of the population of this or that species of bird at Porth.
Note that the box plots show the median and interquartile range values for the data that is plotted.
These are example of robust data summaries in that they are not influenced by outliers in the data.
Means and standard deviations, on the other hand, are influenced by outliers.
So if we wish to describe the central tendency of a data set, the median can be a better way to do that if the data has outliers. In that case, a median value gives a better idea of a typical value in the data set than does the mean.
Consider the Mallard data. Let us calculate mean and median annual observations in each decade. We will exclude the 2020s decade since we have only got one year of data so far in that decade.
mallard_numbers<-porth %>%
filter(SPECIES=='Mallard',decade != '2020s') %>%
group_by(decade) %>%
summarise (count=sum(COUNT),mean=mean(COUNT),st_dev.=sd(COUNT),median=median(COUNT))
mallard_numbers
We saw from the box plot that there were some years in the 2000-2010 decade with extremely high observed numbers. These will skew the mean, but have less impact on the median. We note that the mean in the 2000s is very high, but so is the standard deviation (ie, the spread of numbers in that decade)
Still, it does look as though perhaps the Mallard numbers have been increasing. Let us investigate further using a Chi Square test on the total counts in each decade.
chisq.test(mallard_numbers$count)
The chi square test is appropriate for count data across different categories (in our case, decades) where the data in each category are independent of the data in the other categories (which isn’t quite true in our case, most likely, but moving on…) and where we have a minimum of around 5 counts in each category. We are OK there. The default null hypothesis is that the count numbers among the population are the same in every category. The p-value is the likelihood that we would have got count numbers as different or more different from the null hypothesis as we have in fact in got.
Our p-value is tiny. This tells us that, very likely, Mallard numbers at Porth have fluctuated across the decades.
You might say, well, I know that, I can see it from the data. Yes, but remember that the data is just a sample drawn from a wider population. We want to draw inferences from the sample about the population as a whole.
Let us postulate that the population of Mallards at Porth has been static over the decades: it has neither increased nor decreased. Let us call this the Null Hypothesis - the name generally given to the ‘nothing going on’ scenario.
Given that hypothesis, how likely are our data? Further, if we have sufficient evidence to reject the null hypothesis, is there an upward or downward trend in the data?
To look for a trend, we can plot the population data with an added fitted straight line
g<-ggplot(filter(porth,SPECIES=='Mallard'),aes(x=CALENDAR_YEAR,y=COUNT))+
geom_point()+theme_bw()+
geom_smooth(method='lm',se=FALSE)+
xlab('Decade')+
ylab('Count per year')+
ggtitle('Observations of Mallards at Porth, by year')+
theme_bw()
g
The line points upwards, which suggest a rising population over the years, but there is a lot of scatter in the data. Perhaps the gradient of the line reflects this scatter rather than any underyling trend.
To test this, we can find the gradient of this line using the command lm() , which stands for linear model. lm() will not only tell us the gradient, but the also a p-value: the chance that we could have got a gradient as far from zero as ours is, for our sample data, if in fact the null hypothesis is true. Under this hypothesis, remember, the population has not changed over the years, and so the gradient of the line would be zero, if our sample were the entire population of Mallards at Porth.
fitMallard<-lm(COUNT~CALENDAR_YEAR,data=filter(porth,SPECIES=='Mallard'))
summary(fitMallard)
In the summary(), look at the coefficients section. This gives you the intercept and gradient of the fitted line. The gradient is 0.49. In the right hand column of the table, labelled Pr(>|t|), you get the p-values. The p-value for the gradient is 0.0001. This is telling you that there is a 0.0001 (ie 1 in ten thouasand) chance that your sample data would have shown the steadily rising trend it has if in fact the null hypothesis, that the population as a while has been static, were true.
What do you think is a reasonable conclusion? Accept the null hypothesis, or reject it?