Simple statistics (1)

A data set containing information about the sizes of Norway rat skulls in the pellets of Scandinavian eagle-owls is available in the ratskull.csv file (you may have come across this before). The data comprise a column of rat skull sizes (measured in grams) and a column of codes indicating the season when a particular skull sample was taken. These data were collected in order to evaluate whether there is a difference between sizes of rats eaten in summer and winter. That is, we want to know if there is a statistically significant difference between the mean rat skull sizes in the winter and summer samples.

Download the ratskull.csv file from Google Classroom and place it in your working directory (this is the location you set at the beginning of each R session). Read the data in ratskull.csv into R.

rats <- read.csv("ratskull.csv")

Start by looking at the data — both visually and in terms of its descriptive statistics:

Inspection. Use the View function and dplyr function glimpse to visually inspect the raw data. What are the names given to rat skull size variable and the season indicator variable? What values does the season indicator variable take?

# we can't use View on this web page so we'll print all the data
rats
##    Weight Season
## 1   2.173 Winter
## 2   2.284 Winter
## 3   2.300 Winter
## 4   2.331 Winter
## 5   2.639 Winter
## 6   1.889 Winter
## 7   1.952 Winter
## 8   2.738 Winter
## 9   2.439 Winter
## 10  2.357 Winter
## 11  2.895 Winter
## 12  2.534 Winter
## 13  2.394 Winter
## 14  2.307 Winter
## 15  2.334 Winter
## 16  2.288 Winter
## 17  2.512 Winter
## 18  2.056 Winter
## 19  2.579 Winter
## 20  1.729 Winter
## 21  2.109 Winter
## 22  2.433 Winter
## 23  2.418 Winter
## 24  2.052 Winter
## 25  2.341 Winter
## 26  2.189 Summer
## 27  2.615 Summer
## 28  2.564 Summer
## 29  2.245 Summer
## 30  2.528 Summer
## 31  2.998 Summer
## 32  2.655 Summer
## 33  2.349 Summer
## 34  2.955 Summer
## 35  2.465 Summer
## 36  2.685 Summer
## 37  2.522 Summer
## 38  2.413 Summer
## 39  2.525 Summer
## 40  2.018 Summer
## 41  2.393 Summer
## 42  2.766 Summer
## 43  2.648 Summer
## 44  2.063 Summer
## 45  2.939 Summer
## 46  2.590 Summer
## 47  2.428 Summer
## 48  2.855 Summer
## 49  2.384 Summer
## 50  2.296 Summer
glimpse(rats)
## Observations: 50
## Variables: 2
## $ Weight <dbl> 2.173, 2.284, 2.300, 2.331, 2.639, 1.889, 1.952, 2.738,...
## $ Season <fctr> Winter, Winter, Winter, Winter, Winter, Winter, Winter...

So we have two variables, Weight and Season. The Weight variable is obviously the weight of each rat skull. The Season variable is a categorical variable (a ‘factor’ in R-speak) that designates the season when the rat skull was collected.

Descriptive statistics. Use the appropriate dplyr functions (group_by and summarise) to calculate the sample size, sample mean and standard deviation of each sample.

rats %>% 
  group_by(Season) %>%
  summarise(
    sizeW = n(),
    meanW = mean(Weight),
    stdvW = sd(Weight)
  )
## # A tibble: 2 × 4
##   Season sizeW   meanW     stdvW
##   <fctr> <int>   <dbl>     <dbl>
## 1 Summer    25 2.52352 0.2603290
## 2 Winter    25 2.32332 0.2647103

So we have 25 observations in each season, rat skulls are larger on average in summer, and the standard deviation is very similar in each group.

Graphs. Use ggplot2 to construct a pair of dot plots, one above the other, to summarise the winter and summer skull size distributions. HINT: you will need to use geom_dotplot and the facet_wrap functions to do this.

ggplot(rats, aes(x = Weight)) +
  # pick a sensioble bin width, given the range of your data
  geom_dotplot(binwidth = 0.1) +
  # make a separate panel for each season 
  facet_wrap(~ Season)

Question. Using the dot plots, and the descriptive statistics, conduct an informal evaluation of the assumptions of the t-test. Do you feel the data conform acceptably to the assumptions? If not, make sure you can explain why.

It looks like these data are well-suited to a two sample t-test. The dot plots indicate that the distribution of rat skull sizes in each season is roughly normally distributed. The also have similar variance (though this matters less than one might thing because R assumes unequal variance by default).

Statistical test. Use the R t.test function to compare the skull sizes from each sex.

t.test(Weight ~ Season, data = rats)
## 
##  Welch Two Sample t-test
## 
## data:  Weight by Season
## t = 2.6961, df = 47.987, p-value = 0.009645
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05090054 0.34949946
## sample estimates:
## mean in group Summer mean in group Winter 
##              2.52352              2.32332

The first part of the output reminds us what we did. The first line tells us what kind of t-test we used. This says: Welch two-sample t-test, so we know that we have used the Welch version of the two-sample t-test which accounts for the possibility of unequal variance in the samples. The next line reminds us about the data. This says: data: Weight by Season, which is R-speak for ‘we compared the means of the Weight variable, where the sample membership is defined by the values of the Season variable’.

The third line of text is the most important. This says: t = 2.6961, df = 47.987, p-value = 0.009645. The first part of this, t = 2.6961, is the test statistic (i.e. the value of the t-statistic). The second part, df = 47.987, summarise the ‘degrees of freedom’. The third part, p-value = 0.009645, is the all-important p-value. This says there is a statistically significant difference in the mean dry weight biomass of the two colour morphs, because p < 0.01.

The fourth line of text (alternative hypothesis: true difference in means is not equal to 0) just reminds us what the alternative to the null hypothesis is (H1). The next two lines show us the ‘95% confidence interval’ for the difference between the means. We don’t really need this information, but we can think of this interval as a summary of the likely values of the true difference (again, a confidence interval is more complicated than that in reality). The last few lines just summarise the sample means of each group. This is only useful if we did not bother to calculate these already.

Question. Prepare a concise but complete conclusion summarising the results of the test. Is this what you expected from looking at the distributions of data in the two samples?

The mean weight of rat skulls in summer (2.52 grams) is significantly greater than that skull weight in the winter (2.32 grams) (Welch’s t = 2.70, d.f. = 48, p < 0.01)

Notice that we should indicate which mean is the largest. It is sometimes also useful to give the values of the means in the conclusion, as we did here.

Question. Suggest two possible biological reasons for the result you observe.

The obvious explanation is that eagle owls are more selective in the summer, preferring to hunt larger rats. Presumably, they are less selective in winter when prey are more scarce.

Simple statistics (2)

We are now going to look at prey choice between male and female eagle owls. You have seen that the prey of eagle owls can be established by examination of the pellets containing the undigested remains of their prey. In the eagle owl study the diets of the male and female of a pair were studied by examination of the pellets collected from beneath their roosts (fortunately, an individual tends to use the same roosting site, and individuals tend not to roost together). The numbers of all prey types found in the pellets were recorded.

These data are in the file eagles.csv Read these data into R and inspect them to ensure you understand how they are organised. Once you understand the data, make a bar plot to summarise the important patterns.

Start by reading in the data…

eagle_diet <- read.csv("eagles.csv")

…and then look at it:

glimpse(eagle_diet)
## Observations: 12
## Variables: 3
## $ Sex   <fctr> Female, Female, Female, Female, Female, Female, Male, M...
## $ Prey  <fctr> Hare     , Squirrel , Vole     , Rat      , Waterbird, ...
## $ Count <int> 16, 5, 24, 12, 36, 5, 7, 6, 35, 23, 23, 8
eagle_diet
##       Sex      Prey Count
## 1  Female Hare         16
## 2  Female Squirrel      5
## 3  Female Vole         24
## 4  Female Rat          12
## 5  Female Waterbird    36
## 6  Female Bird          5
## 7    Male Hare          7
## 8    Male Squirrel      6
## 9    Male Vole         35
## 10   Male Rat          23
## 11   Male Waterbird    23
## 12   Male Bird          8

This is a pretty simple data set showing the frequencies of two different categorical variables. Here’s how to make a stacked bar plot to visualise them:

ggplot(eagle_diet, aes(x = Prey, fill = Sex, y = Count)) +
  geom_bar(stat = "identity")

There is definitely some variation among males and females in terms of the prey they take. Both males and females seem to have a preference for larger prey such as water birds and hares, but the preference is stronger in females – it looks like there may be an association between prey choice and eagle owl sex…

Statistical test. Determine whether there is any evidence of differences in the diets of the male and female eagle owls. What do you conclude? If there is an effect, what might account for the result?

We need to carry out a chi-square contingency table test to assess whether the apparent association between prey choice and eagle owl sex. We do this in two steps…

First we need to convert our eagle_diet data set (which is a data frame) into a ‘table’. We can do this with xtabs:

eagle_diet_tb <- xtabs(Count ~ Prey + Sex, data = eagle_diet)
eagle_diet_tb
##            Sex
## Prey        Female Male
##   Bird           5    8
##   Hare          16    7
##   Rat           12   23
##   Squirrel       5    6
##   Vole          24   35
##   Waterbird     36   23

Once the data is in the correct format the chi-square contingency table test is easy:

chisq.test(eagle_diet_tb)
## 
##  Pearson's Chi-squared test
## 
## data:  eagle_diet_tb
## X-squared = 12.602, df = 5, p-value = 0.0274

R first prints a reminder of the test employed (Pearson's Chi-squared test) and the data used (data: lady_bird_table). R then summarises the chi-square value, the degrees of freedom, and the p-value: X-squared = 12.602, df = 5, p-value = 0.0274. The p-value is significant (p<0.05) indicating that the prey choice varies among the two sexes.

There is a significant association between diet choice and the sex of eagle owls (chi-square = 12.6, d.f. = 5, p < 0.05).