Research Methods Test

14-03-2025

HSOC5100

Henry Shaw


Question 1

A friend working on a project overseas has been asked to analyse radiotelemetry data for a small terrestrial rodent. Radiotelemetry allows the researcher to determine the position of the animal wearing a VHF beacon by triangulation. They have sent you the data below (Table 1) and they want you to help them test whether the animal uses all of the habitats in proportion to their occurrence in the environment.

Null Hypothesis

H0 The rodent uses all habitat types in proportion to their occurrence in the environment.

H1 The rodent shows preference and therefore does not use all habitats in proportion to their occurrence in the environment.

Reviewing the data

Table 1: Rodent habitat use frequency
Habitat Type Area (km2) Number of rodent locations
Primary Forest 4 17
Secondary growth forest 2 2
Natural meadow 3 15
Recent clear cut 1 4
Recently burned 1 8
Alpine tundra 2 5
Agricultural land 2 4

The table titles must be without spaces to be read by R. Secondly it must be made sure that the habitat is defined as a factor and the area and observed locations (expected and observed) are numeric:

Rodent.Data <- Rodent.Data |> 
  rename(
    habitat = `Habitat Type`,
    area_km2 = `Area (km2)`,
    observed_locations = `Number of rodent locations`
  )
Rodent.Data$habitat <- as.factor(Rodent.Data$habitat)

Now the data can be visualised (Figure 1). This helps to get an idea of what the test might show as one can see broad differences between the observed and expected frequencies:

ggplot(Rodent.Data, aes(x = habitat, y = observed_locations)) +  
  geom_bar(stat = "identity", fill = "forestgreen", alpha = 0.6) +
  geom_point(aes(y = area_km2), colour = "darkorange3", size = 3) +
  labs(y = "Location Frequency", x = "Habitat Type") +
  theme_light() +
  coord_flip()
__Figure 1__ - The observed vs expected rodent habitat use. The orange points are the area of each habitat site and therefore the expected

Figure 1 - The observed vs expected rodent habitat use. The orange points are the area of each habitat site and therefore the expected

Statistical test(s) required

This is a test comparing observed vs. expected frequencies of habitat use relative to area. Specifically, is usage proportional to availability. These are categorical variables. To determine whether there is a relationship between two categorical, the Chi-Square test is appropriate. ### Chi-square Formula

The Chi-square statistic is calculated using:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \] The test statistic (x2) is calculated by working out the sum of the terms (Σ) (Dytham, 2011). The expected frequency (E) is how many times the rodent should be found in each habitat according to proportional abundance. Subtracting the expected from the observed frequency (O), in this case, number of rodent locations, determines by how much the observed differs from the expected. To remove negative values, the result of this is squared. The squared differences for each habitat type are then divided by the expected frequency This makes the test account for variation in the expected values and thus tells us how much the observed values deviate from the expected values.

The data should be correctly formatted but, to check the columns are correctly categorised as factors and numbers one can print the data:

print(Rodent.Data)
## # A tibble: 7 × 3
##   habitat                 area_km2 observed_locations
##   <fct>                      <dbl>              <dbl>
## 1 Primary Forest                 4                 17
## 2 Secondary growth forest        2                  2
## 3 Natural meadow                 3                 15
## 4 Recent clear cut               1                  4
## 5 Recently burned                1                  8
## 6 Alpine tundra                  2                  5
## 7 Agricultural land              2                  4


As they are categorised appropriately the data can be interpreted; a Chi-square test is possible:

chi_result <- chisq.test(Rodent.Data$observed_locations, 
                         p=Rodent.Data$area_km2 / sum(Rodent.Data$area_km2))
print(chi_result)
## 
##  Chi-squared test for given probabilities
## 
## data:  Rodent.Data$observed_locations
## X-squared = 13.114, df = 6, p-value = 0.04127

Results

Interpretation

The results of the test demonstrate that the rodent shows clear preference with high significance over some habitat types. For example, recently burned habitat is visited frequently (8 times) in proportion to its size (1km2). The rodent shows preference and therefore does not use all habitats in proportion to their occurrence in the environment.

Rodent habitat use is influenced by food availability, predation risk and factors related to the microclimate (exposure and temperature).

Further study into habitat characteristics and rodent behaviour could provide more detail as to the reasons for preference.




Question 2

How does habitat quality affect the population size of a species, and what impact might improving habitat quality have on conservation of this species? Analyse the information in Table 2 to answer this question.

Null Hypothesis


H0 There is no correlation between habitat quality and population size.

H1 There is a correlation between habitat quality and population size.


Reviewing the data

Table 2: Species population size relating to habitat quality index
Habitat Quality Index (0-1) Species Population Size (Individuals)
0.60 450
0.55 350
0.80 750
0.85 850
0.95 1000
0.25 150
0.70 600
0.80 750
0.40 200
0.90 950

The data need restructuring to analyse in R:

Habitat.Data <- Habitat.Data |> 
  rename(
    habitat_quality = `Habitat Quality Index (0-1)`,
    population_size = `Species Population Size (Individuals)`)
Habitat.Data$habitat_quality <- as.numeric(Habitat.Data$habitat_quality)
Habitat.Data$population_size <- as.numeric(Habitat.Data$population_size)

Now we can check the structure:

str(Habitat.Data)  
## tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
##  $ habitat_quality: num [1:10] 0.6 0.55 0.8 0.85 0.95 0.25 0.7 0.8 0.4 0.9
##  $ population_size: num [1:10] 450 350 750 850 1000 150 600 750 200 950

The data is appropriately categorised and structured, we can analyse it.

It is important to run a normality test to determine the correct test.

Test of Normality


Q-Q plots are an effective visual check of normality:

qqnorm(Habitat.Data$population_size, main = "");
qqline(Habitat.Data$population_size, col = "#00A08A") +
  theme_light()
__Figure 2__ Q-Q plot for species population size (Individuals)

Figure 2 Q-Q plot for species population size (Individuals)

qqnorm(Habitat.Data$habitat_quality, main = "");
qqline(Habitat.Data$habitat_quality, col = "#FF0000") +
  theme_light()
__Figure 3__ Q-Q plot for Habitat Quality Index

Figure 3 Q-Q plot for Habitat Quality Index


The data in both Figures 2 and 3 do not deviate too much from the line of best fit, so they are likely normally distributed. We can confirm this assumption statistically with a Shapiro-Wilk test:

shapiro.test(Habitat.Data$population_size)
## 
##  Shapiro-Wilk normality test
## 
## data:  Habitat.Data$population_size
## W = 0.93423, p-value = 0.4907
shapiro.test(Habitat.Data$habitat_quality)
## 
##  Shapiro-Wilk normality test
## 
## data:  Habitat.Data$habitat_quality
## W = 0.93181, p-value = 0.4659

Statistical Test(s) Required

With a W values of 0.93423, and 0.93181, both variables are close to perfectly normal (1). The p values 0.4097 and 0.4659 means that we should use a parametric test of correlation.

A test of correlation is necessary. The number of individuals can be any value, as can the Habitat Quality Index (HQI) as long as it is between 0 and 1; therefore, they are both continuous.

Pearson’s correlation coefficient test can be run on this continuous, normally distributed data with a simple line of code:

cor.test(Habitat.Data$population_size, Habitat.Data$habitat_quality, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  Habitat.Data$population_size and Habitat.Data$habitat_quality
## t = 14.784, df = 8, p-value = 4.312e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9239259 0.9959234
## sample estimates:
##       cor 
## 0.9821867

To assess and quantify what impact improving habitat quality might have on conservation of this species, a test of regression is needed.

The equation for linear regression is:

\[Y = \beta_0 + \beta_1X + \varepsilon\]

Where:

Running a linear model (lm) will aid understanding of the relationship between the independent HQI (x) and dependent population (y).

A visual representation of this allows us to understand the relationship between variables and identify outliers in the data:

ggplot(data=Habitat.Data,aes(x=habitat_quality,y=population_size))+
  geom_point(colour = "#0C1707")+geom_smooth(method='lm', colour = "orchid4", 
                                             fill = "seashell4") + 
  labs(x = "Habitat Quality Index", y = "Population Size (Individuals)") +
  theme_light()
__Figure 4__: Linear regression model for the influence of habitat quality on population size.

Figure 4: Linear regression model for the influence of habitat quality on population size.

We must also run the model to quantify the relationship between HQI and population size:

lm_model <- lm(population_size ~ habitat_quality, data = Habitat.Data)
summary(lm_model)
## 
## Call:
## lm(formula = population_size ~ habitat_quality, data = Habitat.Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83.85 -35.11 -12.98  34.95 111.11 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -290.24      63.53  -4.568  0.00183 ** 
## habitat_quality  1316.52      89.05  14.784 4.31e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.79 on 8 degrees of freedom
## Multiple R-squared:  0.9647, Adjusted R-squared:  0.9603 
## F-statistic: 218.6 on 1 and 8 DF,  p-value: 4.312e-07

Interpretation

The result of the Pearson correlation test suggests that habitat quality has a substantial influence on population size so we can reject the null hypothesis. Linear regression analysis showed that 96.5% of the variation in population size is explained by habitat quality and, importantly, that the impact of improving habitat quality is ~13.2 individuals per 0.01 increase in the HQI.

Although there is a clearly a strong link between HQI and population size, regression shows associations, but not necessarily causation. Further regression analysis of other variables and population size could provide useful insight.


Question 3

Does the temperature in July differ among the three weather stations at locations (Helsinki, Hyytiälä, and Kittilä) in different parts of Finland?

Null Hypothesis


H0 There was no difference between the temperatures in July 2024 at the three Finnish weather stations.

H1 There was a difference between the temperatures in July 2024 at the three Finnish weather stations.


Reviewing the data

This question requires a large dataset but the method for importing and organising it remains the same. As long as it is saved into the same folder as your working directory, read.xl will work to import it:

Temperature.Comparison <- read_excel("TemperatureComparison.xlsx")
str(Temperature.Comparison)
## tibble [26,308 × 8] (S3: tbl_df/tbl/data.frame)
##  $ Station      : chr [1:26308] "Helsinki Kumpula" "Helsinki Kumpula" "Helsinki Kumpula" "Helsinki Kumpula" ...
##  $ Date         : num [1:26308] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year         : num [1:26308] 2024 2024 2024 2024 2024 ...
##  $ Month        : num [1:26308] 3 3 3 3 3 3 3 3 3 3 ...
##  $ Day          : num [1:26308] 7 7 7 7 7 7 7 7 7 7 ...
##  $ Time         : chr [1:26308] "02:00" "03:00" "04:00" "05:00" ...
##  $ Temperature  : num [1:26308] -2.7 -3.2 -3.3 -3.5 -3.9 -4.2 -4.5 -4.9 -5 -4.6 ...
##  $ Precipitation: num [1:26308] 0 0 0 0 0 0 0 0 0 0 ...
head(Temperature.Comparison)
## # A tibble: 6 × 8
##   Station           Date  Year Month   Day Time  Temperature Precipitation
##   <chr>            <dbl> <dbl> <dbl> <dbl> <chr>       <dbl>         <dbl>
## 1 Helsinki Kumpula     1  2024     3     7 02:00        -2.7             0
## 2 Helsinki Kumpula     1  2024     3     7 03:00        -3.2             0
## 3 Helsinki Kumpula     1  2024     3     7 04:00        -3.3             0
## 4 Helsinki Kumpula     1  2024     3     7 05:00        -3.5             0
## 5 Helsinki Kumpula     1  2024     3     7 06:00        -3.9             0
## 6 Helsinki Kumpula     1  2024     3     7 07:00        -4.2             0
tail(Temperature.Comparison)
## # A tibble: 6 × 8
##   Station             Date  Year Month   Day Time  Temperature Precipitation
##   <chr>              <dbl> <dbl> <dbl> <dbl> <chr>       <dbl>         <dbl>
## 1 Juupajoki Hyytiälä    NA  2025     3     7 09:00        -1.6             0
## 2 Juupajoki Hyytiälä    NA  2025     3     7 10:00        -0.6             0
## 3 Juupajoki Hyytiälä    NA  2025     3     7 11:00         0.8             0
## 4 Juupajoki Hyytiälä    NA  2025     3     7 12:00         2               0
## 5 Juupajoki Hyytiälä    NA  2025     3     7 13:00         3.2             0
## 6 Juupajoki Hyytiälä    NA  2025     3     7 14:00         4.4             0

It is worth changing the time and Station column from character to factor format and switching to data.table format as it is better optimised for large datasets such as this. Most importantly, we only need data for July to compare the three locations so the rest can be filtered out:

Temperature.Comparison <- data.table(Temperature.Comparison)
Temperature.Comparison$Time <- as.factor(Temperature.Comparison$Time)
Temperature.Comparison$Station <- as.factor(Temperature.Comparison$Station)
Temperature.Comparison <- Temperature.Comparison |> filter(Month == 7)

The date column is NA and, although it would be lovely to fill it in, that is probably a mission for another time.


Tests for Normality

We can confirm normality with a histogram and Shapiro-Wilk test.

Run geom_histogram using ggplot:

ggplot(Temperature.Comparison, aes(x = Temperature, fill = Station)) +
  geom_histogram(bins = 30, alpha = 0.7) +
  scale_fill_manual(values = wes_palette("Chevalier1", n=3, type = "discrete")) +
  labs(x = "Temperature (°C)", y = "Count") +
  theme_light()
__Figure 5__ - A histogram of the temperature distribution at all three Finnish weather stations over July 2024. The histogram is roughly bell shaped, particularly in Helsinki however there is some skewness and bimodality.

Figure 5 - A histogram of the temperature distribution at all three Finnish weather stations over July 2024. The histogram is roughly bell shaped, particularly in Helsinki however there is some skewness and bimodality.


A Shapiro-Wilk test will quantify the normality for each station:

shapiro.test(Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Helsinki Kumpula"])
## 
##  Shapiro-Wilk normality test
## 
## data:  Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Helsinki Kumpula"]
## W = 0.99528, p-value = 0.02216
shapiro.test(Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Kittilä Pokka"])
## 
##  Shapiro-Wilk normality test
## 
## data:  Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Kittilä Pokka"]
## W = 0.9719, p-value = 9.187e-11
shapiro.test(Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Juupajoki Hyytiälä"])
## 
##  Shapiro-Wilk normality test
## 
## data:  Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Juupajoki Hyytiälä"]
## W = 0.99075, p-value = 0.0001351

Statistical Test(s) Required

With Shapiro-Wilk p values of <0.05 we can say that the data is not normally distributed and therefore a non-parametric Kruskal-Wallis test should be used to compare the three independent groups.

kruskal.test(Temperature ~ Station, data = Temperature.Comparison)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Temperature by Station
## Kruskal-Wallis chi-squared = 278.26, df = 2, p-value < 2.2e-16

Post Hoc Test

The Dunn’s test can be used to compare stations pairwise. Bonferroni-adjusted p values (P.adj) will correct for multiple comparisons, reducing the risk of false positives:

library(FSA)
dunnTest(Temperature ~ Station, data = Temperature.Comparison, method = "bonferroni")
##                              Comparison         Z      P.unadj        P.adj
## 1 Helsinki Kumpula - Juupajoki Hyytiälä 11.061544 1.927502e-28 5.782507e-28
## 2      Helsinki Kumpula - Kittilä Pokka 16.341434 5.006163e-60 1.501849e-59
## 3    Juupajoki Hyytiälä - Kittilä Pokka  5.257851 1.457483e-07 4.372448e-07

All three comparisons are <0.05 and therefore significant. The largest difference is that of Helsinki Kumpula and Kittilä Pokka at 1.501849e-59.


To visualise the differences clearly between station temperatures over July we can produce a box plot:

ggplot(Temperature.Comparison, aes(x = Station, y = Temperature, 
                                   fill = Station)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "#C7B19C", 
               outlier.shape = 16, outlier.size = 4) +
  labs(x = "Weather Station", y = "Temperature (°C)") +
  theme_minimal() +
  scale_fill_manual(values = wes_palette("Chevalier1"))
__Figure 6__ - A box and whisker plot visualising differing temperature distribution at the 3 Finnish weather stations. One can see that Haelsinki has the highest median temperature and smallest range. The other two stations recorded lower temperatures with larger variation. Outliers are shown in beige.

Figure 6 - A box and whisker plot visualising differing temperature distribution at the 3 Finnish weather stations. One can see that Haelsinki has the highest median temperature and smallest range. The other two stations recorded lower temperatures with larger variation. Outliers are shown in beige.


Interpretation

The Kruskal-Wallis test confirmed that the temperature distrbutions significantly differ across the three weather stations and Dunn’s post hoc test showed that difference of each is pair of stations is also statistically significant.

The latitudes can primarily explain the different temperature distributions over July. Kittilä Pokka is 6 degrees further north than Juupajoki Hyytiälä and 7 further than Helsinki Kumpula. Urban heat island effect also goes some way to explaining the higher temperatures in Helsinki.



References

Dytham, C. (2011) Choosing and using statistics: a biologist’s guide. 3. ed. Oxford: Wiley-Blackwell.

Peterken, G.F. (2001) Natural woodland: ecology and conservation in northern temperate regions. Reprinted. New York: Cambridge Univ. Press.

Sharp Bowman, T.R., McMillan, B.R. and St. Clair, S.B. (2017) ‘A comparison of the effects of fire on rodent abundance and diversity in the Great Basin and Mojave Deserts’, PLOS ONE. Edited by A.W. Reed, 12(11), p. e0187740. Available at: https://doi.org/10.1371/journal.pone.0187740.