Research Methods Test
14-03-2025
Question 1
A friend working on a project overseas has been asked to analyse radiotelemetry data for a small terrestrial rodent. Radiotelemetry allows the researcher to determine the position of the animal wearing a VHF beacon by triangulation. They have sent you the data below (Table 1) and they want you to help them test whether the animal uses all of the habitats in proportion to their occurrence in the environment.
Null Hypothesis
H0 The rodent uses all habitat types in proportion to their occurrence in the environment.
H1 The rodent shows preference and therefore does not use all habitats in proportion to their occurrence in the environment.
Reviewing the data
Habitat Type | Area (km2) | Number of rodent locations |
---|---|---|
Primary Forest | 4 | 17 |
Secondary growth forest | 2 | 2 |
Natural meadow | 3 | 15 |
Recent clear cut | 1 | 4 |
Recently burned | 1 | 8 |
Alpine tundra | 2 | 5 |
Agricultural land | 2 | 4 |
The table titles must be without spaces to be read by R. Secondly it must be made sure that the habitat is defined as a factor and the area and observed locations (expected and observed) are numeric:
<- Rodent.Data |>
Rodent.Data rename(
habitat = `Habitat Type`,
area_km2 = `Area (km2)`,
observed_locations = `Number of rodent locations`
)$habitat <- as.factor(Rodent.Data$habitat) Rodent.Data
Now the data can be visualised (Figure 1). This helps to get an idea of what the test might show as one can see broad differences between the observed and expected frequencies:
ggplot(Rodent.Data, aes(x = habitat, y = observed_locations)) +
geom_bar(stat = "identity", fill = "forestgreen", alpha = 0.6) +
geom_point(aes(y = area_km2), colour = "darkorange3", size = 3) +
labs(y = "Location Frequency", x = "Habitat Type") +
theme_light() +
coord_flip()
Figure 1 - The observed vs expected rodent habitat use. The orange points are the area of each habitat site and therefore the expected
Statistical test(s) required
This is a test comparing observed vs. expected frequencies of habitat use relative to area. Specifically, is usage proportional to availability. These are categorical variables. To determine whether there is a relationship between two categorical, the Chi-Square test is appropriate. ### Chi-square Formula
The Chi-square statistic is calculated using:
\[ \chi^2 = \sum \frac{(O - E)^2}{E} \] The test statistic (x2) is calculated by working out the sum of the terms (Σ) (Dytham, 2011). The expected frequency (E) is how many times the rodent should be found in each habitat according to proportional abundance. Subtracting the expected from the observed frequency (O), in this case, number of rodent locations, determines by how much the observed differs from the expected. To remove negative values, the result of this is squared. The squared differences for each habitat type are then divided by the expected frequency This makes the test account for variation in the expected values and thus tells us how much the observed values deviate from the expected values.
The data should be correctly formatted but, to check the columns are correctly categorised as factors and numbers one can print the data:
print(Rodent.Data)
## # A tibble: 7 × 3
## habitat area_km2 observed_locations
## <fct> <dbl> <dbl>
## 1 Primary Forest 4 17
## 2 Secondary growth forest 2 2
## 3 Natural meadow 3 15
## 4 Recent clear cut 1 4
## 5 Recently burned 1 8
## 6 Alpine tundra 2 5
## 7 Agricultural land 2 4
As they are categorised appropriately the data can be
interpreted; a Chi-square test is possible:
<- chisq.test(Rodent.Data$observed_locations,
chi_result p=Rodent.Data$area_km2 / sum(Rodent.Data$area_km2))
print(chi_result)
##
## Chi-squared test for given probabilities
##
## data: Rodent.Data$observed_locations
## X-squared = 13.114, df = 6, p-value = 0.04127
Results
- The x2 is value is 13.114 with 6 degrees of freedom.
- This gives a p value, 0.04127, below the threshold 0.05.
- For this reason we can reject the null hypothesis (H0) and accept the alternative hypothesis (H1).
Interpretation
The results of the test demonstrate that the rodent shows clear preference with high significance over some habitat types. For example, recently burned habitat is visited frequently (8 times) in proportion to its size (1km2). The rodent shows preference and therefore does not use all habitats in proportion to their occurrence in the environment.
Rodent habitat use is influenced by food availability, predation risk and factors related to the microclimate (exposure and temperature).
- Below is an explanation of the likely reasons for preference or lack
thereof:
- Primary Forest - Complex structure due to later seral stage, high food species diversity and abundant cover.
- Secondary growth forest - Exhibits limited structural and species diversity (Peterken, 2001) therefore, less cover and fewer food sources for the rodent.
- Natural meadow - Likely high plant diversity so abundant in seeds and insects.
- Recent clear cut - Vegetation will be spares with little food source and lack of cover will increase risk of predation.
- Recently burned - Early successional, fire adapted plants may attract the rodent. Bipedal rodents such as kangaroo rats in the Mojave desert increase in population after burning and play an important part in plant regrowth through seed caching (Sharp Bowman, McMillan and St. Clair, 2017).
- Alpine tundra - Harsh environments with low
vegetation cover and temperatures are likely to deter rodents unless
they are well adapted.
- Agricultural land - Regularly disturbed land with low species diversity and potentially pesticides will not provide a good environment for rodents.
Further study into habitat characteristics and rodent behaviour could provide more detail as to the reasons for preference.
Question 2
How does habitat quality affect the population size of a species, and what impact might improving habitat quality have on conservation of this species? Analyse the information in Table 2 to answer this question.
Null Hypothesis
H0 There is no correlation between habitat quality
and population size.
H1 There is a correlation between habitat quality and population size.
Reviewing the data
Habitat Quality Index (0-1) | Species Population Size (Individuals) |
---|---|
0.60 | 450 |
0.55 | 350 |
0.80 | 750 |
0.85 | 850 |
0.95 | 1000 |
0.25 | 150 |
0.70 | 600 |
0.80 | 750 |
0.40 | 200 |
0.90 | 950 |
The data need restructuring to analyse in R:
<- Habitat.Data |>
Habitat.Data rename(
habitat_quality = `Habitat Quality Index (0-1)`,
population_size = `Species Population Size (Individuals)`)
$habitat_quality <- as.numeric(Habitat.Data$habitat_quality)
Habitat.Data$population_size <- as.numeric(Habitat.Data$population_size) Habitat.Data
Now we can check the structure:
str(Habitat.Data)
## tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
## $ habitat_quality: num [1:10] 0.6 0.55 0.8 0.85 0.95 0.25 0.7 0.8 0.4 0.9
## $ population_size: num [1:10] 450 350 750 850 1000 150 600 750 200 950
The data is appropriately categorised and structured, we can analyse it.
It is important to run a normality test to determine the correct test.
Test of Normality
Q-Q plots are an effective visual check of normality:
qqnorm(Habitat.Data$population_size, main = "");
qqline(Habitat.Data$population_size, col = "#00A08A") +
theme_light()
Figure 2 Q-Q plot for species population size (Individuals)
qqnorm(Habitat.Data$habitat_quality, main = "");
qqline(Habitat.Data$habitat_quality, col = "#FF0000") +
theme_light()
Figure 3 Q-Q plot for Habitat Quality Index
The data in both Figures 2 and 3 do not deviate too much from the line of best fit, so they are likely normally distributed. We can confirm this assumption statistically with a Shapiro-Wilk test:
shapiro.test(Habitat.Data$population_size)
##
## Shapiro-Wilk normality test
##
## data: Habitat.Data$population_size
## W = 0.93423, p-value = 0.4907
shapiro.test(Habitat.Data$habitat_quality)
##
## Shapiro-Wilk normality test
##
## data: Habitat.Data$habitat_quality
## W = 0.93181, p-value = 0.4659
Statistical Test(s) Required
With a W values of 0.93423, and 0.93181, both variables are close to perfectly normal (1). The p values 0.4097 and 0.4659 means that we should use a parametric test of correlation.
A test of correlation is necessary. The number of individuals can be any value, as can the Habitat Quality Index (HQI) as long as it is between 0 and 1; therefore, they are both continuous.
Pearson’s correlation coefficient test can be run on this continuous, normally distributed data with a simple line of code:
cor.test(Habitat.Data$population_size, Habitat.Data$habitat_quality, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: Habitat.Data$population_size and Habitat.Data$habitat_quality
## t = 14.784, df = 8, p-value = 4.312e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9239259 0.9959234
## sample estimates:
## cor
## 0.9821867
- The test statistic is 14.784 with 8 degrees of freedom sugesting a strong relationship.
- This gave a p value of 4.12e-07 (p < 0.001), thus demonstrating a very statistically significant effect at the 0.05 threshold.
- A 95% confidence interval ranging from 0.924 to 0.996 means the result is very reliable.
- The correlation coefficient (r) is 0.982 indicating a significantly strong and positive correlation between population size and HQI.
To assess and quantify what impact improving habitat quality might have on conservation of this species, a test of regression is needed.
The equation for linear regression is:
\[Y = \beta_0 + \beta_1X + \varepsilon\]
Where:
- y = Population size
- x = HQI
- 𝛽0 = y when x = 0 (intercept)
- 𝛽1 = The change in y for a change in x (slope)
- 𝜖 = Unexplained variation (Error term)
Running a linear model (lm) will aid understanding of the relationship between the independent HQI (x) and dependent population (y).
A visual representation of this allows us to understand the relationship between variables and identify outliers in the data:
ggplot(data=Habitat.Data,aes(x=habitat_quality,y=population_size))+
geom_point(colour = "#0C1707")+geom_smooth(method='lm', colour = "orchid4",
fill = "seashell4") +
labs(x = "Habitat Quality Index", y = "Population Size (Individuals)") +
theme_light()
Figure 4: Linear regression model for the influence of habitat quality on population size.
We must also run the model to quantify the relationship between HQI and population size:
<- lm(population_size ~ habitat_quality, data = Habitat.Data)
lm_model summary(lm_model)
##
## Call:
## lm(formula = population_size ~ habitat_quality, data = Habitat.Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.85 -35.11 -12.98 34.95 111.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -290.24 63.53 -4.568 0.00183 **
## habitat_quality 1316.52 89.05 14.784 4.31e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.79 on 8 degrees of freedom
## Multiple R-squared: 0.9647, Adjusted R-squared: 0.9603
## F-statistic: 218.6 on 1 and 8 DF, p-value: 4.312e-07
- With an R2 value of 0.965 we can say that 96.65% of the variance in population size can be explained by habitat quality.
- The adjusted R2 value is 0.960 so we can be very confident that no unnecessary variables were added to the model and that habitat quality is a strong predictor of population size.
- For every 0.01 increase in HQI, population size can be predicted to increase by 13.2 individuals.
- The F statistic of the linear regression is 218.6 with 8 degrees of freedom.
- This gave a p value of 4.312e-7, signifying a highly significant relationship.
Interpretation
The result of the Pearson correlation test suggests that habitat quality has a substantial influence on population size so we can reject the null hypothesis. Linear regression analysis showed that 96.5% of the variation in population size is explained by habitat quality and, importantly, that the impact of improving habitat quality is ~13.2 individuals per 0.01 increase in the HQI.
Although there is a clearly a strong link between HQI and population size, regression shows associations, but not necessarily causation. Further regression analysis of other variables and population size could provide useful insight.
Question 3
Does the temperature in July differ among the three weather stations at locations (Helsinki, Hyytiälä, and Kittilä) in different parts of Finland?
Null Hypothesis
H0 There was no difference between the temperatures
in July 2024 at the three Finnish weather stations.
H1 There was a difference between the temperatures in July 2024 at the three Finnish weather stations.
Reviewing the data
This question requires a large dataset but the method for importing and organising it remains the same. As long as it is saved into the same folder as your working directory, read.xl will work to import it:
<- read_excel("TemperatureComparison.xlsx") Temperature.Comparison
str(Temperature.Comparison)
## tibble [26,308 × 8] (S3: tbl_df/tbl/data.frame)
## $ Station : chr [1:26308] "Helsinki Kumpula" "Helsinki Kumpula" "Helsinki Kumpula" "Helsinki Kumpula" ...
## $ Date : num [1:26308] 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : num [1:26308] 2024 2024 2024 2024 2024 ...
## $ Month : num [1:26308] 3 3 3 3 3 3 3 3 3 3 ...
## $ Day : num [1:26308] 7 7 7 7 7 7 7 7 7 7 ...
## $ Time : chr [1:26308] "02:00" "03:00" "04:00" "05:00" ...
## $ Temperature : num [1:26308] -2.7 -3.2 -3.3 -3.5 -3.9 -4.2 -4.5 -4.9 -5 -4.6 ...
## $ Precipitation: num [1:26308] 0 0 0 0 0 0 0 0 0 0 ...
head(Temperature.Comparison)
## # A tibble: 6 × 8
## Station Date Year Month Day Time Temperature Precipitation
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Helsinki Kumpula 1 2024 3 7 02:00 -2.7 0
## 2 Helsinki Kumpula 1 2024 3 7 03:00 -3.2 0
## 3 Helsinki Kumpula 1 2024 3 7 04:00 -3.3 0
## 4 Helsinki Kumpula 1 2024 3 7 05:00 -3.5 0
## 5 Helsinki Kumpula 1 2024 3 7 06:00 -3.9 0
## 6 Helsinki Kumpula 1 2024 3 7 07:00 -4.2 0
tail(Temperature.Comparison)
## # A tibble: 6 × 8
## Station Date Year Month Day Time Temperature Precipitation
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Juupajoki Hyytiälä NA 2025 3 7 09:00 -1.6 0
## 2 Juupajoki Hyytiälä NA 2025 3 7 10:00 -0.6 0
## 3 Juupajoki Hyytiälä NA 2025 3 7 11:00 0.8 0
## 4 Juupajoki Hyytiälä NA 2025 3 7 12:00 2 0
## 5 Juupajoki Hyytiälä NA 2025 3 7 13:00 3.2 0
## 6 Juupajoki Hyytiälä NA 2025 3 7 14:00 4.4 0
It is worth changing the time and Station column from character to factor format and switching to data.table format as it is better optimised for large datasets such as this. Most importantly, we only need data for July to compare the three locations so the rest can be filtered out:
<- data.table(Temperature.Comparison)
Temperature.Comparison $Time <- as.factor(Temperature.Comparison$Time)
Temperature.Comparison$Station <- as.factor(Temperature.Comparison$Station)
Temperature.Comparison<- Temperature.Comparison |> filter(Month == 7) Temperature.Comparison
The date column is NA and, although it would be lovely to fill it in, that is probably a mission for another time.
Tests for Normality
We can confirm normality with a histogram and Shapiro-Wilk test.
Run geom_histogram using ggplot:
ggplot(Temperature.Comparison, aes(x = Temperature, fill = Station)) +
geom_histogram(bins = 30, alpha = 0.7) +
scale_fill_manual(values = wes_palette("Chevalier1", n=3, type = "discrete")) +
labs(x = "Temperature (°C)", y = "Count") +
theme_light()
Figure 5 - A histogram of the temperature distribution at all three Finnish weather stations over July 2024. The histogram is roughly bell shaped, particularly in Helsinki however there is some skewness and bimodality.
A Shapiro-Wilk test will quantify the normality for each station:
shapiro.test(Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Helsinki Kumpula"])
##
## Shapiro-Wilk normality test
##
## data: Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Helsinki Kumpula"]
## W = 0.99528, p-value = 0.02216
shapiro.test(Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Kittilä Pokka"])
##
## Shapiro-Wilk normality test
##
## data: Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Kittilä Pokka"]
## W = 0.9719, p-value = 9.187e-11
shapiro.test(Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Juupajoki Hyytiälä"])
##
## Shapiro-Wilk normality test
##
## data: Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Juupajoki Hyytiälä"]
## W = 0.99075, p-value = 0.0001351
Statistical Test(s) Required
With Shapiro-Wilk p values of <0.05 we can say that the data is not normally distributed and therefore a non-parametric Kruskal-Wallis test should be used to compare the three independent groups.
kruskal.test(Temperature ~ Station, data = Temperature.Comparison)
##
## Kruskal-Wallis rank sum test
##
## data: Temperature by Station
## Kruskal-Wallis chi-squared = 278.26, df = 2, p-value < 2.2e-16
- The chi-square statistic with 2 degrees of freedom is 278.26 and indicates a strong difference between the weather stations.
- This resulted in a highly significant p-value of 2.2e-16.
- We can therefore reject the null hypothesis
Post Hoc Test
The Dunn’s test can be used to compare stations pairwise. Bonferroni-adjusted p values (P.adj) will correct for multiple comparisons, reducing the risk of false positives:
library(FSA)
dunnTest(Temperature ~ Station, data = Temperature.Comparison, method = "bonferroni")
## Comparison Z P.unadj P.adj
## 1 Helsinki Kumpula - Juupajoki Hyytiälä 11.061544 1.927502e-28 5.782507e-28
## 2 Helsinki Kumpula - Kittilä Pokka 16.341434 5.006163e-60 1.501849e-59
## 3 Juupajoki Hyytiälä - Kittilä Pokka 5.257851 1.457483e-07 4.372448e-07
All three comparisons are <0.05 and therefore significant. The largest difference is that of Helsinki Kumpula and Kittilä Pokka at 1.501849e-59.
To visualise the differences clearly between station temperatures over July we can produce a box plot:
ggplot(Temperature.Comparison, aes(x = Station, y = Temperature,
fill = Station)) +
geom_boxplot(alpha = 0.7, outlier.colour = "#C7B19C",
outlier.shape = 16, outlier.size = 4) +
labs(x = "Weather Station", y = "Temperature (°C)") +
theme_minimal() +
scale_fill_manual(values = wes_palette("Chevalier1"))
Figure 6 - A box and whisker plot visualising differing temperature distribution at the 3 Finnish weather stations. One can see that Haelsinki has the highest median temperature and smallest range. The other two stations recorded lower temperatures with larger variation. Outliers are shown in beige.
Interpretation
The Kruskal-Wallis test confirmed that the temperature distrbutions significantly differ across the three weather stations and Dunn’s post hoc test showed that difference of each is pair of stations is also statistically significant.
The latitudes can primarily explain the different temperature distributions over July. Kittilä Pokka is 6 degrees further north than Juupajoki Hyytiälä and 7 further than Helsinki Kumpula. Urban heat island effect also goes some way to explaining the higher temperatures in Helsinki.
References
Dytham, C. (2011) Choosing and using statistics: a biologist’s guide. 3. ed. Oxford: Wiley-Blackwell.
Peterken, G.F. (2001) Natural woodland: ecology and conservation in northern temperate regions. Reprinted. New York: Cambridge Univ. Press.
Sharp Bowman, T.R., McMillan, B.R. and St. Clair, S.B. (2017) ‘A comparison of the effects of fire on rodent abundance and diversity in the Great Basin and Mojave Deserts’, PLOS ONE. Edited by A.W. Reed, 12(11), p. e0187740. Available at: https://doi.org/10.1371/journal.pone.0187740.