Research Methods Test

14-03-2025

HSOC5100

Henry Shaw

Question 1

A friend working on a project overseas has been asked to analyse radiotelemetry data for a small terrestrial rodent. Radiotelemetry allows the researcher to determine the position of the animal wearing a VHF beacon by triangulation. They have sent you the data below (Table 1) and they want you to help them test whether the animal uses all of the habitats in proportion to their occurrence in the environment.

Null Hypothesis

H₀ The rodent uses all habitat types in proportion to their occurrence in the environment.

H₁ The rodent shows preference and therefore does not use all habitats in proportion to their occurrence in the environment.

Reviewing the data

**Table 1: Rodent habitat use frequency**
Habitat Type	Area (km2)	Number of rodent locations
Primary Forest	4	17
Secondary growth forest	2	2
Natural meadow	3	15
Recent clear cut	1	4
Recently burned	1	8
Alpine tundra	2	5
Agricultural land	2	4

The table titles must be without spaces to be read by R. Secondly it must be made sure that the habitat is defined as a factor and the area and observed locations (expected and observed) are numeric:

Rodent.Data <- Rodent.Data |> 
  rename(
    habitat = `Habitat Type`,
    area_km2 = `Area (km2)`,
    observed_locations = `Number of rodent locations`
  )
Rodent.Data$habitat <- as.factor(Rodent.Data$habitat)

Now the data can be visualised (Figure 1). This helps to get an idea of what the test might show as one can see broad differences between the observed and expected frequencies:

ggplot(Rodent.Data, aes(x = habitat, y = observed_locations)) +  
  geom_bar(stat = "identity", fill = "forestgreen", alpha = 0.6) +
  geom_point(aes(y = area_km2), colour = "darkorange3", size = 3) +
  labs(y = "Location Frequency", x = "Habitat Type") +
  theme_light() +
  coord_flip()

Figure 1 - The observed vs expected rodent habitat use. The orange points are the area of each habitat site and therefore the expected

Statistical test(s) required

This is a test comparing observed vs. expected frequencies of habitat use relative to area. Specifically, is usage proportional to availability. These are categorical variables. To determine whether there is a relationship between two categorical, the Chi-Square test is appropriate. ### Chi-square Formula

The Chi-square statistic is calculated using:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \] The test statistic (x²) is calculated by working out the sum of the terms (Σ) (Dytham, 2011). The expected frequency (E) is how many times the rodent should be found in each habitat according to proportional abundance. Subtracting the expected from the observed frequency (O), in this case, number of rodent locations, determines by how much the observed differs from the expected. To remove negative values, the result of this is squared. The squared differences for each habitat type are then divided by the expected frequency This makes the test account for variation in the expected values and thus tells us how much the observed values deviate from the expected values.

The data should be correctly formatted but, to check the columns are correctly categorised as factors and numbers one can print the data:

print(Rodent.Data)

## # A tibble: 7 × 3
##   habitat                 area_km2 observed_locations
##   <fct>                      <dbl>              <dbl>
## 1 Primary Forest                 4                 17
## 2 Secondary growth forest        2                  2
## 3 Natural meadow                 3                 15
## 4 Recent clear cut               1                  4
## 5 Recently burned                1                  8
## 6 Alpine tundra                  2                  5
## 7 Agricultural land              2                  4

As they are categorised appropriately the data can be interpreted; a Chi-square test is possible:

chi_result <- chisq.test(Rodent.Data$observed_locations, 
                         p=Rodent.Data$area_km2 / sum(Rodent.Data$area_km2))
print(chi_result)

## 
##  Chi-squared test for given probabilities
## 
## data:  Rodent.Data$observed_locations
## X-squared = 13.114, df = 6, p-value = 0.04127

Results

The x² is value is 13.114 with 6 degrees of freedom.
This gives a p value, 0.04127, below the threshold 0.05.
For this reason we can reject the null hypothesis (H₀) and accept the alternative hypothesis (H₁).

Interpretation

The results of the test demonstrate that the rodent shows clear preference with high significance over some habitat types. For example, recently burned habitat is visited frequently (8 times) in proportion to its size (1km²). The rodent shows preference and therefore does not use all habitats in proportion to their occurrence in the environment.

Rodent habitat use is influenced by food availability, predation risk and factors related to the microclimate (exposure and temperature).

Below is an explanation of the likely reasons for preference or lack thereof:
- Primary Forest - Complex structure due to later seral stage, high food species diversity and abundant cover.
- Secondary growth forest - Exhibits limited structural and species diversity (Peterken, 2001) therefore, less cover and fewer food sources for the rodent.
- Natural meadow - Likely high plant diversity so abundant in seeds and insects.
- Recent clear cut - Vegetation will be spares with little food source and lack of cover will increase risk of predation.
- Recently burned - Early successional, fire adapted plants may attract the rodent. Bipedal rodents such as kangaroo rats in the Mojave desert increase in population after burning and play an important part in plant regrowth through seed caching (Sharp Bowman, McMillan and St. Clair, 2017).
- Alpine tundra - Harsh environments with low vegetation cover and temperatures are likely to deter rodents unless they are well adapted.
- Agricultural land - Regularly disturbed land with low species diversity and potentially pesticides will not provide a good environment for rodents.

Further study into habitat characteristics and rodent behaviour could provide more detail as to the reasons for preference.

Question 2

How does habitat quality affect the population size of a species, and what impact might improving habitat quality have on conservation of this species? Analyse the information in Table 2 to answer this question.

Null Hypothesis

H₀ There is no correlation between habitat quality and population size.

H₁ There is a correlation between habitat quality and population size.

Reviewing the data

**Table 2: Species population size relating to habitat quality index**
Habitat Quality Index (0-1)	Species Population Size (Individuals)
0.60	450
0.55	350
0.80	750
0.85	850
0.95	1000
0.25	150
0.70	600
0.80	750
0.40	200
0.90	950

The data need restructuring to analyse in R:

Habitat.Data <- Habitat.Data |> 
  rename(
    habitat_quality = `Habitat Quality Index (0-1)`,
    population_size = `Species Population Size (Individuals)`)

Habitat.Data$habitat_quality <- as.numeric(Habitat.Data$habitat_quality)
Habitat.Data$population_size <- as.numeric(Habitat.Data$population_size)

Now we can check the structure:

str(Habitat.Data)

## tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
##  $ habitat_quality: num [1:10] 0.6 0.55 0.8 0.85 0.95 0.25 0.7 0.8 0.4 0.9
##  $ population_size: num [1:10] 450 350 750 850 1000 150 600 750 200 950

The data is appropriately categorised and structured, we can analyse it.

It is important to run a normality test to determine the correct test.

Test of Normality

Q-Q plots are an effective visual check of normality:

qqnorm(Habitat.Data$population_size, main = "");
qqline(Habitat.Data$population_size, col = "#00A08A") +
  theme_light()

Figure 2 Q-Q plot for species population size (Individuals)

qqnorm(Habitat.Data$habitat_quality, main = "");
qqline(Habitat.Data$habitat_quality, col = "#FF0000") +
  theme_light()

Figure 3 Q-Q plot for Habitat Quality Index

The data in both Figures 2 and 3 do not deviate too much from the line of best fit, so they are likely normally distributed. We can confirm this assumption statistically with a Shapiro-Wilk test:

shapiro.test(Habitat.Data$population_size)

## 
##  Shapiro-Wilk normality test
## 
## data:  Habitat.Data$population_size
## W = 0.93423, p-value = 0.4907

shapiro.test(Habitat.Data$habitat_quality)

## 
##  Shapiro-Wilk normality test
## 
## data:  Habitat.Data$habitat_quality
## W = 0.93181, p-value = 0.4659

Statistical Test(s) Required

With a W values of 0.93423, and 0.93181, both variables are close to perfectly normal (1). The p values 0.4097 and 0.4659 means that we should use a parametric test of correlation.

A test of correlation is necessary. The number of individuals can be any value, as can the Habitat Quality Index (HQI) as long as it is between 0 and 1; therefore, they are both continuous.

Pearson’s correlation coefficient test can be run on this continuous, normally distributed data with a simple line of code:

cor.test(Habitat.Data$population_size, Habitat.Data$habitat_quality, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  Habitat.Data$population_size and Habitat.Data$habitat_quality
## t = 14.784, df = 8, p-value = 4.312e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9239259 0.9959234
## sample estimates:
##       cor 
## 0.9821867

The test statistic is 14.784 with 8 degrees of freedom sugesting a strong relationship.
This gave a p value of 4.12e-07 (p < 0.001), thus demonstrating a very statistically significant effect at the 0.05 threshold.
A 95% confidence interval ranging from 0.924 to 0.996 means the result is very reliable.
The correlation coefficient (r) is 0.982 indicating a significantly strong and positive correlation between population size and HQI.

To assess and quantify what impact improving habitat quality might have on conservation of this species, a test of regression is needed.

The equation for linear regression is:

\[Y = \beta_0 + \beta_1X + \varepsilon\]

Where:

y = Population size
x = HQI
𝛽₀ = y when x = 0 (intercept)
𝛽₁ = The change in y for a change in x (slope)
𝜖 = Unexplained variation (Error term)

Running a linear model (lm) will aid understanding of the relationship between the independent HQI (x) and dependent population (y).

A visual representation of this allows us to understand the relationship between variables and identify outliers in the data:

ggplot(data=Habitat.Data,aes(x=habitat_quality,y=population_size))+
  geom_point(colour = "#0C1707")+geom_smooth(method='lm', colour = "orchid4", 
                                             fill = "seashell4") + 
  labs(x = "Habitat Quality Index", y = "Population Size (Individuals)") +
  theme_light()

Figure 4: Linear regression model for the influence of habitat quality on population size.

We must also run the model to quantify the relationship between HQI and population size:

lm_model <- lm(population_size ~ habitat_quality, data = Habitat.Data)
summary(lm_model)

## 
## Call:
## lm(formula = population_size ~ habitat_quality, data = Habitat.Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83.85 -35.11 -12.98  34.95 111.11 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -290.24      63.53  -4.568  0.00183 ** 
## habitat_quality  1316.52      89.05  14.784 4.31e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.79 on 8 degrees of freedom
## Multiple R-squared:  0.9647, Adjusted R-squared:  0.9603 
## F-statistic: 218.6 on 1 and 8 DF,  p-value: 4.312e-07

With an R² value of 0.965 we can say that 96.65% of the variance in population size can be explained by habitat quality.
The adjusted R² value is 0.960 so we can be very confident that no unnecessary variables were added to the model and that habitat quality is a strong predictor of population size.
For every 0.01 increase in HQI, population size can be predicted to increase by 13.2 individuals.
The F statistic of the linear regression is 218.6 with 8 degrees of freedom.
This gave a p value of 4.312e-7, signifying a highly significant relationship.

Interpretation

The result of the Pearson correlation test suggests that habitat quality has a substantial influence on population size so we can reject the null hypothesis. Linear regression analysis showed that 96.5% of the variation in population size is explained by habitat quality and, importantly, that the impact of improving habitat quality is ~13.2 individuals per 0.01 increase in the HQI.

Although there is a clearly a strong link between HQI and population size, regression shows associations, but not necessarily causation. Further regression analysis of other variables and population size could provide useful insight.

Question 3

Does the temperature in July differ among the three weather stations at locations (Helsinki, Hyytiälä, and Kittilä) in different parts of Finland?

Null Hypothesis

H₀ There was no difference between the temperatures in July 2024 at the three Finnish weather stations.

H₁ There was a difference between the temperatures in July 2024 at the three Finnish weather stations.

Reviewing the data

This question requires a large dataset but the method for importing and organising it remains the same. As long as it is saved into the same folder as your working directory, read.xl will work to import it:

Temperature.Comparison <- read_excel("TemperatureComparison.xlsx")

str(Temperature.Comparison)

## tibble [26,308 × 8] (S3: tbl_df/tbl/data.frame)
##  $ Station      : chr [1:26308] "Helsinki Kumpula" "Helsinki Kumpula" "Helsinki Kumpula" "Helsinki Kumpula" ...
##  $ Date         : num [1:26308] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year         : num [1:26308] 2024 2024 2024 2024 2024 ...
##  $ Month        : num [1:26308] 3 3 3 3 3 3 3 3 3 3 ...
##  $ Day          : num [1:26308] 7 7 7 7 7 7 7 7 7 7 ...
##  $ Time         : chr [1:26308] "02:00" "03:00" "04:00" "05:00" ...
##  $ Temperature  : num [1:26308] -2.7 -3.2 -3.3 -3.5 -3.9 -4.2 -4.5 -4.9 -5 -4.6 ...
##  $ Precipitation: num [1:26308] 0 0 0 0 0 0 0 0 0 0 ...

head(Temperature.Comparison)

## # A tibble: 6 × 8
##   Station           Date  Year Month   Day Time  Temperature Precipitation
##   <chr>            <dbl> <dbl> <dbl> <dbl> <chr>       <dbl>         <dbl>
## 1 Helsinki Kumpula     1  2024     3     7 02:00        -2.7             0
## 2 Helsinki Kumpula     1  2024     3     7 03:00        -3.2             0
## 3 Helsinki Kumpula     1  2024     3     7 04:00        -3.3             0
## 4 Helsinki Kumpula     1  2024     3     7 05:00        -3.5             0
## 5 Helsinki Kumpula     1  2024     3     7 06:00        -3.9             0
## 6 Helsinki Kumpula     1  2024     3     7 07:00        -4.2             0

tail(Temperature.Comparison)

## # A tibble: 6 × 8
##   Station             Date  Year Month   Day Time  Temperature Precipitation
##   <chr>              <dbl> <dbl> <dbl> <dbl> <chr>       <dbl>         <dbl>
## 1 Juupajoki Hyytiälä    NA  2025     3     7 09:00        -1.6             0
## 2 Juupajoki Hyytiälä    NA  2025     3     7 10:00        -0.6             0
## 3 Juupajoki Hyytiälä    NA  2025     3     7 11:00         0.8             0
## 4 Juupajoki Hyytiälä    NA  2025     3     7 12:00         2               0
## 5 Juupajoki Hyytiälä    NA  2025     3     7 13:00         3.2             0
## 6 Juupajoki Hyytiälä    NA  2025     3     7 14:00         4.4             0

It is worth changing the time and Station column from character to factor format and switching to data.table format as it is better optimised for large datasets such as this. Most importantly, we only need data for July to compare the three locations so the rest can be filtered out:

Temperature.Comparison <- data.table(Temperature.Comparison)
Temperature.Comparison$Time <- as.factor(Temperature.Comparison$Time)
Temperature.Comparison$Station <- as.factor(Temperature.Comparison$Station)
Temperature.Comparison <- Temperature.Comparison |> filter(Month == 7)

The date column is NA and, although it would be lovely to fill it in, that is probably a mission for another time.

Tests for Normality

We can confirm normality with a histogram and Shapiro-Wilk test.

Run geom_histogram using ggplot:

ggplot(Temperature.Comparison, aes(x = Temperature, fill = Station)) +
  geom_histogram(bins = 30, alpha = 0.7) +
  scale_fill_manual(values = wes_palette("Chevalier1", n=3, type = "discrete")) +
  labs(x = "Temperature (°C)", y = "Count") +
  theme_light()

Figure 5 - A histogram of the temperature distribution at all three Finnish weather stations over July 2024. The histogram is roughly bell shaped, particularly in Helsinki however there is some skewness and bimodality.

A Shapiro-Wilk test will quantify the normality for each station:

shapiro.test(Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Helsinki Kumpula"])

## 
##  Shapiro-Wilk normality test
## 
## data:  Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Helsinki Kumpula"]
## W = 0.99528, p-value = 0.02216

shapiro.test(Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Kittilä Pokka"])

## 
##  Shapiro-Wilk normality test
## 
## data:  Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Kittilä Pokka"]
## W = 0.9719, p-value = 9.187e-11

shapiro.test(Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Juupajoki Hyytiälä"])

## 
##  Shapiro-Wilk normality test
## 
## data:  Temperature.Comparison$Temperature[Temperature.Comparison$Station == "Juupajoki Hyytiälä"]
## W = 0.99075, p-value = 0.0001351

Statistical Test(s) Required

With Shapiro-Wilk p values of <0.05 we can say that the data is not normally distributed and therefore a non-parametric Kruskal-Wallis test should be used to compare the three independent groups.

kruskal.test(Temperature ~ Station, data = Temperature.Comparison)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Temperature by Station
## Kruskal-Wallis chi-squared = 278.26, df = 2, p-value < 2.2e-16

The chi-square statistic with 2 degrees of freedom is 278.26 and indicates a strong difference between the weather stations.
This resulted in a highly significant p-value of 2.2e-16.
We can therefore reject the null hypothesis

Post Hoc Test

The Dunn’s test can be used to compare stations pairwise. Bonferroni-adjusted p values (P.adj) will correct for multiple comparisons, reducing the risk of false positives:

library(FSA)
dunnTest(Temperature ~ Station, data = Temperature.Comparison, method = "bonferroni")

##                              Comparison         Z      P.unadj        P.adj
## 1 Helsinki Kumpula - Juupajoki Hyytiälä 11.061544 1.927502e-28 5.782507e-28
## 2      Helsinki Kumpula - Kittilä Pokka 16.341434 5.006163e-60 1.501849e-59
## 3    Juupajoki Hyytiälä - Kittilä Pokka  5.257851 1.457483e-07 4.372448e-07

All three comparisons are <0.05 and therefore significant. The largest difference is that of Helsinki Kumpula and Kittilä Pokka at 1.501849e-59.

To visualise the differences clearly between station temperatures over July we can produce a box plot:

ggplot(Temperature.Comparison, aes(x = Station, y = Temperature, 
                                   fill = Station)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "#C7B19C", 
               outlier.shape = 16, outlier.size = 4) +
  labs(x = "Weather Station", y = "Temperature (°C)") +
  theme_minimal() +
  scale_fill_manual(values = wes_palette("Chevalier1"))

Figure 6 - A box and whisker plot visualising differing temperature distribution at the 3 Finnish weather stations. One can see that Haelsinki has the highest median temperature and smallest range. The other two stations recorded lower temperatures with larger variation. Outliers are shown in beige.

Interpretation

The Kruskal-Wallis test confirmed that the temperature distrbutions significantly differ across the three weather stations and Dunn’s post hoc test showed that difference of each is pair of stations is also statistically significant.

The latitudes can primarily explain the different temperature distributions over July. Kittilä Pokka is 6 degrees further north than Juupajoki Hyytiälä and 7 further than Helsinki Kumpula. Urban heat island effect also goes some way to explaining the higher temperatures in Helsinki.

References

Dytham, C. (2011) Choosing and using statistics: a biologist’s guide. 3. ed. Oxford: Wiley-Blackwell.

Peterken, G.F. (2001) Natural woodland: ecology and conservation in northern temperate regions. Reprinted. New York: Cambridge Univ. Press.

Sharp Bowman, T.R., McMillan, B.R. and St. Clair, S.B. (2017) ‘A comparison of the effects of fire on rodent abundance and diversity in the Great Basin and Mojave Deserts’, PLOS ONE. Edited by A.W. Reed, 12(11), p. e0187740. Available at: https://doi.org/10.1371/journal.pone.0187740.