Research methods - Assignment 1

2025-03-18

Dan Palmer

Study 1 - Habitat use by small rodents

H₀

The rodent uses habitats, in proportion to their availability within the environment.

H₁

The rodent shows preference in habitat selection, rather than using habitats in proportion to their availability within the environment.

Suitable statistical test

The data collected fits the assumptions for a Chi-Square goodness of fit test. The data is:
- Categorical
- Independent (rodent locations are independent of each other)
- Observations are able to be assigned to one of two or more categories within two variables (Dytham, 2011).
Since we are testing whether frequency of rodent location is proportional to the expected frequencies based on habitat area, the Chi-Square test is suitable.
Lastly, the data in non-parametric, one of the main assumptions of the Chi-Square test.

Equation used for Chi-Square test \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]

Chi-Square test: Step by step

#load necessary packages

library(readxl)

library(knitr)

#load excel sheet containing rodent data

Book1 <- read_excel("Book1.xlsx")

#format and load table

Table 1: Overview of the area (km²) of each habitat type and number of rodent locations recorded within them

kable(Book1)

Habitat Type	Area (km2)	Number of rodent locations
Primary Forest	4	17
Secondary growth forest	2	2
Natural meadow	3	15
Recent clear cut	1	4
Recently burned	1	8
Alpine tundra	2	5
Agricultural land	2	4

#load Excel data

Book1 <- read_excel("Book1.xlsx")

#create data table, skipping the first column (remove habitats)

data <- read_excel("Book1.xlsx", sheet = "Sheet1", skip = 1)

#set new column headers

colnames(data) <- c("Habitat_Type", "Area_km2", "Rodent_Locations")

#calculate expected frequencies

data$Expected <- (data$Area_km2 / sum(data$Area_km2)) * sum(data$Rodent_Locations)

#set observed and expected variables

observed <- data$Rodent_Locations

expected <- data$Expected

#perform chisq test and print output

chisq_test <- chisq.test(x = observed, p = expected / sum(expected))

print(chisq_test)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 13.114, df = 6, p-value = 0.04127

Chi square test output evaluation

The X-squared statistic of the Chi-square test is 13.114 with 6 degrees of freedom.

This gave a p-value of 0.04127.

At the p <0.05 threshold, this indicates a statistically significant effect.

Since the test proved statistical significance, the null hypothesis can be confidently rejected.

It is likely that the rodent shows preference in habitat selection, rather than using them in proportion to their availability within the environment.

Discussion

The rodent may have chosen habitats due to multiple factors such as:
- Predation
- Food availability
- Shelter availability
Location counts of the rodents suggest succession of each habitat determines rodent preference. For example:
- The primary forest is likely to have a better developed understorey, providing the rodent with cover from predators, shelter availability within dense vegetation and food opportunities.
- Conversely, secondary growth forests are in earlier stages of succession; offering less complex vegetation and increased predation risk (Morales-Diaz et.al., 2019).
Other habitats in earlier stages of succession also recieved fewer visits by the rodent. Alpine tundra (5), recently clear cut land (4) and agricultural land (4), all provide limited vegetation for feeding and shelter from predators.

Figure 1: Recently burned forests, due to their small size and high number of visitations (8) were the most used, proportionally.

Recently burned forest, may encourage seed dispersal. This is likely to attract rodents, perhaps explaining it’s use by the rodent in the study (Puig-Gironès, 2022). Furthermore, invertebrates primarily hidden by leaf litter and dense vegetation are available for foraging by small rodents.

References

Dytham, Calvin (2011) Choosing and Using Statistics: A Biologist’s Guide (3rd. ed.),199-210

Patricia Morales-Diaz, S., Yolotl Alvarez-Anorve, M., Edith Zamora-Espinoza, M., Dirzo, R., Oyama, K. and Daniel Avila-Cabadilla, L. (2019) ‘Rodent community responses to vegetation and landscape changes in early successional stages of tropical dry forest’, Forest Ecology and Management, 433, pp. 633–644

Puig-Gironès Roger (2023) Can predators influence small rodent foraging activity rates immediately after wildfires?. International Journal of Wildland Fire 32, 1391-1403.

_______________________________________________________________________________-

Study 2 - How does habitat quality affect the population size of a species?

H₀

Habitat quality has no affect on the population size of a species.

H₁

Habitat quality does have an effect on the population size of a species.

Check for normality

To determine the statistical analysis needed to test the provided data, a Shapiro-Wilks normality test was performed.

#load and view dataset from Excel

quality <- read_excel("quality.xlsx")

Table 2: Overview of habitat quality indices (Quality_index) and population size of species (Species_size)

kable(quality)

Quality_index	Species_size
0.60	450
0.55	350
0.80	750
0.85	850
0.95	1000
0.25	150
0.70	600
0.80	750
0.40	200
0.90	950

#structure data set

str(quality)

## tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Quality_index: num [1:10] 0.6 0.55 0.8 0.85 0.95 0.25 0.7 0.8 0.4 0.9
##  $ Species_size : num [1:10] 450 350 750 850 1000 150 600 750 200 950

summary(quality)

##  Quality_index     Species_size 
##  Min.   :0.2500   Min.   : 150  
##  1st Qu.:0.5625   1st Qu.: 375  
##  Median :0.7500   Median : 675  
##  Mean   :0.6800   Mean   : 605  
##  3rd Qu.:0.8375   3rd Qu.: 825  
##  Max.   :0.9500   Max.   :1000

#perform Shapiro-Wilk test for normality

shapiro.test(quality$`Quality_index`)

## 
##  Shapiro-Wilk normality test
## 
## data:  quality$Quality_index
## W = 0.93181, p-value = 0.4659

shapiro.test(quality$`Species_size`)

## 
##  Shapiro-Wilk normality test
## 
## data:  quality$Species_size
## W = 0.93423, p-value = 0.4907

#check for normality using Q-Q PLots

qqnorm(quality$Quality_index)
qqline(quality$Quality_index, col = "red")

Figure 2: Normal Q-Q plot of habitat quality index showing the quantiles of the data plotted against the quantiles of a normal distribution. Points close to the diagonal line suggest a normal distribution

qqnorm(quality$Species_size)
qqline(quality$Species_size, col = "red")

Figure 3: Normal Q-Q plot of population sizes showing the quantiles of the data plotted against the quantiles of a normal distribution. Points close to the diagonal line suggest a normal distribution

Choosing a statistical test

Pearson’s Correlation Coefficient can be used to test the relationship between habitat quality and population size as it passes the following assumptions:
- The data is normally distributed
- The data is continuous
- There are no significant outliers
- Observations are independent
Linear regression can be subsequently performed to better answer what impact might improving habitat quality have on conservation of the species.

The Pearson correlation coefficient: Output

output <- cor.test(x = quality$Quality_index, 
                   y = quality$Species_size, 
                   alternative = "two.sided")

print(output)

## 
##  Pearson's product-moment correlation
## 
## data:  quality$Quality_index and quality$Species_size
## t = 14.784, df = 8, p-value = 4.312e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9239259 0.9959234
## sample estimates:
##       cor 
## 0.9821867

Pearson’s correlation coefficient output evaluation

The t statistic of Pearson’s test is 14.892 with 8 degrees of freedom
This gave a p-value of 4.076 x 10^-7
At the p<0.05 threshold, this indicates very high statistical significance
Pearson’s correlation coefficient (r) was 0.982. This shows a strong relationship between quality of habitat and population size of species.
Since the test indicated statistical significance, the null hypothesis can be rejected.
Therefore, we can accept our alternative hypothesis, that the quality of habitat has a direct effect on the population size of species.

Linear regression analysis

Figure 4: Regression analysis observing the relatonship between habitat quality and population size of species.

The following model is being used: A=β0+β1⋅B+ϵ

output<-lm(Species_size ~ Quality_index, data=quality)
mysummary<-summary(output)
mysummary

## 
## Call:
## lm(formula = Species_size ~ Quality_index, data = quality)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83.85 -35.11 -12.98  34.95 111.11 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -290.24      63.53  -4.568  0.00183 ** 
## Quality_index  1316.52      89.05  14.784 4.31e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.79 on 8 degrees of freedom
## Multiple R-squared:  0.9647, Adjusted R-squared:  0.9603 
## F-statistic: 218.6 on 1 and 8 DF,  p-value: 4.312e-07

Residuals are symmetrically dispersed around 0, indicating good fit.
The residual standard error at 8 degrees of freedom is 63.53, suggesting variability within the data
The multiple R-squared value is 0.9647, indicating a stong relationship between habitat quality and population size.
When adjusted for predictors, the R-squared is 0.9603 which still indicates a strong relationship.
The F statistic at 218.6 is large and when associated with the small p-value (4.312 x 10^-7) it shows high significance of the model.

Discussion

The Pearson’s correllation coefficient was 0.982 suggesting a strong, positive relationship between habitat quality between habitat quality and species population size.
The p-value (4.312 × 10⁻⁷) is much smaller than the significance threshold of 0.05, confirming that this relationship is statistically significant. Therefore, we can reject the null hypothesis and accept our alternative hypothesis that habitat quality does have an effect on the population size of the species.
Linear regression analysis was performed; it’s results further support the strong relationship between habitat quality and population size. The R-squared value at 0.9647 indicates that 96.42% of the variance within population size can be explained by changes in habitat quality.
3.58% of variation is due to unknown variables. These could be accounted for with future research.
The linear regression analysis helps to explain what impact improving habitat quality might have on species conservation. Since a strong relationship between the two has been realised, emphasis should be placed on restoring habitat quality in the future, as this will lead to an increase in species population size.

_______________________________________________________________________________-

Study 3 - Does the temperature in July differ among the three weather stations at locations (Helsinki, Hyytiälä, and Kittilä) in different parts of Finland?

H₀

The temperature in July does not differ between the 3 weather stations.

H₁

There is a difference between the weather in July at the 3 weather stations.

Testing for normality

I performed a Shapiro-Wilks test to determine this, followed by a histogram to visualize normality.

#filter for all data recorded in July

TemperatureComparison_1_ <- TemperatureComparison_1_ %>% filter(Month == 7)

shapiro.test(TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Helsinki Kumpula"])

## 
##  Shapiro-Wilk normality test
## 
## data:  TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Helsinki Kumpula"]
## W = 0.99528, p-value = 0.02216

shapiro.test(TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Juupajoki Hyytiälä"])

## 
##  Shapiro-Wilk normality test
## 
## data:  TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Juupajoki Hyytiälä"]
## W = 0.99075, p-value = 0.0001351

shapiro.test(TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Kittilä Pokka"])

## 
##  Shapiro-Wilk normality test
## 
## data:  TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Kittilä Pokka"]
## W = 0.9719, p-value = 9.187e-11

library(wesanderson)

ggplot (TemperatureComparison_1_, aes(x = Temperature, fill = Station)) + geom_histogram(bins = 30, alpha = 0.7) + scale_fill_manual(values = wes_palette("Darjeeling1", n=3, type = "discrete")) + labs(title = "Temperature variability", x = "Temperature", y = "Observation") + theme_classic()

Figure 5: Histogram showing variability in temperature across three weather stations: Helsinki Kumpula (red), Juupajoki Hyytiälä (green), and Kittilä Pokka (yellow)

Outputs of the Shapiro-Wilks test indicate non-normal distribution of the data.
- W was equal to 0.99442, this is a good indicator for normal distribution of the data.
- However, a p-value of 0.022, 1.351 x 10^-4 and 9.187 X 10^-11 was obtained, this suggests that there is significant deviation from normality (Kim and Park, 2019).
The histogram shows no indication of normality:
- There are obvious tails.
- The data for Juupajoki Hyytiälä and Kittilä Pokka are skewed to the left. .

Choosing a statistical test

A Kruskal-Wallis test is suitable to investigate whether temperatures differ across the three weather stations.

The data meets the assumptions of a Kruskal-Wallis test because:
- I am testing differences between more than two groups
- The data is continuous
- The null hypothesis being tested is that each group has the same median
- The data is not normally distributed (Dytham, 2011)

Performing the Kruskal-Wallis test

kruskal_result <- kruskal.test(Temperature ~ Station, data = TemperatureComparison_1_)
print(kruskal_result)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Temperature by Station
## Kruskal-Wallis chi-squared = 278.26, df = 2, p-value < 2.2e-16

The Kruskal-Wallis test shows that there is at least one significant difference between the temperatures in the 3 weather stations.
- The p-value at 2 degrees of freedom is 2.2x10^-16; I can firmly reject the Null hypothesis in this case.
- The chi-squared value is equal to 278.26 indicating that observed differences are unlikely to be due to chance.

Since a significant effect was found I performed Dunn’s Post-hoc test to assess specific differences between the three weather stations.

#performing Dunn’s Post-hoc test

library(dunn.test)
dunn_result <- dunn.test(TemperatureComparison_1_$Temperature, TemperatureComparison_1_$Station, method = "bonferroni")

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 278.26, df = 2, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                                  (Bonferroni)                                  
## Col Mean-|
## Row Mean |   Helsinki   Juupajok
## ---------+----------------------
## Juupajok |   11.06154
##          |    0.0000*
##          |
## Kittilä  |   16.34143   5.257851
##          |    0.0000*    0.0000*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2

print(dunn_result)

## $chi2
## [1] 278.26
## 
## $Z
## [1] 11.061544 16.341434  5.257851
## 
## $P
## [1] 9.637511e-29 2.503082e-60 7.287413e-08
## 
## $P.adjusted
## [1] 2.891253e-28 7.509245e-60 2.186224e-07
## 
## $comparisons
## [1] "Helsinki Kumpula - Juupajoki Hyytiälä"
## [2] "Helsinki Kumpula - Kittilä Pokka"     
## [3] "Juupajoki Hyytiälä - Kittilä Pokka"

#visualising post-hoc test

ggplot(TemperatureComparison_1_, aes(x = Station, y = Temperature, fill = Station)) +
    geom_boxplot(alpha = 0.7) +  # Adjust alpha for transparency
    scale_fill_manual(values = wes_palette("Darjeeling1", n = 3, type = "discrete"))

Figure 6: Boxplot, showing mean differences in temperature across three weather stations: Helsinki Kumpula (red), Juupajoki Hyytiälä (green), and Kittilä Pokka (yellow)

Dunn’s test evaluation

Helsinki Kumpula vs Juupajoki Hyytiälä has a z-value of 11.062 indicating the mean rank of temperatures in Helsinki Kumpula is much higher than in Juupajoki Hyytiälä.
Similarly, the z-value of Helsinki Kumpula vs Kittilä Pokka is 16.341 showing the mean rank of temperatures in Helsinki Kumpula is even higher when compared to those in Juupajoki Hyytiälä.
Juupajoki Hyytiälä vs Kittilä Pokka gives a z-value of 5.258; Juupajoki Hyytiälä is significantly warmer than Kittilä Pokka. The difference is less distinct than that between Helsinki and the two other weather stations .
Therefore Kittilä Pokka is the weather station with the lowest mean temperature.
All calculations are made with 95% confident that the results are not due to chance

Discussion

Since the p-value was less than 0.05 the null hypothesis is rejected, I can accept the alternative hypothesis that there is a significant difference between the weather in July at the 3 weather stations.
This difference could be explained by the latitude of the 3 weather stations.

Table 3: The latitude of the 3 weather stations may explain the results outputted in Dunn’s post-hoc test

Weather station	Latitude
Helsinki Kumpula	60.20456
Juupajoki Hyytiälä	61.84534
Kittilä Pokka	68.15895

References

Dytham, Calvin (2011) Choosing and Using Statistics: A Biologist’s Guide (3rd. ed.),199-210

Kim, T.K. and Park, J.H. (2019) ‘More about the basic assumptions of t-test: normality and sample size’, Korean journal of anesthesiology, 72(4), pp. 331–335

Study 1 - Habitat use by small rodents

H0

H1

Suitable statistical test

Chi-Square test: Step by step

Chi square test output evaluation

Discussion

References

Study 2 - How does habitat quality affect the population size of a species?

H0

H1

Check for normality

Choosing a statistical test

The Pearson correlation coefficient: Output

Pearson’s correlation coefficient output evaluation

Linear regression analysis

Discussion

Study 3 - Does the temperature in July differ among the three weather stations at locations (Helsinki, Hyytiälä, and Kittilä) in different parts of Finland?

H0

H1

Testing for normality

I performed a Shapiro-Wilks test to determine this, followed by a histogram to visualize normality.

Choosing a statistical test

Performing the Kruskal-Wallis test

Dunn’s test evaluation

Discussion

References

H₀

H₁

H₀

H₁

H₀

H₁