Research methods - Assignment 1
2025-03-18
Study 1 - Habitat use by small rodents
H0
The rodent uses habitats, in proportion to their availability within the environment.
H1
The rodent shows preference in habitat selection, rather than using habitats in proportion to their availability within the environment.
Suitable statistical test
- The data collected fits the assumptions for a Chi-Square goodness of
fit test. The data is:
- Categorical
- Independent (rodent locations are independent of each other)
- Observations are able to be assigned to one of two or more categories within two variables (Dytham, 2011).
- Since we are testing whether frequency of rodent location is proportional to the expected frequencies based on habitat area, the Chi-Square test is suitable.
- Lastly, the data in non-parametric, one of the main assumptions of the Chi-Square test.
Equation used for Chi-Square test \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]
Chi-Square test: Step by step
#load necessary packages
library(readxl)
library(knitr)
#load excel sheet containing rodent data
<- read_excel("Book1.xlsx") Book1
#format and load table
Table 1: Overview of the area (km2) of each habitat type and number of rodent locations recorded within them
kable(Book1)
Habitat Type | Area (km2) | Number of rodent locations |
---|---|---|
Primary Forest | 4 | 17 |
Secondary growth forest | 2 | 2 |
Natural meadow | 3 | 15 |
Recent clear cut | 1 | 4 |
Recently burned | 1 | 8 |
Alpine tundra | 2 | 5 |
Agricultural land | 2 | 4 |
#load Excel data
<- read_excel("Book1.xlsx") Book1
#create data table, skipping the first column (remove habitats)
<- read_excel("Book1.xlsx", sheet = "Sheet1", skip = 1) data
#set new column headers
colnames(data) <- c("Habitat_Type", "Area_km2", "Rodent_Locations")
#calculate expected frequencies
$Expected <- (data$Area_km2 / sum(data$Area_km2)) * sum(data$Rodent_Locations) data
#set observed and expected variables
<- data$Rodent_Locations observed
<- data$Expected expected
#perform chisq test and print output
<- chisq.test(x = observed, p = expected / sum(expected)) chisq_test
print(chisq_test)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 13.114, df = 6, p-value = 0.04127
Chi square test output evaluation
The X-squared statistic of the Chi-square test is 13.114 with 6 degrees of freedom.
This gave a p-value of 0.04127.
At the p <0.05 threshold, this indicates a statistically significant effect.
Since the test proved statistical significance, the null hypothesis can be confidently rejected.
It is likely that the rodent shows preference in habitat selection, rather than using them in proportion to their availability within the environment.
Discussion
- The rodent may have chosen habitats due to multiple factors such as:
- Predation
- Food availability
- Shelter availability
- Location counts of the rodents suggest succession of each habitat
determines rodent preference. For example:
- The primary forest is likely to have a better developed understorey, providing the rodent with cover from predators, shelter availability within dense vegetation and food opportunities.
- Conversely, secondary growth forests are in earlier stages of succession; offering less complex vegetation and increased predation risk (Morales-Diaz et.al., 2019).
- Other habitats in earlier stages of succession also recieved fewer visits by the rodent. Alpine tundra (5), recently clear cut land (4) and agricultural land (4), all provide limited vegetation for feeding and shelter from predators.
Figure 1: Recently burned forests, due to their small size and
high number of visitations (8) were the most used,
proportionally.
Recently burned forest, may encourage seed dispersal. This is likely to attract rodents, perhaps explaining it’s use by the rodent in the study (Puig-Gironès, 2022). Furthermore, invertebrates primarily hidden by leaf litter and dense vegetation are available for foraging by small rodents.
References
Dytham, Calvin (2011) Choosing and Using Statistics: A Biologist’s Guide (3rd. ed.),199-210
Patricia Morales-Diaz, S., Yolotl Alvarez-Anorve, M., Edith Zamora-Espinoza, M., Dirzo, R., Oyama, K. and Daniel Avila-Cabadilla, L. (2019) ‘Rodent community responses to vegetation and landscape changes in early successional stages of tropical dry forest’, Forest Ecology and Management, 433, pp. 633–644
Puig-Gironès Roger (2023) Can predators influence small rodent foraging activity rates immediately after wildfires?. International Journal of Wildland Fire 32, 1391-1403.
_______________________________________________________________________________-
Study 2 - How does habitat quality affect the population size of a species?
H0
Habitat quality has no affect on the population size of a species.
H1
Habitat quality does have an effect on the population size of a species.
Check for normality
To determine the statistical analysis needed to test the provided data, a Shapiro-Wilks normality test was performed.
#load and view dataset from Excel
<- read_excel("quality.xlsx") quality
Table 2: Overview of habitat quality indices (Quality_index) and population size of species (Species_size)
kable(quality)
Quality_index | Species_size |
---|---|
0.60 | 450 |
0.55 | 350 |
0.80 | 750 |
0.85 | 850 |
0.95 | 1000 |
0.25 | 150 |
0.70 | 600 |
0.80 | 750 |
0.40 | 200 |
0.90 | 950 |
#structure data set
str(quality)
## tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
## $ Quality_index: num [1:10] 0.6 0.55 0.8 0.85 0.95 0.25 0.7 0.8 0.4 0.9
## $ Species_size : num [1:10] 450 350 750 850 1000 150 600 750 200 950
summary(quality)
## Quality_index Species_size
## Min. :0.2500 Min. : 150
## 1st Qu.:0.5625 1st Qu.: 375
## Median :0.7500 Median : 675
## Mean :0.6800 Mean : 605
## 3rd Qu.:0.8375 3rd Qu.: 825
## Max. :0.9500 Max. :1000
#perform Shapiro-Wilk test for normality
shapiro.test(quality$`Quality_index`)
##
## Shapiro-Wilk normality test
##
## data: quality$Quality_index
## W = 0.93181, p-value = 0.4659
shapiro.test(quality$`Species_size`)
##
## Shapiro-Wilk normality test
##
## data: quality$Species_size
## W = 0.93423, p-value = 0.4907
#check for normality using Q-Q PLots
qqnorm(quality$Quality_index)
qqline(quality$Quality_index, col = "red")
Figure 2: Normal Q-Q plot of habitat quality index showing the
quantiles of the data plotted against the quantiles of a normal
distribution. Points close to the diagonal line suggest a normal
distribution
qqnorm(quality$Species_size)
qqline(quality$Species_size, col = "red")
Figure 3: Normal Q-Q plot of population sizes showing the
quantiles of the data plotted against the quantiles of a normal
distribution. Points close to the diagonal line suggest a normal
distribution
Choosing a statistical test
- Pearson’s Correlation Coefficient can be used to test the
relationship between habitat quality and population size as it passes
the following assumptions:
- The data is normally distributed
- The data is continuous
- There are no significant outliers
- Observations are independent
- Linear regression can be subsequently performed to better answer what impact might improving habitat quality have on conservation of the species.
The Pearson correlation coefficient: Output
<- cor.test(x = quality$Quality_index,
output y = quality$Species_size,
alternative = "two.sided")
print(output)
##
## Pearson's product-moment correlation
##
## data: quality$Quality_index and quality$Species_size
## t = 14.784, df = 8, p-value = 4.312e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9239259 0.9959234
## sample estimates:
## cor
## 0.9821867
Pearson’s correlation coefficient output evaluation
The t statistic of Pearson’s test is 14.892 with 8 degrees of freedom
This gave a p-value of 4.076 x 10-7
At the p<0.05 threshold, this indicates very high statistical significance
Pearson’s correlation coefficient (r) was 0.982. This shows a strong relationship between quality of habitat and population size of species.
Since the test indicated statistical significance, the null hypothesis can be rejected.
Therefore, we can accept our alternative hypothesis, that the quality of habitat has a direct effect on the population size of species.
Linear regression analysis
Figure 4: Regression analysis observing the relatonship between
habitat quality and population size of species.
The following model is being used: A=β0+β1⋅B+ϵ
<-lm(Species_size ~ Quality_index, data=quality)
output<-summary(output)
mysummary mysummary
##
## Call:
## lm(formula = Species_size ~ Quality_index, data = quality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.85 -35.11 -12.98 34.95 111.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -290.24 63.53 -4.568 0.00183 **
## Quality_index 1316.52 89.05 14.784 4.31e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.79 on 8 degrees of freedom
## Multiple R-squared: 0.9647, Adjusted R-squared: 0.9603
## F-statistic: 218.6 on 1 and 8 DF, p-value: 4.312e-07
Residuals are symmetrically dispersed around 0, indicating good fit.
The residual standard error at 8 degrees of freedom is 63.53, suggesting variability within the data
The multiple R-squared value is 0.9647, indicating a stong relationship between habitat quality and population size.
When adjusted for predictors, the R-squared is 0.9603 which still indicates a strong relationship.
The F statistic at 218.6 is large and when associated with the small p-value (4.312 x 10-7) it shows high significance of the model.
Discussion
The Pearson’s correllation coefficient was 0.982 suggesting a strong, positive relationship between habitat quality between habitat quality and species population size.
The p-value (4.312 × 10⁻⁷) is much smaller than the significance threshold of 0.05, confirming that this relationship is statistically significant. Therefore, we can reject the null hypothesis and accept our alternative hypothesis that habitat quality does have an effect on the population size of the species.
Linear regression analysis was performed; it’s results further support the strong relationship between habitat quality and population size. The R-squared value at 0.9647 indicates that 96.42% of the variance within population size can be explained by changes in habitat quality.
3.58% of variation is due to unknown variables. These could be accounted for with future research.
The linear regression analysis helps to explain what impact improving habitat quality might have on species conservation. Since a strong relationship between the two has been realised, emphasis should be placed on restoring habitat quality in the future, as this will lead to an increase in species population size.
_______________________________________________________________________________-
Study 3 - Does the temperature in July differ among the three weather stations at locations (Helsinki, Hyytiälä, and Kittilä) in different parts of Finland?
H0
The temperature in July does not differ between the 3 weather stations.
H1
There is a difference between the weather in July at the 3 weather stations.
Testing for normality
I performed a Shapiro-Wilks test to determine this, followed by a histogram to visualize normality.
#filter for all data recorded in July
<- TemperatureComparison_1_ %>% filter(Month == 7) TemperatureComparison_1_
shapiro.test(TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Helsinki Kumpula"])
##
## Shapiro-Wilk normality test
##
## data: TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Helsinki Kumpula"]
## W = 0.99528, p-value = 0.02216
shapiro.test(TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Juupajoki Hyytiälä"])
##
## Shapiro-Wilk normality test
##
## data: TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Juupajoki Hyytiälä"]
## W = 0.99075, p-value = 0.0001351
shapiro.test(TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Kittilä Pokka"])
##
## Shapiro-Wilk normality test
##
## data: TemperatureComparison_1_$Temperature[TemperatureComparison_1_$Station == "Kittilä Pokka"]
## W = 0.9719, p-value = 9.187e-11
library(wesanderson)
ggplot (TemperatureComparison_1_, aes(x = Temperature, fill = Station)) + geom_histogram(bins = 30, alpha = 0.7) + scale_fill_manual(values = wes_palette("Darjeeling1", n=3, type = "discrete")) + labs(title = "Temperature variability", x = "Temperature", y = "Observation") + theme_classic()
Figure 5: Histogram showing variability in temperature across
three weather stations: Helsinki Kumpula (red), Juupajoki Hyytiälä
(green), and Kittilä Pokka (yellow)
- Outputs of the Shapiro-Wilks test indicate non-normal distribution
of the data.
- W was equal to 0.99442, this is a good indicator for normal distribution of the data.
- However, a p-value of 0.022, 1.351 x 10-4 and 9.187 X 10-11 was obtained, this suggests that there is significant deviation from normality (Kim and Park, 2019).
- The histogram shows no indication of normality:
- There are obvious tails.
- The data for Juupajoki Hyytiälä and Kittilä Pokka are skewed to the left. .
Choosing a statistical test
A Kruskal-Wallis test is suitable to investigate whether temperatures differ across the three weather stations.
- The data meets the assumptions of a Kruskal-Wallis test because:
- I am testing differences between more than two groups
- The data is continuous
- The null hypothesis being tested is that each group has the same median
- The data is not normally distributed (Dytham, 2011)
Performing the Kruskal-Wallis test
<- kruskal.test(Temperature ~ Station, data = TemperatureComparison_1_)
kruskal_result print(kruskal_result)
##
## Kruskal-Wallis rank sum test
##
## data: Temperature by Station
## Kruskal-Wallis chi-squared = 278.26, df = 2, p-value < 2.2e-16
- The Kruskal-Wallis test shows that there is at least one significant
difference between the temperatures in the 3 weather stations.
- The p-value at 2 degrees of freedom is 2.2x10-16; I can firmly reject the Null hypothesis in this case.
- The chi-squared value is equal to 278.26 indicating that observed differences are unlikely to be due to chance.
Since a significant effect was found I performed Dunn’s Post-hoc test to assess specific differences between the three weather stations.
#performing Dunn’s Post-hoc test
library(dunn.test)
<- dunn.test(TemperatureComparison_1_$Temperature, TemperatureComparison_1_$Station, method = "bonferroni") dunn_result
## Kruskal-Wallis rank sum test
##
## data: x and group
## Kruskal-Wallis chi-squared = 278.26, df = 2, p-value = 0
##
##
## Comparison of x by group
## (Bonferroni)
## Col Mean-|
## Row Mean | Helsinki Juupajok
## ---------+----------------------
## Juupajok | 11.06154
## | 0.0000*
## |
## Kittilä | 16.34143 5.257851
## | 0.0000* 0.0000*
##
## alpha = 0.05
## Reject Ho if p <= alpha/2
print(dunn_result)
## $chi2
## [1] 278.26
##
## $Z
## [1] 11.061544 16.341434 5.257851
##
## $P
## [1] 9.637511e-29 2.503082e-60 7.287413e-08
##
## $P.adjusted
## [1] 2.891253e-28 7.509245e-60 2.186224e-07
##
## $comparisons
## [1] "Helsinki Kumpula - Juupajoki Hyytiälä"
## [2] "Helsinki Kumpula - Kittilä Pokka"
## [3] "Juupajoki Hyytiälä - Kittilä Pokka"
#visualising post-hoc test
ggplot(TemperatureComparison_1_, aes(x = Station, y = Temperature, fill = Station)) +
geom_boxplot(alpha = 0.7) + # Adjust alpha for transparency
scale_fill_manual(values = wes_palette("Darjeeling1", n = 3, type = "discrete"))
Figure 6: Boxplot, showing mean differences in temperature
across three weather stations: Helsinki Kumpula (red), Juupajoki
Hyytiälä (green), and Kittilä Pokka (yellow)
Dunn’s test evaluation
- Helsinki Kumpula vs Juupajoki Hyytiälä has a z-value of 11.062 indicating the mean rank of temperatures in Helsinki Kumpula is much higher than in Juupajoki Hyytiälä.
- Similarly, the z-value of Helsinki Kumpula vs Kittilä Pokka is 16.341 showing the mean rank of temperatures in Helsinki Kumpula is even higher when compared to those in Juupajoki Hyytiälä.
- Juupajoki Hyytiälä vs Kittilä Pokka gives a z-value of 5.258; Juupajoki Hyytiälä is significantly warmer than Kittilä Pokka. The difference is less distinct than that between Helsinki and the two other weather stations .
- Therefore Kittilä Pokka is the weather station with the lowest mean temperature.
- All calculations are made with 95% confident that the results are not due to chance
Discussion
- Since the p-value was less than 0.05 the null hypothesis is rejected, I can accept the alternative hypothesis that there is a significant difference between the weather in July at the 3 weather stations.
- This difference could be explained by the latitude of the 3 weather stations.
Table 3: The latitude of the 3 weather stations may explain the results outputted in Dunn’s post-hoc test
Weather station | Latitude |
---|---|
Helsinki Kumpula | 60.20456 |
Juupajoki Hyytiälä | 61.84534 |
Kittilä Pokka | 68.15895 |
References
Dytham, Calvin (2011) Choosing and Using Statistics: A Biologist’s Guide (3rd. ed.),199-210
Kim, T.K. and Park, J.H. (2019) ‘More about the basic assumptions of t-test: normality and sample size’, Korean journal of anesthesiology, 72(4), pp. 331–335