1 + 1[1] 2
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
1 + 1[1] 2
You can add options to executable code like this
[1] 4
The echo: false option disables the printing of code (only output is displayed).
before we do any sort of data work, lets make sure the packages we need are loaded:
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
Then we need to load in the data set. Once loaded, we can use view() to view the entire data set in a separate tab
library(readxl)
lynxd <- read_excel("~/Library/CloudStorage/OneDrive-NottinghamTrentUniversity/Lynx Dataset (Formative).xlsx")view(lynxd)Question: What trends and comparisons can be seen in lynx populations across the 19th and 20th century?
Background: Lynx (genus containing 4 distinct species) are medium sized mountain cats found in forest terrains across Europe, north America and Asia. This study looked at population changes by sampling population sizes across 70 sites within the 19th and 20th century, to monitor their growth and/or decline in these regions.
Understanding the data: This data set has 3 variables: “id”- giving the study site id; “lynx”- the total number of lynx captured in that area; and “century”- the century in which the lynx were captured. ID and Century are categorical data types as the values fit into the defined set categories, whereas lynx is a numerical discreet date type, with values able to fit anywhere on a defined numerical scale. Each variable has a total of 70 values
We can use the glimpse command to get a quick overview of the data set
glimpse(lynxd)Rows: 70
Columns: 3
$ id <chr> "A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "…
$ lynx <dbl> 3311, 6721, 4254, 687, 255, 473, 358, 784, 1594, 1676, 2251, 1…
$ century <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19…
Using describe (found within the psych package) we can quickly get some descriptive stats on a chosen variable. Since lynx is the main variable we are investigating, we can use this command like so:
describe(lynxd$lynx) vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 70 1668.07 1689.3 904 1423.89 1087.49 39 6991 6952 1.24 1.03
se
X1 201.91
we can now see useful information on the mean, median, range, min value, max value etc of all values within lynx.
Since we want to compare the 19th and 20th century data, it’ll be useful to try and separate the vales into two new columns.
One way we can attempt this is with the mutate() command
lynxdnew <- lynxd %>%
mutate(
cent19 = ifelse(century == 19, lynx, NA),
cent20 = ifelse(century == 20, lynx, NA)
)
view(lynxdnew)describe(lynxdnew) vars n mean sd median trimmed mad min max range skew
id* 1 70 35.50 20.35 35.5 35.50 25.95 1 70 69 0.00
lynx 2 70 1668.07 1689.30 904.0 1423.89 1087.49 39 6991 6952 1.24
century 3 70 19.50 0.50 19.5 19.50 0.74 19 20 1 0.00
cent19 4 35 1403.20 1623.82 687.0 1157.21 896.97 39 6721 6682 1.41
cent20 5 35 1932.94 1734.99 1388.0 1733.79 1547.83 80 6991 6911 1.07
kurtosis se
id* -1.25 2.43
lynx 1.03 201.91
century -2.03 0.06
cent19 1.40 274.48
cent20 0.68 293.27
We can use this mutated version to more easily see a comparison between the 19th and 20th century captures. Opening this tibble in a new window allows it to be easier viewed.
century <- as.factor(lynxd$century)ggplot(lynxd, aes(x = id, y = lynx, fill = factor(century))) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("19" = "light pink", "20" = "light blue")) +
labs(x = "Site ID", y = "Lynx Count", fill = "Century") +
theme_minimal()ggplot(lynxd, aes(x = id, y = lynx, group = century, color = as.factor(century))) +
geom_line() +
geom_point() +
labs(title = "Line graph", x = "ID", y = "Lynx Count", color = "Century") + theme_minimal()The bar plot and line graph show the same data visualization in slightly different ways.
They allow us to see the population change between each individual site across both centuries, with every lynx variable plotted individually. Things like the minimum and maximum value are easily to understand, but other descriptive statistics such as the mean are harder to spot without the aid of additional graphs or analysis.
ggplot(lynxd, aes(x = factor(century), y = lynx, fill = factor(century))) +
geom_boxplot() +
scale_fill_manual(values = c("19" = "light pink", "20" = "light blue")) +
labs(title = "Lynx Captures Across Centuries", x = "Century", y = "Number of Lynx Captures", fill = "Century") +
theme_minimal()This box plot can help us visualize the number of lynx caught in the 19th and 20th century by comparing range, mean and number. We can see that the 20th century had a larger range within the captures meaning there was higher variation in the populations across sites, but the higher mean shows that overall there were more lynx present. The 19th century had a lower mean showing the average population was smaller, but its smaller range in values could indicate the populations present were more stable and well distributed across the larger area.
ggplot(lynxd, aes(x = lynx, fill = factor(century))) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("19" = "light pink", "20" = "light blue")) +
labs(title = "Lynx Captures by Century",
x = "Number of Lynx Captures",
y = "Density",
fill = "Century") +
theme_minimal()t test: This is a two sample t-test to compare the lynx population samples between the 19th and 20th century. The t-test was chosen as it is designed to compare two sets of data, using the means to evaluate if there is significant difference. The data we’re comparing (captures in the 19th and 20th century) are independent of one another, which is an important point when using a t-test.
t.test(lynx ~ century, data = lynxd)
Welch Two Sample t-test
data: lynx by century
t = -1.3188, df = 67.704, p-value = 0.1917
alternative hypothesis: true difference in means between group 19 and group 20 is not equal to 0
95 percent confidence interval:
-1331.3348 271.8491
sample estimates:
mean in group 19 mean in group 20
1403.200 1932.943
The p value is above the rejection level of 0.05, meaning there is no statistical significance in this data comparison. It suggests any difference seen in due to random chance rather than significant changes or effects.
The wilcoxon rank-sum test is used to compare the distribution between two independent variables.
wilcox.test(lynx ~ century, data = lynxd, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: lynx by century
W = 465.5, p-value = 0.08528
alternative hypothesis: true location shift is not equal to 0
The output for this test was p = 0.08528, again suggesting there is no difference in the compared groups.
We can conclude there is no strong evidence supporting the idea that lynx populations have significantly changed over the 19th and 20th centuries. To improve, increase the number of samples taken, either by sampling more areas, or sampling the same areas repeatedly over time. This would allow for a better sample size to be represented.