Lynx_Dataset_Formative_ <- read.delim("~/Downloads/linx")Shared Workbook
Group Work (Lynx)
Question: what trends and comparisons can be seen in lynx populations across the 19th and 20th century?
Background: Lynx (genus containing 4 distinct species) are medium sized mountain cats found in forest terrains across Europe, north America and Asia. This study looked at population changes by sampling population sizes across 70 sites within the 19th and 20th century, to monitor their growth and/or decline in these regions.
Understanding the data: This data set has 3 variables: “id”- giving the study site id; “lynx”- the total number of lynx captured in that area; and “century”- the century in which the lynx were captured. ID and Century are categorical data types as the values fit into the defined set categories, whereas lynx is a numerical discreet date type, with values able to fit anywhere on a defined numerical scale. Each variable has a total of 70 values
Load data
Change accordingly with your file structure
library(tidyverse) ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%() masks base::%||%()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#install.packages("psych")
library(psych)
Anexando pacote: 'psych'
Os seguintes objetos são mascarados por 'package:ggplot2':
%+%, alpha
describe(Lynx_Dataset_Formative_) # it doesnt compare the 19th and 20th at all Lynx_Dataset_Formative_ vars n mean sd median trimmed mad min max range skew
id* 1 70 35.50 20.35 35.5 35.50 25.95 1 70 69 0.00
lynx 2 70 1668.07 1689.30 904.0 1423.89 1087.49 39 6991 6952 1.24
century 3 70 19.50 0.50 19.5 19.50 0.74 19 20 1 0.00
kurtosis se
id* -1.25 2.43
lynx 1.03 201.91
century -2.03 0.06
glimpse(Lynx_Dataset_Formative_)Rows: 70
Columns: 3
$ id <chr> "A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "…
$ lynx <int> 3311, 6721, 4254, 687, 255, 473, 358, 784, 1594, 1676, 2251, 1…
$ century <int> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19…
The below table shows a descriptive summary of all the values within the lynx variable. From this we can gather useful basic information about the lynx populations:
describe(Lynx_Dataset_Formative_$lynx) vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 70 1668.07 1689.3 904 1423.89 1087.49 39 6991 6952 1.24 1.03
se
X1 201.91
to better view this data, it seems sensible to try and separate the 19th and 20th century into two columns. this will allow us to better summaries and describe the data, compare values and spot trends
Lynx_Dataset_Formative_ %>% mutate( cent19 = ifelse(century == 19, lynx, NA), cent20 = ifelse(century == 20, lynx, NA) ) id lynx century cent19 cent20
1 A1 3311 19 3311 NA
2 A2 6721 19 6721 NA
3 A3 4254 19 4254 NA
4 A4 687 19 687 NA
5 A5 255 19 255 NA
6 A6 473 19 473 NA
7 A7 358 19 358 NA
8 A8 784 19 784 NA
9 A9 1594 19 1594 NA
10 A10 1676 19 1676 NA
11 A11 2251 19 2251 NA
12 A12 1426 19 1426 NA
13 A13 756 19 756 NA
14 A14 299 19 299 NA
15 A15 201 19 201 NA
16 A16 229 19 229 NA
17 A17 469 19 469 NA
18 A18 736 19 736 NA
19 A19 2042 19 2042 NA
20 A20 2811 19 2811 NA
21 A21 4431 19 4431 NA
22 A22 2511 19 2511 NA
23 A23 389 19 389 NA
24 A24 73 19 73 NA
25 A25 39 19 39 NA
26 A26 49 19 49 NA
27 A27 59 19 59 NA
28 A28 188 19 188 NA
29 A29 377 19 377 NA
30 A30 1292 19 1292 NA
31 A31 4031 19 4031 NA
32 A32 3495 19 3495 NA
33 A33 587 19 587 NA
34 A34 105 19 105 NA
35 A35 153 19 153 NA
36 B1 387 20 NA 387
37 B2 758 20 NA 758
38 B3 1307 20 NA 1307
39 B4 3465 20 NA 3465
40 B5 6991 20 NA 6991
41 B6 6313 20 NA 6313
42 B7 3794 20 NA 3794
43 B8 1836 20 NA 1836
44 B9 345 20 NA 345
45 B10 382 20 NA 382
46 B11 808 20 NA 808
47 B12 1388 20 NA 1388
48 B13 2713 20 NA 2713
49 B14 3800 20 NA 3800
50 B15 3091 20 NA 3091
51 B16 2985 20 NA 2985
52 B17 3790 20 NA 3790
53 B18 674 20 NA 674
54 B19 81 20 NA 81
55 B20 80 20 NA 80
56 B21 108 20 NA 108
57 B22 229 20 NA 229
58 B23 399 20 NA 399
59 B24 1132 20 NA 1132
60 B25 2432 20 NA 2432
61 B26 3574 20 NA 3574
62 B27 2935 20 NA 2935
63 B28 1537 20 NA 1537
64 B29 529 20 NA 529
65 B30 485 20 NA 485
66 B31 662 20 NA 662
67 B32 1000 20 NA 1000
68 B33 1590 20 NA 1590
69 B34 2657 20 NA 2657
70 B35 3396 20 NA 3396
Visualizing the data
Boxplot
box plot first attempt, forgot to add the colour
ggplot(Lynx_Dataset_Formative_,
aes(x = factor(century), y = lynx)) +
geom_boxplot() +
labs(title = "Lynx Captures Across Centuries", x = "Century", y = "Number of Lynx Captures") +
theme_minimal()Density graph
ggplot(Lynx_Dataset_Formative_, aes(x = lynx, fill = factor(century))) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("19" = "cyan", "20" = "red")) +
labs(title = "Lynx Captures by Century",
x = "Number of Lynx Captures",
y = "Density",
fill = "Century") +
theme_minimal()Line graph
ggplot(Lynx_Dataset_Formative_, aes(x = id, y = lynx, group = century, color = as.factor(century))) +
geom_line() +
geom_point() +
labs(title = "Line graph", x = "ID", y = "Lynx Count", color = "Century") +
theme_minimal()Bar graph
# Convert century to factor if it's not already
Lynx_Dataset_Formative_$century <- as.factor(Lynx_Dataset_Formative_$century)ggplot(Lynx_Dataset_Formative_, aes(x = id, y = lynx, fill = century)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("19" = "lightblue", "20" = "lightgreen")) +
labs(x = "ID", y = "Lynx Count", fill = "Century") +
theme_minimal()# Display the first few rows and a summary of the dataset
head(Lynx_Dataset_Formative_) id lynx century
1 A1 3311 19
2 A2 6721 19
3 A3 4254 19
4 A4 687 19
5 A5 255 19
6 A6 473 19
summary(Lynx_Dataset_Formative_) id lynx century
Length:70 Min. : 39.0 19:35
Class :character 1st Qu.: 378.2 20:35
Mode :character Median : 904.0
Mean :1668.1
3rd Qu.:2786.5
Max. :6991.0
# Check column names
names(Lynx_Dataset_Formative_)[1] "id" "lynx" "century"
# Load ggplot2 library for plotting
library(ggplot2)
# Create a scatter plot of lynx population by century
ggplot(Lynx_Dataset_Formative_, aes(x = century, y = lynx)) +
geom_point(size = 3, color = "blue") +
labs(
title = "Lynx Population by Century",
x = "Century",
y = "Lynx Population"
) +
theme_minimal()Stats testing
t test: This is a two sample t-test to compare the lynx population samples between the 19th and 20th century. The t-test was chosen as it is designed to compare two sets of data, using the means to evaluate if there is significant difference. The data we’re comparing (captures in the 19th and 20th century) are independent of one another, which is an important point when using a t-test.
t.test(lynx ~ century, data = Lynx_Dataset_Formative_)
Welch Two Sample t-test
data: lynx by century
t = -1.3188, df = 67.704, p-value = 0.1917
alternative hypothesis: true difference in means between group 19 and group 20 is not equal to 0
95 percent confidence interval:
-1331.3348 271.8491
sample estimates:
mean in group 19 mean in group 20
1403.200 1932.943
The p value is above the rejection level of 0.05, meaning there is no statistical significance in this data comparison. It suggests any difference seen in due to random chance rather than significant changes or effects.
# Perform Wilcoxon Rank-Sum Test to compare lynx populations between centuries
wilcox.test(lynx ~ century, data = Lynx_Dataset_Formative_, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: lynx by century
W = 465.5, p-value = 0.08528
alternative hypothesis: true location shift is not equal to 0
##Chi2 test: see if there is an association between century and lynx counts
contingency_table <-table(Lynx_Dataset_Formative_$century, cut(Lynx_Dataset_Formative_$lynx, breaks=5)) # Adjust breaks as needed
print(contingency_table)
(32,1.43e+03] (1.43e+03,2.82e+03] (2.82e+03,4.21e+03] (4.21e+03,5.6e+03]
19 23 6 3 2
20 18 6 9 0
(5.6e+03,7e+03]
19 1
20 2
chi_squared_result<-chisq.test(contingency_table)Warning in chisq.test(contingency_table): Aproximação do qui-quadrado pode
estar incorreta
print(chi_squared_result)
Pearson's Chi-squared test
data: contingency_table
X-squared = 5.9431, df = 4, p-value = 0.2034
Hypothetical next steps:
we can conclude there is no strong evidence supporting the idea that lynx populations have significantly changed over the 19th and 20th centuries.
to improve, increase the number of samples taken, either by sampling more areas, or sampling the same areas repeatedly over time. This would allow for a better sample size to be represented