F1 task

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1

[1] 2

You can add options to executable code like this

[1] 4

The echo: false option disables the printing of code (only output is displayed).

F1 Group Task

before we do any sort of data work, lets make sure the packages we need are loaded:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

Then we need to load in the data set. Once loaded, we can use view() to view the entire data set in a separate tab

library(readxl)
lynxd <- read_excel("~/Library/CloudStorage/OneDrive-NottinghamTrentUniversity/Lynx Dataset (Formative).xlsx")

view(lynxd)

Introducing the data

Question: What trends and comparisons can be seen in lynx populations across the 19th and 20th century?

Background: Lynx (genus containing 4 distinct species) are medium sized mountain cats found in forest terrains across Europe, north America and Asia. This study looked at population changes by sampling population sizes across 70 sites within the 19th and 20th century, to monitor their growth and/or decline in these regions.

Understanding the data: This data set has 3 variables: “id”- giving the study site id; “lynx”- the total number of lynx captured in that area; and “century”- the century in which the lynx were captured. ID and Century are categorical data types as the values fit into the defined set categories, whereas lynx is a numerical discreet date type, with values able to fit anywhere on a defined numerical scale. Each variable has a total of 70 values

We can use the glimpse command to get a quick overview of the data set

glimpse(lynxd)

Rows: 70
Columns: 3
$ id      <chr> "A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "…
$ lynx    <dbl> 3311, 6721, 4254, 687, 255, 473, 358, 784, 1594, 1676, 2251, 1…
$ century <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19…

Using describe (found within the psych package) we can quickly get some descriptive stats on a chosen variable. Since lynx is the main variable we are investigating, we can use this command like so:

describe(lynxd$lynx)

   vars  n    mean     sd median trimmed     mad min  max range skew kurtosis
X1    1 70 1668.07 1689.3    904 1423.89 1087.49  39 6991  6952 1.24     1.03
       se
X1 201.91

we can now see useful information on the mean, median, range, min value, max value etc of all values within lynx.

Since we want to compare the 19th and 20th century data, it’ll be useful to try and separate the vales into two new columns.

One way we can attempt this is with the mutate() command

lynxdnew <- lynxd %>% 
  mutate(
    cent19 = ifelse(century == 19, lynx, NA),
    cent20 = ifelse(century == 20, lynx, NA)
  )
view(lynxdnew)

describe(lynxdnew)

        vars  n    mean      sd median trimmed     mad min  max range skew
id*        1 70   35.50   20.35   35.5   35.50   25.95   1   70    69 0.00
lynx       2 70 1668.07 1689.30  904.0 1423.89 1087.49  39 6991  6952 1.24
century    3 70   19.50    0.50   19.5   19.50    0.74  19   20     1 0.00
cent19     4 35 1403.20 1623.82  687.0 1157.21  896.97  39 6721  6682 1.41
cent20     5 35 1932.94 1734.99 1388.0 1733.79 1547.83  80 6991  6911 1.07
        kurtosis     se
id*        -1.25   2.43
lynx        1.03 201.91
century    -2.03   0.06
cent19      1.40 274.48
cent20      0.68 293.27

We can use this mutated version to more easily see a comparison between the 19th and 20th century captures. Opening this tibble in a new window allows it to be easier viewed.

Visualizing the data

century <- as.factor(lynxd$century)

Bar plot

ggplot(lynxd, aes(x = id, y = lynx, fill = factor(century))) +
  geom_bar(stat = "identity") + 
  scale_fill_manual(values = c("19" = "light pink", "20" = "light blue")) +
  labs(x = "Site ID", y = "Lynx Count", fill = "Century") +
  theme_minimal()

Line graph

ggplot(lynxd, aes(x = id, y = lynx, group = century, color = as.factor(century))) +
  geom_line() +   
  geom_point() +
  labs(title = "Line graph", x = "ID", y = "Lynx Count", color = "Century") +   theme_minimal()

The bar plot and line graph show the same data visualization in slightly different ways.

They allow us to see the population change between each individual site across both centuries, with every lynx variable plotted individually. Things like the minimum and maximum value are easily to understand, but other descriptive statistics such as the mean are harder to spot without the aid of additional graphs or analysis.

Boxplot

ggplot(lynxd, aes(x = factor(century), y = lynx, fill = factor(century))) +
   geom_boxplot() +
   scale_fill_manual(values = c("19" = "light pink", "20" = "light blue")) +
   labs(title = "Lynx Captures Across Centuries",  x = "Century", y = "Number of Lynx Captures", fill = "Century") +
   theme_minimal()

This box plot can help us visualize the number of lynx caught in the 19th and 20th century by comparing range, mean and number. We can see that the 20th century had a larger range within the captures meaning there was higher variation in the populations across sites, but the higher mean shows that overall there were more lynx present. The 19th century had a lower mean showing the average population was smaller, but its smaller range in values could indicate the populations present were more stable and well distributed across the larger area.

Density graph

ggplot(lynxd, aes(x = lynx, fill = factor(century))) +
   geom_density(alpha = 0.5) +
   scale_fill_manual(values = c("19" = "light pink", "20" = "light blue")) +
   labs(title = "Lynx Captures by Century",
        x = "Number of Lynx Captures",
        y = "Density",
        fill = "Century") +
   theme_minimal()

Stats testing

t-test

t test: This is a two sample t-test to compare the lynx population samples between the 19th and 20th century. The t-test was chosen as it is designed to compare two sets of data, using the means to evaluate if there is significant difference. The data we’re comparing (captures in the 19th and 20th century) are independent of one another, which is an important point when using a t-test.

t.test(lynx ~ century, data = lynxd)


    Welch Two Sample t-test

data:  lynx by century
t = -1.3188, df = 67.704, p-value = 0.1917
alternative hypothesis: true difference in means between group 19 and group 20 is not equal to 0
95 percent confidence interval:
 -1331.3348   271.8491
sample estimates:
mean in group 19 mean in group 20 
        1403.200         1932.943

The p value is above the rejection level of 0.05, meaning there is no statistical significance in this data comparison. It suggests any difference seen in due to random chance rather than significant changes or effects.

Wilcoxon rank-sum test

The wilcoxon rank-sum test is used to compare the distribution between two independent variables.

wilcox.test(lynx ~ century, data = lynxd, exact = FALSE)


    Wilcoxon rank sum test with continuity correction

data:  lynx by century
W = 465.5, p-value = 0.08528
alternative hypothesis: true location shift is not equal to 0

The output for this test was p = 0.08528, again suggesting there is no difference in the compared groups.

Hypothetical next steps:

We can conclude there is no strong evidence supporting the idea that lynx populations have significantly changed over the 19th and 20th centuries. To improve, increase the number of samples taken, either by sampling more areas, or sampling the same areas repeatedly over time. This would allow for a better sample size to be represented.