ARES40011!Formative_1

Author

Erica Bass, Sophie Packer, Jacob Walton, Martin Cooper

For all analysis - load libraries:

library(tidyverse) # installs package containing ggplot2 and dyplr, needed for manipulating data and producing graphs.
Warning: package 'tidyverse' was built under R version 4.1.3
Warning: package 'tibble' was built under R version 4.1.3
Warning: package 'forcats' was built under R version 4.1.3
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.4     v readr     2.1.5
v forcats   1.0.0     v stringr   1.5.1
v ggplot2   3.5.1     v tibble    3.2.1
v lubridate 1.9.3     v tidyr     1.3.1
v purrr     1.0.2     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Remember to set the session’s working directory, before attempting to import data.

Linces

First steps

  • set working directory
  • load packages
  • import data
[1] "D:/ARES40011 Rsrch Methods & Data Analysis"
# A tibble: 70 x 3
   id     lynx century
   <chr> <int>   <int>
 1 A1     3311      19
 2 A2     6721      19
 3 A3     4254      19
 4 A4      687      19
 5 A5      255      19
 6 A6      473      19
 7 A7      358      19
 8 A8      784      19
 9 A9     1594      19
10 A10    1676      19
# i 60 more rows

Data exploration

  • catergorise data - in this case we have 2 numerical and 1 character variables
  • summarise data
str(lynx) # Tells you what types of variables 
'data.frame':   70 obs. of  3 variables:
 $ id     : chr  "A1" "A2" "A3" "A4" ...
 $ lynx   : int  3311 6721 4254 687 255 473 358 784 1594 1676 ...
 $ century: int  19 19 19 19 19 19 19 19 19 19 ...
summary(lynx) # Tells you basic desciptive statistics
      id                 lynx           century    
 Length:70          Min.   :  39.0   Min.   :19.0  
 Class :character   1st Qu.: 378.2   1st Qu.:19.0  
 Mode  :character   Median : 904.0   Median :19.5  
                    Mean   :1668.1   Mean   :19.5  
                    3rd Qu.:2786.5   3rd Qu.:20.0  
                    Max.   :6991.0   Max.   :20.0  
aggregate(lynx ~ century, data = lynx, mean) # tells you the mean of lynx pop between the centurys
  century     lynx
1      19 1403.200
2      20 1932.943

Data visualisation

Box plot

Next we can try to visualise the data depending in the types of variables we have. Looking at the century and lynx pop we have one categorical and one quantitative variable. Therefore a box plot may be the best way to visualise this data. You can see from this that there is a difference in the mean of lynx population between each century. This may form the basis of your question.

ggplot(lynx, aes(x = as.factor(century), y = lynx)) + # this tells us that we want to plot and plots the point
  geom_boxplot(fill = "mediumorchid", color = "black") +  # Tells us we want to use a boxplot
  labs(title = "Lynx Population by Century", x = "Century", y = "Lynx Population") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) # Puts title in the centre

Line plot

You may be able to explore this data further by searching for trends. Assuming that the ID column contains observations in time you could also plot the lynx pop against the id column and see if a trend appears. You can see that the population peaks and troughs over both centuries which can be explored further and form the basis of your question.

library(ggplot2)

ggplot(lynx, aes(x = id, y = lynx, color = factor(century))) + # separates lines by color.
  geom_line(aes(group = factor(century)), linewidth = 1) +  # Group lines by century
  geom_point(size = 2) +                                    # Points colored by century

  labs(title = "Lynx Population by Century",   # Titles and axis lables
       x = "ID",
       y = "Lynx Count",
       color = NULL) +              
  theme_minimal() + # choose relevant theme
  theme(plot.title = element_text(hjust = 0.5)) + # puts title in centre
  scale_color_manual(values = c("19" = "blue", "20" = "darkgreen"))  # Specify colors for centuries

Mosquitos

Importing the data into R:

  • Chunk imports code into R Studio
Code
mosquito <- read.table("D:\\ARES40011 Rsrch Methods & Data Analysis\\mosquitos.txt", header = TRUE, dec = ".") 

#Code chunk used to import the data from my download page. Labelled as mosquito

Loading Packages:

Code
library(tidyverse)
library(ggplot2)

This code chunk is being used to add the relevant packages into R Studio to reproduce the current data analysis on the Mosquito data.

Exploring the data:

Overview of the data:

Code
str(mosquito) #This code gives a basic overview of the data
'data.frame':   100 obs. of  3 variables:
 $ ID  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ wing: num  37.8 50.6 39.3 38.1 25.2 ...
 $ sex : chr  "f" "f" "f" "f" ...

The tibble above showing the mosquito data set provides us with the information that this data holds 3 different variables, one with the modquito number (ID), one with the mosquito sex and their respective wing span.

Checking for na recordings in data:

Code
colSums(is.na(mosquito)) #This code checks for any gaps or na recordings in the data.
  ID wing  sex 
   0    0    0 

This check shows that in all columns and rows there are 0 na readings, therefore in the data analysis we shouldn’t need use na.omit() in our codes.

Summary of the mosquito data:

Code
summary(mosquito) #This code provides some basic data analysis on the data
       ID              wing           sex           
 Min.   :  1.00   Min.   :25.16   Length:100        
 1st Qu.: 25.75   1st Qu.:41.42   Class :character  
 Median : 50.50   Median :48.42   Mode  :character  
 Mean   : 50.50   Mean   :48.78                     
 3rd Qu.: 75.25   3rd Qu.:56.24                     
 Max.   :100.00   Max.   :69.82                     

This code shows us the minimum and maximum wing length values as well as the mean and median which would be used to explored the data further later on.

Graphs and statsistical tests for them:

:::{.callout-note} This next section is split into several different sections and graphs depending on the x and y-axis. This analysis style will aid in choosing the correct statistical test and exploring the data with questions. ::: The syle of these headings are as follows, x-axis variable - y-axis varibale. :::{.callout-tip} ###Key - Categorical = Cat - Numerical (quantitative) = Num :::

Cat-Cat

Code
mosquito %>% #This line will be used at the start of all code chunks identifying mosquito data set as the one to be using.
  ggplot(aes( 
    x = sex,   #specifying sex on the x-axis
    fill = sex  #specifying the colour on the bar plot
  )) +
  geom_bar(show.legend = FALSE) + #code used to plot a bar graph,   without showing the legend
  labs(x = "Sex", #Labeling of the x-axis
       y = "Number of Mosquitos") #Labeling of the y-axis

Number of Male and Female Mosquitoes

The bar chart above shows the number of male and female mosquitoes. This data shows there are 50 males and 50 females. The results of this chart does not provide any questions about the data. If a statistical test was to be tested on this style of graph a Chi-Square test would be the most applicable.

Cat-Num

Code
mosquito %>%
  ggplot(aes(
    x = sex,    #Assigning the x-axis
    y = wing,   #Assigning the y-axis
    fill = sex  #applies a colour to the graph using the sex variable
  )) +
  geom_boxplot(show.legend = FALSE) + #creates a boxplot of the above axis
  labs(x = "Sex",  #labeling of the axis
       y = "Wing Span")

Mosquito Box plot

This box plot shows the difference in the wing span between the male and female mosquitoes. The graph shows evidence that the male mosquitoes have large wings than the females. If a statistical test was to be performed on the data, the T-test would be the most applicable.

Num-Num

Code
mosquito %>%
  ggplot(aes(
    x = wing,    #Assigning the x-axis
    fill = sex   #Ensures that the gender is split and recorded as male and females
  )) +
  geom_density(alpha = 0.5) +  #Creating a density plot
  labs(x = "Wing Span", y = "Density")

Mosquito density plot

The density graph shows similar results to the boxplot and aids to confirm that the Male mosquitoes have a greater wing span compared with the females, when comparing the peaks of the density plots. If a statistical test was to be performed on this data then again a T-test would be used on this data.

Num-Cat

Code
mosquito %>%
  ggplot(aes(
    x = wing,  #Assigning the x-axis
    y = sex,   #Assigning the y-axis
    colour = sex  #in this output, this only provide colour to the graph
  )) +
  geom_point(show.legend = FALSE) + #Creates a scatter plot of the above data
  labs(x = "Wing Span", #labeling of the x-axis
       y = "Sex")       #labeling of the y-axis

Mosquito Scatter plot

This scatter plot, extents on from the previous graphs using this data set. While the previous boxplot and density plot identified that males had greater wingspan on average, a female mosquito had the largest wingspan and another had the smallest. Using this graph a logistical regression model would be the most applicable statistical method.

Question about the Data

Scientific Hypothesis

  • Observations of male mosquitoes will have greater average wing spans compared with those of female mosquitoes.

Statistical Hypothesis

  • Male mosquitoes have a greater average wing span compared with females.

Deer Data

#| Label: Importing Deer Data

deer.data <- read.table("D:\\ARES40011 Rsrch Methods & Data Analysis\\roe_sika.txt") # imports text file as a dataframe

tibble(deer.data)
# A tibble: 33 x 3
   V1       V2    V3   
   <chr>    <chr> <chr>
 1 woodland roe   sika 
 2 120      832   1082 
 3 153      1010  1212 
 4 171      1032  1548 
 5 295      1001  1301 
 6 307      947   1136 
 7 325      1006  1509 
 8 336      928   1206 
 9 422      1015  1218 
10 498      840   1260 
# i 23 more rows

Loading the necessary packages

library(tidyverse) # installs package containing ggplot2 and dyplr, needed for manipulating data and producing graphs.

Importing the data

Note

Remember to set the session’s working directory, before attempting to import data.

deer.data <- deer.data <- read.table("D:\\ARES40011 Rsrch Methods & Data Analysis\\roe_sika.txt") # imports text file as a dataframe

tibble(deer.data)
# A tibble: 33 x 3
   V1       V2    V3   
   <chr>    <chr> <chr>
 1 woodland roe   sika 
 2 120      832   1082 
 3 153      1010  1212 
 4 171      1032  1548 
 5 295      1001  1301 
 6 307      947   1136 
 7 325      1006  1509 
 8 336      928   1206 
 9 422      1015  1218 
10 498      840   1260 
# i 23 more rows

Adjusting the data table

deer.data <- deer.data %>% # creates a new (altered) data set
  rename(Woodland = V1, Roe = V2, Sika = V3) %>% #renames column headers
  mutate(Row = row_number()) %>% # adds new variable of row numbers
  filter(row_number() %in% c(2:33)) # removes row 1 of data

Exploring the data

deer.data %>%
  str() # tells you which types of variables you have
'data.frame':   32 obs. of  4 variables:
 $ Woodland: chr  "120" "153" "171" "295" ...
 $ Roe     : chr  "832" "1010" "1032" "1001" ...
 $ Sika    : chr  "1082" "1212" "1548" "1301" ...
 $ Row     : int  2 3 4 5 6 7 8 9 10 11 ...

Above we can see that there are 3 character (or string) variables, and 1 integer variable.

Exploring the data will be easier if the variables for numbers of Roe and Sika deer, are converted from character to numeric.

deer.data <- deer.data %>% # creates new (altered data set)
  mutate_at(c('Roe', 'Sika'), as.numeric) # converts character variables to numeric

deer.data %>%
  str()
'data.frame':   32 obs. of  4 variables:
 $ Woodland: chr  "120" "153" "171" "295" ...
 $ Roe     : num  832 1010 1032 1001 947 ...
 $ Sika    : num  1082 1212 1548 1301 1136 ...
 $ Row     : int  2 3 4 5 6 7 8 9 10 11 ...

Now a statistical summary of the numbers of Roe and Sika deer can be produced.

deer.data %>%
  summary()
   Woodland              Roe              Sika           Row       
 Length:32          Min.   : 701.0   Min.   : 841   Min.   : 2.00  
 Class :character   1st Qu.: 840.0   1st Qu.:1076   1st Qu.: 9.75  
 Mode  :character   Median : 916.0   Median :1210   Median :17.50  
                    Mean   : 905.5   Mean   :1203   Mean   :17.50  
                    3rd Qu.:1002.2   3rd Qu.:1303   3rd Qu.:25.25  
                    Max.   :1062.0   Max.   :1593   Max.   :33.00  

Analysing the data

As both of the variables of interest are quantitative, I would use a linear regression to analyse the data. This would show how strong the relationship is between the number of Roe and Sika deer at any given site.

A scatter plot, showing a regression line and standard error, can be produced to visualise this.

ggplot(deer.data, aes(x = Roe,
                      y = Sika)) + # determines position for each variable
  geom_point() + # produces scatter plot
  geom_smooth(method = "lm", # adds regression line
              se = TRUE) + # adds standard error to regression line
  labs(x = "Number of Roe Deer",
       y = "Number of Sika Deer",
       caption = "Figure 1. A comparison between the number of Sika and Roe deer across 32 woodland sites.")

Asking questions

Example statistical hypotheses:

  • There is a higher abundance of Sika deer in woodland habitats with large Roe deer populations.
  • Woodland habitats contain larger numbers of Sika deer than Roe deer.

Example scientific hypothesis:

  • Food availability in larger woodlands, increases total abundance of Roe and Sika deer.

Further information

The following additional information would be useful, to be able to explore the data better:

  • Size of each woodland
  • Age / sex distributions within species populations
  • Sampling at different time points eg. seasons

Prey_Predator

prey.pred <- read.table("D:\\ARES40011 Rsrch Methods & Data Analysis\\prey_predator.txt", header=TRUE) # assigns data from  text file to dataframe and assigns the first row in the txt file as column titles

tibble(prey.pred) #This is to view the data in the document
# A tibble: 32 x 3
   area   prey predator
   <chr> <int>    <int>
 1 120     832      793
 2 153    1010      752
 3 171     768      913
 4 295     971     1005
 5 307     947      803
 6 325    1006      886
 7 336     928      738
 8 422     885      914
 9 498     669      886
10 542     745      912
# i 22 more rows

Next is to see what our data is:

glimpse(prey.pred) #This gives us an overivew of the dataset and allows us to decide what we want to do next
Rows: 32
Columns: 3
$ area     <chr> "120", "153", "171", "295", "307", "325", "336", "422", "498"~
$ prey     <int> 832, 1010, 768, 971, 947, 1006, 928, 885, 669, 745, 776, 866,~
$ predator <int> 793, 752, 913, 1005, 803, 886, 738, 914, 886, 912, 737, 849, ~

From this we can see we have 2 intergers and a set of characteristics. Great! We would like to investigate how the numbers of prey and predators differs across the different areas. For example, does the number of prey predict the number of predators. This gives us 2 quantitative variables.

For two quantitative variables we want a scatterplot.

ggplot(prey.pred,aes(y=predator, x=prey))+ #call the data and set aesthetics
  geom_point() #type of graph we want

Asking Questions

We can see from this visual that the abundance of prey is not a predictor for the abundance of predators. This suggests that other factors are more important in explaining the abundance of predators.

To further investigate this data, you could look at:

  • Area of habitats

  • Quality of habitat

  • Investigate Prey Preferences of the Predator as it may be a specific prey type or preference is a predictor but overall prey number does not