Erica Bass, Sophie Packer, Jacob Walton, Martin Cooper
For all analysis - load libraries:
library(tidyverse) # installs package containing ggplot2 and dyplr, needed for manipulating data and producing graphs.
Warning: package 'tidyverse' was built under R version 4.1.3
Warning: package 'tibble' was built under R version 4.1.3
Warning: package 'forcats' was built under R version 4.1.3
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.4 v readr 2.1.5
v forcats 1.0.0 v stringr 1.5.1
v ggplot2 3.5.1 v tibble 3.2.1
v lubridate 1.9.3 v tidyr 1.3.1
v purrr 1.0.2
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Remember to set the session’s working directory, before attempting to import data.
Linces
First steps
set working directory
load packages
import data
[1] "D:/ARES40011 Rsrch Methods & Data Analysis"
# A tibble: 70 x 3
id lynx century
<chr> <int> <int>
1 A1 3311 19
2 A2 6721 19
3 A3 4254 19
4 A4 687 19
5 A5 255 19
6 A6 473 19
7 A7 358 19
8 A8 784 19
9 A9 1594 19
10 A10 1676 19
# i 60 more rows
Data exploration
catergorise data - in this case we have 2 numerical and 1 character variables
summary(lynx) # Tells you basic desciptive statistics
id lynx century
Length:70 Min. : 39.0 Min. :19.0
Class :character 1st Qu.: 378.2 1st Qu.:19.0
Mode :character Median : 904.0 Median :19.5
Mean :1668.1 Mean :19.5
3rd Qu.:2786.5 3rd Qu.:20.0
Max. :6991.0 Max. :20.0
aggregate(lynx ~ century, data = lynx, mean) # tells you the mean of lynx pop between the centurys
century lynx
1 19 1403.200
2 20 1932.943
Data visualisation
Box plot
Next we can try to visualise the data depending in the types of variables we have. Looking at the century and lynx pop we have one categorical and one quantitative variable. Therefore a box plot may be the best way to visualise this data. You can see from this that there is a difference in the mean of lynx population between each century. This may form the basis of your question.
ggplot(lynx, aes(x =as.factor(century), y = lynx)) +# this tells us that we want to plot and plots the pointgeom_boxplot(fill ="mediumorchid", color ="black") +# Tells us we want to use a boxplotlabs(title ="Lynx Population by Century", x ="Century", y ="Lynx Population") +theme_minimal() +theme(plot.title =element_text(hjust =0.5)) # Puts title in the centre
Line plot
You may be able to explore this data further by searching for trends. Assuming that the ID column contains observations in time you could also plot the lynx pop against the id column and see if a trend appears. You can see that the population peaks and troughs over both centuries which can be explored further and form the basis of your question.
library(ggplot2)ggplot(lynx, aes(x = id, y = lynx, color =factor(century))) +# separates lines by color.geom_line(aes(group =factor(century)), linewidth =1) +# Group lines by centurygeom_point(size =2) +# Points colored by centurylabs(title ="Lynx Population by Century", # Titles and axis lablesx ="ID",y ="Lynx Count",color =NULL) +theme_minimal() +# choose relevant themetheme(plot.title =element_text(hjust =0.5)) +# puts title in centrescale_color_manual(values =c("19"="blue", "20"="darkgreen")) # Specify colors for centuries
Mosquitos
Importing the data into R:
Chunk imports code into R Studio
Code
mosquito <-read.table("D:\\ARES40011 Rsrch Methods & Data Analysis\\mosquitos.txt", header =TRUE, dec =".") #Code chunk used to import the data from my download page. Labelled as mosquito
Loading Packages:
Code
library(tidyverse)library(ggplot2)
This code chunk is being used to add the relevant packages into R Studio to reproduce the current data analysis on the Mosquito data.
Exploring the data:
Overview of the data:
Code
str(mosquito) #This code gives a basic overview of the data
'data.frame': 100 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ wing: num 37.8 50.6 39.3 38.1 25.2 ...
$ sex : chr "f" "f" "f" "f" ...
The tibble above showing the mosquito data set provides us with the information that this data holds 3 different variables, one with the modquito number (ID), one with the mosquito sex and their respective wing span.
Checking for na recordings in data:
Code
colSums(is.na(mosquito)) #This code checks for any gaps or na recordings in the data.
ID wing sex
0 0 0
This check shows that in all columns and rows there are 0 na readings, therefore in the data analysis we shouldn’t need use na.omit() in our codes.
Summary of the mosquito data:
Code
summary(mosquito) #This code provides some basic data analysis on the data
ID wing sex
Min. : 1.00 Min. :25.16 Length:100
1st Qu.: 25.75 1st Qu.:41.42 Class :character
Median : 50.50 Median :48.42 Mode :character
Mean : 50.50 Mean :48.78
3rd Qu.: 75.25 3rd Qu.:56.24
Max. :100.00 Max. :69.82
This code shows us the minimum and maximum wing length values as well as the mean and median which would be used to explored the data further later on.
Graphs and statsistical tests for them:
:::{.callout-note} This next section is split into several different sections and graphs depending on the x and y-axis. This analysis style will aid in choosing the correct statistical test and exploring the data with questions. ::: The syle of these headings are as follows, x-axis variable - y-axis varibale. :::{.callout-tip} ###Key - Categorical = Cat - Numerical (quantitative) = Num :::
Cat-Cat
Code
mosquito %>%#This line will be used at the start of all code chunks identifying mosquito data set as the one to be using.ggplot(aes( x = sex, #specifying sex on the x-axisfill = sex #specifying the colour on the bar plot )) +geom_bar(show.legend =FALSE) +#code used to plot a bar graph, without showing the legendlabs(x ="Sex", #Labeling of the x-axisy ="Number of Mosquitos") #Labeling of the y-axis
Number of Male and Female Mosquitoes
The bar chart above shows the number of male and female mosquitoes. This data shows there are 50 males and 50 females. The results of this chart does not provide any questions about the data. If a statistical test was to be tested on this style of graph a Chi-Square test would be the most applicable.
Cat-Num
Code
mosquito %>%ggplot(aes(x = sex, #Assigning the x-axisy = wing, #Assigning the y-axisfill = sex #applies a colour to the graph using the sex variable )) +geom_boxplot(show.legend =FALSE) +#creates a boxplot of the above axislabs(x ="Sex", #labeling of the axisy ="Wing Span")
Mosquito Box plot
This box plot shows the difference in the wing span between the male and female mosquitoes. The graph shows evidence that the male mosquitoes have large wings than the females. If a statistical test was to be performed on the data, the T-test would be the most applicable.
Num-Num
Code
mosquito %>%ggplot(aes(x = wing, #Assigning the x-axisfill = sex #Ensures that the gender is split and recorded as male and females )) +geom_density(alpha =0.5) +#Creating a density plotlabs(x ="Wing Span", y ="Density")
Mosquito density plot
The density graph shows similar results to the boxplot and aids to confirm that the Male mosquitoes have a greater wing span compared with the females, when comparing the peaks of the density plots. If a statistical test was to be performed on this data then again a T-test would be used on this data.
Num-Cat
Code
mosquito %>%ggplot(aes(x = wing, #Assigning the x-axisy = sex, #Assigning the y-axiscolour = sex #in this output, this only provide colour to the graph )) +geom_point(show.legend =FALSE) +#Creates a scatter plot of the above datalabs(x ="Wing Span", #labeling of the x-axisy ="Sex") #labeling of the y-axis
Mosquito Scatter plot
This scatter plot, extents on from the previous graphs using this data set. While the previous boxplot and density plot identified that males had greater wingspan on average, a female mosquito had the largest wingspan and another had the smallest. Using this graph a logistical regression model would be the most applicable statistical method.
Question about the Data
Scientific Hypothesis
Observations of male mosquitoes will have greater average wing spans compared with those of female mosquitoes.
Statistical Hypothesis
Male mosquitoes have a greater average wing span compared with females.
Deer Data
#| Label: Importing Deer Datadeer.data <-read.table("D:\\ARES40011 Rsrch Methods & Data Analysis\\roe_sika.txt") # imports text file as a dataframetibble(deer.data)
Above we can see that there are 3 character (or string) variables, and 1 integer variable.
Exploring the data will be easier if the variables for numbers of Roe and Sika deer, are converted from character to numeric.
deer.data <- deer.data %>%# creates new (altered data set)mutate_at(c('Roe', 'Sika'), as.numeric) # converts character variables to numericdeer.data %>%str()
Now a statistical summary of the numbers of Roe and Sika deer can be produced.
deer.data %>%summary()
Woodland Roe Sika Row
Length:32 Min. : 701.0 Min. : 841 Min. : 2.00
Class :character 1st Qu.: 840.0 1st Qu.:1076 1st Qu.: 9.75
Mode :character Median : 916.0 Median :1210 Median :17.50
Mean : 905.5 Mean :1203 Mean :17.50
3rd Qu.:1002.2 3rd Qu.:1303 3rd Qu.:25.25
Max. :1062.0 Max. :1593 Max. :33.00
Analysing the data
As both of the variables of interest are quantitative, I would use a linear regression to analyse the data. This would show how strong the relationship is between the number of Roe and Sika deer at any given site.
A scatter plot, showing a regression line and standard error, can be produced to visualise this.
ggplot(deer.data, aes(x = Roe,y = Sika)) +# determines position for each variablegeom_point() +# produces scatter plotgeom_smooth(method ="lm", # adds regression linese =TRUE) +# adds standard error to regression linelabs(x ="Number of Roe Deer",y ="Number of Sika Deer",caption ="Figure 1. A comparison between the number of Sika and Roe deer across 32 woodland sites.")
Asking questions
Example statistical hypotheses:
There is a higher abundance of Sika deer in woodland habitats with large Roe deer populations.
Woodland habitats contain larger numbers of Sika deer than Roe deer.
Example scientific hypothesis:
Food availability in larger woodlands, increases total abundance of Roe and Sika deer.
Further information
The following additional information would be useful, to be able to explore the data better:
Size of each woodland
Age / sex distributions within species populations
Sampling at different time points eg. seasons
Prey_Predator
prey.pred <-read.table("D:\\ARES40011 Rsrch Methods & Data Analysis\\prey_predator.txt", header=TRUE) # assigns data from text file to dataframe and assigns the first row in the txt file as column titlestibble(prey.pred) #This is to view the data in the document
From this we can see we have 2 intergers and a set of characteristics. Great! We would like to investigate how the numbers of prey and predators differs across the different areas. For example, does the number of prey predict the number of predators. This gives us 2 quantitative variables.
For two quantitative variables we want a scatterplot.
ggplot(prey.pred,aes(y=predator, x=prey))+#call the data and set aestheticsgeom_point() #type of graph we want
Asking Questions
We can see from this visual that the abundance of prey is not a predictor for the abundance of predators. This suggests that other factors are more important in explaining the abundance of predators.
To further investigate this data, you could look at:
Area of habitats
Quality of habitat
Investigate Prey Preferences of the Predator as it may be a specific prey type or preference is a predictor but overall prey number does not