library(tidyverse) # installs package containing ggplot2 and dyplr, needed for manipulating data and producing graphs.Formative Assessment
Loading the necessary packages
Importing the data
Remember to set the session’s working directory, before attempting to import data.
deer.data <- read.table("C:\\Users\\erica\\Documents\\Deer Data Set.txt") # imports text file as a dataframe
tibble(deer.data)# A tibble: 33 × 3
V1 V2 V3
<chr> <chr> <chr>
1 woodland roe sika
2 120 832 1082
3 153 1010 1212
4 171 1032 1548
5 295 1001 1301
6 307 947 1136
7 325 1006 1509
8 336 928 1206
9 422 1015 1218
10 498 840 1260
# ℹ 23 more rows
Adjusting the data table
deer.data <- deer.data %>% # creates a new (altered) data set
rename(Woodland = V1, Roe = V2, Sika = V3) %>% #renames column headers
mutate(Row = row_number()) %>% # adds new variable of row numbers
filter(row_number() %in% c(2:33)) # removes row 1 of dataExploring the data
deer.data %>%
str() # tells you which types of variables you have'data.frame': 32 obs. of 4 variables:
$ Woodland: chr "120" "153" "171" "295" ...
$ Roe : chr "832" "1010" "1032" "1001" ...
$ Sika : chr "1082" "1212" "1548" "1301" ...
$ Row : int 2 3 4 5 6 7 8 9 10 11 ...
Above we can see that there are 3 character (or string) variables, and 1 integer variable.
Exploring the data will be easier if the variables for numbers of Roe and Sika deer, are converted from character to numeric.
deer.data <- deer.data %>% # creates new (altered data set)
mutate_at(c('Roe', 'Sika'), as.numeric) # converts character variables to numeric
deer.data %>%
str()'data.frame': 32 obs. of 4 variables:
$ Woodland: chr "120" "153" "171" "295" ...
$ Roe : num 832 1010 1032 1001 947 ...
$ Sika : num 1082 1212 1548 1301 1136 ...
$ Row : int 2 3 4 5 6 7 8 9 10 11 ...
Now a statistical summary of the numbers of Roe and Sika deer can be produced.
deer.data %>%
summary() Woodland Roe Sika Row
Length:32 Min. : 701.0 Min. : 841 Min. : 2.00
Class :character 1st Qu.: 840.0 1st Qu.:1076 1st Qu.: 9.75
Mode :character Median : 916.0 Median :1210 Median :17.50
Mean : 905.5 Mean :1203 Mean :17.50
3rd Qu.:1002.2 3rd Qu.:1303 3rd Qu.:25.25
Max. :1062.0 Max. :1593 Max. :33.00
Analysing the data
As both of the variables of interest are quantitative, I would use a linear regression to analyse the data. This would show how strong the relationship is between the number of Roe and Sika deer at any given site.
A scatter plot, showing a regression line and standard error, can be produced to visualise this.
ggplot(deer.data, aes(x = Roe,
y = Sika)) + # determines position for each variable
geom_point() + # produces scatter plot
geom_smooth(method = "lm", # adds regression line
se = TRUE) + # adds standard error to regression line
labs(x = "Number of Roe Deer",
y = "Number of Sika Deer",
caption = "Figure 1. A comparison between the number of Sika and Roe deer across 32 woodland sites.")Asking questions
Example statistical hypotheses:
- There is a higher abundance of Sika deer in woodland habitats with large Roe deer populations.
- Woodland habitats contain larger numbers of Sika deer than Roe deer.
Example scientific hypothesis:
- Food availability in larger woodlands, increases total abundance of Roe and Sika deer.
Further information
The following additional information would be useful, to be able to explore the data better:
- Size of each woodland
- Age / sex distributions within species populations
- Sampling at different time points eg. seasons