Make sure you have latest R and Rstudio installed before starting this process. These are the R packages that are required to complete the data cleaning and documentation using Rstudio.
Note: The above packages do not come with Rstudio installation, they need to be installed explictly, use the packages tab or just type install.packages(“package_name”).
Next load the R packages:
Reshaped data of Avia Monitoring
We reshape some data format and create some new columns which are easy and convient for the following analysis.
| Monitoring_Route | Date | eBird | eachCount | totalCount | H_S | variable | value | Weeks | WeekNoYear | Year | Month | MonthWithYear | season |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Purple Trail | 2016-08-31 | N | 17 | 13 | 2016-W35 | W35 | 2016 | 8 | 2016-08 | summer | |||
| Green Trail | 2017-04-03 | N | 5 | 4 | Heard | CAGO | 2 | 2017-W14 | W14 | 2017 | 4 | 2017-04 | spring |
| Red Trail | 2018-07-11 | N | 8 | 5 | 2018-W28 | W28 | 2018 | 7 | 2018-07 | summer | |||
| Green Trail | 2016-07-19 | N | 5 | 5 | 2016-W29 | W29 | 2016 | 7 | 2016-07 | summer | |||
| Purple Trail | 2018-06-07 | N | 5 | 4 | 2018-W23 | W23 | 2018 | 6 | 2018-06 | summer |
Data from eBird
With the limited access to eBird dataset, we cannot compare this example data with our data directly, we just take this part of example data from eBird website as reference, whcih will make the following analysis easier.
| Week_starting_on | species | Frequency | Total_checklists_submitted | Abundance | Birds_Per_Party_Hour | checklists_reporting_species | High_Count | checklists_reporting_species__1 | Totals | checklists_reporting_species__2 | Average_Count | checklists_reporting_species__3 | variable |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 02-07 | Common Yellowthroat | 0.0000000 | 7679 | 0.0000000 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0 | Common Yellowthroat |
| 01-07 | Marsh Wren | 0.0529101 | 5670 | 0.0005291 | 2.351564 | 3 | 1 | 3 | 3 | 3 | 1.000000 | 3 | MAWR |
| 08-14 | Canada Goose | 28.3185841 | 3503 | 7.3011704 | 43.862224 | 994 | 350 | 1029 | 26305 | 1000 | 26.305000 | 1000 | CAGO |
| 01-21 | Great Blue Heron | 8.4825117 | 7262 | 0.1817681 | 5.347221 | 633 | 24 | 665 | 1382 | 662 | 2.087613 | 662 | GBHE |
| 08-31 | Marsh Wren | 0.9322974 | 4505 | 0.0142064 | 1.220736 | 43 | 4 | 44 | 65 | 43 | 1.511628 | 43 | MAWR |
| variable | min | median | mean | max |
|---|---|---|---|---|
| MAWR | 1 | 4 | 5.000000 | 11 |
| GBHE | 1 | 3 | 8.000000 | 30 |
| WOTH | 1 | 1 | 2.000000 | 6 |
| WODU | 1 | 3 | 3.428571 | 5 |
| CAGO | 1 | 2 | 8.357143 | 29 |
With season
First, we compare these five speices seasonally which includes autumn, spring, summer and winter.
| variable | autumn | spring | summer | winter |
|---|---|---|---|---|
| CAGO | 22 | 63 | 1 | 31 |
| MAWR | 0 | 1 | 19 | 0 |
| GBHE | 1 | 18 | 85 | 0 |
| WODU | 5 | 3 | 16 | 0 |
| WOTH | 0 | 0 | 14 | 0 |
With Monitoring Route
Secondly, we compare thesee five species based on Monitoring_Route which includes Blue Trail, Green Trail, Purpule Trail and Red Trail.
| variable | Blue Trail | Green Trail | Purple Trail | Red Trail |
|---|---|---|---|---|
| CAGO | 105 | 6 | 2 | 4 |
| MAWR | 20 | 0 | 0 | 0 |
| GBHE | 103 | 1 | 0 | 0 |
| WODU | 24 | 0 | 0 | 0 |
| WOTH | 0 | 1 | 1 | 12 |
With H_S
Thirdly, we comare these five species based on H_S which means the species both heard and seen, headred, seen seperately.
| variable | H&S | Heard | Seen |
|---|---|---|---|
| CAGO | 2 | 12 | 103 |
| MAWR | 0 | 14 | 6 |
| GBHE | 10 | 10 | 84 |
| WODU | 6 | 0 | 18 |
| WOTH | 1 | 10 | 3 |
With Year
Forthly, we coompare these five species yearly which mainly focues on 2016, 2017, 2018 years.
| variable | 2016 | 2017 | 2018 |
|---|---|---|---|
| CAGO | 24 | 72 | 21 |
| MAWR | 11 | 1 | 8 |
| GBHE | 30 | 33 | 41 |
| WODU | 13 | 9 | 2 |
| WOTH | 7 | 2 | 5 |
According to descriptive part, We can be sure each variable with each species in the sample is not homogeneous. But five species have their unique characters on each variable, and we may eplore more relationship on the variables.
The total number of observation in MAWR, WOTH and WODU is not much enough to well estimate the variables difference in the sample. We will only take a trying to test anova group defference on the two species(CAGO, GBHE).
#Randomized Block Design
fit <- aov(value ~ H_S + Year + Monitoring_Route + season, aviaAllu[variable == "CAGO",])
anova(fit)## Analysis of Variance Table
##
## Response: value
## Df Sum Sq Mean Sq F value Pr(>F)
## H_S 2 565.79 282.893 2.0096 0.2488
## Year 1 7.49 7.494 0.0532 0.8288
## Monitoring_Route 3 301.97 100.657 0.7150 0.5924
## season 3 86.87 28.956 0.2057 0.8876
## Residuals 4 563.10 140.774
As the result shows, we can not reject the null hypothesis on 5%, GAGO’s observation number in the four variables has no significant different.
#Randomized Block Design
fit <- aov(value ~ H_S + season + Year + Monitoring_Route, aviaAllu[variable == "GBHE",])
anova(fit)## Analysis of Variance Table
##
## Response: value
## Df Sum Sq Mean Sq F value Pr(>F)
## H_S 2 414.00 207.000 3.7619 0.08733 .
## season 2 356.36 178.182 3.2382 0.11122
## Year 1 15.99 15.988 0.2906 0.60926
## Monitoring_Route 1 61.50 61.499 1.1177 0.33112
## Residuals 6 330.15 55.025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Still no variable’s p-value low than 5% in GBHE. But as the result of the fit and the alluvial diagram suggest, the independence variables have interaction.
# Two Way Factorial Design
fit <- aov(value ~ season * H_S, aviaAllu[variable == "GBHE",])
anova(fit)## Analysis of Variance Table
##
## Response: value
## Df Sum Sq Mean Sq F value Pr(>F)
## season 2 265.94 132.971 3.2609 0.11002
## H_S 2 504.42 252.210 6.1850 0.03484 *
## season:H_S 2 162.97 81.485 1.9983 0.21622
## Residuals 6 244.67 40.778
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With the involving of the interaction, the H_S variable seems to be significant this time. It means different season has different observation ways number.
## 'data.frame': 395 obs. of 21 variables:
## $ Date : Date, format: "2016-07-10" "2016-07-10" ...
## $ Month: int 7 7 7 7 7 7 7 7 7 7 ...
## $ Day : int 10 10 10 10 10 10 10 10 10 10 ...
## $ Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ EABL : int NA NA NA NA NA NA NA NA NA NA ...
## $ WBNU : int NA NA NA NA NA NA NA NA NA NA ...
## $ COYE : int NA NA NA NA NA NA NA NA NA NA ...
## $ EAKI : int NA NA NA NA NA NA NA NA NA NA ...
## $ MALL : int NA NA NA NA NA NA NA NA NA 1 ...
## $ CAGO : int NA NA NA NA NA NA NA NA NA NA ...
## $ MODO : int NA NA NA NA NA NA 1 NA 1 NA ...
## $ MAWR : int 1 NA NA NA 1 NA NA NA NA NA ...
## $ CATE : int NA NA NA NA NA NA NA NA NA NA ...
## $ GREG : int 2 NA 1 1 NA NA NA NA NA NA ...
## $ WODU : int NA NA NA NA NA NA NA NA NA NA ...
## $ GBHE : int 1 1 NA NA NA 1 NA NA NA NA ...
## $ RBWO : int NA NA NA NA NA NA NA NA NA NA ...
## $ WOTH : int NA NA NA NA NA NA NA NA NA NA ...
## $ GRFL : int NA NA NA NA NA NA NA 1 NA NA ...
## $ AMRE : int NA NA NA NA NA NA NA NA NA NA ...
## $ EATO : int NA NA NA NA NA NA NA NA NA NA ...
## Date Month Day Year
## Min. :2016-07-10 Min. : 1.000 Min. : 1.00 Min. :2016
## 1st Qu.:2016-07-29 1st Qu.: 6.000 1st Qu.: 9.00 1st Qu.:2016
## Median :2017-04-03 Median : 7.000 Median :17.00 Median :2017
## Mean :2017-05-01 Mean : 6.706 Mean :16.37 Mean :2017
## 3rd Qu.:2018-04-08 3rd Qu.: 8.000 3rd Qu.:23.00 3rd Qu.:2018
## Max. :2018-08-08 Max. :12.000 Max. :31.00 Max. :2018
##
## EABL WBNU COYE EAKI
## Min. :1.0 Min. :1.000 Min. :1 Min. :1.00
## 1st Qu.:1.0 1st Qu.:1.000 1st Qu.:1 1st Qu.:1.00
## Median :1.0 Median :1.000 Median :1 Median :1.00
## Mean :1.8 Mean :1.094 Mean :1 Mean :1.25
## 3rd Qu.:2.0 3rd Qu.:1.000 3rd Qu.:1 3rd Qu.:1.25
## Max. :5.0 Max. :2.000 Max. :1 Max. :2.00
## NA's :385 NA's :363 NA's :381 NA's :391
## MALL CAGO MODO MAWR
## Min. :1.000 Min. : 1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median : 2.000 Median :1.000 Median :1.000
## Mean :1.632 Mean : 3.297 Mean :1.273 Mean :1.261
## 3rd Qu.:1.500 3rd Qu.: 3.000 3rd Qu.:1.750 3rd Qu.:1.000
## Max. :6.000 Max. :20.000 Max. :2.000 Max. :3.000
## NA's :376 NA's :358 NA's :373 NA's :372
## CATE GREG WODU GBHE
## Min. :1 Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.:1 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.000
## Median :1 Median :1.00 Median :1.000 Median :1.000
## Mean :1 Mean :1.25 Mean :2.053 Mean :1.265
## 3rd Qu.:1 3rd Qu.:1.00 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :1 Max. :5.00 Max. :5.000 Max. :3.000
## NA's :392 NA's :331 NA's :376 NA's :293
## RBWO WOTH GRFL AMRE
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :2
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2
## Median :1.000 Median :1.000 Median :1.000 Median :2
## Mean :1.082 Mean :1.077 Mean :1.077 Mean :2
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:2
## Max. :3.000 Max. :2.000 Max. :2.000 Max. :2
## NA's :346 NA's :382 NA's :382 NA's :394
## EATO
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.222
## 3rd Qu.:1.000
## Max. :2.000
## NA's :386
Based on Research questions, we have identified the following variables that are useful in deriving our data insights and conclusions. There are two types of data that are observed by the scientists/volunteers in same dataset i.e, Quantitative data and Qualitative data.
The following data values are descriptive and non-numeric.
Setting the working directory and reading the cleaned data:
setwd("C:/Users/indra/Desktop/Week12/OldWomanCreek/Deliverables/RScript") # setting the working dir
eagle_raw_data <- read.csv("BaldEagle.csv") # reading the csv data‘eagle_raw_data’ stores the whole dataset from csv file. We make data manupulations of the variable ‘eagle_raw_data’.
NestStatus = eagle_raw_data$NestStatus # select NestStatus from raw data
NestStatus.freq = table(NestStatus) # Apply the table functionNestStatus.freq stores the frequency of each occurence. Lets view the frequecy distribution:
| NestStatus | Freq |
|---|---|
| 87 | |
| A | 4 |
| B | 59 |
| BR | 33 |
| Branching | 82 |
| Building | 5 |
| Building/Guarding | 4 |
| F | 1 |
| Fledged | 2 |
| Guarding | 4 |
| Guarding | 1 |
| H | 665 |
| Hatched | 85 |
| I | 521 |
| Incubating | 22 |
| Incubating/Hatched | 12 |
| Old Nest | 2 |
| Protecting | 3 |
| R | 86 |
Similarly, we can get the frequency distribution for other qualitative variables. We are working on using this frequency to get the plots by joining the Quantitative data.
The relative frequency distribution of a data variable is a summary of the frequency proportion in a collection of non-overlapping categories. [2]
The relationship of frequency and relative frequency is:
Rounding the decimal frequencies to ‘3’ digits.
| NestStatus | Freq |
|---|---|
| 0.052 | |
| A | 0.002 |
| B | 0.035 |
| BR | 0.020 |
| Branching | 0.049 |
| Building | 0.003 |
| Building/Guarding | 0.002 |
| F | 0.001 |
| Fledged | 0.001 |
| Guarding | 0.002 |
| Guarding | 0.001 |
| H | 0.396 |
| Hatched | 0.051 |
| I | 0.310 |
| Incubating | 0.013 |
| Incubating/Hatched | 0.007 |
| Old Nest | 0.001 |
| Protecting | 0.002 |
| R | 0.051 |
colors = c("red", "yellow", "green", "violet", "orange", "blue", "pink", "cyan")
pie(NestStatus.freq, col=colors)H - stands for hatching Nest Status. To find mean temperature(F) for ‘Nest Status’ = H.
Create logical vector for NestStatus = H:
Now, find the mean Temperature(F) of NestStatus = H:
## [1] 59.84
The average mean temperature for NestStatus = Hatching is 59.84
Similarly, summary will show the quantile, median, Min, Max and Mean values for temperature when NestStatus = H:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 51.00 55.00 59.84 70.00 90.00
The following data values are non-descriptive and numeric.
Indra - I worked on Bald eagle, github.
Sun - Worked on eBird Data, github
Kalpana - worked on Indicator species, github and proofreading.