Make sure you have latest R and Rstudio installed before starting this process. These are the R packages that are required to complete the data cleaning and documentation using Rstudio.
Note: The above packages do not come with Rstudio installation, they need to be installed explictly, use the packages tab or just type install.packages(“package_name”).
Next load the R packages:
library("knitr")
library("tidyverse")Image Source: www.hiveminer.com
We received data from Emily Kuzmick, the coastal program coordinator at Old Woman Creek National Estuarine Research Reserve. The data was posted in our college Canvas under our course file section. It was posted by our professor Dr. Andrea Wiggins under our course file section. We received zipped file where we had 10 other files in there. On those files we had Avian monitoring data in an excel spreadsheet. The spreadsheet contains over 50 columns including avian monitoring route, date, wind direction, temperature, etc. There are around 1435 rows containing data for these columns. Then there is Eagle Master spreadsheet which includes information regarding Bald Eagle Data. The spreadsheet has around 22 columns which contains information such as date when data is recorded, nest location of bald eagle, observer’s location to nest, nest status, nest location wind, etc. There are also word documents which contains information related to bald eagle fact sheet, Avian activity monitoring sheet, Bald eagle monitoring sheet, phenological species monitoring at Old Woman Creek, a pdf document on identification of birds.
Citizen scientists or volunteers collect the data at OWC. They come to the point and stand there 15 minutes for observation, and record everything they see and hear in that 15 minutes. Then, they are required to fill out the form which contains the critical attributes for the dataset.
Variable included:
Image Source: Wikipedia
Staff, interns and volunteers collect the data at OWC. They collect data in two parts, firstly they collect weather conditions and secondly they make Eagle related observations. They stand at the observation points for minimum 30 minutes or more and make all these observations. They make their observations on a standard monitoring sheet, provided by OWC. They collect weather conditions along with Date, start time and end time of their observation such as Temperature, Cloud cover, Wind velocity, precipitation on the same sheet along with observer and the equipment used.
Secondly, OWC collects Bald Eagle Activity data such as their Nest status(building, incubating, hatched), Nest conditions(good, poor, fair), Nest locations, Number of Birds in Nest, Approximate number of eggs, chicks if any. Additional comments are added to the sheet if any. OWC provided us with data from ‘4/2/1996’ to ‘6/26/1999’ and ‘3/7/2016’ to ‘6/25/2018’. In short they gave two time periods data from 1996-99 and 2016-18.
Reading the CSV file with Bald Eagle data, there are 1678 rows and 21 columns.
raw_eagle_data <- read.csv("BaldEagle.csv", header = TRUE, na.strings = "")
dim(raw_eagle_data)## [1] 1678 21
Total rows:
nrow(raw_eagle_data)## [1] 1678
Total Coloumns:
ncol(raw_eagle_data)## [1] 21
Our team is using data that is intellectual property of Old Woman Creek, which is part of Ohio’s State Department of Natural Resources. There is no license specified on their website, however the data we use is provided by OWC under agreement to use for Non-Commercial and Educational purpose only. Any party without any afflifiation with OWC needs to acquire license or consent from OWC to use this data.
The metadata contains the information related to temperature, species, dates, eBird, etc which helps us to compare our data based on our research questions. Also the Bald Eagle data and the information such as nest location, number of birds present and plumages,etc. will help to work on our research question related to bald eagle nesting.
colnames(raw_eagle_data)## [1] "Date" "Nest.Location"
## [3] "Observer.s.location.to.nest" "Nest.Status"
## [5] "Nest.Condition" "Observer"
## [7] "Temp" "Wind.Velocity..mph."
## [9] "Wind.Direction" "Precipatation"
## [11] "Cloud.Cover" "Equipment"
## [13] "Start.Time" "End.Time"
## [15] "Time.of.observation" "X..Birds.Present...Plumages"
## [17] "X..Birds.on.Nest" "Behavior.of.Adults"
## [19] "Approx....of.Eggs" "Approx....of.Chicks"
## [21] "Additional.Comments"
| Field | Description | Sample/Remarks |
|---|---|---|
| Date | Date of observation | 04/28/2018 |
| Nest.Location | Location of nest in OWC | West Nest 1 |
| Observer.location.to.Nest | Observer standing location to Nest | West |
| Nest.Status | H-hatched, B-building, A-abandoned, I- Incubating | H |
| Nest.Condition | Good/Fair/Poor | |
| Observer | name of the person who took the observation | |
| Temp | temperature in F | 60 |
| Wind.Velocity.mph | wind velocity in Mph | 13 |
| Wind.Direction | direction of wind | ESE |
| Precipatation | precipatation such as rain | Raining/0% |
| Cloud.cover | cloud cover in percentage | 5% |
| Equipment | Equipment used by observer | Nikon 741-044 |
| Start.Time | start time | 10:30 |
| End.Time | end time | 12:30 |
| Time.of.observation | time of observation | 10:45 |
| # Birds.Present.Plumages | number of birds present including adult, juvenille | 1A, 2J |
| # Birds.on.Nest | number of birds present in nest | 1A |
| Behavior.of.Adults | what exactly adult is doing | adults eating |
| Approx # of Eggs | number of eggs approximation | 2 |
| Approx # of Chicks | number of chicks approximation | 2 |
| Additional Comments | If any comments observer adds them here | Head up very vigilant Calling out with voice |
The Avian monitoring excel sheet has no values for column weather. So, we will remove this column
The Additional notes column has only four values from row 582 to 585.
There are misspelled names as “Trai” from row 933 to 940 and “Traiil” from row 1079 to 1087 in the column named monitoring route. So we will have to correct those.
Dates are formatted incorrectly as there double forward slashes from row 567 to 573, rows 915 to 919, 923 to 925 in the column name Date
There are observer’s name missing in the column observer in rows 920 to 922, rows 971 to 973, row 996, observer R Kovacs last name misspelled “Kovaks” in rows 1429 to 1435.
There are missing wind data in column Wind in rows 31 to 38.
The direction column for wind has NE, ENE, SE, ESE, so replacing ENE with NE and ESE with SE.
Missing Values for column cloud cover from row 664 to 681, 912 to 919, 923 to 925, 992 to 995, 1011 to 1012, 1354 to 1364
Data entry was handtyped into excel by the person. So there are few missing values and inconsistencies in data. But overall data quality is moderate. They followed propoer standard through the document.
There are many missing values such as:
Few values of Start time and End time are having time format as 1300, whereas most of the values are having 13:00 format. Multiple records exist for the same date.
paragraph1 for avian
There are misspelling and duplicate rows in Monitoring Route variables. Find misspelling words and replace with the correct words. Find the duplicate rows and remove it.
There are different format in Date variable Formulas: PROPER(TRIM(CLEAN())) Correct xx/xx//xxxx to xx/xx/xxxx
Define the measurement for Temperature variable. Insert Fahrenheit degree symbol.
There are ambiguous direction in Direction variable. Find ambiguous direction and replace with the clear direction.
Deleting the columns which are less critical for our data analysis such as Observer, Wind, Precipitation, Cloud Cover, Equipment, and Vegetation Cover
raw_eagle_data %>%
filter(is.na(Nest.Location))## Date Nest.Location Observer.s.location.to.nest Nest.Status
## 1 4/2/1996 <NA> <NA> I
## 2 5/10/1998 <NA> <NA> H
## Nest.Condition Observer Temp Wind.Velocity..mph. Wind.Direction
## 1 <NA> Feix 40 <NA> <NA>
## 2 <NA> Wright and Feix <NA> <NA> <NA>
## Precipatation Cloud.Cover Equipment Start.Time End.Time
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## Time.of.observation X..Birds.Present...Plumages X..Birds.on.Nest
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## Behavior.of.Adults Approx....of.Eggs Approx....of.Chicks
## 1 <NA> NA NA
## 2 <NA> NA NA
## Additional.Comments
## 1 <NA>
## 2 <NA>
Omitting these values:
clean_data <- raw_eagle_data[!is.na(raw_eagle_data$Nest.Location),]nrow(is.na(clean_data$Nest.Condition))## NULL
clean_data <- subset(clean_data, select = -c(Observer,Equipment,Wind.Direction))
colnames(clean_data)## [1] "Date" "Nest.Location"
## [3] "Observer.s.location.to.nest" "Nest.Status"
## [5] "Nest.Condition" "Temp"
## [7] "Wind.Velocity..mph." "Precipatation"
## [9] "Cloud.Cover" "Start.Time"
## [11] "End.Time" "Time.of.observation"
## [13] "X..Birds.Present...Plumages" "X..Birds.on.Nest"
## [15] "Behavior.of.Adults" "Approx....of.Eggs"
## [17] "Approx....of.Chicks" "Additional.Comments"
Note: In interest of time, we did following steps using excel, but not in Rstudio.
Fields with # birds, # eggs, # chicks we replaced NA with ‘0’ int, since there is no data we assume count to be zero for easy analysis.
Nest condition and Observer’s location to nest are blank before 2017. We leave them blank as of now. And use only data from 2016-2018 for data analysis.
If there are multiple records on same date, then merge them if the other values are same such as observer, time, # birds, etc. Take max if there is a conflict.
Blank comments are left unchanged.
As part of this assignment, I worked on cleaning and documenting the Bald Eagle data. I have also worked on preparing the Rmarkdown, colloborating the team’s work, pushing the files to github and publishing the presentable document online.
In this assignment, I worked on cleaning, documenting the Avian Monitoring data and collaborating the team’s work.
In this assignment I worked on documentation part for the Avian Data monitoring.