This is an Rmarkdown document demonstrating how to find and flag weight outliers in a salmon ASL dataset and flag them for further review. 1 I’m going to try and provide a fairly comprehensive commentary in this document explaining what I am doing in plain English as I am doing it.
Let’s get started by loading the packages that we will use for this document. 2 After loading the libraries I will use I then set the working directory to the folder where I want to pull input from and where I want to write output.
# load packages
library(tidyverse)
library(ggformula)
library(StatMeasures)
# set working directory
setwd("S:/REG2/BBsalmon/Run Reconstruction/2020")
# read a csv file into memory as a dataframe and name it
master.dat <- read.csv("FDMS2020.csv")
Next I want to clean up my data frame. I do this with a series of ordered dataframe manipulations such that I end up with a working dataframe named ‘dat’. To do this I’m using the dplyr package which uses ‘piping’ syntax (%>%) to string together sequential commands for dataframe manipulations. 3
The dplyer manuipulations I will apply to my dataframe are:
format (mutate) catch date variable into true date format 4
format (mutate) catch and sample date variables into true date format
filter by sockeye species code (going to just analyze sockeye here)
filter by major age classes (i.e. 12,13,22,23)
create (mutate) an Ocean age variable (‘o.age’)
# execute my dataframe modifications outlined above
dat <- master.dat %>%
mutate(catch_date = as.Date(master.dat$catch_date, format = "%m/%d/%Y")) %>%
mutate(sample_date = as.Date(master.dat$sample_date, format = "%m/%d/%Y")) %>%
filter(species_code == 420) %>%
filter(age %in% c(12,13,22,23)) %>%
mutate(o.age = ifelse((age == 12 | age == 22),2,3))
# execute a frequency table to check whether the above command worked out as I wanted
table(dat$o.age,dat$age)
##
## 12 13 22 23
## 2 13438 0 1046 0
## 3 0 19839 0 234
The basic frequency table I have R execute shows that we have 13,438 records with age = 12 and o.age = 2 and 19,839 records with age = 13 and o.age = 3 and so on which all looks ok. Now lets do a boxplot of Weight by o.age and visually examine for outliers. 5 Note that you have to force the formula to view o.age as a factor here. Normally since o.age is an integer it would be viewed as a continuous variable which the formula would not know how to interpret in a boxplot but if you force the formula to view o.age as a factor then you can get it to work.
# plot weight as a function of o.age
gf_boxplot(weight~as.factor(o.age),data = dat)
Yikes!. There is an obvious erroneous weight value > 20 in the o.age 3 fish. Probably a displaced decimal—should be 2 (something) kg not 20 (something) kg. To drill down and identify records that I think need further review I will create a new variable called ‘weight_flag’ and set it = 1 for every record initially and define flag = 1 as records that need no further review. I can then switch the flags = 2 as I identify records that I think need further review starting with that one record where weight > 20.6
after defining the initial weight_flag for all records I can regraph filtering for weight_flag = 1 by changing the data to be graphed from:
to:
# define weight flag = 1 for all records
dat$weight_flag <- 1
# set weight_flag = 2 where weight > 10
dat$weight_flag[dat$weight > 10] <- 2
# regraph filtering weight flag = 1
gf_boxplot(weight~as.factor(o.age),data = dat[dat$weight_flag == 1,])
This is much better. There still appear to be some outliers. One package that I have found very helpful in identifying outliers is the StatMeasures package which has a function nameed outliers which when applied to a vector (or something that can be viewed as a vector such as a column in a dataframe) returns a ‘list’ type object. 7
The outliers command returns a list of two objects. The first object is an integer (numoutliers) which gives the number of statistical outliers. The second object returned is a vector (idxOutliers) of the position of each outlier in the input. So if we are applying outliers to a column in a dataframe, then idxOutliers identifies the rows having outliers.
You can see where I’m going with this. Let’s apply this function to the weight column of our dataframe ‘dat’ having rows where o.age = 2 and just see what we get…
outliers(dat[dat$o.age == 2,"weight"])
## $numOutliers
## [1] 7
##
## $idxOutliers
## [1] 1370 3156 6622 6901 9084 11805 11811
So it looks like o.age = 2 has 7 outliers, and those outliers exist at rows 1370, 3156, 6622….etc. This is very useful output. Let’s use it to flag the outliers in our ‘dat’ dataframe.
# run outliers on o.age 2 data and store as 'a'
a <- outliers(dat[dat$o.age == 2,"weight"])
# use that output to identify rows where weight_flag needs to be redefined = 2
dat[a$idxOutliers,"weight_flag"] <- 2
….then let’s do the same thing with the o.age = 3 fish…
a <- outliers(dat[dat$o.age == 3,"weight"])
dat[a$idxOutliers,"weight_flag"] <- 2
A quick frequency table shows that we….
table(dat$weight_flag,dat$o.age)
##
## 2 3
## 1 14474 20062
## 2 10 11
flagged 10 2-Ocean and 11 3-Ocean records as outliers. Very clean dataset! Golf clap for Stacy! At this point it is probably time to hit pause so we can circle back to these records in FDMS and try and decide if there is anything we want to change. To assist with that that I can write the records with weight outliers into a dataframe named ‘temp’ then write ‘temp’ into a .csv file named ‘records_for_review’ in the working directory. Wash, rinse, repeat until satisfied that data are high quality. If time allows, minor age classes and non-sockeye records could be checked as well.
temp <- dat[dat$weight_flag == 2,]
write.csv(temp,"records_for_review.csv")
Before I wrap this up, notice that I repeat a bit of code above, specifically I use the outliers command twice, once to flag o.age = 2 outliers and then again to flag o.age = 3 fish. This is exactly the sort of scenario for which we might use something called a loop. Loops are just structures where code loops back on inself. Loops can be executed a fixed number of times or be conditional. We are going to rewite the above code where we want the code to repeat twice. Here we go…
# define vector 'ocean.age' as the numbers two and three
ocean.age <- c(2,3)
# for loop wherein the variable i will hold the value of one on the first pass and two on the
# second pass
for (i in 1:2){
a <- outliers(dat[dat$o.age == ocean.age[i],"weight"])
dat[a$idxOutliers,"weight_flag"] <- 2
}
The way this works is that the first pass i = 1 so ocean.age[i] equals the first value in vector ‘ocean.age’ e.g. ocean.age[1] = 2 and on the second pass i = 2 so ocean.age[i] = 3. This little ‘for’ loop didn’t save a lot of code writing but this concept is extremely powerful when doing any sort of data management or analysis as you can essentially eliminate ALL sorts of repetitive tasks which can be VERY powerful.
THE END
Mardown documents are, in a nutshell, a mix of text of R script (called ‘chunks’ in markdown lingo) and text. In the knitted output the text looks like, well, text and code chunks can have various attributes assigned to them such as being visible or hidden in the knitted output. Here for instructional purposes I’m making all code chunks visible while running and showing output of all chunks. Code chunks are shown in greyed out boxes. Every line in the greyed out chunks are executable R code except lines that start with #. These are unexecuted comment lines. I use these comments to explain in plain english what the line of code below is doing. Rmarkdown documents combine text and ‘chunks’ of R code that can be executed when the markdown document is run Markdown documents can produce various output formats such as word formats, .pdf, HTML, and an HTML presentation output that has a very similar look and feel as Power Point.↩
Packages need to be downloaded to your local library before they are available for an R session.↩
dplyr and ggplot2 are two packages that I use heavily of for data manipulation ( dplyr ) and graphics ( ggplot2 ). These packages are part of the tidyverse which I would consider a pretty core set of packages for R at this point.↩
Incompatable default date and time formats between Excel and R are one of the most annoying aspects of working extensively in both platforms. Excel uses the American standard and R uses the English standard. In Excel 6/1/2020 is understood as 1 June 2020. R will not make that association and read that variable as a text string. As a matter of routine when I am reading data into R from Excel I check to see if I need to force a format on a date. I ended up not using these variables but kept the conversion in as an example of an often necessary and frustrating task. Here I use the ‘as.Date’ function in base R but I also make a lot of use of the lubridate which is one of the more intuitive and helpful packages I’ve found for working with dates and times in R.↩
I’m using a rather obscure package here called ggformula which usess the ggplot2 syntax to define graphics using formulas. In ggplot2 you would normally define x and y directly (e.g. x = o.age, y = weight) but for some reason I keep tripping up boxplots using this syntax, so I’ve gotten in the habit of using ggformula as a work around which allows you to define the plot in formula fashion, so instead of doing a boxplot where x=o.age and y=weight you can restate it as ‘weight~age’ in the ggformula command ‘gf_boxplot’ which reads as ‘weight as a funciton of (~) o.age’.↩
Note that I’m using square bracket notation here. In R, square bracket notation is a very important syntax used to extract, subset or perform operation on parts of an object. For a basic dataframe (df) df[rows,colums] notation refers to things you are defining about rows of dataframe ‘df’ before the comma and colums after the comma.↩
While most R commands return objects that are fairly intuitive to understand like dataframes, vectors, matrices or integers, a ‘list’ object can be a bit difficult to understand. Basically a list is an ordered list of things. The things in the list don’t have to be the same type of object. This outliers command would be a good simple example of a command that returns a list and how we can use parts of that list to do other things. Other R commands that commonly return lists are statistical tests like a t-test or model output like linnear regressions.↩