The R code within this document is designed to be used with comma separated data files (.csv) of flume and soil moisture variables captured at a fine-scale temporal resolution of 15 minutes that can be downloaded from the North Wyke Farm Platform (NWFP) data portal: https://nwfp.rothamsted.ac.uk/.
The code was developed using R version 4.2.1 (“Funny-Looking
Kid”) and the following R packages:
tidyverse, fmsb, reshape2, padr, janitor, writexl, lares, stringr.
Data users may prefer to convert the 15 minute data to that of a daily median or mean of the values. However, it is important to note that the number of within day missing values will impact the reliability of the statistic. The influence of ‘missingness’ on the statistic varies depending on the variable and the variation in its value throughout the day. For example, the variance in pH values throughout a day is relatively small and so the number of missing values will have less of an effect on the statistical result compared with nitrate, which may have a greater variation between values depending on conditions.
Users may wish to adjust the number of data values processed based on the Quality Control (QC) flag that is assigned to each data point. In the code, these are defined as ‘Good’, ‘Acceptable’ and ‘Outlier’ but these can be altered by the user as appropriate. Data that do not have the defined QC flags are set to ‘NA’. The types of QC flag and their description are given in Table 1 below.
An example of the difference in number of missing values for means between the original data and the effect of adjusting the data by the QC flags ‘Good’, ‘Acceptable’ and ’Outlier for a single year (2020) is shown in a graphic here.
Table 1: Quality control flags and description
| Flag | Description |
|---|---|
| Not set | No information on quality available |
| Good | Data were checked and deemed good |
| Acceptable | Data were checked and no issues were found |
| Suspicious | Data were checked and might have been affected by an event |
| Highly Suspicious | Data were checked and have definitely been affected by an event |
| Reject | Data were rejected |
| High Sensor Drift | Instrument calibration values were high over the time period |
| Missing Sensor Drift | Missing instrument calibration information and level of instrument drift during the period is unknown |
| Outlier | The value falls outside ‘regular’ limits but within the extreme limits and therefore could still be fine |
| Level Reset | Level pressure sensors were reset indicating this could result in a step in flow |
| Calibration | Calibration ‘Datetime’ of the instrument |
The daily means or medians (depending on the user requirements) are calculated alongside an accompanying missing values count (up to 96) for each variable are output as a csv file. The count acts as a quality index, and depending upon the volatility of the variable, we propose a set of conditions as guidance for the selection of observations based on the number of missing values, which are outlined in the Recommendations section in the Appendix. Recommended threshold limits for missing values for each variable are included in Table 2 and which are based on expert opinion following scrutiny of the box and whisker graphics as described above. The code adjusts the data based on the threshold limits suggested and the results are output as a separate csv file. A csv file is also created containing the total number of missing values for each variable both before and after threshold adjustment.
To help visualise the data with regards to the potential impact of missing values on the statistic, four threshold values of 0, 24, 48 and 72 missing values/day (representing 0%, 25%, 50%, 75% of 96) are defined. Box and whisker graphs of variables by individual catchments and years, and by individual catchments across all the years are produced for each threshold subset of the data. Catchments in the graphs are coloured by the farmlet system to which they belong i.e. red, green, blue.
The graphs are sub-titled by the number of missing value threshold, alongside the number of rows of data in the subset, and the percentage they represent of the whole data set. Tukey fences [Tukey, 1977] for numbers of true (T) or false (F) outliers for each threshold are included as text on the graphs. The box and whisker graphs help illustrate when missing values are greater than or equal to the threshold and how the chosen statistic is influenced.
Examples of box and whisker graphs for pH and nitrate by individual catchment and year, and by catchments (combined years) and by year (combined catchments) are given in the appendix.
rm(list=ls())
##required packages:
packlist <- c("tidyverse","fmsb","reshape2","padr","janitor","writexl","lares","stringr")
instpak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
suppressWarnings(install.packages(new.pkg, dependencies = TRUE))
sapply(pkg, require, character.only = TRUE)
}
invisible(instpak(packlist))
Before reading in the data and writing output files the user should:
data_dir <- " " ##data directory e.g. "C:/Users/my_data_folder/"
output_dir <- " " ##output directory e.g. "C:/Users/my_output_folder/"
single_file <- FALSE ##select single file (TRUE, FALSE)
fname <- "" ##single file name e.g, "measurements_2020-01-01_2020-12-31.csv"
stat_txt <- "mean" ##choose statistic - "mean" or "median".
QCadjvals <- FALSE ##set values to NA unless quality flag = "Good", "Acceptable", "Outlier" (TRUE, FALSE)
write2file <- FALSE ##write output file (TRUE, FALSE)
graphs2file <- FALSE ##output graph files (TRUE, FALSE)
##read in a single .csv file or multiple .csv files
if(single_file){
csv_file <- paste0(data_dir,fname) ##name of single .csv file
}else{
csv_file <- list.files(data_dir,pattern="\\.csv$",full.names=T) ##list .csv files from data directory
}
##use purrr's map_df() from the tidyverse package to read in file (or combine multiple .csv files in chronological order) and create data frame
data_1.df <- map_df(csv_file, read.csv)
Clean up data by:
data_2.df <- data_1.df %>% select(-(contains("Quality.Last.Modified")|contains("Precipitation..mm...Site")))
##clean up column names - remove successive number of periods, replace with underscore, remove last underscore
cols <- colnames(data_2.df) ##get column names
new_cols <- gsub("\\.{6,}", "_", cols)
new_cols <- gsub("\\.{3,}", "_", new_cols)
new_cols <- gsub("\\.{2,}", "_", new_cols)
new_cols <- gsub("\\.{1,}", "_", new_cols)
new_cols <- sub("_$", "", new_cols)
colnames(data_2.df) <- new_cols #replace with new column names
##convert Datetime column from character class to POSIX class
data_2.df$Datetime <- as.POSIXct(data_2.df$Datetime,format = "%Y/%m/%d %H:%M:%S",tz="UTC")
##adjust data by quality flags if selected
if(QCadjvals){
pretxt <- "QCadjusted_"
##get data columns as data frame
data_3.df <- data_2.df %>% select(-(contains("Datetime")|contains("Quality")))
##get quality info columns as a new data frame
qualcols.df <- data_2.df[grep("Quality",colnames(data_2.df))]
names(qualcols.df) <- sub("_Quality", "", names(qualcols.df))
data_3.df <- data_3.df %>%
mutate(across(everything(), ~ ifelse(
is.na(qualcols.df[[cur_column()]]) |
qualcols.df[[cur_column()]] == "" |
!qualcols.df[[cur_column()]] %in% c("Good", "Acceptable", "Outlier"),
NA, .)))
}else{ ##get data columns as new data frame
pretxt <- ""
##get data columns as data frame
data_3.df <- data_2.df %>% select(-(contains("Datetime")|contains("Quality")))
}
##add Datetime column back
data_3.df$Datetime <- data_2.df$Datetime
##adjust Datetime back by 15 minutes
data_3.df$Datetime <- as.POSIXct(data_3.df$Datetime - (15*60),format = "%Y-%m-%d %H:%M:%S",tz="UTC")
##create Date column without time
data_3.df$Date <- as.Date(data_3.df$Datetime, format = "%Y-%m-%d %H:%M:%S",tz="UTC")
##make Date first column
data_3.df <- data_3.df %>%
select(Date, everything())
##make Datetime second column
data_3.df <- data_3.df %>%
relocate(Datetime, .after = Date)
##get variable names without catchment info
orig_var_names <- unique(gsub(paste0('_Catchment.*'), '', colnames(data_3.df)))
##get start and end dates
start_date <- min(data_3.df$Date,na.rm=T)
end_date <- max(data_3.df$Date,na.rm=T)
##create sequence of complete dates
complete_dates <- data.frame(Date = seq(start_date,end_date,by="day"))##complete sequence of dates
The following code chunk loops through each Catchment (1-15) and processes each variable in turn after some preparatory data manipulation:
In the case of Catchment 4, there are 2 columns suffixed with ‘Prior_2013_08_13’ and ‘After_2013_08_13’ for each variable and so the data are amalgamated into a single column for each variable.
In order to balance the number of columns between catchment data frames (especially if the data span several years), where some variables are not measured in all catchments or were not measured at the beginning of the data collection, columns of that variable are created and infilled with ‘NA’ at each time step. For example:
Total oxidisable N (TotN_mg_l) is calculated for each 15 minute time step from the nitrate/nitrite, ammonia and ammonium variables according to the following conditions:
For each variable, the chosen daily statistic (mean or median) and daily sum of missing values (variable names suffixed with “_nmissing”) are calculated and merged in a new data frame as a list element. Columns for Year, Month, Day, Farmlet and Catchment number are added to the new data frame and the position of the columns is reordered to improve the layout.
The list of individual catchment data frames are then combined into a single data frame, an example of which is given below.
##function to move a named column after a specified column:
move_col_after <- function(obj, move_col, after_col) {
col_names <- colnames(obj)
move_idx <- which(col_names == move_col)
after_idx <- which(col_names == after_col)
##remove the column to be moved
col_names <- col_names[-move_idx]
##insert the column at the desired position
new_order <- append(col_names, move_col, after = after_idx)
##reorder the columns
obj <- obj[, new_order]
return(obj)
}
##catchment numbers within farmlets
green <- c(4,5,6,12,13)
blue <- c(9,8,7,11,14)
red <- c(2,3,1,10,15)
##farmlet colours in order of catchments 1:15
farmcols = c("red","red","red",
"green","green","green",
"blue","blue","blue",
"red","blue","green",
"green","blue","red")
##define missing value thresholds
Ammonia_mg_l <- 10
Ammonium_mg_l <- 10
Total_Phosphorus_mg_l <- 10
Ortho_Phosphorus_mg_l <- 10
Nitrite_and_Nitrate_mg_l <- 50
TotN_mg_l <- 50
Fluorescent_Dissolved_Organic_Matter_ug_l_QSU <- 50
Turbidity_FNU <- 50
Precipitation_mm <- 50
Soil_Temperature_15cm_Depth_oC <- 50
Soil_Moisture_10cm_Depth <- 50
Water_Temperature_Flume_oC <- 75
Water_Temperature_Flow_cell_oC <- 75
Flow_l_s <- 75
Pump_On_Off <- 75
pH <- 75
Conductivity_uS_cm <- 75
Dissolved_Oxygen <- 75
##create a data frame of missing value thresholds
thresh.df <- data.frame(
Ammonia_mg_l = Ammonia_mg_l,
Ammonium_mg_l = Ammonium_mg_l,
Total_Phosphorus_mg_l = Total_Phosphorus_mg_l,
Ortho_Phosphorus_mg_l = Ortho_Phosphorus_mg_l,
Nitrite_and_Nitrate_mg_l = Nitrite_and_Nitrate_mg_l,
TotN_mg_l = TotN_mg_l,
Fluorescent_Dissolved_Organic_Matter_ug_l_QSU = Fluorescent_Dissolved_Organic_Matter_ug_l_QSU,
Turbidity_FNU = Turbidity_FNU,
Precipitation_mm = Precipitation_mm,
Soil_Temperature_15cm_Depth_oC = Soil_Temperature_15cm_Depth_oC,
Soil_Moisture_10cm_Depth = Soil_Moisture_10cm_Depth,
Water_Temperature_Flume_oC = Water_Temperature_Flume_oC,
Water_Temperature_Flow_cell_oC = Water_Temperature_Flow_cell_oC,
Flow_l_s = Flow_l_s,
Pump_On_Off = Pump_On_Off,
pH = pH,
Conductivity_uS_cm = Conductivity_uS_cm,
Dissolved_Oxygen = Dissolved_Oxygen,
stringsAsFactors = FALSE) # Prevents automatic conversion to factors
##catchment loop
catch.lst <- list() ##create list to collect catchment data frames
catch_mvadj.lst <- list()##create list to collect catchment adjusted by missing value data frames
for(i in 1:15){
##Catchment 4 manipulation: if columns contain 'Prior_2013_08_13' and 'After_2013_08_13' for each variable
##by amalgamating them into a single column
if(i == 4){
catch_txt <- paste0("Catchment_",i,"\\S") ##search text for Catchment 4 columns
catch <- data_3.df[,grep(catch_txt,colnames(data_3.df))] ##get catchment columns
var_names <- unique(gsub(paste0('_Catchment_.*'), '', colnames(catch))) ##remove everything after 'Catchment'
catch <- cbind(Date=data_3.df$Date,catch)
for(v in var_names){ ##loop though variables
indx <- grep(paste0(v,"\\S"),colnames(catch)) ##get column indices for each 'Prior_2013_08_13' and 'After_2013_08_13' variable
if(length(indx)==2){
x <- catch[,indx[1]]
y <- catch[,indx[2]]
colname_x <- colnames(catch)[indx[1]]
colname_y <- colnames(catch)[indx[2]]
catch$coljoin <- dplyr::coalesce(x,y)
names(catch)[names(catch)=="coljoin"] <- v
catch[,colname_x] <- NULL #delete 'Prior' column
catch[,colname_y] <- NULL #delete 'After' column
}else{
names(catch)[indx] <- v
}
}
}else{
catch_txt <- paste0("Catchment_",i,"$")
catch <- data_3.df[,grep(catch_txt,colnames(data_3.df))] ##get catchment columns
var_names <- gsub(paste0('_Catchment_.*'), '', colnames(catch)) ##remove 'Catchment' and number from variable names
colnames(catch) <- var_names ##rename catch columns
##add Date column back in
catch <- cbind(Date=data_3.df$Date,catch)
} #end if(i == 4)
##to balance the number of columns between catchment data frames create columns of variables with missing values if they don't exist
##check if 'Total_Phosphorus_mg_l_Catchment_X' column (TP) exists & create if not
tp <- paste0("Total_Phosphorus_mg_l")
if (!(tp %in% colnames(catch))) {
#if the column does not exist, create it with NA values
catch[[tp]] <- NA
}
##check if 'Ortho_Phosphorus_mg_l_Catchment_X' column (OP) exists & create if not
op <- paste0("Ortho_Phosphorus_mg_l")
if (!(op %in% colnames(catch))) {
#if the column does not exist, create it with NA values
catch[[op]] <- NA
}
##check if 'Fluorescent_Dissolved_Organic_Matter_ug_l_QSU_Catchment' (FDOM) column exists & create if not
fdom <- paste0("Fluorescent_Dissolved_Organic_Matter_ug_l_QSU")
if (!(fdom %in% colnames(catch))) {
#if the column does not exist, create it with NA values
catch[[fdom]] <- NA
}
##calculate total oxidisable N (TotN) based on conditions
##get column numbers for Nitrite_and_Nitrate, Ammonia & Ammonium,
nox_col <- grep(paste0("^Nitrite_and_Nitrate_mg_l"),colnames(catch))##get column number of Nitrite_and_Nitrate
nh3_col <- grep(paste0("^Ammonia_mg_l"),colnames(catch))##get column number of Ammonia
nh4_col <- grep(paste0("^Ammonium_mg_l"),colnames(catch))##get column number of Ammonium
catch$TotN_mg_l <- ifelse(is.na(catch[,nox_col]) & is.na(catch[,nh3_col]) & is.na(catch[,nh4_col]),NA, ##if all NA return NA
ifelse(!is.na(catch[,nox_col]) & is.na(catch[,nh3_col]) & is.na(catch[,nh4_col]),catch[,nox_col], ##if NOX_N is not NA but Ammonia and Ammonium are NA return NOX_N value
ifelse(is.na(catch[,nox_col]) & !is.na(catch[,nh3_col]) & !is.na(catch[,nh4_col]),NA, #if NOX_N is NA but Ammonia and Ammonium are not NA return NA
rowSums(catch[,c(nox_col,nh3_col,nh4_col)],na.rm = TRUE))))
##use 'move_col_after' function to move TOTN_mg_l to after Ammonium column
catch <- move_col_after(catch, "TotN_mg_l", names(catch[grep(paste0("^Ammonium_mg_l"),colnames(catch))]))
##use 'move_col_after' function to reorder columns
catch <- move_col_after(catch, names(catch[grep(paste0("^Fluorescent_Dissolved"),colnames(catch))]),
names(catch[grep(paste0("^Turbidity"),colnames(catch))]))
catch <- move_col_after(catch, names(catch[grep(paste0("^Ortho_Phosphorus"),colnames(catch))]),
names(catch[grep(paste0("^Turbidity"),colnames(catch))]))
catch <- move_col_after(catch, names(catch[grep(paste0("^Total_Phosphorus"),colnames(catch))]),
names(catch[grep(paste0("^Turbidity"),colnames(catch))]))
##get variable names except Date & Datetime
catch_names <- colnames(catch[, !names(catch) %in% c("Date")])
##calculate daily stat according to user choice
if (stat_txt == "median"){
catch_stats <- catch %>%
group_by(Date) %>%
summarise(across(all_of(catch_names), \(x) median(x, na.rm = TRUE)))
}else if(stat_txt == "mean"){
catch_stats <- catch %>%
group_by(Date) %>%
summarise(across(all_of(catch_names), \(x) mean(x, na.rm = TRUE)))
}else{
stop("statistic not recognised")
}
##calculate daily missing values
catch_miss <- catch %>%
group_by(Date) %>%
summarize(across(all_of(catch_names),~ sum(is.na(.))))
##rename columns
colnames(catch_miss) <- c("Date",paste(colnames(catch_miss[2:ncol(catch_miss)]),"nmissing",sep = "_"))
##merge daily stats and missing values
catch_merge <- merge(catch_stats,catch_miss,by="Date")
##reorder columns so missing values next to variable
for(j in 1:length(catch_names)){
indxs <- grep(paste0("^",catch_names[j]),colnames(catch_merge)) ##get column indices for variable
##use move_col_after function on each variable
catch_merge <- move_col_after(catch_merge, colnames(catch_merge)[indxs[2]], colnames(catch_merge)[indxs[1]])
}
##fill in missing dates with complete dates
complete_catch_merge <- complete_dates %>%
left_join(catch_merge, by = "Date") %>%
arrange(Date)
##create year, month & day columns
complete_catch_merge$Day <- as.numeric(format(as.Date(complete_catch_merge$Date),"%d"))
complete_catch_merge$Year <- as.numeric(format(as.Date(complete_catch_merge$Date),"%Y"))
complete_catch_merge$Month <- as.numeric(format(as.Date(complete_catch_merge$Date),"%m"))
##add Catchment number & Farmlet columns
complete_catch_merge$Catchment <- i
complete_catch_merge$Farmlet <- ifelse(i %in% red, "Red", ifelse(i %in% green, "Green", "Blue"))
##reorder Catchment & Farmlet columns to front
complete_catch_merge <- complete_catch_merge[, c("Farmlet", setdiff(colnames(complete_catch_merge), "Farmlet"))]
complete_catch_merge <- complete_catch_merge[, c("Catchment", setdiff(colnames(complete_catch_merge), "Catchment"))]
##reorder Month & Year columns to front after the Date column
complete_catch_merge <- complete_catch_merge[, c("Day", setdiff(colnames(complete_catch_merge), "Day"))]
complete_catch_merge <- complete_catch_merge[, c("Month", setdiff(colnames(complete_catch_merge), "Month"))]
complete_catch_merge <- complete_catch_merge[, c("Year", setdiff(colnames(complete_catch_merge), "Year"))]
complete_catch_merge <- complete_catch_merge[, c("Date", setdiff(colnames(complete_catch_merge), "Date"))]
##add data frame to list
catch.lst[[i]] <- complete_catch_merge
##add catchment name to catchment data set list
names(catch.lst)[i] <- paste0("Catchment_",i)
##adjust data by missing value thresholds
##create empty dataframe to collect variables
adjmv.df <- data.frame(matrix(ncol = 0, nrow = nrow(complete_catch_merge)))
##loop through variables
for (av in catch_names){
##get threshold value for variable
thmv <- thresh.df[grep(av, colnames(thresh.df))]
adjmvvals <- complete_catch_merge[grep(av, colnames(complete_catch_merge))]
# Update values based on threshold for missing values
adjmvvals[1] <- apply(adjmvvals, 1, function(row) {
if (!is.na(row[2]) && row[2] >= thmv) { #check missing values against threshold
row[1] <- NA #adjust values
} else {
row[1]
}
})
##bind variables into data frames
adjmv.df <- cbind(adjmv.df,adjmvvals)
}#end for (v in var_names)
##add adjusted values to list
catch_mvadj.lst[[i]] <- adjmv.df
##add catchment name to adjusted data set list
names(catch_mvadj.lst)[i] <- paste0("Catchment_",i)
}##end of for(i in 1:15)
##bind lists into data frames
complete_stat.df <- do.call("rbind", catch.lst)
##bind adjusted lists into data frames
mvadj_complete_stat.df <- do.call("rbind", catch_mvadj.lst)
##add other columns e.g. Date, Year, Month, Day, Catchment, Farmlet
mvadj_complete_stat.df <- cbind(complete_stat.df[,1:6],mvadj_complete_stat.df)
##renames rows
row.names(complete_stat.df) <- as.numeric(seq(1:nrow(complete_stat.df)))
##renames rows in adjusted data
row.names(mvadj_complete_stat.df) <- as.numeric(seq(1:nrow(mvadj_complete_stat.df)))
##get total missing values
A <- complete_stat.df %>% select(-contains('_nmissing')) ##drop '_missing' columns
B <- mvadj_complete_stat.df %>% select(-contains('_nmissing'))##drop '_missing' columns
original <- data.frame(original=colSums(is.na(A[7:ncol(A)]))) #total missing values
adjusted <- data.frame(adjusted=colSums(is.na(B[7:ncol(B)]))) #total missing values
##bind sums of missing values for original and adjusted values
summv.df <- cbind(original,adjusted)
summv.df$difference <- summv.df$adjusted-summv.df$original
Example of final data frame (showing 7 columns and 10 rows only)
| Date | Year | Month | Day | Catchment | Farmlet | Flow_l_s | Flow_l_s_nmissing |
|---|---|---|---|---|---|---|---|
| 2012-01-01 | 2012 | 1 | 1 | 1 | Red | NaN | 96 |
| 2012-01-02 | 2012 | 1 | 2 | 1 | Red | NaN | 96 |
| 2012-01-03 | 2012 | 1 | 3 | 1 | Red | NaN | 96 |
| 2012-01-04 | 2012 | 1 | 4 | 1 | Red | NaN | 96 |
| 2012-01-05 | 2012 | 1 | 5 | 1 | Red | NaN | 96 |
| 2012-01-06 | 2012 | 1 | 6 | 1 | Red | NaN | 96 |
| 2012-01-07 | 2012 | 1 | 7 | 1 | Red | NaN | 96 |
| 2012-01-08 | 2012 | 1 | 8 | 1 | Red | NaN | 96 |
| 2012-01-09 | 2012 | 1 | 9 | 1 | Red | NaN | 96 |
| 2012-01-10 | 2012 | 1 | 10 | 1 | Red | NaN | 96 |
Write results to a .csv file if selected above:
if(write2file){
write.csv(complete_stat.df,paste0(output_dir,pretxt,"Daily_",str_to_title(stat_txt),"s","_",start_date,"_",end_date,".csv"),row.names = F)
write.csv(mvadj_complete_stat.df,paste0(output_dir,pretxt,"Daily_",str_to_title(stat_txt),"s","_",start_date,"_",end_date,"_mvadjusted",".csv"),row.names = F)
write.csv(summv.df,paste0(output_dir,pretxt,"Summary_",str_to_title(stat_txt),"s","_MissVals_Adjusted_by_Threshold_",start_date,"_",end_date,".csv"),row.names = T)
}
If selected above, the following sections produce box and whisker graphs of the chosen statistic for variables by individual catchments and years, and by individual catchments across all the years, and saves them as .png files. The graphs are multi-plots showing the influence of each of the 4 different thresholds for number of missing values on the statistic. Text is added to each graph giving the number of missing values threshold, the number of rows of non-missing data and the percentage of these of the data set. Examples of the box and whisker plots for ‘pH’ and ‘Nitrate_and_Nitrite_mg_l’ are given below.
if(graphs2file){
##get years
yrs <- unique(complete_stat.df$Year)
for (yr in yrs) {
##subset data by year
subcomplete.df <- complete_stat.df[complete_stat.df$Date >= paste0(yr, "-01-01")
& complete_stat.df$Date <= paste0(yr, "-12-31"),]
##get variable names excluding Date, Catchment, Year, Month
varcols <- colnames(subcomplete.df[7:ncol(subcomplete.df)])
##remove "_nmissing and make names unique
varnames <- unique(sub("_nmissing", "", varcols))
##remove "Pump_On_Off"
varnames <- varnames[!varnames %in% c("Pump_On_Off")]
##get Date, Catchment, Year, Month, Day columns
firstcols <- subcomplete.df[, 1:5]
#set threshold(s) for missing values
thresh <- c(0, 24, 48, 72)
for (varnam in varnames) {
##get variable column including missing values colums
vari <-subcomplete.df[, grep(varnam, colnames(subcomplete.df))]
##join Date, Catchment, Year, Month, Day columns
vari <- cbind(firstcols, vari)
n_vari <- nrow(vari) ##number of rows in variable
##open graphics device and name of file
png(paste0(output_dir,pretxt,str_to_title(stat_txt),"s_Box_", varnam, ".png"),width = 1500,height = 900)
##define multiple plot windows
par(mfrow = c(4, 1))
par(mar= c(4, 4, 4, 2) + 0.1) ##setting for plot margin; default = c(5, 4, 4, 2) + 0.1
par(mgp = c(2, 1, 0)) # setting for axis title, axis labels, axis line; default = c(3, 1, 0)
##subset the data by chosen missing values threshold
for (thr in thresh) {
if (thr == 0) {
df <- subset(vari, vari[paste0(varnam, '_nmissing')] == thr)
miss_txt <- paste("n miss. vals. = ")
} else{
df <- subset(vari, vari[paste0(varnam, '_nmissing')] >= thr)
miss_txt <- paste("n miss. vals. = ")
}
n_df <- nrow(df) ##number of rows in subset
if (n_df != 0 &&
any(!is.na(df[[varnam]]))) {
## if data present and not all NA
tuk <- table(outlier_turkey(df[[varnam]], k = 1.5))##calculate Tukey fences
maxy <- max(df[[varnam]],na.rm=t)
maxy_1 <- maxy*0.99
maxy_2 <- maxy_1-maxy_1*0.07
perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
##pre-plot to get details of box and whisker graph
b <- boxplot(df[[varnam]] ~ df$Catchment,plot=F)
catchnums <- as.numeric(b$names) ##get catchment numbers to define farmlet colours
boxplot(df[[varnam]] ~ df$Catchment,
xlab = "catchment",ylab = "value",col = farmcols[catchnums],notch = F,plot = T,na.omit = F,cex.lab = 1.5)
text(x = 1,y = maxy_1,labels = paste0("F = ",tuk[1]),
col = "darkgreen",cex = 1.2)
text(x = 1,y = maxy_2,labels = paste0("T = ",tuk[2]),
col = "darkblue",cex = 1.2)
mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
} else{
perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
boxplot(NA, xlab = "",ylab = "",xlim = c(0, 1),ylim = c(0, 1))
text(x = 0.5,y = 0.5,substitute(paste(bold("No Data"))),col = "black",cex = 1.5)
mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
}
}# end for(thr in thresh)
###add main title
mtext(paste0(yr,": ",varnam), side = 3, line = - 2, outer = T,cex = 1.3,adj=0)
}# end for(varnam in varnames)
}# end for(yr in yrs)
}# end if(graphs2file)
if(graphs2file){
##get variable names excluding Date, Catchment, Year, Month
varcols <- colnames(complete_stat.df[7:ncol(complete_stat.df)])
##remove "_nmissing and make names unique
varnames <- unique(sub("_nmissing", "", varcols))
##remove "Pump_On_Off"
varnames <- varnames[!varnames %in% c("Pump_On_Off")]
##get Date, Catchment, Year, Month, Day columns
firstcols <- complete_stat.df[, 1:5]
#set numbers of missing values at 0%, 25%, 50%, 75% of 96
thresh <- c(0, 24, 48, 72)
for (varnam in varnames) {
##get variable column including missing values colums
vari <- complete_stat.df[, grep(varnam, colnames(complete_stat.df))]
##join Date, Catchment, Year, Month, Day columns
vari <- cbind(firstcols, vari)
n_vari <- nrow(vari) ##number of rows in variable
##open graphics device and name of file
png(paste0(output_dir,pretxt,str_to_title(stat_txt),"s_Box_", varnam, "_", yr, ".png"),width = 1500,height = 900)
##define multiple plot windows
par(mfrow = c(4, 2))
par(mar= c(4, 4, 4, 2) + 0.1) ##setting for plot margin; default = c(5, 4, 4, 2) + 0.1
par(mgp = c(2, 1, 0)) # setting for axis title, axis labels, axis line; default = c(3, 1, 0)
##subset the data by chosen missing values threshold
for (thr in thresh) {
if (thr == 0) {
df <- subset(vari, vari[paste0(varnam, '_nmissing')] == thr)
miss_txt <- paste("n miss. vals. = ")
} else{
df <- subset(vari, vari[paste0(varnam, '_nmissing')] >= thr)
miss_txt <- paste("n miss. vals. >= ")
}
n_df <- nrow(df) ##number of rows in subset
if (n_df != 0 && any(!is.na(df[[varnam]]))) {
## if data present and not all NA
tuk <- table(outlier_turkey(df[[varnam]], k = 1.5)) ##calculate Tukey fences
perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
maxy <- max(df[[varnam]],na.rm=t)
maxy_1 <- maxy*0.99
maxy_2 <- maxy_1-maxy_1*0.07
xticks <- unique(format(df$Date,"%Y"))
xat <- 1:length(xticks)
##pre-plot to get details of box and whisker graph
b <- boxplot(df[[varnam]] ~ df$Catchment,plot=F)
catchnums <- as.numeric(b$names) ##get catchment numbers to define farmlet colours
##by catchment
boxplot(df[[varnam]] ~ df$Catchment,
xlab = "catchment",ylab = "value", col = farmcols[catchnums],notch = F,plot = T,na.omit = F,cex.lab = 1.5)
text(x = 1,y = maxy_1,labels = paste0("F = ",tuk[1]),
col = "darkgreen",cex = 1.2)
text(x = 1,y = maxy_2,labels = paste0("T = ",tuk[2]),
col = "darkblue",cex = 1.2)
mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
}else{
perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
boxplot(NA, xlab = "",ylab = "",xlim = c(0, 1),ylim = c(0, 1))
text(x = 0.5,y = 0.5,labels = "No Data",col = "black",cex = 1.5)
mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
}
##by year
if (n_df != 0 && any(!is.na(df[[varnam]]))) {
boxplot(df[[varnam]] ~ df$Year,
xlab = "year",ylab = "value",col = "lightblue",notch = F,plot = T,na.omit = F,cex.lab = 1.5)
axis(1, at=xat,labels=xticks, las=1)
mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
}else{
perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
boxplot(NA, xlab = "",ylab = "",xlim = c(0, 1),ylim = c(0, 1))
text(x = 0.5,y = 0.5,labels = "No Data",col = "black",cex = 1.5)
mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
}
}#end for (thr in thresh)
###add main title
mtext(paste0(varnam," - by Catchment and by Year"), side = 3, line = - 2, outer = T,cex = 1.3,adj=0)
dev.off()##close graphics device
}# end for(varnam in varnames)
} # end if(graphs2file)
An explanation for the suggested acceptable number of missing values for each variable is as follows and is further itemised in Table 2 below.
For variables where a few aberrant observations may considerably skew the scale we recommend the number of missing values for that variable is less than 10.
For variables that may shift significantly in value across time but are not as ephemeral as those above we recommend the maximum number of missing values of 50.
For variables that are the most stable we recommend that a maximum number of 75 missing values is acceptable.
Tukey, J. W., 1977. Exploratory Data Analysis. Addison-Wesley Publishing Co., Boston, Massachusetts, USA.