1 Introduction

 

The R code within this document is designed to be used with comma separated data files (.csv) of flume and soil moisture variables captured at a fine-scale temporal resolution of 15 minutes that can be downloaded from the North Wyke Farm Platform (NWFP) data portal: https://nwfp.rothamsted.ac.uk/.

The code was developed using R version 4.2.1 (“Funny-Looking Kid”) and the following R packages:
tidyverse, fmsb, reshape2, padr, janitor, writexl, lares, stringr.

Data users may prefer to convert the 15 minute data to that of a daily median or mean of the values. However, it is important to note that the number of within day missing values will impact the reliability of the statistic. The influence of ‘missingness’ on the statistic varies depending on the variable and the variation in its value throughout the day. For example, the variance in pH values throughout a day is relatively small and so the number of missing values will have less of an effect on the statistical result compared with nitrate, which may have a greater variation between values depending on conditions.

Users may wish to adjust the number of data values processed based on the Quality Control (QC) flag that is assigned to each data point. In the code, these are defined as ‘Good’, ‘Acceptable’ and ‘Outlier’ but these can be altered by the user as appropriate. Data that do not have the defined QC flags are set to ‘NA’. The types of QC flag and their description are given in Table 1 below.

An example of the difference in number of missing values for means between the original data and the effect of adjusting the data by the QC flags ‘Good’, ‘Acceptable’ and ’Outlier for a single year (2020) is shown in a graphic here.

 

Table 1: Quality control flags and description

Flag Description
Not set No information on quality available
Good Data were checked and deemed good
Acceptable Data were checked and no issues were found
Suspicious Data were checked and might have been affected by an event
Highly Suspicious Data were checked and have definitely been affected by an event
Reject Data were rejected
High Sensor Drift Instrument calibration values were high over the time period
Missing Sensor Drift Missing instrument calibration information and level of instrument drift during the period is unknown
Outlier The value falls outside ‘regular’ limits but within the extreme limits and therefore could still be fine
Level Reset Level pressure sensors were reset indicating this could result in a step in flow
Calibration Calibration ‘Datetime’ of the instrument

 

2 Overview of the code

 

The daily means or medians (depending on the user requirements) are calculated alongside an accompanying missing values count (up to 96) for each variable are output as a csv file. The count acts as a quality index, and depending upon the volatility of the variable, we propose a set of conditions as guidance for the selection of observations based on the number of missing values, which are outlined in the Recommendations section in the Appendix. Recommended threshold limits for missing values for each variable are included in Table 2 and which are based on expert opinion following scrutiny of the box and whisker graphics as described above. The code adjusts the data based on the threshold limits suggested and the results are output as a separate csv file. A csv file is also created containing the total number of missing values for each variable both before and after threshold adjustment.

To help visualise the data with regards to the potential impact of missing values on the statistic, four threshold values of 0, 24, 48 and 72 missing values/day (representing 0%, 25%, 50%, 75% of 96) are defined. Box and whisker graphs of variables by individual catchments and years, and by individual catchments across all the years are produced for each threshold subset of the data. Catchments in the graphs are coloured by the farmlet system to which they belong i.e. red, green, blue.

The graphs are sub-titled by the number of missing value threshold, alongside the number of rows of data in the subset, and the percentage they represent of the whole data set. Tukey fences [Tukey, 1977] for numbers of true (T) or false (F) outliers for each threshold are included as text on the graphs. The box and whisker graphs help illustrate when missing values are greater than or equal to the threshold and how the chosen statistic is influenced.

Examples of box and whisker graphs for pH and nitrate by individual catchment and year, and by catchments (combined years) and by year (combined catchments) are given in the appendix.

   

3 Preliminary steps

 

  • Before starting the processing of the data, it is a good idea to clear the ‘Environment’ from any variables that may remain from a previous session:
rm(list=ls())

 

  • A number of different packages are used in the script and the function below will check whether they are installed. If not it will install them alongside their dependencies i.e. supplementary required packages, and then load the packages into the R session:
##required packages:
packlist <- c("tidyverse","fmsb","reshape2","padr","janitor","writexl","lares","stringr")

instpak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    suppressWarnings(install.packages(new.pkg, dependencies = TRUE))
  sapply(pkg, require, character.only = TRUE)
}

invisible(instpak(packlist))

   

4 Reading & Cleaning up the Data

   

  • Before reading in the data and writing output files the user should:

    • Edit the directories and folders for where the data are kept, and the output directory for results.
data_dir <- " " ##data directory e.g. "C:/Users/my_data_folder/"
output_dir <- " " ##output directory e.g. "C:/Users/my_output_folder/"
  • Choose (TRUE, FALSE) if only a single .csv file is to be processed, and add the name of the file if ‘TRUE’.
    If ‘FALSE’ the script will read in any .csv data files in the data directory in chronological order and combined them into a single data frame.
single_file <- FALSE ##select single file (TRUE, FALSE)
fname <- "" ##single file name e.g, "measurements_2020-01-01_2020-12-31.csv"
  • Choose (“mean”, “median”) the required daily statistic. Any other text will result in an error.
stat_txt <- "mean" ##choose statistic - "mean" or "median". 
  • Choose (TRUE, FALSE) whether to set values to ‘NA’ unless the QC quality flag is set to “Good”, Acceptable” and “Outlier” or not. An example graph illustrating the difference in number of missing values between original and adjust data is given in the appendix.
QCadjvals <- FALSE ##set values to NA unless quality flag = "Good", "Acceptable", "Outlier" (TRUE, FALSE)
  • Choose (TRUE, FALSE) whether to output the processed data and associated graphics files or not.
write2file <- FALSE ##write output file (TRUE, FALSE)
graphs2file <- FALSE ##output graph files (TRUE, FALSE)

   

  • Read the .csv data file(s) and create a data frame.
##read in a single .csv file or multiple .csv files
if(single_file){
csv_file <- paste0(data_dir,fname) ##name of single .csv file
}else{
csv_file <- list.files(data_dir,pattern="\\.csv$",full.names=T) ##list .csv files from data directory
}
##use purrr's map_df() from the tidyverse package to read in file (or combine multiple .csv files in chronological order) and create data frame
data_1.df <- map_df(csv_file, read.csv)

   

  • Clean up data by:

    • Dropping unrequired columns (i.e. “Quality”,“Precipitation..mm…Site.”) and tidying up column names by removing periods (.) and replacing with underscores (_).
data_2.df <- data_1.df %>% select(-(contains("Quality.Last.Modified")|contains("Precipitation..mm...Site")))

##clean up column names - remove successive number of periods, replace with underscore, remove last underscore 
cols <- colnames(data_2.df) ##get column names
new_cols <- gsub("\\.{6,}", "_", cols)
new_cols <- gsub("\\.{3,}", "_", new_cols)
new_cols <- gsub("\\.{2,}", "_", new_cols)
new_cols <- gsub("\\.{1,}", "_", new_cols)
new_cols <- sub("_$", "", new_cols)

colnames(data_2.df) <- new_cols #replace with new column names
  • Converting ‘Datetime’ from character to POSIXct.
##convert Datetime column from character class to POSIX class
data_2.df$Datetime <- as.POSIXct(data_2.df$Datetime,format = "%Y/%m/%d %H:%M:%S",tz="UTC")
  • If selected above, adjusting the data according to the QC flags: currently any flag values not ‘Good’, ‘Acceptable’ or ‘Outlier’ are adjusted to ‘NA’. The list of flags could be edited according to user requirements.
##adjust data by quality flags if selected
if(QCadjvals){
  pretxt <- "QCadjusted_"
##get data columns as data frame
data_3.df <- data_2.df %>% select(-(contains("Datetime")|contains("Quality")))

##get quality info columns as a new data frame
qualcols.df <- data_2.df[grep("Quality",colnames(data_2.df))]
names(qualcols.df) <- sub("_Quality", "", names(qualcols.df))

data_3.df <- data_3.df %>%
  mutate(across(everything(), ~ ifelse(
    is.na(qualcols.df[[cur_column()]]) | 
      qualcols.df[[cur_column()]] == "" | 
      !qualcols.df[[cur_column()]] %in% c("Good", "Acceptable", "Outlier"), 
    NA, .)))

}else{ ##get data columns as new data frame
  pretxt <- ""
  ##get data columns as data frame
  data_3.df <- data_2.df %>% select(-(contains("Datetime")|contains("Quality")))
}
##add Datetime column back 
data_3.df$Datetime <- data_2.df$Datetime
  • Adjusting ‘Datetime’ back by 15 minutes so that the day starts at 00:00:00 (not 00:15:00) and finishes at 23:45:00 (not the next day at 00:00:00).
##adjust Datetime back by 15 minutes
data_3.df$Datetime <- as.POSIXct(data_3.df$Datetime - (15*60),format = "%Y-%m-%d %H:%M:%S",tz="UTC")
  • Creating a ‘Date’ column (without the time). Move the ‘Date’ and ‘Datetime’ columns to the first position and second position, respectively.
##create Date column without time
data_3.df$Date <- as.Date(data_3.df$Datetime, format = "%Y-%m-%d %H:%M:%S",tz="UTC")

##make Date first column
data_3.df <- data_3.df %>%
  select(Date, everything())
##make Datetime second column
data_3.df <- data_3.df %>% 
  relocate(Datetime, .after = Date)

   

  • Get variable names without catchment info:
##get variable names without catchment info
orig_var_names <- unique(gsub(paste0('_Catchment.*'), '', colnames(data_3.df)))

   

  • Get start and end dates and create sequence of complete dates:
##get start and end dates
start_date <- min(data_3.df$Date,na.rm=T) 
end_date <- max(data_3.df$Date,na.rm=T) 

##create sequence of complete dates
complete_dates <- data.frame(Date = seq(start_date,end_date,by="day"))##complete sequence of dates

   

5 Processing the Data

 

The following code chunk loops through each Catchment (1-15) and processes each variable in turn after some preparatory data manipulation:

  • In the case of Catchment 4, there are 2 columns suffixed with ‘Prior_2013_08_13’ and ‘After_2013_08_13’ for each variable and so the data are amalgamated into a single column for each variable.

  • In order to balance the number of columns between catchment data frames (especially if the data span several years), where some variables are not measured in all catchments or were not measured at the beginning of the data collection, columns of that variable are created and infilled with ‘NA’ at each time step. For example:

    • ‘Total_Phosphorus_mg_l’ was only measured in catchments 2, 5 and 8 so columns of this variable are created for the remaining other catchments
    • ‘Fluorescent_Dissolved_Organic_Matter_ug_l_QSU’ was not measured before May 2016.
       
  • Total oxidisable N (TotN_mg_l) is calculated for each 15 minute time step from the nitrate/nitrite, ammonia and ammonium variables according to the following conditions:

    • If a Nitrite_and_Nitrate_mg_l value is present but Ammonia_mg_l and Ammonium_mg_l are missing (‘NA’), return the Nitrite_and_Nitrate_mg_l value for TotN_mg_l.
    • If Nitrite_and_Nitrate_mg_l is ‘NA’ but Ammonia_mg_l and Ammonium_mg_l are not ‘NA’, return ‘NA’ for TotN_mg_l.
    • Otherwise, return the sum of Nitrite_and_Nitrate_mg_l, Ammonia_mg_l and Ammonium_mg_l.

 

For each variable, the chosen daily statistic (mean or median) and daily sum of missing values (variable names suffixed with “_nmissing”) are calculated and merged in a new data frame as a list element. Columns for Year, Month, Day, Farmlet and Catchment number are added to the new data frame and the position of the columns is reordered to improve the layout.

The list of individual catchment data frames are then combined into a single data frame, an example of which is given below.

##function to move a named column after a specified column:  
move_col_after <- function(obj, move_col, after_col) {
  col_names <- colnames(obj)
  move_idx <- which(col_names == move_col)
  after_idx <- which(col_names == after_col)
  
  ##remove the column to be moved
  col_names <- col_names[-move_idx]
  
  ##insert the column at the desired position
  new_order <- append(col_names, move_col, after = after_idx)
  
  ##reorder the columns
  obj <- obj[, new_order]
  return(obj)
}

##catchment numbers within farmlets  
green <- c(4,5,6,12,13)
blue <- c(9,8,7,11,14)
red <- c(2,3,1,10,15)

##farmlet colours in order of catchments 1:15
farmcols = c("red","red","red",
             "green","green","green",
             "blue","blue","blue",
             "red","blue","green",
             "green","blue","red")

##define missing value thresholds
Ammonia_mg_l <- 10
Ammonium_mg_l <- 10
Total_Phosphorus_mg_l <- 10
Ortho_Phosphorus_mg_l <- 10
Nitrite_and_Nitrate_mg_l <- 50
TotN_mg_l <- 50
Fluorescent_Dissolved_Organic_Matter_ug_l_QSU <- 50
Turbidity_FNU <- 50
Precipitation_mm <- 50
Soil_Temperature_15cm_Depth_oC <- 50
Soil_Moisture_10cm_Depth <- 50
Water_Temperature_Flume_oC <- 75
Water_Temperature_Flow_cell_oC <- 75
Flow_l_s <- 75
Pump_On_Off <- 75
pH <- 75
Conductivity_uS_cm <- 75
Dissolved_Oxygen <- 75

##create a data frame of missing value thresholds
thresh.df <- data.frame(
  Ammonia_mg_l = Ammonia_mg_l,
  Ammonium_mg_l = Ammonium_mg_l,
  Total_Phosphorus_mg_l = Total_Phosphorus_mg_l, 
  Ortho_Phosphorus_mg_l = Ortho_Phosphorus_mg_l,
  Nitrite_and_Nitrate_mg_l = Nitrite_and_Nitrate_mg_l,
  TotN_mg_l = TotN_mg_l,
  Fluorescent_Dissolved_Organic_Matter_ug_l_QSU = Fluorescent_Dissolved_Organic_Matter_ug_l_QSU,
  Turbidity_FNU = Turbidity_FNU,
  Precipitation_mm = Precipitation_mm,
  Soil_Temperature_15cm_Depth_oC = Soil_Temperature_15cm_Depth_oC,
  Soil_Moisture_10cm_Depth = Soil_Moisture_10cm_Depth, 
  Water_Temperature_Flume_oC = Water_Temperature_Flume_oC,
  Water_Temperature_Flow_cell_oC = Water_Temperature_Flow_cell_oC,
  Flow_l_s = Flow_l_s,
  Pump_On_Off = Pump_On_Off,
  pH = pH,
  Conductivity_uS_cm = Conductivity_uS_cm,
  Dissolved_Oxygen = Dissolved_Oxygen,
  stringsAsFactors = FALSE) # Prevents automatic conversion to factors

##catchment loop
catch.lst <- list() ##create list to collect catchment data frames
catch_mvadj.lst <- list()##create list to collect catchment adjusted by missing value data frames
for(i in 1:15){
    ##Catchment 4 manipulation: if columns contain 'Prior_2013_08_13' and 'After_2013_08_13' for each variable
    ##by amalgamating them into a single column
  if(i == 4){
    catch_txt <- paste0("Catchment_",i,"\\S") ##search text for Catchment 4 columns
    catch <- data_3.df[,grep(catch_txt,colnames(data_3.df))] ##get catchment columns
    var_names <- unique(gsub(paste0('_Catchment_.*'), '', colnames(catch))) ##remove everything after 'Catchment'
    catch <- cbind(Date=data_3.df$Date,catch)
    
    for(v in var_names){ ##loop though variables
      indx <- grep(paste0(v,"\\S"),colnames(catch)) ##get column indices for each 'Prior_2013_08_13' and 'After_2013_08_13' variable
      if(length(indx)==2){
        x <- catch[,indx[1]]
        y <- catch[,indx[2]]
        colname_x <- colnames(catch)[indx[1]]
        colname_y <- colnames(catch)[indx[2]]
        catch$coljoin <- dplyr::coalesce(x,y)
        names(catch)[names(catch)=="coljoin"] <- v
        catch[,colname_x] <- NULL #delete 'Prior' column
        catch[,colname_y] <- NULL #delete 'After' column
      }else{
        names(catch)[indx] <- v      
      }
    }
  }else{
    catch_txt <- paste0("Catchment_",i,"$") 
    catch <- data_3.df[,grep(catch_txt,colnames(data_3.df))] ##get catchment columns
    var_names <- gsub(paste0('_Catchment_.*'), '', colnames(catch)) ##remove 'Catchment' and number from variable names
    colnames(catch) <- var_names ##rename catch columns
    ##add Date column back in
    catch <- cbind(Date=data_3.df$Date,catch) 
  } #end if(i == 4)

  ##to balance the number of columns between catchment data frames create columns of variables with missing values if they don't exist 
  ##check if 'Total_Phosphorus_mg_l_Catchment_X' column (TP) exists & create if not
  tp <- paste0("Total_Phosphorus_mg_l")
  if (!(tp %in% colnames(catch))) {
    #if the column does not exist, create it with NA values
    catch[[tp]] <- NA
  }
  ##check if 'Ortho_Phosphorus_mg_l_Catchment_X' column (OP) exists & create if not
  op <- paste0("Ortho_Phosphorus_mg_l")
  if (!(op %in% colnames(catch))) {
    #if the column does not exist, create it with NA values
    catch[[op]] <- NA
  }
  
  ##check if 'Fluorescent_Dissolved_Organic_Matter_ug_l_QSU_Catchment' (FDOM) column exists & create if not
  fdom <- paste0("Fluorescent_Dissolved_Organic_Matter_ug_l_QSU")
  if (!(fdom %in% colnames(catch))) {
    #if the column does not exist, create it with NA values
    catch[[fdom]] <- NA
  }
  
  ##calculate total oxidisable N (TotN) based on conditions
  ##get column numbers for Nitrite_and_Nitrate, Ammonia & Ammonium, 
  nox_col <- grep(paste0("^Nitrite_and_Nitrate_mg_l"),colnames(catch))##get column number of Nitrite_and_Nitrate
  nh3_col <- grep(paste0("^Ammonia_mg_l"),colnames(catch))##get column number of Ammonia
  nh4_col <- grep(paste0("^Ammonium_mg_l"),colnames(catch))##get column number of Ammonium
  catch$TotN_mg_l <- ifelse(is.na(catch[,nox_col]) & is.na(catch[,nh3_col]) & is.na(catch[,nh4_col]),NA, ##if all NA return NA
                                 ifelse(!is.na(catch[,nox_col]) & is.na(catch[,nh3_col]) & is.na(catch[,nh4_col]),catch[,nox_col], ##if NOX_N is not NA but Ammonia and Ammonium are NA return NOX_N value
                                        ifelse(is.na(catch[,nox_col]) & !is.na(catch[,nh3_col]) & !is.na(catch[,nh4_col]),NA, #if NOX_N is NA but Ammonia and Ammonium are not NA return NA     
                                               rowSums(catch[,c(nox_col,nh3_col,nh4_col)],na.rm = TRUE))))
  ##use 'move_col_after' function to move TOTN_mg_l to after Ammonium column
  catch <- move_col_after(catch, "TotN_mg_l", names(catch[grep(paste0("^Ammonium_mg_l"),colnames(catch))]))
  ##use 'move_col_after' function to reorder columns
  catch <- move_col_after(catch, names(catch[grep(paste0("^Fluorescent_Dissolved"),colnames(catch))]), 
                          names(catch[grep(paste0("^Turbidity"),colnames(catch))]))
  catch <- move_col_after(catch, names(catch[grep(paste0("^Ortho_Phosphorus"),colnames(catch))]), 
                          names(catch[grep(paste0("^Turbidity"),colnames(catch))]))
  catch <- move_col_after(catch, names(catch[grep(paste0("^Total_Phosphorus"),colnames(catch))]), 
                          names(catch[grep(paste0("^Turbidity"),colnames(catch))]))
  
  ##get variable names except Date & Datetime
  catch_names <- colnames(catch[, !names(catch) %in% c("Date")]) 
  
  ##calculate daily stat according to user choice
  if (stat_txt == "median"){
  catch_stats <- catch %>% 
    group_by(Date) %>% 
    summarise(across(all_of(catch_names), \(x) median(x, na.rm = TRUE)))
  }else if(stat_txt == "mean"){
    catch_stats <- catch %>% 
      group_by(Date) %>% 
      summarise(across(all_of(catch_names), \(x) mean(x, na.rm = TRUE)))
  }else{
    stop("statistic not recognised")
  }
  
  ##calculate daily missing values
  catch_miss <- catch %>%
    group_by(Date) %>%
    summarize(across(all_of(catch_names),~ sum(is.na(.))))

  ##rename columns
  colnames(catch_miss) <- c("Date",paste(colnames(catch_miss[2:ncol(catch_miss)]),"nmissing",sep = "_"))
  
  ##merge daily stats and missing values
  catch_merge <- merge(catch_stats,catch_miss,by="Date")
  
  ##reorder columns so missing values next to variable
  for(j in 1:length(catch_names)){
    indxs <- grep(paste0("^",catch_names[j]),colnames(catch_merge)) ##get column indices for variable
    ##use move_col_after function on each variable
    catch_merge <- move_col_after(catch_merge, colnames(catch_merge)[indxs[2]], colnames(catch_merge)[indxs[1]]) 
  }
  
  ##fill in missing dates with complete dates
  complete_catch_merge <- complete_dates %>%
    left_join(catch_merge, by = "Date") %>%
    arrange(Date)
  
  ##create year, month & day columns
  complete_catch_merge$Day <- as.numeric(format(as.Date(complete_catch_merge$Date),"%d"))
  complete_catch_merge$Year <- as.numeric(format(as.Date(complete_catch_merge$Date),"%Y"))
  complete_catch_merge$Month <- as.numeric(format(as.Date(complete_catch_merge$Date),"%m"))
 
  ##add Catchment number & Farmlet columns
  complete_catch_merge$Catchment <- i
  complete_catch_merge$Farmlet <- ifelse(i %in% red, "Red", ifelse(i %in% green, "Green", "Blue"))
  
  ##reorder Catchment & Farmlet columns to front
  complete_catch_merge <- complete_catch_merge[, c("Farmlet", setdiff(colnames(complete_catch_merge), "Farmlet"))]
  complete_catch_merge <- complete_catch_merge[, c("Catchment", setdiff(colnames(complete_catch_merge), "Catchment"))]
  
  ##reorder Month & Year columns to front after the Date column
  complete_catch_merge <- complete_catch_merge[, c("Day", setdiff(colnames(complete_catch_merge), "Day"))]
  complete_catch_merge <- complete_catch_merge[, c("Month", setdiff(colnames(complete_catch_merge), "Month"))]
  complete_catch_merge <- complete_catch_merge[, c("Year", setdiff(colnames(complete_catch_merge), "Year"))]
  complete_catch_merge <- complete_catch_merge[, c("Date", setdiff(colnames(complete_catch_merge), "Date"))]
 
  ##add data frame to list
  catch.lst[[i]] <- complete_catch_merge
  ##add catchment name to catchment data set list
  names(catch.lst)[i] <- paste0("Catchment_",i)
  
##adjust data by missing value thresholds
##create empty dataframe to collect variables
adjmv.df <- data.frame(matrix(ncol = 0, nrow = nrow(complete_catch_merge)))

##loop through variables
for (av in catch_names){
  ##get threshold value for variable
  thmv <- thresh.df[grep(av, colnames(thresh.df))]
  adjmvvals <- complete_catch_merge[grep(av, colnames(complete_catch_merge))]
  # Update values based on threshold for missing values
  adjmvvals[1] <- apply(adjmvvals, 1, function(row) {
    if (!is.na(row[2]) && row[2] >= thmv) {  #check missing values against threshold
      row[1] <- NA      #adjust values
    } else {
      row[1]
    }
  })
  ##bind variables into data frames
  adjmv.df <- cbind(adjmv.df,adjmvvals)
}#end for (v in var_names)

##add adjusted values to list
catch_mvadj.lst[[i]] <- adjmv.df
##add catchment name to adjusted data set list
names(catch_mvadj.lst)[i] <- paste0("Catchment_",i)
}##end of for(i in 1:15)

##bind lists into data frames
complete_stat.df <- do.call("rbind", catch.lst)

##bind adjusted lists into data frames
mvadj_complete_stat.df <- do.call("rbind", catch_mvadj.lst) 

##add other columns e.g. Date, Year, Month, Day, Catchment, Farmlet
mvadj_complete_stat.df <- cbind(complete_stat.df[,1:6],mvadj_complete_stat.df)

##renames rows
row.names(complete_stat.df) <- as.numeric(seq(1:nrow(complete_stat.df)))

##renames rows in adjusted data
row.names(mvadj_complete_stat.df) <- as.numeric(seq(1:nrow(mvadj_complete_stat.df)))

##get total missing values
A <- complete_stat.df %>% select(-contains('_nmissing')) ##drop '_missing' columns
B <- mvadj_complete_stat.df %>% select(-contains('_nmissing'))##drop '_missing' columns

original <- data.frame(original=colSums(is.na(A[7:ncol(A)]))) #total missing values
adjusted <- data.frame(adjusted=colSums(is.na(B[7:ncol(B)]))) #total missing values

##bind sums of missing values for original and adjusted values
summv.df <- cbind(original,adjusted)
summv.df$difference <- summv.df$adjusted-summv.df$original

   

Example of final data frame (showing 7 columns and 10 rows only)

Date Year Month Day Catchment Farmlet Flow_l_s Flow_l_s_nmissing
2012-01-01 2012 1 1 1 Red NaN 96
2012-01-02 2012 1 2 1 Red NaN 96
2012-01-03 2012 1 3 1 Red NaN 96
2012-01-04 2012 1 4 1 Red NaN 96
2012-01-05 2012 1 5 1 Red NaN 96
2012-01-06 2012 1 6 1 Red NaN 96
2012-01-07 2012 1 7 1 Red NaN 96
2012-01-08 2012 1 8 1 Red NaN 96
2012-01-09 2012 1 9 1 Red NaN 96
2012-01-10 2012 1 10 1 Red NaN 96

   

Write results to a .csv file if selected above:

if(write2file){
  write.csv(complete_stat.df,paste0(output_dir,pretxt,"Daily_",str_to_title(stat_txt),"s","_",start_date,"_",end_date,".csv"),row.names = F)
  write.csv(mvadj_complete_stat.df,paste0(output_dir,pretxt,"Daily_",str_to_title(stat_txt),"s","_",start_date,"_",end_date,"_mvadjusted",".csv"),row.names = F)
  write.csv(summv.df,paste0(output_dir,pretxt,"Summary_",str_to_title(stat_txt),"s","_MissVals_Adjusted_by_Threshold_",start_date,"_",end_date,".csv"),row.names = T)
}

   

6 Box & Whisker Graphs

 

If selected above, the following sections produce box and whisker graphs of the chosen statistic for variables by individual catchments and years, and by individual catchments across all the years, and saves them as .png files. The graphs are multi-plots showing the influence of each of the 4 different thresholds for number of missing values on the statistic. Text is added to each graph giving the number of missing values threshold, the number of rows of non-missing data and the percentage of these of the data set. Examples of the box and whisker plots for ‘pH’ and ‘Nitrate_and_Nitrite_mg_l’ are given below.

 
 

6.1 Individual catchments / year

if(graphs2file){

##get years
yrs <- unique(complete_stat.df$Year)

for (yr in yrs) {
  ##subset data by year
  subcomplete.df <- complete_stat.df[complete_stat.df$Date >= paste0(yr, "-01-01") 
                                        & complete_stat.df$Date <= paste0(yr, "-12-31"),]
  
  ##get variable names excluding Date, Catchment, Year, Month
  varcols <- colnames(subcomplete.df[7:ncol(subcomplete.df)])
  ##remove "_nmissing and make names unique
  varnames <- unique(sub("_nmissing", "", varcols))
  ##remove "Pump_On_Off"
  varnames <- varnames[!varnames %in% c("Pump_On_Off")]

  ##get Date, Catchment, Year, Month, Day columns
  firstcols <- subcomplete.df[, 1:5]
  
  #set threshold(s) for missing values
  thresh <- c(0, 24, 48, 72)
  
  for (varnam in varnames) {
    ##get variable column including missing values colums
    vari <-subcomplete.df[, grep(varnam, colnames(subcomplete.df))]
    
    ##join Date, Catchment, Year, Month, Day columns
    vari <- cbind(firstcols, vari)
    n_vari <- nrow(vari) ##number of rows in variable
    
    ##open graphics device and name of file
    png(paste0(output_dir,pretxt,str_to_title(stat_txt),"s_Box_", varnam, ".png"),width = 1500,height = 900)
    
    ##define multiple plot windows
    par(mfrow = c(4, 1))
    par(mar= c(4, 4, 4, 2) + 0.1) ##setting for plot margin; default = c(5, 4, 4, 2) + 0.1
    par(mgp = c(2, 1, 0)) # setting for axis title, axis labels, axis line; default = c(3, 1, 0)
    
    ##subset the data by chosen missing values threshold
    for (thr in thresh) {
      if (thr == 0) {
        df <- subset(vari, vari[paste0(varnam, '_nmissing')] == thr)
        miss_txt <- paste("n miss. vals. = ")
      } else{
        df <- subset(vari, vari[paste0(varnam, '_nmissing')] >= thr)
        miss_txt <- paste("n miss. vals. = ")
        }
      n_df <- nrow(df) ##number of rows in subset
      
      if (n_df != 0 &&
          any(!is.na(df[[varnam]]))) {
        ## if data present and not all NA
        tuk <- table(outlier_turkey(df[[varnam]], k = 1.5))##calculate Tukey fences
        maxy <- max(df[[varnam]],na.rm=t)
        maxy_1 <- maxy*0.99
        maxy_2 <- maxy_1-maxy_1*0.07
        perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
        ##pre-plot to get details of box and whisker graph
        b <- boxplot(df[[varnam]] ~ df$Catchment,plot=F)
        catchnums <- as.numeric(b$names) ##get catchment numbers to define farmlet colours
        
        boxplot(df[[varnam]] ~ df$Catchment,
        xlab = "catchment",ylab = "value",col = farmcols[catchnums],notch = F,plot = T,na.omit = F,cex.lab = 1.5)
        text(x = 1,y = maxy_1,labels = paste0("F = ",tuk[1]),
             col = "darkgreen",cex = 1.2)
        text(x = 1,y = maxy_2,labels = paste0("T = ",tuk[2]),
             col = "darkblue",cex = 1.2)
        mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
      } else{
        perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
        boxplot(NA, xlab = "",ylab = "",xlim = c(0, 1),ylim = c(0, 1))
        text(x = 0.5,y = 0.5,substitute(paste(bold("No Data"))),col = "black",cex = 1.5)
        mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
      }
      
    }# end for(thr in thresh)
    ###add main title
    mtext(paste0(yr,": ",varnam), side = 3, line = - 2, outer = T,cex = 1.3,adj=0)
  }# end for(varnam in varnames)
  
}# end for(yr in yrs)
}# end if(graphs2file)

   

6.2 Combined years & catchments

if(graphs2file){
##get variable names excluding Date, Catchment, Year, Month
varcols <- colnames(complete_stat.df[7:ncol(complete_stat.df)])
##remove "_nmissing and make names unique
varnames <- unique(sub("_nmissing", "", varcols))
##remove "Pump_On_Off"
varnames <- varnames[!varnames %in% c("Pump_On_Off")]


##get Date, Catchment, Year, Month, Day columns
firstcols <- complete_stat.df[, 1:5]

#set numbers of missing values at 0%, 25%, 50%, 75% of 96
thresh <- c(0, 24, 48, 72)

for (varnam in varnames) {
  ##get variable column including missing values colums
  vari <- complete_stat.df[, grep(varnam, colnames(complete_stat.df))]
  
  ##join Date, Catchment, Year, Month, Day columns
  vari <- cbind(firstcols, vari)
  n_vari <- nrow(vari) ##number of rows in variable
  
  ##open graphics device and name of file
  png(paste0(output_dir,pretxt,str_to_title(stat_txt),"s_Box_", varnam, "_", yr, ".png"),width = 1500,height = 900)
  
  ##define multiple plot windows
  par(mfrow = c(4, 2))
  par(mar= c(4, 4, 4, 2) + 0.1) ##setting for plot margin; default = c(5, 4, 4, 2) + 0.1
  par(mgp = c(2, 1, 0)) # setting for axis title, axis labels, axis line; default = c(3, 1, 0)
  
  ##subset the data by chosen missing values threshold
  for (thr in thresh) {
    if (thr == 0) {
      df <- subset(vari, vari[paste0(varnam, '_nmissing')] == thr)
      miss_txt <- paste("n miss. vals. = ")
    } else{
      df <- subset(vari, vari[paste0(varnam, '_nmissing')] >= thr)
      miss_txt <- paste("n miss. vals. >= ")
    }
    n_df <- nrow(df) ##number of rows in subset
    
    if (n_df != 0 && any(!is.na(df[[varnam]]))) {
      
      ## if data present and not all NA
      tuk <- table(outlier_turkey(df[[varnam]], k = 1.5))  ##calculate Tukey fences
      perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
      maxy <- max(df[[varnam]],na.rm=t)
      maxy_1 <- maxy*0.99
      maxy_2 <- maxy_1-maxy_1*0.07
      xticks <- unique(format(df$Date,"%Y"))
      xat <- 1:length(xticks)
      ##pre-plot to get details of box and whisker graph
      b <- boxplot(df[[varnam]] ~ df$Catchment,plot=F)
      catchnums <- as.numeric(b$names) ##get catchment numbers to define farmlet colours
      ##by catchment
      boxplot(df[[varnam]] ~ df$Catchment,
              xlab = "catchment",ylab = "value", col = farmcols[catchnums],notch = F,plot = T,na.omit = F,cex.lab = 1.5)
      text(x = 1,y = maxy_1,labels = paste0("F = ",tuk[1]),
           col = "darkgreen",cex = 1.2)
      text(x = 1,y = maxy_2,labels = paste0("T = ",tuk[2]),
           col = "darkblue",cex = 1.2)
      mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
    }else{
      perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
      boxplot(NA, xlab = "",ylab = "",xlim = c(0, 1),ylim = c(0, 1))
      text(x = 0.5,y = 0.5,labels = "No Data",col = "black",cex = 1.5)
      mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
    }
    ##by year
    if (n_df != 0 && any(!is.na(df[[varnam]]))) {
      boxplot(df[[varnam]] ~ df$Year,
              xlab = "year",ylab = "value",col = "lightblue",notch = F,plot = T,na.omit = F,cex.lab = 1.5)
      axis(1, at=xat,labels=xticks, las=1)
      mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
    }else{
      perc <- round((n_df / n_vari) * 100, 0) ##percentage of original
      boxplot(NA, xlab = "",ylab = "",xlim = c(0, 1),ylim = c(0, 1))
      text(x = 0.5,y = 0.5,labels = "No Data",col = "black",cex = 1.5)
      mtext(side= 3,paste0(miss_txt,thr,"; n rows data = ",nrow(df),"; ",perc,"%"),line=0.4,cex = 0.9)
    }
    
  }#end for (thr in thresh)
  ###add main title
  mtext(paste0(varnam," - by Catchment and by Year"), side = 3, line = - 2, outer = T,cex = 1.3,adj=0)
  dev.off()##close graphics device
}# end for(varnam in varnames)
} # end if(graphs2file)

   

 
 

8 References

 

Tukey, J. W., 1977. Exploratory Data Analysis. Addison-Wesley Publishing Co., Boston, Massachusetts, USA.