Advanced Quality Control Report 1:

1. Introduction

Measurements of hydrometeorological variables obtained from sensors and instruments sometimes malfunction, resulting in data sets with missing values. In addition, incomplete data series can be caused by human error, the sensor calibration process itself and raw data management errors (e.g. software issues). The number of missing values is crucial, since a large amount can reduce the quality of the data, especially if critical information is missing (e.g. suspected peaks or troughs), and as a consequence, can bias subsequent analysis of the data when key research hypotheses are being tested. The pattern of the missing values is also hugely significant (Tabachnick and Fidell, 2012), as randomly scattered missing data is less likely to significantly affect the data set’s quality and can be handled more easily. Long sequences of missing data, especially over quite variable periods are much more difficult to deal with. A generally accepted missing value threshold of approximately 5% is considered to have insignificant impact (Schafer, 1999). Unlike other scientific fields, where experiments can be repeated to solve the missing data problem, this approach cannot be applied to time series data due to its nature. Therefore, effective missing values imputation is a necessary data pre-processing step since many of the subsequent statistical approaches that need to be applied in a research context require complete data series. Numerous imputation techniques exist and their application depend on the missing data patterns, the nature of the study data and the scope of the analysis (Cheema, 2014). In this report, we are primarily dealing with multivariate time series data that also has a limited spatial extent. As introductory examples of missing value imputation, a local least square imputation method (LLS impute) and a regularized iterative Principal Components Analysis (PCA impute) model are now demonstrated.

For this, a number of different packages will be used and they need to be installed. The code below will check whether they are installed and if they are not, will install them and their dependencies (i.e. other packages that are needed):

if (!is.element("BiocManager", installed.packages()))
    install.packages("BiocManager", dep = T)
if (!is.element("pcaMethods", installed.packages()))
    BiocManager::install("pcaMethods", dep = T)
if (!is.element("matrixStats", installed.packages()))
    install.packages("matrixStats", dep = T)
if (!is.element("hydroGOF", installed.packages()))
    install.packages("hydroGOF", dep = T)
if (!is.element("VIM", installed.packages()))
    install.packages("VIM", dep = T)
if (!is.element("missMDA", installed.packages()))
    install.packages("missMDA", dep = T)
if (!is.element("corrgram", installed.packages()))
    install.packages("corrgram", dep = T)
if (!is.element("naniar", installed.packages()))
    install.packages("naniar", dep = T)
if (!is.element("corrgram", installed.packages()))
    install.packages("corrgram", dep = T)

The packages then need to be loaded into the R session:

library(pcaMethods)
library(matrixStats)
library(hydroGOF)
library(VIM)
library(missMDA)
library(corrgram)
library(naniar)

2. Data

The data set used in this report consists of measurements obtained from the North Wyke Farm Platform (NWFP) recorded on a fine-scale temporal resolution of 15 minutes. The data set (sm_2012_2018.csv) can be downloaded directly from here: https://nwfp.rothamsted.ac.uk/fpdownload/Example_datasets/

On the NWFP, such temporal collections include water quality and water run-off variables (at 15 sub-catchment flumes, located in the corner of each sub-catchment), soil moisture and rainfall measurements (at 15 sub-catchment monitoring stations, located in the center of each sub-catchment) and meteorological data (at a single site). Details of the NWFP experiment can be found in Orr et al. (2016) and also from the website (http://www.rothamsted.ac.uk/farmplatform).

For this report, we demonstrate missing value imputation to soil moisture (%) data recorded at the depth of 10 cm obtained from 15 NWFP sub-catchments. We choose to focus on the water year starting October 2012 and ending April 2018. Much of this data is baseline data, although sub-catchments 2, 8, 14 and 15 entered a transitional phase leading to a post-baseline status, as they were all reseeded in the summer of 2013. Furthermore, sub-catchment 4 was resized (made smaller) in the autumn of 2013, as part of the on-going development of the NWFP. For this report, all such (hugely important) contextual data are noted, but are not directly accounted for in the analyses. This is left for future report revisions.

Before any reading of data and writing of output, the working directory needs to be defined:

#setwd("C:/Users/my_folder")

Before reading the data, it is a good idea to clear the Environment from any variables that may remain from a previous sessions:

rm(list=ls())

Read the data from a csv file:

my.data <- read.csv("sm_2012_2018.csv", sep = ",", header=T)
sm10 <- matrix(unlist(my.data[,3:17]), ncol = 15)
Dates <- as.Date(my.data[,1], "%d/%m/%Y")

Plot the data to get an initial idea of how it looks:

par(mfrow=c(3,5)) #divide into subplots
for(i in 1:15){
  par(mar = c(3,5,3,2) + 0.5)
  plot(sm10[,i]~Dates, ylim = c(min(sm10[,i], na.rm = TRUE), max(sm10[,i], na.rm = T)), xaxt='n', type = "l",   ylab = "Soil moisture at 10cm (%)", xlab = "", main = paste("Sub-catchment", (i)), cex.lab = 1.2, cex.main = 1)
  axis(1, Dates, format(Dates, "%b %Y"), tck = 0, cex.axis = 1.1)}

There are some obvious outliers, especially in sub-catchments 9, 12 and 15. They will be removed, substituted with NAs, and then infilled:

sort(sm10[,which(colnames(sm10) == "Catchment 9")], decreasing = T)[1:15]
sm10[which(sm10[,which(colnames(sm10) == "Catchment 9")] >= 44.91),which(colnames(sm10) == "Catchment 9")] <- NA
sort(sm10[,which(colnames(sm10) == "Catchment 12")], decreasing = T)[1:10]
sm10[which(sm10[,which(colnames(sm10) == "Catchment 12")] >= 45.61),which(colnames(sm10) == "Catchment 12")] <- NA
sort(sm10[,which(colnames(sm10) == "Catchment 15")], decreasing = T)[1:200]
sm10[which(sm10[,which(colnames(sm10) == "Catchment 15")] >= 45.61),which(colnames(sm10) == "Catchment 15")] <- NA

Plot the data again to check if the outliers have been successfully removed:

There seem to be some more irregularities in the measured data. A more advanced method for outliers detection and removal would probably be required but this could be the subject of future work.

Initially, the total percentage of missing values can be calculated:

na.perc <- round(sum(is.na(sm10))/(nrow(sm10)*ncol(sm10))*100, 2)
print(paste0("Percentage of missing values: ", na.perc,"%"))

## [1] "Percentage of missing values: 9.06%"

The percentage of 9.06 is a quite high proportion of missing data. The naniar package contains various useful tools that help depict the missing patterns. One of them shows the percentage of missing values for each variable (sub-catchment in this case):

gg_miss_var(as.data.frame(sm10), show_pct = TRUE)

The aggr function (from the VIM package) provides more detail regarding the patterns of the missing data. It plots a figure which consists of two sub-plots. One is a histogram of missing values per catchment, similar to the previous function. The other one shows all the possible combinations of missing and non-missing values.

percentage <- 0.00001
aggr(sm10[sample(1:nrow(sm10), (nrow(sm10)*ncol(sm10)*percentage), replace=FALSE),], col = c("skyblue", "red"), bars = F, numbers = T, prop = T,  combined = F, sortVars=TRUE, labels = labels, cex.lab = 1, cex.axis = 0.75, cex.numbers = 1, gap = 2, digits = 2, ylab = c("Histogram of missing soil moisture data", "Missing values combinations"))

## 
##  Variables sorted by number of missings: 
##      Variable      Count
##   Catchment 4 0.44827586
##   Catchment 1 0.31034483
##  Catchment 14 0.20689655
##   Catchment 5 0.17241379
##   Catchment 7 0.17241379
##  Catchment 15 0.17241379
##   Catchment 2 0.13793103
##   Catchment 8 0.10344828
##  Catchment 11 0.06896552
##   Catchment 9 0.03448276
##  Catchment 10 0.03448276
##   Catchment 3 0.00000000
##   Catchment 6 0.00000000
##  Catchment 12 0.00000000
##  Catchment 13 0.00000000

The above plots show that almost half of the data in sub-catchment 4 is missing. Sub-catchment 1 also has a high percentage of missing values (around 25%), whereas half of the sub-catchments have a percentage of missing data which is lower than 5%. The aggregation plot on the right describes the combination of missing (red) and non-missing values (blue). Please note that in this case only a small subset of the data was used for demonstration purpose as there are too many possible combinations to be plotted. For example, the last row of the figure, where all the boxes are blue, shows the percentage of time points that have measurements in all the sub-catchments. The rows above show the percentage of time points where some sub-catchments have missing measurements, whilst there are measurements in all the other sub-catchments.

3. Methodology

Researchers have been interested in missing data handling methods for many decades, even from the 1930s (Wilks, 1932). The increase in computation power and the development of statistical packages in the 1990s allowed the development and application of many new techniques. There are 2 main methods of dealing with missing values. The first one is to ignore or delete the part of the data where information is missing and the second one is to estimate the missing values in order to obtain a complete data set. Listwise and pairwise deletion are the two most frequently used methods of discarding missing values. Their results are different only when applied to multivariate data with more than two variables. Listwise deletion matches the missing values of all the variables and considers only the cases for which measurements for all the variables are available. If at least one of the variables has a missing value, then the case is dropped. On the contrary, pairwise deletion matches the missing values only for the variables that are used in the statistical procedure each time in an attempt to minimize data loss.
Mean imputation is probably the most popular method of missing value estimation. However, this technique has been extensively criticized because while it preserves the mean of the variable, it decreases its variance and alters the cross correlation with the other variables. Regression imputation consists of using the relationship between variables and exploiting the information they provide. The drawbacks of regression include changes in marginal and joint distribution of the variables (Josse and Husson, 2013) and biased standard error of parameter estimates (Allison, 2001). According to various studies (e.g. Salkind and Rasmussen, 2007 and Rubin, 1987), the latter is tackled by using more sophisticated techniques, such as maximum likelihood expectation maximization (EM) and multiple imputation (MI) approaches. MI consists of estimating several values which are then pooled by calculating their mean or by using weights, and thus simulates the variance of the missing values satisfactorily. Numerous machine learning based techniques, such as Singular Value Decomposition (Troyanskaya et al., 2001), Genetic Algorithms and Neural Networks, can be found in the literature. Many of these approaches are also combined with EM, MI and regression methods. In most of the studies, multivariate information is used since exogenous variables that, even though not used in the analysis, can provide useful information and improve the imputation performance (Allison, 2001).

In this report, we used local multivariate information since the measurements at each sub-catchment are highly spatially correlated with the observations at all the other sub-catchments.

sm_correl <- round(cor(sm10, use="complete.obs"), 2)
corrgram(sm_correl, type = 'cor', order = FALSE, main="Soil Moisture at 10 cm (%)", lower.panel=panel.pie, upper.panel=panel.cor, diag.panel=panel.minmax, text.panel=panel.txt,cex.labels=0.8,cex=0.8)

3.1 LLS impute

This method was developed by Kim et al. (2005) for the estimation of missing values contained in gene expression data. Initially, variables that are similar to the target variable containing missing values to be imputed, are selected. For this reason, the method is referred to as local, i.e. instead of using all the available variables, only the similar ones are selected. This is performed by calculating the Euclidean distance between the variables and then applying a k-nearest neighbour method or by a correlation coefficient. One can choose between pearson, spearman and kendall correlation coefficients. Then, the missing values are estimated by representing the target variable as a linear combination of the similar variables. In other words, the variable containing missing values is regressed against the closest variables. This method was applied by using the pcaMethods package from the R statistical software (Stacklies et al., 2013).

3.2 PCA impute

Detailed description of this approach can be found in Josse and Husson (2013). In general, PCA is used to reduce the redundancy of the data by reducing its dimensions. This is achieved by projecting the data on the principle axes which are orthogonal (linear transformation) in a way that maximizes the variance, which is assumed to contain most of the information. The first principal component has the largest variance, the second component the second highest variance and so on. Selecting the first few principal components results in data that are linearly uncorrelated and contains most of the initial information in a fewer dimensions. Based on this principle, the imputation procedure is explained in the next paragraph. The missing values are initially imputed as the mean of the observed values or drawn from a Gaussian distribution with the mean and the standard deviation of the observed values. The second step is to perform PCA on the complete data set and estimate the mean of each variable, the principal components and the eigenvectors of the covariance matrix, i.e. the principal axes. Then the classical fixed effect PCA model is applied to estimate the missing values whereas the observed values are kept unchanged. This process is iterated until convergence. The number of dimensions used in the PCA model is estimated by cross-validation in order to avoid overfitting. This approach was implemented using the missMDA package of the R statistical software (Josse and Husson, 2016).

3.3 Evaluation procedure

The concept of imputing missing values is similar to data prediction. In both methods, evaluation consists of excluding a subset of the observed data, which is then used to train the models where the observed data are compared with the predicted data. Observe that we are not (in general) forecasting or extrapolating data, instead we are (in general) interpolating data. Thus, the last values of the time series would be excluded when forecasting, whereas excluded data subsets for missing data imputation evaluation should reflect the fact that missing data can be in any part of a time series.

In this report, the performance of the two techniques were evaluated using two data sub-setting methods. The first method consists of randomly removing approximately 6% of the data which corresponds to about 5% of observations with no missing values, across all 15 sub-catchments.

percentage <- 0.06
which.sample <- sample(length(sm10), (length(sm10)*percentage), replace=FALSE)
sm.rand.sampl <- sm10[which.sample]
print(paste0("Percentage of missing values in the random sample: ", round(sum(is.na(sm.rand.sampl))/(length(sm.rand.sampl))*100, 2), "%"))

## [1] "Percentage of missing values in the random sample: 9.17%"

The sample has similar percentage of values missing and is therefore representative of the dataset. Next, we consider only the non-NAs measurements of the random sample, which constitutes the testing dataset for the application of the imputation techniques.

which.rand.test <- which.sample[which(!is.na(sm.rand.sampl))]
sm.test1 <- sm10; sm.test1[which.rand.test] <- NA

Before applying the imputation techniques, we need to delete the data rows that are all NAs as this results in error messages.

byrow.nas <- rowSums((is.na(sm.test1)))
which.rows.all.nas <- which(byrow.nas == ncol(sm.test1))
sm.test1 <- sm.test1[-which.rows.all.nas,]

Then we apply the two missing values methods to estimate the missing values that were introduced as NAs. The application of the LLS Impute method requires the specification of similar variables k, which are determined based on the correlation coefficient (pearson, kendall or spearman). Similarly, the number of dimensions for the PCA impute must be predefined and can be automatically estimated by a cross-validation method (generalised cross-validation, leave-one-out or N-fold cross-validation).

#n.pcs <- estim_ncpPCA(sm.test1, ncp.min = 1, ncp.max = (ncol(sm.test1)-1), method.cv = "Kfold", nbsim = 100)#estimate npcs, can be time consuming in the case of large dataset, can be manually set a value.
n.pcs <- 2
pca.test1 <- imputePCA(sm.test1, ncp = n.pcs, method=("Regularized"))#apply pca method
pca.est.test1 <- pca.test1$completeObs[which.rand.test]#exctract pca imputed that were set as NAs
lls.test1 <- llsImpute(sm.test1, k = n.pcs, center = T, completeObs = TRUE, correlation = "pearson", allVariables = TRUE)#apply lls
lls.est.test1 <- completeObs(lls.test1)[which.rand.test]#extract lls imputed

The second method involves leaving out a big part of the measured data from a single, specific sub-catchment. The method was applied iteratively for each sub-catchment, as well as for different seasonal combinations of data (winter and summer). This was done in order to capture the different temporal patterns of missing data. For example, sub-catchment 1 has long periods of missing data during both winter and summer periods, there are no measurements in sub-catchment 3 for most of the summer of 2014 and there is a big data gap in sub-catchment 13 during the 2016/2017 winter.

start.date <- "2015-12-01"
end.date <- "2016-02-28" #the dates can be chosen by the user
which.smpl <- c(min(which(Dates==start.date)):max(which(Dates==end.date)))
which.catch <- 10
which.test2 <- which.smpl[which(!is.na(sm10[which.smpl,which.catch]))]
sm.test2 <- sm10; sm.test2[which.test2,which.catch] <- NA

byrow.nas <- rowSums((is.na(sm.test2))) # Delete rows where all NAs as above
which.rows.all.nas <- which(byrow.nas == ncol(sm.test2))
sm.test2 <- sm.test2[-which.rows.all.nas,]

pca.test2 <- imputePCA(sm.test2, ncp = n.pcs, method=("Regularized"))# Apply imputation methods as above
pca.est.test2 <- pca.test2$completeObs[which.test2,which.catch]
lls.test2 <- llsImpute(sm.test2, k = n.pcs, completeObs = TRUE, correlation = "spearman", allVariables = TRUE)#apply lls
lls.est.test2 <- completeObs(lls.test2)[which.test2,which.catch]

The imputed and the observed values were compared by calculating a set of quantitative statistical indexes that can be divided into two main categories. The first category includes error indices, such as Mean Error (ME), Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and relative BIAS (PBIAS). Low values of these indices show good performance of the models and zero values indicate identical behaviour between observed and imputed values. The second category contains indices that take values from -\(\infty\) or 0 (no agreement) to 1 (total agreement). Examples of such indices are the Nash-Sutcliffe efficiency (NSE), Coefficient of Determination (R2) and Index of Agreement (d). The hydroGOF package of R statistical software was used for the graphical and numerical comparison (Zambrano-Bigiarini, 2014).

4. Results

The imputation accuracy of predicting the randomly removed data (sub-setting method 1) using the PCA and LLS impute methods are depicted below. The graphical comparison between the observed and the imputed data, as well as the performance indices, show that both methods predict missing values reasonably well. PCA impute performs marginally better than LLS impute as it provides lower error indices, such as ME, MAE and NRMSE and almost identical indices that reflect agreement (e.g. NSE, R2 and d).

ggof(lls.est.test1, sm10[which.rand.test], leg.cex = 1.3, main = "LLS",legend = c(""), cex.main = 1.5)

## [ Note: You did not provide dates, so only a numeric index will be used in the time axis ]

ggof(pca.est.test1, sm10[which.rand.test], leg.cex = 1.3, main = "PCA", legend = c(""), cex.main = 1.5)

## [ Note: You did not provide dates, so only a numeric index will be used in the time axis ]

As expected, imputation methods have inferior performance when imputing a large sequence of the data (sub-setting method 2) compared to the imputation of randomly missing values (sub-setting method 1). For sub-setting method 2, the figures below show the results of imputing the soil moisture during the winter of 2015/2016 in sub-catchment 10. For this example, LLS provides considerably more accurate results capturing the trends and the general patterns of these data. This is obvious based on the graphical representation, coupled with the calculated error and agreement indices. The results differ depending on the selected sub-catchment and the time of the year the sample refers to. The user is encouraged to experiment using different sub-catchments and leaving out samples from various seasons.

ggof(lls.est.test2, sm10[which.test2,which.catch], legend = c("Imputed", "Measured"), leg.cex = 1.3, main = "LLS", cex.main = 1.5)

## [ Note: You did not provide dates, so only a numeric index will be used in the time axis ]

ggof(pca.est.test2, sm10[which.test2,which.catch], leg.cex = 1.3,legend = c("Imputed", "Measured"), main = "PCA", cex.main = 1.5)

## [ Note: You did not provide dates, so only a numeric index will be used in the time axis ]

Based on the above results, both PCA and LLS were used to impute the missing values for all of the sub-catchments. Both methods seem to have good performance in estimating the missing values by capturing the process satisfactorily. The figures below provide the final imputation results, which are hugely promising.

byrow.nas <- rowSums((is.na(sm10))) # Delete rows where all NAs as above
which.rows.all.nas <- which(byrow.nas == ncol(sm10))
sm.10 <- sm10[-which.rows.all.nas,]
Dates <- Dates[-which.rows.all.nas]

pca.imp <- imputePCA(sm.10, ncp = n.pcs, method=("Regularized"))# Apply imputation methods as above
pca.est <- pca.imp$completeObs
lls.imp <- llsImpute(sm.10, k = n.pcs, completeObs = TRUE, correlation = "spearman", allVariables = TRUE)#apply lls
lls.est <- completeObs(lls.imp)

#Plot PCA imputed along with original data
par(mfrow=c(3,5)) #divide into subplots
for(i in 1:15){
  par(mar = c(3,5,3,2) + 0.5)
  plot(pca.est[,i]~Dates, ylim = c(min(pca.est[,i], na.rm = TRUE), max(pca.est[,i], na.rm = T)), xaxt='n', type = "l", col = "red", ylab = "Soil moisture at 10cm (%)", xlab = "", main = paste("Sub-catchment", (i)), cex.lab = 1.2, cex.main = 1)
  lines(sm.10[,i]~Dates, col = "black")
  axis(1, Dates, format(Dates, "%b %Y"), tck = 0, cex.axis = 1.1)}

#Plot LLS imputed along with original data
par(mfrow=c(3,5)) #divide into subplots
for(i in 1:15){
  par(mar = c(3,5,3,2) + 0.5)
  plot(lls.est[,i]~Dates, ylim = c(min(lls.est[,i], na.rm = TRUE), max(lls.est[,i], na.rm = T)), xaxt='n', type = "l", col = "red", ylab = "Soil moisture at 10cm (%)", xlab = "", main = paste("Sub-catchment", (i)), cex.lab = 1.2, cex.main = 1)
  lines(sm.10[,i]~Dates, col = "black")
  axis(1, Dates, format(Dates, "%b %Y"), tck = 0, cex.axis = 1.1)}

5. Conclusions

In this report, LLS and PCA based imputation techniques were applied to predict missing values of soil moisture time series data recorded at 15 sites of the NWFP. Both methods provided similarly accurate results in predicting missing values that occurred randomly throughout the time series and differing results when imputing a long sequence of missing measurements, depending on the sub-catchment and season. Initial findings are hugely promising, and future reports will expand on that which has been presented here.

References

Allison, P. D. 2001. Missing Data. Sage Publications, Inc., Thousand Oaks, CA. Cheema, J. R. 2014. A Review of Missing Data Handling Methods in Education Research. Review of Educational Research 84 (4): 487-508.

Josse, J. and F. Husson. 2013. Handling Missing Values in Exploratory Multivariate Data Analysis Methods. Journal de La Société Française de Statistique 153 (2): 79-99.

Josse J. and F. Husson. 2016. missMDA: A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70(1), 1-31.

Kim, H., H. G. Golub and H. Park. 2005. Missing Value Estimation for DNA Microarray Gene Expression Data: Local Least Squares Imputation. Bioinformatics 21 (2): 187–98.

Orr, R. J., P. J. Murray, C. J. Eyles, M. S. A. Blackwell, L. M. Cardenas, A. L. Collins, J. A. J. Dungait, K. W. T Goulding, B. A. Griffith, S. J. Gurr, P. Haris, J. M. B. Hawkins, T. H. Misselbrook, C. Rawlings, A. Shepherd, H. Sint, T. Takahashi, K. N. Tozer, A. P. Whitmore, L. Wu and M. R. F. Lee. 2016. The North Wyke Farm Platform: Effect of Temperate Grassland Farming Systems on Soil Moisture Contents, Runoff and Associated Water Quality Dynamics: North Wyke Farm Platform. European Journal of Soil Science 67 (4): 374-85.

R Core Team, 2016. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Rubin, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley Series in Probability and Mathematical Statistics. New York: Wiley.

Salkind, N. J. and K. Rasmussen. 2007. Encyclopedia of Measurement and Statistics. Thousand Oaks, CA, Sage Publications.

Schafer, J. L. 1999. Multiple Imputation: A Primer. Statistical Methods in Medical Research 8 (1): 3-15.

Stacklies, W., H. Redestig, M. Scholz, D. Walther and J. Selbig. pcaMethods – a Bioconductor package providing PCA methods for incomplete data. Bioinformatics, 2007, 23, 1164-1167.

Tabachnick, B. G. and L. S. Fidell. 2012. Using Multivariate Statistics. 6. Needham Heights, MA: Allyn and Bacon.

Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R. B. Altman. 2001. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 17 (6): 520-525.

Wilks, S. 1932. Moments and Distributions of Estimates of Population Parameters from Fragmentary Samples. The Annals of Mathematical Statistics 3 (3): 163-95.

Mauricio Zambrano-Bigiarini. 2014. hydroGOF: Goodness-of-fit functions for comparison of simulated and observed hydrological time series. R package version 0.3-8. https://CRAN.R-project.org/package=hydroGOF