Using data-quality flags

Dan Kelley

2018-05-03

Abstract. This vignette explains the basics of working with data-quality flags in oce, with particular application to hydrographic data. The The goal is to sketch the workflow, on the expectation that readers will consult the package documentation to learn details of the various functions mentioned, and that they will consult the main oce vignette for general information about oce. As in the related vignette about working with adp data, the material is organized to reflect typical workflow.

Colophon. The sample code provided in this vignette all relies on the oce package, so the first step in following along is to load it:

library(oce)
#> Loading required package: gsw
#> Loading required package: testthat

The green text below the call to the library() function is the output of doing that call. Output is shown for many of the code chunks provided in this vignette, but not in cases of voluminous output, or for functions that use data not provided with oce.

1 Overview

1.1 Flag conventions

Hydrographic and some other data may be accompanied by data-quality flags. For example, salinity is subject to a variety of instrumental and other errors, and so archives of its value commonly contain quality flags. These flags may derive from statistical checks, from the judgement of a human analyst, or a combination of the two.

Some flags are in the form of numerical values (typically integers) but others may be expressed as character strings. A frustrating aspect of oceanographic analysis is that different data archiving agencies employ different systems for flags. For example, the World Hydrographic Programme designates good bottle/CTD data with a flag value of 2, whereas the World Ocean Database uses 0 for good data. It is also common to indicate bad data by setting values to non-physical values, e.g. -999, -99.99, or similar, and sometimes these coded values may contradict more formal flags, when both are present.

1.2 Flag storage

Data-quality flags are stored in the flags entry in the metadata slot of oce objects. This entry is a list, containing items with names corresponding to names of data elements that are stored in the data slot.

It is possible to inspect flags using e.g. to see flags corresponding to , but a simpler notation is also provided, with e.g. returning the flags for the salinity data (which may in turn be retrieved with ).

Although it is possible to set and alter flags directly, it is much better to use setFlags, in order to avoid making mistakes, and also to record processing steps. It is also recommended to use setFlagScheme first, which makes it possible to refer to flags by name, and not just number.

1.3 Handling flags

As with setting flags, analysts are free to extract flag values and use them for any purpose that comes to mind. Simple cases may be handled with the oce function named handleFlags, which is set up to respond to particular flag values with particular actions. It has reasonable defaults for different data types, and it can detect a flag scheme that has been set up with setFlagsScheme. The default action of setting the `bad'' flagged data toNA` may suffice in many plotting or analysis situations.

2 Sample working procedure

2.1 CTD profile data: range checks

(This section is an expansion of Example 1 in the setFlags,ctd-method documentation.)

The oce package provides a dataset

that contains some clearly anomolous values that are revealed clearly in a summary plot:

Although ctdTrim removes most of the anomolous data by examining the variation of the pressure signal over time, it may also be of interest to see how well simple range checks can perform in cleaning up the data. Salinity certainly cannot be negative, but in an oceanographic setting it is common to relax that criterion somewhat, perhaps insisting that Absolute Salinity \(S_A\) exceed \(25\)g/kg. This value might work in other situations as well, and the same could be said of an upper limit of \(40\)g/kg. Similarly, it might make sense to bound temperature between, say, \(-2^\circ\)C and \(40^\circ\)C for application throughout much of the world ocean.

These criteria can be supplied to setFlags in various ways, but the simplest is to create logical vectors, e.g.

In the above, with has been used to avoid inserting salinity and temperature in the namespace, but it would also be common to use e.g.

Since the goal here is to illustrate setting multiple flags, the badS and badT values will be used. The first step is to copy the original data, so that the flag operations will not alter ctdRaw:

Work flow is best documented if a flag scheme is established, and the ``WHP CTD exchange’’ scheme is a reasonable choice

after which the bad salinities are flagged with

We can see that the flag got inserted by using summary(qc), but for brevity here another method is:

Now, temperature flags may be inserted with

Readers ought to use summary(qc) to get more details of how flags were handled, and how many bad salinities and temperatures were flagged.

If qc is plotted with plot(qc), the results will match those of plot(ctdRaw). This is because setting flags has no effect on plots, because it alters flags but not data. One more step is required to test whether this procedure has cleaned up the data significantly: we must ``handle’’ the flags, using

Comparing the summary plot for qch, constructed with

with the original summary plot for ctdRaw shows signficant improvements. The downcast and the upcast can be seen quite clearly now, although there appears to be an issue of low salinity at the turnaround point. Setting a flag for pressure increase with time will isolate the downcast somewhat, although some smoothing will be required. Another issue related to the path of the instrument is that it may have been held below the surface for a while to equilibrate. Again, a flag could be set up to remove such data. However, it ought to be noted that ctdTrim can be used to address issues relating to instrument movement.

2.2 CTD profile data: interactive editing

(This section is an expansion of Example 2 in the setFlags,ctd-method documentation.)

The ctd dataset provided by the oce package is similar to ctdRaw, except that only downcast data are provided. Even so, there are still some points that might be considered suspicious. A common way to find such points is to plot TS diagrams, looking for decreases in density with depth.

Running the following code in an interactive session will demonstrate a simple way to use a TS diagram to identify suspicious data. Pasting this into an R console will show a plot with lines between measurements made at successive depths. Clicking on any point will flag it, and so the point will then disappear on the plot. Clicking to the right of the plot frame will exit the procedure, after which qc is a ctd object with flags as set, and data set to NA where these flags indicate bad data.

It would be a simple matter to extend this simple example to a shiny application that displays other data. For example, there could be panels for profiles, as well as the TS plot. The system could track clicks in any of the panels, taking appropriate actions. It would be sensible to have a staged procedure, in which clicking (or brushing) in one panel would cause a replot of all panels, with the selected data indicated in some way, so that the analyst could then choose whether to go to the next stage, of clicking a button to indicate bad data. Another button might be provided to undo such operations, or to show the original data for comparison. The point is that a wide variety of flag operations are handled very easily in R, with oce.

2.3 Section data

The flag-handling functions work for a variety of oce objects. For example, sections, which are built up from a sequence of ctd profiles, are handled easily with functions of the same names as for the ctd case.

As a simple example, the following shows how to clean up the ``A03’’ Atlantic section that is provided with oce.

Note, in the above, that the archiving agency had evidently flagged not just wild data (e.g. the salinity near \(26\)g/kg) but also data that were anomalous in more subtle ways (e.g. cleaning up several points that stood out from the data cloud, below \(10^\circ\)C). In fact, the flags in this data set, as in most archived hydrographic data sets, are the result of a multifaceted inspection scheme that is more demanding and useful than simple range checks.

2.4 Acoustic-Doppler profiler data

By now, the reader should be able to understand the following use of data-quality flags for the adp dataset. Note that there is only a small difference (at 8h and 20h) between the (gray) bad data as identified by the instrument and those identified by the scheme illustrated here; look near 8h and 20h. However, altering the values for G and V4 will reveal that the schemes do differ.

Note, in the above, that setFlagScheme has not been used, because the author is unaware of any common notation for such data. The values used for G and V4 were provided by colleagues at the Bedford Institute of Oceanography, and may be in use by USGS also.