Types of errors in data files obtained from instruments measuring atmospheric variables

When working with data files from instruments measuring atmospheric variables, several types of errors are commonly encountered.

1. Instrumental Errors


2. Data Recording and Transmission Errors


3. Environmental and Sampling Errors


4. Data Processing and Human Errors


5. Outlier and Anomaly Issues


Summary Table

Error Type Example Impact
Instrumental Sensor drift, calibration error Systematic bias
Recording/Transmission Missing data, timestamp errors Data gaps, misalignment
Environmental Humidity interference, contamination False highs/lows
Processing/Human Entry errors, unit mix-ups Misinterpretation, wrong stats
Outliers/Anomalies Spikes, flatlines Skewed mean/variance

Detecting common data errors in atmospheric datasets using R.

1. Missing Data

Detection:
- Use is.na() to find missing values. - Use summary() or skimr::skim() for overview.

Code:

```r
# Basic count of missing values per column
colSums(is.na(data))

# Visualize missingness
library(naniar)
vis_miss(data)
```

2. Duplicated Records

Detection:
- Use duplicated() to find repeated rows or timestamps.

Code:

```r
# Find duplicated rows
sum(duplicated(data))

# Find duplicated timestamps (assuming 'timestamp' column)
sum(duplicated(data$timestamp))
```

3. Timestamp Errors

Detection:
- Check for non-sequential or irregular time intervals. - Use dplyr and lubridate for time operations.

Code:

```r
library(dplyr)
library(lubridate)

# Convert to POSIXct if needed
data$timestamp <- ymd_hms(data$timestamp)

# Check time differences
data <- data %>% arrange(timestamp)
diffs <- diff(data$timestamp)
table(diffs)
```

4. Outliers and Spikes

Detection:
- Use boxplots, z-scores, or the tsoutliers package for time series.

Code:

```r
# Visualize with boxplot
boxplot(data$PM10, main="PM10 Boxplot")

# Identify values beyond 3 standard deviations
outliers <- abs(scale(data$PM10)) > 3
which(outliers)

# For time series outlier detection
library(tsoutliers)
ts_data <- ts(data$PM10, frequency = 24) # e.g., hourly data
outlier_results <- tso(ts_data)
outlier_results$outliers
```

5. Flatlines (Sensor Stuck)

Detection:
- Detect long sequences of identical values.

Code:

```r
# Run-length encoding to find flatlines
rle_PM10 <- rle(data$PM10)
flatlines <- which(rle_PM10$lengths > 10) # e.g., more than 10 identical readings

# Get the start positions of flatlines
flatline_positions <- cumsum(rle_PM10$lengths)[flatlines]
```

6. Unit Inconsistencies

Detection:
- Check for impossible values (e.g., PM10 > 1000 μg/m³). - Use summary statistics.

Code:

```r
summary(data$PM10)
# Set a threshold for plausible values
impossible <- which(data$PM10 > 1000 | data$PM10 < 0)
data[impossible, ]
```

R code snippets for repairing common data errors in atmospheric datasets.

1. Missing Data

Common Fixes: - Impute using mean/median, linear interpolation, or remove rows.

Code:

```r
# Remove rows with missing PM10
clean_data <- data[!is.na(data$PM10), ]

# Impute with median
data$PM10[is.na(data$PM10)] <- median(data$PM10, na.rm = TRUE)

# Linear interpolation
library(zoo)
data$PM10 <- na.approx(data$PM10, na.rm = FALSE)
```

2. Duplicated Records

Common Fix: - Remove duplicate rows or duplicate timestamps.

Code:

```r
# Remove exact duplicate rows
data <- data[!duplicated(data), ]

# Remove duplicate timestamps, keeping the first occurrence
data <- data[!duplicated(data$timestamp), ]
```

3. Timestamp Errors

Common Fixes: - Standardize timestamp format, fill missing timestamps, or interpolate missing time points.

Code:

```r
library(lubridate)
library(dplyr)

# Standardize timestamp
data$timestamp <- ymd_hms(data$timestamp)

# Create complete time sequence (e.g., hourly)
full_times <- data.frame(timestamp = seq(min(data$timestamp), max(data$timestamp), by = "hour"))

# Merge and fill missing times with NA
data <- full_join(full_times, data, by = "timestamp")
```

4. Outliers and Spikes

Common Fixes: - Replace outliers with NA, or impute using nearby values.

Code:

```r
# Identify outliers (e.g., >3 SD from mean)
z <- scale(data$PM10)
data$PM10[abs(z) > 3] <- NA

# Interpolate after removing outliers
library(zoo)
data$PM10 <- na.approx(data$PM10, na.rm = FALSE)
```

5. Flatlines (Sensor Stuck)

Common Fix: - Replace long identical runs with NA, then interpolate.

Code:

```r
rle_PM10 <- rle(data$PM10)
# Identify flatlines longer than 10 readings
flat_idx <- inverse.rle(list(lengths = ifelse(rle_PM10$lengths > 10, rle_PM10$lengths, 0),
                             values = rep(TRUE, length(rle_PM10$lengths))))
data$PM10[flat_idx] <- NA

# Interpolate
data$PM10 <- na.approx(data$PM10, na.rm = FALSE)
```

6. Unit Inconsistencies

Common Fix: - Convert units or set implausible values to NA.

Code:

```r
# Set implausible values (e.g., PM10 > 1000) to NA
data$PM10[data$PM10 > 1000 | data$PM10 < 0] <- NA
```

Summary Table

Error Type Repair Strategy Key Function(s)
Missing Data Remove, impute, interpolate na.approx(), median()
Duplicates Remove duplicates duplicated()
Timestamp Errors Standardize, fill, merge ymd_hms(), full_join()
Outliers/Spikes Set to NA, interpolate scale(), na.approx()
Flatlines Set to NA, interpolate rle(), na.approx()
Unit Issues Convert, set to NA Logical indexing

Tip:
Always visualize before and after cleaning (e.g., with plot() or ggplot2) to verify repairs.


References on Environmental Data Errors


R Packages and Data Cleaning References


Practical Guides and Case Studies


Summary Table

Topic Reference/Resource
Environmental Data QC WHO
R Data Cleaning Wickham & Grolemund (R4DS), Peng, Van der Loo & de Jonge
R Packages naniar, zoo, lubridate, tsoutliers, dplyr, skimr (CRAN documentation)
Air Quality Sensors Holstius et al., Spinelle et al.
Outlier/Missing Data Hyndman & Athanasopoulos, Van der Loo & de Jonge