When working with data files from instruments measuring atmospheric variables, several types of errors are commonly encountered.
Calibration Errors: Incorrect or infrequent calibration can cause systematic bias.
Response Time Lag: Delay in sensor response to rapid concentration changes.
Timestamp Errors: Incorrect, missing, or non-uniform timestamps (e.g., due to daylight saving time changes).
Corrupted Files: Partial or unreadable data files due to transmission/storage errors.
| Error Type | Example | Impact |
|---|---|---|
| Instrumental | Sensor drift, calibration error | Systematic bias |
| Recording/Transmission | Missing data, timestamp errors | Data gaps, misalignment |
| Environmental | Humidity interference, contamination | False highs/lows |
| Processing/Human | Entry errors, unit mix-ups | Misinterpretation, wrong stats |
| Outliers/Anomalies | Spikes, flatlines | Skewed mean/variance |
Detection:
- Use is.na() to find missing values. - Use
summary() or skimr::skim() for overview.
Code:
```r
# Basic count of missing values per column
colSums(is.na(data))
# Visualize missingness
library(naniar)
vis_miss(data)
```
Detection:
- Use duplicated() to find repeated rows or timestamps.
Code:
```r
# Find duplicated rows
sum(duplicated(data))
# Find duplicated timestamps (assuming 'timestamp' column)
sum(duplicated(data$timestamp))
```
Detection:
- Check for non-sequential or irregular time intervals. - Use
dplyr and lubridate for time operations.
Code:
```r
library(dplyr)
library(lubridate)
# Convert to POSIXct if needed
data$timestamp <- ymd_hms(data$timestamp)
# Check time differences
data <- data %>% arrange(timestamp)
diffs <- diff(data$timestamp)
table(diffs)
```
Detection:
- Use boxplots, z-scores, or the tsoutliers package for
time series.
Code:
```r
# Visualize with boxplot
boxplot(data$PM10, main="PM10 Boxplot")
# Identify values beyond 3 standard deviations
outliers <- abs(scale(data$PM10)) > 3
which(outliers)
# For time series outlier detection
library(tsoutliers)
ts_data <- ts(data$PM10, frequency = 24) # e.g., hourly data
outlier_results <- tso(ts_data)
outlier_results$outliers
```
Detection:
- Detect long sequences of identical values.
Code:
```r
# Run-length encoding to find flatlines
rle_PM10 <- rle(data$PM10)
flatlines <- which(rle_PM10$lengths > 10) # e.g., more than 10 identical readings
# Get the start positions of flatlines
flatline_positions <- cumsum(rle_PM10$lengths)[flatlines]
```
Detection:
- Check for impossible values (e.g., PM10 > 1000 μg/m³). - Use
summary statistics.
Code:
```r
summary(data$PM10)
# Set a threshold for plausible values
impossible <- which(data$PM10 > 1000 | data$PM10 < 0)
data[impossible, ]
```
| Error Type | Detection Tool/Function | Example Package |
|---|---|---|
| Missing Data | is.na(), vis_miss() |
naniar |
| Duplicates | duplicated() |
base |
| Timestamp Errors | diff(), lubridate |
lubridate |
| Outliers/Spikes | boxplot(), tsoutliers |
tsoutliers |
| Flatlines | rle() |
base |
| Unit Issues | summary(), logical checks |
base |
Common Fixes: - Impute using mean/median, linear interpolation, or remove rows.
Code:
```r
# Remove rows with missing PM10
clean_data <- data[!is.na(data$PM10), ]
# Impute with median
data$PM10[is.na(data$PM10)] <- median(data$PM10, na.rm = TRUE)
# Linear interpolation
library(zoo)
data$PM10 <- na.approx(data$PM10, na.rm = FALSE)
```
Common Fix: - Remove duplicate rows or duplicate timestamps.
Code:
```r
# Remove exact duplicate rows
data <- data[!duplicated(data), ]
# Remove duplicate timestamps, keeping the first occurrence
data <- data[!duplicated(data$timestamp), ]
```
Common Fixes: - Standardize timestamp format, fill missing timestamps, or interpolate missing time points.
Code:
```r
library(lubridate)
library(dplyr)
# Standardize timestamp
data$timestamp <- ymd_hms(data$timestamp)
# Create complete time sequence (e.g., hourly)
full_times <- data.frame(timestamp = seq(min(data$timestamp), max(data$timestamp), by = "hour"))
# Merge and fill missing times with NA
data <- full_join(full_times, data, by = "timestamp")
```
Common Fixes: - Replace outliers with NA, or impute using nearby values.
Code:
```r
# Identify outliers (e.g., >3 SD from mean)
z <- scale(data$PM10)
data$PM10[abs(z) > 3] <- NA
# Interpolate after removing outliers
library(zoo)
data$PM10 <- na.approx(data$PM10, na.rm = FALSE)
```
Common Fix: - Replace long identical runs with NA, then interpolate.
Code:
```r
rle_PM10 <- rle(data$PM10)
# Identify flatlines longer than 10 readings
flat_idx <- inverse.rle(list(lengths = ifelse(rle_PM10$lengths > 10, rle_PM10$lengths, 0),
values = rep(TRUE, length(rle_PM10$lengths))))
data$PM10[flat_idx] <- NA
# Interpolate
data$PM10 <- na.approx(data$PM10, na.rm = FALSE)
```
Common Fix: - Convert units or set implausible values to NA.
Code:
```r
# Set implausible values (e.g., PM10 > 1000) to NA
data$PM10[data$PM10 > 1000 | data$PM10 < 0] <- NA
```
| Error Type | Repair Strategy | Key Function(s) |
|---|---|---|
| Missing Data | Remove, impute, interpolate | na.approx(), median() |
| Duplicates | Remove duplicates | duplicated() |
| Timestamp Errors | Standardize, fill, merge | ymd_hms(), full_join() |
| Outliers/Spikes | Set to NA, interpolate | scale(), na.approx() |
| Flatlines | Set to NA, interpolate | rle(), na.approx() |
| Unit Issues | Convert, set to NA | Logical indexing |
Tip:
Always visualize before and after cleaning (e.g., with
plot() or ggplot2) to verify repairs.
| Topic | Reference/Resource |
|---|---|
| Environmental Data QC | WHO |
| R Data Cleaning | Wickham & Grolemund (R4DS), Peng, Van der Loo & de Jonge |
| R Packages | naniar, zoo, lubridate, tsoutliers, dplyr, skimr (CRAN documentation) |
| Air Quality Sensors | Holstius et al., Spinelle et al. |
| Outlier/Missing Data | Hyndman & Athanasopoulos, Van der Loo & de Jonge |