When working with data files from instruments measuring atmospheric variables, several types of errors are commonly encountered.
Calibration Errors: Incorrect or infrequent calibration can cause systematic bias.
Response Time Lag: Delay in sensor response to rapid concentration changes.
Timestamp Errors: Incorrect, missing, or non-uniform timestamps (e.g., due to daylight saving time changes).
Corrupted Files: Partial or unreadable data files due to transmission/storage errors.
Error Type | Example | Impact |
---|---|---|
Instrumental | Sensor drift, calibration error | Systematic bias |
Recording/Transmission | Missing data, timestamp errors | Data gaps, misalignment |
Environmental | Humidity interference, contamination | False highs/lows |
Processing/Human | Entry errors, unit mix-ups | Misinterpretation, wrong stats |
Outliers/Anomalies | Spikes, flatlines | Skewed mean/variance |
Detection:
- Use is.na()
to find missing values. - Use
summary()
or skimr::skim()
for overview.
Code:
```r
# Basic count of missing values per column
colSums(is.na(data))
# Visualize missingness
library(naniar)
vis_miss(data)
```
Detection:
- Use duplicated()
to find repeated rows or timestamps.
Code:
```r
# Find duplicated rows
sum(duplicated(data))
# Find duplicated timestamps (assuming 'timestamp' column)
sum(duplicated(data$timestamp))
```
Detection:
- Check for non-sequential or irregular time intervals. - Use
dplyr
and lubridate
for time operations.
Code:
```r
library(dplyr)
library(lubridate)
# Convert to POSIXct if needed
data$timestamp <- ymd_hms(data$timestamp)
# Check time differences
data <- data %>% arrange(timestamp)
diffs <- diff(data$timestamp)
table(diffs)
```
Detection:
- Use boxplots, z-scores, or the tsoutliers
package for
time series.
Code:
```r
# Visualize with boxplot
boxplot(data$PM10, main="PM10 Boxplot")
# Identify values beyond 3 standard deviations
outliers <- abs(scale(data$PM10)) > 3
which(outliers)
# For time series outlier detection
library(tsoutliers)
ts_data <- ts(data$PM10, frequency = 24) # e.g., hourly data
outlier_results <- tso(ts_data)
outlier_results$outliers
```
Detection:
- Detect long sequences of identical values.
Code:
```r
# Run-length encoding to find flatlines
rle_PM10 <- rle(data$PM10)
flatlines <- which(rle_PM10$lengths > 10) # e.g., more than 10 identical readings
# Get the start positions of flatlines
flatline_positions <- cumsum(rle_PM10$lengths)[flatlines]
```
Detection:
- Check for impossible values (e.g., PM10 > 1000 μg/m³). - Use
summary statistics.
Code:
```r
summary(data$PM10)
# Set a threshold for plausible values
impossible <- which(data$PM10 > 1000 | data$PM10 < 0)
data[impossible, ]
```
Error Type | Detection Tool/Function | Example Package |
---|---|---|
Missing Data | is.na() , vis_miss() |
naniar |
Duplicates | duplicated() |
base |
Timestamp Errors | diff() , lubridate |
lubridate |
Outliers/Spikes | boxplot() , tsoutliers |
tsoutliers |
Flatlines | rle() |
base |
Unit Issues | summary() , logical checks |
base |
Common Fixes: - Impute using mean/median, linear interpolation, or remove rows.
Code:
```r
# Remove rows with missing PM10
clean_data <- data[!is.na(data$PM10), ]
# Impute with median
data$PM10[is.na(data$PM10)] <- median(data$PM10, na.rm = TRUE)
# Linear interpolation
library(zoo)
data$PM10 <- na.approx(data$PM10, na.rm = FALSE)
```
Common Fix: - Remove duplicate rows or duplicate timestamps.
Code:
```r
# Remove exact duplicate rows
data <- data[!duplicated(data), ]
# Remove duplicate timestamps, keeping the first occurrence
data <- data[!duplicated(data$timestamp), ]
```
Common Fixes: - Standardize timestamp format, fill missing timestamps, or interpolate missing time points.
Code:
```r
library(lubridate)
library(dplyr)
# Standardize timestamp
data$timestamp <- ymd_hms(data$timestamp)
# Create complete time sequence (e.g., hourly)
full_times <- data.frame(timestamp = seq(min(data$timestamp), max(data$timestamp), by = "hour"))
# Merge and fill missing times with NA
data <- full_join(full_times, data, by = "timestamp")
```
Common Fixes: - Replace outliers with NA, or impute using nearby values.
Code:
```r
# Identify outliers (e.g., >3 SD from mean)
z <- scale(data$PM10)
data$PM10[abs(z) > 3] <- NA
# Interpolate after removing outliers
library(zoo)
data$PM10 <- na.approx(data$PM10, na.rm = FALSE)
```
Common Fix: - Replace long identical runs with NA, then interpolate.
Code:
```r
rle_PM10 <- rle(data$PM10)
# Identify flatlines longer than 10 readings
flat_idx <- inverse.rle(list(lengths = ifelse(rle_PM10$lengths > 10, rle_PM10$lengths, 0),
values = rep(TRUE, length(rle_PM10$lengths))))
data$PM10[flat_idx] <- NA
# Interpolate
data$PM10 <- na.approx(data$PM10, na.rm = FALSE)
```
Common Fix: - Convert units or set implausible values to NA.
Code:
```r
# Set implausible values (e.g., PM10 > 1000) to NA
data$PM10[data$PM10 > 1000 | data$PM10 < 0] <- NA
```
Error Type | Repair Strategy | Key Function(s) |
---|---|---|
Missing Data | Remove, impute, interpolate | na.approx() , median() |
Duplicates | Remove duplicates | duplicated() |
Timestamp Errors | Standardize, fill, merge | ymd_hms() , full_join() |
Outliers/Spikes | Set to NA, interpolate | scale() , na.approx() |
Flatlines | Set to NA, interpolate | rle() , na.approx() |
Unit Issues | Convert, set to NA | Logical indexing |
Tip:
Always visualize before and after cleaning (e.g., with
plot()
or ggplot2
) to verify repairs.
Topic | Reference/Resource |
---|---|
Environmental Data QC | WHO |
R Data Cleaning | Wickham & Grolemund (R4DS), Peng, Van der Loo & de Jonge |
R Packages | naniar, zoo, lubridate, tsoutliers, dplyr, skimr (CRAN documentation) |
Air Quality Sensors | Holstius et al., Spinelle et al. |
Outlier/Missing Data | Hyndman & Athanasopoulos, Van der Loo & de Jonge |