Low-Cost Air Quality Sensor Calibration Techniques in R

Author

Joel Duah

Published

September 24, 2024

Introduction

This document presents various low-cost sensor calibration techniques including multiple linear regression (MLR), polynomial regression (PR), random forests techniques (RF), XGBoost, support vector machines (SVR/SVM) and Gaussian mixture models (GMR/GMM). We will also go through model performances and what is important in selecting or choosing the right calibration technique for low cost sensors.

Key Processes Involved in Calibration of Low-Cost Air Quality Sensors:

Reference Instrument Comparison:
Low-cost sensors are calibrated by comparing their measurements with a higher-accuracy reference instrument, such as those used by regulatory agencies (e.g., Federal Reference Methods, FRM, or Federal Equivalent Methods, FEM). The reference instrument provides a baseline, and adjustments are made to the low-cost sensor’s output.
Environmental Factors:
Environmental factors such as temperature, humidity, and atmospheric pressure can affect sensor readings. Calibration often involves adjusting for these variables, sometimes through mathematical models or algorithms that account for their impact on the sensor’s measurements.
Co-location:
In the co-location process, the low-cost sensors are placed next to a reference monitor for a period of time to gather data under the same conditions. This comparison helps identify biases or drifts in the sensor’s measurements, which can then be corrected.
Field Calibration vs. Laboratory Calibration:
- Laboratory Calibration: Performed in controlled environments where specific variables (e.g., gas concentrations) are introduced, allowing precise adjustments.
- Field Calibration: Done in the actual deployment environment, where real-world conditions influence sensor behavior. This is crucial for low-cost sensors because they often react differently in the field compared to lab settings.
Statistical or Machine Learning Models:
Sometimes, data from low-cost sensors are corrected using statistical or machine learning models. These models are trained on data from reference instruments and then used to adjust the sensor’s readings in real-time or during post-processing.
Routine Calibration:
Low-cost sensors can drift over time, requiring periodic re-calibration to ensure that they maintain accuracy. This can be done manually or through automated processes if the infrastructure supports it.
Zero Calibration and Span Calibration:
- Zero Calibration: Ensures the sensor reads zero when no pollutants are present.
- Span Calibration: Ensures the sensor accurately reads known pollutant concentrations above zero.

Let us proceed with our sensor calibration. The process involves uploading the data, performing some data pre-processing technique to ensure data is clean and ready for modeling:

Data Upload and Preprocessing

Data Frame Summary

df

Dimensions: 650 x 6
Duplicates: 0

ENE00960_PM2.5 [numeric]

Mean (sd) : 12.1 (3.7)
min ≤ med ≤ max:
3.4 ≤ 11.7 ≤ 26.6
IQR (CV) : 4.9 (0.3)

612 distinct values

642 (98.8%)

8 (1.2%)

ENE00950_PM2.5 [numeric]

Mean (sd) : 17.7 (5.2)
min ≤ med ≤ max:
4 ≤ 17.4 ≤ 37.6
IQR (CV) : 6.7 (0.3)

510 distinct values

630 (96.9%)

20 (3.1%)

ENE00933_PM2.5 [numeric]

Mean (sd) : 19.6 (5.3)
min ≤ med ≤ max:
8.6 ≤ 19.1 ≤ 43.8
IQR (CV) : 6.9 (0.3)

180 distinct values

288 (44.3%)

362 (55.7%)

Teledyne_PM2.5 [numeric]

Mean (sd) : 16.8 (3.6)
min ≤ med ≤ max:
8.1 ≤ 16.1 ≤ 33
IQR (CV) : 4.5 (0.2)

624 distinct values

624 (96.0%)

26 (4.0%)

RH [numeric]

Mean (sd) : 98.7 (3.6)
min ≤ med ≤ max:
85.4 ≤ 100.5 ≤ 100.7
IQR (CV) : 0.8 (0)

240 distinct values

619 (95.2%)

31 (4.8%)

Temp [numeric]

Mean (sd) : 25.7 (1.2)
min ≤ med ≤ max:
23.5 ≤ 25.3 ≤ 29
IQR (CV) : 1.6 (0)

606 distinct values

619 (95.2%)

31 (4.8%)

We also present some descriptive statistics for the numeric variabales in our data. The table below presents basic and advanced statistics on the coulmns in the dataset including mean, standard deviation, minimum and maximum values as well as number and percentage of valid observations among others.

Descriptive Statistics

df

N: 650

	ENE00933_PM2 .5	ENE00950_PM2 .5	ENE00960_PM2 .5	RH	Teledyne_PM2 .5	Temp
Mean	19.57	17.70	12.07	98.74	16.79	25.68
Std.Dev	5.30	5.20	3.69	3.56	3.58	1.22
Min	8.57	4.00	3.40	85.42	8.05	23.51
Q1	15.54	14.00	9.39	99.67	14.18	24.80
Median	19.08	17.42	11.75	100.50	16.10	25.30
Q3	22.50	20.71	14.28	100.50	18.68	26.36
Max	43.80	37.61	26.64	100.66	32.97	28.98
MAD	5.19	5.06	3.62	0.06	3.25	0.94
IQR	6.91	6.71	4.88	0.81	4.50	1.56
CV	0.27	0.29	0.31	0.04	0.21	0.05
Skewness	0.83	0.53	0.61	-2.01	0.97	0.79
SE.Skewness	0.14	0.10	0.10	0.10	0.10	0.10
Kurtosis	1.22	0.43	0.50	2.83	1.38	-0.24
N.Valid	288	630	642	619	624	619
Pct.Valid	44.31	96.92	98.77	95.23	96.00	95.23

Sensor Data Summary
650 rows x 6 cols
	Column	Plot Overview	Missing	Mean	Median	SD
	ENE00960_PM2.5		1.2%	12.1	11.7	3.7
	ENE00950_PM2.5		3.1%	17.7	17.4	5.2
	ENE00933_PM2.5		55.7%	19.6	19.1	5.3
	Teledyne_PM2.5		4.0%	16.8	16.1	3.6
	RH		4.8%	98.7	100.5	3.6
	Temp		4.8%	25.7	25.3	1.2

Evidence from our summary tables reveal that there are no outliers in the data. However, ENE00933 has over 55.5% missing data (i.e. only 35.5% valid data) and hence cannot be included in our calibration project. Since the data requires no further manipulation/cleaning, we can perform our calibrations.

Parametric Methods

1. Multiple Linear Regression (MLR)

MLR is useful when you need to account for multiple factors, such as temperature, humidity, and pressure, that may influence sensor readings. To achieve this, we regress Teledyne PM2.5 on Sensor PM2.5 and relative humidity. Thus, we treat teledyne PM2.5 as the dependent variable and raw sensor PM2.5 and relative humidity as explanatory variables as widely used in practice and in the extant literature.

Calibration formula:Teledyne_PM2.5 =  29.43179 + 0.7804852 *ENE00960_PM2.5 + -0.2248289 *RH

RMSE (Multiple Linear):  2.587385 
R² (Multiple Linear):  0.4991407

Mutiple Linear Regression Plot: Raw data vs Calibrated data

2. Quadratic Regression using polynomial terms

Sometimes, the relationship between the sensor data and reference data is nonlinear but can be captured with a polynomial function (e.g. quadratic or cubic). In this example, we employ the quadratic regression with polynomial terms to calibrate the low-cost sensor andderive a correction factor from the model.

RMSE (Polynomial):  2.558854 
R² (Polynomial):  0.5101258

Polynomial Calibration formula: Teledyne_PM2.5 =  42.51667  +  68.2095  * ENE00960_PM2.5 +  -9.534164  * (ENE00960_PM2.5)^2 +  -0.259714  * RH

Polynomial Regression plot: Raw data vrs Calibrated data

Parametric Models Performance Comparison

Model	R	R²	MAE	RSME
MLR	0.71	0.45	1.93	2.59
Polynomial (Quadratic)	0.71	0.51	1.89	2.56

The table above compares the performance of the two parametric calibration methods using R, R², MAE and RSME. The results indicate that while pearson correlation coefficient for both model are equal (R = 0.71), the quadratic model has a higher R² (how much variance in the sensor’s output is explained by the model). It also boasts a lower RSME (2.56) and MAE (1.89) indicating the model’s average error or the average error between the predicated and actual reference PM2.5 values. The MAE provides the average of the absolute differences between predicted and actual values.

Non-parametric Methods

1. Random Forest

Random Forest models capture nonlinear relationships between sensor data and environmental factors. It is an ensemble method based on decision trees and particularly effective when many features including atmospheric pressure, relative humidity and temperature influence the readings.

Steps to Perform Random Forest Calibration:

Load the Data: We’ll use our dataset, which includes PM2.5 values for low-cost sensors, the reference monitor, and environmental variables like RH.
Preprocessing: We need to ensure the data is clean, handle missing values (if any), and split the data into training and testing sets.
Fit Random Forest Model: We train a Random Forest model using the low-cost sensor data and RH to predict the reference sensor data.
Evaluate Model Performance: We’ll calculate metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared to assess the model.
Extract the Calibration Formula: Random Forest doesn’t provide a straightforward equation like linear models, but you can use it to predict new data (i.e., calibrate future sensor readings).
Visualize the Results: We’ll compare the predicted and true values of the reference sensor data.

We train a random forest model using the training data by splitting the data into training and testing sets (80% train, 20% test). Subsequently, use the training sets to make predictions on the test sets


Call:
 randomForest(formula = Teledyne_PM2.5 ~ ENE00960_PM2.5 + RH,      data = train_data, ntree = 500, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 5.759695
                    % Var explained: 56.74

Random Forest Performance Metrics:

Mean Absolute Error (MAE): 1.934813

Root Mean Squared Error (RMSE): 2.576356

R-squared: 0.5127997

Correlation coefficient: 0.7161004

Random Forest Calibration: Teledyne vs ENE00960 (Predicted)

2. Gaussian Mixture Models (GMMs/GMRs)

Gaussian Mixture Models (GMMs) are a powerful tool used for modeling sensor data, especially when dealing with complex, multimodal data distributions that may result from different operational conditions of sensors. GMMs fit multiple Gaussian distributions to the data, making them useful for capturing various regimes or states in sensor behavior. Studies have found that even when given missing inputs, GMR provides better correlation then MLR and RF performed with complete data.

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2
components: 

 log-likelihood   n df       BIC       ICL
      -1640.231 568 11 -3350.226 -3356.213

Clustering table:
  1   2 
400 168

Here are the summary statistics for the fitted GMM model

RMSE: 2.343286 
MAE: 1.735203 
Correlation: 0.7675928 
R-squared: 0.5891871

Gaussian Mixture Regression Calibration for ENE00960_PM2.5

3. XGBoost Calibration Model

XGBoost is a powerful machine learning model that performs well in regression tasks. It builds an ensemble of decision trees in a sequential manner, where each new tree corrects errors made by the previous ones. This iterative approach allows the model to learn complex relationships between input features and target variables, making it highly effective for regression, classification, and ranking tasks.

Why is XGBoost Used for Sensor Calibration?

XGBoost is used for sensor calibration due to its robustness and ability to handle non-linear relationships, which are often present in environmental data, especially when calibrating low-cost sensors. Here are some reasons why XGBoost is a good fit for this task:

Handling Non-linearity: Low-cost sensors often exhibit non-linear relationships between their measurements and the reference values (e.g., from a high-precision sensor). XGBoost can model these non-linearities more effectively than linear models.
Feature Interactions: XGBoost can automatically capture complex interactions between input variables, such as sensor measurements and environmental factors (e.g., temperature, humidity), which might affect the sensor’s performance.
Regularization: XGBoost includes regularization techniques (L1 and L2 penalties), which help prevent overfitting and improve generalization on unseen data. This is especially useful when the dataset is noisy, as is often the case in sensor networks.
Handling Missing Data: XGBoost can natively handle missing data, which is common in sensor networks due to sensor malfunctions or network issues. The algorithm automatically learns which path to take in the decision trees when data is missing.
Scalability: XGBoost is designed to be fast and scalable, making it ideal for large datasets or when dealing with sensor networks that generate vast amounts of data.
Outlier Handling: Low-cost sensors can be prone to outliers, especially under extreme environmental conditions. XGBoost is less sensitive to outliers compared to traditional regression methods, as it builds multiple trees and uses ensemble averaging, reducing the impact of any single erroneous prediction.

After fitting the XGBoost model, the model performance is presented below

XGBoost Performance:

RMSE: 1.85

MAE: 1.41

Correlation: 0.86

R-squared: 0.75

XGBoost Calibration Model for ENE00960_PM2.5

4. Support Vector Machine

Support Vector Regression (SVR) is a machine learning technique adapted from Support Vector Machines (SVM) and is particularly useful in sensor calibration due to its ability to handle nonlinear relationships between sensor measurements and reference data (e.g., from a reference monitor like Teledyne). SVR is widely used for calibrating low-cost sensors as it provides robust predictions, especially when the data exhibits noise or outliers.

Key Principles of SVR:

Kernel Trick for Nonlinearity:
- SVR uses kernel functions to transform the input data into a higher-dimensional space. This allows it to capture nonlinear relationships between the sensor readings (e.g., PM2.5 measurements) and the reference data.
- Commonly used kernels include:
  - Linear Kernel: Assumes a linear relationship.
  - Polynomial Kernel: Captures polynomial relationships.
  - Radial Basis Function (RBF) Kernel: Can handle complex, nonlinear relationships in the data.
Margin of Tolerance (ε-insensitive):
- SVR introduces a margin of tolerance called the ε-tube, within which errors are ignored. The objective is to find a regression line that minimizes prediction errors outside this margin.
- It balances underfitting and overfitting by controlling the margin width and regularization.
Robustness to Outliers:
- SVR is less sensitive to outliers compared to traditional linear regression techniques, making it a good candidate for sensor calibration where noisy or unreliable sensor data is common.
Optimization Problem:
- SVR solves an optimization problem that seeks to minimize prediction errors while keeping model complexity low (through regularization).
- It aims to minimize a cost function that combines the prediction error and the model complexity (L2-norm of the coefficients).

SVR Performance:

RMSE: 2.41

MAE: 1.72

Correlation: 0.76

R-squared: 0.57

SVR Calibration model for ENE00960_PM2.5

Non-parametric Models Performance Comparison

Model	R	R²	MAE	RSME
Random Forest	0.72	0.51	1.93	2.58
Gaussian Mixture Models	0.77	0.59	1.74	2.34
XGBoost	0.86	0.75	1.41	1.85
Support Vector Machines	0.76	0.57	1.72	2.41