Regression Model
Investigate the influence of meteorological factors on the
concentrations of PM2.5 and PM10 in the air of Beijing
1. Data Loading and Initial Setup
Load the necessary libraries and the air quality dataset. The dataset
is cleaned and includes measurements of air pollutants and weather
variables over time.
# Clear environment
rm(list = ls())
# Load required packages
library(tidyverse)
library(caret)
library(corrplot)
library(car)
library(MASS)
library(glmnet)
library(broom)
# Load the air quality dataset
air_data <- read_csv("data/beijing_air_quality_cleaned.csv", show_col_types = FALSE)
air_data$datetime <- as.POSIXct(air_data$datetime)
# Display dataset dimensions and column names
#cat("Data loaded successfully. Dimensions:", nrow(air_data), "rows x", ncol(air_data), "columns\n")
#cat("Column names in dataset:\n")
#print(names(air_data))
2. Data Preparation for Regression
Extract the relevant columns for regression analysis, handle missing
values, and summarize the data. The dataset is divided into pollutant
variables and weather variables for analysis.
# Extract relevant columns for regression
# PM2.5=7, PM10=8, SO2=9, NO2=10, CO=11, O3=12, TEMP=13, PRES=14, DEWP=15, RAIN=16, WSPM=18
regression_data <- air_data[, c(7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18)]
names(regression_data) <- c("PM25", "PM10", "SO2", "NO2", "CO", "O3", "TEMP", "PRES", "DEWP", "RAIN", "WSPM")
# Remove rows with missing values
regression_data <- na.omit(regression_data)
#cat("Regression dataset dimensions:", nrow(regression_data), "rows x", ncol(regression_data), "columns\n")
# Categorize variables
pollutant_vars <- c("PM25", "PM10", "SO2", "NO2", "CO", "O3")
weather_vars <- c("TEMP", "PRES", "DEWP", "RAIN", "WSPM")
# Calculate summary statistics
summary_stats <- data.frame(
variable = names(regression_data),
mean = round(sapply(regression_data, mean), 2),
sd = round(sapply(regression_data, sd), 2),
min = round(sapply(regression_data, min), 2),
max = round(sapply(regression_data, max), 2)
)
3. Correlation Analysis
Analyze the correlation between air pollutants and weather variables.
This helps identify which weather factors are most related to pollutant
concentrations.
# Calculate correlation matrix
correlation_matrix <- cor(regression_data)
# Extract correlations between PM2.5/PM10 and weather variables
pm25_weather_cor <- correlation_matrix[1, 7:11]
pm10_weather_cor <- correlation_matrix[2, 7:11]
PM2.5 correlations with weather variables:
## TEMP PRES DEWP RAIN WSPM
## -0.097 -0.031 0.159 -0.014 -0.297
PM10 correlations with weather variables:
## TEMP PRES DEWP RAIN WSPM
## -0.092 -0.051 0.091 -0.030 -0.212
Variable Correlation Matrix

4. Simple Linear Regression Analysis
To explore the relationship between PM2.5 and each weather variable
individually. This provides a baseline for understanding individual
variable effects.
# Prepare data for PM2.5 regression
pm25_data <- regression_data[, c("PM25", weather_vars)]
# Container for simple regression results
simple_results <- data.frame()
# Perform simple linear regression for each weather variable
for(var in weather_vars) {
formula_str <- paste("PM25 ~", var)
model <- lm(as.formula(formula_str), data = pm25_data)
model_summary <- summary(model)
simple_results <- rbind(simple_results, data.frame(
predictor = var,
slope = round(coef(model)[2], 4),
r_squared = round(model_summary$r.squared, 4),
adj_r_squared = round(model_summary$adj.r.squared, 4),
p_value = round(model_summary$coefficients[2,4], 6),
significance = ifelse(model_summary$coefficients[2,4] < 0.001, "***",
ifelse(model_summary$coefficients[2,4] < 0.01, "**",
ifelse(model_summary$coefficients[2,4] < 0.05, "*", "")))
))
}
Simple Linear Regression Results:
## predictor slope r_squared adj_r_squared p_value significance
## TEMP TEMP -0.5860 0.0095 0.0095 0.00000 ***
## PRES PRES -0.2040 0.0010 0.0009 0.00000 ***
## DEWP DEWP 0.7958 0.0253 0.0252 0.00000 ***
## RAIN RAIN -1.0239 0.0002 0.0002 0.01093 *
## WSPM WSPM -16.9140 0.0883 0.0883 0.00000 ***
5. Multiple Linear Regression Models
Build and compare several multiple regression models to predict PM2.5
and PM10 concentrations, including full model, stepwise selection, ridge
regression, and polynomial regression.
# Build full multiple linear regression model
full_model <- lm(PM25 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = pm25_data)
full_summary <- summary(full_model)
# Build stepwise regression model
null_model <- lm(PM25 ~ 1, data = pm25_data)
stepwise_model <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "forward", trace = FALSE)
# Build polynomial regression model with temperature squared
poly_model <- lm(PM25 ~ poly(TEMP, 2) + PRES + DEWP + RAIN + WSPM, data = pm25_data)
poly_summary <- summary(poly_model)
Full Multiple Linear Regression
##
## Call:
## lm(formula = PM25 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = pm25_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.66 -44.35 -13.20 32.46 356.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1090.92323 59.96780 18.192 <2e-16 ***
## TEMP -4.45207 0.06803 -65.440 <2e-16 ***
## PRES -0.94438 0.05880 -16.061 <2e-16 ***
## DEWP 3.21415 0.05432 59.170 <2e-16 ***
## RAIN -3.43575 0.36377 -9.445 <2e-16 ***
## WSPM -3.53182 0.33970 -10.397 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.39 on 35058 degrees of freedom
## Multiple R-squared: 0.1977, Adjusted R-squared: 0.1976
## F-statistic: 1728 on 5 and 35058 DF, p-value: < 2.2e-16
Stepwise Regression
##
## Call:
## lm(formula = PM25 ~ WSPM + TEMP + DEWP + PRES + RAIN, data = pm25_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.66 -44.35 -13.20 32.46 356.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1090.92323 59.96780 18.192 <2e-16 ***
## WSPM -3.53182 0.33970 -10.397 <2e-16 ***
## TEMP -4.45207 0.06803 -65.440 <2e-16 ***
## DEWP 3.21415 0.05432 59.170 <2e-16 ***
## PRES -0.94438 0.05880 -16.061 <2e-16 ***
## RAIN -3.43575 0.36377 -9.445 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.39 on 35058 degrees of freedom
## Multiple R-squared: 0.1977, Adjusted R-squared: 0.1976
## F-statistic: 1728 on 5 and 35058 DF, p-value: < 2.2e-16
Polynomial Regression
##
## Call:
## lm(formula = PM25 ~ poly(TEMP, 2) + PRES + DEWP + RAIN + WSPM,
## data = pm25_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.11 -44.36 -13.27 32.48 357.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.032e+03 5.955e+01 17.335 <2e-16 ***
## poly(TEMP, 2)1 -9.497e+03 1.453e+02 -65.357 <2e-16 ***
## poly(TEMP, 2)2 -8.697e+01 6.166e+01 -1.410 0.158
## PRES -9.462e-01 5.881e-02 -16.088 <2e-16 ***
## DEWP 3.210e+00 5.442e-02 58.980 <2e-16 ***
## RAIN -3.453e+00 3.640e-01 -9.487 <2e-16 ***
## WSPM -3.525e+00 3.397e-01 -10.375 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.39 on 35057 degrees of freedom
## Multiple R-squared: 0.1978, Adjusted R-squared: 0.1976
## F-statistic: 1440 on 6 and 35057 DF, p-value: < 2.2e-16
6. Model Comparison
Compare the performance of different regression models using AIC,
BIC, and adjusted R-squared values.
# Create model comparison table
model_comparison <- data.frame(
model = c("Linear", "Stepwise", "Polynomial"),
AIC = c(AIC(full_model), AIC(stepwise_model), AIC(poly_model)),
BIC = c(BIC(full_model), BIC(stepwise_model), BIC(poly_model)),
adj_r_squared = c(full_summary$adj.r.squared,
summary(stepwise_model)$adj.r.squared,
poly_summary$adj.r.squared)
)
#cat("\nModel Comparison:\n")
print(model_comparison)
## model AIC BIC adj_r_squared
## 1 Linear 388255.2 388314.5 0.1976122
## 2 Stepwise 388255.2 388314.5 0.1976122
## 3 Polynomial 388255.2 388322.9 0.1976348
7. PM10 Regression Analysis
Extend the analysis to PM10 concentration, building and evaluating
regression models similar to those for PM2.5.
# Prepare data for PM10 regression
pm10_data <- regression_data[, c("PM10", weather_vars)]
# Build full multiple linear regression model for PM10
pm10_model <- lm(PM10 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = pm10_data)
pm10_summary <- summary(pm10_model)
# Build stepwise regression model for PM10
pm10_null <- lm(PM10 ~ 1, data = pm10_data)
pm10_stepwise <- step(pm10_null,
scope = list(lower = pm10_null, upper = pm10_model),
direction = "forward", trace = FALSE)
PM10 Regression Results
##
## Call:
## lm(formula = PM10 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = pm10_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -151.08 -56.44 -15.38 44.15 368.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2294.48253 73.38364 31.267 < 2e-16 ***
## TEMP -4.52266 0.08325 -54.324 < 2e-16 ***
## PRES -2.10425 0.07196 -29.244 < 2e-16 ***
## DEWP 2.34024 0.06647 35.206 < 2e-16 ***
## RAIN -4.79982 0.44515 -10.782 < 2e-16 ***
## WSPM -2.96692 0.41570 -7.137 9.72e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75.13 on 35058 degrees of freedom
## Multiple R-squared: 0.1209, Adjusted R-squared: 0.1208
## F-statistic: 964.2 on 5 and 35058 DF, p-value: < 2.2e-16
PM10 Stepwise Regression Results:
##
## Call:
## lm(formula = PM10 ~ WSPM + TEMP + DEWP + PRES + RAIN, data = pm10_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -151.08 -56.44 -15.38 44.15 368.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2294.48253 73.38364 31.267 < 2e-16 ***
## WSPM -2.96692 0.41570 -7.137 9.72e-13 ***
## TEMP -4.52266 0.08325 -54.324 < 2e-16 ***
## DEWP 2.34024 0.06647 35.206 < 2e-16 ***
## PRES -2.10425 0.07196 -29.244 < 2e-16 ***
## RAIN -4.79982 0.44515 -10.782 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75.13 on 35058 degrees of freedom
## Multiple R-squared: 0.1209, Adjusted R-squared: 0.1208
## F-statistic: 964.2 on 5 and 35058 DF, p-value: < 2.2e-16
9. Cross-Validation
Perform 10-fold cross-validation to assess the generalizability of
the regression models.
# Set up cross-validation control
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
# Perform cross-validation for PM2.5 model
pm25_cv <- train(PM25 ~ TEMP + PRES + DEWP + RAIN + WSPM,
data = pm25_data, method = "lm", trControl = train_control)
# Perform cross-validation for PM10 model
pm10_cv <- train(PM10 ~ TEMP + PRES + DEWP + RAIN + WSPM,
data = pm10_data, method = "lm", trControl = train_control)
PM2.5 Cross-Validation Results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 61.40598 0.1973379 48.5495 0.6801677 0.009047112 0.5120558
PM10 Cross-Validation Results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 75.15198 0.120542 60.70289 0.751539 0.01083148 0.5677504
10. Results Visualization
Create visualizations to help interpret the regression results and
model performance.
# Get predictions from best PM2.5 model (stepwise)
best_pm25_pred <- predict(stepwise_model, pm25_data)
# Create scatter plot of actual vs predicted PM2.5
p1 <- ggplot(data.frame(actual = pm25_data$PM25, predicted = best_pm25_pred),
aes(x = actual, y = predicted)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
geom_smooth(method = "lm", color = "orange", se = FALSE) +
labs(title = "PM2.5 Regression: Actual vs Predicted",
subtitle = paste("R² =", round(cor(pm25_data$PM25, best_pm25_pred)^2, 3)),
x = "Actual PM2.5 (μg/m³)", y = "Predicted PM2.5 (μg/m³)") +
theme_minimal()
#print(p1)
residuals_pm25 <- pm25_data$PM25 - best_pm25_pred
# Create residual plot
p2 <- ggplot(data.frame(predicted = best_pm25_pred, residuals = residuals_pm25),
aes(x = predicted, y = residuals)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(method = "loess", color = "orange", se = FALSE) +
labs(title = "PM2.5 Regression Residuals",
x = "Predicted Values (μg/m³)", y = "Residuals (μg/m³)") +
theme_minimal()
#print(p2)
# Calculate variable importance based on standardized coefficients
standardized_coef <- abs(coef(stepwise_model)[-1])
var_importance <- data.frame(
variable = names(standardized_coef),
importance = standardized_coef
) %>%
arrange(desc(importance))
# Create bar plot of variable importance
p3 <- ggplot(var_importance, aes(x = reorder(variable, importance), y = importance)) +
geom_col(fill = "steelblue", alpha = 0.8) +
coord_flip() +
labs(title = "PM2.5 Regression Variable Importance",
x = "Variable", y = "Standardized Coefficient (Absolute Value)") +
theme_minimal()
#print(p3)
# Predict PM2.5 across temperature range with other variables held at mean
temp_range <- seq(min(pm25_data$TEMP), max(pm25_data$TEMP), length.out = 100)
temp_pred_data <- data.frame(
TEMP = temp_range,
PRES = mean(pm25_data$PRES),
DEWP = mean(pm25_data$DEWP),
RAIN = mean(pm25_data$RAIN),
WSPM = mean(pm25_data$WSPM)
)
temp_predictions <- predict(poly_model, temp_pred_data, interval = "confidence")
# Create plot of temperature effect on PM2.5
p4 <- ggplot() +
geom_point(data = pm25_data, aes(x = TEMP, y = PM25), alpha = 0.3, color = "gray") +
geom_line(aes(x = temp_range, y = temp_predictions[,1]), color = "red", size = 1) +
geom_ribbon(aes(x = temp_range, ymin = temp_predictions[,2], ymax = temp_predictions[,3]),
alpha = 0.2, fill = "red") +
labs(title = "Temperature Effect on PM2.5 (Polynomial Fit)",
x = "Temperature (°C)", y = "PM2.5 Concentration (μg/m³)") +
theme_minimal()
#print(p4)
10.1 Actual vs Predicted Values

10.2 Residual Analysis

10.3 Variable Importance

10.4 Temperature Effect on PM2.5

11 Conclusion
Among the meteorological factors, wind speed (WSPM) and
temperature (TEMP) have the greatest impact on the concentrations of
PM2.5 and PM10, and they show a negative correlation (the greater the
wind speed and the higher the temperature, the lower the pollutant
concentration).
The multiple linear regression model has a higher explanatory
power for PM2.5 (R² ≈ 0.198) than for PM10 (R² ≈ 0.121), indicating that
the relationship between PM2.5 and meteorological variables is more
significant.
After introducing the quadratic term of temperature into the
polynomial model, it did not significantly improve the model
performance, suggesting that the linear model can already fit the
relationship between temperature and pollutants well.
The cross-validation results show that the model’s
generalization ability is stable and can be used for pollutant
concentration prediction based on meteorological data.
Classification Model
Classify the air quality grade (excellent, good, mild pollution,
etc.) of Beijing based on meteorological factors
1. Data Loading
# Clear workspace
rm(list = ls())
# Load required packages
required_packages <- c("tidyverse", "caret", "randomForest", "rpart", "e1071", "class", "MLmetrics")
for (package in required_packages) {
if (!require(package, character.only = TRUE, quietly = TRUE)) {
install.packages(package, quiet = TRUE)
library(package, character.only = TRUE)
}
}
# Load dataset
air_data <- read_csv("data/beijing_air_quality_cleaned.csv", show_col_types = FALSE)
air_data$datetime <- as.POSIXct(air_data$datetime)
2. Data Preparation
Clean the data and create AQI level classifications.
# Function to categorize AQI levels based on PM2.5 values
create_aqi_level <- function(pm25) {
case_when(
pm25 <= 35 ~ "Good",
pm25 <= 75 ~ "Moderate",
pm25 <= 115 ~ "Light_Pollution",
pm25 <= 150 ~ "Unhealthy",
pm25 <= 250 ~ "Very_Unhealthy",
TRUE ~ "Hazardous"
)
}
# Clean and prepare data for classification
air_data_clean <- air_data %>%
dplyr::filter(!is.na(`PM2.5`), !is.na(TEMP), !is.na(PRES), !is.na(DEWP), !is.na(RAIN), !is.na(WSPM)) %>%
dplyr::mutate(AQI_Level = create_aqi_level(`PM2.5`)) %>%
dplyr::mutate(AQI_Level = factor(AQI_Level,
levels = c("Good", "Moderate", "Light_Pollution",
"Unhealthy", "Very_Unhealthy", "Hazardous"),
ordered = TRUE))
# Check AQI level distribution
AQI Level distribution
##
## Good Moderate Light_Pollution Unhealthy Very_Unhealthy
## 12445 8439 5554 3013 5613
## Hazardous
## 0
3. Prepare Classification Dataset
Select relevant features and split data into training and test
sets.
# Prepare dataset for classification
classification_data <- air_data_clean %>%
dplyr::select(AQI_Level, TEMP, PRES, DEWP, RAIN, WSPM) %>%
na.omit() %>%
dplyr::filter(AQI_Level != "Hazardous") %>%
dplyr::mutate(AQI_Level = droplevels(AQI_Level))
# Inspect classification dataset
#cat("Classification dataset dimensions:", nrow(classification_data), "rows x", ncol(classification_data), "columns\n")
#cat("Final AQI Level distribution:\n")
#print(table(classification_data$AQI_Level))
# Split data into training and test sets
set.seed(123)
train_index <- createDataPartition(classification_data$AQI_Level, p = 0.8, list = FALSE)
train_data <- classification_data[train_index, ]
test_data <- classification_data[-train_index, ]
# Sample training data for performance
sample_size <- min(5000, nrow(train_data))
train_sample_index <- sample(nrow(train_data), sample_size)
train_data_sample <- train_data[train_sample_index, ]
#cat("Training set:", nrow(train_data_sample), "samples (sampled)\n")
#cat("Test set:", nrow(test_data), "samples\n")
4. Model Training
Train multiple classification models using cross-validation.
# Set up cross-validation
train_control <- trainControl(method = "cv", number = 3,
classProbs = TRUE)
# Train Random Forest model
#cat("Training Random Forest model...\n")
set.seed(123)
rf_model <- train(AQI_Level ~ ., data = train_data_sample, method = "rf",
trControl = train_control, ntree = 100)
# Train Decision Tree model
#cat("Training Decision Tree model...\n")
set.seed(123)
dt_model <- train(AQI_Level ~ ., data = train_data_sample, method = "rpart",
trControl = train_control)
# Train Logistic Regression model
#cat("Training Logistic Regression model...\n")
set.seed(123)
logistic_model <- train(AQI_Level ~ ., data = train_data_sample, method = "multinom",
trControl = train_control, trace = FALSE)
#cat("All models trained successfully!\n")
# Store models in a list
models <- list(
"Random Forest" = rf_model,
"Decision Tree" = dt_model,
"Logistic Regression" = logistic_model
)
5. Model Evaluation
Evaluate model performance on the test set.
# Function to evaluate model performance
evaluate_model <- function(model, test_data) {
predictions <- predict(model, test_data)
cm <- confusionMatrix(predictions, test_data$AQI_Level)
return(list(
accuracy = round(cm$overall['Accuracy'], 4),
kappa = round(cm$overall['Kappa'], 4),
sensitivity = round(mean(cm$byClass[,'Sensitivity'], na.rm = TRUE), 4),
specificity = round(mean(cm$byClass[,'Specificity'], na.rm = TRUE), 4)
))
}
# Evaluate all models
#cat("Evaluating models on test set...\n")
performance_results <- data.frame()
for(model_name in names(models)) {
#cat("Evaluating", model_name, "...\n")
model <- models[[model_name]]
metrics <- evaluate_model(model, test_data)
performance_results <- rbind(performance_results, data.frame(
Model = model_name,
Accuracy = metrics$accuracy,
Kappa = metrics$kappa,
Sensitivity = metrics$sensitivity,
Specificity = metrics$specificity
))
}
# Identify best model
best_model_name <- performance_results[which.max(performance_results$Accuracy), "Model"]
best_model <- models[[best_model_name]]
Classification Model Performance
## Model Accuracy Kappa Sensitivity Specificity
## Accuracy Random Forest 0.4643 0.2723 0.3743 0.8557
## Accuracy1 Decision Tree 0.4277 0.1800 0.3061 0.8351
## Accuracy2 Logistic Regression 0.4227 0.1984 0.3161 0.8403
Best model
## [1] "Random Forest"
6. Model Analysis
Analyze the performance of the best model.
# Generate confusion matrix for best model
final_predictions <- predict(best_model, test_data)
final_cm <- confusionMatrix(final_predictions, test_data$AQI_Level)
# Feature importance analysis for Random Forest model
if(best_model_name == "Random Forest") {
feature_importance <- varImp(best_model)$importance
feature_ranking <- data.frame(
Feature = rownames(feature_importance),
Importance = feature_importance$Overall
) %>%
arrange(desc(Importance))
#write_csv(feature_ranking, "data/feature_importance_ranking.csv")
}
Model Confusion Matrix
## Reference
## Prediction Good Moderate Light_Pollution Unhealthy Very_Unhealthy
## Good 1838 662 297 162 189
## Moderate 403 574 324 148 210
## Light_Pollution 123 230 246 110 118
## Unhealthy 28 46 59 53 58
## Very_Unhealthy 97 175 184 129 547
Feature Importance Ranking
## Feature Importance
## 1 DEWP 100.00000
## 2 PRES 88.70055
## 3 TEMP 87.38510
## 4 WSPM 65.95241
## 5 RAIN 0.00000
7. Conclusion
This project aims to classify the air quality levels in
Beijing based on meteorological factors. Multiple models (random forest,
decision tree, logistic regression) were trained and
evaluated.
The random forest model performed better than the other
models, with an accuracy rate of 0.4643.
The analysis of feature importance showed that dew point
(DEWP) was the most influential factor, followed by temperature (TEMP),
pressure (PRES), and wind speed (WSPM).
Although the accuracy rate indicates room for improvement,
this model provides insights into the key meteorological drivers of air
quality.
Visualization
1. Load Data and Preprocessing
air_data <- read_csv("data/beijing_air_quality_cleaned.csv", show_col_types = FALSE)
air_data$datetime <- as.POSIXct(air_data$datetime)
create_aqi_level <- function(pm25) {
case_when(
pm25 <= 35 ~ "Good",
pm25 <= 75 ~ "Moderate",
pm25 <= 115 ~ "Unhealthy for Sensitive Groups",
pm25 <= 150 ~ "Unhealthy",
pm25 <= 250 ~ "Very Unhealthy",
TRUE ~ "Hazardous"
)
}
air_data <- air_data %>%
mutate(
AQI_Level = create_aqi_level(PM2.5),
AQI_Level = factor(AQI_Level,
levels = c("Good", "Moderate", "Unhealthy for Sensitive Groups",
"Unhealthy", "Very Unhealthy", "Hazardous"),
ordered = TRUE),
Year = year(datetime),
Month = month(datetime)
)
2. Monthly Pollutant Trends
monthly_trends <- air_data %>%
mutate(YearMonth = floor_date(datetime, "month")) %>%
group_by(YearMonth) %>%
summarise(
PM2.5_avg = mean(PM2.5, na.rm = TRUE),
PM10_avg = mean(PM10, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_longer(cols = c(PM2.5_avg, PM10_avg), names_to = "Pollutant", values_to = "Concentration")
ggplot(monthly_trends, aes(x = YearMonth, y = Concentration, color = Pollutant)) +
geom_line(size = 1) +
scale_color_manual(values = c("PM2.5_avg" = "red", "PM10_avg" = "blue")) +
labs(title = "Monthly Average Pollutant Concentrations (2013-2017)",
x = "Time", y = "Concentration (μg/m³)") +
theme_minimal()

Insight: PM2.5 and PM10 both show recurring peaks
during the winter months and troughs in the summer. This consistent
seasonal pattern suggests that pollution is heavily influenced by
winter-specific human activities such as heating, and by meteorological
conditions that trap pollutants closer to the surface during cold
months.
3. AQI Level Distribution
aqi_distribution <- air_data %>%
filter(!is.na(AQI_Level)) %>%
count(AQI_Level) %>%
mutate(percentage = round(n / sum(n) * 100, 1))
ggplot(aqi_distribution, aes(x = AQI_Level, y = n, fill = AQI_Level)) +
geom_col() +
geom_text(aes(label = paste0(n, "\n(", percentage, "%)")), vjust = -0.5, size = 3) +
labs(title = "Air Quality Level Distribution",
x = "AQI Level", y = "Hours") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

Insight: Over half of the observations fall into the
“Moderate” to “Unhealthy” categories, indicating that such air quality
conditions are routine rather than exceptional. This highlights the
persistent nature of mild to moderate air pollution in Beijing and
emphasizes the need for long-term emission management.
4. Weather Factors vs PM2.5
air_data <- read_csv("data/beijing_air_quality_cleaned.csv", show_col_types = FALSE)
air_data$datetime <- as.POSIXct(air_data$datetime)
names(air_data)[names(air_data) == "PM2.5"] <- "PM25"
weather_vars <- c("TEMP", "PRES", "DEWP", "RAIN", "WSPM")
analysis_cols <- c("PM25", weather_vars)
temp_data <- air_data[, analysis_cols]
clean_data <- temp_data[complete.cases(temp_data), ]
set.seed(123)
n_sample <- min(5000, nrow(clean_data))
sample_indices <- sample(nrow(clean_data), n_sample)
sample_data <- clean_data[sample_indices, ]
cor_results <- data.frame(
Variable = weather_vars,
Correlation = sapply(weather_vars, function(var) {
cor(sample_data$PM25, sample_data[[var]], use = "complete.obs")
})
)
plot_data <- data.frame()
for(var in weather_vars) {
temp_df <- data.frame(
PM25 = sample_data$PM25,
Variable = var,
Value = sample_data[[var]],
Correlation = cor_results$Correlation[cor_results$Variable == var]
)
plot_data <- rbind(plot_data, temp_df)
}
ggplot(plot_data, aes(x = Value, y = PM25)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
geom_text(aes(x = Inf, y = Inf, label = paste("r =", round(Correlation, 3))),
hjust = 1.1, vjust = 1.5, size = 3, color = "red") +
labs(title = "Weather Factors vs PM2.5 Concentration",
x = "Weather Variable Value", y = "PM2.5 Concentration (μg/m³)") +
facet_wrap(~Variable, scales = "free_x") +
theme_minimal()

Insight: Temperature and dew point show a moderate
negative correlation with PM2.5, reinforcing the idea that cold weather
contributes to pollution buildup due to limited atmospheric mixing. Wind
speed has a slight mitigating effect, but not strong enough to disperse
pollution effectively in most cases.
5. Seasonal AQI Distribution
if(!"PM25" %in% names(air_data)) {
names(air_data)[names(air_data) == "PM2.5"] <- "PM25"
}
create_aqi_level <- function(pm25) {
case_when(
pm25 <= 35 ~ "Good",
pm25 <= 75 ~ "Moderate",
pm25 <= 115 ~ "Unhealthy for Sensitive Groups",
pm25 <= 150 ~ "Unhealthy",
pm25 <= 250 ~ "Very Unhealthy",
TRUE ~ "Hazardous"
)
}
seasonal_data <- air_data %>%
filter(!is.na(PM25)) %>%
mutate(
AQI_Level = create_aqi_level(PM25),
AQI_Level = factor(AQI_Level,
levels = c("Good", "Moderate", "Unhealthy for Sensitive Groups",
"Unhealthy", "Very Unhealthy", "Hazardous"),
ordered = TRUE),
Month = month(datetime),
Season = case_when(
Month %in% c(12, 1, 2) ~ "Winter",
Month %in% c(3, 4, 5) ~ "Spring",
Month %in% c(6, 7, 8) ~ "Summer",
Month %in% c(9, 10, 11) ~ "Fall"
)
) %>%
group_by(Season, AQI_Level) %>%
summarise(Hours = n(), .groups = "drop") %>%
group_by(Season) %>%
mutate(Percentage = Hours / sum(Hours) * 100)
ggplot(seasonal_data, aes(x = Season, y = Percentage, fill = AQI_Level)) +
geom_col(position = "stack") +
labs(title = "Seasonal Air Quality Distribution",
x = "Season", y = "Percentage (%)", fill = "AQI Level") +
theme_minimal()

Insight: Winter accounts for the highest share of
unhealthy air quality, confirming that this season is the most
vulnerable. In contrast, summer has the cleanest air. This suggests that
pollution control policies should be adaptive and season-specific,
targeting wintertime emissions most aggressively.
6. Conclusion
The analysis of Beijing air quality data from 2013 to 2017 reveals
several important insights:
- Seasonal Patterns: Pollution spikes in winter due
to heating and stagnant air.
- Air Quality Status: Moderate to unhealthy levels
are common and persistent.
- Weather Impact: Cold, dry, and calm conditions
favor pollutant accumulation.
- Policy Implication: Emission controls should
intensify in winter, and meteorology should be integrated into real-time
pollution warnings.
These data-driven patterns provide actionable knowledge for
environmental planning and reinforce the need for dynamic, targeted air
quality management.
Conclusion
This report analyzes the air quality data of Beijing from 2013 to
2017, based on the PRSA dataset and supplementary sources such as the
China Meteorological Data Network. It systematically examines the
influence of meteorological factors on PM2.5/PM10 and builds a
prediction model. After data cleaning (filling of missing values,
handling of outliers) and feature engineering (time features, lag
variables, rolling window statistics), the data is divided into
training, validation and test sets at a ratio of 70%, 15%, and 15%.
Regression analysis shows that wind speed (WSPM) and temperature (TEMP)
have the strongest negative correlation with pollutant concentrations.
The multiple linear model explains 19.77% and 12.09% of PM2.5 and PM10,
respectively, and wind speed and dew point (DEWP) are key influencing
variables. In the classification model, random forest has a
classification accuracy of 46.43% for air quality grades (good / mild
pollution, etc.), and dew point is the most important feature. The
visualization results reveal that pollutant concentrations have
significant seasonality. They peak in winter due to heating and stable
atmospheric conditions, and decrease in summer due to favorable
meteorological diffusion conditions. The study provides data support and
model basis for air quality prediction and seasonal pollution control in
Beijing.