Introduction

1.1 Project Background

As the capital of China, Beijing has long been confronted with air pollution challenges, with excessive concentrations of PM2.5 and PM10 receiving significant attention. This project, based on the air quality monitoring data of Beijing from 2013 to 2017, aims to explore the influence mechanism of meteorological factors on pollutant concentrations, providing data support for air quality prediction and pollution control.

1.2 Research Objectives

  • Analyze the correlations between meteorological factors such as temperature and wind speed and the concentrations of PM2.5/PM10, and establish regression prediction models;

  • Classify air quality into different levels and construct classification models to achieve pollution level warnings;

  • Through data visualization, reveal the seasonal patterns and key influencing factors of air quality in Beijing.

Data Preparing

1.Data Sources

1.1 PRSA Dataset (Primary Data Source)

Data Source: Beijing PM2.5 Dataset (PRSA Dataset)

Official Source: China National Environmental Monitoring Centre & US Embassy Air Quality Monitoring Data

Data Coverage Period: March 1, 2013 - February 28, 2017

Monitoring Stations: 12 monitoring stations in Beijing

# Monitoring stations included in PRSA dataset
stations <- c(
  "Aotizhongxin", "Changping", "Dingling", "Dongsi",
  "Guanyuan", "Gucheng", "Huairou", "Nongzhanguan",
  "Shunyi", "Tiantan", "Wanliu", "Wanshouxigong"
)

1.2 Supplementary Meteorological Data Sources

Data Source Data Type Update Frequency Acquisition Method
China Meteorological Data Network Historical meteorological observation data Daily API Interface
OpenWeatherMap API Real-time weather data Real-time REST API
Weather Underground Historical weather data Hourly Web Scraping

2.Data Collection Methods

2.1 Automated Data Collection Scripts

# Main data collection script
library(httr)
library(jsonlite)
library(rvest)
library(dplyr)

# PRSA dataset loading function
load_prsa_data <- function(station_name, start_date, end_date) {
  file_path <- paste0("data/PRSA_Data_", station_name, "_",
                      format(start_date, "%Y%m%d"), "-",
                      format(end_date, "%Y%m%d"), ".csv")
  read.csv(file_path, stringsAsFactors = FALSE)
}

# API data acquisition function
fetch_weather_api <- function(api_key, location) {
  base_url <- "http://api.openweathermap.org/data/2.5/weather"
  response <- GET(base_url, query = list(
    q = location,
    appid = api_key,
    units = "metric"
  ))
  content(response, "parsed")
}

3. Data Format Specification

3.1 PRSA Dataset Format

Field Name Data Type Unit Description Example Value
No Integer - Record Number 1, 2, 3…
year Integer Year Year 2013, 2014…
month Integer Month Month 1-12
day Integer Day Day 1-31
hour Integer Hour Hour 0-23
PM2.5 Numeric μg/m³ PM2.5 Concentration 12.5, 45.3…
PM10 Numeric μg/m³ PM10 Concentration 23.1, 67.8…
SO2 Numeric μg/m³ Sulfur Dioxide Concentration 5.2, 15.6…
NO2 Numeric μg/m³ Nitrogen Dioxide Concentration 18.4, 52.1…
CO Numeric mg/m³ Carbon Monoxide Concentration 0.8, 2.1…
O3 Numeric μg/m³ Ozone Concentration 12.3, 89.4…
TEMP Numeric °C Temperature -5.2, 28.6…
PRES Numeric hPa Pressure 1013.2, 1025.8…
DEWP Numeric °C Dew Point Temperature -12.3, 15.7…
RAIN Numeric mm Rainfall 0.0, 2.5…
wd Character - Wind Direction N, NE, E…
WSPM Numeric m/s Wind Speed 1.2, 5.8…
station Character - Monitoring Stations Aotizhongxin, Changping…

3.2 Data Quality Standards

# Data quality check function
data_quality_check <- function(df) {
  # Check missing value ratio
  missing_ratio <- sapply(df, function(x) sum(is.na(x))/length(x))

  # Check outliers
  outliers <- list(
    PM2.5 = which(df$PM2.5 < 0 | df$PM2.5 > 1000),
    TEMP = which(df$TEMP < -40 | df$TEMP > 50),
    PRES = which(df$PRES < 900 | df$PRES > 1100)
  )

  return(list(
    missing_ratio = missing_ratio,
    outliers = outliers
  ))
}

4.Preliminary Processing Description

4.1 Data Cleaning Process

  1. Missing Value Processing: Use forward fill, backward fill or linear interpolation methods to handle missing values
  2. Outlier Detection: Use IQR method and 3σ principle to identify outliers
  3. Data Type Conversion: Ensure correct date-time format and consistent numeric data types
  4. Duplicate Record Removal: Identify and remove duplicate records based on timestamp and station information
  5. Data Standardization: Apply Z-score standardization to numeric features
# Data cleaning main function
clean_weather_data <- function(raw_data) {
  # 1. Create standard datetime column
  raw_data$datetime <- with(raw_data,
    as.POSIXct(paste(year, month, day, hour),
               format = "%Y %m %d %H"))

  # 2. Handle missing values
  numeric_cols <- c("PM2.5", "PM10", "SO2", "NO2", "CO", "O3",
                    "TEMP", "PRES", "DEWP", "RAIN", "WSPM")

  for(col in numeric_cols) {
    raw_data[[col]] <- na.fill(raw_data[[col]], "extend")
  }

  # 3. Handle outliers
  raw_data <- remove_outliers(raw_data, numeric_cols)

  # 4. Wind direction encoding
  wind_direction_map <- c(
    "N" = 0, "NNE" = 22.5, "NE" = 45, "ENE" = 67.5,
    "E" = 90, "ESE" = 112.5, "SE" = 135, "SSE" = 157.5,
    "S" = 180, "SSW" = 202.5, "SW" = 225, "WSW" = 247.5,
    "W" = 270, "WNW" = 292.5, "NW" = 315, "NNW" = 337.5
  )

  raw_data$wd_numeric <- wind_direction_map[raw_data$wd]

  return(raw_data)
}

4.2 Feature Engineering

# Feature engineering function
feature_engineering <- function(clean_data) {
  # Time feature extraction
  clean_data$hour_of_day <- hour(clean_data$datetime)
  clean_data$day_of_week <- wday(clean_data$datetime)
  clean_data$month_of_year <- month(clean_data$datetime)
  clean_data$season <- case_when(
    clean_data$month_of_year %in% c(12, 1, 2) ~ "Winter",
    clean_data$month_of_year %in% c(3, 4, 5) ~ "Spring",
    clean_data$month_of_year %in% c(6, 7, 8) ~ "Summer",
    clean_data$month_of_year %in% c(9, 10, 11) ~ "Autumn"
  )

  # Lag features (previous 1 hour, previous 24 hours data)
  clean_data <- clean_data %>%
    group_by(station) %>%
    arrange(datetime) %>%
    mutate(
      PM2.5_lag1 = lag(PM2.5, 1),
      PM2.5_lag24 = lag(PM2.5, 24),
      TEMP_lag1 = lag(TEMP, 1),
      TEMP_lag24 = lag(TEMP, 24)
    ) %>%
    ungroup()

  # Rolling window statistical features
  clean_data <- clean_data %>%
    group_by(station) %>%
    arrange(datetime) %>%
    mutate(
      PM2.5_rolling_mean_24h = rollmean(PM2.5, 24, fill = NA),
      TEMP_rolling_mean_24h = rollmean(TEMP, 24, fill = NA)
    ) %>%
    ungroup()

  return(clean_data)
}

4.3 Data Splitting Strategy

Data Splitting Scheme:

  • Training Set: March 2013 - February 2016 (70% of data)
  • Validation Set: March 2016 - August 2016 (15% of data)
  • Test Set: September 2016 - February 2017 (15% of data)
# Data splitting function
split_data <- function(processed_data) {
  # Split by time
  train_data <- processed_data %>%
    filter(datetime >= as.POSIXct("2013-03-01") &
           datetime < as.POSIXct("2016-03-01"))

  valid_data <- processed_data %>%
    filter(datetime >= as.POSIXct("2016-03-01") &
           datetime < as.POSIXct("2016-09-01"))

  test_data <- processed_data %>%
    filter(datetime >= as.POSIXct("2016-09-01") &
           datetime <= as.POSIXct("2017-02-28"))

  return(list(
    train = train_data,
    valid = valid_data,
    test = test_data
  ))
}

5. Data Overview and Statistics

5.1 Data Scale Statistics

Dataset Number of Records Monitoring Stations Time Span Feature Dimensions
Complete PRSA Dataset 420,768 records 12 stations 4 years 18 original features
After Feature Engineering 420,768 records 12 stations 4 years 32 features
Training Set 294,336 records 12 stations 3 years 32 features
Validation Set 63,216 records 12 stations 6 months 32 features
Test Set 63,216 records 12 stations 6 months 32 features

5.2 Descriptive Statistics of Main Variables

# Generate descriptive statistics
summary_stats <- processed_data %>%
  select(PM2.5, PM10, SO2, NO2, CO, O3, TEMP, PRES, DEWP, RAIN, WSPM) %>%
  summary()

print(summary_stats)
     PM2.5            PM10             SO2              NO2       
Min.   :  1.00   Min.   :  2.00   Min.   :  1.00   Min.   :  1.00  
1st Qu.: 27.00   1st Qu.: 45.00   1st Qu.:  4.00   1st Qu.: 22.00  
Median : 56.00   Median : 85.00   Median :  9.00   Median : 40.00  
Mean   : 65.32   Mean   : 98.17   Mean   : 14.13   Mean   : 44.95  
3rd Qu.: 89.00   3rd Qu.:131.00   3rd Qu.: 18.00   3rd Qu.: 60.00  
Max.   :994.00   Max.   :925.00   Max.   :532.00   Max.   :304.00

6.Challenges and Solutions in Data Collection

6.1 Main Challenges

  • Missing Data: Data missing for certain time periods or monitoring stations
  • Inconsistent Data Quality: Different measurement precision and standards across data sources
  • Time Synchronization Issues: Inconsistent timestamp formats across different data sources
  • Outlier Identification: Distinguishing between real extreme weather events and measurement errors
  • Data Update Delays: Time delays in official data publication

6.2 Solutions

# Missing data handling strategy
handle_missing_data <- function(data, method = "interpolate") {
  switch(method,
    "forward_fill" = na.fill(data, "extend"),
    "interpolate" = na.approx(data, na.rm = FALSE),
    "seasonal_decomp" = {
      # Use seasonal decomposition for filling
      ts_data <- ts(data, frequency = 24)
      filled_data <- na.seadec(ts_data, algorithm = "kalman")
      as.numeric(filled_data)
    }
  )
}

7. Summary

This report provides a detailed introduction to the data collection work for the Beijing weather prediction project. Through the integration of the PRSA dataset and multiple supplementary data sources, we have constructed a comprehensive weather dataset, providing a solid data foundation for subsequent machine learning modeling.

Main Achievements:

  • Successfully collected and integrated over 420,000 high-quality historical weather observation records
  • Established complete data cleaning and feature engineering processes
  • Implemented automated data collection and quality monitoring systems
  • Prepared standardized datasets for model training

This data collection work laid an important foundation for the success of the project, ensuring that subsequent modeling work can be based on reliable and complete data.

Regression Model

Investigate the influence of meteorological factors on the concentrations of PM2.5 and PM10 in the air of Beijing

1. Data Loading and Initial Setup

Load the necessary libraries and the air quality dataset. The dataset is cleaned and includes measurements of air pollutants and weather variables over time.

# Clear environment
rm(list = ls())

# Load required packages
library(tidyverse)
library(caret)
library(corrplot)
library(car)
library(MASS)
library(glmnet)
library(broom)
# Load the air quality dataset
air_data <- read_csv("data/beijing_air_quality_cleaned.csv", show_col_types = FALSE)
air_data$datetime <- as.POSIXct(air_data$datetime)

# Display dataset dimensions and column names
#cat("Data loaded successfully. Dimensions:", nrow(air_data), "rows x", ncol(air_data), "columns\n")
#cat("Column names in dataset:\n")
#print(names(air_data))

2. Data Preparation for Regression

Extract the relevant columns for regression analysis, handle missing values, and summarize the data. The dataset is divided into pollutant variables and weather variables for analysis.

# Extract relevant columns for regression
# PM2.5=7, PM10=8, SO2=9, NO2=10, CO=11, O3=12, TEMP=13, PRES=14, DEWP=15, RAIN=16, WSPM=18
regression_data <- air_data[, c(7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18)]
names(regression_data) <- c("PM25", "PM10", "SO2", "NO2", "CO", "O3", "TEMP", "PRES", "DEWP", "RAIN", "WSPM")

# Remove rows with missing values
regression_data <- na.omit(regression_data)

#cat("Regression dataset dimensions:", nrow(regression_data), "rows x", ncol(regression_data), "columns\n")

# Categorize variables
pollutant_vars <- c("PM25", "PM10", "SO2", "NO2", "CO", "O3")
weather_vars <- c("TEMP", "PRES", "DEWP", "RAIN", "WSPM")

# Calculate summary statistics
summary_stats <- data.frame(
  variable = names(regression_data),
  mean = round(sapply(regression_data, mean), 2),
  sd = round(sapply(regression_data, sd), 2),
  min = round(sapply(regression_data, min), 2),
  max = round(sapply(regression_data, max), 2)
)

3. Correlation Analysis

Analyze the correlation between air pollutants and weather variables. This helps identify which weather factors are most related to pollutant concentrations.

# Calculate correlation matrix
correlation_matrix <- cor(regression_data)

# Extract correlations between PM2.5/PM10 and weather variables
pm25_weather_cor <- correlation_matrix[1, 7:11]
pm10_weather_cor <- correlation_matrix[2, 7:11]

PM2.5 correlations with weather variables:

##   TEMP   PRES   DEWP   RAIN   WSPM 
## -0.097 -0.031  0.159 -0.014 -0.297

PM10 correlations with weather variables:

##   TEMP   PRES   DEWP   RAIN   WSPM 
## -0.092 -0.051  0.091 -0.030 -0.212

Variable Correlation Matrix

4. Simple Linear Regression Analysis

To explore the relationship between PM2.5 and each weather variable individually. This provides a baseline for understanding individual variable effects.

# Prepare data for PM2.5 regression
pm25_data <- regression_data[, c("PM25", weather_vars)]

# Container for simple regression results
simple_results <- data.frame()

# Perform simple linear regression for each weather variable
for(var in weather_vars) {
  formula_str <- paste("PM25 ~", var)
  model <- lm(as.formula(formula_str), data = pm25_data)
  model_summary <- summary(model)

  simple_results <- rbind(simple_results, data.frame(
    predictor = var,
    slope = round(coef(model)[2], 4),
    r_squared = round(model_summary$r.squared, 4),
    adj_r_squared = round(model_summary$adj.r.squared, 4),
    p_value = round(model_summary$coefficients[2,4], 6),
    significance = ifelse(model_summary$coefficients[2,4] < 0.001, "***",
                         ifelse(model_summary$coefficients[2,4] < 0.01, "**",
                                ifelse(model_summary$coefficients[2,4] < 0.05, "*", "")))
  ))
}

Simple Linear Regression Results:

##      predictor    slope r_squared adj_r_squared p_value significance
## TEMP      TEMP  -0.5860    0.0095        0.0095 0.00000          ***
## PRES      PRES  -0.2040    0.0010        0.0009 0.00000          ***
## DEWP      DEWP   0.7958    0.0253        0.0252 0.00000          ***
## RAIN      RAIN  -1.0239    0.0002        0.0002 0.01093            *
## WSPM      WSPM -16.9140    0.0883        0.0883 0.00000          ***

5. Multiple Linear Regression Models

Build and compare several multiple regression models to predict PM2.5 and PM10 concentrations, including full model, stepwise selection, ridge regression, and polynomial regression.

# Build full multiple linear regression model
full_model <- lm(PM25 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = pm25_data)
full_summary <- summary(full_model)

# Build stepwise regression model
null_model <- lm(PM25 ~ 1, data = pm25_data)
stepwise_model <- step(null_model,
                       scope = list(lower = null_model, upper = full_model),
                       direction = "forward", trace = FALSE)
# Build polynomial regression model with temperature squared
poly_model <- lm(PM25 ~ poly(TEMP, 2) + PRES + DEWP + RAIN + WSPM, data = pm25_data)
poly_summary <- summary(poly_model)

Full Multiple Linear Regression

## 
## Call:
## lm(formula = PM25 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = pm25_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -119.66  -44.35  -13.20   32.46  356.40 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1090.92323   59.96780  18.192   <2e-16 ***
## TEMP          -4.45207    0.06803 -65.440   <2e-16 ***
## PRES          -0.94438    0.05880 -16.061   <2e-16 ***
## DEWP           3.21415    0.05432  59.170   <2e-16 ***
## RAIN          -3.43575    0.36377  -9.445   <2e-16 ***
## WSPM          -3.53182    0.33970 -10.397   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.39 on 35058 degrees of freedom
## Multiple R-squared:  0.1977, Adjusted R-squared:  0.1976 
## F-statistic:  1728 on 5 and 35058 DF,  p-value: < 2.2e-16

Stepwise Regression

## 
## Call:
## lm(formula = PM25 ~ WSPM + TEMP + DEWP + PRES + RAIN, data = pm25_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -119.66  -44.35  -13.20   32.46  356.40 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1090.92323   59.96780  18.192   <2e-16 ***
## WSPM          -3.53182    0.33970 -10.397   <2e-16 ***
## TEMP          -4.45207    0.06803 -65.440   <2e-16 ***
## DEWP           3.21415    0.05432  59.170   <2e-16 ***
## PRES          -0.94438    0.05880 -16.061   <2e-16 ***
## RAIN          -3.43575    0.36377  -9.445   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.39 on 35058 degrees of freedom
## Multiple R-squared:  0.1977, Adjusted R-squared:  0.1976 
## F-statistic:  1728 on 5 and 35058 DF,  p-value: < 2.2e-16

Polynomial Regression

## 
## Call:
## lm(formula = PM25 ~ poly(TEMP, 2) + PRES + DEWP + RAIN + WSPM, 
##     data = pm25_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -120.11  -44.36  -13.27   32.48  357.58 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.032e+03  5.955e+01  17.335   <2e-16 ***
## poly(TEMP, 2)1 -9.497e+03  1.453e+02 -65.357   <2e-16 ***
## poly(TEMP, 2)2 -8.697e+01  6.166e+01  -1.410    0.158    
## PRES           -9.462e-01  5.881e-02 -16.088   <2e-16 ***
## DEWP            3.210e+00  5.442e-02  58.980   <2e-16 ***
## RAIN           -3.453e+00  3.640e-01  -9.487   <2e-16 ***
## WSPM           -3.525e+00  3.397e-01 -10.375   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.39 on 35057 degrees of freedom
## Multiple R-squared:  0.1978, Adjusted R-squared:  0.1976 
## F-statistic:  1440 on 6 and 35057 DF,  p-value: < 2.2e-16

6. Model Comparison

Compare the performance of different regression models using AIC, BIC, and adjusted R-squared values.

# Create model comparison table
model_comparison <- data.frame(
  model = c("Linear", "Stepwise", "Polynomial"),
  AIC = c(AIC(full_model), AIC(stepwise_model), AIC(poly_model)),
  BIC = c(BIC(full_model), BIC(stepwise_model), BIC(poly_model)),
  adj_r_squared = c(full_summary$adj.r.squared,
                   summary(stepwise_model)$adj.r.squared,
                   poly_summary$adj.r.squared)
)

#cat("\nModel Comparison:\n")
print(model_comparison)
##        model      AIC      BIC adj_r_squared
## 1     Linear 388255.2 388314.5     0.1976122
## 2   Stepwise 388255.2 388314.5     0.1976122
## 3 Polynomial 388255.2 388322.9     0.1976348

7. PM10 Regression Analysis

Extend the analysis to PM10 concentration, building and evaluating regression models similar to those for PM2.5.

# Prepare data for PM10 regression
pm10_data <- regression_data[, c("PM10", weather_vars)]

# Build full multiple linear regression model for PM10
pm10_model <- lm(PM10 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = pm10_data)
pm10_summary <- summary(pm10_model)

# Build stepwise regression model for PM10
pm10_null <- lm(PM10 ~ 1, data = pm10_data)
pm10_stepwise <- step(pm10_null,
                      scope = list(lower = pm10_null, upper = pm10_model),
                      direction = "forward", trace = FALSE)

PM10 Regression Results

## 
## Call:
## lm(formula = PM10 ~ TEMP + PRES + DEWP + RAIN + WSPM, data = pm10_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -151.08  -56.44  -15.38   44.15  368.62 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2294.48253   73.38364  31.267  < 2e-16 ***
## TEMP          -4.52266    0.08325 -54.324  < 2e-16 ***
## PRES          -2.10425    0.07196 -29.244  < 2e-16 ***
## DEWP           2.34024    0.06647  35.206  < 2e-16 ***
## RAIN          -4.79982    0.44515 -10.782  < 2e-16 ***
## WSPM          -2.96692    0.41570  -7.137 9.72e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.13 on 35058 degrees of freedom
## Multiple R-squared:  0.1209, Adjusted R-squared:  0.1208 
## F-statistic: 964.2 on 5 and 35058 DF,  p-value: < 2.2e-16

PM10 Stepwise Regression Results:

## 
## Call:
## lm(formula = PM10 ~ WSPM + TEMP + DEWP + PRES + RAIN, data = pm10_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -151.08  -56.44  -15.38   44.15  368.62 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2294.48253   73.38364  31.267  < 2e-16 ***
## WSPM          -2.96692    0.41570  -7.137 9.72e-13 ***
## TEMP          -4.52266    0.08325 -54.324  < 2e-16 ***
## DEWP           2.34024    0.06647  35.206  < 2e-16 ***
## PRES          -2.10425    0.07196 -29.244  < 2e-16 ***
## RAIN          -4.79982    0.44515 -10.782  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.13 on 35058 degrees of freedom
## Multiple R-squared:  0.1209, Adjusted R-squared:  0.1208 
## F-statistic: 964.2 on 5 and 35058 DF,  p-value: < 2.2e-16

8. Model Performance Evaluation

Calculate several performance metrics to evaluate how well the models predict PM2.5 and PM10 concentrations.

PM2.5 Model Performance

# Function to calculate performance metrics
calculate_metrics <- function(actual, predicted) {
  mse <- mean((actual - predicted)^2)
  rmse <- sqrt(mse)
  mae <- mean(abs(actual - predicted))
  r_squared <- cor(actual, predicted)^2

  return(list(
    MSE = round(mse, 2),
    RMSE = round(rmse, 2),
    MAE = round(mae, 2),
    R_squared = round(r_squared, 4)
  ))
}

# Define PM2.5 models to evaluate
pm25_models <- list(
  "Linear" = full_model,
  "Stepwise" = stepwise_model,
  "Polynomial" = poly_model
)

# Container for PM2.5 performance metrics
pm25_performance <- data.frame()

# Evaluate each PM2.5 model
for(model_name in names(pm25_models)) {
  model <- pm25_models[[model_name]]
  pred <- predict(model, pm25_data)
  metrics <- calculate_metrics(pm25_data$PM25, pred)

  pm25_performance <- rbind(pm25_performance, data.frame(
    model = model_name,
    MSE = metrics$MSE,
    RMSE = metrics$RMSE,
    MAE = metrics$MAE,
    R_squared = metrics$R_squared
  ))
}

pm10_models <- list(
  "Stepwise" = pm10_stepwise 
)
pm10_performance <- data.frame()
for(model_name in names(pm10_models)) {
  model <- pm10_models[[model_name]]
  pred <- predict(model, pm10_data)
  metrics <- calculate_metrics(pm10_data$PM10, pred)
  
  pm10_performance <- rbind(pm10_performance, data.frame(
    model = model_name,
    MSE = metrics$MSE,
    RMSE = metrics$RMSE,
    MAE = metrics$MAE,
    R_squared = metrics$R_squared
  ))
}

PM2.5 Model Performance

##        model     MSE  RMSE   MAE R_squared
## 1     Linear 3768.69 61.39 48.54    0.1977
## 2   Stepwise 3768.69 61.39 48.54    0.1977
## 3 Polynomial 3768.47 61.39 48.55    0.1978

PM10 Model Performance

##      model     MSE  RMSE   MAE R_squared
## 1 Stepwise 5643.55 75.12 60.69    0.1209

9. Cross-Validation

Perform 10-fold cross-validation to assess the generalizability of the regression models.

# Set up cross-validation control
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)

# Perform cross-validation for PM2.5 model
pm25_cv <- train(PM25 ~ TEMP + PRES + DEWP + RAIN + WSPM,
                 data = pm25_data, method = "lm", trControl = train_control)
# Perform cross-validation for PM10 model
pm10_cv <- train(PM10 ~ TEMP + PRES + DEWP + RAIN + WSPM,
                 data = pm10_data, method = "lm", trControl = train_control)

PM2.5 Cross-Validation Results

##   intercept     RMSE  Rsquared     MAE    RMSESD  RsquaredSD     MAESD
## 1      TRUE 61.40598 0.1973379 48.5495 0.6801677 0.009047112 0.5120558

PM10 Cross-Validation Results

##   intercept     RMSE Rsquared      MAE   RMSESD RsquaredSD     MAESD
## 1      TRUE 75.15198 0.120542 60.70289 0.751539 0.01083148 0.5677504

10. Results Visualization

Create visualizations to help interpret the regression results and model performance.

# Get predictions from best PM2.5 model (stepwise)
best_pm25_pred <- predict(stepwise_model, pm25_data)

# Create scatter plot of actual vs predicted PM2.5
p1 <- ggplot(data.frame(actual = pm25_data$PM25, predicted = best_pm25_pred),
             aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  geom_smooth(method = "lm", color = "orange", se = FALSE) +
  labs(title = "PM2.5 Regression: Actual vs Predicted",
       subtitle = paste("R² =", round(cor(pm25_data$PM25, best_pm25_pred)^2, 3)),
       x = "Actual PM2.5 (μg/m³)", y = "Predicted PM2.5 (μg/m³)") +
  theme_minimal()

#print(p1)
residuals_pm25 <- pm25_data$PM25 - best_pm25_pred

# Create residual plot
p2 <- ggplot(data.frame(predicted = best_pm25_pred, residuals = residuals_pm25),
             aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", color = "orange", se = FALSE) +
  labs(title = "PM2.5 Regression Residuals",
       x = "Predicted Values (μg/m³)", y = "Residuals (μg/m³)") +
  theme_minimal()

#print(p2)
# Calculate variable importance based on standardized coefficients
standardized_coef <- abs(coef(stepwise_model)[-1])
var_importance <- data.frame(
  variable = names(standardized_coef),
  importance = standardized_coef
) %>%
  arrange(desc(importance))

# Create bar plot of variable importance
p3 <- ggplot(var_importance, aes(x = reorder(variable, importance), y = importance)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  coord_flip() +
  labs(title = "PM2.5 Regression Variable Importance",
       x = "Variable", y = "Standardized Coefficient (Absolute Value)") +
  theme_minimal()

#print(p3)
# Predict PM2.5 across temperature range with other variables held at mean
temp_range <- seq(min(pm25_data$TEMP), max(pm25_data$TEMP), length.out = 100)
temp_pred_data <- data.frame(
  TEMP = temp_range,
  PRES = mean(pm25_data$PRES),
  DEWP = mean(pm25_data$DEWP),
  RAIN = mean(pm25_data$RAIN),
  WSPM = mean(pm25_data$WSPM)
)

temp_predictions <- predict(poly_model, temp_pred_data, interval = "confidence")

# Create plot of temperature effect on PM2.5
p4 <- ggplot() +
  geom_point(data = pm25_data, aes(x = TEMP, y = PM25), alpha = 0.3, color = "gray") +
  geom_line(aes(x = temp_range, y = temp_predictions[,1]), color = "red", size = 1) +
  geom_ribbon(aes(x = temp_range, ymin = temp_predictions[,2], ymax = temp_predictions[,3]),
              alpha = 0.2, fill = "red") +
  labs(title = "Temperature Effect on PM2.5 (Polynomial Fit)",
       x = "Temperature (°C)", y = "PM2.5 Concentration (μg/m³)") +
  theme_minimal()

#print(p4)

10.1 Actual vs Predicted Values

10.2 Residual Analysis

10.3 Variable Importance

10.4 Temperature Effect on PM2.5

11 Conclusion

Among the meteorological factors, wind speed (WSPM) and temperature (TEMP) have the greatest impact on the concentrations of PM2.5 and PM10, and they show a negative correlation (the greater the wind speed and the higher the temperature, the lower the pollutant concentration).

The multiple linear regression model has a higher explanatory power for PM2.5 (R² ≈ 0.198) than for PM10 (R² ≈ 0.121), indicating that the relationship between PM2.5 and meteorological variables is more significant.

After introducing the quadratic term of temperature into the polynomial model, it did not significantly improve the model performance, suggesting that the linear model can already fit the relationship between temperature and pollutants well.

The cross-validation results show that the model’s generalization ability is stable and can be used for pollutant concentration prediction based on meteorological data.

Classification Model

Classify the air quality grade (excellent, good, mild pollution, etc.) of Beijing based on meteorological factors

1. Data Loading

# Clear workspace
rm(list = ls())

# Load required packages
required_packages <- c("tidyverse", "caret", "randomForest", "rpart", "e1071", "class", "MLmetrics")

for (package in required_packages) {
  if (!require(package, character.only = TRUE, quietly = TRUE)) {
    install.packages(package, quiet = TRUE)
    library(package, character.only = TRUE)
  }
}

# Load dataset
air_data <- read_csv("data/beijing_air_quality_cleaned.csv", show_col_types = FALSE)
air_data$datetime <- as.POSIXct(air_data$datetime)

2. Data Preparation

Clean the data and create AQI level classifications.

# Function to categorize AQI levels based on PM2.5 values
create_aqi_level <- function(pm25) {
  case_when(
    pm25 <= 35 ~ "Good",
    pm25 <= 75 ~ "Moderate",
    pm25 <= 115 ~ "Light_Pollution",
    pm25 <= 150 ~ "Unhealthy",
    pm25 <= 250 ~ "Very_Unhealthy",
    TRUE ~ "Hazardous"
  )
}

# Clean and prepare data for classification
air_data_clean <- air_data %>%
  dplyr::filter(!is.na(`PM2.5`), !is.na(TEMP), !is.na(PRES), !is.na(DEWP), !is.na(RAIN), !is.na(WSPM)) %>%
  dplyr::mutate(AQI_Level = create_aqi_level(`PM2.5`)) %>%
  dplyr::mutate(AQI_Level = factor(AQI_Level,
                           levels = c("Good", "Moderate", "Light_Pollution",
                                    "Unhealthy", "Very_Unhealthy", "Hazardous"),
                           ordered = TRUE))

# Check AQI level distribution

AQI Level distribution

## 
##            Good        Moderate Light_Pollution       Unhealthy  Very_Unhealthy 
##           12445            8439            5554            3013            5613 
##       Hazardous 
##               0

3. Prepare Classification Dataset

Select relevant features and split data into training and test sets.

# Prepare dataset for classification
classification_data <- air_data_clean %>%
  dplyr::select(AQI_Level, TEMP, PRES, DEWP, RAIN, WSPM) %>%
  na.omit() %>%
  dplyr::filter(AQI_Level != "Hazardous") %>%
  dplyr::mutate(AQI_Level = droplevels(AQI_Level))

# Inspect classification dataset
#cat("Classification dataset dimensions:", nrow(classification_data), "rows x", ncol(classification_data), "columns\n")
#cat("Final AQI Level distribution:\n")
#print(table(classification_data$AQI_Level))

# Split data into training and test sets
set.seed(123)
train_index <- createDataPartition(classification_data$AQI_Level, p = 0.8, list = FALSE)
train_data <- classification_data[train_index, ]
test_data <- classification_data[-train_index, ]

# Sample training data for performance
sample_size <- min(5000, nrow(train_data))
train_sample_index <- sample(nrow(train_data), sample_size)
train_data_sample <- train_data[train_sample_index, ]

#cat("Training set:", nrow(train_data_sample), "samples (sampled)\n")
#cat("Test set:", nrow(test_data), "samples\n")

4. Model Training

Train multiple classification models using cross-validation.

# Set up cross-validation
train_control <- trainControl(method = "cv", number = 3,
                             classProbs = TRUE)

# Train Random Forest model
#cat("Training Random Forest model...\n")
set.seed(123)
rf_model <- train(AQI_Level ~ ., data = train_data_sample, method = "rf",
                  trControl = train_control, ntree = 100)

# Train Decision Tree model
#cat("Training Decision Tree model...\n")
set.seed(123)
dt_model <- train(AQI_Level ~ ., data = train_data_sample, method = "rpart",
                  trControl = train_control)

# Train Logistic Regression model
#cat("Training Logistic Regression model...\n")
set.seed(123)
logistic_model <- train(AQI_Level ~ ., data = train_data_sample, method = "multinom",
                        trControl = train_control, trace = FALSE)

#cat("All models trained successfully!\n")

# Store models in a list
models <- list(
  "Random Forest" = rf_model,
  "Decision Tree" = dt_model,
  "Logistic Regression" = logistic_model
)

5. Model Evaluation

Evaluate model performance on the test set.

# Function to evaluate model performance
evaluate_model <- function(model, test_data) {
  predictions <- predict(model, test_data)
  cm <- confusionMatrix(predictions, test_data$AQI_Level)

  return(list(
    accuracy = round(cm$overall['Accuracy'], 4),
    kappa = round(cm$overall['Kappa'], 4),
    sensitivity = round(mean(cm$byClass[,'Sensitivity'], na.rm = TRUE), 4),
    specificity = round(mean(cm$byClass[,'Specificity'], na.rm = TRUE), 4)
  ))
}

# Evaluate all models
#cat("Evaluating models on test set...\n")
performance_results <- data.frame()

for(model_name in names(models)) {
  #cat("Evaluating", model_name, "...\n")
  model <- models[[model_name]]
  metrics <- evaluate_model(model, test_data)

  performance_results <- rbind(performance_results, data.frame(
    Model = model_name,
    Accuracy = metrics$accuracy,
    Kappa = metrics$kappa,
    Sensitivity = metrics$sensitivity,
    Specificity = metrics$specificity
  ))
}
# Identify best model
best_model_name <- performance_results[which.max(performance_results$Accuracy), "Model"]
best_model <- models[[best_model_name]]

Classification Model Performance

##                         Model Accuracy  Kappa Sensitivity Specificity
## Accuracy        Random Forest   0.4643 0.2723      0.3743      0.8557
## Accuracy1       Decision Tree   0.4277 0.1800      0.3061      0.8351
## Accuracy2 Logistic Regression   0.4227 0.1984      0.3161      0.8403

Best model

## [1] "Random Forest"

6. Model Analysis

Analyze the performance of the best model.

# Generate confusion matrix for best model
final_predictions <- predict(best_model, test_data)
final_cm <- confusionMatrix(final_predictions, test_data$AQI_Level)



# Feature importance analysis for Random Forest model
if(best_model_name == "Random Forest") {
  feature_importance <- varImp(best_model)$importance
  feature_ranking <- data.frame(
    Feature = rownames(feature_importance),
    Importance = feature_importance$Overall
  ) %>%
    arrange(desc(Importance))



  #write_csv(feature_ranking, "data/feature_importance_ranking.csv")
}

Model Confusion Matrix

##                  Reference
## Prediction        Good Moderate Light_Pollution Unhealthy Very_Unhealthy
##   Good            1838      662             297       162            189
##   Moderate         403      574             324       148            210
##   Light_Pollution  123      230             246       110            118
##   Unhealthy         28       46              59        53             58
##   Very_Unhealthy    97      175             184       129            547

Feature Importance Ranking

##   Feature Importance
## 1    DEWP  100.00000
## 2    PRES   88.70055
## 3    TEMP   87.38510
## 4    WSPM   65.95241
## 5    RAIN    0.00000

7. Conclusion

This project aims to classify the air quality levels in Beijing based on meteorological factors. Multiple models (random forest, decision tree, logistic regression) were trained and evaluated.

The random forest model performed better than the other models, with an accuracy rate of 0.4643.

The analysis of feature importance showed that dew point (DEWP) was the most influential factor, followed by temperature (TEMP), pressure (PRES), and wind speed (WSPM).

Although the accuracy rate indicates room for improvement, this model provides insights into the key meteorological drivers of air quality.

Visualization

1. Load Data and Preprocessing

air_data <- read_csv("data/beijing_air_quality_cleaned.csv", show_col_types = FALSE)
air_data$datetime <- as.POSIXct(air_data$datetime)

create_aqi_level <- function(pm25) {
  case_when(
    pm25 <= 35 ~ "Good",
    pm25 <= 75 ~ "Moderate", 
    pm25 <= 115 ~ "Unhealthy for Sensitive Groups",
    pm25 <= 150 ~ "Unhealthy",
    pm25 <= 250 ~ "Very Unhealthy",
    TRUE ~ "Hazardous"
  )
}

air_data <- air_data %>%
  mutate(
    AQI_Level = create_aqi_level(PM2.5),
    AQI_Level = factor(AQI_Level, 
                       levels = c("Good", "Moderate", "Unhealthy for Sensitive Groups", 
                                  "Unhealthy", "Very Unhealthy", "Hazardous"),
                       ordered = TRUE),
    Year = year(datetime),
    Month = month(datetime)
  )

3. AQI Level Distribution

aqi_distribution <- air_data %>%
  filter(!is.na(AQI_Level)) %>%
  count(AQI_Level) %>%
  mutate(percentage = round(n / sum(n) * 100, 1))

ggplot(aqi_distribution, aes(x = AQI_Level, y = n, fill = AQI_Level)) +
  geom_col() +
  geom_text(aes(label = paste0(n, "\n(", percentage, "%)")), vjust = -0.5, size = 3) +
  labs(title = "Air Quality Level Distribution",
       x = "AQI Level", y = "Hours") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

Insight: Over half of the observations fall into the “Moderate” to “Unhealthy” categories, indicating that such air quality conditions are routine rather than exceptional. This highlights the persistent nature of mild to moderate air pollution in Beijing and emphasizes the need for long-term emission management.

4. Weather Factors vs PM2.5

air_data <- read_csv("data/beijing_air_quality_cleaned.csv", show_col_types = FALSE)
air_data$datetime <- as.POSIXct(air_data$datetime)

names(air_data)[names(air_data) == "PM2.5"] <- "PM25"

weather_vars <- c("TEMP", "PRES", "DEWP", "RAIN", "WSPM")

analysis_cols <- c("PM25", weather_vars)
temp_data <- air_data[, analysis_cols]
clean_data <- temp_data[complete.cases(temp_data), ]

set.seed(123)
n_sample <- min(5000, nrow(clean_data))
sample_indices <- sample(nrow(clean_data), n_sample)
sample_data <- clean_data[sample_indices, ]


cor_results <- data.frame(
  Variable = weather_vars,
  Correlation = sapply(weather_vars, function(var) {
    cor(sample_data$PM25, sample_data[[var]], use = "complete.obs")
  })
)



plot_data <- data.frame()
for(var in weather_vars) {
  temp_df <- data.frame(
    PM25 = sample_data$PM25,
    Variable = var,
    Value = sample_data[[var]],
    Correlation = cor_results$Correlation[cor_results$Variable == var]
  )
  plot_data <- rbind(plot_data, temp_df)
}

ggplot(plot_data, aes(x = Value, y = PM25)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", color = "red") +
  geom_text(aes(x = Inf, y = Inf, label = paste("r =", round(Correlation, 3))),
            hjust = 1.1, vjust = 1.5, size = 3, color = "red") +
  labs(title = "Weather Factors vs PM2.5 Concentration",
       x = "Weather Variable Value", y = "PM2.5 Concentration (μg/m³)") +
  facet_wrap(~Variable, scales = "free_x") +
  theme_minimal()

Insight: Temperature and dew point show a moderate negative correlation with PM2.5, reinforcing the idea that cold weather contributes to pollution buildup due to limited atmospheric mixing. Wind speed has a slight mitigating effect, but not strong enough to disperse pollution effectively in most cases.

5. Seasonal AQI Distribution

if(!"PM25" %in% names(air_data)) {
  names(air_data)[names(air_data) == "PM2.5"] <- "PM25"
}

create_aqi_level <- function(pm25) {
  case_when(
    pm25 <= 35 ~ "Good",
    pm25 <= 75 ~ "Moderate", 
    pm25 <= 115 ~ "Unhealthy for Sensitive Groups",
    pm25 <= 150 ~ "Unhealthy",
    pm25 <= 250 ~ "Very Unhealthy",
    TRUE ~ "Hazardous"
  )
}

seasonal_data <- air_data %>%
  filter(!is.na(PM25)) %>%
  mutate(
    AQI_Level = create_aqi_level(PM25),
    AQI_Level = factor(AQI_Level, 
                       levels = c("Good", "Moderate", "Unhealthy for Sensitive Groups", 
                                  "Unhealthy", "Very Unhealthy", "Hazardous"),
                       ordered = TRUE),
    Month = month(datetime),
    Season = case_when(
      Month %in% c(12, 1, 2) ~ "Winter",
      Month %in% c(3, 4, 5) ~ "Spring", 
      Month %in% c(6, 7, 8) ~ "Summer",
      Month %in% c(9, 10, 11) ~ "Fall"
    )
  ) %>%
  group_by(Season, AQI_Level) %>%
  summarise(Hours = n(), .groups = "drop") %>%
  group_by(Season) %>%
  mutate(Percentage = Hours / sum(Hours) * 100)

ggplot(seasonal_data, aes(x = Season, y = Percentage, fill = AQI_Level)) +
  geom_col(position = "stack") +
  labs(title = "Seasonal Air Quality Distribution",
       x = "Season", y = "Percentage (%)", fill = "AQI Level") +
  theme_minimal()

Insight: Winter accounts for the highest share of unhealthy air quality, confirming that this season is the most vulnerable. In contrast, summer has the cleanest air. This suggests that pollution control policies should be adaptive and season-specific, targeting wintertime emissions most aggressively.

6. Conclusion

The analysis of Beijing air quality data from 2013 to 2017 reveals several important insights:

  1. Seasonal Patterns: Pollution spikes in winter due to heating and stagnant air.
  2. Air Quality Status: Moderate to unhealthy levels are common and persistent.
  3. Weather Impact: Cold, dry, and calm conditions favor pollutant accumulation.
  4. Policy Implication: Emission controls should intensify in winter, and meteorology should be integrated into real-time pollution warnings.

These data-driven patterns provide actionable knowledge for environmental planning and reinforce the need for dynamic, targeted air quality management.

Conclusion

This report analyzes the air quality data of Beijing from 2013 to 2017, based on the PRSA dataset and supplementary sources such as the China Meteorological Data Network. It systematically examines the influence of meteorological factors on PM2.5/PM10 and builds a prediction model. After data cleaning (filling of missing values, handling of outliers) and feature engineering (time features, lag variables, rolling window statistics), the data is divided into training, validation and test sets at a ratio of 70%, 15%, and 15%. Regression analysis shows that wind speed (WSPM) and temperature (TEMP) have the strongest negative correlation with pollutant concentrations. The multiple linear model explains 19.77% and 12.09% of PM2.5 and PM10, respectively, and wind speed and dew point (DEWP) are key influencing variables. In the classification model, random forest has a classification accuracy of 46.43% for air quality grades (good / mild pollution, etc.), and dew point is the most important feature. The visualization results reveal that pollutant concentrations have significant seasonality. They peak in winter due to heating and stable atmospheric conditions, and decrease in summer due to favorable meteorological diffusion conditions. The study provides data support and model basis for air quality prediction and seasonal pollution control in Beijing.