Introduction

In the realm of data science and statistical modeling, handling high-dimensional data with multicollinearity among predictor variables is a common challenge. Traditional regression techniques often struggle in such scenarios, leading to unstable estimates and poor predictive performance. To address these issues, dimensionality reduction techniques such as Principal Component Regression (PCR) and Partial Least Squares Regression (PLS) are employed. This project aims to compare and contrast PCR and PLS, highlighting their strengths and weaknesses through a practical application on the Air Quality Dataset.

Principal Component Regression (PCR):

Principal Component Regression is a technique that combines Principal Component Analysis (PCA) and linear regression to address multicollinearity and reduce dimensionality in high-dimensional datasets. The key steps in PCR are:

  • Principal Component Analysis (PCA): PCA transforms the original predictor variables into a set of new, uncorrelated variables called principal components. These components are linear combinations of the original variables and are ordered by the amount of variance they explain in the data. Each principal component captures the maximum possible variance while being orthogonal to the previous components.
  • Regression: A subset of the principal components (those explaining the most variance) is selected and used as predictors in a linear regression model to predict the response variable. By focusing on the principal components that capture the most variance, PCR aims to build a more stable and interpretable regression model.

Partial Least Squares Regression (PLS):

Partial Least Squares Regression is a technique that, unlike PCR, considers both the predictor variables and the response variable during the dimensionality reduction process. The key steps in PLS are:

  • Latent Variable Extraction: PLS extracts a set of latent variables (components) that maximize the covariance between the predictors and the response variable. These components are linear combinations of the original variables, chosen in such a way that they capture as much of the relevant information as possible for predicting the response variable. This ensures that the extracted components are directly related to the outcome of interest.
  • Regression: The latent variables are then used as predictors in a linear regression model to predict the response variable. By incorporating the response variable in the component extraction process, PLS aims to improve the predictive accuracy of the regression model.

Project Significance:

This project will provide valuable insights into the practical application of dimensionality reduction techniques in regression modeling. By comparing PCR and PLS, data scientists and researchers can make informed decisions about which method to employ based on the characteristics of their data and the specific requirements of their analysis. Moreover, the findings from this project can contribute to the broader understanding of how to effectively handle multicollinearity and high-dimensional data in various fields, including environmental science, finance, and bioinformatics.

Dataset Overview:

The Air Quality Dataset, sourced from the UCI Machine Learning Repository, contains measurements of various air pollutants and meteorological variables collected at an Italian monitoring station. The dataset comprises daily measurements of pollutants such as ozone, nitrogen dioxide, and carbon monoxide, as well as meteorological variables like temperature, wind speed, and humidity. With over 9,000 instances, this dataset provides a rich source of data for analysis. The target variable for this project will be one of the pollutant concentrations, allowing us to explore how well PCR and PLS can predict air quality based on the available predictors.

Target Variable: C6H6 (benzene) concentration, denoted as C6H6(GT), has been chosen as the target variable due to its significant implications for public health and environmental monitoring. Benzene is a major pollutant, classified as a carcinogen, and its presence in the atmosphere is closely linked to a variety of health risks, including increased cancer rates. Additionally, benzene levels serve as an indicator of vehicular and industrial emissions, which are primary concerns in urban pollution management.

Columns:
  • Date: The date when the measurements were taken.
  • Time: The time when the measurements were taken.
  • CO(GT): Concentration of carbon monoxide in the air (measured in mg/m^3).
  • PT08.S1(CO): Sensor response (PPM) of a non-dispersive infrared (NDIR) sensor for carbon monoxide.
  • NMHC(GT): Concentration of non-methane hydrocarbons in the air (measured in microg/m^3).
  • C6H6(GT): Concentration of benzene in the air (measured in microg/m^3).
  • PT08.S2(NMHC): Sensor response (PPM) of an NDIR sensor for non-methane hydrocarbons.
  • NOx(GT): Concentration of nitrogen oxides in the air (measured in PPB).
  • PT08.S3(NOx): Sensor response (PPM) of a chemiluminescence sensor for nitrogen oxides.
  • NO2(GT): Concentration of nitrogen dioxide in the air (measured in microg/m^3).
  • PT08.S4(NO2): Sensor response (PPM) of an electrochemical sensor for nitrogen dioxide.
  • PT08.S5(O3): Sensor response (PPM) of a UV photometric sensor for ozone.
  • T: Temperature in Celsius.
  • RH: Relative humidity in percentage.
  • AH: Absolute humidity in grams per cubic meter.


Libraries

We will start by loading the necessary libraries for our analysis.

# Loading the dplyr package for data manipulation
library(dplyr)
library(tidyr)

# Loading the ggplot2 and GGally packages for data visualization
library(ggplot2)
library(GGally)

# Loading the pls package for Partial Least Squares Regression
library(pls)

# Loading the caret package for model training and evaluation, including Principal Component Regression
library(caret)

# Loading the readr package for reading the dataset
library(readr)


Data Loading and Cleaning

# Loading the dataset 
air_quality <- read_delim("C:/Users/david/Downloads/AirQualityUCI.csv", delim = ";")

# Printing the first few rows of the dataset to inspect it
head(air_quality)
# Removing the columns with parsing issues
air_quality <- air_quality %>% select(-c(...16, ...17))
# Convert character columns to numeric, except for Date and Time
cols_to_convert <- setdiff(names(air_quality), c("Date", "Time"))

# Use mutate to convert columns to numeric
air_quality <- air_quality %>%
  mutate(across(all_of(cols_to_convert), ~ as.numeric(gsub(",", ".", .))))

# Correct the RH column by dividing by 10
air_quality$RH <- air_quality$RH / 10

# Convert Date to Date type
air_quality$Date <- as.Date(air_quality$Date, format = "%d/%m/%Y")

# Load 'hms' library to convert Time to hms
library(hms)

# Replace dots with colons in the Time column
air_quality$Time <- gsub("\\.", ":", air_quality$Time)

# Convert Time to hms type
air_quality$Time <- as_hms(air_quality$Time)

# Check the first few rows to confirm the changes
head(air_quality)
# Check for missing values
colSums(is.na(air_quality))
##          Date          Time        CO(GT)   PT08.S1(CO)      NMHC(GT) 
##           114           114           114           114           114 
##      C6H6(GT) PT08.S2(NMHC)       NOx(GT)  PT08.S3(NOx)       NO2(GT) 
##           114           114           114           114           114 
##  PT08.S4(NO2)   PT08.S5(O3)             T            RH            AH 
##           114           114           114           114           114
# Replacing invalid values (-200) with NA
air_quality[air_quality == -200] <- NA

Lets handle the missing values by replacing them with the mean of each column:

# Handle missing values by replacing them with the mean of each column
air_quality <- air_quality %>%
  mutate(across(all_of(cols_to_convert), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Check for missing values
colSums(is.na(air_quality))
##          Date          Time        CO(GT)   PT08.S1(CO)      NMHC(GT) 
##           114           114             0             0             0 
##      C6H6(GT) PT08.S2(NMHC)       NOx(GT)  PT08.S3(NOx)       NO2(GT) 
##             0             0             0             0             0 
##  PT08.S4(NO2)   PT08.S5(O3)             T            RH            AH 
##             0             0             0             0             0

The missing values in the Date and Time columns remain, but all the other columns have been successfully cleaned. Since Date and Time are not critical for the regression models we are planning to build (PCR and PLS), we can drop these rows with missing Date and Time values.

# Remove rows with missing values in Date and Time columns
air_quality <- air_quality %>%
  filter(!is.na(Date) & !is.na(Time))

# Summary of the cleaned dataset
summary(air_quality)
##       Date                Time              CO(GT)        PT08.S1(CO)  
##  Min.   :2004-03-10   Length:9357       Min.   : 0.100   Min.   : 647  
##  1st Qu.:2004-06-16   Class1:hms        1st Qu.: 1.200   1st Qu.: 941  
##  Median :2004-09-21   Class2:difftime   Median : 2.153   Median :1075  
##  Mean   :2004-09-21   Mode  :numeric    Mean   : 2.153   Mean   :1100  
##  3rd Qu.:2004-12-28                     3rd Qu.: 2.600   3rd Qu.:1221  
##  Max.   :2005-04-04                     Max.   :11.900   Max.   :2040  
##     NMHC(GT)         C6H6(GT)     PT08.S2(NMHC)       NOx(GT)      
##  Min.   :   7.0   Min.   : 0.10   Min.   : 383.0   Min.   :   2.0  
##  1st Qu.: 218.8   1st Qu.: 4.60   1st Qu.: 743.0   1st Qu.: 112.0  
##  Median : 218.8   Median : 8.60   Median : 923.0   Median : 229.0  
##  Mean   : 218.8   Mean   :10.08   Mean   : 939.2   Mean   : 246.9  
##  3rd Qu.: 218.8   3rd Qu.:13.60   3rd Qu.:1105.0   3rd Qu.: 284.0  
##  Max.   :1189.0   Max.   :63.70   Max.   :2214.0   Max.   :1479.0  
##   PT08.S3(NOx)       NO2(GT)       PT08.S4(NO2)   PT08.S5(O3)         T        
##  Min.   : 322.0   Min.   :  2.0   Min.   : 551   Min.   : 221   Min.   :-1.90  
##  1st Qu.: 666.0   1st Qu.: 86.0   1st Qu.:1242   1st Qu.: 742   1st Qu.:12.00  
##  Median : 818.0   Median :113.1   Median :1456   Median : 983   Median :18.30  
##  Mean   : 835.5   Mean   :113.1   Mean   :1456   Mean   :1023   Mean   :18.32  
##  3rd Qu.: 960.0   3rd Qu.:133.0   3rd Qu.:1662   3rd Qu.:1255   3rd Qu.:24.10  
##  Max.   :2683.0   Max.   :340.0   Max.   :2775   Max.   :2523   Max.   :44.60  
##        RH               AH        
##  Min.   :-20.00   Min.   :0.1847  
##  1st Qu.: 34.10   1st Qu.:0.7461  
##  Median : 48.60   Median :1.0154  
##  Mean   : 46.53   Mean   :1.0255  
##  3rd Qu.: 61.90   3rd Qu.:1.2962  
##  Max.   : 88.70   Max.   :2.2310
# Checking the structure of the cleaned dataset
str(air_quality)
## tibble [9,357 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Date         : Date[1:9357], format: "2004-03-10" "2004-03-10" ...
##  $ Time         : 'hms' num [1:9357] 18:00:00 19:00:00 20:00:00 21:00:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ CO(GT)       : num [1:9357] 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
##  $ PT08.S1(CO)  : num [1:9357] 1360 1292 1402 1376 1272 ...
##  $ NMHC(GT)     : num [1:9357] 150 112 88 80 51 38 31 31 24 19 ...
##  $ C6H6(GT)     : num [1:9357] 11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
##  $ PT08.S2(NMHC): num [1:9357] 1046 955 939 948 836 ...
##  $ NOx(GT)      : num [1:9357] 166 103 131 172 131 ...
##  $ PT08.S3(NOx) : num [1:9357] 1056 1174 1140 1092 1205 ...
##  $ NO2(GT)      : num [1:9357] 113 92 114 122 116 ...
##  $ PT08.S4(NO2) : num [1:9357] 1692 1559 1555 1584 1490 ...
##  $ PT08.S5(O3)  : num [1:9357] 1268 972 1074 1203 1110 ...
##  $ T            : num [1:9357] 13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
##  $ RH           : num [1:9357] 48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
##  $ AH           : num [1:9357] 0.758 0.726 0.75 0.787 0.789 ...
# Check for missing values
colSums(is.na(air_quality))
##          Date          Time        CO(GT)   PT08.S1(CO)      NMHC(GT) 
##             0             0             0             0             0 
##      C6H6(GT) PT08.S2(NMHC)       NOx(GT)  PT08.S3(NOx)       NO2(GT) 
##             0             0             0             0             0 
##  PT08.S4(NO2)   PT08.S5(O3)             T            RH            AH 
##             0             0             0             0             0

Both Principal Component Regression (PCR) and Partial Least Squares Regression (PLS) are sensitive to outliers, but their sensitivities arise in different ways. In PCR, the sensitivity to outliers stems from the initial Principal Component Analysis (PCA) step, where the method aims to capture the maximum variance in the data. Outliers can significantly affect this variance, distorting the principal components and subsequently leading to poor regression performance. Similarly, PLS involves the construction of latent variables using both predictors and response variables, which means that outliers in either set can bias the results. To mitigate these issues, lets remove the outliers from the data set.

# Identify outliers using the IQR method
Q1 <- quantile(air_quality$'C6H6(GT)', 0.25)
Q3 <- quantile(air_quality$'C6H6(GT)', 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- which(air_quality$'C6H6(GT)' < lower_bound | air_quality$'C6H6(GT)' > upper_bound)

# Remove outliers from the dataset
air_quality <- air_quality[-outliers, ]
# Checking the structure of the cleaned dataset
str(air_quality)
## tibble [9,071 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Date         : Date[1:9071], format: "2004-03-10" "2004-03-10" ...
##  $ Time         : 'hms' num [1:9071] 18:00:00 19:00:00 20:00:00 21:00:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ CO(GT)       : num [1:9071] 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
##  $ PT08.S1(CO)  : num [1:9071] 1360 1292 1402 1376 1272 ...
##  $ NMHC(GT)     : num [1:9071] 150 112 88 80 51 38 31 31 24 19 ...
##  $ C6H6(GT)     : num [1:9071] 11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
##  $ PT08.S2(NMHC): num [1:9071] 1046 955 939 948 836 ...
##  $ NOx(GT)      : num [1:9071] 166 103 131 172 131 ...
##  $ PT08.S3(NOx) : num [1:9071] 1056 1174 1140 1092 1205 ...
##  $ NO2(GT)      : num [1:9071] 113 92 114 122 116 ...
##  $ PT08.S4(NO2) : num [1:9071] 1692 1559 1555 1584 1490 ...
##  $ PT08.S5(O3)  : num [1:9071] 1268 972 1074 1203 1110 ...
##  $ T            : num [1:9071] 13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
##  $ RH           : num [1:9071] 48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
##  $ AH           : num [1:9071] 0.758 0.726 0.75 0.787 0.789 ...


Data Exploration and Visualization

# Summary of the dataset
summary(air_quality)
##       Date                Time              CO(GT)       PT08.S1(CO)  
##  Min.   :2004-03-10   Length:9071       Min.   :0.100   Min.   : 647  
##  1st Qu.:2004-06-15   Class1:hms        1st Qu.:1.200   1st Qu.: 938  
##  Median :2004-09-19   Class2:difftime   Median :2.100   Median :1066  
##  Mean   :2004-09-21   Mode  :numeric    Mean   :2.046   Mean   :1084  
##  3rd Qu.:2004-12-29                     3rd Qu.:2.500   3rd Qu.:1202  
##  Max.   :2005-04-04                     Max.   :9.300   Max.   :1898  
##     NMHC(GT)        C6H6(GT)      PT08.S2(NMHC)       NOx(GT)      
##  Min.   :  7.0   Min.   : 0.100   Min.   : 383.0   Min.   :   2.0  
##  1st Qu.:218.8   1st Qu.: 4.500   1st Qu.: 736.0   1st Qu.: 110.0  
##  Median :218.8   Median : 8.300   Median : 912.0   Median : 220.0  
##  Mean   :217.2   Mean   : 9.358   Mean   : 917.8   Mean   : 235.5  
##  3rd Qu.:218.8   3rd Qu.:13.000   3rd Qu.:1083.0   3rd Qu.: 270.0  
##  Max.   :872.0   Max.   :27.100   Max.   :1483.0   Max.   :1310.0  
##   PT08.S3(NOx)       NO2(GT)       PT08.S4(NO2)   PT08.S5(O3)    
##  Min.   : 360.0   Min.   :  2.0   Min.   : 551   Min.   : 221.0  
##  1st Qu.: 681.0   1st Qu.: 84.0   1st Qu.:1231   1st Qu.: 734.0  
##  Median : 826.0   Median :113.1   Median :1456   Median : 968.0  
##  Mean   : 847.2   Mean   :111.6   Mean   :1433   Mean   : 995.5  
##  3rd Qu.: 968.0   3rd Qu.:130.0   3rd Qu.:1640   3rd Qu.:1217.0  
##  Max.   :2683.0   Max.   :340.0   Max.   :2404   Max.   :2519.0  
##        T               RH               AH        
##  Min.   :-1.90   Min.   :-20.00   Min.   :0.1847  
##  1st Qu.:12.00   1st Qu.: 33.80   1st Qu.:0.7423  
##  Median :18.20   Median : 48.20   Median :1.0115  
##  Mean   :18.29   Mean   : 46.31   Mean   :1.0207  
##  3rd Qu.:24.10   3rd Qu.: 61.80   3rd Qu.:1.2874  
##  Max.   :44.60   Max.   : 88.70   Max.   :2.2310

Brief summary of the key statistics:

Plotting the distributions:

# Converting data to long format and scaling the values
air_quality_long <- air_quality %>%
  pivot_longer(cols = -c(Date, Time), names_to = "variable", values_to = "value") %>%
  mutate(value = scale(value))  # Scale the data

# 1. Histograms of Scaled Data
ggplot(air_quality_long, aes(x = value)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  facet_wrap(~ variable, scales = "free_x") +
  theme_minimal() +
  labs(title = "Histograms of Scaled Air Quality Variables",
       x = "Scaled Value",
       y = "Frequency")

# 2. Box Plots of Scaled Data
ggplot(air_quality_long, aes(x = variable, y = value)) +
  geom_boxplot(fill = "blue", color = "black") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Box Plots of Scaled Air Quality Variables",
       x = "Variable",
       y = "Scaled Value")

Checking for correlations:

# Correlation matrix
cor_matrix <- cor(air_quality %>% select(-Date, -Time))
print(cor_matrix)
##                    CO(GT) PT08.S1(CO)    NMHC(GT)   C6H6(GT) PT08.S2(NMHC)
## CO(GT)         1.00000000  0.74107980  0.28802081  0.7854718     0.7705184
## PT08.S1(CO)    0.74107980  1.00000000  0.25690917  0.8691725     0.8730848
## NMHC(GT)       0.28802081  0.25690917  1.00000000  0.2900620     0.2811599
## C6H6(GT)       0.78547184  0.86917245  0.29006196  1.0000000     0.9868651
## PT08.S2(NMHC)  0.77051842  0.87308482  0.28115989  0.9868651     1.0000000
## NOx(GT)        0.73012163  0.57549353  0.11028635  0.5704788     0.5563549
## PT08.S3(NOx)  -0.59893390 -0.76177400 -0.27650361 -0.7435827    -0.7917584
## NO2(GT)        0.64805859  0.54970920  0.16286911  0.5368447     0.5507541
## PT08.S4(NO2)   0.49614634  0.62777660  0.20643591  0.7305175     0.7403152
## PT08.S5(O3)    0.72914332  0.88369349  0.23142528  0.8527851     0.8597888
## T              0.02837898  0.04990897  0.07773880  0.2358569     0.2679064
## RH            -0.04862459  0.07059802 -0.05011934 -0.1000756    -0.1111914
## AH             0.02460750  0.11919439  0.04785249  0.1591985     0.1775017
##                  NOx(GT) PT08.S3(NOx)     NO2(GT) PT08.S4(NO2) PT08.S5(O3)
## CO(GT)         0.7301216  -0.59893390  0.64805859   0.49614634  0.72914332
## PT08.S1(CO)    0.5754935  -0.76177400  0.54970920   0.62777660  0.88369349
## NMHC(GT)       0.1102863  -0.27650361  0.16286911   0.20643591  0.23142528
## C6H6(GT)       0.5704788  -0.74358273  0.53684467   0.73051750  0.85278509
## PT08.S2(NMHC)  0.5563549  -0.79175843  0.55075413   0.74031517  0.85978883
## NOx(GT)        1.0000000  -0.54172655  0.76305566   0.11515636  0.65763272
## PT08.S3(NOx)  -0.5417266   1.00000000 -0.55632548  -0.49780942 -0.78521645
## NO2(GT)        0.7630557  -0.55632548  1.00000000   0.08865727  0.61462286
## PT08.S4(NO2)   0.1151564  -0.49780942  0.08865727   1.00000000  0.52943990
## PT08.S5(O3)    0.6576327  -0.78521645  0.61462286   0.52943990  1.00000000
## T             -0.2484538  -0.14887505 -0.16720962   0.60371512 -0.03192633
## RH             0.0666332  -0.02886568 -0.13114017  -0.05263753  0.08362246
## AH            -0.1540311  -0.22352921 -0.30481516   0.65619468  0.05416537
##                         T          RH          AH
## CO(GT)         0.02837898 -0.04862459  0.02460750
## PT08.S1(CO)    0.04990897  0.07059802  0.11919439
## NMHC(GT)       0.07773880 -0.05011934  0.04785249
## C6H6(GT)       0.23585690 -0.10007556  0.15919849
## PT08.S2(NMHC)  0.26790636 -0.11119136  0.17750170
## NOx(GT)       -0.24845384  0.06663320 -0.15403109
## PT08.S3(NOx)  -0.14887505 -0.02886568 -0.22352921
## NO2(GT)       -0.16720962 -0.13114017 -0.30481516
## PT08.S4(NO2)   0.60371512 -0.05263753  0.65619468
## PT08.S5(O3)   -0.03192633  0.08362246  0.05416537
## T              1.00000000 -0.45484120  0.65625807
## RH            -0.45484120  1.00000000  0.12715370
## AH             0.65625807  0.12715370  1.00000000

Based on the correlation matrix provided, it appears that there is a significant level of multicollinearity among the predictor variables. Key observations:

These high correlation values indicate multicollinearity, where predictor variables are highly correlated with each other. Multicollinearity can be problematic in regression models because it can inflate the variance of the coefficient estimates and make the model coefficients unstable and difficult to interpret. Both PCR and PLS are designed to handle multicollinearity by reducing the dimensionality of the data and summarizing the information from correlated predictors into a smaller set of uncorrelated components


Data Preparation for Modeling

# Removing Date and Time columns as they will not be used for modeling
air_quality <- air_quality %>% select(-Date, -Time) 
# Splitting the data into training and testing sets
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(air_quality$`C6H6(GT)`, p = .75, list = FALSE, times = 1) # Create an index for 75% training data
air_quality_train <- air_quality[trainIndex, ]
air_quality_test <- air_quality[-trainIndex, ]

When fitting the model we are going to standardize the data. This is crucial because it ensures that all predictor variables contribute equally to the model, irrespective of their original scales or units. Without scaling, variables with larger magnitudes could dominate the model, potentially skewing the results and leading to misleading interpretations. By standardizing variables (giving them zero mean and unit variance), scaling removes these discrepancies, allowing the model to focus on the underlying patterns and relationships in the data. This leads to more reliable and interpretable models, particularly important when dealing with variables of different types and scales.


Modeling

PCR

# Fitting the PCR model on the training data
pcr_model <- pcr(`C6H6(GT)` ~ ., data = air_quality_train, scale = TRUE, validation = "CV") # Fit PCR model with cross-validation

The argument validation = “CV” specifies that cross-validation (CV) should be used to validate the model. Cross-validation is a robust method for assessing the predictive performance of a model. It involves partitioning the data into subsets, training the model on some subsets (training sets), and validating it on the remaining subsets (validation sets). This process is repeated multiple times to ensure the model’s performance is consistent and not dependent on a particular partitioning of the data.

By using cross-validation, the model is less likely to overfit to the training data. Overfitting occurs when the model captures noise and specific patterns in the training data that do not generalize to new, unseen data. Cross-validation helps in detecting and mitigating overfitting by testing the model on different subsets of the data.

For PCR and PLS, cross-validation helps in determining the optimal number of components to retain.

summary(pcr_model)
## Data:    X dimension: 6805 12 
##  Y dimension: 6805 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           6.062    2.213    2.081    2.065    2.054    1.752    1.601
## adjCV        6.062    2.213    2.081    2.065    2.054    1.752    1.601
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       1.602    1.603    1.333     1.287     1.157    0.8308
## adjCV    1.601    1.603    1.333     1.287     1.157    0.8307
## 
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           46.81    67.53    78.23    85.85    90.19    93.72    95.56
## C6H6(GT)    86.69    88.23    88.42    88.54    91.67    93.05    93.05
##           8 comps  9 comps  10 comps  11 comps  12 comps
## X           96.92    98.04     98.92     99.73    100.00
## C6H6(GT)    93.05    95.19     95.52     96.38     98.13
# Plotting the RMSEP (Root Mean Squared Error of Prediction) to find the optimal number of components
validationplot(pcr_model, val.type = "RMSEP", main = "RMSEP (PCR Model)")

Determining the Optimal Number of Principal Components:

To determine the optimal number of components, we need to balance between the model complexity (number of components) and the RMSEP (Root Mean Squared Error of Prediction). We are looking for the point where adding more components doesn’t significantly improve the prediction accuracy.

Key Points to Consider from the Summary:

RMSEP Values:
  • There is a significant drop in RMSEP from the intercept to 1 component.
  • The RMSEP continues to decrease gradually until around 3-5 components.
  • After 5 components, the RMSEP starts to plateau, with smaller improvements as more components are added.
  • The RMSEP reaches its minimum at 12 components (0.8308), but we should consider the trade-off between complexity and performance improvement.
Variance Explained:
  • The variance explained in the predictors (X) increases rapidly up to 5 components, covering about 90.19% of the variance.
  • The variance explained in the response variable (C6H6(GT)) also shows substantial improvement up to 5 components (91.67%).
  • Beyond 5 components, the additional variance explained in C6H6(GT) increases at a diminishing rate.

Given these points, let’s evaluate:

RMSEP Values:
  • 1 component: 2.213
  • 2 components: 2.081
  • 3 components: 2.065
  • 4 components: 2.054
  • 5 components: 1.752
  • 6 components: 1.601
  • 7-12 components: small improvements
Variance Explained:
  • 5 components: 91.67%
  • 6 components: 93.05% (small improvement)

Based on both RMSEP and variance explained, it seems reasonable to choose around 5-6 components. There is a significant improvement in RMSEP up to 5 components and variance explained in the response variable increases notably up to this point. Beyond 5-6 components, the improvements are minimal.

Therefore, the optimal number of components is 5 or 6. We might choose 5 for a simpler model or 6 if we prefer slightly better performance at the cost of adding one more component. Let’s choose 6 components.

# Predict using the model and evaluate on the test set with optimal number of components
optimal_number_of_components <- 6  # Optimal number of components based on the RMSEP plot and summary
predictions <- predict(pcr_model, ncomp = optimal_number_of_components, newdata = air_quality_test)  
# Compare predictions with actual values
plot(air_quality_test$`C6H6(GT)`, predictions, xlab = "Actual", ylab = "Predicted", main = "Predicted vs Actual C6H6(GT) Values (PCR Model)")  # Plot actual vs predicted values
abline(0, 1)  # Add a diagonal line for reference

Evaluating PCR model

The square root of the Mean Squared Error provides an indication of the average magnitude of the errors in the same units as the response variable. Lower RMSE values indicate better model performance.

R-squared represents the proportion of the variance in the dependent variable that is predictable from the independent variables. Values closer to 1 indicate better model performance.

# Calculate and print the Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((air_quality_test$`C6H6(GT)` - predictions)^2))  # Calculate RMSE between actual and predicted values
print(paste("RMSE: ", rmse)) 
## [1] "RMSE:  1.57211808907846"
# Calculate the sum of squares of residuals
ss_res <- sum((air_quality_test$`C6H6(GT)` - predictions)^2)

# Calculate the total sum of squares
ss_tot <- sum((air_quality_test$`C6H6(GT)` - mean(air_quality_test$`C6H6(GT)`))^2)

# Calculate R-squared
r_squared <- 1 - (ss_res / ss_tot)

# Print R-squared
print(paste("R-squared: ", r_squared))
## [1] "R-squared:  0.933087937624178"

An R-squared value of 0.9331 means that approximately 93.31% of the variance in the dependent variable (C6H6 concentration) can be explained by the independent variables in the model. This indicates a very strong relationship between the predictors and the response variable. This value suggests that the model fits the data very well and that the predictors used in the model are good at explaining the variability in the response variable.

PLS

# Fitting the PLS model on the training data
pls_model <- plsr(`C6H6(GT)` ~ ., data = air_quality_train, scale = TRUE, validation = "CV")

summary(pls_model)
## Data:    X dimension: 6805 12 
##  Y dimension: 6805 1
## Fit method: kernelpls
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           6.062    2.032    1.711    1.359    1.225    1.014   0.9673
## adjCV        6.062    2.031    1.711    1.359    1.225    1.013   0.9672
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV      0.9205   0.8420   0.8305    0.8303    0.8303    0.8303
## adjCV   0.9204   0.8419   0.8304    0.8302    0.8302    0.8302
## 
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           46.70    63.02    71.89    79.98    83.87    90.29    93.89
## C6H6(GT)    88.79    92.05    94.99    95.94    97.22    97.46    97.70
##           8 comps  9 comps  10 comps  11 comps  12 comps
## X           95.16    95.95     97.67     98.64    100.00
## C6H6(GT)    98.08    98.13     98.13     98.13     98.13
# Plotting the RMSEP (Root Mean Squared Error of Prediction) to find the optimal number of components
validationplot(pls_model, val.type = "RMSEP", main = "RMSEP (PLS Model)")

Determining the Optimal Number of Components:

To determine the optimal number of components, we need to balance between the model complexity (number of components) and the RMSEP (Root Mean Squared Error of Prediction). We are looking for the point where adding more components doesn’t significantly improve the prediction accuracy.

Key Points to Consider from the Summary:

RMSEP Values:
  • There is a significant drop in RMSEP from the intercept to 1 component.
  • The RMSEP continues to decrease gradually until around 3-5 components.
  • After 5 components, the RMSEP starts to plateau, with smaller improvements as more components are added.
  • The RMSEP reaches its minimum at 12 components (0.8303), but we should consider the trade-off between complexity and performance improvement.
Variance Explained:
  • The variance explained in the predictors (X) increases rapidly up to 5 components, covering about 83.87% of the variance.
  • The variance explained in the response variable (C6H6(GT)) also shows substantial improvement up to 5 components (97.22%).
  • Beyond 5 components, the additional variance explained in C6H6(GT) increases at a diminishing rate.

Given these points, let’s evaluate:

RMSEP Values:
  • 1 component: 2.032
  • 2 components: 1.711
  • 3 components: 1.359
  • 4 components: 1.225
  • 5 components: 1.014
  • 6 components: 0.967
  • 7-12 components: small improvements
Variance Explained:
  • 5 components: 97.22%
  • 6 components: 97.46% (small improvement)

Based on both RMSEP and variance explained, it seems reasonable to choose around 5-6 components. There is a significant improvement in RMSEP up to 5 components and variance explained in the response variable increases notably up to this point. Beyond 5-6 components, the improvements are minimal.

Therefore, the optimal number of components is 5 or 6. We might choose 5 for a simpler model or 6 if we prefer slightly better performance at the cost of adding one more component. Let’s choose 6 components.

# Predict using the model and evaluate on the test set with optimal number of components
optimal_number_of_components <- 6  # Optimal number of components based on the RMSEP plot and summary
predictions2 <- predict(pls_model, ncomp = optimal_number_of_components, newdata = air_quality_test)  
# Compare predictions with actual values
plot(air_quality_test$`C6H6(GT)`, predictions2, xlab = "Actual", ylab = "Predicted", main = "Predicted vs Actual C6H6(GT) Values (PLS Model)")  # Plot actual vs predicted values
abline(0, 1)  # Add a diagonal line for reference

Evaluating PLS model

# Calculate and print the Root Mean Squared Error (RMSE)
rmse2 <- sqrt(mean((air_quality_test$`C6H6(GT)` - predictions2)^2))  # Calculate RMSE between actual and predicted values
print(paste("RMSE: ", rmse2)) 
## [1] "RMSE:  0.972380060198749"
# Calculate the sum of squares of residuals
ss_res2 <- sum((air_quality_test$`C6H6(GT)` - predictions2)^2)

# Calculate the total sum of squares
ss_tot2 <- sum((air_quality_test$`C6H6(GT)` - mean(air_quality_test$`C6H6(GT)`))^2)

# Calculate R-squared
r_squared2 <- 1 - (ss_res2 / ss_tot2)

# Print R-squared
print(paste("R-squared: ", r_squared2))
## [1] "R-squared:  0.97440199170449"

An R-squared value of 0.9744 means that approximately 97.44% of the variance in the C6H6(GT) values can be explained by the predictors included in the Partial Least Squares (PLS) model. This value is very high, indicating an excellent fit to the data.


Comparing the Models

Model Performance Metrics

Root Mean Squared Error (RMSE)
  • PCR RMSE: 1.572
  • PLS RMSE: 0.972

The RMSE evaluates the average magnitude of the errors between predicted and actual values. The significantly lower RMSE of the PLS model indicates its superior accuracy in predicting the concentrations of C6H6(GT), highlighting its effectiveness in practical applications where precise predictions are vital.

R-squared
  • PCR R-squared: 0.9331
  • PLS R-squared: 0.9744

R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables. The higher R-squared of the PLS model shows that it captures a greater extent of the variability in benzene levels, suggesting a better overall fit to the data.

Model Complexity and Component Analysis

Both models show optimal performance with about 5-6 components based on RMSEP and variance explained. This similarity in dimensionality masks significant differences in how each model processes and utilizes these components:

  • PCR: Focuses on capturing the maximum variance within the predictors. It is effective for dimensionality reduction and uncovering the structure in high-dimensional data. However, the main limitation is that the components with the highest variance in predictors might not be the most relevant for predicting the response variable.
  • PLS: Aims to maximize the covariance between the predictors and the response variable. This method not only reduces dimensionality but ensures that the retained components are directly relevant to the outcome of interest. The strength of PLS lies in its ability to identify the components that are most predictive of the response, making it highly suitable for prediction-focused studies.

Strategic Evaluation of Model Choice

PCR: Strengths and Weaknesses
  • Strengths:
    • Excellent for exploratory data analysis and understanding the underlying structure of the data.
    • Effective in situations where understanding the variance structure within the predictors is more important than predicting a specific outcome.
  • Weaknesses:
    • May overlook the relationship between predictors and the response variable.
    • Not ideal for predictive modeling when the goal is to minimize prediction error.
PLS: Strengths and Weaknesses
  • Strengths:
    • Directly targets variance most predictive of the response variable, enhancing predictive accuracy.
    • Highly effective in handling collinear data, making it ideal for complex datasets where predictors are interrelated.
  • Weaknesses:
    • Can be more complex to interpret compared to PCR because the components are optimized for prediction rather than variance explanation.
    • Less intuitive for purely exploratory analysis as it focuses on the prediction.


Conclusions and Recommendations

Our comprehensive analysis clearly demonstrates that PLS generally provides superior predictive performance for our dataset, particularly because it focuses on the covariance between predictors and the response variable. This makes PLS exceptionally valuable for our goal of accurate and reliable air quality prediction, crucial for formulating effective public health responses.

When to Prefer PLS

When to Prefer PCR

In summary, for tasks where forecasting is critical, such as policy adjustments for air quality management, PLS is the more suitable choice. However, PCR remains advantageous for exploratory tasks or when the primary goal is to understand the internal variance structure without a direct link to a specific response.

This project has demonstrated the practical application of PCR and PLS in handling high-dimensional, multicollinear datasets for environmental monitoring. By providing a clear comparison of their strengths and weaknesses, it equips data scientists and researchers with the knowledge to make informed decisions based on their specific analytical goals. The insights gained from this analysis underscore the importance of selecting the right modeling technique to achieve accurate, reliable, and interpretable results in predictive modeling.