In the realm of data science and statistical modeling, handling high-dimensional data with multicollinearity among predictor variables is a common challenge. Traditional regression techniques often struggle in such scenarios, leading to unstable estimates and poor predictive performance. To address these issues, dimensionality reduction techniques such as Principal Component Regression (PCR) and Partial Least Squares Regression (PLS) are employed. This project aims to compare and contrast PCR and PLS, highlighting their strengths and weaknesses through a practical application on the Air Quality Dataset.
Principal Component Regression is a technique that combines Principal Component Analysis (PCA) and linear regression to address multicollinearity and reduce dimensionality in high-dimensional datasets. The key steps in PCR are:
Partial Least Squares Regression is a technique that, unlike PCR, considers both the predictor variables and the response variable during the dimensionality reduction process. The key steps in PLS are:
This project will provide valuable insights into the practical
application of dimensionality reduction techniques in regression
modeling. By comparing PCR and PLS, data scientists and researchers can
make informed decisions about which method to employ based on the
characteristics of their data and the specific requirements of their
analysis. Moreover, the findings from this project can contribute to the
broader understanding of how to effectively handle multicollinearity and
high-dimensional data in various fields, including environmental
science, finance, and bioinformatics.
The Air Quality Dataset, sourced from the UCI Machine Learning Repository, contains measurements of various air pollutants and meteorological variables collected at an Italian monitoring station. The dataset comprises daily measurements of pollutants such as ozone, nitrogen dioxide, and carbon monoxide, as well as meteorological variables like temperature, wind speed, and humidity. With over 9,000 instances, this dataset provides a rich source of data for analysis. The target variable for this project will be one of the pollutant concentrations, allowing us to explore how well PCR and PLS can predict air quality based on the available predictors.
Target Variable: C6H6 (benzene) concentration, denoted as C6H6(GT), has been chosen as the target variable due to its significant implications for public health and environmental monitoring. Benzene is a major pollutant, classified as a carcinogen, and its presence in the atmosphere is closely linked to a variety of health risks, including increased cancer rates. Additionally, benzene levels serve as an indicator of vehicular and industrial emissions, which are primary concerns in urban pollution management.
We will start by loading the necessary libraries for our analysis.
# Loading the dplyr package for data manipulation
library(dplyr)
library(tidyr)
# Loading the ggplot2 and GGally packages for data visualization
library(ggplot2)
library(GGally)
# Loading the pls package for Partial Least Squares Regression
library(pls)
# Loading the caret package for model training and evaluation, including Principal Component Regression
library(caret)
# Loading the readr package for reading the dataset
library(readr)
# Loading the dataset
air_quality <- read_delim("C:/Users/david/Downloads/AirQualityUCI.csv", delim = ";")
# Printing the first few rows of the dataset to inspect it
head(air_quality)
# Removing the columns with parsing issues
air_quality <- air_quality %>% select(-c(...16, ...17))
# Convert character columns to numeric, except for Date and Time
cols_to_convert <- setdiff(names(air_quality), c("Date", "Time"))
# Use mutate to convert columns to numeric
air_quality <- air_quality %>%
mutate(across(all_of(cols_to_convert), ~ as.numeric(gsub(",", ".", .))))
# Correct the RH column by dividing by 10
air_quality$RH <- air_quality$RH / 10
# Convert Date to Date type
air_quality$Date <- as.Date(air_quality$Date, format = "%d/%m/%Y")
# Load 'hms' library to convert Time to hms
library(hms)
# Replace dots with colons in the Time column
air_quality$Time <- gsub("\\.", ":", air_quality$Time)
# Convert Time to hms type
air_quality$Time <- as_hms(air_quality$Time)
# Check the first few rows to confirm the changes
head(air_quality)
# Check for missing values
colSums(is.na(air_quality))
## Date Time CO(GT) PT08.S1(CO) NMHC(GT)
## 114 114 114 114 114
## C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT)
## 114 114 114 114 114
## PT08.S4(NO2) PT08.S5(O3) T RH AH
## 114 114 114 114 114
# Replacing invalid values (-200) with NA
air_quality[air_quality == -200] <- NA
Lets handle the missing values by replacing them with the mean of each column:
# Handle missing values by replacing them with the mean of each column
air_quality <- air_quality %>%
mutate(across(all_of(cols_to_convert), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Check for missing values
colSums(is.na(air_quality))
## Date Time CO(GT) PT08.S1(CO) NMHC(GT)
## 114 114 0 0 0
## C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT)
## 0 0 0 0 0
## PT08.S4(NO2) PT08.S5(O3) T RH AH
## 0 0 0 0 0
The missing values in the Date and Time columns remain, but all the other columns have been successfully cleaned. Since Date and Time are not critical for the regression models we are planning to build (PCR and PLS), we can drop these rows with missing Date and Time values.
# Remove rows with missing values in Date and Time columns
air_quality <- air_quality %>%
filter(!is.na(Date) & !is.na(Time))
# Summary of the cleaned dataset
summary(air_quality)
## Date Time CO(GT) PT08.S1(CO)
## Min. :2004-03-10 Length:9357 Min. : 0.100 Min. : 647
## 1st Qu.:2004-06-16 Class1:hms 1st Qu.: 1.200 1st Qu.: 941
## Median :2004-09-21 Class2:difftime Median : 2.153 Median :1075
## Mean :2004-09-21 Mode :numeric Mean : 2.153 Mean :1100
## 3rd Qu.:2004-12-28 3rd Qu.: 2.600 3rd Qu.:1221
## Max. :2005-04-04 Max. :11.900 Max. :2040
## NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT)
## Min. : 7.0 Min. : 0.10 Min. : 383.0 Min. : 2.0
## 1st Qu.: 218.8 1st Qu.: 4.60 1st Qu.: 743.0 1st Qu.: 112.0
## Median : 218.8 Median : 8.60 Median : 923.0 Median : 229.0
## Mean : 218.8 Mean :10.08 Mean : 939.2 Mean : 246.9
## 3rd Qu.: 218.8 3rd Qu.:13.60 3rd Qu.:1105.0 3rd Qu.: 284.0
## Max. :1189.0 Max. :63.70 Max. :2214.0 Max. :1479.0
## PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T
## Min. : 322.0 Min. : 2.0 Min. : 551 Min. : 221 Min. :-1.90
## 1st Qu.: 666.0 1st Qu.: 86.0 1st Qu.:1242 1st Qu.: 742 1st Qu.:12.00
## Median : 818.0 Median :113.1 Median :1456 Median : 983 Median :18.30
## Mean : 835.5 Mean :113.1 Mean :1456 Mean :1023 Mean :18.32
## 3rd Qu.: 960.0 3rd Qu.:133.0 3rd Qu.:1662 3rd Qu.:1255 3rd Qu.:24.10
## Max. :2683.0 Max. :340.0 Max. :2775 Max. :2523 Max. :44.60
## RH AH
## Min. :-20.00 Min. :0.1847
## 1st Qu.: 34.10 1st Qu.:0.7461
## Median : 48.60 Median :1.0154
## Mean : 46.53 Mean :1.0255
## 3rd Qu.: 61.90 3rd Qu.:1.2962
## Max. : 88.70 Max. :2.2310
# Checking the structure of the cleaned dataset
str(air_quality)
## tibble [9,357 × 15] (S3: tbl_df/tbl/data.frame)
## $ Date : Date[1:9357], format: "2004-03-10" "2004-03-10" ...
## $ Time : 'hms' num [1:9357] 18:00:00 19:00:00 20:00:00 21:00:00 ...
## ..- attr(*, "units")= chr "secs"
## $ CO(GT) : num [1:9357] 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
## $ PT08.S1(CO) : num [1:9357] 1360 1292 1402 1376 1272 ...
## $ NMHC(GT) : num [1:9357] 150 112 88 80 51 38 31 31 24 19 ...
## $ C6H6(GT) : num [1:9357] 11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
## $ PT08.S2(NMHC): num [1:9357] 1046 955 939 948 836 ...
## $ NOx(GT) : num [1:9357] 166 103 131 172 131 ...
## $ PT08.S3(NOx) : num [1:9357] 1056 1174 1140 1092 1205 ...
## $ NO2(GT) : num [1:9357] 113 92 114 122 116 ...
## $ PT08.S4(NO2) : num [1:9357] 1692 1559 1555 1584 1490 ...
## $ PT08.S5(O3) : num [1:9357] 1268 972 1074 1203 1110 ...
## $ T : num [1:9357] 13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
## $ RH : num [1:9357] 48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
## $ AH : num [1:9357] 0.758 0.726 0.75 0.787 0.789 ...
# Check for missing values
colSums(is.na(air_quality))
## Date Time CO(GT) PT08.S1(CO) NMHC(GT)
## 0 0 0 0 0
## C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT)
## 0 0 0 0 0
## PT08.S4(NO2) PT08.S5(O3) T RH AH
## 0 0 0 0 0
Both Principal Component Regression (PCR) and Partial Least Squares Regression (PLS) are sensitive to outliers, but their sensitivities arise in different ways. In PCR, the sensitivity to outliers stems from the initial Principal Component Analysis (PCA) step, where the method aims to capture the maximum variance in the data. Outliers can significantly affect this variance, distorting the principal components and subsequently leading to poor regression performance. Similarly, PLS involves the construction of latent variables using both predictors and response variables, which means that outliers in either set can bias the results. To mitigate these issues, lets remove the outliers from the data set.
# Identify outliers using the IQR method
Q1 <- quantile(air_quality$'C6H6(GT)', 0.25)
Q3 <- quantile(air_quality$'C6H6(GT)', 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- which(air_quality$'C6H6(GT)' < lower_bound | air_quality$'C6H6(GT)' > upper_bound)
# Remove outliers from the dataset
air_quality <- air_quality[-outliers, ]
# Checking the structure of the cleaned dataset
str(air_quality)
## tibble [9,071 × 15] (S3: tbl_df/tbl/data.frame)
## $ Date : Date[1:9071], format: "2004-03-10" "2004-03-10" ...
## $ Time : 'hms' num [1:9071] 18:00:00 19:00:00 20:00:00 21:00:00 ...
## ..- attr(*, "units")= chr "secs"
## $ CO(GT) : num [1:9071] 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
## $ PT08.S1(CO) : num [1:9071] 1360 1292 1402 1376 1272 ...
## $ NMHC(GT) : num [1:9071] 150 112 88 80 51 38 31 31 24 19 ...
## $ C6H6(GT) : num [1:9071] 11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
## $ PT08.S2(NMHC): num [1:9071] 1046 955 939 948 836 ...
## $ NOx(GT) : num [1:9071] 166 103 131 172 131 ...
## $ PT08.S3(NOx) : num [1:9071] 1056 1174 1140 1092 1205 ...
## $ NO2(GT) : num [1:9071] 113 92 114 122 116 ...
## $ PT08.S4(NO2) : num [1:9071] 1692 1559 1555 1584 1490 ...
## $ PT08.S5(O3) : num [1:9071] 1268 972 1074 1203 1110 ...
## $ T : num [1:9071] 13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
## $ RH : num [1:9071] 48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
## $ AH : num [1:9071] 0.758 0.726 0.75 0.787 0.789 ...
# Summary of the dataset
summary(air_quality)
## Date Time CO(GT) PT08.S1(CO)
## Min. :2004-03-10 Length:9071 Min. :0.100 Min. : 647
## 1st Qu.:2004-06-15 Class1:hms 1st Qu.:1.200 1st Qu.: 938
## Median :2004-09-19 Class2:difftime Median :2.100 Median :1066
## Mean :2004-09-21 Mode :numeric Mean :2.046 Mean :1084
## 3rd Qu.:2004-12-29 3rd Qu.:2.500 3rd Qu.:1202
## Max. :2005-04-04 Max. :9.300 Max. :1898
## NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT)
## Min. : 7.0 Min. : 0.100 Min. : 383.0 Min. : 2.0
## 1st Qu.:218.8 1st Qu.: 4.500 1st Qu.: 736.0 1st Qu.: 110.0
## Median :218.8 Median : 8.300 Median : 912.0 Median : 220.0
## Mean :217.2 Mean : 9.358 Mean : 917.8 Mean : 235.5
## 3rd Qu.:218.8 3rd Qu.:13.000 3rd Qu.:1083.0 3rd Qu.: 270.0
## Max. :872.0 Max. :27.100 Max. :1483.0 Max. :1310.0
## PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3)
## Min. : 360.0 Min. : 2.0 Min. : 551 Min. : 221.0
## 1st Qu.: 681.0 1st Qu.: 84.0 1st Qu.:1231 1st Qu.: 734.0
## Median : 826.0 Median :113.1 Median :1456 Median : 968.0
## Mean : 847.2 Mean :111.6 Mean :1433 Mean : 995.5
## 3rd Qu.: 968.0 3rd Qu.:130.0 3rd Qu.:1640 3rd Qu.:1217.0
## Max. :2683.0 Max. :340.0 Max. :2404 Max. :2519.0
## T RH AH
## Min. :-1.90 Min. :-20.00 Min. :0.1847
## 1st Qu.:12.00 1st Qu.: 33.80 1st Qu.:0.7423
## Median :18.20 Median : 48.20 Median :1.0115
## Mean :18.29 Mean : 46.31 Mean :1.0207
## 3rd Qu.:24.10 3rd Qu.: 61.80 3rd Qu.:1.2874
## Max. :44.60 Max. : 88.70 Max. :2.2310
Brief summary of the key statistics:
hms
format.
Plotting the distributions:
# Converting data to long format and scaling the values
air_quality_long <- air_quality %>%
pivot_longer(cols = -c(Date, Time), names_to = "variable", values_to = "value") %>%
mutate(value = scale(value)) # Scale the data
# 1. Histograms of Scaled Data
ggplot(air_quality_long, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
facet_wrap(~ variable, scales = "free_x") +
theme_minimal() +
labs(title = "Histograms of Scaled Air Quality Variables",
x = "Scaled Value",
y = "Frequency")
# 2. Box Plots of Scaled Data
ggplot(air_quality_long, aes(x = variable, y = value)) +
geom_boxplot(fill = "blue", color = "black") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Box Plots of Scaled Air Quality Variables",
x = "Variable",
y = "Scaled Value")
Checking for correlations:
# Correlation matrix
cor_matrix <- cor(air_quality %>% select(-Date, -Time))
print(cor_matrix)
## CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC)
## CO(GT) 1.00000000 0.74107980 0.28802081 0.7854718 0.7705184
## PT08.S1(CO) 0.74107980 1.00000000 0.25690917 0.8691725 0.8730848
## NMHC(GT) 0.28802081 0.25690917 1.00000000 0.2900620 0.2811599
## C6H6(GT) 0.78547184 0.86917245 0.29006196 1.0000000 0.9868651
## PT08.S2(NMHC) 0.77051842 0.87308482 0.28115989 0.9868651 1.0000000
## NOx(GT) 0.73012163 0.57549353 0.11028635 0.5704788 0.5563549
## PT08.S3(NOx) -0.59893390 -0.76177400 -0.27650361 -0.7435827 -0.7917584
## NO2(GT) 0.64805859 0.54970920 0.16286911 0.5368447 0.5507541
## PT08.S4(NO2) 0.49614634 0.62777660 0.20643591 0.7305175 0.7403152
## PT08.S5(O3) 0.72914332 0.88369349 0.23142528 0.8527851 0.8597888
## T 0.02837898 0.04990897 0.07773880 0.2358569 0.2679064
## RH -0.04862459 0.07059802 -0.05011934 -0.1000756 -0.1111914
## AH 0.02460750 0.11919439 0.04785249 0.1591985 0.1775017
## NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3)
## CO(GT) 0.7301216 -0.59893390 0.64805859 0.49614634 0.72914332
## PT08.S1(CO) 0.5754935 -0.76177400 0.54970920 0.62777660 0.88369349
## NMHC(GT) 0.1102863 -0.27650361 0.16286911 0.20643591 0.23142528
## C6H6(GT) 0.5704788 -0.74358273 0.53684467 0.73051750 0.85278509
## PT08.S2(NMHC) 0.5563549 -0.79175843 0.55075413 0.74031517 0.85978883
## NOx(GT) 1.0000000 -0.54172655 0.76305566 0.11515636 0.65763272
## PT08.S3(NOx) -0.5417266 1.00000000 -0.55632548 -0.49780942 -0.78521645
## NO2(GT) 0.7630557 -0.55632548 1.00000000 0.08865727 0.61462286
## PT08.S4(NO2) 0.1151564 -0.49780942 0.08865727 1.00000000 0.52943990
## PT08.S5(O3) 0.6576327 -0.78521645 0.61462286 0.52943990 1.00000000
## T -0.2484538 -0.14887505 -0.16720962 0.60371512 -0.03192633
## RH 0.0666332 -0.02886568 -0.13114017 -0.05263753 0.08362246
## AH -0.1540311 -0.22352921 -0.30481516 0.65619468 0.05416537
## T RH AH
## CO(GT) 0.02837898 -0.04862459 0.02460750
## PT08.S1(CO) 0.04990897 0.07059802 0.11919439
## NMHC(GT) 0.07773880 -0.05011934 0.04785249
## C6H6(GT) 0.23585690 -0.10007556 0.15919849
## PT08.S2(NMHC) 0.26790636 -0.11119136 0.17750170
## NOx(GT) -0.24845384 0.06663320 -0.15403109
## PT08.S3(NOx) -0.14887505 -0.02886568 -0.22352921
## NO2(GT) -0.16720962 -0.13114017 -0.30481516
## PT08.S4(NO2) 0.60371512 -0.05263753 0.65619468
## PT08.S5(O3) -0.03192633 0.08362246 0.05416537
## T 1.00000000 -0.45484120 0.65625807
## RH -0.45484120 1.00000000 0.12715370
## AH 0.65625807 0.12715370 1.00000000
Based on the correlation matrix provided, it appears that there is a significant level of multicollinearity among the predictor variables. Key observations:
PT08.S1(CO)
and PT08.S5(O3)
have a correlation
of 0.884.
C6H6(GT)
and PT08.S2(NMHC)
have a correlation
of 0.987.
C6H6(GT)
and PT08.S1(CO)
have a correlation of
0.869.
PT08.S2(NMHC)
and PT08.S1(CO)
have a
correlation of 0.873.
C6H6(GT)
and PT08.S5(O3)
have a correlation of
0.853.
PT08.S2(NMHC)
and PT08.S5(O3)
have a
correlation of 0.860.
PT08.S4(NO2)
and T
have a correlation of
0.604.
AH
and T
have a correlation of 0.656.
PT08.S3(NOx)
and several variables such as
PT08.S2(NMHC)
, PT08.S1(CO)
, and
PT08.S5(O3)
, indicating a significant negative
relationship.
These high correlation values indicate multicollinearity, where predictor variables are highly correlated with each other. Multicollinearity can be problematic in regression models because it can inflate the variance of the coefficient estimates and make the model coefficients unstable and difficult to interpret. Both PCR and PLS are designed to handle multicollinearity by reducing the dimensionality of the data and summarizing the information from correlated predictors into a smaller set of uncorrelated components
# Removing Date and Time columns as they will not be used for modeling
air_quality <- air_quality %>% select(-Date, -Time)
# Splitting the data into training and testing sets
set.seed(123) # For reproducibility
trainIndex <- createDataPartition(air_quality$`C6H6(GT)`, p = .75, list = FALSE, times = 1) # Create an index for 75% training data
air_quality_train <- air_quality[trainIndex, ]
air_quality_test <- air_quality[-trainIndex, ]
When fitting the model we are going to standardize the data. This is crucial because it ensures that all predictor variables contribute equally to the model, irrespective of their original scales or units. Without scaling, variables with larger magnitudes could dominate the model, potentially skewing the results and leading to misleading interpretations. By standardizing variables (giving them zero mean and unit variance), scaling removes these discrepancies, allowing the model to focus on the underlying patterns and relationships in the data. This leads to more reliable and interpretable models, particularly important when dealing with variables of different types and scales.
# Fitting the PCR model on the training data
pcr_model <- pcr(`C6H6(GT)` ~ ., data = air_quality_train, scale = TRUE, validation = "CV") # Fit PCR model with cross-validation
The argument validation = “CV” specifies that cross-validation (CV) should be used to validate the model. Cross-validation is a robust method for assessing the predictive performance of a model. It involves partitioning the data into subsets, training the model on some subsets (training sets), and validating it on the remaining subsets (validation sets). This process is repeated multiple times to ensure the model’s performance is consistent and not dependent on a particular partitioning of the data.
By using cross-validation, the model is less likely to overfit to the training data. Overfitting occurs when the model captures noise and specific patterns in the training data that do not generalize to new, unseen data. Cross-validation helps in detecting and mitigating overfitting by testing the model on different subsets of the data.
For PCR and PLS, cross-validation helps in determining the optimal number of components to retain.
summary(pcr_model)
## Data: X dimension: 6805 12
## Y dimension: 6805 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 6.062 2.213 2.081 2.065 2.054 1.752 1.601
## adjCV 6.062 2.213 2.081 2.065 2.054 1.752 1.601
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 1.602 1.603 1.333 1.287 1.157 0.8308
## adjCV 1.601 1.603 1.333 1.287 1.157 0.8307
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 46.81 67.53 78.23 85.85 90.19 93.72 95.56
## C6H6(GT) 86.69 88.23 88.42 88.54 91.67 93.05 93.05
## 8 comps 9 comps 10 comps 11 comps 12 comps
## X 96.92 98.04 98.92 99.73 100.00
## C6H6(GT) 93.05 95.19 95.52 96.38 98.13
# Plotting the RMSEP (Root Mean Squared Error of Prediction) to find the optimal number of components
validationplot(pcr_model, val.type = "RMSEP", main = "RMSEP (PCR Model)")
Determining the Optimal Number of Principal Components:
To determine the optimal number of components, we need to balance between the model complexity (number of components) and the RMSEP (Root Mean Squared Error of Prediction). We are looking for the point where adding more components doesn’t significantly improve the prediction accuracy.
Key Points to Consider from the Summary:
RMSEP Values:Given these points, let’s evaluate:
RMSEP Values:Based on both RMSEP and variance explained, it seems reasonable to choose around 5-6 components. There is a significant improvement in RMSEP up to 5 components and variance explained in the response variable increases notably up to this point. Beyond 5-6 components, the improvements are minimal.
Therefore, the optimal number of components is 5 or 6. We might choose 5 for a simpler model or 6 if we prefer slightly better performance at the cost of adding one more component. Let’s choose 6 components.
# Predict using the model and evaluate on the test set with optimal number of components
optimal_number_of_components <- 6 # Optimal number of components based on the RMSEP plot and summary
predictions <- predict(pcr_model, ncomp = optimal_number_of_components, newdata = air_quality_test)
# Compare predictions with actual values
plot(air_quality_test$`C6H6(GT)`, predictions, xlab = "Actual", ylab = "Predicted", main = "Predicted vs Actual C6H6(GT) Values (PCR Model)") # Plot actual vs predicted values
abline(0, 1) # Add a diagonal line for reference
The square root of the Mean Squared Error provides an indication of the average magnitude of the errors in the same units as the response variable. Lower RMSE values indicate better model performance.
R-squared represents the proportion of the variance in the dependent variable that is predictable from the independent variables. Values closer to 1 indicate better model performance.
# Calculate and print the Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((air_quality_test$`C6H6(GT)` - predictions)^2)) # Calculate RMSE between actual and predicted values
print(paste("RMSE: ", rmse))
## [1] "RMSE: 1.57211808907846"
# Calculate the sum of squares of residuals
ss_res <- sum((air_quality_test$`C6H6(GT)` - predictions)^2)
# Calculate the total sum of squares
ss_tot <- sum((air_quality_test$`C6H6(GT)` - mean(air_quality_test$`C6H6(GT)`))^2)
# Calculate R-squared
r_squared <- 1 - (ss_res / ss_tot)
# Print R-squared
print(paste("R-squared: ", r_squared))
## [1] "R-squared: 0.933087937624178"
An R-squared value of 0.9331 means that approximately 93.31% of the variance in the dependent variable (C6H6 concentration) can be explained by the independent variables in the model. This indicates a very strong relationship between the predictors and the response variable. This value suggests that the model fits the data very well and that the predictors used in the model are good at explaining the variability in the response variable.
# Fitting the PLS model on the training data
pls_model <- plsr(`C6H6(GT)` ~ ., data = air_quality_train, scale = TRUE, validation = "CV")
summary(pls_model)
## Data: X dimension: 6805 12
## Y dimension: 6805 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 6.062 2.032 1.711 1.359 1.225 1.014 0.9673
## adjCV 6.062 2.031 1.711 1.359 1.225 1.013 0.9672
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 0.9205 0.8420 0.8305 0.8303 0.8303 0.8303
## adjCV 0.9204 0.8419 0.8304 0.8302 0.8302 0.8302
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 46.70 63.02 71.89 79.98 83.87 90.29 93.89
## C6H6(GT) 88.79 92.05 94.99 95.94 97.22 97.46 97.70
## 8 comps 9 comps 10 comps 11 comps 12 comps
## X 95.16 95.95 97.67 98.64 100.00
## C6H6(GT) 98.08 98.13 98.13 98.13 98.13
# Plotting the RMSEP (Root Mean Squared Error of Prediction) to find the optimal number of components
validationplot(pls_model, val.type = "RMSEP", main = "RMSEP (PLS Model)")
Determining the Optimal Number of Components:
To determine the optimal number of components, we need to balance between the model complexity (number of components) and the RMSEP (Root Mean Squared Error of Prediction). We are looking for the point where adding more components doesn’t significantly improve the prediction accuracy.
Key Points to Consider from the Summary:
RMSEP Values:Given these points, let’s evaluate:
RMSEP Values:Based on both RMSEP and variance explained, it seems reasonable to choose around 5-6 components. There is a significant improvement in RMSEP up to 5 components and variance explained in the response variable increases notably up to this point. Beyond 5-6 components, the improvements are minimal.
Therefore, the optimal number of components is 5 or 6. We might choose 5 for a simpler model or 6 if we prefer slightly better performance at the cost of adding one more component. Let’s choose 6 components.
# Predict using the model and evaluate on the test set with optimal number of components
optimal_number_of_components <- 6 # Optimal number of components based on the RMSEP plot and summary
predictions2 <- predict(pls_model, ncomp = optimal_number_of_components, newdata = air_quality_test)
# Compare predictions with actual values
plot(air_quality_test$`C6H6(GT)`, predictions2, xlab = "Actual", ylab = "Predicted", main = "Predicted vs Actual C6H6(GT) Values (PLS Model)") # Plot actual vs predicted values
abline(0, 1) # Add a diagonal line for reference
# Calculate and print the Root Mean Squared Error (RMSE)
rmse2 <- sqrt(mean((air_quality_test$`C6H6(GT)` - predictions2)^2)) # Calculate RMSE between actual and predicted values
print(paste("RMSE: ", rmse2))
## [1] "RMSE: 0.972380060198749"
# Calculate the sum of squares of residuals
ss_res2 <- sum((air_quality_test$`C6H6(GT)` - predictions2)^2)
# Calculate the total sum of squares
ss_tot2 <- sum((air_quality_test$`C6H6(GT)` - mean(air_quality_test$`C6H6(GT)`))^2)
# Calculate R-squared
r_squared2 <- 1 - (ss_res2 / ss_tot2)
# Print R-squared
print(paste("R-squared: ", r_squared2))
## [1] "R-squared: 0.97440199170449"
An R-squared value of 0.9744 means that approximately 97.44% of the variance in the C6H6(GT) values can be explained by the predictors included in the Partial Least Squares (PLS) model. This value is very high, indicating an excellent fit to the data.
The RMSE evaluates the average magnitude of the errors between predicted and actual values. The significantly lower RMSE of the PLS model indicates its superior accuracy in predicting the concentrations of C6H6(GT), highlighting its effectiveness in practical applications where precise predictions are vital.
R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables. The higher R-squared of the PLS model shows that it captures a greater extent of the variability in benzene levels, suggesting a better overall fit to the data.
Both models show optimal performance with about 5-6 components based on RMSEP and variance explained. This similarity in dimensionality masks significant differences in how each model processes and utilizes these components:
Our comprehensive analysis clearly demonstrates that PLS generally provides superior predictive performance for our dataset, particularly because it focuses on the covariance between predictors and the response variable. This makes PLS exceptionally valuable for our goal of accurate and reliable air quality prediction, crucial for formulating effective public health responses.
When to Prefer PLS
When to Prefer PCR
In summary, for tasks where forecasting is critical, such as policy adjustments for air quality management, PLS is the more suitable choice. However, PCR remains advantageous for exploratory tasks or when the primary goal is to understand the internal variance structure without a direct link to a specific response.
This project has demonstrated the practical application of PCR and PLS in handling high-dimensional, multicollinear datasets for environmental monitoring. By providing a clear comparison of their strengths and weaknesses, it equips data scientists and researchers with the knowledge to make informed decisions based on their specific analytical goals. The insights gained from this analysis underscore the importance of selecting the right modeling technique to achieve accurate, reliable, and interpretable results in predictive modeling.