New regulations are requiring ABC Beverage to provide a report with an outline of our manufacturing process, and a predictive model of PH including an explanation of predictive factors.
Our data science team is tasked with developing the predictive model from provided historical data and using that model to predict PH on test data.
# Checking out packages
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(corrplot)
library(reshape2) # for melt
library(ggplot2) # for ggplot
library(dplyr)
library(knitr)
library(magrittr)
library(tidyverse) # for code with missing values
library(caret) # for models
library(RANN) # for better kNN imputation
library(gridExtra) # for Outliers
library(car) # VIF
library(earth) # MARS model
library(kernlab) # SVM model
library(xgboost) # XGBoost model
library(glmnet) # Appendix Ridge Regression
library(factoextra) # For advanced plotting
library(pls) # Appendix 2 PLS
Here we import our train and test data, student_train
and student_eval
, and evaluate for missing data and
additional exploratory steps.
Here we can preview the data structure:
# Read data
student_train = read.csv('https://raw.githubusercontent.com/deepasharma06/Data-624/refs/heads/main/StudentData_training.csv')
student_eval = read.csv('https://raw.githubusercontent.com/deepasharma06/Data-624/refs/heads/main/StudentEvaluation_test.csv')
# Show data
head(student_train) %>% kable()
Brand.Code | Carb.Volume | Fill.Ounces | PC.Volume | Carb.Pressure | Carb.Temp | PSC | PSC.Fill | PSC.CO2 | Mnf.Flow | Carb.Pressure1 | Fill.Pressure | Hyd.Pressure1 | Hyd.Pressure2 | Hyd.Pressure3 | Hyd.Pressure4 | Filler.Level | Filler.Speed | Temperature | Usage.cont | Carb.Flow | Density | MFR | Balling | Pressure.Vacuum | PH | Oxygen.Filler | Bowl.Setpoint | Pressure.Setpoint | Air.Pressurer | Alch.Rel | Carb.Rel | Balling.Lvl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B | 5.340000 | 23.96667 | 0.2633333 | 68.2 | 141.2 | 0.104 | 0.26 | 0.04 | -100 | 118.8 | 46.0 | 0 | NA | NA | 118 | 121.2 | 4002 | 66.0 | 16.18 | 2932 | 0.88 | 725.0 | 1.398 | -4.0 | 8.36 | 0.022 | 120 | 46.4 | 142.6 | 6.58 | 5.32 | 1.48 |
A | 5.426667 | 24.00667 | 0.2386667 | 68.4 | 139.6 | 0.124 | 0.22 | 0.04 | -100 | 121.6 | 46.0 | 0 | NA | NA | 106 | 118.6 | 3986 | 67.6 | 19.90 | 3144 | 0.92 | 726.8 | 1.498 | -4.0 | 8.26 | 0.026 | 120 | 46.8 | 143.0 | 6.56 | 5.30 | 1.56 |
B | 5.286667 | 24.06000 | 0.2633333 | 70.8 | 144.8 | 0.090 | 0.34 | 0.16 | -100 | 120.2 | 46.0 | 0 | NA | NA | 82 | 120.0 | 4020 | 67.0 | 17.76 | 2914 | 1.58 | 735.0 | 3.142 | -3.8 | 8.94 | 0.024 | 120 | 46.6 | 142.0 | 7.66 | 5.84 | 3.28 |
A | 5.440000 | 24.00667 | 0.2933333 | 63.0 | 132.6 | NA | 0.42 | 0.04 | -100 | 115.2 | 46.4 | 0 | 0 | 0 | 92 | 117.8 | 4012 | 65.6 | 17.42 | 3062 | 1.54 | 730.6 | 3.042 | -4.4 | 8.24 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.42 | 3.04 |
A | 5.486667 | 24.31333 | 0.1113333 | 67.2 | 136.8 | 0.026 | 0.16 | 0.12 | -100 | 118.4 | 45.8 | 0 | 0 | 0 | 92 | 118.6 | 4010 | 65.6 | 17.68 | 3054 | 1.54 | 722.8 | 3.042 | -4.4 | 8.26 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.44 | 3.04 |
A | 5.380000 | 23.92667 | 0.2693333 | 66.6 | 138.4 | 0.090 | 0.24 | 0.04 | -100 | 119.6 | 45.6 | 0 | 0 | 0 | 116 | 120.2 | 4014 | 66.2 | 23.82 | 2948 | 1.52 | 738.8 | 2.992 | -4.4 | 8.32 | 0.024 | 120 | 46.0 | 146.6 | 7.16 | 5.44 | 3.02 |
We looked for a general understanding of bottling manufacturing and learned a few things potentially about our data, so these are our assumptions and guesses.
This data describes a continuous bottling process, like a conveyor
belt, and so Mnf Flow
would be the percentage of intended
speed the process is running at.
Fill.Ounces
is how full the bottles bottles are being
filled to with a target of 24 oz.
Alch.Rel
sounds like this could be an relative value of
alcohol. I checked and Modelo produces cans of 24 ounce spiked sparkling
water, so maybe it’s a similar product.
Brand.Code
could be in label only but if it’s different
formulas and flavors, the specific ratios of acids (citric, phosphoric,
malic) to create the brand’s flavor profile, then it would definitely
affect PH.
PH
is the PH of our bottled beverage. Curiously we would
expect carbonated beverages to be acidic because carbon in water makes
carbonic acid however all of the PH values are in the 8 range which is
basic. 7 is neutral and below 7 would be acidic.
Mnf.Flow
seems to be the speed of the conveyor belt or
manufacturing flow represented as a percentage of the ideal speed.
MFR
could be the actual speed.
Carb.Volume
, Carb.Pressure
,
Carb.Temp
we could potentially combine with the formula
P*V/T to obtain the amount of carbon being injected into the liquid.
Bowl.Setpoint
is consistently 120 and we believe it’s
how high the bottle is expected to be filled. Filler.Level
is a variable close to 120 which is presumably the actual amount filled.
Dividing Bowl.Setpoint
by Filler Level
could
agregate the predictors into whether too much or too little was
added.
Filler.Speed
seems to be the speed at which the fill
liquid is filled. And Carb.Flow
seems to be the speed at
which the carbonation is injected.
Carb.Pressure1
and Fill.Pressure
look to be
the pressure at which the carbonation (~120) is injected into the fill
liquid and the fill liquid’s pressure (~46). Those values make sense
because if the C02’s pressure is not higher than the liquid, the liquid
won’t carbonate.
Pressure.Setpoint
seems to be the setting for
Fill.Pressure
that the manufacturing process is trying to
match.
Balling
is a reference to the Balling scale which was a
way to determine how much sugar is in a syrup based on the density of
the liquid. Balling.Lvl
is related.
Temperature
is in fahrenheit (~66) but isn’t clear if
that’s the ambient temperature and should be aggregated with
Air.Pressurer
(~144) to get a sense of what the surrounding
pressure is, which could influence how much carbonation stays in the
bottle.
PSC.CO2
could be the purity of the carbon dioxide,
variations in which would have a big impact on the final PH.
PSC.Fill
could be the purity of the water. and
PSC
a combined metric of purity. But this is just a
guess.
This leaves a lot variables without a clear picture of how they fit into the manufacturing process:
PC.Volume
Hyd.Pressure1
Hyd.Pressure2
Hyd.Pressure3
Hyd.Pressure4
Usage.cont
Density
Pressure.Vacuum
Oxygen.Filler
Carb.Rel
In total there are 724 missing values across 442 rows.
17.2% of our rows have missing data (442/2571) so we won’t just drop all rows with missing data.
Here we show columns with the highest number of missing values:
# There are 724 NA values
#sum(is.na(student_train))
# 442 rows have missing data
#student_train %>%
# filter(if_any(everything(), is.na)) %>%
# nrow()
# Display columns by missingness
student_train %>%
summarise(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}")) %>%
pivot_longer(everything(), names_to = "Column", values_to = "Missing_Count") %>%
arrange(desc(Missing_Count))
## # A tibble: 33 × 2
## Column Missing_Count
## <chr> <int>
## 1 missing_MFR 212
## 2 missing_Filler.Speed 57
## 3 missing_PC.Volume 39
## 4 missing_PSC.CO2 39
## 5 missing_Fill.Ounces 38
## 6 missing_PSC 33
## 7 missing_Carb.Pressure1 32
## 8 missing_Hyd.Pressure4 30
## 9 missing_Carb.Pressure 27
## 10 missing_Carb.Temp 26
## # ℹ 23 more rows
Here we impute the missing values using kNN and then check that the
724 missing values in student_train
and the 366 missing
values in student_eval
are filled in.
# Produce the model that can impute values using kNN
imputeModel <- preProcess(student_train, method = c("knnImpute"))
# Impute the missing values for the training and test data
student_train <- predict(imputeModel, student_train)
student_eval <- predict(imputeModel, student_eval)
# There are now zero NA values in our train and test data
sum(is.na(student_train))
## [1] 0
#sum(is.na(student_eval))
In our correlation plot we are comparing values in the second to last
PH
column. We note Filler.Level
and
Bowl.Setpoint
which are highly collinear have some positive
correlation with PH
. Also, a number of variables have
negative correlation with PH
, most notably
Mnf.Flow
, Hyd.Pressure3
, and
Pressure.Setpoint
.
# Select only numeric columns
numeric_data <- student_train %>% select(where(is.numeric))
# Calculate the correlation matrix ("pairwise.complete.obs" ensures any missing values won't affect the correlation calcs for remaining complete pairs)
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
# Create the correlation plot
corrplot(correlation_matrix, tl.col = "black", tl.cex = 0.6, order = 'AOE') # color (tl.col) and size (tl.cex) help style the plot.
Here we show distributions of the variables as histograms. Note,
Brand.Code
is excluded because it doesn’t have the numeric
data.
# Melt the data
mlt.train <- student_train # Use your actual dataframe name
mlt.train$ID <- rownames(mlt.train) # Assign row names to ID
mlt.train <- melt(mlt.train, id.vars = "ID")
# Convert the value column to numeric
mlt.train$value <- as.numeric(mlt.train$value)
# Create histograms of the predictors
ggplot(data = mlt.train, aes(x = value)) +
geom_histogram(binwidth = 6, fill = "skyblue", color = "black", alpha = 0.8) + # Adjust binwidth as needed
facet_wrap(~ variable, scales = "free") +
labs(title = "Distributions of Predictors", x = "Predictors", y = "Frequency") +
theme_minimal(base_size = 9) + # Use a minimal theme for better clarity
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Clean up grid lines
We are seeing a lot of outliers in the boxplots, specifically for variables where the majority of their values are zero. It’s possible we will use near-zero variance filtering to resolve these. Below only the first 9 of 32 are shown for display purposes but the pattern is the same.
# Create boxplots for each numeric column
numeric_columns <- names(student_train)[sapply(student_train, is.numeric)]
plot_list <- lapply(numeric_columns[1:9], function(col) {
ggplot(student_train, aes_string(y = col)) +
geom_boxplot(outlier.color = "red", fill = "lightblue") +
ggtitle(paste("Boxplot of", col)) +
theme_minimal()
})
# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))
Here we use feature engineering and regression techniques to evaluate our feature set.
Here we build the initial linear regression model.
# Setting up the model
model <- lm(PH ~ ., data = student_train)
summary(model)
##
## Call:
## lm(formula = PH ~ ., data = student_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0652 -0.4500 0.0545 0.5017 4.3209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.258937 0.085135 -3.042 0.00238 **
## Brand.CodeA 0.045707 0.141827 0.322 0.74727
## Brand.CodeB 0.447021 0.078984 5.660 1.69e-08 ***
## Brand.CodeC -0.381388 0.085887 -4.441 9.36e-06 ***
## Brand.CodeD 0.338218 0.161402 2.095 0.03623 *
## Carb.Volume -0.046235 0.042321 -1.092 0.27473
## Fill.Ounces -0.035288 0.016279 -2.168 0.03028 *
## PC.Volume -0.038051 0.018876 -2.016 0.04392 *
## Carb.Pressure -0.015593 0.059234 -0.263 0.79238
## Carb.Temp 0.036313 0.053963 0.673 0.50107
## PSC -0.024380 0.016269 -1.499 0.13410
## PSC.Fill -0.023195 0.015868 -1.462 0.14393
## PSC.CO2 -0.032079 0.015670 -2.047 0.04074 *
## Mnf.Flow -0.492621 0.031570 -15.604 < 2e-16 ***
## Carb.Pressure1 0.194895 0.019094 10.207 < 2e-16 ***
## Fill.Pressure 0.033001 0.022371 1.475 0.14030
## Hyd.Pressure1 -0.004829 0.026303 -0.184 0.85436
## Hyd.Pressure2 -0.100778 0.050088 -2.012 0.04432 *
## Hyd.Pressure3 0.315571 0.053337 5.917 3.73e-09 ***
## Hyd.Pressure4 -0.016382 0.023264 -0.704 0.48138
## Filler.Level -0.093419 0.049450 -1.889 0.05899 .
## Filler.Speed 0.029192 0.033531 0.871 0.38406
## Temperature -0.109919 0.018231 -6.029 1.89e-09 ***
## Usage.cont -0.123601 0.019583 -6.312 3.25e-10 ***
## Carb.Flow 0.071620 0.022218 3.223 0.00128 **
## Density -0.272023 0.060842 -4.471 8.13e-06 ***
## MFR -0.036383 0.024352 -1.494 0.13529
## Balling -0.342804 0.122457 -2.799 0.00516 **
## Pressure.Vacuum -0.053160 0.024621 -2.159 0.03093 *
## Oxygen.Filler -0.070359 0.018451 -3.813 0.00014 ***
## Bowl.Setpoint 0.290053 0.050216 5.776 8.58e-09 ***
## Pressure.Setpoint -0.100544 0.022901 -4.390 1.18e-05 ***
## Air.Pressurer -0.019757 0.016531 -1.195 0.23215
## Alch.Rel 0.192473 0.061961 3.106 0.00192 **
## Carb.Rel 0.037817 0.034485 1.097 0.27292
## Balling.Lvl 0.475401 0.106823 4.450 8.94e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7644 on 2535 degrees of freedom
## Multiple R-squared: 0.4233, Adjusted R-squared: 0.4153
## F-statistic: 53.16 on 35 and 2535 DF, p-value: < 2.2e-16
With the above initial linear regression model we can access the Variance Inflation Factor (VIF) to detect multicollinearity in our regression model to help us select features.
Note, we see that Brand.Code
is highly collinear.
# Calculating VIF
vif_values <- vif(model)
vif_values
## GVIF Df GVIF^(1/(2*Df))
## Brand.Code 54.107705 4 1.646863
## Carb.Volume 7.882378 1 2.807557
## Fill.Ounces 1.152505 1 1.073548
## PC.Volume 1.581700 1 1.257657
## Carb.Pressure 15.485791 1 3.935199
## Carb.Temp 12.828711 1 3.581719
## PSC 1.154528 1 1.074490
## PSC.Fill 1.102981 1 1.050229
## PSC.CO2 1.068829 1 1.033842
## Mnf.Flow 4.384391 1 2.093894
## Carb.Pressure1 1.600618 1 1.265155
## Fill.Pressure 2.200066 1 1.483262
## Hyd.Pressure1 3.035406 1 1.742242
## Hyd.Pressure2 11.015223 1 3.318919
## Hyd.Pressure3 12.478420 1 3.532481
## Hyd.Pressure4 2.369556 1 1.539336
## Filler.Level 10.723928 1 3.274741
## Filler.Speed 4.903247 1 2.214328
## Temperature 1.455792 1 1.206562
## Usage.cont 1.686112 1 1.298504
## Carb.Flow 2.170155 1 1.473145
## Density 16.280359 1 4.034893
## MFR 4.069385 1 2.017272
## Balling 65.953990 1 8.121206
## Pressure.Vacuum 2.666455 1 1.632928
## Oxygen.Filler 1.494970 1 1.222690
## Bowl.Setpoint 11.117253 1 3.334254
## Pressure.Setpoint 2.301930 1 1.517211
## Air.Pressurer 1.202058 1 1.096384
## Alch.Rel 16.860616 1 4.106168
## Carb.Rel 5.220788 1 2.284904
## Balling.Lvl 50.184296 1 7.084088
Next we check if any of the variables have a near zero variance and
surprisingly only one variable, Hyd.Pressure1
, is
identified by default. We changed the parameters ‘freqCut’ and
‘uniqueCut’ but it required large changes to pick up the other
variables, so we kept to only removing Hyd.Pressure1
.
# Remove problematic predictors from train and test data
student_train_x <- student_train[, -nearZeroVar(student_train)]
student_eval_y <- student_eval[, -13]
# Display extended list of NZV predictors
nearZeroVar(student_train, freqCut = 60/40, uniqueCut = 40, names = TRUE)
## [1] "Brand.Code" "Fill.Pressure" "Hyd.Pressure1" "Hyd.Pressure2"
## [5] "Hyd.Pressure3" "Carb.Flow" "Bowl.Setpoint"
Three features, Brand.Code
, Balling
, and
Balling.Lvl
, have high multicollinearity which we will
address in this section.
Here we replace Brand.Code
with BCB
which
will show if the record is brand code B or not. When we reran for
individual values of Brand.Code
, we see that Brand B
appears to have a different relationship than the other brands, Brand B
has a R-squared of ~0.7 while all the other brands have a R-squared of
~0.3. This suggests, we can replace all the brands with a binary
variable of whether it is Brand B or not. (not pictured)
From this we identify that Balling
and
Balling.Lvl
represent a lot of the remaining
multicollinearity to be reduced.
# Replacing Brand.Code with BCB
student_train_x1 <- student_train_x |> mutate(BCB = as.numeric(Brand.Code =='B')) |> select(-Brand.Code)
student_eval_y1 <- student_eval_y |> mutate(BCB = as.numeric(Brand.Code =='B')) |> select(-Brand.Code)
# New model with BCB instead of Brand.Code
model <- lm(PH ~ ., data = student_train_x)
#summary(model)
# Calculating VIF
vif_values <- vif(model)
vif_values
## GVIF Df GVIF^(1/(2*Df))
## Brand.Code 53.942578 4 1.646234
## Carb.Volume 7.872617 1 2.805818
## Fill.Ounces 1.151544 1 1.073100
## PC.Volume 1.500187 1 1.224821
## Carb.Pressure 15.485772 1 3.935197
## Carb.Temp 12.827035 1 3.581485
## PSC 1.152323 1 1.073463
## PSC.Fill 1.102569 1 1.050033
## PSC.CO2 1.068812 1 1.033834
## Mnf.Flow 4.382860 1 2.093528
## Carb.Pressure1 1.579213 1 1.256668
## Fill.Pressure 2.180153 1 1.476534
## Hyd.Pressure2 8.964442 1 2.994068
## Hyd.Pressure3 12.433205 1 3.526075
## Hyd.Pressure4 2.368986 1 1.539151
## Filler.Level 10.684198 1 3.268669
## Filler.Speed 4.897880 1 2.213116
## Temperature 1.455792 1 1.206562
## Usage.cont 1.684034 1 1.297703
## Carb.Flow 2.168168 1 1.472470
## Density 16.274585 1 4.034177
## MFR 4.048062 1 2.011980
## Balling 65.945084 1 8.120658
## Pressure.Vacuum 2.582057 1 1.606878
## Oxygen.Filler 1.494535 1 1.222512
## Bowl.Setpoint 11.085276 1 3.329456
## Pressure.Setpoint 2.281597 1 1.510495
## Air.Pressurer 1.195910 1 1.093577
## Alch.Rel 16.848866 1 4.104737
## Carb.Rel 5.209240 1 2.282376
## Balling.Lvl 50.180230 1 7.083801
In order to reduce the multicollinearity we are going to create a new
predictor by dividing Balling.Lvl
with
Balling
.
The division was arrived at after trying different operations and
division resulted in the best correlation to PH
then
removing either individually.
For a guess at domain relevance, Balling.Lvl
could be
the target best value for the syrup’s specific gravity and
Balling
could be the actual, so dividing the two would
result in a value similar to Mnf.Flow
which is a percentage
of the ideal manufacturing conveyor belt speed.
# Create new predictor PT
student_train_x2 <- student_train_x1 |> mutate(PT = Balling.Lvl/Balling) |> select(-c(Balling, Balling.Lvl))
student_eval_y2 <- student_eval_y1 |> mutate(PT = Balling.Lvl/Balling) |> select(-c(Balling, Balling.Lvl))
We also considered creating a Carbon
column by
multiplying Carb.Pressure
and Carb.Volume
and
dividing by Carb.Temp
. This is the Ideal Gas Law and would
tell us how much carbon dioxide is being injected. However there are two
values for pressure, Carb.Pressure
and
Carb.Pressure1
and we don’t have the units to combine them
according to the law. From inspection they look like they’ve already
been centered and scaled and so cannot be combined.
Since we’ve somewhat resolved our multicollinearity issue and have enough data we’re going to try a step-wise reduction model. This reduces our features to just the ones that have relevance for this model. Since we’re not relying on this to produce our final model but rather evaluating our features, step-wise reduction aligns with our goals and we don’t need to try regularization instead.
Note, our model has terrible performance with an R-Squared of ~0.41, but we’ve acquainted ourselves with the features and figured out how to resolve multicollinearity.
# Fit a step-wise linear regression model
model <- lm(PH ~ ., data = student_train_x2)
model <- stats::step(model, trace = 0)
summary(model)
##
## Call:
## lm(formula = PH ~ Fill.Ounces + PC.Volume + Carb.Temp + PSC +
## PSC.Fill + PSC.CO2 + Mnf.Flow + Carb.Pressure1 + Fill.Pressure +
## Hyd.Pressure2 + Hyd.Pressure3 + Filler.Level + Temperature +
## Usage.cont + Carb.Flow + Density + MFR + Oxygen.Filler +
## Bowl.Setpoint + Pressure.Setpoint + Alch.Rel + Carb.Rel +
## BCB, data = student_train_x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0495 -0.4675 0.0790 0.5056 4.1328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.32033 0.02654 -12.069 < 2e-16 ***
## Fill.Ounces -0.03776 0.01602 -2.357 0.018477 *
## PC.Volume -0.04383 0.01823 -2.405 0.016256 *
## Carb.Temp 0.02880 0.01532 1.880 0.060215 .
## PSC -0.02715 0.01628 -1.668 0.095464 .
## PSC.Fill -0.02523 0.01600 -1.576 0.115077
## PSC.CO2 -0.03140 0.01578 -1.989 0.046772 *
## Mnf.Flow -0.48772 0.03167 -15.399 < 2e-16 ***
## Carb.Pressure1 0.19540 0.01883 10.377 < 2e-16 ***
## Fill.Pressure 0.03893 0.02228 1.748 0.080641 .
## Hyd.Pressure2 -0.11549 0.04318 -2.675 0.007525 **
## Hyd.Pressure3 0.34060 0.05059 6.733 2.05e-11 ***
## Filler.Level -0.09331 0.04930 -1.893 0.058505 .
## Temperature -0.11328 0.01751 -6.471 1.16e-10 ***
## Usage.cont -0.11910 0.01927 -6.181 7.40e-10 ***
## Carb.Flow 0.07222 0.02024 3.568 0.000366 ***
## Density -0.14542 0.04050 -3.590 0.000336 ***
## MFR -0.03824 0.01421 -2.691 0.007173 **
## Oxygen.Filler -0.06682 0.01844 -3.623 0.000297 ***
## Bowl.Setpoint 0.30697 0.04956 6.194 6.82e-10 ***
## Pressure.Setpoint -0.11768 0.02276 -5.170 2.52e-07 ***
## Alch.Rel 0.38026 0.04093 9.291 < 2e-16 ***
## Carb.Rel 0.05187 0.03074 1.687 0.091682 .
## BCB 0.65887 0.04477 14.716 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7724 on 2547 degrees of freedom
## Multiple R-squared: 0.4083, Adjusted R-squared: 0.4029
## F-statistic: 76.4 on 23 and 2547 DF, p-value: < 2.2e-16
Here we produce a Multivariate Adaptive Regression Splines (MARS) model to capture some of these non-linear interactions in our data.
We’re using the features selected in the step-wise linear regression model.
# Features used in the previous step-wise linear regression model
student_train_x3 <- student_train_x2 |>
select(PH, Fill.Ounces, PC.Volume, Carb.Temp, PSC,
PSC.Fill, PSC.CO2, Mnf.Flow, Carb.Pressure1, Fill.Pressure,
Hyd.Pressure2, Hyd.Pressure3, Filler.Level, Temperature,
Usage.cont, Carb.Flow, Density, MFR, Oxygen.Filler,
Bowl.Setpoint, Pressure.Setpoint, Alch.Rel, Carb.Rel, BCB)
student_eval_y3 <- student_eval_y2 |>
select(PH, Fill.Ounces, PC.Volume, Carb.Temp, PSC,
PSC.Fill, PSC.CO2, Mnf.Flow, Carb.Pressure1, Fill.Pressure,
Hyd.Pressure2, Hyd.Pressure3, Filler.Level, Temperature,
Usage.cont, Carb.Flow, Density, MFR, Oxygen.Filler,
Bowl.Setpoint, Pressure.Setpoint, Alch.Rel, Carb.Rel, BCB)
Here we run the initial MARS Model with only first order relationships.
# Basic MARS Model
y = student_train_x3$PH
x = student_train_x3 |> select(-PH)
marsFit <- earth(x, y)
summary(marsFit)
## Call: earth(x=x, y=y)
##
## coefficients
## (Intercept) 5.3418732
## BCB 0.5096711
## h(-0.203956-Mnf.Flow) 0.9800807
## h(Carb.Pressure1- -1.55736) 0.1959255
## h(1.53618-Hyd.Pressure3) -0.2270999
## h(Hyd.Pressure3-1.53618) 3.7237258
## h(Temperature- -0.699707) -0.7786390
## h(1.3252-Temperature) -0.4898665
## h(Temperature-1.3252) 0.8280311
## h(Usage.cont-0.365031) -0.5993647
## h(Usage.cont-1.11723) 3.4859750
## h(-2.23001-Carb.Flow) 12.3573997
## h(Carb.Flow- -2.23001) 0.0991432
## h(-1.78438-Density) 2.7642866
## h(Density- -1.78438) 0.5163772
## h(Density- -1.09568) -0.9627448
## h(-3.02915-MFR) 0.1377848
## h(Oxygen.Filler- -0.738421) -1.4493047
## h(3.58372-Oxygen.Filler) -1.3142975
## h(Oxygen.Filler-3.58372) 1.5844172
## h(Bowl.Setpoint- -1.91638) -1.3437708
## h(-1.26292-Bowl.Setpoint) -0.6962323
## h(Bowl.Setpoint- -1.26292) 1.7803895
## h(Bowl.Setpoint-0.0440049) -0.4550215
## h(1.16947-Pressure.Setpoint) 0.0674138
## h(Alch.Rel- -0.509457) 2.5734980
## h(Alch.Rel-0.361355) -5.2860253
## h(0.519685-Alch.Rel) 0.8855864
## h(Alch.Rel-0.519685) 3.8218088
## h(Alch.Rel-1.46966) -1.5882955
## h(-0.441138-Carb.Rel) -0.1615024
## h(Carb.Rel- -0.441138) -0.0935618
##
## Selected 32 of 37 terms, and 14 of 23 predictors
## Termination condition: Reached nk 47
## Importance: Mnf.Flow, Alch.Rel, BCB, Usage.cont, Carb.Pressure1, ...
## Number of terms at each degree of interaction: 1 31 (additive model)
## GCV 0.5514965 RSS 1349.26 GRSq 0.4483104 RSq 0.4746078
And here we show the partial dependence plots of the first order MARS model.
# Plot partial dependencies
plotmo(marsFit)
## plotmo grid: Fill.Ounces PC.Volume Carb.Temp PSC PSC.Fill
## -0.01623723 -0.09531893 -0.0730481 -0.1334244 -0.1304864
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure2 Hyd.Pressure3
## -0.3813757 0.3367148 0.1293786 -0.4790381 0.4661799 0.4345061
## Filler.Level Temperature Usage.cont Carb.Flow Density MFR
## 0.58271 -0.2657983 0.2710059 0.5212328 -0.512943 0.2456178
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Alch.Rel Carb.Rel BCB
## -0.2881982 0.697465 -0.792231 -0.6677866 -0.2857598 0
Here we run the initial MARS Model with second order relationships. This increased the R-squared from 0.475 to .489.
y = student_train_x3$PH
x = student_train_x3 |> select(-PH)
y_eval = student_eval_y3$PH
x_eval = student_eval_y3 |> select(-PH)
marsFit2 <- earth(x, y, degree = 2)
summary(marsFit2)
## Call: earth(x=x, y=y, degree=2)
##
## coefficients
## (Intercept) -0.2741453
## BCB 0.2559212
## h(-0.203956-Mnf.Flow) 0.4940847
## h(0.50962-Hyd.Pressure3) -0.4641113
## h(Hyd.Pressure3-0.50962) 0.5696218
## h(1.3252-Temperature) 0.3307546
## h(-0.785587-Oxygen.Filler) * BCB -2.7498870
## h(Oxygen.Filler- -0.785587) * BCB -0.1458478
## h(-1.26292-Bowl.Setpoint) * BCB 0.4057510
## h(Bowl.Setpoint- -1.26292) * BCB 0.3179062
## h(0.800226-PSC) * h(-0.203956-Mnf.Flow) 0.1075006
## h(-0.203956-Mnf.Flow) * h(Carb.Pressure1- -1.51519) 0.2101897
## h(-0.203956-Mnf.Flow) * h(Hyd.Pressure3-0.0213793) -0.7230425
## h(-0.203956-Mnf.Flow) * h(Usage.cont-0.754562) 2.1408975
## h(-0.203956-Mnf.Flow) * h(-0.724848-Density) 1.2703030
## h(-0.203956-Mnf.Flow) * h(Density-1.50016) -4.5082899
## h(Mnf.Flow- -0.203956) * h(Bowl.Setpoint-0.0440049) -0.2170369
## h(Mnf.Flow- -0.203956) * h(0.0440049-Bowl.Setpoint) -0.2066971
## h(Mnf.Flow- -0.203956) * h(Alch.Rel-0.519685) 0.4830633
## h(Mnf.Flow- -0.203956) * h(0.519685-Alch.Rel) -0.3692237
## h(-0.203956-Mnf.Flow) * h(1.26802-Carb.Rel) -0.2740200
## h(Mnf.Flow- -0.203956) * h(Carb.Rel- -1.21803) -0.1579294
## h(Mnf.Flow- -0.203956) * h(-1.21803-Carb.Rel) -0.6507008
## h(0.50962-Hyd.Pressure3) * h(Alch.Rel- -0.509457) 0.1808274
## h(0.50962-Hyd.Pressure3) * h(-0.509457-Alch.Rel) 0.5649332
## h(1.3252-Temperature) * h(Usage.cont-0.358315) -0.5300740
##
## Selected 26 of 35 terms, and 12 of 23 predictors
## Termination condition: RSq changed by less than 0.001 at 35 terms
## Importance: Mnf.Flow, Temperature, Usage.cont, Bowl.Setpoint, Alch.Rel, ...
## Number of terms at each degree of interaction: 1 5 20
## GCV 0.5362348 RSS 1311.399 GRSq 0.4635775 RSq 0.4893508
# Partial dependence plots
plotmo(marsFit2)
## plotmo grid: Fill.Ounces PC.Volume Carb.Temp PSC PSC.Fill
## -0.01623723 -0.09531893 -0.0730481 -0.1334244 -0.1304864
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure2 Hyd.Pressure3
## -0.3813757 0.3367148 0.1293786 -0.4790381 0.4661799 0.4345061
## Filler.Level Temperature Usage.cont Carb.Flow Density MFR
## 0.58271 -0.2657983 0.2710059 0.5212328 -0.512943 0.2456178
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Alch.Rel Carb.Rel BCB
## -0.2881982 0.697465 -0.792231 -0.6677866 -0.2857598 0
When we originally completed the above model we got an R-Squared of 0.89 however upon subsequently running it with 10-fold cross validation we observed the initially high result was likely due to overfitting. MARS models, especially with high orders of interactions, are susceptible to overfitting.
We could duplicate that result by replacing the imputation of missing values with zeros, not removing the one feature with near-zero variance, and doing the balling vs. balling.lvl feature combination after the step-wise reduction.
Instead, we’re using our current model and will take a different tack starting in the next section.
Here we’re doing a range of models to get a sense of what works.
We’re starting over on the data after the imputation step and removing the one near-zero variance predictor.
Here we split up the data so that PH is our trainy
and
the remaining data is our trainx
. Similarly for our test
data we have testx
however there is no PH
data
for test.
# Split up available data into train and test versions of x and y
trainx <- student_train_x |> select(-PH)
trainy <- student_train_x$PH
testx <- student_eval_y |> select(-PH)
We have one character variable Brand.Code
that is
categorial with no inherent order:
Blank - 120 A - 293 B - 1239 C - 304 D - 615
Here we set the Blanks to brand code M
for miscellaneous
and use one-hot encoding to split them up into five variables with a
1
in the column corresponding to which Brand Code that row
is.
# Assign blanks as "M"
trainx$Brand.Code[trainx$Brand.Code == ""] <- "M"
testx$Brand.Code[testx$Brand.Code == ""] <- "M"
# Convert Brand.Code to a factor
trainx$Brand.Code <- as.factor(trainx$Brand.Code)
testx$Brand.Code <- as.factor(testx$Brand.Code)
# One-hot encoding
dummy_model <- dummyVars(~ ., data = trainx)
trainx <- predict(dummy_model, trainx)
testx <- predict(dummy_model, testx)
Here we look at the kNN, SVM, MARS, and XGBoost models.
k-Nearest Neighbors results in a weak model with an R-Squared of 0.481 when run with 13 as the number of nearest neighbors.
# Train the kNN model
knnModel <- train(x = trainx,
y = trainy,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 2571 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2571, 2571, 2571, 2571, 2571, 2571, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.7553386 0.4600991 0.5516203
## 7 0.7385944 0.4740297 0.5456211
## 9 0.7296617 0.4825545 0.5435731
## 11 0.7291528 0.4819977 0.5452328
## 13 0.7285882 0.4821892 0.5474974
## 15 0.7307512 0.4789545 0.5504139
## 17 0.7325872 0.4764009 0.5534532
## 19 0.7335725 0.4749351 0.5557323
## 21 0.7354178 0.4725328 0.5579163
## 23 0.7373076 0.4699529 0.5604149
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
Support Vector Machines results in a weak model with an Rsquared of 0.586 when C = 8. The C being high like this means that misclassifications are penalized more heavily resulting in a more complex decision boundary.
# Train the SVM model
svmModel <- train(x = trainx,
y = trainy,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 2571 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2314, 2315, 2314, 2313, 2313, 2314, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7163030 0.4959287 0.5314450
## 0.50 0.6962312 0.5208256 0.5124825
## 1.00 0.6785863 0.5430775 0.4968808
## 2.00 0.6655149 0.5589021 0.4862203
## 4.00 0.6521532 0.5763907 0.4774957
## 8.00 0.6451258 0.5864822 0.4732103
## 16.00 0.6497834 0.5845911 0.4770923
## 32.00 0.6675068 0.5704638 0.4910538
## 64.00 0.6978723 0.5456811 0.5151205
## 128.00 0.7378712 0.5134602 0.5439043
## 256.00 0.7800723 0.4855438 0.5742860
## 512.00 0.8204879 0.4610548 0.6044797
## 1024.00 0.8500779 0.4416762 0.6245598
## 2048.00 0.8529043 0.4393422 0.6270423
##
## Tuning parameter 'sigma' was held constant at a value of 0.02016894
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02016894 and C = 8.
Multivariate Adaptive Regression Splines results in a weak model with an Rsquared of 0.517 with nprune = 33 and degree = 6. This means there are 33 basis functions with six-way interactions. This may be suitable to highly nonlinear and interactive data but significantly risks overfitting.
# Train the MARS model
marsGrid <- expand.grid(.degree = 6, .nprune = 30:40)
set.seed(175175327)
marsModel <- train(x = trainx,
y = trainy,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsModel
## Multivariate Adaptive Regression Spline
##
## 2571 samples
## 35 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2314, 2313, 2313, 2315, 2314, 2314, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 30 0.7056510 0.5085210 0.5221312
## 31 0.7048986 0.5098481 0.5215941
## 32 0.7047143 0.5099580 0.5213002
## 33 0.7000219 0.5158733 0.5199846
## 34 0.6983310 0.5180715 0.5194377
## 35 0.6960146 0.5210178 0.5176798
## 36 0.6983852 0.5183097 0.5189310
## 37 0.6979122 0.5190971 0.5184589
## 38 0.6971574 0.5203998 0.5175004
## 39 0.6958715 0.5222221 0.5162749
## 40 0.6959166 0.5228376 0.5153354
##
## Tuning parameter 'degree' was held constant at a value of 6
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 39 and degree = 6.
eXtreme Gradient Boosting also produced similarly weak results with an R-Squared of 0.455.
library(caret)
# Train XGBoost model using caret
xgb_model <- train(
x = trainx,
y = trainy,
method = "xgbTree",
tuneLength = 1, # Automatic hyperparameter tuning
trControl = trainControl(method = "cv", number = 10) # Cross-validation
)
# Summary of the model
print(xgb_model)
## eXtreme Gradient Boosting
##
## 2571 samples
## 35 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2314, 2314, 2315, 2314, 2313, 2315, ...
## Resampling results across tuning parameters:
##
## eta colsample_bytree RMSE Rsquared MAE
## 0.3 0.6 0.7574076 0.4312850 0.5953737
## 0.3 0.8 0.7512717 0.4397941 0.5889623
## 0.4 0.6 0.7421128 0.4533415 0.5791337
## 0.4 0.8 0.7486802 0.4414322 0.5831434
##
## Tuning parameter 'nrounds' was held constant at a value of 50
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
##
## Tuning parameter 'subsample' was held constant at a value of 0.5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 50, max_depth = 1, eta
## = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
## = 0.5.
Here we capture some related group work that followed a variant path to explore a ridge regression and a lasso regression model.
Here we remove highly colinear variables, handle missing values and categorical predictors with one-hot encoding.
Removing variables with high VIF values (such as Brand.Code, Balling, and Density) to reduce multicollinearity.
# Install and load the 'car' package
#install.packages("car")
library(car)
# Check VIF for the current model
vif(model)
## Fill.Ounces PC.Volume Carb.Temp PSC
## 1.092529 1.444174 1.012578 1.132060
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1
## 1.098658 1.061950 4.321046 1.524239
## Fill.Pressure Hyd.Pressure2 Hyd.Pressure3 Filler.Level
## 2.136003 8.014966 10.993119 10.437462
## Temperature Usage.cont Carb.Flow Density
## 1.314374 1.598617 1.763477 7.064779
## MFR Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## 1.356792 1.462939 10.604306 2.226677
## Alch.Rel Carb.Rel BCB
## 7.203861 4.063375 2.156663
# Refine the model by removing variables with high VIF (e.g., Brand.Code, Balling)
model_refined <- lm(PH ~ . -Brand.Code -Balling, data = student_train)
model_refined
# Check VIF again for the refined model
vif_refine<-vif(model_refined)
vif_refine
Here we handle missing values with kNN imputation and check that there are zero remaining missing values:
# Check if there are missing values in the entire dataset (none)
#sum(is.na(student_train))
# Impute missing values using kNN
imputeModel <- preProcess(student_train, method = "knnImpute")
# Apply the imputation to both training and test data
student_train_imputed <- predict(imputeModel, student_train)
student_eval_imputed <- predict(imputeModel, student_eval)
# Check if there are any missing values after imputation (none)
#sum(is.na(student_train_imputed)) # Should return 0 if imputation was successful
#sum(is.na(student_eval_imputed)) # Should also return 0 for the test data
# Prepare the predictor matrix (without the target variable 'PH')
x <- as.matrix(student_train_imputed[, -which(names(student_train_imputed) == "PH")])
# Check if there are missing values in the predictor matrix
sum(is.na(x)) # This should return 0 if there are no missing values
## [1] 0
Here we use one-hot encoding to get dummy variables.
# Remove PH column for dummy encoding
student_train_predictors <- student_train_imputed[, -which(names(student_train_imputed) == "PH")]
# Apply one-hot encoding to the predictors
dummy_model <- dummyVars(~ ., data = student_train_predictors)
# Apply the encoding to the dataset
x <- predict(dummy_model, student_train_predictors)
# Convert to matrix format
x <- as.matrix(x)
# Should return 0 if there are no missing values (correct, none)
#sum(is.na(x))
Here we fit two models, ridge regression and lasso regression.
Here we run the ridge regression model.
Note, the resulting coefficient of -0.031565321 for Carb.Volume means that as Carb.Volume increases by 1 unit, the PH decreases by 0.031565321, assuming all other predictors are held constant.
Negative Coefficients: Features like Brand.Code, Carb.Volume, and Temperature are negatively correlated with PH, meaning as these features increase, the value of PH tends to decrease.
Positive Coefficients: Features like Balling.Lvl and Oxygen.Filler are positively correlated with PH, meaning as these features increase, PH is predicted to increase as well.
# Impute missing values using kNN
imputeModel <- preProcess(student_train, method = "knnImpute")
# Apply the imputation to both training and test data
student_train_imputed <- predict(imputeModel, student_train)
student_eval_imputed <- predict(imputeModel, student_eval)
# Remove PH column for dummy encoding (exclude target variable)
student_train_predictors <- student_train_imputed[, -which(names(student_train_imputed) == "PH")]
# Apply one-hot encoding to the predictors
dummy_model <- dummyVars(~ ., data = student_train_predictors)
x <- predict(dummy_model, student_train_predictors)
# Convert to matrix format (glmnet requires a matrix)
x <- as.matrix(x)
# Check if there are any missing values in the predictor matrix
sum(is.na(x)) # Should return 0 if no missing values
## [1] 0
# Define target variable 'y'
y <- student_train_imputed$PH
# Fit Ridge Regression model (alpha = 0 for Ridge)
ridge_model <- glmnet(x, y, alpha = 0)
# View the coefficients
#print(coef(ridge_model))
# Perform cross-validation to find the best lambda
cv_ridge <- cv.glmnet(x, y, alpha = 0) #Performs cross-validation to select the best regularization parameter (lambda).
# Plot the cross-validation results
plot(cv_ridge)
# Get the best lambda from the cross-validation
best_lambda <- cv_ridge$lambda.min # The lambda value that gives the minimum mean cross-validation error.
cat("Best Lambda for Ridge:", best_lambda, "\n")
## Best Lambda for Ridge: 0.04472021
# Refit the Ridge regression model with the best lambda
ridge_model_best <- glmnet(x, y, alpha = 0, lambda = best_lambda)
# Get the coefficients for the best model
#coef(ridge_model_best) # Extracts the coefficients of the final Ridge model.
Here we seek to improve lambda tuning using cross-validation.
Note, this resulted in a Mean Squared Error (MSE) of 0.5647799 for the best Ridge regression model which is a significant improvement compared to the previous MSE of 153.4756. This suggests that the model is now performing better in terms of prediction accuracy.
The reduction in MSE after applying cross-validation and tuning the lambda parameter implies that the regularization parameter has been effectively optimized.
# Use cross-validation to find the best lambda for Ridge regression
cv_ridge <- cv.glmnet(x, y, alpha = 0)
# Plot cross-validation results
plot(cv_ridge)
# Get the best lambda
best_lambda <- cv_ridge$lambda.min
cat("Best Lambda for Ridge:", best_lambda, "\n")
## Best Lambda for Ridge: 0.04472021
# Refit the model with the best lambda
ridge_model_best <- glmnet(x, y, alpha = 0, lambda = best_lambda)
# Get the coefficients for the best Ridge model
#coef(ridge_model_best)
Here we run the lasso regression model.
This results in some variables with a coefficient of zero.
Non-zero Coefficients: These are the features that have been selected by Lasso. For example, Brand.Code, Carb.Volume, Carb.Pressure, etc., have non-zero coefficients, suggesting they have a relationship with PH.
Zero Coefficients: Features like Carb.Volume and Filler.Level have been shrunk to zero, indicating that Lasso determined these features were not important in predicting PH.
# Fit Lasso Regression model (alpha = 1 for Lasso)
lasso_model <- glmnet(x, y, alpha = 1)
# Use cross-validation to select the best lambda
cv_lasso <- cv.glmnet(x, y, alpha = 1)
# Plot the cross-validation results
plot(cv_lasso)
# Best lambda for Lasso
best_lambda_lasso <- cv_lasso$lambda.min
cat("Best Lambda for Lasso:", best_lambda_lasso, "\n")
## Best Lambda for Lasso: 0.0006640809
# Refit the model with the best lambda
lasso_model_best <- glmnet(x, y, alpha = 1, lambda = best_lambda_lasso)
# Get the coefficients for the best Lasso model
#coef(lasso_model_best)
Here we capture related group work that followed the same path of our
initial data setup with kNN imputation, removing the near zero-variance
feature Hyd.Pressure1
, replacing Brand.Code
with BCB
for Brand code B or not, and creating a new
variable PT
out of
Balling.Lvl
/Balling
. Then picking up right
before the stepwise reduction leading to our initial MARS model.
Here we will use PCR/PLS for further feature reduction.
Here we produce the principal components.
# Separate predictors and response
X_train <- student_train_x2 %>% select(-PH)
y_train <- student_train_x2$PH
X_test <- student_eval_y2 %>% select(-PH)
#y_test <- student_eval_y2$PH # test PH scores not available.
# Scale the training and test sets
# We'll use caret's preProcess for scaling
preProcValues <- preProcess(X_train, method = c("center", "scale"))
X_train_scaled <- predict(preProcValues, X_train)
X_test_scaled <- predict(preProcValues, X_test)
# Perform PCA on the scaled training predictors
pca_result <- prcomp(X_train_scaled, center = FALSE, scale. = FALSE)
# Summary of PCA variance
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2811 2.1735 1.68172 1.32321 1.2562 1.19853 1.16009
## Proportion of Variance 0.1734 0.1575 0.09427 0.05836 0.0526 0.04788 0.04486
## Cumulative Proportion 0.1734 0.3309 0.42520 0.48356 0.5362 0.58404 0.62890
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.04011 1.01262 0.99666 0.94693 0.91894 0.87965 0.86579
## Proportion of Variance 0.03606 0.03418 0.03311 0.02989 0.02815 0.02579 0.02499
## Cumulative Proportion 0.66496 0.69914 0.73225 0.76214 0.79029 0.81608 0.84107
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.84955 0.77508 0.71546 0.69902 0.68471 0.64256 0.62167
## Proportion of Variance 0.02406 0.02002 0.01706 0.01629 0.01563 0.01376 0.01288
## Cumulative Proportion 0.86513 0.88515 0.90222 0.91850 0.93413 0.94789 0.96078
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.52727 0.47220 0.4414 0.37281 0.35244 0.29545 0.22780
## Proportion of Variance 0.00927 0.00743 0.0065 0.00463 0.00414 0.00291 0.00173
## Cumulative Proportion 0.97004 0.97748 0.9840 0.98861 0.99275 0.99566 0.99738
## PC29 PC30
## Standard deviation 0.21788 0.17601
## Proportion of Variance 0.00158 0.00103
## Cumulative Proportion 0.99897 1.00000
The principal components are chosen based on max variability, we can crack open PC1 and PC2 to see which variables cover the widest range of variability and get a sense for how PCR would reduce multi-collinearity:
# Contributions to PC1
fviz_contrib(pca_result, choice = "var", axes = 1, top = 10) +
labs(title = "Contributions of Variables to PC1")
# Contributions to PC1
fviz_contrib(pca_result, choice = "var", axes = 2, top = 10) +
labs(title = "Contributions of Variables to PC2")
On to PCR:
# Set a seed for reproducibility
set.seed(123)
# Using the already processed data 'student_train_x2'
# Ensure that PH is the response and the rest are predictors
pcr_model <- pcr(PH ~ ., data = student_train_x2, scale = TRUE, validation = "CV")
# Summary of the PCR model
summary(pcr_model)
## Data: X dimension: 2571 30
## Y dimension: 2571 1
## Fit method: svdpc
## Number of components considered: 30
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 0.9998 0.9577 0.9589 0.9506 0.9499 0.9461 0.9611
## adjCV 0.9998 0.9542 0.9546 0.9467 0.9461 0.9427 0.9564
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 0.9508 1.013 1.13 1.849 1.840 1.143 1.280
## adjCV 0.9459 1.003 1.11 1.778 1.767 1.120 1.244
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 1.253 1.158 1.279 1.183 1.025 1.029 0.9195
## adjCV 1.219 1.131 1.242 1.152 1.008 1.011 0.9108
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 0.9870 0.9840 1.0152 1.021 1.035 1.016 0.9434
## adjCV 0.9703 0.9676 0.9958 1.001 1.013 0.996 0.9284
## 28 comps 29 comps 30 comps
## CV 1.0018 1.0000 0.9993
## adjCV 0.9815 0.9797 0.9791
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 17.35 33.09 42.52 48.36 53.62 58.40 62.89 66.50
## PH 15.05 16.53 17.27 17.29 17.32 17.33 19.34 19.34
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 69.91 73.23 76.21 79.03 81.61 84.11 86.51
## PH 19.62 19.76 23.33 23.41 26.33 27.28 28.56
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 88.52 90.22 91.85 93.41 94.79 96.08 97.0
## PH 29.86 31.45 31.57 32.82 32.87 35.79 35.8
## 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps
## X 97.75 98.40 98.86 99.27 99.57 99.74 99.90
## PH 36.74 37.65 37.80 38.05 39.93 40.80 40.89
## 30 comps
## X 100.00
## PH 40.91
# Plot the cross-validation results using validationplot()
# val.type options: "MSEP", "RMSEP", "R2"
validationplot(pcr_model, val.type = "RMSEP", main = "PCR Cross-Validation (RMSEP)")
# Print RMSEP for each number of components
rmsep_values <- RMSEP(pcr_model, estimate = "CV")
rmsep_values
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## 0.9998 0.9577 0.9589 0.9506 0.9499 0.9461
## 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## 0.9611 0.9508 1.0134 1.1304 1.8490 1.8405
## 12 comps 13 comps 14 comps 15 comps 16 comps 17 comps
## 1.1432 1.2803 1.2528 1.1584 1.2790 1.1828
## 18 comps 19 comps 20 comps 21 comps 22 comps 23 comps
## 1.0253 1.0294 0.9195 0.9870 0.9840 1.0152
## 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps
## 1.0210 1.0348 1.0162 0.9434 1.0018 1.0000
## 30 comps
## 0.9993
# Also check R² to see how well the model explains the variance
validationplot(pcr_model, val.type = "R2", main = "PCR Cross-Validation (R2)")
It looks like 5 components gives us both the highest R2 and lowest RMSEP before the model sharply drops in performance. It does improve again at 20 components, but for the sake of having a simpler model we will stick to 5.
# Print out R2 for each number of components
r2_values <- R2(pcr_model, estimate = "CV")
r2_values
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## -0.0007784 0.0817204 0.0795193 0.0953255 0.0966418 0.1037929
## 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## 0.0753052 0.0948758 -0.0280725 -0.2791882 -2.4227091 -2.3911300
## 12 comps 13 comps 14 comps 15 comps 16 comps 17 comps
## -0.3082899 -0.6410022 -0.5713197 -0.3432995 -0.6376337 -0.4005158
## 18 comps 19 comps 20 comps 21 comps 22 comps 23 comps
## -0.0524146 -0.0608081 0.1535729 0.0247478 0.0307351 -0.0317707
## 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps
## -0.0436857 -0.0720937 -0.0337501 0.1089018 -0.0047271 -0.0010823
## 30 comps
## 0.0002551
The results with PCR are lackluster.
library(pls)
# Set a seed for reproducibility
set.seed(123)
# Fit the PCR model with exactly 5 components
pcr_model_5 <- pcr(PH ~ .,
data = student_train_x2,
scale = TRUE,
ncomp = 5,
validation = "CV")
# Print summary of the model
summary(pcr_model_5)
## Data: X dimension: 2571 30
## Y dimension: 2571 1
## Fit method: svdpc
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 0.9998 0.9577 0.9589 0.9506 0.9499 0.9461
## adjCV 0.9998 0.9542 0.9546 0.9467 0.9461 0.9427
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 17.35 33.09 42.52 48.36 53.62
## PH 15.05 16.53 17.27 17.29 17.32
# Extract R2 and RMSEP for cross-validation
r2_5 <- R2(pcr_model_5, estimate = "CV")
rmsep_5 <- RMSEP(pcr_model_5, estimate = "CV")
# If you have a test set (student_eval_y2) and want to make predictions:
predictions_5 <- predict(pcr_model_5, newdata = student_eval_y2, ncomp = 5)
# Calculate prediction metrics on the test set
test_res <- postResample(predictions_5, student_eval_y2$PH)
cat("\nTest set performance (using 5 components):\n")
##
## Test set performance (using 5 components):
print(test_res)
## RMSE Rsquared MAE
## 0.6155150 0.2832964 0.5087567
# RMSE 0.6155150
# Rsquared 0.2832964
# MAE 0.5087567
on to PLS:
library(pls)
set.seed(123) # For reproducibility
# 1. Make the PLS model using cross-validation
pls_model <- plsr(PH ~ .,
data = student_train_x2,
scale = TRUE, # Scale predictors
validation = "CV") # Use cross-validation
# Print a summary of the model to see initial info
summary(pls_model)
## Data: X dimension: 2571 30
## Y dimension: 2571 1
## Fit method: kernelpls
## Number of components considered: 30
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 0.9998 0.9918 1.201 1.191 1.084 0.9236 0.9112
## adjCV 0.9998 0.9813 1.168 1.158 1.059 0.9108 0.8990
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1.094 0.9879 0.9589 0.9967 1.0059 0.9694 0.9848
## adjCV 1.066 0.9688 0.9421 0.9767 0.9852 0.9517 0.9658
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 1.0151 1.0006 0.9967 0.9999 0.9993 1.0001 1.0004
## adjCV 0.9936 0.9803 0.9767 0.9797 0.9791 0.9798 0.9801
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 0.9988 0.9991 0.9994 0.9992 0.9993 0.9992 0.9993
## adjCV 0.9787 0.9789 0.9792 0.9790 0.9791 0.9790 0.9791
## 28 comps 29 comps 30 comps
## CV 0.9993 0.9993 0.9993
## adjCV 0.9791 0.9791 0.9791
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 16.76 21.44 31.84 41.84 47.84 50.65 52.86 56.85
## PH 22.63 34.56 36.74 37.89 38.94 39.80 40.40 40.60
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 59.57 62.18 65.80 68.10 70.52 72.8 75.89
## PH 40.75 40.83 40.86 40.88 40.89 40.9 40.90
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 77.86 79.26 81.43 83.71 85.10 87.22 89.29
## PH 40.90 40.90 40.90 40.91 40.91 40.91 40.91
## 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps
## X 90.37 92.79 93.90 95.58 96.89 97.89 98.51
## PH 40.91 40.91 40.91 40.91 40.91 40.91 40.91
## 30 comps
## X 100.00
## PH 40.91
# Using validationplot to check RMSEP across components
validationplot(pls_model, val.type = "RMSEP", main = "PLS: CV RMSEP by Number of Components")
# This plot helps visualize how RMSEP changes as we increase the number of components.
validationplot(pls_model, val.type = "R2", main = "PLS: CV R2 by Number of Components")
We arrive at a PLS Model with an R2 of 43.43%.
# Check R² via R2() function and RMSEP via RMSEP() function
r2_values_pls <- R2(pls_model, estimate = "CV")
rmsep_values_pls <- RMSEP(pls_model, estimate = "CV")
r2_values <- r2_values_pls$val["CV", "PH", ]
r2_comps <- r2_values[-1]
rmse_values <- rmsep_values_pls$val["CV", "PH", ]
rmsep_comps <- rmse_values[-1]
# Identify the best number of components based on minimum RMSEP
opt_comp_rmsep <- which.min(rmsep_comps)
min_rmsep <- rmsep_comps[opt_comp_rmsep]
# Identify the best number of components based on max R²
opt_comp_r2 <- which.max(r2_comps)
max_r2 <- r2_comps[opt_comp_r2]
cat("\nOptimal number of components based on RMSEP:", opt_comp_rmsep, "with RMSEP =", min_rmsep, "\n")
##
## Optimal number of components based on RMSEP: 6 with RMSEP = 0.9111578
cat("Optimal number of components based on R²:", opt_comp_r2, "with R² =", max_r2, "\n")
## Optimal number of components based on R²: 6 with R² = 0.1688543
# Let's choose the optimal number of components. In practice, you might consider a balance
# between minimal RMSEP and complexity. Here, let's use the one chosen by RMSEP for demonstration.
final_ncomp <- opt_comp_rmsep
cat("\nFinal chosen number of components:", final_ncomp, "\n")
##
## Final chosen number of components: 6
# 4. Make a final PLS model with the chosen number of components and evaluate it
pls_model_final <- plsr(PH ~ .,
data = student_train_x2,
scale = TRUE,
ncomp = final_ncomp,
validation = "none") # no need for CV here since we fixed ncomp
# Summary of the final model
summary(pls_model_final)
## Data: X dimension: 2571 30
## Y dimension: 2571 1
## Fit method: kernelpls
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 16.76 21.44 31.84 41.84 47.84 50.65
## PH 22.63 34.56 36.74 37.89 38.94 39.80
preds_pls <- predict(pls_model_final, newdata = student_eval_y2, ncomp = final_ncomp)
results_test <- postResample(preds_pls, student_eval_y2$PH)
cat("\nTest set performance with", final_ncomp, "components:\n")
##
## Test set performance with 6 components:
print(results_test)
## RMSE Rsquared MAE
## 0.5593341 0.4342728 0.4397250
# RMSE 0.5593341
# Rsquared 0.4342728
# MAE 0.4397250
X-Loadings indicate how the original predictors are combined linearly to form the PLS components. A higher absolute loading value means the predictor strongly influences that component.
# View the X-loadings
x_loadings <- pls_model_final$loadings
cat("X-loadings:\n")
## X-loadings:
# print(x_loadings)
# View the X-loading weights
x_loading_weights <- pls_model_final$loading.weights
cat("\nX-loading weights:\n")
##
## X-loading weights:
# print(x_loading_weights)
# Convert to data frames for easier handling
x_loadings_matrix <- unclass(pls_model_final$loadings)
x_loadings_df <- as.data.frame(x_loadings_matrix)
x_loadings_weights_matrix <- unclass(pls_model_final$loading.weights)
x_loading_weights_df <- as.data.frame(x_loadings_matrix)
# Rename columns for clarity: each column corresponds to a component
colnames(x_loadings_df) <- paste0("Comp", 1:ncol(x_loadings_df))
colnames(x_loading_weights_df) <- paste0("Comp", 1:ncol(x_loading_weights_df))
# Let's also see the coefficients for each number of components
coefficients_array <- pls_model_final$coefficients
# `coefficients_array` is a multidimensional array: [predictor, response, component]
# For a single response model, we can simplify it:
coefficients_matrix <- coefficients_array[,1,] # Extract for the single response variable
colnames(coefficients_matrix) <- paste0("Comp", 1:ncol(coefficients_matrix))
cat("\nCoefficients for each component:\n")
##
## Coefficients for each component:
#print(coefficients_matrix)
# Visualizing Loadings
# As with PCA, you can plot the loadings to see which predictors have strong influence on each component.
library(reshape2)
library(ggplot2)
# Melt the loadings data frame for plotting
x_loadings_long <- melt(x_loadings_df, variable.name = "Component", value.name = "Loading")
x_loadings_long$Variable <- rownames(x_loadings_df)
ggplot(x_loadings_long, aes(x = reorder(Variable, Loading), y = Loading, fill = Component)) +
geom_bar(stat="identity", position="dodge") +
coord_flip() +
facet_wrap(~ Component, scales = "free_y") +
theme_minimal() +
labs(title = "PLS X-Loadings by Component", x = "Predictors", y = "Loading")
Coefficients show how each predictor contributes to predicting the response variable when using a given number of components. Higher absolute coefficients indicate greater influence of a predictor on the predicted outcome.
# Extract the coefficients matrix from the PLS model
coefficients_array <- pls_model_final$coefficients
coefficients_matrix <- coefficients_array[, 1, ] # For a single-response model
colnames(coefficients_matrix) <- paste0("Comp", 1:ncol(coefficients_matrix))
# Convert to a data frame
coefficients_df <- as.data.frame(coefficients_matrix)
coefficients_df$Variable <- rownames(coefficients_df)
# Reshape to long format for plotting
library(reshape2)
coefficients_long <- melt(coefficients_df, id.vars = "Variable",
variable.name = "Component", value.name = "Coefficient")
# Plot using ggplot2, faceting by Component
library(ggplot2)
ggplot(coefficients_long, aes(x = reorder(Variable, Coefficient), y = Coefficient, fill = Component)) +
geom_bar(stat="identity", position="dodge") +
coord_flip() +
facet_wrap(~ Component, scales = "free_y") +
theme_minimal() +
labs(title = "PLS Coefficients by Component",
x = "Predictor Variables",
y = "Coefficient") +
theme(legend.position = "none")
The good thing about using PLS is that the multi-colinearity is addressed along with predictive power. The components or latent variables are chosen such that the covariance between predictors is maximized, and each component is orthogonal to the ones prior so the correlation between components is also minimized.
Each predictor is also weighed differently in each component it appears in based on its effect on the target variable. Some next steps we can take are to look at which predictors have consistently low coefficients/loadings and remove them- this could possibly improve the performance of the model.
PT: Coefficients: Around ±0.001 to ±0.007 across all components, never exceeding about 0.0075 in absolute value. These are very small relative to other variables that have coefficients in the 0.05–0.4 range. Loadings: PT does not appear to strongly load on any component (not listed or near zero in the given snippet).
Air.Pressurer: Coefficients: About -0.003 to -0.013 across all components. These values are also quite small compared to other more influential variables. Loadings: Only appears once at about -0.114, which is not large. Most important variables have loadings at least above ±0.2–0.3 somewhere.
Carb.Temp: Coefficients: Ranging around 0.0059 to 0.0158, slightly larger than PT or Air.Pressurer but still relatively small. Some variables show coefficients well above 0.05 or even 0.1. Loadings: Carb.Temp does not appear with a noticeable loading in the snippet (it’s blank), suggesting it may not be significantly influencing the latent structure.
Here we run PLS with fewer predictors based on the analysis above.
library(pls)
# Remove the chosen predictors from the dataset
predictors_to_remove <- c("PT", "Air.Pressurer", "Carb.Temp")
student_train_reduced <- student_train_x2[ , !(names(student_train_x2) %in% predictors_to_remove)]
# Fit a PLS model again with cross-validation
set.seed(123)
pls_model_reduced <- plsr(PH ~ .,
data = student_train_reduced,
scale = TRUE,
validation = "CV")
# Check summary
summary(pls_model_reduced)
## Data: X dimension: 2571 27
## Y dimension: 2571 1
## Fit method: kernelpls
## Number of components considered: 27
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 0.9998 0.8815 0.8157 0.8027 0.7964 0.7906 0.7858
## adjCV 0.9998 0.8815 0.8153 0.8024 0.7960 0.7901 0.7852
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 0.7819 0.7806 0.7799 0.7793 0.7791 0.7790 0.7791
## adjCV 0.7813 0.7801 0.7794 0.7788 0.7786 0.7785 0.7786
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 0.7790 0.7790 0.7790 0.7789 0.7790 0.7790 0.7790
## adjCV 0.7784 0.7784 0.7784 0.7784 0.7784 0.7784 0.7784
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 0.7790 0.7790 0.7790 0.7790 0.7790 0.7790 0.7790
## adjCV 0.7784 0.7784 0.7784 0.7784 0.7784 0.7784 0.7784
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 18.62 23.82 35.37 46.40 53.02 56.15 58.19 62.01
## PH 22.61 34.54 36.73 37.88 38.93 39.78 40.40 40.60
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 65.07 66.83 69.00 72.50 74.75 76.60 79.41
## PH 40.72 40.80 40.83 40.84 40.85 40.85 40.85
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 80.73 82.77 84.62 86.89 89.27 90.91 92.52
## PH 40.85 40.85 40.85 40.85 40.85 40.85 40.85
## 23 comps 24 comps 25 comps 26 comps 27 comps
## X 93.75 95.61 97.40 98.30 100.00
## PH 40.85 40.85 40.85 40.85 40.85
r2_values_reduced <- R2(pls_model_reduced, estimate = "CV")
rmsep_values_reduced <- RMSEP(pls_model_reduced, estimate = "CV")
cat("\nCross-validated R² values after removal:\n")
##
## Cross-validated R² values after removal:
cat("\nCross-validated RMSEP values after removal:\n")
##
## Cross-validated RMSEP values after removal:
# Visualize MSEP to see if there's an improvement
validationplot(pls_model_reduced, val.type = "MSEP", main = "PLS with Reduced Predictors: CV MSEP by #Components")
# Identify optimal number of components in the reduced model based on RMSEP
r2_values_reduced_mod <- r2_values_reduced$val["CV", "PH", ]
r2_comps_reduced <- r2_values_reduced_mod[-1]
rmse_values_reduced_mod <- rmsep_values_reduced$val["CV", "PH", ]
rmsep_comps_reduced <- rmse_values_reduced_mod[-1]
opt_comp_rmsep_reduced <- which.min(rmsep_comps_reduced)
min_rmsep_reduced <- rmsep_comps_reduced[opt_comp_rmsep_reduced]
opt_comp_r2_reduced <- which.max(r2_comps_reduced)
max_r2_reduced <- r2_comps_reduced[opt_comp_r2_reduced]
cat("\nAfter predictor removal:\n")
##
## After predictor removal:
cat("Optimal #Components by RMSEP:", opt_comp_rmsep_reduced, "with RMSEP =", min_rmsep_reduced, "\n")
## Optimal #Components by RMSEP: 17 with RMSEP = 0.7789427
cat("Optimal #Components by R²:", opt_comp_r2_reduced, "with R² =", max_r2_reduced, "\n")
## Optimal #Components by R²: 17 with R² = 0.3925633
There was a small lift in Rsquared after removing some of the less relevant predictors.
Rsquared:
PCR: 0.2832964 PLS: 0.4342728 PLS with some predictors removed: 0.4452060
# Fit the final model with the chosen number of components
final_pls_model <- plsr(PH ~ .,
data = student_train_reduced,
scale = TRUE,
ncomp = opt_comp_rmsep_reduced,
validation = "none")
# Check summary
summary(final_pls_model)
## Data: X dimension: 2571 27
## Y dimension: 2571 1
## Fit method: kernelpls
## Number of components considered: 17
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 18.62 23.82 35.37 46.40 53.02 56.15 58.19 62.01
## PH 22.61 34.54 36.73 37.88 38.93 39.78 40.40 40.60
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 65.07 66.83 69.00 72.50 74.75 76.60 79.41
## PH 40.72 40.80 40.83 40.84 40.85 40.85 40.85
## 16 comps 17 comps
## X 80.73 82.77
## PH 40.85 40.85
# Predict on the test set
preds_final <- predict(final_pls_model, newdata = student_eval_y2, ncomp = opt_comp_rmsep_reduced)
test_performance <- postResample(preds_final, student_eval_y2$PH)
cat("\nTest set performance (RMSE, R2, MAE):\n")
##
## Test set performance (RMSE, R2, MAE):
print(test_performance)
## RMSE Rsquared MAE
## 0.5562690 0.4452060 0.4302488
# RMSE 0.5562690
# Rsquared 0.4452060
# MAE 0.4302488
Here we are revisiting the MARS model with an attempt to address the over-fitting that occurred in the prior iteration. There is some additional feature generation around the ‘Hydrolic.Pressure’ variables based on their distributions. We are using a different approach to imputation by replacing missing values with the column mean. Lastly we are controlling for outliers in our model creation by ‘Winsorizing’ the data set, seting values that are more than 2 standard deviations from the mean of the variable to a value 2SDs from the mean.
First we are addressing the missing Brand.Code values by creating a new level ‘Unk’ for the missing values. We are also temporarily splitting off the Brand.Code column so in the next step we can impute missing values and fix outliers in all the remaining numeric columns.
library(tidyverse)
library(DescTools)
student_train$Brand.Code[student_train$Brand.Code == ''] <- 'Unk'
BC <- student_train |> select(Brand.Code)
student_train <- student_train |> select(-Brand.Code)
student_eval$Brand.Code[student_eval$Brand.Code == ''] <- 'Unk'
BC_eval <- student_eval |> select(Brand.Code)
student_eval <- student_eval |> select(-Brand.Code)
Here we are replacing all missing values with the column mean and creating some new features based on distributions of the ‘Hyd.Pressure’ columns. If you look at the distributions of these columns you see that there are two distributions of values, zero values and a Gaussian distribution of non-zero values. We are creating and addition binary feature for each of these columns which captures if the hydraulic pressure is greater than zero.
hist(student_train$Hyd.Pressure1, breaks = 20, main = "Histogram of Hyd.Pressure1", xlab = 'Hyd.Pressure1')
#Replace missing values in numeric columns with the column mean.
student_train <- student_train %>% mutate(across(everything(), ~replace_na(.x, median(na.omit(.)))))
student_eval <- student_eval %>% mutate(across(everything(), ~replace_na(.x, median(na.omit(.)))))
#Create new features to identify if the hydraulic pressure value is greater than zero
student_train <- student_train |> mutate(Hyd.Pressure10 = as.numeric(Hyd.Pressure1 > 0), Hyd.Pressure20 = as.numeric(Hyd.Pressure2 > 0), Hyd.Pressure30 = as.numeric(Hyd.Pressure3 > 0))
student_eval <- student_eval |> mutate(Hyd.Pressure10 = as.numeric(Hyd.Pressure1 > 0), Hyd.Pressure20 = as.numeric(Hyd.Pressure2 > 0), Hyd.Pressure30 = as.numeric(Hyd.Pressure3 > 0))
Here we are Winsorizing the data, a process by which you define outliers as values falling beyond a number of distributions from the mean (2 distributions by default). We are replacing those outlier values with values 2 standard deviations from the mean, compressing the distributions in our data set. Then we are joining the non-numeric Brand.Code column back to the data.
#Winsorize the data
student_train <-lapply(student_train, Winsorize)
student_train <- bind_cols(BC, student_train)
student_eval <-lapply(student_eval, Winsorize)
student_eval <- bind_cols(BC_eval, student_eval)
library(earth)
y = student_train$PH
x = student_train |> select(-PH)
#y_eval = student_eval$PH
x_eval = student_eval |> select(-PH)
Here we are performing a grid search to identify the optimal hyper parameters for our MARS model.
Lastly we are doing a creating our optimal MARS model and performing a 10-fold cross validation to evaluate our models performance. We are also shorting the tail of the Oxygen.Filler distribution by replacing values greater than 0.25 with 0.25.
We are outputting the model summary for the 1 of the 10 cross fold models built
library(caret)
set.seed(175175327)
x$Oxygen.Filler[x$Oxygen.Filler > 0.25] <- 0.25
#y[y==0] <- 8.5
folds <- createFolds(y = y, k = 10)#, type = "random")
model_results <- data.frame(R_squared = numeric(10))
for (i in 1:10) {
x_data <- x[(-folds[[i]]), ] # Train Predictor data for current fold
y_data <- y[(-folds[[i]])] # Train Target data for current fold
mars_model <- earth(x = x_data, y = y_data, degree = 3, nprune = 42)
#print(summary(mars_model))
model_results[i, 'R_squared_train'] <- summary(mars_model)$rsq
test_data <- x[(folds[[i]]), ] # Test Predictor data for current fold
target_data <- y[(folds[[i]])] # Test Target data for current fold
predictions <- predict(mars_model, newdata = test_data)
model_results[i, "RMSE"] <- sqrt(mean((predictions - target_data)^2))
res <- caret::postResample(predictions, target_data)
model_results[i, "R_squared_test"] <- res[2]
}
summary(mars_model)
## Call: earth(x=x_data, y=y_data, degree=3, nprune=42)
##
## coefficients
## (Intercept) 0.171936
## h(-0.203956-Mnf.Flow) 1.227742
## h(0.910228-Hyd.Pressure3) -0.235784
## h(Temperature- -0.699707) -0.271336
## h(-0.671872-Density) 0.992686
## h(Density- -0.671872) -0.437461
## h(0.0440049-Bowl.Setpoint) -0.366527
## h(Air.Pressurer- -0.193078) 0.189984
## h(-0.549039-Alch.Rel) 2.104751
## h(Alch.Rel- -0.549039) 0.381484
## Brand.CodeC * h(Mnf.Flow-1.06152) -6.381409
## Brand.CodeC * h(1.06152-Mnf.Flow) -0.559842
## h(Mnf.Flow- -0.203956) * h(Usage.cont-0.0560919) -0.334615
## h(-0.203956-Mnf.Flow) * h(0.379132-Pressure.Vacuum) -0.809624
## h(-0.203956-Mnf.Flow) * h(Air.Pressurer-0.962117) -0.727589
## h(-0.203956-Mnf.Flow) * h(0.962117-Air.Pressurer) 0.155671
## h(Mnf.Flow- -0.203956) * h(Alch.Rel-0.519685) 0.418455
## h(Mnf.Flow- -0.203956) * h(0.519685-Alch.Rel) -0.335221
## h(-0.334474-Carb.Pressure1) * h(Hyd.Pressure1-0.688671) -0.278304
## h(0.910228-Hyd.Pressure3) * h(Pressure.Vacuum-0.0282507) -0.100393
## h(0.268287-Filler.Speed) * h(Density- -0.671872) 0.074865
## h(Filler.Speed-0.268287) * h(Density- -0.671872) 1.489211
## h(-0.322631-Pressure.Vacuum) * h(0.0440049-Bowl.Setpoint) 0.272839
## h(Pressure.Vacuum- -0.322631) * h(0.0440049-Bowl.Setpoint) 0.208009
## h(-0.746997-Oxygen.Filler) * h(-0.549039-Alch.Rel) -12.650368
## Brand.CodeC * h(1.06152-Mnf.Flow) * h(0.0450417-Carb.Pressure1) 0.183953
## h(Carb.Volume- -1.34916) * h(Mnf.Flow- -0.203956) * h(Usage.cont-0.0560919) -0.128589
## h(Mnf.Flow- -0.203956) * h(Hyd.Pressure2-1.02762) * h(0.519685-Alch.Rel) 2.639698
## h(-1.04258-Mnf.Flow) * h(0.910228-Hyd.Pressure3) * h(Pressure.Vacuum-0.0282507) 95.969965
## h(Mnf.Flow- -0.203956) * h(0.60202-Temperature) * h(Usage.cont-0.0560919) -0.245408
## h(Mnf.Flow- -0.203956) * h(Usage.cont-0.0560919) * h(Bowl.Setpoint- -1.26292) 0.243339
## h(Mnf.Flow- -0.203956) * h(Usage.cont-0.0560919) * h(-1.26292-Bowl.Setpoint) 0.647840
## h(-0.203956-Mnf.Flow) * h(0.379132-Pressure.Vacuum) * h(1.62799-Alch.Rel) 0.219838
## h(Mnf.Flow- -0.203956) * h(Bowl.Setpoint-0.0440049) * h(Alch.Rel-0.519685) -0.664349
## h(Mnf.Flow- -0.203956) * h(0.0440049-Bowl.Setpoint) * h(Alch.Rel-0.519685) 0.212207
## h(Mnf.Flow- -0.203956) * h(0.519685-Alch.Rel) * h(Balling.Lvl- -0.494086) 1.367322
## h(Mnf.Flow- -0.203956) * h(0.519685-Alch.Rel) * h(-0.494086-Balling.Lvl) 0.871681
## h(1.14142-Carb.Pressure1) * h(Oxygen.Filler- -0.746997) * h(-0.549039-Alch.Rel) -1.568930
## h(0.910228-Hyd.Pressure3) * h(Filler.Speed-0.408398) * h(Pressure.Vacuum-0.0282507) 40.297109
## h(0.910228-Hyd.Pressure3) * h(Filler.Speed-0.379857) * h(Pressure.Vacuum-0.0282507) -9.411090
## h(0.910228-Hyd.Pressure3) * h(Filler.Speed-0.418776) * h(Pressure.Vacuum-0.0282507) -63.437797
## h(-0.753706-Balling) * h(Oxygen.Filler- -0.746997) * h(-0.549039-Alch.Rel) 10.785083
##
## Selected 42 of 66 terms, and 18 of 38 predictors (nprune=42)
## Termination condition: RSq changed by less than 0.001 at 66 terms
## Importance: Mnf.Flow, Brand.CodeC, Alch.Rel, Pressure.Vacuum, ...
## Number of terms at each degree of interaction: 1 9 15 17
## GCV 0.3113796 RSS 657.5183 GRSq 0.620869 RSq 0.6537267
We get an mean training set R-squared of 0.635 and a mean test set R-squared of 0.556 and a max test set R-squared of 0.65. The drop of in the R-square of 0.08 shows that there is some over fitting occurring in the in the model creation but not as much as the first iteration of the MARS model creation. Looking at the model summary I think there could be some subgroup analysis as a next step in model exploration.
summary(model_results)
## R_squared R_squared_train RMSE R_squared_test
## Min. :0 Min. :0.6083 Min. :0.5319 Min. :0.4226
## 1st Qu.:0 1st Qu.:0.6271 1st Qu.:0.5922 1st Qu.:0.5380
## Median :0 Median :0.6313 Median :0.6041 Median :0.5599
## Mean :0 Mean :0.6356 Mean :0.6120 Mean :0.5568
## 3rd Qu.:0 3rd Qu.:0.6500 3rd Qu.:0.6150 3rd Qu.:0.5874
## Max. :0 Max. :0.6642 Max. :0.7721 Max. :0.6564
preds <- stats::predict(mars_model, student_eval)
#write.csv(preds, "Data624_Model_Preds.csv")