Project Introduction:

New regulations are requiring ABC Beverage to provide a report with an outline of our manufacturing process, and a predictive model of PH including an explanation of predictive factors.

Our data science team is tasked with developing the predictive model from provided historical data and using that model to predict PH on test data.

# Checking out packages
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) 

library(corrplot)
library(reshape2)  # for melt
library(ggplot2)   # for ggplot
library(dplyr)
library(knitr)
library(magrittr)
library(tidyverse) # for code with missing values
library(caret)     # for models
library(RANN)      # for better kNN imputation
library(gridExtra) # for Outliers
library(car)       # VIF
library(earth)     # MARS model
library(kernlab)   # SVM model
library(xgboost)   # XGBoost model
library(glmnet)    # Appendix Ridge Regression
library(factoextra) # For advanced plotting
library(pls)       # Appendix 2 PLS



Data

Here we import our train and test data, student_train and student_eval, and evaluate for missing data and additional exploratory steps.


Data Acquisition

Here we can preview the data structure:

# Read data
student_train = read.csv('https://raw.githubusercontent.com/deepasharma06/Data-624/refs/heads/main/StudentData_training.csv')
student_eval = read.csv('https://raw.githubusercontent.com/deepasharma06/Data-624/refs/heads/main/StudentEvaluation_test.csv')

# Show data
head(student_train) %>% kable()
Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
B 5.340000 23.96667 0.2633333 68.2 141.2 0.104 0.26 0.04 -100 118.8 46.0 0 NA NA 118 121.2 4002 66.0 16.18 2932 0.88 725.0 1.398 -4.0 8.36 0.022 120 46.4 142.6 6.58 5.32 1.48
A 5.426667 24.00667 0.2386667 68.4 139.6 0.124 0.22 0.04 -100 121.6 46.0 0 NA NA 106 118.6 3986 67.6 19.90 3144 0.92 726.8 1.498 -4.0 8.26 0.026 120 46.8 143.0 6.56 5.30 1.56
B 5.286667 24.06000 0.2633333 70.8 144.8 0.090 0.34 0.16 -100 120.2 46.0 0 NA NA 82 120.0 4020 67.0 17.76 2914 1.58 735.0 3.142 -3.8 8.94 0.024 120 46.6 142.0 7.66 5.84 3.28
A 5.440000 24.00667 0.2933333 63.0 132.6 NA 0.42 0.04 -100 115.2 46.4 0 0 0 92 117.8 4012 65.6 17.42 3062 1.54 730.6 3.042 -4.4 8.24 0.030 120 46.0 146.2 7.14 5.42 3.04
A 5.486667 24.31333 0.1113333 67.2 136.8 0.026 0.16 0.12 -100 118.4 45.8 0 0 0 92 118.6 4010 65.6 17.68 3054 1.54 722.8 3.042 -4.4 8.26 0.030 120 46.0 146.2 7.14 5.44 3.04
A 5.380000 23.92667 0.2693333 66.6 138.4 0.090 0.24 0.04 -100 119.6 45.6 0 0 0 116 120.2 4014 66.2 23.82 2948 1.52 738.8 2.992 -4.4 8.32 0.024 120 46.0 146.6 7.16 5.44 3.02


Domain Dive

We looked for a general understanding of bottling manufacturing and learned a few things potentially about our data, so these are our assumptions and guesses.

This data describes a continuous bottling process, like a conveyor belt, and so Mnf Flow would be the percentage of intended speed the process is running at.

Fill.Ounces is how full the bottles bottles are being filled to with a target of 24 oz.

Alch.Rel sounds like this could be an relative value of alcohol. I checked and Modelo produces cans of 24 ounce spiked sparkling water, so maybe it’s a similar product.

Brand.Code could be in label only but if it’s different formulas and flavors, the specific ratios of acids (citric, phosphoric, malic) to create the brand’s flavor profile, then it would definitely affect PH.

PH is the PH of our bottled beverage. Curiously we would expect carbonated beverages to be acidic because carbon in water makes carbonic acid however all of the PH values are in the 8 range which is basic. 7 is neutral and below 7 would be acidic.

Mnf.Flow seems to be the speed of the conveyor belt or manufacturing flow represented as a percentage of the ideal speed. MFR could be the actual speed.

Carb.Volume, Carb.Pressure, Carb.Temp we could potentially combine with the formula P*V/T to obtain the amount of carbon being injected into the liquid.

Bowl.Setpoint is consistently 120 and we believe it’s how high the bottle is expected to be filled. Filler.Level is a variable close to 120 which is presumably the actual amount filled. Dividing Bowl.Setpoint by Filler Level could agregate the predictors into whether too much or too little was added.

Filler.Speed seems to be the speed at which the fill liquid is filled. And Carb.Flow seems to be the speed at which the carbonation is injected.

Carb.Pressure1 and Fill.Pressure look to be the pressure at which the carbonation (~120) is injected into the fill liquid and the fill liquid’s pressure (~46). Those values make sense because if the C02’s pressure is not higher than the liquid, the liquid won’t carbonate.

Pressure.Setpoint seems to be the setting for Fill.Pressure that the manufacturing process is trying to match.

Balling is a reference to the Balling scale which was a way to determine how much sugar is in a syrup based on the density of the liquid. Balling.Lvl is related.

Temperature is in fahrenheit (~66) but isn’t clear if that’s the ambient temperature and should be aggregated with Air.Pressurer (~144) to get a sense of what the surrounding pressure is, which could influence how much carbonation stays in the bottle.

PSC.CO2 could be the purity of the carbon dioxide, variations in which would have a big impact on the final PH. PSC.Fill could be the purity of the water. and PSC a combined metric of purity. But this is just a guess.

This leaves a lot variables without a clear picture of how they fit into the manufacturing process:

PC.Volume
Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
Usage.cont Density Pressure.Vacuum Oxygen.Filler Carb.Rel


Missing values

In total there are 724 missing values across 442 rows.

17.2% of our rows have missing data (442/2571) so we won’t just drop all rows with missing data.

Here we show columns with the highest number of missing values:

# There are 724 NA values
#sum(is.na(student_train))

# 442 rows have missing data
#student_train %>%
#  filter(if_any(everything(), is.na)) %>%
#  nrow()

# Display columns by missingness
student_train %>%
  summarise(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}")) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing_Count") %>%
  arrange(desc(Missing_Count))
## # A tibble: 33 × 2
##    Column                 Missing_Count
##    <chr>                          <int>
##  1 missing_MFR                      212
##  2 missing_Filler.Speed              57
##  3 missing_PC.Volume                 39
##  4 missing_PSC.CO2                   39
##  5 missing_Fill.Ounces               38
##  6 missing_PSC                       33
##  7 missing_Carb.Pressure1            32
##  8 missing_Hyd.Pressure4             30
##  9 missing_Carb.Pressure             27
## 10 missing_Carb.Temp                 26
## # ℹ 23 more rows


Impute missing values (NAs)

Here we impute the missing values using kNN and then check that the 724 missing values in student_train and the 366 missing values in student_eval are filled in.

# Produce the model that can impute values using kNN
imputeModel <- preProcess(student_train, method = c("knnImpute"))

# Impute the missing values for the training and test data
student_train <- predict(imputeModel, student_train)
student_eval <- predict(imputeModel, student_eval)

# There are now zero NA values in our train and test data
sum(is.na(student_train))
## [1] 0
#sum(is.na(student_eval))


Correlation Plot

In our correlation plot we are comparing values in the second to last PH column. We note Filler.Level and Bowl.Setpoint which are highly collinear have some positive correlation with PH. Also, a number of variables have negative correlation with PH, most notably Mnf.Flow, Hyd.Pressure3, and Pressure.Setpoint.

# Select only numeric columns
numeric_data <- student_train %>% select(where(is.numeric))

# Calculate the correlation matrix ("pairwise.complete.obs" ensures any missing values won't affect the correlation calcs for remaining complete pairs)
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs") 

# Create the correlation plot
corrplot(correlation_matrix, tl.col = "black", tl.cex = 0.6, order = 'AOE') # color (tl.col) and size (tl.cex) help style the plot.


Distribution Visualization

Here we show distributions of the variables as histograms. Note, Brand.Code is excluded because it doesn’t have the numeric data.

# Melt the data
mlt.train <- student_train  # Use your actual dataframe name
mlt.train$ID <- rownames(mlt.train)  # Assign row names to ID
mlt.train <- melt(mlt.train, id.vars = "ID")

# Convert the value column to numeric
mlt.train$value <- as.numeric(mlt.train$value)

# Create histograms of the predictors
ggplot(data = mlt.train, aes(x = value)) +
  geom_histogram(binwidth = 6, fill = "skyblue", color = "black", alpha = 0.8) +  # Adjust binwidth as needed
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distributions of Predictors", x = "Predictors", y = "Frequency") +
  theme_minimal(base_size = 9) +  # Use a minimal theme for better clarity
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())  # Clean up grid lines


Outliers

We are seeing a lot of outliers in the boxplots, specifically for variables where the majority of their values are zero. It’s possible we will use near-zero variance filtering to resolve these. Below only the first 9 of 32 are shown for display purposes but the pattern is the same.

# Create boxplots for each numeric column
numeric_columns <- names(student_train)[sapply(student_train, is.numeric)]
plot_list <- lapply(numeric_columns[1:9], function(col) {
  ggplot(student_train, aes_string(y = col)) +
    geom_boxplot(outlier.color = "red", fill = "lightblue") +
    ggtitle(paste("Boxplot of", col)) +
    theme_minimal()
})

# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))




Feature Engineering

Here we use feature engineering and regression techniques to evaluate our feature set.


Initial Linear Regression Model

Here we build the initial linear regression model.

# Setting up the model
model <- lm(PH ~ ., data = student_train)
summary(model)
## 
## Call:
## lm(formula = PH ~ ., data = student_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0652 -0.4500  0.0545  0.5017  4.3209 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.258937   0.085135  -3.042  0.00238 ** 
## Brand.CodeA        0.045707   0.141827   0.322  0.74727    
## Brand.CodeB        0.447021   0.078984   5.660 1.69e-08 ***
## Brand.CodeC       -0.381388   0.085887  -4.441 9.36e-06 ***
## Brand.CodeD        0.338218   0.161402   2.095  0.03623 *  
## Carb.Volume       -0.046235   0.042321  -1.092  0.27473    
## Fill.Ounces       -0.035288   0.016279  -2.168  0.03028 *  
## PC.Volume         -0.038051   0.018876  -2.016  0.04392 *  
## Carb.Pressure     -0.015593   0.059234  -0.263  0.79238    
## Carb.Temp          0.036313   0.053963   0.673  0.50107    
## PSC               -0.024380   0.016269  -1.499  0.13410    
## PSC.Fill          -0.023195   0.015868  -1.462  0.14393    
## PSC.CO2           -0.032079   0.015670  -2.047  0.04074 *  
## Mnf.Flow          -0.492621   0.031570 -15.604  < 2e-16 ***
## Carb.Pressure1     0.194895   0.019094  10.207  < 2e-16 ***
## Fill.Pressure      0.033001   0.022371   1.475  0.14030    
## Hyd.Pressure1     -0.004829   0.026303  -0.184  0.85436    
## Hyd.Pressure2     -0.100778   0.050088  -2.012  0.04432 *  
## Hyd.Pressure3      0.315571   0.053337   5.917 3.73e-09 ***
## Hyd.Pressure4     -0.016382   0.023264  -0.704  0.48138    
## Filler.Level      -0.093419   0.049450  -1.889  0.05899 .  
## Filler.Speed       0.029192   0.033531   0.871  0.38406    
## Temperature       -0.109919   0.018231  -6.029 1.89e-09 ***
## Usage.cont        -0.123601   0.019583  -6.312 3.25e-10 ***
## Carb.Flow          0.071620   0.022218   3.223  0.00128 ** 
## Density           -0.272023   0.060842  -4.471 8.13e-06 ***
## MFR               -0.036383   0.024352  -1.494  0.13529    
## Balling           -0.342804   0.122457  -2.799  0.00516 ** 
## Pressure.Vacuum   -0.053160   0.024621  -2.159  0.03093 *  
## Oxygen.Filler     -0.070359   0.018451  -3.813  0.00014 ***
## Bowl.Setpoint      0.290053   0.050216   5.776 8.58e-09 ***
## Pressure.Setpoint -0.100544   0.022901  -4.390 1.18e-05 ***
## Air.Pressurer     -0.019757   0.016531  -1.195  0.23215    
## Alch.Rel           0.192473   0.061961   3.106  0.00192 ** 
## Carb.Rel           0.037817   0.034485   1.097  0.27292    
## Balling.Lvl        0.475401   0.106823   4.450 8.94e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7644 on 2535 degrees of freedom
## Multiple R-squared:  0.4233, Adjusted R-squared:  0.4153 
## F-statistic: 53.16 on 35 and 2535 DF,  p-value: < 2.2e-16


Variance Inflation Factor (VIF)

With the above initial linear regression model we can access the Variance Inflation Factor (VIF) to detect multicollinearity in our regression model to help us select features.

Note, we see that Brand.Code is highly collinear.

# Calculating VIF
vif_values <- vif(model)
vif_values
##                        GVIF Df GVIF^(1/(2*Df))
## Brand.Code        54.107705  4        1.646863
## Carb.Volume        7.882378  1        2.807557
## Fill.Ounces        1.152505  1        1.073548
## PC.Volume          1.581700  1        1.257657
## Carb.Pressure     15.485791  1        3.935199
## Carb.Temp         12.828711  1        3.581719
## PSC                1.154528  1        1.074490
## PSC.Fill           1.102981  1        1.050229
## PSC.CO2            1.068829  1        1.033842
## Mnf.Flow           4.384391  1        2.093894
## Carb.Pressure1     1.600618  1        1.265155
## Fill.Pressure      2.200066  1        1.483262
## Hyd.Pressure1      3.035406  1        1.742242
## Hyd.Pressure2     11.015223  1        3.318919
## Hyd.Pressure3     12.478420  1        3.532481
## Hyd.Pressure4      2.369556  1        1.539336
## Filler.Level      10.723928  1        3.274741
## Filler.Speed       4.903247  1        2.214328
## Temperature        1.455792  1        1.206562
## Usage.cont         1.686112  1        1.298504
## Carb.Flow          2.170155  1        1.473145
## Density           16.280359  1        4.034893
## MFR                4.069385  1        2.017272
## Balling           65.953990  1        8.121206
## Pressure.Vacuum    2.666455  1        1.632928
## Oxygen.Filler      1.494970  1        1.222690
## Bowl.Setpoint     11.117253  1        3.334254
## Pressure.Setpoint  2.301930  1        1.517211
## Air.Pressurer      1.202058  1        1.096384
## Alch.Rel          16.860616  1        4.106168
## Carb.Rel           5.220788  1        2.284904
## Balling.Lvl       50.184296  1        7.084088


Near-Zero Variance

Next we check if any of the variables have a near zero variance and surprisingly only one variable, Hyd.Pressure1, is identified by default. We changed the parameters ‘freqCut’ and ‘uniqueCut’ but it required large changes to pick up the other variables, so we kept to only removing Hyd.Pressure1.

# Remove problematic predictors from train and test data
student_train_x <- student_train[, -nearZeroVar(student_train)]
student_eval_y <- student_eval[, -13]

# Display extended list of NZV predictors
nearZeroVar(student_train, freqCut = 60/40, uniqueCut = 40, names = TRUE)
## [1] "Brand.Code"    "Fill.Pressure" "Hyd.Pressure1" "Hyd.Pressure2"
## [5] "Hyd.Pressure3" "Carb.Flow"     "Bowl.Setpoint"


Reduce Multicollinearity

Three features, Brand.Code, Balling, and Balling.Lvl, have high multicollinearity which we will address in this section.


Brand.Code

Here we replace Brand.Code with BCB which will show if the record is brand code B or not. When we reran for individual values of Brand.Code, we see that Brand B appears to have a different relationship than the other brands, Brand B has a R-squared of ~0.7 while all the other brands have a R-squared of ~0.3. This suggests, we can replace all the brands with a binary variable of whether it is Brand B or not. (not pictured)

From this we identify that Balling and Balling.Lvl represent a lot of the remaining multicollinearity to be reduced.

# Replacing Brand.Code with BCB
student_train_x1 <- student_train_x |> mutate(BCB = as.numeric(Brand.Code =='B')) |> select(-Brand.Code)
student_eval_y1 <- student_eval_y |> mutate(BCB = as.numeric(Brand.Code =='B')) |> select(-Brand.Code)

# New model with BCB instead of Brand.Code
model <- lm(PH ~ ., data = student_train_x)
#summary(model)

# Calculating VIF
vif_values <- vif(model)
vif_values
##                        GVIF Df GVIF^(1/(2*Df))
## Brand.Code        53.942578  4        1.646234
## Carb.Volume        7.872617  1        2.805818
## Fill.Ounces        1.151544  1        1.073100
## PC.Volume          1.500187  1        1.224821
## Carb.Pressure     15.485772  1        3.935197
## Carb.Temp         12.827035  1        3.581485
## PSC                1.152323  1        1.073463
## PSC.Fill           1.102569  1        1.050033
## PSC.CO2            1.068812  1        1.033834
## Mnf.Flow           4.382860  1        2.093528
## Carb.Pressure1     1.579213  1        1.256668
## Fill.Pressure      2.180153  1        1.476534
## Hyd.Pressure2      8.964442  1        2.994068
## Hyd.Pressure3     12.433205  1        3.526075
## Hyd.Pressure4      2.368986  1        1.539151
## Filler.Level      10.684198  1        3.268669
## Filler.Speed       4.897880  1        2.213116
## Temperature        1.455792  1        1.206562
## Usage.cont         1.684034  1        1.297703
## Carb.Flow          2.168168  1        1.472470
## Density           16.274585  1        4.034177
## MFR                4.048062  1        2.011980
## Balling           65.945084  1        8.120658
## Pressure.Vacuum    2.582057  1        1.606878
## Oxygen.Filler      1.494535  1        1.222512
## Bowl.Setpoint     11.085276  1        3.329456
## Pressure.Setpoint  2.281597  1        1.510495
## Air.Pressurer      1.195910  1        1.093577
## Alch.Rel          16.848866  1        4.104737
## Carb.Rel           5.209240  1        2.282376
## Balling.Lvl       50.180230  1        7.083801


Balling vs. Balling.Lvl

In order to reduce the multicollinearity we are going to create a new predictor by dividing Balling.Lvl with Balling.

The division was arrived at after trying different operations and division resulted in the best correlation to PH then removing either individually.

For a guess at domain relevance, Balling.Lvl could be the target best value for the syrup’s specific gravity and Balling could be the actual, so dividing the two would result in a value similar to Mnf.Flow which is a percentage of the ideal manufacturing conveyor belt speed.

# Create new predictor PT
student_train_x2 <- student_train_x1 |> mutate(PT = Balling.Lvl/Balling) |> select(-c(Balling, Balling.Lvl))
student_eval_y2 <- student_eval_y1 |> mutate(PT = Balling.Lvl/Balling) |> select(-c(Balling, Balling.Lvl))


Interaction Terms

We also considered creating a Carbon column by multiplying Carb.Pressure and Carb.Volume and dividing by Carb.Temp. This is the Ideal Gas Law and would tell us how much carbon dioxide is being injected. However there are two values for pressure, Carb.Pressure and Carb.Pressure1 and we don’t have the units to combine them according to the law. From inspection they look like they’ve already been centered and scaled and so cannot be combined.


Step-wise Reduction

Since we’ve somewhat resolved our multicollinearity issue and have enough data we’re going to try a step-wise reduction model. This reduces our features to just the ones that have relevance for this model. Since we’re not relying on this to produce our final model but rather evaluating our features, step-wise reduction aligns with our goals and we don’t need to try regularization instead.

Note, our model has terrible performance with an R-Squared of ~0.41, but we’ve acquainted ourselves with the features and figured out how to resolve multicollinearity.

# Fit a step-wise linear regression model
model <- lm(PH ~ ., data = student_train_x2)
model <- stats::step(model, trace = 0)
summary(model)
## 
## Call:
## lm(formula = PH ~ Fill.Ounces + PC.Volume + Carb.Temp + PSC + 
##     PSC.Fill + PSC.CO2 + Mnf.Flow + Carb.Pressure1 + Fill.Pressure + 
##     Hyd.Pressure2 + Hyd.Pressure3 + Filler.Level + Temperature + 
##     Usage.cont + Carb.Flow + Density + MFR + Oxygen.Filler + 
##     Bowl.Setpoint + Pressure.Setpoint + Alch.Rel + Carb.Rel + 
##     BCB, data = student_train_x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0495 -0.4675  0.0790  0.5056  4.1328 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.32033    0.02654 -12.069  < 2e-16 ***
## Fill.Ounces       -0.03776    0.01602  -2.357 0.018477 *  
## PC.Volume         -0.04383    0.01823  -2.405 0.016256 *  
## Carb.Temp          0.02880    0.01532   1.880 0.060215 .  
## PSC               -0.02715    0.01628  -1.668 0.095464 .  
## PSC.Fill          -0.02523    0.01600  -1.576 0.115077    
## PSC.CO2           -0.03140    0.01578  -1.989 0.046772 *  
## Mnf.Flow          -0.48772    0.03167 -15.399  < 2e-16 ***
## Carb.Pressure1     0.19540    0.01883  10.377  < 2e-16 ***
## Fill.Pressure      0.03893    0.02228   1.748 0.080641 .  
## Hyd.Pressure2     -0.11549    0.04318  -2.675 0.007525 ** 
## Hyd.Pressure3      0.34060    0.05059   6.733 2.05e-11 ***
## Filler.Level      -0.09331    0.04930  -1.893 0.058505 .  
## Temperature       -0.11328    0.01751  -6.471 1.16e-10 ***
## Usage.cont        -0.11910    0.01927  -6.181 7.40e-10 ***
## Carb.Flow          0.07222    0.02024   3.568 0.000366 ***
## Density           -0.14542    0.04050  -3.590 0.000336 ***
## MFR               -0.03824    0.01421  -2.691 0.007173 ** 
## Oxygen.Filler     -0.06682    0.01844  -3.623 0.000297 ***
## Bowl.Setpoint      0.30697    0.04956   6.194 6.82e-10 ***
## Pressure.Setpoint -0.11768    0.02276  -5.170 2.52e-07 ***
## Alch.Rel           0.38026    0.04093   9.291  < 2e-16 ***
## Carb.Rel           0.05187    0.03074   1.687 0.091682 .  
## BCB                0.65887    0.04477  14.716  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7724 on 2547 degrees of freedom
## Multiple R-squared:  0.4083, Adjusted R-squared:  0.4029 
## F-statistic:  76.4 on 23 and 2547 DF,  p-value: < 2.2e-16



Initial MARS Model

Here we produce a Multivariate Adaptive Regression Splines (MARS) model to capture some of these non-linear interactions in our data.


Feature Selection

We’re using the features selected in the step-wise linear regression model.

# Features used in the previous step-wise linear regression model
student_train_x3 <- student_train_x2 |>
    select(PH, Fill.Ounces, PC.Volume, Carb.Temp, PSC, 
    PSC.Fill, PSC.CO2, Mnf.Flow, Carb.Pressure1, Fill.Pressure, 
    Hyd.Pressure2, Hyd.Pressure3, Filler.Level, Temperature, 
    Usage.cont, Carb.Flow, Density, MFR, Oxygen.Filler, 
    Bowl.Setpoint, Pressure.Setpoint, Alch.Rel, Carb.Rel, BCB)

student_eval_y3 <- student_eval_y2 |> 
    select(PH, Fill.Ounces, PC.Volume, Carb.Temp, PSC, 
    PSC.Fill, PSC.CO2, Mnf.Flow, Carb.Pressure1, Fill.Pressure, 
    Hyd.Pressure2, Hyd.Pressure3, Filler.Level, Temperature, 
    Usage.cont, Carb.Flow, Density, MFR, Oxygen.Filler, 
    Bowl.Setpoint, Pressure.Setpoint, Alch.Rel, Carb.Rel, BCB)


1st Order MARS Model

Here we run the initial MARS Model with only first order relationships.

# Basic MARS Model
y = student_train_x3$PH
x = student_train_x3 |> select(-PH)
marsFit <- earth(x, y)
summary(marsFit)
## Call: earth(x=x, y=y)
## 
##                              coefficients
## (Intercept)                     5.3418732
## BCB                             0.5096711
## h(-0.203956-Mnf.Flow)           0.9800807
## h(Carb.Pressure1- -1.55736)     0.1959255
## h(1.53618-Hyd.Pressure3)       -0.2270999
## h(Hyd.Pressure3-1.53618)        3.7237258
## h(Temperature- -0.699707)      -0.7786390
## h(1.3252-Temperature)          -0.4898665
## h(Temperature-1.3252)           0.8280311
## h(Usage.cont-0.365031)         -0.5993647
## h(Usage.cont-1.11723)           3.4859750
## h(-2.23001-Carb.Flow)          12.3573997
## h(Carb.Flow- -2.23001)          0.0991432
## h(-1.78438-Density)             2.7642866
## h(Density- -1.78438)            0.5163772
## h(Density- -1.09568)           -0.9627448
## h(-3.02915-MFR)                 0.1377848
## h(Oxygen.Filler- -0.738421)    -1.4493047
## h(3.58372-Oxygen.Filler)       -1.3142975
## h(Oxygen.Filler-3.58372)        1.5844172
## h(Bowl.Setpoint- -1.91638)     -1.3437708
## h(-1.26292-Bowl.Setpoint)      -0.6962323
## h(Bowl.Setpoint- -1.26292)      1.7803895
## h(Bowl.Setpoint-0.0440049)     -0.4550215
## h(1.16947-Pressure.Setpoint)    0.0674138
## h(Alch.Rel- -0.509457)          2.5734980
## h(Alch.Rel-0.361355)           -5.2860253
## h(0.519685-Alch.Rel)            0.8855864
## h(Alch.Rel-0.519685)            3.8218088
## h(Alch.Rel-1.46966)            -1.5882955
## h(-0.441138-Carb.Rel)          -0.1615024
## h(Carb.Rel- -0.441138)         -0.0935618
## 
## Selected 32 of 37 terms, and 14 of 23 predictors
## Termination condition: Reached nk 47
## Importance: Mnf.Flow, Alch.Rel, BCB, Usage.cont, Carb.Pressure1, ...
## Number of terms at each degree of interaction: 1 31 (additive model)
## GCV 0.5514965    RSS 1349.26    GRSq 0.4483104    RSq 0.4746078


Show Plots

And here we show the partial dependence plots of the first order MARS model.

# Plot partial dependencies
plotmo(marsFit)
##  plotmo grid:    Fill.Ounces   PC.Volume  Carb.Temp        PSC   PSC.Fill
##                  -0.01623723 -0.09531893 -0.0730481 -0.1334244 -0.1304864
##     PSC.CO2  Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure2 Hyd.Pressure3
##  -0.3813757 0.3367148      0.1293786    -0.4790381     0.4661799     0.4345061
##  Filler.Level Temperature Usage.cont Carb.Flow   Density       MFR
##       0.58271  -0.2657983  0.2710059 0.5212328 -0.512943 0.2456178
##  Oxygen.Filler Bowl.Setpoint Pressure.Setpoint   Alch.Rel   Carb.Rel BCB
##     -0.2881982      0.697465         -0.792231 -0.6677866 -0.2857598   0


2nd Order MARS Model

Here we run the initial MARS Model with second order relationships. This increased the R-squared from 0.475 to .489.

y = student_train_x3$PH
x = student_train_x3 |> select(-PH)

y_eval = student_eval_y3$PH
x_eval = student_eval_y3 |> select(-PH)

marsFit2 <- earth(x, y, degree = 2)
summary(marsFit2)
## Call: earth(x=x, y=y, degree=2)
## 
##                                                     coefficients
## (Intercept)                                           -0.2741453
## BCB                                                    0.2559212
## h(-0.203956-Mnf.Flow)                                  0.4940847
## h(0.50962-Hyd.Pressure3)                              -0.4641113
## h(Hyd.Pressure3-0.50962)                               0.5696218
## h(1.3252-Temperature)                                  0.3307546
## h(-0.785587-Oxygen.Filler) * BCB                      -2.7498870
## h(Oxygen.Filler- -0.785587) * BCB                     -0.1458478
## h(-1.26292-Bowl.Setpoint) * BCB                        0.4057510
## h(Bowl.Setpoint- -1.26292) * BCB                       0.3179062
## h(0.800226-PSC) * h(-0.203956-Mnf.Flow)                0.1075006
## h(-0.203956-Mnf.Flow) * h(Carb.Pressure1- -1.51519)    0.2101897
## h(-0.203956-Mnf.Flow) * h(Hyd.Pressure3-0.0213793)    -0.7230425
## h(-0.203956-Mnf.Flow) * h(Usage.cont-0.754562)         2.1408975
## h(-0.203956-Mnf.Flow) * h(-0.724848-Density)           1.2703030
## h(-0.203956-Mnf.Flow) * h(Density-1.50016)            -4.5082899
## h(Mnf.Flow- -0.203956) * h(Bowl.Setpoint-0.0440049)   -0.2170369
## h(Mnf.Flow- -0.203956) * h(0.0440049-Bowl.Setpoint)   -0.2066971
## h(Mnf.Flow- -0.203956) * h(Alch.Rel-0.519685)          0.4830633
## h(Mnf.Flow- -0.203956) * h(0.519685-Alch.Rel)         -0.3692237
## h(-0.203956-Mnf.Flow) * h(1.26802-Carb.Rel)           -0.2740200
## h(Mnf.Flow- -0.203956) * h(Carb.Rel- -1.21803)        -0.1579294
## h(Mnf.Flow- -0.203956) * h(-1.21803-Carb.Rel)         -0.6507008
## h(0.50962-Hyd.Pressure3) * h(Alch.Rel- -0.509457)      0.1808274
## h(0.50962-Hyd.Pressure3) * h(-0.509457-Alch.Rel)       0.5649332
## h(1.3252-Temperature) * h(Usage.cont-0.358315)        -0.5300740
## 
## Selected 26 of 35 terms, and 12 of 23 predictors
## Termination condition: RSq changed by less than 0.001 at 35 terms
## Importance: Mnf.Flow, Temperature, Usage.cont, Bowl.Setpoint, Alch.Rel, ...
## Number of terms at each degree of interaction: 1 5 20
## GCV 0.5362348    RSS 1311.399    GRSq 0.4635775    RSq 0.4893508


Show Plots

# Partial dependence plots
plotmo(marsFit2)
##  plotmo grid:    Fill.Ounces   PC.Volume  Carb.Temp        PSC   PSC.Fill
##                  -0.01623723 -0.09531893 -0.0730481 -0.1334244 -0.1304864
##     PSC.CO2  Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure2 Hyd.Pressure3
##  -0.3813757 0.3367148      0.1293786    -0.4790381     0.4661799     0.4345061
##  Filler.Level Temperature Usage.cont Carb.Flow   Density       MFR
##       0.58271  -0.2657983  0.2710059 0.5212328 -0.512943 0.2456178
##  Oxygen.Filler Bowl.Setpoint Pressure.Setpoint   Alch.Rel   Carb.Rel BCB
##     -0.2881982      0.697465         -0.792231 -0.6677866 -0.2857598   0


Overfitting

When we originally completed the above model we got an R-Squared of 0.89 however upon subsequently running it with 10-fold cross validation we observed the initially high result was likely due to overfitting. MARS models, especially with high orders of interactions, are susceptible to overfitting.

We could duplicate that result by replacing the imputation of missing values with zeros, not removing the one feature with near-zero variance, and doing the balling vs. balling.lvl feature combination after the step-wise reduction.

Instead, we’re using our current model and will take a different tack starting in the next section.




Non-Linear Models

Here we’re doing a range of models to get a sense of what works.


Data

We’re starting over on the data after the imputation step and removing the one near-zero variance predictor.


Split up data

Here we split up the data so that PH is our trainy and the remaining data is our trainx. Similarly for our test data we have testx however there is no PH data for test.

# Split up available data into train and test versions of x and y
trainx <- student_train_x |> select(-PH)
trainy <- student_train_x$PH
testx <- student_eval_y |> select(-PH)


One-hot encoding

We have one character variable Brand.Code that is categorial with no inherent order:

Blank - 120 A - 293 B - 1239 C - 304 D - 615

Here we set the Blanks to brand code M for miscellaneous and use one-hot encoding to split them up into five variables with a 1 in the column corresponding to which Brand Code that row is.

# Assign blanks as "M"
trainx$Brand.Code[trainx$Brand.Code == ""] <- "M"
testx$Brand.Code[testx$Brand.Code == ""] <- "M"

# Convert Brand.Code to a factor
trainx$Brand.Code <- as.factor(trainx$Brand.Code)
testx$Brand.Code <- as.factor(testx$Brand.Code)

# One-hot encoding
dummy_model <- dummyVars(~ ., data = trainx)
trainx <- predict(dummy_model, trainx)
testx <- predict(dummy_model, testx)


Modeling

Here we look at the kNN, SVM, MARS, and XGBoost models.


kNN

k-Nearest Neighbors results in a weak model with an R-Squared of 0.481 when run with 13 as the number of nearest neighbors.

# Train the kNN model
knnModel <- train(x = trainx,
                  y = trainy,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 2571 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2571, 2571, 2571, 2571, 2571, 2571, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7553386  0.4600991  0.5516203
##    7  0.7385944  0.4740297  0.5456211
##    9  0.7296617  0.4825545  0.5435731
##   11  0.7291528  0.4819977  0.5452328
##   13  0.7285882  0.4821892  0.5474974
##   15  0.7307512  0.4789545  0.5504139
##   17  0.7325872  0.4764009  0.5534532
##   19  0.7335725  0.4749351  0.5557323
##   21  0.7354178  0.4725328  0.5579163
##   23  0.7373076  0.4699529  0.5604149
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.


Support Vector Machines

Support Vector Machines results in a weak model with an Rsquared of 0.586 when C = 8. The C being high like this means that misclassifications are penalized more heavily resulting in a more complex decision boundary.

# Train the SVM model
svmModel <- train(x = trainx,
                  y = trainy,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))
svmModel
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2571 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2314, 2315, 2314, 2313, 2313, 2314, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7163030  0.4959287  0.5314450
##      0.50  0.6962312  0.5208256  0.5124825
##      1.00  0.6785863  0.5430775  0.4968808
##      2.00  0.6655149  0.5589021  0.4862203
##      4.00  0.6521532  0.5763907  0.4774957
##      8.00  0.6451258  0.5864822  0.4732103
##     16.00  0.6497834  0.5845911  0.4770923
##     32.00  0.6675068  0.5704638  0.4910538
##     64.00  0.6978723  0.5456811  0.5151205
##    128.00  0.7378712  0.5134602  0.5439043
##    256.00  0.7800723  0.4855438  0.5742860
##    512.00  0.8204879  0.4610548  0.6044797
##   1024.00  0.8500779  0.4416762  0.6245598
##   2048.00  0.8529043  0.4393422  0.6270423
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02016894
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02016894 and C = 8.


MARS

Multivariate Adaptive Regression Splines results in a weak model with an Rsquared of 0.517 with nprune = 33 and degree = 6. This means there are 33 basis functions with six-way interactions. This may be suitable to highly nonlinear and interactive data but significantly risks overfitting.

# Train the MARS model
marsGrid <- expand.grid(.degree = 6, .nprune = 30:40)
set.seed(175175327)
marsModel <- train(x = trainx,
                   y = trainy,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))
marsModel
## Multivariate Adaptive Regression Spline 
## 
## 2571 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2314, 2313, 2313, 2315, 2314, 2314, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE       Rsquared   MAE      
##   30      0.7056510  0.5085210  0.5221312
##   31      0.7048986  0.5098481  0.5215941
##   32      0.7047143  0.5099580  0.5213002
##   33      0.7000219  0.5158733  0.5199846
##   34      0.6983310  0.5180715  0.5194377
##   35      0.6960146  0.5210178  0.5176798
##   36      0.6983852  0.5183097  0.5189310
##   37      0.6979122  0.5190971  0.5184589
##   38      0.6971574  0.5203998  0.5175004
##   39      0.6958715  0.5222221  0.5162749
##   40      0.6959166  0.5228376  0.5153354
## 
## Tuning parameter 'degree' was held constant at a value of 6
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 39 and degree = 6.


XGBoost

eXtreme Gradient Boosting also produced similarly weak results with an R-Squared of 0.455.

library(caret)

# Train XGBoost model using caret
xgb_model <- train(
  x = trainx,
  y = trainy,
  method = "xgbTree",
  tuneLength = 1,  # Automatic hyperparameter tuning
  trControl = trainControl(method = "cv", number = 10)  # Cross-validation
)

# Summary of the model
print(xgb_model)
## eXtreme Gradient Boosting 
## 
## 2571 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2314, 2314, 2315, 2314, 2313, 2315, ... 
## Resampling results across tuning parameters:
## 
##   eta  colsample_bytree  RMSE       Rsquared   MAE      
##   0.3  0.6               0.7574076  0.4312850  0.5953737
##   0.3  0.8               0.7512717  0.4397941  0.5889623
##   0.4  0.6               0.7421128  0.4533415  0.5791337
##   0.4  0.8               0.7486802  0.4414322  0.5831434
## 
## Tuning parameter 'nrounds' was held constant at a value of 50
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## 
## Tuning parameter 'subsample' was held constant at a value of 0.5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 50, max_depth = 1, eta
##  = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 0.5.



Appendix 1

Here we capture some related group work that followed a variant path to explore a ridge regression and a lasso regression model.


Preprocessing

Here we remove highly colinear variables, handle missing values and categorical predictors with one-hot encoding.


Reduce Multicollinearity

Removing variables with high VIF values (such as Brand.Code, Balling, and Density) to reduce multicollinearity.

# Install and load the 'car' package
#install.packages("car")
library(car)

# Check VIF for the current model
vif(model)
##       Fill.Ounces         PC.Volume         Carb.Temp               PSC 
##          1.092529          1.444174          1.012578          1.132060 
##          PSC.Fill           PSC.CO2          Mnf.Flow    Carb.Pressure1 
##          1.098658          1.061950          4.321046          1.524239 
##     Fill.Pressure     Hyd.Pressure2     Hyd.Pressure3      Filler.Level 
##          2.136003          8.014966         10.993119         10.437462 
##       Temperature        Usage.cont         Carb.Flow           Density 
##          1.314374          1.598617          1.763477          7.064779 
##               MFR     Oxygen.Filler     Bowl.Setpoint Pressure.Setpoint 
##          1.356792          1.462939         10.604306          2.226677 
##          Alch.Rel          Carb.Rel               BCB 
##          7.203861          4.063375          2.156663
# Refine the model by removing variables with high VIF (e.g., Brand.Code, Balling)
model_refined <- lm(PH ~ . -Brand.Code -Balling, data = student_train)
model_refined 
# Check VIF again for the refined model
vif_refine<-vif(model_refined)
vif_refine


Missing Values

Here we handle missing values with kNN imputation and check that there are zero remaining missing values:

# Check if there are missing values in the entire dataset (none)
#sum(is.na(student_train))

# Impute missing values using kNN
imputeModel <- preProcess(student_train, method = "knnImpute")

# Apply the imputation to both training and test data
student_train_imputed <- predict(imputeModel, student_train)
student_eval_imputed <- predict(imputeModel, student_eval)

# Check if there are any missing values after imputation (none)
#sum(is.na(student_train_imputed))  # Should return 0 if imputation was successful
#sum(is.na(student_eval_imputed))   # Should also return 0 for the test data

# Prepare the predictor matrix (without the target variable 'PH')
x <- as.matrix(student_train_imputed[, -which(names(student_train_imputed) == "PH")])

# Check if there are missing values in the predictor matrix
sum(is.na(x))  # This should return 0 if there are no missing values
## [1] 0


One-Hot Encoding

Here we use one-hot encoding to get dummy variables.

# Remove PH column for dummy encoding
student_train_predictors <- student_train_imputed[, -which(names(student_train_imputed) == "PH")]

# Apply one-hot encoding to the predictors
dummy_model <- dummyVars(~ ., data = student_train_predictors)

# Apply the encoding to the dataset
x <- predict(dummy_model, student_train_predictors)

# Convert to matrix format
x <- as.matrix(x)  

# Should return 0 if there are no missing values (correct, none)
#sum(is.na(x))  


Modeling

Here we fit two models, ridge regression and lasso regression.


Ridge Regression

Here we run the ridge regression model.

Note, the resulting coefficient of -0.031565321 for Carb.Volume means that as Carb.Volume increases by 1 unit, the PH decreases by 0.031565321, assuming all other predictors are held constant.

Negative Coefficients: Features like Brand.Code, Carb.Volume, and Temperature are negatively correlated with PH, meaning as these features increase, the value of PH tends to decrease.

Positive Coefficients: Features like Balling.Lvl and Oxygen.Filler are positively correlated with PH, meaning as these features increase, PH is predicted to increase as well.

# Impute missing values using kNN
imputeModel <- preProcess(student_train, method = "knnImpute")

# Apply the imputation to both training and test data
student_train_imputed <- predict(imputeModel, student_train)
student_eval_imputed <- predict(imputeModel, student_eval)

# Remove PH column for dummy encoding (exclude target variable)
student_train_predictors <- student_train_imputed[, -which(names(student_train_imputed) == "PH")]

# Apply one-hot encoding to the predictors
dummy_model <- dummyVars(~ ., data = student_train_predictors)
x <- predict(dummy_model, student_train_predictors)

# Convert to matrix format (glmnet requires a matrix)
x <- as.matrix(x)

# Check if there are any missing values in the predictor matrix
sum(is.na(x))  # Should return 0 if no missing values
## [1] 0
# Define target variable 'y'
y <- student_train_imputed$PH

# Fit Ridge Regression model (alpha = 0 for Ridge)
ridge_model <- glmnet(x, y, alpha = 0)

# View the coefficients
#print(coef(ridge_model))

# Perform cross-validation to find the best lambda
cv_ridge <- cv.glmnet(x, y, alpha = 0) #Performs cross-validation to select the best regularization parameter (lambda).

# Plot the cross-validation results
plot(cv_ridge)

# Get the best lambda from the cross-validation
best_lambda <- cv_ridge$lambda.min # The lambda value that gives the minimum mean cross-validation error.
cat("Best Lambda for Ridge:", best_lambda, "\n")
## Best Lambda for Ridge: 0.04472021
# Refit the Ridge regression model with the best lambda
ridge_model_best <- glmnet(x, y, alpha = 0, lambda = best_lambda)

# Get the coefficients for the best model
#coef(ridge_model_best) # Extracts the coefficients of the final Ridge model.


Ridge with Cross-Validation

Here we seek to improve lambda tuning using cross-validation.

Note, this resulted in a Mean Squared Error (MSE) of 0.5647799 for the best Ridge regression model which is a significant improvement compared to the previous MSE of 153.4756. This suggests that the model is now performing better in terms of prediction accuracy.

The reduction in MSE after applying cross-validation and tuning the lambda parameter implies that the regularization parameter has been effectively optimized.

# Use cross-validation to find the best lambda for Ridge regression
cv_ridge <- cv.glmnet(x, y, alpha = 0)

# Plot cross-validation results
plot(cv_ridge)

# Get the best lambda
best_lambda <- cv_ridge$lambda.min
cat("Best Lambda for Ridge:", best_lambda, "\n")
## Best Lambda for Ridge: 0.04472021
# Refit the model with the best lambda
ridge_model_best <- glmnet(x, y, alpha = 0, lambda = best_lambda)

# Get the coefficients for the best Ridge model
#coef(ridge_model_best)


Lasso Regression

Here we run the lasso regression model.

This results in some variables with a coefficient of zero.

Non-zero Coefficients: These are the features that have been selected by Lasso. For example, Brand.Code, Carb.Volume, Carb.Pressure, etc., have non-zero coefficients, suggesting they have a relationship with PH.

Zero Coefficients: Features like Carb.Volume and Filler.Level have been shrunk to zero, indicating that Lasso determined these features were not important in predicting PH.

# Fit Lasso Regression model (alpha = 1 for Lasso)
lasso_model <- glmnet(x, y, alpha = 1)

# Use cross-validation to select the best lambda
cv_lasso <- cv.glmnet(x, y, alpha = 1)

# Plot the cross-validation results
plot(cv_lasso)

# Best lambda for Lasso
best_lambda_lasso <- cv_lasso$lambda.min
cat("Best Lambda for Lasso:", best_lambda_lasso, "\n")
## Best Lambda for Lasso: 0.0006640809
# Refit the model with the best lambda
lasso_model_best <- glmnet(x, y, alpha = 1, lambda = best_lambda_lasso)

# Get the coefficients for the best Lasso model
#coef(lasso_model_best)



Appendix 2

Here we capture related group work that followed the same path of our initial data setup with kNN imputation, removing the near zero-variance feature Hyd.Pressure1, replacing Brand.Code with BCB for Brand code B or not, and creating a new variable PT out of Balling.Lvl/Balling. Then picking up right before the stepwise reduction leading to our initial MARS model.

Here we will use PCR/PLS for further feature reduction.

  1. Do PCR and analyze (rsquared)
  2. Do PLS, get optimal number of components using RMSEP, do PCR and analyze (rsquared)


PCA

Here we produce the principal components.

# Separate predictors and response
X_train <- student_train_x2 %>% select(-PH)
y_train <- student_train_x2$PH

X_test <- student_eval_y2 %>% select(-PH)
#y_test <- student_eval_y2$PH # test PH scores not available.

# Scale the training and test sets
# We'll use caret's preProcess for scaling
preProcValues <- preProcess(X_train, method = c("center", "scale"))
X_train_scaled <- predict(preProcValues, X_train)
X_test_scaled  <- predict(preProcValues, X_test)

# Perform PCA on the scaled training predictors
pca_result <- prcomp(X_train_scaled, center = FALSE, scale. = FALSE)

# Summary of PCA variance
summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.2811 2.1735 1.68172 1.32321 1.2562 1.19853 1.16009
## Proportion of Variance 0.1734 0.1575 0.09427 0.05836 0.0526 0.04788 0.04486
## Cumulative Proportion  0.1734 0.3309 0.42520 0.48356 0.5362 0.58404 0.62890
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.04011 1.01262 0.99666 0.94693 0.91894 0.87965 0.86579
## Proportion of Variance 0.03606 0.03418 0.03311 0.02989 0.02815 0.02579 0.02499
## Cumulative Proportion  0.66496 0.69914 0.73225 0.76214 0.79029 0.81608 0.84107
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.84955 0.77508 0.71546 0.69902 0.68471 0.64256 0.62167
## Proportion of Variance 0.02406 0.02002 0.01706 0.01629 0.01563 0.01376 0.01288
## Cumulative Proportion  0.86513 0.88515 0.90222 0.91850 0.93413 0.94789 0.96078
##                           PC22    PC23   PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.52727 0.47220 0.4414 0.37281 0.35244 0.29545 0.22780
## Proportion of Variance 0.00927 0.00743 0.0065 0.00463 0.00414 0.00291 0.00173
## Cumulative Proportion  0.97004 0.97748 0.9840 0.98861 0.99275 0.99566 0.99738
##                           PC29    PC30
## Standard deviation     0.21788 0.17601
## Proportion of Variance 0.00158 0.00103
## Cumulative Proportion  0.99897 1.00000


Contributions of Variables

The principal components are chosen based on max variability, we can crack open PC1 and PC2 to see which variables cover the widest range of variability and get a sense for how PCR would reduce multi-collinearity:

# Contributions to PC1
fviz_contrib(pca_result, choice = "var", axes = 1, top = 10) +
  labs(title = "Contributions of Variables to PC1")

# Contributions to PC1
fviz_contrib(pca_result, choice = "var", axes = 2, top = 10) +
  labs(title = "Contributions of Variables to PC2")


PCR

On to PCR:

# Set a seed for reproducibility
set.seed(123)

# Using the already processed data 'student_train_x2'
# Ensure that PH is the response and the rest are predictors
pcr_model <- pcr(PH ~ ., data = student_train_x2, scale = TRUE,  validation = "CV")   

# Summary of the PCR model
summary(pcr_model)
## Data:    X dimension: 2571 30 
##  Y dimension: 2571 1
## Fit method: svdpc
## Number of components considered: 30
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.9998   0.9577   0.9589   0.9506   0.9499   0.9461   0.9611
## adjCV       0.9998   0.9542   0.9546   0.9467   0.9461   0.9427   0.9564
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.9508    1.013     1.13     1.849     1.840     1.143     1.280
## adjCV   0.9459    1.003     1.11     1.778     1.767     1.120     1.244
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV        1.253     1.158     1.279     1.183     1.025     1.029    0.9195
## adjCV     1.219     1.131     1.242     1.152     1.008     1.011    0.9108
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV       0.9870    0.9840    1.0152     1.021     1.035     1.016    0.9434
## adjCV    0.9703    0.9676    0.9958     1.001     1.013     0.996    0.9284
##        28 comps  29 comps  30 comps
## CV       1.0018    1.0000    0.9993
## adjCV    0.9815    0.9797    0.9791
## 
## TRAINING: % variance explained
##     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     17.35    33.09    42.52    48.36    53.62    58.40    62.89    66.50
## PH    15.05    16.53    17.27    17.29    17.32    17.33    19.34    19.34
##     9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X     69.91     73.23     76.21     79.03     81.61     84.11     86.51
## PH    19.62     19.76     23.33     23.41     26.33     27.28     28.56
##     16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X      88.52     90.22     91.85     93.41     94.79     96.08      97.0
## PH     29.86     31.45     31.57     32.82     32.87     35.79      35.8
##     23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
## X      97.75     98.40     98.86     99.27     99.57     99.74     99.90
## PH     36.74     37.65     37.80     38.05     39.93     40.80     40.89
##     30 comps
## X     100.00
## PH     40.91


PCR Plot

# Plot the cross-validation results using validationplot()

# val.type options: "MSEP", "RMSEP", "R2"
validationplot(pcr_model, val.type = "RMSEP", main = "PCR Cross-Validation (RMSEP)")


Show PCR RMSEP

# Print RMSEP for each number of components
rmsep_values <- RMSEP(pcr_model, estimate = "CV")
rmsep_values
## (Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps  
##      0.9998       0.9577       0.9589       0.9506       0.9499       0.9461  
##     6 comps      7 comps      8 comps      9 comps     10 comps     11 comps  
##      0.9611       0.9508       1.0134       1.1304       1.8490       1.8405  
##    12 comps     13 comps     14 comps     15 comps     16 comps     17 comps  
##      1.1432       1.2803       1.2528       1.1584       1.2790       1.1828  
##    18 comps     19 comps     20 comps     21 comps     22 comps     23 comps  
##      1.0253       1.0294       0.9195       0.9870       0.9840       1.0152  
##    24 comps     25 comps     26 comps     27 comps     28 comps     29 comps  
##      1.0210       1.0348       1.0162       0.9434       1.0018       1.0000  
##    30 comps  
##      0.9993


PCR R2 Plot

# Also check R² to see how well the model explains the variance
validationplot(pcr_model, val.type = "R2", main = "PCR Cross-Validation (R2)")


PCR R2 Table

It looks like 5 components gives us both the highest R2 and lowest RMSEP before the model sharply drops in performance. It does improve again at 20 components, but for the sake of having a simpler model we will stick to 5.

# Print out R2 for each number of components
r2_values <- R2(pcr_model, estimate = "CV")
r2_values
## (Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps  
##  -0.0007784    0.0817204    0.0795193    0.0953255    0.0966418    0.1037929  
##     6 comps      7 comps      8 comps      9 comps     10 comps     11 comps  
##   0.0753052    0.0948758   -0.0280725   -0.2791882   -2.4227091   -2.3911300  
##    12 comps     13 comps     14 comps     15 comps     16 comps     17 comps  
##  -0.3082899   -0.6410022   -0.5713197   -0.3432995   -0.6376337   -0.4005158  
##    18 comps     19 comps     20 comps     21 comps     22 comps     23 comps  
##  -0.0524146   -0.0608081    0.1535729    0.0247478    0.0307351   -0.0317707  
##    24 comps     25 comps     26 comps     27 comps     28 comps     29 comps  
##  -0.0436857   -0.0720937   -0.0337501    0.1089018   -0.0047271   -0.0010823  
##    30 comps  
##   0.0002551


PCR Model

The results with PCR are lackluster.

library(pls)

# Set a seed for reproducibility
set.seed(123)

# Fit the PCR model with exactly 5 components
pcr_model_5 <- pcr(PH ~ ., 
                   data = student_train_x2, 
                   scale = TRUE, 
                   ncomp = 5, 
                   validation = "CV")

# Print summary of the model
summary(pcr_model_5)
## Data:    X dimension: 2571 30 
##  Y dimension: 2571 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
## CV          0.9998   0.9577   0.9589   0.9506   0.9499   0.9461
## adjCV       0.9998   0.9542   0.9546   0.9467   0.9461   0.9427
## 
## TRAINING: % variance explained
##     1 comps  2 comps  3 comps  4 comps  5 comps
## X     17.35    33.09    42.52    48.36    53.62
## PH    15.05    16.53    17.27    17.29    17.32
# Extract R2 and RMSEP for cross-validation
r2_5 <- R2(pcr_model_5, estimate = "CV")
rmsep_5 <- RMSEP(pcr_model_5, estimate = "CV")


# If you have a test set (student_eval_y2) and want to make predictions:

predictions_5 <- predict(pcr_model_5, newdata = student_eval_y2, ncomp = 5)
  
# Calculate prediction metrics on the test set
test_res <- postResample(predictions_5, student_eval_y2$PH)
cat("\nTest set performance (using 5 components):\n")
## 
## Test set performance (using 5 components):
print(test_res)
##      RMSE  Rsquared       MAE 
## 0.6155150 0.2832964 0.5087567
    # RMSE       0.6155150 
    # Rsquared   0.2832964
    # MAE        0.5087567 


PLS

on to PLS:

library(pls)

set.seed(123) # For reproducibility

# 1. Make the PLS model using cross-validation
pls_model <- plsr(PH ~ ., 
                  data = student_train_x2,
                  scale = TRUE,          # Scale predictors
                  validation = "CV")     # Use cross-validation

# Print a summary of the model to see initial info
summary(pls_model)
## Data:    X dimension: 2571 30 
##  Y dimension: 2571 1
## Fit method: kernelpls
## Number of components considered: 30
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.9998   0.9918    1.201    1.191    1.084   0.9236   0.9112
## adjCV       0.9998   0.9813    1.168    1.158    1.059   0.9108   0.8990
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       1.094   0.9879   0.9589    0.9967    1.0059    0.9694    0.9848
## adjCV    1.066   0.9688   0.9421    0.9767    0.9852    0.9517    0.9658
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV       1.0151    1.0006    0.9967    0.9999    0.9993    1.0001    1.0004
## adjCV    0.9936    0.9803    0.9767    0.9797    0.9791    0.9798    0.9801
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV       0.9988    0.9991    0.9994    0.9992    0.9993    0.9992    0.9993
## adjCV    0.9787    0.9789    0.9792    0.9790    0.9791    0.9790    0.9791
##        28 comps  29 comps  30 comps
## CV       0.9993    0.9993    0.9993
## adjCV    0.9791    0.9791    0.9791
## 
## TRAINING: % variance explained
##     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     16.76    21.44    31.84    41.84    47.84    50.65    52.86    56.85
## PH    22.63    34.56    36.74    37.89    38.94    39.80    40.40    40.60
##     9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X     59.57     62.18     65.80     68.10     70.52      72.8     75.89
## PH    40.75     40.83     40.86     40.88     40.89      40.9     40.90
##     16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X      77.86     79.26     81.43     83.71     85.10     87.22     89.29
## PH     40.90     40.90     40.90     40.91     40.91     40.91     40.91
##     23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
## X      90.37     92.79     93.90     95.58     96.89     97.89     98.51
## PH     40.91     40.91     40.91     40.91     40.91     40.91     40.91
##     30 comps
## X     100.00
## PH     40.91


PLS RMSEP Plot

# Using validationplot to check RMSEP across components
validationplot(pls_model, val.type = "RMSEP", main = "PLS: CV RMSEP by Number of Components")

# This plot helps visualize how RMSEP changes as we increase the number of components.


PLS R2 Plot

validationplot(pls_model, val.type = "R2", main = "PLS: CV R2 by Number of Components")


PLS Model

We arrive at a PLS Model with an R2 of 43.43%.

# Check R² via R2() function and RMSEP via RMSEP() function
r2_values_pls <- R2(pls_model, estimate = "CV")
rmsep_values_pls <- RMSEP(pls_model, estimate = "CV")
r2_values <- r2_values_pls$val["CV", "PH", ]
r2_comps <- r2_values[-1]

rmse_values <- rmsep_values_pls$val["CV", "PH", ]
rmsep_comps <- rmse_values[-1]
# Identify the best number of components based on minimum RMSEP
opt_comp_rmsep <- which.min(rmsep_comps)
min_rmsep <- rmsep_comps[opt_comp_rmsep]

# Identify the best number of components based on max R²
opt_comp_r2 <- which.max(r2_comps)
max_r2 <- r2_comps[opt_comp_r2]

cat("\nOptimal number of components based on RMSEP:", opt_comp_rmsep, "with RMSEP =", min_rmsep, "\n")
## 
## Optimal number of components based on RMSEP: 6 with RMSEP = 0.9111578
cat("Optimal number of components based on R²:", opt_comp_r2, "with R² =", max_r2, "\n")
## Optimal number of components based on R²: 6 with R² = 0.1688543
# Let's choose the optimal number of components. In practice, you might consider a balance 
# between minimal RMSEP and complexity. Here, let's use the one chosen by RMSEP for demonstration.
final_ncomp <- opt_comp_rmsep

cat("\nFinal chosen number of components:", final_ncomp, "\n")
## 
## Final chosen number of components: 6
# 4. Make a final PLS model with the chosen number of components and evaluate it
pls_model_final <- plsr(PH ~ ., 
                        data = student_train_x2,
                        scale = TRUE,
                        ncomp = final_ncomp,
                        validation = "none") # no need for CV here since we fixed ncomp

# Summary of the final model
summary(pls_model_final)
## Data:    X dimension: 2571 30 
##  Y dimension: 2571 1
## Fit method: kernelpls
## Number of components considered: 6
## TRAINING: % variance explained
##     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X     16.76    21.44    31.84    41.84    47.84    50.65
## PH    22.63    34.56    36.74    37.89    38.94    39.80
preds_pls <- predict(pls_model_final, newdata = student_eval_y2, ncomp = final_ncomp)
  
results_test <- postResample(preds_pls, student_eval_y2$PH)
cat("\nTest set performance with", final_ncomp, "components:\n")
## 
## Test set performance with 6 components:
print(results_test)
##      RMSE  Rsquared       MAE 
## 0.5593341 0.4342728 0.4397250
# RMSE     0.5593341
# Rsquared 0.4342728       
# MAE      0.4397250


X-Loadings

X-Loadings indicate how the original predictors are combined linearly to form the PLS components. A higher absolute loading value means the predictor strongly influences that component.

# View the X-loadings
x_loadings <- pls_model_final$loadings
cat("X-loadings:\n")
## X-loadings:
# print(x_loadings)

# View the X-loading weights
x_loading_weights <- pls_model_final$loading.weights
cat("\nX-loading weights:\n")
## 
## X-loading weights:
# print(x_loading_weights)

# Convert to data frames for easier handling
x_loadings_matrix <- unclass(pls_model_final$loadings)
x_loadings_df <- as.data.frame(x_loadings_matrix)

x_loadings_weights_matrix <- unclass(pls_model_final$loading.weights)
x_loading_weights_df <- as.data.frame(x_loadings_matrix)

# Rename columns for clarity: each column corresponds to a component
colnames(x_loadings_df) <- paste0("Comp", 1:ncol(x_loadings_df))
colnames(x_loading_weights_df) <- paste0("Comp", 1:ncol(x_loading_weights_df))

# Let's also see the coefficients for each number of components
coefficients_array <- pls_model_final$coefficients
# `coefficients_array` is a multidimensional array: [predictor, response, component]
# For a single response model, we can simplify it:
coefficients_matrix <- coefficients_array[,1,] # Extract for the single response variable
colnames(coefficients_matrix) <- paste0("Comp", 1:ncol(coefficients_matrix))

cat("\nCoefficients for each component:\n")
## 
## Coefficients for each component:
#print(coefficients_matrix)

# Visualizing Loadings
# As with PCA, you can plot the loadings to see which predictors have strong influence on each component.
library(reshape2)
library(ggplot2)

# Melt the loadings data frame for plotting
x_loadings_long <- melt(x_loadings_df, variable.name = "Component", value.name = "Loading")
x_loadings_long$Variable <- rownames(x_loadings_df)

ggplot(x_loadings_long, aes(x = reorder(Variable, Loading), y = Loading, fill = Component)) +
  geom_bar(stat="identity", position="dodge") +
  coord_flip() +
  facet_wrap(~ Component, scales = "free_y") +
  theme_minimal() +
  labs(title = "PLS X-Loadings by Component", x = "Predictors", y = "Loading")


PLS Coefficients

Coefficients show how each predictor contributes to predicting the response variable when using a given number of components. Higher absolute coefficients indicate greater influence of a predictor on the predicted outcome.

# Extract the coefficients matrix from the PLS model
coefficients_array <- pls_model_final$coefficients
coefficients_matrix <- coefficients_array[, 1, ] # For a single-response model
colnames(coefficients_matrix) <- paste0("Comp", 1:ncol(coefficients_matrix))

# Convert to a data frame
coefficients_df <- as.data.frame(coefficients_matrix)
coefficients_df$Variable <- rownames(coefficients_df)

# Reshape to long format for plotting
library(reshape2)
coefficients_long <- melt(coefficients_df, id.vars = "Variable", 
                          variable.name = "Component", value.name = "Coefficient")

# Plot using ggplot2, faceting by Component
library(ggplot2)
ggplot(coefficients_long, aes(x = reorder(Variable, Coefficient), y = Coefficient, fill = Component)) +
  geom_bar(stat="identity", position="dodge") +
  coord_flip() +
  facet_wrap(~ Component, scales = "free_y") +
  theme_minimal() +
  labs(title = "PLS Coefficients by Component",
       x = "Predictor Variables",
       y = "Coefficient") +
  theme(legend.position = "none")


Analysis

The good thing about using PLS is that the multi-colinearity is addressed along with predictive power. The components or latent variables are chosen such that the covariance between predictors is maximized, and each component is orthogonal to the ones prior so the correlation between components is also minimized.

Each predictor is also weighed differently in each component it appears in based on its effect on the target variable. Some next steps we can take are to look at which predictors have consistently low coefficients/loadings and remove them- this could possibly improve the performance of the model.

PT: Coefficients: Around ±0.001 to ±0.007 across all components, never exceeding about 0.0075 in absolute value. These are very small relative to other variables that have coefficients in the 0.05–0.4 range. Loadings: PT does not appear to strongly load on any component (not listed or near zero in the given snippet).

Air.Pressurer: Coefficients: About -0.003 to -0.013 across all components. These values are also quite small compared to other more influential variables. Loadings: Only appears once at about -0.114, which is not large. Most important variables have loadings at least above ±0.2–0.3 somewhere.

Carb.Temp: Coefficients: Ranging around 0.0059 to 0.0158, slightly larger than PT or Air.Pressurer but still relatively small. Some variables show coefficients well above 0.05 or even 0.1. Loadings: Carb.Temp does not appear with a noticeable loading in the snippet (it’s blank), suggesting it may not be significantly influencing the latent structure.


Refined PLS

Here we run PLS with fewer predictors based on the analysis above.

library(pls)

# Remove the chosen predictors from the dataset
predictors_to_remove <- c("PT", "Air.Pressurer", "Carb.Temp")

student_train_reduced <- student_train_x2[ , !(names(student_train_x2) %in% predictors_to_remove)]

# Fit a PLS model again with cross-validation
set.seed(123)
pls_model_reduced <- plsr(PH ~ ., 
                          data = student_train_reduced,
                          scale = TRUE,
                          validation = "CV")

# Check summary
summary(pls_model_reduced)
## Data:    X dimension: 2571 27 
##  Y dimension: 2571 1
## Fit method: kernelpls
## Number of components considered: 27
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.9998   0.8815   0.8157   0.8027   0.7964   0.7906   0.7858
## adjCV       0.9998   0.8815   0.8153   0.8024   0.7960   0.7901   0.7852
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.7819   0.7806   0.7799    0.7793    0.7791    0.7790    0.7791
## adjCV   0.7813   0.7801   0.7794    0.7788    0.7786    0.7785    0.7786
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV       0.7790    0.7790    0.7790    0.7789    0.7790    0.7790    0.7790
## adjCV    0.7784    0.7784    0.7784    0.7784    0.7784    0.7784    0.7784
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV       0.7790    0.7790    0.7790    0.7790    0.7790    0.7790    0.7790
## adjCV    0.7784    0.7784    0.7784    0.7784    0.7784    0.7784    0.7784
## 
## TRAINING: % variance explained
##     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     18.62    23.82    35.37    46.40    53.02    56.15    58.19    62.01
## PH    22.61    34.54    36.73    37.88    38.93    39.78    40.40    40.60
##     9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X     65.07     66.83     69.00     72.50     74.75     76.60     79.41
## PH    40.72     40.80     40.83     40.84     40.85     40.85     40.85
##     16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X      80.73     82.77     84.62     86.89     89.27     90.91     92.52
## PH     40.85     40.85     40.85     40.85     40.85     40.85     40.85
##     23 comps  24 comps  25 comps  26 comps  27 comps
## X      93.75     95.61     97.40     98.30    100.00
## PH     40.85     40.85     40.85     40.85     40.85
r2_values_reduced <- R2(pls_model_reduced, estimate = "CV")
rmsep_values_reduced <- RMSEP(pls_model_reduced, estimate = "CV")



cat("\nCross-validated R² values after removal:\n")
## 
## Cross-validated R² values after removal:
cat("\nCross-validated RMSEP values after removal:\n")
## 
## Cross-validated RMSEP values after removal:
# Visualize MSEP to see if there's an improvement
validationplot(pls_model_reduced, val.type = "MSEP", main = "PLS with Reduced Predictors: CV MSEP by #Components")

# Identify optimal number of components in the reduced model based on RMSEP

r2_values_reduced_mod <- r2_values_reduced$val["CV", "PH", ]
r2_comps_reduced <- r2_values_reduced_mod[-1]

rmse_values_reduced_mod <- rmsep_values_reduced$val["CV", "PH", ]
rmsep_comps_reduced <- rmse_values_reduced_mod[-1]

opt_comp_rmsep_reduced <- which.min(rmsep_comps_reduced)
min_rmsep_reduced <- rmsep_comps_reduced[opt_comp_rmsep_reduced]
opt_comp_r2_reduced <- which.max(r2_comps_reduced)
max_r2_reduced <- r2_comps_reduced[opt_comp_r2_reduced]

cat("\nAfter predictor removal:\n")
## 
## After predictor removal:
cat("Optimal #Components by RMSEP:", opt_comp_rmsep_reduced, "with RMSEP =", min_rmsep_reduced, "\n")
## Optimal #Components by RMSEP: 17 with RMSEP = 0.7789427
cat("Optimal #Components by R²:", opt_comp_r2_reduced, "with R² =", max_r2_reduced, "\n")
## Optimal #Components by R²: 17 with R² = 0.3925633


Results

There was a small lift in Rsquared after removing some of the less relevant predictors.

Rsquared:

PCR: 0.2832964 PLS: 0.4342728 PLS with some predictors removed: 0.4452060

# Fit the final model with the chosen number of components
final_pls_model <- plsr(PH ~ ., 
                        data = student_train_reduced,
                        scale = TRUE,
                        ncomp = opt_comp_rmsep_reduced,
                        validation = "none")

# Check summary
summary(final_pls_model)
## Data:    X dimension: 2571 27 
##  Y dimension: 2571 1
## Fit method: kernelpls
## Number of components considered: 17
## TRAINING: % variance explained
##     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     18.62    23.82    35.37    46.40    53.02    56.15    58.19    62.01
## PH    22.61    34.54    36.73    37.88    38.93    39.78    40.40    40.60
##     9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X     65.07     66.83     69.00     72.50     74.75     76.60     79.41
## PH    40.72     40.80     40.83     40.84     40.85     40.85     40.85
##     16 comps  17 comps
## X      80.73     82.77
## PH     40.85     40.85
# Predict on the test set
preds_final <- predict(final_pls_model, newdata = student_eval_y2, ncomp = opt_comp_rmsep_reduced)
  
test_performance <- postResample(preds_final, student_eval_y2$PH)
cat("\nTest set performance (RMSE, R2, MAE):\n")
## 
## Test set performance (RMSE, R2, MAE):
print(test_performance)
##      RMSE  Rsquared       MAE 
## 0.5562690 0.4452060 0.4302488
# RMSE      0.5562690
# Rsquared  0.4452060
# MAE        0.4302488 



Appendix 3


MARS Revisted

Here we are revisiting the MARS model with an attempt to address the over-fitting that occurred in the prior iteration. There is some additional feature generation around the ‘Hydrolic.Pressure’ variables based on their distributions. We are using a different approach to imputation by replacing missing values with the column mean. Lastly we are controlling for outliers in our model creation by ‘Winsorizing’ the data set, seting values that are more than 2 standard deviations from the mean of the variable to a value 2SDs from the mean.


Preprocessing

First we are addressing the missing Brand.Code values by creating a new level ‘Unk’ for the missing values. We are also temporarily splitting off the Brand.Code column so in the next step we can impute missing values and fix outliers in all the remaining numeric columns.

library(tidyverse)
library(DescTools)

student_train$Brand.Code[student_train$Brand.Code == ''] <- 'Unk'
BC <- student_train |> select(Brand.Code)
student_train <- student_train |> select(-Brand.Code)

student_eval$Brand.Code[student_eval$Brand.Code == ''] <- 'Unk'
BC_eval <- student_eval |> select(Brand.Code)
student_eval <- student_eval |> select(-Brand.Code)

Here we are replacing all missing values with the column mean and creating some new features based on distributions of the ‘Hyd.Pressure’ columns. If you look at the distributions of these columns you see that there are two distributions of values, zero values and a Gaussian distribution of non-zero values. We are creating and addition binary feature for each of these columns which captures if the hydraulic pressure is greater than zero.

hist(student_train$Hyd.Pressure1, breaks = 20, main = "Histogram of Hyd.Pressure1", xlab = 'Hyd.Pressure1')

#Replace missing values in numeric columns with the column mean.
student_train <- student_train %>% mutate(across(everything(), ~replace_na(.x, median(na.omit(.)))))

student_eval <- student_eval %>% mutate(across(everything(), ~replace_na(.x, median(na.omit(.)))))

#Create new features to identify if the hydraulic pressure value is greater than zero
student_train <- student_train |> mutate(Hyd.Pressure10 = as.numeric(Hyd.Pressure1 > 0), Hyd.Pressure20 = as.numeric(Hyd.Pressure2 > 0), Hyd.Pressure30 = as.numeric(Hyd.Pressure3 > 0))

student_eval <- student_eval |> mutate(Hyd.Pressure10 = as.numeric(Hyd.Pressure1 > 0), Hyd.Pressure20 = as.numeric(Hyd.Pressure2 > 0), Hyd.Pressure30 = as.numeric(Hyd.Pressure3 > 0)) 

Here we are Winsorizing the data, a process by which you define outliers as values falling beyond a number of distributions from the mean (2 distributions by default). We are replacing those outlier values with values 2 standard deviations from the mean, compressing the distributions in our data set. Then we are joining the non-numeric Brand.Code column back to the data.

#Winsorize the data
student_train <-lapply(student_train, Winsorize)
student_train <- bind_cols(BC, student_train)

student_eval <-lapply(student_eval, Winsorize)
student_eval <- bind_cols(BC_eval, student_eval)


Model Hyperparameter selection

library(earth)
y = student_train$PH
x = student_train |> select(-PH)

#y_eval = student_eval$PH
x_eval = student_eval |> select(-PH)

Here we are performing a grid search to identify the optimal hyper parameters for our MARS model.


Model Building with Validation

Lastly we are doing a creating our optimal MARS model and performing a 10-fold cross validation to evaluate our models performance. We are also shorting the tail of the Oxygen.Filler distribution by replacing values greater than 0.25 with 0.25.

We are outputting the model summary for the 1 of the 10 cross fold models built

library(caret)
set.seed(175175327)

x$Oxygen.Filler[x$Oxygen.Filler > 0.25] <- 0.25
#y[y==0] <- 8.5

folds <-  createFolds(y = y, k = 10)#, type = "random")

    model_results <-  data.frame(R_squared = numeric(10))
 

    for (i in 1:10) {

        x_data <- x[(-folds[[i]]), ]  # Train Predictor data for current fold

        y_data <- y[(-folds[[i]])]   # Train Target data for current fold

        

        mars_model <- earth(x = x_data, y = y_data, degree = 3, nprune = 42)
 
        #print(summary(mars_model))

        

        model_results[i, 'R_squared_train'] <- summary(mars_model)$rsq
        
        test_data <- x[(folds[[i]]), ] # Test Predictor data for current fold
        target_data <- y[(folds[[i]])] # Test Target data for current fold
        
        predictions <- predict(mars_model, newdata = test_data)

        model_results[i, "RMSE"] <-  sqrt(mean((predictions - target_data)^2))

        res <- caret::postResample(predictions, target_data)
        model_results[i, "R_squared_test"] <-  res[2]

    }
    
    summary(mars_model)
## Call: earth(x=x_data, y=y_data, degree=3, nprune=42)
## 
##                                                                                     coefficients
## (Intercept)                                                                             0.171936
## h(-0.203956-Mnf.Flow)                                                                   1.227742
## h(0.910228-Hyd.Pressure3)                                                              -0.235784
## h(Temperature- -0.699707)                                                              -0.271336
## h(-0.671872-Density)                                                                    0.992686
## h(Density- -0.671872)                                                                  -0.437461
## h(0.0440049-Bowl.Setpoint)                                                             -0.366527
## h(Air.Pressurer- -0.193078)                                                             0.189984
## h(-0.549039-Alch.Rel)                                                                   2.104751
## h(Alch.Rel- -0.549039)                                                                  0.381484
## Brand.CodeC * h(Mnf.Flow-1.06152)                                                      -6.381409
## Brand.CodeC * h(1.06152-Mnf.Flow)                                                      -0.559842
## h(Mnf.Flow- -0.203956) * h(Usage.cont-0.0560919)                                       -0.334615
## h(-0.203956-Mnf.Flow) * h(0.379132-Pressure.Vacuum)                                    -0.809624
## h(-0.203956-Mnf.Flow) * h(Air.Pressurer-0.962117)                                      -0.727589
## h(-0.203956-Mnf.Flow) * h(0.962117-Air.Pressurer)                                       0.155671
## h(Mnf.Flow- -0.203956) * h(Alch.Rel-0.519685)                                           0.418455
## h(Mnf.Flow- -0.203956) * h(0.519685-Alch.Rel)                                          -0.335221
## h(-0.334474-Carb.Pressure1) * h(Hyd.Pressure1-0.688671)                                -0.278304
## h(0.910228-Hyd.Pressure3) * h(Pressure.Vacuum-0.0282507)                               -0.100393
## h(0.268287-Filler.Speed) * h(Density- -0.671872)                                        0.074865
## h(Filler.Speed-0.268287) * h(Density- -0.671872)                                        1.489211
## h(-0.322631-Pressure.Vacuum) * h(0.0440049-Bowl.Setpoint)                               0.272839
## h(Pressure.Vacuum- -0.322631) * h(0.0440049-Bowl.Setpoint)                              0.208009
## h(-0.746997-Oxygen.Filler) * h(-0.549039-Alch.Rel)                                    -12.650368
## Brand.CodeC * h(1.06152-Mnf.Flow) * h(0.0450417-Carb.Pressure1)                         0.183953
## h(Carb.Volume- -1.34916) * h(Mnf.Flow- -0.203956) * h(Usage.cont-0.0560919)            -0.128589
## h(Mnf.Flow- -0.203956) * h(Hyd.Pressure2-1.02762) * h(0.519685-Alch.Rel)                2.639698
## h(-1.04258-Mnf.Flow) * h(0.910228-Hyd.Pressure3) * h(Pressure.Vacuum-0.0282507)        95.969965
## h(Mnf.Flow- -0.203956) * h(0.60202-Temperature) * h(Usage.cont-0.0560919)              -0.245408
## h(Mnf.Flow- -0.203956) * h(Usage.cont-0.0560919) * h(Bowl.Setpoint- -1.26292)           0.243339
## h(Mnf.Flow- -0.203956) * h(Usage.cont-0.0560919) * h(-1.26292-Bowl.Setpoint)            0.647840
## h(-0.203956-Mnf.Flow) * h(0.379132-Pressure.Vacuum) * h(1.62799-Alch.Rel)               0.219838
## h(Mnf.Flow- -0.203956) * h(Bowl.Setpoint-0.0440049) * h(Alch.Rel-0.519685)             -0.664349
## h(Mnf.Flow- -0.203956) * h(0.0440049-Bowl.Setpoint) * h(Alch.Rel-0.519685)              0.212207
## h(Mnf.Flow- -0.203956) * h(0.519685-Alch.Rel) * h(Balling.Lvl- -0.494086)               1.367322
## h(Mnf.Flow- -0.203956) * h(0.519685-Alch.Rel) * h(-0.494086-Balling.Lvl)                0.871681
## h(1.14142-Carb.Pressure1) * h(Oxygen.Filler- -0.746997) * h(-0.549039-Alch.Rel)        -1.568930
## h(0.910228-Hyd.Pressure3) * h(Filler.Speed-0.408398) * h(Pressure.Vacuum-0.0282507)    40.297109
## h(0.910228-Hyd.Pressure3) * h(Filler.Speed-0.379857) * h(Pressure.Vacuum-0.0282507)    -9.411090
## h(0.910228-Hyd.Pressure3) * h(Filler.Speed-0.418776) * h(Pressure.Vacuum-0.0282507)   -63.437797
## h(-0.753706-Balling) * h(Oxygen.Filler- -0.746997) * h(-0.549039-Alch.Rel)             10.785083
## 
## Selected 42 of 66 terms, and 18 of 38 predictors (nprune=42)
## Termination condition: RSq changed by less than 0.001 at 66 terms
## Importance: Mnf.Flow, Brand.CodeC, Alch.Rel, Pressure.Vacuum, ...
## Number of terms at each degree of interaction: 1 9 15 17
## GCV 0.3113796    RSS 657.5183    GRSq 0.620869    RSq 0.6537267


MARS Results

We get an mean training set R-squared of 0.635 and a mean test set R-squared of 0.556 and a max test set R-squared of 0.65. The drop of in the R-square of 0.08 shows that there is some over fitting occurring in the in the model creation but not as much as the first iteration of the MARS model creation. Looking at the model summary I think there could be some subgroup analysis as a next step in model exploration.

    summary(model_results)
##    R_squared R_squared_train       RMSE        R_squared_test  
##  Min.   :0   Min.   :0.6083   Min.   :0.5319   Min.   :0.4226  
##  1st Qu.:0   1st Qu.:0.6271   1st Qu.:0.5922   1st Qu.:0.5380  
##  Median :0   Median :0.6313   Median :0.6041   Median :0.5599  
##  Mean   :0   Mean   :0.6356   Mean   :0.6120   Mean   :0.5568  
##  3rd Qu.:0   3rd Qu.:0.6500   3rd Qu.:0.6150   3rd Qu.:0.5874  
##  Max.   :0   Max.   :0.6642   Max.   :0.7721   Max.   :0.6564
preds <- stats::predict(mars_model, student_eval)
#write.csv(preds, "Data624_Model_Preds.csv")