DATA 624 Project 2

library(readxl)
library(caret)
library(tidyverse)
library(VIM)
library(ggplot2)
library(GGally)
library(car)
library(glmnet)

Abstract

We are tasked with with analyzing the manufacturing process of a beverage company, ABC Beverage. Specifically, we are tasked with analyzing what factors in the process can help us predict the PH of a beverage.

We train a variety of machine learning models including linear models, nonlinear models, and regression tree/rule based models. We found that a random forest model was best able to explain the variance in PH based on the predictive factors. We use the random forest model to predict PH for a subsequent data set that was excluded from training and testing.

Import Excel tables

Below we import the excel data from Github:

stu_data_url <- 'https://github.com/amedina613/Data624-Project2/raw/refs/heads/main/StudentData.xlsx'

dest_file_stu_data <- 'StudentData.xlsx'

stu_eval_url <- 'https://github.com/amedina613/Data624-Project2/raw/refs/heads/main/StudentEvaluation.xlsx'
  
dest_file_stu_eval <- 'StudentEvaluation.xlsx'

download.file(stu_data_url, dest_file_stu_data, mode = 'wb')
download.file(stu_eval_url, dest_file_stu_eval, mode = 'wb')

stu_data_raw <- read_excel('StudentData.xlsx',col_names = T)
stu_eval_raw <- read_excel('StudentEvaluation.xlsx',col_names = T)


stu_data_raw <- as.data.frame(stu_data_raw)
stu_eval_raw <- as.data.frame(stu_eval_raw)

Stu_data_raw is the data that we will use to train and test models. Stu_eval_raw is the data set for which we will predict PH values using our chosen model.

Data exploration

First, we check the structure of the data set we will use to train and test the model.

str(stu_data_raw)

## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand Code       : chr  "B" "A" "B" "A" ...
##  $ Carb Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num  0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num  0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num  119 122 120 115 118 ...
##  $ Fill Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num  121 119 120 118 119 ...
##  $ Filler Speed     : num  4002 3986 4020 4012 4010 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num  2932 3144 2914 3062 3054 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num  143 143 142 146 146 ...
##  $ Alch Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

Next, we check the amount of missing values in each column.

missing_values <- colSums(is.na(stu_data_raw))

print(missing_values)

##        Brand Code       Carb Volume       Fill Ounces         PC Volume 
##               120                10                38                39 
##     Carb Pressure         Carb Temp               PSC          PSC Fill 
##                27                26                33                23 
##           PSC CO2          Mnf Flow    Carb Pressure1     Fill Pressure 
##                39                 2                32                22 
##     Hyd Pressure1     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4 
##                11                15                15                30 
##      Filler Level      Filler Speed       Temperature        Usage cont 
##                20                57                14                 5 
##         Carb Flow           Density               MFR           Balling 
##                 2                 1               212                 1 
##   Pressure Vacuum                PH     Oxygen Filler     Bowl Setpoint 
##                 0                 4                12                 2 
## Pressure Setpoint     Air Pressurer          Alch Rel          Carb Rel 
##                12                 0                 9                10 
##       Balling Lvl 
##                 1

We also visualize missing data:

library(naniar)
gg_miss_var(stu_data_raw)

vis_miss(stu_data_raw)

Data preparation

Dummy variables

The Brand Code column is a categorical variable. We will have to convert it to dummy variables to perform KNN imputation.

First, we convert the column to a factor.

stu_data_raw$`Brand Code` <- as.factor(stu_data_raw$`Brand Code`)

Next, we impute the Brand Code to replace missing values:

library(VIM)


stu_data_raw <- kNN(data = stu_data_raw, variable = 'Brand Code', k = 5, imp_var = F)

Next, we convert the Brand Code to dummy variables:

dummy_variables <- dummyVars(~., data = stu_data_raw, fullRank = T)
training_data <- predict(dummy_variables, newdata = stu_data_raw)

head(training_data)

##   `Brand Code`B `Brand Code`C `Brand Code`D `Carb Volume` `Fill Ounces`
## 1             1             0             0      5.340000      23.96667
## 2             0             0             0      5.426667      24.00667
## 3             1             0             0      5.286667      24.06000
## 4             0             0             0      5.440000      24.00667
## 5             0             0             0      5.486667      24.31333
## 6             0             0             0      5.380000      23.92667
##   `PC Volume` `Carb Pressure` `Carb Temp`   PSC `PSC Fill` `PSC CO2` `Mnf Flow`
## 1   0.2633333            68.2       141.2 0.104       0.26      0.04       -100
## 2   0.2386667            68.4       139.6 0.124       0.22      0.04       -100
## 3   0.2633333            70.8       144.8 0.090       0.34      0.16       -100
## 4   0.2933333            63.0       132.6    NA       0.42      0.04       -100
## 5   0.1113333            67.2       136.8 0.026       0.16      0.12       -100
## 6   0.2693333            66.6       138.4 0.090       0.24      0.04       -100
##   `Carb Pressure1` `Fill Pressure` `Hyd Pressure1` `Hyd Pressure2`
## 1            118.8            46.0               0              NA
## 2            121.6            46.0               0              NA
## 3            120.2            46.0               0              NA
## 4            115.2            46.4               0               0
## 5            118.4            45.8               0               0
## 6            119.6            45.6               0               0
##   `Hyd Pressure3` `Hyd Pressure4` `Filler Level` `Filler Speed` Temperature
## 1              NA             118          121.2           4002        66.0
## 2              NA             106          118.6           3986        67.6
## 3              NA              82          120.0           4020        67.0
## 4               0              92          117.8           4012        65.6
## 5               0              92          118.6           4010        65.6
## 6               0             116          120.2           4014        66.2
##   `Usage cont` `Carb Flow` Density   MFR Balling `Pressure Vacuum`   PH
## 1        16.18        2932    0.88 725.0   1.398              -4.0 8.36
## 2        19.90        3144    0.92 726.8   1.498              -4.0 8.26
## 3        17.76        2914    1.58 735.0   3.142              -3.8 8.94
## 4        17.42        3062    1.54 730.6   3.042              -4.4 8.24
## 5        17.68        3054    1.54 722.8   3.042              -4.4 8.26
## 6        23.82        2948    1.52 738.8   2.992              -4.4 8.32
##   `Oxygen Filler` `Bowl Setpoint` `Pressure Setpoint` `Air Pressurer`
## 1           0.022             120                46.4           142.6
## 2           0.026             120                46.8           143.0
## 3           0.024             120                46.6           142.0
## 4           0.030             120                46.0           146.2
## 5           0.030             120                46.0           146.2
## 6           0.024             120                46.0           146.6
##   `Alch Rel` `Carb Rel` `Balling Lvl`
## 1       6.58       5.32          1.48
## 2       6.56       5.30          1.56
## 3       7.66       5.84          3.28
## 4       7.14       5.42          3.04
## 5       7.14       5.44          3.04
## 6       7.16       5.44          3.02

We now have three dummy variable columns: Brand Code B, Brand Code C, and Brand Code D with zeros and ones. If all three columns have zero, then it means the Brand is A.

Center, Scale, KNN

Next, we center and scale the data and perform KNN imputation for missing values. We exclude the dummy variable columns from centering, scaling, and imputation (Brand Code was already imputed).

preProcess_model <- preProcess(training_data[,4:35], method = c("center", "scale", "knnImpute"))
training_data_imputed <- training_data
training_data_imputed[,4:35] <- predict(preProcess_model, training_data[,4:35])

We now have zero missing values:

colSums(is.na(training_data_imputed))

##       `Brand Code`B       `Brand Code`C       `Brand Code`D       `Carb Volume` 
##                   0                   0                   0                   0 
##       `Fill Ounces`         `PC Volume`     `Carb Pressure`         `Carb Temp` 
##                   0                   0                   0                   0 
##                 PSC          `PSC Fill`           `PSC CO2`          `Mnf Flow` 
##                   0                   0                   0                   0 
##    `Carb Pressure1`     `Fill Pressure`     `Hyd Pressure1`     `Hyd Pressure2` 
##                   0                   0                   0                   0 
##     `Hyd Pressure3`     `Hyd Pressure4`      `Filler Level`      `Filler Speed` 
##                   0                   0                   0                   0 
##         Temperature        `Usage cont`         `Carb Flow`             Density 
##                   0                   0                   0                   0 
##                 MFR             Balling   `Pressure Vacuum`                  PH 
##                   0                   0                   0                   0 
##     `Oxygen Filler`     `Bowl Setpoint` `Pressure Setpoint`     `Air Pressurer` 
##                   0                   0                   0                   0 
##          `Alch Rel`          `Carb Rel`       `Balling Lvl` 
##                   0                   0                   0

Near Zero Variance predictors

Near zero variance predictors are identified and removed from the imputed numeric data set. These predictors have little variability and do not contribute meaningfully to analysis or modeling.

There is only one variable with near zero variance:

colnames(training_data_imputed)[nearZeroVar(training_data_imputed)]

## [1] "`Hyd Pressure1`"

training_data_imputed <- training_data_imputed[, -nearZeroVar(training_data_imputed)]

Our working training data-set is training_data_imputed

head(training_data_imputed)

##   `Brand Code`B `Brand Code`C `Brand Code`D `Carb Volume` `Fill Ounces`
## 1             1             0             0   -0.28385369   -0.09240162
## 2             0             0             0    0.53079575    0.36458499
## 3             1             0             0   -0.78517642    0.97390047
## 4             0             0             0    0.65612644    0.36458499
## 5             0             0             0    1.09478383    3.86814899
## 6             0             0             0    0.09213836   -0.54938823
##   `PC Volume` `Carb Pressure` `Carb Temp`        PSC `PSC Fill`  `PSC CO2`
## 1  -0.2271248     0.002946277   0.0260259  0.3942913  0.5487360 -0.3813757
## 2  -0.6335262     0.059472131  -0.3702701  0.8002265  0.2091248 -0.3813757
## 3  -0.2271248     0.737782381   0.9176919  0.1101367  1.2279585  2.4068150
## 4   0.2671472    -1.466725932  -2.1040652 -0.1415431  1.9071809 -0.3813757
## 5  -2.7314362    -0.279682994  -1.0637881 -1.1888558 -0.3002920  1.4774181
## 6  -0.1282704    -0.449260557  -0.6674921  0.1101367  0.3789304 -0.3813757
##   `Mnf Flow` `Carb Pressure1` `Fill Pressure` `Hyd Pressure2` `Hyd Pressure3`
## 1  -1.042583       -0.7983274      -0.6049215        -1.27918       -1.280596
## 2  -1.042583       -0.2079690      -0.6049215        -1.27918       -1.280596
## 3  -1.042583       -0.5031482      -0.6049215        -1.27918       -1.280596
## 4  -1.042583       -1.5573596      -0.4790381        -1.27918       -1.280596
## 5  -1.042583       -0.8826643      -0.6678631        -1.27918       -1.280596
## 6  -1.042583       -0.6296536      -0.7308048        -1.27918       -1.280596
##   `Hyd Pressure4` `Filler Level` `Filler Speed` Temperature `Usage cont`
## 1       1.6544895      0.7610718      0.4083977  0.02347442   -1.6162070
## 2       0.7400338      0.5954501      0.3876406  1.18056518   -0.3670199
## 3      -1.0888777      0.6846310      0.4317494  0.74665614   -1.0856383
## 4      -0.3268313      0.5444896      0.4213709 -0.26579827   -1.1998113
## 5      -0.3268313      0.5954501      0.4187762 -0.26579827   -1.1125025
## 6       1.5020803      0.6973712      0.4239655  0.16811076    0.9493279
##   `Carb Flow`    Density       MFR    Balling `Pressure Vacuum`        PH
## 1   0.4318220 -0.7778248 0.2835077 -0.8589593          2.133539 -1.076123
## 2   0.6292707 -0.6718721 0.3078655 -0.7515585          2.133539 -1.655778
## 3   0.4150575  1.0763476 0.4188288  1.0141113          2.484420  2.285880
## 4   0.5528991  0.9703949 0.3592875  0.9067105          1.431776 -1.771709
## 5   0.5454482  0.9703949 0.2537371  0.9067105          1.431776 -1.655778
## 6   0.4467238  0.9174185 0.4702508  0.8530101          1.431776 -1.307985
##   `Oxygen Filler` `Bowl Setpoint` `Pressure Setpoint` `Air Pressurer`
## 1      -0.5326049        0.697465           -0.596061      -0.1930780
## 2      -0.4468482        0.697465           -0.399891       0.1369776
## 3      -0.4897265        0.697465           -0.497976      -0.6881615
## 4      -0.3610914        0.697465           -0.792231       2.7774225
## 5      -0.3610914        0.697465           -0.792231       2.7774225
## 6      -0.4897265        0.697465           -0.792231       3.1074781
##   `Alch Rel` `Carb Rel` `Balling Lvl`
## 1 -0.6282042 -0.9072722    -0.6549488
## 2 -0.6677866 -1.0626503    -0.5630274
## 3  1.5092444  3.1325583     1.4132824
## 4  0.4801025 -0.1303817     1.1375182
## 5  0.4801025  0.0249964     1.1375182
## 6  0.5196849  0.0249964     1.1145379

Before further analysis continues, we convert the data set into a data frame:

training_data_imputed <- as.data.frame(training_data_imputed)

Train-Test Split

The data is split into 80% training and 20% testing to train and validate models.

set.seed(321)  

index <- createDataPartition(training_data_imputed$PH, p = 0.8, list = FALSE)

train_x <- training_data_imputed[index, -which(names(training_data_imputed) == "PH")]
train_y <- training_data_imputed[index, "PH"]
test_x <- training_data_imputed[-index, -which(names(training_data_imputed) == "PH")]
test_y <- training_data_imputed[-index, "PH"]

Correlation matrix: PH vs predictors

We evaluate the relationships between pH (our target) and all available predictors to gain a preliminary understanding of which factors might drive our outcome. Using correlation and visualization we identify the variables that are most strongly associated with pH.

The table of correlations ranks predictors based on their correlation with the target variable, pH. Mnf.Flow shows the strongest relationship with pH (negative correlation of -0.445), Bowl.Setpoint (positive correlation of 0.346) is has the strongest positive correlation and may indicate a role in influencing pH – although these correlations are moderate at best.

# Compute correlations between 'PH' and all other predictors
correlation_values <- training_data_imputed %>%
    summarise(across(.cols = everything(), 
                     .fns = ~ cor(., training_data_imputed$PH, use = "complete.obs"), 
                     .names = "cor_{col}")) %>%
    pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Correlation") %>%
    mutate(Predictor = gsub("cor_", "", Predictor)) %>%
    filter(Predictor != "PH") %>% 
    arrange(desc(abs(Correlation)))

# Output sorted list of correlations
print(correlation_values)

## # A tibble: 33 × 2
##    Predictor           Correlation
##    <chr>                     <dbl>
##  1 `Mnf Flow`               -0.447
##  2 `Bowl Setpoint`           0.348
##  3 `Filler Level`            0.324
##  4 `Usage cont`             -0.318
##  5 `Pressure Setpoint`      -0.307
##  6 `Brand Code`C            -0.281
##  7 `Hyd Pressure3`          -0.240
##  8 `Pressure Vacuum`         0.220
##  9 `Fill Pressure`          -0.215
## 10 `Hyd Pressure2`          -0.201
## # ℹ 23 more rows

# Create scatterplot matrix for PH against other predictors
correlation_matrix <- ggpairs(
  data = training_data_imputed,
  columns = which(names(training_data_imputed) == "PH"):ncol(training_data_imputed),
  upper = list(continuous = wrap("cor", size = 3)),
  title = "Correlation Scatterplot Matrix for PH") +
  theme(
    axis.text.x = element_text(size = 7),
    axis.text.y = element_text(size = 7),
    strip.text = element_text(size = 7),
    plot.title = element_text(size = 12))

# Display the scatterplot matrix
print(correlation_matrix)

Baseline: Ordinary Linear Regression with Highest Correlated Predictors

We begin our modeling exploration with a simple linear regression, focused on the top predictors identified in the correlation analysis. We fit and visualize straightforward models to see how well a single predictor explains pH variance. The baseline helps us understand whether simple relationships exist and serves as a reference point for improvement. Although we do not expect this simple model to be the best, it provides a clear benchmark for gauging progress.

Bowl.Setpoint has slight upward trend in pH as the setpoint increases, with an adjusted R-squared of 0.119. This suggests that Bowl Setpoint explains approximately 12.2% of the variance in pH. Mnf Flow has a negative slope association, showing that pH decreases as Mnf Flow increases, with a higher adjusted R-squared of 0.1995.

Mnf Flow is a stronger predictor of pH than Bowl Setpoint but both are relatively weak predictors for pH. Their low R-squared values point to using additional predictors or more complex models to capture the remaining variance.

# Scatterplot with regression line for PH vs. Bowl Setpoint
plot_bowl <- ggplot(stu_data_raw, aes(x = `Bowl Setpoint`, y = PH)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(
    title = "Scatterplot of PH vs. Bowl Setpoint",
    x = "Bowl Setpoint",
    y = "PH") +
  theme_minimal()

# Scatterplot with regression line for PH vs. Mnf Flow
plot_flow <- ggplot(stu_data_raw, aes(x = `Mnf Flow`, y = PH)) +
  geom_point(color = "green", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(
    title = "Scatterplot of PH vs. Mnf.Flow",
    x = "Mnf Flow",
    y = "PH") +
  theme_minimal()

# Display the plots
print(plot_bowl)

print(plot_flow)

# Linear model for PH and Bowl Setpoint
model_bowl <- lm(PH ~ `Bowl Setpoint`, data = stu_data_raw)
summary(model_bowl)

## 
## Call:
## lm(formula = PH ~ `Bowl Setpoint`, data = stu_data_raw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6682 -0.1076  0.0118  0.1294  0.7724 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     8.1147179  0.0230680  351.77   <2e-16 ***
## `Bowl Setpoint` 0.0039408  0.0002089   18.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1618 on 2563 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.1219, Adjusted R-squared:  0.1215 
## F-statistic: 355.8 on 1 and 2563 DF,  p-value: < 2.2e-16

# Linear model for PH and Mnf Flow
model_flow <- lm(PH ~ `Mnf Flow`, data = stu_data_raw)
summary(model_flow)

## 
## Call:
## lm(formula = PH ~ `Mnf Flow`, data = stu_data_raw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5783 -0.0862  0.0138  0.1138  0.7339 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.5615396  0.0031106 2752.41   <2e-16 ***
## `Mnf Flow`  -0.0006453  0.0000255  -25.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1544 on 2565 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.1998, Adjusted R-squared:  0.1995 
## F-statistic: 640.4 on 1 and 2565 DF,  p-value: < 2.2e-16

Baseline: Multiple Linear Regression

We expand from a single predictor to a full multiple linear regression model that incorporates all potential factors. We examine model summaries and diagnostic plots to assess fit and identify potential issues like non-linearity or heteroskedasticity.

The multiple linear regression model results an adjusted R-squared value of 0.44; the model explains approximately 44% of the variance in pH. Significant predictors include Mnf.Flow (negative association), Carb.Pressure1 (positive association), Hyd.Pressure3, Temperature, Usage.cont, Balling and others as indicated by extremely low p values, Several predictors, such as PC Volume and Hyd.Pressure4, are not significant and may add unnecessary complexity to the model.

The Residuals vs Fitted plot shows a reasonably random spread, suggesting linearity and homoscedasticity. The Q-Q plot suggests most residuals are normally distributed, with some deviation in the tails. The Scale-Location plot supports a relatively even variance across fitted values. The Residuals vs Leverage plot highlights a few influential points (e.g., observation 172), which may warrant further treatment.

# Multiple linear regression model
model <- lm(PH ~ ., data = stu_data_raw)
summary(model)

## 
## Call:
## lm(formula = PH ~ ., data = stu_data_raw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54423 -0.07424  0.00863  0.08459  0.50277 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.182e+01  1.278e+00   9.252  < 2e-16 ***
## `Brand Code`B        1.202e-01  3.144e-02   3.824 0.000135 ***
## `Brand Code`C       -2.465e-02  3.060e-02  -0.805 0.420651    
## `Brand Code`D        5.539e-02  2.087e-02   2.655 0.008000 ** 
## `Carb Volume`       -2.157e-01  1.143e-01  -1.888 0.059147 .  
## `Fill Ounces`       -1.157e-01  3.560e-02  -3.250 0.001174 ** 
## `PC Volume`          3.338e-03  6.442e-02   0.052 0.958680    
## `Carb Pressure`      7.206e-03  5.357e-03   1.345 0.178763    
## `Carb Temp`         -4.905e-03  4.197e-03  -1.169 0.242613    
## PSC                 -3.422e-02  6.233e-02  -0.549 0.583051    
## `PSC Fill`          -1.756e-02  2.486e-02  -0.706 0.480061    
## `PSC CO2`           -1.124e-01  6.947e-02  -1.617 0.105937    
## `Mnf Flow`          -7.061e-04  5.095e-05 -13.859  < 2e-16 ***
## `Carb Pressure1`     6.013e-03  7.587e-04   7.925 3.68e-15 ***
## `Fill Pressure`     -4.035e-03  1.898e-03  -2.126 0.033618 *  
## `Hyd Pressure1`      1.691e-04  3.935e-04   0.430 0.667401    
## `Hyd Pressure2`     -1.570e-03  5.649e-04  -2.779 0.005510 ** 
## `Hyd Pressure3`      3.391e-03  6.419e-04   5.283 1.40e-07 ***
## `Hyd Pressure4`      3.107e-04  4.198e-04   0.740 0.459362    
## `Filler Level`      -8.919e-04  9.024e-04  -0.988 0.323083    
## `Filler Speed`       1.714e-05  2.717e-05   0.631 0.528216    
## Temperature         -1.964e-02  3.086e-03  -6.364 2.41e-10 ***
## `Usage cont`        -7.069e-03  1.276e-03  -5.539 3.42e-08 ***
## `Carb Flow`          1.042e-05  4.657e-06   2.237 0.025418 *  
## Density             -1.289e-01  3.017e-02  -4.272 2.02e-05 ***
## MFR                 -9.184e-05  1.465e-04  -0.627 0.530846    
## Balling             -1.959e-01  3.903e-02  -5.020 5.60e-07 ***
## `Pressure Vacuum`   -4.927e-02  1.006e-02  -4.899 1.04e-06 ***
## `Oxygen Filler`     -4.319e-01  8.508e-02  -5.077 4.18e-07 ***
## `Bowl Setpoint`      2.695e-03  9.314e-04   2.894 0.003847 ** 
## `Pressure Setpoint` -2.482e-03  2.542e-03  -0.976 0.329083    
## `Air Pressurer`     -4.915e-04  2.532e-03  -0.194 0.846099    
## `Alch Rel`           6.808e-02  3.209e-02   2.121 0.034007 *  
## `Carb Rel`           1.795e-01  6.080e-02   2.953 0.003184 ** 
## `Balling Lvl`        2.514e-01  4.511e-02   5.574 2.82e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1287 on 2094 degrees of freedom
##   (442 observations deleted due to missingness)
## Multiple R-squared:  0.444,  Adjusted R-squared:  0.4349 
## F-statistic: 49.18 on 34 and 2094 DF,  p-value: < 2.2e-16

# Diagnostic Plots
par(mfrow = c(2, 2))
plot(model)

Baseline: MLR without Multicolinear Predictors

Due to the potential issues introduced by multiple predictors, we use the Variance Inflation Factor to detect and remove variables that cause multicollinearity. After pruning these collinear variables, we refit the model. By mitigating multicollinearity, we ensure that each predictor contributes distinct information about pH. After removing predictors with high multicollinearity, the remaining predictors have low VIF values - all less than 5.

This refined model has an adjusted R-squared of 0.31, lower than the full MLR model. The results suggest that some variables remain non-significant and could potentially be excluded but the drop in R Square value indicate the the excluded predictors offer explanatory power of the variance.

The diagnostic plots show similar trends to the previous MLR model, with the Residuals vs Fitted plot displaying a random spread, suggesting linearity. The Q-Q plot indicates a relatively normal distribution of residuals with some deviations in the tails. The Scale-Location plot suggests variance homogeneity. Residuals vs Leverage plot identifies a few influential points (e.g., observation 1094), which may need attention.

# Compute VIF values
vif_df <- as.data.frame(vif(model))
names(vif_df) <- "VIF"
vif_df$Predictor <- rownames(vif_df)

# Filter predictors with VIF <= 5 
predictors_to_keep <- vif_df$Predictor[vif_df$VIF <= 5]

# Create formula with the filtered predictors
formula <- as.formula(paste("PH ~", paste(predictors_to_keep, collapse = " + ")))

# Refit model using selected predictors
final_model <- lm(formula, data = stu_data_raw)

# Compute VIF values for the final model
final_vif_df <- as.data.frame(vif(final_model))
names(final_vif_df) <- "VIF"
final_vif_df$Predictor <- rownames(final_vif_df)

# Display VIF values for kept predictors
print(final_vif_df)

##                          VIF           Predictor
## `Fill Ounces`       1.061003       `Fill Ounces`
## `PC Volume`         1.490547         `PC Volume`
## PSC                 1.139143                 PSC
## `PSC Fill`          1.100271          `PSC Fill`
## `PSC CO2`           1.053022           `PSC CO2`
## `Mnf Flow`          2.984454          `Mnf Flow`
## `Carb Pressure1`    1.414876    `Carb Pressure1`
## `Fill Pressure`     2.232085     `Fill Pressure`
## `Hyd Pressure1`     1.419480     `Hyd Pressure1`
## `Hyd Pressure4`     1.259835     `Hyd Pressure4`
## Temperature         1.155792         Temperature
## `Usage cont`        1.589792        `Usage cont`
## `Carb Flow`         1.465947         `Carb Flow`
## `Pressure Vacuum`   1.606165   `Pressure Vacuum`
## `Oxygen Filler`     1.470567     `Oxygen Filler`
## `Pressure Setpoint` 2.353896 `Pressure Setpoint`
## `Air Pressurer`     1.128596     `Air Pressurer`

# Display the summary of the final model
print(summary(final_model))

## 
## Call:
## lm(formula = formula, data = stu_data_raw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55405 -0.08235  0.01256  0.09720  0.44625 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.413e+01  9.521e-01  14.841  < 2e-16 ***
## `Fill Ounces`       -1.654e-01  3.638e-02  -4.547 5.71e-06 ***
## `PC Volume`         -4.420e-02  6.344e-02  -0.697 0.486058    
## PSC                 -1.045e-01  6.557e-02  -1.594 0.111170    
## `PSC Fill`          -2.054e-02  2.656e-02  -0.773 0.439441    
## `PSC CO2`           -1.606e-01  7.308e-02  -2.198 0.028045 *  
## `Mnf Flow`          -6.943e-04  4.263e-05 -16.287  < 2e-16 ***
## `Carb Pressure1`     5.660e-03  7.593e-04   7.454 1.28e-13 ***
## `Fill Pressure`      1.461e-03  1.432e-03   1.021 0.307533    
## `Hyd Pressure1`      1.511e-03  2.888e-04   5.231 1.84e-07 ***
## `Hyd Pressure4`     -1.079e-03  2.636e-04  -4.092 4.43e-05 ***
## Temperature         -2.351e-02  2.575e-03  -9.127  < 2e-16 ***
## `Usage cont`        -9.331e-03  1.266e-03  -7.369 2.39e-13 ***
## `Carb Flow`          1.337e-06  3.456e-06   0.387 0.698851    
## `Pressure Vacuum`   -5.112e-03  6.592e-03  -0.776 0.438087    
## `Oxygen Filler`     -3.332e-01  8.300e-02  -4.015 6.13e-05 ***
## `Pressure Setpoint` -8.491e-03  2.232e-03  -3.804 0.000146 ***
## `Air Pressurer`     -7.443e-04  2.610e-03  -0.285 0.775518    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1431 on 2302 degrees of freedom
##   (251 observations deleted due to missingness)
## Multiple R-squared:  0.313,  Adjusted R-squared:  0.3079 
## F-statistic: 61.68 on 17 and 2302 DF,  p-value: < 2.2e-16

# Diagnostic plots for the final model
par(mfrow = c(2, 2))
plot(final_model)

Model Training

Given the relatively low r-squared values from linear regression models, we fit a variety of models below to see if these models are better able to explain the variance in pH based on the predictors.

Support Vector Machine (SVM)

Training a non-linear SVM model to predict PH values.

control <- trainControl(method = "cv", number = 5)

set.seed(123)

svm_model <- train(
  x = train_x, 
  y = train_y, 
  method = "svmRadial", 
  tuneLength = 10,  
  trControl = control  
)

plot(svm_model)

plot(varImp(svm_model))

Predicting PH values using the trained SVM model on the test set.

svm_predictions <- predict(svm_model, newdata = test_x)

We compare the predictions with the actual test set values to evaluate the model:

postResample(pred = svm_predictions, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.7061447 0.5214108 0.5095292

This RMSE score is high at 0.71, and the R² is 0.52 indicating there is room for improvement in the model

Ridge Regression

Training a Ridge Regression model with cross-validation to find the optimal λ (lambda) value and predict pH values.

set.seed(345)

ridge_model <- train(
  x = train_x, 
  y = train_y, 
  method = "glmnet", 
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 20)), 
  metric = "RMSE",  
  trControl = control,    
  preProc = c("center", "scale")  
)

plot(ridge_model)

plot(varImp(ridge_model))

Using the trained Ridge model to predict pH values for the test set.

#  Predict and evaluate the Ridge Regression model
ridge_predictions <- predict(ridge_model, newdata = test_x)

Evaluate the model:

postResample(pred = ridge_predictions, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.8096543 0.3672160 0.6248417

The low R² value indicates there’s still a large amount of unexplained variance.

Lasso

We fit a Lasso regression model to leverage its feature selection and model complexity reduction attributes. Lasso applies a penalty to shrink less important coefficients toward zero resulting in smaller set of non-zero important features . By focusing on the most critical variables, we may be able to generate a parsimonious model that may outperform our linear regression baselines.

The Lasso regression results reveal a tuned model with optimal parameters: alpha = 0.2 and lambda = 0.004, The variable importance rankings highlight Oxygen.Filler as the most influential predictor, followed by Carb.Rel, PC.Volume, and Density. These top features identify key predictors of pH.

set.seed(321)

#Define cross-validation method
cross_val_10 <- trainControl(method = "cv", number = 10)

lasso_model <- train(
  x = train_x,
  y = train_y,
  method = "glmnet",
  trControl = cross_val_10,
  tuneLength = 20)

plot(lasso_model)

plot(varImp(lasso_model))

Next, we predict the test set with the Lasso model:

lasso_predictions <- predict(lasso_model, newdata = test_x)

postResample(lasso_predictions, test_y)

##      RMSE  Rsquared       MAE 
## 0.8074707 0.3708608 0.6216203

The Lasso regression achieves effective dimensionality reduction but the relatively modest R-squared value suggests that further exploration may be necessary. The R-squared of 0.37 is lower than the adjusted R-squared of 0.44 from the multiple linear regression (MLR) model and may not be a suitable model for this data set.

Neural Network

To capture non-linear and potentially complex relationships, we train a neural network model.

library(nnet)

# Define a tuning grid
nnetGrid <- expand.grid(
  size = 1:10,
  decay = c(0.001, 0.01, 0.1, 0.5))

library(foreach)
library(doParallel)

We create a cluster that is two less than the number of system cores. We will use parallel computing to decrease the run time.

cluster0 <- makeCluster(detectCores()-2)

Next, we start the cluster, train the model, and close the cluster.

registerDoParallel(cluster0)

# Train neural network
set.seed(321)
nnetTuned <- train(
  x = train_x,
  y = train_y,
  method = "nnet",
  tuneGrid = nnetGrid,
  preProcess = c("center", "scale"),
  trControl = trainControl(method = "cv", number = 10),
  linout = TRUE,
  trace = FALSE,
  maxit = 500,
  MaxNWts = 10 * (ncol(train_x) + 1) + 10 + 1)

stopCluster(cluster0)

plot(nnetTuned)

plot(varImp(nnetTuned))

The tuning results graph shows that the combination of decay and hidden units significantly impacts model performance, with RMSE values improving as the decay and size are fine-tuned. The variable importance plot highlights Carb Flow, Mnf.Flow, and Alch Rel as key predictors.

Next, we try predicting the test set with the neural network:

nnet_predictions <- predict(nnetTuned, newdata = test_x)

postResample(nnet_predictions, test_y)

##      RMSE  Rsquared       MAE 
## 0.7333025 0.4902206 0.5611491

The neural network model achieves an R-squared of 0.51. The optimal parameters are size = 9 (number of hidden units) and decay = 0.5 (regularization parameter). Overall, the neural network model has a moderate results.

Random Forest model

Below we try fitting a random forest model to the data.

One of the tuning parameters for random forests is the number of predictors, k, to choose (referred to as \(m_{try}\) - it is recommended to set this parameter to one-third of the number of predictors). Number of trees is an additional parameter - it is recommended to set this to at least 1000.

Below, we create a tune grid for \(m_{try}\) parameter.

mtry <- seq(from = 1, to = (ncol(train_x)), by = 4)
mtry <- as.data.frame(mtry)

We use parallel computing to reduce the run time for training the model.

We create a cluster that is two less than the number of system cores.

cluster <- makeCluster(detectCores()-2)

Next, we start the cluster, train the model, and close the cluster.

registerDoParallel(cluster)

set.seed(50)

rf_model <- train(x = train_x, y = train_y, method = 'rf',
                  tuneGrid = mtry,
                  trControl = trainControl(method = 'cv', number = 10, allowParallel = T))

stopCluster(cluster)

rf_model

## Random Forest 
## 
## 2058 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1851, 1853, 1852, 1851, 1852, 1854, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    1    0.7122044  0.5650578  0.5554539
##    5    0.5901434  0.6771089  0.4383431
##    9    0.5730367  0.6878813  0.4208781
##   13    0.5631583  0.6953799  0.4119006
##   17    0.5601668  0.6960001  0.4085768
##   21    0.5583132  0.6957522  0.4050681
##   25    0.5551583  0.6974658  0.4027724
##   29    0.5592206  0.6907548  0.4049418
##   33    0.5612781  0.6866355  0.4059200
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 25.

Below is a a plot of the cross-validation RMSE

plot(rf_model)

Below is a plot of the variable importance:

plot(varImp(rf_model))

Next, we try predicting the test set with the random forest model

rf_predictions <- predict(rf_model, newdata = test_x)

We compare the predictions with the actual values:

postResample(pred = rf_predictions, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.6085109 0.6500000 0.4368886

Gradient boosted model

Next, we try fitting a gradient boosted regression tree model to the data. The parameters for this model include tree depth, number of trees, and shrinkage parameter which controls how much of a predicted value from a previous iteration is added to the current iteration (values of <0.01 are recommended). Lastly, the bagging fraction is kept constant at 0.5.

Below, we create the tuning grid:

gbmGrid <- expand.grid(interaction.depth = seq(1,7, by = 2),
                       shrinkage = c(0.01,0.1),
                       n.trees = seq(100,1000, by = 50),
                       n.minobsinnode = seq(5,30, by = 5))

We will use parallel computing to reduce the training time. Below we start the cluster, train the model, and close the cluster.

cluster2 <- makeCluster(detectCores()-2)

registerDoParallel(cluster2)

set.seed(100)

gbm_model <- train(x = train_x, y = train_y, , method = 'gbm',
                  tuneGrid = gbmGrid,
                  trControl = trainControl(method = 'cv', number = 10, allowParallel = T),
                  verbose = F)

stopCluster(cluster2)

plot(gbm_model)

Below are the parameters for the final model:

gbm_model$bestTune

##     n.trees interaction.depth shrinkage n.minobsinnode
## 862     400                 7       0.1             20

Next, we try predicting the test set with the gradient boosted model

gbm_predictions <- predict(gbm_model, newdata = test_x)

We compare the predictions with the actual values:

postResample(pred = gbm_predictions, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.6596248 0.5803806 0.4926660

Model Performance Metrics

Below are the performance metrics on the test set for the various models

svm_metrics <- as.data.frame(postResample(pred = svm_predictions, obs = test_y))
ridge_Metrics <- as.data.frame(postResample(pred = ridge_predictions, obs = test_y))
lasso_Metrics <- as.data.frame(postResample(pred = lasso_predictions, obs = test_y))
nnet_Metrics <- as.data.frame(postResample(pred = nnet_predictions, obs = test_y))
rf_Metrics <- as.data.frame(postResample(pred = rf_predictions, obs = test_y))
gbm_metrics <- as.data.frame(postResample(pred = gbm_predictions, obs = test_y))

model_performance <- cbind(svm_metrics, ridge_Metrics, lasso_Metrics, nnet_Metrics,
                           rf_Metrics, gbm_metrics)
colnames(model_performance) <- c('SVM', 'Ridge', 'Lasso', 'NNet', 'RF', 'GBM')
model_performance <- as.data.frame(t(model_performance))

model_performance

##            RMSE  Rsquared       MAE
## SVM   0.7061447 0.5214108 0.5095292
## Ridge 0.8096543 0.3672160 0.6248417
## Lasso 0.8074707 0.3708608 0.6216203
## NNet  0.7333025 0.4902206 0.5611491
## RF    0.6085109 0.6500000 0.4368886
## GBM   0.6596248 0.5803806 0.4926660

Model Selection - Random Forest

With the lowest RMSE and MAE and the highest r-squared of the models we fit, we selected Random Forest for the PH prediction of the unseen Student Evaluation data.

In general, the better performance of Random Forest and Gradient Boosting Machine models suggests that the underlying data contains non-linear relationships and complex feature interactions that linear models like Ridge and Lasso cannot adequately model.

Predict Student Evaluation Data

Below we prepare the prediction data set and predict values using our chosen model.

stu_eval_raw$`Brand Code` <- as.factor(stu_eval_raw$`Brand Code`)
stu_eval_raw <- kNN(data = stu_eval_raw, variable = 'Brand Code', k = 5, imp_var = F)
dummy_variables_eval <- dummyVars(~., data = stu_eval_raw, fullRank = T)

stu_eval_transformed <- predict(dummy_variables, newdata = stu_eval_raw)
stu_eval_transformed <- as.data.frame(stu_eval_transformed)

We drop the PH column

stu_eval_transformed$PHTRUE <- NULL

Apply pre-processing (centering, scaling, imputing

preProcess_model_eval <- preProcess(stu_eval_transformed[,4:34], method = c("center", "scale", "knnImpute"))
stu_eval_transformed_imputed <- stu_eval_transformed
stu_eval_transformed_imputed[,4:34] <- predict(preProcess_model_eval, stu_eval_transformed_imputed[,4:34])

Remove Near-zero variance

nearZeroVar(stu_eval_transformed_imputed)

## integer(0)

There are no zero variance predictors.

Predict PH

# Use the trained random forest model to predict PH
predictions <- predict(rf_model, newdata = stu_eval_transformed_imputed)

# Add predictions to the original data
stu_eval_raw$Predicted_PH <- predictions

Reverse centering and scaling

ph_center <- preProcess_model$mean["PH"]
ph_scale <- preProcess_model$std["PH"]

# Transform predictions back to the original scale
stu_eval_raw$Predicted_PH <- (stu_eval_raw$Predicted_PH * ph_scale) + ph_center

View the data:

head(stu_eval_raw["Predicted_PH"])

##   Predicted_PH
## 1     8.584179
## 2     8.429087
## 3     8.491033
## 4     8.494210
## 5     8.487493
## 6     8.514038

hist(stu_eval_raw$Predicted_PH)

Writing to excel

Below we write the evaluation set with predicted values to an excel file.

library(openxlsx)

stu_eval_raw$PH <- NULL

write.xlsx(stu_eval_raw, file = "stu_eval_predicted_values.xlsx")

DATA 624 Project 2 - Final

Adriana Medina, Johnny Rodriguez, Semyon Toybis

2024-12-11