Problem Statement

My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.


Load Data
# Read PH Training Data
studentdata <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/Data624/refs/heads/main/StudentData.csv", header = TRUE, sep = ',', na.strings="", fill = TRUE)
str(studentdata)
## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand.Code       : chr  "B" "A" "B" "A" ...
##  $ Carb.Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill.Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC.Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb.Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb.Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC.Fill         : num  0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
##  $ PSC.CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf.Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb.Pressure1   : num  119 122 120 115 118 ...
##  $ Fill.Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : int  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler.Level     : num  121 119 120 118 119 ...
##  $ Filler.Speed     : int  4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage.cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb.Flow        : int  2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure.Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen.Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl.Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 143 142 146 146 ...
##  $ Alch.Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb.Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling.Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...


Check Missing data

The primary goal at this stage is to identify whether our dataset contains missing observations that could distort model training. We visualize patterns of missingness to gain understanding of the scale and structure of missing data. Detecting these gaps allows us to address them before moving to modeling steps and to reduce the risk of biased models.

The missingness heatmap and the ranked bar chart show MFR has a notable amount of missing entries – where MFR stands out with the highest number of missing values. It could significantly affect our analysis. Brand.Code and Filler.Speed, also exhibit noticeable levels of missingness. As these variables have a high percentage of missing data, they may influence model performance and predictions.

library(naniar)
vis_miss(studentdata)

gg_miss_var(studentdata)


MICE Imputation

We apply Multiple Imputation by Chained Equations (MICE), to fill in missing values. MICE ensures that our final modeling dataset retains the relationships between features and yields better generalization performance.

The output for MICE Imputation specifies the imputation methods assigned to each variable. Brand.Code is imputed using “polyreg”(polytomous regression) to handle its categorical value, while most numeric variables (e.g., use “pmm” (Predictive Mean Matching). This method selection tailors the imputation process to the type and characteristics of each variable, preserving the relationships within the data.

# Load library
library(mice)

# Convert Brand.Code for imputation
studentdata$Brand.Code <- as.factor(studentdata$Brand.Code)

# Generate method vector for the dataset
imputation_methods <- make.method(studentdata)

# Specify method for Brand.Code
imputation_methods["Brand.Code"] <- "polyreg" 

# Specify method for numerical columns (Predictive Mean Matching)
numeric_columns <- names(studentdata)[sapply(studentdata, is.numeric)]
imputation_methods[numeric_columns] <- "pmm"

# Verify the methods vector
print(imputation_methods)
##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##         "polyreg"             "pmm"             "pmm"             "pmm" 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##         Carb.Flow           Density               MFR           Balling 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##             "pmm"             "pmm"             "pmm"             "pmm" 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##       Balling.Lvl 
##             "pmm"
# Perform multiple imputations
imputed_data <- mice(studentdata, method = imputation_methods, m = 5, seed = 321)
## 
##  iter imp variable
##   1   1  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   1   2  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   1   3  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   1   4  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   1   5  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   2   1  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   2   2  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   2   3  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   2   4  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   2   5  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   3   1  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   3   2  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   3   3  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   3   4  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   3   5  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   4   1  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   4   2  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   4   3  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   4   4  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   4   5  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   5   1  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   5   2  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   5   3  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   5   4  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
##   5   5  Brand.Code  Carb.Volume  Fill.Ounces  PC.Volume  Carb.Pressure  Carb.Temp  PSC  PSC.Fill  PSC.CO2  Mnf.Flow  Carb.Pressure1  Fill.Pressure  Hyd.Pressure1  Hyd.Pressure2  Hyd.Pressure3  Hyd.Pressure4  Filler.Level  Filler.Speed  Temperature  Usage.cont  Carb.Flow  Density  MFR  Balling  PH  Oxygen.Filler  Bowl.Setpoint  Pressure.Setpoint  Alch.Rel  Carb.Rel  Balling.Lvl
# Extract the completed dataset
studentdata_mice <- complete(imputed_data)


Missing Data Check

We verify that the dataset is now fully populated by checking for any lingering NA values. With complete data, we can move forward with modeling. The output confirms that all missing values have been successfully addressed.

colSums(is.na(studentdata_mice))
##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                 0                 0                 0 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                 0                 0                 0                 0 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                 0                 0                 0                 0 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                 0                 0                 0                 0 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                 0                 0                 0                 0 
##         Carb.Flow           Density               MFR           Balling 
##                 0                 0                 0                 0 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 0                 0                 0                 0 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                 0                 0                 0                 0 
##       Balling.Lvl 
##                 0


Correlation Matrix: PH vs Predictors

We evaluate the relationships between pH (our target) and all available predictors to gain a preliminary understanding of which factors might drive our outcome. Using correlation and visualization we identify the variables are most strongly associated with pH.

The table of correlations ranks predictors based on their correlation with the target variable, pH. Mnf.Flow shows the strongest relationship with pH (negative correlation of -0.445), Bowl.Setpoint (positive correlation of 0.346) is has the strongest positive correlation and may indicate a role in influencing pH – although these correlations are moderate at best.

# Load libraries
library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)

# Prepare the data for correlation analysis
cor_data <- studentdata_mice %>%
    select_if(is.numeric)

# Compute correlations between 'PH' and all other predictors
correlation_values <- cor_data %>%
    summarise(across(.cols = everything(), 
                     .fns = ~ cor(., cor_data$PH, use = "complete.obs"), 
                     .names = "cor_{col}")) %>%
    pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Correlation") %>%
    mutate(Predictor = gsub("cor_", "", Predictor)) %>%
    filter(Predictor != "PH") %>% 
    arrange(desc(abs(Correlation)))

# Output sorted list of correlations
print(correlation_values)
## # A tibble: 31 × 2
##    Predictor         Correlation
##    <chr>                   <dbl>
##  1 Mnf.Flow               -0.445
##  2 Bowl.Setpoint           0.346
##  3 Filler.Level            0.323
##  4 Usage.cont             -0.319
##  5 Pressure.Setpoint      -0.304
##  6 Hyd.Pressure3          -0.238
##  7 Pressure.Vacuum         0.220
##  8 Fill.Pressure          -0.211
##  9 Hyd.Pressure2          -0.200
## 10 Oxygen.Filler           0.165
## # ℹ 21 more rows
# Create scatterplot matrix for PH against other predictors
correlation_matrix <- ggpairs(
  data = cor_data,
  columns = which(names(cor_data) == "PH"):ncol(cor_data),
  upper = list(continuous = wrap("cor", size = 3)),
  title = "Correlation Scatterplot Matrix for PH") +
  theme(
    axis.text.x = element_text(size = 7),
    axis.text.y = element_text(size = 7),
    strip.text = element_text(size = 7),
    plot.title = element_text(size = 12))

# Display the scatterplot matrix
print(correlation_matrix)


Baseline: Ordinary Linear Regression with Highest Correlated Predictors

We begin our modeling exploration with a simple linear regression, focused on the top predictors identified in the correlation analysis. We fit and visualize straightforward models to see how well a single predictor explains pH variance. The baseline helps us understand whether simple relationships exist and serves as a reference point for improvement. Although we do not expect this simple model to be the best, it provides a clear benchmark for gauging progress.

Bowl.Setpoint has slight upward trend in pH as the setpoint increases, with an adjusted R-squared of 0.119. This suggests that Bowl.Setpoint explains approximately 11.9% of the variance in pH. Mnf.Flow has a negative slope association, showing that pH decreases as Mnf.Flow increases, with a higher adjusted R-squared of 0.198.

Mnf.Flow is a stronger predictor of pH than Bowl.Setpoint but both are relatives weak predictors for pH. Their low R-squared values point to using additional predictors or more complex models to capture the remaining variance.

# Load library
library(ggplot2)

# Scatterplot with regression line for PH vs. Bowl.Setpoint
plot_bowl <- ggplot(studentdata_mice, aes(x = Bowl.Setpoint, y = PH)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(
    title = "Scatterplot of PH vs. Bowl.Setpoint",
    x = "Bowl Setpoint",
    y = "PH") +
  theme_minimal()

# Scatterplot with regression line for PH vs. Mnf.Flow
plot_flow <- ggplot(studentdata_mice, aes(x = Mnf.Flow, y = PH)) +
  geom_point(color = "green", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(
    title = "Scatterplot of PH vs. Mnf.Flow",
    x = "Mnf.Flow",
    y = "PH") +
  theme_minimal()

# Display the plots
print(plot_bowl)

print(plot_flow)

# Linear model for PH and Bowl.Setpoint
model_bowl <- lm(PH ~ Bowl.Setpoint, data = studentdata_mice)
summary(model_bowl)
## 
## Call:
## lm(formula = PH ~ Bowl.Setpoint, data = studentdata_mice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6685 -0.1074  0.0115  0.1283  0.7726 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.1203354  0.0230139  352.84   <2e-16 ***
## Bowl.Setpoint 0.0038924  0.0002085   18.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1621 on 2569 degrees of freedom
## Multiple R-squared:  0.1194, Adjusted R-squared:  0.1191 
## F-statistic: 348.4 on 1 and 2569 DF,  p-value: < 2.2e-16
# Linear model for PH and Pressure.Setpoint
model_flow <- lm(PH ~ Mnf.Flow, data = studentdata_mice)
summary(model_flow)
## 
## Call:
## lm(formula = PH ~ Mnf.Flow, data = studentdata_mice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57861 -0.08608  0.01392  0.11392  0.73405 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.562e+00  3.115e-03 2748.71   <2e-16 ***
## Mnf.Flow    -6.436e-04  2.553e-05  -25.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1547 on 2569 degrees of freedom
## Multiple R-squared:  0.1983, Adjusted R-squared:  0.198 
## F-statistic: 635.5 on 1 and 2569 DF,  p-value: < 2.2e-16


Baseline: Multiple Linear Regression

We expand from a single predictor to a full multiple linear regression model that incorporates all potential factors. We examine model summaries and diagnostic plots to assess fit and identify potential issues like non-linearity or heteroskedasticity.

The multiple linear regression model results an adjusted R-squared value of 0.4072; the model explains approximately 40.7% of the variance in pH. Significant predictors include Mnf.Flow (negative association), Carb.Pressure1 (positive association), Hyd.Pressure3, Temperature, Usage.cont, Balling and other as indicated by p values, Several predictors, such as Carb.Volume and Hyd.Pressure4, are not significant and may add unnecessary complexity to the model.

The Residuals vs Fitted plot shows a reasonably random spread, suggesting linearity and homoscedasticity. The Q-Q plot suggests most residuals are normally distributed, with some deviation in the tails. The Scale-Location plot supports a relatively even variance across fitted values. The Residuals vs Leverage plot highlights a few influential points (e.g., observation 1094), which may warrant further treatment.

# Load libraries
library(ggplot2)
library(car)
library(lmtest)
library(GGally)

# Multiple linear regression model
model <- lm(PH ~ ., data = studentdata_mice)
summary(model)
## 
## Call:
## lm(formula = PH ~ ., data = studentdata_mice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52814 -0.07883  0.01056  0.08650  0.85018 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.142e+01  1.083e+00  10.543  < 2e-16 ***
## Brand.CodeB        7.884e-02  2.093e-02   3.766 0.000169 ***
## Brand.CodeC       -5.817e-02  2.075e-02  -2.803 0.005104 ** 
## Brand.CodeD        6.130e-02  1.638e-02   3.742 0.000186 ***
## Carb.Volume       -1.316e-01  9.039e-02  -1.456 0.145491    
## Fill.Ounces       -8.063e-02  3.211e-02  -2.511 0.012116 *  
## PC.Volume         -9.943e-02  5.415e-02  -1.836 0.066454 .  
## Carb.Pressure      4.089e-03  4.243e-03   0.964 0.335332    
## Carb.Temp         -2.310e-03  3.342e-03  -0.691 0.489442    
## PSC               -1.069e-01  5.725e-02  -1.868 0.061911 .  
## PSC.Fill          -3.392e-02  2.345e-02  -1.446 0.148178    
## PSC.CO2           -1.031e-01  6.294e-02  -1.638 0.101514    
## Mnf.Flow          -7.249e-04  4.609e-05 -15.726  < 2e-16 ***
## Carb.Pressure1     6.968e-03  6.936e-04  10.045  < 2e-16 ***
## Fill.Pressure      2.140e-03  1.210e-03   1.768 0.077131 .  
## Hyd.Pressure1      3.432e-05  3.684e-04   0.093 0.925787    
## Hyd.Pressure2     -1.218e-03  5.305e-04  -2.296 0.021763 *  
## Hyd.Pressure3      3.558e-03  5.823e-04   6.110 1.15e-09 ***
## Hyd.Pressure4     -1.330e-04  3.127e-04  -0.425 0.670742    
## Filler.Level      -1.067e-03  5.702e-04  -1.872 0.061365 .  
## Filler.Speed       1.692e-05  1.123e-05   1.507 0.131912    
## Temperature       -1.259e-02  2.277e-03  -5.528 3.56e-08 ***
## Usage.cont        -7.394e-03  1.141e-03  -6.478 1.11e-10 ***
## Carb.Flow          1.063e-05  3.737e-06   2.844 0.004491 ** 
## Density           -1.146e-01  2.798e-02  -4.095 4.35e-05 ***
## MFR               -1.069e-04  6.198e-05  -1.724 0.084835 .  
## Balling           -7.430e-02  2.314e-02  -3.210 0.001342 ** 
## Pressure.Vacuum   -1.833e-02  7.628e-03  -2.402 0.016356 *  
## Oxygen.Filler     -2.684e-01  6.823e-02  -3.934 8.57e-05 ***
## Bowl.Setpoint      3.221e-03  5.987e-04   5.381 8.11e-08 ***
## Pressure.Setpoint -9.089e-03  1.965e-03  -4.626 3.91e-06 ***
## Air.Pressurer     -2.665e-03  2.371e-03  -1.124 0.261096    
## Alch.Rel           4.957e-02  2.264e-02   2.190 0.028632 *  
## Carb.Rel           3.967e-03  4.741e-02   0.084 0.933321    
## Balling.Lvl        1.162e-01  2.142e-02   5.425 6.34e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.133 on 2536 degrees of freedom
## Multiple R-squared:  0.415,  Adjusted R-squared:  0.4072 
## F-statistic: 52.92 on 34 and 2536 DF,  p-value: < 2.2e-16
#Diagnostic Plots
par(mfrow = c(2, 2))
plot(model)


Baseline: MLR without Multicolinear Predictors

Due to the potential issues introduced by multiple predictors, we use the Variance Inflation Factor to detect and remove variables that cause multicollinearity. After pruning these collinear variables, we refit the model. By mitigating multicollinearity, we ensure that each predictor contributes distinct information about pH. After removing predictors with high multicollinearity, the remaining predictors have low VIF values - all less than 5.

This refined model has an adjusted R-squared of 0.3003,lower than the full MLR model. The results suggest that some variables remain non-significant and could potentially be excluded but the drop in R Square value indicate the the excluded predictors offer explanatory power of the variance.

The diagnostic plots show similar trends to the previous MLR model, with the Residuals vs Fitted plot displaying a random spread, suggesting linearity. The Q-Q plot indicates a relatively normal distribution of residuals with some deviations in the tails. The Scale-Location plot suggests variance homogeneity. Residuals vs Leverage plot identifies a few influential points (e.g., observation 1094), which may need attention.

# Load library
library(car)

# Compute VIF values
vif_df <- as.data.frame(vif(model))
names(vif_df) <- "VIF"
vif_df$Predictor <- rownames(vif_df)

# Filter predictors with VIF <= 5 
predictors_to_keep <- vif_df$Predictor[vif_df$VIF <= 5]

# Create a new formula with the filtered predictors
formula <- as.formula(paste("PH ~", paste(predictors_to_keep, collapse = " + ")))

# Refit the model using only the selected predictors
final_model <- lm(formula, data = studentdata_mice)

# Compute VIF values for the final model and store them in a dataframe
final_vif_df <- as.data.frame(vif(final_model))
names(final_vif_df) <- "VIF"
final_vif_df$Predictor <- rownames(final_vif_df)

# Display VIF values for all predictors
print(vif_df)
##                         VIF NA       NA         Predictor
## Brand.Code        49.317117  3 1.914989        Brand.Code
## Carb.Volume       13.496459  1 3.673753       Carb.Volume
## Fill.Ounces        1.150115  1 1.072434       Fill.Ounces
## PC.Volume          1.589004  1 1.260557         PC.Volume
## Carb.Pressure     33.422705  1 5.781237     Carb.Pressure
## Carb.Temp         27.907420  1 5.282747         Carb.Temp
## PSC                1.161185  1 1.077583               PSC
## PSC.Fill           1.114727  1 1.055806          PSC.Fill
## PSC.CO2            1.078937  1 1.038719           PSC.CO2
## Mnf.Flow           4.408606  1 2.099668          Mnf.Flow
## Carb.Pressure1     1.579910  1 1.256945    Carb.Pressure1
## Fill.Pressure      2.178546  1 1.475990     Fill.Pressure
## Hyd.Pressure1      3.047057  1 1.745582     Hyd.Pressure1
## Hyd.Pressure2     10.979155  1 3.313481     Hyd.Pressure2
## Hyd.Pressure3     12.552731  1 3.542983     Hyd.Pressure3
## Hyd.Pressure4      2.556701  1 1.598969     Hyd.Pressure4
## Filler.Level      11.699337  1 3.420429      Filler.Level
## Filler.Speed      12.877293  1 3.588494      Filler.Speed
## Temperature        1.454516  1 1.206033       Temperature
## Usage.cont         1.678740  1 1.295662        Usage.cont
## Carb.Flow          2.339185  1 1.529440         Carb.Flow
## Density           16.204921  1 4.025534           Density
## MFR               11.070038  1 3.327167               MFR
## Balling           67.441560  1 8.212281           Balling
## Pressure.Vacuum    2.746893  1 1.657375   Pressure.Vacuum
## Oxygen.Filler      1.486183  1 1.219091     Oxygen.Filler
## Bowl.Setpoint     12.248414  1 3.499773     Bowl.Setpoint
## Pressure.Setpoint  2.335773  1 1.528324 Pressure.Setpoint
## Air.Pressurer      1.199729  1 1.095322     Air.Pressurer
## Alch.Rel          18.972738  1 4.355771          Alch.Rel
## Carb.Rel           5.410081  1 2.325958          Carb.Rel
## Balling.Lvl       50.458243  1 7.103397       Balling.Lvl
# Display VIF values for kept predictors
print(final_vif_df)
##                        VIF         Predictor
## Fill.Ounces       1.060853       Fill.Ounces
## PC.Volume         1.479512         PC.Volume
## PSC               1.152751               PSC
## PSC.Fill          1.104370          PSC.Fill
## PSC.CO2           1.062424           PSC.CO2
## Mnf.Flow          2.792677          Mnf.Flow
## Carb.Pressure1    1.427103    Carb.Pressure1
## Fill.Pressure     2.013339     Fill.Pressure
## Hyd.Pressure1     1.455314     Hyd.Pressure1
## Hyd.Pressure4     1.304884     Hyd.Pressure4
## Temperature       1.207110       Temperature
## Usage.cont        1.568474        Usage.cont
## Carb.Flow         1.470612         Carb.Flow
## Pressure.Vacuum   1.545823   Pressure.Vacuum
## Oxygen.Filler     1.434200     Oxygen.Filler
## Pressure.Setpoint 2.169753 Pressure.Setpoint
## Air.Pressurer     1.124044     Air.Pressurer
# Display the summary of the final model
print(summary(final_model))
## 
## Call:
## lm(formula = formula, data = studentdata_mice)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54684 -0.08428  0.01034  0.09910  0.76961 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.337e+01  8.727e-01  15.325  < 2e-16 ***
## Fill.Ounces       -1.431e-01  3.351e-02  -4.269 2.03e-05 ***
## PC.Volume         -7.972e-02  5.677e-02  -1.404  0.16034    
## PSC               -1.235e-01  6.198e-02  -1.993  0.04638 *  
## PSC.Fill          -4.330e-02  2.536e-02  -1.708  0.08780 .  
## PSC.CO2           -1.153e-01  6.786e-02  -1.700  0.08931 .  
## Mnf.Flow          -7.025e-04  3.985e-05 -17.627  < 2e-16 ***
## Carb.Pressure1     6.326e-03  7.162e-04   8.833  < 2e-16 ***
## Fill.Pressure      3.477e-03  1.264e-03   2.751  0.00599 ** 
## Hyd.Pressure1      1.360e-03  2.766e-04   4.917 9.36e-07 ***
## Hyd.Pressure4     -1.154e-03  2.427e-04  -4.755 2.10e-06 ***
## Temperature       -1.920e-02  2.254e-03  -8.522  < 2e-16 ***
## Usage.cont        -9.126e-03  1.199e-03  -7.614 3.71e-14 ***
## Carb.Flow         -1.428e-06  3.219e-06  -0.444  0.65738    
## Pressure.Vacuum   -1.066e-03  6.217e-03  -0.172  0.86384    
## Oxygen.Filler     -2.821e-01  7.282e-02  -3.875  0.00011 ***
## Pressure.Setpoint -1.084e-02  2.057e-03  -5.271 1.47e-07 ***
## Air.Pressurer     -1.366e-03  2.493e-03  -0.548  0.58376    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1445 on 2553 degrees of freedom
## Multiple R-squared:  0.3049, Adjusted R-squared:  0.3003 
## F-statistic: 65.88 on 17 and 2553 DF,  p-value: < 2.2e-16
# Generate diagnostic plots for the final model
par(mfrow = c(2, 2))
plot(final_model)


Lasso Regression

We model a Lasso regression to leverage its feature selection, model complexity reduction attributes. Lasso applies a penalty to shrink less important coefficients toward zero resulting in smaller set non-zero important features . By focusing on the most critical variables, we may be able to generate a parsimonious model that may outperform our linear regression baselines.

The Lasso regression results reveal a tuned model with optimal parameters: alpha = 0.1 and lambda = 0.0012, The variable importance rankings highlight Oxygen.Filler as the most influential predictor, followed by Carb.Rel, PC.Volume, and Density. These top features identify key predictors of pH.

The Lasso regression achieves effective dimensionality reduction but te relatively modest R-squared value suggests that further exploration may be necessary. The R-squared of 0.354 is lower than the adjusted R-squared of 0.4072 from the multiple linear regression (MLR) model and may not be a suitable model for this dataset.

# Load libraries
library(caret)
library(glmnet)

# Set the seed
set.seed(321)

# Split the data into training and test sets
set.seed(123)
train_indices <- createDataPartition(studentdata_mice$PH, p = 0.8, list = FALSE)
train_data <- studentdata_mice[train_indices, ]
test_data <- studentdata_mice[-train_indices, ]

# Prepare the data for Lasso regression
x_train <- as.matrix(train_data[, -which(names(train_data) == "PH")])
y_train <- train_data$PH
x_test <- as.matrix(test_data[, -which(names(test_data) == "PH")])
y_test <- test_data$PH

# Define cross-validation method
cross_val_10 <- trainControl(method = "cv", number = 10)

# Train the Lasso regression model
lasso_model <- train(
  x = x_train,
  y = y_train,
  method = "glmnet",
  trControl = cross_val_10,
  preProcess = c("center", "scale", "nzv"),
  tuneLength = 20)

# Extract optimal lambda from cross-validation
lasso_best_lambda <- lasso_model$bestTune
print(lasso_best_lambda)
##    alpha      lambda
## 10   0.1 0.001216886
# Compute the best cross-validated R-squared
lasso_best_r2 <- max(lasso_model$results$Rsquared, na.rm = TRUE)
print(lasso_best_r2)
## [1] 0.3383953
# Predict on the test set and compute R-squared
lasso_pred <- predict(lasso_model, newdata = x_test)
test_r2 <- postResample(pred = lasso_pred, obs = y_test)[2]
print(test_r2)
##  Rsquared 
## 0.3542227
# Display variable importance
lasso_var_importance <- varImp(lasso_model)
print(lasso_var_importance)
## glmnet variable importance
## 
##   only 20 most important variables shown (out of 32)
## 
##                    Overall
## Oxygen.Filler     100.0000
## Carb.Rel           37.8637
## PC.Volume          32.2475
## Density            31.0408
## PSC                31.0301
## PSC.CO2            29.0048
## Alch.Rel           24.9745
## Fill.Ounces        23.7366
## PSC.Fill           12.3386
## Balling            12.1294
## Balling.Lvl         7.4211
## Temperature         6.3005
## Pressure.Setpoint   3.0438
## Carb.Volume         2.7216
## Usage.cont          2.0451
## Carb.Pressure1      1.9458
## Pressure.Vacuum     1.5986
## Hyd.Pressure3       0.8888
## Bowl.Setpoint       0.6928
## Fill.Pressure       0.5551
# Plot variable importance
plot(
  lasso_var_importance,
  main = "Lasso Variable Importance",
  cex.main = 0.7,
  cex.axis = 0.7, 
  cex.lab = 0.7)

# Plot tuning results
plot(
  lasso_model,
  main = "Lasso Tuning Results (Lambda vs R2)",
  cex.main = 0.7,
  cex.axis = 0.7,
  cex.lab = 0.7)


Neural Network

To capture non-linear and potentially complex relationships, we train a neural network model.

The neural network model achieves an R-squared of 0.477, surpassing the performance of the Lasso regression (R-squared = 0.354) and coming close to the multiple linear regression (MLR) model’s R-squared of 0.407. The optimal parameters are size = 9 (number of hidden units) and decay = 0.5 (regularization parameter). Overall, the neural network model has a moderate results.

The tuning results graph shows that the combination of decay and hidden units significantly impacts model performance, with RMSE values improving as the decay and size are fine-tuned. The variable importance plot highlights Carb.Flow, Hyd.Pressure1, and Mnf.Flow as key predictors.

The residual plot suggests minimal systematic errors, with residuals centered around zero across predicted values.

# Load libraries
library(caret)
library(nnet)
library(dplyr)

# Convert 'Brand,Code' to dummy variable
studentdata_processed <- studentdata_mice %>%
  mutate(across(where(is.character), as.factor)) %>%
  dummyVars(" ~ .", data = .) %>%
  predict(., studentdata_mice) %>%
  as.data.frame()

# Split data into training and test sets
set.seed(321) 
trainIndex <- createDataPartition(studentdata_processed$PH, p = 0.80, list = FALSE)
trainData <- studentdata_processed[trainIndex, ]
testData <- studentdata_processed[-trainIndex, ]

# Separate predictors and outcome for training and test sets
x_train <- trainData[, !names(trainData) %in% "PH"]
y_train <- trainData$PH
x_test <- testData[, !names(testData) %in% "PH"]
y_test <- testData$PH

# Define a tuning grid
nnetGrid <- expand.grid(
  size = 1:10,
  decay = c(0.001, 0.01, 0.1, 0.5))

# Train neural network
set.seed(321)
nnetTuned <- train(
  x = x_train,
  y = y_train,
  method = "nnet",
  tuneGrid = nnetGrid,
  preProcess = c("center", "scale"),
  trControl = trainControl(method = "cv", number = 10),
  linout = TRUE,
  trace = FALSE,
  maxit = 500,
  MaxNWts = 10 * (ncol(x_train) + 1) + 10 + 1)

# View model results
print(nnetTuned)
## Neural Network 
## 
## 2058 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1852, 1852, 1852, 1852, 1852, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE       Rsquared   MAE       
##    1    0.001  0.1314319  0.4138695  0.10196380
##    1    0.010  0.1313793  0.4146213  0.10144177
##    1    0.100  0.1336383  0.3968065  0.10354521
##    1    0.500  0.1344240  0.3894365  0.10443959
##    2    0.001  0.1259923  0.4630454  0.09675634
##    2    0.010  0.1311952  0.4257799  0.09945994
##    2    0.100  0.1276595  0.4497412  0.09733289
##    2    0.500  0.1289450  0.4375297  0.09834390
##    3    0.001  0.1241064  0.4797454  0.09392113
##    3    0.010  0.1264619  0.4657074  0.09376243
##    3    0.100  0.1281223  0.4494815  0.09564114
##    3    0.500  0.1273465  0.4552452  0.09517264
##    4    0.001  0.1298719  0.4508411  0.09490488
##    4    0.010  0.1281195  0.4629114  0.09440000
##    4    0.100  0.1258940  0.4711368  0.09335261
##    4    0.500  0.1271084  0.4588074  0.09436937
##    5    0.001  0.1297184  0.4494131  0.09642799
##    5    0.010  0.1263351  0.4791060  0.09406880
##    5    0.100  0.1274081  0.4672405  0.09386175
##    5    0.500  0.1256988  0.4731223  0.09343496
##    6    0.001  0.1337957  0.4354460  0.09682491
##    6    0.010  0.1322501  0.4554357  0.09461203
##    6    0.100  0.1272241  0.4705839  0.09415077
##    6    0.500  0.1281386  0.4601989  0.09438853
##    7    0.001  0.1306319  0.4627089  0.09726355
##    7    0.010  0.1348173  0.4362562  0.09727713
##    7    0.100  0.1288899  0.4708749  0.09351349
##    7    0.500  0.1238270  0.4912216  0.09226578
##    8    0.001  0.1320434  0.4563607  0.09900138
##    8    0.010  0.1330398  0.4511652  0.09647422
##    8    0.100  0.1305544  0.4581832  0.09431613
##    8    0.500  0.1251436  0.4824894  0.09355498
##    9    0.001  0.1390460  0.4219372  0.10053340
##    9    0.010  0.1337140  0.4482196  0.09754615
##    9    0.100  0.1326855  0.4554070  0.09508132
##    9    0.500  0.1234339  0.4956042  0.09221320
##   10    0.001  0.1343660  0.4504271  0.09976991
##   10    0.010  0.1339430  0.4584989  0.09626510
##   10    0.100  0.1285464  0.4767257  0.09463327
##   10    0.500  0.1235827  0.4907654  0.09188635
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 9 and decay = 0.5.
print(nnetTuned$bestTune)
##    size decay
## 36    9   0.5
# Plot tuning results
plot(nnetTuned, main = "Tuning Results for Neural Network")

# Variable importance
varImpPlot <- varImp(nnetTuned)
plot(varImpPlot, main = "Variable Importance from Neural Network")

# Predict on the test set
nnetPred <- predict(nnetTuned, newdata = x_test)

# Evaluate model performance
performance <- postResample(pred = nnetPred, obs = y_test)
print(performance)
##       RMSE   Rsquared        MAE 
## 0.12969633 0.47670903 0.09457135
# Residual diagnostic plot
residuals <- y_test - nnetPred
plot(nnetPred, residuals, xlab = "Predicted Values", ylab = "Residuals",
     main = "Residual Plot", col = "blue", pch = 19)
abline(h = 0, col = "red", lwd = 2)


XGBoost Tree

We train a gradient-boosted decision tree model using the xgboost package. We also extract feature importance rankings.

The XGBoost model achieves an R-squared of 0.601, significantly outperforming the previous models, including the neural network (R-squared = 0.477) and multiple linear regression (R-squared = 0.407).

The residual plot shows a good distribution of residuals around zero across the predicted values, indicating a well-fitted model with minimal systematic errors. The feature importance chart indicates Mnf.Flow as the most influential predictor, followed by Oxygen.Filler, Filler.Speed, and Alch.Rel, as predictors of PH.

# Load libraries
library(xgboost)
library(dplyr)

# Preprocess the data (convert categorical to dummy)
studentdata_processed <- studentdata_mice %>%
  mutate(across(where(is.character), as.factor)) %>%
  dummyVars(" ~ .", data = .) %>%
  predict(., studentdata_mice) %>%
  as.data.frame()

# Split into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(studentdata_processed$PH, p = 0.80, list = FALSE)
trainData <- studentdata_processed[trainIndex, ]
testData <- studentdata_processed[-trainIndex, ]

# Separate predictors and outcome
x_train <- as.matrix(trainData[, !names(trainData) %in% "PH"])
y_train <- trainData$PH
x_test <- as.matrix(testData[, !names(testData) %in% "PH"])
y_test <- testData$PH

# Convert training data to DMatrix for XGBoost
dtrain <- xgb.DMatrix(data = x_train, label = y_train)
dtest <- xgb.DMatrix(data = x_test, label = y_test)

# Set parameters for XGBoost
params <- list(
  booster = "gbtree",
  objective = "reg:squarederror",
  eta = 0.1,
  max_depth = 6,
  subsample = 0.8,
  colsample_bytree = 0.8,
  gamma = 0)

# Train XGBoost model
set.seed(123)
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 200,
  watchlist = list(train = dtrain, test = dtest),
  early_stopping_rounds = 10,
  print_every_n = 10)
## [1]  train-rmse:7.243151 test-rmse:7.245872 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [11] train-rmse:2.532389 test-rmse:2.535015 
## [21] train-rmse:0.894530 test-rmse:0.894218 
## [31] train-rmse:0.332541 test-rmse:0.336355 
## [41] train-rmse:0.145981 test-rmse:0.163109 
## [51] train-rmse:0.092131 test-rmse:0.123164 
## [61] train-rmse:0.076973 test-rmse:0.116931 
## [71] train-rmse:0.069305 test-rmse:0.115053 
## [81] train-rmse:0.063289 test-rmse:0.114220 
## [91] train-rmse:0.057172 test-rmse:0.113406 
## [101]    train-rmse:0.051773 test-rmse:0.113434 
## [111]    train-rmse:0.047308 test-rmse:0.112561 
## [121]    train-rmse:0.043808 test-rmse:0.112121 
## [131]    train-rmse:0.040077 test-rmse:0.111577 
## [141]    train-rmse:0.036986 test-rmse:0.111569 
## [151]    train-rmse:0.034563 test-rmse:0.111293 
## [161]    train-rmse:0.031591 test-rmse:0.111015 
## [171]    train-rmse:0.029441 test-rmse:0.110850 
## [181]    train-rmse:0.027203 test-rmse:0.110943 
## [191]    train-rmse:0.024747 test-rmse:0.110770 
## Stopping. Best iteration:
## [185]    train-rmse:0.026203 test-rmse:0.110632
# Predict on test data
predictions <- predict(xgb_model, newdata = dtest, iteration_range = c(1, xgb_model$best_iteration))

# Evaluate performance
rsq <- cor(y_test, predictions)^2
cat("R-squared:", rsq, "\n")
## R-squared: 0.6011627
# Residual plot
residuals <- y_test - predictions
plot(predictions, residuals, xlab = "Predicted", ylab = "Residuals",
     main = "Residual Plot for XGBoost", col = "blue", pch = 19)
abline(h = 0, col = "red", lwd = 2)

# Variable importance plot
importance <- xgb.importance(feature_names = colnames(x_train), model = xgb_model)
xgb.plot.importance(importance, main = "Feature Importance for XGBoost")


Final Model Results and Model Selection

Based on the final RSquared results, the Random Forest model is the best predictor with Rsq = 0.7130620. This model will be used to predict the Student Evaluation dataset. The Random Forest model provides the highest explanatory power, capturing a significant proportion of the variance in the target variable (pH). Its ensemble nature enables it to handle non-linear relationships and complex interactions between predictors effectively in this dataset.

Model R-squared
OLR 0.198
MLR 0.4072
LASSO 0.3542227
NNET 0.47670903
XGB 0.6011627
RIDGE 0.3650728
SVM 0.514247
RFOREST 0.7130620
BOOSTED 0.6468105