Project 2.0

1. Introduction

This section is the introduction to the R script. The code block uses the pacman package to load a comprehensive set of R libraries. These libraries are essential for various tasks that will be performed in the subsequent sections, including data manipulation (tidyverse, janitor), machine learning model building and evaluation (caret, glmnet, pls, earth, kernlab, randomForest, Cubist, xgboost, nnet, e1071), time series analysis (forecast, fpp3, tsibble), association rule mining (arules, arulesViz), clustering (cluster), and multivariate data analysis (FactoMineR, factoextra), as well as data exploration and visualization (skimr, DataExplorer, corrplot

pacman::p_load(MASS, tidyverse, caret, glmnet, pls, earth, kernlab,
               randomForest, Cubist, xgboost, nnet, e1071,
               forecast, fpp3, tsibble, arules, arulesViz, cluster,
               FactoMineR, factoextra, skimr, DataExplorer, janitor, corrplot)

Output: Executing this code block will load all the specified R packages into the current R session. There will be no direct output printed to the console unless a package fails to load, in which case an error message would be displayed.

2. Data Exploration

This section focuses on loading the dataset and performing initial exploratory data analysis (EDA). It begins by loading two CSV files, “StudentData.csv” and “StudentEvaluation.csv”, into R data frames named df and eval respectively, using read_csv() from the readr package. The clean_names() function from the janitor package is then applied to both data frames to ensure that the column names are consistent, readable, and R-friendly (e.g., converting spaces to underscores and making them lowercase). Following the data loading and cleaning, several functions from the skimr and DataExplorer packages are used to get a preliminary understanding of the df data frame. skim(df) provides a concise summary of the dataset, including the data types of columns, the number of missing values, and basic descriptive statistics for numerical columns. The DataExplorer functions (plot_missing(), plot_intro(), plot_histogram(), plot_density()) generate visualizations to further explore the data’s characteristics, such as the patterns of missing values, an overview of the dataset structure and variable types, the distributions of numerical variables through histograms, and their probability density distributions. Finally, the code selects only the numerical columns from df using select(df, where(is.numeric)) and calculates the pairwise correlation matrix using cor() with the use = “pairwise.complete.obs” argument to handle missing values. This correlation matrix is then visualized using corrplot() from the corrplot package, displaying the strength and direction of linear relationships between the numerical variables. The tl.cex = 0.6 argument likely adjusts the size of the text labels on the plot for better readability.

library(readr)

df <- read_csv("StudentData.csv") %>% clean_names()

## Rows: 2571 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Brand Code
## dbl (32): Carb Volume, Fill Ounces, PC Volume, Carb Pressure, Carb Temp, PSC...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

eval <- read_csv("StudentEvaluation.csv") %>% clean_names()

## Rows: 267 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Brand Code
## dbl (31): Carb Volume, Fill Ounces, PC Volume, Carb Pressure, Carb Temp, PSC...
## lgl  (1): PH
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(df)

Data summary
Name	df
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
brand_code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
carb_volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
fill_ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
pc_volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
carb_pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
carb_temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
psc	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
psc_fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
psc_co2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
mnf_flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
carb_pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
fill_pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
hyd_pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
hyd_pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
hyd_pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
hyd_pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
filler_level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
filler_speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
usage_cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
carb_flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
mfr	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
pressure_vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
ph	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
oxygen_filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
bowl_setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
pressure_setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
air_pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
alch_rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
carb_rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
balling_lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

plot_missing(df)

plot_intro(df)

plot_histogram(df)

plot_density(df)

num_df <- select(df, where(is.numeric))
corrplot(cor(num_df, use = "pairwise.complete.obs"), method = "color", tl.cex = 0.6)

Output:

skim(df): This would print a table to the console providing a statistical summary of each column in the df data frame. For numerical columns, it would show statistics like mean, standard deviation, quantiles, and the number of missing values. For categorical columns, it would show the number of unique values and the frequency of the most common categories.
plot_missing(df): This would generate a plot visualizing the patterns of missing values in the df data frame. It might show which variables have missing values and the extent of missingness.
plot_intro(df): This would produce a high-level summary plot of the df data frame, possibly showing the number of rows and columns, the types of variables, and the percentage of missing values. plot_histogram(df): This would generate histograms for each numerical variable in df, allowing for a visual assessment of their distributions (e.g., skewness, modality).
plot_density(df): This would display density plots for each numerical variable in df, providing a smoothed representation of their distributions.
corrplot(cor(num_df, use = “pairwise.complete.obs”), method = “color”, tl.cex = 0.6): This would output a correlation plot, where each numerical variable is represented on both axes. The color and intensity of the squares at the intersections indicate the strength and direction (positive or negative) of the linear correlation between the corresponding variables. The tl.cex argument ensures that the text labels for the variable names on the plot are of a readable size.

3. Data Preprocessing & Feature Engineering

Description: This section focuses on cleaning and preparing the df data frame for modeling. It starts by filtering out rows where the ph variable (likely the target variable) has a missing value using filter(!is.na(ph)). Then, the brand_code column is explicitly converted to a factor data type using mutate(brand_code = as.factor(brand_code)), which is necessary for many statistical models to treat it as a categorical variable. Next, the code identifies and potentially removes near-zero variance predictors. nearZeroVar(df, saveMetrics = TRUE) from the caret package calculates various statistics for each predictor to assess its variance and whether it is near zero. The saveMetrics = TRUE argument ensures that these metrics are stored. The subsequent line df <- df[, !nzv$nzv] subsets the df data frame, keeping only the columns where the nzv component of the nzv object is FALSE, effectively removing the near-zero variance predictors. The code then separates the df data frame into numerical (df_num) and categorical (df_cat) columns using select(where(is.numeric)) and select(where(is.factor)), respectively. For the numerical data, preprocessing steps are applied using preProcess() from caret. The method = c(“medianImpute”, “center”, “scale”) argument specifies that missing values in the numerical columns should be imputed using the median, and then the columns should be centered (mean subtracted) and scaled (divided by standard deviation). The predict() function is used to apply these preprocessing transformations to the df_num data, resulting in a cleaned numerical data frame df_num_clean. Finally, the cleaned numerical data and the original categorical data are combined back into a single data frame df using bind_cols().

df <- df %>% filter(!is.na(ph))
df <- df %>% mutate(brand_code = as.factor(brand_code))

nzv <- nearZeroVar(df, saveMetrics = TRUE)
df <- df[, !nzv$nzv]

df_num <- df %>% select(where(is.numeric))
df_cat <- df %>% select(where(is.factor))

preproc <- preProcess(df_num, method = c("medianImpute", "center", "scale"))
df_num_clean <- predict(preproc, df_num)

df <- bind_cols(df_num_clean, df_cat)

Output:

nzv: This object would contain a data frame with metrics for each predictor in df, indicating whether it has near-zero variance (nzv column as TRUE or FALSE) along with other statistics.
df_num_clean: This would be a data frame containing the numerical columns from the original df data frame, with missing values imputed by the median, and each column centered and scaled.
df: The final df data frame would be a combination of the preprocessed numerical columns (df_num_clean) and the original categorical columns (df_cat), with near-zero variance predictors removed.

4. Target Variable Exploration: pH

Description: This section visually explores the distribution of the target variable, ph, and its relationship with the categorical predictor brand_code. The first part uses ggplot() from the ggplot2 package to create a histogram of the ph variable. geom_histogram(bins = 30, fill = “skyblue”, color = “black”) generates the histogram with 30 bins, filled with sky blue color and black borders for the bars. The labs() function adds a title (“Distribution of pH”) and labels for the x-axis (“pH”) and y-axis (“Count”). This visualization helps to understand the overall distribution of the ph values, such as its central tendency, spread, and any potential skewness or multimodality. The second part of the section also uses ggplot() to create boxplots of ph for each level of the brand_code factor. geom_boxplot(fill = “lightgreen”) generates the boxplots, with each box representing the distribution of ph for a specific brand code, and the boxes filled with light green color. The labs() function adds a title (“pH by Brand”). These boxplots allow for a comparison of the ph values across different brands, showing the median, quartiles, and potential outliers for each brand. This helps to assess if there is a relationship between the brand of the beverage and its pH level.

ggplot(df, aes(ph)) + geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Distribution of pH", x = "pH", y = "Count")

ggplot(df, aes(x = brand_code, y = ph)) + geom_boxplot(fill = "lightgreen") +
  labs(title = "pH by Brand")

Output:

The first ggplot() call would display a histogram showing the frequency distribution of the ph variable across the dataset. The x-axis would represent the range of pH values, and the y-axis would represent the count of observations falling into each bin.
The second ggplot() call would display a series of boxplots, one for each unique value of the brand_code factor. Each boxplot would visually summarize the distribution of the ph variable for the corresponding brand, showing the median (the line inside the box), the interquartile range (the height of the box), and potential outliers (points outside the whiskers).

5. Advanced Feature Engineering

This section performs advanced feature engineering by creating new predictor variables from existing ones and by encoding the categorical variable brand_code into a numerical format suitable for many machine learning models. First, several new numerical features are created using the mutate() function from the dplyr package. These new features are derived by combining existing numerical features in potentially meaningful ways: carb_vol_fill is the product of carb_volume and fill_ounces, carb_press_temp is the product of carb_pressure and carb_temp, mnf_flow_carb_flow is the product of mnf_flow and carb_flow, fill_ounces_sq is the square of fill_ounces, and carb_pressure_sq is the square of carb_pressure. The rationale behind creating these interaction and polynomial terms is to capture potential non-linear relationships and interactions between the original predictors that might have a stronger influence on the target variable (ph). Next, the categorical variable brand_code is converted into a set of numerical dummy variables (also known as one-hot encoding). This is done using the dummyVars() function from the caret package, which creates a model matrix specification for converting brand_code. The predict() function then applies this specification to the df data frame to generate the actual dummy variables. The resulting df_encoded object is a matrix or data frame containing the numerical dummy variables for each level of brand_code. Finally, these new dummy variables are added to the original df data frame using cbind(), while the original brand_code column is removed using select(-brand_code). This results in a df data frame where the categorical brand_code has been replaced by a set of numerical columns, one for each brand (except possibly one as a reference level to avoid multicollinearity, depending on how dummyVars is configured by default).

df <- df %>%
  mutate(carb_vol_fill = carb_volume * fill_ounces,
         carb_press_temp = carb_pressure * carb_temp,
         mnf_flow_carb_flow = mnf_flow * carb_flow,
         fill_ounces_sq = fill_ounces^2,
         carb_pressure_sq = carb_pressure^2)

dummy_model <- dummyVars(" ~ brand_code", data = df)
df_encoded <- predict(dummy_model, newdata = df)
df <- cbind(df %>% select(-brand_code), df_encoded)

Output:

The df data frame will be modified to include five new columns: carb_vol_fill, carb_press_temp, mnf_flow_carb_flow, fill_ounces_sq, and carb_pressure_sq, which are the newly engineered features.
dummy_model: This object will contain the model specification created by dummyVars for the brand_code variable.
df_encoded: This will be a matrix or data frame containing the dummy variables for each level of the brand_code factor. If there are, for example, four unique brand codes (A, B, C, D), this might result in four new columns (or three if one is used as a reference). The values in these columns will be 0 or 1, indicating the presence or absence of that specific brand code for each observation.
The final df data frame will have the original brand_code column removed and the new dummy variable columns appended to it.

6. Data Splitting

Description: This section divides the prepared df data frame into training and testing datasets. The createDataPartition() function from the caret package is used to create stratified random samples for splitting the data. set.seed(123) ensures that the random partitioning is reproducible. The first argument, df$ph, specifies that the partitioning should be based on the distribution of the ph variable, aiming to have similar distributions of the target variable in both the training and testing sets. The p = 0.8 argument indicates that 80% of the data should be allocated to the training set. list = FALSE ensures that the output is a vector of indices rather than a list. These indices are then used to subset the df data frame to create train_data (the 80% subset) and test_data (the remaining 20% subset). Finally, the summary() function is applied to the ph variable in both train_data and test_data to provide descriptive statistics (like mean, median, min, max, quartiles) for the target variable in each split. This step is crucial to verify that the splitting process resulted in reasonably similar distributions of the target variable in both datasets, which is important for training a model on a representative subset of the data and evaluating its performance on unseen data.

set.seed(123)
train_index <- createDataPartition(df$ph, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
summary(train_data$ph)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.858470 -0.612398 -0.032743 -0.000192  0.778775  2.285880

summary(test_data$ph)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.278814 -0.728329 -0.032743  0.000769  0.778775  4.720434

Output:

train_index: This will be a vector of row indices from the df data frame, representing the 80% of the data that will be used for training.
train_data: This will be a data frame containing 80% of the rows from df, selected based on the train_index.
test_data: This will be a data frame containing the remaining 20% of the rows from df (those not in train_data), which will be used for evaluating the model’s performance.
summary(train_data$ph): This will print descriptive statistics for the ph variable in the train_data set to the console.
summary(test_data$ph): This will print descriptive statistics for the ph variable in the test_data set to the console. By comparing these two summaries, one can assess if the distribution of the target variable is similar in the training and testing sets.

7. Modeling

7.1 Linear Regression

This section trains a standard linear regression model to predict the ph value based on all other predictor variables in the train_data. First, it checks for any remaining missing values in the train_data. If any are found (which might occur if the previous preprocessing steps did not handle all missingness or if new missingness was introduced), it applies median imputation again using preProcess and predict from the caret package. This ensures that the linear regression model, which cannot handle missing values, is trained on a complete dataset. Then, the train() function from caret is used to train the linear regression model. The formula ph ~ . specifies that ph is the target variable and all other columns in train_data should be used as predictors. The method = “lm” argument indicates that a linear regression model should be used. The trControl = trainControl(method = “cv”, number = 10) argument sets up 10-fold cross-validation. This means the training data is divided into 10 parts, and the model is trained and evaluated 10 times, each time using a different part as the validation set and the remaining 9 parts as the training set. This process provides a more robust estimate of the model’s performance on unseen data compared to a single train-test split evaluation. The results of the cross-validation, including performance metrics like R-squared, RMSE, and MAE, are stored in the linear_model object.

if(any(is.na(train_data))) {
  preproc_again <- preProcess(train_data, method = "medianImpute")
  train_data <- predict(preproc_again, train_data)
}

linear_model <- train(ph ~ ., data = train_data, method = "lm",
                      trControl = trainControl(method = "cv", number = 10))
linear_model$results

##   intercept      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1      TRUE 0.7730769 0.3980851 0.6060421 0.04667772 0.03859452 0.03243217

skim(df)

Data summary
Name	df
Number of rows	2567
Number of columns	40
_______________________
Column type frequency:
numeric	40
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
carb_volume	0	1.00	0.00	1.00	-3.10	-0.72	-0.22	0.78	3.10	▁▆▇▅▁
fill_ounces	0	1.00	0.00	0.99	-3.91	-0.63	-0.02	0.59	3.95	▁▂▇▂▁
pc_volume	0	1.00	0.00	0.99	-3.27	-0.63	-0.10	0.55	3.31	▁▃▇▂▁
carb_pressure	0	1.00	0.00	0.99	-3.16	-0.73	0.00	0.68	3.17	▁▅▇▃▁
carb_temp	0	1.00	0.00	0.99	-3.10	-0.67	-0.07	0.67	3.20	▁▃▇▃▁
psc	0	1.00	0.00	0.99	-1.68	-0.70	-0.18	0.56	3.76	▆▇▃▁▁
psc_fill	0	1.00	0.00	1.00	-1.66	-0.81	-0.13	0.55	3.61	▆▇▃▁▁
psc_co2	0	1.00	-0.01	0.99	-1.31	-0.85	-0.38	0.55	4.26	▇▅▂▁▁
mnf_flow	0	1.00	0.00	1.00	-1.04	-1.04	0.38	0.97	1.71	▇▁▁▇▂
carb_pressure1	0	1.00	0.00	0.99	-3.59	-0.76	0.13	0.60	3.73	▁▃▇▂▁
fill_pressure	0	1.00	0.00	1.00	-4.19	-0.60	-0.48	0.65	3.93	▁▁▇▂▁
hyd_pressure2	0	1.00	0.00	1.00	-1.28	-1.28	0.46	0.83	2.34	▇▂▇▃▁
hyd_pressure3	0	1.00	0.00	1.00	-1.36	-1.28	0.45	0.80	1.85	▇▁▃▇▁
hyd_pressure4	0	1.00	0.00	0.99	-2.62	-0.79	-0.02	0.43	3.49	▂▆▇▂▁
filler_level	0	1.00	0.00	1.00	-3.40	-0.67	0.58	0.68	3.31	▁▃▅▇▁
filler_speed	0	1.00	0.01	0.99	-3.50	0.26	0.38	0.40	0.44	▁▁▁▁▇
temperature	0	1.00	0.00	1.00	-1.71	-0.55	-0.26	0.32	7.42	▇▃▁▁▁
usage_cont	0	1.00	0.00	1.00	-3.00	-0.88	0.27	0.92	1.65	▁▃▅▃▇
carb_flow	0	1.00	0.00	1.00	-2.29	-1.22	0.52	0.67	2.46	▂▅▆▇▁
density	0	1.00	0.00	1.00	-2.48	-0.73	-0.52	1.18	1.98	▁▅▇▂▆
mfr	0	1.00	0.02	0.96	-9.10	0.06	0.27	0.36	2.23	▁▁▁▂▇
balling	0	1.00	0.00	1.00	-2.19	-0.76	-0.59	1.17	1.95	▁▇▁▁▅
pressure_vacuum	0	1.00	0.00	1.00	-2.43	-0.67	-0.32	0.38	2.83	▂▇▅▃▁
ph	0	1.00	0.00	1.00	-3.86	-0.61	-0.03	0.78	4.72	▁▅▇▂▁
oxygen_filler	0	1.00	0.00	1.00	-0.98	-0.54	-0.29	0.27	7.84	▇▁▁▁▁
bowl_setpoint	0	1.00	0.00	1.00	-2.57	-0.61	0.70	0.70	2.01	▁▂▃▇▁
pressure_setpoint	0	1.00	0.00	1.00	-1.77	-0.79	-0.79	1.17	2.15	▁▇▁▆▁
air_pressurer	0	1.00	0.00	1.00	-1.68	-0.52	-0.19	0.14	4.42	▅▇▁▁▁
alch_rel	0	1.00	0.00	1.00	-3.20	-0.71	-0.67	0.66	3.41	▁▇▂▃▁
carb_rel	0	1.00	0.00	1.00	-3.70	-0.75	-0.29	0.80	4.84	▁▇▇▂▁
balling_lvl	0	1.00	0.00	1.00	-2.36	-0.77	-0.66	1.25	1.85	▁▇▂▁▆
carb_vol_fill	0	1.00	0.02	1.06	-11.77	-0.34	0.01	0.37	8.40	▁▁▇▃▁
carb_press_temp	0	1.00	0.78	1.26	-0.84	0.02	0.31	1.02	9.79	▇▂▁▁▁
mnf_flow_carb_flow	0	1.00	-0.28	0.90	-1.95	-0.80	-0.51	0.53	2.38	▅▇▃▂▁
fill_ounces_sq	0	1.00	0.98	1.69	0.00	0.06	0.35	1.18	15.57	▇▁▁▁▁
carb_pressure_sq	0	1.00	0.99	1.41	0.00	0.11	0.46	1.28	10.04	▇▁▁▁▁
brand_code.A	120	0.95	0.12	0.32	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
brand_code.B	120	0.95	0.50	0.50	0.00	0.00	1.00	1.00	1.00	▇▁▁▁▇
brand_code.C	120	0.95	0.12	0.33	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
brand_code.D	120	0.95	0.25	0.43	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▃

7.2 Ridge Regression

This section trains a Ridge Regression model, which is a linear regression technique that adds an L2 penalty to the loss function to prevent overfitting, especially when there are many correlated predictors. The train() function from caret is used again, with the same formula ph ~ . and training data train_data. The method = “glmnet” specifies that a generalized linear model with elastic net regularization should be used. Ridge regression is a special case of elastic net where the alpha parameter is set to 0. The tuneGrid argument provides a set of hyperparameters to be tested during training. Here, expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 10)) creates a grid where alpha is fixed at 0 (for Ridge regression) and lambda (the regularization parameter) varies across 10 values evenly spaced between 0.001 and 1. The trControl argument is the same as in the linear regression section, setting up 10-fold cross-validation to evaluate the model’s performance for each combination of hyperparameters in the tuneGrid. The results of the cross-validation for each lambda value are stored in the ridge_model object. Finally, plot(ridge_model) generates a plot showing how the model’s performance (likely R-squared, RMSE, or MAE) changes across the different values of lambda.

ridge_model <- train(ph ~ ., data = train_data, method = "glmnet",
                     tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 10)),
                     trControl = trainControl(method = "cv", number = 10))
ridge_model$results

##    alpha lambda      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1      0  0.001 0.7797860 0.3898136 0.6168869 0.03748620 0.05633981 0.02899993
## 2      0  0.112 0.7857738 0.3815535 0.6248781 0.03521212 0.05442652 0.02667958
## 3      0  0.223 0.7934747 0.3719849 0.6340438 0.03305038 0.05246266 0.02404994
## 4      0  0.334 0.7998654 0.3646837 0.6405313 0.03176378 0.05096088 0.02225276
## 5      0  0.445 0.8054528 0.3586596 0.6454842 0.03095276 0.04968823 0.02093480
## 6      0  0.556 0.8104605 0.3535036 0.6497483 0.03043646 0.04858805 0.02001792
## 7      0  0.667 0.8150393 0.3489476 0.6536019 0.03010482 0.04760360 0.01937037
## 8      0  0.778 0.8192627 0.3448741 0.6570851 0.02989204 0.04670600 0.01898592
## 9      0  0.889 0.8231802 0.3411974 0.6602888 0.02977686 0.04592916 0.01886669
## 10     0  1.000 0.8268465 0.3378318 0.6632201 0.02971622 0.04520925 0.01878920

plot(ridge_model)

7.3 Polynomial Regression

This section trains a polynomial regression model. Polynomial regression extends the linear regression model by adding polynomial terms of the predictors to the model. The formula ph ~ .^2 in the train() function specifies that the model should include all main effects of the predictors (represented by .) as well as all two-way interaction terms and quadratic terms (represented by ^2). This allows the model to capture non-linear relationships between the predictors and the target variable ph. Similar to the previous models, 10-fold cross-validation is used to evaluate the model’s performance on the training data. The method = “lm” argument indicates that the underlying modeling technique is still linear regression, but applied to the expanded set of polynomial and interaction features. The poly_model object will store the trained model and the results of the cross-validation.

poly_model <- train(ph ~ .^2, data = train_data, method = "lm",
                    trControl = trainControl(method = "cv", number = 10))
poly_model$results

##   intercept     RMSE  Rsquared       MAE    RMSESD RsquaredSD      MAESD
## 1      TRUE 1.278568 0.2322552 0.7677897 0.3020665 0.05767391 0.07960624

7.4 SVM Regression

This section trains a Support Vector Machine (SVM) model for regression. SVM is a powerful and versatile algorithm that can handle both linear and non-linear relationships. The method = “svmRadial” specifies that an SVM with a radial basis function (RBF) kernel should be used. The RBF kernel allows the model to map the input features into a higher-dimensional space where non-linear relationships might become linear, enabling the fitting of complex functions. The trControl = trainControl(method = “cv”, number = 10) argument again sets up 10-fold cross-validation for evaluating the model’s performance. The tuneLength = 5 argument tells the train() function to automatically try 5 different values of the tuning parameters for the SVM with the radial kernel (specifically, the cost parameter C and the kernel parameter sigma). The svm_model object will store the trained SVM model and the results of the cross-validation across the different hyperparameter settings.

svm_model <- train(ph ~ ., data = train_data, method = "svmRadial",
                   trControl = trainControl(method = "cv", number = 10),
                   tuneLength = 5)
svm_model$results

##        sigma    C      RMSE  Rsquared       MAE     RMSESD RsquaredSD
## 1 0.01786384 0.25 0.7323438 0.4707333 0.5526845 0.05374517 0.05751801
## 2 0.01786384 0.50 0.7134170 0.4942992 0.5335184 0.05530162 0.05824701
## 3 0.01786384 1.00 0.6964656 0.5161072 0.5189073 0.05511532 0.05755817
## 4 0.01786384 2.00 0.6799551 0.5372298 0.5080382 0.05475099 0.05854430
## 5 0.01786384 4.00 0.6658459 0.5562757 0.5007060 0.05213279 0.05647661
##        MAESD
## 1 0.03640937
## 2 0.03839854
## 3 0.03975891
## 4 0.04023068
## 5 0.03830984

7.5 Lasso Regression

This section trains a Lasso Regression model. Lasso (Least Absolute Shrinkage and Selection Operator) is another type of regularized linear regression that adds an L1 penalty to the loss function. Unlike Ridge regression (which uses an L2 penalty), the L1 penalty has the effect of not only shrinking the coefficients of less important predictors towards zero but also potentially setting them exactly to zero, thus performing feature selection. The method = “glmnet” is used, and the tuneGrid is set to expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 10)). Here, alpha = 1 specifically selects the Lasso penalty. The lambda parameter, which controls the strength of the L1 penalty, is varied across 10 values between 0.001 and 1. 10-fold cross-validation is used for model evaluation across these lambda values. The lasso_model object will store the trained Lasso model and the cross-validation results for each lambda. Finally, plot(lasso_model) generates a plot showing how the model’s performance changes as the lambda parameter is varied. This plot typically helps in identifying the optimal value of lambda that balances model complexity and predictive accuracy.

lasso_model <- train(ph ~ ., data = train_data, method = "glmnet",
                     tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 10)),
                     trControl = trainControl(method = "cv", number = 10))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

lasso_model$results

##    alpha lambda      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1      1  0.001 0.7774768 0.3951011 0.6096906 0.02843212 0.03721477 0.02760436
## 2      1  0.112 0.8432443 0.3159366 0.6726200 0.04024085 0.05531783 0.03568876
## 3      1  0.223 0.9020683 0.2590761 0.7218679 0.04123153 0.05888488 0.03566157
## 4      1  0.334 0.9525190 0.2008888 0.7658706 0.04146564 0.05353586 0.03356292
## 5      1  0.445 0.9945488 0.1397918 0.8037642 0.04111184 0.01497529 0.03180424
## 6      1  0.556 0.9949324       NaN 0.8040998 0.04091062         NA 0.03170912
## 7      1  0.667 0.9949324       NaN 0.8040998 0.04091062         NA 0.03170912
## 8      1  0.778 0.9949324       NaN 0.8040998 0.04091062         NA 0.03170912
## 9      1  0.889 0.9949324       NaN 0.8040998 0.04091062         NA 0.03170912
## 10     1  1.000 0.9949324       NaN 0.8040998 0.04091062         NA 0.03170912

plot(lasso_model)

7.6 Random Forest

This section trains a Random Forest model, which is an ensemble learning method based on constructing multiple decision trees during training and outputting the mean prediction (for regression) of the individual trees. Random Forests are known for their high predictive accuracy and robustness to overfitting. The method = “rf” specifies the use of the Random Forest algorithm. The tuneGrid = expand.grid(mtry = c(5, 10, 15)) defines the hyperparameter to be tuned: mtry, which is the number of predictors randomly sampled at each split of the trees. The model will be trained and evaluated using 10-fold cross-validation for each of these mtry values (5, 10, and 15). The ntree = 300 argument specifies that 300 trees should be grown in each random forest. The rf_model object will store the trained Random Forest model with the best performing mtry value. Finally, varImpPlot(rf_model$finalModel) generates a variable importance plot, which visually ranks the predictors based on their importance in the final Random Forest model (the one trained on the entire training set with the optimal mtry).

rf_model <- train(ph ~ ., data = train_data, method = "rf",
                  tuneGrid = expand.grid(mtry = c(5, 10, 15)),
                  trControl = trainControl(method = "cv", number = 10),
                  ntree = 300)
rf_model$results

##   mtry      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1    5 0.5967015 0.6746871 0.4516547 0.03591958 0.01886300 0.02591891
## 2   10 0.5764395 0.6877232 0.4310166 0.03253120 0.01793114 0.02622959
## 3   15 0.5655720 0.6957507 0.4213965 0.02901081 0.02010508 0.02585887

varImpPlot(rf_model$finalModel)

7.7 Gradient Boosting (GBM)

This section trains a Gradient Boosting Machine (GBM) model, which is another powerful ensemble learning technique that builds multiple weak prediction models (typically decision trees) sequentially, with each new model trying to correct the errors made by the previous ones. The method = “gbm” specifies the use of the Gradient Boosting Machine algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for model evaluation. The verbose = FALSE argument suppresses the printing of detailed output during the training process. The gbm_model object will store the trained GBM model and the results of the cross-validation, which would have automatically tuned the GBM’s hyperparameters (such as the number of trees, tree depth, learning rate, etc.) across a default or internally defined grid. Finally, summary(gbm_model) provides a textual summary of the trained GBM model, often including information about the importance of the predictor variables.

gbm_model <- train(ph ~ ., data = train_data, method = "gbm",
                   trControl = trainControl(method = "cv", number = 10),
                   verbose = FALSE)
gbm_model$results

##   shrinkage interaction.depth n.minobsinnode n.trees      RMSE  Rsquared
## 1       0.1                 1             10      50 0.7866012 0.3927961
## 4       0.1                 2             10      50 0.7388255 0.4671643
## 7       0.1                 3             10      50 0.7099874 0.5061733
## 2       0.1                 1             10     100 0.7611657 0.4258190
## 5       0.1                 2             10     100 0.7083733 0.5031650
## 8       0.1                 3             10     100 0.6814072 0.5384789
## 3       0.1                 1             10     150 0.7455101 0.4476241
## 6       0.1                 2             10     150 0.6921097 0.5229075
## 9       0.1                 3             10     150 0.6662645 0.5560738
##         MAE     RMSESD RsquaredSD      MAESD
## 1 0.6255692 0.02419101 0.04556844 0.02042862
## 4 0.5857397 0.02727334 0.05370083 0.02086196
## 7 0.5584523 0.03305241 0.05851863 0.02382923
## 2 0.6040470 0.02775426 0.04989844 0.02279434
## 5 0.5570477 0.03389878 0.05867027 0.02398646
## 8 0.5319858 0.03444561 0.05496661 0.02199860
## 3 0.5889089 0.03108715 0.05215706 0.02502431
## 6 0.5407981 0.03917662 0.06305557 0.02997029
## 9 0.5186422 0.03857824 0.05779837 0.02445151

summary(gbm_model)

##                                   var     rel.inf
## mnf_flow                     mnf_flow 26.22803757
## brand_code.C             brand_code.C  9.04252809
## usage_cont                 usage_cont  7.41095912
## temperature               temperature  5.77562660
## alch_rel                     alch_rel  5.40716401
## oxygen_filler           oxygen_filler  3.84974474
## bowl_setpoint           bowl_setpoint  3.24699914
## pressure_vacuum       pressure_vacuum  3.24592002
## balling                       balling  3.03622993
## carb_pressure1         carb_pressure1  2.99995620
## carb_rel                     carb_rel  2.66363486
## mnf_flow_carb_flow mnf_flow_carb_flow  2.65761731
## balling_lvl               balling_lvl  2.50334279
## density                       density  2.33538476
## filler_speed             filler_speed  2.08638201
## air_pressurer           air_pressurer  1.90272940
## carb_flow                   carb_flow  1.53860571
## brand_code.A             brand_code.A  1.27660916
## hyd_pressure2           hyd_pressure2  1.02265005
## carb_volume               carb_volume  0.97875122
## carb_vol_fill           carb_vol_fill  0.94604646
## mfr                               mfr  0.90158304
## pc_volume                   pc_volume  0.87672218
## fill_ounces               fill_ounces  0.87572828
## psc                               psc  0.86744802
## hyd_pressure3           hyd_pressure3  0.84111807
## filler_level             filler_level  0.80087262
## pressure_setpoint   pressure_setpoint  0.76771341
## psc_fill                     psc_fill  0.76081219
## carb_pressure           carb_pressure  0.70853120
## fill_pressure           fill_pressure  0.63498609
## hyd_pressure4           hyd_pressure4  0.40736233
## psc_co2                       psc_co2  0.31225877
## carb_press_temp       carb_press_temp  0.29184386
## carb_pressure_sq     carb_pressure_sq  0.28365640
## carb_temp                   carb_temp  0.24716425
## fill_ounces_sq         fill_ounces_sq  0.17561659
## brand_code.B             brand_code.B  0.09166355
## brand_code.D             brand_code.D  0.00000000

7.8 XGBoost

This section trains an XGBoost model, which stands for Extreme Gradient Boosting. XGBoost is an optimized and highly efficient implementation of the gradient boosting algorithm, known for its speed and performance. The method = “xgbTree” specifies the use of the tree-based XGBoost model. A tuneGrid named xgb_grid is defined to specify the hyperparameter combinations to be tested during training. This grid includes a fixed number of boosting rounds (nrounds = 100), a range of maximum tree depths (max_depth = 3:5), a learning rate (eta = 0.1), regularization parameters (gamma = 0), subsampling ratio of columns (colsample_bytree = 0.8), minimum child weight (min_child_weight = 1), and subsampling ratio of the training instances (subsample = 0.8). The trainControl is set to 5-fold cross-validation (number = 5) for evaluating the model’s performance across these hyperparameter settings. The verbose = FALSE argument suppresses detailed training output. The xgb_model object will store the trained XGBoost model with the best performing hyperparameters found through the cross-validation process.

xgb_grid <- expand.grid(
  nrounds = 100,       # LIMIT to 100 trees max
  max_depth = 3:5,
  eta = c(0.1),        # learning rate
  gamma = 0,
  colsample_bytree = 0.8,
  min_child_weight = 1,
  subsample = 0.8
)

xgb_model <- train(
  ph ~ ., 
  data = train_data, 
  method = "xgbTree",
  trControl = trainControl(method = "cv", number = 5),  # reduce folds to 5 for now
  tuneGrid = xgb_grid,
  verbose = FALSE
)

7.9 Neural Network

This section trains a single-layer feedforward neural network model using the nnet package via the train() function. Neural networks are powerful non-linear models that can learn complex relationships in the data. The method = “nnet” specifies the use of the neural network algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for evaluating the model. The tuneGrid defines the hyperparameters to be tuned: size (the number of hidden units in the network, ranging from 1 to 5) and decay (the weight decay parameter, with values 0.01 and 0.1, used for regularization). The linout = TRUE argument indicates that the output layer should be linear, which is appropriate for regression tasks. trace = FALSE suppresses the printing of training progress, and maxit = 500 sets the maximum number of iterations for training each network. The nnet_model object will store the trained neural network model with the best performing combination of size and decay found during cross-validation. Finally, varImp(nnet_model) calculates and potentially prints or returns the variable importance scores for the predictors in the trained neural network.

nnet_model <- train(ph ~ ., data = train_data, method = "nnet",
                    trControl = trainControl(method = "cv", number = 10),
                    tuneGrid = expand.grid(size = 1:5, decay = c(0.01, 0.1)),
                    linout = TRUE, trace = FALSE, maxit = 500)
nnet_model$results

##    size decay      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1     1  0.01 0.7603836 0.4190783 0.6004884 0.03989665 0.04539775 0.03515183
## 2     1  0.10 0.7618784 0.4171404 0.5993337 0.03635099 0.04555699 0.03384447
## 3     2  0.01 0.7393173 0.4549772 0.5716739 0.03655419 0.05320002 0.02609113
## 4     2  0.10 0.7293959 0.4683184 0.5647277 0.04071222 0.05480748 0.03077519
## 5     3  0.01 0.7313314 0.4698138 0.5720023 0.03584205 0.05694286 0.02829870
## 6     3  0.10 0.7226353 0.4832520 0.5605036 0.04728550 0.06342039 0.03467665
## 7     4  0.01 0.7428421 0.4574285 0.5735596 0.03360771 0.04396771 0.03229127
## 8     4  0.10 0.7130668 0.5003511 0.5440974 0.05332985 0.05947581 0.03808167
## 9     5  0.01 0.7369358 0.4684357 0.5706867 0.04828142 0.06415200 0.03783661
## 10    5  0.10 0.7155254 0.5027768 0.5482771 0.04843817 0.05209672 0.03282014

varImp(nnet_model)

## nnet variable importance
## 
##   only 20 most important variables shown (out of 39)
## 
##                    Overall
## mnf_flow            100.00
## density              85.96
## balling              58.68
## brand_code.D         54.17
## balling_lvl          48.93
## carb_flow            43.68
## brand_code.B         41.75
## alch_rel             41.00
## hyd_pressure2        40.47
## brand_code.A         39.07
## brand_code.C         37.30
## mnf_flow_carb_flow   35.55
## oxygen_filler        29.45
## bowl_setpoint        28.07
## filler_level         21.84
## hyd_pressure3        19.25
## pressure_setpoint    18.55
## carb_volume          18.45
## temperature          17.28
## pressure_vacuum      16.57

7.10 MARS (Earth)

This section trains a Multivariate Adaptive Regression Splines (MARS) model using the earth package via the train() function. MARS is a non-parametric regression technique that builds a model from piecewise linear segments (splines). It can capture non-linear relationships and interactions between variables. The method = “earth” specifies the use of the MARS algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for model evaluation. The tuneLength = 10 argument tells the train() function to automatically try 10 different sets of tuning parameters for the MARS model. These parameters typically include the degree of interactions allowed between variables and the number of basis functions (related to the number of knots in the splines). The mars_model object will store the trained MARS model with the best performing tuning parameters found during cross-validation.

mars_model <- train(ph ~ ., data = train_data, method = "earth",
                    trControl = trainControl(method = "cv", number = 10),
                    tuneLength = 10)
mars_model$results

##    degree nprune      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1       1      2 0.8781740 0.2251214 0.6919336 0.05758684 0.03192169 0.03688711
## 2       1      3 0.8307085 0.3077105 0.6598020 0.05823249 0.05494291 0.03869580
## 3       1      5 0.8103530 0.3413276 0.6444139 0.05911968 0.05005255 0.04342685
## 4       1      7 0.7875605 0.3770401 0.6172165 0.05850881 0.05397455 0.03962293
## 5       1      9 0.7764927 0.3945731 0.6071310 0.05779201 0.05175938 0.03904156
## 6       1     10 0.7677263 0.4077604 0.6024201 0.05888260 0.05193588 0.03747759
## 7       1     12 0.7653535 0.4117888 0.5971624 0.05702658 0.05143433 0.03703806
## 8       1     14 0.7553028 0.4275468 0.5867319 0.05581461 0.05443256 0.03467552
## 9       1     16 0.7546861 0.4289504 0.5854903 0.05296640 0.05498614 0.03271273
## 10      1     18 0.7540480 0.4299548 0.5843352 0.05265508 0.05482900 0.03204137

7.11 PLS

This section trains a Partial Least Squares (PLS) regression model. PLS is a dimensionality reduction technique combined with regression, particularly useful when dealing with a large number of predictors that might be highly collinear. It aims to find a set of components (latent variables) that are linear combinations of the original predictors and that have maximum covariance with the target variable. The method = “pls” specifies the use of the PLS algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for model evaluation. The tuneLength = 10 argument tells the train() function to automatically try 10 different numbers of components to include in the PLS model. The pls_model object will store the trained PLS model with the optimal number of components found during cross-validation.

pls_model <- train(ph ~ ., data = train_data, method = "pls",
                   trControl = trainControl(method = "cv", number = 10),
                   tuneLength = 10)
pls_model$results

##    ncomp      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1      1 0.8825098 0.2152488 0.7041138 0.03894952 0.04174819 0.02525132
## 2      2 0.8388528 0.2907108 0.6608659 0.04258645 0.04480946 0.02476208
## 3      3 0.8187834 0.3243745 0.6468874 0.04491843 0.04430925 0.02995808
## 4      4 0.8047854 0.3476178 0.6343189 0.03848834 0.05175847 0.02854119
## 5      5 0.7929626 0.3663910 0.6257152 0.03761353 0.05650281 0.02855068
## 6      6 0.7861867 0.3774394 0.6205960 0.03601113 0.05628874 0.02638577
## 7      7 0.7830283 0.3824923 0.6185556 0.03590250 0.05784490 0.02799494
## 8      8 0.7794704 0.3878348 0.6157866 0.03749249 0.05835838 0.02840743
## 9      9 0.7786635 0.3892946 0.6136235 0.03823235 0.06047623 0.02965956
## 10    10 0.7783311 0.3896411 0.6127985 0.03971278 0.05988073 0.03086197

7.12 Cubist

This section trains a Cubist model, which is a rule-based regression algorithm. Cubist builds a set of rules, where each rule has a linear model associated with it. Predictions are made by finding the rule that applies to a new instance and then using the linear model within that rule. Cubist is known for producing models that are often accurate and relatively easy to interpret. The method = “cubist” specifies the use of the Cubist algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for model evaluation. The cubist_model object will store the trained Cubist model and the results of the cross-validation, which would have automatically tuned Cubist’s hyperparameters (such as the number of committees and the number of rules).

cubist_model <- train(ph ~ ., data = train_data, method = "cubist",
                      trControl = trainControl(method = "cv", number = 10))
cubist_model$results

##   committees neighbors      RMSE  Rsquared       MAE     RMSESD RsquaredSD
## 1          1         0 0.7102485 0.5256522 0.4975421 0.04364406 0.03752760
## 2          1         5 0.7019808 0.5517790 0.4858000 0.04897795 0.03377260
## 3          1         9 0.6933138 0.5553105 0.4821711 0.04908501 0.03543315
## 4         10         0 0.5878623 0.6561850 0.4372710 0.02505345 0.02925400
## 5         10         5 0.5707070 0.6746163 0.4223708 0.03370850 0.03701112
## 6         10         9 0.5655634 0.6798202 0.4213567 0.02712086 0.02954652
## 7         20         0 0.5806602 0.6664208 0.4346325 0.02805494 0.03167844
## 8         20         5 0.5643438 0.6810963 0.4172180 0.03923392 0.04353696
## 9         20         9 0.5580712 0.6881898 0.4164169 0.03199413 0.03595721
##        MAESD
## 1 0.02363708
## 2 0.02576124
## 3 0.02322408
## 4 0.02138978
## 5 0.02601130
## 6 0.01924250
## 7 0.02344180
## 8 0.03018595
## 9 0.02440129

8. Model Comparison

his section aims to compare the performance of several of the trained regression models using the resamples() function from the caret package. First, it attempts to clean the test_data by removing any rows that contain missing values in the predictor columns using complete.cases(predictors(test_data)). This is done to prevent potential errors during prediction if any of the models cannot handle missing values in the test set. Then, it re-runs predictions using the Support Vector Machine (svm_model) on this cleaned test data (test_data_clean) and stores the actual and predicted ph values in a data frame test_results. However, the primary focus of this section is the comparison using cross-validation results. The resamples() function takes a list of trained models (Linear Regression, Ridge Regression, Polynomial Regression, and SVM) and extracts their cross-validation performance metrics. The summary(resamps) function then provides a statistical summary of these metrics (e.g., mean, standard deviation, quantiles) for each model across the cross-validation folds. This allows for a direct comparison of the models’ estimated performance on unseen data based on their performance during training. Finally, dotplot(resamps) generates a dot plot that visually compares the distribution of the performance metrics for each model, making it easier to see which models tend to perform better and how variable their performance is across the cross-validation folds.

library(caret)

# Step 0: Train preprocessing pipeline on training data
preprocess_model <- preProcess(train_data, method = c("center", "scale", "knnImpute"))


# ---- STEP 1: Preprocess test data using training pipeline ----
# Ensure test_data has been preloaded
test_data_clean <- predict(preprocess_model, newdata = test_data)
test_data_clean[is.na(test_data_clean)] <- 0  # Optional but safe fallback

# ---- STEP 2: SVM predictions on cleaned test data ----
svm_predictions <- predict(svm_model, newdata = test_data_clean)

# ---- STEP 3: Compare Actual vs Predicted if test has labels (only for validation sets) ----
# If test_data_clean has 'ph' column (which it usually doesn't), evaluate
if ("ph" %in% names(test_data_clean)) {
  test_results <- data.frame(
    Actual = test_data_clean$ph,
    Predicted = svm_predictions
  )
}

# ---- STEP 4: Cross-validation model comparison ----
resamps <- resamples(list(
  LM    = linear_model,
  Ridge = ridge_model,
  Poly  = poly_model,
  SVM   = svm_model
))

summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: LM, Ridge, Poly, SVM 
## Number of resamples: 10 
## 
## MAE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LM    0.5541460 0.5902700 0.6007180 0.6060421 0.6195031 0.6589668    0
## Ridge 0.5756473 0.5930864 0.6256791 0.6168869 0.6271031 0.6741715    0
## Poly  0.6537365 0.7241301 0.7683594 0.7677897 0.8135614 0.8934389    0
## SVM   0.4380342 0.4801021 0.4878602 0.5007060 0.5323212 0.5655219    0
## 
## RMSE 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LM    0.7185870 0.7387778 0.7584026 0.7730769 0.8007085 0.8622259    0
## Ridge 0.7387157 0.7545030 0.7764182 0.7797860 0.7883051 0.8633429    0
## Poly  0.9690045 1.0605886 1.1986217 1.2785676 1.4498312 1.9025118    0
## SVM   0.5825130 0.6357883 0.6625921 0.6658459 0.6980400 0.7636657    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LM    0.32793164 0.3770407 0.4047008 0.3980851 0.4233150 0.4505619    0
## Ridge 0.27587143 0.3660323 0.3853042 0.3898136 0.4333979 0.4648637    0
## Poly  0.09869629 0.2295165 0.2420271 0.2322552 0.2589805 0.3130543    0
## SVM   0.43808338 0.5423333 0.5661380 0.5562757 0.5902952 0.6233509    0

dotplot(resamps)

# ---- STEP 5: Final model predictions (e.g., Random Forest) for output ----
rf_predictions <- predict(rf_model, newdata = test_data_clean)

# ---- STEP 6: Export to CSV ----
write.csv(data.frame(predicted_ph = rf_predictions), "FinalPredictionnnnn.csv", row.names = FALSE)

write.csv(data.frame(ID = 1:length(rf_predictions), predicted_ph = rf_predictions),
          "FinalPredictionnnnn.csv", row.names = FALSE)

Output:

test_data_clean: This will be a subset of the original test_data containing only the rows with complete data in the predictor variables.
predictions: This will be a vector of predicted ph values for the test_data_clean based on the svm_model.
test_results: This data frame will contain two columns: Actual (the true ph values from test_data_clean) and Predicted (the ph values predicted by the `svm

EXTENDED MODEL COMPARISON

8. Extended Model Comparison

This section extends the model comparison by calculating and displaying the Root Mean Squared Error (RMSE) and R-squared values obtained from the cross-validation for all the trained models. It first creates a character vector model_names containing the names assigned to each of the trained models. Then, it creates a data frame named results. This data frame has three columns: Model (containing the names of the models), RMSE, and Rsquared. The values for the RMSE and Rsquared columns are extracted from the $results component of each model object. Specifically, it appears to be taking the first row’s RMSE and Rsquared value from the $results data frame of each model. It’s important to note that the $results object in caret often contains multiple rows if the model was tuned over a grid of hyperparameters; taking just the first row might not represent the best performance of a tuned model (it might be the performance at the first tested hyperparameter setting). After creating this results data frame, it is printed to the console using the print() function.

model_names <- c("LM", "Ridge", "Lasso", "Poly", "SVM", "RF", "GBM", "XGB", "NNet", "MARS", "PLS", "Cubist")

results <- data.frame(
  Model = model_names,
  RMSE = c(
    linear_model$results$RMSE[1],
    ridge_model$results$RMSE[1],
    lasso_model$results$RMSE[1],
    poly_model$results$RMSE[1],
    svm_model$results$RMSE[1],
    rf_model$results$RMSE[1],
    gbm_model$results$RMSE[1],
    xgb_model$results$RMSE[1],
    nnet_model$results$RMSE[1],
    mars_model$results$RMSE[1],
    pls_model$results$RMSE[1],
    cubist_model$results$RMSE[1]
  ),
  Rsquared = c(
    linear_model$results$Rsquared[1],
    ridge_model$results$Rsquared[1],
    lasso_model$results$Rsquared[1],
    poly_model$results$Rsquared[1],
    svm_model$results$Rsquared[1],
    rf_model$results$Rsquared[1],
    gbm_model$results$Rsquared[1],
    xgb_model$results$Rsquared[1],
    nnet_model$results$Rsquared[1],
    mars_model$results$Rsquared[1],
    pls_model$results$Rsquared[1],
    cubist_model$results$Rsquared[1]
  )
)

print(results)

##     Model      RMSE  Rsquared
## 1      LM 0.7730769 0.3980851
## 2   Ridge 0.7797860 0.3898136
## 3   Lasso 0.7774768 0.3951011
## 4    Poly 1.2785676 0.2322552
## 5     SVM 0.7323438 0.4707333
## 6      RF 0.5967015 0.6746871
## 7     GBM 0.7866012 0.3927961
## 8     XGB 0.6408614 0.5937333
## 9    NNet 0.7603836 0.4190783
## 10   MARS 0.8781740 0.2251214
## 11    PLS 0.8825098 0.2152488
## 12 Cubist 0.7102485 0.5256522

9. Final Predictions

This section focuses on making predictions on the test dataset using the Support Vector Machine (svm_model), which is chosen as the final model for prediction. First, it identifies and keeps only the rows in the test_data that have complete data for all the predictor variables using complete.cases(predictors(test_data)). This ensures that the final model receives input data without any missing values, which could cause prediction errors. The cleaned test data is stored in test_data_clean. Then, the predict() function is used to generate predictions on this cleaned test data using the final_model (which is set to svm_model). The resulting predictions are stored in a variable named predictions. A new data frame test_results is created, containing two columns: Actual (the true ph values from test_data_clean) and Predicted (the ph values predicted by the SVM model). Finally, a scatter plot is generated using ggplot() to visualize the relationship between the actual and predicted ph values. geom_point(color = “blue”) creates the scatter plot with blue points. geom_abline(slope = 1, intercept = 0, linetype = “dashed”, color = “red”) adds a dashed red line with a slope of 1 and intercept of 0, representing perfect predictions (where actual equals predicted). labs() adds a title (“Predicted vs Actual pH Values”) and labels for the x-axis (“Actual pH”) and y-axis (“Predicted pH”). This plot helps to visually assess the performance of the final model on the test data; points closer to the red line indicate better prediction accuracy.

# Identify rows that produce valid predictions
valid_index <- complete.cases(predictors(test_data))
test_data_clean <- test_data[valid_index, ]

# Make predictions using the final model
final_model <- svm_model
predictions <- predict(final_model, newdata = test_data_clean)

# Create results dataframe
test_results <- data.frame(Actual = test_data_clean$ph, Predicted = predictions)

# Plot Predicted vs Actual
ggplot(test_results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Predicted vs Actual pH Values", x = "Actual pH", y = "Predicted pH")

10. Evaluation on External Dataset

This section prepares an external dataset (eval) for prediction using the final trained model (svm_model) and then saves the predictions. It involves several steps to ensure compatibility between the external dataset and the model, which was trained on train_data. First, it extracts the names of the dummy variables created for brand_code during the training phase. It then saves the record_id from the eval dataset before any modifications. The brand_code in eval is then converted to a factor with levels matching those used during training. To handle potential issues with factor levels not present in the external data (which can cause errors during dummy variable creation), it includes a step to add a dummy row if brand_code in eval has fewer than two observed levels. This dummy row uses a missing level from the training data. Feature engineering is then applied to eval, creating the same interaction and polynomial terms as were created in the training data. One-hot encoding is performed on the brand_code in eval to create dummy variables. If a dummy row was added earlier, it is now removed. The engineered features and the brand dummy variables are combined into eval_final, excluding the original brand_code. It then checks for any dummy columns that were present in the training data but are missing in eval_final and adds them, filling with 0s. The columns in eval_final are reordered to match the order of the predictor columns in the training data. Finally, predictions are made on the prepared eval_final using the final_model. These predictions are combined with the original record_ids into a data frame eval_results, which is then saved to a CSV file named “FinalPredictions1.csv”.

# Step 0: Extract dummy variable names from training data
brand_dummy_names <- colnames(df_encoded)

# Save record_ids before any data modifications
record_ids <- eval$record_id

## Warning: Unknown or uninitialised column: `record_id`.

# Step 1: Force brand_code levels to match training
eval$brand_code <- factor(eval$brand_code, levels = gsub("brand_code", "", brand_dummy_names))

# Step 2: Add dummy row if brand_code has fewer than 2 observed levels
observed_levels <- unique(eval$brand_code[!is.na(eval$brand_code)])
missing_levels <- setdiff(levels(eval$brand_code), observed_levels)

dummy_added <- FALSE
if (length(observed_levels) < 2 && length(missing_levels) >= 1) {
  dummy_row <- eval[1, ]
  dummy_row$brand_code <- missing_levels[1]
  eval <- bind_rows(eval, dummy_row)
  
  # Add dummy record_id for the dummy row (NA or any placeholder)
  record_ids <- c(record_ids, NA)
  dummy_added <- TRUE
}

# Step 3: Feature engineering
eval <- eval %>%
  mutate(
    carb_vol_fill = carb_volume * fill_ounces,
    carb_press_temp = carb_pressure * carb_temp,
    mnf_flow_carb_flow = mnf_flow * carb_flow,
    fill_ounces_sq = fill_ounces^2,
    carb_pressure_sq = carb_pressure^2
  )

# Step 4: One-hot encode brand_code
# Step 4: One-hot encode brand_code only if it has ≥2 levels
if (length(levels(eval$brand_code)) >= 2) {
  brand_dummies <- model.matrix(~ brand_code - 1, data = eval) %>% as.data.frame()
} else {
  # Create a dummy matrix with 1 column manually if only one level is present
  level_name <- paste0("brand_code", levels(eval$brand_code)[1])
  brand_dummies <- data.frame(setNames(rep(1, nrow(eval)), level_name))
}


# Step 5: Remove dummy row if added
if (dummy_added) {
  eval <- eval[1:(nrow(eval) - 1), ]
  brand_dummies <- brand_dummies[1:(nrow(brand_dummies) - 1), ]
  record_ids <- record_ids[1:(length(record_ids) - 1)]  # remove dummy record_id too
}

# Step 6: Combine data
eval_final <- eval %>%
  select(-brand_code) %>%
  bind_cols(brand_dummies)

## New names:
## • `` -> `...38`

# Step 6.5: Add any missing dummy columns that exist in training but not in eval
required_cols <- setdiff(colnames(train_data), "ph")
missing_cols <- setdiff(required_cols, colnames(eval_final))

for (col in missing_cols) {
  eval_final[[col]] <- 0
}

# Step 6.6: Reorder columns to match training data
eval_final <- eval_final %>% select(all_of(required_cols))

# Step 7: Predict
eval_preds <- predict(final_model, newdata = eval_final)

# Step 8: Save predictions with aligned record_ids
eval_results <- data.frame(ID = record_ids, Predicted_pH = eval_preds)
write.csv(eval_results, "FinalPredictionnnnn.csv", row.names = FALSE)

EXPORT TO EXCEL ADDITONAL

10.1 Export Predictions to Excel

This section uses the openxlsx package to export the eval_results data frame, which contains the record IDs and their corresponding predicted pH values for the external dataset, to an Excel file. The write.xlsx() function is used for this purpose. The first argument is the data frame to be exported (eval_results). The file argument specifies the name of the Excel file to be created (“Evaluation_Results.xlsx”). The sheetName argument sets the name of the sheet within the Excel file to “SVM”. The rowNames = FALSE argument ensures that the row names of the eval_results data frame are not included in the exported Excel file. This provides a way to share the prediction results in a commonly used format.

library(openxlsx)

## Warning: package 'openxlsx' was built under R version 4.4.3

write.xlsx(eval_results, file = "Evaluation_Result.xlsx", sheetName = "SVM", rowNames = FALSE)

OPTIONAL: VARIABLE IMPORTANCE

This optional section calculates and potentially displays the variable importance for three of the tree-based models: Random Forest (rf_model), XGBoost (xgb_model), and Neural Network (nnet_model). The varImp() function from the caret package is a generic function that can extract variable importance scores from various trained models. The specific method of calculating importance depends on the type of model. For tree-based models like Random Forest and XGBoost, importance is often based on how much each variable contributes to reducing impurity (e.g., variance in regression) across all the trees. For neural networks, the calculation is more complex and might involve looking at the weights connecting the input layer to the hidden layers. The output of varImp() typically provides a ranking of the predictors based on their estimated importance in the model. This information can be valuable for understanding which variables are most influential in predicting the target variable.

varImp(rf_model)

## rf variable importance
## 
##   only 20 most important variables shown (out of 39)
## 
##                    Overall
## mnf_flow           100.000
## usage_cont          47.766
## brand_code.C        37.025
## oxygen_filler       29.568
## temperature         26.987
## bowl_setpoint       26.636
## alch_rel            25.306
## filler_level        25.206
## carb_rel            24.678
## pressure_vacuum     23.270
## balling_lvl         21.209
## carb_pressure1      16.957
## balling             16.275
## mnf_flow_carb_flow  15.994
## air_pressurer       15.676
## filler_speed        14.366
## carb_flow           12.217
## hyd_pressure3       11.924
## density             11.539
## mfr                  9.499

varImp(xgb_model)

## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 39)
## 
##                    Overall
## mnf_flow           100.000
## usage_cont          62.084
## brand_code.C        38.226
## oxygen_filler       30.216
## temperature         28.704
## pressure_vacuum     27.437
## alch_rel            23.994
## carb_rel            23.911
## balling_lvl         20.560
## bowl_setpoint       19.393
## air_pressurer       18.964
## carb_pressure1      17.342
## filler_speed        14.865
## balling             14.705
## mnf_flow_carb_flow  13.329
## density             13.255
## hyd_pressure3       12.707
## pc_volume           10.215
## carb_flow            9.028
## filler_level         8.815

varImp(nnet_model)

## nnet variable importance
## 
##   only 20 most important variables shown (out of 39)
## 
##                    Overall
## mnf_flow            100.00
## density              85.96
## balling              58.68
## brand_code.D         54.17
## balling_lvl          48.93
## carb_flow            43.68
## brand_code.B         41.75
## alch_rel             41.00
## hyd_pressure2        40.47
## brand_code.A         39.07
## brand_code.C         37.30
## mnf_flow_carb_flow   35.55
## oxygen_filler        29.45
## bowl_setpoint        28.07
## filler_level         21.84
## hyd_pressure3        19.25
## pressure_setpoint    18.55
## carb_volume          18.45
## temperature          17.28
## pressure_vacuum      16.57

varImp(rf_model): This will likely print to the console or return an object showing the importance scores for each predictor variable in the trained Random Forest model. Variables with higher scores are considered more important.
varImp(xgb_model): Similarly, this will likely print to the console or return an object showing the importance scores for each predictor variable in the trained XGBoost model.
varImp(nnet_model): This will also likely print to the console or return an object indicating the importance of each predictor in the trained neural network model. The scale and interpretation of these importance scores might differ from those of tree-based models.

11. Conclusion & Recommendations

This project demonstrated a systematic approach to predicting beverage pH using machine learning. After extensive data cleaning, feature engineering, and experimentation with multiple regression models, the SVM model yielded the best results. Future work could explore further ensemble techniques and more domain-driven feature engineering to improve accuracy further.

Project 2.0

Taha Malik

2025-05-19

1. Introduction

2. Data Exploration

3. Data Preprocessing & Feature Engineering

4. Target Variable Exploration: pH

5. Advanced Feature Engineering

6. Data Splitting

7. Modeling

7.1 Linear Regression

7.2 Ridge Regression

7.3 Polynomial Regression

7.4 SVM Regression

7.5 Lasso Regression

7.6 Random Forest

7.7 Gradient Boosting (GBM)

7.8 XGBoost

7.9 Neural Network

7.10 MARS (Earth)

7.11 PLS

7.12 Cubist

8. Model Comparison

8. Extended Model Comparison

9. Final Predictions

10. Evaluation on External Dataset

10.1 Export Predictions to Excel

11. Conclusion & Recommendations