This section is the introduction to the R script. The code block uses the pacman package to load a comprehensive set of R libraries. These libraries are essential for various tasks that will be performed in the subsequent sections, including data manipulation (tidyverse, janitor), machine learning model building and evaluation (caret, glmnet, pls, earth, kernlab, randomForest, Cubist, xgboost, nnet, e1071), time series analysis (forecast, fpp3, tsibble), association rule mining (arules, arulesViz), clustering (cluster), and multivariate data analysis (FactoMineR, factoextra), as well as data exploration and visualization (skimr, DataExplorer, corrplot
pacman::p_load(MASS, tidyverse, caret, glmnet, pls, earth, kernlab,
randomForest, Cubist, xgboost, nnet, e1071,
forecast, fpp3, tsibble, arules, arulesViz, cluster,
FactoMineR, factoextra, skimr, DataExplorer, janitor, corrplot)
Output: Executing this code block will load all the specified R packages into the current R session. There will be no direct output printed to the console unless a package fails to load, in which case an error message would be displayed.
This section focuses on loading the dataset and performing initial exploratory data analysis (EDA). It begins by loading two CSV files, “StudentData.csv” and “StudentEvaluation.csv”, into R data frames named df and eval respectively, using read_csv() from the readr package. The clean_names() function from the janitor package is then applied to both data frames to ensure that the column names are consistent, readable, and R-friendly (e.g., converting spaces to underscores and making them lowercase). Following the data loading and cleaning, several functions from the skimr and DataExplorer packages are used to get a preliminary understanding of the df data frame. skim(df) provides a concise summary of the dataset, including the data types of columns, the number of missing values, and basic descriptive statistics for numerical columns. The DataExplorer functions (plot_missing(), plot_intro(), plot_histogram(), plot_density()) generate visualizations to further explore the data’s characteristics, such as the patterns of missing values, an overview of the dataset structure and variable types, the distributions of numerical variables through histograms, and their probability density distributions. Finally, the code selects only the numerical columns from df using select(df, where(is.numeric)) and calculates the pairwise correlation matrix using cor() with the use = “pairwise.complete.obs” argument to handle missing values. This correlation matrix is then visualized using corrplot() from the corrplot package, displaying the strength and direction of linear relationships between the numerical variables. The tl.cex = 0.6 argument likely adjusts the size of the text labels on the plot for better readability.
library(readr)
df <- read_csv("StudentData.csv") %>% clean_names()
## Rows: 2571 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Brand Code
## dbl (32): Carb Volume, Fill Ounces, PC Volume, Carb Pressure, Carb Temp, PSC...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
eval <- read_csv("StudentEvaluation.csv") %>% clean_names()
## Rows: 267 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Brand Code
## dbl (31): Carb Volume, Fill Ounces, PC Volume, Carb Pressure, Carb Temp, PSC...
## lgl (1): PH
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(df)
Name | df |
Number of rows | 2571 |
Number of columns | 33 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 32 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
brand_code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
carb_volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
fill_ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
pc_volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
carb_pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
carb_temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
psc | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
psc_fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
psc_co2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
mnf_flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
carb_pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
fill_pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
hyd_pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
hyd_pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
hyd_pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
hyd_pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
filler_level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
filler_speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
usage_cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
carb_flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
mfr | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
pressure_vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
ph | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
oxygen_filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
bowl_setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
pressure_setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
air_pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
alch_rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
carb_rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
balling_lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
plot_missing(df)
plot_intro(df)
plot_histogram(df)
plot_density(df)
num_df <- select(df, where(is.numeric))
corrplot(cor(num_df, use = "pairwise.complete.obs"), method = "color", tl.cex = 0.6)
Output:
Description: This section focuses on cleaning and preparing the df data frame for modeling. It starts by filtering out rows where the ph variable (likely the target variable) has a missing value using filter(!is.na(ph)). Then, the brand_code column is explicitly converted to a factor data type using mutate(brand_code = as.factor(brand_code)), which is necessary for many statistical models to treat it as a categorical variable. Next, the code identifies and potentially removes near-zero variance predictors. nearZeroVar(df, saveMetrics = TRUE) from the caret package calculates various statistics for each predictor to assess its variance and whether it is near zero. The saveMetrics = TRUE argument ensures that these metrics are stored. The subsequent line df <- df[, !nzv$nzv] subsets the df data frame, keeping only the columns where the nzv component of the nzv object is FALSE, effectively removing the near-zero variance predictors. The code then separates the df data frame into numerical (df_num) and categorical (df_cat) columns using select(where(is.numeric)) and select(where(is.factor)), respectively. For the numerical data, preprocessing steps are applied using preProcess() from caret. The method = c(“medianImpute”, “center”, “scale”) argument specifies that missing values in the numerical columns should be imputed using the median, and then the columns should be centered (mean subtracted) and scaled (divided by standard deviation). The predict() function is used to apply these preprocessing transformations to the df_num data, resulting in a cleaned numerical data frame df_num_clean. Finally, the cleaned numerical data and the original categorical data are combined back into a single data frame df using bind_cols().
df <- df %>% filter(!is.na(ph))
df <- df %>% mutate(brand_code = as.factor(brand_code))
nzv <- nearZeroVar(df, saveMetrics = TRUE)
df <- df[, !nzv$nzv]
df_num <- df %>% select(where(is.numeric))
df_cat <- df %>% select(where(is.factor))
preproc <- preProcess(df_num, method = c("medianImpute", "center", "scale"))
df_num_clean <- predict(preproc, df_num)
df <- bind_cols(df_num_clean, df_cat)
Output:
Description: This section visually explores the distribution of the target variable, ph, and its relationship with the categorical predictor brand_code. The first part uses ggplot() from the ggplot2 package to create a histogram of the ph variable. geom_histogram(bins = 30, fill = “skyblue”, color = “black”) generates the histogram with 30 bins, filled with sky blue color and black borders for the bars. The labs() function adds a title (“Distribution of pH”) and labels for the x-axis (“pH”) and y-axis (“Count”). This visualization helps to understand the overall distribution of the ph values, such as its central tendency, spread, and any potential skewness or multimodality. The second part of the section also uses ggplot() to create boxplots of ph for each level of the brand_code factor. geom_boxplot(fill = “lightgreen”) generates the boxplots, with each box representing the distribution of ph for a specific brand code, and the boxes filled with light green color. The labs() function adds a title (“pH by Brand”). These boxplots allow for a comparison of the ph values across different brands, showing the median, quartiles, and potential outliers for each brand. This helps to assess if there is a relationship between the brand of the beverage and its pH level.
ggplot(df, aes(ph)) + geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of pH", x = "pH", y = "Count")
ggplot(df, aes(x = brand_code, y = ph)) + geom_boxplot(fill = "lightgreen") +
labs(title = "pH by Brand")
Output:
This section performs advanced feature engineering by creating new predictor variables from existing ones and by encoding the categorical variable brand_code into a numerical format suitable for many machine learning models. First, several new numerical features are created using the mutate() function from the dplyr package. These new features are derived by combining existing numerical features in potentially meaningful ways: carb_vol_fill is the product of carb_volume and fill_ounces, carb_press_temp is the product of carb_pressure and carb_temp, mnf_flow_carb_flow is the product of mnf_flow and carb_flow, fill_ounces_sq is the square of fill_ounces, and carb_pressure_sq is the square of carb_pressure. The rationale behind creating these interaction and polynomial terms is to capture potential non-linear relationships and interactions between the original predictors that might have a stronger influence on the target variable (ph). Next, the categorical variable brand_code is converted into a set of numerical dummy variables (also known as one-hot encoding). This is done using the dummyVars() function from the caret package, which creates a model matrix specification for converting brand_code. The predict() function then applies this specification to the df data frame to generate the actual dummy variables. The resulting df_encoded object is a matrix or data frame containing the numerical dummy variables for each level of brand_code. Finally, these new dummy variables are added to the original df data frame using cbind(), while the original brand_code column is removed using select(-brand_code). This results in a df data frame where the categorical brand_code has been replaced by a set of numerical columns, one for each brand (except possibly one as a reference level to avoid multicollinearity, depending on how dummyVars is configured by default).
df <- df %>%
mutate(carb_vol_fill = carb_volume * fill_ounces,
carb_press_temp = carb_pressure * carb_temp,
mnf_flow_carb_flow = mnf_flow * carb_flow,
fill_ounces_sq = fill_ounces^2,
carb_pressure_sq = carb_pressure^2)
dummy_model <- dummyVars(" ~ brand_code", data = df)
df_encoded <- predict(dummy_model, newdata = df)
df <- cbind(df %>% select(-brand_code), df_encoded)
Output:
Description: This section divides the prepared df data frame into training and testing datasets. The createDataPartition() function from the caret package is used to create stratified random samples for splitting the data. set.seed(123) ensures that the random partitioning is reproducible. The first argument, df$ph, specifies that the partitioning should be based on the distribution of the ph variable, aiming to have similar distributions of the target variable in both the training and testing sets. The p = 0.8 argument indicates that 80% of the data should be allocated to the training set. list = FALSE ensures that the output is a vector of indices rather than a list. These indices are then used to subset the df data frame to create train_data (the 80% subset) and test_data (the remaining 20% subset). Finally, the summary() function is applied to the ph variable in both train_data and test_data to provide descriptive statistics (like mean, median, min, max, quartiles) for the target variable in each split. This step is crucial to verify that the splitting process resulted in reasonably similar distributions of the target variable in both datasets, which is important for training a model on a representative subset of the data and evaluating its performance on unseen data.
set.seed(123)
train_index <- createDataPartition(df$ph, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
summary(train_data$ph)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.858470 -0.612398 -0.032743 -0.000192 0.778775 2.285880
summary(test_data$ph)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.278814 -0.728329 -0.032743 0.000769 0.778775 4.720434
Output:
This section trains a standard linear regression model to predict the ph value based on all other predictor variables in the train_data. First, it checks for any remaining missing values in the train_data. If any are found (which might occur if the previous preprocessing steps did not handle all missingness or if new missingness was introduced), it applies median imputation again using preProcess and predict from the caret package. This ensures that the linear regression model, which cannot handle missing values, is trained on a complete dataset. Then, the train() function from caret is used to train the linear regression model. The formula ph ~ . specifies that ph is the target variable and all other columns in train_data should be used as predictors. The method = “lm” argument indicates that a linear regression model should be used. The trControl = trainControl(method = “cv”, number = 10) argument sets up 10-fold cross-validation. This means the training data is divided into 10 parts, and the model is trained and evaluated 10 times, each time using a different part as the validation set and the remaining 9 parts as the training set. This process provides a more robust estimate of the model’s performance on unseen data compared to a single train-test split evaluation. The results of the cross-validation, including performance metrics like R-squared, RMSE, and MAE, are stored in the linear_model object.
if(any(is.na(train_data))) {
preproc_again <- preProcess(train_data, method = "medianImpute")
train_data <- predict(preproc_again, train_data)
}
linear_model <- train(ph ~ ., data = train_data, method = "lm",
trControl = trainControl(method = "cv", number = 10))
linear_model$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 0.7730769 0.3980851 0.6060421 0.04667772 0.03859452 0.03243217
skim(df)
Name | df |
Number of rows | 2567 |
Number of columns | 40 |
_______________________ | |
Column type frequency: | |
numeric | 40 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
carb_volume | 0 | 1.00 | 0.00 | 1.00 | -3.10 | -0.72 | -0.22 | 0.78 | 3.10 | ▁▆▇▅▁ |
fill_ounces | 0 | 1.00 | 0.00 | 0.99 | -3.91 | -0.63 | -0.02 | 0.59 | 3.95 | ▁▂▇▂▁ |
pc_volume | 0 | 1.00 | 0.00 | 0.99 | -3.27 | -0.63 | -0.10 | 0.55 | 3.31 | ▁▃▇▂▁ |
carb_pressure | 0 | 1.00 | 0.00 | 0.99 | -3.16 | -0.73 | 0.00 | 0.68 | 3.17 | ▁▅▇▃▁ |
carb_temp | 0 | 1.00 | 0.00 | 0.99 | -3.10 | -0.67 | -0.07 | 0.67 | 3.20 | ▁▃▇▃▁ |
psc | 0 | 1.00 | 0.00 | 0.99 | -1.68 | -0.70 | -0.18 | 0.56 | 3.76 | ▆▇▃▁▁ |
psc_fill | 0 | 1.00 | 0.00 | 1.00 | -1.66 | -0.81 | -0.13 | 0.55 | 3.61 | ▆▇▃▁▁ |
psc_co2 | 0 | 1.00 | -0.01 | 0.99 | -1.31 | -0.85 | -0.38 | 0.55 | 4.26 | ▇▅▂▁▁ |
mnf_flow | 0 | 1.00 | 0.00 | 1.00 | -1.04 | -1.04 | 0.38 | 0.97 | 1.71 | ▇▁▁▇▂ |
carb_pressure1 | 0 | 1.00 | 0.00 | 0.99 | -3.59 | -0.76 | 0.13 | 0.60 | 3.73 | ▁▃▇▂▁ |
fill_pressure | 0 | 1.00 | 0.00 | 1.00 | -4.19 | -0.60 | -0.48 | 0.65 | 3.93 | ▁▁▇▂▁ |
hyd_pressure2 | 0 | 1.00 | 0.00 | 1.00 | -1.28 | -1.28 | 0.46 | 0.83 | 2.34 | ▇▂▇▃▁ |
hyd_pressure3 | 0 | 1.00 | 0.00 | 1.00 | -1.36 | -1.28 | 0.45 | 0.80 | 1.85 | ▇▁▃▇▁ |
hyd_pressure4 | 0 | 1.00 | 0.00 | 0.99 | -2.62 | -0.79 | -0.02 | 0.43 | 3.49 | ▂▆▇▂▁ |
filler_level | 0 | 1.00 | 0.00 | 1.00 | -3.40 | -0.67 | 0.58 | 0.68 | 3.31 | ▁▃▅▇▁ |
filler_speed | 0 | 1.00 | 0.01 | 0.99 | -3.50 | 0.26 | 0.38 | 0.40 | 0.44 | ▁▁▁▁▇ |
temperature | 0 | 1.00 | 0.00 | 1.00 | -1.71 | -0.55 | -0.26 | 0.32 | 7.42 | ▇▃▁▁▁ |
usage_cont | 0 | 1.00 | 0.00 | 1.00 | -3.00 | -0.88 | 0.27 | 0.92 | 1.65 | ▁▃▅▃▇ |
carb_flow | 0 | 1.00 | 0.00 | 1.00 | -2.29 | -1.22 | 0.52 | 0.67 | 2.46 | ▂▅▆▇▁ |
density | 0 | 1.00 | 0.00 | 1.00 | -2.48 | -0.73 | -0.52 | 1.18 | 1.98 | ▁▅▇▂▆ |
mfr | 0 | 1.00 | 0.02 | 0.96 | -9.10 | 0.06 | 0.27 | 0.36 | 2.23 | ▁▁▁▂▇ |
balling | 0 | 1.00 | 0.00 | 1.00 | -2.19 | -0.76 | -0.59 | 1.17 | 1.95 | ▁▇▁▁▅ |
pressure_vacuum | 0 | 1.00 | 0.00 | 1.00 | -2.43 | -0.67 | -0.32 | 0.38 | 2.83 | ▂▇▅▃▁ |
ph | 0 | 1.00 | 0.00 | 1.00 | -3.86 | -0.61 | -0.03 | 0.78 | 4.72 | ▁▅▇▂▁ |
oxygen_filler | 0 | 1.00 | 0.00 | 1.00 | -0.98 | -0.54 | -0.29 | 0.27 | 7.84 | ▇▁▁▁▁ |
bowl_setpoint | 0 | 1.00 | 0.00 | 1.00 | -2.57 | -0.61 | 0.70 | 0.70 | 2.01 | ▁▂▃▇▁ |
pressure_setpoint | 0 | 1.00 | 0.00 | 1.00 | -1.77 | -0.79 | -0.79 | 1.17 | 2.15 | ▁▇▁▆▁ |
air_pressurer | 0 | 1.00 | 0.00 | 1.00 | -1.68 | -0.52 | -0.19 | 0.14 | 4.42 | ▅▇▁▁▁ |
alch_rel | 0 | 1.00 | 0.00 | 1.00 | -3.20 | -0.71 | -0.67 | 0.66 | 3.41 | ▁▇▂▃▁ |
carb_rel | 0 | 1.00 | 0.00 | 1.00 | -3.70 | -0.75 | -0.29 | 0.80 | 4.84 | ▁▇▇▂▁ |
balling_lvl | 0 | 1.00 | 0.00 | 1.00 | -2.36 | -0.77 | -0.66 | 1.25 | 1.85 | ▁▇▂▁▆ |
carb_vol_fill | 0 | 1.00 | 0.02 | 1.06 | -11.77 | -0.34 | 0.01 | 0.37 | 8.40 | ▁▁▇▃▁ |
carb_press_temp | 0 | 1.00 | 0.78 | 1.26 | -0.84 | 0.02 | 0.31 | 1.02 | 9.79 | ▇▂▁▁▁ |
mnf_flow_carb_flow | 0 | 1.00 | -0.28 | 0.90 | -1.95 | -0.80 | -0.51 | 0.53 | 2.38 | ▅▇▃▂▁ |
fill_ounces_sq | 0 | 1.00 | 0.98 | 1.69 | 0.00 | 0.06 | 0.35 | 1.18 | 15.57 | ▇▁▁▁▁ |
carb_pressure_sq | 0 | 1.00 | 0.99 | 1.41 | 0.00 | 0.11 | 0.46 | 1.28 | 10.04 | ▇▁▁▁▁ |
brand_code.A | 120 | 0.95 | 0.12 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
brand_code.B | 120 | 0.95 | 0.50 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
brand_code.C | 120 | 0.95 | 0.12 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
brand_code.D | 120 | 0.95 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
This section trains a Ridge Regression model, which is a linear regression technique that adds an L2 penalty to the loss function to prevent overfitting, especially when there are many correlated predictors. The train() function from caret is used again, with the same formula ph ~ . and training data train_data. The method = “glmnet” specifies that a generalized linear model with elastic net regularization should be used. Ridge regression is a special case of elastic net where the alpha parameter is set to 0. The tuneGrid argument provides a set of hyperparameters to be tested during training. Here, expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 10)) creates a grid where alpha is fixed at 0 (for Ridge regression) and lambda (the regularization parameter) varies across 10 values evenly spaced between 0.001 and 1. The trControl argument is the same as in the linear regression section, setting up 10-fold cross-validation to evaluate the model’s performance for each combination of hyperparameters in the tuneGrid. The results of the cross-validation for each lambda value are stored in the ridge_model object. Finally, plot(ridge_model) generates a plot showing how the model’s performance (likely R-squared, RMSE, or MAE) changes across the different values of lambda.
ridge_model <- train(ph ~ ., data = train_data, method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 10)),
trControl = trainControl(method = "cv", number = 10))
ridge_model$results
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0 0.001 0.7797860 0.3898136 0.6168869 0.03748620 0.05633981 0.02899993
## 2 0 0.112 0.7857738 0.3815535 0.6248781 0.03521212 0.05442652 0.02667958
## 3 0 0.223 0.7934747 0.3719849 0.6340438 0.03305038 0.05246266 0.02404994
## 4 0 0.334 0.7998654 0.3646837 0.6405313 0.03176378 0.05096088 0.02225276
## 5 0 0.445 0.8054528 0.3586596 0.6454842 0.03095276 0.04968823 0.02093480
## 6 0 0.556 0.8104605 0.3535036 0.6497483 0.03043646 0.04858805 0.02001792
## 7 0 0.667 0.8150393 0.3489476 0.6536019 0.03010482 0.04760360 0.01937037
## 8 0 0.778 0.8192627 0.3448741 0.6570851 0.02989204 0.04670600 0.01898592
## 9 0 0.889 0.8231802 0.3411974 0.6602888 0.02977686 0.04592916 0.01886669
## 10 0 1.000 0.8268465 0.3378318 0.6632201 0.02971622 0.04520925 0.01878920
plot(ridge_model)
This section trains a polynomial regression model. Polynomial regression extends the linear regression model by adding polynomial terms of the predictors to the model. The formula ph ~ .^2 in the train() function specifies that the model should include all main effects of the predictors (represented by .) as well as all two-way interaction terms and quadratic terms (represented by ^2). This allows the model to capture non-linear relationships between the predictors and the target variable ph. Similar to the previous models, 10-fold cross-validation is used to evaluate the model’s performance on the training data. The method = “lm” argument indicates that the underlying modeling technique is still linear regression, but applied to the expanded set of polynomial and interaction features. The poly_model object will store the trained model and the results of the cross-validation.
poly_model <- train(ph ~ .^2, data = train_data, method = "lm",
trControl = trainControl(method = "cv", number = 10))
poly_model$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 1.278568 0.2322552 0.7677897 0.3020665 0.05767391 0.07960624
This section trains a Support Vector Machine (SVM) model for regression. SVM is a powerful and versatile algorithm that can handle both linear and non-linear relationships. The method = “svmRadial” specifies that an SVM with a radial basis function (RBF) kernel should be used. The RBF kernel allows the model to map the input features into a higher-dimensional space where non-linear relationships might become linear, enabling the fitting of complex functions. The trControl = trainControl(method = “cv”, number = 10) argument again sets up 10-fold cross-validation for evaluating the model’s performance. The tuneLength = 5 argument tells the train() function to automatically try 5 different values of the tuning parameters for the SVM with the radial kernel (specifically, the cost parameter C and the kernel parameter sigma). The svm_model object will store the trained SVM model and the results of the cross-validation across the different hyperparameter settings.
svm_model <- train(ph ~ ., data = train_data, method = "svmRadial",
trControl = trainControl(method = "cv", number = 10),
tuneLength = 5)
svm_model$results
## sigma C RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0.01786384 0.25 0.7323438 0.4707333 0.5526845 0.05374517 0.05751801
## 2 0.01786384 0.50 0.7134170 0.4942992 0.5335184 0.05530162 0.05824701
## 3 0.01786384 1.00 0.6964656 0.5161072 0.5189073 0.05511532 0.05755817
## 4 0.01786384 2.00 0.6799551 0.5372298 0.5080382 0.05475099 0.05854430
## 5 0.01786384 4.00 0.6658459 0.5562757 0.5007060 0.05213279 0.05647661
## MAESD
## 1 0.03640937
## 2 0.03839854
## 3 0.03975891
## 4 0.04023068
## 5 0.03830984
This section trains a Lasso Regression model. Lasso (Least Absolute Shrinkage and Selection Operator) is another type of regularized linear regression that adds an L1 penalty to the loss function. Unlike Ridge regression (which uses an L2 penalty), the L1 penalty has the effect of not only shrinking the coefficients of less important predictors towards zero but also potentially setting them exactly to zero, thus performing feature selection. The method = “glmnet” is used, and the tuneGrid is set to expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 10)). Here, alpha = 1 specifically selects the Lasso penalty. The lambda parameter, which controls the strength of the L1 penalty, is varied across 10 values between 0.001 and 1. 10-fold cross-validation is used for model evaluation across these lambda values. The lasso_model object will store the trained Lasso model and the cross-validation results for each lambda. Finally, plot(lasso_model) generates a plot showing how the model’s performance changes as the lambda parameter is varied. This plot typically helps in identifying the optimal value of lambda that balances model complexity and predictive accuracy.
lasso_model <- train(ph ~ ., data = train_data, method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 10)),
trControl = trainControl(method = "cv", number = 10))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
lasso_model$results
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 0.001 0.7774768 0.3951011 0.6096906 0.02843212 0.03721477 0.02760436
## 2 1 0.112 0.8432443 0.3159366 0.6726200 0.04024085 0.05531783 0.03568876
## 3 1 0.223 0.9020683 0.2590761 0.7218679 0.04123153 0.05888488 0.03566157
## 4 1 0.334 0.9525190 0.2008888 0.7658706 0.04146564 0.05353586 0.03356292
## 5 1 0.445 0.9945488 0.1397918 0.8037642 0.04111184 0.01497529 0.03180424
## 6 1 0.556 0.9949324 NaN 0.8040998 0.04091062 NA 0.03170912
## 7 1 0.667 0.9949324 NaN 0.8040998 0.04091062 NA 0.03170912
## 8 1 0.778 0.9949324 NaN 0.8040998 0.04091062 NA 0.03170912
## 9 1 0.889 0.9949324 NaN 0.8040998 0.04091062 NA 0.03170912
## 10 1 1.000 0.9949324 NaN 0.8040998 0.04091062 NA 0.03170912
plot(lasso_model)
This section trains a Random Forest model, which is an ensemble learning method based on constructing multiple decision trees during training and outputting the mean prediction (for regression) of the individual trees. Random Forests are known for their high predictive accuracy and robustness to overfitting. The method = “rf” specifies the use of the Random Forest algorithm. The tuneGrid = expand.grid(mtry = c(5, 10, 15)) defines the hyperparameter to be tuned: mtry, which is the number of predictors randomly sampled at each split of the trees. The model will be trained and evaluated using 10-fold cross-validation for each of these mtry values (5, 10, and 15). The ntree = 300 argument specifies that 300 trees should be grown in each random forest. The rf_model object will store the trained Random Forest model with the best performing mtry value. Finally, varImpPlot(rf_model$finalModel) generates a variable importance plot, which visually ranks the predictors based on their importance in the final Random Forest model (the one trained on the entire training set with the optimal mtry).
rf_model <- train(ph ~ ., data = train_data, method = "rf",
tuneGrid = expand.grid(mtry = c(5, 10, 15)),
trControl = trainControl(method = "cv", number = 10),
ntree = 300)
rf_model$results
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 5 0.5967015 0.6746871 0.4516547 0.03591958 0.01886300 0.02591891
## 2 10 0.5764395 0.6877232 0.4310166 0.03253120 0.01793114 0.02622959
## 3 15 0.5655720 0.6957507 0.4213965 0.02901081 0.02010508 0.02585887
varImpPlot(rf_model$finalModel)
This section trains a Gradient Boosting Machine (GBM) model, which is another powerful ensemble learning technique that builds multiple weak prediction models (typically decision trees) sequentially, with each new model trying to correct the errors made by the previous ones. The method = “gbm” specifies the use of the Gradient Boosting Machine algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for model evaluation. The verbose = FALSE argument suppresses the printing of detailed output during the training process. The gbm_model object will store the trained GBM model and the results of the cross-validation, which would have automatically tuned the GBM’s hyperparameters (such as the number of trees, tree depth, learning rate, etc.) across a default or internally defined grid. Finally, summary(gbm_model) provides a textual summary of the trained GBM model, often including information about the importance of the predictor variables.
gbm_model <- train(ph ~ ., data = train_data, method = "gbm",
trControl = trainControl(method = "cv", number = 10),
verbose = FALSE)
gbm_model$results
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 1 0.1 1 10 50 0.7866012 0.3927961
## 4 0.1 2 10 50 0.7388255 0.4671643
## 7 0.1 3 10 50 0.7099874 0.5061733
## 2 0.1 1 10 100 0.7611657 0.4258190
## 5 0.1 2 10 100 0.7083733 0.5031650
## 8 0.1 3 10 100 0.6814072 0.5384789
## 3 0.1 1 10 150 0.7455101 0.4476241
## 6 0.1 2 10 150 0.6921097 0.5229075
## 9 0.1 3 10 150 0.6662645 0.5560738
## MAE RMSESD RsquaredSD MAESD
## 1 0.6255692 0.02419101 0.04556844 0.02042862
## 4 0.5857397 0.02727334 0.05370083 0.02086196
## 7 0.5584523 0.03305241 0.05851863 0.02382923
## 2 0.6040470 0.02775426 0.04989844 0.02279434
## 5 0.5570477 0.03389878 0.05867027 0.02398646
## 8 0.5319858 0.03444561 0.05496661 0.02199860
## 3 0.5889089 0.03108715 0.05215706 0.02502431
## 6 0.5407981 0.03917662 0.06305557 0.02997029
## 9 0.5186422 0.03857824 0.05779837 0.02445151
summary(gbm_model)
## var rel.inf
## mnf_flow mnf_flow 26.22803757
## brand_code.C brand_code.C 9.04252809
## usage_cont usage_cont 7.41095912
## temperature temperature 5.77562660
## alch_rel alch_rel 5.40716401
## oxygen_filler oxygen_filler 3.84974474
## bowl_setpoint bowl_setpoint 3.24699914
## pressure_vacuum pressure_vacuum 3.24592002
## balling balling 3.03622993
## carb_pressure1 carb_pressure1 2.99995620
## carb_rel carb_rel 2.66363486
## mnf_flow_carb_flow mnf_flow_carb_flow 2.65761731
## balling_lvl balling_lvl 2.50334279
## density density 2.33538476
## filler_speed filler_speed 2.08638201
## air_pressurer air_pressurer 1.90272940
## carb_flow carb_flow 1.53860571
## brand_code.A brand_code.A 1.27660916
## hyd_pressure2 hyd_pressure2 1.02265005
## carb_volume carb_volume 0.97875122
## carb_vol_fill carb_vol_fill 0.94604646
## mfr mfr 0.90158304
## pc_volume pc_volume 0.87672218
## fill_ounces fill_ounces 0.87572828
## psc psc 0.86744802
## hyd_pressure3 hyd_pressure3 0.84111807
## filler_level filler_level 0.80087262
## pressure_setpoint pressure_setpoint 0.76771341
## psc_fill psc_fill 0.76081219
## carb_pressure carb_pressure 0.70853120
## fill_pressure fill_pressure 0.63498609
## hyd_pressure4 hyd_pressure4 0.40736233
## psc_co2 psc_co2 0.31225877
## carb_press_temp carb_press_temp 0.29184386
## carb_pressure_sq carb_pressure_sq 0.28365640
## carb_temp carb_temp 0.24716425
## fill_ounces_sq fill_ounces_sq 0.17561659
## brand_code.B brand_code.B 0.09166355
## brand_code.D brand_code.D 0.00000000
This section trains an XGBoost model, which stands for Extreme Gradient Boosting. XGBoost is an optimized and highly efficient implementation of the gradient boosting algorithm, known for its speed and performance. The method = “xgbTree” specifies the use of the tree-based XGBoost model. A tuneGrid named xgb_grid is defined to specify the hyperparameter combinations to be tested during training. This grid includes a fixed number of boosting rounds (nrounds = 100), a range of maximum tree depths (max_depth = 3:5), a learning rate (eta = 0.1), regularization parameters (gamma = 0), subsampling ratio of columns (colsample_bytree = 0.8), minimum child weight (min_child_weight = 1), and subsampling ratio of the training instances (subsample = 0.8). The trainControl is set to 5-fold cross-validation (number = 5) for evaluating the model’s performance across these hyperparameter settings. The verbose = FALSE argument suppresses detailed training output. The xgb_model object will store the trained XGBoost model with the best performing hyperparameters found through the cross-validation process.
xgb_grid <- expand.grid(
nrounds = 100, # LIMIT to 100 trees max
max_depth = 3:5,
eta = c(0.1), # learning rate
gamma = 0,
colsample_bytree = 0.8,
min_child_weight = 1,
subsample = 0.8
)
xgb_model <- train(
ph ~ .,
data = train_data,
method = "xgbTree",
trControl = trainControl(method = "cv", number = 5), # reduce folds to 5 for now
tuneGrid = xgb_grid,
verbose = FALSE
)
This section trains a single-layer feedforward neural network model using the nnet package via the train() function. Neural networks are powerful non-linear models that can learn complex relationships in the data. The method = “nnet” specifies the use of the neural network algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for evaluating the model. The tuneGrid defines the hyperparameters to be tuned: size (the number of hidden units in the network, ranging from 1 to 5) and decay (the weight decay parameter, with values 0.01 and 0.1, used for regularization). The linout = TRUE argument indicates that the output layer should be linear, which is appropriate for regression tasks. trace = FALSE suppresses the printing of training progress, and maxit = 500 sets the maximum number of iterations for training each network. The nnet_model object will store the trained neural network model with the best performing combination of size and decay found during cross-validation. Finally, varImp(nnet_model) calculates and potentially prints or returns the variable importance scores for the predictors in the trained neural network.
nnet_model <- train(ph ~ ., data = train_data, method = "nnet",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(size = 1:5, decay = c(0.01, 0.1)),
linout = TRUE, trace = FALSE, maxit = 500)
nnet_model$results
## size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 0.01 0.7603836 0.4190783 0.6004884 0.03989665 0.04539775 0.03515183
## 2 1 0.10 0.7618784 0.4171404 0.5993337 0.03635099 0.04555699 0.03384447
## 3 2 0.01 0.7393173 0.4549772 0.5716739 0.03655419 0.05320002 0.02609113
## 4 2 0.10 0.7293959 0.4683184 0.5647277 0.04071222 0.05480748 0.03077519
## 5 3 0.01 0.7313314 0.4698138 0.5720023 0.03584205 0.05694286 0.02829870
## 6 3 0.10 0.7226353 0.4832520 0.5605036 0.04728550 0.06342039 0.03467665
## 7 4 0.01 0.7428421 0.4574285 0.5735596 0.03360771 0.04396771 0.03229127
## 8 4 0.10 0.7130668 0.5003511 0.5440974 0.05332985 0.05947581 0.03808167
## 9 5 0.01 0.7369358 0.4684357 0.5706867 0.04828142 0.06415200 0.03783661
## 10 5 0.10 0.7155254 0.5027768 0.5482771 0.04843817 0.05209672 0.03282014
varImp(nnet_model)
## nnet variable importance
##
## only 20 most important variables shown (out of 39)
##
## Overall
## mnf_flow 100.00
## density 85.96
## balling 58.68
## brand_code.D 54.17
## balling_lvl 48.93
## carb_flow 43.68
## brand_code.B 41.75
## alch_rel 41.00
## hyd_pressure2 40.47
## brand_code.A 39.07
## brand_code.C 37.30
## mnf_flow_carb_flow 35.55
## oxygen_filler 29.45
## bowl_setpoint 28.07
## filler_level 21.84
## hyd_pressure3 19.25
## pressure_setpoint 18.55
## carb_volume 18.45
## temperature 17.28
## pressure_vacuum 16.57
This section trains a Multivariate Adaptive Regression Splines (MARS) model using the earth package via the train() function. MARS is a non-parametric regression technique that builds a model from piecewise linear segments (splines). It can capture non-linear relationships and interactions between variables. The method = “earth” specifies the use of the MARS algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for model evaluation. The tuneLength = 10 argument tells the train() function to automatically try 10 different sets of tuning parameters for the MARS model. These parameters typically include the degree of interactions allowed between variables and the number of basis functions (related to the number of knots in the splines). The mars_model object will store the trained MARS model with the best performing tuning parameters found during cross-validation.
mars_model <- train(ph ~ ., data = train_data, method = "earth",
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
mars_model$results
## degree nprune RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 2 0.8781740 0.2251214 0.6919336 0.05758684 0.03192169 0.03688711
## 2 1 3 0.8307085 0.3077105 0.6598020 0.05823249 0.05494291 0.03869580
## 3 1 5 0.8103530 0.3413276 0.6444139 0.05911968 0.05005255 0.04342685
## 4 1 7 0.7875605 0.3770401 0.6172165 0.05850881 0.05397455 0.03962293
## 5 1 9 0.7764927 0.3945731 0.6071310 0.05779201 0.05175938 0.03904156
## 6 1 10 0.7677263 0.4077604 0.6024201 0.05888260 0.05193588 0.03747759
## 7 1 12 0.7653535 0.4117888 0.5971624 0.05702658 0.05143433 0.03703806
## 8 1 14 0.7553028 0.4275468 0.5867319 0.05581461 0.05443256 0.03467552
## 9 1 16 0.7546861 0.4289504 0.5854903 0.05296640 0.05498614 0.03271273
## 10 1 18 0.7540480 0.4299548 0.5843352 0.05265508 0.05482900 0.03204137
This section trains a Partial Least Squares (PLS) regression model. PLS is a dimensionality reduction technique combined with regression, particularly useful when dealing with a large number of predictors that might be highly collinear. It aims to find a set of components (latent variables) that are linear combinations of the original predictors and that have maximum covariance with the target variable. The method = “pls” specifies the use of the PLS algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for model evaluation. The tuneLength = 10 argument tells the train() function to automatically try 10 different numbers of components to include in the PLS model. The pls_model object will store the trained PLS model with the optimal number of components found during cross-validation.
pls_model <- train(ph ~ ., data = train_data, method = "pls",
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
pls_model$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 0.8825098 0.2152488 0.7041138 0.03894952 0.04174819 0.02525132
## 2 2 0.8388528 0.2907108 0.6608659 0.04258645 0.04480946 0.02476208
## 3 3 0.8187834 0.3243745 0.6468874 0.04491843 0.04430925 0.02995808
## 4 4 0.8047854 0.3476178 0.6343189 0.03848834 0.05175847 0.02854119
## 5 5 0.7929626 0.3663910 0.6257152 0.03761353 0.05650281 0.02855068
## 6 6 0.7861867 0.3774394 0.6205960 0.03601113 0.05628874 0.02638577
## 7 7 0.7830283 0.3824923 0.6185556 0.03590250 0.05784490 0.02799494
## 8 8 0.7794704 0.3878348 0.6157866 0.03749249 0.05835838 0.02840743
## 9 9 0.7786635 0.3892946 0.6136235 0.03823235 0.06047623 0.02965956
## 10 10 0.7783311 0.3896411 0.6127985 0.03971278 0.05988073 0.03086197
This section trains a Cubist model, which is a rule-based regression algorithm. Cubist builds a set of rules, where each rule has a linear model associated with it. Predictions are made by finding the rule that applies to a new instance and then using the linear model within that rule. Cubist is known for producing models that are often accurate and relatively easy to interpret. The method = “cubist” specifies the use of the Cubist algorithm. The trControl = trainControl(method = “cv”, number = 10) sets up 10-fold cross-validation for model evaluation. The cubist_model object will store the trained Cubist model and the results of the cross-validation, which would have automatically tuned Cubist’s hyperparameters (such as the number of committees and the number of rules).
cubist_model <- train(ph ~ ., data = train_data, method = "cubist",
trControl = trainControl(method = "cv", number = 10))
cubist_model$results
## committees neighbors RMSE Rsquared MAE RMSESD RsquaredSD
## 1 1 0 0.7102485 0.5256522 0.4975421 0.04364406 0.03752760
## 2 1 5 0.7019808 0.5517790 0.4858000 0.04897795 0.03377260
## 3 1 9 0.6933138 0.5553105 0.4821711 0.04908501 0.03543315
## 4 10 0 0.5878623 0.6561850 0.4372710 0.02505345 0.02925400
## 5 10 5 0.5707070 0.6746163 0.4223708 0.03370850 0.03701112
## 6 10 9 0.5655634 0.6798202 0.4213567 0.02712086 0.02954652
## 7 20 0 0.5806602 0.6664208 0.4346325 0.02805494 0.03167844
## 8 20 5 0.5643438 0.6810963 0.4172180 0.03923392 0.04353696
## 9 20 9 0.5580712 0.6881898 0.4164169 0.03199413 0.03595721
## MAESD
## 1 0.02363708
## 2 0.02576124
## 3 0.02322408
## 4 0.02138978
## 5 0.02601130
## 6 0.01924250
## 7 0.02344180
## 8 0.03018595
## 9 0.02440129
his section aims to compare the performance of several of the trained regression models using the resamples() function from the caret package. First, it attempts to clean the test_data by removing any rows that contain missing values in the predictor columns using complete.cases(predictors(test_data)). This is done to prevent potential errors during prediction if any of the models cannot handle missing values in the test set. Then, it re-runs predictions using the Support Vector Machine (svm_model) on this cleaned test data (test_data_clean) and stores the actual and predicted ph values in a data frame test_results. However, the primary focus of this section is the comparison using cross-validation results. The resamples() function takes a list of trained models (Linear Regression, Ridge Regression, Polynomial Regression, and SVM) and extracts their cross-validation performance metrics. The summary(resamps) function then provides a statistical summary of these metrics (e.g., mean, standard deviation, quantiles) for each model across the cross-validation folds. This allows for a direct comparison of the models’ estimated performance on unseen data based on their performance during training. Finally, dotplot(resamps) generates a dot plot that visually compares the distribution of the performance metrics for each model, making it easier to see which models tend to perform better and how variable their performance is across the cross-validation folds.
library(caret)
# Step 0: Train preprocessing pipeline on training data
preprocess_model <- preProcess(train_data, method = c("center", "scale", "knnImpute"))
# ---- STEP 1: Preprocess test data using training pipeline ----
# Ensure test_data has been preloaded
test_data_clean <- predict(preprocess_model, newdata = test_data)
test_data_clean[is.na(test_data_clean)] <- 0 # Optional but safe fallback
# ---- STEP 2: SVM predictions on cleaned test data ----
svm_predictions <- predict(svm_model, newdata = test_data_clean)
# ---- STEP 3: Compare Actual vs Predicted if test has labels (only for validation sets) ----
# If test_data_clean has 'ph' column (which it usually doesn't), evaluate
if ("ph" %in% names(test_data_clean)) {
test_results <- data.frame(
Actual = test_data_clean$ph,
Predicted = svm_predictions
)
}
# ---- STEP 4: Cross-validation model comparison ----
resamps <- resamples(list(
LM = linear_model,
Ridge = ridge_model,
Poly = poly_model,
SVM = svm_model
))
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: LM, Ridge, Poly, SVM
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LM 0.5541460 0.5902700 0.6007180 0.6060421 0.6195031 0.6589668 0
## Ridge 0.5756473 0.5930864 0.6256791 0.6168869 0.6271031 0.6741715 0
## Poly 0.6537365 0.7241301 0.7683594 0.7677897 0.8135614 0.8934389 0
## SVM 0.4380342 0.4801021 0.4878602 0.5007060 0.5323212 0.5655219 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LM 0.7185870 0.7387778 0.7584026 0.7730769 0.8007085 0.8622259 0
## Ridge 0.7387157 0.7545030 0.7764182 0.7797860 0.7883051 0.8633429 0
## Poly 0.9690045 1.0605886 1.1986217 1.2785676 1.4498312 1.9025118 0
## SVM 0.5825130 0.6357883 0.6625921 0.6658459 0.6980400 0.7636657 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LM 0.32793164 0.3770407 0.4047008 0.3980851 0.4233150 0.4505619 0
## Ridge 0.27587143 0.3660323 0.3853042 0.3898136 0.4333979 0.4648637 0
## Poly 0.09869629 0.2295165 0.2420271 0.2322552 0.2589805 0.3130543 0
## SVM 0.43808338 0.5423333 0.5661380 0.5562757 0.5902952 0.6233509 0
dotplot(resamps)
# ---- STEP 5: Final model predictions (e.g., Random Forest) for output ----
rf_predictions <- predict(rf_model, newdata = test_data_clean)
# ---- STEP 6: Export to CSV ----
write.csv(data.frame(predicted_ph = rf_predictions), "FinalPredictionnnnn.csv", row.names = FALSE)
write.csv(data.frame(ID = 1:length(rf_predictions), predicted_ph = rf_predictions),
"FinalPredictionnnnn.csv", row.names = FALSE)
Output:
EXTENDED MODEL COMPARISON
This section extends the model comparison by calculating and displaying the Root Mean Squared Error (RMSE) and R-squared values obtained from the cross-validation for all the trained models. It first creates a character vector model_names containing the names assigned to each of the trained models. Then, it creates a data frame named results. This data frame has three columns: Model (containing the names of the models), RMSE, and Rsquared. The values for the RMSE and Rsquared columns are extracted from the $results component of each model object. Specifically, it appears to be taking the first row’s RMSE and Rsquared value from the $results data frame of each model. It’s important to note that the $results object in caret often contains multiple rows if the model was tuned over a grid of hyperparameters; taking just the first row might not represent the best performance of a tuned model (it might be the performance at the first tested hyperparameter setting). After creating this results data frame, it is printed to the console using the print() function.
model_names <- c("LM", "Ridge", "Lasso", "Poly", "SVM", "RF", "GBM", "XGB", "NNet", "MARS", "PLS", "Cubist")
results <- data.frame(
Model = model_names,
RMSE = c(
linear_model$results$RMSE[1],
ridge_model$results$RMSE[1],
lasso_model$results$RMSE[1],
poly_model$results$RMSE[1],
svm_model$results$RMSE[1],
rf_model$results$RMSE[1],
gbm_model$results$RMSE[1],
xgb_model$results$RMSE[1],
nnet_model$results$RMSE[1],
mars_model$results$RMSE[1],
pls_model$results$RMSE[1],
cubist_model$results$RMSE[1]
),
Rsquared = c(
linear_model$results$Rsquared[1],
ridge_model$results$Rsquared[1],
lasso_model$results$Rsquared[1],
poly_model$results$Rsquared[1],
svm_model$results$Rsquared[1],
rf_model$results$Rsquared[1],
gbm_model$results$Rsquared[1],
xgb_model$results$Rsquared[1],
nnet_model$results$Rsquared[1],
mars_model$results$Rsquared[1],
pls_model$results$Rsquared[1],
cubist_model$results$Rsquared[1]
)
)
print(results)
## Model RMSE Rsquared
## 1 LM 0.7730769 0.3980851
## 2 Ridge 0.7797860 0.3898136
## 3 Lasso 0.7774768 0.3951011
## 4 Poly 1.2785676 0.2322552
## 5 SVM 0.7323438 0.4707333
## 6 RF 0.5967015 0.6746871
## 7 GBM 0.7866012 0.3927961
## 8 XGB 0.6408614 0.5937333
## 9 NNet 0.7603836 0.4190783
## 10 MARS 0.8781740 0.2251214
## 11 PLS 0.8825098 0.2152488
## 12 Cubist 0.7102485 0.5256522
This section focuses on making predictions on the test dataset using the Support Vector Machine (svm_model), which is chosen as the final model for prediction. First, it identifies and keeps only the rows in the test_data that have complete data for all the predictor variables using complete.cases(predictors(test_data)). This ensures that the final model receives input data without any missing values, which could cause prediction errors. The cleaned test data is stored in test_data_clean. Then, the predict() function is used to generate predictions on this cleaned test data using the final_model (which is set to svm_model). The resulting predictions are stored in a variable named predictions. A new data frame test_results is created, containing two columns: Actual (the true ph values from test_data_clean) and Predicted (the ph values predicted by the SVM model). Finally, a scatter plot is generated using ggplot() to visualize the relationship between the actual and predicted ph values. geom_point(color = “blue”) creates the scatter plot with blue points. geom_abline(slope = 1, intercept = 0, linetype = “dashed”, color = “red”) adds a dashed red line with a slope of 1 and intercept of 0, representing perfect predictions (where actual equals predicted). labs() adds a title (“Predicted vs Actual pH Values”) and labels for the x-axis (“Actual pH”) and y-axis (“Predicted pH”). This plot helps to visually assess the performance of the final model on the test data; points closer to the red line indicate better prediction accuracy.
# Identify rows that produce valid predictions
valid_index <- complete.cases(predictors(test_data))
test_data_clean <- test_data[valid_index, ]
# Make predictions using the final model
final_model <- svm_model
predictions <- predict(final_model, newdata = test_data_clean)
# Create results dataframe
test_results <- data.frame(Actual = test_data_clean$ph, Predicted = predictions)
# Plot Predicted vs Actual
ggplot(test_results, aes(x = Actual, y = Predicted)) +
geom_point(color = "blue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
labs(title = "Predicted vs Actual pH Values", x = "Actual pH", y = "Predicted pH")
This section prepares an external dataset (eval) for prediction using the final trained model (svm_model) and then saves the predictions. It involves several steps to ensure compatibility between the external dataset and the model, which was trained on train_data. First, it extracts the names of the dummy variables created for brand_code during the training phase. It then saves the record_id from the eval dataset before any modifications. The brand_code in eval is then converted to a factor with levels matching those used during training. To handle potential issues with factor levels not present in the external data (which can cause errors during dummy variable creation), it includes a step to add a dummy row if brand_code in eval has fewer than two observed levels. This dummy row uses a missing level from the training data. Feature engineering is then applied to eval, creating the same interaction and polynomial terms as were created in the training data. One-hot encoding is performed on the brand_code in eval to create dummy variables. If a dummy row was added earlier, it is now removed. The engineered features and the brand dummy variables are combined into eval_final, excluding the original brand_code. It then checks for any dummy columns that were present in the training data but are missing in eval_final and adds them, filling with 0s. The columns in eval_final are reordered to match the order of the predictor columns in the training data. Finally, predictions are made on the prepared eval_final using the final_model. These predictions are combined with the original record_ids into a data frame eval_results, which is then saved to a CSV file named “FinalPredictions1.csv”.
# Step 0: Extract dummy variable names from training data
brand_dummy_names <- colnames(df_encoded)
# Save record_ids before any data modifications
record_ids <- eval$record_id
## Warning: Unknown or uninitialised column: `record_id`.
# Step 1: Force brand_code levels to match training
eval$brand_code <- factor(eval$brand_code, levels = gsub("brand_code", "", brand_dummy_names))
# Step 2: Add dummy row if brand_code has fewer than 2 observed levels
observed_levels <- unique(eval$brand_code[!is.na(eval$brand_code)])
missing_levels <- setdiff(levels(eval$brand_code), observed_levels)
dummy_added <- FALSE
if (length(observed_levels) < 2 && length(missing_levels) >= 1) {
dummy_row <- eval[1, ]
dummy_row$brand_code <- missing_levels[1]
eval <- bind_rows(eval, dummy_row)
# Add dummy record_id for the dummy row (NA or any placeholder)
record_ids <- c(record_ids, NA)
dummy_added <- TRUE
}
# Step 3: Feature engineering
eval <- eval %>%
mutate(
carb_vol_fill = carb_volume * fill_ounces,
carb_press_temp = carb_pressure * carb_temp,
mnf_flow_carb_flow = mnf_flow * carb_flow,
fill_ounces_sq = fill_ounces^2,
carb_pressure_sq = carb_pressure^2
)
# Step 4: One-hot encode brand_code
# Step 4: One-hot encode brand_code only if it has ≥2 levels
if (length(levels(eval$brand_code)) >= 2) {
brand_dummies <- model.matrix(~ brand_code - 1, data = eval) %>% as.data.frame()
} else {
# Create a dummy matrix with 1 column manually if only one level is present
level_name <- paste0("brand_code", levels(eval$brand_code)[1])
brand_dummies <- data.frame(setNames(rep(1, nrow(eval)), level_name))
}
# Step 5: Remove dummy row if added
if (dummy_added) {
eval <- eval[1:(nrow(eval) - 1), ]
brand_dummies <- brand_dummies[1:(nrow(brand_dummies) - 1), ]
record_ids <- record_ids[1:(length(record_ids) - 1)] # remove dummy record_id too
}
# Step 6: Combine data
eval_final <- eval %>%
select(-brand_code) %>%
bind_cols(brand_dummies)
## New names:
## • `` -> `...38`
# Step 6.5: Add any missing dummy columns that exist in training but not in eval
required_cols <- setdiff(colnames(train_data), "ph")
missing_cols <- setdiff(required_cols, colnames(eval_final))
for (col in missing_cols) {
eval_final[[col]] <- 0
}
# Step 6.6: Reorder columns to match training data
eval_final <- eval_final %>% select(all_of(required_cols))
# Step 7: Predict
eval_preds <- predict(final_model, newdata = eval_final)
# Step 8: Save predictions with aligned record_ids
eval_results <- data.frame(ID = record_ids, Predicted_pH = eval_preds)
write.csv(eval_results, "FinalPredictionnnnn.csv", row.names = FALSE)
EXPORT TO EXCEL ADDITONAL
This section uses the openxlsx package to export the eval_results data frame, which contains the record IDs and their corresponding predicted pH values for the external dataset, to an Excel file. The write.xlsx() function is used for this purpose. The first argument is the data frame to be exported (eval_results). The file argument specifies the name of the Excel file to be created (“Evaluation_Results.xlsx”). The sheetName argument sets the name of the sheet within the Excel file to “SVM”. The rowNames = FALSE argument ensures that the row names of the eval_results data frame are not included in the exported Excel file. This provides a way to share the prediction results in a commonly used format.
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.4.3
write.xlsx(eval_results, file = "Evaluation_Result.xlsx", sheetName = "SVM", rowNames = FALSE)
OPTIONAL: VARIABLE IMPORTANCE
This optional section calculates and potentially displays the variable importance for three of the tree-based models: Random Forest (rf_model), XGBoost (xgb_model), and Neural Network (nnet_model). The varImp() function from the caret package is a generic function that can extract variable importance scores from various trained models. The specific method of calculating importance depends on the type of model. For tree-based models like Random Forest and XGBoost, importance is often based on how much each variable contributes to reducing impurity (e.g., variance in regression) across all the trees. For neural networks, the calculation is more complex and might involve looking at the weights connecting the input layer to the hidden layers. The output of varImp() typically provides a ranking of the predictors based on their estimated importance in the model. This information can be valuable for understanding which variables are most influential in predicting the target variable.
varImp(rf_model)
## rf variable importance
##
## only 20 most important variables shown (out of 39)
##
## Overall
## mnf_flow 100.000
## usage_cont 47.766
## brand_code.C 37.025
## oxygen_filler 29.568
## temperature 26.987
## bowl_setpoint 26.636
## alch_rel 25.306
## filler_level 25.206
## carb_rel 24.678
## pressure_vacuum 23.270
## balling_lvl 21.209
## carb_pressure1 16.957
## balling 16.275
## mnf_flow_carb_flow 15.994
## air_pressurer 15.676
## filler_speed 14.366
## carb_flow 12.217
## hyd_pressure3 11.924
## density 11.539
## mfr 9.499
varImp(xgb_model)
## xgbTree variable importance
##
## only 20 most important variables shown (out of 39)
##
## Overall
## mnf_flow 100.000
## usage_cont 62.084
## brand_code.C 38.226
## oxygen_filler 30.216
## temperature 28.704
## pressure_vacuum 27.437
## alch_rel 23.994
## carb_rel 23.911
## balling_lvl 20.560
## bowl_setpoint 19.393
## air_pressurer 18.964
## carb_pressure1 17.342
## filler_speed 14.865
## balling 14.705
## mnf_flow_carb_flow 13.329
## density 13.255
## hyd_pressure3 12.707
## pc_volume 10.215
## carb_flow 9.028
## filler_level 8.815
varImp(nnet_model)
## nnet variable importance
##
## only 20 most important variables shown (out of 39)
##
## Overall
## mnf_flow 100.00
## density 85.96
## balling 58.68
## brand_code.D 54.17
## balling_lvl 48.93
## carb_flow 43.68
## brand_code.B 41.75
## alch_rel 41.00
## hyd_pressure2 40.47
## brand_code.A 39.07
## brand_code.C 37.30
## mnf_flow_carb_flow 35.55
## oxygen_filler 29.45
## bowl_setpoint 28.07
## filler_level 21.84
## hyd_pressure3 19.25
## pressure_setpoint 18.55
## carb_volume 18.45
## temperature 17.28
## pressure_vacuum 16.57
This project demonstrated a systematic approach to predicting beverage pH using machine learning. After extensive data cleaning, feature engineering, and experimentation with multiple regression models, the SVM model yielded the best results. Future work could explore further ensemble techniques and more domain-driven feature engineering to improve accuracy further.