This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
We start by loading relevant libraries for data manipulation, visualization, imputation, and modeling.
# Load required libraries
library(tidyverse) #
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(mice)
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(corrplot)
## corrplot 0.95 loaded
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(nnet)
library(Cubist)
library(openxlsx)
library(ggpubr)
library(viridis)
## Loading required package: viridisLite
library(hrbrthemes)
library(e1071)
library(DT)
library(kernlab)
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:mice':
##
## convergence
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
Load datasets and substitute any empty values with NA values to facilitate the imputation of missing data in the future.
df_StudentData <- read.csv('https://raw.githubusercontent.com/uplotnik/DATA-624/refs/heads/main/StudentData.csv', na.strings = c("", NA))
df_EvalData <- read.csv('https://raw.githubusercontent.com/uplotnik/DATA-624/refs/heads/main/StudentEvaluation.csv', na.strings = c("", NA))
#Check first rows of beverage data
DT::datatable(
df_StudentData[1:10,],
options = list(scrollX = TRUE,
deferRender = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
info = FALSE,
paging=FALSE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 1: First 10 Rows of Beverage Data'
))
DT::datatable(
df_EvalData[1:10,],
options = list(scrollX = TRUE,
deferRender = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
info = FALSE,
paging=FALSE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 2: First 10 Rows of Evaluation Data'
))
# Finding data dimensions.
dims <- data.frame("Train" = dim(df_StudentData),
"Eval" = dim(df_EvalData))
rownames(dims) <- c("Observations","Predictors")
dims
## Train Eval
## Observations 2571 267
## Predictors 33 33
The Training set contains a total of 2,571 observations and 33 predictors, including PH, as shown in the table above. Additionally, the Evaluation set consists of 267 observations, also with 33 predictors, including PH.
glimpse(df_StudentData)
## Rows: 2,571
## Columns: 33
## $ Brand.Code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ Carb.Volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4 <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
#Exploratory Data Analysis
Visualizations like histograms and boxplots will be used to explore the distributions of numeric variables.
# Histograms for numeric variables
df_StudentData %>%
select_if(is.numeric) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(aes(y=..density..), bins = 15, fill = "skyblue", alpha = 0.7, color = "black") +
geom_density(color = "red", size = 1) +
facet_wrap(~variable, scales = "free") +
labs(title = "Histograms and Density Plots of Numeric Variables", x = "Value", y = "Density") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_density()`).
The training data shows different distribution patterns for the variables::
Relatively Normal Distributions: Carb.Pressure, Carb.Temp, Fill.Ounces, PC.Volume, PH (response variable)
Left-skew Distributions: Carb.Flow, Filler.Speed, Mnf.Flow, MFR, Bowl.Setpoint, Filler.Level, Hyd.Pressure2, Hyd.Pressure3, Usage.cont, Carb.Pressure1, Filler.Speed
Right-skew Distributions: Pressure.Setpoint, Fill.Pressure, Hyd.Pressure1, Temperature, Carb.Volume, PSC, PSC.CO2, PSC.Fill, Balling, Density, Hyd.Pressure4, Air.Pressurer, Alch.Rel, Carb.Rel, Oxygen.Filler, Balling.Lvl, Pressure.Vacuum
# Boxplots for numeric variables
df_StudentData %>%
select_if(is.numeric) %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot(fill = "orange", alpha = 0.7) +
coord_flip() +
ggtitle("Boxplots of Numerical Predictors")
## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Data Preparation
Convert Brand.Code to factor
We will transform categorical variable Brand.Code into factors and visualize for proportional distribution.
# Convert Brand.Code to factor
df_StudentData$Brand.Code <- as.factor(df_StudentData$Brand.Code)
df_EvalData$Brand.Code <- as.factor(df_EvalData$Brand.Code)
# Distribution of Brand.Code
df_StudentData %>%
count(Brand.Code) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = Brand.Code, y = prop, fill = Brand.Code)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = scales::percent(prop, accuracy = 0.1)), vjust = -0.5) +
labs(title = "Proportion Distribution of Brand.Code", y = "Proportion", x = "Brand.Code") +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal()
The next phase tackles missing data and data quality issues. To address missing data, the Multiple Imputation by Chained Equations (MICE) method will be applied, filling in absent entries with statistically plausible values while maintaining the dataset’s integrity. After imputation, variables with near-zero variance will be removed to reduce noise.
# Impute missing values using mice with predictive mean matching (pmm)
df_StudentData_imp <- mice(df_StudentData, m = 1, method = 'pmm', printFlag = FALSE) %>% complete()
# Check if any missing values left
df_StudentData_imp %>%
summarise_all(~sum(is.na(.))) %>%
gather(variable, missing) %>%
filter(missing != 0) %>%
kable() %>%
kable_styling()
variable | missing |
---|---|
NA | NA |
:——– | ——-: |
# Remove near-zero variance variables
nzv_vars <- nearZeroVar(df_StudentData_imp)
if(length(nzv_vars) > 0) {
df_StudentData_imp <- df_StudentData_imp[, -nzv_vars]
}
Calculate skewness for numeric variables
Now we will identify highly skewed features and apply Box-Cox transformation to improve normality and model performance.
# Identify numeric columns (excluding target and factors)
numeric_vars <- df_StudentData_imp %>%
select_if(is.numeric) %>%
colnames()
# Calculate skewness for numeric variables
skew_vals <- sapply(df_StudentData_imp[, numeric_vars], skewness)
skew_vals
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## 0.390377114 -0.050540954 0.343582272 0.207113870
## Carb.Temp PSC PSC.Fill PSC.CO2
## 0.289945873 0.849791181 0.938787385 1.722826884
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure2
## 0.004147084 0.062603884 0.540495167 -0.301745830
## Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## -0.317105849 0.560999094 -0.846203532 -2.540090321
## Temperature Usage.cont Carb.Flow Density
## 2.419156401 -0.534935470 -0.986408441 0.524602364
## MFR Balling Pressure.Vacuum PH
## -2.817531037 0.594504160 0.525660793 -0.290411448
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## 2.699352751 -0.974431771 0.201056748 2.252105286
## Alch.Rel Carb.Rel Balling.Lvl
## 0.887713045 0.506410479 0.586215866
# Threshold for high skewness
skew_threshold <- 1
# Variables to transform (highly skewed)
vars_to_transform <- names(skew_vals[abs(skew_vals) > skew_threshold])
print("Highly skewed variables:")
## [1] "Highly skewed variables:"
print(vars_to_transform)
## [1] "PSC.CO2" "Filler.Speed" "Temperature" "MFR"
## [5] "Oxygen.Filler" "Air.Pressurer"
# Apply Box-Cox transformation to variables with high skewness
for (var in vars_to_transform) {
# Extract variable vector
x <- df_StudentData_imp[[var]]
# Shift data if zero or negatives exist (Box-Cox requires positive values)
min_x <- min(x, na.rm = TRUE)
shift <- 0
if (min_x <= 0) {
shift <- abs(min_x) + 1
x <- x + shift
message(paste("Shifted", var, "by", shift, "to make positive for Box-Cox"))
}
# Estimate Box-Cox transformation
bc_obj <- BoxCoxTrans(x)
# Transform the data using estimated lambda
x_bc <- predict(bc_obj, x)
# Replace original variable with transformed variable (without shift to keep consistent)
df_StudentData_imp[[var]] <- x_bc
}
## Shifted PSC.CO2 by 1 to make positive for Box-Cox
skew_vals_after_bc <- sapply(df_StudentData_imp[, vars_to_transform], skewness)
print("Skewness after Box-Cox transformation:")
## [1] "Skewness after Box-Cox transformation:"
print(skew_vals_after_bc)
## PSC.CO2 Filler.Speed Temperature MFR Oxygen.Filler
## 1.2811065 -2.3613223 1.8910674 -2.3710667 -0.1128059
## Air.Pressurer
## 2.1883574
Cap outliers at 1st and 99th percentiles in all numeric variables
In this step outliers in numeric variables are capped at the 1st and 99th percentiles to mitigate their impact.
for (var in numeric_vars) {
lower_bound <- quantile(df_StudentData_imp[[var]], 0.01, na.rm = TRUE)
upper_bound <- quantile(df_StudentData_imp[[var]], 0.99, na.rm = TRUE)
df_StudentData_imp[[var]] <- ifelse(df_StudentData_imp[[var]] < lower_bound, lower_bound, df_StudentData_imp[[var]])
df_StudentData_imp[[var]] <- ifelse(df_StudentData_imp[[var]] > upper_bound, upper_bound, df_StudentData_imp[[var]])
}
# Prepare the data for correlation analysis
cor_data <- df_StudentData_imp %>%
select_if(is.numeric)
# Compute correlations between 'PH' and all other predictors
corr_values <- cor_data %>%
summarise(across(.cols = everything(),
.fns = ~ cor(., cor_data$PH, use = "complete.obs"),
.names = "cor_{col}")) %>%
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Correlation") %>%
mutate(Predictor = gsub("cor_", "", Predictor)) %>%
filter(Predictor != "PH") %>%
arrange(desc(abs(Correlation)))
print(corr_values)
## # A tibble: 30 × 2
## Predictor Correlation
## <chr> <dbl>
## 1 Mnf.Flow -0.448
## 2 Bowl.Setpoint 0.354
## 3 Filler.Level 0.332
## 4 Usage.cont -0.322
## 5 Pressure.Setpoint -0.315
## 6 Hyd.Pressure3 -0.240
## 7 Pressure.Vacuum 0.219
## 8 Fill.Pressure -0.216
## 9 Hyd.Pressure2 -0.205
## 10 Oxygen.Filler 0.204
## # ℹ 20 more rows
We analyzed how pH relates to all the factors we have to get a basic idea of what might affect it. By using correlation and visual tools, we pinpoint the variables that are most closely linked to pH. The correlation table ranks these predictors by how much they relate to pH. Mnf.Flow has the strongest negative correlation with pH at -0.44, while Bowl.Setpoint and Filler.level show a positive correlation of 0.35 and 0.33 respectively, suggesting it could influence pH, though these correlations are only moderate.
Finally, the data is split into training and testing sets with stratification maintained for the target variable PH. Numeric predictors except the target are scaled using centering and scaling methods to prepare them for machine learning algorithms sensitive to feature scaling.
set.seed(100)
index <- createDataPartition(df_StudentData_imp$PH, p = 0.8, list = FALSE)
train_data <- df_StudentData_imp[index, ]
test_data <- df_StudentData_imp[-index, ]
# Separate predictors and target
train_x <- train_data %>% select(-PH)
train_y <- train_data$PH
test_x <- test_data %>% select(-PH)
test_y <- test_data$PH
# Distribution of target variable in train/test sets to check stratification
train_y_df <- data.frame(PH = train_y)
test_y_df <- data.frame(PH = test_y)
p1 <- ggplot(train_y_df, aes(x=PH)) +
geom_histogram(bins = 20, fill = "steelblue", alpha=0.7) +
labs(title = "Distribution of Target Variable PH in Training Set", x = "PH", y = "Count") +
theme_minimal()
p2 <- ggplot(test_y_df, aes(x=PH)) +
geom_histogram(bins = 20, fill = "tomato", alpha=0.7) +
labs(title = "Distribution of Target Variable PH in Test Set", x = "PH", y = "Count") +
theme_minimal()
ggarrange(p1, p2, ncol = 2, nrow = 1)
# Scaling numeric predictors for algorithms sensitive to scale (linear regression, neural networks, k-NN, SVMs, and gradient boosting can benefit from scaling)
scale_vars <- numeric_vars[numeric_vars != "PH"]
preProcValues <- preProcess(train_x[, scale_vars], method = c("center", "scale"))
train_x_scaled <- train_x
test_x_scaled <- test_x
train_x_scaled[, scale_vars] <- predict(preProcValues, train_x[, scale_vars])
test_x_scaled[, scale_vars] <- predict(preProcValues, test_x[, scale_vars])
The data sets are ready for model training and further analysis, ensuring that the challenges of missing data, skewness, outliers, and feature scaling have been addressed.
# Cross Validation for traning control parameter
ctrl = trainControl(method = "cv", number = 10)
Linear Regression Models
#Partial Least Squares Model
pls_model = train(
x = train_x_scaled,
y = train_y,
method = "pls", # using the partial least squares
preProcess = c("center", "scale"), # apply the preprocess before we train the data
tuneLength = 10, # we will try 10 different number of components
# this will capture latent variables, hidden data that influence the data
trControl = ctrl, # 10 fold cross validation
metric = "Rsquared" # we will chose the model based on best r square
)
print(pls_model)
## Partial Least Squares
##
## 2058 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1853, 1852, 1853, 1852, 1852, 1853, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1498634 0.2254310 0.1201949
## 2 0.1412350 0.3106655 0.1112673
## 3 0.1383217 0.3401537 0.1094628
## 4 0.1362765 0.3592078 0.1077145
## 5 0.1342514 0.3787206 0.1054413
## 6 0.1327641 0.3927892 0.1047454
## 7 0.1320949 0.3990078 0.1040261
## 8 0.1315967 0.4031792 0.1037089
## 9 0.1313985 0.4049133 0.1033991
## 10 0.1312831 0.4060984 0.1032898
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.
plot(pls_model)
pls_preds = predict(pls_model, newdata = test_x_scaled)
# postResample will compare the performance of the model to the response / dependent variable
# We observe how will the model will we predict our targeted values
postResample(pls_preds,test_y )
## RMSE Rsquared MAE
## 0.1293843 0.4228368 0.1021428
# Linear Regression Model
linear_model = train(
x = train_x_scaled,
y = train_y,
method = "lm", # linear regression
preProcess = c("center", "scale"),
trControl = ctrl
)
linear_preds = predict(linear_model, newdata = test_x_scaled)
# postResample will compare the performance of the model to the response / dependent variable
# We observe how will the model will we predict our targeted values
postResample(linear_preds,test_y )
## RMSE Rsquared MAE
## 0.1279695 0.4352398 0.1012329
ridge_model = train(
x = train_x_scaled,
y = train_y,
method = "glmnet",
preProcess = c("center", "scale"),
# lambda used to control how strong the penalty will be for having large coefficients in the data
# tune grid will try different lambda strengths 10 times, between lambda value of 0.001 to 1
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 10)),
trControl = ctrl
)
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
ridge_preds = predict(ridge_model, newdata = test_x_scaled)
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
# postResample will compare the performance of the model to the response / dependent variable
# We observe how will the model will we predict our targeted values
postResample(ridge_preds,test_y )
## RMSE Rsquared MAE
## 0.1375180 0.3478321 0.1073900
models = list(
"PLS" = pls_model,
"Linear Regression" = linear_model,
"Ridge Regression" = ridge_model
)
results = resamples(models)
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: PLS, Linear Regression, Ridge Regression
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## PLS 0.09596233 0.09964873 0.1041974 0.1032898 0.1066283 0.1117705
## Linear Regression 0.09383398 0.09880470 0.1010602 0.1025219 0.1030801 0.1150961
## Ridge Regression 0.09980285 0.10326123 0.1050143 0.1075617 0.1136776 0.1155916
## NA's
## PLS 0
## Linear Regression 0
## Ridge Regression 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## PLS 0.1195806 0.1281439 0.1313040 0.1312831 0.1352353 0.1418655
## Linear Regression 0.1174589 0.1273310 0.1304426 0.1309830 0.1332729 0.1438194
## Ridge Regression 0.1283873 0.1336802 0.1351041 0.1371915 0.1412437 0.1504781
## NA's
## PLS 0
## Linear Regression 0
## Ridge Regression 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## PLS 0.2955436 0.3587672 0.4186561 0.4060984 0.4588952 0.4987505
## Linear Regression 0.3299976 0.3744299 0.4274312 0.4075440 0.4366480 0.4599645
## Ridge Regression 0.2679103 0.3190085 0.3547621 0.3506702 0.3839959 0.4268088
## NA's
## PLS 0
## Linear Regression 0
## Ridge Regression 0
print(models)
## $PLS
## Partial Least Squares
##
## 2058 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1853, 1852, 1853, 1852, 1852, 1853, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1498634 0.2254310 0.1201949
## 2 0.1412350 0.3106655 0.1112673
## 3 0.1383217 0.3401537 0.1094628
## 4 0.1362765 0.3592078 0.1077145
## 5 0.1342514 0.3787206 0.1054413
## 6 0.1327641 0.3927892 0.1047454
## 7 0.1320949 0.3990078 0.1040261
## 8 0.1315967 0.4031792 0.1037089
## 9 0.1313985 0.4049133 0.1033991
## 10 0.1312831 0.4060984 0.1032898
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.
##
## $`Linear Regression`
## Linear Regression
##
## 2058 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1852, 1852, 1853, 1853, 1852, 1851, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.130983 0.407544 0.1025219
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## $`Ridge Regression`
## glmnet
##
## 2058 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1852, 1853, 1852, 1851, 1853, 1852, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0001 0.1371915 0.3506702 0.1075617
## 0.1112 0.1434858 0.3031049 0.1145599
## 0.2223 0.1467973 0.2805796 0.1176064
## 0.3334 0.1489864 0.2675742 0.1195684
## 0.4445 0.1506437 0.2590528 0.1210098
## 0.5556 0.1519905 0.2530191 0.1221470
## 0.6667 0.1531232 0.2485587 0.1231216
## 0.7778 0.1540961 0.2451550 0.1239672
## 0.8889 0.1549577 0.2423970 0.1247113
## 1.0000 0.1557170 0.2401874 0.1253688
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1e-04.
bwplot(results)
The linear regression model is the best performing model, partial least
squares model is close second, ridge regression model under performs for
explaining the data. The metrics that indicate a better performing model
is lower root mean square, lower mean absolute error, and r square
closer to 1. All the model are vadilated ising the 10 cross validation
format to make the comparison between all model fair. The Linear
Regression Model has the lowest RMSE and MAE value, the partial least
square comes in second and ridge regression lack in explaining accuracy
in the prediction of the PH values. Linear model are great for capturing
the PH prediction, because our non linear model can over fit and
memorize patterns captured in the data.