The purpose of our second project is to work as a team to apply concepts from the 2nd half of our Predictive Analytics course to a beverage data set. More specifically, to explore the data, determine whether we might use a linear regression, non-linear regression or tree-based model, to then build, compare, select our optimal model, and support why we made the selection that we did.
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
# read in data (as CSVs)
<- read.csv("https://raw.githubusercontent.com/Magnus-PS/DATA-624/main/624P2TrainData.csv")
train <- read.csv("https://raw.githubusercontent.com/Magnus-PS/DATA-624/main/624P2EvalData.csv")
test
# rename i..Brand.Code
names(train)[names(train) == "ï..Brand.Code"] <- "Brand.Code"
names(test)[names(test) == "ï..Brand.Code"] <- "Brand.Code"
# label datasets
$Dataset <- 'train'
train$Dataset <- 'test'
test
# merge datasets
<- rbind(train, test) final_df
We will explore, analyze and model a data set containing approximately 2,600 training and 300 test records representing various brands of a beverage. The variables appear to be related to beverage chemical and manufacturing properties as well as the brand of beverage created.
The response variable is PH
which represents the pH level of each beverage. We’ll generate multiple models and then select the best performing model for our training data prior to casting predictions for our test data’s PH
variable (which is currently all NAs).
We’ll follow the general, high-level approach that is common in Data Science:
Prior to casting our predictions.
The purpose of exploring our data is to enhance the precision of the questions we’re asking while building a firm understanding of the data. The aim is to familiarize ourselves with the data’s structure, the status of missing values, outliers, predictive strength, and correlation to then take the actions necessary to prepare our data for model building.
We get to know the structure, value ranges, proportion of missing values, distribution of values and correlation with the target variable. After this point, we should have enough insight to prepare our data and build our model(s).
We utilize the built-in glimpse
method to gain insight into dimensions, variable characteristics, and value ranges for our training data:
glimpse(train)
## Rows: 2,571
## Columns: 34
## $ Brand.Code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B~
## $ Carb.Volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.~
## $ Fill.Ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23~
## $ PC.Volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333~
## $ Carb.Pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2~
## $ Carb.Temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141~
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,~
## $ PSC.Fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12~
## $ PSC.CO2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14~
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100~
## $ Carb.Pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124~
## $ Fill.Pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0~
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure4 <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86~
## $ Filler.Level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118~
## $ Filler.Speed <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, ~
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4~
## $ Usage.cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.~
## $ Carb.Flow <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3~
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90~
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74~
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2~
## $ Pressure.Vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4~
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38~
## $ Oxygen.Filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0~
## $ Bowl.Setpoint <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12~
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0~
## $ Air.Pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146~
## $ Alch.Rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52~
## $ Carb.Rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34~
## $ Balling.Lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44~
## $ Dataset <chr> "train", "train", "train", "train", "train", "train"~
From above, we gather that:
pH
of type double,Brand Code
categorical variable,Hyd.Pressure1
with all 0s),Filler.Speed
v. Oxygen.Filler
), andLet’s get a high level look at our distributions:
summary(train)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure.Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
## Dataset
## Length:2571
## Class :character
## Mode :character
##
##
##
##
We note presence of negative values (ie. Pressure.Vacuum
), difference of scales and distribution, and confirm the presence of NA values (including 212 values in MFR
) across numerous variables.
Next, we investigate missing values in greater depth:
::aggr(final_df %>% filter(Dataset == 'train'), col=c('green','red'), numbers=T, sortVars=T,
VIMcex.axis = .7,
ylab=c("Proportion of Data", "Combinations and Percentiles"))
From the proportion chart above on the left we can see that:
MFR
has ~8% missing values (the most of any variable),Filler.Speed
),PSC.Fill
), andBrand.Code
) have no missing data.When we shift our attention to the Combinations and Percentiles chart (on the right), it appears that while there may be instances of related patterns between variables (ie. MFR
and Fill.Pressure
), the data is primarily missing at random.
We’re primarily dealing with numeric variables but in order to properly utilize our categorical variable, we convert it to type factor. We re-label empty strings as “Unlabelled” and then convert our Brand.Code
variable to be of type factor:
#Relabel missing brand codes to be "Unlabelled"
$Brand.Code[final_df$Brand.Code == ""] <- NA
final_df$Brand.Code <- final_df$Brand.Code %>% tidyr::replace_na('Unlabelled')
final_df
#convert brand code to factor
<- final_df %>% dplyr::mutate(Brand.Code = factor(Brand.Code, levels = c('Unlabelled', 'A','B','C','D'), ordered = TRUE)) final_df
Earlier we’d noted the vast difference in ranges between numeric variables. To explore this point further and gain greater insight as to the distribution of each variable, we utilize inspectdf’s inspect_num()
function:
#Variable distributions
::inspect_num(final_df %>% filter(Dataset == 'train')) %>%
inspectdfshow_plot()
We observe a range of distributions:
Carb.Pressure
, Carb.Temp
, Carb,Volume
, Fill.Ounces
, PC.Volume
, PH
, Pressure.Vacuum
, PSC
, PSC.Fill
, and Temperature
have relatively normal spreads where just a few of the variables are skewed and would require centering / normalization,Alch.Rel
, Bowl.Setpoint
, and Pressure.Setpoint
appear to resemble distinct numerical variables with distributions centered about a few values, andFrom above, there may be a place for introducing dummy / flag variables (ie. Alch.Rel
= 0.6) and that centering and scaling will prove essential prior to model-building. Fortunately, R has built in functions to center and scale automatically and en masse.
Next, we create a contingency table for our categorical variable to better understand the relationship between Brand.Code
and PH
:
table(final_df$Brand.Code, final_df$PH)
##
## 7.88 7.9 7.98 8 8.02 8.04 8.06 8.08 8.1 8.12 8.14 8.16 8.18 8.2
## Unlabelled 0 0 0 0 0 0 0 0 0 1 0 0 1 2
## A 0 0 0 0 0 0 1 0 1 2 1 4 3 6
## B 0 0 0 1 1 3 4 3 0 3 1 0 5 1
## C 1 1 1 1 2 0 1 3 4 5 7 7 4 3
## D 0 0 0 0 0 0 0 0 0 0 0 1 0 0
##
## 8.22 8.24 8.26 8.28 8.3 8.32 8.34 8.36 8.38 8.4 8.42 8.44 8.46
## Unlabelled 4 4 8 5 5 2 2 0 0 1 2 1 7
## A 6 8 6 4 6 3 4 5 3 10 13 10 24
## B 6 7 12 13 13 11 13 37 42 54 47 47 56
## C 3 4 11 9 19 6 8 12 12 16 15 14 16
## D 0 1 0 2 1 3 2 12 10 28 17 17 25
##
## 8.48 8.5 8.52 8.54 8.56 8.58 8.6 8.62 8.64 8.66 8.68 8.7 8.72 8.74
## Unlabelled 7 8 10 9 6 7 5 1 4 2 1 1 2 1
## A 13 11 6 18 13 16 29 17 10 6 7 3 3 4
## B 48 53 47 74 38 51 45 27 37 24 37 68 58 55
## C 13 18 10 13 17 9 13 6 2 1 2 4 3 1
## D 20 36 20 24 26 26 35 34 30 37 30 38 34 20
##
## 8.76 8.78 8.8 8.82 8.84 8.86 8.88 8.9 8.92 8.94 9.36
## Unlabelled 2 3 0 0 2 2 2 0 0 0 0
## A 6 4 1 2 2 2 0 0 0 0 0
## B 44 33 38 27 18 15 12 2 2 2 0
## C 0 4 2 0 0 0 0 0 0 0 1
## D 15 18 18 13 9 5 2 1 5 0 0
From above we might extend that:
With a basic understanding of the distribution of each of our features, we turn to the boxplot of each numeric distribution in order to visualize the magnitude of outliers:
# Utilize customized box plot generation function
<- final_df %>% filter(Dataset == 'train')
df_train boxplot_depend_vs_independ(df_train, df_train$PH)
What we gather from the output is:
PH
) has a range of ~8.4-8.7 with outlier values > 9.0 and ~ 8.0,Mnf.Flow
, Hyd.Pressure2
, Hyd.Pressure3
Usage.cont
, Carb.Flow
, Density
, Balling
, Bowl.Setpoint
, Pressure.Setpoint
, and Balling.Lvl
have no outliers,Carb.Volume
, Carb.Pressure
, PSC.CO2
, Hyd.Pressure1
, Filler.Level
, Pressure.Vacuum
, and Alch.Rel
have minimal outliers, andAs such, either outlier-handling will be essential in properly predicting our continuous, numeric response variable (PH
) or we’re dealing with a non-linear relationship. Based on the high presence of “outliers” we feel that this is more-likely-than-not the case.
Having reviewed the structure, distributions, and presence of outliers for our variables, we turn our attention to exploring the relationship these variables have with one another via correlation matrix. We consider only variables with a correlation significant > 0.1 in our plot:
#Utilize custom-built correlation matrix generation function
plot_corr_matrix(final_df %>% filter(Dataset == 'train'), -1)
It does not appear that multicollinearity is a concern. From this we might extend that feature exclusion based on multicollinearity will not be a concern.
As a final step in the exploration of our data, we shoot to understand which variables have the strongest v. weakest relationship with PH
level. This information may prove useful later during feature exclusion / selection:
# Perform Boruta search
<- Boruta(PH ~ ., data=na.omit(train), doTrace=0, maxRuns = 1000)
boruta_output
# Get significant variables including tentatives
<- getSelectedAttributes(boruta_output, withTentative = TRUE)
boruta_signif
#print(boruta_signif)
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")
Our Boruta output indicates that 4 of our features may be insignificant (PSC.CO2
, PSC
, PSC.Fill
and Carb.Temp
) while the remainder appear to be significant. With Mnf.Flow
carrying the greatest variable importance (by a narrow margin).
The training dataset has 33 variables and 2571 observations. The remediation actions became apparent over the course of our exploratory data analysis:
PH
. While on the surface we might consider dropping these variables (ie. PSC.CO2
), first we should see whether their importance increases once we consider other features during our regression / model-building.PH
).The recommendations above provide a “starting line” for our data preparation. Being that, the majority of our features appear to have a relatively strong relationship with PH
, we’d anticipate being able to pull much signal from the variables at hand and developing a strong model.
With insights from exploratory data analysis, we set out to prepare our data for modeling. Our aim is to “optimize” our data for modeling. We’ll impute missing values, normalize our data, handle outliers via BoxCox transformation and then proceed to model building.
We utilize the preProcess
function from the caret package to center, scale, impute, and apply a Box-Cox transformation. This will optimize the data set for the different linear and nonlinear models we’re going to build. It is important to note that Random Forest does not need the data to be normalized.
By scaling our data, it can smooth any nonlinear relationships in the model. If there are nonlinear relationships, by transforming the data, these non-linearities are not going to be reflected in the predictions.
It is important to normalize data for linear / non-linear regression as well as neural network models. Being that we’re going to consider a few different linear and nonlinear regression models in addition to a random forest model, we elected to created two data sets: one normalized and one not.
For normalization, we drop the Brand.Code
, Dataset
, and PH
columns from consideration and utilize the preProcess()
function to center, scale, impute via kNN imputation, and handle outliers via BoxCox transformation. Additionally, we ensure that we proceed with only complete PH values and then train-test split our training data so that we can verify its performance before later casting a prediction on the test set.
set.seed(150)
#center, scale, and impute
<- preProcess(df_train2, method = c("knnImpute", "center", "scale", "BoxCox"))
Processed
#fill missing pH values and ensure we don't proceed with missing PH values for training set
<- predict(Processed, df_train2)
complete_ph $PH = df_train$PH
complete_ph<- complete_ph %>% na.omit() complete_ph
In this section we’re going to consider linear, non-linear, and tree-based models to select first the strongest model within each sub-group and then the over-all top performing model.
Our EDA led us to believe that we’re dealing with non-linear data where the heavy presence of outliers may create problems for our linear and non-linear regression models. We anticipate the linear model will have the poorest performance because of the non-linear nature of our variables and the heavy presence of outliers while the non-linear models may struggle a bit because of the heavy skew that our outliers will have created within our models. We felt that the magnitude of outliers in addition to the apparent non-linearity of the data would mean that linear regression models would not perform well.
From this, we believe a tree-based model might be best but will consider linear and non-linear regression models for sake of consistency and in order to properly rule them out.
With our data prepared, we consider a number of linear regression models: multi-linear regression, AIC optimized, and partial least squares. We utilize the same train()
function for all three models, feeding the same datasets for X and Y, and specifying the proper model-building technique via the “method” variable:
<- train(X.train.cs, Y.train.cs,
linear_model method='lm',
trControl=trainControl(method = "cv"))
<- train(X.train.cs, Y.train.cs,
pls_model method='pls',
metric='Rsquared',
tuneLength=10,
trControl=trainControl(method = "cv"))
With all three multi-linear regression models built, we proceed to casting our predictions and then verifying performance against the test portion of our train-test split data. Recall: it’s all been derived from our training set.
set.seed(150)
<- predict(linear_model, newdata = X.test.cs)
lmPred <- postResample(pred=lmPred, obs = Y.test.cs)
lmResample
<-predict(aic_model, newdata=X.test.cs)
aicPred <- postResample(pred=aicPred, obs=Y.test.cs)
aicResample
<-predict(pls_model, newdata=X.test.cs)
plsPred <- postResample(pred=plsPred, obs = Y.test.cs) plsReSample
The purpose of this section is to verify model performance and identify the strongest performing model in our multi linear regression subset. Based on what we observed during EDA (exploratory data analysis) we did not anticipate the linear regression models to perform well:
<- rbind(
display "Linear Regression" = lmResample,
"Stepwise AIC" = aicResample,
"Partial Least Squares" = plsReSample
)
%>% kable() %>% kable_paper() display
RMSE | Rsquared | MAE | |
---|---|---|---|
Linear Regression | 0.1438062 | 0.3493612 | 0.1103118 |
Stepwise AIC | 0.1439355 | 0.3482182 | 0.1100548 |
Partial Least Squares | 0.1437415 | 0.3499468 | 0.1102135 |
Using Rsquared and MAE as the principal criteria for selecting the best model, the Linear Model produces the best metrics. However, the Rsquared value is very low and only ~35% of the variation in the output variable can be explained by the input variables. This is not good at all. All of the linear models produced very similar results, suggesting that the relationship between our target variable and the predictors is non-linear.
As a final measure we investigate the top ten influential predictors:
We see that Mnf.Flow
, Temperature
, Carb.Pressure1
, Usage.cont
, Hyd.Pressure2
and Filler.Level
contribute the most to the pH levels according to the linear regression model. The linear model places heavy emphasis on the Mnf.Flow
and equal importance to Temperature
and Carb.Pressure1
.
As a next natural step, we proceed to exploring the efficacy of non-linear models. We consider Multi-Adaptive Regression Spline (MARS), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN) models and anticipate that our non-linear regression models will outperform their linear counterparts for the simple fact that it appears we’re dealing with non-linear variables and non-linear relationships between variables.
We utilize the train()
function for all three models, feeding the same datasets for X and Y, tuning the grid (where applicable, ie. MARS) and specifying the proper model-building technique via “method” variable:
<- expand.grid(.degree = 1:2, .nprune = 2:38)
marsGrid set.seed(100)
<- train(X.train.cs, Y.train.cs,
marsM method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
<- train(X.train.cs, Y.train.cs,
supModel method = "svmRadial",
tuneLength = 14,
trControl = trainControl(method = "cv"))
<- train(x = X.train.cs,
knnModel y = Y.train.cs,
method = "knn",
tuneLength = 10)
With all three non-linear regression models built, we proceed to casting our predictions and then verifying performance against the test portion of our train-test split data. Recall: it’s all been derived from our training set.
set.seed(150)
<- predict(marsM, newdata = X.test.cs)
marsPred <- postResample(pred=marsPred, obs = Y.test.cs)
marsResample
<-predict(supModel, newdata=X.test.cs)
svmPred <- postResample(pred=svmPred, obs=Y.test.cs)
svmResample
<-predict(knnModel, newdata=X.test.cs)
knnPred <- postResample(pred=knnPred, obs=Y.test.cs) knnResample
The purpose of this section is to verify model performance and identify the strongest performing model in our non linear regression subset. Based on what we observed during EDA (exploratory data analysis) we anticipate our nonlinear models to perform better than our linear regression models although they may still have trouble with the fact that we were dealing with a high magnitude of outliers:
<- rbind(
display "MARS" = marsResample,
"Support Vector Machine" = svmResample,
"KNN" = knnResample)
%>% kable() %>% kable_paper() display
RMSE | Rsquared | MAE | |
---|---|---|---|
MARS | 0.1366654 | 0.4136367 | 0.1036735 |
Support Vector Machine | 0.1280496 | 0.4854961 | 0.0921707 |
KNN | 0.1362174 | 0.4216192 | 0.1019014 |
Using Rsquared and MAE as the principal criteria for selecting the best model, the Support Vector Machine (SVM) performed the best. However, the R-squared value is still below 0.50 which means we’re explaining less than 50% of the variation in the output variable with our input variables. This leaves a bit to be desired and so we look to the next section (tree based models) with optimism …
As a final measure we investigate the top ten influential predictors:
The non-linear model places high importance of Mnf.Flow
just as the linear model, but there is a difference in the other variables. Usage.cont
and Filler.Level
are secondary and tertiary top contributors.
The tree-based models will use non-normalized data in which outliers have not been handled because it’s within their capacity to handle both. As such, we repeat many of the pre-processing steps we’d done in preparing for the regression models: removing highly correlated data and variables that contain near zero variance, kNN imputation for missing values, ensuring non-NULL PH
observations, and train-test splitting our data:
Once our data’s been brought into a proper form, we consider Random Forest, Cubist, and Gradient Boosting models in order to compare and contrast which may be the best tree-based model. As mentioned earlier, we had a hunch that one of the tree-based models would be our strongest performing model being that we were dealing with a high magnitude of outliers and what appeared to be non-linear data.
<- randomForest(X.train, Y.train, importance=TRUE, ntree = 500) rfModel
set.seed(100)
<- train(X.train, Y.train,
cubistTune method = "cubist",
verbose = FALSE)
<- expand.grid(
gbmGrid interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10
)set.seed(100)
<- train(X.train, Y.train,
gbmTune method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
With all three non-linear regression models built, we proceed to casting our predictions and then verifying performance against the test portion of our train-test split data. Recall: it’s all been derived from our training set.
set.seed(150)
<- predict(rfModel, newdata = X.test)
randomPred <- postResample(pred=randomPred, obs = Y.test)
randomResample
<-predict(gbmTune, newdata=X.test)
gbmPred <- postResample(pred=gbmPred, obs=Y.test)
gbmResample
<-predict(cubistTune, newdata=X.test)
cubistPred <- postResample(pred=cubistPred, obs = Y.test) cubistSample
The purpose of this section is to verify model performance and identify the strongest performing model in our tree based subset. Based on what we observed during EDA (exploratory data analysis) we anticipate these models performing quite well:
<- rbind(
display "Random Forest" = randomResample,
"Gradient Boosted Tree" = gbmResample,
"Cubist" = cubistSample)
%>% kable() %>% kable_paper() display
RMSE | Rsquared | MAE | |
---|---|---|---|
Random Forest | 0.1120980 | 0.6268775 | 0.0810325 |
Gradient Boosted Tree | 0.1164534 | 0.5745456 | 0.0859115 |
Cubist | 0.1032232 | 0.6655173 | 0.0753867 |
Using Rsquared and MAE as the principal criteria for selecting the best model, the Cubist model edges the others and very narrowly overshadows the performance of the Random Forest model. All of our tree-based models performed better than the linear and non-linear model and so our hypotheses were proven.
The R-squared for our Cubist model provides indication the model is capable of predicting over 65% of the variance of our data. This is close to the range at which we’d consider it a great model (0.90 >= R-squared >= 0.70) and so we feel relatively comfortable proceeding with the Cubist model as our optimal predictive model.
As a final measure we investigate the top ten influential predictors:
plot(varImp(cubistTune), top = 10)
This model places highest importance on Mnf.Flow
and Alch.Rel
followed by Density
, Temperature
and so on down the line. The plot highlights which variabls are most important to our model and thus in predicting pH levels.
From a chemical standpoint, it makes sense that the model would place a high importance on Alch.Rel
and Carb.Rel
when predicting PH
levels, since those variables deal with oxygen and hydrogen (which either increase or decrease PH).
We considered 9 different models of linear and non-linear regression type in addition to those of a tree-based type. We’d headed into our model building with the hypothesis that tree-based models would outperform nonlinear models would outperform linear models, and it was proven true.
The data we were to process had a relatively high magnitude of outliers in addition to nonlinear relationships between variables, and it was for this reason that we felt one of the tree-based models would be our best. While the Random Forest model is worth honorable mention for a strong predictive performance in addition to easy setup, the Cubist model outperformed the Random Forest model in both R-squared and mean absolute error (MAE).
When we consider the strongest linear v. nonlinear v. tree-based model we see:
<- rbind(
display "Partial Least Squares" = plsReSample,
"Support Vector Machine" = svmResample,
"Cubist" = cubistSample
)
%>% kable() %>% kable_paper() display
RMSE | Rsquared | MAE | |
---|---|---|---|
Partial Least Squares | 0.1437415 | 0.3499468 | 0.1102135 |
Support Vector Machine | 0.1280496 | 0.4854961 | 0.0921707 |
Cubist | 0.1032232 | 0.6655173 | 0.0753867 |
The Cubist (tree-based) model by far outperforms the strongest linear and nonlinear models.
We selected the Cubist model first based on performance and second based on ease-of-use (setup). The Cubist model performed the best on test data with regard to R-squared and mean absolute error (MAE). Additionally, it required less data pre-processing than our linear and nonlinear regression models and provided our best accuracy “out of the box” (no variable tweaking).
Cubist models are a powerful, rule-based model that balance the call for predictive accuracy with that of intelligibility and it’s for this reason we’re confident in its selection and capability.
We proceed to predicting PH
values and returning the provided test set with the NA values replaced by our Cubist model’s predictions:
set.seed(150)
#Prep data set: select test then drop dataset variable
<- final_df %>%
final_test filter(Dataset == 'test') %>%
::select(-Dataset)
dplyr
#Cast predictions using model_8
<- round(predict(cubistTune, newdata=final_test),2)
predict
#verify that we've cast predictions
#head(predict)
#summary(predict)
#send to Excel / CSV file
$PH <- predict
final_testhead(final_test)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## 1 D 5.480000 24.03333 0.2700000 65.4 134.6 0.236
## 2 A 5.393333 23.95333 0.2266667 63.2 135.0 0.042
## 3 B 5.293333 23.92000 0.3033333 66.4 140.4 0.068
## 4 B 5.266667 23.94000 0.1860000 64.8 139.0 0.004
## 5 B 5.406667 24.20000 0.1600000 69.4 142.2 0.040
## 6 B 5.286667 24.10667 0.2120000 73.4 147.2 0.078
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1 0.40 0.04 -100 116.6 46.0 0
## 2 0.22 0.08 -100 118.8 46.2 0
## 3 0.10 0.02 -100 120.2 45.8 0
## 4 0.20 0.02 -100 124.8 40.0 0
## 5 0.30 0.06 -100 115.0 51.4 0
## 6 0.22 NA -100 118.6 46.4 0
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1 NA NA 96 129.4 3986
## 2 0 0 112 120.0 4012
## 3 0 0 98 119.4 4010
## 4 0 0 132 120.2 NA
## 5 0 0 94 116.0 4018
## 6 0 0 94 120.4 4010
## Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH
## 1 66.0 21.66 2950 0.88 727.6 1.398 -3.8 9.02
## 2 65.6 17.60 2916 1.50 735.8 2.942 -4.4 9.02
## 3 65.6 24.18 3056 0.90 734.8 1.448 -4.2 9.02
## 4 74.4 18.12 28 0.74 NA 1.056 -4.0 9.18
## 5 66.4 21.32 3214 0.88 752.0 1.398 -4.0 9.02
## 6 66.6 18.00 3064 0.84 732.0 1.298 -3.8 9.02
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1 0.022 130 45.2 142.6 6.56 5.34
## 2 0.030 120 46.0 147.2 7.14 5.58
## 3 0.046 120 46.0 146.6 6.52 5.34
## 4 NA 120 46.0 146.4 6.48 5.50
## 5 0.082 120 50.0 145.8 6.50 5.38
## 6 0.064 120 46.0 146.0 6.50 5.42
## Balling.Lvl
## 1 1.48
## 2 3.04
## 3 1.46
## 4 1.48
## 5 1.46
## 6 1.44
#write to csv
write.csv(final_test,"C:/Users/magnu/Documents/DATA-624//DATA624Proj2_PH_prediction.csv", row.names = FALSE)
We can see the first 6 predictions above, with PH
predictions present in the 26th column. We’ve provided PH
values of the same magnitude and value format as we were provided in the training data.
The most important processes in impacting pH, based on our elected model and its overlap with the earlier explored Boruta function are: Mnf.Flow
, Alch.Rel
, Carb.Rel
, and Temperature
. Density
was not as important in our general exploration of variable importance but it was important to our model. Putting an emphasis on these processes in the beverage making process would have the largest perceived impact on the pH levels of our beverages.
For example, we know that pH decreases when we increase temperature and so to decrease the pH level of a beverage leveraging this one process may reduce the pH level of our beverage (all else held constant). The powerful, and more complex measure, would be the interplay of these variables and their impact on others.
As a final measure, we consider visualizations of the relationship between our top predictors and PH
:
<- ggplot(train, aes(Mnf.Flow, PH)) + geom_point()
p1 <- ggplot(train, aes(Alch.Rel, PH)) + geom_point()
p2 <- ggplot(train, aes(Carb.Rel, PH)) + geom_point()
p3 <- ggplot(train, aes(Temperature, PH)) + geom_point()
p4
grid.arrange(p1, p2, p3, p4, nrow = 2)
We observe that our data are concentrated between pH values of 8.0 and 9.0 with Mnf.Flow
(minimum night flow) and Alch.Rel
showing clustering characteristics along certain bands / levels.
While a 1:1 relationship between top predictors and PH
is not clear (as highlighted in the above visualizations), what is clear is that they are the most important variables for controlling pH levels, they are heavily concentrated within certain ranges, (ie. Temperature
between 64 and 68 degrees F) and all beverage brands sit at a pH level greater than 8, resulting in alkaline beverages.
Provided that the relationships between predictors and response are not easily understood, using a more complex algorithm like Cubist is warranted and favorable. We selected the Cubist model for its balance of predictive prowess, interpret-ability, as well as ease of use. It was our strongest performing model and the one we felt most confident in as highlighted in this closing section.