INSTRUCTIONS

Hide Assignment Information Instructions Project #2 (Team) Assignment

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Due on Dec 15, 2024 11:59 PM

Load Packages

The following code below loops through the list of necessary packages and checks to determine if each is installed. If the package is not found it is installed and loaded.

## Warning: package 'summarytools' was built under R version 4.3.3
## Warning: package 'caret' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'randomForest' was built under R version 4.3.3
## Warning: package 'glmnet' was built under R version 4.3.3
## Warning: package 'reshape2' was built under R version 4.3.2
## Warning in check_dep_version(): ABI version mismatch: 
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
## Warning: package 'mice' was built under R version 4.3.3

DATA EXPLORATION

Load The Data

The script retrieves training (StudentData.xlsx) and test datasets (StudentEvaluation.xlsx) from GitHub, reads them into data frames, and removes the temporary files to maintain a clean workspace.

View The Data

Performs analysis of the Student_Train dataset and display a summary of the structure of the dataset, helping to understand the data’s composition.

## There are  2571  observations/cases in the Student Training dataset.
## There are  33  columns/elements in the Student Training dataset
## There are  1  Categorical Variables in the Student Training dataset.
## There are  32  Predictor Variables in the Student Training dataset.

Display both a structural overview and a statistical summary of the training dataset.

## Rows: 2,571
## Columns: 33
## $ `Brand Code`        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH                  <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow Density MFR Balling Pressure Vacuum PH Oxygen Filler Bowl Setpoint Pressure Setpoint Air Pressurer Alch Rel Carb Rel Balling Lvl
Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60 Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00 Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600 Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
Mode :character Median :5.347 Median :23.97 Median :0.27133 Median :68.20 Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000 Median : 65.20 Median :123.2 Median :46.40 Median :11.40 Median :28.60 Median :27.60 Median : 96.00 Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028 Median :0.980 Median :724.0 Median : 1.648 Median :-5.400 Median :8.540 Median :0.03340 Median :120.0 Median :46.00 Median :142.6 Median :6.560 Median :5.400 Median :1.48
NA Mean :5.370 Mean :23.97 Mean :0.27712 Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954 Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92 Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29 Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468 Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216 Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62 Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
NA 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
NA Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40 Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00 Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600 Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
NA NA’s :10 NA’s :38 NA’s :39 NA’s :27 NA’s :26 NA’s :33 NA’s :23 NA’s :39 NA’s :2 NA’s :32 NA’s :22 NA’s :11 NA’s :15 NA’s :15 NA’s :30 NA’s :20 NA’s :57 NA’s :14 NA’s :5 NA’s :2 NA’s :1 NA’s :212 NA’s :1 NA NA’s :4 NA’s :12 NA’s :2 NA’s :12 NA NA’s :9 NA’s :10 NA’s :1

Display the first 10 observations/cases in the Student Training dataset.

Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow Density MFR Balling Pressure Vacuum PH Oxygen Filler Bowl Setpoint Pressure Setpoint Air Pressurer Alch Rel Carb Rel Balling Lvl
B 5.340000 23.96667 0.2633333 68.2 141.2 0.104 0.26 0.04 -100 118.8 46.0 0 NA NA 118 121.2 4002 66.0 16.18 2932 0.88 725.0 1.398 -4.0 8.36 0.022 120 46.4 142.6 6.58 5.32 1.48
A 5.426667 24.00667 0.2386667 68.4 139.6 0.124 0.22 0.04 -100 121.6 46.0 0 NA NA 106 118.6 3986 67.6 19.90 3144 0.92 726.8 1.498 -4.0 8.26 0.026 120 46.8 143.0 6.56 5.30 1.56
B 5.286667 24.06000 0.2633333 70.8 144.8 0.090 0.34 0.16 -100 120.2 46.0 0 NA NA 82 120.0 4020 67.0 17.76 2914 1.58 735.0 3.142 -3.8 8.94 0.024 120 46.6 142.0 7.66 5.84 3.28
A 5.440000 24.00667 0.2933333 63.0 132.6 NA 0.42 0.04 -100 115.2 46.4 0 0 0 92 117.8 4012 65.6 17.42 3062 1.54 730.6 3.042 -4.4 8.24 0.030 120 46.0 146.2 7.14 5.42 3.04
A 5.486667 24.31333 0.1113333 67.2 136.8 0.026 0.16 0.12 -100 118.4 45.8 0 0 0 92 118.6 4010 65.6 17.68 3054 1.54 722.8 3.042 -4.4 8.26 0.030 120 46.0 146.2 7.14 5.44 3.04
A 5.380000 23.92667 0.2693333 66.6 138.4 0.090 0.24 0.04 -100 119.6 45.6 0 0 0 116 120.2 4014 66.2 23.82 2948 1.52 738.8 2.992 -4.4 8.32 0.024 120 46.0 146.6 7.16 5.44 3.02
A 5.313333 23.88667 0.2680000 64.2 136.8 0.128 0.40 0.04 -100 122.2 51.8 0 0 0 124 123.4 NA 65.8 20.74 30 0.84 NA 1.298 -4.4 8.40 0.066 120 46.0 146.2 6.54 5.38 1.44
B 5.320000 24.17333 0.2206667 67.6 141.4 0.154 0.34 0.04 -100 124.2 46.8 0 0 0 132 118.6 1004 65.2 18.96 684 0.84 NA 1.298 -4.4 8.38 0.046 120 46.0 146.4 6.52 5.34 1.44
B 5.246667 23.98000 0.2626667 64.2 140.2 0.132 0.12 0.14 -100 120.8 46.0 0 0 0 90 120.2 4014 65.4 18.40 2902 0.90 740.4 1.446 -4.4 8.38 0.064 120 46.0 147.2 6.52 5.34 1.44
B 5.266667 24.00667 0.2313333 72.0 147.4 0.014 0.24 0.06 -100 119.8 45.2 0 0 0 108 120.8 4028 66.6 13.50 3038 0.90 692.4 1.448 -4.4 8.50 0.022 120 46.0 146.2 6.54 5.34 1.38

DATA WRANGLING

The following functions help clean and preprocess the dataset by handling missing values, outliers, feature scaling, and constant columns.

The following provides a comprehensive analysis of missing data by both displaying a table of missing values per column and generating a bar plot to visualize the distribution of missing values across variables. The table allows for detailed inspection, while the plot gives a quick overview of the missing data across the dataset.

The MFR has the highest percentage of missing data (8.2%), while Density has the least (0%). Many of the variables have relatively low missing data (around 0.1% to 0.5%). While some variables, particularly MFR, Brand Code, Filler Speed, and others, have missing values in the range of 1% to 8%, which will be imputed in the data cleaning process.

## Missing values per column:
variable n percent
MFR 212 8.2%
Brand Code 120 4.7%
Filler Speed 57 2.2%
PC Volume 39 1.5%
PSC CO2 39 1.5%
Fill Ounces 38 1.5%
PSC 33 1.3%
Carb Pressure1 32 1.2%
Hyd Pressure4 30 1.2%
Carb Pressure 27 1.1%
Carb Temp 26 1%
PSC Fill 23 0.9%
Fill Pressure 22 0.9%
Filler Level 20 0.8%
Hyd Pressure2 15 0.6%
Hyd Pressure3 15 0.6%
Temperature 14 0.5%
Oxygen Filler 12 0.5%
Pressure Setpoint 12 0.5%
Hyd Pressure1 11 0.4%
Carb Rel 10 0.4%
Carb Volume 10 0.4%
Alch Rel 9 0.4%
Usage cont 5 0.2%
PH 4 0.2%
Bowl Setpoint 2 0.1%
Carb Flow 2 0.1%
Mnf Flow 2 0.1%
Balling 1 0%
Balling Lvl 1 0%
Density 1 0%

The following calculates and visualizes the correlations between numeric variables in the Student_Train dataset using a heatmap. It highlights the strength and direction of the relationships between variables, with negative correlations shown in blue, positive correlations in red, and zero correlations in gray. The heatmap provides an intuitive way to identify strong or weak correlations in the dataset.

The following boxplots represents the distribution of values for a different variable in the Student_Train dataset (excluding the first column). Each plot is labeled with the variable name, and the boxplots are horizontally oriented with a custom color scheme. This visualization helps to identify the spread, central tendency, and potential outliers for each variable.

Variables that are highly Right-Skewed are: PSC, PSC Fill, PSC CO2,Hyd Pressure1-3, Oxygen Filler, Air Pressurer. A suggested Transformation would be to apply a logarithmic or a square root transformation to reduce skewness.

The clean_data function is a comprehensive data preprocessing pipeline that imputes missing values, removes outliers, scales features, and removes constant columns from the dataset. It is applied to the Student_Train dataset to produce a cleaned version, Student_Cleaned, ready for further analysis or modeling. The function uses various techniques like Predictive Mean Matching for imputation, Z-score normalization for scaling, and IQR for outlier detection.

## Warning: package 'janitor' was built under R version 4.3.3

Validation of Cleaning

Checks to see if any missing values remain in the cleaned dataset.

## Missing values per column after cleaning data:
variable n percent
brand_code 119 4.7%

After cleaning the data we need to validate the dataset to check for remaining missing values as well as visualize the distributions to ensure scaling and outlier handling.The imputation process is visualized with a plot, and the density plot for the variable Carb_Volume compares the distribution of imputed and observed values.

Transformations For the right skewed variables we will use square root transformation

## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Model Building

Linear Models: Split the Data 80-20 in training and testing data

Simple Linear Regression

##      RMSE  Rsquared       MAE 
## 0.7812662 0.3468832 0.6011851

Robust Linear Model

##      RMSE  Rsquared       MAE 
## 0.8129427 0.2915512 0.6439884

Partial Linear Squares

##      RMSE  Rsquared       MAE 
## 0.7812578 0.3470570 0.6038144

Enet Model

## Loading required package: lars
## Loaded lars 1.3
##      RMSE  Rsquared       MAE 
## 0.7784749 0.3502429 0.6047150

Ridge Model

##      RMSE  Rsquared       MAE 
## 0.7795994 0.3490617 0.6028884

LARS

##      RMSE  Rsquared       MAE 
## 0.7784348 0.3504221 0.6036631

Non-Linear Models:

Split the Data 80-20 in training and testing data

KNN

##      RMSE  Rsquared       MAE 
## 0.7943361 0.3678533 0.6059853

SVM

##      RMSE  Rsquared       MAE 
## 0.7384741 0.4564360 0.5392765

MARS

## Loading required package: earth
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.3.2
##      RMSE  Rsquared       MAE 
## 0.7628534 0.4136125 0.5601422

Random Forest

##      RMSE  Rsquared       MAE 
## 0.6101989 0.6445020 0.4355393

Model Evaluation

##                   RMSE  Rsquared       MAE
## KNN          0.7943361 0.3678533 0.6059853
## SVM          0.7384741 0.4564360 0.5392765
## MARS         0.7628534 0.4136125 0.5601422
## RandomForest 0.6101989 0.6445020 0.4355393
## SLR          0.9201597 0.1763704 0.7047426
## RLM          0.9087564 0.1771359 0.7090285
## PLS          0.9200291 0.1770999 0.7069576
## ENET         0.9092491 0.1847612 0.6967332
## Ridge        0.9154253 0.1805659 0.7018082
## LARS         0.9111737 0.1835543 0.6975926

The model with the best opimal resampling and test set performance of all the models is the Random Forest model. It has the lowest RMSE (a lower RMSE predicts a better performing model) of all the models at: 0.61. It also has the highest Rsquared (a higher Rsquared indicates a better fit) of all the models at 0.64.

Variable Importance

For our optimal Random Forest model, Mnf Flow enjoys the greatest impact (54.42), above others like Carb Pressure1 (27.02) and Carb Volume (21.26), all of which contribute meaningfully to predicting PH. Minimal contributors such as PSC Fill and PSC CO2 can be diesregarded in follow-up analysis, as we should focus optimization strategies on top predictors. We next employed our Random Forest model to form a prediction.

Forecast

## Warning: package 'DT' was built under R version 4.3.3
## Warning: package 'DataExplorer' was built under R version 4.3.3

The output would indicate concentration of the predictions around around 0, largely symmetric in disttribution and concentration toward the mean, all of which suggests that our model is accurate in predicting PH values within a fairly narrow range.