Goal

You are given a simple data set from a beverage manufacturing company. It consists of 2,571 rows/cases of data and 33 columns / variables. Your goal is to use this data to predict PH (a column in the set). Potential for hydrogen (pH) is a measure of acidity/alkalinity, it must conform in a critical range and therefore it is important to understand its influence and predict its values. This is production data. pH is a KPI, Key Performance Indicator.

Data Exploration

data2<-read.csv("StudentData - TO MODEL.csv")
data2_eval<-read.csv("StudentEvaluation- TO PREDICT.csv")

head(data2)

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6    NA
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04     -100          118.8          46.0             0
## 2     0.22    0.04     -100          121.6          46.0             0
## 3     0.34    0.16     -100          120.2          46.0             0
## 4     0.42    0.04     -100          115.2          46.4             0
## 5     0.16    0.12     -100          118.4          45.8             0
## 6     0.24    0.04     -100          119.6          45.6             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA           118        121.2         4002
## 2            NA            NA           106        118.6         3986
## 3            NA            NA            82        120.0         4020
## 4             0             0            92        117.8         4012
## 5             0             0            92        118.6         4010
## 6             0             0           116        120.2         4014
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      16.18      2932    0.88 725.0   1.398            -4.0 8.36
## 2        67.6      19.90      3144    0.92 726.8   1.498            -4.0 8.26
## 3        67.0      17.76      2914    1.58 735.0   3.142            -3.8 8.94
## 4        65.6      17.42      3062    1.54 730.6   3.042            -4.4 8.24
## 5        65.6      17.68      3054    1.54 722.8   3.042            -4.4 8.26
## 6        66.2      23.82      2948    1.52 738.8   2.992            -4.4 8.32
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           120              46.4         142.6     6.58     5.32
## 2         0.026           120              46.8         143.0     6.56     5.30
## 3         0.024           120              46.6         142.0     7.66     5.84
## 4         0.030           120              46.0         146.2     7.14     5.42
## 5         0.030           120              46.0         146.2     7.14     5.44
## 6         0.024           120              46.0         146.6     7.16     5.44
##   Balling.Lvl
## 1        1.48
## 2        1.56
## 3        3.28
## 4        3.04
## 5        3.04
## 6        3.02

Exploratory Analysis

In the dataset 32 variables are numeric, only Brand Code’ is charecter. I am transforming `Brand Code’ to a factor for the convinience of our model.

glimpse(data2)

## Rows: 2,571
## Columns: 33
## $ Brand.Code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4     <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed      <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow         <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint     <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…

data2$`Brand.Code`<-as.factor(data2$`Brand.Code`)

summary(data2)

##  Brand.Code  Carb.Volume     Fill.Ounces      PC.Volume       Carb.Pressure  
##   : 120     Min.   :5.040   Min.   :23.63   Min.   :0.07933   Min.   :57.00  
##  A: 293     1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917   1st Qu.:65.60  
##  B:1239     Median :5.347   Median :23.97   Median :0.27133   Median :68.20  
##  C: 304     Mean   :5.370   Mean   :23.97   Mean   :0.27712   Mean   :68.19  
##  D: 615     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200   3rd Qu.:70.60  
##             Max.   :5.700   Max.   :24.32   Max.   :0.47800   Max.   :79.40  
##             NA's   :10      NA's   :38      NA's   :39        NA's   :27     
##    Carb.Temp          PSC             PSC.Fill         PSC.CO2       
##  Min.   :128.6   Min.   :0.00200   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000   1st Qu.:0.02000  
##  Median :140.8   Median :0.07600   Median :0.1800   Median :0.04000  
##  Mean   :141.1   Mean   :0.08457   Mean   :0.1954   Mean   :0.05641  
##  3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600   3rd Qu.:0.08000  
##  Max.   :154.0   Max.   :0.27000   Max.   :0.6200   Max.   :0.24000  
##  NA's   :26      NA's   :33        NA's   :23       NA's   :39       
##     Mnf.Flow       Carb.Pressure1  Fill.Pressure   Hyd.Pressure1  
##  Min.   :-100.20   Min.   :105.6   Min.   :34.60   Min.   :-0.80  
##  1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00   1st Qu.: 0.00  
##  Median :  65.20   Median :123.2   Median :46.40   Median :11.40  
##  Mean   :  24.57   Mean   :122.6   Mean   :47.92   Mean   :12.44  
##  3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00   3rd Qu.:20.20  
##  Max.   : 229.40   Max.   :140.2   Max.   :60.40   Max.   :58.00  
##  NA's   :2         NA's   :32      NA's   :22      NA's   :11     
##  Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4     Filler.Level  
##  Min.   : 0.00   Min.   :-1.20   Min.   : 52.00   Min.   : 55.8  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00   1st Qu.: 98.3  
##  Median :28.60   Median :27.60   Median : 96.00   Median :118.4  
##  Mean   :20.96   Mean   :20.46   Mean   : 96.29   Mean   :109.3  
##  3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00   3rd Qu.:120.0  
##  Max.   :59.40   Max.   :50.00   Max.   :142.00   Max.   :161.2  
##  NA's   :15      NA's   :15      NA's   :30       NA's   :20     
##   Filler.Speed   Temperature      Usage.cont      Carb.Flow       Density     
##  Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26   Min.   :0.240  
##  1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144   1st Qu.:0.900  
##  Median :3982   Median :65.60   Median :21.79   Median :3028   Median :0.980  
##  Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468   Mean   :1.174  
##  3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186   3rd Qu.:1.620  
##  Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104   Max.   :1.920  
##  NA's   :57     NA's   :14      NA's   :5       NA's   :2      NA's   :1      
##       MFR           Balling       Pressure.Vacuum        PH       
##  Min.   : 31.4   Min.   :-0.170   Min.   :-6.600   Min.   :7.880  
##  1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600   1st Qu.:8.440  
##  Median :724.0   Median : 1.648   Median :-5.400   Median :8.540  
##  Mean   :704.0   Mean   : 2.198   Mean   :-5.216   Mean   :8.546  
##  3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000   3rd Qu.:8.680  
##  Max.   :868.6   Max.   : 4.012   Max.   :-3.600   Max.   :9.360  
##  NA's   :212     NA's   :1                         NA's   :4      
##  Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint Air.Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :140.8  
##  1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03340   Median :120.0   Median :46.00     Median :142.6  
##  Mean   :0.04684   Mean   :109.3   Mean   :47.62     Mean   :142.8  
##  3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:143.0  
##  Max.   :0.40000   Max.   :140.0   Max.   :52.00     Max.   :148.2  
##  NA's   :12        NA's   :2       NA's   :12                       
##     Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :6.560   Median :5.400   Median :1.48  
##  Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :8.620   Max.   :6.060   Max.   :3.66  
##  NA's   :9       NA's   :10      NA's   :1

Exploring Missing Values:

vis_miss(data2)

Only 1% values fro the whole dataset is missing. However, MFR variable has the most missing value. 8.25% data from the MFR column is missing

Histogram

We can observe that some variables have more than one modes.

Carb.Pressure, Carb.Temp, Carb,Volume, Fill.Ounces, PC.Volume, PH, Pressure.Vacuum, PSC, PSC.Fill, and Temperature have relatively normal distribution.

plot_histogram(data2)

Boxplot:

The distribution of the variables are really diverse.For instance Filler Level, Carb Temp has narrow range of data spread, whereas carb flow has wide range of data spread. Most of the values have outliers, however, speed and MFR have distinct amount of outliers.

Therefore outliers handling will be really crucial to improve the performance of our model. Scaling will be applied to handle outliers.

ggplot(data = reshape2::melt(data2) , aes(x=variable, y=value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=3, outlier.size=5,aes(fill=variable)) +
  coord_flip() + theme(legend.position = "none")

## Warning: Removed 724 rows containing non-finite values (stat_boxplot).

Correlation

corrplot(cor(data2[,-1], use = "na.or.complete"),order="alphabet",
         col=brewer.pal(8,"Set3"))

The graph displays that most of the variables have weak correlations. However,multicollinearity can be observed among Balling, Balling.Lvl.

Since this model is going to predict PH, we are going to explore which variables have strong relationships with PH.This may help with our variable selection. Variable selection is an important aspect of model building.It helps in building predictive models free from correlated variables, biases and unwanted noise.

# Perform Boruta search
boruta_output <- Boruta(PH ~ ., data=na.omit(data2), doTrace=0, maxRuns = 1000)

# Get significant variables including tentatives
boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)

#print(boruta_signif)
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")

From the graph we can see that PSC CO2, PSC, PSC Fill and Carb Temp are insignificant. Carb Ref appears to be really significant.

Data Preparation:

Next we are going to utilize preprocess function from caret package to central, scale, impute,Box Cox and reduce multi collinearity in our dataset. We are going to exclude Brand Code and PH columns.

#dataprocess <- data.matrix(subset(data2, select = -c(Brand.Code))) #convert to mumeric matix. Exlude character column (brand code)
preprocessing <- preProcess(subset(data2, select = -c(Brand.Code, PH)), method = c("center", "scale", "knnImpute", "corr", "BoxCox")) 
preprocessing

## Created from 2134 samples and 31 variables
## 
## Pre-processing:
##   - Box-Cox transformation (19)
##   - centered (25)
##   - ignored (0)
##   - 5 nearest neighbor imputation (25)
##   - removed (6)
##   - scaled (25)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000 -2.0000 -0.2000 -0.1842  1.0000  2.0000

processe2 <-  predict(preprocessing, subset(data2, select = -c(Brand.Code))) 
processe2 <- processe2 %>% drop_na(PH)

Checking for missing value

vis_miss(as.data.frame(processe2))

Feature-Target Correlations

c <- ncol(processe2) - 1

stack(sort(cor(processe2[, c + 1], processe2[,1:c])[,], decreasing=TRUE))

##           values               ind
## 1   0.8397435768          Alch.Rel
## 2   0.7846690832       Carb.Volume
## 3   0.4295227573     Carb.Pressure
## 4   0.1655308292                PH
## 5   0.1262737242     Bowl.Setpoint
## 6   0.0269777802     Hyd.Pressure1
## 7   0.0230715756    Carb.Pressure1
## 8   0.0167511558     Hyd.Pressure2
## 9   0.0038708424         Carb.Temp
## 10  0.0032691401               MFR
## 11  0.0003110381   Pressure.Vacuum
## 12 -0.0152131729          PSC.Fill
## 13 -0.0198118459     Oxygen.Filler
## 14 -0.0312594022        Usage.cont
## 15 -0.0358325321          Mnf.Flow
## 16 -0.0531924694         Carb.Flow
## 17 -0.0643355562           PSC.CO2
## 18 -0.0682015477               PSC
## 19 -0.1001586534     Air.Pressurer
## 20 -0.1089324787       Temperature
## 21 -0.1187077889       Fill.Ounces
## 22 -0.1329121923         PC.Volume
## 23 -0.1656156064     Fill.Pressure
## 24 -0.2393783404 Pressure.Setpoint
## 25 -0.4390215899     Hyd.Pressure4


# Feature Creation
processe2 <- data.frame(processe2)

processe2 <- processe2 %>%
  mutate(Mnf.Flow        = if_else(Mnf.Flow      <     0, 1, 0)) %>% 
  mutate(Hyd.Pressure1  = if_else(Hyd.Pressure1 <=    0 ,1, 0)) %>% 
  mutate(Hyd.Pressure2  = if_else(Hyd.Pressure2 <=    0, 1, 0)) %>% 
  mutate(Carb.Flow    = if_else(Carb.Flow     <  2000, 1, 0))

Data splitting

The data is split in the ratio of 80:20 for train and test sets respectively. A random seed is set for reproducibility.

ph <- data.matrix(processe2$PH) #ph --target
set.seed(100)
s <- ph %>% createDataPartition(p = 0.8, list = FALSE, times = 1)
train  <- processe2[s, ] #student train
train <- subset(train, select = -c(PH)) # exlude target column
test <- processe2[-s, ] #student validation set
test <- subset(test, select = -c(PH)) #exclude target column
train_ph  <- ph[s, ] # ph 
test_ph <- ph[-s] # ph

Modeling

ctrl <- trainControl(method = "cv", number = 10)

1. Linear Model

After preprocessing our data a linear model shows us that about half of the predictors are statistically significant. Our first attempt at a model gives us an RSME of 0.144 and an Rsquared of 0.297. Let’s see if we can find a model that’s better fit. Lastly, our MAPE is 0.01317.

lm <- train(train, train_ph, method = "lm", trControl = ctrl) 
summary(lm)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54082 -0.08425  0.01217  0.09318  0.64248 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        8.5460901  0.0031169 2741.898  < 2e-16 ***
## Carb.Volume       -0.0041917  0.0082489   -0.508  0.61141    
## Fill.Ounces       -0.0085440  0.0033060   -2.584  0.00982 ** 
## PC.Volume         -0.0060449  0.0038927   -1.553  0.12061    
## Carb.Pressure      0.0003508  0.0125407    0.028  0.97768    
## Carb.Temp          0.0060987  0.0113774    0.536  0.59199    
## PSC               -0.0051752  0.0033749   -1.533  0.12533    
## PSC.Fill          -0.0050991  0.0032476   -1.570  0.11655    
## PSC.CO2           -0.0058976  0.0031977   -1.844  0.06528 .  
## Mnf.Flow          -0.0795196  0.0062685  -12.686  < 2e-16 ***
## Carb.Pressure1     0.0370769  0.0038514    9.627  < 2e-16 ***
## Fill.Pressure      0.0057435  0.0045865    1.252  0.21062    
## Hyd.Pressure1      0.0029024  0.0052597    0.552  0.58114    
## Hyd.Pressure2      0.0241311  0.0067796    3.559  0.00038 ***
## Hyd.Pressure4     -0.0051637  0.0043919   -1.176  0.23984    
## Temperature       -0.0293192  0.0034670   -8.457  < 2e-16 ***
## Usage.cont        -0.0275841  0.0040617   -6.791 1.45e-11 ***
## Carb.Flow          0.0091266  0.0041978    2.174  0.02981 *  
## MFR               -0.0076758  0.0035765   -2.146  0.03198 *  
## Pressure.Vacuum    0.0048421  0.0042225    1.147  0.25162    
## Oxygen.Filler     -0.0128752  0.0043916   -2.932  0.00341 ** 
## Bowl.Setpoint      0.0258166  0.0048404    5.334 1.07e-07 ***
## Pressure.Setpoint -0.0106848  0.0046342   -2.306  0.02123 *  
## Air.Pressurer     -0.0001223  0.0033295   -0.037  0.97071    
## Alch.Rel           0.0083527  0.0065589    1.273  0.20299    
## Carb.Rel           0.0094487  0.0065765    1.437  0.15095    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1409 on 2029 degrees of freedom
## Multiple R-squared:  0.3413, Adjusted R-squared:  0.3332 
## F-statistic: 42.05 on 25 and 2029 DF,  p-value: < 2.2e-16

lm_pred <- predict(lm,newdata = test)
table <- data.frame(lm = postResample(pred = lm_pred , obs = test_ph))
table

##                 lm
## RMSE     0.1444703
## Rsquared 0.2972350
## MAE      0.1119767

MAPE(lm_pred, test_ph)

## [1] 0.01317228

2. PLS Model

Out next attempt is a Partial Least Squares model. Plotting the components shows us that the model is best tuned at the 7th component. The RSME is slightly lower and the Rsquared marginally higher, but nothing significant. The variables that influence this model the most are Mnf.Flow at the top with significant weight given to Bowl.Setpoint and Usage.Cont as well. Residuals look randomly distributed, which is what we expect.

pls_model <- train(
  train,
  train_ph,
  method = 'pls',
  trControl = ctrl,
  tuneLength = 20
)

summary(pls_model)

## Data:    X dimension: 2055 25 
##  Y dimension: 2055 1
## Fit method: oscorespls
## Number of components considered: 10
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           16.12    22.41    32.53    40.69    45.33    48.66    52.07
## .outcome    23.29    30.47    32.45    33.19    33.71    34.01    34.09
##           8 comps  9 comps  10 comps
## X           56.50    60.33     63.64
## .outcome    34.12    34.12     34.13

plot(pls_model)

plot(varImp(pls_model), top = 10)

pred <- predict(pls_model, test)
postResample(pred=pred, obs=test_ph)

##      RMSE  Rsquared       MAE 
## 0.1444667 0.2972877 0.1119500

residuals <- test_ph - pred
plot(residuals, 
     main="Residuals for PLS model",
     xlab = 'Observation Number in Test',
     ylab = 'Residuals')

MAPE(pred, test_ph)

## [1] 0.01316916

KNN

The next model we use is K-nearest neighbor model. The inputs are pretty straightforward, and after running it we see it’s best tuned when k=7. Using the our new KNN model against our test set produces an RSME of 0.131 and an R squared of 0.43. This is our most performant model so far because it has a higher R squared and lower RSME than our previous attempts. MAPE is also lower then the previous models at 0.01138.

knn_model <- train(
  train,
  train_ph,
  method = 'knn',
  trControl = ctrl,
  tuneLength = 10
)

knn_model

## k-Nearest Neighbors 
## 
## 2055 samples
##   25 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1850, 1849, 1850, 1850, 1849, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    5  0.1300976  0.4424617  0.09600541
##    7  0.1292228  0.4474741  0.09652588
##    9  0.1295887  0.4428510  0.09712231
##   11  0.1294215  0.4447368  0.09756081
##   13  0.1301966  0.4391178  0.09805839
##   15  0.1301364  0.4402089  0.09838328
##   17  0.1310205  0.4322555  0.09940137
##   19  0.1321252  0.4224370  0.10030266
##   21  0.1326929  0.4179984  0.10081663
##   23  0.1328843  0.4160379  0.10137886
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

plot(knn_model)

pred <- predict(knn_model, test)
postResample(pred=pred, obs=test_ph)

##       RMSE   Rsquared        MAE 
## 0.13110804 0.43201195 0.09655134

MAPE(pred, test_ph)

## [1] 0.01138328

Residuals

residuals <- test_ph - pred
plot(residuals, 
     main="Residuals for KNN model",
     xlab = 'Observation Number in Test',
     ylab = 'Residuals')

Mars

The optimal MARS model is achieved at degree=3 and nprune = 40. The RSME is 0.13207 and Rsqaured is 0.4183. This is marginally worse than the KNN model, but not enough to make a difference. MAPE is 0.01178.

set.seed(100)
parameter_grid <- floor(expand.grid(degree=1:4, nprune=seq(5, 50, by=5)))

mars_model <- train(
  train,
  train_ph,
  method = 'earth',
  metric='RMSE',
  trControl = ctrl,
  tuneGrid = parameter_grid
)

plot(mars_model)

pred <- predict(mars_model$finalModel, test)
postResample(pred=pred, obs=test_ph)

##      RMSE  Rsquared       MAE 
## 0.1353495 0.3883327 0.1024484

MAPE(pred, test_ph)

## [1] 0.01206694

Residuals

residuals <- test_ph - pred
plot(residuals, 
     main="Residuals for MARS model",
     xlab = 'Observation Number in Test',
     ylab = 'Residuals')

Random Forest

Next model is Random forest which gives us the smallest error so far with an RSME of 0.10738 and a MAPE equal to 0.009109. The Rsquared is also the highest at 0.61584. The final value used for the model was mtry = 25.

rf_model <- train(
  train,
  train_ph,
  method = 'rf',
  trControl = ctrl,
  tunelength=10
)

rf_model

## Random Forest 
## 
## 2055 samples
##   25 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1850, 1850, 1849, 1849, 1848, 1849, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1189895  0.5647016  0.09077114
##   13    0.1088697  0.6141481  0.07951260
##   25    0.1081666  0.6119878  0.07809214
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 25.

plot(rf_model)

pred <- predict(rf_model, test)
postResample(pred=pred, obs=test_ph)

##       RMSE   Rsquared        MAE 
## 0.10754407 0.61468327 0.07746621

plot(varImp(rf_model), top = 10)

MAPE(pred, test_ph)

## [1] 0.009118084

Residuals

residuals <- test_ph - pred
plot(residuals, 
     main="Residuals for Random Forrest model",
     xlab = 'Observation Number in Test',
     ylab = 'Residuals')

Cubist

Last model we performed was cubist which produced slightly better results compared to random forrest.

parameter_grid <- expand.grid(committees = c(1, 10, 50, 100), neighbors = c(0, 1, 5, 9))

cubist_model <- train(
  train,
  train_ph,
  method = 'cubist',
  trControl = ctrl,
  tuneGrid=parameter_grid
)

plot(cubist_model)

plot(varImp(cubist_model), top = 10)

pred <- predict(cubist_model, test)
postResample(pred=pred, obs=test_ph)

##       RMSE   Rsquared        MAE 
## 0.10377080 0.64094431 0.07603397

MAPE(pred, test_ph)

## [1] 0.008956511

Residuals

residuals <- test_ph - pred
plot(residuals, 
     main="Residuals for Cubist model",
     xlab = 'Observation Number in Test',
     ylab = 'Residuals')

Predict using Evaluation set

For our final prediction we will use the cubist and random forest model since they gave use the smallest error with the highest Rsquared.

data2_temp <- data2_eval %>% select(Carb.Volume , Fill.Ounces , PC.Volume ,  Carb.Pressure ,  Carb.Temp, PSC,  PSC.Fill ,  PSC.CO2 ,  Mnf.Flow , Carb.Pressure1 ,  Fill.Pressure , Hyd.Pressure1, Hyd.Pressure2 , Hyd.Pressure4 , Temperature , Usage.cont , Carb.Flow , MFR, Pressure.Vacuum ,  Oxygen.Filler , Bowl.Setpoint , Pressure.Setpoint , Air.Pressurer, Alch.Rel, Carb.Rel)

preprocessing_eval <- preProcess(data2_temp, method = c("center", "scale", "knnImpute", "corr", "BoxCox")) 
process_eval <-  predict(preprocessing_eval, data2_temp)
vis_miss(as.data.frame(process_eval))

Cubist

cubist_ph <- predict(cubist_model, process_eval)

cubist_eval <- data2_eval
cubist_eval$PH <- cubist_ph
write.csv(cubist_eval, "cubist_evaluation.csv")

Random Forest

rf_ph <- predict(rf_model, process_eval)

rf_eval <- data2_eval
rf_eval$PH <- cubist_ph
write.csv(rf_eval, "random_forest_evaluation.csv")

Data 624 Project 2

Goal

Data Exploration

Exploratory Analysis

Exploring Missing Values:

Histogram

Boxplot:

Correlation

Data Preparation:

Checking for missing value

Feature-Target Correlations

Data splitting

Modeling

1. Linear Model

2. PLS Model

KNN

Mars

Random Forest

Cubist

Predict using Evaluation set