Group2_data624

Load Data

First, we load the data from Github. We had some trouble reading the Excel files, so we converted them to CSV.

# Load training data (m=modeling)
dfm_raw <- read.csv('https://raw.githubusercontent.com/klgriffen96/summer23_data624/main/project_2/StudentData%20-%20TO%20MODEL.csv')

# Load evaluation data
dfe_raw <- read.csv('https://github.com/klgriffen96/summer23_data624/raw/main/project_2/StudentEvaluation-%20TO%20PREDICT.csv')

Preliminary view

A first look at the data is as follows.

# Move outcome variable pH to front for easier access
dfm <- dfm_raw %>%
    dplyr::select(PH, !matches('Brand.Code'), Brand.Code)
str(dfm)

## 'data.frame':    2571 obs. of  33 variables:
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Carb.Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill.Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC.Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb.Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb.Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC.Fill         : num  0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
##  $ PSC.CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf.Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb.Pressure1   : num  119 122 120 115 118 ...
##  $ Fill.Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : int  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler.Level     : num  121 119 120 118 119 ...
##  $ Filler.Speed     : int  4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage.cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb.Flow        : int  2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure.Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ Oxygen.Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl.Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 143 142 146 146 ...
##  $ Alch.Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb.Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling.Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
##  $ Brand.Code       : chr  "B" "A" "B" "A" ...

summary(dfm)

##        PH         Carb.Volume     Fill.Ounces      PC.Volume      
##  Min.   :7.880   Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  1st Qu.:8.440   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Median :8.540   Median :5.347   Median :23.97   Median :0.27133  
##  Mean   :8.546   Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##  3rd Qu.:8.680   3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##  Max.   :9.360   Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##  NA's   :4       NA's   :10      NA's   :38      NA's   :39       
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd.Pressure1   Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure.Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##  Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint Air.Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :140.8  
##  1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03340   Median :120.0   Median :46.00     Median :142.6  
##  Mean   :0.04684   Mean   :109.3   Mean   :47.62     Mean   :142.8  
##  3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:143.0  
##  Max.   :0.40000   Max.   :140.0   Max.   :52.00     Max.   :148.2  
##  NA's   :12        NA's   :2       NA's   :12                       
##     Alch.Rel        Carb.Rel      Balling.Lvl    Brand.Code       
##  Min.   :5.280   Min.   :4.960   Min.   :0.00   Length:2571       
##  1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38   Class :character  
##  Median :6.560   Median :5.400   Median :1.48   Mode  :character  
##  Mean   :6.897   Mean   :5.437   Mean   :2.05                     
##  3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14                     
##  Max.   :8.620   Max.   :6.060   Max.   :3.66                     
##  NA's   :9       NA's   :10      NA's   :1

kable(table(sapply(dfm_raw, class)), col.names=c("type","count")) |> kable_styling()

type	count
character	1
integer	4
numeric	28

We note several things from a cursory first glance:

There are 2571 observations in 33 variables.
The outcome variable, pH, is continuous, so this is a regression problem.
Of the 32 predictors, 31 are quantitative variables, and there is a single nominal categorical variable (Brand.Code).
There are a number of missing values but not at a rate that would be debilitating to fitting a model to the data. Additional investigation will be done to determine how best to handle these data points.
Little information was given about the data, so we can’t infer units for the variables (for example, it is unknown whether the temperature and pressure variables are in English or metric units).

Let’s take a look at the distribution of the numerical predictors.

dfm |>
  select(-c(PH,Brand.Code)) |>
  gather(key = "predictor", value = "value") |>
  ggplot(aes(x = value)) +
  geom_density(fill="lightblue") +
  facet_wrap(~ predictor, scales = "free") +
  theme_minimal() + 
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

We can observe that there is either bimodality or skew in most of the variables, suggesting a nonlinear model is appropriate. Because of the bimodality, violin plots are preferable than boxplots to observe any outliers,

We’ll need to do a small amount of data preparation first: The categorical variable Brand.Code will need to be factored into numerical levels, as some functions that we’ll use later requires it (for example near-zero variance calculations, seen later). This will only be useful for initial data exploration; for modeling purposes, we’ll need to handle this variable differently (see Data Preparation below).

dfm |>
  select(where(is.character)) |>
  table() |>
  prop.table() |>
  barplot(main="Distribution of Brand Codes", col="lightblue", ylab="Freq")

# Factor categorical variable Brand.Code
dfm$Brand.Code <- factor(dfm$Brand.Code)

Feature plots

To visually evaluate our data set, we’ll generate feature plots of the continuous variables and a boxplot of the single categorical variable (Brand.Code). These plot show the relationship of each continuous variable against the outcome variable (pH) and can be used to detect how individual predictors influence the outcome.

# Feature plots - exclude the last column, which is Brand.Code (a categorical variable)
featurePlot(x=dfm[,2:(ncol(dfm)-1)], y=dfm$PH, plot='scatter', main='Feature plots against PH', type=c('p', 'smooth'), span=0.5, layout=c(8, 4), col.line='red')

# Boxplot of Brand.Code against PH
dfm %>%
    filter(!is.na(Brand.Code) & !is.na(PH)) %>%
    ggplot(aes(x=Brand.Code, y=PH)) +
    geom_boxplot()

At first glance there doesn’t appear to be any strong relationship between pH and any of the predictors. There is a slightly positive relationship between pH and some variables like pressure vacuum, but in general no strong trends are evident.

We also note that some of the observations have a blank Brand.Code, which we’ll have to handle during data preparation.

Collinearity

We’ll also look for collinearity among variables. Collinearity is a problem for linear regression-based models, as these models can generate severely inflated coefficients for predictors that are strongly collinear. In these cases, we have several options:

Retain only a single predictor among those that are collinear relative to each other.
Transform the predictors such that collinearity is minimized, for example using principle component analysis (PCA), partial least squares (PLS), or linear discriminant analysis (LDA).
Employ a model that is insensitive (or less sensitive) to collinear predictors, for example ridge, lasso, PLS, or multivariate adaptive regression splines (MARS).

To examine collinearity in our data set, we’ll use the corrplot() function from the package of the same name to generate a correlation plot. This shows strongly correlated variables in darker colors and with larger squares. Positive correlations are shown in blue, while negative correlations are in red. Correlation values of 1.0 indicate perfect positive correlation between variables, while values of -1.0 indicate a perfect inverse correlation. The plot is clustered by variables that exhibit strong correlations with each other, which can provide some indication of multicollinearity among groups of variables.

# Generate corr plot for pairwise correlations
corr1 <- cor(dfm[,1:(ncol(dfm)-1)], use='complete')
corrplot::corrplot(corr1, method='square', order='hclust', type='full', diag=T, tl.cex=0.75, cl.cex=0.75)

As shown in the correlation plot, there are some strong correlations among the following six variables:

Carb.Volume
Carb.Rel
Alch.Rel
Density
Bailing
Bailing.Lvl

It may be advantageous to remove one or more of these variables, transform them to remove collinearity, or choose a model that is robust to collinearity. Interestingly, these same six variables exhibit a relatively strong inverse relationship with Hyd.Pressure4.

Multicollinearity

In addition to collinearity, which evaluates relationships between pairs of predictors, we’ll look at multicollinearity, which gives a general indication of how much more each predictor is related to the other predictors than to the response variable. To evaluate multicollinearity, we’ll fit a simple linear model to the data using the full set of predictors. Then, we’ll use the vif() function in the cars package to estimate the variance inflation factor (VIF), which is a statistic purpose-built to measure multicollinearity among variables. Values less than or equal to 1 indicate no multicollinearity, values between 1 and 5 indicate moderate multicollinearity, and values greater than 5 indicate strong multicollinearity (Glen, 2023).

# Examine multicollinearity by generating a simple linear model of the data,
# then looking at the variance inflation factor (VIF) of each variable.
#     VIF <= 1:      low multicollinearity
#     1 < VIF <= 5:  moderate multicollinearity
#     VIF > 5:       high multicollinearity
lmod <- lm(PH ~ ., data=dfm)
dfvif <- data.frame(vif(lmod))
dfvif <- dfvif %>%
    mutate(Predictor=row.names(dfvif)) %>%
    rename(VIF=GVIF) %>%
    dplyr::select(Predictor, VIF) %>%
    arrange(desc(VIF)) %>%
    mutate(Rank=rank(-VIF))
dfvif %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Variance Inflation Factor')

Variance Inflation Factor
Predictor	VIF	Rank
Balling.Lvl	196.699873	1
Balling	168.721067	2
Brand.Code	161.462006	3
Carb.Pressure	43.629727	4
Carb.Temp	35.744235	5
Alch.Rel	33.466584	6
Bowl.Setpoint	26.051364	7
Filler.Level	25.385854	8
Carb.Volume	17.895899	9
Density	16.438808	10
Hyd.Pressure3	12.964503	11
MFR	11.435430	12
Filler.Speed	11.224377	13
Hyd.Pressure2	10.635457	14
Carb.Rel	7.483881	15
Mnf.Flow	4.979607	16
Pressure.Vacuum	4.289511	17
Fill.Pressure	3.741073	18
Pressure.Setpoint	3.486682	19
Hyd.Pressure1	3.000519	20
Hyd.Pressure4	2.686600	21
Carb.Flow	2.256335	22
Usage.cont	1.773788	23
PC.Volume	1.712473	24
Oxygen.Filler	1.579977	25
Carb.Pressure1	1.467316	26
Temperature	1.390864	27
Air.Pressurer	1.234829	28
Fill.Ounces	1.181213	29
PSC	1.163257	30
PSC.Fill	1.112048	31
PSC.CO2	1.067962	32

As shown in the table almost half (15) of the variables have a VIF greater than 5, which indicate that they exhibit strong multicollinearity. This will inform our decision on the type of model to employ.

It should be noted that, while collinearity and multicollinearity adversely affect a linear regression-based model’s capacity to generate useful information about the contribution of each predictor to the outcome, these conditions don’t impede the model’s performance. Therefore, the presence of collinear or multicollinear predictors don’t automatically preclude us from using these types of models. [need citation]

Near-zero variance

We’ll look for variables having near-zero variance (NZV) since some models are sensitive to this condition. A variable exhibiting NZV is characterized as having only a single value (or a very small number of values) across the entire range of observations. These types of “degenerate” distributions can be problematic for models such as those based on linear regression. If NZV variables are observed, either they should be removed or a model that is immune to NZV variables should be used (e.g. tree-based models).

To calculate NZV, we’ll use the NearZeroVar() function from the caret package. This calculates the ratio of the frequency of the most common value in each variable to that of the next most common value. If the ratio is high, this indicates NZV conditions, and the variable may need special handling if a model that is sensitive to NZV variables are being used (Kuhn, 2022).

The following table shows the frequency ratios of variables in the data set in descending order.

# Look for NZV variables
nzv <- nearZeroVar(dfm, saveMetrics=T)
nzv$Variable <- row.names(nzv)
nzv %>% dplyr::select(Variable, everything()) %>%
    arrange(desc(freqRatio)) %>%
    flextable() %>%
    width(width = 1) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Frequency ratios of variables')

Frequency ratios of variables
Variable	freqRatio	percentUnique	zeroVar	nzv
Hyd.Pressure1	31.111111	9.5293660	FALSE	TRUE
Hyd.Pressure3	11.450704	7.4679113	FALSE	FALSE
Hyd.Pressure2	7.271028	8.0513419	FALSE	FALSE
Bowl.Setpoint	2.990847	0.4278491	FALSE	FALSE
Brand.Code	2.014634	0.1944769	FALSE	FALSE
Fill.Pressure	1.762931	4.2007001	FALSE	FALSE
Carb.Flow	1.586207	20.7312330	FALSE	FALSE
Pressure.Vacuum	1.389728	0.6223259	FALSE	FALSE
Pressure.Setpoint	1.319361	0.3111630	FALSE	FALSE
Balling.Lvl	1.294872	3.1894205	FALSE	FALSE
Oxygen.Filler	1.266667	13.1466356	FALSE	FALSE
MFR	1.222222	22.8315830	FALSE	FALSE
PSC	1.211538	5.0175029	FALSE	FALSE
Balling	1.193182	8.4402956	FALSE	FALSE
Fill.Ounces	1.163043	3.5783742	FALSE	FALSE
Density	1.108374	3.0338390	FALSE	FALSE
Usage.cont	1.105263	18.7086737	FALSE	FALSE
PC.Volume	1.100000	17.6584986	FALSE	FALSE
Alch.Rel	1.098315	2.0614547	FALSE	FALSE
Air.Pressurer	1.093407	1.2446519	FALSE	FALSE
PSC.Fill	1.091892	1.2446519	FALSE	FALSE
Filler.Level	1.086957	11.2018670	FALSE	FALSE
PSC.CO2	1.078303	0.5056398	FALSE	FALSE
PH	1.078125	2.0225593	FALSE	FALSE
Carb.Volume	1.055556	3.9284325	FALSE	FALSE
Mnf.Flow	1.048443	18.9420459	FALSE	FALSE
Filler.Speed	1.045161	9.4904706	FALSE	FALSE
Temperature	1.040000	2.1781408	FALSE	FALSE
Carb.Pressure1	1.031746	5.4453520	FALSE	FALSE
Carb.Temp	1.016667	4.7841307	FALSE	FALSE
Carb.Pressure	1.016393	4.1229094	FALSE	FALSE
Carb.Rel	1.011583	1.6336056	FALSE	FALSE
Hyd.Pressure4	1.008264	1.5558149	FALSE	FALSE

As shown, only a single variable (Hyd.Pressure1) exhibited a high enough frequency ratio to be classified as an NZV variable. We’ll examine this variable in a bit more detail.

# Generate table of top unique values of Hyd.Pressure1
hpfreq <- dfm %>%
    group_by(Hyd.Pressure1) %>%
    summarize(Unique.count=n()) %>%
    arrange(desc(Unique.count))

# Tally up values other than the top 2
others <- hpfreq %>%
    filter(Unique.count < hpfreq[[2,'Unique.count']]) %>%
    summarize(Unique.count=sum(Unique.count))

# Add the "other values" row to the table
hpfreq <- hpfreq %>%
    head(2) %>%
    rbind(data.frame(Hyd.Pressure1='other values', Unique.count=as.numeric(others))) %>%
    rename(Value=Hyd.Pressure1) %>%
    mutate(Percent=round(100 * Unique.count / nrow(dfm), 2))

# Display the table
hpfreq %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Hyd.Pressure1')

Hyd.Pressure1
Value	Unique.count	Percent
0	840	32.67
12.4	27	1.05
other values	1,704	66.28

As indicated in the table above, a large proportion of the observations have zero values (32.7%). Without knowing anything about how the data was collected or what the variable indicates, it is difficult to know the best way to treat the variable in terms of modeling The best we can do is make an assumption that zero values are correctly entered, structurally sound, and have meaning in the context of the data Based on these assumptions, this variable is a candidate to be dropped if a linear regression-based model is used.

Missing values

Now we’ll examine missing values in the data set. As noted earlier, there aren’t that many, but some models are sensitive to missing values. As such, the following options are available:

Discard observations that have missing values for many of the predictors.
Discard entire predictors that have many values missing across a great number of observations.
Impute (fill in) the missing values using any of a number of statistical methods available (e.g. mean, median, or K nearest neighbor modeling).

We’ll take a first look at missing values on a column-by-column basis.

# Count NAs per column
missing_values <- data.frame(colSums(is.na(dfm)))
missing_values$Variable <- row.names(missing_values)
colnames(missing_values) <- c('Missing.values', 'Variable')
missing_values %>%
    dplyr::select(Variable, Missing.values) %>%
    arrange(desc(Missing.values)) %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Missing values')

Missing values
Variable	Missing.values
MFR	212
Filler.Speed	57
PC.Volume	39
PSC.CO2	39
Fill.Ounces	38
PSC	33
Carb.Pressure1	32
Hyd.Pressure4	30
Carb.Pressure	27
Carb.Temp	26
PSC.Fill	23
Fill.Pressure	22
Filler.Level	20
Hyd.Pressure2	15
Hyd.Pressure3	15
Temperature	14
Oxygen.Filler	12
Pressure.Setpoint	12
Hyd.Pressure1	11
Carb.Volume	10
Carb.Rel	10
Alch.Rel	9
Usage.cont	5
PH	4
Mnf.Flow	2
Carb.Flow	2
Bowl.Setpoint	2
Density	1
Balling	1
Balling.Lvl	1
Pressure.Vacuum	0
Air.Pressurer	0
Brand.Code	0

A good way to visualize missing values is to use the md.pattern() function from the mice package (van Buren, 2011). This function produces a plot that shows patterns of missing data in order of increasing missing information. The numbers along the left indicate the number of observations that correspond to the pattern indicated by that row of squares. Blue squares indicate no missing values, while red squares indicate missing values. The numbers on the far right indicate the number of variables with missing data for that set of observations.

# Missing value patterns
invisible(md.pattern(dfm, rotate.names=T, plot=T))

As shown, there are 2129 observations for which no values are missing, or 82.8% of the data. We note that there are only a few rows with many red squares, indicating that there aren’t many observations missing a significant number of variables. There is only a single observation that had missing values numbering in the double digits of predictors (12); this row is likely a good candidate for removal. In addition, one of the predictors (MFR) contained a significant number of missing values across all observations (212 values, or 8.2% of the data) and should likely be removed. The remaining data can be imputed (filled in) using one of the statistical methods listed above.

Another way to see this clearly is with a column chart.

nas <- colMeans(is.na(dfm))

data.frame(variable = names(nas), missing = nas, row.names = NULL) |>
  ggplot(aes(y=reorder(variable,missing), x=missing)) +
  geom_col(fill="lightblue") +
  ggtitle('proportion of missing values') + xlab('') + ylab('')

An additional option is to use multivariate imputation by chained equations (MICE) (van Buren et al., 2023). MICE uses the fully conditional specification (FCS), which imputes each variable using a separate method. The method used can either be specified by the modeler or chosen by a set of defaults, depending on the kind of variable. For example, there are default methods for continuous variables and those for categorical variables, with additional methods for categorical variables that depend on the number of factors.

Skewness

Many types of models are sensitive to variables that exhibit skewed distributions, i.e. those that do not exhibit a typical bell-shaped pattern but instead have a disproportionate number of high or low values. Linear regression-based models are particularly susceptible to skewness.

To reduce skewness, skewed variables can often be mathematically transformed, for example by taking the natural logarithm or using a method such as the Box-Cox transformation. This method uses an empirical means to find a transformation of the data (Box and Cox, 1964).

Box-Cox transformations only work on positive values, so if zero or negative values are encountered a different method must be used. If only zero values are observed, a constant (typically 1) can be added to the value before applying the Box-Cox transformation; however, there is debate on what constant to use in these cases. Alternatively, if both zero and negative values are present, a method such as one proposed by Yeo and Johnson (Yeo & Johnson, 2000) can be employed to transform the variable. The Yeo-Johnson method can accept positive, negative, or zero values and, thus, is more robust when values aren’t guaranteed to be positive.

To visually examine skewness, we’ll generate histograms of the continuous variables.

# Generate histograms of continuous variables
dfm %>%
    dplyr::select(c(1:(ncol(dfm)-1))) %>%
    gather() %>%
    ggplot(aes(x=value)) +
    geom_histogram(bins=20) +
    facet_wrap(~key, scales = 'free_x') +
    ggtitle('Histograms of continuous variables')

Some variables exhibit obviously skewed distributions (e.g. Filler.Speed, PSC, and Usage.cont), while others appear to be normally distributed (i.e., approximately symmetric), for example PH and PSC.Fill. Others like Bailing and Bailing.Lvl appear to be bimodal.

To gain a better understanding of which variables are skewed and to what degree they are skewed, we can calculate skewness quantitatively using the e1071 package (Meyer, 2022). Skewness with absolute values that are greater than 1 are considered highly skewed, while those between 0.5 and 1 are moderately skewed. Absolute values less than 0.5 can be considered approximately symmetric (Joanes and Gill, 1998).

The sign of the skewness value indicates in which direction the distribution is skewed: Left-skewed distributions will exhibit negative skewness values, while the sign of right-skewed distributions will be positive.

The following plot quantitatively enumerates the skewness of each variable, ordered by how heavily skewed the variable is.

# Look at skewness quantitatively
# As a general rule of thumb:
#     highly skewed:            |skewness| > 1
#     moderately skewed:        0.5 < |skewness| < 1
#     approximately symmetric:  0 < |skewness| < 0.5
# Negative values indicate left-skewed distributions; positive values indicate right-skewed distributions.

# Create function to calculate skewness (needed because it handles columns with missing values inconsistently)
calcSkewness <- function(x, type) {
    return(e1071::skewness(x[!is.na(x)], type=type))
}

# Calculate skewness on columns
colskewness <- data.frame(apply(dfm[,1:(ncol(dfm)-1)], 2, calcSkewness, type=1))
colskewness$Variable <- row.names(colskewness)
colnames(colskewness) <- c('Skewness', 'Variable')

# Graph skewness values
colskewness %>%
    dplyr::select(Variable, Skewness) %>%
    arrange(desc(abs(Skewness))) %>%
    ggplot(aes(x=reorder(Variable, desc(Skewness)), y=Skewness)) +
    geom_bar(stat='identity') +
    coord_flip() +
    ggtitle('Variable skewness') +
    xlab('')

As seen in the table, six variables are heavily skewed, with the first two being skewed left and the remaining skewed right:

MFR
Filler.Speed
Oxygen.Filler
Temperature
Air.Pressurer
PSC.CO2

This reinforces the idea that some type of transformation will be needed if a model is used that is sensitive to skewness (e.g. linear regression-based models). Alternatively, models that aren’t sensitive to skewness can be employed (e.g. tree-based models).

Outliers

Like collinearity and skewness, the presence of outliers in a data set can also negatively impact some models. Outliers are typically defined as those points which lie beyond a certain number of standard deviations from the mean (typically 2, 2.5, or 3 times the standard deviation). We’ll examine our data set for outlying values that fall under this definition, using a cutoff of 3 standard deviations.

# Create function to count the number of outliers outside of (mult) standard deviations from the mean
getOutliers <- function(x, mult) {
    mean_val <- mean(x[!is.na(x)])
    loval <- mean_val - (sd(x[!is.na(x)]) * mult)
    hival <- mean_val + (sd(x[!is.na(x)]) * mult)
    #return(vals[vals < loval | vals > hival])
    return(!is.na(x) & (x < loval | x > hival))
}

# Set multiplier; outliers are considered as such when they lie outside of (multiplier) standard deviations from the mean
mult <- 3
outliers <- apply(dfm[,1:(ncol(dfm)-1)], 2, getOutliers, mult=mult)
dfout <- data.frame(outliers)
dfout <- data.frame(colSums(outliers))
dfout$Variable <- row.names(dfout)
colnames(dfout) <- c('Outlier.count', 'Variable')

# Filter just those variables with outliers and sort by outlier count
dfout <- dfout %>%
    dplyr::select(Variable, Outlier.count) %>%
    arrange(desc(Outlier.count)) %>%
    filter(Outlier.count > 0)

# Generate bar graph of outlier counts
dfout %>%
    ggplot(aes(x=reorder(Variable, Outlier.count), y=Outlier.count, label=Outlier.count)) +
    geom_bar(stat='identity') +
    coord_flip() +
    geom_text(check_overlap=T, hjust=-0.1, nudge_x=0.05) +
    ggtitle('Outliers (only variables with outliers shown)') +
    xlab('') + ylab('Outlier count')

The graph above illustrates that Filler.Speed has the most number of outliers (142, or 5.5% of the data). If models that are sensitive to outliers are used (e.g. linear regression-based models), the outliers must be handled in some way, either by removal or by performing an appropriate transformation.

To better understand the outliers, we plotted variables having outliers against the outcome variable, pH. In the plots below, outliers beyond three times the standard deviation appear in red.

# Create a function to filter and graph the outliers for a specific column
graphOutliers <- function(x, mult) {

    # Get an array of T/F values indicating whether each observation is an outlier
    outlier_array <- getOutliers(dfm[,x], 3)
    
    # Create data set consisting only of outliers
    df_outliers <- dfm[outlier_array,] %>%
        dplyr::select(PH, any_of(x)) %>%
        mutate(outlier=T)
    
    # Create data set consisting only of non-outliers
    df_non_outliers <- dfm[!outlier_array,] %>%
        dplyr::select(PH, any_of(x)) %>%
        mutate(outlier=F)
    
    # Bind the outlier data set to the non-outlier data set
    dftmp <- df_non_outliers %>%
        rbind(df_outliers)
    
    # Graph the outliers
    plt <- dftmp %>%
        filter(!is.na(PH) & !is.na(!!sym(x))) %>%
        ggplot(aes(x=!!sym(x), y=PH, color=outlier)) +
        geom_point() +
        scale_colour_manual(values=c('#e0e0e0', '#c02222')) +
        theme(legend.position='none')
    return(plt)

}

# Initialize list of plots
plts <- list()

# Iterate through all variables with outliers
for (i in 1:nrow(dfout)) {
    plts[[i]] <- graphOutliers(dfout[i, 'Variable'], 3)
}
do.call("grid.arrange", c(plts, ncol=3))

Based on the plot above, there don’t appear to be any outliers that are structurally unsound (i.e., those with impossible values, such as negative pH values). There are some values that are somewhat suspect (e.g. the six values on the far left and right sides of the Alch.Rel plot), but without a description of the data set or knowing how the data was collected, one can only hazard a guess as to which are valid and which are true outliers. Besides the suspect Alch.Rel data points, the remaining points appear to be valid (though extreme) values when compared to the main body of data points.

As mentioned prior, because of the bimodality, violin plots are preferable than boxplots to observe any outliers,

dfm |>
  select(-c(Brand.Code)) |>
  gather(key = "predictor", value = "value") |>
  ggplot(aes(x = predictor, y = value)) +
  geom_violin(fill="lightblue") +
  facet_wrap(~ predictor, scales = "free") +
  coord_flip() + 
  theme_minimal() + 
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

Modeling Approach

At this point we’ll stop to consider several significant factors that will influence our approach to modeling even before we begin cleaning and transforming the data. These factors are as follows:

There does not appear to be overtly linear relationships between the outcome variable and the vast majority of predictors, as evidenced by the feature plots. This indicates that a linear regression-based approach may not yield optimal results.
As shown in the correlation plot, there is significant collinearity between pairs of variables.
About half of the variables exhibit heavy multicollinearity (see the table of VIFs above).
Only one variable can be considered NZV, so this won’t be a strong influence on our modeling approach.
Likewise, missing values are present but not in significant enough quantities to strongly influence how we’ll choose to model the data.
Six variables are heavily skewed (see the histograms and skewness plot above) and would likely need to be transformed for any models that are sensitive to this condition.
There are a significant number of outliers, but we can’t be certain if these are valid data points.

Considering these factors, we decided to use an approach to modeling similar to that described in Chapter 10 of our text (Kuhn and Johnson, 2013). Namely, we will create two data sets to be used for two purposes, as follows:

Full data set: With minimal modification, we’ll use the full data set for those models which are robust to the confounding factors listed above (e.g. collinearity, skewness, and outliers).
Reduced data set: For models such as those based on linear regression, we’ll prepare another data set that handles the above confounding factors. For example, skewed predictors will be transformed, variables exhibiting multicollinearity will be addressed, and, if feasible, outliers will be excluded.

Using the above approach will give us insights into the best way to model the data using all available models, while generally adhering to the recommended best practices for each model type.

Although we will create two data sets (full and reduced), there is nothing preventing us from running both data sets through all models. We expect linear regression-based models to be unstable and to perform poorly on the full set, but there is no inherent disadvantage to doing so.

Since the outcome variable (pH) is continuous, this is a regression problem, so we’ll naturally choose regression-based models. We’ll start by performing linear regression-based modeling, then move to non-linear regression models, followed by tree-based regression models.

Data Preparation

As stated, we’ll prepare two data sets: the full data set to use with more robust models, and the reduced data set to use with more sensitive models. But while some models are robust to many types of irregularities such as skewed variables and outliers, missing values can pose a problem even to these models and for many of the functions involved in prepare these models. Therefore, we’ll need to handle missing values for both the full set and the reduced set.

Imputation

As previously noted, there is a single observation that has 12 missing predictor values. We’ll remove this observation as it likely would contribute little information to the model. In addition, we’ll remove the predictor (MFR) that contains a large number of missing values relative to the total number of observations; there were 212 missing values for this predictor, or 8.2% of the data.

# Init the full data set
dfm_full <- dfm

# Remove observation with 12 missing values
dfm_full <- dfm_full[(rowSums(is.na(dfm_full)) < 12),]

# Remove the MFR predictor because it has 212 missing values (8.2% of the data)
dfm_full <- dfm_full %>% select(-MFR)

For the remaining missing values, there are multiple imputation methods that can be used to fill in missing values. The MICE method is robust but is computationally intensive. The number of missing values in our data set is relatively small, so we’ll choose a simpler method from the DMwR2 package, the knnImputation() function (Torgo, 2016). knnImputation() uses a well-known statistical algorithm to find the “k” most closely related neighbors to observations having missing values. The value of k is selected by the modeler and for imputation purposes is quite arbitrary. In general, smaller values of k lead to faster but less stable predictions, while larger values of k are more computationally intensive but yield smoother results. We chose a value of 9 for k, which is within the typical range for these applications and provides a good balance of performance versus speed.

# Change blank brand codes to NA so they will be imputed
dfm_full <- dfm_full %>%
    mutate(Brand.Code=ifelse(Brand.Code=='', NA, as.character(Brand.Code)))

# Factor Brand.Code (needed for imputation)
dfm_full$Brand.Code <- factor(dfm_full$Brand.Code)

# MICE - not used due to computational intensiveness
#imp <- mice(dfm_full, maxit=5, m=5, seed=77)
#complete(imp)

# Use knnImputation to impute values
dfm_full <- knnImputation(dfm_full, k=9)

Dummy variables

One additional step we’ll need to take is to recode the categorical variable Brand.Code into so-called “dummy” variables. This creates a series of variables that take either a 0 or 1 as a value; each variable indicates the presence (value of 1) or absence (value of 0) of that particular Brand.Code. We’ll use the fastDummies package (Kaplan, 2020), which automatically creates dummy variables for factor- or character-type variables.

# Create dummies for factor and character variables
dfm_full <- dummy_cols(dfm_full, select_columns='Brand.Code', remove_first_dummy=F, remove_selected_columns=T)

It is noted that after the dummy columns are created, the data set now has 34 predictors instead of 31.

Full data set

Now that we’ve filled in the missing values, we have a full data set to use with models that are robust to the confounding problems previously discussed. Besides preparing the full and reduced data sets, we’ll further split each set into two additional sets: one set to train the model and another with which to test the results of the model. This is a standard practice that aims to generate a model that is more accurate when it is confronted with new data it hasn’t yet seen. This is accomplished by training the model with only a portion of the data (typically 75 or 80%) while withholding the rest to gauge the results. The training data typically further undergoes “cross-validation” (CV), another common technique that trains the model over multiple iterations, at each iteration withholding a different proportion of the data to tune the model. After the final iteration, the model with the optimal tuning parameters is selected, and that model is then used to evaluate the previously withheld test data.

We have chosen to use 80% of the data for training, withholding the remaining 20% for testing. And we’ll use 10-fold CV, a commonly accepted practice in modeling. We’ll use the createDataPartition() function from the caret package (Kuhn, 2022) which partitions the data evenly based on the outcome variable. This avoids a common problem when modeling data having a strong imbalance in values of the outcome variable.

# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(dfm_full$PH, p=0.8, list=F)

# Separate training from test
dfm_full_train <- dfm_full[train_rows,]
dfm_full_test <- dfm_full[-train_rows,]

# Separate outcome from predictors
trainx_full <- dfm_full_train[,2:ncol(dfm_full_train)]
trainy_full <- dfm_full_train[,1]
testx_full <- dfm_full_test[,2:ncol(dfm_full_test)]
testy_full <- dfm_full_test[,1]

# Generate a pretty table for the report
data.frame(Set=c('Training', 'Test'), Observations=c(nrow(dfm_full_train), nrow(dfm_full_test))) %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Data partiioning (full data set)')

Data partiioning (full data set)
Set	Observations
Training	2,057
Test	513

It is important to note that some models are sensitive to the numbers of observations compared to the number of parameters. Models that must invert linear matrices are mathematically unsolvable if the number of parameters exceeds the number of observations. Cross validation must also be taken into account when evaluating the observation-to-parameter count; for example, ten-fold CV only uses 90% of the observations for training. In our data set, we have 34 predictor variables and 513 observations in the smaller of the two data sets. Using ten-fold CV results in 461 observations still available to use for training, so the observation-to-parameter count isn’t something we have to worry about for this situation.

Reduced data set

Now that we have a full data set will missing values handled, we can generate a second, reduced set to use with models that are sensitive to irregularities. We’ll first make a copy of the full set. Then we’ll need to remove one of the Brand.Code columns, as having all categories of a categorical variable will be problematic for linear-based models.

# First, make a copy the full set to the reduced set
dfm_red <- dfm_full

# Remove the first dummy Brand.Code from the reduced set, since this will be problematic for linear models which use a y-intercept
dfm_red <- dfm_red %>%
    select(-Brand.Code_A)

Collinearity

As we noted earlier, there the data set exhibits some collinearity and multicollinearity. Based on VIFs calculated previously, almost half the predictors exhibited strong multicollinearity. Since removing that many predictors would likely cause a detrimental loss of information, a different approach is warranted. We opted to use the approach presented by Kuhl and Johnson (2013) in which predictors are iteratively compared pairwise, removing those with the highest correlation value until a certain threshold is attained. While this procedure only addresses collinearity rather than multicollinearity, it can nonetheless yield significant improvements to model performance and stability.

We’ll use the findCorrelation() function to find candidate variables for removal due to collinearity.

# Get correlation table
corr2 <- cor(dfm_red[,2:ncol(dfm_red)], use='complete')
corr_vars <- findCorrelation(corr2, names=T, cutoff=0.9)
data.frame(Variable=corr_vars) %>%
    arrange(Variable) %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Highly correlated variables') %>%
    delete_part( part = "header")

Highly correlated variables
Alch.Rel
Balling
Balling.Lvl
Filler.Level
Hyd.Pressure3

This is a manageable number of variables, so we’ll proceed with removing them from our data set.

# Remove highly correlated variables from the reduced data set
dfm_red <- dfm_red[,-match(corr_vars, colnames(dfm_red))]

Near-zero variance

There is only a single predictor with NZV (Hyd.Pressure1). Since this condition can negatively impact linear regression-based models, we’ll remove that from the reduced data set.

# Remove NZV variables from the reduced data set
dfm_red <- dfm_red[,-match('Hyd.Pressure1', colnames(dfm_red))]

Skewness

We’ll first check for zero or negative values for the variables that were earlier identified as skewed. If there are any such values, the Box-Cox transformation will be infeasible and we’ll have to use a different method.

# Define variables id'ed as skewed from earlier
skewed_vars <- c('Filler.Speed', 'Oxygen.Filler', 'Temperature', 'Air.Pressurer', 'PSC.CO2')

# Generate summary to look at min/max values - we're looking for zero or negative values
summary(dfm_red[,skewed_vars])

##   Filler.Speed  Oxygen.Filler     Temperature    Air.Pressurer  
##  Min.   : 998   Min.   :0.0024   Min.   :63.60   Min.   :140.8  
##  1st Qu.:3824   1st Qu.:0.0220   1st Qu.:65.20   1st Qu.:142.2  
##  Median :3980   Median :0.0334   Median :65.60   Median :142.6  
##  Mean   :3649   Mean   :0.0469   Mean   :65.97   Mean   :142.8  
##  3rd Qu.:3996   3rd Qu.:0.0600   3rd Qu.:66.40   3rd Qu.:143.0  
##  Max.   :4030   Max.   :0.4000   Max.   :76.20   Max.   :148.2  
##     PSC.CO2       
##  Min.   :0.00000  
##  1st Qu.:0.02000  
##  Median :0.04000  
##  Mean   :0.05649  
##  3rd Qu.:0.08000  
##  Max.   :0.24000

Since PSC.CO2 contains zero values, we’ll use the Yeo-Johnson transformation instead. Because the caret package supports Yeo-Johnson natively, we’ll perform the transformations as part of the preprocessing component during modeling. This has the advantage that predictions made based on our models will be automatically back-transformed rather than requiring manual transformation.

Outliers

Based on the outlier plots from earlier, most of the data points beyond the three-standard-deviation cutoff appear to be valid data points, though located on the extreme outer edges of the main body of data. One exception is Alch.Red, which includes six somewhat suspect points located along the left- and right-hand margins of the plot. The other possible exception is data point exhibiting a PH value of 9.36. It is unclear whether these are truly outliers that should be excluded. But in either case, this variable was already identified as exhibiting high collinearity and has already been removed from the reduced data set.

# Remove sus PH data point
dfm_red <- dfm_red[dfm_red$PH < 9.36,]

# Alch.Red has already been removed, so no need to do anything there with outliers

# Find cutoff value of Alch.Red (three sd's)
#loval <- mean(dfm_red$Alch.Rel) - (3 * sd(dfm_red$Alch.Rel))
#hival <- mean(dfm_red$Alch.Rel) + (3 * sd(dfm_red$Alch.Rel))

# Remove sus Alch.Red data points
#dfm_red <- dfm_red[dfm_red$Alch.Rel >+ loval & dfm_red$Alch.Rel <= hival,]

After all modifications to the reduced data set have been made, our final version of the reduced set contains 2569 observations of 28 predictors, reduced from the full set of 2570 observations of 34 predictors. As with the full data set, we’ll split the reduced set into training and test sets and verify that the number of predictors doesn’t exceed the number of observations.

# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(dfm_red$PH, p=0.8, list=F)

# Separate training from test
dfm_red_train <- dfm_red[train_rows,]
dfm_red_test <- dfm_red[-train_rows,]

# Separate outcome from predictors
trainx_red <- dfm_red_train[,2:ncol(dfm_red_train)]
trainy_red <- dfm_red_train[,1]

# Separate outcome from predictors
testx_red <- dfm_red_test[,2:ncol(dfm_red_test)]
testy_red <- dfm_red_test[,1]

# Generate a pretty table for the report
data.frame(Set=c('Training', 'Test'), Observations=c(nrow(dfm_red_train), nrow(dfm_red_test))) %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Data partiioning (reduced data set)')

Data partiioning (reduced data set)
Set	Observations
Training	2,057
Test	512

The table above shows that the test set contains 512 observations, far exceeding the number of predictors (27), so we won’t need to worry about models that perform poorly when there are more predictors than observations.

Modeling

Now that we have a full and reduced set of data–and training and test sets within those two sets–we can begin modeling. As we mentioned, we’ll run both the full and reduced data sets through all models, as there is no inherent disadvantage to do so, while recognizing that some models will be unstable and perform poorly.

We’ll start by performing linear regression-based modeling, then move to non-linear regression models, followed by tree-based regression models.

# Create a data frame to store the results
dfr <- data.frame(matrix(nrow=0, ncol=8))
colnames(dfr) <- c('Data.Set', 'Model', 'Model.Class', 'Tuning.Parameters', 'RMSE.Train', 'RMSE.Test', 'MAPE.Test', 'Train.Time')

# specify 10x cross-validation
ctrl <- trainControl(method='cv', number=10)

# Create function to calculate MAPE
mape <- function(predicted, actual) {
    mape <- mean(abs((actual - predicted) / actual)) * 100
    return (mape)
}

# Parallel processing; detect number of cores and start a parallel socket cluster;
# all subsequent calls to caret::train() will use this cluster
num_cores <- detectCores()
cl <- makePSOCKcluster(num_cores)
registerDoParallel(cl)

Linear regression

# Reduced model

# Linear model
tstart <- Sys.time()
set.seed(77)
fitlm1 <- train(x=trainx_red, y=trainy_red, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'YeoJohnson'))
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlm1
predlm1 <- predict(fitlm1, testx_red)
dfr[1,] <- data.frame(
    Data.Set='Reduced', 
    Model='Linear model', 
    Model.Class='Linear',
    Tuning.Parameters='', 
    RMSE.Train=min(fitlm1$results[['RMSE']]),
    RMSE.Test=postResample(predlm1, testy_red)[['RMSE']],
    MAPE.Test=mape(predlm1, testy_red),
    Train.Time=round(telapsed, 3)
)


# Stepwise linear model using MASS package/caret
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx_red) / 2, 0)))  # Max number of parameters to use
tstart <- Sys.time()
set.seed(77)
fitlm2 <- train(x=trainx_red, y=trainy_red, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid, preProcess=c('center', 'scale', 'YeoJohnson'))
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlm2
predlm2 <- predict(fitlm2, testx_red)
dfr[2,] <- data.frame(
    Data.Set='Reduced', 
    Model='Stepwise linear model', 
    Model.Class='Linear',
    Tuning.Parameters=paste0('nvmax=', fitlm2$bestTune[['nvmax']]), 
    RMSE.Train=min(fitlm2$results[['RMSE']]),
    RMSE.Test=postResample(predlm2, testy_red)[['RMSE']],
    MAPE.Test=mape(predlm2, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Linear model
tstart <- Sys.time()
set.seed(77)
fitlm1_full <- train(x=trainx_full, y=trainy_full, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'YeoJohnson'))
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlm1_full
predlm1_full <- predict(fitlm1_full, testx_full)

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

dfr[19,] <- data.frame(
    Data.Set='Full', 
    Model='Linear model', 
    Model.Class='Linear',
    Tuning.Parameters='', 
    RMSE.Train=min(fitlm1_full$results[['RMSE']]),
    RMSE.Test=postResample(predlm1_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predlm1_full, testy_full),
    Train.Time=round(telapsed, 3)
)

# Stepwise linear model using MASS package/caret
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx_red) / 2, 0)))  # Max number of parameters to use
tstart <- Sys.time()
set.seed(77)
fitlm2_full <- train(x=trainx_full, y=trainy_full, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid, preProcess=c('center', 'scale', 'YeoJohnson'))

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 1
## linear dependencies found

telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlm2_full
predlm2_full <- predict(fitlm2_full, testx_full)
dfr[20,] <- data.frame(
    Data.Set='Full', 
    Model='Stepwise linear model', 
    Model.Class='Linear',
    Tuning.Parameters=paste0('nvmax=', fitlm2_full$bestTune[['nvmax']]), 
    RMSE.Train=min(fitlm2_full$results[['RMSE']]),
    RMSE.Test=postResample(predlm2_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predlm2_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Robust linear regression

# Reduced model

# Robust linear regression
tstart <- Sys.time()
set.seed(77)
fitrlm <- train(x=trainx_red, y=trainy_red, method='rlm', preProcess=c('center', 'scale', 'pca'), trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrlm
predrlm <- predict(fitrlm, testx_red)
dfr[3,] <- data.frame(
    Data.Set='Reduced', 
    Model='Robust linear regression', 
    Model.Class='Linear',
    Tuning.Parameters='', 
    RMSE.Train=min(fitrlm$results[['RMSE']]),
    RMSE.Test=postResample(predrlm, testy_red)[['RMSE']],
    MAPE.Test=mape(predrlm, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Robust linear regression
tstart <- Sys.time()
set.seed(77)
fitrlm_full <- train(x=trainx_full, y=trainy_full, method='rlm', preProcess=c('center', 'scale', 'pca'), trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrlm_full
predrlm_full <- predict(fitrlm_full, testx_full)
dfr[21,] <- data.frame(
    Data.Set='Full', 
    Model='Robust linear regression', 
    Model.Class='Linear',
    Tuning.Parameters='', 
    RMSE.Train=min(fitrlm_full$results[['RMSE']]),
    RMSE.Test=postResample(predrlm_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predrlm_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Partial least squares

# Reduced model

# PLS
tstart <- Sys.time()
set.seed(77)
fitpls <- train(x=trainx_red, y=trainy_red, method='pls', preProcess=c('center', 'scale'), trControl=ctrl, tuneLength=nrow(trainx_red) / 2)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitpls
predpls <- predict(fitpls, testx_red)
dfr[4,] <- data.frame(
    Data.Set='Reduced', 
    Model='Partial least squares', 
    Model.Class='Linear',
    Tuning.Parameters=paste0('ncomp=', fitpls$bestTune), 
    RMSE.Train=min(fitpls$results[['RMSE']]),
    RMSE.Test=postResample(predpls, testy_red)[['RMSE']],
    MAPE.Test=mape(predpls, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# PLS
tstart <- Sys.time()
set.seed(77)
fitpls_full <- train(x=trainx_full, y=trainy_full, method='pls', preProcess=c('center', 'scale'), trControl=ctrl, tuneLength=nrow(trainx_full) / 2)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitpls_full
predpls_full <- predict(fitpls_full, testx_full)
dfr[22,] <- data.frame(
    Data.Set='Full', 
    Model='Partial least squares', 
    Model.Class='Linear',
    Tuning.Parameters=paste0('ncomp=', fitpls_full$bestTune), 
    RMSE.Train=min(fitpls_full$results[['RMSE']]),
    RMSE.Test=postResample(predpls_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predpls_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Ridge regression

# Reduced model

# Ridge regression
tstart <- Sys.time()
set.seed(77)
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=15))
fitridge <- train(x=trainx_red, y=trainy_red, method='ridge', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=ridgeGrid)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitridge
predridge <- predict(fitridge, testx_red)
dfr[5,] <- data.frame(
    Data.Set='Reduced', 
    Model='Ridge regression', 
    Model.Class='Linear',
    Tuning.Parameters=paste0('labmda=', round(fitridge$bestTune[['lambda']], 4)), 
    RMSE.Train=min(fitridge$results[['RMSE']]),
    RMSE.Test=postResample(predridge, testy_red)[['RMSE']],
    MAPE.Test=mape(predridge, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Ridge regression
tstart <- Sys.time()
set.seed(77)
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=15))
fitridge_full <- train(x=trainx_full, y=trainy_full, method='ridge', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=ridgeGrid)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitridge_full
predridge_full <- predict(fitridge_full, testx_full)
dfr[23,] <- data.frame(
    Data.Set='Full', 
    Model='Ridge regression', 
    Model.Class='Linear',
    Tuning.Parameters=paste0('labmda=', round(fitridge_full$bestTune[['lambda']], 4)), 
    RMSE.Train=min(fitridge_full$results[['RMSE']]),
    RMSE.Test=postResample(predridge_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predridge_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Lasso regression

# Reduced model

# Lasso regression
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1), .fraction=seq(0.05, 1, length=20))
tstart <- Sys.time()
set.seed(77)
fitlasso <- train(x=trainx_red, y=trainy_red, method='enet', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=enetGrid)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlasso
predlasso <- predict(fitlasso, testx_red)
dfr[6,] <- data.frame(
    Data.Set='Reduced',
    Model='Lasso (enlastic net)', 
    Model.Class='Linear',
    Tuning.Parameters=paste0('lambda=', fitlasso$bestTune[['lambda']], ', fraction=', fitlasso$bestTune[['fraction']]), 
    RMSE.Train=min(fitlasso$results[['RMSE']]),
    RMSE.Test=postResample(predlasso, testy_red)[['RMSE']],
    MAPE.Test=mape(predlasso, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Lasso regression
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1), .fraction=seq(0.05, 1, length=20))
tstart <- Sys.time()
set.seed(77)
fitlasso_full <- train(x=trainx_full, y=trainy_full, method='enet', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=enetGrid)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlasso_full
predlasso_full <- predict(fitlasso_full, testx_full)
dfr[24,] <- data.frame(
    Data.Set='Full',
    Model='Lasso (enlastic net)', 
    Model.Class='Linear',
    Tuning.Parameters=paste0('lambda=', fitlasso_full$bestTune[['lambda']], ', fraction=', fitlasso_full$bestTune[['fraction']]), 
    RMSE.Train=min(fitlasso_full$results[['RMSE']]),
    RMSE.Test=postResample(predlasso_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predlasso_full, testy_full),
    Train.Time=round(telapsed, 3)
)

K nearest neighbors

# Reduced model

# KNN model
tstart <- Sys.time()
set.seed(77)
fitknn <- train(x=trainx_red, y=trainy_red, method='knn', preProc=c("center", "scale"), tuneLength=10)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitknn
predknn <- predict(fitknn, testx_red)
dfr[7,] <- data.frame(
    Data.Set='Reduced',
    Model='knn', 
    Model.Class='Nonlinear',
    Tuning.Parameters=paste0('k=', fitknn$bestTune[['k']]), 
    RMSE.Train=min(fitknn$results[['RMSE']]),
    RMSE.Test=postResample(predknn, testy_red)[['RMSE']],
    MAPE.Test=mape(predknn, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# KNN model
tstart <- Sys.time()
set.seed(77)
fitknn_full <- train(x=trainx_full, y=trainy_full, method='knn', preProc=c("center", "scale"), tuneLength=10)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitknn_full
predknn_full <- predict(fitknn_full, testx_full)
dfr[25,] <- data.frame(
    Data.Set='Full',
    Model='knn', 
    Model.Class='Nonlinear',
    Tuning.Parameters=paste0('k=', fitknn_full$bestTune[['k']]), 
    RMSE.Train=min(fitknn_full$results[['RMSE']]),
    RMSE.Test=postResample(predknn_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predknn_full, testy_full),
    Train.Time=round(telapsed, 3),
    Train.Time=round(telapsed, 3)
)

## Warning in `[<-.data.frame`(`*tmp*`, 25, , value = structure(list(Data.Set =
## "Full", : provided 9 variables to replace 8 variables

Multivariate adaptive regression splines

# Reduced model

# MARS model
marsGrid <- expand.grid(.degree=1:2, .nprune=2:10)  # set tuning parameters
tstart <- Sys.time()
set.seed(77)
fitmars <- train(x=trainx_red, y=trainy_red, method='earth', tuneGrid=marsGrid, trControl=ctrl)

## Loading required package: earth

## Warning: package 'earth' was built under R version 4.2.3

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.2.3

## Loading required package: plotrix

## Loading required package: TeachingDemos

## Warning: package 'TeachingDemos' was built under R version 4.2.3

## 
## Attaching package: 'TeachingDemos'

## The following object is masked _by_ '.GlobalEnv':
## 
##     outliers

telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmars
predmars <- predict(fitmars, testx_red)
dfr[8,] <- data.frame(
    Data.Set='Reduced',
    Model='MARS', 
    Model.Class='Nonlinear',
    Tuning.Parameters=paste0('degree=', fitmars$bestTune[['degree']], ', nprune=', fitmars$bestTune[['nprune']]), 
    RMSE.Train=min(fitmars$results[['RMSE']]),
    RMSE.Test=postResample(predmars, testy_red)[['RMSE']],
    MAPE.Test=mape(predmars, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# MARS model
marsGrid <- expand.grid(.degree=1:2, .nprune=2:10)  # set tuning parameters
tstart <- Sys.time()
set.seed(77)
fitmars_full <- train(x=trainx_full, y=trainy_full, method='earth', tuneGrid=marsGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmars_full
predmars_full <- predict(fitmars_full, testx_full)
dfr[26,] <- data.frame(
    Data.Set='Full',
    Model='MARS', 
    Model.Class='Nonlinear',
    Tuning.Parameters=paste0('degree=', fitmars_full$bestTune[['degree']], ', nprune=', fitmars_full$bestTune[['nprune']]), 
    RMSE.Train=min(fitmars_full$results[['RMSE']]),
    RMSE.Test=postResample(predmars_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predmars_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Support vector machine

# Reduced model

# SVM model
tstart <- Sys.time()
set.seed(77)
fitsvm <- train(x=trainx_red, y=trainy_red, method='svmRadial', preProc=c('center', 'scale'), tuneLength=14, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitsvm
predsvm <- predict(fitsvm, testx_red)
dfr[9,] <- data.frame(
    Data.Set='Reduced',
    Model='SVM', 
    Model.Class='Nonlinear',
    Tuning.Parameters=paste0('C=', fitsvm$bestTune[['C']], ', sigma=', round(fitsvm$bestTune[['sigma']], 3)), 
    RMSE.Train=min(fitsvm$results[['RMSE']]),
    RMSE.Test=postResample(predsvm, testy_red)[['RMSE']],
    MAPE.Test=mape(predsvm, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# SVM model
tstart <- Sys.time()
set.seed(77)
fitsvm_full <- train(x=trainx_full, y=trainy_full, method='svmRadial', preProc=c('center', 'scale'), tuneLength=14, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitsvm_full
predsvm_full <- predict(fitsvm_full, testx_full)
dfr[27,] <- data.frame(
    Data.Set='Full',
    Model='SVM', 
    Model.Class='Nonlinear',
    Tuning.Parameters=paste0('C=', fitsvm_full$bestTune[['C']], ', sigma=', round(fitsvm_full$bestTune[['sigma']], 3)), 
    RMSE.Train=min(fitsvm_full$results[['RMSE']]),
    RMSE.Test=postResample(predsvm_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predsvm_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Neural net

# Reduced model

# nnet model using model averaging
nnetGrid <- expand.grid(.decay=c(0, 0.01, 0.1), .size=c(1:10), .bag=F)  # set tuning parameters
tstart <- Sys.time()
set.seed(77)
fitnnet <- train(x=trainx_red, y=trainy_red, method='avNNet', preProc=c('center', 'scale'), tunGrid=nnetGrid, trControl=ctrl,
             linout=T, trace=F, MaxNWts=10 * (ncol(trainx_red) + 1) + 10 + 1, maxit=500)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitnnet
prednnet <- predict(fitnnet, testx_red)
dfr[10,] <- data.frame(
    Data.Set='Reduced',
    Model='Neural net', 
    Model.Class='Nonlinear',
    Tuning.Parameters=paste0('decay=', fitnnet$bestTune[['decay']], ', size=', fitnnet$bestTune[['size']], ', bag=False'), 
    RMSE.Train=min(fitnnet$results[['RMSE']]),
    RMSE.Test=postResample(prednnet, testy_red)[['RMSE']],
    MAPE.Test=mape(prednnet, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# nnet model using model averaging
nnetGrid <- expand.grid(.decay=c(0, 0.01, 0.1), .size=c(1:10), .bag=F)  # set tuning parameters
tstart <- Sys.time()
set.seed(77)
fitnnet_full <- train(x=trainx_full, y=trainy_full, method='avNNet', preProc=c('center', 'scale'), tunGrid=nnetGrid, trControl=ctrl,
             linout=T, trace=F, MaxNWts=10 * (ncol(trainx_red) + 1) + 10 + 1, maxit=500)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitnnet_full
prednnet_full <- predict(fitnnet_full, testx_full)
dfr[28,] <- data.frame(
    Data.Set='Full',
    Model='Neural net', 
    Model.Class='Nonlinear',
    Tuning.Parameters=paste0('decay=', fitnnet_full$bestTune[['decay']], ', size=', fitnnet_full$bestTune[['size']], ', bag=False'), 
    RMSE.Train=min(fitnnet_full$results[['RMSE']]),
    RMSE.Test=postResample(prednnet_full, testy_full)[['RMSE']],
    MAPE.Test=mape(prednnet_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Basic CART (tuned with complexity parameter)

# Reduced model

# Basic regression tree
tstart <- Sys.time()
set.seed(77)
fitcart1 <- train(trainx_red, trainy_red, method='rpart', tuneLength=10, trControl=ctrl)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcart1
predcart1 <- predict(fitcart1, testx_red)
dfr[11,] <- data.frame(
    Data.Set='Reduced',
    Model='Basic CART (tuned w/complexity parameter)', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('cp=', round(fitcart1$bestTune[['cp']], 4)), 
    Train.RMSE=min(fitcart1$results[['RMSE']]),
    RMSE.Test=postResample(predcart1, testy_red)[['RMSE']],
    MAPE.Test=mape(predcart1, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Basic regression tree
tstart <- Sys.time()
set.seed(77)
fitcart1_full <- train(trainx_full, trainy_full, method='rpart', tuneLength=10, trControl=ctrl)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcart1_full
predcart1_full <- predict(fitcart1_full, testx_full)
dfr[29,] <- data.frame(
    Data.Set='Full',
    Model='Basic CART (tuned w/complexity parameter)', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('cp=', round(fitcart1_full$bestTune[['cp']], 4)), 
    Train.RMSE=min(fitcart1_full$results[['RMSE']]),
    RMSE.Test=postResample(predcart1_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predcart1_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Basic CART (tuned with node depth)

# Reduced model

# Basic CART model - tuned using node depth
tstart <- Sys.time()
set.seed(77)
fitcart2 <- train(trainx_red, trainy_red, method='rpart2', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcart2
predcart2 <- predict(fitcart2, testx_red)
dfr[12,] = data.frame(
    Data.Set='Reduced',
    Model='Basic CART (tuned w/node depth)', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('maxdepth=', fitcart2$bestTune[['maxdepth']]), 
    Train.RMSE=min(fitcart2$results[['RMSE']]),
    RMSE.Test=postResample(predcart2, testy_red)[['RMSE']],
    MAPE.Test=mape(predcart2, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Basic CART model - tuned using node depth
tstart <- Sys.time()
set.seed(77)
fitcart2_full <- train(trainx_full, trainy_full, method='rpart2', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcart2_full
predcart2_full <- predict(fitcart2_full, testx_full)
dfr[30,] = data.frame(
    Data.Set='Full',
    Model='Basic CART (tuned w/node depth)', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('maxdepth=', fitcart2_full$bestTune[['maxdepth']]), 
    Train.RMSE=min(fitcart2_full$results[['RMSE']]),
    RMSE.Test=postResample(predcart2_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predcart2_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Regression model tree

# Reduced model

# Model tree
mtreeGrid1 <- expand.grid(.pruned=c('Yes', 'No'), .smoothed=c('Yes', 'No'), .rules=c('Yes', 'No'))
tstart <- Sys.time()
set.seed(77)
fitmtree1 <- train(trainx_red, trainy_red, method='M5', trControl=ctrl, control=Weka_control(M=10), tuneGrid=mtreeGrid1)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmtree1
predmtree1 <- predict(fitmtree1, testx_red)
dfr[13,] <- data.frame(
    Data.Set='Reduced',
    Model='Regression model tree', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('pruned=', fitmtree1$bestTune[['pruned']],
                             ', smoothed=', fitmtree1$bestTune[['smoothed']],
                             ', rules=', fitmtree1$bestTune[['rules']]), 
    RMSE.Train=min(fitmtree1$results[['RMSE']]),
    RMSE.Test=postResample(predmtree1, testy_red)[['RMSE']],
    MAPE.Test=mape(predmtree1, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Model tree
mtreeGrid1 <- expand.grid(.pruned=c('Yes', 'No'), .smoothed=c('Yes', 'No'), .rules=c('Yes', 'No'))
tstart <- Sys.time()
set.seed(77)
fitmtree1_full <- train(trainx_full, trainy_full, method='M5', trControl=ctrl, control=Weka_control(M=10), tuneGrid=mtreeGrid1)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmtree1_full
predmtree1_full <- predict(fitmtree1_full, testx_full)
dfr[31,] <- data.frame(
    Data.Set='Full',
    Model='Regression model tree', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('pruned=', fitmtree1_full$bestTune[['pruned']],
                             ', smoothed=', fitmtree1_full$bestTune[['smoothed']],
                             ', rules=', fitmtree1_full$bestTune[['rules']]), 
    RMSE.Train=min(fitmtree1_full$results[['RMSE']]),
    RMSE.Test=postResample(predmtree1_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predmtree1_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Regression model tree (rule-based)

# Reduced model

# Model tree (rule-based)
mtreeGrid2 <- expand.grid(.pruned=c('Yes', 'No'), .smoothed=c('Yes', 'No'))
tstart <- Sys.time()
set.seed(77)
fitmtree2 <- train(trainx_red, trainy_red, method='M5Rules', trControl=ctrl, control=Weka_control(M=10), tuneGrid=mtreeGrid2)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmtree2
predmtree2 <- predict(fitmtree2, testx_red)
dfr[14,] <- data.frame(
    Data.Set='Reduced',
    Model='Regression model tree (rule-based)', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('pruned=', fitmtree2$bestTune[['pruned']],
                             ', smoothed=', fitmtree2$bestTune[['smoothed']]), 
    RMSE.Train=min(fitmtree2$results[['RMSE']]),
    RMSE.Test=postResample(predmtree2, testy_red)[['RMSE']],
    MAPE.Test=mape(predmtree2, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Model tree (rule-based)
mtreeGrid2 <- expand.grid(.pruned=c('Yes', 'No'), .smoothed=c('Yes', 'No'))
tstart <- Sys.time()
set.seed(77)
fitmtree2_full <- train(trainx_full, trainy_full, method='M5Rules', trControl=ctrl, control=Weka_control(M=10), tuneGrid=mtreeGrid2)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmtree2_full
predmtree2_full <- predict(fitmtree2_full, testx_full)
dfr[32,] <- data.frame(
    Data.Set='Full',
    Model='Regression model tree (rule-based)', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('pruned=', fitmtree2_full$bestTune[['pruned']],
                             ', smoothed=', fitmtree2_full$bestTune[['smoothed']]), 
    RMSE.Train=min(fitmtree2_full$results[['RMSE']]),
    RMSE.Test=postResample(predmtree2_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predmtree2_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Bagged tree

# Reduced model

# Bagged tree
tstart <- Sys.time()
set.seed(77)
fitbag <- train(trainx_red, trainy_red, method='treebag', trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitbag
predbag <- predict(fitbag, testx_red)
dfr[15,] <- data.frame(
    Data.Set='Reduced',
    Model='Bagged tree', 
    Model.Class='Tree',
    Tuning.Parameters='', 
    RMSE.Train=min(fitbag$results[['RMSE']]),
    RMSE.Test=postResample(predbag, testy_red)[['RMSE']],
    MAPE.Test=mape(predbag, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Bagged tree
tstart <- Sys.time()
set.seed(77)
fitbag_full <- train(trainx_full, trainy_full, method='treebag', trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitbag_full
predbag_full <- predict(fitbag_full, testx_full)
dfr[33,] <- data.frame(
    Data.Set='Full',
    Model='Bagged tree', 
    Model.Class='Tree',
    Tuning.Parameters='', 
    RMSE.Train=min(fitbag_full$results[['RMSE']]),
    RMSE.Test=postResample(predbag_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predbag_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Random Forest

# Reduced model

# Random Forest
tstart <- Sys.time()
set.seed(77)
fitrf <- train(trainx_red, trainy_red, method='rf', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrf
predrf <- predict(fitrf, testx_red)
dfr[16,] <- data.frame(
    Data.Set='Reduced',
    Model='Random Forest', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('mtry=', fitrf$bestTune[['mtry']]), 
    RMSE.Train=min(fitrf$results[['RMSE']]),
    RMSE.Test=postResample(predrf, testy_red)[['RMSE']],
    MAPE.Test=mape(predrf, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Random Forest
tstart <- Sys.time()
set.seed(77)
fitrf_full <- train(trainx_full, trainy_full, method='rf', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrf_full
predrf_full <- predict(fitrf_full, testx_full)
dfr[34,] <- data.frame(
    Data.Set='Full',
    Model='Random Forest', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('mtry=', fitrf_full$bestTune[['mtry']]), 
    RMSE.Train=min(fitrf_full$results[['RMSE']]),
    RMSE.Test=postResample(predrf_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predrf_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Stochastic gradient boosted tree

# Had to set message=FALSE to prevent thousands of trace messages

# Reduced model

# Stochastic gradient boosting
gbmGrid <- expand.grid(.interaction.depth=seq(1, 7, by=2),
                       .n.trees=seq(100, 1000, by=50),
                       .shrinkage=c(0.01, 0.1),
                       .n.minobsinnode=10)
tstart <- Sys.time()
set.seed(77)
fitgbm <- train(trainx_red, trainy_red, method='gbm', tuneGrid=gbmGrid, trControl=ctrl)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.0271             nan     0.1000    0.0021
##      2        0.0255             nan     0.1000    0.0017
##      3        0.0241             nan     0.1000    0.0013
##      4        0.0228             nan     0.1000    0.0012
##      5        0.0219             nan     0.1000    0.0008
##      6        0.0209             nan     0.1000    0.0009
##      7        0.0201             nan     0.1000    0.0007
##      8        0.0194             nan     0.1000    0.0006
##      9        0.0188             nan     0.1000    0.0005
##     10        0.0183             nan     0.1000    0.0005
##     20        0.0150             nan     0.1000    0.0002
##     40        0.0126             nan     0.1000   -0.0000
##     60        0.0112             nan     0.1000    0.0000
##     80        0.0102             nan     0.1000   -0.0000
##    100        0.0093             nan     0.1000   -0.0000
##    120        0.0087             nan     0.1000   -0.0000
##    140        0.0082             nan     0.1000   -0.0000
##    160        0.0077             nan     0.1000   -0.0000
##    180        0.0072             nan     0.1000   -0.0000
##    200        0.0068             nan     0.1000   -0.0000
##    220        0.0065             nan     0.1000   -0.0000
##    240        0.0062             nan     0.1000   -0.0000
##    260        0.0059             nan     0.1000   -0.0000
##    280        0.0056             nan     0.1000   -0.0000
##    300        0.0053             nan     0.1000   -0.0000
##    320        0.0050             nan     0.1000   -0.0000
##    340        0.0048             nan     0.1000   -0.0000
##    360        0.0046             nan     0.1000   -0.0000
##    380        0.0044             nan     0.1000   -0.0000
##    400        0.0042             nan     0.1000   -0.0000
##    420        0.0040             nan     0.1000   -0.0000
##    440        0.0039             nan     0.1000   -0.0000
##    460        0.0037             nan     0.1000   -0.0000
##    480        0.0036             nan     0.1000   -0.0000
##    500        0.0034             nan     0.1000   -0.0000
##    520        0.0033             nan     0.1000   -0.0000
##    540        0.0031             nan     0.1000   -0.0000
##    560        0.0030             nan     0.1000   -0.0000
##    580        0.0029             nan     0.1000   -0.0000
##    600        0.0028             nan     0.1000   -0.0000
##    620        0.0027             nan     0.1000   -0.0000
##    640        0.0026             nan     0.1000    0.0000
##    650        0.0025             nan     0.1000   -0.0000

telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))

# Reduced model, cont'd
# fitgbm
predgbm <- predict(fitgbm, testx_red)
dfr[17,] <- data.frame(
    Data.Set='Reduced',
    Model='Stochastic gradient boosted tree', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('interaction.depth=', fitgbm$bestTune[['interaction.depth']], 
                             ', n.trees=', fitgbm$bestTune[['n.trees']], 
                             ', shrinkage=', fitgbm$bestTune[['shrinkage']],
                             ', n.minobsinnode=10'), 
    Train.RMSE=min(fitgbm$results[['RMSE']]),
    RMSE.Test=postResample(predgbm, testy_red)[['RMSE']],
    MAPE.Test=mape(predgbm, testy_red),
    Train.Time=round(telapsed, 3)
)

# Had to set message=FALSE to prevent thousands of trace messages

# Full model

# Stochastic gradient boosting
gbmGrid <- expand.grid(.interaction.depth=seq(1, 7, by=2),
                       .n.trees=seq(100, 1000, by=50),
                       .shrinkage=c(0.01, 0.1),
                       .n.minobsinnode=10)
tstart <- Sys.time()
set.seed(77)
fitgbm_full <- train(trainx_full, trainy_full, method='gbm', tuneGrid=gbmGrid, trControl=ctrl)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.0271             nan     0.1000    0.0022
##      2        0.0252             nan     0.1000    0.0019
##      3        0.0236             nan     0.1000    0.0016
##      4        0.0222             nan     0.1000    0.0012
##      5        0.0211             nan     0.1000    0.0010
##      6        0.0201             nan     0.1000    0.0008
##      7        0.0192             nan     0.1000    0.0008
##      8        0.0185             nan     0.1000    0.0006
##      9        0.0179             nan     0.1000    0.0005
##     10        0.0173             nan     0.1000    0.0004
##     20        0.0137             nan     0.1000    0.0001
##     40        0.0109             nan     0.1000    0.0001
##     60        0.0095             nan     0.1000    0.0000
##     80        0.0083             nan     0.1000    0.0000
##    100        0.0074             nan     0.1000    0.0000
##    120        0.0068             nan     0.1000   -0.0000
##    140        0.0062             nan     0.1000   -0.0000
##    160        0.0057             nan     0.1000   -0.0000
##    180        0.0052             nan     0.1000   -0.0000
##    200        0.0049             nan     0.1000   -0.0000
##    220        0.0045             nan     0.1000   -0.0000
##    240        0.0042             nan     0.1000   -0.0000
##    260        0.0039             nan     0.1000   -0.0000
##    280        0.0036             nan     0.1000   -0.0000
##    300        0.0034             nan     0.1000   -0.0000
##    320        0.0031             nan     0.1000   -0.0000
##    340        0.0030             nan     0.1000   -0.0000
##    360        0.0028             nan     0.1000   -0.0000
##    380        0.0026             nan     0.1000   -0.0000
##    400        0.0025             nan     0.1000   -0.0000
##    420        0.0023             nan     0.1000   -0.0000
##    440        0.0022             nan     0.1000   -0.0000
##    460        0.0020             nan     0.1000   -0.0000
##    480        0.0019             nan     0.1000   -0.0000
##    500        0.0018             nan     0.1000   -0.0000
##    520        0.0017             nan     0.1000   -0.0000
##    540        0.0016             nan     0.1000   -0.0000
##    560        0.0015             nan     0.1000   -0.0000
##    580        0.0015             nan     0.1000   -0.0000
##    600        0.0014             nan     0.1000   -0.0000
##    620        0.0013             nan     0.1000   -0.0000
##    640        0.0012             nan     0.1000   -0.0000
##    660        0.0012             nan     0.1000   -0.0000
##    680        0.0011             nan     0.1000   -0.0000
##    700        0.0010             nan     0.1000   -0.0000
##    720        0.0010             nan     0.1000   -0.0000
##    740        0.0009             nan     0.1000   -0.0000
##    760        0.0009             nan     0.1000   -0.0000
##    780        0.0008             nan     0.1000   -0.0000
##    800        0.0008             nan     0.1000   -0.0000
##    820        0.0008             nan     0.1000   -0.0000
##    840        0.0007             nan     0.1000   -0.0000
##    860        0.0007             nan     0.1000   -0.0000
##    880        0.0006             nan     0.1000   -0.0000
##    900        0.0006             nan     0.1000   -0.0000
##    920        0.0006             nan     0.1000   -0.0000
##    940        0.0006             nan     0.1000   -0.0000
##    950        0.0005             nan     0.1000   -0.0000

telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))

# Full model, cont'd
# fitgbm_full
predgbm_full <- predict(fitgbm_full, testx_full)
dfr[35,] <- data.frame(
    Data.Set='Full',
    Model='Stochastic gradient boosted tree', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('interaction.depth=', fitgbm_full$bestTune[['interaction.depth']], 
                             ', n.trees=', fitgbm_full$bestTune[['n.trees']], 
                             ', shrinkage=', fitgbm_full$bestTune[['shrinkage']],
                             ', n.minobsinnode=10'), 
    Train.RMSE=min(fitgbm_full$results[['RMSE']]),
    RMSE.Test=postResample(predgbm_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predgbm_full, testy_full),
    Train.Time=round(telapsed, 3)
)

Cubist

# Reduced model

# Cubist
cubGrid <- expand.grid(.committees=c(seq(1, 10), seq(20, 100, by=10)), .neighbors=c(0, 1, 5, 9))
tstart <- Sys.time()
set.seed(77)
fitcub <- train(trainx_red, trainy_red, method='cubist', tuneGrid=cubGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcub
predcub <- predict(fitcub, testx_red)
dfr[18,] <- data.frame(
    Data.Set='Reduced',
    Model='Cubist', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('committees=', fitcub$bestTune[['committees']], 
                             ', neighbors=', fitcub$bestTune[['neighbors']]), 
    Train.RMSE=min(fitcub$results[['RMSE']]),
    RMSE.Test=postResample(predcub, testy_red)[['RMSE']],
    MAPE.Test=mape(predcub, testy_red),
    Train.Time=round(telapsed, 3)
)

# Full model

# Cubist
cubGrid <- expand.grid(.committees=c(seq(1, 10), seq(20, 100, by=10)), .neighbors=c(0, 1, 5, 9))
tstart <- Sys.time()
set.seed(77)
fitcub_full <- train(trainx_full, trainy_full, method='cubist', tuneGrid=cubGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcub_full
predcub_full <- predict(fitcub_full, testx_full)
dfr[36,] <- data.frame(
    Data.Set='Full',
    Model='Cubist', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('committees=', fitcub_full$bestTune[['committees']], 
                             ', neighbors=', fitcub_full$bestTune[['neighbors']]), 
    Train.RMSE=min(fitcub_full$results[['RMSE']]),
    RMSE.Test=postResample(predcub_full, testy_full)[['RMSE']],
    MAPE.Test=mape(predcub_full, testy_full),
    Train.Time=round(telapsed, 3)
)

# Stop the parallel processing cluster
stopCluster(cl)

Modeling Results

Model performance can be compared using a variety of metrics. One metric commonly used is root mean square error (RMSE), which measures the difference between actual outcome values (pH in this case) and those predicted by the model. RMSE units are in the same units as the outcome variable, pH, making it easy to compare to the original variable.

Another useful metric is the mean absolute percentage error (MAPE), which measures the percentage difference between predicted and actual values of the outcome. Below is a summary of RMSE and MAPE values, along with the tuning parameters of the best-performing models:

# Results table
dfr %>%
    arrange(MAPE.Test) %>%
    select(Model, Data.Set, Tuning.Parameters, Train.Time, RMSE.Train, RMSE.Test, MAPE.Test) %>%
    flextable() %>%
    width(width = 1) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Modeling Results')

Modeling Results
Model	Data.Set	Tuning.Parameters	Train.Time	RMSE.Train	RMSE.Test	MAPE.Test
Cubist	Full	committees=100, neighbors=5	230.949	0.09489731	0.09467403	0.8069488
Random Forest	Full	mtry=26	291.638	0.09573711	0.09518857	0.8278027
Cubist	Reduced	committees=30, neighbors=9	185.834	0.09590937	0.09602074	0.8406549
Random Forest	Reduced	mtry=18	241.865	0.09756142	0.10008600	0.8564273
Stochastic gradient boosted tree	Full	interaction.depth=7, n.trees=950, shrinkage=0.1, n.minobsinnode=10	63.292	0.10471437	0.10147753	0.8862444
Stochastic gradient boosted tree	Reduced	interaction.depth=5, n.trees=650, shrinkage=0.1, n.minobsinnode=10	48.788	0.10722935	0.10653321	0.9463547
Regression model tree	Full	pruned=Yes, smoothed=Yes, rules=No	98.277	0.12150091	0.11406403	0.9841826
SVM	Full	C=16, sigma=0.019	47.931	0.11328172	0.11362331	1.0158661
SVM	Reduced	C=8, sigma=0.024	49.163	0.11517041	0.11670584	1.0427894
Neural net	Full	decay=0.1, size=5, bag=False	87.481	0.11565164	0.11789163	1.0522013
Neural net	Reduced	decay=0.1, size=5, bag=False	76.165	0.11713042	0.11587517	1.0535517
Regression model tree (rule-based)	Full	pruned=Yes, smoothed=No	96.790	0.12672150	0.13331952	1.1181864
knn	Reduced	k=11	5.201	0.12570259	0.12448787	1.1280221
Bagged tree	Full		2.052	0.11960417	0.12225904	1.1282624
Regression model tree	Reduced	pruned=Yes, smoothed=No, rules=Yes	92.994	0.12342576	0.13725888	1.1370836
Regression model tree (rule-based)	Reduced	pruned=Yes, smoothed=No	80.521	0.12342576	0.13725888	1.1370836
knn	Full	k=13	6.429	0.12546626	0.12635171	1.1488225
Bagged tree	Reduced		1.716	0.12080801	0.12566115	1.1556535
MARS	Reduced	degree=2, nprune=10	10.313	0.13075924	0.13523319	1.1976969
Basic CART (tuned w/node depth)	Reduced	maxdepth=10	0.714	0.12801291	0.13305594	1.2089476
Basic CART (tuned w/complexity parameter)	Reduced	cp=0.0136	0.464	0.12948167	0.13469959	1.2183462
Basic CART (tuned w/node depth)	Full	maxdepth=11	0.383	0.12911119	0.13328713	1.2259351
MARS	Full	degree=1, nprune=10	12.286	0.13164275	0.13438371	1.2379820
Partial least squares	Reduced	ncomp=11	0.756	0.13440134	0.13419731	1.2442610
Linear model	Reduced		14.263	0.13418504	0.13426328	1.2442876
Ridge regression	Reduced	labmda=0.0214	2.693	0.13440744	0.13416671	1.2445097
Lasso (enlastic net)	Reduced	lambda=0, fraction=0.9	0.860	0.13435146	0.13448619	1.2471906
Ridge regression	Full	labmda=0.0071	3.576	0.13214741	0.13687962	1.2653748
Lasso (enlastic net)	Full	lambda=0, fraction=0.9	0.934	0.13204044	0.13763948	1.2671424
Partial least squares	Full	ncomp=29	0.602	0.13214800	0.13763492	1.2671462
Linear model	Full		1.123	0.13305920	0.13768953	1.2735829
Basic CART (tuned w/complexity parameter)	Full	cp=0.0127	0.717	0.13031164	0.13833394	1.2770764
Stepwise linear model	Full	nvmax=12	0.767	0.13465862	0.13991498	1.2830543
Robust linear regression	Reduced		1.090	0.13929776	0.13920669	1.2980118
Robust linear regression	Full		1.919	0.13748209	0.13930736	1.3025277
Stepwise linear model	Reduced	nvmax=14	0.988	0.13537079	0.15091506	1.3898956

dfr %>%
    ggplot(aes(x=reorder(Model, desc(MAPE.Test)), y=MAPE.Test, fill=Data.Set)) +
    geom_bar(stat='identity', position='dodge', width=0.75) +
    coord_flip() +
    xlab('') +
    ylab('MAPE (Test set)') +
    scale_fill_manual(values=c('#c03333', '#808080')) +
    ggtitle('Modeling Results')

Our preliminary runs show that tree-based models were the best performers by a significant margin. Nonlinear regression-based models outperformed those that are linear based, but were still outclassed by the tree-based models. The Cubist and random forest models were the top performers, followed by the stochastic gradient boosted tree and regression model tree. It is not surprising that the tree-based models performed better when fed the full data set, as there was a richer feature set to inform the outcome. It was, however, surprising that the test data set outperformed the training set; typically models perform slightly worse against data they haven’t encountered before. This was the case with the linear and nonlinear regression-based models.

We also gauged accuracy against how long it took to train the various models (see plot below). As expected, the more accurate models generally took much longer to run than those that were less accurate.The model with the best MAPE score was also one of the longest to train.

# Performance vs accuracy
dfr %>% 
    ggplot(aes(x=MAPE.Test, y=Train.Time, color=Model.Class)) +
    geom_point() +
    xlab('Accuracy % (MAPE)') +
    ylab('Training time (seconds)') +
    ggtitle('Accuracy and training time by model class')

It should be noted that if this model is to be used in a real-time, production environment, a model with faster runtimes would be preferred over those that perform better but are slower to train.

Model selection

While our modeling yielded very good results by the top performers, we’ll take a closer look at the top two models to see if there are ways to improve it before selecting a final model. As indicated above, the top two performers were the Cubist model run against the full data set with 100 committees and five neighbors, and the random forest model with 26 predictors, also run against the full data set.

Because it was unclear whether the missing values in the categorical variable (Brand.Code) were intentionally uncoded or indeed missing, we decided to try rerunning the top two best-performing models after adjusting this variable. On the first run, we dropped the variable altogether. On the second run, we recoded missing Brand.Code values as “U” to represent unbranded products.

# Rerun top 2 models - full set, but drop Brand.Code
dfm_nobrand <- dfm_full %>%
    select(-starts_with('Brand.Code'))

# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(dfm_nobrand$PH, p=0.8, list=F)

# Separate training from test
dfm_nobrand_train <- dfm_nobrand[train_rows,]
dfm_nobrand_test <- dfm_nobrand[-train_rows,]

# Separate outcome from predictors
trainx_nobrand <- dfm_nobrand_train[,2:ncol(dfm_nobrand_train)]
trainy_nobrand <- dfm_nobrand_train[,1]
testx_nobrand <- dfm_nobrand_test[,2:ncol(dfm_nobrand_test)]
testy_nobrand <- dfm_nobrand_test[,1]

# Parallel processing
cl <- makePSOCKcluster(num_cores)
registerDoParallel(cl)

# Cubist
cubGrid <- expand.grid(.committees=c(seq(1, 10), seq(20, 100, by=10)), .neighbors=c(0, 1, 5, 9))
tstart <- Sys.time()
set.seed(77)
fitcub_nobrand <- train(trainx_nobrand, trainy_nobrand, method='cubist', tuneGrid=cubGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcub_nobrand
predcub_nobrand <- predict(fitcub_nobrand, testx_nobrand)
dfr[37,] <- data.frame(
    Data.Set='Rerun (no Brand.Code)',
    Model='Cubist', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('committees=', fitcub_nobrand$bestTune[['committees']], 
                             ', neighbors=', fitcub_nobrand$bestTune[['neighbors']]), 
    Train.RMSE=min(fitcub_nobrand$results[['RMSE']]),
    RMSE.Test=postResample(predcub_nobrand, testy_nobrand)[['RMSE']],
    MAPE.Test=mape(predcub_nobrand, testy_nobrand),
    Train.Time=round(telapsed, 3)
)

# Random Forest
tstart <- Sys.time()
set.seed(77)
fitrf_nobrand <- train(trainx_nobrand, trainy_nobrand, method='rf',  tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrf_nobrand
predrf_nobrand <- predict(fitrf_nobrand, testx_nobrand)
dfr[39,] <- data.frame(
    Data.Set='Rerun (no Brand.Code)',
    Model='Random forest', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('mtry=', fitrf_nobrand$bestTune[['mtry']]), 
    Train.RMSE=min(fitrf_nobrand$results[['RMSE']]),
    RMSE.Test=postResample(predrf_nobrand, testy_nobrand)[['RMSE']],
    MAPE.Test=mape(predrf_nobrand, testy_nobrand),
    Train.Time=round(telapsed, 3)
)

# Stop the parallel processing cluster
stopCluster(cl)

# Rerun top 2 models - full set, but add uncoded Brand.Code values as U

# Init the full data set
dfm_unbrand <- dfm

# Remove observation with 12 missing values
dfm_unbrand <- dfm_unbrand[(rowSums(is.na(dfm_unbrand)) < 12),]

# Remove the MFR predictor because it has 212 missing values (8.2% of the data)
dfm_unbrand <- dfm_unbrand %>% select(-MFR)

# Recode blank brand codes as "U"
dfm_unbrand <- dfm_unbrand %>%
    mutate(Brand.Code=ifelse(Brand.Code=='', 'U', as.character(Brand.Code)))

# Factor Brand.Code (needed for imputation)
dfm_unbrand$Brand.Code <- factor(dfm_unbrand$Brand.Code)

# Use knnImputation to impute values
dfm_unbrand <- knnImputation(dfm_unbrand, k=9)

# Create dummies for factor and character variables
dfm_unbrand <- dummy_cols(dfm_unbrand, select_columns='Brand.Code', remove_first_dummy=F, remove_selected_columns=T)

# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(dfm_unbrand$PH, p=0.8, list=F)

# Separate training from test
dfm_unbrand_train <- dfm_unbrand[train_rows,]
dfm_unbrand_test <- dfm_unbrand[-train_rows,]

# Separate outcome from predictors
trainx_unbrand <- dfm_unbrand_train[,2:ncol(dfm_unbrand_train)]
trainy_unbrand <- dfm_unbrand_train[,1]
testx_unbrand <- dfm_unbrand_test[,2:ncol(dfm_unbrand_test)]
testy_unbrand <- dfm_unbrand_test[,1]

# Parallel processing
cl <- makePSOCKcluster(num_cores)
registerDoParallel(cl)

# Cubist - full set - wider range of neighbors and committees
cubGrid <- expand.grid(.committees=c(seq(1, 10), seq(20, 100, by=10)), .neighbors=c(0, 1, 5, 9))
tstart <- Sys.time()
set.seed(77)
fitcub_unbrand <- train(trainx_unbrand, trainy_unbrand, method='cubist', tuneGrid=cubGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcub_unbrand
predcub_unbrand <- predict(fitcub_unbrand, testx_unbrand)
dfr[38,] <- data.frame(
    Data.Set='Rerun (keep unbranded)',
    Model='Cubist', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('committees=', fitcub_unbrand$bestTune[['committees']], 
                             ', neighbors=', fitcub_unbrand$bestTune[['neighbors']]), 
    Train.RMSE=min(fitcub_unbrand$results[['RMSE']]),
    RMSE.Test=postResample(predcub_unbrand, testy_unbrand)[['RMSE']],
    MAPE.Test=mape(predcub_unbrand, testy_unbrand),
    Train.Time=round(telapsed, 3)
)

# Random Forest
tstart <- Sys.time()
set.seed(77)
fitrf_unbrand <- train(trainx_unbrand, trainy_unbrand, method='rf',  tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrf_unbrand
predrf_unbrand <- predict(fitrf_unbrand, testx_unbrand)
dfr[40,] <- data.frame(
    Data.Set='Rerun (keep unbranded)',
    Model='Random forest', 
    Model.Class='Tree',
    Tuning.Parameters=paste0('mtry=', fitrf_unbrand$bestTune[['mtry']]), 
    Train.RMSE=min(fitrf_unbrand$results[['RMSE']]),
    RMSE.Test=postResample(predrf_unbrand, testy_unbrand)[['RMSE']],
    MAPE.Test=mape(predrf_unbrand, testy_unbrand),
    Train.Time=round(telapsed, 3)
)

# Stop the parallel processing cluster
stopCluster(cl)

# Reprint top results
dfr[c(16,18,34,36,37:40),] %>%
    arrange(MAPE.Test) %>%
    select(Model, Data.Set, Tuning.Parameters, Train.Time, RMSE.Train, RMSE.Test, MAPE.Test) %>%
    flextable() %>%
    width(width = 1) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Cubist/Random forest results (including reruns)')

Cubist/Random forest results (including reruns)
Model	Data.Set	Tuning.Parameters	Train.Time	RMSE.Train	RMSE.Test	MAPE.Test
Cubist	Full	committees=100, neighbors=5	230.949	0.09489731	0.09467403	0.8069488
Cubist	Rerun (keep unbranded)	committees=50, neighbors=5	240.451	0.09558272	0.09502358	0.8090312
Random Forest	Full	mtry=26	291.638	0.09573711	0.09518857	0.8278027
Random forest	Rerun (keep unbranded)	mtry=31	278.027	0.09582768	0.09605555	0.8292539
Cubist	Reduced	committees=30, neighbors=9	185.834	0.09590937	0.09602074	0.8406549
Cubist	Rerun (no Brand.Code)	committees=70, neighbors=5	244.527	0.09882926	0.10185535	0.8444153
Random Forest	Reduced	mtry=18	241.865	0.09756142	0.10008600	0.8564273
Random forest	Rerun (no Brand.Code)	mtry=23	253.217	0.10338676	0.10653838	0.9037640

As shown, rerunning the models using different Brand.Code combinations did not improve the MAPE of the models run with the full data set. As such, the top-performing model remains the Cubist model run against the full data set (imputing missing Brand.Code values), having 100 committees and five neighbors. As shown on the plot below, performance drops sharply from 0 to around 5 committees, followed by a more gradual decline thereafter. Assuming that more committees means longer runtimes, it would be feasible to decrease the number of committees from 100 to 20 and still obtain almost equivalent performance. But if runtime isn’t a concern, we recommend 100 committees.

# Plot
plot(fitcub_full)

Variable importance

Having selected the optimal model, we’ll examine the importance of individual predictors within the model.

# Variable importance - cubist full set
tmpdf <- varImp(fitcub_full)$importance
tmpdf$Variable <- row.names(tmpdf)
tmpdf %>%
    select(Variable, Overall) %>%
    arrange(desc(Overall)) %>%
    head(10) %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Top 10 variables - Cubist model (full set)')

Top 10 variables - Cubist model (full set)
Variable	Overall
Mnf.Flow	100.00000
Balling.Lvl	64.81481
Alch.Rel	56.17284
Balling	54.93827
Pressure.Vacuum	51.23457
Density	51.23457
Oxygen.Filler	50.00000
Bowl.Setpoint	43.82716
Air.Pressurer	41.97531
Temperature	41.97531

As shown, Mnf.Flow was the most informative predictor in the model, followed by Bailing.Lvl and Alch.Rel.

# mnf.flow scatter 
dfm |> 
  select(Mnf.Flow, PH) |>
  plot()

We can see there appears to be three groups of data, values around the -100 range, values around the 0 range, and values greater than zero. However, upon further inspection, we discover there are no zero values in the mnf.flow column, only values of 0.2. We discretize the variable and plot violin diagrams of the distributions,

# mnf.flow violin 
dfm |>
  select(PH, Mnf.Flow) |>
  na.omit() |>
  mutate(Mnf.Flow = ifelse(Mnf.Flow < 0, "Less than Zero",
                           "Greater than Zero")) |>
  ggplot(aes(x=Mnf.Flow, y=PH)) +
  geom_violin(fill="lightblue") +
  theme_minimal()

We observe that the mnf.flow variable values which are greater than zero are likelier to fall within the critical pH range, as the observations with values less than zero contain the outliers in the range above 9.0 pH.

dfm |>
  select(PH, Mnf.Flow) |>
  na.omit() |>
  mutate(Mnf.Flow = ifelse(Mnf.Flow <= 0, "Less than Zero",
                           "Greater than Zero")) |>
  group_by(Mnf.Flow) |>
  summarize(mean.ph = mean(PH),
            median.ph = median(PH),
            min.ph = min(PH),
            max.ph = max(PH),
            sd.ph = sd(PH)) |>
  as.data.frame() |>
  t() |>
  kable() |>
  kable_styling()

Mnf.Flow	Greater than Zero	Less than Zero
mean.ph	8.470983	8.633001
median.ph	8.48	8.66
min.ph	7.88	8.08
max.ph	8.88	9.36
sd.ph	0.1449628	0.1608029

We can see that the mean, median, standard deviation, minimum and maximum pH are all higher for mnf.flow values below zero, providing evidence that observations with mnf.flow values greater than zero are likelier to fall within the critical pH range.

from the docs:

Cubist: The Cubist output contains variable usage statistics. It gives the percentage of times where each variable was used in a condition and/or a linear model. Note that this output will probably be inconsistent with the rules shown in the output from summary.cubist. At each split of the tree, Cubist saves a linear model (after feature selection) that is allowed to have terms for each variable used in the current split or any split above it. Quinlan (1992) discusses a smoothing algorithm where each model prediction is a linear combination of the parent and child model along the tree. As such, the final prediction is a function of all the linear models from the initial node to the terminal node. The percentages shown in the Cubist output reflects all the models involved in prediction (as opposed to the terminal models shown in the output). The variable importance used here is a linear combination of the usage in the rule conditions and the model.

Evaluation data

Before we generate predictions on the evaluation data set, we’ll take do some cursory exploratory data analysis to verify that the evaluation data set isn’t vastly different than the training set.

Exploratory data analysis

# Move outcome variable pH to front for easier access
dfe <- dfe_raw %>%
    dplyr::select(PH, !matches('Brand.Code'), Brand.Code)
summary(dfe)

##     PH           Carb.Volume     Fill.Ounces      PC.Volume      
##  Mode:logical   Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  NA's:267       1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333  
##                 Median :5.340   Median :23.97   Median :0.27533  
##                 Mean   :5.369   Mean   :23.97   Mean   :0.27769  
##                 3rd Qu.:5.465   3rd Qu.:24.01   3rd Qu.:0.32200  
##                 Max.   :5.667   Max.   :24.20   Max.   :0.46400  
##                 NA's   :1       NA's   :6       NA's   :4        
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.00400   Min.   :0.0200  
##  1st Qu.:65.30   1st Qu.:138.4   1st Qu.:0.04450   1st Qu.:0.1000  
##  Median :68.00   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.25   Mean   :141.2   Mean   :0.08545   Mean   :0.1903  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :77.60   Max.   :154.0   Max.   :0.24600   Max.   :0.6200  
##                  NA's   :1       NA's   :5         NA's   :3       
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :113.0   Min.   :37.80  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00  
##  Median :0.04000   Median :   0.20   Median :123.4   Median :47.80  
##  Mean   :0.05107   Mean   :  21.03   Mean   :123.0   Mean   :48.14  
##  3rd Qu.:0.06000   3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20  
##  Max.   :0.24000   Max.   : 220.40   Max.   :136.0   Max.   :60.20  
##  NA's   :5                           NA's   :4       NA's   :2      
##  Hyd.Pressure1    Hyd.Pressure2    Hyd.Pressure3    Hyd.Pressure4   
##  Min.   :-50.00   Min.   :-50.00   Min.   :-50.00   Min.   : 68.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00  
##  Median : 10.40   Median : 26.80   Median : 27.70   Median : 98.00  
##  Mean   : 12.01   Mean   : 20.11   Mean   : 19.61   Mean   : 97.84  
##  3rd Qu.: 20.40   3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00  
##  Max.   : 50.00   Max.   : 61.40   Max.   : 49.20   Max.   :140.00  
##                   NA's   :1        NA's   :1        NA's   :4       
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 69.2   Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0  
##  1st Qu.:100.6   1st Qu.:3812   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083  
##  Median :118.6   Median :3978   Median :65.80   Median :21.44   Median :3038  
##  Mean   :110.3   Mean   :3581   Mean   :66.23   Mean   :20.90   Mean   :2409  
##  3rd Qu.:120.2   3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215  
##  Max.   :153.2   Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858  
##  NA's   :2       NA's   :10     NA's   :2       NA's   :2                     
##     Density           MFR           Balling      Pressure.Vacuum 
##  Min.   :0.060   Min.   : 15.6   Min.   :0.902   Min.   :-6.400  
##  1st Qu.:0.920   1st Qu.:707.0   1st Qu.:1.498   1st Qu.:-5.600  
##  Median :0.980   Median :724.6   Median :1.648   Median :-5.200  
##  Mean   :1.177   Mean   :697.8   Mean   :2.203   Mean   :-5.174  
##  3rd Qu.:1.600   3rd Qu.:731.5   3rd Qu.:3.242   3rd Qu.:-4.800  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##  NA's   :1       NA's   :31      NA's   :1       NA's   :1       
##  Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint Air.Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :141.2  
##  1st Qu.:0.01960   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03370   Median :120.0   Median :46.00     Median :142.6  
##  Mean   :0.04666   Mean   :109.6   Mean   :47.73     Mean   :142.8  
##  3rd Qu.:0.05440   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:142.8  
##  Max.   :0.39800   Max.   :130.0   Max.   :52.00     Max.   :147.2  
##  NA's   :3         NA's   :1       NA's   :2         NA's   :1      
##     Alch.Rel        Carb.Rel     Balling.Lvl     Brand.Code       
##  Min.   :6.400   Min.   :5.18   Min.   :0.000   Length:267        
##  1st Qu.:6.540   1st Qu.:5.34   1st Qu.:1.380   Class :character  
##  Median :6.580   Median :5.40   Median :1.480   Mode  :character  
##  Mean   :6.907   Mean   :5.44   Mean   :2.051                     
##  3rd Qu.:7.180   3rd Qu.:5.56   3rd Qu.:3.080                     
##  Max.   :7.820   Max.   :5.74   Max.   :3.420                     
##  NA's   :3       NA's   :2

At first glance the evaluation set appears to be roughly on par with the training set. A number of missing values exist, but not in vast proportions.

# Factor categorical variable Brand.Code
dfe$Brand.Code <- factor(dfe$Brand.Code)

# Generate corr plot for pairwise correlations
corr1 <- cor(dfe[,2:(ncol(dfe)-1)], use='complete')
corrplot::corrplot(corr1, method='square', order='hclust', type='full', diag=T, tl.cex=0.75, cl.cex=0.75)

Like the training set data, some collinearity exists between pairs of variables. The same variables appear to be correlated as in the training set:

Carb.Volume
Carb.Rel
Alch.Rel
Density
Bailing
Bailing.Lvl

# Count NAs per column
missing_values <- data.frame(colSums(is.na(dfe[,-1])))
missing_values$Variable <- row.names(missing_values)
colnames(missing_values) <- c('Missing.values', 'Variable')
missing_values %>%
    dplyr::select(Variable, Missing.values) %>%
    arrange(desc(Missing.values)) %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Missing values')

Missing values
Variable	Missing.values
MFR	31
Filler.Speed	10
Fill.Ounces	6
PSC	5
PSC.CO2	5
PC.Volume	4
Carb.Pressure1	4
Hyd.Pressure4	4
PSC.Fill	3
Oxygen.Filler	3
Alch.Rel	3
Fill.Pressure	2
Filler.Level	2
Temperature	2
Usage.cont	2
Pressure.Setpoint	2
Carb.Rel	2
Carb.Volume	1
Carb.Temp	1
Hyd.Pressure2	1
Hyd.Pressure3	1
Density	1
Balling	1
Pressure.Vacuum	1
Bowl.Setpoint	1
Air.Pressurer	1
Carb.Pressure	0
Mnf.Flow	0
Hyd.Pressure1	0
Carb.Flow	0
Balling.Lvl	0
Brand.Code	0

# Missing value patterns
invisible(md.pattern(dfe[,-1], rotate.names=T, plot=T))

Missing values also appear to be on par with the training set, with MFR having the most missing values. There is one observation that contains 14 missing values, but since we’ll need to predict against every row in the evaluation set, we won’t remove it.

# Generate histograms of continuous variables
dfe %>%
    dplyr::select(c(2:(ncol(dfe)-1))) %>%
    gather() %>%
    ggplot(aes(x=value)) +
    geom_histogram(bins=20) +
    facet_wrap(~key, scales = 'free_x') +
    ggtitle('Histograms of continuous variables')

# Calculate skewness on columns
colskewness <- data.frame(apply(dfe[,2:(ncol(dfe)-1)], 2, calcSkewness, type=1))
colskewness$Variable <- row.names(colskewness)
colnames(colskewness) <- c('Skewness', 'Variable')

# Graph skewness values
colskewness %>%
    dplyr::select(Variable, Skewness) %>%
    arrange(desc(abs(Skewness))) %>%
    ggplot(aes(x=reorder(Variable, desc(Skewness)), y=Skewness)) +
    geom_bar(stat='identity') +
    coord_flip() +
    ggtitle('Variable skewness') +
    xlab('')

Roughly the same variables appear to be skewed.

# Set multiplier; outliers are considered as such when they lie outside of (multiplier) standard deviations from the mean
mult <- 3
outliers <- apply(dfe[,2:(ncol(dfe)-1)], 2, getOutliers, mult=mult)
dfout <- data.frame(outliers)
dfout <- data.frame(colSums(outliers))
dfout$Variable <- row.names(dfout)
colnames(dfout) <- c('Outlier.count', 'Variable')

# Filter just those variables with outliers and sort by outlier count
dfout <- dfout %>%
    dplyr::select(Variable, Outlier.count) %>%
    arrange(desc(Outlier.count)) %>%
    filter(Outlier.count > 0)

# Generate bar graph of outlier counts
dfout %>%
    ggplot(aes(x=reorder(Variable, Outlier.count), y=Outlier.count, label=Outlier.count)) +
    geom_bar(stat='identity') +
    coord_flip() +
    geom_text(check_overlap=T, hjust=-0.1, nudge_x=0.05) +
    ggtitle('Outliers (only variables with outliers shown)') +
    xlab('') + ylab('Outlier count')

The proportion of outliers also seems to be similar to that of the training set.

Data preparation

Now that we’ve verified the evaluation data is on par with the training data, we’ll proceed with data preparation. Since our best-performing model used the full data set, we’ll perform the same steps on the evaluation set that we used earlier on the full training set, namely the following:

Remove the MFR predictor because it has a high proportion of missing values (31 values, or 11.6% of the data).
Change the blank brand codes to NA so that they will be imputed.
Impute the missing values using KNN imputation.
Create dummy variables for the Brand.Code predictor.

# Init data frame, removing outcome variable, PH
dfe_eval <- dfe[,-1]

# Remove the MFR predictor because it has a high proportion of missing data
dfe_eval <- dfe_eval %>% select(-MFR)

# Change blank brand codes to NA so they will be imputed
dfe_eval <- dfe_eval %>%
    mutate(Brand.Code=ifelse(Brand.Code=='', NA, as.character(Brand.Code)))

# Factor Brand.Code (needed for imputation)
dfe_eval$Brand.Code <- factor(dfe_eval$Brand.Code)

# Use knnImputation to impute values
dfe_eval <- knnImputation(dfe_eval, k=9)

# Create dummies for factor and character variables
dfe_eval <- dummy_cols(dfe_eval, select_columns='Brand.Code', remove_first_dummy=F, remove_selected_columns=T)

After preparing the data, we have 266 observations in 34 variables. The data is now ready for prediction.

Prediction

To predict pH in the evaluation data set, we’ll use the modeling results from the top performing model (Cubist run against the full data set). The first few predicted values are included below.

# Predict using the Cubist full model
pred_eval <- predict(fitcub_full, dfe_eval)
df_ph <- data.frame(pred_eval)
colnames(df_ph) <- c('PH')

# Show first few values
data.frame(pred_eval) %>%
    head(10) %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Predicted pH (first few values)') %>%
    delete_part( part = "header")

Predicted pH (first few values)
8.628850
8.376658
8.519270
8.634902
8.452495
8.577147
8.453360
8.550534
8.566530
8.643949

# Save to csv
write.csv(df_ph, 'predicted_ph.csv', row.names=F)

We’ll also generate some summary statistics for comparison purposes.

# Generate summary stats for predicted pH values
tmpdf <- data.frame(as.matrix(summary(pred_eval)))
tmpdf$Statistic <- row.names(tmpdf)
colnames(tmpdf) <- c('Evaluation', 'Statistic')
tmpdf <- tmpdf %>%
    rbind(data.frame(Evaluation=sd(pred_eval), Statistic='Std dev'))

# Generate summary stats for training pH values
tmpdfm <- data.frame(as.matrix(summary(dfm_full$PH)))
tmpdfm$Statistic <- row.names(tmpdfm)
colnames(tmpdfm) <- c('Training', 'Statistic')
tmpdfm <- tmpdfm %>%
    rbind(data.frame(Training=sd(dfm_full$PH), Statistic='Std dev'))

# Merge the summary stats to display in a table
tmpdf2 <- tmpdfm %>%
    merge(tmpdf, by=c('Statistic'))

# Display summary
tmpdf2 %>%
    select(Statistic, Training, Evaluation) %>%
    flextable() %>%
    width(width = 2) %>%
    fontsize(size = 10) %>%
    line_spacing(space = 1) %>%
    hline(part = "all") %>%
    set_caption('Comparison of pH statistics')

Comparison of pH statistics
Statistic	Training	Evaluation
1st Qu.	8.4400000	8.4597397
3rd Qu.	8.6800000	8.6659360
Max.	9.3600000	8.9317131
Mean	8.5457724	8.5525592
Median	8.5400000	8.5318241
Min.	7.8800000	8.1460953
Std dev	0.1724898	0.1459234

# Generate histogram of training vs predicted PH
tmpdf3 <- df_ph %>%
    mutate(Set='Predicted') %>%
    rbind(data.frame(PH=dfm_full$PH) %>% mutate(Set='Training'))
tmpdf3 %>% 
    ggplot(aes(x=PH)) +
    geom_histogram(bins=30) +
    facet_wrap('Set')

As shown in the table above, the mean, median, and first and third quartiles are all roughly even between the training and evaluation data. The minimum, maximum, and standard deviation of the training data show a wider range than that of the evaluation data, but this is natural given the greater number of samples in the training set.

# Generate histogram of training vs predicted PH
tmpdf3 <- df_ph %>%
    mutate(Set='Predicted') %>%
    rbind(data.frame(PH=dfm_full$PH) %>% mutate(Set='Training'))
tmpdf3 %>% 
    ggplot(aes(x=PH, y=after_stat(density))) +
    geom_histogram(bins=30) +
    facet_wrap('Set')

The distributions look to be similar.

Conclusion

[insert cool text here]

References

Box, G. and Cox, D. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society: Series B (Methodological), Volume 26, Issue 2, June 1964, Pages 211-243. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1964.tb00553.x

Glen, S. (2023). “Variance Inflation Factor” From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/variance-inflation-factor/

Joanes, D. and Gill, C. (1998). Comparing measures of sample skewness and kurtosis. The Statistician, 47, pages 183-189.

Kaplan J. (2020). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. R package version 1.6.3, https://CRAN.R-project.org/package=fastDummies.

Kuhn M. (2022). caret: Classification and Regression Training. R package version 6.0-93, https://CRAN.R-project.org/package=caret.

Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer Science+Business Media.

Meyer D. et al. (2022). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-12, https://CRAN.R-project.org/package=e1071.

Petrie, A. (2020). regclass: Tools for an Introductory Class in Regression and Modeling. R package version 1.6, https://CRAN.R-project.org/package=regclass.

Torgo, L. (2016). Data Mining with R, learning with case studies (2nd ed.). Chapman and Hall/CRC. http://ltorgo.github.io/DMwR2.

van Buren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). Chapman & Hall/CRC Interdisciplinary Statistic Series. https://stefvanbuuren.name/fimd/sec-FCS.html

van Buren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. DOI 10.18637/jss.v045.i03.

van Buren, S., et al. (2023). Multivariate Imputation by Chained Equations. https://cran.r-project.org/web/packages/mice/mice.pdf.

Yeo, I. and Johnson, R. (2000). A new family of power transformations to improve normality or symmetry. Biometrika, Volume 87, Issue 4, December 2000, pages 954–959. https://academic.oup.com/biomet/article-abstract/87/4/954/232908.

Group2_data624_proj2

Group 2: Alice Friedman, Kayleah Griffen, Josh Iden, Michael Ippolito

2023-07-04

Load Data

Preliminary view

Feature plots

Collinearity

Multicollinearity

Near-zero variance

Missing values

Skewness

Outliers

Modeling Approach

Data Preparation

Imputation

Dummy variables

Full data set

Reduced data set

Collinearity

Near-zero variance

Skewness

Outliers

Modeling

Linear regression

Robust linear regression

Partial least squares

Ridge regression

Lasso regression

K nearest neighbors

Multivariate adaptive regression splines

Support vector machine

Neural net

Basic CART (tuned with complexity parameter)

Basic CART (tuned with node depth)

Regression model tree

Regression model tree (rule-based)

Bagged tree

Random Forest

Stochastic gradient boosted tree

Cubist

Modeling Results

Model selection

Variable importance

Evaluation data

Exploratory data analysis

Data preparation

Prediction

Conclusion

References