First, we load the data from Github. We had some trouble reading the Excel files, so we converted them to CSV.
# Load training data (m=modeling)
dfm_raw <- read.csv('https://raw.githubusercontent.com/klgriffen96/summer23_data624/main/project_2/StudentData%20-%20TO%20MODEL.csv')
# Load evaluation data
dfe_raw <- read.csv('https://github.com/klgriffen96/summer23_data624/raw/main/project_2/StudentEvaluation-%20TO%20PREDICT.csv')A first look at the data is as follows.
# Move outcome variable pH to front for easier access
dfm <- dfm_raw %>%
dplyr::select(PH, !matches('Brand.Code'), Brand.Code)
str(dfm)## 'data.frame': 2571 obs. of 33 variables:
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Carb.Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill.Ounces : num 24 24 24.1 24 24.3 ...
## $ PC.Volume : num 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb.Pressure : num 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb.Temp : num 141 140 145 133 137 ...
## $ PSC : num 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC.Fill : num 0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
## $ PSC.CO2 : num 0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
## $ Mnf.Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb.Pressure1 : num 119 122 120 115 118 ...
## $ Fill.Pressure : num 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : int 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler.Level : num 121 119 120 118 119 ...
## $ Filler.Speed : int 4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage.cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb.Flow : int 2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
## $ Density : num 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure.Vacuum : num -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ Oxygen.Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl.Setpoint : int 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 143 142 146 146 ...
## $ Alch.Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb.Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling.Lvl : num 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
## $ Brand.Code : chr "B" "A" "B" "A" ...
summary(dfm)## PH Carb.Volume Fill.Ounces PC.Volume
## Min. :7.880 Min. :5.040 Min. :23.63 Min. :0.07933
## 1st Qu.:8.440 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Median :8.540 Median :5.347 Median :23.97 Median :0.27133
## Mean :8.546 Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:8.680 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :9.360 Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :4 NA's :10 NA's :38 NA's :39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure.Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8
## 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03340 Median :120.0 Median :46.00 Median :142.6
## Mean :0.04684 Mean :109.3 Mean :47.62 Mean :142.8
## 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0
## Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2
## NA's :12 NA's :2 NA's :12
## Alch.Rel Carb.Rel Balling.Lvl Brand.Code
## Min. :5.280 Min. :4.960 Min. :0.00 Length:2571
## 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38 Class :character
## Median :6.560 Median :5.400 Median :1.48 Mode :character
## Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
kable(table(sapply(dfm_raw, class)), col.names=c("type","count")) |> kable_styling()| type | count |
|---|---|
| character | 1 |
| integer | 4 |
| numeric | 28 |
We note several things from a cursory first glance:
Let’s take a look at the distribution of the numerical predictors.
dfm |>
select(-c(PH,Brand.Code)) |>
gather(key = "predictor", value = "value") |>
ggplot(aes(x = value)) +
geom_density(fill="lightblue") +
facet_wrap(~ predictor, scales = "free") +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank())We can observe that there is either bimodality or skew in most of the variables, suggesting a nonlinear model is appropriate. Because of the bimodality, violin plots are preferable than boxplots to observe any outliers,
We’ll need to do a small amount of data preparation first: The categorical variable Brand.Code will need to be factored into numerical levels, as some functions that we’ll use later requires it (for example near-zero variance calculations, seen later). This will only be useful for initial data exploration; for modeling purposes, we’ll need to handle this variable differently (see Data Preparation below).
dfm |>
select(where(is.character)) |>
table() |>
prop.table() |>
barplot(main="Distribution of Brand Codes", col="lightblue", ylab="Freq")# Factor categorical variable Brand.Code
dfm$Brand.Code <- factor(dfm$Brand.Code)To visually evaluate our data set, we’ll generate feature plots of the continuous variables and a boxplot of the single categorical variable (Brand.Code). These plot show the relationship of each continuous variable against the outcome variable (pH) and can be used to detect how individual predictors influence the outcome.
# Feature plots - exclude the last column, which is Brand.Code (a categorical variable)
featurePlot(x=dfm[,2:(ncol(dfm)-1)], y=dfm$PH, plot='scatter', main='Feature plots against PH', type=c('p', 'smooth'), span=0.5, layout=c(8, 4), col.line='red')# Boxplot of Brand.Code against PH
dfm %>%
filter(!is.na(Brand.Code) & !is.na(PH)) %>%
ggplot(aes(x=Brand.Code, y=PH)) +
geom_boxplot()At first glance there doesn’t appear to be any strong relationship between pH and any of the predictors. There is a slightly positive relationship between pH and some variables like pressure vacuum, but in general no strong trends are evident.
We also note that some of the observations have a blank Brand.Code, which we’ll have to handle during data preparation.
We’ll also look for collinearity among variables. Collinearity is a problem for linear regression-based models, as these models can generate severely inflated coefficients for predictors that are strongly collinear. In these cases, we have several options:
To examine collinearity in our data set, we’ll use the corrplot() function from the package of the same name to generate a correlation plot. This shows strongly correlated variables in darker colors and with larger squares. Positive correlations are shown in blue, while negative correlations are in red. Correlation values of 1.0 indicate perfect positive correlation between variables, while values of -1.0 indicate a perfect inverse correlation. The plot is clustered by variables that exhibit strong correlations with each other, which can provide some indication of multicollinearity among groups of variables.
# Generate corr plot for pairwise correlations
corr1 <- cor(dfm[,1:(ncol(dfm)-1)], use='complete')
corrplot::corrplot(corr1, method='square', order='hclust', type='full', diag=T, tl.cex=0.75, cl.cex=0.75)As shown in the correlation plot, there are some strong correlations among the following six variables:
It may be advantageous to remove one or more of these variables, transform them to remove collinearity, or choose a model that is robust to collinearity. Interestingly, these same six variables exhibit a relatively strong inverse relationship with Hyd.Pressure4.
In addition to collinearity, which evaluates relationships between pairs of predictors, we’ll look at multicollinearity, which gives a general indication of how much more each predictor is related to the other predictors than to the response variable. To evaluate multicollinearity, we’ll fit a simple linear model to the data using the full set of predictors. Then, we’ll use the vif() function in the cars package to estimate the variance inflation factor (VIF), which is a statistic purpose-built to measure multicollinearity among variables. Values less than or equal to 1 indicate no multicollinearity, values between 1 and 5 indicate moderate multicollinearity, and values greater than 5 indicate strong multicollinearity (Glen, 2023).
# Examine multicollinearity by generating a simple linear model of the data,
# then looking at the variance inflation factor (VIF) of each variable.
# VIF <= 1: low multicollinearity
# 1 < VIF <= 5: moderate multicollinearity
# VIF > 5: high multicollinearity
lmod <- lm(PH ~ ., data=dfm)
dfvif <- data.frame(vif(lmod))
dfvif <- dfvif %>%
mutate(Predictor=row.names(dfvif)) %>%
rename(VIF=GVIF) %>%
dplyr::select(Predictor, VIF) %>%
arrange(desc(VIF)) %>%
mutate(Rank=rank(-VIF))
dfvif %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Variance Inflation Factor')Predictor | VIF | Rank |
|---|---|---|
Balling.Lvl | 196.699873 | 1 |
Balling | 168.721067 | 2 |
Brand.Code | 161.462006 | 3 |
Carb.Pressure | 43.629727 | 4 |
Carb.Temp | 35.744235 | 5 |
Alch.Rel | 33.466584 | 6 |
Bowl.Setpoint | 26.051364 | 7 |
Filler.Level | 25.385854 | 8 |
Carb.Volume | 17.895899 | 9 |
Density | 16.438808 | 10 |
Hyd.Pressure3 | 12.964503 | 11 |
MFR | 11.435430 | 12 |
Filler.Speed | 11.224377 | 13 |
Hyd.Pressure2 | 10.635457 | 14 |
Carb.Rel | 7.483881 | 15 |
Mnf.Flow | 4.979607 | 16 |
Pressure.Vacuum | 4.289511 | 17 |
Fill.Pressure | 3.741073 | 18 |
Pressure.Setpoint | 3.486682 | 19 |
Hyd.Pressure1 | 3.000519 | 20 |
Hyd.Pressure4 | 2.686600 | 21 |
Carb.Flow | 2.256335 | 22 |
Usage.cont | 1.773788 | 23 |
PC.Volume | 1.712473 | 24 |
Oxygen.Filler | 1.579977 | 25 |
Carb.Pressure1 | 1.467316 | 26 |
Temperature | 1.390864 | 27 |
Air.Pressurer | 1.234829 | 28 |
Fill.Ounces | 1.181213 | 29 |
PSC | 1.163257 | 30 |
PSC.Fill | 1.112048 | 31 |
PSC.CO2 | 1.067962 | 32 |
As shown in the table almost half (15) of the variables have a VIF greater than 5, which indicate that they exhibit strong multicollinearity. This will inform our decision on the type of model to employ.
It should be noted that, while collinearity and multicollinearity adversely affect a linear regression-based model’s capacity to generate useful information about the contribution of each predictor to the outcome, these conditions don’t impede the model’s performance. Therefore, the presence of collinear or multicollinear predictors don’t automatically preclude us from using these types of models. [need citation]
We’ll look for variables having near-zero variance (NZV) since some models are sensitive to this condition. A variable exhibiting NZV is characterized as having only a single value (or a very small number of values) across the entire range of observations. These types of “degenerate” distributions can be problematic for models such as those based on linear regression. If NZV variables are observed, either they should be removed or a model that is immune to NZV variables should be used (e.g. tree-based models).
To calculate NZV, we’ll use the NearZeroVar() function from the caret package. This calculates the ratio of the frequency of the most common value in each variable to that of the next most common value. If the ratio is high, this indicates NZV conditions, and the variable may need special handling if a model that is sensitive to NZV variables are being used (Kuhn, 2022).
The following table shows the frequency ratios of variables in the data set in descending order.
# Look for NZV variables
nzv <- nearZeroVar(dfm, saveMetrics=T)
nzv$Variable <- row.names(nzv)
nzv %>% dplyr::select(Variable, everything()) %>%
arrange(desc(freqRatio)) %>%
flextable() %>%
width(width = 1) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Frequency ratios of variables')Variable | freqRatio | percentUnique | zeroVar | nzv |
|---|---|---|---|---|
Hyd.Pressure1 | 31.111111 | 9.5293660 | FALSE | TRUE |
Hyd.Pressure3 | 11.450704 | 7.4679113 | FALSE | FALSE |
Hyd.Pressure2 | 7.271028 | 8.0513419 | FALSE | FALSE |
Bowl.Setpoint | 2.990847 | 0.4278491 | FALSE | FALSE |
Brand.Code | 2.014634 | 0.1944769 | FALSE | FALSE |
Fill.Pressure | 1.762931 | 4.2007001 | FALSE | FALSE |
Carb.Flow | 1.586207 | 20.7312330 | FALSE | FALSE |
Pressure.Vacuum | 1.389728 | 0.6223259 | FALSE | FALSE |
Pressure.Setpoint | 1.319361 | 0.3111630 | FALSE | FALSE |
Balling.Lvl | 1.294872 | 3.1894205 | FALSE | FALSE |
Oxygen.Filler | 1.266667 | 13.1466356 | FALSE | FALSE |
MFR | 1.222222 | 22.8315830 | FALSE | FALSE |
PSC | 1.211538 | 5.0175029 | FALSE | FALSE |
Balling | 1.193182 | 8.4402956 | FALSE | FALSE |
Fill.Ounces | 1.163043 | 3.5783742 | FALSE | FALSE |
Density | 1.108374 | 3.0338390 | FALSE | FALSE |
Usage.cont | 1.105263 | 18.7086737 | FALSE | FALSE |
PC.Volume | 1.100000 | 17.6584986 | FALSE | FALSE |
Alch.Rel | 1.098315 | 2.0614547 | FALSE | FALSE |
Air.Pressurer | 1.093407 | 1.2446519 | FALSE | FALSE |
PSC.Fill | 1.091892 | 1.2446519 | FALSE | FALSE |
Filler.Level | 1.086957 | 11.2018670 | FALSE | FALSE |
PSC.CO2 | 1.078303 | 0.5056398 | FALSE | FALSE |
PH | 1.078125 | 2.0225593 | FALSE | FALSE |
Carb.Volume | 1.055556 | 3.9284325 | FALSE | FALSE |
Mnf.Flow | 1.048443 | 18.9420459 | FALSE | FALSE |
Filler.Speed | 1.045161 | 9.4904706 | FALSE | FALSE |
Temperature | 1.040000 | 2.1781408 | FALSE | FALSE |
Carb.Pressure1 | 1.031746 | 5.4453520 | FALSE | FALSE |
Carb.Temp | 1.016667 | 4.7841307 | FALSE | FALSE |
Carb.Pressure | 1.016393 | 4.1229094 | FALSE | FALSE |
Carb.Rel | 1.011583 | 1.6336056 | FALSE | FALSE |
Hyd.Pressure4 | 1.008264 | 1.5558149 | FALSE | FALSE |
As shown, only a single variable (Hyd.Pressure1) exhibited a high enough frequency ratio to be classified as an NZV variable. We’ll examine this variable in a bit more detail.
# Generate table of top unique values of Hyd.Pressure1
hpfreq <- dfm %>%
group_by(Hyd.Pressure1) %>%
summarize(Unique.count=n()) %>%
arrange(desc(Unique.count))
# Tally up values other than the top 2
others <- hpfreq %>%
filter(Unique.count < hpfreq[[2,'Unique.count']]) %>%
summarize(Unique.count=sum(Unique.count))
# Add the "other values" row to the table
hpfreq <- hpfreq %>%
head(2) %>%
rbind(data.frame(Hyd.Pressure1='other values', Unique.count=as.numeric(others))) %>%
rename(Value=Hyd.Pressure1) %>%
mutate(Percent=round(100 * Unique.count / nrow(dfm), 2))
# Display the table
hpfreq %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Hyd.Pressure1')Value | Unique.count | Percent |
|---|---|---|
0 | 840 | 32.67 |
12.4 | 27 | 1.05 |
other values | 1,704 | 66.28 |
As indicated in the table above, a large proportion of the observations have zero values (32.7%). Without knowing anything about how the data was collected or what the variable indicates, it is difficult to know the best way to treat the variable in terms of modeling The best we can do is make an assumption that zero values are correctly entered, structurally sound, and have meaning in the context of the data Based on these assumptions, this variable is a candidate to be dropped if a linear regression-based model is used.
Now we’ll examine missing values in the data set. As noted earlier, there aren’t that many, but some models are sensitive to missing values. As such, the following options are available:
We’ll take a first look at missing values on a column-by-column basis.
# Count NAs per column
missing_values <- data.frame(colSums(is.na(dfm)))
missing_values$Variable <- row.names(missing_values)
colnames(missing_values) <- c('Missing.values', 'Variable')
missing_values %>%
dplyr::select(Variable, Missing.values) %>%
arrange(desc(Missing.values)) %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Missing values')Variable | Missing.values |
|---|---|
MFR | 212 |
Filler.Speed | 57 |
PC.Volume | 39 |
PSC.CO2 | 39 |
Fill.Ounces | 38 |
PSC | 33 |
Carb.Pressure1 | 32 |
Hyd.Pressure4 | 30 |
Carb.Pressure | 27 |
Carb.Temp | 26 |
PSC.Fill | 23 |
Fill.Pressure | 22 |
Filler.Level | 20 |
Hyd.Pressure2 | 15 |
Hyd.Pressure3 | 15 |
Temperature | 14 |
Oxygen.Filler | 12 |
Pressure.Setpoint | 12 |
Hyd.Pressure1 | 11 |
Carb.Volume | 10 |
Carb.Rel | 10 |
Alch.Rel | 9 |
Usage.cont | 5 |
PH | 4 |
Mnf.Flow | 2 |
Carb.Flow | 2 |
Bowl.Setpoint | 2 |
Density | 1 |
Balling | 1 |
Balling.Lvl | 1 |
Pressure.Vacuum | 0 |
Air.Pressurer | 0 |
Brand.Code | 0 |
A good way to visualize missing values is to use the md.pattern() function from the mice package (van Buren, 2011). This function produces a plot that shows patterns of missing data in order of increasing missing information. The numbers along the left indicate the number of observations that correspond to the pattern indicated by that row of squares. Blue squares indicate no missing values, while red squares indicate missing values. The numbers on the far right indicate the number of variables with missing data for that set of observations.
# Missing value patterns
invisible(md.pattern(dfm, rotate.names=T, plot=T))As shown, there are 2129 observations for which no values are missing, or 82.8% of the data. We note that there are only a few rows with many red squares, indicating that there aren’t many observations missing a significant number of variables. There is only a single observation that had missing values numbering in the double digits of predictors (12); this row is likely a good candidate for removal. In addition, one of the predictors (MFR) contained a significant number of missing values across all observations (212 values, or 8.2% of the data) and should likely be removed. The remaining data can be imputed (filled in) using one of the statistical methods listed above.
Another way to see this clearly is with a column chart.
nas <- colMeans(is.na(dfm))
data.frame(variable = names(nas), missing = nas, row.names = NULL) |>
ggplot(aes(y=reorder(variable,missing), x=missing)) +
geom_col(fill="lightblue") +
ggtitle('proportion of missing values') + xlab('') + ylab('')An additional option is to use multivariate imputation by chained equations (MICE) (van Buren et al., 2023). MICE uses the fully conditional specification (FCS), which imputes each variable using a separate method. The method used can either be specified by the modeler or chosen by a set of defaults, depending on the kind of variable. For example, there are default methods for continuous variables and those for categorical variables, with additional methods for categorical variables that depend on the number of factors.
Many types of models are sensitive to variables that exhibit skewed distributions, i.e. those that do not exhibit a typical bell-shaped pattern but instead have a disproportionate number of high or low values. Linear regression-based models are particularly susceptible to skewness.
To reduce skewness, skewed variables can often be mathematically transformed, for example by taking the natural logarithm or using a method such as the Box-Cox transformation. This method uses an empirical means to find a transformation of the data (Box and Cox, 1964).
Box-Cox transformations only work on positive values, so if zero or negative values are encountered a different method must be used. If only zero values are observed, a constant (typically 1) can be added to the value before applying the Box-Cox transformation; however, there is debate on what constant to use in these cases. Alternatively, if both zero and negative values are present, a method such as one proposed by Yeo and Johnson (Yeo & Johnson, 2000) can be employed to transform the variable. The Yeo-Johnson method can accept positive, negative, or zero values and, thus, is more robust when values aren’t guaranteed to be positive.
To visually examine skewness, we’ll generate histograms of the continuous variables.
# Generate histograms of continuous variables
dfm %>%
dplyr::select(c(1:(ncol(dfm)-1))) %>%
gather() %>%
ggplot(aes(x=value)) +
geom_histogram(bins=20) +
facet_wrap(~key, scales = 'free_x') +
ggtitle('Histograms of continuous variables')Some variables exhibit obviously skewed distributions (e.g. Filler.Speed, PSC, and Usage.cont), while others appear to be normally distributed (i.e., approximately symmetric), for example PH and PSC.Fill. Others like Bailing and Bailing.Lvl appear to be bimodal.
To gain a better understanding of which variables are skewed and to what degree they are skewed, we can calculate skewness quantitatively using the e1071 package (Meyer, 2022). Skewness with absolute values that are greater than 1 are considered highly skewed, while those between 0.5 and 1 are moderately skewed. Absolute values less than 0.5 can be considered approximately symmetric (Joanes and Gill, 1998).
The sign of the skewness value indicates in which direction the distribution is skewed: Left-skewed distributions will exhibit negative skewness values, while the sign of right-skewed distributions will be positive.
The following plot quantitatively enumerates the skewness of each variable, ordered by how heavily skewed the variable is.
# Look at skewness quantitatively
# As a general rule of thumb:
# highly skewed: |skewness| > 1
# moderately skewed: 0.5 < |skewness| < 1
# approximately symmetric: 0 < |skewness| < 0.5
# Negative values indicate left-skewed distributions; positive values indicate right-skewed distributions.
# Create function to calculate skewness (needed because it handles columns with missing values inconsistently)
calcSkewness <- function(x, type) {
return(e1071::skewness(x[!is.na(x)], type=type))
}
# Calculate skewness on columns
colskewness <- data.frame(apply(dfm[,1:(ncol(dfm)-1)], 2, calcSkewness, type=1))
colskewness$Variable <- row.names(colskewness)
colnames(colskewness) <- c('Skewness', 'Variable')
# Graph skewness values
colskewness %>%
dplyr::select(Variable, Skewness) %>%
arrange(desc(abs(Skewness))) %>%
ggplot(aes(x=reorder(Variable, desc(Skewness)), y=Skewness)) +
geom_bar(stat='identity') +
coord_flip() +
ggtitle('Variable skewness') +
xlab('')As seen in the table, six variables are heavily skewed, with the first two being skewed left and the remaining skewed right:
This reinforces the idea that some type of transformation will be needed if a model is used that is sensitive to skewness (e.g. linear regression-based models). Alternatively, models that aren’t sensitive to skewness can be employed (e.g. tree-based models).
Like collinearity and skewness, the presence of outliers in a data set can also negatively impact some models. Outliers are typically defined as those points which lie beyond a certain number of standard deviations from the mean (typically 2, 2.5, or 3 times the standard deviation). We’ll examine our data set for outlying values that fall under this definition, using a cutoff of 3 standard deviations.
# Create function to count the number of outliers outside of (mult) standard deviations from the mean
getOutliers <- function(x, mult) {
mean_val <- mean(x[!is.na(x)])
loval <- mean_val - (sd(x[!is.na(x)]) * mult)
hival <- mean_val + (sd(x[!is.na(x)]) * mult)
#return(vals[vals < loval | vals > hival])
return(!is.na(x) & (x < loval | x > hival))
}
# Set multiplier; outliers are considered as such when they lie outside of (multiplier) standard deviations from the mean
mult <- 3
outliers <- apply(dfm[,1:(ncol(dfm)-1)], 2, getOutliers, mult=mult)
dfout <- data.frame(outliers)
dfout <- data.frame(colSums(outliers))
dfout$Variable <- row.names(dfout)
colnames(dfout) <- c('Outlier.count', 'Variable')
# Filter just those variables with outliers and sort by outlier count
dfout <- dfout %>%
dplyr::select(Variable, Outlier.count) %>%
arrange(desc(Outlier.count)) %>%
filter(Outlier.count > 0)
# Generate bar graph of outlier counts
dfout %>%
ggplot(aes(x=reorder(Variable, Outlier.count), y=Outlier.count, label=Outlier.count)) +
geom_bar(stat='identity') +
coord_flip() +
geom_text(check_overlap=T, hjust=-0.1, nudge_x=0.05) +
ggtitle('Outliers (only variables with outliers shown)') +
xlab('') + ylab('Outlier count')The graph above illustrates that Filler.Speed has the most number of outliers (142, or 5.5% of the data). If models that are sensitive to outliers are used (e.g. linear regression-based models), the outliers must be handled in some way, either by removal or by performing an appropriate transformation.
To better understand the outliers, we plotted variables having outliers against the outcome variable, pH. In the plots below, outliers beyond three times the standard deviation appear in red.
# Create a function to filter and graph the outliers for a specific column
graphOutliers <- function(x, mult) {
# Get an array of T/F values indicating whether each observation is an outlier
outlier_array <- getOutliers(dfm[,x], 3)
# Create data set consisting only of outliers
df_outliers <- dfm[outlier_array,] %>%
dplyr::select(PH, any_of(x)) %>%
mutate(outlier=T)
# Create data set consisting only of non-outliers
df_non_outliers <- dfm[!outlier_array,] %>%
dplyr::select(PH, any_of(x)) %>%
mutate(outlier=F)
# Bind the outlier data set to the non-outlier data set
dftmp <- df_non_outliers %>%
rbind(df_outliers)
# Graph the outliers
plt <- dftmp %>%
filter(!is.na(PH) & !is.na(!!sym(x))) %>%
ggplot(aes(x=!!sym(x), y=PH, color=outlier)) +
geom_point() +
scale_colour_manual(values=c('#e0e0e0', '#c02222')) +
theme(legend.position='none')
return(plt)
}
# Initialize list of plots
plts <- list()
# Iterate through all variables with outliers
for (i in 1:nrow(dfout)) {
plts[[i]] <- graphOutliers(dfout[i, 'Variable'], 3)
}
do.call("grid.arrange", c(plts, ncol=3))Based on the plot above, there don’t appear to be any outliers that are structurally unsound (i.e., those with impossible values, such as negative pH values). There are some values that are somewhat suspect (e.g. the six values on the far left and right sides of the Alch.Rel plot), but without a description of the data set or knowing how the data was collected, one can only hazard a guess as to which are valid and which are true outliers. Besides the suspect Alch.Rel data points, the remaining points appear to be valid (though extreme) values when compared to the main body of data points.
As mentioned prior, because of the bimodality, violin plots are preferable than boxplots to observe any outliers,
dfm |>
select(-c(Brand.Code)) |>
gather(key = "predictor", value = "value") |>
ggplot(aes(x = predictor, y = value)) +
geom_violin(fill="lightblue") +
facet_wrap(~ predictor, scales = "free") +
coord_flip() +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank())At this point we’ll stop to consider several significant factors that will influence our approach to modeling even before we begin cleaning and transforming the data. These factors are as follows:
Considering these factors, we decided to use an approach to modeling similar to that described in Chapter 10 of our text (Kuhn and Johnson, 2013). Namely, we will create two data sets to be used for two purposes, as follows:
Using the above approach will give us insights into the best way to model the data using all available models, while generally adhering to the recommended best practices for each model type.
Although we will create two data sets (full and reduced), there is nothing preventing us from running both data sets through all models. We expect linear regression-based models to be unstable and to perform poorly on the full set, but there is no inherent disadvantage to doing so.
Since the outcome variable (pH) is continuous, this is a regression problem, so we’ll naturally choose regression-based models. We’ll start by performing linear regression-based modeling, then move to non-linear regression models, followed by tree-based regression models.
As stated, we’ll prepare two data sets: the full data set to use with more robust models, and the reduced data set to use with more sensitive models. But while some models are robust to many types of irregularities such as skewed variables and outliers, missing values can pose a problem even to these models and for many of the functions involved in prepare these models. Therefore, we’ll need to handle missing values for both the full set and the reduced set.
As previously noted, there is a single observation that has 12 missing predictor values. We’ll remove this observation as it likely would contribute little information to the model. In addition, we’ll remove the predictor (MFR) that contains a large number of missing values relative to the total number of observations; there were 212 missing values for this predictor, or 8.2% of the data.
# Init the full data set
dfm_full <- dfm
# Remove observation with 12 missing values
dfm_full <- dfm_full[(rowSums(is.na(dfm_full)) < 12),]
# Remove the MFR predictor because it has 212 missing values (8.2% of the data)
dfm_full <- dfm_full %>% select(-MFR)For the remaining missing values, there are multiple imputation methods that can be used to fill in missing values. The MICE method is robust but is computationally intensive. The number of missing values in our data set is relatively small, so we’ll choose a simpler method from the DMwR2 package, the knnImputation() function (Torgo, 2016). knnImputation() uses a well-known statistical algorithm to find the “k” most closely related neighbors to observations having missing values. The value of k is selected by the modeler and for imputation purposes is quite arbitrary. In general, smaller values of k lead to faster but less stable predictions, while larger values of k are more computationally intensive but yield smoother results. We chose a value of 9 for k, which is within the typical range for these applications and provides a good balance of performance versus speed.
# Change blank brand codes to NA so they will be imputed
dfm_full <- dfm_full %>%
mutate(Brand.Code=ifelse(Brand.Code=='', NA, as.character(Brand.Code)))
# Factor Brand.Code (needed for imputation)
dfm_full$Brand.Code <- factor(dfm_full$Brand.Code)
# MICE - not used due to computational intensiveness
#imp <- mice(dfm_full, maxit=5, m=5, seed=77)
#complete(imp)
# Use knnImputation to impute values
dfm_full <- knnImputation(dfm_full, k=9)One additional step we’ll need to take is to recode the categorical variable Brand.Code into so-called “dummy” variables. This creates a series of variables that take either a 0 or 1 as a value; each variable indicates the presence (value of 1) or absence (value of 0) of that particular Brand.Code. We’ll use the fastDummies package (Kaplan, 2020), which automatically creates dummy variables for factor- or character-type variables.
# Create dummies for factor and character variables
dfm_full <- dummy_cols(dfm_full, select_columns='Brand.Code', remove_first_dummy=F, remove_selected_columns=T)It is noted that after the dummy columns are created, the data set now has 34 predictors instead of 31.
Now that we’ve filled in the missing values, we have a full data set to use with models that are robust to the confounding problems previously discussed. Besides preparing the full and reduced data sets, we’ll further split each set into two additional sets: one set to train the model and another with which to test the results of the model. This is a standard practice that aims to generate a model that is more accurate when it is confronted with new data it hasn’t yet seen. This is accomplished by training the model with only a portion of the data (typically 75 or 80%) while withholding the rest to gauge the results. The training data typically further undergoes “cross-validation” (CV), another common technique that trains the model over multiple iterations, at each iteration withholding a different proportion of the data to tune the model. After the final iteration, the model with the optimal tuning parameters is selected, and that model is then used to evaluate the previously withheld test data.
We have chosen to use 80% of the data for training, withholding the remaining 20% for testing. And we’ll use 10-fold CV, a commonly accepted practice in modeling. We’ll use the createDataPartition() function from the caret package (Kuhn, 2022) which partitions the data evenly based on the outcome variable. This avoids a common problem when modeling data having a strong imbalance in values of the outcome variable.
# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(dfm_full$PH, p=0.8, list=F)
# Separate training from test
dfm_full_train <- dfm_full[train_rows,]
dfm_full_test <- dfm_full[-train_rows,]
# Separate outcome from predictors
trainx_full <- dfm_full_train[,2:ncol(dfm_full_train)]
trainy_full <- dfm_full_train[,1]
testx_full <- dfm_full_test[,2:ncol(dfm_full_test)]
testy_full <- dfm_full_test[,1]
# Generate a pretty table for the report
data.frame(Set=c('Training', 'Test'), Observations=c(nrow(dfm_full_train), nrow(dfm_full_test))) %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Data partiioning (full data set)')Set | Observations |
|---|---|
Training | 2,057 |
Test | 513 |
It is important to note that some models are sensitive to the numbers of observations compared to the number of parameters. Models that must invert linear matrices are mathematically unsolvable if the number of parameters exceeds the number of observations. Cross validation must also be taken into account when evaluating the observation-to-parameter count; for example, ten-fold CV only uses 90% of the observations for training. In our data set, we have 34 predictor variables and 513 observations in the smaller of the two data sets. Using ten-fold CV results in 461 observations still available to use for training, so the observation-to-parameter count isn’t something we have to worry about for this situation.
Now that we have a full data set will missing values handled, we can generate a second, reduced set to use with models that are sensitive to irregularities. We’ll first make a copy of the full set. Then we’ll need to remove one of the Brand.Code columns, as having all categories of a categorical variable will be problematic for linear-based models.
# First, make a copy the full set to the reduced set
dfm_red <- dfm_full
# Remove the first dummy Brand.Code from the reduced set, since this will be problematic for linear models which use a y-intercept
dfm_red <- dfm_red %>%
select(-Brand.Code_A)As we noted earlier, there the data set exhibits some collinearity and multicollinearity. Based on VIFs calculated previously, almost half the predictors exhibited strong multicollinearity. Since removing that many predictors would likely cause a detrimental loss of information, a different approach is warranted. We opted to use the approach presented by Kuhl and Johnson (2013) in which predictors are iteratively compared pairwise, removing those with the highest correlation value until a certain threshold is attained. While this procedure only addresses collinearity rather than multicollinearity, it can nonetheless yield significant improvements to model performance and stability.
We’ll use the findCorrelation() function to find candidate variables for removal due to collinearity.
# Get correlation table
corr2 <- cor(dfm_red[,2:ncol(dfm_red)], use='complete')
corr_vars <- findCorrelation(corr2, names=T, cutoff=0.9)
data.frame(Variable=corr_vars) %>%
arrange(Variable) %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Highly correlated variables') %>%
delete_part( part = "header")Alch.Rel |
Balling |
Balling.Lvl |
Filler.Level |
Hyd.Pressure3 |
This is a manageable number of variables, so we’ll proceed with removing them from our data set.
# Remove highly correlated variables from the reduced data set
dfm_red <- dfm_red[,-match(corr_vars, colnames(dfm_red))]There is only a single predictor with NZV (Hyd.Pressure1). Since this condition can negatively impact linear regression-based models, we’ll remove that from the reduced data set.
# Remove NZV variables from the reduced data set
dfm_red <- dfm_red[,-match('Hyd.Pressure1', colnames(dfm_red))]We’ll first check for zero or negative values for the variables that were earlier identified as skewed. If there are any such values, the Box-Cox transformation will be infeasible and we’ll have to use a different method.
# Define variables id'ed as skewed from earlier
skewed_vars <- c('Filler.Speed', 'Oxygen.Filler', 'Temperature', 'Air.Pressurer', 'PSC.CO2')
# Generate summary to look at min/max values - we're looking for zero or negative values
summary(dfm_red[,skewed_vars])## Filler.Speed Oxygen.Filler Temperature Air.Pressurer
## Min. : 998 Min. :0.0024 Min. :63.60 Min. :140.8
## 1st Qu.:3824 1st Qu.:0.0220 1st Qu.:65.20 1st Qu.:142.2
## Median :3980 Median :0.0334 Median :65.60 Median :142.6
## Mean :3649 Mean :0.0469 Mean :65.97 Mean :142.8
## 3rd Qu.:3996 3rd Qu.:0.0600 3rd Qu.:66.40 3rd Qu.:143.0
## Max. :4030 Max. :0.4000 Max. :76.20 Max. :148.2
## PSC.CO2
## Min. :0.00000
## 1st Qu.:0.02000
## Median :0.04000
## Mean :0.05649
## 3rd Qu.:0.08000
## Max. :0.24000
Since PSC.CO2 contains zero values, we’ll use the Yeo-Johnson transformation instead. Because the caret package supports Yeo-Johnson natively, we’ll perform the transformations as part of the preprocessing component during modeling. This has the advantage that predictions made based on our models will be automatically back-transformed rather than requiring manual transformation.
Based on the outlier plots from earlier, most of the data points beyond the three-standard-deviation cutoff appear to be valid data points, though located on the extreme outer edges of the main body of data. One exception is Alch.Red, which includes six somewhat suspect points located along the left- and right-hand margins of the plot. The other possible exception is data point exhibiting a PH value of 9.36. It is unclear whether these are truly outliers that should be excluded. But in either case, this variable was already identified as exhibiting high collinearity and has already been removed from the reduced data set.
# Remove sus PH data point
dfm_red <- dfm_red[dfm_red$PH < 9.36,]
# Alch.Red has already been removed, so no need to do anything there with outliers
# Find cutoff value of Alch.Red (three sd's)
#loval <- mean(dfm_red$Alch.Rel) - (3 * sd(dfm_red$Alch.Rel))
#hival <- mean(dfm_red$Alch.Rel) + (3 * sd(dfm_red$Alch.Rel))
# Remove sus Alch.Red data points
#dfm_red <- dfm_red[dfm_red$Alch.Rel >+ loval & dfm_red$Alch.Rel <= hival,]After all modifications to the reduced data set have been made, our final version of the reduced set contains 2569 observations of 28 predictors, reduced from the full set of 2570 observations of 34 predictors. As with the full data set, we’ll split the reduced set into training and test sets and verify that the number of predictors doesn’t exceed the number of observations.
# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(dfm_red$PH, p=0.8, list=F)
# Separate training from test
dfm_red_train <- dfm_red[train_rows,]
dfm_red_test <- dfm_red[-train_rows,]
# Separate outcome from predictors
trainx_red <- dfm_red_train[,2:ncol(dfm_red_train)]
trainy_red <- dfm_red_train[,1]
# Separate outcome from predictors
testx_red <- dfm_red_test[,2:ncol(dfm_red_test)]
testy_red <- dfm_red_test[,1]
# Generate a pretty table for the report
data.frame(Set=c('Training', 'Test'), Observations=c(nrow(dfm_red_train), nrow(dfm_red_test))) %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Data partiioning (reduced data set)')Set | Observations |
|---|---|
Training | 2,057 |
Test | 512 |
The table above shows that the test set contains 512 observations, far exceeding the number of predictors (27), so we won’t need to worry about models that perform poorly when there are more predictors than observations.
Now that we have a full and reduced set of data–and training and test sets within those two sets–we can begin modeling. As we mentioned, we’ll run both the full and reduced data sets through all models, as there is no inherent disadvantage to do so, while recognizing that some models will be unstable and perform poorly.
We’ll start by performing linear regression-based modeling, then move to non-linear regression models, followed by tree-based regression models.
# Create a data frame to store the results
dfr <- data.frame(matrix(nrow=0, ncol=8))
colnames(dfr) <- c('Data.Set', 'Model', 'Model.Class', 'Tuning.Parameters', 'RMSE.Train', 'RMSE.Test', 'MAPE.Test', 'Train.Time')
# specify 10x cross-validation
ctrl <- trainControl(method='cv', number=10)
# Create function to calculate MAPE
mape <- function(predicted, actual) {
mape <- mean(abs((actual - predicted) / actual)) * 100
return (mape)
}
# Parallel processing; detect number of cores and start a parallel socket cluster;
# all subsequent calls to caret::train() will use this cluster
num_cores <- detectCores()
cl <- makePSOCKcluster(num_cores)
registerDoParallel(cl)# Reduced model
# Linear model
tstart <- Sys.time()
set.seed(77)
fitlm1 <- train(x=trainx_red, y=trainy_red, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'YeoJohnson'))
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlm1
predlm1 <- predict(fitlm1, testx_red)
dfr[1,] <- data.frame(
Data.Set='Reduced',
Model='Linear model',
Model.Class='Linear',
Tuning.Parameters='',
RMSE.Train=min(fitlm1$results[['RMSE']]),
RMSE.Test=postResample(predlm1, testy_red)[['RMSE']],
MAPE.Test=mape(predlm1, testy_red),
Train.Time=round(telapsed, 3)
)
# Stepwise linear model using MASS package/caret
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx_red) / 2, 0))) # Max number of parameters to use
tstart <- Sys.time()
set.seed(77)
fitlm2 <- train(x=trainx_red, y=trainy_red, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid, preProcess=c('center', 'scale', 'YeoJohnson'))
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlm2
predlm2 <- predict(fitlm2, testx_red)
dfr[2,] <- data.frame(
Data.Set='Reduced',
Model='Stepwise linear model',
Model.Class='Linear',
Tuning.Parameters=paste0('nvmax=', fitlm2$bestTune[['nvmax']]),
RMSE.Train=min(fitlm2$results[['RMSE']]),
RMSE.Test=postResample(predlm2, testy_red)[['RMSE']],
MAPE.Test=mape(predlm2, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Linear model
tstart <- Sys.time()
set.seed(77)
fitlm1_full <- train(x=trainx_full, y=trainy_full, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'YeoJohnson'))
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlm1_full
predlm1_full <- predict(fitlm1_full, testx_full)## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
dfr[19,] <- data.frame(
Data.Set='Full',
Model='Linear model',
Model.Class='Linear',
Tuning.Parameters='',
RMSE.Train=min(fitlm1_full$results[['RMSE']]),
RMSE.Test=postResample(predlm1_full, testy_full)[['RMSE']],
MAPE.Test=mape(predlm1_full, testy_full),
Train.Time=round(telapsed, 3)
)
# Stepwise linear model using MASS package/caret
stepGrid <- data.frame(.nvmax=seq(1, round(ncol(trainx_red) / 2, 0))) # Max number of parameters to use
tstart <- Sys.time()
set.seed(77)
fitlm2_full <- train(x=trainx_full, y=trainy_full, method='leapSeq', trControl=ctrl, tuneGrid=stepGrid, preProcess=c('center', 'scale', 'YeoJohnson'))## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 1
## linear dependencies found
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlm2_full
predlm2_full <- predict(fitlm2_full, testx_full)
dfr[20,] <- data.frame(
Data.Set='Full',
Model='Stepwise linear model',
Model.Class='Linear',
Tuning.Parameters=paste0('nvmax=', fitlm2_full$bestTune[['nvmax']]),
RMSE.Train=min(fitlm2_full$results[['RMSE']]),
RMSE.Test=postResample(predlm2_full, testy_full)[['RMSE']],
MAPE.Test=mape(predlm2_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Robust linear regression
tstart <- Sys.time()
set.seed(77)
fitrlm <- train(x=trainx_red, y=trainy_red, method='rlm', preProcess=c('center', 'scale', 'pca'), trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrlm
predrlm <- predict(fitrlm, testx_red)
dfr[3,] <- data.frame(
Data.Set='Reduced',
Model='Robust linear regression',
Model.Class='Linear',
Tuning.Parameters='',
RMSE.Train=min(fitrlm$results[['RMSE']]),
RMSE.Test=postResample(predrlm, testy_red)[['RMSE']],
MAPE.Test=mape(predrlm, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Robust linear regression
tstart <- Sys.time()
set.seed(77)
fitrlm_full <- train(x=trainx_full, y=trainy_full, method='rlm', preProcess=c('center', 'scale', 'pca'), trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrlm_full
predrlm_full <- predict(fitrlm_full, testx_full)
dfr[21,] <- data.frame(
Data.Set='Full',
Model='Robust linear regression',
Model.Class='Linear',
Tuning.Parameters='',
RMSE.Train=min(fitrlm_full$results[['RMSE']]),
RMSE.Test=postResample(predrlm_full, testy_full)[['RMSE']],
MAPE.Test=mape(predrlm_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# PLS
tstart <- Sys.time()
set.seed(77)
fitpls <- train(x=trainx_red, y=trainy_red, method='pls', preProcess=c('center', 'scale'), trControl=ctrl, tuneLength=nrow(trainx_red) / 2)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitpls
predpls <- predict(fitpls, testx_red)
dfr[4,] <- data.frame(
Data.Set='Reduced',
Model='Partial least squares',
Model.Class='Linear',
Tuning.Parameters=paste0('ncomp=', fitpls$bestTune),
RMSE.Train=min(fitpls$results[['RMSE']]),
RMSE.Test=postResample(predpls, testy_red)[['RMSE']],
MAPE.Test=mape(predpls, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# PLS
tstart <- Sys.time()
set.seed(77)
fitpls_full <- train(x=trainx_full, y=trainy_full, method='pls', preProcess=c('center', 'scale'), trControl=ctrl, tuneLength=nrow(trainx_full) / 2)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitpls_full
predpls_full <- predict(fitpls_full, testx_full)
dfr[22,] <- data.frame(
Data.Set='Full',
Model='Partial least squares',
Model.Class='Linear',
Tuning.Parameters=paste0('ncomp=', fitpls_full$bestTune),
RMSE.Train=min(fitpls_full$results[['RMSE']]),
RMSE.Test=postResample(predpls_full, testy_full)[['RMSE']],
MAPE.Test=mape(predpls_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Ridge regression
tstart <- Sys.time()
set.seed(77)
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=15))
fitridge <- train(x=trainx_red, y=trainy_red, method='ridge', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=ridgeGrid)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitridge
predridge <- predict(fitridge, testx_red)
dfr[5,] <- data.frame(
Data.Set='Reduced',
Model='Ridge regression',
Model.Class='Linear',
Tuning.Parameters=paste0('labmda=', round(fitridge$bestTune[['lambda']], 4)),
RMSE.Train=min(fitridge$results[['RMSE']]),
RMSE.Test=postResample(predridge, testy_red)[['RMSE']],
MAPE.Test=mape(predridge, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Ridge regression
tstart <- Sys.time()
set.seed(77)
ridgeGrid <- data.frame(.lambda=seq(0, 0.1, length=15))
fitridge_full <- train(x=trainx_full, y=trainy_full, method='ridge', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=ridgeGrid)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitridge_full
predridge_full <- predict(fitridge_full, testx_full)
dfr[23,] <- data.frame(
Data.Set='Full',
Model='Ridge regression',
Model.Class='Linear',
Tuning.Parameters=paste0('labmda=', round(fitridge_full$bestTune[['lambda']], 4)),
RMSE.Train=min(fitridge_full$results[['RMSE']]),
RMSE.Test=postResample(predridge_full, testy_full)[['RMSE']],
MAPE.Test=mape(predridge_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Lasso regression
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1), .fraction=seq(0.05, 1, length=20))
tstart <- Sys.time()
set.seed(77)
fitlasso <- train(x=trainx_red, y=trainy_red, method='enet', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=enetGrid)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlasso
predlasso <- predict(fitlasso, testx_red)
dfr[6,] <- data.frame(
Data.Set='Reduced',
Model='Lasso (enlastic net)',
Model.Class='Linear',
Tuning.Parameters=paste0('lambda=', fitlasso$bestTune[['lambda']], ', fraction=', fitlasso$bestTune[['fraction']]),
RMSE.Train=min(fitlasso$results[['RMSE']]),
RMSE.Test=postResample(predlasso, testy_red)[['RMSE']],
MAPE.Test=mape(predlasso, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Lasso regression
enetGrid <- expand.grid(.lambda=c(0, 0.01, 0.1), .fraction=seq(0.05, 1, length=20))
tstart <- Sys.time()
set.seed(77)
fitlasso_full <- train(x=trainx_full, y=trainy_full, method='enet', preProcess=c('center', 'scale'), trControl=ctrl, tuneGrid=enetGrid)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitlasso_full
predlasso_full <- predict(fitlasso_full, testx_full)
dfr[24,] <- data.frame(
Data.Set='Full',
Model='Lasso (enlastic net)',
Model.Class='Linear',
Tuning.Parameters=paste0('lambda=', fitlasso_full$bestTune[['lambda']], ', fraction=', fitlasso_full$bestTune[['fraction']]),
RMSE.Train=min(fitlasso_full$results[['RMSE']]),
RMSE.Test=postResample(predlasso_full, testy_full)[['RMSE']],
MAPE.Test=mape(predlasso_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# KNN model
tstart <- Sys.time()
set.seed(77)
fitknn <- train(x=trainx_red, y=trainy_red, method='knn', preProc=c("center", "scale"), tuneLength=10)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitknn
predknn <- predict(fitknn, testx_red)
dfr[7,] <- data.frame(
Data.Set='Reduced',
Model='knn',
Model.Class='Nonlinear',
Tuning.Parameters=paste0('k=', fitknn$bestTune[['k']]),
RMSE.Train=min(fitknn$results[['RMSE']]),
RMSE.Test=postResample(predknn, testy_red)[['RMSE']],
MAPE.Test=mape(predknn, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# KNN model
tstart <- Sys.time()
set.seed(77)
fitknn_full <- train(x=trainx_full, y=trainy_full, method='knn', preProc=c("center", "scale"), tuneLength=10)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitknn_full
predknn_full <- predict(fitknn_full, testx_full)
dfr[25,] <- data.frame(
Data.Set='Full',
Model='knn',
Model.Class='Nonlinear',
Tuning.Parameters=paste0('k=', fitknn_full$bestTune[['k']]),
RMSE.Train=min(fitknn_full$results[['RMSE']]),
RMSE.Test=postResample(predknn_full, testy_full)[['RMSE']],
MAPE.Test=mape(predknn_full, testy_full),
Train.Time=round(telapsed, 3),
Train.Time=round(telapsed, 3)
)## Warning in `[<-.data.frame`(`*tmp*`, 25, , value = structure(list(Data.Set =
## "Full", : provided 9 variables to replace 8 variables
# Reduced model
# MARS model
marsGrid <- expand.grid(.degree=1:2, .nprune=2:10) # set tuning parameters
tstart <- Sys.time()
set.seed(77)
fitmars <- train(x=trainx_red, y=trainy_red, method='earth', tuneGrid=marsGrid, trControl=ctrl)## Loading required package: earth
## Warning: package 'earth' was built under R version 4.2.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.2.3
## Loading required package: plotrix
## Loading required package: TeachingDemos
## Warning: package 'TeachingDemos' was built under R version 4.2.3
##
## Attaching package: 'TeachingDemos'
## The following object is masked _by_ '.GlobalEnv':
##
## outliers
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmars
predmars <- predict(fitmars, testx_red)
dfr[8,] <- data.frame(
Data.Set='Reduced',
Model='MARS',
Model.Class='Nonlinear',
Tuning.Parameters=paste0('degree=', fitmars$bestTune[['degree']], ', nprune=', fitmars$bestTune[['nprune']]),
RMSE.Train=min(fitmars$results[['RMSE']]),
RMSE.Test=postResample(predmars, testy_red)[['RMSE']],
MAPE.Test=mape(predmars, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# MARS model
marsGrid <- expand.grid(.degree=1:2, .nprune=2:10) # set tuning parameters
tstart <- Sys.time()
set.seed(77)
fitmars_full <- train(x=trainx_full, y=trainy_full, method='earth', tuneGrid=marsGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmars_full
predmars_full <- predict(fitmars_full, testx_full)
dfr[26,] <- data.frame(
Data.Set='Full',
Model='MARS',
Model.Class='Nonlinear',
Tuning.Parameters=paste0('degree=', fitmars_full$bestTune[['degree']], ', nprune=', fitmars_full$bestTune[['nprune']]),
RMSE.Train=min(fitmars_full$results[['RMSE']]),
RMSE.Test=postResample(predmars_full, testy_full)[['RMSE']],
MAPE.Test=mape(predmars_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# SVM model
tstart <- Sys.time()
set.seed(77)
fitsvm <- train(x=trainx_red, y=trainy_red, method='svmRadial', preProc=c('center', 'scale'), tuneLength=14, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitsvm
predsvm <- predict(fitsvm, testx_red)
dfr[9,] <- data.frame(
Data.Set='Reduced',
Model='SVM',
Model.Class='Nonlinear',
Tuning.Parameters=paste0('C=', fitsvm$bestTune[['C']], ', sigma=', round(fitsvm$bestTune[['sigma']], 3)),
RMSE.Train=min(fitsvm$results[['RMSE']]),
RMSE.Test=postResample(predsvm, testy_red)[['RMSE']],
MAPE.Test=mape(predsvm, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# SVM model
tstart <- Sys.time()
set.seed(77)
fitsvm_full <- train(x=trainx_full, y=trainy_full, method='svmRadial', preProc=c('center', 'scale'), tuneLength=14, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitsvm_full
predsvm_full <- predict(fitsvm_full, testx_full)
dfr[27,] <- data.frame(
Data.Set='Full',
Model='SVM',
Model.Class='Nonlinear',
Tuning.Parameters=paste0('C=', fitsvm_full$bestTune[['C']], ', sigma=', round(fitsvm_full$bestTune[['sigma']], 3)),
RMSE.Train=min(fitsvm_full$results[['RMSE']]),
RMSE.Test=postResample(predsvm_full, testy_full)[['RMSE']],
MAPE.Test=mape(predsvm_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# nnet model using model averaging
nnetGrid <- expand.grid(.decay=c(0, 0.01, 0.1), .size=c(1:10), .bag=F) # set tuning parameters
tstart <- Sys.time()
set.seed(77)
fitnnet <- train(x=trainx_red, y=trainy_red, method='avNNet', preProc=c('center', 'scale'), tunGrid=nnetGrid, trControl=ctrl,
linout=T, trace=F, MaxNWts=10 * (ncol(trainx_red) + 1) + 10 + 1, maxit=500)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitnnet
prednnet <- predict(fitnnet, testx_red)
dfr[10,] <- data.frame(
Data.Set='Reduced',
Model='Neural net',
Model.Class='Nonlinear',
Tuning.Parameters=paste0('decay=', fitnnet$bestTune[['decay']], ', size=', fitnnet$bestTune[['size']], ', bag=False'),
RMSE.Train=min(fitnnet$results[['RMSE']]),
RMSE.Test=postResample(prednnet, testy_red)[['RMSE']],
MAPE.Test=mape(prednnet, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# nnet model using model averaging
nnetGrid <- expand.grid(.decay=c(0, 0.01, 0.1), .size=c(1:10), .bag=F) # set tuning parameters
tstart <- Sys.time()
set.seed(77)
fitnnet_full <- train(x=trainx_full, y=trainy_full, method='avNNet', preProc=c('center', 'scale'), tunGrid=nnetGrid, trControl=ctrl,
linout=T, trace=F, MaxNWts=10 * (ncol(trainx_red) + 1) + 10 + 1, maxit=500)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitnnet_full
prednnet_full <- predict(fitnnet_full, testx_full)
dfr[28,] <- data.frame(
Data.Set='Full',
Model='Neural net',
Model.Class='Nonlinear',
Tuning.Parameters=paste0('decay=', fitnnet_full$bestTune[['decay']], ', size=', fitnnet_full$bestTune[['size']], ', bag=False'),
RMSE.Train=min(fitnnet_full$results[['RMSE']]),
RMSE.Test=postResample(prednnet_full, testy_full)[['RMSE']],
MAPE.Test=mape(prednnet_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Basic regression tree
tstart <- Sys.time()
set.seed(77)
fitcart1 <- train(trainx_red, trainy_red, method='rpart', tuneLength=10, trControl=ctrl)## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcart1
predcart1 <- predict(fitcart1, testx_red)
dfr[11,] <- data.frame(
Data.Set='Reduced',
Model='Basic CART (tuned w/complexity parameter)',
Model.Class='Tree',
Tuning.Parameters=paste0('cp=', round(fitcart1$bestTune[['cp']], 4)),
Train.RMSE=min(fitcart1$results[['RMSE']]),
RMSE.Test=postResample(predcart1, testy_red)[['RMSE']],
MAPE.Test=mape(predcart1, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Basic regression tree
tstart <- Sys.time()
set.seed(77)
fitcart1_full <- train(trainx_full, trainy_full, method='rpart', tuneLength=10, trControl=ctrl)## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcart1_full
predcart1_full <- predict(fitcart1_full, testx_full)
dfr[29,] <- data.frame(
Data.Set='Full',
Model='Basic CART (tuned w/complexity parameter)',
Model.Class='Tree',
Tuning.Parameters=paste0('cp=', round(fitcart1_full$bestTune[['cp']], 4)),
Train.RMSE=min(fitcart1_full$results[['RMSE']]),
RMSE.Test=postResample(predcart1_full, testy_full)[['RMSE']],
MAPE.Test=mape(predcart1_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Basic CART model - tuned using node depth
tstart <- Sys.time()
set.seed(77)
fitcart2 <- train(trainx_red, trainy_red, method='rpart2', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcart2
predcart2 <- predict(fitcart2, testx_red)
dfr[12,] = data.frame(
Data.Set='Reduced',
Model='Basic CART (tuned w/node depth)',
Model.Class='Tree',
Tuning.Parameters=paste0('maxdepth=', fitcart2$bestTune[['maxdepth']]),
Train.RMSE=min(fitcart2$results[['RMSE']]),
RMSE.Test=postResample(predcart2, testy_red)[['RMSE']],
MAPE.Test=mape(predcart2, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Basic CART model - tuned using node depth
tstart <- Sys.time()
set.seed(77)
fitcart2_full <- train(trainx_full, trainy_full, method='rpart2', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcart2_full
predcart2_full <- predict(fitcart2_full, testx_full)
dfr[30,] = data.frame(
Data.Set='Full',
Model='Basic CART (tuned w/node depth)',
Model.Class='Tree',
Tuning.Parameters=paste0('maxdepth=', fitcart2_full$bestTune[['maxdepth']]),
Train.RMSE=min(fitcart2_full$results[['RMSE']]),
RMSE.Test=postResample(predcart2_full, testy_full)[['RMSE']],
MAPE.Test=mape(predcart2_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Model tree
mtreeGrid1 <- expand.grid(.pruned=c('Yes', 'No'), .smoothed=c('Yes', 'No'), .rules=c('Yes', 'No'))
tstart <- Sys.time()
set.seed(77)
fitmtree1 <- train(trainx_red, trainy_red, method='M5', trControl=ctrl, control=Weka_control(M=10), tuneGrid=mtreeGrid1)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmtree1
predmtree1 <- predict(fitmtree1, testx_red)
dfr[13,] <- data.frame(
Data.Set='Reduced',
Model='Regression model tree',
Model.Class='Tree',
Tuning.Parameters=paste0('pruned=', fitmtree1$bestTune[['pruned']],
', smoothed=', fitmtree1$bestTune[['smoothed']],
', rules=', fitmtree1$bestTune[['rules']]),
RMSE.Train=min(fitmtree1$results[['RMSE']]),
RMSE.Test=postResample(predmtree1, testy_red)[['RMSE']],
MAPE.Test=mape(predmtree1, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Model tree
mtreeGrid1 <- expand.grid(.pruned=c('Yes', 'No'), .smoothed=c('Yes', 'No'), .rules=c('Yes', 'No'))
tstart <- Sys.time()
set.seed(77)
fitmtree1_full <- train(trainx_full, trainy_full, method='M5', trControl=ctrl, control=Weka_control(M=10), tuneGrid=mtreeGrid1)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmtree1_full
predmtree1_full <- predict(fitmtree1_full, testx_full)
dfr[31,] <- data.frame(
Data.Set='Full',
Model='Regression model tree',
Model.Class='Tree',
Tuning.Parameters=paste0('pruned=', fitmtree1_full$bestTune[['pruned']],
', smoothed=', fitmtree1_full$bestTune[['smoothed']],
', rules=', fitmtree1_full$bestTune[['rules']]),
RMSE.Train=min(fitmtree1_full$results[['RMSE']]),
RMSE.Test=postResample(predmtree1_full, testy_full)[['RMSE']],
MAPE.Test=mape(predmtree1_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Model tree (rule-based)
mtreeGrid2 <- expand.grid(.pruned=c('Yes', 'No'), .smoothed=c('Yes', 'No'))
tstart <- Sys.time()
set.seed(77)
fitmtree2 <- train(trainx_red, trainy_red, method='M5Rules', trControl=ctrl, control=Weka_control(M=10), tuneGrid=mtreeGrid2)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmtree2
predmtree2 <- predict(fitmtree2, testx_red)
dfr[14,] <- data.frame(
Data.Set='Reduced',
Model='Regression model tree (rule-based)',
Model.Class='Tree',
Tuning.Parameters=paste0('pruned=', fitmtree2$bestTune[['pruned']],
', smoothed=', fitmtree2$bestTune[['smoothed']]),
RMSE.Train=min(fitmtree2$results[['RMSE']]),
RMSE.Test=postResample(predmtree2, testy_red)[['RMSE']],
MAPE.Test=mape(predmtree2, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Model tree (rule-based)
mtreeGrid2 <- expand.grid(.pruned=c('Yes', 'No'), .smoothed=c('Yes', 'No'))
tstart <- Sys.time()
set.seed(77)
fitmtree2_full <- train(trainx_full, trainy_full, method='M5Rules', trControl=ctrl, control=Weka_control(M=10), tuneGrid=mtreeGrid2)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitmtree2_full
predmtree2_full <- predict(fitmtree2_full, testx_full)
dfr[32,] <- data.frame(
Data.Set='Full',
Model='Regression model tree (rule-based)',
Model.Class='Tree',
Tuning.Parameters=paste0('pruned=', fitmtree2_full$bestTune[['pruned']],
', smoothed=', fitmtree2_full$bestTune[['smoothed']]),
RMSE.Train=min(fitmtree2_full$results[['RMSE']]),
RMSE.Test=postResample(predmtree2_full, testy_full)[['RMSE']],
MAPE.Test=mape(predmtree2_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Bagged tree
tstart <- Sys.time()
set.seed(77)
fitbag <- train(trainx_red, trainy_red, method='treebag', trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitbag
predbag <- predict(fitbag, testx_red)
dfr[15,] <- data.frame(
Data.Set='Reduced',
Model='Bagged tree',
Model.Class='Tree',
Tuning.Parameters='',
RMSE.Train=min(fitbag$results[['RMSE']]),
RMSE.Test=postResample(predbag, testy_red)[['RMSE']],
MAPE.Test=mape(predbag, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Bagged tree
tstart <- Sys.time()
set.seed(77)
fitbag_full <- train(trainx_full, trainy_full, method='treebag', trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitbag_full
predbag_full <- predict(fitbag_full, testx_full)
dfr[33,] <- data.frame(
Data.Set='Full',
Model='Bagged tree',
Model.Class='Tree',
Tuning.Parameters='',
RMSE.Train=min(fitbag_full$results[['RMSE']]),
RMSE.Test=postResample(predbag_full, testy_full)[['RMSE']],
MAPE.Test=mape(predbag_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Random Forest
tstart <- Sys.time()
set.seed(77)
fitrf <- train(trainx_red, trainy_red, method='rf', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrf
predrf <- predict(fitrf, testx_red)
dfr[16,] <- data.frame(
Data.Set='Reduced',
Model='Random Forest',
Model.Class='Tree',
Tuning.Parameters=paste0('mtry=', fitrf$bestTune[['mtry']]),
RMSE.Train=min(fitrf$results[['RMSE']]),
RMSE.Test=postResample(predrf, testy_red)[['RMSE']],
MAPE.Test=mape(predrf, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Random Forest
tstart <- Sys.time()
set.seed(77)
fitrf_full <- train(trainx_full, trainy_full, method='rf', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrf_full
predrf_full <- predict(fitrf_full, testx_full)
dfr[34,] <- data.frame(
Data.Set='Full',
Model='Random Forest',
Model.Class='Tree',
Tuning.Parameters=paste0('mtry=', fitrf_full$bestTune[['mtry']]),
RMSE.Train=min(fitrf_full$results[['RMSE']]),
RMSE.Test=postResample(predrf_full, testy_full)[['RMSE']],
MAPE.Test=mape(predrf_full, testy_full),
Train.Time=round(telapsed, 3)
)# Had to set message=FALSE to prevent thousands of trace messages
# Reduced model
# Stochastic gradient boosting
gbmGrid <- expand.grid(.interaction.depth=seq(1, 7, by=2),
.n.trees=seq(100, 1000, by=50),
.shrinkage=c(0.01, 0.1),
.n.minobsinnode=10)
tstart <- Sys.time()
set.seed(77)
fitgbm <- train(trainx_red, trainy_red, method='gbm', tuneGrid=gbmGrid, trControl=ctrl)## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.0271 nan 0.1000 0.0021
## 2 0.0255 nan 0.1000 0.0017
## 3 0.0241 nan 0.1000 0.0013
## 4 0.0228 nan 0.1000 0.0012
## 5 0.0219 nan 0.1000 0.0008
## 6 0.0209 nan 0.1000 0.0009
## 7 0.0201 nan 0.1000 0.0007
## 8 0.0194 nan 0.1000 0.0006
## 9 0.0188 nan 0.1000 0.0005
## 10 0.0183 nan 0.1000 0.0005
## 20 0.0150 nan 0.1000 0.0002
## 40 0.0126 nan 0.1000 -0.0000
## 60 0.0112 nan 0.1000 0.0000
## 80 0.0102 nan 0.1000 -0.0000
## 100 0.0093 nan 0.1000 -0.0000
## 120 0.0087 nan 0.1000 -0.0000
## 140 0.0082 nan 0.1000 -0.0000
## 160 0.0077 nan 0.1000 -0.0000
## 180 0.0072 nan 0.1000 -0.0000
## 200 0.0068 nan 0.1000 -0.0000
## 220 0.0065 nan 0.1000 -0.0000
## 240 0.0062 nan 0.1000 -0.0000
## 260 0.0059 nan 0.1000 -0.0000
## 280 0.0056 nan 0.1000 -0.0000
## 300 0.0053 nan 0.1000 -0.0000
## 320 0.0050 nan 0.1000 -0.0000
## 340 0.0048 nan 0.1000 -0.0000
## 360 0.0046 nan 0.1000 -0.0000
## 380 0.0044 nan 0.1000 -0.0000
## 400 0.0042 nan 0.1000 -0.0000
## 420 0.0040 nan 0.1000 -0.0000
## 440 0.0039 nan 0.1000 -0.0000
## 460 0.0037 nan 0.1000 -0.0000
## 480 0.0036 nan 0.1000 -0.0000
## 500 0.0034 nan 0.1000 -0.0000
## 520 0.0033 nan 0.1000 -0.0000
## 540 0.0031 nan 0.1000 -0.0000
## 560 0.0030 nan 0.1000 -0.0000
## 580 0.0029 nan 0.1000 -0.0000
## 600 0.0028 nan 0.1000 -0.0000
## 620 0.0027 nan 0.1000 -0.0000
## 640 0.0026 nan 0.1000 0.0000
## 650 0.0025 nan 0.1000 -0.0000
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))# Reduced model, cont'd
# fitgbm
predgbm <- predict(fitgbm, testx_red)
dfr[17,] <- data.frame(
Data.Set='Reduced',
Model='Stochastic gradient boosted tree',
Model.Class='Tree',
Tuning.Parameters=paste0('interaction.depth=', fitgbm$bestTune[['interaction.depth']],
', n.trees=', fitgbm$bestTune[['n.trees']],
', shrinkage=', fitgbm$bestTune[['shrinkage']],
', n.minobsinnode=10'),
Train.RMSE=min(fitgbm$results[['RMSE']]),
RMSE.Test=postResample(predgbm, testy_red)[['RMSE']],
MAPE.Test=mape(predgbm, testy_red),
Train.Time=round(telapsed, 3)
)# Had to set message=FALSE to prevent thousands of trace messages
# Full model
# Stochastic gradient boosting
gbmGrid <- expand.grid(.interaction.depth=seq(1, 7, by=2),
.n.trees=seq(100, 1000, by=50),
.shrinkage=c(0.01, 0.1),
.n.minobsinnode=10)
tstart <- Sys.time()
set.seed(77)
fitgbm_full <- train(trainx_full, trainy_full, method='gbm', tuneGrid=gbmGrid, trControl=ctrl)## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.0271 nan 0.1000 0.0022
## 2 0.0252 nan 0.1000 0.0019
## 3 0.0236 nan 0.1000 0.0016
## 4 0.0222 nan 0.1000 0.0012
## 5 0.0211 nan 0.1000 0.0010
## 6 0.0201 nan 0.1000 0.0008
## 7 0.0192 nan 0.1000 0.0008
## 8 0.0185 nan 0.1000 0.0006
## 9 0.0179 nan 0.1000 0.0005
## 10 0.0173 nan 0.1000 0.0004
## 20 0.0137 nan 0.1000 0.0001
## 40 0.0109 nan 0.1000 0.0001
## 60 0.0095 nan 0.1000 0.0000
## 80 0.0083 nan 0.1000 0.0000
## 100 0.0074 nan 0.1000 0.0000
## 120 0.0068 nan 0.1000 -0.0000
## 140 0.0062 nan 0.1000 -0.0000
## 160 0.0057 nan 0.1000 -0.0000
## 180 0.0052 nan 0.1000 -0.0000
## 200 0.0049 nan 0.1000 -0.0000
## 220 0.0045 nan 0.1000 -0.0000
## 240 0.0042 nan 0.1000 -0.0000
## 260 0.0039 nan 0.1000 -0.0000
## 280 0.0036 nan 0.1000 -0.0000
## 300 0.0034 nan 0.1000 -0.0000
## 320 0.0031 nan 0.1000 -0.0000
## 340 0.0030 nan 0.1000 -0.0000
## 360 0.0028 nan 0.1000 -0.0000
## 380 0.0026 nan 0.1000 -0.0000
## 400 0.0025 nan 0.1000 -0.0000
## 420 0.0023 nan 0.1000 -0.0000
## 440 0.0022 nan 0.1000 -0.0000
## 460 0.0020 nan 0.1000 -0.0000
## 480 0.0019 nan 0.1000 -0.0000
## 500 0.0018 nan 0.1000 -0.0000
## 520 0.0017 nan 0.1000 -0.0000
## 540 0.0016 nan 0.1000 -0.0000
## 560 0.0015 nan 0.1000 -0.0000
## 580 0.0015 nan 0.1000 -0.0000
## 600 0.0014 nan 0.1000 -0.0000
## 620 0.0013 nan 0.1000 -0.0000
## 640 0.0012 nan 0.1000 -0.0000
## 660 0.0012 nan 0.1000 -0.0000
## 680 0.0011 nan 0.1000 -0.0000
## 700 0.0010 nan 0.1000 -0.0000
## 720 0.0010 nan 0.1000 -0.0000
## 740 0.0009 nan 0.1000 -0.0000
## 760 0.0009 nan 0.1000 -0.0000
## 780 0.0008 nan 0.1000 -0.0000
## 800 0.0008 nan 0.1000 -0.0000
## 820 0.0008 nan 0.1000 -0.0000
## 840 0.0007 nan 0.1000 -0.0000
## 860 0.0007 nan 0.1000 -0.0000
## 880 0.0006 nan 0.1000 -0.0000
## 900 0.0006 nan 0.1000 -0.0000
## 920 0.0006 nan 0.1000 -0.0000
## 940 0.0006 nan 0.1000 -0.0000
## 950 0.0005 nan 0.1000 -0.0000
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))# Full model, cont'd
# fitgbm_full
predgbm_full <- predict(fitgbm_full, testx_full)
dfr[35,] <- data.frame(
Data.Set='Full',
Model='Stochastic gradient boosted tree',
Model.Class='Tree',
Tuning.Parameters=paste0('interaction.depth=', fitgbm_full$bestTune[['interaction.depth']],
', n.trees=', fitgbm_full$bestTune[['n.trees']],
', shrinkage=', fitgbm_full$bestTune[['shrinkage']],
', n.minobsinnode=10'),
Train.RMSE=min(fitgbm_full$results[['RMSE']]),
RMSE.Test=postResample(predgbm_full, testy_full)[['RMSE']],
MAPE.Test=mape(predgbm_full, testy_full),
Train.Time=round(telapsed, 3)
)# Reduced model
# Cubist
cubGrid <- expand.grid(.committees=c(seq(1, 10), seq(20, 100, by=10)), .neighbors=c(0, 1, 5, 9))
tstart <- Sys.time()
set.seed(77)
fitcub <- train(trainx_red, trainy_red, method='cubist', tuneGrid=cubGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcub
predcub <- predict(fitcub, testx_red)
dfr[18,] <- data.frame(
Data.Set='Reduced',
Model='Cubist',
Model.Class='Tree',
Tuning.Parameters=paste0('committees=', fitcub$bestTune[['committees']],
', neighbors=', fitcub$bestTune[['neighbors']]),
Train.RMSE=min(fitcub$results[['RMSE']]),
RMSE.Test=postResample(predcub, testy_red)[['RMSE']],
MAPE.Test=mape(predcub, testy_red),
Train.Time=round(telapsed, 3)
)# Full model
# Cubist
cubGrid <- expand.grid(.committees=c(seq(1, 10), seq(20, 100, by=10)), .neighbors=c(0, 1, 5, 9))
tstart <- Sys.time()
set.seed(77)
fitcub_full <- train(trainx_full, trainy_full, method='cubist', tuneGrid=cubGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcub_full
predcub_full <- predict(fitcub_full, testx_full)
dfr[36,] <- data.frame(
Data.Set='Full',
Model='Cubist',
Model.Class='Tree',
Tuning.Parameters=paste0('committees=', fitcub_full$bestTune[['committees']],
', neighbors=', fitcub_full$bestTune[['neighbors']]),
Train.RMSE=min(fitcub_full$results[['RMSE']]),
RMSE.Test=postResample(predcub_full, testy_full)[['RMSE']],
MAPE.Test=mape(predcub_full, testy_full),
Train.Time=round(telapsed, 3)
)# Stop the parallel processing cluster
stopCluster(cl)Model performance can be compared using a variety of metrics. One metric commonly used is root mean square error (RMSE), which measures the difference between actual outcome values (pH in this case) and those predicted by the model. RMSE units are in the same units as the outcome variable, pH, making it easy to compare to the original variable.
Another useful metric is the mean absolute percentage error (MAPE), which measures the percentage difference between predicted and actual values of the outcome. Below is a summary of RMSE and MAPE values, along with the tuning parameters of the best-performing models:
# Results table
dfr %>%
arrange(MAPE.Test) %>%
select(Model, Data.Set, Tuning.Parameters, Train.Time, RMSE.Train, RMSE.Test, MAPE.Test) %>%
flextable() %>%
width(width = 1) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Modeling Results')Model | Data.Set | Tuning.Parameters | Train.Time | RMSE.Train | RMSE.Test | MAPE.Test |
|---|---|---|---|---|---|---|
Cubist | Full | committees=100, neighbors=5 | 230.949 | 0.09489731 | 0.09467403 | 0.8069488 |
Random Forest | Full | mtry=26 | 291.638 | 0.09573711 | 0.09518857 | 0.8278027 |
Cubist | Reduced | committees=30, neighbors=9 | 185.834 | 0.09590937 | 0.09602074 | 0.8406549 |
Random Forest | Reduced | mtry=18 | 241.865 | 0.09756142 | 0.10008600 | 0.8564273 |
Stochastic gradient boosted tree | Full | interaction.depth=7, n.trees=950, shrinkage=0.1, n.minobsinnode=10 | 63.292 | 0.10471437 | 0.10147753 | 0.8862444 |
Stochastic gradient boosted tree | Reduced | interaction.depth=5, n.trees=650, shrinkage=0.1, n.minobsinnode=10 | 48.788 | 0.10722935 | 0.10653321 | 0.9463547 |
Regression model tree | Full | pruned=Yes, smoothed=Yes, rules=No | 98.277 | 0.12150091 | 0.11406403 | 0.9841826 |
SVM | Full | C=16, sigma=0.019 | 47.931 | 0.11328172 | 0.11362331 | 1.0158661 |
SVM | Reduced | C=8, sigma=0.024 | 49.163 | 0.11517041 | 0.11670584 | 1.0427894 |
Neural net | Full | decay=0.1, size=5, bag=False | 87.481 | 0.11565164 | 0.11789163 | 1.0522013 |
Neural net | Reduced | decay=0.1, size=5, bag=False | 76.165 | 0.11713042 | 0.11587517 | 1.0535517 |
Regression model tree (rule-based) | Full | pruned=Yes, smoothed=No | 96.790 | 0.12672150 | 0.13331952 | 1.1181864 |
knn | Reduced | k=11 | 5.201 | 0.12570259 | 0.12448787 | 1.1280221 |
Bagged tree | Full | 2.052 | 0.11960417 | 0.12225904 | 1.1282624 | |
Regression model tree | Reduced | pruned=Yes, smoothed=No, rules=Yes | 92.994 | 0.12342576 | 0.13725888 | 1.1370836 |
Regression model tree (rule-based) | Reduced | pruned=Yes, smoothed=No | 80.521 | 0.12342576 | 0.13725888 | 1.1370836 |
knn | Full | k=13 | 6.429 | 0.12546626 | 0.12635171 | 1.1488225 |
Bagged tree | Reduced | 1.716 | 0.12080801 | 0.12566115 | 1.1556535 | |
MARS | Reduced | degree=2, nprune=10 | 10.313 | 0.13075924 | 0.13523319 | 1.1976969 |
Basic CART (tuned w/node depth) | Reduced | maxdepth=10 | 0.714 | 0.12801291 | 0.13305594 | 1.2089476 |
Basic CART (tuned w/complexity parameter) | Reduced | cp=0.0136 | 0.464 | 0.12948167 | 0.13469959 | 1.2183462 |
Basic CART (tuned w/node depth) | Full | maxdepth=11 | 0.383 | 0.12911119 | 0.13328713 | 1.2259351 |
MARS | Full | degree=1, nprune=10 | 12.286 | 0.13164275 | 0.13438371 | 1.2379820 |
Partial least squares | Reduced | ncomp=11 | 0.756 | 0.13440134 | 0.13419731 | 1.2442610 |
Linear model | Reduced | 14.263 | 0.13418504 | 0.13426328 | 1.2442876 | |
Ridge regression | Reduced | labmda=0.0214 | 2.693 | 0.13440744 | 0.13416671 | 1.2445097 |
Lasso (enlastic net) | Reduced | lambda=0, fraction=0.9 | 0.860 | 0.13435146 | 0.13448619 | 1.2471906 |
Ridge regression | Full | labmda=0.0071 | 3.576 | 0.13214741 | 0.13687962 | 1.2653748 |
Lasso (enlastic net) | Full | lambda=0, fraction=0.9 | 0.934 | 0.13204044 | 0.13763948 | 1.2671424 |
Partial least squares | Full | ncomp=29 | 0.602 | 0.13214800 | 0.13763492 | 1.2671462 |
Linear model | Full | 1.123 | 0.13305920 | 0.13768953 | 1.2735829 | |
Basic CART (tuned w/complexity parameter) | Full | cp=0.0127 | 0.717 | 0.13031164 | 0.13833394 | 1.2770764 |
Stepwise linear model | Full | nvmax=12 | 0.767 | 0.13465862 | 0.13991498 | 1.2830543 |
Robust linear regression | Reduced | 1.090 | 0.13929776 | 0.13920669 | 1.2980118 | |
Robust linear regression | Full | 1.919 | 0.13748209 | 0.13930736 | 1.3025277 | |
Stepwise linear model | Reduced | nvmax=14 | 0.988 | 0.13537079 | 0.15091506 | 1.3898956 |
dfr %>%
ggplot(aes(x=reorder(Model, desc(MAPE.Test)), y=MAPE.Test, fill=Data.Set)) +
geom_bar(stat='identity', position='dodge', width=0.75) +
coord_flip() +
xlab('') +
ylab('MAPE (Test set)') +
scale_fill_manual(values=c('#c03333', '#808080')) +
ggtitle('Modeling Results')Our preliminary runs show that tree-based models were the best performers by a significant margin. Nonlinear regression-based models outperformed those that are linear based, but were still outclassed by the tree-based models. The Cubist and random forest models were the top performers, followed by the stochastic gradient boosted tree and regression model tree. It is not surprising that the tree-based models performed better when fed the full data set, as there was a richer feature set to inform the outcome. It was, however, surprising that the test data set outperformed the training set; typically models perform slightly worse against data they haven’t encountered before. This was the case with the linear and nonlinear regression-based models.
We also gauged accuracy against how long it took to train the various models (see plot below). As expected, the more accurate models generally took much longer to run than those that were less accurate.The model with the best MAPE score was also one of the longest to train.
# Performance vs accuracy
dfr %>%
ggplot(aes(x=MAPE.Test, y=Train.Time, color=Model.Class)) +
geom_point() +
xlab('Accuracy % (MAPE)') +
ylab('Training time (seconds)') +
ggtitle('Accuracy and training time by model class')It should be noted that if this model is to be used in a real-time, production environment, a model with faster runtimes would be preferred over those that perform better but are slower to train.
While our modeling yielded very good results by the top performers, we’ll take a closer look at the top two models to see if there are ways to improve it before selecting a final model. As indicated above, the top two performers were the Cubist model run against the full data set with 100 committees and five neighbors, and the random forest model with 26 predictors, also run against the full data set.
Because it was unclear whether the missing values in the categorical variable (Brand.Code) were intentionally uncoded or indeed missing, we decided to try rerunning the top two best-performing models after adjusting this variable. On the first run, we dropped the variable altogether. On the second run, we recoded missing Brand.Code values as “U” to represent unbranded products.
# Rerun top 2 models - full set, but drop Brand.Code
dfm_nobrand <- dfm_full %>%
select(-starts_with('Brand.Code'))
# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(dfm_nobrand$PH, p=0.8, list=F)
# Separate training from test
dfm_nobrand_train <- dfm_nobrand[train_rows,]
dfm_nobrand_test <- dfm_nobrand[-train_rows,]
# Separate outcome from predictors
trainx_nobrand <- dfm_nobrand_train[,2:ncol(dfm_nobrand_train)]
trainy_nobrand <- dfm_nobrand_train[,1]
testx_nobrand <- dfm_nobrand_test[,2:ncol(dfm_nobrand_test)]
testy_nobrand <- dfm_nobrand_test[,1]
# Parallel processing
cl <- makePSOCKcluster(num_cores)
registerDoParallel(cl)
# Cubist
cubGrid <- expand.grid(.committees=c(seq(1, 10), seq(20, 100, by=10)), .neighbors=c(0, 1, 5, 9))
tstart <- Sys.time()
set.seed(77)
fitcub_nobrand <- train(trainx_nobrand, trainy_nobrand, method='cubist', tuneGrid=cubGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcub_nobrand
predcub_nobrand <- predict(fitcub_nobrand, testx_nobrand)
dfr[37,] <- data.frame(
Data.Set='Rerun (no Brand.Code)',
Model='Cubist',
Model.Class='Tree',
Tuning.Parameters=paste0('committees=', fitcub_nobrand$bestTune[['committees']],
', neighbors=', fitcub_nobrand$bestTune[['neighbors']]),
Train.RMSE=min(fitcub_nobrand$results[['RMSE']]),
RMSE.Test=postResample(predcub_nobrand, testy_nobrand)[['RMSE']],
MAPE.Test=mape(predcub_nobrand, testy_nobrand),
Train.Time=round(telapsed, 3)
)
# Random Forest
tstart <- Sys.time()
set.seed(77)
fitrf_nobrand <- train(trainx_nobrand, trainy_nobrand, method='rf', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrf_nobrand
predrf_nobrand <- predict(fitrf_nobrand, testx_nobrand)
dfr[39,] <- data.frame(
Data.Set='Rerun (no Brand.Code)',
Model='Random forest',
Model.Class='Tree',
Tuning.Parameters=paste0('mtry=', fitrf_nobrand$bestTune[['mtry']]),
Train.RMSE=min(fitrf_nobrand$results[['RMSE']]),
RMSE.Test=postResample(predrf_nobrand, testy_nobrand)[['RMSE']],
MAPE.Test=mape(predrf_nobrand, testy_nobrand),
Train.Time=round(telapsed, 3)
)
# Stop the parallel processing cluster
stopCluster(cl)# Rerun top 2 models - full set, but add uncoded Brand.Code values as U
# Init the full data set
dfm_unbrand <- dfm
# Remove observation with 12 missing values
dfm_unbrand <- dfm_unbrand[(rowSums(is.na(dfm_unbrand)) < 12),]
# Remove the MFR predictor because it has 212 missing values (8.2% of the data)
dfm_unbrand <- dfm_unbrand %>% select(-MFR)
# Recode blank brand codes as "U"
dfm_unbrand <- dfm_unbrand %>%
mutate(Brand.Code=ifelse(Brand.Code=='', 'U', as.character(Brand.Code)))
# Factor Brand.Code (needed for imputation)
dfm_unbrand$Brand.Code <- factor(dfm_unbrand$Brand.Code)
# Use knnImputation to impute values
dfm_unbrand <- knnImputation(dfm_unbrand, k=9)
# Create dummies for factor and character variables
dfm_unbrand <- dummy_cols(dfm_unbrand, select_columns='Brand.Code', remove_first_dummy=F, remove_selected_columns=T)
# Create vector of training rows
set.seed(77)
train_rows <- createDataPartition(dfm_unbrand$PH, p=0.8, list=F)
# Separate training from test
dfm_unbrand_train <- dfm_unbrand[train_rows,]
dfm_unbrand_test <- dfm_unbrand[-train_rows,]
# Separate outcome from predictors
trainx_unbrand <- dfm_unbrand_train[,2:ncol(dfm_unbrand_train)]
trainy_unbrand <- dfm_unbrand_train[,1]
testx_unbrand <- dfm_unbrand_test[,2:ncol(dfm_unbrand_test)]
testy_unbrand <- dfm_unbrand_test[,1]
# Parallel processing
cl <- makePSOCKcluster(num_cores)
registerDoParallel(cl)
# Cubist - full set - wider range of neighbors and committees
cubGrid <- expand.grid(.committees=c(seq(1, 10), seq(20, 100, by=10)), .neighbors=c(0, 1, 5, 9))
tstart <- Sys.time()
set.seed(77)
fitcub_unbrand <- train(trainx_unbrand, trainy_unbrand, method='cubist', tuneGrid=cubGrid, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitcub_unbrand
predcub_unbrand <- predict(fitcub_unbrand, testx_unbrand)
dfr[38,] <- data.frame(
Data.Set='Rerun (keep unbranded)',
Model='Cubist',
Model.Class='Tree',
Tuning.Parameters=paste0('committees=', fitcub_unbrand$bestTune[['committees']],
', neighbors=', fitcub_unbrand$bestTune[['neighbors']]),
Train.RMSE=min(fitcub_unbrand$results[['RMSE']]),
RMSE.Test=postResample(predcub_unbrand, testy_unbrand)[['RMSE']],
MAPE.Test=mape(predcub_unbrand, testy_unbrand),
Train.Time=round(telapsed, 3)
)
# Random Forest
tstart <- Sys.time()
set.seed(77)
fitrf_unbrand <- train(trainx_unbrand, trainy_unbrand, method='rf', tuneLength=10, trControl=ctrl)
telapsed <- as.numeric(difftime(Sys.time(), tstart ,units='secs'))
# fitrf_unbrand
predrf_unbrand <- predict(fitrf_unbrand, testx_unbrand)
dfr[40,] <- data.frame(
Data.Set='Rerun (keep unbranded)',
Model='Random forest',
Model.Class='Tree',
Tuning.Parameters=paste0('mtry=', fitrf_unbrand$bestTune[['mtry']]),
Train.RMSE=min(fitrf_unbrand$results[['RMSE']]),
RMSE.Test=postResample(predrf_unbrand, testy_unbrand)[['RMSE']],
MAPE.Test=mape(predrf_unbrand, testy_unbrand),
Train.Time=round(telapsed, 3)
)
# Stop the parallel processing cluster
stopCluster(cl)# Reprint top results
dfr[c(16,18,34,36,37:40),] %>%
arrange(MAPE.Test) %>%
select(Model, Data.Set, Tuning.Parameters, Train.Time, RMSE.Train, RMSE.Test, MAPE.Test) %>%
flextable() %>%
width(width = 1) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Cubist/Random forest results (including reruns)')Model | Data.Set | Tuning.Parameters | Train.Time | RMSE.Train | RMSE.Test | MAPE.Test |
|---|---|---|---|---|---|---|
Cubist | Full | committees=100, neighbors=5 | 230.949 | 0.09489731 | 0.09467403 | 0.8069488 |
Cubist | Rerun (keep unbranded) | committees=50, neighbors=5 | 240.451 | 0.09558272 | 0.09502358 | 0.8090312 |
Random Forest | Full | mtry=26 | 291.638 | 0.09573711 | 0.09518857 | 0.8278027 |
Random forest | Rerun (keep unbranded) | mtry=31 | 278.027 | 0.09582768 | 0.09605555 | 0.8292539 |
Cubist | Reduced | committees=30, neighbors=9 | 185.834 | 0.09590937 | 0.09602074 | 0.8406549 |
Cubist | Rerun (no Brand.Code) | committees=70, neighbors=5 | 244.527 | 0.09882926 | 0.10185535 | 0.8444153 |
Random Forest | Reduced | mtry=18 | 241.865 | 0.09756142 | 0.10008600 | 0.8564273 |
Random forest | Rerun (no Brand.Code) | mtry=23 | 253.217 | 0.10338676 | 0.10653838 | 0.9037640 |
As shown, rerunning the models using different Brand.Code combinations did not improve the MAPE of the models run with the full data set. As such, the top-performing model remains the Cubist model run against the full data set (imputing missing Brand.Code values), having 100 committees and five neighbors. As shown on the plot below, performance drops sharply from 0 to around 5 committees, followed by a more gradual decline thereafter. Assuming that more committees means longer runtimes, it would be feasible to decrease the number of committees from 100 to 20 and still obtain almost equivalent performance. But if runtime isn’t a concern, we recommend 100 committees.
# Plot
plot(fitcub_full)Having selected the optimal model, we’ll examine the importance of individual predictors within the model.
# Variable importance - cubist full set
tmpdf <- varImp(fitcub_full)$importance
tmpdf$Variable <- row.names(tmpdf)
tmpdf %>%
select(Variable, Overall) %>%
arrange(desc(Overall)) %>%
head(10) %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Top 10 variables - Cubist model (full set)')Variable | Overall |
|---|---|
Mnf.Flow | 100.00000 |
Balling.Lvl | 64.81481 |
Alch.Rel | 56.17284 |
Balling | 54.93827 |
Pressure.Vacuum | 51.23457 |
Density | 51.23457 |
Oxygen.Filler | 50.00000 |
Bowl.Setpoint | 43.82716 |
Air.Pressurer | 41.97531 |
Temperature | 41.97531 |
As shown, Mnf.Flow was the most informative predictor in the model, followed by Bailing.Lvl and Alch.Rel.
# mnf.flow scatter
dfm |>
select(Mnf.Flow, PH) |>
plot()We can see there appears to be three groups of data, values around
the -100 range, values around the 0 range, and values greater than zero.
However, upon further inspection, we discover there are no zero values
in the mnf.flow column, only values of 0.2. We discretize
the variable and plot violin diagrams of the distributions,
# mnf.flow violin
dfm |>
select(PH, Mnf.Flow) |>
na.omit() |>
mutate(Mnf.Flow = ifelse(Mnf.Flow < 0, "Less than Zero",
"Greater than Zero")) |>
ggplot(aes(x=Mnf.Flow, y=PH)) +
geom_violin(fill="lightblue") +
theme_minimal()We observe that the mnf.flow variable values which are
greater than zero are likelier to fall within the critical pH range, as
the observations with values less than zero contain the outliers in the
range above 9.0 pH.
dfm |>
select(PH, Mnf.Flow) |>
na.omit() |>
mutate(Mnf.Flow = ifelse(Mnf.Flow <= 0, "Less than Zero",
"Greater than Zero")) |>
group_by(Mnf.Flow) |>
summarize(mean.ph = mean(PH),
median.ph = median(PH),
min.ph = min(PH),
max.ph = max(PH),
sd.ph = sd(PH)) |>
as.data.frame() |>
t() |>
kable() |>
kable_styling()| Mnf.Flow | Greater than Zero | Less than Zero |
| mean.ph | 8.470983 | 8.633001 |
| median.ph | 8.48 | 8.66 |
| min.ph | 7.88 | 8.08 |
| max.ph | 8.88 | 9.36 |
| sd.ph | 0.1449628 | 0.1608029 |
We can see that the mean, median, standard deviation, minimum and
maximum pH are all higher for mnf.flow values below zero,
providing evidence that observations with mnf.flow values
greater than zero are likelier to fall within the critical pH range.
from the docs:
Cubist: The Cubist output contains variable usage statistics. It gives the percentage of times where each variable was used in a condition and/or a linear model. Note that this output will probably be inconsistent with the rules shown in the output from summary.cubist. At each split of the tree, Cubist saves a linear model (after feature selection) that is allowed to have terms for each variable used in the current split or any split above it. Quinlan (1992) discusses a smoothing algorithm where each model prediction is a linear combination of the parent and child model along the tree. As such, the final prediction is a function of all the linear models from the initial node to the terminal node. The percentages shown in the Cubist output reflects all the models involved in prediction (as opposed to the terminal models shown in the output). The variable importance used here is a linear combination of the usage in the rule conditions and the model.
Before we generate predictions on the evaluation data set, we’ll take do some cursory exploratory data analysis to verify that the evaluation data set isn’t vastly different than the training set.
# Move outcome variable pH to front for easier access
dfe <- dfe_raw %>%
dplyr::select(PH, !matches('Brand.Code'), Brand.Code)
summary(dfe)## PH Carb.Volume Fill.Ounces PC.Volume
## Mode:logical Min. :5.147 Min. :23.75 Min. :0.09867
## NA's:267 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23333
## Median :5.340 Median :23.97 Median :0.27533
## Mean :5.369 Mean :23.97 Mean :0.27769
## 3rd Qu.:5.465 3rd Qu.:24.01 3rd Qu.:0.32200
## Max. :5.667 Max. :24.20 Max. :0.46400
## NA's :1 NA's :6 NA's :4
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04450 1st Qu.:0.1000
## Median :68.00 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.08545 Mean :0.1903
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## NA's :1 NA's :5 NA's :3
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05107 Mean : 21.03 Mean :123.0 Mean :48.14
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## NA's :5 NA's :4 NA's :2
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.70 Median : 98.00
## Mean : 12.01 Mean : 20.11 Mean : 19.61 Mean : 97.84
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## NA's :1 NA's :1 NA's :4
## Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90 Min. : 0
## 1st Qu.:100.6 1st Qu.:3812 1st Qu.:65.40 1st Qu.:18.12 1st Qu.:1083
## Median :118.6 Median :3978 Median :65.80 Median :21.44 Median :3038
## Mean :110.3 Mean :3581 Mean :66.23 Mean :20.90 Mean :2409
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74 3rd Qu.:3215
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60 Max. :3858
## NA's :2 NA's :10 NA's :2 NA's :2
## Density MFR Balling Pressure.Vacuum
## Min. :0.060 Min. : 15.6 Min. :0.902 Min. :-6.400
## 1st Qu.:0.920 1st Qu.:707.0 1st Qu.:1.498 1st Qu.:-5.600
## Median :0.980 Median :724.6 Median :1.648 Median :-5.200
## Mean :1.177 Mean :697.8 Mean :2.203 Mean :-5.174
## 3rd Qu.:1.600 3rd Qu.:731.5 3rd Qu.:3.242 3rd Qu.:-4.800
## Max. :1.840 Max. :784.8 Max. :3.788 Max. :-3.600
## NA's :1 NA's :31 NA's :1 NA's :1
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :141.2
## 1st Qu.:0.01960 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03370 Median :120.0 Median :46.00 Median :142.6
## Mean :0.04666 Mean :109.6 Mean :47.73 Mean :142.8
## 3rd Qu.:0.05440 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:142.8
## Max. :0.39800 Max. :130.0 Max. :52.00 Max. :147.2
## NA's :3 NA's :1 NA's :2 NA's :1
## Alch.Rel Carb.Rel Balling.Lvl Brand.Code
## Min. :6.400 Min. :5.18 Min. :0.000 Length:267
## 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380 Class :character
## Median :6.580 Median :5.40 Median :1.480 Mode :character
## Mean :6.907 Mean :5.44 Mean :2.051
## 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :7.820 Max. :5.74 Max. :3.420
## NA's :3 NA's :2
At first glance the evaluation set appears to be roughly on par with the training set. A number of missing values exist, but not in vast proportions.
# Factor categorical variable Brand.Code
dfe$Brand.Code <- factor(dfe$Brand.Code)
# Generate corr plot for pairwise correlations
corr1 <- cor(dfe[,2:(ncol(dfe)-1)], use='complete')
corrplot::corrplot(corr1, method='square', order='hclust', type='full', diag=T, tl.cex=0.75, cl.cex=0.75)Like the training set data, some collinearity exists between pairs of variables. The same variables appear to be correlated as in the training set:
# Count NAs per column
missing_values <- data.frame(colSums(is.na(dfe[,-1])))
missing_values$Variable <- row.names(missing_values)
colnames(missing_values) <- c('Missing.values', 'Variable')
missing_values %>%
dplyr::select(Variable, Missing.values) %>%
arrange(desc(Missing.values)) %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Missing values')Variable | Missing.values |
|---|---|
MFR | 31 |
Filler.Speed | 10 |
Fill.Ounces | 6 |
PSC | 5 |
PSC.CO2 | 5 |
PC.Volume | 4 |
Carb.Pressure1 | 4 |
Hyd.Pressure4 | 4 |
PSC.Fill | 3 |
Oxygen.Filler | 3 |
Alch.Rel | 3 |
Fill.Pressure | 2 |
Filler.Level | 2 |
Temperature | 2 |
Usage.cont | 2 |
Pressure.Setpoint | 2 |
Carb.Rel | 2 |
Carb.Volume | 1 |
Carb.Temp | 1 |
Hyd.Pressure2 | 1 |
Hyd.Pressure3 | 1 |
Density | 1 |
Balling | 1 |
Pressure.Vacuum | 1 |
Bowl.Setpoint | 1 |
Air.Pressurer | 1 |
Carb.Pressure | 0 |
Mnf.Flow | 0 |
Hyd.Pressure1 | 0 |
Carb.Flow | 0 |
Balling.Lvl | 0 |
Brand.Code | 0 |
# Missing value patterns
invisible(md.pattern(dfe[,-1], rotate.names=T, plot=T))Missing values also appear to be on par with the training set, with MFR having the most missing values. There is one observation that contains 14 missing values, but since we’ll need to predict against every row in the evaluation set, we won’t remove it.
# Generate histograms of continuous variables
dfe %>%
dplyr::select(c(2:(ncol(dfe)-1))) %>%
gather() %>%
ggplot(aes(x=value)) +
geom_histogram(bins=20) +
facet_wrap(~key, scales = 'free_x') +
ggtitle('Histograms of continuous variables')# Calculate skewness on columns
colskewness <- data.frame(apply(dfe[,2:(ncol(dfe)-1)], 2, calcSkewness, type=1))
colskewness$Variable <- row.names(colskewness)
colnames(colskewness) <- c('Skewness', 'Variable')
# Graph skewness values
colskewness %>%
dplyr::select(Variable, Skewness) %>%
arrange(desc(abs(Skewness))) %>%
ggplot(aes(x=reorder(Variable, desc(Skewness)), y=Skewness)) +
geom_bar(stat='identity') +
coord_flip() +
ggtitle('Variable skewness') +
xlab('')Roughly the same variables appear to be skewed.
# Set multiplier; outliers are considered as such when they lie outside of (multiplier) standard deviations from the mean
mult <- 3
outliers <- apply(dfe[,2:(ncol(dfe)-1)], 2, getOutliers, mult=mult)
dfout <- data.frame(outliers)
dfout <- data.frame(colSums(outliers))
dfout$Variable <- row.names(dfout)
colnames(dfout) <- c('Outlier.count', 'Variable')
# Filter just those variables with outliers and sort by outlier count
dfout <- dfout %>%
dplyr::select(Variable, Outlier.count) %>%
arrange(desc(Outlier.count)) %>%
filter(Outlier.count > 0)
# Generate bar graph of outlier counts
dfout %>%
ggplot(aes(x=reorder(Variable, Outlier.count), y=Outlier.count, label=Outlier.count)) +
geom_bar(stat='identity') +
coord_flip() +
geom_text(check_overlap=T, hjust=-0.1, nudge_x=0.05) +
ggtitle('Outliers (only variables with outliers shown)') +
xlab('') + ylab('Outlier count')The proportion of outliers also seems to be similar to that of the training set.
Now that we’ve verified the evaluation data is on par with the training data, we’ll proceed with data preparation. Since our best-performing model used the full data set, we’ll perform the same steps on the evaluation set that we used earlier on the full training set, namely the following:
# Init data frame, removing outcome variable, PH
dfe_eval <- dfe[,-1]
# Remove the MFR predictor because it has a high proportion of missing data
dfe_eval <- dfe_eval %>% select(-MFR)
# Change blank brand codes to NA so they will be imputed
dfe_eval <- dfe_eval %>%
mutate(Brand.Code=ifelse(Brand.Code=='', NA, as.character(Brand.Code)))
# Factor Brand.Code (needed for imputation)
dfe_eval$Brand.Code <- factor(dfe_eval$Brand.Code)
# Use knnImputation to impute values
dfe_eval <- knnImputation(dfe_eval, k=9)
# Create dummies for factor and character variables
dfe_eval <- dummy_cols(dfe_eval, select_columns='Brand.Code', remove_first_dummy=F, remove_selected_columns=T)After preparing the data, we have 266 observations in 34 variables. The data is now ready for prediction.
To predict pH in the evaluation data set, we’ll use the modeling results from the top performing model (Cubist run against the full data set). The first few predicted values are included below.
# Predict using the Cubist full model
pred_eval <- predict(fitcub_full, dfe_eval)
df_ph <- data.frame(pred_eval)
colnames(df_ph) <- c('PH')
# Show first few values
data.frame(pred_eval) %>%
head(10) %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Predicted pH (first few values)') %>%
delete_part( part = "header")8.628850 |
8.376658 |
8.519270 |
8.634902 |
8.452495 |
8.577147 |
8.453360 |
8.550534 |
8.566530 |
8.643949 |
# Save to csv
write.csv(df_ph, 'predicted_ph.csv', row.names=F)We’ll also generate some summary statistics for comparison purposes.
# Generate summary stats for predicted pH values
tmpdf <- data.frame(as.matrix(summary(pred_eval)))
tmpdf$Statistic <- row.names(tmpdf)
colnames(tmpdf) <- c('Evaluation', 'Statistic')
tmpdf <- tmpdf %>%
rbind(data.frame(Evaluation=sd(pred_eval), Statistic='Std dev'))
# Generate summary stats for training pH values
tmpdfm <- data.frame(as.matrix(summary(dfm_full$PH)))
tmpdfm$Statistic <- row.names(tmpdfm)
colnames(tmpdfm) <- c('Training', 'Statistic')
tmpdfm <- tmpdfm %>%
rbind(data.frame(Training=sd(dfm_full$PH), Statistic='Std dev'))
# Merge the summary stats to display in a table
tmpdf2 <- tmpdfm %>%
merge(tmpdf, by=c('Statistic'))
# Display summary
tmpdf2 %>%
select(Statistic, Training, Evaluation) %>%
flextable() %>%
width(width = 2) %>%
fontsize(size = 10) %>%
line_spacing(space = 1) %>%
hline(part = "all") %>%
set_caption('Comparison of pH statistics')Statistic | Training | Evaluation |
|---|---|---|
1st Qu. | 8.4400000 | 8.4597397 |
3rd Qu. | 8.6800000 | 8.6659360 |
Max. | 9.3600000 | 8.9317131 |
Mean | 8.5457724 | 8.5525592 |
Median | 8.5400000 | 8.5318241 |
Min. | 7.8800000 | 8.1460953 |
Std dev | 0.1724898 | 0.1459234 |
# Generate histogram of training vs predicted PH
tmpdf3 <- df_ph %>%
mutate(Set='Predicted') %>%
rbind(data.frame(PH=dfm_full$PH) %>% mutate(Set='Training'))
tmpdf3 %>%
ggplot(aes(x=PH)) +
geom_histogram(bins=30) +
facet_wrap('Set')As shown in the table above, the mean, median, and first and third quartiles are all roughly even between the training and evaluation data. The minimum, maximum, and standard deviation of the training data show a wider range than that of the evaluation data, but this is natural given the greater number of samples in the training set.
# Generate histogram of training vs predicted PH
tmpdf3 <- df_ph %>%
mutate(Set='Predicted') %>%
rbind(data.frame(PH=dfm_full$PH) %>% mutate(Set='Training'))
tmpdf3 %>%
ggplot(aes(x=PH, y=after_stat(density))) +
geom_histogram(bins=30) +
facet_wrap('Set')The distributions look to be similar.
[insert cool text here]
Box, G. and Cox, D. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society: Series B (Methodological), Volume 26, Issue 2, June 1964, Pages 211-243. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1964.tb00553.x
Glen, S. (2023). “Variance Inflation Factor” From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/variance-inflation-factor/
Joanes, D. and Gill, C. (1998). Comparing measures of sample skewness and kurtosis. The Statistician, 47, pages 183-189.
Kaplan J. (2020). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. R package version 1.6.3, https://CRAN.R-project.org/package=fastDummies.
Kuhn M. (2022). caret: Classification and Regression Training. R package version 6.0-93, https://CRAN.R-project.org/package=caret.
Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer Science+Business Media.
Meyer D. et al. (2022). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-12, https://CRAN.R-project.org/package=e1071.
Petrie, A. (2020). regclass: Tools for an Introductory Class in Regression and Modeling. R package version 1.6, https://CRAN.R-project.org/package=regclass.
Torgo, L. (2016). Data Mining with R, learning with case studies (2nd ed.). Chapman and Hall/CRC. http://ltorgo.github.io/DMwR2.
van Buren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). Chapman & Hall/CRC Interdisciplinary Statistic Series. https://stefvanbuuren.name/fimd/sec-FCS.html
van Buren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. DOI 10.18637/jss.v045.i03.
van Buren, S., et al. (2023). Multivariate Imputation by Chained Equations. https://cran.r-project.org/web/packages/mice/mice.pdf.
Yeo, I. and Johnson, R. (2000). A new family of power transformations to improve normality or symmetry. Biometrika, Volume 87, Issue 4, December 2000, pages 954–959. https://academic.oup.com/biomet/article-abstract/87/4/954/232908.