Predictive Analytics

library(AppliedPredictiveModeling)
library(arules)
library(arulesViz)
library(caret)
library(Cubist)
library(DataExplorer)
library(dplyr)
library(elasticnet)
library(fma)
library(forcats)
library(forecast)
library(fpp2)
library(gbm)
library(ggcorrplot)
library(ggplot2)
library(gridExtra)
library(kableExtra)
library(MASS)
library(mlbench)
library(naniar)
library(party)
library(partykit)
library(randomForest)
library(readxl)
library(tidyr)
library(tidyverse)
library(timeDate)

Forecasting: Principles and Practice

Time Series

We can use the autoplot() function from the forecast library to explore what the series gold, woolyrnq and gas look like respectively.

autoplot(gold) +
  ggtitle("Daily Morning Gold Prices") +
  xlab("Days since 1/1/1985") +
  ylab("US Dollars")

autoplot(woolyrnq) +
  ggtitle("Wool Yarn Production in Australia") +
  xlab("Year") +
  ylab("US Dollars")

autoplot(gas) +
  ggtitle("Gas Production") +
  xlab("Year") +
  ylab("US Dollars")

The frequency() function provides the frequency of each series.

frequency(gold)

## [1] 1

frequency(woolyrnq)

## [1] 4

frequency(gas)

## [1] 12

which.max() can be used to spot outliers.

which.max(gold)

## [1] 770

The outlier is the day the starting price of gold was $770.

Australian Retail Data

The following data represents retail sales in various categories for different Australian states.

#load excel file
retaildata <- read_excel('/Users/aaronzalki/Downloads/retail.xlsx', skip=1)
#View(retaildata)

Using the ts function, we can select one of the time series as follows:

myts <- ts(retaildata[,"A3349873A"],
  frequency=12, start=c(1982,4))

time_series <- ts(retaildata[,"A3349338X"], frequency=12, start=c(1982,4))
time_series

##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 1982                    63.9  64.0  62.7  65.6  62.6  64.4  66.0  65.3  77.9
## 1983  65.1  62.3  65.7  61.9  63.7  64.9  69.5  67.9  67.5  66.0  67.2  78.4
## 1984  69.2  68.2  70.7  67.4  71.0  67.9  67.4  70.7  68.1  73.5  75.2  81.9
## 1985  74.7  68.7  72.9  73.1  78.3  73.2  78.8  80.7  77.5  86.0  87.4  98.2
## 1986  85.4  76.5  83.7  84.9  91.9  86.7  90.5  91.3  89.6  93.0  92.3 107.4
## 1987  91.2  87.0  88.9  91.4  93.2  91.3  98.5  97.4  97.3  99.8  96.9 111.3
## 1988 100.1  95.0 101.4  96.2  97.8  91.3  94.6  95.4  98.3 101.4 100.7 120.3
## 1989 106.2  97.3 105.4 100.6 103.9 107.4 106.9 109.6 108.8 110.6 113.8 134.4
## 1990 121.4 114.6 123.5 118.8 123.6 127.5 125.9 130.1 116.9 123.3 125.3 136.5
## 1991 121.5 112.3 121.6 120.2 124.1 120.1 124.5 130.2 119.1 127.7 127.0 141.8
## 1992 133.5 125.9 135.1 135.1 137.7 129.5 134.2 134.9 132.7 138.3 134.9 153.8
## 1993 130.4 122.0 125.9 130.6 128.5 126.2 133.9 127.6 127.3 134.3 132.4 156.4
## 1994 145.7 131.7 143.9 137.1 136.1 127.8 130.5 135.7 138.5 153.9 151.6 179.2
## 1995 160.9 157.0 165.5 164.4 160.8 156.9 158.5 169.3 164.3 184.2 192.0 219.4
## 1996 168.0 166.6 170.2 163.3 167.9 174.3 169.1 173.1 164.1 178.5 172.1 188.8
## 1997 165.1 149.5 161.0 165.3 165.1 161.4 201.3 201.0 200.3 207.4 204.7 240.8
## 1998 233.7 207.8 214.0 218.9 219.3 212.4 216.7 211.9 218.6 223.4 209.4 247.8
## 1999 219.7 206.9 218.4 226.4 217.0 215.4 227.7 223.5 228.3 230.6 226.0 284.5
## 2000 185.8 183.1 186.9 176.9 172.2 176.8 169.9 171.9 169.3 153.4 150.9 174.9
## 2001 151.0 142.9 157.5 179.2 172.6 173.0 188.3 191.7 189.3 218.8 218.1 243.0
## 2002 230.5 210.4 216.0 212.6 226.1 213.6 272.8 268.3 272.3 251.0 249.8 289.3
## 2003 241.6 220.6 235.7 216.9 212.1 203.5 232.7 228.2 223.8 240.7 242.7 263.7
## 2004 231.3 215.4 228.6 234.5 221.4 218.5 224.6 228.2 226.1 236.4 248.1 296.5
## 2005 239.3 225.1 247.7 244.5 242.9 238.6 245.9 251.9 246.8 267.2 271.7 316.0
## 2006 287.0 265.7 291.5 268.2 261.7 253.6 264.8 273.6 269.1 289.6 289.6 341.9
## 2007 304.6 288.8 311.3 307.1 304.0 279.5 300.1 308.5 296.2 276.8 272.0 322.6
## 2008 239.7 229.9 241.3 223.8 234.7 221.4 238.0 250.0 243.7 250.6 254.7 320.6
## 2009 236.1 222.5 243.3 241.6 238.4 230.6 237.1 245.2 241.0 266.1 275.6 342.0
## 2010 255.4 215.5 229.6 214.4 224.5 205.6 198.7 203.3 203.2 214.4 201.4 251.8
## 2011 205.8 194.6 200.4 196.2 185.9 183.2 188.5 192.4 188.5 197.6 201.4 257.2
## 2012 188.1 200.1 199.7 214.5 216.2 208.6 204.5 213.9 211.7 230.6 219.5 276.2
## 2013 234.5 226.2 233.4 234.4 243.0 228.7 235.9 247.8 240.2 244.5 232.2 270.9

We can further explore the retail time series using the following functions:

autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf()

autoplot(time_series)

ggseasonplot(time_series)

ggsubseriesplot(time_series)

gglagplot(time_series)

ggAcf(time_series)

Conclusion: There is an overall positive trend in the sales data, with an increase in the December month. The downward slope in the final autocorrelation function plot also depicts the positive trend.

Plastics Manufacturer

The following plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

autoplot(plastics)

ggseasonplot(plastics)

ggsubseriesplot(plastics)

ggAcf(plastics)

The plastics time series has an increasing trend with a seasonal fluctuation of higher sales in the summer and lower sales in the winter. Based on the annual seasonal plot, the apex of plastics sales exist around August and September.

We can use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

decompose_plastics <- decompose(plastics, type="multiplicative")
decompose_plastics %>% 
  autoplot() +
  ggtitle("Multiplicative Decomposition of Plastic Product")

We can see in the above graphs that there is a seasonal component to plastic sales and the trend is increasing. We’ll now compute and plot the seasonally adjusted data.

seasonal_plastics <- plastics / decompose_plastics$seasonal
autoplot(plastics, series = "original data") +
autolayer(seasonal_plastics, series = "seasonally adjusted") +
ylab("Sales (Thousands)") +
ggtitle("Plastic Product Sales") +
scale_color_brewer(palette = "Set2")

We can also change one observation to be an outlier (add 500 to one observation), and recompute the seasonally adjusted data.

plastics_outlier <- plastics
plastics_outlier[50] <- plastics_outlier[50] + 500
decomp_plastics_outlier <- decompose(plastics_outlier, type="multiplicative")
seasonal_plastics_outlier <- plastics_outlier / decomp_plastics_outlier$seasonal
autoplot(plastics, series = 'original data') +
autolayer(seasonal_plastics, series = 'without outlier') +
autolayer(seasonal_plastics_outlier, series = 'with outlier') +
ylab("Sales (Thousands)") +
ggtitle("Plastic Product Sales") +
scale_color_brewer(palette = "Set2")

The outlier causes the series to be slightly higher than the original series. There are also throughs in the series that weren’t present in the original series.

Now we’ll see if it makes any difference if the outlier is near the end rather than in the middle of the time series.

#End
plastics_outlier_end <- plastics
plastics_outlier_end[50] <- plastics_outlier_end[50] + 500
decomp_plastics_outlier_end <- decompose(plastics_outlier_end, type="multiplicative")
seasonal_plastics_outlier_end <- plastics_outlier_end / decomp_plastics_outlier_end$seasonal
#Middle
plastics_outlier_mid <- plastics
plastics_outlier_mid[27] <- plastics_outlier_mid[27] + 500
decomp_plastics_outlier_mid <- decompose(plastics_outlier_mid, type="multiplicative")
seasonal_plastics_outlier_mid <- plastics_outlier_mid / decomp_plastics_outlier_mid$seasonal
#Plot
autoplot(plastics, series = 'original data') +
  autolayer(seasonal_plastics_outlier_end, series = 'with outlier near the end') +
  autolayer(seasonal_plastics_outlier_mid, series = 'with outlier in the middle') +
  ylab("Sales (Thousands)") +
  ggtitle("Plastic Product A Sales") +
  scale_color_brewer(palette = "Set2")

Where the outlier lies seems to make no large difference on the series overall, except where the through lies. The throughs occur slightly later for the series with the outlier towards the end.

Pigs Series

This pigs series accounts for the number of pigs slaughtered in Victoria each month.

The ses() function helps find the optimal values of $\alpha$ and $l_{0}$ and generate forecasts for the next four months.

ses_pigs <- ses(pigs, h = 4)
summary(ses_pigs)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = pigs, h = 4) 
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665 
## 
## Error measures:
##                    ME    RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Sep 1995       98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995       98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995       98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995       98816.41 83958.37 113674.4 76092.99 121539.8

We can also compute a 95% prediction interval for the first forecast using $y \pm 1.96s$ where $s$ is the standard deviation of the residuals.

pig_mean <- ses_pigs$mean[1]
pig_stdev <- sd(ses_pigs$residuals)
pig_lower <- pig_mean - 1.96 * pig_stdev
pig_upper <- pig_mean + 1.96 * pig_stdev

We are 95% confident that the true mean lies in the interval (78679.97, 118952.8)

IBM Stock Price

A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Using R, we can plot the daily closing prices for IBM stock and the ACF and PACF.

tsggplot <- function(ts, title){
  grid.arrange(
    autoplot(ts) +
      ggtitle(title),
    grid.arrange(
      ggAcf(ts),
      ggPacf(ts), ncol=2), nrow=2)
}
tsggplot(ibmclose, "IBM Closing Stock Price")

In the top plot, we can see that there is a downward trend in IBM’s stock price. The gradual decrease in the ACF implies that this time series is not stationary.

ARIMA Models

We can use R to simulate and plot some data from simple ARIMA models.The following code is used to generate data from an AR(1) model with $\phi_1=.6$ and $\sigma^2=1$. The process starts with $y_1=0$.

y <- ts(numeric(100))
e <- rnorm(100)
for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]

We will need to adjust the code to see the effects of changing $\phi_1$ on the time series plot.

phi_sim <- function(phi){
  set.seed(50)
  y <- ts(numeric(100))
  e <- rnorm(100)
  for(i in 2:100){
    y[i] <- phi*y[i-1] + e[i]
  }
  return(y)
}
p <- autoplot(phi_sim(0.6))
for(phi in seq(0.1, 0.9, 0.1)){
  p <- p + autolayer(phi_sim(phi), series = paste(phi))
}
p +
  labs(title="Changes in Phi", color = "Phi") +
  theme(axis.title = element_blank(), legend.position = "bottom") +
  scale_color_brewer(palette = "Set1")

We can see that as phi increases, the distance of the time series from 0 increases.

The following code is used to generate data from an MA(1) model with $\theta_1=.6$ and $\sigma^2=1$.

MA1 <- function(theta){
  set.seed(50)
  y <- ts(numeric(100))
  e <- rnorm(100)
  e[1] <- 0
  for(i in 2:100){
   y[i] <- theta*e[i-1]+ e[i]
  }
  return(y)
}
ma <- MA1(.6)
plot(ma)

p <- autoplot(MA1(0.6))
for(theta in seq(0.1, 0.9, 0.1)){
  p <- p + autolayer(MA1(theta), series = paste(theta))
}
p +
  labs(title="Changes of Theta", color = "Theta") +
  theme(axis.title = element_blank(), legend.position = "bottom") +
  scale_color_brewer(palette = "Set1")

Based on the above time series plot, the effects of changing $\theta$ are less drastic than changing $\phi$. As $\theta$ increases, the disctance from 0 also increases.

The following code is used to generate data from an ARMA(1,1) model with $\phi_1=.6$, $\theta_1=.6$, and $\sigma^2=1$.

ARMA <- function(phi, theta){
  set.seed(50)
  y <- ts(numeric(100))
  e <- rnorm(100)
  e[1] <- 0
  for(i in 2:100)
    y[i] <- phi*y[i-1] + theta*e[i-1] + e[i]
  return(y)
}
arma <- ARMA(.6,.6)
plot(arma)

The following code is used to generate data from an AR(2) model with $\phi_1=-.8$, $\phi_2=.3$, and $\sigma^2=1$.(these parameters will give a non-stationary series)

AR2 <- function(phi_1, phi_2){
  set.seed(50)
  y <- ts(numeric(100))
  e <- rnorm(100)
  for(i in 3:100)
    y[i] <- phi_1*y[i-1] + phi_2*y[i-2] + e[i]
  return(y)
}
ar2 <- AR2(-.8, .3)
plot(ar2)

We’ll now compare the two.

p <- autoplot(arma)
p <- p + autolayer(ar2)
p +
  labs(title="ARMA vs. AR2") +
  theme(axis.title = element_blank(), legend.position = "bottom") +
  scale_color_brewer(palette = "Set1")

International Visitors

austa is a time series that represents the total international visitors to Australia (in millions) for the period 1980-2015.

auto.arima() helps find an appropriate ARIMA model.

aa_austa<-auto.arima(austa)
summary(aa_austa)

## Series: austa 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.3006  0.1735
## s.e.  0.1647  0.0390
## 
## sigma^2 estimated as 0.03376:  log likelihood=10.62
## AIC=-15.24   AICc=-14.46   BIC=-10.57
## 
## Training set error measures:
##                        ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0008313383 0.1759116 0.1520309 -1.069983 5.513269 0.7461559
##                      ACF1
## Training set -0.000571993

Auto ARIMA chose the model ARIMA(0,1,1) with drift.

checkresiduals(aa_austa)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 5, p-value = 0.8067
## 
## Model df: 2.   Total lags used: 7

The residuals are fairly normal.

We can plot forecasts for the next 10 periods.

aa_austa_fc<-forecast(aa_austa,h=10)
autoplot(aa_austa_fc)

We can do the same on the specific ARIMA(0,1,1), ARIMA (0,1,0), ARIMA (2,1,0), ARIMA (0,0,1), ARIMA (0,2,1) models.

aa_austa_2<-arima(austa,order=c(0,1,1))
aa_austa_2_fc<-forecast(aa_austa_2,h=10)
autoplot(aa_austa_2_fc)

aa_austa_3<-arima(austa,order=c(0,1,0))
aa_austa_3_fc<-forecast(aa_austa_3,h=10)
autoplot(aa_austa_3_fc)

aa_austa_4<-arima(austa,order=c(2,1,0))
aa_austa_4_fc<-forecast(aa_austa_4,h=10)
autoplot(aa_austa_4_fc)

aa_austa5<-arima(austa, order=c(0,0,1))
aa_austa5_fc<-forecast(aa_austa5,h=10)
autoplot(aa_austa5_fc)

aa_austa6<-arima(austa, order=c(0,2,1))
aa_austa6_fc<-forecast(aa_austa6,h=10)
autoplot(aa_austa6_fc)

Applied Predictive Modeling

Glass Identification

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

We can use visualizations to explore the predictor variables and understand their distributions as well as the relationships between predictors.

kable(summary(Glass[,1:10]))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

RI	Na	Mg	Al	Si	K	Ca	Ba	Fe	Type
Min. :1.511	Min. :10.73	Min. :0.000	Min. :0.290	Min. :69.81	Min. :0.0000	Min. : 5.430	Min. :0.000	Min. :0.00000	1:70
1st Qu.:1.517	1st Qu.:12.91	1st Qu.:2.115	1st Qu.:1.190	1st Qu.:72.28	1st Qu.:0.1225	1st Qu.: 8.240	1st Qu.:0.000	1st Qu.:0.00000	2:76
Median :1.518	Median :13.30	Median :3.480	Median :1.360	Median :72.79	Median :0.5550	Median : 8.600	Median :0.000	Median :0.00000	3:17
Mean :1.518	Mean :13.41	Mean :2.685	Mean :1.445	Mean :72.65	Mean :0.4971	Mean : 8.957	Mean :0.175	Mean :0.05701	5:13
3rd Qu.:1.519	3rd Qu.:13.82	3rd Qu.:3.600	3rd Qu.:1.630	3rd Qu.:73.09	3rd Qu.:0.6100	3rd Qu.: 9.172	3rd Qu.:0.000	3rd Qu.:0.10000	6: 9
Max. :1.534	Max. :17.38	Max. :4.490	Max. :3.500	Max. :75.41	Max. :6.2100	Max. :16.190	Max. :3.150	Max. :0.51000	7:29

#omit type column
r <- round(cor(Glass[,1:9]),3)
ggcorrplot(r,type = 'lower', lab=T, lab_size=2)

par(mfrow=c(3,3))
for(i in names(Glass)[-10]){
  #title of element boxplot
  boxplot(Glass[i], main=paste('Boxplot of', i), horizontal = T)
}

Magnesium Mg is the only boxplot to not have outliers.

#use timeDate function 'skewness'
Glass[-10] %>% 
  apply(2, skewness) %>% 
  sort(decreasing=T)

##          K         Ba         Ca         Fe         RI         Al         Na 
##  6.4600889  3.3686800  2.0184463  1.7298107  1.6027151  0.8946104  0.4478343 
##         Si         Mg 
## -0.7202392 -1.1364523

Barium Ba, Calcium Ca, Iron Fe, Potassium K, Magnesium Mg and RI have high skewness.

Soybean

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
## See ?Soybean for details

plot_missing helps us understand if there are particular predictors that are missing more data.

plot_missing(Soybean)

The above graph captures a clear pattern of missing data related to classes. germ, lodging, seed.tmt, sever and hail have the most missing data.

We’ll use the following code to handle missing data.

soybean_missing <- Soybean %>%
  mutate_if(is.factor, fct_explicit_na)

par(mar=c(1,1,1,1))
soybean_missing %>%
  gather(-Class, key = "var", value = "val") %>%
  ggplot(aes(x = val, fill=Class)) +
  geom_bar(alpha=1) +
  facet_wrap(~ var, scales = "free") +
  scale_fill_manual("target",
                    values = c('#99d8c9','#8856a7','#43a2ca','#e34a33','#ece7f2',
                               '#1c9099','#dd1c77','#c994c7','#fde0dd','#addd8e',
                               '#f7fcb9','#7fcdbb','#8c510a','#bababa',
                               '#1f78b4','#bf812d','#80cdc1','#01665e','#a6cee3')) +
  xlab("") +
  ylab("") +
  theme(panel.background = element_blank(), legend.position="top")

Chemical Manufacturing

We’d like to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.

data(ChemicalManufacturingProcess)

A small percentage of cells in the predictor set contain missing values.

missing <- ChemicalManufacturingProcess %>%
  miss_var_summary() 
kable(missing %>% 
  filter(pct_miss > 0)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

variable	n_miss	pct_miss
ManufacturingProcess03	15	8.5227273
ManufacturingProcess11	10	5.6818182
ManufacturingProcess10	9	5.1136364
ManufacturingProcess25	5	2.8409091
ManufacturingProcess26	5	2.8409091
ManufacturingProcess27	5	2.8409091
ManufacturingProcess28	5	2.8409091
ManufacturingProcess29	5	2.8409091
ManufacturingProcess30	5	2.8409091
ManufacturingProcess31	5	2.8409091
ManufacturingProcess33	5	2.8409091
ManufacturingProcess34	5	2.8409091
ManufacturingProcess35	5	2.8409091
ManufacturingProcess36	5	2.8409091
ManufacturingProcess02	3	1.7045455
ManufacturingProcess06	2	1.1363636
ManufacturingProcess01	1	0.5681818
ManufacturingProcess04	1	0.5681818
ManufacturingProcess05	1	0.5681818
ManufacturingProcess07	1	0.5681818
ManufacturingProcess08	1	0.5681818
ManufacturingProcess12	1	0.5681818
ManufacturingProcess14	1	0.5681818
ManufacturingProcess22	1	0.5681818
ManufacturingProcess23	1	0.5681818
ManufacturingProcess24	1	0.5681818
ManufacturingProcess40	1	0.5681818
ManufacturingProcess41	1	0.5681818

We can handle this using a knn imputation function to fill in these missing values.

imputation <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
predict_imputer <- predict(imputation, ChemicalManufacturingProcess)
#compare to above chart
missing2 <- predict_imputer %>%
  miss_var_summary() 
kable(missing2 %>% 
  filter(pct_miss > 0)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

variable	n_miss	pct_miss

We can split the data into a training and a test set, pre-process the data, and tune a model.

near_zero <- nearZeroVar(predict_imputer)
part_c <- predict_imputer[,-near_zero]
training_rows <- createDataPartition(part_c$Yield, p = .80, list= FALSE)
train <- part_c[training_rows, ]
test <- part_c[-training_rows, ]
ridge_model <- train(train[,-1], train$Yield,
                      method = "ridge",
                      preProcess = c("center","scale"))
ridge_model

## Ridge Regression 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared    MAE      
##   0e+00   5.810175  0.09503788  1.7393161
##   1e-04   5.089325  0.10354023  1.6051963
##   1e-01   1.219323  0.37803380  0.7176247
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

The optimal value of the performance metric is 0.1.

Using the same dataset, we can analyze which tree-based regression model gives the optimal resampling and test set performance.

imputation <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
predict_imputer <- predict(imputation, ChemicalManufacturingProcess)
near_zero <- nearZeroVar(predict_imputer)
part_a <- predict_imputer[,-near_zero]
training_rows <- createDataPartition(part_a$Yield, p = .80, list= FALSE)
train <- part_a[training_rows, ]
trainx <- train[,-1]
trainy <- train$Yield
test <- part_a[-training_rows, ]
testx <- test[,-1]
testy <- test$Yield

Random Forest

set.seed(999)
randomforest_grid <- expand.grid(mtry=seq(7,49,by=3))
randomforest_model <- train(trainx, trainy, method = "rf", 
                tuneGrid = randomforest_grid)
rf_imp <- varImp(randomforest_model, scale = FALSE) 
rf_imp <- rf_imp$importance %>% rownames_to_column("var") 
rf_imp <- rf_imp[order(-rf_imp$Overall), , drop=FALSE] %>% remove_rownames()
plot(randomforest_model)

head (rf_imp,15)

##                       var   Overall
## 1  ManufacturingProcess32 44.827177
## 2    BiologicalMaterial12  9.850315
## 3  ManufacturingProcess13  9.261473
## 4  ManufacturingProcess17  7.518551
## 5    BiologicalMaterial03  6.566614
## 6    BiologicalMaterial06  4.565992
## 7  ManufacturingProcess09  4.479845
## 8    BiologicalMaterial11  3.536727
## 9  ManufacturingProcess06  3.257252
## 10 ManufacturingProcess31  3.224230
## 11 ManufacturingProcess36  3.143898
## 12   BiologicalMaterial02  2.756447
## 13 ManufacturingProcess11  2.490006
## 14   BiologicalMaterial05  2.295996
## 15 ManufacturingProcess39  2.127252

Gradient Boost

gbm_grid <- expand.grid(.interaction.depth = seq(1, 7, by = 2), 
                       .n.trees = seq(100, 1000, by = 100), 
                       .shrinkage = c(0.005, 0.01, 0.1, 0.2),
                       .n.minobsinnode = seq(5, 16, by = 5))
gbm_train_model <- train(trainx, trainy, method = "gbm", 
                   tuneGrid = gbm_grid, verbose = FALSE)
gbm_imp <- data.frame(varImp(gbm_train_model)$importance) %>%
  rownames_to_column("var") 
gbm_imp <- gbm_imp[order(-gbm_imp$Overall), , drop=FALSE] %>%
  remove_rownames()
plot(gbm_train_model)

head (gbm_imp,15)

##                       var    Overall
## 1  ManufacturingProcess32 100.000000
## 2  ManufacturingProcess17  22.123570
## 3  ManufacturingProcess13  21.486489
## 4    BiologicalMaterial12  21.288606
## 5    BiologicalMaterial03  15.169493
## 6  ManufacturingProcess09  14.649199
## 7  ManufacturingProcess06  10.126854
## 8    BiologicalMaterial09   9.860545
## 9    BiologicalMaterial06   7.193767
## 10   BiologicalMaterial05   6.953206
## 11 ManufacturingProcess11   5.841824
## 12   BiologicalMaterial11   5.536637
## 13 ManufacturingProcess39   4.590900
## 14 ManufacturingProcess21   4.555823
## 15 ManufacturingProcess20   4.502685

Cubist

cubist_grid <- expand.grid(committees = c(1,5,10,20,40,80), neighbors = c(0,1,3,5,7,9))
cubist_model = train(trainx, trainy, method = "cubist", tuneGrid = cubist_grid) 
cubist_imp <- data.frame(varImp(cubist_model)$importance) %>%
  rownames_to_column("var")
cubist_imp <- cubist_imp[order(-cubist_imp$Overall), , drop=FALSE] %>%
  remove_rownames()
plot(cubist_model)

head (cubist_imp,15)

##                       var   Overall
## 1  ManufacturingProcess32 100.00000
## 2  ManufacturingProcess13  50.00000
## 3  ManufacturingProcess17  36.27451
## 4  ManufacturingProcess33  36.27451
## 5    BiologicalMaterial03  33.33333
## 6  ManufacturingProcess29  33.33333
## 7  ManufacturingProcess31  28.43137
## 8  ManufacturingProcess09  25.49020
## 9    BiologicalMaterial02  22.54902
## 10   BiologicalMaterial12  18.62745
## 11 ManufacturingProcess26  18.62745
## 12 ManufacturingProcess04  18.62745
## 13   BiologicalMaterial06  15.68627
## 14 ManufacturingProcess25  14.70588
## 15 ManufacturingProcess39  13.72549

Model Comparison

#use predict function for all three different models
rand_forest_train <- predict(randomforest_model)
gradient_boost_train <- predict(gbm_train_model)
cubist_train <- predict(cubist_model)
#combine results of three different models
model_compare <- data.frame(rbind(postResample(pred = cubist_train, obs = trainy), 
                         postResample(pred = gradient_boost_train, obs = trainy),
                         postResample(pred = rand_forest_train, obs = trainy)),
                   row.names =  c("Cubist","Gradient Boost", "Random Forest"))
model_compare

##                        RMSE  Rsquared          MAE
## Cubist         2.652103e-08 1.0000000 1.779814e-08
## Gradient Boost 1.051633e-01 0.9910530 7.405478e-02
## Random Forest  2.392494e-01 0.9619916 1.775683e-01

#test set performance
rand_forest_test <- predict(randomforest_model, newdata = testx)
gradient_boost_test <- predict(gbm_train_model, newdata = testx)
cubist_test <- predict(cubist_model, newdata = testx)
model_compare2 <- data.frame(rbind(postResample(pred = cubist_test, obs = testy), 
                         postResample(pred = gradient_boost_test, obs = testy),
                         postResample(pred = rand_forest_test, obs = testy)),
                   row.names =  c("Cubist","Gradient Boost", "Random Forest"))
model_compare2

##                     RMSE  Rsquared       MAE
## Cubist         0.5854631 0.6304609 0.3839526
## Gradient Boost 0.5834152 0.6110891 0.4499376
## Random Forest  0.6436170 0.5313606 0.4730455

cubist_model$bestTune

##    committees neighbors
## 32         80         1

The cubist model resulted in the lowest RMSE and MAE values for both the resampling and the test set predictions, as well as the highest $R^2$.

10 most important predictors in the optimal tree-based regression model

head(cubist_imp,10)

##                       var   Overall
## 1  ManufacturingProcess32 100.00000
## 2  ManufacturingProcess13  50.00000
## 3  ManufacturingProcess17  36.27451
## 4  ManufacturingProcess33  36.27451
## 5    BiologicalMaterial03  33.33333
## 6  ManufacturingProcess29  33.33333
## 7  ManufacturingProcess31  28.43137
## 8  ManufacturingProcess09  25.49020
## 9    BiologicalMaterial02  22.54902
## 10   BiologicalMaterial12  18.62745

Optimal single tree with the distribution of yield in the terminal nodes.

set.seed(456)
ctrl <- trainControl(method = "cv", number = 10)
tree_grid <- expand.grid(maxdepth= seq(1,15, by=1))
tree_model <- train(trainx, trainy, method = "rpart2", 
                  tuneGrid = tree_grid, trControl = ctrl)
plot(as.party(tree_model$finalModel), gp=gpar(fontsize=11))

Model Comparison

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[ y = 10\sin(\pi x_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N (0, \sigma ^2) \]

where the $x$ values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x) 
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods. 
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to 
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

We can tune several models on these data to see which gives the best performance.

neural_model <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),
                        .bag = FALSE)
ctrl <- trainControl(method = "cv")
avNNet_model <- train(trainingData$x, trainingData$y, 
                 method = "avNNet",
                 tuneGrid = neural_model,
                 trControl = ctrl,
                 preProcess = c("center", "scale"),
                 linout = TRUE,
                 trace = FALSE,
                 maxit = 500,
                 MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1)
avNNet_model

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.439565  0.7644006  1.906904
##   0.00    2    2.538290  0.7448102  1.996572
##   0.00    3    2.097923  0.8270457  1.651204
##   0.00    4    2.003586  0.8428708  1.608316
##   0.00    5    2.495658  0.7821679  1.925544
##   0.00    6    2.498643  0.7589755  1.910032
##   0.00    7    3.695878  0.6121163  2.352241
##   0.00    8    6.282577  0.4593528  3.759895
##   0.00    9    4.973098  0.5538140  3.236094
##   0.00   10    2.703068  0.7437045  2.148617
##   0.01    1    2.433477  0.7660774  1.899358
##   0.01    2    2.488251  0.7557016  1.962511
##   0.01    3    2.087867  0.8302922  1.661617
##   0.01    4    2.057420  0.8371424  1.617592
##   0.01    5    2.036480  0.8378762  1.644359
##   0.01    6    2.202479  0.8113223  1.718370
##   0.01    7    2.362180  0.7969243  1.854537
##   0.01    8    2.468188  0.7717175  1.905309
##   0.01    9    2.442612  0.7632283  1.936275
##   0.01   10    2.465949  0.7629934  1.926444
##   0.10    1    2.440506  0.7640541  1.909513
##   0.10    2    2.563721  0.7398046  2.021249
##   0.10    3    2.144000  0.8186454  1.672726
##   0.10    4    2.112690  0.8281984  1.697981
##   0.10    5    2.075008  0.8319334  1.645113
##   0.10    6    2.189105  0.8143496  1.733985
##   0.10    7    2.323408  0.7912417  1.824610
##   0.10    8    2.359512  0.7870478  1.824359
##   0.10    9    2.371569  0.7894291  1.886362
##   0.10   10    2.396639  0.7779495  1.895462
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.

avNNet_predict <- predict(avNNet_model, newdata = testData$x)
avNNet_results <- postResample(pred = avNNet_predict, obs = testData$y)
avNNet_results

##      RMSE  Rsquared       MAE 
## 2.5093746 0.7778381 1.8429557

multi_variate <- expand.grid(.degree=1:2, .nprune=2:38)
mars_model <- train(x=trainingData$x, y=trainingData$y, 
                  method="earth",
                  preProcess = c("center", "scale"),
                  tuneGrid=multi_variate,
                  trControl = ctrl)
mars_model

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.189975  0.2906552  3.462839
##   1        3      3.484994  0.5151055  2.764123
##   1        4      2.587964  0.7410329  2.101983
##   1        5      2.420658  0.7683094  1.985518
##   1        6      2.339561  0.7839667  1.871780
##   1        7      1.731624  0.8788241  1.423116
##   1        8      1.678080  0.8854956  1.311885
##   1        9      1.655259  0.8869872  1.291336
##   1       10      1.652845  0.8850859  1.299452
##   1       11      1.604286  0.8921908  1.256919
##   1       12      1.631147  0.8899063  1.266755
##   1       13      1.633440  0.8881519  1.270749
##   1       14      1.646108  0.8865326  1.279917
##   1       15      1.646108  0.8865326  1.279917
##   1       16      1.646108  0.8865326  1.279917
##   1       17      1.646108  0.8865326  1.279917
##   1       18      1.646108  0.8865326  1.279917
##   1       19      1.646108  0.8865326  1.279917
##   1       20      1.646108  0.8865326  1.279917
##   1       21      1.646108  0.8865326  1.279917
##   1       22      1.646108  0.8865326  1.279917
##   1       23      1.646108  0.8865326  1.279917
##   1       24      1.646108  0.8865326  1.279917
##   1       25      1.646108  0.8865326  1.279917
##   1       26      1.646108  0.8865326  1.279917
##   1       27      1.646108  0.8865326  1.279917
##   1       28      1.646108  0.8865326  1.279917
##   1       29      1.646108  0.8865326  1.279917
##   1       30      1.646108  0.8865326  1.279917
##   1       31      1.646108  0.8865326  1.279917
##   1       32      1.646108  0.8865326  1.279917
##   1       33      1.646108  0.8865326  1.279917
##   1       34      1.646108  0.8865326  1.279917
##   1       35      1.646108  0.8865326  1.279917
##   1       36      1.646108  0.8865326  1.279917
##   1       37      1.646108  0.8865326  1.279917
##   1       38      1.646108  0.8865326  1.279917
##   2        2      4.493713  0.1908055  3.729300
##   2        3      3.709145  0.4504905  2.987843
##   2        4      2.609735  0.7341463  2.103370
##   2        5      2.379342  0.7742456  1.931422
##   2        6      2.334586  0.7830818  1.849111
##   2        7      1.749014  0.8766222  1.427570
##   2        8      1.698294  0.8837462  1.400995
##   2        9      1.497818  0.9080943  1.207548
##   2       10      1.411723  0.9170883  1.129177
##   2       11      1.375422  0.9202096  1.096012
##   2       12      1.324813  0.9271017  1.054678
##   2       13      1.319186  0.9276385  1.043179
##   2       14      1.262835  0.9357576  1.000477
##   2       15      1.262417  0.9342643  1.001733
##   2       16      1.267577  0.9334989  1.014147
##   2       17      1.275619  0.9324509  1.016926
##   2       18      1.275619  0.9324509  1.016926
##   2       19      1.275619  0.9324509  1.016926
##   2       20      1.275619  0.9324509  1.016926
##   2       21      1.275619  0.9324509  1.016926
##   2       22      1.275619  0.9324509  1.016926
##   2       23      1.275619  0.9324509  1.016926
##   2       24      1.275619  0.9324509  1.016926
##   2       25      1.275619  0.9324509  1.016926
##   2       26      1.275619  0.9324509  1.016926
##   2       27      1.275619  0.9324509  1.016926
##   2       28      1.275619  0.9324509  1.016926
##   2       29      1.275619  0.9324509  1.016926
##   2       30      1.275619  0.9324509  1.016926
##   2       31      1.275619  0.9324509  1.016926
##   2       32      1.275619  0.9324509  1.016926
##   2       33      1.275619  0.9324509  1.016926
##   2       34      1.275619  0.9324509  1.016926
##   2       35      1.275619  0.9324509  1.016926
##   2       36      1.275619  0.9324509  1.016926
##   2       37      1.275619  0.9324509  1.016926
##   2       38      1.275619  0.9324509  1.016926
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.

mars_predict <- predict(mars_model, newdata=testData$x)
mars_results <- postResample(pred=mars_predict, obs=testData$y)
mars_results

##      RMSE  Rsquared       MAE 
## 1.1908806 0.9428866 0.9496858

svm <- train(x=trainingData$x, y=trainingData$y, 
                  method="svmRadial", 
                  preProcess=c("center", "scale"), 
                  tuneLength=20)
svm

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   C          RMSE      Rsquared   MAE     
##        0.25  2.535413  0.7832510  2.010246
##        0.50  2.312221  0.7976533  1.821205
##        1.00  2.179041  0.8121920  1.705301
##        2.00  2.089542  0.8246834  1.630071
##        4.00  2.033124  0.8330785  1.576070
##        8.00  2.011953  0.8360991  1.561024
##       16.00  2.010544  0.8363445  1.562455
##       32.00  2.010544  0.8363445  1.562455
##       64.00  2.010544  0.8363445  1.562455
##      128.00  2.010544  0.8363445  1.562455
##      256.00  2.010544  0.8363445  1.562455
##      512.00  2.010544  0.8363445  1.562455
##     1024.00  2.010544  0.8363445  1.562455
##     2048.00  2.010544  0.8363445  1.562455
##     4096.00  2.010544  0.8363445  1.562455
##     8192.00  2.010544  0.8363445  1.562455
##    16384.00  2.010544  0.8363445  1.562455
##    32768.00  2.010544  0.8363445  1.562455
##    65536.00  2.010544  0.8363445  1.562455
##   131072.00  2.010544  0.8363445  1.562455
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05869886
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05869886 and C = 16.

svm_predict <- predict(svm, newdata=testData$x)
svm_results<- postResample(pred=svm_predict, obs=testData$y)
svm_results

##     RMSE Rsquared      MAE 
## 2.061539 0.827650 1.566485

model_comparison <- data.frame(rbind(avNNet_results, mars_results, svm_results))
#sort by RMSE
model_comparison[order(model_comparison$RMSE),] %>% kable %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = F)

	RMSE	Rsquared	MAE
mars_results	1.190881	0.9428866	0.9496858
svm_results	2.061539	0.8276500	1.5664854
avNNet_results	2.509375	0.7778381	1.8429557

The Multivariate Adaptive Regression Splines (MARS) model has the best performance based on RMSE, $R^2$, MAE. In addition, the MARS model does select five informative predictors as seen below.

varImp(mars_model)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.31
## X2   48.86
## X5   15.61
## X3    0.00

Market Basket Analysis

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket – and therefore ‘Market Basket Analysis’.

The Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.

First, we will load the data.

#read csv from github path
groceryds <- read.transactions("https://raw.githubusercontent.com/aaronzalkisps/data621az/master/GroceryDataSet.csv", sep=",")
itemFrequencyPlot(groceryds, topN=10, type="absolute", main="Top 10 Items")

We can see that whole milk is the most common item. Next, we will use the apriori algorithm and display the top 15 rules with their support, confidence and lift.

apriori(groceryds, parameter=list(supp=0.001, conf=0.5) , control=list(verbose=FALSE)) %>%
  DATAFRAME() %>%
  arrange(desc(lift)) %>%
  top_n(10) %>%
  kable() %>%
  kable_styling()

LHS	RHS	support	confidence	coverage	lift	count
{root vegetables,tropical fruit}	{other vegetables}	0.0123030	0.5845411	0.0210473	3.020999	121
{rolls/buns,root vegetables}	{other vegetables}	0.0122013	0.5020921	0.0243010	2.594890	120
{root vegetables,yogurt}	{other vegetables}	0.0129131	0.5000000	0.0258261	2.584078	127
{root vegetables,yogurt}	{whole milk}	0.0145399	0.5629921	0.0258261	2.203354	143
{domestic eggs,other vegetables}	{whole milk}	0.0123030	0.5525114	0.0222674	2.162336	121
{rolls/buns,root vegetables}	{whole milk}	0.0127097	0.5230126	0.0243010	2.046888	125
{other vegetables,pip fruit}	{whole milk}	0.0135231	0.5175097	0.0261312	2.025351	133
{tropical fruit,yogurt}	{whole milk}	0.0151500	0.5173611	0.0292832	2.024770	149
{other vegetables,yogurt}	{whole milk}	0.0222674	0.5128806	0.0434164	2.007235	219
{other vegetables,whipped/sour cream}	{whole milk}	0.0146416	0.5070423	0.0288765	1.984385	144

Below, we can see the summary of the rules. There are 232 rules of association, most of which are based on 2 or 3 items.

rules <- apriori(groceryds, parameter = list(support = 0.01, confidence = 0.2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [232 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 232 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3 
##   1 151  80 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.341   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01007   Min.   :0.2006   Min.   :0.01729   Min.   :0.8991  
##  1st Qu.:0.01200   1st Qu.:0.2470   1st Qu.:0.03437   1st Qu.:1.4432  
##  Median :0.01490   Median :0.3170   Median :0.05241   Median :1.7277  
##  Mean   :0.02005   Mean   :0.3321   Mean   :0.06708   Mean   :1.7890  
##  3rd Qu.:0.02227   3rd Qu.:0.4033   3rd Qu.:0.07565   3rd Qu.:2.0762  
##  Max.   :0.25552   Max.   :0.5862   Max.   :1.00000   Max.   :3.2950  
##      count       
##  Min.   :  99.0  
##  1st Qu.: 118.0  
##  Median : 146.5  
##  Mean   : 197.2  
##  3rd Qu.: 219.0  
##  Max.   :2513.0  
## 
## mining info:
##       data ntransactions support confidence
##  groceryds          9835    0.01        0.2

We can see the top fifteen rules based on lift below. They are mostly associated with root vegetables as a consequent item.

inspect(head(rules, n = 15, by = "lift"))

##      lhs                                   rhs                  support   
## [1]  {citrus fruit,other vegetables}    => {root vegetables}    0.01037112
## [2]  {other vegetables,yogurt}          => {whipped/sour cream} 0.01016777
## [3]  {other vegetables,tropical fruit}  => {root vegetables}    0.01230300
## [4]  {beef}                             => {root vegetables}    0.01738688
## [5]  {citrus fruit,root vegetables}     => {other vegetables}   0.01037112
## [6]  {root vegetables,tropical fruit}   => {other vegetables}   0.01230300
## [7]  {other vegetables,whole milk}      => {root vegetables}    0.02318251
## [8]  {curd,whole milk}                  => {yogurt}             0.01006609
## [9]  {other vegetables,yogurt}          => {root vegetables}    0.01291307
## [10] {other vegetables,yogurt}          => {tropical fruit}     0.01230300
## [11] {other vegetables,root vegetables} => {citrus fruit}       0.01037112
## [12] {other vegetables,rolls/buns}      => {root vegetables}    0.01220132
## [13] {tropical fruit,whole milk}        => {root vegetables}    0.01199797
## [14] {rolls/buns,root vegetables}       => {other vegetables}   0.01220132
## [15] {root vegetables,yogurt}           => {other vegetables}   0.01291307
##      confidence coverage   lift     count
## [1]  0.3591549  0.02887646 3.295045 102  
## [2]  0.2341920  0.04341637 3.267062 100  
## [3]  0.3427762  0.03589222 3.144780 121  
## [4]  0.3313953  0.05246568 3.040367 171  
## [5]  0.5862069  0.01769192 3.029608 102  
## [6]  0.5845411  0.02104728 3.020999 121  
## [7]  0.3097826  0.07483477 2.842082 228  
## [8]  0.3852140  0.02613116 2.761356  99  
## [9]  0.2974239  0.04341637 2.728698 127  
## [10] 0.2833724  0.04341637 2.700550 121  
## [11] 0.2188841  0.04738180 2.644626 102  
## [12] 0.2863962  0.04260295 2.627525 120  
## [13] 0.2836538  0.04229792 2.602365 118  
## [14] 0.5020921  0.02430097 2.594890 120  
## [15] 0.5000000  0.02582613 2.584078 127

We can see the top fifteen rules based on confidence below. They are mostly associated with whole milk as a consquent item.

inspect(head(rules, n = 15, by = "confidence"))

##      lhs                                      rhs                support   
## [1]  {citrus fruit,root vegetables}        => {other vegetables} 0.01037112
## [2]  {root vegetables,tropical fruit}      => {other vegetables} 0.01230300
## [3]  {curd,yogurt}                         => {whole milk}       0.01006609
## [4]  {butter,other vegetables}             => {whole milk}       0.01148958
## [5]  {root vegetables,tropical fruit}      => {whole milk}       0.01199797
## [6]  {root vegetables,yogurt}              => {whole milk}       0.01453991
## [7]  {domestic eggs,other vegetables}      => {whole milk}       0.01230300
## [8]  {whipped/sour cream,yogurt}           => {whole milk}       0.01087951
## [9]  {rolls/buns,root vegetables}          => {whole milk}       0.01270971
## [10] {other vegetables,pip fruit}          => {whole milk}       0.01352313
## [11] {tropical fruit,yogurt}               => {whole milk}       0.01514997
## [12] {other vegetables,yogurt}             => {whole milk}       0.02226741
## [13] {other vegetables,whipped/sour cream} => {whole milk}       0.01464159
## [14] {rolls/buns,root vegetables}          => {other vegetables} 0.01220132
## [15] {root vegetables,yogurt}              => {other vegetables} 0.01291307
##      confidence coverage   lift     count
## [1]  0.5862069  0.01769192 3.029608 102  
## [2]  0.5845411  0.02104728 3.020999 121  
## [3]  0.5823529  0.01728521 2.279125  99  
## [4]  0.5736041  0.02003050 2.244885 113  
## [5]  0.5700483  0.02104728 2.230969 118  
## [6]  0.5629921  0.02582613 2.203354 143  
## [7]  0.5525114  0.02226741 2.162336 121  
## [8]  0.5245098  0.02074225 2.052747 107  
## [9]  0.5230126  0.02430097 2.046888 125  
## [10] 0.5175097  0.02613116 2.025351 133  
## [11] 0.5173611  0.02928317 2.024770 149  
## [12] 0.5128806  0.04341637 2.007235 219  
## [13] 0.5070423  0.02887646 1.984385 144  
## [14] 0.5020921  0.02430097 2.594890 120  
## [15] 0.5000000  0.02582613 2.584078 127

Below we will plot the rules:

plot(rules, jitter = 1)

head(rules, n = 10, by = "lift") %>% 
    plot(method = "graph", engine = "htmlwidget")

head(rules, n = 10, by = "confidence") %>% 
    plot(method = "graph", engine = "htmlwidget")

Textbook Reference

Forecasting: Principles and Practice, Hyndman, R.J., & Athanasopoulos, G. (2018)

Applied Predictive Modeling, Kuhn, M., & Johnson, K. (2013)