The purpose of this report is to build on previous work in the Assignment 01 report; namely to fit a model suite to the wine dataset. Although this report is an extension of the Assignment 01 report, some relevant information is repeated herein. Both the data set and metadata (names) file are available for download here.
This report contains five sections:
An appendix of relevant R code used in producing the report is included. The code is grouped by the same sections.
From the wine metadata file, the response variable is the class identifier, class. The response variable is a categorical or factor variable, with classes 1, 2, and 3. Each class corresponds to a different cultivar of wine. Other variables (or attributes) reflect different constituents found in each type of wine. The values of those variables are the results (quantities) of a chemical analysis for each observation.
The dimensions of the Wine dataset indicate there are 178 observations and 14 variables, including the response variable class. The variable class was recoded from integer() to factor(), while the variables magnesium and proline were recoded from integer() to numeric(). Lastly, the levels of class were recoded from 1, 2, and 3 to class_1, class_2, and class_3.
The distribution of the response variable class is shown in Table 1 below. The levels of the response variable are not balanced (i.e. evenly distributed).
| Variable | Type | class_1 | class_2 | class_3 |
|---|---|---|---|---|
| class | Count | 59 | 71 | 48 |
| class | Percent | 33.15 | 39.89 | 26.97 |
Due to the unbalanced classes, the wine dataset was cloned and down sampled using stratified random sampling to achieve class balance. This dataset is referred to as wine.ds, where ds stands for down sampled. Table 2 below shows the distribution of the response variable class in wine.ds. There are 144 observations and 14 variables; the levels in the response variable class are balanced, each appearing \(1/3\) of the time.
| Variable | Type | class_1 | class_2 | class_3 |
|---|---|---|---|---|
| class | Count | 48 | 48 | 48 |
| class | Percent | 33.33 | 33.33 | 33.33 |
Summary statistics, and discussion on missing values and/or outliers are not repeated here. Please reference the appendix for related code, or the Assignment 01 report for discussion.
After the initial data quality check, data are further examined to identify interesting information or detect interesting relationships. That process is known as Exploratory Data Analysis or EDA.
The type of EDA conducted depends on the statistical problem at hand: is it one of regression, or one of classification? The statistical problem faced with the wine dataset is one of classification.
The response variable, class, takes on three possible classes: class_1, class_2, or class_3. The appropriate EDA in this situation centers on interesting information or relationships by each of these classes, through both quantitative and qualitative means.
It is also important to understand what might not be useful. Scatterplots are not useful. However, boxplots and histograms can be useful, as can summary statistics by class, and variable correlations by class.
Traditional EDA covers quantitative and qualitative means. For the wine dataset:
Traditional EDA - Quantitative included summary statistics across all classes, summary statistics by each class, and decile values of each numeric variable. Summary statistics for numeric variables included min, 1st quartile, median, mean, 3rd quartile, and max. Summary statistics for factor variables included counts by level.
Traditional EDA - Qualitative included side-by-side histograms for each variable by class, side-by-side boxplots for each variable by class, and correlation plots by each class. The latter was done as a crude means to see which variables are correlated with each other - and how that relationship differs - by class.
Results from Traditional EDA are excluded here. Please reference the appendix for related code, or the Assignment 01 report for discussion.
Model-Based EDA is another way to glean information about relationships in the dataset. Naive models are used for this purpose, since the goal at this stage is not to build a highly accurate predictive model, but to uncover additional information.
Results are included here, as this expands on the work previously done in the Assignment 01 report. This section contains three parts:
Each naive model is fit under response-vs-all. The first model uses the wine dataset with original class frequencies, whereas the second model uses the wine dataset with down sampled class frequencies.
Both the PCA and LDA models can be used as a means of dimension reduction. Of particular interest is whether or not qualitative plots of the naive models show clear class separation. Though not guaranteed, clear class separation bodes well for constructing a model with high predictive accuracy.
A naive decision tree model was fit to the wine and wine.ds datasets. Both models were constructed using all variables in the dataset. Interesting information can be revealed from a naive decision tree model. In the tree plots below, the color of each square corresponds to a level in the response variable. Within the square:
In Figure 1 below, the first node is colored blue, because 40% of the rows in the wine dataset correspond to class_2 (line 2). Looking at the second node, the tree splits on the proline variable. Values greater than or equal to 755 branch to the left, and values less than 755 branch to the right. On the left, class_1 is the most prevalent, representing 85% of the population at this criterion (lines 1 and 2). In total, 38% of the wine dataset has a proline value greater than or equal to 755 (line 3).
Figure 1 shows the root node branch splitting on the proline variable, followed by flavanoids and OD280_OD315, and finally hue.
The same interpretation process may be used in Figure 2 below. Figure 2 uses the wine.ds dataset, and tells a slightly different story from Figure 1. Here, the root node branch splits on the flavanoids variable, followed by proline and color_intensity_.
A naive tree model can be used as proxy for variable importance. With the wine and wine.ds datasets, each naive tree model tells a slightly different story of variable importance.
A naive PCA model was fit to the wine and wine.ds datasets. Both models were constructed using all variables in the dataset. The models were fit using scaled variables.
The resulting biplot of the PCA model for the wine dataset is shown in Figure 3 below, which plots the first two principal components. Though the biplot() function in {stats} results in a rather cluttered plot, class separation can still be seen.
One benefit of a biplot is being able to see influence of variable loadings on each of the first two principal components. For instance, in Figure 3 below, both alcohol and color_intensity are located away from other observations, but seem to have a large effect on the second principal component.
The resulting biplot of the PCA model for the wine.ds dataset is shown in Figure 4 below, which plots the first two principal components. Though the biplot() function in {stats} results in a rather cluttered plot, class separation can still be seen.
One benefit of a biplot is being able to see influence of variable loadings on each of the first two principal components. For instance, in Figure 4 below, both alcohol and color_intensity are located away from other observations, but seem to have a large effect on the second principal component.
Interestingly, Figure 4 appears to be close to a mirror-image of Figure 3. Despite the cluttered appearance of the biplots, both PCA models show class separation.
A naive LDA model was fit to the wine and wine.ds datasets. Both models were constructed using all variables in the dataset. The resulting graphics are shown in Figure 5 and Figure 6 below, which plots the two linear discriminants and illustrates class separation in both datasets.
Between the LDA and PCA models, the LDA model shows cleaner class separation. This is not surprising since LDA takes class information into account.
The class separation seen in the naive models from the Model-Based EDA section suggest predictive accuracy can be attained in both the wine and wine.ds datasets. Whether one dataset will lead to an edge in predictive accuracy remains to be seen.
This section contains three parts:
Each model was fit using 10-fold cross-validation repeated 10 times (RCV). The RF models were also fit using out-of-bag (OOB) error and the tuneRF() function in {randomForest} to select a value of mtry.
Model fit was assessed on in-sample performance. Both accuracy and Kappa are computed. For classification problems with unbalanced classes (such as in the wine dataset), it is important to assess both accuracy and the Kappa statistic. From Applied Predictive Modeling (Johnson & Kuhn, 2016 5th printing), “Kappa takes into account the accuracy that would be generated simply by chance… When the class distributions are equivalent, overall accuracy and Kappa are proportional.”
Lastly, models were fit first using the wine dataset with unbalanced classes. If 100% accuracy and Kappa was not achieved in-sample, then models were also fit to the wine.ds dataset with balanced classes. This was done with the notion that the more parsimonious model is preferred; with parsimony referring to terms in the model as well as any other adjustments to the dataset from the original or raw form. Practically, this meant none of the RF models were fit to the wine.ds dataset.
A total of six RF models were fit. Each model is first grouped by method of fit: either RCV (10-fold cross-validation repeated ten times), OOB (out-of-bag error), or tuneRF (using the tuneRF() function in {randomForest}) were used to determine the value of mtry. Next, each model either did or did not use pre-processing. Here, pre-processing consisted of centering and scaling numeric variables to a mean of zero and standard deviation of one.
The first RF model (M1) used RCV and no pre-processing of the wine dataset. Two figures are included below. Figure 7 shows accuracy by three different values of mtry, while Figure 8 shows variable importance. The model used 500 trees and chose a mtry value of 2.
Table 3 below shows the confusion matrix for RF M1. Both accuracy and Kappa were 1.0.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 71 | 0 |
| class_3 | 0 | 0 | 48 |
The second RF model (M2) used RCV and pre-processing of the wine dataset. Two figures are included below. Figure 9 shows accuracy by three different values of mtry, while Figure 10 shows variable importance. The model used 500 trees and chose a mtry value of 2.
Table 4 below shows the confusion matrix for RF M2. Both accuracy and Kappa were 1.0.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 71 | 0 |
| class_3 | 0 | 0 | 48 |
The third RF model (M3) used OOB and no pre-processing of the wine dataset. Two figures are included below. Figure 11 shows accuracy by three different values of mtry, while Figure 12 shows variable importance. The model used 500 trees and chose a mtry value of 2. The OOB error rate during model fit was estimated at 1.69%.
Table 5 below shows the confusion matrix for RF M3. Both accuracy and Kappa were 1.0.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 71 | 0 |
| class_3 | 0 | 0 | 48 |
The fourth RF model (M4) used OOB and pre-processing of the wine dataset. Two figures are included below. Figure 13 shows accuracy by three different values of mtry, while Figure 14 shows variable importance. The model used 500 trees and chose a mtry value of 2. The OOB error rate during model fit was estimated at 1.69%.
Table 6 below shows the confusion matrix for RF M4. Both accuracy and Kappa were 1.0.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 71 | 0 |
| class_3 | 0 | 0 | 48 |
The fifth RF model (M5) used tuneRF to determine the value of mtry and no pre-processing of the wine dataset. The tuneRF parameters were set to try 500 trees, under a step factor of 1.5 (the factor with which to increase mtry at each step), and a minimum improvement of 0.01 (the improvement with the next value of mtry must be at least much for the search to continue). Figure 15 shows variable importance using a mtry value of 3.
Table 7 below shows the confusion matrix for RF M5. Both accuracy and Kappa were 1.0.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 71 | 0 |
| class_3 | 0 | 0 | 48 |
The sixth RF model (M6) used tuneRF to determine the value of mtry and pre-processing of the wine dataset. The tuneRF parameters were set to try 500 trees, under a step factor of 1.5 (the factor with which to increase mtry at each step), and a minimum improvement of 0.01 (the improvement with the next value of mtry must be at least much for the search to continue). Figure 16 shows variable importance using a mtry value of 3.
Table 8 below shows the confusion matrix for RF M6. Both accuracy and Kappa were 1.0.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 71 | 0 |
| class_3 | 0 | 0 | 48 |
A total of four SVM models were fit. All four models used RCV (10-fold cross-validation repeated ten times). Each model is first grouped by whether or not the wine (original) or wine.ds (down sampled) dataset was used, then by whether or not pre-processing was used. Here, pre-processing consisted of centering and scaling numeric variables to a mean of zero and standard deviation of one.
The first SVM model (M1) used RCV and no pre-processing of the wine dataset. Two figures are included below. Figure 17 shows accuracy by three different values of C (cost parameter), while Figure 18 shows variable importance by class. The model used C of 0.5 and 82 support vectors, yielding a training error of 0.0056.
Table 9 below shows the confusion matrix for SVM M1. The model yielded accuracy of 0.9944 and Kappa of 0.9915.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 70 | 0 |
| class_3 | 0 | 1 | 48 |
The second SVM model (M2) used RCV and pre-processing of the wine dataset. Two figures are included below. Figure 19 shows accuracy by three different values of C (cost parameter), while Figure 20 shows variable importance by class. The model used C of 0.5 and 82 support vectors, yielding a training error of 0.0056.
Table 10 below shows the confusion matrix for SVM M2. The model yielded accuracy of 0.9944 and Kappa of 0.9915.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 70 | 0 |
| class_3 | 0 | 1 | 48 |
The third SVM model (M3) used RCV and no pre-processing of the wine.ds dataset. Two figures are included below. Figure 21 shows accuracy by three different values of C (cost parameter), while Figure 22 shows variable importance by class. The model used C of 0.25 and 87 support vectors, yielding a training error of 0.0139.
Table 11 below shows the confusion matrix for SVM M3. The model yielded accuracy of 0.9888 and Kappa of 0.9829.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 58 | 0 | 0 |
| class_2 | 1 | 70 | 0 |
| class_3 | 0 | 1 | 48 |
The fourth SVM model (M4) used RCV and pre-processing of the wine.ds dataset. Two figures are included below. Figure 23 shows accuracy by three different values of C (cost parameter), while Figure 24 shows variable importance by class. The model used C of 0.25 and 87 support vectors, yielding a training error of 0.0139.
Table 12 below shows the confusion matrix for SVM M4. The model yielded accuracy of 0.9888 and Kappa of 0.9829.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 58 | 0 | 0 |
| class_2 | 1 | 70 | 0 |
| class_3 | 0 | 1 | 48 |
A total of four NN models were fit. All four models used RCV (10-fold cross-validation repeated ten times). Each model is first grouped by whether or not the wine (original) or wine.ds (down sampled) dataset was used, then by whether or not pre-processing was used. Here, pre-processing consisted of centering and scaling numeric variables to a mean of zero and standard deviation of one.
The first NN model (M1) used RCV and no pre-processing of the wine dataset. Two figures are included below. Figure 25 shows accuracy by three different values of hidden units (1, 3, and 5) and two different values of weight decay (0.0 and 0.1); while Figure 26 shows variable importance by class. The model created a 13-5-3 network, where 13 is the number of input variables, 5 is the number of hidden units, and 3 is the number of output layers (classes). The model used 88 weights and a decay value of 0.1.
Table 13 below shows the confusion matrix for NN M1. The model yielded accuracy of 0.9944 and Kappa of 0.9915.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 58 | 0 | 0 |
| class_2 | 1 | 71 | 0 |
| class_3 | 0 | 0 | 48 |
The second NN model (M2) used RCV and pre-processing of the wine dataset. Two figures are included below. Figure 27 shows accuracy by three different values of hidden units (1, 3, and 5) and two different values of weight decay (0.0 and 0.1); while Figure 28 shows variable importance by class. The model created a 13-3-3 network, where 13 is the number of input variables, 3 is the number of hidden units, and 3 is the number of output layers (classes). The model used 54 weights and a decay value of 0.1.
Table 14 below shows the confusion matrix for NN M2. The model yielded accuracy of 1.0 and Kappa of 1.0.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 71 | 0 |
| class_3 | 0 | 0 | 48 |
The third NN model (M3) used RCV and no pre-processing of the wine.ds dataset. Two figures are included below. Figure 29 shows accuracy by three different values of hidden units (1, 3, and 5) and two different values of weight decay (0.0 and 0.1); while Figure 30 shows variable importance by class. The model created a 13-3-3 network, where 13 is the number of input variables, 3 is the number of hidden units, and 3 is the number of output layers (classes). The model used 54 weights and a decay value of 0.1.
Table 15 below shows the confusion matrix for NN M3. The model yielded accuracy of 0.9831 and Kappa of 0.9745.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 2 | 0 |
| class_2 | 0 | 68 | 0 |
| class_3 | 0 | 1 | 48 |
The fourth NN model (M4) used RCV and pre-processing of the wine.ds dataset. Two figures are included below. Figure 31 shows accuracy by three different values of hidden units (1, 3, and 5) and two different values of weight decay (0.0 and 0.1); while Figure 32 shows variable importance by class. The model created a 13-3-3 network, where 13 is the number of input variables, 3 is the number of hidden units, and 3 is the number of output layers (classes). The model used 54 weights and a decay value of 0.1.
Table 16 below shows the confusion matrix for NN M4. The model yielded accuracy of 0.9944 and Kappa of 0.9915.
| class_1 | class_2 | class_3 | |
|---|---|---|---|
| class_1 | 59 | 0 | 0 |
| class_2 | 0 | 70 | 0 |
| class_3 | 0 | 1 | 48 |
A total of fourteen models were fit: six RF, four SVM, and four NN. In addition to hyper-parameter tuning, models used 10-fold cross-validation repeated ten times. Models also alternated between using pre-processing and not, and using the down sampled dataset or not.
The exception to both of these is RF, which in addition to RCV also used OOB and tuneRF, and also did not use the down sampled dataset, wine.ds, since maximum accuracy and Kappa were achieved on the wine dataset.
Table 17 below is a summary of model performance across each type.
\begin{center} Table 17: Summary of Model Performance by Type \end{center}| Model Type | Model Name | Pre-process | Down Sample | Method | Train: Accuracy | Train: Kappa |
|---|---|---|---|---|---|---|
| Random Forest | M1 | 0 | 0 | RCV | 1 | 1 |
| Random Forest | M2 | 1 | 0 | RCV | 1 | 1 |
| Random Forest | M3 | 0 | 0 | OOB | 1 | 1 |
| Random Forest | M4 | 1 | 0 | OOB | 1 | 1 |
| Random Forest | M5 | 0 | 0 | tuneRF | 1 | 1 |
| Random Forest | M6 | 1 | 0 | tuneRF | 1 | 1 |
| SVM | M1 | 0 | 0 | RCV | 0.9944 | 0.9915 |
| SVM | M2 | 1 | 0 | RCV | 0.9944 | 0.9915 |
| SVM | M3 | 0 | 1 | RCV | 0.9888 | 0.9829 |
| SVM | M4 | 1 | 1 | RCV | 0.9888 | 0.9829 |
| Neural Net | M1 | 0 | 0 | RCV | 0.9944 | 0.9915 |
| Neural Net | M2 | 1 | 0 | RCV | 1 | 1 |
| Neural Net | M3 | 0 | 1 | RCV | 0.9831 | 0.9745 |
| Neural Net | M4 | 1 | 1 | RCV | 0.9944 | 0.9915 |
There are a few interesting points to make:
RF did the best overall, regardless of pre-processing or methodSVM achieved better performance when the down sampled dataset was not used, but pre-processing did not make a difference in performanceNN was the most sensitive to pre-processing and whether or not the down sampled dataset was used; achieving maximum accuracy and Kappa only when using pre-processing but not using the down sampled datasetNN also saw the biggest difference in accuracy plots: when the down sampled dataset was not used, accuracy was low with weight decay of 0.0 and high when accuracy was 0.1; when the down sampled data was used, accuracy started close to the maximum when 3 or 5 hidden units were usedThis exercise was a lot of fun and a good learning experience. Of all the models, I found neural nets to be most interesting. I would like to go back and do more in-depth hyper-parameter tuning on the NN and SVM models. I also find that I learn the most by making truth tables of possible model combinations and seeing what each lever does when pulled (i.e. pre-processing, down sampling, or changing methods).
The RF models continue to dominate in performance, with all models achieving maximum accuracy and Kappa. The only model to match this was NN M2, which used pre-processed but not down sampled data.
Finally, it was also interesting to compare the variable importance plots for each model against the naive decision tree using the same dataset (wine or wine.ds) and seeing which variables did (or did not) match.
#==============================================================================
# Enviornment Prep
#==============================================================================
# Clear workspace
rm(list=ls())
# Load packages
library(caret)
library(MASS)
library(pander)
library(randomForest)
library(rattle)
library(rpart)
# Set code width to 60 to contain within PDF margins
knitr::opts_chunk$set(tidy = F, tidy.opts = list(width.cutoff = 60))
# Set all figures to be centered
knitr::opts_chunk$set(fig.align = "center")
#------------------------------------------------------------------------------
# Functions
#------------------------------------------------------------------------------
#--------------------------------------
# GitHub
#--------------------------------------
# Create function to source functions from GitHub
source.GitHub = function(url){
require(RCurl)
sapply(url, function(x){
eval(parse(text = getURL(x, followlocation = T,
cainfo = system.file("CurlSSL",
"cacert.pem", package = "RCurl"))),
envir = .GlobalEnv)
})
}
# Assign URL and source functions
url = "http://bit.ly/1T6LhBJ"
source.GitHub(url); rm(url)
#==============================================================================
# Data Import & Prep
#==============================================================================
# Read data
wine = read.csv("wine.data", header = F)
# Assign column names
colnames(wine) = c("class",
"alcohol",
"malic_acid",
"ash",
"ash_alcalinity",
"magnesium",
"phenols_total",
"flavanoids",
"phenols_nonflavanoid",
"proanthocyanins",
"color_intensity",
"hue",
"OD280_OD315",
"proline")
# Check variable classes and head
str(wine)
# Recode integers to numeric
wine$magnesium = as.numeric(wine$magnesium)
wine$proline = as.numeric(wine$proline)
# Recode wine$class as factor
wine$class = as.factor(wine$class)
# Rename wine classes
levels(wine$class) = c("class_1", "class_2", "class_3")
#==============================================================================
# Data Quality Check
#==============================================================================
# Check variable classes and head
str(wine)
# Summary statistics
summary(wine)
# Class frequencies
fac.freq(wine$class, cat = F)
# Down sample to balance classes
set.seed(123)
wine.ds = downSample(x = wine[, -1],
y = wine[, 1])
# Rename class variable
colnames(wine.ds)[colnames(wine.ds)=="Class"] = "class"
# Check class frequencies
fac.freq(wine.ds$class, cat = F)
#==============================================================================
# Exploratory Data Analysis
#==============================================================================
#------------------------------------------------------------------------------
# Traditional EDA - Quantitative
#------------------------------------------------------------------------------
# Summary statistics by class
by(wine, wine$class, summary)
#------------------------------------------------------------------------------
# Traditional EDA - Qualitative
#------------------------------------------------------------------------------
# Create list of numeric variables
cn.num = colnames(wine[, !sapply(wine, is.factor)])
#--------------------------------------
# Histograms by class
#--------------------------------------
for (i in cn.num){
temp = histogram(~ wine[, i] | wine[, "class"],
data = wine, layout = c(3, 1), col = "beige",
xlab = paste(i))
print(temp)
rm(temp); rm(i)
}
#--------------------------------------
# Boxplots by class
#--------------------------------------
for (i in cn.num){
settings = list(box.rectangle = list(col = "black", fill = "beige"),
box.umbrella = list(col = "black"),
plot.symbol = list(col = "black"))
temp = bwplot(~ wine[, i] | wine[, "class"],
data = wine, layout = c(3, 1), par.settings = settings,
xlab = paste(i))
print(temp)
rm(settings); rm(temp); rm(i)
}
#--------------------------------------
# Correlation by class
#--------------------------------------
for (lvl in unique(wine$class)){
corrplot(cor(wine[wine$class == lvl, cn.num]),
tl.col = "black", tl.cex = 0.8, tl.srt = 45)
rm(lvl)
}
#------------------------------------------------------------------------------
# Model-Based EDA
#------------------------------------------------------------------------------
#--------------------------------------
# Decision tree
#--------------------------------------
# Model 1 - Original
fancyRpartPlot(rpart(class ~ ., data = wine), sub = "")
# Model 2 - Down Sampled
fancyRpartPlot(rpart(class ~ ., data = wine.ds), sub = "")
#--------------------------------------
# Principal Components Analysis
#--------------------------------------
# Model 1 - Original
wine$class = as.numeric(wine$class)
wine.pcr.m1 = prcomp(wine, scale = T)
biplot(wine.pcr.m1, xlabs = wine[, "class"])
wine$class = as.factor(wine$class)
levels(wine$class) = c("class_1", "class_2", "class_3")
# Model 2 - Down Sampled
wine.ds$class = as.numeric(wine.ds$class)
wine.pcr.m2 = prcomp(wine.ds, scale = T)
biplot(wine.pcr.m2, xlabs = wine.ds[, "class"])
wine.ds$class = as.factor(wine.ds$class)
levels(wine.ds$class) = c("class_1", "class_2", "class_3")
#--------------------------------------
# Linear Discriminant Analysis
#--------------------------------------
# Model 1 - Original
wine.lda.m1 = lda(class ~ ., data = wine)
levels(wine$class) = c("1", "2", "3")
plot(wine.lda.m1)
levels(wine$class) = c("class_1", "class_2", "class_3")
# Model 2 - Down Sampled
wine.lda.m2 = lda(class ~ ., data = wine.ds)
levels(wine.ds$class) = c("1", "2", "3")
plot(wine.lda.m2)
levels(wine.ds$class) = c("class_1", "class_2", "class_3")
#==============================================================================
# Model Build
#==============================================================================
#------------------------------------------------------------------------------
# Random Forest
#------------------------------------------------------------------------------
#--------------------------------------
# Model 1 | preProcess = F | balanced = F | method = rcv
#--------------------------------------
# Specify fit parameters
wine.rf.m1.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.rf.m1 = train(x = wine[, -1],
y = wine[, 1],
method = "rf",
trControl = wine.rf.m1.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.rf.m1$finalModel
# In-sample fit
wine.rf.m1.trn.pred = predict(wine.rf.m1, newdata = wine[, -1])
wine.rf.m1.trn.cm = confusionMatrix(wine.rf.m1.trn.pred, wine$class)
wine.rf.m1.trn.cm$table
wine.rf.m1.trn.cm$overall[1:2]
# Plots
plot(wine.rf.m1, main = "RCV: wine.rf.m1")
plot(varImp(wine.rf.m1), main = "Var Imp: wine.rf.m1")
#--------------------------------------
# Model 2 | preProcess = T | balanced = F | method = rcv
#--------------------------------------
# Specify fit parameters
wine.rf.m2.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.rf.m2 = train(x = wine[, -1],
y = wine[, 1],
method = "rf",
preProcess = c("center", "scale"),
trControl = wine.rf.m2.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.rf.m2$finalModel
# In-sample fit
wine.rf.m2.trn.pred = predict(wine.rf.m2, newdata = wine[, -1])
wine.rf.m2.trn.cm = confusionMatrix(wine.rf.m2.trn.pred, wine$class)
wine.rf.m2.trn.cm$table
wine.rf.m2.trn.cm$overall[1:2]
# Plots
plot(wine.rf.m2, main = "RCV: wine.rf.m2")
plot(varImp(wine.rf.m2), main = "Var Imp: wine.rf.m2")
#--------------------------------------
# Model 3 | preProcess = F | balanced = F | method = oob
#--------------------------------------
# Specify fit parameters
wine.rf.m3.fc= trainControl(method = "oob",
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.rf.m3 = train(x = wine[, -1],
y = wine[, 1],
method = "rf",
trControl = wine.rf.m3.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.rf.m3$finalModel
# In-sample fit
wine.rf.m3.trn.pred = predict(wine.rf.m3, newdata = wine[, -1])
wine.rf.m3.trn.cm = confusionMatrix(wine.rf.m3.trn.pred, wine$class)
wine.rf.m3.trn.cm$table
wine.rf.m3.trn.cm$overall[1:2]
# Plots
plot(wine.rf.m3, main = "OOB: wine.rf.m3")
plot(varImp(wine.rf.m3), main = "Var Imp: wine.rf.m3")
#--------------------------------------
# Model 4 | preProcess = T | balanced = F | method = oob
#--------------------------------------
# Specify fit parameters
wine.rf.m4.fc = trainControl(method = "oob",
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.rf.m4 = train(x = wine[, -1],
y = wine[, 1],
method = "rf",
preProcess = c("center", "scale"),
trControl = wine.rf.m4.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.rf.m4$finalModel
# In-sample fit
wine.rf.m4.trn.pred = predict(wine.rf.m4, newdata = wine[, -1])
wine.rf.m4.trn.cm = confusionMatrix(wine.rf.m4.trn.pred, wine$class)
wine.rf.m4.trn.cm$table
wine.rf.m4.trn.cm$overall[1:2]
# Plots
plot(wine.rf.m4, main = "OOB: wine.rf.m4")
plot(varImp(wine.rf.m4), main = "Var Imp: wine.rf.m4")
#--------------------------------------
# Model 5 | preProcess = F | balanced = F | method = tuneRF
#--------------------------------------
# Tune value of mtry
set.seed(123)
wine.rf.m5.tune = tuneRF(x = wine[, -1],
y = wine[, 1],
ntreeTry = 500,
stepFactor = 1.5,
improve = 0.01,
trace = T,
plot = T)
wine.rf.m5.fc = trainControl(method = "none")
wine.rf.m5.grid = data.frame(mtry = 3)
# Build model
ptm = proc.time()
set.seed(123)
wine.rf.m5 = train(x = wine[, -1],
y = wine[, 1],
method = "rf",
tuneGrid = wine.rf.m5.grid,
trControl = wine.rf.m5.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.rf.m5$finalModel
# In-sample fit
wine.rf.m5.trn.pred = predict(wine.rf.m5, newdata = wine[, -1])
wine.rf.m5.trn.cm = confusionMatrix(wine.rf.m5.trn.pred, wine$class)
wine.rf.m5.trn.cm$table
wine.rf.m5.trn.cm$overall[1:2]
# Plots
plot(varImp(wine.rf.m5), main = "Var Imp: wine.rf.m5")
#--------------------------------------
# Model 6 | preProcess = T | balanced = F | method = tuneRF
#--------------------------------------
# Tune value of mtry
set.seed(123)
wine.rf.m6.tune = tuneRF(x = wine[, -1],
y = wine[, 1],
ntreeTry = 500,
stepFactor = 1.5,
improve = 0.01,
trace = T,
plot = T)
wine.rf.m6.fc = trainControl(method = "none")
wine.rf.m6.grid = data.frame(mtry = 3)
# Build model
ptm = proc.time()
set.seed(123)
wine.rf.m6 = train(x = wine[, -1],
y = wine[, 1],
method = "rf",
preProcess = c("center", "scale"),
tuneGrid = wine.rf.m6.grid,
trControl = wine.rf.m6.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.rf.m6$finalModel
# In-sample fit
wine.rf.m6.trn.pred = predict(wine.rf.m6, newdata = wine[, -1])
wine.rf.m6.trn.cm = confusionMatrix(wine.rf.m6.trn.pred, wine$class)
wine.rf.m6.trn.cm$table
wine.rf.m6.trn.cm$overall[1:2]
# Plots
plot(varImp(wine.rf.m6), main = "Var Imp: wine.rf.m6")
#------------------------------------------------------------------------------
# Support Vector Machine
#------------------------------------------------------------------------------
#--------------------------------------
# Model 1 | preProcess = F | balanced = F | method = rcv
#--------------------------------------
# Specify fit parameters
wine.svm.m1.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.svm.m1 = train(x = wine[, -1],
y = wine[, 1],
method = "svmRadialWeights",
trControl = wine.svm.m1.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.svm.m1$finalModel
# In-sample fit
wine.svm.m1.trn.pred = predict(wine.svm.m1, newdata = wine[, -1])
wine.svm.m1.trn.cm = confusionMatrix(wine.svm.m1.trn.pred, wine$class)
wine.svm.m1.trn.cm$table
wine.svm.m1.trn.cm$overall[1:2]
# Plots
plot(wine.svm.m1, main = "RCV: wine.svm.m1")
plot(varImp(wine.svm.m1), main = "Var Imp: wine.svm.m1")
#--------------------------------------
# Model 2 | preProcess = T | balanced = F | method = rcv
#--------------------------------------
# Specify fit parameters
wine.svm.m2.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.svm.m2 = train(x = wine[, -1],
y = wine[, 1],
method = "svmRadialWeights",
preProcess = c("center", "scale"),
trControl = wine.svm.m2.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.svm.m2$finalModel
# In-sample fit
wine.svm.m2.trn.pred = predict(wine.svm.m2, newdata = wine[, -1])
wine.svm.m2.trn.cm = confusionMatrix(wine.svm.m2.trn.pred, wine$class)
wine.svm.m2.trn.cm$table
wine.svm.m2.trn.cm$overall[1:2]
# Plots
plot(wine.svm.m2, main = "RCV: wine.svm.m2")
plot(varImp(wine.svm.m2), main = "Var Imp: wine.svm.m2")
#--------------------------------------
# Model 3 | preProcess = F | balanced = T | method = rcv
#--------------------------------------
# Specify fit parameters
wine.svm.m3.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.svm.m3 = train(x = wine.ds[, -14],
y = wine.ds[, 14],
method = "svmRadialWeights",
trControl = wine.svm.m3.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.svm.m3$finalModel
# In-sample fit
wine.svm.m3.trn.pred = predict(wine.svm.m3, newdata = wine[, -1])
wine.svm.m3.trn.cm = confusionMatrix(wine.svm.m3.trn.pred, wine$class)
wine.svm.m3.trn.cm$table
wine.svm.m3.trn.cm$overall[1:2]
# Plots
plot(wine.svm.m3, main = "RCV: wine.svm.m3")
plot(varImp(wine.svm.m3), main = "Var Imp: wine.svm.m3")
#--------------------------------------
# Model 4 | preProcess = T | balanced = T | method = rcv
#--------------------------------------
# Specify fit parameters
wine.svm.m4.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.svm.m4 = train(x = wine.ds[, -14],
y = wine.ds[, 14],
method = "svmRadialWeights",
preProcess = c("center", "scale"),
trControl = wine.svm.m4.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.svm.m4$finalModel
# In-sample fit
wine.svm.m4.trn.pred = predict(wine.svm.m4, newdata = wine[, -1])
wine.svm.m4.trn.cm = confusionMatrix(wine.svm.m4.trn.pred, wine$class)
wine.svm.m4.trn.cm$table
wine.svm.m4.trn.cm$overall[1:2]
# Plots
plot(wine.svm.m4, main = "RCV: wine.svm.m4")
plot(varImp(wine.svm.m4), main = "Var Imp: wine.svm.m4")
#------------------------------------------------------------------------------
# Neural Net
#------------------------------------------------------------------------------
#--------------------------------------
# Model 1 | preProcess = F | balanced = F | method = rcv
#--------------------------------------
# Specify fit parameters
wine.nn.m1.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.nn.m1 = train(x = wine[, -1],
y = wine[, 1],
method = "nnet",
trace = F,
trControl = wine.nn.m1.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.nn.m1$finalModel
# In-sample fit
wine.nn.m1.trn.pred = predict(wine.nn.m1, newdata = wine[, -1])
wine.nn.m1.trn.cm = confusionMatrix(wine.nn.m1.trn.pred, wine$class)
wine.nn.m1.trn.cm$table
wine.nn.m1.trn.cm$overall[1:2]
# Plots
plot(wine.nn.m1, main = "RCV: wine.nn.m1")
wine.nn.m1.vi = varImp(wine.nn.m1)
wine.nn.m1.vi$importance = as.data.frame(wine.nn.m1.vi$importance)[, -1]
plot(wine.nn.m1.vi, main = "Var Imp: wine.nn.m1")
#--------------------------------------
# Model 2 | preProcess = T | balanced = F | method = rcv
#--------------------------------------
# Specify fit parameters
wine.nn.m2.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.nn.m2 = train(x = wine[, -1],
y = wine[, 1],
method = "nnet",
trace = F,
preProcess = c("center", "scale"),
trControl = wine.nn.m2.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.nn.m2$finalModel
# In-sample fit
wine.nn.m2.trn.pred = predict(wine.nn.m2, newdata = wine[, -1])
wine.nn.m2.trn.cm = confusionMatrix(wine.nn.m2.trn.pred, wine$class)
wine.nn.m2.trn.cm$table
wine.nn.m2.trn.cm$overall[1:2]
# Plots
plot(wine.nn.m2, main = "RCV: wine.nn.m2")
wine.nn.m2.vi = varImp(wine.nn.m2)
wine.nn.m2.vi$importance = as.data.frame(wine.nn.m2.vi$importance)[, -1]
plot(wine.nn.m2.vi, main = "Var Imp: wine.nn.m2")
#--------------------------------------
# Model 3 | preProcess = F | balanced = T | method = rcv
#--------------------------------------
# Specify fit parameters
wine.nn.m3.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.nn.m3 = train(x = wine.ds[, -14],
y = wine.ds[, 14],
method = "nnet",
trace = F,
trControl = wine.nn.m3.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.nn.m3$finalModel
# In-sample fit
wine.nn.m3.trn.pred = predict(wine.nn.m3, newdata = wine[, -1])
wine.nn.m3.trn.cm = confusionMatrix(wine.nn.m3.trn.pred, wine$class)
wine.nn.m3.trn.cm$table
wine.nn.m3.trn.cm$overall[1:2]
# Plots
plot(wine.nn.m3, main = "RCV: wine.nn.m3")
wine.nn.m3.vi = varImp(wine.nn.m3)
wine.nn.m3.vi$importance = as.data.frame(wine.nn.m3.vi$importance)[, -1]
plot(wine.nn.m3.vi, main = "Var Imp: wine.nn.m3")
#--------------------------------------
# Model 4 | preProcess = T | balanced = T | method = rcv
#--------------------------------------
# Specify fit parameters
wine.nn.m4.fc = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = T)
# Build model
ptm = proc.time()
set.seed(123)
wine.nn.m4 = train(x = wine.ds[, -14],
y = wine.ds[, 14],
method = "nnet",
trace = F,
preProcess = c("center", "scale"),
trControl = wine.nn.m4.fc)
proc.time() - ptm; rm(ptm)
# In-sample summary
wine.nn.m4$finalModel
# In-sample fit
wine.nn.m4.trn.pred = predict(wine.nn.m4, newdata = wine[, -1])
wine.nn.m4.trn.cm = confusionMatrix(wine.nn.m4.trn.pred, wine$class)
wine.nn.m4.trn.cm$table
wine.nn.m4.trn.cm$overall[1:2]
# Plots
plot(wine.nn.m4, main = "RCV: wine.nn.m4")
wine.nn.m4.vi = varImp(wine.nn.m4)
wine.nn.m4.vi$importance = as.data.frame(wine.nn.m4.vi$importance)[, -1]
plot(wine.nn.m4.vi, main = "Var Imp: wine.nn.m4")
#==============================================================================
# Model Comparison
#==============================================================================
#--------------------------------------
# Table Results
#--------------------------------------
# Model Types
model.types = cbind(c(rep("Random Forest", each = 6),
rep(c("SVM", "Neural Net"), each = 4)))
# Model Names
model.reps = c("M1", "M2", "M3", "M4")
model.names = cbind(c(model.reps,
"M5",
"M6",
rep(model.reps, times = 2)))
# Pre-process
model.pp = cbind(rep(c("0", "1"), times = 7))
# Down Sample
model.ds = cbind(c(rep("0", each = 6),
rep(c("0", "1"), each = 2, times = 2)))
# Method
model.meth = cbind(c(rep("RCV", each = 2),
rep("OOB", each = 2),
rep("tuneRF", each = 2),
rep("RCV", each = 8)))
# Accuracy, Train
model.trn.acc = rbind(wine.rf.m1.trn.cm$overall[1],
wine.rf.m2.trn.cm$overall[1],
wine.rf.m3.trn.cm$overall[1],
wine.rf.m4.trn.cm$overall[1],
wine.rf.m5.trn.cm$overall[1],
wine.rf.m6.trn.cm$overall[1],
wine.svm.m1.trn.cm$overall[1],
wine.svm.m2.trn.cm$overall[1],
wine.svm.m3.trn.cm$overall[1],
wine.svm.m4.trn.cm$overall[1],
wine.nn.m1.trn.cm$overall[1],
wine.nn.m2.trn.cm$overall[1],
wine.nn.m3.trn.cm$overall[1],
wine.nn.m4.trn.cm$overall[1])
# Kappa, Train
model.trn.kpp = rbind(wine.rf.m1.trn.cm$overall[2],
wine.rf.m2.trn.cm$overall[2],
wine.rf.m3.trn.cm$overall[2],
wine.rf.m4.trn.cm$overall[2],
wine.rf.m5.trn.cm$overall[2],
wine.rf.m6.trn.cm$overall[2],
wine.svm.m1.trn.cm$overall[2],
wine.svm.m2.trn.cm$overall[2],
wine.svm.m3.trn.cm$overall[2],
wine.svm.m4.trn.cm$overall[2],
wine.nn.m1.trn.cm$overall[2],
wine.nn.m2.trn.cm$overall[2],
wine.nn.m3.trn.cm$overall[2],
wine.nn.m4.trn.cm$overall[2])
# Data Frame
model.comp = data.frame(model.types,
model.names,
model.pp,
model.ds,
model.meth,
model.trn.acc,
model.trn.kpp)
rownames(model.comp) = 1:nrow(model.comp)
colnames(model.comp) = c("Model Type",
"Model Name",
"Pre-process",
"Down Sample",
"Method",
"Train: Accuracy",
"Train: Kappa")
rm(list = ls(pattern = "model"))# FIN
sessionInfo()## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10240)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] nnet_7.3-12 kernlab_0.9-24 RCurl_1.95-4.8
## [4] bitops_1.0-6 rpart_4.1-10 rattle_4.1.0
## [7] randomForest_4.6-12 pander_0.6.0 MASS_7.3-45
## [10] caret_6.0-70 ggplot2_2.1.0 lattice_0.20-33
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 compiler_3.3.1 RColorBrewer_1.1-2
## [4] formatR_1.4 nloptr_1.0.4 plyr_1.8.4
## [7] class_7.3-14 iterators_1.0.8 tools_3.3.1
## [10] digest_0.6.10 lme4_1.1-12 evaluate_0.9
## [13] nlme_3.1-128 gtable_0.2.0 mgcv_1.8-13
## [16] Matrix_1.2-6 foreach_1.4.3 yaml_2.1.13
## [19] parallel_3.3.1 SparseM_1.7 e1071_1.6-7
## [22] RGtk2_2.20.31 stringr_1.0.0 knitr_1.13
## [25] pROC_1.8 MatrixModels_0.4-1 stats4_3.3.1
## [28] grid_3.3.1 rmarkdown_1.0 minqa_1.2.4
## [31] reshape2_1.4.1 car_2.1-2 magrittr_1.5
## [34] scales_0.4.0 codetools_0.2-14 htmltools_0.3.5
## [37] splines_3.3.1 rpart.plot_2.0.1 pbkrtest_0.4-6
## [40] colorspace_1.2-6 quantreg_5.26 stringi_1.1.1
## [43] munsell_0.4.3