Introduction

We begin analyzing the issue by visiting the problem statement with some background information. The issue we are trying to address occurs during what is known as “Rotogravure” printing.

Rotogravure printing involves rotating a chrome-plated copper cylinder in a bath of ink, scraping off the excess ink, and pressing a continuous supply of paper against the inked image with a rubber roller. Once the job is printed, the engraved image is removed from the cylinder, which is replated to be engraved for another job.

Sometimes, a series of grooves - called a “band” - appears in the cylinder during printing, ruining the finished product. These grooves are not present at the start of the printing run, but once they appear the printing press must be shutdown. A technician then removes the band by polishing it out of the cylinder, or by transporting the cylinder to a plating station where the chrome surface is removed, the band is polished out of the copper subsurface, and a chrome finish is replated.

The occurence of bands result in process delays, plant shutdowns, and losses in terms of labour and time.

When process delays have known causes, they can be mitigated by acquiring causal rules from human experts and then applying sensors and automated real-time diagnostic devices to the process. However, for some delays the experts have only “weak” causal knowledge or none at all.

In such cases, machine learning tools can collect training data and process it through an induction engine in search of “diagnostic” knowledge. Our aim in this analysis is to therefore find the most probably causes for band formation and share those parameters so that they can be controlled, in order to mitigate the effects of banding.

Our analysis will follow the different paths in the flowchart as shown below. We firstly take the data that has been provided and perform data clean-up activity (pre-processing) to make them usable for a machine learning algorithm. This clean-up activity involves several sub-steps and we have tried different ways to address the missing data values, dscribed in more details in the relevant sections.

After this activity, we aim to build the a robust prediction model, by using various ML algorithms and testing them for their accuracy and prediction capability.

Data Cleaning

Our first step is to clean our data and remove duplicates. This can be caused due to mistakes in entries (upper/lower case), duplicates, and mistakes in spelling etc., while creating the dataset by the operators, who have noted the various parameter levels when making the cylinders.

Our target variable in this data set is “band_type” which has two levels: band (or) noband. Our explanatory variables are the remaining 39 variables, of which 20 attributes are numeric and 19 are nominal. After this common step, we proceed with analyzing the dataset and derive more meaningful information.

Describing the given DataSet

We check there how the data is spread and what it contains. The variable we are interested in is “band_type”, and especially the reason & prediction ability of banding. From the dataset, we can see the description of the “band_type” variable, as shown below.

## variable = band_type
## type     = factor
## na       = 0 of 540 (0%)
## unique   = 2
##  band    = 228 (42.2%)
##  noband  = 312 (57.8%)

We find that out of the 540 records in the dataset, the information for “band_type” is completely available (no missing “band_type” values, na = 0 of 540). Further, there are two unique levels for this variable (band & noband). Also, the split between band & noband is quite balanced (42.2% versus 57.8%), only a slight tilt for noband, with more information.

An overall plot of the data is shown below. Here we find that there are only 51.3% of complete rows i.e., rows which do not have any values missing. This works out to 277 records of the 540 records. We also see 4.6% of data not having made any obervations at all.

plot_intro(cyl)

A deeper look into the missing data shows that the maximum amount of data that is missing is for the column variable “location” at 28.89%, followed by “blade_pressure” and “blade_mfg”. The below plot gives all 40 variables plotted against the % of missing data it contains. This information will help us later while determining which variables to consider/discard in our analysis.

plot_missing(cyl)

A density plot of the data and is a representation of the distribution of a numeric variable. It helps to understand how normal or skewed the data is and if there are variables that are wildly distributed. As can be gleaned, not all the variables seem relevant here (unit_number, job_number etc.,). The rest of the data seem to be well distributed without too many extremes.

plot_density(cyl)

Dealing with Missing Data & Using ML Models

We next focus our attention to the problem of missing data. There are several ways in which we can fill the missing data and proceed with our analysis.

Here we share 5 different ways in which we have dealt with missing data. After cleaning and dealing with missing data, our dataset is ready for various ML algorithms. For this report, we have picked four ML models: RandomForest, NeuralNetwork, DecisionTrees and Support Vector Machines.

1. Omitting all missing data

In our first method, we simply remove all the rows that are incomplete and only consider the complete observations for our analysis.

## [1] 277

We find that the number of complete rows is 277. Therefore, we work only with these records for analysis.

Using the data we proceed with creating correlation plots, to see how the different variables are correlated. If the dataset has perfectly positive or negative attributes then the performance of the model will be impacted by a problem called “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.

Decision trees and boosted trees algorithms are immune to multicollinearity by nature. When the model decides to split, the tree will choose only one of the perfectly correlated features. However, other algorithms like Logistic Regression or Linear Regression are not immune to this problem and needs to be fixed before training the model. But to note, the variable importance levels can still be affected by the presence of multicollinearity.

In our case, we only wish to see observe correlations if any and remove one of them if it exisits.

We find the following correlations (in lesser degress, exists in the data):

  1. “ink_pct” and “varnish_pct”
  2. “solvent_pct” and “varnish_pct”
  3. “roller_durometer” and “press_speed”
  4. “ink_pct” and “proof_cut”
  5. “varnish_pct” and “proof_cut”

We can next train our model with and without some of the above variables to observe if there is an improvement.

Further, we identify from the problem description, the 12 variables which can be safely ignored from our dataset, as it would only decrease the models capability. The variables that we decided to remove from our training model were the below:

  1. date (though, some months might experience more banding compared to others, but explainable)
  2. customer
  3. job_number
  4. cylinder_no
  5. cylinder_division
  6. ink_color
  7. unit_number
  8. plating_tank
  9. blade_mfg
  10. chrome_content
  11. ESA_voltage
  12. ESA_amperage

a. Using the model: Random Forest

We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.

We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.

From the results, we observe the out-of-bag error rate to be 25.91%.

## 
## Call:
##  randomForest(formula = form, data = data[train, ], ntree = 1000,      mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 25.91%
## Confusion matrix:
##        band noband class.error
## band     44     31   0.4133333
## noband   19     99   0.1610169

The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.

##          OOB   band noband
##  [1,] 0.2535 0.3750 0.1538
##  [2,] 0.2783 0.3478 0.2319
##  [3,] 0.2483 0.2931 0.2184
##  [4,] 0.2545 0.2769 0.2400
##  [5,] 0.3258 0.3043 0.3394
##  [6,] 0.3279 0.3562 0.3091
##  [7,] 0.3048 0.3425 0.2807
##  [8,] 0.2895 0.3600 0.2435
##  [9,] 0.2789 0.3067 0.2609
## [10,] 0.2579 0.3200 0.2174
## [11,] 0.2408 0.3200 0.1897
## [12,] 0.2723 0.3733 0.2069
## [13,] 0.2902 0.3467 0.2542
## [14,] 0.3057 0.4133 0.2373
## [15,] 0.2953 0.3200 0.2797

Mean Decrease in Gini is the average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest.

This is effectively a measure of how important a variable is for estimating the value of the target variable across all of the trees that make up the forest. A higher Mean Decrease in Gini indicates higher variable importance. The most important variables to the model will be highest in the plot and have the largest Mean Decrease in Gini Values, conversely, the least important variable will be lowest in the plot, and have the smallest Mean Decrease in Gini values.

In this case, the variables that are between 3-4 in the Gini Index (and therefore most important are):

  1. press_speed
  2. solvent_pct
  3. ink_temperature
  4. viscosity
  5. roller_durometer
  6. ink_pct
  7. anode_space_ratio
  8. hardener

If we repeat this test, after multicollinearity correction, we get the below results:

## 
## Call:
##  randomForest(formula = form, data = data[train, ], ntree = 1000,      mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 25.91%
## Confusion matrix:
##        band noband class.error
## band     42     33   0.4400000
## noband   17    101   0.1440678

##          OOB   band noband
##  [1,] 0.2676 0.4375 0.1282
##  [2,] 0.3246 0.4681 0.2239
##  [3,] 0.3333 0.4364 0.2717
##  [4,] 0.3148 0.3871 0.2700
##  [5,] 0.3352 0.4286 0.2736
##  [6,] 0.3611 0.4286 0.3182
##  [7,] 0.3871 0.4861 0.3246
##  [8,] 0.3617 0.4795 0.2870
##  [9,] 0.3280 0.4521 0.2500
## [10,] 0.3263 0.4324 0.2586
## [11,] 0.3298 0.4533 0.2500
## [12,] 0.3385 0.4667 0.2564
## [13,] 0.3420 0.4667 0.2627
## [14,] 0.3005 0.4133 0.2288
## [15,] 0.3420 0.4800 0.2542

Clearly, the variables importance rank as measured by the Gini Index gives us a better estimate and provides the following variables of importance:

  1. press_speed
  2. solvent_pct
  3. ink_temperature
  4. viscosity
  5. ink_pct
  6. anode_space_ratio

We now proceed to predict using the test data set and create the confusion matrix for the same.

##         pred
## true     band noband
##   band     20      4
##   noband    9     51
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  84 
## 
##  
##                        | band.pred 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        20 |         4 |        24 | 
##                        |     0.833 |     0.167 |     0.286 | 
##                        |     0.690 |     0.073 |           | 
##                        |     0.238 |     0.048 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         9 |        51 |        60 | 
##                        |     0.150 |     0.850 |     0.714 | 
##                        |     0.310 |     0.927 |           | 
##                        |     0.107 |     0.607 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        29 |        55 |        84 | 
##                        |     0.345 |     0.655 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.84 with mtry=2, and a kappa of 0.61.

There is no standardized interpretation of the kappa statistic. According to Wikipedia (citing their paper), Landis and Koch considers 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect.

Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as fair to good, and < 0.40 as poor.

##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE
##    user  system elapsed 
##   52.22    0.80   55.23
## Random Forest 
## 
## 193 samples
##  24 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 173, 173, 174, 173, 174, 174, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7242251  0.3746088
##   17    0.7123713  0.3648302
##   33    0.7035322  0.3468995
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     17      6
##     noband    7     54
##                                           
##                Accuracy : 0.8452          
##                  95% CI : (0.7499, 0.9149)
##     No Information Rate : 0.7143          
##     P-Value [Acc > NIR] : 0.003877        
##                                           
##                   Kappa : 0.616           
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.7083          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.7391          
##          Neg Pred Value : 0.8852          
##              Prevalence : 0.2857          
##          Detection Rate : 0.2024          
##    Detection Prevalence : 0.2738          
##       Balanced Accuracy : 0.8042          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  84 
## 
##  
##                        | band.pred.1 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        17 |         7 |        24 | 
##                        |     0.708 |     0.292 |     0.286 | 
##                        |     0.739 |     0.115 |           | 
##                        |     0.202 |     0.083 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         6 |        54 |        60 | 
##                        |     0.100 |     0.900 |     0.714 | 
##                        |     0.261 |     0.885 |           | 
##                        |     0.071 |     0.643 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        23 |        61 |        84 | 
##                        |     0.274 |     0.726 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

For a large class imbalance (which is not the case in our example as the band-noband is well balanced) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option, sampling=“down” in the function.

The corresponding confusion matrix and crosstable is as shown below.

##    user  system elapsed 
##   38.03    0.52   38.57
## Random Forest 
## 
## 193 samples
##  24 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 173, 174, 173, 174, 174, 174, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7111988  0.4229284
##   17    0.6846257  0.3597217
##   33    0.6655585  0.3201383
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     22     28
##     noband    2     32
##                                           
##                Accuracy : 0.6429          
##                  95% CI : (0.5308, 0.7445)
##     No Information Rate : 0.7143          
##     P-Value [Acc > NIR] : 0.9393          
##                                           
##                   Kappa : 0.3396          
##                                           
##  Mcnemar's Test P-Value : 5.01e-06        
##                                           
##             Sensitivity : 0.9167          
##             Specificity : 0.5333          
##          Pos Pred Value : 0.4400          
##          Neg Pred Value : 0.9412          
##              Prevalence : 0.2857          
##          Detection Rate : 0.2619          
##    Detection Prevalence : 0.5952          
##       Balanced Accuracy : 0.7250          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  84 
## 
##  
##                        | band.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        22 |         2 |        24 | 
##                        |     0.917 |     0.083 |     0.286 | 
##                        |     0.440 |     0.059 |           | 
##                        |     0.262 |     0.024 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        28 |        32 |        60 | 
##                        |     0.467 |     0.533 |     0.714 | 
##                        |     0.560 |     0.941 |           | 
##                        |     0.333 |     0.381 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        50 |        34 |        84 | 
##                        |     0.595 |     0.405 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

b. Using the model: Neural Network

The neural network is a set of connected input/output units in which each connection has a weight associated with it. In the learning phase, the network learns by adjusting the weights to predict the correct class label of the given inputs. NN performs computations through a process by learning.

We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.

## # weights:  71
## initial  value 137.879331 
## final  value 128.946992 
## converged

We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.

##         predicted
## true     noband
##   band       24
##   noband     60

The misclassification table can also be obtained by using the gmodels library:

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  84 
## 
##  
##                        | predicted 
## data$band_type[-train] |    noband | Row Total | 
## -----------------------|-----------|-----------|
##                   band |        24 |        24 | 
##                        |     0.286 |           | 
## -----------------------|-----------|-----------|
##                 noband |        60 |        60 | 
##                        |     0.714 |           | 
## -----------------------|-----------|-----------|
##           Column Total |        84 |        84 | 
## -----------------------|-----------|-----------|
## 
## 

We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)

The relative importance of each input variable for the response variable is a value from -1 to 1. From the data, we can get an idea of what the neural network is telling us about the specific importance of each explanatory for the response variable.This is from a method proposed by Garson 1991 (also Goh 1995) in which the relative importance of explanatory variables for specific response variables in a supervised neural network is obtained by deconstructing the model weights.

The idea is all weights connecting the specific input node that pass through the hidden layer to the specific response variable are identified. This is repeated for all other explanatory variables until the analyst has a list of all weights that are specific to each input variable. The connections are tallied for each input node and scaled relative to all other inputs. A single value is obtained for each explanatory variable that describes the relationship with response variable in the model

##                          rel.imp              x.names
## solvent_typeXYLOL    -1.00000000    solvent_typeXYLOL
## ink_temperature      -0.87550702      ink_temperature
## direct_steamyes      -0.86384484      direct_steamyes
## locationnorthus      -0.82482455      locationnorthus
## ink_typeuncoated     -0.75388867     ink_typeuncoated
## roughness            -0.69786427            roughness
## cylinder_sizetabloid -0.69255956 cylinder_sizetabloid
## locationsouthus      -0.60007710      locationsouthus
## solvent_pct          -0.56915850          solvent_pct
## proof_inkYES         -0.47786287         proof_inkYES
## grain_screenedYES    -0.41261131    grain_screenedYES
## ink_pct              -0.33059487              ink_pct
## hardener             -0.31448538             hardener
## paper_typeuncoated   -0.23438104   paper_typeuncoated
## paper_typesuper      -0.15555727      paper_typesuper
## solvent_typeNAPTHA   -0.08247157   solvent_typeNAPTHA
## press_typeMotter70   -0.04514294   press_typeMotter70
## press_typeMotter94    0.00000000   press_typeMotter94
## locationmideuropean   0.02689458  locationmideuropean
## wax                   0.08816143                  wax
## cylinder_sizespiegel  0.11718953 cylinder_sizespiegel
## press_typeWoodHoe70   0.17856352  press_typeWoodHoe70
## anode_space_ratio     0.23876785    anode_space_ratio
## humidity              0.27973108             humidity
## current_density       0.32245025      current_density
## ink_typecover         0.33768691        ink_typecover
## blade_pressure        0.37256921       blade_pressure
## caliper               0.43920565              caliper
## viscosity             0.45173435            viscosity
## locationscandanavian  0.50211415 locationscandanavian
## press_speed           0.51625717          press_speed
## press                 0.65012793                press
## cylinder_typeyes      0.83211514     cylinder_typeyes

The bar plot tells us that the variables solvent_type(XYLOL) and cylinder_type(YES) have the strongest negative and positive relationships, respectively, with the response variable band_type. Similarly, variables that have relative importance close to zero, such as press_type, do not have any substantial importance for band_type.

In decreasing order of importance:

  1. cylinder_type - yes
  2. press
  3. press_speed
  4. location - scandanavian
  5. viscosity

c. Using the model: Decision Tree

We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp). We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.

## 
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
## 
## Variables actually used in tree construction:
##  [1] anode_space_ratio blade_pressure    caliper          
##  [4] current_density   cylinder_size     cylinder_type    
##  [7] hardener          humidity          ink_pct          
## [10] ink_temperature   ink_type          press            
## [13] press_speed       solvent_pct       solvent_type     
## [16] viscosity        
## 
## Root node error: 99/277 = 0.3574
## 
## n= 277 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.080808      0  1.000000 1.00000 0.080566
## 2  0.060606      4  0.676768 1.04040 0.081249
## 3  0.050505      5  0.616162 0.97980 0.080195
## 4  0.045455      6  0.565657 0.98990 0.080383
## 5  0.020202      8  0.474747 0.95960 0.079804
## 6  0.016162     13  0.373737 0.90909 0.078735
## 7  0.015152     18  0.292929 0.94949 0.079600
## 8  0.010101     24  0.202020 0.96970 0.080002
## 9  0.006734     34  0.101010 1.06061 0.081561
## 10 0.000010     40  0.060606 1.07071 0.081710

We find the complete decision will have 10 levels. Also, the minimum error rate is at 95% with a tree size of 14, indicating this is not a very good model and has very poor performance.

## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
##   n= 277 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.08080808      0 1.0000000 1.000000 0.08056613
## 2 0.06060606      4 0.6767677 1.040404 0.08124902
## 3 0.05050505      5 0.6161616 0.979798 0.08019495
## 4 0.05000000      6 0.5656566 0.989899 0.08038305
## 
## Variable importance
##             press       press_speed        press_type         viscosity 
##                20                15                14                13 
##     cylinder_size               wax           ink_pct    blade_pressure 
##                10                 8                 8                 4 
##         roughness          location anode_space_ratio         proof_ink 
##                 4                 4                 1                 1 
## 
## Node number 1: 277 observations,    complexity param=0.08080808
##   predicted class=noband  expected loss=0.3574007  P(node) =1
##     class counts:    99   178
##    probabilities: 0.357 0.643 
##   left son=2 (182 obs) right son=3 (95 obs)
##   Primary splits:
##       press          < 822.5   to the left,  improve=11.509960, (0 missing)
##       press_speed    < 2025    to the left,  improve=11.098460, (0 missing)
##       ink_type       splits as  RLL,         improve= 9.863416, (0 missing)
##       grain_screened splits as  RL,          improve= 7.633489, (0 missing)
##       solvent_pct    < 35.45   to the left,  improve= 7.182566, (0 missing)
##   Surrogate splits:
##       press_type  splits as  LLRL,        agree=0.859, adj=0.589, (0 split)
##       wax         < 2.35    to the right, agree=0.776, adj=0.347, (0 split)
##       press_speed < 2075    to the left,  agree=0.758, adj=0.295, (0 split)
##       roughness   < 0.78125 to the left,  agree=0.747, adj=0.263, (0 split)
##       location    splits as  RLLRR,       agree=0.744, adj=0.253, (0 split)
## 
## Node number 2: 182 observations,    complexity param=0.08080808
##   predicted class=noband  expected loss=0.4615385  P(node) =0.6570397
##     class counts:    84    98
##    probabilities: 0.462 0.538 
##   left son=4 (10 obs) right son=5 (172 obs)
##   Primary splits:
##       ink_pct       < 64      to the right, improve=6.135957, (0 missing)
##       viscosity     < 55.5    to the right, improve=5.723443, (0 missing)
##       solvent_pct   < 35.5    to the left,  improve=5.631499, (0 missing)
##       cylinder_size splits as  RRL,         improve=5.013794, (0 missing)
##       press_speed   < 2025    to the left,  improve=4.473709, (0 missing)
## 
## Node number 3: 95 observations
##   predicted class=noband  expected loss=0.1578947  P(node) =0.3429603
##     class counts:    15    80
##    probabilities: 0.158 0.842 
## 
## Node number 4: 10 observations
##   predicted class=band    expected loss=0  P(node) =0.03610108
##     class counts:    10     0
##    probabilities: 1.000 0.000 
## 
## Node number 5: 172 observations,    complexity param=0.08080808
##   predicted class=noband  expected loss=0.4302326  P(node) =0.6209386
##     class counts:    74    98
##    probabilities: 0.430 0.570 
##   left son=10 (146 obs) right son=11 (26 obs)
##   Primary splits:
##       press_speed   < 2025    to the left,  improve=6.072684, (0 missing)
##       cylinder_size splits as  RRL,         improve=5.004309, (0 missing)
##       viscosity     < 41.5    to the right, improve=3.686492, (0 missing)
##       ink_type      splits as  RLL,         improve=3.320349, (0 missing)
##       paper_type    splits as  R-L,         improve=3.047771, (0 missing)
##   Surrogate splits:
##       proof_ink         splits as  RL,          agree=0.860, adj=0.077, (0 split)
##       anode_space_ratio < 91.65   to the right, agree=0.860, adj=0.077, (0 split)
##       paper_type        splits as  R-L,         agree=0.855, adj=0.038, (0 split)
##       viscosity         < 68.5    to the left,  agree=0.855, adj=0.038, (0 split)
##       ink_temperature   < 18.05   to the left,  agree=0.855, adj=0.038, (0 split)
## 
## Node number 10: 146 observations,    complexity param=0.08080808
##   predicted class=noband  expected loss=0.4863014  P(node) =0.5270758
##     class counts:    71    75
##    probabilities: 0.486 0.514 
##   left son=20 (60 obs) right son=21 (86 obs)
##   Primary splits:
##       cylinder_size splits as  RRL,         improve=7.908771, (0 missing)
##       viscosity     < 55.5    to the right, improve=5.936941, (0 missing)
##       wax           < 2.65    to the left,  improve=5.702065, (0 missing)
##       press_type    splits as  RRLL,        improve=5.419073, (0 missing)
##       press         < 814     to the right, improve=5.419073, (0 missing)
##   Surrogate splits:
##       press_type     splits as  RRLR,        agree=0.836, adj=0.600, (0 split)
##       press          < 818.5   to the right, agree=0.836, adj=0.600, (0 split)
##       blade_pressure < 25.5    to the left,  agree=0.767, adj=0.433, (0 split)
##       press_speed    < 1755    to the right, agree=0.712, adj=0.300, (0 split)
##       wax            < 2.725   to the left,  agree=0.705, adj=0.283, (0 split)
## 
## Node number 11: 26 observations
##   predicted class=noband  expected loss=0.1153846  P(node) =0.09386282
##     class counts:     3    23
##    probabilities: 0.115 0.885 
## 
## Node number 20: 60 observations,    complexity param=0.05050505
##   predicted class=band    expected loss=0.3166667  P(node) =0.2166065
##     class counts:    41    19
##    probabilities: 0.683 0.317 
##   left son=40 (55 obs) right son=41 (5 obs)
##   Primary splits:
##       viscosity         < 40.5    to the right, improve=5.093939, (0 missing)
##       ink_temperature   < 16.4    to the right, improve=2.701361, (0 missing)
##       current_density   < 36      to the right, improve=2.594118, (0 missing)
##       blade_pressure    < 26.5    to the right, improve=2.002381, (0 missing)
##       anode_space_ratio < 99.15   to the left,  improve=1.907747, (0 missing)
## 
## Node number 21: 86 observations,    complexity param=0.06060606
##   predicted class=noband  expected loss=0.3488372  P(node) =0.3104693
##     class counts:    30    56
##    probabilities: 0.349 0.651 
##   left son=42 (6 obs) right son=43 (80 obs)
##   Primary splits:
##       viscosity         < 62      to the right, improve=5.469767, (0 missing)
##       humidity          < 86.5    to the right, improve=4.887949, (0 missing)
##       blade_pressure    < 27.5    to the left,  improve=4.606610, (0 missing)
##       press_speed       < 1325    to the left,  improve=4.501866, (0 missing)
##       anode_space_ratio < 106.78  to the left,  improve=3.727302, (0 missing)
## 
## Node number 40: 55 observations
##   predicted class=band    expected loss=0.2545455  P(node) =0.198556
##     class counts:    41    14
##    probabilities: 0.745 0.255 
## 
## Node number 41: 5 observations
##   predicted class=noband  expected loss=0  P(node) =0.01805054
##     class counts:     0     5
##    probabilities: 0.000 1.000 
## 
## Node number 42: 6 observations
##   predicted class=band    expected loss=0  P(node) =0.02166065
##     class counts:     6     0
##    probabilities: 1.000 0.000 
## 
## Node number 43: 80 observations
##   predicted class=noband  expected loss=0.3  P(node) =0.2888087
##     class counts:    24    56
##    probabilities: 0.300 0.700

The variable importance table is given below:

##                     Overall
## anode_space_ratio 24.762654
## blade_pressure    36.006487
## caliper            5.344160
## current_density    8.909889
## cylinder_size     18.628173
## cylinder_type      3.798601
## grain_screened    13.953228
## hardener          18.695875
## humidity          21.747414
## ink_pct           19.236685
## ink_temperature   24.076791
## ink_type          26.352381
## location           1.358391
## paper_type         5.199287
## press             20.781173
## press_speed       57.711832
## press_type         7.585740
## proof_ink          1.066667
## roughness          3.150000
## solvent_pct       39.015257
## solvent_type       1.190476
## viscosity         52.682730
## wax               26.552010
## direct_steam       0.000000

From this, we can list in decreasing order the most important variables:

  1. press_speed
  2. viscosity
  3. solvent_pct
  4. blade_pressure
  5. wax
  6. ink_type

d. Using the model: SVM

We consider the use of support vector machines to build a classifier and to also extract the feature importance list.

First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.

## 
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train, 
##     ], method = "C-classification", kernel = "radial", cost = 10, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  176
## 
##  ( 103 73 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  band noband
##   [1]   1   2   8   9  10  11  13  16  20  22  24  27  29  30  31  33  34
##  [18]  35  41  42  43  45  46  47  48  49  51  62  65  66  67  71  72  73
##  [35]  74  76  77  80  81  82  84  86  87  88  90  92  93  94  96  97  99
##  [52] 100 101 102 103 105 106 108 110 111 112 113 114 115 117 118 119 120
##  [69] 121 122 127 130 131 133 134 139 144 145 147 148 150 153 154 157 160
##  [86] 161 163 164 165 166 167 168 172 173 175 176 177 183 184 189 190 191
## [103] 193   3   4   5   6   7  12  14  17  18  19  21  26  28  36  37  38
## [120]  40  44  50  52  54  55  56  58  59  60  61  63  64  68  69  70  75
## [137]  78  79  83  89  91  95  98 104 107 109 116 123 124 126 128 129 132
## [154] 135 136 137 138 142 146 151 152 155 156 159 170 171 174 178 179 180
## [171] 181 182 185 186 187 188

Prediction using SVM: We find that the accuracy is 72% and at kappa of 0.38, which shows poor prediction capability of our model. We would have to try this with different parameters to try and get a more robust model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     16     15
##     noband    8     45
##                                          
##                Accuracy : 0.7262         
##                  95% CI : (0.618, 0.8179)
##     No Information Rate : 0.7143         
##     P-Value [Acc > NIR] : 0.4588         
##                                          
##                   Kappa : 0.3831         
##                                          
##  Mcnemar's Test P-Value : 0.2109         
##                                          
##             Sensitivity : 0.6667         
##             Specificity : 0.7500         
##          Pos Pred Value : 0.5161         
##          Neg Pred Value : 0.8491         
##              Prevalence : 0.2857         
##          Detection Rate : 0.1905         
##    Detection Prevalence : 0.3690         
##       Balanced Accuracy : 0.7083         
##                                          
##        'Positive' Class : band           
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  84 
## 
##  
##                        | svm.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        16 |         8 |        24 | 
##                        |     0.667 |     0.333 |     0.286 | 
##                        |     0.516 |     0.151 |           | 
##                        |     0.190 |     0.095 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        15 |        45 |        60 | 
##                        |     0.250 |     0.750 |     0.714 | 
##                        |     0.484 |     0.849 |           | 
##                        |     0.179 |     0.536 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        31 |        53 |        84 | 
##                        |     0.369 |     0.631 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.

2. Removing 20% and replacing remaining missing data manually

In our second method, we first remove rows which have more than 20% of data observations missing, and consider only the remaining for our analysis.

Then, instead of imputing the values of missing categorical variables, we add another level called “missing” to it. We also calculate the “mode” for numeric data and replace the missing numeric data with the corresponding mode value.

Using the data we proceed with creating correlation plots, to see how the different variables are correlated. If the dataset has perfectly positive or negative attributes then the performance of the model will be impacted by a problem called “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.

Decision trees and boosted trees algorithms are immune to multicollinearity by nature. When the model decides to split, the tree will choose only one of the perfectly correlated features. However, other algorithms like Logistic Regression or Linear Regression are not immune to this problem and needs to be fixed before training the model. But to note, the variable importance levels can still be affected by the presence of multicollinearity.

In our case, we only wish to see observe correlations if any and remove one of them if it exists.

We find the following correlations (in lesser degress, exists in the data):

  1. “ink_pct” and “varnish_pct”
  2. “solvent_pct” and “varnish_pct”

Which implies, we can remove “varnish_pct” from our training model to observe if there is an improvement.

We continue to ignore the 12 variables as identified in previous section, for this model as well.

a. Using the model: Random Forest

We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.

We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.

From the results, we observe the out-of-bag error rate to be 23.3%.

## 
## Call:
##  randomForest(formula = form, data = data[train, ], ntree = 1000,      mtry = 5, importance = TRUE, localImp = TRUE, replace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 23.3%
## Confusion matrix:
##        band noband class.error
## band     64     61  0.48800000
## noband   18    196  0.08411215

The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.

##          OOB   band noband
##  [1,] 0.3629 0.4468 0.3117
##  [2,] 0.3600 0.4800 0.2880
##  [3,] 0.3755 0.4947 0.3038
##  [4,] 0.3611 0.4623 0.3022
##  [5,] 0.3722 0.4643 0.3198
##  [6,] 0.3560 0.4706 0.2892
##  [7,] 0.3598 0.4628 0.2995
##  [8,] 0.4018 0.5082 0.3397
##  [9,] 0.3772 0.4878 0.3128
## [10,] 0.3661 0.5040 0.2844
## [11,] 0.3650 0.5280 0.2689
## [12,] 0.3561 0.5120 0.2642
## [13,] 0.3333 0.4960 0.2383
## [14,] 0.3451 0.5120 0.2477
## [15,] 0.3451 0.5040 0.2523

In this case, the variables that are between 6-8 in the Gini Index (and therefore most important are):

  1. press_speed
  2. solvent_pct
  3. viscosity
  4. ink_temperature
  5. ink_pct
  6. humidity

We now proceed to predict using the test data set and create the confusion matrix for the same.

##         pred
## true     band noband
##   band     32     16
##   noband    7     91
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | band.pred 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        32 |        16 |        48 | 
##                        |     0.667 |     0.333 |     0.329 | 
##                        |     0.821 |     0.150 |           | 
##                        |     0.219 |     0.110 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         7 |        91 |        98 | 
##                        |     0.071 |     0.929 |     0.671 | 
##                        |     0.179 |     0.850 |           | 
##                        |     0.048 |     0.623 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        39 |       107 |       146 | 
##                        |     0.267 |     0.733 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.84 at mtry=20, and a kappa of 0.62.

##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE
##    user  system elapsed 
##  124.37    1.61  133.20
## Random Forest 
## 
## 339 samples
##  27 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 304, 305, 305, 305, 305, 305, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7029139  0.2639397
##   20    0.7388928  0.3906742
##   39    0.7372880  0.3878486
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 20.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     33      8
##     noband   15     90
##                                           
##                Accuracy : 0.8425          
##                  95% CI : (0.7731, 0.8974)
##     No Information Rate : 0.6712          
##     P-Value [Acc > NIR] : 2.324e-06       
##                                           
##                   Kappa : 0.6293          
##                                           
##  Mcnemar's Test P-Value : 0.2109          
##                                           
##             Sensitivity : 0.6875          
##             Specificity : 0.9184          
##          Pos Pred Value : 0.8049          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.3288          
##          Detection Rate : 0.2260          
##    Detection Prevalence : 0.2808          
##       Balanced Accuracy : 0.8029          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | band.pred.1 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        33 |        15 |        48 | 
##                        |     0.688 |     0.312 |     0.329 | 
##                        |     0.805 |     0.143 |           | 
##                        |     0.226 |     0.103 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         8 |        90 |        98 | 
##                        |     0.082 |     0.918 |     0.671 | 
##                        |     0.195 |     0.857 |           | 
##                        |     0.055 |     0.616 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        41 |       105 |       146 | 
##                        |     0.281 |     0.719 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.

The corresponding confusion matrix and crosstable is as shown below.

##    user  system elapsed 
##   80.00    1.00   81.36
## Random Forest 
## 
## 339 samples
##  27 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 305, 306, 305, 304, 305, 306, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7094372  0.4178767
##   20    0.7021401  0.3859451
##   39    0.6972737  0.3779739
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     39     23
##     noband    9     75
##                                          
##                Accuracy : 0.7808         
##                  95% CI : (0.7049, 0.845)
##     No Information Rate : 0.6712         
##     P-Value [Acc > NIR] : 0.002451       
##                                          
##                   Kappa : 0.5378         
##                                          
##  Mcnemar's Test P-Value : 0.021556       
##                                          
##             Sensitivity : 0.8125         
##             Specificity : 0.7653         
##          Pos Pred Value : 0.6290         
##          Neg Pred Value : 0.8929         
##              Prevalence : 0.3288         
##          Detection Rate : 0.2671         
##    Detection Prevalence : 0.4247         
##       Balanced Accuracy : 0.7889         
##                                          
##        'Positive' Class : band           
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | band.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        39 |         9 |        48 | 
##                        |     0.812 |     0.188 |     0.329 | 
##                        |     0.629 |     0.107 |           | 
##                        |     0.267 |     0.062 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        23 |        75 |        98 | 
##                        |     0.235 |     0.765 |     0.671 | 
##                        |     0.371 |     0.893 |           | 
##                        |     0.158 |     0.514 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        62 |        84 |       146 | 
##                        |     0.425 |     0.575 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

b. Using the model: Neural Network

We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.

## # weights:  83
## initial  value 242.948297 
## final  value 223.156274 
## converged

We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.

##         predicted
## true     noband
##   band       48
##   noband     98

The missclassification table can also be obtained by using the gmodels library:

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | predicted 
## data$band_type[-train] |    noband | Row Total | 
## -----------------------|-----------|-----------|
##                   band |        48 |        48 | 
##                        |     0.329 |           | 
## -----------------------|-----------|-----------|
##                 noband |        98 |        98 | 
##                        |     0.671 |           | 
## -----------------------|-----------|-----------|
##           Column Total |       146 |       146 | 
## -----------------------|-----------|-----------|
## 
## 

We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)

##                          rel.imp              x.names
## press_speed          -1.00000000          press_speed
## direct_steamyes      -0.45418926      direct_steamyes
## anode_space_ratio    -0.42299572    anode_space_ratio
## cylinder_sizetabloid -0.42266536 cylinder_sizetabloid
## press_typeMotter94   -0.41648328   press_typeMotter94
## proof_inkyes         -0.41139218         proof_inkyes
## cylinder_typeno      -0.39029977      cylinder_typeno
## press_typeMotter70   -0.31007975   press_typeMotter70
## cylinder_typeyes     -0.30802519     cylinder_typeyes
## solvent_typexylol    -0.28425575    solvent_typexylol
## roughness            -0.25257263            roughness
## current_density      -0.24980925      current_density
## varnish_pct          -0.19591180          varnish_pct
## locationscandanavian -0.17879029 locationscandanavian
## ink_temperature      -0.14715537      ink_temperature
## proof_inkno          -0.14302633          proof_inkno
## solvent_typenaptha   -0.04598000   solvent_typenaptha
## press                -0.02971619                press
## caliper               0.00000000              caliper
## ink_typecover         0.01093417        ink_typecover
## cylinder_sizespiegel  0.03895839 cylinder_sizespiegel
## viscosity             0.05434212            viscosity
## ink_pct               0.06952549              ink_pct
## humidity              0.08271790             humidity
## roller_durometer      0.10893590     roller_durometer
## paper_typeuncoated    0.11548110   paper_typeuncoated
## grain_screenedyes     0.11687962    grain_screenedyes
## press_typeWoodHoe70   0.11934415  press_typeWoodHoe70
## solvent_pct           0.12436495          solvent_pct
## locationmissing       0.13678704      locationmissing
## wax                   0.14395420                  wax
## locationUSA           0.27883757          locationUSA
## ink_typeuncoated      0.28655381     ink_typeuncoated
## paper_typesuper       0.33059093      paper_typesuper
## grain_screenedno      0.34849813     grain_screenedno
## blade_pressure        0.40906394       blade_pressure
## locationmideuropean   0.42357588  locationmideuropean
## proof_cut             0.46211301            proof_cut
## hardener              0.46489333             hardener

The bar plot tells us that the variables press_speed and hardener have the strongest negative and positive relationships, respectively, with the response variable band_type.

In decreasing order of importance (weak as weights are < 0.47)

  1. hardener
  2. proof_cut
  3. location - mid european
  4. blade_pressure
  5. grain_screened - no

c. Using the model: Decision Tree

We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).

We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.

From the complexity table we make the observation that the lowest relative error of 0.55 occurs at a tree size of 29.

To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of around 8, with a resubstitution error rate of ~0.55 (as before).

This again is not a good predictor due to the large error rate.

## 
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
## 
## Variables actually used in tree construction:
##  [1] anode_space_ratio blade_pressure    caliper          
##  [4] cylinder_type     grain_screened    hardener         
##  [7] humidity          ink_pct           ink_temperature  
## [10] ink_type          location          paper_type       
## [13] press             press_speed       press_type       
## [16] proof_cut         proof_ink         roller_durometer 
## [19] roughness         solvent_pct       solvent_type     
## [22] varnish_pct       viscosity         wax              
## 
## Root node error: 173/485 = 0.3567
## 
## n= 485 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0693642      0  1.000000 1.00000 0.060979
## 2  0.0375723      3  0.780347 1.00578 0.061057
## 3  0.0289017      7  0.618497 0.87283 0.058945
## 4  0.0231214      9  0.560694 0.84971 0.058506
## 5  0.0173410     11  0.514451 0.78613 0.057183
## 6  0.0115607     14  0.462428 0.75145 0.056386
## 7  0.0096339     27  0.312139 0.78035 0.057054
## 8  0.0086705     31  0.271676 0.79191 0.057310
## 9  0.0072254     38  0.208092 0.79191 0.057310
## 10 0.0057803     42  0.179191 0.79191 0.057310
## 11 0.0019268     61  0.069364 0.75723 0.056522
## 12 0.0000100     64  0.063584 0.75723 0.056522

## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
##   n= 485 
## 
##           CP nsplit rel error  xerror       xstd
## 1 0.06936416      0 1.0000000 1.00000 0.06097943
## 2 0.05000000      3 0.7803468 1.00578 0.06105734
## 
## Variable importance
##       press_type         humidity      press_speed            press 
##               31               20               18               11 
##        roughness    cylinder_size          ink_pct     solvent_type 
##                9                4                1                1 
##   blade_pressure         location         hardener              wax 
##                1                1                1                1 
##  ink_temperature roller_durometer 
##                1                1 
## 
## Node number 1: 485 observations,    complexity param=0.06936416
##   predicted class=noband  expected loss=0.356701  P(node) =1
##     class counts:   173   312
##    probabilities: 0.357 0.643 
##   left son=2 (167 obs) right son=3 (318 obs)
##   Primary splits:
##       press_type     splits as  RRRL,        improve=13.743870, (0 missing)
##       press_speed    < 2184.5  to the left,  improve=13.282210, (0 missing)
##       press          < 822.5   to the left,  improve=13.017350, (0 missing)
##       ink_type       splits as  RLL,         improve=10.840790, (0 missing)
##       grain_screened splits as  LRL,         improve= 9.577898, (0 missing)
##   Surrogate splits:
##       press         < 818.5   to the left,  agree=0.777, adj=0.353, (0 split)
##       roughness     < 0.53125 to the left,  agree=0.753, adj=0.281, (0 split)
##       cylinder_size splits as  RLR,         agree=0.697, adj=0.120, (0 split)
##       humidity      < 84.5    to the right, agree=0.676, adj=0.060, (0 split)
##       ink_pct       < 64      to the right, agree=0.672, adj=0.048, (0 split)
## 
## Node number 2: 167 observations,    complexity param=0.06936416
##   predicted class=band    expected loss=0.4790419  P(node) =0.3443299
##     class counts:    87    80
##    probabilities: 0.521 0.479 
##   left son=4 (46 obs) right son=5 (121 obs)
##   Primary splits:
##       press_speed   < 1678    to the left,  improve=7.308378, (0 missing)
##       ink_pct       < 62.9    to the right, improve=5.404575, (0 missing)
##       cylinder_type splits as  LLR,         improve=5.085740, (0 missing)
##       humidity      < 73.5    to the left,  improve=5.077850, (0 missing)
##       location      splits as  RLLRR,       improve=4.147136, (0 missing)
##   Surrogate splits:
##       solvent_type     splits as  R-L,         agree=0.749, adj=0.087, (0 split)
##       location         splits as  RLRRR,       agree=0.743, adj=0.065, (0 split)
##       blade_pressure   < 24.5    to the left,  agree=0.743, adj=0.065, (0 split)
##       ink_temperature  < 19.05   to the right, agree=0.737, adj=0.043, (0 split)
##       roller_durometer < 29      to the left,  agree=0.737, adj=0.043, (0 split)
## 
## Node number 3: 318 observations
##   predicted class=noband  expected loss=0.2704403  P(node) =0.6556701
##     class counts:    86   232
##    probabilities: 0.270 0.730 
## 
## Node number 4: 46 observations
##   predicted class=band    expected loss=0.2391304  P(node) =0.09484536
##     class counts:    35    11
##    probabilities: 0.761 0.239 
## 
## Node number 5: 121 observations,    complexity param=0.06936416
##   predicted class=noband  expected loss=0.4297521  P(node) =0.2494845
##     class counts:    52    69
##    probabilities: 0.430 0.570 
##   left son=10 (22 obs) right son=11 (99 obs)
##   Primary splits:
##       humidity          < 75.5    to the left,  improve=8.113866, (0 missing)
##       ink_pct           < 62.9    to the right, improve=6.323642, (0 missing)
##       location          splits as  RLLRR,       improve=3.867792, (0 missing)
##       press_speed       < 2184.5  to the left,  improve=3.734125, (0 missing)
##       anode_space_ratio < 98.485  to the left,  improve=3.617543, (0 missing)
##   Surrogate splits:
##       press_speed < 2450    to the right, agree=0.835, adj=0.091, (0 split)
##       wax         < 0.85    to the left,  agree=0.826, adj=0.045, (0 split)
##       hardener    < 0.25    to the left,  agree=0.826, adj=0.045, (0 split)
## 
## Node number 10: 22 observations
##   predicted class=band    expected loss=0.1818182  P(node) =0.04536082
##     class counts:    18     4
##    probabilities: 0.818 0.182 
## 
## Node number 11: 99 observations
##   predicted class=noband  expected loss=0.3434343  P(node) =0.2041237
##     class counts:    34    65
##    probabilities: 0.343 0.657

The variable importance table is given below:

##                     Overall
## anode_space_ratio 49.255578
## blade_pressure    23.201752
## caliper           15.764260
## current_density    5.927432
## cylinder_size     23.370388
## cylinder_type     28.803180
## grain_screened    26.322680
## hardener          23.936243
## humidity          53.401308
## ink_pct           47.602685
## ink_temperature   41.345016
## ink_type          30.770425
## location          39.270901
## paper_type         2.000000
## press             43.063079
## press_speed       72.939017
## press_type        29.891329
## proof_cut         43.882342
## proof_ink          2.100000
## roller_durometer  17.912351
## roughness         19.240835
## solvent_pct       37.649654
## solvent_type       1.600000
## varnish_pct       19.361990
## viscosity         58.850678
## wax               22.796918
## direct_steam       0.000000

In decreasing order, the importance of the variables are:

  1. press_speed
  2. viscosity
  3. humidity
  4. ink_pct
  5. anode_space_ratio

d. Using the model: SVM

We consider the use of support vector machines to build a classifier and to also extract the feature importance list.

First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.

## 
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train, 
##     ], method = "C-classification", kernel = "radial", cost = 10, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  314
## 
##  ( 192 122 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  band noband
##   [1]   1   2   4   5   6  10  13  14  17  19  22  24  27  28  32  33  34
##  [18]  36  37  42  47  49  50  51  52  53  54  56  57  58  59  60  61  62
##  [35]  63  64  66  68  70  71  73  74  76  77  79  82  85  87  88  89  92
##  [52]  94  95  96  97  98  99 102 103 104 106 107 111 113 114 116 119 121
##  [69] 122 124 128 130 133 135 136 138 139 141 142 143 144 146 147 148 150
##  [86] 151 153 154 157 159 161 162 163 167 168 169 170 173 176 177 178 180
## [103] 181 185 187 190 191 193 197 198 199 200 201 202 203 204 206 207 209
## [120] 210 213 215 216 217 219 221 226 227 229 231 233 234 236 238 241 244
## [137] 246 247 248 250 251 252 254 255 256 258 259 260 262 268 270 272 273
## [154] 274 276 277 279 280 282 284 286 287 288 293 295 296 297 301 303 304
## [171] 305 306 307 308 310 311 313 315 316 317 318 319 320 321 322 325 326
## [188] 331 332 333 337 339   3   7   8  11  12  15  16  18  20  21  23  25
## [205]  26  29  31  35  38  39  41  43  44  45  55  65  69  72  78  80  81
## [222]  83  84  90  93 101 105 108 109 110 112 115 118 123 125 126 127 129
## [239] 131 132 134 137 140 145 149 152 155 160 164 166 171 172 174 175 184
## [256] 186 188 192 195 196 208 211 212 214 218 220 222 223 225 228 230 232
## [273] 235 237 239 242 243 245 249 253 257 261 263 264 265 266 267 269 271
## [290] 275 278 281 283 285 289 290 291 294 298 299 300 302 309 312 314 323
## [307] 327 328 329 330 334 335 336 338

Prediction using SVM: The accuracy is 84% and a kappa of 0.62.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     33      8
##     noband   15     90
##                                           
##                Accuracy : 0.8425          
##                  95% CI : (0.7731, 0.8974)
##     No Information Rate : 0.6712          
##     P-Value [Acc > NIR] : 2.324e-06       
##                                           
##                   Kappa : 0.6293          
##                                           
##  Mcnemar's Test P-Value : 0.2109          
##                                           
##             Sensitivity : 0.6875          
##             Specificity : 0.9184          
##          Pos Pred Value : 0.8049          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.3288          
##          Detection Rate : 0.2260          
##    Detection Prevalence : 0.2808          
##       Balanced Accuracy : 0.8029          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | svm.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        33 |        15 |        48 | 
##                        |     0.688 |     0.312 |     0.329 | 
##                        |     0.805 |     0.143 |           | 
##                        |     0.226 |     0.103 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         8 |        90 |        98 | 
##                        |     0.082 |     0.918 |     0.671 | 
##                        |     0.195 |     0.857 |           | 
##                        |     0.055 |     0.616 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        41 |       105 |       146 | 
##                        |     0.281 |     0.719 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.

2a. Only replacing missing data manually (no removal of 20%)

In this method, we do not remove rows that have more than 20% missing (as in previous case), rather we continue estimating values for all the missing data and proceed with our model.

Then, instead of imputing the values of missing categorical variables, we add another level called “missing” to it. We also calculate the “mode” for numeric data and replace the missing numeric data with the corresponding mode value.

Using the data we proceed with creating correlation plots, to see how the different variables are correlated. If the dataset has perfectly positive or negative attributes then the performance of the model will be impacted by a problem called “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.

We find the following correlations (in lesser degress, exists in the data) 1. “ink_pct” and “varnish_pct” 2. “solvent_pct” and “varnish_pct”

Which implies, we can remove “varnish_pct” from our training model to observe if there is an improvement.

We continue to ignore the 12 variables as identified in previous section, for this model as well.

a. Using the model: Random Forest

We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.

We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.

From the results, we observe the out-of-bag error rate to be 20.63%.

## 
## Call:
##  randomForest(formula = form, data = data[train, ], ntree = 1000,      mtry = 5, importance = TRUE, localImp = TRUE, replace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 20.63%
## Confusion matrix:
##        band noband class.error
## band    110     56   0.3373494
## noband   22    190   0.1037736

The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.

##          OOB   band noband
##  [1,] 0.2734 0.2712 0.2750
##  [2,] 0.2800 0.3173 0.2479
##  [3,] 0.2867 0.2984 0.2774
##  [4,] 0.2987 0.2643 0.3258
##  [5,] 0.3068 0.3020 0.3105
##  [6,] 0.3199 0.2876 0.3454
##  [7,] 0.3092 0.2911 0.3234
##  [8,] 0.3142 0.3168 0.3122
##  [9,] 0.2938 0.2699 0.3125
## [10,] 0.3029 0.2866 0.3158
## [11,] 0.3013 0.2909 0.3095
## [12,] 0.2766 0.3012 0.2571
## [13,] 0.2872 0.3012 0.2762
## [14,] 0.2838 0.3253 0.2512
## [15,] 0.2918 0.3494 0.2464

In this case, the variables that are between 6-8 in the Gini Index (and therefore most important are):

  1. press_speed
  2. solvent_pct
  3. viscosity
  4. ink_temperature
  5. ink_pct
  6. humidity

We now proceed to predict using the test data set and create the confusion matrix for the same.

##        band noband class.error
## band    110     56   0.3373494
## noband   22    190   0.1037736

We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.85 with mtry=21, and a kappa of 0.68.

##         pred
## true     band noband
##   band     45     17
##   noband    4     96
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | band.pred 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        45 |        17 |        62 | 
##                        |     0.726 |     0.274 |     0.383 | 
##                        |     0.918 |     0.150 |           | 
##                        |     0.278 |     0.105 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         4 |        96 |       100 | 
##                        |     0.040 |     0.960 |     0.617 | 
##                        |     0.082 |     0.850 |           | 
##                        |     0.025 |     0.593 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        49 |       113 |       162 | 
##                        |     0.302 |     0.698 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.77 with mtry=42, and a kappa of 0.52.

##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE
##    user  system elapsed 
##  145.98    2.13  158.75
## Random Forest 
## 
## 378 samples
##  26 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 340, 341, 339, 341, 340, 340, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7530058  0.4748330
##   21    0.7851041  0.5523332
##   41    0.7787026  0.5388084
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 21.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     43      4
##     noband   19     96
##                                           
##                Accuracy : 0.858           
##                  95% CI : (0.7946, 0.9078)
##     No Information Rate : 0.6173          
##     P-Value [Acc > NIR] : 1.285e-11       
##                                           
##                   Kappa : 0.685           
##                                           
##  Mcnemar's Test P-Value : 0.003509        
##                                           
##             Sensitivity : 0.6935          
##             Specificity : 0.9600          
##          Pos Pred Value : 0.9149          
##          Neg Pred Value : 0.8348          
##              Prevalence : 0.3827          
##          Detection Rate : 0.2654          
##    Detection Prevalence : 0.2901          
##       Balanced Accuracy : 0.8268          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | band.pred.1 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        43 |        19 |        62 | 
##                        |     0.694 |     0.306 |     0.383 | 
##                        |     0.915 |     0.165 |           | 
##                        |     0.265 |     0.117 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         4 |        96 |       100 | 
##                        |     0.040 |     0.960 |     0.617 | 
##                        |     0.085 |     0.835 |           | 
##                        |     0.025 |     0.593 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        47 |       115 |       162 | 
##                        |     0.290 |     0.710 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.

The corresponding confusion matrix and crosstable is as shown below.

##    user  system elapsed 
##  114.42    1.36  118.07
## Random Forest 
## 
## 378 samples
##  26 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 341, 340, 340, 341, 339, 340, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7812062  0.5493503
##   21    0.7695791  0.5290100
##   41    0.7683426  0.5252884
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     41      9
##     noband   21     91
##                                           
##                Accuracy : 0.8148          
##                  95% CI : (0.7463, 0.8714)
##     No Information Rate : 0.6173          
##     P-Value [Acc > NIR] : 4.355e-08       
##                                           
##                   Kappa : 0.5931          
##                                           
##  Mcnemar's Test P-Value : 0.04461         
##                                           
##             Sensitivity : 0.6613          
##             Specificity : 0.9100          
##          Pos Pred Value : 0.8200          
##          Neg Pred Value : 0.8125          
##              Prevalence : 0.3827          
##          Detection Rate : 0.2531          
##    Detection Prevalence : 0.3086          
##       Balanced Accuracy : 0.7856          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | band.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        41 |        21 |        62 | 
##                        |     0.661 |     0.339 |     0.383 | 
##                        |     0.820 |     0.188 |           | 
##                        |     0.253 |     0.130 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         9 |        91 |       100 | 
##                        |     0.090 |     0.910 |     0.617 | 
##                        |     0.180 |     0.812 |           | 
##                        |     0.056 |     0.562 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        50 |       112 |       162 | 
##                        |     0.309 |     0.691 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

b. Using the model: Neural Network

We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.

## # weights:  87
## initial  value 260.360533 
## final  value 259.203941 
## converged

We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.

##         predicted
## true     noband
##   band       62
##   noband    100

The missclassification table can also be obtained by using the gmodels library:

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | predicted 
## data$band_type[-train] |    noband | Row Total | 
## -----------------------|-----------|-----------|
##                   band |        62 |        62 | 
##                        |     0.383 |           | 
## -----------------------|-----------|-----------|
##                 noband |       100 |       100 | 
##                        |     0.617 |           | 
## -----------------------|-----------|-----------|
##           Column Total |       162 |       162 | 
## -----------------------|-----------|-----------|
## 
## 

We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)

##                           rel.imp              x.names
## press                -1.000000000                press
## solvent_pct          -0.834810109          solvent_pct
## press_typeWoodHoe70  -0.730755732  press_typeWoodHoe70
## grain_screenedyes    -0.526956828    grain_screenedyes
## solvent_typemissing  -0.498624082  solvent_typemissing
## press_typeMotter70   -0.436915259   press_typeMotter70
## direct_steamyes      -0.388076866      direct_steamyes
## paper_typeuncoated   -0.339440245   paper_typeuncoated
## cylinder_sizespiegel -0.309955529 cylinder_sizespiegel
## ink_pct              -0.255724903              ink_pct
## locationscandanavian -0.250135003 locationscandanavian
## wax                  -0.109553336                  wax
## ink_temperature      -0.086142170      ink_temperature
## roughness            -0.054880847            roughness
## proof_inkyes         -0.049629978         proof_inkyes
## caliper               0.000000000              caliper
## cylinder_typeno       0.001026903      cylinder_typeno
## blade_pressure        0.004262042       blade_pressure
## cylinder_sizetabloid  0.008538332 cylinder_sizetabloid
## locationmissing       0.010877133      locationmissing
## press_speed           0.063965114          press_speed
## current_density       0.076737726      current_density
## cylinder_sizemissing  0.120028831 cylinder_sizemissing
## locationUSA           0.130925774          locationUSA
## humidity              0.143374755             humidity
## locationmideuropean   0.152655043  locationmideuropean
## ink_typecover         0.213428644        ink_typecover
## solvent_typenaptha    0.222300942   solvent_typenaptha
## press_typeMotter94    0.269144105   press_typeMotter94
## proof_cut             0.276827428            proof_cut
## ink_typeuncoated      0.301945001     ink_typeuncoated
## hardener              0.325612844             hardener
## solvent_typexylol     0.335009201    solvent_typexylol
## proof_inkno           0.423948155          proof_inkno
## cylinder_typeyes      0.694660055     cylinder_typeyes
## anode_space_ratio     0.707571780    anode_space_ratio
## grain_screenedno      0.710663466     grain_screenedno
## viscosity             0.736837240            viscosity
## roller_durometer      0.781942884     roller_durometer
## direct_steamno        0.791373930       direct_steamno
## paper_typesuper       0.866745902      paper_typesuper

The bar plot tells us that the variables press and paper_type (super) have the strongest negative and positive relationships, respectively, with the response variable band_type.

In decreasing order of importance:

  1. paper_type - super
  2. direct_steam - no
  3. roller_durometer
  4. viscosity
  5. grain_screened - no

c. Using the model: Decision Tree

We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).

We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.

From the complexity table we make the observation that the lowest relative error of 0.55 occurs at a tree size of 13.

To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 6, with a resubstitution error rate of ~0.55 (as before).

This again is not a good predictor due to the large error rate.

## 
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
## 
## Variables actually used in tree construction:
##  [1] anode_space_ratio blade_pressure    caliper          
##  [4] current_density   cylinder_type     grain_screened   
##  [7] hardener          humidity          ink_pct          
## [10] ink_temperature   ink_type          location         
## [13] paper_type        press             press_speed      
## [16] press_type        proof_cut         proof_ink        
## [19] roller_durometer  roughness         solvent_pct      
## [22] solvent_type      viscosity         wax              
## 
## Root node error: 228/540 = 0.42222
## 
## n= 540 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.2412281      0  1.000000 1.00000 0.050340
## 2  0.0526316      1  0.758772 0.75877 0.047558
## 3  0.0285088      4  0.592105 0.73684 0.047184
## 4  0.0219298      8  0.469298 0.62719 0.044971
## 5  0.0175439     10  0.425439 0.58333 0.043913
## 6  0.0131579     12  0.390351 0.57895 0.043801
## 7  0.0087719     15  0.350877 0.61842 0.044768
## 8  0.0073099     29  0.228070 0.61842 0.044768
## 9  0.0065789     33  0.197368 0.59649 0.044241
## 10 0.0054825     40  0.149123 0.59649 0.044241
## 11 0.0043860     44  0.127193 0.59649 0.044241
## 12 0.0014620     62  0.048246 0.61404 0.044664
## 13 0.0000100     65  0.043860 0.61842 0.044768

## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
##   n= 540 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.24122807      0 1.0000000 1.0000000 0.05033997
## 2 0.05263158      1 0.7587719 0.7587719 0.04755808
## 3 0.05000000      4 0.5921053 0.7368421 0.04718396
## 
## Variable importance
##   solvent_type      proof_ink grain_screened   direct_steam     paper_type 
##             23             21             12             10              9 
##     press_type       humidity    press_speed          press  cylinder_size 
##              7              5              4              3              2 
##      roughness 
##              2 
## 
## Node number 1: 540 observations,    complexity param=0.2412281
##   predicted class=noband  expected loss=0.4222222  P(node) =1
##     class counts:   228   312
##    probabilities: 0.422 0.578 
##   left son=2 (55 obs) right son=3 (485 obs)
##   Primary splits:
##       solvent_type   splits as  RLRR,        improve=40.88522, (0 missing)
##       proof_ink      splits as  LRR,         improve=37.53662, (0 missing)
##       grain_screened splits as  LRR,         improve=26.53113, (0 missing)
##       location       splits as  RRLRR,       improve=23.53878, (0 missing)
##       press_speed    < 2184.5  to the left,  improve=17.60284, (0 missing)
##   Surrogate splits:
##       proof_ink      splits as  LRR,  agree=0.996, adj=0.964, (0 split)
##       grain_screened splits as  LRR,  agree=0.952, adj=0.527, (0 split)
##       direct_steam   splits as  LRR,  agree=0.944, adj=0.455, (0 split)
##       paper_type     splits as  RLR,  agree=0.941, adj=0.418, (0 split)
##       cylinder_size  splits as  RLRR, agree=0.904, adj=0.055, (0 split)
## 
## Node number 2: 55 observations
##   predicted class=band    expected loss=0  P(node) =0.1018519
##     class counts:    55     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 485 observations,    complexity param=0.05263158
##   predicted class=noband  expected loss=0.356701  P(node) =0.8981481
##     class counts:   173   312
##    probabilities: 0.357 0.643 
##   left son=6 (167 obs) right son=7 (318 obs)
##   Primary splits:
##       press_type     splits as  RRRL,        improve=13.743870, (0 missing)
##       press_speed    < 2184.5  to the left,  improve=13.282210, (0 missing)
##       press          < 822.5   to the left,  improve=13.017350, (0 missing)
##       ink_type       splits as  RLL,         improve=10.840790, (0 missing)
##       grain_screened splits as  LRL,         improve= 9.577898, (0 missing)
##   Surrogate splits:
##       press         < 818.5   to the left,  agree=0.777, adj=0.353, (0 split)
##       roughness     < 0.53125 to the left,  agree=0.753, adj=0.281, (0 split)
##       cylinder_size splits as  R-LR,        agree=0.697, adj=0.120, (0 split)
##       humidity      < 84.5    to the right, agree=0.676, adj=0.060, (0 split)
##       ink_pct       < 64      to the right, agree=0.672, adj=0.048, (0 split)
## 
## Node number 6: 167 observations,    complexity param=0.05263158
##   predicted class=band    expected loss=0.4790419  P(node) =0.3092593
##     class counts:    87    80
##    probabilities: 0.521 0.479 
##   left son=12 (46 obs) right son=13 (121 obs)
##   Primary splits:
##       press_speed   < 1678    to the left,  improve=7.308378, (0 missing)
##       ink_pct       < 62.9    to the right, improve=5.404575, (0 missing)
##       cylinder_type splits as  LLR,         improve=5.085740, (0 missing)
##       humidity      < 73.5    to the left,  improve=5.077850, (0 missing)
##       location      splits as  RLLRR,       improve=4.147136, (0 missing)
##   Surrogate splits:
##       solvent_type     splits as  R--L,        agree=0.749, adj=0.087, (0 split)
##       location         splits as  RLRRR,       agree=0.743, adj=0.065, (0 split)
##       blade_pressure   < 24.5    to the left,  agree=0.743, adj=0.065, (0 split)
##       ink_temperature  < 19.05   to the right, agree=0.737, adj=0.043, (0 split)
##       roller_durometer < 29      to the left,  agree=0.737, adj=0.043, (0 split)
## 
## Node number 7: 318 observations
##   predicted class=noband  expected loss=0.2704403  P(node) =0.5888889
##     class counts:    86   232
##    probabilities: 0.270 0.730 
## 
## Node number 12: 46 observations
##   predicted class=band    expected loss=0.2391304  P(node) =0.08518519
##     class counts:    35    11
##    probabilities: 0.761 0.239 
## 
## Node number 13: 121 observations,    complexity param=0.05263158
##   predicted class=noband  expected loss=0.4297521  P(node) =0.2240741
##     class counts:    52    69
##    probabilities: 0.430 0.570 
##   left son=26 (22 obs) right son=27 (99 obs)
##   Primary splits:
##       humidity          < 75.5    to the left,  improve=8.113866, (0 missing)
##       ink_pct           < 62.9    to the right, improve=6.323642, (0 missing)
##       location          splits as  RLLRR,       improve=3.867792, (0 missing)
##       press_speed       < 2184.5  to the left,  improve=3.734125, (0 missing)
##       anode_space_ratio < 98.485  to the left,  improve=3.617543, (0 missing)
##   Surrogate splits:
##       press_speed < 2450    to the right, agree=0.835, adj=0.091, (0 split)
##       wax         < 0.85    to the left,  agree=0.826, adj=0.045, (0 split)
##       hardener    < 0.25    to the left,  agree=0.826, adj=0.045, (0 split)
## 
## Node number 26: 22 observations
##   predicted class=band    expected loss=0.1818182  P(node) =0.04074074
##     class counts:    18     4
##    probabilities: 0.818 0.182 
## 
## Node number 27: 99 observations
##   predicted class=noband  expected loss=0.3434343  P(node) =0.1833333
##     class counts:    34    65
##    probabilities: 0.343 0.657

The variable importance table is given below:

##                     Overall
## anode_space_ratio 53.338911
## blade_pressure    23.201752
## caliper           15.764260
## current_density    8.377432
## cylinder_size     22.520388
## cylinder_type     28.803180
## grain_screened    52.853813
## hardener          27.349880
## humidity          50.534641
## ink_pct           53.487118
## ink_temperature   38.678349
## ink_type          31.431536
## location          63.559683
## paper_type         2.000000
## press             43.563079
## press_speed       89.986305
## press_type        29.891329
## proof_cut         47.383511
## proof_ink         39.636624
## roller_durometer  19.512351
## roughness         19.240835
## solvent_pct       40.116320
## solvent_type      40.885223
## viscosity         59.128456
## wax               22.961624
## direct_steam       0.000000

In decreasing order, the importance of variables are:

  1. press_speed
  2. location
  3. viscosity
  4. anode_space_ratio
  5. grain_screened

d. Using the model: SVM

We consider the use of support vector machines to build a classifier and to also extract the feature importance list.

First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.

## 
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train, 
##     ], method = "C-classification", kernel = "radial", cost = 10, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  335
## 
##  ( 184 151 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  band noband
##   [1]   1   4   5   6   8  10  13  14  17  18  20  22  23  26  28  29  30
##  [18]  31  33  34  36  38  40  41  43  44  45  46  47  48  50  52  53  54
##  [35]  59  61  63  65  68  71  73  75  76  80  81  83  84  87  88  89  90
##  [52]  92  93  94  96  97 101 102 104 105 106 108 110 118 121 123 125 129
##  [69] 130 132 135 136 137 138 139 141 142 144 145 146 148 149 151 152 155
##  [86] 157 160 161 162 165 167 168 173 176 177 178 180 181 185 186 191 192
## [103] 198 199 200 201 202 208 214 217 218 222 223 224 227 230 231 233 235
## [120] 236 240 243 244 246 247 251 252 254 257 263 264 266 267 271 274 279
## [137] 281 283 284 285 286 288 292 294 295 296 297 299 302 303 305 306 312
## [154] 313 314 315 317 321 323 327 329 330 336 337 339 342 343 345 346 350
## [171] 353 354 356 359 361 362 366 370 371 372 373 374 375 377   2   9  11
## [188]  12  15  19  21  24  25  27  35  39  49  57  58  62  64  66  69  70
## [205]  72  74  77  78  79  82  85  86  91  95  98  99 100 103 107 112 113
## [222] 115 117 119 120 122 126 127 128 131 133 134 143 147 150 153 158 159
## [239] 163 166 169 170 171 172 174 175 182 184 187 188 189 194 197 203 204
## [256] 205 206 207 209 210 211 212 213 215 216 219 220 221 225 229 232 234
## [273] 237 238 241 242 245 253 255 256 258 259 260 261 262 265 269 270 272
## [290] 273 275 277 278 280 282 287 289 290 291 293 298 300 301 304 307 308
## [307] 309 310 311 318 319 320 324 325 326 328 334 335 338 341 344 347 348
## [324] 351 352 355 357 358 360 363 364 365 367 368 369

Prediction using SVM: This gives an accuracy of 86.4% with kappa of 0.7, which is a very good prediction rate.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     48      8
##     noband   14     92
##                                           
##                Accuracy : 0.8642          
##                  95% CI : (0.8016, 0.9129)
##     No Information Rate : 0.6173          
##     P-Value [Acc > NIR] : 3.346e-12       
##                                           
##                   Kappa : 0.7072          
##                                           
##  Mcnemar's Test P-Value : 0.2864          
##                                           
##             Sensitivity : 0.7742          
##             Specificity : 0.9200          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.8679          
##              Prevalence : 0.3827          
##          Detection Rate : 0.2963          
##    Detection Prevalence : 0.3457          
##       Balanced Accuracy : 0.8471          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | svm.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        48 |        14 |        62 | 
##                        |     0.774 |     0.226 |     0.383 | 
##                        |     0.857 |     0.132 |           | 
##                        |     0.296 |     0.086 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         8 |        92 |       100 | 
##                        |     0.080 |     0.920 |     0.617 | 
##                        |     0.143 |     0.868 |           | 
##                        |     0.049 |     0.568 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        56 |       106 |       162 | 
##                        |     0.346 |     0.654 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.

3. Removing 20% then applying KNN Imputation

In our second method, we first remove rows which have more than 20% of data observations missing, and consider only the remaining for our analysis.

Then, we use the K-Nearest Neighbor function (KNN) for imputation of missing values.

a. Using the model: Random Forest

We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.

We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.

## 
## Call:
##  randomForest(formula = form, data = data[train, ], ntree = 1000,      mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 25.07%
## Confusion matrix:
##        band noband class.error
## band     59     66  0.52800000
## noband   19    195  0.08878505

From the results, we observe the out-of-bag error rate to be 25.07%.

##          OOB   band noband
##  [1,] 0.3710 0.5106 0.2857
##  [2,] 0.3969 0.5373 0.3228
##  [3,] 0.3701 0.4624 0.3168
##  [4,] 0.3298 0.4216 0.2778
##  [5,] 0.3355 0.4775 0.2526
##  [6,] 0.3691 0.5085 0.2864
##  [7,] 0.3364 0.4417 0.2745
##  [8,] 0.3587 0.4833 0.2871
##  [9,] 0.3333 0.4553 0.2619
## [10,] 0.3552 0.4839 0.2796
## [11,] 0.3620 0.5403 0.2582
## [12,] 0.3343 0.5040 0.2347
## [13,] 0.3156 0.4960 0.2103
## [14,] 0.3304 0.5120 0.2243
## [15,] 0.3304 0.5280 0.2150

The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.

##        band noband class.error
## band     59     66  0.52800000
## noband   19    195  0.08878505

In this case, the variables that are >6 in the Gini Index (and therefore most important are):

  1. press_speed
  2. viscosity
  3. ink_pct
  4. solvent_pct
  5. humidity

We now proceed to predict using the test data set and create the confusion matrix for the same.

##         pred
## true     band noband
##   band     32     16
##   noband    6     92
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | band.pred 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        32 |        16 |        48 | 
##                        |     0.667 |     0.333 |     0.329 | 
##                        |     0.842 |     0.148 |           | 
##                        |     0.219 |     0.110 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         6 |        92 |        98 | 
##                        |     0.061 |     0.939 |     0.671 | 
##                        |     0.158 |     0.852 |           | 
##                        |     0.041 |     0.630 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        38 |       108 |       146 | 
##                        |     0.260 |     0.740 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.85 with mtry=36, and a kappa of 0.64.

##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE
##    user  system elapsed 
##  119.02    0.82  129.08
## Random Forest 
## 
## 339 samples
##  27 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 305, 305, 305, 305, 305, 306, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7083957  0.2747511
##   19    0.7436519  0.3996211
##   36    0.7453929  0.4037285
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 36.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     33      7
##     noband   15     91
##                                           
##                Accuracy : 0.8493          
##                  95% CI : (0.7808, 0.9031)
##     No Information Rate : 0.6712          
##     P-Value [Acc > NIR] : 8.551e-07       
##                                           
##                   Kappa : 0.6434          
##                                           
##  Mcnemar's Test P-Value : 0.1356          
##                                           
##             Sensitivity : 0.6875          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.8250          
##          Neg Pred Value : 0.8585          
##              Prevalence : 0.3288          
##          Detection Rate : 0.2260          
##    Detection Prevalence : 0.2740          
##       Balanced Accuracy : 0.8080          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | band.pred.1 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        33 |        15 |        48 | 
##                        |     0.688 |     0.312 |     0.329 | 
##                        |     0.825 |     0.142 |           | 
##                        |     0.226 |     0.103 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         7 |        91 |        98 | 
##                        |     0.071 |     0.929 |     0.671 | 
##                        |     0.175 |     0.858 |           | 
##                        |     0.048 |     0.623 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        40 |       106 |       146 | 
##                        |     0.274 |     0.726 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.

The corresponding confusion matrix and crosstable is as shown below.

##    user  system elapsed 
##   76.46    0.34   77.48
## Random Forest 
## 
## 339 samples
##  27 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 305, 305, 305, 304, 306, 305, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7007341  0.3977763
##   19    0.6953435  0.3733094
##   36    0.7013952  0.3846407
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 36.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     39     25
##     noband    9     73
##                                          
##                Accuracy : 0.7671         
##                  95% CI : (0.6901, 0.833)
##     No Information Rate : 0.6712         
##     P-Value [Acc > NIR] : 0.007421       
##                                          
##                   Kappa : 0.5137         
##                                          
##  Mcnemar's Test P-Value : 0.010097       
##                                          
##             Sensitivity : 0.8125         
##             Specificity : 0.7449         
##          Pos Pred Value : 0.6094         
##          Neg Pred Value : 0.8902         
##              Prevalence : 0.3288         
##          Detection Rate : 0.2671         
##    Detection Prevalence : 0.4384         
##       Balanced Accuracy : 0.7787         
##                                          
##        'Positive' Class : band           
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | band.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        39 |         9 |        48 | 
##                        |     0.812 |     0.188 |     0.329 | 
##                        |     0.609 |     0.110 |           | 
##                        |     0.267 |     0.062 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        25 |        73 |        98 | 
##                        |     0.255 |     0.745 |     0.671 | 
##                        |     0.391 |     0.890 |           | 
##                        |     0.171 |     0.500 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        64 |        82 |       146 | 
##                        |     0.438 |     0.562 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

b. Using the model: Neural Network

We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.

## # weights:  77
## initial  value 231.694164 
## final  value 223.156122 
## converged

We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.

##         predicted
## true     noband
##   band       48
##   noband     98

The missclassification table can also be obtained by using the gmodels library:

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | predicted 
## data$band_type[-train] |    noband | Row Total | 
## -----------------------|-----------|-----------|
##                   band |        48 |        48 | 
##                        |     0.329 |           | 
## -----------------------|-----------|-----------|
##                 noband |        98 |        98 | 
##                        |     0.671 |           | 
## -----------------------|-----------|-----------|
##           Column Total |       146 |       146 | 
## -----------------------|-----------|-----------|
## 
## 

We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)

##                          rel.imp              x.names
## locationmideuropean  -1.00000000  locationmideuropean
## solvent_pct          -0.81424418          solvent_pct
## locationsouthus      -0.71712401      locationsouthus
## wax                  -0.64254745                  wax
## locationscandanavian -0.62460938 locationscandanavian
## roughness            -0.60141152            roughness
## solvent_typeNAPTHA   -0.55580818   solvent_typeNAPTHA
## cylinder_sizespiegel -0.40198168 cylinder_sizespiegel
## locationnorthus      -0.38447601      locationnorthus
## ink_typecover        -0.27078375        ink_typecover
## solvent_typeXYLOL    -0.26878627    solvent_typeXYLOL
## cylinder_typeyes     -0.19506011     cylinder_typeyes
## current_density      -0.17749519      current_density
## paper_typeuncoated   -0.15372813   paper_typeuncoated
## paper_typesuper      -0.15337064      paper_typesuper
## proof_cut            -0.10804749            proof_cut
## caliper              -0.07973444              caliper
## direct_steamyes      -0.07343444      direct_steamyes
## hardener             -0.02488372             hardener
## cylinder_sizetabloid  0.00000000 cylinder_sizetabloid
## press_typeMotter70    0.05438288   press_typeMotter70
## press_typeMotter94    0.12215403   press_typeMotter94
## ink_temperature       0.15013680      ink_temperature
## anode_space_ratio     0.15294915    anode_space_ratio
## press_typeWoodHoe70   0.15561081  press_typeWoodHoe70
## grain_screenedYES     0.22613986    grain_screenedYES
## roller_durometer      0.31684380     roller_durometer
## viscosity             0.37384575            viscosity
## ink_typeuncoated      0.41528266     ink_typeuncoated
## press_speed           0.48484925          press_speed
## varnish_pct           0.55720107          varnish_pct
## humidity              0.56215695             humidity
## ink_pct               0.57964732              ink_pct
## blade_pressure        0.64043328       blade_pressure
## press                 0.81796317                press
## proof_inkYES          0.95849499         proof_inkYES

The bar plot tells us that the variables location (mideuropean) and proof_ink (yes) have the strongest negative and positive relationships, respectively, with the response variable band_type.

In decreasing order of importance:

  1. proof_ink - yes
  2. press
  3. blade_pressure
  4. ink_pct
  5. humidity
  6. press_speed

c. Using the model: Decision Tree

We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).

We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.

From the complexity table we make the observation that the lowest relative error of 0.75 occurs at a tree size of ~26.

To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 8, with a resubstitution error rate of ~0.75 (as before).

This again is not a good predictor due to the large error rate.

## 
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
## 
## Variables actually used in tree construction:
##  [1] anode_space_ratio blade_pressure    caliper          
##  [4] cylinder_size     cylinder_type     grain_screened   
##  [7] hardener          humidity          ink_pct          
## [10] ink_temperature   ink_type          location         
## [13] paper_type        press             press_speed      
## [16] press_type        proof_cut         proof_ink        
## [19] roller_durometer  roughness         solvent_pct      
## [22] varnish_pct       viscosity         wax              
## 
## Root node error: 173/485 = 0.3567
## 
## n= 485 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0693642      0  1.000000 1.00000 0.060979
## 2  0.0375723      3  0.780347 0.90173 0.059463
## 3  0.0289017      7  0.618497 0.94220 0.060132
## 4  0.0231214      8  0.589595 0.84971 0.058506
## 5  0.0173410      9  0.566474 0.86705 0.058837
## 6  0.0144509     15  0.462428 0.84971 0.058506
## 7  0.0115607     17  0.433526 0.84971 0.058506
## 8  0.0096339     24  0.352601 0.80347 0.057561
## 9  0.0086705     28  0.312139 0.81503 0.057806
## 10 0.0077071     39  0.208092 0.82659 0.058045
## 11 0.0057803     42  0.184971 0.83237 0.058162
## 12 0.0038536     61  0.075145 0.80347 0.057561
## 13 0.0028902     64  0.063584 0.80347 0.057561
## 14 0.0000100     70  0.046243 0.81503 0.057806

## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
##   n= 485 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.06936416      0 1.0000000 1.0000000 0.06097943
## 2 0.05000000      3 0.7803468 0.9017341 0.05946252
## 
## Variable importance
##       press_type         humidity      press_speed            press 
##               31               20               18               11 
##        roughness    cylinder_size          ink_pct     solvent_type 
##                9                4                1                1 
##         location         hardener              wax   blade_pressure 
##                1                1                1                1 
##  ink_temperature roller_durometer 
##                1                1 
## 
## Node number 1: 485 observations,    complexity param=0.06936416
##   predicted class=noband  expected loss=0.356701  P(node) =1
##     class counts:   173   312
##    probabilities: 0.357 0.643 
##   left son=2 (167 obs) right son=3 (318 obs)
##   Primary splits:
##       press_type     splits as  RRRL,        improve=13.743870, (0 missing)
##       press_speed    < 2184.5  to the left,  improve=13.282210, (0 missing)
##       press          < 822.5   to the left,  improve=13.017350, (0 missing)
##       ink_type       splits as  RLL,         improve=10.840790, (0 missing)
##       grain_screened splits as  RL,          improve= 8.957201, (0 missing)
##   Surrogate splits:
##       press         < 818.5   to the left,  agree=0.777, adj=0.353, (0 split)
##       roughness     < 0.53125 to the left,  agree=0.755, adj=0.287, (0 split)
##       cylinder_size splits as  RLR,         agree=0.697, adj=0.120, (0 split)
##       humidity      < 84.5    to the right, agree=0.676, adj=0.060, (0 split)
##       ink_pct       < 64      to the right, agree=0.672, adj=0.048, (0 split)
## 
## Node number 2: 167 observations,    complexity param=0.06936416
##   predicted class=band    expected loss=0.4790419  P(node) =0.3443299
##     class counts:    87    80
##    probabilities: 0.521 0.479 
##   left son=4 (46 obs) right son=5 (121 obs)
##   Primary splits:
##       press_speed   < 1678    to the left,  improve=7.308378, (0 missing)
##       ink_pct       < 62.9    to the right, improve=5.404575, (0 missing)
##       humidity      < 73.5    to the left,  improve=5.077850, (0 missing)
##       cylinder_type splits as  LR,          improve=4.631613, (0 missing)
##       roughness     < 0.5625  to the right, improve=4.059462, (0 missing)
##   Surrogate splits:
##       solvent_type     splits as  R-L,         agree=0.749, adj=0.087, (0 split)
##       location         splits as  RLRR-,       agree=0.743, adj=0.065, (0 split)
##       ink_temperature  < 19.05   to the right, agree=0.737, adj=0.043, (0 split)
##       blade_pressure   < 22.5    to the left,  agree=0.737, adj=0.043, (0 split)
##       roller_durometer < 29      to the left,  agree=0.737, adj=0.043, (0 split)
## 
## Node number 3: 318 observations
##   predicted class=noband  expected loss=0.2704403  P(node) =0.6556701
##     class counts:    86   232
##    probabilities: 0.270 0.730 
## 
## Node number 4: 46 observations
##   predicted class=band    expected loss=0.2391304  P(node) =0.09484536
##     class counts:    35    11
##    probabilities: 0.761 0.239 
## 
## Node number 5: 121 observations,    complexity param=0.06936416
##   predicted class=noband  expected loss=0.4297521  P(node) =0.2494845
##     class counts:    52    69
##    probabilities: 0.430 0.570 
##   left son=10 (22 obs) right son=11 (99 obs)
##   Primary splits:
##       humidity          < 75.5    to the left,  improve=8.113866, (0 missing)
##       ink_pct           < 62.9    to the right, improve=6.323642, (0 missing)
##       press_speed       < 2184.5  to the left,  improve=3.734125, (0 missing)
##       anode_space_ratio < 98.485  to the left,  improve=3.617543, (0 missing)
##       wax               < 1.725   to the left,  improve=3.391992, (0 missing)
##   Surrogate splits:
##       press_speed < 2450    to the right, agree=0.835, adj=0.091, (0 split)
##       wax         < 0.85    to the left,  agree=0.826, adj=0.045, (0 split)
##       hardener    < 0.25    to the left,  agree=0.826, adj=0.045, (0 split)
## 
## Node number 10: 22 observations
##   predicted class=band    expected loss=0.1818182  P(node) =0.04536082
##     class counts:    18     4
##    probabilities: 0.818 0.182 
## 
## Node number 11: 99 observations
##   predicted class=noband  expected loss=0.3434343  P(node) =0.2041237
##     class counts:    34    65
##    probabilities: 0.343 0.657

The variable importance table is given below:

##                     Overall
## anode_space_ratio 59.465712
## blade_pressure    41.265195
## caliper           16.286526
## current_density   14.078302
## cylinder_size     24.208212
## cylinder_type     15.785659
## grain_screened    21.538722
## hardener          20.936189
## humidity          42.792686
## ink_pct           55.397484
## ink_temperature   38.727212
## ink_type          29.660397
## location          27.500859
## paper_type         3.738095
## press             40.805786
## press_speed       76.145054
## press_type        30.664801
## proof_cut         40.656055
## proof_ink          5.491162
## roller_durometer  23.316414
## roughness         26.216762
## solvent_pct       38.299690
## varnish_pct       27.900896
## viscosity         45.424158
## wax               29.088331
## direct_steam       0.000000
## solvent_type       0.000000

From this, we can list the most important varibles as:

  1. press_speed
  2. anode_space_ratio
  3. ink_pct
  4. viscosity
  5. humidity

d. Using the model: SVM

We consider the use of support vector machines to build a classifier and to also extract the feature importance list.

First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.

## 
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train, 
##     ], method = "C-classification", kernel = "radial", cost = 10, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  308
## 
##  ( 188 120 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  band noband
##   [1]   1   4   5   6  10  13  14  17  19  22  24  27  28  32  33  34  36
##  [18]  37  42  47  48  49  50  51  52  53  54  56  57  58  59  61  62  63
##  [35]  64  66  68  70  71  73  74  76  77  79  82  85  87  88  89  92  94
##  [52]  95  96  97  98  99 102 103 104 106 107 111 113 114 116 119 121 122
##  [69] 124 128 130 133 135 136 138 141 142 143 144 146 148 150 151 153 154
##  [86] 157 159 161 162 163 167 168 169 170 173 176 177 178 179 180 181 185
## [103] 187 190 191 193 197 198 199 200 201 202 203 204 206 207 209 210 213
## [120] 215 216 217 219 221 226 227 229 231 233 234 236 238 241 244 246 247
## [137] 248 250 251 252 254 255 256 258 259 260 262 268 270 272 273 274 276
## [154] 277 279 280 282 284 286 287 288 292 293 295 296 297 301 303 305 306
## [171] 307 308 313 315 316 317 318 319 320 321 322 325 326 331 332 333 337
## [188] 339   3   7   8  11  12  15  16  18  20  21  23  25  26  29  31  35
## [205]  38  39  41  43  44  45  55  65  69  72  78  80  81  83  84  90  93
## [222] 101 105 108 109 110 112 115 123 125 126 127 129 131 132 134 137 140
## [239] 145 149 152 155 160 164 166 171 172 174 175 184 186 188 192 195 196
## [256] 208 211 212 214 218 220 222 225 228 230 232 235 237 239 242 243 245
## [273] 249 253 257 261 263 264 265 266 267 269 271 275 278 281 283 285 289
## [290] 290 291 294 298 299 300 302 309 312 314 323 327 328 329 330 334 335
## [307] 336 338

Prediction using SVM: We find the accuracy to be .80 with kappa of 0.52

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     28      9
##     noband   20     89
##                                           
##                Accuracy : 0.8014          
##                  95% CI : (0.7274, 0.8628)
##     No Information Rate : 0.6712          
##     P-Value [Acc > NIR] : 0.0003511       
##                                           
##                   Kappa : 0.522           
##                                           
##  Mcnemar's Test P-Value : 0.0633178       
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.9082          
##          Pos Pred Value : 0.7568          
##          Neg Pred Value : 0.8165          
##              Prevalence : 0.3288          
##          Detection Rate : 0.1918          
##    Detection Prevalence : 0.2534          
##       Balanced Accuracy : 0.7457          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | svm.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        28 |        20 |        48 | 
##                        |     0.583 |     0.417 |     0.329 | 
##                        |     0.757 |     0.183 |           | 
##                        |     0.192 |     0.137 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         9 |        89 |        98 | 
##                        |     0.092 |     0.908 |     0.671 | 
##                        |     0.243 |     0.817 |           | 
##                        |     0.062 |     0.610 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        37 |       109 |       146 | 
##                        |     0.253 |     0.747 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.

3a. Only using KNN Imputation directly (no removal of 20%)

In this case, we do not remove any observations from the dataset and perform KNN Imputation on complete dataset.

Using the data we proceed with creating correlation plots, to see how the different variables are correlated. If the dataset has perfectly positive or negative attributes then the performance of the model will be impacted by a problem called “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.

We find the following correlations (in lesser degress, exists in the data):

  1. “ink_pct” and “varnish_pct”
  2. “solvent_pct” and “varnish_pct”

Which implies, we can remove “varnish_pct” from our training model to observe if there is an improvement.

We continue to ignore the 12 variables as identified in previous section, for this model as well.

a. Using the model: Random Forest

We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.

We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.

From the results, we observe the out-of-bag error rate to be 21.16%

## 
## Call:
##  randomForest(formula = form, data = data[train, ], ntree = 1000,      mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 21.16%
## Confusion matrix:
##        band noband class.error
## band    114     52   0.3132530
## noband   28    184   0.1320755

The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.

##          OOB   band noband
##  [1,] 0.4101 0.3898 0.4250
##  [2,] 0.4130 0.4151 0.4113
##  [3,] 0.3916 0.4462 0.3462
##  [4,] 0.3560 0.3741 0.3409
##  [5,] 0.3684 0.3922 0.3492
##  [6,] 0.3846 0.4395 0.3402
##  [7,] 0.3573 0.4500 0.2836
##  [8,] 0.3671 0.4472 0.3039
##  [9,] 0.3659 0.4815 0.2754
## [10,] 0.3360 0.4451 0.2500
## [11,] 0.3244 0.4329 0.2392
## [12,] 0.3235 0.4303 0.2392
## [13,] 0.3404 0.4578 0.2476
## [14,] 0.3501 0.4880 0.2417
## [15,] 0.3210 0.4759 0.1991

In this case, the variables that are between >8 in the Gini Index (and therefore most important are):

  1. press_speed
  2. viscosity
  3. ink_temperature
  4. humidity

We now proceed to predict using the test data set and create the confusion matrix for the same.

##        band noband class.error
## band    114     52   0.3132530
## noband   28    184   0.1320755
##         pred
## true     band noband
##   band     46     16
##   noband    5     95
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | band.pred 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        46 |        16 |        62 | 
##                        |     0.742 |     0.258 |     0.383 | 
##                        |     0.902 |     0.144 |           | 
##                        |     0.284 |     0.099 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         5 |        95 |       100 | 
##                        |     0.050 |     0.950 |     0.617 | 
##                        |     0.098 |     0.856 |           | 
##                        |     0.031 |     0.586 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        51 |       111 |       162 | 
##                        |     0.315 |     0.685 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.84 with mtry=2, and a kappa of 0.64.

##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE
##    user  system elapsed 
##  119.15    0.58  121.81
## Random Forest 
## 
## 378 samples
##  26 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 340, 341, 340, 341, 341, 339, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7657964  0.5110627
##   18    0.7634542  0.5112989
##   35    0.7599980  0.5046530
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     41      5
##     noband   21     95
##                                           
##                Accuracy : 0.8395          
##                  95% CI : (0.7737, 0.8924)
##     No Information Rate : 0.6173          
##     P-Value [Acc > NIR] : 5.453e-10       
##                                           
##                   Kappa : 0.6428          
##                                           
##  Mcnemar's Test P-Value : 0.003264        
##                                           
##             Sensitivity : 0.6613          
##             Specificity : 0.9500          
##          Pos Pred Value : 0.8913          
##          Neg Pred Value : 0.8190          
##              Prevalence : 0.3827          
##          Detection Rate : 0.2531          
##    Detection Prevalence : 0.2840          
##       Balanced Accuracy : 0.8056          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | band.pred.1 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        41 |        21 |        62 | 
##                        |     0.661 |     0.339 |     0.383 | 
##                        |     0.891 |     0.181 |           | 
##                        |     0.253 |     0.130 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         5 |        95 |       100 | 
##                        |     0.050 |     0.950 |     0.617 | 
##                        |     0.109 |     0.819 |           | 
##                        |     0.031 |     0.586 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        46 |       116 |       162 | 
##                        |     0.284 |     0.716 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.

The corresponding confusion matrix and crosstable is as shown below.

##    user  system elapsed 
##  101.20    0.50  102.26
## Random Forest 
## 
## 378 samples
##  26 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 341, 341, 340, 340, 340, 339, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7639534  0.5252234
##   18    0.7556248  0.5070152
##   35    0.7366871  0.4692026
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     50     12
##     noband   12     88
##                                           
##                Accuracy : 0.8519          
##                  95% CI : (0.7876, 0.9027)
##     No Information Rate : 0.6173          
##     P-Value [Acc > NIR] : 4.698e-11       
##                                           
##                   Kappa : 0.6865          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8065          
##             Specificity : 0.8800          
##          Pos Pred Value : 0.8065          
##          Neg Pred Value : 0.8800          
##              Prevalence : 0.3827          
##          Detection Rate : 0.3086          
##    Detection Prevalence : 0.3827          
##       Balanced Accuracy : 0.8432          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | band.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        50 |        12 |        62 | 
##                        |     0.806 |     0.194 |     0.383 | 
##                        |     0.806 |     0.120 |           | 
##                        |     0.309 |     0.074 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        12 |        88 |       100 | 
##                        |     0.120 |     0.880 |     0.617 | 
##                        |     0.194 |     0.880 |           | 
##                        |     0.074 |     0.543 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        62 |       100 |       162 | 
##                        |     0.383 |     0.617 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

b. Using the model: Neural Network

We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.

## # weights:  75
## initial  value 262.836208 
## final  value 259.203890 
## converged

We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that the performance for predicting “banding” is worse compared to randomForest.

##         predicted
## true     noband
##   band       62
##   noband    100

The missclassification table can also be obtained by using the gmodels library:

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | predicted 
## data$band_type[-train] |    noband | Row Total | 
## -----------------------|-----------|-----------|
##                   band |        62 |        62 | 
##                        |     0.383 |           | 
## -----------------------|-----------|-----------|
##                 noband |       100 |       100 | 
##                        |     0.617 |           | 
## -----------------------|-----------|-----------|
##           Column Total |       162 |       162 | 
## -----------------------|-----------|-----------|
## 
## 

We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)

##                          rel.imp              x.names
## viscosity            -0.70805256            viscosity
## ink_typeuncoated     -0.54977492     ink_typeuncoated
## ink_temperature      -0.50998598      ink_temperature
## caliper              -0.49514066              caliper
## ink_typecover        -0.48778630        ink_typecover
## solvent_typeNAPTHA   -0.45666856   solvent_typeNAPTHA
## direct_steamyes      -0.33513404      direct_steamyes
## paper_typeuncoated   -0.31614261   paper_typeuncoated
## blade_pressure       -0.30417073       blade_pressure
## cylinder_sizespiegel -0.21385993 cylinder_sizespiegel
## press_typeMotter70   -0.08588101   press_typeMotter70
## anode_space_ratio     0.00000000    anode_space_ratio
## proof_cut             0.01041603            proof_cut
## locationnorthus       0.02738016      locationnorthus
## paper_typesuper       0.07173208      paper_typesuper
## proof_inkYES          0.22908412         proof_inkYES
## hardener              0.24559122             hardener
## roughness             0.30671182            roughness
## press_speed           0.33186274          press_speed
## solvent_typeXYLOL     0.33414148    solvent_typeXYLOL
## locationmideuropean   0.41522468  locationmideuropean
## locationsouthus       0.42687887      locationsouthus
## cylinder_sizetabloid  0.44025178 cylinder_sizetabloid
## ink_pct               0.48074246              ink_pct
## press_typeMotter94    0.58473779   press_typeMotter94
## press_typeWoodHoe70   0.59309384  press_typeWoodHoe70
## press                 0.64836929                press
## humidity              0.68131414             humidity
## roller_durometer      0.72800859     roller_durometer
## locationscandanavian  0.80335472 locationscandanavian
## cylinder_typeyes      0.84432853     cylinder_typeyes
## solvent_pct           0.87608894          solvent_pct
## current_density       0.93307578      current_density
## grain_screenedYES     0.96934458    grain_screenedYES
## wax                   1.00000000                  wax

The bar plot tells us that the variables vicosity and wax have the strongest negative and positive relationships, respectively, with the response variable band_type.

In decreasing order of importance:

  1. wax
  2. grain_screenedYES
  3. current_density
  4. solvent_pct
  5. cylinder_typeyes
  6. locationscandanavian
  7. roller_durometer

c. Using the model: Decision Tree

We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).

We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.

## 
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
## 
## Variables actually used in tree construction:
##  [1] anode_space_ratio blade_pressure    caliper          
##  [4] current_density   cylinder_type     grain_screened   
##  [7] hardener          humidity          ink_pct          
## [10] ink_temperature   location          paper_type       
## [13] press             press_speed       press_type       
## [16] proof_cut         proof_ink         roller_durometer 
## [19] roughness         solvent_pct       solvent_type     
## [22] viscosity         wax              
## 
## Root node error: 228/540 = 0.42222
## 
## n= 540 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0723684      0  1.000000 1.00000 0.050340
## 2  0.0526316      4  0.706140 0.91667 0.049643
## 3  0.0438596      5  0.653509 0.91667 0.049643
## 4  0.0307018      6  0.609649 0.85526 0.048955
## 5  0.0263158      7  0.578947 0.83333 0.048672
## 6  0.0197368     10  0.500000 0.78509 0.047979
## 7  0.0153509     12  0.460526 0.74123 0.047261
## 8  0.0131579     14  0.429825 0.71930 0.046869
## 9  0.0109649     16  0.403509 0.69298 0.046369
## 10 0.0087719     20  0.359649 0.68421 0.046195
## 11 0.0073099     32  0.254386 0.64912 0.045461
## 12 0.0065789     39  0.201754 0.64474 0.045365
## 13 0.0043860     45  0.162281 0.65789 0.045651
## 14 0.0021930     73  0.039474 0.67105 0.045927
## 15 0.0000100     77  0.030702 0.68860 0.046283

From the complexity table we make the observation that the lowest relative error of 0.65 occurs at a tree size of ~40.

To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 12, with a resubstitution error rate of ~0.65 (as before).

This again is not a good predictor due to the large error rate.

## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
##   n= 540 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.07236842      0 1.0000000 1.0000000 0.05033997
## 2 0.05263158      4 0.7061404 0.9166667 0.04964270
## 3 0.05000000      5 0.6535088 0.9166667 0.04964270
## 
## Variable importance
##             press       press_speed        press_type          location 
##                25                19                15                10 
##    blade_pressure   current_density     cylinder_size               wax 
##                 9                 5                 4                 4 
##         roughness       solvent_pct        paper_type          hardener 
##                 3                 1                 1                 1 
##         proof_ink anode_space_ratio      solvent_type 
##                 1                 1                 1 
## 
## Node number 1: 540 observations,    complexity param=0.07236842
##   predicted class=noband  expected loss=0.4222222  P(node) =1
##     class counts:   228   312
##    probabilities: 0.422 0.578 
##   left son=2 (451 obs) right son=3 (89 obs)
##   Primary splits:
##       press_speed      < 2184.5    to the left,  improve=17.60284, (0 missing)
##       location         splits as  RLRRL, improve=16.58157, (0 missing)
##       paper_type       splits as  RLR, improve=16.03920, (0 missing)
##       roller_durometer < 33.00751  to the right, improve=12.48117, (0 missing)
##       press            < 822.5     to the left,  improve=11.94435, (0 missing)
##   Surrogate splits:
##       press        < 827.5     to the left,  agree=0.863, adj=0.169, (0 split)
##       solvent_pct  < 44.05     to the left,  agree=0.846, adj=0.067, (0 split)
##       solvent_type splits as  LRL, agree=0.839, adj=0.022, (0 split)
## 
## Node number 2: 451 observations,    complexity param=0.07236842
##   predicted class=noband  expected loss=0.4789357  P(node) =0.8351852
##     class counts:   216   235
##    probabilities: 0.479 0.521 
##   left son=4 (341 obs) right son=5 (110 obs)
##   Primary splits:
##       press      < 814       to the right, improve=13.487460, (0 missing)
##       paper_type splits as  RLR, improve=12.558990, (0 missing)
##       location   splits as  RLRRL, improve=11.622280, (0 missing)
##       humidity   < 69.5      to the right, improve=10.013630, (0 missing)
##       press_type splits as  LRLL, improve= 9.124902, (0 missing)
##   Surrogate splits:
##       press_type     splits as  RRLL, agree=0.956, adj=0.818, (0 split)
##       blade_pressure < 35.1208   to the left,  agree=0.916, adj=0.655, (0 split)
##       cylinder_size  splits as  RLL, agree=0.825, adj=0.282, (0 split)
##       wax            < 2.95      to the left,  agree=0.789, adj=0.136, (0 split)
##       press_speed    < 1432.5    to the right, agree=0.787, adj=0.127, (0 split)
## 
## Node number 3: 89 observations
##   predicted class=noband  expected loss=0.1348315  P(node) =0.1648148
##     class counts:    12    77
##    probabilities: 0.135 0.865 
## 
## Node number 4: 341 observations,    complexity param=0.07236842
##   predicted class=band    expected loss=0.4516129  P(node) =0.6314815
##     class counts:   187   154
##    probabilities: 0.548 0.452 
##   left son=8 (70 obs) right son=9 (271 obs)
##   Primary splits:
##       location        splits as  RLRRL, improve=9.922203, (0 missing)
##       current_density < 36        to the right, improve=9.745700, (0 missing)
##       paper_type      splits as  RLR, improve=9.592881, (0 missing)
##       press           < 822.5     to the left,  improve=9.523061, (0 missing)
##       ink_type        splits as  RLL, improve=8.733958, (0 missing)
##   Surrogate splits:
##       paper_type     splits as  RLR, agree=0.818, adj=0.114, (0 split)
##       press_type     splits as  L-RR, agree=0.806, adj=0.057, (0 split)
##       blade_pressure < 19        to the left,  agree=0.798, adj=0.014, (0 split)
## 
## Node number 5: 110 observations
##   predicted class=noband  expected loss=0.2636364  P(node) =0.2037037
##     class counts:    29    81
##    probabilities: 0.264 0.736 
## 
## Node number 8: 70 observations
##   predicted class=band    expected loss=0.2142857  P(node) =0.1296296
##     class counts:    55    15
##    probabilities: 0.786 0.214 
## 
## Node number 9: 271 observations,    complexity param=0.07236842
##   predicted class=noband  expected loss=0.4870849  P(node) =0.5018519
##     class counts:   132   139
##    probabilities: 0.487 0.513 
##   left son=18 (187 obs) right son=19 (84 obs)
##   Primary splits:
##       press           < 822.5     to the left,  improve=8.739744, (0 missing)
##       current_density < 36        to the right, improve=6.557576, (0 missing)
##       ink_temperature < 16.9      to the right, improve=5.393334, (0 missing)
##       press_type      splits as  L-RL, improve=5.142964, (0 missing)
##       press_speed     < 1275      to the left,  improve=4.898144, (0 missing)
##   Surrogate splits:
##       press_type splits as  L-RL, agree=0.793, adj=0.333, (0 split)
##       roughness  < 0.8073364 to the left,  agree=0.793, adj=0.333, (0 split)
##       wax        < 2.35      to the right, agree=0.756, adj=0.214, (0 split)
##       proof_ink  splits as  RL, agree=0.720, adj=0.095, (0 split)
##       hardener   < 1.35      to the left,  agree=0.720, adj=0.095, (0 split)
## 
## Node number 18: 187 observations,    complexity param=0.05263158
##   predicted class=band    expected loss=0.4278075  P(node) =0.3462963
##     class counts:   107    80
##    probabilities: 0.572 0.428 
##   left son=36 (151 obs) right son=37 (36 obs)
##   Primary splits:
##       current_density < 36        to the right, improve=5.087226, (0 missing)
##       ink_pct         < 55.85     to the right, improve=4.591306, (0 missing)
##       proof_cut       < 55.25     to the left,  improve=4.333522, (0 missing)
##       wax             < 2.25      to the left,  improve=3.755301, (0 missing)
##       press_speed     < 1735      to the left,  improve=3.491101, (0 missing)
##   Surrogate splits:
##       anode_space_ratio < 93.425    to the right, agree=0.829, adj=0.111, (0 split)
##       humidity          < 99        to the left,  agree=0.818, adj=0.056, (0 split)
##       ink_pct           < 43.1      to the right, agree=0.818, adj=0.056, (0 split)
##       solvent_type      splits as  L-R, agree=0.813, adj=0.028, (0 split)
##       solvent_pct       < 44.15     to the left,  agree=0.813, adj=0.028, (0 split)
## 
## Node number 19: 84 observations
##   predicted class=noband  expected loss=0.297619  P(node) =0.1555556
##     class counts:    25    59
##    probabilities: 0.298 0.702 
## 
## Node number 36: 151 observations
##   predicted class=band    expected loss=0.3708609  P(node) =0.2796296
##     class counts:    95    56
##    probabilities: 0.629 0.371 
## 
## Node number 37: 36 observations
##   predicted class=noband  expected loss=0.3333333  P(node) =0.06666667
##     class counts:    12    24
##    probabilities: 0.333 0.667

The variable importance table is given below:

##                     Overall
## anode_space_ratio 39.045225
## blade_pressure    53.689247
## caliper           23.911904
## current_density   24.670661
## cylinder_size      3.504540
## cylinder_type      9.884305
## grain_screened    12.257990
## hardener          44.723327
## humidity          57.332461
## ink_pct           50.808741
## ink_temperature   54.742040
## ink_type          40.796891
## location          56.392524
## paper_type        60.988988
## press             58.353369
## press_speed       56.561023
## press_type        29.485488
## proof_cut         57.179918
## proof_ink          6.087513
## roller_durometer  29.438657
## roughness         14.940079
## solvent_pct       22.471187
## solvent_type       6.492854
## viscosity         70.570605
## wax               49.027160
## direct_steam       0.000000

In decreasing order of importance the variables are:

  1. viscosity
  2. press
  3. press_speed
  4. humidity
  5. ink_pct
  6. blade_pressure

d. Using the model: SVM

We consider the use of support vector machines to build a classifier and to also extract the feature importance list.

First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.

## 
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train, 
##     ], method = "C-classification", kernel = "radial", cost = 10, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  333
## 
##  ( 183 150 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  band noband
##   [1]   1   4   5   6   7   8  10  13  14  16  17  18  20  22  23  26  28
##  [18]  29  30  31  33  34  36  38  40  41  42  43  44  45  46  48  50  52
##  [35]  53  54  59  61  63  65  68  71  73  75  76  81  83  84  87  88  89
##  [52]  90  92  93  94  96  97 101 102 104 105 108 110 114 118 121 123 124
##  [69] 125 129 130 132 135 136 137 138 139 141 142 144 145 146 148 149 151
##  [86] 152 155 157 160 161 162 165 167 168 173 176 177 178 179 180 181 185
## [103] 186 191 192 198 199 200 201 202 208 214 217 218 222 223 224 230 231
## [120] 233 235 240 244 246 247 251 252 254 257 263 264 266 267 271 274 279
## [137] 281 283 284 285 286 288 292 294 295 296 297 299 302 303 305 306 312
## [154] 314 315 317 321 323 327 329 330 336 337 339 342 343 345 346 350 353
## [171] 354 356 359 361 362 366 370 371 372 373 374 375 377   2   9  11  12
## [188]  15  19  21  24  25  27  35  39  49  57  58  62  64  66  67  69  70
## [205]  74  77  78  79  82  85  86  91  95  98  99 100 103 107 112 113 116
## [222] 117 119 120 122 126 127 128 131 133 134 143 147 150 153 158 159 163
## [239] 166 169 170 171 172 174 175 182 184 187 188 189 194 197 203 204 205
## [256] 206 207 209 210 211 212 213 215 216 219 220 221 225 229 232 234 237
## [273] 238 241 242 245 250 253 255 256 258 260 261 262 265 269 270 272 273
## [290] 277 278 280 282 287 289 290 291 293 298 300 301 304 307 308 309 310
## [307] 311 318 320 324 325 326 328 334 335 338 341 344 347 348 349 351 352
## [324] 355 357 358 360 363 364 365 367 368 369

Prediction using SVM: The accuracy of the model is 0.83 at a kappa of 0.64.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     45     10
##     noband   17     90
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7669, 0.8872)
##     No Information Rate : 0.6173          
##     P-Value [Acc > NIR] : 1.737e-09       
##                                           
##                   Kappa : 0.6395          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 0.7258          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.8182          
##          Neg Pred Value : 0.8411          
##              Prevalence : 0.3827          
##          Detection Rate : 0.2778          
##    Detection Prevalence : 0.3395          
##       Balanced Accuracy : 0.8129          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | svm.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        45 |        17 |        62 | 
##                        |     0.726 |     0.274 |     0.383 | 
##                        |     0.818 |     0.159 |           | 
##                        |     0.278 |     0.105 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        10 |        90 |       100 | 
##                        |     0.100 |     0.900 |     0.617 | 
##                        |     0.182 |     0.841 |           | 
##                        |     0.062 |     0.556 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        55 |       107 |       162 | 
##                        |     0.340 |     0.660 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.

4. Remove columns not being considered & then perform KNN Imputation (no removal of 20%)

In this method, the idea is to first remove those column variables that are not being considered essential (by way of heuristics and understanding) for our learning model, and then perform KNN imputation.

The idea here is to not lose essential data that we will forego, if we instead first did a removal of rows that have more than 20% data missing, when in fact, those 20% might instead be of those variables we were not going to consider anyway! So instead of that, we perform the operations slightly differently.

a. Using the model: Random Forest

We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.

We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.

From the results, we observe the out-of-bag error rate to be 19.31%.

## 
## Call:
##  randomForest(formula = form, data = data[train, ], ntree = 1000,      mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 19.31%
## Confusion matrix:
##        band noband class.error
## band    115     51   0.3072289
## noband   22    190   0.1037736

The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.

##          OOB   band noband
##  [1,] 0.3309 0.3220 0.3375
##  [2,] 0.3319 0.3627 0.3071
##  [3,] 0.3750 0.4016 0.3544
##  [4,] 0.3846 0.4203 0.3563
##  [5,] 0.3620 0.3986 0.3333
##  [6,] 0.3768 0.4295 0.3350
##  [7,] 0.3398 0.3688 0.3166
##  [8,] 0.3306 0.3827 0.2886
##  [9,] 0.3144 0.3681 0.2718
## [10,] 0.3270 0.4085 0.2621
## [11,] 0.3110 0.3636 0.2692
## [12,] 0.3316 0.3697 0.3014
## [13,] 0.3183 0.3675 0.2796
## [14,] 0.3254 0.4036 0.2642
## [15,] 0.2937 0.3614 0.2406

Mean Decrease in Gini is the average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest.

This is effectively a measure of how important a variable is for estimating the value of the target variable across all of the trees that make up the forest. A higher Mean Decrease in Gini indicates higher variable importance. The most important variables to the model will be highest in the plot and have the largest Mean Decrease in Gini Values, conversely, the least important variable will be lowest in the plot, and have the smallest Mean Decrease in Gini values.

In this case, the variables that are between >7 in the Gini Index (and therefore most important are):

  1. press_speed
  2. viscosity
  3. humidity
  4. ink_temperature
  5. proof_cut
  6. ink_pct
##        band noband class.error
## band    115     51   0.3072289
## noband   22    190   0.1037736

We now proceed to predict using the test data set and create the confusion matrix for the same.

##         pred
## true     band noband
##   band     45     17
##   noband    6     94
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | band.pred 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        45 |        17 |        62 | 
##                        |     0.726 |     0.274 |     0.383 | 
##                        |     0.882 |     0.153 |           | 
##                        |     0.278 |     0.105 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         6 |        94 |       100 | 
##                        |     0.060 |     0.940 |     0.617 | 
##                        |     0.118 |     0.847 |           | 
##                        |     0.037 |     0.580 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        51 |       111 |       162 | 
##                        |     0.315 |     0.685 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.84 with mtry=19, and a kappa of 0.65.

There is no standardized interpretation of the kappa statistic. According to Wikipedia (citing their paper), Landis and Koch considers 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect.

Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as fair to good, and < 0.40 as poor.

##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE
##    user  system elapsed 
##  119.08    0.84  121.20
## Random Forest 
## 
## 378 samples
##  26 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 341, 340, 341, 341, 340, 341, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7780226  0.5358503
##   18    0.7745713  0.5336904
##   35    0.7616369  0.5075637
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     40      3
##     noband   22     97
##                                           
##                Accuracy : 0.8457          
##                  95% CI : (0.7807, 0.8976)
##     No Information Rate : 0.6173          
##     P-Value [Acc > NIR] : 1.638e-10       
##                                           
##                   Kappa : 0.6532          
##                                           
##  Mcnemar's Test P-Value : 0.0003182       
##                                           
##             Sensitivity : 0.6452          
##             Specificity : 0.9700          
##          Pos Pred Value : 0.9302          
##          Neg Pred Value : 0.8151          
##              Prevalence : 0.3827          
##          Detection Rate : 0.2469          
##    Detection Prevalence : 0.2654          
##       Balanced Accuracy : 0.8076          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | band.pred.1 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        40 |        22 |        62 | 
##                        |     0.645 |     0.355 |     0.383 | 
##                        |     0.930 |     0.185 |           | 
##                        |     0.247 |     0.136 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         3 |        97 |       100 | 
##                        |     0.030 |     0.970 |     0.617 | 
##                        |     0.070 |     0.815 |           | 
##                        |     0.019 |     0.599 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        43 |       119 |       162 | 
##                        |     0.265 |     0.735 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.

The corresponding confusion matrix and crosstable is as shown below.

##    user  system elapsed 
##  101.75    0.46  103.03
## Random Forest 
## 
## 378 samples
##  26 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 341, 340, 340, 341, 340, 341, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7621704  0.5196313
##   18    0.7515160  0.4983770
##   35    0.7432695  0.4825770
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     50     11
##     noband   12     89
##                                           
##                Accuracy : 0.858           
##                  95% CI : (0.7946, 0.9078)
##     No Information Rate : 0.6173          
##     P-Value [Acc > NIR] : 1.285e-11       
##                                           
##                   Kappa : 0.6986          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8065          
##             Specificity : 0.8900          
##          Pos Pred Value : 0.8197          
##          Neg Pred Value : 0.8812          
##              Prevalence : 0.3827          
##          Detection Rate : 0.3086          
##    Detection Prevalence : 0.3765          
##       Balanced Accuracy : 0.8482          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | band.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        50 |        12 |        62 | 
##                        |     0.806 |     0.194 |     0.383 | 
##                        |     0.820 |     0.119 |           | 
##                        |     0.309 |     0.074 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        11 |        89 |       100 | 
##                        |     0.110 |     0.890 |     0.617 | 
##                        |     0.180 |     0.881 |           | 
##                        |     0.068 |     0.549 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        61 |       101 |       162 | 
##                        |     0.377 |     0.623 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

b. Using the model: Neural Network

We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.

## # weights:  75
## initial  value 260.297685 
## final  value 259.203866 
## converged

We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.

##         predicted
## true     noband
##   band       62
##   noband    100

The missclassification table can also be obtained by using the gmodels library:

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | predicted 
## data$band_type[-train] |    noband | Row Total | 
## -----------------------|-----------|-----------|
##                   band |        62 |        62 | 
##                        |     0.383 |           | 
## -----------------------|-----------|-----------|
##                 noband |       100 |       100 | 
##                        |     0.617 |           | 
## -----------------------|-----------|-----------|
##           Column Total |       162 |       162 | 
## -----------------------|-----------|-----------|
## 
## 

We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)

##                          rel.imp              x.names
## ink_pct              -0.95428136              ink_pct
## humidity             -0.84236796             humidity
## wax                  -0.82342916                  wax
## solvent_typeXYLOL    -0.77438835    solvent_typeXYLOL
## press_typeWoodHoe70  -0.67125659  press_typeWoodHoe70
## ink_typecover        -0.59759231        ink_typecover
## caliper              -0.54061341              caliper
## viscosity            -0.50092145            viscosity
## cylinder_sizetabloid -0.43952260 cylinder_sizetabloid
## proof_cut            -0.32721833            proof_cut
## press_typeMotter94   -0.26625756   press_typeMotter94
## anode_space_ratio    -0.22611918    anode_space_ratio
## paper_typeuncoated   -0.21282091   paper_typeuncoated
## direct_steamyes      -0.12860134      direct_steamyes
## paper_typesuper      -0.10583570      paper_typesuper
## locationscandanavian -0.08288393 locationscandanavian
## ink_typeuncoated     -0.06885445     ink_typeuncoated
## press                -0.04479892                press
## blade_pressure       -0.02039919       blade_pressure
## cylinder_typeyes      0.00000000     cylinder_typeyes
## locationmideuropean   0.03440463  locationmideuropean
## ink_temperature       0.11009559      ink_temperature
## hardener              0.25508124             hardener
## roller_durometer      0.34119484     roller_durometer
## locationsouthus       0.34694584      locationsouthus
## locationnorthus       0.41373216      locationnorthus
## press_speed           0.42958805          press_speed
## current_density       0.50752345      current_density
## solvent_pct           0.55162539          solvent_pct
## roughness             0.56921057            roughness
## grain_screenedYES     0.69775106    grain_screenedYES
## proof_inkYES          0.72720408         proof_inkYES
## cylinder_sizespiegel  0.86676553 cylinder_sizespiegel
## solvent_typeNAPTHA    0.92750811   solvent_typeNAPTHA
## press_typeMotter70    1.00000000   press_typeMotter70

The bar plot tells us that the variables ink_pct and press_typeMotter70 have the strongest negative and positive relationships, respectively, with the response variable band_type.

In decreasing order of importance:

  1. press_type - Motter70
  2. solvent_type - NAPTHA
  3. cylinder_size - spiegel
  4. grain_screened - YES
  5. roughness

c. Using the model: Decision Tree

We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).

We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.

## 
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
## 
## Variables actually used in tree construction:
##  [1] anode_space_ratio blade_pressure    caliper          
##  [4] current_density   cylinder_type     grain_screened   
##  [7] hardener          humidity          ink_pct          
## [10] ink_temperature   ink_type          location         
## [13] paper_type        press             press_speed      
## [16] press_type        proof_cut         proof_ink        
## [19] roller_durometer  roughness         solvent_pct      
## [22] solvent_type      viscosity         wax              
## 
## Root node error: 228/540 = 0.42222
## 
## n= 540 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0723684      0  1.000000 1.00000 0.050340
## 2  0.0438596      3  0.728070 0.76316 0.047630
## 3  0.0307018      6  0.596491 0.79825 0.048178
## 4  0.0219298      7  0.565789 0.67544 0.046017
## 5  0.0175439     11  0.451754 0.60526 0.044455
## 6  0.0131579     15  0.381579 0.60526 0.044455
## 7  0.0109649     20  0.315789 0.63158 0.045071
## 8  0.0087719     22  0.293860 0.63158 0.045071
## 9  0.0065789     38  0.153509 0.62281 0.044870
## 10 0.0058480     42  0.127193 0.61842 0.044768
## 11 0.0043860     45  0.109649 0.62281 0.044870
## 12 0.0021930     64  0.026316 0.65789 0.045651
## 13 0.0014620     66  0.021930 0.65351 0.045556
## 14 0.0000100     69  0.017544 0.65351 0.045556

From the complexity table we make the observation that the lowest relative error of 0.6 occurs at a tree size of ~18.

To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 5, with a resubstitution error rate of ~0.6 (as before).

This again is not a good predictor due to the large error rate.

## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
##   n= 540 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.07236842      0 1.0000000 1.0000000 0.05033997
## 2 0.05000000      3 0.7280702 0.7631579 0.04763031
## 
## Variable importance
##      press_speed            press       press_type roller_durometer 
##               19               17               11               11 
##   blade_pressure       paper_type         location         ink_type 
##                8                8                7                5 
##   grain_screened    cylinder_size        proof_cut              wax 
##                5                4                2                2 
##      solvent_pct 
##                1 
## 
## Node number 1: 540 observations,    complexity param=0.07236842
##   predicted class=noband  expected loss=0.4222222  P(node) =1
##     class counts:   228   312
##    probabilities: 0.422 0.578 
##   left son=2 (451 obs) right son=3 (89 obs)
##   Primary splits:
##       press_speed      < 2184.5   to the left,  improve=17.60284, (0 missing)
##       paper_type       splits as  RLR,          improve=16.03920, (0 missing)
##       roller_durometer < 33.07812 to the right, improve=14.64381, (0 missing)
##       location         splits as  RLRRL,        improve=13.89276, (0 missing)
##       press            < 822.5    to the left,  improve=11.94435, (0 missing)
##   Surrogate splits:
##       press        < 827.5    to the left,  agree=0.863, adj=0.169, (0 split)
##       solvent_pct  < 44.05    to the left,  agree=0.846, adj=0.067, (0 split)
##       solvent_type splits as  LRL,          agree=0.839, adj=0.022, (0 split)
## 
## Node number 2: 451 observations,    complexity param=0.07236842
##   predicted class=noband  expected loss=0.4789357  P(node) =0.8351852
##     class counts:   216   235
##    probabilities: 0.479 0.521 
##   left son=4 (341 obs) right son=5 (110 obs)
##   Primary splits:
##       press      < 814      to the right, improve=13.487460, (0 missing)
##       paper_type splits as  RLR,          improve=12.558990, (0 missing)
##       humidity   < 69.5     to the right, improve=10.013630, (0 missing)
##       location   splits as  RLRRL,        improve= 9.804620, (0 missing)
##       press_type splits as  LRLL,         improve= 9.124902, (0 missing)
##   Surrogate splits:
##       press_type     splits as  RRLL,         agree=0.956, adj=0.818, (0 split)
##       blade_pressure < 35.06462 to the left,  agree=0.900, adj=0.591, (0 split)
##       cylinder_size  splits as  RLL,          agree=0.825, adj=0.282, (0 split)
##       wax            < 2.95     to the left,  agree=0.789, adj=0.136, (0 split)
##       press_speed    < 1432.5   to the right, agree=0.787, adj=0.127, (0 split)
## 
## Node number 3: 89 observations
##   predicted class=noband  expected loss=0.1348315  P(node) =0.1648148
##     class counts:    12    77
##    probabilities: 0.135 0.865 
## 
## Node number 4: 341 observations,    complexity param=0.07236842
##   predicted class=band    expected loss=0.4516129  P(node) =0.6314815
##     class counts:   187   154
##    probabilities: 0.548 0.452 
##   left son=8 (208 obs) right son=9 (133 obs)
##   Primary splits:
##       roller_durometer < 33.07812 to the right, improve=10.805260, (0 missing)
##       current_density  < 36       to the right, improve= 9.745700, (0 missing)
##       paper_type       splits as  RLR,          improve= 9.592881, (0 missing)
##       press            < 822.5    to the left,  improve= 9.523061, (0 missing)
##       ink_type         splits as  RLL,          improve= 8.733958, (0 missing)
##   Surrogate splits:
##       paper_type     splits as  RLL,          agree=0.891, adj=0.722, (0 split)
##       location       splits as  RLLLL,        agree=0.848, adj=0.609, (0 split)
##       ink_type       splits as  RLL,          agree=0.801, adj=0.489, (0 split)
##       grain_screened splits as  RL,           agree=0.774, adj=0.421, (0 split)
##       proof_cut      < 39.84652 to the right, agree=0.692, adj=0.211, (0 split)
## 
## Node number 5: 110 observations
##   predicted class=noband  expected loss=0.2636364  P(node) =0.2037037
##     class counts:    29    81
##    probabilities: 0.264 0.736 
## 
## Node number 8: 208 observations
##   predicted class=band    expected loss=0.3509615  P(node) =0.3851852
##     class counts:   135    73
##    probabilities: 0.649 0.351 
## 
## Node number 9: 133 observations
##   predicted class=noband  expected loss=0.3909774  P(node) =0.2462963
##     class counts:    52    81
##    probabilities: 0.391 0.609

The variable importance table is given below:

##                     Overall
## anode_space_ratio 29.198297
## blade_pressure    58.779461
## caliper           19.474535
## current_density   24.888647
## cylinder_size      7.194784
## cylinder_type      6.077429
## grain_screened    12.931862
## hardener          34.775933
## humidity          56.337643
## ink_pct           84.653468
## ink_temperature   56.023266
## ink_type          35.748195
## location          50.644398
## paper_type        59.847210
## press             60.211992
## press_speed       46.497600
## press_type        30.489342
## proof_cut         34.437336
## proof_ink         26.224542
## roller_durometer  32.888612
## roughness         30.441449
## solvent_pct       26.219146
## solvent_type       3.250000
## viscosity         77.422617
## wax               36.046013
## direct_steam       0.000000

In decreasing order, the importance of the variables are:

  1. ink_pct
  2. viscosity
  3. press
  4. paper_type
  5. blade_pressure

d. Using the model: SVM

We consider the use of support vector machines to build a classifier and to also extract the feature importance list.

First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.

## 
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train, 
##     ], method = "C-classification", kernel = "radial", cost = 10, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  335
## 
##  ( 182 153 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  band noband
##   [1]   1   3   4   5   6   7   8  10  13  14  17  18  20  22  23  26  28
##  [18]  29  30  31  33  34  36  38  40  41  42  43  44  45  46  48  50  52
##  [35]  53  54  59  61  63  65  68  71  73  75  76  81  83  84  87  88  89
##  [52]  90  92  93  94  96  97 101 102 104 105 108 110 118 121 123 125 129
##  [69] 130 132 135 136 137 138 139 141 142 144 145 146 148 149 151 152 155
##  [86] 157 160 161 162 165 167 168 173 176 177 178 180 181 185 186 191 192
## [103] 198 199 200 201 202 208 214 217 218 222 223 224 230 231 233 235 236
## [120] 240 243 244 246 247 251 252 254 257 263 264 266 267 271 274 279 281
## [137] 283 284 285 286 288 292 294 295 296 297 299 302 303 305 306 312 314
## [154] 315 317 321 323 327 329 330 336 337 339 342 343 345 346 350 353 354
## [171] 356 359 361 362 366 370 371 372 373 374 375 377   2   9  12  15  19
## [188]  21  24  25  27  32  35  39  49  55  57  58  62  64  66  67  69  70
## [205]  74  77  78  79  82  85  86  91  95  98  99 100 103 107 112 113 115
## [222] 117 119 120 122 126 127 128 131 134 143 147 150 153 158 159 163 166
## [239] 169 170 171 172 174 175 182 184 188 189 194 197 203 204 205 206 207
## [256] 209 210 211 212 213 215 216 219 220 221 225 229 232 234 237 238 241
## [273] 242 245 248 250 253 255 256 258 259 260 261 262 265 270 272 273 275
## [290] 277 278 280 282 287 289 291 293 298 300 301 304 307 308 309 310 311
## [307] 318 319 320 324 325 326 328 332 334 335 338 341 344 347 348 349 351
## [324] 352 355 357 358 360 363 364 365 367 368 369 378

Prediction using SVM: The accuracy of the model is 0.84 at a kappa of 0.66

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     47     10
##     noband   15     90
##                                           
##                Accuracy : 0.8457          
##                  95% CI : (0.7807, 0.8976)
##     No Information Rate : 0.6173          
##     P-Value [Acc > NIR] : 1.638e-10       
##                                           
##                   Kappa : 0.6683          
##                                           
##  Mcnemar's Test P-Value : 0.4237          
##                                           
##             Sensitivity : 0.7581          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.8246          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.3827          
##          Detection Rate : 0.2901          
##    Detection Prevalence : 0.3519          
##       Balanced Accuracy : 0.8290          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  162 
## 
##  
##                        | svm.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        47 |        15 |        62 | 
##                        |     0.758 |     0.242 |     0.383 | 
##                        |     0.825 |     0.143 |           | 
##                        |     0.290 |     0.093 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        10 |        90 |       100 | 
##                        |     0.100 |     0.900 |     0.617 | 
##                        |     0.175 |     0.857 |           | 
##                        |     0.062 |     0.556 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        57 |       105 |       162 | 
##                        |     0.352 |     0.648 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.

5. Remove unconsidered columns first, Remove 20%, perform KNN Imputation

In this case, we first remove the unconsidered columns from the dataset, then further remove observations where more than 20% data is missing, and perform KNN imputation on the remaining data.

The variables that are not considered are:

  1. date (though, some months might experience more banding compared to others, but explainable)
  2. customer
  3. job_number
  4. cylinder_no
  5. cylinder_division
  6. ink_color
  7. unit_number
  8. plating_tank
  9. blade_mfg
  10. chrome_content
  11. ESA_voltage
  12. ESA_amperage
##  grain_screened proof_ink    paper_type      ink_type   direct_steam
##  NO :273        NO : 24   coated  :206   coated  :264   no :482     
##  YES:211        YES:460   super   :  0   cover   : 15   yes:  2     
##                           uncoated:278   uncoated:205               
##                                                                     
##                                                                     
##                                                                     
##  solvent_type cylinder_type     press_type      press       cylinder_size
##  LINE  :467   no :134       Albert70 : 60   Min.   :802.0   catalog:162  
##  NAPTHA:  2   yes:350       Motter70 : 48   1st Qu.:815.0   spiegel: 50  
##  XYLOL : 15                 Motter94 :210   Median :816.0   tabloid:272  
##                             WoodHoe70:166   Mean   :817.5                
##                                             3rd Qu.:824.0                
##                                             Max.   :828.0                
##          location     proof_cut       viscosity        caliper      
##  canadian    :215   Min.   :25.00   Min.   :35.00   Min.   :0.1330  
##  mideuropean : 49   1st Qu.:40.00   1st Qu.:43.00   1st Qu.:0.2000  
##  northus     :199   Median :45.00   Median :50.00   Median :0.2670  
##  scandanavian: 13   Mean   :45.04   Mean   :50.86   Mean   :0.2756  
##  southus     :  8   3rd Qu.:50.00   3rd Qu.:56.00   3rd Qu.:0.3083  
##                     Max.   :72.50   Max.   :72.00   Max.   :0.5330  
##  ink_temperature    humidity        roughness       blade_pressure 
##  Min.   :11.20   Min.   : 57.00   Min.   :0.05625   Min.   :16.00  
##  1st Qu.:14.50   1st Qu.: 73.00   1st Qu.:0.62500   1st Qu.:25.00  
##  Median :15.10   Median : 78.00   Median :0.75000   Median :30.00  
##  Mean   :15.31   Mean   : 78.47   Mean   :0.72738   Mean   :31.24  
##  3rd Qu.:16.00   3rd Qu.: 82.00   3rd Qu.:0.81250   3rd Qu.:33.81  
##  Max.   :24.50   Max.   :105.00   Max.   :1.25000   Max.   :70.00  
##   varnish_pct      press_speed      ink_pct       solvent_pct   
##  Min.   : 0.000   Min.   :   0   Min.   :41.00   Min.   :22.00  
##  1st Qu.: 0.000   1st Qu.:1600   1st Qu.:52.10   1st Qu.:36.80  
##  Median : 3.400   Median :1800   Median :56.75   Median :38.50  
##  Mean   : 5.767   Mean   :1831   Mean   :55.64   Mean   :38.58  
##  3rd Qu.:10.400   3rd Qu.:2050   3rd Qu.:58.80   3rd Qu.:41.20  
##  Max.   :35.800   Max.   :2600   Max.   :76.90   Max.   :53.40  
##       wax           hardener      roller_durometer current_density
##  Min.   :0.000   Min.   :0.0000   Min.   :28.00    Min.   :30.00  
##  1st Qu.:2.414   1st Qu.:0.8000   1st Qu.:30.00    1st Qu.:40.00  
##  Median :2.500   Median :1.0000   Median :34.00    Median :40.00  
##  Mean   :2.422   Mean   :0.9692   Mean   :34.78    Mean   :38.96  
##  3rd Qu.:2.700   3rd Qu.:1.0000   3rd Qu.:40.00    3rd Qu.:40.00  
##  Max.   :3.100   Max.   :3.0000   Max.   :60.00    Max.   :45.00  
##  anode_space_ratio  band_type  
##  Min.   : 90.0     band  :173  
##  1st Qu.:100.0     noband:311  
##  Median :103.1                 
##  Mean   :102.9                 
##  3rd Qu.:106.5                 
##  Max.   :117.9

a. Using the model: Random Forest

We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.

We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.

From the results, we observe the out-of-bag error rate to be 22.78%.

## 
## Call:
##  randomForest(formula = form, data = data[train, ], ntree = 1000,      mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 22.78%
## Confusion matrix:
##        band noband class.error
## band     74     54   0.4218750
## noband   23    187   0.1095238

The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.

##          OOB   band noband
##  [1,] 0.3468 0.4000 0.3108
##  [2,] 0.3483 0.4375 0.2893
##  [3,] 0.3770 0.4894 0.3067
##  [4,] 0.3630 0.4667 0.3011
##  [5,] 0.3233 0.4464 0.2500
##  [6,] 0.3139 0.4474 0.2359
##  [7,] 0.3478 0.4836 0.2650
##  [8,] 0.3405 0.5040 0.2388
##  [9,] 0.3140 0.4841 0.2079
## [10,] 0.2939 0.4762 0.1814
## [11,] 0.2934 0.4646 0.1884
## [12,] 0.3134 0.4724 0.2163
## [13,] 0.3095 0.4567 0.2201
## [14,] 0.2857 0.4409 0.1914
## [15,] 0.2917 0.4724 0.1818

Mean Decrease in Gini is the average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest.

In this case, the variables that are between 6-8 in the Gini Index (and therefore most important are):

  1. viscosity
  2. press_speed
  3. ink_pct
  4. humidity
  5. solvent_pct
  6. ink_temperature
##        band noband class.error
## band     74     54   0.4218750
## noband   23    187   0.1095238

We now proceed to predict using the test data set and create the confusion matrix for the same.

##         pred
## true     band noband
##   band     26     19
##   noband    7     94
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | band.pred 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        26 |        19 |        45 | 
##                        |     0.578 |     0.422 |     0.308 | 
##                        |     0.788 |     0.168 |           | 
##                        |     0.178 |     0.130 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         7 |        94 |       101 | 
##                        |     0.069 |     0.931 |     0.692 | 
##                        |     0.212 |     0.832 |           | 
##                        |     0.048 |     0.644 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        33 |       113 |       146 | 
##                        |     0.226 |     0.774 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.82 with mtry=19, and a kappa of 0.55.

##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE
##    user  system elapsed 
##  103.31    0.46  104.45
## Random Forest 
## 
## 338 samples
##  27 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 304, 304, 304, 304, 304, 304, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7156239  0.3255382
##   19    0.7414260  0.4178939
##   36    0.7401961  0.4165118
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 19.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     27      8
##     noband   18     93
##                                           
##                Accuracy : 0.8219          
##                  95% CI : (0.7501, 0.8802)
##     No Information Rate : 0.6918          
##     P-Value [Acc > NIR] : 0.000261        
##                                           
##                   Kappa : 0.555           
##                                           
##  Mcnemar's Test P-Value : 0.077556        
##                                           
##             Sensitivity : 0.6000          
##             Specificity : 0.9208          
##          Pos Pred Value : 0.7714          
##          Neg Pred Value : 0.8378          
##              Prevalence : 0.3082          
##          Detection Rate : 0.1849          
##    Detection Prevalence : 0.2397          
##       Balanced Accuracy : 0.7604          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | band.pred.1 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        27 |        18 |        45 | 
##                        |     0.600 |     0.400 |     0.308 | 
##                        |     0.771 |     0.162 |           | 
##                        |     0.185 |     0.123 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |         8 |        93 |       101 | 
##                        |     0.079 |     0.921 |     0.692 | 
##                        |     0.229 |     0.838 |           | 
##                        |     0.055 |     0.637 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        35 |       111 |       146 | 
##                        |     0.240 |     0.760 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.

The corresponding confusion matrix and crosstable is as shown below.

##    user  system elapsed 
##   76.70    0.25   77.42
## Random Forest 
## 
## 338 samples
##  27 predictor
##   2 classes: 'band', 'noband' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 304, 304, 305, 304, 304, 304, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7304902  0.4589608
##   19    0.7165865  0.4260143
##   36    0.7140909  0.4200108
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     34     20
##     noband   11     81
##                                           
##                Accuracy : 0.7877          
##                  95% CI : (0.7124, 0.8509)
##     No Information Rate : 0.6918          
##     P-Value [Acc > NIR] : 0.006418        
##                                           
##                   Kappa : 0.5282          
##                                           
##  Mcnemar's Test P-Value : 0.150763        
##                                           
##             Sensitivity : 0.7556          
##             Specificity : 0.8020          
##          Pos Pred Value : 0.6296          
##          Neg Pred Value : 0.8804          
##              Prevalence : 0.3082          
##          Detection Rate : 0.2329          
##    Detection Prevalence : 0.3699          
##       Balanced Accuracy : 0.7788          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | band.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        34 |        11 |        45 | 
##                        |     0.756 |     0.244 |     0.308 | 
##                        |     0.630 |     0.120 |           | 
##                        |     0.233 |     0.075 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        20 |        81 |       101 | 
##                        |     0.198 |     0.802 |     0.692 | 
##                        |     0.370 |     0.880 |           | 
##                        |     0.137 |     0.555 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        54 |        92 |       146 | 
##                        |     0.370 |     0.630 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

b. Using the model: Neural Network

We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.

## # weights:  77
## initial  value 234.064527 
## final  value 224.237831 
## converged

We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.

##         predicted
## true     noband
##   band       45
##   noband    101

The missclassification table can also be obtained by using the gmodels library:

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | predicted 
## data$band_type[-train] |    noband | Row Total | 
## -----------------------|-----------|-----------|
##                   band |        45 |        45 | 
##                        |     0.308 |           | 
## -----------------------|-----------|-----------|
##                 noband |       101 |       101 | 
##                        |     0.692 |           | 
## -----------------------|-----------|-----------|
##           Column Total |       146 |       146 | 
## -----------------------|-----------|-----------|
## 
## 

We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)

##                           rel.imp              x.names
## press_speed          -1.000000000          press_speed
## press                -0.865225137                press
## ink_pct              -0.153081913              ink_pct
## locationnorthus      -0.131619085      locationnorthus
## ink_temperature      -0.125145981      ink_temperature
## viscosity            -0.122927382            viscosity
## anode_space_ratio    -0.115901024    anode_space_ratio
## cylinder_typeyes     -0.094379009     cylinder_typeyes
## caliper              -0.085273658              caliper
## solvent_pct          -0.071738121          solvent_pct
## grain_screenedYES    -0.062100566    grain_screenedYES
## wax                  -0.056329589                  wax
## humidity             -0.038158621             humidity
## direct_steamyes      -0.037380415      direct_steamyes
## roughness            -0.023136432            roughness
## current_density      -0.016455654      current_density
## roller_durometer     -0.005338653     roller_durometer
## cylinder_sizespiegel -0.002002103 cylinder_sizespiegel
## paper_typesuper       0.000000000      paper_typesuper
## locationsouthus       0.001943799      locationsouthus
## press_typeMotter70    0.006251217   press_typeMotter70
## solvent_typeNAPTHA    0.020177467   solvent_typeNAPTHA
## paper_typeuncoated    0.028459842   paper_typeuncoated
## locationscandanavian  0.028964029 locationscandanavian
## ink_typeuncoated      0.032644856     ink_typeuncoated
## press_typeMotter94    0.048582228   press_typeMotter94
## varnish_pct           0.051141124          varnish_pct
## cylinder_sizetabloid  0.051616850 cylinder_sizetabloid
## proof_cut             0.054414036            proof_cut
## solvent_typeXYLOL     0.059283698    solvent_typeXYLOL
## blade_pressure        0.059704615       blade_pressure
## press_typeWoodHoe70   0.099162850  press_typeWoodHoe70
## ink_typecover         0.106830765        ink_typecover
## proof_inkYES          0.126976265         proof_inkYES
## hardener              0.134731925             hardener
## locationmideuropean   0.143023053  locationmideuropean

The bar plot tells us that the variables press_speed and location (mideuropean) have the strongest negative and positive relationships, respectively, with the response variable band_type.

In decreasing order of importance (weak weights):

  1. location - mideuropean
  2. hardener
  3. proof_ink - YES
  4. ink_type - cover
  5. press_type - WoodHoe70

c. Using the model: Decision Tree

We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).

We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.

## 
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
## 
## Variables actually used in tree construction:
##  [1] anode_space_ratio blade_pressure    caliper          
##  [4] cylinder_type     grain_screened    hardener         
##  [7] humidity          ink_pct           ink_temperature  
## [10] ink_type          location          paper_type       
## [13] press             press_speed       press_type       
## [16] proof_cut         proof_ink         roller_durometer 
## [19] roughness         solvent_pct       viscosity        
## [22] wax              
## 
## Root node error: 173/484 = 0.35744
## 
## n= 484 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0693642      0  1.000000 1.00000 0.060944
## 2  0.0375723      3  0.780347 1.00578 0.061022
## 3  0.0346821      7  0.618497 0.94220 0.060100
## 4  0.0289017      8  0.583815 0.85549 0.058591
## 5  0.0231214      9  0.554913 0.82081 0.057901
## 6  0.0173410     11  0.508671 0.80925 0.057660
## 7  0.0115607     15  0.439306 0.79769 0.057413
## 8  0.0096339     24  0.335260 0.76301 0.056636
## 9  0.0092486     28  0.294798 0.76301 0.056636
## 10 0.0086705     34  0.231214 0.76301 0.056636
## 11 0.0072254     36  0.213873 0.79769 0.057413
## 12 0.0057803     40  0.184971 0.79769 0.057413
## 13 0.0038536     60  0.069364 0.80347 0.057537
## 14 0.0028902     63  0.057803 0.80347 0.057537
## 15 0.0000100     65  0.052023 0.82081 0.057901

From the complexity table we make the observation that the lowest relative error of 0.75 occurs at a tree size of ~22.

To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 8, with a resubstitution error rate of ~0.75 (as before).

This again is not a good predictor due to the large error rate.

## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class", 
##     control = rpart.control(minsplit = 4, cp = 1e-05))
##   n= 484 
## 
##           CP nsplit rel error  xerror       xstd
## 1 0.06936416      0 1.0000000 1.00000 0.06094449
## 2 0.05000000      3 0.7803468 1.00578 0.06102204
## 
## Variable importance
##       press_type         humidity      press_speed            press 
##               31               19               17               11 
##        roughness    cylinder_size         location          ink_pct 
##                9                4                3                1 
##     solvent_type   blade_pressure         hardener              wax 
##                1                1                1                1 
##  ink_temperature roller_durometer 
##                1                1 
## 
## Node number 1: 484 observations,    complexity param=0.06936416
##   predicted class=noband  expected loss=0.357438  P(node) =1
##     class counts:   173   311
##    probabilities: 0.357 0.643 
##   left son=2 (166 obs) right son=3 (318 obs)
##   Primary splits:
##       press_type     splits as  RRRL, improve=14.034940, (0 missing)
##       press_speed    < 2184.5    to the left,  improve=13.365950, (0 missing)
##       press          < 822.5     to the left,  improve=13.137310, (0 missing)
##       ink_type       splits as  RLL, improve=10.721900, (0 missing)
##       grain_screened splits as  RL, improve= 9.344136, (0 missing)
##   Surrogate splits:
##       press         < 818.5     to the left,  agree=0.777, adj=0.349, (0 split)
##       roughness     < 0.5570005 to the left,  agree=0.756, adj=0.289, (0 split)
##       cylinder_size splits as  RLR, agree=0.698, adj=0.120, (0 split)
##       humidity      < 84.5      to the right, agree=0.678, adj=0.060, (0 split)
##       ink_pct       < 64        to the right, agree=0.674, adj=0.048, (0 split)
## 
## Node number 2: 166 observations,    complexity param=0.06936416
##   predicted class=band    expected loss=0.4759036  P(node) =0.3429752
##     class counts:    87    79
##    probabilities: 0.524 0.476 
##   left son=4 (46 obs) right son=5 (120 obs)
##   Primary splits:
##       press_speed   < 1678      to the left,  improve=7.134765, (0 missing)
##       ink_pct       < 62.9      to the right, improve=5.336261, (0 missing)
##       humidity      < 73.5      to the left,  improve=5.002718, (0 missing)
##       cylinder_type splits as  LR, improve=4.496773, (0 missing)
##       roughness     < 0.5882505 to the right, improve=4.235566, (0 missing)
##   Surrogate splits:
##       location         splits as  RLRR-, agree=0.771, adj=0.174, (0 split)
##       solvent_type     splits as  R-L, agree=0.747, adj=0.087, (0 split)
##       blade_pressure   < 24.5      to the left,  agree=0.741, adj=0.065, (0 split)
##       ink_temperature  < 19.05     to the right, agree=0.735, adj=0.043, (0 split)
##       roller_durometer < 29        to the left,  agree=0.735, adj=0.043, (0 split)
## 
## Node number 3: 318 observations
##   predicted class=noband  expected loss=0.2704403  P(node) =0.6570248
##     class counts:    86   232
##    probabilities: 0.270 0.730 
## 
## Node number 4: 46 observations
##   predicted class=band    expected loss=0.2391304  P(node) =0.09504132
##     class counts:    35    11
##    probabilities: 0.761 0.239 
## 
## Node number 5: 120 observations,    complexity param=0.06936416
##   predicted class=noband  expected loss=0.4333333  P(node) =0.2479339
##     class counts:    52    68
##    probabilities: 0.433 0.567 
##   left son=10 (22 obs) right son=11 (98 obs)
##   Primary splits:
##       humidity          < 75.5      to the left,  improve=7.979716, (0 missing)
##       ink_pct           < 62.9      to the right, improve=6.248649, (0 missing)
##       anode_space_ratio < 99.34579  to the left,  improve=4.158972, (0 missing)
##       press_speed       < 2184.5    to the left,  improve=3.856410, (0 missing)
##       wax               < 1.725     to the left,  improve=3.350725, (0 missing)
##   Surrogate splits:
##       press_speed < 2450      to the right, agree=0.833, adj=0.091, (0 split)
##       wax         < 0.85      to the left,  agree=0.825, adj=0.045, (0 split)
##       hardener    < 0.25      to the left,  agree=0.825, adj=0.045, (0 split)
## 
## Node number 10: 22 observations
##   predicted class=band    expected loss=0.1818182  P(node) =0.04545455
##     class counts:    18     4
##    probabilities: 0.818 0.182 
## 
## Node number 11: 98 observations
##   predicted class=noband  expected loss=0.3469388  P(node) =0.2024793
##     class counts:    34    64
##    probabilities: 0.347 0.653

The variable importance table is given below:

##                     Overall
## anode_space_ratio 46.792279
## blade_pressure    22.085604
## caliper           17.474892
## current_density   15.176716
## cylinder_size     20.076596
## cylinder_type     14.783200
## grain_screened    25.005585
## hardener          28.849537
## humidity          50.815463
## ink_pct           50.793496
## ink_temperature   44.141707
## ink_type          31.912385
## location          26.939384
## paper_type         2.609524
## press             43.517804
## press_speed       73.276042
## press_type        30.182396
## proof_cut         41.222215
## proof_ink          2.100000
## roller_durometer  22.586269
## roughness         30.553411
## solvent_pct       29.796049
## varnish_pct       22.285894
## viscosity         49.165163
## wax               23.305868
## direct_steam       0.000000
## solvent_type       0.000000

In decreasing order of importance, the variables to be considered are:

  1. press_speed
  2. humidity
  3. ink_pct
  4. viscosity
  5. anode_space_ratio

d. Using the model: SVM

We consider the use of support vector machines to build a classifier and to also extract the feature importance list.

First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.

## 
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train, 
##     ], method = "C-classification", kernel = "radial", cost = 10, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  301
## 
##  ( 179 122 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  band noband
##   [1]   1   2   4   5   6  10  12  14  17  19  22  24  25  27  28  30  32
##  [18]  33  36  47  48  49  50  51  52  53  55  56  57  58  59  60  63  65
##  [35]  67  69  70  72  73  75  76  78  81  82  84  85  86  87  88  91  93
##  [52]  94  95  96  97  98 101 102 103 105 110 112 113 115 118 121 123 127
##  [69] 128 129 134 135 136 137 138 140 141 142 143 146 147 149 150 152 153
##  [86] 156 160 162 165 166 167 168 169 175 176 177 179 180 184 186 189 190
## [103] 191 196 197 198 199 200 201 202 205 206 208 209 212 214 215 216 218
## [120] 225 226 228 230 232 233 234 235 237 240 243 244 245 246 247 249 250
## [137] 251 253 254 255 257 258 259 261 267 271 272 273 275 276 281 282 283
## [154] 285 286 287 291 292 294 300 302 304 307 309 310 312 314 315 317 318
## [171] 319 320 330 331 332 333 336 337 338   3   7   8  11  13  15  16  18
## [188]  20  21  23  26  29  31  35  37  38  39  41  43  44  45  54  61  62
## [205]  71  77  79  80  83  89  92 100 104 106 107 108 111 114 120 122 124
## [222] 125 126 130 131 132 133 139 144 145 148 151 154 158 159 163 170 171
## [239] 173 174 178 183 185 187 192 194 195 203 207 210 211 213 219 220 221
## [256] 224 227 229 231 236 238 241 242 248 252 256 260 262 263 264 265 266
## [273] 268 270 274 277 278 280 284 288 289 290 293 295 296 297 298 299 301
## [290] 303 308 311 313 321 322 324 326 327 328 329 334

Prediction using SVM: The accuracy of the model is 0.81 at a kappa of 0.54

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction band noband
##     band     28     10
##     noband   17     91
##                                           
##                Accuracy : 0.8151          
##                  95% CI : (0.7425, 0.8744)
##     No Information Rate : 0.6918          
##     P-Value [Acc > NIR] : 0.0005379       
##                                           
##                   Kappa : 0.5468          
##                                           
##  Mcnemar's Test P-Value : 0.2482131       
##                                           
##             Sensitivity : 0.6222          
##             Specificity : 0.9010          
##          Pos Pred Value : 0.7368          
##          Neg Pred Value : 0.8426          
##              Prevalence : 0.3082          
##          Detection Rate : 0.1918          
##    Detection Prevalence : 0.2603          
##       Balanced Accuracy : 0.7616          
##                                           
##        'Positive' Class : band            
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  146 
## 
##  
##                        | svm.pred.2 
## data$band_type[-train] |      band |    noband | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   band |        28 |        17 |        45 | 
##                        |     0.622 |     0.378 |     0.308 | 
##                        |     0.737 |     0.157 |           | 
##                        |     0.192 |     0.116 |           | 
## -----------------------|-----------|-----------|-----------|
##                 noband |        10 |        91 |       101 | 
##                        |     0.099 |     0.901 |     0.692 | 
##                        |     0.263 |     0.843 |           | 
##                        |     0.068 |     0.623 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |        38 |       108 |       146 | 
##                        |     0.260 |     0.740 |           | 
## -----------------------|-----------|-----------|-----------|
## 
## 

The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.

Conculsion

In this project we have taken 5 different ways of cleaning the given dataset and applied 4 prediction models each time. The below table gives a summarized view of the results of the various tests conducted.

The most important variables that could contribute to banding is given for each of the methods. Note, we do not have the feature extraction for SVM due to the complexity involved for it.

Summarized Results

We started with 39 variables for the potential reason for banding. With the different methods used, we have now narrowed it down to a select few variables that are most likely to contribute to banding.

We have taken the most frequently occuring and most widely seen variables across all the methods to make our selection.

The 4 variables, in descending order of importance (as reason for banding), is given below:

  1. press_speed - This variable has been seen in the majority of the models
  2. viscosity - This is the second most often seen variable
  3. solvent_pct (and ink_pct) - since there is a correlation b/w the two, this variable is important
  4. humidity

The management is therefore advised to pay special attention and control to these variables during the Rotogravure printing process.