We begin analyzing the issue by visiting the problem statement with some background information. The issue we are trying to address occurs during what is known as “Rotogravure” printing.
Rotogravure printing involves rotating a chrome-plated copper cylinder in a bath of ink, scraping off the excess ink, and pressing a continuous supply of paper against the inked image with a rubber roller. Once the job is printed, the engraved image is removed from the cylinder, which is replated to be engraved for another job.
Sometimes, a series of grooves - called a “band” - appears in the cylinder during printing, ruining the finished product. These grooves are not present at the start of the printing run, but once they appear the printing press must be shutdown. A technician then removes the band by polishing it out of the cylinder, or by transporting the cylinder to a plating station where the chrome surface is removed, the band is polished out of the copper subsurface, and a chrome finish is replated.
The occurence of bands result in process delays, plant shutdowns, and losses in terms of labour and time.
When process delays have known causes, they can be mitigated by acquiring causal rules from human experts and then applying sensors and automated real-time diagnostic devices to the process. However, for some delays the experts have only “weak” causal knowledge or none at all.
In such cases, machine learning tools can collect training data and process it through an induction engine in search of “diagnostic” knowledge. Our aim in this analysis is to therefore find the most probably causes for band formation and share those parameters so that they can be controlled, in order to mitigate the effects of banding.
Our analysis will follow the different paths in the flowchart as shown below. We firstly take the data that has been provided and perform data clean-up activity (pre-processing) to make them usable for a machine learning algorithm. This clean-up activity involves several sub-steps and we have tried different ways to address the missing data values, dscribed in more details in the relevant sections.
After this activity, we aim to build the a robust prediction model, by using various ML algorithms and testing them for their accuracy and prediction capability.
Our first step is to clean our data and remove duplicates. This can be caused due to mistakes in entries (upper/lower case), duplicates, and mistakes in spelling etc., while creating the dataset by the operators, who have noted the various parameter levels when making the cylinders.
Our target variable in this data set is “band_type” which has two levels: band (or) noband. Our explanatory variables are the remaining 39 variables, of which 20 attributes are numeric and 19 are nominal. After this common step, we proceed with analyzing the dataset and derive more meaningful information.
We check there how the data is spread and what it contains. The variable we are interested in is “band_type”, and especially the reason & prediction ability of banding. From the dataset, we can see the description of the “band_type” variable, as shown below.
## variable = band_type
## type = factor
## na = 0 of 540 (0%)
## unique = 2
## band = 228 (42.2%)
## noband = 312 (57.8%)
We find that out of the 540 records in the dataset, the information for “band_type” is completely available (no missing “band_type” values, na = 0 of 540). Further, there are two unique levels for this variable (band & noband). Also, the split between band & noband is quite balanced (42.2% versus 57.8%), only a slight tilt for noband, with more information.
An overall plot of the data is shown below. Here we find that there are only 51.3% of complete rows i.e., rows which do not have any values missing. This works out to 277 records of the 540 records. We also see 4.6% of data not having made any obervations at all.
plot_intro(cyl)
A deeper look into the missing data shows that the maximum amount of data that is missing is for the column variable “location” at 28.89%, followed by “blade_pressure” and “blade_mfg”. The below plot gives all 40 variables plotted against the % of missing data it contains. This information will help us later while determining which variables to consider/discard in our analysis.
plot_missing(cyl)
A density plot of the data and is a representation of the distribution of a numeric variable. It helps to understand how normal or skewed the data is and if there are variables that are wildly distributed. As can be gleaned, not all the variables seem relevant here (unit_number, job_number etc.,). The rest of the data seem to be well distributed without too many extremes.
plot_density(cyl)
We next focus our attention to the problem of missing data. There are several ways in which we can fill the missing data and proceed with our analysis.
Here we share 5 different ways in which we have dealt with missing data. After cleaning and dealing with missing data, our dataset is ready for various ML algorithms. For this report, we have picked four ML models: RandomForest, NeuralNetwork, DecisionTrees and Support Vector Machines.
In our first method, we simply remove all the rows that are incomplete and only consider the complete observations for our analysis.
## [1] 277
We find that the number of complete rows is 277. Therefore, we work only with these records for analysis.
Using the data we proceed with creating correlation plots, to see how the different variables are correlated. If the dataset has perfectly positive or negative attributes then the performance of the model will be impacted by a problem called “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.
Decision trees and boosted trees algorithms are immune to multicollinearity by nature. When the model decides to split, the tree will choose only one of the perfectly correlated features. However, other algorithms like Logistic Regression or Linear Regression are not immune to this problem and needs to be fixed before training the model. But to note, the variable importance levels can still be affected by the presence of multicollinearity.
In our case, we only wish to see observe correlations if any and remove one of them if it exisits.
We find the following correlations (in lesser degress, exists in the data):
We can next train our model with and without some of the above variables to observe if there is an improvement.
Further, we identify from the problem description, the 12 variables which can be safely ignored from our dataset, as it would only decrease the models capability. The variables that we decided to remove from our training model were the below:
We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.
We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.
From the results, we observe the out-of-bag error rate to be 25.91%.
##
## Call:
## randomForest(formula = form, data = data[train, ], ntree = 1000, mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 25.91%
## Confusion matrix:
## band noband class.error
## band 44 31 0.4133333
## noband 19 99 0.1610169
The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.
## OOB band noband
## [1,] 0.2535 0.3750 0.1538
## [2,] 0.2783 0.3478 0.2319
## [3,] 0.2483 0.2931 0.2184
## [4,] 0.2545 0.2769 0.2400
## [5,] 0.3258 0.3043 0.3394
## [6,] 0.3279 0.3562 0.3091
## [7,] 0.3048 0.3425 0.2807
## [8,] 0.2895 0.3600 0.2435
## [9,] 0.2789 0.3067 0.2609
## [10,] 0.2579 0.3200 0.2174
## [11,] 0.2408 0.3200 0.1897
## [12,] 0.2723 0.3733 0.2069
## [13,] 0.2902 0.3467 0.2542
## [14,] 0.3057 0.4133 0.2373
## [15,] 0.2953 0.3200 0.2797
Mean Decrease in Gini is the average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest.
This is effectively a measure of how important a variable is for estimating the value of the target variable across all of the trees that make up the forest. A higher Mean Decrease in Gini indicates higher variable importance. The most important variables to the model will be highest in the plot and have the largest Mean Decrease in Gini Values, conversely, the least important variable will be lowest in the plot, and have the smallest Mean Decrease in Gini values.
In this case, the variables that are between 3-4 in the Gini Index (and therefore most important are):
If we repeat this test, after multicollinearity correction, we get the below results:
##
## Call:
## randomForest(formula = form, data = data[train, ], ntree = 1000, mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 25.91%
## Confusion matrix:
## band noband class.error
## band 42 33 0.4400000
## noband 17 101 0.1440678
## OOB band noband
## [1,] 0.2676 0.4375 0.1282
## [2,] 0.3246 0.4681 0.2239
## [3,] 0.3333 0.4364 0.2717
## [4,] 0.3148 0.3871 0.2700
## [5,] 0.3352 0.4286 0.2736
## [6,] 0.3611 0.4286 0.3182
## [7,] 0.3871 0.4861 0.3246
## [8,] 0.3617 0.4795 0.2870
## [9,] 0.3280 0.4521 0.2500
## [10,] 0.3263 0.4324 0.2586
## [11,] 0.3298 0.4533 0.2500
## [12,] 0.3385 0.4667 0.2564
## [13,] 0.3420 0.4667 0.2627
## [14,] 0.3005 0.4133 0.2288
## [15,] 0.3420 0.4800 0.2542
Clearly, the variables importance rank as measured by the Gini Index gives us a better estimate and provides the following variables of importance:
We now proceed to predict using the test data set and create the confusion matrix for the same.
## pred
## true band noband
## band 20 4
## noband 9 51
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 84
##
##
## | band.pred
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 20 | 4 | 24 |
## | 0.833 | 0.167 | 0.286 |
## | 0.690 | 0.073 | |
## | 0.238 | 0.048 | |
## -----------------------|-----------|-----------|-----------|
## noband | 9 | 51 | 60 |
## | 0.150 | 0.850 | 0.714 |
## | 0.310 | 0.927 | |
## | 0.107 | 0.607 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 29 | 55 | 84 |
## | 0.345 | 0.655 | |
## -----------------------|-----------|-----------|-----------|
##
##
We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.84 with mtry=2, and a kappa of 0.61.
There is no standardized interpretation of the kappa statistic. According to Wikipedia (citing their paper), Landis and Koch considers 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect.
Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as fair to good, and < 0.40 as poor.
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
## user system elapsed
## 52.22 0.80 55.23
## Random Forest
##
## 193 samples
## 24 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 173, 173, 174, 173, 174, 174, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7242251 0.3746088
## 17 0.7123713 0.3648302
## 33 0.7035322 0.3468995
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 17 6
## noband 7 54
##
## Accuracy : 0.8452
## 95% CI : (0.7499, 0.9149)
## No Information Rate : 0.7143
## P-Value [Acc > NIR] : 0.003877
##
## Kappa : 0.616
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.7083
## Specificity : 0.9000
## Pos Pred Value : 0.7391
## Neg Pred Value : 0.8852
## Prevalence : 0.2857
## Detection Rate : 0.2024
## Detection Prevalence : 0.2738
## Balanced Accuracy : 0.8042
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 84
##
##
## | band.pred.1
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 17 | 7 | 24 |
## | 0.708 | 0.292 | 0.286 |
## | 0.739 | 0.115 | |
## | 0.202 | 0.083 | |
## -----------------------|-----------|-----------|-----------|
## noband | 6 | 54 | 60 |
## | 0.100 | 0.900 | 0.714 |
## | 0.261 | 0.885 | |
## | 0.071 | 0.643 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 23 | 61 | 84 |
## | 0.274 | 0.726 | |
## -----------------------|-----------|-----------|-----------|
##
##
For a large class imbalance (which is not the case in our example as the band-noband is well balanced) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option, sampling=“down” in the function.
The corresponding confusion matrix and crosstable is as shown below.
## user system elapsed
## 38.03 0.52 38.57
## Random Forest
##
## 193 samples
## 24 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 173, 174, 173, 174, 174, 174, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7111988 0.4229284
## 17 0.6846257 0.3597217
## 33 0.6655585 0.3201383
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 22 28
## noband 2 32
##
## Accuracy : 0.6429
## 95% CI : (0.5308, 0.7445)
## No Information Rate : 0.7143
## P-Value [Acc > NIR] : 0.9393
##
## Kappa : 0.3396
##
## Mcnemar's Test P-Value : 5.01e-06
##
## Sensitivity : 0.9167
## Specificity : 0.5333
## Pos Pred Value : 0.4400
## Neg Pred Value : 0.9412
## Prevalence : 0.2857
## Detection Rate : 0.2619
## Detection Prevalence : 0.5952
## Balanced Accuracy : 0.7250
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 84
##
##
## | band.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 22 | 2 | 24 |
## | 0.917 | 0.083 | 0.286 |
## | 0.440 | 0.059 | |
## | 0.262 | 0.024 | |
## -----------------------|-----------|-----------|-----------|
## noband | 28 | 32 | 60 |
## | 0.467 | 0.533 | 0.714 |
## | 0.560 | 0.941 | |
## | 0.333 | 0.381 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 50 | 34 | 84 |
## | 0.595 | 0.405 | |
## -----------------------|-----------|-----------|-----------|
##
##
The neural network is a set of connected input/output units in which each connection has a weight associated with it. In the learning phase, the network learns by adjusting the weights to predict the correct class label of the given inputs. NN performs computations through a process by learning.
We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.
## # weights: 71
## initial value 137.879331
## final value 128.946992
## converged
We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.
## predicted
## true noband
## band 24
## noband 60
The misclassification table can also be obtained by using the gmodels library:
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 84
##
##
## | predicted
## data$band_type[-train] | noband | Row Total |
## -----------------------|-----------|-----------|
## band | 24 | 24 |
## | 0.286 | |
## -----------------------|-----------|-----------|
## noband | 60 | 60 |
## | 0.714 | |
## -----------------------|-----------|-----------|
## Column Total | 84 | 84 |
## -----------------------|-----------|-----------|
##
##
We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)
The relative importance of each input variable for the response variable is a value from -1 to 1. From the data, we can get an idea of what the neural network is telling us about the specific importance of each explanatory for the response variable.This is from a method proposed by Garson 1991 (also Goh 1995) in which the relative importance of explanatory variables for specific response variables in a supervised neural network is obtained by deconstructing the model weights.
The idea is all weights connecting the specific input node that pass through the hidden layer to the specific response variable are identified. This is repeated for all other explanatory variables until the analyst has a list of all weights that are specific to each input variable. The connections are tallied for each input node and scaled relative to all other inputs. A single value is obtained for each explanatory variable that describes the relationship with response variable in the model
## rel.imp x.names
## solvent_typeXYLOL -1.00000000 solvent_typeXYLOL
## ink_temperature -0.87550702 ink_temperature
## direct_steamyes -0.86384484 direct_steamyes
## locationnorthus -0.82482455 locationnorthus
## ink_typeuncoated -0.75388867 ink_typeuncoated
## roughness -0.69786427 roughness
## cylinder_sizetabloid -0.69255956 cylinder_sizetabloid
## locationsouthus -0.60007710 locationsouthus
## solvent_pct -0.56915850 solvent_pct
## proof_inkYES -0.47786287 proof_inkYES
## grain_screenedYES -0.41261131 grain_screenedYES
## ink_pct -0.33059487 ink_pct
## hardener -0.31448538 hardener
## paper_typeuncoated -0.23438104 paper_typeuncoated
## paper_typesuper -0.15555727 paper_typesuper
## solvent_typeNAPTHA -0.08247157 solvent_typeNAPTHA
## press_typeMotter70 -0.04514294 press_typeMotter70
## press_typeMotter94 0.00000000 press_typeMotter94
## locationmideuropean 0.02689458 locationmideuropean
## wax 0.08816143 wax
## cylinder_sizespiegel 0.11718953 cylinder_sizespiegel
## press_typeWoodHoe70 0.17856352 press_typeWoodHoe70
## anode_space_ratio 0.23876785 anode_space_ratio
## humidity 0.27973108 humidity
## current_density 0.32245025 current_density
## ink_typecover 0.33768691 ink_typecover
## blade_pressure 0.37256921 blade_pressure
## caliper 0.43920565 caliper
## viscosity 0.45173435 viscosity
## locationscandanavian 0.50211415 locationscandanavian
## press_speed 0.51625717 press_speed
## press 0.65012793 press
## cylinder_typeyes 0.83211514 cylinder_typeyes
The bar plot tells us that the variables solvent_type(XYLOL) and cylinder_type(YES) have the strongest negative and positive relationships, respectively, with the response variable band_type. Similarly, variables that have relative importance close to zero, such as press_type, do not have any substantial importance for band_type.
In decreasing order of importance:
We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp). We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.
##
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
##
## Variables actually used in tree construction:
## [1] anode_space_ratio blade_pressure caliper
## [4] current_density cylinder_size cylinder_type
## [7] hardener humidity ink_pct
## [10] ink_temperature ink_type press
## [13] press_speed solvent_pct solvent_type
## [16] viscosity
##
## Root node error: 99/277 = 0.3574
##
## n= 277
##
## CP nsplit rel error xerror xstd
## 1 0.080808 0 1.000000 1.00000 0.080566
## 2 0.060606 4 0.676768 1.04040 0.081249
## 3 0.050505 5 0.616162 0.97980 0.080195
## 4 0.045455 6 0.565657 0.98990 0.080383
## 5 0.020202 8 0.474747 0.95960 0.079804
## 6 0.016162 13 0.373737 0.90909 0.078735
## 7 0.015152 18 0.292929 0.94949 0.079600
## 8 0.010101 24 0.202020 0.96970 0.080002
## 9 0.006734 34 0.101010 1.06061 0.081561
## 10 0.000010 40 0.060606 1.07071 0.081710
We find the complete decision will have 10 levels. Also, the minimum error rate is at 95% with a tree size of 14, indicating this is not a very good model and has very poor performance.
## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
## n= 277
##
## CP nsplit rel error xerror xstd
## 1 0.08080808 0 1.0000000 1.000000 0.08056613
## 2 0.06060606 4 0.6767677 1.040404 0.08124902
## 3 0.05050505 5 0.6161616 0.979798 0.08019495
## 4 0.05000000 6 0.5656566 0.989899 0.08038305
##
## Variable importance
## press press_speed press_type viscosity
## 20 15 14 13
## cylinder_size wax ink_pct blade_pressure
## 10 8 8 4
## roughness location anode_space_ratio proof_ink
## 4 4 1 1
##
## Node number 1: 277 observations, complexity param=0.08080808
## predicted class=noband expected loss=0.3574007 P(node) =1
## class counts: 99 178
## probabilities: 0.357 0.643
## left son=2 (182 obs) right son=3 (95 obs)
## Primary splits:
## press < 822.5 to the left, improve=11.509960, (0 missing)
## press_speed < 2025 to the left, improve=11.098460, (0 missing)
## ink_type splits as RLL, improve= 9.863416, (0 missing)
## grain_screened splits as RL, improve= 7.633489, (0 missing)
## solvent_pct < 35.45 to the left, improve= 7.182566, (0 missing)
## Surrogate splits:
## press_type splits as LLRL, agree=0.859, adj=0.589, (0 split)
## wax < 2.35 to the right, agree=0.776, adj=0.347, (0 split)
## press_speed < 2075 to the left, agree=0.758, adj=0.295, (0 split)
## roughness < 0.78125 to the left, agree=0.747, adj=0.263, (0 split)
## location splits as RLLRR, agree=0.744, adj=0.253, (0 split)
##
## Node number 2: 182 observations, complexity param=0.08080808
## predicted class=noband expected loss=0.4615385 P(node) =0.6570397
## class counts: 84 98
## probabilities: 0.462 0.538
## left son=4 (10 obs) right son=5 (172 obs)
## Primary splits:
## ink_pct < 64 to the right, improve=6.135957, (0 missing)
## viscosity < 55.5 to the right, improve=5.723443, (0 missing)
## solvent_pct < 35.5 to the left, improve=5.631499, (0 missing)
## cylinder_size splits as RRL, improve=5.013794, (0 missing)
## press_speed < 2025 to the left, improve=4.473709, (0 missing)
##
## Node number 3: 95 observations
## predicted class=noband expected loss=0.1578947 P(node) =0.3429603
## class counts: 15 80
## probabilities: 0.158 0.842
##
## Node number 4: 10 observations
## predicted class=band expected loss=0 P(node) =0.03610108
## class counts: 10 0
## probabilities: 1.000 0.000
##
## Node number 5: 172 observations, complexity param=0.08080808
## predicted class=noband expected loss=0.4302326 P(node) =0.6209386
## class counts: 74 98
## probabilities: 0.430 0.570
## left son=10 (146 obs) right son=11 (26 obs)
## Primary splits:
## press_speed < 2025 to the left, improve=6.072684, (0 missing)
## cylinder_size splits as RRL, improve=5.004309, (0 missing)
## viscosity < 41.5 to the right, improve=3.686492, (0 missing)
## ink_type splits as RLL, improve=3.320349, (0 missing)
## paper_type splits as R-L, improve=3.047771, (0 missing)
## Surrogate splits:
## proof_ink splits as RL, agree=0.860, adj=0.077, (0 split)
## anode_space_ratio < 91.65 to the right, agree=0.860, adj=0.077, (0 split)
## paper_type splits as R-L, agree=0.855, adj=0.038, (0 split)
## viscosity < 68.5 to the left, agree=0.855, adj=0.038, (0 split)
## ink_temperature < 18.05 to the left, agree=0.855, adj=0.038, (0 split)
##
## Node number 10: 146 observations, complexity param=0.08080808
## predicted class=noband expected loss=0.4863014 P(node) =0.5270758
## class counts: 71 75
## probabilities: 0.486 0.514
## left son=20 (60 obs) right son=21 (86 obs)
## Primary splits:
## cylinder_size splits as RRL, improve=7.908771, (0 missing)
## viscosity < 55.5 to the right, improve=5.936941, (0 missing)
## wax < 2.65 to the left, improve=5.702065, (0 missing)
## press_type splits as RRLL, improve=5.419073, (0 missing)
## press < 814 to the right, improve=5.419073, (0 missing)
## Surrogate splits:
## press_type splits as RRLR, agree=0.836, adj=0.600, (0 split)
## press < 818.5 to the right, agree=0.836, adj=0.600, (0 split)
## blade_pressure < 25.5 to the left, agree=0.767, adj=0.433, (0 split)
## press_speed < 1755 to the right, agree=0.712, adj=0.300, (0 split)
## wax < 2.725 to the left, agree=0.705, adj=0.283, (0 split)
##
## Node number 11: 26 observations
## predicted class=noband expected loss=0.1153846 P(node) =0.09386282
## class counts: 3 23
## probabilities: 0.115 0.885
##
## Node number 20: 60 observations, complexity param=0.05050505
## predicted class=band expected loss=0.3166667 P(node) =0.2166065
## class counts: 41 19
## probabilities: 0.683 0.317
## left son=40 (55 obs) right son=41 (5 obs)
## Primary splits:
## viscosity < 40.5 to the right, improve=5.093939, (0 missing)
## ink_temperature < 16.4 to the right, improve=2.701361, (0 missing)
## current_density < 36 to the right, improve=2.594118, (0 missing)
## blade_pressure < 26.5 to the right, improve=2.002381, (0 missing)
## anode_space_ratio < 99.15 to the left, improve=1.907747, (0 missing)
##
## Node number 21: 86 observations, complexity param=0.06060606
## predicted class=noband expected loss=0.3488372 P(node) =0.3104693
## class counts: 30 56
## probabilities: 0.349 0.651
## left son=42 (6 obs) right son=43 (80 obs)
## Primary splits:
## viscosity < 62 to the right, improve=5.469767, (0 missing)
## humidity < 86.5 to the right, improve=4.887949, (0 missing)
## blade_pressure < 27.5 to the left, improve=4.606610, (0 missing)
## press_speed < 1325 to the left, improve=4.501866, (0 missing)
## anode_space_ratio < 106.78 to the left, improve=3.727302, (0 missing)
##
## Node number 40: 55 observations
## predicted class=band expected loss=0.2545455 P(node) =0.198556
## class counts: 41 14
## probabilities: 0.745 0.255
##
## Node number 41: 5 observations
## predicted class=noband expected loss=0 P(node) =0.01805054
## class counts: 0 5
## probabilities: 0.000 1.000
##
## Node number 42: 6 observations
## predicted class=band expected loss=0 P(node) =0.02166065
## class counts: 6 0
## probabilities: 1.000 0.000
##
## Node number 43: 80 observations
## predicted class=noband expected loss=0.3 P(node) =0.2888087
## class counts: 24 56
## probabilities: 0.300 0.700
The variable importance table is given below:
## Overall
## anode_space_ratio 24.762654
## blade_pressure 36.006487
## caliper 5.344160
## current_density 8.909889
## cylinder_size 18.628173
## cylinder_type 3.798601
## grain_screened 13.953228
## hardener 18.695875
## humidity 21.747414
## ink_pct 19.236685
## ink_temperature 24.076791
## ink_type 26.352381
## location 1.358391
## paper_type 5.199287
## press 20.781173
## press_speed 57.711832
## press_type 7.585740
## proof_ink 1.066667
## roughness 3.150000
## solvent_pct 39.015257
## solvent_type 1.190476
## viscosity 52.682730
## wax 26.552010
## direct_steam 0.000000
From this, we can list in decreasing order the most important variables:
We consider the use of support vector machines to build a classifier and to also extract the feature importance list.
First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.
##
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train,
## ], method = "C-classification", kernel = "radial", cost = 10,
## gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 176
##
## ( 103 73 )
##
##
## Number of Classes: 2
##
## Levels:
## band noband
## [1] 1 2 8 9 10 11 13 16 20 22 24 27 29 30 31 33 34
## [18] 35 41 42 43 45 46 47 48 49 51 62 65 66 67 71 72 73
## [35] 74 76 77 80 81 82 84 86 87 88 90 92 93 94 96 97 99
## [52] 100 101 102 103 105 106 108 110 111 112 113 114 115 117 118 119 120
## [69] 121 122 127 130 131 133 134 139 144 145 147 148 150 153 154 157 160
## [86] 161 163 164 165 166 167 168 172 173 175 176 177 183 184 189 190 191
## [103] 193 3 4 5 6 7 12 14 17 18 19 21 26 28 36 37 38
## [120] 40 44 50 52 54 55 56 58 59 60 61 63 64 68 69 70 75
## [137] 78 79 83 89 91 95 98 104 107 109 116 123 124 126 128 129 132
## [154] 135 136 137 138 142 146 151 152 155 156 159 170 171 174 178 179 180
## [171] 181 182 185 186 187 188
Prediction using SVM: We find that the accuracy is 72% and at kappa of 0.38, which shows poor prediction capability of our model. We would have to try this with different parameters to try and get a more robust model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 16 15
## noband 8 45
##
## Accuracy : 0.7262
## 95% CI : (0.618, 0.8179)
## No Information Rate : 0.7143
## P-Value [Acc > NIR] : 0.4588
##
## Kappa : 0.3831
##
## Mcnemar's Test P-Value : 0.2109
##
## Sensitivity : 0.6667
## Specificity : 0.7500
## Pos Pred Value : 0.5161
## Neg Pred Value : 0.8491
## Prevalence : 0.2857
## Detection Rate : 0.1905
## Detection Prevalence : 0.3690
## Balanced Accuracy : 0.7083
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 84
##
##
## | svm.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 16 | 8 | 24 |
## | 0.667 | 0.333 | 0.286 |
## | 0.516 | 0.151 | |
## | 0.190 | 0.095 | |
## -----------------------|-----------|-----------|-----------|
## noband | 15 | 45 | 60 |
## | 0.250 | 0.750 | 0.714 |
## | 0.484 | 0.849 | |
## | 0.179 | 0.536 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 31 | 53 | 84 |
## | 0.369 | 0.631 | |
## -----------------------|-----------|-----------|-----------|
##
##
The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.
In our second method, we first remove rows which have more than 20% of data observations missing, and consider only the remaining for our analysis.
Then, instead of imputing the values of missing categorical variables, we add another level called “missing” to it. We also calculate the “mode” for numeric data and replace the missing numeric data with the corresponding mode value.
Using the data we proceed with creating correlation plots, to see how the different variables are correlated. If the dataset has perfectly positive or negative attributes then the performance of the model will be impacted by a problem called “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.
Decision trees and boosted trees algorithms are immune to multicollinearity by nature. When the model decides to split, the tree will choose only one of the perfectly correlated features. However, other algorithms like Logistic Regression or Linear Regression are not immune to this problem and needs to be fixed before training the model. But to note, the variable importance levels can still be affected by the presence of multicollinearity.
In our case, we only wish to see observe correlations if any and remove one of them if it exists.
We find the following correlations (in lesser degress, exists in the data):
Which implies, we can remove “varnish_pct” from our training model to observe if there is an improvement.
We continue to ignore the 12 variables as identified in previous section, for this model as well.
We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.
We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.
From the results, we observe the out-of-bag error rate to be 23.3%.
##
## Call:
## randomForest(formula = form, data = data[train, ], ntree = 1000, mtry = 5, importance = TRUE, localImp = TRUE, replace = FALSE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 23.3%
## Confusion matrix:
## band noband class.error
## band 64 61 0.48800000
## noband 18 196 0.08411215
The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.
## OOB band noband
## [1,] 0.3629 0.4468 0.3117
## [2,] 0.3600 0.4800 0.2880
## [3,] 0.3755 0.4947 0.3038
## [4,] 0.3611 0.4623 0.3022
## [5,] 0.3722 0.4643 0.3198
## [6,] 0.3560 0.4706 0.2892
## [7,] 0.3598 0.4628 0.2995
## [8,] 0.4018 0.5082 0.3397
## [9,] 0.3772 0.4878 0.3128
## [10,] 0.3661 0.5040 0.2844
## [11,] 0.3650 0.5280 0.2689
## [12,] 0.3561 0.5120 0.2642
## [13,] 0.3333 0.4960 0.2383
## [14,] 0.3451 0.5120 0.2477
## [15,] 0.3451 0.5040 0.2523
In this case, the variables that are between 6-8 in the Gini Index (and therefore most important are):
We now proceed to predict using the test data set and create the confusion matrix for the same.
## pred
## true band noband
## band 32 16
## noband 7 91
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | band.pred
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 32 | 16 | 48 |
## | 0.667 | 0.333 | 0.329 |
## | 0.821 | 0.150 | |
## | 0.219 | 0.110 | |
## -----------------------|-----------|-----------|-----------|
## noband | 7 | 91 | 98 |
## | 0.071 | 0.929 | 0.671 |
## | 0.179 | 0.850 | |
## | 0.048 | 0.623 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 39 | 107 | 146 |
## | 0.267 | 0.733 | |
## -----------------------|-----------|-----------|-----------|
##
##
We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.84 at mtry=20, and a kappa of 0.62.
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
## user system elapsed
## 124.37 1.61 133.20
## Random Forest
##
## 339 samples
## 27 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 304, 305, 305, 305, 305, 305, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7029139 0.2639397
## 20 0.7388928 0.3906742
## 39 0.7372880 0.3878486
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 20.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 33 8
## noband 15 90
##
## Accuracy : 0.8425
## 95% CI : (0.7731, 0.8974)
## No Information Rate : 0.6712
## P-Value [Acc > NIR] : 2.324e-06
##
## Kappa : 0.6293
##
## Mcnemar's Test P-Value : 0.2109
##
## Sensitivity : 0.6875
## Specificity : 0.9184
## Pos Pred Value : 0.8049
## Neg Pred Value : 0.8571
## Prevalence : 0.3288
## Detection Rate : 0.2260
## Detection Prevalence : 0.2808
## Balanced Accuracy : 0.8029
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | band.pred.1
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 33 | 15 | 48 |
## | 0.688 | 0.312 | 0.329 |
## | 0.805 | 0.143 | |
## | 0.226 | 0.103 | |
## -----------------------|-----------|-----------|-----------|
## noband | 8 | 90 | 98 |
## | 0.082 | 0.918 | 0.671 |
## | 0.195 | 0.857 | |
## | 0.055 | 0.616 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 41 | 105 | 146 |
## | 0.281 | 0.719 | |
## -----------------------|-----------|-----------|-----------|
##
##
For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.
The corresponding confusion matrix and crosstable is as shown below.
## user system elapsed
## 80.00 1.00 81.36
## Random Forest
##
## 339 samples
## 27 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 305, 306, 305, 304, 305, 306, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7094372 0.4178767
## 20 0.7021401 0.3859451
## 39 0.6972737 0.3779739
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 39 23
## noband 9 75
##
## Accuracy : 0.7808
## 95% CI : (0.7049, 0.845)
## No Information Rate : 0.6712
## P-Value [Acc > NIR] : 0.002451
##
## Kappa : 0.5378
##
## Mcnemar's Test P-Value : 0.021556
##
## Sensitivity : 0.8125
## Specificity : 0.7653
## Pos Pred Value : 0.6290
## Neg Pred Value : 0.8929
## Prevalence : 0.3288
## Detection Rate : 0.2671
## Detection Prevalence : 0.4247
## Balanced Accuracy : 0.7889
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | band.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 39 | 9 | 48 |
## | 0.812 | 0.188 | 0.329 |
## | 0.629 | 0.107 | |
## | 0.267 | 0.062 | |
## -----------------------|-----------|-----------|-----------|
## noband | 23 | 75 | 98 |
## | 0.235 | 0.765 | 0.671 |
## | 0.371 | 0.893 | |
## | 0.158 | 0.514 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 62 | 84 | 146 |
## | 0.425 | 0.575 | |
## -----------------------|-----------|-----------|-----------|
##
##
We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.
## # weights: 83
## initial value 242.948297
## final value 223.156274
## converged
We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.
## predicted
## true noband
## band 48
## noband 98
The missclassification table can also be obtained by using the gmodels library:
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | predicted
## data$band_type[-train] | noband | Row Total |
## -----------------------|-----------|-----------|
## band | 48 | 48 |
## | 0.329 | |
## -----------------------|-----------|-----------|
## noband | 98 | 98 |
## | 0.671 | |
## -----------------------|-----------|-----------|
## Column Total | 146 | 146 |
## -----------------------|-----------|-----------|
##
##
We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)
## rel.imp x.names
## press_speed -1.00000000 press_speed
## direct_steamyes -0.45418926 direct_steamyes
## anode_space_ratio -0.42299572 anode_space_ratio
## cylinder_sizetabloid -0.42266536 cylinder_sizetabloid
## press_typeMotter94 -0.41648328 press_typeMotter94
## proof_inkyes -0.41139218 proof_inkyes
## cylinder_typeno -0.39029977 cylinder_typeno
## press_typeMotter70 -0.31007975 press_typeMotter70
## cylinder_typeyes -0.30802519 cylinder_typeyes
## solvent_typexylol -0.28425575 solvent_typexylol
## roughness -0.25257263 roughness
## current_density -0.24980925 current_density
## varnish_pct -0.19591180 varnish_pct
## locationscandanavian -0.17879029 locationscandanavian
## ink_temperature -0.14715537 ink_temperature
## proof_inkno -0.14302633 proof_inkno
## solvent_typenaptha -0.04598000 solvent_typenaptha
## press -0.02971619 press
## caliper 0.00000000 caliper
## ink_typecover 0.01093417 ink_typecover
## cylinder_sizespiegel 0.03895839 cylinder_sizespiegel
## viscosity 0.05434212 viscosity
## ink_pct 0.06952549 ink_pct
## humidity 0.08271790 humidity
## roller_durometer 0.10893590 roller_durometer
## paper_typeuncoated 0.11548110 paper_typeuncoated
## grain_screenedyes 0.11687962 grain_screenedyes
## press_typeWoodHoe70 0.11934415 press_typeWoodHoe70
## solvent_pct 0.12436495 solvent_pct
## locationmissing 0.13678704 locationmissing
## wax 0.14395420 wax
## locationUSA 0.27883757 locationUSA
## ink_typeuncoated 0.28655381 ink_typeuncoated
## paper_typesuper 0.33059093 paper_typesuper
## grain_screenedno 0.34849813 grain_screenedno
## blade_pressure 0.40906394 blade_pressure
## locationmideuropean 0.42357588 locationmideuropean
## proof_cut 0.46211301 proof_cut
## hardener 0.46489333 hardener
The bar plot tells us that the variables press_speed and hardener have the strongest negative and positive relationships, respectively, with the response variable band_type.
In decreasing order of importance (weak as weights are < 0.47)
We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).
We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.
From the complexity table we make the observation that the lowest relative error of 0.55 occurs at a tree size of 29.
To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of around 8, with a resubstitution error rate of ~0.55 (as before).
This again is not a good predictor due to the large error rate.
##
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
##
## Variables actually used in tree construction:
## [1] anode_space_ratio blade_pressure caliper
## [4] cylinder_type grain_screened hardener
## [7] humidity ink_pct ink_temperature
## [10] ink_type location paper_type
## [13] press press_speed press_type
## [16] proof_cut proof_ink roller_durometer
## [19] roughness solvent_pct solvent_type
## [22] varnish_pct viscosity wax
##
## Root node error: 173/485 = 0.3567
##
## n= 485
##
## CP nsplit rel error xerror xstd
## 1 0.0693642 0 1.000000 1.00000 0.060979
## 2 0.0375723 3 0.780347 1.00578 0.061057
## 3 0.0289017 7 0.618497 0.87283 0.058945
## 4 0.0231214 9 0.560694 0.84971 0.058506
## 5 0.0173410 11 0.514451 0.78613 0.057183
## 6 0.0115607 14 0.462428 0.75145 0.056386
## 7 0.0096339 27 0.312139 0.78035 0.057054
## 8 0.0086705 31 0.271676 0.79191 0.057310
## 9 0.0072254 38 0.208092 0.79191 0.057310
## 10 0.0057803 42 0.179191 0.79191 0.057310
## 11 0.0019268 61 0.069364 0.75723 0.056522
## 12 0.0000100 64 0.063584 0.75723 0.056522
## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
## n= 485
##
## CP nsplit rel error xerror xstd
## 1 0.06936416 0 1.0000000 1.00000 0.06097943
## 2 0.05000000 3 0.7803468 1.00578 0.06105734
##
## Variable importance
## press_type humidity press_speed press
## 31 20 18 11
## roughness cylinder_size ink_pct solvent_type
## 9 4 1 1
## blade_pressure location hardener wax
## 1 1 1 1
## ink_temperature roller_durometer
## 1 1
##
## Node number 1: 485 observations, complexity param=0.06936416
## predicted class=noband expected loss=0.356701 P(node) =1
## class counts: 173 312
## probabilities: 0.357 0.643
## left son=2 (167 obs) right son=3 (318 obs)
## Primary splits:
## press_type splits as RRRL, improve=13.743870, (0 missing)
## press_speed < 2184.5 to the left, improve=13.282210, (0 missing)
## press < 822.5 to the left, improve=13.017350, (0 missing)
## ink_type splits as RLL, improve=10.840790, (0 missing)
## grain_screened splits as LRL, improve= 9.577898, (0 missing)
## Surrogate splits:
## press < 818.5 to the left, agree=0.777, adj=0.353, (0 split)
## roughness < 0.53125 to the left, agree=0.753, adj=0.281, (0 split)
## cylinder_size splits as RLR, agree=0.697, adj=0.120, (0 split)
## humidity < 84.5 to the right, agree=0.676, adj=0.060, (0 split)
## ink_pct < 64 to the right, agree=0.672, adj=0.048, (0 split)
##
## Node number 2: 167 observations, complexity param=0.06936416
## predicted class=band expected loss=0.4790419 P(node) =0.3443299
## class counts: 87 80
## probabilities: 0.521 0.479
## left son=4 (46 obs) right son=5 (121 obs)
## Primary splits:
## press_speed < 1678 to the left, improve=7.308378, (0 missing)
## ink_pct < 62.9 to the right, improve=5.404575, (0 missing)
## cylinder_type splits as LLR, improve=5.085740, (0 missing)
## humidity < 73.5 to the left, improve=5.077850, (0 missing)
## location splits as RLLRR, improve=4.147136, (0 missing)
## Surrogate splits:
## solvent_type splits as R-L, agree=0.749, adj=0.087, (0 split)
## location splits as RLRRR, agree=0.743, adj=0.065, (0 split)
## blade_pressure < 24.5 to the left, agree=0.743, adj=0.065, (0 split)
## ink_temperature < 19.05 to the right, agree=0.737, adj=0.043, (0 split)
## roller_durometer < 29 to the left, agree=0.737, adj=0.043, (0 split)
##
## Node number 3: 318 observations
## predicted class=noband expected loss=0.2704403 P(node) =0.6556701
## class counts: 86 232
## probabilities: 0.270 0.730
##
## Node number 4: 46 observations
## predicted class=band expected loss=0.2391304 P(node) =0.09484536
## class counts: 35 11
## probabilities: 0.761 0.239
##
## Node number 5: 121 observations, complexity param=0.06936416
## predicted class=noband expected loss=0.4297521 P(node) =0.2494845
## class counts: 52 69
## probabilities: 0.430 0.570
## left son=10 (22 obs) right son=11 (99 obs)
## Primary splits:
## humidity < 75.5 to the left, improve=8.113866, (0 missing)
## ink_pct < 62.9 to the right, improve=6.323642, (0 missing)
## location splits as RLLRR, improve=3.867792, (0 missing)
## press_speed < 2184.5 to the left, improve=3.734125, (0 missing)
## anode_space_ratio < 98.485 to the left, improve=3.617543, (0 missing)
## Surrogate splits:
## press_speed < 2450 to the right, agree=0.835, adj=0.091, (0 split)
## wax < 0.85 to the left, agree=0.826, adj=0.045, (0 split)
## hardener < 0.25 to the left, agree=0.826, adj=0.045, (0 split)
##
## Node number 10: 22 observations
## predicted class=band expected loss=0.1818182 P(node) =0.04536082
## class counts: 18 4
## probabilities: 0.818 0.182
##
## Node number 11: 99 observations
## predicted class=noband expected loss=0.3434343 P(node) =0.2041237
## class counts: 34 65
## probabilities: 0.343 0.657
The variable importance table is given below:
## Overall
## anode_space_ratio 49.255578
## blade_pressure 23.201752
## caliper 15.764260
## current_density 5.927432
## cylinder_size 23.370388
## cylinder_type 28.803180
## grain_screened 26.322680
## hardener 23.936243
## humidity 53.401308
## ink_pct 47.602685
## ink_temperature 41.345016
## ink_type 30.770425
## location 39.270901
## paper_type 2.000000
## press 43.063079
## press_speed 72.939017
## press_type 29.891329
## proof_cut 43.882342
## proof_ink 2.100000
## roller_durometer 17.912351
## roughness 19.240835
## solvent_pct 37.649654
## solvent_type 1.600000
## varnish_pct 19.361990
## viscosity 58.850678
## wax 22.796918
## direct_steam 0.000000
In decreasing order, the importance of the variables are:
We consider the use of support vector machines to build a classifier and to also extract the feature importance list.
First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.
##
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train,
## ], method = "C-classification", kernel = "radial", cost = 10,
## gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 314
##
## ( 192 122 )
##
##
## Number of Classes: 2
##
## Levels:
## band noband
## [1] 1 2 4 5 6 10 13 14 17 19 22 24 27 28 32 33 34
## [18] 36 37 42 47 49 50 51 52 53 54 56 57 58 59 60 61 62
## [35] 63 64 66 68 70 71 73 74 76 77 79 82 85 87 88 89 92
## [52] 94 95 96 97 98 99 102 103 104 106 107 111 113 114 116 119 121
## [69] 122 124 128 130 133 135 136 138 139 141 142 143 144 146 147 148 150
## [86] 151 153 154 157 159 161 162 163 167 168 169 170 173 176 177 178 180
## [103] 181 185 187 190 191 193 197 198 199 200 201 202 203 204 206 207 209
## [120] 210 213 215 216 217 219 221 226 227 229 231 233 234 236 238 241 244
## [137] 246 247 248 250 251 252 254 255 256 258 259 260 262 268 270 272 273
## [154] 274 276 277 279 280 282 284 286 287 288 293 295 296 297 301 303 304
## [171] 305 306 307 308 310 311 313 315 316 317 318 319 320 321 322 325 326
## [188] 331 332 333 337 339 3 7 8 11 12 15 16 18 20 21 23 25
## [205] 26 29 31 35 38 39 41 43 44 45 55 65 69 72 78 80 81
## [222] 83 84 90 93 101 105 108 109 110 112 115 118 123 125 126 127 129
## [239] 131 132 134 137 140 145 149 152 155 160 164 166 171 172 174 175 184
## [256] 186 188 192 195 196 208 211 212 214 218 220 222 223 225 228 230 232
## [273] 235 237 239 242 243 245 249 253 257 261 263 264 265 266 267 269 271
## [290] 275 278 281 283 285 289 290 291 294 298 299 300 302 309 312 314 323
## [307] 327 328 329 330 334 335 336 338
Prediction using SVM: The accuracy is 84% and a kappa of 0.62.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 33 8
## noband 15 90
##
## Accuracy : 0.8425
## 95% CI : (0.7731, 0.8974)
## No Information Rate : 0.6712
## P-Value [Acc > NIR] : 2.324e-06
##
## Kappa : 0.6293
##
## Mcnemar's Test P-Value : 0.2109
##
## Sensitivity : 0.6875
## Specificity : 0.9184
## Pos Pred Value : 0.8049
## Neg Pred Value : 0.8571
## Prevalence : 0.3288
## Detection Rate : 0.2260
## Detection Prevalence : 0.2808
## Balanced Accuracy : 0.8029
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | svm.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 33 | 15 | 48 |
## | 0.688 | 0.312 | 0.329 |
## | 0.805 | 0.143 | |
## | 0.226 | 0.103 | |
## -----------------------|-----------|-----------|-----------|
## noband | 8 | 90 | 98 |
## | 0.082 | 0.918 | 0.671 |
## | 0.195 | 0.857 | |
## | 0.055 | 0.616 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 41 | 105 | 146 |
## | 0.281 | 0.719 | |
## -----------------------|-----------|-----------|-----------|
##
##
The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.
In this method, we do not remove rows that have more than 20% missing (as in previous case), rather we continue estimating values for all the missing data and proceed with our model.
Then, instead of imputing the values of missing categorical variables, we add another level called “missing” to it. We also calculate the “mode” for numeric data and replace the missing numeric data with the corresponding mode value.
Using the data we proceed with creating correlation plots, to see how the different variables are correlated. If the dataset has perfectly positive or negative attributes then the performance of the model will be impacted by a problem called “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.
We find the following correlations (in lesser degress, exists in the data) 1. “ink_pct” and “varnish_pct” 2. “solvent_pct” and “varnish_pct”
Which implies, we can remove “varnish_pct” from our training model to observe if there is an improvement.
We continue to ignore the 12 variables as identified in previous section, for this model as well.
We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.
We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.
From the results, we observe the out-of-bag error rate to be 20.63%.
##
## Call:
## randomForest(formula = form, data = data[train, ], ntree = 1000, mtry = 5, importance = TRUE, localImp = TRUE, replace = FALSE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 20.63%
## Confusion matrix:
## band noband class.error
## band 110 56 0.3373494
## noband 22 190 0.1037736
The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.
## OOB band noband
## [1,] 0.2734 0.2712 0.2750
## [2,] 0.2800 0.3173 0.2479
## [3,] 0.2867 0.2984 0.2774
## [4,] 0.2987 0.2643 0.3258
## [5,] 0.3068 0.3020 0.3105
## [6,] 0.3199 0.2876 0.3454
## [7,] 0.3092 0.2911 0.3234
## [8,] 0.3142 0.3168 0.3122
## [9,] 0.2938 0.2699 0.3125
## [10,] 0.3029 0.2866 0.3158
## [11,] 0.3013 0.2909 0.3095
## [12,] 0.2766 0.3012 0.2571
## [13,] 0.2872 0.3012 0.2762
## [14,] 0.2838 0.3253 0.2512
## [15,] 0.2918 0.3494 0.2464
In this case, the variables that are between 6-8 in the Gini Index (and therefore most important are):
We now proceed to predict using the test data set and create the confusion matrix for the same.
## band noband class.error
## band 110 56 0.3373494
## noband 22 190 0.1037736
We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.85 with mtry=21, and a kappa of 0.68.
## pred
## true band noband
## band 45 17
## noband 4 96
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | band.pred
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 45 | 17 | 62 |
## | 0.726 | 0.274 | 0.383 |
## | 0.918 | 0.150 | |
## | 0.278 | 0.105 | |
## -----------------------|-----------|-----------|-----------|
## noband | 4 | 96 | 100 |
## | 0.040 | 0.960 | 0.617 |
## | 0.082 | 0.850 | |
## | 0.025 | 0.593 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 49 | 113 | 162 |
## | 0.302 | 0.698 | |
## -----------------------|-----------|-----------|-----------|
##
##
We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.77 with mtry=42, and a kappa of 0.52.
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
## user system elapsed
## 145.98 2.13 158.75
## Random Forest
##
## 378 samples
## 26 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 340, 341, 339, 341, 340, 340, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7530058 0.4748330
## 21 0.7851041 0.5523332
## 41 0.7787026 0.5388084
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 21.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 43 4
## noband 19 96
##
## Accuracy : 0.858
## 95% CI : (0.7946, 0.9078)
## No Information Rate : 0.6173
## P-Value [Acc > NIR] : 1.285e-11
##
## Kappa : 0.685
##
## Mcnemar's Test P-Value : 0.003509
##
## Sensitivity : 0.6935
## Specificity : 0.9600
## Pos Pred Value : 0.9149
## Neg Pred Value : 0.8348
## Prevalence : 0.3827
## Detection Rate : 0.2654
## Detection Prevalence : 0.2901
## Balanced Accuracy : 0.8268
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | band.pred.1
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 43 | 19 | 62 |
## | 0.694 | 0.306 | 0.383 |
## | 0.915 | 0.165 | |
## | 0.265 | 0.117 | |
## -----------------------|-----------|-----------|-----------|
## noband | 4 | 96 | 100 |
## | 0.040 | 0.960 | 0.617 |
## | 0.085 | 0.835 | |
## | 0.025 | 0.593 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 47 | 115 | 162 |
## | 0.290 | 0.710 | |
## -----------------------|-----------|-----------|-----------|
##
##
For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.
The corresponding confusion matrix and crosstable is as shown below.
## user system elapsed
## 114.42 1.36 118.07
## Random Forest
##
## 378 samples
## 26 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 341, 340, 340, 341, 339, 340, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7812062 0.5493503
## 21 0.7695791 0.5290100
## 41 0.7683426 0.5252884
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 41 9
## noband 21 91
##
## Accuracy : 0.8148
## 95% CI : (0.7463, 0.8714)
## No Information Rate : 0.6173
## P-Value [Acc > NIR] : 4.355e-08
##
## Kappa : 0.5931
##
## Mcnemar's Test P-Value : 0.04461
##
## Sensitivity : 0.6613
## Specificity : 0.9100
## Pos Pred Value : 0.8200
## Neg Pred Value : 0.8125
## Prevalence : 0.3827
## Detection Rate : 0.2531
## Detection Prevalence : 0.3086
## Balanced Accuracy : 0.7856
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | band.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 41 | 21 | 62 |
## | 0.661 | 0.339 | 0.383 |
## | 0.820 | 0.188 | |
## | 0.253 | 0.130 | |
## -----------------------|-----------|-----------|-----------|
## noband | 9 | 91 | 100 |
## | 0.090 | 0.910 | 0.617 |
## | 0.180 | 0.812 | |
## | 0.056 | 0.562 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 50 | 112 | 162 |
## | 0.309 | 0.691 | |
## -----------------------|-----------|-----------|-----------|
##
##
We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.
## # weights: 87
## initial value 260.360533
## final value 259.203941
## converged
We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.
## predicted
## true noband
## band 62
## noband 100
The missclassification table can also be obtained by using the gmodels library:
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | predicted
## data$band_type[-train] | noband | Row Total |
## -----------------------|-----------|-----------|
## band | 62 | 62 |
## | 0.383 | |
## -----------------------|-----------|-----------|
## noband | 100 | 100 |
## | 0.617 | |
## -----------------------|-----------|-----------|
## Column Total | 162 | 162 |
## -----------------------|-----------|-----------|
##
##
We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)
## rel.imp x.names
## press -1.000000000 press
## solvent_pct -0.834810109 solvent_pct
## press_typeWoodHoe70 -0.730755732 press_typeWoodHoe70
## grain_screenedyes -0.526956828 grain_screenedyes
## solvent_typemissing -0.498624082 solvent_typemissing
## press_typeMotter70 -0.436915259 press_typeMotter70
## direct_steamyes -0.388076866 direct_steamyes
## paper_typeuncoated -0.339440245 paper_typeuncoated
## cylinder_sizespiegel -0.309955529 cylinder_sizespiegel
## ink_pct -0.255724903 ink_pct
## locationscandanavian -0.250135003 locationscandanavian
## wax -0.109553336 wax
## ink_temperature -0.086142170 ink_temperature
## roughness -0.054880847 roughness
## proof_inkyes -0.049629978 proof_inkyes
## caliper 0.000000000 caliper
## cylinder_typeno 0.001026903 cylinder_typeno
## blade_pressure 0.004262042 blade_pressure
## cylinder_sizetabloid 0.008538332 cylinder_sizetabloid
## locationmissing 0.010877133 locationmissing
## press_speed 0.063965114 press_speed
## current_density 0.076737726 current_density
## cylinder_sizemissing 0.120028831 cylinder_sizemissing
## locationUSA 0.130925774 locationUSA
## humidity 0.143374755 humidity
## locationmideuropean 0.152655043 locationmideuropean
## ink_typecover 0.213428644 ink_typecover
## solvent_typenaptha 0.222300942 solvent_typenaptha
## press_typeMotter94 0.269144105 press_typeMotter94
## proof_cut 0.276827428 proof_cut
## ink_typeuncoated 0.301945001 ink_typeuncoated
## hardener 0.325612844 hardener
## solvent_typexylol 0.335009201 solvent_typexylol
## proof_inkno 0.423948155 proof_inkno
## cylinder_typeyes 0.694660055 cylinder_typeyes
## anode_space_ratio 0.707571780 anode_space_ratio
## grain_screenedno 0.710663466 grain_screenedno
## viscosity 0.736837240 viscosity
## roller_durometer 0.781942884 roller_durometer
## direct_steamno 0.791373930 direct_steamno
## paper_typesuper 0.866745902 paper_typesuper
The bar plot tells us that the variables press and paper_type (super) have the strongest negative and positive relationships, respectively, with the response variable band_type.
In decreasing order of importance:
We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).
We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.
From the complexity table we make the observation that the lowest relative error of 0.55 occurs at a tree size of 13.
To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 6, with a resubstitution error rate of ~0.55 (as before).
This again is not a good predictor due to the large error rate.
##
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
##
## Variables actually used in tree construction:
## [1] anode_space_ratio blade_pressure caliper
## [4] current_density cylinder_type grain_screened
## [7] hardener humidity ink_pct
## [10] ink_temperature ink_type location
## [13] paper_type press press_speed
## [16] press_type proof_cut proof_ink
## [19] roller_durometer roughness solvent_pct
## [22] solvent_type viscosity wax
##
## Root node error: 228/540 = 0.42222
##
## n= 540
##
## CP nsplit rel error xerror xstd
## 1 0.2412281 0 1.000000 1.00000 0.050340
## 2 0.0526316 1 0.758772 0.75877 0.047558
## 3 0.0285088 4 0.592105 0.73684 0.047184
## 4 0.0219298 8 0.469298 0.62719 0.044971
## 5 0.0175439 10 0.425439 0.58333 0.043913
## 6 0.0131579 12 0.390351 0.57895 0.043801
## 7 0.0087719 15 0.350877 0.61842 0.044768
## 8 0.0073099 29 0.228070 0.61842 0.044768
## 9 0.0065789 33 0.197368 0.59649 0.044241
## 10 0.0054825 40 0.149123 0.59649 0.044241
## 11 0.0043860 44 0.127193 0.59649 0.044241
## 12 0.0014620 62 0.048246 0.61404 0.044664
## 13 0.0000100 65 0.043860 0.61842 0.044768
## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
## n= 540
##
## CP nsplit rel error xerror xstd
## 1 0.24122807 0 1.0000000 1.0000000 0.05033997
## 2 0.05263158 1 0.7587719 0.7587719 0.04755808
## 3 0.05000000 4 0.5921053 0.7368421 0.04718396
##
## Variable importance
## solvent_type proof_ink grain_screened direct_steam paper_type
## 23 21 12 10 9
## press_type humidity press_speed press cylinder_size
## 7 5 4 3 2
## roughness
## 2
##
## Node number 1: 540 observations, complexity param=0.2412281
## predicted class=noband expected loss=0.4222222 P(node) =1
## class counts: 228 312
## probabilities: 0.422 0.578
## left son=2 (55 obs) right son=3 (485 obs)
## Primary splits:
## solvent_type splits as RLRR, improve=40.88522, (0 missing)
## proof_ink splits as LRR, improve=37.53662, (0 missing)
## grain_screened splits as LRR, improve=26.53113, (0 missing)
## location splits as RRLRR, improve=23.53878, (0 missing)
## press_speed < 2184.5 to the left, improve=17.60284, (0 missing)
## Surrogate splits:
## proof_ink splits as LRR, agree=0.996, adj=0.964, (0 split)
## grain_screened splits as LRR, agree=0.952, adj=0.527, (0 split)
## direct_steam splits as LRR, agree=0.944, adj=0.455, (0 split)
## paper_type splits as RLR, agree=0.941, adj=0.418, (0 split)
## cylinder_size splits as RLRR, agree=0.904, adj=0.055, (0 split)
##
## Node number 2: 55 observations
## predicted class=band expected loss=0 P(node) =0.1018519
## class counts: 55 0
## probabilities: 1.000 0.000
##
## Node number 3: 485 observations, complexity param=0.05263158
## predicted class=noband expected loss=0.356701 P(node) =0.8981481
## class counts: 173 312
## probabilities: 0.357 0.643
## left son=6 (167 obs) right son=7 (318 obs)
## Primary splits:
## press_type splits as RRRL, improve=13.743870, (0 missing)
## press_speed < 2184.5 to the left, improve=13.282210, (0 missing)
## press < 822.5 to the left, improve=13.017350, (0 missing)
## ink_type splits as RLL, improve=10.840790, (0 missing)
## grain_screened splits as LRL, improve= 9.577898, (0 missing)
## Surrogate splits:
## press < 818.5 to the left, agree=0.777, adj=0.353, (0 split)
## roughness < 0.53125 to the left, agree=0.753, adj=0.281, (0 split)
## cylinder_size splits as R-LR, agree=0.697, adj=0.120, (0 split)
## humidity < 84.5 to the right, agree=0.676, adj=0.060, (0 split)
## ink_pct < 64 to the right, agree=0.672, adj=0.048, (0 split)
##
## Node number 6: 167 observations, complexity param=0.05263158
## predicted class=band expected loss=0.4790419 P(node) =0.3092593
## class counts: 87 80
## probabilities: 0.521 0.479
## left son=12 (46 obs) right son=13 (121 obs)
## Primary splits:
## press_speed < 1678 to the left, improve=7.308378, (0 missing)
## ink_pct < 62.9 to the right, improve=5.404575, (0 missing)
## cylinder_type splits as LLR, improve=5.085740, (0 missing)
## humidity < 73.5 to the left, improve=5.077850, (0 missing)
## location splits as RLLRR, improve=4.147136, (0 missing)
## Surrogate splits:
## solvent_type splits as R--L, agree=0.749, adj=0.087, (0 split)
## location splits as RLRRR, agree=0.743, adj=0.065, (0 split)
## blade_pressure < 24.5 to the left, agree=0.743, adj=0.065, (0 split)
## ink_temperature < 19.05 to the right, agree=0.737, adj=0.043, (0 split)
## roller_durometer < 29 to the left, agree=0.737, adj=0.043, (0 split)
##
## Node number 7: 318 observations
## predicted class=noband expected loss=0.2704403 P(node) =0.5888889
## class counts: 86 232
## probabilities: 0.270 0.730
##
## Node number 12: 46 observations
## predicted class=band expected loss=0.2391304 P(node) =0.08518519
## class counts: 35 11
## probabilities: 0.761 0.239
##
## Node number 13: 121 observations, complexity param=0.05263158
## predicted class=noband expected loss=0.4297521 P(node) =0.2240741
## class counts: 52 69
## probabilities: 0.430 0.570
## left son=26 (22 obs) right son=27 (99 obs)
## Primary splits:
## humidity < 75.5 to the left, improve=8.113866, (0 missing)
## ink_pct < 62.9 to the right, improve=6.323642, (0 missing)
## location splits as RLLRR, improve=3.867792, (0 missing)
## press_speed < 2184.5 to the left, improve=3.734125, (0 missing)
## anode_space_ratio < 98.485 to the left, improve=3.617543, (0 missing)
## Surrogate splits:
## press_speed < 2450 to the right, agree=0.835, adj=0.091, (0 split)
## wax < 0.85 to the left, agree=0.826, adj=0.045, (0 split)
## hardener < 0.25 to the left, agree=0.826, adj=0.045, (0 split)
##
## Node number 26: 22 observations
## predicted class=band expected loss=0.1818182 P(node) =0.04074074
## class counts: 18 4
## probabilities: 0.818 0.182
##
## Node number 27: 99 observations
## predicted class=noband expected loss=0.3434343 P(node) =0.1833333
## class counts: 34 65
## probabilities: 0.343 0.657
The variable importance table is given below:
## Overall
## anode_space_ratio 53.338911
## blade_pressure 23.201752
## caliper 15.764260
## current_density 8.377432
## cylinder_size 22.520388
## cylinder_type 28.803180
## grain_screened 52.853813
## hardener 27.349880
## humidity 50.534641
## ink_pct 53.487118
## ink_temperature 38.678349
## ink_type 31.431536
## location 63.559683
## paper_type 2.000000
## press 43.563079
## press_speed 89.986305
## press_type 29.891329
## proof_cut 47.383511
## proof_ink 39.636624
## roller_durometer 19.512351
## roughness 19.240835
## solvent_pct 40.116320
## solvent_type 40.885223
## viscosity 59.128456
## wax 22.961624
## direct_steam 0.000000
In decreasing order, the importance of variables are:
We consider the use of support vector machines to build a classifier and to also extract the feature importance list.
First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.
##
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train,
## ], method = "C-classification", kernel = "radial", cost = 10,
## gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 335
##
## ( 184 151 )
##
##
## Number of Classes: 2
##
## Levels:
## band noband
## [1] 1 4 5 6 8 10 13 14 17 18 20 22 23 26 28 29 30
## [18] 31 33 34 36 38 40 41 43 44 45 46 47 48 50 52 53 54
## [35] 59 61 63 65 68 71 73 75 76 80 81 83 84 87 88 89 90
## [52] 92 93 94 96 97 101 102 104 105 106 108 110 118 121 123 125 129
## [69] 130 132 135 136 137 138 139 141 142 144 145 146 148 149 151 152 155
## [86] 157 160 161 162 165 167 168 173 176 177 178 180 181 185 186 191 192
## [103] 198 199 200 201 202 208 214 217 218 222 223 224 227 230 231 233 235
## [120] 236 240 243 244 246 247 251 252 254 257 263 264 266 267 271 274 279
## [137] 281 283 284 285 286 288 292 294 295 296 297 299 302 303 305 306 312
## [154] 313 314 315 317 321 323 327 329 330 336 337 339 342 343 345 346 350
## [171] 353 354 356 359 361 362 366 370 371 372 373 374 375 377 2 9 11
## [188] 12 15 19 21 24 25 27 35 39 49 57 58 62 64 66 69 70
## [205] 72 74 77 78 79 82 85 86 91 95 98 99 100 103 107 112 113
## [222] 115 117 119 120 122 126 127 128 131 133 134 143 147 150 153 158 159
## [239] 163 166 169 170 171 172 174 175 182 184 187 188 189 194 197 203 204
## [256] 205 206 207 209 210 211 212 213 215 216 219 220 221 225 229 232 234
## [273] 237 238 241 242 245 253 255 256 258 259 260 261 262 265 269 270 272
## [290] 273 275 277 278 280 282 287 289 290 291 293 298 300 301 304 307 308
## [307] 309 310 311 318 319 320 324 325 326 328 334 335 338 341 344 347 348
## [324] 351 352 355 357 358 360 363 364 365 367 368 369
Prediction using SVM: This gives an accuracy of 86.4% with kappa of 0.7, which is a very good prediction rate.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 48 8
## noband 14 92
##
## Accuracy : 0.8642
## 95% CI : (0.8016, 0.9129)
## No Information Rate : 0.6173
## P-Value [Acc > NIR] : 3.346e-12
##
## Kappa : 0.7072
##
## Mcnemar's Test P-Value : 0.2864
##
## Sensitivity : 0.7742
## Specificity : 0.9200
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.8679
## Prevalence : 0.3827
## Detection Rate : 0.2963
## Detection Prevalence : 0.3457
## Balanced Accuracy : 0.8471
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | svm.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 48 | 14 | 62 |
## | 0.774 | 0.226 | 0.383 |
## | 0.857 | 0.132 | |
## | 0.296 | 0.086 | |
## -----------------------|-----------|-----------|-----------|
## noband | 8 | 92 | 100 |
## | 0.080 | 0.920 | 0.617 |
## | 0.143 | 0.868 | |
## | 0.049 | 0.568 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 56 | 106 | 162 |
## | 0.346 | 0.654 | |
## -----------------------|-----------|-----------|-----------|
##
##
The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.
In our second method, we first remove rows which have more than 20% of data observations missing, and consider only the remaining for our analysis.
Then, we use the K-Nearest Neighbor function (KNN) for imputation of missing values.
We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.
We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.
##
## Call:
## randomForest(formula = form, data = data[train, ], ntree = 1000, mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 25.07%
## Confusion matrix:
## band noband class.error
## band 59 66 0.52800000
## noband 19 195 0.08878505
From the results, we observe the out-of-bag error rate to be 25.07%.
## OOB band noband
## [1,] 0.3710 0.5106 0.2857
## [2,] 0.3969 0.5373 0.3228
## [3,] 0.3701 0.4624 0.3168
## [4,] 0.3298 0.4216 0.2778
## [5,] 0.3355 0.4775 0.2526
## [6,] 0.3691 0.5085 0.2864
## [7,] 0.3364 0.4417 0.2745
## [8,] 0.3587 0.4833 0.2871
## [9,] 0.3333 0.4553 0.2619
## [10,] 0.3552 0.4839 0.2796
## [11,] 0.3620 0.5403 0.2582
## [12,] 0.3343 0.5040 0.2347
## [13,] 0.3156 0.4960 0.2103
## [14,] 0.3304 0.5120 0.2243
## [15,] 0.3304 0.5280 0.2150
The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.
## band noband class.error
## band 59 66 0.52800000
## noband 19 195 0.08878505
In this case, the variables that are >6 in the Gini Index (and therefore most important are):
We now proceed to predict using the test data set and create the confusion matrix for the same.
## pred
## true band noband
## band 32 16
## noband 6 92
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | band.pred
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 32 | 16 | 48 |
## | 0.667 | 0.333 | 0.329 |
## | 0.842 | 0.148 | |
## | 0.219 | 0.110 | |
## -----------------------|-----------|-----------|-----------|
## noband | 6 | 92 | 98 |
## | 0.061 | 0.939 | 0.671 |
## | 0.158 | 0.852 | |
## | 0.041 | 0.630 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 38 | 108 | 146 |
## | 0.260 | 0.740 | |
## -----------------------|-----------|-----------|-----------|
##
##
We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.85 with mtry=36, and a kappa of 0.64.
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
## user system elapsed
## 119.02 0.82 129.08
## Random Forest
##
## 339 samples
## 27 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 305, 305, 305, 305, 305, 306, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7083957 0.2747511
## 19 0.7436519 0.3996211
## 36 0.7453929 0.4037285
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 36.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 33 7
## noband 15 91
##
## Accuracy : 0.8493
## 95% CI : (0.7808, 0.9031)
## No Information Rate : 0.6712
## P-Value [Acc > NIR] : 8.551e-07
##
## Kappa : 0.6434
##
## Mcnemar's Test P-Value : 0.1356
##
## Sensitivity : 0.6875
## Specificity : 0.9286
## Pos Pred Value : 0.8250
## Neg Pred Value : 0.8585
## Prevalence : 0.3288
## Detection Rate : 0.2260
## Detection Prevalence : 0.2740
## Balanced Accuracy : 0.8080
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | band.pred.1
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 33 | 15 | 48 |
## | 0.688 | 0.312 | 0.329 |
## | 0.825 | 0.142 | |
## | 0.226 | 0.103 | |
## -----------------------|-----------|-----------|-----------|
## noband | 7 | 91 | 98 |
## | 0.071 | 0.929 | 0.671 |
## | 0.175 | 0.858 | |
## | 0.048 | 0.623 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 40 | 106 | 146 |
## | 0.274 | 0.726 | |
## -----------------------|-----------|-----------|-----------|
##
##
For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.
The corresponding confusion matrix and crosstable is as shown below.
## user system elapsed
## 76.46 0.34 77.48
## Random Forest
##
## 339 samples
## 27 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 305, 305, 305, 304, 306, 305, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7007341 0.3977763
## 19 0.6953435 0.3733094
## 36 0.7013952 0.3846407
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 36.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 39 25
## noband 9 73
##
## Accuracy : 0.7671
## 95% CI : (0.6901, 0.833)
## No Information Rate : 0.6712
## P-Value [Acc > NIR] : 0.007421
##
## Kappa : 0.5137
##
## Mcnemar's Test P-Value : 0.010097
##
## Sensitivity : 0.8125
## Specificity : 0.7449
## Pos Pred Value : 0.6094
## Neg Pred Value : 0.8902
## Prevalence : 0.3288
## Detection Rate : 0.2671
## Detection Prevalence : 0.4384
## Balanced Accuracy : 0.7787
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | band.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 39 | 9 | 48 |
## | 0.812 | 0.188 | 0.329 |
## | 0.609 | 0.110 | |
## | 0.267 | 0.062 | |
## -----------------------|-----------|-----------|-----------|
## noband | 25 | 73 | 98 |
## | 0.255 | 0.745 | 0.671 |
## | 0.391 | 0.890 | |
## | 0.171 | 0.500 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 64 | 82 | 146 |
## | 0.438 | 0.562 | |
## -----------------------|-----------|-----------|-----------|
##
##
We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.
## # weights: 77
## initial value 231.694164
## final value 223.156122
## converged
We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.
## predicted
## true noband
## band 48
## noband 98
The missclassification table can also be obtained by using the gmodels library:
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | predicted
## data$band_type[-train] | noband | Row Total |
## -----------------------|-----------|-----------|
## band | 48 | 48 |
## | 0.329 | |
## -----------------------|-----------|-----------|
## noband | 98 | 98 |
## | 0.671 | |
## -----------------------|-----------|-----------|
## Column Total | 146 | 146 |
## -----------------------|-----------|-----------|
##
##
We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)
## rel.imp x.names
## locationmideuropean -1.00000000 locationmideuropean
## solvent_pct -0.81424418 solvent_pct
## locationsouthus -0.71712401 locationsouthus
## wax -0.64254745 wax
## locationscandanavian -0.62460938 locationscandanavian
## roughness -0.60141152 roughness
## solvent_typeNAPTHA -0.55580818 solvent_typeNAPTHA
## cylinder_sizespiegel -0.40198168 cylinder_sizespiegel
## locationnorthus -0.38447601 locationnorthus
## ink_typecover -0.27078375 ink_typecover
## solvent_typeXYLOL -0.26878627 solvent_typeXYLOL
## cylinder_typeyes -0.19506011 cylinder_typeyes
## current_density -0.17749519 current_density
## paper_typeuncoated -0.15372813 paper_typeuncoated
## paper_typesuper -0.15337064 paper_typesuper
## proof_cut -0.10804749 proof_cut
## caliper -0.07973444 caliper
## direct_steamyes -0.07343444 direct_steamyes
## hardener -0.02488372 hardener
## cylinder_sizetabloid 0.00000000 cylinder_sizetabloid
## press_typeMotter70 0.05438288 press_typeMotter70
## press_typeMotter94 0.12215403 press_typeMotter94
## ink_temperature 0.15013680 ink_temperature
## anode_space_ratio 0.15294915 anode_space_ratio
## press_typeWoodHoe70 0.15561081 press_typeWoodHoe70
## grain_screenedYES 0.22613986 grain_screenedYES
## roller_durometer 0.31684380 roller_durometer
## viscosity 0.37384575 viscosity
## ink_typeuncoated 0.41528266 ink_typeuncoated
## press_speed 0.48484925 press_speed
## varnish_pct 0.55720107 varnish_pct
## humidity 0.56215695 humidity
## ink_pct 0.57964732 ink_pct
## blade_pressure 0.64043328 blade_pressure
## press 0.81796317 press
## proof_inkYES 0.95849499 proof_inkYES
The bar plot tells us that the variables location (mideuropean) and proof_ink (yes) have the strongest negative and positive relationships, respectively, with the response variable band_type.
In decreasing order of importance:
We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).
We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.
From the complexity table we make the observation that the lowest relative error of 0.75 occurs at a tree size of ~26.
To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 8, with a resubstitution error rate of ~0.75 (as before).
This again is not a good predictor due to the large error rate.
##
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
##
## Variables actually used in tree construction:
## [1] anode_space_ratio blade_pressure caliper
## [4] cylinder_size cylinder_type grain_screened
## [7] hardener humidity ink_pct
## [10] ink_temperature ink_type location
## [13] paper_type press press_speed
## [16] press_type proof_cut proof_ink
## [19] roller_durometer roughness solvent_pct
## [22] varnish_pct viscosity wax
##
## Root node error: 173/485 = 0.3567
##
## n= 485
##
## CP nsplit rel error xerror xstd
## 1 0.0693642 0 1.000000 1.00000 0.060979
## 2 0.0375723 3 0.780347 0.90173 0.059463
## 3 0.0289017 7 0.618497 0.94220 0.060132
## 4 0.0231214 8 0.589595 0.84971 0.058506
## 5 0.0173410 9 0.566474 0.86705 0.058837
## 6 0.0144509 15 0.462428 0.84971 0.058506
## 7 0.0115607 17 0.433526 0.84971 0.058506
## 8 0.0096339 24 0.352601 0.80347 0.057561
## 9 0.0086705 28 0.312139 0.81503 0.057806
## 10 0.0077071 39 0.208092 0.82659 0.058045
## 11 0.0057803 42 0.184971 0.83237 0.058162
## 12 0.0038536 61 0.075145 0.80347 0.057561
## 13 0.0028902 64 0.063584 0.80347 0.057561
## 14 0.0000100 70 0.046243 0.81503 0.057806
## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
## n= 485
##
## CP nsplit rel error xerror xstd
## 1 0.06936416 0 1.0000000 1.0000000 0.06097943
## 2 0.05000000 3 0.7803468 0.9017341 0.05946252
##
## Variable importance
## press_type humidity press_speed press
## 31 20 18 11
## roughness cylinder_size ink_pct solvent_type
## 9 4 1 1
## location hardener wax blade_pressure
## 1 1 1 1
## ink_temperature roller_durometer
## 1 1
##
## Node number 1: 485 observations, complexity param=0.06936416
## predicted class=noband expected loss=0.356701 P(node) =1
## class counts: 173 312
## probabilities: 0.357 0.643
## left son=2 (167 obs) right son=3 (318 obs)
## Primary splits:
## press_type splits as RRRL, improve=13.743870, (0 missing)
## press_speed < 2184.5 to the left, improve=13.282210, (0 missing)
## press < 822.5 to the left, improve=13.017350, (0 missing)
## ink_type splits as RLL, improve=10.840790, (0 missing)
## grain_screened splits as RL, improve= 8.957201, (0 missing)
## Surrogate splits:
## press < 818.5 to the left, agree=0.777, adj=0.353, (0 split)
## roughness < 0.53125 to the left, agree=0.755, adj=0.287, (0 split)
## cylinder_size splits as RLR, agree=0.697, adj=0.120, (0 split)
## humidity < 84.5 to the right, agree=0.676, adj=0.060, (0 split)
## ink_pct < 64 to the right, agree=0.672, adj=0.048, (0 split)
##
## Node number 2: 167 observations, complexity param=0.06936416
## predicted class=band expected loss=0.4790419 P(node) =0.3443299
## class counts: 87 80
## probabilities: 0.521 0.479
## left son=4 (46 obs) right son=5 (121 obs)
## Primary splits:
## press_speed < 1678 to the left, improve=7.308378, (0 missing)
## ink_pct < 62.9 to the right, improve=5.404575, (0 missing)
## humidity < 73.5 to the left, improve=5.077850, (0 missing)
## cylinder_type splits as LR, improve=4.631613, (0 missing)
## roughness < 0.5625 to the right, improve=4.059462, (0 missing)
## Surrogate splits:
## solvent_type splits as R-L, agree=0.749, adj=0.087, (0 split)
## location splits as RLRR-, agree=0.743, adj=0.065, (0 split)
## ink_temperature < 19.05 to the right, agree=0.737, adj=0.043, (0 split)
## blade_pressure < 22.5 to the left, agree=0.737, adj=0.043, (0 split)
## roller_durometer < 29 to the left, agree=0.737, adj=0.043, (0 split)
##
## Node number 3: 318 observations
## predicted class=noband expected loss=0.2704403 P(node) =0.6556701
## class counts: 86 232
## probabilities: 0.270 0.730
##
## Node number 4: 46 observations
## predicted class=band expected loss=0.2391304 P(node) =0.09484536
## class counts: 35 11
## probabilities: 0.761 0.239
##
## Node number 5: 121 observations, complexity param=0.06936416
## predicted class=noband expected loss=0.4297521 P(node) =0.2494845
## class counts: 52 69
## probabilities: 0.430 0.570
## left son=10 (22 obs) right son=11 (99 obs)
## Primary splits:
## humidity < 75.5 to the left, improve=8.113866, (0 missing)
## ink_pct < 62.9 to the right, improve=6.323642, (0 missing)
## press_speed < 2184.5 to the left, improve=3.734125, (0 missing)
## anode_space_ratio < 98.485 to the left, improve=3.617543, (0 missing)
## wax < 1.725 to the left, improve=3.391992, (0 missing)
## Surrogate splits:
## press_speed < 2450 to the right, agree=0.835, adj=0.091, (0 split)
## wax < 0.85 to the left, agree=0.826, adj=0.045, (0 split)
## hardener < 0.25 to the left, agree=0.826, adj=0.045, (0 split)
##
## Node number 10: 22 observations
## predicted class=band expected loss=0.1818182 P(node) =0.04536082
## class counts: 18 4
## probabilities: 0.818 0.182
##
## Node number 11: 99 observations
## predicted class=noband expected loss=0.3434343 P(node) =0.2041237
## class counts: 34 65
## probabilities: 0.343 0.657
The variable importance table is given below:
## Overall
## anode_space_ratio 59.465712
## blade_pressure 41.265195
## caliper 16.286526
## current_density 14.078302
## cylinder_size 24.208212
## cylinder_type 15.785659
## grain_screened 21.538722
## hardener 20.936189
## humidity 42.792686
## ink_pct 55.397484
## ink_temperature 38.727212
## ink_type 29.660397
## location 27.500859
## paper_type 3.738095
## press 40.805786
## press_speed 76.145054
## press_type 30.664801
## proof_cut 40.656055
## proof_ink 5.491162
## roller_durometer 23.316414
## roughness 26.216762
## solvent_pct 38.299690
## varnish_pct 27.900896
## viscosity 45.424158
## wax 29.088331
## direct_steam 0.000000
## solvent_type 0.000000
From this, we can list the most important varibles as:
We consider the use of support vector machines to build a classifier and to also extract the feature importance list.
First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.
##
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train,
## ], method = "C-classification", kernel = "radial", cost = 10,
## gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 308
##
## ( 188 120 )
##
##
## Number of Classes: 2
##
## Levels:
## band noband
## [1] 1 4 5 6 10 13 14 17 19 22 24 27 28 32 33 34 36
## [18] 37 42 47 48 49 50 51 52 53 54 56 57 58 59 61 62 63
## [35] 64 66 68 70 71 73 74 76 77 79 82 85 87 88 89 92 94
## [52] 95 96 97 98 99 102 103 104 106 107 111 113 114 116 119 121 122
## [69] 124 128 130 133 135 136 138 141 142 143 144 146 148 150 151 153 154
## [86] 157 159 161 162 163 167 168 169 170 173 176 177 178 179 180 181 185
## [103] 187 190 191 193 197 198 199 200 201 202 203 204 206 207 209 210 213
## [120] 215 216 217 219 221 226 227 229 231 233 234 236 238 241 244 246 247
## [137] 248 250 251 252 254 255 256 258 259 260 262 268 270 272 273 274 276
## [154] 277 279 280 282 284 286 287 288 292 293 295 296 297 301 303 305 306
## [171] 307 308 313 315 316 317 318 319 320 321 322 325 326 331 332 333 337
## [188] 339 3 7 8 11 12 15 16 18 20 21 23 25 26 29 31 35
## [205] 38 39 41 43 44 45 55 65 69 72 78 80 81 83 84 90 93
## [222] 101 105 108 109 110 112 115 123 125 126 127 129 131 132 134 137 140
## [239] 145 149 152 155 160 164 166 171 172 174 175 184 186 188 192 195 196
## [256] 208 211 212 214 218 220 222 225 228 230 232 235 237 239 242 243 245
## [273] 249 253 257 261 263 264 265 266 267 269 271 275 278 281 283 285 289
## [290] 290 291 294 298 299 300 302 309 312 314 323 327 328 329 330 334 335
## [307] 336 338
Prediction using SVM: We find the accuracy to be .80 with kappa of 0.52
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 28 9
## noband 20 89
##
## Accuracy : 0.8014
## 95% CI : (0.7274, 0.8628)
## No Information Rate : 0.6712
## P-Value [Acc > NIR] : 0.0003511
##
## Kappa : 0.522
##
## Mcnemar's Test P-Value : 0.0633178
##
## Sensitivity : 0.5833
## Specificity : 0.9082
## Pos Pred Value : 0.7568
## Neg Pred Value : 0.8165
## Prevalence : 0.3288
## Detection Rate : 0.1918
## Detection Prevalence : 0.2534
## Balanced Accuracy : 0.7457
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | svm.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 28 | 20 | 48 |
## | 0.583 | 0.417 | 0.329 |
## | 0.757 | 0.183 | |
## | 0.192 | 0.137 | |
## -----------------------|-----------|-----------|-----------|
## noband | 9 | 89 | 98 |
## | 0.092 | 0.908 | 0.671 |
## | 0.243 | 0.817 | |
## | 0.062 | 0.610 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 37 | 109 | 146 |
## | 0.253 | 0.747 | |
## -----------------------|-----------|-----------|-----------|
##
##
The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.
In this case, we do not remove any observations from the dataset and perform KNN Imputation on complete dataset.
Using the data we proceed with creating correlation plots, to see how the different variables are correlated. If the dataset has perfectly positive or negative attributes then the performance of the model will be impacted by a problem called “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.
We find the following correlations (in lesser degress, exists in the data):
Which implies, we can remove “varnish_pct” from our training model to observe if there is an improvement.
We continue to ignore the 12 variables as identified in previous section, for this model as well.
We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.
We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.
From the results, we observe the out-of-bag error rate to be 21.16%
##
## Call:
## randomForest(formula = form, data = data[train, ], ntree = 1000, mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 21.16%
## Confusion matrix:
## band noband class.error
## band 114 52 0.3132530
## noband 28 184 0.1320755
The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.
## OOB band noband
## [1,] 0.4101 0.3898 0.4250
## [2,] 0.4130 0.4151 0.4113
## [3,] 0.3916 0.4462 0.3462
## [4,] 0.3560 0.3741 0.3409
## [5,] 0.3684 0.3922 0.3492
## [6,] 0.3846 0.4395 0.3402
## [7,] 0.3573 0.4500 0.2836
## [8,] 0.3671 0.4472 0.3039
## [9,] 0.3659 0.4815 0.2754
## [10,] 0.3360 0.4451 0.2500
## [11,] 0.3244 0.4329 0.2392
## [12,] 0.3235 0.4303 0.2392
## [13,] 0.3404 0.4578 0.2476
## [14,] 0.3501 0.4880 0.2417
## [15,] 0.3210 0.4759 0.1991
In this case, the variables that are between >8 in the Gini Index (and therefore most important are):
We now proceed to predict using the test data set and create the confusion matrix for the same.
## band noband class.error
## band 114 52 0.3132530
## noband 28 184 0.1320755
## pred
## true band noband
## band 46 16
## noband 5 95
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | band.pred
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 46 | 16 | 62 |
## | 0.742 | 0.258 | 0.383 |
## | 0.902 | 0.144 | |
## | 0.284 | 0.099 | |
## -----------------------|-----------|-----------|-----------|
## noband | 5 | 95 | 100 |
## | 0.050 | 0.950 | 0.617 |
## | 0.098 | 0.856 | |
## | 0.031 | 0.586 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 51 | 111 | 162 |
## | 0.315 | 0.685 | |
## -----------------------|-----------|-----------|-----------|
##
##
We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.84 with mtry=2, and a kappa of 0.64.
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
## user system elapsed
## 119.15 0.58 121.81
## Random Forest
##
## 378 samples
## 26 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 340, 341, 340, 341, 341, 339, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7657964 0.5110627
## 18 0.7634542 0.5112989
## 35 0.7599980 0.5046530
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 41 5
## noband 21 95
##
## Accuracy : 0.8395
## 95% CI : (0.7737, 0.8924)
## No Information Rate : 0.6173
## P-Value [Acc > NIR] : 5.453e-10
##
## Kappa : 0.6428
##
## Mcnemar's Test P-Value : 0.003264
##
## Sensitivity : 0.6613
## Specificity : 0.9500
## Pos Pred Value : 0.8913
## Neg Pred Value : 0.8190
## Prevalence : 0.3827
## Detection Rate : 0.2531
## Detection Prevalence : 0.2840
## Balanced Accuracy : 0.8056
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | band.pred.1
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 41 | 21 | 62 |
## | 0.661 | 0.339 | 0.383 |
## | 0.891 | 0.181 | |
## | 0.253 | 0.130 | |
## -----------------------|-----------|-----------|-----------|
## noband | 5 | 95 | 100 |
## | 0.050 | 0.950 | 0.617 |
## | 0.109 | 0.819 | |
## | 0.031 | 0.586 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 46 | 116 | 162 |
## | 0.284 | 0.716 | |
## -----------------------|-----------|-----------|-----------|
##
##
For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.
The corresponding confusion matrix and crosstable is as shown below.
## user system elapsed
## 101.20 0.50 102.26
## Random Forest
##
## 378 samples
## 26 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 341, 341, 340, 340, 340, 339, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7639534 0.5252234
## 18 0.7556248 0.5070152
## 35 0.7366871 0.4692026
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 50 12
## noband 12 88
##
## Accuracy : 0.8519
## 95% CI : (0.7876, 0.9027)
## No Information Rate : 0.6173
## P-Value [Acc > NIR] : 4.698e-11
##
## Kappa : 0.6865
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8065
## Specificity : 0.8800
## Pos Pred Value : 0.8065
## Neg Pred Value : 0.8800
## Prevalence : 0.3827
## Detection Rate : 0.3086
## Detection Prevalence : 0.3827
## Balanced Accuracy : 0.8432
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | band.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 50 | 12 | 62 |
## | 0.806 | 0.194 | 0.383 |
## | 0.806 | 0.120 | |
## | 0.309 | 0.074 | |
## -----------------------|-----------|-----------|-----------|
## noband | 12 | 88 | 100 |
## | 0.120 | 0.880 | 0.617 |
## | 0.194 | 0.880 | |
## | 0.074 | 0.543 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 62 | 100 | 162 |
## | 0.383 | 0.617 | |
## -----------------------|-----------|-----------|-----------|
##
##
We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.
## # weights: 75
## initial value 262.836208
## final value 259.203890
## converged
We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that the performance for predicting “banding” is worse compared to randomForest.
## predicted
## true noband
## band 62
## noband 100
The missclassification table can also be obtained by using the gmodels library:
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | predicted
## data$band_type[-train] | noband | Row Total |
## -----------------------|-----------|-----------|
## band | 62 | 62 |
## | 0.383 | |
## -----------------------|-----------|-----------|
## noband | 100 | 100 |
## | 0.617 | |
## -----------------------|-----------|-----------|
## Column Total | 162 | 162 |
## -----------------------|-----------|-----------|
##
##
We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)
## rel.imp x.names
## viscosity -0.70805256 viscosity
## ink_typeuncoated -0.54977492 ink_typeuncoated
## ink_temperature -0.50998598 ink_temperature
## caliper -0.49514066 caliper
## ink_typecover -0.48778630 ink_typecover
## solvent_typeNAPTHA -0.45666856 solvent_typeNAPTHA
## direct_steamyes -0.33513404 direct_steamyes
## paper_typeuncoated -0.31614261 paper_typeuncoated
## blade_pressure -0.30417073 blade_pressure
## cylinder_sizespiegel -0.21385993 cylinder_sizespiegel
## press_typeMotter70 -0.08588101 press_typeMotter70
## anode_space_ratio 0.00000000 anode_space_ratio
## proof_cut 0.01041603 proof_cut
## locationnorthus 0.02738016 locationnorthus
## paper_typesuper 0.07173208 paper_typesuper
## proof_inkYES 0.22908412 proof_inkYES
## hardener 0.24559122 hardener
## roughness 0.30671182 roughness
## press_speed 0.33186274 press_speed
## solvent_typeXYLOL 0.33414148 solvent_typeXYLOL
## locationmideuropean 0.41522468 locationmideuropean
## locationsouthus 0.42687887 locationsouthus
## cylinder_sizetabloid 0.44025178 cylinder_sizetabloid
## ink_pct 0.48074246 ink_pct
## press_typeMotter94 0.58473779 press_typeMotter94
## press_typeWoodHoe70 0.59309384 press_typeWoodHoe70
## press 0.64836929 press
## humidity 0.68131414 humidity
## roller_durometer 0.72800859 roller_durometer
## locationscandanavian 0.80335472 locationscandanavian
## cylinder_typeyes 0.84432853 cylinder_typeyes
## solvent_pct 0.87608894 solvent_pct
## current_density 0.93307578 current_density
## grain_screenedYES 0.96934458 grain_screenedYES
## wax 1.00000000 wax
The bar plot tells us that the variables vicosity and wax have the strongest negative and positive relationships, respectively, with the response variable band_type.
In decreasing order of importance:
We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).
We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.
##
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
##
## Variables actually used in tree construction:
## [1] anode_space_ratio blade_pressure caliper
## [4] current_density cylinder_type grain_screened
## [7] hardener humidity ink_pct
## [10] ink_temperature location paper_type
## [13] press press_speed press_type
## [16] proof_cut proof_ink roller_durometer
## [19] roughness solvent_pct solvent_type
## [22] viscosity wax
##
## Root node error: 228/540 = 0.42222
##
## n= 540
##
## CP nsplit rel error xerror xstd
## 1 0.0723684 0 1.000000 1.00000 0.050340
## 2 0.0526316 4 0.706140 0.91667 0.049643
## 3 0.0438596 5 0.653509 0.91667 0.049643
## 4 0.0307018 6 0.609649 0.85526 0.048955
## 5 0.0263158 7 0.578947 0.83333 0.048672
## 6 0.0197368 10 0.500000 0.78509 0.047979
## 7 0.0153509 12 0.460526 0.74123 0.047261
## 8 0.0131579 14 0.429825 0.71930 0.046869
## 9 0.0109649 16 0.403509 0.69298 0.046369
## 10 0.0087719 20 0.359649 0.68421 0.046195
## 11 0.0073099 32 0.254386 0.64912 0.045461
## 12 0.0065789 39 0.201754 0.64474 0.045365
## 13 0.0043860 45 0.162281 0.65789 0.045651
## 14 0.0021930 73 0.039474 0.67105 0.045927
## 15 0.0000100 77 0.030702 0.68860 0.046283
From the complexity table we make the observation that the lowest relative error of 0.65 occurs at a tree size of ~40.
To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 12, with a resubstitution error rate of ~0.65 (as before).
This again is not a good predictor due to the large error rate.
## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
## n= 540
##
## CP nsplit rel error xerror xstd
## 1 0.07236842 0 1.0000000 1.0000000 0.05033997
## 2 0.05263158 4 0.7061404 0.9166667 0.04964270
## 3 0.05000000 5 0.6535088 0.9166667 0.04964270
##
## Variable importance
## press press_speed press_type location
## 25 19 15 10
## blade_pressure current_density cylinder_size wax
## 9 5 4 4
## roughness solvent_pct paper_type hardener
## 3 1 1 1
## proof_ink anode_space_ratio solvent_type
## 1 1 1
##
## Node number 1: 540 observations, complexity param=0.07236842
## predicted class=noband expected loss=0.4222222 P(node) =1
## class counts: 228 312
## probabilities: 0.422 0.578
## left son=2 (451 obs) right son=3 (89 obs)
## Primary splits:
## press_speed < 2184.5 to the left, improve=17.60284, (0 missing)
## location splits as RLRRL, improve=16.58157, (0 missing)
## paper_type splits as RLR, improve=16.03920, (0 missing)
## roller_durometer < 33.00751 to the right, improve=12.48117, (0 missing)
## press < 822.5 to the left, improve=11.94435, (0 missing)
## Surrogate splits:
## press < 827.5 to the left, agree=0.863, adj=0.169, (0 split)
## solvent_pct < 44.05 to the left, agree=0.846, adj=0.067, (0 split)
## solvent_type splits as LRL, agree=0.839, adj=0.022, (0 split)
##
## Node number 2: 451 observations, complexity param=0.07236842
## predicted class=noband expected loss=0.4789357 P(node) =0.8351852
## class counts: 216 235
## probabilities: 0.479 0.521
## left son=4 (341 obs) right son=5 (110 obs)
## Primary splits:
## press < 814 to the right, improve=13.487460, (0 missing)
## paper_type splits as RLR, improve=12.558990, (0 missing)
## location splits as RLRRL, improve=11.622280, (0 missing)
## humidity < 69.5 to the right, improve=10.013630, (0 missing)
## press_type splits as LRLL, improve= 9.124902, (0 missing)
## Surrogate splits:
## press_type splits as RRLL, agree=0.956, adj=0.818, (0 split)
## blade_pressure < 35.1208 to the left, agree=0.916, adj=0.655, (0 split)
## cylinder_size splits as RLL, agree=0.825, adj=0.282, (0 split)
## wax < 2.95 to the left, agree=0.789, adj=0.136, (0 split)
## press_speed < 1432.5 to the right, agree=0.787, adj=0.127, (0 split)
##
## Node number 3: 89 observations
## predicted class=noband expected loss=0.1348315 P(node) =0.1648148
## class counts: 12 77
## probabilities: 0.135 0.865
##
## Node number 4: 341 observations, complexity param=0.07236842
## predicted class=band expected loss=0.4516129 P(node) =0.6314815
## class counts: 187 154
## probabilities: 0.548 0.452
## left son=8 (70 obs) right son=9 (271 obs)
## Primary splits:
## location splits as RLRRL, improve=9.922203, (0 missing)
## current_density < 36 to the right, improve=9.745700, (0 missing)
## paper_type splits as RLR, improve=9.592881, (0 missing)
## press < 822.5 to the left, improve=9.523061, (0 missing)
## ink_type splits as RLL, improve=8.733958, (0 missing)
## Surrogate splits:
## paper_type splits as RLR, agree=0.818, adj=0.114, (0 split)
## press_type splits as L-RR, agree=0.806, adj=0.057, (0 split)
## blade_pressure < 19 to the left, agree=0.798, adj=0.014, (0 split)
##
## Node number 5: 110 observations
## predicted class=noband expected loss=0.2636364 P(node) =0.2037037
## class counts: 29 81
## probabilities: 0.264 0.736
##
## Node number 8: 70 observations
## predicted class=band expected loss=0.2142857 P(node) =0.1296296
## class counts: 55 15
## probabilities: 0.786 0.214
##
## Node number 9: 271 observations, complexity param=0.07236842
## predicted class=noband expected loss=0.4870849 P(node) =0.5018519
## class counts: 132 139
## probabilities: 0.487 0.513
## left son=18 (187 obs) right son=19 (84 obs)
## Primary splits:
## press < 822.5 to the left, improve=8.739744, (0 missing)
## current_density < 36 to the right, improve=6.557576, (0 missing)
## ink_temperature < 16.9 to the right, improve=5.393334, (0 missing)
## press_type splits as L-RL, improve=5.142964, (0 missing)
## press_speed < 1275 to the left, improve=4.898144, (0 missing)
## Surrogate splits:
## press_type splits as L-RL, agree=0.793, adj=0.333, (0 split)
## roughness < 0.8073364 to the left, agree=0.793, adj=0.333, (0 split)
## wax < 2.35 to the right, agree=0.756, adj=0.214, (0 split)
## proof_ink splits as RL, agree=0.720, adj=0.095, (0 split)
## hardener < 1.35 to the left, agree=0.720, adj=0.095, (0 split)
##
## Node number 18: 187 observations, complexity param=0.05263158
## predicted class=band expected loss=0.4278075 P(node) =0.3462963
## class counts: 107 80
## probabilities: 0.572 0.428
## left son=36 (151 obs) right son=37 (36 obs)
## Primary splits:
## current_density < 36 to the right, improve=5.087226, (0 missing)
## ink_pct < 55.85 to the right, improve=4.591306, (0 missing)
## proof_cut < 55.25 to the left, improve=4.333522, (0 missing)
## wax < 2.25 to the left, improve=3.755301, (0 missing)
## press_speed < 1735 to the left, improve=3.491101, (0 missing)
## Surrogate splits:
## anode_space_ratio < 93.425 to the right, agree=0.829, adj=0.111, (0 split)
## humidity < 99 to the left, agree=0.818, adj=0.056, (0 split)
## ink_pct < 43.1 to the right, agree=0.818, adj=0.056, (0 split)
## solvent_type splits as L-R, agree=0.813, adj=0.028, (0 split)
## solvent_pct < 44.15 to the left, agree=0.813, adj=0.028, (0 split)
##
## Node number 19: 84 observations
## predicted class=noband expected loss=0.297619 P(node) =0.1555556
## class counts: 25 59
## probabilities: 0.298 0.702
##
## Node number 36: 151 observations
## predicted class=band expected loss=0.3708609 P(node) =0.2796296
## class counts: 95 56
## probabilities: 0.629 0.371
##
## Node number 37: 36 observations
## predicted class=noband expected loss=0.3333333 P(node) =0.06666667
## class counts: 12 24
## probabilities: 0.333 0.667
The variable importance table is given below:
## Overall
## anode_space_ratio 39.045225
## blade_pressure 53.689247
## caliper 23.911904
## current_density 24.670661
## cylinder_size 3.504540
## cylinder_type 9.884305
## grain_screened 12.257990
## hardener 44.723327
## humidity 57.332461
## ink_pct 50.808741
## ink_temperature 54.742040
## ink_type 40.796891
## location 56.392524
## paper_type 60.988988
## press 58.353369
## press_speed 56.561023
## press_type 29.485488
## proof_cut 57.179918
## proof_ink 6.087513
## roller_durometer 29.438657
## roughness 14.940079
## solvent_pct 22.471187
## solvent_type 6.492854
## viscosity 70.570605
## wax 49.027160
## direct_steam 0.000000
In decreasing order of importance the variables are:
We consider the use of support vector machines to build a classifier and to also extract the feature importance list.
First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.
##
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train,
## ], method = "C-classification", kernel = "radial", cost = 10,
## gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 333
##
## ( 183 150 )
##
##
## Number of Classes: 2
##
## Levels:
## band noband
## [1] 1 4 5 6 7 8 10 13 14 16 17 18 20 22 23 26 28
## [18] 29 30 31 33 34 36 38 40 41 42 43 44 45 46 48 50 52
## [35] 53 54 59 61 63 65 68 71 73 75 76 81 83 84 87 88 89
## [52] 90 92 93 94 96 97 101 102 104 105 108 110 114 118 121 123 124
## [69] 125 129 130 132 135 136 137 138 139 141 142 144 145 146 148 149 151
## [86] 152 155 157 160 161 162 165 167 168 173 176 177 178 179 180 181 185
## [103] 186 191 192 198 199 200 201 202 208 214 217 218 222 223 224 230 231
## [120] 233 235 240 244 246 247 251 252 254 257 263 264 266 267 271 274 279
## [137] 281 283 284 285 286 288 292 294 295 296 297 299 302 303 305 306 312
## [154] 314 315 317 321 323 327 329 330 336 337 339 342 343 345 346 350 353
## [171] 354 356 359 361 362 366 370 371 372 373 374 375 377 2 9 11 12
## [188] 15 19 21 24 25 27 35 39 49 57 58 62 64 66 67 69 70
## [205] 74 77 78 79 82 85 86 91 95 98 99 100 103 107 112 113 116
## [222] 117 119 120 122 126 127 128 131 133 134 143 147 150 153 158 159 163
## [239] 166 169 170 171 172 174 175 182 184 187 188 189 194 197 203 204 205
## [256] 206 207 209 210 211 212 213 215 216 219 220 221 225 229 232 234 237
## [273] 238 241 242 245 250 253 255 256 258 260 261 262 265 269 270 272 273
## [290] 277 278 280 282 287 289 290 291 293 298 300 301 304 307 308 309 310
## [307] 311 318 320 324 325 326 328 334 335 338 341 344 347 348 349 351 352
## [324] 355 357 358 360 363 364 365 367 368 369
Prediction using SVM: The accuracy of the model is 0.83 at a kappa of 0.64.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 45 10
## noband 17 90
##
## Accuracy : 0.8333
## 95% CI : (0.7669, 0.8872)
## No Information Rate : 0.6173
## P-Value [Acc > NIR] : 1.737e-09
##
## Kappa : 0.6395
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 0.7258
## Specificity : 0.9000
## Pos Pred Value : 0.8182
## Neg Pred Value : 0.8411
## Prevalence : 0.3827
## Detection Rate : 0.2778
## Detection Prevalence : 0.3395
## Balanced Accuracy : 0.8129
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | svm.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 45 | 17 | 62 |
## | 0.726 | 0.274 | 0.383 |
## | 0.818 | 0.159 | |
## | 0.278 | 0.105 | |
## -----------------------|-----------|-----------|-----------|
## noband | 10 | 90 | 100 |
## | 0.100 | 0.900 | 0.617 |
## | 0.182 | 0.841 | |
## | 0.062 | 0.556 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 55 | 107 | 162 |
## | 0.340 | 0.660 | |
## -----------------------|-----------|-----------|-----------|
##
##
The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.
In this method, the idea is to first remove those column variables that are not being considered essential (by way of heuristics and understanding) for our learning model, and then perform KNN imputation.
The idea here is to not lose essential data that we will forego, if we instead first did a removal of rows that have more than 20% data missing, when in fact, those 20% might instead be of those variables we were not going to consider anyway! So instead of that, we perform the operations slightly differently.
We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.
We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.
From the results, we observe the out-of-bag error rate to be 19.31%.
##
## Call:
## randomForest(formula = form, data = data[train, ], ntree = 1000, mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 19.31%
## Confusion matrix:
## band noband class.error
## band 115 51 0.3072289
## noband 22 190 0.1037736
The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.
## OOB band noband
## [1,] 0.3309 0.3220 0.3375
## [2,] 0.3319 0.3627 0.3071
## [3,] 0.3750 0.4016 0.3544
## [4,] 0.3846 0.4203 0.3563
## [5,] 0.3620 0.3986 0.3333
## [6,] 0.3768 0.4295 0.3350
## [7,] 0.3398 0.3688 0.3166
## [8,] 0.3306 0.3827 0.2886
## [9,] 0.3144 0.3681 0.2718
## [10,] 0.3270 0.4085 0.2621
## [11,] 0.3110 0.3636 0.2692
## [12,] 0.3316 0.3697 0.3014
## [13,] 0.3183 0.3675 0.2796
## [14,] 0.3254 0.4036 0.2642
## [15,] 0.2937 0.3614 0.2406
Mean Decrease in Gini is the average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest.
This is effectively a measure of how important a variable is for estimating the value of the target variable across all of the trees that make up the forest. A higher Mean Decrease in Gini indicates higher variable importance. The most important variables to the model will be highest in the plot and have the largest Mean Decrease in Gini Values, conversely, the least important variable will be lowest in the plot, and have the smallest Mean Decrease in Gini values.
In this case, the variables that are between >7 in the Gini Index (and therefore most important are):
## band noband class.error
## band 115 51 0.3072289
## noband 22 190 0.1037736
We now proceed to predict using the test data set and create the confusion matrix for the same.
## pred
## true band noband
## band 45 17
## noband 6 94
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | band.pred
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 45 | 17 | 62 |
## | 0.726 | 0.274 | 0.383 |
## | 0.882 | 0.153 | |
## | 0.278 | 0.105 | |
## -----------------------|-----------|-----------|-----------|
## noband | 6 | 94 | 100 |
## | 0.060 | 0.940 | 0.617 |
## | 0.118 | 0.847 | |
## | 0.037 | 0.580 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 51 | 111 | 162 |
## | 0.315 | 0.685 | |
## -----------------------|-----------|-----------|-----------|
##
##
We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.84 with mtry=19, and a kappa of 0.65.
There is no standardized interpretation of the kappa statistic. According to Wikipedia (citing their paper), Landis and Koch considers 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect.
Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as fair to good, and < 0.40 as poor.
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
## user system elapsed
## 119.08 0.84 121.20
## Random Forest
##
## 378 samples
## 26 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 341, 340, 341, 341, 340, 341, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7780226 0.5358503
## 18 0.7745713 0.5336904
## 35 0.7616369 0.5075637
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 40 3
## noband 22 97
##
## Accuracy : 0.8457
## 95% CI : (0.7807, 0.8976)
## No Information Rate : 0.6173
## P-Value [Acc > NIR] : 1.638e-10
##
## Kappa : 0.6532
##
## Mcnemar's Test P-Value : 0.0003182
##
## Sensitivity : 0.6452
## Specificity : 0.9700
## Pos Pred Value : 0.9302
## Neg Pred Value : 0.8151
## Prevalence : 0.3827
## Detection Rate : 0.2469
## Detection Prevalence : 0.2654
## Balanced Accuracy : 0.8076
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | band.pred.1
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 40 | 22 | 62 |
## | 0.645 | 0.355 | 0.383 |
## | 0.930 | 0.185 | |
## | 0.247 | 0.136 | |
## -----------------------|-----------|-----------|-----------|
## noband | 3 | 97 | 100 |
## | 0.030 | 0.970 | 0.617 |
## | 0.070 | 0.815 | |
## | 0.019 | 0.599 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 43 | 119 | 162 |
## | 0.265 | 0.735 | |
## -----------------------|-----------|-----------|-----------|
##
##
For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.
The corresponding confusion matrix and crosstable is as shown below.
## user system elapsed
## 101.75 0.46 103.03
## Random Forest
##
## 378 samples
## 26 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 341, 340, 340, 341, 340, 341, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7621704 0.5196313
## 18 0.7515160 0.4983770
## 35 0.7432695 0.4825770
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 50 11
## noband 12 89
##
## Accuracy : 0.858
## 95% CI : (0.7946, 0.9078)
## No Information Rate : 0.6173
## P-Value [Acc > NIR] : 1.285e-11
##
## Kappa : 0.6986
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8065
## Specificity : 0.8900
## Pos Pred Value : 0.8197
## Neg Pred Value : 0.8812
## Prevalence : 0.3827
## Detection Rate : 0.3086
## Detection Prevalence : 0.3765
## Balanced Accuracy : 0.8482
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | band.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 50 | 12 | 62 |
## | 0.806 | 0.194 | 0.383 |
## | 0.820 | 0.119 | |
## | 0.309 | 0.074 | |
## -----------------------|-----------|-----------|-----------|
## noband | 11 | 89 | 100 |
## | 0.110 | 0.890 | 0.617 |
## | 0.180 | 0.881 | |
## | 0.068 | 0.549 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 61 | 101 | 162 |
## | 0.377 | 0.623 | |
## -----------------------|-----------|-----------|-----------|
##
##
We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.
## # weights: 75
## initial value 260.297685
## final value 259.203866
## converged
We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.
## predicted
## true noband
## band 62
## noband 100
The missclassification table can also be obtained by using the gmodels library:
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | predicted
## data$band_type[-train] | noband | Row Total |
## -----------------------|-----------|-----------|
## band | 62 | 62 |
## | 0.383 | |
## -----------------------|-----------|-----------|
## noband | 100 | 100 |
## | 0.617 | |
## -----------------------|-----------|-----------|
## Column Total | 162 | 162 |
## -----------------------|-----------|-----------|
##
##
We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)
## rel.imp x.names
## ink_pct -0.95428136 ink_pct
## humidity -0.84236796 humidity
## wax -0.82342916 wax
## solvent_typeXYLOL -0.77438835 solvent_typeXYLOL
## press_typeWoodHoe70 -0.67125659 press_typeWoodHoe70
## ink_typecover -0.59759231 ink_typecover
## caliper -0.54061341 caliper
## viscosity -0.50092145 viscosity
## cylinder_sizetabloid -0.43952260 cylinder_sizetabloid
## proof_cut -0.32721833 proof_cut
## press_typeMotter94 -0.26625756 press_typeMotter94
## anode_space_ratio -0.22611918 anode_space_ratio
## paper_typeuncoated -0.21282091 paper_typeuncoated
## direct_steamyes -0.12860134 direct_steamyes
## paper_typesuper -0.10583570 paper_typesuper
## locationscandanavian -0.08288393 locationscandanavian
## ink_typeuncoated -0.06885445 ink_typeuncoated
## press -0.04479892 press
## blade_pressure -0.02039919 blade_pressure
## cylinder_typeyes 0.00000000 cylinder_typeyes
## locationmideuropean 0.03440463 locationmideuropean
## ink_temperature 0.11009559 ink_temperature
## hardener 0.25508124 hardener
## roller_durometer 0.34119484 roller_durometer
## locationsouthus 0.34694584 locationsouthus
## locationnorthus 0.41373216 locationnorthus
## press_speed 0.42958805 press_speed
## current_density 0.50752345 current_density
## solvent_pct 0.55162539 solvent_pct
## roughness 0.56921057 roughness
## grain_screenedYES 0.69775106 grain_screenedYES
## proof_inkYES 0.72720408 proof_inkYES
## cylinder_sizespiegel 0.86676553 cylinder_sizespiegel
## solvent_typeNAPTHA 0.92750811 solvent_typeNAPTHA
## press_typeMotter70 1.00000000 press_typeMotter70
The bar plot tells us that the variables ink_pct and press_typeMotter70 have the strongest negative and positive relationships, respectively, with the response variable band_type.
In decreasing order of importance:
We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).
We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.
##
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
##
## Variables actually used in tree construction:
## [1] anode_space_ratio blade_pressure caliper
## [4] current_density cylinder_type grain_screened
## [7] hardener humidity ink_pct
## [10] ink_temperature ink_type location
## [13] paper_type press press_speed
## [16] press_type proof_cut proof_ink
## [19] roller_durometer roughness solvent_pct
## [22] solvent_type viscosity wax
##
## Root node error: 228/540 = 0.42222
##
## n= 540
##
## CP nsplit rel error xerror xstd
## 1 0.0723684 0 1.000000 1.00000 0.050340
## 2 0.0438596 3 0.728070 0.76316 0.047630
## 3 0.0307018 6 0.596491 0.79825 0.048178
## 4 0.0219298 7 0.565789 0.67544 0.046017
## 5 0.0175439 11 0.451754 0.60526 0.044455
## 6 0.0131579 15 0.381579 0.60526 0.044455
## 7 0.0109649 20 0.315789 0.63158 0.045071
## 8 0.0087719 22 0.293860 0.63158 0.045071
## 9 0.0065789 38 0.153509 0.62281 0.044870
## 10 0.0058480 42 0.127193 0.61842 0.044768
## 11 0.0043860 45 0.109649 0.62281 0.044870
## 12 0.0021930 64 0.026316 0.65789 0.045651
## 13 0.0014620 66 0.021930 0.65351 0.045556
## 14 0.0000100 69 0.017544 0.65351 0.045556
From the complexity table we make the observation that the lowest relative error of 0.6 occurs at a tree size of ~18.
To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 5, with a resubstitution error rate of ~0.6 (as before).
This again is not a good predictor due to the large error rate.
## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
## n= 540
##
## CP nsplit rel error xerror xstd
## 1 0.07236842 0 1.0000000 1.0000000 0.05033997
## 2 0.05000000 3 0.7280702 0.7631579 0.04763031
##
## Variable importance
## press_speed press press_type roller_durometer
## 19 17 11 11
## blade_pressure paper_type location ink_type
## 8 8 7 5
## grain_screened cylinder_size proof_cut wax
## 5 4 2 2
## solvent_pct
## 1
##
## Node number 1: 540 observations, complexity param=0.07236842
## predicted class=noband expected loss=0.4222222 P(node) =1
## class counts: 228 312
## probabilities: 0.422 0.578
## left son=2 (451 obs) right son=3 (89 obs)
## Primary splits:
## press_speed < 2184.5 to the left, improve=17.60284, (0 missing)
## paper_type splits as RLR, improve=16.03920, (0 missing)
## roller_durometer < 33.07812 to the right, improve=14.64381, (0 missing)
## location splits as RLRRL, improve=13.89276, (0 missing)
## press < 822.5 to the left, improve=11.94435, (0 missing)
## Surrogate splits:
## press < 827.5 to the left, agree=0.863, adj=0.169, (0 split)
## solvent_pct < 44.05 to the left, agree=0.846, adj=0.067, (0 split)
## solvent_type splits as LRL, agree=0.839, adj=0.022, (0 split)
##
## Node number 2: 451 observations, complexity param=0.07236842
## predicted class=noband expected loss=0.4789357 P(node) =0.8351852
## class counts: 216 235
## probabilities: 0.479 0.521
## left son=4 (341 obs) right son=5 (110 obs)
## Primary splits:
## press < 814 to the right, improve=13.487460, (0 missing)
## paper_type splits as RLR, improve=12.558990, (0 missing)
## humidity < 69.5 to the right, improve=10.013630, (0 missing)
## location splits as RLRRL, improve= 9.804620, (0 missing)
## press_type splits as LRLL, improve= 9.124902, (0 missing)
## Surrogate splits:
## press_type splits as RRLL, agree=0.956, adj=0.818, (0 split)
## blade_pressure < 35.06462 to the left, agree=0.900, adj=0.591, (0 split)
## cylinder_size splits as RLL, agree=0.825, adj=0.282, (0 split)
## wax < 2.95 to the left, agree=0.789, adj=0.136, (0 split)
## press_speed < 1432.5 to the right, agree=0.787, adj=0.127, (0 split)
##
## Node number 3: 89 observations
## predicted class=noband expected loss=0.1348315 P(node) =0.1648148
## class counts: 12 77
## probabilities: 0.135 0.865
##
## Node number 4: 341 observations, complexity param=0.07236842
## predicted class=band expected loss=0.4516129 P(node) =0.6314815
## class counts: 187 154
## probabilities: 0.548 0.452
## left son=8 (208 obs) right son=9 (133 obs)
## Primary splits:
## roller_durometer < 33.07812 to the right, improve=10.805260, (0 missing)
## current_density < 36 to the right, improve= 9.745700, (0 missing)
## paper_type splits as RLR, improve= 9.592881, (0 missing)
## press < 822.5 to the left, improve= 9.523061, (0 missing)
## ink_type splits as RLL, improve= 8.733958, (0 missing)
## Surrogate splits:
## paper_type splits as RLL, agree=0.891, adj=0.722, (0 split)
## location splits as RLLLL, agree=0.848, adj=0.609, (0 split)
## ink_type splits as RLL, agree=0.801, adj=0.489, (0 split)
## grain_screened splits as RL, agree=0.774, adj=0.421, (0 split)
## proof_cut < 39.84652 to the right, agree=0.692, adj=0.211, (0 split)
##
## Node number 5: 110 observations
## predicted class=noband expected loss=0.2636364 P(node) =0.2037037
## class counts: 29 81
## probabilities: 0.264 0.736
##
## Node number 8: 208 observations
## predicted class=band expected loss=0.3509615 P(node) =0.3851852
## class counts: 135 73
## probabilities: 0.649 0.351
##
## Node number 9: 133 observations
## predicted class=noband expected loss=0.3909774 P(node) =0.2462963
## class counts: 52 81
## probabilities: 0.391 0.609
The variable importance table is given below:
## Overall
## anode_space_ratio 29.198297
## blade_pressure 58.779461
## caliper 19.474535
## current_density 24.888647
## cylinder_size 7.194784
## cylinder_type 6.077429
## grain_screened 12.931862
## hardener 34.775933
## humidity 56.337643
## ink_pct 84.653468
## ink_temperature 56.023266
## ink_type 35.748195
## location 50.644398
## paper_type 59.847210
## press 60.211992
## press_speed 46.497600
## press_type 30.489342
## proof_cut 34.437336
## proof_ink 26.224542
## roller_durometer 32.888612
## roughness 30.441449
## solvent_pct 26.219146
## solvent_type 3.250000
## viscosity 77.422617
## wax 36.046013
## direct_steam 0.000000
In decreasing order, the importance of the variables are:
We consider the use of support vector machines to build a classifier and to also extract the feature importance list.
First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.
##
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train,
## ], method = "C-classification", kernel = "radial", cost = 10,
## gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 335
##
## ( 182 153 )
##
##
## Number of Classes: 2
##
## Levels:
## band noband
## [1] 1 3 4 5 6 7 8 10 13 14 17 18 20 22 23 26 28
## [18] 29 30 31 33 34 36 38 40 41 42 43 44 45 46 48 50 52
## [35] 53 54 59 61 63 65 68 71 73 75 76 81 83 84 87 88 89
## [52] 90 92 93 94 96 97 101 102 104 105 108 110 118 121 123 125 129
## [69] 130 132 135 136 137 138 139 141 142 144 145 146 148 149 151 152 155
## [86] 157 160 161 162 165 167 168 173 176 177 178 180 181 185 186 191 192
## [103] 198 199 200 201 202 208 214 217 218 222 223 224 230 231 233 235 236
## [120] 240 243 244 246 247 251 252 254 257 263 264 266 267 271 274 279 281
## [137] 283 284 285 286 288 292 294 295 296 297 299 302 303 305 306 312 314
## [154] 315 317 321 323 327 329 330 336 337 339 342 343 345 346 350 353 354
## [171] 356 359 361 362 366 370 371 372 373 374 375 377 2 9 12 15 19
## [188] 21 24 25 27 32 35 39 49 55 57 58 62 64 66 67 69 70
## [205] 74 77 78 79 82 85 86 91 95 98 99 100 103 107 112 113 115
## [222] 117 119 120 122 126 127 128 131 134 143 147 150 153 158 159 163 166
## [239] 169 170 171 172 174 175 182 184 188 189 194 197 203 204 205 206 207
## [256] 209 210 211 212 213 215 216 219 220 221 225 229 232 234 237 238 241
## [273] 242 245 248 250 253 255 256 258 259 260 261 262 265 270 272 273 275
## [290] 277 278 280 282 287 289 291 293 298 300 301 304 307 308 309 310 311
## [307] 318 319 320 324 325 326 328 332 334 335 338 341 344 347 348 349 351
## [324] 352 355 357 358 360 363 364 365 367 368 369 378
Prediction using SVM: The accuracy of the model is 0.84 at a kappa of 0.66
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 47 10
## noband 15 90
##
## Accuracy : 0.8457
## 95% CI : (0.7807, 0.8976)
## No Information Rate : 0.6173
## P-Value [Acc > NIR] : 1.638e-10
##
## Kappa : 0.6683
##
## Mcnemar's Test P-Value : 0.4237
##
## Sensitivity : 0.7581
## Specificity : 0.9000
## Pos Pred Value : 0.8246
## Neg Pred Value : 0.8571
## Prevalence : 0.3827
## Detection Rate : 0.2901
## Detection Prevalence : 0.3519
## Balanced Accuracy : 0.8290
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 162
##
##
## | svm.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 47 | 15 | 62 |
## | 0.758 | 0.242 | 0.383 |
## | 0.825 | 0.143 | |
## | 0.290 | 0.093 | |
## -----------------------|-----------|-----------|-----------|
## noband | 10 | 90 | 100 |
## | 0.100 | 0.900 | 0.617 |
## | 0.175 | 0.857 | |
## | 0.062 | 0.556 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 57 | 105 | 162 |
## | 0.352 | 0.648 | |
## -----------------------|-----------|-----------|-----------|
##
##
The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.
In this case, we first remove the unconsidered columns from the dataset, then further remove observations where more than 20% data is missing, and perform KNN imputation on the remaining data.
The variables that are not considered are:
## grain_screened proof_ink paper_type ink_type direct_steam
## NO :273 NO : 24 coated :206 coated :264 no :482
## YES:211 YES:460 super : 0 cover : 15 yes: 2
## uncoated:278 uncoated:205
##
##
##
## solvent_type cylinder_type press_type press cylinder_size
## LINE :467 no :134 Albert70 : 60 Min. :802.0 catalog:162
## NAPTHA: 2 yes:350 Motter70 : 48 1st Qu.:815.0 spiegel: 50
## XYLOL : 15 Motter94 :210 Median :816.0 tabloid:272
## WoodHoe70:166 Mean :817.5
## 3rd Qu.:824.0
## Max. :828.0
## location proof_cut viscosity caliper
## canadian :215 Min. :25.00 Min. :35.00 Min. :0.1330
## mideuropean : 49 1st Qu.:40.00 1st Qu.:43.00 1st Qu.:0.2000
## northus :199 Median :45.00 Median :50.00 Median :0.2670
## scandanavian: 13 Mean :45.04 Mean :50.86 Mean :0.2756
## southus : 8 3rd Qu.:50.00 3rd Qu.:56.00 3rd Qu.:0.3083
## Max. :72.50 Max. :72.00 Max. :0.5330
## ink_temperature humidity roughness blade_pressure
## Min. :11.20 Min. : 57.00 Min. :0.05625 Min. :16.00
## 1st Qu.:14.50 1st Qu.: 73.00 1st Qu.:0.62500 1st Qu.:25.00
## Median :15.10 Median : 78.00 Median :0.75000 Median :30.00
## Mean :15.31 Mean : 78.47 Mean :0.72738 Mean :31.24
## 3rd Qu.:16.00 3rd Qu.: 82.00 3rd Qu.:0.81250 3rd Qu.:33.81
## Max. :24.50 Max. :105.00 Max. :1.25000 Max. :70.00
## varnish_pct press_speed ink_pct solvent_pct
## Min. : 0.000 Min. : 0 Min. :41.00 Min. :22.00
## 1st Qu.: 0.000 1st Qu.:1600 1st Qu.:52.10 1st Qu.:36.80
## Median : 3.400 Median :1800 Median :56.75 Median :38.50
## Mean : 5.767 Mean :1831 Mean :55.64 Mean :38.58
## 3rd Qu.:10.400 3rd Qu.:2050 3rd Qu.:58.80 3rd Qu.:41.20
## Max. :35.800 Max. :2600 Max. :76.90 Max. :53.40
## wax hardener roller_durometer current_density
## Min. :0.000 Min. :0.0000 Min. :28.00 Min. :30.00
## 1st Qu.:2.414 1st Qu.:0.8000 1st Qu.:30.00 1st Qu.:40.00
## Median :2.500 Median :1.0000 Median :34.00 Median :40.00
## Mean :2.422 Mean :0.9692 Mean :34.78 Mean :38.96
## 3rd Qu.:2.700 3rd Qu.:1.0000 3rd Qu.:40.00 3rd Qu.:40.00
## Max. :3.100 Max. :3.0000 Max. :60.00 Max. :45.00
## anode_space_ratio band_type
## Min. : 90.0 band :173
## 1st Qu.:100.0 noband:311
## Median :103.1
## Mean :102.9
## 3rd Qu.:106.5
## Max. :117.9
We first generate the training and test datasets based on a random sample. We use 70% of the supplied dataset as training set, and the remaining 30% as a test dataset.
We next build the random forest using the randomForest library, using the training data set. Since the number of variables is 39 in this case, we use a value mtry=6; that is 6 variables used at every node.
From the results, we observe the out-of-bag error rate to be 22.78%.
##
## Call:
## randomForest(formula = form, data = data[train, ], ntree = 1000, mtry = 6, importance = TRUE, localImp = TRUE, replace = FALSE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 22.78%
## Confusion matrix:
## band noband class.error
## band 74 54 0.4218750
## noband 23 187 0.1095238
The variable importance table is visible in the below plot and the first 15 trees with their accuracy is also displayed below it, and the corresponding error plot. This error plot shows the change in error rate as more trees are added to the forest.
## OOB band noband
## [1,] 0.3468 0.4000 0.3108
## [2,] 0.3483 0.4375 0.2893
## [3,] 0.3770 0.4894 0.3067
## [4,] 0.3630 0.4667 0.3011
## [5,] 0.3233 0.4464 0.2500
## [6,] 0.3139 0.4474 0.2359
## [7,] 0.3478 0.4836 0.2650
## [8,] 0.3405 0.5040 0.2388
## [9,] 0.3140 0.4841 0.2079
## [10,] 0.2939 0.4762 0.1814
## [11,] 0.2934 0.4646 0.1884
## [12,] 0.3134 0.4724 0.2163
## [13,] 0.3095 0.4567 0.2201
## [14,] 0.2857 0.4409 0.1914
## [15,] 0.2917 0.4724 0.1818
Mean Decrease in Gini is the average (mean) of a variable’s total decrease in node impurity, weighted by the proportion of samples reaching that node in each individual decision tree in the random forest.
In this case, the variables that are between 6-8 in the Gini Index (and therefore most important are):
## band noband class.error
## band 74 54 0.4218750
## noband 23 187 0.1095238
We now proceed to predict using the test data set and create the confusion matrix for the same.
## pred
## true band noband
## band 26 19
## noband 7 94
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | band.pred
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 26 | 19 | 45 |
## | 0.578 | 0.422 | 0.308 |
## | 0.788 | 0.168 | |
## | 0.178 | 0.130 | |
## -----------------------|-----------|-----------|-----------|
## noband | 7 | 94 | 101 |
## | 0.069 | 0.931 | 0.692 |
## | 0.212 | 0.832 | |
## | 0.048 | 0.644 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 33 | 113 | 146 |
## | 0.226 | 0.774 | |
## -----------------------|-----------|-----------|-----------|
##
##
We now use a repeated cross-validation technique to see if we get a better classification performance. The confusion matrix and cross-table, gives the following for repeating the random forest test 10 times, 10 folds. As can be seen, the accuracy that we obtained is 0.82 with mtry=19, and a kappa of 0.55.
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
## user system elapsed
## 103.31 0.46 104.45
## Random Forest
##
## 338 samples
## 27 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 304, 304, 304, 304, 304, 304, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7156239 0.3255382
## 19 0.7414260 0.4178939
## 36 0.7401961 0.4165118
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 19.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 27 8
## noband 18 93
##
## Accuracy : 0.8219
## 95% CI : (0.7501, 0.8802)
## No Information Rate : 0.6918
## P-Value [Acc > NIR] : 0.000261
##
## Kappa : 0.555
##
## Mcnemar's Test P-Value : 0.077556
##
## Sensitivity : 0.6000
## Specificity : 0.9208
## Pos Pred Value : 0.7714
## Neg Pred Value : 0.8378
## Prevalence : 0.3082
## Detection Rate : 0.1849
## Detection Prevalence : 0.2397
## Balanced Accuracy : 0.7604
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | band.pred.1
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 27 | 18 | 45 |
## | 0.600 | 0.400 | 0.308 |
## | 0.771 | 0.162 | |
## | 0.185 | 0.123 | |
## -----------------------|-----------|-----------|-----------|
## noband | 8 | 93 | 101 |
## | 0.079 | 0.921 | 0.692 |
## | 0.229 | 0.838 | |
## | 0.055 | 0.637 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 35 | 111 | 146 |
## | 0.240 | 0.760 | |
## -----------------------|-----------|-----------|-----------|
##
##
For a large class imbalance (which is not the case in our example as the band-noband is well balance) another function train can subsample the dataset to balance the classes before model fitting. This is done by using the option sampling=“down” in the function.
The corresponding confusion matrix and crosstable is as shown below.
## user system elapsed
## 76.70 0.25 77.42
## Random Forest
##
## 338 samples
## 27 predictor
## 2 classes: 'band', 'noband'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 304, 304, 305, 304, 304, 304, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7304902 0.4589608
## 19 0.7165865 0.4260143
## 36 0.7140909 0.4200108
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 34 20
## noband 11 81
##
## Accuracy : 0.7877
## 95% CI : (0.7124, 0.8509)
## No Information Rate : 0.6918
## P-Value [Acc > NIR] : 0.006418
##
## Kappa : 0.5282
##
## Mcnemar's Test P-Value : 0.150763
##
## Sensitivity : 0.7556
## Specificity : 0.8020
## Pos Pred Value : 0.6296
## Neg Pred Value : 0.8804
## Prevalence : 0.3082
## Detection Rate : 0.2329
## Detection Prevalence : 0.3699
## Balanced Accuracy : 0.7788
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | band.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 34 | 11 | 45 |
## | 0.756 | 0.244 | 0.308 |
## | 0.630 | 0.120 | |
## | 0.233 | 0.075 | |
## -----------------------|-----------|-----------|-----------|
## noband | 20 | 81 | 101 |
## | 0.198 | 0.802 | 0.692 |
## | 0.370 | 0.880 | |
## | 0.137 | 0.555 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 54 | 92 | 146 |
## | 0.370 | 0.630 | |
## -----------------------|-----------|-----------|-----------|
##
##
We build a NN classifier model in this case by providing the dataset free from missing values, using 2 hidden layers.
## # weights: 77
## initial value 234.064527
## final value 224.237831
## converged
We then proceed to test the model by using the test dataset on this model. From it we obtain the missclassification table as shown below. We find that there is good performance of the model, especially for predicting “banding”, compared to randomForest.
## predicted
## true noband
## band 45
## noband 101
The missclassification table can also be obtained by using the gmodels library:
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | predicted
## data$band_type[-train] | noband | Row Total |
## -----------------------|-----------|-----------|
## band | 45 | 45 |
## | 0.308 | |
## -----------------------|-----------|-----------|
## noband | 101 | 101 |
## | 0.692 | |
## -----------------------|-----------|-----------|
## Column Total | 146 | 146 |
## -----------------------|-----------|-----------|
##
##
We can now list the relative importance of the variable using a sourced function (courtesy: https://beckmw.wordpress.com/2013/08/12/variable-importance-in-neural-networks/)
## rel.imp x.names
## press_speed -1.000000000 press_speed
## press -0.865225137 press
## ink_pct -0.153081913 ink_pct
## locationnorthus -0.131619085 locationnorthus
## ink_temperature -0.125145981 ink_temperature
## viscosity -0.122927382 viscosity
## anode_space_ratio -0.115901024 anode_space_ratio
## cylinder_typeyes -0.094379009 cylinder_typeyes
## caliper -0.085273658 caliper
## solvent_pct -0.071738121 solvent_pct
## grain_screenedYES -0.062100566 grain_screenedYES
## wax -0.056329589 wax
## humidity -0.038158621 humidity
## direct_steamyes -0.037380415 direct_steamyes
## roughness -0.023136432 roughness
## current_density -0.016455654 current_density
## roller_durometer -0.005338653 roller_durometer
## cylinder_sizespiegel -0.002002103 cylinder_sizespiegel
## paper_typesuper 0.000000000 paper_typesuper
## locationsouthus 0.001943799 locationsouthus
## press_typeMotter70 0.006251217 press_typeMotter70
## solvent_typeNAPTHA 0.020177467 solvent_typeNAPTHA
## paper_typeuncoated 0.028459842 paper_typeuncoated
## locationscandanavian 0.028964029 locationscandanavian
## ink_typeuncoated 0.032644856 ink_typeuncoated
## press_typeMotter94 0.048582228 press_typeMotter94
## varnish_pct 0.051141124 varnish_pct
## cylinder_sizetabloid 0.051616850 cylinder_sizetabloid
## proof_cut 0.054414036 proof_cut
## solvent_typeXYLOL 0.059283698 solvent_typeXYLOL
## blade_pressure 0.059704615 blade_pressure
## press_typeWoodHoe70 0.099162850 press_typeWoodHoe70
## ink_typecover 0.106830765 ink_typecover
## proof_inkYES 0.126976265 proof_inkYES
## hardener 0.134731925 hardener
## locationmideuropean 0.143023053 locationmideuropean
The bar plot tells us that the variables press_speed and location (mideuropean) have the strongest negative and positive relationships, respectively, with the response variable band_type.
In decreasing order of importance (weak weights):
We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp).
We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.
##
## Classification tree:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
##
## Variables actually used in tree construction:
## [1] anode_space_ratio blade_pressure caliper
## [4] cylinder_type grain_screened hardener
## [7] humidity ink_pct ink_temperature
## [10] ink_type location paper_type
## [13] press press_speed press_type
## [16] proof_cut proof_ink roller_durometer
## [19] roughness solvent_pct viscosity
## [22] wax
##
## Root node error: 173/484 = 0.35744
##
## n= 484
##
## CP nsplit rel error xerror xstd
## 1 0.0693642 0 1.000000 1.00000 0.060944
## 2 0.0375723 3 0.780347 1.00578 0.061022
## 3 0.0346821 7 0.618497 0.94220 0.060100
## 4 0.0289017 8 0.583815 0.85549 0.058591
## 5 0.0231214 9 0.554913 0.82081 0.057901
## 6 0.0173410 11 0.508671 0.80925 0.057660
## 7 0.0115607 15 0.439306 0.79769 0.057413
## 8 0.0096339 24 0.335260 0.76301 0.056636
## 9 0.0092486 28 0.294798 0.76301 0.056636
## 10 0.0086705 34 0.231214 0.76301 0.056636
## 11 0.0072254 36 0.213873 0.79769 0.057413
## 12 0.0057803 40 0.184971 0.79769 0.057413
## 13 0.0038536 60 0.069364 0.80347 0.057537
## 14 0.0028902 63 0.057803 0.80347 0.057537
## 15 0.0000100 65 0.052023 0.82081 0.057901
From the complexity table we make the observation that the lowest relative error of 0.75 occurs at a tree size of ~22.
To reduce this tree size we can do pruning and repeated cross-validation. Doing this gives us a tree size of 8, with a resubstitution error rate of ~0.75 (as before).
This again is not a good predictor due to the large error rate.
## Call:
## rpart(formula = data$band_type ~ ., data = data, method = "class",
## control = rpart.control(minsplit = 4, cp = 1e-05))
## n= 484
##
## CP nsplit rel error xerror xstd
## 1 0.06936416 0 1.0000000 1.00000 0.06094449
## 2 0.05000000 3 0.7803468 1.00578 0.06102204
##
## Variable importance
## press_type humidity press_speed press
## 31 19 17 11
## roughness cylinder_size location ink_pct
## 9 4 3 1
## solvent_type blade_pressure hardener wax
## 1 1 1 1
## ink_temperature roller_durometer
## 1 1
##
## Node number 1: 484 observations, complexity param=0.06936416
## predicted class=noband expected loss=0.357438 P(node) =1
## class counts: 173 311
## probabilities: 0.357 0.643
## left son=2 (166 obs) right son=3 (318 obs)
## Primary splits:
## press_type splits as RRRL, improve=14.034940, (0 missing)
## press_speed < 2184.5 to the left, improve=13.365950, (0 missing)
## press < 822.5 to the left, improve=13.137310, (0 missing)
## ink_type splits as RLL, improve=10.721900, (0 missing)
## grain_screened splits as RL, improve= 9.344136, (0 missing)
## Surrogate splits:
## press < 818.5 to the left, agree=0.777, adj=0.349, (0 split)
## roughness < 0.5570005 to the left, agree=0.756, adj=0.289, (0 split)
## cylinder_size splits as RLR, agree=0.698, adj=0.120, (0 split)
## humidity < 84.5 to the right, agree=0.678, adj=0.060, (0 split)
## ink_pct < 64 to the right, agree=0.674, adj=0.048, (0 split)
##
## Node number 2: 166 observations, complexity param=0.06936416
## predicted class=band expected loss=0.4759036 P(node) =0.3429752
## class counts: 87 79
## probabilities: 0.524 0.476
## left son=4 (46 obs) right son=5 (120 obs)
## Primary splits:
## press_speed < 1678 to the left, improve=7.134765, (0 missing)
## ink_pct < 62.9 to the right, improve=5.336261, (0 missing)
## humidity < 73.5 to the left, improve=5.002718, (0 missing)
## cylinder_type splits as LR, improve=4.496773, (0 missing)
## roughness < 0.5882505 to the right, improve=4.235566, (0 missing)
## Surrogate splits:
## location splits as RLRR-, agree=0.771, adj=0.174, (0 split)
## solvent_type splits as R-L, agree=0.747, adj=0.087, (0 split)
## blade_pressure < 24.5 to the left, agree=0.741, adj=0.065, (0 split)
## ink_temperature < 19.05 to the right, agree=0.735, adj=0.043, (0 split)
## roller_durometer < 29 to the left, agree=0.735, adj=0.043, (0 split)
##
## Node number 3: 318 observations
## predicted class=noband expected loss=0.2704403 P(node) =0.6570248
## class counts: 86 232
## probabilities: 0.270 0.730
##
## Node number 4: 46 observations
## predicted class=band expected loss=0.2391304 P(node) =0.09504132
## class counts: 35 11
## probabilities: 0.761 0.239
##
## Node number 5: 120 observations, complexity param=0.06936416
## predicted class=noband expected loss=0.4333333 P(node) =0.2479339
## class counts: 52 68
## probabilities: 0.433 0.567
## left son=10 (22 obs) right son=11 (98 obs)
## Primary splits:
## humidity < 75.5 to the left, improve=7.979716, (0 missing)
## ink_pct < 62.9 to the right, improve=6.248649, (0 missing)
## anode_space_ratio < 99.34579 to the left, improve=4.158972, (0 missing)
## press_speed < 2184.5 to the left, improve=3.856410, (0 missing)
## wax < 1.725 to the left, improve=3.350725, (0 missing)
## Surrogate splits:
## press_speed < 2450 to the right, agree=0.833, adj=0.091, (0 split)
## wax < 0.85 to the left, agree=0.825, adj=0.045, (0 split)
## hardener < 0.25 to the left, agree=0.825, adj=0.045, (0 split)
##
## Node number 10: 22 observations
## predicted class=band expected loss=0.1818182 P(node) =0.04545455
## class counts: 18 4
## probabilities: 0.818 0.182
##
## Node number 11: 98 observations
## predicted class=noband expected loss=0.3469388 P(node) =0.2024793
## class counts: 34 64
## probabilities: 0.347 0.653
The variable importance table is given below:
## Overall
## anode_space_ratio 46.792279
## blade_pressure 22.085604
## caliper 17.474892
## current_density 15.176716
## cylinder_size 20.076596
## cylinder_type 14.783200
## grain_screened 25.005585
## hardener 28.849537
## humidity 50.815463
## ink_pct 50.793496
## ink_temperature 44.141707
## ink_type 31.912385
## location 26.939384
## paper_type 2.609524
## press 43.517804
## press_speed 73.276042
## press_type 30.182396
## proof_cut 41.222215
## proof_ink 2.100000
## roller_durometer 22.586269
## roughness 30.553411
## solvent_pct 29.796049
## varnish_pct 22.285894
## viscosity 49.165163
## wax 23.305868
## direct_steam 0.000000
## solvent_type 0.000000
In decreasing order of importance, the variables to be considered are:
We consider the use of support vector machines to build a classifier and to also extract the feature importance list.
First we use a cost of 10 (slightly large) which means we would give only a narrow margin for misclassification.
##
## Call:
## svm(formula = data[train, ]$band_type ~ ., data = data[train,
## ], method = "C-classification", kernel = "radial", cost = 10,
## gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 301
##
## ( 179 122 )
##
##
## Number of Classes: 2
##
## Levels:
## band noband
## [1] 1 2 4 5 6 10 12 14 17 19 22 24 25 27 28 30 32
## [18] 33 36 47 48 49 50 51 52 53 55 56 57 58 59 60 63 65
## [35] 67 69 70 72 73 75 76 78 81 82 84 85 86 87 88 91 93
## [52] 94 95 96 97 98 101 102 103 105 110 112 113 115 118 121 123 127
## [69] 128 129 134 135 136 137 138 140 141 142 143 146 147 149 150 152 153
## [86] 156 160 162 165 166 167 168 169 175 176 177 179 180 184 186 189 190
## [103] 191 196 197 198 199 200 201 202 205 206 208 209 212 214 215 216 218
## [120] 225 226 228 230 232 233 234 235 237 240 243 244 245 246 247 249 250
## [137] 251 253 254 255 257 258 259 261 267 271 272 273 275 276 281 282 283
## [154] 285 286 287 291 292 294 300 302 304 307 309 310 312 314 315 317 318
## [171] 319 320 330 331 332 333 336 337 338 3 7 8 11 13 15 16 18
## [188] 20 21 23 26 29 31 35 37 38 39 41 43 44 45 54 61 62
## [205] 71 77 79 80 83 89 92 100 104 106 107 108 111 114 120 122 124
## [222] 125 126 130 131 132 133 139 144 145 148 151 154 158 159 163 170 171
## [239] 173 174 178 183 185 187 192 194 195 203 207 210 211 213 219 220 221
## [256] 224 227 229 231 236 238 241 242 248 252 256 260 262 263 264 265 266
## [273] 268 270 274 277 278 280 284 288 289 290 293 295 296 297 298 299 301
## [290] 303 308 311 313 321 322 324 326 327 328 329 334
Prediction using SVM: The accuracy of the model is 0.81 at a kappa of 0.54
## Confusion Matrix and Statistics
##
## Reference
## Prediction band noband
## band 28 10
## noband 17 91
##
## Accuracy : 0.8151
## 95% CI : (0.7425, 0.8744)
## No Information Rate : 0.6918
## P-Value [Acc > NIR] : 0.0005379
##
## Kappa : 0.5468
##
## Mcnemar's Test P-Value : 0.2482131
##
## Sensitivity : 0.6222
## Specificity : 0.9010
## Pos Pred Value : 0.7368
## Neg Pred Value : 0.8426
## Prevalence : 0.3082
## Detection Rate : 0.1918
## Detection Prevalence : 0.2603
## Balanced Accuracy : 0.7616
##
## 'Positive' Class : band
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 146
##
##
## | svm.pred.2
## data$band_type[-train] | band | noband | Row Total |
## -----------------------|-----------|-----------|-----------|
## band | 28 | 17 | 45 |
## | 0.622 | 0.378 | 0.308 |
## | 0.737 | 0.157 | |
## | 0.192 | 0.116 | |
## -----------------------|-----------|-----------|-----------|
## noband | 10 | 91 | 101 |
## | 0.099 | 0.901 | 0.692 |
## | 0.263 | 0.843 | |
## | 0.068 | 0.623 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 38 | 108 | 146 |
## | 0.260 | 0.740 | |
## -----------------------|-----------|-----------|-----------|
##
##
The variable importance matrix is not implemented in caret for “svm” therefore we will use the rminer package to obtain the VIM.
In this project we have taken 5 different ways of cleaning the given dataset and applied 4 prediction models each time. The below table gives a summarized view of the results of the various tests conducted.
The most important variables that could contribute to banding is given for each of the methods. Note, we do not have the feature extraction for SVM due to the complexity involved for it.
Summarized Results
We started with 39 variables for the potential reason for banding. With the different methods used, we have now narrowed it down to a select few variables that are most likely to contribute to banding.
We have taken the most frequently occuring and most widely seen variables across all the methods to make our selection.
The 4 variables, in descending order of importance (as reason for banding), is given below:
The management is therefore advised to pay special attention and control to these variables during the Rotogravure printing process.