Friday, August 31, 2014

1. About this presentation file

This presentation file for machine learning algorithms has been uploaded on my homepage. If you want to see it , please click http://www.libcell.com. Additionly, it may be converted into HTML, PDF, and MS Word documents formats.

After accessing my website, you may click the upper last button "TASK".

Finally, you will find the presentation file via linkage named by "2014.08.22::Machine Learning Algorithms including DT, RF & SVM" (http://rpubs.com/libcell/drs)

2. The common algorithms in ML

Clustering Algorithms

  • K-means / K-medoids
  • helloerarchellocal clustering (HCA)
  • Self organizing mapping (SOM)
  • Density-based clustering (DBSCAN)
  • Expectation maximization (EM)

Classifying Algorithms

  • LDA/QDA, NB, KNN/wKNN
  • Decision tree (DT) & Random forest (RF)
  • Support vector machine (SVM)
  • Artificial neural network (ANN)

3. Decision Tree Algorithm

A decision tree is a flowchart-like structure in which internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules.

In decision analysis a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.

A decision tree consists of 3 types of nodes:

  • Decision nodes - commonly represented by squares
  • Chance nodes - represented by circles
  • End nodes - represented by triangles

1). Example of Decision tree.

2). Then, the samples can be devided into several zones.

3). So, this model may be used for samples classification.

  • For another example:

4. Implementing DT in R language

  • 1). Instruction for Data set, iris
data(iris) # loading the data set
summary(iris) # showing data's general characteristics
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

dim(iris) # displaying dimension of the data set
## [1] 150   5
head(iris) # displaying the headmost six rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

class(iris)
## [1] "data.frame"
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

  • 2). Visualization for Data set, iris
cols <- ifelse (iris$Species == "setosa", 3, 
                ifelse (iris$Species == "versicolor", 2, 4)) 
plot(iris$Petal.Width, iris$Petal.Length, col=cols)
abline(h=2.4, lty=2)
abline(v=0.75, lty=2)

plot of chunk unnamed-chunk-4

  • 3). The first DT algorithm, CART
# computing the sample number of testing set from three Species
a <- round(1/4*sum(iris$Species == "setosa"))
b <- round(1/4*sum(iris$Species == "versicolor"))
c <- round(1/4*sum(iris$Species == "virginica"))
a; b; c
## [1] 12
## [1] 12
## [1] 12

# stratified sampling to produce the training set and testing set
require(sampling)
## Loading required package: sampling
sub <- strata(iris, stratanames="Species", size=c(a, b, c), method="srswor")
sub
##        Species ID_unit Prob Stratum
## 8       setosa       8 0.24       1
## 10      setosa      10 0.24       1
## 13      setosa      13 0.24       1
## 15      setosa      15 0.24       1
## 17      setosa      17 0.24       1
## 19      setosa      19 0.24       1
## 21      setosa      21 0.24       1
## 22      setosa      22 0.24       1
## 27      setosa      27 0.24       1
## 33      setosa      33 0.24       1
## 35      setosa      35 0.24       1
## 44      setosa      44 0.24       1
## 53  versicolor      53 0.24       2
## 55  versicolor      55 0.24       2
## 57  versicolor      57 0.24       2
## 60  versicolor      60 0.24       2
## 62  versicolor      62 0.24       2
## 63  versicolor      63 0.24       2
## 66  versicolor      66 0.24       2
## 72  versicolor      72 0.24       2
## 74  versicolor      74 0.24       2
## 75  versicolor      75 0.24       2
## 80  versicolor      80 0.24       2
## 85  versicolor      85 0.24       2
## 102  virginica     102 0.24       3
## 104  virginica     104 0.24       3
## 106  virginica     106 0.24       3
## 107  virginica     107 0.24       3
## 109  virginica     109 0.24       3
## 110  virginica     110 0.24       3
## 114  virginica     114 0.24       3
## 134  virginica     134 0.24       3
## 135  virginica     135 0.24       3
## 140  virginica     140 0.24       3
## 142  virginica     142 0.24       3
## 144  virginica     144 0.24       3

train_iris <- iris[-sub$ID_unit, ] # generating the training set
test_iris <- iris[sub$ID_unit, ] # generating the testing set
summary(train_iris)
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.52   1st Qu.:0.3  
##  Median :5.75   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.82   Mean   :3.05   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :38  
##  versicolor:38  
##  virginica :38  
##                 
##                 
## 
summary(test_iris)
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width  
##  Min.   :4.80   Min.   :2.20   Min.   :1.20   Min.   :0.10  
##  1st Qu.:5.20   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.40  
##  Median :5.85   Median :3.05   Median :4.35   Median :1.40  
##  Mean   :5.91   Mean   :3.09   Mean   :3.76   Mean   :1.19  
##  3rd Qu.:6.42   3rd Qu.:3.40   3rd Qu.:5.10   3rd Qu.:1.73  
##  Max.   :7.60   Max.   :4.10   Max.   :6.60   Max.   :2.50  
##        Species  
##  setosa    :12  
##  versicolor:12  
##  virginica :12  
##                 
##                 
## 

require(rpart)
## Loading required package: rpart
formula_iris_Reg <- Species ~ . # seting the model formula
rp_iris_Reg <- rpart(formula_iris_Reg, train_iris, method="class") # constructing the tree
print(rp_iris_Reg) # exporting the basic information of DT
## n= 114 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 114 76 setosa (0.33333 0.33333 0.33333)  
##   2) Petal.Length< 2.45 38  0 setosa (1.00000 0.00000 0.00000) *
##   3) Petal.Length>=2.45 76 38 versicolor (0.00000 0.50000 0.50000)  
##     6) Petal.Width< 1.75 39  2 versicolor (0.00000 0.94872 0.05128) *
##     7) Petal.Width>=1.75 37  1 virginica (0.00000 0.02703 0.97297) *
printcp(rp_iris_Reg) # exporting the cp table of DT
## 
## Classification tree:
## rpart(formula = formula_iris_Reg, data = train_iris, method = "class")
## 
## Variables actually used in tree construction:
## [1] Petal.Length Petal.Width 
## 
## Root node error: 76/114 = 0.67
## 
## n= 114 
## 
##     CP nsplit rel error xerror  xstd
## 1 0.50      0     1.000  1.211 0.055
## 2 0.46      1     0.500  0.750 0.070
## 3 0.01      2     0.039  0.039 0.022

summary(rp_iris_Reg) # exporting the detailed information of DT
## Call:
## rpart(formula = formula_iris_Reg, data = train_iris, method = "class")
##   n= 114 
## 
##       CP nsplit rel error  xerror    xstd
## 1 0.5000      0   1.00000 1.21053 0.05544
## 2 0.4605      1   0.50000 0.75000 0.07024
## 3 0.0100      2   0.03947 0.03947 0.02249
## 
## Variable importance
##  Petal.Width Petal.Length Sepal.Length  Sepal.Width 
##           33           31           22           13 
## 
## Node number 1: 114 observations,    complexity param=0.5
##   predicted class=setosa      expected loss=0.6667  P(node) =1
##     class counts:    38    38    38
##    probabilities: 0.333 0.333 0.333 
##   left son=2 (38 obs) right son=3 (76 obs)
##   Primary splits:
##       Petal.Length < 2.45 to the left,  improve=38.00, (0 missing)
##       Petal.Width  < 0.75 to the left,  improve=38.00, (0 missing)
##       Sepal.Length < 5.45 to the left,  improve=28.61, (0 missing)
##       Sepal.Width  < 3.05 to the right, improve=12.69, (0 missing)
##   Surrogate splits:
##       Petal.Width  < 0.75 to the left,  agree=1.000, adj=1.000, (0 split)
##       Sepal.Length < 5.45 to the left,  agree=0.939, adj=0.816, (0 split)
##       Sepal.Width  < 3.35 to the right, agree=0.816, adj=0.447, (0 split)
## 
## Node number 2: 38 observations
##   predicted class=setosa      expected loss=0  P(node) =0.3333
##     class counts:    38     0     0
##    probabilities: 1.000 0.000 0.000 
## 
## Node number 3: 76 observations,    complexity param=0.4605
##   predicted class=versicolor  expected loss=0.5  P(node) =0.6667
##     class counts:     0    38    38
##    probabilities: 0.000 0.500 0.500 
##   left son=6 (39 obs) right son=7 (37 obs)
##   Primary splits:
##       Petal.Width  < 1.75 to the left,  improve=32.260, (0 missing)
##       Petal.Length < 4.75 to the left,  improve=29.160, (0 missing)
##       Sepal.Length < 6.15 to the left,  improve=10.640, (0 missing)
##       Sepal.Width  < 2.65 to the left,  improve= 4.584, (0 missing)
##   Surrogate splits:
##       Petal.Length < 4.75 to the left,  agree=0.921, adj=0.838, (0 split)
##       Sepal.Length < 6.15 to the left,  agree=0.750, adj=0.486, (0 split)
##       Sepal.Width  < 2.95 to the left,  agree=0.684, adj=0.351, (0 split)
## 
## Node number 6: 39 observations
##   predicted class=versicolor  expected loss=0.05128  P(node) =0.3421
##     class counts:     0    37     2
##    probabilities: 0.000 0.949 0.051 
## 
## Node number 7: 37 observations
##   predicted class=virginica   expected loss=0.02703  P(node) =0.3246
##     class counts:     0     1    36
##    probabilities: 0.000 0.027 0.973

Roll-back of the DT constructing

formula_iris_Reg <- Species ~ . # seting the model formula
rp_iris_Reg1 <- rpart(formula_iris_Reg, train_iris, method="class", 
                      minsplit=10) # re-constructing the tree,from minsplit 20 to 10
print(rp_iris_Reg1) # exporting the basic information of DT
## n= 114 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 114 76 setosa (0.33333 0.33333 0.33333)  
##   2) Petal.Length< 2.45 38  0 setosa (1.00000 0.00000 0.00000) *
##   3) Petal.Length>=2.45 76 38 versicolor (0.00000 0.50000 0.50000)  
##     6) Petal.Width< 1.75 39  2 versicolor (0.00000 0.94872 0.05128) *
##     7) Petal.Width>=1.75 37  1 virginica (0.00000 0.02703 0.97297) *

printcp(rp_iris_Reg1) # exporting the cp table of DT
## 
## Classification tree:
## rpart(formula = formula_iris_Reg, data = train_iris, method = "class", 
##     minsplit = 10)
## 
## Variables actually used in tree construction:
## [1] Petal.Length Petal.Width 
## 
## Root node error: 76/114 = 0.67
## 
## n= 114 
## 
##     CP nsplit rel error xerror  xstd
## 1 0.50      0     1.000  1.237 0.053
## 2 0.46      1     0.500  0.750 0.070
## 3 0.01      2     0.039  0.039 0.022

rp_iris_Reg2 <- rpart(formula_iris_Reg, train_iris, method="class", 
                      cp=0.1) # re-constructing the tree,from minsplit 0.01 to 0.1
print(rp_iris_Reg2) # exporting the basic information of DT
## n= 114 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 114 76 setosa (0.33333 0.33333 0.33333)  
##   2) Petal.Length< 2.45 38  0 setosa (1.00000 0.00000 0.00000) *
##   3) Petal.Length>=2.45 76 38 versicolor (0.00000 0.50000 0.50000)  
##     6) Petal.Width< 1.75 39  2 versicolor (0.00000 0.94872 0.05128) *
##     7) Petal.Width>=1.75 37  1 virginica (0.00000 0.02703 0.97297) *

rp_iris_Reg3 <- prune.rpart(rp_iris_Reg, cp=0.1) # Directly splicing branch based DT, using cp value 0.1
print(rp_iris_Reg3)
## n= 114 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 114 76 setosa (0.33333 0.33333 0.33333)  
##   2) Petal.Length< 2.45 38  0 setosa (1.00000 0.00000 0.00000) *
##   3) Petal.Length>=2.45 76 38 versicolor (0.00000 0.50000 0.50000)  
##     6) Petal.Width< 1.75 39  2 versicolor (0.00000 0.94872 0.05128) *
##     7) Petal.Width>=1.75 37  1 virginica (0.00000 0.02703 0.97297) *

rp_iris_Reg4 <- rpart(formula_iris_Reg, train_iris, method="class", 
                      maxdepth=2) # re-constructing the tree,from minsplit 0.01 to 0.1
print(rp_iris_Reg4) # exporting the basic information of DT
## n= 114 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 114 76 setosa (0.33333 0.33333 0.33333)  
##   2) Petal.Length< 2.45 38  0 setosa (1.00000 0.00000 0.00000) *
##   3) Petal.Length>=2.45 76 38 versicolor (0.00000 0.50000 0.50000)  
##     6) Petal.Width< 1.75 39  2 versicolor (0.00000 0.94872 0.05128) *
##     7) Petal.Width>=1.75 37  1 virginica (0.00000 0.02703 0.97297) *

visulization for tree results

rp_iris_plot <- rpart(formula_iris_Reg, train_iris, method="class", 
                      minsplit=2) # re-constructing the tree,from minsplit 20 to 10
print(rp_iris_plot)
## n= 114 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 114 76 setosa (0.33333 0.33333 0.33333)  
##    2) Petal.Length< 2.45 38  0 setosa (1.00000 0.00000 0.00000) *
##    3) Petal.Length>=2.45 76 38 versicolor (0.00000 0.50000 0.50000)  
##      6) Petal.Width< 1.75 39  2 versicolor (0.00000 0.94872 0.05128)  
##       12) Sepal.Length< 7.1 38  1 versicolor (0.00000 0.97368 0.02632) *
##       13) Sepal.Length>=7.1 1  0 virginica (0.00000 0.00000 1.00000) *
##      7) Petal.Width>=1.75 37  1 virginica (0.00000 0.02703 0.97297) *

# install.packages("rpart.plot")
library(rpart.plot)

par(mfrow=c(2,2))
rpart.plot(rp_iris_plot, type=1); rpart.plot(rp_iris_plot, type=2);
rpart.plot(rp_iris_plot, type=3); rpart.plot(rp_iris_plot, type=4)

plot of chunk unnamed-chunk-17

par(mfrow=c(1,1))

prp(rp_iris_plot, extra=6, box.col=c("pink", "palegreen3", "lightblue")[rp_iris_plot$frame$yval])
## Warning: extra=6 but the response has 3 levels (only the 2nd level is
## displayed)

plot of chunk unnamed-chunk-18

Prediction for testing set

pre_iris_Reg <- predict(rp_iris_Reg, test_iris, method="class")
pre_iris_Reg
##     setosa versicolor virginica
## 8        1    0.00000   0.00000
## 10       1    0.00000   0.00000
## 13       1    0.00000   0.00000
## 15       1    0.00000   0.00000
## 17       1    0.00000   0.00000
## 19       1    0.00000   0.00000
## 21       1    0.00000   0.00000
## 22       1    0.00000   0.00000
## 27       1    0.00000   0.00000
## 33       1    0.00000   0.00000
## 35       1    0.00000   0.00000
## 44       1    0.00000   0.00000
## 53       0    0.94872   0.05128
## 55       0    0.94872   0.05128
## 57       0    0.94872   0.05128
## 60       0    0.94872   0.05128
## 62       0    0.94872   0.05128
## 63       0    0.94872   0.05128
## 66       0    0.94872   0.05128
## 72       0    0.94872   0.05128
## 74       0    0.94872   0.05128
## 75       0    0.94872   0.05128
## 80       0    0.94872   0.05128
## 85       0    0.94872   0.05128
## 102      0    0.02703   0.97297
## 104      0    0.02703   0.97297
## 106      0    0.02703   0.97297
## 107      0    0.94872   0.05128
## 109      0    0.02703   0.97297
## 110      0    0.02703   0.97297
## 114      0    0.02703   0.97297
## 134      0    0.94872   0.05128
## 135      0    0.94872   0.05128
## 140      0    0.02703   0.97297
## 142      0    0.02703   0.97297
## 144      0    0.02703   0.97297

  • 4). Another DT C4.5 algorithm,_ C4.5, by RWeka, party and partykit packages
library(RWeka)
library(party)
## Loading required package: grid
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Loading required package: strucchange
## Loading required package: modeltools
## Loading required package: stats4
library(partykit)
## 
## Attaching package: 'partykit'
## 
## The following objects are masked from 'package:party':
## 
##     ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner,
##     node_surv, node_terminal
## 
## The following object is masked from 'package:grid':
## 
##     depth
oldpar=par(mar=c(3,3,1.5,1),mgp=c(1.5,0.5,0),cex=0.3)
data(iris)

m1<-J48(Species~Petal.Width+Petal.Length,data=iris)
table(iris$Species,predict(m1))
##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         49         1
##   virginica       0          2        48
write_to_dot(m1)
## digraph J48Tree {
## N0 [label="Petal.Width" ]
## N0->N1 [label="<= 0.6"]
## N1 [label="setosa (50.0)" shape=box style=filled ]
## N0->N2 [label="> 0.6"]
## N2 [label="Petal.Width" ]
## N2->N3 [label="<= 1.7"]
## N3 [label="Petal.Length" ]
## N3->N4 [label="<= 4.9"]
## N4 [label="versicolor (48.0/1.0)" shape=box style=filled ]
## N3->N5 [label="> 4.9"]
## N5 [label="Petal.Width" ]
## N5->N6 [label="<= 1.5"]
## N6 [label="virginica (3.0)" shape=box style=filled ]
## N5->N7 [label="> 1.5"]
## N7 [label="versicolor (3.0/1.0)" shape=box style=filled ]
## N2->N8 [label="> 1.7"]
## N8 [label="virginica (46.0/1.0)" shape=box style=filled ]
## }
# if (require("party", quietly=TRUE)) plot(m1)

if (require("party", quietly=TRUE)) plot(m1)

plot of chunk unnamed-chunk-22

5. Random Forest

Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.

Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

It is one of ensemble classifier including Boosting, Bagging and others, and a excellent ML method.

Boosting > Bagging >ClassificationTree

  • Algorithm Comparision among DT, NB, Bagging and Boosting methods.

Bagging Method - Such as bootstrap algorithm

Boosting Method - Such as Adaboost algorithm

  • Implementing in R Language
# install.packages("randomForest")
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
set.seed(4)
iris_rf <- randomForest(Species~., data=iris, ntree=1000, impartance=TRUE)
importance(iris_rf)
##              MeanDecreaseGini
## Sepal.Length            9.942
## Sepal.Width             2.247
## Petal.Length           43.769
## Petal.Width            43.311

print(iris_rf)
## 
## Call:
##  randomForest(formula = Species ~ ., data = iris, ntree = 1000,      impartance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.67%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          4        46        0.08

set.seed(1)
iris_rf1 <- randomForest(Species~., data=iris, proximity=TRUE)
MDSplot(iris_rf1, iris$Species, palette=rep(1, 3), 
        pch=as.numeric(iris$Species))

plot of chunk unnamed-chunk-25

dotchart(importance(iris_rf), col= 1:4)

plot of chunk unnamed-chunk-26

  • optizing the node number in each tree. ntry standing for the number of variables randomly selected
n <- ncol(iris)-1 # variable number
rate <- 1 # setting the start value of vector for mode error rate
for ( i in 1:n) {
  set.seed(22)
  model <- randomForest(Species~., data=iris, mtry=i, importance=TRUE, ntree=800)
  rate[i] <- mean(model$err.rate) # computing the rate of  misclassifying
  # print(model)
}
rate
## [1] 0.06035 0.04577 0.04513 0.04645

So, the best mtry=3 (min rate value)

model1 <- randomForest(Species~., data=iris, mtry=3, importance=TRUE, ntree=800)
plot(model1, col=2:5)
legend(700, 0.01, "total", cex=0.9, bty="n")
legend(700, 0.05, "setosa", cex=0.9, bty="n")
legend(700, 0.07, "versicolor", cex=0.9, bty="n")
legend(700, 0.09, "virginica", cex=0.9, bty="n")

plot of chunk unnamed-chunk-28

hist(treesize(model1))

plot of chunk unnamed-chunk-29

model2 <- randomForest(Species~., data=iris, mtry=3, importance=TRUE, ntree=800, proximity=TRUE)
MDSplot(model2, iris$Species, palette=1:3, pch=as.numeric(iris$Species))

plot of chunk unnamed-chunk-30

6. Support Vector Machine

library(e1071)
attach(iris)
class(iris)
## [1] "data.frame"
x <- subset(iris, select=-Species)
y <- Species
table(y)
## y
##     setosa versicolor  virginica 
##         50         50         50

Primary SVM model selecting

type <- c("C-classification", "nu-classification", "one-classification")
kernel <- c("linear", "polynomial", "radial", "sigmoid")
pred <- array(0, dim=c(150, 3, 4))
accurary <- matrix(0, 3, 4)
yy <- as.integer(y)
for (i in 1:3)
{
  for (j in 1:4)
  {
    pred[, i, j] <- predict(svm(x, y, type=type[i], kernel=kernel[j]), x)
    if (i >2) accurary[i, j] <- sum(pred[, i, j]!=1)
    else accurary[i, j] <- sum(pred[, i, j]!=yy)
  }
}

dimnames(accurary) <- list(type, kernel)
accurary # 模型预测错误的个数
##                    linear polynomial radial sigmoid
## C-classification        5          7      4      17
## nu-classification       5         14      5      12
## one-classification    102         75     76      75
pred
## , , 1
## 
##        [,1] [,2] [,3]
##   [1,]    1    1    0
##   [2,]    1    1    0
##   [3,]    1    1    0
##   [4,]    1    1    0
##   [5,]    1    1    0
##   [6,]    1    1    0
##   [7,]    1    1    0
##   [8,]    1    1    0
##   [9,]    1    1    0
##  [10,]    1    1    0
##  [11,]    1    1    0
##  [12,]    1    1    0
##  [13,]    1    1    0
##  [14,]    1    1    0
##  [15,]    1    1    0
##  [16,]    1    1    0
##  [17,]    1    1    0
##  [18,]    1    1    0
##  [19,]    1    1    0
##  [20,]    1    1    0
##  [21,]    1    1    0
##  [22,]    1    1    0
##  [23,]    1    1    0
##  [24,]    1    1    0
##  [25,]    1    1    0
##  [26,]    1    1    0
##  [27,]    1    1    0
##  [28,]    1    1    0
##  [29,]    1    1    0
##  [30,]    1    1    0
##  [31,]    1    1    0
##  [32,]    1    1    0
##  [33,]    1    1    0
##  [34,]    1    1    0
##  [35,]    1    1    0
##  [36,]    1    1    0
##  [37,]    1    1    0
##  [38,]    1    1    0
##  [39,]    1    1    0
##  [40,]    1    1    0
##  [41,]    1    1    0
##  [42,]    1    1    0
##  [43,]    1    1    0
##  [44,]    1    1    0
##  [45,]    1    1    0
##  [46,]    1    1    0
##  [47,]    1    1    0
##  [48,]    1    1    0
##  [49,]    1    1    0
##  [50,]    1    1    0
##  [51,]    2    2    1
##  [52,]    2    2    0
##  [53,]    2    2    1
##  [54,]    2    2    0
##  [55,]    2    2    1
##  [56,]    2    2    0
##  [57,]    2    2    0
##  [58,]    2    2    0
##  [59,]    2    2    1
##  [60,]    2    2    0
##  [61,]    2    2    0
##  [62,]    2    2    0
##  [63,]    2    2    1
##  [64,]    2    2    0
##  [65,]    2    2    0
##  [66,]    2    2    1
##  [67,]    2    2    0
##  [68,]    2    2    0
##  [69,]    2    2    1
##  [70,]    2    2    0
##  [71,]    3    2    0
##  [72,]    2    2    1
##  [73,]    3    2    1
##  [74,]    2    2    0
##  [75,]    2    2    1
##  [76,]    2    2    1
##  [77,]    2    2    1
##  [78,]    3    3    1
##  [79,]    2    2    0
##  [80,]    2    2    0
##  [81,]    2    2    0
##  [82,]    2    2    0
##  [83,]    2    2    0
##  [84,]    3    3    0
##  [85,]    2    2    0
##  [86,]    2    2    0
##  [87,]    2    2    1
##  [88,]    2    2    1
##  [89,]    2    2    0
##  [90,]    2    2    0
##  [91,]    2    2    0
##  [92,]    2    2    0
##  [93,]    2    2    0
##  [94,]    2    2    0
##  [95,]    2    2    0
##  [96,]    2    2    0
##  [97,]    2    2    0
##  [98,]    2    2    1
##  [99,]    2    2    0
## [100,]    2    2    0
## [101,]    3    3    0
## [102,]    3    3    0
## [103,]    3    3    1
## [104,]    3    3    0
## [105,]    3    3    1
## [106,]    3    3    1
## [107,]    3    2    0
## [108,]    3    3    1
## [109,]    3    3    1
## [110,]    3    3    1
## [111,]    3    3    1
## [112,]    3    3    1
## [113,]    3    3    1
## [114,]    3    3    0
## [115,]    3    3    0
## [116,]    3    3    1
## [117,]    3    3    0
## [118,]    3    3    1
## [119,]    3    3    1
## [120,]    3    3    1
## [121,]    3    3    1
## [122,]    3    3    0
## [123,]    3    3    1
## [124,]    3    3    1
## [125,]    3    3    0
## [126,]    3    3    1
## [127,]    3    3    1
## [128,]    3    3    0
## [129,]    3    3    1
## [130,]    3    3    1
## [131,]    3    3    1
## [132,]    3    3    1
## [133,]    3    3    1
## [134,]    2    2    0
## [135,]    3    3    0
## [136,]    3    3    1
## [137,]    3    3    0
## [138,]    3    3    0
## [139,]    3    2    0
## [140,]    3    3    1
## [141,]    3    3    1
## [142,]    3    3    1
## [143,]    3    3    0
## [144,]    3    3    1
## [145,]    3    3    1
## [146,]    3    3    1
## [147,]    3    3    1
## [148,]    3    3    1
## [149,]    3    3    0
## [150,]    3    3    0
## 
## , , 2
## 
##        [,1] [,2] [,3]
##   [1,]    1    1    1
##   [2,]    1    1    0
##   [3,]    1    1    1
##   [4,]    1    1    1
##   [5,]    1    1    1
##   [6,]    1    1    1
##   [7,]    1    1    1
##   [8,]    1    1    1
##   [9,]    1    1    0
##  [10,]    1    1    0
##  [11,]    1    1    1
##  [12,]    1    1    1
##  [13,]    1    1    0
##  [14,]    1    1    1
##  [15,]    1    1    1
##  [16,]    1    1    1
##  [17,]    1    1    1
##  [18,]    1    1    1
##  [19,]    1    1    1
##  [20,]    1    1    1
##  [21,]    1    1    0
##  [22,]    1    1    1
##  [23,]    1    1    1
##  [24,]    1    1    1
##  [25,]    1    1    0
##  [26,]    1    1    0
##  [27,]    1    1    1
##  [28,]    1    1    1
##  [29,]    1    1    1
##  [30,]    1    1    1
##  [31,]    1    1    0
##  [32,]    1    1    1
##  [33,]    1    1    1
##  [34,]    1    1    1
##  [35,]    1    1    0
##  [36,]    1    1    1
##  [37,]    1    1    1
##  [38,]    1    1    1
##  [39,]    1    1    0
##  [40,]    1    1    1
##  [41,]    1    1    1
##  [42,]    1    1    1
##  [43,]    1    1    0
##  [44,]    1    1    0
##  [45,]    1    1    1
##  [46,]    1    1    1
##  [47,]    1    1    1
##  [48,]    1    1    1
##  [49,]    1    1    1
##  [50,]    1    1    1
##  [51,]    2    2    1
##  [52,]    2    2    0
##  [53,]    2    2    1
##  [54,]    2    2    1
##  [55,]    2    2    0
##  [56,]    2    2    0
##  [57,]    2    2    0
##  [58,]    2    2    0
##  [59,]    2    2    1
##  [60,]    2    2    0
##  [61,]    2    2    1
##  [62,]    2    2    0
##  [63,]    2    2    1
##  [64,]    2    2    0
##  [65,]    2    2    0
##  [66,]    2    2    0
##  [67,]    2    2    0
##  [68,]    2    2    0
##  [69,]    2    2    1
##  [70,]    2    2    0
##  [71,]    2    2    1
##  [72,]    2    2    0
##  [73,]    2    2    0
##  [74,]    2    2    0
##  [75,]    2    2    0
##  [76,]    2    2    0
##  [77,]    2    2    1
##  [78,]    2    2    0
##  [79,]    2    2    0
##  [80,]    2    2    0
##  [81,]    2    2    1
##  [82,]    2    2    1
##  [83,]    2    2    0
##  [84,]    2    2    0
##  [85,]    2    2    0
##  [86,]    2    2    0
##  [87,]    2    2    0
##  [88,]    2    2    1
##  [89,]    2    2    0
##  [90,]    2    2    0
##  [91,]    2    2    0
##  [92,]    2    2    0
##  [93,]    2    2    0
##  [94,]    2    2    1
##  [95,]    2    2    0
##  [96,]    2    2    0
##  [97,]    2    2    0
##  [98,]    2    2    0
##  [99,]    2    2    0
## [100,]    2    2    0
## [101,]    3    3    1
## [102,]    3    2    0
## [103,]    3    3    0
## [104,]    3    3    0
## [105,]    3    3    0
## [106,]    3    3    1
## [107,]    2    2    1
## [108,]    3    3    1
## [109,]    3    3    0
## [110,]    3    3    1
## [111,]    3    3    0
## [112,]    3    3    0
## [113,]    3    3    0
## [114,]    3    3    1
## [115,]    3    3    1
## [116,]    3    3    1
## [117,]    3    2    0
## [118,]    3    3    0
## [119,]    3    3    1
## [120,]    3    2    1
## [121,]    3    3    0
## [122,]    3    2    1
## [123,]    3    3    1
## [124,]    3    2    0
## [125,]    3    3    0
## [126,]    3    3    1
## [127,]    2    2    0
## [128,]    2    2    0
## [129,]    3    3    0
## [130,]    3    3    1
## [131,]    3    3    1
## [132,]    3    3    1
## [133,]    3    3    0
## [134,]    2    2    0
## [135,]    2    2    0
## [136,]    3    3    1
## [137,]    3    3    1
## [138,]    3    2    0
## [139,]    2    2    0
## [140,]    3    3    0
## [141,]    3    3    1
## [142,]    3    3    1
## [143,]    3    2    0
## [144,]    3    3    0
## [145,]    3    3    1
## [146,]    3    3    1
## [147,]    3    3    0
## [148,]    3    3    0
## [149,]    3    3    1
## [150,]    2    2    1
## 
## , , 3
## 
##        [,1] [,2] [,3]
##   [1,]    1    1    1
##   [2,]    1    1    0
##   [3,]    1    1    1
##   [4,]    1    1    0
##   [5,]    1    1    1
##   [6,]    1    1    0
##   [7,]    1    1    0
##   [8,]    1    1    1
##   [9,]    1    1    0
##  [10,]    1    1    0
##  [11,]    1    1    0
##  [12,]    1    1    1
##  [13,]    1    1    0
##  [14,]    1    1    0
##  [15,]    1    1    0
##  [16,]    1    1    0
##  [17,]    1    1    0
##  [18,]    1    1    1
##  [19,]    1    1    0
##  [20,]    1    1    0
##  [21,]    1    1    1
##  [22,]    1    1    0
##  [23,]    1    1    0
##  [24,]    1    1    1
##  [25,]    1    1    1
##  [26,]    1    1    0
##  [27,]    1    1    1
##  [28,]    1    1    1
##  [29,]    1    1    1
##  [30,]    1    1    1
##  [31,]    1    1    1
##  [32,]    1    1    1
##  [33,]    1    1    0
##  [34,]    1    1    0
##  [35,]    1    1    1
##  [36,]    1    1    1
##  [37,]    1    1    0
##  [38,]    1    1    0
##  [39,]    1    1    0
##  [40,]    1    1    1
##  [41,]    1    1    1
##  [42,]    1    1    0
##  [43,]    1    1    0
##  [44,]    1    1    1
##  [45,]    1    1    0
##  [46,]    1    1    0
##  [47,]    1    1    0
##  [48,]    1    1    0
##  [49,]    1    1    0
##  [50,]    1    1    1
##  [51,]    2    2    0
##  [52,]    2    2    1
##  [53,]    2    2    1
##  [54,]    2    2    0
##  [55,]    2    2    1
##  [56,]    2    2    1
##  [57,]    2    2    0
##  [58,]    2    2    0
##  [59,]    2    2    1
##  [60,]    2    2    0
##  [61,]    2    2    0
##  [62,]    2    2    1
##  [63,]    2    2    0
##  [64,]    2    2    1
##  [65,]    2    2    1
##  [66,]    2    2    0
##  [67,]    2    2    1
##  [68,]    2    2    1
##  [69,]    2    2    0
##  [70,]    2    2    0
##  [71,]    2    2    1
##  [72,]    2    2    1
##  [73,]    2    2    1
##  [74,]    2    2    1
##  [75,]    2    2    1
##  [76,]    2    2    1
##  [77,]    2    2    0
##  [78,]    3    3    1
##  [79,]    2    2    1
##  [80,]    2    2    1
##  [81,]    2    2    0
##  [82,]    2    2    0
##  [83,]    2    2    1
##  [84,]    3    3    1
##  [85,]    2    2    0
##  [86,]    2    2    0
##  [87,]    2    2    1
##  [88,]    2    2    0
##  [89,]    2    2    1
##  [90,]    2    2    1
##  [91,]    2    2    1
##  [92,]    2    2    1
##  [93,]    2    2    1
##  [94,]    2    2    0
##  [95,]    2    2    1
##  [96,]    2    2    1
##  [97,]    2    2    1
##  [98,]    2    2    1
##  [99,]    2    2    0
## [100,]    2    2    1
## [101,]    3    3    0
## [102,]    3    3    1
## [103,]    3    3    1
## [104,]    3    3    1
## [105,]    3    3    1
## [106,]    3    3    0
## [107,]    3    2    0
## [108,]    3    3    0
## [109,]    3    3    0
## [110,]    3    3    0
## [111,]    3    3    1
## [112,]    3    3    1
## [113,]    3    3    1
## [114,]    3    3    0
## [115,]    3    3    0
## [116,]    3    3    0
## [117,]    3    3    1
## [118,]    3    3    0
## [119,]    3    3    0
## [120,]    2    2    0
## [121,]    3    3    0
## [122,]    3    3    0
## [123,]    3    3    0
## [124,]    3    3    1
## [125,]    3    3    0
## [126,]    3    3    0
## [127,]    3    3    1
## [128,]    3    3    1
## [129,]    3    3    1
## [130,]    3    3    0
## [131,]    3    3    0
## [132,]    3    3    0
## [133,]    3    3    1
## [134,]    2    2    1
## [135,]    3    3    0
## [136,]    3    3    0
## [137,]    3    3    0
## [138,]    3    3    1
## [139,]    3    3    1
## [140,]    3    3    1
## [141,]    3    3    0
## [142,]    3    3    0
## [143,]    3    3    1
## [144,]    3    3    0
## [145,]    3    3    0
## [146,]    3    3    1
## [147,]    3    3    0
## [148,]    3    3    1
## [149,]    3    3    0
## [150,]    3    3    1
## 
## , , 4
## 
##        [,1] [,2] [,3]
##   [1,]    1    1    0
##   [2,]    1    1    1
##   [3,]    1    1    1
##   [4,]    1    1    1
##   [5,]    1    1    0
##   [6,]    1    1    0
##   [7,]    1    1    1
##   [8,]    1    1    0
##   [9,]    1    1    1
##  [10,]    1    1    1
##  [11,]    1    1    0
##  [12,]    1    1    1
##  [13,]    1    1    1
##  [14,]    1    1    1
##  [15,]    1    1    0
##  [16,]    1    1    0
##  [17,]    1    1    0
##  [18,]    1    1    0
##  [19,]    1    1    0
##  [20,]    1    1    0
##  [21,]    1    1    1
##  [22,]    1    1    0
##  [23,]    1    1    0
##  [24,]    1    1    1
##  [25,]    1    1    1
##  [26,]    1    1    1
##  [27,]    1    1    1
##  [28,]    1    1    0
##  [29,]    1    1    0
##  [30,]    1    1    1
##  [31,]    1    1    1
##  [32,]    1    1    1
##  [33,]    1    1    0
##  [34,]    1    1    0
##  [35,]    1    1    1
##  [36,]    1    1    1
##  [37,]    1    1    0
##  [38,]    1    1    0
##  [39,]    1    1    1
##  [40,]    1    1    0
##  [41,]    1    1    0
##  [42,]    2    1    1
##  [43,]    1    1    1
##  [44,]    1    1    0
##  [45,]    1    1    0
##  [46,]    1    1    1
##  [47,]    1    1    0
##  [48,]    1    1    1
##  [49,]    1    1    0
##  [50,]    1    1    1
##  [51,]    2    2    1
##  [52,]    2    2    1
##  [53,]    3    3    1
##  [54,]    2    2    0
##  [55,]    3    3    1
##  [56,]    2    2    0
##  [57,]    3    3    1
##  [58,]    2    2    0
##  [59,]    2    2    1
##  [60,]    2    2    0
##  [61,]    2    2    0
##  [62,]    2    2    1
##  [63,]    2    2    0
##  [64,]    2    2    1
##  [65,]    2    2    0
##  [66,]    2    2    1
##  [67,]    2    2    1
##  [68,]    2    2    0
##  [69,]    2    2    0
##  [70,]    2    2    0
##  [71,]    3    3    1
##  [72,]    2    2    0
##  [73,]    3    3    1
##  [74,]    2    2    0
##  [75,]    2    2    1
##  [76,]    2    2    1
##  [77,]    3    3    1
##  [78,]    3    3    1
##  [79,]    2    2    1
##  [80,]    2    2    0
##  [81,]    2    2    0
##  [82,]    2    2    0
##  [83,]    2    2    0
##  [84,]    3    3    1
##  [85,]    2    2    0
##  [86,]    2    2    1
##  [87,]    3    3    1
##  [88,]    2    2    0
##  [89,]    2    2    1
##  [90,]    2    2    0
##  [91,]    2    2    0
##  [92,]    2    2    1
##  [93,]    2    2    0
##  [94,]    2    2    0
##  [95,]    2    2    0
##  [96,]    2    2    1
##  [97,]    2    2    0
##  [98,]    2    2    1
##  [99,]    2    2    0
## [100,]    2    2    0
## [101,]    3    3    0
## [102,]    3    3    1
## [103,]    3    3    0
## [104,]    3    3    1
## [105,]    3    3    1
## [106,]    2    3    0
## [107,]    2    2    0
## [108,]    3    3    0
## [109,]    3    3    1
## [110,]    3    3    0
## [111,]    3    3    1
## [112,]    3    3    1
## [113,]    3    3    0
## [114,]    3    3    1
## [115,]    3    3    1
## [116,]    3    3    0
## [117,]    3    3    1
## [118,]    2    2    0
## [119,]    3    3    1
## [120,]    2    3    0
## [121,]    3    3    0
## [122,]    3    3    1
## [123,]    2    3    1
## [124,]    3    3    1
## [125,]    3    3    0
## [126,]    3    3    0
## [127,]    3    3    1
## [128,]    3    3    1
## [129,]    3    3    1
## [130,]    3    3    0
## [131,]    3    3    1
## [132,]    2    2    0
## [133,]    3    3    1
## [134,]    3    3    1
## [135,]    3    3    1
## [136,]    2    3    0
## [137,]    3    3    0
## [138,]    3    3    1
## [139,]    3    3    1
## [140,]    3    3    0
## [141,]    3    3    0
## [142,]    3    3    0
## [143,]    3    3    1
## [144,]    3    3    0
## [145,]    3    3    0
## [146,]    3    3    0
## [147,]    3    3    1
## [148,]    3    3    1
## [149,]    3    3    0
## [150,]    3    3    1

table(pred[, 1, 3], y) # 模型预测精度展示
##    y
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48         2
##   3      0          2        48
model <- svm(x, y, kernel="radial", gamma= if (is.vector(x)) 1 else 1/ncol(x))
model
## 
## Call:
## svm.default(x = x, y = y, kernel = "radial", gamma = if (is.vector(x)) 1 else 1/ncol(x))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.25 
## 
## Number of Support Vectors:  51

Visualization,and “+” standing for the support vectors.

plot(cmdscale(dist(iris[, -5])), 
     col=c("lightgray", "black", " gray")[as.integer(iris[, 5])],
     pch=c("o", "+")[1:150 %in% model$index + 1])
legend(2, -0.3, c("setosa", "versicolor", "virginica"),
       col=c("lightgray", "black", " gray"), lty=1)

plot of chunk unnamed-chunk-35

model <- svm(Species ~ ., data = iris, method = "C-classification", kernel = "radial", cost = 10, gamma = 0.1)
summary(model)
## 
## Call:
## svm(formula = Species ~ ., data = iris, method = "C-classification", 
##     kernel = "radial", cost = 10, gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.1 
## 
## Number of Support Vectors:  32
## 
##  ( 3 16 13 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  setosa versicolor virginica

plot(model, iris, Petal.Width ~Petal.Length, slice = list(Sepal.Width = 3, Sepal.Length = 4))

plot of chunk unnamed-chunk-37

模型优化

wts <- c(1, 1, 1) # 确定模型中各个类别的比重
names(wts) <- c("setosa", "versicolor", "virginica") # 确定各个比重对应的类别
x <- subset(iris, select=-Species)
y <- Species
model1 <- svm(x, y, class.weight = wts) # 建立模型
pred1 <- predict(model1, x) # 根据模型进行预测
table(pred1, y) # 展示预测结果
##             y
## pred1        setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         2
##   virginica       0          2        48

wts <- c(1, 100, 100) # 调整后的分类比重
names(wts) <- c("setosa", "versicolor", "virginica") # 确定各个比重对应的类别
model2 <- svm(x, y, class.weight = wts) # 建立模型
pred2 <- predict(model2, x) # 根据模型进行预测
table(pred2, y) # 展示预测结果
##             y
## pred2        setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         49         1
##   virginica       0          1        49

wts <- c(1, 500, 500) # 调整后的分类比重
names(wts) <- c("setosa", "versicolor", "virginica") # 确定各个比重对应的类别
model3 <- svm(x, y, class.weight = wts, cross=10) # 建立模型
pred3 <- predict(model3, x) # 根据模型进行预测
table(pred3, y)
##             y
## pred3        setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         50         0
##   virginica       0          0        50

summary(model3)
## 
## Call:
## svm.default(x = x, y = y, class.weights = wts, cross = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.25 
## 
## Number of Support Vectors:  29
## 
##  ( 8 10 11 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  setosa versicolor virginica
## 
## 10-fold cross-validation on training data:
## 
## Total Accuracy: 93.33 
## Single Accuracies:
##  93.33 93.33 100 80 100 100 93.33 100 93.33 80

model3$sv # 查看支持向量的情况
## NULL
model3$index
##  [1]   9  14  16  21  23  24  26  42  51  58  61  69  71  73  78  84  86
## [18]  99 107 111 119 120 124 126 130 132 134 139 149
model3$rho
## [1] -0.02036  0.13116  2.29179

7. Comparing among DT, RF and SVM

  • SVM requires preprocessing data, but RF not need.
  • There is no performance difference generalization ability for binary classification of RF and SVM, but RF is better in multi-classification.
  • Robustness: no signigicant differences.
  • In view of the classification of the imbalance date set, SVM is better.

8. Acknowledgement

  • Firstly, I want to express my gratitude to Prof. Zhu Feng', for his for his guidance and valuable advice, and support for ongoing further education outside.
  • Thanks for their help from Miss. Tang Jing and Yang Qingxia.
  • Thanks for other IDRB group members.

Aug 24th, 2014. in Chongqing University

By Li Bo