Introduction to Machine Learning with R

Joel Correa da Rosa
March 30, 2018

MODULE II : Algorithms for Machine Learning

Algorithms for Machine Learning

In this section, we show the basic use of machine learning algorithms.

  1. Decision Trees
  2. Support Vector Machines
  3. Neural Networks
  4. Random Forests
  5. Naive Bayes
  6. Neural Network

Application

  • In this module, we will use a genomic dataset to perform the classification task.

  • We will apply the most important algorithms.

  • This dataset consists of normalised expression levels of 25 genes hypothesised to be associated with a disease subtype have been obtained for 40 patient samples (dataset1.csv).

  • Missing values represent expression levels below the limit of detection (LOD).

  • The data does also contain the disease diagnosis outputted from the gold standard assay. The assay is positive for disease subtype B.

Genomic Data Set

d<-read.csv('dataset1.csv')
rownames(d)<-d$Patient
summary(d[,1:10])
    Patient   Disease      ASSAY        Gene1            Gene2       
 A1     : 1   A:19    Negative:18   Min.   : 6.133   Min.   : 5.089  
 A10    : 1   B:21    Positive:22   1st Qu.: 9.677   1st Qu.: 7.308  
 A11    : 1                         Median :11.193   Median : 9.151  
 A12    : 1                         Mean   :11.022   Mean   : 8.855  
 A13    : 1                         3rd Qu.:12.338   3rd Qu.: 9.976  
 A14    : 1                         Max.   :14.654   Max.   :12.708  
 (Other):34                         NA's   :5                        
     Gene3           Gene4            Gene5            Gene6      
 Min.   :6.175   Min.   : 3.534   Min.   : 5.264   Min.   :6.142  
 1st Qu.:7.202   1st Qu.: 5.922   1st Qu.: 8.847   1st Qu.:7.030  
 Median :7.456   Median : 6.917   Median :10.886   Median :7.425  
 Mean   :7.440   Mean   : 7.111   Mean   :10.984   Mean   :7.321  
 3rd Qu.:7.747   3rd Qu.: 8.221   3rd Qu.:12.764   3rd Qu.:7.597  
 Max.   :8.348   Max.   :10.043   Max.   :17.006   Max.   :8.507  
                 NA's   :1        NA's   :1        NA's   :1      
     Gene7       
 Min.   :0.1981  
 1st Qu.:2.8718  
 Median :6.0272  
 Mean   :5.5149  
 3rd Qu.:8.0715  
 Max.   :9.5222  
 NA's   :2       
d<-subset(d,select = -c(Patient,ASSAY))

Decision Trees (Basics)

  • A Decision Tree is a method of recursive partitioning.

  • It follows the paradigm: “Divides to Conquer”

  • 3 elements need to be specified:

    • Split Rule
    • Prediction Rule
    • Stopping Rule
  • It splits the data recursively until reaching the criteria required by the stopping rule.

A Tree Example

[include a figure of a tree]

Decision Trees (R)

R has several libraries for running decision trees with different algorithms.

  • rpart
  • party
  • C50

Other libraries are used to plot the figure of a tree.

Decision Trees in R

The most famous algorithm for growing a decision tree is called CART (Classification and Regression Trees).

It has been implemented in the package rpart.

library(rpart)
fit.rp<-rpart(Disease~.,data=d)
fit.rp
n= 40 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 40 19 B (0.47500000 0.52500000)  
  2) Gene7< 5.654831 18  1 A (0.94444444 0.05555556) *
  3) Gene7>=5.654831 22  2 B (0.09090909 0.90909091) *

Decision Trees Tuning

Tuning parameters for decision trees

rp.control <-rpart.control(minsplit = 5)
fit.rp<-rpart(Disease~.,data=d,control = rp.control)
fit.rp
n= 40 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 40 19 B (0.47500000 0.52500000)  
   2) Gene7< 5.654831 18  1 A (0.94444444 0.05555556) *
   3) Gene7>=5.654831 22  2 B (0.09090909 0.90909091)  
     6) Gene3< 7.282527 7  2 B (0.28571429 0.71428571)  
      12) Gene10< 10.44221 2  0 A (1.00000000 0.00000000) *
      13) Gene10>=10.44221 5  0 B (0.00000000 1.00000000) *
     7) Gene3>=7.282527 15  0 B (0.00000000 1.00000000) *

Plotting a Decision Tree

plot(fit.rp)
text(fit.rp)

plot of chunk unnamed-chunk-7

Neural Networks

Neural networks perform a non-linear regression and it is known to be an universal approximator, i.e. it can approximate any mapping between \( Y \) (outcome) and \( X \) as long as it includes the sufficient number of parameters.

Tuning parameter : ~size~ (The number of neurons in the hidden layer)

Error in nnet(Disease ~ ., data = d, size = 1) : 
  could not find function "nnet"