Joel Correa da Rosa
March 30, 2018
In this section, we show the basic use of machine learning algorithms.
In this module, we will use a genomic dataset to perform the classification task.
We will apply the most important algorithms.
This dataset consists of normalised expression levels of 25 genes hypothesised to be associated with a disease subtype have been obtained for 40 patient samples (dataset1.csv).
Missing values represent expression levels below the limit of detection (LOD).
The data does also contain the disease diagnosis outputted from the gold standard assay. The assay is positive for disease subtype B.
d<-read.csv('dataset1.csv')
rownames(d)<-d$Patient
summary(d[,1:10])
Patient Disease ASSAY Gene1 Gene2
A1 : 1 A:19 Negative:18 Min. : 6.133 Min. : 5.089
A10 : 1 B:21 Positive:22 1st Qu.: 9.677 1st Qu.: 7.308
A11 : 1 Median :11.193 Median : 9.151
A12 : 1 Mean :11.022 Mean : 8.855
A13 : 1 3rd Qu.:12.338 3rd Qu.: 9.976
A14 : 1 Max. :14.654 Max. :12.708
(Other):34 NA's :5
Gene3 Gene4 Gene5 Gene6
Min. :6.175 Min. : 3.534 Min. : 5.264 Min. :6.142
1st Qu.:7.202 1st Qu.: 5.922 1st Qu.: 8.847 1st Qu.:7.030
Median :7.456 Median : 6.917 Median :10.886 Median :7.425
Mean :7.440 Mean : 7.111 Mean :10.984 Mean :7.321
3rd Qu.:7.747 3rd Qu.: 8.221 3rd Qu.:12.764 3rd Qu.:7.597
Max. :8.348 Max. :10.043 Max. :17.006 Max. :8.507
NA's :1 NA's :1 NA's :1
Gene7
Min. :0.1981
1st Qu.:2.8718
Median :6.0272
Mean :5.5149
3rd Qu.:8.0715
Max. :9.5222
NA's :2
d<-subset(d,select = -c(Patient,ASSAY))
A Decision Tree is a method of recursive partitioning.
It follows the paradigm: “Divides to Conquer”
3 elements need to be specified:
It splits the data recursively until reaching the criteria required by the stopping rule.
[include a figure of a tree]
R has several libraries for running decision trees with different algorithms.
Other libraries are used to plot the figure of a tree.
The most famous algorithm for growing a decision tree is called CART (Classification and Regression Trees).
It has been implemented in the package rpart.
library(rpart)
fit.rp<-rpart(Disease~.,data=d)
fit.rp
n= 40
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 40 19 B (0.47500000 0.52500000)
2) Gene7< 5.654831 18 1 A (0.94444444 0.05555556) *
3) Gene7>=5.654831 22 2 B (0.09090909 0.90909091) *
Tuning parameters for decision trees
rp.control <-rpart.control(minsplit = 5)
fit.rp<-rpart(Disease~.,data=d,control = rp.control)
fit.rp
n= 40
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 40 19 B (0.47500000 0.52500000)
2) Gene7< 5.654831 18 1 A (0.94444444 0.05555556) *
3) Gene7>=5.654831 22 2 B (0.09090909 0.90909091)
6) Gene3< 7.282527 7 2 B (0.28571429 0.71428571)
12) Gene10< 10.44221 2 0 A (1.00000000 0.00000000) *
13) Gene10>=10.44221 5 0 B (0.00000000 1.00000000) *
7) Gene3>=7.282527 15 0 B (0.00000000 1.00000000) *
plot(fit.rp)
text(fit.rp)
Neural networks perform a non-linear regression and it is known to be an universal approximator, i.e. it can approximate any mapping between \( Y \) (outcome) and \( X \) as long as it includes the sufficient number of parameters.
Tuning parameter : ~size~ (The number of neurons in the hidden layer)
Error in nnet(Disease ~ ., data = d, size = 1) :
could not find function "nnet"