Joel Correa da Rosa
March 30, 2018
In this short course we will introduce the use of machine learning techniques for classification and regression.
Applications in genomics and genetics will be covered.
Throughout this course we will use R and Rstudio, standard tools for data scientists that can be freely downloaded from
R is a free software environment for statistical computing and graphics that is supporte by the R Foundation for Statistical Computing.
library(caret)
library(Boruta)
library(rpart)
library(nnet)
library(e1071)
library(randomForest)
library(ggplot2)
library(pROC)
library(nlme)
If a package is not installed, you can try to install using :
install.packages('Boruta')
Transformation of data into intelligent action by exploring the processing capabilities of a computer (machine).
It is more like training an employee than raising a child.
To Learn means to improve performance at some task through experience.
Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Tasks: recognition, diagnosis, planning, robot control, prediction, etc
An example of the learning paradigm applied to a classification task is given below.
T predict response to a drug from gene expressions in biopsied tissues.
P accuracy in predicting response to the drug
E a sample of 100 subjects (50 Responders / 50 Non-Responders) with microarray intensities measured on 50000 genes
Supervised Learning
Unsupervised Learning
Two tasks are accomplished in this short course.
Classification
A categorical outcome \( Y \) is to be predicted by a set of variables \( X_1,X_2,..,X_P \)
Regression
A continous (or quantitative) outcome \( Y \) is to be predicted by a set of variables \( X_1,X_2,..,X_P \)
The machine learning algorithm is supposed to learn on how to predict Y based on a set of n observations.
Machine learning algorithms are trained to learn. Like a student who is being prepared for a school exam, it is important to train before the exam.
Data sets are usually partitioned into 3 portions:
Generate random data, thus we know the truth behind. In R, The function set.seed() makes our work to be replicable.
set.seed(123)
# a normal distributed variables
X<-rnorm(1000,0,1)
# create some noise
noise<-0.5
# a poorly correlated variable
Y<-0.5+0.70*X+rnorm(1000,0,noise)
What is the mapping to be learned ?
plot(X,Y)
abline(c(0.5,0.7))
In this example, we know the data generation process.
# generating indexes for training
ind.train<-sample(1:1000,85)
# training sets
Xtrain<-X[ind.train]
Ytrain<-Y[ind.train]
# test sets
Xtest<-X[-ind.train]
Ytest<-Y[-ind.train]
# Data frames for training and test
Dtrain = cbind.data.frame(X= Xtrain, Y = Ytrain)
Dtest = cbind.data.frame(X= Xtest, Y = Ytest)
# fitting a linear regression
fit.lm <- lm(Y ~ X, data = Dtrain)
summary(fit.lm)
Call:
lm(formula = Y ~ X, data = Dtrain)
Residuals:
Min 1Q Median 3Q Max
-1.49663 -0.29860 0.00437 0.36980 0.97890
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.48371 0.05772 8.38 1.13e-12 ***
X 0.71589 0.06144 11.65 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5314 on 83 degrees of freedom
Multiple R-squared: 0.6206, Adjusted R-squared: 0.616
F-statistic: 135.8 on 1 and 83 DF, p-value: < 2.2e-16
# prediction in training set
pred.lm.train <- predict(fit.lm, newdata = Dtrain)
# predictions in test set
pred.lm.test <- predict(fit.lm, newdata = Dtest)
cbind(Pred = pred.lm.train, Obs = Dtrain$Y)[1:10,]
Pred Obs
1 0.170749274 0.40638822
2 0.056085383 -0.08434249
3 1.085436602 0.86247141
4 0.681486887 1.48711926
5 1.303583792 1.28245578
6 0.986103462 0.81178075
7 1.357776805 0.96513173
8 -0.008002725 0.36180212
9 -0.365483167 -0.59215796
10 -0.533646166 -0.71063779
cbind(Pred = pred.lm.test, Obs = Dtest$Y)[1:10,]
Pred Obs
1 0.08246897 -0.39023232
2 0.31892486 -0.18110176
3 1.59956350 1.58210570
4 0.53418173 0.48326831
5 0.57626105 -0.68416997
6 1.71149709 2.22083222
7 0.81366929 0.94750421
8 -0.42193412 0.82256082
9 0.16466250 -0.03544303
10 1.36000900 2.75555283
plot(pred.lm.train, Dtrain$Y)
RMSE.train = RMSE(pred.lm.train, Dtrain$Y)
RMSE.train
[1] 0.5251293
plot(pred.lm.test, Dtest$Y)
RMSE.test = RMSE(pred.lm.test, Dtest$Y)
RMSE.test
[1] 0.5029191
Mean Squared Error \( MSE = {\sum_{i=1}^n\frac{(P_i-O_i)^2}{N-1}} \)
Root Mean Squared Error \( RMSE = \sqrt{MSE} \)
# fitting a linear regression
fit.sq <- lm(Y ~ I(X^2), data = Dtrain)
summary(fit.sq)
Call:
lm(formula = Y ~ I(X^2), data = Dtrain)
Residuals:
Min 1Q Median 3Q Max
-2.0336 -0.5169 -0.1278 0.4850 2.1868
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.61492 0.12302 4.998 3.17e-06 ***
I(X^2) -0.18822 0.09336 -2.016 0.047 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8423 on 83 degrees of freedom
Multiple R-squared: 0.04668, Adjusted R-squared: 0.0352
F-statistic: 4.065 on 1 and 83 DF, p-value: 0.04703
# prediction in training set
pred.sq.train <- predict(fit.sq, newdata = Dtrain)
# predictions in test set
pred.sq.test <- predict(fit.sq, newdata = Dtest)
plot(pred.sq.train, Dtrain$Y)
RMSE.train = RMSE(pred.sq.train, Dtrain$Y)
RMSE.train
[1] 0.8323796
plot(pred.sq.test, Dtest$Y)
RMSE.test = RMSE(pred.sq.test, Dtest$Y)
RMSE.test
[1] 0.9568338
Algorithms