Joel Correa da Rosa
March 30, 2018
In this short course we will introduce the use of machine learning techniques for classification and regression.
Applications in genomics and genetics will be covered.
Throughout this course we will use R and Rstudio, standard tools for data scientists that can be freely downloaded from
R is a free software environment for statistical computing and graphics that is supporte by the R Foundation for Statistical Computing.
library(caret)
library(Boruta)
library(rpart)
library(nnet)
library(e1071)
library(randomForest)
library(ggplot2)
library(pROC)
library(nlme)
library(plsgenomics)
If a package is not installed, you can try to install using :
install.packages('Boruta')
Transformation of data into intelligent action by exploring the processing capabilities of a computer (machine).
It is more like training an employee than raising a child.
To Learn means to improve performance at some task through experience.
Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Tasks: recognition, diagnosis, planning, robot control, prediction, etc
An example of the learning paradigm applied to a classification task is given below.
T predict response to a drug from gene expressions in biopsied tissues.
P accuracy in predicting response to the drug
E a sample of 100 subjects (50 Responders / 50 Non-Responders) with microarray intensities measured on 50000 genes
Supervised Learning
Unsupervised Learning
Two tasks are accomplished in this short course.
Classification
A categorical outcome \( Y \) is to be predicted by a set of variables \( X_1,X_2,..,X_P \)
Regression
A continous (or quantitative) outcome \( Y \) is to be predicted by a set of variables \( X_1,X_2,..,X_P \)
The machine learning algorithm is supposed to learn on how to predict Y based on a set of n observations.
Pre 80's, learning methods were able to learn linear decision surfaces. Linear learning methods have nice theoretical properties.
80's , neural networks and decision trees allowed efficient learning of non-linear decision surfaces. The theoretical basis is minimal and they suffer from local minima.
1990's , efficient learning algorithms for non-linear functions based on computational learning theory developed. They could aggregate nice theoretical properties.co
Machine learning algorithms are trained to learn. Like a student who is being prepared for a school exam, it is important to train before the exam.
Data sets are usually partitioned into 3 portions:
Generate random data, thus we know the truth behind. In R, The function set.seed() makes our work to be replicable.
set.seed(123)
# a normal distributed variables
X<-rnorm(1000,0,1)
# create some noise
noise<-0.5
# a poorly correlated variable
Y<-0.5+0.70*X+rnorm(1000,0,noise)
What is the mapping to be learned ?
plot(X,Y)
abline(c(0.5,0.7))
In this example, we know the data generation process.
# generating indexes for training
ind.train<-sample(1:1000,850)
# training sets
Xtrain<-X[ind.train]
Ytrain<-Y[ind.train]
# test sets
Xtest<-X[-ind.train]
Ytest<-Y[-ind.train]
# Data frames for training and test
Dtrain = cbind.data.frame(X= Xtrain, Y = Ytrain)
Dtest = cbind.data.frame(X= Xtest, Y = Ytest)
# fitting a linear regression
fit.lm <- lm(Y ~ X, data = Dtrain)
summary(fit.lm)
Call:
lm(formula = Y ~ X, data = Dtrain)
Residuals:
Min 1Q Median 3Q Max
-1.52681 -0.33250 0.00977 0.34290 1.63686
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.53126 0.01718 30.91 <2e-16 ***
X 0.74095 0.01764 41.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5009 on 848 degrees of freedom
Multiple R-squared: 0.6753, Adjusted R-squared: 0.6749
F-statistic: 1764 on 1 and 848 DF, p-value: < 2.2e-16
# prediction in training set
pred.lm.train <- predict(fit.lm, newdata = Dtrain)
# predictions in test set
pred.lm.test <- predict(fit.lm, newdata = Dtest)
cbind(Pred = pred.lm.train, Obs = Dtrain$Y)[1:10,]
Pred Obs
1 0.20734945 0.40638822
2 0.08867142 -0.08434249
3 1.15405804 0.86247141
4 0.73596690 1.48711926
5 1.37984210 1.28245578
6 1.05124746 0.81178075
7 1.43593229 0.96513173
8 0.02233973 0.36180212
9 -0.34765534 -0.59215796
10 -0.52170537 -0.71063779
cbind(Pred = pred.lm.test, Obs = Dtest$Y)[1:10,]
Pred Obs
1 1.8552713 1.8443648
2 0.3126293 0.4435157
3 1.1945033 0.6493635
4 0.4853885 0.6708278
5 -0.3009030 -0.7963763
6 0.2327452 -0.4387448
7 -0.6162832 -0.9873534
8 2.0502677 2.8246106
9 -1.1797118 -1.3023907
10 1.3439624 0.8220265
plot(pred.lm.train, Dtrain$Y)
RMSE.train = RMSE(pred.lm.train, Dtrain$Y)
RMSE.train
[1] 0.5003348
plot(pred.lm.test, Dtest$Y)
RMSE.test = RMSE(pred.lm.test, Dtest$Y)
RMSE.test
[1] 0.5166585
Mean Squared Error \( MSE = {\sum_{i=1}^n\frac{(P_i-O_i)^2}{N-1}} \)
Root Mean Squared Error \( RMSE = \sqrt{MSE} \)
# fitting a linear regression
fit.sq <- lm(Y ~ I(X^2), data = Dtrain)
summary(fit.sq)
Call:
lm(formula = Y ~ I(X^2), data = Dtrain)
Residuals:
Min 1Q Median 3Q Max
-3.2535 -0.5838 0.0186 0.5470 2.8434
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.52175 0.03706 14.080 <2e-16 ***
I(X^2) 0.02438 0.02274 1.072 0.284
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8785 on 848 degrees of freedom
Multiple R-squared: 0.001354, Adjusted R-squared: 0.000176
F-statistic: 1.149 on 1 and 848 DF, p-value: 0.284
# prediction in training set
pred.sq.train <- predict(fit.sq, newdata = Dtrain)
# predictions in test set
pred.sq.test <- predict(fit.sq, newdata = Dtest)
plot(pred.sq.train, Dtrain$Y)
RMSE.train = RMSE(pred.sq.train, Dtrain$Y)
RMSE.train
[1] 0.8774467
plot(pred.sq.test, Dtest$Y)
RMSE.test = RMSE(pred.sq.test, Dtest$Y)
RMSE.test
[1] 0.9674257
Generate data for a logistic regression
X1 = rnorm(1000) # some continuous variables
X2 = rnorm(1000)
Z = 1 + 0.8*X1 + 3.5*X2 # linear combination with a bias
Pr = 1/(1+exp(-Z)) # pass through an inv-logit function
Y1 = factor(rbinom(1000,1,Pr))
plot(X1,Y1)
plot(X2,Y1)
Dlog<-cbind.data.frame(X1,X2,Y1)
summary(Dlog)
X1 X2 Y1
Min. :-2.949872 Min. :-3.12909 0:389
1st Qu.:-0.635576 1st Qu.:-0.66774 1:611
Median :-0.039240 Median : 0.03118
Mean : 0.008998 Mean :-0.01065
3rd Qu.: 0.675692 3rd Qu.: 0.65103
Max. : 3.421095 Max. : 3.44599
Training and testing sets
Dlog.train<-Dlog[ind.train,]
Dlog.test<-Dlog[-ind.train,]
# fitting a linear regression
fit.log <- glm(Y1 ~ X1 + X2, data = Dlog.train, family = "binomial")
summary(fit.log)
Call:
glm(formula = Y1 ~ X1 + X2, family = "binomial", data = Dlog.train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.58628 -0.33123 0.08766 0.39971 2.82640
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1545 0.1310 8.811 < 2e-16 ***
X1 0.9104 0.1303 6.989 2.76e-12 ***
X2 3.5121 0.2511 13.987 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1138.20 on 849 degrees of freedom
Residual deviance: 499.65 on 847 degrees of freedom
AIC: 505.65
Number of Fisher Scoring iterations: 6
# prediction in training set
pred.log.train <- predict(fit.log, newdata = Dlog.train, type = "response")
# predictions in test set
pred.log.test <- predict(fit.log, newdata = Dlog.test, type = "response")
plot(pred.log.train, Dlog.train$Y1)
plot(pred.log.test, Dlog.test$Y1)
table(pred.log.train>0.5, Dlog.train$Y1)
0 1
FALSE 269 51
TRUE 64 466
table(pred.log.test>0.5, Dlog.test$Y1)
0 1
FALSE 46 7
TRUE 10 87
Algorithms
In broad terms, the parameters to be estimated in Machine Learning Algorithms are either “Tuning/Penalty” or “Effect/Impact” parameters.
The Tuning/Penalty parameters are responsible for keeping simple the strategies of classification and regression what increases the likelihood of good performance in the testing data.
The Effect/Impact parameters are related to the magnitude of the predictor's importance to estimate the outcome/response variable.