Introduction to Machine Learning with R (Module I)

Joel Correa da Rosa
March 30, 2018

Basic Concepts

Introduction

In this short course we will introduce the use of machine learning techniques for classification and regression.

Applications in genomics and genetics will be covered.

Outline

Basic Concepts
Algorithms
Training Strategies
Caret Package in R
Performance Evaluation
Applications to Genomics

R and Rstudio

Throughout this course we will use R and Rstudio, standard tools for data scientists that can be freely downloaded from

R is a free software environment for statistical computing and graphics that is supporte by the R Foundation for Statistical Computing.

Packages to be used

library(caret)
library(Boruta)
library(rpart)
library(nnet)
library(e1071)
library(randomForest)
library(ggplot2)
library(pROC)
library(nlme)
library(plsgenomics)

If a package is not installed, you can try to install using :

install.packages('Boruta')

What is Machine Learning ?

Transformation of data into intelligent action by exploring the processing capabilities of a computer (machine).
It is more like training an employee than raising a child.

The Learning Paradigm (Tom M. Mitchell)

To Learn means to improve performance at some task through experience.

Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Tasks: recognition, diagnosis, planning, robot control, prediction, etc

The Learning Paradigm (An Example)

An example of the learning paradigm applied to a classification task is given below.

T predict response to a drug from gene expressions in biopsied tissues.
P accuracy in predicting response to the drug
E a sample of 100 subjects (50 Responders / 50 Non-Responders) with microarray intensities measured on 50000 genes

The Learning Process

Learning Process

Types of Learning

Supervised Learning
- Regression
- Classification
Unsupervised Learning
- Vector Quantisation
- K-means
- Self-organising maps

Classification x Regression

Two tasks are accomplished in this short course.

Classification

A categorical outcome \( Y \) is to be predicted by a set of variables \( X_1,X_2,..,X_P \)

Regression

A continous (or quantitative) outcome \( Y \) is to be predicted by a set of variables \( X_1,X_2,..,X_P \)

The machine learning algorithm is supposed to learn on how to predict Y based on a set of n observations.

History of Learning Algorithms

Pre 80's, learning methods were able to learn linear decision surfaces. Linear learning methods have nice theoretical properties.
80's , neural networks and decision trees allowed efficient learning of non-linear decision surfaces. The theoretical basis is minimal and they suffer from local minima.
1990's , efficient learning algorithms for non-linear functions based on computational learning theory developed. They could aggregate nice theoretical properties.co

Training

Machine learning algorithms are trained to learn. Like a student who is being prepared for a school exam, it is important to train before the exam.

Data sets are usually partitioned into 3 portions:

Training
Validation
Test

Let's Train a Linear Regression I

Generate random data, thus we know the truth behind. In R, The function set.seed() makes our work to be replicable.

set.seed(123)

# a normal distributed variables
X<-rnorm(1000,0,1)

# create some noise
noise<-0.5

# a poorly correlated variable
Y<-0.5+0.70*X+rnorm(1000,0,noise)

Let's Train a Linear Regression II

What is the mapping to be learned ?

plot(X,Y)
abline(c(0.5,0.7))

plot of chunk unnamed-chunk-3

In this example, we know the data generation process.

Let's Train a Linear Regression III

# generating indexes for training
ind.train<-sample(1:1000,850)

# training sets
Xtrain<-X[ind.train]
Ytrain<-Y[ind.train]

# test sets
Xtest<-X[-ind.train]
Ytest<-Y[-ind.train]

# Data frames for training and test
Dtrain = cbind.data.frame(X= Xtrain, Y = Ytrain)
Dtest = cbind.data.frame(X= Xtest, Y = Ytest)

Train and Test a Linear Regression

# fitting a linear regression
fit.lm <- lm(Y ~ X, data = Dtrain)
summary(fit.lm)


Call:
lm(formula = Y ~ X, data = Dtrain)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.52681 -0.33250  0.00977  0.34290  1.63686 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.53126    0.01718   30.91   <2e-16 ***
X            0.74095    0.01764   41.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5009 on 848 degrees of freedom
Multiple R-squared:  0.6753,    Adjusted R-squared:  0.6749 
F-statistic:  1764 on 1 and 848 DF,  p-value: < 2.2e-16

# prediction in training set
pred.lm.train <- predict(fit.lm, newdata = Dtrain)

# predictions in test set
pred.lm.test <- predict(fit.lm, newdata = Dtest)

Train and Test a Linear Regression

cbind(Pred = pred.lm.train, Obs = Dtrain$Y)[1:10,]

          Pred         Obs
1   0.20734945  0.40638822
2   0.08867142 -0.08434249
3   1.15405804  0.86247141
4   0.73596690  1.48711926
5   1.37984210  1.28245578
6   1.05124746  0.81178075
7   1.43593229  0.96513173
8   0.02233973  0.36180212
9  -0.34765534 -0.59215796
10 -0.52170537 -0.71063779

cbind(Pred = pred.lm.test, Obs = Dtest$Y)[1:10,]

         Pred        Obs
1   1.8552713  1.8443648
2   0.3126293  0.4435157
3   1.1945033  0.6493635
4   0.4853885  0.6708278
5  -0.3009030 -0.7963763
6   0.2327452 -0.4387448
7  -0.6162832 -0.9873534
8   2.0502677  2.8246106
9  -1.1797118 -1.3023907
10  1.3439624  0.8220265

Pred : predicted values
Obs : observed values

Evaluating Performance

plot(pred.lm.train, Dtrain$Y)

plot of chunk unnamed-chunk-7

RMSE.train = RMSE(pred.lm.train, Dtrain$Y)
RMSE.train

[1] 0.5003348

plot(pred.lm.test, Dtest$Y)

plot of chunk unnamed-chunk-8

RMSE.test = RMSE(pred.lm.test, Dtest$Y)
RMSE.test

[1] 0.5166585

Root Mean Squared Error

Mean Squared Error \( MSE = {\sum_{i=1}^n\frac{(P_i-O_i)^2}{N-1}} \)

Root Mean Squared Error \( RMSE = \sqrt{MSE} \)

Mispecification

# fitting a linear regression
fit.sq <- lm(Y ~ I(X^2), data = Dtrain)
summary(fit.sq)


Call:
lm(formula = Y ~ I(X^2), data = Dtrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2535 -0.5838  0.0186  0.5470  2.8434 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.52175    0.03706  14.080   <2e-16 ***
I(X^2)       0.02438    0.02274   1.072    0.284    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8785 on 848 degrees of freedom
Multiple R-squared:  0.001354,  Adjusted R-squared:  0.000176 
F-statistic: 1.149 on 1 and 848 DF,  p-value: 0.284

# prediction in training set
pred.sq.train <- predict(fit.sq, newdata = Dtrain)

# predictions in test set
pred.sq.test <- predict(fit.sq, newdata = Dtest)

Mispecification

plot(pred.sq.train, Dtrain$Y)

plot of chunk unnamed-chunk-10

RMSE.train = RMSE(pred.sq.train, Dtrain$Y)
RMSE.train

[1] 0.8774467

plot(pred.sq.test, Dtest$Y)

plot of chunk unnamed-chunk-11

RMSE.test = RMSE(pred.sq.test, Dtest$Y)
RMSE.test

[1] 0.9674257

The Classification Task: Logistic Regression

Generate data for a logistic regression

 X1 = rnorm(1000)           # some continuous variables 
 X2 = rnorm(1000)
 Z = 1 + 0.8*X1 + 3.5*X2        # linear combination with a bias
 Pr = 1/(1+exp(-Z))         # pass through an inv-logit function
 Y1 = factor(rbinom(1000,1,Pr))

The Data Generating Process for a Logistic Regression

plot(X1,Y1)

plot of chunk unnamed-chunk-13

plot(X2,Y1)

plot of chunk unnamed-chunk-14

Logistic Regression

Dlog<-cbind.data.frame(X1,X2,Y1)
summary(Dlog)

       X1                  X2           Y1     
 Min.   :-2.949872   Min.   :-3.12909   0:389  
 1st Qu.:-0.635576   1st Qu.:-0.66774   1:611  
 Median :-0.039240   Median : 0.03118          
 Mean   : 0.008998   Mean   :-0.01065          
 3rd Qu.: 0.675692   3rd Qu.: 0.65103          
 Max.   : 3.421095   Max.   : 3.44599

Training and testing sets

Dlog.train<-Dlog[ind.train,]
Dlog.test<-Dlog[-ind.train,]

Fitting a Logistic Regression in R

# fitting a linear regression
fit.log <- glm(Y1 ~ X1 + X2, data = Dlog.train, family = "binomial")
summary(fit.log)


Call:
glm(formula = Y1 ~ X1 + X2, family = "binomial", data = Dlog.train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.58628  -0.33123   0.08766   0.39971   2.82640  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.1545     0.1310   8.811  < 2e-16 ***
X1            0.9104     0.1303   6.989 2.76e-12 ***
X2            3.5121     0.2511  13.987  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1138.20  on 849  degrees of freedom
Residual deviance:  499.65  on 847  degrees of freedom
AIC: 505.65

Number of Fisher Scoring iterations: 6

# prediction in training set
pred.log.train <- predict(fit.log, newdata = Dlog.train, type = "response")

# predictions in test set
pred.log.test <- predict(fit.log, newdata = Dlog.test, type = "response")

Evaluating Performance

plot(pred.log.train, Dlog.train$Y1)

plot of chunk unnamed-chunk-18

plot(pred.log.test, Dlog.test$Y1)

plot of chunk unnamed-chunk-19

Evaluating Performance on a Logistic Regression

table(pred.log.train>0.5, Dlog.train$Y1)


          0   1
  FALSE 269  51
  TRUE   64 466

table(pred.log.test>0.5, Dlog.test$Y1)


         0  1
  FALSE 46  7
  TRUE  10 87

Characteristics of Machine Learning Algorithms

Algorithms

Linear / Non-Linear
Parametric / Non-parametric
Probabilistic
Bayesian
Tree-structured
Connectionist
Distance-based
Regularization.

Parameters of Machine Learning Algorithms

In broad terms, the parameters to be estimated in Machine Learning Algorithms are either “Tuning/Penalty” or “Effect/Impact” parameters.

The Tuning/Penalty parameters are responsible for keeping simple the strategies of classification and regression what increases the likelihood of good performance in the testing data.

The Effect/Impact parameters are related to the magnitude of the predictor's importance to estimate the outcome/response variable.