Introduction to Machine Learning with R (Module I)

Joel Correa da Rosa
March 30, 2018

Basic Concepts

Introduction

In this short course we will introduce the use of machine learning techniques for classification and regression.

Applications in genomics and genetics will be covered.

Outline

Basic Concepts
Algorithms
Training Strategies
Caret Package in R
Performance Evaluation
Applications to Genomics

R and Rstudio

Throughout this course we will use R and Rstudio, standard tools for data scientists that can be freely downloaded from

R is a free software environment for statistical computing and graphics that is supporte by the R Foundation for Statistical Computing.

Packages to be used

library(caret)
library(Boruta)
library(rpart)
library(nnet)
library(e1071)
library(randomForest)
library(ggplot2)
library(pROC)
library(nlme)

If a package is not installed, you can try to install using :

install.packages('Boruta')

What is Machine Learning ?

Transformation of data into intelligent action by exploring the processing capabilities of a computer (machine).
It is more like training an employee than raising a child.

The Learning Paradigm (Tom M. Mitchell)

To Learn means to improve performance at some task through experience.

Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Tasks: recognition, diagnosis, planning, robot control, prediction, etc

The Learning Paradigm (An Example)

An example of the learning paradigm applied to a classification task is given below.

T predict response to a drug from gene expressions in biopsied tissues.
P accuracy in predicting response to the drug
E a sample of 100 subjects (50 Responders / 50 Non-Responders) with microarray intensities measured on 50000 genes

The Learning Process

Learning Process

Types of Learning

Supervised Learning
- Regression
- Classification
Unsupervised Learning
- Vector Quantisation
- K-means
- Self-organising maps

Classification x Regression

Two tasks are accomplished in this short course.

Classification

A categorical outcome \( Y \) is to be predicted by a set of variables \( X_1,X_2,..,X_P \)

Regression

A continous (or quantitative) outcome \( Y \) is to be predicted by a set of variables \( X_1,X_2,..,X_P \)

The machine learning algorithm is supposed to learn on how to predict Y based on a set of n observations.

Training

Machine learning algorithms are trained to learn. Like a student who is being prepared for a school exam, it is important to train before the exam.

Data sets are usually partitioned into 3 portions:

Training
Validation
Test

Let's Train a Linear Regression I

Generate random data, thus we know the truth behind. In R, The function set.seed() makes our work to be replicable.

set.seed(123)

# a normal distributed variables
X<-rnorm(1000,0,1)

# create some noise
noise<-0.5

# a poorly correlated variable
Y<-0.5+0.70*X+rnorm(1000,0,noise)

Let's Train a Linear Regression II

What is the mapping to be learned ?

plot(X,Y)
abline(c(0.5,0.7))

plot of chunk unnamed-chunk-3

In this example, we know the data generation process.

Let's Train a Linear Regression III

# generating indexes for training
ind.train<-sample(1:1000,85)

# training sets
Xtrain<-X[ind.train]
Ytrain<-Y[ind.train]

# test sets
Xtest<-X[-ind.train]
Ytest<-Y[-ind.train]

# Data frames for training and test
Dtrain = cbind.data.frame(X= Xtrain, Y = Ytrain)
Dtest = cbind.data.frame(X= Xtest, Y = Ytest)

Train and Test a Linear Regression

# fitting a linear regression
fit.lm <- lm(Y ~ X, data = Dtrain)
summary(fit.lm)


Call:
lm(formula = Y ~ X, data = Dtrain)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.49663 -0.29860  0.00437  0.36980  0.97890 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.48371    0.05772    8.38 1.13e-12 ***
X            0.71589    0.06144   11.65  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5314 on 83 degrees of freedom
Multiple R-squared:  0.6206,    Adjusted R-squared:  0.616 
F-statistic: 135.8 on 1 and 83 DF,  p-value: < 2.2e-16

# prediction in training set
pred.lm.train <- predict(fit.lm, newdata = Dtrain)

# predictions in test set
pred.lm.test <- predict(fit.lm, newdata = Dtest)

Train and Test a Linear Regression

cbind(Pred = pred.lm.train, Obs = Dtrain$Y)[1:10,]

           Pred         Obs
1   0.170749274  0.40638822
2   0.056085383 -0.08434249
3   1.085436602  0.86247141
4   0.681486887  1.48711926
5   1.303583792  1.28245578
6   0.986103462  0.81178075
7   1.357776805  0.96513173
8  -0.008002725  0.36180212
9  -0.365483167 -0.59215796
10 -0.533646166 -0.71063779

cbind(Pred = pred.lm.test, Obs = Dtest$Y)[1:10,]

          Pred         Obs
1   0.08246897 -0.39023232
2   0.31892486 -0.18110176
3   1.59956350  1.58210570
4   0.53418173  0.48326831
5   0.57626105 -0.68416997
6   1.71149709  2.22083222
7   0.81366929  0.94750421
8  -0.42193412  0.82256082
9   0.16466250 -0.03544303
10  1.36000900  2.75555283

Pred : predicted values
Obs : observed values

Evaluating Performance

plot(pred.lm.train, Dtrain$Y)

plot of chunk unnamed-chunk-7

RMSE.train = RMSE(pred.lm.train, Dtrain$Y)
RMSE.train

[1] 0.5251293

plot(pred.lm.test, Dtest$Y)

plot of chunk unnamed-chunk-8

RMSE.test = RMSE(pred.lm.test, Dtest$Y)
RMSE.test

[1] 0.5029191

Root Mean Squared Error

Mean Squared Error \( MSE = {\sum_{i=1}^n\frac{(P_i-O_i)^2}{N-1}} \)

Root Mean Squared Error \( RMSE = \sqrt{MSE} \)

Mispecification

# fitting a linear regression
fit.sq <- lm(Y ~ I(X^2), data = Dtrain)
summary(fit.sq)


Call:
lm(formula = Y ~ I(X^2), data = Dtrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0336 -0.5169 -0.1278  0.4850  2.1868 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.61492    0.12302   4.998 3.17e-06 ***
I(X^2)      -0.18822    0.09336  -2.016    0.047 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8423 on 83 degrees of freedom
Multiple R-squared:  0.04668,   Adjusted R-squared:  0.0352 
F-statistic: 4.065 on 1 and 83 DF,  p-value: 0.04703

# prediction in training set
pred.sq.train <- predict(fit.sq, newdata = Dtrain)

# predictions in test set
pred.sq.test <- predict(fit.sq, newdata = Dtest)

Mispecification

plot(pred.sq.train, Dtrain$Y)

plot of chunk unnamed-chunk-10

RMSE.train = RMSE(pred.sq.train, Dtrain$Y)
RMSE.train

[1] 0.8323796

plot(pred.sq.test, Dtest$Y)

plot of chunk unnamed-chunk-11

RMSE.test = RMSE(pred.sq.test, Dtest$Y)
RMSE.test

[1] 0.9568338

Characteristics of Machine Learning Algorithms

Algorithms

Linear / Non-Linear
Parametric / Non-parametric
Probabilistic
Bayesian
Tree-structured
Connectionist
Distance-based