This is a summary of section 4 in Machine Learning Course by JHU in coursera. The report include the following topics:

Regularized regression

Basic ideas

  1. Fit a regression model
  2. Penalize large coefficients

Pros and cons

Can help with bias/variance tradeoff; may demand large dataset

Example

library(ElemStatLearn); data("prostate")
str(prostate)
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
###regression subset selection
covnames <- names(prostate[-(9:10)])
y <- prostate$lpsa
x <- prostate[,covnames]

form <- as.formula(paste("lpsa~", paste(covnames, collapse="+"), sep=""))
summary(lm(form, data=prostate[prostate$train,]))
## 
## Call:
## lm(formula = form, data = prostate[prostate$train, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64870 -0.34147 -0.05424  0.44941  1.48675 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.429170   1.553588   0.276  0.78334    
## lcavol       0.576543   0.107438   5.366 1.47e-06 ***
## lweight      0.614020   0.223216   2.751  0.00792 ** 
## age         -0.019001   0.013612  -1.396  0.16806    
## lbph         0.144848   0.070457   2.056  0.04431 *  
## svi          0.737209   0.298555   2.469  0.01651 *  
## lcp         -0.206324   0.110516  -1.867  0.06697 .  
## gleason     -0.029503   0.201136  -0.147  0.88389    
## pgg45        0.009465   0.005447   1.738  0.08755 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7123 on 58 degrees of freedom
## Multiple R-squared:  0.6944, Adjusted R-squared:  0.6522 
## F-statistic: 16.47 on 8 and 58 DF,  p-value: 2.042e-12
set.seed(1)
train.ind <- sample(nrow(prostate), ceiling(nrow(prostate))/2)
y.test <- prostate$lpsa[-train.ind]
x.test <- x[-train.ind,]

y <- prostate$lpsa[train.ind]
x <- x[train.ind,]

p <- length(covnames)
rss <- list()
for (i in 1:p) {
  cat(i)
  Index <- combn(p,i)

  rss[[i]] <- apply(Index, 2, function(is) {
    form <- as.formula(paste("y~", paste(covnames[is], collapse="+"), sep=""))
    isfit <- lm(form, data=x)
    yhat <- predict(isfit)
    train.rss <- sum((y - yhat)^2)

    yhat <- predict(isfit, newdata=x.test)
    test.rss <- sum((y.test - yhat)^2)
    c(train.rss, test.rss)
  })
}
## 12345678
#png("Plots/selection-plots-01.png", height=432, width=432, pointsize=12)
plot(1:p, 1:p, type="n", ylim=range(unlist(rss)), xlim=c(0,p), xlab="number of predictors", ylab="residual sum of squares", main="Prostate cancer data")
for (i in 1:p) {
  points(rep(i-0.15, ncol(rss[[i]])), rss[[i]][1, ], col="blue")
  points(rep(i+0.15, ncol(rss[[i]])), rss[[i]][2, ], col="red")
}
minrss <- sapply(rss, function(x) min(x[1,]))
lines((1:p)-0.15, minrss, col="blue", lwd=1.7)
minrss <- sapply(rss, function(x) min(x[2,]))
lines((1:p)+0.15, minrss, col="red", lwd=1.7)
legend("topright", c("Train", "Test"), col=c("blue", "red"), pch=1)

We find that as the predictor number increases, the error of training set is going down as we are overfitting the training dataset. However, at the same time the error of test set increases.

Solution for overfitting

  • Split samples (model selection approach)
  • Decomposing expected prediction error

Another issue for high-dimensional data

small = prostate[1:5,]
lm(lpsa~., data = small)
## 
## Call:
## lm(formula = lpsa ~ ., data = small)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph  
##     9.60615      0.13901     -0.79142      0.09516           NA  
##         svi          lcp      gleason        pgg45    trainTRUE  
##          NA           NA     -2.08710           NA           NA

Here we see that some predictors were calculated as NA. This is because there are more predictors than the samples and R is not able to do the estimation.

Solution for dealing with high-dimensional data

  • Hard thresolding
  • Regularization of regression
Basic idea of regularization of regression
In order to control the high variance in the data, we can shrink the coefficient by adding penalty. The penalty is used for reducing the complexity and respecting the structure of the problem.

Ridge regression

Solve:

\[ \sum_{i=1}^N \left(y_i - \beta_0 + \sum_{j=1}^p x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^p \beta_j^2\]

Tuning parameter \(\lambda\)

  • Controls the size of coefficients
  • Controls the amount of regularization

Lasso

Lasso stands for least absolute shrinkage and selection operator.

Combining predictors

Basic idea

  • Improves accuracy
  • Reduces interpretability
  • Boosting, bagging, random forest are the variants of the theme

Examples

Basic intuition: majority vote

Approaches for combining predictors

Example with the wage data

library(ISLR); data(Wage)
library(ggplot2); library(caret)
## Loading required package: lattice
Wage <- subset(Wage,select=-c(logwage))

# Create a building data set and validation set
inBuild <- createDataPartition(y=Wage$wage,
                              p=0.7, list=FALSE)
validation <- Wage[-inBuild,]; buildData <- Wage[inBuild,]

inTrain <- createDataPartition(y=buildData$wage,
                              p=0.7, list=FALSE)
training <- buildData[inTrain,]; testing <- buildData[-inTrain,]

###build two different models to predict wage variables
mod1 <- train(wage~., method = "glm", data = training)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
mod2 <- train(wage~., method = "rf", data = training, trControl = trainControl(method = "cv"), number = 3)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
### plot two predictions 
pred1 <- predict(mod1, testing)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pred2 <- predict(mod2, testing)
qplot(pred1, pred2, data = testing, color = wage)

##fit a model that combines predictors 
predDF <- data.frame(pred1, pred2, wage = testing$wage)
comModFit <- train(wage~., method = "gam", data = predDF )
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-12. For overview type 'help("mgcv-package")'.
combPred <- predict(comModFit, predDF)

###testing errors
sqrt(sum((pred1 - testing$wage)^2))
## [1] 913.4245
sqrt(sum((pred2 - testing$wage)^2))
## [1] 936.1133
sqrt(sum((combPred - testing$wage)^2))
## [1] 904.6411
###prediction on the validation set
pred1V <- predict(mod1,validation); pred2V <- predict(mod2,validation)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
predVDF <- data.frame(pred1=pred1V,pred2=pred2V)
combPredV <- predict(comModFit,predVDF)

Recall scalability matter

Forcasting

Forcasting is typical applied on the time-series data. The characteristics of this kind of analysis includes:

  1. Data are dependent over time
  2. Specific pattern types: trend, seasonal, cycles
  3. Subsampling into training/testing dataset is more complicated
  4. Similar issue arises in spatial data
  5. Typical goal is to predict one or more observations into future

Beaware of surious correlations

Be aware of extrapolations

Example: google data

library(quantmod)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
from.dat <- as.Date("01/01/08", format="%m/%d/%y")
to.dat <- as.Date("12/31/13", format="%m/%d/%y")
getSymbols("GOOG", src="google", from = from.dat, to = to.dat)
##     As of 0.4-0, 'getSymbols' uses env=parent.frame() and
##  auto.assign=TRUE by default.
## 
##  This  behavior  will be  phased out in 0.5-0  when the call  will
##  default to use auto.assign=FALSE. getOption("getSymbols.env") and 
##  getOptions("getSymbols.auto.assign") are now checked for alternate defaults
## 
##  This message is shown once per session and may be disabled by setting 
##  options("getSymbols.warning4.0"=FALSE). See ?getSymbols for more details.
## [1] "GOOG"
head(GOOG)
##            GOOG.Open GOOG.High GOOG.Low GOOG.Close GOOG.Volume
## 2008-01-02    346.09    348.34   338.53     342.25          NA
## 2008-01-03    342.29    343.08   337.92     342.32          NA
## 2008-01-04    339.51    340.14   327.17     328.17          NA
## 2008-01-07    326.64    330.81   318.36     324.30          NA
## 2008-01-08    326.17    329.65   315.18     315.52          NA
## 2008-01-09    314.70    326.34   310.94     326.27          NA
##summarize monthly and store as time series
GOOG <- as.data.frame(GOOG)
rownames(GOOG) = as.Date(rownames(GOOG))
#mGoog <- to.monthly(GOOG)
#googOpen <- Op(mGoog)

##decompose a time series into parts
#plot(decompose(ts1))

##Training and testing dataset on a consective basis
#ts1Train <- window(ts1,start=1,end=5)
#ts1Test <- window(ts1,start=5,end=(7-0.01))

######Simple moving average: ma()
#plot(ts1Train)
#lines(ma(ts1Train,order=3),col="red")

########exponential smoothing
#ets1 <- ets(ts1Train,model="MMM")
#fcast <- forecast(ets1)
#plot(fcast); lines(ts1Test,col="red")

#####get the accuracy: accuracy()

Further readings

Unsupervised prediction

Key ideas

  • Sometimes we don’t know about the labels for prediction
  • To build a predictor, you need create clusters and name clusters

Examples with Iris dataset

data(iris); library(ggplot2)
inTrain <- createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
dim(training); dim(testing)
## [1] 105   5
## [1] 45  5
###create cluster with K-means clustering
kMeans1 <- kmeans(subset(training,select=-c(Species)),centers=3)
training$clusters <- as.factor(kMeans1$cluster)
qplot(Petal.Width,Petal.Length,colour=clusters,data=training)

##compare to the real labels
table(kMeans1$cluster,training$Species)
##    
##     setosa versicolor virginica
##   1     35          0         0
##   2      0         34        12
##   3      0          1        23
####build predictor
modFit <- train(clusters ~.,data=subset(training,select=-c(Species)),method="rpart")
## Loading required package: rpart
###apply on the test
testClusterPred <- predict(modFit,testing)