Playing with nottem dataset using caret package in R

This report summarizes basic function of caret package in R, using nottem dataset as an example. The dataset consists of 20 years of monthly temperature measurements at Notthingham Castle (ts ojbect) (Anderson, O. D. 1976). The report will be focusing on the hands-on practices of implement machine learning for newbies in this realm. Most of the material is derived from JHU Practice Machine Learning Course on Coursera. This report introduces the functionality of caret pacakge, including:

preProcess
createDataPartition
train
predict

The nottem dataset: prepossesing ts object

By taking a quick look at the original data we noticed the seasonality of the monthly temperature. Some sort of normalization may be necessary for the time-series data to remove the non-linear trend. The data was melted into data frame with another variable that indicates the time point of each measurements.

data("nottem")
plot(nottem)

library(reshape2)
temp.new = data.frame(timePoint= melt(time(nottem)), temp = melt(nottem))
colnames(temp.new) = c('timePoint','temp')

Splitting data into training and testing sets

The ratio of training and testing dataset is usually 3:1.

library(caret); library(kernlab)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

inTrain <- createDataPartition(y = temp.new$temp, p = 0.75, list = FALSE)
training <- temp.new[inTrain, ]
testing <- temp.new[-inTrain, ]

Model fitting

Here we use Bayesian Regularized Neural Network model to fit the training set. Prepossessing method “center” is included in the parameter (to substract mean from the training data; it is similar as a normalization process)

set.seed(666)
modelFit <- train(training, training$temp, method = "brnn", preProcess = c("center"))
modelFit

Final model

summary(modelFit$finalModel)

##              Length Class      Mode     
## theta          3    -none-     list     
## alpha          1    -none-     numeric  
## beta           1    -none-     numeric  
## gamma          1    -none-     numeric  
## Ed             1    -none-     numeric  
## Ew             1    -none-     numeric  
## F_history     13    -none-     numeric  
## reason         1    -none-     character
## epoch          1    -none-     numeric  
## neurons        1    -none-     numeric  
## p              1    -none-     numeric  
## n              1    -none-     numeric  
## npar           1    -none-     numeric  
## x_normalized 362    -none-     numeric  
## x_base         2    -none-     numeric  
## x_spread       2    -none-     numeric  
## y_base         1    -none-     numeric  
## y_spread       1    -none-     numeric  
## y            181    -none-     numeric  
## normalize      1    -none-     logical  
## call           4    -none-     call     
## xNames         2    -none-     character
## problemType    1    -none-     character
## tuneValue      1    data.frame list     
## obsLevels      1    -none-     logical

Prediction

Comparing the temp value from the testing set (blue cross) and the prediction set (red circle), we discovered that the majority of the dots locate closely except for two or three which don’t align with each other.

prediction <- predict(modelFit, newdata = testing)

## Loading required package: brnn

## Loading required package: Formula

prediction.frame <- data.frame(testing$timePoint, prediction)
plot(testing, col='navy blue', pch = 4, lwd = 2)
points(prediction.frame, col = 'dark red', lwd = 2)