Random Forest

Friday, September 26, 2014

Introduction

Terminologies

Bagging/bootstrap aggregation: a technique for reducing the variance of an estimated prediction function.
Regression tree:
- a tree-like graph showing model of decision hierarki based on regression equation.
Prunning:
- trimming a tree, shrub or bush; cut away a branch from a tree.
- reduce the number of insignificant variables in a dataset (unwanted branch from a tree).
- criteria for prunning: minimum node size, maximum standard deviation of samples at a node.

Terminologies [2]

Random Forest:
- Ensemble classification of variables/predictors.
- Using multiple models for better performance that just using a single tree model.
- Running multiple regression trees then choose the best one.

Workflow

Prepare the dataset:
- Cases or samples set in rows (n variables)
- Measured variables set in columns (m columns)
- R reads this as a data.frame
- Choose the training dataset:
- Set the training set to make the model then make a prediction.
- In this case, for descriptive purpose, we will use the whole dataset as training set.

Workflow [2]

Prepare the package:
- We will use randomForest package from CRAN repo using: install.packages("randomForest")
- Run the package using: require("randomForest") or library("randomForest")
Load the dataset using:
Check for missing values (NAs) using: sum(!is.na(dataframe))
Impute missing values using: rfImpute()
Run randomForest using: randomForest()

The codes

Installing and loading libraries

#install.packages("randomForest")
require("randomForest")

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Loading dataset

The dataset used in this exercise contains more than 15 water quality variables.
We will subset the dataset as Group1, containing only 10 predictors (x coord, y coord, elevation, pH, hardness, TDS, temperature, Eh, cummulative rainfall, lag-1 rainfall). Electroconductivity (EC) is used as the response.

Loading dataset [2]

data <- as.data.frame(read.csv("0806alldata.csv", header = TRUE))
attach(data)

## The following object is masked from package:datasets:
## 
##     CO2

group1 <- data[,c("x", "y", "type", 
                  "ec", "elv", 
                  "ph", "hard", 
                  "tds", "temp",
                  "eh", "cumrain", 
                  "lag1")]

Making model using `randomForest()`

# imputing missing values
group1Imp <- rfImpute(ec ~ ., data = group1)

##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.86e+04    56.97 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.857e+04    56.89 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.823e+04    55.85 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.838e+04    56.30 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.912e+04    58.55 |

# making model
rfModel1 <- randomForest(ec ~ x + y + elv + 
                              ph + hard + tds + 
                              temp + eh + cumrain + 
                              lag1, data = group1Imp, 
                              ntree = 500, 
                              mtry = 2,
                              importance = TRUE,
                              do.trace = 100, 
                              proximity=TRUE)

##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  100 | 1.941e+04    59.45 |
##  200 | 1.864e+04    57.09 |
##  300 | 1.859e+04    56.92 |
##  400 | 1.869e+04    57.25 |
##  500 | 1.861e+04    57.00 |

Notice that we can type ec ~ . or we type the full equation instead ec ~ x + y + elv + ph + hard + tds + temp + eh + cumrain + lag1

Making model using `randomForest()` [2]

Options:
- ntree number of trees grown.
- mtry number of predictors sampled for spliting at each node
- importance
- ``
- proximity = TRUE

Evaluating the model

We can evaluate the results from the model by using the following functions:

print(rfModel1)

## 
## Call:
##  randomForest(formula = ec ~ x + y + elv + ph + hard + tds + temp +      eh + cumrain + lag1, data = group1Imp, ntree = 500, mtry = 2,      importance = TRUE, do.trace = 100, proximity = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 18610
##                     % Var explained: 43

plot(rfModel1)

plot of chunk evaluatingmodel

round(importance(rfModel1), 3)

##         %IncMSE IncNodePurity
## x         3.155        589228
## y         6.626        658026
## elv       5.992        641067
## ph       11.896        864079
## hard     16.160       1142347
## tds      22.704       2094910
## temp     20.972       1639244
## eh       13.068        915398
## cumrain   3.484        204159
## lag1      2.165        237052

varImpPlot(rfModel1)

plot of chunk evaluatingmodel

varImpPlot() function shows the significant predictors and less significant ones.

Introduction

Terminologies

Terminologies [2]

Workflow

Workflow [2]

The codes

Installing and loading libraries

Loading dataset

Loading dataset [2]

Making model using `randomForest()`

Making model using `randomForest()` [2]

Evaluating the model

References

References [2]

Introduction

Terminologies

Terminologies [2]

Workflow

Workflow [2]

The codes

Installing and loading libraries

Loading dataset

Loading dataset [2]

Making model using randomForest()

Making model using randomForest() [2]

Evaluating the model

References

References [2]

Making model using `randomForest()`

Making model using `randomForest()` [2]