Friday, September 26, 2014

Introduction

Terminologies

  • Bagging/bootstrap aggregation: a technique for reducing the variance of an estimated prediction function.
  • Regression tree:
    • a tree-like graph showing model of decision hierarki based on regression equation.
  • Prunning:
    • trimming a tree, shrub or bush; cut away a branch from a tree.
    • reduce the number of insignificant variables in a dataset (unwanted branch from a tree).
    • criteria for prunning: minimum node size, maximum standard deviation of samples at a node.

Terminologies [2]

  • Random Forest:
    • Ensemble classification of variables/predictors.
    • Using multiple models for better performance that just using a single tree model.
    • Running multiple regression trees then choose the best one.

Workflow

  • Prepare the dataset:
    • Cases or samples set in rows (n variables)
    • Measured variables set in columns (m columns)
    • R reads this as a data.frame
    • Choose the training dataset:
    • Set the training set to make the model then make a prediction.
    • In this case, for descriptive purpose, we will use the whole dataset as training set.

Workflow [2]

  • Prepare the package:
    • We will use randomForest package from CRAN repo using: install.packages("randomForest")
    • Run the package using: require("randomForest") or library("randomForest")
  • Load the dataset using:
  • Check for missing values (NAs) using: sum(!is.na(dataframe))
  • Impute missing values using: rfImpute()
  • Run randomForest using: randomForest()

The codes

Installing and loading libraries

#install.packages("randomForest")
require("randomForest")
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Loading dataset

  • The dataset used in this exercise contains more than 15 water quality variables.
  • We will subset the dataset as Group1, containing only 10 predictors (x coord, y coord, elevation, pH, hardness, TDS, temperature, Eh, cummulative rainfall, lag-1 rainfall). Electroconductivity (EC) is used as the response.

Loading dataset [2]

data <- as.data.frame(read.csv("0806alldata.csv", header = TRUE))
attach(data)
## The following object is masked from package:datasets:
## 
##     CO2
group1 <- data[,c("x", "y", "type", 
                  "ec", "elv", 
                  "ph", "hard", 
                  "tds", "temp",
                  "eh", "cumrain", 
                  "lag1")]

Making model using randomForest()

# imputing missing values
group1Imp <- rfImpute(ec ~ ., data = group1)
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.86e+04    56.97 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.857e+04    56.89 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.823e+04    55.85 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.838e+04    56.30 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  300 | 1.912e+04    58.55 |
# making model
rfModel1 <- randomForest(ec ~ x + y + elv + 
                              ph + hard + tds + 
                              temp + eh + cumrain + 
                              lag1, data = group1Imp, 
                              ntree = 500, 
                              mtry = 2,
                              importance = TRUE,
                              do.trace = 100, 
                              proximity=TRUE) 
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  100 | 1.941e+04    59.45 |
##  200 | 1.864e+04    57.09 |
##  300 | 1.859e+04    56.92 |
##  400 | 1.869e+04    57.25 |
##  500 | 1.861e+04    57.00 |

Notice that we can type ec ~ . or we type the full equation instead ec ~ x + y + elv + ph + hard + tds + temp + eh + cumrain + lag1

Making model using randomForest() [2]

  • Options:
    • ntree number of trees grown.
    • mtry number of predictors sampled for spliting at each node
    • importance
    • ``
    • proximity = TRUE

Evaluating the model

  • We can evaluate the results from the model by using the following functions:
print(rfModel1)
## 
## Call:
##  randomForest(formula = ec ~ x + y + elv + ph + hard + tds + temp +      eh + cumrain + lag1, data = group1Imp, ntree = 500, mtry = 2,      importance = TRUE, do.trace = 100, proximity = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 18610
##                     % Var explained: 43
plot(rfModel1)

plot of chunk evaluatingmodel

round(importance(rfModel1), 3)
##         %IncMSE IncNodePurity
## x         3.155        589228
## y         6.626        658026
## elv       5.992        641067
## ph       11.896        864079
## hard     16.160       1142347
## tds      22.704       2094910
## temp     20.972       1639244
## eh       13.068        915398
## cumrain   3.484        204159
## lag1      2.165        237052
varImpPlot(rfModel1)

plot of chunk evaluatingmodel

  • varImpPlot() function shows the significant predictors and less significant ones.

References

References [2]