Friday, September 26, 2014
data.framerandomForest package from CRAN repo using: install.packages("randomForest")require("randomForest") or library("randomForest")sum(!is.na(dataframe))rfImpute()randomForest()#install.packages("randomForest")
require("randomForest")
## Loading required package: randomForest ## randomForest 4.6-10 ## Type rfNews() to see new features/changes/bug fixes.
Group1, containing only 10 predictors (x coord, y coord, elevation, pH, hardness, TDS, temperature, Eh, cummulative rainfall, lag-1 rainfall). Electroconductivity (EC) is used as the response.data <- as.data.frame(read.csv("0806alldata.csv", header = TRUE))
attach(data)
## The following object is masked from package:datasets: ## ## CO2
group1 <- data[,c("x", "y", "type",
"ec", "elv",
"ph", "hard",
"tds", "temp",
"eh", "cumrain",
"lag1")]
randomForest()# imputing missing values group1Imp <- rfImpute(ec ~ ., data = group1)
## | Out-of-bag | ## Tree | MSE %Var(y) | ## 300 | 1.86e+04 56.97 | ## | Out-of-bag | ## Tree | MSE %Var(y) | ## 300 | 1.857e+04 56.89 | ## | Out-of-bag | ## Tree | MSE %Var(y) | ## 300 | 1.823e+04 55.85 | ## | Out-of-bag | ## Tree | MSE %Var(y) | ## 300 | 1.838e+04 56.30 | ## | Out-of-bag | ## Tree | MSE %Var(y) | ## 300 | 1.912e+04 58.55 |
# making model
rfModel1 <- randomForest(ec ~ x + y + elv +
ph + hard + tds +
temp + eh + cumrain +
lag1, data = group1Imp,
ntree = 500,
mtry = 2,
importance = TRUE,
do.trace = 100,
proximity=TRUE)
## | Out-of-bag | ## Tree | MSE %Var(y) | ## 100 | 1.941e+04 59.45 | ## 200 | 1.864e+04 57.09 | ## 300 | 1.859e+04 56.92 | ## 400 | 1.869e+04 57.25 | ## 500 | 1.861e+04 57.00 |
Notice that we can type ec ~ . or we type the full equation instead ec ~ x + y + elv + ph + hard + tds + temp + eh + cumrain + lag1
randomForest() [2]ntree number of trees grown.mtry number of predictors sampled for spliting at each nodeimportanceproximity = TRUEprint(rfModel1)
## ## Call: ## randomForest(formula = ec ~ x + y + elv + ph + hard + tds + temp + eh + cumrain + lag1, data = group1Imp, ntree = 500, mtry = 2, importance = TRUE, do.trace = 100, proximity = TRUE) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 2 ## ## Mean of squared residuals: 18610 ## % Var explained: 43
plot(rfModel1)
round(importance(rfModel1), 3)
## %IncMSE IncNodePurity ## x 3.155 589228 ## y 6.626 658026 ## elv 5.992 641067 ## ph 11.896 864079 ## hard 16.160 1142347 ## tds 22.704 2094910 ## temp 20.972 1639244 ## eh 13.068 915398 ## cumrain 3.484 204159 ## lag1 2.165 237052
varImpPlot(rfModel1)
varImpPlot() function shows the significant predictors and less significant ones.