We will need the mvpart package to produce the multivariate regression tree. We need vegan to standardize our numerical data. We need plyr to reshape the data frame after normalization.
require(mvpart)
## Loading required package: mvpart
require(vegan)
## Loading required package: vegan
## Loading required package: permute
## This is vegan 2.0-6
require(plyr)
## Loading required package: plyr
Let's use the some car data from 1990, it is a data frame with both numerical and categorical data.
car.test.frame <- car.test.frame
## Error: object 'car.test.frame' not found
Let normalize the numerical data so that each column of numberical data has a mean of 0 and a standard error of +-1. This will prevent the variation in the data in one column from denominating the tree building decisions. We can only standardize numerical so we need to put the categorical data back in.
car.norm <- decostand(car.test.frame[, c("Price", "Reliability", "Mileage",
"Weight", "Disp.", "HP")], "standardize", na.rm = TRUE)
## Error: object 'car.test.frame' not found
car.norm.df <- transform(car.norm, Country = car.test.frame$Country)
## Error: object 'car.norm' not found
car.norm.df <- transform(car.norm.df, Type = car.test.frame$Type)
## Error: object 'car.norm.df' not found
car.norm.df
## Error: object 'car.norm.df' not found
Let's use all the data we have collected to identify which groups most accurately predict a change in price. mvpart is a wrapper for rpart. In this our matrix Price has only one column. The pick option allows use to actually prune our own tree. xvmult = 1000 tell the wrapper to construct 1000 trees and display the comparative results.
mfit <- mvpart(Price ~ Mileage + HP + Reliability + Weight + Disp. + Country +
Type, car.norm.df, xv = "pick", xvmult = 1000)
## Error: object 'car.norm.df' not found
We might select the tree that gives us the lowest CVRE, but we can see that the results don't make as much since for our data. The most parsimonious results is 5 leaves, this result partitions the groups of increasing prices. The lowest number of groups that does not sacrifice a significant decrease in CVRE often provide a logical partitioning of your data, however, you will need to make a descision about what leaf structure makes the most sense for your data.