In this case, the DAAG library will be used, which contains the data that we are going to use in this practical case. The library is installed by default in R, so to load it, use the library (DAAG) command:
The data that I will use is called spam7. These data correspond to 4601 email samples. The explanatory variables are:
crl.tot, total length of words that are in capital letters;
dollar, frequency of the $ symbol, in terms of percentage with respect to characters;
bang, frequency of the symbol!, In terms of percentage with respect to characters;
money frequency of the word money, in terms of percentage with respect to characters;
n000, frequency of string 000, in terms of percentage of characters;
make, frequency of the word make, in terms of the percentage with respect to characters.
The dependent variable ‘yesno’ takes the values ​​yes (‘y’ for spam) or no (‘n’ for non-spam).
# Load the DAAG library
library (DAAG)
dim(spam7)
## [1] 4601 7
spam7[1:5,]
## crl.tot dollar bang money n000 make yesno
## 1 278 0.000 0.778 0.00 0.00 0.00 y
## 2 1028 0.180 0.372 0.43 0.43 0.21 y
## 3 2259 0.184 0.276 0.06 1.16 0.06 y
## 4 191 0.000 0.137 0.00 0.00 0.00 y
## 5 191 0.000 0.135 0.00 0.00 0.00 y
In this case we use the rpart library to generate the decision tree obtained for this data.
library(rpart)
A classification tree is constructed using the 6 explanatory variables. Next we generate the tree structure, which we will internally save in the spam.tree variable:
spam.tree <- rpart(formula = yesno ~ crl.tot + dollar + bang +
money + n000 + make, method="class", data=spam7)
plot(spam.tree)
text(spam.tree)
The rpart library allows you to prune the generated trees in order to find an optimal tree size such that a balance is reached between the complexity of the tree and the fit to the data.
In the rpart library, the complexity parameter ‘cp’ (‘complexity parameter’ or complexity parameter in Spanish) is defined. This is a positive parameter. The smaller this parameter, the deeper the tree that the library builds. What I do is build a tree with a very small complexity parameter, for example 0.0000001.
spam7a.tree <- rpart(formula = yesno ~ crl.tot + dollar +
bang + money + n000 + make, method="class", data=spam7,cp=0.0000001)
Now I paint a graph showing the relationship between the error, the complexity parameter, and the error:
plotcp(spam7a.tree)
This chart is generated using simulation techniques, so it may vary in different order executions. As I can see, the smallest error is obtained for a cp value of approximately 0.0033. To calculate exactly that value we use the command:
cpar=spam7a.tree$cptable[which.min(spam7a.tree$cptable[,"xerror"]),"CP"]
Like the graph above, this approximate optimal value is generated using simulation techniques.
I prune the tree using the prune command:
spam7b.tree <- prune(spam7a.tree,cp=cpar)
Finally, I paint the optimal tree:
plot(spam7b.tree,uniform=TRUE)
text(spam7b.tree,cex=0.75)