A decision tree is a tool that builds regression models in the shape of a tree structure. Decision trees take the shape of a graph that illustrates possible outcomes of different decisions based on a variety of parameters. Decision trees break the data down into smaller and smaller subsets, they are typically used for machine learning and data mining, and are based on machine learning algorithms. Decision trees are also referred to as recursive partitioning.
Decision trees are based on an algorithm called ID3 created by JR Quinlan ID3 employs entropy and information gain to create a decsion tree entropy: is a top-down process that partitons data into subsets that consist of homogeneous data points. If a sample is completely homogenous the entropy is zero, if the sample is completely divided entropy is one. information gain: the decrease in entropy after the dataset is split on an attribute/parameter. Decision trees make splits based on which attributes generate the highest information gain, which results in the most homogenous subsets. Entropy values are calculated for every parameter that is entered into the tree model, for each decision, the parameter with the highest information gain is selected. Then the process is repeated. Decision Tree Components
Decision trees are made up to two parts: nodes and leaves.
Nodes: represent a decision test, examine a single variable and move to another node based on the outcome Leaves: represent the outcome of the decision. What can I do with a decision tree?
Decision trees are useful to make various predictions. For example, to predict if an email is SPAM or not, to predict health outcomes, to predict what group an individual belongs to based on a variety of factors that are specified in the decision tree model.
simple to understand and interpret help determine the expected outcomes of various scenarios help determine best and worst values for different scenarios can be combined with other decision techniques require a relatively low degree of data preparation can accommodate missing data low sensitivity to outliers low impact of nonlinear relationships between parameters can handle both categorical and numeric variables can translate the decision tree results into “decision rules”
for categorical variables, more levels of the variable creates more bias of the decision tree toward that variable if the tree is over-fitted to the data, the results can be poor predictors
2 R Package: ‘party’ The package we will use to create decision trees is called ‘party’. To install the package, use the syntax below. We will also be using the packages plyr and readr for some data set structuring. Mido y limpio la memoria
Instalo party y cargo las librerias necesarias install.packages(“party”) install.packages(“Rtools”)
rm( list=ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 456823 24.4 983384 52.6 643648 34.4
## Vcells 825984 6.4 8388608 64.0 1648449 12.6
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
library(plyr)
##
## Attaching package: 'plyr'
## The following object is masked from 'package:modeltools':
##
## empty
library(readr)
To create decision trees, we will be using the function ctree() from the package ‘party’.
###A BRIEF OVERVIEW OF ctree()
The function ctree() is used to create conditional inference trees. The main components of this function are formula and data. Other components include subset, weights, controls, xtrafo, ytrafo, and scores.
data: tells the function which dataset to pull the variables listed in the model from.
subset: is an optional add on which specifies a subset of observations to be used in the fitting process. Should be used if you don’t want to fit the model to the entire dataset.
weights: is an optional vector that provides weighted values that can be used in the model fitting process. Can only consist of non-negative integers.
?ctree
## starting httpd help server ... done
For the example, we will be using the dataset from UCI machine learning database called iris.
The iris dataset contains information about three different types of iris flowers: setosa iris, versicolor iris, and virginica iris. There are five variables included in the dataset: sepal.length, sepal.width, petal.length, petal.width, and class. Each entry in the data set represent a different iris flower: the length and width of the sepal and petals are listed for each flower along with the type, or class, of the iris. The sepal and petal refer to two different components of the iris flower each of which contribute to the the overall aesthetic of the flower. Using the syntax below, we will load the dataset iris from the UCI website and rename each of the columns to reflect each of the five variables.
Primero lee el dataset de link . Luego renombra las variables del dataframe. Por ultimo cambia la clase de iris a factor y realiza un resumen. Hay 50 instancias de cada clase ( setosa, virginica y versicolor) No hay NA
iris <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", col_names = FALSE)
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_double(),
## X2 = col_double(),
## X3 = col_double(),
## X4 = col_double(),
## X5 = col_character()
## )
iris<-rename(iris, c("X1"="sepal.length", "X2"="sepal.width", "X3"="petal.length", "X4"="petal.width", "X5"="class"))
iris$class<-as.factor(iris$class)
#overall descriptives of the dataset
summary(iris)
## sepal.length sepal.width petal.length petal.width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## class
## Iris-setosa :50
## Iris-versicolor:50
## Iris-virginica :50
##
##
##
First we will create 4 different decision trees consisting of one variable to predict which class of iris a given iris belongs to.
Según largo del sépalo
Predicting iris class by sepal length. To see how well sepal length predicts which class of iris a flower is, we create the following decision tree. Para tener a mano el nombre de las variables corro el comando names (por mi cuenta)
names(iris)
## [1] "sepal.length" "sepal.width" "petal.length" "petal.width" "class"
tree1<-ctree(class~sepal.length, data=iris)
#set the model for the tree, predicting class by sepal length , data set being used is iris
plot(tree1)
#view the decision tree
Interpreting the decision tree
To understand what the decision tree is saying, we want to start with the root of the tree (the first decision node). Looking at the first decision node, we know that the variable that the decision is determined by is sepal.length. There are two leaves from this node: first if the sepal length is less than or equal to 5.4 then it drops down into the group of iris flowers.
This first group, indicated by the first graph on the left side, tells us that there are 52 flowers that have a sepal length less than or equal to 5.4. Of these 52 flowers, approximately 80% of them fall into the first class of iris which is setosa. about 15% fall into the second class which is versicolor, and the remaining fall into the virginica class. The y-axis represents the proportion of the total flowers in this group that correspond to each of the iris flower classes (setosa, versicolor, virginica).
Next, we move to the second node. Again, the variable that the decision is determined by is sepal.length. The two leaves here are less than or equal to 6.1 or greater than 6.1. If the flower has a sepal length that is less than or equal to 6.1, it falls into the second group. Looking at the second graph we can see that the majority of the 43 flowers in this group are of iris class versicolor.
Next, we move to the third node. Here if the sepal length is less than or equal to 7 the flower falls into the thrid group and if it’s greater than 7 the flower falls into the fourth group. Looking at the third graph, there are 43 flowers in the third group and majority of these flowers are of class virginica, however, you can see that there are still a good amount of versicolor iris flowers in this group as well.
In the fourth and last group, there are the remaining 12 flowers. All of these flowers belong to the thrid class of iris: virginica.
Overall, the decision tree tells us that sertosa iris flowers tend to have shorter sepal length, versicolor iris flowers have mid-length sepals, and virginica iris flowers tend to have the longest sepal length. setosa es la más oscura, virgínica es la mas clara
Predice la clase de iris según el ancho del sépalo. Construyo el árbol y lo visualizo
tree2<-ctree(class~sepal.width, data=iris)
plot(tree2)
Interpreting the decision tree
Looking at the decision tree, you can see that using sepal width creates three groups of flowers compared to sepal length that created four groups of flowers.
Here the results are much more mixed. Main conclusions would be that setosa iris tend to have wider sepals, versicolor tend to have more narrow sepals, and virginica have more variety in sepal width
Predice la clase de iris según el largo del pétalo
tree3<-ctree(class~petal.length,data=iris)
plot(tree3)
## ARBOL 4 Predecir la clase de iris segun ancho del pétalo.
tree4<-ctree(class~petal.width,data=iris)
plot(tree4)
Predicting iris class by sepal dimensions
tree5<-ctree(class~sepal.length+sepal.width, data=iris)
plot(tree5)
## ARBOL 6 Predicción por las dimensiones de los pétalos
tree6<-ctree(class~petal.length+petal.width, data=iris)
plot(tree6)
Predicción de acuerdo a las dimensiones de los sepalos y de los pétalos
tree7<-ctree(class~sepal.length + sepal.width + petal.length + petal.width, data=iris)
plot(tree7)
Interpreting the decision tree
Here we have the decision tree that includes all four variables (sepal length, sepal width, petal length, and petal width) into the prediction model.
Notice that there are only two factors that are used in the decision nodes: Petal length and petal width. This tells us that these two factors are most important when distinguishing which type of iris class each flower belongs to. The factors sepal length and sepal width are not necessary to predict which class the flowers belong to. This decision tree is identical to decision tree #6
There are two approaches to avoid overfitting a decision tree to your data.
Pre-pruning: prevents the tree from growing earlier, before the training data is perfectly classified
Post-trimming: or post-pruning, tree is perfectly classified then after the tree is created prune or trim the tree
Post-trimming is the most common approach because it’s often difficult to estimate when to stop growing the tree. The important thing is to define the criteria which determines the correct final tree size
Validation set: use a different data set, other than the training set, to evaluate the post-trimming nodes from the decision tree. Often the dataset is broken in to two datasets, the training set and the validation set. The decision tree is constructed on the training set, then any post-trimming is done on the validation set.
Statistical testing: create the decision tree using the training set, then apply statistical tests (error estimation or chi square) to determine if pruning a node or expanding a node produces an improvement beyond the training set. For more information on these statistical tests, see the “Overfitting Data” in the references and resources section