Decision Tree

Md.Tahseen Anam,Farzana Farid Nitol

April 20, 2018

What is a Decision Tree?

To understand a decision tree let’s first try to understand what is a tree ?

A tree is a data structure made up of nodes or vertices and edges without having any cycle. The tree with no nodes is called the null or empty tree. A tree that is not empty consists of a root node and potentially many levels of additional nodes that form a hierarchy.A (rooted) tree with only one node (the root) has a height of zero (or one).

A Decision Tree is a supervised predictive model that can learn to predict discrete or continuous outputs by answering a set of simple questions based on the values of the input features it receives.

But what is a supervised predictive model ?

What is supervised predictive model?

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.

This algorithm is called supervised predictive model.

Decision Trees Overview

Below is a random sample of 10 flowers from the dataset.

Each of the four measurements is a two-digit number in centimeters, and the target value represents the type of iris by assigning a 0 (Setosa), 1 (Versicolour), or 2 (Virginica).

Decision Trees use different attributes to repeatedly split the data it receives into subsets, and repeats the process until the subsets are pure. Pure means that they all share the same target value. In the iris dataset, the DT may choose to split the data into two by asking which flowers have a petal length less than 2.45 cm:

In this case, the first dataset, representing all flowers with a petal length less than 2.45 cm or less, is pure: all of the target values in the subset are 0. The other subset is not pure, however, as it contains data points with target values of both 2 and 1. To further split the data, the DT can ask which flowers have a sepal length of less than 6 cm. This results in splitting the data into two pure subsets:

Thus, in order to predict the iris type of a new sample, we can follow the splits and evaluate each decision using our new sample data.

For example, if we had the following unknown iris measurements:

We would classify the flower as Versicolor (target value 1), as the petal length is more than 2.45 cm and the sepal length is more than 6 cm.

Decision Trees Algorithm

When choosing attributes to use as criteria for splits, it is possible to choose any attribute. How does the DT algorithm decide which algorithm and condition to split on?

To put it simply, the DT algorithm chooses the attribute and condition based on how pure the resulting subsets will be. The pureness of a subset is determined by the certainty of the target values within a subset. For example, in our iris dataset, we split the data using the criteria of sepal length < 6 cm (condition 1) and came up with the following distribution of target values in the subsets.

Condition 1

First subset Three examples: all with target value 1
Second subset Four examples: all with target value 2

In this case, we are completely certain about which target values can be found in each subset: the first subset contains 100% target value 1, and the second subset contains 100% target value 2.

Now, let’s image that we have split the same data using an imaginary condition (condition 2) that resulted in the following distribution:

Condition 2

First subset Five examples: Three with target value 1 and two with target value 2
Second subset Two examples: One with target value 1 and one with target value 2

In the first subset, we do not have complete certainty about the target values. There are three examples with target value 1 and two with target value 2. In the second subset, we have complete uncertainty: the chance of the target value being either 1 or 2 is exactly 50%.

In this case, the algorithm will prefer the first split attribute and condition as it allows for more certainty of the target values.

Quantifying Pureness of a Subset

o measure the pureness or uncertainty of a subset, we need to provide a quantitative measure so that the DT algorithm can be objective when choosing the best attribute and condition to split on. There are different ways to measure the uncertainty in a set of values, but for the purposes of this example, we will use Entropy (represented by “H”).

Where X is the resulting split, n is the number of different target values in the subset, and pi is the proportion of the ith target value in the subset.

For example, the entropy for conditions 1 and 2 will be the following. The log is base 2.

H(Condition1 first subset) = - ( 1 * log(1) ) = 0 Pure subset H(Condition1 second subset) = - ( 1 * log(1) ) = 0 Pure subset H(Condition2 first subset) = - ( 3/5 * log(3/5) + 2/5 * log(2/5) ) = 0.97 Impure subset H(Condition2 second subset) = - ( 1/2 * log(1/2) + 1/2 * log(1/2) ) = 1 Impure subset

Real-world applications

Agriculture
Astronomy
Control Systems
Financial analysis
Manufacturing and Production
Molecular biology
Object recognition
Plant diseases
Text processing

Advantages of decision trees

Can be used for regression or classification

A regression problem requires the prediction of a quantity.A classification problem requires that examples be classified into one of two or more classes.

Regression is used to predict continuous values. Classification is used to predict which class a data point is part of (discrete value).

Example: I have a house with W rooms, X bathrooms, Y square-footage and Z lot-size. Based on other houses in the area that have recently sold, how much (dollar amount) can I sell my house for? I would use regression for this kind of problem.

Example: I have an unknown fruit that is yellow in color, 5.5 inches long, diameter of an inch, and density of X. What fruit is this? I would use classification for this kind of problem to classify it as a banana (as opposed to an apple or orange).

Can be displayed graphically
Highly interpretable
Can be specified as a series of rules, and more closely approximate human decision-making than other models
Prediction is fast
Features don’t need scaling.Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth.If the design or system fails when a quantity increases, it does not scale.
Automatically learns feature interactions
Tends to ignore irrelevant features
Non-parametric (will outperform linear models if relationship between features and response is highly non-linear)

Disadvantages of decision trees

Performance is (generally) not competitive with the best supervised learning methods
Can easily overfit the training data (tuning is required). Overfitting means the model memorizes the training data.
Small variations in the data can result in a completely different tree (high variance). Variance means how much the model depends on the training data.
Recursive binary splitting makes “locally optimal” decisions that may not result in a globally optimal tree
Doesn’t tend to work well if the classes are highly unbalanced
Doesn’t tend to work well with very small datasets

Install and load required packages for fancy decision tree plotting

install.packages(‘rattle’)

install.packages(‘rpart.plot’)

install.packages(‘RColorBrewer’)

library(rpart)

library(rattle)

library(rpart.plot)

library(RColorBrewer)

train <- read.csv(“train.csv”)

test <- read.csv(“test.csv”)

Recreate the gender model

fit <- rpart(Survived ~ Sex, data=train, method=“class”)

fancyRpartPlot(fit)

Build a deeper tree

fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train, method=“class”)

Plot it with base-R

plot(fit)

text(fit)

And then make it look better with fancyRpartPlot!

fancyRpartPlot(fit)

Now let’s make a prediction and write a submission file

Prediction <- predict(fit, test, type = “class”)

submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)

write.csv(submit, file = “myfirstdtree.csv”, row.names = FALSE)