Md.Tahseen Anam,Farzana Farid Nitol
April 20, 2018
To understand a decision tree let’s first try to understand what is a tree ?
A tree is a data structure made up of nodes or vertices and edges without having any cycle. The tree with no nodes is called the null or empty tree. A tree that is not empty consists of a root node and potentially many levels of additional nodes that form a hierarchy.A (rooted) tree with only one node (the root) has a height of zero (or one).
A Decision Tree is a supervised predictive model that can learn to predict discrete or continuous outputs by answering a set of simple questions based on the values of the input features it receives.
But what is a supervised predictive model ?
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
Y = f(X)
The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.
This algorithm is called supervised predictive model.
Below is a random sample of 10 flowers from the dataset.
Each of the four measurements is a two-digit number in centimeters, and the target value represents the type of iris by assigning a 0 (Setosa), 1 (Versicolour), or 2 (Virginica).
Decision Trees use different attributes to repeatedly split the data it receives into subsets, and repeats the process until the subsets are pure. Pure means that they all share the same target value. In the iris dataset, the DT may choose to split the data into two by asking which flowers have a petal length less than 2.45 cm:
In this case, the first dataset, representing all flowers with a petal length less than 2.45 cm or less, is pure: all of the target values in the subset are 0. The other subset is not pure, however, as it contains data points with target values of both 2 and 1. To further split the data, the DT can ask which flowers have a sepal length of less than 6 cm. This results in splitting the data into two pure subsets:
Thus, in order to predict the iris type of a new sample, we can follow the splits and evaluate each decision using our new sample data.
For example, if we had the following unknown iris measurements:
We would classify the flower as Versicolor (target value 1), as the petal length is more than 2.45 cm and the sepal length is more than 6 cm.
When choosing attributes to use as criteria for splits, it is possible to choose any attribute. How does the DT algorithm decide which algorithm and condition to split on?
To put it simply, the DT algorithm chooses the attribute and condition based on how pure the resulting subsets will be. The pureness of a subset is determined by the certainty of the target values within a subset. For example, in our iris dataset, we split the data using the criteria of sepal length < 6 cm (condition 1) and came up with the following distribution of target values in the subsets.
Condition 1
In this case, we are completely certain about which target values can be found in each subset: the first subset contains 100% target value 1, and the second subset contains 100% target value 2.
Now, let’s image that we have split the same data using an imaginary condition (condition 2) that resulted in the following distribution:
Condition 2
In the first subset, we do not have complete certainty about the target values. There are three examples with target value 1 and two with target value 2. In the second subset, we have complete uncertainty: the chance of the target value being either 1 or 2 is exactly 50%.
In this case, the algorithm will prefer the first split attribute and condition as it allows for more certainty of the target values.
o measure the pureness or uncertainty of a subset, we need to provide a quantitative measure so that the DT algorithm can be objective when choosing the best attribute and condition to split on. There are different ways to measure the uncertainty in a set of values, but for the purposes of this example, we will use Entropy (represented by “H”).
Where X is the resulting split, n is the number of different target values in the subset, and pi is the proportion of the ith target value in the subset.
For example, the entropy for conditions 1 and 2 will be the following. The log is base 2.
H(Condition1 first subset) = - ( 1 * log(1) ) = 0 Pure subset H(Condition1 second subset) = - ( 1 * log(1) ) = 0 Pure subset H(Condition2 first subset) = - ( 3/5 * log(3/5) + 2/5 * log(2/5) ) = 0.97 Impure subset H(Condition2 second subset) = - ( 1/2 * log(1/2) + 1/2 * log(1/2) ) = 1 Impure subset
A regression problem requires the prediction of a quantity.A classification problem requires that examples be classified into one of two or more classes.
Regression is used to predict continuous values. Classification is used to predict which class a data point is part of (discrete value).
Example: I have a house with W rooms, X bathrooms, Y square-footage and Z lot-size. Based on other houses in the area that have recently sold, how much (dollar amount) can I sell my house for? I would use regression for this kind of problem.
Example: I have an unknown fruit that is yellow in color, 5.5 inches long, diameter of an inch, and density of X. What fruit is this? I would use classification for this kind of problem to classify it as a banana (as opposed to an apple or orange).
install.packages(‘rattle’)
install.packages(‘rpart.plot’)
install.packages(‘RColorBrewer’)
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
train <- read.csv(“train.csv”)
test <- read.csv(“test.csv”)
fit <- rpart(Survived ~ Sex, data=train, method=“class”)
fancyRpartPlot(fit)
fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train, method=“class”)
plot(fit)
text(fit)
fancyRpartPlot(fit)
Prediction <- predict(fit, test, type = “class”)
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = “myfirstdtree.csv”, row.names = FALSE)