This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Decision trees are a graphical method to represent choices and their consequences. It is a popular data mining and machine learning technique. It is a type of supervised learning algorithm and can be used for regression as well as classification problems.
A classification tree is very similar to a regression tree except it deals with categorical or qualitative variables. In a classification tree, the splits in data are made based on questions with qualitative answers, therefore, the residual sum of squares cannot be used as a measure here. Instead, classification trees are created based on measures like classification error rate, cross-entropy, etc..
The task is to used Iris data sets:
This data set is already build into RStudio, so it is easy to find.
Iris data set is not huge, probably advanced R programmes or data scientists would find it rather unsuitable, but for, as a lady and begginer in big data analytics, this is a great place to start and show some simple tricks to plot and analyze this data set.
Load the Iris data sets that is mandatory for completing the task:
library(datasets)
data("iris")
Iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Check the upper value of datasets:
head(iris, 10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
Check the bottom, value of dataset:
tail(iris, 10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
According the view of dataset the same type of data repeats agains, thats creates an issues while training and testing the datasets.
For this we shuffle the datasets.
We shufle the datasets for training and testing the datasets.
shuffle_iris <- sample(1 : nrow(iris))
head(shuffle_iris)
## [1] 68 149 44 69 80 37
#to initialized the shuffle value
iris_data <- iris[shuffle_iris,]
To inspexct the shufle value:
head(iris_data,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 68 5.8 2.7 4.1 1.0 versicolor
## 149 6.2 3.4 5.4 2.3 virginica
## 44 5.0 3.5 1.6 0.6 setosa
## 69 6.2 2.2 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 37 5.5 3.5 1.3 0.2 setosa
## 65 5.6 2.9 3.6 1.3 versicolor
## 42 4.5 2.3 1.3 0.3 setosa
## 122 5.6 2.8 4.9 2.0 virginica
## 134 6.3 2.8 5.1 1.5 virginica
tail(iris_data,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 119 7.7 2.6 6.9 2.3 virginica
## 43 4.4 3.2 1.3 0.2 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 136 7.7 3.0 6.1 2.3 virginica
## 5 5.0 3.6 1.4 0.2 setosa
## 130 7.2 3.0 5.8 1.6 virginica
## 93 5.8 2.6 4.0 1.2 versicolor
## 45 5.1 3.8 1.9 0.4 setosa
## 71 5.9 3.2 4.8 1.8 versicolor
So, according to the table the iris data is randomizedin datasets thats good for training and testing phase.
Create a model of training and testing the data for decission tree.
create_train_test <- function(data, size = 0.8, train = TRUE) {
n_row = nrow(data)
total_row = size * n_row
train_sample <- 1: total_row
if (train == TRUE) {
return (data[train_sample, ])
} else {
return (data[-train_sample, ])
}
}
iris_train <- create_train_test(iris_data , size = 0.8, train = TRUE)
iris_test <- create_train_test(iris_data , size = 0.8, train = FALSE)
Rows and Columns of Randomized Dataset:
dim(iris_data)
## [1] 150 5
Rows and Columns of Training Dataset:
dim(iris_train)
## [1] 120 5
Rows and Columns of Testing Dataset:
dim(iris_test)
## [1] 30 5
To verify the randomized datasets according to species
In Training Case
prop.table(table(iris_train$Species)) # Returns conditional proportions given margins,
##
## setosa versicolor virginica
## 0.3166667 0.3500000 0.3333333
In Testing Case
prop.table(table(iris_test$Species))
##
## setosa versicolor virginica
## 0.4000000 0.2666667 0.3333333
In both case the result is about 33 percent same.
library(rpart) #for training
library(rpart.plot) #for plot the decission tree
## Warning: package 'rpart.plot' was built under R version 4.1.1
Model:
fit <- rpart(iris_train$Species~. , data = iris_train , method = "class")
rpart.plot(fit, extra = 106)
## Warning: extra=106 but the response has 3 levels (only the 2nd level is
## displayed)
Predict the test dataset:
predict_unseen <- predict(fit, iris_test, type = "class")
table_mat <- table(iris_test$Species ,predict_unseen)
table_mat
## predict_unseen
## setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 7 1
## virginica 0 1 9
###Check Accuracy Rate:
accuracy_test <- sum(diag(table_mat)) / sum(table_mat)
print(paste("Accuracy for Test is ", accuracy_test))
## [1] "Accuracy for Test is 0.933333333333333"
prop.table(table(iris_train$Petal.Length))
##
## 1 1.1 1.2 1.3 1.4 1.5
## 0.008333333 0.008333333 0.008333333 0.041666667 0.083333333 0.083333333
## 1.6 1.7 1.9 3 3.3 3.5
## 0.041666667 0.033333333 0.008333333 0.008333333 0.016666667 0.008333333
## 3.6 3.7 3.8 3.9 4 4.1
## 0.008333333 0.008333333 0.008333333 0.008333333 0.025000000 0.025000000
## 4.2 4.3 4.4 4.5 4.6 4.7
## 0.025000000 0.016666667 0.033333333 0.058333333 0.025000000 0.041666667
## 4.8 4.9 5 5.1 5.2 5.3
## 0.025000000 0.041666667 0.025000000 0.058333333 0.008333333 0.016666667
## 5.4 5.5 5.6 5.7 5.8 5.9
## 0.016666667 0.025000000 0.033333333 0.016666667 0.016666667 0.016666667
## 6 6.1 6.3 6.4 6.6 6.7
## 0.016666667 0.008333333 0.008333333 0.008333333 0.008333333 0.016666667
prop.table(table(iris_test$Petal.Length))
##
## 1.2 1.3 1.4 1.5 1.6 1.9 3.5
## 0.03333333 0.06666667 0.10000000 0.10000000 0.06666667 0.03333333 0.03333333
## 3.9 4 4.2 4.5 4.8 5 5.1
## 0.06666667 0.06666667 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333
## 5.2 5.6 5.7 5.8 6.1 6.9
## 0.03333333 0.06666667 0.03333333 0.03333333 0.06666667 0.03333333
Model:
fit_1 <- rpart(iris_train$Petal.Length~. , data = iris_train , method = "class")
rpart.plot(fit_1, extra = 106)
## Warning: extra=106 but the response has 42 levels (only the 2nd level is
## displayed)
## Warning: All boxes will be white (the box.palette argument will be ignored) because
## the number of classes in the response 42 is greater than length(box.palette) 6.
## To silence this warning use box.palette=0 or trace=-1.
predict_unseen <- predict(fit, iris_test, type = "class")
table_mat <- table(iris_test$Petal.Length ,predict_unseen)
table_mat
## predict_unseen
## setosa versicolor virginica
## 1.2 1 0 0
## 1.3 2 0 0
## 1.4 3 0 0
## 1.5 3 0 0
## 1.6 2 0 0
## 1.9 1 0 0
## 3.5 0 1 0
## 3.9 0 2 0
## 4 0 2 0
## 4.2 0 1 0
## 4.5 0 1 0
## 4.8 0 0 1
## 5 0 0 1
## 5.1 0 0 1
## 5.2 0 0 1
## 5.6 0 0 2
## 5.7 0 0 1
## 5.8 0 1 0
## 6.1 0 0 2
## 6.9 0 0 1
accuracy_test <- sum(diag(table_mat)) / sum(table_mat)
print(paste("Accuracy for Test is ", accuracy_test))
## [1] "Accuracy for Test is 0.0333333333333333"