R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Decision Tree in R

Decision trees are a graphical method to represent choices and their consequences. It is a popular data mining and machine learning technique. It is a type of supervised learning algorithm and can be used for regression as well as classification problems.

Creating Decision Tree by using R:

Classification Decision Tree:

A classification tree is very similar to a regression tree except it deals with categorical or qualitative variables. In a classification tree, the splits in data are made based on questions with qualitative answers, therefore, the residual sum of squares cannot be used as a measure here. Instead, classification trees are created based on measures like classification error rate, cross-entropy, etc..

Loading required R packages For Decision Tree:

  • Load Recquired Packages:
    • datasets: for loading iris datasets.
    • rpart.plot Plot an rpart model, automatically tailoring the plot for the model’s response type.
    • rpart: Rpart is a powerful machine learning library in R that is used for building classification and regression trees.

Data Sets:

The task is to used Iris data sets:

Data Source:

This data set is already build into RStudio, so it is easy to find.

Iris data set is not huge, probably advanced R programmes or data scientists would find it rather unsuitable, but for, as a lady and begginer in big data analytics, this is a great place to start and show some simple tricks to plot and analyze this data set.

Loading Data sets

Load the Iris data sets that is mandatory for completing the task:

library(datasets)
data("iris")

Iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

To inspect dataset:

Check the upper value of datasets:

head(iris, 10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

Check the bottom, value of dataset:

tail(iris, 10)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 141          6.7         3.1          5.6         2.4 virginica
## 142          6.9         3.1          5.1         2.3 virginica
## 143          5.8         2.7          5.1         1.9 virginica
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

According the view of dataset the same type of data repeats agains, thats creates an issues while training and testing the datasets.

For this we shuffle the datasets.

Shuffling:

We shufle the datasets for training and testing the datasets.

shuffle_iris <- sample(1 : nrow(iris))
head(shuffle_iris)
## [1]  68 149  44  69  80  37
#to initialized the shuffle value
iris_data <- iris[shuffle_iris,]

To inspexct the shufle value:

 head(iris_data,10)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 68           5.8         2.7          4.1         1.0 versicolor
## 149          6.2         3.4          5.4         2.3  virginica
## 44           5.0         3.5          1.6         0.6     setosa
## 69           6.2         2.2          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 37           5.5         3.5          1.3         0.2     setosa
## 65           5.6         2.9          3.6         1.3 versicolor
## 42           4.5         2.3          1.3         0.3     setosa
## 122          5.6         2.8          4.9         2.0  virginica
## 134          6.3         2.8          5.1         1.5  virginica
 tail(iris_data,10)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 119          7.7         2.6          6.9         2.3  virginica
## 43           4.4         3.2          1.3         0.2     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 136          7.7         3.0          6.1         2.3  virginica
## 5            5.0         3.6          1.4         0.2     setosa
## 130          7.2         3.0          5.8         1.6  virginica
## 93           5.8         2.6          4.0         1.2 versicolor
## 45           5.1         3.8          1.9         0.4     setosa
## 71           5.9         3.2          4.8         1.8 versicolor

So, according to the table the iris data is randomizedin datasets thats good for training and testing phase.

Model

Create a model of training and testing the data for decission tree.

create_train_test <- function(data, size = 0.8, train = TRUE) {
  n_row = nrow(data)
  total_row = size * n_row
  train_sample <- 1: total_row
  if (train == TRUE) {
    return (data[train_sample, ])
  } else {
    return (data[-train_sample, ])
  }
}

Create Dimension Tree According to Species

Train the Datasets:

iris_train <- create_train_test(iris_data , size = 0.8, train = TRUE)

Testing the Datasets

iris_test <- create_train_test(iris_data , size = 0.8, train = FALSE)

Comparison:

Rows and Columns of Randomized Dataset:

dim(iris_data)
## [1] 150   5

Rows and Columns of Training Dataset:

dim(iris_train)
## [1] 120   5

Rows and Columns of Testing Dataset:

dim(iris_test)
## [1] 30  5

Verify

To verify the randomized datasets according to species

In Training Case

prop.table(table(iris_train$Species)) # Returns conditional proportions given margins,
## 
##     setosa versicolor  virginica 
##  0.3166667  0.3500000  0.3333333

In Testing Case

prop.table(table(iris_test$Species))
## 
##     setosa versicolor  virginica 
##  0.4000000  0.2666667  0.3333333

In both case the result is about 33 percent same.

Build the model:

library(rpart) #for training
library(rpart.plot) #for plot the decission tree
## Warning: package 'rpart.plot' was built under R version 4.1.1

Model:

fit <- rpart(iris_train$Species~. , data = iris_train , method = "class")
rpart.plot(fit, extra = 106)
## Warning: extra=106 but the response has 3 levels (only the 2nd level is
## displayed)

Make a Prediction:

Predict the test dataset:

predict_unseen <- predict(fit, iris_test, type = "class")
table_mat <- table(iris_test$Species ,predict_unseen)
table_mat
##             predict_unseen
##              setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0          7         1
##   virginica       0          1         9

###Check Accuracy Rate:

accuracy_test <- sum(diag(table_mat)) / sum(table_mat)
print(paste("Accuracy for Test is ", accuracy_test))
## [1] "Accuracy for Test is  0.933333333333333"

Decission Tree By Petal Length

prop.table(table(iris_train$Petal.Length))
## 
##           1         1.1         1.2         1.3         1.4         1.5 
## 0.008333333 0.008333333 0.008333333 0.041666667 0.083333333 0.083333333 
##         1.6         1.7         1.9           3         3.3         3.5 
## 0.041666667 0.033333333 0.008333333 0.008333333 0.016666667 0.008333333 
##         3.6         3.7         3.8         3.9           4         4.1 
## 0.008333333 0.008333333 0.008333333 0.008333333 0.025000000 0.025000000 
##         4.2         4.3         4.4         4.5         4.6         4.7 
## 0.025000000 0.016666667 0.033333333 0.058333333 0.025000000 0.041666667 
##         4.8         4.9           5         5.1         5.2         5.3 
## 0.025000000 0.041666667 0.025000000 0.058333333 0.008333333 0.016666667 
##         5.4         5.5         5.6         5.7         5.8         5.9 
## 0.016666667 0.025000000 0.033333333 0.016666667 0.016666667 0.016666667 
##           6         6.1         6.3         6.4         6.6         6.7 
## 0.016666667 0.008333333 0.008333333 0.008333333 0.008333333 0.016666667
prop.table(table(iris_test$Petal.Length))
## 
##        1.2        1.3        1.4        1.5        1.6        1.9        3.5 
## 0.03333333 0.06666667 0.10000000 0.10000000 0.06666667 0.03333333 0.03333333 
##        3.9          4        4.2        4.5        4.8          5        5.1 
## 0.06666667 0.06666667 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333 
##        5.2        5.6        5.7        5.8        6.1        6.9 
## 0.03333333 0.06666667 0.03333333 0.03333333 0.06666667 0.03333333

Model:

fit_1 <- rpart(iris_train$Petal.Length~. , data = iris_train , method = "class")
rpart.plot(fit_1, extra = 106)
## Warning: extra=106 but the response has 42 levels (only the 2nd level is
## displayed)
## Warning: All boxes will be white (the box.palette argument will be ignored) because
## the number of classes in the response 42 is greater than length(box.palette) 6.
## To silence this warning use box.palette=0 or trace=-1.

predict_unseen <- predict(fit, iris_test, type = "class")
table_mat <- table(iris_test$Petal.Length ,predict_unseen)
table_mat
##      predict_unseen
##       setosa versicolor virginica
##   1.2      1          0         0
##   1.3      2          0         0
##   1.4      3          0         0
##   1.5      3          0         0
##   1.6      2          0         0
##   1.9      1          0         0
##   3.5      0          1         0
##   3.9      0          2         0
##   4        0          2         0
##   4.2      0          1         0
##   4.5      0          1         0
##   4.8      0          0         1
##   5        0          0         1
##   5.1      0          0         1
##   5.2      0          0         1
##   5.6      0          0         2
##   5.7      0          0         1
##   5.8      0          1         0
##   6.1      0          0         2
##   6.9      0          0         1
accuracy_test <- sum(diag(table_mat)) / sum(table_mat)
print(paste("Accuracy for Test is ", accuracy_test))
## [1] "Accuracy for Test is  0.0333333333333333"