CART PRACTICAL ASSIGNMENT

Loading data:

ss=c(1:40,51:90,101:140)
ss

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  51  52  53  54  55  56  57  58  59  60  61  62  63  64
##  [55]  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82
##  [73]  83  84  85  86  87  88  89  90 101 102 103 104 105 106 107 108 109 110
##  [91] 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## [109] 129 130 131 132 133 134 135 136 137 138 139 140

data1=iris[ss,]
data1

Checking Normality of variables:

library(mvShapiroTest)
mvShapiro.Test(as.matrix(data1[,1:4]))

## 
##  Generalized Shapiro-Wilk test for Multivariate Normality by
##  Villasenor-Alva and Gonzalez-Estrada
## 
## data:  as.matrix(data1[, 1:4])
## MVW = 0.97861, p-value = 0.00253

Thus, the p value of the test is less than 1.4e-06<0.05. Thus, we reject the null hypothesis at 5% level of significance and conclude that the dataset is Non-Gaussian in nature.In other words, none of the variables are Gaussian.

K means clustering :

First, we need to standardize the data.

data1s=scale(data1[,1:4])

Next, we apply k means clustering in our data and obtain the optimal number of clusters by observing screeplot:

library(factoextra)
fviz_nbclust(data1s,kmeans,"wss")

From the graph above, we observe that the screeplot has an elbow at 2. This implies that a significant part of the total within sum of squares has been explained by 2 clusters. Hence, we infer, on the basis of k means clustering that the number of optimal clusters is 2.

CART Model:

library(rpart)
CART1 <- rpart(Species ~., data = data1)
par(xpd = NA) # otherwise on some devices the text is clipped
plot(CART1)
text(CART1, digits = 3)

print(CART1)

## n= 120 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 120 80 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.6 40  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.6 80 40 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Width< 1.75 44  5 versicolor (0.00000000 0.88636364 0.11363636) *
##     7) Petal.Width>=1.75 36  1 virginica (0.00000000 0.02777778 0.97222222) *

Regression Tree Model:

data2=data1[,c(-5)]
library(rpart)
RegTree1 <- rpart(Sepal.Length ~., data = data2)
par(xpd = NA) # otherwise on some devices the text is clipped
plot(RegTree1)
text(RegTree1, digits = 3)

print(RegTree1)

## n= 120 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 120 85.148000 5.890000  
##    2) Petal.Length< 4.15 55  9.533818 5.178182  
##      4) Petal.Length< 3.4 41  5.132195 5.034146  
##        8) Sepal.Width< 3.65 31  2.498710 4.906452 *
##        9) Sepal.Width>=3.65 10  0.561000 5.430000 *
##      5) Petal.Length>=3.4 14  1.060000 5.600000 *
##    3) Petal.Length>=4.15 65 24.166150 6.492308  
##      6) Petal.Length< 5.65 48  8.266667 6.233333  
##       12) Sepal.Width< 3.05 35  5.701714 6.137143 *
##       13) Sepal.Width>=3.05 13  1.369231 6.492308 *
##      7) Petal.Length>=5.65 17  3.590588 7.223529 *

CART PRACTICAL ASSIGNMENT

Prokarso Chattopadhyay

K means clustering :

CART Model:

Regression Tree Model: