Loading data:
ss=c(1:40,51:90,101:140)
ss
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## [55] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
## [73] 83 84 85 86 87 88 89 90 101 102 103 104 105 106 107 108 109 110
## [91] 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## [109] 129 130 131 132 133 134 135 136 137 138 139 140
data1=iris[ss,]
data1
Checking Normality of variables:
library(mvShapiroTest)
mvShapiro.Test(as.matrix(data1[,1:4]))
##
## Generalized Shapiro-Wilk test for Multivariate Normality by
## Villasenor-Alva and Gonzalez-Estrada
##
## data: as.matrix(data1[, 1:4])
## MVW = 0.97861, p-value = 0.00253
Thus, the p value of the test is less than 1.4e-06<0.05. Thus, we reject the null hypothesis at 5% level of significance and conclude that the dataset is Non-Gaussian in nature.In other words, none of the variables are Gaussian.
First, we need to standardize the data.
data1s=scale(data1[,1:4])
Next, we apply k means clustering in our data and obtain the optimal number of clusters by observing screeplot:
library(factoextra)
fviz_nbclust(data1s,kmeans,"wss")
From the graph above, we observe that the screeplot has an elbow at 2. This implies that a significant part of the total within sum of squares has been explained by 2 clusters. Hence, we infer, on the basis of k means clustering that the number of optimal clusters is 2.
library(rpart)
CART1 <- rpart(Species ~., data = data1)
par(xpd = NA) # otherwise on some devices the text is clipped
plot(CART1)
text(CART1, digits = 3)
print(CART1)
## n= 120
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 120 80 setosa (0.33333333 0.33333333 0.33333333)
## 2) Petal.Length< 2.6 40 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.6 80 40 versicolor (0.00000000 0.50000000 0.50000000)
## 6) Petal.Width< 1.75 44 5 versicolor (0.00000000 0.88636364 0.11363636) *
## 7) Petal.Width>=1.75 36 1 virginica (0.00000000 0.02777778 0.97222222) *
data2=data1[,c(-5)]
library(rpart)
RegTree1 <- rpart(Sepal.Length ~., data = data2)
par(xpd = NA) # otherwise on some devices the text is clipped
plot(RegTree1)
text(RegTree1, digits = 3)
print(RegTree1)
## n= 120
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 120 85.148000 5.890000
## 2) Petal.Length< 4.15 55 9.533818 5.178182
## 4) Petal.Length< 3.4 41 5.132195 5.034146
## 8) Sepal.Width< 3.65 31 2.498710 4.906452 *
## 9) Sepal.Width>=3.65 10 0.561000 5.430000 *
## 5) Petal.Length>=3.4 14 1.060000 5.600000 *
## 3) Petal.Length>=4.15 65 24.166150 6.492308
## 6) Petal.Length< 5.65 48 8.266667 6.233333
## 12) Sepal.Width< 3.05 35 5.701714 6.137143 *
## 13) Sepal.Width>=3.05 13 1.369231 6.492308 *
## 7) Petal.Length>=5.65 17 3.590588 7.223529 *