mydata <- read.csv("D:/MARV BS MATH/4th year, 2nd sem/Multivariate/Discriminant data.txt", header=T)
str(mydata)
'data.frame': 22 obs. of 9 variables:
$ Company : chr "Arizona " "Boston " "Central " "Commonwealth" ...
$ Fixed_charge: num 1.06 0.89 1.43 1.02 1.49 1.32 1.22 1.1 1.34 1.12 ...
$ RoR : num 9.2 10.3 15.4 11.2 8.8 13.5 12.2 9.2 13 12.4 ...
$ Cost : int 151 202 113 168 192 111 175 245 168 197 ...
$ Load : num 54.4 57.9 53 56 51.2 60 67.6 57 60.4 53 ...
$ D.Demand : num 1.6 2.2 3.4 0.3 1 -2.2 2.2 3.3 7.2 2.7 ...
$ Sales : int 9077 5088 9212 6423 3300 11127 7642 13082 8406 6455 ...
$ Nuclear : num 0 25.3 0 34.3 15.6 22.5 0 0 0 39.2 ...
$ Fuel_Cost : num 0.628 1.555 1.058 0.7 2.044 ...
a<-head(mydata)
paged_table(a)
pairs(mydata[2:9])
Basically, a pairs plot will provide a scatter plot for all possible
combinations
If you want to look at the scatterplot separately we have:
plot(mydata$Fuel_Cost~ mydata$Sales, data = mydata)
with(mydata,text(mydata$Fuel_Cost ~ mydata$Sales, labels=mydata$Company,pos=4))
Normalization is mandatory for cluster analysis.
z <- mydata[,-c(1,1)]
paged_table(z)
means <- apply(z,2,mean)
means
Fixed_charge RoR Cost Load D.Demand Sales
1.114091 10.736364 168.181818 56.977273 3.240909 8914.045455
Nuclear Fuel_Cost
12.000000 1.102727
sds <- apply(z,2,sd)
sds
Fixed_charge RoR Cost Load D.Demand Sales
0.1845112 2.2440494 41.1913495 4.4611478 3.1182503 3549.9840305
Nuclear Fuel_Cost
16.7919198 0.5560981
nor <- scale(z,center=means,scale=sds)
nor
Fixed_charge RoR Cost Load D.Demand Sales
[1,] -0.29315791 -0.68463896 -0.417122002 -0.57771516 -0.52622751 0.04590290
[2,] -1.21451134 -0.19445367 0.821002037 0.20683629 -0.33381191 -1.07776413
[3,] 1.71214073 2.07822360 -1.339645796 -0.89153574 0.05101929 0.08393124
[4,] -0.50994695 0.20660702 -0.004413989 -0.21906307 -0.94312798 -0.70170610
[5,] 2.03732429 -0.86288816 0.578232617 -1.29501935 -0.71864311 -1.58142837
[6,] 1.11597086 1.23153991 -1.388199680 0.67756716 -1.74485965 0.62337028
[7,] 0.57399826 0.65223002 0.165524604 2.38116460 -0.33381191 -0.35832428
[8,] -0.07636887 -0.68463896 1.864910540 0.00509449 0.01895002 1.17407698
[9,] 1.22436538 1.00872841 -0.004413989 0.76723019 1.26965142 -0.14311204
[10,] 0.03202565 0.74135462 0.699617327 -0.89153574 -0.17346558 -0.69269198
[11,] -1.97327298 -1.44219805 0.116970720 -1.22777208 1.04516655 2.40196983
[12,] 0.08622291 0.07292013 0.238355430 1.12588228 0.14722709 -0.77748109
[13,] 0.19461744 0.87504152 0.748171211 -0.73462545 1.01309729 -0.48874740
[14,] -0.13056613 0.56310542 -1.752353809 -1.60883993 -0.59036605 0.21379097
[15,] -0.83513051 -1.39763576 -0.101521757 1.17071379 -1.07140505 -0.68902999
[16,] 0.24881470 -0.37270287 2.034849134 -0.21906307 1.91103676 1.99351729
[17,] -1.91907572 -1.93238335 -0.781276132 1.10346652 1.84689822 -0.90142531
[18,] -0.34735517 0.83047922 -0.441398944 -0.06215278 -0.17346558 0.34534086
[19,] 0.24881470 0.42941852 -1.558138274 -0.66737818 -1.71279038 1.29379583
[20,] 0.46560374 0.47398082 -0.489952828 0.65515141 0.08308855 -0.45832473
[21,] -0.40155243 -0.95201276 0.869555920 0.90172472 0.08308855 -0.63776215
[22,] -0.23896065 -0.64007666 0.141247662 -0.60013092 0.85275095 0.33210137
Nuclear Fuel_Cost
[1,] -0.7146294 -0.85367545
[2,] 0.7920476 0.81329670
[3,] -0.7146294 -0.08043055
[4,] 1.3280197 -0.72420189
[5,] 0.2143888 1.69263800
[6,] 0.6253007 0.24864810
[7,] -0.7146294 0.98772637
[8,] -0.7146294 -1.42731528
[9,] -0.7146294 -0.43288637
[10,] 1.6198267 -0.86266667
[11,] -0.7146294 -0.60192130
[12,] -0.7146294 1.42829614
[13,] 2.2749037 -1.03529809
[14,] -0.7146294 -0.92560521
[15,] -0.6610322 0.53456889
[16,] -0.7146294 -0.86806140
[17,] -0.2203441 1.46965575
[18,] -0.7146294 0.00948165
[19,] -0.7146294 -0.83928950
[20,] 1.7329764 -0.72060540
[21,] -0.7146294 1.82211157
[22,] 0.8694658 0.36553395
attr(,"scaled:center")
Fixed_charge RoR Cost Load D.Demand Sales
1.114091 10.736364 168.181818 56.977273 3.240909 8914.045455
Nuclear Fuel_Cost
12.000000 1.102727
attr(,"scaled:scale")
Fixed_charge RoR Cost Load D.Demand Sales
0.1845112 2.2440494 41.1913495 4.4611478 3.1182503 3549.9840305
Nuclear Fuel_Cost
16.7919198 0.5560981
distance = dist(nor)
mydata.hclust = hclust(distance)
plot(mydata.hclust)
plot(mydata.hclust,labels=mydata$Company,main='Default from hclust')
plot(mydata.hclust,hang=-1, labels=mydata$Company,main='Default from hclust')
mydata.hclust<-hclust(distance,method="average")
plot(mydata.hclust,hang=-1)
member = cutree(mydata.hclust,3)
table(member)
member
1 2 3
18 1 3
b<-aggregate(nor,list(member),mean)
paged_table(b)
c<-aggregate(mydata[,-c(1,1)],list(member),mean)
paged_table(c)
library(cluster)
plot(silhouette(cutree(mydata.hclust,3), distance))
Based on the above plot, if any bar comes as negative side then we can
conclude particular data is an outlier can remove from our analysis.
Scree plot will allow us to see the variabilities in clusters, suppose if we increase the number of clusters within-group sum of squares will come down.
wss <- (nrow(nor)-1)*sum(apply(nor,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(nor, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
So in this data ideal number of clusters should be 3, 4, or 5.
set.seed(123)
kc<-kmeans(nor,3)
kc
K-means clustering with 3 clusters of sizes 7, 5, 10
Cluster means:
Fixed_charge RoR Cost Load D.Demand Sales
1 -0.23896065 -0.65917479 0.2556961 0.7992527 -0.05435116 -0.8604593
2 0.51980100 1.02655333 -1.2959473 -0.5104679 -0.83409247 0.5120458
3 -0.09262805 -0.05185431 0.4689864 -0.3042429 0.45509205 0.3462986
Nuclear Fuel_Cost
1 -0.2884040 1.2497562
2 -0.4466434 -0.3174391
3 0.4252045 -0.7161098
Clustering vector:
[1] 3 1 2 3 1 2 1 3 3 3 3 1 3 2 1 3 1 2 2 3 1 3
Within cluster sum of squares by cluster:
[1] 34.16481 15.15613 57.53424
(between_SS / total_SS = 36.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
ot<-nor
datadistshortset<-dist(ot,method = "euclidean")
hc1 <- hclust(datadistshortset, method = "complete" )
pamvshortset <- pam(datadistshortset,4, diss = FALSE)
clusplot(pamvshortset, shade = FALSE,labels=2,col.clus="blue",col.p="red",span=FALSE,main="Cluster Mapping",cex=1.2)
k2 <- kmeans(nor, centers = 3, nstart = 25)
We can execute k-means in R with the help of kmeans function.
The kmeans function also has a nstart option that attempts multiple initial configurations and reports on the best output.
For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.
str(k2)
List of 9
$ cluster : int [1:22] 3 2 3 3 2 3 2 1 3 3 ...
$ centers : num [1:3, 1:8] -0.6 -0.239 0.289 -0.833 -0.659 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "1" "2" "3"
.. ..$ : chr [1:8] "Fixed_charge" "RoR" "Cost" "Load" ...
$ totss : num 168
$ withinss : num [1:3] 9.53 34.16 58.01
$ tot.withinss: num 102
$ betweenss : num 66.3
$ size : int [1:3] 3 7 12
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
We can also view our kmeans results by using fviz_cluster. This provides a beautiful illustration of the clusters.
fviz_cluster(k2, data = nor)
If there are more than two dimensions (variables) fviz_cluster will
perform principal component analysis (PCA) and plot the data points
according to the first two principal components that explain the
majority of the variance.
We can find out optimal clusters in R with the following code. The results suggest that 4 is the optimal number of clusters as it appears to be the bend in the knee.
The same we executed above with traditional coding’s.
fviz_nbclust(nor, kmeans, method = "wss")
The average silhouette approach measures the quality of a clustering. It determines how well each observation lies within its cluster.
A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k.
We can execute the same based on the below code
fviz_nbclust(nor, kmeans, method = "silhouette")
This approach can be utilized in any type of clustering method (i.e. K-means clustering, hierarchical clustering).
The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data.
We can execute the same based on below code
gap_stat <- clusGap(nor, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
fviz_gap_stat(gap_stat)
In this method also optimal number of cluster is 5.
K-means clustering is a very simple and fast algorithm and it can efficiently deal with very large data sets.
K-means clustering needs to provide a number of clusters as an input, Hierarchical clustering is an alternative approach that does not require that we commit to a particular choice of clusters.
Hierarchical clustering has an added advantage over K-means clustering because it has an attractive tree-based representation of the observations (dendrogram).