Data
## # A tibble: 6 x 9
## Company Fixed_charge RoR Cost Load `D Demand` Sales Nuclear Fuel_Cost
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arizona 1.06 9.2 151 54.4 1.6 9077 0 0.628
## 2 Boston 0.89 10.3 202 57.9 2.2 5088 25.3 1.56
## 3 Central 1.43 15.4 113 53 3.4 9212 0 1.06
## 4 Commonwealth 1.02 11.2 168 56 0.3 6423 34.3 0.7
## 5 Con Ed NY 1.49 8.8 192 51.2 1 3300 15.6 2.04
## 6 Florida 1.32 13.5 111 60 -2.2 11127 22.5 1.24
## tibble [22 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Company : chr [1:22] "Arizona" "Boston" "Central" "Commonwealth" ...
## $ Fixed_charge: num [1:22] 1.06 0.89 1.43 1.02 1.49 1.32 1.22 1.1 1.34 1.12 ...
## $ RoR : num [1:22] 9.2 10.3 15.4 11.2 8.8 13.5 12.2 9.2 13 12.4 ...
## $ Cost : num [1:22] 151 202 113 168 192 111 175 245 168 197 ...
## $ Load : num [1:22] 54.4 57.9 53 56 51.2 60 67.6 57 60.4 53 ...
## $ D Demand : num [1:22] 1.6 2.2 3.4 0.3 1 -2.2 2.2 3.3 7.2 2.7 ...
## $ Sales : num [1:22] 9077 5088 9212 6423 3300 ...
## $ Nuclear : num [1:22] 0 25.3 0 34.3 15.6 22.5 0 0 0 39.2 ...
## $ Fuel_Cost : num [1:22] 0.628 1.555 1.058 0.7 2.044 ...
## - attr(*, "spec")=
## .. cols(
## .. Company = col_character(),
## .. Fixed_charge = col_double(),
## .. RoR = col_double(),
## .. Cost = col_double(),
## .. Load = col_double(),
## .. `D Demand` = col_double(),
## .. Sales = col_double(),
## .. Nuclear = col_double(),
## .. Fuel_Cost = col_double()
## .. )
From the second column, we have quantitative data. 22 observations with 9 variables.
Understand about data
Let’s see some plots and understand about data.
Scatter Plot
plot(Fuel_Cost ~ Sales, data = utilities)
# Add labels
with(utilities, text(Fuel_Cost ~ Sales, labels = Company, pos=4, cex=0.5))- Much more clear view of the data.
We can see,
- Four companies towards more sales, but less fuel cost.
- Another group of companies have medium sales, and medium to lower fuel cost.
- Another group of companies having low sales, but higher fuel cost.
- Also we can just see \(3\) clusters.
Normalization
- First we select quantitative data and apply Z normalization.
## # A tibble: 6 x 8
## Fixed_charge RoR Cost Load `D Demand` Sales Nuclear Fuel_Cost
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.06 9.2 151 54.4 1.6 9077 0 0.628
## 2 0.89 10.3 202 57.9 2.2 5088 25.3 1.56
## 3 1.43 15.4 113 53 3.4 9212 0 1.06
## 4 1.02 11.2 168 56 0.3 6423 34.3 0.7
## 5 1.49 8.8 192 51.2 1 3300 15.6 2.04
## 6 1.32 13.5 111 60 -2.2 11127 22.5 1.24
- We can clearly see variables like Fuel_Cost which are low in values, but variables like Sales which are very high in values.
# means
m <- apply(X = z, MARGIN = 2, FUN = mean)
# std.dev
sd <- apply(X = z, MARGIN = 2, FUN = sd)
# normalized data set
z <- scale(z,m,sd)
head(z)## Fixed_charge RoR Cost Load D Demand Sales
## [1,] -0.2931579 -0.6846390 -0.417122002 -0.5777152 -0.52622751 0.04590290
## [2,] -1.2145113 -0.1944537 0.821002037 0.2068363 -0.33381191 -1.07776413
## [3,] 1.7121407 2.0782236 -1.339645796 -0.8915357 0.05101929 0.08393124
## [4,] -0.5099470 0.2066070 -0.004413989 -0.2190631 -0.94312798 -0.70170610
## [5,] 2.0373243 -0.8628882 0.578232617 -1.2950193 -0.71864311 -1.58142837
## [6,] 1.1159709 1.2315399 -1.388199680 0.6775672 -1.74485965 0.62337028
## Nuclear Fuel_Cost
## [1,] -0.7146294 -0.85367545
## [2,] 0.7920476 0.81329670
## [3,] -0.7146294 -0.08043055
## [4,] 1.3280197 -0.72420189
## [5,] 0.2143888 1.69263800
## [6,] 0.6253007 0.24864810
Distance
Euclidean distance
distance <- dist(z)
# distance
# We can get distance matrix
print(distance, digits = 3) # remove some decimal numbers## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2 3.10
## 3 3.68 4.92
## 4 2.46 2.16 4.11
## 5 4.12 3.85 4.47 4.13
## 6 3.61 4.22 2.99 3.20 4.60
## 7 3.90 3.45 4.22 3.97 4.60 3.35
## 8 2.74 3.89 4.99 3.69 5.16 4.91 4.36
## 9 3.25 3.96 2.75 3.75 4.49 3.73 2.80 3.59
## 10 3.10 2.71 3.93 1.49 4.05 3.83 4.51 3.67 3.57
## 11 3.49 4.79 5.90 4.86 6.46 6.00 6.00 3.46 5.18 5.08
## 12 3.22 2.43 4.03 3.50 3.60 3.74 1.66 4.06 2.74 3.94 5.21
## 13 3.96 3.43 4.39 2.58 4.76 4.55 5.01 4.14 3.66 1.41 5.31 4.50
## 14 2.11 4.32 2.74 3.23 4.82 3.47 4.91 4.34 3.82 3.61 4.32 4.34 4.39
## 15 2.59 2.50 5.16 3.19 4.26 4.07 2.93 3.85 4.11 4.26 4.74 2.33 5.10 4.24
## 16 4.03 4.84 5.26 4.97 5.82 5.84 5.04 2.20 3.63 4.53 3.43 4.62 4.41 5.17 5.18
## 17 4.40 3.62 6.36 4.89 5.63 6.10 4.58 5.43 4.90 5.48 4.75 3.50 5.61 5.56 3.40
## 18 1.88 2.90 2.72 2.65 4.34 2.85 2.95 3.24 2.43 3.07 3.95 2.45 3.78 2.30 3.00
## 19 2.41 4.63 3.18 3.46 5.13 2.58 4.52 4.11 4.11 4.13 4.52 4.41 5.01 1.88 4.03
## 20 3.17 3.00 3.73 1.82 4.39 2.91 3.54 4.09 2.95 2.05 5.35 3.43 2.23 3.74 3.78
## 21 3.45 2.32 5.09 3.88 3.64 4.63 2.68 3.98 3.74 4.36 4.88 1.38 4.94 4.93 2.10
## 22 2.51 2.42 4.11 2.58 3.77 4.03 4.00 3.24 3.21 2.56 3.44 3.00 2.74 3.51 3.35
## 16 17 18 19 20 21
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17 5.56
## 18 3.97 4.43
## 19 5.23 6.09 2.47
## 20 4.82 4.87 2.92 3.90
## 21 4.57 3.10 3.19 4.97 4.15
## 22 3.46 3.63 2.55 3.97 2.62 3.01
- Higher values shows high dissimilarity between observations.
- Lower values shows low dissimilarity between observations.
- Both in terms of euclidean distance.
Hierarchical Clustering
Single Linkage, Complete Linkage, Average Linkage Clustering
Cluster Memberships
member.sin.link <- cutree(hc.sin.link, 3) # 3 clusters
member.com.link <- cutree(hc.com.link, 3) # 3 clusters
member.avg.link <- cutree(hc.avg.link, 3) # 3 clusters
# Understand mismatching
table(member.sin.link, member.com.link, member.avg.link)## , , member.avg.link = 1
##
## member.com.link
## member.sin.link 1 2 3
## 1 13 5 0
## 2 0 0 0
## 3 0 0 0
##
## , , member.avg.link = 2
##
## member.com.link
## member.sin.link 1 2 3
## 1 0 0 0
## 2 1 0 0
## 3 0 0 0
##
## , , member.avg.link = 3
##
## member.com.link
## member.sin.link 1 2 3
## 1 0 0 2
## 2 0 0 0
## 3 0 0 1
Cluster Means
## Group.1 Fixed_charge RoR Cost Load D Demand Sales
## 1 1 0.3068832 0.4326015 -0.31481203 -0.3743722 -0.2605107 -0.1575387
## 2 2 -0.4991075 -0.7113763 0.07812761 1.3365904 0.1343994 -0.6728046
## 3 3 -0.6002757 -0.8331800 1.33891013 -0.4805802 0.9917178 1.8565214
## Nuclear Fuel_Cost
## 1 0.3692252 -0.2389329
## 2 -0.6050529 1.2484717
## 3 -0.7146294 -0.9657660
## Group.1 Fixed_charge RoR Cost Load D Demand Sales Nuclear
## 1 1 1.170714 11.707143 155.2143 55.30714 2.428571 8354.786 18.20
## 2 2 1.022000 9.140000 171.4000 62.94000 3.660000 6525.600 1.84
## 3 3 1.003333 8.866667 223.3333 54.83333 6.333333 15504.667 0.00
## Fuel_Cost
## 1 0.9698571
## 2 1.7970000
## 3 0.5656667
- Consider sales variable, companies that belongs to cluster 3 they have higher sales.
- Also companies that belongs to cluster 2, they have higher fuel costs.
- Like this we can characterize the variables.
- Also using original units, interpretation can make easier.
Silhouette Plots
- Here if cluster formation has been good, or the members in clusters are cloaser to each other \(s_i\) values will be high.
- \(s_i\) is negative means sort of ourlier.
Scree Plot
wss <- (nrow(z)-1)*(sum(apply(z, 2, var)))
for (i in 2:20) wss[i] <- sum(kmeans(z, centers = i)$withinss)
plot(1:20, wss, type = "b",
xlab = "Number of Clusters", ylab = "Within Group SS")- We can see here within group variability drops from one cluster to another cluster.
- From the plot from the 4th cluster, and beyond that the clusters are not very significant.
K-Means Clustering (Non-Hierarchical)
## K-means clustering with 3 clusters of sizes 10, 7, 5
##
## Cluster means:
## Fixed_charge RoR Cost Load D Demand Sales
## 1 -0.09262805 -0.05185431 0.4689864 -0.3042429 0.45509205 0.3462986
## 2 -0.23896065 -0.65917479 0.2556961 0.7992527 -0.05435116 -0.8604593
## 3 0.51980100 1.02655333 -1.2959473 -0.5104679 -0.83409247 0.5120458
## Nuclear Fuel_Cost
## 1 0.4252045 -0.7161098
## 2 -0.2884040 1.2497562
## 3 -0.4466434 -0.3174391
##
## Clustering vector:
## [1] 1 2 3 1 2 3 2 1 1 1 1 2 1 3 2 1 2 3 3 1 2 1
##
## Within cluster sum of squares by cluster:
## [1] 57.53424 34.16481 15.15613
## (between_SS / total_SS = 36.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
## [1] 1 2 3 1 2 3 2 1 1 1 1 2 1 3 2 1 2 3 3 1 2 1
## Fixed_charge RoR Cost Load D Demand Sales
## 1 -0.09262805 -0.05185431 0.4689864 -0.3042429 0.45509205 0.3462986
## 2 -0.23896065 -0.65917479 0.2556961 0.7992527 -0.05435116 -0.8604593
## 3 0.51980100 1.02655333 -1.2959473 -0.5104679 -0.83409247 0.5120458
## Nuclear Fuel_Cost
## 1 0.4252045 -0.7161098
## 2 -0.2884040 1.2497562
## 3 -0.4466434 -0.3174391
Plots
- Clusters are good when the Between cluster distance is high, and the Within cluster distance is low.