Data

library(readr)
utilities <- read_csv("Data/utilities.csv")
head(utilities)
## # A tibble: 6 x 9
##   Company      Fixed_charge   RoR  Cost  Load `D Demand` Sales Nuclear Fuel_Cost
##   <chr>               <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl>   <dbl>     <dbl>
## 1 Arizona              1.06   9.2   151  54.4        1.6  9077     0       0.628
## 2 Boston               0.89  10.3   202  57.9        2.2  5088    25.3     1.56 
## 3 Central              1.43  15.4   113  53          3.4  9212     0       1.06 
## 4 Commonwealth         1.02  11.2   168  56          0.3  6423    34.3     0.7  
## 5 Con Ed NY            1.49   8.8   192  51.2        1    3300    15.6     2.04 
## 6 Florida              1.32  13.5   111  60         -2.2 11127    22.5     1.24
str(utilities)
## tibble [22 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Company     : chr [1:22] "Arizona" "Boston" "Central" "Commonwealth" ...
##  $ Fixed_charge: num [1:22] 1.06 0.89 1.43 1.02 1.49 1.32 1.22 1.1 1.34 1.12 ...
##  $ RoR         : num [1:22] 9.2 10.3 15.4 11.2 8.8 13.5 12.2 9.2 13 12.4 ...
##  $ Cost        : num [1:22] 151 202 113 168 192 111 175 245 168 197 ...
##  $ Load        : num [1:22] 54.4 57.9 53 56 51.2 60 67.6 57 60.4 53 ...
##  $ D Demand    : num [1:22] 1.6 2.2 3.4 0.3 1 -2.2 2.2 3.3 7.2 2.7 ...
##  $ Sales       : num [1:22] 9077 5088 9212 6423 3300 ...
##  $ Nuclear     : num [1:22] 0 25.3 0 34.3 15.6 22.5 0 0 0 39.2 ...
##  $ Fuel_Cost   : num [1:22] 0.628 1.555 1.058 0.7 2.044 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Company = col_character(),
##   ..   Fixed_charge = col_double(),
##   ..   RoR = col_double(),
##   ..   Cost = col_double(),
##   ..   Load = col_double(),
##   ..   `D Demand` = col_double(),
##   ..   Sales = col_double(),
##   ..   Nuclear = col_double(),
##   ..   Fuel_Cost = col_double()
##   .. )
attach(utilities)

From the second column, we have quantitative data. 22 observations with 9 variables.

Understand about data

Let’s see some plots and understand about data.

Scatter Plot

plot(Fuel_Cost ~ Sales, data = utilities)

# Add labels 
with(utilities, text(Fuel_Cost ~ Sales, labels = Company, pos=4, cex=0.5))

  • Much more clear view of the data.

We can see,

  • Four companies towards more sales, but less fuel cost.
  • Another group of companies have medium sales, and medium to lower fuel cost.
  • Another group of companies having low sales, but higher fuel cost.
  • Also we can just see \(3\) clusters.

Normalization

  • First we select quantitative data and apply Z normalization.
# remove 1st column, i.e. Companies
z <- utilities[, -c(1,1)]
head(z)
## # A tibble: 6 x 8
##   Fixed_charge   RoR  Cost  Load `D Demand` Sales Nuclear Fuel_Cost
##          <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl>   <dbl>     <dbl>
## 1         1.06   9.2   151  54.4        1.6  9077     0       0.628
## 2         0.89  10.3   202  57.9        2.2  5088    25.3     1.56 
## 3         1.43  15.4   113  53          3.4  9212     0       1.06 
## 4         1.02  11.2   168  56          0.3  6423    34.3     0.7  
## 5         1.49   8.8   192  51.2        1    3300    15.6     2.04 
## 6         1.32  13.5   111  60         -2.2 11127    22.5     1.24
  • We can clearly see variables like Fuel_Cost which are low in values, but variables like Sales which are very high in values.
# means
m <- apply(X = z, MARGIN = 2, FUN = mean)
# std.dev
sd <- apply(X = z, MARGIN = 2, FUN = sd)
# normalized data set
z <- scale(z,m,sd)
head(z)
##      Fixed_charge        RoR         Cost       Load    D Demand       Sales
## [1,]   -0.2931579 -0.6846390 -0.417122002 -0.5777152 -0.52622751  0.04590290
## [2,]   -1.2145113 -0.1944537  0.821002037  0.2068363 -0.33381191 -1.07776413
## [3,]    1.7121407  2.0782236 -1.339645796 -0.8915357  0.05101929  0.08393124
## [4,]   -0.5099470  0.2066070 -0.004413989 -0.2190631 -0.94312798 -0.70170610
## [5,]    2.0373243 -0.8628882  0.578232617 -1.2950193 -0.71864311 -1.58142837
## [6,]    1.1159709  1.2315399 -1.388199680  0.6775672 -1.74485965  0.62337028
##         Nuclear   Fuel_Cost
## [1,] -0.7146294 -0.85367545
## [2,]  0.7920476  0.81329670
## [3,] -0.7146294 -0.08043055
## [4,]  1.3280197 -0.72420189
## [5,]  0.2143888  1.69263800
## [6,]  0.6253007  0.24864810

Distance

Euclidean distance

distance <- dist(z)
# distance
# We can get distance matrix
print(distance, digits = 3)   # remove some decimal numbers
##       1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
## 2  3.10                                                                      
## 3  3.68 4.92                                                                 
## 4  2.46 2.16 4.11                                                            
## 5  4.12 3.85 4.47 4.13                                                       
## 6  3.61 4.22 2.99 3.20 4.60                                                  
## 7  3.90 3.45 4.22 3.97 4.60 3.35                                             
## 8  2.74 3.89 4.99 3.69 5.16 4.91 4.36                                        
## 9  3.25 3.96 2.75 3.75 4.49 3.73 2.80 3.59                                   
## 10 3.10 2.71 3.93 1.49 4.05 3.83 4.51 3.67 3.57                              
## 11 3.49 4.79 5.90 4.86 6.46 6.00 6.00 3.46 5.18 5.08                         
## 12 3.22 2.43 4.03 3.50 3.60 3.74 1.66 4.06 2.74 3.94 5.21                    
## 13 3.96 3.43 4.39 2.58 4.76 4.55 5.01 4.14 3.66 1.41 5.31 4.50               
## 14 2.11 4.32 2.74 3.23 4.82 3.47 4.91 4.34 3.82 3.61 4.32 4.34 4.39          
## 15 2.59 2.50 5.16 3.19 4.26 4.07 2.93 3.85 4.11 4.26 4.74 2.33 5.10 4.24     
## 16 4.03 4.84 5.26 4.97 5.82 5.84 5.04 2.20 3.63 4.53 3.43 4.62 4.41 5.17 5.18
## 17 4.40 3.62 6.36 4.89 5.63 6.10 4.58 5.43 4.90 5.48 4.75 3.50 5.61 5.56 3.40
## 18 1.88 2.90 2.72 2.65 4.34 2.85 2.95 3.24 2.43 3.07 3.95 2.45 3.78 2.30 3.00
## 19 2.41 4.63 3.18 3.46 5.13 2.58 4.52 4.11 4.11 4.13 4.52 4.41 5.01 1.88 4.03
## 20 3.17 3.00 3.73 1.82 4.39 2.91 3.54 4.09 2.95 2.05 5.35 3.43 2.23 3.74 3.78
## 21 3.45 2.32 5.09 3.88 3.64 4.63 2.68 3.98 3.74 4.36 4.88 1.38 4.94 4.93 2.10
## 22 2.51 2.42 4.11 2.58 3.77 4.03 4.00 3.24 3.21 2.56 3.44 3.00 2.74 3.51 3.35
##      16   17   18   19   20   21
## 2                               
## 3                               
## 4                               
## 5                               
## 6                               
## 7                               
## 8                               
## 9                               
## 10                              
## 11                              
## 12                              
## 13                              
## 14                              
## 15                              
## 16                              
## 17 5.56                         
## 18 3.97 4.43                    
## 19 5.23 6.09 2.47               
## 20 4.82 4.87 2.92 3.90          
## 21 4.57 3.10 3.19 4.97 4.15     
## 22 3.46 3.63 2.55 3.97 2.62 3.01
  • Higher values shows high dissimilarity between observations.
  • Lower values shows low dissimilarity between observations.
  • Both in terms of euclidean distance.

Hierarchical Clustering

Single Linkage, Complete Linkage, Average Linkage Clustering

# Single linkage
hc.sin.link <- hclust(distance,method = "single")
plot(hc.sin.link)

# Complete linkage
hc.com.link <- hclust(distance)
plot(hc.com.link)

# Average linkage
hc.avg.link <- hclust(distance, method = "average")
plot(hc.avg.link)

# With company names
plot(hc.sin.link, labels = utilities$Company, hang = -1)

plot(hc.com.link, labels = utilities$Company, hang = -1)

plot(hc.avg.link, labels = utilities$Company, hang = -1)

Cluster Memberships

member.sin.link <- cutree(hc.sin.link, 3) # 3 clusters
member.com.link <- cutree(hc.com.link, 3) # 3 clusters
member.avg.link <- cutree(hc.avg.link, 3) # 3 clusters

# Understand mismatching
table(member.sin.link, member.com.link, member.avg.link)
## , , member.avg.link = 1
## 
##                member.com.link
## member.sin.link  1  2  3
##               1 13  5  0
##               2  0  0  0
##               3  0  0  0
## 
## , , member.avg.link = 2
## 
##                member.com.link
## member.sin.link  1  2  3
##               1  0  0  0
##               2  1  0  0
##               3  0  0  0
## 
## , , member.avg.link = 3
## 
##                member.com.link
## member.sin.link  1  2  3
##               1  0  0  2
##               2  0  0  0
##               3  0  0  1

Cluster Means

# Cluster means for complete linkage
aggregate(z, list(member.com.link), mean)
##   Group.1 Fixed_charge        RoR        Cost       Load   D Demand      Sales
## 1       1    0.3068832  0.4326015 -0.31481203 -0.3743722 -0.2605107 -0.1575387
## 2       2   -0.4991075 -0.7113763  0.07812761  1.3365904  0.1343994 -0.6728046
## 3       3   -0.6002757 -0.8331800  1.33891013 -0.4805802  0.9917178  1.8565214
##      Nuclear  Fuel_Cost
## 1  0.3692252 -0.2389329
## 2 -0.6050529  1.2484717
## 3 -0.7146294 -0.9657660
# in Original units
aggregate(utilities[, -c(1,1)], list(member.com.link), mean)
##   Group.1 Fixed_charge       RoR     Cost     Load D Demand     Sales Nuclear
## 1       1     1.170714 11.707143 155.2143 55.30714 2.428571  8354.786   18.20
## 2       2     1.022000  9.140000 171.4000 62.94000 3.660000  6525.600    1.84
## 3       3     1.003333  8.866667 223.3333 54.83333 6.333333 15504.667    0.00
##   Fuel_Cost
## 1 0.9698571
## 2 1.7970000
## 3 0.5656667
  • Consider sales variable, companies that belongs to cluster 3 they have higher sales.
  • Also companies that belongs to cluster 2, they have higher fuel costs.
  • Like this we can characterize the variables.
  • Also using original units, interpretation can make easier.

Silhouette Plots

library(cluster)
plot(silhouette(cutree(hc.com.link, 3), distance))

  • Here if cluster formation has been good, or the members in clusters are cloaser to each other \(s_i\) values will be high.
  • \(s_i\) is negative means sort of ourlier.

Scree Plot

wss <- (nrow(z)-1)*(sum(apply(z, 2, var)))
for (i in 2:20) wss[i] <- sum(kmeans(z, centers = i)$withinss)
plot(1:20, wss, type = "b", 
     xlab = "Number of Clusters", ylab = "Within Group SS")

  • We can see here within group variability drops from one cluster to another cluster.
  • From the plot from the 4th cluster, and beyond that the clusters are not very significant.

K-Means Clustering (Non-Hierarchical)

kmc <- kmeans(x = z, 3)
kmc
## K-means clustering with 3 clusters of sizes 10, 7, 5
## 
## Cluster means:
##   Fixed_charge         RoR       Cost       Load    D Demand      Sales
## 1  -0.09262805 -0.05185431  0.4689864 -0.3042429  0.45509205  0.3462986
## 2  -0.23896065 -0.65917479  0.2556961  0.7992527 -0.05435116 -0.8604593
## 3   0.51980100  1.02655333 -1.2959473 -0.5104679 -0.83409247  0.5120458
##      Nuclear  Fuel_Cost
## 1  0.4252045 -0.7161098
## 2 -0.2884040  1.2497562
## 3 -0.4466434 -0.3174391
## 
## Clustering vector:
##  [1] 1 2 3 1 2 3 2 1 1 1 1 2 1 3 2 1 2 3 3 1 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 57.53424 34.16481 15.15613
##  (between_SS / total_SS =  36.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Clusters
kmc$cluster
##  [1] 1 2 3 1 2 3 2 1 1 1 1 2 1 3 2 1 2 3 3 1 2 1
# Averages
kmc$centers
##   Fixed_charge         RoR       Cost       Load    D Demand      Sales
## 1  -0.09262805 -0.05185431  0.4689864 -0.3042429  0.45509205  0.3462986
## 2  -0.23896065 -0.65917479  0.2556961  0.7992527 -0.05435116 -0.8604593
## 3   0.51980100  1.02655333 -1.2959473 -0.5104679 -0.83409247  0.5120458
##      Nuclear  Fuel_Cost
## 1  0.4252045 -0.7161098
## 2 -0.2884040  1.2497562
## 3 -0.4466434 -0.3174391

Plots

plot(Sales ~ `D Demand`, utilities)

plot(Sales ~ `D Demand`, utilities, col = kmc$cluster)

library(ggplot2)

ggplot(data = utilities, aes(x = `D Demand`,
                             y = Sales, 
                             col=kmc$cluster)) + geom_point()

  • Clusters are good when the Between cluster distance is high, and the Within cluster distance is low.