Data

library(readr)
utilities <- read_csv("Data/utilities.csv")
head(utilities)

## # A tibble: 6 x 9
##   Company      Fixed_charge   RoR  Cost  Load `D Demand` Sales Nuclear Fuel_Cost
##   <chr>               <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl>   <dbl>     <dbl>
## 1 Arizona              1.06   9.2   151  54.4        1.6  9077     0       0.628
## 2 Boston               0.89  10.3   202  57.9        2.2  5088    25.3     1.56 
## 3 Central              1.43  15.4   113  53          3.4  9212     0       1.06 
## 4 Commonwealth         1.02  11.2   168  56          0.3  6423    34.3     0.7  
## 5 Con Ed NY            1.49   8.8   192  51.2        1    3300    15.6     2.04 
## 6 Florida              1.32  13.5   111  60         -2.2 11127    22.5     1.24

str(utilities)

## tibble [22 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Company     : chr [1:22] "Arizona" "Boston" "Central" "Commonwealth" ...
##  $ Fixed_charge: num [1:22] 1.06 0.89 1.43 1.02 1.49 1.32 1.22 1.1 1.34 1.12 ...
##  $ RoR         : num [1:22] 9.2 10.3 15.4 11.2 8.8 13.5 12.2 9.2 13 12.4 ...
##  $ Cost        : num [1:22] 151 202 113 168 192 111 175 245 168 197 ...
##  $ Load        : num [1:22] 54.4 57.9 53 56 51.2 60 67.6 57 60.4 53 ...
##  $ D Demand    : num [1:22] 1.6 2.2 3.4 0.3 1 -2.2 2.2 3.3 7.2 2.7 ...
##  $ Sales       : num [1:22] 9077 5088 9212 6423 3300 ...
##  $ Nuclear     : num [1:22] 0 25.3 0 34.3 15.6 22.5 0 0 0 39.2 ...
##  $ Fuel_Cost   : num [1:22] 0.628 1.555 1.058 0.7 2.044 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Company = col_character(),
##   ..   Fixed_charge = col_double(),
##   ..   RoR = col_double(),
##   ..   Cost = col_double(),
##   ..   Load = col_double(),
##   ..   `D Demand` = col_double(),
##   ..   Sales = col_double(),
##   ..   Nuclear = col_double(),
##   ..   Fuel_Cost = col_double()
##   .. )

attach(utilities)

From the second column, we have quantitative data. 22 observations with 9 variables.

Understand about data

Let’s see some plots and understand about data.

Scatter Plot

plot(Fuel_Cost ~ Sales, data = utilities)

# Add labels 
with(utilities, text(Fuel_Cost ~ Sales, labels = Company, pos=4, cex=0.5))

Much more clear view of the data.

We can see,

Four companies towards more sales, but less fuel cost.
Another group of companies have medium sales, and medium to lower fuel cost.
Another group of companies having low sales, but higher fuel cost.
Also we can just see \(3\) clusters.

Normalization

First we select quantitative data and apply Z normalization.

# remove 1st column, i.e. Companies
z <- utilities[, -c(1,1)]
head(z)

## # A tibble: 6 x 8
##   Fixed_charge   RoR  Cost  Load `D Demand` Sales Nuclear Fuel_Cost
##          <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl>   <dbl>     <dbl>
## 1         1.06   9.2   151  54.4        1.6  9077     0       0.628
## 2         0.89  10.3   202  57.9        2.2  5088    25.3     1.56 
## 3         1.43  15.4   113  53          3.4  9212     0       1.06 
## 4         1.02  11.2   168  56          0.3  6423    34.3     0.7  
## 5         1.49   8.8   192  51.2        1    3300    15.6     2.04 
## 6         1.32  13.5   111  60         -2.2 11127    22.5     1.24

We can clearly see variables like Fuel_Cost which are low in values, but variables like Sales which are very high in values.

# means
m <- apply(X = z, MARGIN = 2, FUN = mean)
# std.dev
sd <- apply(X = z, MARGIN = 2, FUN = sd)
# normalized data set
z <- scale(z,m,sd)
head(z)

##      Fixed_charge        RoR         Cost       Load    D Demand       Sales
## [1,]   -0.2931579 -0.6846390 -0.417122002 -0.5777152 -0.52622751  0.04590290
## [2,]   -1.2145113 -0.1944537  0.821002037  0.2068363 -0.33381191 -1.07776413
## [3,]    1.7121407  2.0782236 -1.339645796 -0.8915357  0.05101929  0.08393124
## [4,]   -0.5099470  0.2066070 -0.004413989 -0.2190631 -0.94312798 -0.70170610
## [5,]    2.0373243 -0.8628882  0.578232617 -1.2950193 -0.71864311 -1.58142837
## [6,]    1.1159709  1.2315399 -1.388199680  0.6775672 -1.74485965  0.62337028
##         Nuclear   Fuel_Cost
## [1,] -0.7146294 -0.85367545
## [2,]  0.7920476  0.81329670
## [3,] -0.7146294 -0.08043055
## [4,]  1.3280197 -0.72420189
## [5,]  0.2143888  1.69263800
## [6,]  0.6253007  0.24864810

Distance

Euclidean distance

distance <- dist(z)
# distance
# We can get distance matrix
print(distance, digits = 3)   # remove some decimal numbers

##       1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
## 2  3.10                                                                      
## 3  3.68 4.92                                                                 
## 4  2.46 2.16 4.11                                                            
## 5  4.12 3.85 4.47 4.13                                                       
## 6  3.61 4.22 2.99 3.20 4.60                                                  
## 7  3.90 3.45 4.22 3.97 4.60 3.35                                             
## 8  2.74 3.89 4.99 3.69 5.16 4.91 4.36                                        
## 9  3.25 3.96 2.75 3.75 4.49 3.73 2.80 3.59                                   
## 10 3.10 2.71 3.93 1.49 4.05 3.83 4.51 3.67 3.57                              
## 11 3.49 4.79 5.90 4.86 6.46 6.00 6.00 3.46 5.18 5.08                         
## 12 3.22 2.43 4.03 3.50 3.60 3.74 1.66 4.06 2.74 3.94 5.21                    
## 13 3.96 3.43 4.39 2.58 4.76 4.55 5.01 4.14 3.66 1.41 5.31 4.50               
## 14 2.11 4.32 2.74 3.23 4.82 3.47 4.91 4.34 3.82 3.61 4.32 4.34 4.39          
## 15 2.59 2.50 5.16 3.19 4.26 4.07 2.93 3.85 4.11 4.26 4.74 2.33 5.10 4.24     
## 16 4.03 4.84 5.26 4.97 5.82 5.84 5.04 2.20 3.63 4.53 3.43 4.62 4.41 5.17 5.18
## 17 4.40 3.62 6.36 4.89 5.63 6.10 4.58 5.43 4.90 5.48 4.75 3.50 5.61 5.56 3.40
## 18 1.88 2.90 2.72 2.65 4.34 2.85 2.95 3.24 2.43 3.07 3.95 2.45 3.78 2.30 3.00
## 19 2.41 4.63 3.18 3.46 5.13 2.58 4.52 4.11 4.11 4.13 4.52 4.41 5.01 1.88 4.03
## 20 3.17 3.00 3.73 1.82 4.39 2.91 3.54 4.09 2.95 2.05 5.35 3.43 2.23 3.74 3.78
## 21 3.45 2.32 5.09 3.88 3.64 4.63 2.68 3.98 3.74 4.36 4.88 1.38 4.94 4.93 2.10
## 22 2.51 2.42 4.11 2.58 3.77 4.03 4.00 3.24 3.21 2.56 3.44 3.00 2.74 3.51 3.35
##      16   17   18   19   20   21
## 2                               
## 3                               
## 4                               
## 5                               
## 6                               
## 7                               
## 8                               
## 9                               
## 10                              
## 11                              
## 12                              
## 13                              
## 14                              
## 15                              
## 16                              
## 17 5.56                         
## 18 3.97 4.43                    
## 19 5.23 6.09 2.47               
## 20 4.82 4.87 2.92 3.90          
## 21 4.57 3.10 3.19 4.97 4.15     
## 22 3.46 3.63 2.55 3.97 2.62 3.01

Higher values shows high dissimilarity between observations.
Lower values shows low dissimilarity between observations.
Both in terms of euclidean distance.

Hierarchical Clustering

Single Linkage, Complete Linkage, Average Linkage Clustering

# Single linkage
hc.sin.link <- hclust(distance,method = "single")
plot(hc.sin.link)

# Complete linkage
hc.com.link <- hclust(distance)
plot(hc.com.link)

# Average linkage
hc.avg.link <- hclust(distance, method = "average")
plot(hc.avg.link)

# With company names
plot(hc.sin.link, labels = utilities$Company, hang = -1)

plot(hc.com.link, labels = utilities$Company, hang = -1)

plot(hc.avg.link, labels = utilities$Company, hang = -1)

Cluster Memberships

member.sin.link <- cutree(hc.sin.link, 3) # 3 clusters
member.com.link <- cutree(hc.com.link, 3) # 3 clusters
member.avg.link <- cutree(hc.avg.link, 3) # 3 clusters

# Understand mismatching
table(member.sin.link, member.com.link, member.avg.link)

## , , member.avg.link = 1
## 
##                member.com.link
## member.sin.link  1  2  3
##               1 13  5  0
##               2  0  0  0
##               3  0  0  0
## 
## , , member.avg.link = 2
## 
##                member.com.link
## member.sin.link  1  2  3
##               1  0  0  0
##               2  1  0  0
##               3  0  0  0
## 
## , , member.avg.link = 3
## 
##                member.com.link
## member.sin.link  1  2  3
##               1  0  0  2
##               2  0  0  0
##               3  0  0  1

Cluster Means

# Cluster means for complete linkage
aggregate(z, list(member.com.link), mean)

##   Group.1 Fixed_charge        RoR        Cost       Load   D Demand      Sales
## 1       1    0.3068832  0.4326015 -0.31481203 -0.3743722 -0.2605107 -0.1575387
## 2       2   -0.4991075 -0.7113763  0.07812761  1.3365904  0.1343994 -0.6728046
## 3       3   -0.6002757 -0.8331800  1.33891013 -0.4805802  0.9917178  1.8565214
##      Nuclear  Fuel_Cost
## 1  0.3692252 -0.2389329
## 2 -0.6050529  1.2484717
## 3 -0.7146294 -0.9657660

# in Original units
aggregate(utilities[, -c(1,1)], list(member.com.link), mean)

##   Group.1 Fixed_charge       RoR     Cost     Load D Demand     Sales Nuclear
## 1       1     1.170714 11.707143 155.2143 55.30714 2.428571  8354.786   18.20
## 2       2     1.022000  9.140000 171.4000 62.94000 3.660000  6525.600    1.84
## 3       3     1.003333  8.866667 223.3333 54.83333 6.333333 15504.667    0.00
##   Fuel_Cost
## 1 0.9698571
## 2 1.7970000
## 3 0.5656667

Consider sales variable, companies that belongs to cluster 3 they have higher sales.
Also companies that belongs to cluster 2, they have higher fuel costs.
Like this we can characterize the variables.
Also using original units, interpretation can make easier.

Silhouette Plots

library(cluster)
plot(silhouette(cutree(hc.com.link, 3), distance))

Here if cluster formation has been good, or the members in clusters are cloaser to each other \(s_i\) values will be high.
\(s_i\) is negative means sort of ourlier.

Scree Plot

wss <- (nrow(z)-1)*(sum(apply(z, 2, var)))
for (i in 2:20) wss[i] <- sum(kmeans(z, centers = i)$withinss)
plot(1:20, wss, type = "b", 
     xlab = "Number of Clusters", ylab = "Within Group SS")

We can see here within group variability drops from one cluster to another cluster.
From the plot from the 4th cluster, and beyond that the clusters are not very significant.

K-Means Clustering (Non-Hierarchical)

kmc <- kmeans(x = z, 3)
kmc

## K-means clustering with 3 clusters of sizes 10, 7, 5
## 
## Cluster means:
##   Fixed_charge         RoR       Cost       Load    D Demand      Sales
## 1  -0.09262805 -0.05185431  0.4689864 -0.3042429  0.45509205  0.3462986
## 2  -0.23896065 -0.65917479  0.2556961  0.7992527 -0.05435116 -0.8604593
## 3   0.51980100  1.02655333 -1.2959473 -0.5104679 -0.83409247  0.5120458
##      Nuclear  Fuel_Cost
## 1  0.4252045 -0.7161098
## 2 -0.2884040  1.2497562
## 3 -0.4466434 -0.3174391
## 
## Clustering vector:
##  [1] 1 2 3 1 2 3 2 1 1 1 1 2 1 3 2 1 2 3 3 1 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 57.53424 34.16481 15.15613
##  (between_SS / total_SS =  36.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Clusters
kmc$cluster

##  [1] 1 2 3 1 2 3 2 1 1 1 1 2 1 3 2 1 2 3 3 1 2 1

# Averages
kmc$centers

##   Fixed_charge         RoR       Cost       Load    D Demand      Sales
## 1  -0.09262805 -0.05185431  0.4689864 -0.3042429  0.45509205  0.3462986
## 2  -0.23896065 -0.65917479  0.2556961  0.7992527 -0.05435116 -0.8604593
## 3   0.51980100  1.02655333 -1.2959473 -0.5104679 -0.83409247  0.5120458
##      Nuclear  Fuel_Cost
## 1  0.4252045 -0.7161098
## 2 -0.2884040  1.2497562
## 3 -0.4466434 -0.3174391

Plots

plot(Sales ~ `D Demand`, utilities)

plot(Sales ~ `D Demand`, utilities, col = kmc$cluster)

library(ggplot2)

ggplot(data = utilities, aes(x = `D Demand`,
                             y = Sales, 
                             col=kmc$cluster)) + geom_point()

Clusters are good when the Between cluster distance is high, and the Within cluster distance is low.

Simple Introduction to Clustering in R

Pruthuvi Kehelbedda

2020-12-15

Data

Understand about data

Scatter Plot

Normalization

Distance

Euclidean distance

Hierarchical Clustering

Single Linkage, Complete Linkage, Average Linkage Clustering

Cluster Memberships

Cluster Means

Silhouette Plots

Scree Plot

K-Means Clustering (Non-Hierarchical)

Plots