Basics of Plotting

setwd("~/Google Drive/CR Rao Course")
# Load the libraries
library(MASS)
library(sparcl)
library(HSAUR)
## Loading required package: tools
data(Animals)
head(Animals)
##                     body brain
## Mountain beaver     1.35   8.1
## Cow               465.00 423.0
## Grey wolf          36.33 119.5
## Goat               27.66 115.0
## Guinea pig          1.04   5.5
## Dipliodocus     11700.00  50.0
attach(Animals)

We want to develop a scatter plot between body weight and brain weight. This will be the graphical representation of brain weight_body weight. The order is the independant on the y axis and the dependant on x axis

plot(Animals$body, Animals$brain)

The identify function can identify the individual data names

#identify(Animals$body, Animals$brain, labels = rownames(Animals))

Since the dataset is so variable we transform data by taking log of the data. Then we want to fit a linear regression model to the data. pch stands for plotting character

# Use identify command to find the outliers
#identify(log(Animals$body), log(Animals$brain), labels = rownames(Animals))
model1 <- lm(log(brain)~log(body), data=Animals)
summary(model1)
## 
## Call:
## lm(formula = log(brain) ~ log(body), data = Animals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2890 -0.6763  0.3316  0.8646  2.5835 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.55490    0.41314   6.184 1.53e-06 ***
## log(body)    0.49599    0.07817   6.345 1.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.532 on 26 degrees of freedom
## Multiple R-squared:  0.6076, Adjusted R-squared:  0.5925 
## F-statistic: 40.26 on 1 and 26 DF,  p-value: 1.017e-06

So the Linear Regression Equation is ln(brain) = 2.55 + 0.5(log body weight)

plot(log(Animals$body), log(Animals$brain), xlab="Log of Body Weight in Kg", ylab="Log of Brain Weight in Gms", main="Plot of Body weight against Brain Weight", pch=16, col="red")
abline(model1, col="blue", lwd=2)

Cluster Analysis

The aim of cluster analysis is to group objects into classes. Cluster analysis is an example of unsupervised learning as we don’t know how many groups will be formed. The other alternative that is supervised learning is also called pattern recognition. In case of pattern recognition the number of groups is known

Nearest neighbour single linkage cluster analysis

Basic requirement is that we need to know how to distance between two objets. The algorithm is that we need to calculate the distance matrix of the objects. Most popular measure is Eucledian distance. Distance between clusters is defined based on the minmum distance.
Steps of the process are :
1. Create clusters with each cluster containing only 1 object
2. Find the distance between each of the pair and create a distance matrix
3. Among the distances find the one cluster with the minimum distance and merge them.
4. Now calculate the distance matrix between the n-1 clusters.
4. Rinse and repeat.

Working example

Import the Protein data into R using copy paste
Idea from workshop - Can we use clustering techniques for dosimetric data

prot <- read.csv("~/Google Drive/CR Rao Course/protein.csv")
head(prot)
##          X RMEAT WMEAT EGGS MILK FISH CEREAL STARCH NUTS FRVEG
## 1  Albania  10.1   1.4  0.5  8.9  0.2   42.3    0.6  5.5   1.7
## 2  Austria   8.9  14.0  4.3 19.9  2.1   28.0    3.6  1.3   4.3
## 3  Belgium  13.5   9.3  4.1 17.5  4.5   26.6    5.7  2.1   4.0
## 4 Bulgaria   7.8   6.0  1.6  8.3  1.2   56.7    1.1  3.7   4.2
## 5 Czechslo   9.7  11.4  2.8 12.5  2.0   34.3    5.0  1.1   4.0
## 6  Denmark  10.6  10.8  3.7 25.0  9.9   21.9    4.8  0.7   2.4
prot[c(1,2), ]
##         X RMEAT WMEAT EGGS MILK FISH CEREAL STARCH NUTS FRVEG
## 1 Albania  10.1   1.4  0.5  8.9  0.2   42.3    0.6  5.5   1.7
## 2 Austria   8.9  14.0  4.3 19.9  2.1   28.0    3.6  1.3   4.3

Convert the 1st coloumn into labels and remove that coloumn

rownames(prot) <- prot$X
prot[c(1,2), ]
##               X RMEAT WMEAT EGGS MILK FISH CEREAL STARCH NUTS FRVEG
## Albania Albania  10.1   1.4  0.5  8.9  0.2   42.3    0.6  5.5   1.7
## Austria Austria   8.9  14.0  4.3 19.9  2.1   28.0    3.6  1.3   4.3
prot1 <- prot[ ,-c(1)]

Calculate the eucledian distance between the observations using the distance function

prot_dist <- dist(prot1)
prot_dist_matrix <- as.matrix(prot_dist)
prot_dist_matrix[1:6, 1:6]
##           Albania   Austria   Belgium Bulgaria Czechslo  Denmark
## Albania   0.00000 23.176281 21.650173 15.68821 15.15454 30.15759
## Austria  23.17628  0.000000  7.868291 32.30449 10.30534 11.95659
## Belgium  21.65017  7.868291  0.000000 32.78613 10.60990 11.11980
## Bulgaria 15.68821 32.304489 32.786125  0.00000 24.00542 40.33410
## Czechslo 15.15454 10.305338 10.609901 24.00542  0.00000 19.42061
## Denmark  30.15759 11.956588 11.119802 40.33410 19.42061  0.00000

Using the hclust command.

prot_clust <- hclust(prot_dist)

The hclust command takes the atomic data generated by the dist command not the matrix

Restricting the clustering to 2 and 3 clusters only

two_clust <- cutree(prot_clust, k=2)
three_clust <- cutree(prot_clust, k=3)
three_clust
##  Albania  Austria  Belgium Bulgaria Czechslo  Denmark Egermany  Finland 
##        1        2        2        3        1        2        1        2 
##   France   Greece  Hungary  Ireland    Italy  Netherl   Norway   Poland 
##        2        1        1        2        1        2        2        1 
## Portugal  Romania    Spain   Sweden Switzerl       UK     USSR Wgermany 
##        1        3        1        2        2        2        1        2 
## Yugoslav 
##        3

Sorting the nantions in numerical order

sort(three_clust)
##  Albania Czechslo Egermany   Greece  Hungary    Italy   Poland Portugal 
##        1        1        1        1        1        1        1        1 
##    Spain     USSR  Austria  Belgium  Denmark  Finland   France  Ireland 
##        1        1        2        2        2        2        2        2 
##  Netherl   Norway   Sweden Switzerl       UK Wgermany Bulgaria  Romania 
##        2        2        2        2        2        2        3        3 
## Yugoslav 
##        3

Do four clusters

four_clust <- cutree(prot_clust, k=4)
sort(four_clust)
##  Albania Czechslo   Greece  Hungary    Italy   Poland     USSR  Austria 
##        1        1        1        1        1        1        1        2 
##  Belgium  Denmark  Finland   France  Ireland  Netherl   Norway   Sweden 
##        2        2        2        2        2        2        2        2 
## Switzerl       UK Wgermany Bulgaria  Romania Yugoslav Egermany Portugal 
##        2        2        2        3        3        3        4        4 
##    Spain 
##        4

Create a dendrogram

plot(prot_clust, hang = -1)

Hierarchical clustering is a form of data evaluation. Most people use various types of clustering methods to generate the different methods. So this is a preliminiary investigation into the relationship between the data.

Reading the plot
Distance between bulgaria and yoguslavia is the first join and then joined with Romania.

So read it from the left to right to understand the step of clustering.

K Means Clustering

Method seeks to cluster into a specified number of groups by minimizing a “numerical criterion”.

Methodology
Lets say you have a dataset with n objects. You decide the number of clusters say K. Extend the concept of variance to find distance. For each object in the cluster we calculate the distance of how close the vector to the centroid of the vector for the vectors in the observations.

Now calculate the closest distance for the K clusters. Adding the distance of the closest distance for each cluster gives a sum.

Now we will repeat the process till we can have all possible ways of forming k clusters of these n objects.

Choose the solution that gives the minimum value - this is the best solution.

However since this allows a large number of permutations it is impractical.

Hence we use the following algorithm

The first order of business is to find out K which is the number of clusters - this is done by a graphical method.

  1. Determine the clusters - can do this using the hclust method (single linkage method) as shown above.

  2. Now calculate the centroids (mean) of the clusters determined using the hierarchical clustering. For this the mean of the measurements for each object is calculated in each cluster.

  3. Now calculate the eucledian distance between the measurements between the object and the means calculated. Now shuffle the order so that the object closest to the mean is grouped into clusters.

  4. Repeat this process till the distance is minimized. This is the best solution.

So essentially the algorithm starts with the clusters generated by hierarchical clustering and then an optimization process is run

We have n objects 1, 2, 3, 4 … , n Each object has p quantitative variables. We create clusters C1, C2, C3 .. Ck We start with any arbitrary partition of data into these clusters. Calculate the centroid (similiar mean) of the clusters. Essentially the average of each variables for the objects in each cluster. sTThe average of all data vectors in the clusters

Calculate the sum of squared distance between the average value of each variable and the actual value for each object in each cluster. Sum the overall ESS over the whole clusters (ESSc) The number ESS is the measure of the quality of this matrix. The optimization process involves finding the ESS which is has the minimum value.

The essence of this shuffling is this that the value of each object is compared against the the centroid of each cluster and then shift the object into the cluster in which the distanc is least. This is repeated till the value of ESSc hits the minimum.

The partition that minimizes the ESS is called the K-Means cluster.
ESS = Within Sum of Squares


Cluster analysis on Exoplanet Data

  1. Load the HSAUR planet and check the planets package
data("planets", package = "HSAUR")
head(planets)
##    mass period eccen
## 1 0.120  4.950  0.00
## 2 0.197  3.971  0.00
## 3 0.210 44.280  0.34
## 4 0.220 75.800  0.28
## 5 0.230  6.403  0.08
## 6 0.250  3.024  0.02
dim(planets)
## [1] 101   3
summary(planets)
##       mass            period             eccen       
##  Min.   : 0.050   Min.   :   2.985   Min.   :0.0000  
##  1st Qu.: 0.930   1st Qu.:  44.280   1st Qu.:0.1000  
##  Median : 1.760   Median : 337.110   Median :0.2700  
##  Mean   : 3.327   Mean   : 666.531   Mean   :0.2815  
##  3rd Qu.: 4.140   3rd Qu.:1089.000   3rd Qu.:0.4100  
##  Max.   :17.500   Max.   :5360.000   Max.   :0.9270

Using the K-means clustering technique for the un-normalized data

In order to eliminate variablity the nstart should be kept at 25. This avoids variability between persons using different random sets.

planet1 <- kmeans(planets, centers = 3, nstart = 25)
## Check the output
planet1
## K-means clustering with 3 clusters of sizes 9, 24, 68
## 
## Cluster means:
##       mass    period     eccen
## 1 5.390000 2767.2444 0.3283333
## 2 4.233333 1235.9729 0.3232917
## 3 2.734500  187.5163 0.2606221
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   1   3 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   2   3   1   3   2   3   3   3   3   2   3   3   3   2   3   3   3   3 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   1   2   3   3   3   3   3   2   3   3   3   3   2   2   3   3   3   2 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   3   2   3   3   2   3   1   3   3   3   3   3   2   3   2   3   3   3 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   3   1   2   3   2   2   2   2   3   3   2   3   3   2   3   1   3   2 
##  91  92  93  94  95  96  97  98  99 100 101 
##   2   3   3   1   1   3   1   2   3   2   3 
## 
## Within cluster sum of squares by cluster:
## [1] 8224052 2492267 2526577
##  (between_SS / total_SS =  82.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Interpretation of the the result:
1. Three clusters are generated with mean values of the variables presented.
2. The number of planets in each cluster is shown in the first line.
3. However since the variability in periodicity is so large that it is dominating the clustering algorithm.
4. Clustering Vector: Tells us in which cluster the object has been put.
5. Cluster sum of squares by cluster: The sum of the values will give us the Error Sum of Squares (ESSc). Each value is the ESS value for each cluster.
6. To see available components using the object created previously

planet1$cluster
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   1   3 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   2   3   1   3   2   3   3   3   3   2   3   3   3   2   3   3   3   3 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   1   2   3   3   3   3   3   2   3   3   3   3   2   2   3   3   3   2 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   3   2   3   3   2   3   1   3   3   3   3   3   2   3   2   3   3   3 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   3   1   2   3   2   2   2   2   3   3   2   3   3   2   3   1   3   2 
##  91  92  93  94  95  96  97  98  99 100 101 
##   2   3   3   1   1   3   1   2   3   2   3
planet1$centers
##       mass    period     eccen
## 1 5.390000 2767.2444 0.3283333
## 2 4.233333 1235.9729 0.3232917
## 3 2.734500  187.5163 0.2606221
planet1$totss
## [1] 76345237
planet1$tot.withinss
## [1] 13242896
planet1$withinss
## [1] 8224052 2492267 2526577

The tot.withinss is the metric that can be used to judge the number of K required. So we can repeat the analysis with different k values and keep noting the tot.withinss values which is the best. We can then construct a scree plot to choose the best K value. This can be done using a loop.

Doing loop calculation in R.

 # Creata a folder with 10 0s.
ess <- rep(0,10)
#Create the loop for this command
for (k in 1:10) 
{
  ess[k] <- kmeans(planets,nstart = 25,centers=k+1)$tot.withinss # in this we say that the k value of ess is the kmeans cultering algorirthm for K+1 centers and get only the total ESS value.
} 

ess
##  [1] 26258739.8 13242896.4  5390449.7  3100853.0  1893142.0  1406412.1
##  [7]  1131022.3  1039690.6   559727.2   478045.0

In the above loop what happened was that all 10 k means were calculated and then placed in the ess folder.

Using the scree plot to determine the optimal K

Now define the scree plot for the ess data

plot(2:11, ess, xlab = "Number of clusters",ylab="Within Sum of Squares", col="blue", pch=10, main ="Scree Plot to determine optimal K")

Choice of the clusters from this plot is individual. But K=5 is reasonable.

Now repeat the analysis with 5 clusters

planet2 <- kmeans(planets, centers = 5, nstart = 25)
planet2
## K-means clustering with 5 clusters of sizes 1, 50, 23, 9, 18
## 
## Cluster means:
##       mass     period     eccen
## 1 4.000000 5360.00000 0.1600000
## 2 2.659520   87.40713 0.2318460
## 3 2.617391  544.55217 0.3456522
## 4 5.057778 2387.46667 0.3550000
## 5 5.186667 1309.85556 0.3077222
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   4   2 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   3   2   4   2   5   2   2   2   3   3   3   2   3   4   2   2   3   2 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   4   5   3   3   2   3   2   5   2   3   3   3   3   3   3   2   2   3 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   2   5   2   2   5   3   4   3   2   2   2   2   5   2   5   2   3   2 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   2   1   5   2   5   5   5   5   3   3   5   2   2   5   2   4   2   5 
##  91  92  93  94  95  96  97  98  99 100 101 
##   5   3   3   4   4   2   4   5   2   5   2 
## 
## Within cluster sum of squares by cluster:
## [1]       0.0  486552.9  678974.2  884635.1 1050690.8
##  (between_SS / total_SS =  95.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Now the output has:
1. Clusters have sizes of 23, 9, 1, 50 and 18 planets in the clusters.
2. Again the period has influenced the clustering process so that one cluster has only got a single planet. 3. The cluster with the single cluster has a planet that takes 5360 days to rotate around it’s sun.
4. Hence normalization is needed.

Since the data is so variable we need to “normalize” the data as numbers with large values effect clustering. Log transformation looks nice but can not be done for this dataset.

However logarithm cannot be taken as log 0 is not defined.
## Methods for normalization

  1. Standardization = (Observation - Mean)/Standard Deviation
  2. Divide each observation by the range.
    The latter preserves the direction that is positive / negative.

Normalizing the planets dataset

#Normalization using range
planetnorm <- planets
planetnorm$mass1 <- planetnorm$mass/(max(planetnorm$mass)-min(planetnorm$mass))
summary(planetnorm$mass1)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002865 0.053300 0.100900 0.190700 0.237200 1.003000
planetnorm$period1 <-planetnorm$period/(max(planetnorm$period)-min(planetnorm$period))
planetnorm$eccen1 <- planetnorm$eccen/(max(planetnorm$eccen)-min(planetnorm$eccen))
summary(planetnorm)
##       mass            period             eccen            mass1         
##  Min.   : 0.050   Min.   :   2.985   Min.   :0.0000   Min.   :0.002865  
##  1st Qu.: 0.930   1st Qu.:  44.280   1st Qu.:0.1000   1st Qu.:0.053295  
##  Median : 1.760   Median : 337.110   Median :0.2700   Median :0.100860  
##  Mean   : 3.327   Mean   : 666.531   Mean   :0.2815   Mean   :0.190675  
##  3rd Qu.: 4.140   3rd Qu.:1089.000   3rd Qu.:0.4100   3rd Qu.:0.237249  
##  Max.   :17.500   Max.   :5360.000   Max.   :0.9270   Max.   :1.002865  
##     period1              eccen1      
##  Min.   :0.0005572   Min.   :0.0000  
##  1st Qu.:0.0082658   1st Qu.:0.1079  
##  Median :0.0629287   Median :0.2913  
##  Mean   :0.1244221   Mean   :0.3037  
##  3rd Qu.:0.2032849   3rd Qu.:0.4423  
##  Max.   :1.0005572   Max.   :1.0000

Cluster analysis of normalized planet data

Now do cluster analysis of the normalized dataset.

planetnorm1 <- kmeans(planetnorm[ , c(4,5,6)], nstart=25, centers=3 )
planetnorm1
## K-means clustering with 3 clusters of sizes 34, 53, 14
## 
## Cluster means:
##        mass1    period1    eccen1
## 1 0.16777347 0.11500361 0.5343613
## 2 0.09576256 0.07984122 0.1315524
## 3 0.60560786 0.31606632 0.3953614
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   2   2   1   2   2   2   2   1   2   2   2   2   2   2   2   2   2   2 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   2   1   1   2   1   2   2   2   2   2   1   1   1   1   2   1   2   2 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   2   2   1   1   2   2   1   2   2   1   1   1   1   2   2   2   2   1 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   2   2   2   2   2   2   2   2   2   1   1   2   2   1   1   2   1   1 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   2   3   1   2   1   1   1   1   1   2   2   2   1   3   1   3   1   3 
##  91  92  93  94  95  96  97  98  99 100 101 
##   3   1   3   3   3   3   3   3   3   3   3 
## 
## Within cluster sum of squares by cluster:
## [1] 1.787017 1.755694 1.719130
##  (between_SS / total_SS =  57.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

20 Loops

Creating a loop for finding the best K to be choosen. We do 20 loops.

ess2 <-rep(1,20) # repeat 1 20 times. Choice of the number is arbitrary.

for (k in 1:20) # allows you to put the value of ess in the proper place in the vector.
  {
  ess2[k] <- kmeans(planetnorm[ , c(4,5,6)], centers = k+1, nstart = 25)$tot.withinss
}

Create the scree plot from the normalized data kmeans

plot(1:20, ess2, xlab = "Number of clusters",ylab="Within Sum of Squares", col="blue", pch=10, main ="Scree Plot to determine optimal K with 20 loops")

30 Loops

Creating a loop for finding the best K to be choosen. We do 30 loops.

ess3 <-rep(1,30) # repeat 1 30 times. Choice of the number is arbitrary.

for (k in 1:30) # allows you to put the value of ess in the proper place in the vector.
  {
  ess3[k] <- kmeans(planetnorm[ , c(4,5,6)], centers = k+1, nstart = 25)$tot.withinss
}

Create the scree plot from the normalized data kmeans

plot(1:30, ess3, xlab = "Number of clusters",ylab="Within Sum of Squares", col="blue", pch=10, main ="Scree Plot to determine optimal K with 30 loops")

Tip

A better way to get the K value is use the mclust package from R