Hierarchical trees and K-Means Algo

1.1.1- function that sample the first 10 coordinates of each µj

set.seed(123)
miu_p_i <- function(){
  miu_p_i <- rnorm(10)
  return(miu_p_i)
}

1.1.2- function that samples a datasets of dimension 90 × p

set.seed(123)
sample_ds_fun <- function(miu1, miu2, miu3, p, sig){
  sig <- sig*diag(p) #crating matrix sigma
  result_matrix <- as.data.frame(matrix(0, 90, p)) #crating the result matrix
  #turning the miu to length p by adding zeros at the end of each miu vector
  miu1 = append(miu1, rep(0, p-10)) 
  miu2 = append(miu2, rep(0, p-10))
  miu3 = append(miu3, rep(0, p-10))
  for (i in seq(1:90)) {
    if (i <= 20) {
      result_matrix[i,] <- rmvnorm(1, miu1, sig)}
    if ((i > 20) & (i <= 50)){
      result_matrix[i,] <- rmvnorm(1, miu2, sig)}
    if (i > 50) {
      result_matrix[i,] <- rmvnorm(1, miu3, sig)}
  }
  return(result_matrix)
}

1.1.3- a function that computes the accuracy of a given clustering result

accuracy_func <- function(sam_data, true_label){
  clust <- kmeans(sam_data, 3) 
  tableA <- table(clust$cluster, true_label)
  p <- permn(3) #returns all permutations of seq(3).
  maximum <- 0
  for (i in 1:6) {
     cal_test <- sum(diag(tableA[p[[i]],])/90)
     if (cal_test > maximum){
       maximum <- cal_test
     }
  }
  return(maximum)
  }

1.1.4- wrapper for the K-means algorithm that outputs the accuracy and the run-time

acc_and_runtime <- function(df, true_label){
  start_time <- Sys.time() 
  acc <- accuracy_func(df, true_label)
  time_taken <- Sys.time() - start_time #calculate the time for the function
  return(c(acc, time_taken))
}

1.2.1- the average accuracy and the standard-error for each combination of p and sigma

	P	sigma	average_accuracy	std
2	10	1	0.9363889	0.0081855
3	10	7	0.5366667	0.0091808
4	10	25	0.4373611	0.0051172
5	10	37	0.4241667	0.0045623
6	20	1	0.9159722	0.0113904
7	20	7	0.5027778	0.0083964
8	20	25	0.4227778	0.0040164
9	20	37	0.4183333	0.0043758
10	50	1	0.8761111	0.0147690
11	50	7	0.4604167	0.0066615
12	50	25	0.4184722	0.0042482
13	50	37	0.4130556	0.0036057

1.2.2- figure describing run-time.

1.2.3- the effect of increasing p and increasing sigma on accuracy and run-time

We can see that as long as the P dimension is lower thus the average accuracy is higher. We also see form the graphs and the table that for a similar sigma in a different dimension we get more or less the same results- for a low sigma we will get approximately 90% accuracy in all of the dimensions and for a high sigma we will get approximately 40% accuracy in all of the dimensions. It is make make sense because as long as sigma is higher the “noise”(error…) we are adding is more significant. Also the greater the variability the accuracy level is lower which make sense as well..

In principle running time may vary because of the hardware factor can also affects it. In general as long as P dimension is higher the greater the run time. It make sense because when P is larger more calculations are made.

############################################################

2 Comparing Covid-19 data and demographic data

In this part we will explore how socio-economical similarity between cities relates to patterns in the spread and effect of the corona-virus. We will use the demographic data-set( csb demographics file) that is used to create the socio-economic ranking by the Israeli Statistical Bureau (ISB); each row is a town /moatza mekomit”and the variables represent some demographic property. We will compare the demographics to the Covid-19 statistics data-set by town found in covid towns file. For the Covid19 statistics, we processed the file to produce the monthly number of verified cases, recovered, deaths and diagnostic tests in each town.

1. We will randomly choose a set of 20 cities described in the ISB (demographics) data sets. we will identify these cities. in the corona-virus data-sets. We now have two data sets with the same cities.

2. We will construct an hierarchical tree for the covid19 data and decide on how to define distances between two cites so that the results are meaningful.

Distances calculation method:

In Lect3B_Distances, we studied about the manhattan distances (page 8). How ever, the implementation of it plot bad one. So, we decided to use The Canberra distance, which is is a numerical measure of the distance between pairs of points in a vector space, introduced in 1966 and refined in 1967 by Godfrey N. Lance and William T. Williams. It is a weighted version of Manhattan distance. The Canberra distance has been used as a metric for comparing ranked lists and for intrusion detection in computer security. The choice made the visualization great. We tried Manhattan and Ecludian distances.

We have sampled 20 towns from the covid DF, he Canberra distance suitable for the next parts, after reading and comparing all kinda of distances we have.

3. Construct a an hierarchical for the demographic data. Decide on how to define distances between two cities so that the results are meaningful.

Distances calculation method:

The same as Q 2.2,in Lect3B_Distances, we studied about the manhattan distances (page 8). How ever, the implementation of it plot bad one. So, we decided to use The Canberra distance, which is is a numerical measure of the distance between pairs of points in a vector space, introduced in 1966 and refined in 1967 by Godfrey N. Lance and William T. Williams. It is a weighted version of Manhattan distance. The Canberra distance has been used as a metric for comparing ranked lists and for intrusion detection in computer security. The choice made the visualization great. We tried Manhattan and Ecludian distances.

We have sampled 20 towns from the CBS DF, he Canberra distance suitable for the next parts, after reading and comparing all kinda of distances we have.

4. Compare the two hierarchies. Comment on similarities and differences.

A tanglegram plot gives two dendrogram (with the same set of labels), one facing the other, and having their labels connected by lines. Tanglegram can be used for visually comparing two methods of Hierarchical clustering, and are sometimes used in biology when comparing two phylogenetic trees.

Since we scaled and normalized our columns in both data frames ,our vectors are zero-mean, with most of the values near zero themselves. This made the Canberra distance suitable for the previous parts, after reading and comparing all kinda of distances we have.

Similarities:

Besides cities names,the two trees are barely alike. However,

The city Shoam colored the same in both trees.
Or Akiva&Zarzir, Mevaseeret Zion&Majdal Shams and Kazrin&Kdumim grouped together on both sub trees.
Both Hadera and Hulon displayed as single one rooted sub tree in Covid tree but grouped together in the Demo tree, we can infer it as high similarity of distances in both trees.

Also, the cities that are similar in the socio-economically rate close in both trees. We can assume that the socio-economic rate affects the spared of the Covid.

Differences:

As displayed, most towns grouped with other cities and closer to new cities.

Also,the number of cities in each cluster (sub-tree) out of the 6 clusters we set up is different.

Third, the demo treee is quite balanced comparing to the Covid tree. Most of the branches grouped together (15/ 20) whereas the CBS tree has most of the branches balanced.

5. Choose a similarity score for the two trees. You can base your score on one of the scores implemented in the dendextend package, including Bakers Gamma, the cophenetic correlation or the Fowlkes-Mallows (Bk) index. Explain what the score is measuring, and what does a high score mean.

## The baker's index for the hierarchical trees is: 0.0277

We chose the Baker Gamma Index tool and received a value as mentioned above, which means that the two trees are not statistically similar, as the rank correlation between the stages at which pairs of objects combine in each of the two trees. The value can range between -1 to 1. With near 0 values meaning that the two trees are not statistically similar. For exact p-value one should use a permutation test. One such option will be to permute over the labels of one tree many times, calculating the distribution under the null hypothesis (keeping the trees topologies constant).

It is calculated by taking two items, and see what is the highest possible level of k (number of cluster groups created when cutting the tree) for which the two item still belongs to the same tree. That k is returned, and the same is done for these two items for the second tree. There are n over 2 combinations of such pairs of items from the items in the tree, and all of these numbers are calculated for each of the two trees. Then, these two sets of numbers (a set for the items in each tree) are paired according to the pairs of items compared, and a Spearman correlation is calculated.

The demographic tree is quite balanced,and the covid 19 tree isn’t. Meaning each half of the demographic tree is almost the same size.

6. Find a background distribution for this score, assuming the labels of the trees are completely unrelated. To do this, randomly permute the labels (city names) of one tree, keeping the labels of the other tree fixed. For each of the randomization, compute the similarity score.

Since the observations creating the Bakers Gamma Index of such a measure are not correlated, we need to perform a permutation test for the calculation of the statistical significance of the index. Lets look at the distribution of Bakers Gamma Index under the null hypothesis . It will be different for different tree structures and sizes. Here are the results when the compared tree is itself (after shuffling its own labels), and when comparing tree 1 to the shuffled tree 2.

7.Display the results as a histogram approximating the null-distribution scores, and a vertical line showing your score. Compute the estimated p-value for the observed distribution.

## Our p-value is: 0.45

8. Explain your results in light of the null hypothesis you were testing.

The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. A p-value less than 0.05 is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Our null hypothesis was that the labels of the trees are completely unrelated, as per question 2.6. Meaning - \[H_0: \ Baker's \ Index = 0 | H_1: \ Baker's \ Index != 0\]

As per previous section and the permutation test, the p-value is greater than 0.05, thus we will not reject \(H_0\), which means we have a statistical approval that hierarchical trees do not match each other.

The correlation between demographic and covid data sets isnt significant, as well as we showed in Baers index calculation.

############################################################

3.3 Shiny apps

The app should allow a researcher to find an interesting clustering of the tissues, and to visually inspect her/his results. This will be based on the a ShinyApp. To create the app you will need to use RStudio. Go to NewFile-NewShinyApp and create a new single-file app.

A shiny app requires inputs, and a plotting function. Most of your changes are to the plotting function renderPlot. Change it so that it would :

1. Run a K-means algorithm (using your function).

2. Show a two-dimensional rendering of your tissues, with colors corresponding to the resulting

The input function is in the slider bar. Change it so it would read the number of clusters.

Main sources:

-http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning

-https://en.wikipedia.org/wiki/Canberra_distance

-https://www.statology.org/dist-function-in-r/

-https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html#comparing-two-dendrograms

-https://rdrr.io/cran/dendextend/man/cor_bakers_gamma.html

-https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html#tanglegram

-https://r-lang.com/scale-function-in-r/