MSc in Molecular Biomedicine

Introduction. The biological problem

In this class we will shift to another statistical concept. Instead of looking at central modes and tendencies we will be focusing on variation, variability and how we can use it to translate it into valuable information in various contexts. The analysis of variation is key in many data analysis techniques that include the comparison of multiple samples through ANOVA (Analysis of Variance), handling the complexity of data through dimensionality reduction and clustering methods that take into account data variability.
The topics of this lecture include:
- A. The use of variance in Analysis of Variance, a family of methods that take advantage of distribution variances in order to draw conclusions on their means
- Β. Techniques that assess variation in data in order to analyze complex datasets through approaches that are grouped under the collective term of Dimensionality Reduction
- C. Clustering approaches that build on data variability

Part A. Using Variance to improve our statistical inference

Variance (or dispersion) analysis is a very useful methodological tool in statistical analysis. It is based on the comparison of variance between multiple samples. In the simplest case, we may compare the variance of two samples, with the F-test or variance test. In R, the corresponding function is called \(var.test()\) and its application is simple as shown below. We will be performing a test on two subsets of a broader dataset that comprises weight values of chicks depending on the material they were fed on, containing 71 measurements from 6 different feeds:

str(chickwts)

## 'data.frame':    71 obs. of  2 variables:
##  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
##  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...

We would first like to see how we can compare the variances in a pairwise-manner, choosing an example case between linseed and horsebean.

which(chickwts$feed=="linseed")->linseed
which(chickwts$feed=="horsebean")->horsebean
var.test(chickwts$weight[linseed],chickwts$weight[horsebean])

## 
##  F test to compare two variances
## 
## data:  chickwts$weight[linseed] and chickwts$weight[horsebean]
## F = 1.8289, num df = 11, denom df = 9, p-value = 0.3739
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4674894 6.5617411
## sample estimates:
## ratio of variances 
##           1.828854

An overview of the output shows that it is very similar to that of the \(t.test()\). The corresponding statistic is F. Degrees of freedom, p-value and confidence interval are also included in the output. The reference value here is the ratio of the two variances and the base value is the unit, since in the case of equal variances: \(var_A/var_B=1\). In this example, although the two variances differ from that of linseed by about 1.8 times, the difference is not considered statistically significant. (Notice the p-value, and in addition how the value 1 is included within 95% confidence interval.

Now, compare this to:

t.test(chickwts$weight[linseed],chickwts$weight[horsebean])

## 
##  Welch Two Sample t-test
## 
## data:  chickwts$weight[linseed] and chickwts$weight[horsebean]
## t = 3.0172, df = 19.769, p-value = 0.006869
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  18.0403 99.0597
## sample estimates:
## mean of x mean of y 
##    218.75    160.20

which tells us that two samples may have statistically significant differences in terms of their mean value but this difference is not reflected in the level of variance.

Up to this point we have already seen how to compare two datasets against each other. But the question arises when more than two samples are to be compared against each other. What happens then? One easy answer is to perform pairwise t.tests to all possible combinations like this:

pairwise.t.test(chickwts$weight, chickwts$feed, p.adjust.method = "fdr")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  chickwts$weight and chickwts$feed 
## 
##           casein  horsebean linseed meatmeal soybean
## horsebean 1.6e-08 -         -       -        -      
## linseed   4.5e-05 0.0228    -       -        -      
## meatmeal  0.0570  2.8e-05   0.0225  -        -      
## soybean   0.0012  0.0007    0.2187  0.1991   -      
## sunflower 0.8125  1.2e-08   2.8e-05 0.0360   0.0007 
## 
## P value adjustment method: fdr

This is not entirely unjustified, although it comes at a cost which you may see in the p.adjust parameter. The adjustment of the p-values is necessary in every case that we perform multiple comparisons (as we do here) and it is necessary because when a large number of p-values is produced, we need to apply stricter thresholds to what we will consider small enough. (There is a lot of statistics behind p-adjustment and corrections for multiple testing but you can read a nice and compact introduction on the topic here)

Going beyond pairwise comparisons and back into variance, there is a great way we can use the variance of the multiple samples to our advantage to draw more accurate conclusions on the mean differences. Below we see how.

Analysis of Variance (ANOVA)

The term Analysis of variance refers to a set of statistical modeling analyzes that aim to estimate difference in the mean values between groups in complex samples. ANOVA is the methodological approach of choice for comparing more than two samples. Conceptually, it does not differ much from a multi-pair t-test approach, however it is generally more rigorous and thus leads to more conservative results with limited type I errors, i.e. it rarely leads to the rejection of zero assumptions that are true.

The basis of ANOVA is the estimation of the difference in the mean value of a parametric variable in relation to one or more categorical variables. This may sound too technical at first but basically it means that, given a set of values coming from many categorical variables, we are looking to disentangle the variability within each subset of the same category against the variability between them.

ANOVA

ANOVA in theory

The best way to understand how this is done is to look at a step-by-step example. We will return to the \(chickwts\) data and now take a look at all the categorical variables visually:

What more can we learn from this chart? In this we see that the total dispersion of values of each category can be broken down into two components. The first is the degree to which their individual average value differs from the general average value (the solid black line running in all panels) and the second is the degree to which they are scattered around the individual average value of their category. Between the two we can see the first as a “signal” and the second as a “noise”. This is because the former is indeed indicative of the difference between the categories in terms of their average values, while the latter has to do with the dispersion of the values of each category.

The purpose of ANOVA is precisely to assess the weight of the former in relation to the latter. You can read more about ANOVA here Below we will see how we can apply it in practice.

ANOVA in practice

ANOVA essentially belongs to a broader class of statistical models called generalized linear models. The aim of the model is to estimate this relationship at the dispersion level.

R employs models in a similar way, separating the response variable from one or more explanatory variables with the “~” symbol. In the general case we are interested in analyzing the combined/isolated variance of a ResponseVariable against one, two or more explanatory, categorical variables (A, B, C etc). The general “grammar” looks something like this

aov (ResponseVariable ~ A + B + C …, data = data.frame)

where ResponseVariable is the parametric variable we are interested in controlling and A, B, C, etc. are the categorical variables on the basis of which we want to control. In our case we only have: - A response variable (weight) and - One categorical variable (dietary supplement).

Based on the above the call of the ANOVA R function \(aov()\) will be:

aov(weight~feed, data=chickwts)->fit
summary(fit)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## feed         5 231129   46226   15.37 5.94e-10 ***
## Residuals   65 195556    3009                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which creates a model (called \(fit\)) and whose contents we inspect with the function summary:

What does this possibility tell us? At a first level, that its value is too small to justify the rejection of the null hypothesis and therefore the data by category do not come from the same distribution. As a result it seems poor as it tells us nothing about which dietary supplements behave best. We can obtain this information by applying a confidence interval calculation to the differences in the mean values of the individual categories with the \(TukeyHSD()\) function acting on an object created by ANOVA.

aov(weight~feed, data=chickwts)->fit
TukeyHSD(fit, conf.level = 0.99)

##   Tukey multiple comparisons of means
##     99% family-wise confidence level
## 
## Fit: aov(formula = weight ~ feed, data = chickwts)
## 
## $feed
##                            diff         lwr        upr     p adj
## horsebean-casein    -163.383333 -245.964363 -80.802304 0.0000000
## linseed-casein      -104.833333 -183.571256 -26.095411 0.0002100
## meatmeal-casein      -46.674242 -127.181777  33.833292 0.3324584
## soybean-casein       -77.154762 -153.028522  -1.281001 0.0083653
## sunflower-casein       5.333333  -73.404589  84.071256 0.9998902
## linseed-horsebean     58.550000  -24.031030 141.131030 0.1413329
## meatmeal-horsebean   116.709091   32.439113 200.979069 0.0001062
## soybean-horsebean     86.228571    6.373743 166.083400 0.0042167
## sunflower-horsebean  168.716667   86.135637 251.297696 0.0000000
## meatmeal-linseed      58.159091  -22.348444 138.666626 0.1276965
## soybean-linseed       27.678571  -48.195189 103.552332 0.7932853
## sunflower-linseed    110.166667   31.428744 188.904589 0.0000884
## soybean-meatmeal     -30.480519 -108.189144  47.228105 0.7391356
## sunflower-meatmeal    52.007576  -28.499959 132.515111 0.2206962
## sunflower-soybean     82.488095    6.614335 158.361856 0.0038845

and we can plot these results, including their confidence intervals as you see below:

aov(weight~feed, data=chickwts)->fit
TukeyHSD(fit, conf.level = 0.99)->tukey
par(las=1)
par(mar=c(5,10,2,2))
plot(tukey)
for(i in 1:length(tukey$feed[,1])){
  text(x=tukey$feed[i,1],y=length(tukey$feed[,1])-i+1.3, labels=round(-log10(tukey$feed[i,4]), digits=3), cex=0.4)
}

Take a look at the final outcome of the ANOVA analysis and compare it to the pairwise t.test we saw above. You will notice that some of the p-values are greater in the ANOVA and thus some of the comparisons that were deemed significant at a given level in the t.tests are not significant for the ANOVA (at the same level). This is a nice demonstration of how taking the combined variances into account allows us to be more rigorous and avoid type-I errors (false positives).

Part B. Dimensionality Reduction for Complex Data

Having seen how we can take advantage of variance in order to improve our statistical armour we will now move to a different topic. That of analyzing and making sense of complex data.

Imagine that you are observing an object in three dimensions with a rather complex shape, folds, cavities, bumps, etc. In fact, what you are doing is looking for the vantage points (the viewing angles) that allow you to better discern its peculiarities. Now imagine that instead of an object you have a cloud of points corresponding to the elements for which you have measured three variables. In a graphic representation of them in three dimensions you will try to rotate the axes in such a way that you can better distinguish if your points are evenly distributed, create groups, etc.

In the example shown below we plot three measurements of demographical data from the 50 states of USA.

USStateData<-state.x77
head(USStateData)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

library(scatterplot3d)
scatterplot3d(USStateData[,1:3], pch=19, color="dark red", angle=35)

You see that data referring to the population, the average income and the percentage of illiterate people are more or less distributed evenly. We can try to rotate the axes a little bit using the angle parameter

USStateData<-state.x77
head(USStateData)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

library(scatterplot3d)
scatterplot3d(USStateData[,1:3], pch=19, color="dark red", angle=75)

and we see that indeed a slight rotation of the axes allows us to detect differences in the initially seemingly uniform set of points. This may be OK for a 3-D space defined by three variables, but what if we had 4, 5 or 20? In fact for this dataset we have 8 different variables. What can we make of the states’ behaviour by looking into 8 variables at the same time?

This is what we call the “curse of dimensionality”. A large number of variables define an equally large dimensional space in which we have to search for patterns. You can read a bit more about it (although in rather abstract terms) here. The main point to keep in mind is that, in order to discern patterns in high-dimensional data, we first need to define the parts of the data that contain the largest variability. This is because by identifying variation we can then use it to look into the data in a clearer way. (You can think of it intuitively as looking for ways to “spread” the data as much as possible).

In the following we will see two methods for reducting and interpreting data dimensionality: Principal Component Analysis (or PCA) and t-distributed Stochastic Neighbour Embedding (or t-SNE).

Principal Component Analysis (PCA)

PCA is a method of data transformation, in that it aims to represent the data in a space that will highlight their greatest variation and thus help us better discern the existence of distinct groups. The main difference is that the variables are not used as they are, but instead we use their linear transformations, which are called principal components. These transformations are essentially new variables corresponding to new axes that define the maximum variance of the elements.

PCA. Notice how the coloured axes represent the first two PC which are -in addition- orthogonal to each other

Each such component is the result of a combination of all variables, with the first principal component (or PC) corresponding to the arrangement of points to achieve the greatest possible variance. This process continues for the second major component, the third and so on. Each PCA analysis will eventually generate a number of components equal to the number of variables (and therefore the dimensions).

But what do we gain? The fact that by its construction, the analysis will distribute the overall variation unevenly and with priority to the first major components. In this way even for tens or hundreds of variables the total variation is concentrated by 80% or 90% in the first two or three principal components, which are enough to represent our data in the best possible way. By now you should have understood that variance is a proxy for information and that’s exactly where the concept of reducing dimensionality lies.

PCA in practice

Let’s look at how we can use PCA in R to analyze the States data we only superficially addressed in the 3D plot above. We will first apply PCA with the \(prcomp()\) function to create a PCA object

statespca <- prcomp(USStateData[,1:8], scale = T)

and then use the \(summary()\) function that we have already encountered to see the contents of the analysis

summary(statespca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.8971 1.2775 1.0545 0.84113 0.62019 0.55449 0.38006
## Proportion of Variance 0.4499 0.2040 0.1390 0.08844 0.04808 0.03843 0.01806
## Cumulative Proportion  0.4499 0.6539 0.7928 0.88128 0.92936 0.96780 0.98585
##                            PC8
## Standard deviation     0.33643
## Proportion of Variance 0.01415
## Cumulative Proportion  1.00000

We first see that there are as many components as there are variables. But notice the last row of the table above (which corresponds to the cumulative distribution of dispersion) and see how it is unevenly distributed in the first PCs. This practically tells us that we can get over 65% of the information by looking at just the first two components. How do we do it? We can plot the data with appropriate functions shown below (using the factoextra library).

What does it look like?

We may first visualize what we call the “PC loadings” for each object (State) in our data in the two-dimensional space defined by the first two PCs.

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_pca_ind(statespca, col.ind="cos2", gradient.cols=c("steelblue4", "grey", "firebrick4"))

Notice how a 2D plot of PCA data is much more informative than a 3D plot of unprocessed data. In the plot above we are using the information on all 8 variables and thus we can achieve a clearer distinction. Not all states are alike (as one would expect).

What matters more?

The plot we see above is a result of a combination of information from all 8 variables but which are responsible for placing North Dakota in the bottom left and Texas in the top right? We can see the relative contributions of the original variables as vectors in a 2D space, like this:

library(factoextra)
fviz_pca_var(statespca, col.var="contrib", gradient.cols=c("steelblue4", "grey", "firebrick4"))

Where we see that what makes a state appear at the bottom left is largely associated with low temperatures and life expectancy, while Murder rates are likely to drive a state in the opposite direction of this plot. Also notice that arrows pointing towards the same direction represent features that are “correlated” (but we will discuss this concept in a next class).

PCA and grouping data

PCA can help us define groups

Dimensionality reduction techniques are often presented next to clustering approaches (and this class is no exception). This is because they can assist us into visually identifying groups of data. Although PCA is not the best example of such a function we can see how this is done if we create a grouping of our own and then project it on the PCA plot. We can do this by working our way in the data frame that contains the original data.

In the following we add an additional factor labeled “State” in the original data and in it we assign each state to a broad geographical region.

USStateData<-as.data.frame(state.x77)
USStateData$State<-""
USStateData$State[which(row.names(USStateData) %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania", "Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin"))]<-"North"
USStateData$State[which(row.names(USStateData) %in% c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota", "Utah", "Wyoming", "Colorado", "Idaho", "Montana"))]<-"MidWest"
USStateData$State[which(row.names(USStateData) %in% c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "West Virginia", "Alabama", "Kentucky", "Mississippi", "Tennessee", "Arkansas", "Louisiana", "Oklahoma", "Texas"))]<-"South"
USStateData$State[which(row.names(USStateData) %in% c("Arizona", "Nevada", "New Mexico", "Alaska", "California", "Hawaii", "Oregon", "Washington"))]<-"West"

Let’s now see how these 4 geographical categories (North, South, West, Midwest) are distributed in the 2D PCA plot

library(ggfortify)
autoplot(statespca, data=USStateData, colour="State", frame=T)

We see that the geographical position of the state matters. Each polygon encloses the states that correspond to a certain geographical region. There is a more or less clear divide between North and South. Midwest and West are also clearly different in many respects. We thus see that PCA has captured some of the characteristics that are particular to the geographical position of the states based on demographical data. A nice example of how we can use dimensionality reduction to get information on the “granularity” of the data.

t-distributed Stochastic Neighbour Embedding (t-SNE)

PCA is based linear transformations of variables or their relationships. In many cases, however, especially when the number of variables is very large, the relationships between the variables are non-linear, leading to methods such as PCA delivering non-uniform weights to different components. Methods that overcome this problem are based on nonlinear and probabilistic approaches, in which the relationships between the elements are not defined by linear relationships. An extremely effective methodology of this type is T-Distributed Stochastic Neighbor Embedding (tSNE), which is essentially another mapping method applied to large-scale data for imaging. in two or three dimensions.

t-SNE in practice

We will not go into the (rather weary) technical details of t-SNE but will instead present it directly in a molecular biology context as t-SNE is one of the methods of preference for the analysis of single-cell (sc) genomics data. In sc analyses we are actually measuring a large number of cells for many features at once (e.g. the expression or methylation of hundreds or thousands of genes). This makes these datasets extremely high-dimensional (the dimension is equal to the number of genes analyzed) and thus even PCA struggles. t-SNE can analyze this sort of data in a fast and efficient way.

Below we see how we can use R to analyze a (preprocessed) single-cell dataset. We will use data from a published paper on cell fate decisions during mouse gastrulation Mohammed et al., Cell Reports, 2017

We have already downloaded and filtered the data which we load into R

They comprise a 24483X722 table, meaning that 722 cell have been analyzed for more that 24k genes.

dim(allrawcounts)

## [1] 24483   722

As this is a time-frame experiment in which we also know the time at which the cells were drawn from the sample so we can see that there are:

table(colnames(allrawcounts))

## 
##        E3.5        E4.5        E5.5        E6.5       E6.75 Gene.Symbol 
##          99         105         267         168          82           1

cells drawn at 3.5, 4.5, 5.5, 6.5 and 6.75 days after gastrulation. We are mostly interested in the first transition between 3.5 and 4.5 days so we can subset only these cells from the population:

allcounts<-allrawcounts[,c(1, which(colnames(allrawcounts)=="E4.5" | colnames(allrawcounts)=="E3.5" ))]

We will now try to see if these two categories can be distributed differently in a reduced space. One thing that needs to be done first is to filter out the genes that have very low expression since they are not informative. We will only keep genes that have at least 1000 reads.

selcountdata<-allcounts[which(rowSums(allcounts[,2:205])>1000),]
dim(selcountdata)

## [1] 8880  205

Which means we are left with 205 cells (of two timepoints) measured in the space of 8880 genes. It is in this space that we will try to apply t-SNE. As we want to project the reduced data in two dimensions we will set this as a parameter:

library(Rtsne)
set.seed(1)
#
selcountdatatsne<-Rtsne(t(selcountdata[,2:205]), dims=2, max.iter=100, check_duplicates = FALSE)
#

The object that we have created carries all the necessary information in two dimensions stored in a list named Y:

str(selcountdatatsne)

## List of 14
##  $ N                  : int 204
##  $ Y                  : num [1:204, 1:2] -5.23 -4.78 -4.79 -5.98 -5.97 ...
##  $ costs              : num [1:204] -5.36e-04 9.98e-04 1.04e-03 1.36e-04 -7.52e-05 ...
##  $ itercosts          : num [1:20] 48.4 48.5 48.8 48.2 48.2 ...
##  $ origD              : int 50
##  $ perplexity         : num 30
##  $ theta              : num 0.5
##  $ max_iter           : num 1000
##  $ stop_lying_iter    : int 250
##  $ mom_switch_iter    : int 250
##  $ momentum           : num 0.5
##  $ final_momentum     : num 0.8
##  $ eta                : num 200
##  $ exaggeration_factor: num 12

As a final step we will try to plot this data by projecting their origin. We know from the structure of the original table that the first 99 columns were E3.5 and the next 105 were E4.5. We can use them as indices to save time:

plot(selcountdatatsne$Y, main="tSNE between E3.5 and E4.5", xlab="tSNE1", ylab="tSNE2", cex.main=2, cex.lab=1.4, las=1)
lines(selcountdatatsne$Y[1:99,], type="p", pch=19, col="steelblue4")
lines(selcountdatatsne$Y[100:204,], type="p", pch=19, col="firebrick4")
legend("bottomleft", c("E3.5", "E4.5"), fill=c("steelblue4", "firebrick4"), bty="n")

You see that the discrimination is near perfect. Thus we have a clear and fast way to see that the two developmental stages are clearly distinct at gene expression level without looking at ~9000 gene expression values for each cell but just at two (reduced) dimensions.

In the case of PCA, we saw how we can use the variable analysis to figure out which of the variables is more associated to each state. In this case the variables are of the order of ~9000. How could we figure out which of these ~9000 genes are more “responsible” for the E3.5-E4.5 differences? We will revisit this question in a later chapter.

Part C. Taking advantage of variation in clustering

As a last example of the uses of variation in data analysis we will revisit the problem of identifying groups/clusters in multidimensional data. Remember that we have seen how hierarchical clustering performs an analysis of distances before iteratively joining the items in a table with the smallest distances. This is not the only way to cluster things and in fact most commonly used clustering techniques take advantage of the variability of data in a space and then go on to partition that space in areas that define the groups(=clusters). This is what is called Partition Clustering with the most frequently used approaches being k-means clustering and Partitioning Around Medoids, PAM.

In the following we will present k-means as the most frequently employed and the more intuitive in explanation

k-means. Step 1: Calculate the optimal number of groups

The main difference between hierarchical methods and partition methods is that the latter presuppose the number of groups as the initial data. We thus need to preset the number of clusters we are going to create. There are many ways to do this but we will refrain from presenting the detailed analysis here. The general principle behind any approach to estimating the optimal number of teams is “group coherence”. A small number of groups is expected to have a low degree of coherence because groups contain large numbers of elements with expectedly larger variance. On the other hand, a large number of groups will lead to coherent groups, but due to their number they will not have the same ability to interpret the data. One group cohesion measure is the “Within-Sum-of-Squares (WSS)” which when applied to the USA State data gives us an optimal number of 2. We will thus try to cluster the state data in two groups with k-means as follows:

k-means. Step 2: The algorithm

K-means is based on an iterative convergence process that uses distances from the mean values of a given number of groups as a criterion for determining the distribution of data. Thus, the initial data are the table of data with the values of the variables and the desired number of groups. For a set of \(p\) items in \(n\) variables to be allocated to \(k\) groups, the procedure is as follows:

k-means algorithm schematically

The \(p\) elements are randomly divided into groups equal to the desired number (\(k\)).
For each group, an average value (k-mean) is calculated for all \(n\) variables. This mean value is defined as a vector of length \(n\) called a centroid.
Calculate the Euclidean distances of all elements from all centrifuges. For each element \(p\) there are \(k\) distances.
The elements are redistributed into \(k\) groups based on the minimum distance of each element \(p\) from the centrifuges. This creates new groups of equal number \(k\).

k-means. Step 3: Clustering

R has an homonymous function called \(kmeans()\) which takes the dataset and the number of groups as arguments, while it is also a good idea to set a maximum number of iterations in case there is no convergence:

kmeans(USStateData[, 2:6], centers = 2, iter.max = 100)

## K-means clustering with 2 clusters of sizes 20, 30
## 
## Cluster means:
##     Income Illiteracy Life Exp   Murder  HS Grad
## 1 3830.600       1.50 70.27000 8.820000 47.78500
## 2 4839.267       0.95 71.28433 6.416667 56.65667
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              1              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              2              2              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              1              2              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              1              1              1              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              2              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              2              1              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              2              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              1              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              1              1              1              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              1              2              2 
## 
## Within cluster sum of squares by cluster:
## [1] 1878677 4416455
##  (between_SS / total_SS =  66.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

We see that the result of the application leads to two groups with obvious differences in the variables of income (\(Income\)), the degree of illiteracy and college graduates (\(Illiteracy\), \(HSGrad\)) and crime (\(Murder\)). Along with the average values, k-means returns the distribution of states in the two groups as well as some data on group cohesion (WSS).

We can see which states belong to which group in by applying \(which()\) on the \(cluster\) variable:

statekm2 <-kmeans(USStateData[,2:6], centers = 2, iter.max = 100)
which (statekm2$cluster == 1)

##        Alaska       Arizona    California      Colorado   Connecticut 
##             2             3             5             6             7 
##      Delaware       Florida        Hawaii      Illinois       Indiana 
##             8             9            11            13            14 
##          Iowa        Kansas      Maryland Massachusetts      Michigan 
##            15            16            20            21            22 
##     Minnesota       Montana      Nebraska        Nevada    New Jersey 
##            23            26            27            28            30 
##      New York  North Dakota          Ohio        Oregon  Pennsylvania 
##            32            34            35            37            38 
##  Rhode Island      Virginia    Washington     Wisconsin       Wyoming 
##            39            46            47            49            50

We observe a geographical enrichment in the southern states. Respectively in cluster 2:

which(statekm2$cluster == 2)

##        Alabama       Arkansas        Georgia          Idaho       Kentucky 
##              1              4             10             12             17 
##      Louisiana          Maine    Mississippi       Missouri  New Hampshire 
##             18             19             24             25             29 
##     New Mexico North Carolina       Oklahoma South Carolina   South Dakota 
##             31             33             36             40             41 
##      Tennessee          Texas           Utah        Vermont  West Virginia 
##             42             43             44             45             48

there is a tendency for over-representation of the northern states.

Let’s see how two of the most important variables are distributed between the two clusters, by plotting Income and Illiteracy for the two clusters:

plot(USStateData$Income, USStateData$Illiteracy, main="US Data", cex.main=2, xlab="Income", ylab="Illiteracy", cex.lab=1.5)
lines(USStateData$Income[which(statekm2$cluster==1)], USStateData$Illiteracy[which(statekm2$cluster==1)], type="p", pch=19, col="firebrick4")
lines(USStateData$Income[which(statekm2$cluster==2)], USStateData$Illiteracy[which(statekm2$cluster==2)], type="p", pch=19, col="steelblue4")
legend("topright", c("Southern States", "Northern States"), fill=c("firebrick4", "steelblue4"), bty="n")

From which it is obvious that northern states are on average more literate and high-earning (notice though these are 1977 data).

Can we evaluate this trend in any way? A solution is given by the following procedure. First we will create an auxiliary vector that we will add to the initial table and it will refer to the North-South geographical position. Following the historical distinction between the Union and the Confederation for the eastern states and a geographical division for the western and given the following series of states:

rownames(USStateData)

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

this vector becomes:

NS<-c("S","N","S","S","S","S","N","N","S","S","N","N","N","N","N","S","S","S","N","N","N","N","N","S","S","N","N","S","N","N","S","N","S","N","N","S","N","N","N","S","N","S","S","S","N","S","N","S","N","N")

Next we can see to what extent the grouping we performed, coincides with the geographical distribution by creating a matching table using \(table()\):

table(statekm2$cluster, NS)

##    NS
##      N  S
##   1 23  7
##   2  5 15

This 2X2 coincidence table shows that there is indeed a tendency for the northern states to belong to group 2 (23/28 = 82%). The corresponding trend of the southern states for group 1 is 15/22 (68%).

One additional question we may ask is how statistically significant this result is. We will come back to this question in an upcoming lecture. For the moment we can say that k-means has done a fair job in spotting demographic differences that are associated with the North-South divide in the US.

Assignment

For this week’s assignment you are asked to revisit a gene expression dataset you analyzed two weeks ago. Only this time you will not work with the processed but the original normalized data.

First download the dataset from this link that contains the normalized gene expression

The data come from an expression experiment using DNA microarrays. mRNA expression has been measured for 18703 mouse genes in a wild-type strain (Wt-control) against 5 different conditions (TG, A, B, C, D). Each experiment has 10 biological replicates.

You are asked to:

Extract a subset of genes that correspond on average (over all samples) to the top 5% most expressing genes
Perform a PCA analysis of this subset over all samples
Spot differences between the 6 conditions and present them visually
Try to isolate the genes that are more responsible for the PCA plot (with the use of \(fviz_pca_var()\) function)
Repeat the analysis with k-means for 6 clusters and return the contingency table. What do you make of it?