Exploratory Analysis Overview

Descriptive Statistics

Comprehensive Plan https://www.analyticsvidhya.com/blog/2017/01/the-most-comprehensive-data-science-learning-plan-for-2017/

Books [http://onlinestatbook.com/2/index.html] http://onlinestatbook.com/2/index.html (optional) – Supplement your online course with online stats book.

Exploratory Analysis Content

Principles of Analytic Graphics
Exploratory graphs
Plotting Systems in R
base
lattice
ggplot2
Hierarchical clustering
K-Means clustering
Dimension reduction

Adding a geom

qplot(displ, hwy, data = mpg, geom = c("point", "smooth"))

Principles of Analytic Graphics

Principle 1: Show comparisons
Principle 2: Show causality, mechanism, explanation
Principle 3: Show multivariate data
Principle 4: Integrate multiple modes of evidence
Principle 5: Describe and document the evidence
Principle 6: Content is king

K-means clustering - example

Can we find things that are close together?

Clustering organizes things that are close into groups

How do we define close?
How do we group things?
How do we visualize the grouping?
How do we interpret the grouping?

Hugely important/impactful

http://scholar.google.com/scholar?hl=en&q=cluster+analysis&btnG=&as_sdt=1%2C21&as_sdtp=

Hierarchical clustering

An agglomerative approach
Find closest two things
Put them together
Find next closest
Requires
A defined distance
A merging approach
Produces
A tree showing how close things are to each other

How do we define close?

Most important step
Garbage in -> garbage out
Distance or similarity
Continuous - euclidean distance
Continuous - correlation similarity
Binary - manhattan distance
Pick a distance/similarity that makes sense for your problem

Example distances - Euclidean

http://rafalab.jhsph.edu/688/lec/lecture5-clustering.pdf

Example distances - Euclidean

In general:

\[\sqrt{(A_1-A_2)^2 + (B_1-B_2)^2 + \ldots + (Z_1-Z_2)^2}\] http://rafalab.jhsph.edu/688/lec/lecture5-clustering.pdf

Example distances - Manhattan

In general:

\[|A_1-A_2| + |B_1-B_2| + \ldots + |Z_1-Z_2|\]

http://en.wikipedia.org/wiki/Taxicab_geometry

Hierarchical clustering - example

set.seed(1234); par(mar=c(0,0,0,0))
x <- rnorm(12,mean=rep(1:3,each=4),sd=0.2)
y <- rnorm(12,mean=rep(c(1,2,1),each=4),sd=0.2)
plot(x,y,col="blue",pch=19,cex=2)
text(x+0.05,y+0.05,labels=as.character(1:12))

Hierarchical clustering - `dist`

Important parameters: x,method

dataFrame <- data.frame(x=x,y=y)
dist(dataFrame)

##             1          2          3          4          5          6
## 2  0.34120511                                                       
## 3  0.57493739 0.24102750                                            
## 4  0.26381786 0.52578819 0.71861759                                 
## 5  1.69424700 1.35818182 1.11952883 1.80666768                      
## 6  1.65812902 1.31960442 1.08338841 1.78081321 0.08150268           
## 7  1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557
## 8  1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931
## 9  2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116
## 10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201
## 11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373
## 12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231
##             7          8          9         10         11
## 2                                                        
## 3                                                        
## 4                                                        
## 5                                                        
## 6                                                        
## 7                                                        
## 8  0.65062566                                            
## 9  1.28582631 1.76460709                                 
## 10 1.32063059 1.83517785 0.14090406                      
## 11 1.37369662 1.86999431 0.11624471 0.08317570           
## 12 1.17740375 1.66223814 0.10848966 0.19128645 0.20802789

Hierarchical clustering - #1

Hierarchical clustering - #2

Hierarchical clustering - #3

Hierarchical clustering - hclust

dataFrame <- data.frame(x=x,y=y)
distxy <- dist(dataFrame)
hClustering <- hclust(distxy)
plot(hClustering)

Prettier dendrograms

myplclust <- function( hclust, lab=hclust$labels, lab.col=rep(1,length(hclust$labels)), hang=0.1,...){
  ## modifiction of plclust for plotting hclust objects *in colour*!
  ## Copyright Eva KF Chan 2009
  ## Arguments:
  ##    hclust:    hclust object
  ##    lab:        a character vector of labels of the leaves of the tree
  ##    lab.col:    colour for the labels; NA=default device foreground colour
  ##    hang:     as in hclust & plclust
  ## Side effect:
  ##    A display of hierarchical cluster with coloured leaf labels.
  y <- rep(hclust$height,2); x <- as.numeric(hclust$merge)
  y <- y[which(x<0)]; x <- x[which(x<0)]; x <- abs(x)
  y <- y[order(x)]; x <- x[order(x)]
  plot( hclust, labels=FALSE, hang=hang, ... )
  text( x=x, y=y[hclust$order]-(max(hclust$height)*hang),
        labels=lab[hclust$order], col=lab.col[hclust$order], 
        srt=90, adj=c(1,0.5), xpd=NA, ... )
}

Pretty dendrograms

dataFrame <- data.frame(x=x,y=y)
distxy <- dist(dataFrame)
hClustering <- hclust(distxy)
myplclust(hClustering,lab=rep(1:3,each=4),lab.col=rep(1:3,each=4))

Even Prettier dendrograms

http://gallery.r-enthusiasts.com/RGraphGallery.php?graph=79

Merging points - complete

Merging points - average

`heatmap()`

dataFrame <- data.frame(x=x,y=y)
set.seed(143)
dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
heatmap(dataMatrix)

Notes and further resources

Gives an idea of the relationships between variables/observations
The picture may be unstable
Change a few points
Have different missing values
Pick a different distance
Change the merging strategy
Change the scale of points for one variable
But it is deterministic
Choosing where to cut isn’t always obvious
Should be primarily used for exploration
Rafa’s Distances and Clustering Video
Elements of statistical learning

Matrix data

set.seed(12345); par(mar=rep(0.2,4))
dataMatrix <- matrix(rnorm(400),nrow=40)
image(1:10,1:40,t(dataMatrix)[,nrow(dataMatrix):1])

Cluster the data

par(mar=rep(0.2,4))
heatmap(dataMatrix)

What if we add a pattern?

set.seed(678910)
for(i in 1:40){
  # flip a coin
  coinFlip <- rbinom(1,size=1,prob=0.5)
  # if coin is heads add a common pattern to that row
  if(coinFlip){
    dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,3),each=5)
  }
}

What if we add a pattern? - the data

par(mar=rep(0.2,4))
image(1:10,1:40,t(dataMatrix)[,nrow(dataMatrix):1])

What if we add a pattern? - the clustered data

par(mar=rep(0.2,4))
heatmap(dataMatrix)

Patterns in rows and columns

hh <- hclust(dist(dataMatrix)); dataMatrixOrdered <- dataMatrix[hh$order,]
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(rowMeans(dataMatrixOrdered),40:1,,xlab="Row Mean",ylab="Row",pch=19)
plot(colMeans(dataMatrixOrdered),xlab="Column",ylab="Column Mean",pch=19)

Components of the SVD - \(u\) and \(v\)

svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd1$u[,1],40:1,,xlab="Row",ylab="First left singular vector",pch=19)
plot(svd1$v[,1],xlab="Column",ylab="First right singular vector",pch=19)

Components of the SVD - Variance explained

par(mfrow=c(1,2))
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Prop. of variance explained",pch=19)

Relationship to principal components

svd1 <- svd(scale(dataMatrixOrdered))
pca1 <- prcomp(dataMatrixOrdered,scale=TRUE)
plot(pca1$rotation[,1],svd1$v[,1],pch=19,xlab="Principal Component 1",ylab="Right Singular Vector 1")
abline(c(0,1))

Components of the SVD - variance explained

constantMatrix <- dataMatrixOrdered*0
for(i in 1:dim(dataMatrixOrdered)[1]){constantMatrix[i,] <- rep(c(0,1),each=5)}
svd1 <- svd(constantMatrix)
par(mfrow=c(1,3))
image(t(constantMatrix)[,nrow(constantMatrix):1])
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Prop. of variance explained",pch=19)

What if we add a second pattern?

set.seed(678910)
for(i in 1:40){
  # flip a coin
  coinFlip1 <- rbinom(1,size=1,prob=0.5)
  coinFlip2 <- rbinom(1,size=1,prob=0.5)
  # if coin is heads add a common pattern to that row
  if(coinFlip1){
    dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,5),each=5)
  }
  if(coinFlip2){
    dataMatrix[i,] <- dataMatrix[i,] + rep(c(0,5),5)
  }
}
hh <- hclust(dist(dataMatrix)); dataMatrixOrdered <- dataMatrix[hh$order,]

Singular value decomposition - true patterns

svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(rep(c(0,1),each=5),pch=19,xlab="Column",ylab="Pattern 1")
plot(rep(c(0,1),5),pch=19,xlab="Column",ylab="Pattern 2")

\(v\) and patterns of variance in rows

svd2 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,3))
image(t(dataMatrixOrdered)[,nrow(dataMatrixOrdered):1])
plot(svd2$v[,1],pch=19,xlab="Column",ylab="First right singular vector")
plot(svd2$v[,2],pch=19,xlab="Column",ylab="Second right singular vector")

\(d\) and variance explained

svd1 <- svd(scale(dataMatrixOrdered))
par(mfrow=c(1,2))
plot(svd1$d,xlab="Column",ylab="Singular value",pch=19)
plot(svd1$d^2/sum(svd1$d^2),xlab="Column",ylab="Percent of variance explained",pch=19)

Missing values

dataMatrix2 <- dataMatrixOrdered
## Randomly insert some missing data
dataMatrix2[sample(1:100,size=40,replace=FALSE)] <- NA
svd1 <- svd(scale(dataMatrix2))  ## Doesn't work!

## Error in svd(scale(dataMatrix2)): infinite or missing values in 'x'

Imputing {impute}

library(impute)  ## Available from http://bioconductor.org
dataMatrix2 <- dataMatrixOrdered
dataMatrix2[sample(1:100,size=40,replace=FALSE)] <- NA
dataMatrix2 <- impute.knn(dataMatrix2)$data
svd1 <- svd(scale(dataMatrixOrdered)); svd2 <- svd(scale(dataMatrix2))
par(mfrow=c(1,2)); plot(svd1$v[,1],pch=19); plot(svd2$v[,1],pch=19)

Face example

load("data/face.rda")
image(t(faceData)[,nrow(faceData):1])

Face example - variance explained

svd1 <- svd(scale(faceData))
plot(svd1$d^2/sum(svd1$d^2),pch=19,xlab="Singular vector",ylab="Variance explained")

Face example - create approximations

svd1 <- svd(scale(faceData))
## Note that %*% is matrix multiplication

# Here svd1$d[1] is a constant
approx1 <- svd1$u[,1] %*% t(svd1$v[,1]) * svd1$d[1]

# In these examples we need to make the diagonal matrix out of d
approx5 <- svd1$u[,1:5] %*% diag(svd1$d[1:5])%*% t(svd1$v[,1:5]) 
approx10 <- svd1$u[,1:10] %*% diag(svd1$d[1:10])%*% t(svd1$v[,1:10])

Face example - plot approximations

par(mfrow=c(1,4))
image(t(approx1)[,nrow(approx1):1], main = "(a)")
image(t(approx5)[,nrow(approx5):1], main = "(b)")
image(t(approx10)[,nrow(approx10):1], main = "(c)")
image(t(faceData)[,nrow(faceData):1], main = "(d)")  ## Original data

Notes and further resources

Scale matters
PC’s/SV’s may mix real patterns
Can be computationally intensive
Advanced data analysis from an elementary point of view
Elements of statistical learning
Alternatives
Factor analysis
Independent components analysis
Latent semantic analysis

Exploratory Analysis Overview

Dat Nguyen

Descriptive Statistics

Exploratory Analysis Content

Adding a geom

Principles of Analytic Graphics

K-means clustering - example

Can we find things that are close together?

Hugely important/impactful

Hierarchical clustering

How do we define close?

Example distances - Euclidean

Example distances - Euclidean

Example distances - Manhattan

Hierarchical clustering - example

Hierarchical clustering - dist

Hierarchical clustering - #1

Hierarchical clustering - #2

Hierarchical clustering - #3

Hierarchical clustering - hclust

Prettier dendrograms

Pretty dendrograms

Even Prettier dendrograms

Merging points - complete

Merging points - average

heatmap()

Notes and further resources

Matrix data

Cluster the data

What if we add a pattern?

What if we add a pattern? - the data

What if we add a pattern? - the clustered data

Patterns in rows and columns

Related problems

Related solutions - PCA/SVD

Components of the SVD - \(u\) and \(v\)

Components of the SVD - Variance explained

Relationship to principal components

Components of the SVD - variance explained

What if we add a second pattern?

Singular value decomposition - true patterns

\(v\) and patterns of variance in rows

\(d\) and variance explained

Missing values

Imputing {impute}

Face example

Face example - variance explained

Face example - create approximations

Face example - plot approximations

Notes and further resources

Hierarchical clustering - `dist`

`heatmap()`