stat4arch

Petr Pajdla & Peter Tkáč

AES_707: Statistics seminar for archaeologists

Seminar 6

28. 4. 2022

Today:

Multivariate statistics
Dimensionality reduction
Clustering

But first... an exercise!

Imagine you have radiocarbon data from the whole Czech republic and you want to know from which regions are most of the neolithic data comming. In other words, which districts have most neolithic dates.

Instructions I.

Download dataset Lasoles_14C_database.csv from ZENODO depositary (https://doi.org/10.5281/zenodo.5728242)
Open the dataset in Rstudio (read.csv).
Explore the dataset. Find out how many observations and variables there are etc.
Try to identify the variable which represents uncalibrated radiocarbon dates.

Hints:

You can find describtion of the dataset and variables in the article bellow:

Tkáč, P. and Kolář, J., 2021. Towards New Demography Proxies and Regional Chronologies: Radiocarbon Dates from Archaeological Contexts Located in the Czech Republic Covering the Period Between 10,000 BC and AD 1250. Journal of Open Archaeology Data, 9, p.9. DOI: http://doi.org/10.5334/joad.85

Usefull functions: here::here(), ncol(), nrow(), head(), str().

Instructions II.

With the help of the dplyr package, filter data from the Neolithic period and then count number of observation per district. Put the districts in order, from highest to lowest number of the observations.

Hints:

With the use of str() check the numeric variables in your dataframe. If they are a factor, you have to change it to numeric by:

dataframe$variable <- as.numeric(as.character(dataframe$variable))

To define Neolithic use a period between 6500 and 5500 of uncalibred radiocarbon dates.
Useful functions: names(), str(), head(), filter(), summarise(), %>%, arrange(), group_by().

Additional task:

Select three districts with the most of the dates and compare how their numbers differ in activity areas (variable: Activity_CZ). In other words, is there any difference between number of dates coming from graveyards (poh) and from settlements (sid)?

Multivariate data & methods

Multivariate data

Division based on number of studied variables

Univariate
(one variable per observation);
Bivariate
(two variables);
Multivariate
(also multidimensional, multiple variables).

Dimensionality reduction

The curse of higher dimensions…

Computational ineffectivity;
Low data density in higher dimensions;
Problematic visualization, we do not easily cope with more than 3D;
Difficult interpretation and much more.

Principal components analysis

Goal: Find low-dimensional representation of the observations that explain a good fraction of the variation.

First principal component is a direction that maximizes the variance of the projected data.
Second PC is orthogonal to the previous one.

PCA I: Data

Data preparation

Works with numeric data only.
Format the input as a matrix…

data("DartPoints", package = "archdata")

dp <- DartPoints %>% 
  select(Length, Width, Thickness, Weight) %>% # select columns
  as.matrix() # format as a matrix

# set row names with IDs
rownames(dp) <- DartPoints$Catalog
head(dp, 4)

        Length Width Thickness Weight
41-0322   42.8  15.8       5.8    3.6
35-2946   40.5  17.4       5.8    4.5
35-2921   37.5  16.3       6.1    3.6
36-3487   40.3  16.1       6.3    4.0

Plot the original data

plot(dp)

plot of chunk pca2

PCA II

dp_pca <- prcomp(dp)
dp_pca

Standard deviations (1, .., p=4):
[1] 13.937973  3.238177  1.633327  1.192596

Rotation (n x k) = (4 x 4):
                 PC1        PC2        PC3         PC4
Length    0.90926230  0.3813173  0.1663782  0.01255058
Width     0.30499443 -0.8831253  0.3562343  0.01285657
Thickness 0.06682732 -0.0564513 -0.1616860 -0.98295728
Weight    0.27523551 -0.2673951 -0.9051371  0.18295404

summary(dp_pca)

Importance of components:
                           PC1     PC2     PC3     PC4
Standard deviation     13.9380 3.23818 1.63333 1.19260
Proportion of Variance  0.9302 0.05021 0.01277 0.00681
Cumulative Proportion   0.9302 0.98042 0.99319 1.00000

plot(dp_pca)

plot of chunk pca4

Scree plot

Shows the amount of variation explained by each PC.

PCA III: Biplot

biplot(dp_pca)

plot of chunk pca5

ggbiplot::ggbiplot(dp_pca)

plot of chunk pca6

PCA IV: The details

# structure of the prcomp object
str(dp_pca)

List of 5
 $ sdev    : num [1:4] 13.94 3.24 1.63 1.19
 $ rotation: num [1:4, 1:4] 0.9093 0.305 0.0668 0.2752 0.3813 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "Length" "Width" "Thickness" "Weight"
  .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
 $ center  : Named num [1:4] 49.33 22.08 7.27 7.64
  ..- attr(*, "names")= chr [1:4] "Length" "Width" "Thickness" "Weight"
 $ scale   : logi FALSE
 $ x       : num [1:91, 1:4] -9.06 -10.42 -13.71 -11.1 -20.24 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:91] "41-0322" "35-2946" "35-2921" "36-3487" ...
  .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
 - attr(*, "class")= chr "prcomp"

# extract the coordinates
dp_pca_x <- dp_pca$x
head(dp_pca_x)

               PC1       PC2         PC3        PC4
41-0322  -9.063678  4.217119  0.57461646 0.54402974
35-2946 -10.419278  1.686433 -0.05270198 0.70039255
35-2921 -13.710223  1.737639 -0.17757672 0.18905277
36-3487 -11.101828  2.863704 -0.17735660 0.09821324
36-3321 -20.238281 -1.133789  0.47561971 1.93910904
35-2959  -9.946694  3.209080  1.58242087 2.10559068

class(dp_pca_x)

[1] "matrix" "array"

dp_pca_x <- as.data.frame(dp_pca_x)
class(dp_pca_x)

[1] "data.frame"

PCA V: Custom plot

dp_pca_x %>% 
  bind_cols(
    type = DartPoints$Name
  ) %>% 
  ggplot(
    aes(PC1, PC2, 
        color = type)
  ) +
  geom_point()

Clustering

Broad set of techniques for finding subgroups.
Observations within groups are quite similar to each other and observations in different groups quite different from each other.

Clustering methods

Partitioning: k-means clustering
Agglomerative/divisive: herarchical clustering
Model-based: mixture models

K-means partitioning

Simple approach to partitioning a data set into K distinct, non-overlapping clusters.
K is specified beforehand.

(James et al. 2013)

K-means algorithm

K … number of clusters.

The process:

Randomly assign each observation to groups 1 to K;
For each of K clusters, derive its centroid (midpoint);
Reassign each observation into a cluster whose centroid is the closest;
Iterate until stability in cluster assignments is reached (local vs global optima).

(James et al. 2013)

Some properties of k-means

Standard algorithm for k-means clustering is using Euclidean distance.
Different local optima (stability) can be reached.

Number of clusters K must be specified beforehand.
- Elbow method;
- Silhouette method etc.

K-means I: Data

data(Fibulae, package = "archdata")

# create matrix
fib <- Fibulae %>% 
  select(
    tot.len = Length, 
    bow.hei = BH,
    foot.len = FL
  ) %>% 
  as.matrix()
head(fib)

     tot.len bow.hei foot.len
[1,]     114      24       93
[2,]      35       7       21
[3,]      60      15       33
[4,]      74      26       23
[5,]      68      23       20
[6,]      55      15       27

plot(fib)

plot of chunk km2

K-means II: Reduce dimensions (PCA)

# scale + PCA
fib_pca <- prcomp(scale(fib))

# summary of a PCA
summary(fib_pca)

Importance of components:
                          PC1    PC2     PC3
Standard deviation     1.4691 0.8941 0.20585
Proportion of Variance 0.7194 0.2665 0.01412
Cumulative Proportion  0.7194 0.9859 1.00000

plot(fib_pca$x)

plot of chunk km4

K-means III: Find optimal K

Package factoextra.

factoextra::fviz_nbclust(fib_pca$x, kmeans)

plot of chunk km5-0

K-means IV

fib_km <- kmeans(fib_pca$x, centers = 2)
fib_km

K-means clustering with 2 clusters of sizes 24, 6

Cluster means:
         PC1          PC2           PC3
1  0.6099434 -0.007314937 -0.0001198279
2 -2.4397737  0.029259748  0.0004793116

Clustering vector:
 [1] 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2

Within cluster sum of squares by cluster:
[1] 22.13457 20.21529
 (between_SS / total_SS =  51.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

K-means V

# cluster assignment
fib_km$cluster

 [1] 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2

# create a data frame with clusters
fib_df <- fib_pca$x %>% 
  as.data.frame() %>% 
  bind_cols(
    clust = fib_km$cluster)

head(fib_df, 4)

          PC1         PC2         PC3 clust
1 -4.22727376  0.03044856  0.60841190     2
2  1.76360759  1.78652460 -0.05839999     1
3 -0.08164798  0.43974905 -0.05057241     1
4 -1.28463323 -2.07589875 -0.24087960     2

fib_df %>% 
  ggplot(
    aes(PC1, PC2, 
        color = factor(clust))) +
  geom_point(size = 4) +
  theme_minimal()

plot of chunk km7

Hierarchical clustering

The number of clusters is not specified beforehand.
Output is a dendrogram - a hierarchical structure visualizing the cluster growth.
Starts with a distance matrix.

Agglomerative algorithm:

Put each object in its own cluster;
Join the clusters that are the closest;
Iterate until a single cluster encompassing all objects is reached.

Agglomerative:

Builds the hierarchy of clusters from bottom-up until a single cluster is reached.

Divisive:

Divides a single large cluster into individual objects (top-down).

Hierarchical clustering I: Data

# data set same as in k-means
head(fib)

     tot.len bow.hei foot.len
[1,]     114      24       93
[2,]      35       7       21
[3,]      60      15       33
[4,]      74      26       23
[5,]      68      23       20
[6,]      55      15       27

# create distance matrix
fib_dist <- dist(fib) 
class(fib_dist)

[1] "dist"

as.matrix(fib_dist)[1:6, 1:4]

          1         2        3         4
1   0.00000 108.23123 81.22192 80.647381
2 108.23123   0.00000 28.86174 43.428102
3  81.22192  28.86174  0.00000 20.420578
4  80.64738  43.42810 20.42058  0.000000
5  86.29021  36.68787 17.23369  7.348469
6  88.98314  22.36068  7.81025 22.315914

Matrix with n rows and n columns.
Pair-wise distances between the objects.
Notice the 0 on diagonal, ie. distance to itself is 0.

Hierarchical clustering II: Dendrogram

fib_hclust <- hclust(fib_dist)
fib_hclust


Call:
hclust(d = fib_dist)

Cluster method   : complete 
Distance         : euclidean 
Number of objects: 30

plot(fib_hclust)

plot of chunk hclust5

Hierarchical clustering III: Clusters

Different options where to cut the dendrogram…

plot of chunk hclust6

cutree(x, k, h) function:
- specify k - number of cluster;
- specify h - height where to cut.

h35 <- cutree(fib_hclust, h = 35)
h35

 [1] 1 2 3 4 4 3 2 2 3 2 4 4 3 2 2 2 2 2 2 3 1 3 2 2 2 3 3 2 2 1

k3 <- cutree(fib_hclust, k = 3)
k3

 [1] 1 2 3 3 3 3 2 2 3 2 3 3 3 2 2 2 2 2 2 3 1 3 2 2 2 3 3 2 2 1

Hierarchical clustering IV

plot(fib, col = h35)

plot of chunk hclust9

plot(fib, col = k3)

plot of chunk hclust10

Cluster linkage methods I

Complete (maximum) linkage

Largest distance between clusters.
method = "complete"

plot of chunk link1

Single (minimum) linkage

Smallest distance between clusters.
method = "single"

plot of chunk link2

Cluster linkage methods II

Mean (average) linkage

Mean distance between clusters.
method = "average"

plot of chunk link3

Wards method

Minimizes total within cluster variance.
method = "ward.D2"

plot of chunk link4

K-means partitioning

Pre-specified number of clusters.
Clusters may vary (different local optima).
Best when groups in data are hyper-spheres.

Hierarchical clustering

Variable cluster numbers.
Clusters are stable.
Any shape of data distribution.

Comparison of k-means vs hclust for k = 2

K-means clustering

plot of chunk unnamed-chunk-5

Hierarchical clustering

plot of chunk unnamed-chunk-6

Comparison of k-means vs hclust for k = 3

K-means clustering

plot of chunk unnamed-chunk-7

Hierarchical clustering

plot of chunk unnamed-chunk-8

Comparison of k-means vs hclust for k = 4

K-means clustering

plot of chunk unnamed-chunk-9

Hierarchical clustering

plot of chunk unnamed-chunk-10