Time-Series Clustering of COVID-19 Deaths Data

class: center, middle, inverse, title-slide

# Time-Series Clustering of COVID-19 Deaths Data
## What I learned from Fadel’s COVID-19 dealths paper
### Yinjiao Ma
### SLU
### updated: 2021-07-01

---

background-image: url(https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/06252021/images/cases-06252021.jpg)
background-position: center
background-size: 80%

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Sharingan_triple.svg)

---

# Clustering

- Unsupervised learning task

- Partition a set of unlabeled data objects into homogeneous clusters

- Objects in the same cluster are more similar to each other than objects in different clusters

.footnote[

Reference: Pablo Montero, José A. Vilar (2014). TSclust: An R Package for Time Series Clustering. Journal of statistical software

]
---
# Clustering methods

- Distance-based methods

1. Partitioning algorithms: *k*-Means, *k*-Medians, *k*-Medoids

1. Hierarchical algorithms: Agglomerative, divisive methods

- Density-based and grid-based methods

- Probabilistic and generative models

- High-dimensinal clustering

.footnote[

Reference: Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques (3rd ed.). Waltham: Morgan Kaufmann.

]
???
Agglomerative merging many points into clusters and then merging small clusters into bigger ones.
Divisive methods: start with a single big then slplit and divide into smaller clusters

---

# Time-series data analysis

- Data of the same variable are collected over time
   Daily COVID-19 related deaths by county

- Objectives of the analysis
  1. To describe the important features of the time series pattern
  
  1. To explain how the past affects the future or how two time series can “interact”
  
  1. To forecast future values of the series
  
  1. To possibly serve as a control standard for a variable that measures the quality of product in some manufacturing situations
  
  1. Use time-series clustering to identify similar COVID-19 outbreak patterns
center
.footnote[

Reference: https://online.stat.psu.edu/stat510/lesson/1/1.1

]

---

## Time-series clustering methods

#### Raw-data-based
  - Approaches work directly with raw time series data
  - Replace the distance/similarity measure for static data with an appropriate one for time series

#### Feature-based
  - Converts raw time series data into a feature vector of lower dimension
  - Apply a conventional clustering algorithm to the extracted feature vectors

#### Model-based
  - Converts raw time series data into a number of model parameters 
  - Apply a conventional clustering algorithm to the model parameters

.footnote[

Reference: T. Warren Liao (2005). Culstering of time series data -- a survey. Pattern recognition.

]

---

background-image: url(https://d3i71xaburhd42.cloudfront.net/20faa2ef4bb4e84b1d68750cda28d0a45fb16075/3-Figure1-1.png)
background-position: 
background-size: 50%

# Time-series clustering methods

.footnote[

Reference: T. Warren Liao (2005). Culstering of time series data -- a survey. Pattern recognition.

]

---

# Clustering validation

- Clustering evaluation: Goodness of the results
 
    -- External: Compare clustering results with prior knowledge (ground truth)
    
    -- Internal: How well the clusters are separated and how compact the cluster are
    
    -- Relative: Compare clusters generated by different parameter settings for the same algorithm

.footnote[

Reference: Malika Charrad et.al (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of statistical software

]

---
## Method in the paper: A Two-Stage Modeling Framework for Analyzing COVID-19 Deaths By County

### 1. Clustering algorithm: *k*-Means

- Select k points as initial centroids
- Repeat
  - Form k clusters by assigning each point to its closest centroid
  - Re-compute the centroids (i.e., mean point) of each cluster
- Until convergence criterion is satisfied

### 2. Distance measure: Euclidean distance

`$$d(i,j) = \sqrt{|x_{i1} - x_{j1}|^2 + |x_{i2} - x_{j2}|^2 + ... + |x_{il} - x_{jl}|^2}$$`
.footnote[
 
Reference: [Coursera course: Cluster Analysis in Data Mining by University of Illinois at Urbana-Champaign](https://www.coursera.org/learn/cluster-analysis/home/welcome)

]

---

### 3. Data management
 - Smoothed using a 7 day moving average
 - Rescaled the 7 day average to between 0 and 1

```
mutate(newMA7 = rollmeanr(newDeaths, k = 7, fill = NA), 
         # 7-day mm of new (adjusted) deaths
         maxMA7 = max(newMA7, na.rm = T), 
         # obtaining the max per county to scale data
         scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) 
         # scaling data to a 0-1 scale by county
```

.footnote[
 
R code from Fadel's document (https://fmegahed.github.io/covid_deaths.html#3_Time-Series_Clustering)

]

---

# R package: **NbClust**
- Functions to perform 
 - Clustering algorithm: *k*-means and hierarchical clustering
 - Distance measures
 
- 30 indices (Relative) to determine the number of clusters

- Offers the best clustering scheme from different results

```
nc  <- NbClust(clusteringPrep, distance = "euclidean", # euclidean distance
             min.nc = 2, max.nc = 51, # searching for optimal k between 
                                      # k=2 and k=51
             method = "kmeans", # using the k-means method
             index = "all") # using 26 of the 30 indices in the package
```

- High computational demand indices: 
   -- Gamma, Gplus, Tau, and Gap
```
   index = "alllong"
```

.footnote[
 
R code from Fadel's document (https://fmegahed.github.io/covid_deaths.html#3_Time-Series_Clustering)

]

---
background-image: url(https://d3i71xaburhd42.cloudfront.net/e42795688a276731132dbb1759e9c9370449d546/19-Table2-1.png)
background-position: 
background-size: 50%

???
Thirty indices from NbClust

---
# Time-seires clustering results
- Ways to select clusters
  - Majority rule
  - Considering only indices that performed best in simulation studies

```
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 7 proposed 4 as the best number of clusters 
## * 1 proposed 25 as the best number of clusters 
## * 1 proposed 26 as the best number of clusters 
## * 2 proposed 34 as the best number of clusters 
## * 1 proposed 41 as the best number of clusters 
## * 2 proposed 43 as the best number of clusters 
## * 1 proposed 46 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  4

```
.footnote[
 
Results from Fadel's document (https://fmegahed.github.io/covid_deaths.html#3_Time-Series_Clustering)

]

---

### What I learned
   - Clustering analysis
   - R packages for clustering analysis
   - How to obtain COVID-19 results and import into R

### Suggestions
Different distance measures using R packages: **Tsclust** or **dtwclust**

#### **Tsclust**
   - Frechet distance
   - Dynamic time warping distance
   - Model-based approaches
   ......

#### **dtwclust**
   - Dynamic time warping distance
   - Global alignment kernel distance
   - Soft-DTW
   - Shape-based distance
   
---

background-image: url(https://d3i71xaburhd42.cloudfront.net/a46ec863bbf3e179de4e7ccedd205a96ab1ca64f/13-Table1-1.png)
background-position: 
background-size: 60%

# Multivariate time series clustering
- R package: **Tsclust**
- R package: **dtwclust**

.footnote[
 
LB: Lower bounds for DTW; DTW: Dynamic time warping; GAK: Global alignment kernel; SBD: Shape-based distance

]