Instructions

Consider the file A0001.mat from the PhysioNet Challenge https://physionetchallenges.github.io/2020/. Using R:

  1. Plot the histogram of all the 12 ECG leads with the respective density curve.
  2. Using the second lead inverted, apply the expectation-maximization algorithm with 2 and 3 latent classes.
  3. Plot the densities for each of the latent classes.
  4. Which point in the ECG belongs to each latent class? Plot the ECG in which each point has a color corresponding to the class to which it belongs.
  5. Use the difference between the averages of class distribution as the convergence criterion.
  6. Generate 1000 data points at random according to a single distribution fitted to the original data.
  7. Generate 1000 data points at random according to a mixture of distributions fitted using one of the previous EM computed in line b.

Submitted code file should include comments to improve readability.

Setup

First, load the libraries:

Import data using readMat from R.matlab package. The data comes in a List container, so we drop it to work directly with the matrix:

Histograms

  1. Plot the histogram of all the 12 ECG leads with the respective density curve.

Expectation-Maximization

  1. Using the second lead inverted, apply the expectation-maximization algorithm with 2 and 3 latent classes.
  2. Plot the densities for each of the latent classes.
# expectation maximization function ----
expmax <- function(data, k, eps = 1e-5, crisp = FALSE, plot = TRUE) {
  set.seed(2020)
  sample_mean <- sample(data, k)
  sample_sd <- rep(sd(data) / k, k)
  max_iterations <- 10000

  for (i in seq_len(max_iterations)) {
    # expectation ----
    dens <- sapply(seq_len(k), function(x) {
      dnorm(data, mean = sample_mean[x], sd = sample_sd[x])
    })

    rel <- t(apply(dens, 1, function(x) {
      x / sum(x)
    }))

    # maximization ----
    attrib <- t(apply(rel, 1, function(x) {
      replace(rep(0, k), which.max(x)[1], 1)
    }))

    if (crisp) {
      # crisp ----
      sample_mean.new <- sapply(seq_len(k), function(x) {
        mean(data[attrib[, x] == 1])
      })

      sample_sd.new <- sapply(seq_len(k), function(x) {
        sd(data[attrib[, x] == 1])
      })
    } else {
      # soft ----
      sample_mean.new <- sapply(seq_len(k), function(x) {
        sum(data * rel[, x]) / sum(rel[, x])
      })

      sample_sd.new <- sapply(seq_len(k), function(x) {
        sqrt(sum(rel[, x] * (data - sample_mean[x])^2) / sum(rel[, x]))
      })
    }

    dif <- sum(abs(sample_mean.new - sample_mean))

    if (dif < eps) {
      message("Finished in ", i, " iterations.")
      if (plot) {
        gen_data_dens <- list()
        max_y <- -Inf
        for (i in seq_len(k)) {
          set.seed(2020)
          gen_data_dens[[i]] <- density(rnorm(100000, mean = sample_mean[i], sd = sample_sd[i]))
          max <- max(gen_data_dens[[i]]$y)
          if (max > max_y) {
            max_y <- max
          }
        }

        hist(data,
          main = paste("Expectation-Maximization with", k, "classes"),
          freq = FALSE, xlab = "Value", ylim = c(0, max_y)
        )

        for (i in seq_len(k)) {
          lines(gen_data_dens[[i]], lwd = 1, col = i + 1)
        }
      }

      return(list(mean = sample_mean, sd = sample_sd, attrib = attrib, rel = rel, dens = dens))
    }

    sample_mean <- sample_mean.new
    sample_sd <- sample_sd.new
  }
  stop("Algorithm did not converge.")
}

# data ----
lead <- data[[2]] * -1

# using 2 classes ----
result1 <- expmax(lead, 2, plot = TRUE)
Finished in 56 iterations.

Finished in 71 iterations.

Sections by class

  1. Which point in the ECG belongs to each latent class? Plot the ECG in which each point has a color corresponding to the class to which it belongs.
  2. Use the difference between the averages of class distribution as the convergence criterion.

Generating data

  1. Generate 1000 data points at random according to a single distribution fitted to the original data.
  2. Generate 1000 data points at random according to a mixture of distributions fitted using one of the previous EM computed in line b.

---
title: "HEADS - HIDA - CompStat: Assignment 2"
output: 
  html_notebook: 
    highlight: pygments
    theme: united
    toc: yes
author: Francisco Bischoff
---

## Instructions

Consider the file A0001.mat from the PhysioNet Challenge https://physionetchallenges.github.io/2020/. Using R:

a) Plot the histogram of all the 12 ECG leads with the respective density curve.
b) Using the second lead inverted, apply the expectation-maximization algorithm with 2 and 3 latent classes.
c) Plot the densities for each of the latent classes.
d) Which point in the ECG belongs to each latent class? Plot the ECG in which each point has a color corresponding to the class to which it belongs.
e) Use the difference between the averages of class distribution as the convergence criterion.
f) Generate 1000 data points at random according to a single distribution fitted to the original data.
g) Generate 1000 data points at random according to a mixture of distributions fitted using one of the previous EM computed in line b. 

Submitted code file should include comments to improve readability.

## Setup

First, load the libraries:

```{r setup, message = FALSE}
library(R.matlab)
```

Import data using `readMat` from `R.matlab` package. The data comes in a `List` container, so we drop it to work directly with the `matrix`:

```{r import}
# Import the data ----
data <- readMat("A0001.mat")
data <- data[[1]] # get rid of List
data <- t(data) # transpose the matrix, so each column represents one lead.
data <- as.data.frame(data)
colnames(data) <- c("I", "II", "III", "aVR", "aVL", "aVF", "V1", "V2", "V3", "V4", "V5", "V6")
```

## Histograms

a) Plot the histogram of all the 12 ECG leads with the respective density curve.

```{r histograms, fig.height=6, fig.width=10}
oldpar <- par(mfrow = c(3, 4))

for (i in 1:12) {
  hist(data[[i]], main = names(data[i]), freq = FALSE, xlab = "Value", ylim = c(0, 0.011))
  lines(density(data[[i]]), lwd = 1, col = 4)
}

par(oldpar)
```

## Expectation-Maximization

b) Using the second lead inverted, apply the expectation-maximization algorithm with 2 and 3 latent classes.
c) Plot the densities for each of the latent classes.

```{r em, fig.height=6, fig.width=10}
# expectation maximization function ----
expmax <- function(data, k, eps = 1e-5, crisp = FALSE, plot = TRUE) {
  set.seed(2020)
  sample_mean <- sample(data, k)
  sample_sd <- rep(sd(data) / k, k)
  max_iterations <- 10000

  for (i in seq_len(max_iterations)) {
    # expectation ----
    dens <- sapply(seq_len(k), function(x) {
      dnorm(data, mean = sample_mean[x], sd = sample_sd[x])
    })

    rel <- t(apply(dens, 1, function(x) {
      x / sum(x)
    }))

    # maximization ----
    attrib <- t(apply(rel, 1, function(x) {
      replace(rep(0, k), which.max(x)[1], 1)
    }))

    if (crisp) {
      # crisp ----
      sample_mean.new <- sapply(seq_len(k), function(x) {
        mean(data[attrib[, x] == 1])
      })

      sample_sd.new <- sapply(seq_len(k), function(x) {
        sd(data[attrib[, x] == 1])
      })
    } else {
      # soft ----
      sample_mean.new <- sapply(seq_len(k), function(x) {
        sum(data * rel[, x]) / sum(rel[, x])
      })

      sample_sd.new <- sapply(seq_len(k), function(x) {
        sqrt(sum(rel[, x] * (data - sample_mean[x])^2) / sum(rel[, x]))
      })
    }

    dif <- sum(abs(sample_mean.new - sample_mean))

    if (dif < eps) {
      message("Finished in ", i, " iterations.")
      if (plot) {
        gen_data_dens <- list()
        max_y <- -Inf
        for (i in seq_len(k)) {
          set.seed(2020)
          gen_data_dens[[i]] <- density(rnorm(100000, mean = sample_mean[i], sd = sample_sd[i]))
          max <- max(gen_data_dens[[i]]$y)
          if (max > max_y) {
            max_y <- max
          }
        }

        hist(data,
          main = paste("Expectation-Maximization with", k, "classes"),
          freq = FALSE, xlab = "Value", ylim = c(0, max_y)
        )

        for (i in seq_len(k)) {
          lines(gen_data_dens[[i]], lwd = 1, col = i + 1)
        }
      }

      return(list(mean = sample_mean, sd = sample_sd, attrib = attrib, rel = rel, dens = dens))
    }

    sample_mean <- sample_mean.new
    sample_sd <- sample_sd.new
  }
  stop("Algorithm did not converge.")
}

# data ----
lead <- data[[2]] * -1

# using 2 classes ----
result1 <- expmax(lead, 2, plot = TRUE)

# using 3 classes ----
result2 <- expmax(lead, 3, plot = TRUE)
```

## Sections by class

d) Which point in the ECG belongs to each latent class? Plot the ECG in which each point has a color corresponding to the class to which it belongs.
e) Use the difference between the averages of class distribution as the convergence criterion.

```{r classes2, fig.height=5, fig.width=10, collapse=TRUE}

splits <- sort(result1$mean)
split1 <- mean(splits)

class1 <- ifelse(result1$attrib[, 1] == 1, -lead, NA)
class2 <- ifelse(result1$attrib[, 2] == 1, -lead, NA)

{
  plot(class1,
    type = "l", col = 2, main = "Sections by class - 2", ylab = "value",
    ylim = c(-720, 720)
  )
  lines(class2, col = 3)
  legend("topright",
    legend = c("1", "2"),
    lty = 1, col = 2:3,
    title = "Classes"
  )
}
```

```{r classes3, fig.height=5, fig.width=10, collapse=TRUE}

splits <- sort(result2$mean)
split1 <- mean(splits[1:2])
split2 <- mean(splits[2:3])

class1 <- ifelse(result2$attrib[, 1] == 1, -lead, NA)
class2 <- ifelse(result2$attrib[, 2] == 1, -lead, NA)
class3 <- ifelse(result2$attrib[, 3] == 1, -lead, NA)


{
  plot(class1,
    type = "l", col = 2, main = "Sections by class - 3", ylab = "value",
    ylim = c(-720, 720)
  )
  lines(class2, col = 3)
  lines(class3, col = 4)
  legend("topright",
    legend = c("1", "2", "3"),
    lty = 1, col = 2:4,
    title = "Classes"
  )
}
```

## Generating data

f) Generate 1000 data points at random according to a single distribution fitted to the original data.
g) Generate 1000 data points at random according to a mixture of distributions fitted using one of the previous EM computed in line b. 

```{r generated, fig.height=6, fig.width=10}

n <- 1000
classes <- apply(result1$attrib, 1, which.max)
class_prop <- prop.table(table(classes))
set.seed(1000)
sample_classes <- sample(seq_along(result1$mean), size = n, replace = T, prob = class_prop)
sample_prop <- prop.table(table(sample_classes))
mixture1_gen <- round(unlist(sapply(seq_along(result1$mean), function(x) {
  rnorm(n * sample_prop[x], mean = result1$mean[x], sd = result1$sd[x])
})), 0)

classes <- apply(result2$attrib, 1, which.max)
class_prop <- prop.table(table(classes))
set.seed(1000)
sample_classes <- sample(seq_along(result2$mean), size = n, replace = T, prob = class_prop)
sample_prop <- prop.table(table(sample_classes))
mixture2_gen <- round(unlist(sapply(seq_along(result2$mean), function(x) {
  rnorm(n * sample_prop[x], mean = result2$mean[x], sd = result2$sd[x])
})), 0)

original_gen <- rnorm(n, mean = mean(lead), sd = sd(lead))

oldpar <- par(mfrow = c(2, 2))
breaks <- 10
{
  hist(lead,
    breaks = breaks, main = "Original dataset", freq = FALSE,
    xlim = c(-800, 800), xlab = "value"
  )
  hist(original_gen,
    breaks = breaks, main = "Generated from original", freq = FALSE,
    xlim = c(-800, 800), xlab = "value"
  )
  hist(mixture1_gen,
    breaks = breaks, main = "Generated from 2 classes", freq = FALSE,
    xlim = c(-800, 800), xlab = "value"
  )
  hist(mixture2_gen,
    breaks = breaks, main = "Generated from 3 classes", freq = FALSE,
    xlim = c(-800, 800), xlab = "value"
  )
}
par(oldpar)
```
