Exploratory analysis and visualization II

This work is a continuation of the tutorial on single-cell expression.

We collected three new 96-well plates and performed single-cell expression analysis as before.

The drugs we use are:

DMSO (control),
SAHA (the same HDAC inhibitor as in the tutorial) and,
PMA (phorbol myristate acetate, an activator of the NF-kB pathway).

We use Jurkat T cells that are not infected by HIV, and J-Lat that have a latent HIV infection.

Question 1

Load the data from the three plates (P3.tsv.gz, P4.tsv.gz and P5.tsv.gz). Which command do you use? Provide code alternatives (if any). Give some evidence that the data is properly loaded.

# Load data
P3_tsv <- read.table("H3/P3.tsv.gz", header = TRUE, row.names = 1) 
P4_tsv <- read.table("H3/P4.tsv.gz", header = TRUE, row.names = 1)
P5_tsv <- read.table("H3/P5.tsv.gz", header = TRUE, row.names = 1)

# Check if it is correctly loaded
# View(P3_tsv)
# View(P4_tsv)
# View(P5_tsv)

# We could also load data with readr package
library(readr)
P3_tsv <- read_delim("H3/P3.tsv.gz", 
                     delim = "\t", escape_double = FALSE, 
                     trim_ws = TRUE)
P4_tsv <- read_delim("H3/P4.tsv.gz", 
                     delim = "\t", escape_double = FALSE, 
                     trim_ws = TRUE)
P5_tsv <- read_delim("H3/P5.tsv.gz", 
                     delim = "\t", escape_double = FALSE, 
                     trim_ws = TRUE)

# Another way to check if data is correctly loaded
str(P3_tsv)
str(P4_tsv)
str(P5_tsv)

P3 tsv
	P2769_N701.S502	P2769_N701.S503	P2769_N701.S505
ENSG00000000003.14	0	0	0
ENSG00000000005.5	0	0	0
ENSG00000000419.12	39	68	16
ENSG00000000457.13	0	8	17
ENSG00000000460.16	0	46	4
ENSG00000000938.12	0	0	0

P4 tsv
	P2770_N701.S513	P2770_N701.S515	P2770_N701.S516	P2770_N701.S517
ENSG00000000003.14	0	0.00	0	0
ENSG00000000005.5	0	0.00	0	0
ENSG00000000419.12	49	79.00	144	21
ENSG00000000457.13	0	68.42	0	2
ENSG00000000460.16	0	127.58	0	13
ENSG00000000938.12	0	0.00	0	0

P5 tsv
	P2771_N701.S502	P2771_N701.S503	P2771_N701.S505	P2771_N701.S506
ENSG00000000003.14	0	0	0	0
ENSG00000000005.5	0	0	0	0
ENSG00000000419.12	41	127	113	60
ENSG00000000457.13	60	2	16	0
ENSG00000000460.16	0	59	48	0
ENSG00000000938.12	0	0	0	0

Question 2

Load the treatment associated with every cell (samplesheet.tsv). Which command do you use? How many different conditions are present in the three plates? How many cells are in each condition? Do you need to reorder cells so that data in sample sheet and tsv files match?

## Read metadata as a vector and as a factor
treatment <- read.delim("H3/samplesheet.tsv", stringsAsFactors=TRUE, row.names = 1)
str(treatment)

## 'data.frame':    288 obs. of  1 variable:
##  $ label: Factor w/ 6 levels "J-LatA2+DMSO",..: 4 6 5 1 3 3 2 2 4 6 ...

There are 6 possible conditions present in the three plates, for each of the two cell types three different drugs:

paste(unique(treatment$label), collapse = ", ")

## [1] "Jurkat+DMSO, Jurkat+SAHA, Jurkat+PMA, J-LatA2+DMSO, J-LatA2+SAHA, J-LatA2+PMA"

Each condition presents:

sapply(unique(treatment$label), function (x) {
  paste0(x, ": ", length(which(treatment$label == x)), " cells")
})

## [1] "Jurkat+DMSO: 18 cells"  "Jurkat+SAHA: 36 cells"  "Jurkat+PMA: 36 cells"  
## [4] "J-LatA2+DMSO: 36 cells" "J-LatA2+SAHA: 81 cells" "J-LatA2+PMA: 81 cells"

In total we have 90 Jurkat cells and 198 J-LatA2 cells.
Looking at number of cells by drug treatment we observe that: DMSO 54 cells, SAHA 117 cells, PMA 117 cells.

In order for treatment sheet and tsv files to match we just need to bind rows of P3_tsv, P4_tsv and P4_tsv data frames.

P3_tsv <- as.data.frame(t(P3_tsv))
P4_tsv <- as.data.frame(t(P4_tsv))
P5_tsv <- as.data.frame(t(P5_tsv))

# Merge P3, P4 and P5 into single df
DATA <- rbind(P3_tsv, P4_tsv, P5_tsv)

# Format meta data
treatment %<>% separate(label, c("cell", "drug"), sep = "[+]")
treatment$totalexp <- rowSums(DATA)
treatment$replicates <- as.factor(c(rep("A", 96), rep("B",96), rep("C",96)))

DATA
	ENSG00000000003.14	ENSG00000000005.5	ENSG00000000419.12
P2769_N701.S502	0	0	39
P2769_N701.S503	0	0	68
P2769_N701.S505	0	0	16
P2769_N701.S506	0	0	0
P2769_N701.S507	0	0	0
P2769_N701.S508	0	0	0

Meta Data
	cell	drug	totalexp	replicates
P2769_N701.S502	Jurkat	DMSO	142921.98	A
P2769_N701.S503	Jurkat	SAHA	344847.00	A
P2769_N701.S505	Jurkat	PMA	265363.00	A
P2769_N701.S506	J-LatA2	DMSO	381307.99	A
P2769_N701.S507	J-LatA2	SAHA	76254.00	A
P2769_N701.S508	J-LatA2	SAHA	89569.01	A

Question 3

Represent the data with a PCA projection and tSNE. Scale or normalize data appropriately. Show the different steps in the process and comparison with raw-data and normalized. Using colors, argue whether batch effects are present between the three plates. Provide different visualization and/or code alternatives.

Raw Data

# PCA with RAW data
PCA_raw <- stats::prcomp(DATA)

ggplot2::autoplot(PCA_raw, data = treatment, colour = "drug", shape = "cell") + 
  labs(title = "PCA - Raw Data",
       color = "Drug",
       shape = "Cell") + 
  theme_classic() +
  theme(plot.title = element_text(hjust = "0.5", face = "bold"))

tsne_plot <- function(tSNE, treatment) {
  plot_data <- 
    data.frame(V1 = tSNE$Y[,1], V2 = tSNE$Y[,2],
               cell = treatment$cell,
               drug = treatment$drug,
               replicates = treatment$replicates,
               totalexp = treatment$totalexp)
  return(plot_data)
}

# tSNE with RAW data
tSNE <- Rtsne(as.matrix(DATA))
tsne_data <- tsne_plot(tSNE, treatment)
ggplot(tsne_data, aes(x=V1, y=V2)) + 
  geom_point(aes(color=drug, shape=replicates)) + 
  labs(title = "tSNE - Raw Data",
       color = "Drug",
       shape = "Replicates") + 
  theme_classic() +
  theme(plot.title = element_text(hjust = "0.5", face = "bold"))

By looking at the PCA performed with the raw data, we do not observe a batch effect on the replicates. But we observe some samples that might be outliers.

The variance is very different among the variables; in the case of some genes, the variance is several orders of magnitude higher than the rest. If the variables are not standardized to have zero mean and standard deviation 1 before conducting the PCA study, some variables will dominate most of the variables.

The tSNE performed with the RAW data presents three big very clear clusters one with SAHA treated samples, another with DMSO samples and a final one with PMA samples. All three homogeneously composed of the three type of replicates. Note that some control DMSO samples and SAHA samples mix with the PMA cluster that is the most dispersed one.

Scaled Data

# Normalize data
DATA.norm <- DATA[, colSums(DATA)>0]
DATA.norm <- DATA.norm/rowSums(DATA.norm)

# PCA with scaled data
PCA_scaled <- stats::prcomp(DATA.norm, scale = TRUE)
ggplot2::autoplot(PCA_scaled, data = treatment, colour = "drug", shape = "cell") + 
  geom_label_repel(aes(label = rownames(DATA.norm)), max.overlaps = 1) +
  labs(title = "PCA - Raw Data",
       color = "Drug",
       shape = "Cell") + 
  theme_classic() +
  theme(plot.title = element_text(hjust = "0.5", face = "bold"))

## Warning: ggrepel: 286 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

When performing the PCA with the scaled data, we see that it has been highly affected by outliers, so the next step will be to filter them, in order to ensure a better visualization.

tSNE <- Rtsne(as.matrix(DATA.norm))
tsne_data <- tsne_plot(tSNE, treatment)

ggplot(tsne_data, aes(x=V1, y=V2)) + 
  geom_point(aes(color=drug, shape=cell)) + 
  labs(title = "tSNE - Scaled Data",
       color = "Drug",
       shape = "Cell") + 
  theme_classic() +
  theme(plot.title = element_text(hjust = "0.5", face = "bold"))

When performing the tSNE with the scaled data we do not observe a great difference in contrast to the tSNE applied with the RAW data.

Outlier filtering

We will now perform a filtering of the outliers by applying a threshold on total gene expression, then we will apply a final PCA to our filtered data and visualize it.

# Remove outlier
DATA.fltr <- DATA.norm %>% filter(rownames(.) != "P2771_N704.S506")
treatment.fltr <- treatment %>% filter(rownames(.) != "P2771_N704.S506")
DATA.fltr <- DATA.fltr[, colSums(DATA.fltr)>0]

PCA_final <- stats::prcomp(DATA.fltr, scale = TRUE)
ggplot2::autoplot(PCA_final, data = treatment.fltr, 
                  colour = "drug", shape = "cell") + 
        labs(title = "PCA - Final",
             color = "Drug",
             shape = "Cell") +
        theme_classic() +
        theme(plot.title = element_text(hjust = "0.5", face = "bold"))

Question 4

Check the effect of perplexity and iterations, the same as scaling the data for tSNE or PCA plots respectively. As we have seen in previous practicals:

Perplexity

It says how to balance attention between local and global aspects of your data. It is a guess about the number of close neighbors each point has.

set.seed(123)

# List of plots with different perplexity
tsne_plots <- lapply(c(1, 10, 25, 30, 50, 70), function(x) {
  tSNE <- Rtsne(as.matrix(DATA.fltr), perplexity = x)
  tsne_data <- tsne_plot(tSNE, treatment.fltr)
  ggplot(tsne_data, aes(x=V1, y=V2)) + 
         geom_point(aes(color=drug, shape=cell)) + 
         labs(title = paste0("Perplexity ", x),
              shape = "Cell",
              color = "Drug") +
         theme_classic() +
         theme(plot.title = element_text(hjust = "0.5", size = 8),
               legend.position = "bottom")
})

ggarrange(plotlist = tsne_plots, ncol = 2, nrow = 3, common.legend = TRUE, legend = "bottom")

Iterations

The more the better. Sometimes, not feasible, one might have to wait for days to reach a very high number of iterations. In contrast, if you use too few iterations the clusters might not be visible and you typically discover a huge clump of data points in the center of your tSNE plot.

# List of plots with different iterations
tsne_plots_2 <- lapply(c(100, 1000, 2000, 5000), function(x) {
  tSNE <- Rtsne(as.matrix(DATA.fltr), perplexity = 50, max_iter = x)
  tsne_data <- tsne_plot(tSNE, treatment.fltr)
  ggplot(tsne_data, aes(x=V1, y=V2)) + 
    geom_point(aes(color=drug, shape=cell)) +
    labs(title = paste0("tSNE - Iterations ", x)) + 
    theme_classic() +
    theme(plot.title = element_text(hjust = "0.5"))
})

ggarrange(plotlist = tsne_plots_2, ncol = 2, nrow = 2, common.legend = TRUE, legend = "bottom")

Question 5

Argue whether the drugs SAHA and PMA have the same effect on the cells. Add a legend and label the axes accordingly.

ggplot2::autoplot(PCA_final, data = treatment.fltr, 
                  colour = "totalexp", shape = "drug", size = "cell") + 
  scale_color_gradient(low = "green", high = "red") +
  scale_shape_manual(values = c(3, 24, 21)) +
  scale_size_manual(values = c(4, 8)) +
  labs(shape = "Drug",
       size  = "Cell",
       color = "Total\nExpression") +
  theme_classic()

By looking at the PCA colored by total expression, we can see that most of highest expressed samples correspond to Jurkat cells in both SAHA and PMA treatments.

Question 6

Using the correct PCA analysis, find an interpretation for the first and the second principal component.

The vector defining the first principal component (PC1) follows the direction in which the observations vary the most. The second component (PC2) follows the second direction in which the data show the greatest variance and which is not correlated (orthogonal) with the first component.

By observing the PCA performed with scaled data and outliers removed:

In Principal Component (PC1) correlates to drug, as we can distinguish 3 different clusters the first one in the left composed of PMA treated samples the one at the center of DMSO and the right one of SAHA, all of them an homogeneous proportion of the two cell types.
In Principal Component (PC2) the variation is also correlated to drugs but in this case we can observe that PMA and SAHA treated samples are mixed from -0.05 to higher values. And from -0.05 to lower values we mainly find DMSO (control) samples. In this principal component we can easily differentiate control from treated samples.

Question 7

Using your answer to the previous question, find three genes that are activated by SAHA but not by PMA

To find the genes activated by SAHA but not with PMA we will find which genes present the highest weight in the PC1, that as noted before PC1 is the principal component differentiating the most the different drugs. We have selected the top 10 highest weight genes.

# Get weights of PC1
weights <- data.frame(PC1 = PCA_final$rotation[,1])
highest <- weights %>% arrange(desc(PC1)) %>% head(10)

We will now compute the mean expression of the selected genes by drug types, and we will observe which ones present a biggest difference in expression between SAHA and PMA. As we can consider this genes much more activated by SAHA than PMA (on average) in all cells.

# Add gene expression of selected gene
putative <- DATA[,rownames(highest)]
# Add drugs corresponding to each sample
putative$drug <- treatment$drug
# Compute mean by drug of gene expression
putative %>% 
  group_by(drug) %>% 
  summarize(across(everything(), list(mean))) %>%
  t() %>%
  select(name, SAHA, PMA) %>% 
  mutate(diff = SAHA-PMA) %>% 
  arrange(desc(diff)) %>% head(3) %>% 
  rename(Genes = name) %>%
  select(Genes, SAHA, PMA) %>% 
  kbl(booktabs = TRUE, caption = "Genes activated by SAHA but not by PMA") %>%
  kable_classic(html_font = "Cambria") %>% # Font = Cambria 
  kable_styling(latex_options = "HOLD_position")

Question 8

Create a UMAP visualization of the results. Explore differences between producing results from raw data counts or normalized results. Explore also the effect of predicting a new sample result from previous UMAP analysis. Compare results and give some thoughts on it.

UMAP_raw <- umap(DATA)

plot_umap <- function(x_given, y_given, col_given, shape_given) {
  data_plot <- data.frame(x = x_given, 
                          y = y_given, 
                          col = col_given,
                          shape = shape_given)
  p <- ggplot(data_plot) + 
    geom_point(aes(x = x, y = y, 
                   color = col, shape = shape)) +
    theme_classic()
  return(p)
}


# data_plot <- data.frame(x = UMAP_raw$layout[,1], 
#                         y = UMAP_raw$layout[,2], 
#                         drug = treatment$drug, 
#                         replicates = treatment$replicates)

# UMAP_1 <- ggplot(data_plot) + 
#   geom_point(aes(x = x, y = y, 
#                  color = drug, shape = replicates)) + 
#   labs(title = "UMAP analysis treatment",
#        color = "Drug",
#        shape = "Replicates") + 
#   theme_classic() +
#   theme(plot.title = element_text(hjust = "0.5", face = "bold"))

plot_umap(UMAP_raw$layout[,1], UMAP_raw$layout[,2],
          treatment$drug, treatment$replicates) + 
  labs(title = "UMAP analysis treatment",
       color = "Drug",
       shape = "Replicates")

Question 9

Test some parameters such as minimum distance and number of neighbors. Compare results and give some thoughts on it.

umap_plots <- lapply(c(2, 10, 20, 50), function(x) {
  UMAP <- umap(DATA, n_neighbors=x)
  plot_umap(UMAP$layout[,1], UMAP_raw$layout[,2],
            treatment$drug, treatment$replicates) + 
            labs(title = "UMAP analysis treatment",
                 color = "Drug",
                 shape = "Replicates")
})

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

## Warning: failed creating initial embedding; using random embedding instead

Question 10

Create a final visualization using results from PCA, tSNE and UMAP and write some thoughts on the comparison of the three high-dimension techniques. Use as many metadata available as possible and generate clear and easy to interpret plots.