Exploratory analysis and visualization II
This work is a continuation of the tutorial on single-cell expression.
We collected three new 96-well plates and performed single-cell expression analysis as before.
The drugs we use are:
- DMSO (control),
- SAHA (the same HDAC inhibitor as in the tutorial) and,
- PMA (phorbol myristate acetate, an activator of the NF-kB pathway).
We use Jurkat T cells that are not infected by HIV, and J-Lat that have a latent HIV infection.
Question 1
Load the data from the three plates (P3.tsv.gz,
P4.tsv.gz and P5.tsv.gz). Which command do you
use? Provide code alternatives (if any). Give some evidence that the
data is properly loaded.
# Load data
P3_tsv <- read.table("H3/P3.tsv.gz", header = TRUE, row.names = 1)
P4_tsv <- read.table("H3/P4.tsv.gz", header = TRUE, row.names = 1)
P5_tsv <- read.table("H3/P5.tsv.gz", header = TRUE, row.names = 1)
# Check if it is correctly loaded
# View(P3_tsv)
# View(P4_tsv)
# View(P5_tsv)# We could also load data with readr package
library(readr)
P3_tsv <- read_delim("H3/P3.tsv.gz",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
P4_tsv <- read_delim("H3/P4.tsv.gz",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
P5_tsv <- read_delim("H3/P5.tsv.gz",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
# Another way to check if data is correctly loaded
str(P3_tsv)
str(P4_tsv)
str(P5_tsv)| P2769_N701.S502 | P2769_N701.S503 | P2769_N701.S505 | P2769_N701.S506 | |
|---|---|---|---|---|
| ENSG00000000003.14 | 0 | 0 | 0 | 0 |
| ENSG00000000005.5 | 0 | 0 | 0 | 0 |
| ENSG00000000419.12 | 39 | 68 | 16 | 0 |
| ENSG00000000457.13 | 0 | 8 | 17 | 0 |
| ENSG00000000460.16 | 0 | 46 | 4 | 0 |
| ENSG00000000938.12 | 0 | 0 | 0 | 0 |
| P2770_N701.S513 | P2770_N701.S515 | P2770_N701.S516 | P2770_N701.S517 | |
|---|---|---|---|---|
| ENSG00000000003.14 | 0 | 0.00 | 0 | 0 |
| ENSG00000000005.5 | 0 | 0.00 | 0 | 0 |
| ENSG00000000419.12 | 49 | 79.00 | 144 | 21 |
| ENSG00000000457.13 | 0 | 68.42 | 0 | 2 |
| ENSG00000000460.16 | 0 | 127.58 | 0 | 13 |
| ENSG00000000938.12 | 0 | 0.00 | 0 | 0 |
| P2771_N701.S502 | P2771_N701.S503 | P2771_N701.S505 | P2771_N701.S506 | |
|---|---|---|---|---|
| ENSG00000000003.14 | 0 | 0 | 0 | 0 |
| ENSG00000000005.5 | 0 | 0 | 0 | 0 |
| ENSG00000000419.12 | 41 | 127 | 113 | 60 |
| ENSG00000000457.13 | 60 | 2 | 16 | 0 |
| ENSG00000000460.16 | 0 | 59 | 48 | 0 |
| ENSG00000000938.12 | 0 | 0 | 0 | 0 |
Question 2
Load the treatment associated with every cell
(samplesheet.tsv). Which command do you use? How many
different conditions are present in the three plates? How many cells are
in each condition? Do you need to reorder cells so that data in sample
sheet and tsv files match?
## Read metadata as a vector and as a factor
treatment <- read.delim("H3/samplesheet.tsv", stringsAsFactors=TRUE, row.names = 1)
str(treatment)## 'data.frame': 288 obs. of 1 variable:
## $ label: Factor w/ 6 levels "J-LatA2+DMSO",..: 4 6 5 1 3 3 2 2 4 6 ...
There are 6 possible conditions present in the three plates, for each of the two cell types three different drugs:
paste(unique(treatment$label), collapse = ", ")## [1] "Jurkat+DMSO, Jurkat+SAHA, Jurkat+PMA, J-LatA2+DMSO, J-LatA2+SAHA, J-LatA2+PMA"
Each condition presents:
sapply(unique(treatment$label), function (x) {
paste0(x, ": ", length(which(treatment$label == x)), " cells")
})## [1] "Jurkat+DMSO: 18 cells" "Jurkat+SAHA: 36 cells" "Jurkat+PMA: 36 cells"
## [4] "J-LatA2+DMSO: 36 cells" "J-LatA2+SAHA: 81 cells" "J-LatA2+PMA: 81 cells"
In total we have 90 Jurkat cells and 198 J-LatA2 cells.
Looking at number of cells by drug treatment we observe that: DMSO 54 cells, SAHA 117 cells, PMA 117 cells.
In order for treatment sheet and tsv files to match we just need to
bind rows of P3_tsv, P4_tsv and
P4_tsv data frames.
P3_tsv <- as.data.frame(t(P3_tsv))
P4_tsv <- as.data.frame(t(P4_tsv))
P5_tsv <- as.data.frame(t(P5_tsv))
# Merge P3, P4 and P5 into single df
DATA <- rbind(P3_tsv, P4_tsv, P5_tsv)
# Format meta data
treatment %<>% separate(label, c("cell", "drug"), sep = "[+]")
treatment$totalexp <- rowSums(DATA)
treatment$replicates <- as.factor(c(rep("A", 96), rep("B",96), rep("C",96)))| ENSG00000000003.14 | ENSG00000000005.5 | ENSG00000000419.12 | |
|---|---|---|---|
| P2769_N701.S502 | 0 | 0 | 39 |
| P2769_N701.S503 | 0 | 0 | 68 |
| P2769_N701.S505 | 0 | 0 | 16 |
| P2769_N701.S506 | 0 | 0 | 0 |
| P2769_N701.S507 | 0 | 0 | 0 |
| P2769_N701.S508 | 0 | 0 | 0 |
| cell | drug | totalexp | replicates | |
|---|---|---|---|---|
| P2769_N701.S502 | Jurkat | DMSO | 142921.98 | A |
| P2769_N701.S503 | Jurkat | SAHA | 344847.00 | A |
| P2769_N701.S505 | Jurkat | PMA | 265363.00 | A |
| P2769_N701.S506 | J-LatA2 | DMSO | 381307.99 | A |
| P2769_N701.S507 | J-LatA2 | SAHA | 76254.00 | A |
| P2769_N701.S508 | J-LatA2 | SAHA | 89569.01 | A |
Question 3
Represent the data with a PCA projection and tSNE. Scale or normalize data appropriately. Show the different steps in the process and comparison with raw-data and normalized. Using colors, argue whether batch effects are present between the three plates. Provide different visualization and/or code alternatives.
Raw Data
# PCA with RAW data
PCA_raw <- stats::prcomp(DATA)
ggplot2::autoplot(PCA_raw, data = treatment, colour = "drug", shape = "cell") +
labs(title = "PCA - Raw Data",
color = "Drug",
shape = "Cell") +
theme_classic() +
theme(plot.title = element_text(hjust = "0.5", face = "bold"))tsne_plot <- function(tSNE, treatment) {
plot_data <-
data.frame(V1 = tSNE$Y[,1], V2 = tSNE$Y[,2],
cell = treatment$cell,
drug = treatment$drug,
replicates = treatment$replicates,
totalexp = treatment$totalexp)
return(plot_data)
}
# tSNE with RAW data
tSNE <- Rtsne(as.matrix(DATA))
tsne_data <- tsne_plot(tSNE, treatment)
ggplot(tsne_data, aes(x=V1, y=V2)) +
geom_point(aes(color=drug, shape=replicates)) +
labs(title = "tSNE - Raw Data",
color = "Drug",
shape = "Replicates") +
theme_classic() +
theme(plot.title = element_text(hjust = "0.5", face = "bold"))- By looking at the PCA performed with the raw data, we do not observe a batch effect on the replicates. But we observe some samples that might be outliers.
The variance is very different among the variables; in the case of some genes, the variance is several orders of magnitude higher than the rest. If the variables are not standardized to have zero mean and standard deviation 1 before conducting the PCA study, some variables will dominate most of the variables.
- The tSNE performed with the RAW data presents three big very clear clusters one with SAHA treated samples, another with DMSO samples and a final one with PMA samples. All three homogeneously composed of the three type of replicates. Note that some control DMSO samples and SAHA samples mix with the PMA cluster that is the most dispersed one.
Scaled Data
# Normalize data
DATA.norm <- DATA[, colSums(DATA)>0]
DATA.norm <- DATA.norm/rowSums(DATA.norm)# PCA with scaled data
PCA_scaled <- stats::prcomp(DATA.norm, scale = TRUE)
ggplot2::autoplot(PCA_scaled, data = treatment, colour = "drug", shape = "cell") +
geom_label_repel(aes(label = rownames(DATA.norm)), max.overlaps = 1) +
labs(title = "PCA - Raw Data",
color = "Drug",
shape = "Cell") +
theme_classic() +
theme(plot.title = element_text(hjust = "0.5", face = "bold"))## Warning: ggrepel: 286 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
- When performing the PCA with the scaled data, we see that it has been highly affected by outliers, so the next step will be to filter them, in order to ensure a better visualization.
tSNE <- Rtsne(as.matrix(DATA.norm))
tsne_data <- tsne_plot(tSNE, treatment)ggplot(tsne_data, aes(x=V1, y=V2)) +
geom_point(aes(color=drug, shape=cell)) +
labs(title = "tSNE - Scaled Data",
color = "Drug",
shape = "Cell") +
theme_classic() +
theme(plot.title = element_text(hjust = "0.5", face = "bold"))- When performing the tSNE with the scaled data we do not observe a great difference in contrast to the tSNE applied with the RAW data.
Outlier filtering
We will now perform a filtering of the outliers by applying a threshold on total gene expression, then we will apply a final PCA to our filtered data and visualize it.
# Remove outlier
DATA.fltr <- DATA.norm %>% filter(rownames(.) != "P2771_N704.S506")
treatment.fltr <- treatment %>% filter(rownames(.) != "P2771_N704.S506")
DATA.fltr <- DATA.fltr[, colSums(DATA.fltr)>0]PCA_final <- stats::prcomp(DATA.fltr, scale = TRUE)
ggplot2::autoplot(PCA_final, data = treatment.fltr,
colour = "drug", shape = "cell") +
labs(title = "PCA - Final",
color = "Drug",
shape = "Cell") +
theme_classic() +
theme(plot.title = element_text(hjust = "0.5", face = "bold"))Question 4
Check the effect of perplexity and iterations, the same as scaling the data for tSNE or PCA plots respectively. As we have seen in previous practicals:
Perplexity
It says how to balance attention between local and global aspects of your data. It is a guess about the number of close neighbors each point has.
set.seed(123)
# List of plots with different perplexity
tsne_plots <- lapply(c(1, 10, 25, 30, 50, 70), function(x) {
tSNE <- Rtsne(as.matrix(DATA.fltr), perplexity = x)
tsne_data <- tsne_plot(tSNE, treatment.fltr)
ggplot(tsne_data, aes(x=V1, y=V2)) +
geom_point(aes(color=drug, shape=cell)) +
labs(title = paste0("Perplexity ", x),
shape = "Cell",
color = "Drug") +
theme_classic() +
theme(plot.title = element_text(hjust = "0.5", size = 8),
legend.position = "bottom")
})ggarrange(plotlist = tsne_plots, ncol = 2, nrow = 3, common.legend = TRUE, legend = "bottom")Iterations
The more the better. Sometimes, not feasible, one might have to wait for days to reach a very high number of iterations. In contrast, if you use too few iterations the clusters might not be visible and you typically discover a huge clump of data points in the center of your tSNE plot.
# List of plots with different iterations
tsne_plots_2 <- lapply(c(100, 1000, 2000, 5000), function(x) {
tSNE <- Rtsne(as.matrix(DATA.fltr), perplexity = 50, max_iter = x)
tsne_data <- tsne_plot(tSNE, treatment.fltr)
ggplot(tsne_data, aes(x=V1, y=V2)) +
geom_point(aes(color=drug, shape=cell)) +
labs(title = paste0("tSNE - Iterations ", x)) +
theme_classic() +
theme(plot.title = element_text(hjust = "0.5"))
})ggarrange(plotlist = tsne_plots_2, ncol = 2, nrow = 2, common.legend = TRUE, legend = "bottom")Question 5
Argue whether the drugs SAHA and PMA have the same effect on the cells. Add a legend and label the axes accordingly.
ggplot2::autoplot(PCA_final, data = treatment.fltr,
colour = "totalexp", shape = "drug", size = "cell") +
scale_color_gradient(low = "green", high = "red") +
scale_shape_manual(values = c(3, 24, 21)) +
scale_size_manual(values = c(4, 8)) +
labs(shape = "Drug",
size = "Cell",
color = "Total\nExpression") +
theme_classic()By looking at the PCA colored by total expression, we can see that most of highest expressed samples correspond to Jurkat cells in both SAHA and PMA treatments.
Question 6
Using the correct PCA analysis, find an interpretation for the first and the second principal component.
The vector defining the first principal component (PC1) follows the direction in which the observations vary the most. The second component (PC2) follows the second direction in which the data show the greatest variance and which is not correlated (orthogonal) with the first component.
By observing the PCA performed with scaled data and outliers removed:
In Principal Component (PC1) correlates to drug, as we can distinguish 3 different clusters the first one in the left composed of PMA treated samples the one at the center of DMSO and the right one of SAHA, all of them an homogeneous proportion of the two cell types.
In Principal Component (PC2) the variation is also correlated to drugs but in this case we can observe that PMA and SAHA treated samples are mixed from -0.05 to higher values. And from -0.05 to lower values we mainly find DMSO (control) samples. In this principal component we can easily differentiate control from treated samples.
Question 7
Using your answer to the previous question, find three genes that are activated by SAHA but not by PMA
To find the genes activated by SAHA but not with PMA we will find which genes present the highest weight in the PC1, that as noted before PC1 is the principal component differentiating the most the different drugs. We have selected the top 10 highest weight genes.
# Get weights of PC1
weights <- data.frame(PC1 = PCA_final$rotation[,1])
highest <- weights %>% arrange(desc(PC1)) %>% head(10)We will now compute the mean expression of the selected genes by drug types, and we will observe which ones present a biggest difference in expression between SAHA and PMA. As we can consider this genes much more activated by SAHA than PMA (on average) in all cells.
# Add gene expression of selected gene
putative <- DATA[,rownames(highest)]
# Add drugs corresponding to each sample
putative$drug <- treatment$drug
# Compute mean by drug of gene expression
putative %>%
group_by(drug) %>%
summarize(across(everything(), list(mean))) %>%
t() %>%
select(name, SAHA, PMA) %>%
mutate(diff = SAHA-PMA) %>%
arrange(desc(diff)) %>% head(3) %>%
rename(Genes = name) %>%
select(Genes, SAHA, PMA) %>%
kbl(booktabs = TRUE, caption = "Genes activated by SAHA but not by PMA") %>%
kable_classic(html_font = "Cambria") %>% # Font = Cambria
kable_styling(latex_options = "HOLD_position")Question 8
Create a UMAP visualization of the results. Explore differences between producing results from raw data counts or normalized results. Explore also the effect of predicting a new sample result from previous UMAP analysis. Compare results and give some thoughts on it.
UMAP_raw <- umap(DATA)
plot_umap <- function(x_given, y_given, col_given, shape_given) {
data_plot <- data.frame(x = x_given,
y = y_given,
col = col_given,
shape = shape_given)
p <- ggplot(data_plot) +
geom_point(aes(x = x, y = y,
color = col, shape = shape)) +
theme_classic()
return(p)
}
# data_plot <- data.frame(x = UMAP_raw$layout[,1],
# y = UMAP_raw$layout[,2],
# drug = treatment$drug,
# replicates = treatment$replicates)
# UMAP_1 <- ggplot(data_plot) +
# geom_point(aes(x = x, y = y,
# color = drug, shape = replicates)) +
# labs(title = "UMAP analysis treatment",
# color = "Drug",
# shape = "Replicates") +
# theme_classic() +
# theme(plot.title = element_text(hjust = "0.5", face = "bold"))
plot_umap(UMAP_raw$layout[,1], UMAP_raw$layout[,2],
treatment$drug, treatment$replicates) +
labs(title = "UMAP analysis treatment",
color = "Drug",
shape = "Replicates")Question 9
Test some parameters such as minimum distance and number of neighbors. Compare results and give some thoughts on it.
umap_plots <- lapply(c(2, 10, 20, 50), function(x) {
UMAP <- umap(DATA, n_neighbors=x)
plot_umap(UMAP$layout[,1], UMAP_raw$layout[,2],
treatment$drug, treatment$replicates) +
labs(title = "UMAP analysis treatment",
color = "Drug",
shape = "Replicates")
})## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
## Warning: failed creating initial embedding; using random embedding instead
Question 10
Create a final visualization using results from PCA, tSNE and UMAP and write some thoughts on the comparison of the three high-dimension techniques. Use as many metadata available as possible and generate clear and easy to interpret plots.