BelgradeR Meetup

Slajdovi sa prezentacije: Google drive

Rmd fajl ove stranice: Download

Za dole sprovedenu analizu potrebni su fajlovi: “GTEx_Analysis_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm__Pilot_V3_patch1.gct.gz”, “GTEx_Analysis_Annotations_Sample_DS__Pilot_V3.txt”, koji se mogu download-ovati sa http://www.gtexportal.org/home/datasets, nakon kratke registracije naloga.

Potrebni paketi:

library(readr)
library(tsne)
library(Rtsne)

Vizualizacija genomskih podataka

Tehnološki napredak u oblasti sekvenciranja ljudskog genoma omogućava nam da mnoge ćelijske procese kvantifikujemo brzo, jeftino i sa velikom preciznošću. Na primeru procesa transkripcije postiže se rezolucija do nivoa pojedinog gena (~ 20.000) svake pojedinačne ćelije u nekoj studiji. Izazov tumačenja ovako dobijenih i izvedenih podataka se u nekim slučajevima uspešno može rešiti t-SNE pristupom.

GTEx projekat: LINK

GTEx <- read_delim("~/apps/tsne_r_meetup/GTEx_Analysis_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm__Pilot_V3_patch1.gct.gz", "\t", escape_double = FALSE, 
                   comment = "#", trim_ws = TRUE, skip = 1)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Name = col_character(),
##   Description = col_character()
## )

## See spec(...) for full column specifications.

GTEx_anno <- read_delim("~/apps/tsne_r_meetup/GTEx_Analysis_Annotations_Sample_DS__Pilot_V3.txt", "\t", escape_double = FALSE, trim_ws = TRUE,
col_types = cols_only(SAMPID = col_guess(), SMTS = col_guess(), SMTSD = col_guess()))

GTEx_samples <- names(GTEx[-c(1:2)])
GTEx_anno <- GTEx_anno[GTEx_anno$SAMPID %in% GTEx_samples,]
# cuvamo i features - bilo bi dobro stratifikovati rezultate i spram prirode feature-a: coding-genes, noncoding ili jos granularnije mitochondrial, sex genes, autosomal..
GTEx_features <- GTEx[c(1:2)]

GTEx.values <- GTEx[-c(1:2)]
rm(GTEx_samples)

# filter for rows that have RPKM > 0.1 in at least 80% of samples
rows_to_filter <- apply(GTEx.values, 1, function(x) sum(x>0.1)/length(x)) > 0.8
GTEx.values.filtered <- t(GTEx.values[ rows_to_filter, ])
GTEx.features <- GTEx[rows_to_filter, c(1:2)]
rm(GTEx, GTEx.values, rows_to_filter)

# log2, pseudocount 1, https://www.biostars.org/p/100926/ zanimaju nas promene u relativnim zastupljenostima, tj. proporcije koje se bolje modeliraju u log skali
GTEx.values.filtered.log2 <- log2(GTEx.values.filtered + 1)
rm(GTEx.values.filtered)

# zero-mean normalization (+ scaling ?)
GTEx.values.filtered.log2.scaled <- t(scale(t(GTEx.values.filtered.log2), center=T, scale=F))
#rm(GTEx.values.filtered.log2)

# tsne
#library(tsne)
# callback after each epoch to plot state
#ecb = function(x,y){ plot(x,t='n'); text(x,labels="*") }
#GTEx.tsne <- tsne(GTEx.values.filtered.log2.scaled, epoch_callback = ecb, 
#                  max_iter = 1000, perplexity=15, epoch=100)

# Rtsne
GTEx.tsne <- Rtsne(GTEx.values.filtered.log2.scaled, dims = 2, perplexity = 30,
                   theta = 0.0, check_duplicates = FALSE, pca = TRUE, max_iter = 1000,
                   verbose = TRUE, is_distance = FALSE, Y_init = NULL, initial_dims = 100)

## Read the 1641 x 100 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.000000
## Computing input similarities...
## Normalizing input...
## Symmetrizing...
## Done in 0.71 seconds!
## Learning embedding...
## Iteration 50: error is 67.173967 (50 iterations in 1.97 seconds)
## Iteration 100: error is 55.853517 (50 iterations in 1.95 seconds)
## Iteration 150: error is 53.736117 (50 iterations in 2.15 seconds)
## Iteration 200: error is 52.769302 (50 iterations in 2.22 seconds)
## Iteration 250: error is 52.223140 (50 iterations in 2.12 seconds)
## Iteration 300: error is 1.006698 (50 iterations in 2.20 seconds)
## Iteration 350: error is 0.735200 (50 iterations in 2.21 seconds)
## Iteration 400: error is 0.623059 (50 iterations in 2.13 seconds)
## Iteration 450: error is 0.564449 (50 iterations in 2.18 seconds)
## Iteration 500: error is 0.528877 (50 iterations in 2.27 seconds)
## Iteration 550: error is 0.505114 (50 iterations in 2.10 seconds)
## Iteration 600: error is 0.488196 (50 iterations in 2.07 seconds)
## Iteration 650: error is 0.475713 (50 iterations in 2.09 seconds)
## Iteration 700: error is 0.466145 (50 iterations in 2.10 seconds)
## Iteration 750: error is 0.458658 (50 iterations in 2.06 seconds)
## Iteration 800: error is 0.452740 (50 iterations in 2.07 seconds)
## Iteration 850: error is 0.447949 (50 iterations in 2.22 seconds)
## Iteration 900: error is 0.444022 (50 iterations in 2.34 seconds)
## Iteration 950: error is 0.440744 (50 iterations in 2.22 seconds)
## Iteration 1000: error is 0.437976 (50 iterations in 2.14 seconds)
## Fitting performed in 42.82 seconds.

colors = rainbow(length(unique(GTEx_anno$SMTS)))
names(colors) = unique(GTEx_anno$SMTS)
plot(GTEx.tsne$Y, pch=16, col=colors[GTEx_anno$SMTS])

Vizualizacija visoko-dimenzionih genomskih podataka uz pomoć t-SNE

Dusan Randjelovic

02/28/2017

BelgradeR Meetup

Vizualizacija genomskih podataka