This is a full pipeline for importaing affymetrix .CEL data (human microarray chip data) and running full pairwise differential abundance tests. A full annotation table is also referenced in order to link probe names to gene names.
The .CEL data is read and analyzed by the “limma” package
First, the necessary libraries are activated and the annotation table (downloaded from an Affymetrix website) is read in:
Annotation <- read.csv("HTA-2_0.na35.2.hg19.transcript.csv/HTA-2_0.na35.2.hg19.transcript.csv", header=TRUE, skip=19)
The annotation table is very large and many columns are merged redundant database references (using “//” as separator). it is far too wide to display here! However, it is very powerful as it links the probes to a huge number of databases.
The Annotation table has several columns:
colnames(Annotation)
## [1] "transcript_cluster_id" "probeset_id"
## [3] "seqname" "strand"
## [5] "start" "stop"
## [7] "total_probes" "gene_assignment"
## [9] "mrna_assignment" "swissprot"
## [11] "unigene" "GO_biological_process"
## [13] "GO_cellular_component" "GO_molecular_function"
## [15] "pathway" "protein_domains"
## [17] "category" "locus.type"
## [19] "notes"
Many of these are potentially very useful. For now, I am useing the gene_assignment column. This contains many names separated by //. A single entry can look like this:
Annotation[1,"gene_assignment"]
## [1] NR_046018 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// ENST00000456328 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// ENST00000456328 // DDX11L5 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 5 // 9p24.3 // 100287596 /// AM992878 // DDX11L12 // DEAD/H box polypeptide 11 like 12 // --- // 100532727 /// AM992880 // DDX11L9 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 9 // 15q26.3 // 100288486 /// BC032353 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// uc001aaa.3 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// uc010nxq.1 // DDX11L9 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 9 // 15q26.3 // 100288486 /// uc010nxr.1 // DDX11L12 // DEAD/H box polypeptide 11 like 12 // --- // 100532727
## 41529 Levels: --- ...
The second one is a gene name that links well to many databases and the third provides a description. The gene names can be fed into the Reactome online database for pathway mapping. We will create a focused mapping table that only has the necessary fields. This object will be used to attach gene names and functions to the final diff.expression tables
genenames.all <- Annotation
gene.split <- data.frame(do.call('rbind',strsplit(as.character(genenames.all$gene_assignment),"//",fixed=T)))
## Warning in .Method(..., deparse.level = deparse.level): number of columns
## of result is not a multiple of vector length (arg 1)
probe.gene.map <- data.frame(gene.split[,c(2,3)])
probe.gene.map$probe <- Annotation[,1]
probe.gene.map <- probe.gene.map[,c(3,1,2)]
probe.gene.map$X2 <- gsub("[[:space:]]", "", probe.gene.map$X2)
colnames(probe.gene.map) <- c("probe","name","description")
head(probe.gene.map)
## probe name
## 1 TC01000001.hg.1 DDX11L1
## 2 TC01000002.hg.1 MIR1302-11
## 3 TC01000003.hg.1 OR4F5
## 4 TC01000004.hg.1 RP11-34P13.9
## 5 TC01000005.hg.1 LOC100132287
## 6 TC01000006.hg.1 ---
## description
## 1 DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1
## 2 microRNA 1302-11
## 3 olfactory receptor, family 4, subfamily F, member 5
## 4 putative novel transcript
## 5 uncharacterized LOC100132287
## 6 ---
Differential abundance can only be performed on pairs of sample sets. Each pairwise test has to be configured as a separate file. Importing the test matrices:
targets <- read.csv("targets.txt", as.is=T)
targets.tGF.cGF <- read.csv("targets.GFonly.txt", as.is=T)
targets.tGFR.cGFR <- read.csv("targets.GFRonly.txt", as.is=T)
targets.cGFR.cGF <- read.csv("targets.noDrug.Strain.txt", as.is=T)
targets.tGFR.tGF <- read.csv("targets.drugonly.txt", as.is=T)
all of the samples and metadata are listed here:
targets
## FileName Strain Treatment Experiment
## 1 20160505_HTA2.0_0316_WTH_1R_(HTA-2_0).CEL GF control 1
## 2 20160505_HTA2.0_0317_WTH_2R_(HTA-2_0).CEL GF drug 1
## 3 20160505_HTA2.0_0318_WTH_3R_(HTA-2_0).CEL GFR control 1
## 4 20160505_HTA2.0_0319_WTH_4R_(HTA-2_0).CEL GFR drug 1
## 5 20160616_HTA2.0_0333_WTH_1_R1_(HTA-2_0).CEL GF control 2
## 6 20160616_HTA2.0_0334_WTH_2_R1_(HTA-2_0).CEL GF drug 2
## 7 20160616_HTA2.0_0335_WTH_3_R2_(HTA-2_0).CEL GFR control 2
## 8 20160616_HTA2.0_0336_WTH_4_R1_(HTA-2_0).CEL GFR drug 2
The matrices are set up as such; this matrix is for comparing non-treated and treated samples within in strain GF (control strain). Note thta this is just one of the tests; we now have matrices for 3 other test schemes.
targets.tGF.cGF
## FileName Strain Treatment Experiment
## 1 20160505_HTA2.0_0316_WTH_1R_(HTA-2_0).CEL GF control 1
## 2 20160505_HTA2.0_0317_WTH_2R_(HTA-2_0).CEL GF drug 1
## 3 20160616_HTA2.0_0333_WTH_1_R1_(HTA-2_0).CEL GF control 2
## 4 20160616_HTA2.0_0334_WTH_2_R1_(HTA-2_0).CEL GF drug 2
dat <- read.celfiles(targets$FileName) #read .CEL files
## Loading required package: pd.hta.2.0
## Loading required package: RSQLite
## Loading required package: DBI
## Platform design info loaded.
## Reading in : 20160505_HTA2.0_0316_WTH_1R_(HTA-2_0).CEL
## Reading in : 20160505_HTA2.0_0317_WTH_2R_(HTA-2_0).CEL
## Reading in : 20160505_HTA2.0_0318_WTH_3R_(HTA-2_0).CEL
## Reading in : 20160505_HTA2.0_0319_WTH_4R_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0333_WTH_1_R1_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0334_WTH_2_R1_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0335_WTH_3_R2_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0336_WTH_4_R1_(HTA-2_0).CEL
project.bgcorrect.norm.avg <- rma(dat, background=TRUE, normalize=TRUE) #normalize
## Background correcting
## Normalizing
## Calculating Expression
#this may remove control probes, it should only include probes which actually link to genes
project.bgcorrect.norm.avg.filt <- exprs(project.bgcorrect.norm.avg)[which(rownames(project.bgcorrect.norm.avg) %in% Annotation[,1]),]
#this writes the full results as a file
write.table(project.bgcorrect.norm.avg.filt, "full.expression.results.tsv", sep="\t", row.names = TRUE, quote=FALSE)
After export and filtration, the simplified table looks like (this is the first 6 probes and their normalized intensities):
head(data.frame(project.bgcorrect.norm.avg.filt))
## X20160505_HTA2.0_0316_WTH_1R_.HTA.2_0..CEL
## 2824546_st 12.36100
## 2824549_st 11.90234
## 2824551_st 11.74690
## 2824554_st 12.01546
## 2827992_st 11.01983
## 2827995_st 10.78221
## X20160505_HTA2.0_0317_WTH_2R_.HTA.2_0..CEL
## 2824546_st 12.23731
## 2824549_st 11.62001
## 2824551_st 11.81748
## 2824554_st 11.80532
## 2827992_st 10.97427
## 2827995_st 10.90626
## X20160505_HTA2.0_0318_WTH_3R_.HTA.2_0..CEL
## 2824546_st 12.55858
## 2824549_st 11.84178
## 2824551_st 11.90814
## 2824554_st 11.93338
## 2827992_st 11.34647
## 2827995_st 11.07592
## X20160505_HTA2.0_0319_WTH_4R_.HTA.2_0..CEL
## 2824546_st 12.44655
## 2824549_st 11.89080
## 2824551_st 11.86767
## 2824554_st 11.81407
## 2827992_st 11.38106
## 2827995_st 11.13100
## X20160616_HTA2.0_0333_WTH_1_R1_.HTA.2_0..CEL
## 2824546_st 12.13622
## 2824549_st 11.76768
## 2824551_st 11.65620
## 2824554_st 11.78064
## 2827992_st 11.02287
## 2827995_st 10.26596
## X20160616_HTA2.0_0334_WTH_2_R1_.HTA.2_0..CEL
## 2824546_st 12.13720
## 2824549_st 11.78317
## 2824551_st 11.63193
## 2824554_st 11.83432
## 2827992_st 11.08451
## 2827995_st 10.41974
## X20160616_HTA2.0_0335_WTH_3_R2_.HTA.2_0..CEL
## 2824546_st 12.27436
## 2824549_st 11.81748
## 2824551_st 11.98635
## 2824554_st 11.91740
## 2827992_st 11.09480
## 2827995_st 10.76636
## X20160616_HTA2.0_0336_WTH_4_R1_.HTA.2_0..CEL
## 2824546_st 12.27725
## 2824549_st 11.69465
## 2824551_st 11.89022
## 2824554_st 11.83155
## 2827992_st 11.21518
## 2827995_st 10.79048
This full data table that we just generated is only for record keeping purposes only. For each of the pairwise tests we will do, we will repeat this except that it will be limited to the samples we are testing with: define the test using the target matrix table (e.g. c vs. t, GF vs GFR) read in the .CEL files limited to the matrix table (only a subset of .CELs are read in) normalize define your tests and make differential abundance tables
dat <- read.celfiles(targets.cGFR.cGF$FileName)
## Platform design info loaded.
## Reading in : 20160505_HTA2.0_0316_WTH_1R_(HTA-2_0).CEL
## Reading in : 20160505_HTA2.0_0318_WTH_3R_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0333_WTH_1_R1_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0335_WTH_3_R2_(HTA-2_0).CEL
project.bgcorrect.norm.avg <- rma(dat, background=TRUE, normalize=TRUE)
## Background correcting
## Normalizing
## Calculating Expression
project.bgcorrect.norm.avg.filt <- exprs(project.bgcorrect.norm.avg)[which(rownames(project.bgcorrect.norm.avg) %in% Annotation[,1]),]
Exp <- factor(targets.cGFR.cGF$Experiment)
Strain <- factor(targets.cGFR.cGF$Strain, levels=c("GF","GFR"))
Treatment <- factor(targets.cGFR.cGF$Treatment, levels=c("control","drug"))
This is important decision; these chips were ran from different experimental batches, thus I have controlled for experiment, essentially running a paired analysis. In an earlier iteration, I did not control for experiment and found very few differences due to the high variance between experimental replicates
design <- model.matrix(~Exp+Strain)
eset <- project.bgcorrect.norm.avg.filt
fit <- lmFit(eset, design)
fit <- eBayes(fit, trend=T, robust=T)
## Warning: 65 very small variances detected, have been offset away from zero
results <- decideTests(fit)
write.table(summary(results), "controls.summary", quote=F)
summary(results)
## (Intercept) Exp2 StrainGFR
## Down 0 0 29
## NotSig 4 70523 70450
## Up 70519 0 44
library(data.table)
library(dplyr)
diff.cGFR.cGF <- topTable(fit, coef="StrainGFR", n=73)
setDT(diff.cGFR.cGF, keep.rownames = TRUE)[]
colnames(diff.cGFR.cGF)[1] <- "probe"
diff.cGFR.cGF <- dplyr::left_join(diff.cGFR.cGF,probe.gene.map,by="probe")
## Warning: Column `probe` joining character vector and factor, coercing into
## character vector
write.table(diff.cGFR.cGF, "diff.cGFR.cGF", quote = F)
write.table(diff.cGFR.cGF$name, "diff.cGFR.cGF.gene.list", quote = F)
head(diff.cGFR.cGF)
## probe logFC AveExpr t P.Value adj.P.Val
## 1 TC05001205.hg.1 -4.628246 5.478085 -51.37863 3.455196e-07 0.02263877
## 2 TC18000084.hg.1 -3.559838 5.536229 -38.69108 1.174218e-06 0.02263877
## 3 TC09000323.hg.1 4.167048 7.927503 34.93740 1.822866e-06 0.02263877
## 4 TC19000805.hg.1 3.678938 5.928768 35.70503 1.814231e-06 0.02263877
## 5 TC03000117.hg.1 2.870035 4.204782 32.22194 2.582918e-06 0.02263877
## 6 TC0X000272.hg.1 -2.896188 5.694877 -31.54888 2.828696e-06 0.02263877
## B name description
## 1 4.343165 CDH10 cadherin 10, type 2 (T2-cadherin)
## 2 4.141455 ANKRD30B ankyrin repeat domain 30B
## 3 4.041932 MAMDC2 MAM domain containing 2
## 4 3.995222 ERVV-2 endogenous retrovirus group V, member 2
## 5 3.950694 ZNF385D-AS1 ZNF385D antisense RNA 1
## 6 3.924952 GAGE12F G antigen 12F
This table can be regarded as a final output of the differential expression pipeline.
There is an API, a restful API, and some sort of R package “reactome.db” for this. Querying the database directly https://reactome.org/PathwayBrowser/#TOOL=AT is simple though. Pasting in the list of genes found in the *.gene.list file that was written, we get access to a pathway map with a good deal of depth for exploration
This image can be zoomed in on the website for perhaps more information or dramatic focus.
The results can be downloaded and loaded into R:
cGF.cGFR.reactome <- read.csv("reactome.cGFR.cGF.csv", header=T)
head(cGF.cGFR.reactome)
## Pathway.identifier Pathway.name
## 1 R-HSA-380994 ATF4 activates genes
## 2 R-HSA-418990 Adherens junctions interactions
## 3 R-HSA-6785807 Interleukin-4 and Interleukin-13 signaling
## 4 R-HSA-381042 PERK regulates gene expression
## 5 R-HSA-140834 Extrinsic Pathway of Fibrin Clot Formation
## 6 R-HSA-421270 Cell-cell junction organization
## X.Entities.found X.Entities.total Entities.ratio Entities.pValue
## 1 2 32 0.0022884932 0.006925115
## 2 2 35 0.0025030394 0.008223584
## 3 4 211 0.0150897518 0.009004908
## 4 2 38 0.0027175856 0.009622609
## 5 1 6 0.0004290925 0.022909466
## 6 2 67 0.0047915326 0.027866402
## Entities.FDR X.Reactions.found X.Reactions.total Reactions.ratio
## 1 0.3271687 1 7 0.0005938242
## 2 0.3271687 2 16 0.0013573125
## 3 0.3271687 2 46 0.0039022735
## 4 0.3271687 1 11 0.0009331524
## 5 0.3291710 1 8 0.0006786563
## 6 0.3291710 2 21 0.0017814727
## Species.identifier Species.name Submitted.entities.found Mapped.entities
## 1 9606 Homo sapiens CCL2 NA
## 2 9606 Homo sapiens CDH10;CDH12 NA
## 3 9606 Homo sapiens SAA1;CCL2 NA
## 4 9606 Homo sapiens CCL2 NA
## 5 9606 Homo sapiens TFPI NA
## 6 9606 Homo sapiens CDH10;CDH12 NA
## Found.reaction.identifiers
## 1 R-HSA-1791056
## 2 R-HSA-419002;R-HSA-419001
## 3 R-HSA-6789524;R-HSA-6789615
## 4 R-HSA-1791056
## 5 R-HSA-140825
## 6 R-HSA-419002;R-HSA-419001
There is a column that signified which gene names linked to each pathway.
*before generating dex tables for each of the tests separate the up and down regulated genes before sending to reactome
close dont delete