This is a full pipeline for importaing affymetrix .CEL data (human microarray chip data) and running full pairwise differential abundance tests. A full annotation table is also referenced in order to link probe names to gene names.

The .CEL data is read and analyzed by the “limma” package

initializing limma and dependent packages and importing the annotation table

First, the necessary libraries are activated and the annotation table (downloaded from an Affymetrix website) is read in:

Annotation <- read.csv("HTA-2_0.na35.2.hg19.transcript.csv/HTA-2_0.na35.2.hg19.transcript.csv", header=TRUE, skip=19)

The annotation table is very large and many columns are merged redundant database references (using “//” as separator). it is far too wide to display here! However, it is very powerful as it links the probes to a huge number of databases.

Mapping probe ID to genes and pathways

The Annotation table has several columns:

colnames(Annotation)
##  [1] "transcript_cluster_id" "probeset_id"          
##  [3] "seqname"               "strand"               
##  [5] "start"                 "stop"                 
##  [7] "total_probes"          "gene_assignment"      
##  [9] "mrna_assignment"       "swissprot"            
## [11] "unigene"               "GO_biological_process"
## [13] "GO_cellular_component" "GO_molecular_function"
## [15] "pathway"               "protein_domains"      
## [17] "category"              "locus.type"           
## [19] "notes"

Many of these are potentially very useful. For now, I am useing the gene_assignment column. This contains many names separated by //. A single entry can look like this:

Annotation[1,"gene_assignment"]
## [1] NR_046018 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// ENST00000456328 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// ENST00000456328 // DDX11L5 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 5 // 9p24.3 // 100287596 /// AM992878 // DDX11L12 // DEAD/H box polypeptide 11 like 12 // --- // 100532727 /// AM992880 // DDX11L9 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 9 // 15q26.3 // 100288486 /// BC032353 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// uc001aaa.3 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// uc010nxq.1 // DDX11L9 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 9 // 15q26.3 // 100288486 /// uc010nxr.1 // DDX11L12 // DEAD/H box polypeptide 11 like 12 // --- // 100532727
## 41529 Levels: --- ...

The second one is a gene name that links well to many databases and the third provides a description. The gene names can be fed into the Reactome online database for pathway mapping. We will create a focused mapping table that only has the necessary fields. This object will be used to attach gene names and functions to the final diff.expression tables

genenames.all <- Annotation
gene.split <- data.frame(do.call('rbind',strsplit(as.character(genenames.all$gene_assignment),"//",fixed=T)))
## Warning in .Method(..., deparse.level = deparse.level): number of columns
## of result is not a multiple of vector length (arg 1)
probe.gene.map <- data.frame(gene.split[,c(2,3)])
probe.gene.map$probe <- Annotation[,1]
probe.gene.map <- probe.gene.map[,c(3,1,2)]
probe.gene.map$X2 <- gsub("[[:space:]]", "", probe.gene.map$X2)
colnames(probe.gene.map) <- c("probe","name","description")

head(probe.gene.map)
##             probe         name
## 1 TC01000001.hg.1      DDX11L1
## 2 TC01000002.hg.1   MIR1302-11
## 3 TC01000003.hg.1        OR4F5
## 4 TC01000004.hg.1 RP11-34P13.9
## 5 TC01000005.hg.1 LOC100132287
## 6 TC01000006.hg.1          ---
##                                             description
## 1  DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 
## 2                                     microRNA 1302-11 
## 3  olfactory receptor, family 4, subfamily F, member 5 
## 4                            putative novel transcript 
## 5                         uncharacterized LOC100132287 
## 6                                                   ---

Test matrices

Differential abundance can only be performed on pairs of sample sets. Each pairwise test has to be configured as a separate file. Importing the test matrices:

targets <- read.csv("targets.txt", as.is=T)
targets.tGF.cGF <- read.csv("targets.GFonly.txt", as.is=T)
targets.tGFR.cGFR <- read.csv("targets.GFRonly.txt", as.is=T)
targets.cGFR.cGF <- read.csv("targets.noDrug.Strain.txt", as.is=T)
targets.tGFR.tGF <- read.csv("targets.drugonly.txt", as.is=T)

all of the samples and metadata are listed here:

targets
##                                      FileName Strain Treatment Experiment
## 1   20160505_HTA2.0_0316_WTH_1R_(HTA-2_0).CEL     GF   control          1
## 2   20160505_HTA2.0_0317_WTH_2R_(HTA-2_0).CEL     GF      drug          1
## 3   20160505_HTA2.0_0318_WTH_3R_(HTA-2_0).CEL    GFR   control          1
## 4   20160505_HTA2.0_0319_WTH_4R_(HTA-2_0).CEL    GFR      drug          1
## 5 20160616_HTA2.0_0333_WTH_1_R1_(HTA-2_0).CEL     GF   control          2
## 6 20160616_HTA2.0_0334_WTH_2_R1_(HTA-2_0).CEL     GF      drug          2
## 7 20160616_HTA2.0_0335_WTH_3_R2_(HTA-2_0).CEL    GFR   control          2
## 8 20160616_HTA2.0_0336_WTH_4_R1_(HTA-2_0).CEL    GFR      drug          2

The matrices are set up as such; this matrix is for comparing non-treated and treated samples within in strain GF (control strain). Note thta this is just one of the tests; we now have matrices for 3 other test schemes.

targets.tGF.cGF
##                                      FileName Strain Treatment Experiment
## 1   20160505_HTA2.0_0316_WTH_1R_(HTA-2_0).CEL     GF   control          1
## 2   20160505_HTA2.0_0317_WTH_2R_(HTA-2_0).CEL     GF      drug          1
## 3 20160616_HTA2.0_0333_WTH_1_R1_(HTA-2_0).CEL     GF   control          2
## 4 20160616_HTA2.0_0334_WTH_2_R1_(HTA-2_0).CEL     GF      drug          2

Reading the .CEL files and normalizing.

dat <- read.celfiles(targets$FileName) #read .CEL files
## Loading required package: pd.hta.2.0
## Loading required package: RSQLite
## Loading required package: DBI
## Platform design info loaded.
## Reading in : 20160505_HTA2.0_0316_WTH_1R_(HTA-2_0).CEL
## Reading in : 20160505_HTA2.0_0317_WTH_2R_(HTA-2_0).CEL
## Reading in : 20160505_HTA2.0_0318_WTH_3R_(HTA-2_0).CEL
## Reading in : 20160505_HTA2.0_0319_WTH_4R_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0333_WTH_1_R1_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0334_WTH_2_R1_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0335_WTH_3_R2_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0336_WTH_4_R1_(HTA-2_0).CEL
project.bgcorrect.norm.avg <- rma(dat, background=TRUE, normalize=TRUE) #normalize
## Background correcting
## Normalizing
## Calculating Expression
#this may remove control probes, it should only include probes which actually link to genes
project.bgcorrect.norm.avg.filt <- exprs(project.bgcorrect.norm.avg)[which(rownames(project.bgcorrect.norm.avg) %in% Annotation[,1]),]
#this writes the full results as a file
write.table(project.bgcorrect.norm.avg.filt, "full.expression.results.tsv", sep="\t", row.names = TRUE, quote=FALSE)                                        

The expression table

After export and filtration, the simplified table looks like (this is the first 6 probes and their normalized intensities):

head(data.frame(project.bgcorrect.norm.avg.filt))
##            X20160505_HTA2.0_0316_WTH_1R_.HTA.2_0..CEL
## 2824546_st                                   12.36100
## 2824549_st                                   11.90234
## 2824551_st                                   11.74690
## 2824554_st                                   12.01546
## 2827992_st                                   11.01983
## 2827995_st                                   10.78221
##            X20160505_HTA2.0_0317_WTH_2R_.HTA.2_0..CEL
## 2824546_st                                   12.23731
## 2824549_st                                   11.62001
## 2824551_st                                   11.81748
## 2824554_st                                   11.80532
## 2827992_st                                   10.97427
## 2827995_st                                   10.90626
##            X20160505_HTA2.0_0318_WTH_3R_.HTA.2_0..CEL
## 2824546_st                                   12.55858
## 2824549_st                                   11.84178
## 2824551_st                                   11.90814
## 2824554_st                                   11.93338
## 2827992_st                                   11.34647
## 2827995_st                                   11.07592
##            X20160505_HTA2.0_0319_WTH_4R_.HTA.2_0..CEL
## 2824546_st                                   12.44655
## 2824549_st                                   11.89080
## 2824551_st                                   11.86767
## 2824554_st                                   11.81407
## 2827992_st                                   11.38106
## 2827995_st                                   11.13100
##            X20160616_HTA2.0_0333_WTH_1_R1_.HTA.2_0..CEL
## 2824546_st                                     12.13622
## 2824549_st                                     11.76768
## 2824551_st                                     11.65620
## 2824554_st                                     11.78064
## 2827992_st                                     11.02287
## 2827995_st                                     10.26596
##            X20160616_HTA2.0_0334_WTH_2_R1_.HTA.2_0..CEL
## 2824546_st                                     12.13720
## 2824549_st                                     11.78317
## 2824551_st                                     11.63193
## 2824554_st                                     11.83432
## 2827992_st                                     11.08451
## 2827995_st                                     10.41974
##            X20160616_HTA2.0_0335_WTH_3_R2_.HTA.2_0..CEL
## 2824546_st                                     12.27436
## 2824549_st                                     11.81748
## 2824551_st                                     11.98635
## 2824554_st                                     11.91740
## 2827992_st                                     11.09480
## 2827995_st                                     10.76636
##            X20160616_HTA2.0_0336_WTH_4_R1_.HTA.2_0..CEL
## 2824546_st                                     12.27725
## 2824549_st                                     11.69465
## 2824551_st                                     11.89022
## 2824554_st                                     11.83155
## 2827992_st                                     11.21518
## 2827995_st                                     10.79048

Generating expression tables for differential abundance tests

This full data table that we just generated is only for record keeping purposes only. For each of the pairwise tests we will do, we will repeat this except that it will be limited to the samples we are testing with: define the test using the target matrix table (e.g. c vs. t, GF vs GFR) read in the .CEL files limited to the matrix table (only a subset of .CELs are read in) normalize define your tests and make differential abundance tables

untreated GF vs untreated GFR

1. Define the subset of .CEL files, read them in, and normalize them

dat <- read.celfiles(targets.cGFR.cGF$FileName)
## Platform design info loaded.
## Reading in : 20160505_HTA2.0_0316_WTH_1R_(HTA-2_0).CEL
## Reading in : 20160505_HTA2.0_0318_WTH_3R_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0333_WTH_1_R1_(HTA-2_0).CEL
## Reading in : 20160616_HTA2.0_0335_WTH_3_R2_(HTA-2_0).CEL
project.bgcorrect.norm.avg <- rma(dat, background=TRUE, normalize=TRUE)
## Background correcting
## Normalizing
## Calculating Expression
project.bgcorrect.norm.avg.filt <- exprs(project.bgcorrect.norm.avg)[which(rownames(project.bgcorrect.norm.avg) %in% Annotation[,1]),]

2. Load in the metadata

Exp <- factor(targets.cGFR.cGF$Experiment)
Strain <- factor(targets.cGFR.cGF$Strain, levels=c("GF","GFR"))
Treatment <- factor(targets.cGFR.cGF$Treatment, levels=c("control","drug"))

3. Define the test as controlling for replicate (“exp”) while testing effect of strain

This is important decision; these chips were ran from different experimental batches, thus I have controlled for experiment, essentially running a paired analysis. In an earlier iteration, I did not control for experiment and found very few differences due to the high variance between experimental replicates

design <- model.matrix(~Exp+Strain)

4. Run “fit” algorithms and separate out the significantly different probes

eset <- project.bgcorrect.norm.avg.filt
fit <- lmFit(eset, design)
fit <- eBayes(fit, trend=T, robust=T)
## Warning: 65 very small variances detected, have been offset away from zero
results <- decideTests(fit)
write.table(summary(results), "controls.summary", quote=F)

This is a simple summary of the differences, only the far right column

summary(results)
##        (Intercept)  Exp2 StrainGFR
## Down             0     0        29
## NotSig           4 70523     70450
## Up           70519     0        44

5. Generate a table of the differentially expressed genes and attach gene name/function information

library(data.table)
library(dplyr)
diff.cGFR.cGF <- topTable(fit, coef="StrainGFR", n=73)
setDT(diff.cGFR.cGF, keep.rownames = TRUE)[]
colnames(diff.cGFR.cGF)[1] <- "probe"
diff.cGFR.cGF <- dplyr::left_join(diff.cGFR.cGF,probe.gene.map,by="probe")
## Warning: Column `probe` joining character vector and factor, coercing into
## character vector
write.table(diff.cGFR.cGF, "diff.cGFR.cGF", quote = F)
write.table(diff.cGFR.cGF$name, "diff.cGFR.cGF.gene.list", quote = F)
head(diff.cGFR.cGF)
##             probe     logFC  AveExpr         t      P.Value  adj.P.Val
## 1 TC05001205.hg.1 -4.628246 5.478085 -51.37863 3.455196e-07 0.02263877
## 2 TC18000084.hg.1 -3.559838 5.536229 -38.69108 1.174218e-06 0.02263877
## 3 TC09000323.hg.1  4.167048 7.927503  34.93740 1.822866e-06 0.02263877
## 4 TC19000805.hg.1  3.678938 5.928768  35.70503 1.814231e-06 0.02263877
## 5 TC03000117.hg.1  2.870035 4.204782  32.22194 2.582918e-06 0.02263877
## 6 TC0X000272.hg.1 -2.896188 5.694877 -31.54888 2.828696e-06 0.02263877
##          B        name                               description
## 1 4.343165       CDH10        cadherin 10, type 2 (T2-cadherin) 
## 2 4.141455    ANKRD30B                ankyrin repeat domain 30B 
## 3 4.041932      MAMDC2                  MAM domain containing 2 
## 4 3.995222      ERVV-2  endogenous retrovirus group V, member 2 
## 5 3.950694 ZNF385D-AS1                  ZNF385D antisense RNA 1 
## 6 3.924952     GAGE12F                            G antigen 12F

This table can be regarded as a final output of the differential expression pipeline.

TO DO: split this into up and down regulated for analysis purposes

6. linking genes to pathways using Reactome.db

There is an API, a restful API, and some sort of R package “reactome.db” for this. Querying the database directly https://reactome.org/PathwayBrowser/#TOOL=AT is simple though. Pasting in the list of genes found in the *.gene.list file that was written, we get access to a pathway map with a good deal of depth for exploration

This image can be zoomed in on the website for perhaps more information or dramatic focus.

The results can be downloaded and loaded into R:

cGF.cGFR.reactome <- read.csv("reactome.cGFR.cGF.csv", header=T)
head(cGF.cGFR.reactome)
##   Pathway.identifier                               Pathway.name
## 1       R-HSA-380994                       ATF4 activates genes
## 2       R-HSA-418990            Adherens junctions interactions
## 3      R-HSA-6785807 Interleukin-4 and Interleukin-13 signaling
## 4       R-HSA-381042             PERK regulates gene expression
## 5       R-HSA-140834 Extrinsic Pathway of Fibrin Clot Formation
## 6       R-HSA-421270            Cell-cell junction organization
##   X.Entities.found X.Entities.total Entities.ratio Entities.pValue
## 1                2               32   0.0022884932     0.006925115
## 2                2               35   0.0025030394     0.008223584
## 3                4              211   0.0150897518     0.009004908
## 4                2               38   0.0027175856     0.009622609
## 5                1                6   0.0004290925     0.022909466
## 6                2               67   0.0047915326     0.027866402
##   Entities.FDR X.Reactions.found X.Reactions.total Reactions.ratio
## 1    0.3271687                 1                 7    0.0005938242
## 2    0.3271687                 2                16    0.0013573125
## 3    0.3271687                 2                46    0.0039022735
## 4    0.3271687                 1                11    0.0009331524
## 5    0.3291710                 1                 8    0.0006786563
## 6    0.3291710                 2                21    0.0017814727
##   Species.identifier Species.name Submitted.entities.found Mapped.entities
## 1               9606 Homo sapiens                     CCL2              NA
## 2               9606 Homo sapiens              CDH10;CDH12              NA
## 3               9606 Homo sapiens                SAA1;CCL2              NA
## 4               9606 Homo sapiens                     CCL2              NA
## 5               9606 Homo sapiens                     TFPI              NA
## 6               9606 Homo sapiens              CDH10;CDH12              NA
##    Found.reaction.identifiers
## 1               R-HSA-1791056
## 2   R-HSA-419002;R-HSA-419001
## 3 R-HSA-6789524;R-HSA-6789615
## 4               R-HSA-1791056
## 5                R-HSA-140825
## 6   R-HSA-419002;R-HSA-419001

There is a column that signified which gene names linked to each pathway.

Reactome also generates a spectacular report .pdf. Be sure to download it and examine!

To Do:

*before generating dex tables for each of the tests separate the up and down regulated genes before sending to reactome

close dont delete