Analyze RNA-Seq data with R

The aim of this part is to analyze and a host-pathogen protein interaction map derived from a Y2H experiment using the software “Cytoscape”. The network is extended with additional interactions and protein properties from online resources. Additionally, expression fold change values derived from a RNA-Seq experiment are integrated to examine the expression dynamic after pathogen treatment.

Softwares in this course

R

R R is a free software environment for statistical computing and graphics. R

RStudio

RStudio RStudio provides free and open source tools for R and enterprise-ready professional software for data science teams to develop and share their work at scale. RStudio

R / RStudio Basics

RStudio environment

The environment of RStudio

  1. The editor, where you write your program
  2. The console, where you can enter R commands and where the output of your program is shown.
  3. In the environment / history area are listed in the environment tab all variables, which are currently in use. In the history tab are shown the last commands, which have been executed in the console.
  4. In the fourth area, five tabs are contained: Files, Plots, Packages, Help and Viewer.
    • Files show your file system on your computer.
    • Plots show the last produces graphic.
    • Packages show all installed R packages and allows to install new ones from the CRAN repository.
    • In the Help tab, information about R functions can be found.
    • Viewer also shows java script graphics.
  5. The Run button executes the currently selected lines in the editor.

A comprehensive list of shortcuts can be found here: shortcuts

Working environment setup

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data . In the following analysis of RNA-Seq data several R-packages are needed.

  • BioManager: A convenient tool to install and update Bioconductor packages.
  • DESeq2: Analysis and visualization of RNA-Seq data.
  • GEOQuery: Loading of GEO dataset description.
  • openxlsx: reading and writing of Excel files.
  • ggplot2: advanced plotting package.
  • RColorBrewer: create color gradients for plots.
  • pheatmap: drawing heatmaps and clustering data.
  • RCy3

setup the working directory

<your path> can be absolute or relative folder path

# clear all elements from environment
rm(list = ls())
# set working directory
setwd("<your path>/Praktikum2021/")

Installation of required R packages

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager", dependencies = TRUE, repos = "http://cran.r-project.org")
    }
packs <- c("DESeq2", "GEOquery", "openxlsx", "ggplot2", "RColorBrewer", "pheatmap", "RCy3", "apeglm")
BiocManager::install(packs)
## 
## The downloaded binary packages are in
##  /var/folders/bd/knxrzm1x1s1fcg5sc7hs_7z9dp_qq3/T//RtmpsXwRQV/downloaded_packages

Check what package(s) is loaded

(.packages())
## [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [7] "base"

load packages

For the R programming environment, a lot of packages are available, which contain specialized functions for working with different dataset, conducting special analyses or producing special plots.

Here shows how to load some packages in R.

library(DESeq2)
## Warning: package 'BiocGenerics' was built under R version 4.0.3
## Warning: multiple methods tables found for 'which'

## Warning: multiple methods tables found for 'which'
library(GEOquery)
library(openxlsx)
library(ggplot2)
library(RColorBrewer)
library(pheatmap)
library(RCy3)

Basic Analysis of a RNA-Seq dataset

In this R session, the RNA-Seq dataset GSE88798 (description above) from GEO is loaded for basic analysis and visualization. First a subset of the experiment is selected: Treatment of Col-0 with Mock or Pto_AvrRpm1 at time point 1 hour after inoculation. This dataset is used to identify genes with a significant differential expression visualize the counts of the most significant gene in three repeats in mock vs. pathogen treatment.

The following command load the RNA-seq count matrix and adjust the format for using it with the DESeq2 package

gseData <- read.table("GSE88798/GSE88798_ReadCountTable_M001_348.txt", header = T, sep = "\t")

Adjust the variable gseData

gseLocus <- gseData[,1] # get locus names from first column
gseData <- gseData[,-1] # remove locus name column to be compatible with subsequent analysis
row.names(gseData) <- gseLocus # set locus names as row names
colnames(gseData) <- NULL # clear column names

The table containing the read counts does not include any description of the experiments, except the experiment identifier. The experiment description is in the file *_series_matrix.txt. You can open this file with a text editor to see the structure of the file.

A convenient way to load this file is using the command getGEO from the GEOquery package:

read the series matrix, which contains detailed information on each experiment

https://bioconductor.org/packages/devel/bioc/vignettes/GEOquery/inst/doc/GEOquery.html#getting-gse-series-matrix-files-as-an-expressionset

gseTable <- getGEO(filename = "GSE88798/GSE88798_series_matrix.txt")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
## File stored at:
## /var/folders/bd/knxrzm1x1s1fcg5sc7hs_7z9dp_qq3/T//RtmpsXwRQV/GPL17639.soft
gseTable # show element
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 0 features, 366 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM2347874 GSM2347875 ... GSM2359638 (366 total)
##   varLabels: title geo_accession ... treatment:ch1 (48 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: GPL17639
str(gseTable) # inspect structure of element
## Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
##   ..@ experimentData   :Formal class 'MIAME' [package "Biobase"] with 13 slots
##   .. .. ..@ name             : chr "akira,,mine"
##   .. .. ..@ lab              : chr ""
##   .. .. ..@ contact          : chr "mine@mpipz.mpg.de"
##   .. .. ..@ title            : chr "Time-resolved transcriptome analysis with genetic perturbations reveals a critical time window for effective plant immunity"
##   .. .. ..@ abstract         : chr "Pathogen invasion in plants is associated with transcriptional reprogramming. Enigmatically, plants induce simi"| __truncated__
##   .. .. ..@ url              : chr "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE88798"
##   .. .. ..@ pubMedIds        : chr ""
##   .. .. ..@ samples          : list()
##   .. .. ..@ hybridizations   : list()
##   .. .. ..@ normControls     : list()
##   .. .. ..@ preprocessing    : list()
##   .. .. ..@ other            :List of 24
##   .. .. .. ..$ contact_address        : chr "Carl-von-Linne Weg 10"
##   .. .. .. ..$ contact_city           : chr "Cologne"
##   .. .. .. ..$ contact_country        : chr "Germany"
##   .. .. .. ..$ contact_department     : chr "Plant Microbe Interactions"
##   .. .. .. ..$ contact_email          : chr "mine@mpipz.mpg.de"
##   .. .. .. ..$ contact_institute      : chr "Max Planck Institute for Plant Breeding Research"
##   .. .. .. ..$ contact_laboratory     : chr "Tsuda lab"
##   .. .. .. ..$ contact_name           : chr "akira,,mine"
##   .. .. .. ..$ contact_zip/postal_code: chr "50829"
##   .. .. .. ..$ contributor            : chr "Akira,,Mine\nCarilin,,Seyfferth\nBarbara,,Kracher\nKenichi,,Tsuda"
##   .. .. .. ..$ geo_accession          : chr "GSE88798"
##   .. .. .. ..$ last_update_date       : chr "Dec 12 2017"
##   .. .. .. ..$ overall_design         : chr "Leaves of Col-0 and all the single, double, triple and quadruple mutants of dde2-2, ein2-1, pad4-1, sid2-2 were"| __truncated__
##   .. .. .. ..$ platform_id            : chr "GPL17639"
##   .. .. .. ..$ platform_taxid         : chr "3702"
##   .. .. .. ..$ relation               : chr "BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA348676\nSRA: https://www.ncbi.nlm.nih.gov/sra?term=SRP091641"
##   .. .. .. ..$ sample_id              : chr "GSM2347874 GSM2347875 GSM2347876 GSM2347877 GSM2347878 GSM2347879 GSM2347880 GSM2347881 GSM2347882 GSM2347883 G"| __truncated__
##   .. .. .. ..$ sample_taxid           : chr "3702"
##   .. .. .. ..$ status                 : chr "Public on Dec 12 2017"
##   .. .. .. ..$ submission_date        : chr "Oct 17 2016"
##   .. .. .. ..$ summary                : chr "Pathogen invasion in plants is associated with transcriptional reprogramming. Enigmatically, plants induce simi"| __truncated__
##   .. .. .. ..$ supplementary_file     : chr "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE88nnn/GSE88798/suppl/GSE88798_ReadCountTable_M001_348.txt.gz\nftp://ft"| __truncated__
##   .. .. .. ..$ title                  : chr "Time-resolved transcriptome analysis with genetic perturbations reveals a critical time window for effective plant immunity"
##   .. .. .. ..$ type                   : chr "Expression profiling by high throughput sequencing"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 2
##   .. .. .. .. .. ..$ : int [1:3] 1 0 0
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   ..@ assayData        :<environment: 0x7fc5361d1ac0> 
##   ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   .. .. ..@ varMetadata      :'data.frame':  48 obs. of  1 variable:
##   .. .. .. ..$ labelDescription: chr [1:48] NA NA NA NA ...
##   .. .. ..@ data             :'data.frame':  366 obs. of  48 variables:
##   .. .. .. ..$ title                  : chr [1:366] "x01_Col_01_mock_01h_rep01" "x01_Col_02_EV_01h_rep01" "x01_Col_03_AvrRpt2_01h_rep01" "x01_Col_01_mock_01h_rep02" ...
##   .. .. .. ..$ geo_accession          : chr [1:366] "GSM2347874" "GSM2347875" "GSM2347876" "GSM2347877" ...
##   .. .. .. ..$ status                 : chr [1:366] "Public on Dec 12 2017" "Public on Dec 12 2017" "Public on Dec 12 2017" "Public on Dec 12 2017" ...
##   .. .. .. ..$ submission_date        : chr [1:366] "Oct 17 2016" "Oct 17 2016" "Oct 17 2016" "Oct 17 2016" ...
##   .. .. .. ..$ last_update_date       : chr [1:366] "Dec 12 2017" "Dec 12 2017" "Dec 12 2017" "Dec 12 2017" ...
##   .. .. .. ..$ type                   : chr [1:366] "SRA" "SRA" "SRA" "SRA" ...
##   .. .. .. ..$ channel_count          : chr [1:366] "1" "1" "1" "1" ...
##   .. .. .. ..$ source_name_ch1        : chr [1:366] "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" ...
##   .. .. .. ..$ organism_ch1           : chr [1:366] "Arabidopsis thaliana" "Arabidopsis thaliana" "Arabidopsis thaliana" "Arabidopsis thaliana" ...
##   .. .. .. ..$ characteristics_ch1    : chr [1:366] "genotype: Col-0" "genotype: Col-0" "genotype: Col-0" "genotype: Col-0" ...
##   .. .. .. ..$ characteristics_ch1.1  : chr [1:366] "treatment: Mock" "treatment: Pto DC3000" "treatment: Pto AvrRpt2" "treatment: Mock" ...
##   .. .. .. ..$ characteristics_ch1.2  : chr [1:366] "replicate: #1" "replicate: #1" "replicate: #1" "replicate: #2" ...
##   .. .. .. ..$ characteristics_ch1.3  : chr [1:366] "time: 1 hpi" "time: 1 hpi" "time: 1 hpi" "time: 1 hpi" ...
##   .. .. .. ..$ characteristics_ch1.4  : chr [1:366] "tissue: Leaves" "tissue: Leaves" "tissue: Leaves" "tissue: Leaves" ...
##   .. .. .. ..$ treatment_protocol_ch1 : chr [1:366] "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." ...
##   .. .. .. ..$ growth_protocol_ch1    : chr [1:366] "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ ...
##   .. .. .. ..$ molecule_ch1           : chr [1:366] "total RNA" "total RNA" "total RNA" "total RNA" ...
##   .. .. .. ..$ extract_protocol_ch1   : chr [1:366] "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." ...
##   .. .. .. ..$ extract_protocol_ch1.1 : chr [1:366] "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ ...
##   .. .. .. ..$ taxid_ch1              : chr [1:366] "3702" "3702" "3702" "3702" ...
##   .. .. .. ..$ description            : chr [1:366] "M001" "M002" "M003" "M004" ...
##   .. .. .. ..$ description.1          : chr [1:366] "" "" "" "" ...
##   .. .. .. ..$ data_processing        : chr [1:366] "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" ...
##   .. .. .. ..$ data_processing.1      : chr [1:366] "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ ...
##   .. .. .. ..$ data_processing.2      : chr [1:366] "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ ...
##   .. .. .. ..$ platform_id            : chr [1:366] "GPL17639" "GPL17639" "GPL17639" "GPL17639" ...
##   .. .. .. ..$ contact_name           : chr [1:366] "akira,,mine" "akira,,mine" "akira,,mine" "akira,,mine" ...
##   .. .. .. ..$ contact_email          : chr [1:366] "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" ...
##   .. .. .. ..$ contact_laboratory     : chr [1:366] "Tsuda lab" "Tsuda lab" "Tsuda lab" "Tsuda lab" ...
##   .. .. .. ..$ contact_department     : chr [1:366] "Plant Microbe Interactions" "Plant Microbe Interactions" "Plant Microbe Interactions" "Plant Microbe Interactions" ...
##   .. .. .. ..$ contact_institute      : chr [1:366] "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" ...
##   .. .. .. ..$ contact_address        : chr [1:366] "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" ...
##   .. .. .. ..$ contact_city           : chr [1:366] "Cologne" "Cologne" "Cologne" "Cologne" ...
##   .. .. .. ..$ contact_zip/postal_code: chr [1:366] "50829" "50829" "50829" "50829" ...
##   .. .. .. ..$ contact_country        : chr [1:366] "Germany" "Germany" "Germany" "Germany" ...
##   .. .. .. ..$ data_row_count         : chr [1:366] "0" "0" "0" "0" ...
##   .. .. .. ..$ instrument_model       : chr [1:366] "Illumina HiSeq 2500" "Illumina HiSeq 2500" "Illumina HiSeq 2500" "Illumina HiSeq 2500" ...
##   .. .. .. ..$ library_selection      : chr [1:366] "cDNA" "cDNA" "cDNA" "cDNA" ...
##   .. .. .. ..$ library_source         : chr [1:366] "transcriptomic" "transcriptomic" "transcriptomic" "transcriptomic" ...
##   .. .. .. ..$ library_strategy       : chr [1:366] "RNA-Seq" "RNA-Seq" "RNA-Seq" "RNA-Seq" ...
##   .. .. .. ..$ relation               : chr [1:366] "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913698" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913699" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913697" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913696" ...
##   .. .. .. ..$ relation.1             : chr [1:366] "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248274" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248275" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248276" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248277" ...
##   .. .. .. ..$ supplementary_file_1   : chr [1:366] "NONE" "NONE" "NONE" "NONE" ...
##   .. .. .. ..$ genotype:ch1           : chr [1:366] "Col-0" "Col-0" "Col-0" "Col-0" ...
##   .. .. .. ..$ replicate:ch1          : chr [1:366] "#1" "#1" "#1" "#2" ...
##   .. .. .. ..$ time:ch1               : chr [1:366] "1 hpi" "1 hpi" "1 hpi" "1 hpi" ...
##   .. .. .. ..$ tissue:ch1             : chr [1:366] "Leaves" "Leaves" "Leaves" "Leaves" ...
##   .. .. .. ..$ treatment:ch1          : chr [1:366] "Mock" "Pto DC3000" "Pto AvrRpt2" "Mock" ...
##   .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 1
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   ..@ featureData      :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   .. .. ..@ varMetadata      :'data.frame':  0 obs. of  3 variables:
##   .. .. .. ..$ Column          : chr(0) 
##   .. .. .. ..$ Description     : chr(0) 
##   .. .. .. ..$ labelDescription: chr(0) 
##   .. .. ..@ data             :'data.frame':  0 obs. of  0 variables
##   .. .. ..@ dimLabels        : chr [1:2] "featureNames" "featureColumns"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 1
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   ..@ annotation       : chr "GPL17639"
##   ..@ protocolData     :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   .. .. ..@ varMetadata      :'data.frame':  0 obs. of  1 variable:
##   .. .. .. ..$ labelDescription: chr(0) 
##   .. .. ..@ data             :'data.frame':  366 obs. of  0 variables
##   .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 1
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. ..@ .Data:List of 4
##   .. .. .. ..$ : int [1:3] 4 0 0
##   .. .. .. ..$ : int [1:3] 2 48 0
##   .. .. .. ..$ : int [1:3] 1 3 0
##   .. .. .. ..$ : int [1:3] 1 0 0

The variable gseTable is an ExpressionSet object. To work with the information contained in this object, the relevant data have to be extracted into a simple variable

# extract relevant information: genotype, replicate, time point and treatment for each experiment
myColData <- phenoData(gseTable) # extract metadata using the function phenoData
str(myColData)
## Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   ..@ varMetadata      :'data.frame':    48 obs. of  1 variable:
##   .. ..$ labelDescription: chr [1:48] NA NA NA NA ...
##   ..@ data             :'data.frame':    366 obs. of  48 variables:
##   .. ..$ title                  : chr [1:366] "x01_Col_01_mock_01h_rep01" "x01_Col_02_EV_01h_rep01" "x01_Col_03_AvrRpt2_01h_rep01" "x01_Col_01_mock_01h_rep02" ...
##   .. ..$ geo_accession          : chr [1:366] "GSM2347874" "GSM2347875" "GSM2347876" "GSM2347877" ...
##   .. ..$ status                 : chr [1:366] "Public on Dec 12 2017" "Public on Dec 12 2017" "Public on Dec 12 2017" "Public on Dec 12 2017" ...
##   .. ..$ submission_date        : chr [1:366] "Oct 17 2016" "Oct 17 2016" "Oct 17 2016" "Oct 17 2016" ...
##   .. ..$ last_update_date       : chr [1:366] "Dec 12 2017" "Dec 12 2017" "Dec 12 2017" "Dec 12 2017" ...
##   .. ..$ type                   : chr [1:366] "SRA" "SRA" "SRA" "SRA" ...
##   .. ..$ channel_count          : chr [1:366] "1" "1" "1" "1" ...
##   .. ..$ source_name_ch1        : chr [1:366] "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" ...
##   .. ..$ organism_ch1           : chr [1:366] "Arabidopsis thaliana" "Arabidopsis thaliana" "Arabidopsis thaliana" "Arabidopsis thaliana" ...
##   .. ..$ characteristics_ch1    : chr [1:366] "genotype: Col-0" "genotype: Col-0" "genotype: Col-0" "genotype: Col-0" ...
##   .. ..$ characteristics_ch1.1  : chr [1:366] "treatment: Mock" "treatment: Pto DC3000" "treatment: Pto AvrRpt2" "treatment: Mock" ...
##   .. ..$ characteristics_ch1.2  : chr [1:366] "replicate: #1" "replicate: #1" "replicate: #1" "replicate: #2" ...
##   .. ..$ characteristics_ch1.3  : chr [1:366] "time: 1 hpi" "time: 1 hpi" "time: 1 hpi" "time: 1 hpi" ...
##   .. ..$ characteristics_ch1.4  : chr [1:366] "tissue: Leaves" "tissue: Leaves" "tissue: Leaves" "tissue: Leaves" ...
##   .. ..$ treatment_protocol_ch1 : chr [1:366] "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." ...
##   .. ..$ growth_protocol_ch1    : chr [1:366] "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ ...
##   .. ..$ molecule_ch1           : chr [1:366] "total RNA" "total RNA" "total RNA" "total RNA" ...
##   .. ..$ extract_protocol_ch1   : chr [1:366] "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." ...
##   .. ..$ extract_protocol_ch1.1 : chr [1:366] "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ ...
##   .. ..$ taxid_ch1              : chr [1:366] "3702" "3702" "3702" "3702" ...
##   .. ..$ description            : chr [1:366] "M001" "M002" "M003" "M004" ...
##   .. ..$ description.1          : chr [1:366] "" "" "" "" ...
##   .. ..$ data_processing        : chr [1:366] "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" ...
##   .. ..$ data_processing.1      : chr [1:366] "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ ...
##   .. ..$ data_processing.2      : chr [1:366] "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ ...
##   .. ..$ platform_id            : chr [1:366] "GPL17639" "GPL17639" "GPL17639" "GPL17639" ...
##   .. ..$ contact_name           : chr [1:366] "akira,,mine" "akira,,mine" "akira,,mine" "akira,,mine" ...
##   .. ..$ contact_email          : chr [1:366] "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" ...
##   .. ..$ contact_laboratory     : chr [1:366] "Tsuda lab" "Tsuda lab" "Tsuda lab" "Tsuda lab" ...
##   .. ..$ contact_department     : chr [1:366] "Plant Microbe Interactions" "Plant Microbe Interactions" "Plant Microbe Interactions" "Plant Microbe Interactions" ...
##   .. ..$ contact_institute      : chr [1:366] "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" ...
##   .. ..$ contact_address        : chr [1:366] "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" ...
##   .. ..$ contact_city           : chr [1:366] "Cologne" "Cologne" "Cologne" "Cologne" ...
##   .. ..$ contact_zip/postal_code: chr [1:366] "50829" "50829" "50829" "50829" ...
##   .. ..$ contact_country        : chr [1:366] "Germany" "Germany" "Germany" "Germany" ...
##   .. ..$ data_row_count         : chr [1:366] "0" "0" "0" "0" ...
##   .. ..$ instrument_model       : chr [1:366] "Illumina HiSeq 2500" "Illumina HiSeq 2500" "Illumina HiSeq 2500" "Illumina HiSeq 2500" ...
##   .. ..$ library_selection      : chr [1:366] "cDNA" "cDNA" "cDNA" "cDNA" ...
##   .. ..$ library_source         : chr [1:366] "transcriptomic" "transcriptomic" "transcriptomic" "transcriptomic" ...
##   .. ..$ library_strategy       : chr [1:366] "RNA-Seq" "RNA-Seq" "RNA-Seq" "RNA-Seq" ...
##   .. ..$ relation               : chr [1:366] "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913698" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913699" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913697" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913696" ...
##   .. ..$ relation.1             : chr [1:366] "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248274" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248275" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248276" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248277" ...
##   .. ..$ supplementary_file_1   : chr [1:366] "NONE" "NONE" "NONE" "NONE" ...
##   .. ..$ genotype:ch1           : chr [1:366] "Col-0" "Col-0" "Col-0" "Col-0" ...
##   .. ..$ replicate:ch1          : chr [1:366] "#1" "#1" "#1" "#2" ...
##   .. ..$ time:ch1               : chr [1:366] "1 hpi" "1 hpi" "1 hpi" "1 hpi" ...
##   .. ..$ tissue:ch1             : chr [1:366] "Leaves" "Leaves" "Leaves" "Leaves" ...
##   .. ..$ treatment:ch1          : chr [1:366] "Mock" "Pto DC3000" "Pto AvrRpt2" "Mock" ...
##   ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
##   ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. ..@ .Data:List of 1
##   .. .. .. ..$ : int [1:3] 1 1 0
head(myColData)
## An object of class 'AnnotatedDataFrame'
##   sampleNames: GSM2347874 GSM2347875 ... GSM2347879 (6 total)
##   varLabels: title geo_accession ... treatment:ch1 (48 total)
##   varMetadata: labelDescription

When you print out the variable myColData, then you see that it contains 366 samples, but the count table contains only 348 samples. And we are only interested in the sample properties genotype, replicate, time and treatment. So this part is now extracted from all metadata:

# extract metadata for experiment 1 to 348, which are in
# GSE88798_ReadCountTable_M001_348.txt and the four columns genotype, replicate, time, treatment
myColData <- pData(myColData[1:348, c(44, 45, 46, 48)])
str(myColData)
## 'data.frame':    348 obs. of  4 variables:
##  $ genotype:ch1 : chr  "Col-0" "Col-0" "Col-0" "Col-0" ...
##  $ replicate:ch1: chr  "#1" "#1" "#1" "#2" ...
##  $ time:ch1     : chr  "1 hpi" "1 hpi" "1 hpi" "1 hpi" ...
##  $ treatment:ch1: chr  "Mock" "Pto DC3000" "Pto AvrRpt2" "Mock" ...
head(myColData)
##            genotype:ch1 replicate:ch1 time:ch1 treatment:ch1
## GSM2347874        Col-0            #1    1 hpi          Mock
## GSM2347875        Col-0            #1    1 hpi    Pto DC3000
## GSM2347876        Col-0            #1    1 hpi   Pto AvrRpt2
## GSM2347877        Col-0            #2    1 hpi          Mock
## GSM2347878        Col-0            #2    1 hpi    Pto DC3000
## GSM2347879        Col-0            #2    1 hpi   Pto AvrRpt2
# set new colnames
colnames(myColData) <- c("genotype", "replicate", "time", "treatment")
str(myColData)
## 'data.frame':    348 obs. of  4 variables:
##  $ genotype : chr  "Col-0" "Col-0" "Col-0" "Col-0" ...
##  $ replicate: chr  "#1" "#1" "#1" "#2" ...
##  $ time     : chr  "1 hpi" "1 hpi" "1 hpi" "1 hpi" ...
##  $ treatment: chr  "Mock" "Pto DC3000" "Pto AvrRpt2" "Mock" ...
head(myColData)
##            genotype replicate  time   treatment
## GSM2347874    Col-0        #1 1 hpi        Mock
## GSM2347875    Col-0        #1 1 hpi  Pto DC3000
## GSM2347876    Col-0        #1 1 hpi Pto AvrRpt2
## GSM2347877    Col-0        #2 1 hpi        Mock
## GSM2347878    Col-0        #2 1 hpi  Pto DC3000
## GSM2347879    Col-0        #2 1 hpi Pto AvrRpt2

To simplify the dataset and avoid problems unnecessary white spaces and unwanted special character are removed.

# adjust the values in the series matrix to be more intuitive
unique(myColData$replicate) # show a unique set of entries in replicate column
## [1] "#1" "#2" "#3"
myColData$replicate <- gsub("^ ","",myColData$replicate) # remove leading whitespace
myColData$replicate <- gsub("#","",myColData$replicate) # remove # character
myColData$replicate <- as.numeric(myColData$replicate) # convert from string to number

unique(myColData$time) # do the same for column time
## [1] "1 hpi"  "3 hpi"  "4 hpi"  "6 hpi"  "9 hpi"  "12 hpi" "16 hpi" "24 hpi"
myColData$time <- gsub("^ ", "", myColData$time)
myColData$time <- gsub(" hpi", "", myColData$time)
myColData$time <- as.numeric(myColData$time)

unique(myColData$genotype) # do the same for column genotype
## [1] "Col-0"                       "sid2-2"                     
## [3] "dde2-2 ein2-1 pad4-1 sid2-2" "pad4-1"                     
## [5] "pad4-1 sid2-2"               "rpm1-3 rpad4-1 sid2-22 101C"
myColData$genotype <- gsub("^ ","",myColData$genotype)
myColData$genotype <- gsub(" ","_",myColData$genotype)

unique(myColData$treatment) # do the same for column treatment
## [1] "Mock"        "Pto DC3000"  "Pto AvrRpt2" "Pto AvrRpm1"
myColData$treatment <- gsub("^ ","",myColData$treatment)
myColData$treatment <- gsub(" ","_",myColData$treatment)

In this analysis we are only interested in the genes, which are differentially expressed between mock treatment and treatment with Pseudomonas syringae pv. tomato (Pto) carrying a vector with AvrRpm1 in the accession Col-0 at the time point 1 hour after treatment. Therefore we have to filter the metadata for the accession Col-0, for the treatments Mock or Pto_AvrRpm1 and the time point 1.

############################################################################################
# select only Col-0 and Mock + Pto_AvrRpm1 treatment at time point 1 hpi
# find index of these positions
selectionIndex <-
  which(
    myColData$genotype == "Col-0" &
      (
        myColData$treatment == "Mock" |
          myColData$treatment == "Pto_AvrRpm1"
      ) & (myColData$time == 1)
  )

# extract metadata
myColData <- myColData[selectionIndex, ]
# extract counts
gseData <- gseData[,selectionIndex]

In R there is a special data type called factor (beside the usual ones like int, string, …), which is extensively used. Here the function for differential expression analysis expects the metadata to be of data type factor. Because the metadata are stored as strings, they have to be converted to factors:

# convert all metadata to factors as this is needed by the differential expression function
myColData$time <- as.factor(myColData$time)
myColData$replicate <- as.factor(myColData$replicate)
myColData$genotype <- as.factor(myColData$genotype)
myColData$treatment <- as.factor(myColData$treatment)

# check colnames / rownames for identity and adjust them
all(rownames(myColData) %in% colnames(gseData))
## [1] FALSE
colnames(gseData) <- rownames(myColData)
all(rownames(myColData) %in% colnames(gseData))
## [1] TRUE

Use now the prepared count matrix and the metadata matrix to create a DESeq object, which is needed for the differential expression analysis. The important setting is the design parameter, which has to be set to “~ treatment” as we want to calculate a p-value, which is dependent on the treatment of the plant.

# make deseq object for differential expression test
ddsMat <- DESeqDataSetFromMatrix(countData = gseData,
                                 colData = myColData,
                                 design = ~ treatment)

Set Mock as the untreated condition.

# relevel conditions, set "Mock" as reference and remove unnessecary levels
levels(ddsMat$treatment) # show levels
## [1] "Mock"        "Pto_AvrRpm1"
ddsMat$treatment <- relevel(ddsMat$treatment, ref = "Mock")
ddsMat$treatment <- droplevels(ddsMat$treatment)

Remove genes, which have almost no counts. This accelerates the calculation.

# Pre-filtering, remove genes having less than 10 counts over all experiments
keep <- rowSums(counts(ddsMat)) >= 10
ddsMat <- ddsMat[keep,]

Conduct the differential expression analysis. Count normalization is done automatically.

# Differential expression analysis
# noCores <- parallel:::detectCores() - 1 # determine number of processor cores for parallel processing
# ddsMat <- DESeq(ddsMat, parallel = TRUE, BPPARAM = SnowParam(noCores)) # run differential expression analysis
ddsMat <- DESeq(ddsMat, parallel = FALSE) # run differential expression analysis
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(ddsMat) # extract results from DESeq object
res
## log2 fold change (MLE): treatment Pto AvrRpm1 vs Mock 
## Wald test p-value: treatment Pto AvrRpm1 vs Mock 
## DataFrame with 20255 rows and 6 columns
##             baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##            <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## AT1G01010  378.65099      0.0507979  0.614377  0.082682  0.93410443   0.9721451
## AT1G01020  273.86578     -0.3695763  0.283748 -1.302483  0.19275136   0.4569515
## AT1G01030  207.58372     -1.2485585  0.478459 -2.609542  0.00906636   0.0823238
## AT1G01040 1048.62226     -0.3100009  0.354170 -0.875289  0.38141661   0.6397881
## AT1G01046    5.31325      0.3726277  1.105924  0.336938  0.73616377   0.8756635
## ...              ...            ...       ...       ...         ...         ...
## AT5G67600    777.605      1.1828501  0.743836   1.59020 1.11789e-01 0.347546604
## AT5G67610    251.943     -0.0417828  0.286693  -0.14574 8.84126e-01 0.952025704
## AT5G67620    161.053     -3.3379630  0.701943  -4.75532 1.98136e-06 0.000176331
## AT5G67630    685.906     -1.2305383  0.534264  -2.30324 2.12653e-02 0.138499846
## AT5G67640    138.371     -2.1042483  0.673674  -3.12354 1.78690e-03 0.027777002

Plot the calculated log2 fold changes vs normalized counts.

original

# Exploring results, plot log2 fold change vs. mean normalized counts
plotMA(res, ylim=c(-3, 3))

Remove noise

resultsNames(ddsMat)
## [1] "Intercept"                     "treatment_Pto_AvrRpm1_vs_Mock"
resLFC <- lfcShrink(ddsMat, coef = "treatment_Pto_AvrRpm1_vs_Mock")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resLFC
## log2 fold change (MAP): treatment Pto AvrRpm1 vs Mock 
## Wald test p-value: treatment Pto AvrRpm1 vs Mock 
## DataFrame with 20255 rows and 5 columns
##             baseMean log2FoldChange     lfcSE      pvalue        padj
##            <numeric>      <numeric> <numeric>   <numeric>   <numeric>
## AT1G01010  378.65099      0.0211266  0.383443  0.93410443   0.9721451
## AT1G01020  273.86578     -0.2868327  0.258857  0.19275136   0.4569515
## AT1G01030  207.58372     -0.9337151  0.506783  0.00906636   0.0823238
## AT1G01040 1048.62226     -0.2089155  0.299523  0.38141661   0.6397881
## AT1G01046    5.31325      0.0628926  0.452467  0.73616377   0.8756635
## ...              ...            ...       ...         ...         ...
## AT5G67600    777.605      0.4481358  0.586555 1.11789e-01 0.347546604
## AT5G67610    251.943     -0.0306231  0.247692 8.84126e-01 0.952025704
## AT5G67620    161.053     -3.0234911  0.740305 1.98136e-06 0.000176331
## AT5G67630    685.906     -0.8192357  0.563294 2.12653e-02 0.138499846
## AT5G67640    138.371     -1.6241491  0.765341 1.78690e-03 0.027777002
# plot without noise
plotMA(resLFC, ylim=c(-5,5))
# identify points interactively, click on points, then press "Finish"
idx <- identify(resLFC$baseMean, resLFC$log2FoldChange)

rownames(res)[idx]
## character(0)

DEGs

Identify the most significantly upregulated gene and plot the counts in all three repeats for mock treatment and Pto_AvrRpm1 treatment.

# plot counts for most significant upregulated gene
upIndex <- which(res$log2FoldChange > 0) # find upregulated genes
myIndex <- upIndex[which.min(res$padj[upIndex])] # find most significant upregulated gene

extract read counts for selected gene

fiss <- plotCounts(ddsMat, myIndex,
                   intgroup = c("treatment"), returnData = TRUE)
fiss
##                 count   treatment
## GSM2347874  12.889481        Mock
## GSM2347877   6.476583        Mock
## GSM2347880   4.562768        Mock
## GSM2348162 488.707390 Pto_AvrRpm1
## GSM2348163 523.244554 Pto_AvrRpm1
## GSM2348164 530.299487 Pto_AvrRpm1
# create plot from read counts
g <- ggplot(fiss,
       aes(x = treatment, y = count, color = treatment, group = treatment)) +
        geom_point(position=position_jitter(w=0.1,h=0)) + geom_smooth(se = FALSE, method = "loess") + scale_y_log10(breaks=c(10,50,100,500, 1000, 5000)) +
       ggtitle(rownames(res)[myIndex])
# print plot
print(g)
## `geom_smooth()` using formula 'y ~ x'

# print plot to pdf file
# pdf("Most significant gene.pdf", width = 8, height = 6)
# print(g)
# dev.off()

Differentially expressed genes

Write out the significantly differentially expressed genes

# Identification of differentially expressed genes
summary(res)
## 
## out of 20255 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 780, 3.9%
## LFC < 0 (down)     : 1620, 8%
## outliers [1]       : 246, 1.2%
## low counts [2]     : 786, 3.9%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# set P-Value < 0.05
res05 <- results(ddsMat, alpha=0.05)
summary(res05)
## 
## out of 20255 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 451, 2.2%
## LFC < 0 (down)     : 1135, 5.6%
## outliers [1]       : 246, 1.2%
## low counts [2]     : 786, 3.9%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# write results to excel file
myData <- data.frame(res[which(res$padj < 0.05),])
myData <- myData[order(myData$padj),]
write.xlsx(myData, "Significantly DE Genes.xlsx")

TASK 1

For the 10 upregulated and most significant genes, check on Araport, if they are related to pathogens / immunity

Data transformations and visualization for further evaluation

Transformation of counts, to account for different sequencing depth. This allows to compare different experiments.

# Variance stabilizing transformation
vsd <- vst(ddsMat, blind=FALSE)
head(assay(vsd), 5)
##           GSM2347874 GSM2347877 GSM2347880 GSM2348162 GSM2348163 GSM2348164
## AT1G01010   7.679779   7.819196   9.527553   8.522782   8.425180   8.963733
## AT1G01020   8.331820   8.533809   8.168461   7.972099   7.957201   8.080437
## AT1G01030   8.554305   8.561476   7.467769   7.309531   7.312543   6.697376
## AT1G01040  10.038896   9.814230  10.629327   9.761356   9.892150  10.031034
## AT1G01046   3.745359   3.919170   4.620758   4.050597   4.611107   4.272593

plot

Plot the normalized values RNA-Seq counts to see if the results are consistent.

# Data quality assessment by sample clustering and visualization
# identify the 20 most significantly upregulated genes
upIndex <- which(res$log2FoldChange > 0)
myIndex <- upIndex[order(res$padj[upIndex])[1:20]]

df <- as.data.frame(colData(ddsMat)[,c("treatment")])
rownames(df) <- colnames(vsd)
colnames(df) <- c("treatment")

pheatmap(assay(vsd)[myIndex,], cluster_rows=FALSE, show_rownames=T,
         cluster_cols=FALSE, annotation_col=df)

Heatmap of the sample-to-sample distances

Heatmap Analysis: For the heatmap, the Euclidean distance (or distance measures) is calculated for all pairwise combinations of the experiment. The distance is used to cluster the experiments and identify their similarity. Normally one would expect, that experiments with the same treatment have a low distance and cluster together, whereas experiments with different treatments should have a bigger distance.

sampleDists <- dist(t(assay(vsd))) # calc euclidean distance between all 6 experiments
sampleDistMatrix <- as.matrix(sampleDists) # distance matrix
rownames(sampleDistMatrix) <- paste(vsd$treatment, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

PCA

Prinicpal Compnent Analysis (PCA): A PCA is an unsupervised machine learning method to identify entities, which belong together or to separate entities, which have different properties. In the following figure the PCA results of three different iris species are shown. It is tested, if these three species can be separated on basis of the length and width of sepals and petals. You can see that Iris setosa can be separated well from the other two species, but Iris versicolor and Iris virginica are strongly overlapping and cannot be separated using two principal components.

In our case the PCA is used to check, if the experiments with the same treatment are near to each other and experiments with a different treatment are distant to each other.

pcaData <- plotPCA(vsd, intgroup=c("treatment"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
q <- ggplot(pcaData, aes(PC1, PC2, color=treatment)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
  coord_fixed()
print(q)

Task 2

Discuss the 3 repeats of the mock and AvrRpm1 treatment on the basis of the PCA results and the distance plot. Would you expect inconsistencies between the repeats only looking at the VSD values plot or the dot plot of the most significantly upregulated gene? What is your suggestion for the experimenter?