The aim of this part is to analyze and a host-pathogen protein interaction map derived from a Y2H experiment using the software “Cytoscape”. The network is extended with additional interactions and protein properties from online resources. Additionally, expression fold change values derived from a RNA-Seq experiment are integrated to examine the expression dynamic after pathogen treatment.
RStudio environment
A comprehensive list of shortcuts can be found here: shortcuts
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data . In the following analysis of RNA-Seq data several R-packages are needed.
<your path> can be absolute or relative folder path
# clear all elements from environment
rm(list = ls())
# set working directory
setwd("<your path>/Praktikum2021/")
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager", dependencies = TRUE, repos = "http://cran.r-project.org")
}
packs <- c("DESeq2", "GEOquery", "openxlsx", "ggplot2", "RColorBrewer", "pheatmap", "RCy3", "apeglm")
BiocManager::install(packs)
##
## The downloaded binary packages are in
## /var/folders/bd/knxrzm1x1s1fcg5sc7hs_7z9dp_qq3/T//RtmpsXwRQV/downloaded_packages
Check what package(s) is loaded
(.packages())
## [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [7] "base"
For the R programming environment, a lot of packages are available, which contain specialized functions for working with different dataset, conducting special analyses or producing special plots.
Here shows how to load some packages in R.
library(DESeq2)
## Warning: package 'BiocGenerics' was built under R version 4.0.3
## Warning: multiple methods tables found for 'which'
## Warning: multiple methods tables found for 'which'
library(GEOquery)
library(openxlsx)
library(ggplot2)
library(RColorBrewer)
library(pheatmap)
library(RCy3)
In this R session, the RNA-Seq dataset GSE88798 (description above) from GEO is loaded for basic analysis and visualization. First a subset of the experiment is selected: Treatment of Col-0 with Mock or Pto_AvrRpm1 at time point 1 hour after inoculation. This dataset is used to identify genes with a significant differential expression visualize the counts of the most significant gene in three repeats in mock vs. pathogen treatment.
The following command load the RNA-seq count matrix and adjust the format for using it with the DESeq2 package
gseData <- read.table("GSE88798/GSE88798_ReadCountTable_M001_348.txt", header = T, sep = "\t")
gseLocus <- gseData[,1] # get locus names from first column
gseData <- gseData[,-1] # remove locus name column to be compatible with subsequent analysis
row.names(gseData) <- gseLocus # set locus names as row names
colnames(gseData) <- NULL # clear column names
The table containing the read counts does not include any description of the experiments, except the experiment identifier. The experiment description is in the file *_series_matrix.txt. You can open this file with a text editor to see the structure of the file.
A convenient way to load this file is using the command getGEO from the GEOquery package:
read the series matrix, which contains detailed information on each experiment
gseTable <- getGEO(filename = "GSE88798/GSE88798_series_matrix.txt")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
## File stored at:
## /var/folders/bd/knxrzm1x1s1fcg5sc7hs_7z9dp_qq3/T//RtmpsXwRQV/GPL17639.soft
gseTable # show element
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 0 features, 366 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM2347874 GSM2347875 ... GSM2359638 (366 total)
## varLabels: title geo_accession ... treatment:ch1 (48 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: GPL17639
str(gseTable) # inspect structure of element
## Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
## ..@ experimentData :Formal class 'MIAME' [package "Biobase"] with 13 slots
## .. .. ..@ name : chr "akira,,mine"
## .. .. ..@ lab : chr ""
## .. .. ..@ contact : chr "mine@mpipz.mpg.de"
## .. .. ..@ title : chr "Time-resolved transcriptome analysis with genetic perturbations reveals a critical time window for effective plant immunity"
## .. .. ..@ abstract : chr "Pathogen invasion in plants is associated with transcriptional reprogramming. Enigmatically, plants induce simi"| __truncated__
## .. .. ..@ url : chr "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE88798"
## .. .. ..@ pubMedIds : chr ""
## .. .. ..@ samples : list()
## .. .. ..@ hybridizations : list()
## .. .. ..@ normControls : list()
## .. .. ..@ preprocessing : list()
## .. .. ..@ other :List of 24
## .. .. .. ..$ contact_address : chr "Carl-von-Linne Weg 10"
## .. .. .. ..$ contact_city : chr "Cologne"
## .. .. .. ..$ contact_country : chr "Germany"
## .. .. .. ..$ contact_department : chr "Plant Microbe Interactions"
## .. .. .. ..$ contact_email : chr "mine@mpipz.mpg.de"
## .. .. .. ..$ contact_institute : chr "Max Planck Institute for Plant Breeding Research"
## .. .. .. ..$ contact_laboratory : chr "Tsuda lab"
## .. .. .. ..$ contact_name : chr "akira,,mine"
## .. .. .. ..$ contact_zip/postal_code: chr "50829"
## .. .. .. ..$ contributor : chr "Akira,,Mine\nCarilin,,Seyfferth\nBarbara,,Kracher\nKenichi,,Tsuda"
## .. .. .. ..$ geo_accession : chr "GSE88798"
## .. .. .. ..$ last_update_date : chr "Dec 12 2017"
## .. .. .. ..$ overall_design : chr "Leaves of Col-0 and all the single, double, triple and quadruple mutants of dde2-2, ein2-1, pad4-1, sid2-2 were"| __truncated__
## .. .. .. ..$ platform_id : chr "GPL17639"
## .. .. .. ..$ platform_taxid : chr "3702"
## .. .. .. ..$ relation : chr "BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA348676\nSRA: https://www.ncbi.nlm.nih.gov/sra?term=SRP091641"
## .. .. .. ..$ sample_id : chr "GSM2347874 GSM2347875 GSM2347876 GSM2347877 GSM2347878 GSM2347879 GSM2347880 GSM2347881 GSM2347882 GSM2347883 G"| __truncated__
## .. .. .. ..$ sample_taxid : chr "3702"
## .. .. .. ..$ status : chr "Public on Dec 12 2017"
## .. .. .. ..$ submission_date : chr "Oct 17 2016"
## .. .. .. ..$ summary : chr "Pathogen invasion in plants is associated with transcriptional reprogramming. Enigmatically, plants induce simi"| __truncated__
## .. .. .. ..$ supplementary_file : chr "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE88nnn/GSE88798/suppl/GSE88798_ReadCountTable_M001_348.txt.gz\nftp://ft"| __truncated__
## .. .. .. ..$ title : chr "Time-resolved transcriptome analysis with genetic perturbations reveals a critical time window for effective plant immunity"
## .. .. .. ..$ type : chr "Expression profiling by high throughput sequencing"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 2
## .. .. .. .. .. ..$ : int [1:3] 1 0 0
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## ..@ assayData :<environment: 0x7fc5361d1ac0>
## ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## .. .. ..@ varMetadata :'data.frame': 48 obs. of 1 variable:
## .. .. .. ..$ labelDescription: chr [1:48] NA NA NA NA ...
## .. .. ..@ data :'data.frame': 366 obs. of 48 variables:
## .. .. .. ..$ title : chr [1:366] "x01_Col_01_mock_01h_rep01" "x01_Col_02_EV_01h_rep01" "x01_Col_03_AvrRpt2_01h_rep01" "x01_Col_01_mock_01h_rep02" ...
## .. .. .. ..$ geo_accession : chr [1:366] "GSM2347874" "GSM2347875" "GSM2347876" "GSM2347877" ...
## .. .. .. ..$ status : chr [1:366] "Public on Dec 12 2017" "Public on Dec 12 2017" "Public on Dec 12 2017" "Public on Dec 12 2017" ...
## .. .. .. ..$ submission_date : chr [1:366] "Oct 17 2016" "Oct 17 2016" "Oct 17 2016" "Oct 17 2016" ...
## .. .. .. ..$ last_update_date : chr [1:366] "Dec 12 2017" "Dec 12 2017" "Dec 12 2017" "Dec 12 2017" ...
## .. .. .. ..$ type : chr [1:366] "SRA" "SRA" "SRA" "SRA" ...
## .. .. .. ..$ channel_count : chr [1:366] "1" "1" "1" "1" ...
## .. .. .. ..$ source_name_ch1 : chr [1:366] "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" ...
## .. .. .. ..$ organism_ch1 : chr [1:366] "Arabidopsis thaliana" "Arabidopsis thaliana" "Arabidopsis thaliana" "Arabidopsis thaliana" ...
## .. .. .. ..$ characteristics_ch1 : chr [1:366] "genotype: Col-0" "genotype: Col-0" "genotype: Col-0" "genotype: Col-0" ...
## .. .. .. ..$ characteristics_ch1.1 : chr [1:366] "treatment: Mock" "treatment: Pto DC3000" "treatment: Pto AvrRpt2" "treatment: Mock" ...
## .. .. .. ..$ characteristics_ch1.2 : chr [1:366] "replicate: #1" "replicate: #1" "replicate: #1" "replicate: #2" ...
## .. .. .. ..$ characteristics_ch1.3 : chr [1:366] "time: 1 hpi" "time: 1 hpi" "time: 1 hpi" "time: 1 hpi" ...
## .. .. .. ..$ characteristics_ch1.4 : chr [1:366] "tissue: Leaves" "tissue: Leaves" "tissue: Leaves" "tissue: Leaves" ...
## .. .. .. ..$ treatment_protocol_ch1 : chr [1:366] "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." ...
## .. .. .. ..$ growth_protocol_ch1 : chr [1:366] "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ ...
## .. .. .. ..$ molecule_ch1 : chr [1:366] "total RNA" "total RNA" "total RNA" "total RNA" ...
## .. .. .. ..$ extract_protocol_ch1 : chr [1:366] "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." ...
## .. .. .. ..$ extract_protocol_ch1.1 : chr [1:366] "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ ...
## .. .. .. ..$ taxid_ch1 : chr [1:366] "3702" "3702" "3702" "3702" ...
## .. .. .. ..$ description : chr [1:366] "M001" "M002" "M003" "M004" ...
## .. .. .. ..$ description.1 : chr [1:366] "" "" "" "" ...
## .. .. .. ..$ data_processing : chr [1:366] "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" ...
## .. .. .. ..$ data_processing.1 : chr [1:366] "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ ...
## .. .. .. ..$ data_processing.2 : chr [1:366] "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ ...
## .. .. .. ..$ platform_id : chr [1:366] "GPL17639" "GPL17639" "GPL17639" "GPL17639" ...
## .. .. .. ..$ contact_name : chr [1:366] "akira,,mine" "akira,,mine" "akira,,mine" "akira,,mine" ...
## .. .. .. ..$ contact_email : chr [1:366] "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" ...
## .. .. .. ..$ contact_laboratory : chr [1:366] "Tsuda lab" "Tsuda lab" "Tsuda lab" "Tsuda lab" ...
## .. .. .. ..$ contact_department : chr [1:366] "Plant Microbe Interactions" "Plant Microbe Interactions" "Plant Microbe Interactions" "Plant Microbe Interactions" ...
## .. .. .. ..$ contact_institute : chr [1:366] "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" ...
## .. .. .. ..$ contact_address : chr [1:366] "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" ...
## .. .. .. ..$ contact_city : chr [1:366] "Cologne" "Cologne" "Cologne" "Cologne" ...
## .. .. .. ..$ contact_zip/postal_code: chr [1:366] "50829" "50829" "50829" "50829" ...
## .. .. .. ..$ contact_country : chr [1:366] "Germany" "Germany" "Germany" "Germany" ...
## .. .. .. ..$ data_row_count : chr [1:366] "0" "0" "0" "0" ...
## .. .. .. ..$ instrument_model : chr [1:366] "Illumina HiSeq 2500" "Illumina HiSeq 2500" "Illumina HiSeq 2500" "Illumina HiSeq 2500" ...
## .. .. .. ..$ library_selection : chr [1:366] "cDNA" "cDNA" "cDNA" "cDNA" ...
## .. .. .. ..$ library_source : chr [1:366] "transcriptomic" "transcriptomic" "transcriptomic" "transcriptomic" ...
## .. .. .. ..$ library_strategy : chr [1:366] "RNA-Seq" "RNA-Seq" "RNA-Seq" "RNA-Seq" ...
## .. .. .. ..$ relation : chr [1:366] "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913698" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913699" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913697" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913696" ...
## .. .. .. ..$ relation.1 : chr [1:366] "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248274" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248275" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248276" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248277" ...
## .. .. .. ..$ supplementary_file_1 : chr [1:366] "NONE" "NONE" "NONE" "NONE" ...
## .. .. .. ..$ genotype:ch1 : chr [1:366] "Col-0" "Col-0" "Col-0" "Col-0" ...
## .. .. .. ..$ replicate:ch1 : chr [1:366] "#1" "#1" "#1" "#2" ...
## .. .. .. ..$ time:ch1 : chr [1:366] "1 hpi" "1 hpi" "1 hpi" "1 hpi" ...
## .. .. .. ..$ tissue:ch1 : chr [1:366] "Leaves" "Leaves" "Leaves" "Leaves" ...
## .. .. .. ..$ treatment:ch1 : chr [1:366] "Mock" "Pto DC3000" "Pto AvrRpt2" "Mock" ...
## .. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 1
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## ..@ featureData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## .. .. ..@ varMetadata :'data.frame': 0 obs. of 3 variables:
## .. .. .. ..$ Column : chr(0)
## .. .. .. ..$ Description : chr(0)
## .. .. .. ..$ labelDescription: chr(0)
## .. .. ..@ data :'data.frame': 0 obs. of 0 variables
## .. .. ..@ dimLabels : chr [1:2] "featureNames" "featureColumns"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 1
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## ..@ annotation : chr "GPL17639"
## ..@ protocolData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## .. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable:
## .. .. .. ..$ labelDescription: chr(0)
## .. .. ..@ data :'data.frame': 366 obs. of 0 variables
## .. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 1
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. ..@ .Data:List of 4
## .. .. .. ..$ : int [1:3] 4 0 0
## .. .. .. ..$ : int [1:3] 2 48 0
## .. .. .. ..$ : int [1:3] 1 3 0
## .. .. .. ..$ : int [1:3] 1 0 0
The variable gseTable is an ExpressionSet object. To work with the information contained in this object, the relevant data have to be extracted into a simple variable
# extract relevant information: genotype, replicate, time point and treatment for each experiment
myColData <- phenoData(gseTable) # extract metadata using the function phenoData
str(myColData)
## Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## ..@ varMetadata :'data.frame': 48 obs. of 1 variable:
## .. ..$ labelDescription: chr [1:48] NA NA NA NA ...
## ..@ data :'data.frame': 366 obs. of 48 variables:
## .. ..$ title : chr [1:366] "x01_Col_01_mock_01h_rep01" "x01_Col_02_EV_01h_rep01" "x01_Col_03_AvrRpt2_01h_rep01" "x01_Col_01_mock_01h_rep02" ...
## .. ..$ geo_accession : chr [1:366] "GSM2347874" "GSM2347875" "GSM2347876" "GSM2347877" ...
## .. ..$ status : chr [1:366] "Public on Dec 12 2017" "Public on Dec 12 2017" "Public on Dec 12 2017" "Public on Dec 12 2017" ...
## .. ..$ submission_date : chr [1:366] "Oct 17 2016" "Oct 17 2016" "Oct 17 2016" "Oct 17 2016" ...
## .. ..$ last_update_date : chr [1:366] "Dec 12 2017" "Dec 12 2017" "Dec 12 2017" "Dec 12 2017" ...
## .. ..$ type : chr [1:366] "SRA" "SRA" "SRA" "SRA" ...
## .. ..$ channel_count : chr [1:366] "1" "1" "1" "1" ...
## .. ..$ source_name_ch1 : chr [1:366] "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" "Fully-expanded leaves of 30 to 33 day-old plants" ...
## .. ..$ organism_ch1 : chr [1:366] "Arabidopsis thaliana" "Arabidopsis thaliana" "Arabidopsis thaliana" "Arabidopsis thaliana" ...
## .. ..$ characteristics_ch1 : chr [1:366] "genotype: Col-0" "genotype: Col-0" "genotype: Col-0" "genotype: Col-0" ...
## .. ..$ characteristics_ch1.1 : chr [1:366] "treatment: Mock" "treatment: Pto DC3000" "treatment: Pto AvrRpt2" "treatment: Mock" ...
## .. ..$ characteristics_ch1.2 : chr [1:366] "replicate: #1" "replicate: #1" "replicate: #1" "replicate: #2" ...
## .. ..$ characteristics_ch1.3 : chr [1:366] "time: 1 hpi" "time: 1 hpi" "time: 1 hpi" "time: 1 hpi" ...
## .. ..$ characteristics_ch1.4 : chr [1:366] "tissue: Leaves" "tissue: Leaves" "tissue: Leaves" "tissue: Leaves" ...
## .. ..$ treatment_protocol_ch1 : chr [1:366] "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." "Fully-expanded leaves were syringe-infiltrated with mock (water) or bacterial suspensions at the OD600 of 0.001." ...
## .. ..$ growth_protocol_ch1 : chr [1:366] "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ "Arabidopsis plants were grown in a chamber at 22°C with a 10 h light period and 60% relative humidity for 3 wee"| __truncated__ ...
## .. ..$ molecule_ch1 : chr [1:366] "total RNA" "total RNA" "total RNA" "total RNA" ...
## .. ..$ extract_protocol_ch1 : chr [1:366] "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." "Total RNAs were extracted using TRIzol (Invitrogen), followed by Dnase I (Roche) treatment." ...
## .. ..$ extract_protocol_ch1.1 : chr [1:366] "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ "RNA libraries were prepared from 1 μg of total RNA. Poly A enrichment and library preparation were performed wi"| __truncated__ ...
## .. ..$ taxid_ch1 : chr [1:366] "3702" "3702" "3702" "3702" ...
## .. ..$ description : chr [1:366] "M001" "M002" "M003" "M004" ...
## .. ..$ description.1 : chr [1:366] "" "" "" "" ...
## .. ..$ data_processing : chr [1:366] "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" "Illumina CASAVA-1.8.2 software was used for basecalling and Illumina adapter matching reads were removed using cutadapt 1.5" ...
## .. ..$ data_processing.1 : chr [1:366] "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ "Reads were mapped to the Arabidopsis genome using splice-aware read aligner TopHat (version 2.0.10) with the pa"| __truncated__ ...
## .. ..$ data_processing.2 : chr [1:366] "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ "Supplementary_files_format_and_content: Tab-delimited text file including raw read counts per gene for each of "| __truncated__ ...
## .. ..$ platform_id : chr [1:366] "GPL17639" "GPL17639" "GPL17639" "GPL17639" ...
## .. ..$ contact_name : chr [1:366] "akira,,mine" "akira,,mine" "akira,,mine" "akira,,mine" ...
## .. ..$ contact_email : chr [1:366] "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" "mine@mpipz.mpg.de" ...
## .. ..$ contact_laboratory : chr [1:366] "Tsuda lab" "Tsuda lab" "Tsuda lab" "Tsuda lab" ...
## .. ..$ contact_department : chr [1:366] "Plant Microbe Interactions" "Plant Microbe Interactions" "Plant Microbe Interactions" "Plant Microbe Interactions" ...
## .. ..$ contact_institute : chr [1:366] "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" "Max Planck Institute for Plant Breeding Research" ...
## .. ..$ contact_address : chr [1:366] "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" "Carl-von-Linne Weg 10" ...
## .. ..$ contact_city : chr [1:366] "Cologne" "Cologne" "Cologne" "Cologne" ...
## .. ..$ contact_zip/postal_code: chr [1:366] "50829" "50829" "50829" "50829" ...
## .. ..$ contact_country : chr [1:366] "Germany" "Germany" "Germany" "Germany" ...
## .. ..$ data_row_count : chr [1:366] "0" "0" "0" "0" ...
## .. ..$ instrument_model : chr [1:366] "Illumina HiSeq 2500" "Illumina HiSeq 2500" "Illumina HiSeq 2500" "Illumina HiSeq 2500" ...
## .. ..$ library_selection : chr [1:366] "cDNA" "cDNA" "cDNA" "cDNA" ...
## .. ..$ library_source : chr [1:366] "transcriptomic" "transcriptomic" "transcriptomic" "transcriptomic" ...
## .. ..$ library_strategy : chr [1:366] "RNA-Seq" "RNA-Seq" "RNA-Seq" "RNA-Seq" ...
## .. ..$ relation : chr [1:366] "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913698" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913699" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913697" "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN05913696" ...
## .. ..$ relation.1 : chr [1:366] "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248274" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248275" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248276" "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX2248277" ...
## .. ..$ supplementary_file_1 : chr [1:366] "NONE" "NONE" "NONE" "NONE" ...
## .. ..$ genotype:ch1 : chr [1:366] "Col-0" "Col-0" "Col-0" "Col-0" ...
## .. ..$ replicate:ch1 : chr [1:366] "#1" "#1" "#1" "#2" ...
## .. ..$ time:ch1 : chr [1:366] "1 hpi" "1 hpi" "1 hpi" "1 hpi" ...
## .. ..$ tissue:ch1 : chr [1:366] "Leaves" "Leaves" "Leaves" "Leaves" ...
## .. ..$ treatment:ch1 : chr [1:366] "Mock" "Pto DC3000" "Pto AvrRpt2" "Mock" ...
## ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
## ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. ..@ .Data:List of 1
## .. .. .. ..$ : int [1:3] 1 1 0
head(myColData)
## An object of class 'AnnotatedDataFrame'
## sampleNames: GSM2347874 GSM2347875 ... GSM2347879 (6 total)
## varLabels: title geo_accession ... treatment:ch1 (48 total)
## varMetadata: labelDescription
When you print out the variable myColData, then you see that it contains 366 samples, but the count table contains only 348 samples. And we are only interested in the sample properties genotype, replicate, time and treatment. So this part is now extracted from all metadata:
# extract metadata for experiment 1 to 348, which are in
# GSE88798_ReadCountTable_M001_348.txt and the four columns genotype, replicate, time, treatment
myColData <- pData(myColData[1:348, c(44, 45, 46, 48)])
str(myColData)
## 'data.frame': 348 obs. of 4 variables:
## $ genotype:ch1 : chr "Col-0" "Col-0" "Col-0" "Col-0" ...
## $ replicate:ch1: chr "#1" "#1" "#1" "#2" ...
## $ time:ch1 : chr "1 hpi" "1 hpi" "1 hpi" "1 hpi" ...
## $ treatment:ch1: chr "Mock" "Pto DC3000" "Pto AvrRpt2" "Mock" ...
head(myColData)
## genotype:ch1 replicate:ch1 time:ch1 treatment:ch1
## GSM2347874 Col-0 #1 1 hpi Mock
## GSM2347875 Col-0 #1 1 hpi Pto DC3000
## GSM2347876 Col-0 #1 1 hpi Pto AvrRpt2
## GSM2347877 Col-0 #2 1 hpi Mock
## GSM2347878 Col-0 #2 1 hpi Pto DC3000
## GSM2347879 Col-0 #2 1 hpi Pto AvrRpt2
# set new colnames
colnames(myColData) <- c("genotype", "replicate", "time", "treatment")
str(myColData)
## 'data.frame': 348 obs. of 4 variables:
## $ genotype : chr "Col-0" "Col-0" "Col-0" "Col-0" ...
## $ replicate: chr "#1" "#1" "#1" "#2" ...
## $ time : chr "1 hpi" "1 hpi" "1 hpi" "1 hpi" ...
## $ treatment: chr "Mock" "Pto DC3000" "Pto AvrRpt2" "Mock" ...
head(myColData)
## genotype replicate time treatment
## GSM2347874 Col-0 #1 1 hpi Mock
## GSM2347875 Col-0 #1 1 hpi Pto DC3000
## GSM2347876 Col-0 #1 1 hpi Pto AvrRpt2
## GSM2347877 Col-0 #2 1 hpi Mock
## GSM2347878 Col-0 #2 1 hpi Pto DC3000
## GSM2347879 Col-0 #2 1 hpi Pto AvrRpt2
To simplify the dataset and avoid problems unnecessary white spaces and unwanted special character are removed.
# adjust the values in the series matrix to be more intuitive
unique(myColData$replicate) # show a unique set of entries in replicate column
## [1] "#1" "#2" "#3"
myColData$replicate <- gsub("^ ","",myColData$replicate) # remove leading whitespace
myColData$replicate <- gsub("#","",myColData$replicate) # remove # character
myColData$replicate <- as.numeric(myColData$replicate) # convert from string to number
unique(myColData$time) # do the same for column time
## [1] "1 hpi" "3 hpi" "4 hpi" "6 hpi" "9 hpi" "12 hpi" "16 hpi" "24 hpi"
myColData$time <- gsub("^ ", "", myColData$time)
myColData$time <- gsub(" hpi", "", myColData$time)
myColData$time <- as.numeric(myColData$time)
unique(myColData$genotype) # do the same for column genotype
## [1] "Col-0" "sid2-2"
## [3] "dde2-2 ein2-1 pad4-1 sid2-2" "pad4-1"
## [5] "pad4-1 sid2-2" "rpm1-3 rpad4-1 sid2-22 101C"
myColData$genotype <- gsub("^ ","",myColData$genotype)
myColData$genotype <- gsub(" ","_",myColData$genotype)
unique(myColData$treatment) # do the same for column treatment
## [1] "Mock" "Pto DC3000" "Pto AvrRpt2" "Pto AvrRpm1"
myColData$treatment <- gsub("^ ","",myColData$treatment)
myColData$treatment <- gsub(" ","_",myColData$treatment)
In this analysis we are only interested in the genes, which are differentially expressed between mock treatment and treatment with Pseudomonas syringae pv. tomato (Pto) carrying a vector with AvrRpm1 in the accession Col-0 at the time point 1 hour after treatment. Therefore we have to filter the metadata for the accession Col-0, for the treatments Mock or Pto_AvrRpm1 and the time point 1.
############################################################################################
# select only Col-0 and Mock + Pto_AvrRpm1 treatment at time point 1 hpi
# find index of these positions
selectionIndex <-
which(
myColData$genotype == "Col-0" &
(
myColData$treatment == "Mock" |
myColData$treatment == "Pto_AvrRpm1"
) & (myColData$time == 1)
)
# extract metadata
myColData <- myColData[selectionIndex, ]
# extract counts
gseData <- gseData[,selectionIndex]
In R there is a special data type called factor (beside the usual ones like int, string, …), which is extensively used. Here the function for differential expression analysis expects the metadata to be of data type factor. Because the metadata are stored as strings, they have to be converted to factors:
# convert all metadata to factors as this is needed by the differential expression function
myColData$time <- as.factor(myColData$time)
myColData$replicate <- as.factor(myColData$replicate)
myColData$genotype <- as.factor(myColData$genotype)
myColData$treatment <- as.factor(myColData$treatment)
# check colnames / rownames for identity and adjust them
all(rownames(myColData) %in% colnames(gseData))
## [1] FALSE
colnames(gseData) <- rownames(myColData)
all(rownames(myColData) %in% colnames(gseData))
## [1] TRUE
Use now the prepared count matrix and the metadata matrix to create a DESeq object, which is needed for the differential expression analysis. The important setting is the design parameter, which has to be set to “~ treatment” as we want to calculate a p-value, which is dependent on the treatment of the plant.
# make deseq object for differential expression test
ddsMat <- DESeqDataSetFromMatrix(countData = gseData,
colData = myColData,
design = ~ treatment)
Set Mock as the untreated condition.
# relevel conditions, set "Mock" as reference and remove unnessecary levels
levels(ddsMat$treatment) # show levels
## [1] "Mock" "Pto_AvrRpm1"
ddsMat$treatment <- relevel(ddsMat$treatment, ref = "Mock")
ddsMat$treatment <- droplevels(ddsMat$treatment)
Remove genes, which have almost no counts. This accelerates the calculation.
# Pre-filtering, remove genes having less than 10 counts over all experiments
keep <- rowSums(counts(ddsMat)) >= 10
ddsMat <- ddsMat[keep,]
Conduct the differential expression analysis. Count normalization is done automatically.
# Differential expression analysis
# noCores <- parallel:::detectCores() - 1 # determine number of processor cores for parallel processing
# ddsMat <- DESeq(ddsMat, parallel = TRUE, BPPARAM = SnowParam(noCores)) # run differential expression analysis
ddsMat <- DESeq(ddsMat, parallel = FALSE) # run differential expression analysis
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(ddsMat) # extract results from DESeq object
res
## log2 fold change (MLE): treatment Pto AvrRpm1 vs Mock
## Wald test p-value: treatment Pto AvrRpm1 vs Mock
## DataFrame with 20255 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## AT1G01010 378.65099 0.0507979 0.614377 0.082682 0.93410443 0.9721451
## AT1G01020 273.86578 -0.3695763 0.283748 -1.302483 0.19275136 0.4569515
## AT1G01030 207.58372 -1.2485585 0.478459 -2.609542 0.00906636 0.0823238
## AT1G01040 1048.62226 -0.3100009 0.354170 -0.875289 0.38141661 0.6397881
## AT1G01046 5.31325 0.3726277 1.105924 0.336938 0.73616377 0.8756635
## ... ... ... ... ... ... ...
## AT5G67600 777.605 1.1828501 0.743836 1.59020 1.11789e-01 0.347546604
## AT5G67610 251.943 -0.0417828 0.286693 -0.14574 8.84126e-01 0.952025704
## AT5G67620 161.053 -3.3379630 0.701943 -4.75532 1.98136e-06 0.000176331
## AT5G67630 685.906 -1.2305383 0.534264 -2.30324 2.12653e-02 0.138499846
## AT5G67640 138.371 -2.1042483 0.673674 -3.12354 1.78690e-03 0.027777002
# Exploring results, plot log2 fold change vs. mean normalized counts
plotMA(res, ylim=c(-3, 3))
resultsNames(ddsMat)
## [1] "Intercept" "treatment_Pto_AvrRpm1_vs_Mock"
resLFC <- lfcShrink(ddsMat, coef = "treatment_Pto_AvrRpm1_vs_Mock")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resLFC
## log2 fold change (MAP): treatment Pto AvrRpm1 vs Mock
## Wald test p-value: treatment Pto AvrRpm1 vs Mock
## DataFrame with 20255 rows and 5 columns
## baseMean log2FoldChange lfcSE pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## AT1G01010 378.65099 0.0211266 0.383443 0.93410443 0.9721451
## AT1G01020 273.86578 -0.2868327 0.258857 0.19275136 0.4569515
## AT1G01030 207.58372 -0.9337151 0.506783 0.00906636 0.0823238
## AT1G01040 1048.62226 -0.2089155 0.299523 0.38141661 0.6397881
## AT1G01046 5.31325 0.0628926 0.452467 0.73616377 0.8756635
## ... ... ... ... ... ...
## AT5G67600 777.605 0.4481358 0.586555 1.11789e-01 0.347546604
## AT5G67610 251.943 -0.0306231 0.247692 8.84126e-01 0.952025704
## AT5G67620 161.053 -3.0234911 0.740305 1.98136e-06 0.000176331
## AT5G67630 685.906 -0.8192357 0.563294 2.12653e-02 0.138499846
## AT5G67640 138.371 -1.6241491 0.765341 1.78690e-03 0.027777002
# plot without noise
plotMA(resLFC, ylim=c(-5,5))
# identify points interactively, click on points, then press "Finish"
idx <- identify(resLFC$baseMean, resLFC$log2FoldChange)
rownames(res)[idx]
## character(0)
Identify the most significantly upregulated gene and plot the counts in all three repeats for mock treatment and Pto_AvrRpm1 treatment.
# plot counts for most significant upregulated gene
upIndex <- which(res$log2FoldChange > 0) # find upregulated genes
myIndex <- upIndex[which.min(res$padj[upIndex])] # find most significant upregulated gene
fiss <- plotCounts(ddsMat, myIndex,
intgroup = c("treatment"), returnData = TRUE)
fiss
## count treatment
## GSM2347874 12.889481 Mock
## GSM2347877 6.476583 Mock
## GSM2347880 4.562768 Mock
## GSM2348162 488.707390 Pto_AvrRpm1
## GSM2348163 523.244554 Pto_AvrRpm1
## GSM2348164 530.299487 Pto_AvrRpm1
# create plot from read counts
g <- ggplot(fiss,
aes(x = treatment, y = count, color = treatment, group = treatment)) +
geom_point(position=position_jitter(w=0.1,h=0)) + geom_smooth(se = FALSE, method = "loess") + scale_y_log10(breaks=c(10,50,100,500, 1000, 5000)) +
ggtitle(rownames(res)[myIndex])
# print plot
print(g)
## `geom_smooth()` using formula 'y ~ x'
# print plot to pdf file
# pdf("Most significant gene.pdf", width = 8, height = 6)
# print(g)
# dev.off()
Write out the significantly differentially expressed genes
# Identification of differentially expressed genes
summary(res)
##
## out of 20255 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 780, 3.9%
## LFC < 0 (down) : 1620, 8%
## outliers [1] : 246, 1.2%
## low counts [2] : 786, 3.9%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# set P-Value < 0.05
res05 <- results(ddsMat, alpha=0.05)
summary(res05)
##
## out of 20255 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 451, 2.2%
## LFC < 0 (down) : 1135, 5.6%
## outliers [1] : 246, 1.2%
## low counts [2] : 786, 3.9%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# write results to excel file
myData <- data.frame(res[which(res$padj < 0.05),])
myData <- myData[order(myData$padj),]
write.xlsx(myData, "Significantly DE Genes.xlsx")
For the 10 upregulated and most significant genes, check on Araport, if they are related to pathogens / immunity
Transformation of counts, to account for different sequencing depth. This allows to compare different experiments.
# Variance stabilizing transformation
vsd <- vst(ddsMat, blind=FALSE)
head(assay(vsd), 5)
## GSM2347874 GSM2347877 GSM2347880 GSM2348162 GSM2348163 GSM2348164
## AT1G01010 7.679779 7.819196 9.527553 8.522782 8.425180 8.963733
## AT1G01020 8.331820 8.533809 8.168461 7.972099 7.957201 8.080437
## AT1G01030 8.554305 8.561476 7.467769 7.309531 7.312543 6.697376
## AT1G01040 10.038896 9.814230 10.629327 9.761356 9.892150 10.031034
## AT1G01046 3.745359 3.919170 4.620758 4.050597 4.611107 4.272593
Plot the normalized values RNA-Seq counts to see if the results are consistent.
# Data quality assessment by sample clustering and visualization
# identify the 20 most significantly upregulated genes
upIndex <- which(res$log2FoldChange > 0)
myIndex <- upIndex[order(res$padj[upIndex])[1:20]]
df <- as.data.frame(colData(ddsMat)[,c("treatment")])
rownames(df) <- colnames(vsd)
colnames(df) <- c("treatment")
pheatmap(assay(vsd)[myIndex,], cluster_rows=FALSE, show_rownames=T,
cluster_cols=FALSE, annotation_col=df)
Heatmap Analysis: For the heatmap, the Euclidean distance (or distance measures) is calculated for all pairwise combinations of the experiment. The distance is used to cluster the experiments and identify their similarity. Normally one would expect, that experiments with the same treatment have a low distance and cluster together, whereas experiments with different treatments should have a bigger distance.
sampleDists <- dist(t(assay(vsd))) # calc euclidean distance between all 6 experiments
sampleDistMatrix <- as.matrix(sampleDists) # distance matrix
rownames(sampleDistMatrix) <- paste(vsd$treatment, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
Prinicpal Compnent Analysis (PCA): A PCA is an unsupervised machine learning method to identify entities, which belong together or to separate entities, which have different properties. In the following figure the PCA results of three different iris species are shown. It is tested, if these three species can be separated on basis of the length and width of sepals and petals. You can see that Iris setosa can be separated well from the other two species, but Iris versicolor and Iris virginica are strongly overlapping and cannot be separated using two principal components.
In our case the PCA is used to check, if the experiments with the same treatment are near to each other and experiments with a different treatment are distant to each other.
pcaData <- plotPCA(vsd, intgroup=c("treatment"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
q <- ggplot(pcaData, aes(PC1, PC2, color=treatment)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
print(q)
Discuss the 3 repeats of the mock and AvrRpm1 treatment on the basis of the PCA results and the distance plot. Would you expect inconsistencies between the repeats only looking at the VSD values plot or the dot plot of the most significantly upregulated gene? What is your suggestion for the experimenter?