RNAseq Arabidopsis masterclass assignment:

What is suggested about the genes and associations of vim1,vim123 and met1 genes in the Arabidopsis genome.

Library preperation and experimental design

We will be analysing three specific genes (Vim1,Vim123 and Met1) in the Arabidopsis thaliana genome. In order to assess these genes of interest we need to use an appropriate experimental design. This involves making sure to use an appropriate number of replicated samples of Arabidopsis thaliana - using a mimimum of 4 replications of the sequenced individuals.

In order to prepare the experimental samples for sequencing the transcriptome of choice (in this case mRNA) is extracted from the cells, treated with DNAase and fragmented using sonification. From this adenylate and ligate adapters are bound to the sequence before a PCR is carried out to amplify the DNA fragments. At this point normalisation of the DNA concentration between the different samples occurs. Before sending the samples off for sequencing it is important to assess the data using a bioanalyser to potentially determine if there are any issues with contamination of the sample set (potentially saving money from research budget if this is the case!).

Post sequencing data assessment

Once sequence the samples will be returned as FastQ files. This data can be visualised using FastQC in order to begin to assess the quality of the sequence dataset. Important aspects to look out for are PhRED score, GC score and quality per tile.

To proceed the sequence data needs to be trimmed, using the Trimmomatic program to do so. In order to align the sequences to the Arabidopsis genome we need to use either a splice-aware or pseudoaligner rather than use a non-splice aware aligner. In this case we have opted to use hisat2, a splice-aware aligner.

This aligner will produce a SAM file which can then be compressed into a BAM file and indexed using samtools and visualised using IGV. At this point we can measure the expression of each gene by using LiBiNorm to create a counts file using either the gene id (useful for iding the genes in a database at a later date) or gene name. In order to do this we require a reference of the Arabidopsis thaliana genome such as - (A_thaliana.TAIR10.41.gtf).

Analysis of the sequence data in R

To begin to analyse the data we need to first normalise the data and check the consistency of our replicates within a condition (i.e Col_0,Vim1,Met1 and Vim123) are all consistent within that condition. To do this we can make an xy plot of each of the replicates against another. If the samples remain largely consistent then they should follow y=x, if the samples do not follow this distribution it may suggest that some of the genes in within a sample have been up/downregulated for some reason.

## [1] "col_0_1" "col_0_2" "col_0_3"

## [1] "met1_1" "met1_2" "met1_3"

## [1] "vim1_1" "vim1_2" "vim1_3"

## [1] "vim123_1" "vim123_2" "vim123_3"

From replcated xy plots the the second replicate of the vim_123 mutant (vim123_2) appears to follow a less smooth xy linear distribution of sequences - this suggests that some of the sequence data may be up/down regulated in comparison to the other replicates and perhaps that these points should be removed from the analysis. Due to a) the few number of samples available for this analysis (we would require 2 at a minimum) and b) that these xy plots aren’t terrible - with distribution potentially being an indication we shall continue this analysis we all replicates of the samples.

After this has been assessed we can begin to assess our data for example by exploring if any clustering occurs between the different genes by using a PCA.

We can see in the above PCA that there is significant clustering between the Vim1 Arabidopsis mutant sequence and the wild type (WT), col - suggesting that the mutation remains similar to the WT sequence. In comparison Met1 and Vim123 are more clustered together indicating that they are less related to the WT sequence and more related to one another.

It is also possible to create a distance matrix which will plot the heirarchical clustering of the samples.

This has a simlar function to the PCA in illustrating the potential clustering of the samples - hence we can illustrate similar clustering between the met1 and vim123 mutants as previously demonstrated in the PCA analysis above.

Not only this, we can go on to extract any differentially expressed genes within the sequence data sets. To do so, we can adjust both the log-fold and p-values in order to filter and extract a managaable number of differentially expressed genes from the sequence dataset - in our case adjusted alpha to be less than 0.01. Below demonstrate our selection of the total number of differentially expressed genes from each of the mutant lines, the red points illustrating the filtered differentially expressed genes from each mutant that were shown to have an adjusted p-value less than 0.01.

MA plots - top down Vim1 mutant, Vim123 mutant, Met1 mutant (red points demonstrate the filtered differentially expressed genes selected for further analysis).

These plots illustrate that the vim123 and met1 mutants have significant numbers of both upregulated and downregulated differentially expressed genes, especially in comparison to the vim1 mutant.

Analysis of the differentially expressed genes

There are several ways that we can assess and compare the differentially expressed genes between the different Arabidopsis mutants. We can perform a Gene Ontology (GO) term analysis to give an idea of what processes these differentially expressed genes are involved in. To perform this analysis we used AgriGo to search for the various GO’s.

Vim1GO Vim123GO met1GO

From the GO plots we can begin to see that the differentially expressed genes within the met1 and vim123 mutants appear to play significant roles in altering the regulation of metabolic pathways as well as the maintenance and replication of DNA within the cell. Whereas the differentially expressed genes in the vim1 mutant appear to be far less specalised and spread throughout the cell.

We can also assess the overlap of differentially expressed genes between the different mutants of Arabidopsis and demonstrate this data using a venn diagram.

This diagram shows that the mutants met1 and vim123 share significantly more differentially expressed genes that with the mutant vim1 - again reinforcing the clustering demonstrated in our earlier PCA analysis.

Conclusions

Our analysis demonstrates that the Arabidopsis mutants met1 and vim123 exhibit far more similar sequences than shared with either the wild-type or with the vim1 mutant. Not only this, when looking at the more specific functions of the up/downregulated differentially expressed genes within these mutants the genes seemed to be for more specalised in regulating metabolic processes, DNA maintainance and replication than the vim1 mutant - potentially suggesting that these genes carry out or are involved in more closely related pathwats within the cell.