1M_reads data set were used to identify location candidates within a certain gene that control Nkt cell production.Within this first section, only the raw sequencing data pertained with a read depth of 10k was downloaded and the paired end reads aligned to a specific reference genome. 14 samples in total were sequenced and aligned, 7 each from two different mouse types: NOD.Nkrp1 and NOD.Nkrp1.Nkt1, communicated throughout the report as ‘NN’ and ‘NNN’ respectively.
After downloading, the 10k reads were aligned to the genome using the rsem-calculate-expression command line. The *.genome.bam files produced were in the SAM format, including columns for the flag score and MAPQ score, as well as the associated sequence of each of the alignments. The flag score is a value that is given depending on the properties the read possess; ranging from whether or not it is a paired read to whether the read is a PCR or optical duplicate. In particular, all reads with flag scores below 256 are known to be Primary alignments, meaning that their mapped location in the genome that gives the best match to the read. Flags scores above 256 are given to Secondary alignments; i.e. other locations within the genome that also match the read, but not as closely. The MAPQ score relates to the mapping quality of each read, or in other words the probability that a particular read has been aligned at the correct position within a gene. The higher the MAPQ score, the more likely the read is mapped to the correct location. These two aspects, along with the nucleotide sequence of each read, were extracted from each of the *.genome.bam files and the information used to produce the graph seen below.
Figure 1: The relationship between multiplicity of alignments and MAPQ scores for Primary and Secondary alignments. The MAPQ score of a read relates to the probability that the mapped position of the read is correct. As each read can give rise to more than one possible alignment, each individual alignment can also be classed. The alignment that represents the location in the genome that gives the best match to the read is classed as a primary alignment, whereas all others are secondary.
Figure 1 above shows a clear relationship, initially between the class of the alignment and its MAPQ score of the left, and then on the right relating this to the multiplicity of the alignments. A number of conclusions can be drawn from these graphs. Firstly, it is clear from this graph that within the data set, Secondary alignments have lower MAPQ scores, suggesting there is less certainty about the correctness of their mapping location of the gene. Additionally, it can be seen that the majority of the data points that had higher multiplicities were Secondary alignments. This means that less closely matching alignments were abundant within the data set, having a negative impact of the quanification/ mapping accuracy. Overall, highlighting the reads with Secondary alignments that have a high abundance within the data set as the most problematic reads for quantifying gene expression as they have lower MAPQ scores (below 7). Opposingly, it can be seen that the alignments that achieved a MAPQ score of 100 were mainly Primary alignments. Therefore, to gain better results the secondary alignments that have a high multiplicity should be filtered out of the data.
Although only the first sample is depicted in Figure 1, the relationship between the multiplicity of the alignments and the mapping quality of the primary compared to secondary alignments was extremely similar for all 14 samples tested. No major differences were seen between the different mouse types.
In a similar process to that of Section 1, the 100k and 1M read depth data files were downloaded and aligned with a reference genome in order to gauge the effect that increasing the read depth has on accurately detecting differentially expressed genes. That is, what impact does read depth have on the relationship established in Section 1: that Secondary and higher multiplicity alignments tend to have lower MAPQ scores, and thus are more likely to be mapped to incorrect regions within a gene.
Figure 2: The relationship between MAPQ scores for Primary and Secondary alignments, and how this differs with read depth. The titles to the top left of each individual graph refer to both the sample and read depth used in the sequencing and then aligning the genes; i.e. how many reads per sample were performed.
Figure 3: The relationship between the multiplicity of alignments and their corresponding MAPQ scores. As in Figure 2, the titles in the top left of each individual graph refer to the sample and the depth at which the samples were read.
The same three columns of interest (flag score, MAPQ score and nucleotide sequence) were extracted from the *.genome.bam files for both 100k and 1M read depths. As can be seen in both Figure 2 and 3 above, the same overall trend can be seen amongst different read depths. This trend is that Secondary alignments (those that don’t match as closely to the read) tend to have a lower MAPQ score, meaning they are more likely to be mapped to an incorrect position. This may be problematic for quantification within gene expression studies, particularly with Secondary alignments that have a low MAPQ score but high multiplicity within the data set. On the other hand, by increasing the read depth, the number of genes with enough reads to pass a minimum threshold for quanitifcation will also be increased. This will improve the accurateness of gene expression studies as it will increase our ability to detect genes with statistically significant differences in expression. Therefore, as long as the Secondary alignment with low MAPQ scores that are abundant within the data set are filtered out, a higher read depth shouuld be used to improve gene expression studies.
Each of the other samples within each read depth group, regardless of their mouse type, showed a similar trend and relationship. Hence, only one sample is depticted in the graphs above as an example of the relationship read depth has on MAPQ scores and multiplicity of alignments.
Nkt cells are a class of lymphocytes that have a vital role both within the adaptive immune system of a organism, as well as relating to the regulation of autoimmune responses. Previous studies on the NOD.Nkrp1 (NN) mouse strain have reported a decrease in the number of Nkt cells, as well as a deficiency of the action of Nkt cells overall. In this study, the aim was to produce a NOD.Nkrp1.Nkt1 (NNN) mouse strain that contains a congenic segment encoding for the increased production of Nkt cells, in order to overcome this deficiency. The data depicted below visualises a comparison between the NN and NNN strains, determining which genes have a significant difference in their level of expression. The location of the gene which has the greatest difference in its level of expression between the two strains will most likely be the congenic segment as it does not exist within the NN strain. This highly expressed gene can also be matched to the sequence of the reads the alignment was based off, which can be used as a check later in the study to determine whether the mouse strain with the congenic segment was successfully produced.
Initially, the genes that are differentially expressed between the two mouse strains were extracted by performing a hypothesis test, with a null hypothesis that gene expression was the same between the strains. This then allowed for the data to be filtered for only significantly differentially expressed genes by restricting the padj value to be below 0.05; that is including only the genes that show statistically strong evidence that the null hypothesis is incorrect. The graph below shows the positions of the 33 genes that were highlighted by the hypothesis test to be significantly differentially expressed, graphed against the log2FoldChange values of these genes.
Figure 4: The relationship between the location of the genes, and the difference in its expression between the two mouse strains. The log2FoldChange parameter quantifies the size of the difference between the two mouse strains.
Within Figure 4 above, each point, corresponding to a significant gene, has be graphed in regards to its genomic position, as wellas its log2FoldChange value. This log2FoldChamge value quantifies the size of the difference in expression for each gene between the two mouse strains. Thus, the genes that obtained a higher absolute value for the log2FoldChange parameter are the genes that have the highest difference in their level of expression between NN and NNN. Greater negative log2FoldChange suggests that the gene is more highly expressed within the NN strain, where as a greater positive lg2FoldChange suggests it is more highly expressed within the NNN strain. Given the NN strain does not contain a cogenic segment that has genes to encode for Nkt production, and also that the number of Nkt cells have been reported to be deficient for this strain, It is most likely that the location of the cogenic segment on the NNN strain will correspond to the gene in the graph with the highest positive log2FoldChange value.
The samples higlighted to be possible locations for the congenic segment in Figure 4 were extracted from data to create the table displayed below.
| gene_id | log2FoldChange | pvalue | seqname | start | end |
|---|---|---|---|---|---|
| ENSMUSG00000026602 | -5.038466 | 0 | 1 | 156310727 | 156328035 |
| ENSMUSG00000060985 | 5.196289 | 0 | 1 | 156255296 | 156303664 |
| ENSMUSG00000073491 | 5.909619 | 0 | 1 | 173566284 | 173599274 |
| ENSMUSG00000026535 | -8.135576 | 0 | 1 | 173962568 | 173982744 |
| ENSMUSG00000070501 | 10.259029 | 0 | 1 | 173519717 | 173535957 |
From this additional data, we can select ENSMUSG00000070501 as the most likely gene for control of Nkt cell production given it has the highest, positive log2FoldChange value, meaning it was more greatly expressed in the NNN mouse types. Table 1 also gives the location information about each of the candidates. Thus, it can be deduced that the congenic segment within mouse NNN that is most likely to control Nkt cell production is between 173519717 and 173535957 positions on chromosome 1.
Holly Radford
ID: 13616633
BC5203