This pipeline analyzes data from a scRNAseq using a 10X Genome kit. For more information on this kit, check this link. All of the code here was done using Python 3.8 and terminal. If using Windows, you can download Windows Linux Subsystem using wsl --install on command prompt. For more information on Windows Linux Subsystem, click here.
The scRNAseq will give you two files for each read, meaning you will have a L1 R1 file and an L2 R1 file. Same goes for R2. But they both have the same information, is just that the sequencing occurs in two different lanes. Before anything else, you will need to merge both L1 and L2 files for each read in order to work with the complete sequencing. For that, youāll have to enter on terminal:
cat file1 file2 > final_file
This takes approximately 10 minutes and creates a 4 to 8 GB file.
If you want to know how many lines (or reads) is in that file, just run
zcat final_file | wc -l
This will give you the number of lines in the file. Since each read is comprised of four lines, just divide this number by four to find the number of reads you will be dealing with.
Now that you have all your data in one place, the next step is getting the raw data ready to be analysed.
The data
The raw data consist of three files corresponding to the three sequences generated from each molecule (see Figure on page 15 of the manual):
Each file has the same structure (.fastq) and contains four lines for each sequence:
The sequences should be in the same order in each of the three files and the name of each sequence identical in each file (with the exception of the characters after the space in the read name).
R1 example:
@A00904:175:H2NK5DMXY:1:1101:3893:1031 1:N:0:CCACTACA
ANCCTTTAGGGCCCTTATTTCGCGAAAC
+
F#FFFFF:FFFFFFFF:FFFFFFFFFFF
@A00904:175:H2NK5DMXY:1:1101:11885:1031 1:N:0:CCACTACA
CNTGTTGAGCGGGTATGCACAAGCACAC
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFF
@A00904:175:H2NK5DMXY:1:1101:12481:1031 1:N:0:CCACTACA
R2 example:
@A00904:175:H2NK5DMXY:1:1101:3893:1031 2:N:0:CCACTACA
CTTCCAGCAGCTGAAGGCGACCCACTTTGAAGAGTTCTTGTCGAAGCTGAGCCCGACTCTGCGCGAGCAGGTGCAAGCTGCCCTGAACCAG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00904:175:H2NK5DMXY:1:1101:11885:1031 2:N:0:CCACTACA
AATTGTTGCAGTTAAAACGCTCGTAGTTGAATTTCAAAGAATCGATATTTTATTGTAACTATTCTAGGGGAACTATTTTAGCTTTTCGCTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFF
You can look at your file using
zless file_name
NAME OF THE SEQUENCE_10XBarcode_UMIsequence
mRNA sequence (possibly trimmed)
+
Quality (possibly trimmed)
This is the code I used for this part of the analysis:
### This code runs through R1 file, gets the read name, barcode and umi and creates an id, then goes to R2 file and gets the mRNA sequence and its quality corresponding to that read and puts all that information in a output file. To do that, it goes line by line using line_counter.
r1 = "merged_R1.fastq.gz"
r2 = "merged_R2.fastq.gz"
import re
with open(r1, "rt") as fp1, open(r2, "rt") as fp2: #opens files, has to be rt
with open("output.fastq", "w+") as output_file: #creates output file
print_counter = 0 #running visual
trimmed_counter = 0 #trimmed seqs
total_counter = 0 #total seqs in file
line_counter = 0 #4 lines file
i = 0 #index for trimming quality
for r1_line, r2_line in zip(fp1, fp2):
if re.search('^@', r1_line) and re.search('^@', r2_line): #searches and gets the first line with id
line_counter += 1
r1_id = list(r1_line.split(" "))[0]
r2_id = list(r2_line.split(" "))[0]
else:
line_counter += 1
if line_counter == 2: #if it's the second line, creates header and gets mrna
barcode = r1_line[:15]
umi = r1_line[16:27]
new_header = r1_id+'_'+barcode+'_'+umi
mrna = r2_line
trim_counter = 0 #trims seqs with more than 6 consecutive A (polyA)
for index, letter in enumerate(mrna):
if letter == "A":
trim_counter += 1
if trim_counter == 6:
trim_counter = 0
mrna = mrna[:index-6] + "\n"
i = index
trimmed_counter += 1
break #stops looking for A
else:
trim_counter = 0
print(new_header, file=output_file) #prints info to output file
print(mrna, end='', file=output_file)
print("+", file=output_file)
elif line_counter == 4: #when it gets to the fourth line, it trimms quality if necessary
if i == 0:
quality = r2_line
else:
quality = r2_line[:i-6] + "\n"
i = 0
print(quality, end='', file=output_file)
line_counter = 0
total_counter += 1
print_counter += 1
if print_counter == 10**7: #prints a # every 10 million iterations to know code is running
print_counter = 0
print("#")
with open("controls.txt", "w+") as control_file: #creates controls file
control = f"Total number of sequences: {total_counter} \nTrimmed sequences: {trimmed_counter}"
print(control, file=control_file)
print("Done")
For a 180 million reads file, this step takes approximately 3 hours and creates a roughly 40GB file.
You then need to map all sequences to the P. falciparum genome using Hisat2.
The data:
The reference genome is PlasmoDB-43_Pfalciparum3D7_Genome.fasta and can be downloaded through PlasmodDB. You should also download the .gff file, so later you can use IGV to see the mapped genes.
The analysis:
To install hisat2 youāll need to run these lines on terminal
sudo aot -get update -y
sudo apt -get install -y hisat2
Now that you have hisat2, you will use it to build the indexes. For that you have to download the reference genome
wget https://plasmodb.org/common/downloads/Current_Release/Pfalciparum3D7/fasta/data/
mv PlasmoDB-54_Pfalciparum3D7_Genome.fasta genome.fa #optional; changes the file name to something smaller
Then you will build the HFM index
hisat2 -build genome.fa genome_build
This takes less than 5 minutes.
This file will be used to map your sequences and all future samples you might have that uses the same genome.
Once you have done all that building, you wonāt have to do it again. Just start from this step
hisat2 -x genome_build output.fastq -S mapped_seq.sam --no-unal --max-intronlen 5000 &>alignment_log.txt
Where --no-unal means that it will output only aligned sequences, --max-intron-len is set to 5000 on Plasmodium so you donāt have reads that map on two exons when intron length is big and > outputs the final result to a .txt file. This will give you a my_sample.sam where you will have your paired sequences and an alignment_log.txt file where you will see someting like this
179398290 reads; of these:
179398290 (100.00%) were unpaired; of these:
51998275 (28.98%) aligned 0 times
70122656 (39.09%) aligned exactly 1 time
57277359 (31.93%) aligned >1 times
71.02% overall alignment rate
In this file there may be some Warning: skipping read messages. Thatās ok, it only means that that sequence was trimmed too much to be able to align somewhere. For 180 million reads this step takes roughly 3 hours and the final file had 40GB.
This resulting SAM file is huge, so we will convert it to a BAM file, which is a compressed version of SAM.
samtools view -b my_sample.sam > my_sample.bam
This will take another hour for 180 million reads and gets the file down to 6GB.
You can convert it back if you want, so after you have your BAM file, delete the SAM file to open up some space. To make the BAM -> SAM conversion you have to type
samtools view -h -o out.sam in.bam
This going back step takes roughly 1 hour.
Next, youāll sort and index your BAM file.
First
samtools sort my_sample.bam > my_sample.sorted.bam
This takes roughly 2 hours for 180 million reads.
Then
samtools index my_sample.sorted.bam
This takes less than an hour.
If you want to know how many mapped reads are in your new bam fie, you can use
samtools view -c -F 260 paired_seq.bam
where -F260will give you the number of only mapped primary alignments. In this case, there were 127400015 mapped reads primarily aligned. Since we arlready knew there were 179398290 of total reads, we know that 51998275 reads werenāt mapped (secondary alignment). That checks with the alignment log we kept from one of the previous steps.
Now you will parse through your BAM file to see the mapped reads to each droplet and determine which droplets yielded suitable transcriptome data for analysis.
A00904:175:H2NK5DMXY:1:1101:11885:1031_CNTGTTGAGCGGGTA_GCACAAGCACA 0 Pf3D7_05_v3 1290213 60 91M * 0 0 AATTGTTGCAGTTAAAACGCTCGTAGTTGAATTTCAAAGAATCGATATTTTATTGTAACTATTCTAGGGGAACTATTTTAGCTTTTCGCTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:91 YT:Z:UU NH:i:1
A00904:175:H2NK5DMXY:1:1101:19822:1031_GNTCAGTTCACACCC_GAGGTACTTAG 0 Pf3D7_05_v3 1295861 1 91M * 0 0 GGTTGACAAACCAGTGTTGCGAAGCTAAGTCTGTTGGATAATGGCTGAACGCCTCTTAAGCCAGAAACCATGCTGATTAAACAATACTATT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:91 YT:Z:UU NH:i:2
As you can see, the first part of this line is the read name, the second part is the strand direction (0 = forward, 16 = reverse and anything else is a secondary alingment), the third part is the chromossome number and the fourth and last part we will use is the read position in that chromossome.
samtools view bam_file_name | python3 script_name
If you want to test you code beforehand, use a smaller version of your big BAM file running
samtools view -h big.bam | head -n 2000000 | samtools view -bS - > little.bam
This is the code I used for this part of the analysis
### This code parses over every line of the BAM file taking the read name, chromossome, position and strand and creating two dictionaries. One for each cell in the sample (each different barcode) and the other for each unique read. This will allow you to know how many reads there were in each cell and how many times each unique read showed up (PCR duplicates).
from collections import defaultdict
import sys #if running on BAM
read_dic = defaultdict(list)
dup_dic = defaultdict(int)
for line in sys.stdin: #if running on BAM
#with open("paired_seq.sam", "rt") as f1: #if running on SAM
#for line in f1: #if running on SAM
info = line.split(" ") #gets the information from each line
read = info[0]
barcode = read.split("_")[1]
umi = read.split("_")[2]
chromossome = info[2]
position = info[3]
strand = info[1]
if strand == "0": #if it's a secondary alignment it doesn't count
strand = "Forward"
unique = [barcode, umi, chromossome, position, strand] #creates a unique read
elif strand == "16":
strand = "Reverse"
unique = [barcode, umi, chromossome, position, strand]
else:
continue
if unique not in read_dic[barcode]: #if that cell doesn't have this read, append it
read_dic[barcode].append(unique)
unique = f"{barcode}, {umi}, {chromossome}, {position}"
dup_dic[unique] += 1 #counts duplicates
with open("unique_reads_per_cell.txt", "w+") as unique_cell: #prints output to file
print('Barcode\tReads', file = unique_cell)
for k,v in read_dic.items():
print(f"{k}\t{len(v)}", file = unique_cell)
with open("duplicate_reads.txt", "w+") as duplicate:
print('Mapped_sequences\t Duplicates', file = duplicate)
for k,v in dup_dic.items():
print(f"{k}\t{v}", file = duplicate)
unique_cell.close()
duplicate.close()
This code took approximately 26 hours to run on the BAM file and 36 hours on the SAM, for 127 million mapped reads. Both gave identical results. The text file created with the reads duplicates, should contain a table that looks like this
Mapped_sequences Duplicates
CNTGTTGAGCGGGTA, PGCACAAGCACA, Pf3D7_05_v3, 1290213, Forward 1
GNTCAGTTCACACCC, PGAGGTACTTAG, Pf3D7_05_v3, 1295861, Forward 1
GNAGGACTCTTCGAC, PATTGAGTGAAT, Pf3D7_06_v3, 444756, Forward 1
ANCCAGCAGGCTTAG, PCGTTTCGGCTG, Pf3D7_13_v3, 2003431, Reverse 1
CNATGACGTTAAGTC, PATGTCCTACTG, Pf3D7_02_v3, 839099, Forward 1
CNCATGATCCTTATA, PTCACATCAGAA, Pf3D7_13_v3, 537525, Reverse 1
TCACTCGGTCGGCTA, PTTACTAACTCC, Pf3D7_07_v3, 1084314, Forward 7
CCTCACAAGCTCGGC, PGTTTCTTACTG, Pf3D7_04_v3, 429020, Reverse 11
GTTGTGATCTGAGAT, PAGCGGCATTTC, Pf3D7_07_v3, 1087412, Forward 5
The text file created with the number of unique reads per cell should have a table that looks like this
Barcode Reads
CNTGTTGAGCGGGTA 13
GNTCAGTTCACACCC 24
GNAGGACTCTTCGAC 41
ANCCAGCAGGCTTAG 39
CNATGACGTTAAGTC 33
CNCATGATCCTTATA 24
TCACTCGGTCGGCTA 12218
CCTCACAAGCTCGGC 9075
GTTGTGATCTGAGAT 7017
The average of duplicate reads in this analysis is 5.54, meaning that there are a lot of duplicate reads in the file (for files with little duplicate results, this number should be close to 1). Another thing we get from this analysis is a histogram of the frequency of reads in a single cell. This histogram should have a very high peak at the beggining, which accounts for all the loose mRNA fragments in the sample and then it should show a slightly left skewed curve. In this sample, the histogram shows good results.
Also from this analysis we can see that there were almost 870 thousand cells in the sample (867315) and almost 23 millions unique reads (22993451).
This is the final step before the real analysis begin. You will create a table where the columns will be each cell barcode and the rows will be each gene and you will populate this table with how many times that gene was mapped to that cell. As you saw in the histogram from the previous step, there are a lot of so called cells that mapped a very small number of reads. These cells are most likely loose mRNA or some sort of background contamination in the sample. Because of that, you will have to filter out these cells when doing your analysis.
Pf3D7_13_v3 VEuPathDB protein_coding_gene 624510 626292 . + . ID=PF3D7_1314600;Name=LipL1;description=lipoate-protein ligase 1
Pf3D7_13_v3 VEuPathDB mRNA 624510 626292 . + . ID=PF3D7_1314600.1;Parent=PF3D7_1314600;Note=2.3.1.181 (Lipoyl(octanoyl) transferase)%3B2.7.7.63 (Transferred entry: 6.3.1.20)%3B6.3.1.20 (Lipoate--protein ligase);description=lipoate-protein ligase 1
Pf3D7_13_v3 VEuPathDB exon 624510 626292 . + . ID=exon_PF3D7_1314600.1-E1;Parent=PF3D7_1314600.1;gene_id=PF3D7_1314600
Pf3D7_13_v3 VEuPathDB CDS 624785 626011 . + 0 ID=PF3D7_1314600.1-p1-CDS1;Parent=PF3D7_1314600.1;gene_id=PF3D7_1314600;protein_source_id=PF3D7_1314600.1-p1
Pf3D7_13_v3 VEuPathDB five_prime_UTR 624510 624784 . + . ID=utr_PF3D7_1314600.1_1;Parent=PF3D7_1314600.1
Pf3D7_13_v3 VEuPathDB three_prime_UTR 626012 626292 . + . ID=utr_PF3D7_1314600.1_2;Parent=PF3D7_1314600.1
Pf3D7_06_v3 VEuPathDB protein_coding_gene 876679 877298 . - . ID=PF3D7_0621350;Name=SYS1;description=protein SYS1%2C putative
Pf3D7_06_v3 VEuPathDB mRNA 876679 877298 . - . ID=PF3D7_0621350.1;Parent=PF3D7_0621350;description=protein SYS1%2C putative
Pf3D7_06_v3 VEuPathDB exon 876679 876803 . - . ID=exon_PF3D7_0621350.1-E3;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350
Pf3D7_06_v3 VEuPathDB exon 876876 876991 . - . ID=exon_PF3D7_0621350.1-E2;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350
Pf3D7_06_v3 VEuPathDB exon 877123 877298 . - . ID=exon_PF3D7_0621350.1-E1;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350
Pf3D7_06_v3 VEuPathDB CDS 876679 876803 . - 2 ID=PF3D7_0621350.1-p1-CDS3;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350;protein_source_id=PF3D7_0621350.1-p1
Pf3D7_06_v3 VEuPathDB CDS 876876 876991 . - 1 ID=PF3D7_0621350.1-p1-CDS2;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350;protein_source_id=PF3D7_0621350.1-p1
Pf3D7_06_v3 VEuPathDB CDS 877123 877298 . - 0 ID=PF3D7_0621350.1-p1-CDS1;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350;protein_source_id=PF3D7_0621350.1-p1
c1 = 0
cell_reads = []
with open("unique_reads_per_cell.txt", "r") as bc_file:
for line in bc_file: # creates a list with all cells with more than 5000 reads
if c1 < 1:
c1 += 1
else:
barcode = line.split()[0]
reads = int(line.split()[-1])
if reads >= 5000:
cell_reads.append(barcode)
The next step is to create an annotation file using the .gff file. This new file will get all the information we need from the .gff file, meaning the gene name, PlasmoDB ID, chromossome, position and strand. Since the base pairing before the sequencing occurs on the 3ā end, you should add 500bp to the geneās 3ā position, just to make sure you count any pairing that happend in that gene. In the end, the new file should look like this.
Pf3D7_13_v3 PF3D7_1314600_lipoate-protein_ligase_1 624510 626792 foward_strand
Pf3D7_06_v3 PF3D7_0621350_protein_SYS1%2C_putative 877179 877298 reverse_strand
Pf3D7_02_v3 PF3D7_0209400_conserved_protein%2C_unknown_function 388716 391107 reverse_strand
For this part, I used this code
annotation_file = open("annotation_file.txt", "w+") #creates annotation file with gene names and positions
c2 = 0
with open("PlasmoDB-54_Pfalciparum3D7.gff", "r") as gff_file:
for line in gff_file:
if c2 >= 2: #ignores first two lines because they can't be split
annotation = line.split()
if annotation[2] == "protein_coding_gene": #finds line with gene info
chromossome = annotation[0] #gets chr name
gene_name = line.split("=")[-1] #gets gene name
gene_name = gene_name.replace(" ", "_")
gene_id = re.split("=|;", line)[1] #gets gene id from plsmodb
gene = f"{gene_id}_{gene_name.split()[0]}" #merges gene id and name for final table
if annotation[6] == "+": #if forward print normal
position = "foward_strand"
position_5 = int(annotation[3])
position_3 = int(annotation[4]) + 500
print(f"{chromossome}\t{gene}\t{position_5}\t{position_3}\t{position}", file=annotation_file)
elif annotation[6] == "-": #if reverse print inverted so the smaller number
position = "reverse_strand" #is always first
position_5 = int(annotation[4])
position_3 = int(annotation[3]) + 500
print(f"{chromossome}\t{gene}\t{position_3}\t{position_5}\t{position}", file=annotation_file)
else:
continue
else:
c2 += 1
Now to actually make the final table, you will need a dictionary with the genes and the cells. For that, you will need to know which gene is which. What I did was create a dictionary where the key were each chromossome and the value was a list with each geneās name and position. This is the code I used
chr_dic = dict()
with open("annotation_file.txt", "r") as ann: #creates dict with all chromossomes as key
for line in ann:
chr_name = line.split()[0]
if chr_name not in chr_dic:
chr_dic[chr_name] = []
if len(chr_dic) < 16:
continue
else:
break
with open("annotation_file.txt", "r") as annotation_file: #populates chr_dic with tuples of gene name
for line in annotation_file: #and position
ann_chromossome = line.split()[0]
ann_gene = line.split()[1]
ann_position = (int(line.split()[2]), int(line.split()[3]))
new_gene = (ann_gene,ann_position)
chr_dic[ann_chromossome].append(new_gene)
Then you will need to know which cell has which gene mapped and how many times. For that I created another dictionary where the key was the geneās names and the value was a list with all the cell it was mapped to (allowing cell duplicates). This is the code I used
hit_dic = defaultdict(list)
with open("unique_reads.txt", "r") as mapped: #creates a dict with every gene and all the cells that
next(mapped) #mapped that gene
for line in mapped: #uses the file with unique reads
read_bc = line.split()[0][:-1]
read_chromossome = line.split()[2][:-1]
read_position = int(line.split()[3][:-1])
if read_bc in cell_reads: #if that cell has more than 5000 reads
for k, v in chr_dic.items():
if k == read_chromossome: #looks for that gene in the right chromossome
for (name, position) in v:
pos1 = int(position[0])
pos2 = int(position[1])
if pos1 < read_position < pos2: #if the position maps to a gene
hit_dic[name].append(read_bc) #links the cell barcode to that gene
Now you have all the information you need in one place, all thatās left to do is print it to a .txt file. For that I used this code
final_table = open("final_table.txt", "w+") #creates the final table with how many reads for a gene
for bc in cell_reads: #in each cell
print(f"{bc}\t", file=final_table, end="") #creates columns
for gene_tag, cells in hit_dic.items():
print("\n"+gene_tag, file=final_table, end="\t") #creates rows
c3 = 0
for cell in cell_reads: #populates table
cell_count = cells.count(cell)
print(cell_count, file=final_table, end="\t")
final_table.close()
The total runtime for this step for a inital sequence of 180 million reads which came down to 1972 cells and 5318 genes, was 1 hour.
TCACTCGGTCGGCTA CCTCACAAGCTCGGC GTTGTGATCTGAGAT TGATGGTGTTGGATC TCCCACAAGCGTACA
PF3D7_0409000_conserved_Plasmodium_protein%2C_unknown_function 0 1 0 0 0
PF3D7_0309200_serine/threonine_protein_kinase_ARK2%2C_putative 0 1 0 0 1
PF3D7_0207500_serine_repeat_antigen_6 9 4 4 0 18
PF3D7_1334800_MSP7-like_protein 8 1 7 0 2
The first analysis you will make with the count table will be creating a PCA, which is a graph of Principal Component Analysis which examines the covariances / correlations between individuals. In this graph you will see how different each individual cell is from the others, allowing you to identify groups of samples that are similiar. This is done by some complex mathematics but what basically happens is that there is a type of linear transformation of the dataset that fits it to a new coordinate system in such a way that the most significant variance is found on the first coordinate, and each subsequent coordinate is orthogonal to the last and has a lesser variance. When many variables correlate to each other, they will contribute to the same principal component and that principal component will be able to explain a certain percentage of the total variation in the dataset. So you will calculate the most meaninfull and representative genes and then plot each cell according to their expression of that genes.
The data
For this analysis you will use the count table you created in the previous step and all the analysis will be done in R.
The analysis
The first step is importing the .txt file with the table you created in the previous step. After that, you will normalize your data counts. Since the counts you have for each gene are a matter of technical and not biological information, you can normalize the whole table by the same amount of reads keeping their proportion. For that just run
final_table_normalized[] <- lapply(final_table_normalized[], function(x) ((x/(sum(x))*1000)))
After that you will have to transpose your dataframe so the columns are genes and the rows are cells. Since in the code I wrote for creating this table I create an aditional column by using \t, I had to run an aditional line of code before transposing
final_table_normalized$X <- NULL #gets rid of the last column
transposed_normalized <- t(final_table_normalized)
There will be some genes in the data that will have very little reads, as in only 1 or 2. That could throw you data off, since the PCA analysis will give it the same weigth as genes that are highly expressed in some cells and not at all in others, which could in fact indicate something. So you will want to get rid of the genes that have less than 2.5 counts throughout all the cells. For that just run
i <- (colSums(transposed_normalized, na.rm = T) > 2.5) #lists the cols where the count sum is bigger than 2.5
trans_norm_high_exp <- transposed_normalized[,i] #trimms the dataframe to only the highly expressed genes
Now, for the actual PCA, just run
pca_data <- prcomp(trans_norm_high_exp, scale. = TRUE)
For plotting the PCA graph you can either use the ggfortify package, in which case you just have to run
library(ggfortify)
autoplot(pca_data)
Or you can use ggplot2. For that you have to run
library(ggplot2)
pca.df <- as.data.frame(pca_data$x)
percentage <- round(pca_data$sdev^2 / sum(pca_data$sdev^2) * 100, 2)
percentage <- paste(colnames(pca.df), "(", paste(as.character(percentage), "%", ")", sep=""))
plot <- ggplot(pca.df, aes(x=PC1, y=PC2)) + geom_point() + xlab(percentage[1]) + ylab(percentage[2])
In this type of dataset, namely Plasmodiumās blood stages scRNAseq, it is normal to have a principal component in the single digits. Considering there are roughly 5000 genes in the species, that is suficient to be meaningfull. You can actually see this using
library(factoextra)
fviz_eig(pca_data)
Once you have the PCA graph, you will want to annotate it somehow. In this instance, I used two genes known for being expressed in different stages of the parasiteās blood cycle - PF3D7_1462800 and PF3D7_1410400. The tricky part here is defining a cutoff value for the geneās expression. This will have to be done by trial and error, but always keeping in mind that thereās a fine line between data manipulation and the truth. Not that the absolute truth can be given by this type of analysis, since there is an inherited flaw in the data that comes from the technology used, meaning that there migh be cells that express a given gene, but that gene wasnāt sequenced in that cell, so it wonāt show in the analysis. All of the results and conclusions drawn from analysing a scRNAseq dataset have to be carefully taken with a grain of salt.
Just remember to always use the same cutoff when comparing the expression of different genes and to always stay truthful about what you did in your analysis.
The data
You will use the PCA graph created previously.
The analysis
For annotating the graph in this manner, all you have to do is create a column for the stages and populate it according to some geneās expression.
# creates a new dataframe with non-numerical info which can't be used for running pca
h3_trans_norm_high_exp_col <- h3_trans_norm_high_exp
# adds stage column
h3_trans_norm_high_exp_col["stages"]<-"n"
# annotates each cell based on a given gene's expression
c<-grep("PF3D7_1462800",colnames(h3_trans_norm_high_exp_col))
m<-max(h3_trans_norm_high_exp_col[,c]) # can be used to calculate some kind of cutoff
h3_trans_norm_high_exp_col[which(h3_trans_norm_high_exp_col[,c]>0.1),5006]<-"gapdh"
# creates plot automatically using stage information for color
autoplot(h3_pca, data=h3_trans_norm_high_exp_col, colour="sample", alpha=0.4)
The output
Doing this you will have a graph that looks like this
For this analysis, four samples were used in order to assess Plasmodium falciparum blood stageās viability after different sinchronization / enrichement processes. These processes were MACS, Percoll and saponin. To compare the data from each process, I made a PCA using all of their data. That means a count table where all of the genes are present, as well as the name of each process and the stage of each cell. From that you can create a PCA graph that shows the difference in each processā recuperation stage. For that you will need to merge the count tables and run prcomp on that merged table.
The data
For this analysis you will use the processed dataframes from the previous step. Both with and without the annotation column.
The analysis
The first thing to do is to create another annotation column that will distinguish one sample from the other. For each sample run
h3_trans_norm_high_exp_col["sample"]<-"saponin1"
Then, to merge all samples into one big count table run
library(plyr)
# gets and merges barcodes
h3_barcodes <- rownames(h3_trans_norm_high_exp_col)
h4_barcodes <- rownames(h4_trans_norm_high_exp_col)
h5_barcodes <- rownames(h5_trans_norm_high_exp_col)
all_barcodes <- c(h3_barcodes, h4_barcodes, h5_barcodes)
# gets and merges stages
h3_stages <- h3_trans_norm_high_exp_col$stages
h4_stages <- h4_trans_norm_high_exp_col$stages
h5_stages <- h5_trans_norm_high_exp_col$stages
all_stages <- c(h3_stages, h4_stages, h5_stages)
# gets and merges samples
h3_sample <- h3_trans_norm_high_exp_col$sample
h4_sample <- h4_trans_norm_high_exp_col$sample
h5_sample <- h5_trans_norm_high_exp_col$sample
all_sample <- c(h3_sample, h4_sample, h5_sample)
# merges count tables filling any columns (genes) missing from one or another with NA so all genes are accounted for
merged_count_table <- rbind.fill(h3_trans_norm_high_exp, h4_trans_norm_high_exp, h5_trans_norm_high_exp)
# replaces NA with 0
merged_count_table[is.na(merged_count_table)] <- 0
# creates a merged count table with the highly expressed genes
merged_count_table_high_exp <- merged_count_table
i<-apply(merged_count_table,2,max)>2.5
merged_count_table_high_exp<-merged_count_table[,i]
# creates full annotated table with stages, sample and barodes
merged_full_table <- merged_count_table_high_exp
merged_full_table$stages <- all_stages
merged_full_table$sample <- all_sample
merged_full_table$barccodes <- all_barcodes
Now you have a PCA from all you samples, all thatās left to do is run
# calculates pca from merged samples and plots it
merged_pca <- prcomp(merged_count_table, scale =T)
autoplot(merged_pca, data=merged_full_table, colour="sample", alpha=0.4)
The output
In the end, the PCA graph with all the samples will look like this
Once you have you PCA you can annotate it using genesā expression. For that you will either use your genes of interest or genes established by the literature. If you want to know more about Plasmodium falciparumās, P. knowlesiās or P. bergheiās genes, you can go to the Malaria Cell Atlas. In there you can either test your gene or find one that suits your purposes. For this analysis I chose the REX1 gene to represent rings, Ribosomal protein L13 for trophozoites, Rhoptry protein for schizonts and MSP3 for late schizonts.
The data
You will use the merged count table you created in the last step.
The analysis
For plotting this kind of graph, we will use the manual version of plotting a PCA graph. That means that the first step is to create a pca_df that will be used as base for plotting. In this dataframe you will also need a stages column and an expression column.
pca_df <- as.data.frame(merged_pca$x)
pca_df$stages <- "n"
pca_df$expression <- 0
This dataframe cointains the PCA coordinates and is where we will input the stage and the level of expression of the cells of interest. For that, the merged_full_table will come in handy.
Once you have chosen the genes you will use, you have to find it in the count table, understand how much the cells express it, determine an expression cutoff for your analysis and identify each cell that meets your cutoff and their respective expression.
c<-grep("PF3D7_0935900",colnames(merged_full_table)) # finds gene's column
max <- max(merged_full_table[,c]) # gets the maximum expression for that gene
pca_df[which(merged_full_table[,c]>max*.05),5157]<-"rex1" # finds the top 95% cells expressing that gene and identify them
pca_df[which(merged_full_table[,c]>max*.05),5158] <- merged_full_table[which(merged_full_table[,c]>max*.05),c] # adds those cells' expresions levels to the table
With that you can annotate as many genes as you want. Just remember that if you are using genes that should be expresses roughly at the same time, they can override each other depending on how you run. Also, the cutoff for each gene is a trial and error kind of approach. Just remember to keep it honest.
And then, once you identified all the genes you wanted and got their expression level, all thatās left to do is plot pca_df. For this plot it is interesting to have each dotās size different according to that cellās level of expression of each gene. Thatās why there should be an expression column in pca_df. You will use stages to color each dot and expression to determine the size of the dot.
Now you can plot pca_df subset to each sample using
ggplot(pca_df[which(merged_full_table$sample=="percoll"),], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.4) + labs(title="percoll", x=percentage[1], y=percentage[2])
Or subset it to more than one sample creating a vector with the row numbers of the samples of interest and using it to subset the rows plotted.
bc <- which(merged_full_table$sample=="percoll" | merged_full_table$sample=="macs")
ggplot(pca_df[bc,], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.2) + labs(title="macs+percoll", x=percentage[1], y=percentage[2])
If you only want to plot a specific gene across all samples, you can use
ggplot(pca_df[which(pca_df$stage=="rhoptry"),], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.4) + labs(title="rhoptry", x=percentage[1], y=percentage[2])
To plot only one gene in one sample, use
ggplot(pca_df[which(pca_df$stages=="rhoptry" & merged_full_table$sample=="macs"),], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.4) + labs(title="rhoptry", x=percentage[1], y=percentage[2])
Finally, to plot only one gene in a few samples, use
ggplot(pca_df[which(pca_df$stage=="rhoptry" & (merged_full_table$sample=="macs" | merged_full_table$sample=="percoll")),], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.4) + labs(title="macs + percoll rhoptry", x=percentage[1], y=percentage[2])