Pipeline for scRNAseq analysis

This pipeline analyzes data from a scRNAseq using a 10X Genome kit. For more information on this kit, check this link. All of the code here was done using Python 3.8 and terminal. If using Windows, you can download Windows Linux Subsystem using wsl --install on command prompt. For more information on Windows Linux Subsystem, click here.

0) Merge L1 and L2 files

The scRNAseq will give you two files for each read, meaning you will have a L1 R1 file and an L2 R1 file. Same goes for R2. But they both have the same information, is just that the sequencing occurs in two different lanes. Before anything else, you will need to merge both L1 and L2 files for each read in order to work with the complete sequencing. For that, you’ll have to enter on terminal:

cat file1 file2 > final_file

This takes approximately 10 minutes and creates a 4 to 8 GB file.
If you want to know how many lines (or reads) is in that file, just run

zcat final_file | wc -l

This will give you the number of lines in the file. Since each read is comprised of four lines, just divide this number by four to find the number of reads you will be dealing with.

1) Select sequences and combine barcode and UMI information

Now that you have all your data in one place, the next step is getting the raw data ready to be analysed.

The data
The raw data consist of three files corresponding to the three sequences generated from each molecule (see Figure on page 15 of the manual):
- I1 contains the sample barcode (8-mer). Each sample is barcoded with four 8-mers;
- R1 contains the 10X barcode (16-mer unique to each droplet) followed by the UMI (10-mer unique to each molecule… in theory);
- R2 contains the actual mRNA sequence.
Each file has the same structure (.fastq) and contains four lines for each sequence:
- The sequence name (use only the string before the space!);
- The actual DNA sequence;
- A plus sign “+”;
- The quality of each nucleotide sequenced encoded in ASCII (quality score = ASCII - 33).
The sequences should be in the same order in each of the three files and the name of each sequence identical in each file (with the exception of the characters after the space in the read name).

R1 example:

@A00904:175:H2NK5DMXY:1:1101:3893:1031 1:N:0:CCACTACA
ANCCTTTAGGGCCCTTATTTCGCGAAAC
+
F#FFFFF:FFFFFFFF:FFFFFFFFFFF
@A00904:175:H2NK5DMXY:1:1101:11885:1031 1:N:0:CCACTACA
CNTGTTGAGCGGGTATGCACAAGCACAC
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFF
@A00904:175:H2NK5DMXY:1:1101:12481:1031 1:N:0:CCACTACA

R2 example:

@A00904:175:H2NK5DMXY:1:1101:3893:1031 2:N:0:CCACTACA
CTTCCAGCAGCTGAAGGCGACCCACTTTGAAGAGTTCTTGTCGAAGCTGAGCCCGACTCTGCGCGAGCAGGTGCAAGCTGCCCTGAACCAG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00904:175:H2NK5DMXY:1:1101:11885:1031 2:N:0:CCACTACA
AATTGTTGCAGTTAAAACGCTCGTAGTTGAATTTCAAAGAATCGATATTTTATTGTAACTATTCTAGGGGAACTATTTTAGCTTTTCGCTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFF

You can look at your file using

zless file_name

The analysis
In this step of the analysis you’ll need to write a code that will identify your sequences and make a new .fastq file with the sequences identification and their actual sequence. For that it will need to:
- Find the the 10X barcode and UMI sequence (the 16 and 10 nucleotides of the corresponding sequence in R1) and add it to the sequence name;
- Get the corresponding mRNA sequence (on R2) and trim it if it contains more than 6 As in a row (which could indicate that you are reading through the polyadenylation tail);
- Get the corresponding quality (trimmed to be the same size as the mRNA sequence)
  At the end, your new .fastq file should look like this:

NAME OF THE SEQUENCE_10XBarcode_UMIsequence
mRNA sequence (possibly trimmed)
+
Quality (possibly trimmed)

The output/controls
From this analysis you should end up with one new .fastq file containing the scRNA-seq sequences associated from your samples and all the 10X/UMI information.
You should record (as controls, in a different file)
- the total number of sequences in your file;
- the number of sequences corresponding to your sample (i.e., printed in your new fastq);
- the number of sequences you had to trim.

This is the code I used for this part of the analysis:

### This code runs through R1 file, gets the read name, barcode and umi and creates an id, then goes to R2 file and gets the mRNA sequence and its quality corresponding to that read and puts all that information in a output file. To do that, it goes line by line using line_counter.

r1 = "merged_R1.fastq.gz"
r2 = "merged_R2.fastq.gz"

import re

with open(r1, "rt") as fp1, open(r2, "rt") as fp2:    #opens files, has to be rt
    with open("output.fastq", "w+") as output_file:   #creates output file
        
        print_counter = 0         #running visual
        trimmed_counter = 0       #trimmed seqs
        total_counter = 0         #total seqs in file
        line_counter = 0          #4 lines file
        i = 0                     #index for trimming quality
        
        for r1_line, r2_line in zip(fp1, fp2):
            if re.search('^@', r1_line) and re.search('^@', r2_line):   #searches and gets the first line with id
                line_counter += 1                                       
                r1_id = list(r1_line.split(" "))[0]
                r2_id = list(r2_line.split(" "))[0]
                
            else:                
                line_counter += 1
                if line_counter == 2:      #if it's the second line, creates header and gets mrna
                    barcode = r1_line[:15]
                    umi = r1_line[16:27]
                    new_header = r1_id+'_'+barcode+'_'+umi
                    mrna = r2_line
                    
                    trim_counter = 0        #trims seqs with more than 6 consecutive A (polyA)
                    for index, letter in enumerate(mrna):
                         if letter == "A":
                             trim_counter += 1
                             if trim_counter == 6:
                                 trim_counter = 0
                                 mrna = mrna[:index-6] + "\n"
                                 i = index
                                 trimmed_counter += 1
                                 break       #stops looking for A
                         else:
                             trim_counter = 0
                             
                    print(new_header, file=output_file)      #prints info to output file
                    print(mrna, end='', file=output_file)
                    print("+", file=output_file)
                    
                elif line_counter == 4:     #when it gets to the fourth line, it trimms quality if necessary
                    if i == 0:              
                        quality = r2_line
                    else:
                        quality = r2_line[:i-6] + "\n"
                        i = 0
                    print(quality, end='', file=output_file)
                    line_counter = 0
                    total_counter += 1
                    print_counter += 1
                     
                    if print_counter == 10**7:   #prints a # every 10 million iterations to know code is running
                        print_counter = 0
                        print("#")
        
                 
    with open("controls.txt", "w+") as control_file:    #creates controls file
        control = f"Total number of sequences: {total_counter} \nTrimmed sequences: {trimmed_counter}"
        print(control, file=control_file)      
    print("Done")

For a 180 million reads file, this step takes approximately 3 hours and creates a roughly 40GB file.

2) Map the sequences to the genome

You then need to map all sequences to the P. falciparum genome using Hisat2.

The data:
The reference genome is PlasmoDB-43_Pfalciparum3D7_Genome.fasta and can be downloaded through PlasmodDB. You should also download the .gff file, so later you can use IGV to see the mapped genes.
The analysis:
To install hisat2 you’ll need to run these lines on terminal

sudo aot -get update -y
sudo apt -get install -y hisat2

Now that you have hisat2, you will use it to build the indexes. For that you have to download the reference genome

wget https://plasmodb.org/common/downloads/Current_Release/Pfalciparum3D7/fasta/data/
mv PlasmoDB-54_Pfalciparum3D7_Genome.fasta genome.fa   #optional; changes the file name to something smaller

Then you will build the HFM index

    hisat2 -build genome.fa genome_build

This takes less than 5 minutes.

This file will be used to map your sequences and all future samples you might have that uses the same genome.
Once you have done all that building, you won’t have to do it again. Just start from this step

    hisat2 -x genome_build output.fastq -S mapped_seq.sam --no-unal --max-intronlen 5000 &>alignment_log.txt

Where --no-unal means that it will output only aligned sequences, --max-intron-len is set to 5000 on Plasmodium so you don’t have reads that map on two exons when intron length is big and > outputs the final result to a .txt file. This will give you a my_sample.sam where you will have your paired sequences and an alignment_log.txt file where you will see someting like this

    179398290 reads; of these:
      179398290 (100.00%) were unpaired; of these:
      51998275 (28.98%) aligned 0 times
      70122656 (39.09%) aligned exactly 1 time
      57277359 (31.93%) aligned >1 times
    71.02% overall alignment rate

In this file there may be some Warning: skipping read messages. That’s ok, it only means that that sequence was trimmed too much to be able to align somewhere. For 180 million reads this step takes roughly 3 hours and the final file had 40GB.

This resulting SAM file is huge, so we will convert it to a BAM file, which is a compressed version of SAM.

    samtools view -b my_sample.sam > my_sample.bam

This will take another hour for 180 million reads and gets the file down to 6GB.

You can convert it back if you want, so after you have your BAM file, delete the SAM file to open up some space. To make the BAM -> SAM conversion you have to type

  samtools view -h -o out.sam in.bam

This going back step takes roughly 1 hour.

Next, you’ll sort and index your BAM file.
First

     samtools sort my_sample.bam > my_sample.sorted.bam

This takes roughly 2 hours for 180 million reads.

Then

     samtools index my_sample.sorted.bam

This takes less than an hour.

If you want to know how many mapped reads are in your new bam fie, you can use

     samtools view -c -F 260 paired_seq.bam

where -F260will give you the number of only mapped primary alignments. In this case, there were 127400015 mapped reads primarily aligned. Since we arlready knew there were 179398290 of total reads, we know that 51998275 reads weren’t mapped (secondary alignment). That checks with the alignment log we kept from one of the previous steps.

The output/controls
You will end this part of the analysis with a .sorted.bam.bai file and a .txt file with the record of the number of reads that mapped to the P. falciparum genome.

3) Parsing the BAM file

Now you will parse through your BAM file to see the mapped reads to each droplet and determine which droplets yielded suitable transcriptome data for analysis.

The data
At this point you will only work with your sorted .bam file that contains all the mapped reads. Look at this article to understand how this file is formated.
An example of this file

A00904:175:H2NK5DMXY:1:1101:11885:1031_CNTGTTGAGCGGGTA_GCACAAGCACA      0       Pf3D7_05_v3     1290213 60      91M     *       0       0       AATTGTTGCAGTTAAAACGCTCGTAGTTGAATTTCAAAGAATCGATATTTTATTGTAACTATTCTAGGGGAACTATTTTAGCTTTTCGCTT        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFF     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:91 YT:Z:UU NH:i:1
A00904:175:H2NK5DMXY:1:1101:19822:1031_GNTCAGTTCACACCC_GAGGTACTTAG      0       Pf3D7_05_v3     1295861 1       91M     *       0       0       GGTTGACAAACCAGTGTTGCGAAGCTAAGTCTGTTGGATAATGGCTGAACGCCTCTTAAGCCAGAAACCATGCTGATTAAACAATACTATT        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFF     AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:91 YT:Z:UU NH:i:2

As you can see, the first part of this line is the read name, the second part is the strand direction (0 = forward, 16 = reverse and anything else is a secondary alingment), the third part is the chromossome number and the fourth and last part we will use is the read position in that chromossome.

The analysis
You will have to write a code to parse the .bam file and determine the number of “unique reads” present in each droplet and select those you will use for biological analyses.
The first step is to determine the number of “unique reads” per droplet: if two reads in the same droplet (i.e., with the same 10Xbarcode) have the same UMI and map to the same genome position, they are likely to be derived from the same mRNA molecule and represents PCR duplication. We randomly keep one and discard the other(s). So, from the .bam file, you need to summarize the read information and generate a unique identifier consisting of the combination of
- The droplet ID (the 10XBarcode you stored in the read name)
- The UMI (the unique 10-mer you stored in the read name)
- The chromosome, nucleotide position and strand the read map to.
To run this you can use the BAM or the SAM file. If using the SAM file, just run the script like usual on the command line. If using the BAM file, use the unsorted and unindexed file to run

    samtools view bam_file_name | python3 script_name

If you want to test you code beforehand, use a smaller version of your big BAM file running

  samtools view -h big.bam | head -n 2000000 | samtools view -bS - > little.bam

The output/controls From this analysis you should get:
- A table of the number of unique sequences per droplet (so we can look at the distribution)
- A table of unique sequences and how many reads they had (so we can take their average)

This is the code I used for this part of the analysis

### This code parses over every line of the BAM file taking the read name, chromossome, position and strand and creating two dictionaries. One for each cell in the sample (each different barcode) and the other for each unique read. This will allow you to know how many reads there were in each cell and how many times each unique read showed up (PCR duplicates).

from collections import defaultdict
import sys       #if running on BAM

read_dic = defaultdict(list)
dup_dic = defaultdict(int)
for line in sys.stdin:    #if running on BAM
 

#with open("paired_seq.sam", "rt") as f1:   #if running on SAM
  #for line in f1:                          #if running on SAM
      info = line.split("   ")                #gets the information from each line
      read = info[0]
      barcode = read.split("_")[1]
      umi = read.split("_")[2]
      chromossome = info[2]
      position = info[3]
      strand = info[1]
        
      if strand == "0":                #if it's a secondary alignment it doesn't count
          strand = "Forward"
          unique = [barcode, umi, chromossome, position, strand]   #creates a unique read
      elif strand == "16":
          strand = "Reverse"
          unique = [barcode, umi, chromossome, position, strand]
      else:
           continue
        
      if unique not in read_dic[barcode]:    #if that cell doesn't have this read, append it
          read_dic[barcode].append(unique)
          
      unique = f"{barcode}, {umi}, {chromossome}, {position}"
      dup_dic[unique] += 1              #counts duplicates
            
with open("unique_reads_per_cell.txt", "w+") as unique_cell:   #prints output to file
    print('Barcode\tReads', file = unique_cell)
    for k,v in read_dic.items():
        print(f"{k}\t{len(v)}", file = unique_cell)
                  
with open("duplicate_reads.txt", "w+") as duplicate:
    print('Mapped_sequences\t Duplicates', file = duplicate)
    for k,v in dup_dic.items():
        print(f"{k}\t{v}", file = duplicate)
          
unique_cell.close()
duplicate.close()

This code took approximately 26 hours to run on the BAM file and 36 hours on the SAM, for 127 million mapped reads. Both gave identical results. The text file created with the reads duplicates, should contain a table that looks like this

    Mapped_sequences         Duplicates
    CNTGTTGAGCGGGTA, PGCACAAGCACA, Pf3D7_05_v3, 1290213, Forward    1
    GNTCAGTTCACACCC, PGAGGTACTTAG, Pf3D7_05_v3, 1295861, Forward    1
    GNAGGACTCTTCGAC, PATTGAGTGAAT, Pf3D7_06_v3, 444756, Forward     1
    ANCCAGCAGGCTTAG, PCGTTTCGGCTG, Pf3D7_13_v3, 2003431, Reverse    1
    CNATGACGTTAAGTC, PATGTCCTACTG, Pf3D7_02_v3, 839099, Forward     1
    CNCATGATCCTTATA, PTCACATCAGAA, Pf3D7_13_v3, 537525, Reverse     1
    TCACTCGGTCGGCTA, PTTACTAACTCC, Pf3D7_07_v3, 1084314, Forward    7
    CCTCACAAGCTCGGC, PGTTTCTTACTG, Pf3D7_04_v3, 429020, Reverse     11
    GTTGTGATCTGAGAT, PAGCGGCATTTC, Pf3D7_07_v3, 1087412, Forward    5

The text file created with the number of unique reads per cell should have a table that looks like this

    Barcode Reads
    CNTGTTGAGCGGGTA 13
    GNTCAGTTCACACCC 24
    GNAGGACTCTTCGAC 41
    ANCCAGCAGGCTTAG 39
    CNATGACGTTAAGTC 33
    CNCATGATCCTTATA 24
    TCACTCGGTCGGCTA 12218
    CCTCACAAGCTCGGC 9075
    GTTGTGATCTGAGAT 7017

The average of duplicate reads in this analysis is 5.54, meaning that there are a lot of duplicate reads in the file (for files with little duplicate results, this number should be close to 1). Another thing we get from this analysis is a histogram of the frequency of reads in a single cell. This histogram should have a very high peak at the beggining, which accounts for all the loose mRNA fragments in the sample and then it should show a slightly left skewed curve. In this sample, the histogram shows good results.

Also from this analysis we can see that there were almost 870 thousand cells in the sample (867315) and almost 23 millions unique reads (22993451).

4) Creating Cells x Genes table

This is the final step before the real analysis begin. You will create a table where the columns will be each cell barcode and the rows will be each gene and you will populate this table with how many times that gene was mapped to that cell. As you saw in the histogram from the previous step, there are a lot of so called cells that mapped a very small number of reads. These cells are most likely loose mRNA or some sort of background contamination in the sample. Because of that, you will have to filter out these cells when doing your analysis.

The data
For this step you will use both table files you created while parsing your BAM file and the .gff file you should have downloaded with your genome .fasta file. After the header, this file looks like this

Pf3D7_13_v3     VEuPathDB       protein_coding_gene     624510  626292  .       +       .       ID=PF3D7_1314600;Name=LipL1;description=lipoate-protein ligase 1
Pf3D7_13_v3     VEuPathDB       mRNA    624510  626292  .       +       .       ID=PF3D7_1314600.1;Parent=PF3D7_1314600;Note=2.3.1.181 (Lipoyl(octanoyl) transferase)%3B2.7.7.63 (Transferred entry: 6.3.1.20)%3B6.3.1.20 (Lipoate--protein ligase);description=lipoate-protein ligase 1
Pf3D7_13_v3     VEuPathDB       exon    624510  626292  .       +       .       ID=exon_PF3D7_1314600.1-E1;Parent=PF3D7_1314600.1;gene_id=PF3D7_1314600
Pf3D7_13_v3     VEuPathDB       CDS     624785  626011  .       +       0       ID=PF3D7_1314600.1-p1-CDS1;Parent=PF3D7_1314600.1;gene_id=PF3D7_1314600;protein_source_id=PF3D7_1314600.1-p1
Pf3D7_13_v3     VEuPathDB       five_prime_UTR  624510  624784  .       +       .       ID=utr_PF3D7_1314600.1_1;Parent=PF3D7_1314600.1
Pf3D7_13_v3     VEuPathDB       three_prime_UTR 626012  626292  .       +       .       ID=utr_PF3D7_1314600.1_2;Parent=PF3D7_1314600.1
Pf3D7_06_v3     VEuPathDB       protein_coding_gene     876679  877298  .       -       .       ID=PF3D7_0621350;Name=SYS1;description=protein SYS1%2C putative
Pf3D7_06_v3     VEuPathDB       mRNA    876679  877298  .       -       .       ID=PF3D7_0621350.1;Parent=PF3D7_0621350;description=protein SYS1%2C putative
Pf3D7_06_v3     VEuPathDB       exon    876679  876803  .       -       .       ID=exon_PF3D7_0621350.1-E3;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350
Pf3D7_06_v3     VEuPathDB       exon    876876  876991  .       -       .       ID=exon_PF3D7_0621350.1-E2;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350
Pf3D7_06_v3     VEuPathDB       exon    877123  877298  .       -       .       ID=exon_PF3D7_0621350.1-E1;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350
Pf3D7_06_v3     VEuPathDB       CDS     876679  876803  .       -       2       ID=PF3D7_0621350.1-p1-CDS3;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350;protein_source_id=PF3D7_0621350.1-p1
Pf3D7_06_v3     VEuPathDB       CDS     876876  876991  .       -       1       ID=PF3D7_0621350.1-p1-CDS2;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350;protein_source_id=PF3D7_0621350.1-p1
Pf3D7_06_v3     VEuPathDB       CDS     877123  877298  .       -       0       ID=PF3D7_0621350.1-p1-CDS1;Parent=PF3D7_0621350.1;gene_id=PF3D7_0621350;protein_source_id=PF3D7_0621350.1-p1

The analysis
The first thing to do is to filter out the barcode from cells that have less than 5000 unique reads mapped. For that I used this code

c1 = 0
cell_reads = []
with open("unique_reads_per_cell.txt", "r") as bc_file:
    for line in bc_file:              # creates a list with all cells with more than 5000 reads
        if c1 < 1:
            c1 += 1
        else:
            barcode = line.split()[0]
            reads = int(line.split()[-1])
            if reads >= 5000:
               cell_reads.append(barcode)

The next step is to create an annotation file using the .gff file. This new file will get all the information we need from the .gff file, meaning the gene name, PlasmoDB ID, chromossome, position and strand. Since the base pairing before the sequencing occurs on the 3’ end, you should add 500bp to the gene’s 3’ position, just to make sure you count any pairing that happend in that gene. In the end, the new file should look like this.

Pf3D7_13_v3 PF3D7_1314600_lipoate-protein_ligase_1  624510  626792  foward_strand
Pf3D7_06_v3 PF3D7_0621350_protein_SYS1%2C_putative  877179  877298  reverse_strand
Pf3D7_02_v3 PF3D7_0209400_conserved_protein%2C_unknown_function 388716  391107  reverse_strand

For this part, I used this code

annotation_file = open("annotation_file.txt", "w+")        #creates annotation file with gene names and positions
c2 = 0                                                     
with open("PlasmoDB-54_Pfalciparum3D7.gff", "r") as gff_file:
    for line in gff_file:
        if c2 >= 2:                                        #ignores first two lines because they can't be split
            annotation = line.split()
            if annotation[2] == "protein_coding_gene":     #finds line with gene info
                chromossome = annotation[0]                #gets chr name
                
                gene_name = line.split("=")[-1]            #gets gene name
                gene_name = gene_name.replace(" ", "_")
                
                gene_id = re.split("=|;", line)[1]         #gets gene id from plsmodb
                
                gene = f"{gene_id}_{gene_name.split()[0]}"            #merges gene id and name for final table
                
                
                if annotation[6] == "+":                   #if forward print normal
                    position = "foward_strand"
                    position_5 = int(annotation[3])
                    position_3 = int(annotation[4]) + 500
                    print(f"{chromossome}\t{gene}\t{position_5}\t{position_3}\t{position}", file=annotation_file)
                    
                elif annotation[6] == "-":                 #if reverse print inverted so the smaller number 
                    position = "reverse_strand"            #is always first
                    position_5 = int(annotation[4])        
                    position_3 = int(annotation[3]) + 500 
                    print(f"{chromossome}\t{gene}\t{position_3}\t{position_5}\t{position}", file=annotation_file)
                
            else:
                continue
        else:
            c2 += 1

Now to actually make the final table, you will need a dictionary with the genes and the cells. For that, you will need to know which gene is which. What I did was create a dictionary where the key were each chromossome and the value was a list with each gene’s name and position. This is the code I used

chr_dic = dict()
with open("annotation_file.txt", "r") as ann:   #creates dict with all chromossomes as key
    for line in ann:
        chr_name = line.split()[0]
        if chr_name not in chr_dic:
            chr_dic[chr_name] = []
        if len(chr_dic) < 16:
            continue
        else:
            break
 
with open("annotation_file.txt", "r") as annotation_file:   #populates chr_dic with tuples of gene name 
    for line in annotation_file:                            #and position
        ann_chromossome = line.split()[0]
        ann_gene = line.split()[1]
        ann_position = (int(line.split()[2]), int(line.split()[3]))
        new_gene = (ann_gene,ann_position)
        chr_dic[ann_chromossome].append(new_gene)

Then you will need to know which cell has which gene mapped and how many times. For that I created another dictionary where the key was the gene’s names and the value was a list with all the cell it was mapped to (allowing cell duplicates). This is the code I used

hit_dic = defaultdict(list)       
with open("unique_reads.txt", "r") as mapped:    #creates a dict with every gene and all the cells that 
    next(mapped)                                 #mapped that gene
    for line in mapped:                          #uses the file with unique reads
        read_bc = line.split()[0][:-1]
        read_chromossome = line.split()[2][:-1]
        read_position = int(line.split()[3][:-1])
        if read_bc in cell_reads:                #if that cell has more than 5000 reads
            for k, v in chr_dic.items():
                if k == read_chromossome:        #looks for that gene in the right chromossome
                    for (name, position) in v:   
                        pos1 = int(position[0])
                        pos2 = int(position[1])
                        if pos1 < read_position < pos2:    #if the position maps to a gene
                            hit_dic[name].append(read_bc)  #links the cell barcode to that gene

Now you have all the information you need in one place, all that’s left to do is print it to a .txt file. For that I used this code

final_table = open("final_table.txt", "w+")       #creates the final table with how many reads for a gene 
for bc in cell_reads:                             #in each cell
    print(f"{bc}\t", file=final_table, end="")    #creates columns
for gene_tag, cells in hit_dic.items():
    print("\n"+gene_tag, file=final_table, end="\t")  #creates rows
    c3 = 0
    for cell in cell_reads:                       #populates table
        cell_count = cells.count(cell)
        print(cell_count, file=final_table, end="\t")
final_table.close()

The total runtime for this step for a inital sequence of 180 million reads which came down to 1972 cells and 5318 genes, was 1 hour.

The output/controls
In the end, you will have a table that should look like this

TCACTCGGTCGGCTA CCTCACAAGCTCGGC GTTGTGATCTGAGAT TGATGGTGTTGGATC TCCCACAAGCGTACA
PF3D7_0409000_conserved_Plasmodium_protein%2C_unknown_function  0   1   0   0   0
PF3D7_0309200_serine/threonine_protein_kinase_ARK2%2C_putative  0   1   0   0   1
PF3D7_0207500_serine_repeat_antigen_6   9   4   4   0   18
PF3D7_1334800_MSP7-like_protein 8   1   7   0   2

5) Creating a PCA graph

The first analysis you will make with the count table will be creating a PCA, which is a graph of Principal Component Analysis which examines the covariances / correlations between individuals. In this graph you will see how different each individual cell is from the others, allowing you to identify groups of samples that are similiar. This is done by some complex mathematics but what basically happens is that there is a type of linear transformation of the dataset that fits it to a new coordinate system in such a way that the most significant variance is found on the first coordinate, and each subsequent coordinate is orthogonal to the last and has a lesser variance. When many variables correlate to each other, they will contribute to the same principal component and that principal component will be able to explain a certain percentage of the total variation in the dataset. So you will calculate the most meaninfull and representative genes and then plot each cell according to their expression of that genes.

The data
For this analysis you will use the count table you created in the previous step and all the analysis will be done in R.
The analysis
The first step is importing the .txt file with the table you created in the previous step. After that, you will normalize your data counts. Since the counts you have for each gene are a matter of technical and not biological information, you can normalize the whole table by the same amount of reads keeping their proportion. For that just run

final_table_normalized[] <- lapply(final_table_normalized[], function(x) ((x/(sum(x))*1000)))

After that you will have to transpose your dataframe so the columns are genes and the rows are cells. Since in the code I wrote for creating this table I create an aditional column by using \t, I had to run an aditional line of code before transposing

final_table_normalized$X <- NULL    #gets rid of the last column
transposed_normalized <- t(final_table_normalized)

There will be some genes in the data that will have very little reads, as in only 1 or 2. That could throw you data off, since the PCA analysis will give it the same weigth as genes that are highly expressed in some cells and not at all in others, which could in fact indicate something. So you will want to get rid of the genes that have less than 2.5 counts throughout all the cells. For that just run

i <- (colSums(transposed_normalized, na.rm = T) > 2.5)   #lists the cols where the count sum is bigger than 2.5
trans_norm_high_exp <- transposed_normalized[,i]         #trimms the dataframe to only the highly expressed genes

Now, for the actual PCA, just run

pca_data <- prcomp(trans_norm_high_exp, scale. = TRUE)

For plotting the PCA graph you can either use the ggfortify package, in which case you just have to run

library(ggfortify)
autoplot(pca_data)

Or you can use ggplot2. For that you have to run

library(ggplot2)
pca.df <- as.data.frame(pca_data$x)
percentage <- round(pca_data$sdev^2 / sum(pca_data$sdev^2) * 100, 2)
percentage <- paste(colnames(pca.df), "(", paste(as.character(percentage), "%", ")", sep=""))
plot <- ggplot(pca.df, aes(x=PC1, y=PC2)) + geom_point() + xlab(percentage[1]) + ylab(percentage[2])

The output/controls
This analysis resulted in this plot.

In this type of dataset, namely Plasmodium’s blood stages scRNAseq, it is normal to have a principal component in the single digits. Considering there are roughly 5000 genes in the species, that is suficient to be meaningfull. You can actually see this using

library(factoextra)
fviz_eig(pca_data)

6) Annotating the PCA graph

Once you have the PCA graph, you will want to annotate it somehow. In this instance, I used two genes known for being expressed in different stages of the parasite’s blood cycle - PF3D7_1462800 and PF3D7_1410400. The tricky part here is defining a cutoff value for the gene’s expression. This will have to be done by trial and error, but always keeping in mind that there’s a fine line between data manipulation and the truth. Not that the absolute truth can be given by this type of analysis, since there is an inherited flaw in the data that comes from the technology used, meaning that there migh be cells that express a given gene, but that gene wasn’t sequenced in that cell, so it won’t show in the analysis. All of the results and conclusions drawn from analysing a scRNAseq dataset have to be carefully taken with a grain of salt.
Just remember to always use the same cutoff when comparing the expression of different genes and to always stay truthful about what you did in your analysis.

The data
You will use the PCA graph created previously.
The analysis
For annotating the graph in this manner, all you have to do is create a column for the stages and populate it according to some gene’s expression.

# creates a new dataframe with non-numerical info which can't be used for running pca
h3_trans_norm_high_exp_col <- h3_trans_norm_high_exp

# adds stage column
h3_trans_norm_high_exp_col["stages"]<-"n"

# annotates each cell based on a given gene's expression
c<-grep("PF3D7_1462800",colnames(h3_trans_norm_high_exp_col))
m<-max(h3_trans_norm_high_exp_col[,c]) # can be used to calculate some kind of cutoff
h3_trans_norm_high_exp_col[which(h3_trans_norm_high_exp_col[,c]>0.1),5006]<-"gapdh"

# creates plot automatically using stage information for color
autoplot(h3_pca, data=h3_trans_norm_high_exp_col, colour="sample", alpha=0.4)

The output
Doing this you will have a graph that looks like this

7) Creating a PCA graph with multiple samples

For this analysis, four samples were used in order to assess Plasmodium falciparum blood stage’s viability after different sinchronization / enrichement processes. These processes were MACS, Percoll and saponin. To compare the data from each process, I made a PCA using all of their data. That means a count table where all of the genes are present, as well as the name of each process and the stage of each cell. From that you can create a PCA graph that shows the difference in each process’ recuperation stage. For that you will need to merge the count tables and run prcomp on that merged table.

The data
For this analysis you will use the processed dataframes from the previous step. Both with and without the annotation column.
The analysis
The first thing to do is to create another annotation column that will distinguish one sample from the other. For each sample run

h3_trans_norm_high_exp_col["sample"]<-"saponin1"

Then, to merge all samples into one big count table run

library(plyr)

# gets and merges barcodes
h3_barcodes <- rownames(h3_trans_norm_high_exp_col)
h4_barcodes <- rownames(h4_trans_norm_high_exp_col)
h5_barcodes <- rownames(h5_trans_norm_high_exp_col)
all_barcodes <- c(h3_barcodes, h4_barcodes, h5_barcodes)

# gets and merges stages
h3_stages <- h3_trans_norm_high_exp_col$stages
h4_stages <- h4_trans_norm_high_exp_col$stages
h5_stages <- h5_trans_norm_high_exp_col$stages
all_stages <- c(h3_stages, h4_stages, h5_stages)

# gets and merges samples
h3_sample <- h3_trans_norm_high_exp_col$sample
h4_sample <- h4_trans_norm_high_exp_col$sample
h5_sample <- h5_trans_norm_high_exp_col$sample
all_sample <- c(h3_sample, h4_sample, h5_sample)

# merges count tables filling any columns (genes) missing from one or another with NA so all genes are accounted for
merged_count_table <- rbind.fill(h3_trans_norm_high_exp, h4_trans_norm_high_exp, h5_trans_norm_high_exp)

# replaces NA with 0
merged_count_table[is.na(merged_count_table)] <- 0

# creates a merged count table with the highly expressed genes
merged_count_table_high_exp <- merged_count_table
i<-apply(merged_count_table,2,max)>2.5
merged_count_table_high_exp<-merged_count_table[,i]

# creates full annotated table with stages, sample and barodes
merged_full_table <- merged_count_table_high_exp
merged_full_table$stages <- all_stages
merged_full_table$sample <- all_sample
merged_full_table$barccodes <- all_barcodes

Now you have a PCA from all you samples, all that’s left to do is run

# calculates pca from merged samples and plots it
merged_pca <- prcomp(merged_count_table, scale =T)
autoplot(merged_pca, data=merged_full_table, colour="sample", alpha=0.4)

The output
In the end, the PCA graph with all the samples will look like this

8) Annotating your merged PCA graph

Once you have you PCA you can annotate it using genes’ expression. For that you will either use your genes of interest or genes established by the literature. If you want to know more about Plasmodium falciparum’s, P. knowlesi’s or P. berghei’s genes, you can go to the Malaria Cell Atlas. In there you can either test your gene or find one that suits your purposes. For this analysis I chose the REX1 gene to represent rings, Ribosomal protein L13 for trophozoites, Rhoptry protein for schizonts and MSP3 for late schizonts.

The data
You will use the merged count table you created in the last step.
The analysis
For plotting this kind of graph, we will use the manual version of plotting a PCA graph. That means that the first step is to create a pca_df that will be used as base for plotting. In this dataframe you will also need a stages column and an expression column.

pca_df <- as.data.frame(merged_pca$x)
pca_df$stages <- "n"
pca_df$expression <- 0

This dataframe cointains the PCA coordinates and is where we will input the stage and the level of expression of the cells of interest. For that, the merged_full_table will come in handy.
Once you have chosen the genes you will use, you have to find it in the count table, understand how much the cells express it, determine an expression cutoff for your analysis and identify each cell that meets your cutoff and their respective expression.

c<-grep("PF3D7_0935900",colnames(merged_full_table)) # finds gene's column
max <- max(merged_full_table[,c]) # gets the maximum expression for that gene
pca_df[which(merged_full_table[,c]>max*.05),5157]<-"rex1" # finds the top 95% cells expressing that gene and identify them
pca_df[which(merged_full_table[,c]>max*.05),5158] <- merged_full_table[which(merged_full_table[,c]>max*.05),c] # adds those cells' expresions levels to the table

With that you can annotate as many genes as you want. Just remember that if you are using genes that should be expresses roughly at the same time, they can override each other depending on how you run. Also, the cutoff for each gene is a trial and error kind of approach. Just remember to keep it honest.
And then, once you identified all the genes you wanted and got their expression level, all that’s left to do is plot pca_df. For this plot it is interesting to have each dot’s size different according to that cell’s level of expression of each gene. That’s why there should be an expression column in pca_df. You will use stages to color each dot and expression to determine the size of the dot.
Now you can plot pca_df subset to each sample using

ggplot(pca_df[which(merged_full_table$sample=="percoll"),], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.4) + labs(title="percoll", x=percentage[1], y=percentage[2])

Or subset it to more than one sample creating a vector with the row numbers of the samples of interest and using it to subset the rows plotted.

bc <- which(merged_full_table$sample=="percoll" | merged_full_table$sample=="macs") 
ggplot(pca_df[bc,], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.2) + labs(title="macs+percoll", x=percentage[1], y=percentage[2])

If you only want to plot a specific gene across all samples, you can use

ggplot(pca_df[which(pca_df$stage=="rhoptry"),], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.4) + labs(title="rhoptry", x=percentage[1], y=percentage[2])

To plot only one gene in one sample, use

ggplot(pca_df[which(pca_df$stages=="rhoptry" & merged_full_table$sample=="macs"),], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.4) + labs(title="rhoptry", x=percentage[1], y=percentage[2])

Finally, to plot only one gene in a few samples, use

ggplot(pca_df[which(pca_df$stage=="rhoptry" & (merged_full_table$sample=="macs" | merged_full_table$sample=="percoll")),], aes(x=PC1, y=PC2, color=stages, size=expression)) + geom_point(alpha=0.4) + labs(title="macs + percoll rhoptry", x=percentage[1], y=percentage[2])