02-kallisto-pseudo-align

Author

Sarah Tanja

Published

May 16, 2023

Check kallisto version to confirm the correct file path to the program with the following command:

Code

/home/shared/kallisto/kallisto

Create Index File with `kallisto index -i`

This code is indexing the file Montipora_capitata_HIv3.genes.cds.fna while also renaming it as mcap_rna.index. /home/shared/kallisto/kallisto is the absolute path to the kallisto program from within the raven server, while the lines after the index command indicate where to where to write the file to (mcap_rna.index) and where to get the data from (theMontipora_capitata_HIv3.genes.cds.fna file)

Code


# if it doesn't already exist, make an index folder inside output
mkdir -p ../output/kallisto-index

# use kallisto to build an index file
/home/shared/kallisto/kallisto \
index -i \
../output/kallisto-index/mcap_rna_kallisto.index \
../data/Montipora_capitata_HIv3.genes.cds.fna

Make Abundance Files `abundance.tsv` for each sample with `kallisto quant -i`

create a subdirectory kallisto in the output folder using mkdir -p
use the find utility to search for all files in the /home/shared/8TB_HDD_01/mcap/ directory that match the pattern *.fastq
use the basename command to extract the base file name of each file (i.e., the file name without the directory path), and remove the suffix .fastq.
run the kallisto quant command on each input file, with the following options:
-i ../output/kallisto-index/mcap_rna_kallisto.index: Use the kallisto index file located at ../output/kallisto-index/mcap_rna_kallisto.index.
-o ../output/kallisto_01/{}: Write the output files to a directory called ../output/kallisto_01/ with a subdirectory named after the base filename of the input file (the {} is a placeholder for the base filename).
-t 40: Use 40 threads for the computation.
--single -l 100 -s 10: Specify that the input file contains single-end reads (–single), with an average read length of 100 (-l 100) and a standard deviation of 10 (-s 10).
The input file to process is specified using the {} placeholder, which is replaced by the base file name from the previous step.

Code

# make directory in the output folder for kallisto (if it already exists, the `-p` option creates intermediate directories as needed, and keeps from throwing an error)
mkdir -p ../output/kallisto-abundance


find /home/shared/8TB_HDD_01/mcap/*.fastq \
| xargs basename -s .fastq | xargs -I{} /home/shared/kallisto/kallisto \
quant -i ../output/kallisto-index/mcap_rna_kallisto.index \
-o ../output/kallisto-abundance/{} \
-t 40 \
--single -l 100 -s 10 /home/shared/8TB_HDD_01/mcap/{}.fastq

This makes individual folders for each sample. In each folder there should be three files: - run_info.json - abundance.tsv - abundance.h5

Trinity Abundance Estimates, Gene Expression Matrix

This next command runs the abundance_estimates_to_matrix.pl script from the Trinity RNA-seq assembly software package to create a gene expression matrix from kallisto output files.

The specific options and arguments used in the command are as follows:

perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl: Run the abundance_estimates_to_matrix.pl script from Trinity.
Code
```
 /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl
```
--est_method kallisto: Specify that the abundance estimates were generated using kallisto.
--gene_trans_map none: Do not use a gene-to-transcript mapping file.
--out_prefix ../output/kallisto: Use ../output/kallisto as the output directory and prefix for the gene expression matrix file.
--name_sample_by_basedir: Use the sample directory name (i.e., the final directory in the input file paths) as the sample name in the output matrix.
Use the kallisto abundance.tsc files generated in the last step as input for creating the gene expression matrix.

Code

trinity gene expression matrix

Code


# If it doesn't already exist, create a folder named 'trinity-gem' inside the 'output' folder
mkdir -p ../output/trinity-matrix

# A list of all the paths to each 'abundance.tsv' file for each sample
abundance_paths=../output/kallisto-abundance/*/abundance.tsv
# The paths list, but joined with a '\' between each one!
joined_abundance_paths="${abundance_paths[*]// /\\ }"

# Create gene expression matrix with `trinity` using the list of '\' joined paths named `joined_abundance_paths`
  perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
    --est_method kallisto \
    --gene_trans_map none \
    --name_sample_by_basedir \
    --out_prefix ../output/trinity-matrix/mcap \
     ${joined_abundance_paths}

Getting some errors I don’t understand..

R –no-save –no-restore –no-site-file –no-init-file -q < ../output/trinity-gem/.isoform.TPM.not_cross_norm.runTMM.R 1>&2 sh: 1: R: not found Error, cmd: R –no-save –no-restore –no-site-file –no-init-file -q < ../output/trinity-gem/.isoform.TPM.not_cross_norm.runTMM.R 1>&2 died with ret (32512) at /home/shared/trinityrnaseq-v2.12.0/util/support_scripts/run_TMM_scale_matrix.pl line 105. Error, CMD: /home/shared/trinityrnaseq-v2.12.0/util/support_scripts/run_TMM_scale_matrix.pl –matrix ../output/trinity-gem/.isoform.TPM.not_cross_norm > ../output/trinity-gem/.isoform.TMM.EXPR.matrix died with ret 6400 at /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl line 385.

Summary & Next Steps

Info

Then we take the matrix (which is a count matrix) located at ‘../output/trinity-matrix/mcap.isoform.counts.matrix’ and run it through the 03-DESeq2code!

--- title: "02-kallisto-pseudo-align" author: "Sarah Tanja" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: df-print: paged toc: true smooth-scroll: true link-external-icon: true link-external-newwindow: true code-fold: show code-tools: true code-copy: true highlight-style: breeze code-overflow: wrap theme: light: flatly dark: darkly editor: visual --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` Check kallisto `version` to confirm the correct file path to the program with the following command: ```{r, engine='bash'} /home/shared/kallisto/kallisto ``` # Create Index File with `kallisto index -i` This code is indexing the file `Montipora_capitata_HIv3.genes.cds.fna` while also renaming it as `mcap_rna.index`. **/home/shared/kallisto/kallisto** is the absolute path to the kallisto program from within the raven server, while the lines after the index command indicate where to where to write the file to (`mcap_rna.index`) and where to get the data from (the`Montipora_capitata_HIv3.genes.cds.fna` file) ```{r, engine='bash'} # if it doesn't already exist, make an index folder inside output mkdir -p ../output/kallisto-index # use kallisto to build an index file /home/shared/kallisto/kallisto \ index -i \ ../output/kallisto-index/mcap_rna_kallisto.index \ ../data/Montipora_capitata_HIv3.genes.cds.fna ``` # Make Abundance Files `abundance.tsv` for each sample with `kallisto quant -i` - create a subdirectory `kallisto` in the `output` folder using `mkdir -p` - use the `find` utility to search for all files in the `/home/shared/8TB_HDD_01/mcap/` directory that match the pattern `*.fastq` - use the `basename` command to extract the base file name of each file (i.e., the file name without the directory path), and remove the suffix `.fastq`. - run the kallisto `quant` command on each input file, with the following options: - `-i ../output/kallisto-index/mcap_rna_kallisto.index`: Use the kallisto index file located at `../output/kallisto-index/mcap_rna_kallisto.index`. - `-o ../output/kallisto_01/{}`: Write the output files to a directory called `../output/kallisto_01/` with a subdirectory named after the base filename of the input file (the {} is a placeholder for the base filename). - `-t 40`: Use 40 threads for the computation. - `--single -l 100 -s 10`: Specify that the input file contains single-end reads (--single), with an average read length of 100 (-l 100) and a standard deviation of 10 (-s 10). - The input file to process is specified using the {} placeholder, which is replaced by the base file name from the previous step. ```{r, engine='bash'} # make directory in the output folder for kallisto (if it already exists, the `-p` option creates intermediate directories as needed, and keeps from throwing an error) mkdir -p ../output/kallisto-abundance find /home/shared/8TB_HDD_01/mcap/*.fastq \ | xargs basename -s .fastq | xargs -I{} /home/shared/kallisto/kallisto \ quant -i ../output/kallisto-index/mcap_rna_kallisto.index \ -o ../output/kallisto-abundance/{} \ -t 40 \ --single -l 100 -s 10 /home/shared/8TB_HDD_01/mcap/{}.fastq ``` This makes individual folders for each sample. In each folder there should be three files: - run_info.json - abundance.tsv - abundance.h5 # Trinity Abundance Estimates, Gene Expression Matrix This next command runs the `abundance_estimates_to_matrix.pl` script from the Trinity RNA-seq assembly software package to create a gene expression matrix from kallisto output files. The specific options and arguments used in the command are as follows: - `perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl`: Run the abundance_estimates_to_matrix.pl script from Trinity. ```{r, engine='bash'} /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl ``` - `--est_method kallisto`: Specify that the abundance estimates were generated using kallisto. - `--gene_trans_map none`: Do not use a gene-to-transcript mapping file. - `--out_prefix ../output/kallisto`: Use ../output/kallisto as the output directory and prefix for the gene expression matrix file. - `--name_sample_by_basedir`: Use the sample directory name (i.e., the final directory in the input file paths) as the sample name in the output matrix. - Use the kallisto abundance.tsc files generated in the last step as input for creating the gene expression matrix. ```{r, engine='bash'} ``` trinity gene expression matrix ```{r, engine='bash'} # If it doesn't already exist, create a folder named 'trinity-gem' inside the 'output' folder mkdir -p ../output/trinity-matrix # A list of all the paths to each 'abundance.tsv' file for each sample abundance_paths=../output/kallisto-abundance/*/abundance.tsv # The paths list, but joined with a '\' between each one! joined_abundance_paths="${abundance_paths[*]// /\\ }" # Create gene expression matrix with `trinity` using the list of '\' joined paths named `joined_abundance_paths` perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \ --est_method kallisto \ --gene_trans_map none \ --name_sample_by_basedir \ --out_prefix ../output/trinity-matrix/mcap \ ${joined_abundance_paths} ``` Getting some errors I don't understand.. > R --no-save --no-restore --no-site-file --no-init-file -q \< ../output/trinity-gem/.isoform.TPM.not_cross_norm.runTMM.R 1\>&2 sh: 1: R: not found Error, cmd: R --no-save --no-restore --no-site-file --no-init-file -q \< ../output/trinity-gem/.isoform.TPM.not_cross_norm.runTMM.R 1\>&2 died with ret (32512) at /home/shared/trinityrnaseq-v2.12.0/util/support_scripts/run_TMM_scale_matrix.pl line 105. Error, CMD: /home/shared/trinityrnaseq-v2.12.0/util/support_scripts/run_TMM_scale_matrix.pl --matrix ../output/trinity-gem/.isoform.TPM.not_cross_norm \> ../output/trinity-gem/.isoform.TMM.EXPR.matrix died with ret 6400 at /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl line 385. # Summary & Next Steps ::: callout-info Then we take the matrix (which is a count matrix) located at '../output/trinity-matrix/mcap.isoform.counts.matrix' and run it through the `03-DESeq2`code! :::

Create Index File with kallisto index -i

Make Abundance Files abundance.tsv for each sample with kallisto quant -i

Trinity Abundance Estimates, Gene Expression Matrix

Summary & Next Steps

Create Index File with `kallisto index -i`

Make Abundance Files `abundance.tsv` for each sample with `kallisto quant -i`