Climate change is an unavoidable suite of stressors for aquatic organisms and its effects are exacerbated by the personal care products and pharmaceuticals we use regularly. Thinking about what that multiple-stressor environment looks like and understanding its impact on aquatic organisms, we will turn our attention to marine mussels, Mytilus spp. They are an indicator organism for environmental health and inferences to those impacts to human health. I am exploring just how Mytilus spp. respond to man-made contaminants and to do that I’m using differential gene expression to help me draw conclusions about what common or unique physiological responses are displayed by M. galloprovincials when exposed to various anthropogenic stressors. This molecular-based technique ill help me get ot the heart of what physiological functions could be impaired.
Data
Since I want to explore a ‘hunch’ before making larger experimental choices, it makes sense to look at what’s out there. I chose four different types of stressors and a solid reference genome from existing sequenced data in the NCBI & Uniprot databases. Listed below are the data I chose:
Existing RNA-Seq data was retrieved from complete RNASeq data available in the NCBI database along wiht the reference genome. These very large, ‘crowd sourced’, and free sites provide the foundation for many types of molecular level work - give it a browse!
From this database I was able to select a reference, and examples of broad categories of stressors; organic pharmaceutical contaminants like Valsartan (blood pressure medication) and Carbemazepine (seizure), endocrine disruptoring compounds like synthetic Estrogen, biotoxins like those created by harmful algal blooms, and finally abiotic stressors like hypoxia.
Workflow
flowchart LR
A[Study Organism] --> B[Data Search] --> C[Reproducible Bioinformatics w/ GitHub]
C --> D{Tools for Analysis}
D --> E[RNASeq -> Data type]
D --> F[Kallisto & DESeq -> Quantifyng DEGs]
D --> G[RStudio -> Visualizing Results]
D --> H[Uniprot & Blast -> What *are* the DEGs?]
D --> I[Annotation & Functional Gene Enrichment -> Next Steps]
We’ve already covered the study organism and the data search - lets give a shoutout to the workhorse platforms of the show: GitHub and RStudio/ R. Both polatforms are free, can be connected, and keep you from accidentally ‘losing’ your work when you aren’t using the best file structure practices. Check out Happy Git for more information on how this symbiotic relationship can be attained.
RNASeq
This is the type of sequence data I have available to work with from NCBI so I have to retrieve it. ### Pull out all of the sequences from the exposures.
Pulling my reference genome from prefab code provided by NCBI & verify where I am working
Code
cd ../datals
Code
# grab the referencecd ../datacurl-OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_900618805.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCA_900618805.1.zip"-H"Accept: application/zip"unzip GCA_900618805.1.zip /chris-musselcon/data
QC Note
Even though I didn’t do this step, it is important and how it could be done with MultiQC.
Kallisto
This process aligns my sequences to the reference to support DEG quantification.
Code
cd /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/ncbi_dataset/data/GCA_900618805.1head cds_from_genomic.fna
head ../data/ncbi_dataset/data/GCA_900618805.1/cds_from_genomic.fna
Code
/home/shared/kallisto/kallisto quant
kallisto 0.46.1
Computes equivalence classes for reads and quantifies abundances
Usage: kallisto quant [arguments] FASTQ-files
Required arguments:
-i, --index=STRING Filename for the kallisto index to be used for
quantification
-o, --output-dir=STRING Directory to write output to
Optional arguments:
--bias Perform sequence based bias correction
-b, --bootstrap-samples=INT Number of bootstrap samples (default: 0)
--seed=INT Seed for the bootstrap sampling (default: 42)
--plaintext Output plaintext instead of HDF5
--fusion Search for fusions for Pizzly
--single Quantify single-end reads
--single-overhang Include reads where unobserved rest of fragment is
predicted to lie outside a transcript
--fr-stranded Strand specific reads, first read forward
--rf-stranded Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE Estimated average fragment length
-s, --sd=DOUBLE Estimated standard deviation of fragment length
(default: -l, -s values are estimated from paired
end data, but are required when using --single)
-t, --threads=INT Number of threads to use (default: 1)
--pseudobam Save pseudoalignments to transcriptome to BAM file
--genomebam Project pseudoalignments to genome sorted BAM file
-g, --gtf GTF file for transcriptome information
(required for --genomebam)
-c, --chromosomes Tab separated file with chromosome names and lengths
(optional for --genomebam, but recommended)
Code
ls /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/
# Select top 50 differentially expressed genes - is it valuable to select more than 50?res <-results(deseq2.dds)res_ordered <- res[order(res$padj), ]top_genes <-row.names(res_ordered)[1:50]# Extract counts and normalizecounts <-counts(deseq2.dds, normalized =TRUE)counts_top <- counts[top_genes, ]# Log-transform countslog_counts_top <-log2(counts_top +1)# Generate heatmappheatmap(log_counts_top, scale ="row")
Our PCA shows strong clustering and warrants further review. ### Heatmap
Code
#what are we doing here and how do I get it to compare the others against each other?head(deseq2.res)
Code
# Count number of hits with adjusted p-value less then 0.05dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <=0.05, ])
Our heatmap shows us the contaminant sequences labeled below the color blocks on the X-axis, and up-regulated (red) or down-regulated (blue) genes and the y-axis. When we go back and match the sequence numbers we see confirmation of the result of the PCA. Now we want to look at how many DEGs are statistically significant. ### Volcano Plots
Code
tmp <- deseq2.res# The main plotplot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",main="DEG Valsartan vs DSP (pval <= 0.05)",xlab="mean of normalized counts",ylab="Log2 Fold Change")# Getting the significant points and plotting them again so they're a different colortmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <=0.05, ]points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")# 2 FC linesabline(h=c(-1,1), col="blue")
Code
# Prepare the data for plottingres_df <-as.data.frame(deseq2.res)res_df$gene <-row.names(res_df)# Create volcano plotvolcano_plot <-ggplot(res_df, aes(x = log2FoldChange, y =-log10(padj), color = padj <0.05)) +geom_point(alpha =0.6, size =1.5) +scale_color_manual(values =c("grey", "red")) +labs(title ="Volcano Plot",x ="Log2 Fold Change",y ="-Log10 Adjusted P-value",color ="Significantly\nDifferentially Expressed") +theme_minimal() +theme(panel.grid.major =element_blank(),panel.grid.minor =element_blank(),legend.position ="top")print(volcano_plot)
Now we have something to work with, so we create a result table for further work.
This will take my annotated reference gene sequence (cds file) and compare it to the result table of differentially expressed genes we created above and a reference of all the known proteins in the Uniprot protein database so we can blast them against each other and pick up what the functions of the DEGs may be.
Using BLAST
This is the tool to compare my cds file and the uniprot database. Going to call my existing uniprot database made in week 2 to blast my cds file
Extract the actual term I need from cds before I merge with deg.
Code
library(tidyverse)# Read the TSV file into a data framecds1 <-read.table("/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/cds.tab", sep ="\t", header =FALSE)# Separate the column into multiple columns based on a delimitercds2 <-separate(cds1, V2, into =c("Column 1", "Column 2", "Column 3"), sep ="\t")# Write the updated data frame back to a TSV filewrite.table(cds2, file ="/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/newcds.tab", sep ="\t", row.names =FALSE)
Fixing deg column data being used as row names
Code
# Create a new column with the values from row namesdeg$column_name <-rownames(deg)# Reset the row names to default numeric indicesrownames(deg) <-NULL
Merge cds and DEGlist before merging with Uniprot protein table
My table is acting up, so it will be a mystery until resolved.
Next Steps
Functional Gene Enrichment
Physiological inference
Preliminary conclusions & any further evaluation of targeted processes
If you want to follow along as I continue, please check out the Rpubs link or the Zenodo data repository linked below
---title: "*Mytilus galloprovincialis* differential gene expression in response to contaminant exposure"author: "Chris"date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: df-print: paged toc: true toc-location: left smooth-scroll: true link-external-icon: true link-external-newwindow: true code-fold: show code-tools: true code-copy: true highlight-style: arrow code-overflow: wrap theme: nightreference-location: margincitation-location: margin---```{r setup, include=FALSE}library(knitr)library(tidyverse)library(kableExtra)library(DESeq2)library(pheatmap)library(RColorBrewer)library(data.table)library(DT)library(Biostrings)knitr::opts_chunk$set(echo =TRUE, # Display code chunkseval =FALSE, # Evaluate code chunkswarning =FALSE, # Hide warningsmessage =FALSE, # Hide messagesfig.width =10, # Set plot width in inchesfig.height =10, # Set plot height in inchesfig.align ="center"# Align plots to the center)```# Project Description[](https://github.com/course-fish546-2023/chris-musselcon/blob/main/assets/mytilus.png?raw=true)Climate change is an unavoidable suite of stressors for aquatic organisms and its effects are exacerbated by the personal care products and pharmaceuticals we use regularly. Thinking about what that multiple-stressor environment looks like and understanding its impact on aquatic organisms, we will turn our attention to marine mussels, *Mytilus* spp. They are an indicator organism for environmental health and inferences to those impacts to human health. I am exploring just how *Mytilus* spp. respond to man-made contaminants and to do that I'm using differential gene expression to help me draw conclusions about what common or unique physiological responses are displayed by *M. galloprovincials* when exposed to various anthropogenic stressors. This molecular-based technique ill help me get ot the heart of what physiological functions could be impaired.# DataSince I want to explore a 'hunch' before making larger experimental choices, it makes sense to look at what's out there. I chose four different types of stressors and a solid reference genome from existing sequenced data in the NCBI & Uniprot databases. Listed below are the data I chose:- [Exposure to Valsartan & Carbamazepine](https://www.ncbi.nlm.nih.gov/sra/SRX15826291%5Baccn%5D)- [Exposure to a synthetic hormone 17 a-Ethinylestradiol (EE2)](https://www.ncbi.nlm.nih.gov/sra/SRX12971792%5Baccn%5D)- [Diarrhetic Shellfish Poisoning (DSP) toxins associated with Harmful Algal Blooms (HABs)](https://www.ncbi.nlm.nih.gov/sra/SRX4582204%5Baccn%5D)- [Hypoxia](https://www.ncbi.nlm.nih.gov/sra/SRX9464766%5Baccn%5D)\- [*Mytilus galloprovincialis* Reference](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_025277285.1/)## NCBI & Uniprot DatabasesExisting RNA-Seq data was retrieved from complete RNASeq data available in the [NCBI database](https://www.ncbi.nlm.nih.gov/) along wiht the reference genome. These very large, 'crowd sourced', and free sites provide the foundation for many types of molecular level work - give it a browse!From this database I was able to select a reference, and examples of broad categories of stressors; organic pharmaceutical contaminants like Valsartan (blood pressure medication) and Carbemazepine (seizure), endocrine disruptoring compounds like synthetic Estrogen, biotoxins like those created by harmful algal blooms, and finally abiotic stressors like hypoxia.# Workflow```{mermaid}flowchart LR A[Study Organism] --> B[Data Search] --> C[Reproducible Bioinformatics w/ GitHub] C --> D{Tools for Analysis} D --> E[RNASeq -> Data type] D --> F[Kallisto & DESeq -> Quantifyng DEGs] D --> G[RStudio -> Visualizing Results] D --> H[Uniprot & Blast -> What *are* the DEGs?] D --> I[Annotation & Functional Gene Enrichment -> Next Steps]```[{fig-align="left"}](https://github.com/course-fish546-2023/chris-musselcon/blob/main/assets/ghicon.png?raw=true)We've already covered the study organism and the data search - lets give a shoutout to the workhorse platforms of the show: GitHub and RStudio/ R. Both polatforms are free, can be connected, and keep you from accidentally 'losing' your work when you aren't using the best file structure practices. Check out [Happy Git](https://happygitwithr.com/) for more information on how this symbiotic relationship can be attained.## RNASeqThis is the type of sequence data I have available to work with from NCBI so I have to retrieve it. \### Pull out all of the sequences from the exposures.```{bash}/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump\--outdir /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi \--progress \SRR19782039 \SRR13013756 \SRR7725722 \SRR16771870 ```### Pulling my reference genome from prefab code provided by NCBI & verify where I am working```{bash}cd ../datals``````{bash}# grab the referencecd ../datacurl-OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_900618805.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCA_900618805.1.zip"-H"Accept: application/zip"unzip GCA_900618805.1.zip /chris-musselcon/data ```## QC NoteEven though I didn't do this step, it is important and how it could be done with [MultiQC](https://multiqc.info/docs/).## KallistoThis process aligns my sequences to the reference to support DEG quantification.```{r, engine='bash', eval=TRUE} cd /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/ncbi_dataset/data/GCA_900618805.1 head cds_from_genomic.fna```Create the index file to align my short read files to the genes from the MGAL_10 (reference genome)```{r, engine='bash'}/home/shared/kallisto/kallisto \index \-i ../data/MGAL_cds.index \../data/ncbi_dataset/data/GCA_900618805.1/cds_from_genomic.fna```Let's take a look```{bash}head ../data/ncbi_dataset/data/GCA_900618805.1/cds_from_genomic.fna```[](https://github.com/course-fish546-2023/chris-musselcon/blob/main/assets/reference_head.png?raw=true)```{r, engine='bash', eval=TRUE}/home/shared/kallisto/kallisto quant``````{r, engine='bash', eval=TRUE}ls /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/``````{r, engine='bash'}#mkdir ../outputmkdir ../output/kallisto_01#pulling all files for comparison to my indexfind /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/*_1.fastq \| xargs basename -s _1.fastq | xargs -I{} /home/shared/kallisto/kallisto \quant -i ../data/MGAL_cds.index \-o ../output/kallisto_01/{} \-t 4 \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/{}_1.fastq \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/{}_2.fastq \```Getting length mean & SD for the single read sequences in the .fastq files. Code taken from: https://www.biostars.org/p/243552/```{bash}awk'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}' /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/*.fastq```Grabbing only the fastq files```{r, eval=FALSE, engine='bash'}find /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/*.fastq \| xargs basename -s .fastq | xargs -I{} /home/shared/kallisto/kallisto \quant -i ../data/MGAL_cds.index \--single -l 92-s 405 \-o ../output/kallisto_01/{} \-t 4 \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/{}.fastq``````{r, engine='bash'}perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \--est_method kallisto \--gene_trans_map none \--out_prefix /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01 \--name_sample_by_basedir \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR7725722/abundance.tsv \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR7725722_1/abundance.tsv \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR7725722_2/abundance.tsv \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR13013756/abundance.tsv \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR16771870/abundance.tsv \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR16771870_1/abundance.tsv \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR16771870_2/abundance.tsv \/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR19782039/abundance.tsv```Let's get this into a manageable format we can feed to DESeq and followup with visualization.```{r}countmatrix <-read.delim("../output/kallisto_01.isoform.counts.matrix", header =TRUE, sep ='\t')rownames(countmatrix) <- countmatrix$Xcountmatrix <- countmatrix[,-1]head(countmatrix)#tail(countmatrix)#dim(countmatrix)``````{r}countmatrix <-round(countmatrix, 0)str(countmatrix)``````{r}dim(countmatrix)#dim(deseq2.colData)``````{r}length(colnames(data))```## DESeq2This quantifies how many differentially expressed genes exist based on the parameters we set. We will get the data into a dataframe for review.```{r}deseq2.colData <-data.frame(condition=factor(c("DSP", "DSP", "Hypoxia", "EE2", "EE2", "EE2", "Valsartan")), type=factor(rep("single-read", 7)))rownames(deseq2.colData) <-colnames(data)dim(deseq2.colData)deseq2.dds <-DESeqDataSetFromMatrix(countData = countmatrix,colData = deseq2.colData, design =~ condition)``````{r}deseq2.dds <-DESeq(deseq2.dds)deseq2.res <-results(deseq2.dds)deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]head(deseq2.dds)```## VisualizationThis is an interim step to look at your DESeq data and see what is interesting/ make any general inferences. \### PCA Plot```{r}vsd <-vst(deseq2.dds, blind =FALSE)plotPCA(vsd, intgroup ="condition")```[](https://github.com/course-fish546-2023/chris-musselcon/blob/main/assets/pca.png?raw=true)```{r}# Select top 50 differentially expressed genes - is it valuable to select more than 50?res <-results(deseq2.dds)res_ordered <- res[order(res$padj), ]top_genes <-row.names(res_ordered)[1:50]# Extract counts and normalizecounts <-counts(deseq2.dds, normalized =TRUE)counts_top <- counts[top_genes, ]# Log-transform countslog_counts_top <-log2(counts_top +1)# Generate heatmappheatmap(log_counts_top, scale ="row")```Our PCA shows strong clustering and warrants further review. \### Heatmap```{r}#what are we doing here and how do I get it to compare the others against each other?head(deseq2.res)```[](https://github.com/course-fish546-2023/chris-musselcon/blob/main/assets/heatmap.png?raw=true)```{r}# Count number of hits with adjusted p-value less then 0.05dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <=0.05, ])```Our heatmap shows us the contaminant sequences labeled below the color blocks on the X-axis, and up-regulated (red) or down-regulated (blue) genes and the y-axis. When we go back and match the sequence numbers we see confirmation of the result of the PCA. Now we want to look at how many DEGs are statistically significant. \### Volcano Plots```{r}tmp <- deseq2.res# The main plotplot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",main="DEG Valsartan vs DSP (pval <= 0.05)",xlab="mean of normalized counts",ylab="Log2 Fold Change")# Getting the significant points and plotting them again so they're a different colortmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <=0.05, ]points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")# 2 FC linesabline(h=c(-1,1), col="blue")```[](https://github.com/course-fish546-2023/chris-musselcon/blob/main/assets/volcano1.png?raw=true)```{r}# Prepare the data for plottingres_df <-as.data.frame(deseq2.res)res_df$gene <-row.names(res_df)# Create volcano plotvolcano_plot <-ggplot(res_df, aes(x = log2FoldChange, y =-log10(padj), color = padj <0.05)) +geom_point(alpha =0.6, size =1.5) +scale_color_manual(values =c("grey", "red")) +labs(title ="Volcano Plot",x ="Log2 Fold Change",y ="-Log10 Adjusted P-value",color ="Significantly\nDifferentially Expressed") +theme_minimal() +theme(panel.grid.major =element_blank(),panel.grid.minor =element_blank(),legend.position ="top")print(volcano_plot)```[](https://github.com/course-fish546-2023/chris-musselcon/blob/main/assets/volcano.png?raw=true)Now we have something to work with, so we create a result table for further work.```{r}write.table(tmp.sig, "../output/DEGlist.tab", sep ='\t', row.names = T)```## AnnotationThis will take my annotated reference gene sequence (cds file) and compare it to the result table of differentially expressed genes we created above and a reference of all the known proteins in the Uniprot protein database so we can blast them against each other and pick up what the functions of the DEGs may be.### Using BLASTThis is the tool to compare my cds file and the uniprot database. Going to call my existing uniprot database made in week 2 to blast my cds file```{r, engine='bash'}/home/shared/8TB_HDD_02/cnmntgna/Applications/bioinfo/ncbi-blast-2.13.0+/bin/blastx \-query /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/ncbi_dataset/data/GCA_900618805.1/cds_from_genomic.fna \-db /home/shared/8TB_HDD_02/cnmntgna/output/blastdb/uniprot_sprot_r2023_01 \-out /home/shared/8TB_HDD_02/cnmntgna/data/cds_blastx.tab \-evalue 1E-20 \-num_threads 20 \-max_target_seqs 1 \-outfmt 6``````{bash}head /home/shared/8TB_HDD_02/cnmntgna/data/cds_blastx.tab```### Tables on tables on tables!This is my annotated reference genome.```{bash}head-2 ../code/cds.csvwc-l ../code/cds.csv```Fixing the format for a clean merge/ join.```{bash}tr'|''\t'< ../code/cds.csv |head-2``````{bash}tr'|''\t'< ../code/cds.csv \> ../data/cds.csv``````{bash}head ../data/cds.csv```This is the list of DEGs generated from DESeq and will also be prepared for merging/ joining```{bash}head-2 ../output/DEGlist.tabwc-l ../output/DEGlist.tab``````{bash}tr'|''\t'< ../output/DEGlist.tab |head-2``````{bash}tr'|''\t'< ../output/DEGlist.tab \> ../data/DEGlist.tab``````{bash}head ../data/DEGlist.tab``````{bash}head ../data/cds.tab```Verify everything is formatted correctly```{r, eval=TRUE}getwd()cds <-read.table("/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/cds.tab", sep ="\t", header =FALSE)deg <-read.table("/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/DEGlist.tab", sep ="\t", header =TRUE)```Extract the actual term I need from cds before I merge with deg.```{r, eval=TRUE}library(tidyverse)# Read the TSV file into a data framecds1 <-read.table("/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/cds.tab", sep ="\t", header =FALSE)# Separate the column into multiple columns based on a delimitercds2 <-separate(cds1, V2, into =c("Column 1", "Column 2", "Column 3"), sep ="\t")# Write the updated data frame back to a TSV filewrite.table(cds2, file ="/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/newcds.tab", sep ="\t", row.names =FALSE)```Fixing deg column data being used as row names```{r, eval=TRUE}# Create a new column with the values from row namesdeg$column_name <-rownames(deg)# Reset the row names to default numeric indicesrownames(deg) <-NULL```Merge cds and DEGlist before merging with Uniprot protein table```{r, eval=TRUE}cdsDEG <-merge(cds2, deg, by.x =1, by.y =7 )head(cdsDEG)```Prep Uniprot table for joining```{bash, eval=TRUE}head-2 ../data/uniprot_table_r2023_01.tabwc-l ../data/uniprot_table_r2023_01.tab``````{r, eval=TRUE}uni<-read.csv("../data/uniprot_table_r2023_01.tab", sep ='\t', header =TRUE)```Join all three tables and take a look at the list.```{r, eval-TRUE}library(kableExtra)library(knitr)library(dplyr)library(readr)#new table processfull <-left_join(cdsDEG, uni, by =c("Column 2"="Entry"))write.table(full, file ="/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/annoblast.tab", sep ="\t", row.names =FALSE)```My table is acting up, so it will be a mystery until resolved. ## Next Steps- Functional Gene Enrichment- Physiological inference- Preliminary conclusions & any further evaluation of targeted processes\ If you want to follow along as I continue, please check out the Rpubs link or the Zenodo data repository linked below- [Rpubs link](https://rpubs.com/cmantegna/1039864)- [](https://zenodo.org/badge/latestdoi/624256861)