In this study, it is another look at nasopharygeal carcinoma (NPC) and how Epstein-Barr virus (EBV) latent infection can attach to certain regions of the human genome where less gene profiles are located and start hijacking genes associated with tumorigenesis or tumor creation and tumor angiogenesis of creating blood supply to the tumors being grown in NPC. It switches compartments along the chromosome host where the chromatin of DNA that wraps it along histones to create an environment that the tumor suppressing genes are silenced and tumor creation genes thrive in aggressive NPC as well as similar results in gastric carcinoma. The researchers, they, found a specific gene and a specific signature to predict with great certainty if a sample was NPC with latent EBV, or not as specifically an outcome as gastric carcinoma. It compared different cell samples of commercial NPC epithelial, then equal amounts of NPC with EBV and NPC without EBV as well as patient derived xenografts (internet says are tumor biopsies implanted in an immunodeficient or humanized mouse to allow the cancerous tumor to grow and be monitored to see how the cancer survives and thrives) with NPC and gastric carcinoma or GC. The study is a lot of research by a team of 27 researchers that really dived into the effects of EBV and how it could possibly survive, mutate, and die or become dormant or put back into its latent state in the human host.
There are 16 cell line sources that the samples were taken for this research that were all done in vitro for this study that are provided in the study ID at GSE299775, but other samples as well are covered in the published PubMed document. The samples they used are from in total 16 different cell lines, made up of:
2 commercial NPC epithelial cells called normal immortalised nasopharyngeal epithelial cells or NPE in 2 samples they reference by NP69 and NP460
4 EBV positive NPC or EBV+ NPC cell line samples in reference used as C17, C666, NPC43, and HK1EBV
4 EBV negative NPC or EBV- NPC cell line samples in reference as NPC43noEBV, NPC53, NPC38, and HK1
4 EBV+ patient derived xenographs (PDXs) in reference as C15, Xeno23, Xeno32, and NPC113
1 EBV+ GC or EBV positive gastric carcinoma samples or EBVaGC, in reference is YCCEL1
1 EBV+ clinical NPC samples from a clinical study of patients with metastatic or non-metastatic NPC in China with 77 total samples from Hong Kong and 100 total samples from Guangzhou China. The 77 samples are sourced as HK and the 100 samples are sourced as GZ. The non-metastatic class is if the NPC samples were not metastatic within 3 years as DMFS for 3-year Distant Metastasis-Free Survival. There are 39 metastatic patients and 38 non-metastatic or DMFS samples in the HK clinical population and 36 metastatic and 64 DMFS samples in the GZ clinical population
The study is a lot to unpack in one sitting (most document had my highlighting on it) and much work and figures of analysis were done in this 21 page research study published last August in 2025, but overall they confirmed the LMP1 gene affecting NPC changes and being dysregulated by EBV in our last analysis but also the gene region for KDM5B gene and their interaction regions along the human chromosome by EBV that had the KDM5B signature, called EBVIR-enhancer-KDMB5. The researchers, they, found that the marker was elevated in high amounts of EBV+ NPC but not so much or at all in EBV- NPC samples. It was very interesting discoveries made and learned a bunch of things such as:
EBV affects or is associated with 200,000 cancer cases each year.
EBV is latent in episomes or extrachromosomal plasmids (DNA found not in the chromosome is extrachromosomal and extrachromosomal plasmids are small, circular DNA fragments that replicate independently of the bacterial chromosome, … plasmids are valuable tools in molecular biology and genetics for recombinant protein production that are used to introduce, manipulate, or delete certain genes from the host cell, AKA sex factors, conjugates, extra chromosomal replicos, or transfer factors. They can replicate independently and are often used in research for ease in identification and isolation — wikipedia internet search) that replicate in the S-phase of the cell cycle (cell replication is a cycle of mitosis where a G1 for gap 1 where RNA is synthesized and organelles of cell made, S is for synthesis phase for DNA replication of 2 identical copies of chromosomes as well as the centrosome for organizing spindle fibers in mitosis, the G2 phase is continued cell growth making additional proteins and organelles with a checkpoint to DNA being error free. The M phase is the mitotic phase of dividing into 2 daughter cells of chromosomes identical to each other and the cytokinesis of cytoplasm and cell membrane diviison into 2 separate cells. This is for human cell division and replication with eurkaryotic cells. Viral replication )
There are 3 known regions of the human chromosome that EBV lives dormant and can be activated to make cancers worse, especially in NPC and GC.
The regions that the human chromosome is susceptible to being hijacked by latent EBV are areas that have low gene populations or gene poor or high gene populations or gene rich and are inactive normally but made active by EBV or that are active normally but made active by EBV that they call the active(A) or inactive(B) compartment switching in EBV+ samples. These are called EBVIR for EBV interacting regions of the human chromosome.
there are 2 histone signatures in H3K27 of H3K27ac and H3K27me3 that are present in EBV- NPC, but the EBV+ NPC samples lacked the H3K27ac signature and only had the H3K27me3 one.
The gastric carcinoma with EBV didn’t show chromatin remodeling but the EBV+ samples did show chromatin remodeling in the study using Omni-C analysis
The number of EBVIR is positively correlated with the number of EBV DNA copy numbers from the study, implying the EBVIR are regions of the chromatin that EBV takes over control of DNA modeling and transcription and cell signaling.
The EBVIR regions that are most affected in all EBV+ samples and xenograph samples of EBV found that these 3 regions or oriP called EBV-oriP-EBVIRs, the EBNA1 intron and oriLyt called the EBV-EBNA1-EBVIRs, and the lncRNA or long non coding region of RNA at RMPS1 called EBV-RMPS1-EBVIRs, that were all 3 regions that did not randomly interact with the host or human chromatin. A finding consistent in all EBV positive and xenograph samples that was confirmed in the in vitro studies.
patient derived xenograph is not a skin sample but a tumor biopy that allows the cancer to grow in a mouse specimen that is a living human substitute that has been made immune deficient.
The depletion of the gene, CTFC, a chromatin modeling gene for its natural healthy structure, was apparent in all EBV infected samples of NPC that were initially uninfected. A large scale loop-loop interaction of chromatin was dysregulated in the EBV+ samples more so than the commercial NPE and the EBV- samples.
A knockdown approach to removing the KDM5B gene in the EBV+ NPC cells showed reduced cancer growth, while this inhibition of KDM5B is one way to reduce NPC tumor growth by EBV, there is also a non-biological substance or treatment that is made in a lab somewhere specifically to inhibit KDM5B and it is called GSK-467, when used on EBV+ NPC and EBVaGC it showed a strong suppression of cancer cell proliferation or ability for tumor cells to duplicate and metastasize in the body. So, there is a treatment for EBV+ nasopharyngeal carcinoma and Gastric carcinoma.
The EBV episomes preferred to be tethered at chromosomes 3, 4, 5, 6, and 13 in regions with few genes and rich in AT DNA sequence nucleotide bonds (adenosine-tyrosine, the other bonded DNA nucleotide pair is cytosine-guanine), noted that the EBVIR regions of EBV-oriP-EBVIRs and EBV-EBNA1-EBVIRs were frequently found interacting on chromosome 1 at 1q31.1-32.1 (we can search this by BLAST or looking up the gene at its chromosomal location to get the chromosome view and see the density of genes or overlap of genes in that viewing window to see what genes are found in that segment of chromosome 1 that are affected by the EBV episome) but this location is where KDM5B is located with binding targets of KDM5B and increased chromatin accessibility (chromatin is normally tightly wrapped and not supposed to be accessible to prevent DNA degradation and damage) for a local enhancer to activate genes normally inactive. These could be genes that create the tumors in NPC as there was a high EBV presence in the NPC cells showing high gene expression of KDM5B and the EBVIR-enhancer-KDM5B signature.
The NCBI chromosome
viewer for KDM5B is seen above in the middle at top of image and the
genes in the less gene dense region of KDM5B chromosomal location on
chromosome 1 on the reverse or complementary strand at 1.q32.1.
a study was used to show the clusters of EBV high and low latent gene expression in cancers that were distinguished by the LMP1 gene (in last project we saw the H101R amino acid change in EBV used in NPC samples that delivered results showing rapid NPC proliferation and differentiation when infected with EBV) and RPMS1 gene that the study gives as one of the gene regions for their EBVIRs results in chromatin remodeling and dysregulation of genes involved in protection from cancer. The KDM5B gene is a site where EBV activates enhancers and promotes transcription in EBV high clusters labeled the EBVIR-enhancer-KDM5B signature that saw upregulated in EBV+ NPC of enhancer activation through EBV- host chromatin interactions.
MK167 for increased proliferative capacity with increased expression of MYC and BCL2 seen in increased EBVIR-enhancer-KDM5B and in as many as 20% of the total NPC epithelial cells of the ten subclusters of NPC epithelial cells found when using in situ hybridisation probes to target LMP1 and RPMS1 to further confirm the association of latent EBV infection in activation of EBVIR-enhancer-KDM5B signals, had high levels of CD74 that is a MHC II chaperone or major histocompatibility complex class MHC II chaperone, CIITA, a MHC class II transactivator, as well as LGALS9 (Galectin-9) and CD47 (surface glycoprotein that signals not to eat it) immunosuppressive genes all confirmed with scRNA-seq data. The vascular cell adhesion molecule, VCAM1, a gene associated with aggressiveness and metastasis of cancer genes was also upregulated significantly in same cells. Done with NanoString Spatial Molecular Imaging (SMI platform)
With SMI spatial imaging there were 2 of the ten subclusters of NPC epithelial cells havig strong epigenetic dysregulation also had an involvement with the Polycomb Complex 2 (PRC2) key member EZH2 and the DNA methyltransferase DNMT1 were more likely to have broken apart in the stromal area (crucial to understading tumor microenvironment and impact of cancer progression - internet search of stromal in spatial analysis, also ’stroma rich ECM found in pancreatic cancer, must mean the majority of the tumor material) of figure 4a and 4c. This observation in research probably means that the tumor is made up of dysregulated EZH2 and DNMT1 involved in the Polycomb Complex 2, and methylation of DNA, RNA, and protein groups.
Immunohistochemistry staining was done to evaluate the KDM5B expression level of the clinical specimen and found that KDM5B and a low number of tertiary lymphoid structures or TLS (tertiary lymphoid structures are organized immune cell aggregates in non-lymphoid tissue that play a key role in local adaptive immunity and cancer prognosis usually or often in response to chronic inflammation or tumors, unlike secondary lymphoid structures such as lymph nodes, TLS form in situ inside affected tissues and serve as hubs for antigen presentation and adaptive immune activation which contributes to anti-tumor immunity — internet search) were more likely to develop metastasis after conventional treatment.
testing their hypothesis that aberrant activation of KDM5B and binding targets to KDM5B through enhancer activation at EBVIRs is associated with the aggressiveness of cancer cells in NPC and that patients with a strong EBVIR-enhancer-KDM5B signature are likely to have poor clinical outcomes, they used the clinical NPC patients totalling 77 from Hong Kong, China and 100 from Guanghzou, China. All samples have NPC but 39 of the 77 have metastatic NPC and 36 from the 100. The others have 3-year distant metastasis-free survival or DMFS NPC. None of those samples had treatment of conventional chemotherapy labeled treatment-naive samples. The study carried out this test with a multicentre and multiomics stuty of somatic mutations and chromosomal instability (CIN) in NPC with whole-exome sequencing (WEC) to see the expression of the EBVIR-enhancer-KDM5B signature and oncogenic pathway activity using RNA sequencing of the NPC samples before and after conventional chemotherapy but not mentioned findings of note or any findings of results after treatment with chemotherapy or if in situ treatment of chemotherapy as the samples are all treatment-naive and they already know the 3 year DMFS. Table 1 of report summarizes the outcome and will see if that shows the treatment after conventional chemotherapy. The end of the report has sections for each study and in it, there is a statement that all enrolled patients received their ‘first-line treatments according to NCCN guidelines.’ So, all these patients had already received their first treatment of chemotherapy using their subject year of enrollment in 2011 given that information, but internet search says that the National Comprehensive Cancer Network (NCCN) is the NCCN clinical Practice Guidelines in Oncology that are based on anual changes that support most recent evidence with expert medical opinion from interdisciplinary panel of clinicians and researchers.
They found their hypothesis validated by finding metastatic genes significantly enriched for KDM5B and MYC binding targets using spatial analysis of whole-transcriptome data. They did a GSVA or gene set variation analysis that showed elevation of KDM5B in MET vs non-MET NCP and normal controls. They found their signature marker of EBVIR-enhancer-KDM5B highly correlated with the KDM5B targets derived from independent analysis of all 77+100 samples of NPC from those two regions in China using a univariate and multivariate logistic regression model in figure 9b (only goes up to figure 8 in my pdf copy but there is a table of primers used in the knockout of the KDM5B gene to study effects) and Data 4 supplementary external link. Results from the table of the MET and non-MET NCP clinical samples showed that high KDM5B pathway activity is a high risk for metastasis, but separately even without the KDM5B pathway activity, so is tumor stage and plasma EBV DNA copy number before and after treatment, but together prediction accuracy improves in predicting metastasis or MET with NPC. Stage 4 or late stage NPC patients with high activity in the KDM5B pathway have a very high likelihood of Metastasis.
KDM5B regulates the genes in the EBVIR-enhancer-KDM5B signature that controls for viral life cycle, regulation of viral entry into host cells, and oxidative stress-induced cell death. The KDM5B binding signals on the EBV genome were examined to investigate the role of KDM5B in regulating the EBV life cycle. EBNA1 when bound to the EBV-oriP region of KDM5B it allows persistent and stable EBV infection. When interaction with EBNA1 bindings, KDM5B, and H3H27ac signals are localized to the EBV-oriP region, the latency of EBV is regulated by KDM5B in making an EBV negative host chromatin interact to maintain the EBV genome. The knockdown (KD) of KDM5B labeled KDM5B-KD showed a reduction in cellular EBV copies showing that with KDM5B present, then EBV has the ability to hijack the KDM5B region of DNA to promote cancer cell growth and maintain its survival in a cellular context.
The knockdown of KDM5B, KDM5B-KD, showed that the Differentially expressed genes in knockdown versus WT or ?wild type? that an expression of VCAM1 and VEGFA, genes revealed in spatial analysis imaging to have strong stroma or central regional changes in the tumors, were suppressed by knocking out this gene KDM5B we call KDM5B-KD.
VCAM1 is an encoding gene for vascular cell adhesion molecule 1 known to play a significant role in various cancers, promotes cell invasion, metastasis, and cellular immune response.
VEGFA encodes vascular endothelial growth factor A and is crucial in metastasis by promoting tumour angiogenesis and inhibiting T-cell function.
other genes that KDM5B-KD repressed or inhibited were also highly involved in metastasis related pathways like hedgehog signaling, epithelial-mesenchymal transition (EMT), and hypoxia.
KDM5B regulated targets were enriched for several transcription factors (TFs), such as BRD4 (this is either separately done or done when KDM5B-KD was active in knocking down KDM5B).
BRD4 is an encoding Bromodomain Containing 4, that is an enhancer epigenomic reader that regulates histone acetylation to control transcriptions for cancer cell proliferation.
JQ1, a BRD4 inhibitor, was investigated as a therapeutic for NPC as it is known to inhibit the enrichment of BRD4 at enhancer sites, as BRD4 was enriched when KDM5B was knocked down by KDM5B-KD. Treatment of NPC by JQ1 delivered results showing higher efficacy than the(non-biological) KDM5B inhibitor (called GSK-467 that was used most recently before this portion of report) in the process of inhibiting cancer cell growth in both EBV+NPC and in EBVaGC.
EBV in NPC is mainly associated with type II latency that expresses LMP1, LMP2A, EBNA1, EBERs1/2, and BART transcripts, but EBV infection in gastric cancer is associated with type I latency EBV. This suggests that the ability of EBV to direct cancer-specific compartment switching on and off of genes related to tumor suppression or tumor cell apoptosis depends on the cell and tissue specificity or type.
Even though EBVaGC and EBV+NPC are not from same latency groups of EBV, they seem to use a similar docking mechanism to the host chromosome that leads to enhancer infestation that promotes tumour growth in both EBVaGC and in EBV+NPC as KMD5B-KD showed that both types of cancer cell growth were suppressed when used to knockdown or knockout KMD5B gene.
the KDM5B and EBV-enhancer-KDM5B signature are affective in the G2/Mitosis checkpoint of cell replication that occurs after the initial checks for DNA damage are approved by cell regulators and regulators of cell apoptosis or cell death. This is also where G2/M involves the spindle process of cell replication of mitosis where the chromosomes split into identical sister chromatids. This makes since as the chromatids haven’t began wrapping around the histones of chromosomes to protect the DNA and where a viral infection can occur, such as EBV in cell replication. (The knockdown of KDM5B when done with JQ1 was to prevent the BRD4 protein from acetylating histones that are what the chromatin is looped around to protect DNA).
of the 3 regions of host chomosomes that EBV interacts as EBVIR, EBV-oriP-EBVIR is where KDM5B is located, between the other 2 EBVIR or EBV-EBNA1-EBVIR and EBV-RPMS1-EBVIR, the EBV-EBNA1-EBVIR interacting region is a region that covers oriLyt, a lytic origin of DNA replication by BZLF1. Lytic is to lyse which is to rupture the cell wall or membrane, caused by BZLF1 (BZLF1 is a pioneer factor that reverses epigenetic silencing of viral DNA to allow escape from latency … related to EBV in genome wide analyses of Zta binding to EBV genome — internet search). The EBV-EBNA1-EBVIR may facilitate insulation of latent and lytic gene expression of EBV interaction with host chromosome.
There is elevated expression of CIITA a class II transcriptional activator that interacts with the enhanceosome complex of MHC-II genes, in cancer cells with high expression of the EBV-enhancer-KDM5B signature. This is an interaction sufficient enough to induce the expression of MHC class II genes in MHC-II deficient cells that lack interferon-gamma stimulation.
studies have shown that in vitro EBV infection of pseudostratified epithelium in a 3D primary nasopharyngeal air-liquid interface cell culture model can induce expression of MHC-II genes, including CD74, in suprabasal cells.
CD74 as an invariant CD74 chain physically mediates the subcellular trafficking of the MHC-II complex to promote the expression of antigen-bound MHC-II molecules on antigen presenting cells (APC) such as B cells, dendritic cells, and macrophages. When MHC-II molecules are on APCs, they evoke adaptive T-cell immunity by boosting CD4 helper T cells and tumour-reactive CD8+ T cells.
Previous study using scRNA-seq NPC data found that when there is a dual epithelial-immune features of NPC cells having high expression of MHC-II molecule HLA-DR, that these cells are highly proliferative and can reduce the cytotoxicity of the CD8+ T cells and result in a poor prognosis of NPC for the patient. This study found similar results in the dual epithelial immune features expressed with a strong EBVIR-enhancer-DM5B (not typo, they now exchange the signature name to be EBVIR-enhancer-DM5B signature instead of EBVIR-enhancer-KDM5B signature in this section of the MHC-II discussion) having similar dual epithelial immune features of expressing multiple immunosuppressive molecules of CD47 and Galectin-9 (LGALS9) among others unnamed. The transcription of these genes and other MHC-II genes was unaltered when knocking down KDM5B with KDM5B-KD. They suggest the elevation of these genes unaffected by KDM5B-KD are independent of KDM5B regulation in NPC.
using last information tidbit, these data indicate that a high-order epigenotype change associated with EBV-host chromatin interactions promote tumour growth, and increase the epigenetic plasticity of epithelial cells (epigenetic plasticity is the dynamic and reversible changes in gene expression that occur without altering the underlying DNA sequence, but allows organisms to adapt to environmental changes — internet search, similar to neuroplasticity in forming habits and repeating bad and good habits) with immune cell features for immune evasion in NPC (immune cells leaving or exiting defensive role in NPC cells, but an internet search says it is how pathogens outsmart immne system to survive and replicate through various strategies), which likely contributes to the increased risk of metastasis in NPC.
spatial data analysis imaging of cancer cells with elevated EBVIR-enhancer-KDM5B signature induce strong DNA damage stress, and confirmed in the multicentre and multiomics study where those patients having that signature have noticeable chromosome instability and more likely to develop metastasis after conventional treatment.
EBNA1 binds fragile DNA on human chromosome 2 that triggers the breakage of DNA in a dose dependent manner which induces genomic instability and alters the copy number of tumour suppressor ATM and proto-oncogene MLL after cell division which may lead to tumorigenesis (research article by Li et. al resourced in this study)
EBV particle induce centromere amplification and chromosomal instability in a B-cell model, and KDM5B is a key factor in response to DNA damage via recruitment of Ku70 and BRCA1 (research article by Shumilov et al. resourced in this study)
this study provides more proof of DNA damage and damage to DNA repair machinery through cross-species chromatin interactions and hijacking the KDM5B function.
This study found that when combined factors of having the EBVIR-enhancer-KDM5B signature as well as plasma EBV DNA copy number at diagnosis and treatment and the overall stage of disease progression of NPC, that the 5 year survival rate is reduced in those with the signature and all other factors remaining constant for the other 2 features, from an expected survival of 80-90% to a new much lower expected survival rate of 20-30% in 5 years.
Even with immunotherapy, metastasis remains the leading cause of NPC mortality with a median progression free survival of only 1 year. They recommend using the JQ1 inhibition of BRD4 in transcribing the EBV enhanced genes responsible for the NPC progression as it shows better results in vitro in suppressing tumour growth and proliferation of EBV+NPC cancer cells.
As you can see the study was very in depth and has many bits of information to find and analyse, none of the 27 contributors slacked off as there were many different directions this study took to prove that their hypothesis of chromatin loop-loop disruption by EBV latent type II was a pathway for metastatic NPC. Many explanatory diagrams and plots of the analysis, as well as many different sources of data and methods to analyzing the various data. The study seemed to expand on another study they referenced that found KDM5B as a region of further study in EBV associated NPC by Shumilov et al. Reasonable inferences and understandings of the findings or results of their tests and analysis. The figures provided used many machine learning tools like unsupervised learning in the heirarchical clustering to find compartments that switch on and off in EBV+NPC and EBV-NPC samples as well as the NPE samples. It also used supervised learning of univariate and multivariate logistic regression, and differential expression of genes with DESeq. Some heat maps and clustering with UMap were done as well that is also unsupervised machine learning. This study took me a few hours last night and a whole work day today of about 6 hours to annotate and grab information from without starting the analysis of the genes and gene targets to compare in the study, but we do have an idea of the genes to target and look for in comparing to the top upregulated and downregulated genes in our study but making a special list for those genes in the study publication that were found to affect latent EBV infection and poor prognosis in NPC and also GC. For one it is the chromosome regions of interaction mentioned as being resorts for EBV to land and comingle and change the human host chromosome at EBNA1, OriP, and RPMS1, but also other genes had affects of interest in this study: KDM5B, LMP1, H3K27 histone signatures of ac and me3 postscripts, CTFC, JQ1, BRD4, BRCA1, LGALS9, CD47, VCAM1, VEGFA, MK167, MYC, BCL2, BZLF1, ATM, MLL, Ku70, CD74, and MK167.
The data we use here is from GSE299775 is for the spatial transcriptome data that was used in GEOMX-DSP portion of their study analysis and research. There are 28 samples on the page and in the series data information.
series <- read.table('GSE299775_series_matrix.txt', header=T, comment.char='!')
colnames(series)
## [1] "ID_REF" "GSM9045905" "GSM9045906" "GSM9045907" "GSM9045908"
## [6] "GSM9045909" "GSM9045910" "GSM9045911" "GSM9045912" "GSM9045913"
## [11] "GSM9045914" "GSM9045915" "GSM9045916" "GSM9045917" "GSM9045918"
## [16] "GSM9045919" "GSM9045920" "GSM9045921" "GSM9045922" "GSM9045923"
## [21] "GSM9045924" "GSM9045925" "GSM9045926" "GSM9045927" "GSM9045928"
## [26] "GSM9045929" "GSM9045930" "GSM9045931" "GSM9045932"
Other gene expression data related to their overall study and published work is found in:
GSE274581 is the Omni-C data for the NPC epithelial commercial line called NPE, the EBV+ NPC or the NPC infected with EBV, and the EBV- NPC or the NPC not infected with EBV.
GSE278605 is for the ATAC-seq data for same samples of NPE, EBV+ NPC, and EBV- NPC.
GSE278607 is the CUT&RUN seq data for same samples above but H3K4me3, CTCF, and H3K27ac.
GSE299776 is the spatial transcriptome data of same samples for the 10X visum platform.
GSE150825 & GSE162025 is the source where the single cell RNA sequence or scRNA-seq data for NPC was resourced and used for comparison in this study. This could be the clinical samples from the Hong Kong and Guangzhou regions of China with NPC.
======================
Lets read in the count matrix data and series matrix data.
totalSeriesInfo <- read.table("GSE299775_series_matrix.txt",
header=FALSE,
sep=' ',
nrows=73)
head(totalSeriesInfo,10)
## V1
## 1 !Series_title\tSpatial transcriptome profiling for nasopharyngeal carcinoma [GeoMx-DSP]
## 2 !Series_geo_accession\tGSE299775
## 3 !Series_status\tPublic on Jun 14 2025
## 4 !Series_submission_date\tJun 13 2025
## 5 !Series_last_update_date\tAug 13 2025
## 6 !Series_pubmed_id\t40759888
## 7 !Series_summary\tAs the first human DNA tumour virus Epstein-Barr virus (EBV) is detectable in over 90% of nasopharyngeal carcinoma (NPC) patients in the endemic regions. Genome-wide analysis of 3D chromosome conformation revealed the virus-host chromatin interactions induce spatial reorganisation of loops and compartments, exhibiting “enhancer infestation” and “H3K27 bivalency” at EBV interaction regions (EBVIRs). Through this mechanism, EBV hijacks KDM5B, a genome stability gatekeeper, and its binding targets leading to aberrant activation of the EBVIRs-enhancer-KDM5B signature. The cancer cells with this signature had increased MYC activation, DNA damage responses, and epigenetic plasticity for epithelial-immune cell dual features with metastatic potential. Our multi-centre multi-omics study further confirmed that this signature was highly relevant to chromosome instability and could be utilised as a risk factor for distant metastasis. This study highlights a novel paradigm where latent viral episomes could alter host high-order epigenotype, promoting transcriptional rewiring and risk of metastasis in NPC.
## 8 !Series_overall_design\tSpatial transcriptome data were generated from primary NPC patients. Patients with NPC who developed distant metastasis within three years after treatment were classified as metastatic NPC (MET-NPC), while those without early metastasis were designated as non-metastatic NPC (nonMET-NPC).
## 9 !Series_type\tOther
## 10 !Series_contributor\tWei,,Dai
It looks like the sample type is in row 40, but other information we could use is in the series data.
Lets just read in row 40 by itself for sample type.
sampleType <- read.table('GSE299775_series_matrix.txt', skip=40, nrow=1, header=F)
sampleType
## V1 V2 V3 V4
## 1 !Sample_characteristics_ch1 cell type: MET cell type: MET cell type: nonMET
## V5 V6 V7 V8 V9
## 1 cell type: nonMET cell type: MET cell type: MET cell type: MET cell type: MET
## V10 V11 V12 V13
## 1 cell type: MET cell type: MET cell type: MET cell type: nonMET
## V14 V15 V16 V17
## 1 cell type: nonMET cell type: MET cell type: MET cell type: Normal
## V18 V19 V20 V21
## 1 cell type: Normal cell type: Normal cell type: nonMET cell type: nonMET
## V22 V23 V24 V25
## 1 cell type: Normal cell type: nonMET cell type: nonMET cell type: MET
## V26 V27 V28 V29
## 1 cell type: MET cell type: MET cell type: nonMET cell type: nonMET
Lets save this as a string the sample type.
sampleName <- as.character(sampleType[1,])
sampleName
## [1] "!Sample_characteristics_ch1" "cell type: MET"
## [3] "cell type: MET" "cell type: nonMET"
## [5] "cell type: nonMET" "cell type: MET"
## [7] "cell type: MET" "cell type: MET"
## [9] "cell type: MET" "cell type: MET"
## [11] "cell type: MET" "cell type: MET"
## [13] "cell type: nonMET" "cell type: nonMET"
## [15] "cell type: MET" "cell type: MET"
## [17] "cell type: Normal" "cell type: Normal"
## [19] "cell type: Normal" "cell type: nonMET"
## [21] "cell type: nonMET" "cell type: Normal"
## [23] "cell type: nonMET" "cell type: nonMET"
## [25] "cell type: MET" "cell type: MET"
## [27] "cell type: MET" "cell type: nonMET"
## [29] "cell type: nonMET"
Remove the ‘cell type:’ from our character string of sample type to combine to our GSM sample IDs in our series dataset of header sample markers only.
sampleNames <- gsub('cell type: ', '',sampleName)
sampleNames
## [1] "!Sample_characteristics_ch1" "MET"
## [3] "MET" "nonMET"
## [5] "nonMET" "MET"
## [7] "MET" "MET"
## [9] "MET" "MET"
## [11] "MET" "MET"
## [13] "nonMET" "nonMET"
## [15] "MET" "MET"
## [17] "Normal" "Normal"
## [19] "Normal" "nonMET"
## [21] "nonMET" "Normal"
## [23] "nonMET" "nonMET"
## [25] "MET" "MET"
## [27] "MET" "nonMET"
## [29] "nonMET"
Lets look at the header of series.
header <- as.character(colnames(series))
header
## [1] "ID_REF" "GSM9045905" "GSM9045906" "GSM9045907" "GSM9045908"
## [6] "GSM9045909" "GSM9045910" "GSM9045911" "GSM9045912" "GSM9045913"
## [11] "GSM9045914" "GSM9045915" "GSM9045916" "GSM9045917" "GSM9045918"
## [16] "GSM9045919" "GSM9045920" "GSM9045921" "GSM9045922" "GSM9045923"
## [21] "GSM9045924" "GSM9045925" "GSM9045926" "GSM9045927" "GSM9045928"
## [26] "GSM9045929" "GSM9045930" "GSM9045931" "GSM9045932"
Combine these into a data frame.
dataIDs <- data.frame(GSM_ID = header, sampleType=sampleNames)
dataIDs
## GSM_ID sampleType
## 1 ID_REF !Sample_characteristics_ch1
## 2 GSM9045905 MET
## 3 GSM9045906 MET
## 4 GSM9045907 nonMET
## 5 GSM9045908 nonMET
## 6 GSM9045909 MET
## 7 GSM9045910 MET
## 8 GSM9045911 MET
## 9 GSM9045912 MET
## 10 GSM9045913 MET
## 11 GSM9045914 MET
## 12 GSM9045915 MET
## 13 GSM9045916 nonMET
## 14 GSM9045917 nonMET
## 15 GSM9045918 MET
## 16 GSM9045919 MET
## 17 GSM9045920 Normal
## 18 GSM9045921 Normal
## 19 GSM9045922 Normal
## 20 GSM9045923 nonMET
## 21 GSM9045924 nonMET
## 22 GSM9045925 Normal
## 23 GSM9045926 nonMET
## 24 GSM9045927 nonMET
## 25 GSM9045928 MET
## 26 GSM9045929 MET
## 27 GSM9045930 MET
## 28 GSM9045931 nonMET
## 29 GSM9045932 nonMET
We have 28 samples of MET, nonMET, and Normal. Lets see the breakdown by sampleType.
dataID <- dataIDs[c(2:29),]
table(dataID$sampleType)
##
## MET nonMET Normal
## 14 10 4
Our data, dataID, has 14 MET, 10 nonMET, and 4 Normal sample types.
dataID2 <- dataID[order(dataID$sampleType, decreasing=T),]
dataID2
## GSM_ID sampleType
## 17 GSM9045920 Normal
## 18 GSM9045921 Normal
## 19 GSM9045922 Normal
## 22 GSM9045925 Normal
## 4 GSM9045907 nonMET
## 5 GSM9045908 nonMET
## 13 GSM9045916 nonMET
## 14 GSM9045917 nonMET
## 20 GSM9045923 nonMET
## 21 GSM9045924 nonMET
## 23 GSM9045926 nonMET
## 24 GSM9045927 nonMET
## 28 GSM9045931 nonMET
## 29 GSM9045932 nonMET
## 2 GSM9045905 MET
## 3 GSM9045906 MET
## 6 GSM9045909 MET
## 7 GSM9045910 MET
## 8 GSM9045911 MET
## 9 GSM9045912 MET
## 10 GSM9045913 MET
## 11 GSM9045914 MET
## 12 GSM9045915 MET
## 15 GSM9045918 MET
## 16 GSM9045919 MET
## 25 GSM9045928 MET
## 26 GSM9045929 MET
## 27 GSM9045930 MET
Lets remove the unused variables that clutter or environment space.
rm(dataIDs, sampleType,series, header, sampleInfo, sampleName, sampleNames)
## Warning in rm(dataIDs, sampleType, series, header, sampleInfo, sampleName, :
## object 'sampleInfo' not found
ls()
## [1] "dataID" "dataID2" "totalSeriesInfo"
Now lets read in the count data provided by this GSE299775 study.
counts <- read.table('GSE299775_count_matrix.txt', header=T)
str(counts)
## 'data.frame': 18677 obs. of 29 variables:
## $ ID_REF : chr "TGIF2LY" "OR5B21" "SUGCT" "GABARAP" ...
## $ DSP.1001660020631.H.A03: num 0.71 2.84 0.71 22.73 15.63 ...
## $ DSP.1001660020631.H.A05: num 1.07 4.26 1.07 18.12 10.66 ...
## $ DSP.1001660020631.H.A07: num 4.26 4.26 4.26 4.26 12.79 ...
## $ DSP.1001660020631.H.A09: num 4.26 4.26 4.26 38.36 4.26 ...
## $ DSP.1001660020631.H.A12: num 0.95 0.95 0.95 23.68 3.79 ...
## $ DSP.1001660020631.H.B02: num 0.53 2.66 0.53 28.77 13.32 ...
## $ DSP.1001660020631.H.B04: num 4.65 2.33 1.55 24.03 3.1 ...
## $ DSP.1001660020631.H.B06: num 1.79 5.16 1.79 11.44 9.65 ...
## $ DSP.1001660020631.H.B08: num 1.71 3.41 5.97 32.4 14.49 ...
## $ DSP.1001660020631.H.B10: num 1.89 4.74 2.84 18.94 14.21 ...
## $ DSP.1001660020631.H.B12: num 3.88 4.65 3.88 30.23 13.18 ...
## $ DSP.1001660020631.H.C02: num 1.71 6.82 1.71 28.99 10.23 ...
## $ DSP.1001660020631.H.C03: num 1.42 4.26 0.71 46.18 10.66 ...
## $ DSP.1001660020631.H.C06: num 1.22 3.65 1.22 13.4 8.53 2.44 1.22 3.65 1.22 3.65 ...
## $ DSP.1001660020631.H.C10: num 4.26 3.2 1.07 21.31 11.72 ...
## $ DSP.1001660020631.H.C12: num 1.07 2.13 2.13 28.77 4.26 ...
## $ DSP.1001660020631.H.D01: num 2.56 0.85 0.85 28.13 2.56 ...
## $ DSP.1001660020631.H.D02: num 2.33 3.1 1.55 33.33 3.88 ...
## $ DSP.1001660020631.H.D11: num 1.1 1.65 0.83 30.53 12.38 ...
## $ DSP.1001660020631.H.E01: num 0.78 2.33 1.55 18.6 13.95 ...
## $ DSP.1001660020631.H.E02: num 0.74 1.11 1.85 37.07 2.59 ...
## $ DSP.1001660020631.H.E05: num 4.26 4.26 4.26 29.84 17.05 ...
## $ DSP.1001660020631.H.E07: num 1.48 1.85 1.11 25.95 18.9 ...
## $ DSP.1001660020631.H.E09: num 2.13 2.84 3.55 29.84 11.37 ...
## $ DSP.1001660020631.H.E11: num 3.1 7.75 0.78 18.6 11.63 ...
## $ DSP.1001660020631.H.F01: num 1.29 2.84 1.03 22.48 21.7 ...
## $ DSP.1001660020631.H.F03: num 1.71 5.12 1.71 15.35 6.82 ...
## $ DSP.1001660020631.H.F05: num 1.71 1.71 3.41 25.58 10.23 ...
It is nice to have the IDs ordered by class sample type, but the data is not ordered so we kept the original dataID that removed the ID_REF feature to keep only the GSM ID next to sample type.
colnames(counts)
## [1] "ID_REF" "DSP.1001660020631.H.A03"
## [3] "DSP.1001660020631.H.A05" "DSP.1001660020631.H.A07"
## [5] "DSP.1001660020631.H.A09" "DSP.1001660020631.H.A12"
## [7] "DSP.1001660020631.H.B02" "DSP.1001660020631.H.B04"
## [9] "DSP.1001660020631.H.B06" "DSP.1001660020631.H.B08"
## [11] "DSP.1001660020631.H.B10" "DSP.1001660020631.H.B12"
## [13] "DSP.1001660020631.H.C02" "DSP.1001660020631.H.C03"
## [15] "DSP.1001660020631.H.C06" "DSP.1001660020631.H.C10"
## [17] "DSP.1001660020631.H.C12" "DSP.1001660020631.H.D01"
## [19] "DSP.1001660020631.H.D02" "DSP.1001660020631.H.D11"
## [21] "DSP.1001660020631.H.E01" "DSP.1001660020631.H.E02"
## [23] "DSP.1001660020631.H.E05" "DSP.1001660020631.H.E07"
## [25] "DSP.1001660020631.H.E09" "DSP.1001660020631.H.E11"
## [27] "DSP.1001660020631.H.F01" "DSP.1001660020631.H.F03"
## [29] "DSP.1001660020631.H.F05"
Lets use those sample type IDs with GSM to rename the counts data column names into the GSM ID with sample type.
dataID$ID_name <- paste(dataID$GSM_ID,dataID$sampleType, sep='_')
dataID
## GSM_ID sampleType ID_name
## 2 GSM9045905 MET GSM9045905_MET
## 3 GSM9045906 MET GSM9045906_MET
## 4 GSM9045907 nonMET GSM9045907_nonMET
## 5 GSM9045908 nonMET GSM9045908_nonMET
## 6 GSM9045909 MET GSM9045909_MET
## 7 GSM9045910 MET GSM9045910_MET
## 8 GSM9045911 MET GSM9045911_MET
## 9 GSM9045912 MET GSM9045912_MET
## 10 GSM9045913 MET GSM9045913_MET
## 11 GSM9045914 MET GSM9045914_MET
## 12 GSM9045915 MET GSM9045915_MET
## 13 GSM9045916 nonMET GSM9045916_nonMET
## 14 GSM9045917 nonMET GSM9045917_nonMET
## 15 GSM9045918 MET GSM9045918_MET
## 16 GSM9045919 MET GSM9045919_MET
## 17 GSM9045920 Normal GSM9045920_Normal
## 18 GSM9045921 Normal GSM9045921_Normal
## 19 GSM9045922 Normal GSM9045922_Normal
## 20 GSM9045923 nonMET GSM9045923_nonMET
## 21 GSM9045924 nonMET GSM9045924_nonMET
## 22 GSM9045925 Normal GSM9045925_Normal
## 23 GSM9045926 nonMET GSM9045926_nonMET
## 24 GSM9045927 nonMET GSM9045927_nonMET
## 25 GSM9045928 MET GSM9045928_MET
## 26 GSM9045929 MET GSM9045929_MET
## 27 GSM9045930 MET GSM9045930_MET
## 28 GSM9045931 nonMET GSM9045931_nonMET
## 29 GSM9045932 nonMET GSM9045932_nonMET
Rename the counts data set header of column names.
colnames(counts)[2:29] <- dataID$ID_name
colnames(counts)
## [1] "ID_REF" "GSM9045905_MET" "GSM9045906_MET"
## [4] "GSM9045907_nonMET" "GSM9045908_nonMET" "GSM9045909_MET"
## [7] "GSM9045910_MET" "GSM9045911_MET" "GSM9045912_MET"
## [10] "GSM9045913_MET" "GSM9045914_MET" "GSM9045915_MET"
## [13] "GSM9045916_nonMET" "GSM9045917_nonMET" "GSM9045918_MET"
## [16] "GSM9045919_MET" "GSM9045920_Normal" "GSM9045921_Normal"
## [19] "GSM9045922_Normal" "GSM9045923_nonMET" "GSM9045924_nonMET"
## [22] "GSM9045925_Normal" "GSM9045926_nonMET" "GSM9045927_nonMET"
## [25] "GSM9045928_MET" "GSM9045929_MET" "GSM9045930_MET"
## [28] "GSM9045931_nonMET" "GSM9045932_nonMET"
lets look at the top few rows of this data.
head(counts)
## ID_REF GSM9045905_MET GSM9045906_MET GSM9045907_nonMET GSM9045908_nonMET
## 1 TGIF2LY 0.71 1.07 4.26 4.26
## 2 OR5B21 2.84 4.26 4.26 4.26
## 3 SUGCT 0.71 1.07 4.26 4.26
## 4 GABARAP 22.73 18.12 4.26 38.36
## 5 CAD 15.63 10.66 12.79 4.26
## 6 TAS2R1 0.71 3.20 4.26 4.26
## GSM9045909_MET GSM9045910_MET GSM9045911_MET GSM9045912_MET GSM9045913_MET
## 1 0.95 0.53 4.65 1.79 1.71
## 2 0.95 2.66 2.33 5.16 3.41
## 3 0.95 0.53 1.55 1.79 5.97
## 4 23.68 28.77 24.03 11.44 32.40
## 5 3.79 13.32 3.10 9.65 14.49
## 6 1.89 1.60 3.10 1.57 0.85
## GSM9045914_MET GSM9045915_MET GSM9045916_nonMET GSM9045917_nonMET
## 1 1.89 3.88 1.71 1.42
## 2 4.74 4.65 6.82 4.26
## 3 2.84 3.88 1.71 0.71
## 4 18.94 30.23 28.99 46.18
## 5 14.21 13.18 10.23 10.66
## 6 0.95 0.78 1.71 2.84
## GSM9045918_MET GSM9045919_MET GSM9045920_Normal GSM9045921_Normal
## 1 1.22 4.26 1.07 2.56
## 2 3.65 3.20 2.13 0.85
## 3 1.22 1.07 2.13 0.85
## 4 13.40 21.31 28.77 28.13
## 5 8.53 11.72 4.26 2.56
## 6 2.44 1.07 2.13 1.71
## GSM9045922_Normal GSM9045923_nonMET GSM9045924_nonMET GSM9045925_Normal
## 1 2.33 1.10 0.78 0.74
## 2 3.10 1.65 2.33 1.11
## 3 1.55 0.83 1.55 1.85
## 4 33.33 30.53 18.60 37.07
## 5 3.88 12.38 13.95 2.59
## 6 0.78 0.83 3.10 2.59
## GSM9045926_nonMET GSM9045927_nonMET GSM9045928_MET GSM9045929_MET
## 1 4.26 1.48 2.13 3.10
## 2 4.26 1.85 2.84 7.75
## 3 4.26 1.11 3.55 0.78
## 4 29.84 25.95 29.84 18.60
## 5 17.05 18.90 11.37 11.63
## 6 4.26 2.22 3.55 3.10
## GSM9045930_MET GSM9045931_nonMET GSM9045932_nonMET
## 1 1.29 1.71 1.71
## 2 2.84 5.12 1.71
## 3 1.03 1.71 3.41
## 4 22.48 15.35 25.58
## 5 21.70 6.82 10.23
## 6 2.33 1.71 1.71
The data counts were collected using filtering, normalization, and the top 25th percentile according to row 47 of the series information.
detailsCounts <- totalSeriesInfo[47,]
details <- strsplit(detailsCounts, split='\t')
details[[1]][2]
## [1] "After quality control, batch effect removal, and Q3 normalisation using the top 25% of expressers to normalize across region of interest (ROI). The counts of the oligonucleotide tags were finally used to calculate the abundance of the corresponding mRNAs."
This is total RNA (row 41 of totalSeriesInfo) data of primary nasopharyngeal carcinoma or NPC.
This is the top 25% of expressed genes after quality control filtering, normalizing, and then calculating the abundance of the corresponding mRNAs from the oligonucleotide strand counts.
Lets make a list of those genes we want from the study before finding our top genes expressed in primary nasopharyngeal carcinoma from fold change data in this counts data set.
We have an idea of the genes to target and look for in comparing to the top upregulated and downregulated genes in our study but making a special list for those genes in the study publication that were found to affect latent EBV infection and poor prognosis in NPC and also GC. For one it is the chromosome regions of interaction mentioned as being resorts for EBV to land and comingle and change the human host chromosome at EBNA1, OriP, and RPMS1, but also other genes had affects of interest in this study: KDM5B, LMP1, H3K27 histone signatures of ac and me3 postscripts, CTFC, JQ1, BRD4, BRCA1, LGALS9, CD47, VCAM1, VEGFA, MK167, MYC, BCL2, BZLF1, ATM, MLL, Ku70, CD74, and MK167.
listed <- c("EBNA1", "oriP", "RPMS1", "KDM5B", "LMP1", "H3K27","H3K27ac", "H3K27me3", "CTFC", "JQ1", "BRD4", "BRCA1", "LGALS9", "CD47", "VCAM1", "VEGFA", "MK167", "MYC", "BCL2", "BZLF1", "ATM", "MLL", "Ku70", "CD74", "MK167")
Lets see if any of these genes as they are written are in this data by ID_REF in the counts data.
Listed <- counts[counts$ID_REF %in% listed,]
Listed
## ID_REF GSM9045905_MET GSM9045906_MET GSM9045907_nonMET GSM9045908_nonMET
## 5499 VEGFA 33.39 15.98 21.31 59.68
## 9038 BCL2 24.86 26.64 4.26 25.58
## 9291 BRCA1 8.53 8.53 4.26 4.26
## 9642 ATM 5.68 6.39 12.79 17.05
## 10292 MYC 56.12 88.45 17.05 12.79
## 11188 VCAM1 7.81 22.38 21.31 12.79
## 12676 KDM5B 12.08 8.53 8.53 4.26
## 13596 CD47 15.63 12.79 12.79 25.58
## 14196 BRD4 53.99 21.31 12.79 29.84
## 15400 CD74 363.74 288.79 865.30 980.39
## GSM9045909_MET GSM9045910_MET GSM9045911_MET GSM9045912_MET
## 5499 32.21 32.50 11.63 28.72
## 9038 4.74 9.06 10.85 6.95
## 9291 4.74 7.46 3.88 11.44
## 9642 3.79 4.26 7.75 4.49
## 10292 35.05 44.22 6.98 72.01
## 11188 34.10 57.54 17.05 70.00
## 12676 11.37 7.46 21.70 7.18
## 13596 42.63 10.12 12.40 21.09
## 14196 23.68 31.44 28.68 20.19
## 15400 290.80 1113.59 564.21 447.34
## GSM9045913_MET GSM9045914_MET GSM9045915_MET GSM9045916_nonMET
## 5499 52.86 11.37 13.18 8.53
## 9038 35.81 13.26 14.73 11.94
## 9291 8.53 8.53 3.88 1.71
## 9642 3.41 0.95 6.20 3.41
## 10292 100.60 16.10 79.05 25.58
## 11188 11.08 30.31 51.93 22.17
## 12676 12.79 8.53 11.63 5.12
## 13596 17.05 27.47 7.75 8.53
## 14196 37.51 32.21 41.85 23.87
## 15400 324.81 448.99 1335.34 862.74
## GSM9045917_nonMET GSM9045918_MET GSM9045919_MET GSM9045920_Normal
## 5499 29.13 8.53 31.97 12.79
## 9038 5.68 25.58 10.66 7.46
## 9291 7.10 6.09 9.59 6.39
## 9642 2.13 2.44 7.46 3.20
## 10292 12.08 85.25 21.31 17.05
## 11188 4.97 25.58 49.02 6.39
## 12676 13.50 10.96 3.20 5.33
## 13596 7.81 12.18 19.18 8.53
## 14196 40.49 30.45 39.43 24.51
## 15400 174.05 580.93 705.45 801.36
## GSM9045921_Normal GSM9045922_Normal GSM9045923_nonMET GSM9045924_nonMET
## 5499 34.10 25.58 12.65 6.20
## 9038 2.56 7.75 13.75 10.85
## 9291 3.41 6.98 4.40 9.30
## 9642 2.56 9.30 9.08 6.98
## 10292 17.05 13.18 26.13 15.50
## 11188 4.26 3.10 41.53 27.90
## 12676 3.41 9.30 10.45 8.53
## 13596 15.35 13.95 14.85 30.23
## 14196 28.13 27.13 35.20 20.15
## 15400 843.99 993.56 1095.89 435.56
## GSM9045925_Normal GSM9045926_nonMET GSM9045927_nonMET GSM9045928_MET
## 5499 17.79 4.26 25.20 6.39
## 9038 2.22 12.79 39.29 14.21
## 9291 2.59 8.53 10.75 10.66
## 9642 4.45 12.79 3.34 7.81
## 10292 8.90 34.10 85.25 10.66
## 11188 6.30 4.26 26.69 6.39
## 12676 8.53 8.53 8.90 4.26
## 13596 16.31 4.26 16.68 8.53
## 14196 20.76 42.63 25.20 34.10
## 15400 946.29 677.75 242.41 895.14
## GSM9045929_MET GSM9045930_MET GSM9045931_nonMET GSM9045932_nonMET
## 5499 8.53 16.79 6.82 6.82
## 9038 14.73 28.68 5.12 6.82
## 9291 1.55 10.33 3.41 3.41
## 9642 3.88 4.13 6.82 11.94
## 10292 20.93 128.65 15.35 10.23
## 11188 20.93 41.33 46.04 20.46
## 12676 10.85 2.33 8.53 6.82
## 13596 9.30 21.18 11.94 25.58
## 14196 17.05 24.03 17.05 30.69
## 15400 678.91 1165.36 1239.55 821.82
There are 10 genes in the data of counts and the most important one is there, KDM5B, as it was the whole basis for this research study. The listed study genes totaled 25 genes, with 10 in this counts data and 15 not in the data. That could be due to the study genes having alternate names or not making the cut in quality filtering, normalization, and the top 25% of genes using a 3rd quantile filtering to omit less high variability genes.
There is no simple way to get the genes that aren’t in the data as I tried the different methods to extract only the genes in and only genes not in, the latter didn’t want to work.
listed <- as.factor(listed)
class(listed)
## [1] "factor"
studyListed <- as.factor(Listed$ID_REF)
class(studyListed)
## [1] "factor"
notListed <- listed[-(studyListed %in% listed)]
notListed
## [1] oriP RPMS1 KDM5B LMP1 H3K27 H3K27ac H3K27me3 CTFC
## [9] JQ1 BRD4 BRCA1 LGALS9 CD47 VCAM1 VEGFA MK167
## [17] MYC BCL2 BZLF1 ATM MLL Ku70 CD74 MK167
## 24 Levels: ATM BCL2 BRCA1 BRD4 BZLF1 CD47 CD74 CTFC EBNA1 H3K27 ... VEGFA
So, obviously that is more than 15 genes and a huge error. We can just eyeball the list of 25 listed genes in the study and mark off the listed from the list to make it the notListed list of genes not in the list. I did the same thing as characters, but changing to factor type objects makes no difference.
notListed <- as.factor(c("EBNA1", "oriP", "RPMS1", "LMP1", "H3K27","H3K27ac", "H3K27me3", "CTFC", "JQ1", "LGALS9", "MK167", "BZLF1", "MLL", "Ku70", "MK167"))
notListed
## [1] EBNA1 oriP RPMS1 LMP1 H3K27 H3K27ac H3K27me3 CTFC
## [9] JQ1 LGALS9 MK167 BZLF1 MLL Ku70 MK167
## 14 Levels: BZLF1 CTFC EBNA1 H3K27 H3K27ac H3K27me3 JQ1 Ku70 LGALS9 ... RPMS1
It is unfortunate that the JQ1 gene that inhibits the BRD4 gene in transcription of cancer cell proliferation isn’t in the counts data, but that makes sense as it inhibits the effects of KDM5B involvment in EBV+NPC and rapid cancer progression. Further search of JQ1 in genecards turned up nothing, and when using the internet, it turns out that JQ1 is a lab developed compound and not a gene that does inhibit the BRD4 family of genes that is useful in slowing cancer progression in cell proliferation and tumor growth. JQ1 is a Thienotriazolodiazepine seen in many studies and experiments to slow cancer progression.
The 3 genes for the EBV interacting regions or EBVIR are not in the counts data so they must be named something else or not considered genes but regions on the chromosome. Those genes were the EBNA1, oriP, and RPMS1. Genecards.org says there is an EBNA1BP2 but not EBNA1, for oriP nothing. It turns out the internet does have results for EBNA1, oriP, and RPMS1. They are EBV genes and not human genes.
The CTFC gene is a regulatory protein called CCCTC-binding factor that is listed in genecards.org when looking up CCCTC-binding factor but not CTFC. MLL is now called KMT2A in genecards.org.
It is odd that LMP1 is not in this data of counts after quality control filtering, normalization, and taking the top 25th percentile of gene counts. LMP1 is associated with EBV latency type II and also in last analysis we did with EBV+NPC top genes in GSE271486. LMP1 was mentioned in this study for its relation to EBV+NPC but didn’t make their measures in this counts data.
studyListed
## [1] VEGFA BCL2 BRCA1 ATM MYC VCAM1 KDM5B CD47 BRD4 CD74
## Levels: ATM BCL2 BRCA1 BRD4 CD47 CD74 KDM5B MYC VCAM1 VEGFA
We can see that the studyListed set of genes is not in the notListed set of genes. However, the two genes that KDM5B enhanced when the EBVIR-enhancer-KDM5B signature was present are there, VEGFA and VCAM1 as well as KDM5B. Also the BRD4 gene that plays a role in cancer proliferation and tumor growth is present. The two MHC-II t-cell genes of CD47 and CD74 are present as well, and the two study genes this research mentioned in having associations to NPC of BRCA1, ATM, and MYC. The only gene not mentioned is BCL2 that is in the listed study genes This gene was found to be present with MYC in the top 20% of subclusters of high EBV epithelial cells of NPC when the EBVIR-enhancer-KDM5B signature is present.
Lets remove the genes that are not human genes of the EBV 3 genes and the Thienotriazolodiazepine called JQ1 from our listed set of study genes.
listed
## [1] EBNA1 oriP RPMS1 KDM5B LMP1 H3K27 H3K27ac H3K27me3
## [9] CTFC JQ1 BRD4 BRCA1 LGALS9 CD47 VCAM1 VEGFA
## [17] MK167 MYC BCL2 BZLF1 ATM MLL Ku70 CD74
## [25] MK167
## 24 Levels: ATM BCL2 BRCA1 BRD4 BZLF1 CD47 CD74 CTFC EBNA1 H3K27 ... VEGFA
humanListed <- as.character(listed[-c(1,2,3,10)])
humanListed
## [1] "KDM5B" "LMP1" "H3K27" "H3K27ac" "H3K27me3" "CTFC"
## [7] "BRD4" "BRCA1" "LGALS9" "CD47" "VCAM1" "VEGFA"
## [13] "MK167" "MYC" "BCL2" "BZLF1" "ATM" "MLL"
## [19] "Ku70" "CD74" "MK167"
Lets remove the unneccessary data from our enviroment.
rm(dataSamples,details,totalSeriesInfo,detailsCounts,dataID2)
## Warning in rm(dataSamples, details, totalSeriesInfo, detailsCounts, dataID2):
## object 'dataSamples' not found
Lets go ahead and get our data set of means by sample and fold change values for top upregulated or stimulated genes and top downregulated or inhibited genes in comparing MET to normal and nonMET to normal. This data was taken from samples from the group of Chinese biregional or multicentric and multiomics data of all enrolled patients having their first line treatment which was chemotherapy in 2011 guided by the NCCC and then tracked to see which patients developed MET 3 years past last treatment of chemotherapy.
colnames(counts)
## [1] "ID_REF" "GSM9045905_MET" "GSM9045906_MET"
## [4] "GSM9045907_nonMET" "GSM9045908_nonMET" "GSM9045909_MET"
## [7] "GSM9045910_MET" "GSM9045911_MET" "GSM9045912_MET"
## [10] "GSM9045913_MET" "GSM9045914_MET" "GSM9045915_MET"
## [13] "GSM9045916_nonMET" "GSM9045917_nonMET" "GSM9045918_MET"
## [16] "GSM9045919_MET" "GSM9045920_Normal" "GSM9045921_Normal"
## [19] "GSM9045922_Normal" "GSM9045923_nonMET" "GSM9045924_nonMET"
## [22] "GSM9045925_Normal" "GSM9045926_nonMET" "GSM9045927_nonMET"
## [25] "GSM9045928_MET" "GSM9045929_MET" "GSM9045930_MET"
## [28] "GSM9045931_nonMET" "GSM9045932_nonMET"
Lets organize the columns so its easier to get rowMeans by sample type or class of MET, nonMET, and Normal.
DF <- counts[,c(1:3,6:12,15,16,25:27,4,5,13,14,20,21,23,24,28,29,17,18,19,22)]
colnames(DF)
## [1] "ID_REF" "GSM9045905_MET" "GSM9045906_MET"
## [4] "GSM9045909_MET" "GSM9045910_MET" "GSM9045911_MET"
## [7] "GSM9045912_MET" "GSM9045913_MET" "GSM9045914_MET"
## [10] "GSM9045915_MET" "GSM9045918_MET" "GSM9045919_MET"
## [13] "GSM9045928_MET" "GSM9045929_MET" "GSM9045930_MET"
## [16] "GSM9045907_nonMET" "GSM9045908_nonMET" "GSM9045916_nonMET"
## [19] "GSM9045917_nonMET" "GSM9045923_nonMET" "GSM9045924_nonMET"
## [22] "GSM9045926_nonMET" "GSM9045927_nonMET" "GSM9045931_nonMET"
## [25] "GSM9045932_nonMET" "GSM9045920_Normal" "GSM9045921_Normal"
## [28] "GSM9045922_Normal" "GSM9045925_Normal"
The columns are arranged by class type with MET classes columns 2 through 15, the nonMET class are columns 16 through 25, and the Normal class are 26 through 29.
DF$MET_mean <- rowMeans(DF[,c(2:15)])
DF$nonMET_mean <- rowMeans(DF[,c(16:25)])
DF$Normal_mean <- rowMeans(DF[,c(26:29)])
DF$MET_Normal_FoldChange <- DF$MET_mean/DF$Normal_mean
DF$nonMET_Normal_FoldChange <- DF$nonMET_mean/DF$Normal_mean
DF$MET_nonMET_FoldChange <- DF$MET_mean/DF$nonMET_mean
colnames(DF)
## [1] "ID_REF" "GSM9045905_MET"
## [3] "GSM9045906_MET" "GSM9045909_MET"
## [5] "GSM9045910_MET" "GSM9045911_MET"
## [7] "GSM9045912_MET" "GSM9045913_MET"
## [9] "GSM9045914_MET" "GSM9045915_MET"
## [11] "GSM9045918_MET" "GSM9045919_MET"
## [13] "GSM9045928_MET" "GSM9045929_MET"
## [15] "GSM9045930_MET" "GSM9045907_nonMET"
## [17] "GSM9045908_nonMET" "GSM9045916_nonMET"
## [19] "GSM9045917_nonMET" "GSM9045923_nonMET"
## [21] "GSM9045924_nonMET" "GSM9045926_nonMET"
## [23] "GSM9045927_nonMET" "GSM9045931_nonMET"
## [25] "GSM9045932_nonMET" "GSM9045920_Normal"
## [27] "GSM9045921_Normal" "GSM9045922_Normal"
## [29] "GSM9045925_Normal" "MET_mean"
## [31] "nonMET_mean" "Normal_mean"
## [33] "MET_Normal_FoldChange" "nonMET_Normal_FoldChange"
## [35] "MET_nonMET_FoldChange"
str(DF)
## 'data.frame': 18677 obs. of 35 variables:
## $ ID_REF : chr "TGIF2LY" "OR5B21" "SUGCT" "GABARAP" ...
## $ GSM9045905_MET : num 0.71 2.84 0.71 22.73 15.63 ...
## $ GSM9045906_MET : num 1.07 4.26 1.07 18.12 10.66 ...
## $ GSM9045909_MET : num 0.95 0.95 0.95 23.68 3.79 ...
## $ GSM9045910_MET : num 0.53 2.66 0.53 28.77 13.32 ...
## $ GSM9045911_MET : num 4.65 2.33 1.55 24.03 3.1 ...
## $ GSM9045912_MET : num 1.79 5.16 1.79 11.44 9.65 ...
## $ GSM9045913_MET : num 1.71 3.41 5.97 32.4 14.49 ...
## $ GSM9045914_MET : num 1.89 4.74 2.84 18.94 14.21 ...
## $ GSM9045915_MET : num 3.88 4.65 3.88 30.23 13.18 ...
## $ GSM9045918_MET : num 1.22 3.65 1.22 13.4 8.53 2.44 1.22 3.65 1.22 3.65 ...
## $ GSM9045919_MET : num 4.26 3.2 1.07 21.31 11.72 ...
## $ GSM9045928_MET : num 2.13 2.84 3.55 29.84 11.37 ...
## $ GSM9045929_MET : num 3.1 7.75 0.78 18.6 11.63 ...
## $ GSM9045930_MET : num 1.29 2.84 1.03 22.48 21.7 ...
## $ GSM9045907_nonMET : num 4.26 4.26 4.26 4.26 12.79 ...
## $ GSM9045908_nonMET : num 4.26 4.26 4.26 38.36 4.26 ...
## $ GSM9045916_nonMET : num 1.71 6.82 1.71 28.99 10.23 ...
## $ GSM9045917_nonMET : num 1.42 4.26 0.71 46.18 10.66 ...
## $ GSM9045923_nonMET : num 1.1 1.65 0.83 30.53 12.38 ...
## $ GSM9045924_nonMET : num 0.78 2.33 1.55 18.6 13.95 ...
## $ GSM9045926_nonMET : num 4.26 4.26 4.26 29.84 17.05 ...
## $ GSM9045927_nonMET : num 1.48 1.85 1.11 25.95 18.9 ...
## $ GSM9045931_nonMET : num 1.71 5.12 1.71 15.35 6.82 ...
## $ GSM9045932_nonMET : num 1.71 1.71 3.41 25.58 10.23 ...
## $ GSM9045920_Normal : num 1.07 2.13 2.13 28.77 4.26 ...
## $ GSM9045921_Normal : num 2.56 0.85 0.85 28.13 2.56 ...
## $ GSM9045922_Normal : num 2.33 3.1 1.55 33.33 3.88 ...
## $ GSM9045925_Normal : num 0.74 1.11 1.85 37.07 2.59 ...
## $ MET_mean : num 2.08 3.66 1.92 22.57 11.64 ...
## $ nonMET_mean : num 2.27 3.65 2.38 26.36 11.73 ...
## $ Normal_mean : num 1.68 1.8 1.59 31.82 3.32 ...
## $ MET_Normal_FoldChange : num 1.244 2.038 1.206 0.709 3.504 ...
## $ nonMET_Normal_FoldChange: num 1.355 2.032 1.493 0.828 3.53 ...
## $ MET_nonMET_FoldChange : num 0.919 1.003 0.808 0.856 0.993 ...
MET_nonMET_FC_ordered <- DF[order(DF$MET_nonMET_FoldChange, decreasing=T),]
head(MET_nonMET_FC_ordered)
## ID_REF GSM9045905_MET GSM9045906_MET GSM9045909_MET GSM9045910_MET
## 10538 TRPA1 0.71 4.26 1.89 188.09
## 14378 CCL2 127.88 4.26 38.84 6.39
## 7211 SHISA2 41.92 1.07 3.79 6.93
## 15643 LGR5 22.02 3.20 3.79 19.18
## 1979 FDCSP 1.42 3.20 0.95 2.13
## 1516 CXCL6 13.50 4.26 152.50 2.66
## GSM9045911_MET GSM9045912_MET GSM9045913_MET GSM9045914_MET
## 10538 1.55 5.16 4.26 2.84
## 14378 3.88 2.69 5.12 161.98
## 7211 10.08 1.79 49.45 21.79
## 15643 30.23 8.75 5.12 34.10
## 1979 1.55 2.92 150.89 119.35
## 1516 4.65 2.69 0.85 0.95
## GSM9045915_MET GSM9045918_MET GSM9045919_MET GSM9045928_MET
## 10538 9.30 1.22 1.07 5.68
## 14378 11.63 8.53 457.16 7.10
## 7211 1.55 2.44 59.68 4.26
## 15643 0.78 4.87 117.22 7.10
## 1979 2.33 1.22 1.07 7.10
## 1516 1.55 2.44 9.59 1.42
## GSM9045929_MET GSM9045930_MET GSM9045907_nonMET GSM9045908_nonMET
## 10538 15.50 2.84 4.26 4.26
## 14378 1.55 91.97 4.26 4.26
## 7211 6.98 1.03 4.26 4.26
## 15643 3.10 47.02 4.26 4.26
## 1979 5.43 25.06 4.26 12.79
## 1516 3.88 0.78 12.79 4.26
## GSM9045916_nonMET GSM9045917_nonMET GSM9045923_nonMET GSM9045924_nonMET
## 10538 1.71 0.71 0.55 4.65
## 14378 5.12 2.84 3.58 81.38
## 7211 1.71 0.71 3.58 3.88
## 15643 1.71 9.24 2.48 6.20
## 1979 1.71 2.13 1.93 5.43
## 1516 1.71 0.71 0.83 0.78
## GSM9045926_nonMET GSM9045927_nonMET GSM9045931_nonMET GSM9045932_nonMET
## 10538 4.26 1.48 1.71 1.71
## 14378 4.26 2.97 1.71 1.71
## 7211 4.26 2.22 1.71 1.71
## 15643 4.26 2.59 5.12 3.41
## 1979 8.53 2.97 5.12 1.71
## 1516 4.26 2.22 1.71 1.71
## GSM9045920_Normal GSM9045921_Normal GSM9045922_Normal GSM9045925_Normal
## 10538 1.07 0.85 0.78 1.48
## 14378 3.20 2.56 3.88 2.97
## 7211 1.07 2.56 2.33 3.71
## 15643 3.20 2.56 0.78 1.11
## 1979 3.20 0.85 0.78 3.34
## 1516 3.20 3.41 48.83 37.81
## MET_mean nonMET_mean Normal_mean MET_Normal_FoldChange
## 10538 17.45500 2.530 1.0450 16.703349
## 14378 66.35571 11.209 3.1525 21.048601
## 7211 15.19714 2.830 2.4175 6.286305
## 15643 21.89143 4.353 1.9125 11.446499
## 1979 23.18714 4.658 2.0425 11.352334
## 1516 14.40857 3.098 23.3125 0.618062
## nonMET_Normal_FoldChange MET_nonMET_FoldChange
## 10538 2.4210526 6.899209
## 14378 3.5555908 5.919860
## 7211 1.1706308 5.370015
## 15643 2.2760784 5.029044
## 1979 2.2805386 4.977918
## 1516 0.1328901 4.650927
top10UpAnd10Down_met_nonmetFC <- MET_nonMET_FC_ordered[c(1:10,18668:18677),]
top10UpAnd10Down_met_nonmetFC[,c(1,35)]
## ID_REF MET_nonMET_FoldChange
## 10538 TRPA1 6.89920949
## 14378 CCL2 5.91986032
## 7211 SHISA2 5.37001514
## 15643 LGR5 5.02904401
## 1979 FDCSP 4.97791817
## 1516 CXCL6 4.65092687
## 7594 CST1 3.93830205
## 11836 MMP1 3.77502483
## 13000 FSTL3 3.72344208
## 14923 AKR1C3 3.37745476
## 4627 AMY1A 0.15966597
## 7092 MSLN 0.14037864
## 6013 GSTA1 0.13022230
## 3758 MSMB 0.12031737
## 17621 NCCRP1 0.11610074
## 5503 COL14A1 0.10322887
## 5687 ZNF331 0.05778740
## 14741 SCGB1A1 0.03950420
## 202 ZPBP 0.02622455
## 9475 LTF 0.02573839
That result is the top 10 up regulated and top 10 down regulated genes between MET and non-MET NPC fold change values of MET mean value over nonMET mean value.
top10UpAnd10Down_met_nonmetFC$ID_REF
## [1] "TRPA1" "CCL2" "SHISA2" "LGR5" "FDCSP" "CXCL6" "CST1"
## [8] "MMP1" "FSTL3" "AKR1C3" "AMY1A" "MSLN" "GSTA1" "MSMB"
## [15] "NCCRP1" "COL14A1" "ZNF331" "SCGB1A1" "ZPBP" "LTF"
Those are the top 20 genes most differentially changed from MET NPC to nonMET NPC. Lets see where are list of genes are in the studyListed set of genes that are in this counts data frame.
foldchage_studyListed <- MET_nonMET_FC_ordered[which(MET_nonMET_FC_ordered$ID_REF %in% Listed$ID_REF),]
foldchage_studyListed[,c(1,35)]
## ID_REF MET_nonMET_FoldChange
## 10292 MYC 2.1518539
## 11188 VCAM1 1.3947860
## 9291 BRCA1 1.2970418
## 9038 BCL2 1.2637524
## 5499 VEGFA 1.2025392
## 12676 KDM5B 1.1411223
## 14196 BRD4 1.1204038
## 13596 CD47 1.0710900
## 15400 CD74 0.8889044
## 9642 ATM 0.5679204
All genes related to tumor proliferation in the research report for NPC cellular changes when KDM5B are present seem to be upregulated from above information just comparing the results of the MET to nonMET foldchange gene expression values, except the ATM and CD74 genes. CD74 is one of the MHC-II genes mentioned that is a chaperone to immune activity and seems to perform immune evasion by lowering in value, and ATM is a gene that normally suppresses tumor growth but is low by almost half as much in MET compared to nonMET cases of NPC following treatment. ATM is lowered when the EBNA1 gene (an EBV latent type II gene) infiltrates its EBVIR area of the human chromosome. Both the VCAM1 that stimulates tumor growth and proliferation as well as VEGFA that stimulates tumour growth by creating blood routes for the tumor in tumor angiogenesis are up regulated in MET vs nonMET as well as KDM5B the gene of interest to the research study we are analyzing one portion of their in depth study. These genes are also the only genes from the list of 25 genes in the study we noted and totaling 10 remaining in the count data of scRNA data after they filtered and normalized and selected top 25th percentile of genes having highest variability.
Let just keep the fold change data set of genes we already have comparing MET to nonMET top 20 genes with highest change in up or down regulation. And use the ten genes from the study that made the quality control filtering of counts.
Lets make a matrix after binding these genes together into one dataset.Lets rename the long name of our top20 genes of fold change to a shorter name.
top20 <- top10UpAnd10Down_met_nonmetFC
rm(top10UpAnd10Down_met_nonmetFC)
Our data was already arranged to have the columns of sample types in order across the data frame. Lets add a column to each of the 2 data sets of top20 and study genes with FC values.
foldchage_studyListed$geneSource <- 'GSE299775 EBV+NCP gene of interest'
top20$geneSource <- "top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC"
Now combine the 30 genes into one dataset and write to csv.
top30 <- rbind(foldchage_studyListed,top20)
top30[,c(1,35)]
## ID_REF MET_nonMET_FoldChange
## 10292 MYC 2.15185389
## 11188 VCAM1 1.39478595
## 9291 BRCA1 1.29704183
## 9038 BCL2 1.26375241
## 5499 VEGFA 1.20253916
## 12676 KDM5B 1.14112231
## 14196 BRD4 1.12040383
## 13596 CD47 1.07109005
## 15400 CD74 0.88890443
## 9642 ATM 0.56792044
## 10538 TRPA1 6.89920949
## 14378 CCL2 5.91986032
## 7211 SHISA2 5.37001514
## 15643 LGR5 5.02904401
## 1979 FDCSP 4.97791817
## 1516 CXCL6 4.65092687
## 7594 CST1 3.93830205
## 11836 MMP1 3.77502483
## 13000 FSTL3 3.72344208
## 14923 AKR1C3 3.37745476
## 4627 AMY1A 0.15966597
## 7092 MSLN 0.14037864
## 6013 GSTA1 0.13022230
## 3758 MSMB 0.12031737
## 17621 NCCRP1 0.11610074
## 5503 COL14A1 0.10322887
## 5687 ZNF331 0.05778740
## 14741 SCGB1A1 0.03950420
## 202 ZPBP 0.02622455
## 9475 LTF 0.02573839
write.csv(top30,'top20_MET_nonMET_FC_and_10GSE299775_genes.csv',row.names=F)
You can access these 30 genes and samples with means per sample class and fold change values here
Now lets make a matrix of these genes to predict with randomForest the class as normal, metastatic NPC, or non-metastatic NPC.
colnames(top30)
## [1] "ID_REF" "GSM9045905_MET"
## [3] "GSM9045906_MET" "GSM9045909_MET"
## [5] "GSM9045910_MET" "GSM9045911_MET"
## [7] "GSM9045912_MET" "GSM9045913_MET"
## [9] "GSM9045914_MET" "GSM9045915_MET"
## [11] "GSM9045918_MET" "GSM9045919_MET"
## [13] "GSM9045928_MET" "GSM9045929_MET"
## [15] "GSM9045930_MET" "GSM9045907_nonMET"
## [17] "GSM9045908_nonMET" "GSM9045916_nonMET"
## [19] "GSM9045917_nonMET" "GSM9045923_nonMET"
## [21] "GSM9045924_nonMET" "GSM9045926_nonMET"
## [23] "GSM9045927_nonMET" "GSM9045931_nonMET"
## [25] "GSM9045932_nonMET" "GSM9045920_Normal"
## [27] "GSM9045921_Normal" "GSM9045922_Normal"
## [29] "GSM9045925_Normal" "MET_mean"
## [31] "nonMET_mean" "Normal_mean"
## [33] "MET_Normal_FoldChange" "nonMET_Normal_FoldChange"
## [35] "MET_nonMET_FoldChange" "geneSource"
Only include sample type but make a vector for column names as these gene IDs and a column vector of class type for the type of sample each one is in 3 classes.
genes <- top30$ID_REF
class <- c("MET" ,
"MET" , "MET" ,
"MET" , "MET" ,
"MET" , "MET" ,
"MET" , "MET" ,
"MET" , "MET" ,
"MET" , "MET",
"MET" , "nonMET",
"nonMET" , "nonMET" ,
"nonMET" , "nonMET",
"nonMET" , "nonMET" ,
"nonMET" , "nonMET" ,
"nonMET" , "Normal" ,
"Normal" , "Normal" ,
"Normal" )
matrix <- data.frame(t(top30[,c(2:29)]))
colnames(matrix) <- genes
matrix$class <- as.factor(class)
str(matrix)
## 'data.frame': 28 obs. of 31 variables:
## $ MYC : num 56.12 88.45 35.05 44.22 6.98 ...
## $ VCAM1 : num 7.81 22.38 34.1 57.54 17.05 ...
## $ BRCA1 : num 8.53 8.53 4.74 7.46 3.88 ...
## $ BCL2 : num 24.86 26.64 4.74 9.06 10.85 ...
## $ VEGFA : num 33.4 16 32.2 32.5 11.6 ...
## $ KDM5B : num 12.08 8.53 11.37 7.46 21.7 ...
## $ BRD4 : num 54 21.3 23.7 31.4 28.7 ...
## $ CD47 : num 15.6 12.8 42.6 10.1 12.4 ...
## $ CD74 : num 364 289 291 1114 564 ...
## $ ATM : num 5.68 6.39 3.79 4.26 7.75 4.49 3.41 0.95 6.2 2.44 ...
## $ TRPA1 : num 0.71 4.26 1.89 188.09 1.55 ...
## $ CCL2 : num 127.88 4.26 38.84 6.39 3.88 ...
## $ SHISA2 : num 41.92 1.07 3.79 6.93 10.08 ...
## $ LGR5 : num 22.02 3.2 3.79 19.18 30.23 ...
## $ FDCSP : num 1.42 3.2 0.95 2.13 1.55 ...
## $ CXCL6 : num 13.5 4.26 152.5 2.66 4.65 ...
## $ CST1 : num 4.97 9.59 2.84 0.53 1.55 ...
## $ MMP1 : num 2.84 2.13 14.21 3.2 1.55 ...
## $ FSTL3 : num 138.53 12.79 8.53 31.97 6.98 ...
## $ AKR1C3 : num 1.42 4.26 3.79 3.73 1.55 ...
## $ AMY1A : num 8.53 10.66 8.53 10.66 7.75 ...
## $ MSLN : num 2.13 1.07 1.89 1.6 3.1 4.26 2.56 1.89 5.43 2.44 ...
## $ GSTA1 : num 4.97 1.07 2.84 3.2 31 ...
## $ MSMB : num 1.42 3.2 0.95 1.07 6.98 6.28 1.71 1.89 0.78 1.22 ...
## $ NCCRP1 : num 1.42 1.07 2.84 1.6 4.65 2.92 1.71 5.68 1.55 1.22 ...
## $ COL14A1: num 0.71 3.2 0.95 1.07 2.33 4.49 0.85 9.47 2.33 2.44 ...
## $ ZNF331 : num 3.55 3.2 2.84 2.66 4.65 2.24 6.82 4.74 3.88 6.09 ...
## $ SCGB1A1: num 2.13 3.2 1.89 3.2 3.88 1.57 2.56 4.74 1.55 1.22 ...
## $ ZPBP : num 0.71 3.2 0.95 0.53 1.55 1.79 0.85 2.84 1.55 2.44 ...
## $ LTF : num 4.97 9.59 1.89 5.86 1.55 ...
## $ class : Factor w/ 3 levels "MET","nonMET",..: 1 1 1 1 1 1 1 1 1 1 ...
table(matrix$class)
##
## MET nonMET Normal
## 14 10 4
There are 14 MET and 10 nonMET samples, but only 4 Normal samples. Lets see if the model for randomForest can predict accurately on a 3 class model using these genes.
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1234)
intrain <- sample(1:28,.75*28)
training <- matrix[intrain,]
testing <- matrix[-intrain,]
table(training$class)
##
## MET nonMET Normal
## 11 7 3
table(testing$class)
##
## MET nonMET Normal
## 3 3 1
This should be interesting. We will use the same settings as our last project on a 2 class model.
rf <- randomForest(training[,c(1:30)], y=training$class, mtry=6, ntree=5000, confusion=T)
rf$confusion
## MET nonMET Normal class.error
## MET 9 2 0 0.1818182
## nonMET 2 5 0 0.2857143
## Normal 0 3 0 1.0000000
Not very good with the normal class prediction as all 3 samples that were normal were misidentified as nonMET with a 100% class error in the training model. But the MET class predicted 10 of the 11 MET samples correctly, and 5 of the 7 nonMET classes correctly. Lets see how well this model predicts the testing set classes.
predicted <- predict(rf,testing)
results <- data.frame(predicted = predicted, actual=testing$class)
results
## predicted actual
## GSM9045905_MET MET MET
## GSM9045909_MET MET MET
## GSM9045929_MET MET MET
## GSM9045917_nonMET MET nonMET
## GSM9045924_nonMET MET nonMET
## GSM9045931_nonMET nonMET nonMET
## GSM9045920_Normal Normal Normal
The prediction accuracy in the nonMET and Normal classes was 100% for each class, but the accuracy in prediction of the MET class was only 3 out of 5 with 2 of those MET classes predicted as nonMET that is 3/5 in accuracy on MET class predictio or 60%, but 100% in accuracy for the nonMET and Normal class. Overall it predicted 5/7 correctly which is
sum(results$predicted==results$actual)/length(results$predicted)
## [1] 0.7142857
71.4 % accuracy in prediction using all 30 genes of top 10 up regulated, top 10 down regulated, and the only 10 genes from the study that made it into this counts database after quality control filtering and normalization and removing bottom 75th percentile genes in variability.
How well do only the 10 study genes compare? Lets find out using same settings of this random forest algorithm.
matrix10 <- matrix[,c(1:10,31)]
matrix20 <- matrix[,c(11:30,31)]
training <- matrix10[intrain,]
testing <- matrix10[-intrain,]
table(training$class)
##
## MET nonMET Normal
## 11 7 3
table(testing$class)
##
## MET nonMET Normal
## 3 3 1
We are using the same samples but only the 10 genes in the study from our matrix to predict the class in a 3 class model using randomForest classifier.
rf1 <- randomForest(training[,c(1:10)], y=training$class, mtry=3, ntree=5000, confusion=T)
rf$confusion
## MET nonMET Normal class.error
## MET 9 2 0 0.1818182
## nonMET 2 5 0 0.2857143
## Normal 0 3 0 1.0000000
This time, the normal class in building the model was 100% accurate in prediction, while the nonMET class was only 5/7 or 71% accurate, and the MET class was 10/11 accurate or 91% accurate. Lets see how well the 10 study genes predict the testing set.
prediction1 <- predict(rf1,testing)
results1 <- data.frame(predicted=prediction1,actual=testing$class)
results1
## predicted actual
## GSM9045905_MET MET MET
## GSM9045909_MET MET MET
## GSM9045929_MET nonMET MET
## GSM9045917_nonMET MET nonMET
## GSM9045924_nonMET MET nonMET
## GSM9045931_nonMET MET nonMET
## GSM9045920_Normal MET Normal
This model scored terribly on the prediction data as it predicted almost all classes to be MET except for one MET class it predicted to be nonMET. It scored 2/3 MET classes correctly or 67% accurate, while all 3 of the nonMET classes were predicted incorrectly or 0% accurate, and all Normal classes predicted incorrectly for 1/1 with 0% accuracy. The overall prediction accuracy was
sum(results1$predicted==results1$actual)/length(results1$predicted)
## [1] 0.2857143
28.6 % accuracy in prediction using these 10 study genes that made the filtering process and quality control of the study
colnames(matrix10)[1:10]
## [1] "MYC" "VCAM1" "BRCA1" "BCL2" "VEGFA" "KDM5B" "BRD4" "CD47" "CD74"
## [10] "ATM"
Now lets look at how well the other genes from our fold change values of most change in top 10 up regulated and top 10 down regulated genes totaling 20 genes did.
training <- matrix20[intrain,]
testing <- matrix20[-intrain,]
table(training$class)
##
## MET nonMET Normal
## 11 7 3
table(testing$class)
##
## MET nonMET Normal
## 3 3 1
rf2 <- randomForest(training[,c(1:20)], y=training$class, mtry=6, ntree=5000, confusion=T)
rf2$confusion
## MET nonMET Normal class.error
## MET 9 2 0 0.1818182
## nonMET 2 4 1 0.4285714
## Normal 0 2 1 0.6666667
The model accuracy in prediction for the MET class was 8/11 with 73% accuracy, nonMET was 4/7 for 58% accuracy, and the normal class was 1/3 for 33% accuracy. Lets see how well the prediction accuracy is on the testing data.
prediction2 <- predict(rf2,testing)
results2 <- data.frame(predicted=prediction2,actual=testing$class)
results2
## predicted actual
## GSM9045905_MET MET MET
## GSM9045909_MET MET MET
## GSM9045929_MET nonMET MET
## GSM9045917_nonMET MET nonMET
## GSM9045924_nonMET MET nonMET
## GSM9045931_nonMET nonMET nonMET
## GSM9045920_Normal Normal Normal
We can see the prediction accuracy in the MET class is 2/3 or 66% accurate, the nonMET class is 1/3 for 33% accuracy, and the Normal class is 100% accurate with 1/1 predicted correctly. The overall prediction accuracy using the 20 genes was
sum(results2$predicted==results2$actual)/length(results2$predicted)
## [1] 0.5714286
57.1 % which is actually better than the prediction accuracy of the 10 genes from the research study in predicting a 3 class model, but the best model was when we used all 30 genes to predict the 3 classes as that scored 71.4 %. But when using a 3 class model or any model where there are more than 2 classes, the machine will always struggle, better tuning of the model delivers better accuracy, but it is a tough battle to score high in prediction when using a 3 class model to predict, it always or random forest or any model always predicts better with a 2 class model if the features are good enough to explain all the variation of the target feature.
Thanks so much and I hope you enjoyed this analysis. More time was spent understanding the research published article but it pays off as some parts weren’t that clear until tested. I admit the article was lengthy and I did not read all the wonderful plots or graphs but got as much information I could use to find the features worth testing with a portion of the data that was used in the study. The study helped understand what these samples are, where were they taken, and that not all the samples were used. Many of the plots and graphs are beyond my current understanding in bioinformatics and gene expression analysis in a lab setting and using machines and other intricate algorithms to analyze data, but the overall goal and missio of the study is understood as I do have a Master’s degree in data science with bioinformatics, course knowledge in genetics, biology, biochemistry, differential diagnosis, immunity and infection, pathology, and neurology from 3 years of chiropractic medical school. Work like this keeps the mind alert and trained to help stay sharp and prevent dimentia or Alzheimer’s or other cognitive disorders from not allowing the mind to explore and create or understand and learn by connecting ideas, also the internet is wonderful in its ability to use AI to give information fast and options to compare with own understanding, however they do say it uses a lot of computing technology and energy that drains water resources as well as electrical stores for maintaining grid power in communities. It all pays off and we get better understanding and leads to more people searching for what they want to find and accomplish or the skills to learn that they never thought they could. I do recommend exploring this article at link above and doing the analysis presented in front of you to make your own inferences or guide you in exploring other parts of the data and study to design your own questions to answer with data science and machine learning.
Keep checking in. We will be adding these 30 genes to our pathology database but not today.
========================================================================
*** Part 2
We will just upload these 30 genes to the database of pathologies for this part.
NPC_30 <- read.csv('top20_MET_nonMET_FC_and_10GSE299775_genes.csv', header=T, sep=',')
str(NPC_30)
## 'data.frame': 30 obs. of 36 variables:
## $ ID_REF : chr "MYC" "VCAM1" "BRCA1" "BCL2" ...
## $ GSM9045905_MET : num 56.12 7.81 8.53 24.86 33.39 ...
## $ GSM9045906_MET : num 88.45 22.38 8.53 26.64 15.98 ...
## $ GSM9045909_MET : num 35.05 34.1 4.74 4.74 32.21 ...
## $ GSM9045910_MET : num 44.22 57.54 7.46 9.06 32.5 ...
## $ GSM9045911_MET : num 6.98 17.05 3.88 10.85 11.63 ...
## $ GSM9045912_MET : num 72.01 70 11.44 6.95 28.72 ...
## $ GSM9045913_MET : num 100.6 11.08 8.53 35.81 52.86 ...
## $ GSM9045914_MET : num 16.1 30.31 8.53 13.26 11.37 ...
## $ GSM9045915_MET : num 79.05 51.93 3.88 14.73 13.18 ...
## $ GSM9045918_MET : num 85.25 25.58 6.09 25.58 8.53 ...
## $ GSM9045919_MET : num 21.31 49.02 9.59 10.66 31.97 ...
## $ GSM9045928_MET : num 10.66 6.39 10.66 14.21 6.39 ...
## $ GSM9045929_MET : num 20.93 20.93 1.55 14.73 8.53 ...
## $ GSM9045930_MET : num 128.7 41.3 10.3 28.7 16.8 ...
## $ GSM9045907_nonMET : num 17.05 21.31 4.26 4.26 21.31 ...
## $ GSM9045908_nonMET : num 12.79 12.79 4.26 25.58 59.68 ...
## $ GSM9045916_nonMET : num 25.58 22.17 1.71 11.94 8.53 ...
## $ GSM9045917_nonMET : num 12.08 4.97 7.1 5.68 29.13 ...
## $ GSM9045923_nonMET : num 26.1 41.5 4.4 13.8 12.6 ...
## $ GSM9045924_nonMET : num 15.5 27.9 9.3 10.8 6.2 ...
## $ GSM9045926_nonMET : num 34.1 4.26 8.53 12.79 4.26 ...
## $ GSM9045927_nonMET : num 85.2 26.7 10.8 39.3 25.2 ...
## $ GSM9045931_nonMET : num 15.35 46.04 3.41 5.12 6.82 ...
## $ GSM9045932_nonMET : num 10.23 20.46 3.41 6.82 6.82 ...
## $ GSM9045920_Normal : num 17.05 6.39 6.39 7.46 12.79 ...
## $ GSM9045921_Normal : num 17.05 4.26 3.41 2.56 34.1 ...
## $ GSM9045922_Normal : num 13.18 3.1 6.98 7.75 25.58 ...
## $ GSM9045925_Normal : num 8.9 6.3 2.59 2.22 17.79 ...
## $ MET_mean : num 54.67 31.82 7.41 17.2 21.72 ...
## $ nonMET_mean : num 25.41 22.81 5.71 13.61 18.06 ...
## $ Normal_mean : num 14.04 5.01 4.84 5 22.57 ...
## $ MET_Normal_FoldChange : num 3.892 6.348 1.53 3.441 0.962 ...
## $ nonMET_Normal_FoldChange: num 1.81 4.55 1.18 2.72 0.8 ...
## $ MET_nonMET_FoldChange : num 2.15 1.39 1.3 1.26 1.2 ...
## $ geneSource : chr "GSE299775 EBV+NCP gene of interest" "GSE299775 EBV+NCP gene of interest" "GSE299775 EBV+NCP gene of interest" "GSE299775 EBV+NCP gene of interest" ...
Lets get the other table of pathologies to use with out table of 30 NPC genes.
path <- "C:/.../genes Mono MS FM EBV Lyme NKTCL EBVaNPC/" #path of directory to your file of pathologies.
setwd(path)
pathologyDB <- read.csv("pathologyDB_EBVaNPC_13plus20genes.csv", header=T, sep=',')
str(pathologyDB)
## 'data.frame': 237 obs. of 7 variables:
## $ Ensembl_ID : chr "ENSG00000211899" "ENSG00000164458" "ENSG00000211644" "ENSG00000125869" ...
## $ Genecards_ID : chr "IGHM" "TBXT" "IGLV1-51" "LAMP5" ...
## $ FC_pathology_control: num 18550 1051 179 140 105 ...
## $ topGenePathology : chr "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" ...
## $ mediaType : chr "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" ...
## $ studySummarized : chr "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ ...
## $ GSE_study_ID : chr "GSE253756" "GSE253756" "GSE253756" "GSE253756" ...
colnames(NPC_30)
## [1] "ID_REF" "GSM9045905_MET"
## [3] "GSM9045906_MET" "GSM9045909_MET"
## [5] "GSM9045910_MET" "GSM9045911_MET"
## [7] "GSM9045912_MET" "GSM9045913_MET"
## [9] "GSM9045914_MET" "GSM9045915_MET"
## [11] "GSM9045918_MET" "GSM9045919_MET"
## [13] "GSM9045928_MET" "GSM9045929_MET"
## [15] "GSM9045930_MET" "GSM9045907_nonMET"
## [17] "GSM9045908_nonMET" "GSM9045916_nonMET"
## [19] "GSM9045917_nonMET" "GSM9045923_nonMET"
## [21] "GSM9045924_nonMET" "GSM9045926_nonMET"
## [23] "GSM9045927_nonMET" "GSM9045931_nonMET"
## [25] "GSM9045932_nonMET" "GSM9045920_Normal"
## [27] "GSM9045921_Normal" "GSM9045922_Normal"
## [29] "GSM9045925_Normal" "MET_mean"
## [31] "nonMET_mean" "Normal_mean"
## [33] "MET_Normal_FoldChange" "nonMET_Normal_FoldChange"
## [35] "MET_nonMET_FoldChange" "geneSource"
Just keep the ID_REF, MET_nonMET_FoldChange, and geneSource columns.
genes30_ID_FC <- NPC_30[,c(1,35,36)]
genes30_ID_FC
## ID_REF MET_nonMET_FoldChange
## 1 MYC 2.15185389
## 2 VCAM1 1.39478595
## 3 BRCA1 1.29704183
## 4 BCL2 1.26375241
## 5 VEGFA 1.20253916
## 6 KDM5B 1.14112231
## 7 BRD4 1.12040383
## 8 CD47 1.07109005
## 9 CD74 0.88890443
## 10 ATM 0.56792044
## 11 TRPA1 6.89920949
## 12 CCL2 5.91986032
## 13 SHISA2 5.37001514
## 14 LGR5 5.02904401
## 15 FDCSP 4.97791817
## 16 CXCL6 4.65092687
## 17 CST1 3.93830205
## 18 MMP1 3.77502483
## 19 FSTL3 3.72344208
## 20 AKR1C3 3.37745476
## 21 AMY1A 0.15966597
## 22 MSLN 0.14037864
## 23 GSTA1 0.13022230
## 24 MSMB 0.12031737
## 25 NCCRP1 0.11610074
## 26 COL14A1 0.10322887
## 27 ZNF331 0.05778740
## 28 SCGB1A1 0.03950420
## 29 ZPBP 0.02622455
## 30 LTF 0.02573839
## geneSource
## 1 GSE299775 EBV+NCP gene of interest
## 2 GSE299775 EBV+NCP gene of interest
## 3 GSE299775 EBV+NCP gene of interest
## 4 GSE299775 EBV+NCP gene of interest
## 5 GSE299775 EBV+NCP gene of interest
## 6 GSE299775 EBV+NCP gene of interest
## 7 GSE299775 EBV+NCP gene of interest
## 8 GSE299775 EBV+NCP gene of interest
## 9 GSE299775 EBV+NCP gene of interest
## 10 GSE299775 EBV+NCP gene of interest
## 11 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 12 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 13 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 14 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 15 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 16 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 17 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 18 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 19 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 20 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 21 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 22 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 23 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 24 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 25 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 26 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 27 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 28 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 29 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 30 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
colnames(pathologyDB)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
We need to make our column names the same exact name as the pathologyDB to add them to it, but we have to add those columns and fill them in too.
colnames(genes30_ID_FC) <- c("Genecards_ID", "FC_pathology_control","topGenePathology")
The published article has a section that describes what procedure was done for this data. It is not PBMCs or peripheral blood mononuclear cells, but biopsied FFPE NPC using total RNA and not mRNA as most of our studies used. The section of the article ‘Spatial Transcriptome Analysis with GeoMX Digital Spatial Profiler (DSP)’ on page 15 on PubMed document.
genes30_ID_FC$Ensembl_ID <- "Ensembl_ID needed"
genes30_ID_FC$mediaType <- "Total RNA of biopsied Formalin-Fixed Paraffine-Embedded or FFPE Nasopharyngeal Carcinoma or NPC tumor samples of 16 MET (metastatic after 3 years) and 8 nonMET (not metastatic afer 3 years) and 4 normal for 28 samples in total"
genes30_ID_FC$studySummarized <- "GSE299775 on the chromatin loops that DNA of less dense genes have access to latent type II Epstein-Barr Virus (EBV) that can hijack the genes in those areas and cause cancer cell proliferation, tumor growth, and a higher risk of fatality in Nasopharyngeal Carcinoma (NPC) patients. The gene target of interest for this study is KDM5B and the EBV interacting region or EBVIR that one of the 3 EBV genes can enter the chromatin and hijack the human DNA to activate and inactivate genes that promote tumor growth and metastasis. This study found that the chromatin around KDM5B and the EBVIR-enhancer-KDM5B signature was found in EBV+NPC and in patient derived xenographs where the tumor is implanted in a mouse that has been made immunodeficient to watch the tumor progress. They found that when KDM5B is upregulated that the cancer promoting genes are also active and cancer suppressor genes inactive. There are 10 genes from the study that made the cut of this study's quality control, batch effect removal, normalization, and keeping only the top 25th percentile of gene counts of mRNA from the NPC tumors, so not all genes they mentioned in the study made this step of the research. There were other gene sets they analyzed with various unsupervised and supervised machine learning algorithms. These genes as a set were also sorted to get means per sample type or class then get the fold change of the MET over the nonMET to use the top 10 up regulated and top 10 down regulated genes for a total of 20 genes. The Ensembl_ID wasn't included in this study and has to be found online manually. The 10 genes from the study are: MYC, VCAM1, BRCA1, BCL2, VEGFA, KDM5B, BRD4, CD47, CD74, and ATM. There were 3 genes that were EBV genes: EBNA1, oriP, and RPMS1 that were the EBV genes that lock onto the chromatin of human host and have been shown to enhance carcinoma cell proliferation in NPC as well as gastric carcinoma. There were some other genes in the study that didn't get through quality control in the data for GSE299775: LMP1, H3K27,H3K27ac, H3K27me3, CTFC, LGALS9, MK167, BZLF1, MLL, Ku70, MK167. The top 10 up regulated or enhanced genes from fold change values of MET/nonMET were: TRPA1, CCL2, SHISA2, LGR5, FDCSP, CXCL6, CST1, MMP1, FSTL3, and AKR1C3. And the top 10 down regulated or silenced genes from fold change values MET/nonMET were: AMY1A, MSLN, GSTA1, MSMB,
NCCRP1, COL14A1, ZNF331, SCGB1A1, ZPBP, LTF."
genes30_ID_FC$GSE_study_ID <- "GSE299775"
colnames(genes30_ID_FC)
## [1] "Genecards_ID" "FC_pathology_control" "topGenePathology"
## [4] "Ensembl_ID" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
genes30 <- genes30_ID_FC[,c(4,1,2,3,5:7)]
colnames(genes30)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
colnames(pathologyDB)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
Lets read in our last project’s data set for GSE271486 and see if the Ensemble_ID for our 30 genes our in the file to easily add to this file before manually adding the Ensemble ID.
prevStudy <- "C:...GSE271486_HNE_1_MUT_LMP1_vs_HNE_1_WT_LMP1.csv" #use own path on computer for this file that came with the study GSE271486.
previous <- read.csv(prevStudy, header=T, sep=',')
head(previous)
## gene_id gene_name
## 1 ENSG00000000003 TSPAN6
## 2 ENSG00000000005 TNMD
## 3 ENSG00000000419 DPM1
## 4 ENSG00000000457 SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938 FGR
## description
## 1 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## 2 tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
## 3 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## 4 SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## 5 chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
## 6 FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697]
## locus HNE_1_MUT_LMP1_1_count HNE_1_MUT_LMP1_2_count
## 1 X:100627109-100639991:- 1057 700
## 2 X:100584802-100599885:+ 0 0
## 3 20:50934867-50958555:- 937 678
## 4 1:169849631-169894267:- 200 136
## 5 1:169662007-169854080:+ 537 474
## 6 1:27612064-27635277:- 0 0
## HNE_1_MUT_LMP1_3_count HNE_1_WT_LMP1_1_count HNE_1_WT_LMP1_2_count
## 1 855 1643 826
## 2 0 0 0
## 3 932 1114 930
## 4 158 273 120
## 5 471 839 617
## 6 0 0 0
## HNE_1_WT_LMP1_3_count HNE_1_MUT_LMP1_1_FPKM HNE_1_MUT_LMP1_2_FPKM
## 1 875 9.355201 8.273695
## 2 0 0.000000 0.000000
## 3 845 19.973180 19.300172
## 4 177 1.247310 1.132686
## 5 594 3.772203 4.446549
## 6 0 0.000000 0.000000
## HNE_1_MUT_LMP1_3_FPKM HNE_1_WT_LMP1_1_FPKM HNE_1_WT_LMP1_2_FPKM
## 1 9.658906 10.528930 7.0215334
## 2 0.000000 0.000000 0.0000000
## 3 25.357562 17.193387 19.0398968
## 4 1.257724 1.232754 0.7187855
## 5 4.223044 4.267284 4.1627526
## 6 0.000000 0.000000 0.0000000
## HNE_1_WT_LMP1_3_FPKM
## 1 9.546551
## 2 0.000000
## 3 22.203657
## 4 1.360751
## 5 5.143604
## 6 0.000000
Lets see if our 30 genes our listed in this dataset we just read in from last project on EBV NPC in GSE271486.
maybe30 <- previous[which(previous$gene_name %in% genes30$Genecards_ID),]
maybe30$gene_name
## [1] "BRCA1" "LTF" "CD74" "ZPBP" "FSTL3" "MSLN" "TRPA1"
## [8] "CCL2" "VEGFA" "KDM5B" "CXCL6" "ZNF331" "MYC" "LGR5"
## [15] "BRD4" "SCGB1A1" "ATM" "VCAM1" "CST1" "BCL2" "SHISA2"
## [22] "FDCSP" "COL14A1" "NCCRP1" "AKR1C3" "MMP1" "CD47" "AMY1A"
## [29] "GSTA1" "MSMB" "AKR1C3"
genes30$Genecards_ID
## [1] "MYC" "VCAM1" "BRCA1" "BCL2" "VEGFA" "KDM5B" "BRD4"
## [8] "CD47" "CD74" "ATM" "TRPA1" "CCL2" "SHISA2" "LGR5"
## [15] "FDCSP" "CXCL6" "CST1" "MMP1" "FSTL3" "AKR1C3" "AMY1A"
## [22] "MSLN" "GSTA1" "MSMB" "NCCRP1" "COL14A1" "ZNF331" "SCGB1A1"
## [29] "ZPBP" "LTF"
We have an extra ID in the 30 gene match of our previous study. We don’t care about the other information but that we can get the Ensemble ID, so we can remove the duplicates.
Yes30 <- maybe30[,c(1,2)]
Yes30 <- Yes30[!duplicated(Yes30$gene_name),]
Yes30
## gene_id gene_name
## 310 ENSG00000012048 BRCA1
## 316 ENSG00000012223 LTF
## 394 ENSG00000019582 CD74
## 591 ENSG00000042813 ZPBP
## 1111 ENSG00000070404 FSTL3
## 2710 ENSG00000102854 MSLN
## 2900 ENSG00000104321 TRPA1
## 3574 ENSG00000108691 CCL2
## 4100 ENSG00000112715 VEGFA
## 4672 ENSG00000117139 KDM5B
## 5599 ENSG00000124875 CXCL6
## 6361 ENSG00000130844 ZNF331
## 7396 ENSG00000136997 MYC
## 7808 ENSG00000139292 LGR5
## 8178 ENSG00000141867 BRD4
## 9161 ENSG00000149021 SCGB1A1
## 9199 ENSG00000149311 ATM
## 10805 ENSG00000162692 VCAM1
## 12810 ENSG00000170373 CST1
## 13149 ENSG00000171791 BCL2
## 14928 ENSG00000180730 SHISA2
## 15088 ENSG00000181617 FDCSP
## 16614 ENSG00000187955 COL14A1
## 16753 ENSG00000188505 NCCRP1
## 17023 ENSG00000196139 AKR1C3
## 17195 ENSG00000196611 MMP1
## 17244 ENSG00000196776 CD47
## 30895 ENSG00000237763 AMY1A
## 33183 ENSG00000243955 GSTA1
## 42653 ENSG00000263639 MSMB
Lets merge this data with our 30 genes of NPC from GSE299775.
Ids30 <- merge(Yes30, genes30, by.x='gene_name', by.y='Genecards_ID')
colnames(Ids30)
## [1] "gene_name" "gene_id" "Ensembl_ID"
## [4] "FC_pathology_control" "topGenePathology" "mediaType"
## [7] "studySummarized" "GSE_study_ID"
We need to change the gene_id to Ensemble_ID and remove the Ensemble_ID that was a placeholder filled with ‘needs Ensembl_ID’ before adding it to the pathology database.
ID30 <- Ids30[,c(2,1,4:8)] #rearrange while dropping placeholder Ensembl_ID column
colnames(ID30)[1] <- "Ensembl_ID"
colnames(ID30)[2] <- "Genecards_ID"
colnames(ID30)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
Once more check to compare that the columns are the same name.
colnames(pathologyDB)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
Now we are ready to add these 30 genes to our pathology database.
Pathology <- rbind(pathologyDB,ID30)
tail(Pathology,2)
## Ensembl_ID Genecards_ID FC_pathology_control
## 266 ENSG00000130844 ZNF331 0.05778740
## 267 ENSG00000042813 ZPBP 0.02622455
## topGenePathology
## 266 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 267 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## mediaType
## 266 Total RNA of biopsied Formalin-Fixed Paraffine-Embedded or FFPE Nasopharyngeal Carcinoma or NPC tumor samples of 16 MET (metastatic after 3 years) and 8 nonMET (not metastatic afer 3 years) and 4 normal for 28 samples in total
## 267 Total RNA of biopsied Formalin-Fixed Paraffine-Embedded or FFPE Nasopharyngeal Carcinoma or NPC tumor samples of 16 MET (metastatic after 3 years) and 8 nonMET (not metastatic afer 3 years) and 4 normal for 28 samples in total
## studySummarized
## 266 GSE299775 on the chromatin loops that DNA of less dense genes have access to latent type II Epstein-Barr Virus (EBV) that can hijack the genes in those areas and cause cancer cell proliferation, tumor growth, and a higher risk of fatality in Nasopharyngeal Carcinoma (NPC) patients. The gene target of interest for this study is KDM5B and the EBV interacting region or EBVIR that one of the 3 EBV genes can enter the chromatin and hijack the human DNA to activate and inactivate genes that promote tumor growth and metastasis. This study found that the chromatin around KDM5B and the EBVIR-enhancer-KDM5B signature was found in EBV+NPC and in patient derived xenographs where the tumor is implanted in a mouse that has been made immunodeficient to watch the tumor progress. They found that when KDM5B is upregulated that the cancer promoting genes are also active and cancer suppressor genes inactive. There are 10 genes from the study that made the cut of this study's quality control, batch effect removal, normalization, and keeping only the top 25th percentile of gene counts of mRNA from the NPC tumors, so not all genes they mentioned in the study made this step of the research. There were other gene sets they analyzed with various unsupervised and supervised machine learning algorithms. These genes as a set were also sorted to get means per sample type or class then get the fold change of the MET over the nonMET to use the top 10 up regulated and top 10 down regulated genes for a total of 20 genes. The Ensembl_ID wasn't included in this study and has to be found online manually. The 10 genes from the study are: MYC, VCAM1, BRCA1, BCL2, VEGFA, KDM5B, BRD4, CD47, CD74, and ATM. There were 3 genes that were EBV genes: EBNA1, oriP, and RPMS1 that were the EBV genes that lock onto the chromatin of human host and have been shown to enhance carcinoma cell proliferation in NPC as well as gastric carcinoma. There were some other genes in the study that didn't get through quality control in the data for GSE299775: LMP1, H3K27,H3K27ac, H3K27me3, CTFC, LGALS9, MK167, BZLF1, MLL, Ku70, MK167. The top 10 up regulated or enhanced genes from fold change values of MET/nonMET were: TRPA1, CCL2, SHISA2, LGR5, FDCSP, CXCL6, CST1, MMP1, FSTL3, and AKR1C3. And the top 10 down regulated or silenced genes from fold change values MET/nonMET were: AMY1A, MSLN, GSTA1, MSMB, \nNCCRP1, COL14A1, ZNF331, SCGB1A1, ZPBP, LTF.
## 267 GSE299775 on the chromatin loops that DNA of less dense genes have access to latent type II Epstein-Barr Virus (EBV) that can hijack the genes in those areas and cause cancer cell proliferation, tumor growth, and a higher risk of fatality in Nasopharyngeal Carcinoma (NPC) patients. The gene target of interest for this study is KDM5B and the EBV interacting region or EBVIR that one of the 3 EBV genes can enter the chromatin and hijack the human DNA to activate and inactivate genes that promote tumor growth and metastasis. This study found that the chromatin around KDM5B and the EBVIR-enhancer-KDM5B signature was found in EBV+NPC and in patient derived xenographs where the tumor is implanted in a mouse that has been made immunodeficient to watch the tumor progress. They found that when KDM5B is upregulated that the cancer promoting genes are also active and cancer suppressor genes inactive. There are 10 genes from the study that made the cut of this study's quality control, batch effect removal, normalization, and keeping only the top 25th percentile of gene counts of mRNA from the NPC tumors, so not all genes they mentioned in the study made this step of the research. There were other gene sets they analyzed with various unsupervised and supervised machine learning algorithms. These genes as a set were also sorted to get means per sample type or class then get the fold change of the MET over the nonMET to use the top 10 up regulated and top 10 down regulated genes for a total of 20 genes. The Ensembl_ID wasn't included in this study and has to be found online manually. The 10 genes from the study are: MYC, VCAM1, BRCA1, BCL2, VEGFA, KDM5B, BRD4, CD47, CD74, and ATM. There were 3 genes that were EBV genes: EBNA1, oriP, and RPMS1 that were the EBV genes that lock onto the chromatin of human host and have been shown to enhance carcinoma cell proliferation in NPC as well as gastric carcinoma. There were some other genes in the study that didn't get through quality control in the data for GSE299775: LMP1, H3K27,H3K27ac, H3K27me3, CTFC, LGALS9, MK167, BZLF1, MLL, Ku70, MK167. The top 10 up regulated or enhanced genes from fold change values of MET/nonMET were: TRPA1, CCL2, SHISA2, LGR5, FDCSP, CXCL6, CST1, MMP1, FSTL3, and AKR1C3. And the top 10 down regulated or silenced genes from fold change values MET/nonMET were: AMY1A, MSLN, GSTA1, MSMB, \nNCCRP1, COL14A1, ZNF331, SCGB1A1, ZPBP, LTF.
## GSE_study_ID
## 266 GSE299775
## 267 GSE299775
Lets write this pathology database out to csv to continue adding pathology fold change results of top genes for each pathology as we progress in making our database of pathologies of EBV, EBV associated pathologies, Lyme disease, multiple sclerosis, and fibromyalgia.
Lets just see which pathologies we have in our database currently.
table(Pathology$topGenePathology)
##
## EBVaNPC nasopharyngeal carcinoma with EBV infection
## 34
## Epstein Barr Virus
## 80
## fibromyalgia
## 15
## GSE299775 EBV+NCP gene of interest
## 10
## Lyme Disease 6 months
## 33
## mononucleosis
## 15
## Multiple Sclerosis
## 41
## NKTCL Natural Killer T-Cell Lymphoma & EBV
## 19
## top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 20
There are 9 types but a couple of them are EBV in Nasopharyngeal Carcinoma. We have more to research and analyze in Hodgkin’s lymphoma, large B-cell Lymphoma, gastric carcinoma, intrahepatic cholangiocarcinoma in gall bladder bile duct carcinoma, and others if we find them on the way before we can test out our machine to see if the machine can predict by class on similar data we find from same media type. Seems all have been normalized between 0 and 1 and scaled to some affect, but fold change will still be a change between the pathology vs the normal or baseline state. Its neat because if you think a pathology is similar and might be affected by the same pathway of cancer proliferation and metastasis in these pathologies or disease progression, the same genes could possibly be affected that are also affected in other pathologies related to it. In this study we just completed on EBV positive NPC and gastric carcinoma. They noticed that the same chromatin loop interaction site of DNA access for latent EBV to infiltrate and hijack the KDM5B virus was present with their identified EBVIR-enhancer-KDM5B signature but that only the EBV latent type II did the compartment switching in the chromatin loops for less gene dense or gene rich regions did the EBV infiltrate NPC and not for gastric carcinoma, in gastric carcinoma they didn’t see compartment switching but did see it increase risk of metastasis with the signature present. This is because EBV latent type 1 is what affects gastric carcinoma. A mnemonic for cancer like skin cancer and EBV or HPV for human papolloma virus, that they change the normal epithelial from stratified epithelial cells into a different not stratified type cell, the mnemonic is,‘if its satisfied, its stratified.’ Like food in to satisfy hunger and farting or pooping to satisfy an upset gut and the reproductive tract lining for sexual intercourse and the urethral tract for peeing to satisfy emptying a full bladder. Basically, just the entire gastrointestinal tract and uretergenito tract.
Lets write this out to csv and you can get a copy here.
write.csv(Pathology,'pathologyDB_2EBVNPC_Lyme_MS_mono_NKTCL_EBV.csv', row.names=F)
Thanks so much and keep checking back to see how this is going.