In this study, it is another look at nasopharygeal carcinoma (NPC) and how Epstein-Barr virus (EBV) latent infection can attach to certain regions of the human genome where less gene profiles are located and start hijacking genes associated with tumorigenesis or tumor creation and tumor angiogenesis of creating blood supply to the tumors being grown in NPC. It switches compartments along the chromosome host where the chromatin of DNA that wraps it along histones to create an environment that the tumor suppressing genes are silenced and tumor creation genes thrive in aggressive NPC as well as similar results in gastric carcinoma. The researchers, they, found a specific gene and a specific signature to predict with great certainty if a sample was NPC with latent EBV, or not as specifically an outcome as gastric carcinoma. It compared different cell samples of commercial NPC epithelial, then equal amounts of NPC with EBV and NPC without EBV as well as patient derived xenografts (internet says are tumor biopsies implanted in an immunodeficient or humanized mouse to allow the cancerous tumor to grow and be monitored to see how the cancer survives and thrives) with NPC and gastric carcinoma or GC. The study is a lot of research by a team of 27 researchers that really dived into the effects of EBV and how it could possibly survive, mutate, and die or become dormant or put back into its latent state in the human host.

There are 16 cell line sources that the samples were taken for this research that were all done in vitro for this study that are provided in the study ID at GSE299775, but other samples as well are covered in the published PubMed document. The samples they used are from in total 16 different cell lines, made up of:

The study is a lot to unpack in one sitting (most document had my highlighting on it) and much work and figures of analysis were done in this 21 page research study published last August in 2025, but overall they confirmed the LMP1 gene affecting NPC changes and being dysregulated by EBV in our last analysis but also the gene region for KDM5B gene and their interaction regions along the human chromosome by EBV that had the KDM5B signature, called EBVIR-enhancer-KDMB5. The researchers, they, found that the marker was elevated in high amounts of EBV+ NPC but not so much or at all in EBV- NPC samples. It was very interesting discoveries made and learned a bunch of things such as:

chromosome viewer in ncbi KDM5B viewer chromosome viewer in ncbi KDM5B viewer The NCBI chromosome viewer for KDM5B is seen above in the middle at top of image and the genes in the less gene dense region of KDM5B chromosomal location on chromosome 1 on the reverse or complementary strand at 1.q32.1.

As you can see the study was very in depth and has many bits of information to find and analyse, none of the 27 contributors slacked off as there were many different directions this study took to prove that their hypothesis of chromatin loop-loop disruption by EBV latent type II was a pathway for metastatic NPC. Many explanatory diagrams and plots of the analysis, as well as many different sources of data and methods to analyzing the various data. The study seemed to expand on another study they referenced that found KDM5B as a region of further study in EBV associated NPC by Shumilov et al. Reasonable inferences and understandings of the findings or results of their tests and analysis. The figures provided used many machine learning tools like unsupervised learning in the heirarchical clustering to find compartments that switch on and off in EBV+NPC and EBV-NPC samples as well as the NPE samples. It also used supervised learning of univariate and multivariate logistic regression, and differential expression of genes with DESeq. Some heat maps and clustering with UMap were done as well that is also unsupervised machine learning. This study took me a few hours last night and a whole work day today of about 6 hours to annotate and grab information from without starting the analysis of the genes and gene targets to compare in the study, but we do have an idea of the genes to target and look for in comparing to the top upregulated and downregulated genes in our study but making a special list for those genes in the study publication that were found to affect latent EBV infection and poor prognosis in NPC and also GC. For one it is the chromosome regions of interaction mentioned as being resorts for EBV to land and comingle and change the human host chromosome at EBNA1, OriP, and RPMS1, but also other genes had affects of interest in this study: KDM5B, LMP1, H3K27 histone signatures of ac and me3 postscripts, CTFC, JQ1, BRD4, BRCA1, LGALS9, CD47, VCAM1, VEGFA, MK167, MYC, BCL2, BZLF1, ATM, MLL, Ku70, CD74, and MK167.

The data we use here is from GSE299775 is for the spatial transcriptome data that was used in GEOMX-DSP portion of their study analysis and research. There are 28 samples on the page and in the series data information.

series <- read.table('GSE299775_series_matrix.txt', header=T, comment.char='!')
colnames(series)
##  [1] "ID_REF"     "GSM9045905" "GSM9045906" "GSM9045907" "GSM9045908"
##  [6] "GSM9045909" "GSM9045910" "GSM9045911" "GSM9045912" "GSM9045913"
## [11] "GSM9045914" "GSM9045915" "GSM9045916" "GSM9045917" "GSM9045918"
## [16] "GSM9045919" "GSM9045920" "GSM9045921" "GSM9045922" "GSM9045923"
## [21] "GSM9045924" "GSM9045925" "GSM9045926" "GSM9045927" "GSM9045928"
## [26] "GSM9045929" "GSM9045930" "GSM9045931" "GSM9045932"

Other gene expression data related to their overall study and published work is found in:

======================

Lets read in the count matrix data and series matrix data.

totalSeriesInfo <- read.table("GSE299775_series_matrix.txt", 
                              header=FALSE, 
                              sep=' ', 
                              nrows=73)
head(totalSeriesInfo,10)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        V1
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 !Series_title\tSpatial transcriptome profiling for nasopharyngeal carcinoma [GeoMx-DSP]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        !Series_geo_accession\tGSE299775
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   !Series_status\tPublic on Jun 14 2025
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    !Series_submission_date\tJun 13 2025
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   !Series_last_update_date\tAug 13 2025
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             !Series_pubmed_id\t40759888
## 7  !Series_summary\tAs the first human DNA tumour virus Epstein-Barr virus (EBV) is detectable in over 90% of nasopharyngeal carcinoma (NPC) patients in the endemic regions. Genome-wide analysis of 3D chromosome conformation revealed the virus-host chromatin interactions induce spatial reorganisation of loops and compartments, exhibiting “enhancer infestation” and “H3K27 bivalency” at EBV interaction regions (EBVIRs). Through this mechanism, EBV hijacks KDM5B, a genome stability gatekeeper, and its binding targets leading to aberrant activation of the EBVIRs-enhancer-KDM5B signature. The cancer cells with this signature had increased MYC activation, DNA damage responses, and epigenetic plasticity for epithelial-immune cell dual features with metastatic potential. Our multi-centre multi-omics study further confirmed that this signature was highly relevant to chromosome instability and could be utilised as a risk factor for distant metastasis. This study highlights a novel paradigm where latent viral episomes could alter host high-order epigenotype, promoting transcriptional rewiring and risk of metastasis in NPC.
## 8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                !Series_overall_design\tSpatial transcriptome data were generated from primary NPC patients. Patients with NPC who developed distant metastasis within three years after treatment were classified as metastatic NPC (MET-NPC), while those without early metastasis were designated as non-metastatic NPC (nonMET-NPC).
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     !Series_type\tOther
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          !Series_contributor\tWei,,Dai

It looks like the sample type is in row 40, but other information we could use is in the series data.

Lets just read in row 40 by itself for sample type.

sampleType <- read.table('GSE299775_series_matrix.txt', skip=40, nrow=1, header=F)
sampleType
##                            V1             V2             V3                V4
## 1 !Sample_characteristics_ch1 cell type: MET cell type: MET cell type: nonMET
##                  V5             V6             V7             V8             V9
## 1 cell type: nonMET cell type: MET cell type: MET cell type: MET cell type: MET
##              V10            V11            V12               V13
## 1 cell type: MET cell type: MET cell type: MET cell type: nonMET
##                 V14            V15            V16               V17
## 1 cell type: nonMET cell type: MET cell type: MET cell type: Normal
##                 V18               V19               V20               V21
## 1 cell type: Normal cell type: Normal cell type: nonMET cell type: nonMET
##                 V22               V23               V24            V25
## 1 cell type: Normal cell type: nonMET cell type: nonMET cell type: MET
##              V26            V27               V28               V29
## 1 cell type: MET cell type: MET cell type: nonMET cell type: nonMET

Lets save this as a string the sample type.

sampleName <- as.character(sampleType[1,])
sampleName
##  [1] "!Sample_characteristics_ch1" "cell type: MET"             
##  [3] "cell type: MET"              "cell type: nonMET"          
##  [5] "cell type: nonMET"           "cell type: MET"             
##  [7] "cell type: MET"              "cell type: MET"             
##  [9] "cell type: MET"              "cell type: MET"             
## [11] "cell type: MET"              "cell type: MET"             
## [13] "cell type: nonMET"           "cell type: nonMET"          
## [15] "cell type: MET"              "cell type: MET"             
## [17] "cell type: Normal"           "cell type: Normal"          
## [19] "cell type: Normal"           "cell type: nonMET"          
## [21] "cell type: nonMET"           "cell type: Normal"          
## [23] "cell type: nonMET"           "cell type: nonMET"          
## [25] "cell type: MET"              "cell type: MET"             
## [27] "cell type: MET"              "cell type: nonMET"          
## [29] "cell type: nonMET"

Remove the ‘cell type:’ from our character string of sample type to combine to our GSM sample IDs in our series dataset of header sample markers only.

sampleNames <- gsub('cell type: ', '',sampleName)
sampleNames
##  [1] "!Sample_characteristics_ch1" "MET"                        
##  [3] "MET"                         "nonMET"                     
##  [5] "nonMET"                      "MET"                        
##  [7] "MET"                         "MET"                        
##  [9] "MET"                         "MET"                        
## [11] "MET"                         "MET"                        
## [13] "nonMET"                      "nonMET"                     
## [15] "MET"                         "MET"                        
## [17] "Normal"                      "Normal"                     
## [19] "Normal"                      "nonMET"                     
## [21] "nonMET"                      "Normal"                     
## [23] "nonMET"                      "nonMET"                     
## [25] "MET"                         "MET"                        
## [27] "MET"                         "nonMET"                     
## [29] "nonMET"

Lets look at the header of series.

header <- as.character(colnames(series))
header
##  [1] "ID_REF"     "GSM9045905" "GSM9045906" "GSM9045907" "GSM9045908"
##  [6] "GSM9045909" "GSM9045910" "GSM9045911" "GSM9045912" "GSM9045913"
## [11] "GSM9045914" "GSM9045915" "GSM9045916" "GSM9045917" "GSM9045918"
## [16] "GSM9045919" "GSM9045920" "GSM9045921" "GSM9045922" "GSM9045923"
## [21] "GSM9045924" "GSM9045925" "GSM9045926" "GSM9045927" "GSM9045928"
## [26] "GSM9045929" "GSM9045930" "GSM9045931" "GSM9045932"

Combine these into a data frame.

dataIDs <- data.frame(GSM_ID = header, sampleType=sampleNames)
dataIDs
##        GSM_ID                  sampleType
## 1      ID_REF !Sample_characteristics_ch1
## 2  GSM9045905                         MET
## 3  GSM9045906                         MET
## 4  GSM9045907                      nonMET
## 5  GSM9045908                      nonMET
## 6  GSM9045909                         MET
## 7  GSM9045910                         MET
## 8  GSM9045911                         MET
## 9  GSM9045912                         MET
## 10 GSM9045913                         MET
## 11 GSM9045914                         MET
## 12 GSM9045915                         MET
## 13 GSM9045916                      nonMET
## 14 GSM9045917                      nonMET
## 15 GSM9045918                         MET
## 16 GSM9045919                         MET
## 17 GSM9045920                      Normal
## 18 GSM9045921                      Normal
## 19 GSM9045922                      Normal
## 20 GSM9045923                      nonMET
## 21 GSM9045924                      nonMET
## 22 GSM9045925                      Normal
## 23 GSM9045926                      nonMET
## 24 GSM9045927                      nonMET
## 25 GSM9045928                         MET
## 26 GSM9045929                         MET
## 27 GSM9045930                         MET
## 28 GSM9045931                      nonMET
## 29 GSM9045932                      nonMET

We have 28 samples of MET, nonMET, and Normal. Lets see the breakdown by sampleType.

dataID <- dataIDs[c(2:29),]
table(dataID$sampleType)
## 
##    MET nonMET Normal 
##     14     10      4

Our data, dataID, has 14 MET, 10 nonMET, and 4 Normal sample types.

dataID2 <- dataID[order(dataID$sampleType, decreasing=T),]
dataID2
##        GSM_ID sampleType
## 17 GSM9045920     Normal
## 18 GSM9045921     Normal
## 19 GSM9045922     Normal
## 22 GSM9045925     Normal
## 4  GSM9045907     nonMET
## 5  GSM9045908     nonMET
## 13 GSM9045916     nonMET
## 14 GSM9045917     nonMET
## 20 GSM9045923     nonMET
## 21 GSM9045924     nonMET
## 23 GSM9045926     nonMET
## 24 GSM9045927     nonMET
## 28 GSM9045931     nonMET
## 29 GSM9045932     nonMET
## 2  GSM9045905        MET
## 3  GSM9045906        MET
## 6  GSM9045909        MET
## 7  GSM9045910        MET
## 8  GSM9045911        MET
## 9  GSM9045912        MET
## 10 GSM9045913        MET
## 11 GSM9045914        MET
## 12 GSM9045915        MET
## 15 GSM9045918        MET
## 16 GSM9045919        MET
## 25 GSM9045928        MET
## 26 GSM9045929        MET
## 27 GSM9045930        MET

Lets remove the unused variables that clutter or environment space.

rm(dataIDs, sampleType,series, header, sampleInfo, sampleName, sampleNames)
## Warning in rm(dataIDs, sampleType, series, header, sampleInfo, sampleName, :
## object 'sampleInfo' not found
ls()
## [1] "dataID"          "dataID2"         "totalSeriesInfo"

Now lets read in the count data provided by this GSE299775 study.

counts <- read.table('GSE299775_count_matrix.txt', header=T)
str(counts)
## 'data.frame':    18677 obs. of  29 variables:
##  $ ID_REF                 : chr  "TGIF2LY" "OR5B21" "SUGCT" "GABARAP" ...
##  $ DSP.1001660020631.H.A03: num  0.71 2.84 0.71 22.73 15.63 ...
##  $ DSP.1001660020631.H.A05: num  1.07 4.26 1.07 18.12 10.66 ...
##  $ DSP.1001660020631.H.A07: num  4.26 4.26 4.26 4.26 12.79 ...
##  $ DSP.1001660020631.H.A09: num  4.26 4.26 4.26 38.36 4.26 ...
##  $ DSP.1001660020631.H.A12: num  0.95 0.95 0.95 23.68 3.79 ...
##  $ DSP.1001660020631.H.B02: num  0.53 2.66 0.53 28.77 13.32 ...
##  $ DSP.1001660020631.H.B04: num  4.65 2.33 1.55 24.03 3.1 ...
##  $ DSP.1001660020631.H.B06: num  1.79 5.16 1.79 11.44 9.65 ...
##  $ DSP.1001660020631.H.B08: num  1.71 3.41 5.97 32.4 14.49 ...
##  $ DSP.1001660020631.H.B10: num  1.89 4.74 2.84 18.94 14.21 ...
##  $ DSP.1001660020631.H.B12: num  3.88 4.65 3.88 30.23 13.18 ...
##  $ DSP.1001660020631.H.C02: num  1.71 6.82 1.71 28.99 10.23 ...
##  $ DSP.1001660020631.H.C03: num  1.42 4.26 0.71 46.18 10.66 ...
##  $ DSP.1001660020631.H.C06: num  1.22 3.65 1.22 13.4 8.53 2.44 1.22 3.65 1.22 3.65 ...
##  $ DSP.1001660020631.H.C10: num  4.26 3.2 1.07 21.31 11.72 ...
##  $ DSP.1001660020631.H.C12: num  1.07 2.13 2.13 28.77 4.26 ...
##  $ DSP.1001660020631.H.D01: num  2.56 0.85 0.85 28.13 2.56 ...
##  $ DSP.1001660020631.H.D02: num  2.33 3.1 1.55 33.33 3.88 ...
##  $ DSP.1001660020631.H.D11: num  1.1 1.65 0.83 30.53 12.38 ...
##  $ DSP.1001660020631.H.E01: num  0.78 2.33 1.55 18.6 13.95 ...
##  $ DSP.1001660020631.H.E02: num  0.74 1.11 1.85 37.07 2.59 ...
##  $ DSP.1001660020631.H.E05: num  4.26 4.26 4.26 29.84 17.05 ...
##  $ DSP.1001660020631.H.E07: num  1.48 1.85 1.11 25.95 18.9 ...
##  $ DSP.1001660020631.H.E09: num  2.13 2.84 3.55 29.84 11.37 ...
##  $ DSP.1001660020631.H.E11: num  3.1 7.75 0.78 18.6 11.63 ...
##  $ DSP.1001660020631.H.F01: num  1.29 2.84 1.03 22.48 21.7 ...
##  $ DSP.1001660020631.H.F03: num  1.71 5.12 1.71 15.35 6.82 ...
##  $ DSP.1001660020631.H.F05: num  1.71 1.71 3.41 25.58 10.23 ...

It is nice to have the IDs ordered by class sample type, but the data is not ordered so we kept the original dataID that removed the ID_REF feature to keep only the GSM ID next to sample type.

colnames(counts)
##  [1] "ID_REF"                  "DSP.1001660020631.H.A03"
##  [3] "DSP.1001660020631.H.A05" "DSP.1001660020631.H.A07"
##  [5] "DSP.1001660020631.H.A09" "DSP.1001660020631.H.A12"
##  [7] "DSP.1001660020631.H.B02" "DSP.1001660020631.H.B04"
##  [9] "DSP.1001660020631.H.B06" "DSP.1001660020631.H.B08"
## [11] "DSP.1001660020631.H.B10" "DSP.1001660020631.H.B12"
## [13] "DSP.1001660020631.H.C02" "DSP.1001660020631.H.C03"
## [15] "DSP.1001660020631.H.C06" "DSP.1001660020631.H.C10"
## [17] "DSP.1001660020631.H.C12" "DSP.1001660020631.H.D01"
## [19] "DSP.1001660020631.H.D02" "DSP.1001660020631.H.D11"
## [21] "DSP.1001660020631.H.E01" "DSP.1001660020631.H.E02"
## [23] "DSP.1001660020631.H.E05" "DSP.1001660020631.H.E07"
## [25] "DSP.1001660020631.H.E09" "DSP.1001660020631.H.E11"
## [27] "DSP.1001660020631.H.F01" "DSP.1001660020631.H.F03"
## [29] "DSP.1001660020631.H.F05"

Lets use those sample type IDs with GSM to rename the counts data column names into the GSM ID with sample type.

dataID$ID_name <- paste(dataID$GSM_ID,dataID$sampleType, sep='_')
dataID
##        GSM_ID sampleType           ID_name
## 2  GSM9045905        MET    GSM9045905_MET
## 3  GSM9045906        MET    GSM9045906_MET
## 4  GSM9045907     nonMET GSM9045907_nonMET
## 5  GSM9045908     nonMET GSM9045908_nonMET
## 6  GSM9045909        MET    GSM9045909_MET
## 7  GSM9045910        MET    GSM9045910_MET
## 8  GSM9045911        MET    GSM9045911_MET
## 9  GSM9045912        MET    GSM9045912_MET
## 10 GSM9045913        MET    GSM9045913_MET
## 11 GSM9045914        MET    GSM9045914_MET
## 12 GSM9045915        MET    GSM9045915_MET
## 13 GSM9045916     nonMET GSM9045916_nonMET
## 14 GSM9045917     nonMET GSM9045917_nonMET
## 15 GSM9045918        MET    GSM9045918_MET
## 16 GSM9045919        MET    GSM9045919_MET
## 17 GSM9045920     Normal GSM9045920_Normal
## 18 GSM9045921     Normal GSM9045921_Normal
## 19 GSM9045922     Normal GSM9045922_Normal
## 20 GSM9045923     nonMET GSM9045923_nonMET
## 21 GSM9045924     nonMET GSM9045924_nonMET
## 22 GSM9045925     Normal GSM9045925_Normal
## 23 GSM9045926     nonMET GSM9045926_nonMET
## 24 GSM9045927     nonMET GSM9045927_nonMET
## 25 GSM9045928        MET    GSM9045928_MET
## 26 GSM9045929        MET    GSM9045929_MET
## 27 GSM9045930        MET    GSM9045930_MET
## 28 GSM9045931     nonMET GSM9045931_nonMET
## 29 GSM9045932     nonMET GSM9045932_nonMET

Rename the counts data set header of column names.

colnames(counts)[2:29] <- dataID$ID_name

colnames(counts)
##  [1] "ID_REF"            "GSM9045905_MET"    "GSM9045906_MET"   
##  [4] "GSM9045907_nonMET" "GSM9045908_nonMET" "GSM9045909_MET"   
##  [7] "GSM9045910_MET"    "GSM9045911_MET"    "GSM9045912_MET"   
## [10] "GSM9045913_MET"    "GSM9045914_MET"    "GSM9045915_MET"   
## [13] "GSM9045916_nonMET" "GSM9045917_nonMET" "GSM9045918_MET"   
## [16] "GSM9045919_MET"    "GSM9045920_Normal" "GSM9045921_Normal"
## [19] "GSM9045922_Normal" "GSM9045923_nonMET" "GSM9045924_nonMET"
## [22] "GSM9045925_Normal" "GSM9045926_nonMET" "GSM9045927_nonMET"
## [25] "GSM9045928_MET"    "GSM9045929_MET"    "GSM9045930_MET"   
## [28] "GSM9045931_nonMET" "GSM9045932_nonMET"

lets look at the top few rows of this data.

head(counts)
##    ID_REF GSM9045905_MET GSM9045906_MET GSM9045907_nonMET GSM9045908_nonMET
## 1 TGIF2LY           0.71           1.07              4.26              4.26
## 2  OR5B21           2.84           4.26              4.26              4.26
## 3   SUGCT           0.71           1.07              4.26              4.26
## 4 GABARAP          22.73          18.12              4.26             38.36
## 5     CAD          15.63          10.66             12.79              4.26
## 6  TAS2R1           0.71           3.20              4.26              4.26
##   GSM9045909_MET GSM9045910_MET GSM9045911_MET GSM9045912_MET GSM9045913_MET
## 1           0.95           0.53           4.65           1.79           1.71
## 2           0.95           2.66           2.33           5.16           3.41
## 3           0.95           0.53           1.55           1.79           5.97
## 4          23.68          28.77          24.03          11.44          32.40
## 5           3.79          13.32           3.10           9.65          14.49
## 6           1.89           1.60           3.10           1.57           0.85
##   GSM9045914_MET GSM9045915_MET GSM9045916_nonMET GSM9045917_nonMET
## 1           1.89           3.88              1.71              1.42
## 2           4.74           4.65              6.82              4.26
## 3           2.84           3.88              1.71              0.71
## 4          18.94          30.23             28.99             46.18
## 5          14.21          13.18             10.23             10.66
## 6           0.95           0.78              1.71              2.84
##   GSM9045918_MET GSM9045919_MET GSM9045920_Normal GSM9045921_Normal
## 1           1.22           4.26              1.07              2.56
## 2           3.65           3.20              2.13              0.85
## 3           1.22           1.07              2.13              0.85
## 4          13.40          21.31             28.77             28.13
## 5           8.53          11.72              4.26              2.56
## 6           2.44           1.07              2.13              1.71
##   GSM9045922_Normal GSM9045923_nonMET GSM9045924_nonMET GSM9045925_Normal
## 1              2.33              1.10              0.78              0.74
## 2              3.10              1.65              2.33              1.11
## 3              1.55              0.83              1.55              1.85
## 4             33.33             30.53             18.60             37.07
## 5              3.88             12.38             13.95              2.59
## 6              0.78              0.83              3.10              2.59
##   GSM9045926_nonMET GSM9045927_nonMET GSM9045928_MET GSM9045929_MET
## 1              4.26              1.48           2.13           3.10
## 2              4.26              1.85           2.84           7.75
## 3              4.26              1.11           3.55           0.78
## 4             29.84             25.95          29.84          18.60
## 5             17.05             18.90          11.37          11.63
## 6              4.26              2.22           3.55           3.10
##   GSM9045930_MET GSM9045931_nonMET GSM9045932_nonMET
## 1           1.29              1.71              1.71
## 2           2.84              5.12              1.71
## 3           1.03              1.71              3.41
## 4          22.48             15.35             25.58
## 5          21.70              6.82             10.23
## 6           2.33              1.71              1.71

The data counts were collected using filtering, normalization, and the top 25th percentile according to row 47 of the series information.

detailsCounts <- totalSeriesInfo[47,]

details <- strsplit(detailsCounts, split='\t')
details[[1]][2]
## [1] "After quality control, batch effect removal, and Q3 normalisation using the top 25% of expressers to normalize across region of interest (ROI). The counts of the oligonucleotide tags were finally used to calculate the abundance of the corresponding mRNAs."

This is total RNA (row 41 of totalSeriesInfo) data of primary nasopharyngeal carcinoma or NPC.

This is the top 25% of expressed genes after quality control filtering, normalizing, and then calculating the abundance of the corresponding mRNAs from the oligonucleotide strand counts.

Lets make a list of those genes we want from the study before finding our top genes expressed in primary nasopharyngeal carcinoma from fold change data in this counts data set.

We have an idea of the genes to target and look for in comparing to the top upregulated and downregulated genes in our study but making a special list for those genes in the study publication that were found to affect latent EBV infection and poor prognosis in NPC and also GC. For one it is the chromosome regions of interaction mentioned as being resorts for EBV to land and comingle and change the human host chromosome at EBNA1, OriP, and RPMS1, but also other genes had affects of interest in this study: KDM5B, LMP1, H3K27 histone signatures of ac and me3 postscripts, CTFC, JQ1, BRD4, BRCA1, LGALS9, CD47, VCAM1, VEGFA, MK167, MYC, BCL2, BZLF1, ATM, MLL, Ku70, CD74, and MK167.

listed <- c("EBNA1", "oriP", "RPMS1", "KDM5B", "LMP1", "H3K27","H3K27ac", "H3K27me3", "CTFC", "JQ1", "BRD4", "BRCA1", "LGALS9", "CD47", "VCAM1", "VEGFA", "MK167", "MYC", "BCL2", "BZLF1", "ATM", "MLL", "Ku70", "CD74", "MK167")

Lets see if any of these genes as they are written are in this data by ID_REF in the counts data.

Listed <- counts[counts$ID_REF %in% listed,]
Listed
##       ID_REF GSM9045905_MET GSM9045906_MET GSM9045907_nonMET GSM9045908_nonMET
## 5499   VEGFA          33.39          15.98             21.31             59.68
## 9038    BCL2          24.86          26.64              4.26             25.58
## 9291   BRCA1           8.53           8.53              4.26              4.26
## 9642     ATM           5.68           6.39             12.79             17.05
## 10292    MYC          56.12          88.45             17.05             12.79
## 11188  VCAM1           7.81          22.38             21.31             12.79
## 12676  KDM5B          12.08           8.53              8.53              4.26
## 13596   CD47          15.63          12.79             12.79             25.58
## 14196   BRD4          53.99          21.31             12.79             29.84
## 15400   CD74         363.74         288.79            865.30            980.39
##       GSM9045909_MET GSM9045910_MET GSM9045911_MET GSM9045912_MET
## 5499           32.21          32.50          11.63          28.72
## 9038            4.74           9.06          10.85           6.95
## 9291            4.74           7.46           3.88          11.44
## 9642            3.79           4.26           7.75           4.49
## 10292          35.05          44.22           6.98          72.01
## 11188          34.10          57.54          17.05          70.00
## 12676          11.37           7.46          21.70           7.18
## 13596          42.63          10.12          12.40          21.09
## 14196          23.68          31.44          28.68          20.19
## 15400         290.80        1113.59         564.21         447.34
##       GSM9045913_MET GSM9045914_MET GSM9045915_MET GSM9045916_nonMET
## 5499           52.86          11.37          13.18              8.53
## 9038           35.81          13.26          14.73             11.94
## 9291            8.53           8.53           3.88              1.71
## 9642            3.41           0.95           6.20              3.41
## 10292         100.60          16.10          79.05             25.58
## 11188          11.08          30.31          51.93             22.17
## 12676          12.79           8.53          11.63              5.12
## 13596          17.05          27.47           7.75              8.53
## 14196          37.51          32.21          41.85             23.87
## 15400         324.81         448.99        1335.34            862.74
##       GSM9045917_nonMET GSM9045918_MET GSM9045919_MET GSM9045920_Normal
## 5499              29.13           8.53          31.97             12.79
## 9038               5.68          25.58          10.66              7.46
## 9291               7.10           6.09           9.59              6.39
## 9642               2.13           2.44           7.46              3.20
## 10292             12.08          85.25          21.31             17.05
## 11188              4.97          25.58          49.02              6.39
## 12676             13.50          10.96           3.20              5.33
## 13596              7.81          12.18          19.18              8.53
## 14196             40.49          30.45          39.43             24.51
## 15400            174.05         580.93         705.45            801.36
##       GSM9045921_Normal GSM9045922_Normal GSM9045923_nonMET GSM9045924_nonMET
## 5499              34.10             25.58             12.65              6.20
## 9038               2.56              7.75             13.75             10.85
## 9291               3.41              6.98              4.40              9.30
## 9642               2.56              9.30              9.08              6.98
## 10292             17.05             13.18             26.13             15.50
## 11188              4.26              3.10             41.53             27.90
## 12676              3.41              9.30             10.45              8.53
## 13596             15.35             13.95             14.85             30.23
## 14196             28.13             27.13             35.20             20.15
## 15400            843.99            993.56           1095.89            435.56
##       GSM9045925_Normal GSM9045926_nonMET GSM9045927_nonMET GSM9045928_MET
## 5499              17.79              4.26             25.20           6.39
## 9038               2.22             12.79             39.29          14.21
## 9291               2.59              8.53             10.75          10.66
## 9642               4.45             12.79              3.34           7.81
## 10292              8.90             34.10             85.25          10.66
## 11188              6.30              4.26             26.69           6.39
## 12676              8.53              8.53              8.90           4.26
## 13596             16.31              4.26             16.68           8.53
## 14196             20.76             42.63             25.20          34.10
## 15400            946.29            677.75            242.41         895.14
##       GSM9045929_MET GSM9045930_MET GSM9045931_nonMET GSM9045932_nonMET
## 5499            8.53          16.79              6.82              6.82
## 9038           14.73          28.68              5.12              6.82
## 9291            1.55          10.33              3.41              3.41
## 9642            3.88           4.13              6.82             11.94
## 10292          20.93         128.65             15.35             10.23
## 11188          20.93          41.33             46.04             20.46
## 12676          10.85           2.33              8.53              6.82
## 13596           9.30          21.18             11.94             25.58
## 14196          17.05          24.03             17.05             30.69
## 15400         678.91        1165.36           1239.55            821.82

There are 10 genes in the data of counts and the most important one is there, KDM5B, as it was the whole basis for this research study. The listed study genes totaled 25 genes, with 10 in this counts data and 15 not in the data. That could be due to the study genes having alternate names or not making the cut in quality filtering, normalization, and the top 25% of genes using a 3rd quantile filtering to omit less high variability genes.

There is no simple way to get the genes that aren’t in the data as I tried the different methods to extract only the genes in and only genes not in, the latter didn’t want to work.

listed <- as.factor(listed)
class(listed)
## [1] "factor"
studyListed <- as.factor(Listed$ID_REF)
class(studyListed)
## [1] "factor"
notListed <- listed[-(studyListed %in% listed)]
notListed
##  [1] oriP     RPMS1    KDM5B    LMP1     H3K27    H3K27ac  H3K27me3 CTFC    
##  [9] JQ1      BRD4     BRCA1    LGALS9   CD47     VCAM1    VEGFA    MK167   
## [17] MYC      BCL2     BZLF1    ATM      MLL      Ku70     CD74     MK167   
## 24 Levels: ATM BCL2 BRCA1 BRD4 BZLF1 CD47 CD74 CTFC EBNA1 H3K27 ... VEGFA

So, obviously that is more than 15 genes and a huge error. We can just eyeball the list of 25 listed genes in the study and mark off the listed from the list to make it the notListed list of genes not in the list. I did the same thing as characters, but changing to factor type objects makes no difference.

notListed <- as.factor(c("EBNA1", "oriP", "RPMS1", "LMP1", "H3K27","H3K27ac", "H3K27me3", "CTFC", "JQ1", "LGALS9", "MK167",   "BZLF1",  "MLL", "Ku70", "MK167"))
notListed
##  [1] EBNA1    oriP     RPMS1    LMP1     H3K27    H3K27ac  H3K27me3 CTFC    
##  [9] JQ1      LGALS9   MK167    BZLF1    MLL      Ku70     MK167   
## 14 Levels: BZLF1 CTFC EBNA1 H3K27 H3K27ac H3K27me3 JQ1 Ku70 LGALS9 ... RPMS1

It is unfortunate that the JQ1 gene that inhibits the BRD4 gene in transcription of cancer cell proliferation isn’t in the counts data, but that makes sense as it inhibits the effects of KDM5B involvment in EBV+NPC and rapid cancer progression. Further search of JQ1 in genecards turned up nothing, and when using the internet, it turns out that JQ1 is a lab developed compound and not a gene that does inhibit the BRD4 family of genes that is useful in slowing cancer progression in cell proliferation and tumor growth. JQ1 is a Thienotriazolodiazepine seen in many studies and experiments to slow cancer progression.

The 3 genes for the EBV interacting regions or EBVIR are not in the counts data so they must be named something else or not considered genes but regions on the chromosome. Those genes were the EBNA1, oriP, and RPMS1. Genecards.org says there is an EBNA1BP2 but not EBNA1, for oriP nothing. It turns out the internet does have results for EBNA1, oriP, and RPMS1. They are EBV genes and not human genes.

The CTFC gene is a regulatory protein called CCCTC-binding factor that is listed in genecards.org when looking up CCCTC-binding factor but not CTFC. MLL is now called KMT2A in genecards.org.

It is odd that LMP1 is not in this data of counts after quality control filtering, normalization, and taking the top 25th percentile of gene counts. LMP1 is associated with EBV latency type II and also in last analysis we did with EBV+NPC top genes in GSE271486. LMP1 was mentioned in this study for its relation to EBV+NPC but didn’t make their measures in this counts data.

studyListed
##  [1] VEGFA BCL2  BRCA1 ATM   MYC   VCAM1 KDM5B CD47  BRD4  CD74 
## Levels: ATM BCL2 BRCA1 BRD4 CD47 CD74 KDM5B MYC VCAM1 VEGFA

We can see that the studyListed set of genes is not in the notListed set of genes. However, the two genes that KDM5B enhanced when the EBVIR-enhancer-KDM5B signature was present are there, VEGFA and VCAM1 as well as KDM5B. Also the BRD4 gene that plays a role in cancer proliferation and tumor growth is present. The two MHC-II t-cell genes of CD47 and CD74 are present as well, and the two study genes this research mentioned in having associations to NPC of BRCA1, ATM, and MYC. The only gene not mentioned is BCL2 that is in the listed study genes This gene was found to be present with MYC in the top 20% of subclusters of high EBV epithelial cells of NPC when the EBVIR-enhancer-KDM5B signature is present.

Lets remove the genes that are not human genes of the EBV 3 genes and the Thienotriazolodiazepine called JQ1 from our listed set of study genes.

listed 
##  [1] EBNA1    oriP     RPMS1    KDM5B    LMP1     H3K27    H3K27ac  H3K27me3
##  [9] CTFC     JQ1      BRD4     BRCA1    LGALS9   CD47     VCAM1    VEGFA   
## [17] MK167    MYC      BCL2     BZLF1    ATM      MLL      Ku70     CD74    
## [25] MK167   
## 24 Levels: ATM BCL2 BRCA1 BRD4 BZLF1 CD47 CD74 CTFC EBNA1 H3K27 ... VEGFA
humanListed <- as.character(listed[-c(1,2,3,10)])
humanListed
##  [1] "KDM5B"    "LMP1"     "H3K27"    "H3K27ac"  "H3K27me3" "CTFC"    
##  [7] "BRD4"     "BRCA1"    "LGALS9"   "CD47"     "VCAM1"    "VEGFA"   
## [13] "MK167"    "MYC"      "BCL2"     "BZLF1"    "ATM"      "MLL"     
## [19] "Ku70"     "CD74"     "MK167"

Lets remove the unneccessary data from our enviroment.

rm(dataSamples,details,totalSeriesInfo,detailsCounts,dataID2)
## Warning in rm(dataSamples, details, totalSeriesInfo, detailsCounts, dataID2):
## object 'dataSamples' not found

Lets go ahead and get our data set of means by sample and fold change values for top upregulated or stimulated genes and top downregulated or inhibited genes in comparing MET to normal and nonMET to normal. This data was taken from samples from the group of Chinese biregional or multicentric and multiomics data of all enrolled patients having their first line treatment which was chemotherapy in 2011 guided by the NCCC and then tracked to see which patients developed MET 3 years past last treatment of chemotherapy.

colnames(counts)
##  [1] "ID_REF"            "GSM9045905_MET"    "GSM9045906_MET"   
##  [4] "GSM9045907_nonMET" "GSM9045908_nonMET" "GSM9045909_MET"   
##  [7] "GSM9045910_MET"    "GSM9045911_MET"    "GSM9045912_MET"   
## [10] "GSM9045913_MET"    "GSM9045914_MET"    "GSM9045915_MET"   
## [13] "GSM9045916_nonMET" "GSM9045917_nonMET" "GSM9045918_MET"   
## [16] "GSM9045919_MET"    "GSM9045920_Normal" "GSM9045921_Normal"
## [19] "GSM9045922_Normal" "GSM9045923_nonMET" "GSM9045924_nonMET"
## [22] "GSM9045925_Normal" "GSM9045926_nonMET" "GSM9045927_nonMET"
## [25] "GSM9045928_MET"    "GSM9045929_MET"    "GSM9045930_MET"   
## [28] "GSM9045931_nonMET" "GSM9045932_nonMET"

Lets organize the columns so its easier to get rowMeans by sample type or class of MET, nonMET, and Normal.

DF <- counts[,c(1:3,6:12,15,16,25:27,4,5,13,14,20,21,23,24,28,29,17,18,19,22)]
colnames(DF)
##  [1] "ID_REF"            "GSM9045905_MET"    "GSM9045906_MET"   
##  [4] "GSM9045909_MET"    "GSM9045910_MET"    "GSM9045911_MET"   
##  [7] "GSM9045912_MET"    "GSM9045913_MET"    "GSM9045914_MET"   
## [10] "GSM9045915_MET"    "GSM9045918_MET"    "GSM9045919_MET"   
## [13] "GSM9045928_MET"    "GSM9045929_MET"    "GSM9045930_MET"   
## [16] "GSM9045907_nonMET" "GSM9045908_nonMET" "GSM9045916_nonMET"
## [19] "GSM9045917_nonMET" "GSM9045923_nonMET" "GSM9045924_nonMET"
## [22] "GSM9045926_nonMET" "GSM9045927_nonMET" "GSM9045931_nonMET"
## [25] "GSM9045932_nonMET" "GSM9045920_Normal" "GSM9045921_Normal"
## [28] "GSM9045922_Normal" "GSM9045925_Normal"

The columns are arranged by class type with MET classes columns 2 through 15, the nonMET class are columns 16 through 25, and the Normal class are 26 through 29.

DF$MET_mean <- rowMeans(DF[,c(2:15)])
DF$nonMET_mean <- rowMeans(DF[,c(16:25)])
DF$Normal_mean <- rowMeans(DF[,c(26:29)])

DF$MET_Normal_FoldChange <- DF$MET_mean/DF$Normal_mean
DF$nonMET_Normal_FoldChange <- DF$nonMET_mean/DF$Normal_mean
DF$MET_nonMET_FoldChange <- DF$MET_mean/DF$nonMET_mean

colnames(DF)
##  [1] "ID_REF"                   "GSM9045905_MET"          
##  [3] "GSM9045906_MET"           "GSM9045909_MET"          
##  [5] "GSM9045910_MET"           "GSM9045911_MET"          
##  [7] "GSM9045912_MET"           "GSM9045913_MET"          
##  [9] "GSM9045914_MET"           "GSM9045915_MET"          
## [11] "GSM9045918_MET"           "GSM9045919_MET"          
## [13] "GSM9045928_MET"           "GSM9045929_MET"          
## [15] "GSM9045930_MET"           "GSM9045907_nonMET"       
## [17] "GSM9045908_nonMET"        "GSM9045916_nonMET"       
## [19] "GSM9045917_nonMET"        "GSM9045923_nonMET"       
## [21] "GSM9045924_nonMET"        "GSM9045926_nonMET"       
## [23] "GSM9045927_nonMET"        "GSM9045931_nonMET"       
## [25] "GSM9045932_nonMET"        "GSM9045920_Normal"       
## [27] "GSM9045921_Normal"        "GSM9045922_Normal"       
## [29] "GSM9045925_Normal"        "MET_mean"                
## [31] "nonMET_mean"              "Normal_mean"             
## [33] "MET_Normal_FoldChange"    "nonMET_Normal_FoldChange"
## [35] "MET_nonMET_FoldChange"
str(DF)
## 'data.frame':    18677 obs. of  35 variables:
##  $ ID_REF                  : chr  "TGIF2LY" "OR5B21" "SUGCT" "GABARAP" ...
##  $ GSM9045905_MET          : num  0.71 2.84 0.71 22.73 15.63 ...
##  $ GSM9045906_MET          : num  1.07 4.26 1.07 18.12 10.66 ...
##  $ GSM9045909_MET          : num  0.95 0.95 0.95 23.68 3.79 ...
##  $ GSM9045910_MET          : num  0.53 2.66 0.53 28.77 13.32 ...
##  $ GSM9045911_MET          : num  4.65 2.33 1.55 24.03 3.1 ...
##  $ GSM9045912_MET          : num  1.79 5.16 1.79 11.44 9.65 ...
##  $ GSM9045913_MET          : num  1.71 3.41 5.97 32.4 14.49 ...
##  $ GSM9045914_MET          : num  1.89 4.74 2.84 18.94 14.21 ...
##  $ GSM9045915_MET          : num  3.88 4.65 3.88 30.23 13.18 ...
##  $ GSM9045918_MET          : num  1.22 3.65 1.22 13.4 8.53 2.44 1.22 3.65 1.22 3.65 ...
##  $ GSM9045919_MET          : num  4.26 3.2 1.07 21.31 11.72 ...
##  $ GSM9045928_MET          : num  2.13 2.84 3.55 29.84 11.37 ...
##  $ GSM9045929_MET          : num  3.1 7.75 0.78 18.6 11.63 ...
##  $ GSM9045930_MET          : num  1.29 2.84 1.03 22.48 21.7 ...
##  $ GSM9045907_nonMET       : num  4.26 4.26 4.26 4.26 12.79 ...
##  $ GSM9045908_nonMET       : num  4.26 4.26 4.26 38.36 4.26 ...
##  $ GSM9045916_nonMET       : num  1.71 6.82 1.71 28.99 10.23 ...
##  $ GSM9045917_nonMET       : num  1.42 4.26 0.71 46.18 10.66 ...
##  $ GSM9045923_nonMET       : num  1.1 1.65 0.83 30.53 12.38 ...
##  $ GSM9045924_nonMET       : num  0.78 2.33 1.55 18.6 13.95 ...
##  $ GSM9045926_nonMET       : num  4.26 4.26 4.26 29.84 17.05 ...
##  $ GSM9045927_nonMET       : num  1.48 1.85 1.11 25.95 18.9 ...
##  $ GSM9045931_nonMET       : num  1.71 5.12 1.71 15.35 6.82 ...
##  $ GSM9045932_nonMET       : num  1.71 1.71 3.41 25.58 10.23 ...
##  $ GSM9045920_Normal       : num  1.07 2.13 2.13 28.77 4.26 ...
##  $ GSM9045921_Normal       : num  2.56 0.85 0.85 28.13 2.56 ...
##  $ GSM9045922_Normal       : num  2.33 3.1 1.55 33.33 3.88 ...
##  $ GSM9045925_Normal       : num  0.74 1.11 1.85 37.07 2.59 ...
##  $ MET_mean                : num  2.08 3.66 1.92 22.57 11.64 ...
##  $ nonMET_mean             : num  2.27 3.65 2.38 26.36 11.73 ...
##  $ Normal_mean             : num  1.68 1.8 1.59 31.82 3.32 ...
##  $ MET_Normal_FoldChange   : num  1.244 2.038 1.206 0.709 3.504 ...
##  $ nonMET_Normal_FoldChange: num  1.355 2.032 1.493 0.828 3.53 ...
##  $ MET_nonMET_FoldChange   : num  0.919 1.003 0.808 0.856 0.993 ...
MET_nonMET_FC_ordered <- DF[order(DF$MET_nonMET_FoldChange, decreasing=T),]
head(MET_nonMET_FC_ordered)
##       ID_REF GSM9045905_MET GSM9045906_MET GSM9045909_MET GSM9045910_MET
## 10538  TRPA1           0.71           4.26           1.89         188.09
## 14378   CCL2         127.88           4.26          38.84           6.39
## 7211  SHISA2          41.92           1.07           3.79           6.93
## 15643   LGR5          22.02           3.20           3.79          19.18
## 1979   FDCSP           1.42           3.20           0.95           2.13
## 1516   CXCL6          13.50           4.26         152.50           2.66
##       GSM9045911_MET GSM9045912_MET GSM9045913_MET GSM9045914_MET
## 10538           1.55           5.16           4.26           2.84
## 14378           3.88           2.69           5.12         161.98
## 7211           10.08           1.79          49.45          21.79
## 15643          30.23           8.75           5.12          34.10
## 1979            1.55           2.92         150.89         119.35
## 1516            4.65           2.69           0.85           0.95
##       GSM9045915_MET GSM9045918_MET GSM9045919_MET GSM9045928_MET
## 10538           9.30           1.22           1.07           5.68
## 14378          11.63           8.53         457.16           7.10
## 7211            1.55           2.44          59.68           4.26
## 15643           0.78           4.87         117.22           7.10
## 1979            2.33           1.22           1.07           7.10
## 1516            1.55           2.44           9.59           1.42
##       GSM9045929_MET GSM9045930_MET GSM9045907_nonMET GSM9045908_nonMET
## 10538          15.50           2.84              4.26              4.26
## 14378           1.55          91.97              4.26              4.26
## 7211            6.98           1.03              4.26              4.26
## 15643           3.10          47.02              4.26              4.26
## 1979            5.43          25.06              4.26             12.79
## 1516            3.88           0.78             12.79              4.26
##       GSM9045916_nonMET GSM9045917_nonMET GSM9045923_nonMET GSM9045924_nonMET
## 10538              1.71              0.71              0.55              4.65
## 14378              5.12              2.84              3.58             81.38
## 7211               1.71              0.71              3.58              3.88
## 15643              1.71              9.24              2.48              6.20
## 1979               1.71              2.13              1.93              5.43
## 1516               1.71              0.71              0.83              0.78
##       GSM9045926_nonMET GSM9045927_nonMET GSM9045931_nonMET GSM9045932_nonMET
## 10538              4.26              1.48              1.71              1.71
## 14378              4.26              2.97              1.71              1.71
## 7211               4.26              2.22              1.71              1.71
## 15643              4.26              2.59              5.12              3.41
## 1979               8.53              2.97              5.12              1.71
## 1516               4.26              2.22              1.71              1.71
##       GSM9045920_Normal GSM9045921_Normal GSM9045922_Normal GSM9045925_Normal
## 10538              1.07              0.85              0.78              1.48
## 14378              3.20              2.56              3.88              2.97
## 7211               1.07              2.56              2.33              3.71
## 15643              3.20              2.56              0.78              1.11
## 1979               3.20              0.85              0.78              3.34
## 1516               3.20              3.41             48.83             37.81
##       MET_mean nonMET_mean Normal_mean MET_Normal_FoldChange
## 10538 17.45500       2.530      1.0450             16.703349
## 14378 66.35571      11.209      3.1525             21.048601
## 7211  15.19714       2.830      2.4175              6.286305
## 15643 21.89143       4.353      1.9125             11.446499
## 1979  23.18714       4.658      2.0425             11.352334
## 1516  14.40857       3.098     23.3125              0.618062
##       nonMET_Normal_FoldChange MET_nonMET_FoldChange
## 10538                2.4210526              6.899209
## 14378                3.5555908              5.919860
## 7211                 1.1706308              5.370015
## 15643                2.2760784              5.029044
## 1979                 2.2805386              4.977918
## 1516                 0.1328901              4.650927
top10UpAnd10Down_met_nonmetFC <- MET_nonMET_FC_ordered[c(1:10,18668:18677),]
top10UpAnd10Down_met_nonmetFC[,c(1,35)]
##        ID_REF MET_nonMET_FoldChange
## 10538   TRPA1            6.89920949
## 14378    CCL2            5.91986032
## 7211   SHISA2            5.37001514
## 15643    LGR5            5.02904401
## 1979    FDCSP            4.97791817
## 1516    CXCL6            4.65092687
## 7594     CST1            3.93830205
## 11836    MMP1            3.77502483
## 13000   FSTL3            3.72344208
## 14923  AKR1C3            3.37745476
## 4627    AMY1A            0.15966597
## 7092     MSLN            0.14037864
## 6013    GSTA1            0.13022230
## 3758     MSMB            0.12031737
## 17621  NCCRP1            0.11610074
## 5503  COL14A1            0.10322887
## 5687   ZNF331            0.05778740
## 14741 SCGB1A1            0.03950420
## 202      ZPBP            0.02622455
## 9475      LTF            0.02573839

That result is the top 10 up regulated and top 10 down regulated genes between MET and non-MET NPC fold change values of MET mean value over nonMET mean value.

top10UpAnd10Down_met_nonmetFC$ID_REF
##  [1] "TRPA1"   "CCL2"    "SHISA2"  "LGR5"    "FDCSP"   "CXCL6"   "CST1"   
##  [8] "MMP1"    "FSTL3"   "AKR1C3"  "AMY1A"   "MSLN"    "GSTA1"   "MSMB"   
## [15] "NCCRP1"  "COL14A1" "ZNF331"  "SCGB1A1" "ZPBP"    "LTF"

Those are the top 20 genes most differentially changed from MET NPC to nonMET NPC. Lets see where are list of genes are in the studyListed set of genes that are in this counts data frame.

foldchage_studyListed <- MET_nonMET_FC_ordered[which(MET_nonMET_FC_ordered$ID_REF %in% Listed$ID_REF),]
foldchage_studyListed[,c(1,35)]
##       ID_REF MET_nonMET_FoldChange
## 10292    MYC             2.1518539
## 11188  VCAM1             1.3947860
## 9291   BRCA1             1.2970418
## 9038    BCL2             1.2637524
## 5499   VEGFA             1.2025392
## 12676  KDM5B             1.1411223
## 14196   BRD4             1.1204038
## 13596   CD47             1.0710900
## 15400   CD74             0.8889044
## 9642     ATM             0.5679204

All genes related to tumor proliferation in the research report for NPC cellular changes when KDM5B are present seem to be upregulated from above information just comparing the results of the MET to nonMET foldchange gene expression values, except the ATM and CD74 genes. CD74 is one of the MHC-II genes mentioned that is a chaperone to immune activity and seems to perform immune evasion by lowering in value, and ATM is a gene that normally suppresses tumor growth but is low by almost half as much in MET compared to nonMET cases of NPC following treatment. ATM is lowered when the EBNA1 gene (an EBV latent type II gene) infiltrates its EBVIR area of the human chromosome. Both the VCAM1 that stimulates tumor growth and proliferation as well as VEGFA that stimulates tumour growth by creating blood routes for the tumor in tumor angiogenesis are up regulated in MET vs nonMET as well as KDM5B the gene of interest to the research study we are analyzing one portion of their in depth study. These genes are also the only genes from the list of 25 genes in the study we noted and totaling 10 remaining in the count data of scRNA data after they filtered and normalized and selected top 25th percentile of genes having highest variability.

Let just keep the fold change data set of genes we already have comparing MET to nonMET top 20 genes with highest change in up or down regulation. And use the ten genes from the study that made the quality control filtering of counts.

Lets make a matrix after binding these genes together into one dataset.Lets rename the long name of our top20 genes of fold change to a shorter name.

top20 <- top10UpAnd10Down_met_nonmetFC
rm(top10UpAnd10Down_met_nonmetFC)

Our data was already arranged to have the columns of sample types in order across the data frame. Lets add a column to each of the 2 data sets of top20 and study genes with FC values.

foldchage_studyListed$geneSource <- 'GSE299775 EBV+NCP gene of interest'
top20$geneSource <- "top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC"

Now combine the 30 genes into one dataset and write to csv.

top30 <- rbind(foldchage_studyListed,top20)
top30[,c(1,35)]
##        ID_REF MET_nonMET_FoldChange
## 10292     MYC            2.15185389
## 11188   VCAM1            1.39478595
## 9291    BRCA1            1.29704183
## 9038     BCL2            1.26375241
## 5499    VEGFA            1.20253916
## 12676   KDM5B            1.14112231
## 14196    BRD4            1.12040383
## 13596    CD47            1.07109005
## 15400    CD74            0.88890443
## 9642      ATM            0.56792044
## 10538   TRPA1            6.89920949
## 14378    CCL2            5.91986032
## 7211   SHISA2            5.37001514
## 15643    LGR5            5.02904401
## 1979    FDCSP            4.97791817
## 1516    CXCL6            4.65092687
## 7594     CST1            3.93830205
## 11836    MMP1            3.77502483
## 13000   FSTL3            3.72344208
## 14923  AKR1C3            3.37745476
## 4627    AMY1A            0.15966597
## 7092     MSLN            0.14037864
## 6013    GSTA1            0.13022230
## 3758     MSMB            0.12031737
## 17621  NCCRP1            0.11610074
## 5503  COL14A1            0.10322887
## 5687   ZNF331            0.05778740
## 14741 SCGB1A1            0.03950420
## 202      ZPBP            0.02622455
## 9475      LTF            0.02573839
write.csv(top30,'top20_MET_nonMET_FC_and_10GSE299775_genes.csv',row.names=F)

You can access these 30 genes and samples with means per sample class and fold change values here

Now lets make a matrix of these genes to predict with randomForest the class as normal, metastatic NPC, or non-metastatic NPC.

colnames(top30)
##  [1] "ID_REF"                   "GSM9045905_MET"          
##  [3] "GSM9045906_MET"           "GSM9045909_MET"          
##  [5] "GSM9045910_MET"           "GSM9045911_MET"          
##  [7] "GSM9045912_MET"           "GSM9045913_MET"          
##  [9] "GSM9045914_MET"           "GSM9045915_MET"          
## [11] "GSM9045918_MET"           "GSM9045919_MET"          
## [13] "GSM9045928_MET"           "GSM9045929_MET"          
## [15] "GSM9045930_MET"           "GSM9045907_nonMET"       
## [17] "GSM9045908_nonMET"        "GSM9045916_nonMET"       
## [19] "GSM9045917_nonMET"        "GSM9045923_nonMET"       
## [21] "GSM9045924_nonMET"        "GSM9045926_nonMET"       
## [23] "GSM9045927_nonMET"        "GSM9045931_nonMET"       
## [25] "GSM9045932_nonMET"        "GSM9045920_Normal"       
## [27] "GSM9045921_Normal"        "GSM9045922_Normal"       
## [29] "GSM9045925_Normal"        "MET_mean"                
## [31] "nonMET_mean"              "Normal_mean"             
## [33] "MET_Normal_FoldChange"    "nonMET_Normal_FoldChange"
## [35] "MET_nonMET_FoldChange"    "geneSource"

Only include sample type but make a vector for column names as these gene IDs and a column vector of class type for the type of sample each one is in 3 classes.

genes <- top30$ID_REF

class <- c("MET" ,         
 "MET"  ,         "MET"  ,        
"MET"     ,      "MET" ,         
 "MET"  ,         "MET"  ,        
 "MET" ,          "MET"  ,        
"MET"  ,         "MET" ,         
 "MET" ,          "MET",          
 "MET" ,          "nonMET",       
 "nonMET" ,       "nonMET"  ,     
"nonMET"  ,      "nonMET",       
 "nonMET" ,       "nonMET" ,      
"nonMET"  ,      "nonMET"  ,     
 "nonMET" ,       "Normal" ,      
 "Normal" ,       "Normal" ,      
 "Normal"   )

matrix <- data.frame(t(top30[,c(2:29)]))

colnames(matrix) <- genes
matrix$class <- as.factor(class)

str(matrix)
## 'data.frame':    28 obs. of  31 variables:
##  $ MYC    : num  56.12 88.45 35.05 44.22 6.98 ...
##  $ VCAM1  : num  7.81 22.38 34.1 57.54 17.05 ...
##  $ BRCA1  : num  8.53 8.53 4.74 7.46 3.88 ...
##  $ BCL2   : num  24.86 26.64 4.74 9.06 10.85 ...
##  $ VEGFA  : num  33.4 16 32.2 32.5 11.6 ...
##  $ KDM5B  : num  12.08 8.53 11.37 7.46 21.7 ...
##  $ BRD4   : num  54 21.3 23.7 31.4 28.7 ...
##  $ CD47   : num  15.6 12.8 42.6 10.1 12.4 ...
##  $ CD74   : num  364 289 291 1114 564 ...
##  $ ATM    : num  5.68 6.39 3.79 4.26 7.75 4.49 3.41 0.95 6.2 2.44 ...
##  $ TRPA1  : num  0.71 4.26 1.89 188.09 1.55 ...
##  $ CCL2   : num  127.88 4.26 38.84 6.39 3.88 ...
##  $ SHISA2 : num  41.92 1.07 3.79 6.93 10.08 ...
##  $ LGR5   : num  22.02 3.2 3.79 19.18 30.23 ...
##  $ FDCSP  : num  1.42 3.2 0.95 2.13 1.55 ...
##  $ CXCL6  : num  13.5 4.26 152.5 2.66 4.65 ...
##  $ CST1   : num  4.97 9.59 2.84 0.53 1.55 ...
##  $ MMP1   : num  2.84 2.13 14.21 3.2 1.55 ...
##  $ FSTL3  : num  138.53 12.79 8.53 31.97 6.98 ...
##  $ AKR1C3 : num  1.42 4.26 3.79 3.73 1.55 ...
##  $ AMY1A  : num  8.53 10.66 8.53 10.66 7.75 ...
##  $ MSLN   : num  2.13 1.07 1.89 1.6 3.1 4.26 2.56 1.89 5.43 2.44 ...
##  $ GSTA1  : num  4.97 1.07 2.84 3.2 31 ...
##  $ MSMB   : num  1.42 3.2 0.95 1.07 6.98 6.28 1.71 1.89 0.78 1.22 ...
##  $ NCCRP1 : num  1.42 1.07 2.84 1.6 4.65 2.92 1.71 5.68 1.55 1.22 ...
##  $ COL14A1: num  0.71 3.2 0.95 1.07 2.33 4.49 0.85 9.47 2.33 2.44 ...
##  $ ZNF331 : num  3.55 3.2 2.84 2.66 4.65 2.24 6.82 4.74 3.88 6.09 ...
##  $ SCGB1A1: num  2.13 3.2 1.89 3.2 3.88 1.57 2.56 4.74 1.55 1.22 ...
##  $ ZPBP   : num  0.71 3.2 0.95 0.53 1.55 1.79 0.85 2.84 1.55 2.44 ...
##  $ LTF    : num  4.97 9.59 1.89 5.86 1.55 ...
##  $ class  : Factor w/ 3 levels "MET","nonMET",..: 1 1 1 1 1 1 1 1 1 1 ...
table(matrix$class)
## 
##    MET nonMET Normal 
##     14     10      4

There are 14 MET and 10 nonMET samples, but only 4 Normal samples. Lets see if the model for randomForest can predict accurately on a 3 class model using these genes.

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1234)

intrain <- sample(1:28,.75*28)

training <- matrix[intrain,]
testing <- matrix[-intrain,]

table(training$class)
## 
##    MET nonMET Normal 
##     11      7      3
table(testing$class)
## 
##    MET nonMET Normal 
##      3      3      1

This should be interesting. We will use the same settings as our last project on a 2 class model.

rf <- randomForest(training[,c(1:30)], y=training$class, mtry=6, ntree=5000, confusion=T)
rf$confusion
##        MET nonMET Normal class.error
## MET      9      2      0   0.1818182
## nonMET   2      5      0   0.2857143
## Normal   0      3      0   1.0000000

Not very good with the normal class prediction as all 3 samples that were normal were misidentified as nonMET with a 100% class error in the training model. But the MET class predicted 10 of the 11 MET samples correctly, and 5 of the 7 nonMET classes correctly. Lets see how well this model predicts the testing set classes.

predicted <- predict(rf,testing)
results <- data.frame(predicted = predicted, actual=testing$class)
results
##                   predicted actual
## GSM9045905_MET          MET    MET
## GSM9045909_MET          MET    MET
## GSM9045929_MET          MET    MET
## GSM9045917_nonMET       MET nonMET
## GSM9045924_nonMET       MET nonMET
## GSM9045931_nonMET    nonMET nonMET
## GSM9045920_Normal    Normal Normal

The prediction accuracy in the nonMET and Normal classes was 100% for each class, but the accuracy in prediction of the MET class was only 3 out of 5 with 2 of those MET classes predicted as nonMET that is 3/5 in accuracy on MET class predictio or 60%, but 100% in accuracy for the nonMET and Normal class. Overall it predicted 5/7 correctly which is

sum(results$predicted==results$actual)/length(results$predicted)
## [1] 0.7142857

71.4 % accuracy in prediction using all 30 genes of top 10 up regulated, top 10 down regulated, and the only 10 genes from the study that made it into this counts database after quality control filtering and normalization and removing bottom 75th percentile genes in variability.

How well do only the 10 study genes compare? Lets find out using same settings of this random forest algorithm.

matrix10 <- matrix[,c(1:10,31)]
matrix20 <- matrix[,c(11:30,31)]
training <- matrix10[intrain,]
testing <- matrix10[-intrain,]

table(training$class)
## 
##    MET nonMET Normal 
##     11      7      3
table(testing$class)
## 
##    MET nonMET Normal 
##      3      3      1

We are using the same samples but only the 10 genes in the study from our matrix to predict the class in a 3 class model using randomForest classifier.

rf1 <- randomForest(training[,c(1:10)], y=training$class, mtry=3, ntree=5000, confusion=T)
rf$confusion
##        MET nonMET Normal class.error
## MET      9      2      0   0.1818182
## nonMET   2      5      0   0.2857143
## Normal   0      3      0   1.0000000

This time, the normal class in building the model was 100% accurate in prediction, while the nonMET class was only 5/7 or 71% accurate, and the MET class was 10/11 accurate or 91% accurate. Lets see how well the 10 study genes predict the testing set.

prediction1 <- predict(rf1,testing)
results1 <- data.frame(predicted=prediction1,actual=testing$class)
results1
##                   predicted actual
## GSM9045905_MET          MET    MET
## GSM9045909_MET          MET    MET
## GSM9045929_MET       nonMET    MET
## GSM9045917_nonMET       MET nonMET
## GSM9045924_nonMET       MET nonMET
## GSM9045931_nonMET       MET nonMET
## GSM9045920_Normal       MET Normal

This model scored terribly on the prediction data as it predicted almost all classes to be MET except for one MET class it predicted to be nonMET. It scored 2/3 MET classes correctly or 67% accurate, while all 3 of the nonMET classes were predicted incorrectly or 0% accurate, and all Normal classes predicted incorrectly for 1/1 with 0% accuracy. The overall prediction accuracy was

sum(results1$predicted==results1$actual)/length(results1$predicted)
## [1] 0.2857143

28.6 % accuracy in prediction using these 10 study genes that made the filtering process and quality control of the study

colnames(matrix10)[1:10]
##  [1] "MYC"   "VCAM1" "BRCA1" "BCL2"  "VEGFA" "KDM5B" "BRD4"  "CD47"  "CD74" 
## [10] "ATM"

Now lets look at how well the other genes from our fold change values of most change in top 10 up regulated and top 10 down regulated genes totaling 20 genes did.

training <- matrix20[intrain,]
testing <- matrix20[-intrain,]

table(training$class)
## 
##    MET nonMET Normal 
##     11      7      3
table(testing$class)
## 
##    MET nonMET Normal 
##      3      3      1
rf2 <- randomForest(training[,c(1:20)], y=training$class, mtry=6, ntree=5000, confusion=T)
rf2$confusion
##        MET nonMET Normal class.error
## MET      9      2      0   0.1818182
## nonMET   2      4      1   0.4285714
## Normal   0      2      1   0.6666667

The model accuracy in prediction for the MET class was 8/11 with 73% accuracy, nonMET was 4/7 for 58% accuracy, and the normal class was 1/3 for 33% accuracy. Lets see how well the prediction accuracy is on the testing data.

prediction2 <- predict(rf2,testing)
results2 <- data.frame(predicted=prediction2,actual=testing$class)
results2
##                   predicted actual
## GSM9045905_MET          MET    MET
## GSM9045909_MET          MET    MET
## GSM9045929_MET       nonMET    MET
## GSM9045917_nonMET       MET nonMET
## GSM9045924_nonMET       MET nonMET
## GSM9045931_nonMET    nonMET nonMET
## GSM9045920_Normal    Normal Normal

We can see the prediction accuracy in the MET class is 2/3 or 66% accurate, the nonMET class is 1/3 for 33% accuracy, and the Normal class is 100% accurate with 1/1 predicted correctly. The overall prediction accuracy using the 20 genes was

sum(results2$predicted==results2$actual)/length(results2$predicted)
## [1] 0.5714286

57.1 % which is actually better than the prediction accuracy of the 10 genes from the research study in predicting a 3 class model, but the best model was when we used all 30 genes to predict the 3 classes as that scored 71.4 %. But when using a 3 class model or any model where there are more than 2 classes, the machine will always struggle, better tuning of the model delivers better accuracy, but it is a tough battle to score high in prediction when using a 3 class model to predict, it always or random forest or any model always predicts better with a 2 class model if the features are good enough to explain all the variation of the target feature.

Thanks so much and I hope you enjoyed this analysis. More time was spent understanding the research published article but it pays off as some parts weren’t that clear until tested. I admit the article was lengthy and I did not read all the wonderful plots or graphs but got as much information I could use to find the features worth testing with a portion of the data that was used in the study. The study helped understand what these samples are, where were they taken, and that not all the samples were used. Many of the plots and graphs are beyond my current understanding in bioinformatics and gene expression analysis in a lab setting and using machines and other intricate algorithms to analyze data, but the overall goal and missio of the study is understood as I do have a Master’s degree in data science with bioinformatics, course knowledge in genetics, biology, biochemistry, differential diagnosis, immunity and infection, pathology, and neurology from 3 years of chiropractic medical school. Work like this keeps the mind alert and trained to help stay sharp and prevent dimentia or Alzheimer’s or other cognitive disorders from not allowing the mind to explore and create or understand and learn by connecting ideas, also the internet is wonderful in its ability to use AI to give information fast and options to compare with own understanding, however they do say it uses a lot of computing technology and energy that drains water resources as well as electrical stores for maintaining grid power in communities. It all pays off and we get better understanding and leads to more people searching for what they want to find and accomplish or the skills to learn that they never thought they could. I do recommend exploring this article at link above and doing the analysis presented in front of you to make your own inferences or guide you in exploring other parts of the data and study to design your own questions to answer with data science and machine learning.

Keep checking in. We will be adding these 30 genes to our pathology database but not today.

========================================================================

*** Part 2

We will just upload these 30 genes to the database of pathologies for this part.

NPC_30 <- read.csv('top20_MET_nonMET_FC_and_10GSE299775_genes.csv', header=T, sep=',')

str(NPC_30)
## 'data.frame':    30 obs. of  36 variables:
##  $ ID_REF                  : chr  "MYC" "VCAM1" "BRCA1" "BCL2" ...
##  $ GSM9045905_MET          : num  56.12 7.81 8.53 24.86 33.39 ...
##  $ GSM9045906_MET          : num  88.45 22.38 8.53 26.64 15.98 ...
##  $ GSM9045909_MET          : num  35.05 34.1 4.74 4.74 32.21 ...
##  $ GSM9045910_MET          : num  44.22 57.54 7.46 9.06 32.5 ...
##  $ GSM9045911_MET          : num  6.98 17.05 3.88 10.85 11.63 ...
##  $ GSM9045912_MET          : num  72.01 70 11.44 6.95 28.72 ...
##  $ GSM9045913_MET          : num  100.6 11.08 8.53 35.81 52.86 ...
##  $ GSM9045914_MET          : num  16.1 30.31 8.53 13.26 11.37 ...
##  $ GSM9045915_MET          : num  79.05 51.93 3.88 14.73 13.18 ...
##  $ GSM9045918_MET          : num  85.25 25.58 6.09 25.58 8.53 ...
##  $ GSM9045919_MET          : num  21.31 49.02 9.59 10.66 31.97 ...
##  $ GSM9045928_MET          : num  10.66 6.39 10.66 14.21 6.39 ...
##  $ GSM9045929_MET          : num  20.93 20.93 1.55 14.73 8.53 ...
##  $ GSM9045930_MET          : num  128.7 41.3 10.3 28.7 16.8 ...
##  $ GSM9045907_nonMET       : num  17.05 21.31 4.26 4.26 21.31 ...
##  $ GSM9045908_nonMET       : num  12.79 12.79 4.26 25.58 59.68 ...
##  $ GSM9045916_nonMET       : num  25.58 22.17 1.71 11.94 8.53 ...
##  $ GSM9045917_nonMET       : num  12.08 4.97 7.1 5.68 29.13 ...
##  $ GSM9045923_nonMET       : num  26.1 41.5 4.4 13.8 12.6 ...
##  $ GSM9045924_nonMET       : num  15.5 27.9 9.3 10.8 6.2 ...
##  $ GSM9045926_nonMET       : num  34.1 4.26 8.53 12.79 4.26 ...
##  $ GSM9045927_nonMET       : num  85.2 26.7 10.8 39.3 25.2 ...
##  $ GSM9045931_nonMET       : num  15.35 46.04 3.41 5.12 6.82 ...
##  $ GSM9045932_nonMET       : num  10.23 20.46 3.41 6.82 6.82 ...
##  $ GSM9045920_Normal       : num  17.05 6.39 6.39 7.46 12.79 ...
##  $ GSM9045921_Normal       : num  17.05 4.26 3.41 2.56 34.1 ...
##  $ GSM9045922_Normal       : num  13.18 3.1 6.98 7.75 25.58 ...
##  $ GSM9045925_Normal       : num  8.9 6.3 2.59 2.22 17.79 ...
##  $ MET_mean                : num  54.67 31.82 7.41 17.2 21.72 ...
##  $ nonMET_mean             : num  25.41 22.81 5.71 13.61 18.06 ...
##  $ Normal_mean             : num  14.04 5.01 4.84 5 22.57 ...
##  $ MET_Normal_FoldChange   : num  3.892 6.348 1.53 3.441 0.962 ...
##  $ nonMET_Normal_FoldChange: num  1.81 4.55 1.18 2.72 0.8 ...
##  $ MET_nonMET_FoldChange   : num  2.15 1.39 1.3 1.26 1.2 ...
##  $ geneSource              : chr  "GSE299775 EBV+NCP gene of interest" "GSE299775 EBV+NCP gene of interest" "GSE299775 EBV+NCP gene of interest" "GSE299775 EBV+NCP gene of interest" ...

Lets get the other table of pathologies to use with out table of 30 NPC genes.

path <- "C:/.../genes Mono MS FM EBV Lyme NKTCL EBVaNPC/" #path of directory to your file of pathologies.
setwd(path)

pathologyDB <- read.csv("pathologyDB_EBVaNPC_13plus20genes.csv", header=T, sep=',')

str(pathologyDB)
## 'data.frame':    237 obs. of  7 variables:
##  $ Ensembl_ID          : chr  "ENSG00000211899" "ENSG00000164458" "ENSG00000211644" "ENSG00000125869" ...
##  $ Genecards_ID        : chr  "IGHM" "TBXT" "IGLV1-51" "LAMP5" ...
##  $ FC_pathology_control: num  18550 1051 179 140 105 ...
##  $ topGenePathology    : chr  "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" ...
##  $ mediaType           : chr  "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" ...
##  $ studySummarized     : chr  "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ ...
##  $ GSE_study_ID        : chr  "GSE253756" "GSE253756" "GSE253756" "GSE253756" ...
colnames(NPC_30)
##  [1] "ID_REF"                   "GSM9045905_MET"          
##  [3] "GSM9045906_MET"           "GSM9045909_MET"          
##  [5] "GSM9045910_MET"           "GSM9045911_MET"          
##  [7] "GSM9045912_MET"           "GSM9045913_MET"          
##  [9] "GSM9045914_MET"           "GSM9045915_MET"          
## [11] "GSM9045918_MET"           "GSM9045919_MET"          
## [13] "GSM9045928_MET"           "GSM9045929_MET"          
## [15] "GSM9045930_MET"           "GSM9045907_nonMET"       
## [17] "GSM9045908_nonMET"        "GSM9045916_nonMET"       
## [19] "GSM9045917_nonMET"        "GSM9045923_nonMET"       
## [21] "GSM9045924_nonMET"        "GSM9045926_nonMET"       
## [23] "GSM9045927_nonMET"        "GSM9045931_nonMET"       
## [25] "GSM9045932_nonMET"        "GSM9045920_Normal"       
## [27] "GSM9045921_Normal"        "GSM9045922_Normal"       
## [29] "GSM9045925_Normal"        "MET_mean"                
## [31] "nonMET_mean"              "Normal_mean"             
## [33] "MET_Normal_FoldChange"    "nonMET_Normal_FoldChange"
## [35] "MET_nonMET_FoldChange"    "geneSource"

Just keep the ID_REF, MET_nonMET_FoldChange, and geneSource columns.

genes30_ID_FC <- NPC_30[,c(1,35,36)]
genes30_ID_FC
##     ID_REF MET_nonMET_FoldChange
## 1      MYC            2.15185389
## 2    VCAM1            1.39478595
## 3    BRCA1            1.29704183
## 4     BCL2            1.26375241
## 5    VEGFA            1.20253916
## 6    KDM5B            1.14112231
## 7     BRD4            1.12040383
## 8     CD47            1.07109005
## 9     CD74            0.88890443
## 10     ATM            0.56792044
## 11   TRPA1            6.89920949
## 12    CCL2            5.91986032
## 13  SHISA2            5.37001514
## 14    LGR5            5.02904401
## 15   FDCSP            4.97791817
## 16   CXCL6            4.65092687
## 17    CST1            3.93830205
## 18    MMP1            3.77502483
## 19   FSTL3            3.72344208
## 20  AKR1C3            3.37745476
## 21   AMY1A            0.15966597
## 22    MSLN            0.14037864
## 23   GSTA1            0.13022230
## 24    MSMB            0.12031737
## 25  NCCRP1            0.11610074
## 26 COL14A1            0.10322887
## 27  ZNF331            0.05778740
## 28 SCGB1A1            0.03950420
## 29    ZPBP            0.02622455
## 30     LTF            0.02573839
##                                                              geneSource
## 1                                    GSE299775 EBV+NCP gene of interest
## 2                                    GSE299775 EBV+NCP gene of interest
## 3                                    GSE299775 EBV+NCP gene of interest
## 4                                    GSE299775 EBV+NCP gene of interest
## 5                                    GSE299775 EBV+NCP gene of interest
## 6                                    GSE299775 EBV+NCP gene of interest
## 7                                    GSE299775 EBV+NCP gene of interest
## 8                                    GSE299775 EBV+NCP gene of interest
## 9                                    GSE299775 EBV+NCP gene of interest
## 10                                   GSE299775 EBV+NCP gene of interest
## 11 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 12 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 13 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 14 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 15 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 16 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 17 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 18 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 19 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 20 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 21 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 22 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 23 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 24 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 25 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 26 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 27 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 28 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 29 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 30 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
colnames(pathologyDB)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"

We need to make our column names the same exact name as the pathologyDB to add them to it, but we have to add those columns and fill them in too.

colnames(genes30_ID_FC) <- c("Genecards_ID", "FC_pathology_control","topGenePathology")

The published article has a section that describes what procedure was done for this data. It is not PBMCs or peripheral blood mononuclear cells, but biopsied FFPE NPC using total RNA and not mRNA as most of our studies used. The section of the article ‘Spatial Transcriptome Analysis with GeoMX Digital Spatial Profiler (DSP)’ on page 15 on PubMed document.

genes30_ID_FC$Ensembl_ID <- "Ensembl_ID needed"
genes30_ID_FC$mediaType <- "Total RNA of biopsied Formalin-Fixed Paraffine-Embedded or FFPE Nasopharyngeal Carcinoma or NPC tumor samples of 16 MET (metastatic after 3 years) and 8 nonMET (not metastatic afer 3 years) and 4 normal for 28 samples in total"
genes30_ID_FC$studySummarized <- "GSE299775 on the chromatin loops that DNA of less dense genes have access to latent type II Epstein-Barr Virus (EBV) that can hijack the genes in those areas and cause cancer cell proliferation, tumor growth, and a higher risk of fatality in Nasopharyngeal Carcinoma (NPC) patients. The gene target of interest for this study is KDM5B and the EBV interacting region or EBVIR that one of the 3 EBV genes can enter the chromatin and hijack the human DNA to activate and inactivate genes that promote tumor growth and metastasis. This study found that the chromatin around KDM5B and the EBVIR-enhancer-KDM5B signature was found in EBV+NPC and in patient derived xenographs where the tumor is implanted in a mouse that has been made immunodeficient to watch the tumor progress. They found that when KDM5B is upregulated that the cancer promoting genes are also active and cancer suppressor genes inactive. There are 10 genes from the study that made the cut of this study's quality control, batch effect removal, normalization, and keeping only the top 25th percentile of gene counts of mRNA from the NPC tumors, so not all genes they mentioned in the study made this step of the research. There were other gene sets they analyzed with various unsupervised and supervised machine learning algorithms. These genes as a set were also sorted to get means per sample type or class then get the fold change of the MET over the nonMET to use the top 10 up regulated and top 10 down regulated genes for a total of 20 genes. The Ensembl_ID wasn't included in this study and has to be found online manually. The 10 genes from the study are: MYC, VCAM1, BRCA1, BCL2, VEGFA, KDM5B, BRD4, CD47, CD74, and ATM. There were 3 genes that were EBV genes: EBNA1, oriP, and RPMS1 that were the EBV genes that lock onto the chromatin of human host and have been shown to enhance carcinoma cell proliferation in NPC as well as gastric carcinoma. There were some other genes in the study that didn't get through quality control in the data for GSE299775: LMP1, H3K27,H3K27ac, H3K27me3, CTFC, LGALS9, MK167,   BZLF1,  MLL, Ku70, MK167. The top 10 up regulated or enhanced genes from fold change values of MET/nonMET were: TRPA1, CCL2, SHISA2, LGR5, FDCSP, CXCL6, CST1, MMP1, FSTL3, and AKR1C3. And the top 10 down regulated or silenced genes from fold change values MET/nonMET were: AMY1A, MSLN, GSTA1, MSMB, 
NCCRP1, COL14A1, ZNF331, SCGB1A1, ZPBP, LTF."  

genes30_ID_FC$GSE_study_ID <- "GSE299775"
colnames(genes30_ID_FC)
## [1] "Genecards_ID"         "FC_pathology_control" "topGenePathology"    
## [4] "Ensembl_ID"           "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"
genes30 <- genes30_ID_FC[,c(4,1,2,3,5:7)]
colnames(genes30)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"
colnames(pathologyDB)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"

Lets read in our last project’s data set for GSE271486 and see if the Ensemble_ID for our 30 genes our in the file to easily add to this file before manually adding the Ensemble ID.

prevStudy <- "C:...GSE271486_HNE_1_MUT_LMP1_vs_HNE_1_WT_LMP1.csv" #use own path on computer for this file that came with the study GSE271486.
previous <- read.csv(prevStudy, header=T, sep=',')
head(previous)
##           gene_id gene_name
## 1 ENSG00000000003    TSPAN6
## 2 ENSG00000000005      TNMD
## 3 ENSG00000000419      DPM1
## 4 ENSG00000000457     SCYL3
## 5 ENSG00000000460  C1orf112
## 6 ENSG00000000938       FGR
##                                                                                      description
## 1                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## 2                                                tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
## 3 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## 4                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## 5                        chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
## 6              FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697]
##                     locus HNE_1_MUT_LMP1_1_count HNE_1_MUT_LMP1_2_count
## 1 X:100627109-100639991:-                   1057                    700
## 2 X:100584802-100599885:+                      0                      0
## 3  20:50934867-50958555:-                    937                    678
## 4 1:169849631-169894267:-                    200                    136
## 5 1:169662007-169854080:+                    537                    474
## 6   1:27612064-27635277:-                      0                      0
##   HNE_1_MUT_LMP1_3_count HNE_1_WT_LMP1_1_count HNE_1_WT_LMP1_2_count
## 1                    855                  1643                   826
## 2                      0                     0                     0
## 3                    932                  1114                   930
## 4                    158                   273                   120
## 5                    471                   839                   617
## 6                      0                     0                     0
##   HNE_1_WT_LMP1_3_count HNE_1_MUT_LMP1_1_FPKM HNE_1_MUT_LMP1_2_FPKM
## 1                   875              9.355201              8.273695
## 2                     0              0.000000              0.000000
## 3                   845             19.973180             19.300172
## 4                   177              1.247310              1.132686
## 5                   594              3.772203              4.446549
## 6                     0              0.000000              0.000000
##   HNE_1_MUT_LMP1_3_FPKM HNE_1_WT_LMP1_1_FPKM HNE_1_WT_LMP1_2_FPKM
## 1              9.658906            10.528930            7.0215334
## 2              0.000000             0.000000            0.0000000
## 3             25.357562            17.193387           19.0398968
## 4              1.257724             1.232754            0.7187855
## 5              4.223044             4.267284            4.1627526
## 6              0.000000             0.000000            0.0000000
##   HNE_1_WT_LMP1_3_FPKM
## 1             9.546551
## 2             0.000000
## 3            22.203657
## 4             1.360751
## 5             5.143604
## 6             0.000000

Lets see if our 30 genes our listed in this dataset we just read in from last project on EBV NPC in GSE271486.

maybe30 <- previous[which(previous$gene_name %in% genes30$Genecards_ID),]
maybe30$gene_name
##  [1] "BRCA1"   "LTF"     "CD74"    "ZPBP"    "FSTL3"   "MSLN"    "TRPA1"  
##  [8] "CCL2"    "VEGFA"   "KDM5B"   "CXCL6"   "ZNF331"  "MYC"     "LGR5"   
## [15] "BRD4"    "SCGB1A1" "ATM"     "VCAM1"   "CST1"    "BCL2"    "SHISA2" 
## [22] "FDCSP"   "COL14A1" "NCCRP1"  "AKR1C3"  "MMP1"    "CD47"    "AMY1A"  
## [29] "GSTA1"   "MSMB"    "AKR1C3"
genes30$Genecards_ID
##  [1] "MYC"     "VCAM1"   "BRCA1"   "BCL2"    "VEGFA"   "KDM5B"   "BRD4"   
##  [8] "CD47"    "CD74"    "ATM"     "TRPA1"   "CCL2"    "SHISA2"  "LGR5"   
## [15] "FDCSP"   "CXCL6"   "CST1"    "MMP1"    "FSTL3"   "AKR1C3"  "AMY1A"  
## [22] "MSLN"    "GSTA1"   "MSMB"    "NCCRP1"  "COL14A1" "ZNF331"  "SCGB1A1"
## [29] "ZPBP"    "LTF"

We have an extra ID in the 30 gene match of our previous study. We don’t care about the other information but that we can get the Ensemble ID, so we can remove the duplicates.

Yes30 <- maybe30[,c(1,2)]
Yes30 <- Yes30[!duplicated(Yes30$gene_name),]
Yes30
##               gene_id gene_name
## 310   ENSG00000012048     BRCA1
## 316   ENSG00000012223       LTF
## 394   ENSG00000019582      CD74
## 591   ENSG00000042813      ZPBP
## 1111  ENSG00000070404     FSTL3
## 2710  ENSG00000102854      MSLN
## 2900  ENSG00000104321     TRPA1
## 3574  ENSG00000108691      CCL2
## 4100  ENSG00000112715     VEGFA
## 4672  ENSG00000117139     KDM5B
## 5599  ENSG00000124875     CXCL6
## 6361  ENSG00000130844    ZNF331
## 7396  ENSG00000136997       MYC
## 7808  ENSG00000139292      LGR5
## 8178  ENSG00000141867      BRD4
## 9161  ENSG00000149021   SCGB1A1
## 9199  ENSG00000149311       ATM
## 10805 ENSG00000162692     VCAM1
## 12810 ENSG00000170373      CST1
## 13149 ENSG00000171791      BCL2
## 14928 ENSG00000180730    SHISA2
## 15088 ENSG00000181617     FDCSP
## 16614 ENSG00000187955   COL14A1
## 16753 ENSG00000188505    NCCRP1
## 17023 ENSG00000196139    AKR1C3
## 17195 ENSG00000196611      MMP1
## 17244 ENSG00000196776      CD47
## 30895 ENSG00000237763     AMY1A
## 33183 ENSG00000243955     GSTA1
## 42653 ENSG00000263639      MSMB

Lets merge this data with our 30 genes of NPC from GSE299775.

Ids30 <- merge(Yes30, genes30, by.x='gene_name', by.y='Genecards_ID')
colnames(Ids30)
## [1] "gene_name"            "gene_id"              "Ensembl_ID"          
## [4] "FC_pathology_control" "topGenePathology"     "mediaType"           
## [7] "studySummarized"      "GSE_study_ID"

We need to change the gene_id to Ensemble_ID and remove the Ensemble_ID that was a placeholder filled with ‘needs Ensembl_ID’ before adding it to the pathology database.

ID30 <- Ids30[,c(2,1,4:8)] #rearrange while dropping placeholder Ensembl_ID column
colnames(ID30)[1] <- "Ensembl_ID"
colnames(ID30)[2] <- "Genecards_ID"
colnames(ID30)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"

Once more check to compare that the columns are the same name.

colnames(pathologyDB)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"

Now we are ready to add these 30 genes to our pathology database.

Pathology <- rbind(pathologyDB,ID30)
tail(Pathology,2)
##          Ensembl_ID Genecards_ID FC_pathology_control
## 266 ENSG00000130844       ZNF331           0.05778740
## 267 ENSG00000042813         ZPBP           0.02622455
##                                                         topGenePathology
## 266 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
## 267 top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC
##                                                                                                                                                                                                                              mediaType
## 266 Total RNA of biopsied Formalin-Fixed Paraffine-Embedded or FFPE Nasopharyngeal Carcinoma or NPC tumor samples of 16 MET (metastatic after 3 years) and 8 nonMET (not metastatic afer 3 years) and 4 normal for 28 samples in total
## 267 Total RNA of biopsied Formalin-Fixed Paraffine-Embedded or FFPE Nasopharyngeal Carcinoma or NPC tumor samples of 16 MET (metastatic after 3 years) and 8 nonMET (not metastatic afer 3 years) and 4 normal for 28 samples in total
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  studySummarized
## 266 GSE299775 on the chromatin loops that DNA of less dense genes have access to latent type II Epstein-Barr Virus (EBV) that can hijack the genes in those areas and cause cancer cell proliferation, tumor growth, and a higher risk of fatality in Nasopharyngeal Carcinoma (NPC) patients. The gene target of interest for this study is KDM5B and the EBV interacting region or EBVIR that one of the 3 EBV genes can enter the chromatin and hijack the human DNA to activate and inactivate genes that promote tumor growth and metastasis. This study found that the chromatin around KDM5B and the EBVIR-enhancer-KDM5B signature was found in EBV+NPC and in patient derived xenographs where the tumor is implanted in a mouse that has been made immunodeficient to watch the tumor progress. They found that when KDM5B is upregulated that the cancer promoting genes are also active and cancer suppressor genes inactive. There are 10 genes from the study that made the cut of this study's quality control, batch effect removal, normalization, and keeping only the top 25th percentile of gene counts of mRNA from the NPC tumors, so not all genes they mentioned in the study made this step of the research. There were other gene sets they analyzed with various unsupervised and supervised machine learning algorithms. These genes as a set were also sorted to get means per sample type or class then get the fold change of the MET over the nonMET to use the top 10 up regulated and top 10 down regulated genes for a total of 20 genes. The Ensembl_ID wasn't included in this study and has to be found online manually. The 10 genes from the study are: MYC, VCAM1, BRCA1, BCL2, VEGFA, KDM5B, BRD4, CD47, CD74, and ATM. There were 3 genes that were EBV genes: EBNA1, oriP, and RPMS1 that were the EBV genes that lock onto the chromatin of human host and have been shown to enhance carcinoma cell proliferation in NPC as well as gastric carcinoma. There were some other genes in the study that didn't get through quality control in the data for GSE299775: LMP1, H3K27,H3K27ac, H3K27me3, CTFC, LGALS9, MK167,   BZLF1,  MLL, Ku70, MK167. The top 10 up regulated or enhanced genes from fold change values of MET/nonMET were: TRPA1, CCL2, SHISA2, LGR5, FDCSP, CXCL6, CST1, MMP1, FSTL3, and AKR1C3. And the top 10 down regulated or silenced genes from fold change values MET/nonMET were: AMY1A, MSLN, GSTA1, MSMB, \nNCCRP1, COL14A1, ZNF331, SCGB1A1, ZPBP, LTF.
## 267 GSE299775 on the chromatin loops that DNA of less dense genes have access to latent type II Epstein-Barr Virus (EBV) that can hijack the genes in those areas and cause cancer cell proliferation, tumor growth, and a higher risk of fatality in Nasopharyngeal Carcinoma (NPC) patients. The gene target of interest for this study is KDM5B and the EBV interacting region or EBVIR that one of the 3 EBV genes can enter the chromatin and hijack the human DNA to activate and inactivate genes that promote tumor growth and metastasis. This study found that the chromatin around KDM5B and the EBVIR-enhancer-KDM5B signature was found in EBV+NPC and in patient derived xenographs where the tumor is implanted in a mouse that has been made immunodeficient to watch the tumor progress. They found that when KDM5B is upregulated that the cancer promoting genes are also active and cancer suppressor genes inactive. There are 10 genes from the study that made the cut of this study's quality control, batch effect removal, normalization, and keeping only the top 25th percentile of gene counts of mRNA from the NPC tumors, so not all genes they mentioned in the study made this step of the research. There were other gene sets they analyzed with various unsupervised and supervised machine learning algorithms. These genes as a set were also sorted to get means per sample type or class then get the fold change of the MET over the nonMET to use the top 10 up regulated and top 10 down regulated genes for a total of 20 genes. The Ensembl_ID wasn't included in this study and has to be found online manually. The 10 genes from the study are: MYC, VCAM1, BRCA1, BCL2, VEGFA, KDM5B, BRD4, CD47, CD74, and ATM. There were 3 genes that were EBV genes: EBNA1, oriP, and RPMS1 that were the EBV genes that lock onto the chromatin of human host and have been shown to enhance carcinoma cell proliferation in NPC as well as gastric carcinoma. There were some other genes in the study that didn't get through quality control in the data for GSE299775: LMP1, H3K27,H3K27ac, H3K27me3, CTFC, LGALS9, MK167,   BZLF1,  MLL, Ku70, MK167. The top 10 up regulated or enhanced genes from fold change values of MET/nonMET were: TRPA1, CCL2, SHISA2, LGR5, FDCSP, CXCL6, CST1, MMP1, FSTL3, and AKR1C3. And the top 10 down regulated or silenced genes from fold change values MET/nonMET were: AMY1A, MSLN, GSTA1, MSMB, \nNCCRP1, COL14A1, ZNF331, SCGB1A1, ZPBP, LTF.
##     GSE_study_ID
## 266    GSE299775
## 267    GSE299775

Lets write this pathology database out to csv to continue adding pathology fold change results of top genes for each pathology as we progress in making our database of pathologies of EBV, EBV associated pathologies, Lyme disease, multiple sclerosis, and fibromyalgia.

Lets just see which pathologies we have in our database currently.

table(Pathology$topGenePathology)
## 
##                  EBVaNPC nasopharyngeal carcinoma with EBV infection 
##                                                                   34 
##                                                   Epstein Barr Virus 
##                                                                   80 
##                                                         fibromyalgia 
##                                                                   15 
##                                   GSE299775 EBV+NCP gene of interest 
##                                                                   10 
##                                                Lyme Disease 6 months 
##                                                                   33 
##                                                        mononucleosis 
##                                                                   15 
##                                                   Multiple Sclerosis 
##                                                                   41 
##                           NKTCL Natural Killer T-Cell Lymphoma & EBV 
##                                                                   19 
## top 10 up and 10 down regulated foldchange mean of MET to nonMET NPC 
##                                                                   20

There are 9 types but a couple of them are EBV in Nasopharyngeal Carcinoma. We have more to research and analyze in Hodgkin’s lymphoma, large B-cell Lymphoma, gastric carcinoma, intrahepatic cholangiocarcinoma in gall bladder bile duct carcinoma, and others if we find them on the way before we can test out our machine to see if the machine can predict by class on similar data we find from same media type. Seems all have been normalized between 0 and 1 and scaled to some affect, but fold change will still be a change between the pathology vs the normal or baseline state. Its neat because if you think a pathology is similar and might be affected by the same pathway of cancer proliferation and metastasis in these pathologies or disease progression, the same genes could possibly be affected that are also affected in other pathologies related to it. In this study we just completed on EBV positive NPC and gastric carcinoma. They noticed that the same chromatin loop interaction site of DNA access for latent EBV to infiltrate and hijack the KDM5B virus was present with their identified EBVIR-enhancer-KDM5B signature but that only the EBV latent type II did the compartment switching in the chromatin loops for less gene dense or gene rich regions did the EBV infiltrate NPC and not for gastric carcinoma, in gastric carcinoma they didn’t see compartment switching but did see it increase risk of metastasis with the signature present. This is because EBV latent type 1 is what affects gastric carcinoma. A mnemonic for cancer like skin cancer and EBV or HPV for human papolloma virus, that they change the normal epithelial from stratified epithelial cells into a different not stratified type cell, the mnemonic is,‘if its satisfied, its stratified.’ Like food in to satisfy hunger and farting or pooping to satisfy an upset gut and the reproductive tract lining for sexual intercourse and the urethral tract for peeing to satisfy emptying a full bladder. Basically, just the entire gastrointestinal tract and uretergenito tract.

Lets write this out to csv and you can get a copy here.

write.csv(Pathology,'pathologyDB_2EBVNPC_Lyme_MS_mono_NKTCL_EBV.csv', row.names=F)

Thanks so much and keep checking back to see how this is going.