This is a continuation of the other part 1 and part 2 of the project anayzing the effects of Epstein-Barr viral infection (EBV) on Nasopharyngeal Carcinoma (NPC) using gene expression data from the NBCI study GSE271486 and the extraction of the top genes the study found in its free Pubmed article.

We are going to expand on Part 2 by using those top genes we found from the study totaling 13 and predicting the class of the sample as NPC or EBVaNPC which is EBV infected NPC via taking the mutated 101th position amino acid of Hystadine (H) replaced with Arginine (R) in the latent membrane protein gene called LMP1 that has many associations with tumorigenesis and cancers when analyzed by the study’s research team. You can read the details in the PubMed article at link above or below.

Lets read in the large data file of the NPC and EBV sample with fold change data as well as the 13 study genes.

EBVaNPC <- read.csv("NPCEBV_ordered.csv", header=T, sep=',')

studyGenes13 <- read.csv("studyGenesDF.csv", header=T, sep=',')
str(EBVaNPC)
## 'data.frame':    50868 obs. of  19 variables:
##  $ gene_id               : chr  "ENSG00000001561" "ENSG00000004846" "ENSG00000005108" "ENSG00000006210" ...
##  $ gene_name             : chr  "ENPP4" "ABCB5" "THSD7A" "CX3CL1" ...
##  $ description           : chr  "ectonucleotide pyrophosphatase/phosphodiesterase 4 [Source:HGNC Symbol;Acc:HGNC:3359]" "ATP binding cassette subfamily B member 5 [Source:HGNC Symbol;Acc:HGNC:46]" "thrombospondin type 1 domain containing 7A [Source:HGNC Symbol;Acc:HGNC:22207]" "C-X3-C motif chemokine ligand 1 [Source:HGNC Symbol;Acc:HGNC:10647]" ...
##  $ locus                 : chr  "6:46129993-46146699:+" "7:20615207-20777038:+" "7:11370357-11832198:-" "16:57372458-57385048:+" ...
##  $ HNE_1_MUT_LMP1_1_count: int  0 1 0 1 1 1 0 0 2 1 ...
##  $ HNE_1_MUT_LMP1_2_count: int  1 0 0 0 0 0 0 2 0 0 ...
##  $ HNE_1_MUT_LMP1_3_count: int  1 0 1 0 0 0 1 1 0 0 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_2_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_MUT_LMP1_1_FPKM : num  0 0.00529 0 0.00414 0.0065 ...
##  $ HNE_1_MUT_LMP1_2_FPKM : num  0.0154 0 0 0 0 ...
##  $ HNE_1_MUT_LMP1_3_FPKM : num  0.01476 0 0.00375 0 0 ...
##  $ HNE_1_WT_LMP1_1_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_2_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EBV_mean              : num  0.667 0.333 0.333 0.333 0.333 ...
##  $ baseline_mean         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FoldchangeEBV_baseline: num  Inf Inf Inf Inf Inf ...
str(studyGenes13)
## 'data.frame':    13 obs. of  3 variables:
##  $ gene_name             : chr  "CEACAM6" "FOXP3" "RAB38" "ANKRD11" ...
##  $ description           : chr  "CEA cell adhesion molecule 6 [Source:HGNC Symbol;Acc:HGNC:1818]" "forkhead box P3 [Source:HGNC Symbol;Acc:HGNC:6106]" "RAB38, member RAS oncogene family [Source:HGNC Symbol;Acc:HGNC:9776]" "ankyrin repeat domain 11 [Source:HGNC Symbol;Acc:HGNC:21316]" ...
##  $ FoldchangeEBV_baseline: num  Inf 2.83 2.4 1.84 0.5 ...
DF13 <- EBVaNPC[,c(2,5:10)]
colnames(DF13)
## [1] "gene_name"              "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
## [4] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count"  "HNE_1_WT_LMP1_2_count" 
## [7] "HNE_1_WT_LMP1_3_count"
data13 <- DF13[which(DF13$gene_name %in% studyGenes13$gene_name),]

genesHeader <- data13$gene_name
class <- c("EBVaNPC","EBVaNPC","EBVaNPC","NPC", "NPC","NPC")

str(data13)
## 'data.frame':    13 obs. of  7 variables:
##  $ gene_name             : chr  "CEACAM6" "FOXP3" "RAB38" "ANKRD11" ...
##  $ HNE_1_MUT_LMP1_1_count: int  3 6 5 6518 4 0 0 3 8 0 ...
##  $ HNE_1_MUT_LMP1_2_count: int  3 7 4 6902 11 1 1 2 10 0 ...
##  $ HNE_1_MUT_LMP1_3_count: int  2 4 3 3838 5 2 3 4 10 1 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 3 3 3197 13 2 4 16 50 2 ...
##  $ HNE_1_WT_LMP1_2_count : int  0 1 2 2469 10 3 3 7 19 1 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 2 0 3724 17 3 5 6 26 2 ...

Lets make the matrix to test out the same random Forest model settings on this low count for samples study to see how well it predicts the class of the sample as EBVaNPC or NPC.

matrix13 <- data.frame(t(data13[,c(2:7)]))

colnames(matrix13) <- genesHeader

matrix13$class <- as.factor(class)


str(matrix13)
## 'data.frame':    6 obs. of  14 variables:
##  $ CEACAM6 : int  3 3 2 0 0 0
##  $ FOXP3   : int  6 7 4 3 1 2
##  $ RAB38   : int  5 4 3 3 2 0
##  $ ANKRD11 : int  6518 6902 3838 3197 2469 3724
##  $ CD34    : int  4 11 5 13 10 17
##  $ HLA-DPA1: int  0 1 2 2 3 3
##  $ CD14    : int  0 1 3 4 3 5
##  $ CSF1R   : int  3 2 4 16 7 6
##  $ LBH     : int  8 10 10 50 19 26
##  $ IL7     : int  0 0 1 2 1 2
##  $ PRRX1   : int  0 0 1 2 2 4
##  $ WNT7A   : int  0 0 0 2 1 2
##  $ LTA     : int  0 0 0 2 1 3
##  $ class   : Factor w/ 2 levels "EBVaNPC","NPC": 1 1 1 2 2 2

Lets separate into 2 classes each for the training data and 1 sample each for the testing data.

training <- matrix13[c(1,3,4,5),]
testing <- matrix13[c(2,6),]
table(testing$class)
## 
## EBVaNPC     NPC 
##       1       1
table(training$class)
## 
## EBVaNPC     NPC 
##       2       2
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
rf_study <- randomForest(training[,c(1:13)], training[,14],mtry=6,
                         importance=TRUE, confusion=T)
rf_study$confusion
##         EBVaNPC NPC class.error
## EBVaNPC       2   0           0
## NPC           0   2           0

The model scored 100% on training with all classes of EBV infected NCP identified and all classes of baseline NCP only identified.

Resulting output: EBVncp ncp class.error EBVncp 2 0 0 ncp 0 2 0

Now lets see how well it predicts on the hold out set or our testing set of 1 sample each.

prediction <- predict(rf_study,testing)
result <- data.frame(predicted=prediction, actual=testing$class)
result
##                        predicted  actual
## HNE_1_MUT_LMP1_2_count   EBVaNPC EBVaNPC
## HNE_1_WT_LMP1_3_count        NPC     NPC

The training model had an error when training on the NPC class with it incorrectly identifying one sample of NPC as EBVaNPC. The testing set was 100% correctly identified by correctly identifying the EBVaNPC sample as such and correctly identifying the NPC sample as NPC.

This model scored as well as our genes we found by differential expression fold change analysis, so we can add these genes to our database of pathologies as well and move onto another study in the lymphomas of Burkitt Lymphoma that interestingly enough, this study found in one of their research articles reviewed that the same LMP1 gene mutation at the 101th position of amino acid in its mRNA form, has the same hystidine (H) amino acid replaced with another amino acid but it is the amino acid D for Aspartic acid.

Lets read in our all genes database and format the 13 study genes to put into our database of pathologies.

path <- "C/...genes..." #use your own path to the pathologies database.
setwd(path)

pathologyDB <- read.csv("pathologyDB_NKTCL_added.csv", header=T, sep=',')

str(pathologyDB)
## 'data.frame':    203 obs. of  7 variables:
##  $ Ensembl_ID          : chr  "ENSG00000211899" "ENSG00000164458" "ENSG00000211644" "ENSG00000125869" ...
##  $ Genecards_ID        : chr  "IGHM" "TBXT" "IGLV1-51" "LAMP5" ...
##  $ FC_pathology_control: num  18550 1051 179 140 105 ...
##  $ topGenePathology    : chr  "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" ...
##  $ mediaType           : chr  "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" ...
##  $ studySummarized     : chr  "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ ...
##  $ GSE_study_ID        : chr  "GSE253756" "GSE253756" "GSE253756" "GSE253756" ...
colnames(pathologyDB)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"
head(studyGenes13)
##   gene_name
## 1   CEACAM6
## 2     FOXP3
## 3     RAB38
## 4   ANKRD11
## 5      CD34
## 6  HLA-DPA1
##                                                                                 description
## 1                           CEA cell adhesion molecule 6 [Source:HGNC Symbol;Acc:HGNC:1818]
## 2                                        forkhead box P3 [Source:HGNC Symbol;Acc:HGNC:6106]
## 3                      RAB38, member RAS oncogene family [Source:HGNC Symbol;Acc:HGNC:9776]
## 4                              ankyrin repeat domain 11 [Source:HGNC Symbol;Acc:HGNC:21316]
## 5                                          CD34 molecule [Source:HGNC Symbol;Acc:HGNC:1662]
## 6 major histocompatibility complex, class II, DP alpha 1 [Source:HGNC Symbol;Acc:HGNC:4938]
##   FoldchangeEBV_baseline
## 1                    Inf
## 2               2.833333
## 3               2.400000
## 4               1.837913
## 5               0.500000
## 6               0.375000

Keep only the fold change and gene name of the study genes.

keep2 <- studyGenes13[,c(1,3)]
colnames(keep2)
## [1] "gene_name"              "FoldchangeEBV_baseline"

Lets add in the other features and set them to their filler content known from the study reviewed, but not yet the Ensembl_ID. However, the original data from the gene expression data of GSE271486 supplied the Ensemble_ID in that data frame. So we can read that in first and take the Ensemble_ID.

npcEBV <- read.csv("GSE271486_HNE_1_MUT_LMP1_vs_HNE_1_WT_LMP1.csv", header=T, sep=',')

colnames(npcEBV)
##  [1] "gene_id"                "gene_name"              "description"           
##  [4] "locus"                  "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
##  [7] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count"  "HNE_1_WT_LMP1_2_count" 
## [10] "HNE_1_WT_LMP1_3_count"  "HNE_1_MUT_LMP1_1_FPKM"  "HNE_1_MUT_LMP1_2_FPKM" 
## [13] "HNE_1_MUT_LMP1_3_FPKM"  "HNE_1_WT_LMP1_1_FPKM"   "HNE_1_WT_LMP1_2_FPKM"  
## [16] "HNE_1_WT_LMP1_3_FPKM"
original <- npcEBV[which(npcEBV$gene_name %in% genesHeader),c(1,2)]
original
##               gene_id gene_name
## 669   ENSG00000049768     FOXP3
## 1699  ENSG00000086548   CEACAM6
## 2923  ENSG00000104432       IL7
## 4533  ENSG00000116132     PRRX1
## 5431  ENSG00000123892     RAB38
## 9808  ENSG00000154764     WNT7A
## 12112 ENSG00000167522   ANKRD11
## 12833 ENSG00000170458      CD14
## 13644 ENSG00000174059      CD34
## 15297 ENSG00000182578     CSF1R
## 19593 ENSG00000213626       LBH
## 23681 ENSG00000226979       LTA
## 26608 ENSG00000231389  HLA-DPA1

Lets merge the original dataset with the 13 study genes with fold change of keep2 dataset.

IDs13 <- merge(original, keep2, by.x='gene_name', by.y='gene_name')
IDs13
##    gene_name         gene_id FoldchangeEBV_baseline
## 1    ANKRD11 ENSG00000167522              1.8379127
## 2       CD14 ENSG00000170458              0.3333333
## 3       CD34 ENSG00000174059              0.5000000
## 4    CEACAM6 ENSG00000086548                    Inf
## 5      CSF1R ENSG00000182578              0.3103448
## 6      FOXP3 ENSG00000049768              2.8333333
## 7   HLA-DPA1 ENSG00000231389              0.3750000
## 8        IL7 ENSG00000104432              0.2000000
## 9        LBH ENSG00000213626              0.2947368
## 10       LTA ENSG00000226979              0.0000000
## 11     PRRX1 ENSG00000116132              0.1250000
## 12     RAB38 ENSG00000123892              2.4000000
## 13     WNT7A ENSG00000154764              0.0000000

Now lets add in the other features to our pathologies database.

IDs13$topGenePathology <- "EBVaNPC nasopharyngeal carcinoma with EBV infection"
IDs13$mediaType <- "peripheral blood mononuclear cells - immune cells, reverse transcribed RNA into messenger RNA, 13 genes in study confirmed"
IDs13$studySummarized <- "This study found the mutation of a gene most prominent in a study of Epstein-Barr infected patients, then recreated that mutation of hystidine replaced with Arginine at the 101th amino acid in the sequence for the gene LMP1 for a latent membrane protein that is found to be abnormal (the study explains in detail the process of using specialized control plasmids for site specific mutagenesis at the 101th amino acid H, another lab created and then having a transfer plasmid, envelope plasmid, and packaging plasmid among other tedious steps by a few other lab created products and time of 2-3 days, a centrifuge collection, a T cell creation, viral creation with a lentivirus, replacing the substrate every 14-16 hours, etc and so on but details in the 'cell lines and culture conditions' subsection of the 'Method' section of PubMed article), a variant, dysfunction or dysregulated in uncontrolled tumor growth of many cancers in studies they evaluated in this study's research. Then they took a commercial line of nasopharyngeal carcinoma patient samples (NPC) and transfected the mutated strand of LMP1 of H101R in EBV infection of 3 NPC samples and compared it to the wild type or controlled or baseline NPC. They found that there was uncontrolled rapid progression of tumor growth and proliferation with the EBV infected NPC or EBVaNPC for EBV associated NPC samples compared to the controlled NPC samples. Takeaway, closer to finding treatments for NPC and know not to keep NPC patients near anybody with the kissing disease of mononucleosis or sharing any utensils, lipstick, or other personal hygiene products like chapstick or water bottles with an EBV infected person in the active stage of infection (about 6-8 weeks) if you want them to live longer and have a chance. In the study of EBVaNPC and NPC wild type, the study used single cell RNA sequencing with a chip array and created their own computer system ViSFE to rank the barcode different length nucleotide sequences then combine and extract the genes from those results. The provided gene expression data included counts of those genes by barcodes for each and the fragments per kilo million or FPKM, we used the counts only and got fold change values to compare and found our own genes and the study's genes. The genes we found from most up and down regulated 10 genes will be added as well, these are the genes found in the study to predict the class of EBVaNPC or NPC. Both sets of genes scored a perfect classification with limited samples using random forest classifier."

IDs13$GSE_study_ID <- 'GSE271486'

colnames(IDs13)
## [1] "gene_name"              "gene_id"                "FoldchangeEBV_baseline"
## [4] "topGenePathology"       "mediaType"              "studySummarized"       
## [7] "GSE_study_ID"
colnames(pathologyDB)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"
colnames(IDs13)[1] <- 'Ensembl_ID'
colnames(IDs13)[2] <- 'Genecards_ID'
colnames(IDs13)[3] <- 'FC_pathology_control'

colnames(IDs13)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"

Now we can add these study genes to our pathology database.

pathologyDB_added13 <- rbind(pathologyDB,IDs13)

tail(pathologyDB_added13)
##     Ensembl_ID    Genecards_ID FC_pathology_control
## 211        IL7 ENSG00000104432            0.2000000
## 212        LBH ENSG00000213626            0.2947368
## 213        LTA ENSG00000226979            0.0000000
## 214      PRRX1 ENSG00000116132            0.1250000
## 215      RAB38 ENSG00000123892            2.4000000
## 216      WNT7A ENSG00000154764            0.0000000
##                                        topGenePathology
## 211 EBVaNPC nasopharyngeal carcinoma with EBV infection
## 212 EBVaNPC nasopharyngeal carcinoma with EBV infection
## 213 EBVaNPC nasopharyngeal carcinoma with EBV infection
## 214 EBVaNPC nasopharyngeal carcinoma with EBV infection
## 215 EBVaNPC nasopharyngeal carcinoma with EBV infection
## 216 EBVaNPC nasopharyngeal carcinoma with EBV infection
##                                                                                                                      mediaType
## 211 peripheral blood mononuclear cells - immune cells, reverse transcribed RNA into messenger RNA, 13 genes in study confirmed
## 212 peripheral blood mononuclear cells - immune cells, reverse transcribed RNA into messenger RNA, 13 genes in study confirmed
## 213 peripheral blood mononuclear cells - immune cells, reverse transcribed RNA into messenger RNA, 13 genes in study confirmed
## 214 peripheral blood mononuclear cells - immune cells, reverse transcribed RNA into messenger RNA, 13 genes in study confirmed
## 215 peripheral blood mononuclear cells - immune cells, reverse transcribed RNA into messenger RNA, 13 genes in study confirmed
## 216 peripheral blood mononuclear cells - immune cells, reverse transcribed RNA into messenger RNA, 13 genes in study confirmed
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       studySummarized
## 211 This study found the mutation of a gene most prominent in a study of Epstein-Barr infected patients, then recreated that mutation of hystidine replaced with Arginine at the 101th amino acid in the sequence for the gene LMP1 for a latent membrane protein that is found to be abnormal (the study explains in detail the process of using specialized control plasmids for site specific mutagenesis at the 101th amino acid H, another lab created and then having a transfer plasmid, envelope plasmid, and packaging plasmid among other tedious steps by a few other lab created products and time of 2-3 days, a centrifuge collection, a T cell creation, viral creation with a lentivirus, replacing the substrate every 14-16 hours, etc and so on but details in the 'cell lines and culture conditions' subsection of the 'Method' section of PubMed article), a variant, dysfunction or dysregulated in uncontrolled tumor growth of many cancers in studies they evaluated in this study's research. Then they took a commercial line of nasopharyngeal carcinoma patient samples (NPC) and transfected the mutated strand of LMP1 of H101R in EBV infection of 3 NPC samples and compared it to the wild type or controlled or baseline NPC. They found that there was uncontrolled rapid progression of tumor growth and proliferation with the EBV infected NPC or EBVaNPC for EBV associated NPC samples compared to the controlled NPC samples. Takeaway, closer to finding treatments for NPC and know not to keep NPC patients near anybody with the kissing disease of mononucleosis or sharing any utensils, lipstick, or other personal hygiene products like chapstick or water bottles with an EBV infected person in the active stage of infection (about 6-8 weeks) if you want them to live longer and have a chance. In the study of EBVaNPC and NPC wild type, the study used single cell RNA sequencing with a chip array and created their own computer system ViSFE to rank the barcode different length nucleotide sequences then combine and extract the genes from those results. The provided gene expression data included counts of those genes by barcodes for each and the fragments per kilo million or FPKM, we used the counts only and got fold change values to compare and found our own genes and the study's genes. The genes we found from most up and down regulated 10 genes will be added as well, these are the genes found in the study to predict the class of EBVaNPC or NPC. Both sets of genes scored a perfect classification with limited samples using random forest classifier.
## 212 This study found the mutation of a gene most prominent in a study of Epstein-Barr infected patients, then recreated that mutation of hystidine replaced with Arginine at the 101th amino acid in the sequence for the gene LMP1 for a latent membrane protein that is found to be abnormal (the study explains in detail the process of using specialized control plasmids for site specific mutagenesis at the 101th amino acid H, another lab created and then having a transfer plasmid, envelope plasmid, and packaging plasmid among other tedious steps by a few other lab created products and time of 2-3 days, a centrifuge collection, a T cell creation, viral creation with a lentivirus, replacing the substrate every 14-16 hours, etc and so on but details in the 'cell lines and culture conditions' subsection of the 'Method' section of PubMed article), a variant, dysfunction or dysregulated in uncontrolled tumor growth of many cancers in studies they evaluated in this study's research. Then they took a commercial line of nasopharyngeal carcinoma patient samples (NPC) and transfected the mutated strand of LMP1 of H101R in EBV infection of 3 NPC samples and compared it to the wild type or controlled or baseline NPC. They found that there was uncontrolled rapid progression of tumor growth and proliferation with the EBV infected NPC or EBVaNPC for EBV associated NPC samples compared to the controlled NPC samples. Takeaway, closer to finding treatments for NPC and know not to keep NPC patients near anybody with the kissing disease of mononucleosis or sharing any utensils, lipstick, or other personal hygiene products like chapstick or water bottles with an EBV infected person in the active stage of infection (about 6-8 weeks) if you want them to live longer and have a chance. In the study of EBVaNPC and NPC wild type, the study used single cell RNA sequencing with a chip array and created their own computer system ViSFE to rank the barcode different length nucleotide sequences then combine and extract the genes from those results. The provided gene expression data included counts of those genes by barcodes for each and the fragments per kilo million or FPKM, we used the counts only and got fold change values to compare and found our own genes and the study's genes. The genes we found from most up and down regulated 10 genes will be added as well, these are the genes found in the study to predict the class of EBVaNPC or NPC. Both sets of genes scored a perfect classification with limited samples using random forest classifier.
## 213 This study found the mutation of a gene most prominent in a study of Epstein-Barr infected patients, then recreated that mutation of hystidine replaced with Arginine at the 101th amino acid in the sequence for the gene LMP1 for a latent membrane protein that is found to be abnormal (the study explains in detail the process of using specialized control plasmids for site specific mutagenesis at the 101th amino acid H, another lab created and then having a transfer plasmid, envelope plasmid, and packaging plasmid among other tedious steps by a few other lab created products and time of 2-3 days, a centrifuge collection, a T cell creation, viral creation with a lentivirus, replacing the substrate every 14-16 hours, etc and so on but details in the 'cell lines and culture conditions' subsection of the 'Method' section of PubMed article), a variant, dysfunction or dysregulated in uncontrolled tumor growth of many cancers in studies they evaluated in this study's research. Then they took a commercial line of nasopharyngeal carcinoma patient samples (NPC) and transfected the mutated strand of LMP1 of H101R in EBV infection of 3 NPC samples and compared it to the wild type or controlled or baseline NPC. They found that there was uncontrolled rapid progression of tumor growth and proliferation with the EBV infected NPC or EBVaNPC for EBV associated NPC samples compared to the controlled NPC samples. Takeaway, closer to finding treatments for NPC and know not to keep NPC patients near anybody with the kissing disease of mononucleosis or sharing any utensils, lipstick, or other personal hygiene products like chapstick or water bottles with an EBV infected person in the active stage of infection (about 6-8 weeks) if you want them to live longer and have a chance. In the study of EBVaNPC and NPC wild type, the study used single cell RNA sequencing with a chip array and created their own computer system ViSFE to rank the barcode different length nucleotide sequences then combine and extract the genes from those results. The provided gene expression data included counts of those genes by barcodes for each and the fragments per kilo million or FPKM, we used the counts only and got fold change values to compare and found our own genes and the study's genes. The genes we found from most up and down regulated 10 genes will be added as well, these are the genes found in the study to predict the class of EBVaNPC or NPC. Both sets of genes scored a perfect classification with limited samples using random forest classifier.
## 214 This study found the mutation of a gene most prominent in a study of Epstein-Barr infected patients, then recreated that mutation of hystidine replaced with Arginine at the 101th amino acid in the sequence for the gene LMP1 for a latent membrane protein that is found to be abnormal (the study explains in detail the process of using specialized control plasmids for site specific mutagenesis at the 101th amino acid H, another lab created and then having a transfer plasmid, envelope plasmid, and packaging plasmid among other tedious steps by a few other lab created products and time of 2-3 days, a centrifuge collection, a T cell creation, viral creation with a lentivirus, replacing the substrate every 14-16 hours, etc and so on but details in the 'cell lines and culture conditions' subsection of the 'Method' section of PubMed article), a variant, dysfunction or dysregulated in uncontrolled tumor growth of many cancers in studies they evaluated in this study's research. Then they took a commercial line of nasopharyngeal carcinoma patient samples (NPC) and transfected the mutated strand of LMP1 of H101R in EBV infection of 3 NPC samples and compared it to the wild type or controlled or baseline NPC. They found that there was uncontrolled rapid progression of tumor growth and proliferation with the EBV infected NPC or EBVaNPC for EBV associated NPC samples compared to the controlled NPC samples. Takeaway, closer to finding treatments for NPC and know not to keep NPC patients near anybody with the kissing disease of mononucleosis or sharing any utensils, lipstick, or other personal hygiene products like chapstick or water bottles with an EBV infected person in the active stage of infection (about 6-8 weeks) if you want them to live longer and have a chance. In the study of EBVaNPC and NPC wild type, the study used single cell RNA sequencing with a chip array and created their own computer system ViSFE to rank the barcode different length nucleotide sequences then combine and extract the genes from those results. The provided gene expression data included counts of those genes by barcodes for each and the fragments per kilo million or FPKM, we used the counts only and got fold change values to compare and found our own genes and the study's genes. The genes we found from most up and down regulated 10 genes will be added as well, these are the genes found in the study to predict the class of EBVaNPC or NPC. Both sets of genes scored a perfect classification with limited samples using random forest classifier.
## 215 This study found the mutation of a gene most prominent in a study of Epstein-Barr infected patients, then recreated that mutation of hystidine replaced with Arginine at the 101th amino acid in the sequence for the gene LMP1 for a latent membrane protein that is found to be abnormal (the study explains in detail the process of using specialized control plasmids for site specific mutagenesis at the 101th amino acid H, another lab created and then having a transfer plasmid, envelope plasmid, and packaging plasmid among other tedious steps by a few other lab created products and time of 2-3 days, a centrifuge collection, a T cell creation, viral creation with a lentivirus, replacing the substrate every 14-16 hours, etc and so on but details in the 'cell lines and culture conditions' subsection of the 'Method' section of PubMed article), a variant, dysfunction or dysregulated in uncontrolled tumor growth of many cancers in studies they evaluated in this study's research. Then they took a commercial line of nasopharyngeal carcinoma patient samples (NPC) and transfected the mutated strand of LMP1 of H101R in EBV infection of 3 NPC samples and compared it to the wild type or controlled or baseline NPC. They found that there was uncontrolled rapid progression of tumor growth and proliferation with the EBV infected NPC or EBVaNPC for EBV associated NPC samples compared to the controlled NPC samples. Takeaway, closer to finding treatments for NPC and know not to keep NPC patients near anybody with the kissing disease of mononucleosis or sharing any utensils, lipstick, or other personal hygiene products like chapstick or water bottles with an EBV infected person in the active stage of infection (about 6-8 weeks) if you want them to live longer and have a chance. In the study of EBVaNPC and NPC wild type, the study used single cell RNA sequencing with a chip array and created their own computer system ViSFE to rank the barcode different length nucleotide sequences then combine and extract the genes from those results. The provided gene expression data included counts of those genes by barcodes for each and the fragments per kilo million or FPKM, we used the counts only and got fold change values to compare and found our own genes and the study's genes. The genes we found from most up and down regulated 10 genes will be added as well, these are the genes found in the study to predict the class of EBVaNPC or NPC. Both sets of genes scored a perfect classification with limited samples using random forest classifier.
## 216 This study found the mutation of a gene most prominent in a study of Epstein-Barr infected patients, then recreated that mutation of hystidine replaced with Arginine at the 101th amino acid in the sequence for the gene LMP1 for a latent membrane protein that is found to be abnormal (the study explains in detail the process of using specialized control plasmids for site specific mutagenesis at the 101th amino acid H, another lab created and then having a transfer plasmid, envelope plasmid, and packaging plasmid among other tedious steps by a few other lab created products and time of 2-3 days, a centrifuge collection, a T cell creation, viral creation with a lentivirus, replacing the substrate every 14-16 hours, etc and so on but details in the 'cell lines and culture conditions' subsection of the 'Method' section of PubMed article), a variant, dysfunction or dysregulated in uncontrolled tumor growth of many cancers in studies they evaluated in this study's research. Then they took a commercial line of nasopharyngeal carcinoma patient samples (NPC) and transfected the mutated strand of LMP1 of H101R in EBV infection of 3 NPC samples and compared it to the wild type or controlled or baseline NPC. They found that there was uncontrolled rapid progression of tumor growth and proliferation with the EBV infected NPC or EBVaNPC for EBV associated NPC samples compared to the controlled NPC samples. Takeaway, closer to finding treatments for NPC and know not to keep NPC patients near anybody with the kissing disease of mononucleosis or sharing any utensils, lipstick, or other personal hygiene products like chapstick or water bottles with an EBV infected person in the active stage of infection (about 6-8 weeks) if you want them to live longer and have a chance. In the study of EBVaNPC and NPC wild type, the study used single cell RNA sequencing with a chip array and created their own computer system ViSFE to rank the barcode different length nucleotide sequences then combine and extract the genes from those results. The provided gene expression data included counts of those genes by barcodes for each and the fragments per kilo million or FPKM, we used the counts only and got fold change values to compare and found our own genes and the study's genes. The genes we found from most up and down regulated 10 genes will be added as well, these are the genes found in the study to predict the class of EBVaNPC or NPC. Both sets of genes scored a perfect classification with limited samples using random forest classifier.
##     GSE_study_ID
## 211    GSE271486
## 212    GSE271486
## 213    GSE271486
## 214    GSE271486
## 215    GSE271486
## 216    GSE271486

Now we should add the genes we found in part 1 from using two different sets of differentially expressed genes that both sets scored 100% accuracy in predicting the class as EBVaNPC or NPC only.

I scrolled down to part 1 below and found the 20 genes that worked for top 10 up regulated and top 10 down regulated when we filtered out the 0 value means in the EBVaNPC and NPC means of the samples.

[1] “AC139530.2” “AC011511.4” “AC092143.1” “AL645465.1” “AL353997.3”
[6] “AC022384.1” “MPO” “TRIM6-TRIM34” “FUT4” “ITGA4”
[11] “RPL23AP34” “AC027088.3” “MYOM3” “CALCR” “MOXD1”
[16] “SMAD5-AS1” “HSPE1P7” “AC026464.2” “RASA4B” “AC026954.2”

We can store this as our list of genes found that also predict 100% accuracy in EBVaNPC or NPC.

genes20 <- c("AC139530.2"  , "AC011511.4"  , "AC092143.1"  , "AL645465.1" ,  "AL353997.3"  
 ,"AC022384.1"  , "MPO"     ,     "TRIM6-TRIM34", "FUT4"   ,      "ITGA4"       
,"RPL23AP34" ,   "AC027088.3" ,  "MYOM3"       , "CALCR"   ,     "MOXD1"      
,"SMAD5-AS1"  ,  "HSPE1P7" ,     "AC026464.2"  , "RASA4B"   ,    "AC026954.2" ) 

We also need the fold change data from the EBVaNPC data set we uploaded in this part 3, and we can get the Ensembl ID from this dataset as well.

FC_20genes <- EBVaNPC[which(EBVaNPC$gene_name %in% genes20),c(1,2,19)]
FC_20genes
##               gene_id    gene_name FoldchangeEBV_baseline
## 1787  ENSG00000262660   AC139530.2           520.00000000
## 1788  ENSG00000267303   AC011511.4           154.25000000
## 1789  ENSG00000198211   AC092143.1            91.00000000
## 1790  ENSG00000240963   AL645465.1            45.25000000
## 1791  ENSG00000267441   AL353997.3            28.25000000
## 1792  ENSG00000272410   AC022384.1            25.55555556
## 1793  ENSG00000005381          MPO            22.66666667
## 1794  ENSG00000258588 TRIM6-TRIM34            18.00000000
## 1795  ENSG00000196371         FUT4            16.00000000
## 1796  ENSG00000115232        ITGA4            14.00000000
## 23709 ENSG00000225991    RPL23AP34             0.07692308
## 23710 ENSG00000259265   AC027088.3             0.07692308
## 23711 ENSG00000142661        MYOM3             0.07142857
## 23712 ENSG00000004948        CALCR             0.06666667
## 23713 ENSG00000079931        MOXD1             0.06250000
## 23714 ENSG00000164621    SMAD5-AS1             0.05882353
## 23715 ENSG00000270945      HSPE1P7             0.04761905
## 23716 ENSG00000260108   AC026464.2             0.03557312
## 23717 ENSG00000170667       RASA4B             0.03225806
## 23718 ENSG00000261915   AC026954.2             0.01107011
## 50278 ENSG00000279999   AL353997.3                     NA

Lets make the names of these columns the needed column names to our pathology database to add them later with a row bind.

colnames(FC_20genes)[1] <- 'Ensembl_ID'
colnames(FC_20genes)[2] <- 'Genecards_ID'
colnames(FC_20genes)[3] <- 'FC_pathology_control'

colnames(FC_20genes)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
FC_20genes$topGenePathology <- "EBVaNPC nasopharyngeal carcinoma with EBV infection"
FC_20genes$mediaType <- "peripheral blood mononuclear cells - immune cells, reverse transcribed RNA into messenger RNA, these are the top 10 up regulated and top 10 bottom regulated genes after getting the fold change values of the EBVaNPC vs NPC or Mutated vs Wild Type and then filtering to remove the mean values of 0 that led to a 0 value fold change being very small and rounded to 0 or an Inf value from too large a value."
FC_20genes$studySummarized <- "This study found the mutation of a gene most prominent in a study of Epstein-Barr infected patients, then recreated that mutation of hystidine replaced with Arginine at the 101th amino acid in the sequence for the gene LMP1 for a latent membrane protein that is found to be abnormal (the study explains in detail the process of using specialized control plasmids for site specific mutagenesis at the 101th amino acid H, another lab created and then having a transfer plasmid, envelope plasmid, and packaging plasmid among other tedious steps by a few other lab created products and time of 2-3 days, a centrifuge collection, a T cell creation, viral creation with a lentivirus, replacing the substrate every 14-16 hours, etc and so on but details in the 'cell lines and culture conditions' subsection of the 'Method' section of PubMed article), a variant, dysfunction or dysregulated in uncontrolled tumor growth of many cancers in studies they evaluated in this study's research. Then they took a commercial line of nasopharyngeal carcinoma patient samples (NPC) and transfected the mutated strand of LMP1 of H101R in EBV infection of 3 NPC samples and compared it to the wild type or controlled or baseline NPC. They found that there was uncontrolled rapid progression of tumor growth and proliferation with the EBV infected NPC or EBVaNPC for EBV associated NPC samples compared to the controlled NPC samples. Takeaway, closer to finding treatments for NPC and know not to keep NPC patients near anybody with the kissing disease of mononucleosis or sharing any utensils, lipstick, or other personal hygiene products like chapstick or water bottles with an EBV infected person in the active stage of infection (about 6-8 weeks) if you want them to live longer and have a chance. In the study of EBVaNPC and NPC wild type, the study used single cell RNA sequencing with a chip array and created their own computer system ViSFE to rank the barcode different length nucleotide sequences then combine and extract the genes from those results. The provided gene expression data included counts of those genes by barcodes for each and the fragments per kilo million or FPKM, we used the counts only and got fold change values to compare and found our own genes and the study's genes. The genes we found from most up and down regulated 10 genes will be added as well, these are the genes found in the study to predict the class of EBVaNPC or NPC. Both sets of genes scored a perfect classification with limited samples using random forest classifier."

FC_20genes$GSE_study_ID <- 'GSE271486'

colnames(FC_20genes)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"
colnames(pathologyDB_added13)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"

We can add these additional 20 genes to our pathology database.

pathologyDB_13plus20 <- rbind(pathologyDB_added13,FC_20genes)
dim(pathologyDB_13plus20)
## [1] 237   7
dim(pathologyDB)
## [1] 203   7
dim(pathologyDB_added13)
## [1] 216   7

Lets write this dataset out to csv.

setwd(path)
write.csv(pathologyDB_13plus20,"pathologyDB_EBVaNPC_13plus20genes.csv",row.names=F)

You can access this pathology database at this link. We will continue adding to it and will be moving on to the next set of EBV associated pathologies for our machine to predict our pathology diseases from the projects we have analyzed thus far in Lyme disease, infectious Mononucleosis, Multiple Sclerosis, Fibromyalgia, EBV initial infection, Nasopharyngeal Carcinoma with EBV infection, and Natural Killer T-cell Lymphoma.

We can actually see the table of classes of pathologies we have in our latest edition of the pathology database.

table(pathologyDB_13plus20$topGenePathology)
## 
## EBVaNPC nasopharyngeal carcinoma with EBV infection 
##                                                  34 
##                                  Epstein Barr Virus 
##                                                  80 
##                                        fibromyalgia 
##                                                  15 
##                               Lyme Disease 6 months 
##                                                  33 
##                                       mononucleosis 
##                                                  15 
##                                  Multiple Sclerosis 
##                                                  41 
##          NKTCL Natural Killer T-Cell Lymphoma & EBV 
##                                                  19

Great, now we keep adding and looking for other pathologies related to EBV. The article may have mentioned some other pathologies but I will review it again to see. For now we are going to be looking at Hodgkin’s Lymphoma and Burkitt Lymphoma gene expression data if we come across it or other EBV flagged pathologies in National Center for Bioinformatics (NCBI) online database.

Thanks and keep checking in. Below are the reverse sequenced parts to this project, with Part 2 next, and Part 1 at the end of this document following Part 2. They are separated by equal signs and labeled with 3 asterick symbols before their respective title.

=====================================

*** Part 2

This is part 2 to the previous quick analysis and predictive analytics testing in finding the top genes with fold change values from a study on nasopharyngeal carcinoma (NPC) and how Epstein-Barr viral infection (EBV) influences the proliferation rate to increase and to create more mutant strands of the target protein with an amino acid sequence change from Histadine to Arginine at the 101th amino acid marker of the messenger RNA or mRNA strand. I reviewed the pdf free study for EBVaNPC GOF and found that it is their term for Epstein-Barr Virus associated Gene Ontology Fingerprint. They found the GOF by screening thousands of studies on EBV and tested the differentially expressed genes then compared those to a network of gene sets from gene set expression analysis or GSEA for common genes as influencers to the type of NPC that is more aggressive or has seen a marked change from commercial line NPC they call the Wild Type (WT) vs their mutatated (MUT) strand that they used strand profilers to create the mutation at H101R that they found in their own clinical study of 26 EBV infected patients where 13 had EBV and the other 13 had EBV and NPC. They saw a pattern of changes in the Latent Membrane Protein (LMP1) to have the histadine amino acid (H amino acid letter) replacement at the 101th or 101st amino acid in the strand with Arginine (R amino acid letter).

Additonally in my summary of the report that was free to download at Pubmed article. The study did much lab and computational analysis and developed its own strategy for finding various lengths of strands that are created to use in single cell RNA sequencing so that they ranked the strands and found the length of the strand of nucleotides had negligible affects on the outcome of their own predictive machine learners to predict the class of their samples as EBV or EBV with NPC. The LMP1 gene is said to be an intracellular signaling protein and affects the T cell activity in identifying antigens that need removal or cell apoptosis of an unhealthy or infected cell. When there is a defective LMP1 gene product it causes an increase in a gene, FOXP3, that causes T cell anergy, looking it up this is the same process that makes HIV exhaust the T cells from using their CD4 proliferation to get natural killer cells in the bone marrow to (MEN - all men are natural killers - mnemonic for the Monocytes, Eosinophils, and Neutrophils that are made in bone marrow and only the Eosinophils and Neutrophils are granolocytes that send out granules to destroy infected cells due to other mnemonic BEN where Basophiles, Eosinophils, and Neutrophils are only White Blood Cells that have granulocytes, and monocytes are macrophages once they pass the blood vessels into the intracellular spaces of tissue in a response to injury defense producing 5 cardinal signs of inflammation or redness (rubor), pain (dolor), swelling (tumor), heat (calor), and funcio de laisso or temporary loss of function).

The mutated strand of H101R of the gene LMP1 in EBV infected patients can lead to NPC by increasing the FOXP3 gene that creates T-cell anergy or stops the immune regulating defenses of natural killer cells from seeking out and destroying pathogens, and also retards the gene WNT7A that is an immune regulatory gene involved in stem cell maintenance and healing when defective or mutated can slow the process of healing from infection. I extracted the EBVaNPC GOF genes manually as I stopped scraping sites long ago and this is a pub med article and would make no sense to build a program in a few days to scrape an article that could take a few hours at most with notetaking annotations and highlighting. The figure 5d, with the link analysis to the genes found in EBVaNPC GOF and in the GSEA study they found with Gene Ontology (GO) were of interest for my own research to use our database of differentially expressed genes we found and took the top and bottom genes with most upregulated and most down regulated changes from NPC to NPC with EBV, as they used a commercial line of NPC and transfected the NPC with their mutated LMP1 gene with H101R found in EBV. So we will do that in this part. I am leaving the previous work on the tail end of this part 2 by dividing this part 2 from part 1 below with a bunch of equal signs and 3 astericks before a label as Part 1.

The genes found in the study of interest for this project to analyze and explore and possibly add or work with separately to see the changes or viability of them in the machine learning algorithm used are from the Figure 5D of the study. They also preferred using our same preferred supervised learning algorithm of random forest compared to others they used of gradient boosted machines, logistic regression, naive bayes, random forest, and a few unsupervised learning algorithms of support vector machines and K-Nearest Neighbors. The genes that we are interested in for this part 2 research analysis are the genes they found in EBVaNPC GOF and in their GSEA study with GO.

EBVaNPC GOF genes:

GSEA genes:

There were also some other genes mentioned in the study findings that weren’t included in the link analysis diagram of Figure 5d. Those genes are:

So, our list of genes to look up in our database before we extracted the most up regulated and most down regulated genes from NPC to NPC with EBV will be: CD14, CSF1R, FOXP3, LTA, WNT7A, CD34, ANKRD11, IL7, LBH, PRRX1, HLA-DPA1, CEACAM6, and RAB38. Lets make this into a list of genes to extract.

studyGenes <- c("CD14", "CSF1R", "FOXP3", "LTA", "WNT7A", "CD34", "ANKRD11", "IL7", "LBH", "PRRX1", "HLA-DPA1", "CEACAM6", "RAB38")

Lets read in that dataset and extract those genes into their own data set of top genes for that study GSE271486. We have to start from beginning as those data sets weren’t saved but no worries it was a quick analysis from start to end.

NPCEBV <- read.csv("GSE271486_HNE_1_MUT_LMP1_vs_HNE_1_WT_LMP1.csv", header=T, sep=',')
colnames(NPCEBV)
##  [1] "gene_id"                "gene_name"              "description"           
##  [4] "locus"                  "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
##  [7] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count"  "HNE_1_WT_LMP1_2_count" 
## [10] "HNE_1_WT_LMP1_3_count"  "HNE_1_MUT_LMP1_1_FPKM"  "HNE_1_MUT_LMP1_2_FPKM" 
## [13] "HNE_1_MUT_LMP1_3_FPKM"  "HNE_1_WT_LMP1_1_FPKM"   "HNE_1_WT_LMP1_2_FPKM"  
## [16] "HNE_1_WT_LMP1_3_FPKM"
NPCEBV$EBV_mean <- rowMeans(NPCEBV[,c(5:7)])
NPCEBV$baseline_mean <- rowMeans(NPCEBV[,c(8:10)])
NPCEBV$FoldchangeEBV_baseline <- NPCEBV$EBV_mean/NPCEBV$baseline_mean
NPCEBV_ordered <- NPCEBV[order(NPCEBV$FoldchangeEBV_baseline, decreasing=T),]
str(NPCEBV_ordered)
## 'data.frame':    50868 obs. of  19 variables:
##  $ gene_id               : chr  "ENSG00000001561" "ENSG00000004846" "ENSG00000005108" "ENSG00000006210" ...
##  $ gene_name             : chr  "ENPP4" "ABCB5" "THSD7A" "CX3CL1" ...
##  $ description           : chr  "ectonucleotide pyrophosphatase/phosphodiesterase 4 [Source:HGNC Symbol;Acc:HGNC:3359]" "ATP binding cassette subfamily B member 5 [Source:HGNC Symbol;Acc:HGNC:46]" "thrombospondin type 1 domain containing 7A [Source:HGNC Symbol;Acc:HGNC:22207]" "C-X3-C motif chemokine ligand 1 [Source:HGNC Symbol;Acc:HGNC:10647]" ...
##  $ locus                 : chr  "6:46129993-46146699:+" "7:20615207-20777038:+" "7:11370357-11832198:-" "16:57372458-57385048:+" ...
##  $ HNE_1_MUT_LMP1_1_count: int  0 1 0 1 1 1 0 0 2 1 ...
##  $ HNE_1_MUT_LMP1_2_count: int  1 0 0 0 0 0 0 2 0 0 ...
##  $ HNE_1_MUT_LMP1_3_count: int  1 0 1 0 0 0 1 1 0 0 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_2_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_MUT_LMP1_1_FPKM : num  0 0.00529 0 0.00414 0.0065 ...
##  $ HNE_1_MUT_LMP1_2_FPKM : num  0.0154 0 0 0 0 ...
##  $ HNE_1_MUT_LMP1_3_FPKM : num  0.01476 0 0.00375 0 0 ...
##  $ HNE_1_WT_LMP1_1_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_2_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EBV_mean              : num  0.667 0.333 0.333 0.333 0.333 ...
##  $ baseline_mean         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FoldchangeEBV_baseline: num  Inf Inf Inf Inf Inf ...

Lets see if any of the study genes are in the gene_name feature of our data before we filtered it to exclude missing values or 0s in the samples and means.

common <- NPCEBV_ordered[which(NPCEBV_ordered$gene_name %in% studyGenes),]
common
##               gene_id gene_name
## 1699  ENSG00000086548   CEACAM6
## 669   ENSG00000049768     FOXP3
## 5431  ENSG00000123892     RAB38
## 12112 ENSG00000167522   ANKRD11
## 13644 ENSG00000174059      CD34
## 26608 ENSG00000231389  HLA-DPA1
## 12833 ENSG00000170458      CD14
## 15297 ENSG00000182578     CSF1R
## 19593 ENSG00000213626       LBH
## 2923  ENSG00000104432       IL7
## 4533  ENSG00000116132     PRRX1
## 9808  ENSG00000154764     WNT7A
## 23681 ENSG00000226979       LTA
##                                                                                     description
## 1699                            CEA cell adhesion molecule 6 [Source:HGNC Symbol;Acc:HGNC:1818]
## 669                                          forkhead box P3 [Source:HGNC Symbol;Acc:HGNC:6106]
## 5431                       RAB38, member RAS oncogene family [Source:HGNC Symbol;Acc:HGNC:9776]
## 12112                              ankyrin repeat domain 11 [Source:HGNC Symbol;Acc:HGNC:21316]
## 13644                                          CD34 molecule [Source:HGNC Symbol;Acc:HGNC:1662]
## 26608 major histocompatibility complex, class II, DP alpha 1 [Source:HGNC Symbol;Acc:HGNC:4938]
## 12833                                          CD14 molecule [Source:HGNC Symbol;Acc:HGNC:1628]
## 15297                   colony stimulating factor 1 receptor [Source:HGNC Symbol;Acc:HGNC:2433]
## 19593                LBH regulator of WNT signaling pathway [Source:HGNC Symbol;Acc:HGNC:29532]
## 2923                                           interleukin 7 [Source:HGNC Symbol;Acc:HGNC:6023]
## 4533                               paired related homeobox 1 [Source:HGNC Symbol;Acc:HGNC:9142]
## 9808                                   Wnt family member 7A [Source:HGNC Symbol;Acc:HGNC:12786]
## 23681                                      lymphotoxin alpha [Source:HGNC Symbol;Acc:HGNC:6709]
##                         locus HNE_1_MUT_LMP1_1_count HNE_1_MUT_LMP1_2_count
## 1699   19:41750977-41772208:+                      3                      3
## 669     X:49250436-49264826:-                      6                      7
## 5431   11:88113242-88175467:-                      5                      4
## 12112  16:89267627-89490561:-                   6518                   6902
## 13644 1:207880972-207911402:-                      4                     11
## 26608   6:33064569-33080775:-                      0                      1
## 12833 5:140631728-140633701:-                      0                      1
## 15297 5:150053291-150113372:-                      3                      2
## 19593   2:30231531-30323730:+                      8                     10
## 2923    8:78675743-78805523:-                      0                      0
## 4533  1:170662728-170739419:+                      0                      0
## 9808    3:13816258-13880121:-                      0                      0
## 23681   6:31572054-31574324:+                      0                      0
##       HNE_1_MUT_LMP1_3_count HNE_1_WT_LMP1_1_count HNE_1_WT_LMP1_2_count
## 1699                       2                     0                     0
## 669                        4                     3                     1
## 5431                       3                     3                     2
## 12112                   3838                  3197                  2469
## 13644                      5                    13                    10
## 26608                      2                     2                     3
## 12833                      3                     4                     3
## 15297                      4                    16                     7
## 19593                     10                    50                    19
## 2923                       1                     2                     1
## 4533                       1                     2                     2
## 9808                       0                     2                     1
## 23681                      0                     2                     1
##       HNE_1_WT_LMP1_3_count HNE_1_MUT_LMP1_1_FPKM HNE_1_MUT_LMP1_2_FPKM
## 1699                      0            0.04930920           0.065840603
## 669                       2            0.06943194           0.108171793
## 5431                      0            0.08618129           0.092081288
## 12112                  3724           15.22033021          21.523280160
## 13644                    17            0.01727330           0.063439477
## 26608                     3            0.00000000           0.007629582
## 12833                     5            0.00000000           0.018610955
## 15297                     6            0.02419598           0.021542269
## 19593                    26            0.05776895           0.096431264
## 2923                      2            0.00000000           0.000000000
## 4533                      4            0.00000000           0.000000000
## 9808                      2            0.00000000           0.000000000
## 23681                     3            0.00000000           0.000000000
##       HNE_1_MUT_LMP1_3_FPKM HNE_1_WT_LMP1_1_FPKM HNE_1_WT_LMP1_2_FPKM
## 1699            0.041948267          0.000000000          0.000000000
## 669             0.059082292          0.025135938          0.011115633
## 5431            0.066000178          0.037441990          0.033115249
## 12112          11.439281190          5.405314676          5.537387379
## 13644           0.027560037          0.040650254          0.041479696
## 26608           0.014587793          0.008277097          0.016466422
## 12833           0.053385517          0.040380901          0.040166796
## 15297           0.041178567          0.093428733          0.054216856
## 19593           0.092169972          0.261410758          0.131771024
## 2923            0.009805968          0.011122054          0.007377604
## 4533            0.006280126          0.007122999          0.009449813
## 9808            0.000000000          0.015655415          0.010384724
## 23681           0.000000000          0.036093823          0.023942156
##       HNE_1_WT_LMP1_3_FPKM     EBV_mean baseline_mean FoldchangeEBV_baseline
## 1699            0.00000000    2.6666667      0.000000                    Inf
## 669             0.02852506    5.6666667      2.000000              2.8333333
## 5431            0.00000000    4.0000000      1.666667              2.4000000
## 12112          10.71962836 5752.6666667   3130.000000              1.8379127
## 13644           0.09050085    6.6666667     13.333333              0.5000000
## 26608           0.02113450    1.0000000      2.666667              0.3750000
## 12833           0.08592286    1.3333333      4.000000              0.3333333
## 15297           0.05964330    3.0000000      9.666667              0.3103448
## 19593           0.23143041    9.3333333     31.666667              0.2947368
## 2923            0.01893249    0.3333333      1.666667              0.2000000
## 4533            0.02425021    0.3333333      2.666667              0.1250000
## 9808            0.02664939    0.0000000      1.666667              0.0000000
## 23681           0.09216093    0.0000000      2.000000              0.0000000

All 13 genes are in the data we have before filtering it. Lets see if the 13 genes are also in the data once we excluded the 0s for mean values and fold change irregularities created by having 0s for the baseline or EBV mutated mean value of genes.

NPCEBV2 <- subset(NPCEBV_ordered,NPCEBV_ordered$baseline_mean > 0 & NPCEBV_ordered$EBV_mean > 0)
str(NPCEBV2)
## 'data.frame':    21932 obs. of  19 variables:
##  $ gene_id               : chr  "ENSG00000262660" "ENSG00000267303" "ENSG00000198211" "ENSG00000240963" ...
##  $ gene_name             : chr  "AC139530.2" "AC011511.4" "AC092143.1" "AL645465.1" ...
##  $ description           : chr  "novel protein" "novel transcript" "novel protein (MC1R-TUBB3 readthrough)" "novel transcript, antisense to C1orf100" ...
##  $ locus                 : chr  "17:81703371-81720539:+" "19:10315471-10320678:-" "16:89919165-89936092:+" "1:244375100-244409592:-" ...
##  $ HNE_1_MUT_LMP1_1_count: int  520 617 211 93 1 15 0 0 5 0 ...
##  $ HNE_1_MUT_LMP1_2_count: int  0 0 0 3 63 203 36 54 4 13 ...
##  $ HNE_1_MUT_LMP1_3_count: int  0 0 62 85 49 12 32 0 7 1 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 0 3 1 2 4 0 3 0 0 ...
##  $ HNE_1_WT_LMP1_2_count : int  1 0 0 1 1 2 3 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 4 0 2 1 3 0 0 1 1 ...
##  $ HNE_1_MUT_LMP1_1_FPKM : num  14.5446 17.9294 4.0862 7.7876 0.0913 ...
##  $ HNE_1_MUT_LMP1_2_FPKM : num  0 0 0 0.335 7.679 ...
##  $ HNE_1_MUT_LMP1_3_FPKM : num  0 0 1.53 9.08 5.71 ...
##  $ HNE_1_WT_LMP1_1_FPKM  : num  0 0 0.0421 0.0606 0.1322 ...
##  $ HNE_1_WT_LMP1_2_FPKM  : num  0.0269 0 0 0.0804 0.0877 ...
##  $ HNE_1_WT_LMP1_3_FPKM  : num  0 0.143 0 0.206 0.113 ...
##  $ EBV_mean              : num  173.3 205.7 91 60.3 37.7 ...
##  $ baseline_mean         : num  0.333 1.333 1 1.333 1.333 ...
##  $ FoldchangeEBV_baseline: num  520 154.2 91 45.3 28.2 ...
range(NPCEBV2$FoldchangeEBV_baseline)
## [1]   0.01107011 520.00000000
commonFiltered <- NPCEBV2[which(NPCEBV2$gene_name %in% studyGenes),]
commonFiltered
##               gene_id gene_name
## 669   ENSG00000049768     FOXP3
## 5431  ENSG00000123892     RAB38
## 12112 ENSG00000167522   ANKRD11
## 13644 ENSG00000174059      CD34
## 26608 ENSG00000231389  HLA-DPA1
## 12833 ENSG00000170458      CD14
## 15297 ENSG00000182578     CSF1R
## 19593 ENSG00000213626       LBH
## 2923  ENSG00000104432       IL7
## 4533  ENSG00000116132     PRRX1
##                                                                                     description
## 669                                          forkhead box P3 [Source:HGNC Symbol;Acc:HGNC:6106]
## 5431                       RAB38, member RAS oncogene family [Source:HGNC Symbol;Acc:HGNC:9776]
## 12112                              ankyrin repeat domain 11 [Source:HGNC Symbol;Acc:HGNC:21316]
## 13644                                          CD34 molecule [Source:HGNC Symbol;Acc:HGNC:1662]
## 26608 major histocompatibility complex, class II, DP alpha 1 [Source:HGNC Symbol;Acc:HGNC:4938]
## 12833                                          CD14 molecule [Source:HGNC Symbol;Acc:HGNC:1628]
## 15297                   colony stimulating factor 1 receptor [Source:HGNC Symbol;Acc:HGNC:2433]
## 19593                LBH regulator of WNT signaling pathway [Source:HGNC Symbol;Acc:HGNC:29532]
## 2923                                           interleukin 7 [Source:HGNC Symbol;Acc:HGNC:6023]
## 4533                               paired related homeobox 1 [Source:HGNC Symbol;Acc:HGNC:9142]
##                         locus HNE_1_MUT_LMP1_1_count HNE_1_MUT_LMP1_2_count
## 669     X:49250436-49264826:-                      6                      7
## 5431   11:88113242-88175467:-                      5                      4
## 12112  16:89267627-89490561:-                   6518                   6902
## 13644 1:207880972-207911402:-                      4                     11
## 26608   6:33064569-33080775:-                      0                      1
## 12833 5:140631728-140633701:-                      0                      1
## 15297 5:150053291-150113372:-                      3                      2
## 19593   2:30231531-30323730:+                      8                     10
## 2923    8:78675743-78805523:-                      0                      0
## 4533  1:170662728-170739419:+                      0                      0
##       HNE_1_MUT_LMP1_3_count HNE_1_WT_LMP1_1_count HNE_1_WT_LMP1_2_count
## 669                        4                     3                     1
## 5431                       3                     3                     2
## 12112                   3838                  3197                  2469
## 13644                      5                    13                    10
## 26608                      2                     2                     3
## 12833                      3                     4                     3
## 15297                      4                    16                     7
## 19593                     10                    50                    19
## 2923                       1                     2                     1
## 4533                       1                     2                     2
##       HNE_1_WT_LMP1_3_count HNE_1_MUT_LMP1_1_FPKM HNE_1_MUT_LMP1_2_FPKM
## 669                       2            0.06943194           0.108171793
## 5431                      0            0.08618129           0.092081288
## 12112                  3724           15.22033021          21.523280160
## 13644                    17            0.01727330           0.063439477
## 26608                     3            0.00000000           0.007629582
## 12833                     5            0.00000000           0.018610955
## 15297                     6            0.02419598           0.021542269
## 19593                    26            0.05776895           0.096431264
## 2923                      2            0.00000000           0.000000000
## 4533                      4            0.00000000           0.000000000
##       HNE_1_MUT_LMP1_3_FPKM HNE_1_WT_LMP1_1_FPKM HNE_1_WT_LMP1_2_FPKM
## 669             0.059082292          0.025135938          0.011115633
## 5431            0.066000178          0.037441990          0.033115249
## 12112          11.439281190          5.405314676          5.537387379
## 13644           0.027560037          0.040650254          0.041479696
## 26608           0.014587793          0.008277097          0.016466422
## 12833           0.053385517          0.040380901          0.040166796
## 15297           0.041178567          0.093428733          0.054216856
## 19593           0.092169972          0.261410758          0.131771024
## 2923            0.009805968          0.011122054          0.007377604
## 4533            0.006280126          0.007122999          0.009449813
##       HNE_1_WT_LMP1_3_FPKM     EBV_mean baseline_mean FoldchangeEBV_baseline
## 669             0.02852506    5.6666667      2.000000              2.8333333
## 5431            0.00000000    4.0000000      1.666667              2.4000000
## 12112          10.71962836 5752.6666667   3130.000000              1.8379127
## 13644           0.09050085    6.6666667     13.333333              0.5000000
## 26608           0.02113450    1.0000000      2.666667              0.3750000
## 12833           0.08592286    1.3333333      4.000000              0.3333333
## 15297           0.05964330    3.0000000      9.666667              0.3103448
## 19593           0.23143041    9.3333333     31.666667              0.2947368
## 2923            0.01893249    0.3333333      1.666667              0.2000000
## 4533            0.02425021    0.3333333      2.666667              0.1250000

Only 10 of the genes are in the data once 0s were removed for genes that created irregularities in the fold change values of EBV mutated NCP compared to NCP only.

Lets see those genes and which removed to compare information.

c <- common$gene_name
c
##  [1] "CEACAM6"  "FOXP3"    "RAB38"    "ANKRD11"  "CD34"     "HLA-DPA1"
##  [7] "CD14"     "CSF1R"    "LBH"      "IL7"      "PRRX1"    "WNT7A"   
## [13] "LTA"
cf <- commonFiltered$gene_name
cf
##  [1] "FOXP3"    "RAB38"    "ANKRD11"  "CD34"     "HLA-DPA1" "CD14"    
##  [7] "CSF1R"    "LBH"      "IL7"      "PRRX1"
nc <- c[-which(c %in% cf)]
nc
## [1] "CEACAM6" "WNT7A"   "LTA"

There are 3 genes that aren’t in the study gene set once filtered to exclude 0 fold change values or irregular fold change values from having 0 mean value for the gene. Lets look at the data of these 3 genes from unfiltered data.

genes3 <- NPCEBV_ordered[which(NPCEBV_ordered$gene_name %in% nc),]
genes3
##               gene_id gene_name
## 1699  ENSG00000086548   CEACAM6
## 9808  ENSG00000154764     WNT7A
## 23681 ENSG00000226979       LTA
##                                                           description
## 1699  CEA cell adhesion molecule 6 [Source:HGNC Symbol;Acc:HGNC:1818]
## 9808         Wnt family member 7A [Source:HGNC Symbol;Acc:HGNC:12786]
## 23681            lymphotoxin alpha [Source:HGNC Symbol;Acc:HGNC:6709]
##                        locus HNE_1_MUT_LMP1_1_count HNE_1_MUT_LMP1_2_count
## 1699  19:41750977-41772208:+                      3                      3
## 9808   3:13816258-13880121:-                      0                      0
## 23681  6:31572054-31574324:+                      0                      0
##       HNE_1_MUT_LMP1_3_count HNE_1_WT_LMP1_1_count HNE_1_WT_LMP1_2_count
## 1699                       2                     0                     0
## 9808                       0                     2                     1
## 23681                      0                     2                     1
##       HNE_1_WT_LMP1_3_count HNE_1_MUT_LMP1_1_FPKM HNE_1_MUT_LMP1_2_FPKM
## 1699                      0             0.0493092             0.0658406
## 9808                      2             0.0000000             0.0000000
## 23681                     3             0.0000000             0.0000000
##       HNE_1_MUT_LMP1_3_FPKM HNE_1_WT_LMP1_1_FPKM HNE_1_WT_LMP1_2_FPKM
## 1699             0.04194827           0.00000000           0.00000000
## 9808             0.00000000           0.01565541           0.01038472
## 23681            0.00000000           0.03609382           0.02394216
##       HNE_1_WT_LMP1_3_FPKM EBV_mean baseline_mean FoldchangeEBV_baseline
## 1699            0.00000000 2.666667      0.000000                    Inf
## 9808            0.02664939 0.000000      1.666667                      0
## 23681           0.09216093 0.000000      2.000000                      0

Looks like the 2 key genes of CEACAM6 and WNT7A didn’t make the filter as well as LTA, because CEACAM6 was highly up regulated from almost 0 (we can say almost 0 because in gene expression data after normalizing the data between 0 and 1 some values can be 1X1.0^-12 or 0.000000000001) and excluded or rounded to 0.00 automatically by the computer that rounds to the 8th or 12th figure). The WNT7A gene is found to be highly downregulated in EBV mutated NCP as the study claimed and it is responsible for stem cell maintenance (WBCs, RBCs, and lymphocytes are stem cells and natural killer cells are WBCs of monocytes, eosinophiles, and neutrophiles) and renewal, while CEACAM6 the study cited a source as saying it is up regulated in many cancers and proliferation and metastasis of tumors we can see here is highly upregulated. And LT7 we can go to the included summary of the gene in the gene data they provided that it is highly downregulated and is a lymphotoxin alpha which sounds like that is a good thing for overall immunity but it would need to be looked up by its genecards or ensembl ID definition to determine if it is a great immunity defense or adds to weakened immune defenses. Lets look at all the genes next to their fold change with only the gene name, the description, and the fold change value.

studyGenesDF <- NPCEBV_ordered[which(NPCEBV_ordered$gene_name %in% studyGenes), c(2,3,19)]
studyGenesDF
##       gene_name
## 1699    CEACAM6
## 669       FOXP3
## 5431      RAB38
## 12112   ANKRD11
## 13644      CD34
## 26608  HLA-DPA1
## 12833      CD14
## 15297     CSF1R
## 19593       LBH
## 2923        IL7
## 4533      PRRX1
## 9808      WNT7A
## 23681       LTA
##                                                                                     description
## 1699                            CEA cell adhesion molecule 6 [Source:HGNC Symbol;Acc:HGNC:1818]
## 669                                          forkhead box P3 [Source:HGNC Symbol;Acc:HGNC:6106]
## 5431                       RAB38, member RAS oncogene family [Source:HGNC Symbol;Acc:HGNC:9776]
## 12112                              ankyrin repeat domain 11 [Source:HGNC Symbol;Acc:HGNC:21316]
## 13644                                          CD34 molecule [Source:HGNC Symbol;Acc:HGNC:1662]
## 26608 major histocompatibility complex, class II, DP alpha 1 [Source:HGNC Symbol;Acc:HGNC:4938]
## 12833                                          CD14 molecule [Source:HGNC Symbol;Acc:HGNC:1628]
## 15297                   colony stimulating factor 1 receptor [Source:HGNC Symbol;Acc:HGNC:2433]
## 19593                LBH regulator of WNT signaling pathway [Source:HGNC Symbol;Acc:HGNC:29532]
## 2923                                           interleukin 7 [Source:HGNC Symbol;Acc:HGNC:6023]
## 4533                               paired related homeobox 1 [Source:HGNC Symbol;Acc:HGNC:9142]
## 9808                                   Wnt family member 7A [Source:HGNC Symbol;Acc:HGNC:12786]
## 23681                                      lymphotoxin alpha [Source:HGNC Symbol;Acc:HGNC:6709]
##       FoldchangeEBV_baseline
## 1699                     Inf
## 669                2.8333333
## 5431               2.4000000
## 12112              1.8379127
## 13644              0.5000000
## 26608              0.3750000
## 12833              0.3333333
## 15297              0.3103448
## 19593              0.2947368
## 2923               0.2000000
## 4533               0.1250000
## 9808               0.0000000
## 23681              0.0000000

The top 4 genes are up regulated while the last 9 genes are downregulated in comparing the EBV mutated H101R of LMP1 gene in NPC to NPC without the mutation or the wild type. The top 4 upregulated genes includes CEACAM6, FOXP3, and RAB38 as well as ANKRD11. CEACAM6 is seen in tumor creation, proliferation, and metastasis, while FOXP3 is seen in T cell anergy or lazy non proliferative T cells like CD4 that gets Natural Killer Cells (NKC) to fight infection and destroy contaminants, and that RAB38 that is seen highly upregulated in the phenotypic metastatic pancreatic cancer is upregulated but only at a value of 2.4 which could be almost 2 1/2 times the gene expression it should be at in controlled NPC.

Lets write both of these data sets to the csv format and upload and share to view or use at a later time.

write.csv(NPCEBV_ordered, "NPCEBV_ordered.csv", row.names=F)
write.csv(studyGenesDF, "studyGenesDF.csv",row.names=F)
write.csv(NPCEBV2, "NPCEBV_filtered.csv", row.names=F)

You can get the ordered and not filtered data set of 50,868 genes here, the data set of only the 13 genes in the study from unfiltered data set here, and the filtered data of genes that is 21,932 genes long here.

Thanks so much, beyond this we could test to see how well these study genes perform in predicting the EBVaNCP or the NCP class samples.

================================================

*** Part 1

In this project, we seek to understand target genes in the study GSE271486 that is a recently published study analyzing nasopharyngeal carcinoma (NCP) and the effects of Epstein-Barr Viral (EBV) infection to changes in the latent membrane protein 1 or LMP1. The study used machine learning to predict a model developed from 26 EBV infected patients that showed there were changes in oncogenes that could help indicate NPC developing in those infected with EBV. Their gene targets found through the gene ontology fingerprint of EBV and NPC were associated with changes in H101R and its affects on immune regulation processes to the FOXP3-T cell anergy and WNT7A-stem cell population maintenance.

This study, GSE271486, essentially found the most differentially expressed genes in an infected EBV population of 26, then used those gene targets and gene groups from the Gene Ontology fingerprint of NPC and EBV infection to determine the top feature in identifying EBV infection was H101R mutation or genes of most variability as in cell differentiation in cancer indicator using a commercial cell line of NCP called HNE-1 that they infected with the mutant H101R in vitro or in a petri dish and watched changes. They compared the non-infected NPC to the EBV infected NCP petri dish outcomes after 7 days, and confirmed their findings that mutations in H101R to the LMBP-1 region gave a higher risk of NCP. However, the cell line alread had NCP, so we can look at the data they supplied for us already in a csv format that has 3 samples of infected EBV called HNE-1MUT-LMP1 and 3 not infected assumed as it is the other group HNE-1WT_LMP1. There are 3 counts and 3 fragments per kilomillion features (fpkm) for each group of infected EBV or not infected. I expect to see the group of most differentially expressed genes being those related to EBV and NCP fingerprint gene groups from the Gene Ontology Fingerprint.

Sounds like a super fascinating study, lets see what we find and if we can use these top genes in our database of pathologies that we can use to build a machine model to predict EBV associated pathologies, EBV infection, fibromyalgia, Lyme disease, multiple sclerosis, mononucleosis, Hodgkin’s lymphoma, Burkett’s lymphoma, Large B-cell lymphoma, or Natural Killer T-cell Lymphoma.

Lets read in the csv file that was provided.

ncpEBV <- read.csv("GSE271486_HNE_1_MUT_LMP1_vs_HNE_1_WT_LMP1.csv", header=T, sep=',')

colnames(ncpEBV)
##  [1] "gene_id"                "gene_name"              "description"           
##  [4] "locus"                  "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
##  [7] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count"  "HNE_1_WT_LMP1_2_count" 
## [10] "HNE_1_WT_LMP1_3_count"  "HNE_1_MUT_LMP1_1_FPKM"  "HNE_1_MUT_LMP1_2_FPKM" 
## [13] "HNE_1_MUT_LMP1_3_FPKM"  "HNE_1_WT_LMP1_1_FPKM"   "HNE_1_WT_LMP1_2_FPKM"  
## [16] "HNE_1_WT_LMP1_3_FPKM"
head(ncpEBV)
##           gene_id gene_name
## 1 ENSG00000000003    TSPAN6
## 2 ENSG00000000005      TNMD
## 3 ENSG00000000419      DPM1
## 4 ENSG00000000457     SCYL3
## 5 ENSG00000000460  C1orf112
## 6 ENSG00000000938       FGR
##                                                                                      description
## 1                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## 2                                                tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
## 3 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## 4                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## 5                        chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
## 6              FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697]
##                     locus HNE_1_MUT_LMP1_1_count HNE_1_MUT_LMP1_2_count
## 1 X:100627109-100639991:-                   1057                    700
## 2 X:100584802-100599885:+                      0                      0
## 3  20:50934867-50958555:-                    937                    678
## 4 1:169849631-169894267:-                    200                    136
## 5 1:169662007-169854080:+                    537                    474
## 6   1:27612064-27635277:-                      0                      0
##   HNE_1_MUT_LMP1_3_count HNE_1_WT_LMP1_1_count HNE_1_WT_LMP1_2_count
## 1                    855                  1643                   826
## 2                      0                     0                     0
## 3                    932                  1114                   930
## 4                    158                   273                   120
## 5                    471                   839                   617
## 6                      0                     0                     0
##   HNE_1_WT_LMP1_3_count HNE_1_MUT_LMP1_1_FPKM HNE_1_MUT_LMP1_2_FPKM
## 1                   875              9.355201              8.273695
## 2                     0              0.000000              0.000000
## 3                   845             19.973180             19.300172
## 4                   177              1.247310              1.132686
## 5                   594              3.772203              4.446549
## 6                     0              0.000000              0.000000
##   HNE_1_MUT_LMP1_3_FPKM HNE_1_WT_LMP1_1_FPKM HNE_1_WT_LMP1_2_FPKM
## 1              9.658906            10.528930            7.0215334
## 2              0.000000             0.000000            0.0000000
## 3             25.357562            17.193387           19.0398968
## 4              1.257724             1.232754            0.7187855
## 5              4.223044             4.267284            4.1627526
## 6              0.000000             0.000000            0.0000000
##   HNE_1_WT_LMP1_3_FPKM
## 1             9.546551
## 2             0.000000
## 3            22.203657
## 4             1.360751
## 5             5.143604
## 6             0.000000
str(ncpEBV)
## 'data.frame':    50868 obs. of  16 variables:
##  $ gene_id               : chr  "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
##  $ gene_name             : chr  "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
##  $ description           : chr  "tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]" "tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]" "dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]" "SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]" ...
##  $ locus                 : chr  "X:100627109-100639991:-" "X:100584802-100599885:+" "20:50934867-50958555:-" "1:169849631-169894267:-" ...
##  $ HNE_1_MUT_LMP1_1_count: int  1057 0 937 200 537 0 277 2475 1635 796 ...
##  $ HNE_1_MUT_LMP1_2_count: int  700 0 678 136 474 0 256 1634 1753 668 ...
##  $ HNE_1_MUT_LMP1_3_count: int  855 0 932 158 471 0 316 1864 1833 804 ...
##  $ HNE_1_WT_LMP1_1_count : int  1643 0 1114 273 839 0 386 3631 2044 1101 ...
##  $ HNE_1_WT_LMP1_2_count : int  826 0 930 120 617 0 299 1907 1732 878 ...
##  $ HNE_1_WT_LMP1_3_count : int  875 0 845 177 594 0 280 2069 2159 774 ...
##  $ HNE_1_MUT_LMP1_1_FPKM : num  9.36 0 19.97 1.25 3.77 ...
##  $ HNE_1_MUT_LMP1_2_FPKM : num  8.27 0 19.3 1.13 4.45 ...
##  $ HNE_1_MUT_LMP1_3_FPKM : num  9.66 0 25.36 1.26 4.22 ...
##  $ HNE_1_WT_LMP1_1_FPKM  : num  10.53 0 17.19 1.23 4.27 ...
##  $ HNE_1_WT_LMP1_2_FPKM  : num  7.022 0 19.04 0.719 4.163 ...
##  $ HNE_1_WT_LMP1_3_FPKM  : num  9.55 0 22.2 1.36 5.14 ...

Lets do a quick analysis with fold change values and see the most differentially expressed genes using only the counts (no filtering with counts and FPKM or mitochondrial DNA percent as was done in data made for Seurat) from non infected EBV to infected EBV on the NCP commercial cell line.

ncpEBV$EBV_mean <- rowMeans(ncpEBV[,c(5:7)])
ncpEBV$baseline_mean <- rowMeans(ncpEBV[,c(8:10)])

ncpEBV$FoldchangeEBV_baseline <- ncpEBV$EBV_mean/ncpEBV$baseline_mean

ncpEBV_ordered <- ncpEBV[order(ncpEBV$FoldchangeEBV_baseline, decreasing=T),]

str(ncpEBV_ordered)
## 'data.frame':    50868 obs. of  19 variables:
##  $ gene_id               : chr  "ENSG00000001561" "ENSG00000004846" "ENSG00000005108" "ENSG00000006210" ...
##  $ gene_name             : chr  "ENPP4" "ABCB5" "THSD7A" "CX3CL1" ...
##  $ description           : chr  "ectonucleotide pyrophosphatase/phosphodiesterase 4 [Source:HGNC Symbol;Acc:HGNC:3359]" "ATP binding cassette subfamily B member 5 [Source:HGNC Symbol;Acc:HGNC:46]" "thrombospondin type 1 domain containing 7A [Source:HGNC Symbol;Acc:HGNC:22207]" "C-X3-C motif chemokine ligand 1 [Source:HGNC Symbol;Acc:HGNC:10647]" ...
##  $ locus                 : chr  "6:46129993-46146699:+" "7:20615207-20777038:+" "7:11370357-11832198:-" "16:57372458-57385048:+" ...
##  $ HNE_1_MUT_LMP1_1_count: int  0 1 0 1 1 1 0 0 2 1 ...
##  $ HNE_1_MUT_LMP1_2_count: int  1 0 0 0 0 0 0 2 0 0 ...
##  $ HNE_1_MUT_LMP1_3_count: int  1 0 1 0 0 0 1 1 0 0 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_2_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_MUT_LMP1_1_FPKM : num  0 0.00529 0 0.00414 0.0065 ...
##  $ HNE_1_MUT_LMP1_2_FPKM : num  0.0154 0 0 0 0 ...
##  $ HNE_1_MUT_LMP1_3_FPKM : num  0.01476 0 0.00375 0 0 ...
##  $ HNE_1_WT_LMP1_1_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_2_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EBV_mean              : num  0.667 0.333 0.333 0.333 0.333 ...
##  $ baseline_mean         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FoldchangeEBV_baseline: num  Inf Inf Inf Inf Inf ...
ncpEBV2 <- subset(ncpEBV_ordered,ncpEBV_ordered$baseline_mean > 0 & ncpEBV_ordered$EBV_mean > 0)

str(ncpEBV2)
## 'data.frame':    21932 obs. of  19 variables:
##  $ gene_id               : chr  "ENSG00000262660" "ENSG00000267303" "ENSG00000198211" "ENSG00000240963" ...
##  $ gene_name             : chr  "AC139530.2" "AC011511.4" "AC092143.1" "AL645465.1" ...
##  $ description           : chr  "novel protein" "novel transcript" "novel protein (MC1R-TUBB3 readthrough)" "novel transcript, antisense to C1orf100" ...
##  $ locus                 : chr  "17:81703371-81720539:+" "19:10315471-10320678:-" "16:89919165-89936092:+" "1:244375100-244409592:-" ...
##  $ HNE_1_MUT_LMP1_1_count: int  520 617 211 93 1 15 0 0 5 0 ...
##  $ HNE_1_MUT_LMP1_2_count: int  0 0 0 3 63 203 36 54 4 13 ...
##  $ HNE_1_MUT_LMP1_3_count: int  0 0 62 85 49 12 32 0 7 1 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 0 3 1 2 4 0 3 0 0 ...
##  $ HNE_1_WT_LMP1_2_count : int  1 0 0 1 1 2 3 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 4 0 2 1 3 0 0 1 1 ...
##  $ HNE_1_MUT_LMP1_1_FPKM : num  14.5446 17.9294 4.0862 7.7876 0.0913 ...
##  $ HNE_1_MUT_LMP1_2_FPKM : num  0 0 0 0.335 7.679 ...
##  $ HNE_1_MUT_LMP1_3_FPKM : num  0 0 1.53 9.08 5.71 ...
##  $ HNE_1_WT_LMP1_1_FPKM  : num  0 0 0.0421 0.0606 0.1322 ...
##  $ HNE_1_WT_LMP1_2_FPKM  : num  0.0269 0 0 0.0804 0.0877 ...
##  $ HNE_1_WT_LMP1_3_FPKM  : num  0 0.143 0 0.206 0.113 ...
##  $ EBV_mean              : num  173.3 205.7 91 60.3 37.7 ...
##  $ baseline_mean         : num  0.333 1.333 1 1.333 1.333 ...
##  $ FoldchangeEBV_baseline: num  520 154.2 91 45.3 28.2 ...

We have a dataset that is smaller than original by excluding those entries where the mean of EBV infected and the mean of the baseline NCP samples are higher than 0.

range(ncpEBV2$FoldchangeEBV_baseline)
## [1]   0.01107011 520.00000000

The range is all positive but those less than 1 are underexpressed genes or inhibited in EBV infected NCP compared to baseline NCP samples. Lets see what are top 10 overexpressed genes are in stimulated the most or inhibited the most for top and bottom 10 genes. We should also look at a list of those exact genes that the study used as the ‘Gene Ontology Fingerprint’ for ‘EBVaNPC GOF’ perhaps with an internet search. The internet returned the Pubmed article by this GSE study.

topBottom10stimulatedInhibited <- ncpEBV2[c(1:10,21923:21932),]
topBottom10stimulatedInhibited[,c(2,19)]
##          gene_name FoldchangeEBV_baseline
## 42404   AC139530.2           520.00000000
## 43879   AC011511.4           154.25000000
## 17708   AC092143.1            91.00000000
## 31972   AL645465.1            45.25000000
## 43987   AL353997.3            28.25000000
## 46317   AC022384.1            25.55555556
## 95             MPO            22.66666667
## 39685 TRIM6-TRIM34            18.00000000
## 17107         FUT4            16.00000000
## 4383         ITGA4            14.00000000
## 23008    RPL23AP34             0.07692308
## 40275   AC027088.3             0.07692308
## 8269         MYOM3             0.07142857
## 70           CALCR             0.06666667
## 1480         MOXD1             0.06250000
## 11402    SMAD5-AS1             0.05882353
## 45524      HSPE1P7             0.04761905
## 40946   AC026464.2             0.03557312
## 12885       RASA4B             0.03225806
## 42231   AC026954.2             0.01107011

Lets make a matrix and split into testing and training sets comparing the samples, this will be small since only 3 baseline and 3 EBV infected NCP.

colnames(topBottom10stimulatedInhibited)
##  [1] "gene_id"                "gene_name"              "description"           
##  [4] "locus"                  "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
##  [7] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count"  "HNE_1_WT_LMP1_2_count" 
## [10] "HNE_1_WT_LMP1_3_count"  "HNE_1_MUT_LMP1_1_FPKM"  "HNE_1_MUT_LMP1_2_FPKM" 
## [13] "HNE_1_MUT_LMP1_3_FPKM"  "HNE_1_WT_LMP1_1_FPKM"   "HNE_1_WT_LMP1_2_FPKM"  
## [16] "HNE_1_WT_LMP1_3_FPKM"   "EBV_mean"               "baseline_mean"         
## [19] "FoldchangeEBV_baseline"
header <- topBottom10stimulatedInhibited$gene_name

class <- c('EBVncp','EBVncp','EBVncp','ncp','ncp','ncp')

df <- topBottom10stimulatedInhibited[,c(5:10)]

str(df)
## 'data.frame':    20 obs. of  6 variables:
##  $ HNE_1_MUT_LMP1_1_count: int  520 617 211 93 1 15 0 0 5 0 ...
##  $ HNE_1_MUT_LMP1_2_count: int  0 0 0 3 63 203 36 54 4 13 ...
##  $ HNE_1_MUT_LMP1_3_count: int  0 0 62 85 49 12 32 0 7 1 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 0 3 1 2 4 0 3 0 0 ...
##  $ HNE_1_WT_LMP1_2_count : int  1 0 0 1 1 2 3 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 4 0 2 1 3 0 0 1 1 ...
ncpEBV_matrix <- data.frame(t(df))
colnames(ncpEBV_matrix) <- header
ncpEBV_matrix$class <- class

str(ncpEBV_matrix)
## 'data.frame':    6 obs. of  21 variables:
##  $ AC139530.2  : int  520 0 0 0 1 0
##  $ AC011511.4  : int  617 0 0 0 0 4
##  $ AC092143.1  : int  211 0 62 3 0 0
##  $ AL645465.1  : int  93 3 85 1 1 2
##  $ AL353997.3  : int  1 63 49 2 1 1
##  $ AC022384.1  : int  15 203 12 4 2 3
##  $ MPO         : int  0 36 32 0 3 0
##  $ TRIM6-TRIM34: int  0 54 0 3 0 0
##  $ FUT4        : int  5 4 7 0 0 1
##  $ ITGA4       : int  0 13 1 0 0 1
##  $ RPL23AP34   : int  0 1 0 5 1 7
##  $ AC027088.3  : int  2 0 0 9 13 4
##  $ MYOM3       : int  1 0 0 7 7 0
##  $ CALCR       : int  0 0 1 6 1 8
##  $ MOXD1       : int  1 0 0 12 2 2
##  $ SMAD5-AS1   : int  0 1 0 14 2 1
##  $ HSPE1P7     : int  0 1 0 0 15 6
##  $ AC026464.2  : int  4 2 3 184 3 66
##  $ RASA4B      : int  2 0 0 59 3 0
##  $ AC026954.2  : int  1 1 1 269 1 1
##  $ class       : chr  "EBVncp" "EBVncp" "EBVncp" "ncp" ...
ncpEBV_matrix$class <- as.factor(ncpEBV_matrix$class)
str(ncpEBV_matrix)
## 'data.frame':    6 obs. of  21 variables:
##  $ AC139530.2  : int  520 0 0 0 1 0
##  $ AC011511.4  : int  617 0 0 0 0 4
##  $ AC092143.1  : int  211 0 62 3 0 0
##  $ AL645465.1  : int  93 3 85 1 1 2
##  $ AL353997.3  : int  1 63 49 2 1 1
##  $ AC022384.1  : int  15 203 12 4 2 3
##  $ MPO         : int  0 36 32 0 3 0
##  $ TRIM6-TRIM34: int  0 54 0 3 0 0
##  $ FUT4        : int  5 4 7 0 0 1
##  $ ITGA4       : int  0 13 1 0 0 1
##  $ RPL23AP34   : int  0 1 0 5 1 7
##  $ AC027088.3  : int  2 0 0 9 13 4
##  $ MYOM3       : int  1 0 0 7 7 0
##  $ CALCR       : int  0 0 1 6 1 8
##  $ MOXD1       : int  1 0 0 12 2 2
##  $ SMAD5-AS1   : int  0 1 0 14 2 1
##  $ HSPE1P7     : int  0 1 0 0 15 6
##  $ AC026464.2  : int  4 2 3 184 3 66
##  $ RASA4B      : int  2 0 0 59 3 0
##  $ AC026954.2  : int  1 1 1 269 1 1
##  $ class       : Factor w/ 2 levels "EBVncp","ncp": 1 1 1 2 2 2

Split into a training and testing set.

set.seed(123)
training <- ncpEBV_matrix[c(1,2,5,6),] #two each
testing <- ncpEBV_matrix[c(3,4),] #one each 
table(training$class)
## 
## EBVncp    ncp 
##      2      2
table(testing$class)
## 
## EBVncp    ncp 
##      1      1

Lets use a randomForest model to see how well it trains and predicts on unseen data.

library(randomForest)
str(training
    )
## 'data.frame':    4 obs. of  21 variables:
##  $ AC139530.2  : int  520 0 1 0
##  $ AC011511.4  : int  617 0 0 4
##  $ AC092143.1  : int  211 0 0 0
##  $ AL645465.1  : int  93 3 1 2
##  $ AL353997.3  : int  1 63 1 1
##  $ AC022384.1  : int  15 203 2 3
##  $ MPO         : int  0 36 3 0
##  $ TRIM6-TRIM34: int  0 54 0 0
##  $ FUT4        : int  5 4 0 1
##  $ ITGA4       : int  0 13 0 1
##  $ RPL23AP34   : int  0 1 1 7
##  $ AC027088.3  : int  2 0 13 4
##  $ MYOM3       : int  1 0 7 0
##  $ CALCR       : int  0 0 1 8
##  $ MOXD1       : int  1 0 2 2
##  $ SMAD5-AS1   : int  0 1 2 1
##  $ HSPE1P7     : int  0 1 15 6
##  $ AC026464.2  : int  4 2 3 66
##  $ RASA4B      : int  2 0 3 0
##  $ AC026954.2  : int  1 1 1 1
##  $ class       : Factor w/ 2 levels "EBVncp","ncp": 1 1 2 2
rf <- randomForest(training[,c(1:20)], training[,21],mtry=6,
                         importance=TRUE, confusion=T)
rf$confusion
##        EBVncp ncp class.error
## EBVncp      2   0           0
## ncp         0   2           0

The model scored 100% on training with all classes of EBV infected NCP identified and all classes of baseline NCP only identified.

Resulting output: EBVncp ncp class.error EBVncp 2 0 0 ncp 0 2 0

Now lets see how well it predicts on the hold out set or our testing set of 1 sample each.

prediction <- predict(rf,testing)
result <- data.frame(predicted=prediction, actual=testing$class)
result
##                        predicted actual
## HNE_1_MUT_LMP1_3_count    EBVncp EBVncp
## HNE_1_WT_LMP1_1_count        ncp    ncp

Looks like a very strong set of genes we selected in using a randomForest model made for classification on our data to predict with 100% accuracy the correct class of a two class model.

Resulting output:

predicted actual HNE_1_MUT_LMP1_3_count EBVncp EBVncp
HNE_1_WT_LMP1_1_count ncp ncp
2 rows

Thanks so much, looks like we can go ahead and add these genes to our database of pathologies. But it would be interesting to look up the actual genes used in the GOF the study used to confirm that the mutation in the H101R gene affected the immune response based on the EBVaNCP GOF.

Keep checking back.