In this project, we seek to understand target genes in the study GSE271486 that is a recently published study analyzing nasopharyngeal carcinoma (NCP) and the effects of Epstein-Barr Viral (EBV) infection to changes in the latent membrane protein 1 or LMP1. The study used machine learning to predict a model developed from 26 EBV infected patients that showed there were changes in oncogenes that could help indicate NPC developing in those infected with EBV. Their gene targets found through the gene ontology fingerprint of EBV and NPC were associated with changes in H101R and its affects on immune regulation processes to the FOXP3-T cell anergy and WNT7A-stem cell population maintenance.

This study, GSE271486, essentially found the most differentially expressed genes in an infected EBV population of 26, then used those gene targets and gene groups from the Gene Ontology fingerprint of NPC and EBV infection to determine the top feature in identifying EBV infection was H101R mutation or genes of most variability as in cell differentiation in cancer indicator using a commercial cell line of NCP called HNE-1 that they infected with the mutant H101R in vitro or in a petri dish and watched changes. They compared the non-infected NPC to the EBV infected NCP petri dish outcomes after 7 days, and confirmed their findings that mutations in H101R to the LMBP-1 region gave a higher risk of NCP. However, the cell line alread had NCP, so we can look at the data they supplied for us already in a csv format that has 3 samples of infected EBV called HNE-1MUT-LMP1 and 3 not infected assumed as it is the other group HNE-1WT_LMP1. There are 3 counts and 3 fragments per kilomillion features (fpkm) for each group of infected EBV or not infected. I expect to see the group of most differentially expressed genes being those related to EBV and NCP fingerprint gene groups from the Gene Ontology Fingerprint.

Sounds like a super fascinating study, lets see what we find and if we can use these top genes in our database of pathologies that we can use to build a machine model to predict EBV associated pathologies, EBV infection, fibromyalgia, Lyme disease, multiple sclerosis, mononucleosis, Hodgkin’s lymphoma, Burkett’s lymphoma, Large B-cell lymphoma, or Natural Killer T-cell Lymphoma.

Lets read in the csv file that was provided.

ncpEBV <- read.csv("GSE271486_HNE_1_MUT_LMP1_vs_HNE_1_WT_LMP1.csv", header=T, sep=',')

colnames(ncpEBV)
##  [1] "gene_id"                "gene_name"              "description"           
##  [4] "locus"                  "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
##  [7] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count"  "HNE_1_WT_LMP1_2_count" 
## [10] "HNE_1_WT_LMP1_3_count"  "HNE_1_MUT_LMP1_1_FPKM"  "HNE_1_MUT_LMP1_2_FPKM" 
## [13] "HNE_1_MUT_LMP1_3_FPKM"  "HNE_1_WT_LMP1_1_FPKM"   "HNE_1_WT_LMP1_2_FPKM"  
## [16] "HNE_1_WT_LMP1_3_FPKM"
head(ncpEBV)
##           gene_id gene_name
## 1 ENSG00000000003    TSPAN6
## 2 ENSG00000000005      TNMD
## 3 ENSG00000000419      DPM1
## 4 ENSG00000000457     SCYL3
## 5 ENSG00000000460  C1orf112
## 6 ENSG00000000938       FGR
##                                                                                      description
## 1                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## 2                                                tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
## 3 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## 4                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## 5                        chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
## 6              FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697]
##                     locus HNE_1_MUT_LMP1_1_count HNE_1_MUT_LMP1_2_count
## 1 X:100627109-100639991:-                   1057                    700
## 2 X:100584802-100599885:+                      0                      0
## 3  20:50934867-50958555:-                    937                    678
## 4 1:169849631-169894267:-                    200                    136
## 5 1:169662007-169854080:+                    537                    474
## 6   1:27612064-27635277:-                      0                      0
##   HNE_1_MUT_LMP1_3_count HNE_1_WT_LMP1_1_count HNE_1_WT_LMP1_2_count
## 1                    855                  1643                   826
## 2                      0                     0                     0
## 3                    932                  1114                   930
## 4                    158                   273                   120
## 5                    471                   839                   617
## 6                      0                     0                     0
##   HNE_1_WT_LMP1_3_count HNE_1_MUT_LMP1_1_FPKM HNE_1_MUT_LMP1_2_FPKM
## 1                   875              9.355201              8.273695
## 2                     0              0.000000              0.000000
## 3                   845             19.973180             19.300172
## 4                   177              1.247310              1.132686
## 5                   594              3.772203              4.446549
## 6                     0              0.000000              0.000000
##   HNE_1_MUT_LMP1_3_FPKM HNE_1_WT_LMP1_1_FPKM HNE_1_WT_LMP1_2_FPKM
## 1              9.658906            10.528930            7.0215334
## 2              0.000000             0.000000            0.0000000
## 3             25.357562            17.193387           19.0398968
## 4              1.257724             1.232754            0.7187855
## 5              4.223044             4.267284            4.1627526
## 6              0.000000             0.000000            0.0000000
##   HNE_1_WT_LMP1_3_FPKM
## 1             9.546551
## 2             0.000000
## 3            22.203657
## 4             1.360751
## 5             5.143604
## 6             0.000000
str(ncpEBV)
## 'data.frame':    50868 obs. of  16 variables:
##  $ gene_id               : chr  "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
##  $ gene_name             : chr  "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
##  $ description           : chr  "tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]" "tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]" "dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]" "SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]" ...
##  $ locus                 : chr  "X:100627109-100639991:-" "X:100584802-100599885:+" "20:50934867-50958555:-" "1:169849631-169894267:-" ...
##  $ HNE_1_MUT_LMP1_1_count: int  1057 0 937 200 537 0 277 2475 1635 796 ...
##  $ HNE_1_MUT_LMP1_2_count: int  700 0 678 136 474 0 256 1634 1753 668 ...
##  $ HNE_1_MUT_LMP1_3_count: int  855 0 932 158 471 0 316 1864 1833 804 ...
##  $ HNE_1_WT_LMP1_1_count : int  1643 0 1114 273 839 0 386 3631 2044 1101 ...
##  $ HNE_1_WT_LMP1_2_count : int  826 0 930 120 617 0 299 1907 1732 878 ...
##  $ HNE_1_WT_LMP1_3_count : int  875 0 845 177 594 0 280 2069 2159 774 ...
##  $ HNE_1_MUT_LMP1_1_FPKM : num  9.36 0 19.97 1.25 3.77 ...
##  $ HNE_1_MUT_LMP1_2_FPKM : num  8.27 0 19.3 1.13 4.45 ...
##  $ HNE_1_MUT_LMP1_3_FPKM : num  9.66 0 25.36 1.26 4.22 ...
##  $ HNE_1_WT_LMP1_1_FPKM  : num  10.53 0 17.19 1.23 4.27 ...
##  $ HNE_1_WT_LMP1_2_FPKM  : num  7.022 0 19.04 0.719 4.163 ...
##  $ HNE_1_WT_LMP1_3_FPKM  : num  9.55 0 22.2 1.36 5.14 ...

Lets do a quick analysis with fold change values and see the most differentially expressed genes using only the counts (no filtering with counts and FPKM or mitochondrial DNA percent as was done in data made for Seurat) from non infected EBV to infected EBV on the NCP commercial cell line.

ncpEBV$EBV_mean <- rowMeans(ncpEBV[,c(5:7)])
ncpEBV$baseline_mean <- rowMeans(ncpEBV[,c(8:10)])

ncpEBV$FoldchangeEBV_baseline <- ncpEBV$EBV_mean/ncpEBV$baseline_mean

ncpEBV_ordered <- ncpEBV[order(ncpEBV$FoldchangeEBV_baseline, decreasing=T),]

str(ncpEBV_ordered)
## 'data.frame':    50868 obs. of  19 variables:
##  $ gene_id               : chr  "ENSG00000001561" "ENSG00000004846" "ENSG00000005108" "ENSG00000006210" ...
##  $ gene_name             : chr  "ENPP4" "ABCB5" "THSD7A" "CX3CL1" ...
##  $ description           : chr  "ectonucleotide pyrophosphatase/phosphodiesterase 4 [Source:HGNC Symbol;Acc:HGNC:3359]" "ATP binding cassette subfamily B member 5 [Source:HGNC Symbol;Acc:HGNC:46]" "thrombospondin type 1 domain containing 7A [Source:HGNC Symbol;Acc:HGNC:22207]" "C-X3-C motif chemokine ligand 1 [Source:HGNC Symbol;Acc:HGNC:10647]" ...
##  $ locus                 : chr  "6:46129993-46146699:+" "7:20615207-20777038:+" "7:11370357-11832198:-" "16:57372458-57385048:+" ...
##  $ HNE_1_MUT_LMP1_1_count: int  0 1 0 1 1 1 0 0 2 1 ...
##  $ HNE_1_MUT_LMP1_2_count: int  1 0 0 0 0 0 0 2 0 0 ...
##  $ HNE_1_MUT_LMP1_3_count: int  1 0 1 0 0 0 1 1 0 0 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_2_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_MUT_LMP1_1_FPKM : num  0 0.00529 0 0.00414 0.0065 ...
##  $ HNE_1_MUT_LMP1_2_FPKM : num  0.0154 0 0 0 0 ...
##  $ HNE_1_MUT_LMP1_3_FPKM : num  0.01476 0 0.00375 0 0 ...
##  $ HNE_1_WT_LMP1_1_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_2_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_FPKM  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EBV_mean              : num  0.667 0.333 0.333 0.333 0.333 ...
##  $ baseline_mean         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FoldchangeEBV_baseline: num  Inf Inf Inf Inf Inf ...
ncpEBV2 <- subset(ncpEBV_ordered,ncpEBV_ordered$baseline_mean > 0 & ncpEBV_ordered$EBV_mean > 0)

str(ncpEBV2)
## 'data.frame':    21932 obs. of  19 variables:
##  $ gene_id               : chr  "ENSG00000262660" "ENSG00000267303" "ENSG00000198211" "ENSG00000240963" ...
##  $ gene_name             : chr  "AC139530.2" "AC011511.4" "AC092143.1" "AL645465.1" ...
##  $ description           : chr  "novel protein" "novel transcript" "novel protein (MC1R-TUBB3 readthrough)" "novel transcript, antisense to C1orf100" ...
##  $ locus                 : chr  "17:81703371-81720539:+" "19:10315471-10320678:-" "16:89919165-89936092:+" "1:244375100-244409592:-" ...
##  $ HNE_1_MUT_LMP1_1_count: int  520 617 211 93 1 15 0 0 5 0 ...
##  $ HNE_1_MUT_LMP1_2_count: int  0 0 0 3 63 203 36 54 4 13 ...
##  $ HNE_1_MUT_LMP1_3_count: int  0 0 62 85 49 12 32 0 7 1 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 0 3 1 2 4 0 3 0 0 ...
##  $ HNE_1_WT_LMP1_2_count : int  1 0 0 1 1 2 3 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 4 0 2 1 3 0 0 1 1 ...
##  $ HNE_1_MUT_LMP1_1_FPKM : num  14.5446 17.9294 4.0862 7.7876 0.0913 ...
##  $ HNE_1_MUT_LMP1_2_FPKM : num  0 0 0 0.335 7.679 ...
##  $ HNE_1_MUT_LMP1_3_FPKM : num  0 0 1.53 9.08 5.71 ...
##  $ HNE_1_WT_LMP1_1_FPKM  : num  0 0 0.0421 0.0606 0.1322 ...
##  $ HNE_1_WT_LMP1_2_FPKM  : num  0.0269 0 0 0.0804 0.0877 ...
##  $ HNE_1_WT_LMP1_3_FPKM  : num  0 0.143 0 0.206 0.113 ...
##  $ EBV_mean              : num  173.3 205.7 91 60.3 37.7 ...
##  $ baseline_mean         : num  0.333 1.333 1 1.333 1.333 ...
##  $ FoldchangeEBV_baseline: num  520 154.2 91 45.3 28.2 ...

We have a dataset that is smaller than original by excluding those entries where the mean of EBV infected and the mean of the baseline NCP samples are higher than 0.

range(ncpEBV2$FoldchangeEBV_baseline)
## [1]   0.01107011 520.00000000

The range is all positive but those less than 1 are underexpressed genes or inhibited in EBV infected NCP compared to baseline NCP samples. Lets see what are top 10 overexpressed genes are in stimulated the most or inhibited the most for top and bottom 10 genes. We should also look at a list of those exact genes that the study used as the ‘Gene Ontology Fingerprint’ for ‘EBVaNPC GOF’ perhaps with an internet search. The internet returned the Pubmed article by this GSE study.

topBottom10stimulatedInhibited <- ncpEBV2[c(1:10,21923:21932),]
topBottom10stimulatedInhibited[,c(2,19)]
##          gene_name FoldchangeEBV_baseline
## 42404   AC139530.2           520.00000000
## 43879   AC011511.4           154.25000000
## 17708   AC092143.1            91.00000000
## 31972   AL645465.1            45.25000000
## 43987   AL353997.3            28.25000000
## 46317   AC022384.1            25.55555556
## 95             MPO            22.66666667
## 39685 TRIM6-TRIM34            18.00000000
## 17107         FUT4            16.00000000
## 4383         ITGA4            14.00000000
## 23008    RPL23AP34             0.07692308
## 40275   AC027088.3             0.07692308
## 8269         MYOM3             0.07142857
## 70           CALCR             0.06666667
## 1480         MOXD1             0.06250000
## 11402    SMAD5-AS1             0.05882353
## 45524      HSPE1P7             0.04761905
## 40946   AC026464.2             0.03557312
## 12885       RASA4B             0.03225806
## 42231   AC026954.2             0.01107011

Lets make a matrix and split into testing and training sets comparing the samples, this will be small since only 3 baseline and 3 EBV infected NCP.

colnames(topBottom10stimulatedInhibited)
##  [1] "gene_id"                "gene_name"              "description"           
##  [4] "locus"                  "HNE_1_MUT_LMP1_1_count" "HNE_1_MUT_LMP1_2_count"
##  [7] "HNE_1_MUT_LMP1_3_count" "HNE_1_WT_LMP1_1_count"  "HNE_1_WT_LMP1_2_count" 
## [10] "HNE_1_WT_LMP1_3_count"  "HNE_1_MUT_LMP1_1_FPKM"  "HNE_1_MUT_LMP1_2_FPKM" 
## [13] "HNE_1_MUT_LMP1_3_FPKM"  "HNE_1_WT_LMP1_1_FPKM"   "HNE_1_WT_LMP1_2_FPKM"  
## [16] "HNE_1_WT_LMP1_3_FPKM"   "EBV_mean"               "baseline_mean"         
## [19] "FoldchangeEBV_baseline"
header <- topBottom10stimulatedInhibited$gene_name

class <- c('EBVncp','EBVncp','EBVncp','ncp','ncp','ncp')

df <- topBottom10stimulatedInhibited[,c(5:10)]

str(df)
## 'data.frame':    20 obs. of  6 variables:
##  $ HNE_1_MUT_LMP1_1_count: int  520 617 211 93 1 15 0 0 5 0 ...
##  $ HNE_1_MUT_LMP1_2_count: int  0 0 0 3 63 203 36 54 4 13 ...
##  $ HNE_1_MUT_LMP1_3_count: int  0 0 62 85 49 12 32 0 7 1 ...
##  $ HNE_1_WT_LMP1_1_count : int  0 0 3 1 2 4 0 3 0 0 ...
##  $ HNE_1_WT_LMP1_2_count : int  1 0 0 1 1 2 3 0 0 0 ...
##  $ HNE_1_WT_LMP1_3_count : int  0 4 0 2 1 3 0 0 1 1 ...
ncpEBV_matrix <- data.frame(t(df))
colnames(ncpEBV_matrix) <- header
ncpEBV_matrix$class <- class

str(ncpEBV_matrix)
## 'data.frame':    6 obs. of  21 variables:
##  $ AC139530.2  : int  520 0 0 0 1 0
##  $ AC011511.4  : int  617 0 0 0 0 4
##  $ AC092143.1  : int  211 0 62 3 0 0
##  $ AL645465.1  : int  93 3 85 1 1 2
##  $ AL353997.3  : int  1 63 49 2 1 1
##  $ AC022384.1  : int  15 203 12 4 2 3
##  $ MPO         : int  0 36 32 0 3 0
##  $ TRIM6-TRIM34: int  0 54 0 3 0 0
##  $ FUT4        : int  5 4 7 0 0 1
##  $ ITGA4       : int  0 13 1 0 0 1
##  $ RPL23AP34   : int  0 1 0 5 1 7
##  $ AC027088.3  : int  2 0 0 9 13 4
##  $ MYOM3       : int  1 0 0 7 7 0
##  $ CALCR       : int  0 0 1 6 1 8
##  $ MOXD1       : int  1 0 0 12 2 2
##  $ SMAD5-AS1   : int  0 1 0 14 2 1
##  $ HSPE1P7     : int  0 1 0 0 15 6
##  $ AC026464.2  : int  4 2 3 184 3 66
##  $ RASA4B      : int  2 0 0 59 3 0
##  $ AC026954.2  : int  1 1 1 269 1 1
##  $ class       : chr  "EBVncp" "EBVncp" "EBVncp" "ncp" ...
ncpEBV_matrix$class <- as.factor(ncpEBV_matrix$class)
str(ncpEBV_matrix)
## 'data.frame':    6 obs. of  21 variables:
##  $ AC139530.2  : int  520 0 0 0 1 0
##  $ AC011511.4  : int  617 0 0 0 0 4
##  $ AC092143.1  : int  211 0 62 3 0 0
##  $ AL645465.1  : int  93 3 85 1 1 2
##  $ AL353997.3  : int  1 63 49 2 1 1
##  $ AC022384.1  : int  15 203 12 4 2 3
##  $ MPO         : int  0 36 32 0 3 0
##  $ TRIM6-TRIM34: int  0 54 0 3 0 0
##  $ FUT4        : int  5 4 7 0 0 1
##  $ ITGA4       : int  0 13 1 0 0 1
##  $ RPL23AP34   : int  0 1 0 5 1 7
##  $ AC027088.3  : int  2 0 0 9 13 4
##  $ MYOM3       : int  1 0 0 7 7 0
##  $ CALCR       : int  0 0 1 6 1 8
##  $ MOXD1       : int  1 0 0 12 2 2
##  $ SMAD5-AS1   : int  0 1 0 14 2 1
##  $ HSPE1P7     : int  0 1 0 0 15 6
##  $ AC026464.2  : int  4 2 3 184 3 66
##  $ RASA4B      : int  2 0 0 59 3 0
##  $ AC026954.2  : int  1 1 1 269 1 1
##  $ class       : Factor w/ 2 levels "EBVncp","ncp": 1 1 1 2 2 2

Split into a training and testing set.

set.seed(123)
training <- ncpEBV_matrix[c(1,2,5,6),] #two each
testing <- ncpEBV_matrix[c(3,4),] #one each 
table(training$class)
## 
## EBVncp    ncp 
##      2      2
table(testing$class)
## 
## EBVncp    ncp 
##      1      1

Lets use a randomForest model to see how well it trains and predicts on unseen data.

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
str(training
    )
## 'data.frame':    4 obs. of  21 variables:
##  $ AC139530.2  : int  520 0 1 0
##  $ AC011511.4  : int  617 0 0 4
##  $ AC092143.1  : int  211 0 0 0
##  $ AL645465.1  : int  93 3 1 2
##  $ AL353997.3  : int  1 63 1 1
##  $ AC022384.1  : int  15 203 2 3
##  $ MPO         : int  0 36 3 0
##  $ TRIM6-TRIM34: int  0 54 0 0
##  $ FUT4        : int  5 4 0 1
##  $ ITGA4       : int  0 13 0 1
##  $ RPL23AP34   : int  0 1 1 7
##  $ AC027088.3  : int  2 0 13 4
##  $ MYOM3       : int  1 0 7 0
##  $ CALCR       : int  0 0 1 8
##  $ MOXD1       : int  1 0 2 2
##  $ SMAD5-AS1   : int  0 1 2 1
##  $ HSPE1P7     : int  0 1 15 6
##  $ AC026464.2  : int  4 2 3 66
##  $ RASA4B      : int  2 0 3 0
##  $ AC026954.2  : int  1 1 1 1
##  $ class       : Factor w/ 2 levels "EBVncp","ncp": 1 1 2 2
rf <- randomForest(training[,c(1:20)], training[,21],mtry=6,
                         importance=TRUE, confusion=T)
rf$confusion
##        EBVncp ncp class.error
## EBVncp      2   0           0
## ncp         0   2           0

The model scored 100% on training with all classes of EBV infected NCP identified and all classes of baseline NCP only identified.

Resulting output: EBVncp ncp class.error EBVncp 2 0 0 ncp 0 2 0

Now lets see how well it predicts on the hold out set or our testing set of 1 sample each.

prediction <- predict(rf,testing)
result <- data.frame(predicted=prediction, actual=testing$class)
result
##                        predicted actual
## HNE_1_MUT_LMP1_3_count    EBVncp EBVncp
## HNE_1_WT_LMP1_1_count        ncp    ncp

Looks like a very strong set of genes we selected in using a randomForest model made for classification on our data to predict with 100% accuracy the correct class of a two class model.

Resulting output:

predicted actual HNE_1_MUT_LMP1_3_count EBVncp EBVncp
HNE_1_WT_LMP1_1_count ncp ncp
2 rows

Thanks so much, looks like we can go ahead and add these genes to our database of pathologies. But it would be interesting to look up the actual genes used in the GOF the study used to confirm that the mutation in the H101R gene affected the immune response based on the EBVaNCP GOF.

Keep checking back.