Part 1 getting to understand an interesting and recent gene study that will give us some more genes to target in our machine model when we build it to predict class of EBV, EBV associated pathologies, MS, mono, fibromyalgia, Lyme disease, NKTC lymphoma, nasopharyngeal carcinoma with EBV, and others. This one compares Hodgkin’s disease in all samples but adds in comparisons to find tumor inflammatory microenvironment changes, tumor mutation burden, the T-cell receptor repertoire changes, and compares to other genes. The study was read, annotated, and notes for takeaways made to clarify the data working with and results to compare with findings in this project to see what genes we can use for our machine model. Part 2 is going to be the machine learning and additional analysis after finding top genes here and building our database systems and data frames to use.

In this study we analyze a study, GSE289903, that explores the tumor neoplastic gene expression to profile the classical Hodgkin’s lymphoma or cHL, and how Epstein-Barr Virus (EBV) infection and/or Human Immunodeficiency Virus (HIV) affect the lymphocyte tumors or neoplasms dissected and then processed through a chip array sequencing of RNA. The tumor inflammatory microenvironment or TIME, the tumor mutational burden or TMB, as well as immune checkpoint inhibitor or ICI biomarkers of antibodies, and the T-cell Receptor Repertoire or TCR of these tumors were evaluated with taking counts and using the genes that had at least one count in processing. I summarized the published and free article published by Science Direct in some key takeaways and to direct this project in finding genes that explain the data or could be used as excellent predictors on sample type as this has 3 sample types of data. We will analyze the counts data that was uploaded and downloaded from link above at the GSE289903 link and compare to the genes we find and top stimulated or up regulated genes and top down regulated or inhibited genes to the genes in this study that showed as having top 5 genes in explaining the molecular cell types from the samples, the ICIs they used in finding TCR affects, TIME, and TMB. This was a 10 page document with tabular and graphical representations of the findings or results of their 12 researcher large research study published within a year of today. The study describes 25 patients in their Table 1 but the data downloaded has 19 samples.

Takeaways or notes on GSE289903:

Great! Now that we summarized the key takeaway points from this study. We can easily go back to find the genes that seem relevant to make a string of to also extract from the data like the top 5 genes in their heatmap figure and the ICI genes and top genes within each sample type.

Lets read in our data of series matrix information and then the counts data.

setwd(path)
seriesInfo1_75 <- read.table('GSE289903_series_matrix.txt', nrow=75)


paged_table(seriesInfo1_75)

Lets skip 33 rows and read it in to see the sample ID with type of sample, there is an error at line 40 so lets only read in 39 rows.

seriesInfo <- read.table('GSE289903_series_matrix.txt', header=F, skip=33, nrow=39)


paged_table(seriesInfo)

The first 2 rows have the sample type and the sample ID.

sampleTypeByID <- data.frame(t(seriesInfo[c(1:2),c(2:20)]))

colnames(sampleTypeByID) <- c("sampleType", "GSE_ID")

sampleTypeByID
##                           sampleType     GSE_ID
## V2  YF02, NS, CHL, EBV+/HIV+, Female GSM8801166
## V3    YF03, MC, CHL, EBV+/HIV-, Male GSM8801167
## V4    YF05, NS, CHL, EBV+/HIV-, Male GSM8801168
## V5  YF06, MC, CHL, EBV+/HIV+, Female GSM8801169
## V6    YF07, MC, CHL, EBV+/HIV-, Male GSM8801170
## V7  YF08, LD, CHL, EBV-/HIV-, Female GSM8801171
## V8    YF09, LR, CHL, EBV+/HIV-, Male GSM8801172
## V9    YF10, NS, CHL, EBV+/HIV+, Male GSM8801173
## V10   YF11, MC, CHL, EBV+/HIV-, Male GSM8801174
## V11   YF12, NS, CHL, EBV+/HIV-, Male GSM8801175
## V12   YF13, NS, CHL, EBV-/HIV-, Male GSM8801176
## V13 YF14, NS, CHL, EBV+/HIV-, Female GSM8801177
## V14 YF15, NS, CHL, EBV+/HIV+, Female GSM8801178
## V15 YF17, NA, CHL, EBV-/HIV-, Female GSM8801179
## V16 YF18, NS, CHL, EBV+/HIV-, Female GSM8801180
## V17   YF19, NS, CHL, EBV+/HIV-, Male GSM8801181
## V18   YF21, MC, CHL, EBV+/HIV-, Male GSM8801182
## V19 YF22, NS, CHL, EBV-/HIV-, Female GSM8801183
## V20 YF23, NS, CHL, EBV-/HIV-, Female GSM8801184

The subtypes are NA for not applicable or didn’t fit in the 4 subtypes of Nodular Sclerosis as NS, Lymphocyte Rich as LR, Lymphocyte Deplete as LD, and mixed cellularity as MC. The YF preheader means nothing to our analysis just an ID. We can exclude it. But we should get the 4 subtypes and NA sample, and also the class as one of the 3 class types.

LymphocyteDeplete <- grep('LD,',sampleTypeByID$sampleType)
LymphocyteRich <- grep('LR,',sampleTypeByID$sampleType)
NotApplicable <- grep('NA,' , sampleTypeByID$sampleType)
MixedCellularity <- grep('MC,', sampleTypeByID$sampleType)
NodularSclerosis <- grep('NS,',sampleTypeByID$sampleType)

We could analyze by cell type of the 4 subtypes to see if any genes show up regardless of EBV or HIV status but having cHL as all 19 samples do have cHL.

Lets add this as a descriptive feature to our table of IDs and sample types.

sampleTypeByID$subtype <- "subtype"
sampleTypeByID$subtype[LymphocyteDeplete] <- 'Lymphocyte Deplete'
sampleTypeByID$subtype[LymphocyteRich] <- 'Lymphocyte Rich'
sampleTypeByID$subtype[NodularSclerosis] <- 'Nodular Sclerosis'
sampleTypeByID$subtype[MixedCellularity] <- 'Mixed Cellularity'
sampleTypeByID$subtype[NotApplicable] <- 'Not Applied'

paged_table(sampleTypeByID)

We know all samples are cHL, but also that all HIV have EBV. We divide these samples into cHL only, cHL + EBV, and cHL + EBV + HIV to get the means of the genes by class and then the fold change values.

Lets add a descriptive feature for the class type to our sampleTypeByID table.

sampleTypeByID$class <- 'class'

cHL <- grep('EBV-/HIV-', sampleTypeByID$sampleType, fixed=TRUE)
cHL_EBV <- grep('EBV+/HIV-', sampleTypeByID$sampleType, fixed=TRUE)
cHL_EBV_HIV <- grep('EBV+/HIV+', sampleTypeByID$sampleType, fixed=TRUE)

sampleTypeByID$class[cHL] <- 'cHL only'
sampleTypeByID$class[cHL_EBV] <- 'cHL & EBV'
sampleTypeByID$class[cHL_EBV_HIV] <- 'cHL & EBV & HIV'

sampleTypeByID
##                           sampleType     GSE_ID            subtype
## V2  YF02, NS, CHL, EBV+/HIV+, Female GSM8801166  Nodular Sclerosis
## V3    YF03, MC, CHL, EBV+/HIV-, Male GSM8801167  Mixed Cellularity
## V4    YF05, NS, CHL, EBV+/HIV-, Male GSM8801168  Nodular Sclerosis
## V5  YF06, MC, CHL, EBV+/HIV+, Female GSM8801169  Mixed Cellularity
## V6    YF07, MC, CHL, EBV+/HIV-, Male GSM8801170  Mixed Cellularity
## V7  YF08, LD, CHL, EBV-/HIV-, Female GSM8801171 Lymphocyte Deplete
## V8    YF09, LR, CHL, EBV+/HIV-, Male GSM8801172    Lymphocyte Rich
## V9    YF10, NS, CHL, EBV+/HIV+, Male GSM8801173  Nodular Sclerosis
## V10   YF11, MC, CHL, EBV+/HIV-, Male GSM8801174  Mixed Cellularity
## V11   YF12, NS, CHL, EBV+/HIV-, Male GSM8801175  Nodular Sclerosis
## V12   YF13, NS, CHL, EBV-/HIV-, Male GSM8801176  Nodular Sclerosis
## V13 YF14, NS, CHL, EBV+/HIV-, Female GSM8801177  Nodular Sclerosis
## V14 YF15, NS, CHL, EBV+/HIV+, Female GSM8801178  Nodular Sclerosis
## V15 YF17, NA, CHL, EBV-/HIV-, Female GSM8801179        Not Applied
## V16 YF18, NS, CHL, EBV+/HIV-, Female GSM8801180  Nodular Sclerosis
## V17   YF19, NS, CHL, EBV+/HIV-, Male GSM8801181  Nodular Sclerosis
## V18   YF21, MC, CHL, EBV+/HIV-, Male GSM8801182  Mixed Cellularity
## V19 YF22, NS, CHL, EBV-/HIV-, Female GSM8801183  Nodular Sclerosis
## V20 YF23, NS, CHL, EBV-/HIV-, Female GSM8801184  Nodular Sclerosis
##               class
## V2  cHL & EBV & HIV
## V3        cHL & EBV
## V4        cHL & EBV
## V5  cHL & EBV & HIV
## V6        cHL & EBV
## V7         cHL only
## V8        cHL & EBV
## V9  cHL & EBV & HIV
## V10       cHL & EBV
## V11       cHL & EBV
## V12        cHL only
## V13       cHL & EBV
## V14 cHL & EBV & HIV
## V15        cHL only
## V16       cHL & EBV
## V17       cHL & EBV
## V18       cHL & EBV
## V19        cHL only
## V20        cHL only

Lets look at the number of samples in each of our 3 classes and in the subtypes.

table(sampleTypeByID$class)
## 
##       cHL & EBV cHL & EBV & HIV        cHL only 
##              10               4               5

There are 4 samples with cHL & EBV & HIV, 10 samples of cHL & EBV, and 5 samples of cHL only.

Lets look at how many samples by subtype there are.

table(sampleTypeByID$subtype)
## 
## Lymphocyte Deplete    Lymphocyte Rich  Mixed Cellularity  Nodular Sclerosis 
##                  1                  1                  5                 11 
##        Not Applied 
##                  1

There are 11 with Nodular Sclerosis, 5 with Mixed Cellularity, and 1 each in Lymphocyte Depleted, Lymphocyte Rich, and Not applied to the subtype.

The row names of our sampleTypeByID table has the column number of our counts data per gene. But we haven’t uploaded the counts data yet. So, we should do that to see how many genes we have that made the quality control filtering and analysis.

setwd(path1)
counts <- read.table('GSE289903_raw_counts.txt', header=T)

paged_table(head(counts,20))

We see the first 10 rows above. Lets look at the structure of the data.

str(counts)
## 'data.frame':    59427 obs. of  19 variables:
##  $ YF02: num  0 0 0 0 11 0 407 0 0 0 ...
##  $ YF03: num  0 0 0 21 8 0 114 35 0 0 ...
##  $ YF05: num  0 0 0 0 0 0 218 58 0 0 ...
##  $ YF06: num  0 0 0 1 55 0 603 202 0 0 ...
##  $ YF07: num  0 0 0 0 0 0 153 45 0 0 ...
##  $ YF08: num  0 0 0 0 0 0 322 20 0 0 ...
##  $ YF09: num  0 0 0 0 0 0 61 248 0 0 ...
##  $ YF10: num  0 0 75.5 0 37.9 ...
##  $ YF11: num  0 0 0 36 0 0 142 0 0 0 ...
##  $ YF12: num  0 0 0 0 0 0 47 0 0 0 ...
##  $ YF13: num  0 0 0 0 0 0 58 48 0 0 ...
##  $ YF14: num  0 0 0 0 16 0 10 51 0 0 ...
##  $ YF15: num  0 0 0 37 47 ...
##  $ YF17: num  0 0 1792 0 113 ...
##  $ YF18: num  0 0 0 0 20.3 ...
##  $ YF19: num  0 0 862 0 16 ...
##  $ YF21: num  0 0 0 13 43 ...
##  $ YF22: num  0 0 0 0 0 0 507 0 0 0 ...
##  $ YF23: num  0 0 3 43 48 ...

There are 59,427 genes with numeric gene expression data that says it is raw counts but the information says they didn’t share raw counts and that their counts data was normalized and only genes that had some counts were included. The 2 outliers were omitted from 21 samples they had for the RNA sequencing data which left us with these 19 samples. The original data had tumor microscopy done on the 25 samples, but this is what we have to work with. The series data gives information on how this data was manipulated before being shared on NCBI site for GSE289903 study. Row 8 of the first table we read in from the series data.

seriesInfo1_75[c(8:13),]
##                        V1
## 8  !Series_overall_design
## 9  !Series_overall_design
## 10 !Series_overall_design
## 11 !Series_overall_design
## 12 !Series_overall_design
## 13           !Series_type
##                                                                                                                                                                        V2
## 8  We performed RNA sequencing of 19 pre-treatment formalin-fixed paraffin-embedded (FFPE) whole lymph node biopsies of cHL, inclusive of EBV-association and HIV status.
## 9                                                                                                                                                                        
## 10                                                                                                        ***************************************************************
## 11                  Raw files for human/patient samples were not submitted to GEO due to concerns about submitting personally identifiable sequence data for open access.
## 12                                                                                                        ***************************************************************
## 13                                                                                                                     Expression profiling by high throughput sequencing

They did RNA sequencing of 19 pre-treatment formalin-fixed paraffin-embedded (FFPE) whole lymph node biopsies of cHL (and those with EBV and/or HIV).

But from my notes above directly from their published article…

RNA sequencing used the whole transcriptome RNA sequencing on the FFPE tumors used above to extract RNA, get the cDNA or complementary DNA libraries to prepare, and then messenger RNA or mRNA sequencing was done as strand specific. Counts generated with aligning FASTQ files using STAR and then quantifying using Salmon. Only those samples passing quality control by library and sequencing kept. There were 21 samples, but 2 were identified as outliers by principal component analysis or PCA and their quality control pipelines, leaving 19 samples. Removing genes with no counts across all samples, then differential gene expression done followed by normalization using DESeq2. Differential Pathway Enrichment was done using the Gene Sequencing Variation Association package or GSVA package, limma, and normalizing counts by DESeq2 using the Hallmark Pathways data set. The cell type proportions as transcripts per million or TPM found estimates with CIBERSORTx in an output file from the Salmon pipeline. Default settings used because recommended of LM22 signature matrix file, no quantile normalization, and 100 permutations.

The column names of our data uses the prefixed YF02-YF23 for 19 samples. We should now rearrange those columns by their sample type into a different data frame. We already have the values from our grep to get the class type in the cHL, cHL_EBV, and cHL_EBV_HIV objects. Lets use those values to arrange our new data frame in that order to get the means to add to the last columns.

DataFrame <- counts[,c(cHL,cHL_EBV,cHL_EBV_HIV)]

paged_table(head(DataFrame))

We can see the column names are out of order, lets compare it to the names of our samples next to their class type.

paged_table(sampleTypeByID[,c(1,4)])

Lets add the genes from the row names as we don’t have the genes in this data and need them.

DataFrame$Gene_ID <- row.names(DataFrame)
paged_table(head(DataFrame))

Lets prepend the sample type to the IDs on table to make inferences on the tabular data if we need to later.

colnames(DataFrame)[1:5] <- paste(colnames(DataFrame)[1:5],sep='_', 'cHL')
colnames(DataFrame)[6:15] <- paste(colnames(DataFrame)[6:15], sep='_',"cHL_EBV")
colnames(DataFrame)[16:19] <- paste(colnames(DataFrame)[16:19],sep='_','cHL_EBV_HIV')

colnames(DataFrame)
##  [1] "YF08_cHL"         "YF13_cHL"         "YF17_cHL"         "YF22_cHL"        
##  [5] "YF23_cHL"         "YF03_cHL_EBV"     "YF05_cHL_EBV"     "YF07_cHL_EBV"    
##  [9] "YF09_cHL_EBV"     "YF11_cHL_EBV"     "YF12_cHL_EBV"     "YF14_cHL_EBV"    
## [13] "YF18_cHL_EBV"     "YF19_cHL_EBV"     "YF21_cHL_EBV"     "YF02_cHL_EBV_HIV"
## [17] "YF06_cHL_EBV_HIV" "YF10_cHL_EBV_HIV" "YF15_cHL_EBV_HIV" "Gene_ID"

Lets add the row means by class to the DataFrame.

DataFrame$cHL_Means <- rowMeans(DataFrame[,c(1:5)])
DataFrame$cHL_EBV_Means <- rowMeans(DataFrame[,c(6:15)])
DataFrame$cHL_EBV_HIV_Means <- rowMeans(DataFrame[,c(16:19)])

DataFrame[c(1:3),c(20:23)]
##             Gene_ID cHL_Means cHL_EBV_Means cHL_EBV_HIV_Means
## 5S_rRNA     5S_rRNA     0.000        0.0000           0.00000
## 5_8S_rRNA 5_8S_rRNA     0.000        0.0000           0.00000
## 7SK             7SK   358.906       86.1891          18.86475

Now we can do the fold change value, since all have cHL, we can use that as the baseline and compare fold change for EBV+HIV- and EBV+HIV+.

DataFrame$EBV_cHL_FoldChange <- DataFrame$cHL_EBV_Means/DataFrame$cHL_Means
DataFrame$EBV_and_HIV_cHL_FoldChange <- DataFrame$cHL_EBV_HIV_Means/DataFrame$cHL_Means

DataFrame[c(1:3),c(20:25)]
##             Gene_ID cHL_Means cHL_EBV_Means cHL_EBV_HIV_Means
## 5S_rRNA     5S_rRNA     0.000        0.0000           0.00000
## 5_8S_rRNA 5_8S_rRNA     0.000        0.0000           0.00000
## 7SK             7SK   358.906       86.1891          18.86475
##           EBV_cHL_FoldChange EBV_and_HIV_cHL_FoldChange
## 5S_rRNA                  NaN                        NaN
## 5_8S_rRNA                NaN                        NaN
## 7SK                0.2401439                 0.05256181

Lets look at summary statistics on this DataFrame.

summary(DataFrame)
##     YF08_cHL            YF13_cHL            YF17_cHL           YF22_cHL       
##  Min.   :     0.00   Min.   :     0.00   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:    0.00   1st Qu.:    0.00  
##  Median :     0.00   Median :     0.00   Median :    0.00   Median :    0.00  
##  Mean   :    44.39   Mean   :    19.34   Mean   :   32.09   Mean   :   22.61  
##  3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:    0.00   3rd Qu.:    0.00  
##  Max.   :141120.00   Max.   :114953.00   Max.   :66979.00   Max.   :64755.00  
##                                                                               
##     YF23_cHL          YF03_cHL_EBV        YF05_cHL_EBV      
##  Min.   :0.000e+00   Min.   :     0.00   Min.   :     0.00  
##  1st Qu.:0.000e+00   1st Qu.:     0.00   1st Qu.:     0.00  
##  Median :0.000e+00   Median :     0.00   Median :     0.00  
##  Mean   :4.436e+01   Mean   :    27.47   Mean   :    57.78  
##  3rd Qu.:7.962e+00   3rd Qu.:     0.00   3rd Qu.:     0.00  
##  Max.   :1.742e+05   Max.   :148867.00   Max.   :289355.00  
##                                                             
##   YF07_cHL_EBV        YF09_cHL_EBV        YF11_cHL_EBV        YF12_cHL_EBV     
##  Min.   :     0.00   Min.   :     0.00   Min.   :     0.00   Min.   :    0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:    0.00  
##  Median :     0.00   Median :     0.00   Median :     0.00   Median :    0.00  
##  Mean   :    35.23   Mean   :    49.25   Mean   :    25.05   Mean   :   18.25  
##  3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:    0.00  
##  Max.   :212856.00   Max.   :329748.00   Max.   :184276.00   Max.   :83420.00  
##                                                                                
##   YF14_cHL_EBV        YF18_cHL_EBV        YF19_cHL_EBV      
##  Min.   :     0.00   Min.   :     0.00   Min.   :     0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:     0.00  
##  Median :     0.00   Median :     0.00   Median :     0.00  
##  Mean   :    22.86   Mean   :    24.21   Mean   :    22.27  
##  3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:     0.00  
##  Max.   :159960.00   Max.   :159745.00   Max.   :143078.00  
##                                                             
##   YF21_cHL_EBV       YF02_cHL_EBV_HIV    YF06_cHL_EBV_HIV   
##  Min.   :     0.00   Min.   :     0.00   Min.   :     0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:     0.00  
##  Median :     0.00   Median :     0.00   Median :     0.00  
##  Mean   :    35.71   Mean   :    38.15   Mean   :    58.37  
##  3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:     8.00  
##  Max.   :179773.00   Max.   :169301.00   Max.   :299787.00  
##                                                             
##  YF10_cHL_EBV_HIV    YF15_cHL_EBV_HIV      Gene_ID            cHL_Means        
##  Min.   :     0.00   Min.   :     0.00   Length:59427       Min.   :     0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   Class :character   1st Qu.:     0.00  
##  Median :     0.00   Median :     0.00   Mode  :character   Median :     0.00  
##  Mean   :    59.98   Mean   :    63.84                      Mean   :    32.56  
##  3rd Qu.:    10.00   3rd Qu.:     0.00                      3rd Qu.:     6.80  
##  Max.   :252671.00   Max.   :246588.00                      Max.   :112394.40  
##                                                                                
##  cHL_EBV_Means       cHL_EBV_HIV_Means   EBV_cHL_FoldChange
##  Min.   :     0.00   Min.   :     0.00   Min.   :0.000     
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:0.438     
##  Median :     0.00   Median :     0.00   Median :0.971     
##  Mean   :    31.81   Mean   :    55.08   Mean   :  Inf     
##  3rd Qu.:     4.70   3rd Qu.:     9.25   3rd Qu.:2.061     
##  Max.   :189107.80   Max.   :214479.25   Max.   :  Inf     
##                                          NA's   :35661     
##  EBV_and_HIV_cHL_FoldChange
##  Min.   :0.000             
##  1st Qu.:0.936             
##  Median :1.730             
##  Mean   :  Inf             
##  3rd Qu.:3.450             
##  Max.   :  Inf             
##  NA's   :35807

We have a lot of Inf when the number is so large it can’t be displayed on computer from our 0 means or when dividing a number by a very small number and we have a lot of NAs as NaNs from dividing a very small number or very large number into a very small number as you cannot divide by 0. You can make the numbers infinitesimally small to fit between 0 and 1 but those numbers can be very small. At this point we haven’t made a vector of the values we want to keep that are from the research article we read that found target genes and confirmed genes involved in therapeutic treatment of Hodgkins, EBV, or HIV.

Lets look at the notes above that mention genes of interest:

Ok, so our genes are going to be added to a list:

Lets add these to a character string of genes from research study.

researchStudyGenes <- c("KMT2D","EP300","CARD11","CREBBP","EZH2", "TTN","MUC16", "XBP1","NCOR2","ARID1B","ZFHX3","TP53", "PD-L1","TIGIT","MHC-II","M2","CD4","CD8")

Lets see if any of these genes are in our DataFrame.

DataFrameResearched <- DataFrame[which(DataFrame$Gene_ID %in% researchStudyGenes),]
DataFrameResearched[,c(20,25)]
##        Gene_ID EBV_and_HIV_cHL_FoldChange
## ARID1B  ARID1B                  2.3832654
## CARD11  CARD11                  4.9572650
## CD4        CD4                  1.6534548
## CREBBP  CREBBP                  1.9744713
## EP300    EP300                  1.8721294
## EZH2      EZH2                  0.7481752
## KMT2D    KMT2D                  1.3122912
## MUC16    MUC16                  1.1629747
## NCOR2    NCOR2                  1.3716459
## TIGIT    TIGIT                  3.5623847
## TP53      TP53                  1.9667832
## TTN        TTN                  3.2353523
## XBP1      XBP1                  6.2879673
## ZFHX3    ZFHX3                  0.9466127

There are 13 of the 18 genes from our research article in this dataset of gene counts.

DataFrameResearched$Gene_ID
##  [1] "ARID1B" "CARD11" "CD4"    "CREBBP" "EP300"  "EZH2"   "KMT2D"  "MUC16" 
##  [9] "NCOR2"  "TIGIT"  "TP53"   "TTN"    "XBP1"   "ZFHX3"
researchStudyGenes[-which(DataFrameResearched$Gene_ID %in% researchStudyGenes)]
## [1] "MHC-II" "M2"     "CD4"    "CD8"

CD4 is listed in the data and it should say PD-L1 is missing instead of CD4, but MHC-II, M2, and CD8 are not. Four genes are missing.

We know there are the top 5 genes in TMB from the research study in this data and other genes important to class of this study. Lets order the fold change and get the top up regulated and top bottom regulated genes. We will omit the NaNs and Inf when ordering. but keep the genes in our DataFrameResearched data to combine to this data of fold change values.

DataFrameOrderedEBV_HIV <- DataFrame[order(DataFrame$EBV_and_HIV_cHL_FoldChange, decreasing=T),]

#remove the NaNs and Inf
#df <- df[!is.infinite(rowSums(df)),]
DataFrameOrderedEBV_HIV2 <- DataFrameOrderedEBV_HIV[ !is.nan(DataFrameOrderedEBV_HIV$EBV_and_HIV_cHL_FoldChange),]
DataFrameOrderedEBV_HIV3 <- DataFrameOrderedEBV_HIV2[!is.infinite(DataFrameOrderedEBV_HIV2$EBV_and_HIV_cHL_FoldChange),]
DataFrameOrderedEBV_HIV4 <- subset( DataFrameOrderedEBV_HIV3, DataFrameOrderedEBV_HIV3$EBV_and_HIV_cHL_FoldChange>0)

cHL_EBV_HIV_top10 <- DataFrameOrderedEBV_HIV4[c(1:5,17726:17730),]

paged_table(cHL_EBV_HIV_top10)

May of the genes with top rank in up or down regulation for EBV+HIV+ cHL are not genes of interest in only EBV+ cHL samples. Top genes of cHL+EBV+HIV are:

cHL_EBV_HIV_top10$Gene_ID
##  [1] "IGKV1D-39"  "RPS4Y1"     "HNRNPA1P70" "CCDC8"      "IGKV3-20"  
##  [6] "LTF"        "PES1P2"     "CTSLP3"     "MPO"        "MMP8"
DataFrameOrderedEBV <- DataFrame[order(DataFrame$EBV_cHL_FoldChange, decreasing=T ),]

DataFrameOrderedEBV2 <- DataFrameOrderedEBV[!is.nan(DataFrameOrderedEBV$EBV_cHL_FoldChange),]
DataFrameOrderedEBV3 <- DataFrameOrderedEBV2[!is.infinite(DataFrameOrderedEBV2$EBV_cHL_FoldChange),]
DataFrameOrderedEBV4 <- subset(DataFrameOrderedEBV3, DataFrameOrderedEBV3$EBV_cHL_FoldChange > 0)

cHL_EBV_top10 <- DataFrameOrderedEBV4[c(1:5,17543:17547),]

paged_table(cHL_EBV_top10)

Only one gene that is a top down regulated gene in cHL+EBV is not a gene with a remarkable fold change value of change in the cHL_EBV_HIV data. The top 10 up and down regulated genes for cHL+EBV are:

cHL_EBV_top10$Gene_ID
##  [1] "RPS4Y1"   "Z82243.1" "NR4A2"    "LERFS"    "VAMP5"    "POU2F3"  
##  [7] "COL16A1"  "SLC18A2"  "CPLX2"    "AICDA"

Lets add columns to each data set of top genes from our fold change values and the research then combine them into one data set.

cHL_EBV_HIV_top10$topGene <- "foldchange"
cHL_EBV_HIV_top10$topGene[1:5] <- 'top up regulated cHL_EBV_HIV'
cHL_EBV_HIV_top10$topGene[6:10] <- 'top down regulated cHL_EBV_HIV'

paged_table(cHL_EBV_HIV_top10[,c(20,25,26)])
cHL_EBV_top10$topGene <- 'top gene'
cHL_EBV_top10$topGene[1:5] <- 'top up regulated in cHL_EBV'
cHL_EBV_top10$topGene[6:10] <- 'top down regulated in cHL_EBV'

paged_table(cHL_EBV_top10[,c(20,24,26)])
DataFrameResearched$topGene <- 'GSE289903 gene target'
paged_table(DataFrameResearched[,c(20,24:26)])

Lets be more descriptive and add to topGene feature why in the study it was a target or referenced in the study.

TIME referenced M2, CD4, and CD8

DataFrameResearched$topGene[c(1,11,14)] <- 'cHL+EBV compared to cHL top DE gene'
DataFrameResearched$topGene[9] <- 'cHL, cHL+EBV, & cHL+EBV+HIV top DE gene'
DataFrameResearched$topGene[c(2,4:7)] <- 'top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL'
DataFrameResearched$topGene[c(8,12)] <- 'FLAG genes or frequently mutated genes in cHL'
DataFrameResearched$topGene[10] <- 'ICI immune complex inhibitor gene as well as PD-L1 and MHC-II'
DataFrameResearched$topGene[3] <- 'TIME tumor inflammation microenvironment gene as well as M2 macrophages and CD8 t-cells'
DataFrameResearched$topGene[13] <- 'top referenced gene with NCOR2 in cHL+EBV+HIV'

DataFrameResearched[,c(20,26)]
##        Gene_ID
## ARID1B  ARID1B
## CARD11  CARD11
## CD4        CD4
## CREBBP  CREBBP
## EP300    EP300
## EZH2      EZH2
## KMT2D    KMT2D
## MUC16    MUC16
## NCOR2    NCOR2
## TIGIT    TIGIT
## TP53      TP53
## TTN        TTN
## XBP1      XBP1
## ZFHX3    ZFHX3
##                                                                                        topGene
## ARID1B                                                     cHL+EBV compared to cHL top DE gene
## CARD11              top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## CD4    TIME tumor inflammation microenvironment gene as well as M2 macrophages and CD8 t-cells
## CREBBP              top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## EP300               top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## EZH2                top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## KMT2D               top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## MUC16                                            FLAG genes or frequently mutated genes in cHL
## NCOR2                                                  cHL, cHL+EBV, & cHL+EBV+HIV top DE gene
## TIGIT                            ICI immune complex inhibitor gene as well as PD-L1 and MHC-II
## TP53                                                       cHL+EBV compared to cHL top DE gene
## TTN                                              FLAG genes or frequently mutated genes in cHL
## XBP1                                             top referenced gene with NCOR2 in cHL+EBV+HIV
## ZFHX3                                                      cHL+EBV compared to cHL top DE gene

Now lets combine these genes into one data set.

topGenes <- rbind(cHL_EBV_HIV_top10,cHL_EBV_top10,DataFrameResearched)

paged_table(topGenes)

We have 34 genes that we can use in part 2 of this project to see how well they do in machine learning to predict class as classic Hodgkin’s lymphoma, classic Hodgkin’s lymphoma with EBV infection, or classic Hodgkin’s lymphoma with EBV and HIV infections.

We will do that on another day but keep checking in. For now we can write a few of these data sets to csv and use it later.

write.csv(topGenes,'top34Genes_cHL_EBV_HIV.csv',row.names=F)
write.csv(sampleTypeByID,'sampleTypeByID_cHL_EBV_HIV.csv',row.names=F)
write.csv(DataFrameOrderedEBV, 'DataFrameAll_59427_orderedEBV.csv',row.names=F)

You can get the topGenes data here. Find the sampleTypeByID data set here. And get the total gene set ordered by EBV over cHL only here.

Thanks so much and look for part 2 soon. That part will have more analysis and making inferences on the duplicate genes in common and those that are more up regulated in the cHL+EBV+HIV class compared to the cHL+EBV, and also see how the genes of the research study did in fold change values in the data provided by fold change comparisons to the cHL only samples that were our baseline as all samples had at minimum classic Hodgkin’s lymphoma. And then we will test these gene sets separately and together in the machine learning algorithms we have come to use most of the time to see if these genes make good predictors of these 3 classes of samples.

We will also go back to the subtypes of the samples possibly before or after making the data and testing those matrices of predictors in predicting our 3 class samples, and see if the subtypes do give any additional information as the study said the genes from the subtypes they found by class type didn’t matter on any of the clinical features of age, gender, subtype, and so on but by TIME, TCR, and TMB.