Classic Hodgkin’s Lymphoma, Epstein-Barr Virus, and Human Immunodeficiency Virus samples to explore Biomarkers in neoplasm RNA samples GSE289903

Part 1 getting to understand an interesting and recent gene study that will give us some more genes to target in our machine model when we build it to predict class of EBV, EBV associated pathologies, MS, mono, fibromyalgia, Lyme disease, NKTC lymphoma, nasopharyngeal carcinoma with EBV, and others. This one compares Hodgkin’s disease in all samples but adds in comparisons to find tumor inflammatory microenvironment changes, tumor mutation burden, the T-cell receptor repertoire changes, and compares to other genes. The study was read, annotated, and notes for takeaways made to clarify the data working with and results to compare with findings in this project to see what genes we can use for our machine model. Part 2 is going to be the machine learning and additional analysis after finding top genes here and building our database systems and data frames to use.

In this study we analyze a study, GSE289903, that explores the tumor neoplastic gene expression to profile the classical Hodgkin’s lymphoma or cHL, and how Epstein-Barr Virus (EBV) infection and/or Human Immunodeficiency Virus (HIV) affect the lymphocyte tumors or neoplasms dissected and then processed through a chip array sequencing of RNA. The tumor inflammatory microenvironment or TIME, the tumor mutational burden or TMB, as well as immune checkpoint inhibitor or ICI biomarkers of antibodies, and the T-cell Receptor Repertoire or TCR of these tumors were evaluated with taking counts and using the genes that had at least one count in processing. I summarized the published and free article published by Science Direct in some key takeaways and to direct this project in finding genes that explain the data or could be used as excellent predictors on sample type as this has 3 sample types of data. We will analyze the counts data that was uploaded and downloaded from link above at the GSE289903 link and compare to the genes we find and top stimulated or up regulated genes and top down regulated or inhibited genes to the genes in this study that showed as having top 5 genes in explaining the molecular cell types from the samples, the ICIs they used in finding TCR affects, TIME, and TMB. This was a 10 page document with tabular and graphical representations of the findings or results of their 12 researcher large research study published within a year of today. The study describes 25 patients in their Table 1 but the data downloaded has 19 samples.

Takeaways or notes on GSE289903:

Classic Hodgkin Lymphoma or cHL is highly associated with EBV and about 30% of cHL cases have EBV.
cHL is a hematolymphoid neoplasm (tumor growth of abnormal growth of cells as in a neoplasm is neoplastic – internet search) in B cells created from dysfunctional germinal centers within the B cells and affects 80k+ people per year worldwide, and expected to increase in incidence within the next 10 years.
non-Hodgkin Lymphoma incidence decreased with the use of antiretroviral therapies but classic Hodgkin Lymphoma has stayed the same rate of incidence.
People with HIV or PWH are 5x likely to get cHL and nearly all cases of cHL in PWH is d/t EBV
whole exome, transcriptome, and T-cell Receptor or TCR repertoire evaluated in this study from tumor cells, all patients had cHL but some were a combination of EBV and/or HIV status.
cHL and HIV+ had increased tumor mutational burden compared with cHL and HIV-
cHL, EBV+, and HIV- tumors had a distinct transcriptional profile compared with cHL, EBV-, and HIV- as well as cHL, EBV+, and HIV+
the tumor microenvironment of the EBV and HIV positive groups had an association with cellular composition changes like increased B cell memory and CD8+ T cells compared to the EBV and HIV negative groups.
The cHL tumor mass is composed mostly of an inflammatory microenvironment with scattered Hodgkin/Reed-Sternberg or HRS cells.
There are 4 histologic subtypes of cHL, that show the tumor architecture and tumor immune microenvironment or TIME composition.
cHL molecular features within these 4 subtypes have notable changes when due to EBV, like a decreased number of somatic mutations in the HRS cells and an increase in immunosuppressive CD4+ T-cell recruitment.
chemotherapy has been the available treatment in cHL and PWH and is affective if in a low and middle income country or LMIC for short
focus of this study is to find the tumor host interactions like the genomic features and TIME characteristics or tumor immune microenvironment and tumor structure while using unavailable high quality techniques unavailable to LMIC with low resource tools available. The goal is to find the immunogenic profiles of cHL in an HIV (and EBV) inclusive environment.
all patients were confirmed to have cHL and EBV and HIV if present, but all had cHL, confirmed by physical exam, chest x-ray, abdominal ultrasound, and bone marrow biopsy.
all patients in study treated with 6 cycles of cytotoxic combination chemotherapy including ABVD for Adriamycin, doxorubicin, bleomycin, and vinblastine as well as dexamethasone (a steroid). Radiation therapy and positron emission therapy were not availale for treatment.
If the patients also had HIV, then they continued their antiretroviral therapy with cHL diagnosis.
children under 18 years of age were followed 2 years after treatment and the adults for 5 years unless censored or until the study ended on 4-26-2024. Table data on survival rates per group are based on a 2 year follow-up.
Antibodies used were the CD3 (PS1 clone), CD20 (L26 clone), CD30 (15B3 clone), and PAX5 to test the hematoxylin and eosin-stained formalin-fixed and paraffin-embedded (FFPE) tissue sections of tumor and immunohistochemistry (IHC). Sent out for genomic testing, with EBV tumor testing done with EBV-encoded RNA or EBER expression where more than 10% tumor with HRS cells considered EBV positive. But all tumors expressing EBER were positive in majority of neoplastic cells.
internet search for clone of antibody returned it is a mouse monoclonal antibody that recognizes same protein used in immunohistochemistry, western blotting, and flow cytometry, that recognizes an epitope in the cytoplasmic domain that is crucial for its functionality in B cell activation and differentiation.
Two hemopathologists subtyped the histologic samples, blinded to the study, as either of 4 subtypes: Nodular Sclerosis (NS), mixed cellularity, lymphocyte rich, or lymphocyte deplete. NS is the most common of cHL tumor types outside of the study with internet search.
The immunohistochemistry or IHC for short, of the tumor tissue used antibodies PD-L1 (E1L3N clone), TIGIT (E5Y1W clone), and MHC-II (LGII-612.14 clone). Samples considered positive if neoplastic cells expressed the proteins in the PD-L1 and MHC-II staining. TIGIT expression scored using a published metric. All done by light microscopy to determine expression of antibodies.
whole exome sequencing kept 21 of 23 confirmed cHL FFPE samples after quality control. Processing of the DNA from FFPE and blood pellets were captured as FASTQs and aligned into BAM files, sorted and indexed by Bio-bambam2, quality control done by Picard, MultipleMetrics, and FastQC. BAM realignment done with coverageBed, variants called by Strelka2, Cadabra, and Mutect2, then merged into a single variant call format or VCF file with vcf2maf. The mutations classed as silent or intronic by maftools were removed. Mutations filtered by keeping those that had a read depth greater than 15x, germline read depth greater than 15x, tumor alternative allele count greater than 5, variant allele frequency greater than 1%, and less than 10% accounting for the small percentage of neoplastic cells. The median tumor depth was 159x and normal depth was 195x after filtering.
RNA sequencing used the whole transcriptome RNA sequencing on the FFPE tumors used above to extract RNA, get the cDNA or complementary DNA libraries to prepare, and then messenger RNA or mRNA sequencing was done as strand specific. Counts generated with aligning FASTQ files using STAR and then quantifying using Salmon. Only those samples passing quality control by library and sequencing kept. There were 21 samples, but 2 were identified as outliers by principal component analysis or PCA and their quality control pipelines, leaving 19 samples. Removing genes with no counts across all samples, then differential gene expression done followed by normalization using DESeq2. Differential Pathway Enrichment was done using the Gene Sequencing Variation Association package or GSVA package, limma, and normalizing counts by DESeq2 using the Hallmark Pathways data set. The cell type proportions as transcripts per million or TPM found estimates with CIBERSORTx in an output file from the Salmon pipeline. Default settings used because recommended of LM22 signature matrix file, no quantile normalization, and 100 permutations.
T-cell receptor or TCR sequencing done with immunoSEQ Human TCRB assay using 25 of the FFPE samples. Pooled libraries quantified with Collibri Library Quantification Kit with a final concentration of 1.5 parts per Million or pM. Raw sequencing data processed with Adaptive Biotechnologies’ pipeline. Filtered to keep only samples with 100 or more productive templates that equaled 22, and only those productive templates with no stop coone and in-frame were considered. T-cell count impacts template count and varies widely based on FFPE tumor block, so the lowest template count of 248 by random downsampling was used and averaged over 100 iterations. There was a Productive Simpson Clonality calculated as square root of Simpson’s diversity index with larger values representing increased clonality. The productive maximum frequency is the number of clones of the most common t-cell receptor clone in a sample. Diversity was measured by taking the unique productive rearrangements indicated by the number of TCR clones. Amino acid sequences were used to calculate metrics with the Adaptive Biotechnologies software, and TCR overlap and epitope analysis done with immunarch package.
Total of 25 cHL patients with 18 adults and remaining 7 children. Median age was 22 years old. Median progression-free survival wasn’t reached due to censoring at 24 months.
Using Table 1 data, there were 25 cHL of all 25 samples, but the EBV+ was 16 with 11 HIV negative and 5 HIV positive, and there was 9 EBV negative that were also all HIV negative. The breakdown is 9 samples of cHL only (without EBV or HIV), 11 samples of cHL and EBV (but no HIV), and 5 samples of cHL, EBV, and HIV.
Table 1 data showed that those with cHL but no EBV or HIV had worse progression-free survival (PFS) of minimum 8.57 months, but all median time was max of 2 years or 24 months. The cHL and EBV but no HIV had a minimum survival of 16.9 months. The cHL with EBV and HIV had the highest minimum survival of 24 months.
Table 1 data showed that most subtypes had the nodular sclerosis subtype in all 3 classes of cHL only, cHL+EBV, and cHL+EBV+HIV. There were no samples of lymphocyte rich in cHL only, and no lymphocyte depleted in the cHL+EBV. There was no mixed cellularity in the cHL+EBV+HIV. There was 1 sample in the 25 that was NA for not applicable to the 4 subtypes, and it was in the cHL only class. The cHL+EBV next most populated subtype was the mixed cellularity and then the lymphocyte rich subtype, none in lymphocyte depleted or NA.
Table 1 data showed that the stage I-IV of cHL in the 3 classes showed that those with cHL only had all classes in stage III-IV, half or more in stage I-II in cHL+EBV class, and more in stage III-IV in cHL+EBV+HIV at 80% or 4/5 patients. Four of the patients in stage I-II were children, the other three were adults.
whole exome sequencing was done to find cHL genomic alterations but with this bulk tumor, whole-exome sequencing, there were many mutations and many were frequently mutated genes or FLAGS like TTN and MUC16. Gene targets were refined by filtering for genes in the FoundationOne Heme 406 targeted gene panel that uses recurrently mutated genes found in a wide array of hematologic neoplasms and used in many previous cHL studies. More biological relevant genes found with large exonic regions. Proceeding this way in bulk sequencing would be difficult so they used a cHL-associated gene panel they made from resourcing gene studies that used the flow-sorted HRS cells, ultra-deep, and/or targeted sequenceing, and came up with 46 genes in their ‘curated panel’ of relevant genes. The tumor mutational burden or TMB was calculated as the total number of mutations or at least 2 mutations per sample.
TMB of cHL+EBV+HIV had a greater score than cHL or cHL+EBV. The cHL+EBV had the lowest TMB showing the impact of EBV on TMB.
Top 5 mutated genes, as a percent of the 21 samples used and of the 46 genes found relevant in TMB, were KMT2D (33%), EP300 (29%), CARD11 (24%), CREBBP (24%), and EZH2 (24%). In figure 1 there are other genes with less importance by % of samples found.
The subtype of the 4 types of cHL made no difference on Tumor mutational burden or TMB, neither did age.
Since the HRC cells are a small percentage of the bulk tumor RNA sequencing, the differential gene expression captured is showing the EBV or HIV status on Tumor inflammation microenvironment or on TIME. All HIV samples had EBV and cHL. Not all EBV samples had HIV. But all samples had cHL. TIME was assessed by class of cHL not subtype.
TIME assessment of cHL+EBV+HIV samples found 26 differentially expressed genes of XBP1, NCOR2, and various IGH genes. The cHL+EBV had 160 differentially expressed genes and also decreased expression of collagen and L ribosomal protein genes compared to cHL only. Genes common the cHL and cHL+EBV with altered gene expression were NCOR2, ARID1B, and ZFHX3.
Gene Set Variation Analysis or GSVA was used to determine TIME as well and to determine the pathway enrichment by viral status. There was more interferon-alfa response as increased enrichment in the cHL+EBV+HIV class compared to cHL+EBV. The cHL+EBV class also had depleted expression of P53, hypoxia, and epithelial-mesenchymal transition for enrichment pathways when compared to cHL only.
The antibodies discussed at beginning of PD-L1, TIGIT, and MHC-II were used to observe expression of potential immune checkpoint inhibitor targets tumor biomarkers using immunohistochemistry staining. This comparison also helped to establish TIME. PD-L1 was expressed by the HRS cells in all classes of cHL, cHL+EBV, and cHL+EBV+HIV. MHC-II and TIGIT were variable or inconsistent across samples so that the EBER (tumor EBV expression), HIV, or other feature assessed was found not to be associated.
In analyzing TIME within 3 classes, they looked at cell type proportion analysis for TIME cellular composition. The M2 macrophages were the most represented cell type with more M2 in the cHL only class compared to cHL+EBV. The CD8+ T cells were significantly increased in the cHL+EBV+HIV class than the cHL only class. The CD4+ naive T-cell levels were higher in cHL+EBV than the cHL+EBV+HIV class.
EBV status was the primary driving force for T-cell Receptor diversity. T cells are known to be important in emerging immunotherapies. The bulk transcriptomics and deconvolution observations led to TCR sequencing to show TCR diversity primarily affected by EBV status.
TCR clonality is (internet search) the genetic identity of T-cells in a population, crucial to diagnose conditions like T-cell lymphomas and other immune disorders. Clonal populations identified in unique genetic rearrangements in TCR by way of PCR to amplify TCR gene sequences. Neoplastic processes can be indicated by the presence of clonal T-cell receptor gene arrangements, and used to follow up test patients to see if clone persists. Takeaway is that the HIV T-cell clone is bad.
For the TCR repertoire characterization, the TCR sequencing showed that HIV tumors had increased clonality, clonality was not shown associated with other features like age, gender, or stage. The findings were that the TCR of cHL+EBV were most similar within the samples using the Morisita overlap index to analyze the intertumoral similarity of TCR repertoires between patients.
Although, all the HIV class samples also had EBV, and all samples had more nodular sclerosis subtypes of tumor cells, more study is needed on cHL+EBV+HIV due to distinct tumor molecular profile.
The overall survival was not the best for those with only cHL and not cHL+EBV or cHL+EBV+HIV that were also treated with HIV medication concurrently during the study. This could be due to the cHL population that had the one or more patients with progressive symptoms in 8 months as it didn’t say how many patients progressed but that the minimum time free from progression of symptoms was 8 months and also the cohort was small. Current epidemiologic studies show that cHL+EBV cases of treatment have improved but the cHL only cases has not, and could be due to the younger age of onset.
The TIME features of tumor inflammation microenvironment was more associated with EBV expression, but tumor mutational burden or TMB was more associated with HIV status. Other studies from European studies, this was an African cohort study, showed slight differences in CD8+ T cells being the same in cHL+EBV and cHL+EBV+HIV tumors, but different PD-1 and TIGIT and CD155 gene expression levels.
The study found it difficult to disentangle the effects of EBV and HIV on tumorigenesis or cHL tumors due to the cHL+EBV being distinct from the cHL+EBV+HIV and also distinct from cHL only, and also that all HIV cases in this study had EBV. They would like to investigate people with HIV and developing cHL.
Immune Checkpoint Inhibitors or ICIs are a common way to optionally treat relapsed or refractory cHL. The ICIs found in this study showed PD-L1 altered by EBV and HIV with all cHL samples positive for PD-L1 from the tumor cells. The stimulation of CD4 T cells that have antitumor properties was shown by retaining MHC-II expression in the response to ICIs and also patient remission. And most of these samples had retained MHC-II expression whether HIV+ or not.
TIGIT, an ICI, is inversely correlated with PD-L1 presence in cHL and advanced disease and is a new treatment target using TIGIT as the target in treating lymphoproliferative diseases. TIGIT is also associated with advanced disease progression of cHL. All samples were positive for TIGIT staining in immunohistochemistry portion of analysis but the amount and location of TIGIT in the HRS cells was greatly varied.
ICIs are a safe and effective 2nd line treatment for cHL+HIV with CATCH-IT.
Although this study found nodular sclerosis subtype most common in all 3 classes of samples, it is more common that the cHL+HIV class has mixed cellularity. In the 5 cHL+EBV+HIV samples in this study in Table 1, there were no patient samples with mixed cellularity.
This study used tumor biosamples to find biomarkers of genes, while others used less invasive and less expensive blood samples. This study was better able to characterize the Tumor inflammation microenvironment and could be useful for treatment strategies. Other studies with cHL mutational profiling and biomarker discoveries used targeted, ultra-deep, and/or circulating tumor DNA sequencing methods from blood. The microdissection and flow sorting of this study technique gives more details and a deeper view of mutational profile of HRS cells in comparison to bulk sequencing that is too expensive to be used clinically or globally in countries like LMIC

Great! Now that we summarized the key takeaway points from this study. We can easily go back to find the genes that seem relevant to make a string of to also extract from the data like the top 5 genes in their heatmap figure and the ICI genes and top genes within each sample type.

Lets read in our data of series matrix information and then the counts data.

setwd(path)
seriesInfo1_75 <- read.table('GSE289903_series_matrix.txt', nrow=75)


paged_table(seriesInfo1_75)

Lets skip 33 rows and read it in to see the sample ID with type of sample, there is an error at line 40 so lets only read in 39 rows.

seriesInfo <- read.table('GSE289903_series_matrix.txt', header=F, skip=33, nrow=39)


paged_table(seriesInfo)

The first 2 rows have the sample type and the sample ID.

sampleTypeByID <- data.frame(t(seriesInfo[c(1:2),c(2:20)]))

colnames(sampleTypeByID) <- c("sampleType", "GSE_ID")

sampleTypeByID

##                           sampleType     GSE_ID
## V2  YF02, NS, CHL, EBV+/HIV+, Female GSM8801166
## V3    YF03, MC, CHL, EBV+/HIV-, Male GSM8801167
## V4    YF05, NS, CHL, EBV+/HIV-, Male GSM8801168
## V5  YF06, MC, CHL, EBV+/HIV+, Female GSM8801169
## V6    YF07, MC, CHL, EBV+/HIV-, Male GSM8801170
## V7  YF08, LD, CHL, EBV-/HIV-, Female GSM8801171
## V8    YF09, LR, CHL, EBV+/HIV-, Male GSM8801172
## V9    YF10, NS, CHL, EBV+/HIV+, Male GSM8801173
## V10   YF11, MC, CHL, EBV+/HIV-, Male GSM8801174
## V11   YF12, NS, CHL, EBV+/HIV-, Male GSM8801175
## V12   YF13, NS, CHL, EBV-/HIV-, Male GSM8801176
## V13 YF14, NS, CHL, EBV+/HIV-, Female GSM8801177
## V14 YF15, NS, CHL, EBV+/HIV+, Female GSM8801178
## V15 YF17, NA, CHL, EBV-/HIV-, Female GSM8801179
## V16 YF18, NS, CHL, EBV+/HIV-, Female GSM8801180
## V17   YF19, NS, CHL, EBV+/HIV-, Male GSM8801181
## V18   YF21, MC, CHL, EBV+/HIV-, Male GSM8801182
## V19 YF22, NS, CHL, EBV-/HIV-, Female GSM8801183
## V20 YF23, NS, CHL, EBV-/HIV-, Female GSM8801184

The subtypes are NA for not applicable or didn’t fit in the 4 subtypes of Nodular Sclerosis as NS, Lymphocyte Rich as LR, Lymphocyte Deplete as LD, and mixed cellularity as MC. The YF preheader means nothing to our analysis just an ID. We can exclude it. But we should get the 4 subtypes and NA sample, and also the class as one of the 3 class types.

LymphocyteDeplete <- grep('LD,',sampleTypeByID$sampleType)
LymphocyteRich <- grep('LR,',sampleTypeByID$sampleType)
NotApplicable <- grep('NA,' , sampleTypeByID$sampleType)
MixedCellularity <- grep('MC,', sampleTypeByID$sampleType)
NodularSclerosis <- grep('NS,',sampleTypeByID$sampleType)

We could analyze by cell type of the 4 subtypes to see if any genes show up regardless of EBV or HIV status but having cHL as all 19 samples do have cHL.

Lets add this as a descriptive feature to our table of IDs and sample types.

sampleTypeByID$subtype <- "subtype"
sampleTypeByID$subtype[LymphocyteDeplete] <- 'Lymphocyte Deplete'
sampleTypeByID$subtype[LymphocyteRich] <- 'Lymphocyte Rich'
sampleTypeByID$subtype[NodularSclerosis] <- 'Nodular Sclerosis'
sampleTypeByID$subtype[MixedCellularity] <- 'Mixed Cellularity'
sampleTypeByID$subtype[NotApplicable] <- 'Not Applied'

paged_table(sampleTypeByID)

We know all samples are cHL, but also that all HIV have EBV. We divide these samples into cHL only, cHL + EBV, and cHL + EBV + HIV to get the means of the genes by class and then the fold change values.

Lets add a descriptive feature for the class type to our sampleTypeByID table.

sampleTypeByID$class <- 'class'

cHL <- grep('EBV-/HIV-', sampleTypeByID$sampleType, fixed=TRUE)
cHL_EBV <- grep('EBV+/HIV-', sampleTypeByID$sampleType, fixed=TRUE)
cHL_EBV_HIV <- grep('EBV+/HIV+', sampleTypeByID$sampleType, fixed=TRUE)

sampleTypeByID$class[cHL] <- 'cHL only'
sampleTypeByID$class[cHL_EBV] <- 'cHL & EBV'
sampleTypeByID$class[cHL_EBV_HIV] <- 'cHL & EBV & HIV'

sampleTypeByID

##                           sampleType     GSE_ID            subtype
## V2  YF02, NS, CHL, EBV+/HIV+, Female GSM8801166  Nodular Sclerosis
## V3    YF03, MC, CHL, EBV+/HIV-, Male GSM8801167  Mixed Cellularity
## V4    YF05, NS, CHL, EBV+/HIV-, Male GSM8801168  Nodular Sclerosis
## V5  YF06, MC, CHL, EBV+/HIV+, Female GSM8801169  Mixed Cellularity
## V6    YF07, MC, CHL, EBV+/HIV-, Male GSM8801170  Mixed Cellularity
## V7  YF08, LD, CHL, EBV-/HIV-, Female GSM8801171 Lymphocyte Deplete
## V8    YF09, LR, CHL, EBV+/HIV-, Male GSM8801172    Lymphocyte Rich
## V9    YF10, NS, CHL, EBV+/HIV+, Male GSM8801173  Nodular Sclerosis
## V10   YF11, MC, CHL, EBV+/HIV-, Male GSM8801174  Mixed Cellularity
## V11   YF12, NS, CHL, EBV+/HIV-, Male GSM8801175  Nodular Sclerosis
## V12   YF13, NS, CHL, EBV-/HIV-, Male GSM8801176  Nodular Sclerosis
## V13 YF14, NS, CHL, EBV+/HIV-, Female GSM8801177  Nodular Sclerosis
## V14 YF15, NS, CHL, EBV+/HIV+, Female GSM8801178  Nodular Sclerosis
## V15 YF17, NA, CHL, EBV-/HIV-, Female GSM8801179        Not Applied
## V16 YF18, NS, CHL, EBV+/HIV-, Female GSM8801180  Nodular Sclerosis
## V17   YF19, NS, CHL, EBV+/HIV-, Male GSM8801181  Nodular Sclerosis
## V18   YF21, MC, CHL, EBV+/HIV-, Male GSM8801182  Mixed Cellularity
## V19 YF22, NS, CHL, EBV-/HIV-, Female GSM8801183  Nodular Sclerosis
## V20 YF23, NS, CHL, EBV-/HIV-, Female GSM8801184  Nodular Sclerosis
##               class
## V2  cHL & EBV & HIV
## V3        cHL & EBV
## V4        cHL & EBV
## V5  cHL & EBV & HIV
## V6        cHL & EBV
## V7         cHL only
## V8        cHL & EBV
## V9  cHL & EBV & HIV
## V10       cHL & EBV
## V11       cHL & EBV
## V12        cHL only
## V13       cHL & EBV
## V14 cHL & EBV & HIV
## V15        cHL only
## V16       cHL & EBV
## V17       cHL & EBV
## V18       cHL & EBV
## V19        cHL only
## V20        cHL only

Lets look at the number of samples in each of our 3 classes and in the subtypes.

table(sampleTypeByID$class)

## 
##       cHL & EBV cHL & EBV & HIV        cHL only 
##              10               4               5

There are 4 samples with cHL & EBV & HIV, 10 samples of cHL & EBV, and 5 samples of cHL only.

Lets look at how many samples by subtype there are.

table(sampleTypeByID$subtype)

## 
## Lymphocyte Deplete    Lymphocyte Rich  Mixed Cellularity  Nodular Sclerosis 
##                  1                  1                  5                 11 
##        Not Applied 
##                  1

There are 11 with Nodular Sclerosis, 5 with Mixed Cellularity, and 1 each in Lymphocyte Depleted, Lymphocyte Rich, and Not applied to the subtype.

The row names of our sampleTypeByID table has the column number of our counts data per gene. But we haven’t uploaded the counts data yet. So, we should do that to see how many genes we have that made the quality control filtering and analysis.

setwd(path1)
counts <- read.table('GSE289903_raw_counts.txt', header=T)

paged_table(head(counts,20))

We see the first 10 rows above. Lets look at the structure of the data.

str(counts)

## 'data.frame':    59427 obs. of  19 variables:
##  $ YF02: num  0 0 0 0 11 0 407 0 0 0 ...
##  $ YF03: num  0 0 0 21 8 0 114 35 0 0 ...
##  $ YF05: num  0 0 0 0 0 0 218 58 0 0 ...
##  $ YF06: num  0 0 0 1 55 0 603 202 0 0 ...
##  $ YF07: num  0 0 0 0 0 0 153 45 0 0 ...
##  $ YF08: num  0 0 0 0 0 0 322 20 0 0 ...
##  $ YF09: num  0 0 0 0 0 0 61 248 0 0 ...
##  $ YF10: num  0 0 75.5 0 37.9 ...
##  $ YF11: num  0 0 0 36 0 0 142 0 0 0 ...
##  $ YF12: num  0 0 0 0 0 0 47 0 0 0 ...
##  $ YF13: num  0 0 0 0 0 0 58 48 0 0 ...
##  $ YF14: num  0 0 0 0 16 0 10 51 0 0 ...
##  $ YF15: num  0 0 0 37 47 ...
##  $ YF17: num  0 0 1792 0 113 ...
##  $ YF18: num  0 0 0 0 20.3 ...
##  $ YF19: num  0 0 862 0 16 ...
##  $ YF21: num  0 0 0 13 43 ...
##  $ YF22: num  0 0 0 0 0 0 507 0 0 0 ...
##  $ YF23: num  0 0 3 43 48 ...

There are 59,427 genes with numeric gene expression data that says it is raw counts but the information says they didn’t share raw counts and that their counts data was normalized and only genes that had some counts were included. The 2 outliers were omitted from 21 samples they had for the RNA sequencing data which left us with these 19 samples. The original data had tumor microscopy done on the 25 samples, but this is what we have to work with. The series data gives information on how this data was manipulated before being shared on NCBI site for GSE289903 study. Row 8 of the first table we read in from the series data.

seriesInfo1_75[c(8:13),]

##                        V1
## 8  !Series_overall_design
## 9  !Series_overall_design
## 10 !Series_overall_design
## 11 !Series_overall_design
## 12 !Series_overall_design
## 13           !Series_type
##                                                                                                                                                                        V2
## 8  We performed RNA sequencing of 19 pre-treatment formalin-fixed paraffin-embedded (FFPE) whole lymph node biopsies of cHL, inclusive of EBV-association and HIV status.
## 9                                                                                                                                                                        
## 10                                                                                                        ***************************************************************
## 11                  Raw files for human/patient samples were not submitted to GEO due to concerns about submitting personally identifiable sequence data for open access.
## 12                                                                                                        ***************************************************************
## 13                                                                                                                     Expression profiling by high throughput sequencing

They did RNA sequencing of 19 pre-treatment formalin-fixed paraffin-embedded (FFPE) whole lymph node biopsies of cHL (and those with EBV and/or HIV).

But from my notes above directly from their published article…

RNA sequencing used the whole transcriptome RNA sequencing on the FFPE tumors used above to extract RNA, get the cDNA or complementary DNA libraries to prepare, and then messenger RNA or mRNA sequencing was done as strand specific. Counts generated with aligning FASTQ files using STAR and then quantifying using Salmon. Only those samples passing quality control by library and sequencing kept. There were 21 samples, but 2 were identified as outliers by principal component analysis or PCA and their quality control pipelines, leaving 19 samples. Removing genes with no counts across all samples, then differential gene expression done followed by normalization using DESeq2. Differential Pathway Enrichment was done using the Gene Sequencing Variation Association package or GSVA package, limma, and normalizing counts by DESeq2 using the Hallmark Pathways data set. The cell type proportions as transcripts per million or TPM found estimates with CIBERSORTx in an output file from the Salmon pipeline. Default settings used because recommended of LM22 signature matrix file, no quantile normalization, and 100 permutations.

The column names of our data uses the prefixed YF02-YF23 for 19 samples. We should now rearrange those columns by their sample type into a different data frame. We already have the values from our grep to get the class type in the cHL, cHL_EBV, and cHL_EBV_HIV objects. Lets use those values to arrange our new data frame in that order to get the means to add to the last columns.

DataFrame <- counts[,c(cHL,cHL_EBV,cHL_EBV_HIV)]

paged_table(head(DataFrame))

We can see the column names are out of order, lets compare it to the names of our samples next to their class type.

paged_table(sampleTypeByID[,c(1,4)])

Lets add the genes from the row names as we don’t have the genes in this data and need them.

DataFrame$Gene_ID <- row.names(DataFrame)
paged_table(head(DataFrame))

Lets prepend the sample type to the IDs on table to make inferences on the tabular data if we need to later.

colnames(DataFrame)[1:5] <- paste(colnames(DataFrame)[1:5],sep='_', 'cHL')
colnames(DataFrame)[6:15] <- paste(colnames(DataFrame)[6:15], sep='_',"cHL_EBV")
colnames(DataFrame)[16:19] <- paste(colnames(DataFrame)[16:19],sep='_','cHL_EBV_HIV')

colnames(DataFrame)

##  [1] "YF08_cHL"         "YF13_cHL"         "YF17_cHL"         "YF22_cHL"        
##  [5] "YF23_cHL"         "YF03_cHL_EBV"     "YF05_cHL_EBV"     "YF07_cHL_EBV"    
##  [9] "YF09_cHL_EBV"     "YF11_cHL_EBV"     "YF12_cHL_EBV"     "YF14_cHL_EBV"    
## [13] "YF18_cHL_EBV"     "YF19_cHL_EBV"     "YF21_cHL_EBV"     "YF02_cHL_EBV_HIV"
## [17] "YF06_cHL_EBV_HIV" "YF10_cHL_EBV_HIV" "YF15_cHL_EBV_HIV" "Gene_ID"

Lets add the row means by class to the DataFrame.

DataFrame$cHL_Means <- rowMeans(DataFrame[,c(1:5)])
DataFrame$cHL_EBV_Means <- rowMeans(DataFrame[,c(6:15)])
DataFrame$cHL_EBV_HIV_Means <- rowMeans(DataFrame[,c(16:19)])

DataFrame[c(1:3),c(20:23)]

##             Gene_ID cHL_Means cHL_EBV_Means cHL_EBV_HIV_Means
## 5S_rRNA     5S_rRNA     0.000        0.0000           0.00000
## 5_8S_rRNA 5_8S_rRNA     0.000        0.0000           0.00000
## 7SK             7SK   358.906       86.1891          18.86475

Now we can do the fold change value, since all have cHL, we can use that as the baseline and compare fold change for EBV+HIV- and EBV+HIV+.

DataFrame$EBV_cHL_FoldChange <- DataFrame$cHL_EBV_Means/DataFrame$cHL_Means
DataFrame$EBV_and_HIV_cHL_FoldChange <- DataFrame$cHL_EBV_HIV_Means/DataFrame$cHL_Means

DataFrame[c(1:3),c(20:25)]

##             Gene_ID cHL_Means cHL_EBV_Means cHL_EBV_HIV_Means
## 5S_rRNA     5S_rRNA     0.000        0.0000           0.00000
## 5_8S_rRNA 5_8S_rRNA     0.000        0.0000           0.00000
## 7SK             7SK   358.906       86.1891          18.86475
##           EBV_cHL_FoldChange EBV_and_HIV_cHL_FoldChange
## 5S_rRNA                  NaN                        NaN
## 5_8S_rRNA                NaN                        NaN
## 7SK                0.2401439                 0.05256181

Lets look at summary statistics on this DataFrame.

summary(DataFrame)

##     YF08_cHL            YF13_cHL            YF17_cHL           YF22_cHL       
##  Min.   :     0.00   Min.   :     0.00   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:    0.00   1st Qu.:    0.00  
##  Median :     0.00   Median :     0.00   Median :    0.00   Median :    0.00  
##  Mean   :    44.39   Mean   :    19.34   Mean   :   32.09   Mean   :   22.61  
##  3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:    0.00   3rd Qu.:    0.00  
##  Max.   :141120.00   Max.   :114953.00   Max.   :66979.00   Max.   :64755.00  
##                                                                               
##     YF23_cHL          YF03_cHL_EBV        YF05_cHL_EBV      
##  Min.   :0.000e+00   Min.   :     0.00   Min.   :     0.00  
##  1st Qu.:0.000e+00   1st Qu.:     0.00   1st Qu.:     0.00  
##  Median :0.000e+00   Median :     0.00   Median :     0.00  
##  Mean   :4.436e+01   Mean   :    27.47   Mean   :    57.78  
##  3rd Qu.:7.962e+00   3rd Qu.:     0.00   3rd Qu.:     0.00  
##  Max.   :1.742e+05   Max.   :148867.00   Max.   :289355.00  
##                                                             
##   YF07_cHL_EBV        YF09_cHL_EBV        YF11_cHL_EBV        YF12_cHL_EBV     
##  Min.   :     0.00   Min.   :     0.00   Min.   :     0.00   Min.   :    0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:    0.00  
##  Median :     0.00   Median :     0.00   Median :     0.00   Median :    0.00  
##  Mean   :    35.23   Mean   :    49.25   Mean   :    25.05   Mean   :   18.25  
##  3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:    0.00  
##  Max.   :212856.00   Max.   :329748.00   Max.   :184276.00   Max.   :83420.00  
##                                                                                
##   YF14_cHL_EBV        YF18_cHL_EBV        YF19_cHL_EBV      
##  Min.   :     0.00   Min.   :     0.00   Min.   :     0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:     0.00  
##  Median :     0.00   Median :     0.00   Median :     0.00  
##  Mean   :    22.86   Mean   :    24.21   Mean   :    22.27  
##  3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:     0.00  
##  Max.   :159960.00   Max.   :159745.00   Max.   :143078.00  
##                                                             
##   YF21_cHL_EBV       YF02_cHL_EBV_HIV    YF06_cHL_EBV_HIV   
##  Min.   :     0.00   Min.   :     0.00   Min.   :     0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:     0.00  
##  Median :     0.00   Median :     0.00   Median :     0.00  
##  Mean   :    35.71   Mean   :    38.15   Mean   :    58.37  
##  3rd Qu.:     0.00   3rd Qu.:     0.00   3rd Qu.:     8.00  
##  Max.   :179773.00   Max.   :169301.00   Max.   :299787.00  
##                                                             
##  YF10_cHL_EBV_HIV    YF15_cHL_EBV_HIV      Gene_ID            cHL_Means        
##  Min.   :     0.00   Min.   :     0.00   Length:59427       Min.   :     0.00  
##  1st Qu.:     0.00   1st Qu.:     0.00   Class :character   1st Qu.:     0.00  
##  Median :     0.00   Median :     0.00   Mode  :character   Median :     0.00  
##  Mean   :    59.98   Mean   :    63.84                      Mean   :    32.56  
##  3rd Qu.:    10.00   3rd Qu.:     0.00                      3rd Qu.:     6.80  
##  Max.   :252671.00   Max.   :246588.00                      Max.   :112394.40  
##                                                                                
##  cHL_EBV_Means       cHL_EBV_HIV_Means   EBV_cHL_FoldChange
##  Min.   :     0.00   Min.   :     0.00   Min.   :0.000     
##  1st Qu.:     0.00   1st Qu.:     0.00   1st Qu.:0.438     
##  Median :     0.00   Median :     0.00   Median :0.971     
##  Mean   :    31.81   Mean   :    55.08   Mean   :  Inf     
##  3rd Qu.:     4.70   3rd Qu.:     9.25   3rd Qu.:2.061     
##  Max.   :189107.80   Max.   :214479.25   Max.   :  Inf     
##                                          NA's   :35661     
##  EBV_and_HIV_cHL_FoldChange
##  Min.   :0.000             
##  1st Qu.:0.936             
##  Median :1.730             
##  Mean   :  Inf             
##  3rd Qu.:3.450             
##  Max.   :  Inf             
##  NA's   :35807

We have a lot of Inf when the number is so large it can’t be displayed on computer from our 0 means or when dividing a number by a very small number and we have a lot of NAs as NaNs from dividing a very small number or very large number into a very small number as you cannot divide by 0. You can make the numbers infinitesimally small to fit between 0 and 1 but those numbers can be very small. At this point we haven’t made a vector of the values we want to keep that are from the research article we read that found target genes and confirmed genes involved in therapeutic treatment of Hodgkins, EBV, or HIV.

Lets look at the notes above that mention genes of interest:

Top 5 mutated genes, as a percent of the 21 samples used and of the 46 genes found relevant in TMB, were KMT2D (33%), EP300 (29%), CARD11 (24%), CREBBP (24%), and EZH2 (24%). In figure 1 there are other genes with less importance by % of samples found.
whole exome sequencing was done to find cHL genomic alterations but with this bulk tumor, whole-exome sequencing, there were many mutations and many were frequently mutated genes or FLAGS like TTN and MUC16. Gene targets were refined by filtering for genes in the FoundationOne Heme 406 … came up with 46 genes in their ‘curated panel’ of relevant genes. The tumor mutational burden or TMB was calculated as the total number of mutations or at least 2 mutations per sample.
TIME assessment of cHL+EBV+HIV samples found 26 differentially expressed genes of XBP1, NCOR2, and various IGH genes. The cHL+EBV had 160 differentially expressed genes and also decreased expression of collagen and L ribosomal protein genes compared to cHL only. Genes common the cHL and cHL+EBV with altered gene expression were NCOR2, ARID1B, and ZFHX3.
The antibodies discussed at beginning of PD-L1, TIGIT, and MHC-II were used to observe expression of potential immune checkpoint inhibitor targets tumor biomarkers using immunohistochemistry staining. This comparison also helped to establish TIME. PD-L1 was expressed by the HRS cells in all classes of cHL, cHL+EBV, and cHL+EBV+HIV. MHC-II and TIGIT were variable or inconsistent across samples so that the EBER (tumor EBV expression), HIV, or other feature assessed was found not to be associated.
The cHL+EBV class also had depleted expression of P53, hypoxia, and epithelial-mesenchymal transition for enrichment pathways when compared to cHL only.
In analyzing TIME within 3 classes, they looked at cell type proportion analysis for TIME cellular composition. The M2 macrophages were the most represented cell type with more M2 in the cHL only class compared to cHL+EBV. The CD8+ T cells were significantly increased in the cHL+EBV+HIV class than the cHL only class. The CD4+ naive T-cell levels were higher in cHL+EBV than the cHL+EBV+HIV class.

Ok, so our genes are going to be added to a list:

TIME referenced M2, CD4, and CD8
ICIs referenced PD-L1, TIGIT, and MHC-II
cHL and cHL+EBV referenced NCOR2, ARID1B, and ZFHX3, and also P53 as TP53
cHL+EBV+HIV referenced XBP1 and NCOR2
FLAGS or frequently mutated genes in bulk sequencing referenced TTN and MUC16
TMB referenced top 5 of KMT2D, EP300, CARD11, CREBBP, and EZH2

Lets add these to a character string of genes from research study.

researchStudyGenes <- c("KMT2D","EP300","CARD11","CREBBP","EZH2", "TTN","MUC16", "XBP1","NCOR2","ARID1B","ZFHX3","TP53", "PD-L1","TIGIT","MHC-II","M2","CD4","CD8")

Lets see if any of these genes are in our DataFrame.

DataFrameResearched <- DataFrame[which(DataFrame$Gene_ID %in% researchStudyGenes),]
DataFrameResearched[,c(20,25)]

##        Gene_ID EBV_and_HIV_cHL_FoldChange
## ARID1B  ARID1B                  2.3832654
## CARD11  CARD11                  4.9572650
## CD4        CD4                  1.6534548
## CREBBP  CREBBP                  1.9744713
## EP300    EP300                  1.8721294
## EZH2      EZH2                  0.7481752
## KMT2D    KMT2D                  1.3122912
## MUC16    MUC16                  1.1629747
## NCOR2    NCOR2                  1.3716459
## TIGIT    TIGIT                  3.5623847
## TP53      TP53                  1.9667832
## TTN        TTN                  3.2353523
## XBP1      XBP1                  6.2879673
## ZFHX3    ZFHX3                  0.9466127

There are 13 of the 18 genes from our research article in this dataset of gene counts.

DataFrameResearched$Gene_ID

##  [1] "ARID1B" "CARD11" "CD4"    "CREBBP" "EP300"  "EZH2"   "KMT2D"  "MUC16" 
##  [9] "NCOR2"  "TIGIT"  "TP53"   "TTN"    "XBP1"   "ZFHX3"

researchStudyGenes[-which(DataFrameResearched$Gene_ID %in% researchStudyGenes)]

## [1] "MHC-II" "M2"     "CD4"    "CD8"

CD4 is listed in the data and it should say PD-L1 is missing instead of CD4, but MHC-II, M2, and CD8 are not. Four genes are missing.

We know there are the top 5 genes in TMB from the research study in this data and other genes important to class of this study. Lets order the fold change and get the top up regulated and top bottom regulated genes. We will omit the NaNs and Inf when ordering. but keep the genes in our DataFrameResearched data to combine to this data of fold change values.

DataFrameOrderedEBV_HIV <- DataFrame[order(DataFrame$EBV_and_HIV_cHL_FoldChange, decreasing=T),]

#remove the NaNs and Inf
#df <- df[!is.infinite(rowSums(df)),]
DataFrameOrderedEBV_HIV2 <- DataFrameOrderedEBV_HIV[ !is.nan(DataFrameOrderedEBV_HIV$EBV_and_HIV_cHL_FoldChange),]
DataFrameOrderedEBV_HIV3 <- DataFrameOrderedEBV_HIV2[!is.infinite(DataFrameOrderedEBV_HIV2$EBV_and_HIV_cHL_FoldChange),]
DataFrameOrderedEBV_HIV4 <- subset( DataFrameOrderedEBV_HIV3, DataFrameOrderedEBV_HIV3$EBV_and_HIV_cHL_FoldChange>0)

cHL_EBV_HIV_top10 <- DataFrameOrderedEBV_HIV4[c(1:5,17726:17730),]

paged_table(cHL_EBV_HIV_top10)

May of the genes with top rank in up or down regulation for EBV+HIV+ cHL are not genes of interest in only EBV+ cHL samples. Top genes of cHL+EBV+HIV are:

cHL_EBV_HIV_top10$Gene_ID

##  [1] "IGKV1D-39"  "RPS4Y1"     "HNRNPA1P70" "CCDC8"      "IGKV3-20"  
##  [6] "LTF"        "PES1P2"     "CTSLP3"     "MPO"        "MMP8"

DataFrameOrderedEBV <- DataFrame[order(DataFrame$EBV_cHL_FoldChange, decreasing=T ),]

DataFrameOrderedEBV2 <- DataFrameOrderedEBV[!is.nan(DataFrameOrderedEBV$EBV_cHL_FoldChange),]
DataFrameOrderedEBV3 <- DataFrameOrderedEBV2[!is.infinite(DataFrameOrderedEBV2$EBV_cHL_FoldChange),]
DataFrameOrderedEBV4 <- subset(DataFrameOrderedEBV3, DataFrameOrderedEBV3$EBV_cHL_FoldChange > 0)

cHL_EBV_top10 <- DataFrameOrderedEBV4[c(1:5,17543:17547),]

paged_table(cHL_EBV_top10)

Only one gene that is a top down regulated gene in cHL+EBV is not a gene with a remarkable fold change value of change in the cHL_EBV_HIV data. The top 10 up and down regulated genes for cHL+EBV are:

cHL_EBV_top10$Gene_ID

##  [1] "RPS4Y1"   "Z82243.1" "NR4A2"    "LERFS"    "VAMP5"    "POU2F3"  
##  [7] "COL16A1"  "SLC18A2"  "CPLX2"    "AICDA"

Lets add columns to each data set of top genes from our fold change values and the research then combine them into one data set.

cHL_EBV_HIV_top10$topGene <- "foldchange"
cHL_EBV_HIV_top10$topGene[1:5] <- 'top up regulated cHL_EBV_HIV'
cHL_EBV_HIV_top10$topGene[6:10] <- 'top down regulated cHL_EBV_HIV'

paged_table(cHL_EBV_HIV_top10[,c(20,25,26)])

cHL_EBV_top10$topGene <- 'top gene'
cHL_EBV_top10$topGene[1:5] <- 'top up regulated in cHL_EBV'
cHL_EBV_top10$topGene[6:10] <- 'top down regulated in cHL_EBV'

paged_table(cHL_EBV_top10[,c(20,24,26)])

DataFrameResearched$topGene <- 'GSE289903 gene target'
paged_table(DataFrameResearched[,c(20,24:26)])

Lets be more descriptive and add to topGene feature why in the study it was a target or referenced in the study.

TIME referenced M2, CD4, and CD8

ICIs referenced PD-L1, TIGIT, and MHC-II
cHL and cHL+EBV referenced NCOR2, ARID1B, and ZFHX3, and also P53 as TP53
cHL+EBV+HIV referenced XBP1 and NCOR2
FLAGS or frequently mutated genes in bulk sequencing referenced TTN and MUC16
TMB referenced top 5 of KMT2D, EP300, CARD11, CREBBP, and EZH2

DataFrameResearched$topGene[c(1,11,14)] <- 'cHL+EBV compared to cHL top DE gene'
DataFrameResearched$topGene[9] <- 'cHL, cHL+EBV, & cHL+EBV+HIV top DE gene'
DataFrameResearched$topGene[c(2,4:7)] <- 'top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL'
DataFrameResearched$topGene[c(8,12)] <- 'FLAG genes or frequently mutated genes in cHL'
DataFrameResearched$topGene[10] <- 'ICI immune complex inhibitor gene as well as PD-L1 and MHC-II'
DataFrameResearched$topGene[3] <- 'TIME tumor inflammation microenvironment gene as well as M2 macrophages and CD8 t-cells'
DataFrameResearched$topGene[13] <- 'top referenced gene with NCOR2 in cHL+EBV+HIV'

DataFrameResearched[,c(20,26)]

##        Gene_ID
## ARID1B  ARID1B
## CARD11  CARD11
## CD4        CD4
## CREBBP  CREBBP
## EP300    EP300
## EZH2      EZH2
## KMT2D    KMT2D
## MUC16    MUC16
## NCOR2    NCOR2
## TIGIT    TIGIT
## TP53      TP53
## TTN        TTN
## XBP1      XBP1
## ZFHX3    ZFHX3
##                                                                                        topGene
## ARID1B                                                     cHL+EBV compared to cHL top DE gene
## CARD11              top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## CD4    TIME tumor inflammation microenvironment gene as well as M2 macrophages and CD8 t-cells
## CREBBP              top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## EP300               top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## EZH2                top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## KMT2D               top 5 genes ranked by tumor mutation burden TMB of RNA sequencing data cHL
## MUC16                                            FLAG genes or frequently mutated genes in cHL
## NCOR2                                                  cHL, cHL+EBV, & cHL+EBV+HIV top DE gene
## TIGIT                            ICI immune complex inhibitor gene as well as PD-L1 and MHC-II
## TP53                                                       cHL+EBV compared to cHL top DE gene
## TTN                                              FLAG genes or frequently mutated genes in cHL
## XBP1                                             top referenced gene with NCOR2 in cHL+EBV+HIV
## ZFHX3                                                      cHL+EBV compared to cHL top DE gene

Now lets combine these genes into one data set.

topGenes <- rbind(cHL_EBV_HIV_top10,cHL_EBV_top10,DataFrameResearched)

paged_table(topGenes)

We have 34 genes that we can use in part 2 of this project to see how well they do in machine learning to predict class as classic Hodgkin’s lymphoma, classic Hodgkin’s lymphoma with EBV infection, or classic Hodgkin’s lymphoma with EBV and HIV infections.

We will do that on another day but keep checking in. For now we can write a few of these data sets to csv and use it later.

write.csv(topGenes,'top34Genes_cHL_EBV_HIV.csv',row.names=F)
write.csv(sampleTypeByID,'sampleTypeByID_cHL_EBV_HIV.csv',row.names=F)
write.csv(DataFrameOrderedEBV, 'DataFrameAll_59427_orderedEBV.csv',row.names=F)

You can get the topGenes data here. Find the sampleTypeByID data set here. And get the total gene set ordered by EBV over cHL only here.

Thanks so much and look for part 2 soon. That part will have more analysis and making inferences on the duplicate genes in common and those that are more up regulated in the cHL+EBV+HIV class compared to the cHL+EBV, and also see how the genes of the research study did in fold change values in the data provided by fold change comparisons to the cHL only samples that were our baseline as all samples had at minimum classic Hodgkin’s lymphoma. And then we will test these gene sets separately and together in the machine learning algorithms we have come to use most of the time to see if these genes make good predictors of these 3 classes of samples.

We will also go back to the subtypes of the samples possibly before or after making the data and testing those matrices of predictors in predicting our 3 class samples, and see if the subtypes do give any additional information as the study said the genes from the subtypes they found by class type didn’t matter on any of the clinical features of age, gender, subtype, and so on but by TIME, TCR, and TMB.

Classic Hodgkin’s Lymphoma, Epstein-Barr Virus, and Human Immunodeficiency Virus samples to explore Biomarkers in neoplasm RNA samples GSE289903

Janis Harris

2026-03-25