This is Part 3 to the GSE305165 research attached to the end of Part 2 that is attached to the end of Part 1 in analyzing the data of CEL files to extract information and then use that information to predict the classes known and the classes that were found in the study that accompanies this gene expression data on Epstein-Barr Viral (EBV) infection in the elderly population with large B-cell lymphomas of monophasic diffuse (mDLBCL), polyphasic diffuse (pDLBCL), and classic Hodgkin’s Lymphoma (CHL).

In part 2 we got to the part of finding a way to find change or variation between the sample group means divided by sample group medians with 15 groups from the 47 samples into the demographic and class of lymphoma. We need to find the subgroups of the class groups for gender and age equal to or less than 72 and greater than 72 years of age. There was one missed group on the mDLBCL older than 72 change effects of avg/median that was corrected in this Part 3 of project.

Scroll to the ten astericks for Part 3 **********.

=======================================================================

This is part 2, we didn’t get any fold change values due to the samples being same type of large B cell lymphoma but different states depending on immune integrity of senescence or escape or otherwise fine. We did get the means and medians in part 1 after importing and converting the CEL files into a dataframe to work with after configurations that fixed the Rstudio environment of bioconductor but not knitr.

In this Part 2 we will be using an alternate to fold change which is just for sake of having a value that can translate variation by using the standard deviation in a way of the summary statistics for each gene in each sample of CHL, pDLBCL, mDLBCL, CHL and male, CHL and female, pDLBCL and male, pDLBCL and female, mDLBCL and male, mDLBCL and female, CHL and younger or equal to 72 years of age, CHL and older than 72 years of age, pDLBCL and younger or equal to 72 years of age, pDLBCL and older than 72 years of age, mDLBCL and younger or equal to 72 years of age, or mDLBCL and older than 72 years of age.

For this Part 2, we will be dividing the average estimated gene expression value per group and subgroup we identified by the median real sample group or subgroup value of the middle value. We will see which genes have the most variation based on their average or mean value where outliers can greatly skew to be less or more than the real value, and the median value. So if a gene had a group average of 100, but the median was 72 for 3 samples, then the value of 100/72 will be greater than 1 and over estimated, but if the average is 50 and the real median value is 72, then the value will be underestimated and less than 1. For those values greatly over or under estimated we will take those top genes per group. Then run a random forest machine learning model on it to see how well those genes and the genes in the study can predict a 3 class model of CHL, pDLBCL, or mDLBCL. And then how well a 4 class model of the classes this research study proposed of IS for immune senescence or low immune defenses but still there in the elderly and immune compromised. None of the patients here are otherwise immune compromised and they made sure to eliminate anybody who was, but these are elderly patients from 50-94 years of age with 25% of the 47 samples under 64 years and 25% over 82 years of age or about that might be 84. Immune senscence is a natural state of aging as our thymus (General Anatomy) is inactive after puberty and starts shrinking with very little use in making new T cells but by the age of 65 will be entirely inactive as our bodies rely on the T cells our bodies have floating around waiting to be initiated by plasma cells and macrophages of the B cells.

We will see how well the study genes predict the 4 classes as well as 3 classes and our top genes per group from taking the average per group divided by the median of each group and taking the top 20 that is made up of top 10 removing NaNs and Infinites and 0s that are over estimated the most and top 10 same filtering method for under estimated.

Scroll to the ***** to mark the beginning of Part 2 for analysis and machine learning.

===========================================================================

*** Part 1

Hello all we will be analyzing gene expression data from GSE305165. There is a published article that I took the following information from after reading it once and highlighting the important aspects of the study.

This research used Diffuse Large B-Cell Lymphoma or DLBCL to compare it to Classical Hodgkin’s Lymphoma or CHL in Epstein-Barr Virus (EBV) infected patients. The two types of lymphoma affect the elderly populations and have overlap between biomarkers such as IDO1, EBV latent type 2 is specific to CHL, but can be seen in a polymorphic type DLBCL called pDLBCL, and EBV latent type 3 seen mostly in monomorphic DLBCL labeled mDLBCL, but also in pDLBCL. The study saw that the typical region of chromosome 9 specific to high variations of genes in this loci of 9.24 in CHL, had some variations noticed within pDLBCL and mDLBCL. Overall, this study used populations 50 years old or older with no autoimmune or immunodeficient pathologies. However, in the elderly populations there is a natural decline of immune response to pathogens and antigens called immune senescence or IS, and in seriously impacted disease state of the DLBCL there can be a more fatal condition of immune escape, an actual term for both that means pathogens and antigens escape detection by the host immune system and have the ability to make changes that can lead to the host’s death.

This is a very interesting study, not too difficult to read, but overall, these researchers have decides that their clustering of heirarchical did a great job at separating the differences between the classes of large B-cell Lymphomas. They decided that CHL and DLBCLs of pDLBCL and mDLBCL are not separate diseases but the same type of disease where there is a 4th transitional state of disease that overlaps with pDLBCL and CHL that has low interferon gamma.

The four states are the 1st group which IS group which is the mDLBCL that is EBNA2 positive and EBV latent type 3, the 2nd group which is the CHL group that is high in variations at loci of chromosome 9 at 9p24.1 and high in PDL1 gene expression also only EBV latent type 2 gene expression and EBNA2 negative, the 3rd group which is the pDLBCL that is high in interferon gamma or IFN-g and low in variations of 9p24.1 with high gene expression of IDO1 that lead to immune escape and high chance of getting poor prognosis of hemocytic lymphocytosis called HLH that can lead to demise, and the 4th group that is the transition between CHL and pDLBCL where the IFN-g is low and characteristics unlike the other 3 groups as not otherwise specified or NOS.

The study uses 57 samples where 35 are DLBCL with 12 being pDLBCL and 23 being mDLBCL, and the other 22 samples are CHL. All samples have confirmed EBV and no immune deficiency or pathology other than Lymphoma and normal affects of aging in IS.

However, there are only 47 samples in the GSE305605 gene expression omnibus or GEO link above. We will be working with 47 samples.

library(rmarkdown)

## Warning: package 'rmarkdown' was built under R version 4.5.3

series <- read.table("GSE305165_series_matrix.txt/GSE305165_series_matrix.txt", skip=31, header=T, nrow=29)

paged_table(series)

The GSM ID is row 1, age is row 10, diagnosis is row 9 of EBV+CHL, EBV+pDLBCL, or EBV+mDLBCL, gender is row 11, group is row 19. Lets make a table of only those 4 features.

series4 <- series[c(1,9,10,11,19),]

paged_table(series4)

The groups have mostly stuck by definitions but show overlap as the CHL groups should be high 9p24.1 variations but some mDLBCL are also high 9p24.1, and IFNG-L should be mDLBCL but some CHL are classified as this instead of high 9p24.1 and at least one pDLBCL sample, and mDLBCL should all be IS, but some are IFNG-L or 9p24.1 variation high, or even at least one sample is IFNG-H. The study said there was some overlap between the samples, but that most all the latent type 2 EBV were CHL or high 9p24.1. We can still use it to show over lap.

There are 47 samples, 10 must have dropped out and not wanted information shared or unable to share it. The published article said there were 57 samples. Lets see how many samples are here based on diagnosis.

Lets see how many groups.

group <- series4[5,c(2:48)]
group_t <- data.frame(t(group))
colnames(group_t) <- 'group'
group_t$group <- gsub("group","", group_t$group)
table(group_t$group)

## 
## 9p24.1-H    IFNG-H    IFNG-L        IS  
##         9         9        18        10

This is the 4 subtypes of lymphoma the study produced and says the transition state is the one with low IFNG or IFNG-L and not otherwise specified findings. The IS is immune sequesence of mDLBCL, IFNG-H is supposed to be the pDLBCL, and 9p24.1-H is high variations in gene copies at locus 9p24.1 on chromosome 9 for CHL. All of these lymphomas have confirmed EBV infection.

Now for the number samples in each diagnosis.

dx <- series4[2,c(2:48)]
dx_t <- data.frame(t(dx))

colnames(dx_t) <- 'diagnosis'

table(dx_t$diagnosis)

## 
##    diagnosis: EBV+ CHL diagnosis: EBV+ mDLBCL diagnosis: EBV+ pDLBCL 
##                     19                     20                      8

There are 19 EBV+CHL cases, 20 EBV+mDLBCL, and 8 EBV+pDLBCL.

Lets see the age range summary stats.

age <- series4[3,c(2:48)]
age_t <- data.frame(t(age))
colnames(age_t) <- "Age"
age_t$Age <- gsub("age: ","",age_t$Age)
age_t$Age <- as.numeric(age_t$Age)

summary(age_t$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   50.00   62.50   74.00   72.11   79.50   94.00

The age is 50 years old as the youngest, with a median age of 74 years of age for all 47 patients’ ages lined up in order least to most, with a mean age as the average age for these 47 patients being 72 years old. The oldest is 94 years old. More than 75% of the people are older than 62 years of age and more than half the patients are older than 72 years old, with 25 % of the patients older than almost 80 years of age and 25% of the patients between 50 to 62 years of age.

Lets look at the gender balance of men to women in this study.

gender <- series4[4,c(2:48)]

gender_t <- data.frame(t(gender));
colnames(gender_t) <- 'gender'

gender_t$gender <- gsub("Sex: ", "" , gender_t$gender)

table(gender_t$gender)

## 
## female   male 
##     12     35

There are mostly male in this research study with 35 males and 12 females spread about all samples of EBV+CHL, EBV+pDLBCL, and EBV+mDLBCL.

Lets make a sample GSM ID table as well.

ID <- series4[1,c(2:48)]

ID_t <- data.frame(t(ID))

colnames(ID_t) <- "sampleID"
ID_t

##                        sampleID
## Case02_lymphoma_FFPE GSM9163281
## Case03_lymphoma_FFPE GSM9163282
## Case06_lymphoma_FFPE GSM9163283
## Case08_lymphoma_FFPE GSM9163284
## Case09_lymphoma_FFPE GSM9163285
## Case10_lymphoma_FFPE GSM9163286
## Case11_lymphoma_FFPE GSM9163287
## Case12_lymphoma_FFPE GSM9163288
## Case13_lymphoma_FFPE GSM9163289
## Case14_lymphoma_FFPE GSM9163290
## Case15_lymphoma_FFPE GSM9163291
## Case16_lymphoma_FFPE GSM9163292
## Case17_lymphoma_FFPE GSM9163293
## Case19_lymphoma_FFPE GSM9163294
## Case20_lymphoma_FFPE GSM9163295
## Case21_lymphoma_FFPE GSM9163296
## Case22_lymphoma_FFPE GSM9163297
## Case23_lymphoma_FFPE GSM9163298
## Case24_lymphoma_FFPE GSM9163299
## Case25_lymphoma_FFPE GSM9163300
## Case26_lymphoma_FFPE GSM9163301
## Case27_lymphoma_FFPE GSM9163302
## Case29_lymphoma_FFPE GSM9163303
## Case30_lymphoma_FFPE GSM9163304
## Case31_lymphoma_FFPE GSM9163305
## Case32_lymphoma_FFPE GSM9163306
## Case34_lymphoma_FFPE GSM9163307
## Case35_lymphoma_FFPE GSM9163308
## Case36_lymphoma_FFPE GSM9163309
## Case37_lymphoma_FFPE GSM9163310
## Case38_lymphoma_FFPE GSM9163311
## Case39_lymphoma_FFPE GSM9163312
## Case40_lymphoma_FFPE GSM9163313
## Case41_lymphoma_FFPE GSM9163314
## Case42_lymphoma_FFPE GSM9163315
## Case43_lymphoma_FFPE GSM9163316
## Case44_lymphoma_FFPE GSM9163317
## Case45_lymphoma_FFPE GSM9163318
## Case46_lymphoma_FFPE GSM9163319
## Case49_lymphoma_FFPE GSM9163320
## Case50_lymphoma_FFPE GSM9163321
## Case51_lymphoma_FFPE GSM9163322
## Case52_lymphoma_FFPE GSM9163323
## Case53_lymphoma_FFPE GSM9163324
## Case55_lymphoma_FFPE GSM9163325
## Case56_lymphoma_FFPE GSM9163326
## Case57_lymphoma_FFPE GSM9163327

Lets make a table of the diagnosis

Lets make a table of these 5 characteristics.

characteristics_df <- cbind(ID_t, dx_t, group_t, age_t, gender_t)

paged_table(characteristics_df)

====================================================

*** Attempt 3 (attempt 1 and attempt 2 omitted but are in Part 1 and Part 2)

Got it — you want to read CEL files (Affymetrix microarray data) in Bioconductor 3.22 with R 4.5. Here’s a concise, up-to-date approach that works with the current Bioconductor ecosystem.

Install Bioconductor and Required Packages

Copy code # Install BiocManager if not already installed

install.packages("BiocManager")

Ensure you’re using Bioconductor 3.22

BiocManager::install(version = "3.22")

Install packages for reading CEL files

BiocManager::install(c("affy", "oligo"))

Reading CEL Files (Two Common Approaches)

Option A – Using affy (classic Affymetrix arrays)

Copy code

library(affy)

## Loading required package: BiocGenerics

## Loading required package: generics

## 
## Attaching package: 'generics'

## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min

## Loading required package: Biobase

## Warning: package 'Biobase' was built under R version 4.5.3

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

Read all CEL files from a directory

data <- ReadAffy(celfile.path = "data/")

the data file in Rstudio with affy library

Inspect

summary(data)

##    Length     Class      Mode 
##        47 AffyBatch        S4

Option B – Using oligo (newer arrays, e.g., Gene ST, Exon ST)

Copy code

library(oligo)

## Loading required package: oligoClasses

## Welcome to oligoClasses version 1.72.0

## 
## Attaching package: 'oligoClasses'

## The following object is masked from 'package:affy':
## 
##     list.celfiles

## Loading required package: Biostrings

## Loading required package: S4Vectors

## Warning: package 'S4Vectors' was built under R version 4.5.3

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

## Loading required package: XVector

## Loading required package: Seqinfo

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

## ================================================================================

## Welcome to oligo version 1.74.0

## ================================================================================

## 
## Attaching package: 'oligo'

## The following objects are masked from 'package:affy':
## 
##     intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
##     probeNames, rma

Read CEL files

This uses Attempt 2’s version of placing each CEL file from individual folder into one folder we named GSE305165.

library(GEOquery)

## Setting options('download.file.method.GEOquery'='auto')

## Setting options('GEOquery.inmemory.gpl'=FALSE)

getGEOSuppFiles("GSE305165")

## Using locally cached version of supplementary file(s) GSE305165 found here:
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar

##                                                                                                                      size
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 46592000
##                                                                                                                  isdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar FALSE
##                                                                                                                  mode
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar  666
##                                                                                                                                mtime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:11
##                                                                                                                                ctime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:06
##                                                                                                                                atime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-12 23:08:47
##                                                                                                                  exe
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar  no
##                                                                                                                  uname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar jlcor
##                                                                                                                        udomain
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar DATAMASSAGER1
##                                                                                                                              fname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165_RAW.tar
##                                                                                                                                                                                                         destdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165
##                                                                                                                                                                                                                          filepath
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar
##                                                                                                                        GEO
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165

untar("GSE305165/GSE305165_RAW.tar", exdir="data/")

cel_files <- list.celfiles("GSE305165/", full.names = TRUE)

data <- read.celfiles(cel_files)

## Loading required package: pd.clariom.s.human

## Loading required package: RSQLite

## Loading required package: DBI

## Platform design info loaded.

## Reading in : GSE305165/GSM9163281_02_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163282_03_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163283_06_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163284_08_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163285_09_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163286_10_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163287_11_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163288_12_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163289_13_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163290_14_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163291_15_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163292_16_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163293_17_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163294_19_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163295_20_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163296_21_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163297_22_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163298_23_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163299_24_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163300_25_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163301_26_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163302_27_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163303_29_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163304_30_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163305_31_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163306_32_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163307_34_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163308_35_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163309_36_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163310_37_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163311_38_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163312_39_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163313_40_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163314_41_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163315_42_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163316_43_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163317_44_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163318_45_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163319_46_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163320_49_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163321_50_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163322_51_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163323_52_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163324_53_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163325_55_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163326_56_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163327_57_Clariom_S_Human_.CEL

Inspect

data

## ExpressionFeatureSet (storageMode: lockedEnvironment)
## assayData: 300304 features, 47 samples 
##   element names: exprs 
## protocolData
##   rowNames: GSM9163281_02_Clariom_S_Human_.CEL
##     GSM9163282_03_Clariom_S_Human_.CEL ...
##     GSM9163327_57_Clariom_S_Human_.CEL (47 total)
##   varLabels: exprs dates
##   varMetadata: labelDescription channel
## phenoData
##   rowNames: GSM9163281_02_Clariom_S_Human_.CEL
##     GSM9163282_03_Clariom_S_Human_.CEL ...
##     GSM9163327_57_Clariom_S_Human_.CEL (47 total)
##   varLabels: index
##   varMetadata: labelDescription channel
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: pd.clariom.s.human

The data in oligo looks like this in Rstudio

The rest works in Rstudio after correcting for the package of bioconductor made for my version of R, but knitr stops working right here.

normalized.data <- rma(data)

Background correcting Normalizing Calculating Expression

normalized.data

ExpressionSet (storageMode: lockedEnvironment) assayData: 27189 features, 47 samples element names: exprs protocolData rowNames: GSM9163281_02_Clariom_S_Human_.CEL GSM9163282_03_Clariom_S_Human_.CEL … GSM9163327_57_Clariom_S_Human_.CEL (47 total) varLabels: exprs dates varMetadata: labelDescription channel phenoData rowNames: GSM9163281_02_Clariom_S_Human_.CEL GSM9163282_03_Clariom_S_Human_.CEL … GSM9163327_57_Clariom_S_Human_.CEL (47 total) varLabels: index varMetadata: labelDescription channel featureData: none experimentData: use ‘experimentData(object)’ Annotation: pd.clariom.s.human

normalized.expr <- exprs(normalized.data)

The row names are the probe IDs in Affymetrix for this cdf Clariom_s_human.

normalized.expr <- as.data.frame(exprs(normalized.data))

Writing it out to read it in to run the rest of the code that will allow other programs knitr isn’t picking on that finally my Rstudio version with R and bioconductor is working with.

write.csv(normalized.expr,'normalized.expr1.csv', row.names=T)

normalized.expr <- read.csv('normalized.expr1.csv', header=T, row.names=1)

paged_table(normalized.expr[1:10,])

output because knitr is having problems with this

The above only shows the first 10 rows, but all 47 samples are there with long ID names per sample.

Map probe IDs into gene symbols.

gse <- getGEO("GSE305165", GSEMatrix = T)

Found 1 file(s) GSE305165_series_matrix.txt.gz Using locally cached version: C:/GSE305165_series_matrix.txt.gz Using locally cached version of GPL23159 found here: C:/GPL23159.soft.gz

The above produces a large list.

gse

$GSE305165_series_matrix.txt.gz ExpressionSet (storageMode: lockedEnvironment) assayData: 21448 features, 47 samples element names: exprs protocolData: none phenoData sampleNames: GSM9163281 GSM9163282 … GSM9163327 (47 total) varLabels: title geo_accession … Sex:ch1 (33 total) varMetadata: labelDescription featureData featureNames: TC0100006437.hg.1 TC0100006476.hg.1 … TSUnmapped00000823.hg.1 (21448 total) fvarLabels: ID probeset_id … SPOT_ID.1 (10 total) fvarMetadata: Column Description labelDescription experimentData: use ‘experimentData(object)’ pubMedIds: 41371409 Annotation: GPL23159

The gse Large list in Rstudio. gse Large list details in source code object view in rstudio

Next get the feature IDs and store it.

feature.data  <- gse$GSE305165_series_matrix.txt.gz@featureData@data

The above produces additional information with alternate gene ID symbols to the Probe IDs as well as other information.

paged_table(feature.data[1:10,])

1st screen 2nd screen 3rd screen

paged_table(feature.data[1:10,])

subset to only get the gene symbols and probe IDs. This is column 1 and 10 but 10 is a mix of various groups in one ID field.

feature.data1 <- feature.data[,c(1,10)]

paged_table(feature.data1[1:20,])

Add an ID column to the normalized.expr data frame of samples by probe ID.

normalized.expr$ID <- row.names(normalized.expr)

paged_table(normalized.expr[1:10,45:48])

The above table shows only first 10 rows and last 4 columns with the ID column added to it.

str(normalized.expr)

## 'data.frame':    27189 obs. of  48 variables:
##  $ GSM9163281_02_Clariom_S_Human_.CEL : num  5.31 5.68 4.97 3.97 3.53 ...
##  $ GSM9163282_03_Clariom_S_Human_.CEL : num  5.49 4.69 5.15 4.57 2.77 ...
##  $ GSM9163283_06_Clariom_S_Human_.CEL : num  4.75 5.03 5.03 3.7 3.66 ...
##  $ GSM9163284_08_Clariom_S_Human_.CEL : num  5.68 5.65 4.29 3.57 3.35 ...
##  $ GSM9163285_09_Clariom_S_Human_.CEL : num  4.82 5.68 4.92 3.86 4.89 ...
##  $ GSM9163286_10_Clariom_S_Human_.CEL : num  5.52 5.59 4.1 3.6 3.8 ...
##  $ GSM9163287_11_Clariom_S_Human_.CEL : num  5.45 5.56 4.91 3.97 3.39 ...
##  $ GSM9163288_12_Clariom_S_Human_.CEL : num  5.77 5.88 3.76 3.98 3.33 ...
##  $ GSM9163289_13_Clariom_S_Human_2.CEL: num  6.18 5.37 4.53 4.24 3.62 ...
##  $ GSM9163290_14_Clariom_S_Human_.CEL : num  5.59 5.7 4.19 4.45 3.64 ...
##  $ GSM9163291_15_Clariom_S_Human_.CEL : num  5.51 5.72 4.87 3.78 3.45 ...
##  $ GSM9163292_16_Clariom_S_Human_.CEL : num  5.66 5.58 4.91 4.59 3.11 ...
##  $ GSM9163293_17_Clariom_S_Human_.CEL : num  5.02 5.12 3.33 3.91 3.39 ...
##  $ GSM9163294_19_Clariom_S_Human_.CEL : num  5.2 5.75 4.15 4.09 3.39 ...
##  $ GSM9163295_20_Clariom_S_Human_.CEL : num  5.32 5.23 4.47 3.88 3.49 ...
##  $ GSM9163296_21_Clariom_S_Human_.CEL : num  5.8 5.97 4.6 3.86 3.7 ...
##  $ GSM9163297_22_Clariom_S_Human_.CEL : num  5.67 5.07 4.27 4.07 2.79 ...
##  $ GSM9163298_23_Clariom_S_Human_.CEL : num  6.01 5.89 3.89 4.14 3.54 ...
##  $ GSM9163299_24_Clariom_S_Human_.CEL : num  6.05 5.81 4.51 4.2 3.16 ...
##  $ GSM9163300_25_Clariom_S_Human_.CEL : num  5.51 5.73 4.13 4.18 3.73 ...
##  $ GSM9163301_26_Clariom_S_Human_.CEL : num  6.31 5.9 4.62 4.2 3.18 ...
##  $ GSM9163302_27_Clariom_S_Human_2.CEL: num  5.9 5.39 4.52 3.58 4.07 ...
##  $ GSM9163303_29_Clariom_S_Human_.CEL : num  5.39 5.14 4.58 4.6 3.69 ...
##  $ GSM9163304_30_Clariom_S_Human_.CEL : num  5.81 6.15 4.56 4.19 3.71 ...
##  $ GSM9163305_31_Clariom_S_Human_.CEL : num  5.67 5.64 5.28 4.02 3.57 ...
##  $ GSM9163306_32_Clariom_S_Human_.CEL : num  5.92 6.06 5.41 4.37 3.46 ...
##  $ GSM9163307_34_Clariom_S_Human_.CEL : num  5.68 6.04 4.29 4.41 3.4 ...
##  $ GSM9163308_35_Clariom_S_Human_.CEL : num  5.79 6.39 4.79 4.35 3.77 ...
##  $ GSM9163309_36_Clariom_S_Human_.CEL : num  5.14 5.21 3.79 4.09 3.96 ...
##  $ GSM9163310_37_Clariom_S_Human_2.CEL: num  6.19 5.26 3.8 4.27 3.49 ...
##  $ GSM9163311_38_Clariom_S_Human_2.CEL: num  5.75 5.85 3.84 3.86 3.39 ...
##  $ GSM9163312_39_Clariom_S_Human_.CEL : num  6.33 5.98 4.89 3.95 3.7 ...
##  $ GSM9163313_40_Clariom_S_Human_2.CEL: num  5.42 5.84 4.49 4.47 3.35 ...
##  $ GSM9163314_41_Clariom_S_Human_.CEL : num  6.33 5.88 4.85 4.35 3.17 ...
##  $ GSM9163315_42_Clariom_S_Human_.CEL : num  5.66 6.07 4.89 4.01 3.4 ...
##  $ GSM9163316_43_Clariom_S_Human_.CEL : num  6.12 6.08 4.44 3.98 3.24 ...
##  $ GSM9163317_44_Clariom_S_Human_.CEL : num  5.96 6.39 5.34 4.23 3.86 ...
##  $ GSM9163318_45_Clariom_S_Human_.CEL : num  4.72 6.16 4.92 4.3 4.12 ...
##  $ GSM9163319_46_Clariom_S_Human_.CEL : num  5.67 5.72 4.81 4.82 3.25 ...
##  $ GSM9163320_49_Clariom_S_Human_2.CEL: num  6.25 5.98 4.73 3.83 3.47 ...
##  $ GSM9163321_50_Clariom_S_Human_.CEL : num  5.95 5.9 4.69 4.1 3.35 ...
##  $ GSM9163322_51_Clariom_S_Human_.CEL : num  6.25 5.73 4.18 4.26 3.17 ...
##  $ GSM9163323_52_Clariom_S_Human_.CEL : num  5.71 6.36 4.4 4.1 3.69 ...
##  $ GSM9163324_53_Clariom_S_Human_.CEL : num  6.16 5.56 4.9 4.23 3.12 ...
##  $ GSM9163325_55_Clariom_S_Human_.CEL : num  6.85 5.68 4.01 3.83 3.5 ...
##  $ GSM9163326_56_Clariom_S_Human_.CEL : num  5.68 5.79 4.03 4.39 3.28 ...
##  $ GSM9163327_57_Clariom_S_Human_.CEL : num  6.29 5.94 4.5 4.19 3.38 ...
##  $ ID                                 : chr  "23064070" "23064071" "23064072" "23064073" ...

normalized.expr <- inner_join(normalized.expr,feature.data1, by="ID")

str(normalized.expr)

## 'data.frame':    27189 obs. of  48 variables:
##  $ GSM9163281_02_Clariom_S_Human_.CEL : num  5.31 5.68 4.97 3.97 3.53 ...
##  $ GSM9163282_03_Clariom_S_Human_.CEL : num  5.49 4.69 5.15 4.57 2.77 ...
##  $ GSM9163283_06_Clariom_S_Human_.CEL : num  4.75 5.03 5.03 3.7 3.66 ...
##  $ GSM9163284_08_Clariom_S_Human_.CEL : num  5.68 5.65 4.29 3.57 3.35 ...
##  $ GSM9163285_09_Clariom_S_Human_.CEL : num  4.82 5.68 4.92 3.86 4.89 ...
##  $ GSM9163286_10_Clariom_S_Human_.CEL : num  5.52 5.59 4.1 3.6 3.8 ...
##  $ GSM9163287_11_Clariom_S_Human_.CEL : num  5.45 5.56 4.91 3.97 3.39 ...
##  $ GSM9163288_12_Clariom_S_Human_.CEL : num  5.77 5.88 3.76 3.98 3.33 ...
##  $ GSM9163289_13_Clariom_S_Human_2.CEL: num  6.18 5.37 4.53 4.24 3.62 ...
##  $ GSM9163290_14_Clariom_S_Human_.CEL : num  5.59 5.7 4.19 4.45 3.64 ...
##  $ GSM9163291_15_Clariom_S_Human_.CEL : num  5.51 5.72 4.87 3.78 3.45 ...
##  $ GSM9163292_16_Clariom_S_Human_.CEL : num  5.66 5.58 4.91 4.59 3.11 ...
##  $ GSM9163293_17_Clariom_S_Human_.CEL : num  5.02 5.12 3.33 3.91 3.39 ...
##  $ GSM9163294_19_Clariom_S_Human_.CEL : num  5.2 5.75 4.15 4.09 3.39 ...
##  $ GSM9163295_20_Clariom_S_Human_.CEL : num  5.32 5.23 4.47 3.88 3.49 ...
##  $ GSM9163296_21_Clariom_S_Human_.CEL : num  5.8 5.97 4.6 3.86 3.7 ...
##  $ GSM9163297_22_Clariom_S_Human_.CEL : num  5.67 5.07 4.27 4.07 2.79 ...
##  $ GSM9163298_23_Clariom_S_Human_.CEL : num  6.01 5.89 3.89 4.14 3.54 ...
##  $ GSM9163299_24_Clariom_S_Human_.CEL : num  6.05 5.81 4.51 4.2 3.16 ...
##  $ GSM9163300_25_Clariom_S_Human_.CEL : num  5.51 5.73 4.13 4.18 3.73 ...
##  $ GSM9163301_26_Clariom_S_Human_.CEL : num  6.31 5.9 4.62 4.2 3.18 ...
##  $ GSM9163302_27_Clariom_S_Human_2.CEL: num  5.9 5.39 4.52 3.58 4.07 ...
##  $ GSM9163303_29_Clariom_S_Human_.CEL : num  5.39 5.14 4.58 4.6 3.69 ...
##  $ GSM9163304_30_Clariom_S_Human_.CEL : num  5.81 6.15 4.56 4.19 3.71 ...
##  $ GSM9163305_31_Clariom_S_Human_.CEL : num  5.67 5.64 5.28 4.02 3.57 ...
##  $ GSM9163306_32_Clariom_S_Human_.CEL : num  5.92 6.06 5.41 4.37 3.46 ...
##  $ GSM9163307_34_Clariom_S_Human_.CEL : num  5.68 6.04 4.29 4.41 3.4 ...
##  $ GSM9163308_35_Clariom_S_Human_.CEL : num  5.79 6.39 4.79 4.35 3.77 ...
##  $ GSM9163309_36_Clariom_S_Human_.CEL : num  5.14 5.21 3.79 4.09 3.96 ...
##  $ GSM9163310_37_Clariom_S_Human_2.CEL: num  6.19 5.26 3.8 4.27 3.49 ...
##  $ GSM9163311_38_Clariom_S_Human_2.CEL: num  5.75 5.85 3.84 3.86 3.39 ...
##  $ GSM9163312_39_Clariom_S_Human_.CEL : num  6.33 5.98 4.89 3.95 3.7 ...
##  $ GSM9163313_40_Clariom_S_Human_2.CEL: num  5.42 5.84 4.49 4.47 3.35 ...
##  $ GSM9163314_41_Clariom_S_Human_.CEL : num  6.33 5.88 4.85 4.35 3.17 ...
##  $ GSM9163315_42_Clariom_S_Human_.CEL : num  5.66 6.07 4.89 4.01 3.4 ...
##  $ GSM9163316_43_Clariom_S_Human_.CEL : num  6.12 6.08 4.44 3.98 3.24 ...
##  $ GSM9163317_44_Clariom_S_Human_.CEL : num  5.96 6.39 5.34 4.23 3.86 ...
##  $ GSM9163318_45_Clariom_S_Human_.CEL : num  4.72 6.16 4.92 4.3 4.12 ...
##  $ GSM9163319_46_Clariom_S_Human_.CEL : num  5.67 5.72 4.81 4.82 3.25 ...
##  $ GSM9163320_49_Clariom_S_Human_2.CEL: num  6.25 5.98 4.73 3.83 3.47 ...
##  $ GSM9163321_50_Clariom_S_Human_.CEL : num  5.95 5.9 4.69 4.1 3.35 ...
##  $ GSM9163322_51_Clariom_S_Human_.CEL : num  6.25 5.73 4.18 4.26 3.17 ...
##  $ GSM9163323_52_Clariom_S_Human_.CEL : num  5.71 6.36 4.4 4.1 3.69 ...
##  $ GSM9163324_53_Clariom_S_Human_.CEL : num  6.16 5.56 4.9 4.23 3.12 ...
##  $ GSM9163325_55_Clariom_S_Human_.CEL : num  6.85 5.68 4.01 3.83 3.5 ...
##  $ GSM9163326_56_Clariom_S_Human_.CEL : num  5.68 5.79 4.03 4.39 3.28 ...
##  $ GSM9163327_57_Clariom_S_Human_.CEL : num  6.29 5.94 4.5 4.19 3.38 ...
##  $ ID                                 : chr  "23064070" "23064071" "23064072" "23064073" ...

The SPOT.ID.1 column has the gene name in it as well as full name and alternate names.

normalized.expr$SPOT_ID.1[1]

## NULL

It seems like this column should be split by each ‘//’ and made into separate columns. We will see about separating it later, am unable to use split function to separate as a list currently. Will ask AI later for extracting only the gene name in parenthesis.

Now lets see how are samples are related in order with the series_txt information we made earlier.

compare <- rbind(colnames(normalized.expr)[1:47],series4[2:48])

paged_table(compare)

It looks like viewing the sample ID by GSM ID is the same and consistent in order, so we should rename these to be smaller by class. All have EBV confirmed diagnosis, but some are CHL, pDLBCL, and mDLBCL. We can extend the group as well.

CHL <- grep('CHL',compare[3,])
mDLBCL <- grep('mDLBCL', compare[3,])
pDLBCL <- grep('pDLBCL',compare[3,])

The column names will be automatically numbered with a dot and the number starting with 1 after a duplicate encountered.

colnames(compare)[CHL] <- 'CHL'
colnames(compare)[mDLBCL] <- 'mDLBCL'
colnames(compare)[pDLBCL] <- 'pDLBCL'

paged_table(compare)

newNames <- colnames(compare)

colnames(normalized.expr)[1:47] <- newNames

Lets add the row means to each gene by type of Lymphoma.

normalized.expr$CHL_mean <- rowMeans(normalized.expr[,CHL])
normalized.expr$pDLBCL_mean <- rowMeans(normalized.expr[,pDLBCL])
normalized.expr$mDLBCL_mean <- rowMeans(normalized.expr[,mDLBCL])

str(normalized.expr)

## 'data.frame':    27189 obs. of  51 variables:
##  $ pDLBCL     : num  5.31 5.68 4.97 3.97 3.53 ...
##  $ CHL        : num  5.49 4.69 5.15 4.57 2.77 ...
##  $ pDLBCL     : num  4.75 5.03 5.03 3.7 3.66 ...
##  $ mDLBCL     : num  5.68 5.65 4.29 3.57 3.35 ...
##  $ CHL        : num  4.82 5.68 4.92 3.86 4.89 ...
##  $ mDLBCL     : num  5.52 5.59 4.1 3.6 3.8 ...
##  $ pDLBCL     : num  5.45 5.56 4.91 3.97 3.39 ...
##  $ CHL        : num  5.77 5.88 3.76 3.98 3.33 ...
##  $ CHL        : num  6.18 5.37 4.53 4.24 3.62 ...
##  $ CHL        : num  5.59 5.7 4.19 4.45 3.64 ...
##  $ CHL        : num  5.51 5.72 4.87 3.78 3.45 ...
##  $ mDLBCL     : num  5.66 5.58 4.91 4.59 3.11 ...
##  $ mDLBCL     : num  5.02 5.12 3.33 3.91 3.39 ...
##  $ CHL        : num  5.2 5.75 4.15 4.09 3.39 ...
##  $ CHL        : num  5.32 5.23 4.47 3.88 3.49 ...
##  $ pDLBCL     : num  5.8 5.97 4.6 3.86 3.7 ...
##  $ CHL        : num  5.67 5.07 4.27 4.07 2.79 ...
##  $ mDLBCL     : num  6.01 5.89 3.89 4.14 3.54 ...
##  $ mDLBCL     : num  6.05 5.81 4.51 4.2 3.16 ...
##  $ mDLBCL     : num  5.51 5.73 4.13 4.18 3.73 ...
##  $ mDLBCL     : num  6.31 5.9 4.62 4.2 3.18 ...
##  $ CHL        : num  5.9 5.39 4.52 3.58 4.07 ...
##  $ mDLBCL     : num  5.39 5.14 4.58 4.6 3.69 ...
##  $ CHL        : num  5.81 6.15 4.56 4.19 3.71 ...
##  $ CHL        : num  5.67 5.64 5.28 4.02 3.57 ...
##  $ CHL        : num  5.92 6.06 5.41 4.37 3.46 ...
##  $ CHL        : num  5.68 6.04 4.29 4.41 3.4 ...
##  $ mDLBCL     : num  5.79 6.39 4.79 4.35 3.77 ...
##  $ mDLBCL     : num  5.14 5.21 3.79 4.09 3.96 ...
##  $ mDLBCL     : num  6.19 5.26 3.8 4.27 3.49 ...
##  $ mDLBCL     : num  5.75 5.85 3.84 3.86 3.39 ...
##  $ mDLBCL     : num  6.33 5.98 4.89 3.95 3.7 ...
##  $ mDLBCL     : num  5.42 5.84 4.49 4.47 3.35 ...
##  $ pDLBCL     : num  6.33 5.88 4.85 4.35 3.17 ...
##  $ pDLBCL     : num  5.66 6.07 4.89 4.01 3.4 ...
##  $ mDLBCL     : num  6.12 6.08 4.44 3.98 3.24 ...
##  $ mDLBCL     : num  5.96 6.39 5.34 4.23 3.86 ...
##  $ CHL        : num  4.72 6.16 4.92 4.3 4.12 ...
##  $ pDLBCL     : num  5.67 5.72 4.81 4.82 3.25 ...
##  $ pDLBCL     : num  6.25 5.98 4.73 3.83 3.47 ...
##  $ CHL        : num  5.95 5.9 4.69 4.1 3.35 ...
##  $ CHL        : num  6.25 5.73 4.18 4.26 3.17 ...
##  $ CHL        : num  5.71 6.36 4.4 4.1 3.69 ...
##  $ mDLBCL     : num  6.16 5.56 4.9 4.23 3.12 ...
##  $ mDLBCL     : num  6.85 5.68 4.01 3.83 3.5 ...
##  $ mDLBCL     : num  5.68 5.79 4.03 4.39 3.28 ...
##  $ CHL        : num  6.29 5.94 4.5 4.19 3.38 ...
##  $ ID         : chr  "23064070" "23064071" "23064072" "23064073" ...
##  $ CHL_mean   : num  5.65 5.71 4.58 4.13 3.54 ...
##  $ pDLBCL_mean: num  5.65 5.74 4.85 4.06 3.45 ...
##  $ mDLBCL_mean: num  5.83 5.72 4.33 4.13 3.48 ...

summary(normalized.expr)

##      pDLBCL            CHL             pDLBCL           mDLBCL      
##  Min.   : 1.104   Min.   : 1.075   Min.   : 1.037   Min.   : 1.056  
##  1st Qu.: 3.237   1st Qu.: 3.305   1st Qu.: 3.230   1st Qu.: 3.209  
##  Median : 4.383   Median : 4.326   Median : 4.376   Median : 4.377  
##  Mean   : 4.467   Mean   : 4.421   Mean   : 4.457   Mean   : 4.462  
##  3rd Qu.: 5.473   3rd Qu.: 5.362   3rd Qu.: 5.474   3rd Qu.: 5.470  
##  Max.   :13.469   Max.   :13.482   Max.   :13.469   Max.   :13.482  
##       CHL             mDLBCL           pDLBCL            CHL        
##  Min.   : 1.115   Min.   : 0.979   Min.   : 0.979   Min.   : 1.071  
##  1st Qu.: 3.289   1st Qu.: 3.193   1st Qu.: 3.221   1st Qu.: 3.195  
##  Median : 4.354   Median : 4.382   Median : 4.377   Median : 4.380  
##  Mean   : 4.433   Mean   : 4.465   Mean   : 4.458   Mean   : 4.474  
##  3rd Qu.: 5.431   3rd Qu.: 5.474   3rd Qu.: 5.482   3rd Qu.: 5.502  
##  Max.   :13.454   Max.   :13.497   Max.   :13.469   Max.   :13.495  
##       CHL              CHL              CHL             mDLBCL      
##  Min.   : 1.009   Min.   : 1.035   Min.   : 1.158   Min.   : 1.097  
##  1st Qu.: 3.218   1st Qu.: 3.268   1st Qu.: 3.293   1st Qu.: 3.222  
##  Median : 4.410   Median : 4.416   Median : 4.412   Median : 4.376  
##  Mean   : 4.503   Mean   : 4.504   Mean   : 4.488   Mean   : 4.498  
##  3rd Qu.: 5.537   3rd Qu.: 5.492   3rd Qu.: 5.457   3rd Qu.: 5.517  
##  Max.   :13.219   Max.   :13.495   Max.   :13.497   Max.   :13.482  
##      mDLBCL            CHL              CHL              pDLBCL       
##  Min.   : 1.131   Min.   : 1.015   Min.   : 0.9349   Min.   : 0.9517  
##  1st Qu.: 3.322   1st Qu.: 3.271   1st Qu.: 3.2113   1st Qu.: 3.2006  
##  Median : 4.418   Median : 4.414   Median : 4.3820   Median : 4.3765  
##  Mean   : 4.495   Mean   : 4.499   Mean   : 4.4727   Mean   : 4.4734  
##  3rd Qu.: 5.474   3rd Qu.: 5.485   3rd Qu.: 5.4897   3rd Qu.: 5.4796  
##  Max.   :13.455   Max.   :13.495   Max.   :13.4823   Max.   :13.4973  
##       CHL             mDLBCL           mDLBCL           mDLBCL      
##  Min.   : 1.083   Min.   : 1.011   Min.   : 1.072   Min.   : 1.035  
##  1st Qu.: 3.256   1st Qu.: 3.194   1st Qu.: 3.189   1st Qu.: 3.194  
##  Median : 4.392   Median : 4.382   Median : 4.400   Median : 4.369  
##  Mean   : 4.480   Mean   : 4.488   Mean   : 4.497   Mean   : 4.480  
##  3rd Qu.: 5.483   3rd Qu.: 5.512   3rd Qu.: 5.528   3rd Qu.: 5.478  
##  Max.   :13.469   Max.   :13.497   Max.   :13.497   Max.   :13.124  
##      mDLBCL            CHL             mDLBCL            CHL         
##  Min.   : 1.075   Min.   : 1.085   Min.   : 1.134   Min.   : 0.9819  
##  1st Qu.: 3.222   1st Qu.: 3.211   1st Qu.: 3.248   1st Qu.: 3.1659  
##  Median : 4.388   Median : 4.394   Median : 4.395   Median : 4.3858  
##  Mean   : 4.485   Mean   : 4.472   Mean   : 4.495   Mean   : 4.4632  
##  3rd Qu.: 5.511   3rd Qu.: 5.502   3rd Qu.: 5.494   3rd Qu.: 5.5239  
##  Max.   :13.469   Max.   :13.190   Max.   :13.497   Max.   :13.4823  
##       CHL              CHL              CHL             mDLBCL      
##  Min.   : 1.005   Min.   : 1.117   Min.   : 1.081   Min.   : 1.072  
##  1st Qu.: 3.205   1st Qu.: 3.153   1st Qu.: 3.222   1st Qu.: 3.188  
##  Median : 4.388   Median : 4.374   Median : 4.396   Median : 4.390  
##  Mean   : 4.482   Mean   : 4.486   Mean   : 4.484   Mean   : 4.488  
##  3rd Qu.: 5.521   3rd Qu.: 5.542   3rd Qu.: 5.500   3rd Qu.: 5.538  
##  Max.   :13.469   Max.   :13.438   Max.   :13.469   Max.   :13.469  
##      mDLBCL           mDLBCL           mDLBCL           mDLBCL      
##  Min.   : 1.100   Min.   : 1.060   Min.   : 1.065   Min.   : 1.119  
##  1st Qu.: 3.305   1st Qu.: 3.202   1st Qu.: 3.211   1st Qu.: 3.180  
##  Median : 4.408   Median : 4.387   Median : 4.405   Median : 4.376  
##  Mean   : 4.496   Mean   : 4.492   Mean   : 4.496   Mean   : 4.501  
##  3rd Qu.: 5.479   3rd Qu.: 5.522   3rd Qu.: 5.522   3rd Qu.: 5.532  
##  Max.   :13.482   Max.   :13.364   Max.   :13.167   Max.   :13.482  
##      mDLBCL           pDLBCL           pDLBCL           mDLBCL      
##  Min.   : 1.107   Min.   : 1.151   Min.   : 1.091   Min.   : 1.067  
##  1st Qu.: 3.227   1st Qu.: 3.181   1st Qu.: 3.196   1st Qu.: 3.182  
##  Median : 4.414   Median : 4.396   Median : 4.390   Median : 4.374  
##  Mean   : 4.493   Mean   : 4.496   Mean   : 4.487   Mean   : 4.502  
##  3rd Qu.: 5.523   3rd Qu.: 5.528   3rd Qu.: 5.525   3rd Qu.: 5.540  
##  Max.   :13.497   Max.   :13.469   Max.   :13.497   Max.   :13.482  
##      mDLBCL            CHL             pDLBCL           pDLBCL      
##  Min.   : 1.035   Min.   : 1.021   Min.   : 1.052   Min.   : 1.111  
##  1st Qu.: 3.191   1st Qu.: 3.203   1st Qu.: 3.165   1st Qu.: 3.131  
##  Median : 4.371   Median : 4.376   Median : 4.372   Median : 4.334  
##  Mean   : 4.483   Mean   : 4.440   Mean   : 4.490   Mean   : 4.489  
##  3rd Qu.: 5.515   3rd Qu.: 5.508   3rd Qu.: 5.539   3rd Qu.: 5.532  
##  Max.   :13.497   Max.   :13.469   Max.   :13.497   Max.   :13.469  
##       CHL              CHL              CHL             mDLBCL      
##  Min.   : 1.099   Min.   : 1.011   Min.   : 1.046   Min.   : 1.036  
##  1st Qu.: 3.180   1st Qu.: 3.168   1st Qu.: 3.219   1st Qu.: 3.121  
##  Median : 4.381   Median : 4.370   Median : 4.397   Median : 4.333  
##  Mean   : 4.484   Mean   : 4.492   Mean   : 4.488   Mean   : 4.479  
##  3rd Qu.: 5.500   3rd Qu.: 5.531   3rd Qu.: 5.515   3rd Qu.: 5.531  
##  Max.   :13.482   Max.   :13.482   Max.   :13.495   Max.   :13.482  
##      mDLBCL           mDLBCL            CHL              ID           
##  Min.   : 1.101   Min.   : 1.011   Min.   : 1.085   Length:27189      
##  1st Qu.: 3.158   1st Qu.: 3.247   1st Qu.: 3.163   Class :character  
##  Median : 4.364   Median : 4.410   Median : 4.357   Mode  :character  
##  Mean   : 4.494   Mean   : 4.499   Mean   : 4.496                     
##  3rd Qu.: 5.525   3rd Qu.: 5.523   3rd Qu.: 5.526                     
##  Max.   :13.497   Max.   :13.497   Max.   :13.497                     
##     CHL_mean       pDLBCL_mean      mDLBCL_mean    
##  Min.   : 1.344   Min.   : 1.342   Min.   : 1.321  
##  1st Qu.: 3.248   1st Qu.: 3.218   1st Qu.: 3.232  
##  Median : 4.396   Median : 4.383   Median : 4.389  
##  Mean   : 4.477   Mean   : 4.477   Mean   : 4.489  
##  3rd Qu.: 5.462   3rd Qu.: 5.481   3rd Qu.: 5.488  
##  Max.   :13.406   Max.   :13.474   Max.   :13.379

paged_table(normalized.expr[c(1:10),c(45:51)])

There is no baseline or healthy sample to compare to for fold change values, so we must look at some of the factors or features to compare like the mean and median values per sample type. This is one way to do a comparison of these types of Lymphomas, to see how far from the median the mean is and use that as a difference per gene on how far the median is from the mean, where a skewed mean with a higher value of mean minus median could indicate that there is variability or an outlier as well skewing the data. The samples are also not balanced as there are only 8 pDLBCL cases but 19 CHL and 20 mDLBCL cases. There are also more men than women and the median age is 72 years of age but all older than 50 years of age. We could factor in age as the older we get the lower our quality of immune response is. There could be weights for each class by those percent younger than 72 and those percent older than 72, as well as the percent of men per class to the percent of females per class. The median value should be very close to the mean if the samples had no significant differences. We know the groups were done by looking directly at values of IFN-G, IDO1, and PDL1, as well as copy number variants at chromosome 9 loci of genes surrounding 9p24.1.

males <- grep('Sex: male',compare[5,])
females <- grep('female', compare[5,])

There are 12 females and 47 males.

We will call x - female and y - male.

CHL_x <- CHL[CHL %in% females]
mDLBCL_x <- mDLBCL[mDLBCL %in% females]
pDLBCL_x <- pDLBCL[pDLBCL %in% females]

CHL_y <- CHL[CHL %in% males]
mDLBCL_y <- mDLBCL[mDLBCL %in% males]
pDLBCL_y <- pDLBCL[pDLBCL %in% males]

Add Mean values of genes for each class by gender to the data frame normalized.expr

normalized.expr$CHL_x_mean <- rowMeans(normalized.expr[,CHL_x])
normalized.expr$mDLBCL_x_mean <- rowMeans(normalized.expr[,mDLBCL_x])
normalized.expr$pDLBCL_x_mean <- rowMeans(normalized.expr[,pDLBCL_x])

normalized.expr$CHL_y_mean <- rowMeans(normalized.expr[,CHL_y])
normalized.expr$mDLBCL_y_mean <- rowMeans(normalized.expr[,mDLBCL_y])
normalized.expr$pDLBCL_y_mean <- rowMeans(normalized.expr[,pDLBCL_y])

Lets add in gene means per sample by age older than or equal to 72 and those younger than 72.

age <- as.numeric(gsub('age: ','',compare[4,]))

compare[4,] <- age

row.names(compare) <- c('titleID','GSM_ID','diagnosis','age', 'gender','group')

paged_table(compare)

write.csv(compare, 'compare_df_6labels_47samples.csv',row.names=T)

young <- which(compare['age',] <= 72)
old <- which(compare['age',] > 72)


CHL_young <- CHL[CHL %in% young]
mDLBCL_young <- mDLBCL[mDLBCL %in% young]
pDLBCL_young <- pDLBCL[pDLBCL %in% young]

CHL_old <- CHL[CHL %in% old]
mDLBCL_old <- mDLBCL[mDLBCL %in% old]
pDLBCL_old <- pDLBCL[pDLBCL %in% old]

Now that we have the old and young indices and the class count in the young and old classes of less than 72 is young or equal to 72, but older than 72 is old, we can get our means per gene for each lymphoma in subcategory by age.

normalized.expr$CHL_young72_mean <- rowMeans(normalized.expr[CHL_young])
normalized.expr$mDLBCL_young72_mean <- rowMeans(normalized.expr[mDLBCL_young])
normalized.expr$pDLBCL_young72_mean <- rowMeans(normalized.expr[pDLBCL_young])

normalized.expr$CHL_old <- rowMeans(normalized.expr[CHL_old])
normalized.expr$mDLBCL_old <- rowMeans(normalized.expr[mDLBCL_old])
normalized.expr$pDLBCL_old <- rowMeans(normalized.expr[pDLBCL_old])

paged_table(normalized.expr[1:10,48:63])

We will add the medians per class of lymphoma only not by group but class of CHL, mDLBCL, or pDLBCL. We won’t use any weights by the gender or age but we can compare the mean values in each subset of the class by gender and age separately to see if there are any noticeable changes.

Calculate row medians for selected columns

df$row_median <- apply(df[cols_to_use], 1, function(row) { median(as.numeric(row), na.rm = TRUE) # Convert to numeric & ignore NAs })

normalized.expr$CHL_median <- apply(normalized.expr[CHL],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_median <- apply(normalized.expr[mDLBCL],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_median <- apply(normalized.expr[pDLBCL],1,function(row){
  median(as.numeric(row), na.rm=F)
})

paged_table(normalized.expr[1:10,49:66])

Lets go ahead and add in the medians of each class in its subclass of gender and age. This could be useful when filtering to find genes that are far from the average or having one when compared to where the median is in that group. Its a bunch of copy and paste and replace so not too much typing or creativity to do this little step.

normalized.expr$CHL_x_median <- apply(normalized.expr[CHL_x],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_x_median <- apply(normalized.expr[mDLBCL_x],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_x_median <- apply(normalized.expr[pDLBCL_x],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$CHL_y_median <- apply(normalized.expr[CHL_y],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_y_median <- apply(normalized.expr[mDLBCL_y],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_y_median <- apply(normalized.expr[pDLBCL_y],1,function(row){
  median(as.numeric(row), na.rm=F)
})

Above we added the medians per class and gender or gender within each class of Lymphoma.

Now for young or old within each class, younger than or equal to 72 and older than 72.

normalized.expr$CHL_young72_median <- apply(normalized.expr[CHL_young],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_young72_median <- apply(normalized.expr[mDLBCL_young],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_young72_median <- apply(normalized.expr[pDLBCL_young],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$CHL_old72_median <- apply(normalized.expr[CHL_old],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_old72_median <- apply(normalized.expr[mDLBCL_old],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_old72_median <- apply(normalized.expr[pDLBCL_old],1,function(row){
  median(as.numeric(row), na.rm=F)
})

Lets see what we have so far and decide on some filters as markers for genes or we can just grep the genes that were in the study, since we have the chromosome location, or loci of 9p24.1, we should grep all genes in that region since we have that data from the features data we uploaded earlier in this file. I looked and the start and stop is given but not the loci, we have another file that has the loci from another study. The study said at 9p24.1 at start of 5,259,371-5,481,709 using GRCh38 (a greater human chromosome study with UCSB genome browser or similar). This location is where PDL1 encoded by CD274 was targeted, and at 9p24.1 at start of 5,593,784 and end at 5,764,809 for PDL2. The other genes are EBNA2 will be expressed if active EBV infection for latent type 3 EBV as well as LMP1 positive, for the DLBCL types not normally the CHL type that would have EBNA1 for latent type 1 and type 2. They defined EBV latent type 1 as LMP1 and EBNA2 negative. EBV latent type 2 was defined as LMP1 positive but EBNA2 negative. We also need to get the IFN-g for interferon gamma, and IDO1. They confirmed active EBV infection by EBER in situ hybridization

LMP1
EBNA2
IDO1
IFNG
PDL1
PDL2
9p24.1 5,259,371-5,481,709 for CD274 and PDL1 on 5’ strand
9p24.1 5,593,784-5,764,809 for PDL2 encompassed by PDCD1LG2 on 3’ strand

Other genes that were looked at as part of their Gene Expression Validation Analysis included the CD3, CD5, CD10, CD15, CD20, CD30, CD79a, PAX5, EBER, BCL2, BCL6, and MUM1.

Lets make a list of these genes called GSVA for gene study validation analysis.

GSVA <- c("LMP1","EBNA2","IDO1","IFNG","PDL1","PDL2","CD3","CD5","CD10","CD15","CD20","CD30","CD79a","PAX5","EBER","BCL2","BCL6", "MUM1")

The loci of chromosome start and stop have + and - signs, the + is upstream or 5’ and - is downstream or 3’ end when referring to the antiparallel strands of DNA and complementary DNA is the opposite strand to the template or antisense strand.

9p24.1 5,259,371-5,481,709 for CD274 and PDL1 on 5’ strand
9p24.1 5,593,784-5,764,809 for PDL2 on 3’ strand

chr9 <- subset(feature.data, feature.data$seqname == 'chr9') #822X10

  
startPDL1 <- 5259371
endPDL1 <- 5481709


PDL1_loci <- subset(chr9, chr9$strand == '+' & chr9$start >= startPDL1 & chr9$stop <= endPDL1)

paged_table(PDL1_loci)

We can see that by scrolling to the SPOT_ID.1 feature that this is CD274 as study said they used to encompass PDL1 detection.

Now for PDL2 loci

startPDL2 <- 5593784
endPDL2 <- 5764809

PDL2_loci <- subset(chr9, chr9$strand == '-' & chr9$start >= startPDL2 & chr9$stop <= endPDL2)

paged_table(PDL2_loci)

These genes we can add to the list of genes then, we already have CD274, but not PDCD1LG2 for PDL2, but PTPRD and ERP44.

Lets grep these genes in the SPOT_ID.1 column of the normalized.expr data.

GSVA <- c(GSVA,"PTPRD","CD274", "ERP44","PDCD1LG2")

GSVA

##  [1] "LMP1"     "EBNA2"    "IDO1"     "IFNG"     "PDL1"     "PDL2"    
##  [7] "CD3"      "CD5"      "CD10"     "CD15"     "CD20"     "CD30"    
## [13] "CD79a"    "PAX5"     "EBER"     "BCL2"     "BCL6"     "MUM1"    
## [19] "PTPRD"    "CD274"    "ERP44"    "PDCD1LG2"

We will do this later as we will likely have to grep each of these 26 genes separately.

Lets write out the file we have to use later in Part 2 with our samples, probe IDs, messy gene IDs in one column, sample means and medians by class and subclass within the class.

write.csv(normalized.expr,'MeansMediansLymphomas_CHL_DLBCL_21448X78.csv', row.names=F)

This file is about 135 MB large and too large for Google. But Kaggle allows it, get this file here

or 2nd version here

Thanks so much and keep checking in for part 2.

====================================================

***** Part 2

library(rmarkdown)

Lets read in our file we made in part 1.

file <- read.csv("MeansMediansLymphomas_CHL_DLBCL_21448X78.csv", header=T)

paged_table(file[1:10,])

colnames(file)

##  [1] "pDLBCL"                "CHL"                   "pDLBCL.1"             
##  [4] "mDLBCL"                "CHL.1"                 "mDLBCL.1"             
##  [7] "pDLBCL.2"              "CHL.2"                 "CHL.3"                
## [10] "CHL.4"                 "CHL.5"                 "mDLBCL.2"             
## [13] "mDLBCL.3"              "CHL.6"                 "CHL.7"                
## [16] "pDLBCL.3"              "CHL.8"                 "mDLBCL.4"             
## [19] "mDLBCL.5"              "mDLBCL.6"              "mDLBCL.7"             
## [22] "CHL.9"                 "mDLBCL.8"              "CHL.10"               
## [25] "CHL.11"                "CHL.12"                "CHL.13"               
## [28] "mDLBCL.9"              "mDLBCL.10"             "mDLBCL.11"            
## [31] "mDLBCL.12"             "mDLBCL.13"             "mDLBCL.14"            
## [34] "pDLBCL.4"              "pDLBCL.5"              "mDLBCL.15"            
## [37] "mDLBCL.16"             "CHL.14"                "pDLBCL.6"             
## [40] "pDLBCL.7"              "CHL.15"                "CHL.16"               
## [43] "CHL.17"                "mDLBCL.17"             "mDLBCL.18"            
## [46] "mDLBCL.19"             "CHL.18"                "ID"                   
## [49] "CHL_mean"              "pDLBCL_mean"           "mDLBCL_mean"          
## [52] "CHL_x_mean"            "mDLBCL_x_mean"         "pDLBCL_x_mean"        
## [55] "CHL_y_mean"            "mDLBCL_y_mean"         "pDLBCL_y_mean"        
## [58] "CHL_young72_mean"      "mDLBCL_young72_mean"   "pDLBCL_young72_mean"  
## [61] "CHL_old"               "mDLBCL_old"            "pDLBCL_old"           
## [64] "CHL_median"            "mDLBCL_median"         "pDLBCL_median"        
## [67] "CHL_x_median"          "mDLBCL_x_median"       "pDLBCL_x_median"      
## [70] "CHL_y_median"          "mDLBCL_y_median"       "pDLBCL_y_median"      
## [73] "CHL_young72_median"    "mDLBCL_young72_median" "pDLBCL_young72_median"
## [76] "CHL_old72_median"      "mDLBCL_old72_median"   "pDLBCL_old72_median"

The ID is there at column 48. We will use the mean and median columns plus ID columns 48-78.

summary <- file[,48:78]

paged_table(summary[1:10,])

colnames(summary)

##  [1] "ID"                    "CHL_mean"              "pDLBCL_mean"          
##  [4] "mDLBCL_mean"           "CHL_x_mean"            "mDLBCL_x_mean"        
##  [7] "pDLBCL_x_mean"         "CHL_y_mean"            "mDLBCL_y_mean"        
## [10] "pDLBCL_y_mean"         "CHL_young72_mean"      "mDLBCL_young72_mean"  
## [13] "pDLBCL_young72_mean"   "CHL_old"               "mDLBCL_old"           
## [16] "pDLBCL_old"            "CHL_median"            "mDLBCL_median"        
## [19] "pDLBCL_median"         "CHL_x_median"          "mDLBCL_x_median"      
## [22] "pDLBCL_x_median"       "CHL_y_median"          "mDLBCL_y_median"      
## [25] "pDLBCL_y_median"       "CHL_young72_median"    "mDLBCL_young72_median"
## [28] "pDLBCL_young72_median" "CHL_old72_median"      "mDLBCL_old72_median"  
## [31] "pDLBCL_old72_median"

But the gene names are in the feature1 data table with the SPOT_ID.1 column. We can do the analysis and then extract those IDs of the top 20 manually since that column was a bunch of grouped or listed IDs in one column. Lets upload that file as well.

SPOT_IDs <- read.csv("feature.data.csv", header=T, row.names=1)

paged_table(SPOT_IDs[1:10,])

The name will be the one of the Genecards ID in parenthesis.

colnames(SPOT_IDs)

##  [1] "ID"           "probeset_id"  "seqname"      "strand"       "start"       
##  [6] "stop"         "total_probes" "category"     "SPOT_ID"      "SPOT_ID.1"

The genes from study are these ones.

GSVA <- c("LMP1","EBNA2","IDO1","IFNG","PDL1","PDL2","CD3","CD5","CD10","CD15","CD20","CD30","CD79a","PAX5","EBER","BCL2","BCL6", "MUM1","PTPRD","CD274", "ERP44","PDCD1LG2")

GSVA

##  [1] "LMP1"     "EBNA2"    "IDO1"     "IFNG"     "PDL1"     "PDL2"    
##  [7] "CD3"      "CD5"      "CD10"     "CD15"     "CD20"     "CD30"    
## [13] "CD79a"    "PAX5"     "EBER"     "BCL2"     "BCL6"     "MUM1"    
## [19] "PTPRD"    "CD274"    "ERP44"    "PDCD1LG2"

Read in the labels table, that I went ahead and wrote to csv from Part 1 data, you can rerun all of the above in Part 1 or get the samples’ labels here

labels <- read.csv("compare_df_6labels_47samples.csv", header=T, row.names=1)

paged_table(labels)

The group labels are in this data and we didn’t do anything with them earlier, but we will use those as a 4 class model for predicting the new classes of large B-cell Lymphoma based on this research article’s findings.

Let modify the content some of the labels table. Remove the ‘Sex:’ from gender row sample contents and the ’ group’ from group row contents of sample columns.

labels["gender",] <- gsub("Sex: ","",labels["gender",])
labels["gender",]

##        pDLBCL  CHL pDLBCL.1 mDLBCL CHL.1 mDLBCL.1 pDLBCL.2 CHL.2 CHL.3 CHL.4
## gender   male male     male   male  male   female   female  male  male  male
##        CHL.5 mDLBCL.2 mDLBCL.3  CHL.6 CHL.7 pDLBCL.3 CHL.8 mDLBCL.4 mDLBCL.5
## gender  male     male   female female  male   female  male     male     male
##        mDLBCL.6 mDLBCL.7 CHL.9 mDLBCL.8 CHL.10 CHL.11 CHL.12 CHL.13 mDLBCL.9
## gender     male   female  male     male female   male   male   male     male
##        mDLBCL.10 mDLBCL.11 mDLBCL.12 mDLBCL.13 mDLBCL.14 pDLBCL.4 pDLBCL.5
## gender      male      male      male    female      male   female   female
##        mDLBCL.15 mDLBCL.16 CHL.14 pDLBCL.6 pDLBCL.7 CHL.15 CHL.16 CHL.17
## gender      male      male   male     male     male female female   male
##        mDLBCL.17 mDLBCL.18 mDLBCL.19 CHL.18
## gender      male      male      male   male

labels["group",] <- gsub(" group","", labels["group",])
labels["group",]

##       pDLBCL    CHL pDLBCL.1 mDLBCL  CHL.1 mDLBCL.1 pDLBCL.2  CHL.2    CHL.3
## group IFNG-H IFNG-L   IFNG-H IFNG-L IFNG-L 9p24.1-H   IFNG-L IFNG-L 9p24.1-H
##          CHL.4    CHL.5 mDLBCL.2 mDLBCL.3  CHL.6  CHL.7 pDLBCL.3  CHL.8
## group 9p24.1-H 9p24.1-H   IFNG-L       IS IFNG-L IFNG-L   IFNG-L IFNG-L
##       mDLBCL.4 mDLBCL.5 mDLBCL.6 mDLBCL.7  CHL.9 mDLBCL.8   CHL.10 CHL.11
## group       IS       IS       IS       IS IFNG-L   IFNG-H 9p24.1-H IFNG-L
##         CHL.12   CHL.13 mDLBCL.9 mDLBCL.10 mDLBCL.11 mDLBCL.12 mDLBCL.13
## group 9p24.1-H 9p24.1-H       IS      <NA>        IS    IFNG-L    IFNG-H
##       mDLBCL.14 pDLBCL.4 pDLBCL.5 mDLBCL.15 mDLBCL.16 CHL.14 pDLBCL.6 pDLBCL.7
## group  9p24.1-H   IFNG-L   IFNG-H    IFNG-L    IFNG-H IFNG-L   IFNG-H   IFNG-H
##       CHL.15 CHL.16   CHL.17 mDLBCL.17 mDLBCL.18 mDLBCL.19 CHL.18
## group IFNG-L IFNG-L 9p24.1-H        IS        IS        IS IFNG-H

group <- as.data.frame(t(labels["group",]))

table(group$group)

## 
## 9p24.1-H   IFNG-H   IFNG-L       IS 
##        9        9       18       10

They found 9 high chromosome loci 9p24.1 groups, 9 high in interferon gamma or IFNG, 18 low in IFNG, and 10 as immune senescence.

Lets get the variation of mean/median per gene in the first set of CHL_mean/CHL_median from our summary data frame. Then the same for pDLBCL_mean/pDLBCL_median, and mDLBCL_mean/mDLBCL_median.

summary$CHL_change <- summary$CHL_mean/summary$CHL_median
summary$pDLBCL_change <- summary$pDLBCL_mean/summary$pDLBCL_median
summary$mDLBCL_change <- summary$mDLBCL_mean/summary$mDLBCL_median

Now do the same for the gender each group, and then for the age in each group.

summary$CHL_x_change <- summary$CHL_x_mean/summary$CHL_x_median
summary$CHL_y_change <- summary$CHL_y_mean/summary$CHL_y_median

summary$pDLBCL_x_change <- summary$pDLBCL_x_mean/summary$pDLBCL_x_median
summary$pDLBCL_y_change <- summary$pDLBCL_y_mean/summary$pDLBCL_y_median

summary$mDLBCL_x_change <- summary$mDLBCL_x_mean/summary$mDLBCL_x_median
summary$mDLBCL_y_change <- summary$mDLBCL_y_mean/summary$mDLBCL_y_median

Here we add the age changes per group. These ‘change’ are the avg per group divided by median per group. We just call it change but it could be variation or skew or skew change for the over or under estimation the average does when compared to midline real value of median.

summary$CHL_old72_change <- summary$CHL_old/summary$CHL_old72_median #we didn't add 'mean' to the end of the CHL_old column name
summary$CHL_young72_change <- summary$CHL_young72_mean/summary$CHL_young72_median

summary$pDLBCL_old_change <- summary$pDLBCL_old/summary$pDLBCL_old72_median

summary$pDLBCL_young72_change <- summary$pDLBCL_young72_mean/summary$pDLBCL_young72_median

summary$mDLBCL_young72_change <- summary$mDLBCL_young72_mean/summary$mDLBCL_young72_median

colnames(summary)

##  [1] "ID"                    "CHL_mean"              "pDLBCL_mean"          
##  [4] "mDLBCL_mean"           "CHL_x_mean"            "mDLBCL_x_mean"        
##  [7] "pDLBCL_x_mean"         "CHL_y_mean"            "mDLBCL_y_mean"        
## [10] "pDLBCL_y_mean"         "CHL_young72_mean"      "mDLBCL_young72_mean"  
## [13] "pDLBCL_young72_mean"   "CHL_old"               "mDLBCL_old"           
## [16] "pDLBCL_old"            "CHL_median"            "mDLBCL_median"        
## [19] "pDLBCL_median"         "CHL_x_median"          "mDLBCL_x_median"      
## [22] "pDLBCL_x_median"       "CHL_y_median"          "mDLBCL_y_median"      
## [25] "pDLBCL_y_median"       "CHL_young72_median"    "mDLBCL_young72_median"
## [28] "pDLBCL_young72_median" "CHL_old72_median"      "mDLBCL_old72_median"  
## [31] "pDLBCL_old72_median"   "CHL_change"            "pDLBCL_change"        
## [34] "mDLBCL_change"         "CHL_x_change"          "CHL_y_change"         
## [37] "pDLBCL_x_change"       "pDLBCL_y_change"       "mDLBCL_x_change"      
## [40] "mDLBCL_y_change"       "CHL_old72_change"      "CHL_young72_change"   
## [43] "pDLBCL_old_change"     "pDLBCL_young72_change" "mDLBCL_young72_change"

lets get a summary stat report on these values to see if any of our change columns have NAs, NaNs, or Infinites.

summary(summary[,32:45]) #27189 genes and 14 change groups

##    CHL_change     pDLBCL_change    mDLBCL_change     CHL_x_change   
##  Min.   :0.8612   Min.   :0.8116   Min.   :0.8641   Min.   :0.8322  
##  1st Qu.:0.9934   1st Qu.:0.9921   1st Qu.:0.9955   1st Qu.:0.9921  
##  Median :1.0038   Median :1.0018   Median :1.0028   Median :1.0014  
##  Mean   :1.0055   Mean   :1.0051   Mean   :1.0054   Mean   :1.0032  
##  3rd Qu.:1.0155   3rd Qu.:1.0141   3rd Qu.:1.0120   3rd Qu.:1.0124  
##  Max.   :1.2861   Max.   :1.5134   Max.   :1.2959   Max.   :1.3600  
##   CHL_y_change    pDLBCL_x_change  pDLBCL_y_change  mDLBCL_x_change 
##  Min.   :0.8310   Min.   :0.8194   Min.   :0.8535   Min.   :0.8568  
##  1st Qu.:0.9917   1st Qu.:0.9917   1st Qu.:0.9920   1st Qu.:0.9915  
##  Median :1.0032   Median :1.0007   Median :1.0012   Median :1.0012  
##  Mean   :1.0050   Mean   :1.0024   Mean   :1.0042   Mean   :1.0039  
##  3rd Qu.:1.0161   3rd Qu.:1.0109   3rd Qu.:1.0127   3rd Qu.:1.0125  
##  Max.   :1.4108   Max.   :1.3300   Max.   :1.3877   Max.   :1.3811  
##  mDLBCL_y_change  CHL_old72_change CHL_young72_change pDLBCL_old_change
##  Min.   :0.8643   Min.   :0.8183   Min.   :0.7994     Min.   :0.8228   
##  1st Qu.:0.9949   1st Qu.:0.9904   1st Qu.:0.9913     1st Qu.:0.9914   
##  Median :1.0027   Median :1.0020   Median :1.0040     Median :1.0012   
##  Mean   :1.0053   Mean   :1.0043   Mean   :1.0057     Mean   :1.0039   
##  3rd Qu.:1.0122   3rd Qu.:1.0151   3rd Qu.:1.0182     3rd Qu.:1.0128   
##  Max.   :1.2960   Max.   :1.4992   Max.   :1.3811     Max.   :1.3349   
##  pDLBCL_young72_change mDLBCL_young72_change
##  Min.   :0.8574        Min.   :0.8384       
##  1st Qu.:0.9919        1st Qu.:0.9922       
##  Median :1.0009        Median :1.0021       
##  Mean   :1.0032        Mean   :1.0044       
##  3rd Qu.:1.0117        3rd Qu.:1.0138       
##  Max.   :1.4469        Max.   :1.4123

There aren’t any NAs, NaNs, or Infs. This is great and appears the min is above 0, so no more filtering needed. We can order by each change group and take the top 10 and bottom 10 for top 10 genes over estimated and top 10 genes under estimated.

CHL_change <- summary[order(summary$CHL_change, decreasing=T),]
topCHL <- CHL_change[c(1:10,27180:27189),]

pDLBCL_change <- summary[order(summary$pDLBCL_change, decreasing=T),]
top_pDLBCL <- pDLBCL_change[c(1:10,27180:27189),]

mDLBCL_change <- summary[order(summary$mDLBCL_change,decreasing=T),]
top_mDLBCL <- mDLBCL_change[c(1:10,27180:27189),]

Lets combine these tables to their gene names from the SPOT_IDs.

SPOT_ID_b <- SPOT_IDs[,c(1,10)]
topCHL_gene <- merge(topCHL, SPOT_ID_b, by.x="ID",by.y="ID")

paged_table(topCHL_gene)

When merging by ID there only seemed to be 2 genes in the SPOT_IDs. This is possible since there are 27,189 genes and only 21,448 SPOT IDs. That means there are going to be around 5,741 genes without SPOT IDs.

The first gene is KLRB1 and second is MGST1. I scrolled over the SPOT_ID column to get the ID in parenthesis of each row. Lets write this to a new column called ‘Gene_ID’

topCHL_gene$Gene_ID <- c("KLRB1","MGST1")

top_pDLBCL_gene <- merge(top_pDLBCL, SPOT_ID_b, by.x="ID",by.y="ID")
paged_table(top_pDLBCL_gene)

There are 7 genes for top genes of pDLBCL.

top_pDLBCL_gene$Gene_ID <- c("SPRR1B","SPRR2D","S100A7","SPRR2A","TMPRSS11A","USP30","DMKN")                           
paged_table(top_pDLBCL_gene)

top_mDLBCL_gene <- merge(top_mDLBCL, SPOT_ID_b, by.x="ID",by.y="ID")
paged_table(top_mDLBCL_gene)

There are 8 genes for mDLBCL. We can write those to their Gene_ID.

top_mDLBCL_gene$Gene_ID <- c("IL23R","SPRR1B","SPRR2A","MAL2","FABP4","CT45A6","CLDN10","DSG3" )

Lets give each a column where the gene was found to be a top variation skew change from avg/median group values.

topCHL_gene$importance <- c("top gene CHL avg/median group values")
top_pDLBCL_gene$importance <- c("top gene pDLBCL avg/median group values")
top_mDLBCL_gene$importance <- c("top gene mDLBCL avg/median group values")

Lets combine these top genes into a data frame of the top genes of 3 lymphoma classes.

topGenes_CHL_pDLBCL_mDLBCL <- rbind(topCHL_gene,top_pDLBCL_gene, top_mDLBCL_gene)

paged_table(topGenes_CHL_pDLBCL_mDLBCL)

Lets write this out to csv to use later in machine learning. I want to go back to the genes of the study and individually extract those genes and make them its own data table of the original Part 1 78 variables, but also add in our makeshift change variables to take the place of fold change.

GSVA

##  [1] "LMP1"     "EBNA2"    "IDO1"     "IFNG"     "PDL1"     "PDL2"    
##  [7] "CD3"      "CD5"      "CD10"     "CD15"     "CD20"     "CD30"    
## [13] "CD79a"    "PAX5"     "EBER"     "BCL2"     "BCL6"     "MUM1"    
## [19] "PTPRD"    "CD274"    "ERP44"    "PDCD1LG2"

LMP1 <- grep("(LMP1)", SPOT_ID_b$SPOT_ID.1, fixed=T) #empty
EBNA2 <- grep("(EBNA2)", SPOT_ID_b$SPOT_ID.1, fixed=T) #empty
EBER <- grep("(EBER)", SPOT_ID_b$SPOT_ID.1, fixed=T) #empty
PDL2 <- grep("(PDL2)", SPOT_ID_b$SPOT_ID.1, fixed=T) #empty

CD3 <- grep("(CD3)", SPOT_ID_b$SPOT_ID.1, fixed=T) #empty
CD10 <- grep("(CD10)", SPOT_ID_b$SPOT_ID.1, fixed=T) #empty
CD15 <- grep("(CD15)", SPOT_ID_b$SPOT_ID.1, fixed=T)#empty
CD20 <- grep("(CD20)", SPOT_ID_b$SPOT_ID.1, fixed=T) #empty
CD30 <- grep("(CD30)", SPOT_ID_b$SPOT_ID.1, fixed=T) #empty
CD79a <- grep("(CD79a)", SPOT_ID_b$SPOT_ID.1, fixed=T) #empty

IDO1 <- grep("(IDO1)", SPOT_ID_b$SPOT_ID.1, fixed=T) 
IFNG <- grep("(IFNG)", SPOT_ID_b$SPOT_ID.1, fixed=T) 
PDL1 <- grep("(PDL1)", SPOT_ID_b$SPOT_ID.1, fixed=T) 
CD5 <- grep("(CD5)", SPOT_ID_b$SPOT_ID.1, fixed=T) 
PAX5 <- grep("(PAX5)", SPOT_ID_b$SPOT_ID.1, fixed=T) 
BCL2 <- grep("(BCL2)", SPOT_ID_b$SPOT_ID.1, fixed=T)
BCL6 <- grep("(BCL6)", SPOT_ID_b$SPOT_ID.1, fixed=T) 
MUM1 <- grep("(MUM1)", SPOT_ID_b$SPOT_ID.1, fixed=T) 
PTPRD <- grep("(PTPRD)", SPOT_ID_b$SPOT_ID.1, fixed=T) 
CD274 <- grep("(CD274)", SPOT_ID_b$SPOT_ID.1, fixed=T)
ERP44 <- grep("(ERP44)", SPOT_ID_b$SPOT_ID.1, fixed=T) 
PDCD1LG2 <- grep("(PDCD1LG2)", SPOT_ID_b$SPOT_ID.1, fixed=T)

Lets get these ID by gene name by using these indices of the genes in the research study for its own data table.

genesStudied <- SPOT_ID_b[c(IDO1, IFNG, PDL1, CD3, CD5, CD10, CD15, CD20, CD30, CD79a, PAX5, BCL2, BCL6, MUM1, PTPRD, CD274, ERP44, PDCD1LG2),]

paged_table(genesStudied)

There are only 11 genes that made this data frame of genes from the research study genes. Lets get the summary information for change values of these genes,

studied <- merge(summary,genesStudied, by.x='ID', by.y='ID')

paged_table(studied)

studied$Gene_ID <- c("BCL6","IDO1","CD274","PDCD1LG2","PTPRD","PAX5","ERP44","CD5","IFNG","BCL2","MUM1")

studied$importance <- "Research Article GSE305165 genes in the data that had an ID match a gene from the study importance genes"

We have 11 genes from the research study in the data and 17 top change genes from each of the 3 lymphoma classes where not all genes were attached to a gene name by Affymetrix ID tag.

Lets combine these top genes since we already have their importance. We will save the other groups for later. We have 3 of the classes of 14 classes, we still have 11 more subclasses within the three classes to get the top genes for.

topStudied_3Ls <- rbind(topGenes_CHL_pDLBCL_mDLBCL,studied)

Lets write this out to csv.

write.csv(topStudied_3Ls, 'topStudied_and_3L_classes.csv', row.names=F)

Lets combine the summary stats to the file data.

Data <- cbind(file,summary)

colnames(Data)

##   [1] "pDLBCL"                "CHL"                   "pDLBCL.1"             
##   [4] "mDLBCL"                "CHL.1"                 "mDLBCL.1"             
##   [7] "pDLBCL.2"              "CHL.2"                 "CHL.3"                
##  [10] "CHL.4"                 "CHL.5"                 "mDLBCL.2"             
##  [13] "mDLBCL.3"              "CHL.6"                 "CHL.7"                
##  [16] "pDLBCL.3"              "CHL.8"                 "mDLBCL.4"             
##  [19] "mDLBCL.5"              "mDLBCL.6"              "mDLBCL.7"             
##  [22] "CHL.9"                 "mDLBCL.8"              "CHL.10"               
##  [25] "CHL.11"                "CHL.12"                "CHL.13"               
##  [28] "mDLBCL.9"              "mDLBCL.10"             "mDLBCL.11"            
##  [31] "mDLBCL.12"             "mDLBCL.13"             "mDLBCL.14"            
##  [34] "pDLBCL.4"              "pDLBCL.5"              "mDLBCL.15"            
##  [37] "mDLBCL.16"             "CHL.14"                "pDLBCL.6"             
##  [40] "pDLBCL.7"              "CHL.15"                "CHL.16"               
##  [43] "CHL.17"                "mDLBCL.17"             "mDLBCL.18"            
##  [46] "mDLBCL.19"             "CHL.18"                "ID"                   
##  [49] "CHL_mean"              "pDLBCL_mean"           "mDLBCL_mean"          
##  [52] "CHL_x_mean"            "mDLBCL_x_mean"         "pDLBCL_x_mean"        
##  [55] "CHL_y_mean"            "mDLBCL_y_mean"         "pDLBCL_y_mean"        
##  [58] "CHL_young72_mean"      "mDLBCL_young72_mean"   "pDLBCL_young72_mean"  
##  [61] "CHL_old"               "mDLBCL_old"            "pDLBCL_old"           
##  [64] "CHL_median"            "mDLBCL_median"         "pDLBCL_median"        
##  [67] "CHL_x_median"          "mDLBCL_x_median"       "pDLBCL_x_median"      
##  [70] "CHL_y_median"          "mDLBCL_y_median"       "pDLBCL_y_median"      
##  [73] "CHL_young72_median"    "mDLBCL_young72_median" "pDLBCL_young72_median"
##  [76] "CHL_old72_median"      "mDLBCL_old72_median"   "pDLBCL_old72_median"  
##  [79] "ID"                    "CHL_mean"              "pDLBCL_mean"          
##  [82] "mDLBCL_mean"           "CHL_x_mean"            "mDLBCL_x_mean"        
##  [85] "pDLBCL_x_mean"         "CHL_y_mean"            "mDLBCL_y_mean"        
##  [88] "pDLBCL_y_mean"         "CHL_young72_mean"      "mDLBCL_young72_mean"  
##  [91] "pDLBCL_young72_mean"   "CHL_old"               "mDLBCL_old"           
##  [94] "pDLBCL_old"            "CHL_median"            "mDLBCL_median"        
##  [97] "pDLBCL_median"         "CHL_x_median"          "mDLBCL_x_median"      
## [100] "pDLBCL_x_median"       "CHL_y_median"          "mDLBCL_y_median"      
## [103] "pDLBCL_y_median"       "CHL_young72_median"    "mDLBCL_young72_median"
## [106] "pDLBCL_young72_median" "CHL_old72_median"      "mDLBCL_old72_median"  
## [109] "pDLBCL_old72_median"   "CHL_change"            "pDLBCL_change"        
## [112] "mDLBCL_change"         "CHL_x_change"          "CHL_y_change"         
## [115] "pDLBCL_x_change"       "pDLBCL_y_change"       "mDLBCL_x_change"      
## [118] "mDLBCL_y_change"       "CHL_old72_change"      "CHL_young72_change"   
## [121] "pDLBCL_old_change"     "pDLBCL_young72_change" "mDLBCL_young72_change"

Lets now merge this with the SPOT_ID_b table, except there is an error, might be too many rows, so this won’t work.

Data_gene <- merge(SPOT_ID_b,Data, by.x="ID", by.y="ID")

Error in fix.by(by.y, y) : ‘by’ must specify a uniquely valid column 5. stop(ngettext(sum(bad), “‘by’ must specify a uniquely valid column”, “‘by’ must specify uniquely valid columns”), domain = NA) 4. fix.by(by.y, y) 3. merge.data.frame(SPOT_ID_b, Data, by.x = “ID”, by.y = “ID”) 2. merge(SPOT_ID_b, Data, by.x = “ID”, by.y = “ID”) 1. merge(SPOT_ID_b, Data, by.x = “ID”, by.y = “ID”)

Data$ID[1:10]

##  [1] "23064070" "23064071" "23064072" "23064073" "23064074" "23064075"
##  [7] "23064076" "23064077" "23064078" "23064079"

SPOT_ID_b$ID[1:10]

##  [1] "TC0100006437.hg.1" "TC0100006476.hg.1" "TC0100006479.hg.1"
##  [4] "TC0100006480.hg.1" "TC0100006483.hg.1" "TC0100006486.hg.1"
##  [7] "TC0100006490.hg.1" "TC0100006492.hg.1" "TC0100006494.hg.1"
## [10] "TC0100006497.hg.1"

Both the data frames have an ID column and are named the same, so must be an error with the number of columns merging as in a limit. Data has 123 columns and the other has 2.

Lets also put the ID column as the 1st column and not 48 stuck in middle as it was the last column at one point befor all summary stats.

Data1 <- Data[,c(48,1:47,49:123)]

Lets write Data out to csv it has all our summary stats and avg/med per group values and subgroup values of age and gender.

write.csv(Data1, 'Data_27189X123_GSE305165_summaryStatsChangeAvg_med.csv', row.names=F)

The last file we made is very large. Let me see if I can upload it to Kaggle to share, and it was possible. Here is that dataset

The top 3 lymphoma genes and genes studied with just change values and summary mean and median per sample is here.

We still need to test these genes and look at how the study genes did for the major players that were in the data like IFNG, CD274 for PDL2, and IDO1 in each class of lymphoma. And get the summary stats on the subgroups of class by gender and class by age in the 11 other subcategories for the skewed change of avg/median per subcategory samples. Get those genes from the list of groups of gene IDs and names per Affy ID.

This is the end of Part 2. We have that above to do in Part 3 and also machine learning of the different top genes with study’s genes to see how well they predict the 3 classes and also the 4 new group labels.

Thanks so much and keep checking in.

======================================================

********** Part 3

We need to find the change values of variation between the group avg / group medians for the other 11 groups. So, lets start by reading in our data at links above after we import the libraries.

library(rmarkdown)

dataset

data <- read.csv('Data_27189X123_GSE305165_summaryStatsChangeAvg_med.csv', header=T)
paged_table(data[1:10,])

here

topStudied_3Ls <- read.csv("topStudied_and_3L_classes.csv", header=T)

paged_table(topStudied_3Ls)

The above table is the top genes from the studies in the data and the top genes from running a simple measure of variation in the group avg divided by group median for the 3 groups of lymphomas of CHL, pDLBCL, and mDLBCL.

Lets bring in the table on the header demographic information for each group. We changed the new data and topStudied_3Ls tables to have the ID field as first column but the table of demographic information will need to have the information presented differently when we combine this with the other group top genes.

You can get the labels file after edits made here.

labels <- read.csv("compare_df_6labels_47samples.csv", header=T, row.names=1)

paged_table(labels)

We will be predicting by group when we get to our random Forest machine model.

But the groups we are finding the change of avg/med per group are:

CHL, pDLBCL, mDLBCL (3 classes of lymphoma all with EBV, done in part 2)
CHL_x, pDLBCL_x, mDLBCL_x (females within each class of lymphoma)
CHL_y, pDLBCL_y, mDLBCL_y (males within each class of lymphoma)
CHL_young72, pDLBCL_young72, mDLBCL_young72 (younger or equal to 72 years of age in each lymphoma class)
CHL_old72, pDLBCL_old72, mDLBCL_old72 (older than 72 years of age in each lymphoma class)

Looks like we have 15 classes in total and not 14, so one of the other classes must have been missed when added to data. Lets see what class it is.

colnames(data[,c(110:123)])

##  [1] "CHL_change"            "pDLBCL_change"         "mDLBCL_change"        
##  [4] "CHL_x_change"          "CHL_y_change"          "pDLBCL_x_change"      
##  [7] "pDLBCL_y_change"       "mDLBCL_x_change"       "mDLBCL_y_change"      
## [10] "CHL_old72_change"      "CHL_young72_change"    "pDLBCL_old_change"    
## [13] "pDLBCL_young72_change" "mDLBCL_young72_change"

We didn’t add the mDLBCL_old72_change to the data frame. We can do that now because we have the information.

data$mDLBCL_old72_change <- data$mDLBCL_old/data$mDLBCL_old72_median

colnames(data)

##   [1] "ID"                      "pDLBCL"                 
##   [3] "CHL"                     "pDLBCL.1"               
##   [5] "mDLBCL"                  "CHL.1"                  
##   [7] "mDLBCL.1"                "pDLBCL.2"               
##   [9] "CHL.2"                   "CHL.3"                  
##  [11] "CHL.4"                   "CHL.5"                  
##  [13] "mDLBCL.2"                "mDLBCL.3"               
##  [15] "CHL.6"                   "CHL.7"                  
##  [17] "pDLBCL.3"                "CHL.8"                  
##  [19] "mDLBCL.4"                "mDLBCL.5"               
##  [21] "mDLBCL.6"                "mDLBCL.7"               
##  [23] "CHL.9"                   "mDLBCL.8"               
##  [25] "CHL.10"                  "CHL.11"                 
##  [27] "CHL.12"                  "CHL.13"                 
##  [29] "mDLBCL.9"                "mDLBCL.10"              
##  [31] "mDLBCL.11"               "mDLBCL.12"              
##  [33] "mDLBCL.13"               "mDLBCL.14"              
##  [35] "pDLBCL.4"                "pDLBCL.5"               
##  [37] "mDLBCL.15"               "mDLBCL.16"              
##  [39] "CHL.14"                  "pDLBCL.6"               
##  [41] "pDLBCL.7"                "CHL.15"                 
##  [43] "CHL.16"                  "CHL.17"                 
##  [45] "mDLBCL.17"               "mDLBCL.18"              
##  [47] "mDLBCL.19"               "CHL.18"                 
##  [49] "CHL_mean"                "pDLBCL_mean"            
##  [51] "mDLBCL_mean"             "CHL_x_mean"             
##  [53] "mDLBCL_x_mean"           "pDLBCL_x_mean"          
##  [55] "CHL_y_mean"              "mDLBCL_y_mean"          
##  [57] "pDLBCL_y_mean"           "CHL_young72_mean"       
##  [59] "mDLBCL_young72_mean"     "pDLBCL_young72_mean"    
##  [61] "CHL_old"                 "mDLBCL_old"             
##  [63] "pDLBCL_old"              "CHL_median"             
##  [65] "mDLBCL_median"           "pDLBCL_median"          
##  [67] "CHL_x_median"            "mDLBCL_x_median"        
##  [69] "pDLBCL_x_median"         "CHL_y_median"           
##  [71] "mDLBCL_y_median"         "pDLBCL_y_median"        
##  [73] "CHL_young72_median"      "mDLBCL_young72_median"  
##  [75] "pDLBCL_young72_median"   "CHL_old72_median"       
##  [77] "mDLBCL_old72_median"     "pDLBCL_old72_median"    
##  [79] "ID.1"                    "CHL_mean.1"             
##  [81] "pDLBCL_mean.1"           "mDLBCL_mean.1"          
##  [83] "CHL_x_mean.1"            "mDLBCL_x_mean.1"        
##  [85] "pDLBCL_x_mean.1"         "CHL_y_mean.1"           
##  [87] "mDLBCL_y_mean.1"         "pDLBCL_y_mean.1"        
##  [89] "CHL_young72_mean.1"      "mDLBCL_young72_mean.1"  
##  [91] "pDLBCL_young72_mean.1"   "CHL_old.1"              
##  [93] "mDLBCL_old.1"            "pDLBCL_old.1"           
##  [95] "CHL_median.1"            "mDLBCL_median.1"        
##  [97] "pDLBCL_median.1"         "CHL_x_median.1"         
##  [99] "mDLBCL_x_median.1"       "pDLBCL_x_median.1"      
## [101] "CHL_y_median.1"          "mDLBCL_y_median.1"      
## [103] "pDLBCL_y_median.1"       "CHL_young72_median.1"   
## [105] "mDLBCL_young72_median.1" "pDLBCL_young72_median.1"
## [107] "CHL_old72_median.1"      "mDLBCL_old72_median.1"  
## [109] "pDLBCL_old72_median.1"   "CHL_change"             
## [111] "pDLBCL_change"           "mDLBCL_change"          
## [113] "CHL_x_change"            "CHL_y_change"           
## [115] "pDLBCL_x_change"         "pDLBCL_y_change"        
## [117] "mDLBCL_x_change"         "mDLBCL_y_change"        
## [119] "CHL_old72_change"        "CHL_young72_change"     
## [121] "pDLBCL_old_change"       "pDLBCL_young72_change"  
## [123] "mDLBCL_young72_change"   "mDLBCL_old72_change"

Somewhere along the way of updating columns and reordering them, there were duplicates added and why there is a ‘.1’ at the end of the summary columns. I must have added the summary table to the data when I forgot it was already added. So we can delete them here, as the other data frame was already uploaded online. From list above, those are the ending columns of “ID.1”:“pDLBCL_old72_median.1” or 79:109. Lets remove those. Once the dataset is uploaded to Kaggle I am unaware currently of how to change the file to the new one. So we can keep the link above but change it with the edits here. I will rewrite this one out to a different kaggle dataset once completed to share.

Data <- data[,c(1:78,110:124)]
rm(data)

colnames(Data)

##  [1] "ID"                    "pDLBCL"                "CHL"                  
##  [4] "pDLBCL.1"              "mDLBCL"                "CHL.1"                
##  [7] "mDLBCL.1"              "pDLBCL.2"              "CHL.2"                
## [10] "CHL.3"                 "CHL.4"                 "CHL.5"                
## [13] "mDLBCL.2"              "mDLBCL.3"              "CHL.6"                
## [16] "CHL.7"                 "pDLBCL.3"              "CHL.8"                
## [19] "mDLBCL.4"              "mDLBCL.5"              "mDLBCL.6"             
## [22] "mDLBCL.7"              "CHL.9"                 "mDLBCL.8"             
## [25] "CHL.10"                "CHL.11"                "CHL.12"               
## [28] "CHL.13"                "mDLBCL.9"              "mDLBCL.10"            
## [31] "mDLBCL.11"             "mDLBCL.12"             "mDLBCL.13"            
## [34] "mDLBCL.14"             "pDLBCL.4"              "pDLBCL.5"             
## [37] "mDLBCL.15"             "mDLBCL.16"             "CHL.14"               
## [40] "pDLBCL.6"              "pDLBCL.7"              "CHL.15"               
## [43] "CHL.16"                "CHL.17"                "mDLBCL.17"            
## [46] "mDLBCL.18"             "mDLBCL.19"             "CHL.18"               
## [49] "CHL_mean"              "pDLBCL_mean"           "mDLBCL_mean"          
## [52] "CHL_x_mean"            "mDLBCL_x_mean"         "pDLBCL_x_mean"        
## [55] "CHL_y_mean"            "mDLBCL_y_mean"         "pDLBCL_y_mean"        
## [58] "CHL_young72_mean"      "mDLBCL_young72_mean"   "pDLBCL_young72_mean"  
## [61] "CHL_old"               "mDLBCL_old"            "pDLBCL_old"           
## [64] "CHL_median"            "mDLBCL_median"         "pDLBCL_median"        
## [67] "CHL_x_median"          "mDLBCL_x_median"       "pDLBCL_x_median"      
## [70] "CHL_y_median"          "mDLBCL_y_median"       "pDLBCL_y_median"      
## [73] "CHL_young72_median"    "mDLBCL_young72_median" "pDLBCL_young72_median"
## [76] "CHL_old72_median"      "mDLBCL_old72_median"   "pDLBCL_old72_median"  
## [79] "CHL_change"            "pDLBCL_change"         "mDLBCL_change"        
## [82] "CHL_x_change"          "CHL_y_change"          "pDLBCL_x_change"      
## [85] "pDLBCL_y_change"       "mDLBCL_x_change"       "mDLBCL_y_change"      
## [88] "CHL_old72_change"      "CHL_young72_change"    "pDLBCL_old_change"    
## [91] "pDLBCL_young72_change" "mDLBCL_young72_change" "mDLBCL_old72_change"

Columns 79 to 93 are the change columns to extract top 20 genes of 10 most over estimated and 10 most under estimated in the groups.

CHL_x_ordered <- Data[order(Data$CHL_x_change, decreasing=T),]
top_CHL_x <- CHL_x_ordered[c(1:10,27180:27189),]

CHL_y_ordered <- Data[order(Data$CHL_y_change, decreasing=T),]
top_CHL_y <- CHL_y_ordered[c(1:10, 27180:27189),]

pDLBCL_x_ordered <- Data[order(Data$pDLBCL_x_change, decreasing=T),]
top_pDLBCL_x <- pDLBCL_x_ordered[c(1:10, 27180:27189),]

pDLBCL_y_ordered <- Data[order(Data$pDLBCL_y_change, decreasing=T),]
top_pDLBCL_y <- pDLBCL_y_ordered[c(1:10,27180:27189),]

mDLBCL_x_ordered <- Data[order(Data$mDLBCL_x_change, decreasing=T),]
top_mDLBCL_x <- mDLBCL_x_ordered[c(1:10, 27180:27189),]

mDLBCL_y_ordered <- Data[order(Data$mDLBCL_y_change, decreasing=T),]
top_mDLBCL_y <- mDLBCL_y_ordered[c(1:10, 27180:27189),]

That was the gender top genes, but now for age top genes within each class of lymphoma.

CHL_young72_ordered <- Data[order(Data$CHL_young72_change, decreasing=T),]
top_CHL_young72 <- CHL_young72_ordered[c(1:10,27180:27189),]

CHL_old72_ordered <- Data[order(Data$CHL_old72_change, decreasing=T),]
top_CHL_old72 <- CHL_old72_ordered[c(1:10,27180:27189),]

pDLBCL_young72_ordered <- Data[order(Data$pDLBCL_young72_change, decreasing=T),]
top_pDLBCL_young72 <- pDLBCL_young72_ordered[c(1:10,27180:27189),]

pDLBCL_old72_ordered <- Data[order(Data$pDLBCL_old_change, decreasing=T),]
top_pDLBCL_old72 <- pDLBCL_old72_ordered[c(1:10,27180:27189),]

mDLBCL_young72_ordered <- Data[order(Data$mDLBCL_young72_change, decreasing=T),]
top_mDLBCL_young72 <- mDLBCL_young72_ordered[c(1:10,27180:27189),]

mDLBCL_old72_ordered <- Data[order(Data$mDLBCL_old72_change, decreasing=T),]
top_mDLBCL_old72 <- mDLBCL_old72_ordered[c(1:10,27180:27189),]

Now we have the other 12 groups of genes, I could change the intro to 12 right now, but somebody is paying attention, while others could be just browsing. Understanding what is going on and what you are doing is important to know what you are progressing towards.

There was another file we need to upload on the feature gene names for each ID that is an Affymetrix ID. The features table has the gene names but enclosed in parenthesis, there may be 20 genes per group of top genes but when combining these Affy IDs with their gene names, the Affy ID might not be matched and dropped leaving maybe only 1-2 genes or more or less. The features data is 5,670 genes less than our data of genes. So some may be excluded for that reason. This file is extremely large for what it is because of the listed groups in each gene of 21k+ genes. This was in the CEL file that was made using bioconductor in part 1 extracting feature information with the following code that maps the probe IDs to the feature genes:

library(GEOquery)

gse <- getGEO("GSE305165", GSEMatrix = T)

feature.data  <- gse$GSE305165_series_matrix.txt.gz@featureData@data

This is 116 MB due to that last column. Too big for google sheets but kaggle allows it. Get the features data here.

features <- read.csv("feature.data.csv", header=T, row.names=1)

colnames(features)

##  [1] "ID"           "probeset_id"  "seqname"      "strand"       "start"       
##  [6] "stop"         "total_probes" "category"     "SPOT_ID"      "SPOT_ID.1"

ls()

##   [1] "age"                        "age_t"                     
##   [3] "BCL2"                       "BCL6"                      
##   [5] "CD10"                       "CD15"                      
##   [7] "CD20"                       "CD274"                     
##   [9] "CD3"                        "CD30"                      
##  [11] "CD5"                        "CD79a"                     
##  [13] "cel_files"                  "characteristics_df"        
##  [15] "CHL"                        "CHL_change"                
##  [17] "CHL_old"                    "CHL_old72_ordered"         
##  [19] "CHL_x"                      "CHL_x_ordered"             
##  [21] "CHL_y"                      "CHL_y_ordered"             
##  [23] "CHL_young"                  "CHL_young72_ordered"       
##  [25] "chr9"                       "compare"                   
##  [27] "Data"                       "Data1"                     
##  [29] "dx"                         "dx_t"                      
##  [31] "EBER"                       "EBNA2"                     
##  [33] "endPDL1"                    "endPDL2"                   
##  [35] "ERP44"                      "feature.data"              
##  [37] "feature.data1"              "features"                  
##  [39] "females"                    "file"                      
##  [41] "gender"                     "gender_t"                  
##  [43] "genesStudied"               "group"                     
##  [45] "group_t"                    "GSVA"                      
##  [47] "ID"                         "ID_t"                      
##  [49] "IDO1"                       "IFNG"                      
##  [51] "labels"                     "LMP1"                      
##  [53] "males"                      "mDLBCL"                    
##  [55] "mDLBCL_change"              "mDLBCL_old"                
##  [57] "mDLBCL_old72_ordered"       "mDLBCL_x"                  
##  [59] "mDLBCL_x_ordered"           "mDLBCL_y"                  
##  [61] "mDLBCL_y_ordered"           "mDLBCL_young"              
##  [63] "mDLBCL_young72_ordered"     "MUM1"                      
##  [65] "newNames"                   "normalized.expr"           
##  [67] "old"                        "PAX5"                      
##  [69] "PDCD1LG2"                   "PDL1"                      
##  [71] "PDL1_loci"                  "PDL2"                      
##  [73] "PDL2_loci"                  "pDLBCL"                    
##  [75] "pDLBCL_change"              "pDLBCL_old"                
##  [77] "pDLBCL_old72_ordered"       "pDLBCL_x"                  
##  [79] "pDLBCL_x_ordered"           "pDLBCL_y"                  
##  [81] "pDLBCL_y_ordered"           "pDLBCL_young"              
##  [83] "pDLBCL_young72_ordered"     "PTPRD"                     
##  [85] "series"                     "series4"                   
##  [87] "SPOT_ID_b"                  "SPOT_IDs"                  
##  [89] "startPDL1"                  "startPDL2"                 
##  [91] "studied"                    "summary"                   
##  [93] "top_CHL_old72"              "top_CHL_x"                 
##  [95] "top_CHL_y"                  "top_CHL_young72"           
##  [97] "top_mDLBCL"                 "top_mDLBCL_gene"           
##  [99] "top_mDLBCL_old72"           "top_mDLBCL_x"              
## [101] "top_mDLBCL_y"               "top_mDLBCL_young72"        
## [103] "top_pDLBCL"                 "top_pDLBCL_gene"           
## [105] "top_pDLBCL_old72"           "top_pDLBCL_x"              
## [107] "top_pDLBCL_y"               "top_pDLBCL_young72"        
## [109] "topCHL"                     "topCHL_gene"               
## [111] "topGenes_CHL_pDLBCL_mDLBCL" "topStudied_3Ls"            
## [113] "young"

Lets remove some of the clutter of the ordered tables we won’t use again.

rm("mDLBCL_old72_ordered" ,  "mDLBCL_x_ordered"    ,   "mDLBCL_y_ordered"    ,   "mDLBCL_young72_ordered",
"pDLBCL_old72_ordered",   "pDLBCL_x_ordered"    ,   "pDLBCL_y_ordered"    ,   "pDLBCL_young72_ordered","CHL_old72_ordered"   ,   "CHL_x_ordered"    ,      "CHL_y_ordered"  ,        "CHL_young72_ordered")

Now we can see which top genes data sets to get the gene ID to from the features table that we must combine with each top gene table by listing our object in our Environment window of Rstudio.

ls()

##   [1] "age"                        "age_t"                     
##   [3] "BCL2"                       "BCL6"                      
##   [5] "CD10"                       "CD15"                      
##   [7] "CD20"                       "CD274"                     
##   [9] "CD3"                        "CD30"                      
##  [11] "CD5"                        "CD79a"                     
##  [13] "cel_files"                  "characteristics_df"        
##  [15] "CHL"                        "CHL_change"                
##  [17] "CHL_old"                    "CHL_x"                     
##  [19] "CHL_y"                      "CHL_young"                 
##  [21] "chr9"                       "compare"                   
##  [23] "Data"                       "Data1"                     
##  [25] "dx"                         "dx_t"                      
##  [27] "EBER"                       "EBNA2"                     
##  [29] "endPDL1"                    "endPDL2"                   
##  [31] "ERP44"                      "feature.data"              
##  [33] "feature.data1"              "features"                  
##  [35] "females"                    "file"                      
##  [37] "gender"                     "gender_t"                  
##  [39] "genesStudied"               "group"                     
##  [41] "group_t"                    "GSVA"                      
##  [43] "ID"                         "ID_t"                      
##  [45] "IDO1"                       "IFNG"                      
##  [47] "labels"                     "LMP1"                      
##  [49] "males"                      "mDLBCL"                    
##  [51] "mDLBCL_change"              "mDLBCL_old"                
##  [53] "mDLBCL_x"                   "mDLBCL_y"                  
##  [55] "mDLBCL_young"               "MUM1"                      
##  [57] "newNames"                   "normalized.expr"           
##  [59] "old"                        "PAX5"                      
##  [61] "PDCD1LG2"                   "PDL1"                      
##  [63] "PDL1_loci"                  "PDL2"                      
##  [65] "PDL2_loci"                  "pDLBCL"                    
##  [67] "pDLBCL_change"              "pDLBCL_old"                
##  [69] "pDLBCL_x"                   "pDLBCL_y"                  
##  [71] "pDLBCL_young"               "PTPRD"                     
##  [73] "series"                     "series4"                   
##  [75] "SPOT_ID_b"                  "SPOT_IDs"                  
##  [77] "startPDL1"                  "startPDL2"                 
##  [79] "studied"                    "summary"                   
##  [81] "top_CHL_old72"              "top_CHL_x"                 
##  [83] "top_CHL_y"                  "top_CHL_young72"           
##  [85] "top_mDLBCL"                 "top_mDLBCL_gene"           
##  [87] "top_mDLBCL_old72"           "top_mDLBCL_x"              
##  [89] "top_mDLBCL_y"               "top_mDLBCL_young72"        
##  [91] "top_pDLBCL"                 "top_pDLBCL_gene"           
##  [93] "top_pDLBCL_old72"           "top_pDLBCL_x"              
##  [95] "top_pDLBCL_y"               "top_pDLBCL_young72"        
##  [97] "topCHL"                     "topCHL_gene"               
##  [99] "topGenes_CHL_pDLBCL_mDLBCL" "topStudied_3Ls"            
## [101] "young"

The topStudied_3LS table of top genes has already combined the Affy IDs with gene IDs in features table in Part 2 of the 3 lymphomas and research study relevant genes.

colnames(topStudied_3Ls)

##  [1] "ID"                    "CHL_mean"              "pDLBCL_mean"          
##  [4] "mDLBCL_mean"           "CHL_x_mean"            "mDLBCL_x_mean"        
##  [7] "pDLBCL_x_mean"         "CHL_y_mean"            "mDLBCL_y_mean"        
## [10] "pDLBCL_y_mean"         "CHL_young72_mean"      "mDLBCL_young72_mean"  
## [13] "pDLBCL_young72_mean"   "CHL_old"               "mDLBCL_old"           
## [16] "pDLBCL_old"            "CHL_median"            "mDLBCL_median"        
## [19] "pDLBCL_median"         "CHL_x_median"          "mDLBCL_x_median"      
## [22] "pDLBCL_x_median"       "CHL_y_median"          "mDLBCL_y_median"      
## [25] "pDLBCL_y_median"       "CHL_young72_median"    "mDLBCL_young72_median"
## [28] "pDLBCL_young72_median" "CHL_old72_median"      "mDLBCL_old72_median"  
## [31] "pDLBCL_old72_median"   "CHL_change"            "pDLBCL_change"        
## [34] "mDLBCL_change"         "CHL_x_change"          "CHL_y_change"         
## [37] "pDLBCL_x_change"       "pDLBCL_y_change"       "mDLBCL_x_change"      
## [40] "mDLBCL_y_change"       "CHL_old72_change"      "CHL_young72_change"   
## [43] "pDLBCL_old_change"     "pDLBCL_young72_change" "mDLBCL_young72_change"
## [46] "SPOT_ID.1"             "Gene_ID"               "importance"

It is also missing the mDLBCL older than 72 change effect in group avg/group median. We can add it and rearrange the columns now.

topStudied_3Ls$mDLBCL_old72_change <- topStudied_3Ls$mDLBCL_old/topStudied_3Ls$mDLBCL_old72_median

colnames(topStudied_3Ls)

##  [1] "ID"                    "CHL_mean"              "pDLBCL_mean"          
##  [4] "mDLBCL_mean"           "CHL_x_mean"            "mDLBCL_x_mean"        
##  [7] "pDLBCL_x_mean"         "CHL_y_mean"            "mDLBCL_y_mean"        
## [10] "pDLBCL_y_mean"         "CHL_young72_mean"      "mDLBCL_young72_mean"  
## [13] "pDLBCL_young72_mean"   "CHL_old"               "mDLBCL_old"           
## [16] "pDLBCL_old"            "CHL_median"            "mDLBCL_median"        
## [19] "pDLBCL_median"         "CHL_x_median"          "mDLBCL_x_median"      
## [22] "pDLBCL_x_median"       "CHL_y_median"          "mDLBCL_y_median"      
## [25] "pDLBCL_y_median"       "CHL_young72_median"    "mDLBCL_young72_median"
## [28] "pDLBCL_young72_median" "CHL_old72_median"      "mDLBCL_old72_median"  
## [31] "pDLBCL_old72_median"   "CHL_change"            "pDLBCL_change"        
## [34] "mDLBCL_change"         "CHL_x_change"          "CHL_y_change"         
## [37] "pDLBCL_x_change"       "pDLBCL_y_change"       "mDLBCL_x_change"      
## [40] "mDLBCL_y_change"       "CHL_old72_change"      "CHL_young72_change"   
## [43] "pDLBCL_old_change"     "pDLBCL_young72_change" "mDLBCL_young72_change"
## [46] "SPOT_ID.1"             "Gene_ID"               "importance"           
## [49] "mDLBCL_old72_change"

topStudied_3Ls <- topStudied_3Ls[,c(1:45,49,46:48)]
colnames(topStudied_3Ls)

##  [1] "ID"                    "CHL_mean"              "pDLBCL_mean"          
##  [4] "mDLBCL_mean"           "CHL_x_mean"            "mDLBCL_x_mean"        
##  [7] "pDLBCL_x_mean"         "CHL_y_mean"            "mDLBCL_y_mean"        
## [10] "pDLBCL_y_mean"         "CHL_young72_mean"      "mDLBCL_young72_mean"  
## [13] "pDLBCL_young72_mean"   "CHL_old"               "mDLBCL_old"           
## [16] "pDLBCL_old"            "CHL_median"            "mDLBCL_median"        
## [19] "pDLBCL_median"         "CHL_x_median"          "mDLBCL_x_median"      
## [22] "pDLBCL_x_median"       "CHL_y_median"          "mDLBCL_y_median"      
## [25] "pDLBCL_y_median"       "CHL_young72_median"    "mDLBCL_young72_median"
## [28] "pDLBCL_young72_median" "CHL_old72_median"      "mDLBCL_old72_median"  
## [31] "pDLBCL_old72_median"   "CHL_change"            "pDLBCL_change"        
## [34] "mDLBCL_change"         "CHL_x_change"          "CHL_y_change"         
## [37] "pDLBCL_x_change"       "pDLBCL_y_change"       "mDLBCL_x_change"      
## [40] "mDLBCL_y_change"       "CHL_old72_change"      "CHL_young72_change"   
## [43] "pDLBCL_old_change"     "pDLBCL_young72_change" "mDLBCL_young72_change"
## [46] "mDLBCL_old72_change"   "SPOT_ID.1"             "Gene_ID"              
## [49] "importance"

Lets add the gene features, but also notice that we will be manually adding in the Gene_ID from the SPOT_ID.1 column of features with the content in parenthesis, and an importance column to show why that gene was selected by being a top gene of which group subgroup.

We can do the gender subgroups now.

CHL_x_genes <- merge(top_CHL_x,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(CHL_x_genes)

CHL_x_genes$Gene_ID <- c("NM_001040033","RGS13","S100A12","NM_001033556","JCHAIN","85444","51633","NM_014117",
                         "CCL13")
CHL_x_genes$importance <- "This is the top genes from the CHL group of females only, the number Gene_IDs are Entrez IDs because the other Gene ID wasn't available or there is not a Gene ID for it. Change is the avg/median group values. "

CHL_y_genes <- merge(top_CHL_y,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(CHL_y_genes)

CHL_y_genes$Gene_ID <- c("CHIT1")
CHL_y_genes$importance <- "This is the top genes from the CHL group of males only, there was only 1 gene of CHL and males from the available features genes of 21k that matched up to the top 20 genes in the data of 27k probe IDs. Change is the avg/median group values.  "

pDLBCL_x_genes <- merge(top_pDLBCL_x,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(pDLBCL_x_genes)

pDLBCL_x_genes$Gene_ID <- c("CLCA2","MUC7","GCNT4")
pDLBCL_x_genes$importance <- "These are from the top 20 genes of pDLBCL and females, but only 3 genes were matched from their probe IDs to a gene in the features data. Change is the avg/median group values. "

pDLBCL_y_genes <- merge(top_pDLBCL_y,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(pDLBCL_y_genes)

pDLBCL_y_genes$Gene_ID <- c("SMPX","LOC389834")
pDLBCL_y_genes$importance <- "These are from the top 20 genes of pDLBCL and males, but only 2 genes were matched from their probe IDs to a gene in the features data. Change is the avg/median group values. "

mDLBCL_x_genes <- merge(top_mDLBCL_x,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(mDLBCL_x_genes)

mDLBCL_x_genes$Gene_ID <- c("IL7R","MAL2","KRT4","CLDN10")
mDLBCL_x_genes$importance <- "These are from the top 20 genes of mDLBCL and females, but only 4 genes were matched from their probe IDs to a gene in the features data. Change is the avg/median group values. "

mDLBCL_y_genes <- merge(top_mDLBCL_y,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(mDLBCL_y_genes)

mDLBCL_y_genes$Gene_ID <- c("SPRR1B","MAL2","FABP4","CT45A6","CT45A10","CLDN10","DSG3")
mDLBCL_y_genes$importance <- "These are from the top 20 genes of mDLBCL and males, but only 7 genes were matched from their probe IDs to a gene in the features data. Change is the avg/median group values. "

We have extracted the top genes that we could match to the top 20 barcodes of genes by group avg/group median for the gender classes within each lymphoma class. We can add that to our topStudied_3Ls data frame now. Lets add back into this data the orginal samples, the new tables have all samples and summary stats, but not the topStudied_3Ls that only has stats and change values with the added 2 values of Gene_ID and importance.

top3Ls_studied <- merge(Data[,c(1:48)], topStudied_3Ls, by.x="ID", by.y="ID")
colnames(top3Ls_studied)

##  [1] "ID"                    "pDLBCL"                "CHL"                  
##  [4] "pDLBCL.1"              "mDLBCL"                "CHL.1"                
##  [7] "mDLBCL.1"              "pDLBCL.2"              "CHL.2"                
## [10] "CHL.3"                 "CHL.4"                 "CHL.5"                
## [13] "mDLBCL.2"              "mDLBCL.3"              "CHL.6"                
## [16] "CHL.7"                 "pDLBCL.3"              "CHL.8"                
## [19] "mDLBCL.4"              "mDLBCL.5"              "mDLBCL.6"             
## [22] "mDLBCL.7"              "CHL.9"                 "mDLBCL.8"             
## [25] "CHL.10"                "CHL.11"                "CHL.12"               
## [28] "CHL.13"                "mDLBCL.9"              "mDLBCL.10"            
## [31] "mDLBCL.11"             "mDLBCL.12"             "mDLBCL.13"            
## [34] "mDLBCL.14"             "pDLBCL.4"              "pDLBCL.5"             
## [37] "mDLBCL.15"             "mDLBCL.16"             "CHL.14"               
## [40] "pDLBCL.6"              "pDLBCL.7"              "CHL.15"               
## [43] "CHL.16"                "CHL.17"                "mDLBCL.17"            
## [46] "mDLBCL.18"             "mDLBCL.19"             "CHL.18"               
## [49] "CHL_mean"              "pDLBCL_mean"           "mDLBCL_mean"          
## [52] "CHL_x_mean"            "mDLBCL_x_mean"         "pDLBCL_x_mean"        
## [55] "CHL_y_mean"            "mDLBCL_y_mean"         "pDLBCL_y_mean"        
## [58] "CHL_young72_mean"      "mDLBCL_young72_mean"   "pDLBCL_young72_mean"  
## [61] "CHL_old"               "mDLBCL_old"            "pDLBCL_old"           
## [64] "CHL_median"            "mDLBCL_median"         "pDLBCL_median"        
## [67] "CHL_x_median"          "mDLBCL_x_median"       "pDLBCL_x_median"      
## [70] "CHL_y_median"          "mDLBCL_y_median"       "pDLBCL_y_median"      
## [73] "CHL_young72_median"    "mDLBCL_young72_median" "pDLBCL_young72_median"
## [76] "CHL_old72_median"      "mDLBCL_old72_median"   "pDLBCL_old72_median"  
## [79] "CHL_change"            "pDLBCL_change"         "mDLBCL_change"        
## [82] "CHL_x_change"          "CHL_y_change"          "pDLBCL_x_change"      
## [85] "pDLBCL_y_change"       "mDLBCL_x_change"       "mDLBCL_y_change"      
## [88] "CHL_old72_change"      "CHL_young72_change"    "pDLBCL_old_change"    
## [91] "pDLBCL_young72_change" "mDLBCL_young72_change" "mDLBCL_old72_change"  
## [94] "SPOT_ID.1"             "Gene_ID"               "importance"

top_3Ls_3LsGenders <- rbind(top3Ls_studied,CHL_x_genes, CHL_y_genes, pDLBCL_x_genes, pDLBCL_y_genes,
                            mDLBCL_x_genes,mDLBCL_y_genes)

paged_table(top_3Ls_3LsGenders)

There are 54 genes by groups of the 3 lymphomas, the 3 lymphomas only females in each, the 3 lymphomas only males in each, and the study genes. This is because even though we selected more genes, the probe IDs didn’t have a match for most of the genes in the feature list of genes. A few won’t match either, due to there not being a gene ID in parethesis for the Gene card ID we are using. Now we need to add the age group in each of the 3 lymphomas for those younger than or equal to 72 years of age and those older than 72 years of age.

CHL_young72_genes <- merge(top_CHL_young72,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(CHL_young72_genes)

CHL_young72_genes$Gene_ID <- "NA"
CHL_young72_genes$importance <- "There were none of the top 20 genes of Change skew avg/median in the CHL group of people younger than 72 in the data of genes to match any of the 20 probe IDs found. "

CHL_old72_genes <- merge(top_CHL_old72,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(CHL_old72_genes)

CHL_old72_genes$Gene_ID <- c("SCRG1","LUM")
CHL_old72_genes$importance <- "There were 2 genes found in the data of genes of CHL younger than 72 years of age to match any of the 20 probe IDs found.  Change is the avg/median group values. "

pDLBCL_young72_genes <- merge(top_pDLBCL_young72,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(pDLBCL_young72_genes)

pDLBCL_young72_genes$Gene_ID <- c("SPRR1B","S100A7","SPRR2A","GCNT4","CLDN10")
pDLBCL_young72_genes$importance <- "There were 5 genes found in the data of pDLBCL younger than 72 years of age genes to match any of the 20 probe IDs found.  Change is the avg/median group values. "

pDLBCL_old72_genes <- merge(top_pDLBCL_old72,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(pDLBCL_old72_genes)

pDLBCL_old72_genes$Gene_ID <- c("CLCA2","TACSTD2","MUC7","MGAM2","MAL2","CLDN10","DSG3")
pDLBCL_old72_genes$importance <- "There were 7 genes found in the data of pDLBCL older than 72 years of age genes to match any of the 20 probe IDs found.  Change is the avg/median group values. "

mDLBCL_young72_genes <- merge(top_mDLBCL_young72,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(mDLBCL_young72_genes)

mDLBCL_young72_genes$Gene_ID <- c("SPRR1B","SPRR2A","UTS2B","KIAA0825","MAL2","NM_201998","CLDN10","NOP10")
mDLBCL_young72_genes$importance <- "There were 8 genes found in the data of mDLBCL younger than 72 years of age genes to match any of the 20 probe IDs found.  Change is the avg/median group values. "

mDLBCL_old72_genes <- merge(top_mDLBCL_old72,features[,c(1,10)], by.x="ID",by.y="ID")

paged_table(mDLBCL_old72_genes)

mDLBCL_old72_genes$Gene_ID <- c("SPRR1B","MAL2","FABP4","CT45A6","DDX3Y","CLDN10","DSG3","DMKN")
mDLBCL_old72_genes$importance <- "There were 8 genes found in the data of mDLBCL older than 72 years of age genes to match any of the 20 probe IDs found.  Change is the avg/median group values. "

Now we have found all top genes from the top 20 change in avg/median group values of probe IDs we filtered top 20 of top 10 over and top 10 under estimated by avg/median to find match in list of genes that matched probes, some 5,650+ genes were left out for the number of probe IDs in the data

Lets add these to our top genes database and it is good to go. We have to omit ,CHL_young72_genes, because it has no rows for its genes and

top_3Ls_3LsGenders_3LsAge <- rbind(top_3Ls_3LsGenders,CHL_old72_genes,
                                   pDLBCL_young72_genes, pDLBCL_old72_genes,
                                   mDLBCL_young72_genes, mDLBCL_old72_genes)

paged_table(top_3Ls_3LsGenders_3LsAge)

We have a total of 84 genes. Lets see them.

top_3Ls_3LsGenders_3LsAge$Gene_ID

##  [1] "IL23R"        "SPRR1B"       "SPRR1B"       "SPRR2D"       "S100A7"      
##  [6] "SPRR2A"       "SPRR2A"       "BCL6"         "TMPRSS11A"    "IDO1"        
## [11] "MAL2"         "FABP4"        "CD274"        "PDCD1LG2"     "PTPRD"       
## [16] "PAX5"         "ERP44"        "CT45A6"       "CD5"          "USP30"       
## [21] "KLRB1"        "IFNG"         "MGST1"        "CLDN10"       "DSG3"        
## [26] "BCL2"         "MUM1"         "DMKN"         "NM_001040033" "RGS13"       
## [31] "S100A12"      "NM_001033556" "JCHAIN"       "85444"        "51633"       
## [36] "NM_014117"    "CCL13"        "CHIT1"        "CLCA2"        "MUC7"        
## [41] "GCNT4"        "SMPX"         "LOC389834"    "IL7R"         "MAL2"        
## [46] "KRT4"         "CLDN10"       "SPRR1B"       "MAL2"         "FABP4"       
## [51] "CT45A6"       "CT45A10"      "CLDN10"       "DSG3"         "SCRG1"       
## [56] "LUM"          "SPRR1B"       "S100A7"       "SPRR2A"       "GCNT4"       
## [61] "CLDN10"       "CLCA2"        "TACSTD2"      "MUC7"         "MGAM2"       
## [66] "MAL2"         "CLDN10"       "DSG3"         "SPRR1B"       "SPRR2A"      
## [71] "UTS2B"        "KIAA0825"     "MAL2"         "NM_201998"    "CLDN10"      
## [76] "NOP10"        "SPRR1B"       "MAL2"         "FABP4"        "CT45A6"      
## [81] "DDX3Y"        "CLDN10"       "DSG3"         "DMKN"

There are 84 genes.

Lets look at the unique genes. These genes for some showed up in other groups for most change by group avg/median.

unique(top_3Ls_3LsGenders_3LsAge$Gene_ID)

##  [1] "IL23R"        "SPRR1B"       "SPRR2D"       "S100A7"       "SPRR2A"      
##  [6] "BCL6"         "TMPRSS11A"    "IDO1"         "MAL2"         "FABP4"       
## [11] "CD274"        "PDCD1LG2"     "PTPRD"        "PAX5"         "ERP44"       
## [16] "CT45A6"       "CD5"          "USP30"        "KLRB1"        "IFNG"        
## [21] "MGST1"        "CLDN10"       "DSG3"         "BCL2"         "MUM1"        
## [26] "DMKN"         "NM_001040033" "RGS13"        "S100A12"      "NM_001033556"
## [31] "JCHAIN"       "85444"        "51633"        "NM_014117"    "CCL13"       
## [36] "CHIT1"        "CLCA2"        "MUC7"         "GCNT4"        "SMPX"        
## [41] "LOC389834"    "IL7R"         "KRT4"         "CT45A10"      "SCRG1"       
## [46] "LUM"          "TACSTD2"      "MGAM2"        "UTS2B"        "KIAA0825"    
## [51] "NM_201998"    "NOP10"        "DDX3Y"

There are 53 unique genes.

Lets see if it will show the genes duplicated in multiple groups.

dupe <- top_3Ls_3LsGenders_3LsAge[duplicated(top_3Ls_3LsGenders_3LsAge$Gene_ID),"Gene_ID"]

dupe

##  [1] "SPRR1B" "SPRR2A" "MAL2"   "CLDN10" "SPRR1B" "MAL2"   "FABP4"  "CT45A6"
##  [9] "CLDN10" "DSG3"   "SPRR1B" "S100A7" "SPRR2A" "GCNT4"  "CLDN10" "CLCA2" 
## [17] "MUC7"   "MAL2"   "CLDN10" "DSG3"   "SPRR1B" "SPRR2A" "MAL2"   "CLDN10"
## [25] "SPRR1B" "MAL2"   "FABP4"  "CT45A6" "CLDN10" "DSG3"   "DMKN"

There are 31 duplicated genes. We see that the genes are duplicated in the list, like CLDN10 is there at least 5 groups, SPRR2B, MAL2 as well as others.

Lets order by Gene_ID and then write it out to csv.

topAll15 <- top_3Ls_3LsGenders_3LsAge[order(top_3Ls_3LsGenders_3LsAge$Gene_ID, decreasing=T),]

topAll15$Gene_ID

##  [1] "UTS2B"        "USP30"        "TMPRSS11A"    "TACSTD2"      "SPRR2D"      
##  [6] "SPRR2A"       "SPRR2A"       "SPRR2A"       "SPRR2A"       "SPRR1B"      
## [11] "SPRR1B"       "SPRR1B"       "SPRR1B"       "SPRR1B"       "SPRR1B"      
## [16] "SMPX"         "SCRG1"        "S100A7"       "S100A7"       "S100A12"     
## [21] "RGS13"        "PTPRD"        "PDCD1LG2"     "PAX5"         "NOP10"       
## [26] "NM_201998"    "NM_014117"    "NM_001040033" "NM_001033556" "MUM1"        
## [31] "MUC7"         "MUC7"         "MGST1"        "MGAM2"        "MAL2"        
## [36] "MAL2"         "MAL2"         "MAL2"         "MAL2"         "MAL2"        
## [41] "LUM"          "LOC389834"    "KRT4"         "KLRB1"        "KIAA0825"    
## [46] "JCHAIN"       "IL7R"         "IL23R"        "IFNG"         "IDO1"        
## [51] "GCNT4"        "GCNT4"        "FABP4"        "FABP4"        "FABP4"       
## [56] "ERP44"        "DSG3"         "DSG3"         "DSG3"         "DSG3"        
## [61] "DMKN"         "DMKN"         "DDX3Y"        "CT45A6"       "CT45A6"      
## [66] "CT45A6"       "CT45A10"      "CLDN10"       "CLDN10"       "CLDN10"      
## [71] "CLDN10"       "CLDN10"       "CLDN10"       "CLDN10"       "CLCA2"       
## [76] "CLCA2"        "CHIT1"        "CD5"          "CD274"        "CCL13"       
## [81] "BCL6"         "BCL2"         "85444"        "51633"

write.csv(topAll15,"topGenes_probeMathed2Gene_84genes_54unique_31duplicated.csv", row.names=F)

We still need to compare the study genes to see how well the study genes compared to each class and group of gender and age within each group. Then see how well the study genes, and top genes by the classes, and groups within the classes predict the class, and how well the new 4 groups that are identified as the new groups in the article using these gene expression values can be predicted by the machine model.

Lets upload this large file to kaggle and you can get it here.

Thanks so much and keep checking in.

Hodgkins and Diffuse Large B-Cell Lymphoma Different Stages same Pathology

Janis Harris

2026-04-08

Ensure you’re using Bioconductor 3.22

Install packages for reading CEL files

Read all CEL files from a directory

Inspect

Read CEL files

Inspect

Calculate row medians for selected columns