Hello all we will be analyzing gene expression data from GSE305165. There is a published article that I took the following information from after reading it once and highlighting the important aspects of the study.
This research used Diffuse Large B-Cell Lymphoma or DLBCL to compare it to Classical Hodgkin’s Lymphoma or CHL in Epstein-Barr Virus (EBV) infected patients. The two types of lymphoma affect the elderly populations and have overlap between biomarkers such as IDO1, EBV latent type 2 is specific to CHL, but can be seen in a polymorphic type DLBCL called pDLBCL, and EBV latent type 3 seen mostly in monomorphic DLBCL labeled mDLBCL, but also in pDLBCL. The study saw that the typical region of chromosome 9 specific to high variations of genes in this loci of 9.24 in CHL, had some variations noticed within pDLBCL and mDLBCL. Overall, this study used populations 50 years old or older with no autoimmune or immunodeficient pathologies. However, in the elderly populations there is a natural decline of immune response to pathogens and antigens called immune senescence or IS, and in seriously impacted disease state of the DLBCL there can be a more fatal condition of immune escape, an actual term for both that means pathogens and antigens escape detection by the host immune system and have the ability to make changes that can lead to the host’s death.
This is a very interesting study, not too difficult to read, but overall, these researchers have decides that their clustering of heirarchical did a great job at separating the differences between the classes of large B-cell Lymphomas. They decided that CHL and DLBCLs of pDLBCL and mDLBCL are not separate diseases but the same type of disease where there is a 4th transitional state of disease that overlaps with pDLBCL and CHL that has low interferon gamma.
The four states are the 1st group which IS group which is the mDLBCL that is EBNA2 positive and EBV latent type 3, the 2nd group which is the CHL group that is high in variations at loci of chromosome 9 at 9p24.1 and high in PDL1 gene expression also only EBV latent type 2 gene expression and EBNA2 negative, the 3rd group which is the pDLBCL that is high in interferon gamma or IFN-g and low in variations of 9p24.1 with high gene expression of IDO1 that lead to immune escape and high chance of getting poor prognosis of hemocytic lymphocytosis called HLH that can lead to demise, and the 4th group that is the transition between CHL and pDLBCL where the IFN-g is low and characteristics unlike the other 3 groups as not otherwise specified or NOS.
The study uses 57 samples where 35 are DLBCL with 12 being pDLBCL and 23 being mDLBCL, and the other 22 samples are CHL. All samples have confirmed EBV and no immune deficiency or pathology other than Lymphoma and normal affects of aging in IS.
However, there are only 47 samples in the GSE305605 gene expression omnibus or GEO link above. We will be working with 47 samples.
library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 4.5.3
series <- read.table("GSE305165_series_matrix.txt/GSE305165_series_matrix.txt", skip=31, header=T, nrow=29)
paged_table(series)
The GSM ID is row 1, age is row 10, diagnosis is row 9 of EBV+CHL, EBV+pDLBCL, or EBV+mDLBCL, gender is row 11, group is row 19. Lets make a table of only those 4 features.
series4 <- series[c(1,9,10,11,19),]
paged_table(series4)
The groups have mostly stuck by definitions but show overlap as the CHL groups should be high 9p24.1 variations but some mDLBCL are also high 9p24.1, and IFNG-L should be mDLBCL but some CHL are classified as this instead of high 9p24.1 and at least one pDLBCL sample, and mDLBCL should all be IS, but some are IFNG-L or 9p24.1 variation high, or even at least one sample is IFNG-H. The study said there was some overlap between the samples, but that most all the latent type 2 EBV were CHL or high 9p24.1. We can still use it to show over lap.
There are 47 samples, 10 must have dropped out and not wanted information shared or unable to share it. The published article said there were 57 samples. Lets see how many samples are here based on diagnosis.
Lets see how many groups.
group <- series4[5,c(2:48)]
group_t <- data.frame(t(group))
colnames(group_t) <- 'group'
group_t$group <- gsub("group","", group_t$group)
table(group_t$group)
##
## 9p24.1-H IFNG-H IFNG-L IS
## 9 9 18 10
This is the 4 subtypes of lymphoma the study produced and says the transition state is the one with low IFNG or IFNG-L and not otherwise specified findings. The IS is immune sequesence of mDLBCL, IFNG-H is supposed to be the pDLBCL, and 9p24.1-H is high variations in gene copies at locus 9p24.1 on chromosome 9 for CHL. All of these lymphomas have confirmed EBV infection.
Now for the number samples in each diagnosis.
dx <- series4[2,c(2:48)]
dx_t <- data.frame(t(dx))
colnames(dx_t) <- 'diagnosis'
table(dx_t$diagnosis)
##
## diagnosis: EBV+ CHL diagnosis: EBV+ mDLBCL diagnosis: EBV+ pDLBCL
## 19 20 8
There are 19 EBV+CHL cases, 20 EBV+mDLBCL, and 8 EBV+pDLBCL.
Lets see the age range summary stats.
age <- series4[3,c(2:48)]
age_t <- data.frame(t(age))
colnames(age_t) <- "Age"
age_t$Age <- gsub("age: ","",age_t$Age)
age_t$Age <- as.numeric(age_t$Age)
summary(age_t$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 62.50 74.00 72.11 79.50 94.00
The age is 50 years old as the youngest, with a median age of 74 years of age for all 47 patients’ ages lined up in order least to most, with a mean age as the average age for these 47 patients being 72 years old. The oldest is 94 years old. More than 75% of the people are older than 62 years of age and more than half the patients are older than 72 years old, with 25 % of the patients older than almost 80 years of age and 25% of the patients between 50 to 62 years of age.
Lets look at the gender balance of men to women in this study.
gender <- series4[4,c(2:48)]
gender_t <- data.frame(t(gender));
colnames(gender_t) <- 'gender'
gender_t$gender <- gsub("Sex: ", "" , gender_t$gender)
table(gender_t$gender)
##
## female male
## 12 35
There are mostly male in this research study with 35 males and 12 females spread about all samples of EBV+CHL, EBV+pDLBCL, and EBV+mDLBCL.
Lets make a sample GSM ID table as well.
ID <- series4[1,c(2:48)]
ID_t <- data.frame(t(ID))
colnames(ID_t) <- "sampleID"
ID_t
## sampleID
## Case02_lymphoma_FFPE GSM9163281
## Case03_lymphoma_FFPE GSM9163282
## Case06_lymphoma_FFPE GSM9163283
## Case08_lymphoma_FFPE GSM9163284
## Case09_lymphoma_FFPE GSM9163285
## Case10_lymphoma_FFPE GSM9163286
## Case11_lymphoma_FFPE GSM9163287
## Case12_lymphoma_FFPE GSM9163288
## Case13_lymphoma_FFPE GSM9163289
## Case14_lymphoma_FFPE GSM9163290
## Case15_lymphoma_FFPE GSM9163291
## Case16_lymphoma_FFPE GSM9163292
## Case17_lymphoma_FFPE GSM9163293
## Case19_lymphoma_FFPE GSM9163294
## Case20_lymphoma_FFPE GSM9163295
## Case21_lymphoma_FFPE GSM9163296
## Case22_lymphoma_FFPE GSM9163297
## Case23_lymphoma_FFPE GSM9163298
## Case24_lymphoma_FFPE GSM9163299
## Case25_lymphoma_FFPE GSM9163300
## Case26_lymphoma_FFPE GSM9163301
## Case27_lymphoma_FFPE GSM9163302
## Case29_lymphoma_FFPE GSM9163303
## Case30_lymphoma_FFPE GSM9163304
## Case31_lymphoma_FFPE GSM9163305
## Case32_lymphoma_FFPE GSM9163306
## Case34_lymphoma_FFPE GSM9163307
## Case35_lymphoma_FFPE GSM9163308
## Case36_lymphoma_FFPE GSM9163309
## Case37_lymphoma_FFPE GSM9163310
## Case38_lymphoma_FFPE GSM9163311
## Case39_lymphoma_FFPE GSM9163312
## Case40_lymphoma_FFPE GSM9163313
## Case41_lymphoma_FFPE GSM9163314
## Case42_lymphoma_FFPE GSM9163315
## Case43_lymphoma_FFPE GSM9163316
## Case44_lymphoma_FFPE GSM9163317
## Case45_lymphoma_FFPE GSM9163318
## Case46_lymphoma_FFPE GSM9163319
## Case49_lymphoma_FFPE GSM9163320
## Case50_lymphoma_FFPE GSM9163321
## Case51_lymphoma_FFPE GSM9163322
## Case52_lymphoma_FFPE GSM9163323
## Case53_lymphoma_FFPE GSM9163324
## Case55_lymphoma_FFPE GSM9163325
## Case56_lymphoma_FFPE GSM9163326
## Case57_lymphoma_FFPE GSM9163327
Lets make a table of the diagnosis
Lets make a table of these 5 characteristics.
characteristics_df <- cbind(ID_t, dx_t, group_t, age_t, gender_t)
paged_table(characteristics_df)
The next part comes with the type of data, the CEL files are Affymetrix gene chip files that can only be opened within Bioconductor with a library. The package takes about 20 minutes on cell phone hotspot wifi to download and install on a regular laptop PC. I got through to the end, but didn’t convert the files from gz or unzip them.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("oligo", "affy"))
library(oligo) # For newer Affymetrix arrays
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: oligoClasses
## Welcome to oligoClasses version 1.72.0
## Loading required package: Biobase
## Warning: package 'Biobase' was built under R version 4.5.3
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: Biostrings
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.5.3
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
## Loading required package: Seqinfo
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## ================================================================================
## Welcome to oligo version 1.74.0
## ================================================================================
library(affy) # Uncomment for older arrays
##
## Attaching package: 'affy'
## The following objects are masked from 'package:oligo':
##
## intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
## probeNames, rma
## The following object is masked from 'package:oligoClasses':
##
## list.celfiles
Unzip the files from the RAW download then put all in one folder, I named mine the GSM ID.
#cel_path <- "path/to/your/CEL/files"
cel_path <- "...GSE305165" # path to your CEL files
setwd(cel_path)
cel_files <- list.celfiles(full.names = TRUE)
cel_files
## [1] "./GSM9163281_02_Clariom_S_Human_.CEL"
## [2] "./GSM9163282_03_Clariom_S_Human_.CEL"
## [3] "./GSM9163283_06_Clariom_S_Human_.CEL"
## [4] "./GSM9163284_08_Clariom_S_Human_.CEL"
## [5] "./GSM9163285_09_Clariom_S_Human_.CEL"
## [6] "./GSM9163286_10_Clariom_S_Human_.CEL"
## [7] "./GSM9163287_11_Clariom_S_Human_.CEL"
## [8] "./GSM9163288_12_Clariom_S_Human_.CEL"
## [9] "./GSM9163289_13_Clariom_S_Human_2.CEL"
## [10] "./GSM9163290_14_Clariom_S_Human_.CEL"
## [11] "./GSM9163291_15_Clariom_S_Human_.CEL"
## [12] "./GSM9163292_16_Clariom_S_Human_.CEL"
## [13] "./GSM9163293_17_Clariom_S_Human_.CEL"
## [14] "./GSM9163294_19_Clariom_S_Human_.CEL"
## [15] "./GSM9163295_20_Clariom_S_Human_.CEL"
## [16] "./GSM9163296_21_Clariom_S_Human_.CEL"
## [17] "./GSM9163297_22_Clariom_S_Human_.CEL"
## [18] "./GSM9163298_23_Clariom_S_Human_.CEL"
## [19] "./GSM9163299_24_Clariom_S_Human_.CEL"
## [20] "./GSM9163300_25_Clariom_S_Human_.CEL"
## [21] "./GSM9163301_26_Clariom_S_Human_.CEL"
## [22] "./GSM9163302_27_Clariom_S_Human_2.CEL"
## [23] "./GSM9163303_29_Clariom_S_Human_.CEL"
## [24] "./GSM9163304_30_Clariom_S_Human_.CEL"
## [25] "./GSM9163305_31_Clariom_S_Human_.CEL"
## [26] "./GSM9163306_32_Clariom_S_Human_.CEL"
## [27] "./GSM9163307_34_Clariom_S_Human_.CEL"
## [28] "./GSM9163308_35_Clariom_S_Human_.CEL"
## [29] "./GSM9163309_36_Clariom_S_Human_.CEL"
## [30] "./GSM9163310_37_Clariom_S_Human_2.CEL"
## [31] "./GSM9163311_38_Clariom_S_Human_2.CEL"
## [32] "./GSM9163312_39_Clariom_S_Human_.CEL"
## [33] "./GSM9163313_40_Clariom_S_Human_2.CEL"
## [34] "./GSM9163314_41_Clariom_S_Human_.CEL"
## [35] "./GSM9163315_42_Clariom_S_Human_.CEL"
## [36] "./GSM9163316_43_Clariom_S_Human_.CEL"
## [37] "./GSM9163317_44_Clariom_S_Human_.CEL"
## [38] "./GSM9163318_45_Clariom_S_Human_.CEL"
## [39] "./GSM9163319_46_Clariom_S_Human_.CEL"
## [40] "./GSM9163320_49_Clariom_S_Human_2.CEL"
## [41] "./GSM9163321_50_Clariom_S_Human_.CEL"
## [42] "./GSM9163322_51_Clariom_S_Human_.CEL"
## [43] "./GSM9163323_52_Clariom_S_Human_.CEL"
## [44] "./GSM9163324_53_Clariom_S_Human_.CEL"
## [45] "./GSM9163325_55_Clariom_S_Human_.CEL"
## [46] "./GSM9163326_56_Clariom_S_Human_.CEL"
## [47] "./GSM9163327_57_Clariom_S_Human_.CEL"
raw_data <- read.celfiles(cel_files)
Error: These do not exist: ./GSM9163281_02_Clariom_S_Human_.CEL ./GSM9163282_03_Clariom_S_Human_.CEL ./GSM9163283_06_Clariom_S_Human_.CEL ./GSM9163284_08_Clariom_S_Human_.CEL ./GSM9163285_09_Clariom_S_Human_.CEL ./GSM9163286_10_Clariom_S_Human_.CEL ./GSM9163287_11_Clariom_S_Human_.CEL ./GSM9163288_12_Clariom_S_Human_.CEL ./GSM9163289_13_Clariom_S_Human_2.CEL ./GSM9163290_14_Clariom_S_Human_.CEL ./GSM9163291_15_Clariom_S_Human_.CEL ./GSM9163292_16_Clariom_S_Human_.CEL ./GSM9163293_17_Clariom_S_Human_.CEL ./GSM9163294_19_Clariom_S_Human_.CEL ./GSM9163295_20_Clariom_S_Human_.CEL ./GSM9163296_21_Clariom_S_Human_.CEL ./GSM9163297_22_Clariom_S_Human_.CEL ./GSM9163298_23_Clariom_S_Human_.CEL ./GSM9163299_24_Clariom_S_Human_.CEL ./GSM9163300_25_Clariom_S_Human_.CEL ./GSM9163301_26_Clariom_S_Human_.CEL ./GSM9163302_27_Clariom_S_Human_2.CEL ./GSM9163303_29_Clariom_S_Human_.CEL ./GSM9163304_30_Clariom_S_Human_.CEL ./GSM9163305_31_Clariom_S_Human_.
The command didn’t work even after using only RAW with separate unzipped folders of separate folder per patient, then unzipped with one folder and a folder for each patient with CEL file in each one, and not with one folder with only the actual CEL files in it without a folder for each.
I will have to return to this to see how the CEL files can be read in with bioconductor’s affy and oligo packages.
raw_data
Copy code # Quick QC plot
boxplot(raw_data, main = "Raw CEL Data", las = 2)
norm_data <- rma(raw_data)
exprs(norm_data)[1:5, 1:5] # First 5 genes × first 5 samples
==================================================
Keep checking back and we will figure out how to open these CEL files and do are regular analysis and data science to these samples.