Hello all we will be analyzing gene expression data from GSE305165. There is a published article that I took the following information from after reading it once and highlighting the important aspects of the study.

This research used Diffuse Large B-Cell Lymphoma or DLBCL to compare it to Classical Hodgkin’s Lymphoma or CHL in Epstein-Barr Virus (EBV) infected patients. The two types of lymphoma affect the elderly populations and have overlap between biomarkers such as IDO1, EBV latent type 2 is specific to CHL, but can be seen in a polymorphic type DLBCL called pDLBCL, and EBV latent type 3 seen mostly in monomorphic DLBCL labeled mDLBCL, but also in pDLBCL. The study saw that the typical region of chromosome 9 specific to high variations of genes in this loci of 9.24 in CHL, had some variations noticed within pDLBCL and mDLBCL. Overall, this study used populations 50 years old or older with no autoimmune or immunodeficient pathologies. However, in the elderly populations there is a natural decline of immune response to pathogens and antigens called immune senescence or IS, and in seriously impacted disease state of the DLBCL there can be a more fatal condition of immune escape, an actual term for both that means pathogens and antigens escape detection by the host immune system and have the ability to make changes that can lead to the host’s death.

This is a very interesting study, not too difficult to read, but overall, these researchers have decides that their clustering of heirarchical did a great job at separating the differences between the classes of large B-cell Lymphomas. They decided that CHL and DLBCLs of pDLBCL and mDLBCL are not separate diseases but the same type of disease where there is a 4th transitional state of disease that overlaps with pDLBCL and CHL that has low interferon gamma.

The four states are the 1st group which IS group which is the mDLBCL that is EBNA2 positive and EBV latent type 3, the 2nd group which is the CHL group that is high in variations at loci of chromosome 9 at 9p24.1 and high in PDL1 gene expression also only EBV latent type 2 gene expression and EBNA2 negative, the 3rd group which is the pDLBCL that is high in interferon gamma or IFN-g and low in variations of 9p24.1 with high gene expression of IDO1 that lead to immune escape and high chance of getting poor prognosis of hemocytic lymphocytosis called HLH that can lead to demise, and the 4th group that is the transition between CHL and pDLBCL where the IFN-g is low and characteristics unlike the other 3 groups as not otherwise specified or NOS.

The study uses 57 samples where 35 are DLBCL with 12 being pDLBCL and 23 being mDLBCL, and the other 22 samples are CHL. All samples have confirmed EBV and no immune deficiency or pathology other than Lymphoma and normal affects of aging in IS.

However, there are only 47 samples in the GSE305605 gene expression omnibus or GEO link above. We will be working with 47 samples.

library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 4.5.3
series <- read.table("GSE305165_series_matrix.txt/GSE305165_series_matrix.txt", skip=31, header=T, nrow=29)

paged_table(series)

The GSM ID is row 1, age is row 10, diagnosis is row 9 of EBV+CHL, EBV+pDLBCL, or EBV+mDLBCL, gender is row 11, group is row 19. Lets make a table of only those 4 features.

series4 <- series[c(1,9,10,11,19),]

paged_table(series4)

The groups have mostly stuck by definitions but show overlap as the CHL groups should be high 9p24.1 variations but some mDLBCL are also high 9p24.1, and IFNG-L should be mDLBCL but some CHL are classified as this instead of high 9p24.1 and at least one pDLBCL sample, and mDLBCL should all be IS, but some are IFNG-L or 9p24.1 variation high, or even at least one sample is IFNG-H. The study said there was some overlap between the samples, but that most all the latent type 2 EBV were CHL or high 9p24.1. We can still use it to show over lap.

There are 47 samples, 10 must have dropped out and not wanted information shared or unable to share it. The published article said there were 57 samples. Lets see how many samples are here based on diagnosis.

Lets see how many groups.

group <- series4[5,c(2:48)]
group_t <- data.frame(t(group))
colnames(group_t) <- 'group'
group_t$group <- gsub("group","", group_t$group)
table(group_t$group)
## 
## 9p24.1-H    IFNG-H    IFNG-L        IS  
##         9         9        18        10

This is the 4 subtypes of lymphoma the study produced and says the transition state is the one with low IFNG or IFNG-L and not otherwise specified findings. The IS is immune sequesence of mDLBCL, IFNG-H is supposed to be the pDLBCL, and 9p24.1-H is high variations in gene copies at locus 9p24.1 on chromosome 9 for CHL. All of these lymphomas have confirmed EBV infection.

Now for the number samples in each diagnosis.

dx <- series4[2,c(2:48)]
dx_t <- data.frame(t(dx))

colnames(dx_t) <- 'diagnosis'

table(dx_t$diagnosis)
## 
##    diagnosis: EBV+ CHL diagnosis: EBV+ mDLBCL diagnosis: EBV+ pDLBCL 
##                     19                     20                      8

There are 19 EBV+CHL cases, 20 EBV+mDLBCL, and 8 EBV+pDLBCL.

Lets see the age range summary stats.

age <- series4[3,c(2:48)]
age_t <- data.frame(t(age))
colnames(age_t) <- "Age"
age_t$Age <- gsub("age: ","",age_t$Age)
age_t$Age <- as.numeric(age_t$Age)

summary(age_t$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   50.00   62.50   74.00   72.11   79.50   94.00

The age is 50 years old as the youngest, with a median age of 74 years of age for all 47 patients’ ages lined up in order least to most, with a mean age as the average age for these 47 patients being 72 years old. The oldest is 94 years old. More than 75% of the people are older than 62 years of age and more than half the patients are older than 72 years old, with 25 % of the patients older than almost 80 years of age and 25% of the patients between 50 to 62 years of age.

Lets look at the gender balance of men to women in this study.

gender <- series4[4,c(2:48)]

gender_t <- data.frame(t(gender));
colnames(gender_t) <- 'gender'

gender_t$gender <- gsub("Sex: ", "" , gender_t$gender)

table(gender_t$gender)
## 
## female   male 
##     12     35

There are mostly male in this research study with 35 males and 12 females spread about all samples of EBV+CHL, EBV+pDLBCL, and EBV+mDLBCL.

Lets make a sample GSM ID table as well.

ID <- series4[1,c(2:48)]

ID_t <- data.frame(t(ID))

colnames(ID_t) <- "sampleID"
ID_t
##                        sampleID
## Case02_lymphoma_FFPE GSM9163281
## Case03_lymphoma_FFPE GSM9163282
## Case06_lymphoma_FFPE GSM9163283
## Case08_lymphoma_FFPE GSM9163284
## Case09_lymphoma_FFPE GSM9163285
## Case10_lymphoma_FFPE GSM9163286
## Case11_lymphoma_FFPE GSM9163287
## Case12_lymphoma_FFPE GSM9163288
## Case13_lymphoma_FFPE GSM9163289
## Case14_lymphoma_FFPE GSM9163290
## Case15_lymphoma_FFPE GSM9163291
## Case16_lymphoma_FFPE GSM9163292
## Case17_lymphoma_FFPE GSM9163293
## Case19_lymphoma_FFPE GSM9163294
## Case20_lymphoma_FFPE GSM9163295
## Case21_lymphoma_FFPE GSM9163296
## Case22_lymphoma_FFPE GSM9163297
## Case23_lymphoma_FFPE GSM9163298
## Case24_lymphoma_FFPE GSM9163299
## Case25_lymphoma_FFPE GSM9163300
## Case26_lymphoma_FFPE GSM9163301
## Case27_lymphoma_FFPE GSM9163302
## Case29_lymphoma_FFPE GSM9163303
## Case30_lymphoma_FFPE GSM9163304
## Case31_lymphoma_FFPE GSM9163305
## Case32_lymphoma_FFPE GSM9163306
## Case34_lymphoma_FFPE GSM9163307
## Case35_lymphoma_FFPE GSM9163308
## Case36_lymphoma_FFPE GSM9163309
## Case37_lymphoma_FFPE GSM9163310
## Case38_lymphoma_FFPE GSM9163311
## Case39_lymphoma_FFPE GSM9163312
## Case40_lymphoma_FFPE GSM9163313
## Case41_lymphoma_FFPE GSM9163314
## Case42_lymphoma_FFPE GSM9163315
## Case43_lymphoma_FFPE GSM9163316
## Case44_lymphoma_FFPE GSM9163317
## Case45_lymphoma_FFPE GSM9163318
## Case46_lymphoma_FFPE GSM9163319
## Case49_lymphoma_FFPE GSM9163320
## Case50_lymphoma_FFPE GSM9163321
## Case51_lymphoma_FFPE GSM9163322
## Case52_lymphoma_FFPE GSM9163323
## Case53_lymphoma_FFPE GSM9163324
## Case55_lymphoma_FFPE GSM9163325
## Case56_lymphoma_FFPE GSM9163326
## Case57_lymphoma_FFPE GSM9163327

Lets make a table of the diagnosis

Lets make a table of these 5 characteristics.

characteristics_df <- cbind(ID_t, dx_t, group_t, age_t, gender_t)

paged_table(characteristics_df)

====================================================

*** Attempt 1

The next part comes with the type of data, the CEL files are Affymetrix gene chip files that can only be opened within Bioconductor with a library. The package takes about 20 minutes on cell phone hotspot wifi to download and install on a regular laptop PC. I got through to the end, but didn’t convert the files from gz or unzip them.

Install Bioconductor manager if not already installed

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

Install packages for reading CEL files

BiocManager::install(c("oligo", "affy"))
  1. Load and Read CEL Files Copy code

Load the package

library(oligo)  # For newer Affymetrix arrays
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min
## Loading required package: oligoClasses
## Welcome to oligoClasses version 1.72.0
## Loading required package: Biobase
## Warning: package 'Biobase' was built under R version 4.5.3
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: Biostrings
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.5.3
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: XVector
## Loading required package: Seqinfo
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## ================================================================================
## Welcome to oligo version 1.74.0
## ================================================================================
library(affy) # Uncomment for older arrays
## 
## Attaching package: 'affy'
## The following objects are masked from 'package:oligo':
## 
##     intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
##     probeNames, rma
## The following object is masked from 'package:oligoClasses':
## 
##     list.celfiles

Set the path to your CEL files

Unzip the files from the RAW download then put all in one folder, I named mine the GSM ID.

#cel_path <- "path/to/your/CEL/files"
cel_path <- "...GSE305165" # path to your CEL files

List CEL files in the directory

setwd(cel_path)

cel_files <- list.celfiles(full.names = TRUE)

cel_files
##  [1] "./GSM9163281_02_Clariom_S_Human_.CEL" 
##  [2] "./GSM9163282_03_Clariom_S_Human_.CEL" 
##  [3] "./GSM9163283_06_Clariom_S_Human_.CEL" 
##  [4] "./GSM9163284_08_Clariom_S_Human_.CEL" 
##  [5] "./GSM9163285_09_Clariom_S_Human_.CEL" 
##  [6] "./GSM9163286_10_Clariom_S_Human_.CEL" 
##  [7] "./GSM9163287_11_Clariom_S_Human_.CEL" 
##  [8] "./GSM9163288_12_Clariom_S_Human_.CEL" 
##  [9] "./GSM9163289_13_Clariom_S_Human_2.CEL"
## [10] "./GSM9163290_14_Clariom_S_Human_.CEL" 
## [11] "./GSM9163291_15_Clariom_S_Human_.CEL" 
## [12] "./GSM9163292_16_Clariom_S_Human_.CEL" 
## [13] "./GSM9163293_17_Clariom_S_Human_.CEL" 
## [14] "./GSM9163294_19_Clariom_S_Human_.CEL" 
## [15] "./GSM9163295_20_Clariom_S_Human_.CEL" 
## [16] "./GSM9163296_21_Clariom_S_Human_.CEL" 
## [17] "./GSM9163297_22_Clariom_S_Human_.CEL" 
## [18] "./GSM9163298_23_Clariom_S_Human_.CEL" 
## [19] "./GSM9163299_24_Clariom_S_Human_.CEL" 
## [20] "./GSM9163300_25_Clariom_S_Human_.CEL" 
## [21] "./GSM9163301_26_Clariom_S_Human_.CEL" 
## [22] "./GSM9163302_27_Clariom_S_Human_2.CEL"
## [23] "./GSM9163303_29_Clariom_S_Human_.CEL" 
## [24] "./GSM9163304_30_Clariom_S_Human_.CEL" 
## [25] "./GSM9163305_31_Clariom_S_Human_.CEL" 
## [26] "./GSM9163306_32_Clariom_S_Human_.CEL" 
## [27] "./GSM9163307_34_Clariom_S_Human_.CEL" 
## [28] "./GSM9163308_35_Clariom_S_Human_.CEL" 
## [29] "./GSM9163309_36_Clariom_S_Human_.CEL" 
## [30] "./GSM9163310_37_Clariom_S_Human_2.CEL"
## [31] "./GSM9163311_38_Clariom_S_Human_2.CEL"
## [32] "./GSM9163312_39_Clariom_S_Human_.CEL" 
## [33] "./GSM9163313_40_Clariom_S_Human_2.CEL"
## [34] "./GSM9163314_41_Clariom_S_Human_.CEL" 
## [35] "./GSM9163315_42_Clariom_S_Human_.CEL" 
## [36] "./GSM9163316_43_Clariom_S_Human_.CEL" 
## [37] "./GSM9163317_44_Clariom_S_Human_.CEL" 
## [38] "./GSM9163318_45_Clariom_S_Human_.CEL" 
## [39] "./GSM9163319_46_Clariom_S_Human_.CEL" 
## [40] "./GSM9163320_49_Clariom_S_Human_2.CEL"
## [41] "./GSM9163321_50_Clariom_S_Human_.CEL" 
## [42] "./GSM9163322_51_Clariom_S_Human_.CEL" 
## [43] "./GSM9163323_52_Clariom_S_Human_.CEL" 
## [44] "./GSM9163324_53_Clariom_S_Human_.CEL" 
## [45] "./GSM9163325_55_Clariom_S_Human_.CEL" 
## [46] "./GSM9163326_56_Clariom_S_Human_.CEL" 
## [47] "./GSM9163327_57_Clariom_S_Human_.CEL"

Read CEL files into an ExpressionFeatureSet object

raw_data <- read.celfiles(cel_files)

Error: These do not exist: ./GSM9163281_02_Clariom_S_Human_.CEL ./GSM9163282_03_Clariom_S_Human_.CEL ./GSM9163283_06_Clariom_S_Human_.CEL ./GSM9163284_08_Clariom_S_Human_.CEL ./GSM9163285_09_Clariom_S_Human_.CEL ./GSM9163286_10_Clariom_S_Human_.CEL ./GSM9163287_11_Clariom_S_Human_.CEL ./GSM9163288_12_Clariom_S_Human_.CEL ./GSM9163289_13_Clariom_S_Human_2.CEL ./GSM9163290_14_Clariom_S_Human_.CEL ./GSM9163291_15_Clariom_S_Human_.CEL ./GSM9163292_16_Clariom_S_Human_.CEL ./GSM9163293_17_Clariom_S_Human_.CEL ./GSM9163294_19_Clariom_S_Human_.CEL ./GSM9163295_20_Clariom_S_Human_.CEL ./GSM9163296_21_Clariom_S_Human_.CEL ./GSM9163297_22_Clariom_S_Human_.CEL ./GSM9163298_23_Clariom_S_Human_.CEL ./GSM9163299_24_Clariom_S_Human_.CEL ./GSM9163300_25_Clariom_S_Human_.CEL ./GSM9163301_26_Clariom_S_Human_.CEL ./GSM9163302_27_Clariom_S_Human_2.CEL ./GSM9163303_29_Clariom_S_Human_.CEL ./GSM9163304_30_Clariom_S_Human_.CEL ./GSM9163305_31_Clariom_S_Human_.

The command didn’t work even after using only RAW with separate unzipped folders of separate folder per patient, then unzipped with one folder and a folder for each patient with CEL file in each one, and not with one folder with only the actual CEL files in it without a folder for each.

I will have to return to this to see how the CEL files can be read in with bioconductor’s affy and oligo packages.

This following section of code was from the AI generator that gave me the bioconductor installation instructions for using it with its packages affy and oligo to open CEL packages, affy is the only way in RStudio.

View a summary

raw_data
  1. Basic Quality Check and Normalization R

Copy code # Quick QC plot

boxplot(raw_data, main = "Raw CEL Data", las = 2)

Normalize using RMA (Robust Multi-array Average)

norm_data <- rma(raw_data)

View normalized expression matrix

exprs(norm_data)[1:5, 1:5]  # First 5 genes × first 5 samples
  1. Notes & Tips oligo works for most modern Affymetrix platforms (e.g., Gene ST, Exon ST arrays). affy is for older 3′ IVT arrays. If your .CEL files are inside a .tar or .zip, extract them first. Always check the platform type in GEO or the experiment metadata to choose the right package.

==================================================

*** Attempt 2

Keep checking back and we will figure out how to open these CEL files and do are regular analysis and data science to these samples.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("pd.clariom.s.human")

Try another video tutorial

library(affy)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%within%() masks IRanges::%within%()
## ✖ dplyr::collapse()     masks Biostrings::collapse(), IRanges::collapse()
## ✖ dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
## ✖ purrr::compact()      masks XVector::compact()
## ✖ dplyr::desc()         masks IRanges::desc()
## ✖ tidyr::expand()       masks S4Vectors::expand()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ dplyr::first()        masks S4Vectors::first()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ lubridate::pm()       masks affy::pm(), oligo::pm()
## ✖ ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce()       masks IRanges::reduce()
## ✖ dplyr::rename()       masks S4Vectors::rename()
## ✖ lubridate::second()   masks S4Vectors::second()
## ✖ lubridate::second<-() masks S4Vectors::second<-()
## ✖ dplyr::slice()        masks XVector::slice(), IRanges::slice()
## ✖ dplyr::summarize()    masks oligo::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#install.packages("GEOquery")
BiocManager::install("GEOquery")
library(GEOquery)
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
getGEOSuppFiles("GSE305165")
## Using locally cached version of supplementary file(s) GSE305165 found here:
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar
##                                                                                                                      size
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 46592000
##                                                                                                                  isdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar FALSE
##                                                                                                                  mode
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar  666
##                                                                                                                                mtime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:11
##                                                                                                                                ctime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:06
##                                                                                                                                atime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 20:32:52
##                                                                                                                  exe
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar  no
##                                                                                                                  uname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar jlcor
##                                                                                                                        udomain
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar DATAMASSAGER1
##                                                                                                                              fname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165_RAW.tar
##                                                                                                                                                                                                         destdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165
##                                                                                                                                                                                                                          filepath
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar
##                                                                                                                        GEO
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165
untar("GSE305165/GSE305165_RAW.tar", exdir="data/")
raw_data <- ReadAffy(celfile.path = "data/")
str(raw_data)
## Formal class 'AffyBatch' [package "affy"] with 10 slots
##   ..@ cdfName          : chr "Clariom_S_Human"
##   ..@ nrow             : Named int 548
##   .. ..- attr(*, "names")= chr "Rows"
##   ..@ ncol             : Named int 548
##   .. ..- attr(*, "names")= chr "Cols"
##   ..@ assayData        :<environment: 0x000001db9ea3c430> 
##   ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   .. .. ..@ varMetadata      :'data.frame':  1 obs. of  1 variable:
##   .. .. .. ..$ labelDescription: chr "arbitrary numbering"
##   .. .. ..@ data             :'data.frame':  47 obs. of  1 variable:
##   .. .. .. ..$ sample: int [1:47] 1 2 3 4 5 6 7 8 9 10 ...
##   .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 1
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   .. .. .. .. ..$ names: chr "AnnotatedDataFrame"
##   ..@ featureData      :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   .. .. ..@ varMetadata      :'data.frame':  0 obs. of  1 variable:
##   .. .. .. ..$ labelDescription: chr(0) 
##   .. .. ..@ data             :'data.frame':  300304 obs. of  0 variables
##   .. .. ..@ dimLabels        : chr [1:2] "featureNames" "featureColumns"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 1
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   .. .. .. .. ..$ names: chr "AnnotatedDataFrame"
##   ..@ experimentData   :Formal class 'MIAME' [package "Biobase"] with 13 slots
##   .. .. ..@ name             : chr ""
##   .. .. ..@ lab              : chr ""
##   .. .. ..@ contact          : chr ""
##   .. .. ..@ title            : chr ""
##   .. .. ..@ abstract         : chr ""
##   .. .. ..@ url              : chr ""
##   .. .. ..@ pubMedIds        : chr ""
##   .. .. ..@ samples          : list()
##   .. .. ..@ hybridizations   : list()
##   .. .. ..@ normControls     : list()
##   .. .. ..@ preprocessing    :List of 2
##   .. .. .. ..$ filenames  : chr [1:47] "data/GSM9163281_02_Clariom_S_Human_.CEL.gz" "data/GSM9163282_03_Clariom_S_Human_.CEL.gz" "data/GSM9163283_06_Clariom_S_Human_.CEL.gz" "data/GSM9163284_08_Clariom_S_Human_.CEL.gz" ...
##   .. .. .. ..$ affyversion: chr NA
##   .. .. ..@ other            :List of 1
##   .. .. .. ..$ : chr ""
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 2
##   .. .. .. .. .. ..$ : int [1:3] 1 0 0
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   .. .. .. .. ..$ names: chr [1:2] "MIAxE" "MIAME"
##   ..@ annotation       : chr "clariomshuman"
##   ..@ protocolData     :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   .. .. ..@ varMetadata      :'data.frame':  1 obs. of  1 variable:
##   .. .. .. ..$ labelDescription: chr NA
##   .. .. ..@ data             :'data.frame':  47 obs. of  1 variable:
##   .. .. .. ..$ ScanDate: chr [1:47] "2022-10-06T05:15:22Z" "2022-10-06T05:10:06Z" "2022-10-06T04:16:44Z" "2022-10-06T04:06:10Z" ...
##   .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 1
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   .. .. .. .. ..$ names: chr "AnnotatedDataFrame"
##   ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. ..@ .Data:List of 4
##   .. .. .. ..$ : int [1:3] 4 5 2
##   .. .. .. ..$ : int [1:3] 2 70 0
##   .. .. .. ..$ : int [1:3] 1 3 0
##   .. .. .. ..$ : int [1:3] 1 2 0
##   .. .. ..$ names: chr [1:4] "R" "Biobase" "eSet" "AffyBatch"
raw_data

Warning: missing cdf environment! in show(AffyBatch) AffyBatch object size of arrays=548x548 features (36 kb) cdf=Clariom_S_Human (??? affyids) number of samples=47

Error in getCdfInfo(object) : Could not obtain CDF environment, problems encountered: Specified environment does not contain Clariom_S_Human Library - package clariomshumancdf not installed Bioconductor - clariomshumancdf not available 7. stop(paste(“Could not obtain CDF environment, problems encountered:”, paste(unlist(badOut), collapse = “”), sep = “”)) 6. getCdfInfo(object) 5. featureNames(object) 4. featureNames(object) 3. cat(“number of genes=”, length(featureNames(object)), “”, sep = ““) 2. (new(”standardGeneric”, .Data = function (object) standardGeneric(“show”), generic = structure(“show”, package = “methods”), package = “methods”, group = list(), valueClass = character(0), signature = structure(“object”, simpleOnly = TRUE), default = new(“derivedDefaultMethod”, … 1. (new(“standardGeneric”, .Data = function (object) standardGeneric(“show”), generic = structure(“show”, package = “methods”), package = “methods”, group = list(), valueClass = character(0), signature = structure(“object”, simpleOnly = TRUE), default = new(“derivedDefaultMethod”, …

After installing the missing packages the install said it installed the clarionshumancdf but it didn’t, because the error keeps popping up.

*** This is where the tutorial on youtube fails this version of R and Bioconducter.

But here is the description and code of commands to get the matrix and combine probe IDs by gene name for normalized gene reads per sample.

normalized.data <- rma(raw_data)

Let me try closing and restarting. Maybe the install will work then. It did not. The version of Bioconductor is wrong for the packages it needs installed.

normalized.data
normalized.expr <- exprs(normalized.data)

When it works, video shows matrix of probes as row names and samples as headers.

normalized.expr <- as.data.frame(exprs(normalized.data))

Will turn matrix into a data frame with code above, in video.

Map probe IDs into gene symbols.

gse <- getGEO("GSE305165", GSEMatrix = T)

The above produces a large list.

Next get the feature IDs and store it.

feature.data  <- gse$GSE305165_series_matrix.txt.gz@featureData@data

The above produces additional information with alternate gene ID symbols to the Probe IDs as well as other information.

subset to only get the gene symbols and probe IDs. This is column 1 and 11

feature.data <- feature.data[,c(1,11)]

merge the two data sets use a dot for the x to be the normalized.data merged with feature.data. Both x and y have the merging feature named ID for probes in both sets after adding a column to the normalized.data dataframe named ID.

normalized.expr <- normalized.expr %>% 
  rownames_to_column(var="ID") %>%
  inner_join(.,feature.data, by="ID")

==========================================

*** Attempt 3

That was the end of the 4 year old video tutorial linked above on reading in CEL files. It stopped with the package incompatibility between Clariom_s_human and the other dependent files so trying another AI read for exact copy of Bioconductor version 3.22 built for R 4.5.

Got it — you want to read CEL files (Affymetrix microarray data) in Bioconductor 3.22 with R 4.5. Here’s a concise, up-to-date approach that works with the current Bioconductor ecosystem.

  1. Install Bioconductor and Required Packages

R

Copy code # Install BiocManager if not already installed install.packages(“BiocManager”)

Ensure you’re using Bioconductor 3.22

BiocManager::install(version = "3.22")

Install packages for reading CEL files

BiocManager::install(c("affy", "oligo"))
  1. Reading CEL Files (Two Common Approaches)

Option A – Using affy (classic Affymetrix arrays)

R

Copy code

library(affy)

Read all CEL files from a directory

data <- ReadAffy(celfile.path = "data/")
the data file in Rstudio with affy library
the data file in Rstudio with affy library

Inspect

summary(data)
##    Length     Class      Mode 
##        47 AffyBatch        S4

Option B – Using oligo (newer arrays, e.g., Gene ST, Exon ST)

R

Copy code

library(oligo)

Read CEL files

This uses Attempt 2’s version of placing each CEL file from individual folder into one folder we named GSE305165. It used the

library(GEOquery)
getGEOSuppFiles("GSE305165")
## Using locally cached version of supplementary file(s) GSE305165 found here:
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar
##                                                                                                                      size
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 46592000
##                                                                                                                  isdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar FALSE
##                                                                                                                  mode
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar  666
##                                                                                                                                mtime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:11
##                                                                                                                                ctime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:06
##                                                                                                                                atime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 20:52:56
##                                                                                                                  exe
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar  no
##                                                                                                                  uname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar jlcor
##                                                                                                                        udomain
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar DATAMASSAGER1
##                                                                                                                              fname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165_RAW.tar
##                                                                                                                                                                                                         destdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165
##                                                                                                                                                                                                                          filepath
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar
##                                                                                                                        GEO
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165
untar("GSE305165/GSE305165_RAW.tar", exdir="data/")
cel_files <- list.celfiles("GSE305165/", full.names = TRUE)

data <- read.celfiles(cel_files)
## Loading required package: pd.clariom.s.human
## Loading required package: RSQLite
## Loading required package: DBI
## Platform design info loaded.
## Reading in : GSE305165/GSM9163281_02_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163282_03_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163283_06_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163284_08_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163285_09_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163286_10_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163287_11_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163288_12_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163289_13_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163290_14_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163291_15_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163292_16_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163293_17_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163294_19_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163295_20_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163296_21_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163297_22_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163298_23_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163299_24_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163300_25_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163301_26_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163302_27_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163303_29_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163304_30_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163305_31_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163306_32_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163307_34_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163308_35_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163309_36_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163310_37_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163311_38_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163312_39_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163313_40_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163314_41_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163315_42_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163316_43_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163317_44_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163318_45_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163319_46_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163320_49_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163321_50_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163322_51_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163323_52_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163324_53_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163325_55_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163326_56_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163327_57_Clariom_S_Human_.CEL

Inspect

data
## ExpressionFeatureSet (storageMode: lockedEnvironment)
## assayData: 300304 features, 47 samples 
##   element names: exprs 
## protocolData
##   rowNames: GSM9163281_02_Clariom_S_Human_.CEL
##     GSM9163282_03_Clariom_S_Human_.CEL ...
##     GSM9163327_57_Clariom_S_Human_.CEL (47 total)
##   varLabels: exprs dates
##   varMetadata: labelDescription channel
## phenoData
##   rowNames: GSM9163281_02_Clariom_S_Human_.CEL
##     GSM9163282_03_Clariom_S_Human_.CEL ...
##     GSM9163327_57_Clariom_S_Human_.CEL (47 total)
##   varLabels: index
##   varMetadata: labelDescription channel
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: pd.clariom.s.human
The data in oligo looks like this in Rstudio
The data in oligo looks like this in Rstudio

Option C – Using affyio for low-level access

R

Copy code library(affyio)

Read header and data separately

header <- read.celfile.header(“path/to/file.CEL”) cel_data <- read.celfile(“path/to/file.CEL”) Tips

Use affy for older 3’ expression arrays. Use oligo for newer whole-transcript arrays. Always check the array type with

affyio::read.celfile.header(cel_files) 

before choosing the package. If you’d like, I can also give you a full preprocessing pipeline for CEL files in Bioconductor 3.22, including normalization and quality control, so you can go from raw data to analysis-ready expression matrices. Would you like me to prepare that?

That was the end of the bioconductor AI help for this version of 3.22 in R version 4.5.

================

Lets bring in Attempt 2 after the raw_data fail due to wrong version cdf, using the latest data read in with oligo library.

The rest works in Rstudio after correcting for the package of bioconductor made for my version of R, but knitr stops working right here.

normalized.data <- rma(data)

Background correcting Normalizing Calculating Expression

normalized.data

ExpressionSet (storageMode: lockedEnvironment) assayData: 27189 features, 47 samples element names: exprs protocolData rowNames: GSM9163281_02_Clariom_S_Human_.CEL GSM9163282_03_Clariom_S_Human_.CEL … GSM9163327_57_Clariom_S_Human_.CEL (47 total) varLabels: exprs dates varMetadata: labelDescription channel phenoData rowNames: GSM9163281_02_Clariom_S_Human_.CEL GSM9163282_03_Clariom_S_Human_.CEL … GSM9163327_57_Clariom_S_Human_.CEL (47 total) varLabels: index varMetadata: labelDescription channel featureData: none experimentData: use ‘experimentData(object)’ Annotation: pd.clariom.s.human

normalized.expr <- exprs(normalized.data)

The row names are the probe IDs in Affymetrix for this cdf Clariom_s_human.

normalized.expr <- as.data.frame(exprs(normalized.data))

Writing it out to read it in to run the rest of the code that will allow other programs knitr isn’t picking on that finally my Rstudio version with R and bioconductor is working with.

write.csv(normalized.expr,'normalized.expr1.csv', row.names=T)
normalized.expr <- read.csv('normalized.expr1.csv', header=T, row.names=1)
paged_table(normalized.expr[1:10,])
output because knitr is having problems with this
output because knitr is having problems with this

The above only shows the first 10 rows, but all 47 samples are there with long ID names per sample.

Map probe IDs into gene symbols.

gse <- getGEO("GSE305165", GSEMatrix = T)

Found 1 file(s) GSE305165_series_matrix.txt.gz Using locally cached version: C:/GSE305165_series_matrix.txt.gz Using locally cached version of GPL23159 found here: C:/GPL23159.soft.gz

The above produces a large list.

gse

$GSE305165_series_matrix.txt.gz ExpressionSet (storageMode: lockedEnvironment) assayData: 21448 features, 47 samples element names: exprs protocolData: none phenoData sampleNames: GSM9163281 GSM9163282 … GSM9163327 (47 total) varLabels: title geo_accession … Sex:ch1 (33 total) varMetadata: labelDescription featureData featureNames: TC0100006437.hg.1 TC0100006476.hg.1 … TSUnmapped00000823.hg.1 (21448 total) fvarLabels: ID probeset_id … SPOT_ID.1 (10 total) fvarMetadata: Column Description labelDescription experimentData: use ‘experimentData(object)’ pubMedIds: 41371409 Annotation: GPL23159

The gse Large list in Rstudio. gse Large list details in source code object view in rstudio

Next get the feature IDs and store it.

feature.data  <- gse$GSE305165_series_matrix.txt.gz@featureData@data

The above produces additional information with alternate gene ID symbols to the Probe IDs as well as other information.

paged_table(feature.data[1:10,])

1st screen 2nd screen 3rd screen

paged_table(feature.data[1:10,])

subset to only get the gene symbols and probe IDs. This is column 1 and 10 but 10 is a mix of various groups in one ID field.

feature.data1 <- feature.data[,c(1,10)]

paged_table(feature.data1[1:20,])

Add an ID column to the normalized.expr data frame of samples by probe ID.

normalized.expr$ID <- row.names(normalized.expr)

paged_table(normalized.expr[1:10,45:48])

The above table shows only first 10 rows and last 4 columns with the ID column added to it.

str(normalized.expr)
## 'data.frame':    27189 obs. of  48 variables:
##  $ GSM9163281_02_Clariom_S_Human_.CEL : num  5.31 5.68 4.97 3.97 3.53 ...
##  $ GSM9163282_03_Clariom_S_Human_.CEL : num  5.49 4.69 5.15 4.57 2.77 ...
##  $ GSM9163283_06_Clariom_S_Human_.CEL : num  4.75 5.03 5.03 3.7 3.66 ...
##  $ GSM9163284_08_Clariom_S_Human_.CEL : num  5.68 5.65 4.29 3.57 3.35 ...
##  $ GSM9163285_09_Clariom_S_Human_.CEL : num  4.82 5.68 4.92 3.86 4.89 ...
##  $ GSM9163286_10_Clariom_S_Human_.CEL : num  5.52 5.59 4.1 3.6 3.8 ...
##  $ GSM9163287_11_Clariom_S_Human_.CEL : num  5.45 5.56 4.91 3.97 3.39 ...
##  $ GSM9163288_12_Clariom_S_Human_.CEL : num  5.77 5.88 3.76 3.98 3.33 ...
##  $ GSM9163289_13_Clariom_S_Human_2.CEL: num  6.18 5.37 4.53 4.24 3.62 ...
##  $ GSM9163290_14_Clariom_S_Human_.CEL : num  5.59 5.7 4.19 4.45 3.64 ...
##  $ GSM9163291_15_Clariom_S_Human_.CEL : num  5.51 5.72 4.87 3.78 3.45 ...
##  $ GSM9163292_16_Clariom_S_Human_.CEL : num  5.66 5.58 4.91 4.59 3.11 ...
##  $ GSM9163293_17_Clariom_S_Human_.CEL : num  5.02 5.12 3.33 3.91 3.39 ...
##  $ GSM9163294_19_Clariom_S_Human_.CEL : num  5.2 5.75 4.15 4.09 3.39 ...
##  $ GSM9163295_20_Clariom_S_Human_.CEL : num  5.32 5.23 4.47 3.88 3.49 ...
##  $ GSM9163296_21_Clariom_S_Human_.CEL : num  5.8 5.97 4.6 3.86 3.7 ...
##  $ GSM9163297_22_Clariom_S_Human_.CEL : num  5.67 5.07 4.27 4.07 2.79 ...
##  $ GSM9163298_23_Clariom_S_Human_.CEL : num  6.01 5.89 3.89 4.14 3.54 ...
##  $ GSM9163299_24_Clariom_S_Human_.CEL : num  6.05 5.81 4.51 4.2 3.16 ...
##  $ GSM9163300_25_Clariom_S_Human_.CEL : num  5.51 5.73 4.13 4.18 3.73 ...
##  $ GSM9163301_26_Clariom_S_Human_.CEL : num  6.31 5.9 4.62 4.2 3.18 ...
##  $ GSM9163302_27_Clariom_S_Human_2.CEL: num  5.9 5.39 4.52 3.58 4.07 ...
##  $ GSM9163303_29_Clariom_S_Human_.CEL : num  5.39 5.14 4.58 4.6 3.69 ...
##  $ GSM9163304_30_Clariom_S_Human_.CEL : num  5.81 6.15 4.56 4.19 3.71 ...
##  $ GSM9163305_31_Clariom_S_Human_.CEL : num  5.67 5.64 5.28 4.02 3.57 ...
##  $ GSM9163306_32_Clariom_S_Human_.CEL : num  5.92 6.06 5.41 4.37 3.46 ...
##  $ GSM9163307_34_Clariom_S_Human_.CEL : num  5.68 6.04 4.29 4.41 3.4 ...
##  $ GSM9163308_35_Clariom_S_Human_.CEL : num  5.79 6.39 4.79 4.35 3.77 ...
##  $ GSM9163309_36_Clariom_S_Human_.CEL : num  5.14 5.21 3.79 4.09 3.96 ...
##  $ GSM9163310_37_Clariom_S_Human_2.CEL: num  6.19 5.26 3.8 4.27 3.49 ...
##  $ GSM9163311_38_Clariom_S_Human_2.CEL: num  5.75 5.85 3.84 3.86 3.39 ...
##  $ GSM9163312_39_Clariom_S_Human_.CEL : num  6.33 5.98 4.89 3.95 3.7 ...
##  $ GSM9163313_40_Clariom_S_Human_2.CEL: num  5.42 5.84 4.49 4.47 3.35 ...
##  $ GSM9163314_41_Clariom_S_Human_.CEL : num  6.33 5.88 4.85 4.35 3.17 ...
##  $ GSM9163315_42_Clariom_S_Human_.CEL : num  5.66 6.07 4.89 4.01 3.4 ...
##  $ GSM9163316_43_Clariom_S_Human_.CEL : num  6.12 6.08 4.44 3.98 3.24 ...
##  $ GSM9163317_44_Clariom_S_Human_.CEL : num  5.96 6.39 5.34 4.23 3.86 ...
##  $ GSM9163318_45_Clariom_S_Human_.CEL : num  4.72 6.16 4.92 4.3 4.12 ...
##  $ GSM9163319_46_Clariom_S_Human_.CEL : num  5.67 5.72 4.81 4.82 3.25 ...
##  $ GSM9163320_49_Clariom_S_Human_2.CEL: num  6.25 5.98 4.73 3.83 3.47 ...
##  $ GSM9163321_50_Clariom_S_Human_.CEL : num  5.95 5.9 4.69 4.1 3.35 ...
##  $ GSM9163322_51_Clariom_S_Human_.CEL : num  6.25 5.73 4.18 4.26 3.17 ...
##  $ GSM9163323_52_Clariom_S_Human_.CEL : num  5.71 6.36 4.4 4.1 3.69 ...
##  $ GSM9163324_53_Clariom_S_Human_.CEL : num  6.16 5.56 4.9 4.23 3.12 ...
##  $ GSM9163325_55_Clariom_S_Human_.CEL : num  6.85 5.68 4.01 3.83 3.5 ...
##  $ GSM9163326_56_Clariom_S_Human_.CEL : num  5.68 5.79 4.03 4.39 3.28 ...
##  $ GSM9163327_57_Clariom_S_Human_.CEL : num  6.29 5.94 4.5 4.19 3.38 ...
##  $ ID                                 : chr  "23064070" "23064071" "23064072" "23064073" ...
normalized.expr <- inner_join(normalized.expr,feature.data1, by="ID")
str(normalized.expr)
## 'data.frame':    27189 obs. of  48 variables:
##  $ GSM9163281_02_Clariom_S_Human_.CEL : num  5.31 5.68 4.97 3.97 3.53 ...
##  $ GSM9163282_03_Clariom_S_Human_.CEL : num  5.49 4.69 5.15 4.57 2.77 ...
##  $ GSM9163283_06_Clariom_S_Human_.CEL : num  4.75 5.03 5.03 3.7 3.66 ...
##  $ GSM9163284_08_Clariom_S_Human_.CEL : num  5.68 5.65 4.29 3.57 3.35 ...
##  $ GSM9163285_09_Clariom_S_Human_.CEL : num  4.82 5.68 4.92 3.86 4.89 ...
##  $ GSM9163286_10_Clariom_S_Human_.CEL : num  5.52 5.59 4.1 3.6 3.8 ...
##  $ GSM9163287_11_Clariom_S_Human_.CEL : num  5.45 5.56 4.91 3.97 3.39 ...
##  $ GSM9163288_12_Clariom_S_Human_.CEL : num  5.77 5.88 3.76 3.98 3.33 ...
##  $ GSM9163289_13_Clariom_S_Human_2.CEL: num  6.18 5.37 4.53 4.24 3.62 ...
##  $ GSM9163290_14_Clariom_S_Human_.CEL : num  5.59 5.7 4.19 4.45 3.64 ...
##  $ GSM9163291_15_Clariom_S_Human_.CEL : num  5.51 5.72 4.87 3.78 3.45 ...
##  $ GSM9163292_16_Clariom_S_Human_.CEL : num  5.66 5.58 4.91 4.59 3.11 ...
##  $ GSM9163293_17_Clariom_S_Human_.CEL : num  5.02 5.12 3.33 3.91 3.39 ...
##  $ GSM9163294_19_Clariom_S_Human_.CEL : num  5.2 5.75 4.15 4.09 3.39 ...
##  $ GSM9163295_20_Clariom_S_Human_.CEL : num  5.32 5.23 4.47 3.88 3.49 ...
##  $ GSM9163296_21_Clariom_S_Human_.CEL : num  5.8 5.97 4.6 3.86 3.7 ...
##  $ GSM9163297_22_Clariom_S_Human_.CEL : num  5.67 5.07 4.27 4.07 2.79 ...
##  $ GSM9163298_23_Clariom_S_Human_.CEL : num  6.01 5.89 3.89 4.14 3.54 ...
##  $ GSM9163299_24_Clariom_S_Human_.CEL : num  6.05 5.81 4.51 4.2 3.16 ...
##  $ GSM9163300_25_Clariom_S_Human_.CEL : num  5.51 5.73 4.13 4.18 3.73 ...
##  $ GSM9163301_26_Clariom_S_Human_.CEL : num  6.31 5.9 4.62 4.2 3.18 ...
##  $ GSM9163302_27_Clariom_S_Human_2.CEL: num  5.9 5.39 4.52 3.58 4.07 ...
##  $ GSM9163303_29_Clariom_S_Human_.CEL : num  5.39 5.14 4.58 4.6 3.69 ...
##  $ GSM9163304_30_Clariom_S_Human_.CEL : num  5.81 6.15 4.56 4.19 3.71 ...
##  $ GSM9163305_31_Clariom_S_Human_.CEL : num  5.67 5.64 5.28 4.02 3.57 ...
##  $ GSM9163306_32_Clariom_S_Human_.CEL : num  5.92 6.06 5.41 4.37 3.46 ...
##  $ GSM9163307_34_Clariom_S_Human_.CEL : num  5.68 6.04 4.29 4.41 3.4 ...
##  $ GSM9163308_35_Clariom_S_Human_.CEL : num  5.79 6.39 4.79 4.35 3.77 ...
##  $ GSM9163309_36_Clariom_S_Human_.CEL : num  5.14 5.21 3.79 4.09 3.96 ...
##  $ GSM9163310_37_Clariom_S_Human_2.CEL: num  6.19 5.26 3.8 4.27 3.49 ...
##  $ GSM9163311_38_Clariom_S_Human_2.CEL: num  5.75 5.85 3.84 3.86 3.39 ...
##  $ GSM9163312_39_Clariom_S_Human_.CEL : num  6.33 5.98 4.89 3.95 3.7 ...
##  $ GSM9163313_40_Clariom_S_Human_2.CEL: num  5.42 5.84 4.49 4.47 3.35 ...
##  $ GSM9163314_41_Clariom_S_Human_.CEL : num  6.33 5.88 4.85 4.35 3.17 ...
##  $ GSM9163315_42_Clariom_S_Human_.CEL : num  5.66 6.07 4.89 4.01 3.4 ...
##  $ GSM9163316_43_Clariom_S_Human_.CEL : num  6.12 6.08 4.44 3.98 3.24 ...
##  $ GSM9163317_44_Clariom_S_Human_.CEL : num  5.96 6.39 5.34 4.23 3.86 ...
##  $ GSM9163318_45_Clariom_S_Human_.CEL : num  4.72 6.16 4.92 4.3 4.12 ...
##  $ GSM9163319_46_Clariom_S_Human_.CEL : num  5.67 5.72 4.81 4.82 3.25 ...
##  $ GSM9163320_49_Clariom_S_Human_2.CEL: num  6.25 5.98 4.73 3.83 3.47 ...
##  $ GSM9163321_50_Clariom_S_Human_.CEL : num  5.95 5.9 4.69 4.1 3.35 ...
##  $ GSM9163322_51_Clariom_S_Human_.CEL : num  6.25 5.73 4.18 4.26 3.17 ...
##  $ GSM9163323_52_Clariom_S_Human_.CEL : num  5.71 6.36 4.4 4.1 3.69 ...
##  $ GSM9163324_53_Clariom_S_Human_.CEL : num  6.16 5.56 4.9 4.23 3.12 ...
##  $ GSM9163325_55_Clariom_S_Human_.CEL : num  6.85 5.68 4.01 3.83 3.5 ...
##  $ GSM9163326_56_Clariom_S_Human_.CEL : num  5.68 5.79 4.03 4.39 3.28 ...
##  $ GSM9163327_57_Clariom_S_Human_.CEL : num  6.29 5.94 4.5 4.19 3.38 ...
##  $ ID                                 : chr  "23064070" "23064071" "23064072" "23064073" ...

The SPOT.ID.1 column has the gene name in it as well as full name and alternate names.

normalized.expr$SPOT_ID.1[1]
## NULL

It seems like this column should be split by each ‘//’ and made into separate columns. We will see about separating it later, am unable to use split function to separate as a list currently. Will ask AI later for extracting only the gene name in parenthesis.

Now lets see how are samples are related in order with the series_txt information we made earlier.

compare <- rbind(colnames(normalized.expr)[1:47],series4[2:48])

paged_table(compare)

It looks like viewing the sample ID by GSM ID is the same and consistent in order, so we should rename these to be smaller by class. All have EBV confirmed diagnosis, but some are CHL, pDLBCL, and mDLBCL. We can extend the group as well.

CHL <- grep('CHL',compare[3,])
mDLBCL <- grep('mDLBCL', compare[3,])
pDLBCL <- grep('pDLBCL',compare[3,])

The column names will be automatically numbered with a dot and the number starting with 1 after a duplicate encountered.

colnames(compare)[CHL] <- 'CHL'
colnames(compare)[mDLBCL] <- 'mDLBCL'
colnames(compare)[pDLBCL] <- 'pDLBCL'

paged_table(compare)
newNames <- colnames(compare)

colnames(normalized.expr)[1:47] <- newNames

Lets add the row means to each gene by type of Lymphoma.

normalized.expr$CHL_mean <- rowMeans(normalized.expr[,CHL])
normalized.expr$pDLBCL_mean <- rowMeans(normalized.expr[,pDLBCL])
normalized.expr$mDLBCL_mean <- rowMeans(normalized.expr[,mDLBCL])

str(normalized.expr)
## 'data.frame':    27189 obs. of  51 variables:
##  $ pDLBCL     : num  5.31 5.68 4.97 3.97 3.53 ...
##  $ CHL        : num  5.49 4.69 5.15 4.57 2.77 ...
##  $ pDLBCL     : num  4.75 5.03 5.03 3.7 3.66 ...
##  $ mDLBCL     : num  5.68 5.65 4.29 3.57 3.35 ...
##  $ CHL        : num  4.82 5.68 4.92 3.86 4.89 ...
##  $ mDLBCL     : num  5.52 5.59 4.1 3.6 3.8 ...
##  $ pDLBCL     : num  5.45 5.56 4.91 3.97 3.39 ...
##  $ CHL        : num  5.77 5.88 3.76 3.98 3.33 ...
##  $ CHL        : num  6.18 5.37 4.53 4.24 3.62 ...
##  $ CHL        : num  5.59 5.7 4.19 4.45 3.64 ...
##  $ CHL        : num  5.51 5.72 4.87 3.78 3.45 ...
##  $ mDLBCL     : num  5.66 5.58 4.91 4.59 3.11 ...
##  $ mDLBCL     : num  5.02 5.12 3.33 3.91 3.39 ...
##  $ CHL        : num  5.2 5.75 4.15 4.09 3.39 ...
##  $ CHL        : num  5.32 5.23 4.47 3.88 3.49 ...
##  $ pDLBCL     : num  5.8 5.97 4.6 3.86 3.7 ...
##  $ CHL        : num  5.67 5.07 4.27 4.07 2.79 ...
##  $ mDLBCL     : num  6.01 5.89 3.89 4.14 3.54 ...
##  $ mDLBCL     : num  6.05 5.81 4.51 4.2 3.16 ...
##  $ mDLBCL     : num  5.51 5.73 4.13 4.18 3.73 ...
##  $ mDLBCL     : num  6.31 5.9 4.62 4.2 3.18 ...
##  $ CHL        : num  5.9 5.39 4.52 3.58 4.07 ...
##  $ mDLBCL     : num  5.39 5.14 4.58 4.6 3.69 ...
##  $ CHL        : num  5.81 6.15 4.56 4.19 3.71 ...
##  $ CHL        : num  5.67 5.64 5.28 4.02 3.57 ...
##  $ CHL        : num  5.92 6.06 5.41 4.37 3.46 ...
##  $ CHL        : num  5.68 6.04 4.29 4.41 3.4 ...
##  $ mDLBCL     : num  5.79 6.39 4.79 4.35 3.77 ...
##  $ mDLBCL     : num  5.14 5.21 3.79 4.09 3.96 ...
##  $ mDLBCL     : num  6.19 5.26 3.8 4.27 3.49 ...
##  $ mDLBCL     : num  5.75 5.85 3.84 3.86 3.39 ...
##  $ mDLBCL     : num  6.33 5.98 4.89 3.95 3.7 ...
##  $ mDLBCL     : num  5.42 5.84 4.49 4.47 3.35 ...
##  $ pDLBCL     : num  6.33 5.88 4.85 4.35 3.17 ...
##  $ pDLBCL     : num  5.66 6.07 4.89 4.01 3.4 ...
##  $ mDLBCL     : num  6.12 6.08 4.44 3.98 3.24 ...
##  $ mDLBCL     : num  5.96 6.39 5.34 4.23 3.86 ...
##  $ CHL        : num  4.72 6.16 4.92 4.3 4.12 ...
##  $ pDLBCL     : num  5.67 5.72 4.81 4.82 3.25 ...
##  $ pDLBCL     : num  6.25 5.98 4.73 3.83 3.47 ...
##  $ CHL        : num  5.95 5.9 4.69 4.1 3.35 ...
##  $ CHL        : num  6.25 5.73 4.18 4.26 3.17 ...
##  $ CHL        : num  5.71 6.36 4.4 4.1 3.69 ...
##  $ mDLBCL     : num  6.16 5.56 4.9 4.23 3.12 ...
##  $ mDLBCL     : num  6.85 5.68 4.01 3.83 3.5 ...
##  $ mDLBCL     : num  5.68 5.79 4.03 4.39 3.28 ...
##  $ CHL        : num  6.29 5.94 4.5 4.19 3.38 ...
##  $ ID         : chr  "23064070" "23064071" "23064072" "23064073" ...
##  $ CHL_mean   : num  5.65 5.71 4.58 4.13 3.54 ...
##  $ pDLBCL_mean: num  5.65 5.74 4.85 4.06 3.45 ...
##  $ mDLBCL_mean: num  5.83 5.72 4.33 4.13 3.48 ...
summary(normalized.expr)
##      pDLBCL            CHL             pDLBCL           mDLBCL      
##  Min.   : 1.104   Min.   : 1.075   Min.   : 1.037   Min.   : 1.056  
##  1st Qu.: 3.237   1st Qu.: 3.305   1st Qu.: 3.230   1st Qu.: 3.209  
##  Median : 4.383   Median : 4.326   Median : 4.376   Median : 4.377  
##  Mean   : 4.467   Mean   : 4.421   Mean   : 4.457   Mean   : 4.462  
##  3rd Qu.: 5.473   3rd Qu.: 5.362   3rd Qu.: 5.474   3rd Qu.: 5.470  
##  Max.   :13.469   Max.   :13.482   Max.   :13.469   Max.   :13.482  
##       CHL             mDLBCL           pDLBCL            CHL        
##  Min.   : 1.115   Min.   : 0.979   Min.   : 0.979   Min.   : 1.071  
##  1st Qu.: 3.289   1st Qu.: 3.193   1st Qu.: 3.221   1st Qu.: 3.195  
##  Median : 4.354   Median : 4.382   Median : 4.377   Median : 4.380  
##  Mean   : 4.433   Mean   : 4.465   Mean   : 4.458   Mean   : 4.474  
##  3rd Qu.: 5.431   3rd Qu.: 5.474   3rd Qu.: 5.482   3rd Qu.: 5.502  
##  Max.   :13.454   Max.   :13.497   Max.   :13.469   Max.   :13.495  
##       CHL              CHL              CHL             mDLBCL      
##  Min.   : 1.009   Min.   : 1.035   Min.   : 1.158   Min.   : 1.097  
##  1st Qu.: 3.218   1st Qu.: 3.268   1st Qu.: 3.293   1st Qu.: 3.222  
##  Median : 4.410   Median : 4.416   Median : 4.412   Median : 4.376  
##  Mean   : 4.503   Mean   : 4.504   Mean   : 4.488   Mean   : 4.498  
##  3rd Qu.: 5.537   3rd Qu.: 5.492   3rd Qu.: 5.457   3rd Qu.: 5.517  
##  Max.   :13.219   Max.   :13.495   Max.   :13.497   Max.   :13.482  
##      mDLBCL            CHL              CHL              pDLBCL       
##  Min.   : 1.131   Min.   : 1.015   Min.   : 0.9349   Min.   : 0.9517  
##  1st Qu.: 3.322   1st Qu.: 3.271   1st Qu.: 3.2113   1st Qu.: 3.2006  
##  Median : 4.418   Median : 4.414   Median : 4.3820   Median : 4.3765  
##  Mean   : 4.495   Mean   : 4.499   Mean   : 4.4727   Mean   : 4.4734  
##  3rd Qu.: 5.474   3rd Qu.: 5.485   3rd Qu.: 5.4897   3rd Qu.: 5.4796  
##  Max.   :13.455   Max.   :13.495   Max.   :13.4823   Max.   :13.4973  
##       CHL             mDLBCL           mDLBCL           mDLBCL      
##  Min.   : 1.083   Min.   : 1.011   Min.   : 1.072   Min.   : 1.035  
##  1st Qu.: 3.256   1st Qu.: 3.194   1st Qu.: 3.189   1st Qu.: 3.194  
##  Median : 4.392   Median : 4.382   Median : 4.400   Median : 4.369  
##  Mean   : 4.480   Mean   : 4.488   Mean   : 4.497   Mean   : 4.480  
##  3rd Qu.: 5.483   3rd Qu.: 5.512   3rd Qu.: 5.528   3rd Qu.: 5.478  
##  Max.   :13.469   Max.   :13.497   Max.   :13.497   Max.   :13.124  
##      mDLBCL            CHL             mDLBCL            CHL         
##  Min.   : 1.075   Min.   : 1.085   Min.   : 1.134   Min.   : 0.9819  
##  1st Qu.: 3.222   1st Qu.: 3.211   1st Qu.: 3.248   1st Qu.: 3.1659  
##  Median : 4.388   Median : 4.394   Median : 4.395   Median : 4.3858  
##  Mean   : 4.485   Mean   : 4.472   Mean   : 4.495   Mean   : 4.4632  
##  3rd Qu.: 5.511   3rd Qu.: 5.502   3rd Qu.: 5.494   3rd Qu.: 5.5239  
##  Max.   :13.469   Max.   :13.190   Max.   :13.497   Max.   :13.4823  
##       CHL              CHL              CHL             mDLBCL      
##  Min.   : 1.005   Min.   : 1.117   Min.   : 1.081   Min.   : 1.072  
##  1st Qu.: 3.205   1st Qu.: 3.153   1st Qu.: 3.222   1st Qu.: 3.188  
##  Median : 4.388   Median : 4.374   Median : 4.396   Median : 4.390  
##  Mean   : 4.482   Mean   : 4.486   Mean   : 4.484   Mean   : 4.488  
##  3rd Qu.: 5.521   3rd Qu.: 5.542   3rd Qu.: 5.500   3rd Qu.: 5.538  
##  Max.   :13.469   Max.   :13.438   Max.   :13.469   Max.   :13.469  
##      mDLBCL           mDLBCL           mDLBCL           mDLBCL      
##  Min.   : 1.100   Min.   : 1.060   Min.   : 1.065   Min.   : 1.119  
##  1st Qu.: 3.305   1st Qu.: 3.202   1st Qu.: 3.211   1st Qu.: 3.180  
##  Median : 4.408   Median : 4.387   Median : 4.405   Median : 4.376  
##  Mean   : 4.496   Mean   : 4.492   Mean   : 4.496   Mean   : 4.501  
##  3rd Qu.: 5.479   3rd Qu.: 5.522   3rd Qu.: 5.522   3rd Qu.: 5.532  
##  Max.   :13.482   Max.   :13.364   Max.   :13.167   Max.   :13.482  
##      mDLBCL           pDLBCL           pDLBCL           mDLBCL      
##  Min.   : 1.107   Min.   : 1.151   Min.   : 1.091   Min.   : 1.067  
##  1st Qu.: 3.227   1st Qu.: 3.181   1st Qu.: 3.196   1st Qu.: 3.182  
##  Median : 4.414   Median : 4.396   Median : 4.390   Median : 4.374  
##  Mean   : 4.493   Mean   : 4.496   Mean   : 4.487   Mean   : 4.502  
##  3rd Qu.: 5.523   3rd Qu.: 5.528   3rd Qu.: 5.525   3rd Qu.: 5.540  
##  Max.   :13.497   Max.   :13.469   Max.   :13.497   Max.   :13.482  
##      mDLBCL            CHL             pDLBCL           pDLBCL      
##  Min.   : 1.035   Min.   : 1.021   Min.   : 1.052   Min.   : 1.111  
##  1st Qu.: 3.191   1st Qu.: 3.203   1st Qu.: 3.165   1st Qu.: 3.131  
##  Median : 4.371   Median : 4.376   Median : 4.372   Median : 4.334  
##  Mean   : 4.483   Mean   : 4.440   Mean   : 4.490   Mean   : 4.489  
##  3rd Qu.: 5.515   3rd Qu.: 5.508   3rd Qu.: 5.539   3rd Qu.: 5.532  
##  Max.   :13.497   Max.   :13.469   Max.   :13.497   Max.   :13.469  
##       CHL              CHL              CHL             mDLBCL      
##  Min.   : 1.099   Min.   : 1.011   Min.   : 1.046   Min.   : 1.036  
##  1st Qu.: 3.180   1st Qu.: 3.168   1st Qu.: 3.219   1st Qu.: 3.121  
##  Median : 4.381   Median : 4.370   Median : 4.397   Median : 4.333  
##  Mean   : 4.484   Mean   : 4.492   Mean   : 4.488   Mean   : 4.479  
##  3rd Qu.: 5.500   3rd Qu.: 5.531   3rd Qu.: 5.515   3rd Qu.: 5.531  
##  Max.   :13.482   Max.   :13.482   Max.   :13.495   Max.   :13.482  
##      mDLBCL           mDLBCL            CHL              ID           
##  Min.   : 1.101   Min.   : 1.011   Min.   : 1.085   Length:27189      
##  1st Qu.: 3.158   1st Qu.: 3.247   1st Qu.: 3.163   Class :character  
##  Median : 4.364   Median : 4.410   Median : 4.357   Mode  :character  
##  Mean   : 4.494   Mean   : 4.499   Mean   : 4.496                     
##  3rd Qu.: 5.525   3rd Qu.: 5.523   3rd Qu.: 5.526                     
##  Max.   :13.497   Max.   :13.497   Max.   :13.497                     
##     CHL_mean       pDLBCL_mean      mDLBCL_mean    
##  Min.   : 1.344   Min.   : 1.342   Min.   : 1.321  
##  1st Qu.: 3.248   1st Qu.: 3.218   1st Qu.: 3.232  
##  Median : 4.396   Median : 4.383   Median : 4.389  
##  Mean   : 4.477   Mean   : 4.477   Mean   : 4.489  
##  3rd Qu.: 5.462   3rd Qu.: 5.481   3rd Qu.: 5.488  
##  Max.   :13.406   Max.   :13.474   Max.   :13.379
paged_table(normalized.expr[c(1:10),c(45:51)])

There is no baseline or healthy sample to compare to for fold change values, so we must look at some of the factors or features to compare like the mean and median values per sample type. This is one way to do a comparison of these types of Lymphomas, to see how far from the median the mean is and use that as a difference per gene on how far the median is from the mean, where a skewed mean with a higher value of mean minus median could indicate that there is variability or an outlier as well skewing the data. The samples are also not balanced as there are only 8 pDLBCL cases but 19 CHL and 20 mDLBCL cases. There are also more men than women and the median age is 72 years of age but all older than 50 years of age. We could factor in age as the older we get the lower our quality of immune response is. There could be weights for each class by those percent younger than 72 and those percent older than 72, as well as the percent of men per class to the percent of females per class. The median value should be very close to the mean if the samples had no significant differences. We know the groups were done by looking directly at values of IFN-G, IDO1, and PDL1, as well as copy number variants at chromosome 9 loci of genes surrounding 9p24.1.

males <- grep('Sex: male',compare[5,])
females <- grep('female', compare[5,])

There are 12 females and 47 males.

We will call x - female and y - male.

CHL_x <- CHL[CHL %in% females]
mDLBCL_x <- mDLBCL[mDLBCL %in% females]
pDLBCL_x <- pDLBCL[pDLBCL %in% females]

CHL_y <- CHL[CHL %in% males]
mDLBCL_y <- mDLBCL[mDLBCL %in% males]
pDLBCL_y <- pDLBCL[pDLBCL %in% males]

Add Mean values of genes for each class by gender to the data frame normalized.expr

normalized.expr$CHL_x_mean <- rowMeans(normalized.expr[,CHL_x])
normalized.expr$mDLBCL_x_mean <- rowMeans(normalized.expr[,mDLBCL_x])
normalized.expr$pDLBCL_x_mean <- rowMeans(normalized.expr[,pDLBCL_x])

normalized.expr$CHL_y_mean <- rowMeans(normalized.expr[,CHL_y])
normalized.expr$mDLBCL_y_mean <- rowMeans(normalized.expr[,mDLBCL_y])
normalized.expr$pDLBCL_y_mean <- rowMeans(normalized.expr[,pDLBCL_y])

Lets add in gene means per sample by age older than or equal to 72 and those younger than 72.

age <- as.numeric(gsub('age: ','',compare[4,]))

compare[4,] <- age

row.names(compare) <- c('titleID','GSM_ID','diagnosis','age', 'gender','group')

paged_table(compare)
young <- which(compare['age',] <= 72)
old <- which(compare['age',] > 72)


CHL_young <- CHL[CHL %in% young]
mDLBCL_young <- mDLBCL[mDLBCL %in% young]
pDLBCL_young <- pDLBCL[pDLBCL %in% young]

CHL_old <- CHL[CHL %in% old]
mDLBCL_old <- mDLBCL[mDLBCL %in% old]
pDLBCL_old <- pDLBCL[pDLBCL %in% old]

Now that we have the old and young indices and the class count in the young and old classes of less than 72 is young or equal to 72, but older than 72 is old, we can get our means per gene for each lymphoma in subcategory by age.

normalized.expr$CHL_young72_mean <- rowMeans(normalized.expr[CHL_young])
normalized.expr$mDLBCL_young72_mean <- rowMeans(normalized.expr[mDLBCL_young])
normalized.expr$pDLBCL_young72_mean <- rowMeans(normalized.expr[pDLBCL_young])

normalized.expr$CHL_old <- rowMeans(normalized.expr[CHL_old])
normalized.expr$mDLBCL_old <- rowMeans(normalized.expr[mDLBCL_old])
normalized.expr$pDLBCL_old <- rowMeans(normalized.expr[pDLBCL_old])

paged_table(normalized.expr[1:10,48:63])

We will add the medians per class of lymphoma only not by group but class of CHL, mDLBCL, or pDLBCL. We won’t use any weights by the gender or age but we can compare the mean values in each subset of the class by gender and age separately to see if there are any noticeable changes.

Calculate row medians for selected columns

df$row_median <- apply(df[cols_to_use], 1, function(row) { median(as.numeric(row), na.rm = TRUE) # Convert to numeric & ignore NAs })

normalized.expr$CHL_median <- apply(normalized.expr[CHL],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_median <- apply(normalized.expr[mDLBCL],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_median <- apply(normalized.expr[pDLBCL],1,function(row){
  median(as.numeric(row), na.rm=F)
})

paged_table(normalized.expr[1:10,49:66])

Lets go ahead and add in the medians of each class in its subclass of gender and age. This could be useful when filtering to find genes that are far from the average or having one when compared to where the median is in that group. Its a bunch of copy and paste and replace so not too much typing or creativity to do this little step.

normalized.expr$CHL_x_median <- apply(normalized.expr[CHL_x],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_x_median <- apply(normalized.expr[mDLBCL_x],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_x_median <- apply(normalized.expr[pDLBCL_x],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$CHL_y_median <- apply(normalized.expr[CHL_y],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_y_median <- apply(normalized.expr[mDLBCL_y],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_y_median <- apply(normalized.expr[pDLBCL_y],1,function(row){
  median(as.numeric(row), na.rm=F)
})

Above we added the medians per class and gender or gender within each class of Lymphoma.

Now for young or old within each class, younger than or equal to 72 and older than 72.

normalized.expr$CHL_young72_median <- apply(normalized.expr[CHL_young],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_young72_median <- apply(normalized.expr[mDLBCL_young],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_young72_median <- apply(normalized.expr[pDLBCL_young],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$CHL_old72_median <- apply(normalized.expr[CHL_old],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$mDLBCL_old72_median <- apply(normalized.expr[mDLBCL_old],1,function(row){
  median(as.numeric(row), na.rm=F)
})

normalized.expr$pDLBCL_old72_median <- apply(normalized.expr[pDLBCL_old],1,function(row){
  median(as.numeric(row), na.rm=F)
})

Lets see what we have so far and decide on some filters as markers for genes or we can just grep the genes that were in the study, since we have the chromosome location, or loci of 9p24.1, we should grep all genes in that region since we have that data from the features data we uploaded earlier in this file. I looked and the start and stop is given but not the loci, we have another file that has the loci from another study. The study said at 9p24.1 at start of 5,259,371-5,481,709 using GRCh38 (a greater human chromosome study with UCSB genome browser or similar). This location is where PDL1 encoded by CD274 was targeted, and at 9p24.1 at start of 5,593,784 and end at 5,764,809 for PDL2. The other genes are EBNA2 will be expressed if active EBV infection for latent type 3 EBV as well as LMP1 positive, for the DLBCL types not normally the CHL type that would have EBNA1 for latent type 1 and type 2. They defined EBV latent type 1 as LMP1 and EBNA2 negative. EBV latent type 2 was defined as LMP1 positive but EBNA2 negative. We also need to get the IFN-g for interferon gamma, and IDO1. They confirmed active EBV infection by EBER in situ hybridization

Other genes that were looked at as part of their Gene Expression Validation Analysis included the CD3, CD5, CD10, CD15, CD20, CD30, CD79a, PAX5, EBER, BCL2, BCL6, and MUM1.

Lets make a list of these genes called GSVA for gene study validation analysis.

GSVA <- c("LMP1","EBNA2","IDO1","IFNG","PDL1","PDL2","CD3","CD5","CD10","CD15","CD20","CD30","CD79a","PAX5","EBER","BCL2","BCL6", "MUM1")

The loci of chromosome start and stop have + and - signs, the + is upstream or 5’ and - is downstream or 3’ end when referring to the antiparallel strands of DNA and complementary DNA is the opposite strand to the template or antisense strand.

chr9 <- subset(feature.data, feature.data$seqname == 'chr9') #822X10

  
startPDL1 <- 5259371
endPDL1 <- 5481709


PDL1_loci <- subset(chr9, chr9$strand == '+' & chr9$start >= startPDL1 & chr9$stop <= endPDL1)

paged_table(PDL1_loci)

We can see that by scrolling to the SPOT_ID.1 feature that this is CD274 as study said they used to encompass PDL1 detection.

Now for PDL2 loci

startPDL2 <- 5593784
endPDL2 <- 5764809

PDL2_loci <- subset(chr9, chr9$strand == '-' & chr9$start >= startPDL2 & chr9$stop <= endPDL2)

paged_table(PDL2_loci)

These genes we can add to the list of genes then, we already have CD274, but not PDCD1LG2 for PDL2, but PTPRD and ERP44.

Lets grep these genes in the SPOT_ID.1 column of the normalized.expr data.

GSVA <- c(GSVA,"PTPRD","CD274", "ERP44","PDCD1LG2")

GSVA
##  [1] "LMP1"     "EBNA2"    "IDO1"     "IFNG"     "PDL1"     "PDL2"    
##  [7] "CD3"      "CD5"      "CD10"     "CD15"     "CD20"     "CD30"    
## [13] "CD79a"    "PAX5"     "EBER"     "BCL2"     "BCL6"     "MUM1"    
## [19] "PTPRD"    "CD274"    "ERP44"    "PDCD1LG2"

We will do this later as we will likely have to grep each of these 26 genes separately.

Lets write out the file we have to use later in Part 2 with our samples, probe IDs, messy gene IDs in one column, sample means and medians by class and subclass within the class.

write.csv(normalized.expr,'MeansMediansLymphomas_CHL_DLBCL_21448X78.csv', row.names=F)

This file is about 135 MB large and too large for Google. But Kaggle allows it, get this file here

or 2nd version here

Thanks so much and keep checking in for part 2.