Hello all we will be analyzing gene expression data from GSE305165. There is a published article that I took the following information from after reading it once and highlighting the important aspects of the study.
This research used Diffuse Large B-Cell Lymphoma or DLBCL to compare it to Classical Hodgkin’s Lymphoma or CHL in Epstein-Barr Virus (EBV) infected patients. The two types of lymphoma affect the elderly populations and have overlap between biomarkers such as IDO1, EBV latent type 2 is specific to CHL, but can be seen in a polymorphic type DLBCL called pDLBCL, and EBV latent type 3 seen mostly in monomorphic DLBCL labeled mDLBCL, but also in pDLBCL. The study saw that the typical region of chromosome 9 specific to high variations of genes in this loci of 9.24 in CHL, had some variations noticed within pDLBCL and mDLBCL. Overall, this study used populations 50 years old or older with no autoimmune or immunodeficient pathologies. However, in the elderly populations there is a natural decline of immune response to pathogens and antigens called immune senescence or IS, and in seriously impacted disease state of the DLBCL there can be a more fatal condition of immune escape, an actual term for both that means pathogens and antigens escape detection by the host immune system and have the ability to make changes that can lead to the host’s death.
This is a very interesting study, not too difficult to read, but overall, these researchers have decides that their clustering of heirarchical did a great job at separating the differences between the classes of large B-cell Lymphomas. They decided that CHL and DLBCLs of pDLBCL and mDLBCL are not separate diseases but the same type of disease where there is a 4th transitional state of disease that overlaps with pDLBCL and CHL that has low interferon gamma.
The four states are the 1st group which IS group which is the mDLBCL that is EBNA2 positive and EBV latent type 3, the 2nd group which is the CHL group that is high in variations at loci of chromosome 9 at 9p24.1 and high in PDL1 gene expression also only EBV latent type 2 gene expression and EBNA2 negative, the 3rd group which is the pDLBCL that is high in interferon gamma or IFN-g and low in variations of 9p24.1 with high gene expression of IDO1 that lead to immune escape and high chance of getting poor prognosis of hemocytic lymphocytosis called HLH that can lead to demise, and the 4th group that is the transition between CHL and pDLBCL where the IFN-g is low and characteristics unlike the other 3 groups as not otherwise specified or NOS.
The study uses 57 samples where 35 are DLBCL with 12 being pDLBCL and 23 being mDLBCL, and the other 22 samples are CHL. All samples have confirmed EBV and no immune deficiency or pathology other than Lymphoma and normal affects of aging in IS.
However, there are only 47 samples in the GSE305605 gene expression omnibus or GEO link above. We will be working with 47 samples.
library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 4.5.3
series <- read.table("GSE305165_series_matrix.txt/GSE305165_series_matrix.txt", skip=31, header=T, nrow=29)
paged_table(series)
The GSM ID is row 1, age is row 10, diagnosis is row 9 of EBV+CHL, EBV+pDLBCL, or EBV+mDLBCL, gender is row 11, group is row 19. Lets make a table of only those 4 features.
series4 <- series[c(1,9,10,11,19),]
paged_table(series4)
The groups have mostly stuck by definitions but show overlap as the CHL groups should be high 9p24.1 variations but some mDLBCL are also high 9p24.1, and IFNG-L should be mDLBCL but some CHL are classified as this instead of high 9p24.1 and at least one pDLBCL sample, and mDLBCL should all be IS, but some are IFNG-L or 9p24.1 variation high, or even at least one sample is IFNG-H. The study said there was some overlap between the samples, but that most all the latent type 2 EBV were CHL or high 9p24.1. We can still use it to show over lap.
There are 47 samples, 10 must have dropped out and not wanted information shared or unable to share it. The published article said there were 57 samples. Lets see how many samples are here based on diagnosis.
Lets see how many groups.
group <- series4[5,c(2:48)]
group_t <- data.frame(t(group))
colnames(group_t) <- 'group'
group_t$group <- gsub("group","", group_t$group)
table(group_t$group)
##
## 9p24.1-H IFNG-H IFNG-L IS
## 9 9 18 10
This is the 4 subtypes of lymphoma the study produced and says the transition state is the one with low IFNG or IFNG-L and not otherwise specified findings. The IS is immune sequesence of mDLBCL, IFNG-H is supposed to be the pDLBCL, and 9p24.1-H is high variations in gene copies at locus 9p24.1 on chromosome 9 for CHL. All of these lymphomas have confirmed EBV infection.
Now for the number samples in each diagnosis.
dx <- series4[2,c(2:48)]
dx_t <- data.frame(t(dx))
colnames(dx_t) <- 'diagnosis'
table(dx_t$diagnosis)
##
## diagnosis: EBV+ CHL diagnosis: EBV+ mDLBCL diagnosis: EBV+ pDLBCL
## 19 20 8
There are 19 EBV+CHL cases, 20 EBV+mDLBCL, and 8 EBV+pDLBCL.
Lets see the age range summary stats.
age <- series4[3,c(2:48)]
age_t <- data.frame(t(age))
colnames(age_t) <- "Age"
age_t$Age <- gsub("age: ","",age_t$Age)
age_t$Age <- as.numeric(age_t$Age)
summary(age_t$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 62.50 74.00 72.11 79.50 94.00
The age is 50 years old as the youngest, with a median age of 74 years of age for all 47 patients’ ages lined up in order least to most, with a mean age as the average age for these 47 patients being 72 years old. The oldest is 94 years old. More than 75% of the people are older than 62 years of age and more than half the patients are older than 72 years old, with 25 % of the patients older than almost 80 years of age and 25% of the patients between 50 to 62 years of age.
Lets look at the gender balance of men to women in this study.
gender <- series4[4,c(2:48)]
gender_t <- data.frame(t(gender));
colnames(gender_t) <- 'gender'
gender_t$gender <- gsub("Sex: ", "" , gender_t$gender)
table(gender_t$gender)
##
## female male
## 12 35
There are mostly male in this research study with 35 males and 12 females spread about all samples of EBV+CHL, EBV+pDLBCL, and EBV+mDLBCL.
Lets make a sample GSM ID table as well.
ID <- series4[1,c(2:48)]
ID_t <- data.frame(t(ID))
colnames(ID_t) <- "sampleID"
ID_t
## sampleID
## Case02_lymphoma_FFPE GSM9163281
## Case03_lymphoma_FFPE GSM9163282
## Case06_lymphoma_FFPE GSM9163283
## Case08_lymphoma_FFPE GSM9163284
## Case09_lymphoma_FFPE GSM9163285
## Case10_lymphoma_FFPE GSM9163286
## Case11_lymphoma_FFPE GSM9163287
## Case12_lymphoma_FFPE GSM9163288
## Case13_lymphoma_FFPE GSM9163289
## Case14_lymphoma_FFPE GSM9163290
## Case15_lymphoma_FFPE GSM9163291
## Case16_lymphoma_FFPE GSM9163292
## Case17_lymphoma_FFPE GSM9163293
## Case19_lymphoma_FFPE GSM9163294
## Case20_lymphoma_FFPE GSM9163295
## Case21_lymphoma_FFPE GSM9163296
## Case22_lymphoma_FFPE GSM9163297
## Case23_lymphoma_FFPE GSM9163298
## Case24_lymphoma_FFPE GSM9163299
## Case25_lymphoma_FFPE GSM9163300
## Case26_lymphoma_FFPE GSM9163301
## Case27_lymphoma_FFPE GSM9163302
## Case29_lymphoma_FFPE GSM9163303
## Case30_lymphoma_FFPE GSM9163304
## Case31_lymphoma_FFPE GSM9163305
## Case32_lymphoma_FFPE GSM9163306
## Case34_lymphoma_FFPE GSM9163307
## Case35_lymphoma_FFPE GSM9163308
## Case36_lymphoma_FFPE GSM9163309
## Case37_lymphoma_FFPE GSM9163310
## Case38_lymphoma_FFPE GSM9163311
## Case39_lymphoma_FFPE GSM9163312
## Case40_lymphoma_FFPE GSM9163313
## Case41_lymphoma_FFPE GSM9163314
## Case42_lymphoma_FFPE GSM9163315
## Case43_lymphoma_FFPE GSM9163316
## Case44_lymphoma_FFPE GSM9163317
## Case45_lymphoma_FFPE GSM9163318
## Case46_lymphoma_FFPE GSM9163319
## Case49_lymphoma_FFPE GSM9163320
## Case50_lymphoma_FFPE GSM9163321
## Case51_lymphoma_FFPE GSM9163322
## Case52_lymphoma_FFPE GSM9163323
## Case53_lymphoma_FFPE GSM9163324
## Case55_lymphoma_FFPE GSM9163325
## Case56_lymphoma_FFPE GSM9163326
## Case57_lymphoma_FFPE GSM9163327
Lets make a table of the diagnosis
Lets make a table of these 5 characteristics.
characteristics_df <- cbind(ID_t, dx_t, group_t, age_t, gender_t)
paged_table(characteristics_df)
====================================================
*** Attempt 1
The next part comes with the type of data, the CEL files are Affymetrix gene chip files that can only be opened within Bioconductor with a library. The package takes about 20 minutes on cell phone hotspot wifi to download and install on a regular laptop PC. I got through to the end, but didn’t convert the files from gz or unzip them.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("oligo", "affy"))
library(oligo) # For newer Affymetrix arrays
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: oligoClasses
## Welcome to oligoClasses version 1.72.0
## Loading required package: Biobase
## Warning: package 'Biobase' was built under R version 4.5.3
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: Biostrings
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.5.3
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
## Loading required package: Seqinfo
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## ================================================================================
## Welcome to oligo version 1.74.0
## ================================================================================
library(affy) # Uncomment for older arrays
##
## Attaching package: 'affy'
## The following objects are masked from 'package:oligo':
##
## intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
## probeNames, rma
## The following object is masked from 'package:oligoClasses':
##
## list.celfiles
Unzip the files from the RAW download then put all in one folder, I named mine the GSM ID.
#cel_path <- "path/to/your/CEL/files"
cel_path <- "...GSE305165" # path to your CEL files
setwd(cel_path)
cel_files <- list.celfiles(full.names = TRUE)
cel_files
## [1] "./GSM9163281_02_Clariom_S_Human_.CEL"
## [2] "./GSM9163282_03_Clariom_S_Human_.CEL"
## [3] "./GSM9163283_06_Clariom_S_Human_.CEL"
## [4] "./GSM9163284_08_Clariom_S_Human_.CEL"
## [5] "./GSM9163285_09_Clariom_S_Human_.CEL"
## [6] "./GSM9163286_10_Clariom_S_Human_.CEL"
## [7] "./GSM9163287_11_Clariom_S_Human_.CEL"
## [8] "./GSM9163288_12_Clariom_S_Human_.CEL"
## [9] "./GSM9163289_13_Clariom_S_Human_2.CEL"
## [10] "./GSM9163290_14_Clariom_S_Human_.CEL"
## [11] "./GSM9163291_15_Clariom_S_Human_.CEL"
## [12] "./GSM9163292_16_Clariom_S_Human_.CEL"
## [13] "./GSM9163293_17_Clariom_S_Human_.CEL"
## [14] "./GSM9163294_19_Clariom_S_Human_.CEL"
## [15] "./GSM9163295_20_Clariom_S_Human_.CEL"
## [16] "./GSM9163296_21_Clariom_S_Human_.CEL"
## [17] "./GSM9163297_22_Clariom_S_Human_.CEL"
## [18] "./GSM9163298_23_Clariom_S_Human_.CEL"
## [19] "./GSM9163299_24_Clariom_S_Human_.CEL"
## [20] "./GSM9163300_25_Clariom_S_Human_.CEL"
## [21] "./GSM9163301_26_Clariom_S_Human_.CEL"
## [22] "./GSM9163302_27_Clariom_S_Human_2.CEL"
## [23] "./GSM9163303_29_Clariom_S_Human_.CEL"
## [24] "./GSM9163304_30_Clariom_S_Human_.CEL"
## [25] "./GSM9163305_31_Clariom_S_Human_.CEL"
## [26] "./GSM9163306_32_Clariom_S_Human_.CEL"
## [27] "./GSM9163307_34_Clariom_S_Human_.CEL"
## [28] "./GSM9163308_35_Clariom_S_Human_.CEL"
## [29] "./GSM9163309_36_Clariom_S_Human_.CEL"
## [30] "./GSM9163310_37_Clariom_S_Human_2.CEL"
## [31] "./GSM9163311_38_Clariom_S_Human_2.CEL"
## [32] "./GSM9163312_39_Clariom_S_Human_.CEL"
## [33] "./GSM9163313_40_Clariom_S_Human_2.CEL"
## [34] "./GSM9163314_41_Clariom_S_Human_.CEL"
## [35] "./GSM9163315_42_Clariom_S_Human_.CEL"
## [36] "./GSM9163316_43_Clariom_S_Human_.CEL"
## [37] "./GSM9163317_44_Clariom_S_Human_.CEL"
## [38] "./GSM9163318_45_Clariom_S_Human_.CEL"
## [39] "./GSM9163319_46_Clariom_S_Human_.CEL"
## [40] "./GSM9163320_49_Clariom_S_Human_2.CEL"
## [41] "./GSM9163321_50_Clariom_S_Human_.CEL"
## [42] "./GSM9163322_51_Clariom_S_Human_.CEL"
## [43] "./GSM9163323_52_Clariom_S_Human_.CEL"
## [44] "./GSM9163324_53_Clariom_S_Human_.CEL"
## [45] "./GSM9163325_55_Clariom_S_Human_.CEL"
## [46] "./GSM9163326_56_Clariom_S_Human_.CEL"
## [47] "./GSM9163327_57_Clariom_S_Human_.CEL"
raw_data <- read.celfiles(cel_files)
Error: These do not exist: ./GSM9163281_02_Clariom_S_Human_.CEL ./GSM9163282_03_Clariom_S_Human_.CEL ./GSM9163283_06_Clariom_S_Human_.CEL ./GSM9163284_08_Clariom_S_Human_.CEL ./GSM9163285_09_Clariom_S_Human_.CEL ./GSM9163286_10_Clariom_S_Human_.CEL ./GSM9163287_11_Clariom_S_Human_.CEL ./GSM9163288_12_Clariom_S_Human_.CEL ./GSM9163289_13_Clariom_S_Human_2.CEL ./GSM9163290_14_Clariom_S_Human_.CEL ./GSM9163291_15_Clariom_S_Human_.CEL ./GSM9163292_16_Clariom_S_Human_.CEL ./GSM9163293_17_Clariom_S_Human_.CEL ./GSM9163294_19_Clariom_S_Human_.CEL ./GSM9163295_20_Clariom_S_Human_.CEL ./GSM9163296_21_Clariom_S_Human_.CEL ./GSM9163297_22_Clariom_S_Human_.CEL ./GSM9163298_23_Clariom_S_Human_.CEL ./GSM9163299_24_Clariom_S_Human_.CEL ./GSM9163300_25_Clariom_S_Human_.CEL ./GSM9163301_26_Clariom_S_Human_.CEL ./GSM9163302_27_Clariom_S_Human_2.CEL ./GSM9163303_29_Clariom_S_Human_.CEL ./GSM9163304_30_Clariom_S_Human_.CEL ./GSM9163305_31_Clariom_S_Human_.
The command didn’t work even after using only RAW with separate unzipped folders of separate folder per patient, then unzipped with one folder and a folder for each patient with CEL file in each one, and not with one folder with only the actual CEL files in it without a folder for each.
I will have to return to this to see how the CEL files can be read in with bioconductor’s affy and oligo packages.
raw_data
Copy code # Quick QC plot
boxplot(raw_data, main = "Raw CEL Data", las = 2)
norm_data <- rma(raw_data)
exprs(norm_data)[1:5, 1:5] # First 5 genes × first 5 samples
==================================================
*** Attempt 2
Keep checking back and we will figure out how to open these CEL files and do are regular analysis and data science to these samples.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("pd.clariom.s.human")
Try another video tutorial
library(affy)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%within%() masks IRanges::%within%()
## ✖ dplyr::collapse() masks Biostrings::collapse(), IRanges::collapse()
## ✖ dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
## ✖ purrr::compact() masks XVector::compact()
## ✖ dplyr::desc() masks IRanges::desc()
## ✖ tidyr::expand() masks S4Vectors::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks S4Vectors::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ lubridate::pm() masks affy::pm(), oligo::pm()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce() masks IRanges::reduce()
## ✖ dplyr::rename() masks S4Vectors::rename()
## ✖ lubridate::second() masks S4Vectors::second()
## ✖ lubridate::second<-() masks S4Vectors::second<-()
## ✖ dplyr::slice() masks XVector::slice(), IRanges::slice()
## ✖ dplyr::summarize() masks oligo::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#install.packages("GEOquery")
BiocManager::install("GEOquery")
library(GEOquery)
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
getGEOSuppFiles("GSE305165")
## Using locally cached version of supplementary file(s) GSE305165 found here:
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar
## size
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 46592000
## isdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar FALSE
## mode
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 666
## mtime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:11
## ctime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:06
## atime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 20:32:52
## exe
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar no
## uname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar jlcor
## udomain
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar DATAMASSAGER1
## fname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165_RAW.tar
## destdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165
## filepath
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar
## GEO
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165
untar("GSE305165/GSE305165_RAW.tar", exdir="data/")
raw_data <- ReadAffy(celfile.path = "data/")
str(raw_data)
## Formal class 'AffyBatch' [package "affy"] with 10 slots
## ..@ cdfName : chr "Clariom_S_Human"
## ..@ nrow : Named int 548
## .. ..- attr(*, "names")= chr "Rows"
## ..@ ncol : Named int 548
## .. ..- attr(*, "names")= chr "Cols"
## ..@ assayData :<environment: 0x000001db9ea3c430>
## ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## .. .. ..@ varMetadata :'data.frame': 1 obs. of 1 variable:
## .. .. .. ..$ labelDescription: chr "arbitrary numbering"
## .. .. ..@ data :'data.frame': 47 obs. of 1 variable:
## .. .. .. ..$ sample: int [1:47] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 1
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## .. .. .. .. ..$ names: chr "AnnotatedDataFrame"
## ..@ featureData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## .. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable:
## .. .. .. ..$ labelDescription: chr(0)
## .. .. ..@ data :'data.frame': 300304 obs. of 0 variables
## .. .. ..@ dimLabels : chr [1:2] "featureNames" "featureColumns"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 1
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## .. .. .. .. ..$ names: chr "AnnotatedDataFrame"
## ..@ experimentData :Formal class 'MIAME' [package "Biobase"] with 13 slots
## .. .. ..@ name : chr ""
## .. .. ..@ lab : chr ""
## .. .. ..@ contact : chr ""
## .. .. ..@ title : chr ""
## .. .. ..@ abstract : chr ""
## .. .. ..@ url : chr ""
## .. .. ..@ pubMedIds : chr ""
## .. .. ..@ samples : list()
## .. .. ..@ hybridizations : list()
## .. .. ..@ normControls : list()
## .. .. ..@ preprocessing :List of 2
## .. .. .. ..$ filenames : chr [1:47] "data/GSM9163281_02_Clariom_S_Human_.CEL.gz" "data/GSM9163282_03_Clariom_S_Human_.CEL.gz" "data/GSM9163283_06_Clariom_S_Human_.CEL.gz" "data/GSM9163284_08_Clariom_S_Human_.CEL.gz" ...
## .. .. .. ..$ affyversion: chr NA
## .. .. ..@ other :List of 1
## .. .. .. ..$ : chr ""
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 2
## .. .. .. .. .. ..$ : int [1:3] 1 0 0
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## .. .. .. .. ..$ names: chr [1:2] "MIAxE" "MIAME"
## ..@ annotation : chr "clariomshuman"
## ..@ protocolData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## .. .. ..@ varMetadata :'data.frame': 1 obs. of 1 variable:
## .. .. .. ..$ labelDescription: chr NA
## .. .. ..@ data :'data.frame': 47 obs. of 1 variable:
## .. .. .. ..$ ScanDate: chr [1:47] "2022-10-06T05:15:22Z" "2022-10-06T05:10:06Z" "2022-10-06T04:16:44Z" "2022-10-06T04:06:10Z" ...
## .. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 1
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## .. .. .. .. ..$ names: chr "AnnotatedDataFrame"
## ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. ..@ .Data:List of 4
## .. .. .. ..$ : int [1:3] 4 5 2
## .. .. .. ..$ : int [1:3] 2 70 0
## .. .. .. ..$ : int [1:3] 1 3 0
## .. .. .. ..$ : int [1:3] 1 2 0
## .. .. ..$ names: chr [1:4] "R" "Biobase" "eSet" "AffyBatch"
raw_data
Warning: missing cdf environment! in show(AffyBatch) AffyBatch object size of arrays=548x548 features (36 kb) cdf=Clariom_S_Human (??? affyids) number of samples=47
Error in getCdfInfo(object) : Could not obtain CDF environment, problems encountered: Specified environment does not contain Clariom_S_Human Library - package clariomshumancdf not installed Bioconductor - clariomshumancdf not available 7. stop(paste(“Could not obtain CDF environment, problems encountered:”, paste(unlist(badOut), collapse = “”), sep = “”)) 6. getCdfInfo(object) 5. featureNames(object) 4. featureNames(object) 3. cat(“number of genes=”, length(featureNames(object)), “”, sep = ““) 2. (new(”standardGeneric”, .Data = function (object) standardGeneric(“show”), generic = structure(“show”, package = “methods”), package = “methods”, group = list(), valueClass = character(0), signature = structure(“object”, simpleOnly = TRUE), default = new(“derivedDefaultMethod”, … 1. (new(“standardGeneric”, .Data = function (object) standardGeneric(“show”), generic = structure(“show”, package = “methods”), package = “methods”, group = list(), valueClass = character(0), signature = structure(“object”, simpleOnly = TRUE), default = new(“derivedDefaultMethod”, …
After installing the missing packages the install said it installed the clarionshumancdf but it didn’t, because the error keeps popping up.
*** This is where the tutorial on youtube fails this version of R and Bioconducter.
But here is the description and code of commands to get the matrix and combine probe IDs by gene name for normalized gene reads per sample.
normalized.data <- rma(raw_data)
Let me try closing and restarting. Maybe the install will work then. It did not. The version of Bioconductor is wrong for the packages it needs installed.
normalized.data
normalized.expr <- exprs(normalized.data)
When it works, video shows matrix of probes as row names and samples as headers.
normalized.expr <- as.data.frame(exprs(normalized.data))
Will turn matrix into a data frame with code above, in video.
Map probe IDs into gene symbols.
gse <- getGEO("GSE305165", GSEMatrix = T)
The above produces a large list.
Next get the feature IDs and store it.
feature.data <- gse$GSE305165_series_matrix.txt.gz@featureData@data
The above produces additional information with alternate gene ID symbols to the Probe IDs as well as other information.
subset to only get the gene symbols and probe IDs. This is column 1 and 11
feature.data <- feature.data[,c(1,11)]
merge the two data sets use a dot for the x to be the normalized.data merged with feature.data. Both x and y have the merging feature named ID for probes in both sets after adding a column to the normalized.data dataframe named ID.
normalized.expr <- normalized.expr %>%
rownames_to_column(var="ID") %>%
inner_join(.,feature.data, by="ID")
==========================================
*** Attempt 3
That was the end of the 4 year old video tutorial linked above on reading in CEL files. It stopped with the package incompatibility between Clariom_s_human and the other dependent files so trying another AI read for exact copy of Bioconductor version 3.22 built for R 4.5.
Got it — you want to read CEL files (Affymetrix microarray data) in Bioconductor 3.22 with R 4.5. Here’s a concise, up-to-date approach that works with the current Bioconductor ecosystem.
R
Copy code # Install BiocManager if not already installed install.packages(“BiocManager”)
BiocManager::install(version = "3.22")
BiocManager::install(c("affy", "oligo"))
Option A – Using affy (classic Affymetrix arrays)
R
Copy code
library(affy)
data <- ReadAffy(celfile.path = "data/")
summary(data)
## Length Class Mode
## 47 AffyBatch S4
Option B – Using oligo (newer arrays, e.g., Gene ST, Exon ST)
R
Copy code
library(oligo)
This uses Attempt 2’s version of placing each CEL file from individual folder into one folder we named GSE305165. It used the
library(GEOquery)
getGEOSuppFiles("GSE305165")
## Using locally cached version of supplementary file(s) GSE305165 found here:
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar
## size
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 46592000
## isdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar FALSE
## mode
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 666
## mtime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:11
## ctime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 16:26:06
## atime
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar 2026-04-09 20:52:56
## exe
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar no
## uname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar jlcor
## udomain
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar DATAMASSAGER1
## fname
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165_RAW.tar
## destdir
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165
## filepath
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar
## GEO
## C:/Users/jlcor/Desktop/EBV classic Hodgkin Lymphoma and Diffuse large Bcell Lymphoma/GSE305165/GSE305165_RAW.tar GSE305165
untar("GSE305165/GSE305165_RAW.tar", exdir="data/")
cel_files <- list.celfiles("GSE305165/", full.names = TRUE)
data <- read.celfiles(cel_files)
## Loading required package: pd.clariom.s.human
## Loading required package: RSQLite
## Loading required package: DBI
## Platform design info loaded.
## Reading in : GSE305165/GSM9163281_02_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163282_03_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163283_06_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163284_08_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163285_09_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163286_10_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163287_11_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163288_12_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163289_13_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163290_14_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163291_15_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163292_16_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163293_17_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163294_19_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163295_20_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163296_21_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163297_22_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163298_23_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163299_24_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163300_25_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163301_26_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163302_27_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163303_29_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163304_30_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163305_31_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163306_32_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163307_34_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163308_35_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163309_36_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163310_37_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163311_38_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163312_39_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163313_40_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163314_41_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163315_42_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163316_43_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163317_44_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163318_45_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163319_46_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163320_49_Clariom_S_Human_2.CEL
## Reading in : GSE305165/GSM9163321_50_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163322_51_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163323_52_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163324_53_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163325_55_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163326_56_Clariom_S_Human_.CEL
## Reading in : GSE305165/GSM9163327_57_Clariom_S_Human_.CEL
data
## ExpressionFeatureSet (storageMode: lockedEnvironment)
## assayData: 300304 features, 47 samples
## element names: exprs
## protocolData
## rowNames: GSM9163281_02_Clariom_S_Human_.CEL
## GSM9163282_03_Clariom_S_Human_.CEL ...
## GSM9163327_57_Clariom_S_Human_.CEL (47 total)
## varLabels: exprs dates
## varMetadata: labelDescription channel
## phenoData
## rowNames: GSM9163281_02_Clariom_S_Human_.CEL
## GSM9163282_03_Clariom_S_Human_.CEL ...
## GSM9163327_57_Clariom_S_Human_.CEL (47 total)
## varLabels: index
## varMetadata: labelDescription channel
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: pd.clariom.s.human
Option C – Using affyio for low-level access
R
Copy code library(affyio)
header <- read.celfile.header(“path/to/file.CEL”) cel_data <- read.celfile(“path/to/file.CEL”) Tips
Use affy for older 3’ expression arrays. Use oligo for newer whole-transcript arrays. Always check the array type with
affyio::read.celfile.header(cel_files)
before choosing the package. If you’d like, I can also give you a full preprocessing pipeline for CEL files in Bioconductor 3.22, including normalization and quality control, so you can go from raw data to analysis-ready expression matrices. Would you like me to prepare that?
That was the end of the bioconductor AI help for this version of 3.22 in R version 4.5.
================
Lets bring in Attempt 2 after the raw_data fail due to wrong version cdf, using the latest data read in with oligo library.
The rest works in Rstudio after correcting for the package of bioconductor made for my version of R, but knitr stops working right here.
normalized.data <- rma(data)
Background correcting Normalizing Calculating Expression
normalized.data
ExpressionSet (storageMode: lockedEnvironment) assayData: 27189 features, 47 samples element names: exprs protocolData rowNames: GSM9163281_02_Clariom_S_Human_.CEL GSM9163282_03_Clariom_S_Human_.CEL … GSM9163327_57_Clariom_S_Human_.CEL (47 total) varLabels: exprs dates varMetadata: labelDescription channel phenoData rowNames: GSM9163281_02_Clariom_S_Human_.CEL GSM9163282_03_Clariom_S_Human_.CEL … GSM9163327_57_Clariom_S_Human_.CEL (47 total) varLabels: index varMetadata: labelDescription channel featureData: none experimentData: use ‘experimentData(object)’ Annotation: pd.clariom.s.human
normalized.expr <- exprs(normalized.data)
The row names are the probe IDs in Affymetrix for this cdf Clariom_s_human.
normalized.expr <- as.data.frame(exprs(normalized.data))
Writing it out to read it in to run the rest of the code that will allow other programs knitr isn’t picking on that finally my Rstudio version with R and bioconductor is working with.
write.csv(normalized.expr,'normalized.expr1.csv', row.names=T)
normalized.expr <- read.csv('normalized.expr1.csv', header=T, row.names=1)
paged_table(normalized.expr[1:10,])
The above only shows the first 10 rows, but all 47 samples are there with long ID names per sample.
Map probe IDs into gene symbols.
gse <- getGEO("GSE305165", GSEMatrix = T)
Found 1 file(s) GSE305165_series_matrix.txt.gz Using locally cached version: C:/GSE305165_series_matrix.txt.gz Using locally cached version of GPL23159 found here: C:/GPL23159.soft.gz
The above produces a large list.
gse
$GSE305165_series_matrix.txt.gz ExpressionSet (storageMode: lockedEnvironment) assayData: 21448 features, 47 samples element names: exprs protocolData: none phenoData sampleNames: GSM9163281 GSM9163282 … GSM9163327 (47 total) varLabels: title geo_accession … Sex:ch1 (33 total) varMetadata: labelDescription featureData featureNames: TC0100006437.hg.1 TC0100006476.hg.1 … TSUnmapped00000823.hg.1 (21448 total) fvarLabels: ID probeset_id … SPOT_ID.1 (10 total) fvarMetadata: Column Description labelDescription experimentData: use ‘experimentData(object)’ pubMedIds: 41371409 Annotation: GPL23159
The gse Large list in Rstudio.
Next get the feature IDs and store it.
feature.data <- gse$GSE305165_series_matrix.txt.gz@featureData@data
The above produces additional information with alternate gene ID symbols to the Probe IDs as well as other information.
paged_table(feature.data[1:10,])
paged_table(feature.data[1:10,])
subset to only get the gene symbols and probe IDs. This is column 1 and 10 but 10 is a mix of various groups in one ID field.
feature.data1 <- feature.data[,c(1,10)]
paged_table(feature.data1[1:20,])
Add an ID column to the normalized.expr data frame of samples by probe ID.
normalized.expr$ID <- row.names(normalized.expr)
paged_table(normalized.expr[1:10,45:48])
The above table shows only first 10 rows and last 4 columns with the ID column added to it.
str(normalized.expr)
## 'data.frame': 27189 obs. of 48 variables:
## $ GSM9163281_02_Clariom_S_Human_.CEL : num 5.31 5.68 4.97 3.97 3.53 ...
## $ GSM9163282_03_Clariom_S_Human_.CEL : num 5.49 4.69 5.15 4.57 2.77 ...
## $ GSM9163283_06_Clariom_S_Human_.CEL : num 4.75 5.03 5.03 3.7 3.66 ...
## $ GSM9163284_08_Clariom_S_Human_.CEL : num 5.68 5.65 4.29 3.57 3.35 ...
## $ GSM9163285_09_Clariom_S_Human_.CEL : num 4.82 5.68 4.92 3.86 4.89 ...
## $ GSM9163286_10_Clariom_S_Human_.CEL : num 5.52 5.59 4.1 3.6 3.8 ...
## $ GSM9163287_11_Clariom_S_Human_.CEL : num 5.45 5.56 4.91 3.97 3.39 ...
## $ GSM9163288_12_Clariom_S_Human_.CEL : num 5.77 5.88 3.76 3.98 3.33 ...
## $ GSM9163289_13_Clariom_S_Human_2.CEL: num 6.18 5.37 4.53 4.24 3.62 ...
## $ GSM9163290_14_Clariom_S_Human_.CEL : num 5.59 5.7 4.19 4.45 3.64 ...
## $ GSM9163291_15_Clariom_S_Human_.CEL : num 5.51 5.72 4.87 3.78 3.45 ...
## $ GSM9163292_16_Clariom_S_Human_.CEL : num 5.66 5.58 4.91 4.59 3.11 ...
## $ GSM9163293_17_Clariom_S_Human_.CEL : num 5.02 5.12 3.33 3.91 3.39 ...
## $ GSM9163294_19_Clariom_S_Human_.CEL : num 5.2 5.75 4.15 4.09 3.39 ...
## $ GSM9163295_20_Clariom_S_Human_.CEL : num 5.32 5.23 4.47 3.88 3.49 ...
## $ GSM9163296_21_Clariom_S_Human_.CEL : num 5.8 5.97 4.6 3.86 3.7 ...
## $ GSM9163297_22_Clariom_S_Human_.CEL : num 5.67 5.07 4.27 4.07 2.79 ...
## $ GSM9163298_23_Clariom_S_Human_.CEL : num 6.01 5.89 3.89 4.14 3.54 ...
## $ GSM9163299_24_Clariom_S_Human_.CEL : num 6.05 5.81 4.51 4.2 3.16 ...
## $ GSM9163300_25_Clariom_S_Human_.CEL : num 5.51 5.73 4.13 4.18 3.73 ...
## $ GSM9163301_26_Clariom_S_Human_.CEL : num 6.31 5.9 4.62 4.2 3.18 ...
## $ GSM9163302_27_Clariom_S_Human_2.CEL: num 5.9 5.39 4.52 3.58 4.07 ...
## $ GSM9163303_29_Clariom_S_Human_.CEL : num 5.39 5.14 4.58 4.6 3.69 ...
## $ GSM9163304_30_Clariom_S_Human_.CEL : num 5.81 6.15 4.56 4.19 3.71 ...
## $ GSM9163305_31_Clariom_S_Human_.CEL : num 5.67 5.64 5.28 4.02 3.57 ...
## $ GSM9163306_32_Clariom_S_Human_.CEL : num 5.92 6.06 5.41 4.37 3.46 ...
## $ GSM9163307_34_Clariom_S_Human_.CEL : num 5.68 6.04 4.29 4.41 3.4 ...
## $ GSM9163308_35_Clariom_S_Human_.CEL : num 5.79 6.39 4.79 4.35 3.77 ...
## $ GSM9163309_36_Clariom_S_Human_.CEL : num 5.14 5.21 3.79 4.09 3.96 ...
## $ GSM9163310_37_Clariom_S_Human_2.CEL: num 6.19 5.26 3.8 4.27 3.49 ...
## $ GSM9163311_38_Clariom_S_Human_2.CEL: num 5.75 5.85 3.84 3.86 3.39 ...
## $ GSM9163312_39_Clariom_S_Human_.CEL : num 6.33 5.98 4.89 3.95 3.7 ...
## $ GSM9163313_40_Clariom_S_Human_2.CEL: num 5.42 5.84 4.49 4.47 3.35 ...
## $ GSM9163314_41_Clariom_S_Human_.CEL : num 6.33 5.88 4.85 4.35 3.17 ...
## $ GSM9163315_42_Clariom_S_Human_.CEL : num 5.66 6.07 4.89 4.01 3.4 ...
## $ GSM9163316_43_Clariom_S_Human_.CEL : num 6.12 6.08 4.44 3.98 3.24 ...
## $ GSM9163317_44_Clariom_S_Human_.CEL : num 5.96 6.39 5.34 4.23 3.86 ...
## $ GSM9163318_45_Clariom_S_Human_.CEL : num 4.72 6.16 4.92 4.3 4.12 ...
## $ GSM9163319_46_Clariom_S_Human_.CEL : num 5.67 5.72 4.81 4.82 3.25 ...
## $ GSM9163320_49_Clariom_S_Human_2.CEL: num 6.25 5.98 4.73 3.83 3.47 ...
## $ GSM9163321_50_Clariom_S_Human_.CEL : num 5.95 5.9 4.69 4.1 3.35 ...
## $ GSM9163322_51_Clariom_S_Human_.CEL : num 6.25 5.73 4.18 4.26 3.17 ...
## $ GSM9163323_52_Clariom_S_Human_.CEL : num 5.71 6.36 4.4 4.1 3.69 ...
## $ GSM9163324_53_Clariom_S_Human_.CEL : num 6.16 5.56 4.9 4.23 3.12 ...
## $ GSM9163325_55_Clariom_S_Human_.CEL : num 6.85 5.68 4.01 3.83 3.5 ...
## $ GSM9163326_56_Clariom_S_Human_.CEL : num 5.68 5.79 4.03 4.39 3.28 ...
## $ GSM9163327_57_Clariom_S_Human_.CEL : num 6.29 5.94 4.5 4.19 3.38 ...
## $ ID : chr "23064070" "23064071" "23064072" "23064073" ...
normalized.expr <- inner_join(normalized.expr,feature.data1, by="ID")
str(normalized.expr)
## 'data.frame': 27189 obs. of 48 variables:
## $ GSM9163281_02_Clariom_S_Human_.CEL : num 5.31 5.68 4.97 3.97 3.53 ...
## $ GSM9163282_03_Clariom_S_Human_.CEL : num 5.49 4.69 5.15 4.57 2.77 ...
## $ GSM9163283_06_Clariom_S_Human_.CEL : num 4.75 5.03 5.03 3.7 3.66 ...
## $ GSM9163284_08_Clariom_S_Human_.CEL : num 5.68 5.65 4.29 3.57 3.35 ...
## $ GSM9163285_09_Clariom_S_Human_.CEL : num 4.82 5.68 4.92 3.86 4.89 ...
## $ GSM9163286_10_Clariom_S_Human_.CEL : num 5.52 5.59 4.1 3.6 3.8 ...
## $ GSM9163287_11_Clariom_S_Human_.CEL : num 5.45 5.56 4.91 3.97 3.39 ...
## $ GSM9163288_12_Clariom_S_Human_.CEL : num 5.77 5.88 3.76 3.98 3.33 ...
## $ GSM9163289_13_Clariom_S_Human_2.CEL: num 6.18 5.37 4.53 4.24 3.62 ...
## $ GSM9163290_14_Clariom_S_Human_.CEL : num 5.59 5.7 4.19 4.45 3.64 ...
## $ GSM9163291_15_Clariom_S_Human_.CEL : num 5.51 5.72 4.87 3.78 3.45 ...
## $ GSM9163292_16_Clariom_S_Human_.CEL : num 5.66 5.58 4.91 4.59 3.11 ...
## $ GSM9163293_17_Clariom_S_Human_.CEL : num 5.02 5.12 3.33 3.91 3.39 ...
## $ GSM9163294_19_Clariom_S_Human_.CEL : num 5.2 5.75 4.15 4.09 3.39 ...
## $ GSM9163295_20_Clariom_S_Human_.CEL : num 5.32 5.23 4.47 3.88 3.49 ...
## $ GSM9163296_21_Clariom_S_Human_.CEL : num 5.8 5.97 4.6 3.86 3.7 ...
## $ GSM9163297_22_Clariom_S_Human_.CEL : num 5.67 5.07 4.27 4.07 2.79 ...
## $ GSM9163298_23_Clariom_S_Human_.CEL : num 6.01 5.89 3.89 4.14 3.54 ...
## $ GSM9163299_24_Clariom_S_Human_.CEL : num 6.05 5.81 4.51 4.2 3.16 ...
## $ GSM9163300_25_Clariom_S_Human_.CEL : num 5.51 5.73 4.13 4.18 3.73 ...
## $ GSM9163301_26_Clariom_S_Human_.CEL : num 6.31 5.9 4.62 4.2 3.18 ...
## $ GSM9163302_27_Clariom_S_Human_2.CEL: num 5.9 5.39 4.52 3.58 4.07 ...
## $ GSM9163303_29_Clariom_S_Human_.CEL : num 5.39 5.14 4.58 4.6 3.69 ...
## $ GSM9163304_30_Clariom_S_Human_.CEL : num 5.81 6.15 4.56 4.19 3.71 ...
## $ GSM9163305_31_Clariom_S_Human_.CEL : num 5.67 5.64 5.28 4.02 3.57 ...
## $ GSM9163306_32_Clariom_S_Human_.CEL : num 5.92 6.06 5.41 4.37 3.46 ...
## $ GSM9163307_34_Clariom_S_Human_.CEL : num 5.68 6.04 4.29 4.41 3.4 ...
## $ GSM9163308_35_Clariom_S_Human_.CEL : num 5.79 6.39 4.79 4.35 3.77 ...
## $ GSM9163309_36_Clariom_S_Human_.CEL : num 5.14 5.21 3.79 4.09 3.96 ...
## $ GSM9163310_37_Clariom_S_Human_2.CEL: num 6.19 5.26 3.8 4.27 3.49 ...
## $ GSM9163311_38_Clariom_S_Human_2.CEL: num 5.75 5.85 3.84 3.86 3.39 ...
## $ GSM9163312_39_Clariom_S_Human_.CEL : num 6.33 5.98 4.89 3.95 3.7 ...
## $ GSM9163313_40_Clariom_S_Human_2.CEL: num 5.42 5.84 4.49 4.47 3.35 ...
## $ GSM9163314_41_Clariom_S_Human_.CEL : num 6.33 5.88 4.85 4.35 3.17 ...
## $ GSM9163315_42_Clariom_S_Human_.CEL : num 5.66 6.07 4.89 4.01 3.4 ...
## $ GSM9163316_43_Clariom_S_Human_.CEL : num 6.12 6.08 4.44 3.98 3.24 ...
## $ GSM9163317_44_Clariom_S_Human_.CEL : num 5.96 6.39 5.34 4.23 3.86 ...
## $ GSM9163318_45_Clariom_S_Human_.CEL : num 4.72 6.16 4.92 4.3 4.12 ...
## $ GSM9163319_46_Clariom_S_Human_.CEL : num 5.67 5.72 4.81 4.82 3.25 ...
## $ GSM9163320_49_Clariom_S_Human_2.CEL: num 6.25 5.98 4.73 3.83 3.47 ...
## $ GSM9163321_50_Clariom_S_Human_.CEL : num 5.95 5.9 4.69 4.1 3.35 ...
## $ GSM9163322_51_Clariom_S_Human_.CEL : num 6.25 5.73 4.18 4.26 3.17 ...
## $ GSM9163323_52_Clariom_S_Human_.CEL : num 5.71 6.36 4.4 4.1 3.69 ...
## $ GSM9163324_53_Clariom_S_Human_.CEL : num 6.16 5.56 4.9 4.23 3.12 ...
## $ GSM9163325_55_Clariom_S_Human_.CEL : num 6.85 5.68 4.01 3.83 3.5 ...
## $ GSM9163326_56_Clariom_S_Human_.CEL : num 5.68 5.79 4.03 4.39 3.28 ...
## $ GSM9163327_57_Clariom_S_Human_.CEL : num 6.29 5.94 4.5 4.19 3.38 ...
## $ ID : chr "23064070" "23064071" "23064072" "23064073" ...
The SPOT.ID.1 column has the gene name in it as well as full name and alternate names.
normalized.expr$SPOT_ID.1[1]
## NULL
It seems like this column should be split by each ‘//’ and made into separate columns. We will see about separating it later, am unable to use split function to separate as a list currently. Will ask AI later for extracting only the gene name in parenthesis.
Now lets see how are samples are related in order with the series_txt information we made earlier.
compare <- rbind(colnames(normalized.expr)[1:47],series4[2:48])
paged_table(compare)
It looks like viewing the sample ID by GSM ID is the same and consistent in order, so we should rename these to be smaller by class. All have EBV confirmed diagnosis, but some are CHL, pDLBCL, and mDLBCL. We can extend the group as well.
CHL <- grep('CHL',compare[3,])
mDLBCL <- grep('mDLBCL', compare[3,])
pDLBCL <- grep('pDLBCL',compare[3,])
The column names will be automatically numbered with a dot and the number starting with 1 after a duplicate encountered.
colnames(compare)[CHL] <- 'CHL'
colnames(compare)[mDLBCL] <- 'mDLBCL'
colnames(compare)[pDLBCL] <- 'pDLBCL'
paged_table(compare)
newNames <- colnames(compare)
colnames(normalized.expr)[1:47] <- newNames
Lets add the row means to each gene by type of Lymphoma.
normalized.expr$CHL_mean <- rowMeans(normalized.expr[,CHL])
normalized.expr$pDLBCL_mean <- rowMeans(normalized.expr[,pDLBCL])
normalized.expr$mDLBCL_mean <- rowMeans(normalized.expr[,mDLBCL])
str(normalized.expr)
## 'data.frame': 27189 obs. of 51 variables:
## $ pDLBCL : num 5.31 5.68 4.97 3.97 3.53 ...
## $ CHL : num 5.49 4.69 5.15 4.57 2.77 ...
## $ pDLBCL : num 4.75 5.03 5.03 3.7 3.66 ...
## $ mDLBCL : num 5.68 5.65 4.29 3.57 3.35 ...
## $ CHL : num 4.82 5.68 4.92 3.86 4.89 ...
## $ mDLBCL : num 5.52 5.59 4.1 3.6 3.8 ...
## $ pDLBCL : num 5.45 5.56 4.91 3.97 3.39 ...
## $ CHL : num 5.77 5.88 3.76 3.98 3.33 ...
## $ CHL : num 6.18 5.37 4.53 4.24 3.62 ...
## $ CHL : num 5.59 5.7 4.19 4.45 3.64 ...
## $ CHL : num 5.51 5.72 4.87 3.78 3.45 ...
## $ mDLBCL : num 5.66 5.58 4.91 4.59 3.11 ...
## $ mDLBCL : num 5.02 5.12 3.33 3.91 3.39 ...
## $ CHL : num 5.2 5.75 4.15 4.09 3.39 ...
## $ CHL : num 5.32 5.23 4.47 3.88 3.49 ...
## $ pDLBCL : num 5.8 5.97 4.6 3.86 3.7 ...
## $ CHL : num 5.67 5.07 4.27 4.07 2.79 ...
## $ mDLBCL : num 6.01 5.89 3.89 4.14 3.54 ...
## $ mDLBCL : num 6.05 5.81 4.51 4.2 3.16 ...
## $ mDLBCL : num 5.51 5.73 4.13 4.18 3.73 ...
## $ mDLBCL : num 6.31 5.9 4.62 4.2 3.18 ...
## $ CHL : num 5.9 5.39 4.52 3.58 4.07 ...
## $ mDLBCL : num 5.39 5.14 4.58 4.6 3.69 ...
## $ CHL : num 5.81 6.15 4.56 4.19 3.71 ...
## $ CHL : num 5.67 5.64 5.28 4.02 3.57 ...
## $ CHL : num 5.92 6.06 5.41 4.37 3.46 ...
## $ CHL : num 5.68 6.04 4.29 4.41 3.4 ...
## $ mDLBCL : num 5.79 6.39 4.79 4.35 3.77 ...
## $ mDLBCL : num 5.14 5.21 3.79 4.09 3.96 ...
## $ mDLBCL : num 6.19 5.26 3.8 4.27 3.49 ...
## $ mDLBCL : num 5.75 5.85 3.84 3.86 3.39 ...
## $ mDLBCL : num 6.33 5.98 4.89 3.95 3.7 ...
## $ mDLBCL : num 5.42 5.84 4.49 4.47 3.35 ...
## $ pDLBCL : num 6.33 5.88 4.85 4.35 3.17 ...
## $ pDLBCL : num 5.66 6.07 4.89 4.01 3.4 ...
## $ mDLBCL : num 6.12 6.08 4.44 3.98 3.24 ...
## $ mDLBCL : num 5.96 6.39 5.34 4.23 3.86 ...
## $ CHL : num 4.72 6.16 4.92 4.3 4.12 ...
## $ pDLBCL : num 5.67 5.72 4.81 4.82 3.25 ...
## $ pDLBCL : num 6.25 5.98 4.73 3.83 3.47 ...
## $ CHL : num 5.95 5.9 4.69 4.1 3.35 ...
## $ CHL : num 6.25 5.73 4.18 4.26 3.17 ...
## $ CHL : num 5.71 6.36 4.4 4.1 3.69 ...
## $ mDLBCL : num 6.16 5.56 4.9 4.23 3.12 ...
## $ mDLBCL : num 6.85 5.68 4.01 3.83 3.5 ...
## $ mDLBCL : num 5.68 5.79 4.03 4.39 3.28 ...
## $ CHL : num 6.29 5.94 4.5 4.19 3.38 ...
## $ ID : chr "23064070" "23064071" "23064072" "23064073" ...
## $ CHL_mean : num 5.65 5.71 4.58 4.13 3.54 ...
## $ pDLBCL_mean: num 5.65 5.74 4.85 4.06 3.45 ...
## $ mDLBCL_mean: num 5.83 5.72 4.33 4.13 3.48 ...
summary(normalized.expr)
## pDLBCL CHL pDLBCL mDLBCL
## Min. : 1.104 Min. : 1.075 Min. : 1.037 Min. : 1.056
## 1st Qu.: 3.237 1st Qu.: 3.305 1st Qu.: 3.230 1st Qu.: 3.209
## Median : 4.383 Median : 4.326 Median : 4.376 Median : 4.377
## Mean : 4.467 Mean : 4.421 Mean : 4.457 Mean : 4.462
## 3rd Qu.: 5.473 3rd Qu.: 5.362 3rd Qu.: 5.474 3rd Qu.: 5.470
## Max. :13.469 Max. :13.482 Max. :13.469 Max. :13.482
## CHL mDLBCL pDLBCL CHL
## Min. : 1.115 Min. : 0.979 Min. : 0.979 Min. : 1.071
## 1st Qu.: 3.289 1st Qu.: 3.193 1st Qu.: 3.221 1st Qu.: 3.195
## Median : 4.354 Median : 4.382 Median : 4.377 Median : 4.380
## Mean : 4.433 Mean : 4.465 Mean : 4.458 Mean : 4.474
## 3rd Qu.: 5.431 3rd Qu.: 5.474 3rd Qu.: 5.482 3rd Qu.: 5.502
## Max. :13.454 Max. :13.497 Max. :13.469 Max. :13.495
## CHL CHL CHL mDLBCL
## Min. : 1.009 Min. : 1.035 Min. : 1.158 Min. : 1.097
## 1st Qu.: 3.218 1st Qu.: 3.268 1st Qu.: 3.293 1st Qu.: 3.222
## Median : 4.410 Median : 4.416 Median : 4.412 Median : 4.376
## Mean : 4.503 Mean : 4.504 Mean : 4.488 Mean : 4.498
## 3rd Qu.: 5.537 3rd Qu.: 5.492 3rd Qu.: 5.457 3rd Qu.: 5.517
## Max. :13.219 Max. :13.495 Max. :13.497 Max. :13.482
## mDLBCL CHL CHL pDLBCL
## Min. : 1.131 Min. : 1.015 Min. : 0.9349 Min. : 0.9517
## 1st Qu.: 3.322 1st Qu.: 3.271 1st Qu.: 3.2113 1st Qu.: 3.2006
## Median : 4.418 Median : 4.414 Median : 4.3820 Median : 4.3765
## Mean : 4.495 Mean : 4.499 Mean : 4.4727 Mean : 4.4734
## 3rd Qu.: 5.474 3rd Qu.: 5.485 3rd Qu.: 5.4897 3rd Qu.: 5.4796
## Max. :13.455 Max. :13.495 Max. :13.4823 Max. :13.4973
## CHL mDLBCL mDLBCL mDLBCL
## Min. : 1.083 Min. : 1.011 Min. : 1.072 Min. : 1.035
## 1st Qu.: 3.256 1st Qu.: 3.194 1st Qu.: 3.189 1st Qu.: 3.194
## Median : 4.392 Median : 4.382 Median : 4.400 Median : 4.369
## Mean : 4.480 Mean : 4.488 Mean : 4.497 Mean : 4.480
## 3rd Qu.: 5.483 3rd Qu.: 5.512 3rd Qu.: 5.528 3rd Qu.: 5.478
## Max. :13.469 Max. :13.497 Max. :13.497 Max. :13.124
## mDLBCL CHL mDLBCL CHL
## Min. : 1.075 Min. : 1.085 Min. : 1.134 Min. : 0.9819
## 1st Qu.: 3.222 1st Qu.: 3.211 1st Qu.: 3.248 1st Qu.: 3.1659
## Median : 4.388 Median : 4.394 Median : 4.395 Median : 4.3858
## Mean : 4.485 Mean : 4.472 Mean : 4.495 Mean : 4.4632
## 3rd Qu.: 5.511 3rd Qu.: 5.502 3rd Qu.: 5.494 3rd Qu.: 5.5239
## Max. :13.469 Max. :13.190 Max. :13.497 Max. :13.4823
## CHL CHL CHL mDLBCL
## Min. : 1.005 Min. : 1.117 Min. : 1.081 Min. : 1.072
## 1st Qu.: 3.205 1st Qu.: 3.153 1st Qu.: 3.222 1st Qu.: 3.188
## Median : 4.388 Median : 4.374 Median : 4.396 Median : 4.390
## Mean : 4.482 Mean : 4.486 Mean : 4.484 Mean : 4.488
## 3rd Qu.: 5.521 3rd Qu.: 5.542 3rd Qu.: 5.500 3rd Qu.: 5.538
## Max. :13.469 Max. :13.438 Max. :13.469 Max. :13.469
## mDLBCL mDLBCL mDLBCL mDLBCL
## Min. : 1.100 Min. : 1.060 Min. : 1.065 Min. : 1.119
## 1st Qu.: 3.305 1st Qu.: 3.202 1st Qu.: 3.211 1st Qu.: 3.180
## Median : 4.408 Median : 4.387 Median : 4.405 Median : 4.376
## Mean : 4.496 Mean : 4.492 Mean : 4.496 Mean : 4.501
## 3rd Qu.: 5.479 3rd Qu.: 5.522 3rd Qu.: 5.522 3rd Qu.: 5.532
## Max. :13.482 Max. :13.364 Max. :13.167 Max. :13.482
## mDLBCL pDLBCL pDLBCL mDLBCL
## Min. : 1.107 Min. : 1.151 Min. : 1.091 Min. : 1.067
## 1st Qu.: 3.227 1st Qu.: 3.181 1st Qu.: 3.196 1st Qu.: 3.182
## Median : 4.414 Median : 4.396 Median : 4.390 Median : 4.374
## Mean : 4.493 Mean : 4.496 Mean : 4.487 Mean : 4.502
## 3rd Qu.: 5.523 3rd Qu.: 5.528 3rd Qu.: 5.525 3rd Qu.: 5.540
## Max. :13.497 Max. :13.469 Max. :13.497 Max. :13.482
## mDLBCL CHL pDLBCL pDLBCL
## Min. : 1.035 Min. : 1.021 Min. : 1.052 Min. : 1.111
## 1st Qu.: 3.191 1st Qu.: 3.203 1st Qu.: 3.165 1st Qu.: 3.131
## Median : 4.371 Median : 4.376 Median : 4.372 Median : 4.334
## Mean : 4.483 Mean : 4.440 Mean : 4.490 Mean : 4.489
## 3rd Qu.: 5.515 3rd Qu.: 5.508 3rd Qu.: 5.539 3rd Qu.: 5.532
## Max. :13.497 Max. :13.469 Max. :13.497 Max. :13.469
## CHL CHL CHL mDLBCL
## Min. : 1.099 Min. : 1.011 Min. : 1.046 Min. : 1.036
## 1st Qu.: 3.180 1st Qu.: 3.168 1st Qu.: 3.219 1st Qu.: 3.121
## Median : 4.381 Median : 4.370 Median : 4.397 Median : 4.333
## Mean : 4.484 Mean : 4.492 Mean : 4.488 Mean : 4.479
## 3rd Qu.: 5.500 3rd Qu.: 5.531 3rd Qu.: 5.515 3rd Qu.: 5.531
## Max. :13.482 Max. :13.482 Max. :13.495 Max. :13.482
## mDLBCL mDLBCL CHL ID
## Min. : 1.101 Min. : 1.011 Min. : 1.085 Length:27189
## 1st Qu.: 3.158 1st Qu.: 3.247 1st Qu.: 3.163 Class :character
## Median : 4.364 Median : 4.410 Median : 4.357 Mode :character
## Mean : 4.494 Mean : 4.499 Mean : 4.496
## 3rd Qu.: 5.525 3rd Qu.: 5.523 3rd Qu.: 5.526
## Max. :13.497 Max. :13.497 Max. :13.497
## CHL_mean pDLBCL_mean mDLBCL_mean
## Min. : 1.344 Min. : 1.342 Min. : 1.321
## 1st Qu.: 3.248 1st Qu.: 3.218 1st Qu.: 3.232
## Median : 4.396 Median : 4.383 Median : 4.389
## Mean : 4.477 Mean : 4.477 Mean : 4.489
## 3rd Qu.: 5.462 3rd Qu.: 5.481 3rd Qu.: 5.488
## Max. :13.406 Max. :13.474 Max. :13.379
paged_table(normalized.expr[c(1:10),c(45:51)])
There is no baseline or healthy sample to compare to for fold change values, so we must look at some of the factors or features to compare like the mean and median values per sample type. This is one way to do a comparison of these types of Lymphomas, to see how far from the median the mean is and use that as a difference per gene on how far the median is from the mean, where a skewed mean with a higher value of mean minus median could indicate that there is variability or an outlier as well skewing the data. The samples are also not balanced as there are only 8 pDLBCL cases but 19 CHL and 20 mDLBCL cases. There are also more men than women and the median age is 72 years of age but all older than 50 years of age. We could factor in age as the older we get the lower our quality of immune response is. There could be weights for each class by those percent younger than 72 and those percent older than 72, as well as the percent of men per class to the percent of females per class. The median value should be very close to the mean if the samples had no significant differences. We know the groups were done by looking directly at values of IFN-G, IDO1, and PDL1, as well as copy number variants at chromosome 9 loci of genes surrounding 9p24.1.
males <- grep('Sex: male',compare[5,])
females <- grep('female', compare[5,])
There are 12 females and 47 males.
We will call x - female and y - male.
CHL_x <- CHL[CHL %in% females]
mDLBCL_x <- mDLBCL[mDLBCL %in% females]
pDLBCL_x <- pDLBCL[pDLBCL %in% females]
CHL_y <- CHL[CHL %in% males]
mDLBCL_y <- mDLBCL[mDLBCL %in% males]
pDLBCL_y <- pDLBCL[pDLBCL %in% males]
Add Mean values of genes for each class by gender to the data frame normalized.expr
normalized.expr$CHL_x_mean <- rowMeans(normalized.expr[,CHL_x])
normalized.expr$mDLBCL_x_mean <- rowMeans(normalized.expr[,mDLBCL_x])
normalized.expr$pDLBCL_x_mean <- rowMeans(normalized.expr[,pDLBCL_x])
normalized.expr$CHL_y_mean <- rowMeans(normalized.expr[,CHL_y])
normalized.expr$mDLBCL_y_mean <- rowMeans(normalized.expr[,mDLBCL_y])
normalized.expr$pDLBCL_y_mean <- rowMeans(normalized.expr[,pDLBCL_y])
Lets add in gene means per sample by age older than or equal to 72 and those younger than 72.
age <- as.numeric(gsub('age: ','',compare[4,]))
compare[4,] <- age
row.names(compare) <- c('titleID','GSM_ID','diagnosis','age', 'gender','group')
paged_table(compare)
young <- which(compare['age',] <= 72)
old <- which(compare['age',] > 72)
CHL_young <- CHL[CHL %in% young]
mDLBCL_young <- mDLBCL[mDLBCL %in% young]
pDLBCL_young <- pDLBCL[pDLBCL %in% young]
CHL_old <- CHL[CHL %in% old]
mDLBCL_old <- mDLBCL[mDLBCL %in% old]
pDLBCL_old <- pDLBCL[pDLBCL %in% old]
Now that we have the old and young indices and the class count in the young and old classes of less than 72 is young or equal to 72, but older than 72 is old, we can get our means per gene for each lymphoma in subcategory by age.
normalized.expr$CHL_young72_mean <- rowMeans(normalized.expr[CHL_young])
normalized.expr$mDLBCL_young72_mean <- rowMeans(normalized.expr[mDLBCL_young])
normalized.expr$pDLBCL_young72_mean <- rowMeans(normalized.expr[pDLBCL_young])
normalized.expr$CHL_old <- rowMeans(normalized.expr[CHL_old])
normalized.expr$mDLBCL_old <- rowMeans(normalized.expr[mDLBCL_old])
normalized.expr$pDLBCL_old <- rowMeans(normalized.expr[pDLBCL_old])
paged_table(normalized.expr[1:10,48:63])
We will add the medians per class of lymphoma only not by group but class of CHL, mDLBCL, or pDLBCL. We won’t use any weights by the gender or age but we can compare the mean values in each subset of the class by gender and age separately to see if there are any noticeable changes.
df$row_median <- apply(df[cols_to_use], 1, function(row) { median(as.numeric(row), na.rm = TRUE) # Convert to numeric & ignore NAs })
normalized.expr$CHL_median <- apply(normalized.expr[CHL],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$mDLBCL_median <- apply(normalized.expr[mDLBCL],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$pDLBCL_median <- apply(normalized.expr[pDLBCL],1,function(row){
median(as.numeric(row), na.rm=F)
})
paged_table(normalized.expr[1:10,49:66])
Lets go ahead and add in the medians of each class in its subclass of gender and age. This could be useful when filtering to find genes that are far from the average or having one when compared to where the median is in that group. Its a bunch of copy and paste and replace so not too much typing or creativity to do this little step.
normalized.expr$CHL_x_median <- apply(normalized.expr[CHL_x],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$mDLBCL_x_median <- apply(normalized.expr[mDLBCL_x],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$pDLBCL_x_median <- apply(normalized.expr[pDLBCL_x],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$CHL_y_median <- apply(normalized.expr[CHL_y],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$mDLBCL_y_median <- apply(normalized.expr[mDLBCL_y],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$pDLBCL_y_median <- apply(normalized.expr[pDLBCL_y],1,function(row){
median(as.numeric(row), na.rm=F)
})
Above we added the medians per class and gender or gender within each class of Lymphoma.
Now for young or old within each class, younger than or equal to 72 and older than 72.
normalized.expr$CHL_young72_median <- apply(normalized.expr[CHL_young],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$mDLBCL_young72_median <- apply(normalized.expr[mDLBCL_young],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$pDLBCL_young72_median <- apply(normalized.expr[pDLBCL_young],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$CHL_old72_median <- apply(normalized.expr[CHL_old],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$mDLBCL_old72_median <- apply(normalized.expr[mDLBCL_old],1,function(row){
median(as.numeric(row), na.rm=F)
})
normalized.expr$pDLBCL_old72_median <- apply(normalized.expr[pDLBCL_old],1,function(row){
median(as.numeric(row), na.rm=F)
})
Lets see what we have so far and decide on some filters as markers for genes or we can just grep the genes that were in the study, since we have the chromosome location, or loci of 9p24.1, we should grep all genes in that region since we have that data from the features data we uploaded earlier in this file. I looked and the start and stop is given but not the loci, we have another file that has the loci from another study. The study said at 9p24.1 at start of 5,259,371-5,481,709 using GRCh38 (a greater human chromosome study with UCSB genome browser or similar). This location is where PDL1 encoded by CD274 was targeted, and at 9p24.1 at start of 5,593,784 and end at 5,764,809 for PDL2. The other genes are EBNA2 will be expressed if active EBV infection for latent type 3 EBV as well as LMP1 positive, for the DLBCL types not normally the CHL type that would have EBNA1 for latent type 1 and type 2. They defined EBV latent type 1 as LMP1 and EBNA2 negative. EBV latent type 2 was defined as LMP1 positive but EBNA2 negative. We also need to get the IFN-g for interferon gamma, and IDO1. They confirmed active EBV infection by EBER in situ hybridization
LMP1
EBNA2
IDO1
IFNG
PDL1
PDL2
9p24.1 5,259,371-5,481,709 for CD274 and PDL1 on 5’ strand
9p24.1 5,593,784-5,764,809 for PDL2 encompassed by PDCD1LG2 on 3’ strand
Other genes that were looked at as part of their Gene Expression Validation Analysis included the CD3, CD5, CD10, CD15, CD20, CD30, CD79a, PAX5, EBER, BCL2, BCL6, and MUM1.
Lets make a list of these genes called GSVA for gene study validation analysis.
GSVA <- c("LMP1","EBNA2","IDO1","IFNG","PDL1","PDL2","CD3","CD5","CD10","CD15","CD20","CD30","CD79a","PAX5","EBER","BCL2","BCL6", "MUM1")
The loci of chromosome start and stop have + and - signs, the + is upstream or 5’ and - is downstream or 3’ end when referring to the antiparallel strands of DNA and complementary DNA is the opposite strand to the template or antisense strand.
9p24.1 5,259,371-5,481,709 for CD274 and PDL1 on 5’ strand
9p24.1 5,593,784-5,764,809 for PDL2 on 3’ strand
chr9 <- subset(feature.data, feature.data$seqname == 'chr9') #822X10
startPDL1 <- 5259371
endPDL1 <- 5481709
PDL1_loci <- subset(chr9, chr9$strand == '+' & chr9$start >= startPDL1 & chr9$stop <= endPDL1)
paged_table(PDL1_loci)
We can see that by scrolling to the SPOT_ID.1 feature that this is CD274 as study said they used to encompass PDL1 detection.
Now for PDL2 loci
startPDL2 <- 5593784
endPDL2 <- 5764809
PDL2_loci <- subset(chr9, chr9$strand == '-' & chr9$start >= startPDL2 & chr9$stop <= endPDL2)
paged_table(PDL2_loci)
These genes we can add to the list of genes then, we already have CD274, but not PDCD1LG2 for PDL2, but PTPRD and ERP44.
Lets grep these genes in the SPOT_ID.1 column of the normalized.expr data.
GSVA <- c(GSVA,"PTPRD","CD274", "ERP44","PDCD1LG2")
GSVA
## [1] "LMP1" "EBNA2" "IDO1" "IFNG" "PDL1" "PDL2"
## [7] "CD3" "CD5" "CD10" "CD15" "CD20" "CD30"
## [13] "CD79a" "PAX5" "EBER" "BCL2" "BCL6" "MUM1"
## [19] "PTPRD" "CD274" "ERP44" "PDCD1LG2"
We will do this later as we will likely have to grep each of these 26 genes separately.
Lets write out the file we have to use later in Part 2 with our samples, probe IDs, messy gene IDs in one column, sample means and medians by class and subclass within the class.
write.csv(normalized.expr,'MeansMediansLymphomas_CHL_DLBCL_21448X78.csv', row.names=F)
This file is about 135 MB large and too large for Google. But Kaggle allows it, get this file here
or 2nd version here
Thanks so much and keep checking in for part 2.