Part 1 exploring and cleaning up data to extract information as an analysis of Epstein-Barr Viral infection to get top genes for machine learning to use in comparison against other viral infections like lymphoproliferative B-cell activity in Multiple Sclerosis, Lyme Disease, Mononucleosis, Burketts Lymphoma, and Hodgkins Lymphoma using their relative data as these can not be compared across platform for type of media, and lab analysis differences, sample size differences, and so on.

This project looks at the gene expression data from the National Center for Biotechnology Information (NCBI), GSE253756, that evaluates Epstein Barr Viral infections and the effects IL27 has with IL27RA. Summarizing the study from link above and generalizing with my own input to put into my own words: the IL27RA is a good component to have as it is a subunit of the IL27 gene that allows cytotoxic T-cells to go out into the body and destroy infected cells of EBV. But if you lack the IL27RA to make that subunit, the networking or call to the cytotoxic T-cells or natural killer cells or CD4 cells will not occur and the person they found to lack this was those with alleles that are homozygous or that they inherited the gene allele lacking this gene from both parents. Those folk get infectious mononucleosis and EBV infection regularly, otherwise the rest of the healthy populations have a much more favorable or easier healing process that allows their MH2 and CD4 T-cell networking to hunt and destroy on site those EBV infected cells to prevent reactions typical of mononucleosis and EBV infection like sore throat and malaise or chronic fatigue or lack of energy for brain and body to do anything more than turn over in bed.

The study used 2 samples of healthy cells not infected with EBV as controls and 2 samples each from 2 patients that they stimulated with IL27

The media is lymphoblastic cell lines (LCL) found in peripheral blood mononuclear cells (pbmc). The Epstein Barr virus (EBV) is an agent much like hela cells used from human papilloma viral (HPV) infected tumor cells of Henrietta Lacks, that also works to preserve cells and gain fast division and avoid mammal live testing. The samples are not infected with EBV to test EBV effects but do focus on infectious mononucleosis which is already to be known to be derived from EBV among other pathologies connected to an earlier EBV infection such as general flu and sore throat effects from direct infection of EBV, Burkett Lymphoma, Multiple Sclerosis, head and neck carcinogen, and even Hodgkin Lymphona although not as strongly as the association with mononucleosis. The study begins their summary of the research by stating that those associations with EBV just mentioned are engendered lymphoproliferative diseases by EBV and without the IL27RA-IL27 autocrine axis in response to EBV infected B-cells by healthy T-cells, that EBV hijacks the lymphoblastoid B-cells and can cause those lymphoproliferative diseases of mono, lymphoma, and multiple sclerosis that B-cell activity causes demyelination of the central nervous system or CNS (by oligodendrocytes in brain and shwann cells in peripheral nervous system or PNS)

The data uses RNA-seq to aquire gene expression data with high throughput sequencing, cDNA or complementary DNA that is the RNA copy of the DNA from the LCL type of human body sample. The LCLs are basically stem cells from the lymphoid tissue of white blood cells in the blood sampled from bone where the manufacturing of WBCs and RBCs takes place. In adults usually spine and in kids they have it in long bones but easier to take from the spine like axial skeleton of ilium.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

Get series data table,

seriesHeader <- read.table('GSE253756_series_matrix.txt', sep="", comment.char="!")
head(seriesHeader)
##       V1         V2         V3         V4         V5         V6         V7
## 1 ID_REF GSM8027531 GSM8027532 GSM8027533 GSM8027534 GSM8027535 GSM8027536
##           V8         V9
## 1 GSM8027537 GSM8027538

Summary Information to data in collection and maintaining data samples in the lab.Shrink the summary information from 209 to 109 rows.

info <- read.csv('GSE253756_series_matrix.txt', sep="", header=T, row.names=NULL)
head(info,10)
##              X.Series_title
## 1     !Series_geo_accession
## 2            !Series_status
## 3   !Series_submission_date
## 4  !Series_last_update_date
## 5           !Series_summary
## 6    !Series_overall_design
## 7              !Series_type
## 8       !Series_contributor
## 9       !Series_contributor
## 10        !Series_sample_id
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Dual.role.of.IL.27.in.Epstein.Barr.virus.infection.revealed.by.IL27RA.deficiency
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       GSE253756
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Public on Jan 23 2024
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Jan 19 2024
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Jan 23 2024
## 5  Epstein-Barr virus (EBV) infection can engender severe B-cell lymphoproliferative diseases. The primary infection is often asymptomatic or causes infectious mononucleosis (IM), a self-limiting lymphoproliferative disorder. Selective vulnerability to EBV has been reported in association with inherited mutations impairing T cell immunity to EBV.  Herein, we report bi-allelic loss-of-function mutations variants in IL27RA that underlie an acute and severe primary EBV infection with a nevertheless favorable outcome requiring a minimal treatment. One mutant allele (rs201107107) was enriched in the Finnish population (MAF=0.0068) and carried a high risk of severe IM when homozygous. IL27RA codes for the a subunit of the receptor of IL-27. In the absence of IL27RA, phosphorylation of STAT1 and STAT3 in response to IL-27 is abolished in T cells. In in vitro studies, IL-27 exerts a synergistic effect on TCR-dependent T cell proliferation that is abolished in cells from the patients, leading to impaired expansion of potent anti-EBV effector cytotoxic CD8+ T cells. IL-27 is produced by EBV- infected B lymphocytes and an IL27RA-IL-27 autocrine loop is required for maintenance of EBV-transformed B cells. This potentially explains the eventual favorable outcome of the EBV-induced viral disease in IL27RA-deficient patients. Furthermore, we identified neutralizing anti-IL-27 autoantibodies in most individuals who developed sporadic IM and chronic EBV infection. Collectively, these results demonstrate the critical role of IL27RA-IL-27 axis in T cell immunity to EBV, but also the hijacking of this defense by EBV to promote expansion of infected B cells.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  To investigate transcriptomic impact associated with IL27RA-deficiency, gene expression (including EBV genes) of two control LCLs and two IL27RA-deficient LCLs issued from two patients P1.1 and P.2 were analysed in response to IL-27 (Stim_IL27) compared to non-stimulated cells (Basal).
## 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Expression profiling by high throughput sequencing
## 8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Sylvain,,Latour
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Mathieu,,Simonin
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       GSM8027531 GSM8027532 GSM8027533 GSM8027534 GSM8027535 GSM8027536 GSM8027537 GSM8027538

Get only the summary information from the commented ‘!’ rows exluded in table header read.

summaryInformation <- info$X.Series_title[c(89,95,104,109,114,119)]
summaryInformation
## [1] "Total RNA was isolated from LCLs stimulated or not with IL-27 for 6 h hours using the RNeasy Kit (QIAGEN) including DNAse treatment. RNA quality was assessed by capillary electrophoresis by Fragment Analyzer using High Sensitivity RNA reagents (Agilent Technologies) and RNA concentration measured by spectrophotometry withXpose (Trinean) and Fragment Analyzer."                                                                                                 
## [2] "RNAseq libraries were generated from 600 ng of RNA using Universal PlusTM mRNA-Seq with NuQuant® (NuGEN, San Juan, Puerto Rico)"                                                                                                                                                                                                                                                                                                                                           
## [3] "RNAseq libraries were sequenced (paired-end reads 100 bases + 100 bases) on Illumina NovaSeq6000."                                                                                                                                                                                                                                                                                                                                                                         
## [4] "A total of ~90 million of passing filter paired-end reads was produced per library. Reads were mapped and aligned to the human reference genome and to EBV1 and EBV2 reference genomes using STAR52, and the number of uniquely mapped reads for each gene was counted using HTSeq53. Only genes with an average ≥1 read/sample were used in our analysis.DESeq2 was used for normalization and the negative binomial model was used for differential expression analysis."
## [5] "Assembly: FASTQ files were mapped to the ENSEMBL Human GRCh38/hg38 reference using Illumina DRAGEN aligner and counted by featureCounts from the Subread R package"                                                                                                                                                                                                                                                                                                        
## [6] "Supplementary files format and content: csv file includes raw counts for the different samples"

The above summary information describes the protocols and design of this series data published as research with NCBI. The above summary explains the threshold values of the genes to keep as base pairs of nucleotides the messenger RNA or mRNA makes alternately called cDNA in transcription of DNA into RNA in the nucleus to translate a protein in the nucleus that gets packaged by other organelles in the cytoplasm before being sent off into the body through the plasma membrane or to stay as cell activity regulators in the cell. They have filtered out the genes that didn’t have at least 1 read of it in the sample used. The data was normalized using DESeq2 with a negative binomial model for differential expression analysis and then mapped to the human genome assembly 38 or hg38 in ENSEMBL. These files we are looking at are the csv supplementary files that are raw counts for different samples.

The experiment design is not in the summary file for what the samples are in the header, that information is in the NCBI sample page to retrieve the samples. Lets read in that file of the raw counts of gene expression we will be analyzing.

geneData <- read.table('GSE253756_raw.count.txt',sep="", header=T)
str(geneData)
## 'data.frame':    58302 obs. of  9 variables:
##  $ Geneid              : chr  "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
##  $ LCL_P1_1_Stim_IL27  : int  2 0 1631 1047 823 6583 0 1038 1035 2443 ...
##  $ LCL_P1_1_Basal      : int  2 0 1659 919 305 5278 0 931 1546 1713 ...
##  $ LCL_P2_Stim_IL27    : int  2 0 1971 940 937 2969 0 741 704 2898 ...
##  $ LCL_P2_Basal        : int  1 0 2953 1109 700 5015 0 1109 1609 3208 ...
##  $ LCL_Ctrl_1_Stim_IL27: int  2 0 2795 1238 896 4238 2 1439 1281 3371 ...
##  $ LCL_Ctrl_1_Basal    : int  1 0 3111 1008 732 4484 1 1700 2110 2941 ...
##  $ LCL_Ctrl_2_Stim_IL27: int  0 0 2693 1139 935 5157 1 1376 1271 3322 ...
##  $ LCL_Ctrl_2_Basal    : int  1 0 2928 987 713 4542 1 1774 2160 3147 ...

We have a data frame of 58,302 genes for 9 variables. Our header has 9 variables as well and is labeled by sample. The sample type must be the header. Lets look at that information. Make a table and rename columns

sampleHeaderDF <- data.frame(t(seriesHeader[1,]), colnames(geneData))
sampleHeaderDF
##            X1   colnames.geneData.
## V1     ID_REF               Geneid
## V2 GSM8027531   LCL_P1_1_Stim_IL27
## V3 GSM8027532       LCL_P1_1_Basal
## V4 GSM8027533     LCL_P2_Stim_IL27
## V5 GSM8027534         LCL_P2_Basal
## V6 GSM8027535 LCL_Ctrl_1_Stim_IL27
## V7 GSM8027536     LCL_Ctrl_1_Basal
## V8 GSM8027537 LCL_Ctrl_2_Stim_IL27
## V9 GSM8027538     LCL_Ctrl_2_Basal

Remove the 1st row and rename the header to Sample_ID and Sample_Experiment.

names <- c("Sample_ID","Sample_Experiment")
sampleByType_df <- sampleHeaderDF[-1,]

colnames(sampleByType_df) <- names
row.names(sampleByType_df) <- NULL

sampleByType_df
##    Sample_ID    Sample_Experiment
## 1 GSM8027531   LCL_P1_1_Stim_IL27
## 2 GSM8027532       LCL_P1_1_Basal
## 3 GSM8027533     LCL_P2_Stim_IL27
## 4 GSM8027534         LCL_P2_Basal
## 5 GSM8027535 LCL_Ctrl_1_Stim_IL27
## 6 GSM8027536     LCL_Ctrl_1_Basal
## 7 GSM8027537 LCL_Ctrl_2_Stim_IL27
## 8 GSM8027538     LCL_Ctrl_2_Basal

The samples above show their gene study ID next to the study done to that sample worded in name of sample. There are two samples of the 1st and 2nd patient infected with EBV each and 2 control samples from different patients each that have no EBV infection. That sounds like the study we summarized ealier. The samples from each patient infected or not with EBV is with infection or not infected or stimulated with IL27.

The data is raw counts and they said they kept the genes that only had 1 or more reads per sample. So we can remove any genes that didn’t have at least 1 read per sample. I take that to mean it had to be in all samples. Lets summarise the columns across each row that is a gene to be at least 8, if the sum of the reads is not at least 8 we filter it out.

geneData$sampleSumOfCounts <- rowSums(geneData[,2:9], na.rm=F, dims=1)
summary(geneData$sampleSumOfCounts)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       0.0       0.0       4.0    5400.9     247.8 1208554.0

We used rowSums function to sum up across the row in each column for counts in all of the genes. We see that because there are 8 samples, there should be at least 8 raw counts in this column if it was kept in the data. Lets remove the genes that the data didn’t have at least 1 raw gene per sample.

atLeast8_genes <- subset(geneData, geneData$sampleSumOfCounts >=8)
summary(atLeast8_genes)
##     Geneid          LCL_P1_1_Stim_IL27 LCL_P1_1_Basal     LCL_P2_Stim_IL27
##  Length:26164       Min.   :     0     Min.   :     0.0   Min.   :     0  
##  Class :character   1st Qu.:     4     1st Qu.:     4.0   1st Qu.:     4  
##  Mode  :character   Median :    49     Median :    41.0   Median :    48  
##                     Mean   :  1234     Mean   :  1073.6   Mean   :  1316  
##                     3rd Qu.:  1112     3rd Qu.:   915.2   3rd Qu.:  1133  
##                     Max.   :207921     Max.   :138352.0   Max.   :163087  
##   LCL_P2_Basal    LCL_Ctrl_1_Stim_IL27 LCL_Ctrl_1_Basal LCL_Ctrl_2_Stim_IL27
##  Min.   :     0   Min.   :     0       Min.   :     0   Min.   :     0      
##  1st Qu.:     5   1st Qu.:     6       1st Qu.:     4   1st Qu.:     6      
##  Median :    59   Median :    64       Median :    55   Median :    67      
##  Mean   :  1584   Mean   :  1781       Mean   :  1711   Mean   :  1714      
##  3rd Qu.:  1393   3rd Qu.:  1555       3rd Qu.:  1500   3rd Qu.:  1503      
##  Max.   :175662   Max.   :159813       Max.   :166663   Max.   :151522      
##  LCL_Ctrl_2_Basal sampleSumOfCounts
##  Min.   :     0   Min.   :      8  
##  1st Qu.:     5   1st Qu.:     44  
##  Median :    56   Median :    489  
##  Mean   :  1620   Mean   :  12034  
##  3rd Qu.:  1454   3rd Qu.:  10704  
##  Max.   :153781   Max.   :1208554

We still see some genes have no count in the samples and are retained because total sums of at least 8 doesn’t account for those samples that have no counts while others have more.

dim(atLeast8_genes)
## [1] 26164    10

We went from 58,302 genes to 26,164 genes by getting at least those genes that showed 8 counts across samples. Now lets subset or filter again to only keep the gene if each sample has at least 1 count.

atLeast1geneEach <- subset(atLeast8_genes, 
                        atLeast8_genes$LCL_P1_1_Stim_IL27 >= 1 & 
                        atLeast8_genes$LCL_P1_1_Basal >= 1 &
                        atLeast8_genes$LCL_P2_Stim_IL27 >= 1 & 
                        atLeast8_genes$LCL_P2_Basal >= 1 &
                        atLeast8_genes$LCL_Ctrl_1_Stim_IL27 >= 1 &
                        atLeast8_genes$LCL_Ctrl_1_Basal >= 1 &
                        atLeast8_genes$LCL_Ctrl_2_Stim_IL27 >= 1 & 
                        atLeast8_genes$LCL_Ctrl_2_Basal >=1)
dim(atLeast1geneEach)
## [1] 20857    10

Now we have a data set of 20,857 genes. Lets see the summary of the gene counts per sample.

summary(atLeast1geneEach[2:10])
##  LCL_P1_1_Stim_IL27 LCL_P1_1_Basal   LCL_P2_Stim_IL27  LCL_P2_Basal   
##  Min.   :     1     Min.   :     1   Min.   :     1   Min.   :     1  
##  1st Qu.:    16     1st Qu.:    13   1st Qu.:    15   1st Qu.:    18  
##  Median :   213     Median :   176   Median :   207   Median :   250  
##  Mean   :  1544     Mean   :  1343   Mean   :  1647   Mean   :  1982  
##  3rd Qu.:  1615     3rd Qu.:  1342   3rd Qu.:  1671   3rd Qu.:  2050  
##  Max.   :207921     Max.   :138352   Max.   :163087   Max.   :175662  
##  LCL_Ctrl_1_Stim_IL27 LCL_Ctrl_1_Basal LCL_Ctrl_2_Stim_IL27 LCL_Ctrl_2_Basal
##  Min.   :     1       Min.   :     1   Min.   :     1       Min.   :     1  
##  1st Qu.:    20       1st Qu.:    16   1st Qu.:    22       1st Qu.:    17  
##  Median :   280       Median :   257   Median :   287       Median :   250  
##  Mean   :  2231       Mean   :  2143   Mean   :  2147       Mean   :  2029  
##  3rd Qu.:  2311       3rd Qu.:  2233   3rd Qu.:  2214       3rd Qu.:  2132  
##  Max.   :159813       Max.   :166663   Max.   :151522       Max.   :153781  
##  sampleSumOfCounts
##  Min.   :      8  
##  1st Qu.:    153  
##  Median :   2079  
##  Mean   :  15065  
##  3rd Qu.:  15829  
##  Max.   :1208554

We see that each sample has at least 1 gene raw count and that there is a median of 2,079 counts per gene sum of counts with a mean of 15,065. There is some skew in the data as the data has a mean more than 7X the actual half of the data raw count values of each gene in our 20,857 gene pool. More than half the genes are greater than 2,079 raw counts each as the median is 2,079 while the 3rd quantile has a high increase to 15,829.

Lets plot a histogram of the sample sum of counts per gene.

hist(atLeast1geneEach$sampleSumOfCounts)

plot(atLeast1geneEach$sampleSumOfCounts)

The scatter plot of actual gene as an index mapped to its relative raw count in comparison to other genes is plot above.

Lets get the fold sample means of basal controls and basal infected, and see which genes are being expressed the most or least among them.

colnames(atLeast1geneEach)
##  [1] "Geneid"               "LCL_P1_1_Stim_IL27"   "LCL_P1_1_Basal"      
##  [4] "LCL_P2_Stim_IL27"     "LCL_P2_Basal"         "LCL_Ctrl_1_Stim_IL27"
##  [7] "LCL_Ctrl_1_Basal"     "LCL_Ctrl_2_Stim_IL27" "LCL_Ctrl_2_Basal"    
## [10] "sampleSumOfCounts"

The added features for means of controls, patients, and their respective fold change values of the means of patients to controls without IL27 stimulation.

atLeast1geneEach$basalcontrolMeans <- rowMeans(atLeast1geneEach[,c(7,9)])
atLeast1geneEach$basalPatientMeans <- rowMeans(atLeast1geneEach[,c(3,5)])
atLeast1geneEach$foldchange_pt_ctrl_not_stim <- atLeast1geneEach$basalPatientMeans/atLeast1geneEach$basalcontrolMeans

summary(atLeast1geneEach[,11:13])
##  basalcontrolMeans  basalPatientMeans foldchange_pt_ctrl_not_stim
##  Min.   :     1.0   Min.   :     1    Min.   :   0.0006          
##  1st Qu.:    16.5   1st Qu.:    17    1st Qu.:   0.6958          
##  Median :   255.5   Median :   225    Median :   0.8254          
##  Mean   :  2085.9   Mean   :  1662    Mean   :   1.6157          
##  3rd Qu.:  2178.0   3rd Qu.:  1709    3rd Qu.:   1.0881          
##  Max.   :160222.0   Max.   :157007    Max.   :8971.8286

From summaries of these added features above, there is similar gene expression raw counts in means for controls and patients without the IL-27 stimulation are very close in order of size or magnitude for summary statistics but lets see what gene is naturally more expressed as the max gene.

atLeast1geneEach[which.max(atLeast1geneEach$basalcontrolMeans),]
##               Geneid LCL_P1_1_Stim_IL27 LCL_P1_1_Basal LCL_P2_Stim_IL27
## 1317 ENSG00000075624             130887         119703           150851
##      LCL_P2_Basal LCL_Ctrl_1_Stim_IL27 LCL_Ctrl_1_Basal LCL_Ctrl_2_Stim_IL27
## 1317       175334               159813           166663               151522
##      LCL_Ctrl_2_Basal sampleSumOfCounts basalcontrolMeans basalPatientMeans
## 1317           153781           1208554            160222          147518.5
##      foldchange_pt_ctrl_not_stim
## 1317                   0.9207131

That gene that has a max mean expression value across the controls and patients is ENSG00000075624.

I selected the uniProt summary for this gene with alternate name Actin as ACTB:

“UniProtKB/Swiss-Prot Summary for ACTB Gene Actin is a highly conserved protein that polymerizes to produce filaments that form cross-linked networks in the cytoplasm of cells (PubMed:25255767, 29581253). Actin exists in both monomeric (G-actin) and polymeric (F-actin) forms, both forms playing key functions, such as cell motility and contraction (PubMed:29581253). In addition to their role in the cytoplasmic cytoskeleton, G- and F-actin also localize in the nucleus, and regulate gene transcription and motility and repair of damaged DNA (PubMed:29925947). Plays a role in the assembly of the gamma-tubulin ring complex (gTuRC), which regulates the minus-end nucleation of alpha-beta tubulin heterodimers that grow into microtubule protafilaments (PubMed:39321809, 38609661). Part of the ACTR1A/ACTB filament around which the dynactin complex is built (By similarity). The dynactin multiprotein complex activates the molecular motor dynein for ultra-processive transport along microtubules (By similarity).”

ACTB seems to be a regulating gene that regulates transcription and DNA replication and repair of DNA.

Lets now add the means per sample of the IL-27 stimulated control samples next to patient samples and the fold change values.

atLeast1geneEach$controlMeans_IL27_stim <- rowMeans(atLeast1geneEach[,c(6,8)])
atLeast1geneEach$patientMeans_IL27_stim <- rowMeans(atLeast1geneEach[,c(2,4)])
atLeast1geneEach$foldchange_pt_ctrol_stim <- atLeast1geneEach$patientMeans_IL27_stim/atLeast1geneEach$controlMeans_IL27_stim
summary(atLeast1geneEach[14:16])
##  controlMeans_IL27_stim patientMeans_IL27_stim foldchange_pt_ctrol_stim
##  Min.   :     1         Min.   :     1         Min.   :9.000e-04       
##  1st Qu.:    21         1st Qu.:    17         1st Qu.:5.916e-01       
##  Median :   285         Median :   221         Median :7.453e-01       
##  Mean   :  2189         Mean   :  1595         Mean   :1.967e+00       
##  3rd Qu.:  2270         3rd Qu.:  1656         3rd Qu.:9.744e-01       
##  Max.   :155668         Max.   :185504         Max.   :1.855e+04

Just to see what the ENSEMBL name is for IL27RA lets look it up in genecards and see if it is in our data. The genecards ENSEMBL name is ENSG00000104998 for IL27RA. The ENSEMBL name for IL27 is ENSG00000197272. Lets see if our filtered data has the values. The counts are of IL27 and IL27RA as raw data before stimulated and after, just to see if anything stands out.

IL27RA_IL27 <- atLeast1geneEach[atLeast1geneEach$Geneid=="ENSG00000104998" | atLeast1geneEach$Geneid=="ENSG00000197272",]
IL27RA_IL27
##                Geneid LCL_P1_1_Stim_IL27 LCL_P1_1_Basal LCL_P2_Stim_IL27
## 3025  ENSG00000104998               3088           2319             3664
## 17372 ENSG00000197272                  5              2               21
##       LCL_P2_Basal LCL_Ctrl_1_Stim_IL27 LCL_Ctrl_1_Basal LCL_Ctrl_2_Stim_IL27
## 3025          5819                 5308             8085                 6757
## 17372           44                  145               41                  181
##       LCL_Ctrl_2_Basal sampleSumOfCounts basalcontrolMeans basalPatientMeans
## 3025              8204             43244            8144.5              4069
## 17372               60               499              50.5                23
##       foldchange_pt_ctrl_not_stim controlMeans_IL27_stim patientMeans_IL27_stim
## 3025                    0.4996010                 6032.5                   3376
## 17372                   0.4554455                  163.0                     13
##       foldchange_pt_ctrol_stim
## 3025                 0.5596353
## 17372                0.0797546

Lets add the respective Genecards name to its Geneid.

IL27RA_IL27$genecardName <- c("IL27RA","IL27")
IL27RA_IL27
##                Geneid LCL_P1_1_Stim_IL27 LCL_P1_1_Basal LCL_P2_Stim_IL27
## 3025  ENSG00000104998               3088           2319             3664
## 17372 ENSG00000197272                  5              2               21
##       LCL_P2_Basal LCL_Ctrl_1_Stim_IL27 LCL_Ctrl_1_Basal LCL_Ctrl_2_Stim_IL27
## 3025          5819                 5308             8085                 6757
## 17372           44                  145               41                  181
##       LCL_Ctrl_2_Basal sampleSumOfCounts basalcontrolMeans basalPatientMeans
## 3025              8204             43244            8144.5              4069
## 17372               60               499              50.5                23
##       foldchange_pt_ctrl_not_stim controlMeans_IL27_stim patientMeans_IL27_stim
## 3025                    0.4996010                 6032.5                   3376
## 17372                   0.4554455                  163.0                     13
##       foldchange_pt_ctrol_stim genecardName
## 3025                 0.5596353       IL27RA
## 17372                0.0797546         IL27

The data shows that with stimulation of the IL27RA that the patient mean values compared to the control mean values had a fold change of 56% of IL27RA while IL27 had a fold change of 8%. The non-stimulated or Basal means of patient compared to control samples had a fold change of 50% for IL27RA and 46% for IL27. The IL27RA and IL27 increased by around half in fold change comparison of EBV to healthy without IL-27RA stimulation, but after stimulation with IL27RA the IL27 had less gene expression of EBV infected to healthy comparison by 8% while IL27RA foldchange in stimulation with IL27RA increased 56%. This makes sense and aligns with the study by raw counts of gene expression.

Now, moving forward with comparison of top genes to run a machine learning model on so that we can find those genes that change in EBV using basically only a population of 2 people but be optimistic. Since, this study was to find out if an allele that these EBV infected people had was going to slow IL27 activity by T-cells to fight B-cell contamination by EBV that would show increased IL27, they injected a known subunit to IL27 effects in otherwise healthy folk infected with EBV that should enhance the power of fighting infection of EBV. We will just use these genes as a target for their raw counts and compare in the gene activity in Lyme disease non-related to EBV affects and classify a sample as lyme disease various stages or not. We could also use it as suspect gene targets when we get the top genes from the multiple sclerosis data, or compare it to the genes we have in fibromyalgia to see if it can predict myofascial pain or not in those samples as well, but using the relative data sets and also only if they have these genes we find now We will be searching for other data to these EBV lymphoproliferative pathologies like Burkett and Hodgkin lymphoma as well.

Lets get those top genes by ordering our data frame from most to least, take the top 40 and bottom 40 in fold change of patient to control in non-stimulated feature. The stimulated feature was designed to detect the allele that IL27RA has that makes it ineffective in T-cell moderation of autocrine protection in an EBV infection long term or short term affect.

topGenesEBV <- atLeast1geneEach[order(atLeast1geneEach$foldchange_pt_ctrl_not_stim, decreasing=T),]

topGenes80 <- topGenesEBV[c(1:40,20818:20857),]
dim(topGenes80)
## [1] 80 16

write both data frames to csv as well as our filtered 20,857 gene data frame and original geneData data frame of 58,302 observations, and the sampleByType sample ID and sample type data frame.

write.csv(sampleByType_df, 'sampleID_sampletype_header.csv', row.names=F)
write.csv(geneData,'original_58k_gene_EBV_df.csv', row.names=F)
write.csv(topGenes80,'top80_EBV_patient_vs_control_noStim.csv', row.names=F)
write.csv(topGenesEBV,'ordered_EBV_21k_genes_FC_noStim.csv',row.names=F)

I will share those data sets with you individually. There is more to add to this that will be in a separate file once I get the machine learning dataset of top80 genes to use with our fibromyalgia and lyme disease data. Can we build a model to predict the class using their relative data as having fibromyalgia, lyme disease, or EBV infection? Check back in to see what we come up with, we as in me and help from the internet and my skillset.

sampleByType_df geneData topGenes80 topGenesEBV