These samples are the headers added from three Gene Expression Omnibus studies at

The first two studies are part of the same study that used human liver tumor samples in vitro to compare the effects of the coronavirus over time. The third study used human microvascular blood capillaries in vitro to study the effects of the coronavirus over time.

In the first two studies that used the liver tumor samples to examine the effects of the coronavirus in vitro, there were four groups inoculated or treated with the active coronavirus and four groups not inoculated with the active coranavirus, and two samples that were treated with heat inactivated coronavirus, and two samples that were treated with active coronavirus and IL-1alpha to see the gene expression changes over one hour’s time.

In the the third study that used blood capillaries, there were five samples followed over a 0,12,24,36, and 48 hour time intervals in groups A,B,C,D, and E that compared the time interval values of screening for changes in microarray analysis with a mock group of the same.

This following data is the data of all genes in common between these three studies, cleaned to remove missing values and with the attached gene symbols from the GEO platform for the probe IDs.


The libraries used for this script are tidyr, dplyr, and ggplot2.

library(ggplot2)
library(dplyr)
library(tidyr)
library(DT)
library(stringr)

both <- read.csv('both_clean_liver_capillary_CoV.csv', sep=',', header=TRUE, 
                 na.strings=c('',' '))
dim(both)
## [1] 21754    63
colnames(both)
##  [1] "GENE_SYMBOL"                                  
##  [2] "LiverTumorSamples.GSM2359851_CoV1"            
##  [3] "LiverTumorSamples.GSM2359853_CoV2"            
##  [4] "LiverTumorSamples.GSM2359910_CoV3"            
##  [5] "LiverTumorSamples.GSM2359913_CoV4"            
##  [6] "LiverTumorSamples.GSM2359850_ctrl1"           
##  [7] "LiverTumorSamples.GSM2359852_ctrl2"           
##  [8] "LiverTumorSamples.GSM2359911_ctrl3"           
##  [9] "LiverTumorSamples.GSM2359914_ctrl4"           
## [10] "LiverTumorSamples.GSM2359912_Il1"             
## [11] "LiverTumorSamples.GSM2359917_IL2"             
## [12] "LiverTumorSamples.GSM2359915_inactiveHeatCoV1"
## [13] "LiverTumorSamples.GSM2359916_inactiveHeatCoV2"
## [14] "capillarySamples.GSM2685693_MERS_CoV_0hr_A"   
## [15] "capillarySamples.GSM2685694_MERS_CoV_0hr_B"   
## [16] "capillarySamples.GSM2685695_MERS_CoV_0hr_C"   
## [17] "capillarySamples.GSM2685696_MERS_CoV_0hr_D"   
## [18] "capillarySamples.GSM2685697_MERS_CoV_0hr_E"   
## [19] "capillarySamples.GSM2685698_ctrl_0hr_A"       
## [20] "capillarySamples.GSM2685699_ctrl_0hr_B"       
## [21] "capillarySamples.GSM2685700_ctrl_0hr_C"       
## [22] "capillarySamples.GSM2685701_ctrl_0hr_D"       
## [23] "capillarySamples.GSM2685702_ctrl_0hr_E"       
## [24] "capillarySamples.GSM2685703_MERS_CoV_12hr_A"  
## [25] "capillarySamples.GSM2685704_MERS_CoV_12hr_B"  
## [26] "capillarySamples.GSM2685705_MERS_CoV_12hr_C"  
## [27] "capillarySamples.GSM2685706_MERS_CoV_12hr_D"  
## [28] "capillarySamples.GSM2685707_MERS_CoV_12hr_E"  
## [29] "capillarySamples.GSM2685708_ctrl_12hr_A"      
## [30] "capillarySamples.GSM2685709_ctrl_12hr_B"      
## [31] "capillarySamples.GSM2685710_ctrl_12hr_C"      
## [32] "capillarySamples.GSM2685711_ctrl_12hr_D"      
## [33] "capillarySamples.GSM2685712_ctrl_12hr_E"      
## [34] "capillarySamples.GSM2685713_MERS_CoV_24hr_A"  
## [35] "capillarySamples.GSM2685714_MERS_CoV_24hr_B"  
## [36] "capillarySamples.GSM2685715_MERS_CoV_24hr_C"  
## [37] "capillarySamples.GSM2685716_MERS_CoV_24hr_D"  
## [38] "capillarySamples.GSM2685717_MERS_CoV_24hr_E"  
## [39] "capillarySamples.GSM2685718_ctrl_24hr_A"      
## [40] "capillarySamples.GSM2685719_ctrl_24hr_B"      
## [41] "capillarySamples.GSM2685720_ctrl_24hr_C"      
## [42] "capillarySamples.GSM2685721_ctrl_24hr_D"      
## [43] "capillarySamples.GSM2685722_ctrl_24hr_E"      
## [44] "capillarySamples.GSM2685723_MERS_CoV_36hr_A"  
## [45] "capillarySamples.GSM2685724_MERS_CoV_36hr_B"  
## [46] "capillarySamples.GSM2685725_MERS_CoV_36hr_C"  
## [47] "capillarySamples.GSM2685726_MERS_CoV_36hr_D"  
## [48] "capillarySamples.GSM2685727_MERS_CoV_36hr_E"  
## [49] "capillarySamples.GSM2685728_ctrl_36hr_A"      
## [50] "capillarySamples.GSM2685729_ctrl_36hr_B"      
## [51] "capillarySamples.GSM2685730_ctrl_36hr_C"      
## [52] "capillarySamples.GSM2685731_ctrl_36hr_D"      
## [53] "capillarySamples.GSM2685732_ctrl_36hr_E"      
## [54] "capillarySamples.GSM2685733_MERS_CoV_48hr_A"  
## [55] "capillarySamples.GSM2685734_MERS_CoV_48hr_B"  
## [56] "capillarySamples.GSM2685735_MERS_CoV_48hr_C"  
## [57] "capillarySamples.GSM2685736_MERS_CoV_48hr_D"  
## [58] "capillarySamples.GSM2685737_MERS_CoV_48hr_E"  
## [59] "capillarySamples.GSM2685738_ctrl_48hr_A"      
## [60] "capillarySamples.GSM2685739_ctrl_48hr_B"      
## [61] "capillarySamples.GSM2685740_ctrl_48hr_C"      
## [62] "capillarySamples.GSM2685741_ctrl_48hr_D"      
## [63] "capillarySamples.GSM2685742_ctrl_48hr_E"

Lets group the samples that are our columns with descriptive and GEO ID names into their respective groups, get the fold change between the controls from those groups, attach to the original data table, both, as a different names, then order by the genes that have the most fold change then the least fold change. Take the first 100 genes from both lists, combine into one table of 200 genes and the samples with their fold change values ordered, make into a transposed data frame so that the samples are the rows, the stats removed, and the 200 genes are the header columns to save as a machine learning ready file. ***

Liver tumor study control and CoV treated. Also, the IL-alpha treated and the inactive CoV treated tables are in this code block.

names <- both$GENE_SYMBOL

liverCtrl <- both[,c(6:9)]
row.names(liverCtrl) <- names

liverCoV <- both[,c(2:5)]
row.names(liverCoV) <- names

liverIL <- both[,10:11]
row.names(liverIL) <- names

liverIACoV <- both[,12:13]
row.names(liverIACoV) <- names   

Get the row means of those liver samples groups each.

liverCtrl$CtrlMeanLvr <- rowMeans(liverCtrl)
liverCoV$CoVMeanLvr <- rowMeans(liverCoV)
liverIL$ILMeanLvr <- rowMeans(liverIL)
liverIACoV$IACoVMeanLvr <- rowMeans(liverIACoV)

Get the fold change values of those states as a ratio to the control group values.

fold1 <- as.data.frame(cbind(liverCtrl$CtrlMeanLvr,liverCoV$CoVMeanLvr,liverIL$ILMeanLvr,
               liverIACoV$IACoVMeanLvr))
row.names(fold1) <- names
colnames(fold1) <- c('CtrlMeanLvr','CoVMeanLvr','ILMeanLvr','IACoVMeanLvr')

fold1$FC_CoV <- fold1$CoVMeanLvr/fold1$CtrlMeanLvr
fold1$FC_IL <- fold1$ILMeanLvr/fold1$CtrlMeanLvr
fold1$FC_IACov <- fold1$IACoVMeanLvr/fold1$CtrlMeanLvr

Most expressed in liver samples by fold change of the Coronavirus, inactive CoronaVirus, and the IL-alpha treated Coronavirus as tables.

mostCoV <- fold1[order(fold1$FC_CoV, decreasing = TRUE)[0:100],]
mostIL <- fold1[order(fold1$FC_IL, decreasing = TRUE)[0:100],]
mostIACoV <- fold1[order(fold1$FC_IACov, decreasing = TRUE)[0:100],]

Least expressed in liver samples by fold change of the Coronavirus, inactive CoronaVirus, and the IL-alpha treated Coronavirus as tables.

leastCoV <- fold1[order(fold1$FC_CoV, decreasing = FALSE)[0:100],]
leastIL <- fold1[order(fold1$FC_IL, decreasing = FALSE)[0:100],]
leastIACoV <- fold1[order(fold1$FC_IACov, decreasing = FALSE)[0:100],]

Gene Expressions with most changes in the liver samples.

changes <- rbind(mostCoV,mostIL,mostIACoV,leastCoV,leastIL,leastIACoV)
Changes <- changes[!duplicated(row.names(changes)),]
length(unique(row.names(Changes)))
## [1] 600

Get the magnitude of the fold change genes’ row means.

Changes$MagnitudeFCs <- abs(rowMeans(Changes[,5:7]))

Combine this to the samples data for the liver tumor group.

Changes$Gene <- row.names(Changes)
combined1 <- merge(both, Changes, by.x='GENE_SYMBOL', by.y='Gene')

combined2 <- combined1[order(combined1$MagnitudeFCs, decreasing=TRUE),]

CombinedLiver <- combined2[c(0:100,354:453),]

Machine Learning data for liver samples with 200 genes in the group of most gene expression changes.

names1 <- CombinedLiver$GENE_SYMBOL
names2 <- colnames(CombinedLiver)
row.names(CombinedLiver) <- names1

Combo_lvr_ML <- as.data.frame(t(CombinedLiver))

colnames(Combo_lvr_ML) <- gsub('-','_',colnames(Combo_lvr_ML))
Combo1 <- Combo_lvr_ML[c(2:63),] #remove stats of fold change values and gene symbol row

Lets add a class field called Class_Type to use machine learning on predicting class with these 200 genes and 62 mixed samples of capillary and liver tumor both inoculated with Coronavirus.

a <- rep('liver_CoV', 4)
b <- rep('liver_Ctrl',4)
c <- rep('liver_CoV_IL',2)
d <- rep('liver_IA_CoV',2)
e <- rep('capillary_CoV_0hr',5)
f <- rep('capillary_Ctrl_0hr',5)
g <- rep('capillary_Cov_12hr',5)
h <- rep('capillary_Ctrl_12hr',5)
i <- rep('capillary_Cov_24hr',5)
j <- rep('capillary_Ctrl_24hr',5)
k <- rep('capillary_Cov_36hr',5)
l <- rep('capillary_Ctrl_36hr',5)
m <- rep('capillary_Cov_48hr',5)
n <- rep('capillary_Ctrl_48hr',5)

type <- as.data.frame(c(a,b,c,d,e,f,g,h,i,j,k,l,m,n))
colnames(type) <- 'Class_Type'
row.names(type) <- row.names(Combo1)
Combo2 <- cbind(type,Combo1)

Write this ML ready file to csv.

write.csv(Combo2, 'ML_ready_CoV_14_classes.csv', row.names=TRUE)
Combo3 <- datatable(data=Combo2,  rownames=FALSE,
                      filter=list(position='top'),
                      options=list(
                        dom='Bfrtip',
                        buttons=c('colvis','csv','excel'),
                        language=list(sSearch='Filter:')),
                      extensions=c('Buttons','Responsive')
                      )
Combo3
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Make a separate ML ready file with a smaller set of classes to classify by liver or capillary and control or CoronaVirus

a <- rep('liver', 4)
b <- rep('liver',4)
c <- rep('liver',2)
d <- rep('liver',2)
e <- rep('capillary',5)
f <- rep('capillary',5)
g <- rep('capillary',5)
h <- rep('capillary',5)
i <- rep('capillary',5)
j <- rep('capillary',5)
k <- rep('capillary',5)
l <- rep('capillary',5)
m <- rep('capillary',5)
n <- rep('capillary',5)

type <- as.data.frame(c(a,b,c,d,e,f,g,h,i,j,k,l,m,n))
colnames(type) <- 'Class_Type'
row.names(type) <- row.names(Combo1)

Combo3 <- cbind(type,Combo1)
write.csv(Combo3, 'ML_ready_CoV_2_classes.csv', row.names=TRUE)
a <- rep('CoV', 4)
b <- rep('Ctrl',4)
c <- rep('CoV_IL',2)
d <- rep('IA_CoV',2)
e <- rep('CoV',5)
f <- rep('Ctrl',5)
g <- rep('Cov',5)
h <- rep('Ctrl',5)
i <- rep('Cov',5)
j <- rep('Ctrl',5)
k <- rep('Cov',5)
l <- rep('Ctrl',5)
m <- rep('Cov',5)
n <- rep('Ctrl',5)

type <- as.data.frame(c(a,b,c,d,e,f,g,h,i,j,k,l,m,n))
colnames(type) <- 'Class_Type'
row.names(type) <- row.names(Combo1)

Combo4 <- cbind(type,Combo1)
write.csv(Combo4, 'ML_ready_CoV_4_classes.csv', row.names=TRUE)


We didn’t do any fold change or stat measures on the capillary samples, but we can plot them by using ggplot2 and group the sets by timed intervals for each group A through E and picking a handful of genes to compare over the 0,12,24,36, and 48 hour time intervals for the control group and the Coronavirus inoculated groups.


When the values are a ratio like this, it is easier to see the larger changes as in 9 compared to a low change like 0.0005, but this just means that compared to the control samples the inoculated Coronavirus had 9 times the gene expression values or had downregulated or suppressed gene expression values to 1/5000th the amount of the normal range of gene expression values respectively. ***

It makes sense to use some genes we already know have a higher magnitude of change, and we have a column for that in the CombinedLiver table called MagnitudeFCs that was already sorted from largest to smallest when made. We’ll just select the first five of those genes to compare in these capillary samples over time.

mostChanged <- CombinedLiver[1:5,c(1,71)]
mostSuppressed <- CombinedLiver[196:200,c(1,71)]
row.names(mostChanged)
## [1] "NEURL3" "DUSP1"  "ATF3"   "PCLO"   "LHB"
row.names(mostSuppressed)
## [1] "RASSF7"       "LOC100335030" "C2orf78"      "DEFB1"        "ZNF610"
capillary <- merge(mostChanged, CombinedLiver, by.x='GENE_SYMBOL', by.y='GENE_SYMBOL')
capillary1 <- merge(mostSuppressed, CombinedLiver, by.x='GENE_SYMBOL', by.y='GENE_SYMBOL')
capillaries <- rbind(capillary,capillary1)
Capillaries <- capillaries[,c(1,15:64)]
row.names(Capillaries) <- Capillaries$GENE_SYMBOL

Capillaries2 <- as.data.frame(t(Capillaries))
Capillaries2 <- Capillaries2[-1,]
row.names(Capillaries2) <- gsub('capillarySamples.','',row.names(Capillaries2))
row.names(Capillaries2) <- gsub('GSM[0-9][0-9][0-9][0-9][0-9][0-9][0-9]_','',row.names(Capillaries2))
row.names(Capillaries2) <- gsub('MERS_','', row.names(Capillaries2))
CoV <- grep('CoV', row.names(Capillaries2))
ctrl <- grep('ctrl', row.names(Capillaries2))

Capillaries2$Class <- 'CoV or ctrl'

Capillaries2[CoV,11] <- 'Coronavirus'
Capillaries2[ctrl,11] <- 'control'

A <- grep('_A', row.names(Capillaries2))
B <- grep('_B', row.names(Capillaries2))
C <- grep('_C', row.names(Capillaries2))
D <- grep('_D', row.names(Capillaries2))
E <- grep('_E', row.names(Capillaries2))

Capillaries2$Group <- 'group'

Capillaries2[A,12] <- 'A'
Capillaries2[B,12] <- 'B'
Capillaries2[C,12] <- 'C'
Capillaries2[D,12] <- 'D'
Capillaries2[E,12] <- 'E'

hr0 <- grep('0hr', row.names(Capillaries2))
hr12 <- grep('12hr', row.names(Capillaries2))
hr24 <- grep('24hr', row.names(Capillaries2))
hr36 <- grep('36hr', row.names(Capillaries2))
hr48 <- grep('48hr', row.names(Capillaries2))

Capillaries2$TimeInterval <- 'time'

Capillaries2[hr0,13] <- '0 hr'
Capillaries2[hr12,13] <- '12 hr'
Capillaries2[hr24,13] <- '24 hr'
Capillaries2[hr36,13] <- '36 hr'
Capillaries2[hr48,13] <- '48 hr'
write.csv(Capillaries2,'FC_10_capillaries_CoV.csv', row.names=TRUE)

The above table has 10 genes as the columns with the added Class (Coronavirus or control), Group (A,B,C,D,E), and TimeInterval (0,12,24,36,48 hours) fields to filter by and plot.

Lets make these group tables for the corona virus and see how they compare over time.

A_group <- filter(Capillaries2, Group=='A' & Class == 'Coronavirus')
B_group <- filter(Capillaries2, Group=='B' & Class == 'Coronavirus')
C_group <- filter(Capillaries2, Group=='C' & Class == 'Coronavirus')
D_group <- filter(Capillaries2, Group=='D' & Class == 'Coronavirus')
E_group <- filter(Capillaries2, Group=='E' & Class == 'Coronavirus')

We will do this for the A_group table and ignore the Group and Class fields, because we made it only the A group of the Coronavirus class.

A_group2 <- A_group[,c(1,3,5,7,9,11:13)]
A_tidy <- gather(A_group2, 'Gene','GeneExpression',1:5)
## Warning: attributes are not identical across measure variables;
## they will be dropped
A_tidy$GeneExpression <- round(as.numeric(A_tidy$GeneExpression),1)
A_tidy$TimeInterval <- as.factor(A_tidy$TimeInterval)
A_tidy$Gene <- as.factor(A_tidy$Gene)
A_tidy
##          Class Group TimeInterval   Gene GeneExpression
## 1  Coronavirus     A         0 hr   ATF3           10.3
## 2  Coronavirus     A        12 hr   ATF3            9.8
## 3  Coronavirus     A        24 hr   ATF3           12.4
## 4  Coronavirus     A        36 hr   ATF3           12.3
## 5  Coronavirus     A        48 hr   ATF3           11.2
## 6  Coronavirus     A         0 hr    LHB            9.1
## 7  Coronavirus     A        12 hr    LHB           10.8
## 8  Coronavirus     A        24 hr    LHB           10.1
## 9  Coronavirus     A        36 hr    LHB            9.9
## 10 Coronavirus     A        48 hr    LHB            9.7
## 11 Coronavirus     A         0 hr   PCLO            6.5
## 12 Coronavirus     A        12 hr   PCLO            6.3
## 13 Coronavirus     A        24 hr   PCLO            6.3
## 14 Coronavirus     A        36 hr   PCLO            6.5
## 15 Coronavirus     A        48 hr   PCLO            6.5
## 16 Coronavirus     A         0 hr  DEFB1            7.6
## 17 Coronavirus     A        12 hr  DEFB1            8.1
## 18 Coronavirus     A        24 hr  DEFB1            7.6
## 19 Coronavirus     A        36 hr  DEFB1            8.1
## 20 Coronavirus     A        48 hr  DEFB1            7.3
## 21 Coronavirus     A         0 hr RASSF7            6.8
## 22 Coronavirus     A        12 hr RASSF7            6.6
## 23 Coronavirus     A        24 hr RASSF7            6.7
## 24 Coronavirus     A        36 hr RASSF7            6.7
## 25 Coronavirus     A        48 hr RASSF7            6.5

ggplot(data = A_tidy, aes(x=Gene, y=GeneExpression, fill=TimeInterval)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous(breaks = seq(0, 15, by=1), limits=c(0,15))+
  scale_fill_brewer(palette='Paired') + 
  theme(legend.position="top")+
  #ggtitle('Group A with Coronavirus for Selected Genes Part 1')+
  xlab('Gene')+
  ylab('Gene Expression Values')


The genes above for Part 1 of the group A samples of coronavirus in blood capillaries show some variation in gene expression values for some of these genes that had the most change in the liver tumor samples. Starting at the initial hour up to 48 hours after being inoculated in vitro, there is an increase then decrease for ATF3 and LHB genes, while a decrease then increase close to initial value with PCLO and slightly with RASSF7. For DEFB1, it has a cyclical increase, decrease, increase, then decrease to stabilize closer to the initial gene expressio value.



Now lets find the other five genes in the group A set of ten genes found to have the most change in the liver tumor samples, and examined here in the blood capillary samples.

A_group3 <- A_group[,c(2,4,6,8,10,11:13)]
A_tidy1 <- gather(A_group3, 'Gene','GeneExpression',1:5)
## Warning: attributes are not identical across measure variables;
## they will be dropped
A_tidy1$GeneExpression <- round(as.numeric(A_tidy1$GeneExpression),1)
A_tidy1$TimeInterval <- as.factor(A_tidy1$TimeInterval)
A_tidy1$Gene <- as.factor(A_tidy1$Gene)

gt3 <- datatable(data=A_tidy1,  rownames=FALSE,
                      filter=list(position='top'),
                      options=list(
                        dom='Bfrtip',
                        buttons=c('colvis','csv','excel'),
                        language=list(sSearch='Filter:')),
                      extensions=c('Buttons','Responsive')
                      )
gt3

ggplot(data = A_tidy1, aes(x=Gene, y=GeneExpression, fill=TimeInterval)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous(breaks = seq(0, 15, by=1), limits=c(0,15))+
  scale_fill_brewer(palette='Paired') + 
  theme(legend.position="bottom")+
  ggtitle('Group A with Coronavirus for Selected Genes Part 2')+
  xlab('Gene')+
  ylab('Gene Expression Values')


The above genes in part 2 of the group A Coronavirus samples over 48 hours, shows that the gene expression values increase up to 24 hours then decrease to 48 hours for most of the genes above.



Lets look back at the platforms and the features removed. The sequence field is an interesting field because it can show copy number variants of genes by the genes that are duplicates at other probes from the samples.

Platform13497 <- read.csv('GPL13497-9755-forSequenceFeature-GSE100509.csv', sep=',',
                   na.strings=c('',' '), header=TRUE)
Platform16699 <- read.csv('GPL16699-forSequenceFeatureGSE89166_GSE89160.csv', sep=',',
                          na.strings=c('',' '), header=TRUE)

The features in Platform13497 and the first five listed ID values for this platform:

colnames(Platform13497)
##  [1] "ID"                   "SPOT_ID"              "CONTROL_TYPE"        
##  [4] "REFSEQ"               "GB_ACC"               "GENE"                
##  [7] "GENE_SYMBOL"          "GENE_NAME"            "UNIGENE_ID"          
## [10] "ENSEMBL_ID"           "TIGR_ID"              "ACCESSION_STRING"    
## [13] "CHROMOSOMAL_LOCATION" "CYTOBAND"             "DESCRIPTION"         
## [16] "GO_ID"                "SEQUENCE"

The features in Platform16699 and the first five listed ID values of that platform:

colnames(Platform16699)
##  [1] "ID"                   "COL"                  "ROW"                 
##  [4] "NAME"                 "SPOT_ID"              "CONTROL_TYPE"        
##  [7] "REFSEQ"               "GB_ACC"               "LOCUSLINK_ID"        
## [10] "GENE_SYMBOL"          "GENE_NAME"            "UNIGENE_ID"          
## [13] "ENSEMBL_ID"           "ACCESSION_STRING"     "CHROMOSOMAL_LOCATION"
## [16] "CYTOBAND"             "DESCRIPTION"          "GO_ID"               
## [19] "SEQUENCE"

Lets keep only the ID (Platform16699) or SPOT_ID (Platform13497), GENE_SYMBOL, DESCRIPTION, and SEQUENCE features of both platforms.

P16699 <- Platform16699[,c(1,10,17,19)]
P13497 <- Platform13497[,c(2,7,15,17)]

Lets also remove the incomplete cases in both platforms.

work16699 <- P16699[complete.cases(P16699),]
work13497 <- P13497[complete.cases(P13497),]

Now merge these data sets to their corresponding samples by SPOT_ID.First read in the samples data for each platform and series.

GSE89166_89160 <- read.csv('GSE89166_GSE89160.csv',sep=',', na.strings=c('',' '),
                           header=TRUE)
GSE100509 <- read.csv('GSE100509.csv', sep=',', header=TRUE, 
                      na.strings=c('',' '))

Now merge the series data sets to their respective platforms of gene informational meta features.

Series100509 <- merge(work13497,GSE100509,by.x='SPOT_ID', by.y='ID_REF')
Series89166_89160 <- merge(work16699,GSE89166_89160,by.x='ID', by.y='ID_REF')

Rename the columns of the Series100509 to the 5 groups for each of CoV and Ctrl over 0,12,24,36, and 48 hours.

colnames(Series100509)
##  [1] "SPOT_ID"                  "GENE_SYMBOL"             
##  [3] "DESCRIPTION"              "SEQUENCE"                
##  [5] "GSM2685693_MERS_CoV_0hr"  "GSM2685694_MERS_CoV_0hr" 
##  [7] "GSM2685695_MERS_CoV_0hr"  "GSM2685696_MERS_CoV_0hr" 
##  [9] "GSM2685697_MERS_CoV_0hr"  "GSM2685698_ctrl_0hr"     
## [11] "GSM2685699_ctrl_0hr"      "GSM2685700_ctrl_0hr"     
## [13] "GSM2685701_ctrl_0hr"      "GSM2685702_ctrl_0hr"     
## [15] "GSM2685703_MERS_CoV_12hr" "GSM2685704_MERS_CoV_12hr"
## [17] "GSM2685705_MERS_CoV_12hr" "GSM2685706_MERS_CoV_12hr"
## [19] "GSM2685707_MERS_CoV_12hr" "GSM2685708_ctrl_12hr"    
## [21] "GSM2685709_ctrl_12hr"     "GSM2685710_ctrl_12hr"    
## [23] "GSM2685711_ctrl_12hr"     "GSM2685712_ctrl_12hr"    
## [25] "GSM2685713_MERS_CoV_24hr" "GSM2685714_MERS_CoV_24hr"
## [27] "GSM2685715_MERS_CoV_24hr" "GSM2685716_MERS_CoV_24hr"
## [29] "GSM2685717_MERS_CoV_24hr" "GSM2685718_ctrl_24hr"    
## [31] "GSM2685719_ctrl_24hr"     "GSM2685720_ctrl_24hr"    
## [33] "GSM2685721_ctrl_24hr"     "GSM2685722_ctrl_24hr"    
## [35] "GSM2685723_MERS_CoV_36hr" "GSM2685724_MERS_CoV_36hr"
## [37] "GSM2685725_MERS_CoV_36hr" "GSM2685726_MERS_CoV_36hr"
## [39] "GSM2685727_MERS_CoV_36hr" "GSM2685728_ctrl_36hr"    
## [41] "GSM2685729_ctrl_36hr"     "GSM2685730_ctrl_36hr"    
## [43] "GSM2685731_ctrl_36hr"     "GSM2685732_ctrl_36hr"    
## [45] "GSM2685733_MERS_CoV_48hr" "GSM2685734_MERS_CoV_48hr"
## [47] "GSM2685735_MERS_CoV_48hr" "GSM2685736_MERS_CoV_48hr"
## [49] "GSM2685737_MERS_CoV_48hr" "GSM2685738_ctrl_48hr"    
## [51] "GSM2685739_ctrl_48hr"     "GSM2685740_ctrl_48hr"    
## [53] "GSM2685741_ctrl_48hr"     "GSM2685742_ctrl_48hr"
group <- rep(1:5,10)
Group <- gsub('1','Group_A',group)
Group <- gsub('2','Group_B',Group)
Group <- gsub('3','Group_C',Group)
Group <- gsub('4','Group_D',Group)
Group <- gsub('5','Group_E',Group)

names <- colnames(Series100509)[5:54]

Names <- paste(names,Group,sep='_')

newNames <- gsub('_MERS','', Names)
newNames
##  [1] "GSM2685693_CoV_0hr_Group_A"   "GSM2685694_CoV_0hr_Group_B"  
##  [3] "GSM2685695_CoV_0hr_Group_C"   "GSM2685696_CoV_0hr_Group_D"  
##  [5] "GSM2685697_CoV_0hr_Group_E"   "GSM2685698_ctrl_0hr_Group_A" 
##  [7] "GSM2685699_ctrl_0hr_Group_B"  "GSM2685700_ctrl_0hr_Group_C" 
##  [9] "GSM2685701_ctrl_0hr_Group_D"  "GSM2685702_ctrl_0hr_Group_E" 
## [11] "GSM2685703_CoV_12hr_Group_A"  "GSM2685704_CoV_12hr_Group_B" 
## [13] "GSM2685705_CoV_12hr_Group_C"  "GSM2685706_CoV_12hr_Group_D" 
## [15] "GSM2685707_CoV_12hr_Group_E"  "GSM2685708_ctrl_12hr_Group_A"
## [17] "GSM2685709_ctrl_12hr_Group_B" "GSM2685710_ctrl_12hr_Group_C"
## [19] "GSM2685711_ctrl_12hr_Group_D" "GSM2685712_ctrl_12hr_Group_E"
## [21] "GSM2685713_CoV_24hr_Group_A"  "GSM2685714_CoV_24hr_Group_B" 
## [23] "GSM2685715_CoV_24hr_Group_C"  "GSM2685716_CoV_24hr_Group_D" 
## [25] "GSM2685717_CoV_24hr_Group_E"  "GSM2685718_ctrl_24hr_Group_A"
## [27] "GSM2685719_ctrl_24hr_Group_B" "GSM2685720_ctrl_24hr_Group_C"
## [29] "GSM2685721_ctrl_24hr_Group_D" "GSM2685722_ctrl_24hr_Group_E"
## [31] "GSM2685723_CoV_36hr_Group_A"  "GSM2685724_CoV_36hr_Group_B" 
## [33] "GSM2685725_CoV_36hr_Group_C"  "GSM2685726_CoV_36hr_Group_D" 
## [35] "GSM2685727_CoV_36hr_Group_E"  "GSM2685728_ctrl_36hr_Group_A"
## [37] "GSM2685729_ctrl_36hr_Group_B" "GSM2685730_ctrl_36hr_Group_C"
## [39] "GSM2685731_ctrl_36hr_Group_D" "GSM2685732_ctrl_36hr_Group_E"
## [41] "GSM2685733_CoV_48hr_Group_A"  "GSM2685734_CoV_48hr_Group_B" 
## [43] "GSM2685735_CoV_48hr_Group_C"  "GSM2685736_CoV_48hr_Group_D" 
## [45] "GSM2685737_CoV_48hr_Group_E"  "GSM2685738_ctrl_48hr_Group_A"
## [47] "GSM2685739_ctrl_48hr_Group_B" "GSM2685740_ctrl_48hr_Group_C"
## [49] "GSM2685741_ctrl_48hr_Group_D" "GSM2685742_ctrl_48hr_Group_E"

Change the column names in Series100509 to the new column names identifying which group the samples is from in A:E.

colnames(Series100509)[5:54] <- newNames
colnames(Series100509)
##  [1] "SPOT_ID"                      "GENE_SYMBOL"                 
##  [3] "DESCRIPTION"                  "SEQUENCE"                    
##  [5] "GSM2685693_CoV_0hr_Group_A"   "GSM2685694_CoV_0hr_Group_B"  
##  [7] "GSM2685695_CoV_0hr_Group_C"   "GSM2685696_CoV_0hr_Group_D"  
##  [9] "GSM2685697_CoV_0hr_Group_E"   "GSM2685698_ctrl_0hr_Group_A" 
## [11] "GSM2685699_ctrl_0hr_Group_B"  "GSM2685700_ctrl_0hr_Group_C" 
## [13] "GSM2685701_ctrl_0hr_Group_D"  "GSM2685702_ctrl_0hr_Group_E" 
## [15] "GSM2685703_CoV_12hr_Group_A"  "GSM2685704_CoV_12hr_Group_B" 
## [17] "GSM2685705_CoV_12hr_Group_C"  "GSM2685706_CoV_12hr_Group_D" 
## [19] "GSM2685707_CoV_12hr_Group_E"  "GSM2685708_ctrl_12hr_Group_A"
## [21] "GSM2685709_ctrl_12hr_Group_B" "GSM2685710_ctrl_12hr_Group_C"
## [23] "GSM2685711_ctrl_12hr_Group_D" "GSM2685712_ctrl_12hr_Group_E"
## [25] "GSM2685713_CoV_24hr_Group_A"  "GSM2685714_CoV_24hr_Group_B" 
## [27] "GSM2685715_CoV_24hr_Group_C"  "GSM2685716_CoV_24hr_Group_D" 
## [29] "GSM2685717_CoV_24hr_Group_E"  "GSM2685718_ctrl_24hr_Group_A"
## [31] "GSM2685719_ctrl_24hr_Group_B" "GSM2685720_ctrl_24hr_Group_C"
## [33] "GSM2685721_ctrl_24hr_Group_D" "GSM2685722_ctrl_24hr_Group_E"
## [35] "GSM2685723_CoV_36hr_Group_A"  "GSM2685724_CoV_36hr_Group_B" 
## [37] "GSM2685725_CoV_36hr_Group_C"  "GSM2685726_CoV_36hr_Group_D" 
## [39] "GSM2685727_CoV_36hr_Group_E"  "GSM2685728_ctrl_36hr_Group_A"
## [41] "GSM2685729_ctrl_36hr_Group_B" "GSM2685730_ctrl_36hr_Group_C"
## [43] "GSM2685731_ctrl_36hr_Group_D" "GSM2685732_ctrl_36hr_Group_E"
## [45] "GSM2685733_CoV_48hr_Group_A"  "GSM2685734_CoV_48hr_Group_B" 
## [47] "GSM2685735_CoV_48hr_Group_C"  "GSM2685736_CoV_48hr_Group_D" 
## [49] "GSM2685737_CoV_48hr_Group_E"  "GSM2685738_ctrl_48hr_Group_A"
## [51] "GSM2685739_ctrl_48hr_Group_B" "GSM2685740_ctrl_48hr_Group_C"
## [53] "GSM2685741_ctrl_48hr_Group_D" "GSM2685742_ctrl_48hr_Group_E"

Remove the ID and SPOT_ID fields of the probe labels that won’t be needed for the analysis.

Series89 <- Series89166_89160[,-1]
Series100 <- Series100509[,-c(1,3,4)]

Now combine the series together by genes in common.

ComboLiverCapillarySequences <- merge(Series89,Series100, by.x='GENE_SYMBOL',
                                      by.y='GENE_SYMBOL')

There should be different genotypes or copy number variations in the SEQUENCE feature column, which will identify which nucleotide jumps, is deleted, rearranged in the gene expressions. This time around lets group by SEQUENCE to see how many genotypes there are in all the genes. This could give more information in the analysis of any genotypes of genes could be more susceptible to pathogenesis of CoV or immunity to it. ***



This next portion of this analysis shows the top five genes in the data, the number of genotypes or copy number variations of the nucleotide sequences are in each gene, and the gene name. This could be useful to determine how well the genotypes within these five genes that were expressed more than the other 65k genes were when analyzing the effects of CoV, inactive CoV, CoV treated with an interleukin, and the control samples.

SeqGroup <- ComboLiverCapillarySequences %>% group_by(GENE_SYMBOL) %>% count(n=n())
SeqGroup <- SeqGroup[order(SeqGroup$n, decreasing=TRUE),-3]

SeqGroup5 <- SeqGroup[1:5,]

genes5sequences <- merge(SeqGroup5,ComboLiverCapillarySequences,
                         by.x='GENE_SYMBOL', by.y='GENE_SYMBOL')

genotypes5 <- genes5sequences %>% group_by(SEQUENCE) %>% count(n=n)

genotypes5$n <- as.factor(genotypes5$n)
SeqGroup5$n <- as.factor(SeqGroup5$n)

genotypes5_1 <- merge(genotypes5, SeqGroup5, by.x='n', by.y='n')
colnames(genotypes5_1)[c(1,3)] <- c('geneCount','genotypeCount')
head(genotypes5_1)
##   geneCount                                                     SEQUENCE
## 1       255 AAACTTACTCCAGAGCTCCTTGTGCATCTGACCAGCACCATCGACAGAATAAACACAGAA
## 2       255 AACAGAGTCCTCAGGGAAGAAAATCGAAGACTTCAGGCTCAACTGAGTCATGTTTCCAGA
## 3       255 GCACCTGTGTTCTTTGAGTTCACATCATGAATGTGGTGATTTCCCAGATACCATCTCAGG
## 4       255 ATGGGGTGCTCTGGGGAAATATTGGAGGGTCATCCATTCCACATTAAAAGAGCAAGTTGT
## 5       255 AAGTGCTTGGAAATACTTGGGTGAATGTTACCAGACTCCTTCTCTCTCAGCTTACAGCCT
## 6       255 AATCTACGAGGCACTTTATGGCAATTCCAAGAAGGGGCTGAAAGGTATGTGTTCTTCTCC
##   genotypeCount GENE_SYMBOL
## 1            15     PDE4DIP
## 2            15     PDE4DIP
## 3            15     PDE4DIP
## 4            15     PDE4DIP
## 5            15     PDE4DIP
## 6            15     PDE4DIP

Write the files above out to csv.

write.csv(ComboLiverCapillarySequences, 'SequencesBothCleaned.csv',row.names=FALSE)

Write this large file into three subsets of three tiers because it is more than 57mb in size, and I want to load it into github.They would be read in separately, then they would be row binded in order of tier1, tier2, and then tier 3: - ComboLiverCapillarySequences <- rbind(SequencesBothCleaned_tier1,SequencesBothCleaned_tier2,SequencesBothCleaned_tier3)

write.csv(ComboLiverCapillarySequences[1:21000,],
          'SequencesBothCleaned_tier1.csv',row.names=FALSE)
write.csv(ComboLiverCapillarySequences[21001:42000,],
          'SequencesBothCleaned_tier2.csv',row.names=FALSE)
write.csv(ComboLiverCapillarySequences[42001:65505,],
          'SequencesBothCleaned_tier3.csv',row.names=FALSE)
write.csv(genotypes5_1, 'Genotypes_5_1.csv', row.names=FALSE)
ComboCNV <- ComboLiverCapillarySequences[,-c(1,2)]
copynumbers <- merge(genotypes5_1,ComboCNV,
                     by.x='SEQUENCE', by.y='SEQUENCE')
CNV <- copynumbers[!duplicated(copynumbers$SEQUENCE),]
write.csv(CNV, 'copyNumbers.csv', row.names=FALSE)
colnames(CNV)
##  [1] "SEQUENCE"                     "geneCount"                   
##  [3] "genotypeCount"                "GENE_SYMBOL"                 
##  [5] "GSM2359851_CoV1"              "GSM2359853_CoV2"             
##  [7] "GSM2359910_CoV3"              "GSM2359913_CoV4"             
##  [9] "GSM2359850_ctrl1"             "GSM2359852_ctrl2"            
## [11] "GSM2359911_ctrl3"             "GSM2359914_ctrl4"            
## [13] "GSM2359912_Il1"               "GSM2359917_IL2"              
## [15] "GSM2359915_inactiveHeatCoV1"  "GSM2359916_inactiveHeatCoV2" 
## [17] "GSM2685693_CoV_0hr_Group_A"   "GSM2685694_CoV_0hr_Group_B"  
## [19] "GSM2685695_CoV_0hr_Group_C"   "GSM2685696_CoV_0hr_Group_D"  
## [21] "GSM2685697_CoV_0hr_Group_E"   "GSM2685698_ctrl_0hr_Group_A" 
## [23] "GSM2685699_ctrl_0hr_Group_B"  "GSM2685700_ctrl_0hr_Group_C" 
## [25] "GSM2685701_ctrl_0hr_Group_D"  "GSM2685702_ctrl_0hr_Group_E" 
## [27] "GSM2685703_CoV_12hr_Group_A"  "GSM2685704_CoV_12hr_Group_B" 
## [29] "GSM2685705_CoV_12hr_Group_C"  "GSM2685706_CoV_12hr_Group_D" 
## [31] "GSM2685707_CoV_12hr_Group_E"  "GSM2685708_ctrl_12hr_Group_A"
## [33] "GSM2685709_ctrl_12hr_Group_B" "GSM2685710_ctrl_12hr_Group_C"
## [35] "GSM2685711_ctrl_12hr_Group_D" "GSM2685712_ctrl_12hr_Group_E"
## [37] "GSM2685713_CoV_24hr_Group_A"  "GSM2685714_CoV_24hr_Group_B" 
## [39] "GSM2685715_CoV_24hr_Group_C"  "GSM2685716_CoV_24hr_Group_D" 
## [41] "GSM2685717_CoV_24hr_Group_E"  "GSM2685718_ctrl_24hr_Group_A"
## [43] "GSM2685719_ctrl_24hr_Group_B" "GSM2685720_ctrl_24hr_Group_C"
## [45] "GSM2685721_ctrl_24hr_Group_D" "GSM2685722_ctrl_24hr_Group_E"
## [47] "GSM2685723_CoV_36hr_Group_A"  "GSM2685724_CoV_36hr_Group_B" 
## [49] "GSM2685725_CoV_36hr_Group_C"  "GSM2685726_CoV_36hr_Group_D" 
## [51] "GSM2685727_CoV_36hr_Group_E"  "GSM2685728_ctrl_36hr_Group_A"
## [53] "GSM2685729_ctrl_36hr_Group_B" "GSM2685730_ctrl_36hr_Group_C"
## [55] "GSM2685731_ctrl_36hr_Group_D" "GSM2685732_ctrl_36hr_Group_E"
## [57] "GSM2685733_CoV_48hr_Group_A"  "GSM2685734_CoV_48hr_Group_B" 
## [59] "GSM2685735_CoV_48hr_Group_C"  "GSM2685736_CoV_48hr_Group_D" 
## [61] "GSM2685737_CoV_48hr_Group_E"  "GSM2685738_ctrl_48hr_Group_A"
## [63] "GSM2685739_ctrl_48hr_Group_B" "GSM2685740_ctrl_48hr_Group_C"
## [65] "GSM2685741_ctrl_48hr_Group_D" "GSM2685742_ctrl_48hr_Group_E"

This feature, SEQUENCE, provides the single nucleotide polymorphisms (SNP)s or copy number variants of each gene’s DNA, many are duplicated. Here are the first few SNPs.

head(CNV$SEQUENCE)
## [1] AAACTTACTCCAGAGCTCCTTGTGCATCTGACCAGCACCATCGACAGAATAAACACAGAA
## [2] AACAGAGTCCTCAGGGAAGAAAATCGAAGACTTCAGGCTCAACTGAGTCATGTTTCCAGA
## [3] AAGGATTTGCTTATAAGGGTTCCTGCTTTCACAAAATTATTCCAGGTTTTATGTGTCAGG
## [4] AAGGATTTGGTTGTAAGGGCTCCCGCTTTCACAGAATTATTCCAGGGTTTATGTGTCAGG
## [5] AAGTGCTTGGAAATACTTGGGTGAATGTTACCAGACTCCTTCTCTCTCAGCTTACAGCCT
## [6] AATCTACGAGGCACTTTATGGCAATTCCAAGAAGGGGCTGAAAGGTATGTGTTCTTCTCC
## 50571 Levels: AAACAAAAAACAGGTTAAGAAAATTACTTGGGTGGGCAGACTTAGGAACGCTCTACTCGG ...


It would be interesting in the fold change values between genotypes of the genes expressed in comparing the capillary samples all within 1 hour after being inoculated with CoV, inactive CoV, CoV and an interleukin alpha, or control group. Then compare the liver tumor samples of each group A through E that was monitored after being inoculated with CoV at 0,12,24,36, and 48 hours side by side with the control groups of groups A through E.

We could also detect patterns to analyze those sequences of copy number variations within the top 5 genes expressed the most number of times in this data. Comparing networks of genes associated with processes in the body like immune response, pathogenesis of disease onset, networks of human processes in the body associated with cancer or subsequent diseases like autoimmune, celiac disease, hemochromatosis, anemia, etc. To see how well any of those genotypes fair.

We have the five genes that had the highest copy number variations within them and merged it with our liver tumor and capillary CoV and control samples in the CNV data table.



I joinged genecards.org and found that this site has some useful information on gene networks for analyzing genes that play a role in a network of genes that function at some level on the human anatomy. Such as anemia, cannabidiol,celiac disease, diabetes type 1 and 2, hemochromatosis, immunity, pain, and tumorigenesis. I downloaded the csv files for each of these networks just mentioned. There is a ranking of each gene as importance the higher the ranked order in the network functional role on the human body.

We can import each of those now.

anemiaNetwork <- read.csv('GeneCards-SearchResults-anemiaNetwork.csv', sep=',',
                          header=TRUE, na.strings=c('',' '))

cannabidiolNetwork <- read.csv('GeneCards-SearchResults-cannabidiolNetwork.csv', sep=',',
                          header=TRUE, na.strings=c('',' '))
celiacDiseaseNetwork <- read.csv('GeneCards-SearchResults-celiacDiseaseNetwork.csv',
                                 sep=',', header=TRUE, na.strings=c('',' '))
diabetesType1Network <- read.csv('GeneCards-SearchResults-diabetesType1Network.csv',
                                 sep=',', header=TRUE, na.strings=c('',' '))
diabetesType2Network <- read.csv('GeneCards-SearchResults-diabetesType2Network.csv',
                                 sep=',', header=TRUE, na.strings=c('',' '))
hemochromatosisNetwork <- read.csv('GeneCards-SearchResults-hemochromatosisNetwork.csv', 
                          sep=',',header=TRUE, na.strings=c('',' '))
immunityNetwork <- read.csv('GeneCards-SearchResults-immunityNetwork.csv', 
                          sep=',',header=TRUE, na.strings=c('',' '))
painNetwork <- read.csv('GeneCards-SearchResults-painNetwork.csv', 
                          sep=',', header=TRUE, na.strings=c('',' '))
tumorigenesisNetwork <- read.csv('GeneCards-SearchResults-tumorigenesisNetwork.csv', sep=',',
                          header=TRUE, na.strings=c('',' '))

We could then take our combined data from the liver and capillary datasets to look at the copy number variants of genotypes within these networks separately to understand how or if these genes have a high number of copy number in control versus diseased and diseased/treated states.

Lets take the top five genes from each network, add a column feature to label the gene network, then combine those tables. They are already ordered each by Relevance.score.

amenia5 <- anemiaNetwork[1:5,]
cannabidiol5 <- cannabidiolNetwork[1:5,]
diabetesOne5 <- diabetesType1Network[1:5,]
diabetesTwo5 <- diabetesType2Network[1:5,]
hemochromatosis5 <- hemochromatosisNetwork[1:5,]
immunity5 <- immunityNetwork[1:5,]
pain5 <- painNetwork[1:5,]
tumorigenesis5 <- tumorigenesisNetwork[1:5,]

Lets merge each network disease pathway with the ComboLiverCapillarySequences data table.

anemia5_path <- merge(amenia5, ComboLiverCapillarySequences, by.x='Gene.Symbol',
                      by.y='GENE_SYMBOL')
cannabidiol5_path <- merge(cannabidiol5, ComboLiverCapillarySequences, by.x='Gene.Symbol',
                      by.y='GENE_SYMBOL')
diabetesOne5_path <- merge(diabetesOne5, ComboLiverCapillarySequences, by.x='Gene.Symbol',
                      by.y='GENE_SYMBOL')
diabetesTwo5_path <- merge(diabetesTwo5, ComboLiverCapillarySequences, by.x='Gene.Symbol',
                      by.y='GENE_SYMBOL')
hemochromatosis5_path <- merge(hemochromatosis5, ComboLiverCapillarySequences, by.x='Gene.Symbol',
                      by.y='GENE_SYMBOL')
immunity5_path <- merge(immunity5, ComboLiverCapillarySequences, by.x='Gene.Symbol',
                      by.y='GENE_SYMBOL')
pain5_path <- merge(pain5, ComboLiverCapillarySequences, by.x='Gene.Symbol',
                      by.y='GENE_SYMBOL')
tumorigenesis5_path <- merge(tumorigenesis5, ComboLiverCapillarySequences, by.x='Gene.Symbol',
                      by.y='GENE_SYMBOL')

Combine the above data tables together.

Network <- as.data.frame(c(rep('anemia',length(anemia5_path$Gene.Symbol)),
                         rep('cannabidiol',length(cannabidiol5_path$Gene.Symbol)),
                         rep('diabetes 1',length(diabetesOne5_path$Gene.Symbol)),
                         rep('diabetes 2', length(diabetesTwo5_path$Gene.Symbol)),
                         rep('hemochromatosis', length(hemochromatosis5_path$Gene.Symbol)),
                         rep('immunity', length(immunity5_path$Gene.Symbol)),
                         rep('pain', length(pain5_path$Gene.Symbol)),
                         rep('tumorigenesis',length(tumorigenesis5_path$Gene.Symbol))))
colnames(Network) <- 'geneNetwork'

networkPathwayGenes <- rbind(anemia5_path,cannabidiol5_path,diabetesOne5_path,                             diabetesTwo5_path,hemochromatosis5_path,immunity5_path,
                             pain5_path,tumorigenesis5_path)
PathwayGenes <- cbind(Network,networkPathwayGenes)

Now group by gene for the count and then by sequence for the copy number variant count within the gene.

pathGenesCount <- PathwayGenes %>% group_by(Gene.Symbol) %>% count(n())
pathGenotypeCount <- PathwayGenes %>% group_by(SEQUENCE) %>% count(n())

pathGenes <- pathGenesCount[,-2]
colnames(pathGenes)[2] <- 'GeneCount'
pathGenos <- pathGenotypeCount[,-2]
colnames(pathGenos)[2] <- 'GenotypeCount'
pG <- merge(pathGenos, PathwayGenes, by.x='SEQUENCE', by.y='SEQUENCE')
pG1 <- merge(pathGenes, pG, by.x='Gene.Symbol', by.y='Gene.Symbol')
colnames(pG1)[c(13,14,17,18)] <- paste('Study1',colnames(pG1)[c(13,14,17,18)],
                                       sep='_')
write.csv(pG1, 'genes-genotypes-networks.csv', row.names=FALSE)
pG2 <- datatable(data=pG1,  rownames=FALSE,
                      #colnames=pretty_headers,
                      filter=list(position='top'),
                      options=list(
                        dom='Bfrtip',
                        buttons=c('colvis','csv','excel'),
                        language=list(sSearch='Filter:')),
                      extensions=c('Buttons','Responsive')
                      )
pG2
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

I want to remove and hold it separate for one of the liver tumor samples studies that uses CoV1, CoV2, ctrl1, and ctrl2, because the values are not scaled the same and these numbers will ruin the analysis included as is. They were already log scaled but seem to be much larger than the other values after scaling.The previous block of code added ‘Study1’ to those larger scaled samples to extract for separate analysis.

scaleSet <- pG1[,c(1:14,17,18)]
otherSet <- pG1[,c(1:12,15,16,19:74)]

Lets add in that the scaleSet columns are from study1 and leave the study2 column names in the otherSet as is. Combine and make a separate data set. I have previously examined cannabidiol in sebum/acne and brain tumor samples to briefly determine the CBD effects on hormone and immune response genes. I added this cannabidiol network to get the top five CBD genes by relative score in the CBD network. We can look at this scale set of CoV and ctrl samples from liver tumors to see if there are any obvious effects to this network that has shown to treat chronic pain in some folks.

Lets get the data set of cannabidiol, CBD, to compare CoV to mock or control liver tumor samples from the larger scaled data set.

CBD <- subset(scaleSet, scaleSet$geneNetwork == 'cannabidiol')
orderCBD <- CBD[order(CBD$Gene.Symbol, decreasing=TRUE),]

We can see how many geno types are unique to a gene by adding up the genotypes to equal the genes in count. For instance, the PPARG gene has 22 counts of that gene in the data with one genotype having 22 counts, and the other genotype for that gene having only 2 counts. So that there are two unique genotypes for the gene PPARG. Lets go ahead and use dplyr to group by SEQUENCE then take the mean of that sequence but for the CoV and ctrl groups separately.

The two separate CoV and Ctrl datasets for the five CBD genes with the highest count of occurence in the data.

CBD_CoV <- CBD[,-c(15,16)]

CBD_ctrl <- CBD[,-c(13,14)]

samples <- as.vector(colnames(CBD_CoV)[13:14])
samples1 <- as.vector(colnames(CBD_ctrl)[13:14])

CBD_CovMeans <- CBD_CoV %>% group_by(SEQUENCE) %>% summarise_at(vars(samples),
                                                               mean)
CBD_CtrlMeans <- CBD_ctrl %>% group_by(SEQUENCE) %>% summarise_at(vars(samples1),
                                                               mean)

It seems a small data set, but with this network, CBD network, there was only one gene with more than one genotype. This is why there are only six genotypes when we selected five genes.

unique(CBD$SEQUENCE)
## [1] TGTACTAGGCCTACTGGGGATCAGAGTTCCCAAGAAAGGAAACCTTTTCTTGTATCTGGA
## [2] AGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGC
## [3] CCAAGGCTTCATGACAAGGGAGTTTCTAAAGAGCCTGCGAAAGCCTTTTGGTGACTTTAT
## [4] ACAATCAGATTGAAGCTTATCTATGACAGATGTGATCTTAACTGTCGGATCCACAAAAAA
## [5] TGAGATATTTAAGGTTGAATGTTTGTCCTTAGGATAGGCCTATGTGCTAGCCCACAAAGA
## [6] GGGGTATCCTGGGGGACCCAATGTAGGAGCTGCCTTGGCTCAGACATGTTTTCCGTGAAA
## 50571 Levels: AAACAAAAAACAGGTTAAGAAAATTACTTGGGTGGGCAGACTTAGGAACGCTCTACTCGG ...

Lets combine the data means for each group by genotype or unique sequence with the genes they belong to from two features of the CBD data set, Gene.Symbol and SEQUENCE.

geneSeq <- CBD[,c(1:5)]
geneSeq1 <- merge(geneSeq, CBD_CovMeans, by.x='SEQUENCE', by.y='SEQUENCE')
geneSeq2 <- merge(geneSeq1, CBD_CtrlMeans, by.x='SEQUENCE', by.y='SEQUENCE')

geneSeq3 <- geneSeq2[!duplicated(geneSeq2),]

geneSeq4 <- datatable(data=geneSeq3,  rownames=FALSE,
                      #colnames=pretty_headers,
                      filter=list(position='top'),
                      options=list(
                        dom='Bfrtip',
                        buttons=c('colvis','csv','excel'),
                        language=list(sSearch='Filter:')),
                      extensions=c('Buttons','Responsive')
                      )
geneSeq4
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

The above table shows the mean value of each genotype in the four samples, where two samples are liver tumor inoculated with CoV, and the other two are the liver tumor samples without CoV with both having been screened after 1 hour.



Since we want the fold change values within each geno type of the CBD network, lets get the means of each type of either CoV or ctrl. This is now the collective sample means of the genotype means per unique sample of either CoV or ctrl.

geneSeq3$CoV_Genotype_Mean <- rowMeans(geneSeq3[,6:7])
geneSeq3$Ctrl_Geneotype_Mean <- rowMeans(geneSeq3[,8:9])
geneSeq3[,c(1,2,10,11)]
##                                                        SEQUENCE Gene.Symbol
## 1  ACAATCAGATTGAAGCTTATCTATGACAGATGTGATCTTAACTGTCGGATCCACAAAAAA       PPARG
## 3  AGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGC         INS
## 4  CCAAGGCTTCATGACAAGGGAGTTTCTAAAGAGCCTGCGAAAGCCTTTTGGTGACTTTAT       PPARG
## 24 GGGGTATCCTGGGGGACCCAATGTAGGAGCTGCCTTGGCTCAGACATGTTTTCCGTGAAA         TNF
## 25 TGAGATATTTAAGGTTGAATGTTTGTCCTTAGGATAGGCCTATGTGCTAGCCCACAAAGA       PTGS2
## 35 TGTACTAGGCCTACTGGGGATCAGAGTTCCCAAGAAAGGAAACCTTTTCTTGTATCTGGA        CNR1
##    CoV_Genotype_Mean Ctrl_Geneotype_Mean
## 1          274.69375           386.93375
## 3           11.60250            17.92250
## 4          966.77644          1253.19162
## 24          13.59375            11.72750
## 25          11.61744            11.00081
## 35          74.95375            56.71875

Looking at the above, both genotypes of PPARG decrease in CoV treated samples as well as insulin, INS. But tumor necrosis factor or TNF, prostaglandin-Endoperoxide Synthase 2 or PTGS2 and Cannabinoid Receptor 1 or CNR1 all increase in CoV treated samples. This could mean that in liver tumor samples that are inoculated with Coronavirus, more tumor suppressant or destroyer called TNF is activated, more pain reliever activation in the increased CNR1, and less insulin is produced. Insulin is a diabetic factor that plays a role in glucose processing for energy. If less insulin is produced after one hour, this could mean the body doesn’t want to feed a contaminant it is recognizing in the body by lessening the need for the body to consume more glucose. An aside on the insulin and glucose relationship in regulating the body: Insulin and glucose are supposed to be balanced, where the more glucose is burned by the body for energy will lead to more insulin produced to compensate energy needs. The two types of diabetes controlled by insulin is either an overproduction of insulin that leads to an irregular need to ead more than normal amounts of food to get more glucose in the blood, or an underproduction of insulin that leads to an inadequate amount of glucose the body needs, loss of appetite and irregularly small amounts of food consumed.

Immediately, some more analysis to answer questions of these assertions on the results above can be further confirmed by:

1.) Analyzing the time sequence of the blood capillary samples inoculated with CoV and their respective control groups. Examining how these CBD genes are effected over time.

2.) Confirming that insulin is effected by CoV by looking at the diabetic gene networks for type 1 and type 2 diabetes.

3.) Also, looking at the tumorigenesis gene network to see how the genes that are involved in tumor growth in the body react to CoV in the liver tumor samples and the blood capillary samples.



Lets put an interesting bar chart into this script to show what we have so far in the CBD genes network for liver tumor samples inoculated over one hour’s time. Lets use the mutate function of dplyr to create a new feature that will name each gene with more than one genotype or copy number variant in this data. Then use tidyr to gather the sample means per genotype. And finally, plot the CoV and ctrl means using ggplot2.

CBD_plot <- geneSeq3 %>% mutate(geneClass = paste(Gene.Symbol,GenotypeCount,
                                                  sep='_'))
CBD_plot2 <- gather(CBD_plot, 'Sample','MeanValue',10:11)


ggplot(data = CBD_plot2, aes(x=geneClass, y=MeanValue, fill=Sample)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous(breaks = seq(0, 1300, by=100), limits=c(0,1300))+
  scale_fill_brewer(palette='Paired') + 
  #ggtitle('CBD Network Gene Expression Values Treated with Coronavirus')+
  theme(legend.position="top")+
  #xlab('Gene and Genotype or Single Nucleotide Polymorphism or Copy Number Variant')+
  ylab('Gene Expression Mean Values')

The lower values aren’t as easy to distinguish as the scale of change in Mean values is significantly greater for some genes. We could fix this by making the plot fold change and add that field, or adding a log2 field of the means to put on the same scale for visualization purposes. Or cut off the scale for those less than 100 in Mean Values and those greater than 100.



The following is the genes less than 100 for Mean genotype expression values.

ggplot(data = CBD_plot2, aes(x=geneClass, y=MeanValue, fill=Sample)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous(breaks = seq(0, 100, by=100), limits=c(0,100))+
  scale_fill_brewer(palette='Paired') + 
  theme(legend.position="bottom")+
  #ggtitle('CBD Network Gene Expression Values Less than 100 Treated with CoV')+
  ylab('Gene Expression Mean Values')



The following is the genes greater than 100 for Mean genotype expression values:

ggplot(data = CBD_plot2, aes(x=geneClass, y=MeanValue, fill=Sample)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous(breaks = seq(0, 1300, by=100), limits=c(0,1300))+
  scale_fill_brewer(palette='Paired') + 
  geom_hline(yintercept=100, linetype="dashed", color = "red")+
  theme(legend.position="bottom")+
  #ggtitle('CBD Network Gene levels > 100 Treated with CoV')+
  ylab('Gene Expression Mean Values')



The following is the fold change chart.

CBD_plot$FoldChange <- CBD_plot$CoV_Genotype_Mean/CBD_plot$Ctrl_Geneotype_Mean
CBD_plot3 <- CBD_plot[,c(1:9,12,10,11,13)] #fold change plot data
CBD_plot2$logMean <- log2(CBD_plot2$MeanValue) # log2 scaled plot data

The chart of fold change values of the genotype or single nucleotide polymorphism (SNP) or copy number variant of the CBD gene network top five listed genes by Relevance.score.with an added dashed red line to show that genes below the line decreased and those above increased in percentage of gene expression of CoV/ctrl.

ggplot(data = CBD_plot3, aes(x=geneClass, y=FoldChange)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous(breaks = seq(0, 1.5, by=.1), limits=c(0,1.5))+
  scale_fill_brewer(palette='Paired') + 
  geom_hline(yintercept=1, linetype="dashed", color = "red")+
  ggtitle('CBD Network Gene Expression Fold Change Values Treated with CoV')+
  xlab('Gene and Genotype or Copy Number Variant')+
  ylab('Gene Expression Fold Change Values')




The next chart is the log2 scaled Mean values for the CBD genotypes.

ggplot(data = CBD_plot2, aes(x=geneClass, y=logMean, fill=Sample)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous(breaks = seq(0, 11, by=1), limits=c(0,11))+
  scale_fill_brewer(palette='Paired') + 
  theme(legend.position="bottom") +
  #ggtitle('CBD Network Gene Expression Log2 Mean Values Treated with Coronavirus')+
  ylab('Gene Expression Log2 Mean Values')

Those visuals help establish that the PPARG has two copy number variants of the gene that are expressed mush more in both the control and CoV inoculated samples by mean genotype expression values, fold change values, and log2 scaled values. We can see that CoV infected samples after 1 hour in liver tumor samples increases genotype expressions of CNR_1, PTGS2, and TNF. And that INS and PPARG are decreased in the first hour of inoculation with CoV.



Now, lets start working on the data that confirms or better asserts what is going on with these genes. Recall that we wanted to examine three possibilities to understand why we made the observation above on the what the data provided for the cannabidiol genes network.

1.) Analyzing the time sequence of the blood capillary samples inoculated with CoV and their respective control groups. Examining how these CBD genes are effected over time.

2.) Confirming that insulin is effected by CoV by looking at the diabetic gene networks for type 1 and type 2 diabetes.

3.) Also, looking at the tumorigenesis gene network to see how the genes that are involved in tumor growth in the body react to CoV in the liver tumor samples and the blood capillary samples.



Lets look at 1.) for the time sequence of the blood capillary samples for each of 0, 12, 24, 36, and 48 hours for the mean values of CoV and ctrl samples.

Time 0: We have to go back to the beginning when adding the platform sequence values to the probe ideas of the series. We have the cleaned series for this data, Series100509. Otherwise, there are duplicate gene values, and the SNP values of the CBD gene with 2 SNPs will be the same.

series100509 <- Series100509[,-c(1,3)]
s100 <- series100509[,c(1,2,grep('0hr', colnames(series100509)))]

CBD5 <- cannabidiol5[,1:2]
CBD5_s100 <- merge(CBD5, s100, by.x='Gene.Symbol', by.y='GENE_SYMBOL')

seqN_CBD <- CBD5_s100 %>% group_by(SEQUENCE) %>% count(n=n())
seqN_CBD <- seqN_CBD[,-3]
colnames(seqN_CBD)[2] <- 'GenotypeCount'

gen_CBD <- CBD5_s100 %>% group_by(Gene.Symbol) %>% count(n=n())
gen_CBD <- gen_CBD[,-3]
colnames(gen_CBD)[2] <- 'GeneCount'

CBD5_s100_2 <- merge(seqN_CBD, CBD5_s100, by.x='SEQUENCE', by.y='SEQUENCE')
CBD5_s100_3 <- merge(gen_CBD, CBD5_s100_2, by.x='Gene.Symbol', by.y='Gene.Symbol')

CBD5_s100_3$geneClass <- ifelse(CBD5_s100_3$GeneCount>1, 
                               paste(as.character(CBD5_s100_3$Gene.Symbol), 
                                     CBD5_s100_3$GeneCount,sep='_'),
                               as.character(CBD5_s100_3$Gene.Symbol))
CBD5_s100_3$geneClass[3] <- 'PPARG_1'
CBD5_s100_3 <- CBD5_s100_3[,c(1:2,16,3:15)]
covSamples <- as.vector(colnames(CBD5_s100_3)[7:11])
CBD_seqCoV_t0 <- CBD5_s100_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(covSamples), mean)
CBD_seqCoV_t0$CoV_t0_SeqMeans <- rowMeans(CBD_seqCoV_t0[2:6])

CoV_t0 <- CBD5_s100_3[,1:6]
CoV_t0_Means <- merge(CoV_t0, CBD_seqCoV_t0, by.x='SEQUENCE', by.y='SEQUENCE')
CoV_t0_seqMean <- CoV_t0_Means[,c(1:6,12)]

ctrlSamples <- as.vector(colnames(CBD5_s100_3)[12:16])
CBD_seqCtrl_t0 <- CBD5_s100_3 %>% group_by(SEQUENCE) %>% summarise_at(vars(ctrlSamples), mean)
CBD_seqCtrl_t0$Ctrl_t0_SeqMeans <- rowMeans(CBD_seqCtrl_t0[2:6])

Ctrl_t0_Means <- merge(CoV_t0, CBD_seqCtrl_t0, by.x='SEQUENCE', by.y='SEQUENCE')
Ctrl_t0_seqMean <- Ctrl_t0_Means[,c(1,12)]


t0_CBD <- merge(CoV_t0_seqMean, Ctrl_t0_seqMean, by.x='SEQUENCE', by.y='SEQUENCE')

t0_CBD$FoldChange <- t0_CBD$CoV_t0_SeqMeans/t0_CBD$Ctrl_t0_SeqMeans

ggplot(data = t0_CBD, aes(x=geneClass, y=FoldChange)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette='Paired') + 
  geom_hline(yintercept=1, linetype="dashed", color = "red")+
  theme(legend.position="bottom")+
  ggtitle('CBD Gene Fold Changes for CoV at 0 Hours')+
  ylab('Gene Expression Fold Change Values')

The above chart is one of the fold change values around 1, with those less than 1 insulin and the cannabidiol receptor 1 genes.


Now we will look at the mean values of those CBD genes on the blood capillary of the CoV and control sample means.

tidyCBD <- as.data.frame(gather(t0_CBD, 'SampleType','MeanValue',7:8))
tidyCBD$SampleType <- gsub('CoV_t0_SeqMeans','CoV', tidyCBD$SampleType)
tidyCBD$SampleType <- gsub('Ctrl_t0_SeqMeans','Control', tidyCBD$SampleType)
ggplot(data = tidyCBD, aes(x=geneClass, y=MeanValue,fill=SampleType)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette='Paired') + 
  theme(legend.position="bottom")+
  ggtitle('CBD Gene Mean Values at 0 Hours')+
  ylab('Mean Values')

As we see in the chart above the tumor necrosis factor, TNF, gene increases in Coronavirus samples compared to control samples in blood capillaries as it did in liver tumor samples, as well as insulin, INS, decreasing in both tissue types. But that was at zero hours time for blood capillary tissue, closest to the one hour time of the liver tumor samples previously analyzed.


Now, lets look at what happens after 24 hours, skipping 12 hours for these blood capillary samples inoculated with CoV.

series100509_24 <- Series100509[,-c(1,3)]
s100_24 <- series100509[,c(1,2,grep('24hr', colnames(series100509)))]

CBD5_24 <- cannabidiol5[,1:2]
CBD5_s100_24 <- merge(CBD5, s100_24, by.x='Gene.Symbol', by.y='GENE_SYMBOL')

seqN_CBD_24 <- CBD5_s100_24 %>% group_by(SEQUENCE) %>% count(n=n())
seqN_CBD_24 <- seqN_CBD_24[,-3]
colnames(seqN_CBD_24)[2] <- 'GenotypeCount'

gen_CBD_24 <- CBD5_s100_24 %>% group_by(Gene.Symbol) %>% count(n=n())
gen_CBD_24 <- gen_CBD_24[,-3]
colnames(gen_CBD_24)[2] <- 'GeneCount'

CBD5_s100_2_24 <- merge(seqN_CBD_24, CBD5_s100_24, by.x='SEQUENCE', by.y='SEQUENCE')
CBD5_s100_3_24 <- merge(gen_CBD_24, CBD5_s100_2_24, by.x='Gene.Symbol', by.y='Gene.Symbol')

CBD5_s100_3_24$geneClass <- ifelse(CBD5_s100_3_24$GeneCount>1, 
                               paste(as.character(CBD5_s100_3_24$Gene.Symbol), 
                                     CBD5_s100_3_24$GeneCount,sep='_'),
                               as.character(CBD5_s100_3_24$Gene.Symbol))
CBD5_s100_3_24$geneClass[3] <- 'PPARG_1'
CBD5_s100_3_24 <- CBD5_s100_3_24[,c(1:2,16,3:15)]
covSamples_24 <- as.vector(colnames(CBD5_s100_3_24)[7:11])
CBD_seqCoV_t_24 <- CBD5_s100_3_24 %>% group_by(SEQUENCE) %>% summarise_at(vars(covSamples_24), mean)
CBD_seqCoV_t_24$CoV_t_24_SeqMeans <- rowMeans(CBD_seqCoV_t_24[2:6])

CoV_t_24 <- CBD5_s100_3_24[,1:6]
CoV_t_24_Means <- merge(CoV_t_24, CBD_seqCoV_t_24, by.x='SEQUENCE', by.y='SEQUENCE')
CoV_t_24_seqMean <- CoV_t_24_Means[,c(1:6,12)]

ctrlSamples_24 <- as.vector(colnames(CBD5_s100_3_24)[12:16])
CBD_seqCtrl_t_24 <- CBD5_s100_3_24 %>% group_by(SEQUENCE) %>% summarise_at(vars(ctrlSamples_24), mean)
CBD_seqCtrl_t_24$Ctrl_t_24_SeqMeans <- rowMeans(CBD_seqCtrl_t_24[2:6])

Ctrl_t_24_Means <- merge(CoV_t_24, CBD_seqCtrl_t_24, by.x='SEQUENCE', by.y='SEQUENCE')
Ctrl_t_24_seqMean <- Ctrl_t_24_Means[,c(1,12)]


t_24_CBD <- merge(CoV_t_24_seqMean, Ctrl_t_24_seqMean, by.x='SEQUENCE', by.y='SEQUENCE')

t_24_CBD$FoldChange <- t_24_CBD$CoV_t_24_SeqMeans/t_24_CBD$Ctrl_t_24_SeqMeans
ggplot(data = t_24_CBD, aes(x=geneClass, y=FoldChange)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette='Paired') + 
  geom_hline(yintercept=1, linetype="dashed", color = "red")+
  theme(legend.position="bottom")+
  ggtitle('CBD Gene Fold Changes for CoV at 24 Hours')+
  ylab('Gene Expression Fold Change Values')

Looking at the chart above on the fold change values of the blood capillaries over 24 hours after having been inoculated with CoV compared to a control group, there is a switch or change in those genes that were previously up-regulated and down-regulated.

TNF and CNR1 have negligible changes, INS is now up-regulated after 24 hours of being inoculated, and both genotypes or copy number variants of PPARG have decreased. Previously, at 0 hours in blood capillaries we saw that INS and CNR1 were down-regulated , both PPARG genotypes were negligibly close to no change at all, and TNF and PTGS2 were up-regulated.


tidyCBD_24 <- as.data.frame(gather(t_24_CBD, 'SampleType','MeanValue',7:8))
tidyCBD_24$SampleType <- gsub('CoV_t_24_SeqMeans','CoV', tidyCBD_24$SampleType)
tidyCBD_24$SampleType <- gsub('Ctrl_t_24_SeqMeans','Control', tidyCBD_24$SampleType)
ggplot(data = tidyCBD_24, aes(x=geneClass, y=MeanValue,fill=SampleType)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette='Paired') + 
  theme(legend.position="bottom")+
  ggtitle('CBD Gene Mean Values at 24 Hours')+
  ylab('Mean Values')

The above bar chart shows the mean values of the control and CoV inoculated samples after 24 hours in the blood capillary samples. In CoV inoculation, the gene expression values for both PPARG genotypes decrease, in INS and PTGS2 there is an up-regulation, and CNR1 and TNF show negligible changes, with a slight decrease in CNR1 gene expression.


Now, lets look at after 48 hours of being inoculated with CoV in these blood capillary samples.

series100509_48 <- Series100509[,-c(1,3)]
s100_48 <- series100509[,c(1,2,grep('48hr', colnames(series100509)))]

CBD5_48 <- cannabidiol5[,1:2]
CBD5_s100_48 <- merge(CBD5, s100_48, by.x='Gene.Symbol', by.y='GENE_SYMBOL')

seqN_CBD_48 <- CBD5_s100_48 %>% group_by(SEQUENCE) %>% count(n=n())
seqN_CBD_48 <- seqN_CBD_48[,-3]
colnames(seqN_CBD_48)[2] <- 'GenotypeCount'

gen_CBD_48 <- CBD5_s100_48 %>% group_by(Gene.Symbol) %>% count(n=n())
gen_CBD_48 <- gen_CBD_48[,-3]
colnames(gen_CBD_48)[2] <- 'GeneCount'

CBD5_s100_2_48 <- merge(seqN_CBD_48, CBD5_s100_48, by.x='SEQUENCE', by.y='SEQUENCE')
CBD5_s100_3_48 <- merge(gen_CBD_48, CBD5_s100_2_48, by.x='Gene.Symbol', by.y='Gene.Symbol')

CBD5_s100_3_48$geneClass <- ifelse(CBD5_s100_3_48$GeneCount>1, 
                               paste(as.character(CBD5_s100_3_48$Gene.Symbol), 
                                     CBD5_s100_3_48$GeneCount,sep='_'),
                               as.character(CBD5_s100_3_48$Gene.Symbol))
CBD5_s100_3_48$geneClass[3] <- 'PPARG_1'
CBD5_s100_3_48 <- CBD5_s100_3_48[,c(1:2,16,3:15)]
covSamples_48 <- as.vector(colnames(CBD5_s100_3_48)[7:11])
CBD_seqCoV_t_48 <- CBD5_s100_3_48 %>% group_by(SEQUENCE) %>% summarise_at(vars(covSamples_48), mean)
CBD_seqCoV_t_48$CoV_t_48_SeqMeans <- rowMeans(CBD_seqCoV_t_48[2:6])

CoV_t_48 <- CBD5_s100_3_48[,1:6]
CoV_t_48_Means <- merge(CoV_t_48, CBD_seqCoV_t_48, by.x='SEQUENCE', by.y='SEQUENCE')
CoV_t_48_seqMean <- CoV_t_48_Means[,c(1:6,12)]

ctrlSamples_48 <- as.vector(colnames(CBD5_s100_3_48)[12:16])
CBD_seqCtrl_t_48 <- CBD5_s100_3_48 %>% group_by(SEQUENCE) %>% summarise_at(vars(ctrlSamples_48), mean)
CBD_seqCtrl_t_48$Ctrl_t_48_SeqMeans <- rowMeans(CBD_seqCtrl_t_48[2:6])

Ctrl_t_48_Means <- merge(CoV_t_48, CBD_seqCtrl_t_48, by.x='SEQUENCE', by.y='SEQUENCE')
Ctrl_t_48_seqMean <- Ctrl_t_48_Means[,c(1,12)]


t_48_CBD <- merge(CoV_t_48_seqMean, Ctrl_t_48_seqMean, by.x='SEQUENCE', by.y='SEQUENCE')

t_48_CBD$FoldChange <- t_48_CBD$CoV_t_48_SeqMeans/t_48_CBD$Ctrl_t_48_SeqMeans
ggplot(data = t_48_CBD, aes(x=geneClass, y=FoldChange)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette='Paired') + 
  geom_hline(yintercept=1, linetype="dashed", color = "red")+
  theme(legend.position="bottom")+
  ggtitle('CBD Gene Fold Changes for CoV at 48 Hours')+
  ylab('Gene Expression Fold Change Values')

From the above chart of fold changes after 48 hours, we see that not much has changed compared to after 24 hours, except the TNF has now become up-regulated as it was at 0 hours but not after 24 hours.

CNR1 remains having negligible changes as was the case after 24 hours, INS, TNF, and PTGS2 are up-regulated after 48 hours of being inoculated, and both genotypes or copy number variants of PPARG have decreased.

Previously, at 0 hours in blood capillaries we saw that INS and CNR1 were down-regulated , both PPARG genotypes were negligibly close to no change at all, and TNF and PTGS2 were up-regulated.


tidyCBD_48 <- as.data.frame(gather(t_48_CBD, 'SampleType','MeanValue',7:8))
tidyCBD_48$SampleType <- gsub('CoV_t_48_SeqMeans','CoV', tidyCBD_48$SampleType)
tidyCBD_48$SampleType <- gsub('Ctrl_t_48_SeqMeans','Control', tidyCBD_48$SampleType)
ggplot(data = tidyCBD_48, aes(x=geneClass, y=MeanValue,fill=SampleType)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette='Paired') + 
  theme(legend.position="bottom")+
  ggtitle('CBD Gene Mean Values at 48 Hours')+
  ylab('Mean Values')

The above bar chart shows the mean values of the control and CoV inoculated samples after 48 hours in the blood capillary samples. In CoV inoculation, the gene expression values for both PPARG genotypes decrease. In INS, TNF, and PTGS2 there is an up-regulation, and CNR1 shows negligible changes, with a slight decrease in CNR1 gene expression.



3.) Lets look at the tumorigenesis genes and the liver tumor samples that were inoculated with the Coronavirus.

Get the table with the five genes that were the highest relevance.scoring made earlier to be combined with all our samples. Then select only those liver tumor samples except for the overly scaled samples 1 and 2.Include the heat inactivated CoV and the added IL-alpha1 to CoV samples.

tumorigenesis5_path_1 <- tumorigenesis5_path[,c(1,9,12,13,16:21)]
describe <- tumorigenesis5_path[!duplicated(tumorigenesis5_path$Gene.Symbol),c(1,2,8)]
describe
##    Gene.Symbol                      Description
## 1       CTNNB1                   Catenin Beta 1
## 23        EGFR Epidermal Growth Factor Receptor
## 32        IGF2     Insulin Like Growth Factor 2
## 36        KRAS      KRAS Proto-Oncogene, GTPase
## 40        TP53                Tumor Protein P53
##                                                                                                           DESCRIPTION
## 1  Homo sapiens catenin (cadherin-associated protein), beta 1, 88kDa (CTNNB1), transcript variant 1, mRNA [NM_001904]
## 23                       Homo sapiens epidermal growth factor receptor (EGFR), transcript variant 2, mRNA [NM_201282]
## 32           Homo sapiens insulin-like growth factor 2 (somatomedin A) (IGF2), transcript variant 1, mRNA [NM_000612]
## 36   Homo sapiens v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog (KRAS), transcript variant a, mRNA [NM_033360]
## 40                                      Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA [NM_000546]
colnames(tumorigenesis5_path_1)
##  [1] "Gene.Symbol"                 "SEQUENCE"                   
##  [3] "GSM2359910_CoV3"             "GSM2359913_CoV4"            
##  [5] "GSM2359911_ctrl3"            "GSM2359914_ctrl4"           
##  [7] "GSM2359912_Il1"              "GSM2359917_IL2"             
##  [9] "GSM2359915_inactiveHeatCoV1" "GSM2359916_inactiveHeatCoV2"

Now, group by SEQUENCE to see how many genotypes there are in this set of five tumor creating network of genes.

seqTumor <- tumorigenesis5_path_1 %>% group_by(SEQUENCE) %>% count(n=n())
Seq_tumor <- as.data.frame(seqTumor[,-3])

colnames(Seq_tumor)[2] <- 'genoTypeCount'

Seq_Tumor <- merge(Seq_tumor, tumorigenesis5_path_1, 
                   by.x='SEQUENCE', by.y='SEQUENCE')
Seq_Tumor$CoV_Means <- rowMeans(Seq_Tumor[4:5])
Seq_Tumor$Ctrl_Means <- rowMeans(Seq_Tumor[6:7])
Seq_Tumor$IL_A_Means <- rowMeans(Seq_Tumor[8:9])
Seq_Tumor$IA_CoV_Means <- rowMeans(Seq_Tumor[10:11])
seqTumor1 <- Seq_Tumor %>% group_by(Gene.Symbol) %>% count(n=n())
Seq_tumor1 <- as.data.frame(seqTumor1[,-3])
colnames(Seq_tumor1)[2] <- 'geneCount'
Seq_Tumor1 <- merge(Seq_tumor1, Seq_Tumor, by.x='Gene.Symbol', by.y='Gene.Symbol')
Seq_Tumor1$FoldChange_CoV <- Seq_Tumor1$CoV_Means/Seq_Tumor1$Ctrl_Means
Seq_Tumor1$FoldChange_IL_a <- Seq_Tumor1$IL_A_Means/Seq_Tumor1$Ctrl_Means
Seq_Tumor1$FoldChange_IA_CoV <- Seq_Tumor1$IA_CoV_Means/Seq_Tumor1$Ctrl_Means
EGFR1 <- grep('^GGAAATC', Seq_Tumor1$SEQUENCE)
EGFR2 <- grep('^TATCAA',Seq_Tumor1$SEQUENCE)
EGFR3 <- grep('^AGAACT',Seq_Tumor1$SEQUENCE)

IGF2_1 <- grep('^CTCAACTC',Seq_Tumor1$SEQUENCE)
IGF2_2 <- grep('^TGCTTCC',Seq_Tumor1$SEQUENCE)

CTNNB1_1 <- grep('^TGATCAA',Seq_Tumor1$SEQUENCE)
CTNNB1_2 <- grep('^ATGATGGA',Seq_Tumor1$SEQUENCE)

KRAS_1 <- grep('^TCAGGACT',Seq_Tumor1$SEQUENCE)
KRAS_2 <- grep('^CTGAGTCA',Seq_Tumor1$SEQUENCE)

TP53_1 <- grep('^CTGTGAGG',Seq_Tumor1$SEQUENCE)
TP53_2 <- grep('^CAGCTACG',Seq_Tumor1$SEQUENCE)


Seq_Tumor1$Genotype_Class <- 'class'

Seq_Tumor1$Genotype_Class[EGFR1] <- 'EGFR_1'
Seq_Tumor1$Genotype_Class[EGFR2] <- 'EGFR_2'
Seq_Tumor1$Genotype_Class[EGFR3] <- 'EGFR_3'
Seq_Tumor1$Genotype_Class[IGF2_1] <- 'IGF2_1'
Seq_Tumor1$Genotype_Class[IGF2_2] <- 'IGF2_2'
Seq_Tumor1$Genotype_Class[CTNNB1_1] <- 'CTNNB1_1'
Seq_Tumor1$Genotype_Class[CTNNB1_2] <- 'CTNNB1_2'
Seq_Tumor1$Genotype_Class[KRAS_1] <- 'KRAS_1'
Seq_Tumor1$Genotype_Class[KRAS_2] <- 'KRAS_2'
Seq_Tumor1$Genotype_Class[TP53_1] <- 'TP53_1'
Seq_Tumor1$Genotype_Class[TP53_2] <- 'TP53_2'

Make the data tidy.

tidy_SeqTumor1 <- gather(Seq_Tumor1, 'Statistic','Value',13:19)
foldChanges <- subset(tidy_SeqTumor1, tidy_SeqTumor1$Statistic=="FoldChange_IL_a" | tidy_SeqTumor1$Statistic=="FoldChange_IA_CoV" | tidy_SeqTumor1$Statistic=="FoldChange_CoV")
ggplot(data = foldChanges, aes(x=Statistic, y=Value,fill=Genotype_Class)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette='Paired') +
  geom_hline(yintercept=1, linetype="dashed", color = "red")+
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Liver Tumor Samples')+
  ylab('Values')


The above bar chart shows the fold change values of the tumorigenesis genes in the liver tumor samples inoculated with Coronavirus, Coronavirus and interleukin alpha1, and heat inactive CoV compared to the control group and over a one hour time span in vitro or cell culture. The genes closest to the dashed line represent negligible change. IGF2 genotype 1 of 3 and the 1st genotype of KRAS are negligible for changes after being inoculated with either of the three inoculations over one hour.

The EGFR for two genotypes out of three, both genotypes of TP53, and the second genotype of KRAS are increased as they are above the line in fold change values as the chart above shows.

The genotype of the three for EGFR that isn’t increased is decreased in the CoV inoculated samples, but increased in the heat inactivated CoV samples and negligible change in the IL-A mixed with CoV sample. These could be clues to how CoV effects tumor producing genes or the network of genes responsible for causing tumors. However, this study is only done for a one hour time span, not like the blood capillary samples that were done over a 48 hour time span.

These five tumorigenesis genes are shown in the following table that was made earlier:

describe
##    Gene.Symbol                      Description
## 1       CTNNB1                   Catenin Beta 1
## 23        EGFR Epidermal Growth Factor Receptor
## 32        IGF2     Insulin Like Growth Factor 2
## 36        KRAS      KRAS Proto-Oncogene, GTPase
## 40        TP53                Tumor Protein P53
##                                                                                                           DESCRIPTION
## 1  Homo sapiens catenin (cadherin-associated protein), beta 1, 88kDa (CTNNB1), transcript variant 1, mRNA [NM_001904]
## 23                       Homo sapiens epidermal growth factor receptor (EGFR), transcript variant 2, mRNA [NM_201282]
## 32           Homo sapiens insulin-like growth factor 2 (somatomedin A) (IGF2), transcript variant 1, mRNA [NM_000612]
## 36   Homo sapiens v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog (KRAS), transcript variant a, mRNA [NM_033360]
## 40                                      Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA [NM_000546]
describe1 <- datatable(data=describe,  rownames=FALSE,
                      #colnames=pretty_headers,
                      filter=list(position='top'),
                      options=list(
                        dom='Bfrtip',
                        buttons=c('colvis','csv','excel'),
                        language=list(sSearch='Filter:')),
                      extensions=c('Buttons','Responsive')
                      )
describe1
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

We just analyzed the tumorigenesis genes in the liver tumor samples of CoV, but lets look at the blood capillary samples as well to see how these genotypes behave. Are the genes expressed the same or do they differ when looking at blood capillary tissue instead of liver tumor tissue infected with CoV.

We will use the first four columns and the last one of the Seq_Tumor1 table that has the count of the genes and genotypes in it as well as the corresponding given name by which genotype it was.

Seq_cap_tg <- Seq_Tumor1[,c(1:4,20)]
Seq_cap_tg <- Seq_cap_tg[!duplicated(Seq_cap_tg$Genotype_Class),]
colnames(Seq_cap_tg)
## [1] "Gene.Symbol"    "geneCount"      "SEQUENCE"       "genoTypeCount" 
## [5] "Genotype_Class"
Seq_cap_tg$Genotype_Class
##  [1] "CTNNB1_1" "CTNNB1_2" "EGFR_1"   "EGFR_2"   "EGFR_3"   "IGF2_1"  
##  [7] "IGF2_2"   "KRAS_1"   "KRAS_2"   "TP53_1"   "TP53_2"

Now use the Series100 table that has the blood capillary samples by gene name of all genes with gene names and sequence copy variant of each gene.Combine the tumorigenesis table Seq_cap_tg with it.

tg_genes_bcs <- merge(Seq_cap_tg, Series100509,by.x='SEQUENCE', by.y='SEQUENCE')

Lets take the row means of each subset of samples in our blood capillary samples of CoV by 0, 12,24,36, and 48 hours.

colnames(tg_genes_bcs)
##  [1] "SEQUENCE"                     "Gene.Symbol"                 
##  [3] "geneCount"                    "genoTypeCount"               
##  [5] "Genotype_Class"               "SPOT_ID"                     
##  [7] "GENE_SYMBOL"                  "DESCRIPTION"                 
##  [9] "GSM2685693_CoV_0hr_Group_A"   "GSM2685694_CoV_0hr_Group_B"  
## [11] "GSM2685695_CoV_0hr_Group_C"   "GSM2685696_CoV_0hr_Group_D"  
## [13] "GSM2685697_CoV_0hr_Group_E"   "GSM2685698_ctrl_0hr_Group_A" 
## [15] "GSM2685699_ctrl_0hr_Group_B"  "GSM2685700_ctrl_0hr_Group_C" 
## [17] "GSM2685701_ctrl_0hr_Group_D"  "GSM2685702_ctrl_0hr_Group_E" 
## [19] "GSM2685703_CoV_12hr_Group_A"  "GSM2685704_CoV_12hr_Group_B" 
## [21] "GSM2685705_CoV_12hr_Group_C"  "GSM2685706_CoV_12hr_Group_D" 
## [23] "GSM2685707_CoV_12hr_Group_E"  "GSM2685708_ctrl_12hr_Group_A"
## [25] "GSM2685709_ctrl_12hr_Group_B" "GSM2685710_ctrl_12hr_Group_C"
## [27] "GSM2685711_ctrl_12hr_Group_D" "GSM2685712_ctrl_12hr_Group_E"
## [29] "GSM2685713_CoV_24hr_Group_A"  "GSM2685714_CoV_24hr_Group_B" 
## [31] "GSM2685715_CoV_24hr_Group_C"  "GSM2685716_CoV_24hr_Group_D" 
## [33] "GSM2685717_CoV_24hr_Group_E"  "GSM2685718_ctrl_24hr_Group_A"
## [35] "GSM2685719_ctrl_24hr_Group_B" "GSM2685720_ctrl_24hr_Group_C"
## [37] "GSM2685721_ctrl_24hr_Group_D" "GSM2685722_ctrl_24hr_Group_E"
## [39] "GSM2685723_CoV_36hr_Group_A"  "GSM2685724_CoV_36hr_Group_B" 
## [41] "GSM2685725_CoV_36hr_Group_C"  "GSM2685726_CoV_36hr_Group_D" 
## [43] "GSM2685727_CoV_36hr_Group_E"  "GSM2685728_ctrl_36hr_Group_A"
## [45] "GSM2685729_ctrl_36hr_Group_B" "GSM2685730_ctrl_36hr_Group_C"
## [47] "GSM2685731_ctrl_36hr_Group_D" "GSM2685732_ctrl_36hr_Group_E"
## [49] "GSM2685733_CoV_48hr_Group_A"  "GSM2685734_CoV_48hr_Group_B" 
## [51] "GSM2685735_CoV_48hr_Group_C"  "GSM2685736_CoV_48hr_Group_D" 
## [53] "GSM2685737_CoV_48hr_Group_E"  "GSM2685738_ctrl_48hr_Group_A"
## [55] "GSM2685739_ctrl_48hr_Group_B" "GSM2685740_ctrl_48hr_Group_C"
## [57] "GSM2685741_ctrl_48hr_Group_D" "GSM2685742_ctrl_48hr_Group_E"

Add in the row means for each group of CoV or Ctrl over 0,12,24,36, or 48 hours.

tg_genes_bcs$Mean_0hr_Cov <- rowMeans(tg_genes_bcs[,9:13])
tg_genes_bcs$Mean_12hr_Cov <- rowMeans(tg_genes_bcs[,19:23])
tg_genes_bcs$Mean_24hr_Cov <- rowMeans(tg_genes_bcs[,29:33])
tg_genes_bcs$Mean_36hr_Cov <- rowMeans(tg_genes_bcs[,39:43])
tg_genes_bcs$Mean_48hr_Cov <- rowMeans(tg_genes_bcs[,49:53])

tg_genes_bcs$Mean_0hr_Ctrl <- rowMeans(tg_genes_bcs[,14:18])
tg_genes_bcs$Mean_12hr_Ctrl <- rowMeans(tg_genes_bcs[,24:28])
tg_genes_bcs$Mean_24hr_Ctrl <- rowMeans(tg_genes_bcs[,34:38])
tg_genes_bcs$Mean_36hr_Ctrl <- rowMeans(tg_genes_bcs[,44:48])
tg_genes_bcs$Mean_48hr_Ctrl <- rowMeans(tg_genes_bcs[,54:58])

Now, add in the fold change values of the CoV/Ctrl for each of 0,12,24,36, or 48 hours.

tg_genes_bcs$FC_0hr <- tg_genes_bcs$Mean_0hr_Cov/tg_genes_bcs$Mean_0hr_Ctrl
tg_genes_bcs$FC_12hr <- tg_genes_bcs$Mean_12hr_Cov/tg_genes_bcs$Mean_12hr_Ctrl
tg_genes_bcs$FC_24hr <- tg_genes_bcs$Mean_24hr_Cov/tg_genes_bcs$Mean_24hr_Ctrl
tg_genes_bcs$FC_36hr <- tg_genes_bcs$Mean_36hr_Cov/tg_genes_bcs$Mean_36hr_Ctrl
tg_genes_bcs$FC_48hr <- tg_genes_bcs$Mean_48hr_Cov/tg_genes_bcs$Mean_48hr_Ctrl

Lets look at all the columns of sample and statistical information.

head(tg_genes_bcs[,1:4])
##                                                       SEQUENCE Gene.Symbol
## 1 AGAACTCTGAGTGCATACAGTGCCACCCAGAGTGCCTGCCTCAGGCCATGAACATCACCT        EGFR
## 2 ATGATGGAACATGAGATGGGTGGCCACCACCCTGGTGCTGACTATCCAGTTGATGGGCTG      CTNNB1
## 3 CAGCTACGGTTTCCGTCTGGGCTTCTTGCATTCTGGGACAGCCAAGTCTGTGACTTGCAC        TP53
## 4 CTCAACTCAGCTCCTTTAACGCTAATATTTCCGGCAAAATCCCATGCTTGGGTTTTGTCT        IGF2
## 5 CTGAGTCACACTGCATAGGAATTTAGAACCTAACTTTTATAGGTTATCAAAACTGTTGTC        KRAS
## 6 CTGTGAGGGATGTTTGGGAGATGTAAGAAATGTTCTTGCAGTTAAGGGTTAGTTTACAAT        TP53
##   geneCount genoTypeCount
## 1         9             3
## 2        22             2
## 3        22             2
## 4         4             2
## 5         4             2
## 6        22            20
tail(tg_genes_bcs[,69:73])
##       FC_0hr   FC_12hr   FC_24hr   FC_36hr   FC_48hr
## 6  1.0035557 0.8901734 0.9371630 0.9166376 0.9227073
## 7  1.0037408 0.9708363 0.9652115 1.2515678 1.5938946
## 8  1.0035569 1.0252020 1.0569149 1.0924454 1.1058086
## 9  0.9957301 0.9123630 0.9475070 0.9431493 0.9034454
## 10 1.0015810 0.9618076 0.9400880 0.9736967 0.9659112
## 11 0.9983990 1.0491453 1.0020517 0.9762646 0.9773549

Lets use tidyr to gather the fold change columns and mean columns into separate tables to plot.

bc_fc <- as.data.frame(gather(tg_genes_bcs, 'timeInterval_FCs','FC_Value',69:73))
bc_means <- as.data.frame(gather(tg_genes_bcs, 'timeInterval_Means', 'MeanValue',59:68))

Now, we will plot the blood capillary tumorigenesis genes and their genotypes by fold change (FC) of each time interval and compare to the liver tumor samples of tumorigenesis fold change values.

ggplot(data = bc_fc, aes(x=timeInterval_FCs, y=FC_Value,fill=Genotype_Class)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette='Paired') +
  geom_hline(yintercept=1, linetype="dashed", color = "red")+
  theme_classic()+
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Capillary Samples')+
  ylab('Fold Change Values')

The above plot shows the fold changes over 48 hours for the blood capillary samples inoculated with CoV. At zero hours after inoculation the fold change values are negligible.

After 12 hours of being infected with CoV, the 2nd and 3rd genotypes of EGFR are increased as well as the 2nd genotype of IGF2. The TP53, CTNNB1, and KRAS genotypes as well as the 1st genotype of EGFR are decreased after 12 hours.

After one day or 24 hours, both genotypes of EGFR that were increased after 12 hours are still increased. The 2nd genotype of IGF2 dropped back down to original values, and all other genotypes are decreased. The first genotype of IGF2 decreased more than after 12 hours. The first genotype of CTNNB1 and TP57 increased slightly more than after 12 hours but still less than original values.

After 36 hours, all the EGFR genotypes increase, with a huge increase in the first genotype of EGFR that was originally decreased after 12 and 24 hours. The CTNNB1 genes remain the same approximate values as they were at 24 hours after inoculation.

After 48 hours, only the EGFR gene is increased with all genotypes increased more than after 0 hours. The first genotype of EGFR that was decreased in the 0,12,and 24 hour samples has now surpassed gene expression values of all the other genotypes as the most expressed genotype in this set of genes.


This type of tissue, blood capillary, when compared to liver tumor tissue, did not show the same genes and behaviors as each other. The TP53 genotypes never increased as it did in the liver tumor samples. The CNNTD1 genotypes were increased in the liver tumor samples as well as all the other genotypes except for the first genotypes of KRAS and EGFR. These blood capillary samples show 48 hours of results in 12 hour increments, offering more information on gene expression of the tumorigenesis genes after two days. ***

means_0_24 <- subset(bc_means, bc_means$timeInterval_Means=='Mean_0hr_Cov' | 
                       bc_means$timeInterval_Means=='Mean_0hr_Ctrl' |
                       bc_means$timeInterval_Means=='Mean_12hr_Cov' |
                       bc_means$timeInterval_Means=='Mean_12hr_Ctrl' |
                       bc_means$timeInterval_Means=='Mean_24hr_Cov' |
                       bc_means$timeInterval_Means=='Mean_24hr_Ctrl' )
means_0_24_a <- subset(means_0_24, means_0_24$Genotype_Class=='CTNNB1_1' |
                         means_0_24$Genotype_Class=='CTNNB1_2' |
                         means_0_24$Genotype_Class=='EGFR_1' |
                         means_0_24$Genotype_Class=='EGFR_2' |
                         means_0_24$Genotype_Class=='EGFR_3' )
ggplot(data = means_0_24_a, aes(x=timeInterval_Means, y=MeanValue,group=Genotype_Class)) +
  geom_line(aes(color=Genotype_Class))+
  geom_point()+
  scale_y_continuous()+
  scale_fill_brewer(palette="paired") +
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Capillary 0-24 Hours')+
  ylab('Mean Values')
## Warning in pal_name(palette, type): Unknown palette paired


The above line chart shows the direction of increase or decrease over 24 hours at the mean gene expression values of the CTNBB1 and EGFR genotypes.The CTNNB1 genotypes decrease, while most of the EGFR genotypes increase when compared to the control samples at the same time.


ggplot(data = means_0_24_a, aes(x=timeInterval_Means, y=MeanValue,fill=Genotype_Class)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette="paired") +
  theme_dark()+
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Capillary 0-24 Hours')+
  ylab('Mean Values')
## Warning in pal_name(palette, type): Unknown palette paired


The above bar chart shows the first 24 hours after inoculation for the genotypes of the CTNNB1 an EGFR genes. CTNNB1 has decreasing values and the EGFR genotypes are increased over 24 hours time.


means_0_24_b <- subset(means_0_24, means_0_24$Genotype_Class=='IGF2_1' |
                         means_0_24$Genotype_Class=='IGF2_2' |
                         means_0_24$Genotype_Class=='TP53_1' |
                         means_0_24$Genotype_Class=='TP53_2' |
                         means_0_24$Genotype_Class=='KRAS_1' |
                         means_0_24$Genotype_Class=='KRAS_2')
ggplot(data = means_0_24_b, aes(x=timeInterval_Means, y=MeanValue,group=Genotype_Class)) +
  geom_line(aes(color=Genotype_Class))+
  geom_point()+
  scale_y_continuous()+
  scale_fill_brewer(palette="paired") +
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Capillary 0-24 Hours')+
  ylab('Mean Values')
## Warning in pal_name(palette, type): Unknown palette paired


The above plot shows the line of direction for increasing and decreasing gene expression values by genotype for IGF2, KRAS, and TP53 over 24 hours.


ggplot(data = means_0_24_b, aes(x=timeInterval_Means, y=MeanValue,fill=Genotype_Class)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette="paired") +
  theme_dark()+
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Capillary 0-24 Hours')+
  ylab('Mean Values')
## Warning in pal_name(palette, type): Unknown palette paired


The above chart shows the 0-24 hour time window of genotype mean expression values for the IGF2, KRAS, and TP53 genes. Start at the mean ctrl values for the hour compared to the CoV sample Means and notice the changes. At zero hours, negligible change. At 12 hours, a decrease in the KRAS and TP53 genotypes, with a slight increase or no noticeable change in the IGF2 genotypes. And after 24 hours all these genotypes decrease in expression values.


Now look at these genes split up into sets of genotypes and in the next 24 hours from time interval 24-48 hours.

means_24_48 <- subset(bc_means, bc_means$timeInterval_Means=='Mean_24hr_Cov' | 
                       bc_means$timeInterval_Means=='Mean_24hr_Ctrl' |
                       bc_means$timeInterval_Means=='Mean_36hr_Cov' |
                       bc_means$timeInterval_Means=='Mean_36hr_Ctrl' |
                       bc_means$timeInterval_Means=='Mean_48hr_Cov' |
                       bc_means$timeInterval_Means=='Mean_48hr_Ctrl' )
means_24_48_a <- subset(means_24_48, means_24_48$Genotype_Class=='CTNNB1_1' |
                         means_24_48$Genotype_Class=='CTNNB1_2' |
                         means_24_48$Genotype_Class=='EGFR_1' |
                         means_24_48$Genotype_Class=='EGFR_2' |
                         means_24_48$Genotype_Class=='EGFR_3' )
ggplot(data = means_24_48_a, aes(x=timeInterval_Means, y=MeanValue,group=Genotype_Class)) +
  geom_line(aes(color=Genotype_Class))+
  geom_point()+
  scale_y_continuous()+
  scale_fill_brewer(palette="paired") +
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Capillary 24-48 Hours')+
  ylab('Mean Values')
## Warning in pal_name(palette, type): Unknown palette paired


The above line chart shows the increasing and decreasing directions of the genotypes of CTNNB1 and EGFR genes from 24-48 hours. The CTNNB1 genotypes decreased throughout this time interval compared to the control group. The EGFR genotypes all increased except for the first genotype that decreased after 24 hours but increased the remaining time intervals compared to the control groups.


ggplot(data = means_24_48_a, aes(x=timeInterval_Means, y=MeanValue,fill=Genotype_Class)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette="PuOr") +
  theme_dark()+
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Capillary 24-48 Hours')+
  ylab('Mean Values')


The above bar chart shows the time interval of 24-48 hours after being infected with the Coronavirus for the genotypes of CTNNB1 and EGFR. We can see there is decreased expression after 24 hours in the CTNNB1 genotypes as well as for the other time intervals through 48 hours while almost at the same levels after 48 hours. For the EGFR genotypes, they increased for all times from 24-48 hours. Except for the 24 hour time interval where the first genotype of EGFR decreased.

means_24_48_b <- subset(means_24_48, means_24_48$Genotype_Class=='IGF2_1' |
                         means_24_48$Genotype_Class=='IGF2_2' |
                         means_24_48$Genotype_Class=='TP53_1' |
                         means_24_48$Genotype_Class=='TP53_2' |
                         means_24_48$Genotype_Class=='KRAS_1' |
                         means_24_48$Genotype_Class=='KRAS_2')
ggplot(data = means_24_48_b, aes(x=timeInterval_Means, y=MeanValue,group=Genotype_Class)) +
  geom_line(aes(color=Genotype_Class))+
  geom_point()+
  scale_y_continuous()+
  scale_fill_brewer(palette="Spectral") +
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Capillary 24-48 Hours')+
  ylab('Mean Values')


The above line chart shows how the CoV and control gene expression values behave for three of the genes out of five tumorigenesis genes selected for the time interval of 24-48 hours.From each sample, you can see at 24 hours all these genotypes were less than the control gene values, so they decreased. And the same for the 36 hour interval, and the 48 hour interval. ***

ggplot(data = means_24_48_b, aes(x=timeInterval_Means, y=MeanValue,fill=Genotype_Class)) +
  geom_bar(stat='identity', position=position_dodge())+
  scale_y_continuous()+
  scale_fill_brewer(palette="PuOr") +
  theme_dark()+
  theme(legend.position="bottom")+
  ggtitle('Tumorigenesis Gene Values CoV Capillary 24-48 Hours')+
  ylab('Mean Values')


The above bar chart shows the genotypes of IGF2, KRAS, and TP53 for the Coronavirus and control mean values from 24 hours to 48 hours time. The 2nd genotype of IGF2 seems to stay around the same value, while all the other genotypes have decreased expression values compared to the control set of samples for the corresponding time interval.


The liver tumor samples showed increases in all genes except the first genotypes of EGFR and IGF2 over a one hour time span.While the blood capillary samples, showed decreases in all genes after 48 hours, except for the EGFR genotypes. And the genotype of EGFR the most expressed was the same one found under expressed in earlier time intervals and the liver tumor samples.

The information offers some clues of how these genes are affected by the CoV inoculation and the window of changes in two different sample types.

updated 3/25/2020