covid19 study

The access number for this data is GSE151161 and it can be (downloaded)[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152418] with the text metadata information shown immediately below and the RAW data values of gene expression. No need to add the platform information, because the RAW data has the gene name attached to the Raw RNA gene expression data. The data resource link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152418.

The ages of the patients in this study range from 23-91 years of age with a median age of 56, and there are 18 females and 16 males in 34 total samples.

This is Peripheral Blood Mononuclear Cells with a (wikipedia)[en.wikipedia.org/wiki/Peripheral_blood_mononuclear_cell] definition:" A peripheral blood mononuclear cell (PBMC) is any peripheral blood cell having a round nucleus. These cells consist of lymphocytes (T cells, B cells, NK cells) and monocytes, whereas erythrocytes and platelets have no nuclei, and granulocytes (neutrophils, basophils, and eosinophils) have multi-lobed nuclei."

National Center for Bioinformatics Information (NCBI) study summary: "Series GSE152418 Query DataSets for GSE152418 Status Public on Jul 31, 2020 Title Systems biological assessment of immunity to severe and mild COVID-19 infections Organism Homo sapiens Experiment type Expression profiling by high throughput sequencing Summary The recent emergence of COVID-19 presents a major global crisis. Profound knowledge gaps remain about the interaction between the virus and the immune system. Here, we used a systems biology approach to analyze immune responses in 76 COVID-19 patients and 69 age and sex- matched controls, from Hong Kong and Atlanta. Mass cytometry revealed prolonged plasmablast and effector T cell responses, reduced myeloid expression of HLA-DR and inhibition of mTOR signaling in plasmacytoid DCs (pDCs) during infection. Production of pro-inflammatory cytokines plasma levels of inflammatory mediators, including EN-RAGE, TNFSF14, and Oncostatin-M, which correlated with disease severity, and increased bacterial DNA and endotoxin in plasma in and reduced HLA-DR and CD86 but enhanced EN-RAGE expression in myeloid cells in severe transient expression of IFN stimulated genes in moderate infections, consistent with transcriptomic analysis of bulk PBMCs, that correlated with transient and low levels of plasma COVID-19.

Overall design RNAseq analysis of PBMCs in a group of 17 COVID-19 subjects and 17 healthy controls"

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

source("geneCards.R")

## Warning: package 'rvest' was built under R version 3.6.3

## Loading required package: xml2

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

I manually copied and pasted the header information for each sample in the following table. It gives the age, if healthy, mild, severe, ICU, and one convalescent category for COVID-19 sampled patients, as well as gender and age and Atlanta, GA as the location.

infoHeader <- read.csv('header.csv', sep=',', header=T,na.strings=c('',' ','NA'),
                       stringsAsFactors = F)
colnames(infoHeader)

##  [1] "GSM4614985" "GSM4614986" "GSM4614987" "GSM4614988" "GSM4614989"
##  [6] "GSM4614990" "GSM4614991" "GSM4614992" "GSM4614993" "GSM4614994"
## [11] "GSM4614995" "GSM4614996" "GSM4614997" "GSM4614998" "GSM4614999"
## [16] "GSM4615000" "GSM4615001" "GSM4615003" "GSM4615006" "GSM4615008"
## [21] "GSM4615011" "GSM4615014" "GSM4615016" "GSM4615019" "GSM4615022"
## [26] "X"          "GSM4615025" "GSM4615027" "GSM4615030" "GSM4615032"
## [31] "GSM4615033" "GSM4615034" "GSM4615035" "GSM4615036" "GSM4615037"

row.names(infoHeader) <- infoHeader$X
infoHeader <- infoHeader[,-26]
infoHeader

##                           GSM4614985              GSM4614986
## sampleID              S145_nCOV001_C S147_nCoV001EUHM-Draw-1
## age                               80                      75
## gender                             M                       F
## diseaseState            Convalescent                COVID-19
## severity                Convalescent                Moderate
## geographicLocation  Atlanta, GA, USA        Atlanta, GA, USA
## cellType                        PBMC                    PBMC
##                                 GSM4614987              GSM4614988
## sampleID           S149_nCoV002EUHM-Draw-2 S150_nCoV003EUHM-Draw-1
## age                                     54                      75
## gender                                   M                       F
## diseaseState                      COVID-19                COVID-19
## severity                            Severe                  Severe
## geographicLocation        Atlanta, GA, USA        Atlanta, GA, USA
## cellType                              PBMC                    PBMC
##                                 GSM4614989              GSM4614990
## sampleID           S151_nCoV004EUHM-Draw-1 S152_nCoV006EUHM-Draw-1
## age                                     59                      59
## gender                                   M                       M
## diseaseState                      COVID-19                COVID-19
## severity                               ICU                  Severe
## geographicLocation        Atlanta, GA, USA        Atlanta, GA, USA
## cellType                              PBMC                    PBMC
##                                 GSM4614991               GSM4614992
## sampleID           S153_nCoV007EUHM-Draw-1 S154_nCoV0010EUHM-Draw-1
## age                                     53                       64
## gender                                   F                        F
## diseaseState                      COVID-19                 COVID-19
## severity                          Moderate                   Severe
## geographicLocation        Atlanta, GA, USA         Atlanta, GA, USA
## cellType                              PBMC                     PBMC
##                           GSM4614993              GSM4614994        GSM4614995
## sampleID            S155_nCOV021EUHM S156_nCOV024EUHM-Draw-1 S157_nCOV0029EUHM
## age                               60                      48                47
## gender                             F                       M                 F
## diseaseState                COVID-19                COVID-19          COVID-19
## severity                         ICU                  Severe          Moderate
## geographicLocation  Atlanta, GA, USA        Atlanta, GA, USA  Atlanta, GA, USA
## cellType                        PBMC                    PBMC              PBMC
##                                 GSM4614996              GSM4614997
## sampleID           S175_nCoV024EUHM-Draw-2 S176_nCoV025EUHM-Draw-1
## age                                     48                      56
## gender                                   M                       F
## diseaseState                      COVID-19                COVID-19
## severity                               ICU                  Severe
## geographicLocation        Atlanta, GA, USA        Atlanta, GA, USA
## cellType                              PBMC                    PBMC
##                                 GSM4614998              GSM4614999
## sampleID           S177_nCoV025EUHM-Draw-2 S178_nCoV028EUHM-Draw-1
## age                                     56                      56
## gender                                   F                       M
## diseaseState                      COVID-19                COVID-19
## severity                               ICU                  Severe
## geographicLocation        Atlanta, GA, USA        Atlanta, GA, USA
## cellType                              PBMC                    PBMC
##                                 GSM4615000              GSM4615001
## sampleID           S179_nCoV033EUHM-Draw-1 S180_nCoV034EUHM-Draw-1
## age                                     76                      46
## gender                                   M                       F
## diseaseState                      COVID-19                COVID-19
## severity                            Severe                Moderate
## geographicLocation        Atlanta, GA, USA        Atlanta, GA, USA
## cellType                              PBMC                    PBMC
##                           GSM4615003        GSM4615006        GSM4615008
## sampleID                    S061_257          S062_258          S063_259
## age                               25                70                68
## gender                             M                 F                 F
## diseaseState                 Healthy           Healthy           Healthy
## severity                     Healthy           Healthy           Healthy
## geographicLocation  Atlanta, GA, USA  Atlanta, GA, USA  Atlanta, GA, USA
## cellType                        PBMC              PBMC              PBMC
##                           GSM4615011        GSM4615014        GSM4615016
## sampleID                    S064_260          S065_261          S066_265
## age                               69                29                90
## gender                             M                 F                 M
## diseaseState                 Healthy           Healthy           Healthy
## severity                     Healthy           Healthy           Healthy
## geographicLocation  Atlanta, GA, USA  Atlanta, GA, USA  Atlanta, GA, USA
## cellType                        PBMC              PBMC              PBMC
##                           GSM4615019        GSM4615022        GSM4615025
## sampleID                    S067_270          S068_272          S069_273
## age                               85                28                26
## gender                             F                 M                 F
## diseaseState                 Healthy           Healthy           Healthy
## severity                     Healthy           Healthy           Healthy
## geographicLocation  Atlanta, GA, USA  Atlanta, GA, USA  Atlanta, GA, USA
## cellType                        PBMC              PBMC              PBMC
##                           GSM4615027        GSM4615030        GSM4615032
## sampleID                    S070_279          S071_280          S181_255
## age                               38                84                27
## gender                             M                 F                 M
## diseaseState                 Healthy           Healthy           Healthy
## severity                     Healthy           Healthy           Healthy
## geographicLocation  Atlanta, GA, USA  Atlanta, GA, USA  Atlanta, GA, USA
## cellType                        PBMC              PBMC              PBMC
##                           GSM4615033        GSM4615034        GSM4615035
## sampleID                 S182_SHXA10          S183_263       S184_SHXA18
## age                               50                23                55
## gender                             F                 F                 M
## diseaseState                 Healthy           Healthy           Healthy
## severity                     Healthy           Healthy           Healthy
## geographicLocation  Atlanta, GA, USA  Atlanta, GA, USA  Atlanta, GA, USA
## cellType                        PBMC              PBMC              PBMC
##                           GSM4615036        GSM4615037
## sampleID                    S185_266       S186_SHXA14
## age                               91                57
## gender                             F                 M
## diseaseState                 Healthy           Healthy
## severity                     Healthy           Healthy
## geographicLocation  Atlanta, GA, USA  Atlanta, GA, USA
## cellType                        PBMC              PBMC

The first sample has a class that doesn’t fit this data of healthy or one of three stages of COVID-19. Lets see all the unique classes of disease states.

tHeader <- as.data.frame(t(infoHeader))
head(tHeader)

##                           sampleID age gender  diseaseState      severity
## GSM4614985          S145_nCOV001_C  80      M  Convalescent  Convalescent
## GSM4614986 S147_nCoV001EUHM-Draw-1  75      F      COVID-19      Moderate
## GSM4614987 S149_nCoV002EUHM-Draw-2  54      M      COVID-19        Severe
## GSM4614988 S150_nCoV003EUHM-Draw-1  75      F      COVID-19        Severe
## GSM4614989 S151_nCoV004EUHM-Draw-1  59      M      COVID-19           ICU
## GSM4614990 S152_nCoV006EUHM-Draw-1  59      M      COVID-19        Severe
##            geographicLocation cellType
## GSM4614985   Atlanta, GA, USA     PBMC
## GSM4614986   Atlanta, GA, USA     PBMC
## GSM4614987   Atlanta, GA, USA     PBMC
## GSM4614988   Atlanta, GA, USA     PBMC
## GSM4614989   Atlanta, GA, USA     PBMC
## GSM4614990   Atlanta, GA, USA     PBMC

write.csv(tHeader,'tHeader.csv', row.names=T)
tHeader <- read.csv('tHeader.csv',sep=',',header=T,row.names = 1, stringsAsFactors = F)
tHeader$gender <- trimws(tHeader$gender,which='left',whitespace=' ')
tHeader$geographicLocation <- trimws(tHeader$geographicLocation,which='left',whitespace=' ')
tHeader$diseaseState <- trimws(tHeader$diseaseState,which='left',whitespace=' ')
tHeader$severity <- trimws(tHeader$severity,which='left',whitespace=' ')
tHeader$cellType <- trimws(tHeader$cellType,which='left',whitespace=' ')

unique(tHeader$severity)

## [1] "Convalescent" "Moderate"     "Severe"       "ICU"          "Healthy"

Lets divide this into subgroups based on severity except for the convalescent class. Maybe we can set this aside, and see if we can determine what class it is based on our results in the end of how certain genes we find behave in these other classes of Moderate, Severe, ICU, or Healthy.

convalescent <- subset(tHeader,tHeader$severity=='Convalescent')
moderate <- subset(tHeader, tHeader$severity=='Moderate')
severe <- subset(tHeader, tHeader$severity=='Severe')
ICU <- subset(tHeader, tHeader$severity=='ICU')
healthy <- subset(tHeader, tHeader$severity=='Healthy')
write.csv(tHeader,'tHeader.csv', row.names=T)

info <- read.delim('GSE152418_p20047_Study1_RawCounts.txt',sep='\t',header=T,
                   na.strings=c('',' ','NA'), stringsAsFactors = F)
head(info)

##         ENSEMBLID S145_nCOV001_C S147_nCoV001EUHM.Draw.1
## 1 ENSG00000223972              0                       0
## 2 ENSG00000227232              1                       0
## 3 ENSG00000278267              3                       1
## 4 ENSG00000243485              2                       0
## 5 ENSG00000284332              0                       0
## 6 ENSG00000237613              0                       0
##   S149_nCoV002EUHM.Draw.2 S150_nCoV003EUHM.Draw.1 S151_nCoV004EUHM.Draw.1
## 1                       0                       0                       0
## 2                       3                      16                       1
## 3                       2                       0                       3
## 4                       0                       1                       0
## 5                       0                       0                       0
## 6                       0                       0                       0
##   S152_nCoV006EUHM.Draw.1 S153_nCoV007EUHM.Draw.1 S154_nCoV0010EUHM.Draw.1
## 1                       0                       0                        0
## 2                       2                      18                        6
## 3                       0                       0                        1
## 4                       0                       0                        0
## 5                       0                       0                        0
## 6                       0                       0                        0
##   S155_nCOV021EUHM S156_nCOV024EUHM.Draw.1 S157_nCOV0029EUHM
## 1                2                       0                 2
## 2               20                       0                 2
## 3                3                       2                 2
## 4                0                       0                 0
## 5                0                       0                 0
## 6                0                       0                 0
##   S175_nCoV024EUHM.Draw.2 S176_nCoV025EUHM.Draw.1 S177_nCoV025EUHM.Draw.2
## 1                       0                       0                       0
## 2                       0                      13                      32
## 3                       3                       3                       7
## 4                       0                       0                       0
## 5                       0                       0                       0
## 6                       0                       0                       0
##   S178_nCoV028EUHM.Draw.1 S179_nCoV033EUHM.Draw.1 S180_nCoV034EUHM.Draw.1
## 1                       1                       0                       0
## 2                       3                       2                       9
## 3                       0                       3                       0
## 4                       0                       0                       0
## 5                       0                       0                       0
## 6                       0                       0                       0
##   S061_257 S062_258 S063_259 S064_260 S065_261 S066_265 S067_270 S068_272
## 1        0        0        0        1        0        0        0        0
## 2        6        0        0        1        2        1        0        1
## 3        4        5        3        8        3        9        7        2
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        0        0        0        0        0        0        0
##   S069_273 S070_279 S071_280 S181_255 S182_SHXA10 S183_263 S184_SHXA18 S185_266
## 1        0        0        0        0           0        0           0        0
## 2        1        1        2        3           4        1           0        0
## 3        7        2        6        3           2        6          12        5
## 4        0        0        1        0           0        0           0        0
## 5        0        0        0        0           0        0           0        0
## 6        0        0        0        0           0        0           0        0
##   S186_SHXA14
## 1           0
## 2           7
## 3           3
## 4           0
## 5           0
## 6           0

convalescentList <- convalescent$sampleID
moderateList <- moderate$sampleID
severeList <- severe$sampleID
ICUlist <- ICU$sampleID
healthyList <- healthy$sampleID

moderateList

## [1] "S147_nCoV001EUHM-Draw-1" "S153_nCoV007EUHM-Draw-1"
## [3] "S157_nCOV0029EUHM"       "S180_nCoV034EUHM-Draw-1"

severeList

## [1] "S149_nCoV002EUHM-Draw-2"  "S150_nCoV003EUHM-Draw-1" 
## [3] "S152_nCoV006EUHM-Draw-1"  "S154_nCoV0010EUHM-Draw-1"
## [5] "S156_nCOV024EUHM-Draw-1"  "S176_nCoV025EUHM-Draw-1" 
## [7] "S178_nCoV028EUHM-Draw-1"  "S179_nCoV033EUHM-Draw-1"

ICUlist

## [1] "S151_nCoV004EUHM-Draw-1" "S155_nCOV021EUHM"       
## [3] "S175_nCoV024EUHM-Draw-2" "S177_nCoV025EUHM-Draw-2"

healthyList

##  [1] "S061_257"    "S062_258"    "S063_259"    "S064_260"    "S065_261"   
##  [6] "S066_265"    "S067_270"    "S068_272"    "S069_273"    "S070_279"   
## [11] "S071_280"    "S181_255"    "S182_SHXA10" "S183_263"    "S184_SHXA18"
## [16] "S185_266"    "S186_SHXA14"

moderateList2 <- gsub('-','.',moderateList)
moderateDF <- info[,colnames(info) %in% moderateList2]
severeList2 <- gsub('-','.',severeList)
severeDF <- info[,colnames(info) %in% severeList2]
ICUlist2 <- gsub('-','.',ICUlist)
ICUdf <- info[,colnames(info) %in% ICUlist2]
healthyDF <- info[,colnames(info) %in% healthyList]
convalescentDF <- info[,1:2]

cnConv <- colnames(convalescentDF[2])
cnMod <- colnames(moderateDF)
cnSev <- colnames(severeDF)
cnHeal <- colnames(healthyDF)
cnICU <- colnames(ICUdf)
CN_old <- c(cnConv,cnMod,cnSev,cnHeal,cnICU)
CN_old

##  [1] "S145_nCOV001_C"           "S147_nCoV001EUHM.Draw.1" 
##  [3] "S153_nCoV007EUHM.Draw.1"  "S157_nCOV0029EUHM"       
##  [5] "S180_nCoV034EUHM.Draw.1"  "S149_nCoV002EUHM.Draw.2" 
##  [7] "S150_nCoV003EUHM.Draw.1"  "S152_nCoV006EUHM.Draw.1" 
##  [9] "S154_nCoV0010EUHM.Draw.1" "S156_nCOV024EUHM.Draw.1" 
## [11] "S176_nCoV025EUHM.Draw.1"  "S178_nCoV028EUHM.Draw.1" 
## [13] "S179_nCoV033EUHM.Draw.1"  "S061_257"                
## [15] "S062_258"                 "S063_259"                
## [17] "S064_260"                 "S065_261"                
## [19] "S066_265"                 "S067_270"                
## [21] "S068_272"                 "S069_273"                
## [23] "S070_279"                 "S071_280"                
## [25] "S181_255"                 "S182_SHXA10"             
## [27] "S183_263"                 "S184_SHXA18"             
## [29] "S185_266"                 "S186_SHXA14"             
## [31] "S151_nCoV004EUHM.Draw.1"  "S155_nCOV021EUHM"        
## [33] "S175_nCoV024EUHM.Draw.2"  "S177_nCoV025EUHM.Draw.2"

colnames(convalescentDF)[2] <- 'convalescent'
row.names(convalescentDF) <- convalescentDF$ENSEMBLID
colnames(moderateDF) <- c('moderate1','moderate2','moderate3','moderate4')
row.names(moderateDF) <- row.names(convalescentDF)
colnames(severeDF) <- c('severe1','severe2','severe3','severe4','severe5','severe6','severe7',
                        'severe8')
row.names(severeDF) <- row.names(convalescentDF)
colnames(ICUdf) <- c('ICU1','ICU2','ICU3','ICU4')
row.names(ICUdf) <- row.names(convalescentDF)
colnames(healthyDF) <- c('healthy1','healthy2','healthy3','healthy4','healthy5','healthy6',
                         'healthy7','healthy8','healthy9','healthy10','healthy11','healthy12',
                         'healthy13','healthy14','healthy15','healthy16','healthy17')
row.names(healthyDF) <- row.names(convalescentDF)

CN_new <- c(colnames(convalescentDF)[2],colnames(moderateDF),
            colnames(severeDF),colnames(healthyDF),colnames(ICUdf))
CN_new

##  [1] "convalescent" "moderate1"    "moderate2"    "moderate3"    "moderate4"   
##  [6] "severe1"      "severe2"      "severe3"      "severe4"      "severe5"     
## [11] "severe6"      "severe7"      "severe8"      "healthy1"     "healthy2"    
## [16] "healthy3"     "healthy4"     "healthy5"     "healthy6"     "healthy7"    
## [21] "healthy8"     "healthy9"     "healthy10"    "healthy11"    "healthy12"   
## [26] "healthy13"    "healthy14"    "healthy15"    "healthy16"    "healthy17"   
## [31] "ICU1"         "ICU2"         "ICU3"         "ICU4"

CN1 <- as.data.frame(CN_old)
CN1$CN_old <- as.character(paste(CN1$CN_old))
CN1$CN_old <- gsub('[.]','-',CN1$CN_old,perl=T)
CN2 <- as.data.frame(CN_new)
CNs <- cbind(CN1,CN2)

tHeader2 <- tHeader
tHeader2$GSM_ID <- row.names(tHeader2)

HeaderInformation <- merge(CNs,tHeader2,by.x='CN_old', by.y='sampleID')
HeaderInformation <- HeaderInformation[,c(9,1:8)]
head(HeaderInformation)

##       GSM_ID   CN_old   CN_new age gender diseaseState severity
## 1 GSM4615003 S061_257 healthy1  25      M      Healthy  Healthy
## 2 GSM4615006 S062_258 healthy2  70      F      Healthy  Healthy
## 3 GSM4615008 S063_259 healthy3  68      F      Healthy  Healthy
## 4 GSM4615011 S064_260 healthy4  69      M      Healthy  Healthy
## 5 GSM4615014 S065_261 healthy5  29      F      Healthy  Healthy
## 6 GSM4615016 S066_265 healthy6  90      M      Healthy  Healthy
##   geographicLocation cellType
## 1   Atlanta, GA, USA     PBMC
## 2   Atlanta, GA, USA     PBMC
## 3   Atlanta, GA, USA     PBMC
## 4   Atlanta, GA, USA     PBMC
## 5   Atlanta, GA, USA     PBMC
## 6   Atlanta, GA, USA     PBMC

write.csv(HeaderInformation,'HeaderInformation.csv',row.names=F)

Lets get the gene means of each data set.

moderateDF$moderate_mean <- apply(moderateDF,1,mean)
severeDF$severe_mean <- apply(severeDF,1,mean)
ICUdf$ICU_mean <- apply(ICUdf,1,mean)
healthyDF$healthy_mean <- apply(healthyDF,1,mean)

Now combine all the data frames.

dataMeans <- cbind(convalescentDF,healthyDF,moderateDF,severeDF,ICUdf)
DATA <- dataMeans[,c(1:19,21:24,26:33,35:38,20,25,34,39)]
colnames(DATA)

##  [1] "ENSEMBLID"     "convalescent"  "healthy1"      "healthy2"     
##  [5] "healthy3"      "healthy4"      "healthy5"      "healthy6"     
##  [9] "healthy7"      "healthy8"      "healthy9"      "healthy10"    
## [13] "healthy11"     "healthy12"     "healthy13"     "healthy14"    
## [17] "healthy15"     "healthy16"     "healthy17"     "moderate1"    
## [21] "moderate2"     "moderate3"     "moderate4"     "severe1"      
## [25] "severe2"       "severe3"       "severe4"       "severe5"      
## [29] "severe6"       "severe7"       "severe8"       "ICU1"         
## [33] "ICU2"          "ICU3"          "ICU4"          "healthy_mean" 
## [37] "moderate_mean" "severe_mean"   "ICU_mean"

Lets get the fold change ratios of the means of the moderate, severe, and ICU to the healthy means.

DATA$mod_health_foldChange <- DATA$moderate_mean/DATA$healthy_mean
DATA$sevr_health_foldChange <- DATA$severe_mean/DATA$healthy_mean
DATA$ICU_health_foldChange <- DATA$ICU_mean/DATA$healthy_mean

We only want the change, from diseased state to healthy state, and when using division by zero, we will get Inf if the numerator is not zero but the denominator is zero, and NaN when both are zero. To handle this if the numerator is not zero and the denominator is, then the change is the numerator, and if they are both zero, then there is no change, hence zero change will be a 1 because there is no change in over or under expression. My previous use of this was errored, I will go back and fix this mistake, because having a fold change value of 0 actually means the gene had a decrease of 100% or halved, because and increase of 100% is a fold change of 2 and hence doubled. We will write these in with conditions using ifelse.In this regard, we should also change all 0.00000…. to the healthy mean value because if the disease state to healthy state ratio is 0/0.17 then the gene decreased expression 17% to 1. So we will say 1-healthy mean, so that 1 as our no change value between states, 1-the start value, it the disease value for change. Hence, expression in the healthy state to no expression in the disease state. So it decreased expression the level of expression in the healthy state to predict genes related to our grades of disease.

DATA$mod_health_foldChange <- ifelse(DATA$mod_health_foldChange=='Inf',
                                          DATA$moderate_mean, ifelse(DATA$mod_health_foldChange=='NaN', 1, ifelse(DATA$mod_health_foldChange==0, 1-DATA$healthy_mean,
             DATA$mod_health_foldChange)))

DATA$sevr_health_foldChange <- ifelse(DATA$sevr_health_foldChange=='Inf',
        DATA$severe_mean, ifelse(DATA$sevr_health_foldChange=='NaN',  1,
     ifelse(DATA$sevr_health_foldChange==0,1-DATA$healthy_mean,
            DATA$sevr_health_foldChange)))

DATA$ICU_health_foldChange <- ifelse(DATA$ICU_health_foldChange=='Inf',
   DATA$ICU_mean, ifelse(DATA$ICU_health_foldChange=='NaN', 1,
   ifelse(DATA$ICU_health_foldChange==0, 1-DATA$healthy_mean,
          DATA$ICU_health_foldChange)))

write.csv(DATA,'DATA_FCs_GSE152418.csv',row.names=F)

Lets select some genes based on the highest and lowest fold change values. Since we put in the values, our fold change values of zero are significant for those genes that halved. Even smaller values of zero would mean a more significant down regulation. i.e. 0.0 versus 0.001, where the decline in gene expression went from a decrease of 10% versus a decrease of 1000% or 10 times the decrease in the disease state to the healthy state.

DATA1 <- DATA[order(DATA$ICU_health_foldChange,decreasing=T)[c(1:50,60634:60683)],]
dim(DATA1)

## [1] 100  42

write.csv(DATA1,'DATA_100genes.csv',row.names=F)

source('geneCards2.R')
for (i in DATA1$ENSEMBLID){  
  find25genes(i)
}

files <- list.files('./gene scrapes')[1:100]
files

for (i in files){
 file <- paste('./gene scrapes',i, sep='/')
 t <- read.csv(file,sep=',',header=F)
 write.table(t,file='DATA1genes.csv',sep=',',append=T,col.names=F,row.names=F)
}

DATA1genes <- read.delim('DATA1genes.csv',header=F,sep=',',na.strings=c('',' ','NA'))
DATA1genes$V2 <- toupper(DATA1genes$V2)
colnames(DATA1genes) <- c('gene','EnsemblGene','DateSourced')

DATA1genes <- DATA1genes[,-3]
DATA1genes <- DATA1genes[!duplicated(DATA1genes),]

for (i in DATA1genes$gene){
  getSummaries(i,'PBMC')
}

getGeneSummaries('PBMC')

pbmcSumms <- read.csv("proteinGeneSummaries_pbmc.csv")
file.rename("proteinGeneSummaries_pbmc.csv",'proteinGeneSummaries_pbmc1.csv')

pbmcSumms <- read.csv("proteinGeneSummaries_pbmc1.csv")

head(pbmcSumms)

##   proteinSearched    gene
## 1            pbmc  SLC4A1
## 2            pbmc RAP1GAP
## 3            pbmc    GCKR
## 4            pbmc ADAMTS2
## 5            pbmc  CRISP3
## 6            pbmc   OLFM4
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            EntrezSummary
## 1 The protein encoded by this gene is part of the anion exchanger (AE) family and is expressed in the erythrocyte plasma membrane, where it functions as a chloride/bicarbonate exchanger involved in carbon dioxide transport from tissues to lungs. The protein comprises two domains that are structurally and functionally distinct. The N-terminal 40kDa domain is located in the cytoplasm and acts as an attachment site for the red cell skeleton by binding ankyrin. The glycosylated C-terminal membrane-associated domain contains 12-14 membrane spanning segments and carries out the stilbene disulphonate-sensitive exchange transport of anions. The cytoplasmic tail at the extreme C-terminus of the membrane domain binds carbonic anhydrase II. The encoded protein associates with the red cell membrane protein glycophorin A and this association promotes the correct folding and translocation of the exchanger. This protein is predominantly dimeric but forms tetramers in the presence of ankyrin. Many mutations in this gene are known in man, and these mutations can lead to two types of disease: destabilization of red cell membrane leading to hereditary spherocytosis, and defective kidney acid secretion leading to distal renal tubular acidosis. Other mutations that do not give rise to disease result in novel blood group antigens, which form the Diego blood group system. Southeast Asian ovalocytosis (SAO, Melanesian ovalocytosis) results from the heterozygous presence of a deletion in the encoded protein and is common in areas where Plasmodium falciparum malaria is endemic. One null mutation in this gene is known, resulting in very severe anemia and nephrocalcinosis. [provided by RefSeq, Jul 2008]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This gene encodes a type of GTPase-activating-protein (GAP) that down-regulates the activity of the ras-related RAP1 protein. RAP1 acts as a molecular switch by cycling between an inactive GDP-bound form and an active GTP-bound form. The product of this gene, RAP1GAP, promotes the hydrolysis of bound GTP and hence returns RAP1 to the inactive state whereas other proteins, guanine nucleotide exchange factors (GEFs), act as RAP1 activators by facilitating the conversion of RAP1 from the GDP- to the GTP-bound form. In general, ras subfamily proteins, such as RAP1, play key roles in receptor-linked signaling pathways that control cell growth and differentiation. RAP1 plays a role in diverse processes such as cell proliferation, adhesion, differentiation, and embryogenesis. Alternative splicing results in multiple transcript variants encoding distinct proteins. [provided by RefSeq, Aug 2011]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes a protein belonging to the GCKR subfamily of the SIS (Sugar ISomerase) family of proteins. The gene product is a regulatory protein that inhibits glucokinase in liver and pancreatic islet cells by binding non-covalently to form an inactive complex with the enzyme. This gene is considered a susceptibility gene candidate for a form of maturity-onset diabetes of the young (MODY). [provided by RefSeq, Jul 2008]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      This gene encodes a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motifs) protein family. Members of the family share several distinct protein modules, including a propeptide region, a metalloproteinase domain, a disintegrin-like domain, and a thrombospondin type 1 (TS) motif. Individual members of this family differ in the number of C-terminal TS motifs, and some have unique C-terminal domains. The encoded preproprotein is proteolytically processed to generate the mature procollagen N-proteinase. This proteinase excises the N-propeptide of the fibrillar procollagens types I-III and type V. Mutations in this gene cause Ehlers-Danlos syndrome type VIIC, a recessively inherited connective-tissue disorder. Alternative splicing results in multiple transcript variants, at least one of which encodes an isoform that is proteolytically processed. [provided by RefSeq, Feb 2016]
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       This gene encodes a member of the cysteine-rich secretory protein (CRISP) family within the CRISP, antigen 5 and pathogenesis-related 1 proteins superfamily. The encoded protein has an N-terminal CRISP, antigen 5 and pathogenesis-related 1 proteins domain, a hinge region, and a C-terminal ion channel regulator domain. This protein contains cysteine residues, located in both the N- and C-terminal domains, that form eight disulfide bonds, a distinguishing characteristic of this family. This gene is expressed in the male reproductive tract where it plays a role in sperm function and fertilization, and the female reproductive tract where it plays a role in endometrial receptivity for embryo implantation. This gene is upregulated in certain types of prostate cancer. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Nov 2016]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              This gene was originally cloned from human myeloblasts and found to be selectively expressed in inflammed colonic epithelium. This gene encodes a member of the olfactomedin family. The encoded protein is an antiapoptotic factor that promotes tumor growth and is an extracellular matrix glycoprotein that facilitates cell adhesion. [provided by RefSeq, Mar 2011]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             GeneCardsSummary
## 1                                             SLC4A1 (Solute Carrier Family 4 Member 1 (Diego Blood Group)) is a Protein Coding gene.                                            Diseases associated with SLC4A1 include Renal Tubular Acidosis, Distal, Autosomal Dominant and Cryohydrocytosis.                                            Among its related pathways are Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds and Neuroscience.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and transporter activity.                                            An important paralog of this gene is SLC4A2.
## 2                                                                                                                       RAP1GAP (RAP1 GTPase Activating Protein) is a Protein Coding gene.                                            Diseases associated with RAP1GAP include Tuberous Sclerosis and Bleeding Disorder, Platelet-Type, 18.                                            Among its related pathways are Ras signaling pathway and Development Angiotensin activation of ERK.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and GTPase activator activity.                                            An important paralog of this gene is RAP1GAP2.
## 3                                                                                                                                       GCKR (Glucokinase Regulator) is a Protein Coding gene.                                            Diseases associated with GCKR include Fasting Plasma Glucose Level Quantitative Trait Locus 5 and Maturity-Onset Diabetes Of The Young.                                            Among its related pathways are Regulation of activated PAK-2p34 by proteasome mediated degradation and Hexose transport.                                            Gene Ontology (GO) annotations related to this gene include enzyme binding and protein domain specific binding.                                            
## 4                                                                                                  ADAMTS2 (ADAM Metallopeptidase With Thrombospondin Type 1 Motif 2) is a Protein Coding gene.                                            Diseases associated with ADAMTS2 include Ehlers-Danlos Syndrome, Dermatosparaxis Type and Ehlers-Danlos Syndrome.                                            Among its related pathways are Degradation of the extracellular matrix and Metabolism of proteins.                                            Gene Ontology (GO) annotations related to this gene include peptidase activity and metallopeptidase activity.                                            An important paralog of this gene is ADAMTS3.
## 5                                                                                                                                                                                                                                                                                                                                               CRISP3 (Cysteine Rich Secretory Protein 3) is a Protein Coding gene.                                            Diseases associated with CRISP3 include Prostate Cancer.                                            Among its related pathways are Innate Immune System.                                                                                        An important paralog of this gene is CRISP2.
## 6                                                                                                                                                                                                   OLFM4 (Olfactomedin 4) is a Protein Coding gene.                                            Diseases associated with OLFM4 include Pancreatic Cancer and Ovary Serous Adenocarcinoma.                                            Among its related pathways are Innate Immune System and Adhesion.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and cadherin binding.                                            An important paralog of this gene is OLFM1.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UniProtKB_Summary
## 1 Functions both as a transporter that mediates electroneutral anion exchange across the cell membrane and as a structural protein. Major integral membrane glycoprotein of the erythrocyte membrane; required for normal flexibility and stability of the erythrocyte membrane and for normal erythrocyte shape via the interactions of its cytoplasmic domain with cytoskeletal proteins, glycolytic enzymes, and hemoglobin. Functions as a transporter that mediates the 1:1 exchange of inorganic anions across the erythrocyte membrane. Mediates chloride-bicarbonate exchange in the kidney, and is required for normal acidification of the urine.\n                         B3AT_HUMAN,P02730\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               GTPase activator for the nuclear Ras-related regulatory protein RAP-1A (KREV-1), converting it to the putatively inactive GDP-bound state.\n                         RPGP1_HUMAN,P47736\n                         
## 3                                                                     Regulates glucokinase (GCK) by forming an inactive complex with this enzyme (PubMed:23621087, PubMed:23733961). Acts by promoting GCK recruitment to the nucleus, possibly to provide a reserve of GCK that can be quickly released in the cytoplasm after a meal (PubMed:10456334). The affinity of GCKR for GCK is modulated by fructose metabolites: GCKR with bound fructose 6-phosphate has increased affinity for GCK, while GCKR with bound fructose 1-phosphate has strongly decreased affinity for GCK and does not inhibit GCK activity (PubMed:23621087, PubMed:23733961).\n                         GCKR_HUMAN,Q14397\n                         
## 4                                                                                                                                                                                                                                                                                                                                       Cleaves the propeptides of type I and II collagen prior to fibril assembly (By similarity). Does not act on type III collagen (By similarity). Cleaves lysyl oxidase LOX at a site downstream of its propeptide cleavage site to produce a short LOX form with reduced collagen-binding activity (PubMed:31152061).\n                         ATS2_HUMAN,O95450\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       no summary
## 6                                                                                                                                                                                                                                                          May promote proliferation of pancreatic cancer cells by favoring the transition from the S to G2/M phase. In myeloid leukemic cell lines, inhibits cell growth and induces cell differentiation and apoptosis. May play a role in the inhibition of EIF4EBP1 phosphorylation/deactivation. Facilitates cell adhesion, most probably through interaction with cell surface lectins and cadherin.\n                         OLFM4_HUMAN,Q6UX06\n                         
##                 todaysDate
## 1 Fri Aug 14 08:54:39 2020
## 2 Fri Aug 14 08:54:39 2020
## 3 Fri Aug 14 08:54:40 2020
## 4 Fri Aug 14 08:54:41 2020
## 5 Fri Aug 14 08:54:42 2020
## 6 Fri Aug 14 08:54:43 2020

pbmcSumms2 <- pbmcSumms[,c(2:5)]
pbmcSumms2 <- pbmcSumms2[!duplicated(pbmcSumms2),]

pbmcSummaries <- merge(DATA1genes,pbmcSumms2,by.x='gene',by.y='gene')
pbmcSummaries <- pbmcSummaries[!duplicated(pbmcSummaries),]
head(pbmcSummaries)

##        gene     EnsemblGene
## 1 ACBD3-AS1 ENSG00000234478
## 2   ADAMTS2 ENSG00000087116
## 3    ADARB2 ENSG00000235281
## 4      AHSP ENSG00000169877
## 5     AJAP1 ENSG00000284692
## 6       AK6 ENSG00000253333
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       EntrezSummary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ACBD3-AS1 (ACBD3 Antisense RNA 1) is an RNA Gene, and is affiliated with the lncRNA class.                                            Diseases associated with ACBD3-AS1 include Uterine Corpus Endometrial Carcinoma.                                                                                                                                    
## 2 This gene encodes a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motifs) protein family. Members of the family share several distinct protein modules, including a propeptide region, a metalloproteinase domain, a disintegrin-like domain, and a thrombospondin type 1 (TS) motif. Individual members of this family differ in the number of C-terminal TS motifs, and some have unique C-terminal domains. The encoded preproprotein is proteolytically processed to generate the mature procollagen N-proteinase. This proteinase excises the N-propeptide of the fibrillar procollagens types I-III and type V. Mutations in this gene cause Ehlers-Danlos syndrome type VIIC, a recessively inherited connective-tissue disorder. Alternative splicing results in multiple transcript variants, at least one of which encodes an isoform that is proteolytically processed. [provided by RefSeq, Feb 2016]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             This gene encodes a member of the double-stranded RNA adenosine deaminase family of RNA-editing enzymes and may play a regulatory role in RNA editing. [provided by RefSeq, Jul 2008]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes a molecular chaperone which binds specifically to free alpha-globin and is involved in hemoglobin assembly. The encoded protein binds to monomeric alpha-globin until it has been transferred to beta-globin to form a heterodimer, which in turn binds to another heterodimer to form the stable tetrameric hemoglobin. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Dec 2015]
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               AJAP1 (Adherens Junctions Associated Protein 1) is a Protein Coding gene.                                            Diseases associated with AJAP1 include Dental Caries and Migraine, Familial Hemiplegic, 2.                                                                                                                                    
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 This gene encodes a protein that belongs to the adenylate kinase family of enzymes. The protein has a nuclear localization and contains Walker A (P-loop) and Walker B motifs and a metal-coordinating residue. The protein may be involved in regulation of Cajal body formation. In human, AK6 and TAF9 (GeneID: 6880) are two distinct genes that share 5' exons. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GeneCardsSummary
## 1                                                                                                                                                                                                                                                                                                                                            ACBD3-AS1 (ACBD3 Antisense RNA 1) is an RNA Gene, and is affiliated with the lncRNA class.                                            Diseases associated with ACBD3-AS1 include Uterine Corpus Endometrial Carcinoma.                                                                                                                                    
## 2                                             ADAMTS2 (ADAM Metallopeptidase With Thrombospondin Type 1 Motif 2) is a Protein Coding gene.                                            Diseases associated with ADAMTS2 include Ehlers-Danlos Syndrome, Dermatosparaxis Type and Ehlers-Danlos Syndrome.                                            Among its related pathways are Degradation of the extracellular matrix and Metabolism of proteins.                                            Gene Ontology (GO) annotations related to this gene include peptidase activity and metallopeptidase activity.                                            An important paralog of this gene is ADAMTS3.
## 3                                                                                                                                        ADARB2 (Adenosine Deaminase RNA Specific B2 (Inactive)) is a Protein Coding gene.                                            Diseases associated with ADARB2 include Dyschromatosis Symmetrica Hereditaria and Juvenile Astrocytoma.                                            Among its related pathways are ATP/ITP metabolism.                                            Gene Ontology (GO) annotations related to this gene include double-stranded RNA binding.                                            An important paralog of this gene is ADARB1.
## 4                                                                                                                                                                                                                                                              AHSP (Alpha Hemoglobin Stabilizing Protein) is a Protein Coding gene.                                            Diseases associated with AHSP include Beta-Thalassemia and Thalassemia.                                                                                        Gene Ontology (GO) annotations related to this gene include unfolded protein binding and hemoglobin binding.                                            
## 5                                                                                                                                                                                                                                                                                                                                                   AJAP1 (Adherens Junctions Associated Protein 1) is a Protein Coding gene.                                            Diseases associated with AJAP1 include Dental Caries and Migraine, Familial Hemiplegic, 2.                                                                                                                                    
## 6                                                                                                                                                                                                                                                                                  AK6 (Adenylate Kinase 6) is a Protein Coding gene.                                                                                        Among its related pathways are Metabolism and Metabolism of nucleotides.                                            Gene Ontology (GO) annotations related to this gene include ATPase activity and adenylate kinase activity.                                            
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             no summary
## 2                                                                                                                                                                                                                                                                                                                             Cleaves the propeptides of type I and II collagen prior to fibril assembly (By similarity). Does not act on type III collagen (By similarity). Cleaves lysyl oxidase LOX at a site downstream of its propeptide cleavage site to produce a short LOX form with reduced collagen-binding activity (PubMed:31152061).\n                         ATS2_HUMAN,O95450\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                        Lacks editing activity. It prevents the binding of other ADAR enzymes to targets in vitro, and decreases the efficiency of these enzymes. Capable of binding to dsRNA but also to ssRNA.\n                         RED2_HUMAN,Q9NS39\n                         
## 4                                                                                                                                                                                                                                                                                                                                                       Acts as a chaperone to prevent the harmful aggregation of alpha-hemoglobin during normal erythroid cell development. Specifically protects free alpha-hemoglobin from precipitation. It is predicted to modulate pathological states of alpha-hemoglobin excess such as beta-thalassemia.\n                         AHSP_HUMAN,Q9NZD4\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Plays a role in cell adhesion and cell migration.\n                         AJAP1_HUMAN,Q9UKB5\n                         
## 6 Broad-specificity nucleoside monophosphate (NMP) kinase that catalyzes the reversible transfer of the terminal phosphate group between nucleoside triphosphates and monophosphates. AMP and dAMP are the preferred substrates, but CMP and dCMP are also good substrates. IMP is phosphorylated to a much lesser extent. All nucleoside triphosphates ATP, GTP, UTP, CTP, dATP, dCTP, dGTP, and TTP are accepted as phosphate donors. CTP is the best phosphate donor, followed by UTP, ATP, GTP and dCTP. May have a role in nuclear energy homeostasis. Has also ATPase activity. May be involved in regulation of Cajal body (CB) formation.\n                         KAD6_HUMAN,Q9Y3D8\n

DATA1_Summs <- merge(pbmcSummaries,DATA1,by.x='EnsemblGene','ENSEMBLID')

write.csv(DATA1_Summs,'DATA_100FCs_summaries.csv',row.names=F)

We can further divide these groups into females versus males, and age by splitting by the median age.

quantile(tHeader$age,.5)

## 50% 
##  56

sort(tHeader$age)

##  [1] 23 25 26 27 28 29 38 46 47 48 48 50 53 54 55 56 56 56 57 59 59 60 64 68 69
## [26] 70 75 75 76 80 84 85 90 91

The ages of these patients is from 23-91 years. With a median age of 56 years.

summary(tHeader$gender=='F')

##    Mode   FALSE    TRUE 
## logical      16      18

There are also 18 females and 16 males.

The median age is 56 in this data of COVID patients.

moderateAgeList <- tHeader$age[tHeader$sampleID %in% moderateList]
sort(moderateAgeList)

## [1] 46 47 53 75

severeAgeList <- tHeader$age[tHeader$sampleID %in% severeList]
sort(severeAgeList)

## [1] 48 54 56 56 59 64 75 76

ICUageList <- tHeader$age[tHeader$sampleID %in% ICUlist]
sort(ICUageList)

## [1] 48 56 59 60

healthyAgeList <- tHeader$age[tHeader$sampleID %in% healthyList]
sort(healthyAgeList)

##  [1] 23 25 26 27 28 29 38 50 55 57 68 69 70 84 85 90 91

If we split by less than or at 56 years of age on all data sets, we have data that is almost balanced between the two categories of older than 56 or 56 and younger patients.

For gender, lets see if the class balance is distributed almost equally.

moderateGenderList <- tHeader$gender[tHeader$sampleID %in% moderateList]
sort(moderateGenderList)

## [1] "F" "F" "F" "F"

severeGenderList <- tHeader$gender[tHeader$sampleID %in% severeList]
sort(severeGenderList)

## [1] "F" "F" "F" "M" "M" "M" "M" "M"

ICUgenderList <- tHeader$gender[tHeader$sampleID %in% ICUlist]
sort(ICUgenderList)

## [1] "F" "F" "M" "M"

healthyGenderList <- tHeader$gender[tHeader$sampleID %in% healthyList]
sort(healthyGenderList)

##  [1] "F" "F" "F" "F" "F" "F" "F" "F" "F" "M" "M" "M" "M" "M" "M" "M" "M"

When creating divisions by gender, the moderate class of patients is all females, so we wouldn’t get any male data from that category of COVID-19 class, but the divisions by gender are almost equal for the other categories of healthy, ICU, and severe.

Determining what genes are behaving differently when controlling for gender and age could add some useful information of how well aging and gender handle COVID-19 and body reactions.

The age will be split by those older than 56 and those younger than 57. And the genders will be split by those female and males only in the severe, ICU, and healthy data sets, as the moderate data set is entirely females.

moderateList

## [1] "S147_nCoV001EUHM-Draw-1" "S153_nCoV007EUHM-Draw-1"
## [3] "S157_nCOV0029EUHM"       "S180_nCoV034EUHM-Draw-1"

tHeader[tHeader$sampleID %in% moderateList,]

##                           sampleID age gender diseaseState severity
## GSM4614986 S147_nCoV001EUHM-Draw-1  75      F     COVID-19 Moderate
## GSM4614991 S153_nCoV007EUHM-Draw-1  53      F     COVID-19 Moderate
## GSM4614995       S157_nCOV0029EUHM  47      F     COVID-19 Moderate
## GSM4615001 S180_nCoV034EUHM-Draw-1  46      F     COVID-19 Moderate
##            geographicLocation cellType
## GSM4614986   Atlanta, GA, USA     PBMC
## GSM4614991   Atlanta, GA, USA     PBMC
## GSM4614995   Atlanta, GA, USA     PBMC
## GSM4615001   Atlanta, GA, USA     PBMC

Moderate by age younger than 57 and older than 56.

mod56 <- moderateDF[,2:4]
mod57 <- moderateDF[,1]

severe by age younger than 57 and older than 56.

tHeader[tHeader$sampleID %in% severeList,]

##                            sampleID age gender diseaseState severity
## GSM4614987  S149_nCoV002EUHM-Draw-2  54      M     COVID-19   Severe
## GSM4614988  S150_nCoV003EUHM-Draw-1  75      F     COVID-19   Severe
## GSM4614990  S152_nCoV006EUHM-Draw-1  59      M     COVID-19   Severe
## GSM4614992 S154_nCoV0010EUHM-Draw-1  64      F     COVID-19   Severe
## GSM4614994  S156_nCOV024EUHM-Draw-1  48      M     COVID-19   Severe
## GSM4614997  S176_nCoV025EUHM-Draw-1  56      F     COVID-19   Severe
## GSM4614999  S178_nCoV028EUHM-Draw-1  56      M     COVID-19   Severe
## GSM4615000  S179_nCoV033EUHM-Draw-1  76      M     COVID-19   Severe
##            geographicLocation cellType
## GSM4614987   Atlanta, GA, USA     PBMC
## GSM4614988   Atlanta, GA, USA     PBMC
## GSM4614990   Atlanta, GA, USA     PBMC
## GSM4614992   Atlanta, GA, USA     PBMC
## GSM4614994   Atlanta, GA, USA     PBMC
## GSM4614997   Atlanta, GA, USA     PBMC
## GSM4614999   Atlanta, GA, USA     PBMC
## GSM4615000   Atlanta, GA, USA     PBMC

severe56 <- severeDF[,c(1,5:7)]
severe57 <- severeDF[,c(2:4,8)]

ICU by age younger than 57 and older than 56.

tHeader[tHeader$sampleID %in% ICUlist,]

##                           sampleID age gender diseaseState severity
## GSM4614989 S151_nCoV004EUHM-Draw-1  59      M     COVID-19      ICU
## GSM4614993        S155_nCOV021EUHM  60      F     COVID-19      ICU
## GSM4614996 S175_nCoV024EUHM-Draw-2  48      M     COVID-19      ICU
## GSM4614998 S177_nCoV025EUHM-Draw-2  56      F     COVID-19      ICU
##            geographicLocation cellType
## GSM4614989   Atlanta, GA, USA     PBMC
## GSM4614993   Atlanta, GA, USA     PBMC
## GSM4614996   Atlanta, GA, USA     PBMC
## GSM4614998   Atlanta, GA, USA     PBMC

ICU56 <- ICUdf[,c(3,4)]
ICU57 <- ICUdf[,c(1,2)]

healthy by age younger than 57 and older than 56.

tHeader[tHeader$sampleID %in% healthyList,]

##               sampleID age gender diseaseState severity geographicLocation
## GSM4615003    S061_257  25      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615006    S062_258  70      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615008    S063_259  68      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615011    S064_260  69      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615014    S065_261  29      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615016    S066_265  90      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615019    S067_270  85      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615022    S068_272  28      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615025    S069_273  26      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615027    S070_279  38      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615030    S071_280  84      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615032    S181_255  27      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615033 S182_SHXA10  50      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615034    S183_263  23      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615035 S184_SHXA18  55      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615036    S185_266  91      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615037 S186_SHXA14  57      M      Healthy  Healthy   Atlanta, GA, USA
##            cellType
## GSM4615003     PBMC
## GSM4615006     PBMC
## GSM4615008     PBMC
## GSM4615011     PBMC
## GSM4615014     PBMC
## GSM4615016     PBMC
## GSM4615019     PBMC
## GSM4615022     PBMC
## GSM4615025     PBMC
## GSM4615027     PBMC
## GSM4615030     PBMC
## GSM4615032     PBMC
## GSM4615033     PBMC
## GSM4615034     PBMC
## GSM4615035     PBMC
## GSM4615036     PBMC
## GSM4615037     PBMC

healthy56 <- healthyDF[, c(1,5,8:10,12:15)]
healthy57 <- healthyDF[,c(2:4,6,7,11,16,17)]

Lets get the means of each data frame on age in the four health cases.

mod56$mod56_mean <- apply(mod56,1,mean)
mod57a <- as.data.frame(mod57)
colnames(mod57a) <- 'moderate1' #this is a vector of only 1 dimension
mod57b <- as.data.frame(mod57)
colnames(mod57b) <- 'mod57_mean'
mod57c <- cbind(mod57a,mod57b)
severe56$severe56_mean <- apply(severe56,1,mean)
severe57$severe57_mean <- apply(severe57,1,mean)
ICU56$ICU56_mean <- apply(ICU56,1,mean)
ICU57$ICU57_mean <- apply(ICU57,1,mean)
healthy56$healthy56_mean <- apply(healthy56,1,mean)
healthy57$healthy57_mean <- apply(healthy57,1,mean)

ageMeans <- cbind(mod56,mod57c,severe56,severe57,ICU56,ICU57,healthy56,healthy57)

ageMeans2 <- ageMeans[,c(4,6,11,16,19,22,32,41)]
ageMeans2$gene <- row.names(ageMeans2)
ageMeans3 <- ageMeans2[,c(9,1:8)]
colnames(ageMeans3)

## [1] "gene"           "mod56_mean"     "mod57_mean"     "severe56_mean" 
## [5] "severe57_mean"  "ICU56_mean"     "ICU57_mean"     "healthy56_mean"
## [9] "healthy57_mean"

ageMeans3$ICU_56_FoldChange <- ageMeans3$ICU56_mean/ageMeans3$healthy56_mean
ageMeans3$ICU_57_FoldChange <- ageMeans3$ICU57_mean/ageMeans3$healthy57_mean
ageMeans3$severe_56_FoldChange <- ageMeans3$severe56_mean/ageMeans3$healthy56_mean
ageMeans3$severe_57_FoldChange <- ageMeans3$severe57_mean/ageMeans3$healthy57_mean
ageMeans3$moderate_56_FoldChange <- ageMeans3$mod56_mean/ageMeans3$healthy56_mean
ageMeans3$moderate_57_FoldChange <- ageMeans3$mod57_mean/ageMeans3$healthy57_mean

ageMeans3$ICU_56_FoldChange <- ifelse(ageMeans3$ICU_56_FoldChange=='Inf',
                                      ageMeans3$ICU56_mean,
                                      ifelse(ageMeans3$ICU_56_FoldChange=='NaN',
                                             1,
                  ifelse(ageMeans3$ICU_56_FoldChange==0, 1-ageMeans3$healthy56_mean,
                                             ageMeans3$ICU_56_FoldChange)))

ageMeans3$ICU_57_FoldChange <- ifelse(ageMeans3$ICU_57_FoldChange=='Inf',
                                      ageMeans3$ICU57_mean,
                                      ifelse(ageMeans3$ICU_57_FoldChange=='NaN',
                                             1,
                  ifelse(ageMeans3$ICU_57_FoldChange==0,
                         1-ageMeans3$healthy57_mean,                                             ageMeans3$ICU_57_FoldChange)))

ageMeans3$severe_56_FoldChange <- ifelse(ageMeans3$severe_56_FoldChange=='Inf',
                                      ageMeans3$severe56_mean, ifelse(ageMeans3$severe_56_FoldChange=='NaN',
                                             1,
                    ifelse(ageMeans3$severe_56_FoldChange==0,
                           1-ageMeans3$healthy56_mean,
                                             ageMeans3$severe_56_FoldChange)))

ageMeans3$severe_57_FoldChange <- ifelse(ageMeans3$severe_57_FoldChange=='Inf',
                                      ageMeans3$severe57_mean, ifelse(ageMeans3$severe_57_FoldChange=='NaN',
                                             1,
                      ifelse(ageMeans3$severe_57_FoldChange==0,
                             1-ageMeans3$healthy57_mean,
                                             ageMeans3$severe_57_FoldChange)))



ageMeans3$moderate_56_FoldChange <- ifelse(ageMeans3$moderate_56_FoldChange=='Inf',
                                      ageMeans3$mod56_mean, ifelse(ageMeans3$moderate_56_FoldChange=='NaN',
                                             1,
                ifelse(ageMeans3$moderate_56_FoldChange==0,
                       1-ageMeans3$healthy56_mean,
                                             ageMeans3$moderate_56_FoldChange)))

ageMeans3$moderate_57_FoldChange <- ifelse(ageMeans3$moderate_57_FoldChange=='Inf',
                                      ageMeans3$mod57_mean, ifelse(ageMeans3$moderate_57_FoldChange=='NaN',
                                             1,
                ifelse(ageMeans3$moderate_57_FoldChange==0,
                       1-ageMeans3$healthy57_mean,
                                             ageMeans3$moderate_57_FoldChange)))

AgeFC_df <- ageMeans3[order(ageMeans3$ICU_56_FoldChange,decreasing=T)[c(1:50,60634:60683)],]
dim(AgeFC_df)

## [1] 100  15

Reset gene scrapes folder for new genes to gather since last run of find25genes().

source('geneCards2.R')

for (i in AgeFC_df$gene){
  find25genes(i)
}

files <- list.files('./gene scrapes')[1:100]
files

for (i in files){
 file <- paste('./gene scrapes',i, sep='/')
 t <- read.csv(file,sep=',',header=F)
 write.table(t,file='ageFCgenes.csv',sep=',',append=T,col.names=F,row.names=F)
}

ageFCgenes <- read.delim('ageFCgenes.csv',header=F,sep=',',na.strings=c('',' ','NA'))
ageFCgenes$V2 <- toupper(ageFCgenes$V2)
colnames(ageFCgenes) <- c('gene','EnsemblGene','DateSourced')
ageFCgenes_b <- ageFCgenes[,-3]
ageFCgenes_c <- ageFCgenes_b[!duplicated(ageFCgenes_b),]

for (i in ageFCgenes_c$gene[1:928]){
  getSummaries(i,'PBMC')
}

for (i in ageFCgenes_c$gene[929:length(ageFCgenes_c$gene)]){
  getSummaries(i,'PBMC')
}

getGeneSummaries('PBMC')

ageFC_summs <- read.csv("proteinGeneSummaries_pbmc.csv")
file.rename("proteinGeneSummaries_pbmc.csv",'proteinGeneSummaries_pbmc2.csv')

ageFC_summs <- read.csv("proteinGeneSummaries_pbmc2.csv")

ageFC_summs2 <- ageFC_summs[,c(2:5)]
ageFCs <- merge(ageFCgenes_c, ageFC_summs2,by.x='gene',by.y='gene')

ageFCs2 <- merge(ageFCs,AgeFC_df,by.x='EnsemblGene',by.y='gene')

ageFCs3 <- ageFCs2[!duplicated(ageFCs2),]

write.csv(ageFCs3,'ageFCs_summs.csv',row.names=F)

Now for the gender differences. The moderate data is already all females. We are going to use only the females and males in the severe, ICU, and healthy datasets.

genderSevere <- tHeader[tHeader$sampleID %in% severeList,]
femaleSevere <- severeDF[,c(2,4,6)]
maleSevere <- severeDF[,c(1,3,5,7,8)]

genderICU <- tHeader[tHeader$sampleID %in% ICUlist,]
genderICU

##                           sampleID age gender diseaseState severity
## GSM4614989 S151_nCoV004EUHM-Draw-1  59      M     COVID-19      ICU
## GSM4614993        S155_nCOV021EUHM  60      F     COVID-19      ICU
## GSM4614996 S175_nCoV024EUHM-Draw-2  48      M     COVID-19      ICU
## GSM4614998 S177_nCoV025EUHM-Draw-2  56      F     COVID-19      ICU
##            geographicLocation cellType
## GSM4614989   Atlanta, GA, USA     PBMC
## GSM4614993   Atlanta, GA, USA     PBMC
## GSM4614996   Atlanta, GA, USA     PBMC
## GSM4614998   Atlanta, GA, USA     PBMC

femaleICU <- ICUdf[,c(2,4)]
maleICU <- ICUdf[,c(1,3)]

genderHealthy <- tHeader[tHeader$sampleID %in% healthyList,]
genderHealthy

##               sampleID age gender diseaseState severity geographicLocation
## GSM4615003    S061_257  25      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615006    S062_258  70      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615008    S063_259  68      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615011    S064_260  69      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615014    S065_261  29      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615016    S066_265  90      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615019    S067_270  85      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615022    S068_272  28      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615025    S069_273  26      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615027    S070_279  38      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615030    S071_280  84      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615032    S181_255  27      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615033 S182_SHXA10  50      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615034    S183_263  23      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615035 S184_SHXA18  55      M      Healthy  Healthy   Atlanta, GA, USA
## GSM4615036    S185_266  91      F      Healthy  Healthy   Atlanta, GA, USA
## GSM4615037 S186_SHXA14  57      M      Healthy  Healthy   Atlanta, GA, USA
##            cellType
## GSM4615003     PBMC
## GSM4615006     PBMC
## GSM4615008     PBMC
## GSM4615011     PBMC
## GSM4615014     PBMC
## GSM4615016     PBMC
## GSM4615019     PBMC
## GSM4615022     PBMC
## GSM4615025     PBMC
## GSM4615027     PBMC
## GSM4615030     PBMC
## GSM4615032     PBMC
## GSM4615033     PBMC
## GSM4615034     PBMC
## GSM4615035     PBMC
## GSM4615036     PBMC
## GSM4615037     PBMC

femaleHealthy <- healthyDF[,c(2,3,5,7,9,11,13,14,16)]
maleHealthy <- healthyDF[,c(1,4,6,8,10,12,15,17)]

femaleHealthy$femHealthy_mean <- apply(femaleHealthy,1,mean)
maleHealthy$maleHealthy_mean <- apply(maleHealthy,1,mean)
femaleICU$femICU_mean <- apply(femaleICU,1,mean)
maleICU$maleICU_mean <- apply(maleICU,1,mean)
femaleSevere$femSevere_mean <- apply(femaleSevere,1,mean)
maleSevere$maleSevere_mean <- apply(maleSevere,1,mean)

genderDF <- cbind(femaleHealthy,maleHealthy,
                  femaleICU, maleICU,
                  femaleSevere,
                  maleSevere)
colnames(genderDF)

##  [1] "healthy2"         "healthy3"         "healthy5"         "healthy7"        
##  [5] "healthy9"         "healthy11"        "healthy13"        "healthy14"       
##  [9] "healthy16"        "femHealthy_mean"  "healthy1"         "healthy4"        
## [13] "healthy6"         "healthy8"         "healthy10"        "healthy12"       
## [17] "healthy15"        "healthy17"        "maleHealthy_mean" "ICU2"            
## [21] "ICU4"             "femICU_mean"      "ICU1"             "ICU3"            
## [25] "maleICU_mean"     "severe2"          "severe4"          "severe6"         
## [29] "femSevere_mean"   "severe1"          "severe3"          "severe5"         
## [33] "severe7"          "severe8"          "maleSevere_mean"

genderDF_FCs <- genderDF[,c(10,19,22,25,29,35)]
colnames(genderDF_FCs)

## [1] "femHealthy_mean"  "maleHealthy_mean" "femICU_mean"      "maleICU_mean"    
## [5] "femSevere_mean"   "maleSevere_mean"

genderDF_FCs$femSevereHealthy_FoldChange <- genderDF_FCs$femSevere_mean/genderDF_FCs$femHealthy_mean
genderDF_FCs$maleSevereHealthy_FoldChange <-
  genderDF_FCs$maleSevere_mean/genderDF_FCs$maleHealthy_mean
genderDF_FCs$femICUhealthy_FoldChange <- genderDF_FCs$femICU_mean/genderDF_FCs$femHealthy_mean
genderDF_FCs$maleICUhealthy_FoldChange <- genderDF_FCs$maleICU_mean/genderDF_FCs$maleHealthy_mean

genderDF_FCs$femSevereHealthy_FoldChange <- ifelse(genderDF_FCs$femSevereHealthy_FoldChange=='Inf',
              genderDF_FCs$femSevere_mean, ifelse(genderDF_FCs$femSevereHealthy_FoldChange=='NaN',
                                                  1,
          ifelse(genderDF_FCs$femSevereHealthy_FoldChange==0,
                1- genderDF_FCs$femHealthy_mean, genderDF_FCs$femSevereHealthy_FoldChange)))

genderDF_FCs$maleSevereHealthy_FoldChange <- ifelse(genderDF_FCs$maleSevereHealthy_FoldChange=='Inf',
              genderDF_FCs$maleSevere_mean, ifelse(genderDF_FCs$maleSevereHealthy_FoldChange=='NaN',
                                                  1,
                  ifelse(genderDF_FCs$maleSevereHealthy_FoldChange==0,
                         1-genderDF_FCs$maleHealthy_mean,                                                  genderDF_FCs$maleSevereHealthy_FoldChange)))

genderDF_FCs$femICUhealthy_FoldChange <- ifelse(genderDF_FCs$femICUhealthy_FoldChange=='Inf',
              genderDF_FCs$femICU_mean, ifelse(genderDF_FCs$femICUhealthy_FoldChange=='NaN',
                                                  1,
            ifelse(genderDF_FCs$femICUhealthy_FoldChange==0,
                   1-genderDF_FCs$femHealthy_mean,
                                      genderDF_FCs$femICUhealthy_FoldChange)))

genderDF_FCs$maleICUhealthy_FoldChange <- ifelse(genderDF_FCs$maleICUhealthy_FoldChange=='Inf',
              genderDF_FCs$maleICU_mean, ifelse(genderDF_FCs$maleICUhealthy_FoldChange=='NaN',
                                                  1,
              ifelse(genderDF_FCs$maleICUhealthy_FoldChange==0,
                     1-genderDF_FCs$maleHealthy_mean,
                                      genderDF_FCs$maleICUhealthy_FoldChange)))

genderDF_FCs100 <- genderDF_FCs[order(genderDF_FCs$femICUhealthy_FoldChange)[c(1:50,60634:60683)],]
dim(genderDF_FCs100)

## [1] 100  10

genderDF_FCs100$gene <- row.names(genderDF_FCs100)
genderDF_FCs100b <- genderDF_FCs100[,c(11,1:10)]

source('geneCards2.R')

for (i in genderDF_FCs100b$gene){
  find25genes(i)
}

files <- list.files('./gene scrapes')[1:100]
files

for (i in files){
 file <- paste('./gene scrapes',i, sep='/')
 t <- read.csv(file,sep=',',header=F)
 write.table(t,file='genderFCgenes.csv',sep=',',append=T,col.names=F,row.names=F)
}

genderFCgenes <- read.delim('genderFCgenes.csv',header=F,sep=',',na.strings=c('',' ','NA'))
genderFCgenes$V2 <- toupper(genderFCgenes$V2)
colnames(genderFCgenes) <- c('gene','EnsemblGene','DateSourced')
genderFCgenes <- genderFCgenes[,-3]
genderFCgenes <- genderFCgenes[!duplicated(genderFCgenes),]

for (i in genderFCgenes$gene[1:283]){
  getSummaries(i,'PBMC')
}

The file is in the gene scrapes folder from the source code, it is ‘summary.csv’ if there is an http error to backtrack and start at the list where cut in previous chunk.

for (i in genderFCgenes$gene[284:length(genderFCgenes$gene)]){
  getSummaries(i,'PBMC')
}

getGeneSummaries('PBMC')

genderFC_summs <- read.csv("proteinGeneSummaries_pbmc.csv")
file.rename("proteinGeneSummaries_pbmc.csv",'proteinGeneSummaries_pbmc3.csv')

genderFC_summs <- read.csv("proteinGeneSummaries_pbmc3.csv")

genderFC_summs2 <- genderFC_summs[,c(2:5)]
genderFCgenes2 <- genderFCgenes[,-3]
genderFCs <- merge(genderFCgenes2, genderFC_summs2,by.x='gene',by.y='gene')

genderFCs2 <- merge(genderFCs,genderDF_FCs100b,by.x='EnsemblGene',by.y='gene')

genderFCs3 <- genderFCs2[!duplicated(genderFCs2),]

write.csv(genderFCs3,'genderFCs_summs.csv',row.names=F)

Ok, tables are ready for Tableau. For some visual analytics. The moderate data frame is all females, so when we look at the fold change of the moderate samples to healthy samples we are looking at all females to mixed genders of healthy samples. We could also just look at the fold change for moderate COVID-19 females to healthy females separately.

modFemales <- moderateDF
modFemales$femHealthy_mean <- genderDF_FCs$femHealthy_mean

modFemales$femModHealthy_FoldChange <- modFemales$moderate_mean/modFemales$femHealthy_mean

modFemales$femModHealthy_FoldChange <- ifelse(modFemales$femModHealthy_FoldChange=='Inf',
                                       modFemales$moderate_mean, ifelse(modFemales$femModHealthy_FoldChange=='NaN', 1,
        ifelse(modFemales$femModHealthy_FoldChange==0,
              1- modFemales$femHealthy_mean,
               modFemales$femModHealthy_FoldChange)))

modFemales$gene <- row.names(modFemales)
modFemales2 <- modFemales[,c(8,1:7)]
colnames(modFemales2)[6] <- 'femModerate_mean'
modFemales3 <- merge(genderFCgenes2,modFemales2,by.x='EnsemblGene',by.y='gene')
modFemale4 <- merge(genderFC_summs2,modFemales3,by.x='gene', by.y='gene')
modFemales4 <- modFemale4[!duplicated(modFemale4),]
modFemales5 <- modFemales4[order(modFemales4$femModHealthy_FoldChange)[c(0:50,334:383)],]
write.csv(modFemales5,'femaleModerateFCs.csv',row.names=F)

DATA1_Summs highest fold change genes by ICU_health_foldChange

dataML20 <- DATA1_Summs[order(DATA1_Summs$ICU_health_foldChange,
                              decreasing=T)[1:80],]
dataML_20 <- dataML20[!duplicated(dataML20),-c(1,3,4,5)]
row.names(dataML_20) <- NULL
head(dataML_20,10)

##        gene convalescent healthy1 healthy2 healthy3 healthy4 healthy5 healthy6
## 1      HBA2          134        1      212     1063        7       55       31
## 2      GYPB            1        0        0        0        0        0        0
## 3       CA1           13        1        2       13        1        3        0
## 4       HBM            3        0        1        2        0        0        0
## 5     ALAS2           47        0        5       70        3        6        5
## 6       HBD           27        1        8        6        8        5        8
## 7    IFIT1B            1        0        6        0        0        1        0
## 8      AHSP            3        2        0        0        1        0        2
## 9      GYPA            0        0        0        0        0        0        0
## 10 SELENBP1            5        4        3        7        2        6        0
##    healthy7 healthy8 healthy9 healthy10 healthy11 healthy12 healthy13 healthy14
## 1       111        7       39        19       113         5         0         0
## 2         0        0        0         0         0         0         6         0
## 3         2        0        5         0         5         1       137         0
## 4         1        0        0         0         0         1        56         0
## 5        22        1        6         1        14         5       804         0
## 6        32        0        1         5         5         2       191         3
## 7         0        2        1         4         0         0        54         0
## 8         0        1        0         0         0         0        68         0
## 9         0        0        0         0         0         0         1         0
## 10       17        1        4         7         0         0        75         1
##    healthy15 healthy16 healthy17 moderate1 moderate2 moderate3 moderate4
## 1        438       467        97         9     10257     21918      3841
## 2          2         1         0         0        63        66        18
## 3          4        34         4         3       515       715       423
## 4          8        12         3         0       301       371       124
## 5         18        83        23         2      2641      6799      1216
## 6          8        30        12         1       694      1909       845
## 7          1         2         2         3        86       188       205
## 8          3         7         0         0       205       439       115
## 9          0         0         1         4         1         6         8
## 10         2        11         3         2       221       646       176
##    severe1 severe2 severe3 severe4 severe5 severe6 severe7 severe8 ICU1   ICU2
## 1       19       0    2709      12     160    6091    6958   43860 1863 113552
## 2        0       4       8       1       2      57      16     201    7    328
## 3        3     356     109       3       6    2991     152    3296  432   4672
## 4        1     116     116       0       6     355     136    1518  169   2432
## 5       10    1233     835       0      24    4667    2513   16806 1621  27287
## 6       11     241     565       2      10    2274     330    5013  345   7961
## 7        0      61     200       2       2     569      12     440  142   1468
## 8        4      62      59       0       3     278     100    1144  104   1924
## 9        1       0       0       1       0      21       3      26    1     43
## 10       4     121     117       1       0     524     270    1673  149   2790
##    ICU3  ICU4 healthy_mean moderate_mean severe_mean ICU_mean
## 1    17 12906  156.7647059       9006.25    7476.125 32084.50
## 2     1    79    0.5294118         36.75      36.125   103.75
## 3     2  4520   12.4705882        414.00     864.500  2406.50
## 4     1   786    4.9411765        199.00     281.000   847.00
## 5     7  9679   62.7058824       2664.50    3261.000  9648.50
## 6    29  3412   19.1176471        862.25    1055.750  2936.75
## 7     1   937    4.2941176        120.50     160.750   637.00
## 8     9   579    4.9411765        189.75     206.250   654.00
## 9     1    15    0.1176471          4.75       6.500    15.00
## 10    6  1254    8.4117647        261.25     338.750  1049.75
##    mod_health_foldChange sevr_health_foldChange ICU_health_foldChange
## 1               57.45075               47.69010              204.6666
## 2               69.41667               68.23611              195.9722
## 3               33.19811               69.32311              192.9741
## 4               40.27381               56.86905              171.4167
## 5               42.49203               52.00469              153.8691
## 6               45.10231               55.22385              153.6146
## 7               28.06164               37.43493              148.3425
## 8               38.40179               41.74107              132.3571
## 9               40.37500               55.25000              127.5000
## 10              31.05769               40.27098              124.7955

ML_data <- as.data.frame(t(dataML_20))
colnames(ML_data) <- dataML_20$gene
ML_data1 <- ML_data[-c(1),]
ML_data1$sampleID <- row.names(ML_data1)
HeaderInformation$CN_new <- as.character(paste(HeaderInformation$CN_new))
ML_data2 <- merge(HeaderInformation,ML_data1,by.x='CN_new',by.y='sampleID')

colnames(ML_data2)

##  [1] "CN_new"             "GSM_ID"             "CN_old"            
##  [4] "age"                "gender"             "diseaseState"      
##  [7] "severity"           "geographicLocation" "cellType"          
## [10] "HBA2"               "GYPB"               "CA1"               
## [13] "HBM"                "ALAS2"              "HBD"               
## [16] "IFIT1B"             "AHSP"               "GYPA"              
## [19] "SELENBP1"           "IGHV1OR16-3"        "RNU6-675P"         
## [22] "GYPC"               "LOC105373602"       "ENSG00000288524"   
## [25] "HBG2"               "RHAG"               "RAP1GAP"           
## [28] "PSMB3P1"            "ANKRD52"            "IL23A"             
## [31] "ITGA7"              "OR10P1"             "MMP19"             
## [34] "ENSG00000257740"    "METTL7B"            "ZC3H10"            
## [37] "PYM1"               "OR10AE3P"           "PHC1P1"            
## [40] "RNF41"              "SMARCC2"            "PAN2"              
## [43] "TMEM198B"           "BLOC1S1"            "IKZF4"             
## [46] "ORMDL2"             "ENSG00000258921"    "BAZ2A"             
## [49] "DNAJC14"            "CA3-AS1"            "ENSG00000287927"   
## [52] "LOC101929627"       "CA1.1"              "CA2"               
## [55] "LOC107986954"       "ATP6V0D2"           "DEFA4"             
## [58] "ENSG00000253072"    "INSC"               "LOC102724957"      
## [61] "ENSG00000254789"    "LINC02751"          "LOC105376568"      
## [64] "HSALNG0082854"      "EPB42"              "GCKR"              
## [67] "METTL7B.1"          "ADAMTS2"            "SLC6A19"           
## [70] "IFI27"              "SLC4A1"             "RPL29P11"          
## [73] "PI3"                "PRAMEF31P"          "PRTN3"             
## [76] "SEC14L4"            "LOC105375299"       "MIR3147"           
## [79] "ENSG00000280225"    "ZNF716"             "LOC100653233"      
## [82] "VN1R28P"            "ZNF479"             "ENSG00000260653"   
## [85] "ENSG00000270749"    "WFDC1"              "OLFM4"             
## [88] "IGSF21"             "PCDH10"

ML_data3 <- ML_data2[,c(4,5,7,10:29)]
head(ML_data3)

##   age gender     severity HBA2 GYPB CA1 HBM ALAS2 HBD IFIT1B AHSP GYPA SELENBP1
## 1  80      M Convalescent  134    1  13   3    47  27      1    3    0        5
## 2  25      M      Healthy    1    0   1   0     0   1      0    2    0        4
## 3  38      M      Healthy   19    0   0   0     1   5      4    0    0        7
## 4  84      F      Healthy  113    0   5   0    14   5      0    0    0        0
## 5  27      M      Healthy    5    0   1   1     5   2      0    0    0        0
## 6  50      F      Healthy    0    6 137  56   804 191     54   68    1       75
##   IGHV1OR16-3 RNU6-675P GYPC LOC105373602 ENSG00000288524 HBG2 RHAG RAP1GAP
## 1           0         1    1            1               1    8    0       1
## 2           0         0    0            0               0    0    0       2
## 3           0         0    0            0               0    0    1       0
## 4           0         0    0            0               0    2    0       0
## 5           0         0    0            0               0    0    0       1
## 6           0         0    0            0               0    0    1      12
##   PSMB3P1 ANKRD52
## 1       0       0
## 2       0       0
## 3       0       0
## 4       0       0
## 5       0       0
## 6       1       1

write.csv(ML_data3,'ML_data_covid.csv',row.names=F)

library(RANN) #this pkg supplements caret for out of bag validation
library(e1071)
library(caret)
library(randomForest)
library(MASS)
library(gbm)

Reads in the genes as numeric integers, because they were factors with whitespace.

ML_data3 <- read.csv('ML_data_covid.csv',sep=',',na.strings=c('',' ','NA'),
                     stringsAsFactors = F,header=T)

set.seed(8989)
ML_data3b <- ML_data3[-c(1),]
inTrain <- createDataPartition(y=ML_data3b$severity, p=0.7, list=FALSE)

trainingSet <- ML_data3b[inTrain,]
testingSet <- ML_data3b[-inTrain,]
dim(trainingSet)

## [1] 24 23

dim(testingSet)

## [1]  9 23

Lets use random forest, knn, and glm algorithms to predict the ratings.

set.seed(605040)
rf_boot <- train(severity~., method='rf', 
               na.action=na.pass,
               data=(trainingSet),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet)

DF_boot <- data.frame(predRF_boot, type=testingSet$severity)

length_boot <- length(DF_boot$type)

sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)

accRF_boot <- (sum_boot/length_boot)

accRF_boot

## [1] 0.3333333

head(DF_boot)

##   predRF_boot    type
## 1      Severe Healthy
## 2     Healthy Healthy
## 3      Severe Healthy
## 4      Severe Healthy
## 5     Healthy Healthy
## 6      Severe     ICU

errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored

##   predRF_boot     type
## 1      Severe  Healthy
## 3      Severe  Healthy
## 4      Severe  Healthy
## 6      Severe      ICU
## 7     Healthy Moderate
## 8     Healthy   Severe

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored))
rn

## [1] 1 3 4 6 7 8

rfError <- testingSet[rn,]
rfError

##    age gender severity   HBA2 GYPB  CA1  HBM ALAS2  HBD IFIT1B AHSP GYPA
## 6   50      F  Healthy      0    6  137   56   804  191     54   68    1
## 8   55      M  Healthy    438    2    4    8    18    8      1    3    0
## 9   91      F  Healthy    467    1   34   12    83   30      2    7    0
## 20  60      F      ICU 113552  328 4672 2432 27287 7961   1468 1924   43
## 23  75      F Moderate      9    0    3    0     2    1      3    0    4
## 27  54      M   Severe     19    0    3    1    10   11      0    4    1
##    SELENBP1 IGHV1OR16.3 RNU6.675P GYPC LOC105373602 ENSG00000288524 HBG2 RHAG
## 6        75           0         0    0            0               0    0    1
## 8         2           0         0    0            0               0    0    0
## 9        11           0         1    1            1               1    0    3
## 20     2790           0        24   24           24              24  197  105
## 23        2           2         0    0            0               0    3    2
## 27        4           2         0    0            0               0    1    0
##    RAP1GAP PSMB3P1 ANKRD52
## 6       12       1       1
## 8        0       1       1
## 9        0       0       0
## 20     496      43      43
## 23       2       1       1
## 27       5       0       0

set.seed(8989)

training <- ML_data3[-c(1:2),]
testing <- ML_data3[1:2,]

dim(training)

## [1] 32 23

dim(testing)

## [1]  2 23

Lets use random forest, knn, and glm algorithms to predict the ratings.

rf_boot <- train(severity~., method='rf', 
               na.action=na.pass,
               data=(training),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testing)

DF_boot <- data.frame(predRF_boot, type=testing$severity)

DF_boot

##   predRF_boot         type
## 1     Healthy Convalescent
## 2     Healthy      Healthy

The random forest classifier thinks the Convalescent class is a healthy sample given our 20 genes and other identifiers.

Lets try classifying it with our other models.

knn_boot <- train(severity ~ .,
                method='knn', preProcess=c('center','scale'),
                tuneLength=10, trControl=trainControl(method='boot'),
                data=training)

predKNN_boot <- predict(knn_boot, testing)

DF_KNN_boot <- data.frame(predKNN_boot, type=testing$severity)

head(DF_KNN_boot)

##   predKNN_boot         type
## 1      Healthy Convalescent
## 2      Healthy      Healthy

Our KNN model or k-nearest neighbor also thinks the Convalescent sample is a healthy sample.

Lets go back to our trainingSet and testingSet that exclude the Convalescent sample.

KNN:

set.seed(121234)
knn_boot <- train(severity ~ .,
                method='knn', preProcess=c('center','scale'),
                tuneLength=10, trControl=trainControl(method='boot'),
                data=trainingSet)

predKNN_boot <- predict(knn_boot, testingSet)

DF_KNN_boot <- data.frame(predKNN_boot, type=testingSet$severity)

DF_KNN_boot

##   predKNN_boot     type
## 1      Healthy  Healthy
## 2      Healthy  Healthy
## 3      Healthy  Healthy
## 4      Healthy  Healthy
## 5      Healthy  Healthy
## 6       Severe      ICU
## 7      Healthy Moderate
## 8      Healthy   Severe
## 9      Healthy   Severe

errored <- DF_KNN_boot[(DF_KNN_boot$predKNN_boot != DF_KNN_boot$type),]
errored

##   predKNN_boot     type
## 6       Severe      ICU
## 7      Healthy Moderate
## 8      Healthy   Severe
## 9      Healthy   Severe

acc <- (length(testingSet$severity)-length(errored$predKNN_boot))/
  length(testingSet$severity)
acc

## [1] 0.5555556

rn <- as.numeric(row.names(errored))
rn

## [1] 6 7 8 9

KNNError <- testingSet[rn,]
KNNError

##    age gender severity   HBA2 GYPB  CA1  HBM ALAS2  HBD IFIT1B AHSP GYPA
## 20  60      F      ICU 113552  328 4672 2432 27287 7961   1468 1924   43
## 23  75      F Moderate      9    0    3    0     2    1      3    0    4
## 27  54      M   Severe     19    0    3    1    10   11      0    4    1
## 33  56      M   Severe   6958   16  152  136  2513  330     12  100    3
##    SELENBP1 IGHV1OR16.3 RNU6.675P GYPC LOC105373602 ENSG00000288524 HBG2 RHAG
## 20     2790           0        24   24           24              24  197  105
## 23        2           2         0    0            0               0    3    2
## 27        4           2         0    0            0               0    1    0
## 33      270           0         0    0            0               0    6   14
##    RAP1GAP PSMB3P1 ANKRD52
## 20     496      43      43
## 23       2       1       1
## 27       5       0       0
## 33      17       0       0

Recursive partitioned trees-rpart.

set.seed(505005)

rpartMod <- train(severity~., method='rpart', data=trainingSet,
                trControl=trainControl(method='boot'))
predrpart <- predict(rpartMod, testingSet)

DF_rpart <- data.frame(predrpart, type=testingSet$severity)
DF_rpart

##   predrpart     type
## 1    Severe  Healthy
## 2   Healthy  Healthy
## 3    Severe  Healthy
## 4    Severe  Healthy
## 5   Healthy  Healthy
## 6    Severe      ICU
## 7   Healthy Moderate
## 8   Healthy   Severe
## 9    Severe   Severe

errored <- DF_rpart[(DF_rpart$predrpart != DF_rpart$type),]
errored

##   predrpart     type
## 1    Severe  Healthy
## 3    Severe  Healthy
## 4    Severe  Healthy
## 6    Severe      ICU
## 7   Healthy Moderate
## 8   Healthy   Severe

rn <- as.numeric(row.names(errored))
rn

## [1] 1 3 4 6 7 8

rpartError <- testingSet[rn,]
rpartError

##    age gender severity   HBA2 GYPB  CA1  HBM ALAS2  HBD IFIT1B AHSP GYPA
## 6   50      F  Healthy      0    6  137   56   804  191     54   68    1
## 8   55      M  Healthy    438    2    4    8    18    8      1    3    0
## 9   91      F  Healthy    467    1   34   12    83   30      2    7    0
## 20  60      F      ICU 113552  328 4672 2432 27287 7961   1468 1924   43
## 23  75      F Moderate      9    0    3    0     2    1      3    0    4
## 27  54      M   Severe     19    0    3    1    10   11      0    4    1
##    SELENBP1 IGHV1OR16.3 RNU6.675P GYPC LOC105373602 ENSG00000288524 HBG2 RHAG
## 6        75           0         0    0            0               0    0    1
## 8         2           0         0    0            0               0    0    0
## 9        11           0         1    1            1               1    0    3
## 20     2790           0        24   24           24              24  197  105
## 23        2           2         0    0            0               0    3    2
## 27        4           2         0    0            0               0    1    0
##    RAP1GAP PSMB3P1 ANKRD52
## 6       12       1       1
## 8        0       1       1
## 9        0       0       0
## 20     496      43      43
## 23       2       1       1
## 27       5       0       0

Latent Dirichlet Model-topic modeler based on tokens with weights and the genes similating tokens or word counts with TF-IDF, counts, or gram weights.

set.seed(505005)

ldaMod <- train(severity~., method='lda', data=trainingSet,
                trControl=trainControl(method='boot'))
predlda <- predict(ldaMod, testingSet)

DF_lda <- data.frame(predlda, type=testingSet$severity)
DF_lda

##    predlda     type
## 1 Moderate  Healthy
## 2  Healthy  Healthy
## 3 Moderate  Healthy
## 4   Severe  Healthy
## 5  Healthy  Healthy
## 6 Moderate      ICU
## 7  Healthy Moderate
## 8  Healthy   Severe
## 9      ICU   Severe

errored <- DF_lda[(DF_lda$predlda != DF_lda$type),]
errored

##    predlda     type
## 1 Moderate  Healthy
## 3 Moderate  Healthy
## 4   Severe  Healthy
## 6 Moderate      ICU
## 7  Healthy Moderate
## 8  Healthy   Severe
## 9      ICU   Severe

rn <- as.numeric(row.names(errored))
rn

## [1] 1 3 4 6 7 8 9

ldaError <- testingSet[rn,]
ldaError

##    age gender severity   HBA2 GYPB  CA1  HBM ALAS2  HBD IFIT1B AHSP GYPA
## 6   50      F  Healthy      0    6  137   56   804  191     54   68    1
## 8   55      M  Healthy    438    2    4    8    18    8      1    3    0
## 9   91      F  Healthy    467    1   34   12    83   30      2    7    0
## 20  60      F      ICU 113552  328 4672 2432 27287 7961   1468 1924   43
## 23  75      F Moderate      9    0    3    0     2    1      3    0    4
## 27  54      M   Severe     19    0    3    1    10   11      0    4    1
## 33  56      M   Severe   6958   16  152  136  2513  330     12  100    3
##    SELENBP1 IGHV1OR16.3 RNU6.675P GYPC LOC105373602 ENSG00000288524 HBG2 RHAG
## 6        75           0         0    0            0               0    0    1
## 8         2           0         0    0            0               0    0    0
## 9        11           0         1    1            1               1    0    3
## 20     2790           0        24   24           24              24  197  105
## 23        2           2         0    0            0               0    3    2
## 27        4           2         0    0            0               0    1    0
## 33      270           0         0    0            0               0    6   14
##    RAP1GAP PSMB3P1 ANKRD52
## 6       12       1       1
## 8        0       1       1
## 9        0       0       0
## 20     496      43      43
## 23       2       1       1
## 27       5       0       0
## 33      17       0       0

These genes that we selected from the 20 genes most expressed in fold change values of ICU samples to healthy samples, weren’t very great in classifying the severity of the sample as moderate, severe, ICU, or healthy. This could be due to the limited samples where 24 were our training set and 9 our testing set with the one unlabeled or mislabeled sample as ‘Convalescent’ excluded. However, our KNN and RF models classified this sample as a ’healthy sample. And maybe it is, but when it came to accuracy in predicting the samples from our model trained on 24 samples, the RF model scored 33% with 3/9 correct. The KNN scored 55% with 5/9 correct. The rpart and lda scored 3/9 and 2/9 correct respectively. But that also included the age and gender features as numeric and factor respectively.

Lets remove those features that aren’t genes and see if our results are better. We are going to remove the age and gender features.

trainingSet2 <- trainingSet[,-c(1:2)]
testingSet2 <- testingSet[,-c(1:2)]

Lets use random forest, knn, and glm algorithms to predict the ratings.

set.seed(605040)
rf_boot <- train(severity~., method='rf', 
               na.action=na.pass,
               data=(trainingSet2),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet2)

DF_boot <- data.frame(predRF_boot, type=testingSet2$severity)

length_boot <- length(DF_boot$type)

sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)

accRF_boot <- (sum_boot/length_boot)

accRF_boot

## [1] 0.5555556

DF_boot

##   predRF_boot     type
## 1      Severe  Healthy
## 2     Healthy  Healthy
## 3     Healthy  Healthy
## 4     Healthy  Healthy
## 5     Healthy  Healthy
## 6      Severe      ICU
## 7     Healthy Moderate
## 8     Healthy   Severe
## 9      Severe   Severe

errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored

##   predRF_boot     type
## 1      Severe  Healthy
## 6      Severe      ICU
## 7     Healthy Moderate
## 8     Healthy   Severe

rn <- as.numeric(row.names(errored))
rn

## [1] 1 6 7 8

rpartError <- testingSet[rn,]
rpartError

##    age gender severity   HBA2 GYPB  CA1  HBM ALAS2  HBD IFIT1B AHSP GYPA
## 6   50      F  Healthy      0    6  137   56   804  191     54   68    1
## 20  60      F      ICU 113552  328 4672 2432 27287 7961   1468 1924   43
## 23  75      F Moderate      9    0    3    0     2    1      3    0    4
## 27  54      M   Severe     19    0    3    1    10   11      0    4    1
##    SELENBP1 IGHV1OR16.3 RNU6.675P GYPC LOC105373602 ENSG00000288524 HBG2 RHAG
## 6        75           0         0    0            0               0    0    1
## 20     2790           0        24   24           24              24  197  105
## 23        2           2         0    0            0               0    3    2
## 27        4           2         0    0            0               0    1    0
##    RAP1GAP PSMB3P1 ANKRD52
## 6       12       1       1
## 20     496      43      43
## 23       2       1       1
## 27       5       0       0

The RF model scored better with the age and gender features removed. This time it classified 5/9 correct, better than the other data set of features’ 3/9.

Lets see how the KNN does.

set.seed(121234)
knn_boot <- train(severity ~ .,
                method='knn', preProcess=c('center','scale'),
                tuneLength=10, trControl=trainControl(method='boot'),
                data=trainingSet2)

predKNN_boot <- predict(knn_boot, testingSet2)

DF_KNN_boot <- data.frame(predKNN_boot, type=testingSet2$severity)

DF_KNN_boot

##   predKNN_boot     type
## 1      Healthy  Healthy
## 2      Healthy  Healthy
## 3      Healthy  Healthy
## 4      Healthy  Healthy
## 5      Healthy  Healthy
## 6       Severe      ICU
## 7      Healthy Moderate
## 8      Healthy   Severe
## 9      Healthy   Severe

errored <- DF_KNN_boot[(DF_KNN_boot$predKNN_boot != DF_KNN_boot$type),]
errored

##   predKNN_boot     type
## 6       Severe      ICU
## 7      Healthy Moderate
## 8      Healthy   Severe
## 9      Healthy   Severe

acc <- (length(testingSet2$severity)-length(errored$predKNN_boot))/
  length(testingSet2$severity)
acc

## [1] 0.5555556

rn <- as.numeric(row.names(errored))
rn

## [1] 6 7 8 9

KNNError <- testingSet2[rn,]
KNNError

##    severity   HBA2 GYPB  CA1  HBM ALAS2  HBD IFIT1B AHSP GYPA SELENBP1
## 20      ICU 113552  328 4672 2432 27287 7961   1468 1924   43     2790
## 23 Moderate      9    0    3    0     2    1      3    0    4        2
## 27   Severe     19    0    3    1    10   11      0    4    1        4
## 33   Severe   6958   16  152  136  2513  330     12  100    3      270
##    IGHV1OR16.3 RNU6.675P GYPC LOC105373602 ENSG00000288524 HBG2 RHAG RAP1GAP
## 20           0        24   24           24              24  197  105     496
## 23           2         0    0            0               0    3    2       2
## 27           2         0    0            0               0    1    0       5
## 33           0         0    0            0               0    6   14      17
##    PSMB3P1 ANKRD52
## 20      43      43
## 23       1       1
## 27       0       0
## 33       0       0

The KNN model without the age and gender features scored the same with 5/9 correctly classified.

Lets see if rpart and lda score better with the age and gender features removed.

rpart:

set.seed(505005)

rpartMod <- train(severity~., method='rpart', data=trainingSet2,
                trControl=trainControl(method='boot'))
predrpart <- predict(rpartMod, testingSet2)

DF_rpart <- data.frame(predrpart, type=testingSet2$severity)
DF_rpart

##   predrpart     type
## 1    Severe  Healthy
## 2   Healthy  Healthy
## 3    Severe  Healthy
## 4    Severe  Healthy
## 5   Healthy  Healthy
## 6    Severe      ICU
## 7   Healthy Moderate
## 8   Healthy   Severe
## 9    Severe   Severe

errored <- DF_rpart[(DF_rpart$predrpart != DF_rpart$type),]
errored

##   predrpart     type
## 1    Severe  Healthy
## 3    Severe  Healthy
## 4    Severe  Healthy
## 6    Severe      ICU
## 7   Healthy Moderate
## 8   Healthy   Severe

rn <- as.numeric(row.names(errored))
rn

## [1] 1 3 4 6 7 8

rpartError <- testingSet2[rn,]
rpartError

##    severity   HBA2 GYPB  CA1  HBM ALAS2  HBD IFIT1B AHSP GYPA SELENBP1
## 6   Healthy      0    6  137   56   804  191     54   68    1       75
## 8   Healthy    438    2    4    8    18    8      1    3    0        2
## 9   Healthy    467    1   34   12    83   30      2    7    0       11
## 20      ICU 113552  328 4672 2432 27287 7961   1468 1924   43     2790
## 23 Moderate      9    0    3    0     2    1      3    0    4        2
## 27   Severe     19    0    3    1    10   11      0    4    1        4
##    IGHV1OR16.3 RNU6.675P GYPC LOC105373602 ENSG00000288524 HBG2 RHAG RAP1GAP
## 6            0         0    0            0               0    0    1      12
## 8            0         0    0            0               0    0    0       0
## 9            0         1    1            1               1    0    3       0
## 20           0        24   24           24              24  197  105     496
## 23           2         0    0            0               0    3    2       2
## 27           2         0    0            0               0    1    0       5
##    PSMB3P1 ANKRD52
## 6        1       1
## 8        1       1
## 9        0       0
## 20      43      43
## 23       1       1
## 27       0       0

The rpart scored the same in accuracy in classification of 3/9 correct.

lda:

set.seed(505005)

ldaMod <- train(severity~., method='lda', data=trainingSet2,
                trControl=trainControl(method='boot'))
predlda <- predict(ldaMod, testingSet2)

DF_lda <- data.frame(predlda, type=testingSet2$severity)
DF_lda

##    predlda     type
## 1 Moderate  Healthy
## 2  Healthy  Healthy
## 3 Moderate  Healthy
## 4   Severe  Healthy
## 5  Healthy  Healthy
## 6 Moderate      ICU
## 7 Moderate Moderate
## 8  Healthy   Severe
## 9      ICU   Severe

errored <- DF_lda[(DF_lda$predlda != DF_lda$type),]
errored

##    predlda    type
## 1 Moderate Healthy
## 3 Moderate Healthy
## 4   Severe Healthy
## 6 Moderate     ICU
## 8  Healthy  Severe
## 9      ICU  Severe

rn <- as.numeric(row.names(errored))
rn

## [1] 1 3 4 6 8 9

ldaError <- testingSet2[rn,]
ldaError

##    severity   HBA2 GYPB  CA1  HBM ALAS2  HBD IFIT1B AHSP GYPA SELENBP1
## 6   Healthy      0    6  137   56   804  191     54   68    1       75
## 8   Healthy    438    2    4    8    18    8      1    3    0        2
## 9   Healthy    467    1   34   12    83   30      2    7    0       11
## 20      ICU 113552  328 4672 2432 27287 7961   1468 1924   43     2790
## 27   Severe     19    0    3    1    10   11      0    4    1        4
## 33   Severe   6958   16  152  136  2513  330     12  100    3      270
##    IGHV1OR16.3 RNU6.675P GYPC LOC105373602 ENSG00000288524 HBG2 RHAG RAP1GAP
## 6            0         0    0            0               0    0    1      12
## 8            0         0    0            0               0    0    0       0
## 9            0         1    1            1               1    0    3       0
## 20           0        24   24           24              24  197  105     496
## 27           2         0    0            0               0    1    0       5
## 33           0         0    0            0               0    6   14      17
##    PSMB3P1 ANKRD52
## 6        1       1
## 8        1       1
## 9        0       0
## 20      43      43
## 27       0       0
## 33       0       0

The lda model scored 3/9 correct, which was slightly better than its original classification of 2/9 correct.

We should get genes that we see are relevant to the genders and age because with our genes that had the most fold change in all the patients’ ICU to healthy means there wasn’t enough sampling or balanced samples, or these genes show variation in all severity of COVID-19 cases. We did see in the Tableau vizualizations that there are some genes that increase and decrease in separate groups, like those younger than 57 compared to older than 57 for the severe and ICU cases and for the males versus females. The blog to see those images and link out of the blog to the Tableau chart to explore the genes with variations is at:https://themassagenegotiator.com/blog/f/actual-covid-19-data-to-analyze-gene-expression-in-three-stages to see the genes that were differentiated in certain group comparisons.

The gene IFI27, for example has a fold change that is much higher in the moderate to healthy cases than it is in the severe and ICU cases where it monotonically decreases the more severe the case from moderate -> severe -> ICU.While most of the genes through a visual inspection increase monotonically from less severe to severe. Such as RHAG and RNF41 and most of the genes starting with an I. There are a couple that seem to be high in moderete, decrease in the severe, then increase to more than sever and moderate in the ICU cases. Examples of the latter are RIC3-DT and PCDH10. Those genes in the latter part might not hold any information about COVID-19, but also the females makeup all the moderate cases. That could also mean those genes play a role in female body maintenance and healthy.

I modified the fold change logic for when the disease state is zero but the healthy mean is not zero, and no difference was made in prediction accuracy as this is the 2nd script and the results are the same as the 1st script. The fold change values are different but none of the genes’ values that changed from 0 to 1-healthy mean mattered in our filtering of the top 50 and bottom 50 values of 60683 genes.

We are using 20 genes to predict. We should sample less genes, the more relevant genes that do not show a monotonic increase or decrease between grades of severity in COVID-19 samples. Those genes are likely throwing off our measures.

covid19 study

Janis Corona

8/14/2020