This document is to prepare the Seurat derived data on counts of genes found in the GSE318371 study on Natural Killer T-cell Lymphoma associated with EBV infection. There are previous uploaded documents on deriving this information in Seurat, a library for R that helps manipulate single cell RNA sequencing data in array format. This data is from a large array and there are 30,960 genes with counts extracted from a larger set where only those genes that had more than 3 counts and were in more than 10 features were kept. The top 2,000 highest variability of change from healthy to pathological state are noted here along with the 12 healthy and 20 pathological samples’ statistical measures for variance and standardized values of counts.

I want to only keep the mean values of the counts and then do a fold change comparison of those healthy samples and those diseased samples of NKTC Lymphoma and keep the top 10 genes that have the highest and lowest fold change values for most over expressed or stimulated or up regulated, and those with most underexpressed, inhibited, or down regulated. Then compare those to the values obtained in the top 10 genes of Seurat with the highest variability measure extracted.

Then we can test these genes to see how well they predict a 2 class model using supervised learning algorithms of random forest for cross validation or boot strap aggregating AKA bagging or boosting. Then if they predict well, we can add them to our larger database of genes that are the top predictors for pathologies we have analyzed thus far in multiple sclerosis, mononucleosis, fibromyalgia, Lyme disease, and Epstein-Barr infection. We have a goal of creating a predictive machine that can determine a sample as EBV, EBV associated pathology, fibromyalgia, or Lyme disease.

Another idea is to use these genes from this study to compare to gene set studies on EBV and blood cancers to see if any relationships can be immediately determined.

Lets work with the dataset we made a few weeks ago and filter it to only include the samples by mean, the top genes if it was, and the gene names.

data <- read.csv("all30960genes_NKTC_Lymphoma_EBV_rownamesAdded.csv", header=T, sep=',')
colnames(data)
##   [1] "vf_vst_counts.GSM9493334_mean"                 
##   [2] "vf_vst_counts.GSM9493334_variance"             
##   [3] "vf_vst_counts.GSM9493334_variance.expected"    
##   [4] "vf_vst_counts.GSM9493334_variance.standardized"
##   [5] "vf_vst_counts.GSM9493334_variable"             
##   [6] "vf_vst_counts.GSM9493334_rank"                 
##   [7] "vf_vst_counts.GSM9493335_mean"                 
##   [8] "vf_vst_counts.GSM9493335_variance"             
##   [9] "vf_vst_counts.GSM9493335_variance.expected"    
##  [10] "vf_vst_counts.GSM9493335_variance.standardized"
##  [11] "vf_vst_counts.GSM9493335_variable"             
##  [12] "vf_vst_counts.GSM9493335_rank"                 
##  [13] "vf_vst_counts.GSM9493336_mean"                 
##  [14] "vf_vst_counts.GSM9493336_variance"             
##  [15] "vf_vst_counts.GSM9493336_variance.expected"    
##  [16] "vf_vst_counts.GSM9493336_variance.standardized"
##  [17] "vf_vst_counts.GSM9493336_variable"             
##  [18] "vf_vst_counts.GSM9493336_rank"                 
##  [19] "vf_vst_counts.GSM9493337_mean"                 
##  [20] "vf_vst_counts.GSM9493337_variance"             
##  [21] "vf_vst_counts.GSM9493337_variance.expected"    
##  [22] "vf_vst_counts.GSM9493337_variance.standardized"
##  [23] "vf_vst_counts.GSM9493337_variable"             
##  [24] "vf_vst_counts.GSM9493337_rank"                 
##  [25] "vf_vst_counts.GSM9493338_mean"                 
##  [26] "vf_vst_counts.GSM9493338_variance"             
##  [27] "vf_vst_counts.GSM9493338_variance.expected"    
##  [28] "vf_vst_counts.GSM9493338_variance.standardized"
##  [29] "vf_vst_counts.GSM9493338_variable"             
##  [30] "vf_vst_counts.GSM9493338_rank"                 
##  [31] "vf_vst_counts.GSM9493339_mean"                 
##  [32] "vf_vst_counts.GSM9493339_variance"             
##  [33] "vf_vst_counts.GSM9493339_variance.expected"    
##  [34] "vf_vst_counts.GSM9493339_variance.standardized"
##  [35] "vf_vst_counts.GSM9493339_variable"             
##  [36] "vf_vst_counts.GSM9493339_rank"                 
##  [37] "vf_vst_counts.GSM9493340_mean"                 
##  [38] "vf_vst_counts.GSM9493340_variance"             
##  [39] "vf_vst_counts.GSM9493340_variance.expected"    
##  [40] "vf_vst_counts.GSM9493340_variance.standardized"
##  [41] "vf_vst_counts.GSM9493340_variable"             
##  [42] "vf_vst_counts.GSM9493340_rank"                 
##  [43] "vf_vst_counts.GSM9493341_mean"                 
##  [44] "vf_vst_counts.GSM9493341_variance"             
##  [45] "vf_vst_counts.GSM9493341_variance.expected"    
##  [46] "vf_vst_counts.GSM9493341_variance.standardized"
##  [47] "vf_vst_counts.GSM9493341_variable"             
##  [48] "vf_vst_counts.GSM9493341_rank"                 
##  [49] "vf_vst_counts.GSM9493342_mean"                 
##  [50] "vf_vst_counts.GSM9493342_variance"             
##  [51] "vf_vst_counts.GSM9493342_variance.expected"    
##  [52] "vf_vst_counts.GSM9493342_variance.standardized"
##  [53] "vf_vst_counts.GSM9493342_variable"             
##  [54] "vf_vst_counts.GSM9493342_rank"                 
##  [55] "vf_vst_counts.GSM9493343_mean"                 
##  [56] "vf_vst_counts.GSM9493343_variance"             
##  [57] "vf_vst_counts.GSM9493343_variance.expected"    
##  [58] "vf_vst_counts.GSM9493343_variance.standardized"
##  [59] "vf_vst_counts.GSM9493343_variable"             
##  [60] "vf_vst_counts.GSM9493343_rank"                 
##  [61] "vf_vst_counts.GSM9493344_mean"                 
##  [62] "vf_vst_counts.GSM9493344_variance"             
##  [63] "vf_vst_counts.GSM9493344_variance.expected"    
##  [64] "vf_vst_counts.GSM9493344_variance.standardized"
##  [65] "vf_vst_counts.GSM9493344_variable"             
##  [66] "vf_vst_counts.GSM9493344_rank"                 
##  [67] "vf_vst_counts.GSM9493345_mean"                 
##  [68] "vf_vst_counts.GSM9493345_variance"             
##  [69] "vf_vst_counts.GSM9493345_variance.expected"    
##  [70] "vf_vst_counts.GSM9493345_variance.standardized"
##  [71] "vf_vst_counts.GSM9493345_variable"             
##  [72] "vf_vst_counts.GSM9493345_rank"                 
##  [73] "vf_vst_counts.GSM9493346_mean"                 
##  [74] "vf_vst_counts.GSM9493346_variance"             
##  [75] "vf_vst_counts.GSM9493346_variance.expected"    
##  [76] "vf_vst_counts.GSM9493346_variance.standardized"
##  [77] "vf_vst_counts.GSM9493346_variable"             
##  [78] "vf_vst_counts.GSM9493346_rank"                 
##  [79] "vf_vst_counts.GSM9493347_mean"                 
##  [80] "vf_vst_counts.GSM9493347_variance"             
##  [81] "vf_vst_counts.GSM9493347_variance.expected"    
##  [82] "vf_vst_counts.GSM9493347_variance.standardized"
##  [83] "vf_vst_counts.GSM9493347_variable"             
##  [84] "vf_vst_counts.GSM9493347_rank"                 
##  [85] "vf_vst_counts.GSM9493348_mean"                 
##  [86] "vf_vst_counts.GSM9493348_variance"             
##  [87] "vf_vst_counts.GSM9493348_variance.expected"    
##  [88] "vf_vst_counts.GSM9493348_variance.standardized"
##  [89] "vf_vst_counts.GSM9493348_variable"             
##  [90] "vf_vst_counts.GSM9493348_rank"                 
##  [91] "vf_vst_counts.GSM9493349_mean"                 
##  [92] "vf_vst_counts.GSM9493349_variance"             
##  [93] "vf_vst_counts.GSM9493349_variance.expected"    
##  [94] "vf_vst_counts.GSM9493349_variance.standardized"
##  [95] "vf_vst_counts.GSM9493349_variable"             
##  [96] "vf_vst_counts.GSM9493349_rank"                 
##  [97] "vf_vst_counts.GSM9493350_mean"                 
##  [98] "vf_vst_counts.GSM9493350_variance"             
##  [99] "vf_vst_counts.GSM9493350_variance.expected"    
## [100] "vf_vst_counts.GSM9493350_variance.standardized"
## [101] "vf_vst_counts.GSM9493350_variable"             
## [102] "vf_vst_counts.GSM9493350_rank"                 
## [103] "vf_vst_counts.GSM9493351_mean"                 
## [104] "vf_vst_counts.GSM9493351_variance"             
## [105] "vf_vst_counts.GSM9493351_variance.expected"    
## [106] "vf_vst_counts.GSM9493351_variance.standardized"
## [107] "vf_vst_counts.GSM9493351_variable"             
## [108] "vf_vst_counts.GSM9493351_rank"                 
## [109] "vf_vst_counts.GSM9493332_mean"                 
## [110] "vf_vst_counts.GSM9493332_variance"             
## [111] "vf_vst_counts.GSM9493332_variance.expected"    
## [112] "vf_vst_counts.GSM9493332_variance.standardized"
## [113] "vf_vst_counts.GSM9493332_variable"             
## [114] "vf_vst_counts.GSM9493332_rank"                 
## [115] "vf_vst_counts.GSM9493333_mean"                 
## [116] "vf_vst_counts.GSM9493333_variance"             
## [117] "vf_vst_counts.GSM9493333_variance.expected"    
## [118] "vf_vst_counts.GSM9493333_variance.standardized"
## [119] "vf_vst_counts.GSM9493333_variable"             
## [120] "vf_vst_counts.GSM9493333_rank"                 
## [121] "vf_vst_counts.GSM9493320_mean"                 
## [122] "vf_vst_counts.GSM9493320_variance"             
## [123] "vf_vst_counts.GSM9493320_variance.expected"    
## [124] "vf_vst_counts.GSM9493320_variance.standardized"
## [125] "vf_vst_counts.GSM9493320_variable"             
## [126] "vf_vst_counts.GSM9493320_rank"                 
## [127] "vf_vst_counts.GSM9493329_mean"                 
## [128] "vf_vst_counts.GSM9493329_variance"             
## [129] "vf_vst_counts.GSM9493329_variance.expected"    
## [130] "vf_vst_counts.GSM9493329_variance.standardized"
## [131] "vf_vst_counts.GSM9493329_variable"             
## [132] "vf_vst_counts.GSM9493329_rank"                 
## [133] "vf_vst_counts.GSM9493330_mean"                 
## [134] "vf_vst_counts.GSM9493330_variance"             
## [135] "vf_vst_counts.GSM9493330_variance.expected"    
## [136] "vf_vst_counts.GSM9493330_variance.standardized"
## [137] "vf_vst_counts.GSM9493330_variable"             
## [138] "vf_vst_counts.GSM9493330_rank"                 
## [139] "vf_vst_counts.GSM9493331_mean"                 
## [140] "vf_vst_counts.GSM9493331_variance"             
## [141] "vf_vst_counts.GSM9493331_variance.expected"    
## [142] "vf_vst_counts.GSM9493331_variance.standardized"
## [143] "vf_vst_counts.GSM9493331_variable"             
## [144] "vf_vst_counts.GSM9493331_rank"                 
## [145] "vf_vst_counts.GSM9493321_mean"                 
## [146] "vf_vst_counts.GSM9493321_variance"             
## [147] "vf_vst_counts.GSM9493321_variance.expected"    
## [148] "vf_vst_counts.GSM9493321_variance.standardized"
## [149] "vf_vst_counts.GSM9493321_variable"             
## [150] "vf_vst_counts.GSM9493321_rank"                 
## [151] "vf_vst_counts.GSM9493322_mean"                 
## [152] "vf_vst_counts.GSM9493322_variance"             
## [153] "vf_vst_counts.GSM9493322_variance.expected"    
## [154] "vf_vst_counts.GSM9493322_variance.standardized"
## [155] "vf_vst_counts.GSM9493322_variable"             
## [156] "vf_vst_counts.GSM9493322_rank"                 
## [157] "vf_vst_counts.GSM9493323_mean"                 
## [158] "vf_vst_counts.GSM9493323_variance"             
## [159] "vf_vst_counts.GSM9493323_variance.expected"    
## [160] "vf_vst_counts.GSM9493323_variance.standardized"
## [161] "vf_vst_counts.GSM9493323_variable"             
## [162] "vf_vst_counts.GSM9493323_rank"                 
## [163] "vf_vst_counts.GSM9493324_mean"                 
## [164] "vf_vst_counts.GSM9493324_variance"             
## [165] "vf_vst_counts.GSM9493324_variance.expected"    
## [166] "vf_vst_counts.GSM9493324_variance.standardized"
## [167] "vf_vst_counts.GSM9493324_variable"             
## [168] "vf_vst_counts.GSM9493324_rank"                 
## [169] "vf_vst_counts.GSM9493325_mean"                 
## [170] "vf_vst_counts.GSM9493325_variance"             
## [171] "vf_vst_counts.GSM9493325_variance.expected"    
## [172] "vf_vst_counts.GSM9493325_variance.standardized"
## [173] "vf_vst_counts.GSM9493325_variable"             
## [174] "vf_vst_counts.GSM9493325_rank"                 
## [175] "vf_vst_counts.GSM9493326_mean"                 
## [176] "vf_vst_counts.GSM9493326_variance"             
## [177] "vf_vst_counts.GSM9493326_variance.expected"    
## [178] "vf_vst_counts.GSM9493326_variance.standardized"
## [179] "vf_vst_counts.GSM9493326_variable"             
## [180] "vf_vst_counts.GSM9493326_rank"                 
## [181] "vf_vst_counts.GSM9493327_mean"                 
## [182] "vf_vst_counts.GSM9493327_variance"             
## [183] "vf_vst_counts.GSM9493327_variance.expected"    
## [184] "vf_vst_counts.GSM9493327_variance.standardized"
## [185] "vf_vst_counts.GSM9493327_variable"             
## [186] "vf_vst_counts.GSM9493327_rank"                 
## [187] "vf_vst_counts.GSM9493328_mean"                 
## [188] "vf_vst_counts.GSM9493328_variance"             
## [189] "vf_vst_counts.GSM9493328_variance.expected"    
## [190] "vf_vst_counts.GSM9493328_variance.standardized"
## [191] "vf_vst_counts.GSM9493328_variable"             
## [192] "vf_vst_counts.GSM9493328_rank"                 
## [193] "var.features"                                  
## [194] "var.features.rank"                             
## [195] "genes"

We want to keep the last 3 columns and the columns with the mean of the GSM sample. The rank is for the gene if it was top ranked it is 1 through 10 and goes up to 2,000 for only the top 2,000 genes with the most variability from healthy state to pathological state were identified. We have 30,960 total genes that made the cut of being in at least 10 genes for the barcode fragments and in at least 3 cells of the massive array of cells in the hundreds of thousands.

meanID <- grep('mean',colnames(data))
meanID
##  [1]   1   7  13  19  25  31  37  43  49  55  61  67  73  79  85  91  97 103 109
## [20] 115 121 127 133 139 145 151 157 163 169 175 181 187

There are 32 samples so there must be all samples accounted for, we will attach the IDs to the GSM ID of healthy or NKTC lymphoma as well.

DataMeans <- data[,c(meanID,193:195)]
colnames(DataMeans)
##  [1] "vf_vst_counts.GSM9493334_mean" "vf_vst_counts.GSM9493335_mean"
##  [3] "vf_vst_counts.GSM9493336_mean" "vf_vst_counts.GSM9493337_mean"
##  [5] "vf_vst_counts.GSM9493338_mean" "vf_vst_counts.GSM9493339_mean"
##  [7] "vf_vst_counts.GSM9493340_mean" "vf_vst_counts.GSM9493341_mean"
##  [9] "vf_vst_counts.GSM9493342_mean" "vf_vst_counts.GSM9493343_mean"
## [11] "vf_vst_counts.GSM9493344_mean" "vf_vst_counts.GSM9493345_mean"
## [13] "vf_vst_counts.GSM9493346_mean" "vf_vst_counts.GSM9493347_mean"
## [15] "vf_vst_counts.GSM9493348_mean" "vf_vst_counts.GSM9493349_mean"
## [17] "vf_vst_counts.GSM9493350_mean" "vf_vst_counts.GSM9493351_mean"
## [19] "vf_vst_counts.GSM9493332_mean" "vf_vst_counts.GSM9493333_mean"
## [21] "vf_vst_counts.GSM9493320_mean" "vf_vst_counts.GSM9493329_mean"
## [23] "vf_vst_counts.GSM9493330_mean" "vf_vst_counts.GSM9493331_mean"
## [25] "vf_vst_counts.GSM9493321_mean" "vf_vst_counts.GSM9493322_mean"
## [27] "vf_vst_counts.GSM9493323_mean" "vf_vst_counts.GSM9493324_mean"
## [29] "vf_vst_counts.GSM9493325_mean" "vf_vst_counts.GSM9493326_mean"
## [31] "vf_vst_counts.GSM9493327_mean" "vf_vst_counts.GSM9493328_mean"
## [33] "var.features"                  "var.features.rank"            
## [35] "genes"
top2000Variability <- subset(DataMeans, DataMeans$var.features.rank != "NA")
top2000Var <- top2000Variability[order(top2000Variability$var.features.rank,decreasing=F),]
top2000Var[c(1:10),c(32:35)]
##       vf_vst_counts.GSM9493328_mean var.features var.features.rank  genes
## 5603                      0.8955755         PPBP                 1   PPBP
## 22311                     1.5609617        IGLC2                 2  IGLC2
## 2972                      3.3883910         IGKC                 3   IGKC
## 1414                     21.2710201       S100A9                 4 S100A9
## 5610                      1.2892303         EREG                 5   EREG
## 22312                     0.5342154        IGLC3                 6  IGLC3
## 13803                     0.1337317         SOX5                 7   SOX5
## 5602                      0.7050790          PF4                 8    PF4
## 14280                    15.8304168          LYZ                 9    LYZ
## 10577                     0.9443733       LINGO2                10 LINGO2

We have the top 10 ranked genes with most gene variability but not if they were most underexpressed or overexpressed.

Lets get the fold change values of the genes for the NKTC Lymphoma means over the healthy means.

seriesInfoDesign <-read.csv("GSE318371-GPL34284_series_matrix.txt", sep='\t', nrows=50,stringsAsFactors = T,strip.white=T,na.strings=" ", ncol(32), skip=25, header=F)

dim(seriesInfoDesign)
## [1] 42 30

The 2nd and 17th rows have the sample type per GSM ID.

seriesInfoDesign[c(2,17),]
##                       V1                  V2                  V3
## 2  !Sample_geo_accession        "GSM9493320"        "GSM9493321"
## 17   !Sample_description "Library name: HD1" "Library name: HD2"
##                     V4                  V5                  V6
## 2         "GSM9493322"        "GSM9493323"        "GSM9493324"
## 17 "Library name: HD3" "Library name: HD4" "Library name: HD5"
##                     V7                  V8                  V9
## 2         "GSM9493325"        "GSM9493326"        "GSM9493327"
## 17 "Library name: HD6" "Library name: HD7" "Library name: HD8"
##                    V10                  V11                  V12
## 2         "GSM9493328"         "GSM9493329"         "GSM9493330"
## 17 "Library name: HD9" "Library name: HD10" "Library name: HD11"
##                     V13                       V14                       V15
## 2          "GSM9493331"              "GSM9493335"              "GSM9493336"
## 17 "Library name: HD12" "Library name: patient16" "Library name: patient18"
##                          V16                       V17
## 2               "GSM9493337"              "GSM9493338"
## 17 "Library name: patient19" "Library name: patient20"
##                          V18                       V19
## 2               "GSM9493339"              "GSM9493340"
## 17 "Library name: patient22" "Library name: patient25"
##                          V20                       V21
## 2               "GSM9493341"              "GSM9493342"
## 17 "Library name: patient27" "Library name: patient30"
##                          V22                       V23
## 2               "GSM9493343"              "GSM9493344"
## 17 "Library name: patient31" "Library name: patient36"
##                          V24                       V25
## 2               "GSM9493345"              "GSM9493346"
## 17 "Library name: patient37" "Library name: patient38"
##                          V26                       V27
## 2               "GSM9493347"              "GSM9493348"
## 17 "Library name: patient39" "Library name: patient41"
##                          V28                       V29
## 2               "GSM9493349"              "GSM9493350"
## 17 "Library name: patient44" "Library name: patient51"
##                          V30
## 2               "GSM9493351"
## 17 "Library name: patient52"

The library name that starts with “HD” is for the healthy samples, while other samples are the patient samples. We only care about if healthy or a diseased sample so that we can identify the set class to get the mean for that class. GSM29493320-GSM9493331 are healthy and GSM9493332-GSM9493351 are diseased.

colnames(DataMeans)
##  [1] "vf_vst_counts.GSM9493334_mean" "vf_vst_counts.GSM9493335_mean"
##  [3] "vf_vst_counts.GSM9493336_mean" "vf_vst_counts.GSM9493337_mean"
##  [5] "vf_vst_counts.GSM9493338_mean" "vf_vst_counts.GSM9493339_mean"
##  [7] "vf_vst_counts.GSM9493340_mean" "vf_vst_counts.GSM9493341_mean"
##  [9] "vf_vst_counts.GSM9493342_mean" "vf_vst_counts.GSM9493343_mean"
## [11] "vf_vst_counts.GSM9493344_mean" "vf_vst_counts.GSM9493345_mean"
## [13] "vf_vst_counts.GSM9493346_mean" "vf_vst_counts.GSM9493347_mean"
## [15] "vf_vst_counts.GSM9493348_mean" "vf_vst_counts.GSM9493349_mean"
## [17] "vf_vst_counts.GSM9493350_mean" "vf_vst_counts.GSM9493351_mean"
## [19] "vf_vst_counts.GSM9493332_mean" "vf_vst_counts.GSM9493333_mean"
## [21] "vf_vst_counts.GSM9493320_mean" "vf_vst_counts.GSM9493329_mean"
## [23] "vf_vst_counts.GSM9493330_mean" "vf_vst_counts.GSM9493331_mean"
## [25] "vf_vst_counts.GSM9493321_mean" "vf_vst_counts.GSM9493322_mean"
## [27] "vf_vst_counts.GSM9493323_mean" "vf_vst_counts.GSM9493324_mean"
## [29] "vf_vst_counts.GSM9493325_mean" "vf_vst_counts.GSM9493326_mean"
## [31] "vf_vst_counts.GSM9493327_mean" "vf_vst_counts.GSM9493328_mean"
## [33] "var.features"                  "var.features.rank"            
## [35] "genes"

They are out of order but the first 20 samples are diseased and the next 12 are the healthy samples, then the last 3 are the features, rank, and gene name. Lets rename them.

colnames(DataMeans)[1:20] <- gsub('vf_vst_counts','NKTCL', colnames(DataMeans)[1:20])
colnames(DataMeans)
##  [1] "NKTCL.GSM9493334_mean"         "NKTCL.GSM9493335_mean"        
##  [3] "NKTCL.GSM9493336_mean"         "NKTCL.GSM9493337_mean"        
##  [5] "NKTCL.GSM9493338_mean"         "NKTCL.GSM9493339_mean"        
##  [7] "NKTCL.GSM9493340_mean"         "NKTCL.GSM9493341_mean"        
##  [9] "NKTCL.GSM9493342_mean"         "NKTCL.GSM9493343_mean"        
## [11] "NKTCL.GSM9493344_mean"         "NKTCL.GSM9493345_mean"        
## [13] "NKTCL.GSM9493346_mean"         "NKTCL.GSM9493347_mean"        
## [15] "NKTCL.GSM9493348_mean"         "NKTCL.GSM9493349_mean"        
## [17] "NKTCL.GSM9493350_mean"         "NKTCL.GSM9493351_mean"        
## [19] "NKTCL.GSM9493332_mean"         "NKTCL.GSM9493333_mean"        
## [21] "vf_vst_counts.GSM9493320_mean" "vf_vst_counts.GSM9493329_mean"
## [23] "vf_vst_counts.GSM9493330_mean" "vf_vst_counts.GSM9493331_mean"
## [25] "vf_vst_counts.GSM9493321_mean" "vf_vst_counts.GSM9493322_mean"
## [27] "vf_vst_counts.GSM9493323_mean" "vf_vst_counts.GSM9493324_mean"
## [29] "vf_vst_counts.GSM9493325_mean" "vf_vst_counts.GSM9493326_mean"
## [31] "vf_vst_counts.GSM9493327_mean" "vf_vst_counts.GSM9493328_mean"
## [33] "var.features"                  "var.features.rank"            
## [35] "genes"
colnames(DataMeans)[21:32] <- gsub('vf_vst_counts','healthy',colnames(DataMeans[21:32]))
colnames(DataMeans)
##  [1] "NKTCL.GSM9493334_mean"   "NKTCL.GSM9493335_mean"  
##  [3] "NKTCL.GSM9493336_mean"   "NKTCL.GSM9493337_mean"  
##  [5] "NKTCL.GSM9493338_mean"   "NKTCL.GSM9493339_mean"  
##  [7] "NKTCL.GSM9493340_mean"   "NKTCL.GSM9493341_mean"  
##  [9] "NKTCL.GSM9493342_mean"   "NKTCL.GSM9493343_mean"  
## [11] "NKTCL.GSM9493344_mean"   "NKTCL.GSM9493345_mean"  
## [13] "NKTCL.GSM9493346_mean"   "NKTCL.GSM9493347_mean"  
## [15] "NKTCL.GSM9493348_mean"   "NKTCL.GSM9493349_mean"  
## [17] "NKTCL.GSM9493350_mean"   "NKTCL.GSM9493351_mean"  
## [19] "NKTCL.GSM9493332_mean"   "NKTCL.GSM9493333_mean"  
## [21] "healthy.GSM9493320_mean" "healthy.GSM9493329_mean"
## [23] "healthy.GSM9493330_mean" "healthy.GSM9493331_mean"
## [25] "healthy.GSM9493321_mean" "healthy.GSM9493322_mean"
## [27] "healthy.GSM9493323_mean" "healthy.GSM9493324_mean"
## [29] "healthy.GSM9493325_mean" "healthy.GSM9493326_mean"
## [31] "healthy.GSM9493327_mean" "healthy.GSM9493328_mean"
## [33] "var.features"            "var.features.rank"      
## [35] "genes"

Get the mean of the set rows by gene to make fold change values of the genes.

DataMeans$pathologyMean <- rowMeans(DataMeans[1:20], na.rm=T,dims=1)
DataMeans$healthyMean <- rowMeans(DataMeans[21:32], na.rm=T, dims=1)

DataMeans$foldchangePathologyVsHealthy <- DataMeans$pathologyMean/DataMeans$healthyMean

DataMeans2 <- DataMeans[order(DataMeans$foldchangePathologyVsHealthy,decreasing=T),]
head(DataMeans2)
##       NKTCL.GSM9493334_mean NKTCL.GSM9493335_mean NKTCL.GSM9493336_mean
## 24724                    NA          0.0003390980                    NA
## 29976                    NA                    NA                    NA
## 21921          0.0009161704                    NA           0.001888293
## 12963          0.0042372881          0.0007912287           0.001391373
## 21920          0.0052679798          0.0031649147           0.002583979
## 29338                    NA                    NA                    NA
##       NKTCL.GSM9493337_mean NKTCL.GSM9493338_mean NKTCL.GSM9493339_mean
## 24724                    NA                    NA                    NA
## 29976                    NA                    NA                    NA
## 21921          0.0007212405          0.0007837868           0.000337952
## 12963          0.0016227912          0.0015675736           0.001858736
## 21920          0.0016227912          0.0005598477           0.000844880
## 29338                    NA                    NA                    NA
##       NKTCL.GSM9493340_mean NKTCL.GSM9493341_mean NKTCL.GSM9493342_mean
## 24724                    NA                    NA                    NA
## 29976                    NA                    NA                    NA
## 21921           0.002033390          0.0005197168          0.0002857143
## 12963           0.004066781          0.0011693627          0.0006428571
## 21920           0.003638699          0.0044175924          0.0010000000
## 29338                    NA          0.0002598584                    NA
##       NKTCL.GSM9493343_mean NKTCL.GSM9493344_mean NKTCL.GSM9493345_mean
## 24724                    NA                    NA                    NA
## 29976                    NA                    NA                    NA
## 21921          0.0004090258                    NA          0.0003960082
## 12963          0.0013634195          0.0005694298          0.0011880247
## 21920          0.0025223260          0.0021150248          0.0027720577
## 29338                    NA                    NA                    NA
##       NKTCL.GSM9493346_mean NKTCL.GSM9493347_mean NKTCL.GSM9493348_mean
## 24724                    NA                    NA                    NA
## 29976                    NA                    NA          0.0003144407
## 21921                    NA          0.0002256487                    NA
## 12963           0.001716949          0.0004512975          0.0031444069
## 21920           0.001716949          0.0003760812          0.0011005424
## 29338                    NA                    NA                    NA
##       NKTCL.GSM9493349_mean NKTCL.GSM9493350_mean NKTCL.GSM9493351_mean
## 24724                    NA                    NA                    NA
## 29976                    NA                    NA                    NA
## 21921          0.0003385113          0.0008430411          0.0005055612
## 12963          0.0004513484          0.0036787247          0.0020222447
## 21920          0.0027080904          0.0128755365          0.0104001156
## 29338          0.0001128371                    NA                    NA
##       NKTCL.GSM9493332_mean NKTCL.GSM9493333_mean healthy.GSM9493320_mean
## 24724                    NA                    NA                      NA
## 29976          5.328360e-04                    NA                      NA
## 21921          1.713352e+02           128.0193663            0.0008069164
## 12963          7.279339e+00             3.7898896            0.0005763689
## 21920          4.821100e+00             6.0941260                      NA
## 29338          6.007726e-02             0.1343539                      NA
##       healthy.GSM9493329_mean healthy.GSM9493330_mean healthy.GSM9493331_mean
## 24724                      NA                      NA                      NA
## 29976                      NA                      NA                      NA
## 21921            0.0016596592            0.0013849137            0.0019877788
## 12963            0.0039831821            0.0071092235            0.0043436649
## 21920            0.0130559858            0.0054473271            0.0031657219
## 29338            0.0003319318            0.0004616379            0.0002208643
##       healthy.GSM9493321_mean healthy.GSM9493322_mean healthy.GSM9493323_mean
## 24724                      NA                      NA                      NA
## 29976                      NA                      NA                      NA
## 21921            0.0007748334            0.0006547359            0.0007430034
## 12963            0.0009298001            0.0009821039            0.0014034508
## 21920            0.0001549667            0.0005456133            0.0011557830
## 29338                      NA                      NA                      NA
##       healthy.GSM9493324_mean healthy.GSM9493325_mean healthy.GSM9493326_mean
## 24724                      NA                      NA            0.0000000000
## 29976                      NA                      NA                      NA
## 21921            0.0006256256            0.0004761905            0.0008632783
## 12963            0.0005005005            0.0010317460            0.0022661055
## 21920                      NA            0.0016666667            0.0026977447
## 29338                      NA                      NA            0.0003237294
##       healthy.GSM9493327_mean healthy.GSM9493328_mean var.features
## 24724                      NA                      NA         <NA>
## 29976                      NA            0.0000000000         <NA>
## 21921            0.0006865774            0.0007113387         <NA>
## 12963            0.0036045314            0.0056907099         <NA>
## 21920            0.0084105733            0.0049793712         <NA>
## 29338            0.0006865774            0.0002845355         <NA>
##       var.features.rank           genes pathologyMean  healthyMean
## 24724                NA ENSG00000254339  3.390980e-04 0.0000000000
## 29976                NA            IL37  4.236384e-04 0.0000000000
## 21921                NA ENSG00000280441  1.871030e+01 0.0009479043
## 12963                NA         PTPRCAP  5.550581e-01 0.0027017823
## 21920                NA ENSG00000278996  5.487457e-01 0.0041279754
## 29338                NA ENSG00000281383  4.870095e-02 0.0003848794
##       foldchangePathologyVsHealthy
## 24724                          Inf
## 29976                          Inf
## 21921                   19738.5913
## 12963                     205.4415
## 21920                     132.9334
## 29338                     126.5356
tail(DataMeans2)
##       NKTCL.GSM9493334_mean NKTCL.GSM9493335_mean NKTCL.GSM9493336_mean
## 30955                    NA                    NA                    NA
## 30956                    NA                    NA                    NA
## 30957                    NA                    NA                    NA
## 30958                    NA                    NA                    NA
## 30959                    NA                    NA                    NA
## 30960                    NA                    NA                    NA
##       NKTCL.GSM9493337_mean NKTCL.GSM9493338_mean NKTCL.GSM9493339_mean
## 30955                    NA                    NA                    NA
## 30956                    NA                    NA                    NA
## 30957                    NA                    NA                    NA
## 30958                    NA                    NA                    NA
## 30959                    NA                    NA                    NA
## 30960                    NA                    NA                    NA
##       NKTCL.GSM9493340_mean NKTCL.GSM9493341_mean NKTCL.GSM9493342_mean
## 30955                    NA                    NA                    NA
## 30956                    NA                    NA                    NA
## 30957                    NA                    NA                    NA
## 30958                    NA                    NA                    NA
## 30959                    NA                    NA                    NA
## 30960                    NA                    NA                    NA
##       NKTCL.GSM9493343_mean NKTCL.GSM9493344_mean NKTCL.GSM9493345_mean
## 30955                    NA                    NA                    NA
## 30956                    NA                    NA                    NA
## 30957                    NA                    NA                    NA
## 30958                    NA                    NA                    NA
## 30959                    NA                    NA                    NA
## 30960                    NA                    NA                    NA
##       NKTCL.GSM9493346_mean NKTCL.GSM9493347_mean NKTCL.GSM9493348_mean
## 30955                    NA                    NA                    NA
## 30956                    NA                    NA                    NA
## 30957                    NA                    NA                    NA
## 30958                    NA                    NA                    NA
## 30959                    NA                    NA                    NA
## 30960                    NA                    NA                    NA
##       NKTCL.GSM9493349_mean NKTCL.GSM9493350_mean NKTCL.GSM9493351_mean
## 30955                    NA                    NA                    NA
## 30956                    NA                    NA                    NA
## 30957                    NA                    NA                    NA
## 30958                    NA                    NA                    NA
## 30959                    NA                    NA                    NA
## 30960                    NA                    NA                    NA
##       NKTCL.GSM9493332_mean NKTCL.GSM9493333_mean healthy.GSM9493320_mean
## 30955                    NA                    NA                      NA
## 30956                    NA                    NA                      NA
## 30957                    NA                    NA                      NA
## 30958                    NA                    NA                      NA
## 30959                    NA                    NA                      NA
## 30960                    NA                    NA                      NA
##       healthy.GSM9493329_mean healthy.GSM9493330_mean healthy.GSM9493331_mean
## 30955                      NA                      NA                      NA
## 30956                      NA                      NA                      NA
## 30957                      NA                      NA                      NA
## 30958                      NA                      NA                      NA
## 30959                      NA                      NA                      NA
## 30960                      NA                      NA                      NA
##       healthy.GSM9493321_mean healthy.GSM9493322_mean healthy.GSM9493323_mean
## 30955                      NA                      NA                      NA
## 30956                      NA                      NA                      NA
## 30957                      NA                      NA                      NA
## 30958                      NA                      NA                      NA
## 30959                      NA                      NA                      NA
## 30960                      NA                      NA                      NA
##       healthy.GSM9493324_mean healthy.GSM9493325_mean healthy.GSM9493326_mean
## 30955                      NA                      NA                      NA
## 30956                      NA                      NA                      NA
## 30957                      NA                      NA                      NA
## 30958                      NA                      NA                      NA
## 30959                      NA                      NA                      NA
## 30960                      NA                      NA                      NA
##       healthy.GSM9493327_mean healthy.GSM9493328_mean var.features
## 30955                      NA            0.0004268032         <NA>
## 30956                      NA            0.0004268032         <NA>
## 30957                      NA            0.0002845355         <NA>
## 30958                      NA            0.0002845355         <NA>
## 30959                      NA            0.0004268032         <NA>
## 30960                      NA            0.0005690710         <NA>
##       var.features.rank           genes pathologyMean  healthyMean
## 30955                NA ENSG00000286957           NaN 0.0004268032
## 30956                NA ENSG00000235902           NaN 0.0004268032
## 30957                NA ENSG00000234826           NaN 0.0002845355
## 30958                NA         RNASE13           NaN 0.0002845355
## 30959                NA         CACNA1G           NaN 0.0004268032
## 30960                NA ENSG00000276633           NaN 0.0005690710
##       foldchangePathologyVsHealthy
## 30955                          NaN
## 30956                          NaN
## 30957                          NaN
## 30958                          NaN
## 30959                          NaN
## 30960                          NaN

Lets not include the top 2 in order of highest to lowest as many if not all except 2 healthy and 3 diseased state had a value other than NA, it skewed the value. When you divide by a very small number that number can go infinitesimally into that small number and not useful in this case. Lets also remove the ‘NaN’ values in fold change values before taking the 10 most under expressed and 10 most over expressed genes.

DataMeans3 <- subset(DataMeans2, DataMeans2$foldchangePathologyVsHealthy != 'NaN')

DataMeans4 <- DataMeans3[-c(1:2),]

There are actually a lot of the genes missing information for these top genes, so lets look at those that are the top genes in the set with missing values of NA and then the complete cases of top genes.

top10bottom10_NAs <- DataMeans3[c(1:10,28808:28817),]#there are 28817 genes in this dataset
top10bottom10_NAs$genes
##  [1] "ENSG00000254339" "IL37"            "ENSG00000280441" "PTPRCAP"        
##  [5] "ENSG00000278996" "ENSG00000281383" "ACOT4"           "LINC01670"      
##  [9] "ENSG00000287698" "IGLV3-27"        "CD207"           "POTEC"          
## [13] "ENSG00000260564" "ENSG00000285354" "CATSPER4"        "SEZ6L-AS1"      
## [17] "IGKV1-13"        "TPSAB1"          "ENSG00000290420" "ENSG00000254350"
DataMeansComplete <- DataMeans3[complete.cases(DataMeans3[,c(1:32)]),]
top10bottom10_noNAs <- DataMeansComplete[c(1:10,19120:19129),]
top10bottom10_noNAs$genes
##  [1] "PTPRCAP"         "ANKRD22"         "APOL4"           "CXCL10"         
##  [5] "IFI27"           "GBP1P1"          "IL1R2"           "CCL23"          
##  [9] "ETV7"            "BATF2"           "NCKAP5"          "NOG"            
## [13] "FCER1A"          "H4C5"            "ENSG00000289117" "ENSG00000286797"
## [17] "SULT1A2"         "ENSG00000240086" "NRCAM"           "CCDC144NL"

Lets look at these genes and see if any are in the list of the top 2,000 highest variability genes from Seurat.

commonNAs <- top10bottom10_NAs[which(top10bottom10_NAs$genes %in% top2000Var$genes),]

commonNAs
##  [1] NKTCL.GSM9493334_mean        NKTCL.GSM9493335_mean       
##  [3] NKTCL.GSM9493336_mean        NKTCL.GSM9493337_mean       
##  [5] NKTCL.GSM9493338_mean        NKTCL.GSM9493339_mean       
##  [7] NKTCL.GSM9493340_mean        NKTCL.GSM9493341_mean       
##  [9] NKTCL.GSM9493342_mean        NKTCL.GSM9493343_mean       
## [11] NKTCL.GSM9493344_mean        NKTCL.GSM9493345_mean       
## [13] NKTCL.GSM9493346_mean        NKTCL.GSM9493347_mean       
## [15] NKTCL.GSM9493348_mean        NKTCL.GSM9493349_mean       
## [17] NKTCL.GSM9493350_mean        NKTCL.GSM9493351_mean       
## [19] NKTCL.GSM9493332_mean        NKTCL.GSM9493333_mean       
## [21] healthy.GSM9493320_mean      healthy.GSM9493329_mean     
## [23] healthy.GSM9493330_mean      healthy.GSM9493331_mean     
## [25] healthy.GSM9493321_mean      healthy.GSM9493322_mean     
## [27] healthy.GSM9493323_mean      healthy.GSM9493324_mean     
## [29] healthy.GSM9493325_mean      healthy.GSM9493326_mean     
## [31] healthy.GSM9493327_mean      healthy.GSM9493328_mean     
## [33] var.features                 var.features.rank           
## [35] genes                        pathologyMean               
## [37] healthyMean                  foldchangePathologyVsHealthy
## <0 rows> (or 0-length row.names)

There are no common genes between the top 2,000 genes with highest variability selected in Seurat and the genes having NAs in their samples for each gene.

common <- top10bottom10_noNAs[which(top10bottom10_noNAs$genes %in% top2000Var$genes),]

common
##       NKTCL.GSM9493334_mean NKTCL.GSM9493335_mean NKTCL.GSM9493336_mean
## 11871          0.4882043060           0.104894314           0.239316239
## 5631           0.1628492900           0.069289025           0.105943152
## 15920          0.1984654146           0.025319317           0.026237329
## 3072           0.6386852955           0.004069176           0.017193401
## 3239           0.0019468621           0.004747372           0.004472272
## 1570           0.0033211177           0.018198259           0.015702644
## 3485           0.0026339899           0.005764666           0.003080898
## 4840           0.0006871278           0.002825817           0.001987676
## 9217           0.0053825011           0.023058664           0.042933810
##       NKTCL.GSM9493337_mean NKTCL.GSM9493338_mean NKTCL.GSM9493339_mean
## 11871           0.033717995           1.659388646           0.019009801
## 5631            0.030472413           1.882431979           0.008448800
## 15920           0.012441399           0.360317994           0.032781345
## 3072            0.121168410           0.003023178           3.565224738
## 3239            0.006491165           0.003471056           0.002281176
## 1570            0.016588532           0.009069533           0.002619128
## 3485            0.018932564           0.016907401           0.006336600
## 4840            0.007933646           0.004142873           0.001774248
## 9217            0.094121890           0.031127533           0.009547144
##       NKTCL.GSM9493340_mean NKTCL.GSM9493341_mean NKTCL.GSM9493342_mean
## 11871           0.983197774           0.466055999           0.308642857
## 5631            0.237906678           0.488273891           0.287000000
## 15920           0.698630137           0.090430715           0.077285714
## 3072            0.051583904           0.001819009           0.002857143
## 3239            0.005458048           0.005911778           0.005142857
## 1570            0.006849315           0.021568245           0.016785714
## 3485            0.005779110           0.007081141           0.025000000
## 4840            0.003745719           0.003767946           0.002285714
## 9217            0.046875000           0.010654193           0.045285714
##       NKTCL.GSM9493343_mean NKTCL.GSM9493344_mean NKTCL.GSM9493345_mean
## 11871           0.006203559           0.672415196           0.426184065
## 5631            0.009066739           0.231595217           0.249722794
## 15920           0.005999046           0.315545432           0.062648503
## 3072            0.009339423           0.022695843           0.010217013
## 3239            0.004362942           0.002359066           0.004435292
## 1570            0.077646738           0.016594810           0.017503564
## 3485            0.007226123           0.004962174           0.002217646
## 4840            0.001227078           0.002928496           0.001267226
## 9217            0.083713955           0.009598959           0.109219072
##       NKTCL.GSM9493346_mean NKTCL.GSM9493347_mean NKTCL.GSM9493348_mean
## 11871          1.8103180443           1.362316660          0.6013678170
## 5631           0.4401111929           1.082437006          0.7794198569
## 15920          0.3514021748           0.552012035          0.7934124676
## 3072           0.1241108658           0.006167732          0.0487383067
## 3239           0.0031886191           0.004362542          0.0014149831
## 1570           0.0020439866           0.020308387          0.0072321358
## 3485           0.0008993541           0.013238059          0.0079396274
## 4840           0.0008175946           0.001053027          0.0025941357
## 9217           0.0030251002           0.040240692          0.0005502712
##       NKTCL.GSM9493349_mean NKTCL.GSM9493350_mean NKTCL.GSM9493351_mean
## 11871          1.5785158160           0.015787860           0.114040156
## 5631           2.4244555610           0.027590435           0.142351582
## 15920          0.1884755708           0.194512569           0.155929510
## 3072           0.0103434009           0.004445126           0.014083490
## 3239           0.0016925565           0.002759044           0.004838943
## 1570           0.0080114342           0.024908032           0.070273003
## 3485           0.0003008989           0.005977928           0.001516684
## 4840           0.0011659834           0.001456162           0.001227791
## 9217           0.0066573889           0.005977928           0.026000289
##       NKTCL.GSM9493332_mean NKTCL.GSM9493333_mean healthy.GSM9493320_mean
## 11871           0.070467564          0.1994304023             0.014985591
## 5631            0.036366058          0.3639017444             0.016253602
## 15920           0.228853070          0.1167675329             0.018789625
## 3072            0.001465299          0.0074759701             0.007377522
## 3239            0.000799254          0.0008543966             0.013717579
## 1570            0.008125749          0.0127447490             0.028933718
## 3485            0.000932463          0.0129583482             0.063861671
## 4840            0.000799254          0.0014951940             0.004841499
## 9217            0.001198881          0.0202919188             0.190317003
##       healthy.GSM9493329_mean healthy.GSM9493330_mean healthy.GSM9493331_mean
## 11871             0.016264660             0.005078017            0.0082456011
## 5631              0.027550343             0.027144308            0.0233379960
## 15920             0.012502766             0.017911550            0.0039019362
## 3072              0.016485948             0.008863448            0.0131782375
## 3239              0.006638637             0.007016896            0.0106014872
## 1570              0.097698606             0.098605854            0.0990944563
## 3485              0.005532197             0.001015603            0.0119266730
## 4840              0.001880947             0.001477241            0.0009570787
## 9217              0.064837353             0.073862063            0.0529338143
##       healthy.GSM9493321_mean healthy.GSM9493322_mean healthy.GSM9493323_mean
## 11871             0.006818534             0.009493671             0.007760258
## 5631              0.011777468             0.004801397             0.008420705
## 15920             0.004339067             0.002400698             0.002476678
## 3072              0.010460251             0.019860323             0.014447288
## 3239              0.011235084             0.038411174             0.003715017
## 1570              0.069889974             0.038192929             0.011062495
## 3485              0.027274136             0.045285901             0.026995790
## 4840              0.003564234             0.002073330             0.001651119
## 9217              0.114830311             0.247162811             0.061999505
##       healthy.GSM9493324_mean healthy.GSM9493325_mean healthy.GSM9493326_mean
## 11871             0.013263263             0.006746032             0.019963311
## 5631              0.003503504             0.005634921             0.018560483
## 15920             0.005880881             0.003253968             0.010467249
## 3072              0.024274274             0.005158730             0.016186468
## 3239              0.007007007             0.010555556             0.006474587
## 1570              0.030780781             0.052380952             0.052767886
## 3485              0.016391391             0.065000000             0.003237294
## 4840              0.004379379             0.003095238             0.059566203
## 9217              0.200700701             0.165238095             0.053415345
##       healthy.GSM9493327_mean healthy.GSM9493328_mean    var.features
## 11871             0.013388260             0.010101010         ANKRD22
## 5631              0.025060076             0.018068004          CXCL10
## 15920             0.011757638             0.002703087           IFI27
## 3072              0.012787504             0.020344288           IL1R2
## 3239              0.005921730             0.003983497          NCKAP5
## 1570              0.036903536             0.062882345          FCER1A
## 3485              0.002574665             0.008393797 ENSG00000286797
## 4840              0.001716444             0.002134016 ENSG00000240086
## 9217              0.023086165             0.076682316           NRCAM
##       var.features.rank           genes pathologyMean healthyMean
## 11871               693         ANKRD22   0.557973754 0.011009017
## 5631                 30          CXCL10   0.452981671 0.015842734
## 15920               775           IFI27   0.224373364 0.008032095
## 3072                617           IL1R2   0.233235336 0.014118690
## 3239               1013          NCKAP5   0.003549511 0.010439854
## 1570                 57          FCER1A   0.018804754 0.056599461
## 3485               1259 ENSG00000286797   0.007484284 0.023124093
## 4840               1978 ENSG00000240086   0.002259136 0.007278061
## 9217                613           NRCAM   0.030773045 0.110422124
##       foldchangePathologyVsHealthy
## 11871                   50.6833391
## 5631                    28.5923929
## 15920                   27.9345988
## 3072                    16.5196158
## 3239                     0.3399962
## 1570                     0.3322426
## 3485                     0.3236574
## 4840                     0.3104035
## 9217                     0.2786855

There are 9 genes in common with the top 2,000 highest variable genes from Seurat project in the dataset of samples with only genes that had all samples represented or having complete cases.

common$genes
## [1] "ANKRD22"         "CXCL10"          "IFI27"           "IL1R2"          
## [5] "NCKAP5"          "FCER1A"          "ENSG00000286797" "ENSG00000240086"
## [9] "NRCAM"

The top 4 genes are enhanced or upregulated or stimulated, while those bottom 4 genes are underexpressed or inhibited from healthy state to NKTCL pathological state. We can keep these genes as our top genes and use randomForest to test out these genes in predicting a 2 class model of healthy or NKTCL sample.

Lets see the rank of these genes in Seurat’s top 2,000 genes.

NKTCL <- common$genes

HighVarRank <- top2000Var[which(top2000Var$genes %in% NKTCL),c(34:35)]
HighVarRank
##       var.features.rank           genes
## 5631                 30          CXCL10
## 1570                 57          FCER1A
## 9217                613           NRCAM
## 3072                617           IL1R2
## 11871               693         ANKRD22
## 15920               775           IFI27
## 3239               1013          NCKAP5
## 3485               1259 ENSG00000286797
## 4840               1978 ENSG00000240086

Interesting that the top 2 enhanced genes are in the top 100 while the bottom 3 silenced genes are in the last 1,000 of the top 2,000 genes, while the middle genes are in the top 1/3 of the top 2,000 genes.

We will be using the NKTCL genes we just found sharing common ground with the top 2,000 genes found in Seurat and with complete cases of numeric sequencing readings for the gene in every sample.

Lets make the matrix to test the 2 class model from our dataframe called common.

colnames(common)
##  [1] "NKTCL.GSM9493334_mean"        "NKTCL.GSM9493335_mean"       
##  [3] "NKTCL.GSM9493336_mean"        "NKTCL.GSM9493337_mean"       
##  [5] "NKTCL.GSM9493338_mean"        "NKTCL.GSM9493339_mean"       
##  [7] "NKTCL.GSM9493340_mean"        "NKTCL.GSM9493341_mean"       
##  [9] "NKTCL.GSM9493342_mean"        "NKTCL.GSM9493343_mean"       
## [11] "NKTCL.GSM9493344_mean"        "NKTCL.GSM9493345_mean"       
## [13] "NKTCL.GSM9493346_mean"        "NKTCL.GSM9493347_mean"       
## [15] "NKTCL.GSM9493348_mean"        "NKTCL.GSM9493349_mean"       
## [17] "NKTCL.GSM9493350_mean"        "NKTCL.GSM9493351_mean"       
## [19] "NKTCL.GSM9493332_mean"        "NKTCL.GSM9493333_mean"       
## [21] "healthy.GSM9493320_mean"      "healthy.GSM9493329_mean"     
## [23] "healthy.GSM9493330_mean"      "healthy.GSM9493331_mean"     
## [25] "healthy.GSM9493321_mean"      "healthy.GSM9493322_mean"     
## [27] "healthy.GSM9493323_mean"      "healthy.GSM9493324_mean"     
## [29] "healthy.GSM9493325_mean"      "healthy.GSM9493326_mean"     
## [31] "healthy.GSM9493327_mean"      "healthy.GSM9493328_mean"     
## [33] "var.features"                 "var.features.rank"           
## [35] "genes"                        "pathologyMean"               
## [37] "healthyMean"                  "foldchangePathologyVsHealthy"
class <- c('NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','healthy', 'healthy','healthy','healthy','healthy','healthy','healthy','healthy','healthy','healthy','healthy','healthy')

sampleHeader <- common$genes


class
##  [1] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
##  [5] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
##  [9] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
## [13] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
## [17] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
## [21] "healthy"       "healthy"       "healthy"       "healthy"      
## [25] "healthy"       "healthy"       "healthy"       "healthy"      
## [29] "healthy"       "healthy"       "healthy"       "healthy"
sampleHeader
## [1] "ANKRD22"         "CXCL10"          "IFI27"           "IL1R2"          
## [5] "NCKAP5"          "FCER1A"          "ENSG00000286797" "ENSG00000240086"
## [9] "NRCAM"
dataMatrix <- data.frame(t(common[1:32]))
dim(dataMatrix)
## [1] 32  9
colnames(dataMatrix) <- sampleHeader

dataMatrix$class <- class
dataMatrix
##                             ANKRD22      CXCL10       IFI27       IL1R2
## NKTCL.GSM9493334_mean   0.488204306 0.162849290 0.198465415 0.638685295
## NKTCL.GSM9493335_mean   0.104894314 0.069289025 0.025319317 0.004069176
## NKTCL.GSM9493336_mean   0.239316239 0.105943152 0.026237329 0.017193401
## NKTCL.GSM9493337_mean   0.033717995 0.030472413 0.012441399 0.121168410
## NKTCL.GSM9493338_mean   1.659388646 1.882431979 0.360317994 0.003023178
## NKTCL.GSM9493339_mean   0.019009801 0.008448800 0.032781345 3.565224738
## NKTCL.GSM9493340_mean   0.983197774 0.237906678 0.698630137 0.051583904
## NKTCL.GSM9493341_mean   0.466055999 0.488273891 0.090430715 0.001819009
## NKTCL.GSM9493342_mean   0.308642857 0.287000000 0.077285714 0.002857143
## NKTCL.GSM9493343_mean   0.006203559 0.009066739 0.005999046 0.009339423
## NKTCL.GSM9493344_mean   0.672415196 0.231595217 0.315545432 0.022695843
## NKTCL.GSM9493345_mean   0.426184065 0.249722794 0.062648503 0.010217013
## NKTCL.GSM9493346_mean   1.810318044 0.440111193 0.351402175 0.124110866
## NKTCL.GSM9493347_mean   1.362316660 1.082437006 0.552012035 0.006167732
## NKTCL.GSM9493348_mean   0.601367817 0.779419857 0.793412468 0.048738307
## NKTCL.GSM9493349_mean   1.578515816 2.424455561 0.188475571 0.010343401
## NKTCL.GSM9493350_mean   0.015787860 0.027590435 0.194512569 0.004445126
## NKTCL.GSM9493351_mean   0.114040156 0.142351582 0.155929510 0.014083490
## NKTCL.GSM9493332_mean   0.070467564 0.036366058 0.228853070 0.001465299
## NKTCL.GSM9493333_mean   0.199430402 0.363901744 0.116767533 0.007475970
## healthy.GSM9493320_mean 0.014985591 0.016253602 0.018789625 0.007377522
## healthy.GSM9493329_mean 0.016264660 0.027550343 0.012502766 0.016485948
## healthy.GSM9493330_mean 0.005078017 0.027144308 0.017911550 0.008863448
## healthy.GSM9493331_mean 0.008245601 0.023337996 0.003901936 0.013178238
## healthy.GSM9493321_mean 0.006818534 0.011777468 0.004339067 0.010460251
## healthy.GSM9493322_mean 0.009493671 0.004801397 0.002400698 0.019860323
## healthy.GSM9493323_mean 0.007760258 0.008420705 0.002476678 0.014447288
## healthy.GSM9493324_mean 0.013263263 0.003503504 0.005880881 0.024274274
## healthy.GSM9493325_mean 0.006746032 0.005634921 0.003253968 0.005158730
## healthy.GSM9493326_mean 0.019963311 0.018560483 0.010467249 0.016186468
## healthy.GSM9493327_mean 0.013388260 0.025060076 0.011757638 0.012787504
## healthy.GSM9493328_mean 0.010101010 0.018068004 0.002703087 0.020344288
##                               NCKAP5      FCER1A ENSG00000286797
## NKTCL.GSM9493334_mean   0.0019468621 0.003321118    0.0026339899
## NKTCL.GSM9493335_mean   0.0047473720 0.018198259    0.0057646660
## NKTCL.GSM9493336_mean   0.0044722719 0.015702644    0.0030808984
## NKTCL.GSM9493337_mean   0.0064911648 0.016588532    0.0189325640
## NKTCL.GSM9493338_mean   0.0034710559 0.009069533    0.0169074012
## NKTCL.GSM9493339_mean   0.0022811761 0.002619128    0.0063366002
## NKTCL.GSM9493340_mean   0.0054580479 0.006849315    0.0057791096
## NKTCL.GSM9493341_mean   0.0059117781 0.021568245    0.0070811408
## NKTCL.GSM9493342_mean   0.0051428571 0.016785714    0.0250000000
## NKTCL.GSM9493343_mean   0.0043629423 0.077646738    0.0072261231
## NKTCL.GSM9493344_mean   0.0023590661 0.016594810    0.0049621736
## NKTCL.GSM9493345_mean   0.0044352923 0.017503564    0.0022176461
## NKTCL.GSM9493346_mean   0.0031886191 0.002043987    0.0008993541
## NKTCL.GSM9493347_mean   0.0043625423 0.020308387    0.0132380594
## NKTCL.GSM9493348_mean   0.0014149831 0.007232136    0.0079396274
## NKTCL.GSM9493349_mean   0.0016925565 0.008011434    0.0003008989
## NKTCL.GSM9493350_mean   0.0027590435 0.024908032    0.0059779277
## NKTCL.GSM9493351_mean   0.0048389427 0.070273003    0.0015166835
## NKTCL.GSM9493332_mean   0.0007992540 0.008125749    0.0009324630
## NKTCL.GSM9493333_mean   0.0008543966 0.012744749    0.0129583482
## healthy.GSM9493320_mean 0.0137175793 0.028933718    0.0638616715
## healthy.GSM9493329_mean 0.0066386369 0.097698606    0.0055321974
## healthy.GSM9493330_mean 0.0070168959 0.098605854    0.0010156034
## healthy.GSM9493331_mean 0.0106014872 0.099094456    0.0119266730
## healthy.GSM9493321_mean 0.0112350845 0.069889974    0.0272741361
## healthy.GSM9493322_mean 0.0384111742 0.038192929    0.0452859014
## healthy.GSM9493323_mean 0.0037150169 0.011062495    0.0269957896
## healthy.GSM9493324_mean 0.0070070070 0.030780781    0.0163913914
## healthy.GSM9493325_mean 0.0105555556 0.052380952    0.0650000000
## healthy.GSM9493326_mean 0.0064745872 0.052767886    0.0032372936
## healthy.GSM9493327_mean 0.0059217302 0.036903536    0.0025746653
## healthy.GSM9493328_mean 0.0039834969 0.062882345    0.0083937971
##                         ENSG00000240086        NRCAM         class
## NKTCL.GSM9493334_mean      0.0006871278 0.0053825011 NKTC Lymphoma
## NKTCL.GSM9493335_mean      0.0028258167 0.0230586640 NKTC Lymphoma
## NKTCL.GSM9493336_mean      0.0019876764 0.0429338104 NKTC Lymphoma
## NKTCL.GSM9493337_mean      0.0079336459 0.0941218897 NKTC Lymphoma
## NKTCL.GSM9493338_mean      0.0041428731 0.0311275333 NKTC Lymphoma
## NKTCL.GSM9493339_mean      0.0017742481 0.0095471443 NKTC Lymphoma
## NKTCL.GSM9493340_mean      0.0037457192 0.0468750000 NKTC Lymphoma
## NKTCL.GSM9493341_mean      0.0037679465 0.0106541935 NKTC Lymphoma
## NKTCL.GSM9493342_mean      0.0022857143 0.0452857143 NKTC Lymphoma
## NKTCL.GSM9493343_mean      0.0012270775 0.0837139546 NKTC Lymphoma
## NKTCL.GSM9493344_mean      0.0029284959 0.0095989588 NKTC Lymphoma
## NKTCL.GSM9493345_mean      0.0012672264 0.1092190718 NKTC Lymphoma
## NKTCL.GSM9493346_mean      0.0008175946 0.0030251002 NKTC Lymphoma
## NKTCL.GSM9493347_mean      0.0010530275 0.0402406920 NKTC Lymphoma
## NKTCL.GSM9493348_mean      0.0025941357 0.0005502712 NKTC Lymphoma
## NKTCL.GSM9493349_mean      0.0011659834 0.0066573889 NKTC Lymphoma
## NKTCL.GSM9493350_mean      0.0014561619 0.0059779277 NKTC Lymphoma
## NKTCL.GSM9493351_mean      0.0012277914 0.0260002889 NKTC Lymphoma
## NKTCL.GSM9493332_mean      0.0007992540 0.0011988810 NKTC Lymphoma
## NKTCL.GSM9493333_mean      0.0014951940 0.0202919188 NKTC Lymphoma
## healthy.GSM9493320_mean    0.0048414986 0.1903170029       healthy
## healthy.GSM9493329_mean    0.0018809471 0.0648373534       healthy
## healthy.GSM9493330_mean    0.0014772413 0.0738620626       healthy
## healthy.GSM9493331_mean    0.0009570787 0.0529338143       healthy
## healthy.GSM9493321_mean    0.0035642337 0.1148303115       healthy
## healthy.GSM9493322_mean    0.0020733304 0.2471628110       healthy
## healthy.GSM9493323_mean    0.0016511186 0.0619995047       healthy
## healthy.GSM9493324_mean    0.0043793794 0.2007007007       healthy
## healthy.GSM9493325_mean    0.0030952381 0.1652380952       healthy
## healthy.GSM9493326_mean    0.0595662027 0.0534153448       healthy
## healthy.GSM9493327_mean    0.0017164435 0.0230861655       healthy
## healthy.GSM9493328_mean    0.0021340162 0.0766823161       healthy

We have our matrix, now lets import randomForest and see how well this data set of top genes predicts a two class model of healthy versus Natural Killer T-cell Lymphoma.

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
dataMatrix$class <- as.factor(dataMatrix$class)
str(dataMatrix)
## 'data.frame':    32 obs. of  10 variables:
##  $ ANKRD22        : num  0.4882 0.1049 0.2393 0.0337 1.6594 ...
##  $ CXCL10         : num  0.1628 0.0693 0.1059 0.0305 1.8824 ...
##  $ IFI27          : num  0.1985 0.0253 0.0262 0.0124 0.3603 ...
##  $ IL1R2          : num  0.63869 0.00407 0.01719 0.12117 0.00302 ...
##  $ NCKAP5         : num  0.00195 0.00475 0.00447 0.00649 0.00347 ...
##  $ FCER1A         : num  0.00332 0.0182 0.0157 0.01659 0.00907 ...
##  $ ENSG00000286797: num  0.00263 0.00576 0.00308 0.01893 0.01691 ...
##  $ ENSG00000240086: num  0.000687 0.002826 0.001988 0.007934 0.004143 ...
##  $ NRCAM          : num  0.00538 0.02306 0.04293 0.09412 0.03113 ...
##  $ class          : Factor w/ 2 levels "healthy","NKTC Lymphoma": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(123)

inTrain <- sample(1:32,.7*32)

training <- dataMatrix[inTrain,]
testing <- dataMatrix[-inTrain,]
table(training$class)
## 
##       healthy NKTC Lymphoma 
##             9            13
table(testing$class)
## 
##       healthy NKTC Lymphoma 
##             3             7
rf <- randomForest(class ~ ., training, importance=TRUE, na.action=na.omit,mtry=3,ntree=500)
prediction <- predict(rf,testing)
results <- data.frame(predicted=prediction,actual=testing$class)
results
##                             predicted        actual
## NKTCL.GSM9493334_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493335_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493337_mean         healthy NKTC Lymphoma
## NKTCL.GSM9493339_mean         healthy NKTC Lymphoma
## NKTCL.GSM9493345_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493346_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493349_mean   NKTC Lymphoma NKTC Lymphoma
## healthy.GSM9493320_mean       healthy       healthy
## healthy.GSM9493331_mean       healthy       healthy
## healthy.GSM9493321_mean       healthy       healthy

Using 70% of the samples in the data to train, the model scored decent by only missing 2 of the samples incorrectly predicting healthy when the sample was actually the NKTC lymphoma.

sum(results$predicted==results$actual)/length(results$predicted)
## [1] 0.8

By training 70% of the data, it scored 80% accuracy on the two class model.

rf$confusion
##               healthy NKTC Lymphoma class.error
## healthy             9             0  0.00000000
## NKTC Lymphoma       1            12  0.07692308

In the training model, all healthy classes were predicted correctly, but one NKTC Lymphoma sample was incorrectly predicted healthy when it was actually NKTC Lymphoma.

Lets try a model that trains on 80% of the data instead.

set.seed(123)

inTrain <- sample(1:32,.8*32)

training <- dataMatrix[inTrain,]
testing <- dataMatrix[-inTrain,]
table(training$class)
## 
##       healthy NKTC Lymphoma 
##            10            15
table(testing$class)
## 
##       healthy NKTC Lymphoma 
##             2             5
rf2 <- randomForest(class ~ ., training, importance=TRUE, na.action=na.omit,mtry=3,ntree=500)
prediction2 <- predict(rf2,testing)
results2 <- data.frame(predicted=prediction2,actual=testing$class)
results2
##                             predicted        actual
## NKTCL.GSM9493335_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493339_mean         healthy NKTC Lymphoma
## NKTCL.GSM9493345_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493346_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493349_mean   NKTC Lymphoma NKTC Lymphoma
## healthy.GSM9493320_mean       healthy       healthy
## healthy.GSM9493321_mean       healthy       healthy
sum(results2$predicted==results2$actual)/length(results2$predicted)
## [1] 0.8571429

The model scored better at 85.7% accuracy when predicting the testing hold out set of 7 samples. Only 1 class was incorrectly predicted and that was an NKTC Lymphoma sample incorrectly predicted as healthy.

Lets see how well the model did on training.

rf2$confusion
##               healthy NKTC Lymphoma class.error
## healthy            10             0         0.0
## NKTC Lymphoma       3            12         0.2

All healthy samples were correctly labeled healthy, but there were 3 samples of NKTC Lymphoma that were incorrectly identified as healthy. That could be bad if the person had the pathology and was unaware and then it caused a lot of harm to them and illness that they didn’t know why as they believed the model and that they didn’t have the illness.

Lets try tuning our randomForest model on these same genes, then we will test the genes that Seurat has as the top 10 genes to see how well those genes predict the pathology of NKTC Lymphoma.

We will keep the same training and testing set of 80% since it did better.

rf3 <- randomForest(class ~ ., training, importance=TRUE, na.action=na.omit,mtry=2,ntree=10000)
prediction3 <- predict(rf3,testing)
results3 <- data.frame(predicted=prediction3,actual=testing$class)
results3
##                             predicted        actual
## NKTCL.GSM9493335_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493339_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493345_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493346_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493349_mean   NKTC Lymphoma NKTC Lymphoma
## healthy.GSM9493320_mean       healthy       healthy
## healthy.GSM9493321_mean       healthy       healthy

When tuning the model using 80% training data, to predict on the testing data, the results were all classes correctly predicted using 10,000 trees instead of 500 and a lower mtry of 2 instead of 3.

Lets see how well the training model did on the training set before using it to predict the two class model of the hold out testing set.

rf3$confusion
##               healthy NKTC Lymphoma class.error
## healthy            10             0   0.0000000
## NKTC Lymphoma       2            13   0.1333333

The training model did well, but not perfect. It correctly predicted the healthy samples, but incorrectly predicted two samples of NKTC Lymphoma as healthy.

Ok, so now lets look at the 10 genes that were found in Seurat. And see what their fold change values are before making a matrix to test these genes on a two class model.

top10Seurat <- top2000Var[c(1:10),]
top10Seurat[,c(34,35)]
##       var.features.rank  genes
## 5603                  1   PPBP
## 22311                 2  IGLC2
## 2972                  3   IGKC
## 1414                  4 S100A9
## 5610                  5   EREG
## 22312                 6  IGLC3
## 13803                 7   SOX5
## 5602                  8    PF4
## 14280                 9    LYZ
## 10577                10 LINGO2

These are the genes not found by fold change but by principal component analysis. They are the most variable changed genes from healthy to pathological state using the filtering settings selected in those earlier uploads of this series GSE318371.

Since we got the genes from complete cases that were in this set, lets use that data set of complete cases of genes in each sample to extract the fold change values for these PCA extracted top genes to see if up or down regulated.

PCA10 <- DataMeansComplete[which(DataMeansComplete$genes %in% top10Seurat$genes),]

PCA10orderedRank <- PCA10[order(PCA10$var.features.rank,decreasing=F),c(34,35,38)]

PCA10orderedRank
##       var.features.rank  genes foldchangePathologyVsHealthy
## 5603                  1   PPBP                    1.0299078
## 22311                 2  IGLC2                    0.9634843
## 2972                  3   IGKC                    0.9858088
## 1414                  4 S100A9                    1.5474204
## 5610                  5   EREG                    0.6300822
## 22312                 6  IGLC3                    1.1257449
## 13803                 7   SOX5                    1.0760944
## 5602                  8    PF4                    0.7387458
## 14280                 9    LYZ                    1.3220576
## 10577                10 LINGO2                    0.6979391
PCA10orderedFC <- PCA10[order(PCA10$foldchangePathologyVsHealthy, decreasing=T),c(34,35,38)]

PCA10orderedFC
##       var.features.rank  genes foldchangePathologyVsHealthy
## 1414                  4 S100A9                    1.5474204
## 14280                 9    LYZ                    1.3220576
## 22312                 6  IGLC3                    1.1257449
## 13803                 7   SOX5                    1.0760944
## 5603                  1   PPBP                    1.0299078
## 2972                  3   IGKC                    0.9858088
## 22311                 2  IGLC2                    0.9634843
## 5602                  8    PF4                    0.7387458
## 10577                10 LINGO2                    0.6979391
## 5610                  5   EREG                    0.6300822

There are equally 5 that are upregulated and 5 that are down regulated from healthy to pathological state.

Make the matrix from PCA10 that has the 12 healthy and 20 NKTC Lymphoma samples with 10 genes of Seurat’s PCA top ranked genes of highest variability.

colnames(PCA10)
##  [1] "NKTCL.GSM9493334_mean"        "NKTCL.GSM9493335_mean"       
##  [3] "NKTCL.GSM9493336_mean"        "NKTCL.GSM9493337_mean"       
##  [5] "NKTCL.GSM9493338_mean"        "NKTCL.GSM9493339_mean"       
##  [7] "NKTCL.GSM9493340_mean"        "NKTCL.GSM9493341_mean"       
##  [9] "NKTCL.GSM9493342_mean"        "NKTCL.GSM9493343_mean"       
## [11] "NKTCL.GSM9493344_mean"        "NKTCL.GSM9493345_mean"       
## [13] "NKTCL.GSM9493346_mean"        "NKTCL.GSM9493347_mean"       
## [15] "NKTCL.GSM9493348_mean"        "NKTCL.GSM9493349_mean"       
## [17] "NKTCL.GSM9493350_mean"        "NKTCL.GSM9493351_mean"       
## [19] "NKTCL.GSM9493332_mean"        "NKTCL.GSM9493333_mean"       
## [21] "healthy.GSM9493320_mean"      "healthy.GSM9493329_mean"     
## [23] "healthy.GSM9493330_mean"      "healthy.GSM9493331_mean"     
## [25] "healthy.GSM9493321_mean"      "healthy.GSM9493322_mean"     
## [27] "healthy.GSM9493323_mean"      "healthy.GSM9493324_mean"     
## [29] "healthy.GSM9493325_mean"      "healthy.GSM9493326_mean"     
## [31] "healthy.GSM9493327_mean"      "healthy.GSM9493328_mean"     
## [33] "var.features"                 "var.features.rank"           
## [35] "genes"                        "pathologyMean"               
## [37] "healthyMean"                  "foldchangePathologyVsHealthy"
PCA10b <- PCA10[,c(1:32)]

genesB <- PCA10$genes


head(PCA10b)
##       NKTCL.GSM9493334_mean NKTCL.GSM9493335_mean NKTCL.GSM9493336_mean
## 1414            56.14074668             9.1726009            14.0559531
## 14280           28.45338983             8.8639087             9.6306897
## 22312            0.27828676             0.6656494             0.5280262
## 13803            0.10364178             0.2857466             0.4873783
## 5603             0.09883188             0.4481745             0.8714967
## 2972             1.58863949             5.0399005             5.4777380
##       NKTCL.GSM9493337_mean NKTCL.GSM9493338_mean NKTCL.GSM9493339_mean
## 1414             12.7582041           22.19706640           16.51334910
## 14280            10.3959611           12.17064159            6.66821561
## 22312             0.6413631            1.05105811            0.26039202
## 13803             0.3067075            0.09965289            0.04123015
## 5603              1.3196899            0.23278468            0.26486989
## 2972              3.5041471            4.29336021            1.45555931
##       NKTCL.GSM9493340_mean NKTCL.GSM9493341_mean NKTCL.GSM9493342_mean
## 1414             29.9675728            5.95517443            10.8132857
## 14280            12.7071918           10.93055285            12.4585000
## 22312             0.1405180            0.36938868             0.4012143
## 13803             0.0854024            0.15792893             0.1267857
## 5603              1.8139983            0.02864939             0.4072143
## 2972              1.7065497            2.09627753             3.8170000
##       NKTCL.GSM9493343_mean NKTCL.GSM9493344_mean NKTCL.GSM9493345_mean
## 1414             14.2298043            28.7486374            16.6500079
## 14280            12.0203149            22.5039453            13.3806431
## 22312             0.3568069             0.2224030             0.3739110
## 13803             0.1337514             0.1227528             0.2149533
## 5603              0.7770809             2.3528838             0.4168383
## 2972              2.1034154             2.7587245             3.0952796
##       NKTCL.GSM9493346_mean NKTCL.GSM9493347_mean NKTCL.GSM9493348_mean
## 1414            75.24241681           19.10109064           25.19990567
## 14280           24.06557109           20.11214742            9.47779263
## 22312            0.04962799            0.55690109            0.43573618
## 13803            0.17676396            0.06130124            0.04213505
## 5603             0.79298504            0.32094772            4.96093074
## 2972             1.74392936            2.80699511            1.64916280
##       NKTCL.GSM9493349_mean NKTCL.GSM9493350_mean NKTCL.GSM9493351_mean
## 1414            19.16635950            12.8743869           18.06203958
## 14280           16.34768872             5.2355150           10.01278348
## 22312            0.61263023             0.9675046            0.46959411
## 13803            0.02903675             0.2963673            0.01856132
## 5603             1.36555459             0.6836297            0.82103134
## 2972             1.73246323             6.7227928            0.92568251
##       NKTCL.GSM9493332_mean NKTCL.GSM9493333_mean healthy.GSM9493320_mean
## 1414            13.23404822            9.13221787               8.1473199
## 14280            5.19808179           10.06422214               5.8773487
## 22312            0.06673771            0.02470630               0.3576945
## 13803            0.02917277            0.06429334               0.2695101
## 5603             0.33422139            0.21352795               0.7465130
## 2972             0.09750899            0.07831969               1.8303170
##       healthy.GSM9493329_mean healthy.GSM9493330_mean healthy.GSM9493331_mean
## 1414               29.2248285              11.3305327               9.6991828
## 14280              18.1343218               6.2842766               8.5866156
## 22312               0.3899093               0.7881082               0.5373629
## 13803               0.1203806               0.1153171               0.1123463
## 5603                0.7561407               1.1591727               1.1468748
## 2972                2.0202478               4.4677315               5.8579842
##       healthy.GSM9493321_mean healthy.GSM9493322_mean healthy.GSM9493323_mean
## 1414                6.2743685               8.9673723              5.89515397
## 14280               7.0196808              11.9966172              4.16816643
## 22312               0.4270107               0.1407682              0.29167011
## 13803               0.1930885               0.1736141              0.09378354
## 5603                0.4886874               0.8538848              0.37183192
## 2972                1.5470324               2.3434090              2.02253777
##       healthy.GSM9493324_mean healthy.GSM9493325_mean healthy.GSM9493326_mean
## 1414               11.6448949              13.1591270             23.20427323
## 14280              13.4121622               9.4002381             13.01553901
## 22312               0.3395896               0.1820635              0.36840401
## 13803               0.1525275               0.1492063              0.03507068
## 5603                0.4872372               1.5381746              0.67454408
## 2972                1.4079079               1.9276190              2.61066149
##       healthy.GSM9493327_mean healthy.GSM9493328_mean
## 1414               17.6065911              21.2710201
## 14280               4.5891692              15.8304168
## 22312               0.1588568               0.5342154
## 13803               0.0592173               0.1337317
## 5603                1.6737899               0.8955755
## 2972                2.6473567               3.3883910
dataPCA_Matrix <- data.frame(t(PCA10b))

colnames(dataPCA_Matrix) <- genesB

dataPCA_Matrix$class <- class

dataPCA_Matrix$class <- as.factor(dataPCA_Matrix$class)

str(dataPCA_Matrix)
## 'data.frame':    32 obs. of  11 variables:
##  $ S100A9: num  56.14 9.17 14.06 12.76 22.2 ...
##  $ LYZ   : num  28.45 8.86 9.63 10.4 12.17 ...
##  $ IGLC3 : num  0.278 0.666 0.528 0.641 1.051 ...
##  $ SOX5  : num  0.1036 0.2857 0.4874 0.3067 0.0997 ...
##  $ PPBP  : num  0.0988 0.4482 0.8715 1.3197 0.2328 ...
##  $ IGKC  : num  1.59 5.04 5.48 3.5 4.29 ...
##  $ IGLC2 : num  0.323 1.553 1.786 0.892 0.818 ...
##  $ PF4   : num  0.0413 0.1868 0.3056 0.4931 0.0826 ...
##  $ LINGO2: num  0.262 1.063 0.847 0.527 0.509 ...
##  $ EREG  : num  1.015 0.291 0.699 0.498 0.252 ...
##  $ class : Factor w/ 2 levels "healthy","NKTC Lymphoma": 2 2 2 2 2 2 2 2 2 2 ...
head(dataPCA_Matrix,10)
##                          S100A9       LYZ     IGLC3       SOX5       PPBP
## NKTCL.GSM9493334_mean 56.140747 28.453390 0.2782868 0.10364178 0.09883188
## NKTCL.GSM9493335_mean  9.172601  8.863909 0.6656494 0.28574658 0.44817452
## NKTCL.GSM9493336_mean 14.055953  9.630690 0.5280262 0.48737825 0.87149672
## NKTCL.GSM9493337_mean 12.758204 10.395961 0.6413631 0.30670754 1.31968987
## NKTCL.GSM9493338_mean 22.197066 12.170642 1.0510581 0.09965289 0.23278468
## NKTCL.GSM9493339_mean 16.513349  6.668216 0.2603920 0.04123015 0.26486989
## NKTCL.GSM9493340_mean 29.967573 12.707192 0.1405180 0.08540240 1.81399829
## NKTCL.GSM9493341_mean  5.955174 10.930553 0.3693887 0.15792893 0.02864939
## NKTCL.GSM9493342_mean 10.813286 12.458500 0.4012143 0.12678571 0.40721429
## NKTCL.GSM9493343_mean 14.229804 12.020315 0.3568069 0.13375145 0.77708092
##                           IGKC     IGLC2        PF4    LINGO2       EREG
## NKTCL.GSM9493334_mean 1.588639 0.3231791 0.04134219 0.2616812 1.01465873
## NKTCL.GSM9493335_mean 5.039901 1.5525037 0.18684300 1.0630722 0.29128518
## NKTCL.GSM9493336_mean 5.477738 1.7860266 0.30560525 0.8470483 0.69936394
## NKTCL.GSM9493337_mean 3.504147 0.8921745 0.49314821 0.5274071 0.49783628
## NKTCL.GSM9493338_mean 4.293360 0.8176016 0.08263352 0.5087896 0.25249132
## NKTCL.GSM9493339_mean 1.455559 0.5998648 0.10687732 0.8781683 0.36160865
## NKTCL.GSM9493340_mean 1.706550 0.4258348 0.70954623 0.3816353 0.61772260
## NKTCL.GSM9493341_mean 2.096278 1.8922887 0.01630611 0.1517573 0.32274410
## NKTCL.GSM9493342_mean 3.817000 1.1034286 0.14450000 0.1280714 0.30864286
## NKTCL.GSM9493343_mean 2.103415 0.6613266 0.28120526 0.6723703 0.07246574
##                               class
## NKTCL.GSM9493334_mean NKTC Lymphoma
## NKTCL.GSM9493335_mean NKTC Lymphoma
## NKTCL.GSM9493336_mean NKTC Lymphoma
## NKTCL.GSM9493337_mean NKTC Lymphoma
## NKTCL.GSM9493338_mean NKTC Lymphoma
## NKTCL.GSM9493339_mean NKTC Lymphoma
## NKTCL.GSM9493340_mean NKTC Lymphoma
## NKTCL.GSM9493341_mean NKTC Lymphoma
## NKTCL.GSM9493342_mean NKTC Lymphoma
## NKTCL.GSM9493343_mean NKTC Lymphoma
tail(dataPCA_Matrix,10)
##                            S100A9       LYZ     IGLC3       SOX5      PPBP
## healthy.GSM9493330_mean 11.330533  6.284277 0.7881082 0.11531715 1.1591727
## healthy.GSM9493331_mean  9.699183  8.586616 0.5373629 0.11234632 1.1468748
## healthy.GSM9493321_mean  6.274369  7.019681 0.4270107 0.19308849 0.4886874
## healthy.GSM9493322_mean  8.967372 11.996617 0.1407682 0.17361414 0.8538848
## healthy.GSM9493323_mean  5.895154  4.168166 0.2916701 0.09378354 0.3718319
## healthy.GSM9493324_mean 11.644895 13.412162 0.3395896 0.15252753 0.4872372
## healthy.GSM9493325_mean 13.159127  9.400238 0.1820635 0.14920635 1.5381746
## healthy.GSM9493326_mean 23.204273 13.015539 0.3684040 0.03507068 0.6745441
## healthy.GSM9493327_mean 17.606591  4.589169 0.1588568 0.05921730 1.6737899
## healthy.GSM9493328_mean 21.271020 15.830417 0.5342154 0.13373168 0.8955755
##                             IGKC     IGLC2       PF4    LINGO2      EREG
## healthy.GSM9493330_mean 4.467732 0.7086142 0.6272736 0.6213646 0.4305235
## healthy.GSM9493331_mean 5.857984 1.3745859 0.5944195 0.5052639 0.3482294
## healthy.GSM9493321_mean 1.547032 0.2962188 0.1782117 1.2142414 0.2019216
## healthy.GSM9493322_mean 2.343409 0.4860323 0.3416630 1.0185509 0.5947185
## healthy.GSM9493323_mean 2.022538 0.4578552 0.1098819 0.1161562 0.1640386
## healthy.GSM9493324_mean 1.407908 0.3403403 0.1461461 1.2079580 0.5110110
## healthy.GSM9493325_mean 1.927619 1.2646825 0.3322222 0.5403175 0.5958730
## healthy.GSM9493326_mean 2.610661 3.4131866 0.5474264 0.2417179 0.7083198
## healthy.GSM9493327_mean 2.647357 0.3359938 1.1686406 0.6624614 0.1890663
## healthy.GSM9493328_mean 3.388391 1.5609617 0.7050790 0.9443733 1.2892303
##                           class
## healthy.GSM9493330_mean healthy
## healthy.GSM9493331_mean healthy
## healthy.GSM9493321_mean healthy
## healthy.GSM9493322_mean healthy
## healthy.GSM9493323_mean healthy
## healthy.GSM9493324_mean healthy
## healthy.GSM9493325_mean healthy
## healthy.GSM9493326_mean healthy
## healthy.GSM9493327_mean healthy
## healthy.GSM9493328_mean healthy
set.seed(123)

inTrain <- sample(1:32,.8*32)

training <- dataPCA_Matrix[inTrain,]
testing <- dataPCA_Matrix[-inTrain,]
table(training$class)
## 
##       healthy NKTC Lymphoma 
##            10            15
table(testing$class)
## 
##       healthy NKTC Lymphoma 
##             2             5

We will keep the same training and testing set of 80% since it did better.

rf4 <- randomForest(class ~ ., training, importance=TRUE, na.action=na.omit,mtry=2,ntree=10000)
prediction4 <- predict(rf4,testing)
results4 <- data.frame(predicted=prediction4,actual=testing$class)
results4
##                             predicted        actual
## NKTCL.GSM9493335_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493339_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493345_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493346_mean   NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493349_mean   NKTC Lymphoma NKTC Lymphoma
## healthy.GSM9493320_mean       healthy       healthy
## healthy.GSM9493321_mean       healthy       healthy

Using our same samples and the same tuned settings that scored the best on the top genes found from fold change values of complete cases in the top 2,000 top variable genes found with PCA in Seurat, the results are the same at 100% accuracy in prediction of class in a two class model. That is with an mtry of 2 and 10,000 trees with default paramter settings within the randomForest function.

How well did the model do on the training data before predicting the hold out test data?

rf4$confusion
##               healthy NKTC Lymphoma class.error
## healthy             3             7   0.7000000
## NKTC Lymphoma       4            11   0.2666667

Well, it actually scored worse in training than our best model on the fold change genes of complete cases that are also in the top 2,000 genes but not the top 10 of those 2,000 genes. But somehow corrected within the model to score perfectly on the hold out prediction testing set. This training model incorrectly classified 7 healthy samples as NKTC lymphoma and 4 NKTC lymphoma samples as healthy.

Seems like either set of genes could be great sets of genes to use to predict NKTC Lymphoma with our tuned settings.

We will add these to our data of top genes for the various pathologies that we analyzed and use them later to predict the samples of various pathologies of fibromyalgia, multiple sclerosis, EBV infection, mononucleosis infection, Lyme disease, and now NKTC lymphoma.

Our genes are the 9 genes in the common data frame and the 10 genes in the PCA10 data frame. Lets write those out to csv.

write.csv(common,'commonToSeurat2000AndFCsCompleteCasesNKTCL.csv',row.names=F)
write.csv(PCA10,'top10SeuratPCA_NKTCL_genes.csv',row.names=F)

You can get those files here for common and the PCA10 sets of genes to add to our larger dataset later.

Lets write out some of our larger data out to csv as well.

write.csv(DataMeans3,'NKTCL_28817X3_noNaNs.csv',row.names=F)
write.csv(DataMeansComplete,'completeCasesNKTCL_19129X38.csv',row.names=F)
write.csv(top2000Var,'PCA_2000topgenesSeurat_NKTCL.csv',row.names=F)

The dataset without NaNs but has the Inf values. DataMeans3 28,817X38 The data of complete cases DataMeansComplete 19,129X38 The Seurat PCA 2,000 high variability genes data no fold change values. top2000Var 2,000 X 35

Thanks and keep checking in when we add those to the dataset of our genes and move on to the next series of EBV associated pathologies. We have nasopharyngeal carcinoma, Hodgkin’s lymphoma, Burkett’s lymphoma, and some others if we see them stirrig some interest or find other data sets lacking in information or some other reason.