This document is to prepare the Seurat derived data on counts of genes found in the GSE318371 study on Natural Killer T-cell Lymphoma associated with EBV infection. There are previous uploaded documents on deriving this information in Seurat, a library for R that helps manipulate single cell RNA sequencing data in array format. This data is from a large array and there are 30,960 genes with counts extracted from a larger set where only those genes that had more than 3 counts and were in more than 10 features were kept. The top 2,000 highest variability of change from healthy to pathological state are noted here along with the 12 healthy and 20 pathological samples’ statistical measures for variance and standardized values of counts.
I want to only keep the mean values of the counts and then do a fold change comparison of those healthy samples and those diseased samples of NKTC Lymphoma and keep the top 10 genes that have the highest and lowest fold change values for most over expressed or stimulated or up regulated, and those with most underexpressed, inhibited, or down regulated. Then compare those to the values obtained in the top 10 genes of Seurat with the highest variability measure extracted.
Then we can test these genes to see how well they predict a 2 class model using supervised learning algorithms of random forest for cross validation or boot strap aggregating AKA bagging or boosting. Then if they predict well, we can add them to our larger database of genes that are the top predictors for pathologies we have analyzed thus far in multiple sclerosis, mononucleosis, fibromyalgia, Lyme disease, and Epstein-Barr infection. We have a goal of creating a predictive machine that can determine a sample as EBV, EBV associated pathology, fibromyalgia, or Lyme disease.
Another idea is to use these genes from this study to compare to gene set studies on EBV and blood cancers to see if any relationships can be immediately determined.
Lets work with the dataset we made a few weeks ago and filter it to only include the samples by mean, the top genes if it was, and the gene names.
data <- read.csv("all30960genes_NKTC_Lymphoma_EBV_rownamesAdded.csv", header=T, sep=',')
colnames(data)
## [1] "vf_vst_counts.GSM9493334_mean"
## [2] "vf_vst_counts.GSM9493334_variance"
## [3] "vf_vst_counts.GSM9493334_variance.expected"
## [4] "vf_vst_counts.GSM9493334_variance.standardized"
## [5] "vf_vst_counts.GSM9493334_variable"
## [6] "vf_vst_counts.GSM9493334_rank"
## [7] "vf_vst_counts.GSM9493335_mean"
## [8] "vf_vst_counts.GSM9493335_variance"
## [9] "vf_vst_counts.GSM9493335_variance.expected"
## [10] "vf_vst_counts.GSM9493335_variance.standardized"
## [11] "vf_vst_counts.GSM9493335_variable"
## [12] "vf_vst_counts.GSM9493335_rank"
## [13] "vf_vst_counts.GSM9493336_mean"
## [14] "vf_vst_counts.GSM9493336_variance"
## [15] "vf_vst_counts.GSM9493336_variance.expected"
## [16] "vf_vst_counts.GSM9493336_variance.standardized"
## [17] "vf_vst_counts.GSM9493336_variable"
## [18] "vf_vst_counts.GSM9493336_rank"
## [19] "vf_vst_counts.GSM9493337_mean"
## [20] "vf_vst_counts.GSM9493337_variance"
## [21] "vf_vst_counts.GSM9493337_variance.expected"
## [22] "vf_vst_counts.GSM9493337_variance.standardized"
## [23] "vf_vst_counts.GSM9493337_variable"
## [24] "vf_vst_counts.GSM9493337_rank"
## [25] "vf_vst_counts.GSM9493338_mean"
## [26] "vf_vst_counts.GSM9493338_variance"
## [27] "vf_vst_counts.GSM9493338_variance.expected"
## [28] "vf_vst_counts.GSM9493338_variance.standardized"
## [29] "vf_vst_counts.GSM9493338_variable"
## [30] "vf_vst_counts.GSM9493338_rank"
## [31] "vf_vst_counts.GSM9493339_mean"
## [32] "vf_vst_counts.GSM9493339_variance"
## [33] "vf_vst_counts.GSM9493339_variance.expected"
## [34] "vf_vst_counts.GSM9493339_variance.standardized"
## [35] "vf_vst_counts.GSM9493339_variable"
## [36] "vf_vst_counts.GSM9493339_rank"
## [37] "vf_vst_counts.GSM9493340_mean"
## [38] "vf_vst_counts.GSM9493340_variance"
## [39] "vf_vst_counts.GSM9493340_variance.expected"
## [40] "vf_vst_counts.GSM9493340_variance.standardized"
## [41] "vf_vst_counts.GSM9493340_variable"
## [42] "vf_vst_counts.GSM9493340_rank"
## [43] "vf_vst_counts.GSM9493341_mean"
## [44] "vf_vst_counts.GSM9493341_variance"
## [45] "vf_vst_counts.GSM9493341_variance.expected"
## [46] "vf_vst_counts.GSM9493341_variance.standardized"
## [47] "vf_vst_counts.GSM9493341_variable"
## [48] "vf_vst_counts.GSM9493341_rank"
## [49] "vf_vst_counts.GSM9493342_mean"
## [50] "vf_vst_counts.GSM9493342_variance"
## [51] "vf_vst_counts.GSM9493342_variance.expected"
## [52] "vf_vst_counts.GSM9493342_variance.standardized"
## [53] "vf_vst_counts.GSM9493342_variable"
## [54] "vf_vst_counts.GSM9493342_rank"
## [55] "vf_vst_counts.GSM9493343_mean"
## [56] "vf_vst_counts.GSM9493343_variance"
## [57] "vf_vst_counts.GSM9493343_variance.expected"
## [58] "vf_vst_counts.GSM9493343_variance.standardized"
## [59] "vf_vst_counts.GSM9493343_variable"
## [60] "vf_vst_counts.GSM9493343_rank"
## [61] "vf_vst_counts.GSM9493344_mean"
## [62] "vf_vst_counts.GSM9493344_variance"
## [63] "vf_vst_counts.GSM9493344_variance.expected"
## [64] "vf_vst_counts.GSM9493344_variance.standardized"
## [65] "vf_vst_counts.GSM9493344_variable"
## [66] "vf_vst_counts.GSM9493344_rank"
## [67] "vf_vst_counts.GSM9493345_mean"
## [68] "vf_vst_counts.GSM9493345_variance"
## [69] "vf_vst_counts.GSM9493345_variance.expected"
## [70] "vf_vst_counts.GSM9493345_variance.standardized"
## [71] "vf_vst_counts.GSM9493345_variable"
## [72] "vf_vst_counts.GSM9493345_rank"
## [73] "vf_vst_counts.GSM9493346_mean"
## [74] "vf_vst_counts.GSM9493346_variance"
## [75] "vf_vst_counts.GSM9493346_variance.expected"
## [76] "vf_vst_counts.GSM9493346_variance.standardized"
## [77] "vf_vst_counts.GSM9493346_variable"
## [78] "vf_vst_counts.GSM9493346_rank"
## [79] "vf_vst_counts.GSM9493347_mean"
## [80] "vf_vst_counts.GSM9493347_variance"
## [81] "vf_vst_counts.GSM9493347_variance.expected"
## [82] "vf_vst_counts.GSM9493347_variance.standardized"
## [83] "vf_vst_counts.GSM9493347_variable"
## [84] "vf_vst_counts.GSM9493347_rank"
## [85] "vf_vst_counts.GSM9493348_mean"
## [86] "vf_vst_counts.GSM9493348_variance"
## [87] "vf_vst_counts.GSM9493348_variance.expected"
## [88] "vf_vst_counts.GSM9493348_variance.standardized"
## [89] "vf_vst_counts.GSM9493348_variable"
## [90] "vf_vst_counts.GSM9493348_rank"
## [91] "vf_vst_counts.GSM9493349_mean"
## [92] "vf_vst_counts.GSM9493349_variance"
## [93] "vf_vst_counts.GSM9493349_variance.expected"
## [94] "vf_vst_counts.GSM9493349_variance.standardized"
## [95] "vf_vst_counts.GSM9493349_variable"
## [96] "vf_vst_counts.GSM9493349_rank"
## [97] "vf_vst_counts.GSM9493350_mean"
## [98] "vf_vst_counts.GSM9493350_variance"
## [99] "vf_vst_counts.GSM9493350_variance.expected"
## [100] "vf_vst_counts.GSM9493350_variance.standardized"
## [101] "vf_vst_counts.GSM9493350_variable"
## [102] "vf_vst_counts.GSM9493350_rank"
## [103] "vf_vst_counts.GSM9493351_mean"
## [104] "vf_vst_counts.GSM9493351_variance"
## [105] "vf_vst_counts.GSM9493351_variance.expected"
## [106] "vf_vst_counts.GSM9493351_variance.standardized"
## [107] "vf_vst_counts.GSM9493351_variable"
## [108] "vf_vst_counts.GSM9493351_rank"
## [109] "vf_vst_counts.GSM9493332_mean"
## [110] "vf_vst_counts.GSM9493332_variance"
## [111] "vf_vst_counts.GSM9493332_variance.expected"
## [112] "vf_vst_counts.GSM9493332_variance.standardized"
## [113] "vf_vst_counts.GSM9493332_variable"
## [114] "vf_vst_counts.GSM9493332_rank"
## [115] "vf_vst_counts.GSM9493333_mean"
## [116] "vf_vst_counts.GSM9493333_variance"
## [117] "vf_vst_counts.GSM9493333_variance.expected"
## [118] "vf_vst_counts.GSM9493333_variance.standardized"
## [119] "vf_vst_counts.GSM9493333_variable"
## [120] "vf_vst_counts.GSM9493333_rank"
## [121] "vf_vst_counts.GSM9493320_mean"
## [122] "vf_vst_counts.GSM9493320_variance"
## [123] "vf_vst_counts.GSM9493320_variance.expected"
## [124] "vf_vst_counts.GSM9493320_variance.standardized"
## [125] "vf_vst_counts.GSM9493320_variable"
## [126] "vf_vst_counts.GSM9493320_rank"
## [127] "vf_vst_counts.GSM9493329_mean"
## [128] "vf_vst_counts.GSM9493329_variance"
## [129] "vf_vst_counts.GSM9493329_variance.expected"
## [130] "vf_vst_counts.GSM9493329_variance.standardized"
## [131] "vf_vst_counts.GSM9493329_variable"
## [132] "vf_vst_counts.GSM9493329_rank"
## [133] "vf_vst_counts.GSM9493330_mean"
## [134] "vf_vst_counts.GSM9493330_variance"
## [135] "vf_vst_counts.GSM9493330_variance.expected"
## [136] "vf_vst_counts.GSM9493330_variance.standardized"
## [137] "vf_vst_counts.GSM9493330_variable"
## [138] "vf_vst_counts.GSM9493330_rank"
## [139] "vf_vst_counts.GSM9493331_mean"
## [140] "vf_vst_counts.GSM9493331_variance"
## [141] "vf_vst_counts.GSM9493331_variance.expected"
## [142] "vf_vst_counts.GSM9493331_variance.standardized"
## [143] "vf_vst_counts.GSM9493331_variable"
## [144] "vf_vst_counts.GSM9493331_rank"
## [145] "vf_vst_counts.GSM9493321_mean"
## [146] "vf_vst_counts.GSM9493321_variance"
## [147] "vf_vst_counts.GSM9493321_variance.expected"
## [148] "vf_vst_counts.GSM9493321_variance.standardized"
## [149] "vf_vst_counts.GSM9493321_variable"
## [150] "vf_vst_counts.GSM9493321_rank"
## [151] "vf_vst_counts.GSM9493322_mean"
## [152] "vf_vst_counts.GSM9493322_variance"
## [153] "vf_vst_counts.GSM9493322_variance.expected"
## [154] "vf_vst_counts.GSM9493322_variance.standardized"
## [155] "vf_vst_counts.GSM9493322_variable"
## [156] "vf_vst_counts.GSM9493322_rank"
## [157] "vf_vst_counts.GSM9493323_mean"
## [158] "vf_vst_counts.GSM9493323_variance"
## [159] "vf_vst_counts.GSM9493323_variance.expected"
## [160] "vf_vst_counts.GSM9493323_variance.standardized"
## [161] "vf_vst_counts.GSM9493323_variable"
## [162] "vf_vst_counts.GSM9493323_rank"
## [163] "vf_vst_counts.GSM9493324_mean"
## [164] "vf_vst_counts.GSM9493324_variance"
## [165] "vf_vst_counts.GSM9493324_variance.expected"
## [166] "vf_vst_counts.GSM9493324_variance.standardized"
## [167] "vf_vst_counts.GSM9493324_variable"
## [168] "vf_vst_counts.GSM9493324_rank"
## [169] "vf_vst_counts.GSM9493325_mean"
## [170] "vf_vst_counts.GSM9493325_variance"
## [171] "vf_vst_counts.GSM9493325_variance.expected"
## [172] "vf_vst_counts.GSM9493325_variance.standardized"
## [173] "vf_vst_counts.GSM9493325_variable"
## [174] "vf_vst_counts.GSM9493325_rank"
## [175] "vf_vst_counts.GSM9493326_mean"
## [176] "vf_vst_counts.GSM9493326_variance"
## [177] "vf_vst_counts.GSM9493326_variance.expected"
## [178] "vf_vst_counts.GSM9493326_variance.standardized"
## [179] "vf_vst_counts.GSM9493326_variable"
## [180] "vf_vst_counts.GSM9493326_rank"
## [181] "vf_vst_counts.GSM9493327_mean"
## [182] "vf_vst_counts.GSM9493327_variance"
## [183] "vf_vst_counts.GSM9493327_variance.expected"
## [184] "vf_vst_counts.GSM9493327_variance.standardized"
## [185] "vf_vst_counts.GSM9493327_variable"
## [186] "vf_vst_counts.GSM9493327_rank"
## [187] "vf_vst_counts.GSM9493328_mean"
## [188] "vf_vst_counts.GSM9493328_variance"
## [189] "vf_vst_counts.GSM9493328_variance.expected"
## [190] "vf_vst_counts.GSM9493328_variance.standardized"
## [191] "vf_vst_counts.GSM9493328_variable"
## [192] "vf_vst_counts.GSM9493328_rank"
## [193] "var.features"
## [194] "var.features.rank"
## [195] "genes"
We want to keep the last 3 columns and the columns with the mean of the GSM sample. The rank is for the gene if it was top ranked it is 1 through 10 and goes up to 2,000 for only the top 2,000 genes with the most variability from healthy state to pathological state were identified. We have 30,960 total genes that made the cut of being in at least 10 genes for the barcode fragments and in at least 3 cells of the massive array of cells in the hundreds of thousands.
meanID <- grep('mean',colnames(data))
meanID
## [1] 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109
## [20] 115 121 127 133 139 145 151 157 163 169 175 181 187
There are 32 samples so there must be all samples accounted for, we will attach the IDs to the GSM ID of healthy or NKTC lymphoma as well.
DataMeans <- data[,c(meanID,193:195)]
colnames(DataMeans)
## [1] "vf_vst_counts.GSM9493334_mean" "vf_vst_counts.GSM9493335_mean"
## [3] "vf_vst_counts.GSM9493336_mean" "vf_vst_counts.GSM9493337_mean"
## [5] "vf_vst_counts.GSM9493338_mean" "vf_vst_counts.GSM9493339_mean"
## [7] "vf_vst_counts.GSM9493340_mean" "vf_vst_counts.GSM9493341_mean"
## [9] "vf_vst_counts.GSM9493342_mean" "vf_vst_counts.GSM9493343_mean"
## [11] "vf_vst_counts.GSM9493344_mean" "vf_vst_counts.GSM9493345_mean"
## [13] "vf_vst_counts.GSM9493346_mean" "vf_vst_counts.GSM9493347_mean"
## [15] "vf_vst_counts.GSM9493348_mean" "vf_vst_counts.GSM9493349_mean"
## [17] "vf_vst_counts.GSM9493350_mean" "vf_vst_counts.GSM9493351_mean"
## [19] "vf_vst_counts.GSM9493332_mean" "vf_vst_counts.GSM9493333_mean"
## [21] "vf_vst_counts.GSM9493320_mean" "vf_vst_counts.GSM9493329_mean"
## [23] "vf_vst_counts.GSM9493330_mean" "vf_vst_counts.GSM9493331_mean"
## [25] "vf_vst_counts.GSM9493321_mean" "vf_vst_counts.GSM9493322_mean"
## [27] "vf_vst_counts.GSM9493323_mean" "vf_vst_counts.GSM9493324_mean"
## [29] "vf_vst_counts.GSM9493325_mean" "vf_vst_counts.GSM9493326_mean"
## [31] "vf_vst_counts.GSM9493327_mean" "vf_vst_counts.GSM9493328_mean"
## [33] "var.features" "var.features.rank"
## [35] "genes"
top2000Variability <- subset(DataMeans, DataMeans$var.features.rank != "NA")
top2000Var <- top2000Variability[order(top2000Variability$var.features.rank,decreasing=F),]
top2000Var[c(1:10),c(32:35)]
## vf_vst_counts.GSM9493328_mean var.features var.features.rank genes
## 5603 0.8955755 PPBP 1 PPBP
## 22311 1.5609617 IGLC2 2 IGLC2
## 2972 3.3883910 IGKC 3 IGKC
## 1414 21.2710201 S100A9 4 S100A9
## 5610 1.2892303 EREG 5 EREG
## 22312 0.5342154 IGLC3 6 IGLC3
## 13803 0.1337317 SOX5 7 SOX5
## 5602 0.7050790 PF4 8 PF4
## 14280 15.8304168 LYZ 9 LYZ
## 10577 0.9443733 LINGO2 10 LINGO2
We have the top 10 ranked genes with most gene variability but not if they were most underexpressed or overexpressed.
Lets get the fold change values of the genes for the NKTC Lymphoma means over the healthy means.
seriesInfoDesign <-read.csv("GSE318371-GPL34284_series_matrix.txt", sep='\t', nrows=50,stringsAsFactors = T,strip.white=T,na.strings=" ", ncol(32), skip=25, header=F)
dim(seriesInfoDesign)
## [1] 42 30
The 2nd and 17th rows have the sample type per GSM ID.
seriesInfoDesign[c(2,17),]
## V1 V2 V3
## 2 !Sample_geo_accession "GSM9493320" "GSM9493321"
## 17 !Sample_description "Library name: HD1" "Library name: HD2"
## V4 V5 V6
## 2 "GSM9493322" "GSM9493323" "GSM9493324"
## 17 "Library name: HD3" "Library name: HD4" "Library name: HD5"
## V7 V8 V9
## 2 "GSM9493325" "GSM9493326" "GSM9493327"
## 17 "Library name: HD6" "Library name: HD7" "Library name: HD8"
## V10 V11 V12
## 2 "GSM9493328" "GSM9493329" "GSM9493330"
## 17 "Library name: HD9" "Library name: HD10" "Library name: HD11"
## V13 V14 V15
## 2 "GSM9493331" "GSM9493335" "GSM9493336"
## 17 "Library name: HD12" "Library name: patient16" "Library name: patient18"
## V16 V17
## 2 "GSM9493337" "GSM9493338"
## 17 "Library name: patient19" "Library name: patient20"
## V18 V19
## 2 "GSM9493339" "GSM9493340"
## 17 "Library name: patient22" "Library name: patient25"
## V20 V21
## 2 "GSM9493341" "GSM9493342"
## 17 "Library name: patient27" "Library name: patient30"
## V22 V23
## 2 "GSM9493343" "GSM9493344"
## 17 "Library name: patient31" "Library name: patient36"
## V24 V25
## 2 "GSM9493345" "GSM9493346"
## 17 "Library name: patient37" "Library name: patient38"
## V26 V27
## 2 "GSM9493347" "GSM9493348"
## 17 "Library name: patient39" "Library name: patient41"
## V28 V29
## 2 "GSM9493349" "GSM9493350"
## 17 "Library name: patient44" "Library name: patient51"
## V30
## 2 "GSM9493351"
## 17 "Library name: patient52"
The library name that starts with “HD” is for the healthy samples, while other samples are the patient samples. We only care about if healthy or a diseased sample so that we can identify the set class to get the mean for that class. GSM29493320-GSM9493331 are healthy and GSM9493332-GSM9493351 are diseased.
colnames(DataMeans)
## [1] "vf_vst_counts.GSM9493334_mean" "vf_vst_counts.GSM9493335_mean"
## [3] "vf_vst_counts.GSM9493336_mean" "vf_vst_counts.GSM9493337_mean"
## [5] "vf_vst_counts.GSM9493338_mean" "vf_vst_counts.GSM9493339_mean"
## [7] "vf_vst_counts.GSM9493340_mean" "vf_vst_counts.GSM9493341_mean"
## [9] "vf_vst_counts.GSM9493342_mean" "vf_vst_counts.GSM9493343_mean"
## [11] "vf_vst_counts.GSM9493344_mean" "vf_vst_counts.GSM9493345_mean"
## [13] "vf_vst_counts.GSM9493346_mean" "vf_vst_counts.GSM9493347_mean"
## [15] "vf_vst_counts.GSM9493348_mean" "vf_vst_counts.GSM9493349_mean"
## [17] "vf_vst_counts.GSM9493350_mean" "vf_vst_counts.GSM9493351_mean"
## [19] "vf_vst_counts.GSM9493332_mean" "vf_vst_counts.GSM9493333_mean"
## [21] "vf_vst_counts.GSM9493320_mean" "vf_vst_counts.GSM9493329_mean"
## [23] "vf_vst_counts.GSM9493330_mean" "vf_vst_counts.GSM9493331_mean"
## [25] "vf_vst_counts.GSM9493321_mean" "vf_vst_counts.GSM9493322_mean"
## [27] "vf_vst_counts.GSM9493323_mean" "vf_vst_counts.GSM9493324_mean"
## [29] "vf_vst_counts.GSM9493325_mean" "vf_vst_counts.GSM9493326_mean"
## [31] "vf_vst_counts.GSM9493327_mean" "vf_vst_counts.GSM9493328_mean"
## [33] "var.features" "var.features.rank"
## [35] "genes"
They are out of order but the first 20 samples are diseased and the next 12 are the healthy samples, then the last 3 are the features, rank, and gene name. Lets rename them.
colnames(DataMeans)[1:20] <- gsub('vf_vst_counts','NKTCL', colnames(DataMeans)[1:20])
colnames(DataMeans)
## [1] "NKTCL.GSM9493334_mean" "NKTCL.GSM9493335_mean"
## [3] "NKTCL.GSM9493336_mean" "NKTCL.GSM9493337_mean"
## [5] "NKTCL.GSM9493338_mean" "NKTCL.GSM9493339_mean"
## [7] "NKTCL.GSM9493340_mean" "NKTCL.GSM9493341_mean"
## [9] "NKTCL.GSM9493342_mean" "NKTCL.GSM9493343_mean"
## [11] "NKTCL.GSM9493344_mean" "NKTCL.GSM9493345_mean"
## [13] "NKTCL.GSM9493346_mean" "NKTCL.GSM9493347_mean"
## [15] "NKTCL.GSM9493348_mean" "NKTCL.GSM9493349_mean"
## [17] "NKTCL.GSM9493350_mean" "NKTCL.GSM9493351_mean"
## [19] "NKTCL.GSM9493332_mean" "NKTCL.GSM9493333_mean"
## [21] "vf_vst_counts.GSM9493320_mean" "vf_vst_counts.GSM9493329_mean"
## [23] "vf_vst_counts.GSM9493330_mean" "vf_vst_counts.GSM9493331_mean"
## [25] "vf_vst_counts.GSM9493321_mean" "vf_vst_counts.GSM9493322_mean"
## [27] "vf_vst_counts.GSM9493323_mean" "vf_vst_counts.GSM9493324_mean"
## [29] "vf_vst_counts.GSM9493325_mean" "vf_vst_counts.GSM9493326_mean"
## [31] "vf_vst_counts.GSM9493327_mean" "vf_vst_counts.GSM9493328_mean"
## [33] "var.features" "var.features.rank"
## [35] "genes"
colnames(DataMeans)[21:32] <- gsub('vf_vst_counts','healthy',colnames(DataMeans[21:32]))
colnames(DataMeans)
## [1] "NKTCL.GSM9493334_mean" "NKTCL.GSM9493335_mean"
## [3] "NKTCL.GSM9493336_mean" "NKTCL.GSM9493337_mean"
## [5] "NKTCL.GSM9493338_mean" "NKTCL.GSM9493339_mean"
## [7] "NKTCL.GSM9493340_mean" "NKTCL.GSM9493341_mean"
## [9] "NKTCL.GSM9493342_mean" "NKTCL.GSM9493343_mean"
## [11] "NKTCL.GSM9493344_mean" "NKTCL.GSM9493345_mean"
## [13] "NKTCL.GSM9493346_mean" "NKTCL.GSM9493347_mean"
## [15] "NKTCL.GSM9493348_mean" "NKTCL.GSM9493349_mean"
## [17] "NKTCL.GSM9493350_mean" "NKTCL.GSM9493351_mean"
## [19] "NKTCL.GSM9493332_mean" "NKTCL.GSM9493333_mean"
## [21] "healthy.GSM9493320_mean" "healthy.GSM9493329_mean"
## [23] "healthy.GSM9493330_mean" "healthy.GSM9493331_mean"
## [25] "healthy.GSM9493321_mean" "healthy.GSM9493322_mean"
## [27] "healthy.GSM9493323_mean" "healthy.GSM9493324_mean"
## [29] "healthy.GSM9493325_mean" "healthy.GSM9493326_mean"
## [31] "healthy.GSM9493327_mean" "healthy.GSM9493328_mean"
## [33] "var.features" "var.features.rank"
## [35] "genes"
Get the mean of the set rows by gene to make fold change values of the genes.
DataMeans$pathologyMean <- rowMeans(DataMeans[1:20], na.rm=T,dims=1)
DataMeans$healthyMean <- rowMeans(DataMeans[21:32], na.rm=T, dims=1)
DataMeans$foldchangePathologyVsHealthy <- DataMeans$pathologyMean/DataMeans$healthyMean
DataMeans2 <- DataMeans[order(DataMeans$foldchangePathologyVsHealthy,decreasing=T),]
head(DataMeans2)
## NKTCL.GSM9493334_mean NKTCL.GSM9493335_mean NKTCL.GSM9493336_mean
## 24724 NA 0.0003390980 NA
## 29976 NA NA NA
## 21921 0.0009161704 NA 0.001888293
## 12963 0.0042372881 0.0007912287 0.001391373
## 21920 0.0052679798 0.0031649147 0.002583979
## 29338 NA NA NA
## NKTCL.GSM9493337_mean NKTCL.GSM9493338_mean NKTCL.GSM9493339_mean
## 24724 NA NA NA
## 29976 NA NA NA
## 21921 0.0007212405 0.0007837868 0.000337952
## 12963 0.0016227912 0.0015675736 0.001858736
## 21920 0.0016227912 0.0005598477 0.000844880
## 29338 NA NA NA
## NKTCL.GSM9493340_mean NKTCL.GSM9493341_mean NKTCL.GSM9493342_mean
## 24724 NA NA NA
## 29976 NA NA NA
## 21921 0.002033390 0.0005197168 0.0002857143
## 12963 0.004066781 0.0011693627 0.0006428571
## 21920 0.003638699 0.0044175924 0.0010000000
## 29338 NA 0.0002598584 NA
## NKTCL.GSM9493343_mean NKTCL.GSM9493344_mean NKTCL.GSM9493345_mean
## 24724 NA NA NA
## 29976 NA NA NA
## 21921 0.0004090258 NA 0.0003960082
## 12963 0.0013634195 0.0005694298 0.0011880247
## 21920 0.0025223260 0.0021150248 0.0027720577
## 29338 NA NA NA
## NKTCL.GSM9493346_mean NKTCL.GSM9493347_mean NKTCL.GSM9493348_mean
## 24724 NA NA NA
## 29976 NA NA 0.0003144407
## 21921 NA 0.0002256487 NA
## 12963 0.001716949 0.0004512975 0.0031444069
## 21920 0.001716949 0.0003760812 0.0011005424
## 29338 NA NA NA
## NKTCL.GSM9493349_mean NKTCL.GSM9493350_mean NKTCL.GSM9493351_mean
## 24724 NA NA NA
## 29976 NA NA NA
## 21921 0.0003385113 0.0008430411 0.0005055612
## 12963 0.0004513484 0.0036787247 0.0020222447
## 21920 0.0027080904 0.0128755365 0.0104001156
## 29338 0.0001128371 NA NA
## NKTCL.GSM9493332_mean NKTCL.GSM9493333_mean healthy.GSM9493320_mean
## 24724 NA NA NA
## 29976 5.328360e-04 NA NA
## 21921 1.713352e+02 128.0193663 0.0008069164
## 12963 7.279339e+00 3.7898896 0.0005763689
## 21920 4.821100e+00 6.0941260 NA
## 29338 6.007726e-02 0.1343539 NA
## healthy.GSM9493329_mean healthy.GSM9493330_mean healthy.GSM9493331_mean
## 24724 NA NA NA
## 29976 NA NA NA
## 21921 0.0016596592 0.0013849137 0.0019877788
## 12963 0.0039831821 0.0071092235 0.0043436649
## 21920 0.0130559858 0.0054473271 0.0031657219
## 29338 0.0003319318 0.0004616379 0.0002208643
## healthy.GSM9493321_mean healthy.GSM9493322_mean healthy.GSM9493323_mean
## 24724 NA NA NA
## 29976 NA NA NA
## 21921 0.0007748334 0.0006547359 0.0007430034
## 12963 0.0009298001 0.0009821039 0.0014034508
## 21920 0.0001549667 0.0005456133 0.0011557830
## 29338 NA NA NA
## healthy.GSM9493324_mean healthy.GSM9493325_mean healthy.GSM9493326_mean
## 24724 NA NA 0.0000000000
## 29976 NA NA NA
## 21921 0.0006256256 0.0004761905 0.0008632783
## 12963 0.0005005005 0.0010317460 0.0022661055
## 21920 NA 0.0016666667 0.0026977447
## 29338 NA NA 0.0003237294
## healthy.GSM9493327_mean healthy.GSM9493328_mean var.features
## 24724 NA NA <NA>
## 29976 NA 0.0000000000 <NA>
## 21921 0.0006865774 0.0007113387 <NA>
## 12963 0.0036045314 0.0056907099 <NA>
## 21920 0.0084105733 0.0049793712 <NA>
## 29338 0.0006865774 0.0002845355 <NA>
## var.features.rank genes pathologyMean healthyMean
## 24724 NA ENSG00000254339 3.390980e-04 0.0000000000
## 29976 NA IL37 4.236384e-04 0.0000000000
## 21921 NA ENSG00000280441 1.871030e+01 0.0009479043
## 12963 NA PTPRCAP 5.550581e-01 0.0027017823
## 21920 NA ENSG00000278996 5.487457e-01 0.0041279754
## 29338 NA ENSG00000281383 4.870095e-02 0.0003848794
## foldchangePathologyVsHealthy
## 24724 Inf
## 29976 Inf
## 21921 19738.5913
## 12963 205.4415
## 21920 132.9334
## 29338 126.5356
tail(DataMeans2)
## NKTCL.GSM9493334_mean NKTCL.GSM9493335_mean NKTCL.GSM9493336_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## NKTCL.GSM9493337_mean NKTCL.GSM9493338_mean NKTCL.GSM9493339_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## NKTCL.GSM9493340_mean NKTCL.GSM9493341_mean NKTCL.GSM9493342_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## NKTCL.GSM9493343_mean NKTCL.GSM9493344_mean NKTCL.GSM9493345_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## NKTCL.GSM9493346_mean NKTCL.GSM9493347_mean NKTCL.GSM9493348_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## NKTCL.GSM9493349_mean NKTCL.GSM9493350_mean NKTCL.GSM9493351_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## NKTCL.GSM9493332_mean NKTCL.GSM9493333_mean healthy.GSM9493320_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## healthy.GSM9493329_mean healthy.GSM9493330_mean healthy.GSM9493331_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## healthy.GSM9493321_mean healthy.GSM9493322_mean healthy.GSM9493323_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## healthy.GSM9493324_mean healthy.GSM9493325_mean healthy.GSM9493326_mean
## 30955 NA NA NA
## 30956 NA NA NA
## 30957 NA NA NA
## 30958 NA NA NA
## 30959 NA NA NA
## 30960 NA NA NA
## healthy.GSM9493327_mean healthy.GSM9493328_mean var.features
## 30955 NA 0.0004268032 <NA>
## 30956 NA 0.0004268032 <NA>
## 30957 NA 0.0002845355 <NA>
## 30958 NA 0.0002845355 <NA>
## 30959 NA 0.0004268032 <NA>
## 30960 NA 0.0005690710 <NA>
## var.features.rank genes pathologyMean healthyMean
## 30955 NA ENSG00000286957 NaN 0.0004268032
## 30956 NA ENSG00000235902 NaN 0.0004268032
## 30957 NA ENSG00000234826 NaN 0.0002845355
## 30958 NA RNASE13 NaN 0.0002845355
## 30959 NA CACNA1G NaN 0.0004268032
## 30960 NA ENSG00000276633 NaN 0.0005690710
## foldchangePathologyVsHealthy
## 30955 NaN
## 30956 NaN
## 30957 NaN
## 30958 NaN
## 30959 NaN
## 30960 NaN
Lets not include the top 2 in order of highest to lowest as many if not all except 2 healthy and 3 diseased state had a value other than NA, it skewed the value. When you divide by a very small number that number can go infinitesimally into that small number and not useful in this case. Lets also remove the ‘NaN’ values in fold change values before taking the 10 most under expressed and 10 most over expressed genes.
DataMeans3 <- subset(DataMeans2, DataMeans2$foldchangePathologyVsHealthy != 'NaN')
DataMeans4 <- DataMeans3[-c(1:2),]
There are actually a lot of the genes missing information for these top genes, so lets look at those that are the top genes in the set with missing values of NA and then the complete cases of top genes.
top10bottom10_NAs <- DataMeans3[c(1:10,28808:28817),]#there are 28817 genes in this dataset
top10bottom10_NAs$genes
## [1] "ENSG00000254339" "IL37" "ENSG00000280441" "PTPRCAP"
## [5] "ENSG00000278996" "ENSG00000281383" "ACOT4" "LINC01670"
## [9] "ENSG00000287698" "IGLV3-27" "CD207" "POTEC"
## [13] "ENSG00000260564" "ENSG00000285354" "CATSPER4" "SEZ6L-AS1"
## [17] "IGKV1-13" "TPSAB1" "ENSG00000290420" "ENSG00000254350"
DataMeansComplete <- DataMeans3[complete.cases(DataMeans3[,c(1:32)]),]
top10bottom10_noNAs <- DataMeansComplete[c(1:10,19120:19129),]
top10bottom10_noNAs$genes
## [1] "PTPRCAP" "ANKRD22" "APOL4" "CXCL10"
## [5] "IFI27" "GBP1P1" "IL1R2" "CCL23"
## [9] "ETV7" "BATF2" "NCKAP5" "NOG"
## [13] "FCER1A" "H4C5" "ENSG00000289117" "ENSG00000286797"
## [17] "SULT1A2" "ENSG00000240086" "NRCAM" "CCDC144NL"
Lets look at these genes and see if any are in the list of the top 2,000 highest variability genes from Seurat.
commonNAs <- top10bottom10_NAs[which(top10bottom10_NAs$genes %in% top2000Var$genes),]
commonNAs
## [1] NKTCL.GSM9493334_mean NKTCL.GSM9493335_mean
## [3] NKTCL.GSM9493336_mean NKTCL.GSM9493337_mean
## [5] NKTCL.GSM9493338_mean NKTCL.GSM9493339_mean
## [7] NKTCL.GSM9493340_mean NKTCL.GSM9493341_mean
## [9] NKTCL.GSM9493342_mean NKTCL.GSM9493343_mean
## [11] NKTCL.GSM9493344_mean NKTCL.GSM9493345_mean
## [13] NKTCL.GSM9493346_mean NKTCL.GSM9493347_mean
## [15] NKTCL.GSM9493348_mean NKTCL.GSM9493349_mean
## [17] NKTCL.GSM9493350_mean NKTCL.GSM9493351_mean
## [19] NKTCL.GSM9493332_mean NKTCL.GSM9493333_mean
## [21] healthy.GSM9493320_mean healthy.GSM9493329_mean
## [23] healthy.GSM9493330_mean healthy.GSM9493331_mean
## [25] healthy.GSM9493321_mean healthy.GSM9493322_mean
## [27] healthy.GSM9493323_mean healthy.GSM9493324_mean
## [29] healthy.GSM9493325_mean healthy.GSM9493326_mean
## [31] healthy.GSM9493327_mean healthy.GSM9493328_mean
## [33] var.features var.features.rank
## [35] genes pathologyMean
## [37] healthyMean foldchangePathologyVsHealthy
## <0 rows> (or 0-length row.names)
There are no common genes between the top 2,000 genes with highest variability selected in Seurat and the genes having NAs in their samples for each gene.
common <- top10bottom10_noNAs[which(top10bottom10_noNAs$genes %in% top2000Var$genes),]
common
## NKTCL.GSM9493334_mean NKTCL.GSM9493335_mean NKTCL.GSM9493336_mean
## 11871 0.4882043060 0.104894314 0.239316239
## 5631 0.1628492900 0.069289025 0.105943152
## 15920 0.1984654146 0.025319317 0.026237329
## 3072 0.6386852955 0.004069176 0.017193401
## 3239 0.0019468621 0.004747372 0.004472272
## 1570 0.0033211177 0.018198259 0.015702644
## 3485 0.0026339899 0.005764666 0.003080898
## 4840 0.0006871278 0.002825817 0.001987676
## 9217 0.0053825011 0.023058664 0.042933810
## NKTCL.GSM9493337_mean NKTCL.GSM9493338_mean NKTCL.GSM9493339_mean
## 11871 0.033717995 1.659388646 0.019009801
## 5631 0.030472413 1.882431979 0.008448800
## 15920 0.012441399 0.360317994 0.032781345
## 3072 0.121168410 0.003023178 3.565224738
## 3239 0.006491165 0.003471056 0.002281176
## 1570 0.016588532 0.009069533 0.002619128
## 3485 0.018932564 0.016907401 0.006336600
## 4840 0.007933646 0.004142873 0.001774248
## 9217 0.094121890 0.031127533 0.009547144
## NKTCL.GSM9493340_mean NKTCL.GSM9493341_mean NKTCL.GSM9493342_mean
## 11871 0.983197774 0.466055999 0.308642857
## 5631 0.237906678 0.488273891 0.287000000
## 15920 0.698630137 0.090430715 0.077285714
## 3072 0.051583904 0.001819009 0.002857143
## 3239 0.005458048 0.005911778 0.005142857
## 1570 0.006849315 0.021568245 0.016785714
## 3485 0.005779110 0.007081141 0.025000000
## 4840 0.003745719 0.003767946 0.002285714
## 9217 0.046875000 0.010654193 0.045285714
## NKTCL.GSM9493343_mean NKTCL.GSM9493344_mean NKTCL.GSM9493345_mean
## 11871 0.006203559 0.672415196 0.426184065
## 5631 0.009066739 0.231595217 0.249722794
## 15920 0.005999046 0.315545432 0.062648503
## 3072 0.009339423 0.022695843 0.010217013
## 3239 0.004362942 0.002359066 0.004435292
## 1570 0.077646738 0.016594810 0.017503564
## 3485 0.007226123 0.004962174 0.002217646
## 4840 0.001227078 0.002928496 0.001267226
## 9217 0.083713955 0.009598959 0.109219072
## NKTCL.GSM9493346_mean NKTCL.GSM9493347_mean NKTCL.GSM9493348_mean
## 11871 1.8103180443 1.362316660 0.6013678170
## 5631 0.4401111929 1.082437006 0.7794198569
## 15920 0.3514021748 0.552012035 0.7934124676
## 3072 0.1241108658 0.006167732 0.0487383067
## 3239 0.0031886191 0.004362542 0.0014149831
## 1570 0.0020439866 0.020308387 0.0072321358
## 3485 0.0008993541 0.013238059 0.0079396274
## 4840 0.0008175946 0.001053027 0.0025941357
## 9217 0.0030251002 0.040240692 0.0005502712
## NKTCL.GSM9493349_mean NKTCL.GSM9493350_mean NKTCL.GSM9493351_mean
## 11871 1.5785158160 0.015787860 0.114040156
## 5631 2.4244555610 0.027590435 0.142351582
## 15920 0.1884755708 0.194512569 0.155929510
## 3072 0.0103434009 0.004445126 0.014083490
## 3239 0.0016925565 0.002759044 0.004838943
## 1570 0.0080114342 0.024908032 0.070273003
## 3485 0.0003008989 0.005977928 0.001516684
## 4840 0.0011659834 0.001456162 0.001227791
## 9217 0.0066573889 0.005977928 0.026000289
## NKTCL.GSM9493332_mean NKTCL.GSM9493333_mean healthy.GSM9493320_mean
## 11871 0.070467564 0.1994304023 0.014985591
## 5631 0.036366058 0.3639017444 0.016253602
## 15920 0.228853070 0.1167675329 0.018789625
## 3072 0.001465299 0.0074759701 0.007377522
## 3239 0.000799254 0.0008543966 0.013717579
## 1570 0.008125749 0.0127447490 0.028933718
## 3485 0.000932463 0.0129583482 0.063861671
## 4840 0.000799254 0.0014951940 0.004841499
## 9217 0.001198881 0.0202919188 0.190317003
## healthy.GSM9493329_mean healthy.GSM9493330_mean healthy.GSM9493331_mean
## 11871 0.016264660 0.005078017 0.0082456011
## 5631 0.027550343 0.027144308 0.0233379960
## 15920 0.012502766 0.017911550 0.0039019362
## 3072 0.016485948 0.008863448 0.0131782375
## 3239 0.006638637 0.007016896 0.0106014872
## 1570 0.097698606 0.098605854 0.0990944563
## 3485 0.005532197 0.001015603 0.0119266730
## 4840 0.001880947 0.001477241 0.0009570787
## 9217 0.064837353 0.073862063 0.0529338143
## healthy.GSM9493321_mean healthy.GSM9493322_mean healthy.GSM9493323_mean
## 11871 0.006818534 0.009493671 0.007760258
## 5631 0.011777468 0.004801397 0.008420705
## 15920 0.004339067 0.002400698 0.002476678
## 3072 0.010460251 0.019860323 0.014447288
## 3239 0.011235084 0.038411174 0.003715017
## 1570 0.069889974 0.038192929 0.011062495
## 3485 0.027274136 0.045285901 0.026995790
## 4840 0.003564234 0.002073330 0.001651119
## 9217 0.114830311 0.247162811 0.061999505
## healthy.GSM9493324_mean healthy.GSM9493325_mean healthy.GSM9493326_mean
## 11871 0.013263263 0.006746032 0.019963311
## 5631 0.003503504 0.005634921 0.018560483
## 15920 0.005880881 0.003253968 0.010467249
## 3072 0.024274274 0.005158730 0.016186468
## 3239 0.007007007 0.010555556 0.006474587
## 1570 0.030780781 0.052380952 0.052767886
## 3485 0.016391391 0.065000000 0.003237294
## 4840 0.004379379 0.003095238 0.059566203
## 9217 0.200700701 0.165238095 0.053415345
## healthy.GSM9493327_mean healthy.GSM9493328_mean var.features
## 11871 0.013388260 0.010101010 ANKRD22
## 5631 0.025060076 0.018068004 CXCL10
## 15920 0.011757638 0.002703087 IFI27
## 3072 0.012787504 0.020344288 IL1R2
## 3239 0.005921730 0.003983497 NCKAP5
## 1570 0.036903536 0.062882345 FCER1A
## 3485 0.002574665 0.008393797 ENSG00000286797
## 4840 0.001716444 0.002134016 ENSG00000240086
## 9217 0.023086165 0.076682316 NRCAM
## var.features.rank genes pathologyMean healthyMean
## 11871 693 ANKRD22 0.557973754 0.011009017
## 5631 30 CXCL10 0.452981671 0.015842734
## 15920 775 IFI27 0.224373364 0.008032095
## 3072 617 IL1R2 0.233235336 0.014118690
## 3239 1013 NCKAP5 0.003549511 0.010439854
## 1570 57 FCER1A 0.018804754 0.056599461
## 3485 1259 ENSG00000286797 0.007484284 0.023124093
## 4840 1978 ENSG00000240086 0.002259136 0.007278061
## 9217 613 NRCAM 0.030773045 0.110422124
## foldchangePathologyVsHealthy
## 11871 50.6833391
## 5631 28.5923929
## 15920 27.9345988
## 3072 16.5196158
## 3239 0.3399962
## 1570 0.3322426
## 3485 0.3236574
## 4840 0.3104035
## 9217 0.2786855
There are 9 genes in common with the top 2,000 highest variable genes from Seurat project in the dataset of samples with only genes that had all samples represented or having complete cases.
common$genes
## [1] "ANKRD22" "CXCL10" "IFI27" "IL1R2"
## [5] "NCKAP5" "FCER1A" "ENSG00000286797" "ENSG00000240086"
## [9] "NRCAM"
The top 4 genes are enhanced or upregulated or stimulated, while those bottom 4 genes are underexpressed or inhibited from healthy state to NKTCL pathological state. We can keep these genes as our top genes and use randomForest to test out these genes in predicting a 2 class model of healthy or NKTCL sample.
Lets see the rank of these genes in Seurat’s top 2,000 genes.
NKTCL <- common$genes
HighVarRank <- top2000Var[which(top2000Var$genes %in% NKTCL),c(34:35)]
HighVarRank
## var.features.rank genes
## 5631 30 CXCL10
## 1570 57 FCER1A
## 9217 613 NRCAM
## 3072 617 IL1R2
## 11871 693 ANKRD22
## 15920 775 IFI27
## 3239 1013 NCKAP5
## 3485 1259 ENSG00000286797
## 4840 1978 ENSG00000240086
Interesting that the top 2 enhanced genes are in the top 100 while the bottom 3 silenced genes are in the last 1,000 of the top 2,000 genes, while the middle genes are in the top 1/3 of the top 2,000 genes.
We will be using the NKTCL genes we just found sharing common ground with the top 2,000 genes found in Seurat and with complete cases of numeric sequencing readings for the gene in every sample.
Lets make the matrix to test the 2 class model from our dataframe called common.
colnames(common)
## [1] "NKTCL.GSM9493334_mean" "NKTCL.GSM9493335_mean"
## [3] "NKTCL.GSM9493336_mean" "NKTCL.GSM9493337_mean"
## [5] "NKTCL.GSM9493338_mean" "NKTCL.GSM9493339_mean"
## [7] "NKTCL.GSM9493340_mean" "NKTCL.GSM9493341_mean"
## [9] "NKTCL.GSM9493342_mean" "NKTCL.GSM9493343_mean"
## [11] "NKTCL.GSM9493344_mean" "NKTCL.GSM9493345_mean"
## [13] "NKTCL.GSM9493346_mean" "NKTCL.GSM9493347_mean"
## [15] "NKTCL.GSM9493348_mean" "NKTCL.GSM9493349_mean"
## [17] "NKTCL.GSM9493350_mean" "NKTCL.GSM9493351_mean"
## [19] "NKTCL.GSM9493332_mean" "NKTCL.GSM9493333_mean"
## [21] "healthy.GSM9493320_mean" "healthy.GSM9493329_mean"
## [23] "healthy.GSM9493330_mean" "healthy.GSM9493331_mean"
## [25] "healthy.GSM9493321_mean" "healthy.GSM9493322_mean"
## [27] "healthy.GSM9493323_mean" "healthy.GSM9493324_mean"
## [29] "healthy.GSM9493325_mean" "healthy.GSM9493326_mean"
## [31] "healthy.GSM9493327_mean" "healthy.GSM9493328_mean"
## [33] "var.features" "var.features.rank"
## [35] "genes" "pathologyMean"
## [37] "healthyMean" "foldchangePathologyVsHealthy"
class <- c('NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','NKTC Lymphoma','healthy', 'healthy','healthy','healthy','healthy','healthy','healthy','healthy','healthy','healthy','healthy','healthy')
sampleHeader <- common$genes
class
## [1] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
## [5] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
## [9] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
## [13] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
## [17] "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma" "NKTC Lymphoma"
## [21] "healthy" "healthy" "healthy" "healthy"
## [25] "healthy" "healthy" "healthy" "healthy"
## [29] "healthy" "healthy" "healthy" "healthy"
sampleHeader
## [1] "ANKRD22" "CXCL10" "IFI27" "IL1R2"
## [5] "NCKAP5" "FCER1A" "ENSG00000286797" "ENSG00000240086"
## [9] "NRCAM"
dataMatrix <- data.frame(t(common[1:32]))
dim(dataMatrix)
## [1] 32 9
colnames(dataMatrix) <- sampleHeader
dataMatrix$class <- class
dataMatrix
## ANKRD22 CXCL10 IFI27 IL1R2
## NKTCL.GSM9493334_mean 0.488204306 0.162849290 0.198465415 0.638685295
## NKTCL.GSM9493335_mean 0.104894314 0.069289025 0.025319317 0.004069176
## NKTCL.GSM9493336_mean 0.239316239 0.105943152 0.026237329 0.017193401
## NKTCL.GSM9493337_mean 0.033717995 0.030472413 0.012441399 0.121168410
## NKTCL.GSM9493338_mean 1.659388646 1.882431979 0.360317994 0.003023178
## NKTCL.GSM9493339_mean 0.019009801 0.008448800 0.032781345 3.565224738
## NKTCL.GSM9493340_mean 0.983197774 0.237906678 0.698630137 0.051583904
## NKTCL.GSM9493341_mean 0.466055999 0.488273891 0.090430715 0.001819009
## NKTCL.GSM9493342_mean 0.308642857 0.287000000 0.077285714 0.002857143
## NKTCL.GSM9493343_mean 0.006203559 0.009066739 0.005999046 0.009339423
## NKTCL.GSM9493344_mean 0.672415196 0.231595217 0.315545432 0.022695843
## NKTCL.GSM9493345_mean 0.426184065 0.249722794 0.062648503 0.010217013
## NKTCL.GSM9493346_mean 1.810318044 0.440111193 0.351402175 0.124110866
## NKTCL.GSM9493347_mean 1.362316660 1.082437006 0.552012035 0.006167732
## NKTCL.GSM9493348_mean 0.601367817 0.779419857 0.793412468 0.048738307
## NKTCL.GSM9493349_mean 1.578515816 2.424455561 0.188475571 0.010343401
## NKTCL.GSM9493350_mean 0.015787860 0.027590435 0.194512569 0.004445126
## NKTCL.GSM9493351_mean 0.114040156 0.142351582 0.155929510 0.014083490
## NKTCL.GSM9493332_mean 0.070467564 0.036366058 0.228853070 0.001465299
## NKTCL.GSM9493333_mean 0.199430402 0.363901744 0.116767533 0.007475970
## healthy.GSM9493320_mean 0.014985591 0.016253602 0.018789625 0.007377522
## healthy.GSM9493329_mean 0.016264660 0.027550343 0.012502766 0.016485948
## healthy.GSM9493330_mean 0.005078017 0.027144308 0.017911550 0.008863448
## healthy.GSM9493331_mean 0.008245601 0.023337996 0.003901936 0.013178238
## healthy.GSM9493321_mean 0.006818534 0.011777468 0.004339067 0.010460251
## healthy.GSM9493322_mean 0.009493671 0.004801397 0.002400698 0.019860323
## healthy.GSM9493323_mean 0.007760258 0.008420705 0.002476678 0.014447288
## healthy.GSM9493324_mean 0.013263263 0.003503504 0.005880881 0.024274274
## healthy.GSM9493325_mean 0.006746032 0.005634921 0.003253968 0.005158730
## healthy.GSM9493326_mean 0.019963311 0.018560483 0.010467249 0.016186468
## healthy.GSM9493327_mean 0.013388260 0.025060076 0.011757638 0.012787504
## healthy.GSM9493328_mean 0.010101010 0.018068004 0.002703087 0.020344288
## NCKAP5 FCER1A ENSG00000286797
## NKTCL.GSM9493334_mean 0.0019468621 0.003321118 0.0026339899
## NKTCL.GSM9493335_mean 0.0047473720 0.018198259 0.0057646660
## NKTCL.GSM9493336_mean 0.0044722719 0.015702644 0.0030808984
## NKTCL.GSM9493337_mean 0.0064911648 0.016588532 0.0189325640
## NKTCL.GSM9493338_mean 0.0034710559 0.009069533 0.0169074012
## NKTCL.GSM9493339_mean 0.0022811761 0.002619128 0.0063366002
## NKTCL.GSM9493340_mean 0.0054580479 0.006849315 0.0057791096
## NKTCL.GSM9493341_mean 0.0059117781 0.021568245 0.0070811408
## NKTCL.GSM9493342_mean 0.0051428571 0.016785714 0.0250000000
## NKTCL.GSM9493343_mean 0.0043629423 0.077646738 0.0072261231
## NKTCL.GSM9493344_mean 0.0023590661 0.016594810 0.0049621736
## NKTCL.GSM9493345_mean 0.0044352923 0.017503564 0.0022176461
## NKTCL.GSM9493346_mean 0.0031886191 0.002043987 0.0008993541
## NKTCL.GSM9493347_mean 0.0043625423 0.020308387 0.0132380594
## NKTCL.GSM9493348_mean 0.0014149831 0.007232136 0.0079396274
## NKTCL.GSM9493349_mean 0.0016925565 0.008011434 0.0003008989
## NKTCL.GSM9493350_mean 0.0027590435 0.024908032 0.0059779277
## NKTCL.GSM9493351_mean 0.0048389427 0.070273003 0.0015166835
## NKTCL.GSM9493332_mean 0.0007992540 0.008125749 0.0009324630
## NKTCL.GSM9493333_mean 0.0008543966 0.012744749 0.0129583482
## healthy.GSM9493320_mean 0.0137175793 0.028933718 0.0638616715
## healthy.GSM9493329_mean 0.0066386369 0.097698606 0.0055321974
## healthy.GSM9493330_mean 0.0070168959 0.098605854 0.0010156034
## healthy.GSM9493331_mean 0.0106014872 0.099094456 0.0119266730
## healthy.GSM9493321_mean 0.0112350845 0.069889974 0.0272741361
## healthy.GSM9493322_mean 0.0384111742 0.038192929 0.0452859014
## healthy.GSM9493323_mean 0.0037150169 0.011062495 0.0269957896
## healthy.GSM9493324_mean 0.0070070070 0.030780781 0.0163913914
## healthy.GSM9493325_mean 0.0105555556 0.052380952 0.0650000000
## healthy.GSM9493326_mean 0.0064745872 0.052767886 0.0032372936
## healthy.GSM9493327_mean 0.0059217302 0.036903536 0.0025746653
## healthy.GSM9493328_mean 0.0039834969 0.062882345 0.0083937971
## ENSG00000240086 NRCAM class
## NKTCL.GSM9493334_mean 0.0006871278 0.0053825011 NKTC Lymphoma
## NKTCL.GSM9493335_mean 0.0028258167 0.0230586640 NKTC Lymphoma
## NKTCL.GSM9493336_mean 0.0019876764 0.0429338104 NKTC Lymphoma
## NKTCL.GSM9493337_mean 0.0079336459 0.0941218897 NKTC Lymphoma
## NKTCL.GSM9493338_mean 0.0041428731 0.0311275333 NKTC Lymphoma
## NKTCL.GSM9493339_mean 0.0017742481 0.0095471443 NKTC Lymphoma
## NKTCL.GSM9493340_mean 0.0037457192 0.0468750000 NKTC Lymphoma
## NKTCL.GSM9493341_mean 0.0037679465 0.0106541935 NKTC Lymphoma
## NKTCL.GSM9493342_mean 0.0022857143 0.0452857143 NKTC Lymphoma
## NKTCL.GSM9493343_mean 0.0012270775 0.0837139546 NKTC Lymphoma
## NKTCL.GSM9493344_mean 0.0029284959 0.0095989588 NKTC Lymphoma
## NKTCL.GSM9493345_mean 0.0012672264 0.1092190718 NKTC Lymphoma
## NKTCL.GSM9493346_mean 0.0008175946 0.0030251002 NKTC Lymphoma
## NKTCL.GSM9493347_mean 0.0010530275 0.0402406920 NKTC Lymphoma
## NKTCL.GSM9493348_mean 0.0025941357 0.0005502712 NKTC Lymphoma
## NKTCL.GSM9493349_mean 0.0011659834 0.0066573889 NKTC Lymphoma
## NKTCL.GSM9493350_mean 0.0014561619 0.0059779277 NKTC Lymphoma
## NKTCL.GSM9493351_mean 0.0012277914 0.0260002889 NKTC Lymphoma
## NKTCL.GSM9493332_mean 0.0007992540 0.0011988810 NKTC Lymphoma
## NKTCL.GSM9493333_mean 0.0014951940 0.0202919188 NKTC Lymphoma
## healthy.GSM9493320_mean 0.0048414986 0.1903170029 healthy
## healthy.GSM9493329_mean 0.0018809471 0.0648373534 healthy
## healthy.GSM9493330_mean 0.0014772413 0.0738620626 healthy
## healthy.GSM9493331_mean 0.0009570787 0.0529338143 healthy
## healthy.GSM9493321_mean 0.0035642337 0.1148303115 healthy
## healthy.GSM9493322_mean 0.0020733304 0.2471628110 healthy
## healthy.GSM9493323_mean 0.0016511186 0.0619995047 healthy
## healthy.GSM9493324_mean 0.0043793794 0.2007007007 healthy
## healthy.GSM9493325_mean 0.0030952381 0.1652380952 healthy
## healthy.GSM9493326_mean 0.0595662027 0.0534153448 healthy
## healthy.GSM9493327_mean 0.0017164435 0.0230861655 healthy
## healthy.GSM9493328_mean 0.0021340162 0.0766823161 healthy
We have our matrix, now lets import randomForest and see how well this data set of top genes predicts a two class model of healthy versus Natural Killer T-cell Lymphoma.
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
dataMatrix$class <- as.factor(dataMatrix$class)
str(dataMatrix)
## 'data.frame': 32 obs. of 10 variables:
## $ ANKRD22 : num 0.4882 0.1049 0.2393 0.0337 1.6594 ...
## $ CXCL10 : num 0.1628 0.0693 0.1059 0.0305 1.8824 ...
## $ IFI27 : num 0.1985 0.0253 0.0262 0.0124 0.3603 ...
## $ IL1R2 : num 0.63869 0.00407 0.01719 0.12117 0.00302 ...
## $ NCKAP5 : num 0.00195 0.00475 0.00447 0.00649 0.00347 ...
## $ FCER1A : num 0.00332 0.0182 0.0157 0.01659 0.00907 ...
## $ ENSG00000286797: num 0.00263 0.00576 0.00308 0.01893 0.01691 ...
## $ ENSG00000240086: num 0.000687 0.002826 0.001988 0.007934 0.004143 ...
## $ NRCAM : num 0.00538 0.02306 0.04293 0.09412 0.03113 ...
## $ class : Factor w/ 2 levels "healthy","NKTC Lymphoma": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(123)
inTrain <- sample(1:32,.7*32)
training <- dataMatrix[inTrain,]
testing <- dataMatrix[-inTrain,]
table(training$class)
##
## healthy NKTC Lymphoma
## 9 13
table(testing$class)
##
## healthy NKTC Lymphoma
## 3 7
rf <- randomForest(class ~ ., training, importance=TRUE, na.action=na.omit,mtry=3,ntree=500)
prediction <- predict(rf,testing)
results <- data.frame(predicted=prediction,actual=testing$class)
results
## predicted actual
## NKTCL.GSM9493334_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493335_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493337_mean healthy NKTC Lymphoma
## NKTCL.GSM9493339_mean healthy NKTC Lymphoma
## NKTCL.GSM9493345_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493346_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493349_mean NKTC Lymphoma NKTC Lymphoma
## healthy.GSM9493320_mean healthy healthy
## healthy.GSM9493331_mean healthy healthy
## healthy.GSM9493321_mean healthy healthy
Using 70% of the samples in the data to train, the model scored decent by only missing 2 of the samples incorrectly predicting healthy when the sample was actually the NKTC lymphoma.
sum(results$predicted==results$actual)/length(results$predicted)
## [1] 0.8
By training 70% of the data, it scored 80% accuracy on the two class model.
rf$confusion
## healthy NKTC Lymphoma class.error
## healthy 9 0 0.00000000
## NKTC Lymphoma 1 12 0.07692308
In the training model, all healthy classes were predicted correctly, but one NKTC Lymphoma sample was incorrectly predicted healthy when it was actually NKTC Lymphoma.
Lets try a model that trains on 80% of the data instead.
set.seed(123)
inTrain <- sample(1:32,.8*32)
training <- dataMatrix[inTrain,]
testing <- dataMatrix[-inTrain,]
table(training$class)
##
## healthy NKTC Lymphoma
## 10 15
table(testing$class)
##
## healthy NKTC Lymphoma
## 2 5
rf2 <- randomForest(class ~ ., training, importance=TRUE, na.action=na.omit,mtry=3,ntree=500)
prediction2 <- predict(rf2,testing)
results2 <- data.frame(predicted=prediction2,actual=testing$class)
results2
## predicted actual
## NKTCL.GSM9493335_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493339_mean healthy NKTC Lymphoma
## NKTCL.GSM9493345_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493346_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493349_mean NKTC Lymphoma NKTC Lymphoma
## healthy.GSM9493320_mean healthy healthy
## healthy.GSM9493321_mean healthy healthy
sum(results2$predicted==results2$actual)/length(results2$predicted)
## [1] 0.8571429
The model scored better at 85.7% accuracy when predicting the testing hold out set of 7 samples. Only 1 class was incorrectly predicted and that was an NKTC Lymphoma sample incorrectly predicted as healthy.
Lets see how well the model did on training.
rf2$confusion
## healthy NKTC Lymphoma class.error
## healthy 10 0 0.0
## NKTC Lymphoma 3 12 0.2
All healthy samples were correctly labeled healthy, but there were 3 samples of NKTC Lymphoma that were incorrectly identified as healthy. That could be bad if the person had the pathology and was unaware and then it caused a lot of harm to them and illness that they didn’t know why as they believed the model and that they didn’t have the illness.
Lets try tuning our randomForest model on these same genes, then we will test the genes that Seurat has as the top 10 genes to see how well those genes predict the pathology of NKTC Lymphoma.
We will keep the same training and testing set of 80% since it did better.
rf3 <- randomForest(class ~ ., training, importance=TRUE, na.action=na.omit,mtry=2,ntree=10000)
prediction3 <- predict(rf3,testing)
results3 <- data.frame(predicted=prediction3,actual=testing$class)
results3
## predicted actual
## NKTCL.GSM9493335_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493339_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493345_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493346_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493349_mean NKTC Lymphoma NKTC Lymphoma
## healthy.GSM9493320_mean healthy healthy
## healthy.GSM9493321_mean healthy healthy
When tuning the model using 80% training data, to predict on the testing data, the results were all classes correctly predicted using 10,000 trees instead of 500 and a lower mtry of 2 instead of 3.
Lets see how well the training model did on the training set before using it to predict the two class model of the hold out testing set.
rf3$confusion
## healthy NKTC Lymphoma class.error
## healthy 10 0 0.0000000
## NKTC Lymphoma 2 13 0.1333333
The training model did well, but not perfect. It correctly predicted the healthy samples, but incorrectly predicted two samples of NKTC Lymphoma as healthy.
Ok, so now lets look at the 10 genes that were found in Seurat. And see what their fold change values are before making a matrix to test these genes on a two class model.
top10Seurat <- top2000Var[c(1:10),]
top10Seurat[,c(34,35)]
## var.features.rank genes
## 5603 1 PPBP
## 22311 2 IGLC2
## 2972 3 IGKC
## 1414 4 S100A9
## 5610 5 EREG
## 22312 6 IGLC3
## 13803 7 SOX5
## 5602 8 PF4
## 14280 9 LYZ
## 10577 10 LINGO2
These are the genes not found by fold change but by principal component analysis. They are the most variable changed genes from healthy to pathological state using the filtering settings selected in those earlier uploads of this series GSE318371.
Since we got the genes from complete cases that were in this set, lets use that data set of complete cases of genes in each sample to extract the fold change values for these PCA extracted top genes to see if up or down regulated.
PCA10 <- DataMeansComplete[which(DataMeansComplete$genes %in% top10Seurat$genes),]
PCA10orderedRank <- PCA10[order(PCA10$var.features.rank,decreasing=F),c(34,35,38)]
PCA10orderedRank
## var.features.rank genes foldchangePathologyVsHealthy
## 5603 1 PPBP 1.0299078
## 22311 2 IGLC2 0.9634843
## 2972 3 IGKC 0.9858088
## 1414 4 S100A9 1.5474204
## 5610 5 EREG 0.6300822
## 22312 6 IGLC3 1.1257449
## 13803 7 SOX5 1.0760944
## 5602 8 PF4 0.7387458
## 14280 9 LYZ 1.3220576
## 10577 10 LINGO2 0.6979391
PCA10orderedFC <- PCA10[order(PCA10$foldchangePathologyVsHealthy, decreasing=T),c(34,35,38)]
PCA10orderedFC
## var.features.rank genes foldchangePathologyVsHealthy
## 1414 4 S100A9 1.5474204
## 14280 9 LYZ 1.3220576
## 22312 6 IGLC3 1.1257449
## 13803 7 SOX5 1.0760944
## 5603 1 PPBP 1.0299078
## 2972 3 IGKC 0.9858088
## 22311 2 IGLC2 0.9634843
## 5602 8 PF4 0.7387458
## 10577 10 LINGO2 0.6979391
## 5610 5 EREG 0.6300822
There are equally 5 that are upregulated and 5 that are down regulated from healthy to pathological state.
Make the matrix from PCA10 that has the 12 healthy and 20 NKTC Lymphoma samples with 10 genes of Seurat’s PCA top ranked genes of highest variability.
colnames(PCA10)
## [1] "NKTCL.GSM9493334_mean" "NKTCL.GSM9493335_mean"
## [3] "NKTCL.GSM9493336_mean" "NKTCL.GSM9493337_mean"
## [5] "NKTCL.GSM9493338_mean" "NKTCL.GSM9493339_mean"
## [7] "NKTCL.GSM9493340_mean" "NKTCL.GSM9493341_mean"
## [9] "NKTCL.GSM9493342_mean" "NKTCL.GSM9493343_mean"
## [11] "NKTCL.GSM9493344_mean" "NKTCL.GSM9493345_mean"
## [13] "NKTCL.GSM9493346_mean" "NKTCL.GSM9493347_mean"
## [15] "NKTCL.GSM9493348_mean" "NKTCL.GSM9493349_mean"
## [17] "NKTCL.GSM9493350_mean" "NKTCL.GSM9493351_mean"
## [19] "NKTCL.GSM9493332_mean" "NKTCL.GSM9493333_mean"
## [21] "healthy.GSM9493320_mean" "healthy.GSM9493329_mean"
## [23] "healthy.GSM9493330_mean" "healthy.GSM9493331_mean"
## [25] "healthy.GSM9493321_mean" "healthy.GSM9493322_mean"
## [27] "healthy.GSM9493323_mean" "healthy.GSM9493324_mean"
## [29] "healthy.GSM9493325_mean" "healthy.GSM9493326_mean"
## [31] "healthy.GSM9493327_mean" "healthy.GSM9493328_mean"
## [33] "var.features" "var.features.rank"
## [35] "genes" "pathologyMean"
## [37] "healthyMean" "foldchangePathologyVsHealthy"
PCA10b <- PCA10[,c(1:32)]
genesB <- PCA10$genes
head(PCA10b)
## NKTCL.GSM9493334_mean NKTCL.GSM9493335_mean NKTCL.GSM9493336_mean
## 1414 56.14074668 9.1726009 14.0559531
## 14280 28.45338983 8.8639087 9.6306897
## 22312 0.27828676 0.6656494 0.5280262
## 13803 0.10364178 0.2857466 0.4873783
## 5603 0.09883188 0.4481745 0.8714967
## 2972 1.58863949 5.0399005 5.4777380
## NKTCL.GSM9493337_mean NKTCL.GSM9493338_mean NKTCL.GSM9493339_mean
## 1414 12.7582041 22.19706640 16.51334910
## 14280 10.3959611 12.17064159 6.66821561
## 22312 0.6413631 1.05105811 0.26039202
## 13803 0.3067075 0.09965289 0.04123015
## 5603 1.3196899 0.23278468 0.26486989
## 2972 3.5041471 4.29336021 1.45555931
## NKTCL.GSM9493340_mean NKTCL.GSM9493341_mean NKTCL.GSM9493342_mean
## 1414 29.9675728 5.95517443 10.8132857
## 14280 12.7071918 10.93055285 12.4585000
## 22312 0.1405180 0.36938868 0.4012143
## 13803 0.0854024 0.15792893 0.1267857
## 5603 1.8139983 0.02864939 0.4072143
## 2972 1.7065497 2.09627753 3.8170000
## NKTCL.GSM9493343_mean NKTCL.GSM9493344_mean NKTCL.GSM9493345_mean
## 1414 14.2298043 28.7486374 16.6500079
## 14280 12.0203149 22.5039453 13.3806431
## 22312 0.3568069 0.2224030 0.3739110
## 13803 0.1337514 0.1227528 0.2149533
## 5603 0.7770809 2.3528838 0.4168383
## 2972 2.1034154 2.7587245 3.0952796
## NKTCL.GSM9493346_mean NKTCL.GSM9493347_mean NKTCL.GSM9493348_mean
## 1414 75.24241681 19.10109064 25.19990567
## 14280 24.06557109 20.11214742 9.47779263
## 22312 0.04962799 0.55690109 0.43573618
## 13803 0.17676396 0.06130124 0.04213505
## 5603 0.79298504 0.32094772 4.96093074
## 2972 1.74392936 2.80699511 1.64916280
## NKTCL.GSM9493349_mean NKTCL.GSM9493350_mean NKTCL.GSM9493351_mean
## 1414 19.16635950 12.8743869 18.06203958
## 14280 16.34768872 5.2355150 10.01278348
## 22312 0.61263023 0.9675046 0.46959411
## 13803 0.02903675 0.2963673 0.01856132
## 5603 1.36555459 0.6836297 0.82103134
## 2972 1.73246323 6.7227928 0.92568251
## NKTCL.GSM9493332_mean NKTCL.GSM9493333_mean healthy.GSM9493320_mean
## 1414 13.23404822 9.13221787 8.1473199
## 14280 5.19808179 10.06422214 5.8773487
## 22312 0.06673771 0.02470630 0.3576945
## 13803 0.02917277 0.06429334 0.2695101
## 5603 0.33422139 0.21352795 0.7465130
## 2972 0.09750899 0.07831969 1.8303170
## healthy.GSM9493329_mean healthy.GSM9493330_mean healthy.GSM9493331_mean
## 1414 29.2248285 11.3305327 9.6991828
## 14280 18.1343218 6.2842766 8.5866156
## 22312 0.3899093 0.7881082 0.5373629
## 13803 0.1203806 0.1153171 0.1123463
## 5603 0.7561407 1.1591727 1.1468748
## 2972 2.0202478 4.4677315 5.8579842
## healthy.GSM9493321_mean healthy.GSM9493322_mean healthy.GSM9493323_mean
## 1414 6.2743685 8.9673723 5.89515397
## 14280 7.0196808 11.9966172 4.16816643
## 22312 0.4270107 0.1407682 0.29167011
## 13803 0.1930885 0.1736141 0.09378354
## 5603 0.4886874 0.8538848 0.37183192
## 2972 1.5470324 2.3434090 2.02253777
## healthy.GSM9493324_mean healthy.GSM9493325_mean healthy.GSM9493326_mean
## 1414 11.6448949 13.1591270 23.20427323
## 14280 13.4121622 9.4002381 13.01553901
## 22312 0.3395896 0.1820635 0.36840401
## 13803 0.1525275 0.1492063 0.03507068
## 5603 0.4872372 1.5381746 0.67454408
## 2972 1.4079079 1.9276190 2.61066149
## healthy.GSM9493327_mean healthy.GSM9493328_mean
## 1414 17.6065911 21.2710201
## 14280 4.5891692 15.8304168
## 22312 0.1588568 0.5342154
## 13803 0.0592173 0.1337317
## 5603 1.6737899 0.8955755
## 2972 2.6473567 3.3883910
dataPCA_Matrix <- data.frame(t(PCA10b))
colnames(dataPCA_Matrix) <- genesB
dataPCA_Matrix$class <- class
dataPCA_Matrix$class <- as.factor(dataPCA_Matrix$class)
str(dataPCA_Matrix)
## 'data.frame': 32 obs. of 11 variables:
## $ S100A9: num 56.14 9.17 14.06 12.76 22.2 ...
## $ LYZ : num 28.45 8.86 9.63 10.4 12.17 ...
## $ IGLC3 : num 0.278 0.666 0.528 0.641 1.051 ...
## $ SOX5 : num 0.1036 0.2857 0.4874 0.3067 0.0997 ...
## $ PPBP : num 0.0988 0.4482 0.8715 1.3197 0.2328 ...
## $ IGKC : num 1.59 5.04 5.48 3.5 4.29 ...
## $ IGLC2 : num 0.323 1.553 1.786 0.892 0.818 ...
## $ PF4 : num 0.0413 0.1868 0.3056 0.4931 0.0826 ...
## $ LINGO2: num 0.262 1.063 0.847 0.527 0.509 ...
## $ EREG : num 1.015 0.291 0.699 0.498 0.252 ...
## $ class : Factor w/ 2 levels "healthy","NKTC Lymphoma": 2 2 2 2 2 2 2 2 2 2 ...
head(dataPCA_Matrix,10)
## S100A9 LYZ IGLC3 SOX5 PPBP
## NKTCL.GSM9493334_mean 56.140747 28.453390 0.2782868 0.10364178 0.09883188
## NKTCL.GSM9493335_mean 9.172601 8.863909 0.6656494 0.28574658 0.44817452
## NKTCL.GSM9493336_mean 14.055953 9.630690 0.5280262 0.48737825 0.87149672
## NKTCL.GSM9493337_mean 12.758204 10.395961 0.6413631 0.30670754 1.31968987
## NKTCL.GSM9493338_mean 22.197066 12.170642 1.0510581 0.09965289 0.23278468
## NKTCL.GSM9493339_mean 16.513349 6.668216 0.2603920 0.04123015 0.26486989
## NKTCL.GSM9493340_mean 29.967573 12.707192 0.1405180 0.08540240 1.81399829
## NKTCL.GSM9493341_mean 5.955174 10.930553 0.3693887 0.15792893 0.02864939
## NKTCL.GSM9493342_mean 10.813286 12.458500 0.4012143 0.12678571 0.40721429
## NKTCL.GSM9493343_mean 14.229804 12.020315 0.3568069 0.13375145 0.77708092
## IGKC IGLC2 PF4 LINGO2 EREG
## NKTCL.GSM9493334_mean 1.588639 0.3231791 0.04134219 0.2616812 1.01465873
## NKTCL.GSM9493335_mean 5.039901 1.5525037 0.18684300 1.0630722 0.29128518
## NKTCL.GSM9493336_mean 5.477738 1.7860266 0.30560525 0.8470483 0.69936394
## NKTCL.GSM9493337_mean 3.504147 0.8921745 0.49314821 0.5274071 0.49783628
## NKTCL.GSM9493338_mean 4.293360 0.8176016 0.08263352 0.5087896 0.25249132
## NKTCL.GSM9493339_mean 1.455559 0.5998648 0.10687732 0.8781683 0.36160865
## NKTCL.GSM9493340_mean 1.706550 0.4258348 0.70954623 0.3816353 0.61772260
## NKTCL.GSM9493341_mean 2.096278 1.8922887 0.01630611 0.1517573 0.32274410
## NKTCL.GSM9493342_mean 3.817000 1.1034286 0.14450000 0.1280714 0.30864286
## NKTCL.GSM9493343_mean 2.103415 0.6613266 0.28120526 0.6723703 0.07246574
## class
## NKTCL.GSM9493334_mean NKTC Lymphoma
## NKTCL.GSM9493335_mean NKTC Lymphoma
## NKTCL.GSM9493336_mean NKTC Lymphoma
## NKTCL.GSM9493337_mean NKTC Lymphoma
## NKTCL.GSM9493338_mean NKTC Lymphoma
## NKTCL.GSM9493339_mean NKTC Lymphoma
## NKTCL.GSM9493340_mean NKTC Lymphoma
## NKTCL.GSM9493341_mean NKTC Lymphoma
## NKTCL.GSM9493342_mean NKTC Lymphoma
## NKTCL.GSM9493343_mean NKTC Lymphoma
tail(dataPCA_Matrix,10)
## S100A9 LYZ IGLC3 SOX5 PPBP
## healthy.GSM9493330_mean 11.330533 6.284277 0.7881082 0.11531715 1.1591727
## healthy.GSM9493331_mean 9.699183 8.586616 0.5373629 0.11234632 1.1468748
## healthy.GSM9493321_mean 6.274369 7.019681 0.4270107 0.19308849 0.4886874
## healthy.GSM9493322_mean 8.967372 11.996617 0.1407682 0.17361414 0.8538848
## healthy.GSM9493323_mean 5.895154 4.168166 0.2916701 0.09378354 0.3718319
## healthy.GSM9493324_mean 11.644895 13.412162 0.3395896 0.15252753 0.4872372
## healthy.GSM9493325_mean 13.159127 9.400238 0.1820635 0.14920635 1.5381746
## healthy.GSM9493326_mean 23.204273 13.015539 0.3684040 0.03507068 0.6745441
## healthy.GSM9493327_mean 17.606591 4.589169 0.1588568 0.05921730 1.6737899
## healthy.GSM9493328_mean 21.271020 15.830417 0.5342154 0.13373168 0.8955755
## IGKC IGLC2 PF4 LINGO2 EREG
## healthy.GSM9493330_mean 4.467732 0.7086142 0.6272736 0.6213646 0.4305235
## healthy.GSM9493331_mean 5.857984 1.3745859 0.5944195 0.5052639 0.3482294
## healthy.GSM9493321_mean 1.547032 0.2962188 0.1782117 1.2142414 0.2019216
## healthy.GSM9493322_mean 2.343409 0.4860323 0.3416630 1.0185509 0.5947185
## healthy.GSM9493323_mean 2.022538 0.4578552 0.1098819 0.1161562 0.1640386
## healthy.GSM9493324_mean 1.407908 0.3403403 0.1461461 1.2079580 0.5110110
## healthy.GSM9493325_mean 1.927619 1.2646825 0.3322222 0.5403175 0.5958730
## healthy.GSM9493326_mean 2.610661 3.4131866 0.5474264 0.2417179 0.7083198
## healthy.GSM9493327_mean 2.647357 0.3359938 1.1686406 0.6624614 0.1890663
## healthy.GSM9493328_mean 3.388391 1.5609617 0.7050790 0.9443733 1.2892303
## class
## healthy.GSM9493330_mean healthy
## healthy.GSM9493331_mean healthy
## healthy.GSM9493321_mean healthy
## healthy.GSM9493322_mean healthy
## healthy.GSM9493323_mean healthy
## healthy.GSM9493324_mean healthy
## healthy.GSM9493325_mean healthy
## healthy.GSM9493326_mean healthy
## healthy.GSM9493327_mean healthy
## healthy.GSM9493328_mean healthy
set.seed(123)
inTrain <- sample(1:32,.8*32)
training <- dataPCA_Matrix[inTrain,]
testing <- dataPCA_Matrix[-inTrain,]
table(training$class)
##
## healthy NKTC Lymphoma
## 10 15
table(testing$class)
##
## healthy NKTC Lymphoma
## 2 5
We will keep the same training and testing set of 80% since it did better.
rf4 <- randomForest(class ~ ., training, importance=TRUE, na.action=na.omit,mtry=2,ntree=10000)
prediction4 <- predict(rf4,testing)
results4 <- data.frame(predicted=prediction4,actual=testing$class)
results4
## predicted actual
## NKTCL.GSM9493335_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493339_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493345_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493346_mean NKTC Lymphoma NKTC Lymphoma
## NKTCL.GSM9493349_mean NKTC Lymphoma NKTC Lymphoma
## healthy.GSM9493320_mean healthy healthy
## healthy.GSM9493321_mean healthy healthy
Using our same samples and the same tuned settings that scored the best on the top genes found from fold change values of complete cases in the top 2,000 top variable genes found with PCA in Seurat, the results are the same at 100% accuracy in prediction of class in a two class model. That is with an mtry of 2 and 10,000 trees with default paramter settings within the randomForest function.
How well did the model do on the training data before predicting the hold out test data?
rf4$confusion
## healthy NKTC Lymphoma class.error
## healthy 3 7 0.7000000
## NKTC Lymphoma 4 11 0.2666667
Well, it actually scored worse in training than our best model on the fold change genes of complete cases that are also in the top 2,000 genes but not the top 10 of those 2,000 genes. But somehow corrected within the model to score perfectly on the hold out prediction testing set. This training model incorrectly classified 7 healthy samples as NKTC lymphoma and 4 NKTC lymphoma samples as healthy.
Seems like either set of genes could be great sets of genes to use to predict NKTC Lymphoma with our tuned settings.
We will add these to our data of top genes for the various pathologies that we analyzed and use them later to predict the samples of various pathologies of fibromyalgia, multiple sclerosis, EBV infection, mononucleosis infection, Lyme disease, and now NKTC lymphoma.
Our genes are the 9 genes in the common data frame and the 10 genes in the PCA10 data frame. Lets write those out to csv.
write.csv(common,'commonToSeurat2000AndFCsCompleteCasesNKTCL.csv',row.names=F)
write.csv(PCA10,'top10SeuratPCA_NKTCL_genes.csv',row.names=F)
You can get those files here for common and the PCA10 sets of genes to add to our larger dataset later.
Lets write out some of our larger data out to csv as well.
write.csv(DataMeans3,'NKTCL_28817X3_noNaNs.csv',row.names=F)
write.csv(DataMeansComplete,'completeCasesNKTCL_19129X38.csv',row.names=F)
write.csv(top2000Var,'PCA_2000topgenesSeurat_NKTCL.csv',row.names=F)
The dataset without NaNs but has the Inf values. DataMeans3 28,817X38 The data of complete cases DataMeansComplete 19,129X38 The Seurat PCA 2,000 high variability genes data no fold change values. top2000Var 2,000 X 35
Thanks and keep checking in when we add those to the dataset of our genes and move on to the next series of EBV associated pathologies. We have nasopharyngeal carcinoma, Hodgkin’s lymphoma, Burkett’s lymphoma, and some others if we see them stirrig some interest or find other data sets lacking in information or some other reason.