This project is revisiting our previous work on top genes in a chronic hodgkin’s lymphoma (CHL) and diffuse large b-cell lymphomas (DLBCL) that didn’t find great genes when classifying with the random forest classifier do to there being no control or baseline comparison as these samples were directly compared to each other in similarity. This study was GSE305165. We then saw some healthy samples totaling 6 in another project GSE85599 we worked on with acute infectious mononucleosis (AIM) and chronic active epstein-barr virus (CAEBV). The type of media is the peripheral blood mononuclear cells (PBMCs) of total RNA for both, but one is chip sequencing for the CHL and DLBCL while the other is array data.
The data sets worked on were from each of those studies that can be found within google to directly access these datasets here and kaggle. While the documentation from beginning to final output can be found on the rpubs site for janiscorona and here.
library(rmarkdown)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
data_lymphoma <- read.csv("Data_TableauReady_21448X125.csv", header=T)
summary(data_lymphoma)
## ID SPOT_ID.1 GeneID pDLBCL
## Length :21448 Length :21448 Length :21448 Min. : 1.209
## N.unique :21448 N.unique :21448 N.unique :18769 1st Qu.: 3.242
## N.blank : 0 N.blank : 0 N.blank : 0 Median : 4.319
## Min.nchar: 17 Min.nchar: 105 Min.nchar: 1 Mean : 4.357
## Max.nchar: 35 Max.nchar:32532 Max.nchar: 8556 3rd Qu.: 5.352
## Max. :11.828
## CHL pDLBCL.1 mDLBCL CHL.1
## Min. : 1.214 Min. : 1.310 Min. : 1.345 Min. : 1.233
## 1st Qu.: 3.321 1st Qu.: 3.247 1st Qu.: 3.221 1st Qu.: 3.299
## Median : 4.286 Median : 4.331 Median : 4.323 Median : 4.320
## Mean : 4.343 Mean : 4.359 Mean : 4.349 Mean : 4.370
## 3rd Qu.: 5.257 3rd Qu.: 5.354 3rd Qu.: 5.343 3rd Qu.: 5.342
## Max. :11.637 Max. :11.936 Max. :11.941 Max. :11.825
## mDLBCL.1 pDLBCL.2 CHL.2 CHL.3
## Min. : 1.265 Min. : 1.279 Min. : 1.120 Min. : 1.184
## 1st Qu.: 3.207 1st Qu.: 3.236 1st Qu.: 3.207 1st Qu.: 3.234
## Median : 4.319 Median : 4.332 Median : 4.313 Median : 4.353
## Mean : 4.346 Mean : 4.363 Mean : 4.354 Mean : 4.383
## 3rd Qu.: 5.333 3rd Qu.: 5.364 3rd Qu.: 5.365 3rd Qu.: 5.393
## Max. :11.997 Max. :11.994 Max. :11.907 Max. :11.990
## CHL.4 CHL.5 mDLBCL.2 mDLBCL.3
## Min. : 1.194 Min. : 1.292 Min. : 1.160 Min. : 1.358
## 1st Qu.: 3.273 1st Qu.: 3.310 1st Qu.: 3.231 1st Qu.: 3.338
## Median : 4.361 Median : 4.363 Median : 4.312 Median : 4.365
## Mean : 4.393 Mean : 4.388 Mean : 4.367 Mean : 4.398
## 3rd Qu.: 5.378 3rd Qu.: 5.340 3rd Qu.: 5.362 3rd Qu.: 5.355
## Max. :12.090 Max. :11.984 Max. :11.890 Max. :11.971
## CHL.6 CHL.7 pDLBCL.3 CHL.8
## Min. : 1.297 Min. : 1.289 Min. : 1.311 Min. : 1.236
## 1st Qu.: 3.276 1st Qu.: 3.228 1st Qu.: 3.209 1st Qu.: 3.271
## Median : 4.358 Median : 4.317 Median : 4.305 Median : 4.338
## Mean : 4.386 Mean : 4.356 Mean : 4.348 Mean : 4.372
## 3rd Qu.: 5.362 3rd Qu.: 5.352 3rd Qu.: 5.337 3rd Qu.: 5.351
## Max. :11.945 Max. :11.958 Max. :12.007 Max. :11.958
## mDLBCL.4 mDLBCL.5 mDLBCL.6 mDLBCL.7
## Min. : 1.231 Min. : 1.250 Min. : 1.288 Min. : 1.283
## 1st Qu.: 3.200 1st Qu.: 3.206 1st Qu.: 3.199 1st Qu.: 3.231
## Median : 4.307 Median : 4.339 Median : 4.305 Median : 4.325
## Mean : 4.354 Mean : 4.372 Mean : 4.347 Mean : 4.369
## 3rd Qu.: 5.361 3rd Qu.: 5.381 3rd Qu.: 5.334 3rd Qu.: 5.374
## Max. :11.986 Max. :11.986 Max. :12.055 Max. :12.046
## CHL.9 mDLBCL.8 CHL.10 CHL.11
## Min. : 1.184 Min. : 1.262 Min. : 1.104 Min. : 1.214
## 1st Qu.: 3.223 1st Qu.: 3.262 1st Qu.: 3.187 1st Qu.: 3.220
## Median : 4.341 Median : 4.332 Median : 4.336 Median : 4.329
## Mean : 4.359 Mean : 4.381 Mean : 4.364 Mean : 4.367
## 3rd Qu.: 5.369 3rd Qu.: 5.357 3rd Qu.: 5.404 3rd Qu.: 5.382
## Max. :12.134 Max. :11.952 Max. :11.943 Max. :12.093
## CHL.12 CHL.13 mDLBCL.9 mDLBCL.10
## Min. : 1.221 Min. : 1.274 Min. : 1.248 Min. : 1.341
## 1st Qu.: 3.172 1st Qu.: 3.238 1st Qu.: 3.201 1st Qu.: 3.309
## Median : 4.309 Median : 4.340 Median : 4.335 Median : 4.351
## Mean : 4.363 Mean : 4.376 Mean : 4.370 Mean : 4.392
## 3rd Qu.: 5.388 3rd Qu.: 5.373 3rd Qu.: 5.393 3rd Qu.: 5.357
## Max. :12.150 Max. :12.126 Max. :12.011 Max. :12.004
## mDLBCL.11 mDLBCL.12 mDLBCL.13 mDLBCL.14
## Min. : 1.195 Min. : 1.211 Min. : 1.287 Min. : 1.361
## 1st Qu.: 3.212 1st Qu.: 3.221 1st Qu.: 3.192 1st Qu.: 3.244
## Median : 4.317 Median : 4.336 Median : 4.319 Median : 4.359
## Mean : 4.360 Mean : 4.369 Mean : 4.370 Mean : 4.386
## 3rd Qu.: 5.379 3rd Qu.: 5.377 3rd Qu.: 5.375 3rd Qu.: 5.388
## Max. :12.102 Max. :12.155 Max. :11.945 Max. :12.157
## pDLBCL.4 pDLBCL.5 mDLBCL.15 mDLBCL.16
## Min. : 1.244 Min. : 1.157 Min. : 1.208 Min. : 1.263
## 1st Qu.: 3.203 1st Qu.: 3.217 1st Qu.: 3.203 1st Qu.: 3.206
## Median : 4.338 Median : 4.339 Median : 4.314 Median : 4.314
## Mean : 4.375 Mean : 4.376 Mean : 4.368 Mean : 4.357
## 3rd Qu.: 5.395 3rd Qu.: 5.387 3rd Qu.: 5.376 3rd Qu.: 5.375
## Max. :11.979 Max. :12.027 Max. :11.822 Max. :11.897
## CHL.14 pDLBCL.6 pDLBCL.7 CHL.15
## Min. : 1.228 Min. : 1.127 Min. : 1.226 Min. : 1.320
## 1st Qu.: 3.215 1st Qu.: 3.181 1st Qu.: 3.146 1st Qu.: 3.200
## Median : 4.317 Median : 4.305 Median : 4.275 Median : 4.320
## Mean : 4.355 Mean : 4.363 Mean : 4.353 Mean : 4.367
## 3rd Qu.: 5.392 3rd Qu.: 5.396 3rd Qu.: 5.356 3rd Qu.: 5.374
## Max. :12.061 Max. :11.999 Max. :12.113 Max. :11.938
## CHL.16 CHL.17 mDLBCL.17 mDLBCL.18
## Min. : 1.239 Min. : 1.339 Min. : 1.225 Min. : 1.250
## 1st Qu.: 3.193 1st Qu.: 3.234 1st Qu.: 3.137 1st Qu.: 3.178
## Median : 4.317 Median : 4.345 Median : 4.262 Median : 4.308
## Mean : 4.367 Mean : 4.376 Mean : 4.340 Mean : 4.362
## 3rd Qu.: 5.380 3rd Qu.: 5.377 3rd Qu.: 5.364 3rd Qu.: 5.378
## Max. :12.080 Max. :11.967 Max. :12.221 Max. :11.992
## mDLBCL.19 CHL.18 CHL_mean pDLBCL_mean
## Min. : 1.223 Min. : 1.301 Min. : 1.464 Min. : 1.446
## 1st Qu.: 3.265 1st Qu.: 3.181 1st Qu.: 3.261 1st Qu.: 3.233
## Median : 4.367 Median : 4.305 Median : 4.339 Median : 4.327
## Mean : 4.407 Mean : 4.370 Mean : 4.369 Mean : 4.362
## 3rd Qu.: 5.412 3rd Qu.: 5.374 3rd Qu.: 5.343 3rd Qu.: 5.346
## Max. :11.962 Max. :12.067 Max. :11.975 Max. :11.938
## mDLBCL_mean CHL_x_mean mDLBCL_x_mean pDLBCL_x_mean
## Min. : 1.444 Min. : 1.349 Min. : 1.415 Min. : 1.405
## 1st Qu.: 3.245 1st Qu.: 3.237 1st Qu.: 3.263 1st Qu.: 3.235
## Median : 4.330 Median : 4.343 Median : 4.335 Median : 4.330
## Mean : 4.368 Mean : 4.371 Mean : 4.371 Mean : 4.365
## 3rd Qu.: 5.349 3rd Qu.: 5.363 3rd Qu.: 5.337 3rd Qu.: 5.351
## Max. :11.963 Max. :11.976 Max. :11.961 Max. :11.931
## CHL_y_mean mDLBCL_y_mean pDLBCL_y_mean CHL_young72_mean
## Min. : 1.495 Min. : 1.428 Min. : 1.420 Min. : 1.479
## 1st Qu.: 3.270 1st Qu.: 3.243 1st Qu.: 3.221 1st Qu.: 3.272
## Median : 4.342 Median : 4.329 Median : 4.318 Median : 4.341
## Mean : 4.368 Mean : 4.368 Mean : 4.358 Mean : 4.366
## 3rd Qu.: 5.336 3rd Qu.: 5.350 3rd Qu.: 5.339 3rd Qu.: 5.330
## Max. :11.974 Max. :11.963 Max. :11.945 Max. :11.967
## mDLBCL_young72_mean pDLBCL_young72_mean CHL_old mDLBCL_old
## Min. : 1.387 Min. : 1.474 Min. : 1.416 Min. : 1.424
## 1st Qu.: 3.254 1st Qu.: 3.247 1st Qu.: 3.250 1st Qu.: 3.242
## Median : 4.332 Median : 4.328 Median : 4.339 Median : 4.328
## Mean : 4.369 Mean : 4.360 Mean : 4.373 Mean : 4.368
## 3rd Qu.: 5.349 3rd Qu.: 5.341 3rd Qu.: 5.356 3rd Qu.: 5.350
## Max. :11.931 Max. :11.897 Max. :11.989 Max. :11.976
## pDLBCL_old CHL_median mDLBCL_median pDLBCL_median
## Min. : 1.408 Min. : 1.466 Min. : 1.386 Min. : 1.431
## 1st Qu.: 3.222 1st Qu.: 3.231 1st Qu.: 3.219 1st Qu.: 3.207
## Median : 4.320 Median : 4.328 Median : 4.315 Median : 4.317
## Mean : 4.364 Mean : 4.357 Mean : 4.352 Mean : 4.348
## 3rd Qu.: 5.358 3rd Qu.: 5.343 3rd Qu.: 5.342 3rd Qu.: 5.342
## Max. :11.979 Max. :11.958 Max. :11.968 Max. :11.954
## CHL_x_median mDLBCL_x_median pDLBCL_x_median CHL_y_median
## Min. : 1.348 Min. : 1.351 Min. : 1.400 Min. : 1.466
## 1st Qu.: 3.223 1st Qu.: 3.243 1st Qu.: 3.222 1st Qu.: 3.240
## Median : 4.337 Median : 4.322 Median : 4.329 Median : 4.328
## Mean : 4.363 Mean : 4.361 Mean : 4.360 Mean : 4.360
## 3rd Qu.: 5.359 3rd Qu.: 5.338 3rd Qu.: 5.350 3rd Qu.: 5.345
## Max. :11.944 Max. :11.958 Max. :11.923 Max. :11.967
## mDLBCL_y_median pDLBCL_y_median CHL_young72_median mDLBCL_young72_median
## Min. : 1.391 Min. : 1.423 Min. : 1.468 Min. : 1.366
## 1st Qu.: 3.215 1st Qu.: 3.209 1st Qu.: 3.239 1st Qu.: 3.237
## Median : 4.314 Median : 4.301 Median : 4.325 Median : 4.324
## Mean : 4.352 Mean : 4.347 Mean : 4.355 Mean : 4.358
## 3rd Qu.: 5.347 3rd Qu.: 5.336 3rd Qu.: 5.335 3rd Qu.: 5.343
## Max. :11.968 Max. :11.968 Max. :11.974 Max. :11.956
## pDLBCL_young72_median CHL_old72_median mDLBCL_old72_median pDLBCL_old72_median
## Min. : 1.451 Min. : 1.406 Min. : 1.400 Min. : 1.349
## 1st Qu.: 3.231 1st Qu.: 3.235 1st Qu.: 3.213 1st Qu.: 3.202
## Median : 4.323 Median : 4.339 Median : 4.315 Median : 4.298
## Mean : 4.351 Mean : 4.363 Mean : 4.352 Mean : 4.354
## 3rd Qu.: 5.337 3rd Qu.: 5.356 3rd Qu.: 5.345 3rd Qu.: 5.359
## Max. :11.893 Max. :11.957 Max. :11.968 Max. :12.008
## ID.1 CHL_mean.1 pDLBCL_mean.1 mDLBCL_mean.1
## Length :21448 Min. : 1.464 Min. : 1.446 Min. : 1.444
## N.unique :21448 1st Qu.: 3.261 1st Qu.: 3.233 1st Qu.: 3.245
## N.blank : 0 Median : 4.339 Median : 4.327 Median : 4.330
## Min.nchar: 17 Mean : 4.369 Mean : 4.362 Mean : 4.368
## Max.nchar: 35 3rd Qu.: 5.343 3rd Qu.: 5.346 3rd Qu.: 5.349
## Max. :11.975 Max. :11.938 Max. :11.963
## CHL_x_mean.1 mDLBCL_x_mean.1 pDLBCL_x_mean.1 CHL_y_mean.1
## Min. : 1.349 Min. : 1.415 Min. : 1.405 Min. : 1.495
## 1st Qu.: 3.237 1st Qu.: 3.263 1st Qu.: 3.235 1st Qu.: 3.270
## Median : 4.343 Median : 4.335 Median : 4.330 Median : 4.342
## Mean : 4.371 Mean : 4.371 Mean : 4.365 Mean : 4.368
## 3rd Qu.: 5.363 3rd Qu.: 5.337 3rd Qu.: 5.351 3rd Qu.: 5.336
## Max. :11.976 Max. :11.961 Max. :11.931 Max. :11.974
## mDLBCL_y_mean.1 pDLBCL_y_mean.1 CHL_young72_mean.1 mDLBCL_young72_mean.1
## Min. : 1.428 Min. : 1.420 Min. : 1.479 Min. : 1.387
## 1st Qu.: 3.243 1st Qu.: 3.221 1st Qu.: 3.272 1st Qu.: 3.254
## Median : 4.329 Median : 4.318 Median : 4.341 Median : 4.332
## Mean : 4.368 Mean : 4.358 Mean : 4.366 Mean : 4.369
## 3rd Qu.: 5.350 3rd Qu.: 5.339 3rd Qu.: 5.330 3rd Qu.: 5.349
## Max. :11.963 Max. :11.945 Max. :11.967 Max. :11.931
## pDLBCL_young72_mean.1 CHL_old.1 mDLBCL_old.1 pDLBCL_old.1
## Min. : 1.474 Min. : 1.416 Min. : 1.424 Min. : 1.408
## 1st Qu.: 3.247 1st Qu.: 3.250 1st Qu.: 3.242 1st Qu.: 3.222
## Median : 4.328 Median : 4.339 Median : 4.328 Median : 4.320
## Mean : 4.360 Mean : 4.373 Mean : 4.368 Mean : 4.364
## 3rd Qu.: 5.341 3rd Qu.: 5.356 3rd Qu.: 5.350 3rd Qu.: 5.358
## Max. :11.897 Max. :11.989 Max. :11.976 Max. :11.979
## CHL_median.1 mDLBCL_median.1 pDLBCL_median.1 CHL_x_median.1
## Min. : 1.466 Min. : 1.386 Min. : 1.431 Min. : 1.348
## 1st Qu.: 3.231 1st Qu.: 3.219 1st Qu.: 3.207 1st Qu.: 3.223
## Median : 4.328 Median : 4.315 Median : 4.317 Median : 4.337
## Mean : 4.357 Mean : 4.352 Mean : 4.348 Mean : 4.363
## 3rd Qu.: 5.343 3rd Qu.: 5.342 3rd Qu.: 5.342 3rd Qu.: 5.359
## Max. :11.958 Max. :11.968 Max. :11.954 Max. :11.944
## mDLBCL_x_median.1 pDLBCL_x_median.1 CHL_y_median.1 mDLBCL_y_median.1
## Min. : 1.351 Min. : 1.400 Min. : 1.466 Min. : 1.391
## 1st Qu.: 3.243 1st Qu.: 3.222 1st Qu.: 3.240 1st Qu.: 3.215
## Median : 4.322 Median : 4.329 Median : 4.328 Median : 4.314
## Mean : 4.361 Mean : 4.360 Mean : 4.360 Mean : 4.352
## 3rd Qu.: 5.338 3rd Qu.: 5.350 3rd Qu.: 5.345 3rd Qu.: 5.347
## Max. :11.958 Max. :11.923 Max. :11.967 Max. :11.968
## pDLBCL_y_median.1 CHL_young72_median.1 mDLBCL_young72_median.1
## Min. : 1.423 Min. : 1.468 Min. : 1.366
## 1st Qu.: 3.209 1st Qu.: 3.239 1st Qu.: 3.237
## Median : 4.301 Median : 4.325 Median : 4.324
## Mean : 4.347 Mean : 4.355 Mean : 4.358
## 3rd Qu.: 5.336 3rd Qu.: 5.335 3rd Qu.: 5.343
## Max. :11.968 Max. :11.974 Max. :11.956
## pDLBCL_young72_median.1 CHL_old72_median.1 mDLBCL_old72_median.1
## Min. : 1.451 Min. : 1.406 Min. : 1.400
## 1st Qu.: 3.231 1st Qu.: 3.235 1st Qu.: 3.213
## Median : 4.323 Median : 4.339 Median : 4.315
## Mean : 4.351 Mean : 4.363 Mean : 4.352
## 3rd Qu.: 5.337 3rd Qu.: 5.356 3rd Qu.: 5.345
## Max. :11.893 Max. :11.957 Max. :11.968
## pDLBCL_old72_median.1 CHL_change pDLBCL_change mDLBCL_change
## Min. : 1.349 Min. :0.9016 Min. :0.9048 Min. :0.9283
## 1st Qu.: 3.202 1st Qu.:0.9941 1st Qu.:0.9925 1st Qu.:0.9960
## Median : 4.298 Median :1.0040 Median :1.0020 Median :1.0032
## Mean : 4.354 Mean :1.0055 Mean :1.0051 Mean :1.0056
## 3rd Qu.: 5.359 3rd Qu.:1.0150 3rd Qu.:1.0141 3rd Qu.:1.0121
## Max. :12.008 Max. :1.2240 Max. :1.5134 Max. :1.2959
## CHL_x_change CHL_y_change pDLBCL_x_change pDLBCL_y_change
## Min. :0.8728 Min. :0.9054 Min. :0.9016 Min. :0.8977
## 1st Qu.:0.9924 1st Qu.:0.9925 1st Qu.:0.9921 1st Qu.:0.9921
## Median :1.0015 Median :1.0033 Median :1.0009 Median :1.0012
## Mean :1.0032 Mean :1.0050 Mean :1.0025 Mean :1.0039
## 3rd Qu.:1.0122 3rd Qu.:1.0158 3rd Qu.:1.0108 3rd Qu.:1.0124
## Max. :1.2379 Max. :1.2424 Max. :1.2773 Max. :1.2548
## mDLBCL_x_change mDLBCL_y_change CHL_old72_change CHL_young72_change
## Min. :0.8726 Min. :0.9293 Min. :0.8762 Min. :0.8839
## 1st Qu.:0.9920 1st Qu.:0.9953 1st Qu.:0.9907 1st Qu.:0.9924
## Median :1.0014 Median :1.0030 Median :1.0021 Median :1.0045
## Mean :1.0038 Mean :1.0054 Mean :1.0039 Mean :1.0060
## 3rd Qu.:1.0123 3rd Qu.:1.0122 3rd Qu.:1.0148 3rd Qu.:1.0178
## Max. :1.3753 Max. :1.2939 Max. :1.2085 Max. :1.2139
## pDLBCL_old_change pDLBCL_young72_change mDLBCL_young72_change
## Min. :0.9046 Min. :0.9111 Min. :0.8669
## 1st Qu.:0.9917 1st Qu.:0.9921 1st Qu.:0.9926
## Median :1.0013 Median :1.0009 Median :1.0025
## Mean :1.0041 Mean :1.0029 Mean :1.0047
## 3rd Qu.:1.0129 3rd Qu.:1.0112 3rd Qu.:1.0140
## Max. :1.3349 Max. :1.3463 Max. :1.4123
paged_table(data_lymphoma[1:5,])
colnames(data_lymphoma)
## [1] "ID" "SPOT_ID.1"
## [3] "GeneID" "pDLBCL"
## [5] "CHL" "pDLBCL.1"
## [7] "mDLBCL" "CHL.1"
## [9] "mDLBCL.1" "pDLBCL.2"
## [11] "CHL.2" "CHL.3"
## [13] "CHL.4" "CHL.5"
## [15] "mDLBCL.2" "mDLBCL.3"
## [17] "CHL.6" "CHL.7"
## [19] "pDLBCL.3" "CHL.8"
## [21] "mDLBCL.4" "mDLBCL.5"
## [23] "mDLBCL.6" "mDLBCL.7"
## [25] "CHL.9" "mDLBCL.8"
## [27] "CHL.10" "CHL.11"
## [29] "CHL.12" "CHL.13"
## [31] "mDLBCL.9" "mDLBCL.10"
## [33] "mDLBCL.11" "mDLBCL.12"
## [35] "mDLBCL.13" "mDLBCL.14"
## [37] "pDLBCL.4" "pDLBCL.5"
## [39] "mDLBCL.15" "mDLBCL.16"
## [41] "CHL.14" "pDLBCL.6"
## [43] "pDLBCL.7" "CHL.15"
## [45] "CHL.16" "CHL.17"
## [47] "mDLBCL.17" "mDLBCL.18"
## [49] "mDLBCL.19" "CHL.18"
## [51] "CHL_mean" "pDLBCL_mean"
## [53] "mDLBCL_mean" "CHL_x_mean"
## [55] "mDLBCL_x_mean" "pDLBCL_x_mean"
## [57] "CHL_y_mean" "mDLBCL_y_mean"
## [59] "pDLBCL_y_mean" "CHL_young72_mean"
## [61] "mDLBCL_young72_mean" "pDLBCL_young72_mean"
## [63] "CHL_old" "mDLBCL_old"
## [65] "pDLBCL_old" "CHL_median"
## [67] "mDLBCL_median" "pDLBCL_median"
## [69] "CHL_x_median" "mDLBCL_x_median"
## [71] "pDLBCL_x_median" "CHL_y_median"
## [73] "mDLBCL_y_median" "pDLBCL_y_median"
## [75] "CHL_young72_median" "mDLBCL_young72_median"
## [77] "pDLBCL_young72_median" "CHL_old72_median"
## [79] "mDLBCL_old72_median" "pDLBCL_old72_median"
## [81] "ID.1" "CHL_mean.1"
## [83] "pDLBCL_mean.1" "mDLBCL_mean.1"
## [85] "CHL_x_mean.1" "mDLBCL_x_mean.1"
## [87] "pDLBCL_x_mean.1" "CHL_y_mean.1"
## [89] "mDLBCL_y_mean.1" "pDLBCL_y_mean.1"
## [91] "CHL_young72_mean.1" "mDLBCL_young72_mean.1"
## [93] "pDLBCL_young72_mean.1" "CHL_old.1"
## [95] "mDLBCL_old.1" "pDLBCL_old.1"
## [97] "CHL_median.1" "mDLBCL_median.1"
## [99] "pDLBCL_median.1" "CHL_x_median.1"
## [101] "mDLBCL_x_median.1" "pDLBCL_x_median.1"
## [103] "CHL_y_median.1" "mDLBCL_y_median.1"
## [105] "pDLBCL_y_median.1" "CHL_young72_median.1"
## [107] "mDLBCL_young72_median.1" "pDLBCL_young72_median.1"
## [109] "CHL_old72_median.1" "mDLBCL_old72_median.1"
## [111] "pDLBCL_old72_median.1" "CHL_change"
## [113] "pDLBCL_change" "mDLBCL_change"
## [115] "CHL_x_change" "CHL_y_change"
## [117] "pDLBCL_x_change" "pDLBCL_y_change"
## [119] "mDLBCL_x_change" "mDLBCL_y_change"
## [121] "CHL_old72_change" "CHL_young72_change"
## [123] "pDLBCL_old_change" "pDLBCL_young72_change"
## [125] "mDLBCL_young72_change"
data_mono <- read.csv("CAEBV_genes_32670_FCs.csv", header=T)
summary(data_mono)
## ID gene GSM2279022_AIM GSM2279023_AIM
## Min. : 2996 Length :32670 Min. : 1.034 Min. : 1.095
## 1st Qu.:18724 N.unique :30905 1st Qu.: 1.706 1st Qu.: 1.736
## Median :35662 N.blank : 0 Median : 2.534 Median : 2.553
## Mean :36518 Min.nchar: 1 Mean : 3.050 Mean : 3.062
## 3rd Qu.:54869 Max.nchar: 22 3rd Qu.: 3.965 3rd Qu.: 3.974
## Max. :70515 Max. :12.372 Max. :12.369
## GSM2279024_AIM GSM2279025_CAEBV GSM2279026_AIM GSM2279027_CAEBV
## Min. : 1.088 Min. : 1.063 Min. : 1.091 Min. : 1.087
## 1st Qu.: 1.749 1st Qu.: 1.767 1st Qu.: 1.715 1st Qu.: 1.768
## Median : 2.559 Median : 2.584 Median : 2.542 Median : 2.565
## Mean : 3.055 Mean : 3.051 Mean : 3.056 Mean : 3.066
## 3rd Qu.: 3.939 3rd Qu.: 3.937 3rd Qu.: 3.962 3rd Qu.: 3.936
## Max. :12.423 Max. :12.396 Max. :12.460 Max. :12.311
## GSM2279028_CAEBV GSM2279029_CAEBV GSM2279030_CAEBV GSM2279031_healthy
## Min. : 1.132 Min. : 0.7472 Min. : 1.124 Min. : 1.059
## 1st Qu.: 1.755 1st Qu.: 1.7254 1st Qu.: 1.770 1st Qu.: 1.743
## Median : 2.568 Median : 2.5629 Median : 2.561 Median : 2.589
## Mean : 3.061 Mean : 3.0407 Mean : 3.048 Mean : 3.069
## 3rd Qu.: 3.927 3rd Qu.: 3.9254 3rd Qu.: 3.896 3rd Qu.: 3.988
## Max. :12.471 Max. :12.4559 Max. :12.495 Max. :12.273
## GSM2279032_healthy GSM2279033_healthy GSM2279034_healthy GSM2279035_healthy
## Min. : 1.165 Min. : 1.127 Min. : 1.095 Min. : 1.084
## 1st Qu.: 1.760 1st Qu.: 1.741 1st Qu.: 1.739 1st Qu.: 1.788
## Median : 2.573 Median : 2.554 Median : 2.566 Median : 2.601
## Mean : 3.074 Mean : 3.063 Mean : 3.064 Mean : 3.086
## 3rd Qu.: 3.957 3rd Qu.: 3.944 3rd Qu.: 3.938 3rd Qu.: 3.957
## Max. :12.439 Max. :12.428 Max. :12.390 Max. :12.425
## GSM2279036_healthy GSM2279037_AIM GSM2279038_AIM AIM_mean
## Min. : 1.100 Min. : 1.079 Min. : 1.105 Min. : 1.225
## 1st Qu.: 1.802 1st Qu.: 1.767 1st Qu.: 1.770 1st Qu.: 1.735
## Median : 2.600 Median : 2.588 Median : 2.594 Median : 2.562
## Mean : 3.075 Mean : 3.059 Mean : 3.066 Mean : 3.058
## 3rd Qu.: 3.926 3rd Qu.: 3.948 3rd Qu.: 3.979 3rd Qu.: 3.963
## Max. :12.332 Max. :12.430 Max. :12.373 Max. :12.404
## CAEBV_mean healthy_mean FC_AIM_healthy FC_CAEBV_healthy
## Min. : 1.172 Min. : 1.202 Min. :0.3075 Min. :0.4247
## 1st Qu.: 1.757 1st Qu.: 1.760 1st Qu.:0.9673 1st Qu.:0.9746
## Median : 2.569 Median : 2.582 Median :0.9903 Median :0.9951
## Mean : 3.053 Mean : 3.072 Mean :0.9979 Mean :0.9968
## 3rd Qu.: 3.930 3rd Qu.: 3.955 3rd Qu.:1.0187 3rd Qu.:1.0177
## Max. :12.426 Max. :12.381 Max. :2.6063 Max. :2.1094
paged_table(data_mono)
We only want to keep the gene ID and healthy samples in the mono and EBV dataset.
colnames(data_mono)
## [1] "ID" "gene" "GSM2279022_AIM"
## [4] "GSM2279023_AIM" "GSM2279024_AIM" "GSM2279025_CAEBV"
## [7] "GSM2279026_AIM" "GSM2279027_CAEBV" "GSM2279028_CAEBV"
## [10] "GSM2279029_CAEBV" "GSM2279030_CAEBV" "GSM2279031_healthy"
## [13] "GSM2279032_healthy" "GSM2279033_healthy" "GSM2279034_healthy"
## [16] "GSM2279035_healthy" "GSM2279036_healthy" "GSM2279037_AIM"
## [19] "GSM2279038_AIM" "AIM_mean" "CAEBV_mean"
## [22] "healthy_mean" "FC_AIM_healthy" "FC_CAEBV_healthy"
healthy <- data_mono[,c(2,12:17)]
paged_table(healthy)
dim(healthy)
## [1] 32670 7
dim(data_lymphoma)
## [1] 21448 125
colnames(data_lymphoma)
## [1] "ID" "SPOT_ID.1"
## [3] "GeneID" "pDLBCL"
## [5] "CHL" "pDLBCL.1"
## [7] "mDLBCL" "CHL.1"
## [9] "mDLBCL.1" "pDLBCL.2"
## [11] "CHL.2" "CHL.3"
## [13] "CHL.4" "CHL.5"
## [15] "mDLBCL.2" "mDLBCL.3"
## [17] "CHL.6" "CHL.7"
## [19] "pDLBCL.3" "CHL.8"
## [21] "mDLBCL.4" "mDLBCL.5"
## [23] "mDLBCL.6" "mDLBCL.7"
## [25] "CHL.9" "mDLBCL.8"
## [27] "CHL.10" "CHL.11"
## [29] "CHL.12" "CHL.13"
## [31] "mDLBCL.9" "mDLBCL.10"
## [33] "mDLBCL.11" "mDLBCL.12"
## [35] "mDLBCL.13" "mDLBCL.14"
## [37] "pDLBCL.4" "pDLBCL.5"
## [39] "mDLBCL.15" "mDLBCL.16"
## [41] "CHL.14" "pDLBCL.6"
## [43] "pDLBCL.7" "CHL.15"
## [45] "CHL.16" "CHL.17"
## [47] "mDLBCL.17" "mDLBCL.18"
## [49] "mDLBCL.19" "CHL.18"
## [51] "CHL_mean" "pDLBCL_mean"
## [53] "mDLBCL_mean" "CHL_x_mean"
## [55] "mDLBCL_x_mean" "pDLBCL_x_mean"
## [57] "CHL_y_mean" "mDLBCL_y_mean"
## [59] "pDLBCL_y_mean" "CHL_young72_mean"
## [61] "mDLBCL_young72_mean" "pDLBCL_young72_mean"
## [63] "CHL_old" "mDLBCL_old"
## [65] "pDLBCL_old" "CHL_median"
## [67] "mDLBCL_median" "pDLBCL_median"
## [69] "CHL_x_median" "mDLBCL_x_median"
## [71] "pDLBCL_x_median" "CHL_y_median"
## [73] "mDLBCL_y_median" "pDLBCL_y_median"
## [75] "CHL_young72_median" "mDLBCL_young72_median"
## [77] "pDLBCL_young72_median" "CHL_old72_median"
## [79] "mDLBCL_old72_median" "pDLBCL_old72_median"
## [81] "ID.1" "CHL_mean.1"
## [83] "pDLBCL_mean.1" "mDLBCL_mean.1"
## [85] "CHL_x_mean.1" "mDLBCL_x_mean.1"
## [87] "pDLBCL_x_mean.1" "CHL_y_mean.1"
## [89] "mDLBCL_y_mean.1" "pDLBCL_y_mean.1"
## [91] "CHL_young72_mean.1" "mDLBCL_young72_mean.1"
## [93] "pDLBCL_young72_mean.1" "CHL_old.1"
## [95] "mDLBCL_old.1" "pDLBCL_old.1"
## [97] "CHL_median.1" "mDLBCL_median.1"
## [99] "pDLBCL_median.1" "CHL_x_median.1"
## [101] "mDLBCL_x_median.1" "pDLBCL_x_median.1"
## [103] "CHL_y_median.1" "mDLBCL_y_median.1"
## [105] "pDLBCL_y_median.1" "CHL_young72_median.1"
## [107] "mDLBCL_young72_median.1" "pDLBCL_young72_median.1"
## [109] "CHL_old72_median.1" "mDLBCL_old72_median.1"
## [111] "pDLBCL_old72_median.1" "CHL_change"
## [113] "pDLBCL_change" "mDLBCL_change"
## [115] "CHL_x_change" "CHL_y_change"
## [117] "pDLBCL_x_change" "pDLBCL_y_change"
## [119] "mDLBCL_x_change" "mDLBCL_y_change"
## [121] "CHL_old72_change" "CHL_young72_change"
## [123] "pDLBCL_old_change" "pDLBCL_young72_change"
## [125] "mDLBCL_young72_change"
Lets remove the first 2 and last summary stats columns except group means from data_lymphoma and merge by GeneID in that dataset with gene in the healthy dataset.
lymphomas <- data_lymphoma[,c(3:53)]
paged_table(lymphomas)
Now we will merge the two tables.
Data <- merge(healthy, lymphomas, by.x="gene", by.y="GeneID")
dim(Data)
## [1] 16827 57
paged_table(Data)
colnames(Data)
## [1] "gene" "GSM2279031_healthy" "GSM2279032_healthy"
## [4] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
## [7] "GSM2279036_healthy" "pDLBCL" "CHL"
## [10] "pDLBCL.1" "mDLBCL" "CHL.1"
## [13] "mDLBCL.1" "pDLBCL.2" "CHL.2"
## [16] "CHL.3" "CHL.4" "CHL.5"
## [19] "mDLBCL.2" "mDLBCL.3" "CHL.6"
## [22] "CHL.7" "pDLBCL.3" "CHL.8"
## [25] "mDLBCL.4" "mDLBCL.5" "mDLBCL.6"
## [28] "mDLBCL.7" "CHL.9" "mDLBCL.8"
## [31] "CHL.10" "CHL.11" "CHL.12"
## [34] "CHL.13" "mDLBCL.9" "mDLBCL.10"
## [37] "mDLBCL.11" "mDLBCL.12" "mDLBCL.13"
## [40] "mDLBCL.14" "pDLBCL.4" "pDLBCL.5"
## [43] "mDLBCL.15" "mDLBCL.16" "CHL.14"
## [46] "pDLBCL.6" "pDLBCL.7" "CHL.15"
## [49] "CHL.16" "CHL.17" "mDLBCL.17"
## [52] "mDLBCL.18" "mDLBCL.19" "CHL.18"
## [55] "CHL_mean" "pDLBCL_mean" "mDLBCL_mean"
Lets add the healthy mean to the data that will be our baseline control to get the fold change values to extract top genes of variability to test our classifier model on.
Data$healthy_mean <- rowMeans(Data[,c(2:7)])
summary(Data$healthy_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.293 2.072 3.126 3.419 4.368 11.400
Now we can get our fold change values.
Data$foldchange_CHL_healthy <- Data$CHL_mean/Data$healthy_mean
Data$foldchange_pDLBCL_healthy <- Data$pDLBCL_mean/Data$healthy_mean
Data$foldchange_mDLBCL_healthy <- Data$mDLBCL_mean/Data$healthy_mean
paged_table(Data[,c(1:5,55:61)])
The above shows the first few columns and last ones after adding the foldchange values of pathology / healthy means per group.
Data_CHL_ordered <- Data[order(Data$foldchange_CHL_healthy, decreasing=T),]
Data_pDLBCL_ordered <- Data[order(Data$foldchange_pDLBCL_healthy, decreasing=T),]
Data_mDLBCL_ordered <- Data[order(Data$foldchange_mDLBCL_healthy, decreasing=T),]
CHL_20 <- Data_CHL_ordered[c(1:10,16818:16827),]
pDLBCL_20 <- Data_pDLBCL_ordered[c(1:10,16818:16827),]
mDLBCL_20 <- Data_mDLBCL_ordered[c(1:10,16818:16827),]
CHL_genes <- CHL_20$gene
pDLBCL_genes <- pDLBCL_20$gene
mDLBCL_genes <- mDLBCL_20$gene
CHL_genes
## [1] "HIST1H2AJ" "FDCSP" "TMBIM6" "KRTAP2-4" "TMEM47"
## [6] "APOLD1" "KRTAP2-3" "HNRNPR" "BRDT" "FOXF1"
## [11] "ZBTB10" "ZNF888" "HNRNPA1P33" "MAPRE2" "SLC38A2"
## [16] "FFAR2" "NBPF9" "AQP9" "GPCPD1" "EIF4G2"
pDLBCL_genes
## [1] "HIST1H2AJ" "TMBIM6" "KRTAP2-4" "TMEM47" "FDCSP"
## [6] "APOLD1" "KRTAP2-3" "BRDT" "SULF1" "HNRNPR"
## [11] "MAPRE2" "SLC38A2" "ZNF888" "DPM1" "ZBTB10"
## [16] "HNRNPA1P33" "GPCPD1" "FFAR2" "NBPF9" "EIF4G2"
mDLBCL_genes
## [1] "HIST1H2AJ" "TMBIM6" "KRTAP2-4" "APOLD1" "TMEM47"
## [6] "HNRNPR" "CCDC34" "KRTAP2-3" "SULF1" "BRDT"
## [11] "SLC38A2" "ZNF888" "ZBTB10" "DPM1" "HNRNPA1P33"
## [16] "FFAR2" "MAPRE2" "GPCPD1" "NBPF9" "EIF4G2"
commonTopGenes <- c(CHL_genes,pDLBCL_genes,mDLBCL_genes)
uniqueTopGenes <- commonTopGenes[!duplicated(commonTopGenes)]
uniqueTopGenes
## [1] "HIST1H2AJ" "FDCSP" "TMBIM6" "KRTAP2-4" "TMEM47"
## [6] "APOLD1" "KRTAP2-3" "HNRNPR" "BRDT" "FOXF1"
## [11] "ZBTB10" "ZNF888" "HNRNPA1P33" "MAPRE2" "SLC38A2"
## [16] "FFAR2" "NBPF9" "AQP9" "GPCPD1" "EIF4G2"
## [21] "SULF1" "DPM1" "CCDC34"
There are 23 top genes for all these lymphomas.
Lets see the genes that were duplicated in each data set.
duplicatedTopGenes <- commonTopGenes[duplicated(commonTopGenes)]
duplicatedTopGenes
## [1] "HIST1H2AJ" "TMBIM6" "KRTAP2-4" "TMEM47" "FDCSP"
## [6] "APOLD1" "KRTAP2-3" "BRDT" "HNRNPR" "MAPRE2"
## [11] "SLC38A2" "ZNF888" "ZBTB10" "HNRNPA1P33" "GPCPD1"
## [16] "FFAR2" "NBPF9" "EIF4G2" "HIST1H2AJ" "TMBIM6"
## [21] "KRTAP2-4" "APOLD1" "TMEM47" "HNRNPR" "KRTAP2-3"
## [26] "SULF1" "BRDT" "SLC38A2" "ZNF888" "ZBTB10"
## [31] "DPM1" "HNRNPA1P33" "FFAR2" "MAPRE2" "GPCPD1"
## [36] "NBPF9" "EIF4G2"
These genes are top genes in all three lymphomas of hodgkins and polymorpich DLBCL and monomorphic DLBCL.
duplicated <- duplicatedTopGenes[!duplicated(duplicatedTopGenes)]
duplicated
## [1] "HIST1H2AJ" "TMBIM6" "KRTAP2-4" "TMEM47" "FDCSP"
## [6] "APOLD1" "KRTAP2-3" "BRDT" "HNRNPR" "MAPRE2"
## [11] "SLC38A2" "ZNF888" "ZBTB10" "HNRNPA1P33" "GPCPD1"
## [16] "FFAR2" "NBPF9" "EIF4G2" "SULF1" "DPM1"
There are 20 genes in common that are unique. We will use this as a set of top genes and the other set of uniqueTopGenes.
write.csv(Data_CHL_ordered,"Dataset_CHL_pDLBCL_mDLBCL_AIM_CAEBV_healthy.csv", row.names=F)
The above dataset is going in Tableua to replace the other top genes of CHL and DLBCL.
colnames(Data)
## [1] "gene" "GSM2279031_healthy"
## [3] "GSM2279032_healthy" "GSM2279033_healthy"
## [5] "GSM2279034_healthy" "GSM2279035_healthy"
## [7] "GSM2279036_healthy" "pDLBCL"
## [9] "CHL" "pDLBCL.1"
## [11] "mDLBCL" "CHL.1"
## [13] "mDLBCL.1" "pDLBCL.2"
## [15] "CHL.2" "CHL.3"
## [17] "CHL.4" "CHL.5"
## [19] "mDLBCL.2" "mDLBCL.3"
## [21] "CHL.6" "CHL.7"
## [23] "pDLBCL.3" "CHL.8"
## [25] "mDLBCL.4" "mDLBCL.5"
## [27] "mDLBCL.6" "mDLBCL.7"
## [29] "CHL.9" "mDLBCL.8"
## [31] "CHL.10" "CHL.11"
## [33] "CHL.12" "CHL.13"
## [35] "mDLBCL.9" "mDLBCL.10"
## [37] "mDLBCL.11" "mDLBCL.12"
## [39] "mDLBCL.13" "mDLBCL.14"
## [41] "pDLBCL.4" "pDLBCL.5"
## [43] "mDLBCL.15" "mDLBCL.16"
## [45] "CHL.14" "pDLBCL.6"
## [47] "pDLBCL.7" "CHL.15"
## [49] "CHL.16" "CHL.17"
## [51] "mDLBCL.17" "mDLBCL.18"
## [53] "mDLBCL.19" "CHL.18"
## [55] "CHL_mean" "pDLBCL_mean"
## [57] "mDLBCL_mean" "healthy_mean"
## [59] "foldchange_CHL_healthy" "foldchange_pDLBCL_healthy"
## [61] "foldchange_mDLBCL_healthy"
Data1 <- Data[,1:54]
colnames(Data1)
## [1] "gene" "GSM2279031_healthy" "GSM2279032_healthy"
## [4] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
## [7] "GSM2279036_healthy" "pDLBCL" "CHL"
## [10] "pDLBCL.1" "mDLBCL" "CHL.1"
## [13] "mDLBCL.1" "pDLBCL.2" "CHL.2"
## [16] "CHL.3" "CHL.4" "CHL.5"
## [19] "mDLBCL.2" "mDLBCL.3" "CHL.6"
## [22] "CHL.7" "pDLBCL.3" "CHL.8"
## [25] "mDLBCL.4" "mDLBCL.5" "mDLBCL.6"
## [28] "mDLBCL.7" "CHL.9" "mDLBCL.8"
## [31] "CHL.10" "CHL.11" "CHL.12"
## [34] "CHL.13" "mDLBCL.9" "mDLBCL.10"
## [37] "mDLBCL.11" "mDLBCL.12" "mDLBCL.13"
## [40] "mDLBCL.14" "pDLBCL.4" "pDLBCL.5"
## [43] "mDLBCL.15" "mDLBCL.16" "CHL.14"
## [46] "pDLBCL.6" "pDLBCL.7" "CHL.15"
## [49] "CHL.16" "CHL.17" "mDLBCL.17"
## [52] "mDLBCL.18" "mDLBCL.19" "CHL.18"
Lets get Data1 class IDs from column names.
healthySamples <- grep('healthy',colnames(Data1))
CHLSamples <- grep('CHL',colnames(Data1))
pDLBCLSamples <- grep('pDLBCL', colnames(Data1))
mDLBCLSamples <- grep('mDLBCL', colnames(Data1))
class <- c(colnames(Data1)[healthySamples],colnames(Data1)[CHLSamples],
colnames(Data1)[pDLBCLSamples],colnames(Data1)[mDLBCLSamples])
class
## [1] "GSM2279031_healthy" "GSM2279032_healthy" "GSM2279033_healthy"
## [4] "GSM2279034_healthy" "GSM2279035_healthy" "GSM2279036_healthy"
## [7] "CHL" "CHL.1" "CHL.2"
## [10] "CHL.3" "CHL.4" "CHL.5"
## [13] "CHL.6" "CHL.7" "CHL.8"
## [16] "CHL.9" "CHL.10" "CHL.11"
## [19] "CHL.12" "CHL.13" "CHL.14"
## [22] "CHL.15" "CHL.16" "CHL.17"
## [25] "CHL.18" "pDLBCL" "pDLBCL.1"
## [28] "pDLBCL.2" "pDLBCL.3" "pDLBCL.4"
## [31] "pDLBCL.5" "pDLBCL.6" "pDLBCL.7"
## [34] "mDLBCL" "mDLBCL.1" "mDLBCL.2"
## [37] "mDLBCL.3" "mDLBCL.4" "mDLBCL.5"
## [40] "mDLBCL.6" "mDLBCL.7" "mDLBCL.8"
## [43] "mDLBCL.9" "mDLBCL.10" "mDLBCL.11"
## [46] "mDLBCL.12" "mDLBCL.13" "mDLBCL.14"
## [49] "mDLBCL.15" "mDLBCL.16" "mDLBCL.17"
## [52] "mDLBCL.18" "mDLBCL.19"
class[1:6] <- "healthy"
class[7:25] <- "CHL"
class[26:33] <- "pDLBCL"
class[34:53] <- "mDLBCL"
table(class)
## class
## CHL healthy mDLBCL pDLBCL
## 19 6 20 8
class
## [1] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "CHL"
## [8] "CHL" "CHL" "CHL" "CHL" "CHL" "CHL" "CHL"
## [15] "CHL" "CHL" "CHL" "CHL" "CHL" "CHL" "CHL"
## [22] "CHL" "CHL" "CHL" "CHL" "pDLBCL" "pDLBCL" "pDLBCL"
## [29] "pDLBCL" "pDLBCL" "pDLBCL" "pDLBCL" "pDLBCL" "mDLBCL" "mDLBCL"
## [36] "mDLBCL" "mDLBCL" "mDLBCL" "mDLBCL" "mDLBCL" "mDLBCL" "mDLBCL"
## [43] "mDLBCL" "mDLBCL" "mDLBCL" "mDLBCL" "mDLBCL" "mDLBCL" "mDLBCL"
## [50] "mDLBCL" "mDLBCL" "mDLBCL" "mDLBCL"
We have to order the Data1 table as well in order from healthy to mDLBCL like the class feature we made.
Data2 <- Data1[,c(1,healthySamples,CHLSamples,pDLBCLSamples,mDLBCLSamples)]
colnames(Data2)
## [1] "gene" "GSM2279031_healthy" "GSM2279032_healthy"
## [4] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
## [7] "GSM2279036_healthy" "CHL" "CHL.1"
## [10] "CHL.2" "CHL.3" "CHL.4"
## [13] "CHL.5" "CHL.6" "CHL.7"
## [16] "CHL.8" "CHL.9" "CHL.10"
## [19] "CHL.11" "CHL.12" "CHL.13"
## [22] "CHL.14" "CHL.15" "CHL.16"
## [25] "CHL.17" "CHL.18" "pDLBCL"
## [28] "pDLBCL.1" "pDLBCL.2" "pDLBCL.3"
## [31] "pDLBCL.4" "pDLBCL.5" "pDLBCL.6"
## [34] "pDLBCL.7" "mDLBCL" "mDLBCL.1"
## [37] "mDLBCL.2" "mDLBCL.3" "mDLBCL.4"
## [40] "mDLBCL.5" "mDLBCL.6" "mDLBCL.7"
## [43] "mDLBCL.8" "mDLBCL.9" "mDLBCL.10"
## [46] "mDLBCL.11" "mDLBCL.12" "mDLBCL.13"
## [49] "mDLBCL.14" "mDLBCL.15" "mDLBCL.16"
## [52] "mDLBCL.17" "mDLBCL.18" "mDLBCL.19"
Now our class feature string will match with columns 2:54 when we add it to the matrix.
Lets make our matrix now using the genes in the duplicated set and uniqueTopGenes set of Data1 and our class feature.
unique23 <- Data2[which(Data2$gene %in% uniqueTopGenes),]
duplicated20 <- Data2[which(Data2$gene %in% duplicated),]
paged_table(unique23)
paged_table(unique23)
We have our set of genes common to all lymphomas as duplicated20 but some duplicates remained but different expression values with 22 genes in duplicated20, and the unique23 dataset of unique genes to all lymphomas.
We will see how well these two sets of genes predict the class of lymphoma.
unique_mx <- data.frame(t(unique23[,2:54]))
colnames(unique_mx) <- unique23$gene
unique_mx$class <- as.factor(class)
paged_table(unique_mx)
duplicated_mx <- data.frame(t(duplicated20[,2:54]))
colnames(duplicated_mx) <- duplicated20$gene
duplicated_mx$class <- as.factor(class)
paged_table(duplicated_mx)
Now test the unique genes on predicting class in a 4 class model of healthy, CHL, pDLBCL, and mDLBCL using random forest classifier.
set.seed(678)
inTrain <- sample(1:53, .7*53)
training <- unique_mx[inTrain,]
testing <- unique_mx[-inTrain,]
table(training$class)
##
## CHL healthy mDLBCL pDLBCL
## 13 5 15 4
table(testing$class)
##
## CHL healthy mDLBCL pDLBCL
## 6 1 5 4
rf_unq <- randomForest(training[1:25],training$class, mtry=8, ntree=5000, confusion=T)
rf_unq$confusion
## CHL healthy mDLBCL pDLBCL class.error
## CHL 10 0 3 0 0.2307692
## healthy 0 5 0 0 0.0000000
## mDLBCL 2 0 12 1 0.2000000
## pDLBCL 3 0 0 1 0.7500000
All healthy samples predicted correctly, the CHL class 77% correct, the mDLBCL class was 80% correct, and the pDLBCL class was only 25% correct.
Now lets see the prediction on hold out validation set of 20%.
prediction_unq <- predict(rf_unq, testing)
results_unq <- data.frame(predicted=prediction_unq, actual=testing$class)
results_unq
set.seed(678)
inTrain <- sample(1:53, .7*53)
training <- duplicated_mx[inTrain,]
testing <- duplicated_mx[-inTrain,]
table(training$class)
##
## CHL healthy mDLBCL pDLBCL
## 13 5 15 4
table(testing$class)
##
## CHL healthy mDLBCL pDLBCL
## 6 1 5 4
rf_duplicated <- randomForest(training[1:23],training$class, mtry=8, ntree=5000, confusion=T)
rf_duplicated$confusion
## CHL healthy mDLBCL pDLBCL class.error
## CHL 13 0 0 0 0.0
## healthy 0 5 0 0 0.0
## mDLBCL 0 0 15 0 0.0
## pDLBCL 2 0 0 2 0.5
The duplicated genes common to all lymphoma top genes seemed to perform much better in prediction. We will assume these are the top genes for now. We had an error with the prediction on hold out validation test. Said the features missing in testing set. But every class was 100% accurate except the pDLBCL class with only 50% accuracy still better than the unique genes common to the lymphomas in this project.