This is an R Markdown document for the gene-expression analysis of low grade glioblastama (LGG) data. There are three possible goals in genomics which we can try using this data.

  1. Class prediction: According to some labelling of samples (group 0 and group 1), can we predict the status of subjects/samples by exploiting the expression profiles?? This is a supervised machine learning algorithm where the desired output is known and it generates a function of inputs to desired outputs.

  2. Class comparison: Are there any differences in the expression profiles between labelled groups of samples. This is also a supervised machine learning algorithm.

  3. Class discovery: Are there any classses in the data groups of the samples characterised by the expression profiles. This is a unsupervised machine learning algorithm. Random Forest, Clustering algorithm etc.,

Files provided:

  1. expression_data_LGG (Genes Vz. sample)
  2. brain_LGG_subset (patient id, group)

From these two files, selected the patients into two groups.

  1. group 0 (patients who are likely to become high grade brain cancer)
  2. group 1 (patients who are likely to be remained with low grade brain cancer)

Out of 36(group 0) and 36(group 1) patients as given in brain_LGG_subset file, expression data was available only for 34(group 0) and 32(group 1) patients. Few patient data was missing.

Missing data :

  1. TCGA.DU.5872.02.RNAseq (group 0)
  2. TCGA.DU.7304.02.RNAseq (group 0)
  3. TCGA.FG.5965.02.RNAseq (group 1)
  4. TCGA.DU.6404.02.RNAseq (group 1)
  5. TCGA.DU.6407.02.RNAseq (group 1)
  6. TCGA.DU.5870.02.RNAseq (group 1)
lgg <- read.csv("~/Google Drive/0-official-MD-Anderson/Projects/lgg/expression_data_LGG.csv")
new_expression_group0 <- lgg[, which(colnames(lgg) %in% c("NAME",
                                                        "TCGA.CS.4941.01.RNAseq",
                                                        "TCGA.DU.5852.01.RNAseq",
                                                        "TCGA.DU.7292.01.RNAseq",
                                                        "TCGA.HT.7469.01.RNAseq",
                                                        "TCGA.DU.6402.01.RNAseq",
                                                        "TCGA.DU.6396.01.RNAseq",
                                                        "TCGA.DU.8161.01.RNAseq",
                                                        "TCGA.HT.8564.01.RNAseq",
                                                        "TCGA.IK.7675.01.RNAseq",
                                                        "TCGA.HT.8012.01.RNAseq",
                                                        "TCGA.E1.5303.01.RNAseq",
                                                        "TCGA.S9.A6UA.01.RNAseq",
                                                        "TCGA.DU.7013.01.RNAseq",
                                                        "TCGA.CS.6186.01.RNAseq",
                                                        "TCGA.DU.8166.01.RNAseq",
                                                        "TCGA.DU.7010.01.RNAseq",
                                                        "TCGA.DU.7298.01.RNAseq",
                                                        "TCGA.DU.5854.01.RNAseq",
                                                        "TCGA.DU.A5TP.01.RNAseq",
                                                        "TCGA.DU.A76K.01.RNAseq",
                                                        "TCGA.FG.6689.01.RNAseq",
                                                        "TCGA.S9.A6WL.01.RNAseq",
                                                        "TCGA.FG.7643.01.RNAseq",
                                                        "TCGA.DH.A669.01.RNAseq",
                                                        "TCGA.DH.A669.02.RNAseq",
                                                        "TCGA.DU.5872.01.RNAseq",
                                                        "TCGA.DU.5872.02.RNAseq",
                                                        "TCGA.HT.8011.01.RNAseq",
                                                        "TCGA.HT.7620.01.RNAseq",
                                                        "TCGA.FG.A6J3.01.RNAseq",
                                                        "TCGA.CS.5395.01.RNAseq",
                                                        "TCGA.DU.7304.01.RNAseq",
                                                        "TCGA.DU.7304.02.RNAseq",
                                                        "TCGA.DU.7300.01.RNAseq",
                                                        "TCGA.DU.7008.01.RNAseq",
                                                        "TCGA.DU.A76R.01.RNAseq"))]
new_expression_group1 <- lgg[, which(colnames(lgg) %in% c(                             
                                                        "TCGA.E1.5319.01.RNAseq",
                                                        "TCGA.TM.A7CF.01.RNAseq",
                                                        "TCGA.TM.A7CF.02.RNAseq",
                                                        "TCGA.DU.6392.01.RNAseq",
                                                        "TCGA.S9.A7R3.01.RNAseq",
                                                        "TCGA.FG.5965.01.RNAseq",
                                                        "TCGA.FG.5965.02.RNAseq",
                                                        "TCGA.E1.5322.01.RNAseq",
                                                        "TCGA.HT.7854.01.RNAseq",
                                                        "TCGA.DU.6404.01.RNAseq",
                                                        "TCGA.DU.6404.02.RNAseq",
                                                        "TCGA.DB.5277.01.RNAseq",
                                                        "TCGA.CS.4942.01.RNAseq",
                                                        "TCGA.DU.6395.01.RNAseq",
                                                        "TCGA.HT.7610.01.RNAseq",
                                                        "TCGA.E1.5302.01.RNAseq",
                                                        "TCGA.DU.7306.01.RNAseq",
                                                        "TCGA.HT.8013.01.RNAseq",
                                                        "TCGA.TM.A7C3.01.RNAseq",
                                                        "TCGA.S9.A6TS.01.RNAseq",
                                                        "TCGA.DU.7302.01.RNAseq",
                                                        "TCGA.E1.5307.01.RNAseq",
                                                        "TCGA.S9.A6TU.01.RNAseq",
                                                        "TCGA.E1.5305.01.RNAseq",
                                                        "TCGA.DU.6399.01.RNAseq",
                                                        "TCGA.DU.6401.01.RNAseq",
                                                        "TCGA.HT.7480.01.RNAseq",
                                                        "TCGA.DU.6408.01.RNAseq",
                                                        "TCGA.DU.6407.01.RNAseq",
                                                        "TCGA.DU.6407.02.RNAseq",
                                                        "TCGA.E1.5311.01.RNAseq",
                                                        "TCGA.DU.7009.01.RNAseq",
                                                        "TCGA.DU.A7T8.01.RNAseq",
                                                        "TCGA.DU.7014.01.RNAseq",
                                                        "TCGA.DU.5870.01.RNAseq",
                                                        "TCGA.DU.5870.02.RNAseq"))]
patient_list <- cbind(new_expression_group0,new_expression_group1)
head(patient_list)
##     NAME TCGA.E1.5303.01.RNAseq TCGA.DU.5854.01.RNAseq
## 1   A1BG               4.286203               3.514726
## 2  A2BP1               2.882978               2.362140
## 3    A2M               8.603549               8.554752
## 4 A4GALT               2.668210               3.638892
## 5   AAAS               4.124813               4.422150
## 6   AACS               3.315060               3.309582
##   TCGA.HT.8011.01.RNAseq TCGA.DU.6396.01.RNAseq TCGA.DU.8166.01.RNAseq
## 1               4.489565             0.05101928               2.742207
## 2               5.395035             5.67813465               6.320532
## 3               7.325603             8.37078645               6.901863
## 4               2.599386             1.95914614               2.041069
## 5               4.982690             4.09249927               3.952144
## 6               3.862184             2.44541998               3.990481
##   TCGA.DU.A76R.01.RNAseq TCGA.DH.A669.02.RNAseq TCGA.DU.6402.01.RNAseq
## 1               2.829135               5.287513               2.179651
## 2               6.436948               2.593714               3.047376
## 3               6.102486               7.419090               9.507524
## 4               4.009560               3.204934               3.164137
## 5               3.982664               5.243121               3.936601
## 6               3.804551               4.705563               3.962370
##   TCGA.FG.7643.01.RNAseq TCGA.DU.7013.01.RNAseq TCGA.FG.6689.01.RNAseq
## 1               2.150818               2.962076               3.496908
## 2               8.371212               4.271336               7.463982
## 3               7.247842               8.294226               7.368326
## 4               2.773666               2.488866               2.767250
## 5               3.469384               4.280211               3.756258
## 6               5.003389               3.794806               3.926532
##   TCGA.HT.7620.01.RNAseq TCGA.HT.8564.01.RNAseq TCGA.DU.8161.01.RNAseq
## 1               2.808414               2.275027               1.962217
## 2               3.797862               7.237400               5.668690
## 3               7.253941               6.914375               8.256354
## 4               2.559708               3.346032               2.191079
## 5               4.302771               4.225703               3.876366
## 6               3.070786               4.906326               3.557249
##   TCGA.CS.5395.01.RNAseq TCGA.DU.7010.01.RNAseq TCGA.S9.A6UA.01.RNAseq
## 1               1.221521               3.214041               4.364277
## 2               5.758602               4.805072               3.232660
## 3               7.249446               7.630540               9.052742
## 4               2.973738               2.886075               4.445530
## 5               3.461423               4.339754               4.311822
## 6               3.871781               3.229247               4.592943
##   TCGA.DU.7298.01.RNAseq TCGA.HT.7469.01.RNAseq TCGA.DU.7300.01.RNAseq
## 1               2.059453               2.516771               1.268565
## 2               6.058195               4.257595               7.614513
## 3               7.819317               6.741015               7.414865
## 4               2.655854               1.681300               3.437628
## 5               4.117105               4.758452               4.167904
## 6               3.404243               3.507530               4.729319
##   TCGA.IK.7675.01.RNAseq TCGA.DU.7304.01.RNAseq TCGA.DU.A5TP.01.RNAseq
## 1               1.115500               2.586928               1.435233
## 2               1.892098               6.728038               4.552080
## 3               7.027047               7.952403               6.987470
## 4               2.637869               3.184238               2.328974
## 5               4.248841               3.926714               4.319409
## 6               3.438810               4.078968               3.443829
##   TCGA.DU.7008.01.RNAseq TCGA.DU.5872.01.RNAseq TCGA.DH.A669.01.RNAseq
## 1               3.662504               3.653025               4.112665
## 2               6.422228               3.568195               1.269689
## 3               6.269714               8.405140               7.187093
## 4               1.058367               3.206838               2.448967
## 5               4.694215               4.109095               5.077949
## 6               4.120674               3.606391               4.274644
##   TCGA.DU.5852.01.RNAseq TCGA.S9.A6WL.01.RNAseq TCGA.FG.A6J3.01.RNAseq
## 1               4.121739               1.901054               4.739103
## 2               5.324998               6.659719               5.441802
## 3               8.539744               7.691267              10.074703
## 4               3.302489               3.026985               4.043398
## 5               4.411346               4.482058               4.011993
## 6               3.854388               4.689586               3.392795
##   TCGA.CS.6186.01.RNAseq TCGA.HT.8012.01.RNAseq TCGA.CS.4941.01.RNAseq
## 1               4.389527               3.579025               2.927597
## 2               2.400435               5.522408               6.438462
## 3               6.461194               7.272095               7.992366
## 4               2.932200               2.907919               3.631744
## 5               4.236094               4.810014               3.249270
## 6               3.541240               3.732900               3.314651
##   TCGA.DU.A76K.01.RNAseq TCGA.DU.7292.01.RNAseq TCGA.DU.A7T8.01.RNAseq
## 1               3.743090               2.796730               3.148230
## 2               7.664797               8.154592               4.239624
## 3               5.654791               7.042877               6.833341
## 4               3.562144               2.912255               4.150886
## 5               4.136664               3.590104               4.759055
## 6               4.810073               5.162783               3.979308
##   TCGA.DU.6408.01.RNAseq TCGA.DU.7302.01.RNAseq TCGA.DU.6407.01.RNAseq
## 1               3.551785              0.2953503               3.989309
## 2               5.889896              7.5138926               3.021798
## 3               7.431369              6.5453323               7.820373
## 4               2.345480              2.4611376               2.580694
## 5               4.496437              4.0025274               4.543815
## 6               3.919760              5.0471587               3.457066
##   TCGA.DU.7306.01.RNAseq TCGA.HT.7480.01.RNAseq TCGA.HT.8013.01.RNAseq
## 1              0.7821583               1.230501               3.584099
## 2             -0.1939539               6.683495               4.164705
## 3              7.7217814               6.834326               8.104305
## 4              2.7750861               2.196196               2.848842
## 5              3.7052644               4.604738               4.498313
## 6              2.6671049               4.146470               3.328954
##   TCGA.DU.6401.01.RNAseq TCGA.S9.A7R3.01.RNAseq TCGA.E1.5305.01.RNAseq
## 1               2.767727               4.038280               2.633282
## 2               2.170136               3.337787               6.484535
## 3               8.145055               7.933655               6.995512
## 4               4.078769               3.065271               2.930693
## 5               4.089603               4.452240               4.184799
## 6               2.862786               2.953221               4.635300
##   TCGA.HT.7854.01.RNAseq TCGA.TM.A7CF.01.RNAseq TCGA.E1.5302.01.RNAseq
## 1               2.039557               3.871484               2.461092
## 2               5.560066               7.628899               4.783197
## 3               7.524931               6.104328               7.481894
## 4               4.644410               2.320636               3.052576
## 5               3.355520               4.025910               4.002463
## 6               3.318029               4.630099               3.657512
##   TCGA.E1.5311.01.RNAseq TCGA.DU.6404.01.RNAseq TCGA.FG.5965.01.RNAseq
## 1               3.161000               2.706335               1.775443
## 2               6.761614               3.793245               5.337451
## 3               5.894455               7.394867               7.379567
## 4               2.709048               4.514972               3.222803
## 5               4.158479               4.555340               4.233452
## 6               4.385011               3.361979               3.716622
##   TCGA.TM.A7C3.01.RNAseq TCGA.DB.5277.01.RNAseq TCGA.S9.A6TS.01.RNAseq
## 1               4.458640               4.047202               4.529308
## 2               4.805218               5.788962               6.340151
## 3               7.325303               7.328254               7.884674
## 4               3.471761               3.148262               3.214321
## 5               4.495663               4.492886               4.758163
## 6               3.377978               3.974205               4.388189
##   TCGA.DU.7009.01.RNAseq TCGA.E1.5307.01.RNAseq TCGA.E1.5319.01.RNAseq
## 1               1.944279               2.987277               2.454796
## 2               6.804482               6.594915               4.208865
## 3               6.797354               7.111364               7.068633
## 4               1.791766               1.393997               3.098703
## 5               4.391662               4.098815               4.204807
## 6               3.708076               4.152721               3.477179
##   TCGA.HT.7610.01.RNAseq TCGA.DU.6395.01.RNAseq TCGA.DU.6392.01.RNAseq
## 1               1.538702               4.550651               3.408787
## 2               8.252028               4.151717               5.151905
## 3               6.685914               7.386582               8.522442
## 4               1.818870               2.307917               3.252413
## 5               3.815857               4.627877               3.855580
## 6               5.205683               3.788052               3.885722
##   TCGA.E1.5322.01.RNAseq TCGA.S9.A6TU.01.RNAseq TCGA.DU.7014.01.RNAseq
## 1               2.979066               3.979370               2.588235
## 2               4.852762               2.838277               4.138658
## 3               7.717477               7.909081               8.053413
## 4               2.129446               4.026027               3.602164
## 5               4.535491               4.812407               4.680492
## 6               3.753070               3.026521               3.565325
##   TCGA.DU.5870.01.RNAseq TCGA.DU.6399.01.RNAseq TCGA.TM.A7CF.02.RNAseq
## 1               1.252023               2.034414               4.688094
## 2               3.754004               2.395654               5.316216
## 3               7.744321               8.088247               5.696277
## 4               2.731204               2.604539               2.807492
## 5               4.450772               4.420419               4.749371
## 6               3.456624               3.739150               3.780533
##   TCGA.CS.4942.01.RNAseq
## 1               2.967074
## 2               6.010099
## 3               8.096648
## 4               2.107511
## 5               3.901328
## 6               3.671747
rownames(patient_list) = make.names(patient_list$NAME, unique=TRUE)
head(patient_list)
##          NAME TCGA.E1.5303.01.RNAseq TCGA.DU.5854.01.RNAseq
## A1BG     A1BG               4.286203               3.514726
## A2BP1   A2BP1               2.882978               2.362140
## A2M       A2M               8.603549               8.554752
## A4GALT A4GALT               2.668210               3.638892
## AAAS     AAAS               4.124813               4.422150
## AACS     AACS               3.315060               3.309582
##        TCGA.HT.8011.01.RNAseq TCGA.DU.6396.01.RNAseq
## A1BG                 4.489565             0.05101928
## A2BP1                5.395035             5.67813465
## A2M                  7.325603             8.37078645
## A4GALT               2.599386             1.95914614
## AAAS                 4.982690             4.09249927
## AACS                 3.862184             2.44541998
##        TCGA.DU.8166.01.RNAseq TCGA.DU.A76R.01.RNAseq
## A1BG                 2.742207               2.829135
## A2BP1                6.320532               6.436948
## A2M                  6.901863               6.102486
## A4GALT               2.041069               4.009560
## AAAS                 3.952144               3.982664
## AACS                 3.990481               3.804551
##        TCGA.DH.A669.02.RNAseq TCGA.DU.6402.01.RNAseq
## A1BG                 5.287513               2.179651
## A2BP1                2.593714               3.047376
## A2M                  7.419090               9.507524
## A4GALT               3.204934               3.164137
## AAAS                 5.243121               3.936601
## AACS                 4.705563               3.962370
##        TCGA.FG.7643.01.RNAseq TCGA.DU.7013.01.RNAseq
## A1BG                 2.150818               2.962076
## A2BP1                8.371212               4.271336
## A2M                  7.247842               8.294226
## A4GALT               2.773666               2.488866
## AAAS                 3.469384               4.280211
## AACS                 5.003389               3.794806
##        TCGA.FG.6689.01.RNAseq TCGA.HT.7620.01.RNAseq
## A1BG                 3.496908               2.808414
## A2BP1                7.463982               3.797862
## A2M                  7.368326               7.253941
## A4GALT               2.767250               2.559708
## AAAS                 3.756258               4.302771
## AACS                 3.926532               3.070786
##        TCGA.HT.8564.01.RNAseq TCGA.DU.8161.01.RNAseq
## A1BG                 2.275027               1.962217
## A2BP1                7.237400               5.668690
## A2M                  6.914375               8.256354
## A4GALT               3.346032               2.191079
## AAAS                 4.225703               3.876366
## AACS                 4.906326               3.557249
##        TCGA.CS.5395.01.RNAseq TCGA.DU.7010.01.RNAseq
## A1BG                 1.221521               3.214041
## A2BP1                5.758602               4.805072
## A2M                  7.249446               7.630540
## A4GALT               2.973738               2.886075
## AAAS                 3.461423               4.339754
## AACS                 3.871781               3.229247
##        TCGA.S9.A6UA.01.RNAseq TCGA.DU.7298.01.RNAseq
## A1BG                 4.364277               2.059453
## A2BP1                3.232660               6.058195
## A2M                  9.052742               7.819317
## A4GALT               4.445530               2.655854
## AAAS                 4.311822               4.117105
## AACS                 4.592943               3.404243
##        TCGA.HT.7469.01.RNAseq TCGA.DU.7300.01.RNAseq
## A1BG                 2.516771               1.268565
## A2BP1                4.257595               7.614513
## A2M                  6.741015               7.414865
## A4GALT               1.681300               3.437628
## AAAS                 4.758452               4.167904
## AACS                 3.507530               4.729319
##        TCGA.IK.7675.01.RNAseq TCGA.DU.7304.01.RNAseq
## A1BG                 1.115500               2.586928
## A2BP1                1.892098               6.728038
## A2M                  7.027047               7.952403
## A4GALT               2.637869               3.184238
## AAAS                 4.248841               3.926714
## AACS                 3.438810               4.078968
##        TCGA.DU.A5TP.01.RNAseq TCGA.DU.7008.01.RNAseq
## A1BG                 1.435233               3.662504
## A2BP1                4.552080               6.422228
## A2M                  6.987470               6.269714
## A4GALT               2.328974               1.058367
## AAAS                 4.319409               4.694215
## AACS                 3.443829               4.120674
##        TCGA.DU.5872.01.RNAseq TCGA.DH.A669.01.RNAseq
## A1BG                 3.653025               4.112665
## A2BP1                3.568195               1.269689
## A2M                  8.405140               7.187093
## A4GALT               3.206838               2.448967
## AAAS                 4.109095               5.077949
## AACS                 3.606391               4.274644
##        TCGA.DU.5852.01.RNAseq TCGA.S9.A6WL.01.RNAseq
## A1BG                 4.121739               1.901054
## A2BP1                5.324998               6.659719
## A2M                  8.539744               7.691267
## A4GALT               3.302489               3.026985
## AAAS                 4.411346               4.482058
## AACS                 3.854388               4.689586
##        TCGA.FG.A6J3.01.RNAseq TCGA.CS.6186.01.RNAseq
## A1BG                 4.739103               4.389527
## A2BP1                5.441802               2.400435
## A2M                 10.074703               6.461194
## A4GALT               4.043398               2.932200
## AAAS                 4.011993               4.236094
## AACS                 3.392795               3.541240
##        TCGA.HT.8012.01.RNAseq TCGA.CS.4941.01.RNAseq
## A1BG                 3.579025               2.927597
## A2BP1                5.522408               6.438462
## A2M                  7.272095               7.992366
## A4GALT               2.907919               3.631744
## AAAS                 4.810014               3.249270
## AACS                 3.732900               3.314651
##        TCGA.DU.A76K.01.RNAseq TCGA.DU.7292.01.RNAseq
## A1BG                 3.743090               2.796730
## A2BP1                7.664797               8.154592
## A2M                  5.654791               7.042877
## A4GALT               3.562144               2.912255
## AAAS                 4.136664               3.590104
## AACS                 4.810073               5.162783
##        TCGA.DU.A7T8.01.RNAseq TCGA.DU.6408.01.RNAseq
## A1BG                 3.148230               3.551785
## A2BP1                4.239624               5.889896
## A2M                  6.833341               7.431369
## A4GALT               4.150886               2.345480
## AAAS                 4.759055               4.496437
## AACS                 3.979308               3.919760
##        TCGA.DU.7302.01.RNAseq TCGA.DU.6407.01.RNAseq
## A1BG                0.2953503               3.989309
## A2BP1               7.5138926               3.021798
## A2M                 6.5453323               7.820373
## A4GALT              2.4611376               2.580694
## AAAS                4.0025274               4.543815
## AACS                5.0471587               3.457066
##        TCGA.DU.7306.01.RNAseq TCGA.HT.7480.01.RNAseq
## A1BG                0.7821583               1.230501
## A2BP1              -0.1939539               6.683495
## A2M                 7.7217814               6.834326
## A4GALT              2.7750861               2.196196
## AAAS                3.7052644               4.604738
## AACS                2.6671049               4.146470
##        TCGA.HT.8013.01.RNAseq TCGA.DU.6401.01.RNAseq
## A1BG                 3.584099               2.767727
## A2BP1                4.164705               2.170136
## A2M                  8.104305               8.145055
## A4GALT               2.848842               4.078769
## AAAS                 4.498313               4.089603
## AACS                 3.328954               2.862786
##        TCGA.S9.A7R3.01.RNAseq TCGA.E1.5305.01.RNAseq
## A1BG                 4.038280               2.633282
## A2BP1                3.337787               6.484535
## A2M                  7.933655               6.995512
## A4GALT               3.065271               2.930693
## AAAS                 4.452240               4.184799
## AACS                 2.953221               4.635300
##        TCGA.HT.7854.01.RNAseq TCGA.TM.A7CF.01.RNAseq
## A1BG                 2.039557               3.871484
## A2BP1                5.560066               7.628899
## A2M                  7.524931               6.104328
## A4GALT               4.644410               2.320636
## AAAS                 3.355520               4.025910
## AACS                 3.318029               4.630099
##        TCGA.E1.5302.01.RNAseq TCGA.E1.5311.01.RNAseq
## A1BG                 2.461092               3.161000
## A2BP1                4.783197               6.761614
## A2M                  7.481894               5.894455
## A4GALT               3.052576               2.709048
## AAAS                 4.002463               4.158479
## AACS                 3.657512               4.385011
##        TCGA.DU.6404.01.RNAseq TCGA.FG.5965.01.RNAseq
## A1BG                 2.706335               1.775443
## A2BP1                3.793245               5.337451
## A2M                  7.394867               7.379567
## A4GALT               4.514972               3.222803
## AAAS                 4.555340               4.233452
## AACS                 3.361979               3.716622
##        TCGA.TM.A7C3.01.RNAseq TCGA.DB.5277.01.RNAseq
## A1BG                 4.458640               4.047202
## A2BP1                4.805218               5.788962
## A2M                  7.325303               7.328254
## A4GALT               3.471761               3.148262
## AAAS                 4.495663               4.492886
## AACS                 3.377978               3.974205
##        TCGA.S9.A6TS.01.RNAseq TCGA.DU.7009.01.RNAseq
## A1BG                 4.529308               1.944279
## A2BP1                6.340151               6.804482
## A2M                  7.884674               6.797354
## A4GALT               3.214321               1.791766
## AAAS                 4.758163               4.391662
## AACS                 4.388189               3.708076
##        TCGA.E1.5307.01.RNAseq TCGA.E1.5319.01.RNAseq
## A1BG                 2.987277               2.454796
## A2BP1                6.594915               4.208865
## A2M                  7.111364               7.068633
## A4GALT               1.393997               3.098703
## AAAS                 4.098815               4.204807
## AACS                 4.152721               3.477179
##        TCGA.HT.7610.01.RNAseq TCGA.DU.6395.01.RNAseq
## A1BG                 1.538702               4.550651
## A2BP1                8.252028               4.151717
## A2M                  6.685914               7.386582
## A4GALT               1.818870               2.307917
## AAAS                 3.815857               4.627877
## AACS                 5.205683               3.788052
##        TCGA.DU.6392.01.RNAseq TCGA.E1.5322.01.RNAseq
## A1BG                 3.408787               2.979066
## A2BP1                5.151905               4.852762
## A2M                  8.522442               7.717477
## A4GALT               3.252413               2.129446
## AAAS                 3.855580               4.535491
## AACS                 3.885722               3.753070
##        TCGA.S9.A6TU.01.RNAseq TCGA.DU.7014.01.RNAseq
## A1BG                 3.979370               2.588235
## A2BP1                2.838277               4.138658
## A2M                  7.909081               8.053413
## A4GALT               4.026027               3.602164
## AAAS                 4.812407               4.680492
## AACS                 3.026521               3.565325
##        TCGA.DU.5870.01.RNAseq TCGA.DU.6399.01.RNAseq
## A1BG                 1.252023               2.034414
## A2BP1                3.754004               2.395654
## A2M                  7.744321               8.088247
## A4GALT               2.731204               2.604539
## AAAS                 4.450772               4.420419
## AACS                 3.456624               3.739150
##        TCGA.TM.A7CF.02.RNAseq TCGA.CS.4942.01.RNAseq
## A1BG                 4.688094               2.967074
## A2BP1                5.316216               6.010099
## A2M                  5.696277               8.096648
## A4GALT               2.807492               2.107511
## AAAS                 4.749371               3.901328
## AACS                 3.780533               3.671747

Now we need to remove the additional column of Gene and write the selected subset of patients gene expression data to a new file patient_list.csv Further the sampleinfo.LGG.csv file is written with patient id and the corresponding group.

patient_list$NAME<-NULL
head(patient_list)
##        TCGA.E1.5303.01.RNAseq TCGA.DU.5854.01.RNAseq
## A1BG                 4.286203               3.514726
## A2BP1                2.882978               2.362140
## A2M                  8.603549               8.554752
## A4GALT               2.668210               3.638892
## AAAS                 4.124813               4.422150
## AACS                 3.315060               3.309582
##        TCGA.HT.8011.01.RNAseq TCGA.DU.6396.01.RNAseq
## A1BG                 4.489565             0.05101928
## A2BP1                5.395035             5.67813465
## A2M                  7.325603             8.37078645
## A4GALT               2.599386             1.95914614
## AAAS                 4.982690             4.09249927
## AACS                 3.862184             2.44541998
##        TCGA.DU.8166.01.RNAseq TCGA.DU.A76R.01.RNAseq
## A1BG                 2.742207               2.829135
## A2BP1                6.320532               6.436948
## A2M                  6.901863               6.102486
## A4GALT               2.041069               4.009560
## AAAS                 3.952144               3.982664
## AACS                 3.990481               3.804551
##        TCGA.DH.A669.02.RNAseq TCGA.DU.6402.01.RNAseq
## A1BG                 5.287513               2.179651
## A2BP1                2.593714               3.047376
## A2M                  7.419090               9.507524
## A4GALT               3.204934               3.164137
## AAAS                 5.243121               3.936601
## AACS                 4.705563               3.962370
##        TCGA.FG.7643.01.RNAseq TCGA.DU.7013.01.RNAseq
## A1BG                 2.150818               2.962076
## A2BP1                8.371212               4.271336
## A2M                  7.247842               8.294226
## A4GALT               2.773666               2.488866
## AAAS                 3.469384               4.280211
## AACS                 5.003389               3.794806
##        TCGA.FG.6689.01.RNAseq TCGA.HT.7620.01.RNAseq
## A1BG                 3.496908               2.808414
## A2BP1                7.463982               3.797862
## A2M                  7.368326               7.253941
## A4GALT               2.767250               2.559708
## AAAS                 3.756258               4.302771
## AACS                 3.926532               3.070786
##        TCGA.HT.8564.01.RNAseq TCGA.DU.8161.01.RNAseq
## A1BG                 2.275027               1.962217
## A2BP1                7.237400               5.668690
## A2M                  6.914375               8.256354
## A4GALT               3.346032               2.191079
## AAAS                 4.225703               3.876366
## AACS                 4.906326               3.557249
##        TCGA.CS.5395.01.RNAseq TCGA.DU.7010.01.RNAseq
## A1BG                 1.221521               3.214041
## A2BP1                5.758602               4.805072
## A2M                  7.249446               7.630540
## A4GALT               2.973738               2.886075
## AAAS                 3.461423               4.339754
## AACS                 3.871781               3.229247
##        TCGA.S9.A6UA.01.RNAseq TCGA.DU.7298.01.RNAseq
## A1BG                 4.364277               2.059453
## A2BP1                3.232660               6.058195
## A2M                  9.052742               7.819317
## A4GALT               4.445530               2.655854
## AAAS                 4.311822               4.117105
## AACS                 4.592943               3.404243
##        TCGA.HT.7469.01.RNAseq TCGA.DU.7300.01.RNAseq
## A1BG                 2.516771               1.268565
## A2BP1                4.257595               7.614513
## A2M                  6.741015               7.414865
## A4GALT               1.681300               3.437628
## AAAS                 4.758452               4.167904
## AACS                 3.507530               4.729319
##        TCGA.IK.7675.01.RNAseq TCGA.DU.7304.01.RNAseq
## A1BG                 1.115500               2.586928
## A2BP1                1.892098               6.728038
## A2M                  7.027047               7.952403
## A4GALT               2.637869               3.184238
## AAAS                 4.248841               3.926714
## AACS                 3.438810               4.078968
##        TCGA.DU.A5TP.01.RNAseq TCGA.DU.7008.01.RNAseq
## A1BG                 1.435233               3.662504
## A2BP1                4.552080               6.422228
## A2M                  6.987470               6.269714
## A4GALT               2.328974               1.058367
## AAAS                 4.319409               4.694215
## AACS                 3.443829               4.120674
##        TCGA.DU.5872.01.RNAseq TCGA.DH.A669.01.RNAseq
## A1BG                 3.653025               4.112665
## A2BP1                3.568195               1.269689
## A2M                  8.405140               7.187093
## A4GALT               3.206838               2.448967
## AAAS                 4.109095               5.077949
## AACS                 3.606391               4.274644
##        TCGA.DU.5852.01.RNAseq TCGA.S9.A6WL.01.RNAseq
## A1BG                 4.121739               1.901054
## A2BP1                5.324998               6.659719
## A2M                  8.539744               7.691267
## A4GALT               3.302489               3.026985
## AAAS                 4.411346               4.482058
## AACS                 3.854388               4.689586
##        TCGA.FG.A6J3.01.RNAseq TCGA.CS.6186.01.RNAseq
## A1BG                 4.739103               4.389527
## A2BP1                5.441802               2.400435
## A2M                 10.074703               6.461194
## A4GALT               4.043398               2.932200
## AAAS                 4.011993               4.236094
## AACS                 3.392795               3.541240
##        TCGA.HT.8012.01.RNAseq TCGA.CS.4941.01.RNAseq
## A1BG                 3.579025               2.927597
## A2BP1                5.522408               6.438462
## A2M                  7.272095               7.992366
## A4GALT               2.907919               3.631744
## AAAS                 4.810014               3.249270
## AACS                 3.732900               3.314651
##        TCGA.DU.A76K.01.RNAseq TCGA.DU.7292.01.RNAseq
## A1BG                 3.743090               2.796730
## A2BP1                7.664797               8.154592
## A2M                  5.654791               7.042877
## A4GALT               3.562144               2.912255
## AAAS                 4.136664               3.590104
## AACS                 4.810073               5.162783
##        TCGA.DU.A7T8.01.RNAseq TCGA.DU.6408.01.RNAseq
## A1BG                 3.148230               3.551785
## A2BP1                4.239624               5.889896
## A2M                  6.833341               7.431369
## A4GALT               4.150886               2.345480
## AAAS                 4.759055               4.496437
## AACS                 3.979308               3.919760
##        TCGA.DU.7302.01.RNAseq TCGA.DU.6407.01.RNAseq
## A1BG                0.2953503               3.989309
## A2BP1               7.5138926               3.021798
## A2M                 6.5453323               7.820373
## A4GALT              2.4611376               2.580694
## AAAS                4.0025274               4.543815
## AACS                5.0471587               3.457066
##        TCGA.DU.7306.01.RNAseq TCGA.HT.7480.01.RNAseq
## A1BG                0.7821583               1.230501
## A2BP1              -0.1939539               6.683495
## A2M                 7.7217814               6.834326
## A4GALT              2.7750861               2.196196
## AAAS                3.7052644               4.604738
## AACS                2.6671049               4.146470
##        TCGA.HT.8013.01.RNAseq TCGA.DU.6401.01.RNAseq
## A1BG                 3.584099               2.767727
## A2BP1                4.164705               2.170136
## A2M                  8.104305               8.145055
## A4GALT               2.848842               4.078769
## AAAS                 4.498313               4.089603
## AACS                 3.328954               2.862786
##        TCGA.S9.A7R3.01.RNAseq TCGA.E1.5305.01.RNAseq
## A1BG                 4.038280               2.633282
## A2BP1                3.337787               6.484535
## A2M                  7.933655               6.995512
## A4GALT               3.065271               2.930693
## AAAS                 4.452240               4.184799
## AACS                 2.953221               4.635300
##        TCGA.HT.7854.01.RNAseq TCGA.TM.A7CF.01.RNAseq
## A1BG                 2.039557               3.871484
## A2BP1                5.560066               7.628899
## A2M                  7.524931               6.104328
## A4GALT               4.644410               2.320636
## AAAS                 3.355520               4.025910
## AACS                 3.318029               4.630099
##        TCGA.E1.5302.01.RNAseq TCGA.E1.5311.01.RNAseq
## A1BG                 2.461092               3.161000
## A2BP1                4.783197               6.761614
## A2M                  7.481894               5.894455
## A4GALT               3.052576               2.709048
## AAAS                 4.002463               4.158479
## AACS                 3.657512               4.385011
##        TCGA.DU.6404.01.RNAseq TCGA.FG.5965.01.RNAseq
## A1BG                 2.706335               1.775443
## A2BP1                3.793245               5.337451
## A2M                  7.394867               7.379567
## A4GALT               4.514972               3.222803
## AAAS                 4.555340               4.233452
## AACS                 3.361979               3.716622
##        TCGA.TM.A7C3.01.RNAseq TCGA.DB.5277.01.RNAseq
## A1BG                 4.458640               4.047202
## A2BP1                4.805218               5.788962
## A2M                  7.325303               7.328254
## A4GALT               3.471761               3.148262
## AAAS                 4.495663               4.492886
## AACS                 3.377978               3.974205
##        TCGA.S9.A6TS.01.RNAseq TCGA.DU.7009.01.RNAseq
## A1BG                 4.529308               1.944279
## A2BP1                6.340151               6.804482
## A2M                  7.884674               6.797354
## A4GALT               3.214321               1.791766
## AAAS                 4.758163               4.391662
## AACS                 4.388189               3.708076
##        TCGA.E1.5307.01.RNAseq TCGA.E1.5319.01.RNAseq
## A1BG                 2.987277               2.454796
## A2BP1                6.594915               4.208865
## A2M                  7.111364               7.068633
## A4GALT               1.393997               3.098703
## AAAS                 4.098815               4.204807
## AACS                 4.152721               3.477179
##        TCGA.HT.7610.01.RNAseq TCGA.DU.6395.01.RNAseq
## A1BG                 1.538702               4.550651
## A2BP1                8.252028               4.151717
## A2M                  6.685914               7.386582
## A4GALT               1.818870               2.307917
## AAAS                 3.815857               4.627877
## AACS                 5.205683               3.788052
##        TCGA.DU.6392.01.RNAseq TCGA.E1.5322.01.RNAseq
## A1BG                 3.408787               2.979066
## A2BP1                5.151905               4.852762
## A2M                  8.522442               7.717477
## A4GALT               3.252413               2.129446
## AAAS                 3.855580               4.535491
## AACS                 3.885722               3.753070
##        TCGA.S9.A6TU.01.RNAseq TCGA.DU.7014.01.RNAseq
## A1BG                 3.979370               2.588235
## A2BP1                2.838277               4.138658
## A2M                  7.909081               8.053413
## A4GALT               4.026027               3.602164
## AAAS                 4.812407               4.680492
## AACS                 3.026521               3.565325
##        TCGA.DU.5870.01.RNAseq TCGA.DU.6399.01.RNAseq
## A1BG                 1.252023               2.034414
## A2BP1                3.754004               2.395654
## A2M                  7.744321               8.088247
## A4GALT               2.731204               2.604539
## AAAS                 4.450772               4.420419
## AACS                 3.456624               3.739150
##        TCGA.TM.A7CF.02.RNAseq TCGA.CS.4942.01.RNAseq
## A1BG                 4.688094               2.967074
## A2BP1                5.316216               6.010099
## A2M                  5.696277               8.096648
## A4GALT               2.807492               2.107511
## AAAS                 4.749371               3.901328
## AACS                 3.780533               3.671747
write.csv(patient_list,file = "patient_list.csv")
sampleinfo <- read.csv("~/Google Drive/0-official-MD-Anderson/Projects/lgg/sampleinfo.LGG.csv")
  1. 34 columns - group0
  2. 32 columns - group1

Now we use bioconductor software package to filter genes which have biological impact and are also statistically reliable.

source("http://bioconductor.org/biocLite.R")
## Bioconductor version 3.1 (BiocInstaller 1.18.5), ?biocLite for help
## A newer version of Bioconductor is available for this version of R,
##   ?BiocUpgrade for help
LGGmat<-as.matrix(sapply(patient_list, as.numeric))
group<-factor(sampleinfo$group)
row.names(LGGmat) <- rownames(patient_list)

Plot density plot to vishualize the data distributions. In this context, it is used to find the outlier samples.

plot ( density ( LGGmat[,1], na.rm=TRUE ), xlim = c(0,16), ylim = c(0,0.35), 
       main = "Density plot", xlab = "Intensity" )
for(i in 2: ncol ( LGGmat ) ) {  
    points ( density ( LGGmat[,i], na.rm = TRUE ), type = "l", col = rainbow(20)[i] ) 
}

In addition, we also do check the normalization of the data

biocLite("vsn")
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.1 (BiocInstaller 1.18.5), R version 3.2.2.
## Installing package(s) 'vsn'
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
library("vsn")
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
meanSdPlot(LGGmat)

The meanSdPlot shows some dependence between mean and standard deviation. The median is shown in red lines.

I still went ahead considering that the data is fine for the time being. (need discussion with Dr. Rao)

In addition, we also check the boxplot, general scatter plot, MA plot, vulcano plot and heatmap of the expression data. If we make the plots with all the data it will not be easily understandable. Therefore, we do some filtering using rowmeans for each gene

meanThresh <- log2(50)
filt1 <- rowMeans(LGGmat[, group == 0]) > meanThresh
filt2 <- rowMeans(LGGmat[, group == 1])> meanThresh
selProbes <- (filt1 | filt2)
LGGmatfilt <- LGGmat[selProbes, ]
dim(LGGmatfilt)
## [1] 2422   66
rowMeans(LGGmatfilt)[1:10]
##      A2M     AAMP     AARS     AASS     AATK     ABAT    ABCA1    ABCA2 
## 7.487085 5.595376 6.324315 5.679096 5.831121 8.082848 6.694927 6.822409 
##    ABCA3    ABCA8 
## 7.281692 6.005804

scatter plot

plot(LGGmatfilt,col=densCols(LGGmatfilt),pch=16,cex=.4)

boxplot

boxplot(LGGmatfilt)

MA Plot

x<-LGGmatfilt[,1]
y<-LGGmatfilt[,3]
plot(x,y,main=paste0("corr=",signif(cor(x,y))))

plot((x+y)/2,y-x,main=paste0("corr=",signif(cor(x,y))))

#Let us do this using log transformed values

x<-log2(LGGmatfilt[,1])
y<-log2(LGGmatfilt[,3])
plot(x,y,main=paste0("corr=",signif(cor(x,y))))

plot((x+y)/2,y-x,main=paste0("corr=",signif(cor(x,y))))

Seems like there is one sample in the group =0 has an outlier. For the time being, I am keeping the calculations taking all samples (66).

Lets do some class discovery using clustering algorithms which is an unsupervised machine learning.

Complex Heatmap for understanding the data. Here the rows and columns are re-organized using hierarchical clustering. Dendrograms on the side will give the idea where actually the gene was originally in the data.

dim(LGGmatfilt)
## [1] 2422   66
biocLite("ComplexHeatmap")
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.1 (BiocInstaller 1.18.5), R version 3.2.2.
## Installing package(s) 'ComplexHeatmap'
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
library(ComplexHeatmap)
## Loading required package: grid
library(circlize)
library(colorspace)
library(GetoptLong)
## Warning: package 'GetoptLong' was built under R version 3.2.3
## 
## Attaching package: 'GetoptLong'
## 
## The following object is masked from 'package:base':
## 
##     source
Heatmap(LGGmatfilt, name = "key", column_title = "sample", 
        row_title ="gene",
    column_title_gp = gpar(fontsize = 10, fontface = "bold",column_title_side="bottom",show_row_names = FALSE,show_column_names=FALSE))

This shows around 5 clusters in genes…Seems like a good variation in the expression level among these groups. But the original clustering/groups in the sample data is not predicted.

Trying to understand clustering starting with hierarchial clustering!!!!!

cluster columns

install.packages("rafalib")
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
library(rafalib)

#####previous way of installing rafalib library from github
#library(devtools)
#install_github("ririzarr/rafalib")
#####################################

mypar(1,2)
democlust<-hclust(dist(t(LGGmatfilt)))

groupchar<-as.character(group)
plot(democlust,label=group,cex=0.5,main="cluster columns - plot dendrogram")

# To see each of the group (0,1) in different colours and seperated, we use
#myplclust from rafalib library

myplclust(democlust,cex=0.5,labels=group,lab.col=as.fumeric(groupchar),main="cluster columns - myplclust dendrogram")

democut<-cutree(democlust,h=70)
abline(h=70)
#democut
tab<-table(true=group,cluster=democut)
ftable(tab)
##      cluster  1  2  3  4  5  6  7  8
## true                                
## 0             7  3  4 15  2  1  1  1
## 1            13  2  1 16  0  0  0  0
democut<-cutree(democlust,h=95)
abline(h=95)

#democut
tab<-table(true=group,cluster=democut)
ftable(tab)
##      cluster  1  2
## true              
## 0            22 12
## 1            29  3

We can see that the groups (0,1) are not seperated as in the original data. They are all mixed. Now lets to the clustering of rows (ie., genes)

cluster rows

hc.rows <- hclust(dist(LGGmatfilt))
plot(hc.rows,labels=FALSE,main="row/gene cluster")

transpose the matrix and cluster columns

hc.cols <- hclust(dist(t(LGGmatfilt)))
plot(hc.cols,labels=FALSE,main="column/group cluster")

mypar(1,2)
install.packages("RColorBrewer")
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
library("RColorBrewer")
heatmap(LGGmatfilt,Colv=as.dendrogram(hc.cols), scale='none',main="row-column/gene-group clustering",col=brewer.pal(11,"BrBG"))

## Colv = NA keep the column index without reordering - so it will be in the same order as the data is provided
heatmap(LGGmatfilt,Colv=NA,scale='none',col=brewer.pal(11,"BrBG"),main="with normal sample order")

#Heatmap.2 using gplots: Another way of doing heatmaps.

install.packages("gplots")
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
library(gplots)
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
heatmap.2(LGGmatfilt,Colv=as.dendrogram(hc.cols),Rowv=as.dendrogram(hc.rows),scale='none',col=brewer.pal(11,"BrBG"),main="row-column/gene-group clustering",trace="none")

Here, with colv=NA (no reordering of columns) - we don’t find any type of difference in expression values between group 0 and group 1 samples. But there is clustering among genes and we find some of them are expressed more and hence in a different cluster.

More clustering by imaging the matrix

d <- dist(t(LGGmat))
hc <- hclust(d, method = "complete")

We now split the dendrogram into two clusters (using the function cutree) and compare the resulting clusters with the true classes. We first compute a contingency table and then apply an independence test.

groups <- cutree(hc, k = 2)
table(groups, group)
##       group
## groups  0  1
##      1 31 32
##      2  3  0
fisher.test(groups, group)$p.value
## [1] 0.2391608

Null hypothesis : True labels and the cluster results are independent. P value (0.239) says that we cannot reject the null at a statistical significance of 5%.

To improve our results, we should try to avoid using genes that contribute just noise and no information. A simple approach is to exclude all genes that show no variance across all samples.

genes.var <- apply(LGGmat, 1, var)
genes.var.select <- order(genes.var, decreasing = T)[1:100]
dat.s <- LGGmat[genes.var.select, ]
d.s <- dist(t(dat.s))
hc.s <- hclust(d.s, method = "complete")

Lets divide the sample into two groups and make the contingency table and check the independence.

groups.s <- cutree(hc.s, k = 2)
table(groups.s, group)
##         group
## groups.s  0  1
##        1 20 29
##        2 14  3
fisher.test(groups.s, group)$p.value
## [1] 0.004394799

Reducing the number of genes and using only high variance genes reduced the p.value (0.004) which supports the alternative hypothesis. The real groups and the groups obtained now are dependent.

To check further, lets take 100 random genes and repeat the same procedure as above.

genes.random.select <- sample(nrow(LGGmat), 100)
dat.r <- LGGmat[genes.random.select, ]
d.r <- dist(t(dat.r))
hc.r <- hclust(d.r, method = "complete")

groups.r <- cutree(hc.r, k = 2)
table(groups.r, group)
##         group
## groups.r  0  1
##        1 22 29
##        2 12  3
fisher.test(groups.r, group)$p.value
## [1] 0.01784544

p.value(0.259) says that again the null hypothesis has to be true. No dependence between groups obtained here and the groups already made.

par(mfrow = c(1, 3))
plot(hc, labels = group,main="complete genes")
#The plot shows that the group 0 and group 1 samples are not well seperated.
plot(hc.r, labels = group,main="100 random genes")
#This also did not give the correct seperation. 
plot(hc.s, labels = group,main="100 high variance genes")

For all the three cases discussed above (all genes, 100 random genes, 100 high variance genes, we plot the image, cluster plot and see how they are seperated)

par(mfrow = c(1, 3))
image(as.matrix(d),main="complete genes");
image(as.matrix(d.r),main="100 random genes");
image(as.matrix(d.s),main="100 high-variance genes");

install.packages("cluster")
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
library(cluster)

kc.r <- kmeans(dat.r, centers=2)
kc.s <- kmeans(dat.s, centers=2)
kc<-kmeans(LGGmatfilt,centers=2)
par(mfrow = c(1, 3))
library(cluster)
install.packages('fpc')
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
library(fpc)
plotcluster(LGGmatfilt, kc$cluster,main="complete genes")
plotcluster(dat.r, kc.r$cluster,main="100 random genes")
plotcluster(dat.s, kc.s$cluster,main="100 high variance genes")

Now lets combine PCA(Principal Component Analysis) plot after the clustering

par(mfrow = c(1, 3))
clusplot(LGGmatfilt, kc$cluster, main="complete genes",color=TRUE, shade=TRUE,labels=4, lines=0)
clusplot(dat.r, kc.r$cluster, main = "100 random genes", color = TRUE,shade=TRUE,labels=4,lines=0)
clusplot(dat.s, kc.s$cluster, main = "100 high-variance genes", color = TRUE,shade=TRUE,labels=4,lines=0)

This results shows not much of a clear distinction in the clusters. Need discussion with Dr. Rao.

Now lets do some clustering using different clustering methods for row and column (1) as dendrograms row = diana (Divisive ANAlysis Clustering) - large to small cluster - top down
and column = agnes (Agglomerative Nesting) - small to large cluster - bottom up

library(cluster)
Heatmap(LGGmatfilt, name = "key", cluster_rows = as.dendrogram(diana(LGGmatfilt)),
   cluster_columns = as.dendrogram(agnes(t(LGGmatfilt))))

Heatmap(LGGmatfilt, name = "key", cluster_rows = as.dendrogram(diana(LGGmatfilt)),
   cluster_columns = as.dendrogram(diana(t(LGGmatfilt))))

These heatmaps shows two to three clear ditinct groups among the genes. So decided to do K-means clustering and see how it works out..

K-means clustering… Set the number of clusters to k=2. Then perform k-means clustering for all samples and all genes. k-means uses a random starting solution and thus different runs can lead to different results. Run k-means 10 times and use the result with smallest within-cluster variance

k <- 2
withinss <- Inf
for (i in 1:10) {
kmeans.run <- kmeans(t(LGGmat), k)
print(sum(kmeans.run$withinss))
print(table(kmeans.run$cluster, group))
cat("----\n")
if (sum(kmeans.run$withinss) < withinss) {
result <- kmeans.run 
withinss <- sum(result$withinss)}}
## [1] 876898.8
##    group
##      0  1
##   1 17 28
##   2 17  4
## ----
## [1] 881468.4
##    group
##      0  1
##   1 19 15
##   2 15 17
## ----
## [1] 876898.8
##    group
##      0  1
##   1 17  4
##   2 17 28
## ----
## [1] 881468.4
##    group
##      0  1
##   1 19 15
##   2 15 17
## ----
## [1] 878689.2
##    group
##      0  1
##   1 18 29
##   2 16  3
## ----
## [1] 881468.4
##    group
##      0  1
##   1 15 17
##   2 19 15
## ----
## [1] 890256.3
##    group
##      0  1
##   1 13 12
##   2 21 20
## ----
## [1] 890256.3
##    group
##      0  1
##   1 13 12
##   2 21 20
## ----
## [1] 890363.5
##    group
##      0  1
##   1 13 12
##   2 21 20
## ----
## [1] 890256.3
##    group
##      0  1
##   1 13 12
##   2 21 20
## ----
table(group,clusters=result$cluster)
##      clusters
## group  1  2
##     0 17 17
##     1 28  4
fisher.test(result$cluster,group)$p.value
## [1] 0.001404262
kmeans.s <- kmeans(t(dat.s), k)
table(group,clusters=kmeans.s$cluster)
##      clusters
## group  1  2
##     0 18 16
##     1  7 25
d<-dist(t(dat.s))
mds<-cmdscale(d)
plot(mds[,1],mds[,2],col=kmeans.s$cluster)

fisher.test(groups.r, group)$p.value
## [1] 0.01784544

Though the p.value is small (2.527244e-05), we cannot be sure whether the result reflects biology. Original data is for 34 group 0 and 32 group 1 patients. But the k-means divided the group 23 group 0 and 43 group 1. The random selecton of genes have a variying p value (???)

Now lets try PAM (Partitioning around medoids) clustering

result <- pam(t(LGGmat),k)
table(group,cluster=result$clustering)
##      cluster
## group  1  2
##     0 25  9
##     1 23  9
fisher.test(result$clustering, group)$p.value
## [1] 1

To check whether the partitioning is correct or not, we use silhoutte and by examining the silhoutte coefficient, we can decide whether it is performed correctly. For this, we do kmeans starting from num of cluster = 2 to 20 and check the SC(silhoutte coefficient)

Range of SC Interpretation

0.71-1.0 A strong structure has been found

0.51-0.70 A reasonable structure has been found

0.26-0.50 The structure is weak and could be artificial

< 0.25 No substantial structure has been found

par(mfrow = c(1, 1))
asw <- numeric()
for (k in 2:20) {
asw[k] <- pam(t(dat.s), k)$silinfo$avg.width}
plot(1:20, asw, xlab = "# clusters", ylab = "average silhouette width", type = "b")

par(mfrow = c(1, 2))
plot(silhouette(pam(t(dat.s), 2)),main="100 high variance genes with k=2")
k=2
result <- pam(t(dat.s),k)
clusplot(pam(dat.s, k),main="100 high variance genes with k=2")

table(group,cluster=result$clustering)
##      cluster
## group  1  2
##     0 21 13
##     1 13 19
par(mfrow = c(1, 2))
plot(silhouette(pam(t(dat.s), 3)),main="100 high variance genes with k=3")
clusplot(pam(dat.s, 3),main="100 high variance genes with k=3")

Lets check using fpc the clustering by k-means

K-Means Clustering with 2 clusters followed by # Cluster Plot against 1st 2 principal components vary parameters for most readable graph

par(mfrow = c(1, 2))
fit1 <- kmeans(LGGmatfilt, 2)
#fit2<-pam(t(LGGmatfilt),2)

library(cluster) 
clusplot(LGGmatfilt, fit1$cluster, color=TRUE, shade=TRUE, 
    labels=2, lines=0)

# Centroid Plot against 1st 2 discriminant functions
library(fpc)
plotcluster(LGGmatfilt, fit1$cluster)

Here we use the complex Heatmap with km specified as 5. This takes k-means clustering of rows.

ha = HeatmapAnnotation(df = data.frame(group = group))
Heatmap(LGGmatfilt, name = "expression", km = 5, top_annotation = ha, 
    top_annotation_height = unit(4, "mm"), show_row_names = FALSE, 
    show_column_names = FALSE) 

Now lets go back to the t.test and do some more filtering of genes using multiple testing correction (FDR, FWER) which is a class prediction and a **supervised machine learning method*.

library(rafalib)
mypar(1,2)

We use qqnorm function to check the normality assumptions for eg for the 2500th gene

par(mfrow = c(1, 2))
gene<-patient_list[2500,] 
qqnorm(gene[group==1],main="Normal Q-Q - group 0")
qqline(gene[group==1])
qqnorm(gene[group==0],main="Normal Q-Q - group 1")
qqline(gene[group==0])

As we have two conditions with multiple replicates, to find differentially expressed genes we will use the parametric test - t.test

Let us take for eg. 2500th gene and do a simple t-test

t.test(gene[group==1],gene[group==0])
## 
##  Welch Two Sample t-test
## 
## data:  gene[group == 1] and gene[group == 0]
## t = -0.014559, df = 63.993, p-value = 0.9884
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1909576  0.1881943
## sample estimates:
## mean of x mean of y 
##  6.656769  6.658150

Here the p value is 0.06685 - this is larger than alpha = 0.05. Hence null hypothesis is true - No difference between two groups for 2500th gene in the data. Now do t-test for each row.

biocLite("genefilter")
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.1 (BiocInstaller 1.18.5), R version 3.2.2.
## Installing package(s) 'genefilter'
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
library(genefilter)
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:ComplexHeatmap':
## 
##     dist2
## 
## The following object is masked from 'package:base':
## 
##     anyNA
results1<-rowttests(LGGmat,group)
results1<-na.omit(results1)

Look at the top 10 from each of the t-tests and sort them and get the respective names along with the histogram of the p-values

sort( results1$p.value )[1:10]
##  [1] 1.273034e-06 2.342231e-06 2.498085e-06 2.681133e-06 4.491045e-06
##  [6] 6.809912e-06 7.907597e-06 9.589072e-06 9.915291e-06 1.191924e-05
row.names(LGGmat) <- rownames(patient_list)

AffyID <- rownames( LGGmat )

AffyID[order(results1$p.value)][1:10]
##  [1] "TCEB2"  "UGDH"   "ACVR2B" "ABCC3"  "STXBP4" "STYK1"  "SMAD4" 
##  [8] "GBAS"   "CTBP1"  "IGF2AS"
sort( results1$p.value )[1:10]
##  [1] 1.273034e-06 2.342231e-06 2.498085e-06 2.681133e-06 4.491045e-06
##  [6] 6.809912e-06 7.907597e-06 9.589072e-06 9.915291e-06 1.191924e-05
hist(results1$p.value, breaks=100)

#table(results1$p.value)

In addition, it is nice to know if the genes are over or under expressed relative to the control condition using the vulcano plot (results1$dm is same as log2 fold change)

plot(results1$dm, -log10(results1$p.value), pch=".", 
     xlab = expression(mean~log[2]~fold~change),
     ylab = expression(-log[10](p)))

sum(results1$p.value<0.05,na.rm=TRUE & abs(results1$dm)>1)
## [1] 2630
library(rafalib)
#shist(LGGmat)
hist(LGGmat)

We make the vulcano plot using the ggplot library

biocLite("ggplot2")
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.1 (BiocInstaller 1.18.5), R version 3.2.2.
## Installing package(s) 'ggplot2'
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
install.packages("ggplot2")
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
require(ggplot2)
## Loading required package: ggplot2
##Highlight genes that have an absolute fold change > 2 and a p-value < Bonferroni cut-off
mypar(1,2)
par(mfrow = c(1, 2))
no_of_genes = dim(LGGmat)[1]
0.05/no_of_genes
## [1] 3.931745e-06
table(results1$threshold)
## < table of extent 0 >
results1$threshold = as.factor(abs(results1$dm) > 2.5 & results1$p.value < 0.005)
results1$clr <- paste(abs(results1$dm) > 2.5, results1$p.value < 0.005)

table(results1$threshold)
## 
## FALSE  TRUE 
## 12690    14
# for p.value <0.005

g = ggplot(data=results1, aes(x=dm, y=-log10(p.value), col=clr)) +
    geom_point(alpha=0.5, size=1.50,shape=15) +
#    theme(legend.position = "none") +
#     theme(legend.key=results1$threshold, legend.title=conditions) +
    xlim(c(-5, 5)) + ylim(c(0, 6.0)) +
    xlab("log2 fold change") + ylab("-log10 p-value") + 
    labs(title="p.value <0.005") +
    geom_hline(yintercept=-log10(0.005), color="red") +
    geom_vline(xintercept=2.5, color="blue") +
    geom_vline(xintercept=-2.5, color="blue") 
g

vulcano_text = results1[(abs(results1$dm) > 2.5) & (results1$p.value < 0.005),]
g + geom_text(data=vulcano_text, aes(x=vulcano_text$dm, y=-log10(vulcano_text$p.value), label=row.names(vulcano_text)), size=3,colour="black")

##Construct the plot object for Bonferroni

results1$threshold = as.factor(abs(results1$dm) > 2.5 & results1$p.value < 0.05/no_of_genes)
results1$clr <- paste(abs(results1$dm) > 2.5, results1$p.value < 0.05)
g = ggplot(data=results1, aes(x=dm, y=-log10(p.value), col=clr)) +
    geom_point(alpha=0.5, size=1.50,shape=15) +
#    theme(legend.position = "none") +
#     theme(legend.key=results1$threshold, legend.title=conditions) +
    xlim(c(-5, 5)) + ylim(c(0, 6.0)) +
    xlab("log2 fold change") + ylab("-log10 p-value") + 
    labs(title="p.value <0.05/number of genes") +
    geom_hline(yintercept=-log10(0.05), color="red") +
    geom_vline(xintercept=2.5, color="blue") +
    geom_vline(xintercept=-2.5, color="blue") 
g

vulcano_text = results1[(abs(results1$dm) > 2.5) & (results1$p.value < 0.05/no_of_genes),]
g + geom_text(data=vulcano_text, aes(x=vulcano_text$dm, y=-log10(vulcano_text$p.value), label=row.names(vulcano_text)), size=3,colour="red")

MA plot

library(lattice)
## 
## Attaching package: 'lattice'
## 
## The following object is masked from 'package:GetoptLong':
## 
##     qq
A = rowMeans(LGGmat)
M = (unlist(LGGmat)) - A
sample = rep(colnames(LGGmat), each=nrow(LGGmat))
df = data.frame(M, A, sample, row.names=NULL, check.names=FALSE)
plt = xyplot(M ~ A | sample, df, panel=panel.smoothScatter)
plt
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'
## Warning in KernSmooth::bkde2D(x, gridsize = nbin, bandwidth = bandwidth):
## Binning grid too coarse for current (small) bandwidth: consider increasing
## 'gridsize'

filt3 <- results1$p.value < 0.0001
table(filt3)
## filt3
## FALSE  TRUE 
## 12657    47
selProbes <- (filt1 | filt2) & filt3
## Warning in (filt1 | filt2) & filt3: longer object length is not a multiple
## of shorter object length
LGGmatfilt1<- LGGmat[selProbes, ]
dim(LGGmatfilt1)
## [1] 13 66

MDS Plot using cmdscale

group<-factor(sampleinfo$group)
library(rafalib)
mypar(1,2)
d <- dist(t(LGGmatfilt))
mds <- cmdscale(d)


plot(mds[,1],mds[,2],bg=as.numeric(group),pch=21,
     xlab="First dimension",ylab="Second dimension")
legend("bottomleft",levels(group),col=seq(along=levels(group)),pch=15)

MDS Plot using svd

y = LGGmatfilt - rowMeans(LGGmatfilt)
s = svd(y)
z = s$d * t(s$v)

# we can make an mds plot
plot(z[1,],z[2,],bg=as.numeric(group),pch=21,
     xlab="First dimension",ylab="Second dimension")
legend("bottomleft",levels(group),col=seq(along=levels(group)),pch=15)

cor(z[1,],mds[,1])
## [1] 1
cor(z[2,],mds[,2])
## [1] -1
plot(s$d^2/sum(s$d^2))

#date<-factor(sampleInfo$date)
#ethnicity<-factor(sampleInfo$ethnicity)
#PC3 <- s$d[1]*s$v[,1]
#PC4 <- s$d[2]*s$v[,2]
#mypar(1,1)
#plot(PC3,PC4,pch=21,bg=as.numeric(ethnicity))
#legend("bottomright",levels(ethnicity),col=seq(along=levels(ethnicity)),pch=15,cex=1.5)

#MDS plot using svd and cmdscale are different by a factor of -1. So tried plotting -1*z[1,] against -1*z[2,] and 

library(rafalib)
mypar(1,2)
group<-factor(sampleinfo$group)
plot(-1*z[1,],-1*z[2,],col=as.numeric(group))
legend("topleft",levels(group),col=seq_along(group),pch=1)
plot(mds[,1],mds[,2],col=as.numeric(group))

Multiple test (FWER - Bonferroni Procedure)

#How many genes have p-values smaller than 0.05?
dim(LGGmat)   #12717 genes and 66 samples
## [1] 12717    66
sum(results1$p.value<0.05,na.rm=TRUE)  # 2630 genes
## [1] 2630
# Bonferroni Procedure (p.value <= alpha/(total number of genes))
#Apply the Bonferroni correction to the p-values obtained above to achieve a #FWER of 0.05. How many genes are called significant under this procedure?
k = 0.05/length(results1$p.value)
sum(results1$p.value<k,na.rm=TRUE)  # Ans 4 (Are we very conservative? We have
## [1] 4
# not really assessed the false negatives at all yet.)

Note that the FDR is a property of a list of features, not each specific feature. The q-value relates FDR to an individual feature. To define the q-value we order features we tested by p-value then compute the FDRs for a list with the most significant, the two most significant, the three most significant, etc… The FDR of the list with the, say, m most significant tests is defined as the q-value of the m-th most significant feature. In other words, the q-value of a feature, is the FDR of the biggest list that includes that gene.

In R, we can compute the q-value using the p.adjust function with the FDR option. Read the help file for p.adjust and then, for our gene expression dataset, compute how many genes achieve an FDR < 0.05

fdr = p.adjust(results1$p.value,method="fdr")  # this is same as Benjamini #Hochenberg
sum(fdr<0.05,na.rm=TRUE)   #Ans 222
## [1] 222
pv.BH <- p.adjust(results1$p.value, method = "BH")
table(pv.BH < 0.05)   #TRUE 222
## 
## FALSE  TRUE 
## 12482   222

Note that controlling the FDR at 0.05 gives us (222-4=118) more genes than the Bonferroni correction. Note that we are controlloing two very different error rates. Here we are saying that we think this list of 222 genes has about 5% false positives. The Bonferroni procedure gave us a list ot 4 genes for which we were quite certain had no false positives. Note again that we have not discussed false negatives.

Now use the qvalue function, in the Bioconductor qvalue package, to estimate q-values using the procedure described by Storey.

Using this estimate how many genes have q-values below 0.05?

biocLite("qvalue")
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.1 (BiocInstaller 1.18.5), R version 3.2.2.
## Installing package(s) 'qvalue'
## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpxJzkJx/downloaded_packages
library(qvalue)
pvals <- results1$p.value[!is.na(results1$p.value)]
length(pvals)  #12704 ;    12717-12704= 13 NAs
## [1] 12704
res = qvalue(pvals)
qvals = res$qvalues
sum(qvals<0.05)     # Ans 648
## [1] 648

Now the list has increased to 648. However, we are also claiming this list has an FDR of 5%. The reason this list is large is because, qvalue estimates FDR differently and is less conservative. Remember that the theory provides bounds for FDR: it guarantees FDR will be less than 0.05. If qvalue does in fact estimate pi0 well then it will provide a list with FDR closer to 0.05.

Read the help file for qvalue and report the estimated proportion of genes for which the null hypothesis is true π0=m0/m

qvalue(pvals)$pi0   #ans 0.6082695  (60% no change ; 40% differentially expressed)
## [1] 0.6090869

Note that we have the number of genes passing the q-value <0.05 threshold is larger (648) with the qvalue function than the p.adjust difference using FDR / BH (222).

Why is this the case? A plot of the ratio of these two estimates will clarify this difference.

plot(qvalue(pvals)$qvalue/p.adjust(pvals,method="fdr"))
abline(h=qvalue(pvals)$pi0,col=2)

#The qvalue function estimates the proportion of genes for which the null #hypothesis is true and provides a less conservative estimate

#To get an idea of how pi0 is estimated, note that if we look at the histogram, #pi0 roughly tells us the proportion that looks about uniform:
    
hist(pvals,breaks=seq(0,1,len=21))
expectedfreq <- length(pvals)/20 #per bin
abline(h=expectedfreq*qvalue(pvals)$pi0,col=2,lty=2)

The figure illustrates of the Benjamini-Hochberg multiple testing adjustment. The black line shows the p-values (y-axis) versus their rank (x-axis), starting with the smallest p-value from the left, then the second smallest, and so on. The red line is a straight line with slope α/m, where m is the number of tests, and α is a target false discovery rate. FDR is controlled at the value α if the genes are selected that lie to the left of the rightmost intersection between the red and black lines: here, this results in 15 significant p–values. Thus, the procedure is relatively conservative since we actually have simulated 15 non–null p–values. The Bonferroni correction is clearly not suitable here as we only get 3 significant p–values.

library(ggplot2)
(ggplot2::qplot(rank(pvals), pvals, xlab = "p-value rank", ylab="p-values"
,main = "Visualization of the BH procedure",aes(color="green"))
+ geom_abline(intercept = 0, slope = 0.05/12717, aes(color = "coral3"))
+ ylim(c(0, 1) ))

Now lets do a multiple testing using the library multtest (multiple t test)

biocLite(“multtest”) library(multtest) adjp<- mt.rawp2adjp(results1\(p.value, proc="BH") head(adjp\)adjp) ## print out the first 6 genes with their raw p and adjusted p sum(adjp\(adjp[, "BH"] < 0.05) adjp\)index[1:10] AffyID[adjp$index[1:10]]