This project is revisiting our previous work on top genes in a chronic hodgkin’s lymphoma (CHL) and diffuse large b-cell lymphomas (DLBCL) that didn’t find great genes when classifying with the random forest classifier do to there being no control or baseline comparison as these samples were directly compared to each other in similarity. This study was GSE305165. We then saw some healthy samples totaling 6 in another project GSE85599 we worked on with acute infectious mononucleosis (AIM) and chronic active epstein-barr virus (CAEBV). The type of media is the peripheral blood mononuclear cells (PBMCs) of total RNA for both, but one is chip sequencing for the CHL and DLBCL while the other is array data.

The data sets worked on were from each of those studies that can be found within google to directly access these datasets here and kaggle. While the documentation from beginning to final output can be found on the rpubs site for janiscorona and here.

library(rmarkdown)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
data_lymphoma <- read.csv("Data_TableauReady_21448X125.csv", header=T)

summary(data_lymphoma)
##          ID            SPOT_ID.1           GeneID          pDLBCL      
##  Length   :21448   Length   :21448   Length   :21448   Min.   : 1.209  
##  N.unique :21448   N.unique :21448   N.unique :18769   1st Qu.: 3.242  
##  N.blank  :    0   N.blank  :    0   N.blank  :    0   Median : 4.319  
##  Min.nchar:   17   Min.nchar:  105   Min.nchar:    1   Mean   : 4.357  
##  Max.nchar:   35   Max.nchar:32532   Max.nchar: 8556   3rd Qu.: 5.352  
##                                                        Max.   :11.828  
##       CHL            pDLBCL.1          mDLBCL           CHL.1       
##  Min.   : 1.214   Min.   : 1.310   Min.   : 1.345   Min.   : 1.233  
##  1st Qu.: 3.321   1st Qu.: 3.247   1st Qu.: 3.221   1st Qu.: 3.299  
##  Median : 4.286   Median : 4.331   Median : 4.323   Median : 4.320  
##  Mean   : 4.343   Mean   : 4.359   Mean   : 4.349   Mean   : 4.370  
##  3rd Qu.: 5.257   3rd Qu.: 5.354   3rd Qu.: 5.343   3rd Qu.: 5.342  
##  Max.   :11.637   Max.   :11.936   Max.   :11.941   Max.   :11.825  
##     mDLBCL.1         pDLBCL.2          CHL.2            CHL.3       
##  Min.   : 1.265   Min.   : 1.279   Min.   : 1.120   Min.   : 1.184  
##  1st Qu.: 3.207   1st Qu.: 3.236   1st Qu.: 3.207   1st Qu.: 3.234  
##  Median : 4.319   Median : 4.332   Median : 4.313   Median : 4.353  
##  Mean   : 4.346   Mean   : 4.363   Mean   : 4.354   Mean   : 4.383  
##  3rd Qu.: 5.333   3rd Qu.: 5.364   3rd Qu.: 5.365   3rd Qu.: 5.393  
##  Max.   :11.997   Max.   :11.994   Max.   :11.907   Max.   :11.990  
##      CHL.4            CHL.5           mDLBCL.2         mDLBCL.3     
##  Min.   : 1.194   Min.   : 1.292   Min.   : 1.160   Min.   : 1.358  
##  1st Qu.: 3.273   1st Qu.: 3.310   1st Qu.: 3.231   1st Qu.: 3.338  
##  Median : 4.361   Median : 4.363   Median : 4.312   Median : 4.365  
##  Mean   : 4.393   Mean   : 4.388   Mean   : 4.367   Mean   : 4.398  
##  3rd Qu.: 5.378   3rd Qu.: 5.340   3rd Qu.: 5.362   3rd Qu.: 5.355  
##  Max.   :12.090   Max.   :11.984   Max.   :11.890   Max.   :11.971  
##      CHL.6            CHL.7           pDLBCL.3          CHL.8       
##  Min.   : 1.297   Min.   : 1.289   Min.   : 1.311   Min.   : 1.236  
##  1st Qu.: 3.276   1st Qu.: 3.228   1st Qu.: 3.209   1st Qu.: 3.271  
##  Median : 4.358   Median : 4.317   Median : 4.305   Median : 4.338  
##  Mean   : 4.386   Mean   : 4.356   Mean   : 4.348   Mean   : 4.372  
##  3rd Qu.: 5.362   3rd Qu.: 5.352   3rd Qu.: 5.337   3rd Qu.: 5.351  
##  Max.   :11.945   Max.   :11.958   Max.   :12.007   Max.   :11.958  
##     mDLBCL.4         mDLBCL.5         mDLBCL.6         mDLBCL.7     
##  Min.   : 1.231   Min.   : 1.250   Min.   : 1.288   Min.   : 1.283  
##  1st Qu.: 3.200   1st Qu.: 3.206   1st Qu.: 3.199   1st Qu.: 3.231  
##  Median : 4.307   Median : 4.339   Median : 4.305   Median : 4.325  
##  Mean   : 4.354   Mean   : 4.372   Mean   : 4.347   Mean   : 4.369  
##  3rd Qu.: 5.361   3rd Qu.: 5.381   3rd Qu.: 5.334   3rd Qu.: 5.374  
##  Max.   :11.986   Max.   :11.986   Max.   :12.055   Max.   :12.046  
##      CHL.9           mDLBCL.8          CHL.10           CHL.11      
##  Min.   : 1.184   Min.   : 1.262   Min.   : 1.104   Min.   : 1.214  
##  1st Qu.: 3.223   1st Qu.: 3.262   1st Qu.: 3.187   1st Qu.: 3.220  
##  Median : 4.341   Median : 4.332   Median : 4.336   Median : 4.329  
##  Mean   : 4.359   Mean   : 4.381   Mean   : 4.364   Mean   : 4.367  
##  3rd Qu.: 5.369   3rd Qu.: 5.357   3rd Qu.: 5.404   3rd Qu.: 5.382  
##  Max.   :12.134   Max.   :11.952   Max.   :11.943   Max.   :12.093  
##      CHL.12           CHL.13          mDLBCL.9        mDLBCL.10     
##  Min.   : 1.221   Min.   : 1.274   Min.   : 1.248   Min.   : 1.341  
##  1st Qu.: 3.172   1st Qu.: 3.238   1st Qu.: 3.201   1st Qu.: 3.309  
##  Median : 4.309   Median : 4.340   Median : 4.335   Median : 4.351  
##  Mean   : 4.363   Mean   : 4.376   Mean   : 4.370   Mean   : 4.392  
##  3rd Qu.: 5.388   3rd Qu.: 5.373   3rd Qu.: 5.393   3rd Qu.: 5.357  
##  Max.   :12.150   Max.   :12.126   Max.   :12.011   Max.   :12.004  
##    mDLBCL.11        mDLBCL.12        mDLBCL.13        mDLBCL.14     
##  Min.   : 1.195   Min.   : 1.211   Min.   : 1.287   Min.   : 1.361  
##  1st Qu.: 3.212   1st Qu.: 3.221   1st Qu.: 3.192   1st Qu.: 3.244  
##  Median : 4.317   Median : 4.336   Median : 4.319   Median : 4.359  
##  Mean   : 4.360   Mean   : 4.369   Mean   : 4.370   Mean   : 4.386  
##  3rd Qu.: 5.379   3rd Qu.: 5.377   3rd Qu.: 5.375   3rd Qu.: 5.388  
##  Max.   :12.102   Max.   :12.155   Max.   :11.945   Max.   :12.157  
##     pDLBCL.4         pDLBCL.5        mDLBCL.15        mDLBCL.16     
##  Min.   : 1.244   Min.   : 1.157   Min.   : 1.208   Min.   : 1.263  
##  1st Qu.: 3.203   1st Qu.: 3.217   1st Qu.: 3.203   1st Qu.: 3.206  
##  Median : 4.338   Median : 4.339   Median : 4.314   Median : 4.314  
##  Mean   : 4.375   Mean   : 4.376   Mean   : 4.368   Mean   : 4.357  
##  3rd Qu.: 5.395   3rd Qu.: 5.387   3rd Qu.: 5.376   3rd Qu.: 5.375  
##  Max.   :11.979   Max.   :12.027   Max.   :11.822   Max.   :11.897  
##      CHL.14          pDLBCL.6         pDLBCL.7          CHL.15      
##  Min.   : 1.228   Min.   : 1.127   Min.   : 1.226   Min.   : 1.320  
##  1st Qu.: 3.215   1st Qu.: 3.181   1st Qu.: 3.146   1st Qu.: 3.200  
##  Median : 4.317   Median : 4.305   Median : 4.275   Median : 4.320  
##  Mean   : 4.355   Mean   : 4.363   Mean   : 4.353   Mean   : 4.367  
##  3rd Qu.: 5.392   3rd Qu.: 5.396   3rd Qu.: 5.356   3rd Qu.: 5.374  
##  Max.   :12.061   Max.   :11.999   Max.   :12.113   Max.   :11.938  
##      CHL.16           CHL.17         mDLBCL.17        mDLBCL.18     
##  Min.   : 1.239   Min.   : 1.339   Min.   : 1.225   Min.   : 1.250  
##  1st Qu.: 3.193   1st Qu.: 3.234   1st Qu.: 3.137   1st Qu.: 3.178  
##  Median : 4.317   Median : 4.345   Median : 4.262   Median : 4.308  
##  Mean   : 4.367   Mean   : 4.376   Mean   : 4.340   Mean   : 4.362  
##  3rd Qu.: 5.380   3rd Qu.: 5.377   3rd Qu.: 5.364   3rd Qu.: 5.378  
##  Max.   :12.080   Max.   :11.967   Max.   :12.221   Max.   :11.992  
##    mDLBCL.19          CHL.18          CHL_mean       pDLBCL_mean    
##  Min.   : 1.223   Min.   : 1.301   Min.   : 1.464   Min.   : 1.446  
##  1st Qu.: 3.265   1st Qu.: 3.181   1st Qu.: 3.261   1st Qu.: 3.233  
##  Median : 4.367   Median : 4.305   Median : 4.339   Median : 4.327  
##  Mean   : 4.407   Mean   : 4.370   Mean   : 4.369   Mean   : 4.362  
##  3rd Qu.: 5.412   3rd Qu.: 5.374   3rd Qu.: 5.343   3rd Qu.: 5.346  
##  Max.   :11.962   Max.   :12.067   Max.   :11.975   Max.   :11.938  
##   mDLBCL_mean       CHL_x_mean     mDLBCL_x_mean    pDLBCL_x_mean   
##  Min.   : 1.444   Min.   : 1.349   Min.   : 1.415   Min.   : 1.405  
##  1st Qu.: 3.245   1st Qu.: 3.237   1st Qu.: 3.263   1st Qu.: 3.235  
##  Median : 4.330   Median : 4.343   Median : 4.335   Median : 4.330  
##  Mean   : 4.368   Mean   : 4.371   Mean   : 4.371   Mean   : 4.365  
##  3rd Qu.: 5.349   3rd Qu.: 5.363   3rd Qu.: 5.337   3rd Qu.: 5.351  
##  Max.   :11.963   Max.   :11.976   Max.   :11.961   Max.   :11.931  
##    CHL_y_mean     mDLBCL_y_mean    pDLBCL_y_mean    CHL_young72_mean
##  Min.   : 1.495   Min.   : 1.428   Min.   : 1.420   Min.   : 1.479  
##  1st Qu.: 3.270   1st Qu.: 3.243   1st Qu.: 3.221   1st Qu.: 3.272  
##  Median : 4.342   Median : 4.329   Median : 4.318   Median : 4.341  
##  Mean   : 4.368   Mean   : 4.368   Mean   : 4.358   Mean   : 4.366  
##  3rd Qu.: 5.336   3rd Qu.: 5.350   3rd Qu.: 5.339   3rd Qu.: 5.330  
##  Max.   :11.974   Max.   :11.963   Max.   :11.945   Max.   :11.967  
##  mDLBCL_young72_mean pDLBCL_young72_mean    CHL_old         mDLBCL_old    
##  Min.   : 1.387      Min.   : 1.474      Min.   : 1.416   Min.   : 1.424  
##  1st Qu.: 3.254      1st Qu.: 3.247      1st Qu.: 3.250   1st Qu.: 3.242  
##  Median : 4.332      Median : 4.328      Median : 4.339   Median : 4.328  
##  Mean   : 4.369      Mean   : 4.360      Mean   : 4.373   Mean   : 4.368  
##  3rd Qu.: 5.349      3rd Qu.: 5.341      3rd Qu.: 5.356   3rd Qu.: 5.350  
##  Max.   :11.931      Max.   :11.897      Max.   :11.989   Max.   :11.976  
##    pDLBCL_old       CHL_median     mDLBCL_median    pDLBCL_median   
##  Min.   : 1.408   Min.   : 1.466   Min.   : 1.386   Min.   : 1.431  
##  1st Qu.: 3.222   1st Qu.: 3.231   1st Qu.: 3.219   1st Qu.: 3.207  
##  Median : 4.320   Median : 4.328   Median : 4.315   Median : 4.317  
##  Mean   : 4.364   Mean   : 4.357   Mean   : 4.352   Mean   : 4.348  
##  3rd Qu.: 5.358   3rd Qu.: 5.343   3rd Qu.: 5.342   3rd Qu.: 5.342  
##  Max.   :11.979   Max.   :11.958   Max.   :11.968   Max.   :11.954  
##   CHL_x_median    mDLBCL_x_median  pDLBCL_x_median   CHL_y_median   
##  Min.   : 1.348   Min.   : 1.351   Min.   : 1.400   Min.   : 1.466  
##  1st Qu.: 3.223   1st Qu.: 3.243   1st Qu.: 3.222   1st Qu.: 3.240  
##  Median : 4.337   Median : 4.322   Median : 4.329   Median : 4.328  
##  Mean   : 4.363   Mean   : 4.361   Mean   : 4.360   Mean   : 4.360  
##  3rd Qu.: 5.359   3rd Qu.: 5.338   3rd Qu.: 5.350   3rd Qu.: 5.345  
##  Max.   :11.944   Max.   :11.958   Max.   :11.923   Max.   :11.967  
##  mDLBCL_y_median  pDLBCL_y_median  CHL_young72_median mDLBCL_young72_median
##  Min.   : 1.391   Min.   : 1.423   Min.   : 1.468     Min.   : 1.366       
##  1st Qu.: 3.215   1st Qu.: 3.209   1st Qu.: 3.239     1st Qu.: 3.237       
##  Median : 4.314   Median : 4.301   Median : 4.325     Median : 4.324       
##  Mean   : 4.352   Mean   : 4.347   Mean   : 4.355     Mean   : 4.358       
##  3rd Qu.: 5.347   3rd Qu.: 5.336   3rd Qu.: 5.335     3rd Qu.: 5.343       
##  Max.   :11.968   Max.   :11.968   Max.   :11.974     Max.   :11.956       
##  pDLBCL_young72_median CHL_old72_median mDLBCL_old72_median pDLBCL_old72_median
##  Min.   : 1.451        Min.   : 1.406   Min.   : 1.400      Min.   : 1.349     
##  1st Qu.: 3.231        1st Qu.: 3.235   1st Qu.: 3.213      1st Qu.: 3.202     
##  Median : 4.323        Median : 4.339   Median : 4.315      Median : 4.298     
##  Mean   : 4.351        Mean   : 4.363   Mean   : 4.352      Mean   : 4.354     
##  3rd Qu.: 5.337        3rd Qu.: 5.356   3rd Qu.: 5.345      3rd Qu.: 5.359     
##  Max.   :11.893        Max.   :11.957   Max.   :11.968      Max.   :12.008     
##         ID.1         CHL_mean.1     pDLBCL_mean.1    mDLBCL_mean.1   
##  Length   :21448   Min.   : 1.464   Min.   : 1.446   Min.   : 1.444  
##  N.unique :21448   1st Qu.: 3.261   1st Qu.: 3.233   1st Qu.: 3.245  
##  N.blank  :    0   Median : 4.339   Median : 4.327   Median : 4.330  
##  Min.nchar:   17   Mean   : 4.369   Mean   : 4.362   Mean   : 4.368  
##  Max.nchar:   35   3rd Qu.: 5.343   3rd Qu.: 5.346   3rd Qu.: 5.349  
##                    Max.   :11.975   Max.   :11.938   Max.   :11.963  
##   CHL_x_mean.1    mDLBCL_x_mean.1  pDLBCL_x_mean.1   CHL_y_mean.1   
##  Min.   : 1.349   Min.   : 1.415   Min.   : 1.405   Min.   : 1.495  
##  1st Qu.: 3.237   1st Qu.: 3.263   1st Qu.: 3.235   1st Qu.: 3.270  
##  Median : 4.343   Median : 4.335   Median : 4.330   Median : 4.342  
##  Mean   : 4.371   Mean   : 4.371   Mean   : 4.365   Mean   : 4.368  
##  3rd Qu.: 5.363   3rd Qu.: 5.337   3rd Qu.: 5.351   3rd Qu.: 5.336  
##  Max.   :11.976   Max.   :11.961   Max.   :11.931   Max.   :11.974  
##  mDLBCL_y_mean.1  pDLBCL_y_mean.1  CHL_young72_mean.1 mDLBCL_young72_mean.1
##  Min.   : 1.428   Min.   : 1.420   Min.   : 1.479     Min.   : 1.387       
##  1st Qu.: 3.243   1st Qu.: 3.221   1st Qu.: 3.272     1st Qu.: 3.254       
##  Median : 4.329   Median : 4.318   Median : 4.341     Median : 4.332       
##  Mean   : 4.368   Mean   : 4.358   Mean   : 4.366     Mean   : 4.369       
##  3rd Qu.: 5.350   3rd Qu.: 5.339   3rd Qu.: 5.330     3rd Qu.: 5.349       
##  Max.   :11.963   Max.   :11.945   Max.   :11.967     Max.   :11.931       
##  pDLBCL_young72_mean.1   CHL_old.1       mDLBCL_old.1     pDLBCL_old.1   
##  Min.   : 1.474        Min.   : 1.416   Min.   : 1.424   Min.   : 1.408  
##  1st Qu.: 3.247        1st Qu.: 3.250   1st Qu.: 3.242   1st Qu.: 3.222  
##  Median : 4.328        Median : 4.339   Median : 4.328   Median : 4.320  
##  Mean   : 4.360        Mean   : 4.373   Mean   : 4.368   Mean   : 4.364  
##  3rd Qu.: 5.341        3rd Qu.: 5.356   3rd Qu.: 5.350   3rd Qu.: 5.358  
##  Max.   :11.897        Max.   :11.989   Max.   :11.976   Max.   :11.979  
##   CHL_median.1    mDLBCL_median.1  pDLBCL_median.1  CHL_x_median.1  
##  Min.   : 1.466   Min.   : 1.386   Min.   : 1.431   Min.   : 1.348  
##  1st Qu.: 3.231   1st Qu.: 3.219   1st Qu.: 3.207   1st Qu.: 3.223  
##  Median : 4.328   Median : 4.315   Median : 4.317   Median : 4.337  
##  Mean   : 4.357   Mean   : 4.352   Mean   : 4.348   Mean   : 4.363  
##  3rd Qu.: 5.343   3rd Qu.: 5.342   3rd Qu.: 5.342   3rd Qu.: 5.359  
##  Max.   :11.958   Max.   :11.968   Max.   :11.954   Max.   :11.944  
##  mDLBCL_x_median.1 pDLBCL_x_median.1 CHL_y_median.1   mDLBCL_y_median.1
##  Min.   : 1.351    Min.   : 1.400    Min.   : 1.466   Min.   : 1.391   
##  1st Qu.: 3.243    1st Qu.: 3.222    1st Qu.: 3.240   1st Qu.: 3.215   
##  Median : 4.322    Median : 4.329    Median : 4.328   Median : 4.314   
##  Mean   : 4.361    Mean   : 4.360    Mean   : 4.360   Mean   : 4.352   
##  3rd Qu.: 5.338    3rd Qu.: 5.350    3rd Qu.: 5.345   3rd Qu.: 5.347   
##  Max.   :11.958    Max.   :11.923    Max.   :11.967   Max.   :11.968   
##  pDLBCL_y_median.1 CHL_young72_median.1 mDLBCL_young72_median.1
##  Min.   : 1.423    Min.   : 1.468       Min.   : 1.366         
##  1st Qu.: 3.209    1st Qu.: 3.239       1st Qu.: 3.237         
##  Median : 4.301    Median : 4.325       Median : 4.324         
##  Mean   : 4.347    Mean   : 4.355       Mean   : 4.358         
##  3rd Qu.: 5.336    3rd Qu.: 5.335       3rd Qu.: 5.343         
##  Max.   :11.968    Max.   :11.974       Max.   :11.956         
##  pDLBCL_young72_median.1 CHL_old72_median.1 mDLBCL_old72_median.1
##  Min.   : 1.451          Min.   : 1.406     Min.   : 1.400       
##  1st Qu.: 3.231          1st Qu.: 3.235     1st Qu.: 3.213       
##  Median : 4.323          Median : 4.339     Median : 4.315       
##  Mean   : 4.351          Mean   : 4.363     Mean   : 4.352       
##  3rd Qu.: 5.337          3rd Qu.: 5.356     3rd Qu.: 5.345       
##  Max.   :11.893          Max.   :11.957     Max.   :11.968       
##  pDLBCL_old72_median.1   CHL_change     pDLBCL_change    mDLBCL_change   
##  Min.   : 1.349        Min.   :0.9016   Min.   :0.9048   Min.   :0.9283  
##  1st Qu.: 3.202        1st Qu.:0.9941   1st Qu.:0.9925   1st Qu.:0.9960  
##  Median : 4.298        Median :1.0040   Median :1.0020   Median :1.0032  
##  Mean   : 4.354        Mean   :1.0055   Mean   :1.0051   Mean   :1.0056  
##  3rd Qu.: 5.359        3rd Qu.:1.0150   3rd Qu.:1.0141   3rd Qu.:1.0121  
##  Max.   :12.008        Max.   :1.2240   Max.   :1.5134   Max.   :1.2959  
##   CHL_x_change     CHL_y_change    pDLBCL_x_change  pDLBCL_y_change 
##  Min.   :0.8728   Min.   :0.9054   Min.   :0.9016   Min.   :0.8977  
##  1st Qu.:0.9924   1st Qu.:0.9925   1st Qu.:0.9921   1st Qu.:0.9921  
##  Median :1.0015   Median :1.0033   Median :1.0009   Median :1.0012  
##  Mean   :1.0032   Mean   :1.0050   Mean   :1.0025   Mean   :1.0039  
##  3rd Qu.:1.0122   3rd Qu.:1.0158   3rd Qu.:1.0108   3rd Qu.:1.0124  
##  Max.   :1.2379   Max.   :1.2424   Max.   :1.2773   Max.   :1.2548  
##  mDLBCL_x_change  mDLBCL_y_change  CHL_old72_change CHL_young72_change
##  Min.   :0.8726   Min.   :0.9293   Min.   :0.8762   Min.   :0.8839    
##  1st Qu.:0.9920   1st Qu.:0.9953   1st Qu.:0.9907   1st Qu.:0.9924    
##  Median :1.0014   Median :1.0030   Median :1.0021   Median :1.0045    
##  Mean   :1.0038   Mean   :1.0054   Mean   :1.0039   Mean   :1.0060    
##  3rd Qu.:1.0123   3rd Qu.:1.0122   3rd Qu.:1.0148   3rd Qu.:1.0178    
##  Max.   :1.3753   Max.   :1.2939   Max.   :1.2085   Max.   :1.2139    
##  pDLBCL_old_change pDLBCL_young72_change mDLBCL_young72_change
##  Min.   :0.9046    Min.   :0.9111        Min.   :0.8669       
##  1st Qu.:0.9917    1st Qu.:0.9921        1st Qu.:0.9926       
##  Median :1.0013    Median :1.0009        Median :1.0025       
##  Mean   :1.0041    Mean   :1.0029        Mean   :1.0047       
##  3rd Qu.:1.0129    3rd Qu.:1.0112        3rd Qu.:1.0140       
##  Max.   :1.3349    Max.   :1.3463        Max.   :1.4123
paged_table(data_lymphoma[1:5,])
colnames(data_lymphoma)
##   [1] "ID"                      "SPOT_ID.1"              
##   [3] "GeneID"                  "pDLBCL"                 
##   [5] "CHL"                     "pDLBCL.1"               
##   [7] "mDLBCL"                  "CHL.1"                  
##   [9] "mDLBCL.1"                "pDLBCL.2"               
##  [11] "CHL.2"                   "CHL.3"                  
##  [13] "CHL.4"                   "CHL.5"                  
##  [15] "mDLBCL.2"                "mDLBCL.3"               
##  [17] "CHL.6"                   "CHL.7"                  
##  [19] "pDLBCL.3"                "CHL.8"                  
##  [21] "mDLBCL.4"                "mDLBCL.5"               
##  [23] "mDLBCL.6"                "mDLBCL.7"               
##  [25] "CHL.9"                   "mDLBCL.8"               
##  [27] "CHL.10"                  "CHL.11"                 
##  [29] "CHL.12"                  "CHL.13"                 
##  [31] "mDLBCL.9"                "mDLBCL.10"              
##  [33] "mDLBCL.11"               "mDLBCL.12"              
##  [35] "mDLBCL.13"               "mDLBCL.14"              
##  [37] "pDLBCL.4"                "pDLBCL.5"               
##  [39] "mDLBCL.15"               "mDLBCL.16"              
##  [41] "CHL.14"                  "pDLBCL.6"               
##  [43] "pDLBCL.7"                "CHL.15"                 
##  [45] "CHL.16"                  "CHL.17"                 
##  [47] "mDLBCL.17"               "mDLBCL.18"              
##  [49] "mDLBCL.19"               "CHL.18"                 
##  [51] "CHL_mean"                "pDLBCL_mean"            
##  [53] "mDLBCL_mean"             "CHL_x_mean"             
##  [55] "mDLBCL_x_mean"           "pDLBCL_x_mean"          
##  [57] "CHL_y_mean"              "mDLBCL_y_mean"          
##  [59] "pDLBCL_y_mean"           "CHL_young72_mean"       
##  [61] "mDLBCL_young72_mean"     "pDLBCL_young72_mean"    
##  [63] "CHL_old"                 "mDLBCL_old"             
##  [65] "pDLBCL_old"              "CHL_median"             
##  [67] "mDLBCL_median"           "pDLBCL_median"          
##  [69] "CHL_x_median"            "mDLBCL_x_median"        
##  [71] "pDLBCL_x_median"         "CHL_y_median"           
##  [73] "mDLBCL_y_median"         "pDLBCL_y_median"        
##  [75] "CHL_young72_median"      "mDLBCL_young72_median"  
##  [77] "pDLBCL_young72_median"   "CHL_old72_median"       
##  [79] "mDLBCL_old72_median"     "pDLBCL_old72_median"    
##  [81] "ID.1"                    "CHL_mean.1"             
##  [83] "pDLBCL_mean.1"           "mDLBCL_mean.1"          
##  [85] "CHL_x_mean.1"            "mDLBCL_x_mean.1"        
##  [87] "pDLBCL_x_mean.1"         "CHL_y_mean.1"           
##  [89] "mDLBCL_y_mean.1"         "pDLBCL_y_mean.1"        
##  [91] "CHL_young72_mean.1"      "mDLBCL_young72_mean.1"  
##  [93] "pDLBCL_young72_mean.1"   "CHL_old.1"              
##  [95] "mDLBCL_old.1"            "pDLBCL_old.1"           
##  [97] "CHL_median.1"            "mDLBCL_median.1"        
##  [99] "pDLBCL_median.1"         "CHL_x_median.1"         
## [101] "mDLBCL_x_median.1"       "pDLBCL_x_median.1"      
## [103] "CHL_y_median.1"          "mDLBCL_y_median.1"      
## [105] "pDLBCL_y_median.1"       "CHL_young72_median.1"   
## [107] "mDLBCL_young72_median.1" "pDLBCL_young72_median.1"
## [109] "CHL_old72_median.1"      "mDLBCL_old72_median.1"  
## [111] "pDLBCL_old72_median.1"   "CHL_change"             
## [113] "pDLBCL_change"           "mDLBCL_change"          
## [115] "CHL_x_change"            "CHL_y_change"           
## [117] "pDLBCL_x_change"         "pDLBCL_y_change"        
## [119] "mDLBCL_x_change"         "mDLBCL_y_change"        
## [121] "CHL_old72_change"        "CHL_young72_change"     
## [123] "pDLBCL_old_change"       "pDLBCL_young72_change"  
## [125] "mDLBCL_young72_change"
data_mono <- read.csv("CAEBV_genes_32670_FCs.csv", header=T)

summary(data_mono)
##        ID               gene       GSM2279022_AIM   GSM2279023_AIM  
##  Min.   : 2996   Length   :32670   Min.   : 1.034   Min.   : 1.095  
##  1st Qu.:18724   N.unique :30905   1st Qu.: 1.706   1st Qu.: 1.736  
##  Median :35662   N.blank  :    0   Median : 2.534   Median : 2.553  
##  Mean   :36518   Min.nchar:    1   Mean   : 3.050   Mean   : 3.062  
##  3rd Qu.:54869   Max.nchar:   22   3rd Qu.: 3.965   3rd Qu.: 3.974  
##  Max.   :70515                     Max.   :12.372   Max.   :12.369  
##  GSM2279024_AIM   GSM2279025_CAEBV GSM2279026_AIM   GSM2279027_CAEBV
##  Min.   : 1.088   Min.   : 1.063   Min.   : 1.091   Min.   : 1.087  
##  1st Qu.: 1.749   1st Qu.: 1.767   1st Qu.: 1.715   1st Qu.: 1.768  
##  Median : 2.559   Median : 2.584   Median : 2.542   Median : 2.565  
##  Mean   : 3.055   Mean   : 3.051   Mean   : 3.056   Mean   : 3.066  
##  3rd Qu.: 3.939   3rd Qu.: 3.937   3rd Qu.: 3.962   3rd Qu.: 3.936  
##  Max.   :12.423   Max.   :12.396   Max.   :12.460   Max.   :12.311  
##  GSM2279028_CAEBV GSM2279029_CAEBV  GSM2279030_CAEBV GSM2279031_healthy
##  Min.   : 1.132   Min.   : 0.7472   Min.   : 1.124   Min.   : 1.059    
##  1st Qu.: 1.755   1st Qu.: 1.7254   1st Qu.: 1.770   1st Qu.: 1.743    
##  Median : 2.568   Median : 2.5629   Median : 2.561   Median : 2.589    
##  Mean   : 3.061   Mean   : 3.0407   Mean   : 3.048   Mean   : 3.069    
##  3rd Qu.: 3.927   3rd Qu.: 3.9254   3rd Qu.: 3.896   3rd Qu.: 3.988    
##  Max.   :12.471   Max.   :12.4559   Max.   :12.495   Max.   :12.273    
##  GSM2279032_healthy GSM2279033_healthy GSM2279034_healthy GSM2279035_healthy
##  Min.   : 1.165     Min.   : 1.127     Min.   : 1.095     Min.   : 1.084    
##  1st Qu.: 1.760     1st Qu.: 1.741     1st Qu.: 1.739     1st Qu.: 1.788    
##  Median : 2.573     Median : 2.554     Median : 2.566     Median : 2.601    
##  Mean   : 3.074     Mean   : 3.063     Mean   : 3.064     Mean   : 3.086    
##  3rd Qu.: 3.957     3rd Qu.: 3.944     3rd Qu.: 3.938     3rd Qu.: 3.957    
##  Max.   :12.439     Max.   :12.428     Max.   :12.390     Max.   :12.425    
##  GSM2279036_healthy GSM2279037_AIM   GSM2279038_AIM      AIM_mean     
##  Min.   : 1.100     Min.   : 1.079   Min.   : 1.105   Min.   : 1.225  
##  1st Qu.: 1.802     1st Qu.: 1.767   1st Qu.: 1.770   1st Qu.: 1.735  
##  Median : 2.600     Median : 2.588   Median : 2.594   Median : 2.562  
##  Mean   : 3.075     Mean   : 3.059   Mean   : 3.066   Mean   : 3.058  
##  3rd Qu.: 3.926     3rd Qu.: 3.948   3rd Qu.: 3.979   3rd Qu.: 3.963  
##  Max.   :12.332     Max.   :12.430   Max.   :12.373   Max.   :12.404  
##    CAEBV_mean      healthy_mean    FC_AIM_healthy   FC_CAEBV_healthy
##  Min.   : 1.172   Min.   : 1.202   Min.   :0.3075   Min.   :0.4247  
##  1st Qu.: 1.757   1st Qu.: 1.760   1st Qu.:0.9673   1st Qu.:0.9746  
##  Median : 2.569   Median : 2.582   Median :0.9903   Median :0.9951  
##  Mean   : 3.053   Mean   : 3.072   Mean   :0.9979   Mean   :0.9968  
##  3rd Qu.: 3.930   3rd Qu.: 3.955   3rd Qu.:1.0187   3rd Qu.:1.0177  
##  Max.   :12.426   Max.   :12.381   Max.   :2.6063   Max.   :2.1094
paged_table(data_mono)

We only want to keep the gene ID and healthy samples in the mono and EBV dataset.

colnames(data_mono)
##  [1] "ID"                 "gene"               "GSM2279022_AIM"    
##  [4] "GSM2279023_AIM"     "GSM2279024_AIM"     "GSM2279025_CAEBV"  
##  [7] "GSM2279026_AIM"     "GSM2279027_CAEBV"   "GSM2279028_CAEBV"  
## [10] "GSM2279029_CAEBV"   "GSM2279030_CAEBV"   "GSM2279031_healthy"
## [13] "GSM2279032_healthy" "GSM2279033_healthy" "GSM2279034_healthy"
## [16] "GSM2279035_healthy" "GSM2279036_healthy" "GSM2279037_AIM"    
## [19] "GSM2279038_AIM"     "AIM_mean"           "CAEBV_mean"        
## [22] "healthy_mean"       "FC_AIM_healthy"     "FC_CAEBV_healthy"
healthy <- data_mono[,c(2,12:17)]

paged_table(healthy)
dim(healthy)
## [1] 32670     7
dim(data_lymphoma)
## [1] 21448   125
colnames(data_lymphoma)
##   [1] "ID"                      "SPOT_ID.1"              
##   [3] "GeneID"                  "pDLBCL"                 
##   [5] "CHL"                     "pDLBCL.1"               
##   [7] "mDLBCL"                  "CHL.1"                  
##   [9] "mDLBCL.1"                "pDLBCL.2"               
##  [11] "CHL.2"                   "CHL.3"                  
##  [13] "CHL.4"                   "CHL.5"                  
##  [15] "mDLBCL.2"                "mDLBCL.3"               
##  [17] "CHL.6"                   "CHL.7"                  
##  [19] "pDLBCL.3"                "CHL.8"                  
##  [21] "mDLBCL.4"                "mDLBCL.5"               
##  [23] "mDLBCL.6"                "mDLBCL.7"               
##  [25] "CHL.9"                   "mDLBCL.8"               
##  [27] "CHL.10"                  "CHL.11"                 
##  [29] "CHL.12"                  "CHL.13"                 
##  [31] "mDLBCL.9"                "mDLBCL.10"              
##  [33] "mDLBCL.11"               "mDLBCL.12"              
##  [35] "mDLBCL.13"               "mDLBCL.14"              
##  [37] "pDLBCL.4"                "pDLBCL.5"               
##  [39] "mDLBCL.15"               "mDLBCL.16"              
##  [41] "CHL.14"                  "pDLBCL.6"               
##  [43] "pDLBCL.7"                "CHL.15"                 
##  [45] "CHL.16"                  "CHL.17"                 
##  [47] "mDLBCL.17"               "mDLBCL.18"              
##  [49] "mDLBCL.19"               "CHL.18"                 
##  [51] "CHL_mean"                "pDLBCL_mean"            
##  [53] "mDLBCL_mean"             "CHL_x_mean"             
##  [55] "mDLBCL_x_mean"           "pDLBCL_x_mean"          
##  [57] "CHL_y_mean"              "mDLBCL_y_mean"          
##  [59] "pDLBCL_y_mean"           "CHL_young72_mean"       
##  [61] "mDLBCL_young72_mean"     "pDLBCL_young72_mean"    
##  [63] "CHL_old"                 "mDLBCL_old"             
##  [65] "pDLBCL_old"              "CHL_median"             
##  [67] "mDLBCL_median"           "pDLBCL_median"          
##  [69] "CHL_x_median"            "mDLBCL_x_median"        
##  [71] "pDLBCL_x_median"         "CHL_y_median"           
##  [73] "mDLBCL_y_median"         "pDLBCL_y_median"        
##  [75] "CHL_young72_median"      "mDLBCL_young72_median"  
##  [77] "pDLBCL_young72_median"   "CHL_old72_median"       
##  [79] "mDLBCL_old72_median"     "pDLBCL_old72_median"    
##  [81] "ID.1"                    "CHL_mean.1"             
##  [83] "pDLBCL_mean.1"           "mDLBCL_mean.1"          
##  [85] "CHL_x_mean.1"            "mDLBCL_x_mean.1"        
##  [87] "pDLBCL_x_mean.1"         "CHL_y_mean.1"           
##  [89] "mDLBCL_y_mean.1"         "pDLBCL_y_mean.1"        
##  [91] "CHL_young72_mean.1"      "mDLBCL_young72_mean.1"  
##  [93] "pDLBCL_young72_mean.1"   "CHL_old.1"              
##  [95] "mDLBCL_old.1"            "pDLBCL_old.1"           
##  [97] "CHL_median.1"            "mDLBCL_median.1"        
##  [99] "pDLBCL_median.1"         "CHL_x_median.1"         
## [101] "mDLBCL_x_median.1"       "pDLBCL_x_median.1"      
## [103] "CHL_y_median.1"          "mDLBCL_y_median.1"      
## [105] "pDLBCL_y_median.1"       "CHL_young72_median.1"   
## [107] "mDLBCL_young72_median.1" "pDLBCL_young72_median.1"
## [109] "CHL_old72_median.1"      "mDLBCL_old72_median.1"  
## [111] "pDLBCL_old72_median.1"   "CHL_change"             
## [113] "pDLBCL_change"           "mDLBCL_change"          
## [115] "CHL_x_change"            "CHL_y_change"           
## [117] "pDLBCL_x_change"         "pDLBCL_y_change"        
## [119] "mDLBCL_x_change"         "mDLBCL_y_change"        
## [121] "CHL_old72_change"        "CHL_young72_change"     
## [123] "pDLBCL_old_change"       "pDLBCL_young72_change"  
## [125] "mDLBCL_young72_change"

Lets remove the first 2 and last summary stats columns except group means from data_lymphoma and merge by GeneID in that dataset with gene in the healthy dataset.

lymphomas <- data_lymphoma[,c(3:53)]

paged_table(lymphomas)

Now we will merge the two tables.

Data <- merge(healthy, lymphomas, by.x="gene", by.y="GeneID")

dim(Data)
## [1] 16827    57
paged_table(Data)
colnames(Data)
##  [1] "gene"               "GSM2279031_healthy" "GSM2279032_healthy"
##  [4] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
##  [7] "GSM2279036_healthy" "pDLBCL"             "CHL"               
## [10] "pDLBCL.1"           "mDLBCL"             "CHL.1"             
## [13] "mDLBCL.1"           "pDLBCL.2"           "CHL.2"             
## [16] "CHL.3"              "CHL.4"              "CHL.5"             
## [19] "mDLBCL.2"           "mDLBCL.3"           "CHL.6"             
## [22] "CHL.7"              "pDLBCL.3"           "CHL.8"             
## [25] "mDLBCL.4"           "mDLBCL.5"           "mDLBCL.6"          
## [28] "mDLBCL.7"           "CHL.9"              "mDLBCL.8"          
## [31] "CHL.10"             "CHL.11"             "CHL.12"            
## [34] "CHL.13"             "mDLBCL.9"           "mDLBCL.10"         
## [37] "mDLBCL.11"          "mDLBCL.12"          "mDLBCL.13"         
## [40] "mDLBCL.14"          "pDLBCL.4"           "pDLBCL.5"          
## [43] "mDLBCL.15"          "mDLBCL.16"          "CHL.14"            
## [46] "pDLBCL.6"           "pDLBCL.7"           "CHL.15"            
## [49] "CHL.16"             "CHL.17"             "mDLBCL.17"         
## [52] "mDLBCL.18"          "mDLBCL.19"          "CHL.18"            
## [55] "CHL_mean"           "pDLBCL_mean"        "mDLBCL_mean"

Lets add the healthy mean to the data that will be our baseline control to get the fold change values to extract top genes of variability to test our classifier model on.

Data$healthy_mean <- rowMeans(Data[,c(2:7)])

summary(Data$healthy_mean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.293   2.072   3.126   3.419   4.368  11.400

Now we can get our fold change values.

Data$foldchange_CHL_healthy <- Data$CHL_mean/Data$healthy_mean
Data$foldchange_pDLBCL_healthy <- Data$pDLBCL_mean/Data$healthy_mean
Data$foldchange_mDLBCL_healthy <- Data$mDLBCL_mean/Data$healthy_mean

paged_table(Data[,c(1:5,55:61)])

The above shows the first few columns and last ones after adding the foldchange values of pathology / healthy means per group.

Data_CHL_ordered <- Data[order(Data$foldchange_CHL_healthy, decreasing=T),]
Data_pDLBCL_ordered <- Data[order(Data$foldchange_pDLBCL_healthy, decreasing=T),]
Data_mDLBCL_ordered <- Data[order(Data$foldchange_mDLBCL_healthy, decreasing=T),]
CHL_20 <- Data_CHL_ordered[c(1:10,16818:16827),]
pDLBCL_20 <- Data_pDLBCL_ordered[c(1:10,16818:16827),]
mDLBCL_20 <- Data_mDLBCL_ordered[c(1:10,16818:16827),]
CHL_genes <- CHL_20$gene
pDLBCL_genes <- pDLBCL_20$gene
mDLBCL_genes <- mDLBCL_20$gene
CHL_genes
##  [1] "HIST1H2AJ"  "FDCSP"      "TMBIM6"     "KRTAP2-4"   "TMEM47"    
##  [6] "APOLD1"     "KRTAP2-3"   "HNRNPR"     "BRDT"       "FOXF1"     
## [11] "ZBTB10"     "ZNF888"     "HNRNPA1P33" "MAPRE2"     "SLC38A2"   
## [16] "FFAR2"      "NBPF9"      "AQP9"       "GPCPD1"     "EIF4G2"
pDLBCL_genes
##  [1] "HIST1H2AJ"  "TMBIM6"     "KRTAP2-4"   "TMEM47"     "FDCSP"     
##  [6] "APOLD1"     "KRTAP2-3"   "BRDT"       "SULF1"      "HNRNPR"    
## [11] "MAPRE2"     "SLC38A2"    "ZNF888"     "DPM1"       "ZBTB10"    
## [16] "HNRNPA1P33" "GPCPD1"     "FFAR2"      "NBPF9"      "EIF4G2"
mDLBCL_genes
##  [1] "HIST1H2AJ"  "TMBIM6"     "KRTAP2-4"   "APOLD1"     "TMEM47"    
##  [6] "HNRNPR"     "CCDC34"     "KRTAP2-3"   "SULF1"      "BRDT"      
## [11] "SLC38A2"    "ZNF888"     "ZBTB10"     "DPM1"       "HNRNPA1P33"
## [16] "FFAR2"      "MAPRE2"     "GPCPD1"     "NBPF9"      "EIF4G2"
commonTopGenes <- c(CHL_genes,pDLBCL_genes,mDLBCL_genes)

uniqueTopGenes <- commonTopGenes[!duplicated(commonTopGenes)]

uniqueTopGenes
##  [1] "HIST1H2AJ"  "FDCSP"      "TMBIM6"     "KRTAP2-4"   "TMEM47"    
##  [6] "APOLD1"     "KRTAP2-3"   "HNRNPR"     "BRDT"       "FOXF1"     
## [11] "ZBTB10"     "ZNF888"     "HNRNPA1P33" "MAPRE2"     "SLC38A2"   
## [16] "FFAR2"      "NBPF9"      "AQP9"       "GPCPD1"     "EIF4G2"    
## [21] "SULF1"      "DPM1"       "CCDC34"

There are 23 top genes for all these lymphomas.

Lets see the genes that were duplicated in each data set.

duplicatedTopGenes <- commonTopGenes[duplicated(commonTopGenes)]
duplicatedTopGenes
##  [1] "HIST1H2AJ"  "TMBIM6"     "KRTAP2-4"   "TMEM47"     "FDCSP"     
##  [6] "APOLD1"     "KRTAP2-3"   "BRDT"       "HNRNPR"     "MAPRE2"    
## [11] "SLC38A2"    "ZNF888"     "ZBTB10"     "HNRNPA1P33" "GPCPD1"    
## [16] "FFAR2"      "NBPF9"      "EIF4G2"     "HIST1H2AJ"  "TMBIM6"    
## [21] "KRTAP2-4"   "APOLD1"     "TMEM47"     "HNRNPR"     "KRTAP2-3"  
## [26] "SULF1"      "BRDT"       "SLC38A2"    "ZNF888"     "ZBTB10"    
## [31] "DPM1"       "HNRNPA1P33" "FFAR2"      "MAPRE2"     "GPCPD1"    
## [36] "NBPF9"      "EIF4G2"

These genes are top genes in all three lymphomas of hodgkins and polymorpich DLBCL and monomorphic DLBCL.

duplicated <- duplicatedTopGenes[!duplicated(duplicatedTopGenes)]

duplicated
##  [1] "HIST1H2AJ"  "TMBIM6"     "KRTAP2-4"   "TMEM47"     "FDCSP"     
##  [6] "APOLD1"     "KRTAP2-3"   "BRDT"       "HNRNPR"     "MAPRE2"    
## [11] "SLC38A2"    "ZNF888"     "ZBTB10"     "HNRNPA1P33" "GPCPD1"    
## [16] "FFAR2"      "NBPF9"      "EIF4G2"     "SULF1"      "DPM1"

There are 20 genes in common that are unique. We will use this as a set of top genes and the other set of uniqueTopGenes.

write.csv(Data_CHL_ordered,"Dataset_CHL_pDLBCL_mDLBCL_AIM_CAEBV_healthy.csv", row.names=F)

The above dataset is going in Tableua to replace the other top genes of CHL and DLBCL.

colnames(Data)
##  [1] "gene"                      "GSM2279031_healthy"       
##  [3] "GSM2279032_healthy"        "GSM2279033_healthy"       
##  [5] "GSM2279034_healthy"        "GSM2279035_healthy"       
##  [7] "GSM2279036_healthy"        "pDLBCL"                   
##  [9] "CHL"                       "pDLBCL.1"                 
## [11] "mDLBCL"                    "CHL.1"                    
## [13] "mDLBCL.1"                  "pDLBCL.2"                 
## [15] "CHL.2"                     "CHL.3"                    
## [17] "CHL.4"                     "CHL.5"                    
## [19] "mDLBCL.2"                  "mDLBCL.3"                 
## [21] "CHL.6"                     "CHL.7"                    
## [23] "pDLBCL.3"                  "CHL.8"                    
## [25] "mDLBCL.4"                  "mDLBCL.5"                 
## [27] "mDLBCL.6"                  "mDLBCL.7"                 
## [29] "CHL.9"                     "mDLBCL.8"                 
## [31] "CHL.10"                    "CHL.11"                   
## [33] "CHL.12"                    "CHL.13"                   
## [35] "mDLBCL.9"                  "mDLBCL.10"                
## [37] "mDLBCL.11"                 "mDLBCL.12"                
## [39] "mDLBCL.13"                 "mDLBCL.14"                
## [41] "pDLBCL.4"                  "pDLBCL.5"                 
## [43] "mDLBCL.15"                 "mDLBCL.16"                
## [45] "CHL.14"                    "pDLBCL.6"                 
## [47] "pDLBCL.7"                  "CHL.15"                   
## [49] "CHL.16"                    "CHL.17"                   
## [51] "mDLBCL.17"                 "mDLBCL.18"                
## [53] "mDLBCL.19"                 "CHL.18"                   
## [55] "CHL_mean"                  "pDLBCL_mean"              
## [57] "mDLBCL_mean"               "healthy_mean"             
## [59] "foldchange_CHL_healthy"    "foldchange_pDLBCL_healthy"
## [61] "foldchange_mDLBCL_healthy"
Data1 <- Data[,1:54]
colnames(Data1)
##  [1] "gene"               "GSM2279031_healthy" "GSM2279032_healthy"
##  [4] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
##  [7] "GSM2279036_healthy" "pDLBCL"             "CHL"               
## [10] "pDLBCL.1"           "mDLBCL"             "CHL.1"             
## [13] "mDLBCL.1"           "pDLBCL.2"           "CHL.2"             
## [16] "CHL.3"              "CHL.4"              "CHL.5"             
## [19] "mDLBCL.2"           "mDLBCL.3"           "CHL.6"             
## [22] "CHL.7"              "pDLBCL.3"           "CHL.8"             
## [25] "mDLBCL.4"           "mDLBCL.5"           "mDLBCL.6"          
## [28] "mDLBCL.7"           "CHL.9"              "mDLBCL.8"          
## [31] "CHL.10"             "CHL.11"             "CHL.12"            
## [34] "CHL.13"             "mDLBCL.9"           "mDLBCL.10"         
## [37] "mDLBCL.11"          "mDLBCL.12"          "mDLBCL.13"         
## [40] "mDLBCL.14"          "pDLBCL.4"           "pDLBCL.5"          
## [43] "mDLBCL.15"          "mDLBCL.16"          "CHL.14"            
## [46] "pDLBCL.6"           "pDLBCL.7"           "CHL.15"            
## [49] "CHL.16"             "CHL.17"             "mDLBCL.17"         
## [52] "mDLBCL.18"          "mDLBCL.19"          "CHL.18"

Lets get Data1 class IDs from column names.

healthySamples <- grep('healthy',colnames(Data1))
CHLSamples <- grep('CHL',colnames(Data1))
pDLBCLSamples <- grep('pDLBCL', colnames(Data1))
mDLBCLSamples <- grep('mDLBCL', colnames(Data1))

class <- c(colnames(Data1)[healthySamples],colnames(Data1)[CHLSamples],
           colnames(Data1)[pDLBCLSamples],colnames(Data1)[mDLBCLSamples])
           
class
##  [1] "GSM2279031_healthy" "GSM2279032_healthy" "GSM2279033_healthy"
##  [4] "GSM2279034_healthy" "GSM2279035_healthy" "GSM2279036_healthy"
##  [7] "CHL"                "CHL.1"              "CHL.2"             
## [10] "CHL.3"              "CHL.4"              "CHL.5"             
## [13] "CHL.6"              "CHL.7"              "CHL.8"             
## [16] "CHL.9"              "CHL.10"             "CHL.11"            
## [19] "CHL.12"             "CHL.13"             "CHL.14"            
## [22] "CHL.15"             "CHL.16"             "CHL.17"            
## [25] "CHL.18"             "pDLBCL"             "pDLBCL.1"          
## [28] "pDLBCL.2"           "pDLBCL.3"           "pDLBCL.4"          
## [31] "pDLBCL.5"           "pDLBCL.6"           "pDLBCL.7"          
## [34] "mDLBCL"             "mDLBCL.1"           "mDLBCL.2"          
## [37] "mDLBCL.3"           "mDLBCL.4"           "mDLBCL.5"          
## [40] "mDLBCL.6"           "mDLBCL.7"           "mDLBCL.8"          
## [43] "mDLBCL.9"           "mDLBCL.10"          "mDLBCL.11"         
## [46] "mDLBCL.12"          "mDLBCL.13"          "mDLBCL.14"         
## [49] "mDLBCL.15"          "mDLBCL.16"          "mDLBCL.17"         
## [52] "mDLBCL.18"          "mDLBCL.19"
class[1:6] <- "healthy"
class[7:25] <- "CHL"
class[26:33] <- "pDLBCL"
class[34:53] <- "mDLBCL"


table(class)
## class
##     CHL healthy  mDLBCL  pDLBCL 
##      19       6      20       8
class
##  [1] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "CHL"    
##  [8] "CHL"     "CHL"     "CHL"     "CHL"     "CHL"     "CHL"     "CHL"    
## [15] "CHL"     "CHL"     "CHL"     "CHL"     "CHL"     "CHL"     "CHL"    
## [22] "CHL"     "CHL"     "CHL"     "CHL"     "pDLBCL"  "pDLBCL"  "pDLBCL" 
## [29] "pDLBCL"  "pDLBCL"  "pDLBCL"  "pDLBCL"  "pDLBCL"  "mDLBCL"  "mDLBCL" 
## [36] "mDLBCL"  "mDLBCL"  "mDLBCL"  "mDLBCL"  "mDLBCL"  "mDLBCL"  "mDLBCL" 
## [43] "mDLBCL"  "mDLBCL"  "mDLBCL"  "mDLBCL"  "mDLBCL"  "mDLBCL"  "mDLBCL" 
## [50] "mDLBCL"  "mDLBCL"  "mDLBCL"  "mDLBCL"

We have to order the Data1 table as well in order from healthy to mDLBCL like the class feature we made.

Data2 <- Data1[,c(1,healthySamples,CHLSamples,pDLBCLSamples,mDLBCLSamples)]
colnames(Data2)
##  [1] "gene"               "GSM2279031_healthy" "GSM2279032_healthy"
##  [4] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
##  [7] "GSM2279036_healthy" "CHL"                "CHL.1"             
## [10] "CHL.2"              "CHL.3"              "CHL.4"             
## [13] "CHL.5"              "CHL.6"              "CHL.7"             
## [16] "CHL.8"              "CHL.9"              "CHL.10"            
## [19] "CHL.11"             "CHL.12"             "CHL.13"            
## [22] "CHL.14"             "CHL.15"             "CHL.16"            
## [25] "CHL.17"             "CHL.18"             "pDLBCL"            
## [28] "pDLBCL.1"           "pDLBCL.2"           "pDLBCL.3"          
## [31] "pDLBCL.4"           "pDLBCL.5"           "pDLBCL.6"          
## [34] "pDLBCL.7"           "mDLBCL"             "mDLBCL.1"          
## [37] "mDLBCL.2"           "mDLBCL.3"           "mDLBCL.4"          
## [40] "mDLBCL.5"           "mDLBCL.6"           "mDLBCL.7"          
## [43] "mDLBCL.8"           "mDLBCL.9"           "mDLBCL.10"         
## [46] "mDLBCL.11"          "mDLBCL.12"          "mDLBCL.13"         
## [49] "mDLBCL.14"          "mDLBCL.15"          "mDLBCL.16"         
## [52] "mDLBCL.17"          "mDLBCL.18"          "mDLBCL.19"

Now our class feature string will match with columns 2:54 when we add it to the matrix.

Lets make our matrix now using the genes in the duplicated set and uniqueTopGenes set of Data1 and our class feature.

unique23 <- Data2[which(Data2$gene %in% uniqueTopGenes),]
duplicated20 <- Data2[which(Data2$gene %in% duplicated),]

paged_table(unique23)
paged_table(unique23)

We have our set of genes common to all lymphomas as duplicated20 but some duplicates remained but different expression values with 22 genes in duplicated20, and the unique23 dataset of unique genes to all lymphomas.

We will see how well these two sets of genes predict the class of lymphoma.

unique_mx <- data.frame(t(unique23[,2:54]))

colnames(unique_mx) <- unique23$gene

unique_mx$class <- as.factor(class)


paged_table(unique_mx)
duplicated_mx <- data.frame(t(duplicated20[,2:54]))

colnames(duplicated_mx) <- duplicated20$gene

duplicated_mx$class <- as.factor(class)


paged_table(duplicated_mx)

Now test the unique genes on predicting class in a 4 class model of healthy, CHL, pDLBCL, and mDLBCL using random forest classifier.

set.seed(678)

inTrain <- sample(1:53, .7*53)

training <- unique_mx[inTrain,]
testing <- unique_mx[-inTrain,]

table(training$class)
## 
##     CHL healthy  mDLBCL  pDLBCL 
##      13       5      15       4
table(testing$class)
## 
##     CHL healthy  mDLBCL  pDLBCL 
##       6       1       5       4
rf_unq <- randomForest(training[1:25],training$class, mtry=8, ntree=5000, confusion=T)
rf_unq$confusion
##         CHL healthy mDLBCL pDLBCL class.error
## CHL      10       0      3      0   0.2307692
## healthy   0       5      0      0   0.0000000
## mDLBCL    2       0     12      1   0.2000000
## pDLBCL    3       0      0      1   0.7500000

All healthy samples predicted correctly, the CHL class 77% correct, the mDLBCL class was 80% correct, and the pDLBCL class was only 25% correct.

Now lets see the prediction on hold out validation set of 20%.

prediction_unq <- predict(rf_unq, testing)

results_unq <- data.frame(predicted=prediction_unq, actual=testing$class)

results_unq
set.seed(678)

inTrain <- sample(1:53, .7*53)

training <- duplicated_mx[inTrain,]
testing <- duplicated_mx[-inTrain,]

table(training$class)
## 
##     CHL healthy  mDLBCL  pDLBCL 
##      13       5      15       4
table(testing$class)
## 
##     CHL healthy  mDLBCL  pDLBCL 
##       6       1       5       4
rf_duplicated <- randomForest(training[1:23],training$class, mtry=8, ntree=5000, confusion=T)

rf_duplicated$confusion
##         CHL healthy mDLBCL pDLBCL class.error
## CHL      13       0      0      0         0.0
## healthy   0       5      0      0         0.0
## mDLBCL    0       0     15      0         0.0
## pDLBCL    2       0      0      2         0.5

The duplicated genes common to all lymphoma top genes seemed to perform much better in prediction. We will assume these are the top genes for now. We had an error with the prediction on hold out validation test. Said the features missing in testing set. But every class was 100% accurate except the pDLBCL class with only 50% accuracy still better than the unique genes common to the lymphomas in this project.