This analysis is on the infectious mononucleosis gene expression data in GSE109220 in the NCBI online database directory. There are 19 samples. This data has 5 subjects for initial diagnosis and 1 month later. Then 1 subject drops out for 2 months later, while that patient and 2 others drop out for 7 months later. There are 3 healthy controls, 5 initial diagnosis, 5 1-month after infection, 4 2-months after infection, and 2 7-months after infection. This is a total of 19 samples. The cells collected are peripheral blood mononuclear cells or PBMCs at each time point looking at the micro RNA changes. They were scanned and prepared for gene chip micro array analysis. The numeric values shown are already normalized with robust multi array average, then transformed into log 2, and one way ANOVA or analysis of variance. The header information and metadata in the series file contains all the needed information. We will collect that and use that information to design our analysis by number of patients that stayed through whole study from initial infection to 7 months later of 2 patients out of 5. Then do the initial, 1 month, and 2 month analysis with the 5 patients that stayed for the study for 2 months. And finally an analysis of all samples and how the time line with unbalanced samples at each time point compare overall in evaluating top gene expression values in infectious mononucleosis and add it to our database of studies that relate back to EBV or Lyme disease.

data <- read.csv("GSE109220_series_matrix.txt",comment.char="!",
                   header=T,sep='\t')

head(data)
##        ID_REF GSM2935530 GSM2935531 GSM2935532 GSM2935533 GSM2935534 GSM2935535
## 1     14q0_st   0.895331   0.866299   0.555901   0.966886   0.770751   0.754841
## 2   14qI-1_st   1.327430   1.382420   1.239750   1.790480   1.002980   1.150330
## 3 14qI-1_x_st   1.101760   1.117640   0.887508   1.720290   0.777309   0.924656
## 4   14qI-2_st   0.899064   0.869578   0.553979   0.828477   0.793098   0.809946
## 5 14qI-3_x_st   0.941506   1.123190   1.019810   0.981515   0.751912   0.838089
## 6   14qI-4_st   0.859005   0.684318   1.061370   1.163310   0.902312   0.548366
##   GSM2935536 GSM2935537 GSM2935538 GSM2935539 GSM2935540 GSM2935541 GSM2935542
## 1   0.895331   1.043050   1.225800   0.768760   1.009170   0.876800   1.015470
## 2   0.995848   1.108190   1.388010   1.539740   1.547410   1.065380   1.197270
## 3   0.799595   0.967973   1.162340   1.314060   1.321740   0.839706   1.107620
## 4   0.809946   0.655543   0.488946   0.543943   0.880662   0.700726   1.296980
## 5   1.105910   0.826468   0.859821   1.140310   0.848059   1.010360   0.838089
## 6   0.984952   0.866898   0.780459   0.844232   0.945453   1.041640   0.910368
##   GSM2935543 GSM2935544 GSM2935545 GSM2935546 GSM2935547 GSM2935548
## 1   0.661300   0.895331   0.876800   1.225550   0.602426   1.102760
## 2   1.659260   1.086880   1.593780   1.281600   0.844680   0.922029
## 3   1.116890   0.875489   1.409310   1.372510   0.895331   0.696359
## 4   0.673405   1.028730   0.726906   0.809946   0.793098   1.027200
## 5   0.711518   0.937774   0.911435   0.880532   0.559532   0.799354
## 6   0.895331   1.078120   1.209940   0.915795   0.827384   1.458610
metadata <-read.csv("GSE109220_series_matrix.txt", sep='\t', nrows=27,na.strings=c('',' '))

metadata
##                     X.Series_title
## 1            !Series_geo_accession
## 2                   !Series_status
## 3          !Series_submission_date
## 4         !Series_last_update_date
## 5                !Series_pubmed_id
## 6                  !Series_summary
## 7                  !Series_summary
## 8           !Series_overall_design
## 9                     !Series_type
## 10             !Series_contributor
## 11             !Series_contributor
## 12             !Series_contributor
## 13               !Series_sample_id
## 14            !Series_contact_name
## 15      !Series_contact_laboratory
## 16      !Series_contact_department
## 17       !Series_contact_institute
## 18         !Series_contact_address
## 19            !Series_contact_city
## 20           !Series_contact_state
## 21 !Series_contact_zip/postal_code
## 22         !Series_contact_country
## 23      !Series_supplementary_file
## 24             !Series_platform_id
## 25          !Series_platform_taxid
## 26            !Series_sample_taxid
## 27                !Series_relation
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                              miRNA.expression.data.from.human.PBMC.derived.form.acute.infectious.mononucleosis.patients
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             GSE109220
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Public on Jan 17 2018
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Jan 16 2018
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Apr 18 2018
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              29379474
## 6  Epstein Barr virus (EBV) is the etiological agent of acute infectious mononucleosis (IM). Since acute IM is a self-resolving disease with most patients regaining health in one to three weeks there have been few studies examining molecular signatures in early acute stages of the disease. microRNAs have been shown, however, to influence immune cell function and consequently the generation of antibody responses in IM. In this study, we performed a comprehensive analysis of differentially expressed microRNAs in early stage uncomplicated acute IM.
## 7                                                                                                                                                                                                                                                                                                                                                         We used microarrays to profile microRNAs from patient peripheral blood obtained at the time of IM diagnosis and at subsequent time points, and identified differentially regulated microRNAs in this process.
## 8                              Peripheral blood samples were derived from symptomatic patients at diagnosis, 1 , 2 and 7 months and processed to obtain PBMCs which were frozen and then maintained in liquid nitrogen until use. PBMCs were also isolated from blood derived from healthy age-matched individuals. Patient (n = 16) and healthy control (n = 3) PBMCs were suspended and lysed in TRIzol Reagent  and cleaned-up with RNeasy spin columns using a modified protocol for isolation of total RNA and hybridized on Affymetrix GeneChip miRNA 4.0 arrays.
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Non-coding RNA profiling by array
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Vandana,,Kaul
## 11                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Sheri,,Krams
## 12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Olivia,M,Martinez
## 13                                                                                                                                                                                                                                                                                                                                                    GSM2935530 GSM2935531 GSM2935532 GSM2935533 GSM2935534 GSM2935535 GSM2935536 GSM2935537 GSM2935538 GSM2935539 GSM2935540 GSM2935541 GSM2935542 GSM2935543 GSM2935544 GSM2935545 GSM2935546 GSM2935547 GSM2935548 
## 14                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Vandana,,Kaul
## 15                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Transplant Immunology Lab (TIL)
## 16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Surgery
## 17                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Stanford University
## 18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1201 Welch Rd., MSLS P328
## 19                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Stanford
## 20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   CA
## 21                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           94305-5731
## 22                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  USA
## 23                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE109nnn/GSE109220/suppl/GSE109220_RAW.tar
## 24                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             GPL19117
## 25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                32630
## 26                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 9606
## 27                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA430181

The following reads in the additional metadata information that says the GSM sample is control or patient with other information and the additional gene expression data.

headerInformation <-read.csv("GSE109220_series_matrix.txt", sep='\t', stringsAsFactors = T,strip.white=T,na.strings=" ", header=T,skip=28, nrows=35)

headerInformation[c(1,10,11,13,18:21,33:35),]
##                  X.Sample_title
## 1         !Sample_geo_accession
## 10  !Sample_characteristics_ch1
## 11  !Sample_characteristics_ch1
## 13 !Sample_extract_protocol_ch1
## 18         !Sample_hyb_protocol
## 19        !Sample_scan_protocol
## 20          !Sample_description
## 21      !Sample_data_processing
## 33       !Sample_data_row_count
## 34   !series_matrix_table_begin
## 35                       ID_REF
##                                                                                                                                                                                                                                                                                                                           PBMC_IMDiagnosis_Subject1
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935530
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                               time: at diagnosis
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                              microRNA expression data from PBMCs at IM Diagnosis
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935530
##                                                                                                                                                                                                                                                                                                                           PBMC_IMDiagnosis_Subject2
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935531
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                               time: at diagnosis
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                              microRNA expression data from PBMCs at IM Diagnosis
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935531
##                                                                                                                                                                                                                                                                                                                           PBMC_IMDiagnosis_Subject3
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935532
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                               time: at diagnosis
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                              microRNA expression data from PBMCs at IM Diagnosis
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935532
##                                                                                                                                                                                                                                                                                                                           PBMC_IMDiagnosis_Subject4
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935533
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                               time: at diagnosis
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                              microRNA expression data from PBMCs at IM Diagnosis
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935533
##                                                                                                                                                                                                                                                                                                                           PBMC_IMDiagnosis_Subject5
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935534
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                               time: at diagnosis
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                              microRNA expression data from PBMCs at IM Diagnosis
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935534
##                                                                                                                                                                                                                                                                                                                                PBMC_1month_Subject1
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935535
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                    time: 1 month
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                   microRNA expression data from PBMCs at 1 month
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935535
##                                                                                                                                                                                                                                                                                                                                PBMC_1month_Subject2
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935536
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                    time: 1 month
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                   microRNA expression data from PBMCs at 1 month
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935536
##                                                                                                                                                                                                                                                                                                                                PBMC_1month_Subject3
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935537
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                    time: 1 month
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                   microRNA expression data from PBMCs at 1 month
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935537
##                                                                                                                                                                                                                                                                                                                                PBMC_1month_Subject4
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935538
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                    time: 1 month
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                   microRNA expression data from PBMCs at 1 month
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935538
##                                                                                                                                                                                                                                                                                                                                PBMC_1month_Subject5
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935539
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                    time: 1 month
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                   microRNA expression data from PBMCs at 1 month
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935539
##                                                                                                                                                                                                                                                                                                                                PBMC_2month_Subject1
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935540
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                   time: 2 months
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                  microRNA expression data from PBMCs at 2 months
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935540
##                                                                                                                                                                                                                                                                                                                                PBMC_2month_Subject3
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935541
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                   time: 2 months
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                  microRNA expression data from PBMCs at 2 months
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935541
##                                                                                                                                                                                                                                                                                                                                PBMC_2month_Subject4
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935542
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                   time: 2 months
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                  microRNA expression data from PBMCs at 2 months
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935542
##                                                                                                                                                                                                                                                                                                                                PBMC_2month_Subject5
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935543
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                   time: 2 months
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                  microRNA expression data from PBMCs at 2 months
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935543
##                                                                                                                                                                                                                                                                                                                                PBMC_7month_Subject1
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935544
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                   time: 7 months
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                  microRNA expression data from PBMCs at 7 months
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935544
##                                                                                                                                                                                                                                                                                                                                PBMC_7month_Subject3
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935545
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                   time: 7 months
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                                                  microRNA expression data from PBMCs at 7 months
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935545
##                                                                                                                                                                                                                                                                                                                               PBMC_Control_Subject1
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935546
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                    time: control
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                             microRNA expression data from PBMCs from healthy subjects (controls)
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935546
##                                                                                                                                                                                                                                                                                                                               PBMC_Control_Subject2
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935547
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                    time: control
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                             microRNA expression data from PBMCs from healthy subjects (controls)
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935547
##                                                                                                                                                                                                                                                                                                                               PBMC_Control_Subject3
## 1                                                                                                                                                                                                                                                                                                                                        GSM2935548
## 10                                                                                                                                                                                                                                                                                                    cell type: Peripheral blood mononuclear cells
## 11                                                                                                                                                                                                                                                                                                                                    time: control
## 13                                                                                                                                                                                                                                                       PBMCs were derived from whole peripheral blood samples of IM patients or healthy controls.
## 18                                    The labeled targets were hybridized to the GeneChip miRNA 4.0 arrays per standard miRNA protocol, which includes 16 h (overnight) hybridization at 48◦C at 60 RPM in the Affymetrix GeneChip Hybridization oven 645. The arrays were then washed and stained in the Affymetrix GeneChip Fluidics Station 450.
## 19                                                                                                                                                                                                                                                                           The arrays were scanned using the Affymetrix GeneChip Scanner 3000 7G.
## 20                                                                                                                                                                                                                                                                             microRNA expression data from PBMCs from healthy subjects (controls)
## 21 The data were analyzed with Partek Genomics Suite Software. Normalization was done suing Robust Multi-array Average (RMA) , followed by log2 transformation of data and one-way ANOVA. Multi-testing correction was done using Benjamini-Hochberg Step-Up FDR-controlling procedure for all the expressed miRNAs (default FDR Alpha level 0.05).
## 33                                                                                                                                                                                                                                                                                                                                            36353
## 34                                                                                                                                                                                                                                                                                                                                                 
## 35                                                                                                                                                                                                                                                                                                                                       GSM2935548

The above are selected rows of information per sample like the time of collection of data, the cell type as microRNA, and scan and processing type with gene chip array and how the numeric data shown was normalized or processed before being used for analysis or machine learning in this study. You can see a bunch of useful information in the data set if you view it in Rstudio without the clutter of knitr printed version.

We can use the sample column names of the ‘header information’ data frame instead of the GSM ID to each column in our ‘data’ data frame.

This is great how all information was obtained, we know how each sample was collected and when as well as patient information for each sample so we can use that in how we label the GSM samples in our data. It also says there are the correct number of microRNA of 36,353 genes like our data file of gene expression data shows.

str(data)
## 'data.frame':    36353 obs. of  20 variables:
##  $ ID_REF    : chr  "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
##  $ GSM2935530: num  0.895 1.327 1.102 0.899 0.942 ...
##  $ GSM2935531: num  0.866 1.382 1.118 0.87 1.123 ...
##  $ GSM2935532: num  0.556 1.24 0.888 0.554 1.02 ...
##  $ GSM2935533: num  0.967 1.79 1.72 0.828 0.982 ...
##  $ GSM2935534: num  0.771 1.003 0.777 0.793 0.752 ...
##  $ GSM2935535: num  0.755 1.15 0.925 0.81 0.838 ...
##  $ GSM2935536: num  0.895 0.996 0.8 0.81 1.106 ...
##  $ GSM2935537: num  1.043 1.108 0.968 0.656 0.826 ...
##  $ GSM2935538: num  1.226 1.388 1.162 0.489 0.86 ...
##  $ GSM2935539: num  0.769 1.54 1.314 0.544 1.14 ...
##  $ GSM2935540: num  1.009 1.547 1.322 0.881 0.848 ...
##  $ GSM2935541: num  0.877 1.065 0.84 0.701 1.01 ...
##  $ GSM2935542: num  1.015 1.197 1.108 1.297 0.838 ...
##  $ GSM2935543: num  0.661 1.659 1.117 0.673 0.712 ...
##  $ GSM2935544: num  0.895 1.087 0.875 1.029 0.938 ...
##  $ GSM2935545: num  0.877 1.594 1.409 0.727 0.911 ...
##  $ GSM2935546: num  1.226 1.282 1.373 0.81 0.881 ...
##  $ GSM2935547: num  0.602 0.845 0.895 0.793 0.56 ...
##  $ GSM2935548: num  1.103 0.922 0.696 1.027 0.799 ...

Lets keep the ‘ID_REF’ gene field name and change the GSM ID names to their sample type.

newNames <- colnames(headerInformation)[2:20]
colnames(data)[2:20] <- newNames

str(data)
## 'data.frame':    36353 obs. of  20 variables:
##  $ ID_REF                   : chr  "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
##  $ PBMC_IMDiagnosis_Subject1: num  0.895 1.327 1.102 0.899 0.942 ...
##  $ PBMC_IMDiagnosis_Subject2: num  0.866 1.382 1.118 0.87 1.123 ...
##  $ PBMC_IMDiagnosis_Subject3: num  0.556 1.24 0.888 0.554 1.02 ...
##  $ PBMC_IMDiagnosis_Subject4: num  0.967 1.79 1.72 0.828 0.982 ...
##  $ PBMC_IMDiagnosis_Subject5: num  0.771 1.003 0.777 0.793 0.752 ...
##  $ PBMC_1month_Subject1     : num  0.755 1.15 0.925 0.81 0.838 ...
##  $ PBMC_1month_Subject2     : num  0.895 0.996 0.8 0.81 1.106 ...
##  $ PBMC_1month_Subject3     : num  1.043 1.108 0.968 0.656 0.826 ...
##  $ PBMC_1month_Subject4     : num  1.226 1.388 1.162 0.489 0.86 ...
##  $ PBMC_1month_Subject5     : num  0.769 1.54 1.314 0.544 1.14 ...
##  $ PBMC_2month_Subject1     : num  1.009 1.547 1.322 0.881 0.848 ...
##  $ PBMC_2month_Subject3     : num  0.877 1.065 0.84 0.701 1.01 ...
##  $ PBMC_2month_Subject4     : num  1.015 1.197 1.108 1.297 0.838 ...
##  $ PBMC_2month_Subject5     : num  0.661 1.659 1.117 0.673 0.712 ...
##  $ PBMC_7month_Subject1     : num  0.895 1.087 0.875 1.029 0.938 ...
##  $ PBMC_7month_Subject3     : num  0.877 1.594 1.409 0.727 0.911 ...
##  $ PBMC_Control_Subject1    : num  1.226 1.282 1.373 0.81 0.881 ...
##  $ PBMC_Control_Subject2    : num  0.602 0.845 0.895 0.793 0.56 ...
##  $ PBMC_Control_Subject3    : num  1.103 0.922 0.696 1.027 0.799 ...

Lets now do separate tasks to get only the samples all patients followed through with study to get top fold change genes. Then we will do only the 1 month and initial infection study comparison with all patient data available. And then the overall comparison with all samples even though imbalanced to see the top genes. We can do a quick randomForest to predict class of healthy or Infectious Mononucleosis (IM) on each data set analysis afterwards with only 2 classes and 10,000 trees using cross validation.

===============================================================

Only using the patients that stayed through the whole study for this part, so we need to select our samples to be only patient 1 and 3.

colnames(data)
##  [1] "ID_REF"                    "PBMC_IMDiagnosis_Subject1"
##  [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
##  [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
##  [7] "PBMC_1month_Subject1"      "PBMC_1month_Subject2"     
##  [9] "PBMC_1month_Subject3"      "PBMC_1month_Subject4"     
## [11] "PBMC_1month_Subject5"      "PBMC_2month_Subject1"     
## [13] "PBMC_2month_Subject3"      "PBMC_2month_Subject4"     
## [15] "PBMC_2month_Subject5"      "PBMC_7month_Subject1"     
## [17] "PBMC_7month_Subject3"      "PBMC_Control_Subject1"    
## [19] "PBMC_Control_Subject2"     "PBMC_Control_Subject3"
patients13 <- data[,c(1,2,4,7,9,12,13,16,17:20)]
str(patients13)
## 'data.frame':    36353 obs. of  12 variables:
##  $ ID_REF                   : chr  "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
##  $ PBMC_IMDiagnosis_Subject1: num  0.895 1.327 1.102 0.899 0.942 ...
##  $ PBMC_IMDiagnosis_Subject3: num  0.556 1.24 0.888 0.554 1.02 ...
##  $ PBMC_1month_Subject1     : num  0.755 1.15 0.925 0.81 0.838 ...
##  $ PBMC_1month_Subject3     : num  1.043 1.108 0.968 0.656 0.826 ...
##  $ PBMC_2month_Subject1     : num  1.009 1.547 1.322 0.881 0.848 ...
##  $ PBMC_2month_Subject3     : num  0.877 1.065 0.84 0.701 1.01 ...
##  $ PBMC_7month_Subject1     : num  0.895 1.087 0.875 1.029 0.938 ...
##  $ PBMC_7month_Subject3     : num  0.877 1.594 1.409 0.727 0.911 ...
##  $ PBMC_Control_Subject1    : num  1.226 1.282 1.373 0.81 0.881 ...
##  $ PBMC_Control_Subject2    : num  0.602 0.845 0.895 0.793 0.56 ...
##  $ PBMC_Control_Subject3    : num  1.103 0.922 0.696 1.027 0.799 ...

Now add in row means of each time point and controls or healthy samples, and an overall IM pathology row means per gene to compare.

patients13$initial_mean <- rowMeans(patients13[,c(2,3)], na.rm=FALSE, dims=1)
patients13$means_1month <- rowMeans(patients13[,c(4,5)], na.rm=FALSE,dims=1)
patients13$means_2months <- rowMeans(patients13[,c(6,7)], na.rm=FALSE, dims=1)
patients13$means_7months <- rowMeans(patients13[,c(8,9)], na.rm=FALSE, dims=1)
patients13$IM_total_means <- rowMeans(patients13[,c(2:9)], na.rm=FALSE,dims=1)
patients13$healthy_mean <- rowMeans(patients13[,c(10:12)], na.rm=FALSE, dims=1)

str(patients13)
## 'data.frame':    36353 obs. of  18 variables:
##  $ ID_REF                   : chr  "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
##  $ PBMC_IMDiagnosis_Subject1: num  0.895 1.327 1.102 0.899 0.942 ...
##  $ PBMC_IMDiagnosis_Subject3: num  0.556 1.24 0.888 0.554 1.02 ...
##  $ PBMC_1month_Subject1     : num  0.755 1.15 0.925 0.81 0.838 ...
##  $ PBMC_1month_Subject3     : num  1.043 1.108 0.968 0.656 0.826 ...
##  $ PBMC_2month_Subject1     : num  1.009 1.547 1.322 0.881 0.848 ...
##  $ PBMC_2month_Subject3     : num  0.877 1.065 0.84 0.701 1.01 ...
##  $ PBMC_7month_Subject1     : num  0.895 1.087 0.875 1.029 0.938 ...
##  $ PBMC_7month_Subject3     : num  0.877 1.594 1.409 0.727 0.911 ...
##  $ PBMC_Control_Subject1    : num  1.226 1.282 1.373 0.81 0.881 ...
##  $ PBMC_Control_Subject2    : num  0.602 0.845 0.895 0.793 0.56 ...
##  $ PBMC_Control_Subject3    : num  1.103 0.922 0.696 1.027 0.799 ...
##  $ initial_mean             : num  0.726 1.284 0.995 0.727 0.981 ...
##  $ means_1month             : num  0.899 1.129 0.946 0.733 0.832 ...
##  $ means_2months            : num  0.943 1.306 1.081 0.791 0.929 ...
##  $ means_7months            : num  0.886 1.34 1.142 0.878 0.925 ...
##  $ IM_total_means           : num  0.863 1.265 1.041 0.782 0.917 ...
##  $ healthy_mean             : num  0.977 1.016 0.988 0.877 0.746 ...

Now get the fold change value of each time point mean and overall means over the healthy means.

patients13$FC_initial_healthy <- patients13$initial_mean/patients13$healthy_mean

patients13$FC_1month_healthy <- patients13$means_1month/patients13$healthy_mean

patients13$FC_2months_healthy <- patients13$means_2months/patients13$healthy_mean

patients13$FC_7months_healthy <- patients13$means_7months/patients13$healthy_mean

patients13$FC_all_IM_healthy <- patients13$IM_total_means/patients13$healthy_mean

str(patients13)
## 'data.frame':    36353 obs. of  23 variables:
##  $ ID_REF                   : chr  "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
##  $ PBMC_IMDiagnosis_Subject1: num  0.895 1.327 1.102 0.899 0.942 ...
##  $ PBMC_IMDiagnosis_Subject3: num  0.556 1.24 0.888 0.554 1.02 ...
##  $ PBMC_1month_Subject1     : num  0.755 1.15 0.925 0.81 0.838 ...
##  $ PBMC_1month_Subject3     : num  1.043 1.108 0.968 0.656 0.826 ...
##  $ PBMC_2month_Subject1     : num  1.009 1.547 1.322 0.881 0.848 ...
##  $ PBMC_2month_Subject3     : num  0.877 1.065 0.84 0.701 1.01 ...
##  $ PBMC_7month_Subject1     : num  0.895 1.087 0.875 1.029 0.938 ...
##  $ PBMC_7month_Subject3     : num  0.877 1.594 1.409 0.727 0.911 ...
##  $ PBMC_Control_Subject1    : num  1.226 1.282 1.373 0.81 0.881 ...
##  $ PBMC_Control_Subject2    : num  0.602 0.845 0.895 0.793 0.56 ...
##  $ PBMC_Control_Subject3    : num  1.103 0.922 0.696 1.027 0.799 ...
##  $ initial_mean             : num  0.726 1.284 0.995 0.727 0.981 ...
##  $ means_1month             : num  0.899 1.129 0.946 0.733 0.832 ...
##  $ means_2months            : num  0.943 1.306 1.081 0.791 0.929 ...
##  $ means_7months            : num  0.886 1.34 1.142 0.878 0.925 ...
##  $ IM_total_means           : num  0.863 1.265 1.041 0.782 0.917 ...
##  $ healthy_mean             : num  0.977 1.016 0.988 0.877 0.746 ...
##  $ FC_initial_healthy       : num  0.743 1.263 1.007 0.829 1.314 ...
##  $ FC_1month_healthy        : num  0.92 1.111 0.958 0.836 1.115 ...
##  $ FC_2months_healthy       : num  0.965 1.286 1.094 0.902 1.245 ...
##  $ FC_7months_healthy       : num  0.907 1.319 1.156 1.001 1.239 ...
##  $ FC_all_IM_healthy        : num  0.884 1.245 1.054 0.892 1.228 ...

Lets order the data frame samples and keep the top 5 under and over expressed or top 5 enhanced and top 5 silenced genes for each time point and all timepoints.

colnames(patients13)
##  [1] "ID_REF"                    "PBMC_IMDiagnosis_Subject1"
##  [3] "PBMC_IMDiagnosis_Subject3" "PBMC_1month_Subject1"     
##  [5] "PBMC_1month_Subject3"      "PBMC_2month_Subject1"     
##  [7] "PBMC_2month_Subject3"      "PBMC_7month_Subject1"     
##  [9] "PBMC_7month_Subject3"      "PBMC_Control_Subject1"    
## [11] "PBMC_Control_Subject2"     "PBMC_Control_Subject3"    
## [13] "initial_mean"              "means_1month"             
## [15] "means_2months"             "means_7months"            
## [17] "IM_total_means"            "healthy_mean"             
## [19] "FC_initial_healthy"        "FC_1month_healthy"        
## [21] "FC_2months_healthy"        "FC_7months_healthy"       
## [23] "FC_all_IM_healthy"
X_0 <- patients13[order(patients13$FC_initial_healthy, decreasing=T),]
DX0 <- X_0[c(1:5,36349:36353),c(1,19)]
DX0
##                ID_REF FC_initial_healthy
## 23290 MIMAT0018929_st         12.7639130
## 14878 MIMAT0009448_st          7.3035458
## 22480 MIMAT0018115_st          6.4052125
## 29663 MIMAT0025375_st          6.3131440
## 23138 MIMAT0018776_st          6.1511222
## 14759 MIMAT0009317_st          0.2215564
## 15326 MIMAT0009904_st          0.2215564
## 17635 MIMAT0013157_st          0.2215564
## 20275 MIMAT0015888_st          0.2215564
## 28413 MIMAT0024117_st          0.1667130
X_1 <- patients13[order(patients13$FC_1month_healthy, decreasing=T),]
DX1 <- X_1[c(1:5,36349:36353),c(1,20)]
DX1
##                ID_REF FC_1month_healthy
## 23290 MIMAT0018929_st        12.0429496
## 22480 MIMAT0018115_st         6.0032794
## 33919 MIMAT0029648_st         5.6526514
## 28681 MIMAT0024385_st         5.4238156
## 36038 MIMAT0031767_st         5.4099636
## 14750 MIMAT0009308_st         0.1895457
## 17628 MIMAT0013150_st         0.1895457
## 23664 MIMAT0019306_st         0.1895457
## 28243 MIMAT0023947_st         0.1895457
## 23686 MIMAT0019328_st         0.1696506
X_2 <- patients13[order(patients13$FC_2months_healthy, decreasing=T),]
DX2 <- X_2[c(1:5,36349:36353),c(1,21)]
DX2
##                ID_REF FC_2months_healthy
## 23290 MIMAT0018929_st         10.8804258
## 26585 MIMAT0022271_st          6.2396110
## 33919 MIMAT0029648_st          5.5550380
## 28681 MIMAT0024385_st          5.4038902
## 22480 MIMAT0018115_st          5.3847433
## 13712 MIMAT0008257_st          0.2362490
## 20414 MIMAT0016027_st          0.2362490
## 14794 MIMAT0009352_st          0.2317874
## 11612 MIMAT0006109_st          0.2223497
## 28413 MIMAT0024117_st          0.2181580
X_7 <- patients13[order(patients13$FC_7months_healthy, decreasing=T),]
DX7 <- X_7[c(1:5,36349:36353),c(1,22)]
DX7
##                ID_REF FC_7months_healthy
## 23290 MIMAT0018929_st          8.0489859
## 23138 MIMAT0018776_st          4.2975872
## 22480 MIMAT0018115_st          4.2067260
## 26917 MIMAT0022608_st          4.1895747
## 19436 MIMAT0015036_st          4.0733855
## 26939 MIMAT0022630_st          0.2830678
## 22326 MIMAT0017956_st          0.2513339
## 18370 MIMAT0013946_st          0.2500538
## 28101 MIMAT0023805_st          0.2322748
## 14794 MIMAT0009352_st          0.2166384
X_all <- patients13[order(patients13$FC_all_IM_healthy, decreasing=T),]
DXall <- X_all[c(1:5,36349:36353),c(1,23)]
DXall
##                ID_REF FC_all_IM_healthy
## 23290 MIMAT0018929_st        10.9340686
## 22480 MIMAT0018115_st         5.4999903
## 23138 MIMAT0018776_st         4.9913737
## 28681 MIMAT0024385_st         4.9629552
## 26585 MIMAT0022271_st         4.8911259
## 13589 MIMAT0008134_st         0.2537693
## 14750 MIMAT0009308_st         0.2537693
## 17628 MIMAT0013150_st         0.2537693
## 23664 MIMAT0019306_st         0.2537693
## 28243 MIMAT0023947_st         0.2537693

Lets find out which genes are common and not among these time stamps of IM.

DX01 <- DX0$ID_REF %in% DX1$ID_REF
DX0$ID_REF[DX01]
## [1] "MIMAT0018929_st" "MIMAT0018115_st"

The two genes above are common to initial and 1 month after infection.

DX27 <- DX2$ID_REF %in% DX7$ID_REF
common0127 <- DX2$ID_REF[DX27]
common0127
## [1] "MIMAT0018929_st" "MIMAT0018115_st" "MIMAT0009352_st"

The three genes above are common between 2 months and 7 months after infections. We can visually inspect and see the first two genes are common to initial, 1 month, 2 months, and 7 months after infection.

Now see if these genes are common to all timepoints top genes.

commonAll <- DXall[which(DXall$ID_REF %in% common0127),]
commonAll
##                ID_REF FC_all_IM_healthy
## 23290 MIMAT0018929_st          10.93407
## 22480 MIMAT0018115_st           5.49999

There are two genes common to initial diagnosis of infectious mononucleosis, 1 month later, 2 months later, and 7 months later. Those 2 genes are MIMAT0018929_st and MIMAT0018115_st.

So there is no record of it on genecards.org and an internet search for it turns up nothing but we know it is a gene ID through Affymetrix gene chip scanning so it must be a gene ID with the machine used to scan in PBMCs into the arrays to analyze. Lets just use the 7 months of infection top genes to analyze and use in prediction of the class of healthy or IM when we get there. Lets write out the 7 months ordered decreasing for fold change at 7 months compared to healthy.

write.csv(X_7, 'mononucleosis_7monthsInfected_allGenesOrderedDecreasing.csv', row.names=F)

Lets get the top genes by name in each infectious time stamp month.

DX0$class <- "initial infection"
DX1$class <- "month1 of infection"
DX2$class <- "month2 of infection"
DX7$class <- "month7 of infection"

colnames(DX0) <- c("miroRNA","FoldChangePathologyOverHealthy","class")
colnames(DX1) <- c("miroRNA","FoldChangePathologyOverHealthy","class")
colnames(DX2) <- c("miroRNA","FoldChangePathologyOverHealthy","class")
colnames(DX7) <- c("miroRNA","FoldChangePathologyOverHealthy","class")
topGenesClasses <- rbind(DX0,DX1,DX2,DX7)
topGenesClasses
##                miroRNA FoldChangePathologyOverHealthy               class
## 23290  MIMAT0018929_st                     12.7639130   initial infection
## 14878  MIMAT0009448_st                      7.3035458   initial infection
## 22480  MIMAT0018115_st                      6.4052125   initial infection
## 29663  MIMAT0025375_st                      6.3131440   initial infection
## 23138  MIMAT0018776_st                      6.1511222   initial infection
## 14759  MIMAT0009317_st                      0.2215564   initial infection
## 15326  MIMAT0009904_st                      0.2215564   initial infection
## 17635  MIMAT0013157_st                      0.2215564   initial infection
## 20275  MIMAT0015888_st                      0.2215564   initial infection
## 28413  MIMAT0024117_st                      0.1667130   initial infection
## 232901 MIMAT0018929_st                     12.0429496 month1 of infection
## 224801 MIMAT0018115_st                      6.0032794 month1 of infection
## 33919  MIMAT0029648_st                      5.6526514 month1 of infection
## 28681  MIMAT0024385_st                      5.4238156 month1 of infection
## 36038  MIMAT0031767_st                      5.4099636 month1 of infection
## 14750  MIMAT0009308_st                      0.1895457 month1 of infection
## 17628  MIMAT0013150_st                      0.1895457 month1 of infection
## 23664  MIMAT0019306_st                      0.1895457 month1 of infection
## 28243  MIMAT0023947_st                      0.1895457 month1 of infection
## 23686  MIMAT0019328_st                      0.1696506 month1 of infection
## 232902 MIMAT0018929_st                     10.8804258 month2 of infection
## 26585  MIMAT0022271_st                      6.2396110 month2 of infection
## 339191 MIMAT0029648_st                      5.5550380 month2 of infection
## 286811 MIMAT0024385_st                      5.4038902 month2 of infection
## 224802 MIMAT0018115_st                      5.3847433 month2 of infection
## 13712  MIMAT0008257_st                      0.2362490 month2 of infection
## 20414  MIMAT0016027_st                      0.2362490 month2 of infection
## 14794  MIMAT0009352_st                      0.2317874 month2 of infection
## 11612  MIMAT0006109_st                      0.2223497 month2 of infection
## 284131 MIMAT0024117_st                      0.2181580 month2 of infection
## 232903 MIMAT0018929_st                      8.0489859 month7 of infection
## 231381 MIMAT0018776_st                      4.2975872 month7 of infection
## 224803 MIMAT0018115_st                      4.2067260 month7 of infection
## 26917  MIMAT0022608_st                      4.1895747 month7 of infection
## 19436  MIMAT0015036_st                      4.0733855 month7 of infection
## 26939  MIMAT0022630_st                      0.2830678 month7 of infection
## 22326  MIMAT0017956_st                      0.2513339 month7 of infection
## 18370  MIMAT0013946_st                      0.2500538 month7 of infection
## 28101  MIMAT0023805_st                      0.2322748 month7 of infection
## 147941 MIMAT0009352_st                      0.2166384 month7 of infection
top40 <- topGenesClasses[order(topGenesClasses$FoldChangePathologyOverHealthy,decreasing=T),]
top40
##                miroRNA FoldChangePathologyOverHealthy               class
## 23290  MIMAT0018929_st                     12.7639130   initial infection
## 232901 MIMAT0018929_st                     12.0429496 month1 of infection
## 232902 MIMAT0018929_st                     10.8804258 month2 of infection
## 232903 MIMAT0018929_st                      8.0489859 month7 of infection
## 14878  MIMAT0009448_st                      7.3035458   initial infection
## 22480  MIMAT0018115_st                      6.4052125   initial infection
## 29663  MIMAT0025375_st                      6.3131440   initial infection
## 26585  MIMAT0022271_st                      6.2396110 month2 of infection
## 23138  MIMAT0018776_st                      6.1511222   initial infection
## 224801 MIMAT0018115_st                      6.0032794 month1 of infection
## 33919  MIMAT0029648_st                      5.6526514 month1 of infection
## 339191 MIMAT0029648_st                      5.5550380 month2 of infection
## 28681  MIMAT0024385_st                      5.4238156 month1 of infection
## 36038  MIMAT0031767_st                      5.4099636 month1 of infection
## 286811 MIMAT0024385_st                      5.4038902 month2 of infection
## 224802 MIMAT0018115_st                      5.3847433 month2 of infection
## 231381 MIMAT0018776_st                      4.2975872 month7 of infection
## 224803 MIMAT0018115_st                      4.2067260 month7 of infection
## 26917  MIMAT0022608_st                      4.1895747 month7 of infection
## 19436  MIMAT0015036_st                      4.0733855 month7 of infection
## 26939  MIMAT0022630_st                      0.2830678 month7 of infection
## 22326  MIMAT0017956_st                      0.2513339 month7 of infection
## 18370  MIMAT0013946_st                      0.2500538 month7 of infection
## 13712  MIMAT0008257_st                      0.2362490 month2 of infection
## 20414  MIMAT0016027_st                      0.2362490 month2 of infection
## 28101  MIMAT0023805_st                      0.2322748 month7 of infection
## 14794  MIMAT0009352_st                      0.2317874 month2 of infection
## 11612  MIMAT0006109_st                      0.2223497 month2 of infection
## 14759  MIMAT0009317_st                      0.2215564   initial infection
## 15326  MIMAT0009904_st                      0.2215564   initial infection
## 17635  MIMAT0013157_st                      0.2215564   initial infection
## 20275  MIMAT0015888_st                      0.2215564   initial infection
## 284131 MIMAT0024117_st                      0.2181580 month2 of infection
## 147941 MIMAT0009352_st                      0.2166384 month7 of infection
## 14750  MIMAT0009308_st                      0.1895457 month1 of infection
## 17628  MIMAT0013150_st                      0.1895457 month1 of infection
## 23664  MIMAT0019306_st                      0.1895457 month1 of infection
## 28243  MIMAT0023947_st                      0.1895457 month1 of infection
## 23686  MIMAT0019328_st                      0.1696506 month1 of infection
## 28413  MIMAT0024117_st                      0.1667130   initial infection

Write this out to csv.

write.csv(top40, 'top40_mono_miRNA_4timestamps_orderedFCs.csv', row.names=F)
top20 <- top40$miroRNA[c(1:10,31:40)]
top20
##  [1] "MIMAT0018929_st" "MIMAT0018929_st" "MIMAT0018929_st" "MIMAT0018929_st"
##  [5] "MIMAT0009448_st" "MIMAT0018115_st" "MIMAT0025375_st" "MIMAT0022271_st"
##  [9] "MIMAT0018776_st" "MIMAT0018115_st" "MIMAT0013157_st" "MIMAT0015888_st"
## [13] "MIMAT0024117_st" "MIMAT0009352_st" "MIMAT0009308_st" "MIMAT0013150_st"
## [17] "MIMAT0019306_st" "MIMAT0023947_st" "MIMAT0019328_st" "MIMAT0024117_st"
top <- top20[!duplicated(top20)]
top
##  [1] "MIMAT0018929_st" "MIMAT0009448_st" "MIMAT0018115_st" "MIMAT0025375_st"
##  [5] "MIMAT0022271_st" "MIMAT0018776_st" "MIMAT0013157_st" "MIMAT0015888_st"
##  [9] "MIMAT0024117_st" "MIMAT0009352_st" "MIMAT0009308_st" "MIMAT0013150_st"
## [13] "MIMAT0019306_st" "MIMAT0023947_st" "MIMAT0019328_st"

So we have 15 unique top genes between all timestamps to use as predictors in predicting class as healthy or infectious mononucleosis.

patientsTop <- patients13[patients13$ID_REF %in% top,c(1:12)]
colnames(patientsTop)
##  [1] "ID_REF"                    "PBMC_IMDiagnosis_Subject1"
##  [3] "PBMC_IMDiagnosis_Subject3" "PBMC_1month_Subject1"     
##  [5] "PBMC_1month_Subject3"      "PBMC_2month_Subject1"     
##  [7] "PBMC_2month_Subject3"      "PBMC_7month_Subject1"     
##  [9] "PBMC_7month_Subject3"      "PBMC_Control_Subject1"    
## [11] "PBMC_Control_Subject2"     "PBMC_Control_Subject3"

Two class ML and a five class ML with randomForest after making the respective matrices.

class <- c("mononucleosis","mononucleosis","mononucleosis","mononucleosis","mononucleosis","mononucleosis","mononucleosis","mononucleosis", "healthy", "healthy","healthy")

p13geneHeader <- patientsTop$ID_REF

matrix13 <- data.frame(t(patientsTop[,-1]))
colnames(matrix13) <- p13geneHeader
matrix13$class <- class
str(matrix13)
## 'data.frame':    11 obs. of  16 variables:
##  $ MIMAT0009308_st: num  1.254 0.82 0.975 0.583 1.316 ...
##  $ MIMAT0009352_st: num  0.889 1.364 1.351 0.939 1.053 ...
##  $ MIMAT0009448_st: num  6.61 6.11 3.62 4.15 6.52 ...
##  $ MIMAT0013150_st: num  1.254 0.82 0.975 0.583 1.316 ...
##  $ MIMAT0013157_st: num  0.791 1.077 1.235 0.759 1.672 ...
##  $ MIMAT0015888_st: num  0.791 1.077 1.235 0.759 1.672 ...
##  $ MIMAT0018115_st: num  7.99 6.87 7.29 6.64 6.87 ...
##  $ MIMAT0018776_st: num  7.14 6.99 5.09 5.75 5.46 ...
##  $ MIMAT0018929_st: num  9.56 7.75 8.74 7.6 8.18 ...
##  $ MIMAT0019306_st: num  1.254 0.82 0.975 0.583 1.316 ...
##  $ MIMAT0019328_st: num  1.931 0.775 1.105 0.656 2.748 ...
##  $ MIMAT0022271_st: num  4.8 4.99 3.19 5.17 6.77 ...
##  $ MIMAT0023947_st: num  1.254 0.82 0.975 0.583 1.316 ...
##  $ MIMAT0024117_st: num  0.859 0.5 1.153 0.965 1.187 ...
##  $ MIMAT0025375_st: num  5.83 6.64 4.91 4.48 4.31 ...
##  $ class          : chr  "mononucleosis" "mononucleosis" "mononucleosis" "mononucleosis" ...

write to csv and read back in to avoid row names.

write.csv(matrix13,"matrix1_3_top15genes.csv", row.names=F)
matrix13 <- read.csv("matrix1_3_top15genes.csv", header=T, sep=',')
matrix13$class <- as.factor(matrix13$class)
str(matrix13)
## 'data.frame':    11 obs. of  16 variables:
##  $ MIMAT0009308_st: num  1.254 0.82 0.975 0.583 1.316 ...
##  $ MIMAT0009352_st: num  0.889 1.364 1.351 0.939 1.053 ...
##  $ MIMAT0009448_st: num  6.61 6.11 3.62 4.15 6.52 ...
##  $ MIMAT0013150_st: num  1.254 0.82 0.975 0.583 1.316 ...
##  $ MIMAT0013157_st: num  0.791 1.077 1.235 0.759 1.672 ...
##  $ MIMAT0015888_st: num  0.791 1.077 1.235 0.759 1.672 ...
##  $ MIMAT0018115_st: num  7.99 6.87 7.29 6.64 6.87 ...
##  $ MIMAT0018776_st: num  7.14 6.99 5.09 5.75 5.46 ...
##  $ MIMAT0018929_st: num  9.56 7.75 8.74 7.6 8.18 ...
##  $ MIMAT0019306_st: num  1.254 0.82 0.975 0.583 1.316 ...
##  $ MIMAT0019328_st: num  1.931 0.775 1.105 0.656 2.748 ...
##  $ MIMAT0022271_st: num  4.8 4.99 3.19 5.17 6.77 ...
##  $ MIMAT0023947_st: num  1.254 0.82 0.975 0.583 1.316 ...
##  $ MIMAT0024117_st: num  0.859 0.5 1.153 0.965 1.187 ...
##  $ MIMAT0025375_st: num  5.83 6.64 4.91 4.48 4.31 ...
##  $ class          : Factor w/ 2 levels "healthy","mononucleosis": 2 2 2 2 2 2 2 2 1 1 ...
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
training <- matrix13[c(1,2,3,4,5,7,9,10),]
testing <- matrix13[c(6,8,11),]
table(testing$class)
## 
##       healthy mononucleosis 
##             1             2
table(training$class)
## 
##       healthy mononucleosis 
##             2             6
set.seed(123)
ML15.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)

print(ML15.rf)
## 
## Call:
##  randomForest(formula = class ~ ., data = training, ntree = 10000,      keep.forest = TRUE, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 10000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 12.5%
## Confusion matrix:
##               healthy mononucleosis class.error
## healthy             1             1         0.5
## mononucleosis       0             6         0.0
predicted <- predict(ML15.rf,testing)

results <- data.frame(predicted=predicted,actual=testing$class)
results
##        predicted        actual
## 6  mononucleosis mononucleosis
## 8  mononucleosis mononucleosis
## 11       healthy       healthy

The two class model predicted with 100% accuracy the class as mononucleosis infection or healthy for these top 15 genes among all timestamps. Lets do this for a five class model to see how well prediction is in predicting initial infection, 1 month infected, 2 months infected, 7 months infected or healthy. We have to go back to the matrix and make the class have 5 classes instead of 2 for respective class.

colnames(patientsTop)
##  [1] "ID_REF"                    "PBMC_IMDiagnosis_Subject1"
##  [3] "PBMC_IMDiagnosis_Subject3" "PBMC_1month_Subject1"     
##  [5] "PBMC_1month_Subject3"      "PBMC_2month_Subject1"     
##  [7] "PBMC_2month_Subject3"      "PBMC_7month_Subject1"     
##  [9] "PBMC_7month_Subject3"      "PBMC_Control_Subject1"    
## [11] "PBMC_Control_Subject2"     "PBMC_Control_Subject3"
class5 <- c("initial infection", "initial infection","1 month infected","1 month infected","2 months infected","2 months infected","7 months infected","7 months infected","healthy","healthy","healthy")
matrix5class <- data.frame(t(patientsTop[,c(2:12)]))
colnames(matrix5class) <- p13geneHeader
matrix5class$class <- class5

write out and read back in to avoid row names.

write.csv(matrix5class,'matrix5class_top15genes_mono.csv',row.names=F)
matrix5class <- read.csv('matrix5class_top15genes_mono.csv', header=T, sep=',')
matrix5class$class <- as.factor(matrix5class$class)
training <- matrix5class[c(1,3,5,7,9,8,10),]
testing <- matrix5class[c(2,4,6,11),]
set.seed(123)
ML15b.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)

print(ML15b.rf)
## 
## Call:
##  randomForest(formula = class ~ ., data = training, ntree = 10000,      keep.forest = TRUE, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 10000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 85.71%
## Confusion matrix:
##                   1 month infected 2 months infected 7 months infected healthy
## 1 month infected                 0                 0                 1       0
## 2 months infected                0                 0                 1       0
## 7 months infected                1                 1                 0       0
## healthy                          0                 0                 1       1
## initial infection                0                 1                 0       0
##                   initial infection class.error
## 1 month infected                  0         1.0
## 2 months infected                 0         1.0
## 7 months infected                 0         1.0
## healthy                           0         0.5
## initial infection                 0         1.0
predicted2 <- predict(ML15b.rf,testing)

results <- data.frame(predicted=predicted2,actual=testing$class)
results
##            predicted            actual
## 2   1 month infected initial infection
## 4  7 months infected  1 month infected
## 6  7 months infected 2 months infected
## 11           healthy           healthy

We already knew this would be terrible as few observations and many classes to predict the machine loses accuracy dramatically, but in this case the healthy class was predicted with 100% accuracy, and most predictions made by machine in training were with the sample it had the most of, which was 7 months.

Lets do it with one sample to predict for healthy and give the model 2 of each class to train on and only predict one class that is healthy.

training <- matrix5class[c(1:9,11),]
testing <- matrix5class[10,]
set.seed(123)
set.seed(123)
ML15c.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)

print(ML15c.rf)
## 
## Call:
##  randomForest(formula = class ~ ., data = training, ntree = 10000,      keep.forest = TRUE, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 10000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 80%
## Confusion matrix:
##                   1 month infected 2 months infected 7 months infected healthy
## 1 month infected                 0                 1                 0       0
## 2 months infected                1                 0                 1       0
## 7 months infected                1                 1                 0       0
## healthy                          0                 0                 0       2
## initial infection                1                 1                 0       0
##                   initial infection class.error
## 1 month infected                  1           1
## 2 months infected                 0           1
## 7 months infected                 0           1
## healthy                           0           0
## initial infection                 0           1

Slightly better results in training with less one out of bag error than previous training and testing set splits.

predicted3 <- predict(ML15c.rf,testing)

results <- data.frame(predicted=predicted3,actual=testing$class)
results
##            predicted  actual
## 10 7 months infected healthy

It still prefers to predict the 7 month infected in a 5 class model with balanced classes in training. But it didn’t predict the one healthy class as healthy, but 7 months infected. Could also mean that the late stage mono at 7 months after infection is closer to a healthy sample than an infected one or sample exposed to infectious mononucleosis. These genes of micro RNA still seem good to use in prediction. We will have to find out how to get alternate names in Ensembl or genecards or other name for them to add them to our database. Otherwise its useful only for microRNA and no tie to EBV or Epstein-Barr Viral infection and associated pathologies.

We wanted to also see the fold change values when breaking up the original 19 samples into those with all 5 participating people through 1 month after infection, and another run with all samples and a 2 class problem with unbalanced class training on 5,5,4, and 2 samples for initial infection at diagnosis, 1 month infection, 2 months infection, and 7 months infection respectively. There are also 3 healthy classes that is unbalanced in training compared to the 5, 4, or 2 samples of other classes. I will save that for next project as an extension to this one. We found that with a 2 class model, the machine was able to use these top 15 genes found as top silencers or enhancers in the mononucleosis infected micro RNA gene expression data was able to predict with 100% accuracy a healthy or mono class. But when using time stamps in months of infection as a five class model, the machine was not very good due to low samples, not enough to train the model with or test it.

Thanks.