This analysis is on the infectious mononucleosis gene expression data in GSE109220 in the NCBI online database directory. There are 19 samples. This data has 5 subjects for initial diagnosis and 1 month later. Then 1 subject drops out for 2 months later, while that patient and 2 others drop out for 7 months later. There are 3 healthy controls, 5 initial diagnosis, 5 1-month after infection, 4 2-months after infection, and 2 7-months after infection. This is a total of 19 samples. The cells collected are peripheral blood mononuclear cells or PBMCs at each time point looking at the micro RNA changes. They were scanned and prepared for gene chip micro array analysis. The numeric values shown are already normalized with robust multi array average, then transformed into log 2, and one way ANOVA or analysis of variance. The header information and metadata in the series file contains all the needed information. We will collect that and use that information to design our analysis by number of patients that stayed through whole study from initial infection to 7 months later of 2 patients out of 5. Then do the initial, 1 month, and 2 month analysis with the 5 patients that stayed for the study for 2 months. And finally an analysis of all samples and how the time line with unbalanced samples at each time point compare overall in evaluating top gene expression values in infectious mononucleosis and add it to our database of studies that relate back to EBV or Lyme disease.
data <- read.csv("GSE109220_series_matrix.txt",comment.char="!",
header=T,sep='\t')
head(data)
## ID_REF GSM2935530 GSM2935531 GSM2935532 GSM2935533 GSM2935534 GSM2935535
## 1 14q0_st 0.895331 0.866299 0.555901 0.966886 0.770751 0.754841
## 2 14qI-1_st 1.327430 1.382420 1.239750 1.790480 1.002980 1.150330
## 3 14qI-1_x_st 1.101760 1.117640 0.887508 1.720290 0.777309 0.924656
## 4 14qI-2_st 0.899064 0.869578 0.553979 0.828477 0.793098 0.809946
## 5 14qI-3_x_st 0.941506 1.123190 1.019810 0.981515 0.751912 0.838089
## 6 14qI-4_st 0.859005 0.684318 1.061370 1.163310 0.902312 0.548366
## GSM2935536 GSM2935537 GSM2935538 GSM2935539 GSM2935540 GSM2935541 GSM2935542
## 1 0.895331 1.043050 1.225800 0.768760 1.009170 0.876800 1.015470
## 2 0.995848 1.108190 1.388010 1.539740 1.547410 1.065380 1.197270
## 3 0.799595 0.967973 1.162340 1.314060 1.321740 0.839706 1.107620
## 4 0.809946 0.655543 0.488946 0.543943 0.880662 0.700726 1.296980
## 5 1.105910 0.826468 0.859821 1.140310 0.848059 1.010360 0.838089
## 6 0.984952 0.866898 0.780459 0.844232 0.945453 1.041640 0.910368
## GSM2935543 GSM2935544 GSM2935545 GSM2935546 GSM2935547 GSM2935548
## 1 0.661300 0.895331 0.876800 1.225550 0.602426 1.102760
## 2 1.659260 1.086880 1.593780 1.281600 0.844680 0.922029
## 3 1.116890 0.875489 1.409310 1.372510 0.895331 0.696359
## 4 0.673405 1.028730 0.726906 0.809946 0.793098 1.027200
## 5 0.711518 0.937774 0.911435 0.880532 0.559532 0.799354
## 6 0.895331 1.078120 1.209940 0.915795 0.827384 1.458610
metadata <-read.csv("GSE109220_series_matrix.txt", sep='\t', nrows=27,na.strings=c('',' '))
metadata
## X.Series_title
## 1 !Series_geo_accession
## 2 !Series_status
## 3 !Series_submission_date
## 4 !Series_last_update_date
## 5 !Series_pubmed_id
## 6 !Series_summary
## 7 !Series_summary
## 8 !Series_overall_design
## 9 !Series_type
## 10 !Series_contributor
## 11 !Series_contributor
## 12 !Series_contributor
## 13 !Series_sample_id
## 14 !Series_contact_name
## 15 !Series_contact_laboratory
## 16 !Series_contact_department
## 17 !Series_contact_institute
## 18 !Series_contact_address
## 19 !Series_contact_city
## 20 !Series_contact_state
## 21 !Series_contact_zip/postal_code
## 22 !Series_contact_country
## 23 !Series_supplementary_file
## 24 !Series_platform_id
## 25 !Series_platform_taxid
## 26 !Series_sample_taxid
## 27 !Series_relation
## miRNA.expression.data.from.human.PBMC.derived.form.acute.infectious.mononucleosis.patients
## 1 GSE109220
## 2 Public on Jan 17 2018
## 3 Jan 16 2018
## 4 Apr 18 2018
## 5 29379474
## 6 Epstein Barr virus (EBV) is the etiological agent of acute infectious mononucleosis (IM). Since acute IM is a self-resolving disease with most patients regaining health in one to three weeks there have been few studies examining molecular signatures in early acute stages of the disease. microRNAs have been shown, however, to influence immune cell function and consequently the generation of antibody responses in IM. In this study, we performed a comprehensive analysis of differentially expressed microRNAs in early stage uncomplicated acute IM.
## 7 We used microarrays to profile microRNAs from patient peripheral blood obtained at the time of IM diagnosis and at subsequent time points, and identified differentially regulated microRNAs in this process.
## 8 Peripheral blood samples were derived from symptomatic patients at diagnosis, 1 , 2 and 7 months and processed to obtain PBMCs which were frozen and then maintained in liquid nitrogen until use. PBMCs were also isolated from blood derived from healthy age-matched individuals. Patient (n = 16) and healthy control (n = 3) PBMCs were suspended and lysed in TRIzol Reagent and cleaned-up with RNeasy spin columns using a modified protocol for isolation of total RNA and hybridized on Affymetrix GeneChip miRNA 4.0 arrays.
## 9 Non-coding RNA profiling by array
## 10 Vandana,,Kaul
## 11 Sheri,,Krams
## 12 Olivia,M,Martinez
## 13 GSM2935530 GSM2935531 GSM2935532 GSM2935533 GSM2935534 GSM2935535 GSM2935536 GSM2935537 GSM2935538 GSM2935539 GSM2935540 GSM2935541 GSM2935542 GSM2935543 GSM2935544 GSM2935545 GSM2935546 GSM2935547 GSM2935548
## 14 Vandana,,Kaul
## 15 Transplant Immunology Lab (TIL)
## 16 Surgery
## 17 Stanford University
## 18 1201 Welch Rd., MSLS P328
## 19 Stanford
## 20 CA
## 21 94305-5731
## 22 USA
## 23 ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE109nnn/GSE109220/suppl/GSE109220_RAW.tar
## 24 GPL19117
## 25 32630
## 26 9606
## 27 BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA430181
The following reads in the additional metadata information that says the GSM sample is control or patient with other information and the additional gene expression data.
headerInformation <-read.csv("GSE109220_series_matrix.txt", sep='\t', stringsAsFactors = T,strip.white=T,na.strings=" ", header=T,skip=28, nrows=35)
#headerInformation[c(1,10,11,13,18:21,33:35),]
The above are selected rows of information per sample like the time of collection of data, the cell type as microRNA, and scan and processing type with gene chip array and how the numeric data shown was normalized or processed before being used for analysis or machine learning in this study. You can see a bunch of useful information in the data set if you view it in Rstudio without the clutter of knitr printed version.
We can use the sample column names of the ‘header information’ data frame instead of the GSM ID to each column in our ‘data’ data frame.
This is great how all information was obtained, we know how each sample was collected and when as well as patient information for each sample so we can use that in how we label the GSM samples in our data. It also says there are the correct number of microRNA of 36,353 genes like our data file of gene expression data shows.
str(data)
## 'data.frame': 36353 obs. of 20 variables:
## $ ID_REF : chr "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
## $ GSM2935530: num 0.895 1.327 1.102 0.899 0.942 ...
## $ GSM2935531: num 0.866 1.382 1.118 0.87 1.123 ...
## $ GSM2935532: num 0.556 1.24 0.888 0.554 1.02 ...
## $ GSM2935533: num 0.967 1.79 1.72 0.828 0.982 ...
## $ GSM2935534: num 0.771 1.003 0.777 0.793 0.752 ...
## $ GSM2935535: num 0.755 1.15 0.925 0.81 0.838 ...
## $ GSM2935536: num 0.895 0.996 0.8 0.81 1.106 ...
## $ GSM2935537: num 1.043 1.108 0.968 0.656 0.826 ...
## $ GSM2935538: num 1.226 1.388 1.162 0.489 0.86 ...
## $ GSM2935539: num 0.769 1.54 1.314 0.544 1.14 ...
## $ GSM2935540: num 1.009 1.547 1.322 0.881 0.848 ...
## $ GSM2935541: num 0.877 1.065 0.84 0.701 1.01 ...
## $ GSM2935542: num 1.015 1.197 1.108 1.297 0.838 ...
## $ GSM2935543: num 0.661 1.659 1.117 0.673 0.712 ...
## $ GSM2935544: num 0.895 1.087 0.875 1.029 0.938 ...
## $ GSM2935545: num 0.877 1.594 1.409 0.727 0.911 ...
## $ GSM2935546: num 1.226 1.282 1.373 0.81 0.881 ...
## $ GSM2935547: num 0.602 0.845 0.895 0.793 0.56 ...
## $ GSM2935548: num 1.103 0.922 0.696 1.027 0.799 ...
Lets keep the ‘ID_REF’ gene field name and change the GSM ID names to their sample type.
newNames <- colnames(headerInformation)[2:20]
colnames(data)[2:20] <- newNames
str(data)
## 'data.frame': 36353 obs. of 20 variables:
## $ ID_REF : chr "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
## $ PBMC_IMDiagnosis_Subject1: num 0.895 1.327 1.102 0.899 0.942 ...
## $ PBMC_IMDiagnosis_Subject2: num 0.866 1.382 1.118 0.87 1.123 ...
## $ PBMC_IMDiagnosis_Subject3: num 0.556 1.24 0.888 0.554 1.02 ...
## $ PBMC_IMDiagnosis_Subject4: num 0.967 1.79 1.72 0.828 0.982 ...
## $ PBMC_IMDiagnosis_Subject5: num 0.771 1.003 0.777 0.793 0.752 ...
## $ PBMC_1month_Subject1 : num 0.755 1.15 0.925 0.81 0.838 ...
## $ PBMC_1month_Subject2 : num 0.895 0.996 0.8 0.81 1.106 ...
## $ PBMC_1month_Subject3 : num 1.043 1.108 0.968 0.656 0.826 ...
## $ PBMC_1month_Subject4 : num 1.226 1.388 1.162 0.489 0.86 ...
## $ PBMC_1month_Subject5 : num 0.769 1.54 1.314 0.544 1.14 ...
## $ PBMC_2month_Subject1 : num 1.009 1.547 1.322 0.881 0.848 ...
## $ PBMC_2month_Subject3 : num 0.877 1.065 0.84 0.701 1.01 ...
## $ PBMC_2month_Subject4 : num 1.015 1.197 1.108 1.297 0.838 ...
## $ PBMC_2month_Subject5 : num 0.661 1.659 1.117 0.673 0.712 ...
## $ PBMC_7month_Subject1 : num 0.895 1.087 0.875 1.029 0.938 ...
## $ PBMC_7month_Subject3 : num 0.877 1.594 1.409 0.727 0.911 ...
## $ PBMC_Control_Subject1 : num 1.226 1.282 1.373 0.81 0.881 ...
## $ PBMC_Control_Subject2 : num 0.602 0.845 0.895 0.793 0.56 ...
## $ PBMC_Control_Subject3 : num 1.103 0.922 0.696 1.027 0.799 ...
Lets now do separate tasks to get only the samples all patients followed through with study to get top fold change genes. Then we will do only the 1 month and initial infection study comparison with all patient data available. And then the overall comparison with all samples even though imbalanced to see the top genes. We can do a quick randomForest to predict class of healthy or Infectious Mononucleosis (IM) on each data set analysis afterwards with only 2 classes and 10,000 trees using cross validation.
===============================================================
Only using the patients that stayed through the whole study for this part, so we need to select our samples to be only patient 1 and 3.
colnames(data)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
## [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
## [7] "PBMC_1month_Subject1" "PBMC_1month_Subject2"
## [9] "PBMC_1month_Subject3" "PBMC_1month_Subject4"
## [11] "PBMC_1month_Subject5" "PBMC_2month_Subject1"
## [13] "PBMC_2month_Subject3" "PBMC_2month_Subject4"
## [15] "PBMC_2month_Subject5" "PBMC_7month_Subject1"
## [17] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [19] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
patients13 <- data[,c(1,2,4,7,9,12,13,16,17:20)]
str(patients13)
## 'data.frame': 36353 obs. of 12 variables:
## $ ID_REF : chr "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
## $ PBMC_IMDiagnosis_Subject1: num 0.895 1.327 1.102 0.899 0.942 ...
## $ PBMC_IMDiagnosis_Subject3: num 0.556 1.24 0.888 0.554 1.02 ...
## $ PBMC_1month_Subject1 : num 0.755 1.15 0.925 0.81 0.838 ...
## $ PBMC_1month_Subject3 : num 1.043 1.108 0.968 0.656 0.826 ...
## $ PBMC_2month_Subject1 : num 1.009 1.547 1.322 0.881 0.848 ...
## $ PBMC_2month_Subject3 : num 0.877 1.065 0.84 0.701 1.01 ...
## $ PBMC_7month_Subject1 : num 0.895 1.087 0.875 1.029 0.938 ...
## $ PBMC_7month_Subject3 : num 0.877 1.594 1.409 0.727 0.911 ...
## $ PBMC_Control_Subject1 : num 1.226 1.282 1.373 0.81 0.881 ...
## $ PBMC_Control_Subject2 : num 0.602 0.845 0.895 0.793 0.56 ...
## $ PBMC_Control_Subject3 : num 1.103 0.922 0.696 1.027 0.799 ...
Now add in row means of each time point and controls or healthy samples, and an overall IM pathology row means per gene to compare.
patients13$initial_mean <- rowMeans(patients13[,c(2,3)], na.rm=FALSE, dims=1)
patients13$means_1month <- rowMeans(patients13[,c(4,5)], na.rm=FALSE,dims=1)
patients13$means_2months <- rowMeans(patients13[,c(6,7)], na.rm=FALSE, dims=1)
patients13$means_7months <- rowMeans(patients13[,c(8,9)], na.rm=FALSE, dims=1)
patients13$IM_total_means <- rowMeans(patients13[,c(2:9)], na.rm=FALSE,dims=1)
patients13$healthy_mean <- rowMeans(patients13[,c(10:12)], na.rm=FALSE, dims=1)
str(patients13)
## 'data.frame': 36353 obs. of 18 variables:
## $ ID_REF : chr "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
## $ PBMC_IMDiagnosis_Subject1: num 0.895 1.327 1.102 0.899 0.942 ...
## $ PBMC_IMDiagnosis_Subject3: num 0.556 1.24 0.888 0.554 1.02 ...
## $ PBMC_1month_Subject1 : num 0.755 1.15 0.925 0.81 0.838 ...
## $ PBMC_1month_Subject3 : num 1.043 1.108 0.968 0.656 0.826 ...
## $ PBMC_2month_Subject1 : num 1.009 1.547 1.322 0.881 0.848 ...
## $ PBMC_2month_Subject3 : num 0.877 1.065 0.84 0.701 1.01 ...
## $ PBMC_7month_Subject1 : num 0.895 1.087 0.875 1.029 0.938 ...
## $ PBMC_7month_Subject3 : num 0.877 1.594 1.409 0.727 0.911 ...
## $ PBMC_Control_Subject1 : num 1.226 1.282 1.373 0.81 0.881 ...
## $ PBMC_Control_Subject2 : num 0.602 0.845 0.895 0.793 0.56 ...
## $ PBMC_Control_Subject3 : num 1.103 0.922 0.696 1.027 0.799 ...
## $ initial_mean : num 0.726 1.284 0.995 0.727 0.981 ...
## $ means_1month : num 0.899 1.129 0.946 0.733 0.832 ...
## $ means_2months : num 0.943 1.306 1.081 0.791 0.929 ...
## $ means_7months : num 0.886 1.34 1.142 0.878 0.925 ...
## $ IM_total_means : num 0.863 1.265 1.041 0.782 0.917 ...
## $ healthy_mean : num 0.977 1.016 0.988 0.877 0.746 ...
Now get the fold change value of each time point mean and overall means over the healthy means.
patients13$FC_initial_healthy <- patients13$initial_mean/patients13$healthy_mean
patients13$FC_1month_healthy <- patients13$means_1month/patients13$healthy_mean
patients13$FC_2months_healthy <- patients13$means_2months/patients13$healthy_mean
patients13$FC_7months_healthy <- patients13$means_7months/patients13$healthy_mean
patients13$FC_all_IM_healthy <- patients13$IM_total_means/patients13$healthy_mean
str(patients13)
## 'data.frame': 36353 obs. of 23 variables:
## $ ID_REF : chr "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
## $ PBMC_IMDiagnosis_Subject1: num 0.895 1.327 1.102 0.899 0.942 ...
## $ PBMC_IMDiagnosis_Subject3: num 0.556 1.24 0.888 0.554 1.02 ...
## $ PBMC_1month_Subject1 : num 0.755 1.15 0.925 0.81 0.838 ...
## $ PBMC_1month_Subject3 : num 1.043 1.108 0.968 0.656 0.826 ...
## $ PBMC_2month_Subject1 : num 1.009 1.547 1.322 0.881 0.848 ...
## $ PBMC_2month_Subject3 : num 0.877 1.065 0.84 0.701 1.01 ...
## $ PBMC_7month_Subject1 : num 0.895 1.087 0.875 1.029 0.938 ...
## $ PBMC_7month_Subject3 : num 0.877 1.594 1.409 0.727 0.911 ...
## $ PBMC_Control_Subject1 : num 1.226 1.282 1.373 0.81 0.881 ...
## $ PBMC_Control_Subject2 : num 0.602 0.845 0.895 0.793 0.56 ...
## $ PBMC_Control_Subject3 : num 1.103 0.922 0.696 1.027 0.799 ...
## $ initial_mean : num 0.726 1.284 0.995 0.727 0.981 ...
## $ means_1month : num 0.899 1.129 0.946 0.733 0.832 ...
## $ means_2months : num 0.943 1.306 1.081 0.791 0.929 ...
## $ means_7months : num 0.886 1.34 1.142 0.878 0.925 ...
## $ IM_total_means : num 0.863 1.265 1.041 0.782 0.917 ...
## $ healthy_mean : num 0.977 1.016 0.988 0.877 0.746 ...
## $ FC_initial_healthy : num 0.743 1.263 1.007 0.829 1.314 ...
## $ FC_1month_healthy : num 0.92 1.111 0.958 0.836 1.115 ...
## $ FC_2months_healthy : num 0.965 1.286 1.094 0.902 1.245 ...
## $ FC_7months_healthy : num 0.907 1.319 1.156 1.001 1.239 ...
## $ FC_all_IM_healthy : num 0.884 1.245 1.054 0.892 1.228 ...
Lets order the data frame samples and keep the top 5 under and over expressed or top 5 enhanced and top 5 silenced genes for each time point and all timepoints.
colnames(patients13)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject3" "PBMC_1month_Subject1"
## [5] "PBMC_1month_Subject3" "PBMC_2month_Subject1"
## [7] "PBMC_2month_Subject3" "PBMC_7month_Subject1"
## [9] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [11] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
## [13] "initial_mean" "means_1month"
## [15] "means_2months" "means_7months"
## [17] "IM_total_means" "healthy_mean"
## [19] "FC_initial_healthy" "FC_1month_healthy"
## [21] "FC_2months_healthy" "FC_7months_healthy"
## [23] "FC_all_IM_healthy"
X_0 <- patients13[order(patients13$FC_initial_healthy, decreasing=T),]
DX0 <- X_0[c(1:5,36349:36353),c(1,19)]
DX0
## ID_REF FC_initial_healthy
## 23290 MIMAT0018929_st 12.7639130
## 14878 MIMAT0009448_st 7.3035458
## 22480 MIMAT0018115_st 6.4052125
## 29663 MIMAT0025375_st 6.3131440
## 23138 MIMAT0018776_st 6.1511222
## 14759 MIMAT0009317_st 0.2215564
## 15326 MIMAT0009904_st 0.2215564
## 17635 MIMAT0013157_st 0.2215564
## 20275 MIMAT0015888_st 0.2215564
## 28413 MIMAT0024117_st 0.1667130
X_1 <- patients13[order(patients13$FC_1month_healthy, decreasing=T),]
DX1 <- X_1[c(1:5,36349:36353),c(1,20)]
DX1
## ID_REF FC_1month_healthy
## 23290 MIMAT0018929_st 12.0429496
## 22480 MIMAT0018115_st 6.0032794
## 33919 MIMAT0029648_st 5.6526514
## 28681 MIMAT0024385_st 5.4238156
## 36038 MIMAT0031767_st 5.4099636
## 14750 MIMAT0009308_st 0.1895457
## 17628 MIMAT0013150_st 0.1895457
## 23664 MIMAT0019306_st 0.1895457
## 28243 MIMAT0023947_st 0.1895457
## 23686 MIMAT0019328_st 0.1696506
X_2 <- patients13[order(patients13$FC_2months_healthy, decreasing=T),]
DX2 <- X_2[c(1:5,36349:36353),c(1,21)]
DX2
## ID_REF FC_2months_healthy
## 23290 MIMAT0018929_st 10.8804258
## 26585 MIMAT0022271_st 6.2396110
## 33919 MIMAT0029648_st 5.5550380
## 28681 MIMAT0024385_st 5.4038902
## 22480 MIMAT0018115_st 5.3847433
## 13712 MIMAT0008257_st 0.2362490
## 20414 MIMAT0016027_st 0.2362490
## 14794 MIMAT0009352_st 0.2317874
## 11612 MIMAT0006109_st 0.2223497
## 28413 MIMAT0024117_st 0.2181580
X_7 <- patients13[order(patients13$FC_7months_healthy, decreasing=T),]
DX7 <- X_7[c(1:5,36349:36353),c(1,22)]
DX7
## ID_REF FC_7months_healthy
## 23290 MIMAT0018929_st 8.0489859
## 23138 MIMAT0018776_st 4.2975872
## 22480 MIMAT0018115_st 4.2067260
## 26917 MIMAT0022608_st 4.1895747
## 19436 MIMAT0015036_st 4.0733855
## 26939 MIMAT0022630_st 0.2830678
## 22326 MIMAT0017956_st 0.2513339
## 18370 MIMAT0013946_st 0.2500538
## 28101 MIMAT0023805_st 0.2322748
## 14794 MIMAT0009352_st 0.2166384
X_all <- patients13[order(patients13$FC_all_IM_healthy, decreasing=T),]
DXall <- X_all[c(1:5,36349:36353),c(1,23)]
DXall
## ID_REF FC_all_IM_healthy
## 23290 MIMAT0018929_st 10.9340686
## 22480 MIMAT0018115_st 5.4999903
## 23138 MIMAT0018776_st 4.9913737
## 28681 MIMAT0024385_st 4.9629552
## 26585 MIMAT0022271_st 4.8911259
## 13589 MIMAT0008134_st 0.2537693
## 14750 MIMAT0009308_st 0.2537693
## 17628 MIMAT0013150_st 0.2537693
## 23664 MIMAT0019306_st 0.2537693
## 28243 MIMAT0023947_st 0.2537693
Lets find out which genes are common and not among these time stamps of IM.
DX01 <- DX0$ID_REF %in% DX1$ID_REF
DX0$ID_REF[DX01]
## [1] "MIMAT0018929_st" "MIMAT0018115_st"
The two genes above are common to initial and 1 month after infection.
DX27 <- DX2$ID_REF %in% DX7$ID_REF
common0127 <- DX2$ID_REF[DX27]
common0127
## [1] "MIMAT0018929_st" "MIMAT0018115_st" "MIMAT0009352_st"
The three genes above are common between 2 months and 7 months after infections. We can visually inspect and see the first two genes are common to initial, 1 month, 2 months, and 7 months after infection.
Now see if these genes are common to all timepoints top genes.
commonAll <- DXall[which(DXall$ID_REF %in% common0127),]
commonAll
## ID_REF FC_all_IM_healthy
## 23290 MIMAT0018929_st 10.93407
## 22480 MIMAT0018115_st 5.49999
There are two genes common to initial diagnosis of infectious mononucleosis, 1 month later, 2 months later, and 7 months later. Those 2 genes are MIMAT0018929_st and MIMAT0018115_st.
So there is no record of it on genecards.org and an internet search for it turns up nothing but we know it is a gene ID through Affymetrix gene chip scanning so it must be a gene ID with the machine used to scan in PBMCs into the arrays to analyze. Lets just use the 7 months of infection top genes to analyze and use in prediction of the class of healthy or IM when we get there. Lets write out the 7 months ordered decreasing for fold change at 7 months compared to healthy.
write.csv(X_7, 'mononucleosis_7monthsInfected_allGenesOrderedDecreasing.csv', row.names=F)
Lets get the top genes by name in each infectious time stamp month.
DX0$class <- "initial infection"
DX1$class <- "month1 of infection"
DX2$class <- "month2 of infection"
DX7$class <- "month7 of infection"
colnames(DX0) <- c("miroRNA","FoldChangePathologyOverHealthy","class")
colnames(DX1) <- c("miroRNA","FoldChangePathologyOverHealthy","class")
colnames(DX2) <- c("miroRNA","FoldChangePathologyOverHealthy","class")
colnames(DX7) <- c("miroRNA","FoldChangePathologyOverHealthy","class")
topGenesClasses <- rbind(DX0,DX1,DX2,DX7)
topGenesClasses
## miroRNA FoldChangePathologyOverHealthy class
## 23290 MIMAT0018929_st 12.7639130 initial infection
## 14878 MIMAT0009448_st 7.3035458 initial infection
## 22480 MIMAT0018115_st 6.4052125 initial infection
## 29663 MIMAT0025375_st 6.3131440 initial infection
## 23138 MIMAT0018776_st 6.1511222 initial infection
## 14759 MIMAT0009317_st 0.2215564 initial infection
## 15326 MIMAT0009904_st 0.2215564 initial infection
## 17635 MIMAT0013157_st 0.2215564 initial infection
## 20275 MIMAT0015888_st 0.2215564 initial infection
## 28413 MIMAT0024117_st 0.1667130 initial infection
## 232901 MIMAT0018929_st 12.0429496 month1 of infection
## 224801 MIMAT0018115_st 6.0032794 month1 of infection
## 33919 MIMAT0029648_st 5.6526514 month1 of infection
## 28681 MIMAT0024385_st 5.4238156 month1 of infection
## 36038 MIMAT0031767_st 5.4099636 month1 of infection
## 14750 MIMAT0009308_st 0.1895457 month1 of infection
## 17628 MIMAT0013150_st 0.1895457 month1 of infection
## 23664 MIMAT0019306_st 0.1895457 month1 of infection
## 28243 MIMAT0023947_st 0.1895457 month1 of infection
## 23686 MIMAT0019328_st 0.1696506 month1 of infection
## 232902 MIMAT0018929_st 10.8804258 month2 of infection
## 26585 MIMAT0022271_st 6.2396110 month2 of infection
## 339191 MIMAT0029648_st 5.5550380 month2 of infection
## 286811 MIMAT0024385_st 5.4038902 month2 of infection
## 224802 MIMAT0018115_st 5.3847433 month2 of infection
## 13712 MIMAT0008257_st 0.2362490 month2 of infection
## 20414 MIMAT0016027_st 0.2362490 month2 of infection
## 14794 MIMAT0009352_st 0.2317874 month2 of infection
## 11612 MIMAT0006109_st 0.2223497 month2 of infection
## 284131 MIMAT0024117_st 0.2181580 month2 of infection
## 232903 MIMAT0018929_st 8.0489859 month7 of infection
## 231381 MIMAT0018776_st 4.2975872 month7 of infection
## 224803 MIMAT0018115_st 4.2067260 month7 of infection
## 26917 MIMAT0022608_st 4.1895747 month7 of infection
## 19436 MIMAT0015036_st 4.0733855 month7 of infection
## 26939 MIMAT0022630_st 0.2830678 month7 of infection
## 22326 MIMAT0017956_st 0.2513339 month7 of infection
## 18370 MIMAT0013946_st 0.2500538 month7 of infection
## 28101 MIMAT0023805_st 0.2322748 month7 of infection
## 147941 MIMAT0009352_st 0.2166384 month7 of infection
top40 <- topGenesClasses[order(topGenesClasses$FoldChangePathologyOverHealthy,decreasing=T),]
top40
## miroRNA FoldChangePathologyOverHealthy class
## 23290 MIMAT0018929_st 12.7639130 initial infection
## 232901 MIMAT0018929_st 12.0429496 month1 of infection
## 232902 MIMAT0018929_st 10.8804258 month2 of infection
## 232903 MIMAT0018929_st 8.0489859 month7 of infection
## 14878 MIMAT0009448_st 7.3035458 initial infection
## 22480 MIMAT0018115_st 6.4052125 initial infection
## 29663 MIMAT0025375_st 6.3131440 initial infection
## 26585 MIMAT0022271_st 6.2396110 month2 of infection
## 23138 MIMAT0018776_st 6.1511222 initial infection
## 224801 MIMAT0018115_st 6.0032794 month1 of infection
## 33919 MIMAT0029648_st 5.6526514 month1 of infection
## 339191 MIMAT0029648_st 5.5550380 month2 of infection
## 28681 MIMAT0024385_st 5.4238156 month1 of infection
## 36038 MIMAT0031767_st 5.4099636 month1 of infection
## 286811 MIMAT0024385_st 5.4038902 month2 of infection
## 224802 MIMAT0018115_st 5.3847433 month2 of infection
## 231381 MIMAT0018776_st 4.2975872 month7 of infection
## 224803 MIMAT0018115_st 4.2067260 month7 of infection
## 26917 MIMAT0022608_st 4.1895747 month7 of infection
## 19436 MIMAT0015036_st 4.0733855 month7 of infection
## 26939 MIMAT0022630_st 0.2830678 month7 of infection
## 22326 MIMAT0017956_st 0.2513339 month7 of infection
## 18370 MIMAT0013946_st 0.2500538 month7 of infection
## 13712 MIMAT0008257_st 0.2362490 month2 of infection
## 20414 MIMAT0016027_st 0.2362490 month2 of infection
## 28101 MIMAT0023805_st 0.2322748 month7 of infection
## 14794 MIMAT0009352_st 0.2317874 month2 of infection
## 11612 MIMAT0006109_st 0.2223497 month2 of infection
## 14759 MIMAT0009317_st 0.2215564 initial infection
## 15326 MIMAT0009904_st 0.2215564 initial infection
## 17635 MIMAT0013157_st 0.2215564 initial infection
## 20275 MIMAT0015888_st 0.2215564 initial infection
## 284131 MIMAT0024117_st 0.2181580 month2 of infection
## 147941 MIMAT0009352_st 0.2166384 month7 of infection
## 14750 MIMAT0009308_st 0.1895457 month1 of infection
## 17628 MIMAT0013150_st 0.1895457 month1 of infection
## 23664 MIMAT0019306_st 0.1895457 month1 of infection
## 28243 MIMAT0023947_st 0.1895457 month1 of infection
## 23686 MIMAT0019328_st 0.1696506 month1 of infection
## 28413 MIMAT0024117_st 0.1667130 initial infection
Write this out to csv.
write.csv(top40, 'top40_mono_miRNA_4timestamps_orderedFCs.csv', row.names=F)
top20 <- top40$miroRNA[c(1:10,31:40)]
top20
## [1] "MIMAT0018929_st" "MIMAT0018929_st" "MIMAT0018929_st" "MIMAT0018929_st"
## [5] "MIMAT0009448_st" "MIMAT0018115_st" "MIMAT0025375_st" "MIMAT0022271_st"
## [9] "MIMAT0018776_st" "MIMAT0018115_st" "MIMAT0013157_st" "MIMAT0015888_st"
## [13] "MIMAT0024117_st" "MIMAT0009352_st" "MIMAT0009308_st" "MIMAT0013150_st"
## [17] "MIMAT0019306_st" "MIMAT0023947_st" "MIMAT0019328_st" "MIMAT0024117_st"
top <- top20[!duplicated(top20)]
top
## [1] "MIMAT0018929_st" "MIMAT0009448_st" "MIMAT0018115_st" "MIMAT0025375_st"
## [5] "MIMAT0022271_st" "MIMAT0018776_st" "MIMAT0013157_st" "MIMAT0015888_st"
## [9] "MIMAT0024117_st" "MIMAT0009352_st" "MIMAT0009308_st" "MIMAT0013150_st"
## [13] "MIMAT0019306_st" "MIMAT0023947_st" "MIMAT0019328_st"
So we have 15 unique top genes between all timestamps to use as predictors in predicting class as healthy or infectious mononucleosis.
patientsTop <- patients13[patients13$ID_REF %in% top,c(1:12)]
colnames(patientsTop)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject3" "PBMC_1month_Subject1"
## [5] "PBMC_1month_Subject3" "PBMC_2month_Subject1"
## [7] "PBMC_2month_Subject3" "PBMC_7month_Subject1"
## [9] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [11] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
Two class ML and a five class ML with randomForest after making the respective matrices.
class <- c("mononucleosis","mononucleosis","mononucleosis","mononucleosis","mononucleosis","mononucleosis","mononucleosis","mononucleosis", "healthy", "healthy","healthy")
p13geneHeader <- patientsTop$ID_REF
matrix13 <- data.frame(t(patientsTop[,-1]))
colnames(matrix13) <- p13geneHeader
matrix13$class <- class
str(matrix13)
## 'data.frame': 11 obs. of 16 variables:
## $ MIMAT0009308_st: num 1.254 0.82 0.975 0.583 1.316 ...
## $ MIMAT0009352_st: num 0.889 1.364 1.351 0.939 1.053 ...
## $ MIMAT0009448_st: num 6.61 6.11 3.62 4.15 6.52 ...
## $ MIMAT0013150_st: num 1.254 0.82 0.975 0.583 1.316 ...
## $ MIMAT0013157_st: num 0.791 1.077 1.235 0.759 1.672 ...
## $ MIMAT0015888_st: num 0.791 1.077 1.235 0.759 1.672 ...
## $ MIMAT0018115_st: num 7.99 6.87 7.29 6.64 6.87 ...
## $ MIMAT0018776_st: num 7.14 6.99 5.09 5.75 5.46 ...
## $ MIMAT0018929_st: num 9.56 7.75 8.74 7.6 8.18 ...
## $ MIMAT0019306_st: num 1.254 0.82 0.975 0.583 1.316 ...
## $ MIMAT0019328_st: num 1.931 0.775 1.105 0.656 2.748 ...
## $ MIMAT0022271_st: num 4.8 4.99 3.19 5.17 6.77 ...
## $ MIMAT0023947_st: num 1.254 0.82 0.975 0.583 1.316 ...
## $ MIMAT0024117_st: num 0.859 0.5 1.153 0.965 1.187 ...
## $ MIMAT0025375_st: num 5.83 6.64 4.91 4.48 4.31 ...
## $ class : chr "mononucleosis" "mononucleosis" "mononucleosis" "mononucleosis" ...
write to csv and read back in to avoid row names.
write.csv(matrix13,"matrix1_3_top15genes.csv", row.names=F)
matrix13 <- read.csv("matrix1_3_top15genes.csv", header=T, sep=',')
matrix13$class <- as.factor(matrix13$class)
str(matrix13)
## 'data.frame': 11 obs. of 16 variables:
## $ MIMAT0009308_st: num 1.254 0.82 0.975 0.583 1.316 ...
## $ MIMAT0009352_st: num 0.889 1.364 1.351 0.939 1.053 ...
## $ MIMAT0009448_st: num 6.61 6.11 3.62 4.15 6.52 ...
## $ MIMAT0013150_st: num 1.254 0.82 0.975 0.583 1.316 ...
## $ MIMAT0013157_st: num 0.791 1.077 1.235 0.759 1.672 ...
## $ MIMAT0015888_st: num 0.791 1.077 1.235 0.759 1.672 ...
## $ MIMAT0018115_st: num 7.99 6.87 7.29 6.64 6.87 ...
## $ MIMAT0018776_st: num 7.14 6.99 5.09 5.75 5.46 ...
## $ MIMAT0018929_st: num 9.56 7.75 8.74 7.6 8.18 ...
## $ MIMAT0019306_st: num 1.254 0.82 0.975 0.583 1.316 ...
## $ MIMAT0019328_st: num 1.931 0.775 1.105 0.656 2.748 ...
## $ MIMAT0022271_st: num 4.8 4.99 3.19 5.17 6.77 ...
## $ MIMAT0023947_st: num 1.254 0.82 0.975 0.583 1.316 ...
## $ MIMAT0024117_st: num 0.859 0.5 1.153 0.965 1.187 ...
## $ MIMAT0025375_st: num 5.83 6.64 4.91 4.48 4.31 ...
## $ class : Factor w/ 2 levels "healthy","mononucleosis": 2 2 2 2 2 2 2 2 1 1 ...
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
training <- matrix13[c(1,2,3,4,5,7,9,10),]
testing <- matrix13[c(6,8,11),]
table(testing$class)
##
## healthy mononucleosis
## 1 2
table(training$class)
##
## healthy mononucleosis
## 2 6
set.seed(123)
ML15.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)
print(ML15.rf)
##
## Call:
## randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 12.5%
## Confusion matrix:
## healthy mononucleosis class.error
## healthy 1 1 0.5
## mononucleosis 0 6 0.0
predicted <- predict(ML15.rf,testing)
results <- data.frame(predicted=predicted,actual=testing$class)
results
## predicted actual
## 6 mononucleosis mononucleosis
## 8 mononucleosis mononucleosis
## 11 healthy healthy
The two class model predicted with 100% accuracy the class as mononucleosis infection or healthy for these top 15 genes among all timestamps. Lets do this for a five class model to see how well prediction is in predicting initial infection, 1 month infected, 2 months infected, 7 months infected or healthy. We have to go back to the matrix and make the class have 5 classes instead of 2 for respective class.
colnames(patientsTop)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject3" "PBMC_1month_Subject1"
## [5] "PBMC_1month_Subject3" "PBMC_2month_Subject1"
## [7] "PBMC_2month_Subject3" "PBMC_7month_Subject1"
## [9] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [11] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
class5 <- c("initial infection", "initial infection","1 month infected","1 month infected","2 months infected","2 months infected","7 months infected","7 months infected","healthy","healthy","healthy")
matrix5class <- data.frame(t(patientsTop[,c(2:12)]))
colnames(matrix5class) <- p13geneHeader
matrix5class$class <- class5
write out and read back in to avoid row names.
write.csv(matrix5class,'matrix5class_top15genes_mono.csv',row.names=F)
matrix5class <- read.csv('matrix5class_top15genes_mono.csv', header=T, sep=',')
matrix5class$class <- as.factor(matrix5class$class)
training <- matrix5class[c(1,3,5,7,9,8,10),]
testing <- matrix5class[c(2,4,6,11),]
set.seed(123)
ML15b.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)
print(ML15b.rf)
##
## Call:
## randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 85.71%
## Confusion matrix:
## 1 month infected 2 months infected 7 months infected healthy
## 1 month infected 0 0 1 0
## 2 months infected 0 0 1 0
## 7 months infected 1 1 0 0
## healthy 0 0 1 1
## initial infection 0 1 0 0
## initial infection class.error
## 1 month infected 0 1.0
## 2 months infected 0 1.0
## 7 months infected 0 1.0
## healthy 0 0.5
## initial infection 0 1.0
predicted2 <- predict(ML15b.rf,testing)
results <- data.frame(predicted=predicted2,actual=testing$class)
results
## predicted actual
## 2 1 month infected initial infection
## 4 7 months infected 1 month infected
## 6 7 months infected 2 months infected
## 11 healthy healthy
We already knew this would be terrible as few observations and many classes to predict the machine loses accuracy dramatically, but in this case the healthy class was predicted with 100% accuracy, and most predictions made by machine in training were with the sample it had the most of, which was 7 months.
Lets do it with one sample to predict for healthy and give the model 2 of each class to train on and only predict one class that is healthy.
training <- matrix5class[c(1:9,11),]
testing <- matrix5class[10,]
set.seed(123)
set.seed(123)
ML15c.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)
print(ML15c.rf)
##
## Call:
## randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 80%
## Confusion matrix:
## 1 month infected 2 months infected 7 months infected healthy
## 1 month infected 0 1 0 0
## 2 months infected 1 0 1 0
## 7 months infected 1 1 0 0
## healthy 0 0 0 2
## initial infection 1 1 0 0
## initial infection class.error
## 1 month infected 1 1
## 2 months infected 0 1
## 7 months infected 0 1
## healthy 0 0
## initial infection 0 1
Slightly better results in training with less one out of bag error than previous training and testing set splits.
predicted3 <- predict(ML15c.rf,testing)
results <- data.frame(predicted=predicted3,actual=testing$class)
results
## predicted actual
## 10 7 months infected healthy
It still prefers to predict the 7 month infected in a 5 class model with balanced classes in training. But it didn’t predict the one healthy class as healthy, but 7 months infected. Could also mean that the late stage mono at 7 months after infection is closer to a healthy sample than an infected one or sample exposed to infectious mononucleosis. These genes of micro RNA still seem good to use in prediction. We will have to find out how to get alternate names in Ensembl or genecards or other name for them to add them to our database. Otherwise its useful only for microRNA and no tie to EBV or Epstein-Barr Viral infection and associated pathologies.
We wanted to also see the fold change values when breaking up the original 19 samples into those with all 5 participating people through 1 month after infection, and another run with all samples and a 2 class problem with unbalanced class training on 5,5,4, and 2 samples for initial infection at diagnosis, 1 month infection, 2 months infection, and 7 months infection respectively. There are also 3 healthy classes that is unbalanced in training compared to the 5, 4, or 2 samples of other classes. I will save that for next project as an extension to this one. We found that with a 2 class model, the machine was able to use these top 15 genes found as top silencers or enhancers in the mononucleosis infected micro RNA gene expression data was able to predict with 100% accuracy a healthy or mono class. But when using time stamps in months of infection as a five class model, the machine was not very good due to low samples, not enough to train the model with or test it.
There is an article published almost 10 years ago with an internet search on Micro RNA that has a lot of digestible information about the work of non coding RNA and micro RNA in activating or silencing gene expression and reaches to theories and suggestions that micro RNA has ability to affect transcription at the nuclear level as well as post transcription activity.
There is a site, miRbase.org, that is made for micro RNA and will give the name of our miRNA IDs if we exclude the last letter attachment like ’_st’ and search their search bar. We can search each of these top 15 strings by their ID after manually removing the appended few letters after numeric portion of ID_REF and then get the RNA string, look up the proteins of amino acids and also use BLAST.org to search the string but as RNA or miRNA if available and go about manually finding the chromosome location top ranked for that gene specific to that miRNA that will silence or enhance translation of the final protein.
Thanks.
=====================================================
This part is an extension part 2 to find the patients through 1 month of infection where all 5 patients stayed for the initial diagnosis and 1 month but dropped out for 2 and 7 months. Then we will analyze all samples that are imbalanced with 5, 5, 4, and 2 samples of time point mono progression with the 3 healthy control samples.
Lets get our matrix of all patients only up to and including 1 month.
colnames(data)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
## [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
## [7] "PBMC_1month_Subject1" "PBMC_1month_Subject2"
## [9] "PBMC_1month_Subject3" "PBMC_1month_Subject4"
## [11] "PBMC_1month_Subject5" "PBMC_2month_Subject1"
## [13] "PBMC_2month_Subject3" "PBMC_2month_Subject4"
## [15] "PBMC_2month_Subject5" "PBMC_7month_Subject1"
## [17] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [19] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
We want the ID_REF, and columns after it up to column 11 and then the last 3 columns for the healthy controls.
patients1month <- data[,c(1:11,18:20)]
str(patients1month)
## 'data.frame': 36353 obs. of 14 variables:
## $ ID_REF : chr "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
## $ PBMC_IMDiagnosis_Subject1: num 0.895 1.327 1.102 0.899 0.942 ...
## $ PBMC_IMDiagnosis_Subject2: num 0.866 1.382 1.118 0.87 1.123 ...
## $ PBMC_IMDiagnosis_Subject3: num 0.556 1.24 0.888 0.554 1.02 ...
## $ PBMC_IMDiagnosis_Subject4: num 0.967 1.79 1.72 0.828 0.982 ...
## $ PBMC_IMDiagnosis_Subject5: num 0.771 1.003 0.777 0.793 0.752 ...
## $ PBMC_1month_Subject1 : num 0.755 1.15 0.925 0.81 0.838 ...
## $ PBMC_1month_Subject2 : num 0.895 0.996 0.8 0.81 1.106 ...
## $ PBMC_1month_Subject3 : num 1.043 1.108 0.968 0.656 0.826 ...
## $ PBMC_1month_Subject4 : num 1.226 1.388 1.162 0.489 0.86 ...
## $ PBMC_1month_Subject5 : num 0.769 1.54 1.314 0.544 1.14 ...
## $ PBMC_Control_Subject1 : num 1.226 1.282 1.373 0.81 0.881 ...
## $ PBMC_Control_Subject2 : num 0.602 0.845 0.895 0.793 0.56 ...
## $ PBMC_Control_Subject3 : num 1.103 0.922 0.696 1.027 0.799 ...
Lets get the rowMeans of each time point and healthy controls.
patients1month$initialDX_Means <- rowMeans(patients1month[,c(2:6)])
patients1month$month1_Means <- rowMeans(patients1month[,c(7:11)])
patients1month$healthy_Means <- rowMeans(patients1month[,c(12:14)])
patients1month$FC_initial_healthy <- patients1month$initialDX_Means/patients1month$healthy_Means
patients1month$FC_1month_healthy <- patients1month$month1_Means/patients1month$healthy_Means
str(patients1month)
## 'data.frame': 36353 obs. of 19 variables:
## $ ID_REF : chr "14q0_st" "14qI-1_st" "14qI-1_x_st" "14qI-2_st" ...
## $ PBMC_IMDiagnosis_Subject1: num 0.895 1.327 1.102 0.899 0.942 ...
## $ PBMC_IMDiagnosis_Subject2: num 0.866 1.382 1.118 0.87 1.123 ...
## $ PBMC_IMDiagnosis_Subject3: num 0.556 1.24 0.888 0.554 1.02 ...
## $ PBMC_IMDiagnosis_Subject4: num 0.967 1.79 1.72 0.828 0.982 ...
## $ PBMC_IMDiagnosis_Subject5: num 0.771 1.003 0.777 0.793 0.752 ...
## $ PBMC_1month_Subject1 : num 0.755 1.15 0.925 0.81 0.838 ...
## $ PBMC_1month_Subject2 : num 0.895 0.996 0.8 0.81 1.106 ...
## $ PBMC_1month_Subject3 : num 1.043 1.108 0.968 0.656 0.826 ...
## $ PBMC_1month_Subject4 : num 1.226 1.388 1.162 0.489 0.86 ...
## $ PBMC_1month_Subject5 : num 0.769 1.54 1.314 0.544 1.14 ...
## $ PBMC_Control_Subject1 : num 1.226 1.282 1.373 0.81 0.881 ...
## $ PBMC_Control_Subject2 : num 0.602 0.845 0.895 0.793 0.56 ...
## $ PBMC_Control_Subject3 : num 1.103 0.922 0.696 1.027 0.799 ...
## $ initialDX_Means : num 0.811 1.349 1.121 0.789 0.964 ...
## $ month1_Means : num 0.938 1.236 1.034 0.662 0.954 ...
## $ healthy_Means : num 0.977 1.016 0.988 0.877 0.746 ...
## $ FC_initial_healthy : num 0.83 1.33 1.13 0.9 1.29 ...
## $ FC_1month_healthy : num 0.96 1.217 1.046 0.755 1.278 ...
Lets get our top genes 10 silencer and 10 enhancers from fold changes of each group at initial diagnosis and at 1 month progression with mononucleosis. We are
topDX0genes <- patients1month[order(patients1month$FC_initial_healthy, decreasing=T),]
top20DXOgenes <- topDX0genes[c(1:10,36344:36353),]
top20DXOgenes$ID_REF
## [1] "MIMAT0018929_st" "MIMAT0009448_st" "MIMAT0018115_st" "MIMAT0024385_st"
## [5] "MIMAT0005867_st" "MIMAT0008310_st" "MIMAT0016093_st" "MIMAT0018776_st"
## [9] "MIMAT0025375_st" "MIMAT0018913_st" "MIMAT0006308_st" "MIMAT0006715_st"
## [13] "MIMAT0008131_st" "MIMAT0009306_st" "MIMAT0013147_st" "MIMAT0015869_st"
## [17] "MIMAT0023945_st" "MIMAT0024110_st" "MIMAT0022658_st" "MIMAT0026241_st"
Those are the top 20 genes expressed for the initial diagnosis fold change values versus healthy means.
Now for the top 20 genes for 1 month means vs. healthy means.
top1monthgenes <- patients1month[order(patients1month$FC_1month_healthy,decreasing=T),]
top20_1monthgenes <- top1monthgenes[c(1:10,36344:36353),]
top20_1monthgenes$ID_REF
## [1] "MIMAT0018929_st" "MIMAT0018115_st" "MIMAT0024385_st" "MIMAT0022271_st"
## [5] "MIMAT0031767_st" "MIMAT0027774_st" "MIMAT0030467_st" "MIMAT0029648_st"
## [9] "MIMAT0018776_st" "MIMAT0022938_st" "MIMAT0000737_st" "MIMAT0000747_st"
## [13] "MIMAT0003201_st" "MIMAT0006311_st" "MIMAT0008134_st" "MIMAT0009308_st"
## [17] "MIMAT0013150_st" "MIMAT0019306_st" "MIMAT0023947_st" "MIMAT0001415_st"
The above genes are the genes for the 1 month progression with mono fold change over healthy means.
Lets see what genes in common.
common1monthDX0 <- top20_1monthgenes[which(top20_1monthgenes$ID_REF %in% top20DXOgenes$ID_REF),]
common1monthDX0$ID_REF
## [1] "MIMAT0018929_st" "MIMAT0018115_st" "MIMAT0024385_st" "MIMAT0018776_st"
There are 4 genes in common with all 5 participants having top gene expression as silencing or enhanced activity after diagnosis and 1 month of infection. We see the same 2 genes common to the last analysis of only the 2 patients that participated at each time point, MIMAT0018929_st and MIMAT0018155_st as well as MIMAT0024385_st and MIMAT0018776_st.
We will keep these 4 genes in common and run a 2 class model to see if the machine can predict mono or healthy from these 3 healthy and 10 infectious mononucleosis samples.
Lets make our matrix to test these 4 genes.
top4genes1month <- patients1month[which(patients1month$ID_REF %in% common1monthDX0$ID_REF),c(1:14)]
str(top4genes1month)
## 'data.frame': 4 obs. of 14 variables:
## $ ID_REF : chr "MIMAT0018115_st" "MIMAT0018776_st" "MIMAT0018929_st" "MIMAT0024385_st"
## $ PBMC_IMDiagnosis_Subject1: num 7.99 7.14 9.56 6.59
## $ PBMC_IMDiagnosis_Subject2: num 5.35 5.07 6.77 5.82
## $ PBMC_IMDiagnosis_Subject3: num 6.87 6.99 7.75 5.9
## $ PBMC_IMDiagnosis_Subject4: num 6.71 5.73 8.26 6.55
## $ PBMC_IMDiagnosis_Subject5: num 6.42 5.47 8.02 6.92
## $ PBMC_1month_Subject1 : num 7.29 5.09 8.74 5.32
## $ PBMC_1month_Subject2 : num 6.55 6 7.13 5.65
## $ PBMC_1month_Subject3 : num 6.64 5.75 7.6 6.68
## $ PBMC_1month_Subject4 : num 7.63 5.02 9.32 7.76
## $ PBMC_1month_Subject5 : num 5.69 4.89 7.24 5.81
## $ PBMC_Control_Subject1 : num 0.999 1.316 0.751 0.679
## $ PBMC_Control_Subject2 : num 1.218 0.814 0.621 1.146
## $ PBMC_Control_Subject3 : num 1.264 1.315 0.663 1.494
We have our 10 mono samples and 3 healthy samples.
class <- c("mono","mono","mono","mono","mono",
"mono","mono","mono","mono","mono",
"healthy","healthy","healthy")
header1monthmatrix <- top4genes1month$ID_REF
month1_4genes_mx <- data.frame(t(top4genes1month[,c(2:14)]))
colnames(month1_4genes_mx) <- header1monthmatrix
month1_4genes_mx$class <- as.factor(class)
str(month1_4genes_mx)
## 'data.frame': 13 obs. of 5 variables:
## $ MIMAT0018115_st: num 7.99 5.35 6.87 6.71 6.42 ...
## $ MIMAT0018776_st: num 7.14 5.07 6.99 5.73 5.47 ...
## $ MIMAT0018929_st: num 9.56 6.77 7.75 8.26 8.02 ...
## $ MIMAT0024385_st: num 6.59 5.82 5.9 6.55 6.92 ...
## $ class : Factor w/ 2 levels "healthy","mono": 2 2 2 2 2 2 2 2 2 2 ...
table(month1_4genes_mx$class)
##
## healthy mono
## 3 10
We have 3 healthy and 10 mono samples in our data. Lets write this out and read it in to run a machine learning model on it without row names.
write.csv(month1_4genes_mx,'month1_DX0_10mono_3healthy_matrix.csv',row.names=F)
month1_4genes_mx <- read.csv("month1_DX0_10mono_3healthy_matrix.csv",sep=',',header=T)
month1_4genes_mx$class <- as.factor(month1_4genes_mx$class)
Now lets make the training and testing set and see how well our random forest model with classification and 10,000 trees can predict the class in a 2 class problem with only the first month and initial diagnosis samples.
training <- month1_4genes_mx[c(1,2,4,6,7,9,11,13),]
testing <- month1_4genes_mx[c(3,5,8,10,12),]
set.seed(123)
ML4.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)
print(ML4.rf)
##
## Call:
## randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## healthy mono class.error
## healthy 2 0 0
## mono 0 6 0
Wow! There is a 100% accuracy in the training model, lets see how well this model predicts the testing set data.
predict4 <- predict(ML4.rf,testing)
predicted4results <- data.frame(predicted=predict4, actual=testing$class)
predicted4results
## predicted actual
## 3 mono mono
## 5 mono mono
## 8 mono mono
## 10 mono mono
## 12 healthy healthy
Cool, these genes must be the genes to predict mono or healthy in a 2 class model, as you can see we got a 100% accuracty in prediction with the machine in using these 4 genes top expressed as silencers or enhancers in the initial diagnosis and 1 month infection samples.
Lets look at the fold change values for these 4 genes and see if they are silencers or enhancers.
overall4genesFCs <- top1monthgenes[top1monthgenes$ID_REF %in% header1monthmatrix,]
overall4genesFCs
## ID_REF PBMC_IMDiagnosis_Subject1 PBMC_IMDiagnosis_Subject2
## 23290 MIMAT0018929_st 9.56456 6.77330
## 22480 MIMAT0018115_st 7.98950 5.35384
## 28681 MIMAT0024385_st 6.58972 5.82029
## 23138 MIMAT0018776_st 7.13740 5.06894
## PBMC_IMDiagnosis_Subject3 PBMC_IMDiagnosis_Subject4
## 23290 7.75470 8.26385
## 22480 6.87418 6.71152
## 28681 5.90241 6.54903
## 23138 6.99133 5.73309
## PBMC_IMDiagnosis_Subject5 PBMC_1month_Subject1 PBMC_1month_Subject2
## 23290 8.02411 8.74426 7.13127
## 22480 6.42185 7.28631 6.54830
## 28681 6.92075 5.32486 5.64545
## 23138 5.46725 5.09389 5.99764
## PBMC_1month_Subject3 PBMC_1month_Subject4 PBMC_1month_Subject5
## 23290 7.59673 9.32027 7.23575
## 22480 6.64466 7.63433 5.68676
## 28681 6.67667 7.76192 5.81186
## 23138 5.75052 5.01959 4.89004
## PBMC_Control_Subject1 PBMC_Control_Subject2 PBMC_Control_Subject3
## 23290 0.751354 0.620609 0.663376
## 22480 0.998890 1.218440 1.263510
## 28681 0.679280 1.145750 1.494090
## 23138 1.316260 0.814413 1.314730
## initialDX_Means month1_Means healthy_Means FC_initial_healthy
## 23290 8.076104 8.005656 0.6784463 11.903821
## 22480 6.670178 6.760072 1.1602800 5.748766
## 28681 6.356440 6.244152 1.1063733 5.745294
## 23138 6.079602 5.350336 1.1484677 5.293664
## FC_1month_healthy
## 23290 11.799984
## 22480 5.826242
## 28681 5.643802
## 23138 4.658674
These 4 genes in initial and 1 month of infection with mono are all enhanced genes and not silenced genes.
We can see how well it handles 3 classes of initial diagnosis, 1 month, and healthy now.
class3 <- c("initial mono","initial mono","initial mono","initial mono","initial mono","1 month mono","1 month mono","1 month mono","1 month mono","1 month mono","healthy","healthy","healthy")
month1_4genes_mx <- data.frame(t(top4genes1month[,c(2:14)]))
colnames(month1_4genes_mx) <- header1monthmatrix
month1_4genes_mx$class <- as.factor(class3)
month1_4genes_mx
## MIMAT0018115_st MIMAT0018776_st MIMAT0018929_st
## PBMC_IMDiagnosis_Subject1 7.98950 7.137400 9.564560
## PBMC_IMDiagnosis_Subject2 5.35384 5.068940 6.773300
## PBMC_IMDiagnosis_Subject3 6.87418 6.991330 7.754700
## PBMC_IMDiagnosis_Subject4 6.71152 5.733090 8.263850
## PBMC_IMDiagnosis_Subject5 6.42185 5.467250 8.024110
## PBMC_1month_Subject1 7.28631 5.093890 8.744260
## PBMC_1month_Subject2 6.54830 5.997640 7.131270
## PBMC_1month_Subject3 6.64466 5.750520 7.596730
## PBMC_1month_Subject4 7.63433 5.019590 9.320270
## PBMC_1month_Subject5 5.68676 4.890040 7.235750
## PBMC_Control_Subject1 0.99889 1.316260 0.751354
## PBMC_Control_Subject2 1.21844 0.814413 0.620609
## PBMC_Control_Subject3 1.26351 1.314730 0.663376
## MIMAT0024385_st class
## PBMC_IMDiagnosis_Subject1 6.58972 initial mono
## PBMC_IMDiagnosis_Subject2 5.82029 initial mono
## PBMC_IMDiagnosis_Subject3 5.90241 initial mono
## PBMC_IMDiagnosis_Subject4 6.54903 initial mono
## PBMC_IMDiagnosis_Subject5 6.92075 initial mono
## PBMC_1month_Subject1 5.32486 1 month mono
## PBMC_1month_Subject2 5.64545 1 month mono
## PBMC_1month_Subject3 6.67667 1 month mono
## PBMC_1month_Subject4 7.76192 1 month mono
## PBMC_1month_Subject5 5.81186 1 month mono
## PBMC_Control_Subject1 0.67928 healthy
## PBMC_Control_Subject2 1.14575 healthy
## PBMC_Control_Subject3 1.49409 healthy
We have our matrix of 4 top genes and 3 classes. Lets write it out and read it in then test the machine model on predicting each class.
write.csv(month1_4genes_mx,"month1_4genes_3classes_mono_matrix.csv",
row.names=F)
genes4_3class <- read.csv("month1_4genes_3classes_mono_matrix.csv", sep=',',header=T)
genes4_3class$class <- as.factor(genes4_3class$class)
str(genes4_3class)
## 'data.frame': 13 obs. of 5 variables:
## $ MIMAT0018115_st: num 7.99 5.35 6.87 6.71 6.42 ...
## $ MIMAT0018776_st: num 7.14 5.07 6.99 5.73 5.47 ...
## $ MIMAT0018929_st: num 9.56 6.77 7.75 8.26 8.02 ...
## $ MIMAT0024385_st: num 6.59 5.82 5.9 6.55 6.92 ...
## $ class : Factor w/ 3 levels "1 month mono",..: 3 3 3 3 3 1 1 1 1 1 ...
training <- genes4_3class[c(1,2,4,6,7,9,11,13),]
testing <- genes4_3class[c(3,5,8,10,12),]
set.seed(123)
ML4_3class.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)
print(ML4_3class.rf)
##
## Call:
## randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 62.5%
## Confusion matrix:
## 1 month mono healthy initial mono class.error
## 1 month mono 1 0 2 0.6666667
## healthy 0 2 0 0.0000000
## initial mono 3 0 0 1.0000000
The model did much worse with 3 classes to predict. It trained on 2-3 samples in each class and in this training model predicted all the healthy samples correctly, only 1 of the 1 month mono correctly and missed 2 of those samples, and none of the initial mono correctly of 3 samples of initial mono.
Lets see if it can predict better on the testing set.
predict4_3class <- predict(ML4_3class.rf,testing)
predicted4_3class_results <- data.frame(predicted=predict4_3class, actual=testing$class)
predicted4_3class_results
## predicted actual
## 3 initial mono initial mono
## 5 initial mono initial mono
## 8 initial mono 1 month mono
## 10 initial mono 1 month mono
## 12 healthy healthy
It predicted 60% or 3 out of 5 samples correctly and missed 2 of 5 samples in predicting the accurate class.
Next we will do the overall samples in a 2 class and 5 class model to see how well those genes predict Mono. But we will also on the side look up the micro RNA that these miRNA strands influence the protein expression into mRNA and a protein. MiRNA can influence what mRNA is made into proteins or inhibit it with its hairpin form that is a strand of single stranded RNA that doubles back on itself to bond like ribonucleic nucleotides and prevent the molecule from creating a protein. They call that a double stranded RNA although it is a single strand RNA that doubles back on itself to block protein formation, or it unwinds and allows mRNA to proceed with protein formation. Either way, the nuclear material of DNA wrapped tightly around beads of histones forming chromatin that encircles or wraps each chromosome is where all proteins in the body are made, including miRNA, mRNA, proteins, transfer DNA, and so on.
This is the plan we developed before this extension with our 15 top genes among the 2 patients that stayed the entire research experiment.
There is a site, miRbase.org, that is made for micro RNA and will give the name of our miRNA IDs if we exclude the last letter attachment like ’_st’ and search their search bar. We can search each of these top 15 strings by their ID after manually removing the appended few letters after numeric portion of ID_REF and then get the RNA string, look up the proteins of amino acids and also use BLAST.org to search the string but as RNA or miRNA if available and go about manually finding the chromosome location top ranked for that gene specific to that miRNA that will silence or enhance translation of the final protein.
Lets see if these top 4 genes of enhancers in early mono infection are in this list.
patientsTop[which(patientsTop$ID_REF %in% header1monthmatrix),]
## ID_REF PBMC_IMDiagnosis_Subject1 PBMC_IMDiagnosis_Subject3
## 22480 MIMAT0018115_st 7.98950 6.87418
## 23138 MIMAT0018776_st 7.13740 6.99133
## 23290 MIMAT0018929_st 9.56456 7.75470
## PBMC_1month_Subject1 PBMC_1month_Subject3 PBMC_2month_Subject1
## 22480 7.28631 6.64466 6.86842
## 23138 5.09389 5.75052 5.45741
## 23290 8.74426 7.59673 8.17706
## PBMC_2month_Subject3 PBMC_7month_Subject1 PBMC_7month_Subject3
## 22480 5.62720 6.14692 3.61504
## 23138 5.55762 5.17261 4.69867
## 23290 6.58651 7.72162 3.19999
## PBMC_Control_Subject1 PBMC_Control_Subject2 PBMC_Control_Subject3
## 22480 0.998890 1.218440 1.263510
## 23138 1.316260 0.814413 1.314730
## 23290 0.751354 0.620609 0.663376
It appears 1 of the genes in the top 4 is not in the top 15 genes, so lets add it.
top16list <- c(patientsTop$ID_REF,header1monthmatrix)
top16 <- top16list[!duplicated(top16list)]
top16
## [1] "MIMAT0009308_st" "MIMAT0009352_st" "MIMAT0009448_st" "MIMAT0013150_st"
## [5] "MIMAT0013157_st" "MIMAT0015888_st" "MIMAT0018115_st" "MIMAT0018776_st"
## [9] "MIMAT0018929_st" "MIMAT0019306_st" "MIMAT0019328_st" "MIMAT0022271_st"
## [13] "MIMAT0023947_st" "MIMAT0024117_st" "MIMAT0025375_st" "MIMAT0024385_st"
Make the dataset of these 16 genes from our Data and write it out to csv to find the proteins these micro RNA affect.
df16 <- data[which(data$ID_REF %in% top16),]
df16
## ID_REF PBMC_IMDiagnosis_Subject1 PBMC_IMDiagnosis_Subject2
## 14750 MIMAT0009308_st 1.253760 1.52557
## 14794 MIMAT0009352_st 0.888682 2.64363
## 14878 MIMAT0009448_st 6.611770 6.30356
## 17628 MIMAT0013150_st 1.253760 1.52557
## 17635 MIMAT0013157_st 0.791461 1.82682
## 20275 MIMAT0015888_st 0.791461 1.82682
## 22480 MIMAT0018115_st 7.989500 5.35384
## 23138 MIMAT0018776_st 7.137400 5.06894
## 23290 MIMAT0018929_st 9.564560 6.77330
## 23664 MIMAT0019306_st 1.253760 1.52557
## 23686 MIMAT0019328_st 1.930870 3.54334
## 26585 MIMAT0022271_st 4.804370 3.26516
## 28243 MIMAT0023947_st 1.253760 1.52557
## 28413 MIMAT0024117_st 0.858988 1.56380
## 28681 MIMAT0024385_st 6.589720 5.82029
## 29663 MIMAT0025375_st 5.826970 2.66354
## PBMC_IMDiagnosis_Subject3 PBMC_IMDiagnosis_Subject4
## 14750 0.820108 1.78429
## 14794 1.363520 2.26717
## 14878 6.113860 4.71800
## 17628 0.820108 1.78429
## 17635 1.077050 1.56078
## 20275 1.077050 1.56078
## 22480 6.874180 6.71152
## 23138 6.991330 5.73309
## 23290 7.754700 8.26385
## 23664 0.820108 1.78429
## 23686 0.775165 3.88250
## 26585 4.992900 4.30547
## 28243 0.820108 1.78429
## 28413 0.499793 2.50218
## 28681 5.902410 6.54903
## 29663 6.642790 5.27552
## PBMC_IMDiagnosis_Subject5 PBMC_1month_Subject1 PBMC_1month_Subject2
## 14750 1.077190 0.97485 0.898028
## 14794 0.849133 1.35099 0.710139
## 14878 6.035440 3.62155 2.016480
## 17628 1.077190 0.97485 0.898028
## 17635 2.200320 1.23523 0.569350
## 20275 2.200320 1.23523 0.569350
## 22480 6.421850 7.28631 6.548300
## 23138 5.467250 5.09389 5.997640
## 23290 8.024110 8.74426 7.131270
## 23664 1.077190 0.97485 0.898028
## 23686 2.417000 1.10510 2.201100
## 26585 5.212060 3.18501 4.930970
## 28243 1.077190 0.97485 0.898028
## 28413 1.235230 1.15299 1.400370
## 28681 6.920750 5.32486 5.645450
## 29663 4.824540 4.90673 4.124050
## PBMC_1month_Subject3 PBMC_1month_Subject4 PBMC_1month_Subject5
## 14750 0.583420 1.77516 1.206380
## 14794 0.939178 1.52217 0.768760
## 14878 4.150270 7.19976 2.825740
## 17628 0.583420 1.77516 1.206380
## 17635 0.758594 1.73440 1.321170
## 20275 0.758594 1.73440 1.321170
## 22480 6.644660 7.63433 5.686760
## 23138 5.750520 5.01959 4.890040
## 23290 7.596730 9.32027 7.235750
## 23664 0.583420 1.77516 1.206380
## 23686 0.655552 2.32579 2.325790
## 26585 5.173840 6.61938 5.189980
## 28243 0.583420 1.77516 1.206380
## 28413 0.965413 1.29613 0.909796
## 28681 6.676670 7.76192 5.811860
## 29663 4.478840 2.18753 2.364130
## PBMC_2month_Subject1 PBMC_2month_Subject3 PBMC_2month_Subject4
## 14750 1.31600 0.998890 1.180520
## 14794 1.05297 0.793028 0.895331
## 14878 6.51860 2.326930 3.523210
## 17628 1.31600 0.998890 1.180520
## 17635 1.67161 0.657761 0.993811
## 20275 1.67161 0.657761 0.993811
## 22480 6.86842 5.627200 6.483520
## 23138 5.45741 5.557620 4.395970
## 23290 8.17706 6.586510 7.371500
## 23664 1.31600 0.998890 1.180520
## 23686 2.74819 1.780610 2.854490
## 26585 6.76735 5.162990 5.172610
## 28243 1.31600 0.998890 1.180520
## 28413 1.18651 0.591569 1.165370
## 28681 5.70551 6.251930 6.164940
## 29663 4.31294 4.521010 1.491890
## PBMC_2month_Subject5 PBMC_7month_Subject1 PBMC_7month_Subject3
## 14750 4.24044 1.661050 0.736952
## 14794 4.00975 0.991218 0.734131
## 14878 5.47309 1.534900 2.501360
## 17628 4.24044 1.661050 0.736952
## 17635 4.92096 2.279960 0.893046
## 20275 4.92096 2.279960 0.893046
## 22480 7.13712 6.146920 3.615040
## 23138 2.99686 5.172610 4.698670
## 23290 7.47487 7.721620 3.199990
## 23664 4.24044 1.661050 0.736952
## 23686 5.62994 3.979800 1.617440
## 26585 6.44556 5.573930 1.747580
## 28243 4.24044 1.661050 0.736952
## 28413 4.73508 3.027160 0.739443
## 28681 6.62477 5.422700 2.053250
## 29663 2.10153 4.085580 2.987850
## PBMC_Control_Subject1 PBMC_Control_Subject2 PBMC_Control_Subject3
## 14750 6.143320 1.813050 4.375250
## 14794 4.714100 1.557880 5.674300
## 14878 0.630559 0.832797 1.150230
## 17628 6.143320 1.813050 4.375250
## 17635 6.992610 1.409610 4.248130
## 20275 6.992610 1.409610 4.248130
## 22480 0.998890 1.218440 1.263510
## 23138 1.316260 0.814413 1.314730
## 23290 0.751354 0.620609 0.663376
## 23664 6.143320 1.813050 4.375250
## 23686 7.734350 1.964680 5.868130
## 26585 0.880532 0.786847 1.200670
## 28243 6.143320 1.813050 4.375250
## 28413 7.143990 1.050430 4.031210
## 28681 0.679280 1.145750 1.494090
## 29663 0.859998 0.895331 1.207480
We only kept original gene expression of microRNA and not the fold change data from respective dataset such as patients 1 and 3 who stayed all the way to the end of study with 15 genes unique to the various time points and this last set of only all members who participated and stayed together until their last time point together. We still have the set of fold changes and genes for all samples to do another time as an extension to this as part 3. All genes are common to the two data sets so far other than the 1 gene that was in the early mono with all participants as a top gene but not in any of the other time points for the only participants who saw experiment out till the end. That gene is the last listed gene in the top16 set of genes. Lets add a class to this data frame to list that the top 15 genes had significant fold changes that were common to all time points and early time points only for last gene. But we would still need to look back at the fold change values to see how they did overall.
merge fold change values used from each sample source to the df16 data frame.
FC_allpatients <- patients13[which(patients13$ID_REF %in% p13geneHeader),c(1,23)]
FC_alltimes <- topDX0genes[which(topDX0genes$ID_REF %in% header1monthmatrix),c(1,19)]
colnames(FC_alltimes)[2] <- "foldchangeValue"
colnames(FC_allpatients)[2] <- "foldchangeValue"
FC2sets <- rbind(FC_allpatients,FC_alltimes)
FC2sets$setSample <- "sample"
FC2sets$setSample[1:15] <- "2 patients 7 timepoints all vs healthy means"
FC2sets$setSample[16:19] <- "2 timepoints 5 patients 1month vs healthy means"
data16FCs <- merge(data,FC2sets,by.x='ID_REF',by.y='ID_REF')
head(data16FCs)
## ID_REF PBMC_IMDiagnosis_Subject1 PBMC_IMDiagnosis_Subject2
## 1 MIMAT0009308_st 1.253760 1.52557
## 2 MIMAT0009352_st 0.888682 2.64363
## 3 MIMAT0009448_st 6.611770 6.30356
## 4 MIMAT0013150_st 1.253760 1.52557
## 5 MIMAT0013157_st 0.791461 1.82682
## 6 MIMAT0015888_st 0.791461 1.82682
## PBMC_IMDiagnosis_Subject3 PBMC_IMDiagnosis_Subject4 PBMC_IMDiagnosis_Subject5
## 1 0.820108 1.78429 1.077190
## 2 1.363520 2.26717 0.849133
## 3 6.113860 4.71800 6.035440
## 4 0.820108 1.78429 1.077190
## 5 1.077050 1.56078 2.200320
## 6 1.077050 1.56078 2.200320
## PBMC_1month_Subject1 PBMC_1month_Subject2 PBMC_1month_Subject3
## 1 0.97485 0.898028 0.583420
## 2 1.35099 0.710139 0.939178
## 3 3.62155 2.016480 4.150270
## 4 0.97485 0.898028 0.583420
## 5 1.23523 0.569350 0.758594
## 6 1.23523 0.569350 0.758594
## PBMC_1month_Subject4 PBMC_1month_Subject5 PBMC_2month_Subject1
## 1 1.77516 1.20638 1.31600
## 2 1.52217 0.76876 1.05297
## 3 7.19976 2.82574 6.51860
## 4 1.77516 1.20638 1.31600
## 5 1.73440 1.32117 1.67161
## 6 1.73440 1.32117 1.67161
## PBMC_2month_Subject3 PBMC_2month_Subject4 PBMC_2month_Subject5
## 1 0.998890 1.180520 4.24044
## 2 0.793028 0.895331 4.00975
## 3 2.326930 3.523210 5.47309
## 4 0.998890 1.180520 4.24044
## 5 0.657761 0.993811 4.92096
## 6 0.657761 0.993811 4.92096
## PBMC_7month_Subject1 PBMC_7month_Subject3 PBMC_Control_Subject1
## 1 1.661050 0.736952 6.143320
## 2 0.991218 0.734131 4.714100
## 3 1.534900 2.501360 0.630559
## 4 1.661050 0.736952 6.143320
## 5 2.279960 0.893046 6.992610
## 6 2.279960 0.893046 6.992610
## PBMC_Control_Subject2 PBMC_Control_Subject3 foldchangeValue
## 1 1.813050 4.37525 0.2537693
## 2 1.557880 5.67430 0.2546938
## 3 0.832797 1.15023 4.7892876
## 4 1.813050 4.37525 0.2537693
## 5 1.409610 4.24813 0.2776024
## 6 1.409610 4.24813 0.2776024
## setSample
## 1 2 patients 7 timepoints all vs healthy means
## 2 2 patients 7 timepoints all vs healthy means
## 3 2 patients 7 timepoints all vs healthy means
## 4 2 patients 7 timepoints all vs healthy means
## 5 2 patients 7 timepoints all vs healthy means
## 6 2 patients 7 timepoints all vs healthy means
Write this table out as our top 16 genes in mono miRNA to find gene IDs for and for our first 2 of 3 sets of analysis from this study.
write.csv(data16FCs,'top16genes_mono_FCs.csv',row.names=FALSE)
We can now look up these genes and add the gene ID to these microRNA using miRbase.org. You can get this file here. We also want to do an overall sample with the unbalanced samples per time period of mono progression and see how well the machine learning model predicts the class in a 2 or 5 class model. We found that with the set of samples where all patients participated for the first month and initial diagnosis that 3 genes were in the top 15 genes from all patients who participated, but added an extra gene. When a 2 class model, the machine was able to predict with 100% accuracy the class, but when we made it a 3 class model of initial diagnosis, 1 month, or healthy, the machine didn’t do well on any class other than healthy class. So these genes must be very good in predicting if a person has mono or not up to 1 month of infection and also going back to other samples up to 7 months of infection. These markers can probably track the genes they encode and later be used to identify similar results in associated pathologies related to EBV infection, like this one of infectious mononucleosis, multiple sclerosis, burkett and hodgkin lymphomas, and throat cancer.
Thanks.
======================================================
Part 3, we find the top genes among all samples which includes all participating patients from beginning until end of research as well as those that dropped out before month 2 and month 7. We will see if the top genes are similar or at least the same top 2 genes in other sets of samples or if we have new genes, and if the machine model is better at predicting a 5 class model, and still 100% accurate in predicting a 2 class model.
I still haven’t searched the 16 top genes by micro RNA protein name encoding or enhancing or silencing, but we can get these top genes then check back with the details on the names of those miRNA and add them to our database on genes and how they predict their respective pathology when we do so that we can build a machine model that can predict by pathology based on the medial or cell line used such as skeletal muscle, peripheral blood mononumclear cell B-cells, or peripheral blood mononuclear cell lymphoblastic B-cells, or RNA-Seq, You can find the file I uploaded earlier in week to google sheets here.
I went ahead and ran all the lines of code to this extension, so our environment is pretty cluttered. We will keep the data and data16FCs data sets.
ls()
## [1] "class" "class3"
## [3] "class5" "common0127"
## [5] "common1monthDX0" "commonAll"
## [7] "data" "data16FCs"
## [9] "df16" "DX0"
## [11] "DX01" "DX1"
## [13] "DX2" "DX27"
## [15] "DX7" "DXall"
## [17] "FC_allpatients" "FC_alltimes"
## [19] "FC2sets" "genes4_3class"
## [21] "header1monthmatrix" "headerInformation"
## [23] "matrix13" "matrix5class"
## [25] "metadata" "ML15.rf"
## [27] "ML15b.rf" "ML15c.rf"
## [29] "ML4.rf" "ML4_3class.rf"
## [31] "month1_4genes_mx" "newNames"
## [33] "overall4genesFCs" "p13geneHeader"
## [35] "patients13" "patients1month"
## [37] "patientsTop" "predict4"
## [39] "predict4_3class" "predicted"
## [41] "predicted2" "predicted3"
## [43] "predicted4_3class_results" "predicted4results"
## [45] "results" "testing"
## [47] "top" "top16"
## [49] "top16list" "top1monthgenes"
## [51] "top20" "top20_1monthgenes"
## [53] "top20DXOgenes" "top40"
## [55] "top4genes1month" "topDX0genes"
## [57] "topGenesClasses" "training"
## [59] "X_0" "X_1"
## [61] "X_2" "X_7"
## [63] "X_all"
We will remove everything except the data and data16FCs.
rm( "class" , "class3" ,
"class5" , "common0127" ,
"common1monthDX0" , "commonAll" ,
"df16" , "DX0" ,
"DX01" , "DX1" ,
"DX2" , "DX27" ,
"DX7" , "DXall" ,
"FC_allpatients" , "FC_alltimes" ,
"FC2sets" ,"genes4_3class" ,
"header1monthmatrix" , "headerInformation" ,
"matrix13" , "matrix5class" ,
"metadata" , "ML15.rf" ,
"ML15b.rf" , "ML15c.rf" ,
"ML4.rf" , "ML4_3class.rf" ,
"month1_4genes_mx" ,"newNames" ,
"overall4genesFCs" , "p13geneHeader" ,
"patients13" , "patients1month" ,
"patientsTop" , "predict4" ,
"predict4_3class" , "predicted" ,
"predicted2" , "predicted3" ,
"predicted4_3class_results", "predicted4results" ,
"results" , "testing" ,
"top" ,"top16" ,
"top16list" , "top1monthgenes" ,
"top20" ,"top20_1monthgenes",
"top20DXOgenes" , "top40" ,
"top4genes1month" , "topDX0genes" ,
"topGenesClasses" ,"training" ,
"X_0" , "X_1" ,
"X_2" ,"X_7" ,
"X_all" )
I have been meaning to do that but wasn’t sure if we could recycle any, since we are using all the features here, we can run code without selecting out special features.
colnames(data)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
## [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
## [7] "PBMC_1month_Subject1" "PBMC_1month_Subject2"
## [9] "PBMC_1month_Subject3" "PBMC_1month_Subject4"
## [11] "PBMC_1month_Subject5" "PBMC_2month_Subject1"
## [13] "PBMC_2month_Subject3" "PBMC_2month_Subject4"
## [15] "PBMC_2month_Subject5" "PBMC_7month_Subject1"
## [17] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [19] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
data$healthyMeans <- rowMeans(data[,c(18:20)])
data$AllMonoMeans <- rowMeans(data[,c(2:17)])
data$DX0_Means <- rowMeans(data[,c(2:6)])
data$DX1_Means <- rowMeans(data[,c(7:11)])
data$DX2_Means <- rowMeans(data[,c(12:15)])
data$DX7_Means <- rowMeans(data[,c(16:17)])
colnames(data)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
## [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
## [7] "PBMC_1month_Subject1" "PBMC_1month_Subject2"
## [9] "PBMC_1month_Subject3" "PBMC_1month_Subject4"
## [11] "PBMC_1month_Subject5" "PBMC_2month_Subject1"
## [13] "PBMC_2month_Subject3" "PBMC_2month_Subject4"
## [15] "PBMC_2month_Subject5" "PBMC_7month_Subject1"
## [17] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [19] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
## [21] "healthyMeans" "AllMonoMeans"
## [23] "DX0_Means" "DX1_Means"
## [25] "DX2_Means" "DX7_Means"
DX0 means initial diagnosis, DX1 means after 1 month infection, and so on up to DX7 for 7 months infection. We get the means of all samples per gene across the rows of samples in that selection.
data$FC_allMono_healthy <- data$AllMonoMeans/data$healthyMeans
data$FC_DX0_healthy <- data$DX0_Means/data$healthyMeans
data$FC_DX1_healthy <- data$DX1_Means/data$healthyMeans
data$FC_DX2_healthy <- data$DX2_Means/data$healthyMeans
data$FC_DX7_healthy <- data$DX7_Means/data$healthyMeans
colnames(data)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
## [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
## [7] "PBMC_1month_Subject1" "PBMC_1month_Subject2"
## [9] "PBMC_1month_Subject3" "PBMC_1month_Subject4"
## [11] "PBMC_1month_Subject5" "PBMC_2month_Subject1"
## [13] "PBMC_2month_Subject3" "PBMC_2month_Subject4"
## [15] "PBMC_2month_Subject5" "PBMC_7month_Subject1"
## [17] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [19] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
## [21] "healthyMeans" "AllMonoMeans"
## [23] "DX0_Means" "DX1_Means"
## [25] "DX2_Means" "DX7_Means"
## [27] "FC_allMono_healthy" "FC_DX0_healthy"
## [29] "FC_DX1_healthy" "FC_DX2_healthy"
## [31] "FC_DX7_healthy"
Lets order the time sets of foldchange values pathology vs healthy from largest to smallest and take the top 10 genes at the top and bottom 10 genes.
allMono_top20 <- data[order(data$FC_allMono_healthy, decreasing=T),]
all20 <- allMono_top20[c(1:10,36344:36353),]
monoDX0_top20 <- data[order(data$FC_DX0_healthy,decreasing=T),]
DX0_20 <- monoDX0_top20[c(1:10,36344:36353),]
monoDX1_top20 <- data[order(data$FC_DX1_healthy,decreasing=T),]
DX1_20 <- monoDX1_top20[c(1:10,36344:36353),]
monoDX2_top20 <- data[order(data$FC_DX2_healthy,decreasing=T),]
DX2_20 <- monoDX2_top20[c(1:10,36344:36353),]
monoDX7_top20 <- data[order(data$FC_DX7_healthy,decreasing=T),]
DX7_20 <- monoDX7_top20[c(1:10,36344:36353),]
Get all micro RNA in each set and compare to get those in common.
allSamplesIDs <- all20$ID_REF
DX0_IDs <- DX0_20$ID_REF
DX1_IDs <- DX1_20$ID_REF
DX2_IDs <- DX2_20$ID_REF
DX7_IDs <- DX7_20$ID_REF
Now lets see what IDs are in common to all of these sets’ top genes.
a <- which(allSamplesIDs %in% DX0_IDs)
aa <- allSamplesIDs[a]
aa
## [1] "MIMAT0018929_st" "MIMAT0018115_st" "MIMAT0024385_st" "MIMAT0009448_st"
## [5] "MIMAT0005867_st" "MIMAT0008310_st" "MIMAT0016093_st" "MIMAT0022658_st"
## [9] "MIMAT0026241_st"
b <- which(DX1_IDs %in% DX2_IDs)
bb <- DX1_IDs[b]
bb
## [1] "MIMAT0018929_st" "MIMAT0018115_st" "MIMAT0024385_st" "MIMAT0022271_st"
## [5] "MIMAT0031767_st" "MIMAT0027774_st"
c <- which(aa %in% bb)
cc <- aa[c]
cc
## [1] "MIMAT0018929_st" "MIMAT0018115_st" "MIMAT0024385_st"
d <- which(DX7_IDs %in% cc)
dd <- DX7_IDs[d]
dd
## [1] "MIMAT0018929_st" "MIMAT0018115_st"
Looks like two genes in common among all top genes for all samples and timepoints. Lets see if they are in the set of 16 top micro RNA genes so far.
e <- which(data16FCs$ID_REF %in% dd)
data16FCs$ID_REF[e]
## [1] "MIMAT0018115_st" "MIMAT0018115_st" "MIMAT0018929_st" "MIMAT0018929_st"
duplicates because these are the two genes common to all sets of sample distributions to analyze with machine learning. It appears in mononucleosis at any state in the infectious state, unless these are substrates in the experiment design, that these two genes are target enhancers, as we found out they were enhancers last time, in mononumcleosis, MIMAT0018115_st and MIMAT0018929_st.
Lets look at the fold change values in our data on all samples and timepoints fold change to healthy.
colnames(all20)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
## [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
## [7] "PBMC_1month_Subject1" "PBMC_1month_Subject2"
## [9] "PBMC_1month_Subject3" "PBMC_1month_Subject4"
## [11] "PBMC_1month_Subject5" "PBMC_2month_Subject1"
## [13] "PBMC_2month_Subject3" "PBMC_2month_Subject4"
## [15] "PBMC_2month_Subject5" "PBMC_7month_Subject1"
## [17] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [19] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
## [21] "healthyMeans" "AllMonoMeans"
## [23] "DX0_Means" "DX1_Means"
## [25] "DX2_Means" "DX7_Means"
## [27] "FC_allMono_healthy" "FC_DX0_healthy"
## [29] "FC_DX1_healthy" "FC_DX2_healthy"
## [31] "FC_DX7_healthy"
all20[,c(1,27)]
## ID_REF FC_allMono_healthy
## 23290 MIMAT0018929_st 11.1412967
## 22480 MIMAT0018115_st 5.5498172
## 28681 MIMAT0024385_st 5.3794037
## 26585 MIMAT0022271_st 5.1351868
## 14878 MIMAT0009448_st 5.1276187
## 36038 MIMAT0031767_st 4.7639984
## 33919 MIMAT0029648_st 4.7100233
## 11378 MIMAT0005867_st 4.6915691
## 13765 MIMAT0008310_st 4.6915691
## 20480 MIMAT0016093_st 4.6915691
## 11811 MIMAT0006311_st 0.3350017
## 13589 MIMAT0008134_st 0.3350017
## 14750 MIMAT0009308_st 0.3350017
## 17628 MIMAT0013150_st 0.3350017
## 23664 MIMAT0019306_st 0.3350017
## 28243 MIMAT0023947_st 0.3350017
## 17416 MIMAT0012937_st 0.3274217
## 34625 MIMAT0030354_st 0.3274217
## 26965 MIMAT0022658_st 0.3127430
## 30520 MIMAT0026241_st 0.3127430
Lets add these two genes as triplet appearances in our dataframe of 16 top genes but there are actually already 19 due to duplicate appearances and now should be 21 after we add these appearances. We have to extract only the features needed and change name of fold change field to foldchangeValue for all samples to same name, add another feature by setSample to show where these fold change values taken for the gene expressions.
allTop2_df <- all20[which(all20$ID_REF %in% dd),c(1:20,27)]
colnames(allTop2_df)[21] <- "foldchangeValue"
allTop2_df$setSample <- 'all samples all times DX0-DX7 patient1-patient5'
colnames(allTop2_df)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
## [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
## [7] "PBMC_1month_Subject1" "PBMC_1month_Subject2"
## [9] "PBMC_1month_Subject3" "PBMC_1month_Subject4"
## [11] "PBMC_1month_Subject5" "PBMC_2month_Subject1"
## [13] "PBMC_2month_Subject3" "PBMC_2month_Subject4"
## [15] "PBMC_2month_Subject5" "PBMC_7month_Subject1"
## [17] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [19] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
## [21] "foldchangeValue" "setSample"
Now we can rbind it to the data16FCs dataset for our genes to get gene IDs for these listed micro RNA IDs.
data16_allsets <- rbind(data16FCs,allTop2_df)
str(data16_allsets)
## 'data.frame': 21 obs. of 22 variables:
## $ ID_REF : chr "MIMAT0009308_st" "MIMAT0009352_st" "MIMAT0009448_st" "MIMAT0013150_st" ...
## $ PBMC_IMDiagnosis_Subject1: num 1.254 0.889 6.612 1.254 0.791 ...
## $ PBMC_IMDiagnosis_Subject2: num 1.53 2.64 6.3 1.53 1.83 ...
## $ PBMC_IMDiagnosis_Subject3: num 0.82 1.36 6.11 0.82 1.08 ...
## $ PBMC_IMDiagnosis_Subject4: num 1.78 2.27 4.72 1.78 1.56 ...
## $ PBMC_IMDiagnosis_Subject5: num 1.077 0.849 6.035 1.077 2.2 ...
## $ PBMC_1month_Subject1 : num 0.975 1.351 3.622 0.975 1.235 ...
## $ PBMC_1month_Subject2 : num 0.898 0.71 2.016 0.898 0.569 ...
## $ PBMC_1month_Subject3 : num 0.583 0.939 4.15 0.583 0.759 ...
## $ PBMC_1month_Subject4 : num 1.78 1.52 7.2 1.78 1.73 ...
## $ PBMC_1month_Subject5 : num 1.206 0.769 2.826 1.206 1.321 ...
## $ PBMC_2month_Subject1 : num 1.32 1.05 6.52 1.32 1.67 ...
## $ PBMC_2month_Subject3 : num 0.999 0.793 2.327 0.999 0.658 ...
## $ PBMC_2month_Subject4 : num 1.181 0.895 3.523 1.181 0.994 ...
## $ PBMC_2month_Subject5 : num 4.24 4.01 5.47 4.24 4.92 ...
## $ PBMC_7month_Subject1 : num 1.661 0.991 1.535 1.661 2.28 ...
## $ PBMC_7month_Subject3 : num 0.737 0.734 2.501 0.737 0.893 ...
## $ PBMC_Control_Subject1 : num 6.143 4.714 0.631 6.143 6.993 ...
## $ PBMC_Control_Subject2 : num 1.813 1.558 0.833 1.813 1.41 ...
## $ PBMC_Control_Subject3 : num 4.38 5.67 1.15 4.38 4.25 ...
## $ foldchangeValue : num 0.254 0.255 4.789 0.254 0.278 ...
## $ setSample : chr "2 patients 7 timepoints all vs healthy means" "2 patients 7 timepoints all vs healthy means" "2 patients 7 timepoints all vs healthy means" "2 patients 7 timepoints all vs healthy means" ...
Lets write this file out to csv to add our name search IDs to the miRNA IDs later.
write.csv(data16_allsets,
'all16genes_duplicates_mono_allstates_allpatients.csv',
row.names=F)
Lets also write out this data file with means and fold change values per time point set of samples.
write.csv(data,'data_mono_allfeatures_allpatients_alltimepoints.csv',
row.names=F)
Now lets do some machine learning and build our matrix to train and test on first, then test a two class model of mono or healthy to compare for these samples. And then a 5 class model by timepoints of each sample. Clear environment but leave the data16_allsets to use, lets just read it in as data.
data <- read.csv('all16genes_duplicates_mono_allstates_allpatients.csv',
header=T,sep=',')
colnames(data)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
## [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
## [7] "PBMC_1month_Subject1" "PBMC_1month_Subject2"
## [9] "PBMC_1month_Subject3" "PBMC_1month_Subject4"
## [11] "PBMC_1month_Subject5" "PBMC_2month_Subject1"
## [13] "PBMC_2month_Subject3" "PBMC_2month_Subject4"
## [15] "PBMC_2month_Subject5" "PBMC_7month_Subject1"
## [17] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [19] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
## [21] "foldchangeValue" "setSample"
We don’t need the fold change values or setSample and we also don’t need duplicated genes.
genesOnly <- data[!duplicated(data$ID_REF),c(1:20)]
colnames(genesOnly)
## [1] "ID_REF" "PBMC_IMDiagnosis_Subject1"
## [3] "PBMC_IMDiagnosis_Subject2" "PBMC_IMDiagnosis_Subject3"
## [5] "PBMC_IMDiagnosis_Subject4" "PBMC_IMDiagnosis_Subject5"
## [7] "PBMC_1month_Subject1" "PBMC_1month_Subject2"
## [9] "PBMC_1month_Subject3" "PBMC_1month_Subject4"
## [11] "PBMC_1month_Subject5" "PBMC_2month_Subject1"
## [13] "PBMC_2month_Subject3" "PBMC_2month_Subject4"
## [15] "PBMC_2month_Subject5" "PBMC_7month_Subject1"
## [17] "PBMC_7month_Subject3" "PBMC_Control_Subject1"
## [19] "PBMC_Control_Subject2" "PBMC_Control_Subject3"
There are 16 classes of mono and 3 healthy classes for 19 total samples for our 2 class model.
geneHeader <- genesOnly$ID_REF
classLabels <- c("mono","mono","mono","mono","mono","mono","mono",
"mono","mono","mono","mono","mono","mono","mono",
"mono","mono","healthy","healthy","healthy")
Leave out the ID_REF field for our matrix.
matrix16_2 <- data.frame(t(genesOnly[,c(2:20)]))
colnames(matrix16_2) <- geneHeader
matrix16_2$class <- classLabels
str(matrix16_2)
## 'data.frame': 19 obs. of 17 variables:
## $ MIMAT0009308_st: num 1.25 1.53 0.82 1.78 1.08 ...
## $ MIMAT0009352_st: num 0.889 2.644 1.364 2.267 0.849 ...
## $ MIMAT0009448_st: num 6.61 6.3 6.11 4.72 6.04 ...
## $ MIMAT0013150_st: num 1.25 1.53 0.82 1.78 1.08 ...
## $ MIMAT0013157_st: num 0.791 1.827 1.077 1.561 2.2 ...
## $ MIMAT0015888_st: num 0.791 1.827 1.077 1.561 2.2 ...
## $ MIMAT0018115_st: num 7.99 5.35 6.87 6.71 6.42 ...
## $ MIMAT0018776_st: num 7.14 5.07 6.99 5.73 5.47 ...
## $ MIMAT0018929_st: num 9.56 6.77 7.75 8.26 8.02 ...
## $ MIMAT0019306_st: num 1.25 1.53 0.82 1.78 1.08 ...
## $ MIMAT0019328_st: num 1.931 3.543 0.775 3.882 2.417 ...
## $ MIMAT0022271_st: num 4.8 3.27 4.99 4.31 5.21 ...
## $ MIMAT0023947_st: num 1.25 1.53 0.82 1.78 1.08 ...
## $ MIMAT0024117_st: num 0.859 1.564 0.5 2.502 1.235 ...
## $ MIMAT0024385_st: num 6.59 5.82 5.9 6.55 6.92 ...
## $ MIMAT0025375_st: num 5.83 2.66 6.64 5.28 4.82 ...
## $ class : chr "mono" "mono" "mono" "mono" ...
write this out to csv and read back in to avoid row names.
write.csv(matrix16_2,'matrix16_2.csv',row.names=F)
matrix16_2 <- read.csv('matrix16_2.csv',header=T, sep=',')
matrix16_2$class <- as.factor(matrix16_2$class)
training <- matrix16_2[c(1,2,3,6,7,8,11,12,13,15,18,19),]
testing <- matrix16_2[c(4,5,9,10,14,16,17),]
set.seed(123)
ML16_2.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)
print(ML16_2.rf)
##
## Call:
## randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 8.33%
## Confusion matrix:
## healthy mono class.error
## healthy 1 1 0.5
## mono 0 10 0.0
Not as well of a performance on this model, when training with all samples that are unbalanced. The other two sets were balanced and found 100% accuracy in predicting mono or healthy, but this one incorrectly predicted a healthy class as mono. There were 2 of the 3 healthy samples manually placed into the training set. Lets see how well it predicts on the testing set with this model.
predicted <- predict(ML16_2.rf,testing)
results <- data.frame(predicted=predicted,actual=testing$class)
results
## predicted actual
## 4 mono mono
## 5 mono mono
## 9 mono mono
## 10 mono mono
## 14 mono mono
## 16 mono mono
## 17 healthy healthy
Well, that is great! It still predicted with 100% accuracy the class of the sample as being one with mono at any time point or being healthy.
Lets see how well it does on the five class sample of initial diagnosis, 1 month infection, 2 months infection, 7 months infection, or healthy.
class5 <- c("DX0","DX0","DX0","DX0","DX0",
"1month","1month","1month","1month","1month",
"2month", "2month", "2month", "2month",
"7month","7month",
"healthy","healthy","healthy")
matrix16_5 <- data.frame(matrix16_2)
matrix16_5$class <- as.factor(class5)
str(matrix16_5)
## 'data.frame': 19 obs. of 17 variables:
## $ MIMAT0009308_st: num 1.25 1.53 0.82 1.78 1.08 ...
## $ MIMAT0009352_st: num 0.889 2.644 1.364 2.267 0.849 ...
## $ MIMAT0009448_st: num 6.61 6.3 6.11 4.72 6.04 ...
## $ MIMAT0013150_st: num 1.25 1.53 0.82 1.78 1.08 ...
## $ MIMAT0013157_st: num 0.791 1.827 1.077 1.561 2.2 ...
## $ MIMAT0015888_st: num 0.791 1.827 1.077 1.561 2.2 ...
## $ MIMAT0018115_st: num 7.99 5.35 6.87 6.71 6.42 ...
## $ MIMAT0018776_st: num 7.14 5.07 6.99 5.73 5.47 ...
## $ MIMAT0018929_st: num 9.56 6.77 7.75 8.26 8.02 ...
## $ MIMAT0019306_st: num 1.25 1.53 0.82 1.78 1.08 ...
## $ MIMAT0019328_st: num 1.931 3.543 0.775 3.882 2.417 ...
## $ MIMAT0022271_st: num 4.8 3.27 4.99 4.31 5.21 ...
## $ MIMAT0023947_st: num 1.25 1.53 0.82 1.78 1.08 ...
## $ MIMAT0024117_st: num 0.859 1.564 0.5 2.502 1.235 ...
## $ MIMAT0024385_st: num 6.59 5.82 5.9 6.55 6.92 ...
## $ MIMAT0025375_st: num 5.83 2.66 6.64 5.28 4.82 ...
## $ class : Factor w/ 5 levels "1month","2month",..: 4 4 4 4 4 1 1 1 1 1 ...
training <- matrix16_5[c(1,2,3,6,7,8,11,12,13,15,18,19),]
testing <- matrix16_5[c(4,5,9,10,14,16,17),]
set.seed(123)
ML16_5.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,importance=TRUE)
print(ML16_5.rf)
##
## Call:
## randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 83.33%
## Confusion matrix:
## 1month 2month 7month DX0 healthy class.error
## 1month 0 2 0 1 0 1.0000000
## 2month 1 1 0 1 0 0.6666667
## 7month 0 1 0 0 0 1.0000000
## DX0 1 2 0 0 0 1.0000000
## healthy 0 1 0 0 1 0.5000000
It didn’t do well at all, with the best class being predicted the 2 months of infection when in a 5 class model, and next best being the healthy class. It incorrectly predicted the class as initial diagnosis, 1 month infection, or 7 months infected. This about as well as the other 5 class models with either all timepoints but only with participating patients throughout or no drop outs, and only timepoints with all patients before some dropped out. Lets see how well it predicts the classes in the testing set. It could be good.
predicted <- predict(ML16_5.rf,testing)
results <- data.frame(predicted=predicted,actual=testing$class)
results
## predicted actual
## 4 DX0 DX0
## 5 2month DX0
## 9 DX0 1month
## 10 2month 1month
## 14 healthy 2month
## 16 1month 7month
## 17 healthy healthy
Only one of the initial diagnosis mono samples and the only healthy sample was predicted correctly. The genes still correctly predicted the healthy sample in the testing set for all samples in study. But it does the best at only predicting if a sample has mono or not, and not by the time point from initial diagnosis, one month infection, two months infection, seven months infection, or healthy.
We have our data frame of the 16 genes that haven’t changed and seem to be great indicators of mono in these samples. We just have to look up each gene in the micro RNA data base and possibly the BLAST microRNA look up by rank of highest overall for gene target, then use biostrings to get the codon amino acid sequences and see if the amino acids are from diet also like the essential amino acids in the PVT TIM HALL mnemonic or are part of regular body metabolism without diet. And also add it to our database of all gene targets thus far in various cell lines from Lyme disease, Fibromyalgia, multiple sclerosis, EBV stimulated with IL27, and now this one mononucleosis. You can get the file of the microRNA genes in each of 3 analysis sets here
Thanks so much and keep checking back.
==============================================================
Part 4 extension.
I have looked up the micro RNA strand IDs in the MiRbase.org and found some had Ensembl IDs and some didn’t, while most of the micro RNA IDs had an HGNC ID symbol or numeric ID. There are also some that have a genecards.org ID for these miRNA IDs that are the MIR*** name. The miRbase.org gave the HGNC ID and Ensembl ID if available as well as provide a small nucleotide strand of ribonucleic acid to identify the micro RNA. I explored the BLAST database and used the nucleotide with RNA setting to find the nucleotide of micro RNA provided by miRbase.org and then selected the top ranked gene if one was available for that micro RNA ID. Some additional information like the chromosome located or if it is mitochondrial DNA which is circular from our very ancient resources of familial DNA from bacteria, linear is DNA in the nucleus of cell wrapped around histones of each chromosome. Mitochondrial DNA is inside the organelle of cytoplasm of cell outside the nucleus called the mitochondria, it is also where ATP is synthesized and enhanced in the Krebb Cycle or Citric Acid Cycle or TAC as alternate names.
We cannot add the micro RNA in its ID_REF format to our DNA genes database as we need the gene the microRNA enhances or silences by promoting or blocking the messenger RNA production in cytosol of cell. We can use biostrings in the bioconductor package to get the triplet codons of amino acids and select which ones are essential amino acids consumed in diet, branch chain amino acids to build skeletal muscle, or are innate amino acids to the body. This could help direct how to plan a diet to better heal from mono or see if certain amino acid rich foods enhance or promote the mono virus or retard or silence it from replicating and causing the symptoms in hosts the first 3-6 weeks of infection.
Very few of our 16 genes had a gene that could be used to add to the database of current studies’ genes from the multiple sclerosis, EBV IL-27 stimulated, and the TNF-a and MEF2C stimulated fibromyalgia, and Lyme disease genes. But we can look at the additional fields added to our last data set here. We can currently add only the genes that have an Ensembl and genecards ID to our database of target genes by pathology.
Data <- read.csv('all16genes_duplicates_mono_allstates_allpatients_geneNamesToAdd.csv',sep=',',header=T)
str(Data)
## 'data.frame': 21 obs. of 35 variables:
## $ ID_REF : chr "MIMAT0009308_st" "MIMAT0009352_st" "MIMAT0009448_st" "MIMAT0013150_st" ...
## $ hairpin_miRbase.org : chr "bta-mir-382" "bta-mir-584-2" "hsa-mir-1973" "eca-miR-382" ...
## $ HGNC_lookup : chr "MIR382" "MIR584" "MIR1973" "MIR382" ...
## $ HGNC_symbol : int 31875 32840 37061 31875 32083 32083 38946 38946 NA NA ...
## $ Ensembl_geneID : chr "ENSG00000283170" "ENSG00000207714" "ENSG00000284253" "ENSG00000283170" ...
## $ geneCards_ID : chr "MIR382" "MIR584" "MIR1973" "MIR382" ...
## $ description : chr "bta-miR-382 mature miRNA" "bta-miR-584 mature miRNA" "hsa-miR-1973 mature miRNA" "eca-miR-382 mature miRNA" ...
## $ nucleotideSequence_miRbase.org: chr "GAAGUUGUUCGUGGUGGAUUCG" "UGGUUUGCCUGGGACUGAG" "ACCGUGCAAAGGUAGCAUA" "GAAGUUGUUCGUGGUGGAUUCG" ...
## $ BLAST_ranked_gene : chr "BAC CH17-456D16 from chromosome 14, complete sequence\nSequence ID: AC275898.1Length: 214532Number of Matches: "| __truncated__ "Homo sapiens isolate NA24385 chromosome 17\nSequence ID: CP139549.2Length: 84843407Number of Matches: 53\nRelat"| __truncated__ "isolate MEN014 haplogroup L2c mitochondrion, complete genome\nSequence ID: PX403006.1Length: 16567Number of Mat"| __truncated__ "BAC CH17-456D16 from chromosome 14, complete sequence\nSequence ID: AC275898.1Length: 214532Number of Matches: "| __truncated__ ...
## $ chromosome_BLAST_genBank : chr "14" "17" "mitochondrion circular" "14" ...
## $ BLAST_genBank : chr "AC275898.1" " CP139549.2" "PX403006.1" "AC275898.1" ...
## $ miRNA_evidence_miRbase.org : chr "experimental\nArray [3], qRT-PCR [3], cloned [4]" "not_experimental" "experimental\ncloned [1]" "not_experimental" ...
## $ database_links_miRbase.org : chr "" "" "RNAcentral and TissueAtlas" "" ...
## $ predictedTargets_miRbase.org : chr "" "" "targetScan, targetMiner, miRDB" "" ...
## $ PBMC_IMDiagnosis_Subject1 : num 1.254 0.889 6.612 1.254 0.791 ...
## $ PBMC_IMDiagnosis_Subject2 : num 1.53 2.64 6.3 1.53 1.83 ...
## $ PBMC_IMDiagnosis_Subject3 : num 0.82 1.36 6.11 0.82 1.08 ...
## $ PBMC_IMDiagnosis_Subject4 : num 1.78 2.27 4.72 1.78 1.56 ...
## $ PBMC_IMDiagnosis_Subject5 : num 1.077 0.849 6.035 1.077 2.2 ...
## $ PBMC_1month_Subject1 : num 0.975 1.351 3.622 0.975 1.235 ...
## $ PBMC_1month_Subject2 : num 0.898 0.71 2.016 0.898 0.569 ...
## $ PBMC_1month_Subject3 : num 0.583 0.939 4.15 0.583 0.759 ...
## $ PBMC_1month_Subject4 : num 1.78 1.52 7.2 1.78 1.73 ...
## $ PBMC_1month_Subject5 : num 1.206 0.769 2.826 1.206 1.321 ...
## $ PBMC_2month_Subject1 : num 1.32 1.05 6.52 1.32 1.67 ...
## $ PBMC_2month_Subject3 : num 0.999 0.793 2.327 0.999 0.658 ...
## $ PBMC_2month_Subject4 : num 1.181 0.895 3.523 1.181 0.994 ...
## $ PBMC_2month_Subject5 : num 4.24 4.01 5.47 4.24 4.92 ...
## $ PBMC_7month_Subject1 : num 1.661 0.991 1.535 1.661 2.28 ...
## $ PBMC_7month_Subject3 : num 0.737 0.734 2.501 0.737 0.893 ...
## $ PBMC_Control_Subject1 : num 6.143 4.714 0.631 6.143 6.993 ...
## $ PBMC_Control_Subject2 : num 1.813 1.558 0.833 1.813 1.41 ...
## $ PBMC_Control_Subject3 : num 4.38 5.67 1.15 4.38 4.25 ...
## $ foldchangeValue : num 0.254 0.255 4.789 0.254 0.278 ...
## $ setSample : chr "2 patients 7 timepoints all vs healthy means" "2 patients 7 timepoints all vs healthy means" "2 patients 7 timepoints all vs healthy means" "2 patients 7 timepoints all vs healthy means" ...
Lets look at our data on current target genes for pathologies analyzed thus far.
pathologyGenes <- read.csv('pathologies4_MS_FM_EBV_Lyme_addedInformation.csv',sep=',',header=T)
str(pathologyGenes)
## 'data.frame': 169 obs. of 7 variables:
## $ Ensembl_ID : chr "ENSG00000211899" "ENSG00000164458" "ENSG00000211644" "ENSG00000125869" ...
## $ Genecards_ID : chr "IGHM" "TBXT" "IGLV1-51" "LAMP5" ...
## $ FC_pathology_control: num 18550 1051 179 140 105 ...
## $ topGenePathology : chr "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" ...
## $ mediaType : chr "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" ...
## $ studySummarized : chr "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ ...
## $ GSE_study_ID : chr "GSE253756" "GSE253756" "GSE253756" "GSE253756" ...
completeEnsemblGenecards <- Data[grep('ENS',Data$Ensembl_geneID),]
completeEnsemblGenecards$Ensembl_geneID
## [1] "ENSG00000283170" "ENSG00000207714" "ENSG00000284253" "ENSG00000283170"
## [5] "ENSG00000272458" "ENSG00000272458" "ENSG00000110092" "ENSG00000110092"
## [9] "ENSG00000283170" "ENSG00000199107" "ENSG00000284450" "ENSG00000283170"
## [13] "ENSG00000272458" "ENSG00000207656" "ENSG00000110092"
Each Ensembl ID has a genecards ID, but only one unique genecard ID of CNC1 and others are MIR type IDs for only micro RNA and all have same description that they are micro RNA genes and not much else described or was for a few but not included in the gene summary. None of the Genecards.org summaries were included due to the additional information already adding.
colnames(completeEnsemblGenecards)
## [1] "ID_REF" "hairpin_miRbase.org"
## [3] "HGNC_lookup" "HGNC_symbol"
## [5] "Ensembl_geneID" "geneCards_ID"
## [7] "description" "nucleotideSequence_miRbase.org"
## [9] "BLAST_ranked_gene" "chromosome_BLAST_genBank"
## [11] "BLAST_genBank" "miRNA_evidence_miRbase.org"
## [13] "database_links_miRbase.org" "predictedTargets_miRbase.org"
## [15] "PBMC_IMDiagnosis_Subject1" "PBMC_IMDiagnosis_Subject2"
## [17] "PBMC_IMDiagnosis_Subject3" "PBMC_IMDiagnosis_Subject4"
## [19] "PBMC_IMDiagnosis_Subject5" "PBMC_1month_Subject1"
## [21] "PBMC_1month_Subject2" "PBMC_1month_Subject3"
## [23] "PBMC_1month_Subject4" "PBMC_1month_Subject5"
## [25] "PBMC_2month_Subject1" "PBMC_2month_Subject3"
## [27] "PBMC_2month_Subject4" "PBMC_2month_Subject5"
## [29] "PBMC_7month_Subject1" "PBMC_7month_Subject3"
## [31] "PBMC_Control_Subject1" "PBMC_Control_Subject2"
## [33] "PBMC_Control_Subject3" "foldchangeValue"
## [35] "setSample"
Lets keep only the 5th, 6th, and 34th columns to add to our pathologyGenes database. This will keep the fold change value and the Ensemble and Genecard ID only.
genesToAdd <- completeEnsemblGenecards[,c(5,6,34)]
head(genesToAdd)
## Ensembl_geneID geneCards_ID foldchangeValue
## 1 ENSG00000283170 MIR382 0.2537693
## 2 ENSG00000207714 MIR584 0.2546938
## 3 ENSG00000284253 MIR1973 4.7892876
## 4 ENSG00000283170 MIR382 0.2537693
## 5 ENSG00000272458 MIR432 0.2776024
## 6 ENSG00000272458 MIR432 0.2776024
colnames(pathologyGenes)
## [1] "Ensembl_ID" "Genecards_ID" "FC_pathology_control"
## [4] "topGenePathology" "mediaType" "studySummarized"
## [7] "GSE_study_ID"
Rename the columns to our genes to add to match the pathology column names.
colnames(genesToAdd) <- c("Ensembl_ID" , "Genecards_ID"
,"FC_pathology_control")
genesToAdd$topGenePathology <- "mononucleosis"
genesToAdd$mediaType <- "PBMC microRNA"
genesToAdd$studySummarized <- "There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers"
genesToAdd$GSE_study_ID <- 'GSE109220'
head(genesToAdd)
## Ensembl_ID Genecards_ID FC_pathology_control topGenePathology
## 1 ENSG00000283170 MIR382 0.2537693 mononucleosis
## 2 ENSG00000207714 MIR584 0.2546938 mononucleosis
## 3 ENSG00000284253 MIR1973 4.7892876 mononucleosis
## 4 ENSG00000283170 MIR382 0.2537693 mononucleosis
## 5 ENSG00000272458 MIR432 0.2776024 mononucleosis
## 6 ENSG00000272458 MIR432 0.2776024 mononucleosis
## mediaType
## 1 PBMC microRNA
## 2 PBMC microRNA
## 3 PBMC microRNA
## 4 PBMC microRNA
## 5 PBMC microRNA
## 6 PBMC microRNA
## studySummarized
## 1 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 2 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 3 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 4 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 5 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 6 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## GSE_study_ID
## 1 GSE109220
## 2 GSE109220
## 3 GSE109220
## 4 GSE109220
## 5 GSE109220
## 6 GSE109220
Lets now add this to our pathologyGenes database and export it out to csv to use later when building our machine to predict the type of pathology a sample has from Lyme disease, multiple sclerosis, mononucleosis, EBV stimulated with IL-27, and fibromyalgia stimulated with TNF-a and MEF2C. At a point further down the timeline as we need to also add in some data on associated EBV diseases of Hodgkin and Burkett Lymphoma and throat sarcoma or head and neck cancer that has loose associations with EBV infection at some point earlier in patient’s life.
pathologyDB <- rbind(pathologyGenes,genesToAdd)
pathologyDB[c(1,20,40,60,90,100,130,165,170:175),]
## Ensembl_ID Genecards_ID FC_pathology_control topGenePathology
## 1 ENSG00000211899 IGHM 1.855040e+04 Epstein Barr Virus
## 20 ENSG00000232977 LINC00327 1.940000e+01 Epstein Barr Virus
## 40 ENSG00000261471 ENSG00000261471 2.538462e+00 Epstein Barr Virus
## 60 ENSG00000112139 MDGA1 3.412663e-02 Epstein Barr Virus
## 90 ENSG00000180818 HOXC10 3.268590e-02 fibromyalgia
## 100 ENSG00000215193 PEX26 2.759844e+03 Lyme Disease 6 months
## 130 ENSG00000163453 IGFBP7 9.100000e+01 Multiple Sclerosis
## 165 ENSG00000228590 MIR4432HG 9.850746e-02 Multiple Sclerosis
## 170 ENSG00000283170 MIR382 2.537693e-01 mononucleosis
## 210 ENSG00000207714 MIR584 2.546938e-01 mononucleosis
## 310 ENSG00000284253 MIR1973 4.789288e+00 mononucleosis
## 410 ENSG00000283170 MIR382 2.537693e-01 mononucleosis
## 510 ENSG00000272458 MIR432 2.776024e-01 mononucleosis
## 610 ENSG00000272458 MIR432 2.776024e-01 mononucleosis
## mediaType
## 1 LCLs of PBMCs RNA-Seq format
## 20 LCLs of PBMCs RNA-Seq format
## 40 LCLs of PBMCs RNA-Seq format
## 60 LCLs of PBMCs RNA-Seq format
## 90 skeletal muscle RNA-Seq format
## 100 RBCs of PBMCs array format
## 130 B-cell of PBMCs RNA-Seq format
## 165 B-cell of PBMCs RNA-Seq format
## 170 PBMC microRNA
## 210 PBMC microRNA
## 310 PBMC microRNA
## 410 PBMC microRNA
## 510 PBMC microRNA
## 610 PBMC microRNA
## studySummarized
## 1 The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood mononuclear cells. Not the same as B cells in multiple sclerosis for tissue type. The lymphoblastic B cells are cancerous uncontrolled growth B cells, where the regular B cells are the normal circulating white blood cells active in healthy immunity. There are 2 samples of control1, 2 samples of control2, 2 samples from patient 1, and 2 samples from patient 2. This is a total of 8 samples where each control and sample have a basal or non-stimulated baseline and 4 samples stimulated with IL27. They stimulated the samples with IL27 and found the samples with an allele lacking IL27RA had a slower healing time if any to EBV infection. These 8 samples used the raw counts of genes. So, we have 8 EBV samples as lymphoblastic B-cells of PBMCs stimulated with IL27, 86 Lyme disease samples as RBCs of PBMCs in various states of infection with antibiotics used, 12 Fibromyalgia samples as skeletal muscle of possibly rats and unknown if before or after being stimulated with MEF2 known to enhance pain causing inflammation, or before or after that stimulation as well as stimulation with DEX known to alleviate pain caused by inflammation, and 18 Multiple Sclerosis samples as healthy B-cells of PBMCs without any stimulation but using 5 MS samples of commercial B-cells.
## 20 The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood mononuclear cells. Not the same as B cells in multiple sclerosis for tissue type. The lymphoblastic B cells are cancerous uncontrolled growth B cells, where the regular B cells are the normal circulating white blood cells active in healthy immunity. There are 2 samples of control1, 2 samples of control2, 2 samples from patient 1, and 2 samples from patient 2. This is a total of 8 samples where each control and sample have a basal or non-stimulated baseline and 4 samples stimulated with IL27. They stimulated the samples with IL27 and found the samples with an allele lacking IL27RA had a slower healing time if any to EBV infection. These 8 samples used the raw counts of genes. So, we have 8 EBV samples as lymphoblastic B-cells of PBMCs stimulated with IL27, 86 Lyme disease samples as RBCs of PBMCs in various states of infection with antibiotics used, 12 Fibromyalgia samples as skeletal muscle of possibly rats and unknown if before or after being stimulated with MEF2 known to enhance pain causing inflammation, or before or after that stimulation as well as stimulation with DEX known to alleviate pain caused by inflammation, and 18 Multiple Sclerosis samples as healthy B-cells of PBMCs without any stimulation but using 5 MS samples of commercial B-cells.
## 40 The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood mononuclear cells. Not the same as B cells in multiple sclerosis for tissue type. The lymphoblastic B cells are cancerous uncontrolled growth B cells, where the regular B cells are the normal circulating white blood cells active in healthy immunity. There are 2 samples of control1, 2 samples of control2, 2 samples from patient 1, and 2 samples from patient 2. This is a total of 8 samples where each control and sample have a basal or non-stimulated baseline and 4 samples stimulated with IL27. They stimulated the samples with IL27 and found the samples with an allele lacking IL27RA had a slower healing time if any to EBV infection. These 8 samples used the raw counts of genes. So, we have 8 EBV samples as lymphoblastic B-cells of PBMCs stimulated with IL27, 86 Lyme disease samples as RBCs of PBMCs in various states of infection with antibiotics used, 12 Fibromyalgia samples as skeletal muscle of possibly rats and unknown if before or after being stimulated with MEF2 known to enhance pain causing inflammation, or before or after that stimulation as well as stimulation with DEX known to alleviate pain caused by inflammation, and 18 Multiple Sclerosis samples as healthy B-cells of PBMCs without any stimulation but using 5 MS samples of commercial B-cells.
## 60 The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood mononuclear cells. Not the same as B cells in multiple sclerosis for tissue type. The lymphoblastic B cells are cancerous uncontrolled growth B cells, where the regular B cells are the normal circulating white blood cells active in healthy immunity. There are 2 samples of control1, 2 samples of control2, 2 samples from patient 1, and 2 samples from patient 2. This is a total of 8 samples where each control and sample have a basal or non-stimulated baseline and 4 samples stimulated with IL27. They stimulated the samples with IL27 and found the samples with an allele lacking IL27RA had a slower healing time if any to EBV infection. These 8 samples used the raw counts of genes. So, we have 8 EBV samples as lymphoblastic B-cells of PBMCs stimulated with IL27, 86 Lyme disease samples as RBCs of PBMCs in various states of infection with antibiotics used, 12 Fibromyalgia samples as skeletal muscle of possibly rats and unknown if before or after being stimulated with MEF2 known to enhance pain causing inflammation, or before or after that stimulation as well as stimulation with DEX known to alleviate pain caused by inflammation, and 18 Multiple Sclerosis samples as healthy B-cells of PBMCs without any stimulation but using 5 MS samples of commercial B-cells.
## 90 Fibromyalgia or the FM data obtained media from skeletal muscle using RNA-seq or high throughput sequencing. They injected the pain enducing myocyte enhancing factor 2 into healthy and the FM patients then injected DEX into those patients and found decreased inflammatory factors for pain in healthy and FM patients. Time point of when samples obtained before or after which injection is not apparent or clear.There were 5 healthy and 7 myofascial trigger point pain or fibromyalgia pain samples for a total of 12 samples. This project chose to use the fragments per kilo-million or fpkm instead of counts for each gene in the sample. It was listed under human or homo sapien, but the description for the samples each said derived from rats. Questions on this study are when the gene samples provided and are they human. Were they provided after the MEF2 injection to stimulate pain and see WBC activity or after the DEX injection delivered after the MEF2 injection.
## 100 Lyme disease samples had more acute, healthy, and 1 month infection than the chronic infection samples. The blood of peripheral blood mononuclear cells was examined by array and not high throughput analysis.There were 10 chronic cases of 6 months infection after antibiotics, 21 cases of healthy and uninfected, 28 cases of acute infection, and 27 cases of infected 1 month with antibiotics.
## 130 In the multiple sclerosis high throughput sequencing, the B cells are obtained but not sure if from peripheral blood mononuclear cells and isolated, probably. Or if from bone marrow. There were 3 repeats of healthy controls, 5 repeats of the first MS patient, 5 repeats of the 2nd MS patient, and 5 repeats of the commercial line MS patient purchased by researchers to compare. This totals 15 MS samples and 3 healthy samples for 18 total samples. These samples were 20 base pair long nucleotide strands of complimentary DNA or cDNA.
## 165 In the multiple sclerosis high throughput sequencing, the B cells are obtained but not sure if from peripheral blood mononuclear cells and isolated, probably. Or if from bone marrow. There were 3 repeats of healthy controls, 5 repeats of the first MS patient, 5 repeats of the 2nd MS patient, and 5 repeats of the commercial line MS patient purchased by researchers to compare. This totals 15 MS samples and 3 healthy samples for 18 total samples. These samples were 20 base pair long nucleotide strands of complimentary DNA or cDNA.
## 170 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 210 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 310 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 410 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 510 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## 610 There were initially 5 patients that had blood drawn and the micro RNA was taken at initial diagnosis, then 1 month later, then 2 months later only for 4 of 5 patients, and then 7 months later for remaining 2 patients. These microRNA had different ID_REF names that were searched in miRbase.org and from there leads to Ensemble ID and/or genecards.org ID provided, BLAST was searched from the short ribonucleic strand that miRbase.org provided from the microRNA ID_REF, and top ranked gene selected to compare on the chromosome. These genes were filtered out of top 10 enhanced and bottom 10 silenced genes in analyzing the gene expression data from the GSE109220 study in NCBI databank. The numeric data was already normalized a specific method by the researchers
## GSE_study_ID
## 1 GSE253756
## 20 GSE253756
## 40 GSE253756
## 60 GSE253756
## 90 GSE215154
## 100 GSE145974
## 130 GSE293036
## 165 GSE293036
## 170 GSE109220
## 210 GSE109220
## 310 GSE109220
## 410 GSE109220
## 510 GSE109220
## 610 GSE109220
Looks good, I selected random rows to see the variation in pathologies of the database of genes so far that are target genes of the pathologies we analyzed so far.
Lets write this out to csv now.
write.csv(pathologyDB, 'pathologyDB_5pathologies_2-8-2026.csv', row.names=F)
You can get our newest gene target pathology database here.
We can remove those database objects now as we are going to be using the Data on our 15 genes of repeats that have genecards and Ensembl IDs to explore their amino acid triplet codons of RNA.
rm(pathologyGenes,pathologyDB,genesToAdd)
We kept the Data with those miRNA without IDs to Ensembl and Genecards.org so that we can look at their codons of amino acids.
RNA <- RNAStringSet(Data$nucleotideSequence_miRbase.org)
RNA
## RNAStringSet object of length 21:
## width seq
## [1] 22 GAAGUUGUUCGUGGUGGAUUCG
## [2] 19 UGGUUUGCCUGGGACUGAG
## [3] 19 ACCGUGCAAAGGUAGCAUA
## [4] 22 GAAGUUGUUCGUGGUGGAUUCG
## [5] 23 UCUUGGAGUAGGUCAUUGGGUGG
## ... ... ...
## [17] 22 UCUUGGAGUAGGUCAUUGGGUG
## [18] 19 GAGGCAGGAGAAUCGCUUG
## [19] 22 AGUGACAUCACAUAUACGGCGG
## [20] 18 GGUGGGCUUCCCGGAGGG
## [21] 24 CCCGGACAGGCGUUCGUGCGACGU
AAs <- translate(RNA)
AAs
## AAStringSet object of length 21:
## width seq
## [1] 7 EVVRGGF
## [2] 6 WFAWD*
## [3] 6 TVQR*H
## [4] 7 EVVRGGF
## [5] 7 SWSRSLG
## ... ... ...
## [17] 7 SWSRSLG
## [18] 6 EAGESL
## [19] 7 SDITYTA
## [20] 6 GGLPGG
## [21] 8 PGQAFVRR
Lets recall our amino acids by their 1 letter abbreviation.
The one-letter abbreviations of amino acids are as follows: * A - Alanine * C - Cysteine * D - Aspartic acid * E - Glutamic acid * F - Phenylalanine * G - Glycine * H - Histidine * I - Isoleucine * K - Lysine * L - Leucine * M - Methionine * N - Asparagine * P - Proline * Q - Glutamine * R - Arginine * S - Serine * T - Threonine * V - Valine * W - Tryptophan * Y - Tyrosine
Branch chain amino acids (BCAAs) are protein building blocks for skeletal muscle of leucine, isoleucine, and valine for VIL. We can see how many more in the silenced or enhanced regions have these BCAAs. The essential amino acids are from a mnemonic,PVT TIM HALL, where the P in this case is for phenylalanine that is abbreviated with an F not P and the A is for arginine abbreviated wiht an R not A as in mnemonic. Essential Amino Acids include the branch chain amino acids or BCAAs of VIL, plus 6 more for histidine (essential in kids but adults have enough of it), phenylalanine, tryptophan, methionine, Lysine, and threonine.
We are looking for F,V,T,W,I,M,H,R,K, and L for our essential amino acids, but H and R are semi-essential.
The three letter alternate and substitute codon chart of ribonucleic acids forming each amino acid is below.Amino Acid Chart
Lets add in the Amino Acids into our Data of miRNA.
Data$AminoAcidString <- AAs
colnames(Data)
## [1] "ID_REF" "hairpin_miRbase.org"
## [3] "HGNC_lookup" "HGNC_symbol"
## [5] "Ensembl_geneID" "geneCards_ID"
## [7] "description" "nucleotideSequence_miRbase.org"
## [9] "BLAST_ranked_gene" "chromosome_BLAST_genBank"
## [11] "BLAST_genBank" "miRNA_evidence_miRbase.org"
## [13] "database_links_miRbase.org" "predictedTargets_miRbase.org"
## [15] "PBMC_IMDiagnosis_Subject1" "PBMC_IMDiagnosis_Subject2"
## [17] "PBMC_IMDiagnosis_Subject3" "PBMC_IMDiagnosis_Subject4"
## [19] "PBMC_IMDiagnosis_Subject5" "PBMC_1month_Subject1"
## [21] "PBMC_1month_Subject2" "PBMC_1month_Subject3"
## [23] "PBMC_1month_Subject4" "PBMC_1month_Subject5"
## [25] "PBMC_2month_Subject1" "PBMC_2month_Subject3"
## [27] "PBMC_2month_Subject4" "PBMC_2month_Subject5"
## [29] "PBMC_7month_Subject1" "PBMC_7month_Subject3"
## [31] "PBMC_Control_Subject1" "PBMC_Control_Subject2"
## [33] "PBMC_Control_Subject3" "foldchangeValue"
## [35] "setSample" "AminoAcidString"
We are looking for F,V,T,W,I,M,H,R,K, and L for our essential amino acids, but H and R are semi-essential.
Data$AminoAcidString <- as.character(Data$AminoAcidString)
Data$AminoAcidString
## [1] "EVVRGGF" "WFAWD*" "TVQR*H" "EVVRGGF" "SWSRSLG" "SWSRSLG"
## [7] "PGQAFVRR" "PGQAFVRR" "SGRQLRR" "SGRQLRR" "GGLPGG" "GGLPGG"
## [13] "EVVRGGF" "RMLLGEP" "WAKGDDWV" "EVVRGGF" "SWSRSLG" "EAGESL"
## [19] "SDITYTA" "GGLPGG" "PGQAFVRR"
Lets see if we can count all essential amino acids in each strand by type of amino acid and then get a percentage ratio of the essential to non-essential amino acid.
F <- grep('F',Data$AminoAcidString)
F
## [1] 1 2 4 7 8 13 16 21
This just says if the amino acid string has that amino acid, we are not adding the amount of times the amino acid is in the string but only if it is present.
Data$F[F] <- 1
Data$F[-F] <- 0
We are looking for F,V,T,W,I,M,H,R,K, and L for our essential amino acids, but H and R are semi-essential.
V <- grep("V",Data$AminoAcidString)
T <- grep("T",Data$AminoAcidString)
W <- grep("W",Data$AminoAcidString)
I <- grep("I",Data$AminoAcidString)
M <- grep("M",Data$AminoAcidString)
H <- grep("H",Data$AminoAcidString)
R <- grep("R",Data$AminoAcidString)
K <- grep("K",Data$AminoAcidString)
L <- grep("L",Data$AminoAcidString)
Data$V[V] <- 1
Data$V[-V] <- 0
Data$T <- 0
Data$T[T] <- 1
Data$W <- 0
Data$W[W] <- 1
Data$I <- 0
Data$I[I] <- 1
Data$M <- 0
Data$M[M] <- 1
Data$H <- 0
Data$H[H] <- 1
Data$R <- 0
Data$R[R] <- 1
Data$K <- 0
Data$K[K] <- 1
Data$L <- 0
Data$L[L] <- 1
Add a feature that shows how many unique essential amino acids in the microRNA strand.
Data$uniqueEssentialAminoAcids <- rowSums(Data[,c(37:45)])
Looking at the data when ordering least essential amino acids to most and comparing fold change, all the strings of only 1 type of essential amino acid present have enhanced fold change, and those strands only have L for Leucine as the only essential amino acid present. Leucine is a BCAA building block of skeletal muscle, and gene expression of these microRNA strands are silenced.
DataOrderedEAA <- Data[order(Data$uniqueEssentialAminoAcids, decreasing=T),]
DataOrderedEAA[,c(6,34:46)]
## geneCards_ID foldchangeValue setSample
## 3 MIR1973 4.7892876 2 patients 7 timepoints all vs healthy means
## 1 MIR382 0.2537693 2 patients 7 timepoints all vs healthy means
## 4 MIR382 0.2537693 2 patients 7 timepoints all vs healthy means
## 7 5.4999903 2 patients 7 timepoints all vs healthy means
## 8 5.8262419 2 timepoints 5 patients 1month vs healthy means
## 13 MIR382 0.2537693 2 patients 7 timepoints all vs healthy means
## 15 MIR664B 4.8911259 2 patients 7 timepoints all vs healthy means
## 16 MIR382 0.2537693 2 patients 7 timepoints all vs healthy means
## 21 5.5498172 all samples all times DX0-DX7 patient1-patient5
## 2 MIR584 0.2546938 2 patients 7 timepoints all vs healthy means
## 5 MIR432 0.2776024 2 patients 7 timepoints all vs healthy means
## 6 MIR432 0.2776024 2 patients 7 timepoints all vs healthy means
## 14 MIR409 0.3515267 2 patients 7 timepoints all vs healthy means
## 17 MIR432 0.2767301 2 patients 7 timepoints all vs healthy means
## 19 MIR489 4.7795913 2 patients 7 timepoints all vs healthy means
## 9 4.9913737 2 patients 7 timepoints all vs healthy means
## 10 4.6586736 2 timepoints 5 patients 1month vs healthy means
## 11 CCND1 11.7999842 2 timepoints 5 patients 1month vs healthy means
## 12 CCND1 10.9340686 2 patients 7 timepoints all vs healthy means
## 18 5.6438020 2 timepoints 5 patients 1month vs healthy means
## 20 CCND1 11.1412967 all samples all times DX0-DX7 patient1-patient5
## AminoAcidString F V T W I M H R K L
## 3 TVQR*H 0 1 1 0 0 0 1 1 0 0
## 1 EVVRGGF 1 1 0 0 0 0 0 1 0 0
## 4 EVVRGGF 1 1 0 0 0 0 0 1 0 0
## 7 PGQAFVRR 1 1 0 0 0 0 0 1 0 0
## 8 PGQAFVRR 1 1 0 0 0 0 0 1 0 0
## 13 EVVRGGF 1 1 0 0 0 0 0 1 0 0
## 15 WAKGDDWV 0 1 0 1 0 0 0 0 1 0
## 16 EVVRGGF 1 1 0 0 0 0 0 1 0 0
## 21 PGQAFVRR 1 1 0 0 0 0 0 1 0 0
## 2 WFAWD* 1 0 0 1 0 0 0 0 0 0
## 5 SWSRSLG 0 0 0 1 0 0 0 1 0 1
## 6 SWSRSLG 0 0 0 1 0 0 0 1 0 1
## 14 RMLLGEP 0 0 0 0 0 1 0 1 0 1
## 17 SWSRSLG 0 0 0 1 0 0 0 1 0 1
## 19 SDITYTA 0 0 1 0 1 0 0 0 0 0
## 9 SGRQLRR 0 0 0 0 0 0 0 1 0 1
## 10 SGRQLRR 0 0 0 0 0 0 0 1 0 1
## 11 GGLPGG 0 0 0 0 0 0 0 0 0 1
## 12 GGLPGG 0 0 0 0 0 0 0 0 0 1
## 18 EAGESL 0 0 0 0 0 0 0 0 0 1
## 20 GGLPGG 0 0 0 0 0 0 0 0 0 1
When we order by fold change and look at the genes that are silenced or reduced in fold change expression, all are reduced to 30% or less than normal expression values, but a few are duplicates. And only 4 genes are silenced, the others are enhanced. These 4 genes are MIR382, MIR432, MIR584, and MIR409.
silenced <- Data[which(Data$foldchangeValue < 1),]
silenced[,c(1,6)]
## ID_REF geneCards_ID
## 1 MIMAT0009308_st MIR382
## 2 MIMAT0009352_st MIR584
## 4 MIMAT0013150_st MIR382
## 5 MIMAT0013157_st MIR432
## 6 MIMAT0015888_st MIR432
## 13 MIMAT0019306_st MIR382
## 14 MIMAT0019328_st MIR409
## 16 MIMAT0023947_st MIR382
## 17 MIMAT0024117_st MIR432
Some strands of the micro RNA by ID_REF name had the same Genecards.org ID name.
If you look at the essential amino acids present in the silenced genes, you will see that none of the silenced genes have isoleucine-I, Threonine-T, Lysine-k, or Tryptophan-W in them.
Now lets look at enhanced genes and search for patterns recognized by simply viewing them.
enhanced <- Data[which(Data$foldchangeValue >1),]
enhanced[,c(1,6)]
## ID_REF geneCards_ID
## 3 MIMAT0009448_st MIR1973
## 7 MIMAT0018115_st
## 8 MIMAT0018115_st
## 9 MIMAT0018776_st
## 10 MIMAT0018776_st
## 11 MIMAT0018929_st CCND1
## 12 MIMAT0018929_st CCND1
## 15 MIMAT0022271_st MIR664B
## 18 MIMAT0024385_st
## 19 MIMAT0025375_st MIR489
## 20 MIMAT0018929_st CCND1
## 21 MIMAT0018115_st
Not all the micro RNA had an alternate genecards name. Leucine is also found in the most expressed gene of CCND1. More of the enhanced or promoted genes have isoleucine, threonine, Lysine, or Histidine than the silenced genes, but not many.
Since these are top expressed or silenced gene micro RNA that influence gene expression by preventing it or promoting it, we can say the silenced genes are the microRNA that are suppressed from producing the proteins they normally make, while the upregulated or enhanced microRNAs are being promoted into production to produce more of their end product proteins for the body during mononucleosis infection from initial diagnosis to seven months after infection.
These microRNA can have applications to processing multiple mRNA strands to promote or deter protein synthesis in the cell nucleus if DNA derived or cell cytosol if mitochondrial DNA derived. The mother carries the mitochondrial DNA as it is part of cell organelle the embryo inherits in meiosis cell generation of an embryo in the oocyte splitting. Coincidentally the DNA tests people have to trace ancestry and heritage of cultures through time are from your mother’s side. This is why if you know you have certain races in your paternal side and their ancestors they won’t show in your DNA test results for cultural heritage you are. The miRbase.org gave information on chromosome as well as mitochondrial DNA source. Some were mitochondrial DNA. We should look at that in our silenced and enhanced data sets just created.
mitochondrial <- Data[grep('mito',Data$chromosome_BLAST_genBank),]
mitochondrial
## ID_REF hairpin_miRbase.org HGNC_lookup HGNC_symbol Ensembl_geneID
## 3 MIMAT0009448_st hsa-mir-1973 MIR1973 37061 ENSG00000284253
## geneCards_ID description nucleotideSequence_miRbase.org
## 3 MIR1973 hsa-miR-1973 mature miRNA ACCGUGCAAAGGUAGCAUA
## BLAST_ranked_gene
## 3 isolate MEN014 haplogroup L2c mitochondrion, complete genome\nSequence ID: PX403006.1Length: 16567Number of Matches: 1\nRange 1: 2581 to 2599, Alignment:\tQuery_7151513 x PX403006.1\nAnchor:\tPX403006.1 (2,581..2,599)\nQuery:\tQuery_7151513 (1..19)\nRelative orientation:\tforward\nSpan on PX403006.1:\t19\nSegments:\t1\nCoverage:\t100.0%\nIdentity:\t100.0%\nMismatches:\t0\nGaps:\t0\n\nDownload FASTA: \tQuery_7151513\n\nLinks & Tools\nBLAST nr: \tPX403006.1\nBLAST to Genome: \tPX403006.1\nFASTA record: \tPX403006.1\nGenBank record: \tPX403006.1
## chromosome_BLAST_genBank BLAST_genBank miRNA_evidence_miRbase.org
## 3 mitochondrion circular PX403006.1 experimental\ncloned [1]
## database_links_miRbase.org predictedTargets_miRbase.org
## 3 RNAcentral and TissueAtlas targetScan, targetMiner, miRDB
## PBMC_IMDiagnosis_Subject1 PBMC_IMDiagnosis_Subject2 PBMC_IMDiagnosis_Subject3
## 3 6.61177 6.30356 6.11386
## PBMC_IMDiagnosis_Subject4 PBMC_IMDiagnosis_Subject5 PBMC_1month_Subject1
## 3 4.718 6.03544 3.62155
## PBMC_1month_Subject2 PBMC_1month_Subject3 PBMC_1month_Subject4
## 3 2.01648 4.15027 7.19976
## PBMC_1month_Subject5 PBMC_2month_Subject1 PBMC_2month_Subject3
## 3 2.82574 6.5186 2.32693
## PBMC_2month_Subject4 PBMC_2month_Subject5 PBMC_7month_Subject1
## 3 3.52321 5.47309 1.5349
## PBMC_7month_Subject3 PBMC_Control_Subject1 PBMC_Control_Subject2
## 3 2.50136 0.630559 0.832797
## PBMC_Control_Subject3 foldchangeValue
## 3 1.15023 4.789288
## setSample AminoAcidString F V T W I M H R
## 3 2 patients 7 timepoints all vs healthy means TVQR*H 0 1 1 0 0 0 1 1
## K L uniqueEssentialAminoAcids
## 3 0 0 4
Only one gene is from mitochondrial DNA and that is the MIR1973 gene in the enhanced genes data set. It is also the only gene with Histidine in it. This micro RNA increased 4 fold at the least when infected with mononucleosis vs healthy.
Well, that was interesting, right? We added these genes to our gene targets pathology data base to use in building a predictive model for associated pathologies with EBV, healthy, and Lyme disease classes when we also find gene targets to other EBV associated pathologies of Hodgkin and Burkett Lymphoma and neck, head, and throat sarcoma.
Write this large Data with added essential amino acids and other added information manually.
write.csv(Data, "Data_mono_addedIDs_EssentialAAs_allNumericValues.csv",row.names=FALSE)
You can get this Data we made here. We aren’t going to do any more with the mononucleosis data analysis and predictive modeling. We well move on to finding the next pathology of one of the EBV associated pathologies next.
Thanks!