This data was obtained from the National Center for Bioinformatics (NCBI) Gene Expression Omnibus (GEO) online data repository on COVID-19 data using a mixed group of patient who all have the autoimmune disease, Rheumatoid Arthritis (RA). Their blood was taken in a special extraction method to preserve the RNA, then treated with abatacept, the RA drug to control autoimmune response for RA patients. There were 76 samples of a week 0 and week 12 where week 0 is the control in 38 patients, and week 12 is the CD80/86 costimulation inhibitor of T cells that produce the same or similar pathological processes as COVID-19 does in patients’ who suffer lung congestion till death in some cases. The access number for this data is GSE151161 and it can be (downloaded)[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151161] with the text metadata information shown immediately below and the RAW data values of gene expression. No need to add the platform information, because the RAW data has the gene name attached to the Raw RNA gene expression data. The data resource link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151161.
Originally, I thought this was blood infected with COVID-19, but it isn’t. It is blood that simulated the same disease process that kills the patient when the lungs and organs become congested with inflammation from the immune response. I wasn’t able to get the fastq files that were on COVID-19 from a different study to align with a sequence alignment app to identify each gene. Although those files are really helpful if you know how to do that process already. This study did use STAR to align their fastq files, and attached the gene symbols to each sample’s gene expression value.
Here is the original summary on NCBI’s GEO briefly describing this study, and this is why I thought it was COVID-19 samples:
__**"Series GSE151161 Query DataSets for GSE151161 Status Public on Jul 01, 2020 Title Blocking of the CD80/86 axis as a therapeutic approach to prevent progression to more severe forms of COVID-19 Organism Homo sapiens Experiment type Expression profiling by high throughput sequencing Summary In its more severe forms, COVID-19 progresses towards an excessive immune response, leading to the systemic overexpression of proinflammatory cytokines like IL6, mostly from the infected lungs. This cytokine storm can cause multiple organ damage and death. Consequently, there is a pressing need to identify therapies to treat and prevent severe symptoms during COVID-19. Based on previous clinical evidence, we hypothesized that inhibiting T cell co-stimulation by blocking CD80/86 could be an effective therapeutic strategy against progression to severe proinflammatory states. To support this hypothesis, we performed an analysis integrating blood transcriptional data we generated from rheumatoid arthritis patients treated with abatacept -a CD80/86 costimulation inhibitor- with the pathological features associated with COVID-19, particularly in its more severe forms. We have found that many of the biological processes that have been consistently associated with COVID-19 pathology are reversed by CD80/86 co-stimulation inhibition, including the downregulation of IL6 production. Also, analysis of previous transcriptional data from blood of SARS-CoVinfected patients showed that the response to abatacept has a very high level of antagonism to that elicited by COVID-19. Finally, analyzing a recent single cell RNA-seq dataset from bronchoalveolar lavage fluid cells from COVID-19 patients, we found a significant correlation along the main elements of the C80/86 axis: CD86+/80+ antigen presenting cells, activated CD4+ T cells and IL6 production. Our in-silico study provides additional support to the hypothesis that blocking of the CD80/CD86 signaling axis may be protective of the excessive proinflammatory state associated with COVID-19 in the lungs
Overall design Whole RNAseq Blood Samples of patients treated with abatacept at week 0 and week 12"__**
This first section pulls the data in from the commented text file of metadata information, arranges the demographic data to align with each of the RAW sample values.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
source("geneCards.R")
## Warning: package 'rvest' was built under R version 3.6.3
## Loading required package: xml2
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
info <- file('GSE151161_series_matrix.txt')
info2 <- readLines(con=info, n=-1)#, encoding='utf-8')
#info2
Info <- info2[c(30:32,38:44)]
head(Info)
## [1] ""
## [2] "!Sample_title\t\"Patient 1 week0\"\t\"Patient 1 week12\"\t\"Patient 2 week0\"\t\"Patient 2 week12\"\t\"Patient 15 week0\"\t\"Patient 15 week12\"\t\"Patient 24 week0\"\t\"Patient 24 week12\"\t\"Patient 16 week0\"\t\"Patient 16 week12\"\t\"Patient 13 week0\"\t\"Patient 13 week12\"\t\"Patient 22 week0\"\t\"Patient 22 week12\"\t\"Patient 31 week0\"\t\"Patient 31 week12\"\t\"Patient 3 week0\"\t\"Patient 3 week12\"\t\"Patient 18 week0\"\t\"Patient 18 week12\"\t\"Patient 23 week0\"\t\"Patient 23 week12\"\t\"Patient 21 week0\"\t\"Patient 21 week12\"\t\"Patient 25 week0\"\t\"Patient 25 week12\"\t\"Patient 26 week0\"\t\"Patient 26 week12\"\t\"Patient 27 week0\"\t\"Patient 27 week12\"\t\"Patient 20 week0\"\t\"Patient 20 week12\"\t\"Patient 34 week0\"\t\"Patient 34 week12\"\t\"Patient 14 week0\"\t\"Patient 14 week12\"\t\"Patient 17 week0\"\t\"Patient 17 week12\"\t\"Patient 19 week0\"\t\"Patient 19 week12\"\t\"Patient 4 week0\"\t\"Patient 4 week12\"\t\"Patient 35 week0\"\t\"Patient 35 week12\"\t\"Patient 28 week0\"\t\"Patient 28 week12\"\t\"Patient 12 week0\"\t\"Patient 12 week12\"\t\"Patient 8 week0\"\t\"Patient 8 week12\"\t\"Patient 5 week0\"\t\"Patient 5 week12\"\t\"Patient 38 week0\"\t\"Patient 38 week12\"\t\"Patient 10 week0\"\t\"Patient 10 week12\"\t\"Patient 29 week0\"\t\"Patient 29 week12\"\t\"Patient 37 week0\"\t\"Patient 37 week12\"\t\"Patient 6 week0\"\t\"Patient 6 week12\"\t\"Patient 36 week0\"\t\"Patient 36 week12\"\t\"Patient 9 week0\"\t\"Patient 9 week12\"\t\"Patient 32 week0\"\t\"Patient 32 week12\"\t\"Patient 7 week0\"\t\"Patient 7 week12\"\t\"Patient 33 week0\"\t\"Patient 33 week12\"\t\"Patient 11 week0\"\t\"Patient 11 week12\"\t\"Patient 30 week0\"\t\"Patient 30 week12\""
## [3] "!Sample_geo_accession\t\"GSM4567526\"\t\"GSM4567527\"\t\"GSM4567528\"\t\"GSM4567529\"\t\"GSM4567530\"\t\"GSM4567531\"\t\"GSM4567532\"\t\"GSM4567533\"\t\"GSM4567534\"\t\"GSM4567535\"\t\"GSM4567536\"\t\"GSM4567537\"\t\"GSM4567538\"\t\"GSM4567539\"\t\"GSM4567540\"\t\"GSM4567541\"\t\"GSM4567542\"\t\"GSM4567543\"\t\"GSM4567544\"\t\"GSM4567545\"\t\"GSM4567546\"\t\"GSM4567547\"\t\"GSM4567548\"\t\"GSM4567549\"\t\"GSM4567550\"\t\"GSM4567551\"\t\"GSM4567552\"\t\"GSM4567553\"\t\"GSM4567554\"\t\"GSM4567555\"\t\"GSM4567556\"\t\"GSM4567557\"\t\"GSM4567558\"\t\"GSM4567559\"\t\"GSM4567560\"\t\"GSM4567561\"\t\"GSM4567562\"\t\"GSM4567563\"\t\"GSM4567564\"\t\"GSM4567565\"\t\"GSM4567566\"\t\"GSM4567567\"\t\"GSM4567568\"\t\"GSM4567569\"\t\"GSM4567570\"\t\"GSM4567571\"\t\"GSM4567572\"\t\"GSM4567573\"\t\"GSM4567574\"\t\"GSM4567575\"\t\"GSM4567576\"\t\"GSM4567577\"\t\"GSM4567578\"\t\"GSM4567579\"\t\"GSM4567580\"\t\"GSM4567581\"\t\"GSM4567582\"\t\"GSM4567583\"\t\"GSM4567584\"\t\"GSM4567585\"\t\"GSM4567586\"\t\"GSM4567587\"\t\"GSM4567588\"\t\"GSM4567589\"\t\"GSM4567590\"\t\"GSM4567591\"\t\"GSM4567592\"\t\"GSM4567593\"\t\"GSM4567594\"\t\"GSM4567595\"\t\"GSM4567596\"\t\"GSM4567597\"\t\"GSM4567598\"\t\"GSM4567599\"\t\"GSM4567600\"\t\"GSM4567601\""
## [4] "!Sample_source_name_ch1\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\""
## [5] "!Sample_organism_ch1\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\""
## [6] "!Sample_characteristics_ch1\t\"age: 78\"\t\"age: 78\"\t\"age: 69\"\t\"age: 69\"\t\"age: 50\"\t\"age: 50\"\t\"age: 51\"\t\"age: 51\"\t\"age: 54\"\t\"age: 54\"\t\"age: 59\"\t\"age: 59\"\t\"age: 61\"\t\"age: 61\"\t\"age: 62\"\t\"age: 62\"\t\"age: 70\"\t\"age: 70\"\t\"age: 75\"\t\"age: 75\"\t\"age: 35\"\t\"age: 35\"\t\"age: 78\"\t\"age: 78\"\t\"age: 56\"\t\"age: 56\"\t\"age: 68\"\t\"age: 68\"\t\"age: 56\"\t\"age: 56\"\t\"age: 55\"\t\"age: 55\"\t\"age: 74\"\t\"age: 74\"\t\"age: 44\"\t\"age: 44\"\t\"age: 72\"\t\"age: 72\"\t\"age: 47\"\t\"age: 47\"\t\"age: 27\"\t\"age: 27\"\t\"age: 68\"\t\"age: 68\"\t\"age: 66\"\t\"age: 66\"\t\"age: 46\"\t\"age: 46\"\t\"age: 45\"\t\"age: 45\"\t\"age: 71\"\t\"age: 71\"\t\"age: 69\"\t\"age: 69\"\t\"age: 80\"\t\"age: 80\"\t\"age: 55\"\t\"age: 55\"\t\"age: 60\"\t\"age: 60\"\t\"age: 68\"\t\"age: 68\"\t\"age: 37\"\t\"age: 37\"\t\"age: 47\"\t\"age: 47\"\t\"age: 43\"\t\"age: 43\"\t\"age: 58\"\t\"age: 58\"\t\"age: 76\"\t\"age: 76\"\t\"age: 21\"\t\"age: 21\"\t\"age: 63\"\t\"age: 63\""
Info2 <- Info[c(2:3,6:8)]
head(Info2)
## [1] "!Sample_title\t\"Patient 1 week0\"\t\"Patient 1 week12\"\t\"Patient 2 week0\"\t\"Patient 2 week12\"\t\"Patient 15 week0\"\t\"Patient 15 week12\"\t\"Patient 24 week0\"\t\"Patient 24 week12\"\t\"Patient 16 week0\"\t\"Patient 16 week12\"\t\"Patient 13 week0\"\t\"Patient 13 week12\"\t\"Patient 22 week0\"\t\"Patient 22 week12\"\t\"Patient 31 week0\"\t\"Patient 31 week12\"\t\"Patient 3 week0\"\t\"Patient 3 week12\"\t\"Patient 18 week0\"\t\"Patient 18 week12\"\t\"Patient 23 week0\"\t\"Patient 23 week12\"\t\"Patient 21 week0\"\t\"Patient 21 week12\"\t\"Patient 25 week0\"\t\"Patient 25 week12\"\t\"Patient 26 week0\"\t\"Patient 26 week12\"\t\"Patient 27 week0\"\t\"Patient 27 week12\"\t\"Patient 20 week0\"\t\"Patient 20 week12\"\t\"Patient 34 week0\"\t\"Patient 34 week12\"\t\"Patient 14 week0\"\t\"Patient 14 week12\"\t\"Patient 17 week0\"\t\"Patient 17 week12\"\t\"Patient 19 week0\"\t\"Patient 19 week12\"\t\"Patient 4 week0\"\t\"Patient 4 week12\"\t\"Patient 35 week0\"\t\"Patient 35 week12\"\t\"Patient 28 week0\"\t\"Patient 28 week12\"\t\"Patient 12 week0\"\t\"Patient 12 week12\"\t\"Patient 8 week0\"\t\"Patient 8 week12\"\t\"Patient 5 week0\"\t\"Patient 5 week12\"\t\"Patient 38 week0\"\t\"Patient 38 week12\"\t\"Patient 10 week0\"\t\"Patient 10 week12\"\t\"Patient 29 week0\"\t\"Patient 29 week12\"\t\"Patient 37 week0\"\t\"Patient 37 week12\"\t\"Patient 6 week0\"\t\"Patient 6 week12\"\t\"Patient 36 week0\"\t\"Patient 36 week12\"\t\"Patient 9 week0\"\t\"Patient 9 week12\"\t\"Patient 32 week0\"\t\"Patient 32 week12\"\t\"Patient 7 week0\"\t\"Patient 7 week12\"\t\"Patient 33 week0\"\t\"Patient 33 week12\"\t\"Patient 11 week0\"\t\"Patient 11 week12\"\t\"Patient 30 week0\"\t\"Patient 30 week12\""
## [2] "!Sample_geo_accession\t\"GSM4567526\"\t\"GSM4567527\"\t\"GSM4567528\"\t\"GSM4567529\"\t\"GSM4567530\"\t\"GSM4567531\"\t\"GSM4567532\"\t\"GSM4567533\"\t\"GSM4567534\"\t\"GSM4567535\"\t\"GSM4567536\"\t\"GSM4567537\"\t\"GSM4567538\"\t\"GSM4567539\"\t\"GSM4567540\"\t\"GSM4567541\"\t\"GSM4567542\"\t\"GSM4567543\"\t\"GSM4567544\"\t\"GSM4567545\"\t\"GSM4567546\"\t\"GSM4567547\"\t\"GSM4567548\"\t\"GSM4567549\"\t\"GSM4567550\"\t\"GSM4567551\"\t\"GSM4567552\"\t\"GSM4567553\"\t\"GSM4567554\"\t\"GSM4567555\"\t\"GSM4567556\"\t\"GSM4567557\"\t\"GSM4567558\"\t\"GSM4567559\"\t\"GSM4567560\"\t\"GSM4567561\"\t\"GSM4567562\"\t\"GSM4567563\"\t\"GSM4567564\"\t\"GSM4567565\"\t\"GSM4567566\"\t\"GSM4567567\"\t\"GSM4567568\"\t\"GSM4567569\"\t\"GSM4567570\"\t\"GSM4567571\"\t\"GSM4567572\"\t\"GSM4567573\"\t\"GSM4567574\"\t\"GSM4567575\"\t\"GSM4567576\"\t\"GSM4567577\"\t\"GSM4567578\"\t\"GSM4567579\"\t\"GSM4567580\"\t\"GSM4567581\"\t\"GSM4567582\"\t\"GSM4567583\"\t\"GSM4567584\"\t\"GSM4567585\"\t\"GSM4567586\"\t\"GSM4567587\"\t\"GSM4567588\"\t\"GSM4567589\"\t\"GSM4567590\"\t\"GSM4567591\"\t\"GSM4567592\"\t\"GSM4567593\"\t\"GSM4567594\"\t\"GSM4567595\"\t\"GSM4567596\"\t\"GSM4567597\"\t\"GSM4567598\"\t\"GSM4567599\"\t\"GSM4567600\"\t\"GSM4567601\""
## [3] "!Sample_characteristics_ch1\t\"age: 78\"\t\"age: 78\"\t\"age: 69\"\t\"age: 69\"\t\"age: 50\"\t\"age: 50\"\t\"age: 51\"\t\"age: 51\"\t\"age: 54\"\t\"age: 54\"\t\"age: 59\"\t\"age: 59\"\t\"age: 61\"\t\"age: 61\"\t\"age: 62\"\t\"age: 62\"\t\"age: 70\"\t\"age: 70\"\t\"age: 75\"\t\"age: 75\"\t\"age: 35\"\t\"age: 35\"\t\"age: 78\"\t\"age: 78\"\t\"age: 56\"\t\"age: 56\"\t\"age: 68\"\t\"age: 68\"\t\"age: 56\"\t\"age: 56\"\t\"age: 55\"\t\"age: 55\"\t\"age: 74\"\t\"age: 74\"\t\"age: 44\"\t\"age: 44\"\t\"age: 72\"\t\"age: 72\"\t\"age: 47\"\t\"age: 47\"\t\"age: 27\"\t\"age: 27\"\t\"age: 68\"\t\"age: 68\"\t\"age: 66\"\t\"age: 66\"\t\"age: 46\"\t\"age: 46\"\t\"age: 45\"\t\"age: 45\"\t\"age: 71\"\t\"age: 71\"\t\"age: 69\"\t\"age: 69\"\t\"age: 80\"\t\"age: 80\"\t\"age: 55\"\t\"age: 55\"\t\"age: 60\"\t\"age: 60\"\t\"age: 68\"\t\"age: 68\"\t\"age: 37\"\t\"age: 37\"\t\"age: 47\"\t\"age: 47\"\t\"age: 43\"\t\"age: 43\"\t\"age: 58\"\t\"age: 58\"\t\"age: 76\"\t\"age: 76\"\t\"age: 21\"\t\"age: 21\"\t\"age: 63\"\t\"age: 63\""
## [4] "!Sample_characteristics_ch1\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\""
## [5] "!Sample_characteristics_ch1\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\""
Info3 <- gsub('\\t',' ', Info2, perl=T)
Info4 <- gsub('\\\"',' ',Info3, perl=T)
Info5 <- gsub('[!]','',Info4)
Info6 <- gsub('Sample_characteristics_ch1 ','',Info5)
Info6a <- gsub('Sample_title ','',Info6)
Info6b <- gsub('Sample_geo_accession ','',Info6a)
Info7 <- trimws(Info6b,which='right', whitespace='[ ]')
head(Info7)
## [1] "Patient 1 week0 Patient 1 week12 Patient 2 week0 Patient 2 week12 Patient 15 week0 Patient 15 week12 Patient 24 week0 Patient 24 week12 Patient 16 week0 Patient 16 week12 Patient 13 week0 Patient 13 week12 Patient 22 week0 Patient 22 week12 Patient 31 week0 Patient 31 week12 Patient 3 week0 Patient 3 week12 Patient 18 week0 Patient 18 week12 Patient 23 week0 Patient 23 week12 Patient 21 week0 Patient 21 week12 Patient 25 week0 Patient 25 week12 Patient 26 week0 Patient 26 week12 Patient 27 week0 Patient 27 week12 Patient 20 week0 Patient 20 week12 Patient 34 week0 Patient 34 week12 Patient 14 week0 Patient 14 week12 Patient 17 week0 Patient 17 week12 Patient 19 week0 Patient 19 week12 Patient 4 week0 Patient 4 week12 Patient 35 week0 Patient 35 week12 Patient 28 week0 Patient 28 week12 Patient 12 week0 Patient 12 week12 Patient 8 week0 Patient 8 week12 Patient 5 week0 Patient 5 week12 Patient 38 week0 Patient 38 week12 Patient 10 week0 Patient 10 week12 Patient 29 week0 Patient 29 week12 Patient 37 week0 Patient 37 week12 Patient 6 week0 Patient 6 week12 Patient 36 week0 Patient 36 week12 Patient 9 week0 Patient 9 week12 Patient 32 week0 Patient 32 week12 Patient 7 week0 Patient 7 week12 Patient 33 week0 Patient 33 week12 Patient 11 week0 Patient 11 week12 Patient 30 week0 Patient 30 week12"
## [2] "GSM4567526 GSM4567527 GSM4567528 GSM4567529 GSM4567530 GSM4567531 GSM4567532 GSM4567533 GSM4567534 GSM4567535 GSM4567536 GSM4567537 GSM4567538 GSM4567539 GSM4567540 GSM4567541 GSM4567542 GSM4567543 GSM4567544 GSM4567545 GSM4567546 GSM4567547 GSM4567548 GSM4567549 GSM4567550 GSM4567551 GSM4567552 GSM4567553 GSM4567554 GSM4567555 GSM4567556 GSM4567557 GSM4567558 GSM4567559 GSM4567560 GSM4567561 GSM4567562 GSM4567563 GSM4567564 GSM4567565 GSM4567566 GSM4567567 GSM4567568 GSM4567569 GSM4567570 GSM4567571 GSM4567572 GSM4567573 GSM4567574 GSM4567575 GSM4567576 GSM4567577 GSM4567578 GSM4567579 GSM4567580 GSM4567581 GSM4567582 GSM4567583 GSM4567584 GSM4567585 GSM4567586 GSM4567587 GSM4567588 GSM4567589 GSM4567590 GSM4567591 GSM4567592 GSM4567593 GSM4567594 GSM4567595 GSM4567596 GSM4567597 GSM4567598 GSM4567599 GSM4567600 GSM4567601"
## [3] "age: 78 age: 78 age: 69 age: 69 age: 50 age: 50 age: 51 age: 51 age: 54 age: 54 age: 59 age: 59 age: 61 age: 61 age: 62 age: 62 age: 70 age: 70 age: 75 age: 75 age: 35 age: 35 age: 78 age: 78 age: 56 age: 56 age: 68 age: 68 age: 56 age: 56 age: 55 age: 55 age: 74 age: 74 age: 44 age: 44 age: 72 age: 72 age: 47 age: 47 age: 27 age: 27 age: 68 age: 68 age: 66 age: 66 age: 46 age: 46 age: 45 age: 45 age: 71 age: 71 age: 69 age: 69 age: 80 age: 80 age: 55 age: 55 age: 60 age: 60 age: 68 age: 68 age: 37 age: 37 age: 47 age: 47 age: 43 age: 43 age: 58 age: 58 age: 76 age: 76 age: 21 age: 21 age: 63 age: 63"
## [4] "gender: F gender: F gender: F gender: F gender: F gender: F gender: M gender: M gender: F gender: F gender: F gender: F gender: F gender: F gender: M gender: M gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: M gender: M gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: M gender: M gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: F gender: M gender: M gender: F gender: F gender: M gender: M"
## [5] "treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept treatment: control treatment: abatacept"
if (file.exists('patient.csv')){
file.remove('patient.csv')
}
## [1] TRUE
patient <- strsplit(Info7[1], split=' ')
for (i in patient){
write.table(i,file='patient.csv',row.names=F,col.names=F)
}
Patient <- read.csv('patient.csv', sep=',',header=F)
colnames(Patient) <- 'patient'
Patient
## patient
## 1 Patient 1 week0
## 2 Patient 1 week12
## 3 Patient 2 week0
## 4 Patient 2 week12
## 5 Patient 15 week0
## 6 Patient 15 week12
## 7 Patient 24 week0
## 8 Patient 24 week12
## 9 Patient 16 week0
## 10 Patient 16 week12
## 11 Patient 13 week0
## 12 Patient 13 week12
## 13 Patient 22 week0
## 14 Patient 22 week12
## 15 Patient 31 week0
## 16 Patient 31 week12
## 17 Patient 3 week0
## 18 Patient 3 week12
## 19 Patient 18 week0
## 20 Patient 18 week12
## 21 Patient 23 week0
## 22 Patient 23 week12
## 23 Patient 21 week0
## 24 Patient 21 week12
## 25 Patient 25 week0
## 26 Patient 25 week12
## 27 Patient 26 week0
## 28 Patient 26 week12
## 29 Patient 27 week0
## 30 Patient 27 week12
## 31 Patient 20 week0
## 32 Patient 20 week12
## 33 Patient 34 week0
## 34 Patient 34 week12
## 35 Patient 14 week0
## 36 Patient 14 week12
## 37 Patient 17 week0
## 38 Patient 17 week12
## 39 Patient 19 week0
## 40 Patient 19 week12
## 41 Patient 4 week0
## 42 Patient 4 week12
## 43 Patient 35 week0
## 44 Patient 35 week12
## 45 Patient 28 week0
## 46 Patient 28 week12
## 47 Patient 12 week0
## 48 Patient 12 week12
## 49 Patient 8 week0
## 50 Patient 8 week12
## 51 Patient 5 week0
## 52 Patient 5 week12
## 53 Patient 38 week0
## 54 Patient 38 week12
## 55 Patient 10 week0
## 56 Patient 10 week12
## 57 Patient 29 week0
## 58 Patient 29 week12
## 59 Patient 37 week0
## 60 Patient 37 week12
## 61 Patient 6 week0
## 62 Patient 6 week12
## 63 Patient 36 week0
## 64 Patient 36 week12
## 65 Patient 9 week0
## 66 Patient 9 week12
## 67 Patient 32 week0
## 68 Patient 32 week12
## 69 Patient 7 week0
## 70 Patient 7 week12
## 71 Patient 33 week0
## 72 Patient 33 week12
## 73 Patient 11 week0
## 74 Patient 11 week12
## 75 Patient 30 week0
## 76 Patient 30 week12
if (file.exists('sample.csv')){
file.remove('sample.csv')
}
## [1] TRUE
sample <- strsplit(Info7[2], split=' ')
for (i in sample){
write.table(i,file='sample.csv',row.names=F,col.names=F)
}
Sample <- read.csv('sample.csv', sep=',',header=F)
colnames(Sample) <- 'sample'
Sample
## sample
## 1 GSM4567526
## 2 GSM4567527
## 3 GSM4567528
## 4 GSM4567529
## 5 GSM4567530
## 6 GSM4567531
## 7 GSM4567532
## 8 GSM4567533
## 9 GSM4567534
## 10 GSM4567535
## 11 GSM4567536
## 12 GSM4567537
## 13 GSM4567538
## 14 GSM4567539
## 15 GSM4567540
## 16 GSM4567541
## 17 GSM4567542
## 18 GSM4567543
## 19 GSM4567544
## 20 GSM4567545
## 21 GSM4567546
## 22 GSM4567547
## 23 GSM4567548
## 24 GSM4567549
## 25 GSM4567550
## 26 GSM4567551
## 27 GSM4567552
## 28 GSM4567553
## 29 GSM4567554
## 30 GSM4567555
## 31 GSM4567556
## 32 GSM4567557
## 33 GSM4567558
## 34 GSM4567559
## 35 GSM4567560
## 36 GSM4567561
## 37 GSM4567562
## 38 GSM4567563
## 39 GSM4567564
## 40 GSM4567565
## 41 GSM4567566
## 42 GSM4567567
## 43 GSM4567568
## 44 GSM4567569
## 45 GSM4567570
## 46 GSM4567571
## 47 GSM4567572
## 48 GSM4567573
## 49 GSM4567574
## 50 GSM4567575
## 51 GSM4567576
## 52 GSM4567577
## 53 GSM4567578
## 54 GSM4567579
## 55 GSM4567580
## 56 GSM4567581
## 57 GSM4567582
## 58 GSM4567583
## 59 GSM4567584
## 60 GSM4567585
## 61 GSM4567586
## 62 GSM4567587
## 63 GSM4567588
## 64 GSM4567589
## 65 GSM4567590
## 66 GSM4567591
## 67 GSM4567592
## 68 GSM4567593
## 69 GSM4567594
## 70 GSM4567595
## 71 GSM4567596
## 72 GSM4567597
## 73 GSM4567598
## 74 GSM4567599
## 75 GSM4567600
## 76 GSM4567601
if (file.exists('age.csv')){
file.remove('age.csv')
}
## [1] TRUE
age <- strsplit(Info7[3],split='age: ')
for (i in age){
write.table(i,file='age.csv',append=T, row.names=F, col.names = F)
}
Age <- read.csv('age.csv', header=F, sep=',')
colnames(Age) <- 'Age'
Age
## Age
## 1 78
## 2 78
## 3 69
## 4 69
## 5 50
## 6 50
## 7 51
## 8 51
## 9 54
## 10 54
## 11 59
## 12 59
## 13 61
## 14 61
## 15 62
## 16 62
## 17 70
## 18 70
## 19 75
## 20 75
## 21 35
## 22 35
## 23 78
## 24 78
## 25 56
## 26 56
## 27 68
## 28 68
## 29 56
## 30 56
## 31 55
## 32 55
## 33 74
## 34 74
## 35 44
## 36 44
## 37 72
## 38 72
## 39 47
## 40 47
## 41 27
## 42 27
## 43 68
## 44 68
## 45 66
## 46 66
## 47 46
## 48 46
## 49 45
## 50 45
## 51 71
## 52 71
## 53 69
## 54 69
## 55 80
## 56 80
## 57 55
## 58 55
## 59 60
## 60 60
## 61 68
## 62 68
## 63 37
## 64 37
## 65 47
## 66 47
## 67 43
## 68 43
## 69 58
## 70 58
## 71 76
## 72 76
## 73 21
## 74 21
## 75 63
## 76 63
if (file.exists('gender.csv')){
file.remove('gender.csv')
}
## [1] TRUE
gender <- strsplit(Info7[4], split='gender: ')
for (i in gender){
write.table(i,file='gender.csv',row.names=F, col.names=F)
}
Gender= read.csv('gender.csv',sep=',',header=F)
colnames(Gender) <- 'gender'
Gender
## gender
## 1 F
## 2 F
## 3 F
## 4 F
## 5 F
## 6 F
## 7 M
## 8 M
## 9 F
## 10 F
## 11 F
## 12 F
## 13 F
## 14 F
## 15 M
## 16 M
## 17 F
## 18 F
## 19 F
## 20 F
## 21 F
## 22 F
## 23 F
## 24 F
## 25 M
## 26 M
## 27 F
## 28 F
## 29 F
## 30 F
## 31 F
## 32 F
## 33 F
## 34 F
## 35 F
## 36 F
## 37 F
## 38 F
## 39 F
## 40 F
## 41 F
## 42 F
## 43 F
## 44 F
## 45 F
## 46 F
## 47 F
## 48 F
## 49 F
## 50 F
## 51 F
## 52 F
## 53 F
## 54 F
## 55 F
## 56 F
## 57 F
## 58 F
## 59 M
## 60 M
## 61 F
## 62 F
## 63 F
## 64 F
## 65 F
## 66 F
## 67 F
## 68 F
## 69 F
## 70 F
## 71 M
## 72 M
## 73 F
## 74 F
## 75 M
## 76 M
if (file.exists('treatment.csv')){
file.remove('treatment.csv')
}
## [1] TRUE
treatment <- strsplit(Info7[5], split='treatment: ')
for (i in treatment){
write.table(i,file='treatment.csv',row.names=F,col.names=F)
}
Treatment <- read.csv('treatment.csv', sep=',',header=F)
colnames(Treatment) <- 'treatment'
Treatment
## treatment
## 1 control
## 2 abatacept
## 3 control
## 4 abatacept
## 5 control
## 6 abatacept
## 7 control
## 8 abatacept
## 9 control
## 10 abatacept
## 11 control
## 12 abatacept
## 13 control
## 14 abatacept
## 15 control
## 16 abatacept
## 17 control
## 18 abatacept
## 19 control
## 20 abatacept
## 21 control
## 22 abatacept
## 23 control
## 24 abatacept
## 25 control
## 26 abatacept
## 27 control
## 28 abatacept
## 29 control
## 30 abatacept
## 31 control
## 32 abatacept
## 33 control
## 34 abatacept
## 35 control
## 36 abatacept
## 37 control
## 38 abatacept
## 39 control
## 40 abatacept
## 41 control
## 42 abatacept
## 43 control
## 44 abatacept
## 45 control
## 46 abatacept
## 47 control
## 48 abatacept
## 49 control
## 50 abatacept
## 51 control
## 52 abatacept
## 53 control
## 54 abatacept
## 55 control
## 56 abatacept
## 57 control
## 58 abatacept
## 59 control
## 60 abatacept
## 61 control
## 62 abatacept
## 63 control
## 64 abatacept
## 65 control
## 66 abatacept
## 67 control
## 68 abatacept
## 69 control
## 70 abatacept
## 71 control
## 72 abatacept
## 73 control
## 74 abatacept
## 75 control
## 76 abatacept
demographics <- cbind(Patient,Sample,Age,Gender,Treatment)
demographics
## patient sample Age gender treatment
## 1 Patient 1 week0 GSM4567526 78 F control
## 2 Patient 1 week12 GSM4567527 78 F abatacept
## 3 Patient 2 week0 GSM4567528 69 F control
## 4 Patient 2 week12 GSM4567529 69 F abatacept
## 5 Patient 15 week0 GSM4567530 50 F control
## 6 Patient 15 week12 GSM4567531 50 F abatacept
## 7 Patient 24 week0 GSM4567532 51 M control
## 8 Patient 24 week12 GSM4567533 51 M abatacept
## 9 Patient 16 week0 GSM4567534 54 F control
## 10 Patient 16 week12 GSM4567535 54 F abatacept
## 11 Patient 13 week0 GSM4567536 59 F control
## 12 Patient 13 week12 GSM4567537 59 F abatacept
## 13 Patient 22 week0 GSM4567538 61 F control
## 14 Patient 22 week12 GSM4567539 61 F abatacept
## 15 Patient 31 week0 GSM4567540 62 M control
## 16 Patient 31 week12 GSM4567541 62 M abatacept
## 17 Patient 3 week0 GSM4567542 70 F control
## 18 Patient 3 week12 GSM4567543 70 F abatacept
## 19 Patient 18 week0 GSM4567544 75 F control
## 20 Patient 18 week12 GSM4567545 75 F abatacept
## 21 Patient 23 week0 GSM4567546 35 F control
## 22 Patient 23 week12 GSM4567547 35 F abatacept
## 23 Patient 21 week0 GSM4567548 78 F control
## 24 Patient 21 week12 GSM4567549 78 F abatacept
## 25 Patient 25 week0 GSM4567550 56 M control
## 26 Patient 25 week12 GSM4567551 56 M abatacept
## 27 Patient 26 week0 GSM4567552 68 F control
## 28 Patient 26 week12 GSM4567553 68 F abatacept
## 29 Patient 27 week0 GSM4567554 56 F control
## 30 Patient 27 week12 GSM4567555 56 F abatacept
## 31 Patient 20 week0 GSM4567556 55 F control
## 32 Patient 20 week12 GSM4567557 55 F abatacept
## 33 Patient 34 week0 GSM4567558 74 F control
## 34 Patient 34 week12 GSM4567559 74 F abatacept
## 35 Patient 14 week0 GSM4567560 44 F control
## 36 Patient 14 week12 GSM4567561 44 F abatacept
## 37 Patient 17 week0 GSM4567562 72 F control
## 38 Patient 17 week12 GSM4567563 72 F abatacept
## 39 Patient 19 week0 GSM4567564 47 F control
## 40 Patient 19 week12 GSM4567565 47 F abatacept
## 41 Patient 4 week0 GSM4567566 27 F control
## 42 Patient 4 week12 GSM4567567 27 F abatacept
## 43 Patient 35 week0 GSM4567568 68 F control
## 44 Patient 35 week12 GSM4567569 68 F abatacept
## 45 Patient 28 week0 GSM4567570 66 F control
## 46 Patient 28 week12 GSM4567571 66 F abatacept
## 47 Patient 12 week0 GSM4567572 46 F control
## 48 Patient 12 week12 GSM4567573 46 F abatacept
## 49 Patient 8 week0 GSM4567574 45 F control
## 50 Patient 8 week12 GSM4567575 45 F abatacept
## 51 Patient 5 week0 GSM4567576 71 F control
## 52 Patient 5 week12 GSM4567577 71 F abatacept
## 53 Patient 38 week0 GSM4567578 69 F control
## 54 Patient 38 week12 GSM4567579 69 F abatacept
## 55 Patient 10 week0 GSM4567580 80 F control
## 56 Patient 10 week12 GSM4567581 80 F abatacept
## 57 Patient 29 week0 GSM4567582 55 F control
## 58 Patient 29 week12 GSM4567583 55 F abatacept
## 59 Patient 37 week0 GSM4567584 60 M control
## 60 Patient 37 week12 GSM4567585 60 M abatacept
## 61 Patient 6 week0 GSM4567586 68 F control
## 62 Patient 6 week12 GSM4567587 68 F abatacept
## 63 Patient 36 week0 GSM4567588 37 F control
## 64 Patient 36 week12 GSM4567589 37 F abatacept
## 65 Patient 9 week0 GSM4567590 47 F control
## 66 Patient 9 week12 GSM4567591 47 F abatacept
## 67 Patient 32 week0 GSM4567592 43 F control
## 68 Patient 32 week12 GSM4567593 43 F abatacept
## 69 Patient 7 week0 GSM4567594 58 F control
## 70 Patient 7 week12 GSM4567595 58 F abatacept
## 71 Patient 33 week0 GSM4567596 76 M control
## 72 Patient 33 week12 GSM4567597 76 M abatacept
## 73 Patient 11 week0 GSM4567598 21 F control
## 74 Patient 11 week12 GSM4567599 21 F abatacept
## 75 Patient 30 week0 GSM4567600 63 M control
## 76 Patient 30 week12 GSM4567601 63 M abatacept
data <- read.csv('GSE151161_Raw_counts.csv',sep=',', header=T, na.strings=c('',' ','NA'),
stringsAsFactors = F)
colnames(data)[1] <- 'gene'
colnames(data)
## [1] "gene" "P1_w12" "P1_w0" "P2_w12" "P2_w0" "P3_w0" "P3_w12"
## [8] "P4_w0" "P4_w12" "P5_w0" "P5_w12" "P6_w12" "P6_w0" "P7_w12"
## [15] "P7_w0" "P8_w12" "P8_w0" "P9_w12" "P9_w0" "P10_w12" "P10_w0"
## [22] "P11_w12" "P11_w0" "P12_w0" "P12_w12" "P13_w12" "P13_w0" "P14_w12"
## [29] "P14_w0" "P15_w12" "P15_w0" "P16_w0" "P16_w12" "P17_w0" "P17_w12"
## [36] "P18_w12" "P18_w0" "P19_w12" "P19_w0" "P20_w12" "P20_w0" "P21_w0"
## [43] "P21_w12" "P22_w0" "P22_w12" "P23_w12" "P23_w0" "P24_w0" "P24_w12"
## [50] "P25_w12" "P25_w0" "P26_w12" "P26_w0" "P27_w0" "P27_w12" "P28_w0"
## [57] "P28_w12" "P29_w12" "P29_w0" "P30_w12" "P30_w0" "P31_w0" "P31_w12"
## [64] "P32_w12" "P32_w0" "P33_w12" "P33_w0" "P34_w12" "P34_w0" "P35_w12"
## [71] "P35_w0" "P36_w12" "P36_w0" "P37_w0" "P37_w12" "P38_w0" "P38_w12"
names <- as.data.frame(colnames(data)[2:77])
colnames(names) <- 'sample'
names
## sample
## 1 P1_w12
## 2 P1_w0
## 3 P2_w12
## 4 P2_w0
## 5 P3_w0
## 6 P3_w12
## 7 P4_w0
## 8 P4_w12
## 9 P5_w0
## 10 P5_w12
## 11 P6_w12
## 12 P6_w0
## 13 P7_w12
## 14 P7_w0
## 15 P8_w12
## 16 P8_w0
## 17 P9_w12
## 18 P9_w0
## 19 P10_w12
## 20 P10_w0
## 21 P11_w12
## 22 P11_w0
## 23 P12_w0
## 24 P12_w12
## 25 P13_w12
## 26 P13_w0
## 27 P14_w12
## 28 P14_w0
## 29 P15_w12
## 30 P15_w0
## 31 P16_w0
## 32 P16_w12
## 33 P17_w0
## 34 P17_w12
## 35 P18_w12
## 36 P18_w0
## 37 P19_w12
## 38 P19_w0
## 39 P20_w12
## 40 P20_w0
## 41 P21_w0
## 42 P21_w12
## 43 P22_w0
## 44 P22_w12
## 45 P23_w12
## 46 P23_w0
## 47 P24_w0
## 48 P24_w12
## 49 P25_w12
## 50 P25_w0
## 51 P26_w12
## 52 P26_w0
## 53 P27_w0
## 54 P27_w12
## 55 P28_w0
## 56 P28_w12
## 57 P29_w12
## 58 P29_w0
## 59 P30_w12
## 60 P30_w0
## 61 P31_w0
## 62 P31_w12
## 63 P32_w12
## 64 P32_w0
## 65 P33_w12
## 66 P33_w0
## 67 P34_w12
## 68 P34_w0
## 69 P35_w12
## 70 P35_w0
## 71 P36_w12
## 72 P36_w0
## 73 P37_w0
## 74 P37_w12
## 75 P38_w0
## 76 P38_w12
demographics$patient <- gsub('Patient ','P',demographics$patient)
demographics$patient <- gsub(' week','_w',demographics$patient)
demographics$patient <- as.factor(paste(demographics$patient))
demographics$treatment <- as.character(demographics$treatment)
demographics$treatment <- trimws(demographics$treatment, which='right', whitespace=' ')
demographics$gender <- as.character(demographics$gender)
demographics$gender <- trimws(demographics$gender, which='right', whitespace=' ')
demographics$patient
## [1] P1_w0 P1_w12 P2_w0 P2_w12 P15_w0 P15_w12 P24_w0 P24_w12 P16_w0
## [10] P16_w12 P13_w0 P13_w12 P22_w0 P22_w12 P31_w0 P31_w12 P3_w0 P3_w12
## [19] P18_w0 P18_w12 P23_w0 P23_w12 P21_w0 P21_w12 P25_w0 P25_w12 P26_w0
## [28] P26_w12 P27_w0 P27_w12 P20_w0 P20_w12 P34_w0 P34_w12 P14_w0 P14_w12
## [37] P17_w0 P17_w12 P19_w0 P19_w12 P4_w0 P4_w12 P35_w0 P35_w12 P28_w0
## [46] P28_w12 P12_w0 P12_w12 P8_w0 P8_w12 P5_w0 P5_w12 P38_w0 P38_w12
## [55] P10_w0 P10_w12 P29_w0 P29_w12 P37_w0 P37_w12 P6_w0 P6_w12 P36_w0
## [64] P36_w12 P9_w0 P9_w12 P32_w0 P32_w12 P7_w0 P7_w12 P33_w0 P33_w12
## [73] P11_w0 P11_w12 P30_w0 P30_w12
## 76 Levels: P1_w0 P1_w12 P10_w0 P10_w12 P11_w0 P11_w12 P12_w0 P12_w12 ... P9_w12
names$sample
## [1] P1_w12 P1_w0 P2_w12 P2_w0 P3_w0 P3_w12 P4_w0 P4_w12 P5_w0
## [10] P5_w12 P6_w12 P6_w0 P7_w12 P7_w0 P8_w12 P8_w0 P9_w12 P9_w0
## [19] P10_w12 P10_w0 P11_w12 P11_w0 P12_w0 P12_w12 P13_w12 P13_w0 P14_w12
## [28] P14_w0 P15_w12 P15_w0 P16_w0 P16_w12 P17_w0 P17_w12 P18_w12 P18_w0
## [37] P19_w12 P19_w0 P20_w12 P20_w0 P21_w0 P21_w12 P22_w0 P22_w12 P23_w12
## [46] P23_w0 P24_w0 P24_w12 P25_w12 P25_w0 P26_w12 P26_w0 P27_w0 P27_w12
## [55] P28_w0 P28_w12 P29_w12 P29_w0 P30_w12 P30_w0 P31_w0 P31_w12 P32_w12
## [64] P32_w0 P33_w12 P33_w0 P34_w12 P34_w0 P35_w12 P35_w0 P36_w12 P36_w0
## [73] P37_w0 P37_w12 P38_w0 P38_w12
## 76 Levels: P1_w0 P1_w12 P10_w0 P10_w12 P11_w0 P11_w12 P12_w0 P12_w12 ... P9_w12
write.csv(demographics,'demographics.csv',row.names=FALSE)
demographics
## patient sample Age gender treatment
## 1 P1_w0 GSM4567526 78 F control
## 2 P1_w12 GSM4567527 78 F abatacept
## 3 P2_w0 GSM4567528 69 F control
## 4 P2_w12 GSM4567529 69 F abatacept
## 5 P15_w0 GSM4567530 50 F control
## 6 P15_w12 GSM4567531 50 F abatacept
## 7 P24_w0 GSM4567532 51 M control
## 8 P24_w12 GSM4567533 51 M abatacept
## 9 P16_w0 GSM4567534 54 F control
## 10 P16_w12 GSM4567535 54 F abatacept
## 11 P13_w0 GSM4567536 59 F control
## 12 P13_w12 GSM4567537 59 F abatacept
## 13 P22_w0 GSM4567538 61 F control
## 14 P22_w12 GSM4567539 61 F abatacept
## 15 P31_w0 GSM4567540 62 M control
## 16 P31_w12 GSM4567541 62 M abatacept
## 17 P3_w0 GSM4567542 70 F control
## 18 P3_w12 GSM4567543 70 F abatacept
## 19 P18_w0 GSM4567544 75 F control
## 20 P18_w12 GSM4567545 75 F abatacept
## 21 P23_w0 GSM4567546 35 F control
## 22 P23_w12 GSM4567547 35 F abatacept
## 23 P21_w0 GSM4567548 78 F control
## 24 P21_w12 GSM4567549 78 F abatacept
## 25 P25_w0 GSM4567550 56 M control
## 26 P25_w12 GSM4567551 56 M abatacept
## 27 P26_w0 GSM4567552 68 F control
## 28 P26_w12 GSM4567553 68 F abatacept
## 29 P27_w0 GSM4567554 56 F control
## 30 P27_w12 GSM4567555 56 F abatacept
## 31 P20_w0 GSM4567556 55 F control
## 32 P20_w12 GSM4567557 55 F abatacept
## 33 P34_w0 GSM4567558 74 F control
## 34 P34_w12 GSM4567559 74 F abatacept
## 35 P14_w0 GSM4567560 44 F control
## 36 P14_w12 GSM4567561 44 F abatacept
## 37 P17_w0 GSM4567562 72 F control
## 38 P17_w12 GSM4567563 72 F abatacept
## 39 P19_w0 GSM4567564 47 F control
## 40 P19_w12 GSM4567565 47 F abatacept
## 41 P4_w0 GSM4567566 27 F control
## 42 P4_w12 GSM4567567 27 F abatacept
## 43 P35_w0 GSM4567568 68 F control
## 44 P35_w12 GSM4567569 68 F abatacept
## 45 P28_w0 GSM4567570 66 F control
## 46 P28_w12 GSM4567571 66 F abatacept
## 47 P12_w0 GSM4567572 46 F control
## 48 P12_w12 GSM4567573 46 F abatacept
## 49 P8_w0 GSM4567574 45 F control
## 50 P8_w12 GSM4567575 45 F abatacept
## 51 P5_w0 GSM4567576 71 F control
## 52 P5_w12 GSM4567577 71 F abatacept
## 53 P38_w0 GSM4567578 69 F control
## 54 P38_w12 GSM4567579 69 F abatacept
## 55 P10_w0 GSM4567580 80 F control
## 56 P10_w12 GSM4567581 80 F abatacept
## 57 P29_w0 GSM4567582 55 F control
## 58 P29_w12 GSM4567583 55 F abatacept
## 59 P37_w0 GSM4567584 60 M control
## 60 P37_w12 GSM4567585 60 M abatacept
## 61 P6_w0 GSM4567586 68 F control
## 62 P6_w12 GSM4567587 68 F abatacept
## 63 P36_w0 GSM4567588 37 F control
## 64 P36_w12 GSM4567589 37 F abatacept
## 65 P9_w0 GSM4567590 47 F control
## 66 P9_w12 GSM4567591 47 F abatacept
## 67 P32_w0 GSM4567592 43 F control
## 68 P32_w12 GSM4567593 43 F abatacept
## 69 P7_w0 GSM4567594 58 F control
## 70 P7_w12 GSM4567595 58 F abatacept
## 71 P33_w0 GSM4567596 76 M control
## 72 P33_w12 GSM4567597 76 M abatacept
## 73 P11_w0 GSM4567598 21 F control
## 74 P11_w12 GSM4567599 21 F abatacept
## 75 P30_w0 GSM4567600 63 M control
## 76 P30_w12 GSM4567601 63 M abatacept
We can subset into control and treatment with our demographics data frame’s treatment column.
control <- subset(demographics, demographics$treatment =='control')
abatacept <- subset(demographics, demographics$treatment=='abatacept')
There are 38 samples in each group of control or abatacept.
Lets divide the groups into males and females.
femaleControl <- subset(control, control$gender=='F')
maleControl <- subset(control, control$gender=='M')
femaleTreatment <- subset(abatacept, abatacept$gender=='F')
maleTreatment <- subset(abatacept, abatacept$gender=='M')
There are 32 females and 6 males, totalling the number of samples in the control group and the treatment group because it was the same group of people’s blood 12 weeks later.
controlList <- control$patient
treatmentList <- abatacept$patient
femaleControlList <- femaleControl$patient
maleControlList <- maleControl$patient
femaleTreatmentList <- femaleTreatment$patient
maleTreatmentList <- maleTreatment$patient
gene <- data$gene
control2 <- colnames(data) %in% controlList
controlDF <- data[,control2]
controlDF$gene <- gene
abatacept2 <- colnames(data) %in% treatmentList
treatmentDF <- data[,abatacept2]
treatmentDF$gene <- gene
femaleControlList2 <- colnames(controlDF) %in% femaleControlList
femControlDF <- controlDF[,femaleControlList2]
femControlDF$gene <- gene
maleControlList2 <- colnames(controlDF) %in% maleControlList
maleControlDF <- controlDF[,maleControlList2]
maleControlDF$gene <- gene
femaleTreatmentList2 <- colnames(treatmentDF) %in% femaleTreatmentList
femTreatmentDF <- treatmentDF[,femaleTreatmentList2]
femTreatmentDF$gene <- gene
maleTreatmentList2 <- colnames(treatmentDF) %in% maleTreatmentList
maleTreatmentDF <- treatmentDF[,maleTreatmentList2]
maleTreatmentDF$gene <- gene
controlDF2 <- controlDF %>% group_by(gene) %>% summarise_at(vars('P1_w0':'P38_w0'),mean)
treatmentDF2 <- treatmentDF %>% group_by(gene) %>% summarise_at(vars('P1_w12':'P38_w12'),mean)
femTreatmentDF2 <- femTreatmentDF %>% group_by(gene) %>% summarise_at(vars('P1_w12':'P38_w12'),mean)
maleTreatmentDF2 <- maleTreatmentDF %>% group_by(gene) %>%
summarise_at(vars('P24_w12':'P37_w12'),mean)
femControlDF2 <- femControlDF %>% group_by(gene) %>% summarise_at(vars('P1_w0':'P38_w0'),mean)
maleControlDF2 <- maleControlDF %>% group_by(gene) %>% summarise_at(vars('P24_w0':'P37_w0'),mean)
gene <- controlDF2$gene
femControlDF3 <- femControlDF2[,-1]
maleControlDF3 <- maleControlDF2[,-1]
femTreatmentDF3 <- femTreatmentDF2[,-1]
maleTreatmentDF3 <- maleTreatmentDF2[,-1]
controlDF3 <- controlDF2[,-1]
treatmentDF3 <- treatmentDF2[,-1]
controlDF3$ctrl_mean <- apply(controlDF3, 1,mean, na.rm=T)
treatmentDF3$trt_mean <- apply(treatmentDF3, 1,mean, na.rm=T)
femTreatmentDF3$fem_trt_mean <- apply(femTreatmentDF3, 1,mean, na.rm=T)
maleTreatmentDF3$mal_trt_mean <- apply(maleTreatmentDF3, 1,mean, na.rm=T)
femControlDF3$fem_ctrl_mean <- apply(femControlDF3, 1,mean, na.rm=T)
maleControlDF3$mal_ctrl_mean <- apply(maleControlDF3, 1,mean, na.rm=T)
DATA_RA <- cbind(controlDF3,treatmentDF3,maleControlDF3,femControlDF3,
maleTreatmentDF3,femTreatmentDF3)
DATA_RA$gene <- gene
colnames(DATA_RA)
## [1] "P1_w0" "P2_w0" "P3_w0" "P4_w0"
## [5] "P5_w0" "P6_w0" "P7_w0" "P8_w0"
## [9] "P9_w0" "P10_w0" "P11_w0" "P12_w0"
## [13] "P13_w0" "P14_w0" "P15_w0" "P16_w0"
## [17] "P17_w0" "P18_w0" "P19_w0" "P20_w0"
## [21] "P21_w0" "P22_w0" "P23_w0" "P24_w0"
## [25] "P25_w0" "P26_w0" "P27_w0" "P28_w0"
## [29] "P29_w0" "P30_w0" "P31_w0" "P32_w0"
## [33] "P33_w0" "P34_w0" "P35_w0" "P36_w0"
## [37] "P37_w0" "P38_w0" "ctrl_mean" "P1_w12"
## [41] "P2_w12" "P3_w12" "P4_w12" "P5_w12"
## [45] "P6_w12" "P7_w12" "P8_w12" "P9_w12"
## [49] "P10_w12" "P11_w12" "P12_w12" "P13_w12"
## [53] "P14_w12" "P15_w12" "P16_w12" "P17_w12"
## [57] "P18_w12" "P19_w12" "P20_w12" "P21_w12"
## [61] "P22_w12" "P23_w12" "P24_w12" "P25_w12"
## [65] "P26_w12" "P27_w12" "P28_w12" "P29_w12"
## [69] "P30_w12" "P31_w12" "P32_w12" "P33_w12"
## [73] "P34_w12" "P35_w12" "P36_w12" "P37_w12"
## [77] "P38_w12" "trt_mean" "P24_w0" "P25_w0"
## [81] "P30_w0" "P31_w0" "P33_w0" "P37_w0"
## [85] "mal_ctrl_mean" "P1_w0" "P2_w0" "P3_w0"
## [89] "P4_w0" "P5_w0" "P6_w0" "P7_w0"
## [93] "P8_w0" "P9_w0" "P10_w0" "P11_w0"
## [97] "P12_w0" "P13_w0" "P14_w0" "P15_w0"
## [101] "P16_w0" "P17_w0" "P18_w0" "P19_w0"
## [105] "P20_w0" "P21_w0" "P22_w0" "P23_w0"
## [109] "P26_w0" "P27_w0" "P28_w0" "P29_w0"
## [113] "P32_w0" "P34_w0" "P35_w0" "P36_w0"
## [117] "P38_w0" "fem_ctrl_mean" "P24_w12" "P25_w12"
## [121] "P30_w12" "P31_w12" "P33_w12" "P37_w12"
## [125] "mal_trt_mean" "P1_w12" "P2_w12" "P3_w12"
## [129] "P4_w12" "P5_w12" "P6_w12" "P7_w12"
## [133] "P8_w12" "P9_w12" "P10_w12" "P11_w12"
## [137] "P12_w12" "P13_w12" "P14_w12" "P15_w12"
## [141] "P16_w12" "P17_w12" "P18_w12" "P19_w12"
## [145] "P20_w12" "P21_w12" "P22_w12" "P23_w12"
## [149] "P26_w12" "P27_w12" "P28_w12" "P29_w12"
## [153] "P32_w12" "P34_w12" "P35_w12" "P36_w12"
## [157] "P38_w12" "fem_trt_mean" "gene"
DATA_RA2 <- DATA_RA[,c(159,1:38,40:77,79:84,86:117,119:124,126:157,39,78,85,118,125,158)]
colnames(DATA_RA2)
## [1] "gene" "P1_w0" "P2_w0" "P3_w0"
## [5] "P4_w0" "P5_w0" "P6_w0" "P7_w0"
## [9] "P8_w0" "P9_w0" "P10_w0" "P11_w0"
## [13] "P12_w0" "P13_w0" "P14_w0" "P15_w0"
## [17] "P16_w0" "P17_w0" "P18_w0" "P19_w0"
## [21] "P20_w0" "P21_w0" "P22_w0" "P23_w0"
## [25] "P24_w0" "P25_w0" "P26_w0" "P27_w0"
## [29] "P28_w0" "P29_w0" "P30_w0" "P31_w0"
## [33] "P32_w0" "P33_w0" "P34_w0" "P35_w0"
## [37] "P36_w0" "P37_w0" "P38_w0" "P1_w12"
## [41] "P2_w12" "P3_w12" "P4_w12" "P5_w12"
## [45] "P6_w12" "P7_w12" "P8_w12" "P9_w12"
## [49] "P10_w12" "P11_w12" "P12_w12" "P13_w12"
## [53] "P14_w12" "P15_w12" "P16_w12" "P17_w12"
## [57] "P18_w12" "P19_w12" "P20_w12" "P21_w12"
## [61] "P22_w12" "P23_w12" "P24_w12" "P25_w12"
## [65] "P26_w12" "P27_w12" "P28_w12" "P29_w12"
## [69] "P30_w12" "P31_w12" "P32_w12" "P33_w12"
## [73] "P34_w12" "P35_w12" "P36_w12" "P37_w12"
## [77] "P38_w12" "P24_w0.1" "P25_w0.1" "P30_w0.1"
## [81] "P31_w0.1" "P33_w0.1" "P37_w0.1" "P1_w0.1"
## [85] "P2_w0.1" "P3_w0.1" "P4_w0.1" "P5_w0.1"
## [89] "P6_w0.1" "P7_w0.1" "P8_w0.1" "P9_w0.1"
## [93] "P10_w0.1" "P11_w0.1" "P12_w0.1" "P13_w0.1"
## [97] "P14_w0.1" "P15_w0.1" "P16_w0.1" "P17_w0.1"
## [101] "P18_w0.1" "P19_w0.1" "P20_w0.1" "P21_w0.1"
## [105] "P22_w0.1" "P23_w0.1" "P26_w0.1" "P27_w0.1"
## [109] "P28_w0.1" "P29_w0.1" "P32_w0.1" "P34_w0.1"
## [113] "P35_w0.1" "P36_w0.1" "P38_w0.1" "P24_w12.1"
## [117] "P25_w12.1" "P30_w12.1" "P31_w12.1" "P33_w12.1"
## [121] "P37_w12.1" "P1_w12.1" "P2_w12.1" "P3_w12.1"
## [125] "P4_w12.1" "P5_w12.1" "P6_w12.1" "P7_w12.1"
## [129] "P8_w12.1" "P9_w12.1" "P10_w12.1" "P11_w12.1"
## [133] "P12_w12.1" "P13_w12.1" "P14_w12.1" "P15_w12.1"
## [137] "P16_w12.1" "P17_w12.1" "P18_w12.1" "P19_w12.1"
## [141] "P20_w12.1" "P21_w12.1" "P22_w12.1" "P23_w12.1"
## [145] "P26_w12.1" "P27_w12.1" "P28_w12.1" "P29_w12.1"
## [149] "P32_w12.1" "P34_w12.1" "P35_w12.1" "P36_w12.1"
## [153] "P38_w12.1" "ctrl_mean" "trt_mean" "mal_ctrl_mean"
## [157] "fem_ctrl_mean" "mal_trt_mean" "fem_trt_mean"
DATA_RA2$trt_ctrl_FoldChange <- DATA_RA2$trt_mean/DATA_RA2$ctrl_mean
DATA_RA2$fem_trt_ctrl_FoldChange <- DATA_RA2$fem_trt_mean/DATA_RA2$fem_ctrl_mean
DATA_RA2$mal_trt_ctrl_FoldChange <- DATA_RA2$mal_trt_mean/DATA_RA2$mal_ctrl_mean
DATA_RA2$fem_mal_trt_FoldChange <- DATA_RA2$fem_trt_mean/DATA_RA2$mal_trt_mean
DATA_RA2$fem_mal_ctrl_FoldChange <- DATA_RA2$fem_ctrl_mean/DATA_RA2$mal_ctrl_mean
Lets add in age as well.
quantile(demographics$Age)
## 0% 25% 50% 75% 100%
## 21.0 47.0 59.5 69.0 80.0
Lets divide the data into those patients with RA who are younger than 50 and those older than 50.
less50 <- subset(demographics,demographics$Age < 50)
great50 <- subset(demographics,demographics$Age >=50)
gene <- controlDF$gene
less50b <- less50$patient
great50b <- great50$patient
less50c <- colnames(controlDF) %in% less50b
great50c <- colnames(controlDF) %in% great50b
less50ctrlDF <- controlDF[,less50c]
great50ctrlDF <- controlDF[,great50c]
less50trtDF <- treatmentDF[,less50c]
great50trtDF <- treatmentDF[,great50c]
less50ctrlDF$gene <- gene
less50trtDF$gene <- gene
great50ctrlDF$gene <- gene
great50trtDF$gene <- gene
less50ctrlDF2 <- less50ctrlDF %>% group_by(gene) %>% summarise_at(vars('P4_w0':'P36_w0'),mean)
less50trtDF2 <- less50trtDF %>% group_by(gene) %>% summarise_at(vars('P4_w12':'P36_w12'),mean)
great50ctrlDF2 <- great50ctrlDF %>% group_by(gene) %>% summarise_at(vars('P1_w0':'P38_w0'),mean)
great50trtDF2 <- great50trtDF %>% group_by(gene) %>% summarise_at(vars('P1_w12':'P38_w12'),mean)
gene <- great50trtDF2$gene
less50ctrlDF3 <- less50ctrlDF2[,-1]
less50trtDF3 <- less50trtDF2[,-1]
great50ctrlDF3 <- great50ctrlDF2[,-1]
great50trtDF3 <- great50trtDF2[,-1]
less50ctrlDF3$less50_ctrl_mean <- apply(less50ctrlDF3,1,mean)
less50trtDF3$less50_trt_mean <- apply(less50trtDF3,1,mean)
great50ctrlDF3$great50_ctrl_mean <- apply(great50ctrlDF3,1,mean)
great50trtDF3$great50_trt_mean <- apply(great50trtDF3,1,mean)
DATA_RA3 <- cbind(DATA_RA2, less50ctrlDF3, great50ctrlDF3, less50trtDF3,great50trtDF3)
colnames(DATA_RA3)
## [1] "gene" "P1_w0"
## [3] "P2_w0" "P3_w0"
## [5] "P4_w0" "P5_w0"
## [7] "P6_w0" "P7_w0"
## [9] "P8_w0" "P9_w0"
## [11] "P10_w0" "P11_w0"
## [13] "P12_w0" "P13_w0"
## [15] "P14_w0" "P15_w0"
## [17] "P16_w0" "P17_w0"
## [19] "P18_w0" "P19_w0"
## [21] "P20_w0" "P21_w0"
## [23] "P22_w0" "P23_w0"
## [25] "P24_w0" "P25_w0"
## [27] "P26_w0" "P27_w0"
## [29] "P28_w0" "P29_w0"
## [31] "P30_w0" "P31_w0"
## [33] "P32_w0" "P33_w0"
## [35] "P34_w0" "P35_w0"
## [37] "P36_w0" "P37_w0"
## [39] "P38_w0" "P1_w12"
## [41] "P2_w12" "P3_w12"
## [43] "P4_w12" "P5_w12"
## [45] "P6_w12" "P7_w12"
## [47] "P8_w12" "P9_w12"
## [49] "P10_w12" "P11_w12"
## [51] "P12_w12" "P13_w12"
## [53] "P14_w12" "P15_w12"
## [55] "P16_w12" "P17_w12"
## [57] "P18_w12" "P19_w12"
## [59] "P20_w12" "P21_w12"
## [61] "P22_w12" "P23_w12"
## [63] "P24_w12" "P25_w12"
## [65] "P26_w12" "P27_w12"
## [67] "P28_w12" "P29_w12"
## [69] "P30_w12" "P31_w12"
## [71] "P32_w12" "P33_w12"
## [73] "P34_w12" "P35_w12"
## [75] "P36_w12" "P37_w12"
## [77] "P38_w12" "P24_w0.1"
## [79] "P25_w0.1" "P30_w0.1"
## [81] "P31_w0.1" "P33_w0.1"
## [83] "P37_w0.1" "P1_w0.1"
## [85] "P2_w0.1" "P3_w0.1"
## [87] "P4_w0.1" "P5_w0.1"
## [89] "P6_w0.1" "P7_w0.1"
## [91] "P8_w0.1" "P9_w0.1"
## [93] "P10_w0.1" "P11_w0.1"
## [95] "P12_w0.1" "P13_w0.1"
## [97] "P14_w0.1" "P15_w0.1"
## [99] "P16_w0.1" "P17_w0.1"
## [101] "P18_w0.1" "P19_w0.1"
## [103] "P20_w0.1" "P21_w0.1"
## [105] "P22_w0.1" "P23_w0.1"
## [107] "P26_w0.1" "P27_w0.1"
## [109] "P28_w0.1" "P29_w0.1"
## [111] "P32_w0.1" "P34_w0.1"
## [113] "P35_w0.1" "P36_w0.1"
## [115] "P38_w0.1" "P24_w12.1"
## [117] "P25_w12.1" "P30_w12.1"
## [119] "P31_w12.1" "P33_w12.1"
## [121] "P37_w12.1" "P1_w12.1"
## [123] "P2_w12.1" "P3_w12.1"
## [125] "P4_w12.1" "P5_w12.1"
## [127] "P6_w12.1" "P7_w12.1"
## [129] "P8_w12.1" "P9_w12.1"
## [131] "P10_w12.1" "P11_w12.1"
## [133] "P12_w12.1" "P13_w12.1"
## [135] "P14_w12.1" "P15_w12.1"
## [137] "P16_w12.1" "P17_w12.1"
## [139] "P18_w12.1" "P19_w12.1"
## [141] "P20_w12.1" "P21_w12.1"
## [143] "P22_w12.1" "P23_w12.1"
## [145] "P26_w12.1" "P27_w12.1"
## [147] "P28_w12.1" "P29_w12.1"
## [149] "P32_w12.1" "P34_w12.1"
## [151] "P35_w12.1" "P36_w12.1"
## [153] "P38_w12.1" "ctrl_mean"
## [155] "trt_mean" "mal_ctrl_mean"
## [157] "fem_ctrl_mean" "mal_trt_mean"
## [159] "fem_trt_mean" "trt_ctrl_FoldChange"
## [161] "fem_trt_ctrl_FoldChange" "mal_trt_ctrl_FoldChange"
## [163] "fem_mal_trt_FoldChange" "fem_mal_ctrl_FoldChange"
## [165] "P4_w0" "P8_w0"
## [167] "P9_w0" "P11_w0"
## [169] "P12_w0" "P14_w0"
## [171] "P19_w0" "P23_w0"
## [173] "P32_w0" "P36_w0"
## [175] "less50_ctrl_mean" "P1_w0"
## [177] "P2_w0" "P3_w0"
## [179] "P5_w0" "P6_w0"
## [181] "P7_w0" "P10_w0"
## [183] "P13_w0" "P15_w0"
## [185] "P16_w0" "P17_w0"
## [187] "P18_w0" "P20_w0"
## [189] "P21_w0" "P22_w0"
## [191] "P24_w0" "P25_w0"
## [193] "P26_w0" "P27_w0"
## [195] "P28_w0" "P29_w0"
## [197] "P30_w0" "P31_w0"
## [199] "P33_w0" "P34_w0"
## [201] "P35_w0" "P37_w0"
## [203] "P38_w0" "great50_ctrl_mean"
## [205] "P4_w12" "P8_w12"
## [207] "P9_w12" "P11_w12"
## [209] "P12_w12" "P14_w12"
## [211] "P19_w12" "P23_w12"
## [213] "P32_w12" "P36_w12"
## [215] "less50_trt_mean" "P1_w12"
## [217] "P2_w12" "P3_w12"
## [219] "P5_w12" "P6_w12"
## [221] "P7_w12" "P10_w12"
## [223] "P13_w12" "P15_w12"
## [225] "P16_w12" "P17_w12"
## [227] "P18_w12" "P20_w12"
## [229] "P21_w12" "P22_w12"
## [231] "P24_w12" "P25_w12"
## [233] "P26_w12" "P27_w12"
## [235] "P28_w12" "P29_w12"
## [237] "P30_w12" "P31_w12"
## [239] "P33_w12" "P34_w12"
## [241] "P35_w12" "P37_w12"
## [243] "P38_w12" "great50_trt_mean"
DATA_RA4 <- DATA_RA3[,c(1:77,154:159,175,204,215,244,160:164)]
colnames(DATA_RA4)
## [1] "gene" "P1_w0"
## [3] "P2_w0" "P3_w0"
## [5] "P4_w0" "P5_w0"
## [7] "P6_w0" "P7_w0"
## [9] "P8_w0" "P9_w0"
## [11] "P10_w0" "P11_w0"
## [13] "P12_w0" "P13_w0"
## [15] "P14_w0" "P15_w0"
## [17] "P16_w0" "P17_w0"
## [19] "P18_w0" "P19_w0"
## [21] "P20_w0" "P21_w0"
## [23] "P22_w0" "P23_w0"
## [25] "P24_w0" "P25_w0"
## [27] "P26_w0" "P27_w0"
## [29] "P28_w0" "P29_w0"
## [31] "P30_w0" "P31_w0"
## [33] "P32_w0" "P33_w0"
## [35] "P34_w0" "P35_w0"
## [37] "P36_w0" "P37_w0"
## [39] "P38_w0" "P1_w12"
## [41] "P2_w12" "P3_w12"
## [43] "P4_w12" "P5_w12"
## [45] "P6_w12" "P7_w12"
## [47] "P8_w12" "P9_w12"
## [49] "P10_w12" "P11_w12"
## [51] "P12_w12" "P13_w12"
## [53] "P14_w12" "P15_w12"
## [55] "P16_w12" "P17_w12"
## [57] "P18_w12" "P19_w12"
## [59] "P20_w12" "P21_w12"
## [61] "P22_w12" "P23_w12"
## [63] "P24_w12" "P25_w12"
## [65] "P26_w12" "P27_w12"
## [67] "P28_w12" "P29_w12"
## [69] "P30_w12" "P31_w12"
## [71] "P32_w12" "P33_w12"
## [73] "P34_w12" "P35_w12"
## [75] "P36_w12" "P37_w12"
## [77] "P38_w12" "ctrl_mean"
## [79] "trt_mean" "mal_ctrl_mean"
## [81] "fem_ctrl_mean" "mal_trt_mean"
## [83] "fem_trt_mean" "less50_ctrl_mean"
## [85] "great50_ctrl_mean" "less50_trt_mean"
## [87] "great50_trt_mean" "trt_ctrl_FoldChange"
## [89] "fem_trt_ctrl_FoldChange" "mal_trt_ctrl_FoldChange"
## [91] "fem_mal_trt_FoldChange" "fem_mal_ctrl_FoldChange"
DATA_RA4$less50_trt_ctrl_FoldChange <- DATA_RA4$less50_trt_mean/DATA_RA4$less50_ctrl_mean
DATA_RA4$great50_trt_ctrl_FoldChange <- DATA_RA4$great50_trt_mean/DATA_RA4$great50_ctrl_mean
colnames(DATA_RA4)
## [1] "gene" "P1_w0"
## [3] "P2_w0" "P3_w0"
## [5] "P4_w0" "P5_w0"
## [7] "P6_w0" "P7_w0"
## [9] "P8_w0" "P9_w0"
## [11] "P10_w0" "P11_w0"
## [13] "P12_w0" "P13_w0"
## [15] "P14_w0" "P15_w0"
## [17] "P16_w0" "P17_w0"
## [19] "P18_w0" "P19_w0"
## [21] "P20_w0" "P21_w0"
## [23] "P22_w0" "P23_w0"
## [25] "P24_w0" "P25_w0"
## [27] "P26_w0" "P27_w0"
## [29] "P28_w0" "P29_w0"
## [31] "P30_w0" "P31_w0"
## [33] "P32_w0" "P33_w0"
## [35] "P34_w0" "P35_w0"
## [37] "P36_w0" "P37_w0"
## [39] "P38_w0" "P1_w12"
## [41] "P2_w12" "P3_w12"
## [43] "P4_w12" "P5_w12"
## [45] "P6_w12" "P7_w12"
## [47] "P8_w12" "P9_w12"
## [49] "P10_w12" "P11_w12"
## [51] "P12_w12" "P13_w12"
## [53] "P14_w12" "P15_w12"
## [55] "P16_w12" "P17_w12"
## [57] "P18_w12" "P19_w12"
## [59] "P20_w12" "P21_w12"
## [61] "P22_w12" "P23_w12"
## [63] "P24_w12" "P25_w12"
## [65] "P26_w12" "P27_w12"
## [67] "P28_w12" "P29_w12"
## [69] "P30_w12" "P31_w12"
## [71] "P32_w12" "P33_w12"
## [73] "P34_w12" "P35_w12"
## [75] "P36_w12" "P37_w12"
## [77] "P38_w12" "ctrl_mean"
## [79] "trt_mean" "mal_ctrl_mean"
## [81] "fem_ctrl_mean" "mal_trt_mean"
## [83] "fem_trt_mean" "less50_ctrl_mean"
## [85] "great50_ctrl_mean" "less50_trt_mean"
## [87] "great50_trt_mean" "trt_ctrl_FoldChange"
## [89] "fem_trt_ctrl_FoldChange" "mal_trt_ctrl_FoldChange"
## [91] "fem_mal_trt_FoldChange" "fem_mal_ctrl_FoldChange"
## [93] "less50_trt_ctrl_FoldChange" "great50_trt_ctrl_FoldChange"
DATA_RA4_FC <- DATA_RA4[,c(1,78:94)]
write.csv(DATA_RA4, 'all_RA_FC.csv',row.names=FALSE)
Note that I caught this late by a day. WE need to set the NaNs to 1 not zero, because there isn’t a halving of gene expression (foldchange = 0) but no change of fold change =1. We only want the change, from diseased state to healthy state, and when using division by zero, we will get Inf if the numerator is not zero but the denominator is zero, and NaN when both are zero. To handle this if the numerator is not zero and the denominator is, then the change is the numerator, and if they are both zero, then there is no change, hence zero change will be a 1 because there is no change in over or under expression. My previous use of this was errored, I will go back and fix this mistake, because having a fold change value of 0 actually means the gene had a decrease of 100% or halved, because and increase of 100% is a fold change of 2 and hence doubled. We will write these in with conditions using ifelse.
DATA_RA4_FC$trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$trt_ctrl_FoldChange=='Inf',
DATA_RA4_FC$trt_mean,
ifelse(DATA_RA4_FC$trt_ctrl_FoldChange=='NaN',
1,
DATA_RA4_FC$trt_ctrl_FoldChange))
DATA_RA4_FC$fem_trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$fem_trt_ctrl_FoldChange=='Inf',
DATA_RA4_FC$fem_trt_mean,
ifelse(DATA_RA4_FC$fem_trt_ctrl_FoldChange=='NaN',
1,
DATA_RA4_FC$fem_trt_ctrl_FoldChange))
DATA_RA4_FC$mal_trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$mal_trt_ctrl_FoldChange=='Inf',
DATA_RA4_FC$mal_trt_mean,
ifelse(DATA_RA4_FC$mal_trt_ctrl_FoldChange=='NaN',
1,
DATA_RA4_FC$mal_trt_ctrl_FoldChange))
DATA_RA4_FC$fem_mal_trt_FoldChange <- ifelse(DATA_RA4_FC$fem_mal_trt_FoldChange=='Inf',
DATA_RA4_FC$fem_trt_mean,
ifelse(DATA_RA4_FC$fem_mal_trt_FoldChange=='NaN',
1,
DATA_RA4_FC$fem_mal_trt_FoldChange))
DATA_RA4_FC$fem_mal_ctrl_FoldChange <- ifelse(DATA_RA4_FC$fem_mal_ctrl_FoldChange=='Inf',
DATA_RA4_FC$fem_ctrl_mean,
ifelse(DATA_RA4_FC$fem_mal_ctrl_FoldChange=='NaN',
1,
DATA_RA4_FC$fem_mal_ctrl_FoldChange))
DATA_RA4_FC$less50_trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$less50_trt_ctrl_FoldChange=='Inf',
DATA_RA4_FC$less50_trt_mean,
ifelse(DATA_RA4_FC$less50_trt_ctrl_FoldChange=='NaN',
1,
DATA_RA4_FC$less50_trt_ctrl_FoldChange))
DATA_RA4_FC$great50_trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$great50_trt_ctrl_FoldChange=='Inf',
DATA_RA4_FC$great50_trt_mean,
ifelse(DATA_RA4_FC$great50_trt_ctrl_FoldChange=='NaN',
1,
DATA_RA4_FC$great50_trt_ctrl_FoldChange))
Lets select some genes based on the highest and lowest fold change values. Since we put in the values, our fold change values of zero are significant for those genes that halved. Even smaller values of zero would mean a more significant down regulation. i.e. 0.0 versus 0.001, where the decline in gene expression went from a decrease of 10% versus a decrease of 1000% or 10 times the decrease in the disease state to the healthy state.
write.csv(DATA_RA4_FC,'RA_FCs.csv',row.names=FALSE)
head(DATA_RA4_FC)
## gene ctrl_mean trt_mean mal_ctrl_mean fem_ctrl_mean mal_trt_mean
## 1 1-Dec 0.26315789 0.2105263 0.6666667 0.18750 0.3333333
## 2 1-Mar 707.13157895 633.0526316 710.5000000 706.50000 640.0833333
## 3 1-Sep 829.44736842 876.8684211 1030.0000000 791.84375 906.1666667
## 4 10-Mar 1.89473684 2.0263158 2.5000000 1.78125 1.5000000
## 5 10-Sep 31.76315789 30.8947368 35.6666667 31.03125 28.8333333
## 6 11-Mar 0.02631579 0.0000000 0.0000000 0.03125 0.0000000
## fem_trt_mean less50_ctrl_mean great50_ctrl_mean less50_trt_mean
## 1 0.18750 0.00 0.35714286 0.1
## 2 631.73438 630.45 734.51785714 576.4
## 3 871.37500 667.50 887.28571429 734.8
## 4 2.12500 1.60 2.00000000 1.6
## 5 31.28125 23.80 34.60714286 22.7
## 6 0.00000 0.00 0.03571429 0.0
## great50_trt_mean trt_ctrl_FoldChange fem_trt_ctrl_FoldChange
## 1 0.250000 0.8000000 1.0000000
## 2 653.285714 0.8952402 0.8941746
## 3 927.607143 1.0571719 1.1004381
## 4 2.178571 1.0694444 1.1929825
## 5 33.821429 0.9726595 1.0080564
## 6 0.000000 0.0000000 0.0000000
## mal_trt_ctrl_FoldChange fem_mal_trt_FoldChange fem_mal_ctrl_FoldChange
## 1 0.5000000 0.5625000 0.2812500
## 2 0.9008914 0.9869565 0.9943702
## 3 0.8797735 0.9616057 0.7687803
## 4 0.6000000 1.4166667 0.7125000
## 5 0.8084112 1.0848988 0.8700350
## 6 1.0000000 1.0000000 0.0312500
## less50_trt_ctrl_FoldChange great50_trt_ctrl_FoldChange
## 1 0.1000000 0.7000000
## 2 0.9142676 0.8894075
## 3 1.1008240 1.0454436
## 4 1.0000000 1.0892857
## 5 0.9537815 0.9772962
## 6 1.0000000 0.0000000
Lets get the highest and least expressed genes for 10 of each in fold change values for each group of: - younger than 50 treatment/control fold change - older than 50 treatment/control fold change - females to males control fold change - females to males treatment fold change - females treatment/control fold change - males treatment/contral fold change - treatment/control fold change for all participants
treatControl <- DATA_RA4_FC[order(DATA_RA4_FC$trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]
young <- DATA_RA4_FC[order(DATA_RA4_FC$less50_trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]
old <- DATA_RA4_FC[order(DATA_RA4_FC$great50_trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]
fems <- DATA_RA4_FC[order(DATA_RA4_FC$fem_trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]
males <- DATA_RA4_FC[order(DATA_RA4_FC$mal_trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]
fems_males_trt <- DATA_RA4_FC[order(DATA_RA4_FC$fem_mal_trt_FoldChange,decreasing=T)[c(1:10,56627:56636)],]
fems_males_ctrl <- DATA_RA4_FC[order(DATA_RA4_FC$fem_mal_trt_FoldChange,decreasing=T)[c(1:10,56627:56636)],]
write.csv(treatControl,'treatControl.csv', row.names=F)
write.csv(young, 'young.csv', row.names=F)
write.csv(old, 'old.csv', row.names=F)
write.csv(fems,'fems.csv',row.names=F)
write.csv(males, 'males.csv', row.names=F)
write.csv(fems_males_trt,'fems_males_trt.csv',row.names=F)
write.csv(fems_males_ctrl,'fems_males_ctrl.csv',row.names=F)
The above files are the: - treatment/control of all patients for ratio of week 12 to week 0 treated with abatacept in RA blood - younger than 50 year old patients mixed genders week12 to week0 - older than 50 year old patients mixed genders week12 to week0 - female patients mixed ages week12 to week0 - male patients mixed ages week12 to week0 - females to males mixed ages week12 ratios - females to males mixed ages week0 ratios
Lets now do some machine learning on these data sets. We will take the top 2 genes and bottom 2 from each data frame and then use those as factors or gene variables to split our data of samples up into training and testing sets using various machine learning algorithms in R and possibly in python with the R reticulate package.
treat2 <- treatControl[c(1:2,19:20),]
young2 <- young[c(1:2,19:20),]
old2 <- old[c(1:2,19:20),]
fems2 <- fems[c(1:2,19:20),]
males2 <- males[c(1:2,19:20),]
fems_males_trt2 <- fems_males_trt[c(1:2,19:20),]
fems_males_ctrl2 <- fems_males_ctrl[c(1:2,19:20),]
ML_genes_df <-rbind(treat2,young2,old2,fems2,males2,fems_males_trt2,
fems_males_ctrl2)
ML_genes_df2 <- ML_genes_df[!duplicated(ML_genes_df),]
ML_genes_List <- sort(ML_genes_df2$gene)
ML_genes_List
## [1] "AL161626.1" "CTB-57H20.1" "HEPH" "KCNJ10"
## [5] "OR2L3" "PGF" "RP11-317J10.4" "RP11-561N12.5"
## [9] "RP11-806O11.1" "XIST" "ZNF804B" "ZNF863P"
## [13] "ZNF962P" "ZNF99" "ZP2" "ZSCAN1"
Lets create our ML data frame from our original data.
ML_DF <- subset(data, data$gene %in% ML_genes_List)
ML_DF2 <- as.data.frame(t(ML_DF))
colnames(ML_DF2) <- ML_DF$gene
ML_DF2$patientID <- colnames(ML_DF)
ML_DF3 <- ML_DF2[2:77,]
ML_DF4 <- merge(ML_DF3, demographics,by.x='patientID',by.y='patient')
ML_DF5 <- ML_DF4[,c(1,18:21,2:17)]
colnames(ML_DF5)
## [1] "patientID" "sample" "Age" "gender"
## [5] "treatment" "KCNJ10" "OR2L3" "ZNF863P"
## [9] "CTB-57H20.1" "RP11-561N12.5" "ZNF804B" "HEPH"
## [13] "XIST" "RP11-806O11.1" "RP11-317J10.4" "AL161626.1"
## [17] "ZNF962P" "PGF" "ZP2" "ZNF99"
## [21] "ZSCAN1"
We now have some outcome variables to predict the outcome or class of each sample after training our models. Lets take a look at the class or variable type of each of our gene and demographic variables.
str(ML_DF5)
## 'data.frame': 76 obs. of 21 variables:
## $ patientID : chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ sample : Factor w/ 76 levels "GSM4567526","GSM4567527",..: 1 2 55 56 73 74 47 48 11 12 ...
## $ Age : num 78 78 80 80 21 21 46 46 59 59 ...
## $ gender : chr "F" "F" "F" "F" ...
## $ treatment : chr "control" "abatacept" "control" "abatacept" ...
## $ KCNJ10 : Factor w/ 15 levels " 0"," 1",..: 1 1 2 2 4 4 3 1 4 4 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ OR2L3 : Factor w/ 14 levels " 0"," 1",..: 1 1 1 3 1 3 1 1 2 1 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ ZNF863P : Factor w/ 6 levels " 0"," 0",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ CTB-57H20.1 : Factor w/ 9 levels " 0"," 1",..: 1 1 1 1 1 3 1 1 1 1 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ RP11-561N12.5: Factor w/ 8 levels " 0"," 1",..: 1 2 1 1 2 1 1 1 1 2 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ ZNF804B : Factor w/ 7 levels " 0"," 1",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ HEPH : Factor w/ 43 levels " 7"," 8",..: 18 10 18 6 13 13 35 28 24 8 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ XIST : Factor w/ 68 levels " 0","0","10248",..: 43 63 41 25 30 14 6 50 37 61 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ RP11-806O11.1: Factor w/ 8 levels " 0"," 1",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ RP11-317J10.4: Factor w/ 10 levels " 0"," 1",..: 1 1 1 1 2 1 1 1 1 2 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ AL161626.1 : Factor w/ 25 levels " 0"," 1",..: 14 6 1 1 2 5 1 1 6 15 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ ZNF962P : Factor w/ 7 levels " 0"," 1",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ PGF : Factor w/ 25 levels " 0"," 1",..: 1 3 3 2 1 2 15 18 2 2 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ ZP2 : Factor w/ 6 levels " 0"," 1",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ ZNF99 : Factor w/ 7 levels " 0"," 1",..: 2 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
## $ ZSCAN1 : Factor w/ 7 levels " 0"," 1",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
write.csv(ML_DF5,'ML_genes.csv',row.names=F)
ML_GENES <- read.csv('ML_genes.csv', sep=',',header=T,na.strings=c('',' ','NA'),
stringsAsFactors = T)
str(ML_GENES)
## 'data.frame': 76 obs. of 21 variables:
## $ patientID : Factor w/ 76 levels "P1_w0","P1_w12",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ sample : Factor w/ 76 levels "GSM4567526","GSM4567527",..: 1 2 55 56 73 74 47 48 11 12 ...
## $ Age : int 78 78 80 80 21 21 46 46 59 59 ...
## $ gender : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ treatment : Factor w/ 2 levels "abatacept","control": 2 1 2 1 2 1 2 1 2 1 ...
## $ KCNJ10 : int 0 0 1 1 3 3 2 0 3 3 ...
## $ OR2L3 : int 0 0 0 2 0 2 0 0 1 0 ...
## $ ZNF863P : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CTB.57H20.1 : int 0 0 0 0 0 2 0 0 0 0 ...
## $ RP11.561N12.5: int 0 1 0 0 1 0 0 0 0 1 ...
## $ ZNF804B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ HEPH : int 28 19 28 15 22 22 93 47 36 17 ...
## $ XIST : int 24708 45725 24619 19769 22017 16788 12388 26615 23288 40961 ...
## $ RP11.806O11.1: int 0 0 0 0 0 0 0 0 0 0 ...
## $ RP11.317J10.4: int 0 0 0 0 1 0 0 0 0 1 ...
## $ AL161626.1 : int 44 5 0 0 1 4 0 0 5 81 ...
## $ ZNF962P : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PGF : int 0 2 2 1 0 1 19 75 1 1 ...
## $ ZP2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ZNF99 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ ZSCAN1 : int 0 0 0 0 0 0 0 0 0 0 ...
Lets rename the rows to the patientID.
row.names(ML_GENES) <- ML_GENES$patientID
ML_GENES2 <- ML_GENES[,-c(1:2)]
head(ML_GENES2)
## Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P1_w0 78 F control 0 0 0 0 0
## P1_w12 78 F abatacept 0 0 0 0 1
## P10_w0 80 F control 1 0 0 0 0
## P10_w12 80 F abatacept 1 2 0 0 0
## P11_w0 21 F control 3 0 0 0 1
## P11_w12 21 F abatacept 3 2 0 2 0
## ZNF804B HEPH XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P1_w0 0 28 24708 0 0 44 0 0
## P1_w12 0 19 45725 0 0 5 0 2
## P10_w0 0 28 24619 0 0 0 0 2
## P10_w12 0 15 19769 0 0 0 0 1
## P11_w0 0 22 22017 0 1 1 0 0
## P11_w12 0 22 16788 0 0 4 0 1
## ZP2 ZNF99 ZSCAN1
## P1_w0 0 1 0
## P1_w12 0 0 0
## P10_w0 0 0 0
## P10_w12 0 0 0
## P11_w0 0 0 0
## P11_w12 0 0 0
Our machine learning data frame is the ML_GENES2 data, we have the class types of the variables as integer or factor. We can train to classify by gender, treatment, or age.
Lets get started testing our machine learning algorithms from the caret package for random forest (rf), k-nearest neighbor (knn), generalized linear models (glm), and recursive partitioned trees (rpart).
We will likely have to change the setting in our glm model because it is set for numeric regression and not classification, the same for our rf model.
library(RANN) #this pkg supplements caret for out of bag validation
library(e1071)
library(caret)
library(randomForest)
library(MASS)
library(gbm)
set.seed(12345)
inTrain <- createDataPartition(y=ML_GENES2$treatment, p=0.7, list=FALSE)
trainingSet <- ML_GENES2[inTrain,]
testingSet <- ML_GENES2[-inTrain,]
dim(trainingSet)
## [1] 54 19
dim(testingSet)
## [1] 22 19
Lets use random forest, knn, and glm algorithms to predict the ratings.
rf_boot <- train(treatment~., method='rf',
na.action=na.pass,
data=(trainingSet), preProc = c("center", "scale","knnImpute"),
trControl=trainControl(method='boot'), number=5)
predRF_boot <- predict(rf_boot, testingSet)
DF_boot <- data.frame(predRF_boot, type=testingSet$treatment)
length_boot <- length(DF_boot$type)
sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)
accRF_boot <- (sum_boot/length_boot)
accRF_boot
## [1] 0.7272727
head(DF_boot,30)
## predRF_boot type
## 1 control control
## 2 abatacept abatacept
## 3 control control
## 4 abatacept control
## 5 control control
## 6 abatacept abatacept
## 7 control control
## 8 abatacept abatacept
## 9 control abatacept
## 10 abatacept abatacept
## 11 abatacept control
## 12 abatacept control
## 13 abatacept abatacept
## 14 control control
## 15 abatacept control
## 16 abatacept abatacept
## 17 abatacept abatacept
## 18 control abatacept
## 19 control control
## 20 abatacept abatacept
## 21 control control
## 22 abatacept abatacept
errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored
## predRF_boot type
## 4 abatacept control
## 9 control abatacept
## 11 abatacept control
## 12 abatacept control
## 15 abatacept control
## 18 control abatacept
Lets see which samples created an error in our random forest model.
rn <- as.numeric(row.names(errored))
rn
## [1] 4 9 11 12 15 18
rfError <- testingSet[rn,]
rfError
## Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P2_w0 69 F control 0 1 0 0 0
## P27_w12 56 F abatacept 0 0 0 0 0
## P29_w0 55 F control 2 3 0 0 0
## P3_w0 70 F control 1 4 0 0 0
## P35_w0 68 F control 0 4 0 0 0
## P4_w12 27 F abatacept 0 1 0 0 0
## ZNF804B HEPH XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P2_w0 0 29 37182 0 0 3 1 30
## P27_w12 0 27 9291 0 1 0 1 7
## P29_w0 0 31 18424 0 0 0 0 0
## P3_w0 0 26 31608 0 0 1 0 8
## P35_w0 0 31 30501 0 0 0 0 7
## P4_w12 0 16 15650 0 0 0 0 0
## ZP2 ZNF99 ZSCAN1
## P2_w0 0 0 0
## P27_w12 0 0 0
## P29_w0 0 0 0
## P3_w0 0 0 0
## P35_w0 0 0 0
## P4_w12 0 0 0
All the six errors in classification of treatment type were from the female group with a mixed age group. All the males were classified correctly.
lets tune our model and see if we can get prediction accuracy on treatment. methods in trainControl(): boot-boot strap aggregating, LOOCV-leave one out cross validation cv-cross validation repeatedcv-? LGOCV-for repeated training/test splits Scored one better with 5 instead of 6 all females but wasn’t reproduced optimism_boot-? boot632-? boot_all-? adaptive-cv adaptive_boot adaptive_LGOCV none-only fits one model to entire training set for bagged and forest models timeslice
rf_boot <- train(treatment~., method='rf',
na.action=na.pass,
data=(trainingSet), preProc = c("center", "scale"),
trControl=trainControl(method='cv'), number=7)
predRF_boot <- predict(rf_boot, testingSet)
DF_boot <- data.frame(predRF_boot, type=testingSet$treatment)
length_boot <- length(DF_boot$type)
sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)
accRF_boot <- (sum_boot/length_boot)
accRF_boot
## [1] 0.6818182
head(DF_boot,30)
## predRF_boot type
## 1 control control
## 2 control abatacept
## 3 control control
## 4 abatacept control
## 5 control control
## 6 abatacept abatacept
## 7 control control
## 8 abatacept abatacept
## 9 control abatacept
## 10 abatacept abatacept
## 11 abatacept control
## 12 abatacept control
## 13 abatacept abatacept
## 14 control control
## 15 abatacept control
## 16 abatacept abatacept
## 17 abatacept abatacept
## 18 control abatacept
## 19 control control
## 20 abatacept abatacept
## 21 control control
## 22 abatacept abatacept
errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored
## predRF_boot type
## 2 control abatacept
## 4 abatacept control
## 9 control abatacept
## 11 abatacept control
## 12 abatacept control
## 15 abatacept control
## 18 control abatacept
rn <- as.numeric(row.names(errored))
rn
## [1] 2 4 9 11 12 15 18
rfError <- testingSet[rn,]
rfError
## Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P13_w12 59 F abatacept 3 0 0 0 1
## P2_w0 69 F control 0 1 0 0 0
## P27_w12 56 F abatacept 0 0 0 0 0
## P29_w0 55 F control 2 3 0 0 0
## P3_w0 70 F control 1 4 0 0 0
## P35_w0 68 F control 0 4 0 0 0
## P4_w12 27 F abatacept 0 1 0 0 0
## ZNF804B HEPH XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P13_w12 0 17 40961 0 1 81 0 1
## P2_w0 0 29 37182 0 0 3 1 30
## P27_w12 0 27 9291 0 1 0 1 7
## P29_w0 0 31 18424 0 0 0 0 0
## P3_w0 0 26 31608 0 0 1 0 8
## P35_w0 0 31 30501 0 0 0 0 7
## P4_w12 0 16 15650 0 0 0 0 0
## ZP2 ZNF99 ZSCAN1
## P13_w12 0 0 0
## P2_w0 0 0 0
## P27_w12 0 0 0
## P29_w0 0 0 0
## P3_w0 0 0 0
## P35_w0 0 0 0
## P4_w12 0 0 0
The tuning didn’t produce better results. The errors ranged from 6-8 with a week 12 male aged 62 in with the female errors.
Lets see how well our rf model can classify by gender instead of treatment.
rf_boot <- train(gender~., method='rf',
na.action=na.pass,
data=(trainingSet), preProc = c("center", "scale"),
trControl=trainControl(method='boot'), number=5)
predRF_boot <- predict(rf_boot, testingSet)
DF_boot <- data.frame(predRF_boot, type=testingSet$gender)
length_boot <- length(DF_boot$type)
sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)
accRF_boot <- (sum_boot/length_boot)
accRF_boot
## [1] 1
head(DF_boot,30)
## predRF_boot type
## 1 F F
## 2 F F
## 3 F F
## 4 F F
## 5 F F
## 6 F F
## 7 M M
## 8 M M
## 9 F F
## 10 F F
## 11 F F
## 12 F F
## 13 M M
## 14 F F
## 15 F F
## 16 F F
## 17 M M
## 18 F F
## 19 F F
## 20 F F
## 21 F F
## 22 F F
errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored
## [1] predRF_boot type
## <0 rows> (or 0-length row.names)
rn <- as.numeric(row.names(errored))
rn
## numeric(0)
rfError <- testingSet[rn,]
rfError
## [1] Age gender treatment KCNJ10 OR2L3
## [6] ZNF863P CTB.57H20.1 RP11.561N12.5 ZNF804B HEPH
## [11] XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P
## [16] PGF ZP2 ZNF99 ZSCAN1
## <0 rows> (or 0-length row.names)
The random forest model was able to classify the gender based on the genes used and the age and treatment group 100% accurately.
Now lets add an age variable for those younger than 50 and those 50 or older and remove our other age variable that correlates with this new feature adding.
ML_GENES3 <- ML_GENES2
ML_GENES3$Age <- ifelse(ML_GENES3$Age < 50, 'younger','older')
head(ML_GENES3)
## Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P1_w0 older F control 0 0 0 0 0
## P1_w12 older F abatacept 0 0 0 0 1
## P10_w0 older F control 1 0 0 0 0
## P10_w12 older F abatacept 1 2 0 0 0
## P11_w0 younger F control 3 0 0 0 1
## P11_w12 younger F abatacept 3 2 0 2 0
## ZNF804B HEPH XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P1_w0 0 28 24708 0 0 44 0 0
## P1_w12 0 19 45725 0 0 5 0 2
## P10_w0 0 28 24619 0 0 0 0 2
## P10_w12 0 15 19769 0 0 0 0 1
## P11_w0 0 22 22017 0 1 1 0 0
## P11_w12 0 22 16788 0 0 4 0 1
## ZP2 ZNF99 ZSCAN1
## P1_w0 0 1 0
## P1_w12 0 0 0
## P10_w0 0 0 0
## P10_w12 0 0 0
## P11_w0 0 0 0
## P11_w12 0 0 0
set.seed(12345)
inTrain <- createDataPartition(y=ML_GENES3$Age, p=0.7, list=FALSE)
trainingSet2 <- ML_GENES3[inTrain,]
testingSet2 <- ML_GENES3[-inTrain,]
dim(trainingSet2)
## [1] 54 19
dim(testingSet2)
## [1] 22 19
rf_boot <- train(Age~., method='rf',
na.action=na.pass,
data=(trainingSet2), preProc = c("center", "scale"),
trControl=trainControl(method='boot'), number=5)
predRF_boot <- predict(rf_boot, testingSet2)
DF_boot <- data.frame(predRF_boot, type=testingSet2$Age)
length_boot <- length(DF_boot$type)
sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)
accRF_boot <- (sum_boot/length_boot)
accRF_boot
## [1] 0.7727273
head(DF_boot,30)
## predRF_boot type
## 1 older younger
## 2 older younger
## 3 older older
## 4 older older
## 5 older older
## 6 older younger
## 7 older younger
## 8 older older
## 9 older older
## 10 older older
## 11 older older
## 12 older older
## 13 older older
## 14 older older
## 15 older older
## 16 older older
## 17 older older
## 18 older older
## 19 older older
## 20 younger younger
## 21 older older
## 22 older younger
errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored
## predRF_boot type
## 1 older younger
## 2 older younger
## 6 older younger
## 7 older younger
## 22 older younger
rn <- as.numeric(row.names(errored))
rn
## [1] 1 2 6 7 22
rfError <- testingSet2[rn,]
rfError
## Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P11_w12 younger F abatacept 3 2 0 2 0
## P12_w12 younger F abatacept 0 0 0 0 0
## P19_w0 younger F control 1 0 0 0 0
## P19_w12 younger F abatacept 1 0 0 2 0
## P9_w0 younger F control 0 0 0 0 0
## ZNF804B HEPH XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P11_w12 0 22 16788 0 0 4 0 1
## P12_w12 0 47 26615 0 0 0 0 75
## P19_w0 0 50 18051 0 0 1 0 2
## P19_w12 0 41 19099 0 0 0 0 2
## P9_w0 0 17 25929 0 0 0 0 1
## ZP2 ZNF99 ZSCAN1
## P11_w12 0 0 0
## P12_w12 0 0 0
## P19_w0 0 0 1
## P19_w12 0 0 0
## P9_w0 0 0 0
When using our random forest model to predict the age as younger than 50 or 50 or older, it misclassified 5 samples that were all females. Lets see what those ages were for our females that were misclassified as older.
ageErrorList <- row.names(rfError)
ageError <- row.names(ML_GENES2) %in% ageErrorList
ML_GENES2[ageError,]
## Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P11_w12 21 F abatacept 3 2 0 2 0
## P12_w12 46 F abatacept 0 0 0 0 0
## P19_w0 47 F control 1 0 0 0 0
## P19_w12 47 F abatacept 1 0 0 2 0
## P9_w0 47 F control 0 0 0 0 0
## ZNF804B HEPH XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P11_w12 0 22 16788 0 0 4 0 1
## P12_w12 0 47 26615 0 0 0 0 75
## P19_w0 0 50 18051 0 0 1 0 2
## P19_w12 0 41 19099 0 0 0 0 2
## P9_w0 0 17 25929 0 0 0 0 1
## ZP2 ZNF99 ZSCAN1
## P11_w12 0 0 0
## P12_w12 0 0 0
## P19_w0 0 0 1
## P19_w12 0 0 0
## P9_w0 0 0 0
The females that were misclassified as older than 50 when they were younger was 4 unique females but a couple in the week 12 and one in the week0, and the 4th from both week0 and week12.
Lets move on to exploring those genes that could make more of a difference in detection RA, since this data isn’t on COVID-19 genes, but symptoms related to COVID-19 after knocking down an interleukin with an inhibitor. We used both under and over expressed genes, lets remove the under expressed genes.
treat2b <- treatControl[c(1:2),]
young2b <- young[c(1:2),]
old2b <- old[c(1:2),]
fems2b <- fems[c(1:2),]
males2b <- males[c(1:2),]
fems_males_trt2b <- fems_males_trt[c(1:2),]
fems_males_ctrl2b <- fems_males_ctrl[c(1:2),]
ML_genes_dfb <-rbind(treat2b,young2b,old2b,fems2b,males2b,fems_males_trt2b,
fems_males_ctrl2b)
ML_genes_df2b <- ML_genes_dfb[!duplicated(ML_genes_dfb),]
ML_genes_Listb <- sort(ML_genes_df2b$gene)
ML_genes_Listb
## [1] "AL161626.1" "CTB-57H20.1" "HEPH" "KCNJ10"
## [5] "OR2L3" "PGF" "RP11-317J10.4" "RP11-561N12.5"
## [9] "RP11-806O11.1" "XIST"
upReg <- c('Age','gender','treatment',ML_genes_Listb)
upReg2 <- gsub('-','.',upReg)
ML_up <- ML_GENES2[,colnames(ML_GENES2) %in% upReg2]
ML_up
## Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH XIST
## P1_w0 78 F control 0 0 0 0 28 24708
## P1_w12 78 F abatacept 0 0 0 1 19 45725
## P10_w0 80 F control 1 0 0 0 28 24619
## P10_w12 80 F abatacept 1 2 0 0 15 19769
## P11_w0 21 F control 3 0 0 1 22 22017
## P11_w12 21 F abatacept 3 2 2 0 22 16788
## P12_w0 46 F control 2 0 0 0 93 12388
## P12_w12 46 F abatacept 0 0 0 0 47 26615
## P13_w0 59 F control 3 1 0 0 36 23288
## P13_w12 59 F abatacept 3 0 0 1 17 40961
## P14_w0 44 F control 0 0 0 0 54 10248
## P14_w12 44 F abatacept 3 2 0 1 34 18257
## P15_w0 50 F control 0 0 0 0 28 10754
## P15_w12 50 F abatacept 1 2 0 1 27 23239
## P16_w0 54 F control 3 0 0 0 16 31607
## P16_w12 54 F abatacept 1 1 1 1 36 40660
## P17_w0 72 F control 2 0 0 0 28 16866
## P17_w12 72 F abatacept 2 0 0 1 12 24028
## P18_w0 75 F control 0 1 0 0 41 6869
## P18_w12 75 F abatacept 1 4 0 0 21 20913
## P19_w0 47 F control 1 0 0 0 50 18051
## P19_w12 47 F abatacept 1 0 2 0 41 19099
## P2_w0 69 F control 0 1 0 0 29 37182
## P2_w12 69 F abatacept 1 2 0 0 37 26427
## P20_w0 55 F control 0 0 0 0 15 46330
## P20_w12 55 F abatacept 0 2 0 0 57 34773
## P21_w0 78 F control 0 0 0 0 33 16612
## P21_w12 78 F abatacept 0 3 0 0 37 18768
## P22_w0 61 F control 0 0 0 0 40 13119
## P22_w12 61 F abatacept 0 0 0 1 14 20901
## P23_w0 35 F control 1 1 0 0 17 17706
## P23_w12 35 F abatacept 0 1 2 0 36 22766
## P24_w0 51 M control 0 2 1 0 1 0
## P24_w12 51 M abatacept 2 1 0 0 0 0
## P25_w0 56 M control 0 1 0 0 0 0
## P25_w12 56 M abatacept 3 1 0 1 3 0
## P26_w0 68 F control 4 0 0 0 20 24503
## P26_w12 68 F abatacept 4 1 1 1 24 22650
## P27_w0 56 F control 1 1 0 0 47 18171
## P27_w12 56 F abatacept 0 0 0 0 27 9291
## P28_w0 66 F control 0 3 0 0 23 24365
## P28_w12 66 F abatacept 1 5 0 2 27 24685
## P29_w0 55 F control 2 3 0 0 31 18424
## P29_w12 55 F abatacept 4 1 0 0 22 16213
## P3_w0 70 F control 1 4 0 0 26 31608
## P3_w12 70 F abatacept 0 1 0 0 23 20866
## P30_w0 63 M control 0 1 0 0 1 0
## P30_w12 63 M abatacept 0 1 0 0 0 0
## P31_w0 62 M control 0 1 1 0 0 2
## P31_w12 62 M abatacept 9 0 1 0 0 0
## P32_w0 43 F control 49 0 0 0 15 22362
## P32_w12 43 F abatacept 0 0 0 0 49 13612
## P33_w0 76 M control 1 1 0 0 2 0
## P33_w12 76 M abatacept 1 0 0 1 1 0
## P34_w0 74 F control 1 1 1 0 36 26925
## P34_w12 74 F abatacept 1 4 0 0 37 26379
## P35_w0 68 F control 0 4 0 0 31 30501
## P35_w12 68 F abatacept 0 4 0 0 30 27625
## P36_w0 37 F control 0 0 0 0 22 19285
## P36_w12 37 F abatacept 0 5 0 0 7 23282
## P37_w0 60 M control 0 2 0 0 0 0
## P37_w12 60 M abatacept 0 2 0 1 0 0
## P38_w0 69 F control 1 0 0 1 8 25390
## P38_w12 69 F abatacept 3 0 1 0 10 33212
## P4_w0 27 F control 0 0 0 0 20 18096
## P4_w12 27 F abatacept 0 1 0 0 16 15650
## P5_w0 71 F control 2 2 0 0 28 29488
## P5_w12 71 F abatacept 3 3 0 1 14 41655
## P6_w0 68 F control 1 1 0 0 63 10778
## P6_w12 68 F abatacept 0 1 2 0 81 13843
## P7_w0 58 F control 2 2 0 0 18 50830
## P7_w12 58 F abatacept 2 1 0 0 15 25678
## P8_w0 45 F control 0 0 0 0 20 24721
## P8_w12 45 F abatacept 0 2 0 0 28 22624
## P9_w0 47 F control 0 0 0 0 17 25929
## P9_w12 47 F abatacept 0 1 1 0 7 15750
## RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P1_w0 0 0 44 0
## P1_w12 0 0 5 2
## P10_w0 0 0 0 2
## P10_w12 0 0 0 1
## P11_w0 0 1 1 0
## P11_w12 0 0 4 1
## P12_w0 0 0 0 19
## P12_w12 0 0 0 75
## P13_w0 0 0 5 1
## P13_w12 0 1 81 1
## P14_w0 0 0 0 0
## P14_w12 0 0 0 0
## P15_w0 0 0 35 0
## P15_w12 0 0 13 6
## P16_w0 0 0 0 5
## P16_w12 10 1 2 9
## P17_w0 0 0 0 7
## P17_w12 0 0 0 0
## P18_w0 0 0 0 0
## P18_w12 0 0 1 0
## P19_w0 0 0 1 2
## P19_w12 0 0 0 2
## P2_w0 0 0 3 30
## P2_w12 0 2 0 6
## P20_w0 0 0 0 0
## P20_w12 0 0 0 2
## P21_w0 1 0 0 3
## P21_w12 3 0 32 0
## P22_w0 0 0 3 10
## P22_w12 0 0 17 12
## P23_w0 0 0 0 0
## P23_w12 0 0 0 0
## P24_w0 0 0 0 1
## P24_w12 0 0 2 1
## P25_w0 0 0 7 1
## P25_w12 0 0 4 1
## P26_w0 0 0 0 3
## P26_w12 0 0 0 3
## P27_w0 0 0 0 13
## P27_w12 0 1 0 7
## P28_w0 0 0 0 0
## P28_w12 0 2 0 1
## P29_w0 0 0 0 0
## P29_w12 0 0 0 1
## P3_w0 0 0 1 8
## P3_w12 0 1 3 7
## P30_w0 0 0 1 0
## P30_w12 0 0 0 10
## P31_w0 0 0 9 1
## P31_w12 0 1 3 21
## P32_w0 0 0 0 1
## P32_w12 0 0 0 1
## P33_w0 0 0 34 0
## P33_w12 0 0 2 1
## P34_w0 0 0 2 2
## P34_w12 0 0 0 9
## P35_w0 0 0 0 7
## P35_w12 0 0 0 1
## P36_w0 0 0 0 0
## P36_w12 0 0 20 2
## P37_w0 0 0 1 0
## P37_w12 0 1 0 7
## P38_w0 0 0 1 3
## P38_w12 0 0 0 0
## P4_w0 0 0 0 0
## P4_w12 0 0 0 0
## P5_w0 0 0 0 2
## P5_w12 0 0 0 4
## P6_w0 0 0 38 2
## P6_w12 1 1 0 0
## P7_w0 0 0 2 2
## P7_w12 0 0 5 7
## P8_w0 0 0 0 0
## P8_w12 0 0 0 2
## P9_w0 0 0 0 1
## P9_w12 0 0 0 0
Lets now repeat our random forest classifier on this data by treatment, then gender, and finally by Age younger than 50 or 50 and older.
set.seed(12345)
inTrain <- createDataPartition(y=ML_up$treatment, p=0.7, list=FALSE)
trainingSet <- ML_up[inTrain,]
testingSet <- ML_up[-inTrain,]
dim(trainingSet)
## [1] 54 13
dim(testingSet)
## [1] 22 13
Lets use random forest, knn, and glm algorithms to predict the ratings.
rf_boot <- train(treatment~., method='rf',
na.action=na.pass,
data=(trainingSet), preProc = c("center", "scale"),
trControl=trainControl(method='boot'), number=5)
predRF_boot <- predict(rf_boot, testingSet)
DF_boot <- data.frame(predRF_boot, type=testingSet$treatment)
length_boot <- length(DF_boot$type)
sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)
accRF_boot <- (sum_boot/length_boot)
accRF_boot
## [1] 0.6818182
head(DF_boot,30)
## predRF_boot type
## 1 control control
## 2 control abatacept
## 3 control control
## 4 abatacept control
## 5 control control
## 6 abatacept abatacept
## 7 control control
## 8 abatacept abatacept
## 9 control abatacept
## 10 abatacept abatacept
## 11 abatacept control
## 12 abatacept control
## 13 abatacept abatacept
## 14 control control
## 15 abatacept control
## 16 abatacept abatacept
## 17 abatacept abatacept
## 18 control abatacept
## 19 control control
## 20 abatacept abatacept
## 21 control control
## 22 abatacept abatacept
errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored
## predRF_boot type
## 2 control abatacept
## 4 abatacept control
## 9 control abatacept
## 11 abatacept control
## 12 abatacept control
## 15 abatacept control
## 18 control abatacept
Lets see which samples created an error in our random forest model.
rn <- as.numeric(row.names(errored))
rn
## [1] 2 4 9 11 12 15 18
rfError <- testingSet[rn,]
rfError
## Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH XIST
## P13_w12 59 F abatacept 3 0 0 1 17 40961
## P2_w0 69 F control 0 1 0 0 29 37182
## P27_w12 56 F abatacept 0 0 0 0 27 9291
## P29_w0 55 F control 2 3 0 0 31 18424
## P3_w0 70 F control 1 4 0 0 26 31608
## P35_w0 68 F control 0 4 0 0 31 30501
## P4_w12 27 F abatacept 0 1 0 0 16 15650
## RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P13_w12 0 1 81 1
## P2_w0 0 0 3 30
## P27_w12 0 1 0 7
## P29_w0 0 0 0 0
## P3_w0 0 0 1 8
## P35_w0 0 0 0 7
## P4_w12 0 0 0 0
With only our genes with over expression the classification by treatment wasn’t improved. Lets try by gender, and recall when we used gender to classify on all the selected genes we scored a 100% accuracy. We will see if it is as good or worse.
set.seed(12345)
inTrain <- createDataPartition(y=ML_up$gender, p=0.7, list=FALSE)
trainingSet <- ML_up[inTrain,]
testingSet <- ML_up[-inTrain,]
dim(trainingSet)
## [1] 54 13
dim(testingSet)
## [1] 22 13
Lets use random forest, knn, and glm algorithms to predict the ratings.
rf_boot <- train(gender~., method='rf',
na.action=na.pass,
data=(trainingSet), preProc = c("center", "scale"),
trControl=trainControl(method='boot'), number=5)
predRF_boot <- predict(rf_boot, testingSet)
DF_boot <- data.frame(predRF_boot, type=testingSet$gender)
length_boot <- length(DF_boot$type)
sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)
accRF_boot <- (sum_boot/length_boot)
accRF_boot
## [1] 1
head(DF_boot,30)
## predRF_boot type
## 1 F F
## 2 F F
## 3 F F
## 4 F F
## 5 F F
## 6 F F
## 7 F F
## 8 F F
## 9 M M
## 10 F F
## 11 F F
## 12 F F
## 13 F F
## 14 M M
## 15 F F
## 16 M M
## 17 F F
## 18 F F
## 19 F F
## 20 F F
## 21 F F
## 22 F F
errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored
## [1] predRF_boot type
## <0 rows> (or 0-length row.names)
Lets see which samples created an error in our random forest model.
rn <- as.numeric(row.names(errored))
rn
## numeric(0)
rfError <- testingSet[rn,]
rfError
## [1] Age gender treatment KCNJ10 OR2L3
## [6] CTB.57H20.1 RP11.561N12.5 HEPH XIST RP11.806O11.1
## [11] RP11.317J10.4 AL161626.1 PGF
## <0 rows> (or 0-length row.names)
Our model on only most expressed genes scored 100% classifying by gender. Our selected genes must have a significant change by gender to be able to classify each sample as male or female correctly for all samples.
Now we’ll add the age group.
ML_up2 <- ML_up
ML_up2$Age <- ifelse(ML_up2$Age < 50, 'younger','older')
head(ML_up2)
## Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P1_w0 older F control 0 0 0 0 28
## P1_w12 older F abatacept 0 0 0 1 19
## P10_w0 older F control 1 0 0 0 28
## P10_w12 older F abatacept 1 2 0 0 15
## P11_w0 younger F control 3 0 0 1 22
## P11_w12 younger F abatacept 3 2 2 0 22
## XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P1_w0 24708 0 0 44 0
## P1_w12 45725 0 0 5 2
## P10_w0 24619 0 0 0 2
## P10_w12 19769 0 0 0 1
## P11_w0 22017 0 1 1 0
## P11_w12 16788 0 0 4 1
set.seed(12345)
inTrain <- createDataPartition(y=ML_up2$gender, p=0.7, list=FALSE)
trainingSet <- ML_up2[inTrain,]
testingSet <- ML_up2[-inTrain,]
dim(trainingSet)
## [1] 54 13
dim(testingSet)
## [1] 22 13
Lets use random forest, knn, and glm algorithms to predict the ratings.
rf_boot <- train(Age~., method='rf',
na.action=na.pass,
data=(trainingSet), preProc = c("center", "scale"),
trControl=trainControl(method='boot'), number=5)
predRF_boot <- predict(rf_boot, testingSet)
DF_boot <- data.frame(predRF_boot, type=testingSet$Age)
length_boot <- length(DF_boot$type)
sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)
accRF_boot <- (sum_boot/length_boot)
accRF_boot
## [1] 0.7727273
head(DF_boot,30)
## predRF_boot type
## 1 older younger
## 2 older younger
## 3 older older
## 4 older older
## 5 younger older
## 6 older younger
## 7 younger older
## 8 older older
## 9 older older
## 10 older older
## 11 older older
## 12 older older
## 13 older older
## 14 older older
## 15 younger younger
## 16 older older
## 17 older older
## 18 older older
## 19 older older
## 20 younger younger
## 21 older older
## 22 younger younger
errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored
## predRF_boot type
## 1 older younger
## 2 older younger
## 5 younger older
## 6 older younger
## 7 younger older
Lets see which samples created an error in our random forest model.
rn <- as.numeric(row.names(errored))
rn
## [1] 1 2 5 6 7
rfError <- testingSet[rn,]
rfError
## Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P11_w0 younger F control 3 0 0 1 22
## P12_w0 younger F control 2 0 0 0 93
## P18_w0 older F control 0 1 0 0 41
## P19_w0 younger F control 1 0 0 0 50
## P20_w0 older F control 0 0 0 0 15
## XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P11_w0 22017 0 1 1 0
## P12_w0 12388 0 0 0 19
## P18_w0 6869 0 0 0 0
## P19_w0 18051 0 0 1 2
## P20_w0 46330 0 0 0 0
This data frame using only the most expressed few genes in each group didn’t classify better than the last group. The same misclassifications were still by female, but not all are older misclassifications. Because there are a few females that were older and classified as younger based on these genes in RA patients most expressed in week12 compared to week0 of having been treated with Abatacept in both groups. These are also unique females and all in the 1st week or week0. That does show some improvement, then because all week 12 samples were able to be classified correctly as younger than 50 or 50 plus in age.
I doublt we will do better with other models, but here are the schematics or coding parameters to use with altered data in some cases.
knn_boot <- train(treatment ~ .,
method='knn', preProcess=c('center','scale'),
tuneLength=10, trControl=trainControl(method='boot'),
data=trainingSet)
predKNN_boot <- predict(knn_boot, testingSet)
DF_KNN_boot <- data.frame(predKNN_boot, type=testingSet$treatment)
length_KNN_boot <- length(DF_KNN_boot$type)
sum_KNN_boot <- sum(DF_KNN_boot$predKNN_boot==DF_KNN_boot$type)
accKNN_boot <- (sum_KNN_boot/length_KNN_boot)
accKNN_boot
## [1] 0.7272727
head(DF_KNN_boot,30)
## predKNN_boot type
## 1 abatacept control
## 2 control control
## 3 control control
## 4 abatacept abatacept
## 5 control control
## 6 control control
## 7 control control
## 8 control control
## 9 abatacept abatacept
## 10 control control
## 11 control control
## 12 control abatacept
## 13 abatacept control
## 14 abatacept abatacept
## 15 control abatacept
## 16 abatacept abatacept
## 17 control control
## 18 abatacept abatacept
## 19 control abatacept
## 20 control abatacept
## 21 control control
## 22 control control
errored <- DF_KNN_boot[(DF_KNN_boot$predKNN_boot != DF_KNN_boot$type),]
errored
## predKNN_boot type
## 1 abatacept control
## 12 control abatacept
## 13 abatacept control
## 15 control abatacept
## 19 control abatacept
## 20 control abatacept
Lets see which samples created an error in our random forest model.
rn <- as.numeric(row.names(errored))
rn
## [1] 1 12 13 15 19 20
rfError <- testingSet[rn,]
rfError
## Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P11_w0 younger F control 3 0 0 1 22
## P27_w12 older F abatacept 0 0 0 0 27
## P3_w0 older F control 1 4 0 0 26
## P32_w12 younger F abatacept 0 0 0 0 49
## P38_w12 older F abatacept 3 0 1 0 10
## P4_w12 younger F abatacept 0 1 0 0 16
## XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P11_w0 22017 0 1 1 0
## P27_w12 9291 0 1 0 7
## P3_w0 31608 0 0 1 8
## P32_w12 13612 0 0 0 1
## P38_w12 33212 0 0 0 0
## P4_w12 15650 0 0 0 0
GeneralizedBoostedModel
gbmMod <- train(treatment~., method='gbm', data=trainingSet, verbose=FALSE )
predGbm <- predict(gbmMod, testingSet)
sumGBM0 <- sum(predGbm==testingSet$treatment)
lengthGBM0 <- length(testingSet$treatment)
accuracy_gbmMod <- sumGBM0/lengthGBM0
accuracy_gbmMod
## [1] 0.3636364
DF_GBM <- data.frame(predGbm, type=testingSet$treatment)
head(DF_GBM,30)
## predGbm type
## 1 control control
## 2 abatacept control
## 3 abatacept control
## 4 control abatacept
## 5 control control
## 6 control control
## 7 control control
## 8 control control
## 9 control abatacept
## 10 abatacept control
## 11 abatacept control
## 12 abatacept abatacept
## 13 abatacept control
## 14 control abatacept
## 15 control abatacept
## 16 control abatacept
## 17 abatacept control
## 18 abatacept abatacept
## 19 control abatacept
## 20 control abatacept
## 21 abatacept control
## 22 control control
errored <- DF_GBM[(DF_GBM$predGbm != DF_GBM$type),]
errored
## predGbm type
## 2 abatacept control
## 3 abatacept control
## 4 control abatacept
## 9 control abatacept
## 10 abatacept control
## 11 abatacept control
## 13 abatacept control
## 14 control abatacept
## 15 control abatacept
## 16 control abatacept
## 17 abatacept control
## 19 control abatacept
## 20 control abatacept
## 21 abatacept control
Lets see which samples created an error in our random forest model.
rn <- as.numeric(row.names(errored))
rn
## [1] 2 3 4 9 10 11 13 14 15 16 17 19 20 21
rfError <- testingSet[rn,]
rfError
## Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P12_w0 younger F control 2 0 0 0 93
## P16_w0 older F control 3 0 0 0 16
## P17_w12 older F abatacept 2 0 0 1 12
## P25_w12 older M abatacept 3 1 0 1 3
## P26_w0 older F control 4 0 0 0 20
## P27_w0 older F control 1 1 0 0 47
## P3_w0 older F control 1 4 0 0 26
## P31_w12 older M abatacept 9 0 1 0 0
## P32_w12 younger F abatacept 0 0 0 0 49
## P33_w12 older M abatacept 1 0 0 1 1
## P34_w0 older F control 1 1 1 0 36
## P38_w12 older F abatacept 3 0 1 0 10
## P4_w12 younger F abatacept 0 1 0 0 16
## P6_w0 older F control 1 1 0 0 63
## XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P12_w0 12388 0 0 0 19
## P16_w0 31607 0 0 0 5
## P17_w12 24028 0 0 0 0
## P25_w12 0 0 0 4 1
## P26_w0 24503 0 0 0 3
## P27_w0 18171 0 0 0 13
## P3_w0 31608 0 0 1 8
## P31_w12 0 0 1 3 21
## P32_w12 13612 0 0 0 1
## P33_w12 0 0 0 2 1
## P34_w0 26925 0 0 2 2
## P38_w12 33212 0 0 0 0
## P4_w12 15650 0 0 0 0
## P6_w0 10778 0 0 38 2
Linkage dirichlet allocation model
ldaMod <- train(treatment~., method='lda', data=trainingSet)
predlda <- predict(ldaMod, testingSet)
sumLDA0 <- sum(predlda==testingSet$treatment)
lengthLDA0 <- length(testingSet$treatment)
accuracy_ldaMod <- sumLDA0/lengthLDA0
accuracy_ldaMod
## [1] 0.5454545
DF_LDA <- data.frame(predlda, type=testingSet$treatment)
head(DF_LDA,30)
## predlda type
## 1 abatacept control
## 2 control control
## 3 control control
## 4 control abatacept
## 5 control control
## 6 control control
## 7 control control
## 8 control control
## 9 control abatacept
## 10 control control
## 11 control control
## 12 control abatacept
## 13 abatacept control
## 14 abatacept abatacept
## 15 control abatacept
## 16 control abatacept
## 17 abatacept control
## 18 abatacept abatacept
## 19 control abatacept
## 20 control abatacept
## 21 control control
## 22 control control
errored <- DF_LDA[(DF_LDA$predlda != DF_LDA$type),]
errored
## predlda type
## 1 abatacept control
## 4 control abatacept
## 9 control abatacept
## 12 control abatacept
## 13 abatacept control
## 15 control abatacept
## 16 control abatacept
## 17 abatacept control
## 19 control abatacept
## 20 control abatacept
Lets see which samples created an error in our random forest model.
rn <- as.numeric(row.names(errored))
rn
## [1] 1 4 9 12 13 15 16 17 19 20
rfError <- testingSet[rn,]
rfError
## Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P11_w0 younger F control 3 0 0 1 22
## P17_w12 older F abatacept 2 0 0 1 12
## P25_w12 older M abatacept 3 1 0 1 3
## P27_w12 older F abatacept 0 0 0 0 27
## P3_w0 older F control 1 4 0 0 26
## P32_w12 younger F abatacept 0 0 0 0 49
## P33_w12 older M abatacept 1 0 0 1 1
## P34_w0 older F control 1 1 1 0 36
## P38_w12 older F abatacept 3 0 1 0 10
## P4_w12 younger F abatacept 0 1 0 0 16
## XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P11_w0 22017 0 1 1 0
## P17_w12 24028 0 0 0 0
## P25_w12 0 0 0 4 1
## P27_w12 9291 0 1 0 7
## P3_w0 31608 0 0 1 8
## P32_w12 13612 0 0 0 1
## P33_w12 0 0 0 2 1
## P34_w0 26925 0 0 2 2
## P38_w12 33212 0 0 0 0
## P4_w12 15650 0 0 0 0
The random forest classification was better using the most expressed genes and leaving out the least expressed genes by treatment type. But this was designed to have the outcome from the start, because both weeks were given Abatacept and blood taken in week12 after giving it Abatacept. So, there isn’t much to gain in comparing RA patients’ blood in three months.
The generalized linear model is a regression, but we can make that data and test it.
ML_up3 <- ML_up2
ML_up3$treatment <- gsub('control',0,ML_up3$treatment)
ML_up3$treatment <- gsub('abatacept',1,ML_up3$treatment)
ML_up3$treatment <- as.integer(ML_up3$treatment)
set.seed(12345)
inTrain <- createDataPartition(y=ML_up3$treatment, p=0.7, list=FALSE)
trainingSet1 <- ML_up3[inTrain,]
testingSet1 <- ML_up3[-inTrain,]
trainingSet <- ML_up3[inTrain,-c(1:2)]
testingSet <- ML_up3[-inTrain,-c(1:2)]
dim(trainingSet)
## [1] 54 11
dim(testingSet)
## [1] 22 11
glmMod2 <- train(treatment ~ .,
method='glm', data=trainingSet)
predglm2 <- predict(glmMod2, testingSet)
DF_glm2 <- data.frame(predglm2, type=testingSet$treatment)
DF_glm2$predglm2 <- ifelse(DF_glm2$predglm2<.5,0,1)
length_glm2 <- length(DF_glm2$type)
sum_glm2 <- sum((DF_glm2$predglm2)==DF_glm2$type)
accglm2 <- (sum_glm2/length_glm2)
accglm2
## [1] 0.5909091
head(DF_glm2,30)
## predglm2 type
## P10_w12 0 1
## P13_w0 0 0
## P15_w12 1 1
## P2_w12 0 1
## P21_w12 0 1
## P22_w0 0 0
## P25_w0 0 0
## P25_w12 1 1
## P27_w0 0 0
## P28_w0 0 0
## P29_w12 0 1
## P3_w12 0 1
## P31_w0 1 0
## P32_w12 0 1
## P35_w12 0 1
## P36_w0 0 0
## P37_w0 0 0
## P4_w0 0 0
## P6_w0 0 0
## P6_w12 1 1
## P8_w0 0 0
## P8_w12 0 1
errored <- DF_glm2[(DF_glm2$predglm2 != DF_glm2$type),]
errored
## predglm2 type
## P10_w12 0 1
## P2_w12 0 1
## P21_w12 0 1
## P29_w12 0 1
## P3_w12 0 1
## P31_w0 1 0
## P32_w12 0 1
## P35_w12 0 1
## P8_w12 0 1
Lets see which samples created an error in our random forest model.
rn <- row.names(errored)
rn
## [1] "P10_w12" "P2_w12" "P21_w12" "P29_w12" "P3_w12" "P31_w0" "P32_w12"
## [8] "P35_w12" "P8_w12"
rfError <- testingSet1[rn,]
rfError
## Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P10_w12 older F 1 1 2 0 0 15
## P2_w12 older F 1 1 2 0 0 37
## P21_w12 older F 1 0 3 0 0 37
## P29_w12 older F 1 4 1 0 0 22
## P3_w12 older F 1 0 1 0 0 23
## P31_w0 older M 0 0 1 1 0 0
## P32_w12 younger F 1 0 0 0 0 49
## P35_w12 older F 1 0 4 0 0 30
## P8_w12 younger F 1 0 2 0 0 28
## XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P10_w12 19769 0 0 0 1
## P2_w12 26427 0 2 0 6
## P21_w12 18768 3 0 32 0
## P29_w12 16213 0 0 0 1
## P3_w12 20866 0 1 3 7
## P31_w0 2 0 0 9 1
## P32_w12 13612 0 0 0 1
## P35_w12 27625 0 0 0 1
## P8_w12 22624 0 0 0 2
For regression using the linear regression of generalized linear models in caret, most misclassifications for treatment were the older females, but one younger female and one older male misclassified in the wrong treatment group. They were all erroneously classified as week12 except for the one misclassified male in week0 who was misclassifed as week12.
We know the most expressed genes improve the classification of the treatment or control slightly and the females 100% in either most or most & least expressed genes according to their fold change values of treatment/control ratios. And the age was improved in classifying younger than 50 or 50 and older to week0 females instead of by a mix of treatment and control for females misclassified by age using both the most and least fold change values. The random forest model we used was more accurate in making those classifications but we did test out some other machine learning models for k-nearest neighbor, a topic modeling algorithm called latent dirichlet allocation, gradient boosted models, and generalized linear model with a modified data frame for regression. Under the hood the models work differently to group the samples according to their most logical class according to our feature vectors of the genes and in some cases the age, gender, and/or treatment, depending on which model and if classification or regression used and which feature was the predictor or outcome feature. The topic modeling algorithm does decent in classifying because it usually used a bunch of tokens or words from a document that it weights and sorts or classifies according to topic of a story it has trained already or in its bank. It uses the genes and demographics as tokens with a numerical weight to classify.