RA simulating COVID-19 study with ML

This data was obtained from the National Center for Bioinformatics (NCBI) Gene Expression Omnibus (GEO) online data repository on COVID-19 data using a mixed group of patient who all have the autoimmune disease, Rheumatoid Arthritis (RA). Their blood was taken in a special extraction method to preserve the RNA, then treated with abatacept, the RA drug to control autoimmune response for RA patients. There were 76 samples of a week 0 and week 12 where week 0 is the control in 38 patients, and week 12 is the CD80/86 costimulation inhibitor of T cells that produce the same or similar pathological processes as COVID-19 does in patients’ who suffer lung congestion till death in some cases. The access number for this data is GSE151161 and it can be (downloaded)[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151161] with the text metadata information shown immediately below and the RAW data values of gene expression. No need to add the platform information, because the RAW data has the gene name attached to the Raw RNA gene expression data. The data resource link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE151161.

Originally, I thought this was blood infected with COVID-19, but it isn’t. It is blood that simulated the same disease process that kills the patient when the lungs and organs become congested with inflammation from the immune response. I wasn’t able to get the fastq files that were on COVID-19 from a different study to align with a sequence alignment app to identify each gene. Although those files are really helpful if you know how to do that process already. This study did use STAR to align their fastq files, and attached the gene symbols to each sample’s gene expression value.

Here is the original summary on NCBI’s GEO briefly describing this study, and this is why I thought it was COVID-19 samples:

__**"Series GSE151161 Query DataSets for GSE151161 Status Public on Jul 01, 2020 Title Blocking of the CD80/86 axis as a therapeutic approach to prevent progression to more severe forms of COVID-19 Organism Homo sapiens Experiment type Expression profiling by high throughput sequencing Summary In its more severe forms, COVID-19 progresses towards an excessive immune response, leading to the systemic overexpression of proinflammatory cytokines like IL6, mostly from the infected lungs. This cytokine storm can cause multiple organ damage and death. Consequently, there is a pressing need to identify therapies to treat and prevent severe symptoms during COVID-19. Based on previous clinical evidence, we hypothesized that inhibiting T cell co-stimulation by blocking CD80/86 could be an effective therapeutic strategy against progression to severe proinflammatory states. To support this hypothesis, we performed an analysis integrating blood transcriptional data we generated from rheumatoid arthritis patients treated with abatacept -a CD80/86 costimulation inhibitor- with the pathological features associated with COVID-19, particularly in its more severe forms. We have found that many of the biological processes that have been consistently associated with COVID-19 pathology are reversed by CD80/86 co-stimulation inhibition, including the downregulation of IL6 production. Also, analysis of previous transcriptional data from blood of SARS-CoVinfected patients showed that the response to abatacept has a very high level of antagonism to that elicited by COVID-19. Finally, analyzing a recent single cell RNA-seq dataset from bronchoalveolar lavage fluid cells from COVID-19 patients, we found a significant correlation along the main elements of the C80/86 axis: CD86+/80+ antigen presenting cells, activated CD4+ T cells and IL6 production. Our in-silico study provides additional support to the hypothesis that blocking of the CD80/CD86 signaling axis may be protective of the excessive proinflammatory state associated with COVID-19 in the lungs

Overall design Whole RNAseq Blood Samples of patients treated with abatacept at week 0 and week 12"__**

This first section pulls the data in from the commented text file of metadata information, arranges the demographic data to align with each of the RAW sample values.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

source("geneCards.R")

## Warning: package 'rvest' was built under R version 3.6.3

## Loading required package: xml2

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

info <- file('GSE151161_series_matrix.txt')
info2 <- readLines(con=info, n=-1)#, encoding='utf-8')

#info2

Info <- info2[c(30:32,38:44)]
head(Info)

## [1] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [2] "!Sample_title\t\"Patient 1 week0\"\t\"Patient 1 week12\"\t\"Patient 2 week0\"\t\"Patient 2 week12\"\t\"Patient 15 week0\"\t\"Patient 15 week12\"\t\"Patient 24 week0\"\t\"Patient 24 week12\"\t\"Patient 16 week0\"\t\"Patient 16 week12\"\t\"Patient 13 week0\"\t\"Patient 13 week12\"\t\"Patient 22 week0\"\t\"Patient 22 week12\"\t\"Patient 31 week0\"\t\"Patient 31 week12\"\t\"Patient 3 week0\"\t\"Patient 3 week12\"\t\"Patient 18 week0\"\t\"Patient 18 week12\"\t\"Patient 23 week0\"\t\"Patient 23 week12\"\t\"Patient 21 week0\"\t\"Patient 21 week12\"\t\"Patient 25 week0\"\t\"Patient 25 week12\"\t\"Patient 26 week0\"\t\"Patient 26 week12\"\t\"Patient 27 week0\"\t\"Patient 27 week12\"\t\"Patient 20 week0\"\t\"Patient 20 week12\"\t\"Patient 34 week0\"\t\"Patient 34 week12\"\t\"Patient 14 week0\"\t\"Patient 14 week12\"\t\"Patient 17 week0\"\t\"Patient 17 week12\"\t\"Patient 19 week0\"\t\"Patient 19 week12\"\t\"Patient 4 week0\"\t\"Patient 4 week12\"\t\"Patient 35 week0\"\t\"Patient 35 week12\"\t\"Patient 28 week0\"\t\"Patient 28 week12\"\t\"Patient 12 week0\"\t\"Patient 12 week12\"\t\"Patient 8 week0\"\t\"Patient 8 week12\"\t\"Patient 5 week0\"\t\"Patient 5 week12\"\t\"Patient 38 week0\"\t\"Patient 38 week12\"\t\"Patient 10 week0\"\t\"Patient 10 week12\"\t\"Patient 29 week0\"\t\"Patient 29 week12\"\t\"Patient 37 week0\"\t\"Patient 37 week12\"\t\"Patient 6 week0\"\t\"Patient 6 week12\"\t\"Patient 36 week0\"\t\"Patient 36 week12\"\t\"Patient 9 week0\"\t\"Patient 9 week12\"\t\"Patient 32 week0\"\t\"Patient 32 week12\"\t\"Patient 7 week0\"\t\"Patient 7 week12\"\t\"Patient 33 week0\"\t\"Patient 33 week12\"\t\"Patient 11 week0\"\t\"Patient 11 week12\"\t\"Patient 30 week0\"\t\"Patient 30 week12\""
## [3] "!Sample_geo_accession\t\"GSM4567526\"\t\"GSM4567527\"\t\"GSM4567528\"\t\"GSM4567529\"\t\"GSM4567530\"\t\"GSM4567531\"\t\"GSM4567532\"\t\"GSM4567533\"\t\"GSM4567534\"\t\"GSM4567535\"\t\"GSM4567536\"\t\"GSM4567537\"\t\"GSM4567538\"\t\"GSM4567539\"\t\"GSM4567540\"\t\"GSM4567541\"\t\"GSM4567542\"\t\"GSM4567543\"\t\"GSM4567544\"\t\"GSM4567545\"\t\"GSM4567546\"\t\"GSM4567547\"\t\"GSM4567548\"\t\"GSM4567549\"\t\"GSM4567550\"\t\"GSM4567551\"\t\"GSM4567552\"\t\"GSM4567553\"\t\"GSM4567554\"\t\"GSM4567555\"\t\"GSM4567556\"\t\"GSM4567557\"\t\"GSM4567558\"\t\"GSM4567559\"\t\"GSM4567560\"\t\"GSM4567561\"\t\"GSM4567562\"\t\"GSM4567563\"\t\"GSM4567564\"\t\"GSM4567565\"\t\"GSM4567566\"\t\"GSM4567567\"\t\"GSM4567568\"\t\"GSM4567569\"\t\"GSM4567570\"\t\"GSM4567571\"\t\"GSM4567572\"\t\"GSM4567573\"\t\"GSM4567574\"\t\"GSM4567575\"\t\"GSM4567576\"\t\"GSM4567577\"\t\"GSM4567578\"\t\"GSM4567579\"\t\"GSM4567580\"\t\"GSM4567581\"\t\"GSM4567582\"\t\"GSM4567583\"\t\"GSM4567584\"\t\"GSM4567585\"\t\"GSM4567586\"\t\"GSM4567587\"\t\"GSM4567588\"\t\"GSM4567589\"\t\"GSM4567590\"\t\"GSM4567591\"\t\"GSM4567592\"\t\"GSM4567593\"\t\"GSM4567594\"\t\"GSM4567595\"\t\"GSM4567596\"\t\"GSM4567597\"\t\"GSM4567598\"\t\"GSM4567599\"\t\"GSM4567600\"\t\"GSM4567601\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [4] "!Sample_source_name_ch1\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\"\t\"Whole Blood\""                                                                                                                                                                                                                                                                                                                                                                                                      
## [5] "!Sample_organism_ch1\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\"\t\"Homo sapiens\""                                                                                                                                                                                                                                                                                                                             
## [6] "!Sample_characteristics_ch1\t\"age: 78\"\t\"age: 78\"\t\"age: 69\"\t\"age: 69\"\t\"age: 50\"\t\"age: 50\"\t\"age: 51\"\t\"age: 51\"\t\"age: 54\"\t\"age: 54\"\t\"age: 59\"\t\"age: 59\"\t\"age: 61\"\t\"age: 61\"\t\"age: 62\"\t\"age: 62\"\t\"age: 70\"\t\"age: 70\"\t\"age: 75\"\t\"age: 75\"\t\"age: 35\"\t\"age: 35\"\t\"age: 78\"\t\"age: 78\"\t\"age: 56\"\t\"age: 56\"\t\"age: 68\"\t\"age: 68\"\t\"age: 56\"\t\"age: 56\"\t\"age: 55\"\t\"age: 55\"\t\"age: 74\"\t\"age: 74\"\t\"age: 44\"\t\"age: 44\"\t\"age: 72\"\t\"age: 72\"\t\"age: 47\"\t\"age: 47\"\t\"age: 27\"\t\"age: 27\"\t\"age: 68\"\t\"age: 68\"\t\"age: 66\"\t\"age: 66\"\t\"age: 46\"\t\"age: 46\"\t\"age: 45\"\t\"age: 45\"\t\"age: 71\"\t\"age: 71\"\t\"age: 69\"\t\"age: 69\"\t\"age: 80\"\t\"age: 80\"\t\"age: 55\"\t\"age: 55\"\t\"age: 60\"\t\"age: 60\"\t\"age: 68\"\t\"age: 68\"\t\"age: 37\"\t\"age: 37\"\t\"age: 47\"\t\"age: 47\"\t\"age: 43\"\t\"age: 43\"\t\"age: 58\"\t\"age: 58\"\t\"age: 76\"\t\"age: 76\"\t\"age: 21\"\t\"age: 21\"\t\"age: 63\"\t\"age: 63\""

Info2 <- Info[c(2:3,6:8)]
head(Info2)

## [1] "!Sample_title\t\"Patient 1 week0\"\t\"Patient 1 week12\"\t\"Patient 2 week0\"\t\"Patient 2 week12\"\t\"Patient 15 week0\"\t\"Patient 15 week12\"\t\"Patient 24 week0\"\t\"Patient 24 week12\"\t\"Patient 16 week0\"\t\"Patient 16 week12\"\t\"Patient 13 week0\"\t\"Patient 13 week12\"\t\"Patient 22 week0\"\t\"Patient 22 week12\"\t\"Patient 31 week0\"\t\"Patient 31 week12\"\t\"Patient 3 week0\"\t\"Patient 3 week12\"\t\"Patient 18 week0\"\t\"Patient 18 week12\"\t\"Patient 23 week0\"\t\"Patient 23 week12\"\t\"Patient 21 week0\"\t\"Patient 21 week12\"\t\"Patient 25 week0\"\t\"Patient 25 week12\"\t\"Patient 26 week0\"\t\"Patient 26 week12\"\t\"Patient 27 week0\"\t\"Patient 27 week12\"\t\"Patient 20 week0\"\t\"Patient 20 week12\"\t\"Patient 34 week0\"\t\"Patient 34 week12\"\t\"Patient 14 week0\"\t\"Patient 14 week12\"\t\"Patient 17 week0\"\t\"Patient 17 week12\"\t\"Patient 19 week0\"\t\"Patient 19 week12\"\t\"Patient 4 week0\"\t\"Patient 4 week12\"\t\"Patient 35 week0\"\t\"Patient 35 week12\"\t\"Patient 28 week0\"\t\"Patient 28 week12\"\t\"Patient 12 week0\"\t\"Patient 12 week12\"\t\"Patient 8 week0\"\t\"Patient 8 week12\"\t\"Patient 5 week0\"\t\"Patient 5 week12\"\t\"Patient 38 week0\"\t\"Patient 38 week12\"\t\"Patient 10 week0\"\t\"Patient 10 week12\"\t\"Patient 29 week0\"\t\"Patient 29 week12\"\t\"Patient 37 week0\"\t\"Patient 37 week12\"\t\"Patient 6 week0\"\t\"Patient 6 week12\"\t\"Patient 36 week0\"\t\"Patient 36 week12\"\t\"Patient 9 week0\"\t\"Patient 9 week12\"\t\"Patient 32 week0\"\t\"Patient 32 week12\"\t\"Patient 7 week0\"\t\"Patient 7 week12\"\t\"Patient 33 week0\"\t\"Patient 33 week12\"\t\"Patient 11 week0\"\t\"Patient 11 week12\"\t\"Patient 30 week0\"\t\"Patient 30 week12\""                                                                                                                                                                                                                              
## [2] "!Sample_geo_accession\t\"GSM4567526\"\t\"GSM4567527\"\t\"GSM4567528\"\t\"GSM4567529\"\t\"GSM4567530\"\t\"GSM4567531\"\t\"GSM4567532\"\t\"GSM4567533\"\t\"GSM4567534\"\t\"GSM4567535\"\t\"GSM4567536\"\t\"GSM4567537\"\t\"GSM4567538\"\t\"GSM4567539\"\t\"GSM4567540\"\t\"GSM4567541\"\t\"GSM4567542\"\t\"GSM4567543\"\t\"GSM4567544\"\t\"GSM4567545\"\t\"GSM4567546\"\t\"GSM4567547\"\t\"GSM4567548\"\t\"GSM4567549\"\t\"GSM4567550\"\t\"GSM4567551\"\t\"GSM4567552\"\t\"GSM4567553\"\t\"GSM4567554\"\t\"GSM4567555\"\t\"GSM4567556\"\t\"GSM4567557\"\t\"GSM4567558\"\t\"GSM4567559\"\t\"GSM4567560\"\t\"GSM4567561\"\t\"GSM4567562\"\t\"GSM4567563\"\t\"GSM4567564\"\t\"GSM4567565\"\t\"GSM4567566\"\t\"GSM4567567\"\t\"GSM4567568\"\t\"GSM4567569\"\t\"GSM4567570\"\t\"GSM4567571\"\t\"GSM4567572\"\t\"GSM4567573\"\t\"GSM4567574\"\t\"GSM4567575\"\t\"GSM4567576\"\t\"GSM4567577\"\t\"GSM4567578\"\t\"GSM4567579\"\t\"GSM4567580\"\t\"GSM4567581\"\t\"GSM4567582\"\t\"GSM4567583\"\t\"GSM4567584\"\t\"GSM4567585\"\t\"GSM4567586\"\t\"GSM4567587\"\t\"GSM4567588\"\t\"GSM4567589\"\t\"GSM4567590\"\t\"GSM4567591\"\t\"GSM4567592\"\t\"GSM4567593\"\t\"GSM4567594\"\t\"GSM4567595\"\t\"GSM4567596\"\t\"GSM4567597\"\t\"GSM4567598\"\t\"GSM4567599\"\t\"GSM4567600\"\t\"GSM4567601\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [3] "!Sample_characteristics_ch1\t\"age: 78\"\t\"age: 78\"\t\"age: 69\"\t\"age: 69\"\t\"age: 50\"\t\"age: 50\"\t\"age: 51\"\t\"age: 51\"\t\"age: 54\"\t\"age: 54\"\t\"age: 59\"\t\"age: 59\"\t\"age: 61\"\t\"age: 61\"\t\"age: 62\"\t\"age: 62\"\t\"age: 70\"\t\"age: 70\"\t\"age: 75\"\t\"age: 75\"\t\"age: 35\"\t\"age: 35\"\t\"age: 78\"\t\"age: 78\"\t\"age: 56\"\t\"age: 56\"\t\"age: 68\"\t\"age: 68\"\t\"age: 56\"\t\"age: 56\"\t\"age: 55\"\t\"age: 55\"\t\"age: 74\"\t\"age: 74\"\t\"age: 44\"\t\"age: 44\"\t\"age: 72\"\t\"age: 72\"\t\"age: 47\"\t\"age: 47\"\t\"age: 27\"\t\"age: 27\"\t\"age: 68\"\t\"age: 68\"\t\"age: 66\"\t\"age: 66\"\t\"age: 46\"\t\"age: 46\"\t\"age: 45\"\t\"age: 45\"\t\"age: 71\"\t\"age: 71\"\t\"age: 69\"\t\"age: 69\"\t\"age: 80\"\t\"age: 80\"\t\"age: 55\"\t\"age: 55\"\t\"age: 60\"\t\"age: 60\"\t\"age: 68\"\t\"age: 68\"\t\"age: 37\"\t\"age: 37\"\t\"age: 47\"\t\"age: 47\"\t\"age: 43\"\t\"age: 43\"\t\"age: 58\"\t\"age: 58\"\t\"age: 76\"\t\"age: 76\"\t\"age: 21\"\t\"age: 21\"\t\"age: 63\"\t\"age: 63\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [4] "!Sample_characteristics_ch1\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\"\t\"gender: F\"\t\"gender: F\"\t\"gender: M\"\t\"gender: M\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [5] "!Sample_characteristics_ch1\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\"\t\"treatment: control\"\t\"treatment: abatacept\""

Info3 <- gsub('\\t',' ', Info2, perl=T)
Info4 <- gsub('\\\"',' ',Info3, perl=T)
Info5 <- gsub('[!]','',Info4)
Info6 <- gsub('Sample_characteristics_ch1  ','',Info5)
Info6a <- gsub('Sample_title  ','',Info6)
Info6b <- gsub('Sample_geo_accession  ','',Info6a)
Info7 <- trimws(Info6b,which='right', whitespace='[ ]')
head(Info7)

## [1] "Patient 1 week0   Patient 1 week12   Patient 2 week0   Patient 2 week12   Patient 15 week0   Patient 15 week12   Patient 24 week0   Patient 24 week12   Patient 16 week0   Patient 16 week12   Patient 13 week0   Patient 13 week12   Patient 22 week0   Patient 22 week12   Patient 31 week0   Patient 31 week12   Patient 3 week0   Patient 3 week12   Patient 18 week0   Patient 18 week12   Patient 23 week0   Patient 23 week12   Patient 21 week0   Patient 21 week12   Patient 25 week0   Patient 25 week12   Patient 26 week0   Patient 26 week12   Patient 27 week0   Patient 27 week12   Patient 20 week0   Patient 20 week12   Patient 34 week0   Patient 34 week12   Patient 14 week0   Patient 14 week12   Patient 17 week0   Patient 17 week12   Patient 19 week0   Patient 19 week12   Patient 4 week0   Patient 4 week12   Patient 35 week0   Patient 35 week12   Patient 28 week0   Patient 28 week12   Patient 12 week0   Patient 12 week12   Patient 8 week0   Patient 8 week12   Patient 5 week0   Patient 5 week12   Patient 38 week0   Patient 38 week12   Patient 10 week0   Patient 10 week12   Patient 29 week0   Patient 29 week12   Patient 37 week0   Patient 37 week12   Patient 6 week0   Patient 6 week12   Patient 36 week0   Patient 36 week12   Patient 9 week0   Patient 9 week12   Patient 32 week0   Patient 32 week12   Patient 7 week0   Patient 7 week12   Patient 33 week0   Patient 33 week12   Patient 11 week0   Patient 11 week12   Patient 30 week0   Patient 30 week12"                                                                                                                                                                                                                
## [2] "GSM4567526   GSM4567527   GSM4567528   GSM4567529   GSM4567530   GSM4567531   GSM4567532   GSM4567533   GSM4567534   GSM4567535   GSM4567536   GSM4567537   GSM4567538   GSM4567539   GSM4567540   GSM4567541   GSM4567542   GSM4567543   GSM4567544   GSM4567545   GSM4567546   GSM4567547   GSM4567548   GSM4567549   GSM4567550   GSM4567551   GSM4567552   GSM4567553   GSM4567554   GSM4567555   GSM4567556   GSM4567557   GSM4567558   GSM4567559   GSM4567560   GSM4567561   GSM4567562   GSM4567563   GSM4567564   GSM4567565   GSM4567566   GSM4567567   GSM4567568   GSM4567569   GSM4567570   GSM4567571   GSM4567572   GSM4567573   GSM4567574   GSM4567575   GSM4567576   GSM4567577   GSM4567578   GSM4567579   GSM4567580   GSM4567581   GSM4567582   GSM4567583   GSM4567584   GSM4567585   GSM4567586   GSM4567587   GSM4567588   GSM4567589   GSM4567590   GSM4567591   GSM4567592   GSM4567593   GSM4567594   GSM4567595   GSM4567596   GSM4567597   GSM4567598   GSM4567599   GSM4567600   GSM4567601"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [3] "age: 78   age: 78   age: 69   age: 69   age: 50   age: 50   age: 51   age: 51   age: 54   age: 54   age: 59   age: 59   age: 61   age: 61   age: 62   age: 62   age: 70   age: 70   age: 75   age: 75   age: 35   age: 35   age: 78   age: 78   age: 56   age: 56   age: 68   age: 68   age: 56   age: 56   age: 55   age: 55   age: 74   age: 74   age: 44   age: 44   age: 72   age: 72   age: 47   age: 47   age: 27   age: 27   age: 68   age: 68   age: 66   age: 66   age: 46   age: 46   age: 45   age: 45   age: 71   age: 71   age: 69   age: 69   age: 80   age: 80   age: 55   age: 55   age: 60   age: 60   age: 68   age: 68   age: 37   age: 37   age: 47   age: 47   age: 43   age: 43   age: 58   age: 58   age: 76   age: 76   age: 21   age: 21   age: 63   age: 63"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [4] "gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: M   gender: M   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: M   gender: M   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: M   gender: M   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: M   gender: M   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: F   gender: M   gender: M   gender: F   gender: F   gender: M   gender: M"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [5] "treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept   treatment: control   treatment: abatacept"

if (file.exists('patient.csv')){
  file.remove('patient.csv')
}

## [1] TRUE

patient <- strsplit(Info7[1], split='   ')
for (i in patient){
  write.table(i,file='patient.csv',row.names=F,col.names=F)
}

Patient <- read.csv('patient.csv', sep=',',header=F)
colnames(Patient) <- 'patient'
Patient

##              patient
## 1    Patient 1 week0
## 2   Patient 1 week12
## 3    Patient 2 week0
## 4   Patient 2 week12
## 5   Patient 15 week0
## 6  Patient 15 week12
## 7   Patient 24 week0
## 8  Patient 24 week12
## 9   Patient 16 week0
## 10 Patient 16 week12
## 11  Patient 13 week0
## 12 Patient 13 week12
## 13  Patient 22 week0
## 14 Patient 22 week12
## 15  Patient 31 week0
## 16 Patient 31 week12
## 17   Patient 3 week0
## 18  Patient 3 week12
## 19  Patient 18 week0
## 20 Patient 18 week12
## 21  Patient 23 week0
## 22 Patient 23 week12
## 23  Patient 21 week0
## 24 Patient 21 week12
## 25  Patient 25 week0
## 26 Patient 25 week12
## 27  Patient 26 week0
## 28 Patient 26 week12
## 29  Patient 27 week0
## 30 Patient 27 week12
## 31  Patient 20 week0
## 32 Patient 20 week12
## 33  Patient 34 week0
## 34 Patient 34 week12
## 35  Patient 14 week0
## 36 Patient 14 week12
## 37  Patient 17 week0
## 38 Patient 17 week12
## 39  Patient 19 week0
## 40 Patient 19 week12
## 41   Patient 4 week0
## 42  Patient 4 week12
## 43  Patient 35 week0
## 44 Patient 35 week12
## 45  Patient 28 week0
## 46 Patient 28 week12
## 47  Patient 12 week0
## 48 Patient 12 week12
## 49   Patient 8 week0
## 50  Patient 8 week12
## 51   Patient 5 week0
## 52  Patient 5 week12
## 53  Patient 38 week0
## 54 Patient 38 week12
## 55  Patient 10 week0
## 56 Patient 10 week12
## 57  Patient 29 week0
## 58 Patient 29 week12
## 59  Patient 37 week0
## 60 Patient 37 week12
## 61   Patient 6 week0
## 62  Patient 6 week12
## 63  Patient 36 week0
## 64 Patient 36 week12
## 65   Patient 9 week0
## 66  Patient 9 week12
## 67  Patient 32 week0
## 68 Patient 32 week12
## 69   Patient 7 week0
## 70  Patient 7 week12
## 71  Patient 33 week0
## 72 Patient 33 week12
## 73  Patient 11 week0
## 74 Patient 11 week12
## 75  Patient 30 week0
## 76 Patient 30 week12

if (file.exists('sample.csv')){
  file.remove('sample.csv')
}

## [1] TRUE

sample <- strsplit(Info7[2], split='   ')
for (i in sample){
  write.table(i,file='sample.csv',row.names=F,col.names=F)
}

Sample <- read.csv('sample.csv', sep=',',header=F)
colnames(Sample) <- 'sample'
Sample

##        sample
## 1  GSM4567526
## 2  GSM4567527
## 3  GSM4567528
## 4  GSM4567529
## 5  GSM4567530
## 6  GSM4567531
## 7  GSM4567532
## 8  GSM4567533
## 9  GSM4567534
## 10 GSM4567535
## 11 GSM4567536
## 12 GSM4567537
## 13 GSM4567538
## 14 GSM4567539
## 15 GSM4567540
## 16 GSM4567541
## 17 GSM4567542
## 18 GSM4567543
## 19 GSM4567544
## 20 GSM4567545
## 21 GSM4567546
## 22 GSM4567547
## 23 GSM4567548
## 24 GSM4567549
## 25 GSM4567550
## 26 GSM4567551
## 27 GSM4567552
## 28 GSM4567553
## 29 GSM4567554
## 30 GSM4567555
## 31 GSM4567556
## 32 GSM4567557
## 33 GSM4567558
## 34 GSM4567559
## 35 GSM4567560
## 36 GSM4567561
## 37 GSM4567562
## 38 GSM4567563
## 39 GSM4567564
## 40 GSM4567565
## 41 GSM4567566
## 42 GSM4567567
## 43 GSM4567568
## 44 GSM4567569
## 45 GSM4567570
## 46 GSM4567571
## 47 GSM4567572
## 48 GSM4567573
## 49 GSM4567574
## 50 GSM4567575
## 51 GSM4567576
## 52 GSM4567577
## 53 GSM4567578
## 54 GSM4567579
## 55 GSM4567580
## 56 GSM4567581
## 57 GSM4567582
## 58 GSM4567583
## 59 GSM4567584
## 60 GSM4567585
## 61 GSM4567586
## 62 GSM4567587
## 63 GSM4567588
## 64 GSM4567589
## 65 GSM4567590
## 66 GSM4567591
## 67 GSM4567592
## 68 GSM4567593
## 69 GSM4567594
## 70 GSM4567595
## 71 GSM4567596
## 72 GSM4567597
## 73 GSM4567598
## 74 GSM4567599
## 75 GSM4567600
## 76 GSM4567601

if (file.exists('age.csv')){
  file.remove('age.csv')
}

## [1] TRUE

age <- strsplit(Info7[3],split='age: ')

for (i in age){
  write.table(i,file='age.csv',append=T, row.names=F, col.names = F)
}

Age <- read.csv('age.csv', header=F, sep=',')
colnames(Age) <- 'Age'
Age

##    Age
## 1   78
## 2   78
## 3   69
## 4   69
## 5   50
## 6   50
## 7   51
## 8   51
## 9   54
## 10  54
## 11  59
## 12  59
## 13  61
## 14  61
## 15  62
## 16  62
## 17  70
## 18  70
## 19  75
## 20  75
## 21  35
## 22  35
## 23  78
## 24  78
## 25  56
## 26  56
## 27  68
## 28  68
## 29  56
## 30  56
## 31  55
## 32  55
## 33  74
## 34  74
## 35  44
## 36  44
## 37  72
## 38  72
## 39  47
## 40  47
## 41  27
## 42  27
## 43  68
## 44  68
## 45  66
## 46  66
## 47  46
## 48  46
## 49  45
## 50  45
## 51  71
## 52  71
## 53  69
## 54  69
## 55  80
## 56  80
## 57  55
## 58  55
## 59  60
## 60  60
## 61  68
## 62  68
## 63  37
## 64  37
## 65  47
## 66  47
## 67  43
## 68  43
## 69  58
## 70  58
## 71  76
## 72  76
## 73  21
## 74  21
## 75  63
## 76  63

if (file.exists('gender.csv')){
  file.remove('gender.csv')
}

## [1] TRUE

gender <- strsplit(Info7[4], split='gender: ')
for (i in gender){
  write.table(i,file='gender.csv',row.names=F, col.names=F)
}

Gender= read.csv('gender.csv',sep=',',header=F)
colnames(Gender) <- 'gender'
Gender

##    gender
## 1    F   
## 2    F   
## 3    F   
## 4    F   
## 5    F   
## 6    F   
## 7    M   
## 8    M   
## 9    F   
## 10   F   
## 11   F   
## 12   F   
## 13   F   
## 14   F   
## 15   M   
## 16   M   
## 17   F   
## 18   F   
## 19   F   
## 20   F   
## 21   F   
## 22   F   
## 23   F   
## 24   F   
## 25   M   
## 26   M   
## 27   F   
## 28   F   
## 29   F   
## 30   F   
## 31   F   
## 32   F   
## 33   F   
## 34   F   
## 35   F   
## 36   F   
## 37   F   
## 38   F   
## 39   F   
## 40   F   
## 41   F   
## 42   F   
## 43   F   
## 44   F   
## 45   F   
## 46   F   
## 47   F   
## 48   F   
## 49   F   
## 50   F   
## 51   F   
## 52   F   
## 53   F   
## 54   F   
## 55   F   
## 56   F   
## 57   F   
## 58   F   
## 59   M   
## 60   M   
## 61   F   
## 62   F   
## 63   F   
## 64   F   
## 65   F   
## 66   F   
## 67   F   
## 68   F   
## 69   F   
## 70   F   
## 71   M   
## 72   M   
## 73   F   
## 74   F   
## 75   M   
## 76      M

if (file.exists('treatment.csv')){
  file.remove('treatment.csv')
}

## [1] TRUE

treatment <- strsplit(Info7[5], split='treatment: ')
for (i in treatment){
  write.table(i,file='treatment.csv',row.names=F,col.names=F)
}

Treatment <- read.csv('treatment.csv', sep=',',header=F)
colnames(Treatment) <- 'treatment'
Treatment

##       treatment
## 1    control   
## 2  abatacept   
## 3    control   
## 4  abatacept   
## 5    control   
## 6  abatacept   
## 7    control   
## 8  abatacept   
## 9    control   
## 10 abatacept   
## 11   control   
## 12 abatacept   
## 13   control   
## 14 abatacept   
## 15   control   
## 16 abatacept   
## 17   control   
## 18 abatacept   
## 19   control   
## 20 abatacept   
## 21   control   
## 22 abatacept   
## 23   control   
## 24 abatacept   
## 25   control   
## 26 abatacept   
## 27   control   
## 28 abatacept   
## 29   control   
## 30 abatacept   
## 31   control   
## 32 abatacept   
## 33   control   
## 34 abatacept   
## 35   control   
## 36 abatacept   
## 37   control   
## 38 abatacept   
## 39   control   
## 40 abatacept   
## 41   control   
## 42 abatacept   
## 43   control   
## 44 abatacept   
## 45   control   
## 46 abatacept   
## 47   control   
## 48 abatacept   
## 49   control   
## 50 abatacept   
## 51   control   
## 52 abatacept   
## 53   control   
## 54 abatacept   
## 55   control   
## 56 abatacept   
## 57   control   
## 58 abatacept   
## 59   control   
## 60 abatacept   
## 61   control   
## 62 abatacept   
## 63   control   
## 64 abatacept   
## 65   control   
## 66 abatacept   
## 67   control   
## 68 abatacept   
## 69   control   
## 70 abatacept   
## 71   control   
## 72 abatacept   
## 73   control   
## 74 abatacept   
## 75   control   
## 76    abatacept

demographics <- cbind(Patient,Sample,Age,Gender,Treatment)
demographics

##              patient     sample Age gender    treatment
## 1    Patient 1 week0 GSM4567526  78   F      control   
## 2   Patient 1 week12 GSM4567527  78   F    abatacept   
## 3    Patient 2 week0 GSM4567528  69   F      control   
## 4   Patient 2 week12 GSM4567529  69   F    abatacept   
## 5   Patient 15 week0 GSM4567530  50   F      control   
## 6  Patient 15 week12 GSM4567531  50   F    abatacept   
## 7   Patient 24 week0 GSM4567532  51   M      control   
## 8  Patient 24 week12 GSM4567533  51   M    abatacept   
## 9   Patient 16 week0 GSM4567534  54   F      control   
## 10 Patient 16 week12 GSM4567535  54   F    abatacept   
## 11  Patient 13 week0 GSM4567536  59   F      control   
## 12 Patient 13 week12 GSM4567537  59   F    abatacept   
## 13  Patient 22 week0 GSM4567538  61   F      control   
## 14 Patient 22 week12 GSM4567539  61   F    abatacept   
## 15  Patient 31 week0 GSM4567540  62   M      control   
## 16 Patient 31 week12 GSM4567541  62   M    abatacept   
## 17   Patient 3 week0 GSM4567542  70   F      control   
## 18  Patient 3 week12 GSM4567543  70   F    abatacept   
## 19  Patient 18 week0 GSM4567544  75   F      control   
## 20 Patient 18 week12 GSM4567545  75   F    abatacept   
## 21  Patient 23 week0 GSM4567546  35   F      control   
## 22 Patient 23 week12 GSM4567547  35   F    abatacept   
## 23  Patient 21 week0 GSM4567548  78   F      control   
## 24 Patient 21 week12 GSM4567549  78   F    abatacept   
## 25  Patient 25 week0 GSM4567550  56   M      control   
## 26 Patient 25 week12 GSM4567551  56   M    abatacept   
## 27  Patient 26 week0 GSM4567552  68   F      control   
## 28 Patient 26 week12 GSM4567553  68   F    abatacept   
## 29  Patient 27 week0 GSM4567554  56   F      control   
## 30 Patient 27 week12 GSM4567555  56   F    abatacept   
## 31  Patient 20 week0 GSM4567556  55   F      control   
## 32 Patient 20 week12 GSM4567557  55   F    abatacept   
## 33  Patient 34 week0 GSM4567558  74   F      control   
## 34 Patient 34 week12 GSM4567559  74   F    abatacept   
## 35  Patient 14 week0 GSM4567560  44   F      control   
## 36 Patient 14 week12 GSM4567561  44   F    abatacept   
## 37  Patient 17 week0 GSM4567562  72   F      control   
## 38 Patient 17 week12 GSM4567563  72   F    abatacept   
## 39  Patient 19 week0 GSM4567564  47   F      control   
## 40 Patient 19 week12 GSM4567565  47   F    abatacept   
## 41   Patient 4 week0 GSM4567566  27   F      control   
## 42  Patient 4 week12 GSM4567567  27   F    abatacept   
## 43  Patient 35 week0 GSM4567568  68   F      control   
## 44 Patient 35 week12 GSM4567569  68   F    abatacept   
## 45  Patient 28 week0 GSM4567570  66   F      control   
## 46 Patient 28 week12 GSM4567571  66   F    abatacept   
## 47  Patient 12 week0 GSM4567572  46   F      control   
## 48 Patient 12 week12 GSM4567573  46   F    abatacept   
## 49   Patient 8 week0 GSM4567574  45   F      control   
## 50  Patient 8 week12 GSM4567575  45   F    abatacept   
## 51   Patient 5 week0 GSM4567576  71   F      control   
## 52  Patient 5 week12 GSM4567577  71   F    abatacept   
## 53  Patient 38 week0 GSM4567578  69   F      control   
## 54 Patient 38 week12 GSM4567579  69   F    abatacept   
## 55  Patient 10 week0 GSM4567580  80   F      control   
## 56 Patient 10 week12 GSM4567581  80   F    abatacept   
## 57  Patient 29 week0 GSM4567582  55   F      control   
## 58 Patient 29 week12 GSM4567583  55   F    abatacept   
## 59  Patient 37 week0 GSM4567584  60   M      control   
## 60 Patient 37 week12 GSM4567585  60   M    abatacept   
## 61   Patient 6 week0 GSM4567586  68   F      control   
## 62  Patient 6 week12 GSM4567587  68   F    abatacept   
## 63  Patient 36 week0 GSM4567588  37   F      control   
## 64 Patient 36 week12 GSM4567589  37   F    abatacept   
## 65   Patient 9 week0 GSM4567590  47   F      control   
## 66  Patient 9 week12 GSM4567591  47   F    abatacept   
## 67  Patient 32 week0 GSM4567592  43   F      control   
## 68 Patient 32 week12 GSM4567593  43   F    abatacept   
## 69   Patient 7 week0 GSM4567594  58   F      control   
## 70  Patient 7 week12 GSM4567595  58   F    abatacept   
## 71  Patient 33 week0 GSM4567596  76   M      control   
## 72 Patient 33 week12 GSM4567597  76   M    abatacept   
## 73  Patient 11 week0 GSM4567598  21   F      control   
## 74 Patient 11 week12 GSM4567599  21   F    abatacept   
## 75  Patient 30 week0 GSM4567600  63   M      control   
## 76 Patient 30 week12 GSM4567601  63      M    abatacept

data <- read.csv('GSE151161_Raw_counts.csv',sep=',', header=T, na.strings=c('',' ','NA'),
                 stringsAsFactors = F)

colnames(data)[1] <- 'gene'
colnames(data)

##  [1] "gene"    "P1_w12"  "P1_w0"   "P2_w12"  "P2_w0"   "P3_w0"   "P3_w12" 
##  [8] "P4_w0"   "P4_w12"  "P5_w0"   "P5_w12"  "P6_w12"  "P6_w0"   "P7_w12" 
## [15] "P7_w0"   "P8_w12"  "P8_w0"   "P9_w12"  "P9_w0"   "P10_w12" "P10_w0" 
## [22] "P11_w12" "P11_w0"  "P12_w0"  "P12_w12" "P13_w12" "P13_w0"  "P14_w12"
## [29] "P14_w0"  "P15_w12" "P15_w0"  "P16_w0"  "P16_w12" "P17_w0"  "P17_w12"
## [36] "P18_w12" "P18_w0"  "P19_w12" "P19_w0"  "P20_w12" "P20_w0"  "P21_w0" 
## [43] "P21_w12" "P22_w0"  "P22_w12" "P23_w12" "P23_w0"  "P24_w0"  "P24_w12"
## [50] "P25_w12" "P25_w0"  "P26_w12" "P26_w0"  "P27_w0"  "P27_w12" "P28_w0" 
## [57] "P28_w12" "P29_w12" "P29_w0"  "P30_w12" "P30_w0"  "P31_w0"  "P31_w12"
## [64] "P32_w12" "P32_w0"  "P33_w12" "P33_w0"  "P34_w12" "P34_w0"  "P35_w12"
## [71] "P35_w0"  "P36_w12" "P36_w0"  "P37_w0"  "P37_w12" "P38_w0"  "P38_w12"

names <- as.data.frame(colnames(data)[2:77])
colnames(names) <- 'sample'
names

##     sample
## 1   P1_w12
## 2    P1_w0
## 3   P2_w12
## 4    P2_w0
## 5    P3_w0
## 6   P3_w12
## 7    P4_w0
## 8   P4_w12
## 9    P5_w0
## 10  P5_w12
## 11  P6_w12
## 12   P6_w0
## 13  P7_w12
## 14   P7_w0
## 15  P8_w12
## 16   P8_w0
## 17  P9_w12
## 18   P9_w0
## 19 P10_w12
## 20  P10_w0
## 21 P11_w12
## 22  P11_w0
## 23  P12_w0
## 24 P12_w12
## 25 P13_w12
## 26  P13_w0
## 27 P14_w12
## 28  P14_w0
## 29 P15_w12
## 30  P15_w0
## 31  P16_w0
## 32 P16_w12
## 33  P17_w0
## 34 P17_w12
## 35 P18_w12
## 36  P18_w0
## 37 P19_w12
## 38  P19_w0
## 39 P20_w12
## 40  P20_w0
## 41  P21_w0
## 42 P21_w12
## 43  P22_w0
## 44 P22_w12
## 45 P23_w12
## 46  P23_w0
## 47  P24_w0
## 48 P24_w12
## 49 P25_w12
## 50  P25_w0
## 51 P26_w12
## 52  P26_w0
## 53  P27_w0
## 54 P27_w12
## 55  P28_w0
## 56 P28_w12
## 57 P29_w12
## 58  P29_w0
## 59 P30_w12
## 60  P30_w0
## 61  P31_w0
## 62 P31_w12
## 63 P32_w12
## 64  P32_w0
## 65 P33_w12
## 66  P33_w0
## 67 P34_w12
## 68  P34_w0
## 69 P35_w12
## 70  P35_w0
## 71 P36_w12
## 72  P36_w0
## 73  P37_w0
## 74 P37_w12
## 75  P38_w0
## 76 P38_w12

demographics$patient <- gsub('Patient ','P',demographics$patient)
demographics$patient <- gsub(' week','_w',demographics$patient)
demographics$patient <- as.factor(paste(demographics$patient))
demographics$treatment <- as.character(demographics$treatment)
demographics$treatment <- trimws(demographics$treatment, which='right', whitespace=' ')
demographics$gender <- as.character(demographics$gender)
demographics$gender <- trimws(demographics$gender, which='right', whitespace=' ')


demographics$patient

##  [1] P1_w0   P1_w12  P2_w0   P2_w12  P15_w0  P15_w12 P24_w0  P24_w12 P16_w0 
## [10] P16_w12 P13_w0  P13_w12 P22_w0  P22_w12 P31_w0  P31_w12 P3_w0   P3_w12 
## [19] P18_w0  P18_w12 P23_w0  P23_w12 P21_w0  P21_w12 P25_w0  P25_w12 P26_w0 
## [28] P26_w12 P27_w0  P27_w12 P20_w0  P20_w12 P34_w0  P34_w12 P14_w0  P14_w12
## [37] P17_w0  P17_w12 P19_w0  P19_w12 P4_w0   P4_w12  P35_w0  P35_w12 P28_w0 
## [46] P28_w12 P12_w0  P12_w12 P8_w0   P8_w12  P5_w0   P5_w12  P38_w0  P38_w12
## [55] P10_w0  P10_w12 P29_w0  P29_w12 P37_w0  P37_w12 P6_w0   P6_w12  P36_w0 
## [64] P36_w12 P9_w0   P9_w12  P32_w0  P32_w12 P7_w0   P7_w12  P33_w0  P33_w12
## [73] P11_w0  P11_w12 P30_w0  P30_w12
## 76 Levels: P1_w0 P1_w12 P10_w0 P10_w12 P11_w0 P11_w12 P12_w0 P12_w12 ... P9_w12

names$sample

##  [1] P1_w12  P1_w0   P2_w12  P2_w0   P3_w0   P3_w12  P4_w0   P4_w12  P5_w0  
## [10] P5_w12  P6_w12  P6_w0   P7_w12  P7_w0   P8_w12  P8_w0   P9_w12  P9_w0  
## [19] P10_w12 P10_w0  P11_w12 P11_w0  P12_w0  P12_w12 P13_w12 P13_w0  P14_w12
## [28] P14_w0  P15_w12 P15_w0  P16_w0  P16_w12 P17_w0  P17_w12 P18_w12 P18_w0 
## [37] P19_w12 P19_w0  P20_w12 P20_w0  P21_w0  P21_w12 P22_w0  P22_w12 P23_w12
## [46] P23_w0  P24_w0  P24_w12 P25_w12 P25_w0  P26_w12 P26_w0  P27_w0  P27_w12
## [55] P28_w0  P28_w12 P29_w12 P29_w0  P30_w12 P30_w0  P31_w0  P31_w12 P32_w12
## [64] P32_w0  P33_w12 P33_w0  P34_w12 P34_w0  P35_w12 P35_w0  P36_w12 P36_w0 
## [73] P37_w0  P37_w12 P38_w0  P38_w12
## 76 Levels: P1_w0 P1_w12 P10_w0 P10_w12 P11_w0 P11_w12 P12_w0 P12_w12 ... P9_w12

write.csv(demographics,'demographics.csv',row.names=FALSE)

demographics

##    patient     sample Age gender treatment
## 1    P1_w0 GSM4567526  78      F   control
## 2   P1_w12 GSM4567527  78      F abatacept
## 3    P2_w0 GSM4567528  69      F   control
## 4   P2_w12 GSM4567529  69      F abatacept
## 5   P15_w0 GSM4567530  50      F   control
## 6  P15_w12 GSM4567531  50      F abatacept
## 7   P24_w0 GSM4567532  51      M   control
## 8  P24_w12 GSM4567533  51      M abatacept
## 9   P16_w0 GSM4567534  54      F   control
## 10 P16_w12 GSM4567535  54      F abatacept
## 11  P13_w0 GSM4567536  59      F   control
## 12 P13_w12 GSM4567537  59      F abatacept
## 13  P22_w0 GSM4567538  61      F   control
## 14 P22_w12 GSM4567539  61      F abatacept
## 15  P31_w0 GSM4567540  62      M   control
## 16 P31_w12 GSM4567541  62      M abatacept
## 17   P3_w0 GSM4567542  70      F   control
## 18  P3_w12 GSM4567543  70      F abatacept
## 19  P18_w0 GSM4567544  75      F   control
## 20 P18_w12 GSM4567545  75      F abatacept
## 21  P23_w0 GSM4567546  35      F   control
## 22 P23_w12 GSM4567547  35      F abatacept
## 23  P21_w0 GSM4567548  78      F   control
## 24 P21_w12 GSM4567549  78      F abatacept
## 25  P25_w0 GSM4567550  56      M   control
## 26 P25_w12 GSM4567551  56      M abatacept
## 27  P26_w0 GSM4567552  68      F   control
## 28 P26_w12 GSM4567553  68      F abatacept
## 29  P27_w0 GSM4567554  56      F   control
## 30 P27_w12 GSM4567555  56      F abatacept
## 31  P20_w0 GSM4567556  55      F   control
## 32 P20_w12 GSM4567557  55      F abatacept
## 33  P34_w0 GSM4567558  74      F   control
## 34 P34_w12 GSM4567559  74      F abatacept
## 35  P14_w0 GSM4567560  44      F   control
## 36 P14_w12 GSM4567561  44      F abatacept
## 37  P17_w0 GSM4567562  72      F   control
## 38 P17_w12 GSM4567563  72      F abatacept
## 39  P19_w0 GSM4567564  47      F   control
## 40 P19_w12 GSM4567565  47      F abatacept
## 41   P4_w0 GSM4567566  27      F   control
## 42  P4_w12 GSM4567567  27      F abatacept
## 43  P35_w0 GSM4567568  68      F   control
## 44 P35_w12 GSM4567569  68      F abatacept
## 45  P28_w0 GSM4567570  66      F   control
## 46 P28_w12 GSM4567571  66      F abatacept
## 47  P12_w0 GSM4567572  46      F   control
## 48 P12_w12 GSM4567573  46      F abatacept
## 49   P8_w0 GSM4567574  45      F   control
## 50  P8_w12 GSM4567575  45      F abatacept
## 51   P5_w0 GSM4567576  71      F   control
## 52  P5_w12 GSM4567577  71      F abatacept
## 53  P38_w0 GSM4567578  69      F   control
## 54 P38_w12 GSM4567579  69      F abatacept
## 55  P10_w0 GSM4567580  80      F   control
## 56 P10_w12 GSM4567581  80      F abatacept
## 57  P29_w0 GSM4567582  55      F   control
## 58 P29_w12 GSM4567583  55      F abatacept
## 59  P37_w0 GSM4567584  60      M   control
## 60 P37_w12 GSM4567585  60      M abatacept
## 61   P6_w0 GSM4567586  68      F   control
## 62  P6_w12 GSM4567587  68      F abatacept
## 63  P36_w0 GSM4567588  37      F   control
## 64 P36_w12 GSM4567589  37      F abatacept
## 65   P9_w0 GSM4567590  47      F   control
## 66  P9_w12 GSM4567591  47      F abatacept
## 67  P32_w0 GSM4567592  43      F   control
## 68 P32_w12 GSM4567593  43      F abatacept
## 69   P7_w0 GSM4567594  58      F   control
## 70  P7_w12 GSM4567595  58      F abatacept
## 71  P33_w0 GSM4567596  76      M   control
## 72 P33_w12 GSM4567597  76      M abatacept
## 73  P11_w0 GSM4567598  21      F   control
## 74 P11_w12 GSM4567599  21      F abatacept
## 75  P30_w0 GSM4567600  63      M   control
## 76 P30_w12 GSM4567601  63      M abatacept

We can subset into control and treatment with our demographics data frame’s treatment column.

control <- subset(demographics, demographics$treatment =='control')
abatacept <- subset(demographics, demographics$treatment=='abatacept')

There are 38 samples in each group of control or abatacept.

Lets divide the groups into males and females.

femaleControl <- subset(control, control$gender=='F')
maleControl <- subset(control, control$gender=='M')
femaleTreatment <- subset(abatacept, abatacept$gender=='F')
maleTreatment <- subset(abatacept, abatacept$gender=='M')

There are 32 females and 6 males, totalling the number of samples in the control group and the treatment group because it was the same group of people’s blood 12 weeks later.

controlList <- control$patient
treatmentList <- abatacept$patient

femaleControlList <- femaleControl$patient
maleControlList <- maleControl$patient
femaleTreatmentList <- femaleTreatment$patient
maleTreatmentList <- maleTreatment$patient

gene <- data$gene

control2 <- colnames(data) %in% controlList
controlDF <- data[,control2]
controlDF$gene <- gene

abatacept2 <- colnames(data) %in% treatmentList
treatmentDF <- data[,abatacept2]
treatmentDF$gene <- gene

femaleControlList2 <- colnames(controlDF) %in% femaleControlList
femControlDF <- controlDF[,femaleControlList2]
femControlDF$gene <- gene

maleControlList2 <- colnames(controlDF) %in% maleControlList
maleControlDF <- controlDF[,maleControlList2]
maleControlDF$gene <- gene

femaleTreatmentList2 <- colnames(treatmentDF) %in% femaleTreatmentList
femTreatmentDF <- treatmentDF[,femaleTreatmentList2]
femTreatmentDF$gene <- gene

maleTreatmentList2 <- colnames(treatmentDF) %in% maleTreatmentList
maleTreatmentDF <- treatmentDF[,maleTreatmentList2]
maleTreatmentDF$gene <- gene

controlDF2 <- controlDF %>% group_by(gene) %>% summarise_at(vars('P1_w0':'P38_w0'),mean)
treatmentDF2 <- treatmentDF %>% group_by(gene) %>% summarise_at(vars('P1_w12':'P38_w12'),mean)

femTreatmentDF2 <- femTreatmentDF %>% group_by(gene) %>% summarise_at(vars('P1_w12':'P38_w12'),mean)
maleTreatmentDF2 <- maleTreatmentDF %>% group_by(gene) %>%
  summarise_at(vars('P24_w12':'P37_w12'),mean)

femControlDF2 <- femControlDF %>% group_by(gene) %>% summarise_at(vars('P1_w0':'P38_w0'),mean)
maleControlDF2 <- maleControlDF %>% group_by(gene) %>% summarise_at(vars('P24_w0':'P37_w0'),mean)

gene <- controlDF2$gene

femControlDF3 <- femControlDF2[,-1]
maleControlDF3 <- maleControlDF2[,-1]
femTreatmentDF3 <- femTreatmentDF2[,-1]
maleTreatmentDF3 <- maleTreatmentDF2[,-1]
controlDF3 <- controlDF2[,-1]
treatmentDF3 <- treatmentDF2[,-1]

controlDF3$ctrl_mean <- apply(controlDF3, 1,mean, na.rm=T)

treatmentDF3$trt_mean <- apply(treatmentDF3, 1,mean, na.rm=T)

femTreatmentDF3$fem_trt_mean <- apply(femTreatmentDF3, 1,mean, na.rm=T)

maleTreatmentDF3$mal_trt_mean <- apply(maleTreatmentDF3, 1,mean, na.rm=T)

femControlDF3$fem_ctrl_mean <- apply(femControlDF3, 1,mean, na.rm=T)

maleControlDF3$mal_ctrl_mean <- apply(maleControlDF3, 1,mean, na.rm=T)

DATA_RA <- cbind(controlDF3,treatmentDF3,maleControlDF3,femControlDF3,
                 maleTreatmentDF3,femTreatmentDF3)
DATA_RA$gene <- gene

colnames(DATA_RA)

##   [1] "P1_w0"         "P2_w0"         "P3_w0"         "P4_w0"        
##   [5] "P5_w0"         "P6_w0"         "P7_w0"         "P8_w0"        
##   [9] "P9_w0"         "P10_w0"        "P11_w0"        "P12_w0"       
##  [13] "P13_w0"        "P14_w0"        "P15_w0"        "P16_w0"       
##  [17] "P17_w0"        "P18_w0"        "P19_w0"        "P20_w0"       
##  [21] "P21_w0"        "P22_w0"        "P23_w0"        "P24_w0"       
##  [25] "P25_w0"        "P26_w0"        "P27_w0"        "P28_w0"       
##  [29] "P29_w0"        "P30_w0"        "P31_w0"        "P32_w0"       
##  [33] "P33_w0"        "P34_w0"        "P35_w0"        "P36_w0"       
##  [37] "P37_w0"        "P38_w0"        "ctrl_mean"     "P1_w12"       
##  [41] "P2_w12"        "P3_w12"        "P4_w12"        "P5_w12"       
##  [45] "P6_w12"        "P7_w12"        "P8_w12"        "P9_w12"       
##  [49] "P10_w12"       "P11_w12"       "P12_w12"       "P13_w12"      
##  [53] "P14_w12"       "P15_w12"       "P16_w12"       "P17_w12"      
##  [57] "P18_w12"       "P19_w12"       "P20_w12"       "P21_w12"      
##  [61] "P22_w12"       "P23_w12"       "P24_w12"       "P25_w12"      
##  [65] "P26_w12"       "P27_w12"       "P28_w12"       "P29_w12"      
##  [69] "P30_w12"       "P31_w12"       "P32_w12"       "P33_w12"      
##  [73] "P34_w12"       "P35_w12"       "P36_w12"       "P37_w12"      
##  [77] "P38_w12"       "trt_mean"      "P24_w0"        "P25_w0"       
##  [81] "P30_w0"        "P31_w0"        "P33_w0"        "P37_w0"       
##  [85] "mal_ctrl_mean" "P1_w0"         "P2_w0"         "P3_w0"        
##  [89] "P4_w0"         "P5_w0"         "P6_w0"         "P7_w0"        
##  [93] "P8_w0"         "P9_w0"         "P10_w0"        "P11_w0"       
##  [97] "P12_w0"        "P13_w0"        "P14_w0"        "P15_w0"       
## [101] "P16_w0"        "P17_w0"        "P18_w0"        "P19_w0"       
## [105] "P20_w0"        "P21_w0"        "P22_w0"        "P23_w0"       
## [109] "P26_w0"        "P27_w0"        "P28_w0"        "P29_w0"       
## [113] "P32_w0"        "P34_w0"        "P35_w0"        "P36_w0"       
## [117] "P38_w0"        "fem_ctrl_mean" "P24_w12"       "P25_w12"      
## [121] "P30_w12"       "P31_w12"       "P33_w12"       "P37_w12"      
## [125] "mal_trt_mean"  "P1_w12"        "P2_w12"        "P3_w12"       
## [129] "P4_w12"        "P5_w12"        "P6_w12"        "P7_w12"       
## [133] "P8_w12"        "P9_w12"        "P10_w12"       "P11_w12"      
## [137] "P12_w12"       "P13_w12"       "P14_w12"       "P15_w12"      
## [141] "P16_w12"       "P17_w12"       "P18_w12"       "P19_w12"      
## [145] "P20_w12"       "P21_w12"       "P22_w12"       "P23_w12"      
## [149] "P26_w12"       "P27_w12"       "P28_w12"       "P29_w12"      
## [153] "P32_w12"       "P34_w12"       "P35_w12"       "P36_w12"      
## [157] "P38_w12"       "fem_trt_mean"  "gene"

DATA_RA2 <- DATA_RA[,c(159,1:38,40:77,79:84,86:117,119:124,126:157,39,78,85,118,125,158)]
colnames(DATA_RA2)

##   [1] "gene"          "P1_w0"         "P2_w0"         "P3_w0"        
##   [5] "P4_w0"         "P5_w0"         "P6_w0"         "P7_w0"        
##   [9] "P8_w0"         "P9_w0"         "P10_w0"        "P11_w0"       
##  [13] "P12_w0"        "P13_w0"        "P14_w0"        "P15_w0"       
##  [17] "P16_w0"        "P17_w0"        "P18_w0"        "P19_w0"       
##  [21] "P20_w0"        "P21_w0"        "P22_w0"        "P23_w0"       
##  [25] "P24_w0"        "P25_w0"        "P26_w0"        "P27_w0"       
##  [29] "P28_w0"        "P29_w0"        "P30_w0"        "P31_w0"       
##  [33] "P32_w0"        "P33_w0"        "P34_w0"        "P35_w0"       
##  [37] "P36_w0"        "P37_w0"        "P38_w0"        "P1_w12"       
##  [41] "P2_w12"        "P3_w12"        "P4_w12"        "P5_w12"       
##  [45] "P6_w12"        "P7_w12"        "P8_w12"        "P9_w12"       
##  [49] "P10_w12"       "P11_w12"       "P12_w12"       "P13_w12"      
##  [53] "P14_w12"       "P15_w12"       "P16_w12"       "P17_w12"      
##  [57] "P18_w12"       "P19_w12"       "P20_w12"       "P21_w12"      
##  [61] "P22_w12"       "P23_w12"       "P24_w12"       "P25_w12"      
##  [65] "P26_w12"       "P27_w12"       "P28_w12"       "P29_w12"      
##  [69] "P30_w12"       "P31_w12"       "P32_w12"       "P33_w12"      
##  [73] "P34_w12"       "P35_w12"       "P36_w12"       "P37_w12"      
##  [77] "P38_w12"       "P24_w0.1"      "P25_w0.1"      "P30_w0.1"     
##  [81] "P31_w0.1"      "P33_w0.1"      "P37_w0.1"      "P1_w0.1"      
##  [85] "P2_w0.1"       "P3_w0.1"       "P4_w0.1"       "P5_w0.1"      
##  [89] "P6_w0.1"       "P7_w0.1"       "P8_w0.1"       "P9_w0.1"      
##  [93] "P10_w0.1"      "P11_w0.1"      "P12_w0.1"      "P13_w0.1"     
##  [97] "P14_w0.1"      "P15_w0.1"      "P16_w0.1"      "P17_w0.1"     
## [101] "P18_w0.1"      "P19_w0.1"      "P20_w0.1"      "P21_w0.1"     
## [105] "P22_w0.1"      "P23_w0.1"      "P26_w0.1"      "P27_w0.1"     
## [109] "P28_w0.1"      "P29_w0.1"      "P32_w0.1"      "P34_w0.1"     
## [113] "P35_w0.1"      "P36_w0.1"      "P38_w0.1"      "P24_w12.1"    
## [117] "P25_w12.1"     "P30_w12.1"     "P31_w12.1"     "P33_w12.1"    
## [121] "P37_w12.1"     "P1_w12.1"      "P2_w12.1"      "P3_w12.1"     
## [125] "P4_w12.1"      "P5_w12.1"      "P6_w12.1"      "P7_w12.1"     
## [129] "P8_w12.1"      "P9_w12.1"      "P10_w12.1"     "P11_w12.1"    
## [133] "P12_w12.1"     "P13_w12.1"     "P14_w12.1"     "P15_w12.1"    
## [137] "P16_w12.1"     "P17_w12.1"     "P18_w12.1"     "P19_w12.1"    
## [141] "P20_w12.1"     "P21_w12.1"     "P22_w12.1"     "P23_w12.1"    
## [145] "P26_w12.1"     "P27_w12.1"     "P28_w12.1"     "P29_w12.1"    
## [149] "P32_w12.1"     "P34_w12.1"     "P35_w12.1"     "P36_w12.1"    
## [153] "P38_w12.1"     "ctrl_mean"     "trt_mean"      "mal_ctrl_mean"
## [157] "fem_ctrl_mean" "mal_trt_mean"  "fem_trt_mean"

DATA_RA2$trt_ctrl_FoldChange <- DATA_RA2$trt_mean/DATA_RA2$ctrl_mean
DATA_RA2$fem_trt_ctrl_FoldChange <- DATA_RA2$fem_trt_mean/DATA_RA2$fem_ctrl_mean
DATA_RA2$mal_trt_ctrl_FoldChange <- DATA_RA2$mal_trt_mean/DATA_RA2$mal_ctrl_mean

DATA_RA2$fem_mal_trt_FoldChange <- DATA_RA2$fem_trt_mean/DATA_RA2$mal_trt_mean
DATA_RA2$fem_mal_ctrl_FoldChange <- DATA_RA2$fem_ctrl_mean/DATA_RA2$mal_ctrl_mean

Lets add in age as well.

quantile(demographics$Age)

##   0%  25%  50%  75% 100% 
## 21.0 47.0 59.5 69.0 80.0

Lets divide the data into those patients with RA who are younger than 50 and those older than 50.

less50 <- subset(demographics,demographics$Age < 50)
great50 <- subset(demographics,demographics$Age >=50)

gene <- controlDF$gene

less50b <- less50$patient
great50b <- great50$patient

less50c <- colnames(controlDF) %in% less50b
great50c <- colnames(controlDF) %in% great50b

less50ctrlDF <- controlDF[,less50c]
great50ctrlDF <- controlDF[,great50c]
less50trtDF <- treatmentDF[,less50c]
great50trtDF <- treatmentDF[,great50c]

less50ctrlDF$gene <- gene
less50trtDF$gene <- gene
great50ctrlDF$gene <- gene
great50trtDF$gene <- gene

less50ctrlDF2 <- less50ctrlDF %>% group_by(gene) %>% summarise_at(vars('P4_w0':'P36_w0'),mean)
less50trtDF2 <- less50trtDF %>% group_by(gene) %>% summarise_at(vars('P4_w12':'P36_w12'),mean)
great50ctrlDF2 <- great50ctrlDF %>% group_by(gene) %>% summarise_at(vars('P1_w0':'P38_w0'),mean)
great50trtDF2 <- great50trtDF %>% group_by(gene) %>% summarise_at(vars('P1_w12':'P38_w12'),mean)

gene <- great50trtDF2$gene

less50ctrlDF3 <- less50ctrlDF2[,-1]
less50trtDF3 <- less50trtDF2[,-1]
great50ctrlDF3 <- great50ctrlDF2[,-1]
great50trtDF3 <- great50trtDF2[,-1]

less50ctrlDF3$less50_ctrl_mean <- apply(less50ctrlDF3,1,mean)
less50trtDF3$less50_trt_mean <- apply(less50trtDF3,1,mean)
great50ctrlDF3$great50_ctrl_mean <- apply(great50ctrlDF3,1,mean)
great50trtDF3$great50_trt_mean <- apply(great50trtDF3,1,mean)

DATA_RA3 <- cbind(DATA_RA2, less50ctrlDF3, great50ctrlDF3, less50trtDF3,great50trtDF3)
colnames(DATA_RA3)

##   [1] "gene"                    "P1_w0"                  
##   [3] "P2_w0"                   "P3_w0"                  
##   [5] "P4_w0"                   "P5_w0"                  
##   [7] "P6_w0"                   "P7_w0"                  
##   [9] "P8_w0"                   "P9_w0"                  
##  [11] "P10_w0"                  "P11_w0"                 
##  [13] "P12_w0"                  "P13_w0"                 
##  [15] "P14_w0"                  "P15_w0"                 
##  [17] "P16_w0"                  "P17_w0"                 
##  [19] "P18_w0"                  "P19_w0"                 
##  [21] "P20_w0"                  "P21_w0"                 
##  [23] "P22_w0"                  "P23_w0"                 
##  [25] "P24_w0"                  "P25_w0"                 
##  [27] "P26_w0"                  "P27_w0"                 
##  [29] "P28_w0"                  "P29_w0"                 
##  [31] "P30_w0"                  "P31_w0"                 
##  [33] "P32_w0"                  "P33_w0"                 
##  [35] "P34_w0"                  "P35_w0"                 
##  [37] "P36_w0"                  "P37_w0"                 
##  [39] "P38_w0"                  "P1_w12"                 
##  [41] "P2_w12"                  "P3_w12"                 
##  [43] "P4_w12"                  "P5_w12"                 
##  [45] "P6_w12"                  "P7_w12"                 
##  [47] "P8_w12"                  "P9_w12"                 
##  [49] "P10_w12"                 "P11_w12"                
##  [51] "P12_w12"                 "P13_w12"                
##  [53] "P14_w12"                 "P15_w12"                
##  [55] "P16_w12"                 "P17_w12"                
##  [57] "P18_w12"                 "P19_w12"                
##  [59] "P20_w12"                 "P21_w12"                
##  [61] "P22_w12"                 "P23_w12"                
##  [63] "P24_w12"                 "P25_w12"                
##  [65] "P26_w12"                 "P27_w12"                
##  [67] "P28_w12"                 "P29_w12"                
##  [69] "P30_w12"                 "P31_w12"                
##  [71] "P32_w12"                 "P33_w12"                
##  [73] "P34_w12"                 "P35_w12"                
##  [75] "P36_w12"                 "P37_w12"                
##  [77] "P38_w12"                 "P24_w0.1"               
##  [79] "P25_w0.1"                "P30_w0.1"               
##  [81] "P31_w0.1"                "P33_w0.1"               
##  [83] "P37_w0.1"                "P1_w0.1"                
##  [85] "P2_w0.1"                 "P3_w0.1"                
##  [87] "P4_w0.1"                 "P5_w0.1"                
##  [89] "P6_w0.1"                 "P7_w0.1"                
##  [91] "P8_w0.1"                 "P9_w0.1"                
##  [93] "P10_w0.1"                "P11_w0.1"               
##  [95] "P12_w0.1"                "P13_w0.1"               
##  [97] "P14_w0.1"                "P15_w0.1"               
##  [99] "P16_w0.1"                "P17_w0.1"               
## [101] "P18_w0.1"                "P19_w0.1"               
## [103] "P20_w0.1"                "P21_w0.1"               
## [105] "P22_w0.1"                "P23_w0.1"               
## [107] "P26_w0.1"                "P27_w0.1"               
## [109] "P28_w0.1"                "P29_w0.1"               
## [111] "P32_w0.1"                "P34_w0.1"               
## [113] "P35_w0.1"                "P36_w0.1"               
## [115] "P38_w0.1"                "P24_w12.1"              
## [117] "P25_w12.1"               "P30_w12.1"              
## [119] "P31_w12.1"               "P33_w12.1"              
## [121] "P37_w12.1"               "P1_w12.1"               
## [123] "P2_w12.1"                "P3_w12.1"               
## [125] "P4_w12.1"                "P5_w12.1"               
## [127] "P6_w12.1"                "P7_w12.1"               
## [129] "P8_w12.1"                "P9_w12.1"               
## [131] "P10_w12.1"               "P11_w12.1"              
## [133] "P12_w12.1"               "P13_w12.1"              
## [135] "P14_w12.1"               "P15_w12.1"              
## [137] "P16_w12.1"               "P17_w12.1"              
## [139] "P18_w12.1"               "P19_w12.1"              
## [141] "P20_w12.1"               "P21_w12.1"              
## [143] "P22_w12.1"               "P23_w12.1"              
## [145] "P26_w12.1"               "P27_w12.1"              
## [147] "P28_w12.1"               "P29_w12.1"              
## [149] "P32_w12.1"               "P34_w12.1"              
## [151] "P35_w12.1"               "P36_w12.1"              
## [153] "P38_w12.1"               "ctrl_mean"              
## [155] "trt_mean"                "mal_ctrl_mean"          
## [157] "fem_ctrl_mean"           "mal_trt_mean"           
## [159] "fem_trt_mean"            "trt_ctrl_FoldChange"    
## [161] "fem_trt_ctrl_FoldChange" "mal_trt_ctrl_FoldChange"
## [163] "fem_mal_trt_FoldChange"  "fem_mal_ctrl_FoldChange"
## [165] "P4_w0"                   "P8_w0"                  
## [167] "P9_w0"                   "P11_w0"                 
## [169] "P12_w0"                  "P14_w0"                 
## [171] "P19_w0"                  "P23_w0"                 
## [173] "P32_w0"                  "P36_w0"                 
## [175] "less50_ctrl_mean"        "P1_w0"                  
## [177] "P2_w0"                   "P3_w0"                  
## [179] "P5_w0"                   "P6_w0"                  
## [181] "P7_w0"                   "P10_w0"                 
## [183] "P13_w0"                  "P15_w0"                 
## [185] "P16_w0"                  "P17_w0"                 
## [187] "P18_w0"                  "P20_w0"                 
## [189] "P21_w0"                  "P22_w0"                 
## [191] "P24_w0"                  "P25_w0"                 
## [193] "P26_w0"                  "P27_w0"                 
## [195] "P28_w0"                  "P29_w0"                 
## [197] "P30_w0"                  "P31_w0"                 
## [199] "P33_w0"                  "P34_w0"                 
## [201] "P35_w0"                  "P37_w0"                 
## [203] "P38_w0"                  "great50_ctrl_mean"      
## [205] "P4_w12"                  "P8_w12"                 
## [207] "P9_w12"                  "P11_w12"                
## [209] "P12_w12"                 "P14_w12"                
## [211] "P19_w12"                 "P23_w12"                
## [213] "P32_w12"                 "P36_w12"                
## [215] "less50_trt_mean"         "P1_w12"                 
## [217] "P2_w12"                  "P3_w12"                 
## [219] "P5_w12"                  "P6_w12"                 
## [221] "P7_w12"                  "P10_w12"                
## [223] "P13_w12"                 "P15_w12"                
## [225] "P16_w12"                 "P17_w12"                
## [227] "P18_w12"                 "P20_w12"                
## [229] "P21_w12"                 "P22_w12"                
## [231] "P24_w12"                 "P25_w12"                
## [233] "P26_w12"                 "P27_w12"                
## [235] "P28_w12"                 "P29_w12"                
## [237] "P30_w12"                 "P31_w12"                
## [239] "P33_w12"                 "P34_w12"                
## [241] "P35_w12"                 "P37_w12"                
## [243] "P38_w12"                 "great50_trt_mean"

DATA_RA4 <- DATA_RA3[,c(1:77,154:159,175,204,215,244,160:164)]
colnames(DATA_RA4)

##  [1] "gene"                    "P1_w0"                  
##  [3] "P2_w0"                   "P3_w0"                  
##  [5] "P4_w0"                   "P5_w0"                  
##  [7] "P6_w0"                   "P7_w0"                  
##  [9] "P8_w0"                   "P9_w0"                  
## [11] "P10_w0"                  "P11_w0"                 
## [13] "P12_w0"                  "P13_w0"                 
## [15] "P14_w0"                  "P15_w0"                 
## [17] "P16_w0"                  "P17_w0"                 
## [19] "P18_w0"                  "P19_w0"                 
## [21] "P20_w0"                  "P21_w0"                 
## [23] "P22_w0"                  "P23_w0"                 
## [25] "P24_w0"                  "P25_w0"                 
## [27] "P26_w0"                  "P27_w0"                 
## [29] "P28_w0"                  "P29_w0"                 
## [31] "P30_w0"                  "P31_w0"                 
## [33] "P32_w0"                  "P33_w0"                 
## [35] "P34_w0"                  "P35_w0"                 
## [37] "P36_w0"                  "P37_w0"                 
## [39] "P38_w0"                  "P1_w12"                 
## [41] "P2_w12"                  "P3_w12"                 
## [43] "P4_w12"                  "P5_w12"                 
## [45] "P6_w12"                  "P7_w12"                 
## [47] "P8_w12"                  "P9_w12"                 
## [49] "P10_w12"                 "P11_w12"                
## [51] "P12_w12"                 "P13_w12"                
## [53] "P14_w12"                 "P15_w12"                
## [55] "P16_w12"                 "P17_w12"                
## [57] "P18_w12"                 "P19_w12"                
## [59] "P20_w12"                 "P21_w12"                
## [61] "P22_w12"                 "P23_w12"                
## [63] "P24_w12"                 "P25_w12"                
## [65] "P26_w12"                 "P27_w12"                
## [67] "P28_w12"                 "P29_w12"                
## [69] "P30_w12"                 "P31_w12"                
## [71] "P32_w12"                 "P33_w12"                
## [73] "P34_w12"                 "P35_w12"                
## [75] "P36_w12"                 "P37_w12"                
## [77] "P38_w12"                 "ctrl_mean"              
## [79] "trt_mean"                "mal_ctrl_mean"          
## [81] "fem_ctrl_mean"           "mal_trt_mean"           
## [83] "fem_trt_mean"            "less50_ctrl_mean"       
## [85] "great50_ctrl_mean"       "less50_trt_mean"        
## [87] "great50_trt_mean"        "trt_ctrl_FoldChange"    
## [89] "fem_trt_ctrl_FoldChange" "mal_trt_ctrl_FoldChange"
## [91] "fem_mal_trt_FoldChange"  "fem_mal_ctrl_FoldChange"

DATA_RA4$less50_trt_ctrl_FoldChange <- DATA_RA4$less50_trt_mean/DATA_RA4$less50_ctrl_mean
DATA_RA4$great50_trt_ctrl_FoldChange <- DATA_RA4$great50_trt_mean/DATA_RA4$great50_ctrl_mean
colnames(DATA_RA4)

##  [1] "gene"                        "P1_w0"                      
##  [3] "P2_w0"                       "P3_w0"                      
##  [5] "P4_w0"                       "P5_w0"                      
##  [7] "P6_w0"                       "P7_w0"                      
##  [9] "P8_w0"                       "P9_w0"                      
## [11] "P10_w0"                      "P11_w0"                     
## [13] "P12_w0"                      "P13_w0"                     
## [15] "P14_w0"                      "P15_w0"                     
## [17] "P16_w0"                      "P17_w0"                     
## [19] "P18_w0"                      "P19_w0"                     
## [21] "P20_w0"                      "P21_w0"                     
## [23] "P22_w0"                      "P23_w0"                     
## [25] "P24_w0"                      "P25_w0"                     
## [27] "P26_w0"                      "P27_w0"                     
## [29] "P28_w0"                      "P29_w0"                     
## [31] "P30_w0"                      "P31_w0"                     
## [33] "P32_w0"                      "P33_w0"                     
## [35] "P34_w0"                      "P35_w0"                     
## [37] "P36_w0"                      "P37_w0"                     
## [39] "P38_w0"                      "P1_w12"                     
## [41] "P2_w12"                      "P3_w12"                     
## [43] "P4_w12"                      "P5_w12"                     
## [45] "P6_w12"                      "P7_w12"                     
## [47] "P8_w12"                      "P9_w12"                     
## [49] "P10_w12"                     "P11_w12"                    
## [51] "P12_w12"                     "P13_w12"                    
## [53] "P14_w12"                     "P15_w12"                    
## [55] "P16_w12"                     "P17_w12"                    
## [57] "P18_w12"                     "P19_w12"                    
## [59] "P20_w12"                     "P21_w12"                    
## [61] "P22_w12"                     "P23_w12"                    
## [63] "P24_w12"                     "P25_w12"                    
## [65] "P26_w12"                     "P27_w12"                    
## [67] "P28_w12"                     "P29_w12"                    
## [69] "P30_w12"                     "P31_w12"                    
## [71] "P32_w12"                     "P33_w12"                    
## [73] "P34_w12"                     "P35_w12"                    
## [75] "P36_w12"                     "P37_w12"                    
## [77] "P38_w12"                     "ctrl_mean"                  
## [79] "trt_mean"                    "mal_ctrl_mean"              
## [81] "fem_ctrl_mean"               "mal_trt_mean"               
## [83] "fem_trt_mean"                "less50_ctrl_mean"           
## [85] "great50_ctrl_mean"           "less50_trt_mean"            
## [87] "great50_trt_mean"            "trt_ctrl_FoldChange"        
## [89] "fem_trt_ctrl_FoldChange"     "mal_trt_ctrl_FoldChange"    
## [91] "fem_mal_trt_FoldChange"      "fem_mal_ctrl_FoldChange"    
## [93] "less50_trt_ctrl_FoldChange"  "great50_trt_ctrl_FoldChange"

DATA_RA4_FC <- DATA_RA4[,c(1,78:94)]
write.csv(DATA_RA4, 'all_RA_FC.csv',row.names=FALSE)

Note that I caught this late by a day. WE need to set the NaNs to 1 not zero, because there isn’t a halving of gene expression (foldchange = 0) but no change of fold change =1. We only want the change, from diseased state to healthy state, and when using division by zero, we will get Inf if the numerator is not zero but the denominator is zero, and NaN when both are zero. To handle this if the numerator is not zero and the denominator is, then the change is the numerator, and if they are both zero, then there is no change, hence zero change will be a 1 because there is no change in over or under expression. My previous use of this was errored, I will go back and fix this mistake, because having a fold change value of 0 actually means the gene had a decrease of 100% or halved, because and increase of 100% is a fold change of 2 and hence doubled. We will write these in with conditions using ifelse.

DATA_RA4_FC$trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$trt_ctrl_FoldChange=='Inf',
                                          DATA_RA4_FC$trt_mean,
                                          ifelse(DATA_RA4_FC$trt_ctrl_FoldChange=='NaN',
                                                 1,
                                                 DATA_RA4_FC$trt_ctrl_FoldChange))
DATA_RA4_FC$fem_trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$fem_trt_ctrl_FoldChange=='Inf',
                                          DATA_RA4_FC$fem_trt_mean,
                                          ifelse(DATA_RA4_FC$fem_trt_ctrl_FoldChange=='NaN',
                                                 1,
                                                 DATA_RA4_FC$fem_trt_ctrl_FoldChange))
DATA_RA4_FC$mal_trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$mal_trt_ctrl_FoldChange=='Inf',
                                          DATA_RA4_FC$mal_trt_mean,
                                          ifelse(DATA_RA4_FC$mal_trt_ctrl_FoldChange=='NaN',
                                                 1,
                                                 DATA_RA4_FC$mal_trt_ctrl_FoldChange))
DATA_RA4_FC$fem_mal_trt_FoldChange <- ifelse(DATA_RA4_FC$fem_mal_trt_FoldChange=='Inf',
                                          DATA_RA4_FC$fem_trt_mean,
                                          ifelse(DATA_RA4_FC$fem_mal_trt_FoldChange=='NaN',
                                                 1,
                                                 DATA_RA4_FC$fem_mal_trt_FoldChange))
DATA_RA4_FC$fem_mal_ctrl_FoldChange <- ifelse(DATA_RA4_FC$fem_mal_ctrl_FoldChange=='Inf',
                                          DATA_RA4_FC$fem_ctrl_mean,
                                          ifelse(DATA_RA4_FC$fem_mal_ctrl_FoldChange=='NaN',
                                                 1,
                                                 DATA_RA4_FC$fem_mal_ctrl_FoldChange))
DATA_RA4_FC$less50_trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$less50_trt_ctrl_FoldChange=='Inf',
                                          DATA_RA4_FC$less50_trt_mean,
                                          ifelse(DATA_RA4_FC$less50_trt_ctrl_FoldChange=='NaN',
                                                 1,
                                                 DATA_RA4_FC$less50_trt_ctrl_FoldChange))
DATA_RA4_FC$great50_trt_ctrl_FoldChange <- ifelse(DATA_RA4_FC$great50_trt_ctrl_FoldChange=='Inf',
                                          DATA_RA4_FC$great50_trt_mean,
                                          ifelse(DATA_RA4_FC$great50_trt_ctrl_FoldChange=='NaN',
                                                 1,
                                                 DATA_RA4_FC$great50_trt_ctrl_FoldChange))

Lets select some genes based on the highest and lowest fold change values. Since we put in the values, our fold change values of zero are significant for those genes that halved. Even smaller values of zero would mean a more significant down regulation. i.e. 0.0 versus 0.001, where the decline in gene expression went from a decrease of 10% versus a decrease of 1000% or 10 times the decrease in the disease state to the healthy state.

write.csv(DATA_RA4_FC,'RA_FCs.csv',row.names=FALSE)

head(DATA_RA4_FC)

##     gene    ctrl_mean    trt_mean mal_ctrl_mean fem_ctrl_mean mal_trt_mean
## 1  1-Dec   0.26315789   0.2105263     0.6666667       0.18750    0.3333333
## 2  1-Mar 707.13157895 633.0526316   710.5000000     706.50000  640.0833333
## 3  1-Sep 829.44736842 876.8684211  1030.0000000     791.84375  906.1666667
## 4 10-Mar   1.89473684   2.0263158     2.5000000       1.78125    1.5000000
## 5 10-Sep  31.76315789  30.8947368    35.6666667      31.03125   28.8333333
## 6 11-Mar   0.02631579   0.0000000     0.0000000       0.03125    0.0000000
##   fem_trt_mean less50_ctrl_mean great50_ctrl_mean less50_trt_mean
## 1      0.18750             0.00        0.35714286             0.1
## 2    631.73438           630.45      734.51785714           576.4
## 3    871.37500           667.50      887.28571429           734.8
## 4      2.12500             1.60        2.00000000             1.6
## 5     31.28125            23.80       34.60714286            22.7
## 6      0.00000             0.00        0.03571429             0.0
##   great50_trt_mean trt_ctrl_FoldChange fem_trt_ctrl_FoldChange
## 1         0.250000           0.8000000               1.0000000
## 2       653.285714           0.8952402               0.8941746
## 3       927.607143           1.0571719               1.1004381
## 4         2.178571           1.0694444               1.1929825
## 5        33.821429           0.9726595               1.0080564
## 6         0.000000           0.0000000               0.0000000
##   mal_trt_ctrl_FoldChange fem_mal_trt_FoldChange fem_mal_ctrl_FoldChange
## 1               0.5000000              0.5625000               0.2812500
## 2               0.9008914              0.9869565               0.9943702
## 3               0.8797735              0.9616057               0.7687803
## 4               0.6000000              1.4166667               0.7125000
## 5               0.8084112              1.0848988               0.8700350
## 6               1.0000000              1.0000000               0.0312500
##   less50_trt_ctrl_FoldChange great50_trt_ctrl_FoldChange
## 1                  0.1000000                   0.7000000
## 2                  0.9142676                   0.8894075
## 3                  1.1008240                   1.0454436
## 4                  1.0000000                   1.0892857
## 5                  0.9537815                   0.9772962
## 6                  1.0000000                   0.0000000

Lets get the highest and least expressed genes for 10 of each in fold change values for each group of: - younger than 50 treatment/control fold change - older than 50 treatment/control fold change - females to males control fold change - females to males treatment fold change - females treatment/control fold change - males treatment/contral fold change - treatment/control fold change for all participants

treatControl <- DATA_RA4_FC[order(DATA_RA4_FC$trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]

young <- DATA_RA4_FC[order(DATA_RA4_FC$less50_trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]
old <- DATA_RA4_FC[order(DATA_RA4_FC$great50_trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]

fems <- DATA_RA4_FC[order(DATA_RA4_FC$fem_trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]

males <- DATA_RA4_FC[order(DATA_RA4_FC$mal_trt_ctrl_FoldChange,decreasing=T)[c(1:10,56627:56636)],]

fems_males_trt <- DATA_RA4_FC[order(DATA_RA4_FC$fem_mal_trt_FoldChange,decreasing=T)[c(1:10,56627:56636)],]

fems_males_ctrl <- DATA_RA4_FC[order(DATA_RA4_FC$fem_mal_trt_FoldChange,decreasing=T)[c(1:10,56627:56636)],]

write.csv(treatControl,'treatControl.csv', row.names=F)
write.csv(young, 'young.csv', row.names=F)
write.csv(old, 'old.csv', row.names=F)
write.csv(fems,'fems.csv',row.names=F)
write.csv(males, 'males.csv', row.names=F)
write.csv(fems_males_trt,'fems_males_trt.csv',row.names=F)
write.csv(fems_males_ctrl,'fems_males_ctrl.csv',row.names=F)

The above files are the: - treatment/control of all patients for ratio of week 12 to week 0 treated with abatacept in RA blood - younger than 50 year old patients mixed genders week12 to week0 - older than 50 year old patients mixed genders week12 to week0 - female patients mixed ages week12 to week0 - male patients mixed ages week12 to week0 - females to males mixed ages week12 ratios - females to males mixed ages week0 ratios

Lets now do some machine learning on these data sets. We will take the top 2 genes and bottom 2 from each data frame and then use those as factors or gene variables to split our data of samples up into training and testing sets using various machine learning algorithms in R and possibly in python with the R reticulate package.

treat2 <- treatControl[c(1:2,19:20),]
young2 <- young[c(1:2,19:20),]
old2 <- old[c(1:2,19:20),]
fems2 <- fems[c(1:2,19:20),]
males2 <- males[c(1:2,19:20),]
fems_males_trt2 <- fems_males_trt[c(1:2,19:20),]
fems_males_ctrl2 <- fems_males_ctrl[c(1:2,19:20),]

ML_genes_df <-rbind(treat2,young2,old2,fems2,males2,fems_males_trt2,
                    fems_males_ctrl2)
ML_genes_df2 <- ML_genes_df[!duplicated(ML_genes_df),]
ML_genes_List <- sort(ML_genes_df2$gene)
ML_genes_List

##  [1] "AL161626.1"    "CTB-57H20.1"   "HEPH"          "KCNJ10"       
##  [5] "OR2L3"         "PGF"           "RP11-317J10.4" "RP11-561N12.5"
##  [9] "RP11-806O11.1" "XIST"          "ZNF804B"       "ZNF863P"      
## [13] "ZNF962P"       "ZNF99"         "ZP2"           "ZSCAN1"

Lets create our ML data frame from our original data.

ML_DF <- subset(data, data$gene %in% ML_genes_List)
ML_DF2 <- as.data.frame(t(ML_DF))
colnames(ML_DF2) <- ML_DF$gene
ML_DF2$patientID <- colnames(ML_DF)
ML_DF3 <- ML_DF2[2:77,]
ML_DF4 <- merge(ML_DF3, demographics,by.x='patientID',by.y='patient')
ML_DF5 <- ML_DF4[,c(1,18:21,2:17)]
colnames(ML_DF5)

##  [1] "patientID"     "sample"        "Age"           "gender"       
##  [5] "treatment"     "KCNJ10"        "OR2L3"         "ZNF863P"      
##  [9] "CTB-57H20.1"   "RP11-561N12.5" "ZNF804B"       "HEPH"         
## [13] "XIST"          "RP11-806O11.1" "RP11-317J10.4" "AL161626.1"   
## [17] "ZNF962P"       "PGF"           "ZP2"           "ZNF99"        
## [21] "ZSCAN1"

We now have some outcome variables to predict the outcome or class of each sample after training our models. Lets take a look at the class or variable type of each of our gene and demographic variables.

str(ML_DF5)

## 'data.frame':    76 obs. of  21 variables:
##  $ patientID    : chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ sample       : Factor w/ 76 levels "GSM4567526","GSM4567527",..: 1 2 55 56 73 74 47 48 11 12 ...
##  $ Age          : num  78 78 80 80 21 21 46 46 59 59 ...
##  $ gender       : chr  "F" "F" "F" "F" ...
##  $ treatment    : chr  "control" "abatacept" "control" "abatacept" ...
##  $ KCNJ10       : Factor w/ 15 levels "    0","    1",..: 1 1 2 2 4 4 3 1 4 4 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ OR2L3        : Factor w/ 14 levels "    0","    1",..: 1 1 1 3 1 3 1 1 2 1 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ ZNF863P      : Factor w/ 6 levels "    0","   0",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ CTB-57H20.1  : Factor w/ 9 levels "    0","    1",..: 1 1 1 1 1 3 1 1 1 1 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ RP11-561N12.5: Factor w/ 8 levels "    0","    1",..: 1 2 1 1 2 1 1 1 1 2 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ ZNF804B      : Factor w/ 7 levels "    0","    1",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ HEPH         : Factor w/ 43 levels "    7","    8",..: 18 10 18 6 13 13 35 28 24 8 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ XIST         : Factor w/ 68 levels " 0","0","10248",..: 43 63 41 25 30 14 6 50 37 61 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ RP11-806O11.1: Factor w/ 8 levels "    0","    1",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ RP11-317J10.4: Factor w/ 10 levels "    0","    1",..: 1 1 1 1 2 1 1 1 1 2 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ AL161626.1   : Factor w/ 25 levels "    0","    1",..: 14 6 1 1 2 5 1 1 6 15 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ ZNF962P      : Factor w/ 7 levels "    0","    1",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ PGF          : Factor w/ 25 levels "    0","    1",..: 1 3 3 2 1 2 15 18 2 2 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ ZP2          : Factor w/ 6 levels "    0","    1",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ ZNF99        : Factor w/ 7 levels "    0","    1",..: 2 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...
##  $ ZSCAN1       : Factor w/ 7 levels "    0","    1",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr  "P1_w0" "P1_w12" "P10_w0" "P10_w12" ...

write.csv(ML_DF5,'ML_genes.csv',row.names=F)
ML_GENES <- read.csv('ML_genes.csv', sep=',',header=T,na.strings=c('',' ','NA'),
                     stringsAsFactors = T)
str(ML_GENES)

## 'data.frame':    76 obs. of  21 variables:
##  $ patientID    : Factor w/ 76 levels "P1_w0","P1_w12",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ sample       : Factor w/ 76 levels "GSM4567526","GSM4567527",..: 1 2 55 56 73 74 47 48 11 12 ...
##  $ Age          : int  78 78 80 80 21 21 46 46 59 59 ...
##  $ gender       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ treatment    : Factor w/ 2 levels "abatacept","control": 2 1 2 1 2 1 2 1 2 1 ...
##  $ KCNJ10       : int  0 0 1 1 3 3 2 0 3 3 ...
##  $ OR2L3        : int  0 0 0 2 0 2 0 0 1 0 ...
##  $ ZNF863P      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CTB.57H20.1  : int  0 0 0 0 0 2 0 0 0 0 ...
##  $ RP11.561N12.5: int  0 1 0 0 1 0 0 0 0 1 ...
##  $ ZNF804B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HEPH         : int  28 19 28 15 22 22 93 47 36 17 ...
##  $ XIST         : int  24708 45725 24619 19769 22017 16788 12388 26615 23288 40961 ...
##  $ RP11.806O11.1: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RP11.317J10.4: int  0 0 0 0 1 0 0 0 0 1 ...
##  $ AL161626.1   : int  44 5 0 0 1 4 0 0 5 81 ...
##  $ ZNF962P      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PGF          : int  0 2 2 1 0 1 19 75 1 1 ...
##  $ ZP2          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ZNF99        : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ ZSCAN1       : int  0 0 0 0 0 0 0 0 0 0 ...

Lets rename the rows to the patientID.

row.names(ML_GENES) <- ML_GENES$patientID
ML_GENES2 <- ML_GENES[,-c(1:2)]
head(ML_GENES2)

##         Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P1_w0    78      F   control      0     0       0           0             0
## P1_w12   78      F abatacept      0     0       0           0             1
## P10_w0   80      F   control      1     0       0           0             0
## P10_w12  80      F abatacept      1     2       0           0             0
## P11_w0   21      F   control      3     0       0           0             1
## P11_w12  21      F abatacept      3     2       0           2             0
##         ZNF804B HEPH  XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P1_w0         0   28 24708             0             0         44       0   0
## P1_w12        0   19 45725             0             0          5       0   2
## P10_w0        0   28 24619             0             0          0       0   2
## P10_w12       0   15 19769             0             0          0       0   1
## P11_w0        0   22 22017             0             1          1       0   0
## P11_w12       0   22 16788             0             0          4       0   1
##         ZP2 ZNF99 ZSCAN1
## P1_w0     0     1      0
## P1_w12    0     0      0
## P10_w0    0     0      0
## P10_w12   0     0      0
## P11_w0    0     0      0
## P11_w12   0     0      0

Our machine learning data frame is the ML_GENES2 data, we have the class types of the variables as integer or factor. We can train to classify by gender, treatment, or age.

Lets get started testing our machine learning algorithms from the caret package for random forest (rf), k-nearest neighbor (knn), generalized linear models (glm), and recursive partitioned trees (rpart).

We will likely have to change the setting in our glm model because it is set for numeric regression and not classification, the same for our rf model.

library(RANN) #this pkg supplements caret for out of bag validation
library(e1071)
library(caret)
library(randomForest)
library(MASS)
library(gbm)

set.seed(12345)
inTrain <- createDataPartition(y=ML_GENES2$treatment, p=0.7, list=FALSE)

trainingSet <- ML_GENES2[inTrain,]
testingSet <- ML_GENES2[-inTrain,]
dim(trainingSet)

## [1] 54 19

dim(testingSet)

## [1] 22 19

Lets use random forest, knn, and glm algorithms to predict the ratings.

rf_boot <- train(treatment~., method='rf', 
               na.action=na.pass,
               data=(trainingSet),  preProc = c("center", "scale","knnImpute"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet)

DF_boot <- data.frame(predRF_boot, type=testingSet$treatment)

length_boot <- length(DF_boot$type)

sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)

accRF_boot <- (sum_boot/length_boot)

accRF_boot

## [1] 0.7272727

head(DF_boot,30)

##    predRF_boot      type
## 1      control   control
## 2    abatacept abatacept
## 3      control   control
## 4    abatacept   control
## 5      control   control
## 6    abatacept abatacept
## 7      control   control
## 8    abatacept abatacept
## 9      control abatacept
## 10   abatacept abatacept
## 11   abatacept   control
## 12   abatacept   control
## 13   abatacept abatacept
## 14     control   control
## 15   abatacept   control
## 16   abatacept abatacept
## 17   abatacept abatacept
## 18     control abatacept
## 19     control   control
## 20   abatacept abatacept
## 21     control   control
## 22   abatacept abatacept

errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored

##    predRF_boot      type
## 4    abatacept   control
## 9      control abatacept
## 11   abatacept   control
## 12   abatacept   control
## 15   abatacept   control
## 18     control abatacept

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored))
rn

## [1]  4  9 11 12 15 18

rfError <- testingSet[rn,]
rfError

##         Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P2_w0    69      F   control      0     1       0           0             0
## P27_w12  56      F abatacept      0     0       0           0             0
## P29_w0   55      F   control      2     3       0           0             0
## P3_w0    70      F   control      1     4       0           0             0
## P35_w0   68      F   control      0     4       0           0             0
## P4_w12   27      F abatacept      0     1       0           0             0
##         ZNF804B HEPH  XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P2_w0         0   29 37182             0             0          3       1  30
## P27_w12       0   27  9291             0             1          0       1   7
## P29_w0        0   31 18424             0             0          0       0   0
## P3_w0         0   26 31608             0             0          1       0   8
## P35_w0        0   31 30501             0             0          0       0   7
## P4_w12        0   16 15650             0             0          0       0   0
##         ZP2 ZNF99 ZSCAN1
## P2_w0     0     0      0
## P27_w12   0     0      0
## P29_w0    0     0      0
## P3_w0     0     0      0
## P35_w0    0     0      0
## P4_w12    0     0      0

All the six errors in classification of treatment type were from the female group with a mixed age group. All the males were classified correctly.

lets tune our model and see if we can get prediction accuracy on treatment. methods in trainControl(): boot-boot strap aggregating, LOOCV-leave one out cross validation cv-cross validation repeatedcv-? LGOCV-for repeated training/test splits Scored one better with 5 instead of 6 all females but wasn’t reproduced optimism_boot-? boot632-? boot_all-? adaptive-cv adaptive_boot adaptive_LGOCV none-only fits one model to entire training set for bagged and forest models timeslice

rf_boot <- train(treatment~., method='rf', 
               na.action=na.pass,
               data=(trainingSet),  preProc = c("center", "scale"),
               trControl=trainControl(method='cv'), number=7)

predRF_boot <- predict(rf_boot, testingSet)

DF_boot <- data.frame(predRF_boot, type=testingSet$treatment)

length_boot <- length(DF_boot$type)

sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)

accRF_boot <- (sum_boot/length_boot)

accRF_boot

## [1] 0.6818182

head(DF_boot,30)

##    predRF_boot      type
## 1      control   control
## 2      control abatacept
## 3      control   control
## 4    abatacept   control
## 5      control   control
## 6    abatacept abatacept
## 7      control   control
## 8    abatacept abatacept
## 9      control abatacept
## 10   abatacept abatacept
## 11   abatacept   control
## 12   abatacept   control
## 13   abatacept abatacept
## 14     control   control
## 15   abatacept   control
## 16   abatacept abatacept
## 17   abatacept abatacept
## 18     control abatacept
## 19     control   control
## 20   abatacept abatacept
## 21     control   control
## 22   abatacept abatacept

errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored

##    predRF_boot      type
## 2      control abatacept
## 4    abatacept   control
## 9      control abatacept
## 11   abatacept   control
## 12   abatacept   control
## 15   abatacept   control
## 18     control abatacept

rn <- as.numeric(row.names(errored))
rn

## [1]  2  4  9 11 12 15 18

rfError <- testingSet[rn,]
rfError

##         Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P13_w12  59      F abatacept      3     0       0           0             1
## P2_w0    69      F   control      0     1       0           0             0
## P27_w12  56      F abatacept      0     0       0           0             0
## P29_w0   55      F   control      2     3       0           0             0
## P3_w0    70      F   control      1     4       0           0             0
## P35_w0   68      F   control      0     4       0           0             0
## P4_w12   27      F abatacept      0     1       0           0             0
##         ZNF804B HEPH  XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P13_w12       0   17 40961             0             1         81       0   1
## P2_w0         0   29 37182             0             0          3       1  30
## P27_w12       0   27  9291             0             1          0       1   7
## P29_w0        0   31 18424             0             0          0       0   0
## P3_w0         0   26 31608             0             0          1       0   8
## P35_w0        0   31 30501             0             0          0       0   7
## P4_w12        0   16 15650             0             0          0       0   0
##         ZP2 ZNF99 ZSCAN1
## P13_w12   0     0      0
## P2_w0     0     0      0
## P27_w12   0     0      0
## P29_w0    0     0      0
## P3_w0     0     0      0
## P35_w0    0     0      0
## P4_w12    0     0      0

The tuning didn’t produce better results. The errors ranged from 6-8 with a week 12 male aged 62 in with the female errors.

Lets see how well our rf model can classify by gender instead of treatment.

rf_boot <- train(gender~., method='rf', 
               na.action=na.pass,
               data=(trainingSet),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet)

DF_boot <- data.frame(predRF_boot, type=testingSet$gender)

length_boot <- length(DF_boot$type)

sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)

accRF_boot <- (sum_boot/length_boot)

accRF_boot

## [1] 1

head(DF_boot,30)

##    predRF_boot type
## 1            F    F
## 2            F    F
## 3            F    F
## 4            F    F
## 5            F    F
## 6            F    F
## 7            M    M
## 8            M    M
## 9            F    F
## 10           F    F
## 11           F    F
## 12           F    F
## 13           M    M
## 14           F    F
## 15           F    F
## 16           F    F
## 17           M    M
## 18           F    F
## 19           F    F
## 20           F    F
## 21           F    F
## 22           F    F

errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored

## [1] predRF_boot type       
## <0 rows> (or 0-length row.names)

rn <- as.numeric(row.names(errored))
rn

## numeric(0)

rfError <- testingSet[rn,]
rfError

##  [1] Age           gender        treatment     KCNJ10        OR2L3        
##  [6] ZNF863P       CTB.57H20.1   RP11.561N12.5 ZNF804B       HEPH         
## [11] XIST          RP11.806O11.1 RP11.317J10.4 AL161626.1    ZNF962P      
## [16] PGF           ZP2           ZNF99         ZSCAN1       
## <0 rows> (or 0-length row.names)

The random forest model was able to classify the gender based on the genes used and the age and treatment group 100% accurately.

Now lets add an age variable for those younger than 50 and those 50 or older and remove our other age variable that correlates with this new feature adding.

ML_GENES3 <- ML_GENES2
ML_GENES3$Age <- ifelse(ML_GENES3$Age < 50, 'younger','older')
head(ML_GENES3)

##             Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P1_w0     older      F   control      0     0       0           0             0
## P1_w12    older      F abatacept      0     0       0           0             1
## P10_w0    older      F   control      1     0       0           0             0
## P10_w12   older      F abatacept      1     2       0           0             0
## P11_w0  younger      F   control      3     0       0           0             1
## P11_w12 younger      F abatacept      3     2       0           2             0
##         ZNF804B HEPH  XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P1_w0         0   28 24708             0             0         44       0   0
## P1_w12        0   19 45725             0             0          5       0   2
## P10_w0        0   28 24619             0             0          0       0   2
## P10_w12       0   15 19769             0             0          0       0   1
## P11_w0        0   22 22017             0             1          1       0   0
## P11_w12       0   22 16788             0             0          4       0   1
##         ZP2 ZNF99 ZSCAN1
## P1_w0     0     1      0
## P1_w12    0     0      0
## P10_w0    0     0      0
## P10_w12   0     0      0
## P11_w0    0     0      0
## P11_w12   0     0      0

set.seed(12345)
inTrain <- createDataPartition(y=ML_GENES3$Age, p=0.7, list=FALSE)

trainingSet2 <- ML_GENES3[inTrain,]
testingSet2 <- ML_GENES3[-inTrain,]
dim(trainingSet2)

## [1] 54 19

dim(testingSet2)

## [1] 22 19

rf_boot <- train(Age~., method='rf', 
               na.action=na.pass,
               data=(trainingSet2),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet2)

DF_boot <- data.frame(predRF_boot, type=testingSet2$Age)

length_boot <- length(DF_boot$type)

sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)

accRF_boot <- (sum_boot/length_boot)

accRF_boot

## [1] 0.7727273

head(DF_boot,30)

##    predRF_boot    type
## 1        older younger
## 2        older younger
## 3        older   older
## 4        older   older
## 5        older   older
## 6        older younger
## 7        older younger
## 8        older   older
## 9        older   older
## 10       older   older
## 11       older   older
## 12       older   older
## 13       older   older
## 14       older   older
## 15       older   older
## 16       older   older
## 17       older   older
## 18       older   older
## 19       older   older
## 20     younger younger
## 21       older   older
## 22       older younger

errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored

##    predRF_boot    type
## 1        older younger
## 2        older younger
## 6        older younger
## 7        older younger
## 22       older younger

rn <- as.numeric(row.names(errored))
rn

## [1]  1  2  6  7 22

rfError <- testingSet2[rn,]
rfError

##             Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P11_w12 younger      F abatacept      3     2       0           2             0
## P12_w12 younger      F abatacept      0     0       0           0             0
## P19_w0  younger      F   control      1     0       0           0             0
## P19_w12 younger      F abatacept      1     0       0           2             0
## P9_w0   younger      F   control      0     0       0           0             0
##         ZNF804B HEPH  XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P11_w12       0   22 16788             0             0          4       0   1
## P12_w12       0   47 26615             0             0          0       0  75
## P19_w0        0   50 18051             0             0          1       0   2
## P19_w12       0   41 19099             0             0          0       0   2
## P9_w0         0   17 25929             0             0          0       0   1
##         ZP2 ZNF99 ZSCAN1
## P11_w12   0     0      0
## P12_w12   0     0      0
## P19_w0    0     0      1
## P19_w12   0     0      0
## P9_w0     0     0      0

When using our random forest model to predict the age as younger than 50 or 50 or older, it misclassified 5 samples that were all females. Lets see what those ages were for our females that were misclassified as older.

ageErrorList <- row.names(rfError)
ageError <- row.names(ML_GENES2) %in% ageErrorList
ML_GENES2[ageError,]

##         Age gender treatment KCNJ10 OR2L3 ZNF863P CTB.57H20.1 RP11.561N12.5
## P11_w12  21      F abatacept      3     2       0           2             0
## P12_w12  46      F abatacept      0     0       0           0             0
## P19_w0   47      F   control      1     0       0           0             0
## P19_w12  47      F abatacept      1     0       0           2             0
## P9_w0    47      F   control      0     0       0           0             0
##         ZNF804B HEPH  XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 ZNF962P PGF
## P11_w12       0   22 16788             0             0          4       0   1
## P12_w12       0   47 26615             0             0          0       0  75
## P19_w0        0   50 18051             0             0          1       0   2
## P19_w12       0   41 19099             0             0          0       0   2
## P9_w0         0   17 25929             0             0          0       0   1
##         ZP2 ZNF99 ZSCAN1
## P11_w12   0     0      0
## P12_w12   0     0      0
## P19_w0    0     0      1
## P19_w12   0     0      0
## P9_w0     0     0      0

The females that were misclassified as older than 50 when they were younger was 4 unique females but a couple in the week 12 and one in the week0, and the 4th from both week0 and week12.

Lets move on to exploring those genes that could make more of a difference in detection RA, since this data isn’t on COVID-19 genes, but symptoms related to COVID-19 after knocking down an interleukin with an inhibitor. We used both under and over expressed genes, lets remove the under expressed genes.

treat2b <- treatControl[c(1:2),]
young2b <- young[c(1:2),]
old2b <- old[c(1:2),]
fems2b <- fems[c(1:2),]
males2b <- males[c(1:2),]
fems_males_trt2b <- fems_males_trt[c(1:2),]
fems_males_ctrl2b <- fems_males_ctrl[c(1:2),]

ML_genes_dfb <-rbind(treat2b,young2b,old2b,fems2b,males2b,fems_males_trt2b,
                    fems_males_ctrl2b)
ML_genes_df2b <- ML_genes_dfb[!duplicated(ML_genes_dfb),]
ML_genes_Listb <- sort(ML_genes_df2b$gene)
ML_genes_Listb

##  [1] "AL161626.1"    "CTB-57H20.1"   "HEPH"          "KCNJ10"       
##  [5] "OR2L3"         "PGF"           "RP11-317J10.4" "RP11-561N12.5"
##  [9] "RP11-806O11.1" "XIST"

upReg <- c('Age','gender','treatment',ML_genes_Listb)
upReg2 <- gsub('-','.',upReg)

ML_up <- ML_GENES2[,colnames(ML_GENES2) %in% upReg2]
ML_up

##         Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH  XIST
## P1_w0    78      F   control      0     0           0             0   28 24708
## P1_w12   78      F abatacept      0     0           0             1   19 45725
## P10_w0   80      F   control      1     0           0             0   28 24619
## P10_w12  80      F abatacept      1     2           0             0   15 19769
## P11_w0   21      F   control      3     0           0             1   22 22017
## P11_w12  21      F abatacept      3     2           2             0   22 16788
## P12_w0   46      F   control      2     0           0             0   93 12388
## P12_w12  46      F abatacept      0     0           0             0   47 26615
## P13_w0   59      F   control      3     1           0             0   36 23288
## P13_w12  59      F abatacept      3     0           0             1   17 40961
## P14_w0   44      F   control      0     0           0             0   54 10248
## P14_w12  44      F abatacept      3     2           0             1   34 18257
## P15_w0   50      F   control      0     0           0             0   28 10754
## P15_w12  50      F abatacept      1     2           0             1   27 23239
## P16_w0   54      F   control      3     0           0             0   16 31607
## P16_w12  54      F abatacept      1     1           1             1   36 40660
## P17_w0   72      F   control      2     0           0             0   28 16866
## P17_w12  72      F abatacept      2     0           0             1   12 24028
## P18_w0   75      F   control      0     1           0             0   41  6869
## P18_w12  75      F abatacept      1     4           0             0   21 20913
## P19_w0   47      F   control      1     0           0             0   50 18051
## P19_w12  47      F abatacept      1     0           2             0   41 19099
## P2_w0    69      F   control      0     1           0             0   29 37182
## P2_w12   69      F abatacept      1     2           0             0   37 26427
## P20_w0   55      F   control      0     0           0             0   15 46330
## P20_w12  55      F abatacept      0     2           0             0   57 34773
## P21_w0   78      F   control      0     0           0             0   33 16612
## P21_w12  78      F abatacept      0     3           0             0   37 18768
## P22_w0   61      F   control      0     0           0             0   40 13119
## P22_w12  61      F abatacept      0     0           0             1   14 20901
## P23_w0   35      F   control      1     1           0             0   17 17706
## P23_w12  35      F abatacept      0     1           2             0   36 22766
## P24_w0   51      M   control      0     2           1             0    1     0
## P24_w12  51      M abatacept      2     1           0             0    0     0
## P25_w0   56      M   control      0     1           0             0    0     0
## P25_w12  56      M abatacept      3     1           0             1    3     0
## P26_w0   68      F   control      4     0           0             0   20 24503
## P26_w12  68      F abatacept      4     1           1             1   24 22650
## P27_w0   56      F   control      1     1           0             0   47 18171
## P27_w12  56      F abatacept      0     0           0             0   27  9291
## P28_w0   66      F   control      0     3           0             0   23 24365
## P28_w12  66      F abatacept      1     5           0             2   27 24685
## P29_w0   55      F   control      2     3           0             0   31 18424
## P29_w12  55      F abatacept      4     1           0             0   22 16213
## P3_w0    70      F   control      1     4           0             0   26 31608
## P3_w12   70      F abatacept      0     1           0             0   23 20866
## P30_w0   63      M   control      0     1           0             0    1     0
## P30_w12  63      M abatacept      0     1           0             0    0     0
## P31_w0   62      M   control      0     1           1             0    0     2
## P31_w12  62      M abatacept      9     0           1             0    0     0
## P32_w0   43      F   control     49     0           0             0   15 22362
## P32_w12  43      F abatacept      0     0           0             0   49 13612
## P33_w0   76      M   control      1     1           0             0    2     0
## P33_w12  76      M abatacept      1     0           0             1    1     0
## P34_w0   74      F   control      1     1           1             0   36 26925
## P34_w12  74      F abatacept      1     4           0             0   37 26379
## P35_w0   68      F   control      0     4           0             0   31 30501
## P35_w12  68      F abatacept      0     4           0             0   30 27625
## P36_w0   37      F   control      0     0           0             0   22 19285
## P36_w12  37      F abatacept      0     5           0             0    7 23282
## P37_w0   60      M   control      0     2           0             0    0     0
## P37_w12  60      M abatacept      0     2           0             1    0     0
## P38_w0   69      F   control      1     0           0             1    8 25390
## P38_w12  69      F abatacept      3     0           1             0   10 33212
## P4_w0    27      F   control      0     0           0             0   20 18096
## P4_w12   27      F abatacept      0     1           0             0   16 15650
## P5_w0    71      F   control      2     2           0             0   28 29488
## P5_w12   71      F abatacept      3     3           0             1   14 41655
## P6_w0    68      F   control      1     1           0             0   63 10778
## P6_w12   68      F abatacept      0     1           2             0   81 13843
## P7_w0    58      F   control      2     2           0             0   18 50830
## P7_w12   58      F abatacept      2     1           0             0   15 25678
## P8_w0    45      F   control      0     0           0             0   20 24721
## P8_w12   45      F abatacept      0     2           0             0   28 22624
## P9_w0    47      F   control      0     0           0             0   17 25929
## P9_w12   47      F abatacept      0     1           1             0    7 15750
##         RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P1_w0               0             0         44   0
## P1_w12              0             0          5   2
## P10_w0              0             0          0   2
## P10_w12             0             0          0   1
## P11_w0              0             1          1   0
## P11_w12             0             0          4   1
## P12_w0              0             0          0  19
## P12_w12             0             0          0  75
## P13_w0              0             0          5   1
## P13_w12             0             1         81   1
## P14_w0              0             0          0   0
## P14_w12             0             0          0   0
## P15_w0              0             0         35   0
## P15_w12             0             0         13   6
## P16_w0              0             0          0   5
## P16_w12            10             1          2   9
## P17_w0              0             0          0   7
## P17_w12             0             0          0   0
## P18_w0              0             0          0   0
## P18_w12             0             0          1   0
## P19_w0              0             0          1   2
## P19_w12             0             0          0   2
## P2_w0               0             0          3  30
## P2_w12              0             2          0   6
## P20_w0              0             0          0   0
## P20_w12             0             0          0   2
## P21_w0              1             0          0   3
## P21_w12             3             0         32   0
## P22_w0              0             0          3  10
## P22_w12             0             0         17  12
## P23_w0              0             0          0   0
## P23_w12             0             0          0   0
## P24_w0              0             0          0   1
## P24_w12             0             0          2   1
## P25_w0              0             0          7   1
## P25_w12             0             0          4   1
## P26_w0              0             0          0   3
## P26_w12             0             0          0   3
## P27_w0              0             0          0  13
## P27_w12             0             1          0   7
## P28_w0              0             0          0   0
## P28_w12             0             2          0   1
## P29_w0              0             0          0   0
## P29_w12             0             0          0   1
## P3_w0               0             0          1   8
## P3_w12              0             1          3   7
## P30_w0              0             0          1   0
## P30_w12             0             0          0  10
## P31_w0              0             0          9   1
## P31_w12             0             1          3  21
## P32_w0              0             0          0   1
## P32_w12             0             0          0   1
## P33_w0              0             0         34   0
## P33_w12             0             0          2   1
## P34_w0              0             0          2   2
## P34_w12             0             0          0   9
## P35_w0              0             0          0   7
## P35_w12             0             0          0   1
## P36_w0              0             0          0   0
## P36_w12             0             0         20   2
## P37_w0              0             0          1   0
## P37_w12             0             1          0   7
## P38_w0              0             0          1   3
## P38_w12             0             0          0   0
## P4_w0               0             0          0   0
## P4_w12              0             0          0   0
## P5_w0               0             0          0   2
## P5_w12              0             0          0   4
## P6_w0               0             0         38   2
## P6_w12              1             1          0   0
## P7_w0               0             0          2   2
## P7_w12              0             0          5   7
## P8_w0               0             0          0   0
## P8_w12              0             0          0   2
## P9_w0               0             0          0   1
## P9_w12              0             0          0   0

Lets now repeat our random forest classifier on this data by treatment, then gender, and finally by Age younger than 50 or 50 and older.

set.seed(12345)
inTrain <- createDataPartition(y=ML_up$treatment, p=0.7, list=FALSE)

trainingSet <- ML_up[inTrain,]
testingSet <- ML_up[-inTrain,]
dim(trainingSet)

## [1] 54 13

dim(testingSet)

## [1] 22 13

Lets use random forest, knn, and glm algorithms to predict the ratings.

rf_boot <- train(treatment~., method='rf', 
               na.action=na.pass,
               data=(trainingSet),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet)

DF_boot <- data.frame(predRF_boot, type=testingSet$treatment)

length_boot <- length(DF_boot$type)

sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)

accRF_boot <- (sum_boot/length_boot)

accRF_boot

## [1] 0.6818182

head(DF_boot,30)

##    predRF_boot      type
## 1      control   control
## 2      control abatacept
## 3      control   control
## 4    abatacept   control
## 5      control   control
## 6    abatacept abatacept
## 7      control   control
## 8    abatacept abatacept
## 9      control abatacept
## 10   abatacept abatacept
## 11   abatacept   control
## 12   abatacept   control
## 13   abatacept abatacept
## 14     control   control
## 15   abatacept   control
## 16   abatacept abatacept
## 17   abatacept abatacept
## 18     control abatacept
## 19     control   control
## 20   abatacept abatacept
## 21     control   control
## 22   abatacept abatacept

errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored

##    predRF_boot      type
## 2      control abatacept
## 4    abatacept   control
## 9      control abatacept
## 11   abatacept   control
## 12   abatacept   control
## 15   abatacept   control
## 18     control abatacept

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored))
rn

## [1]  2  4  9 11 12 15 18

rfError <- testingSet[rn,]
rfError

##         Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH  XIST
## P13_w12  59      F abatacept      3     0           0             1   17 40961
## P2_w0    69      F   control      0     1           0             0   29 37182
## P27_w12  56      F abatacept      0     0           0             0   27  9291
## P29_w0   55      F   control      2     3           0             0   31 18424
## P3_w0    70      F   control      1     4           0             0   26 31608
## P35_w0   68      F   control      0     4           0             0   31 30501
## P4_w12   27      F abatacept      0     1           0             0   16 15650
##         RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P13_w12             0             1         81   1
## P2_w0               0             0          3  30
## P27_w12             0             1          0   7
## P29_w0              0             0          0   0
## P3_w0               0             0          1   8
## P35_w0              0             0          0   7
## P4_w12              0             0          0   0

With only our genes with over expression the classification by treatment wasn’t improved. Lets try by gender, and recall when we used gender to classify on all the selected genes we scored a 100% accuracy. We will see if it is as good or worse.

set.seed(12345)
inTrain <- createDataPartition(y=ML_up$gender, p=0.7, list=FALSE)

trainingSet <- ML_up[inTrain,]
testingSet <- ML_up[-inTrain,]
dim(trainingSet)

## [1] 54 13

dim(testingSet)

## [1] 22 13

Lets use random forest, knn, and glm algorithms to predict the ratings.

rf_boot <- train(gender~., method='rf', 
               na.action=na.pass,
               data=(trainingSet),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet)

DF_boot <- data.frame(predRF_boot, type=testingSet$gender)

length_boot <- length(DF_boot$type)

sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)

accRF_boot <- (sum_boot/length_boot)

accRF_boot

## [1] 1

head(DF_boot,30)

##    predRF_boot type
## 1            F    F
## 2            F    F
## 3            F    F
## 4            F    F
## 5            F    F
## 6            F    F
## 7            F    F
## 8            F    F
## 9            M    M
## 10           F    F
## 11           F    F
## 12           F    F
## 13           F    F
## 14           M    M
## 15           F    F
## 16           M    M
## 17           F    F
## 18           F    F
## 19           F    F
## 20           F    F
## 21           F    F
## 22           F    F

errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored

## [1] predRF_boot type       
## <0 rows> (or 0-length row.names)

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored))
rn

## numeric(0)

rfError <- testingSet[rn,]
rfError

##  [1] Age           gender        treatment     KCNJ10        OR2L3        
##  [6] CTB.57H20.1   RP11.561N12.5 HEPH          XIST          RP11.806O11.1
## [11] RP11.317J10.4 AL161626.1    PGF          
## <0 rows> (or 0-length row.names)

Our model on only most expressed genes scored 100% classifying by gender. Our selected genes must have a significant change by gender to be able to classify each sample as male or female correctly for all samples.

Now we’ll add the age group.

ML_up2 <- ML_up
ML_up2$Age <- ifelse(ML_up2$Age < 50, 'younger','older')
head(ML_up2)

##             Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P1_w0     older      F   control      0     0           0             0   28
## P1_w12    older      F abatacept      0     0           0             1   19
## P10_w0    older      F   control      1     0           0             0   28
## P10_w12   older      F abatacept      1     2           0             0   15
## P11_w0  younger      F   control      3     0           0             1   22
## P11_w12 younger      F abatacept      3     2           2             0   22
##          XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P1_w0   24708             0             0         44   0
## P1_w12  45725             0             0          5   2
## P10_w0  24619             0             0          0   2
## P10_w12 19769             0             0          0   1
## P11_w0  22017             0             1          1   0
## P11_w12 16788             0             0          4   1

set.seed(12345)
inTrain <- createDataPartition(y=ML_up2$gender, p=0.7, list=FALSE)

trainingSet <- ML_up2[inTrain,]
testingSet <- ML_up2[-inTrain,]
dim(trainingSet)

## [1] 54 13

dim(testingSet)

## [1] 22 13

Lets use random forest, knn, and glm algorithms to predict the ratings.

rf_boot <- train(Age~., method='rf', 
               na.action=na.pass,
               data=(trainingSet),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet)

DF_boot <- data.frame(predRF_boot, type=testingSet$Age)

length_boot <- length(DF_boot$type)

sum_boot <- sum(DF_boot$predRF_boot==DF_boot$type)

accRF_boot <- (sum_boot/length_boot)

accRF_boot

## [1] 0.7727273

head(DF_boot,30)

##    predRF_boot    type
## 1        older younger
## 2        older younger
## 3        older   older
## 4        older   older
## 5      younger   older
## 6        older younger
## 7      younger   older
## 8        older   older
## 9        older   older
## 10       older   older
## 11       older   older
## 12       older   older
## 13       older   older
## 14       older   older
## 15     younger younger
## 16       older   older
## 17       older   older
## 18       older   older
## 19       older   older
## 20     younger younger
## 21       older   older
## 22     younger younger

errored <- DF_boot[(DF_boot$predRF_boot != DF_boot$type),]
errored

##   predRF_boot    type
## 1       older younger
## 2       older younger
## 5     younger   older
## 6       older younger
## 7     younger   older

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored))
rn

## [1] 1 2 5 6 7

rfError <- testingSet[rn,]
rfError

##            Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P11_w0 younger      F   control      3     0           0             1   22
## P12_w0 younger      F   control      2     0           0             0   93
## P18_w0   older      F   control      0     1           0             0   41
## P19_w0 younger      F   control      1     0           0             0   50
## P20_w0   older      F   control      0     0           0             0   15
##         XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P11_w0 22017             0             1          1   0
## P12_w0 12388             0             0          0  19
## P18_w0  6869             0             0          0   0
## P19_w0 18051             0             0          1   2
## P20_w0 46330             0             0          0   0

This data frame using only the most expressed few genes in each group didn’t classify better than the last group. The same misclassifications were still by female, but not all are older misclassifications. Because there are a few females that were older and classified as younger based on these genes in RA patients most expressed in week12 compared to week0 of having been treated with Abatacept in both groups. These are also unique females and all in the 1st week or week0. That does show some improvement, then because all week 12 samples were able to be classified correctly as younger than 50 or 50 plus in age.

I doublt we will do better with other models, but here are the schematics or coding parameters to use with altered data in some cases.

knn_boot <- train(treatment ~ .,
                method='knn', preProcess=c('center','scale'),
                tuneLength=10, trControl=trainControl(method='boot'),
                data=trainingSet)

predKNN_boot <- predict(knn_boot, testingSet)

DF_KNN_boot <- data.frame(predKNN_boot, type=testingSet$treatment)

length_KNN_boot <- length(DF_KNN_boot$type)

sum_KNN_boot <- sum(DF_KNN_boot$predKNN_boot==DF_KNN_boot$type)

accKNN_boot <- (sum_KNN_boot/length_KNN_boot)

accKNN_boot

## [1] 0.7272727

head(DF_KNN_boot,30)

##    predKNN_boot      type
## 1     abatacept   control
## 2       control   control
## 3       control   control
## 4     abatacept abatacept
## 5       control   control
## 6       control   control
## 7       control   control
## 8       control   control
## 9     abatacept abatacept
## 10      control   control
## 11      control   control
## 12      control abatacept
## 13    abatacept   control
## 14    abatacept abatacept
## 15      control abatacept
## 16    abatacept abatacept
## 17      control   control
## 18    abatacept abatacept
## 19      control abatacept
## 20      control abatacept
## 21      control   control
## 22      control   control

errored <- DF_KNN_boot[(DF_KNN_boot$predKNN_boot != DF_KNN_boot$type),]
errored

##    predKNN_boot      type
## 1     abatacept   control
## 12      control abatacept
## 13    abatacept   control
## 15      control abatacept
## 19      control abatacept
## 20      control abatacept

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored))
rn

## [1]  1 12 13 15 19 20

rfError <- testingSet[rn,]
rfError

##             Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P11_w0  younger      F   control      3     0           0             1   22
## P27_w12   older      F abatacept      0     0           0             0   27
## P3_w0     older      F   control      1     4           0             0   26
## P32_w12 younger      F abatacept      0     0           0             0   49
## P38_w12   older      F abatacept      3     0           1             0   10
## P4_w12  younger      F abatacept      0     1           0             0   16
##          XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P11_w0  22017             0             1          1   0
## P27_w12  9291             0             1          0   7
## P3_w0   31608             0             0          1   8
## P32_w12 13612             0             0          0   1
## P38_w12 33212             0             0          0   0
## P4_w12  15650             0             0          0   0

GeneralizedBoostedModel

gbmMod <- train(treatment~., method='gbm', data=trainingSet, verbose=FALSE )
predGbm <- predict(gbmMod, testingSet)
sumGBM0 <- sum(predGbm==testingSet$treatment)
lengthGBM0 <- length(testingSet$treatment)
accuracy_gbmMod <- sumGBM0/lengthGBM0 
accuracy_gbmMod

## [1] 0.3636364

DF_GBM <- data.frame(predGbm, type=testingSet$treatment)
head(DF_GBM,30)

##      predGbm      type
## 1    control   control
## 2  abatacept   control
## 3  abatacept   control
## 4    control abatacept
## 5    control   control
## 6    control   control
## 7    control   control
## 8    control   control
## 9    control abatacept
## 10 abatacept   control
## 11 abatacept   control
## 12 abatacept abatacept
## 13 abatacept   control
## 14   control abatacept
## 15   control abatacept
## 16   control abatacept
## 17 abatacept   control
## 18 abatacept abatacept
## 19   control abatacept
## 20   control abatacept
## 21 abatacept   control
## 22   control   control

errored <- DF_GBM[(DF_GBM$predGbm != DF_GBM$type),]
errored

##      predGbm      type
## 2  abatacept   control
## 3  abatacept   control
## 4    control abatacept
## 9    control abatacept
## 10 abatacept   control
## 11 abatacept   control
## 13 abatacept   control
## 14   control abatacept
## 15   control abatacept
## 16   control abatacept
## 17 abatacept   control
## 19   control abatacept
## 20   control abatacept
## 21 abatacept   control

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored))
rn

##  [1]  2  3  4  9 10 11 13 14 15 16 17 19 20 21

rfError <- testingSet[rn,]
rfError

##             Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P12_w0  younger      F   control      2     0           0             0   93
## P16_w0    older      F   control      3     0           0             0   16
## P17_w12   older      F abatacept      2     0           0             1   12
## P25_w12   older      M abatacept      3     1           0             1    3
## P26_w0    older      F   control      4     0           0             0   20
## P27_w0    older      F   control      1     1           0             0   47
## P3_w0     older      F   control      1     4           0             0   26
## P31_w12   older      M abatacept      9     0           1             0    0
## P32_w12 younger      F abatacept      0     0           0             0   49
## P33_w12   older      M abatacept      1     0           0             1    1
## P34_w0    older      F   control      1     1           1             0   36
## P38_w12   older      F abatacept      3     0           1             0   10
## P4_w12  younger      F abatacept      0     1           0             0   16
## P6_w0     older      F   control      1     1           0             0   63
##          XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P12_w0  12388             0             0          0  19
## P16_w0  31607             0             0          0   5
## P17_w12 24028             0             0          0   0
## P25_w12     0             0             0          4   1
## P26_w0  24503             0             0          0   3
## P27_w0  18171             0             0          0  13
## P3_w0   31608             0             0          1   8
## P31_w12     0             0             1          3  21
## P32_w12 13612             0             0          0   1
## P33_w12     0             0             0          2   1
## P34_w0  26925             0             0          2   2
## P38_w12 33212             0             0          0   0
## P4_w12  15650             0             0          0   0
## P6_w0   10778             0             0         38   2

Linkage dirichlet allocation model

ldaMod <- train(treatment~., method='lda', data=trainingSet)
predlda <- predict(ldaMod, testingSet)
sumLDA0 <- sum(predlda==testingSet$treatment)
lengthLDA0 <- length(testingSet$treatment)
accuracy_ldaMod <- sumLDA0/lengthLDA0 
accuracy_ldaMod

## [1] 0.5454545

DF_LDA <- data.frame(predlda, type=testingSet$treatment)
head(DF_LDA,30)

##      predlda      type
## 1  abatacept   control
## 2    control   control
## 3    control   control
## 4    control abatacept
## 5    control   control
## 6    control   control
## 7    control   control
## 8    control   control
## 9    control abatacept
## 10   control   control
## 11   control   control
## 12   control abatacept
## 13 abatacept   control
## 14 abatacept abatacept
## 15   control abatacept
## 16   control abatacept
## 17 abatacept   control
## 18 abatacept abatacept
## 19   control abatacept
## 20   control abatacept
## 21   control   control
## 22   control   control

errored <- DF_LDA[(DF_LDA$predlda != DF_LDA$type),]
errored

##      predlda      type
## 1  abatacept   control
## 4    control abatacept
## 9    control abatacept
## 12   control abatacept
## 13 abatacept   control
## 15   control abatacept
## 16   control abatacept
## 17 abatacept   control
## 19   control abatacept
## 20   control abatacept

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored))
rn

##  [1]  1  4  9 12 13 15 16 17 19 20

rfError <- testingSet[rn,]
rfError

##             Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P11_w0  younger      F   control      3     0           0             1   22
## P17_w12   older      F abatacept      2     0           0             1   12
## P25_w12   older      M abatacept      3     1           0             1    3
## P27_w12   older      F abatacept      0     0           0             0   27
## P3_w0     older      F   control      1     4           0             0   26
## P32_w12 younger      F abatacept      0     0           0             0   49
## P33_w12   older      M abatacept      1     0           0             1    1
## P34_w0    older      F   control      1     1           1             0   36
## P38_w12   older      F abatacept      3     0           1             0   10
## P4_w12  younger      F abatacept      0     1           0             0   16
##          XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P11_w0  22017             0             1          1   0
## P17_w12 24028             0             0          0   0
## P25_w12     0             0             0          4   1
## P27_w12  9291             0             1          0   7
## P3_w0   31608             0             0          1   8
## P32_w12 13612             0             0          0   1
## P33_w12     0             0             0          2   1
## P34_w0  26925             0             0          2   2
## P38_w12 33212             0             0          0   0
## P4_w12  15650             0             0          0   0

The random forest classification was better using the most expressed genes and leaving out the least expressed genes by treatment type. But this was designed to have the outcome from the start, because both weeks were given Abatacept and blood taken in week12 after giving it Abatacept. So, there isn’t much to gain in comparing RA patients’ blood in three months.

The generalized linear model is a regression, but we can make that data and test it.

ML_up3 <- ML_up2
ML_up3$treatment <- gsub('control',0,ML_up3$treatment)
ML_up3$treatment <- gsub('abatacept',1,ML_up3$treatment)
ML_up3$treatment <- as.integer(ML_up3$treatment)

set.seed(12345)
inTrain <- createDataPartition(y=ML_up3$treatment, p=0.7, list=FALSE)

trainingSet1 <- ML_up3[inTrain,]
testingSet1 <- ML_up3[-inTrain,]

trainingSet <- ML_up3[inTrain,-c(1:2)]
testingSet <- ML_up3[-inTrain,-c(1:2)]
dim(trainingSet)

## [1] 54 11

dim(testingSet)

## [1] 22 11

glmMod2 <- train(treatment ~ .,
                method='glm', data=trainingSet)

predglm2 <- predict(glmMod2, testingSet)

DF_glm2 <- data.frame(predglm2, type=testingSet$treatment)

DF_glm2$predglm2 <- ifelse(DF_glm2$predglm2<.5,0,1)

length_glm2 <- length(DF_glm2$type)

sum_glm2 <- sum((DF_glm2$predglm2)==DF_glm2$type)

accglm2 <- (sum_glm2/length_glm2)

accglm2

## [1] 0.5909091

head(DF_glm2,30)

##         predglm2 type
## P10_w12        0    1
## P13_w0         0    0
## P15_w12        1    1
## P2_w12         0    1
## P21_w12        0    1
## P22_w0         0    0
## P25_w0         0    0
## P25_w12        1    1
## P27_w0         0    0
## P28_w0         0    0
## P29_w12        0    1
## P3_w12         0    1
## P31_w0         1    0
## P32_w12        0    1
## P35_w12        0    1
## P36_w0         0    0
## P37_w0         0    0
## P4_w0          0    0
## P6_w0          0    0
## P6_w12         1    1
## P8_w0          0    0
## P8_w12         0    1

errored <- DF_glm2[(DF_glm2$predglm2 != DF_glm2$type),]
errored

##         predglm2 type
## P10_w12        0    1
## P2_w12         0    1
## P21_w12        0    1
## P29_w12        0    1
## P3_w12         0    1
## P31_w0         1    0
## P32_w12        0    1
## P35_w12        0    1
## P8_w12         0    1

Lets see which samples created an error in our random forest model.

rn <- row.names(errored)
rn

## [1] "P10_w12" "P2_w12"  "P21_w12" "P29_w12" "P3_w12"  "P31_w0"  "P32_w12"
## [8] "P35_w12" "P8_w12"

rfError <- testingSet1[rn,]
rfError

##             Age gender treatment KCNJ10 OR2L3 CTB.57H20.1 RP11.561N12.5 HEPH
## P10_w12   older      F         1      1     2           0             0   15
## P2_w12    older      F         1      1     2           0             0   37
## P21_w12   older      F         1      0     3           0             0   37
## P29_w12   older      F         1      4     1           0             0   22
## P3_w12    older      F         1      0     1           0             0   23
## P31_w0    older      M         0      0     1           1             0    0
## P32_w12 younger      F         1      0     0           0             0   49
## P35_w12   older      F         1      0     4           0             0   30
## P8_w12  younger      F         1      0     2           0             0   28
##          XIST RP11.806O11.1 RP11.317J10.4 AL161626.1 PGF
## P10_w12 19769             0             0          0   1
## P2_w12  26427             0             2          0   6
## P21_w12 18768             3             0         32   0
## P29_w12 16213             0             0          0   1
## P3_w12  20866             0             1          3   7
## P31_w0      2             0             0          9   1
## P32_w12 13612             0             0          0   1
## P35_w12 27625             0             0          0   1
## P8_w12  22624             0             0          0   2

For regression using the linear regression of generalized linear models in caret, most misclassifications for treatment were the older females, but one younger female and one older male misclassified in the wrong treatment group. They were all erroneously classified as week12 except for the one misclassified male in week0 who was misclassifed as week12.

We know the most expressed genes improve the classification of the treatment or control slightly and the females 100% in either most or most & least expressed genes according to their fold change values of treatment/control ratios. And the age was improved in classifying younger than 50 or 50 and older to week0 females instead of by a mix of treatment and control for females misclassified by age using both the most and least fold change values. The random forest model we used was more accurate in making those classifications but we did test out some other machine learning models for k-nearest neighbor, a topic modeling algorithm called latent dirichlet allocation, gradient boosted models, and generalized linear model with a modified data frame for regression. Under the hood the models work differently to group the samples according to their most logical class according to our feature vectors of the genes and in some cases the age, gender, and/or treatment, depending on which model and if classification or regression used and which feature was the predictor or outcome feature. The topic modeling algorithm does decent in classifying because it usually used a bunch of tokens or words from a document that it weights and sorts or classifies according to topic of a story it has trained already or in its bank. It uses the genes and demographics as tokens with a numerical weight to classify.

RA simulating COVID-19 study with ML

Janis Corona

8/12/2020