sample data from Maia

Got two example files from Maia:

  1. read counts for each gene by sample - That is the data Maia produces prior to using edgeR to statistically determine differential expression. This data goes into calculating differential expression

  2. the differential expression results, which is what is usually hand off to the researcher.

read in the file with read counts for each gene by sample (“CF_countsSum.csv”)

counts <- read.csv("CF_countsSum.csv")
head(counts)
##                 X HAP77_CTL HAP79_CTL HAP83_CTL CF573_CTL CF580_CTL
## 1 ENSG00000000419        31        29         5        74       178
## 2 ENSG00000000457         2         2         2        10        18
## 3 ENSG00000000460         7         0         4        10        24
## 4 ENSG00000000971       107       159        44       596       635
## 5 ENSG00000001036        38        40        17        76       135
## 6 ENSG00000001084       109        81        19       153       374
##   CF582_CTL CF586_CTL HAP77_RV1B HAP79_RV1B HAP83_RV1B CF573_RV1B
## 1        61       328         42        135         59         73
## 2        18        45          8         14          9          8
## 3         8        74          7         19         36         11
## 4       179       732        117        437        424        172
## 5        40       339         46        151        116         56
## 6       127       777        125        492        159         61
##   CF580_RV1B CF582_RV1B CF586_RV1B
## 1         80        130        121
## 2         18         52         13
## 3         14         43         12
## 4        323       1391        309
## 5         67        179        116
## 6         66        276        136
str(counts)
## 'data.frame':    14273 obs. of  15 variables:
##  $ X         : Factor w/ 14273 levels "ENSG00000000419",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ HAP77_CTL : int  31 2 7 107 38 109 50 17 86 38 ...
##  $ HAP79_CTL : int  29 2 0 159 40 81 29 20 74 22 ...
##  $ HAP83_CTL : int  5 2 4 44 17 19 6 1 7 7 ...
##  $ CF573_CTL : int  74 10 10 596 76 153 53 38 223 81 ...
##  $ CF580_CTL : int  178 18 24 635 135 374 175 101 453 184 ...
##  $ CF582_CTL : int  61 18 8 179 40 127 60 21 83 41 ...
##  $ CF586_CTL : int  328 45 74 732 339 777 346 160 1626 212 ...
##  $ HAP77_RV1B: int  42 8 7 117 46 125 42 27 78 35 ...
##  $ HAP79_RV1B: int  135 14 19 437 151 492 66 39 157 65 ...
##  $ HAP83_RV1B: int  59 9 36 424 116 159 53 12 50 71 ...
##  $ CF573_RV1B: int  73 8 11 172 56 61 29 22 87 66 ...
##  $ CF580_RV1B: int  80 18 14 323 67 66 38 33 104 44 ...
##  $ CF582_RV1B: int  130 52 43 1391 179 276 79 89 171 110 ...
##  $ CF586_RV1B: int  121 13 12 309 116 136 67 49 161 74 ...
names(counts)
##  [1] "X"          "HAP77_CTL"  "HAP79_CTL"  "HAP83_CTL"  "CF573_CTL" 
##  [6] "CF580_CTL"  "CF582_CTL"  "CF586_CTL"  "HAP77_RV1B" "HAP79_RV1B"
## [11] "HAP83_RV1B" "CF573_RV1B" "CF580_RV1B" "CF582_RV1B" "CF586_RV1B"
rownames(counts[1:10, ])
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

There are 14 samples: 7 controls and 7 treated with RV1B (ie. 7 samples before and after treatment).

read in the file with differential gene expression (“CF_edgeR_DE_CFresponse2RV1B.csv”“) - obtained after processing with edgeR

tDat <- read.csv("CF_edgeR_DE_CFresponse2RV1B.csv")
head(tDat)
##                 X logFC logCPM    LR    PValue       FDR
## 1 ENSG00000134326 7.140  7.632 166.8 3.789e-38 4.271e-34
## 2 ENSG00000134321 7.859 10.069 165.8 5.985e-38 4.271e-34
## 3 ENSG00000183486 6.222  8.084 157.2 4.524e-36 2.153e-32
## 4 ENSG00000135114 6.446  8.437 154.4 1.894e-35 6.760e-32
## 5 ENSG00000157601 5.335  9.410 152.4 5.196e-35 1.483e-31
## 6 ENSG00000119917 6.306 10.356 146.7 9.183e-34 2.184e-30
str(tDat)
## 'data.frame':    14273 obs. of  6 variables:
##  $ X     : Factor w/ 14273 levels "ENSG00000000419",..: 5251 5249 10843 5353 7642 3891 3892 5884 11122 5885 ...
##  $ logFC : num  7.14 7.86 6.22 6.45 5.33 ...
##  $ logCPM: num  7.63 10.07 8.08 8.44 9.41 ...
##  $ LR    : num  167 166 157 154 152 ...
##  $ PValue: num  3.79e-38 5.98e-38 4.52e-36 1.89e-35 5.20e-35 ...
##  $ FDR   : num  4.27e-34 4.27e-34 2.15e-32 6.76e-32 1.48e-31 ...

logFC - log_2 of fold change logCPM - log_2 of counts per million reads. This value can be used to #calculate RPKM or FPKM by subtracting log_2 of gene length LR - likelihood ratio FDR - Benjamini and Hochberg's algorithm is used to control the false discovery rate

Looks like this file contains info on differentially expressed genes for the CF "group” (“CF573”, “CF580” , “CF582” , “CF586”).

Study design (?):

Two groups: HAP and CF. The HAP group has 3 individuals (or replicates) and the CF group has 4 individuals (or replicates). Both groups have the gene expression evaluated before (_CTL) and after (_RV1B) treatment with RV1B.