Exercises 3

Exercise 1: questions 1-10

1. Download the dataset GDS2447 from GEO Datasets (NCBI):

In terminal:

wget ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS2nnn/GDS2447/soft/GDS2447.soft.gz

## --2019-03-21 16:00:50--  ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS2nnn/GDS2447/soft/GDS2447.soft.gz
##            => ‘GDS2447.soft.gz’
## Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 2607:f220:41e:250::12
## Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:21... connected.
## Logging in as anonymous ... Logged in!
## ==> SYST ... done.    ==> PWD ... done.
## ==> TYPE I ... done.  ==> CWD (1) /geo/datasets/GDS2nnn/GDS2447/soft ... done.
## ==> SIZE GDS2447.soft.gz ... 2162905
## ==> PASV ... done.    ==> RETR GDS2447.soft.gz ... done.
## Length: 2162905 (2.1M) (unauthoritative)
## 
##      0K .......... .......... .......... .......... ..........  2%  138K 15s
##     50K .......... .......... .......... .......... ..........  4%  262K 11s
##    100K .......... .......... .......... .......... ..........  7% 2.55M 7s
##    150K .......... .......... .......... .......... ..........  9% 2.54M 6s
##    200K .......... .......... .......... .......... .......... 11%  469K 5s
##    250K .......... .......... .......... .......... .......... 14% 2.37M 4s
##    300K .......... .......... .......... .......... .......... 16% 2.65M 4s
##    350K .......... .......... .......... .......... .......... 18% 2.66M 3s
##    400K .......... .......... .......... .......... .......... 21%  778K 3s
##    450K .......... .......... .......... .......... .......... 23% 2.75M 3s
##    500K .......... .......... .......... .......... .......... 26% 2.62M 2s
##    550K .......... .......... .......... .......... .......... 28% 2.60M 2s
##    600K .......... .......... .......... .......... .......... 30% 2.50M 2s
##    650K .......... .......... .......... .......... .......... 33% 2.45M 2s
##    700K .......... .......... .......... .......... .......... 35% 2.57M 2s
##    750K .......... .......... .......... .......... .......... 37% 2.56M 2s
##    800K .......... .......... .......... .......... .......... 40% 2.74M 1s
##    850K .......... .......... .......... .......... .......... 42% 2.70M 1s
##    900K .......... .......... .......... .......... .......... 44% 2.80M 1s
##    950K .......... .......... .......... .......... .......... 47% 2.08M 1s
##   1000K .......... .......... .......... .......... .......... 49% 2.86M 1s
##   1050K .......... .......... .......... .......... .......... 52% 2.52M 1s
##   1100K .......... .......... .......... .......... .......... 54% 2.46M 1s
##   1150K .......... .......... .......... .......... .......... 56% 2.80M 1s
##   1200K .......... .......... .......... .......... .......... 59% 2.88M 1s
##   1250K .......... .......... .......... .......... .......... 61% 2.61M 1s
##   1300K .......... .......... .......... .......... .......... 63% 2.85M 1s
##   1350K .......... .......... .......... .......... .......... 66% 2.74M 1s
##   1400K .......... .......... .......... .......... .......... 68% 2.90M 1s
##   1450K .......... .......... .......... .......... .......... 71% 2.42M 0s
##   1500K .......... .......... .......... .......... .......... 73% 2.60M 0s
##   1550K .......... .......... .......... .......... .......... 75% 2.48M 0s
##   1600K .......... .......... .......... .......... .......... 78% 2.72M 0s
##   1650K .......... .......... .......... .......... .......... 80% 2.38M 0s
##   1700K .......... .......... .......... .......... .......... 82% 2.02M 0s
##   1750K .......... .......... .......... .......... .......... 85% 2.67M 0s
##   1800K .......... .......... .......... .......... .......... 87% 2.73M 0s
##   1850K .......... .......... .......... .......... .......... 89% 2.71M 0s
##   1900K .......... .......... .......... .......... .......... 92% 2.63M 0s
##   1950K .......... .......... .......... .......... .......... 94% 2.96M 0s
##   2000K .......... .......... .......... .......... .......... 97% 2.50M 0s
##   2050K .......... .......... .......... .......... .......... 99% 2.60M 0s
##   2100K .......... ..                                         100% 3.04M=1.4s
## 
## 2019-03-21 16:00:54 (1.43 MB/s) - ‘GDS2447.soft.gz’ saved [2162905]

Pre-process:

Unzipping:

gunzip GDS2447.soft.gz

Cleaning the data, i.e. removing the comments:

grep -v ^[!^#] GDS2447.soft > GDS2447.clean

Reading the file in R:

raw.data <- read.table('GDS2447.clean', sep = '\t', header = TRUE)
dim(raw.data)

## [1] 33202    17

Removing the first two columns:

data <- raw.data[, -c(1, 2)]
dim(data)

## [1] 33202    15

head(data)

##    GSM144131   GSM144132   GSM144133  GSM144134   GSM144135  GSM144136
## 1 254.147000 165.4640000 334.1650000 202.555000 263.8830000 285.248000
## 2   0.171102   0.0632793   0.0700499   0.176690   0.0827437   0.316200
## 3   0.261324   0.2304610   0.1074160   0.358684   0.7966200   0.119989
## 4   0.699315   0.7008360   0.6627730   0.982541   0.4121350   0.308548
## 5   7.363850   8.6595500  15.1613000  10.231400  17.4776000  15.071000
## 6  17.357700   7.2464300  24.3139000  37.564400  23.0226000  15.463100
##     GSM144122   GSM144123   GSM144124   GSM144125   GSM144126   GSM144127
## 1 168.6640000 182.1780000 185.8640000 210.9910000 185.8930000 211.9420000
## 2   0.0574015   0.0580385   0.0475359   0.0342564   0.0722826   0.0299415
## 3   0.1764020   0.0964890   0.2515190   0.0918525   0.0811391   0.0515375
## 4   1.4016300   0.9375940   0.5687420   0.7422920   0.7696850   0.4549370
## 5  26.1969000  22.6432000  35.4197000  34.4893000  35.7904000  15.7142000
## 6   5.6211400   6.8915300   7.3169000   6.2578800   7.0759800   9.3557000
##     GSM144128   GSM144129   GSM144130
## 1 271.4500000 220.0460000 310.1760000
## 2   0.0302075   0.0798368   0.0809686
## 3   0.1401160   0.0663585   0.2115270
## 4   0.2848950   0.5722410   0.3224470
## 5  15.3359000   9.0004300  18.5867000
## 6  14.5014000  13.3420000  38.4430000

Checking if data is normalized:

boxplot(data)

As the data is not normalized, we will use the log_2 transformation:

normalized.data <- log2(data)
boxplot(normalized.data, main = expression(paste('Boxplot of the ', log2, ' data')))

2. Describe what this dataset is about and how many samples exist from each class:

This dataset consists of transcriptional profiling of samples of the homo sapiens organism and includes two classes, the one of tobacco smokers (nicotine dependents) in which there are 6 samples and the non smoking (control) class in which there are 9 samples (15 samples in total).

The first information is in the descriptions (title, summary, organism) in the GEO’s site, to see click here.

The latter information is found here, which is found after clicking in the “Experiment design and value distribution” left at the bottom at the GEO’s site of GDS2447 and then in the image it will appear on its right.

For the questions 3, 4, 5, 6 assume the significance threshold at 0.05:

3. In the dataset, two classes are present, classA and classB.

Find out which and how many genes are higher in classA than classB:

Making the labels:

labels <- c(rep('tobacco', 6), rep('control', 9))

Creating a new function, the ttest.greater, which applies t.test for a gene (row) when there are two levels/classes, the alternative hypothesis is H1: μ1 > μ2, and returns the p-value:

ttest.greater <- function(vec, labels)
{
  levels <- unique(labels)
  v1 <- vec[ labels == levels[1] ]
  v2 <- vec[ labels == levels[2] ]
  pval <- t.test(v1, v2, alternative = 'greater')$p.value
  pval
}

Applying the ttest.greater function in the data (the normalized data):

pvals <- apply(normalized.data, 1, ttest.greater, labels)

Checking truthness:

length(pvals) == dim(normalized.data)[1]

## [1] TRUE

We consider “tobacco” as classA and “control” as classB, hence the genes that are higher in classA than classB are:

higher.names <- raw.data[,2][pvals < 0.05]

Answer to how many genes are higher in classA than classB:

length(which(pvals < 0.05))

## [1] 7884

Alternative preferred way:

sum(pvals < 0.05)

## [1] 7884

4. Which and how many genes are higher in classB than classA?

In similar manner, we will create the function ttest.less and apply it to the data:

ttest.less <- function(vec, labels)
{
  levels <- unique(labels)
  v1 <- vec[ labels == levels[1] ]
  v2 <- vec[ labels == levels[2] ]
  pval <- t.test(v1, v2, alternative = 'less')$p.value
  pval
}
the.pvals <- apply(normalized.data, 1, ttest.less, labels)

The genes that are higher in classB than classA are:

less.names <- raw.data[,2][the.pvals < 0.05]

Answer to how many genes are higher in classB than classA:

sum(the.pvals < 0.05)

## [1] 4902

5. Which and how many genes are different between classA and classB?

ttest <- function(vec, labels)
{
  levels <- unique(labels)
  v1 <- vec[ labels == levels[1] ]
  v2 <- vec[ labels == levels[2] ]
  pval <- t.test(v1, v2)$p.value
  pval
}
pvalues <- apply(normalized.data, 1, ttest, labels)

The genes that are different between classA and classB are:

DEG.names <- raw.data[,2][pvalues<0.05]

Answer to how many genes are different between classA and classB:

sum(pvalues<0.05)

## [1] 9243

6. Which and how many genes are not different between classA and classB?

The genes that are not different between classA and classB are:

non.DEGs.names <- raw.data[,2][pvalues>=0.05]

Answer to how many genes are not different between classA and classB:

sum(pvalues>=0.05)

## [1] 23959

For question 7 to 8 use a random subset of 500 genes to speed up calculations:

data.subset <- normalized.data[sample(dim(normalized.data)[1], 500),]
dim(data.subset)

## [1] 500  15

head(data.subset)

##        GSM144131  GSM144132  GSM144133  GSM144134  GSM144135 GSM144136
## 19556  5.2208948  6.9693232  7.2808357  6.4459468  5.7585031  6.804221
## 4667  -1.5840078 -1.5534261 -2.4401852 -1.5249442 -2.2586184 -3.354087
## 8397   0.2454470 -0.2575953  0.2348814 -1.1919791 -0.4015071  0.449873
## 26442  3.2219838  1.6587873  2.6375987  1.4445664  2.1329453  2.460048
## 23374  5.1336663  4.4713110  5.5289588  5.3890126  5.5411834  5.118451
## 15660  0.6033688  1.0474337  0.5969256 -0.1009946 -0.5306454 -1.299479
##        GSM144122  GSM144123  GSM144124  GSM144125  GSM144126  GSM144127
## 19556  9.2435858  8.2203590  8.4094461  8.5775346  8.7261602  7.9718309
## 4667  -3.8892615 -3.3338404 -3.5736833 -2.6410097 -2.5612984 -2.3173404
## 8397  -0.0391473 -0.4832284 -0.6784967 -0.4044047 -0.7026419  1.3945526
## 26442  1.9745220  1.6725023  1.2320897  2.1260147  2.7203638  2.4570794
## 23374  5.0737388  4.7555606  5.0948525  5.0726161  5.6941392  5.2285263
## 15660  0.5918900  0.2710910  1.2476784  0.9455253  0.6264578 -0.3190222
##        GSM144128     GSM144129  GSM144130
## 19556  6.9549058  6.273094e+00  5.9525316
## 4667  -2.1830547 -2.478652e+00 -1.7148085
## 8397   0.2297479  2.963106e-01 -0.4677110
## 26442  2.5307874  2.643441e+00  1.8807874
## 23374  5.4162959  5.026251e+00  4.7355222
## 15660  0.2317027  1.442688e-05  0.9516773

7. Construct your own statistic as k=μ2/μ1, where μ1 is the mean of classA and μ2 the mean of classB:

Creating the function getmeans so we can inspect the values of the means:

getmeans <- function(vec, labels){
    levels <- unique(labels)
    v1 <- vec[ labels == levels[1] ] ; v2 <- vec[ labels == levels[2] ]
    # list(mean1=mean(v1), mean2=mean(v2))
    c(mean(v2), mean(v1))
}
themeans <- t(apply(as.matrix(data.subset), 1, getmeans, labels))
dim(themeans)

## [1] 500   2

head(themeans)

##             [,1]        [,2]
## 19556  7.8143832  6.41328750
## 4667  -2.7436610 -2.11921139
## 8397  -0.0950021 -0.15348003
## 26442  2.1375097  2.25932155
## 23374  5.1219448  5.19709720
## 15660  0.5052239  0.05276826

Creating the function k.statistic for this statistic:

k.statistic <- function(vec, labels)
{
  levels <- unique(labels)
  v1 <- vec[ labels == levels[1] ]
  v2 <- vec[ labels == levels[2] ]
  m1 <- mean(v1)
  m2 <- mean(v2)
  # If m1==m2==0, then the result should be 1, but 0/0 is not defined, so:
  if (m1 == 0 && m2 == 0) return(1)
  # if m1 == 0, we don't want an inf result, so absolute difference intead:
  if (m1 == 0) return(1+abs(m2))
  # We also added 1 so we can bring it closer to making m2/m1 compared to 1.
  
  # If m2 == 0, then m2/m1 doesn't yield a comparison of the means, so we should do the same as before:
  if (m2 == 0) return(1+abs(m1))
  # The previous two if statements are equivalent to one command: if (m1  == 0 || m2 == 0) return(1+abs(m1-m2)), but doing it in two steps so to be more descriptive and hence more clear (comprehensible)
  k <- m2/m1
  # if m1 and m2 have different signs (we spotted this several times before when we applied to the genes, i.e. margin 1 in apply, the getmeans function we created), then again return the absolute difference of the means plus one since we want to have values close to 1 here (independently of the observed value):
  if (k < 0) return(1+abs(m1-m2))
  # if nothing of the above stands, i.e. else, return the k:
  k
}

Applying this function to our data, i.e. to the data.subset:

obs.stats <- apply(as.matrix(data.subset), 1, k.statistic, labels)
length(obs.stats) == 500

## [1] TRUE

Use the shuffling approach to define the null distribution of the statistic for each gene.

Replicate the shuffling process 1000 times:

Constructing the null distribution of the statistic for each gene:

# We know that there are 6 tobacco and 9 control:
num_tobacco <- 6
num_control <- 9
num_samples <- num_tobacco + num_control

reps <- 1000 # the length of the repetitions we will use

# Since filling NAs is the default, we only need the number of rows and the number of columns for initialization:
stats <- matrix( nrow = nrow(data.subset), ncol = reps )

library(parallel)

tdata <- as.data.frame(t(data.subset))
indices.chosen.randomly <- colnames(tdata)
dim(tdata)

## [1]  15 500

# now the genes are the columns, and the samples are the rows
for (i in 1:reps){
  # print(i)
  shuffled.indices <- sample(1:num_samples, num_samples)
  shuffled.data <- tdata[shuffled.indices,]
  stats[,i] <- unlist(mclapply(shuffled.data, k.statistic, labels, mc.cores = 4))
}
dim(stats)

## [1]  500 1000

Now, every row of the matrix stats refers to a gene (the 1st row is about the 1st gene obtained randomly, the 2nd row about the 2nd gene obtained randomly, and so on), is a vector 1x1000 and includes 1000 values for the k-distribution of that gene.

8. Using the distribution of k under the null, calculate the pvalue for the H1: “classA greater than classB”:

shuffle.pvals <- array(dim = nrow(data.subset)) # initialization
for(i in 1:nrow(data.subset)){
  shuffle.pvals[i] <- sum(stats[i,] <= obs.stats[i])/reps
}

9. For the same set of 500 genes (as in question 7), report the t-test pvalue (for the same H1) (as in question 8):

p.vals <- apply(data.subset, 1, ttest.greater, labels)
p.vals.ttest.mat <- as.matrix(p.vals)

How do the two p-value sets compare to each other?

i.e. how many pvalues are smaller when using t-test than using k and vice versa?

As comparison operations in R are evaluated in an elementwise manner when used between matrices, we will take advantage of this fact to perform our comparisons and assign the result to a variable named comparison.mat:

p.vals.ktest.mat <- as.matrix(shuffle.pvals)
comparison.mat <- p.vals.ttest.mat < p.vals.ktest.mat

Now, the table function will give us the frequencies of TRUE and FALSE:

table(comparison.mat)

## comparison.mat
## FALSE  TRUE 
##   189   311

10. Correct the pvalues (multiple testing correction) found in question 5, using the FDR and the Bonferroni:

pvalues <- apply(normalized.data, 1, ttest, labels)
is.numeric(pvalues)

## [1] TRUE

# Since it's numeric, we can use it as argument in p.adjust:
pvalues.fdr <- p.adjust(pvalues, method = 'fdr')
pvalues.bonferroni <- p.adjust(pvalues, method = "bonferroni")

Which and how many genes have been found after correcting for multiple testing?

FDR (False Discovery Rate) method:

DEGs: Differentially Expressed Genes

The DEGs between nicotine dependent and control groups are:

DEGs.after.correction <- raw.data[,2][pvalues.fdr<0.05]

Answer to how many these genes are:

sum(pvalues.fdr<0.05)

## [1] 1343

Bonferroni correction

The DEGs between nicotine dependent and control groups are:

raw.data[,2][pvalues.bonferroni<0.05]

##  [1] CLVS1      RHPN1      C1QTNF8    CLSTN3     LYNX1      COG5      
##  [7] SMG1P2     ADAP2      IGF2       hCG2042414 TCEAL8     226454    
## 28138 Levels: 100206 100220 100410 100473 100747 100861 100925 ... ZZZ3

Answer to how many these genes are:

sum(pvalues.bonferroni<0.05)

## [1] 12