In terminal:
wget ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS2nnn/GDS2447/soft/GDS2447.soft.gz
## --2019-03-21 16:00:50-- ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS2nnn/GDS2447/soft/GDS2447.soft.gz
## => ‘GDS2447.soft.gz’
## Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 2607:f220:41e:250::12
## Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:21... connected.
## Logging in as anonymous ... Logged in!
## ==> SYST ... done. ==> PWD ... done.
## ==> TYPE I ... done. ==> CWD (1) /geo/datasets/GDS2nnn/GDS2447/soft ... done.
## ==> SIZE GDS2447.soft.gz ... 2162905
## ==> PASV ... done. ==> RETR GDS2447.soft.gz ... done.
## Length: 2162905 (2.1M) (unauthoritative)
##
## 0K .......... .......... .......... .......... .......... 2% 138K 15s
## 50K .......... .......... .......... .......... .......... 4% 262K 11s
## 100K .......... .......... .......... .......... .......... 7% 2.55M 7s
## 150K .......... .......... .......... .......... .......... 9% 2.54M 6s
## 200K .......... .......... .......... .......... .......... 11% 469K 5s
## 250K .......... .......... .......... .......... .......... 14% 2.37M 4s
## 300K .......... .......... .......... .......... .......... 16% 2.65M 4s
## 350K .......... .......... .......... .......... .......... 18% 2.66M 3s
## 400K .......... .......... .......... .......... .......... 21% 778K 3s
## 450K .......... .......... .......... .......... .......... 23% 2.75M 3s
## 500K .......... .......... .......... .......... .......... 26% 2.62M 2s
## 550K .......... .......... .......... .......... .......... 28% 2.60M 2s
## 600K .......... .......... .......... .......... .......... 30% 2.50M 2s
## 650K .......... .......... .......... .......... .......... 33% 2.45M 2s
## 700K .......... .......... .......... .......... .......... 35% 2.57M 2s
## 750K .......... .......... .......... .......... .......... 37% 2.56M 2s
## 800K .......... .......... .......... .......... .......... 40% 2.74M 1s
## 850K .......... .......... .......... .......... .......... 42% 2.70M 1s
## 900K .......... .......... .......... .......... .......... 44% 2.80M 1s
## 950K .......... .......... .......... .......... .......... 47% 2.08M 1s
## 1000K .......... .......... .......... .......... .......... 49% 2.86M 1s
## 1050K .......... .......... .......... .......... .......... 52% 2.52M 1s
## 1100K .......... .......... .......... .......... .......... 54% 2.46M 1s
## 1150K .......... .......... .......... .......... .......... 56% 2.80M 1s
## 1200K .......... .......... .......... .......... .......... 59% 2.88M 1s
## 1250K .......... .......... .......... .......... .......... 61% 2.61M 1s
## 1300K .......... .......... .......... .......... .......... 63% 2.85M 1s
## 1350K .......... .......... .......... .......... .......... 66% 2.74M 1s
## 1400K .......... .......... .......... .......... .......... 68% 2.90M 1s
## 1450K .......... .......... .......... .......... .......... 71% 2.42M 0s
## 1500K .......... .......... .......... .......... .......... 73% 2.60M 0s
## 1550K .......... .......... .......... .......... .......... 75% 2.48M 0s
## 1600K .......... .......... .......... .......... .......... 78% 2.72M 0s
## 1650K .......... .......... .......... .......... .......... 80% 2.38M 0s
## 1700K .......... .......... .......... .......... .......... 82% 2.02M 0s
## 1750K .......... .......... .......... .......... .......... 85% 2.67M 0s
## 1800K .......... .......... .......... .......... .......... 87% 2.73M 0s
## 1850K .......... .......... .......... .......... .......... 89% 2.71M 0s
## 1900K .......... .......... .......... .......... .......... 92% 2.63M 0s
## 1950K .......... .......... .......... .......... .......... 94% 2.96M 0s
## 2000K .......... .......... .......... .......... .......... 97% 2.50M 0s
## 2050K .......... .......... .......... .......... .......... 99% 2.60M 0s
## 2100K .......... .. 100% 3.04M=1.4s
##
## 2019-03-21 16:00:54 (1.43 MB/s) - ‘GDS2447.soft.gz’ saved [2162905]
Unzipping:
gunzip GDS2447.soft.gz
Cleaning the data, i.e. removing the comments:
grep -v ^[!^#] GDS2447.soft > GDS2447.clean
Reading the file in R:
raw.data <- read.table('GDS2447.clean', sep = '\t', header = TRUE)
dim(raw.data)
## [1] 33202 17
Removing the first two columns:
data <- raw.data[, -c(1, 2)]
dim(data)
## [1] 33202 15
head(data)
## GSM144131 GSM144132 GSM144133 GSM144134 GSM144135 GSM144136
## 1 254.147000 165.4640000 334.1650000 202.555000 263.8830000 285.248000
## 2 0.171102 0.0632793 0.0700499 0.176690 0.0827437 0.316200
## 3 0.261324 0.2304610 0.1074160 0.358684 0.7966200 0.119989
## 4 0.699315 0.7008360 0.6627730 0.982541 0.4121350 0.308548
## 5 7.363850 8.6595500 15.1613000 10.231400 17.4776000 15.071000
## 6 17.357700 7.2464300 24.3139000 37.564400 23.0226000 15.463100
## GSM144122 GSM144123 GSM144124 GSM144125 GSM144126 GSM144127
## 1 168.6640000 182.1780000 185.8640000 210.9910000 185.8930000 211.9420000
## 2 0.0574015 0.0580385 0.0475359 0.0342564 0.0722826 0.0299415
## 3 0.1764020 0.0964890 0.2515190 0.0918525 0.0811391 0.0515375
## 4 1.4016300 0.9375940 0.5687420 0.7422920 0.7696850 0.4549370
## 5 26.1969000 22.6432000 35.4197000 34.4893000 35.7904000 15.7142000
## 6 5.6211400 6.8915300 7.3169000 6.2578800 7.0759800 9.3557000
## GSM144128 GSM144129 GSM144130
## 1 271.4500000 220.0460000 310.1760000
## 2 0.0302075 0.0798368 0.0809686
## 3 0.1401160 0.0663585 0.2115270
## 4 0.2848950 0.5722410 0.3224470
## 5 15.3359000 9.0004300 18.5867000
## 6 14.5014000 13.3420000 38.4430000
Checking if data is normalized:
boxplot(data)
As the data is not normalized, we will use the log_2 transformation:
normalized.data <- log2(data)
boxplot(normalized.data, main = expression(paste('Boxplot of the ', log2, ' data')))
This dataset consists of transcriptional profiling of samples of the homo sapiens organism and includes two classes, the one of tobacco smokers (nicotine dependents) in which there are 6 samples and the non smoking (control) class in which there are 9 samples (15 samples in total).
The first information is in the descriptions (title, summary, organism) in the GEO’s site, to see click here.
The latter information is found here, which is found after clicking in the “Experiment design and value distribution” left at the bottom at the GEO’s site of GDS2447 and then in the image it will appear on its right.
Making the labels:
labels <- c(rep('tobacco', 6), rep('control', 9))
Creating a new function, the ttest.greater, which applies t.test for a gene (row) when there are two levels/classes, the alternative hypothesis is H1: μ1 > μ2, and returns the p-value:
ttest.greater <- function(vec, labels)
{
levels <- unique(labels)
v1 <- vec[ labels == levels[1] ]
v2 <- vec[ labels == levels[2] ]
pval <- t.test(v1, v2, alternative = 'greater')$p.value
pval
}
Applying the ttest.greater function in the data (the normalized data):
pvals <- apply(normalized.data, 1, ttest.greater, labels)
Checking truthness:
length(pvals) == dim(normalized.data)[1]
## [1] TRUE
We consider “tobacco” as classA and “control” as classB, hence the genes that are higher in classA than classB are:
higher.names <- raw.data[,2][pvals < 0.05]
Answer to how many genes are higher in classA than classB:
length(which(pvals < 0.05))
## [1] 7884
Alternative preferred way:
sum(pvals < 0.05)
## [1] 7884
In similar manner, we will create the function ttest.less and apply it to the data:
ttest.less <- function(vec, labels)
{
levels <- unique(labels)
v1 <- vec[ labels == levels[1] ]
v2 <- vec[ labels == levels[2] ]
pval <- t.test(v1, v2, alternative = 'less')$p.value
pval
}
the.pvals <- apply(normalized.data, 1, ttest.less, labels)
The genes that are higher in classB than classA are:
less.names <- raw.data[,2][the.pvals < 0.05]
Answer to how many genes are higher in classB than classA:
sum(the.pvals < 0.05)
## [1] 4902
ttest <- function(vec, labels)
{
levels <- unique(labels)
v1 <- vec[ labels == levels[1] ]
v2 <- vec[ labels == levels[2] ]
pval <- t.test(v1, v2)$p.value
pval
}
pvalues <- apply(normalized.data, 1, ttest, labels)
The genes that are different between classA and classB are:
DEG.names <- raw.data[,2][pvalues<0.05]
Answer to how many genes are different between classA and classB:
sum(pvalues<0.05)
## [1] 9243
The genes that are not different between classA and classB are:
non.DEGs.names <- raw.data[,2][pvalues>=0.05]
Answer to how many genes are not different between classA and classB:
sum(pvalues>=0.05)
## [1] 23959
data.subset <- normalized.data[sample(dim(normalized.data)[1], 500),]
dim(data.subset)
## [1] 500 15
head(data.subset)
## GSM144131 GSM144132 GSM144133 GSM144134 GSM144135 GSM144136
## 19556 5.2208948 6.9693232 7.2808357 6.4459468 5.7585031 6.804221
## 4667 -1.5840078 -1.5534261 -2.4401852 -1.5249442 -2.2586184 -3.354087
## 8397 0.2454470 -0.2575953 0.2348814 -1.1919791 -0.4015071 0.449873
## 26442 3.2219838 1.6587873 2.6375987 1.4445664 2.1329453 2.460048
## 23374 5.1336663 4.4713110 5.5289588 5.3890126 5.5411834 5.118451
## 15660 0.6033688 1.0474337 0.5969256 -0.1009946 -0.5306454 -1.299479
## GSM144122 GSM144123 GSM144124 GSM144125 GSM144126 GSM144127
## 19556 9.2435858 8.2203590 8.4094461 8.5775346 8.7261602 7.9718309
## 4667 -3.8892615 -3.3338404 -3.5736833 -2.6410097 -2.5612984 -2.3173404
## 8397 -0.0391473 -0.4832284 -0.6784967 -0.4044047 -0.7026419 1.3945526
## 26442 1.9745220 1.6725023 1.2320897 2.1260147 2.7203638 2.4570794
## 23374 5.0737388 4.7555606 5.0948525 5.0726161 5.6941392 5.2285263
## 15660 0.5918900 0.2710910 1.2476784 0.9455253 0.6264578 -0.3190222
## GSM144128 GSM144129 GSM144130
## 19556 6.9549058 6.273094e+00 5.9525316
## 4667 -2.1830547 -2.478652e+00 -1.7148085
## 8397 0.2297479 2.963106e-01 -0.4677110
## 26442 2.5307874 2.643441e+00 1.8807874
## 23374 5.4162959 5.026251e+00 4.7355222
## 15660 0.2317027 1.442688e-05 0.9516773
Creating the function getmeans so we can inspect the values of the means:
getmeans <- function(vec, labels){
levels <- unique(labels)
v1 <- vec[ labels == levels[1] ] ; v2 <- vec[ labels == levels[2] ]
# list(mean1=mean(v1), mean2=mean(v2))
c(mean(v2), mean(v1))
}
themeans <- t(apply(as.matrix(data.subset), 1, getmeans, labels))
dim(themeans)
## [1] 500 2
head(themeans)
## [,1] [,2]
## 19556 7.8143832 6.41328750
## 4667 -2.7436610 -2.11921139
## 8397 -0.0950021 -0.15348003
## 26442 2.1375097 2.25932155
## 23374 5.1219448 5.19709720
## 15660 0.5052239 0.05276826
Creating the function k.statistic for this statistic:
k.statistic <- function(vec, labels)
{
levels <- unique(labels)
v1 <- vec[ labels == levels[1] ]
v2 <- vec[ labels == levels[2] ]
m1 <- mean(v1)
m2 <- mean(v2)
# If m1==m2==0, then the result should be 1, but 0/0 is not defined, so:
if (m1 == 0 && m2 == 0) return(1)
# if m1 == 0, we don't want an inf result, so absolute difference intead:
if (m1 == 0) return(1+abs(m2))
# We also added 1 so we can bring it closer to making m2/m1 compared to 1.
# If m2 == 0, then m2/m1 doesn't yield a comparison of the means, so we should do the same as before:
if (m2 == 0) return(1+abs(m1))
# The previous two if statements are equivalent to one command: if (m1 == 0 || m2 == 0) return(1+abs(m1-m2)), but doing it in two steps so to be more descriptive and hence more clear (comprehensible)
k <- m2/m1
# if m1 and m2 have different signs (we spotted this several times before when we applied to the genes, i.e. margin 1 in apply, the getmeans function we created), then again return the absolute difference of the means plus one since we want to have values close to 1 here (independently of the observed value):
if (k < 0) return(1+abs(m1-m2))
# if nothing of the above stands, i.e. else, return the k:
k
}
Applying this function to our data, i.e. to the data.subset:
obs.stats <- apply(as.matrix(data.subset), 1, k.statistic, labels)
length(obs.stats) == 500
## [1] TRUE
Constructing the null distribution of the statistic for each gene:
# We know that there are 6 tobacco and 9 control:
num_tobacco <- 6
num_control <- 9
num_samples <- num_tobacco + num_control
reps <- 1000 # the length of the repetitions we will use
# Since filling NAs is the default, we only need the number of rows and the number of columns for initialization:
stats <- matrix( nrow = nrow(data.subset), ncol = reps )
library(parallel)
tdata <- as.data.frame(t(data.subset))
indices.chosen.randomly <- colnames(tdata)
dim(tdata)
## [1] 15 500
# now the genes are the columns, and the samples are the rows
for (i in 1:reps){
# print(i)
shuffled.indices <- sample(1:num_samples, num_samples)
shuffled.data <- tdata[shuffled.indices,]
stats[,i] <- unlist(mclapply(shuffled.data, k.statistic, labels, mc.cores = 4))
}
dim(stats)
## [1] 500 1000
Now, every row of the matrix stats refers to a gene (the 1st row is about the 1st gene obtained randomly, the 2nd row about the 2nd gene obtained randomly, and so on), is a vector 1x1000 and includes 1000 values for the k-distribution of that gene.
shuffle.pvals <- array(dim = nrow(data.subset)) # initialization
for(i in 1:nrow(data.subset)){
shuffle.pvals[i] <- sum(stats[i,] <= obs.stats[i])/reps
}
p.vals <- apply(data.subset, 1, ttest.greater, labels)
p.vals.ttest.mat <- as.matrix(p.vals)
As comparison operations in R are evaluated in an elementwise manner when used between matrices, we will take advantage of this fact to perform our comparisons and assign the result to a variable named comparison.mat:
p.vals.ktest.mat <- as.matrix(shuffle.pvals)
comparison.mat <- p.vals.ttest.mat < p.vals.ktest.mat
Now, the table function will give us the frequencies of TRUE and FALSE:
table(comparison.mat)
## comparison.mat
## FALSE TRUE
## 189 311
pvalues <- apply(normalized.data, 1, ttest, labels)
is.numeric(pvalues)
## [1] TRUE
# Since it's numeric, we can use it as argument in p.adjust:
pvalues.fdr <- p.adjust(pvalues, method = 'fdr')
pvalues.bonferroni <- p.adjust(pvalues, method = "bonferroni")
FDR (False Discovery Rate) method:
DEGs: Differentially Expressed Genes
The DEGs between nicotine dependent and control groups are:
DEGs.after.correction <- raw.data[,2][pvalues.fdr<0.05]
Answer to how many these genes are:
sum(pvalues.fdr<0.05)
## [1] 1343
Bonferroni correction
The DEGs between nicotine dependent and control groups are:
raw.data[,2][pvalues.bonferroni<0.05]
## [1] CLVS1 RHPN1 C1QTNF8 CLSTN3 LYNX1 COG5
## [7] SMG1P2 ADAP2 IGF2 hCG2042414 TCEAL8 226454
## 28138 Levels: 100206 100220 100410 100473 100747 100861 100925 ... ZZZ3
Answer to how many these genes are:
sum(pvalues.bonferroni<0.05)
## [1] 12