Dosen : Prof. Suhartono - UIN MALIKI MALANG
Author : Lia Wahyuliningtyas
Mahasiswa Magister
Informatika
In the previous chapter, the knowledge of both the generative model and the values of the parameters provided us with probabilities we could use for decision making – for instance, whether we had really found an epitope. In many real situations, neither the generative model nor the parameters are known, and we will need to estimate them using the data we have collected. Statistical modeling works from the data upwards to a model that might plausibly explain the data2121 Even if we have found a model that perfectly explains all our current data, it could always be that reality is more complex. A new set of data lets us conclude that another model is needed, and may include the current model as a special case or approximation.. This upward-reasoning step is called statistical inference. This chapter will show us some of the distributions and estimation mechanisms that serve as building blocks for inference. Although the examples in this chapter are all parametric (i.e., the statistical models only have a small number of unknown parameters), the principles we discuss will generalize.
The normal distribution, or bell shaped curve, is often a good model for continuous measurements. Distributions can also be more complicated mixtures of these elementary ones.
Let’s revisit the epitope example from the previous chapter, starting without the tricky outlier.
load("../Documents/data/data/e100.RData")
e99 = e100[-which.max(e100)]
barplot(table(e99), space = 0.8, col = "chartreuse4")
library("vcd")
## Loading required package: grid
gf1 = goodfit( e99, "poisson")
rootogram(gf1, xlab = "", rect_gp = gpar(fill = "chartreuse4"))
What value for the Poisson mean makes the data the most probable? In a first step, we tally the outcomes.
table(e100)
## e100
## 0 1 2 7
## 58 34 7 1
table(rpois(100, 3))
##
## 0 1 2 3 4 5 6 9
## 5 24 27 11 17 7 8 1
prod(dpois(c(0, 1, 2, 7), lambda = 3) ^ (c(58, 34, 7, 1)))
## [1] 1.392143e-110
loglikelihood = function(lambda, data = e100) {
sum(log(dpois(data, lambda)))
}
Now we can compute the likelihood for a whole series of lambda values from 0.05 to 0.95
lambdas = seq(0.05, 0.95, length = 100)
loglik = vapply(lambdas, loglikelihood, numeric(1))
plot(lambdas, loglik, type = "l", col = "red", ylab = "", lwd = 2,
xlab = expression(lambda))
m0 = mean(e100)
abline(v = m0, col = "blue", lwd = 2)
abline(h = loglikelihood(m0), col = "purple", lwd = 2)
m0
## [1] 0.55
gf = goodfit(e100, "poisson")
names(gf)
## [1] "observed" "count" "fitted" "type" "method" "df" "par"
gf$par
## $lambda
## [1] 0.55
gf$observed
## [1] 58 34 7 0 0 0 0 1
gf$fitted
## [1] 5.769498e+01 3.173224e+01 8.726366e+00 1.599834e+00 2.199771e-01
## [6] 2.419749e-02 2.218103e-03 1.742795e-04
gf$type
## [1] "poisson"
cb <- rbinom(120, prob = 0.8, size = 1)
cb
## [1] 0 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1
## [75] 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 1
## [112] 0 1 1 1 1 1 1 0 1
table(cb)
## cb
## 0 1
## 32 88
probs = seq(0, 0.3, by = 0.005)
likelihood = dbinom(sum(cb), prob = probs, size = length(cb))
plot(probs, likelihood, pch = 16, xlab = "probability of success",
ylab = "likelihood", cex=0.6)
probs[which.max(likelihood)]
## [1] 0.3
Here’s a function we use to calculate it:
loglikelihood = function(theta, n = 300, k = 40) {
115 + k * log(theta) + (n - k) * log(1 - theta)
}
which we plot for the range of θ from 0 to 1 (Figure 2.6).
thetas = seq(0, 1, by = 0.001)
plot(thetas, loglikelihood(thetas), xlab = expression(theta),
ylab = expression(paste("log f(", theta, " | y)")),type = "l")
DNA count modeling: base pairs This section combines estimation and testing by simulation in a real example. Data from one strand of DNA for the genes of Staphylococcus aureus bacterium are available in a fasta file staphsequence.ffn.txt, which we can read with a function from the Bioconductor package Biostrings.
library("Biostrings")
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:vcd':
##
## tile
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:grid':
##
## pattern
## The following object is masked from 'package:base':
##
## strsplit
staph = readDNAStringSet("../Documents/data/data/staphsequence.ffn.txt", "fasta")
letterFrequency(staph[[1]], letters = "ACGT", OR = 0)
## A C G T
## 522 219 229 392
letterFrq = vapply(staph, letterFrequency, FUN.VALUE = numeric(4),
letters = "ACGT", OR = 0)
colnames(letterFrq) = paste0("gene", seq(along = staph))
tab10 = letterFrq[, 1:10]
computeProportions = function(x) { x/sum(x) }
prop10 = apply(tab10, 2, computeProportions)
round(prop10, digits = 2)
## gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
## A 0.38 0.36 0.35 0.37 0.35 0.33 0.33 0.34 0.38 0.27
## C 0.16 0.16 0.13 0.15 0.15 0.15 0.16 0.16 0.14 0.16
## G 0.17 0.17 0.23 0.19 0.22 0.22 0.20 0.21 0.20 0.20
## T 0.29 0.31 0.30 0.29 0.27 0.30 0.30 0.29 0.28 0.36
p0 = rowMeans(prop10)
p0
## A C G T
## 0.3470531 0.1518313 0.2011442 0.2999714
cs = colSums(tab10)
cs
## gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
## 1362 1134 246 1113 1932 2661 831 1515 1287 696
expectedtab10 = outer(p0, cs, FUN = "*")
round(expectedtab10)
## gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
## A 473 394 85 386 671 924 288 526 447 242
## C 207 172 37 169 293 404 126 230 195 106
## G 274 228 49 224 389 535 167 305 259 140
## T 409 340 74 334 580 798 249 454 386 209
randomtab10 = sapply(cs, function(s) { rmultinom(1, s, p0) } )
all(colSums(randomtab10) == cs)
## [1] TRUE
Now we repeat this B = 1000 times. For each table we compute our test statistic from Section 1.4.1 in Chapter 1 (the function stat) and store the results in the vector simulstat. Together, these values constitute our null distribution, as they were generated under the null hypothesis that p0 is the vector of multinomial proportions for each of the 10 genes.
stat = function(obsvd, exptd = 20 * pvec) {
sum((obsvd - exptd)^2 / exptd)
}
B = 1000
simulstat = replicate(B, {
randomtab10 = sapply(cs, function(s) { rmultinom(1, s, p0) })
stat(randomtab10, expectedtab10)
})
S1 = stat(tab10, expectedtab10)
sum(simulstat >= S1)
## [1] 0
hist(simulstat, col = "lavender", breaks = seq(0, 75, length.out=50))
abline(v = S1, col = "red")
abline(v = quantile(simulstat, probs = c(0.95, 0.99)),
col = c("darkgreen", "blue"), lty = 2)