In R, the family of uniform distribution functions is given by dunif() (Density), runif() (Random sampling), punif() (culmulative Probability), and qunif() (Quantiles). When these names evade you, type ?runif (or any other of these function names that you can remember) and R will give you a helpful reminder of all of them. If you can’t remember any, type ??uniform into the R console (note double question marks!), or ‘uniform distribution R’ into Google.
punif(.7, min=0, max=1) # cumulative up to .7
## [1] 0.7
punif(-5, min=0, max=1)
## [1] 0
punif(5, min=0, max=1)
## [1] 1
qunif(.4, min=0, max=1) # for what x does P(x) = .4?
## [1] 0.4
runif(5, min=0, max=1)
## [1] 0.8074689 0.9900246 0.9499684 0.3668860 0.2923250
dunif(-1, min=0, max=1)
## [1] 0
dunif(0, min=0, max=1)
## [1] 1
dunif(.2, min=0, max=1)
## [1] 1
dunif(.5, min=0, max=1)
## [1] 1
dunif(2, min=0, max=1)
## [1] 0
curve(dunif(x, min=0, max=1), xlim=c(-1, 2), ylim=c(0,2), xlab="x", ylab="density p(x)")
The same holds for other common distributions: for the binomial distribution R has rbinom(), dbinom(), qbinom(), pbinom(); for the normal distribution it has rnorm(), dnorm(), etc.
The expected value or mean of a random variable is a probability-weighted average of the possible values. Summations mean we’re talking about discrete RVs, and integrals are for continuous. \[ \begin{aligned} \mathbb{E}_P(X) =& \sum_{x \in X}\ x P(x) & \text{ (discrete)} \\ \mathbb{E}_p(X) =& \int_{-\infty}^\infty\ x p(x) dx \ \ \ & \text{ (continuous)} \\ \end{aligned} \] Expected value is your ‘best guess’ for the true value of a RV, in the sense that - if you have to make a point estimate, \(P(X)\) is your distribution on answers, and your response is penalized by the following function - it is the response that will minimize your average penalty. \[ \begin{aligned} \mathbf{penalty}(\mathbf{guess}) = [\mathbf{guess} - \mathbf{true\ value}]^2 \end{aligned} \] For example, the expected value of a \(U(0,1)\) random variable is \(.5\). However, this doesn’t mean that the expected value is a high-probability guess: in this case it has exactly the same probability as an infinite number of possible values. In fact, if your distribution is bimodal, for example, you may find that the expected value is a value which has very low probability. As an extreme example, let \(P(X)\) be given by the following function:
bimodal.uniform.density = Vectorize(function(x) {
if (x < 0) return(0)
else if (x <= 1) return(.5)
else if (x < 2) return(0)
else if (x <= 3) return(.5)
else return(0)
})
curve(bimodal.uniform.density(x), xlim=c(-1, 4), ylim=c(0,1), xlab="x", ylab="density p(x)")
Make sure you keep in mind the difference between the mean of a random variable (often written \(\mu\)) and the sample mean of some data set (often written \(\hat{\mu}\)). These are very different things. \(\hat{\mu}\) can be used as an estimate of \(\mu\), but it is not the mean of the RV - it’s the arithmetic mean of the values obtained in some sequences of samples from the RV. If the sample is a sequence of \(n\) values \([x_1, x_2, ..., x_n]\), then its sample mean \(\hat{\mu}\) is \[ \begin{aligned} \hat{\mu} = \frac{\sum_{i=1}^n\ x_i}{n} \end{aligned} \] Convergence to the true mean (divergence \(\to\) 0) as \(n\) increases:
true.mean = .5
n.samples = 50
par(mfrow=c(2,2))
for (j in 1:4) {
samples = runif(n.samples, min=0, max=1)
divergence = function(n) {
return(mean(samples[1:n]) - true.mean)
}
sample.divergence = sapply(1:n.samples, FUN=divergence)
plot(1:n.samples, sample.divergence, pch=20, col='blue', main=paste('run #', j), xlab='index i', ylab='divergence in mean up to i', ylim=c(-.5, .5))
abline(h=0, col='red') # 0 divergence would mean perfect estimate
}
Sample mean is an unbiased estimator, in the sense that the expected deviation of the sample mean from the true value is 0. \[ \begin{aligned} \mathbf{Bias}_P(\hat{\mu}) =& \mathbb{E}_P(\hat{\mu} - \mu) \\ =& 0 \end{aligned} \] We can get an approximate picture of this estimator’s bias by sampling many data sets and observing what the average divergence from the true mean is.
true.mean = .5
n.sims = 5000
n = 15 # number of samples per simulated data set
sims = sapply(1:n.sims, function(i){
samples = runif(n, min=0, max=1)
mu.hat = mean(samples)
return(mu.hat)
})
mean(sims - true.mean)
## [1] -0.0008227525
plot(sims, pch=20, xlab='Sample', ylab='sample mean')
abline(h=true.mean, col='red')
Example of convergence to true mean:
true.mean = .5
n = 15 # number of samples per simulated data set
Ns = floor(10^seq(from=.5, to=6, by=.5)) # 3, 10, 31, 100, 316, 1000, ...
mean.divergence = rep(NA, length(Ns))
names(mean.divergence) = paste('N=', Ns, sep='')
for (i in 1:length(Ns)) {
n.sims = Ns[i]
sims = sapply(1:n.sims, function(i){
samples = runif(n, min=0, max=1)
mu.hat = mean(samples)
return(mu.hat)
})
mean.divergence[i] = mean(sims - true.mean)
}
mean.divergence
## N=3 N=10 N=31 N=100 N=316
## -4.133387e-02 2.860110e-02 1.722205e-02 7.127371e-03 -5.389688e-04
## N=1000 N=3162 N=10000 N=31622 N=1e+05
## -3.924059e-03 -1.025805e-03 5.578166e-04 4.697092e-04 8.666730e-05
## N=316227 N=1e+06
## -6.349103e-05 3.285976e-05
plot(log(Ns, base=10), mean.divergence, xlab='N', ylab='mean divergence in N sims, 15 samples',
pch=20, ylim=c(-.08, .08), xaxt='n') # xaxt='n' suppresses x-axis
axis(1, at=log(Ns, base=10), labels=Ns) # use N instead of log(N) to label x-axis
abline(h=0, col='red') # an unbiased estimator's value should be converging to this point
The variance of a RV is a common measure of how spread out it is. \[ \begin{aligned} \text{Var}_P(X) =& \mathbb{E}_P([X - \mathbb{E}_P(X)]^2)\\ =& \mathbb{E}_P(X^2) - \mathbb{E}_P(X)^2 \end{aligned} \] This is the ‘penalty’ we were talking about above: the mean is the sum-of-squares-minimizing point estimate of a RV.
Variance is often written \(\sigma^2\), since its square root - the standard deviation - is written \(\sigma\).
Again, the variance of a distribution is not the same thing as the (uncorrected) sample variance, which is \[ S^2 = 1/n \times \sum_{i=1}^n [x_i - \hat{\mu}]^2. \]
Unlike sample mean, sample variance is not an unbiased estimator! For example, the true variance of a \(U(a,b)\) random variable is \((b-a)^2/12\). But if we sample from this distribution, the expected (uncorrected) sample variance is not equal to the true variance. For example:
true.variance = 1/12
n = 15
samples = runif(n, min=0, max=1)
summary(samples)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.007269 0.175200 0.298200 0.370500 0.614700 0.762700
sum((samples - mean(samples))^2)/n # sample variance
## [1] 0.06447093
sum((samples - mean(samples))^2)/n - true.variance
## [1] -0.0188624
By doing this many times we can work out approximately how different the (uncorrected) sample variance will be from the true variance, on average.
true.variance = 1/12
n.sims = 5000
n = 15 # number of samples per simulated data set
sims = sapply(1:n.sims, function(i){
samples = runif(n, min=0, max=1)
mu.hat = mean(samples)
return(sum((samples - mu.hat)^2)/n)
})
mean(sims - true.variance)
## [1] -0.00562567
plot(sims, pch=20, xlab='Sample', ylab='uncorrected sample variance S^2')
abline(h=true.variance, col='red')
Example of non-convergence to true variance:
true.variance = 1/12
n = 15 # number of samples per simulated data set
Ns = floor(10^seq(from=.5, to=6, by=.5)) # 3, 10, 31, 100, 316, 1000, ...
mean.divergence = rep(NA, length(Ns))
names(mean.divergence) = paste('N=', Ns, sep='')
for (i in 1:length(Ns)) {
n.sims = Ns[i]
sims = sapply(1:n.sims, function(i){
samples = runif(n, min=0, max=1)
mu.hat = mean(samples)
return(sum((samples - mu.hat)^2)/n)
})
mean.divergence[i] = mean(sims - true.variance)
}
mean.divergence
## N=3 N=10 N=31 N=100 N=316
## 0.005417070 -0.003016105 -0.013780489 -0.008139291 -0.003820838
## N=1000 N=3162 N=10000 N=31622 N=1e+05
## -0.004592789 -0.005538563 -0.006064320 -0.005577406 -0.005555838
## N=316227 N=1e+06
## -0.005598699 -0.005523349
plot(log(Ns, base=10), mean.divergence, xlab='N', ylab='mean divergence in N sims, 15 samples',
pch=20, ylim=c(-.03, .03), xaxt='n') # xaxt='n' suppresses x-axis
axis(1, at=log(Ns, base=10), labels=Ns) # use N instead of log(N) to label x-axis
abline(h=0, col='red') # an unbiased estimator's value should be converging to this point
The sample variance is frequently substantially different from the true variance - it’s biased. However, we can get an unbiased estimator by using a different estimator, \(s^2\). (This is, by the way, what R’s var() function computes.) \[
\begin{aligned}
s^2 = 1/(n-1) \times \sum_{i=1}^n [x_i - \hat{\mu}]^2
\end{aligned}
\]
true.variance = 1/12
n.sims = 5000
n = 15 # each sim will have us taking this many samples
sims = sapply(1:n.sims, function(i){
samples = runif(n, min=0, max=1)
mu.hat = mean(samples)
return(sum((samples - mu.hat)^2)/(n-1))
})
mean(sims - true.variance)
## [1] -0.000738395
plot(sims, pch=20, xlab='Sample', ylab='corrected sample variance s^2')
abline(h=true.variance, col='red')
Convergence:
true.variance = 1/12
n = 15 # number of samples per simulated data set
Ns = floor(10^seq(from=.5, to=6, by=.5)) # 3, 10, 31, 100, 316, 1000, ...
mean.divergence = rep(NA, length(Ns))
names(mean.divergence) = paste('N=', Ns, sep='')
for (i in 1:length(Ns)) {
n.sims = Ns[i]
sims = sapply(1:n.sims, function(i){
samples = runif(n, min=0, max=1)
mu.hat = mean(samples)
return(sum((samples - mu.hat)^2)/(n-1))
})
mean.divergence[i] = mean(sims - true.variance)
}
mean.divergence
## N=3 N=10 N=31 N=100 N=316
## -1.075525e-02 -1.006333e-02 2.170952e-03 1.102632e-03 3.989974e-04
## N=1000 N=3162 N=10000 N=31622 N=1e+05
## -3.630847e-04 -7.530779e-05 6.234220e-05 6.079765e-05 -8.771804e-05
## N=316227 N=1e+06
## -5.334962e-06 -2.485221e-05
plot(log(Ns, base=10), mean.divergence, xlab='N', ylab='mean divergence in N sims, 15 samples',
pch=20, ylim=c(-.03, .03), xaxt='n') # xaxt='n' suppresses x-axis
axis(1, at=log(Ns, base=10), labels=Ns) # use N instead of log(N) to label x-axis
abline(h=0, col='red') # an unbiased estimator's value should be converging to this point
With large \(n\) the difference is not much because the difference between dividing by \(n\) vs. \(n-1\) is not a big deal. When \(n\) is small, as frequently when you’re doing an experiment, it makes a substantial difference.
The normal (aka Gaussian) distribution is one of the most common and useful distributions. It has an inscrutable closed form that you absolutely do not need to memorize. What you do need to know is what it looks like, how it’s parametrized, and how to work with it in R.
The normal distribution is parametrized by a mean \(\mu\) and a standard deviation \(\sigma\). It has support over the entire real line - in other words, a variable can’t be normally distributed if it doesn’t make sense to talk about negative values. (Consequence: contrary to popular belief, heights are not normally distributed.)
A standard normal distribution is simply one with \(\mu = 0\) and \(\sigma = 1\). It looks like this:
curve(dnorm(x, mean=0, sd=1), xlim=c(-3,3))
In R, use pnorm() to get the cumulative distribution, dnorm() to get the density function, rnorm() to sample from the normal, and qnorm() to get quantiles.
pnorm(q=-1, mean=0, sd=1)
## [1] 0.1586553
pnorm(q=0, mean=0, sd=1)
## [1] 0.5
pnorm(q=1, mean=0, sd=1)
## [1] 0.8413447
pnorm(q=1, mean=0, sd=1) == 1 - pnorm(q=-1, mean=0, sd=1)
## [1] TRUE
(Geometrically, what does this latter identity mean?)
To get the probability that a standard normal variable falls between -2 and 2:
pnorm(q=2, mean=0, sd=1) - pnorm(q=-2, mean=0, sd=1)
## [1] 0.9544997
A 95%-coverage interval, a bit more precisely. Store ‘1.96’ for later use.
pnorm(q=1.96, mean=0, sd=1) - pnorm(q=-1.96, mean=0, sd=1)
## [1] 0.9500042
What’s the smallest value s.t. there’s a .4 probability that a standard normal variable falls below it - i.e., the point of cumulative probability .4?
qnorm(.4, mean=0, sd=1)
## [1] -0.2533471
pnorm(-0.2533471, mean=0, sd=1)
## [1] 0.4
One way to get a precise 95% coverage interval, with 2.5% of the probability in each tail:
c(lower=qnorm(.025, mean=0, sd=1), upper=qnorm(1 - .025, mean=0, sd=1))
## lower upper
## -1.959964 1.959964
Sample 100 random numbers from this distribution:
rnorm(n=100, mean=0, sd=1)
## [1] 0.573427005 -0.606139152 1.032931746 0.209416188 -0.134255598
## [6] 2.434555115 0.805865709 -0.710278638 0.447311740 0.739716619
## [11] 1.243092120 0.562658944 -0.156176666 0.525286452 -1.780177211
## [16] -0.161045710 1.131014802 0.330228947 0.693138510 -0.401128716
## [21] 1.034646154 -0.552195187 0.007467308 0.192257769 1.556117428
## [26] 0.381317968 1.161365657 -0.230816829 0.884465515 -0.754754762
## [31] 0.211141521 0.590983044 1.416191442 0.658945379 -0.269249190
## [36] 0.437164825 0.156328956 0.690685920 -0.599007679 0.232493772
## [41] -1.719485719 0.224598454 0.940099103 0.338204506 -0.336088242
## [46] 0.417623376 0.236875303 0.113934954 -0.642674081 0.322397347
## [51] -0.686182063 0.101921347 1.647733715 -1.014129061 0.347215955
## [56] 1.389886351 -1.905117281 -0.444805110 -1.045609740 -0.257600730
## [61] -1.905661588 -0.783613093 -0.631675251 0.146443006 1.106331969
## [66] -1.975395515 1.119089059 -0.059130739 0.310949932 -0.075807718
## [71] 1.127801228 1.233576673 0.497837474 1.385764183 -2.608087084
## [76] -1.556964347 -1.159872216 0.292823123 0.526142791 0.883057317
## [81] -1.718286852 -0.066391645 1.258624776 -2.488382879 0.720669439
## [86] -0.311970120 0.830124075 -1.927724081 -0.115785093 1.609383169
## [91] -1.111322369 1.394712119 0.547208253 -1.897064906 0.638834189
## [96] -1.386838670 1.232943599 -0.207351845 -0.176451089 -0.418508055
Notation: in probability & statistics, joint distributions are usually written in the form \[ P(X = x_i, Y = y_j) \] for discrete variables, or \(p(X=x_i, Y=y_I)\) for continuous variables. This notation is equivalent to either of the following (modulo capitalization). \[ \begin{aligned} & P(X = x_i\ \&\ Y = y_j)\\ & P(X = x_i \wedge Y = y_j) \end{aligned} \] So, for example, we might be interested in the probability that the direct object of a verb in English is sentence-initial and definite vs. indefinite. Then we would be trying to estimate the probabilities on the right-hand side of the following joint probability statements: \[ \begin{aligned} P(\mathbf{initial} = \mathbf{yes}, \mathbf{definite} = \mathbf{yes}) &= p_1\\ P(\mathbf{initial} = \mathbf{no}, \mathbf{definite} = \mathbf{no}) &= p_2\\ P(\mathbf{initial} = \mathbf{yes}, \mathbf{indefinite} = \mathbf{yes}) &= p_3\\ P(\mathbf{initial} = \mathbf{no}, \mathbf{indefinite} = \mathbf{no}) &= 1 - (p_1 + p_2 + p_3)\\ \end{aligned} \]
Since this is a partition - a mutually exclusive and exhaustive list of possible outcomes - the marginal probability \(P(\mathbf{initial} = \mathbf{yes})\) can be found by summing over the joint probabilities of \(P(\mathbf{initial} = \mathbf{yes}, \mathbf{definite})\) for all possible values of \(\mathbf{definite}\). \[ \begin{aligned} P(\mathbf{initial} = \mathbf{yes}) &= P(\mathbf{initial} = \mathbf{yes}, \mathbf{definite} = \mathbf{yes}) + P(\mathbf{initial} = \mathbf{yes}, \mathbf{definite} = \mathbf{no})\\ &= p_1 + p_3 \end{aligned} \] For continuous variables, it works the same except that the summation becomes an interval. \[ p(X) &= \int_{-\infty}^\infty\ P(X,Y=y) \ dy \]
Directed graphical models (‘Bayes nets’, ‘Probabilistic graphical models’, ‘Directed Acyclic Graphs’, ‘DAGs’) express conditional (in)dependence relationships between sets of variables.
Here’s a motivating linguistic example (example and analysis borrowed from R. Levy’s notes). Suppose you are a listener and you hear a speaker say
From your own linguistic experience, you might be able to come up with a number of possible explanations for this disfluency that could guide your expectations about upcoming linguistic material. For example, disfluencies can be generated by (among others)
Here’s a little graphical model depicting the intuitive relationship among variables here:
This is read as follows: “W can cause D, and A can also cause D. W and A are extrinsically given, in the sense that they are not the daughters of any variables that are included in our model.” If we don’t know anything about \(D\), \(W\) and \(A\) are independent, and will usually be uncorrelated (except by accident). But, the disfluency demands a cause. As a result, \(W\) and \(A\) become dependent, given \(D\), and are in an explaining-away relationship. If we subsequently learn that there was a distraction, this will reduce the inferred probability of a difficult upcoming word, and vice versa.
Some more examples of directed graphs from Levy, Appendix C:
Make sure you can identify the following relations between nodes:
Acyclicity (the ‘A’ in ‘DAG’) means that no node \(X_i\) should be both a parent and an ancestor of a node \(X_j\). (Equivalently, no node can be its own ancestor or descendant.)
The key feature of a DAG, as a summary of a class of probability distributions, is that a node is screened off from its non-descendants, conditional on its parents - more formally, \[ P(X_i | \mathbf{Parents}(X_i)) = P(X_i | \mathbf{Parents}(X_i), \mathbf{Ancestors}(X_i)) \] This relationship does not necessarily hold if we also condition on the values of some of \(X_i\)’s descendants. That is, for \(X_j \in \mathbf{Descendants}(X_i)\), \[ P(X_i | \mathbf{Parents}(X_i), X_j) \overset{??}{\underset{??}{=}} P(X_i | \mathbf{Parents}(X_i), X_j, \mathbf{Ancestors}(X_i)) \] For example, in (d) above, \(X_4\) is conditionally independent of \(X_3\) given its parents \(\{X_1,X_2\}\). But it’s not conditionally independent of \(X_3\) given \(\{X_1,X_2,X_6\}\), where \(X_6\) is one of its daughters. Can you explain why?
The nice thing about a DAG, then, is that when you have \(n\) variables you don’t need to specify \(2^{n-1}\) probabilities. Usually, you can factorize the distribution into a much smaller set of conditional probability statements - intuitively, starting with ‘exogenous’ or ‘uncaused’ variables at the top and working your way down to the bottom.
When you code up a DAG, parent- and daughterhood are really easy to read off from the code. Ancestor- and descendant-hood can be more difficult: ensuring that \(X_i\) is not an ancestor/descendant of \(X_j\) involves checking all possible connections. As an exercise, let’s work through the following code and then draw a DAG that encodes the conditional independence relationships between the variables in this diagram.
rain.sim = function(i) {
flip = function(p) runif(1) < p
Zeus.angry = flip(.3)
Zeus.crying = flip(.2)
thunder = Zeus.angry && flip(.4)
rain = (Zeus.angry && flip(.8)) || (Zeus.crying && flip(.99))
forgetful.gardener = flip(.4)
sprinkler = forgetful.gardener && flip(.7)
wet.grass = (rain && flip(.8)) || (sprinkler && flip(.9)) || flip(.1)
return(c(Zeus.angry = Zeus.angry, Zeus.crying = Zeus.crying, thunder = thunder, forgetful = forgetful.gardener, sprinkler = sprinkler, wet.grass = wet.grass))
}
sapply(1:10, FUN=rain.sim)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## Zeus.angry FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## Zeus.crying FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## thunder FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## forgetful FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
## sprinkler FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
## wet.grass FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
thunder and wet.grass? If you hear thunder, have you learned something about how likely the grass is to be wet? Why or why not?query = function(variable.name, n.sims = 10000) {
sims = sapply(1:n.sims, FUN=rain.sim)
proportion = length(which(sims[variable.name,] == TRUE))/n.sims
return(proportion)
}
query('Zeus.angry')
## [1] 0.2981
query('thunder')
## [1] 0.1237
Frequently, what we’re doing with a DAG is observing some ‘downstream’ variables (effects) and trying to infer the values of some ‘upstream’ variables (causes).
conditional.query = function(variable.name, conditioners, n.sims=10000) {
sims = sapply(1:n.sims, FUN=rain.sim)
conditioners.true = 1:n.sims
for (i in 1:length(conditioners)) {
conditioners.true = intersect(conditioners.true, which(sims[conditioners[i],] == TRUE))
}
all.true = intersect(conditioners.true, which(sims[variable.name,] == TRUE))
proportion = length(all.true)/length(conditioners.true)
return(proportion)
}
query('wet.grass')
## [1] 0.5425
conditional.query('wet.grass', 'thunder')
## [1] 0.7789735
conditional.query('wet.grass', c('thunder', 'Zeus.angry'))
## [1] 0.7813799
conditional.query('wet.grass', 'Zeus.angry')
## [1] 0.792998
Lots of linguistic applications of this kind of model. For example, here’s the schematic model of communication that I presented on Day 1: