Probability & Statistics for Linguists: Class 4(Jan 14, 2015)

CRVs continued

In R, the family of uniform distribution functions is given by dunif() (Density), runif() (Random sampling), punif() (culmulative Probability), and qunif() (Quantiles). When these names evade you, type ?runif (or any other of these function names that you can remember) and R will give you a helpful reminder of all of them. If you can’t remember any, type ??uniform into the R console (note double question marks!), or ‘uniform distribution R’ into Google.

punif(.7, min=0, max=1) # cumulative up to .7

## [1] 0.7

punif(-5, min=0, max=1)

## [1] 0

punif(5, min=0, max=1)

## [1] 1

qunif(.4, min=0, max=1) # for what x does P(x) = .4?

## [1] 0.4

runif(5, min=0, max=1)

## [1] 0.8074689 0.9900246 0.9499684 0.3668860 0.2923250

dunif(-1, min=0, max=1)

## [1] 0

dunif(0, min=0, max=1)

## [1] 1

dunif(.2, min=0, max=1)

## [1] 1

dunif(.5, min=0, max=1)

## [1] 1

dunif(2, min=0, max=1)

## [1] 0

curve(dunif(x, min=0, max=1), xlim=c(-1, 2), ylim=c(0,2), xlab="x", ylab="density p(x)")

The same holds for other common distributions: for the binomial distribution R has rbinom(), dbinom(), qbinom(), pbinom(); for the normal distribution it has rnorm(), dnorm(), etc.

Expected value

The expected value or mean of a random variable is a probability-weighted average of the possible values. Summations mean we’re talking about discrete RVs, and integrals are for continuous. \[ \begin{aligned} \mathbb{E}_P(X) =& \sum_{x \in X}\ x P(x) & \text{ (discrete)} \\ \mathbb{E}_p(X) =& \int_{-\infty}^\infty\ x p(x) dx \ \ \ & \text{ (continuous)} \\ \end{aligned} \] Expected value is your ‘best guess’ for the true value of a RV, in the sense that - if you have to make a point estimate, \(P(X)\) is your distribution on answers, and your response is penalized by the following function - it is the response that will minimize your average penalty. \[ \begin{aligned} \mathbf{penalty}(\mathbf{guess}) = [\mathbf{guess} - \mathbf{true\ value}]^2 \end{aligned} \] For example, the expected value of a \(U(0,1)\) random variable is \(.5\). However, this doesn’t mean that the expected value is a high-probability guess: in this case it has exactly the same probability as an infinite number of possible values. In fact, if your distribution is bimodal, for example, you may find that the expected value is a value which has very low probability. As an extreme example, let \(P(X)\) be given by the following function:

bimodal.uniform.density = Vectorize(function(x) {
    if (x < 0) return(0)
    else if (x <= 1) return(.5)
    else if (x < 2) return(0)
    else if (x <= 3) return(.5)
    else return(0)
})
curve(bimodal.uniform.density(x), xlim=c(-1, 4), ylim=c(0,1), xlab="x", ylab="density p(x)")

What is \(\mathbb{E}_P(X)\)?
What is \(P(\mathbb{E}_P(X)))\)?

Make sure you keep in mind the difference between the mean of a random variable (often written \(\mu\)) and the sample mean of some data set (often written \(\hat{\mu}\)). These are very different things. \(\hat{\mu}\) can be used as an estimate of \(\mu\), but it is not the mean of the RV - it’s the arithmetic mean of the values obtained in some sequences of samples from the RV. If the sample is a sequence of \(n\) values \([x_1, x_2, ..., x_n]\), then its sample mean \(\hat{\mu}\) is \[ \begin{aligned} \hat{\mu} = \frac{\sum_{i=1}^n\ x_i}{n} \end{aligned} \] Convergence to the true mean (divergence \(\to\) 0) as \(n\) increases:

true.mean = .5
n.samples = 50
par(mfrow=c(2,2))
for (j in 1:4) {
  samples = runif(n.samples, min=0, max=1)
  divergence = function(n) {
    return(mean(samples[1:n]) - true.mean)
  }
  sample.divergence = sapply(1:n.samples, FUN=divergence)
  plot(1:n.samples, sample.divergence, pch=20, col='blue', main=paste('run #', j), xlab='index i', ylab='divergence in mean up to i', ylim=c(-.5, .5))
  abline(h=0, col='red') # 0 divergence would mean perfect estimate 
}

Sample mean is an unbiased estimator, in the sense that the expected deviation of the sample mean from the true value is 0. \[ \begin{aligned} \mathbf{Bias}_P(\hat{\mu}) =& \mathbb{E}_P(\hat{\mu} - \mu) \\ =& 0 \end{aligned} \] We can get an approximate picture of this estimator’s bias by sampling many data sets and observing what the average divergence from the true mean is.

true.mean = .5
n.sims = 5000
n = 15 # number of samples per simulated data set
sims = sapply(1:n.sims, function(i){
    samples = runif(n, min=0, max=1)
    mu.hat = mean(samples)
    return(mu.hat)
})
mean(sims - true.mean)

## [1] -0.0008227525

plot(sims, pch=20, xlab='Sample', ylab='sample mean')
abline(h=true.mean, col='red')

Example of convergence to true mean:

true.mean = .5
n = 15 # number of samples per simulated data set
Ns = floor(10^seq(from=.5, to=6, by=.5)) # 3, 10, 31, 100, 316, 1000, ...
mean.divergence = rep(NA, length(Ns))
names(mean.divergence) = paste('N=', Ns, sep='')
for (i in 1:length(Ns)) {
  n.sims = Ns[i]
  sims = sapply(1:n.sims, function(i){
    samples = runif(n, min=0, max=1)
    mu.hat = mean(samples)
    return(mu.hat)
  })
  mean.divergence[i] = mean(sims - true.mean)
}
mean.divergence

##           N=3          N=10          N=31         N=100         N=316 
## -4.133387e-02  2.860110e-02  1.722205e-02  7.127371e-03 -5.389688e-04 
##        N=1000        N=3162       N=10000       N=31622       N=1e+05 
## -3.924059e-03 -1.025805e-03  5.578166e-04  4.697092e-04  8.666730e-05 
##      N=316227       N=1e+06 
## -6.349103e-05  3.285976e-05

plot(log(Ns, base=10), mean.divergence, xlab='N', ylab='mean divergence in N sims, 15 samples', 
     pch=20, ylim=c(-.08, .08), xaxt='n') # xaxt='n' suppresses x-axis
axis(1, at=log(Ns, base=10), labels=Ns) # use N instead of log(N) to label x-axis 
abline(h=0, col='red') # an unbiased estimator's value should be converging to this point

Variance

The variance of a RV is a common measure of how spread out it is. \[ \begin{aligned} \text{Var}_P(X) =& \mathbb{E}_P([X - \mathbb{E}_P(X)]^2)\\ =& \mathbb{E}_P(X^2) - \mathbb{E}_P(X)^2 \end{aligned} \] This is the ‘penalty’ we were talking about above: the mean is the sum-of-squares-minimizing point estimate of a RV.

Variance is often written \(\sigma^2\), since its square root - the standard deviation - is written \(\sigma\).

Again, the variance of a distribution is not the same thing as the (uncorrected) sample variance, which is \[ S^2 = 1/n \times \sum_{i=1}^n [x_i - \hat{\mu}]^2. \]

Unlike sample mean, sample variance is not an unbiased estimator! For example, the true variance of a \(U(a,b)\) random variable is \((b-a)^2/12\). But if we sample from this distribution, the expected (uncorrected) sample variance is not equal to the true variance. For example:

true.variance = 1/12
n = 15 
samples = runif(n, min=0, max=1)
summary(samples)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.007269 0.175200 0.298200 0.370500 0.614700 0.762700

sum((samples - mean(samples))^2)/n # sample variance

## [1] 0.06447093

sum((samples - mean(samples))^2)/n - true.variance

## [1] -0.0188624

By doing this many times we can work out approximately how different the (uncorrected) sample variance will be from the true variance, on average.

true.variance = 1/12
n.sims = 5000
n = 15 # number of samples per simulated data set
sims = sapply(1:n.sims, function(i){
    samples = runif(n, min=0, max=1)
    mu.hat = mean(samples)
    return(sum((samples - mu.hat)^2)/n)
})
mean(sims - true.variance)

## [1] -0.00562567

plot(sims, pch=20, xlab='Sample', ylab='uncorrected sample variance S^2')
abline(h=true.variance, col='red')

Example of non-convergence to true variance:

true.variance = 1/12
n = 15 # number of samples per simulated data set
Ns = floor(10^seq(from=.5, to=6, by=.5)) # 3, 10, 31, 100, 316, 1000, ...
mean.divergence = rep(NA, length(Ns))
names(mean.divergence) = paste('N=', Ns, sep='')
for (i in 1:length(Ns)) {
  n.sims = Ns[i]
  sims = sapply(1:n.sims, function(i){
    samples = runif(n, min=0, max=1)
    mu.hat = mean(samples)
    return(sum((samples - mu.hat)^2)/n)
  })
  mean.divergence[i] = mean(sims - true.variance)
}
mean.divergence

##          N=3         N=10         N=31        N=100        N=316 
##  0.005417070 -0.003016105 -0.013780489 -0.008139291 -0.003820838 
##       N=1000       N=3162      N=10000      N=31622      N=1e+05 
## -0.004592789 -0.005538563 -0.006064320 -0.005577406 -0.005555838 
##     N=316227      N=1e+06 
## -0.005598699 -0.005523349

plot(log(Ns, base=10), mean.divergence, xlab='N', ylab='mean divergence in N sims, 15 samples', 
     pch=20, ylim=c(-.03, .03), xaxt='n') # xaxt='n' suppresses x-axis
axis(1, at=log(Ns, base=10), labels=Ns) # use N instead of log(N) to label x-axis 
abline(h=0, col='red') # an unbiased estimator's value should be converging to this point

The sample variance is frequently substantially different from the true variance - it’s biased. However, we can get an unbiased estimator by using a different estimator, \(s^2\). (This is, by the way, what R’s var() function computes.) \[ \begin{aligned} s^2 = 1/(n-1) \times \sum_{i=1}^n [x_i - \hat{\mu}]^2 \end{aligned} \]

true.variance = 1/12
n.sims = 5000
n = 15 # each sim will have us taking this many samples
sims = sapply(1:n.sims, function(i){
    samples = runif(n, min=0, max=1)
    mu.hat = mean(samples)
    return(sum((samples - mu.hat)^2)/(n-1))
})
mean(sims - true.variance)

## [1] -0.000738395

plot(sims, pch=20, xlab='Sample', ylab='corrected sample variance s^2')
abline(h=true.variance, col='red')

Convergence:

true.variance = 1/12
n = 15 # number of samples per simulated data set
Ns = floor(10^seq(from=.5, to=6, by=.5)) # 3, 10, 31, 100, 316, 1000, ...
mean.divergence = rep(NA, length(Ns))
names(mean.divergence) = paste('N=', Ns, sep='')
for (i in 1:length(Ns)) {
  n.sims = Ns[i]
  sims = sapply(1:n.sims, function(i){
    samples = runif(n, min=0, max=1)
    mu.hat = mean(samples)
    return(sum((samples - mu.hat)^2)/(n-1))
  })
  mean.divergence[i] = mean(sims - true.variance)
}
mean.divergence

##           N=3          N=10          N=31         N=100         N=316 
## -1.075525e-02 -1.006333e-02  2.170952e-03  1.102632e-03  3.989974e-04 
##        N=1000        N=3162       N=10000       N=31622       N=1e+05 
## -3.630847e-04 -7.530779e-05  6.234220e-05  6.079765e-05 -8.771804e-05 
##      N=316227       N=1e+06 
## -5.334962e-06 -2.485221e-05

plot(log(Ns, base=10), mean.divergence, xlab='N', ylab='mean divergence in N sims, 15 samples', 
     pch=20, ylim=c(-.03, .03), xaxt='n') # xaxt='n' suppresses x-axis
axis(1, at=log(Ns, base=10), labels=Ns) # use N instead of log(N) to label x-axis 
abline(h=0, col='red') # an unbiased estimator's value should be converging to this point

With large \(n\) the difference is not much because the difference between dividing by \(n\) vs. \(n-1\) is not a big deal. When \(n\) is small, as frequently when you’re doing an experiment, it makes a substantial difference.

The normal distribution

The normal (aka Gaussian) distribution is one of the most common and useful distributions. It has an inscrutable closed form that you absolutely do not need to memorize. What you do need to know is what it looks like, how it’s parametrized, and how to work with it in R.

The normal distribution is parametrized by a mean \(\mu\) and a standard deviation \(\sigma\). It has support over the entire real line - in other words, a variable can’t be normally distributed if it doesn’t make sense to talk about negative values. (Consequence: contrary to popular belief, heights are not normally distributed.)

A standard normal distribution is simply one with \(\mu = 0\) and \(\sigma = 1\). It looks like this:

curve(dnorm(x, mean=0, sd=1), xlim=c(-3,3))

In R, use pnorm() to get the cumulative distribution, dnorm() to get the density function, rnorm() to sample from the normal, and qnorm() to get quantiles.

pnorm(q=-1, mean=0, sd=1)

## [1] 0.1586553

pnorm(q=0, mean=0, sd=1)

## [1] 0.5

pnorm(q=1, mean=0, sd=1)

## [1] 0.8413447

pnorm(q=1, mean=0, sd=1) == 1 - pnorm(q=-1, mean=0, sd=1)

## [1] TRUE

(Geometrically, what does this latter identity mean?)

To get the probability that a standard normal variable falls between -2 and 2:

pnorm(q=2, mean=0, sd=1) - pnorm(q=-2, mean=0, sd=1)

## [1] 0.9544997

A 95%-coverage interval, a bit more precisely. Store ‘1.96’ for later use.

pnorm(q=1.96, mean=0, sd=1) - pnorm(q=-1.96, mean=0, sd=1)

## [1] 0.9500042

What’s the smallest value s.t. there’s a .4 probability that a standard normal variable falls below it - i.e., the point of cumulative probability .4?

qnorm(.4, mean=0, sd=1)

## [1] -0.2533471

pnorm(-0.2533471, mean=0, sd=1)

## [1] 0.4

One way to get a precise 95% coverage interval, with 2.5% of the probability in each tail:

c(lower=qnorm(.025, mean=0, sd=1), upper=qnorm(1 - .025, mean=0, sd=1))

##     lower     upper 
## -1.959964  1.959964

Sample 100 random numbers from this distribution:

rnorm(n=100, mean=0, sd=1)

##   [1]  0.573427005 -0.606139152  1.032931746  0.209416188 -0.134255598
##   [6]  2.434555115  0.805865709 -0.710278638  0.447311740  0.739716619
##  [11]  1.243092120  0.562658944 -0.156176666  0.525286452 -1.780177211
##  [16] -0.161045710  1.131014802  0.330228947  0.693138510 -0.401128716
##  [21]  1.034646154 -0.552195187  0.007467308  0.192257769  1.556117428
##  [26]  0.381317968  1.161365657 -0.230816829  0.884465515 -0.754754762
##  [31]  0.211141521  0.590983044  1.416191442  0.658945379 -0.269249190
##  [36]  0.437164825  0.156328956  0.690685920 -0.599007679  0.232493772
##  [41] -1.719485719  0.224598454  0.940099103  0.338204506 -0.336088242
##  [46]  0.417623376  0.236875303  0.113934954 -0.642674081  0.322397347
##  [51] -0.686182063  0.101921347  1.647733715 -1.014129061  0.347215955
##  [56]  1.389886351 -1.905117281 -0.444805110 -1.045609740 -0.257600730
##  [61] -1.905661588 -0.783613093 -0.631675251  0.146443006  1.106331969
##  [66] -1.975395515  1.119089059 -0.059130739  0.310949932 -0.075807718
##  [71]  1.127801228  1.233576673  0.497837474  1.385764183 -2.608087084
##  [76] -1.556964347 -1.159872216  0.292823123  0.526142791  0.883057317
##  [81] -1.718286852 -0.066391645  1.258624776 -2.488382879  0.720669439
##  [86] -0.311970120  0.830124075 -1.927724081 -0.115785093  1.609383169
##  [91] -1.111322369  1.394712119  0.547208253 -1.897064906  0.638834189
##  [96] -1.386838670  1.232943599 -0.207351845 -0.176451089 -0.418508055

Multivariate probability

Notation: in probability & statistics, joint distributions are usually written in the form \[ P(X = x_i, Y = y_j) \] for discrete variables, or \(p(X=x_i, Y=y_I)\) for continuous variables. This notation is equivalent to either of the following (modulo capitalization). \[ \begin{aligned} & P(X = x_i\ \&\ Y = y_j)\\ & P(X = x_i \wedge Y = y_j) \end{aligned} \] So, for example, we might be interested in the probability that the direct object of a verb in English is sentence-initial and definite vs. indefinite. Then we would be trying to estimate the probabilities on the right-hand side of the following joint probability statements: \[ \begin{aligned} P(\mathbf{initial} = \mathbf{yes}, \mathbf{definite} = \mathbf{yes}) &= p_1\\ P(\mathbf{initial} = \mathbf{no}, \mathbf{definite} = \mathbf{no}) &= p_2\\ P(\mathbf{initial} = \mathbf{yes}, \mathbf{indefinite} = \mathbf{yes}) &= p_3\\ P(\mathbf{initial} = \mathbf{no}, \mathbf{indefinite} = \mathbf{no}) &= 1 - (p_1 + p_2 + p_3)\\ \end{aligned} \]

Since this is a partition - a mutually exclusive and exhaustive list of possible outcomes - the marginal probability \(P(\mathbf{initial} = \mathbf{yes})\) can be found by summing over the joint probabilities of \(P(\mathbf{initial} = \mathbf{yes}, \mathbf{definite})\) for all possible values of \(\mathbf{definite}\). \[ \begin{aligned} P(\mathbf{initial} = \mathbf{yes}) &= P(\mathbf{initial} = \mathbf{yes}, \mathbf{definite} = \mathbf{yes}) + P(\mathbf{initial} = \mathbf{yes}, \mathbf{definite} = \mathbf{no})\\ &= p_1 + p_3 \end{aligned} \] For continuous variables, it works the same except that the summation becomes an interval. \[ p(X) &= \int_{-\infty}^\infty\ P(X,Y=y) \ dy \]

Directed graphical models

Directed graphical models (‘Bayes nets’, ‘Probabilistic graphical models’, ‘Directed Acyclic Graphs’, ‘DAGs’) express conditional (in)dependence relationships between sets of variables.

Here’s a motivating linguistic example (example and analysis borrowed from R. Levy’s notes). Suppose you are a listener and you hear a speaker say

“I saw an exhibition about the, uh, …”

From your own linguistic experience, you might be able to come up with a number of possible explanations for this disfluency that could guide your expectations about upcoming linguistic material. For example, disfluencies can be generated by (among others)

The fact that an upcoming word is difficult to produce (e.g., a low frequency item like astrolabe)
The fact that the speaker’s attention was distracted by something in the non-linguistic environment.

Here’s a little graphical model depicting the intuitive relationship among variables here:

Disfluency model

This is read as follows: “W can cause D, and A can also cause D. W and A are extrinsically given, in the sense that they are not the daughters of any variables that are included in our model.” If we don’t know anything about \(D\), \(W\) and \(A\) are independent, and will usually be uncorrelated (except by accident). But, the disfluency demands a cause. As a result, \(W\) and \(A\) become dependent, given \(D\), and are in an explaining-away relationship. If we subsequently learn that there was a distraction, this will reduce the inferred probability of a difficult upcoming word, and vice versa.

Some more examples of directed graphs from Levy, Appendix C:

Some DAGs

Make sure you can identify the following relations between nodes:

\(X_i\) is an ancestor of \(X_j\)
\(X_i\) is a parent (or ‘immediate ancestor’) of \(X_j\)
\(X_i\) is an descendant of \(X_j\)
\(X_i\) is a daughter (or ‘immediate descendant’) of \(X_j\)

Acyclicity (the ‘A’ in ‘DAG’) means that no node \(X_i\) should be both a parent and an ancestor of a node \(X_j\). (Equivalently, no node can be its own ancestor or descendant.)

The key feature of a DAG, as a summary of a class of probability distributions, is that a node is screened off from its non-descendants, conditional on its parents - more formally, \[ P(X_i | \mathbf{Parents}(X_i)) = P(X_i | \mathbf{Parents}(X_i), \mathbf{Ancestors}(X_i)) \] This relationship does not necessarily hold if we also condition on the values of some of \(X_i\)’s descendants. That is, for \(X_j \in \mathbf{Descendants}(X_i)\), \[ P(X_i | \mathbf{Parents}(X_i), X_j) \overset{??}{\underset{??}{=}} P(X_i | \mathbf{Parents}(X_i), X_j, \mathbf{Ancestors}(X_i)) \] For example, in (d) above, \(X_4\) is conditionally independent of \(X_3\) given its parents \(\{X_1,X_2\}\). But it’s not conditionally independent of \(X_3\) given \(\{X_1,X_2,X_6\}\), where \(X_6\) is one of its daughters. Can you explain why?

The nice thing about a DAG, then, is that when you have \(n\) variables you don’t need to specify \(2^{n-1}\) probabilities. Usually, you can factorize the distribution into a much smaller set of conditional probability statements - intuitively, starting with ‘exogenous’ or ‘uncaused’ variables at the top and working your way down to the bottom.

When you code up a DAG, parent- and daughterhood are really easy to read off from the code. Ancestor- and descendant-hood can be more difficult: ensuring that \(X_i\) is not an ancestor/descendant of \(X_j\) involves checking all possible connections. As an exercise, let’s work through the following code and then draw a DAG that encodes the conditional independence relationships between the variables in this diagram.

rain.sim = function(i) {
    flip = function(p) runif(1) < p
    Zeus.angry = flip(.3)
    Zeus.crying = flip(.2)
    thunder = Zeus.angry && flip(.4)
    rain = (Zeus.angry && flip(.8)) || (Zeus.crying && flip(.99))
    forgetful.gardener = flip(.4)
    sprinkler = forgetful.gardener && flip(.7)
    wet.grass = (rain && flip(.8)) || (sprinkler && flip(.9)) || flip(.1)
    return(c(Zeus.angry = Zeus.angry, Zeus.crying = Zeus.crying, thunder = thunder, forgetful = forgetful.gardener, sprinkler = sprinkler, wet.grass = wet.grass))
}
sapply(1:10, FUN=rain.sim)

##              [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
## Zeus.angry  FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## Zeus.crying FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## thunder     FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## forgetful   FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
## sprinkler   FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
## wet.grass   FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

What is the relationship between thunder and wet.grass? If you hear thunder, have you learned something about how likely the grass is to be wet? Why or why not?
What if you also saw that it had rained - should learning about whether there was thunder still influence your best guess about the probability of rain?

query = function(variable.name, n.sims = 10000) {
    sims = sapply(1:n.sims, FUN=rain.sim)
    proportion = length(which(sims[variable.name,] == TRUE))/n.sims
    return(proportion)
}
query('Zeus.angry')

## [1] 0.2981

query('thunder')

## [1] 0.1237

Frequently, what we’re doing with a DAG is observing some ‘downstream’ variables (effects) and trying to infer the values of some ‘upstream’ variables (causes).

conditional.query = function(variable.name, conditioners, n.sims=10000) {
    sims = sapply(1:n.sims, FUN=rain.sim)
    conditioners.true = 1:n.sims
    for (i in 1:length(conditioners)) {
        conditioners.true = intersect(conditioners.true, which(sims[conditioners[i],] == TRUE))
    }
    all.true = intersect(conditioners.true, which(sims[variable.name,] == TRUE))
    proportion = length(all.true)/length(conditioners.true)
    return(proportion)
}
query('wet.grass')

## [1] 0.5425

conditional.query('wet.grass', 'thunder')

## [1] 0.7789735

conditional.query('wet.grass', c('thunder', 'Zeus.angry'))

## [1] 0.7813799

conditional.query('wet.grass', 'Zeus.angry')

## [1] 0.792998

Lots of linguistic applications of this kind of model. For example, here’s the schematic model of communication that I presented on Day 1:

Communication over a noisy channel