Probability & Statistics for Linguists: Class 5(Jan 21, 2015)

Correlation

Suppose the daily maximum temperature in Palo Alto in January is normally distributed. (Technically this isn’t possible - why? - but it could be approximately normal.) We could measure temperature in either F or C. Obviously, we want our statistical analyses to be robust to this choice, since it’s nothing but a linear transformation. (Technically, a ‘positive affine transformation’: \(y = ax + b\) where \(a\) is positive and \(b\) is any real number.)

mu.F = 58 # real data
sigma.F = 8 # I made this up
samples.F = rnorm(100, mu.F, sigma.F)
F.to.C = function(t) (t - 32) * 5/9
samples.C = sapply(samples.F, FUN=F.to.C)
plot(density(samples.F), col='blue', 
    xlim=c(min(samples.C), max(samples.F)), ylim=c(0,.1))
lines(density(samples.C), col='red')
legend('topright', c('Celcius', 'Fahrenheit'), 
       lty=1, col=c('red', 'blue'), text.col=c('red', 'blue'))

Notice the difference in both mean and variance.

meanvar(samples.C)

##     mean variance       sd 
## 14.19947 18.73816  4.32876

meanvar(samples.F)

##      mean  variance        sd 
## 57.559050 60.711642  7.791768

Nevertheless, these distributions are really identical: they represent the same measurements! As a result they are perfectly correlated, and you can plot them on a perfectly straight line.

cor(samples.F, samples.C)

## [1] 1

plot(samples.F, samples.C, pch=20, col='blue',
     xlim=c(30,80), ylim=c(0,80))

In general, you can add or subtract anything, and multiply or divide by any positive number, without affecting correlation.

\(z\)-scoring temperatures is one way to see in what sense they are the same regardless of unit. Suppose we have a normally distributed RV \(X\). We can \(z\)-score \(X\) by rescaling it to a standard normal RV, like this: for all \(x\), \[ z(x) = \frac{x - \mathbb{E}(X)}{\sigma_X} \] Then, \(\mathbb{E}(X) = 0\) and \(\sigma_X = 1\). [Remember: \(\sigma_X\), \(X\)’s standard deviation, is equal to \(\sqrt{\mathbf{Var}(X)}\). Measurements are now stated in units of standard deviations: \(z(x) = 1.2\) is read “\(x\) is 1.2 standard deviations above the mean”.

\(z\)-scoring RVs gives us a way to compare them using common units. In effect, it gives us a scale-independent quantitative measurement of the position of an individual in the distribution - how much the individual’s position deviates from the mean, relative to the overall distribution of deviations.

z.score = function(x, mu, sigma) (x - mu)/sigma
z.sigma.F = sapply(samples.F, FUN=function(x) {
  z.score(x, mu.F, sigma.F)
  }) 
z.sigma.C = sapply(samples.C, FUN=function(x) {
  z.score(x, F.to.C(mu.F), sigma.F * 5/9)
  }) 
approx.equal = function(v1, v2) abs(v1-v2) < 10^-15
approx.equal(z.sigma.F, z.sigma.C)

##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [71] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [99] TRUE TRUE

plot(z.sigma.F, z.sigma.C, pch=20, col='blue')
abline(a=0, b=1, col='red', lty=4)

Various ways to interpret such a situation: if \(X\) and \(Y\) are perfectly correlated,

knowing \(X\) makes it possible to predict \(Y\) with no possibility of error. E.g., if we know something’s temperature in F we can predict its temperature in C without further ado.
the “position” of any individual measurement in \(X\) - its \(z\)-score - is the same as its position in \(Y\).
if we plot \(X\) and \(Y\) against each other, we can fit a perfect straight line through the points.
if we \(z\)-score \(X\) and \(Y\) and then compare them, all of the measurements are identical.
if we \(z\)-score \(X\) and \(Y\) and then plot them against each other, we can fit a perfect straight line through the points, and the line will have intercept 0 and slope 1.

Perfect negative correlation means, similarly, that we can predict either perfectly from the other: the only difference is that the line slopes downward instead of upward, and the correlation coefficient is negative.

disease.rate = rbeta(20, 4, 6) 
  # disease rates for 20 cities, drawn from a Beta(8,2)
non.diseased.rate = 1 - disease.rate
  # proportion of people without the disease
cor(disease.rate, non.diseased.rate)

## [1] -1

plot(disease.rate, non.diseased.rate, 
     pch=20, col='blue', xlim=c(0,1), ylim=c(0,1))
abline(a=1, b=-1, col='red', lty=4)

OK, but how is cor() actually calculated? For normally distributed variables, we could calculate it very simply from the \(z\)-scored distributions/approximate it from \(n\) samples: \[ \begin{aligned} \rho_{XY} &= \mathbb{E}[X * Y] \\ &\approx \frac{\sum_{i=1}^n \ x_i \times y_i}{n} \end{aligned} \]

More generally, for any \(X\) and \(Y\) (not necessarily normally distributed), \[ \rho_{XY} = \frac{\mathbf{Cov}(X,Y)}{\sqrt{\mathbf{Var}(X)\mathbf{Var}(Y)}}, \] where the covariance is given by \[ \begin{aligned} \mathbf{Cov}(X,Y) &= \mathbb{E}[(X - \mathbb{E}(X)) \times (Y - \mathbb{E}(Y))]\\ &\approx \frac{\sum_{i=1}^n (x_i - \hat{\mu}_X) \times (y_i - \hat{\mu}_Y)}{n-1} . \end{aligned} \]

Notes:

The reason why \(z\)-scoring simplifies the formula for \(\rho_{XY}\) so drastically is that it makes both expected values 0, and both variances 1.
\(\rho_{XY}\) is bounded above by 1 and below by -1, because \(\mathbf{abs}(\mathbf{numerator}) \leq \mathbf{abs}(\mathbf{denominator})\).
\(\rho_{XY}\) is equal to 0 if and only if \(X\) and \(Y\) are independent.
\(\mathbf{Cov}(X,X) = \mathbf{Var}(X)\).

Covariance/correlation are useful for both numerical and categorical variables, as long as they’re associated with numerical values.

Here are some examples using bivariate normal distributions with various correlations.

rbivariate <- function(rho, n.samples, mean.x = 0, sd.x=1, mean.y=0, sd.y=1) {
   z1 = rnorm(n.samples)
   z2 = rnorm(n.samples)
   x = sqrt(1 - rho^2) * sd.x * z1 + rho * sd.x * z2 + mean.x
   y = sd.y * z2 + mean.y
   return(list(x=x,y=y))
}
cors = seq(from=-1, to=1, by=.25)
cors

## [1] -1.00 -0.75 -0.50 -0.25  0.00  0.25  0.50  0.75  1.00

par(mfrow=c(3,3))
for (i in 1:length(cors)) {
    data = rbivariate(cors[i], 200)
    plot(data$x, data$y, col='blue', pch=20,
         main=paste('correlation =', cors[i], sep=''))
    abline(a=0, b=cors[i], col='red', lty=4)
}

Pseudo-Old English again [Levy p.40]

proportions = matrix(
  c(0.224, 0.655,
    0.014, 0.107), 
  nrow=2, byrow=T,
  dimnames=list(c('obj.preverbal', 'obj.postverbal'),
                  c('pronoun','not pronoun')))
proportions

##                pronoun not pronoun
## obj.preverbal    0.224       0.655
## obj.postverbal   0.014       0.107

Below we code “object preverbal” = 0, “object postverbal” = 1, and we also code “pronoun” = 0 and “not pronoun” = 1. When there are only two ways for a variable to come out, you can assign 1 and 0 arbitrarily to the outcomes like this. (With more than 2 possible outcomes numeric coding is trickier - why?)

our.data = matrix(c(rep(c(0,0), 224), rep(c(1,0), 14), rep(c(0,1), 655), rep(c(1,1), 107)),
                     ncol=2, byrow=T)
# create a matrix of 1000 observations with these exact proportions
colnames(our.data) = c('postverbal?', 'non-pronoun?')
rownames(our.data) = paste('s', 1:length(our.data[,1]), sep='') # convert samples to a data frame
head(our.data)

##    postverbal? non-pronoun?
## s1           0            0
## s2           0            0
## s3           0            0
## s4           0            0
## s5           0            0
## s6           0            0

How well does object position predict pronominality?

sample.covariance = function(X,Y) {
  if (length(X) != length(Y)) stop("vectors have unequal lengths")
  else sum((X - mean(X)) * (Y - mean(Y))) / (length(X) - 1)
}
postverbal = our.data[,1]
non.pronoun = our.data[,2]
sample.covariance(postverbal, non.pronoun)

## [1] 0.01481281

cov(postverbal, non.pronoun) # built-in function for this

## [1] 0.01481281

This differs from Levy’s calculation slightly, because he calculates the uncorrected covariance.

uncorrected.covariance = function(X,Y) {
  if (length(X) != length(Y)) stop("vectors have unequal lengths")
  else sum((X - mean(X)) * (Y - mean(Y))) / length(X)
}
uncorrected.covariance(postverbal, non.pronoun)

## [1] 0.014798

When would one or the other choice be appropriate?

In any case this number is hard to interpret, but we can get a correlation coefficient by dividing it by the square root of the product of the variances.

sample.correlation = function(X,Y) {
  sample.covariance(X,Y) / sqrt(var(X) * var(Y))
}
sample.correlation(postverbal, non.pronoun)

## [1] 0.1065491

cor(postverbal, non.pronoun) # built-in function for this

## [1] 0.1065491

Non-linear correlation, parameters, and overfitting

The examples we have all seen compute linear correlation. Frequently you will want to be able to compute correlations with non-linear functions - especially, correlations between some data and a model that makes interesting, non-linear predictions. If you can write down your model as a function that predicts a numerical value for each observation, this is straightforward. Here’s an example from Lassiter & Goodman, “How many kinds of reasoning?”, 2014, Cognition.

Our model assumed that the by-condition proportions on the \(y\)-axis are a power-law function of the baseline proportions on the \(x\)-axis. \[ \mathbf{prop}_M(Y) = \mathbf{prop}_M(X)^{\alpha_M} \] We measured both quantities and evaluated the model’s performance by searching for the best-fit estimate of the parameter \(\alpha_M\). This parameter was fit by searching for the parameter estimate which minimized the squared error of the model. \[ \begin{aligned} \mathbf{squared\ error}(\alpha, M) &= \sum_{i=1}^n [\mathbf{prop}_M(y_i) - \mathbf{prop}_M(x_i) ^ \alpha]^2\\ \hat{\alpha}_M &= \text{arg min}_{\alpha \in \mathbb{R}}\ \mathbf{squared\ error}(\alpha, M) \end{aligned} \] The squared error is, of course, just estimated (uncorrected) variance of model predictions and data. \[ \mathbf{Var}(\mathbf{prop}_M(Y), \mathbf{prop}_M(X)^{\alpha_M}) \] With these best-case parameter estimates we calculated by-condition \(R^2\) values:

L&G ’14 correlations

The correlation between data and model is a measure of how well we can predict the observed value of \(\mathbf{prop}_M(Y)\) if you know all three of the following:

the observed value of \(\mathbf{prop}_M(X)\)
the form of the model
the fitted value of \(\alpha_M\)

As mentioned, what we reported was the \(R^2\) value, i.e., the square of the correlation. \(R^2\) is always positive and always less than the absolute value of \(\rho\) (or equal iff \(\rho = 1\)). It’s often glossed as the “proportion of variance explained” - here, the proportion of the variance inherent in \(Y\) than can be explained using \(X\) and the fitted model. There’s some controversy about whether this is a good gloss in general. Supposing it is, what we see is that our model explained a lot, but certainly not all, of the variability inherent in participants’ responses.

Note, though, that we used the observed values of \(\mathbf{prop}_M(X)\) to fit the model, and then also used them to evaluate the performance of the model! If we had chosen \(\alpha_M\) values at random, for example, the \(R^2\) values would probably have been much lower. Should we be worried about this? In general, yes. Overfitting by using the same data to fit parameters and evaluate a model is a potentially serious worry (not just in statistics!! this is a major problemin linguistic theory in general). Overparametrizing a model is a different, but related, worry because it can give you too many degrees of freedom - in effect, you can avoid being shown wrong by making your model more flexible.

In this case, we judged that a single numerical parameter was the minimum that we could assume in order to distinguish a variety of semantically different modal words (possible, plausible, likely, probable, certain, necessary). We also tried to avoid overfitting worries by comparing our correlations with the split-half correlation, which involves splitting the data into two halves and asking how well we can predict the data in one half on the basis of features of the other half. This is different from, but related to, the idea of avoiding overfitting by dividing data into training and test sets.

Central limit theorem

If we have a large number of i.i.d. RVs, their sum will be approximately normally distributed, regardless of what the underlying distibution is. This is a useful result called the Central Limit Theorem. For example, consider taking the sum of \(n\) samples from a \(U(0,1)\) distribution, varying \(n\).

n.sims = 1000  
par(mfrow=c(2,3))
for (n in 1:6) {
    sampler = function(i) sum(runif(n, 0, 1))
    samples = sapply(1:n.sims, FUN=sampler)
    curve(dnorm(x, mean(samples), sd(samples)), 
          col='red', xlim=c(0, max(samples)),  main=paste('n. samples =', n), xlab='sampled value', ylab='est. density')
    lines(density(samples), col='blue') 
}

Already with the sum of 4 uniform random variables, the distribution is very closely approximated by a normal distribution (red).

For interest: many theorists have invoked the CLT to explain the ubiquity of approximately normal distributions in the real world - heights, measurements of baking goods, lifetimes of lightbulbs, and so on. But it is really unclear whether the CLT does any explanatory work here: see Aidan Lyon, “Why are normal distributions normal?”, British Journal for the Philosophy of Science 2013 for a lucid discussion.

For example, let’s look at the sum

Probability & Statistics for Linguists: Class 5
(Jan 21, 2015)

Dan Lassiter

Correlation

Pseudo-Old English again [Levy p.40]

Non-linear correlation, parameters, and overfitting

Central limit theorem