The Height Data

We will use the height data from Richard McElreath’s Statistical Rethinking course. The data comes from the !Kung San people of the Kalahari desert. We will download the spreadsheet .csv file from git-hub, specifying that in this case the field separator is a ‘;’ and not a ‘,’ by setting the sep variable to ‘;’ using “sep=‘;’”. We then save the table in a list called data. Finally, head will show the first few row of the table.

data = read.csv("https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv",sep=';')
data = data.frame(data)
head(data)
##    height   weight age male
## 1 151.765 47.82561  63    1
## 2 139.700 36.48581  63    0
## 3 136.525 31.86484  65    0
## 4 156.845 53.04191  41    1
## 5 145.415 41.27687  51    0
## 6 163.830 62.99259  35    1

The data is indexed by column and row starting from 1, so

data[3,2]
## [1] 31.86484

pulls out the second column (weight) or the third row.

Question: What would you type to pull get the data for the age of the 6’th entry?

In addition to single entries, we can pull out a whole row by leaving the column entry blank:

data[3,]
##    height   weight age male
## 3 136.525 31.86484  65    0

Question: By the same logic, how would you pull out a whole column of data?

We can all specific a column of data by name with using data[‘height’], or data$height:

head(data['height'])
##    height
## 1 151.765
## 2 139.700
## 3 136.525
## 4 156.845
## 5 145.415
## 6 163.830
head(data$height)
## [1] 151.765 139.700 136.525 156.845 145.415 163.830

Plot our Data

You can also embed plots using plot(x_data,y_data), for example:

Working with data

For our first simple model we just want to work with the data in the part of the age spectrum where the mean is constant. Practically, that means that the data will distributed around a single average value. We can extract the data for ages greater than 20 say as follows:

data$age>20
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [12]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
##  [23]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [34]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [45] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [56] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [67]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
##  [78] FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
##  [89]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## [100]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [111] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [122]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE
## [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## [144] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
## [155]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [166]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [177]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [188] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## [199]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [210]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [221] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## [232] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## [243] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [254]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [265]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## [276]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [287]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [298] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [309] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [320]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [331] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [342]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [353]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [364] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [375]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
## [386]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [397]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
## [408] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [419]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
## [430]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [441] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## [452]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## [463] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
## [474]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
## [485]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
## [496]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
## [507]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [518] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [529]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE
## [540] FALSE  TRUE  TRUE FALSE  TRUE

this returns a vector of 544 TRUE and FALSE values, TRUE for each entry where the age is greater than 20 and FALSE otherwise. We can feed this vector into the rows specifier of data to extract only those entries for which the age is greater than 20. Note, we must include the comma to tell R that we want to pull out the columns:

head(data[data$age>20,])
##    height   weight age male
## 1 151.765 47.82561  63    1
## 2 139.700 36.48581  63    0
## 3 136.525 31.86484  65    0
## 4 156.845 53.04191  41    1
## 5 145.415 41.27687  51    0
## 6 163.830 62.99259  35    1

To pull out the heights, we can use

height = data[data$age>20,'height']

Visualizing a Distribution

Now that we have the adult heights, we can visualize the distribution using a histogram:

hist(height)

We can change the number of bars by specifying the breaks variable, and specify that we’re interesting the probability distribution (the frequencies divided by the total number of entries) by using prob = TRUE

hist(height, breaks=20, prob = TRUE)

Finding Mean and Varience

R has many built in functions to help us find empirical mean, variance, and to deal with a multitude of common probability distributions. (Aside: In fact, R can basically do anything we’ll learn the matheatmics of, including interval estimates, boot strapping, etc. Know when to use the tools is the hard part)

To find the mean we could sum and divide by the length:

sum(height)/length(height)
## [1] 154.7477

or use built-in functions. In the code below, cat stands for concatanate and print:

m = mean(height)
s = sd(height)
s2 = var(height)

cat("Mean:", m, ", Standard Deviation:", s, ", Varience:", s2)
## Mean: 154.7477 , Standard Deviation: 7.799343 , Varience: 60.82974

Working with Probability Functions

We can now plug the mean and the variance we found into the probability density function for the normal distribution built into R. For each probability distribution R has some form of the following functions:

Example: Plotting Distributions

For a first example, lets plot the normal distribution we just calculated on top of the height histogram. To add to a plot we will call the functions lines as if it were a plot call.

First, we generate a sequence of \(x\) values from 130 to 180 (roughly the range of the data) at a step of .1 using the function seq. We then feed these values into dnorm with the mean and standard deviation computed above:

hist(height, breaks=20, prob = TRUE)

x = seq(130, 180, by=.1)
lines(x, dnorm(x=x, mean=m, sd=s))

Example: Determining Probabilities

Now, assume I am 120 centimeters. I can compute the probability of someone being shorter than me using the CDF pnorm:

pnorm(q=170, mean=m, sd=s)
## [1] 0.974743

So 97.5% of the sampled population will be shorter than me.

Question: How can I figure out what percent of the population is taller than me?

Well we ruminate on that, lets think about how we determine the probability that someone is between 150 and 170. Computing the CDF’s we find that

print(pnorm(q=150, mean=m, sd=s))
## [1] 0.2713504
print(pnorm(q=170, mean=m, sd=s))
## [1] 0.974743

So 97.5% of the population is below 170, but 27.1% is below 150. That means the amount between 150 and 170 is (97.5% - 27.1% = ) 70.4% of the population. For a known probability distribution, we can now use R to probe regions where we may not have real data!

Exercise: Show that the probability of a member of the population having height between 173 and 177 cm is 0.747%.

Exercise: Use the qnorm function to find the value of x such that 10% of the population is less than x. Check your value using pnorm.

Exercise: Use the qnorm function to find the value c such that 90% of the population is inside the interval [m-c, m+c].

Exercise: Use your results to compute the 95% and 99% confidence intervals for the mean.

Simulations

Simulating Sampling Error

Let assume now that we draw 50 data points from a probability distribution; for clarity we will fix mean= 150 and sd=10

m_sim = 150
s_sim = 10

We can use rnorm to sample the distribution 50 times and then compute the fit mean and sd from the result:

x_samp = rnorm(50, m_sim, s_sim)

m_fit = mean(x_samp)
s_fit = sd(x_samp)

Now, how does this sample-fit distribution compare to the original?

x = seq(120, 180, by=.1)
plot(x, dnorm(x=x, mean=m_sim, sd=s_sim), type='l', ylab = "Probability",col="red")
lines(x, dnorm(x=x, mean=m_fit, sd=s_fit),col="blue")

legend(1, 95, legend=c("Origional", "Sample-Fit"),
       col=c("red", "blue"), lty=1:2, cex=0.8)

We can see there’s quite a difference! Run the code again and you’ll see this isn’t fixed. In fact, we can you a for loop to run the code 100 times and see how diverse of outcomes we get:

x = seq(120, 180, by=.1)
plot(x, dnorm(x=x, mean=m_sim, sd=s_sim), type='l', ylab = "Probability",col="red")

for (i in 1:100) {
  x_samp = rnorm(50, m_sim, s_sim)
  m_fit = mean(x_samp)
  s_fit = sd(x_samp)
  lines(x, dnorm(x=x, mean=m_fit, sd=s_fit),col="blue")
}
  
lines(x, dnorm(x=x, mean=m_sim, sd=s_sim), col="red", lwd = 3)

legend(1, 95, legend=c("Origional", "Sample-Fit"),
       col=c("red", "blue"), lty=1:2, cex=0.8)

We can see there’s quite a bit of variation, even when we start from a known distribution. How do we quantify this?

*Question: How would you expect the distribution of sample-fit distributions to change as we varied the sample size? Verify your results by modifying the code above.

Sample Distribution of Means

Like before, lets try to get a handle on the sample distribution of means. This will be very similar to the last computation, but we will have to save each of the means we compute. To do that, we’ll start by building a dummy list will all 0’s of the proper size and then save the computed sample means into the i’th entry. We can then take a histogram:

x = seq(120, 180, by=.1)

means = rep(0, 100)

plot(x, dnorm(x=x, mean=m_sim, sd=s_sim), type='l', ylab = "Probability",col="red")

for (i in 1:100) {
  x_samp = rnorm(50, m_sim, s_sim)
  m_fit = mean(x_samp)
  means[i] = m_fit
}
  
hist(means, probability = TRUE)

legend(1, 95, legend=c("Origional", "Sample-Fit"),
       col=c("red", "blue"), lty=1:2, cex=0.8)

Notice that the histogram of means is much more centered at the mean than the original probability distribution.

Exercise: Using dnorm, plot the proper probability density function over the histrogram of means. What will mean and sd be?

Exercise: Repeat the simulation above but this time plot the histogram of the sample standard deviations. Do you think these follow a normal distribution?

Bootstrap Confidence Interval (CI):

Now that we have a way of generating samples of means, bootstrapping the CI for the mean is straight forward: For the 90% CI we just generate 1000 samples and throw away the largest 5% and the smallest 5%, the boundary will then be out values. Luckily, R has a built in function to do this for us: quantile(means, c(.05,.95)) will return the 5% quantile and the 95% quantile.

M = 1000
N = 100

means = rep(0, M)

for (i in 1:M) {
  x_samp = rnorm(N, m_sim, s_sim)
  m_fit = mean(x_samp)
  means[i] = m_fit
}

quantile(means, c(.05,.95))
##       5%      95% 
## 148.3579 151.5599

This directly gives the bootstrap CI. Notice however that it will change under multiple runs, you should always pick \(N\) and \(M\) large enough that your bootstrap CI’s endpoints are stable under different runs.

Exercise: What is the bootstrap 95% CI?

Exercise: Bootstrap the 95% CI for the fit sd from the last section.

Central Limit Theorem

We want to test the claim that many little random binary flips can add up to something that looks like a Gaussian distribution. Consider the following cartoon example: There are N genes that effect height. Each gene has a different probability of being active in a random person, ie neither the expected value for each genes contribution to a persons overall height, nor the variance are the same between genes. If generate M random humans heights under these assumptions, what will the distribution of heights look like?

Here’s one way we could simulate this: Each of those N genes has a (uniform) random probability between 0 and 1 of flipping on. We will generate those random probabilities into a fixed vector using the runif function:

N = 1000
gene_probs = runif(N)
head(gene_probs)
## [1] 0.1811334 0.1119085 0.9904869 0.1704202 0.9219167 0.7444811

Now, for a given person we generate a random number for each gene and compare it to the threshold:

N = 1000
gene_expression = runif(N)

genes_on = gene_probs>gene_expression
head(genes_on)
## [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE

Finally, we add the number of true values up, these are the gene that successfully contribute to adding height to a person.

N = 1000
gene_expression = runif(N)

genes_on = gene_probs>gene_expression
sum(genes_on)
## [1] 497

Lets put these in a loop and run them for 1000 people:

N = 1000
M = 10000

gene_total = rep(0,M)

for (i in 1:M){
  gene_expression = runif(N)
  genes_on = gene_probs>gene_expression
  gene_total[i] = sum(genes_on)
}

hist(gene_total,breaks=50)

This looks very Gaussian. The fact that small binary decisions can in aggregate add up to normally distributed phenomena gives us a strong reason to believe Gaussian noise in a wide variety of situations.

Fitting Probabilistic Models

Fitting a Distribution

We saw two different methods for fitting probability distributions: The methods of moments and the log-likelihood method. In this case, R has a built in function to minimize the negative log-likelihood of a model, we just need to define a function that takes our parameters and returns a number, namely the log-likelihood that the data was pulled from a distribution with the parameters. The built in function won’t solve the problem explicitly, it will use automatic methods like gradient decent.

require(stats4)
## Loading required package: stats4
LL <- function(mu, sigma) {
     x = height
     R = dnorm(x, mean=mu, sd=sigma,log=TRUE)
     return(-sum(R))
}

mle(LL, start = list(mu = 1, sigma=1))
## 
## Call:
## mle(minuslogl = LL, start = list(mu = 1, sigma = 1))
## 
## Coefficients:
##        mu     sigma 
##  183.9476 2496.4526

We immediately see a huge problem: the predicted mean is far from the real mean and the standard deviation is several orders of magnitude too large. What went wrong? The answer is that if you don’t have reasonable starting conditions automatic fitting can often find a local minimum that is far away from the global minimum. Lets try something a little closer to the real value:

LL <- function(mu, sigma) {
     x = height
     R = dnorm(x, mean=mu, sd=sigma,log=TRUE)
     return(-sum(R))
}

mle(LL, start = list(mu = 150, sigma=10))
## Warning in dnorm(x, mean = mu, sd = sigma, log = TRUE): NaNs produced
## 
## Call:
## mle(minuslogl = LL, start = list(mu = 150, sigma = 10))
## 
## Coefficients:
##         mu      sigma 
## 154.747724   7.787552

This is a much more reasonable answer. Compare it with the answer above for the method of moments fit:

Mean: 154.7477, Standard Deviation: 7.799343

So maximum likelihood can work, but wee need to be careful about our starting conditions.

Fitting a Model

Now, we can use maximum likelihood to fit more advanced models. Indeed if we look the mle function we used didn’t even know if the model was a probability distribution, it just wanted to be given a likelihood. Assume we have a simple additive model like

\[ y = f(x) + \epsilon\,,\hspace{3em} \epsilon \sim \mathcal{N}(\epsilon \,|\, 0, \sigma)\,, \] where \(f(x)\) is a deterministic function and \(\epsilon\) is a random variable accounting for noise, and is drawn for a normal distribution centered at 0 with a constant variance for all values of \(x\). The likelihood of choosing \(y\) given that we are at \(x\) is

\[ \mathbb{P}(y\,|\,x) = \mathcal{N}(\epsilon \,|\, f(x), \sigma) \] That is, \(y\) is drawn from a normal distribution centered at \(f(x)\). We can use this to write the log-likelihood, and then given a function \(f(x)\) we can try to fit both the function and the distribution to the data.

Lets recall the shape of the data:

I propose that the data follows an underlying hill function: \[ f(x\,|\,a,b) = b\frac{x}{a+x}\,. \] We can see that for \(x\ll a\), \(f\) is approximately the linear function \(f\approx \frac{b}{a} x\) whereas for \(x\gg a\), \(f\approx b\). Lets fit this by minimizing the negative log-likelihood:

LL <- function(a, b, sigma) {
     y = data$height
     x = data$age
     R = dnorm(y, mean=b*(x)/((a) + (x)), sd=sigma,log=TRUE)
     return(-sum(R))
}

mle(LL, start = list(a = 2, b = 154.747724, sigma=7.78))
## 
## Call:
## mle(minuslogl = LL, start = list(a = 2, b = 154.747724, sigma = 7.78))
## 
## Coefficients:
##          a          b      sigma 
##   2.289411 162.315912  14.328473

How have I picked my starting parameters? Well, \(b\) is the large \(x\) value so we would expect \(b = \mu\) from before. Similarly, we calculated \(\sigma\) from the heights of the adult males, its reasonable to use that as a starting value.

Question: The model above doesn’t quite fit the data as tightly as we would like. What can be done to improve it?