We will use the height data from Richard McElreath’s Statistical Rethinking course. The data comes from the !Kung San people of the Kalahari desert. We will download the spreadsheet .csv file from git-hub, specifying that in this case the field separator is a ‘;’ and not a ‘,’ by setting the sep variable to ‘;’ using “sep=‘;’”. We then save the table in a list called data. Finally, head will show the first few row of the table.
data = read.csv("https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv",sep=';')
data = data.frame(data)
head(data)
## height weight age male
## 1 151.765 47.82561 63 1
## 2 139.700 36.48581 63 0
## 3 136.525 31.86484 65 0
## 4 156.845 53.04191 41 1
## 5 145.415 41.27687 51 0
## 6 163.830 62.99259 35 1
The data is indexed by column and row starting from 1, so
data[3,2]
## [1] 31.86484
pulls out the second column (weight) or the third row.
Question: What would you type to pull get the data for the age of the 6’th entry?
In addition to single entries, we can pull out a whole row by leaving the column entry blank:
data[3,]
## height weight age male
## 3 136.525 31.86484 65 0
Question: By the same logic, how would you pull out a whole column of data?
We can all specific a column of data by name with using data[‘height’], or data$height:
head(data['height'])
## height
## 1 151.765
## 2 139.700
## 3 136.525
## 4 156.845
## 5 145.415
## 6 163.830
head(data$height)
## [1] 151.765 139.700 136.525 156.845 145.415 163.830
You can also embed plots using plot(x_data,y_data), for example:
For our first simple model we just want to work with the data in the part of the age spectrum where the mean is constant. Practically, that means that the data will distributed around a single average value. We can extract the data for ages greater than 20 say as follows:
data$age>20
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [12] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
## [23] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [34] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [45] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [56] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [67] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
## [78] FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [89] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
## [100] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [111] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [122] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
## [133] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
## [144] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [155] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## [177] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [188] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
## [199] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [210] TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
## [221] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
## [232] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## [243] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [254] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [265] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
## [276] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
## [287] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [298] FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
## [309] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## [320] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [331] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [342] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## [353] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [364] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [375] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [386] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [397] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [408] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [419] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## [430] TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## [441] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## [452] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
## [463] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [474] TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## [485] TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
## [496] TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [507] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## [518] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [529] TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## [540] FALSE TRUE TRUE FALSE TRUE
this returns a vector of 544 TRUE and FALSE values, TRUE for each entry where the age is greater than 20 and FALSE otherwise. We can feed this vector into the rows specifier of data to extract only those entries for which the age is greater than 20. Note, we must include the comma to tell R that we want to pull out the columns:
head(data[data$age>20,])
## height weight age male
## 1 151.765 47.82561 63 1
## 2 139.700 36.48581 63 0
## 3 136.525 31.86484 65 0
## 4 156.845 53.04191 41 1
## 5 145.415 41.27687 51 0
## 6 163.830 62.99259 35 1
To pull out the heights, we can use
height = data[data$age>20,'height']
Now that we have the adult heights, we can visualize the distribution using a histogram:
hist(height)
We can change the number of bars by specifying the
breaks variable, and specify that we’re interesting the probability distribution (the frequencies divided by the total number of entries) by using prob = TRUE
hist(height, breaks=20, prob = TRUE)
R has many built in functions to help us find empirical mean, variance, and to deal with a multitude of common probability distributions. (Aside: In fact, R can basically do anything we’ll learn the matheatmics of, including interval estimates, boot strapping, etc. Know when to use the tools is the hard part)
To find the mean we could sum and divide by the length:
sum(height)/length(height)
## [1] 154.7477
or use built-in functions. In the code below, cat stands for concatanate and print:
m = mean(height)
s = sd(height)
s2 = var(height)
cat("Mean:", m, ", Standard Deviation:", s, ", Varience:", s2)
## Mean: 154.7477 , Standard Deviation: 7.799343 , Varience: 60.82974
We can now plug the mean and the variance we found into the probability density function for the normal distribution built into R. For each probability distribution R has some form of the following functions:
dnorm(x=, mean=, sd=) - Returns the values of the probability density function (PDF) at the point x for a distribution with mean given by mean and standard deviation given by sd. For the normal distribution, the PDF is the bell curve and dnorm can be used to plot it.pnorm(q=, mean=, ds=) - Returns the cumulative density function (CDF) at q of a distribution with mean and sd as above. The cumulative distribution is the probability \(p\) that a random value \(x\) drawn from the normal distribution is less than q. Use this to determine things like “What is the probability of finding a height less than 60 cm?”qnorm(p=, mean=, sd=) - Returns the inverse CDF of the normal distribution at p with mean and sd as above. This function takes a probability p and returns the value q such that p percent of the distribution is less than q. For example, if you want to know how tall you have to be to be above 67% of the population use qnorm(p=.67, mean=, sd=).rnorm(n=, mean=, sd=) - Returns n random samples from the normal distribution with mean and sd as above. Use this function for simulations and bootstrapping.For a first example, lets plot the normal distribution we just calculated on top of the height histogram. To add to a plot we will call the functions lines as if it were a plot call.
First, we generate a sequence of \(x\) values from 130 to 180 (roughly the range of the data) at a step of .1 using the function seq. We then feed these values into dnorm with the mean and standard deviation computed above:
hist(height, breaks=20, prob = TRUE)
x = seq(130, 180, by=.1)
lines(x, dnorm(x=x, mean=m, sd=s))
Now, assume I am 120 centimeters. I can compute the probability of someone being shorter than me using the CDF pnorm:
pnorm(q=170, mean=m, sd=s)
## [1] 0.974743
So 97.5% of the sampled population will be shorter than me.
Question: How can I figure out what percent of the population is taller than me?
Well we ruminate on that, lets think about how we determine the probability that someone is between 150 and 170. Computing the CDF’s we find that
print(pnorm(q=150, mean=m, sd=s))
## [1] 0.2713504
print(pnorm(q=170, mean=m, sd=s))
## [1] 0.974743
So 97.5% of the population is below 170, but 27.1% is below 150. That means the amount between 150 and 170 is (97.5% - 27.1% = ) 70.4% of the population. For a known probability distribution, we can now use R to probe regions where we may not have real data!
Exercise: Show that the probability of a member of the population having height between 173 and 177 cm is 0.747%.
Exercise: Use the qnorm function to find the value of x such that 10% of the population is less than x. Check your value using pnorm.
Exercise: Use the qnorm function to find the value c such that 90% of the population is inside the interval [m-c, m+c].
Exercise: Use your results to compute the 95% and 99% confidence intervals for the mean.
Let assume now that we draw 50 data points from a probability distribution; for clarity we will fix mean= 150 and sd=10
m_sim = 150
s_sim = 10
We can use rnorm to sample the distribution 50 times and then compute the fit mean and sd from the result:
x_samp = rnorm(50, m_sim, s_sim)
m_fit = mean(x_samp)
s_fit = sd(x_samp)
Now, how does this sample-fit distribution compare to the original?
x = seq(120, 180, by=.1)
plot(x, dnorm(x=x, mean=m_sim, sd=s_sim), type='l', ylab = "Probability",col="red")
lines(x, dnorm(x=x, mean=m_fit, sd=s_fit),col="blue")
legend(1, 95, legend=c("Origional", "Sample-Fit"),
col=c("red", "blue"), lty=1:2, cex=0.8)
We can see there’s quite a difference! Run the code again and you’ll see this isn’t fixed. In fact, we can you a for loop to run the code 100 times and see how diverse of outcomes we get:
x = seq(120, 180, by=.1)
plot(x, dnorm(x=x, mean=m_sim, sd=s_sim), type='l', ylab = "Probability",col="red")
for (i in 1:100) {
x_samp = rnorm(50, m_sim, s_sim)
m_fit = mean(x_samp)
s_fit = sd(x_samp)
lines(x, dnorm(x=x, mean=m_fit, sd=s_fit),col="blue")
}
lines(x, dnorm(x=x, mean=m_sim, sd=s_sim), col="red", lwd = 3)
legend(1, 95, legend=c("Origional", "Sample-Fit"),
col=c("red", "blue"), lty=1:2, cex=0.8)
We can see there’s quite a bit of variation, even when we start from a known distribution. How do we quantify this?
*Question: How would you expect the distribution of sample-fit distributions to change as we varied the sample size? Verify your results by modifying the code above.
Like before, lets try to get a handle on the sample distribution of means. This will be very similar to the last computation, but we will have to save each of the means we compute. To do that, we’ll start by building a dummy list will all 0’s of the proper size and then save the computed sample means into the i’th entry. We can then take a histogram:
x = seq(120, 180, by=.1)
means = rep(0, 100)
plot(x, dnorm(x=x, mean=m_sim, sd=s_sim), type='l', ylab = "Probability",col="red")
for (i in 1:100) {
x_samp = rnorm(50, m_sim, s_sim)
m_fit = mean(x_samp)
means[i] = m_fit
}
hist(means, probability = TRUE)
legend(1, 95, legend=c("Origional", "Sample-Fit"),
col=c("red", "blue"), lty=1:2, cex=0.8)
Notice that the histogram of means is much more centered at the mean than the original probability distribution.
Exercise: Using dnorm, plot the proper probability density function over the histrogram of means. What will mean and sd be?
Exercise: Repeat the simulation above but this time plot the histogram of the sample standard deviations. Do you think these follow a normal distribution?
Now that we have a way of generating samples of means, bootstrapping the CI for the mean is straight forward: For the 90% CI we just generate 1000 samples and throw away the largest 5% and the smallest 5%, the boundary will then be out values. Luckily, R has a built in function to do this for us: quantile(means, c(.05,.95)) will return the 5% quantile and the 95% quantile.
M = 1000
N = 100
means = rep(0, M)
for (i in 1:M) {
x_samp = rnorm(N, m_sim, s_sim)
m_fit = mean(x_samp)
means[i] = m_fit
}
quantile(means, c(.05,.95))
## 5% 95%
## 148.3579 151.5599
This directly gives the bootstrap CI. Notice however that it will change under multiple runs, you should always pick \(N\) and \(M\) large enough that your bootstrap CI’s endpoints are stable under different runs.
Exercise: What is the bootstrap 95% CI?
Exercise: Bootstrap the 95% CI for the fit sd from the last section.
We want to test the claim that many little random binary flips can add up to something that looks like a Gaussian distribution. Consider the following cartoon example: There are N genes that effect height. Each gene has a different probability of being active in a random person, ie neither the expected value for each genes contribution to a persons overall height, nor the variance are the same between genes. If generate M random humans heights under these assumptions, what will the distribution of heights look like?
Here’s one way we could simulate this: Each of those N genes has a (uniform) random probability between 0 and 1 of flipping on. We will generate those random probabilities into a fixed vector using the runif function:
N = 1000
gene_probs = runif(N)
head(gene_probs)
## [1] 0.1811334 0.1119085 0.9904869 0.1704202 0.9219167 0.7444811
Now, for a given person we generate a random number for each gene and compare it to the threshold:
N = 1000
gene_expression = runif(N)
genes_on = gene_probs>gene_expression
head(genes_on)
## [1] FALSE FALSE TRUE FALSE TRUE TRUE
Finally, we add the number of true values up, these are the gene that successfully contribute to adding height to a person.
N = 1000
gene_expression = runif(N)
genes_on = gene_probs>gene_expression
sum(genes_on)
## [1] 497
Lets put these in a loop and run them for 1000 people:
N = 1000
M = 10000
gene_total = rep(0,M)
for (i in 1:M){
gene_expression = runif(N)
genes_on = gene_probs>gene_expression
gene_total[i] = sum(genes_on)
}
hist(gene_total,breaks=50)
This looks very Gaussian. The fact that small binary decisions can in aggregate add up to normally distributed phenomena gives us a strong reason to believe Gaussian noise in a wide variety of situations.
We saw two different methods for fitting probability distributions: The methods of moments and the log-likelihood method. In this case, R has a built in function to minimize the negative log-likelihood of a model, we just need to define a function that takes our parameters and returns a number, namely the log-likelihood that the data was pulled from a distribution with the parameters. The built in function won’t solve the problem explicitly, it will use automatic methods like gradient decent.
require(stats4)
## Loading required package: stats4
LL <- function(mu, sigma) {
x = height
R = dnorm(x, mean=mu, sd=sigma,log=TRUE)
return(-sum(R))
}
mle(LL, start = list(mu = 1, sigma=1))
##
## Call:
## mle(minuslogl = LL, start = list(mu = 1, sigma = 1))
##
## Coefficients:
## mu sigma
## 183.9476 2496.4526
We immediately see a huge problem: the predicted mean is far from the real mean and the standard deviation is several orders of magnitude too large. What went wrong? The answer is that if you don’t have reasonable starting conditions automatic fitting can often find a local minimum that is far away from the global minimum. Lets try something a little closer to the real value:
LL <- function(mu, sigma) {
x = height
R = dnorm(x, mean=mu, sd=sigma,log=TRUE)
return(-sum(R))
}
mle(LL, start = list(mu = 150, sigma=10))
## Warning in dnorm(x, mean = mu, sd = sigma, log = TRUE): NaNs produced
##
## Call:
## mle(minuslogl = LL, start = list(mu = 150, sigma = 10))
##
## Coefficients:
## mu sigma
## 154.747724 7.787552
This is a much more reasonable answer. Compare it with the answer above for the method of moments fit:
Mean: 154.7477, Standard Deviation: 7.799343
So maximum likelihood can work, but wee need to be careful about our starting conditions.
Now, we can use maximum likelihood to fit more advanced models. Indeed if we look the mle function we used didn’t even know if the model was a probability distribution, it just wanted to be given a likelihood. Assume we have a simple additive model like
\[ y = f(x) + \epsilon\,,\hspace{3em} \epsilon \sim \mathcal{N}(\epsilon \,|\, 0, \sigma)\,, \] where \(f(x)\) is a deterministic function and \(\epsilon\) is a random variable accounting for noise, and is drawn for a normal distribution centered at 0 with a constant variance for all values of \(x\). The likelihood of choosing \(y\) given that we are at \(x\) is
\[ \mathbb{P}(y\,|\,x) = \mathcal{N}(\epsilon \,|\, f(x), \sigma) \] That is, \(y\) is drawn from a normal distribution centered at \(f(x)\). We can use this to write the log-likelihood, and then given a function \(f(x)\) we can try to fit both the function and the distribution to the data.
Lets recall the shape of the data:
I propose that the data follows an underlying hill function: \[ f(x\,|\,a,b) = b\frac{x}{a+x}\,. \] We can see that for \(x\ll a\), \(f\) is approximately the linear function \(f\approx \frac{b}{a} x\) whereas for \(x\gg a\), \(f\approx b\). Lets fit this by minimizing the negative log-likelihood:
LL <- function(a, b, sigma) {
y = data$height
x = data$age
R = dnorm(y, mean=b*(x)/((a) + (x)), sd=sigma,log=TRUE)
return(-sum(R))
}
mle(LL, start = list(a = 2, b = 154.747724, sigma=7.78))
##
## Call:
## mle(minuslogl = LL, start = list(a = 2, b = 154.747724, sigma = 7.78))
##
## Coefficients:
## a b sigma
## 2.289411 162.315912 14.328473
How have I picked my starting parameters? Well, \(b\) is the large \(x\) value so we would expect \(b = \mu\) from before. Similarly, we calculated \(\sigma\) from the heights of the adult males, its reasonable to use that as a starting value.
Question: The model above doesn’t quite fit the data as tightly as we would like. What can be done to improve it?