Here follows a number of data analytic questions. Use JAGS and R to build models that probe these questions. The JAGS documentation can be found here: http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/jags_user_manual.pdf
Below is a code scaffold you can copy-n-paste into R. Right now the scaffold contains a simple model for two binomial rates, but this should be replaced with a model that matches the relevant questions.
→ Read through the code to see if you can figure out what does what and then run it to make sure it works. It should print out some statistics and some pretty graphs.
library(rjags)
# The model specification
model_string <- "model{
for(i in 1:length(y1) ) {
y1[i] ~ dbern(theta1)
}
for(i in 1:length(y2) ) {
y2[i] ~ dbern(theta2)
}
theta_diff <- theta2 - theta1
# Priors
theta1 ~ dbeta(1, 1)
theta2 ~ dbeta(1, 1)
}"
data_list <- list(
y1 =c(0, 1, 0, 0, 0, 0, 1, 0, 0, 0),
y2 = c(0, 0, 1, 1, 1, 0, 1, 1, 1, 0)
)
params_to_monitor <- c("theta1", "theta2", "theta_diff")
# Running the model
model <- jags.model( textConnection(model_string), data_list, n.chains = 3, n.adapt= 1000)
update(model, 1000); # Burning 1000 samples to the MCMC gods...
mcmc_samples <- coda.samples(model, params_to_monitor, n.iter=5000)
# Inspect the results
plot(mcmc_samples)
summary(mcmc_samples)
Hint: dbern
is the Bernoulli distribution which is the special case of the binomial distribution when there is just one trial (dbern(x) == dbinom(x, 1)
). That is, if the data is coded as 4 successes out of 6 (x = 4, n = 6) it would be most convenient to use a binomial distribution. If the data is coded like c(1, 1, 1, 1, 0, 0) it would be more convenient to use a Bernoulli distribution. The result would in any case be the same.
To inspect and manipulate samples from individual parameters it is useful to convert the mcmc “object” into a simple matrix which gets one column per parameter:
s <- as.matrix(mcmc_samples)
head(s)
## theta1 theta2 theta_diff
## [1,] 0.22025 0.4737 0.2534
## [2,] 0.15750 0.4735 0.3160
## [3,] 0.25720 0.5568 0.2996
## [4,] 0.17944 0.4204 0.2410
## [5,] 0.12313 0.7586 0.6355
## [6,] 0.09005 0.3529 0.2628
This is useful as you can, for example, plot and compare the individual parameters.
# The probability that the rate theta1 is smaller than theta2
mean( s[, "theta1"] < s[, "theta2"] )
## [1] 0.9617
# The distribution of the difference between theta1 and theta2
hist(s[, "theta_diff"])
→ Calculate the probability that the difference between the two underlying rates is smaller than 0.2.
Hint: abs(x - y)
calculates the absolute difference between x and y.
Farmer Jöns has a huge number of cows. Earlier this year he ran an experiment where he gave 10 cows medicine A and 10 medicine B and then measured whether they got sick (0) or not (1) during the summer season. Here is the resulting data:
cowA <- c(0, 1, 0, 0, 0, 0, 1, 0, 0, 0)
cowB <- c(0, 0, 1, 1, 1, 0, 1, 1, 1, 0)
→ Jöns now wants to know: How effective are the drugs? What is the evidence that medicine A is better or worse than medicine B?
Farmer Jöns has a huge number of cows. Earlier this year he ran an experiment where he gave 10 cows a special diet that he had heard could make them produce more milk. He recorded the number of liters of milk from these “diet” cows and from 15 “normal” cows during one month. This is the data:
diet_milk <- c(651, 679, 374, 601, 401, 609, 767, 709, 704, 679)
normal_milk <- c(798, 1139, 529, 609, 553, 743, 151, 544, 488, 555, 257, 692, 678, 675, 538)
→ Jöns now wants to know: Was the diet any good, does it results in better milk production?
Hint: To model this you might find it useful to use the Normal distribution which is called dnorm
in JAGS. A statement using dnorm
could look like:
for(i in 1:length(y) ) {
y[i] ~ dnorm(mu, tau)
}
A tricky thing here is that dnorm
is parameterized with precision (tau
in the above statement) rather than the standard deviation . The relation between precision (\(\tau\)) and standard deviation (\(\sigma\)) is simple: \(\tau = 1/\sigma^2\) . So to be able to use standard deviation in JAGS you can rewrite the statement above as:
for(i in 1:length(y) ) {
y[i] ~ dnorm(mu, 1 / pow( sigma, 2) )
}
Farmer Jöns has a huge number of cows. Due to a recent radioactive leak in a nearby power plant he fears that some of them have become mutant cows. Jöns is interested in measuring the effectiveness of a diet on normal cows, but not on mutant cows (that might produce excessive amounts of milk, or nearly no milk at all!). The following data set contains the amount of milk for cows on a diet and cows on normal diet:
diet_milk <- c(651, 679, 374, 601, 4000, 401, 609, 767, 3890, 704, 679)
normal_milk <- c(798, 1139, 529, 609, 553, 743, 3,151, 544, 488, 15, 257, 692, 678, 675, 538)
Some of the data points might come from mutant cows (aka outliers).
→ Jöns now wants to know: Was the diet any good, does it results in better milk production for non-mutant cows?
Tip: Basically we have an outlier problem. A conventional trick in this situation is to supplement the normal distribution for a distribution with wider tails that is more sensitive to the central values and disregards the far away values (this is a little bit like trimming away some amount of the data on the left and on the right). A good choice for such a distribution is the t distribution which is like the normal but with a third parameter called the degrees of freedom. The lower this parameter the wider the tails, when this parameter is larger than about 50 the t is practically the same as the normal. A good choice for the problem with the mutant cows would be to use a t distribution with around 3 degrees of freedom:
for(i in 1:length(y) ) {
y[i] ~ dt(mu, 1 / pow( sigma, 2), 3)
}
Farmer Jöns has a huge number of cows. He also has chickens. He tries different diets on them too with the hope that they will produce more eggs. Below is the number of eggs produced in one week by chickens on a diet and chickens eating normal chicken stuff:
diet_eggs <- c(6, 4, 2, NA, 4, 3, 0, 4, 0, 6, 3)
normal_eggs <- c(4, 2, 1, 1, 2, 1, 2, 1, NA, 2, 1)
→ Jöns now wants to know: Was the diet any good, does it result in the chickens producing more eggs?
Hint: The Poisson distribution is a discrete distribution that is often a reasonable choice when one wants to model count data (like, for example, counts of eggs). The Poisson has one parameter \(\lambda\) which stands for the mean count. In JAGS you would use the Poisson like this:
for(i in 1:length(y) ) {
y[i] ~ dpois(lambda)
}
It’s often common to have all data in a data frame / table. Copy-n-paste the following into R and inspect the resulting data frame d
.
d <-
structure(list(milk = c(651, 679, 374, 601, 401, 609, 767, 709,
704, 679, 798, 1139, 529, 609, 553, 743, 151, 544, 488, 555,
257, 692, 678, 675, 538), group = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), .Names = c("milk",
"group"), row.names = c(NA, -25L), class = "data.frame")
Looking at d
you should see that it contains the same data as in exercise (4) but coded with one cow per row (The mutant cows were perhaps just a dream…). The diet group is coded as a 1 and the normal group is coded as a 2. This data could be read into JAGS by using the following data list:
data_list <- list(
y = d$milk,
x = d$group
)
→ Modify the model from (4) to work with this data format instead.
Hint: In your JAGS code you can loop over the group variable and use it to pick out the relevant parameters like this:
for(i in 1:length(y)) {
y[i] ~ dnorm( mu[ x[i] ], 1 / pow(sigma[ x[i]], 2) )
}
mu[1] ~ dunif(0, 100000)
mu[2] ~ dunif(0, 100000)
AKA indexception; you use an index (i
) to pick out an index (x[i]
) used to pick out a parameter (mu[x[i]]
).
Farmer Jöns has a huge number of cows. He also has a huge number of different diets he wants to try. In addition to the diet he already tried, he tries another diet (let’s call it diet 2) on 10 more cows. Copy-n-paste the following into R and inspect the resulting data frame d
.
d <-
structure(list(milk = c(651, 679, 374, 601, 401, 609, 767, 709,
704, 679, 798, 1139, 529, 609, 553, 743, 151, 544, 488, 555,
257, 692, 678, 675, 538, 1061, 721, 595, 784, 877, 562, 800,
684, 741, 516), group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3)), .Names = c("milk", "group"), row.names = c(NA, 35L), class = "data.frame")
It contains the same data as in the last exercise but with 10 added rows for diet 2 which is coded with group = 3.
→ Now Jöns now wants to know: Which diet seems best, if any? How much more milk should he be expecting to produce using the best diet compared to the others?
Hint: It is possible to use a loop when constructing priors too. In that way you don’t have to change the model when adding more groups, e.g:
for(i in 1:length(y)) {
y[i] ~ dnorm( mu[ x[i] ], 1 / pow(sigma[ x[i]], 2) )
}
for(i in 1:max(x)) {
mu[i] ~ dunif(0, 100000)
}
Farmer Jöns has a huge number of cows. He is wondering whether the amount of time a cow spends outside in the sunshine affects how much milk she produces. To test this he makes a controlled experiment where he picks out 20 cows and assigns each a number of hours she should spend outside each day. The experiment runs for a month and Jöns records the number of liters of milk each cow produces. Copy-n-paste the following into R and inspect the resulting data frame d
.
d <-
structure(list(milk = c(685, 691, 476, 1151, 879, 725,
1190, 1107, 809, 539, 298, 805, 820, 498, 1026, 1217, 1177, 684,
1061, 834), hours = c(3, 7, 6, 10, 6, 5, 10, 11, 9, 3, 6, 6,
3, 5, 8, 11, 12, 9, 5, 5)), .Names = c("milk", "hours"
), row.names = c(NA, -20L), class = "data.frame")
→ Using this data on hours of sunshine and resulting liters of milk Jöns wants to know: Does sunshine affect milk production positively or negatively?
Hint 1: A model probing the question above requires quite small changes to the model you developed in (8).
Hint 2: You do remember the equation for the (linear regression) line? mu <- beta0 + beta1 * x
I don’t count on anybody getting this far during the tutorial, but if you have: Congratulations! Here follows three pro-level exercises that will take you through multilevel regression, hierarchical linear models (sometimes called Mixed-effects models), and psychophysical modeling using logistic regression (yeah!).
Farmer Jöns has a huge number of cows and he knows that all cows are not the same. Therefore he wants to test if it could be the case that some cows produce more milk when allowed to be outside in the sun while for other cows sunshine might not matter that much for the milk production. He therefore selects four cows and for each cow varies the number of hours she spends outside each day, changing this number each month. The experiment runs for eight months yielding eight datapoints per cow. Copy-n-paste the following into R and inspect the resulting data frame d
.
d <-
structure(list(cow = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4), hours = c(2,
3, 5, 6, 8, 9, 11, 12, 2, 3, 5, 6, 8, 9, 11, 12, 2, 3, 5, 6,
8, 9, 11, 12, 2, 3, 5, 6, 8, 9, 11, 12), milk = c(672, 1013,
808, 486, 648, 720, 1100, 950, 746, 721, 654, 1156, 964, 505,
1104, 903, 560, 817, 829, 975, 1169, 1044, 1722, 1672, 457, 977,
896, 794, 1116, 1155, 1228, 1243)), .Names = c("cow", "hours",
"milk"), class = "data.frame", row.names = c(NA, -32L))
→ Jöns wants to know: Does it seem like the four cows react in the same way to increased amounts of sunshine? Or is sunshine more beneficial to some cows (or rather, to some cows’ milk production) than other?
Hint: This is a straight forward extension of the model from the last exercise (9). You just need to add some more loop magic similar to that in exercise (8).
Farmer Jöns has a huge number of cows and unlimited time to run experiments on them it seems… Jöns continued the study from last exercise (10) and collected data on ten new cows. Copy-n-paste the following into R and inspect the resulting data frame d
.
d <-
structure(list(cow = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,
5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7,
7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10,
10, 10, 10, 10, 10), hours = c(2, 3, 5, 6, 8, 9, 11, 12, 2, 3,
5, 6, 8, 9, 11, 12, 2, 3, 5, 6, 8, 9, 11, 12, 2, 3, 5, 6, 8,
9, 11, 12, 2, 3, 5, 6, 8, 9, 11, 12, 2, 3, 5, 6, 8, 9, 11, 12,
2, 3, 5, 6, 8, 9, 11, 12, 2, 3, 5, 6, 8, 9, 11, 12, 2, 3, 5,
6, 8, 9, 11, 12, 2, 3, 5, 6, 8, 9, 11, 12), milk = c(891, 742,
796, 761, 674, 1166, 955, 485, 806, 605, 552, 755, 660, 752,
839, 660, 941, 891, 806, 1371, 1379, 1322, 1733, 1817, 849, 864,
921, 840, 876, 903, 924, 1064, 435, 1165, 1061, 639, 870, 902,
1239, 1110, 834, 869, 1049, 1422, 1286, 1726, 1296, 1814, 732,
805, 945, 823, 964, 881, 978, 1307, 694, 617, 795, 1022, 518,
157, 824, 483, 501, 863, 640, 472, 791, 747, 814, 910, 579, 809,
689, 826, 1032, 927, 828, 1149)), .Names = c("cow", "hours",
"milk"), class = "data.frame", row.names = c(NA, -80L))
And while you are at it, why not take a look at the data by running the following command (assuming you have the ggplot2 package installed.)
library(ggplot2)
qplot(hours, milk, facets = ~ cow, geom="line", data=d, ylim=c(0, 2000))
→ Jöns wants to know: “While I just measured ten cows, what would be an estimate of the effect of sunshine on all my cows and roughly how much would the effect of sunshine vary from cow to cow?”
Hint: What Jöns is asking you to do is not just estimate the distribution of the data but in addition estimate the distribution of the regression parameters for the effect of sunshine on the different cows. While you might not know much about how this parameter might be distributed a reasonable starting point would be to assume a normal distribution. Some cows are more positively affected by sunshine and some cows are more goth…
Fun fact: The model you’ve most probably will developed will be identical to that resulting from the following call to lmer
.
fit <- lmer(milk ~ 1 + hours + (1 + hours | cow), data=d)
summary(fit)
The results will also be identical iff you use the same priors as ´lmer´ (within approximation error).
Farmer Jöns has a huge number of cows. He also has a huge interest in the light intensity discrimination abilities of cows. He therefore performs an experiment on one of his favorite cows, Blenda. In 90 trials Jöns shows Blenda two led-lights, one that always has the same intensity (the “standard” at an intensity of 10) and one with different intensity in different trials (the “test” ranging from 6-14). In each trial Blenda has to indicate in a 2-forced-choice-task which light source she experiences as the most bright. Copy-n-paste the following into R and inspect the resulting data frame d
.
d <-
structure(list(test_intensity = c(6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
14, 14, 14, 14, 14, 14, 14, 14, 14, 14), choice = c(0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0)), .Names = c("test_intensity", "choice"), row.names = c(NA,
-90L), class = "data.frame")
In d
test_intensity
shows the intensity of the test stimulus and choice
show the choice of Blenda where the choice of the standard is coded as 0 and the choice of the test is coded as 1.
→ Help Jöns fit a psychometric function to the choice data!
Hint: If \(\theta\) is a probability then the log odds \(x\) of theta is defined as:
\[x = log(\frac{\theta}{1 - \theta}) \]
The conversion the other way (which is probably what you want in your JAGS model) is:
\[\theta = \frac{1}{1 + exp(-x)} \]