library(devtools)
devtools::install_github("schmettow/mascutils")
devtools::install_github("schmettow/bayr")
library(tidyverse)
library(rstanarm)
library(bayr)
Data in social science is extremely noisy, which requires inferential statistics to separate the signal. For almost a century, social science researchers relied on the classic frequentist framework of null hypothesis significance testing (NHST) to decide whether the outcome of a study supports their assumptions. This has failed dramatically.
A likely approach to replace NHST is the Bayesian framework. It is surprisingly intuitive and extremely versatile. For applied researchers Bayesian statistics connects seemlessly with rational decision making.
During the presentation, I will run you through a synthetic case study, that I assume fits your field of expertize. My aim is to illustrate a Bayesian statistical workflow.
The new PISA study is there. The results have been carefully scrutinized by the national education administration (NEA) and they came up with two crucial observations:
In a certain region, the regional government has purposefully fostered modern teaching style over years, e.g., “the teacher as a learning companion”. The results are somewhat better in that region. However, there is no unambiguous causal relation between school type and student performance.
There seems to be a disadvantage for children from poor families. However, this effect was too uncertain for confident decision making. (Any action would be very costly.)
Additional research is required and it boils down to:
We gather data:
attach(Schools)
| Obs | Stud | income | School | Type | score |
|---|---|---|---|---|---|
| 1 | 1 | 0.456 | 1 | A | 80.8 |
| 7 | 7 | 0.139 | 9 | A | 82.1 |
| 43 | 43 | 0.464 | 1 | A | 88.9 |
| 174 | 174 | 1.306 | 10 | B | 74.2 |
| 150 | 150 | -0.647 | 6 | B | 87.0 |
| 154 | 154 | 0.527 | 8 | B | 77.8 |
We approach the first question in classic style: we run a t-test on the difference in average score between type of school.
\(H_0:\) The two school types have the same mean score.
\(H_1:\) The two school types differ in mean score.
\(H_0\) is rejected with \(p = 0.141\)
0.141 is the probability of this data to arise when school types are equal, in fact (\(H_0\)).
This is strictly the definition. Does it sound helpful? Note that this only makes a statement about \(H_0\). Is that what you want? No? Wouldn’t it be nice to say something like:
With a probability of 97.5%, the true difference between school types is at least x?
What is a statistic? A statistic is a single number that summarizes (or represents) a data set. For example, the mean is a statistic and summarizes the overall magnitude of a variable. By saying: “the average Dutchmen is 1.86m tall”, the whole sample is summarized into one number. Statistics can express other properties of data:
In our studies we can never be certain about a statistic, be it through sampling bias or measurement errors. An inferential statistic encapsulates the statistic with a statement of uncertainty. The best known statement of uncertainty in classic statistics is the confidence interval.
An inferential statistic is a summary score on data, that comes with a statement on certainty
\([0.1, 0.4]\)
“If we were to repeat the experiment over and over, then 95% of the time the confidence intervals contain the true mean.”
We run a simple group mean regression model:
M_1 <- stan_glm(score ~ Type,
data = D_1)
and get these coefficients:
T_1 <- fixef(M_1)
T_1
| fixef | center | lower | upper |
|---|---|---|---|
| Intercept | 73.32 | 70.778 | 75.82 |
| TypeB | 2.58 | -0.757 | 5.79 |
Well, how to read this table? What are the fixed effects? First of all, don’t contemplate long about the term fixed. It means nothing and is just a label. It may help to imagine it being called “fried effect” or “filos effect”.
We start by looking at the column center, which reflects the overall magnitude of coefficients. The intercept is the average score in type A schools. The TypeB effect is the difference towards group A. Add it to the intercept and you get the average of type B schools.
Per coefficient, we get the center estimate, which is a best guess. It marks the most likely region for the true effect to be. We do our magnitude statement:
Overall, the difference is in the region of \(2.58\)
Lower and upper mark the region where the true effect is with a certainty of 95%. In Bayesian statistics this is called a credibility interval. Ironically (and fortunately), it is precisely what many people falsely assume the frequentist confidence interval is. It can always be understood as a bet: if someone puts EUR 5 that the true value is outside (smaller or larger) that range, you hold EUR 95 against. We make our certainty statement:
With 95% certainty, the true value is in the interval $ [-0.76, 5.79]_{CI95}$
Keep in mind, that the 95% is a conventional number, but not magic one (dark magic, maybe). It fully depends on what is at stake or what attitude one takes towards the risk of being wrong. When overhauling the education system of a nation, you probably want to be highly certain that what you are doing has the desired effect. In contrast, a young entrepreneur is probably much less risk averse. With Bayesian statistics, you can shape credibility limits to your needs.
Where do these intervals come from? In classic statistics some arcane asymptotic rules (i.e. higher math) are used to compute them, and you get only these two points, lower and upper.
In Bayesian statistics, the posterior distribution holds complete gradual information on certainty.
With the posterior distribution, we can now test, what the probability is for the difference being at least 2:
This probability is exactly the area of the distribution inside the green rectangle. By simple counting on the posterior, we precisely get:
With 64.375% certainty, type B schools have at least an advantage of 2 scores.
That is not enough certainty to take action.
Now we tackle the question of economic status and student performance using a linear regression model
M_2 <- stan_glm(score ~ income,
data = D_1)
and get these coefficients:
T_2 <- fixef(M_2)
T_2
| fixef | center | lower | upper |
|---|---|---|---|
| Intercept | 74.70 | 73.2 | 76.21 |
| income | 2.97 | 1.5 | 4.49 |
The previous model made a statement on the average student. Do we care about them? No. We care about a fair society and that means to break with heritage and socialization. Routinely, educational researchers ask for household income, which we use as an approximation for social layer.
We explore graphically: in both education types, students of higher income get the better scores. However, this is much less for B. The crossover point is slightly above average income. This is why the simple group comparison showed no advantage for B. Below the crossover, a student has lesser a chance with A.
We run a model that contains both predictors, income and type of education. The term income:type is an interaction effect, we’ll see in a minute, what this means.
M_3 <-
stan_glm(score ~ income + Type + income:Type,
data = D_1)
We extract the effects:
T_3 <- fixef(M_3)
T_3
| fixef | center | lower | upper |
|---|---|---|---|
| Intercept | 73.44 | 70.943 | 75.787 |
| income | 5.04 | 2.723 | 7.300 |
| TypeB | 2.37 | -0.768 | 5.542 |
| income:TypeB | -3.68 | -6.639 | -0.468 |
This needs explanation; we start by just looking at magnitude (center):
income:Type is an interaction effect. Look at the graph once again. The slope of the two groups differ. B has the lower slope. This difference is precisely the interaction effect. To get the absolute slope in group B, add the interaction effect to the main effect.
A statistic is a single number that somehow summerizes a bunch of numbers. The average is a statistic to represent overall magnitude.
An inferential statistic is a single number that summarizes a bunch of numbers and come with an uncertainty statement. We have seen credibility intervals to reflect uncertainty.
It is assumed that in applied research the magnitude, as well as the uncertainty matter. (Experimental social science is utterly obsessed about certainty, they are the real non-quants).
And this is how to do it:
And here it goes:
The average score of average income students on schools with A is \(73.44\). This estimate is very precise, we can be 95% sure that the true value is within the interval, \(73.44 [70.94, 75.79]_{CI95}\).
The income effect in group A is remarkable: per unit on the income scale, the score increases by 5 points \(5.04 [2.72, 7.3]_{CI95}\). The effect is almost certainly positive, but with considerable uncertainty about its real magnitude.
The education type effect reflects average-income students’ advantage of being in B. The effect is small in comparison to the income effect and there is considerable uncertainty in both direction: neither can we safely deny that the effect is relevant, nor can we assume it to be strong.
The interaction effect indicates that the income effect in B is weaker, compared to A. It seems to be reduced by more than a half, which is really good news. However, we must be careful with conclusions, as the uncertainty is immense. There is quite a chance that the effect is much lower than that.
In other words: you have to hit the field, again, and gather more data. Uncertainty means lack of information, and gathering more data adds information.
And that is another thing you can do with Bayesian statistics, but not with p-values: if you are too uncertain, just gather more data, until you are satisfied (or running out of resources).
Yes, you are getting me right: in the frequentist framework of null-hypothesis testing, this is not allowed! Just right: no “adaptive testing”! You may fully replicate your study, but you may not staple the two data sets. That’s a pretty stupid thing, isn’t it?
Let’s gamble! (Thinking of gambling is a great metaphor for understanding Bayesian concepts).
You meet a female German student at a party. What is she studying?
An unknown person enters the room. What is her IQ? The closest guess wins.
You try to replicate the Stroop effect and fail? Are you gonna write a paper: “The Stroop effect debunked”?
A student does a short version of an IQ test and gets a score of 165. How certain are you that this real?
A student does a short version of an IQ test and gets a score of 105. How certain are you that this real?
Exercise: what effects in school research are solidly known?
Using prior knowledge to come to conclusions is natural. With Bayesian statistics, you can incorporate prior knowledge in your models.
You are manager of the follow-up study to confirm B’s reduction of inequality. You collect another few hundred observations using the same protocol. The data set D_2 contains the new data stacked on the original observations. Analyze it in much the same way.
| Obs | Stud | income | School | Type | score |
|---|---|---|---|---|---|
| 4 | 4 | -1.140 | 9 | A | 62.5 |
| 308 | 308 | 0.408 | 6 | B | 81.2 |
| 78 | 78 | 0.089 | 5 | B | 78.2 |
| 51 | 51 | -1.399 | 4 | A | 71.0 |
| 311 | 311 | 0.702 | 9 | A | 63.2 |
| 320 | 320 | -0.105 | 5 | B | 81.7 |