New statistics for the design researcher

0.1 Bayesian statistics in School research

Data in social science is extremely noisy, which requires inferential statistics to separate the signal. For almost a century, social science researchers relied on the classic frequentist framework of null hypothesis significance testing (NHST) to decide whether the outcome of a study supports their assumptions. This has failed dramatically.

A likely approach to replace NHST is the Bayesian framework. It is surprisingly intuitive and extremely versatile. For applied researchers Bayesian statistics connects seemlessly with rational decision making.

During the presentation, I will run you through a synthetic case study, that I assume fits your field of expertize. My aim is to illustrate a Bayesian statistical workflow.

0.1.1 Decision making in school research

The new PISA study is there. The results have been carefully scrutinized by the national education administration (NEA) and they came up with two crucial observations:

In a certain region, the regional government has purposefully fostered modern teaching style over years, e.g., “the teacher as a learning companion”. The results are somewhat better in that region. However, there is no unambiguous causal relation between school type and student performance.
There seems to be a disadvantage for children from poor families. However, this effect was too uncertain for confident decision making. (Any action would be very costly.)

Additional research is required and it boils down to:

How strong is the effect of parental economic status?
Are schools with a high level of autonomous learning (say: Montessori) better?

We gather data:

the score of students (outcome)
parental SES
type of school (A = conventional, B = teacher as companion)

attach(Schools)

Obs	Stud	income	School	Type	score
1	1	0.456	1	A	80.8
7	7	0.139	9	A	82.1
43	43	0.464	1	A	88.9
174	174	1.306	10	B	74.2
150	150	-0.647	6	B	87.0
154	154	0.527	8	B	77.8

0.1.2 What does the p-value tell you?

We approach the first question in classic style: we run a t-test on the difference in average score between type of school.

$H_0:$ The two school types have the same mean score.

$H_1:$ The two school types differ in mean score.

$H_0$ is rejected with $p = 0.141$

0.141 is the probability of this data to arise when school types are equal, in fact ($H_0$).

This is strictly the definition. Does it sound helpful? Note that this only makes a statement about $H_0$. Is that what you want? No? Wouldn’t it be nice to say something like:

With a probability of 97.5%, the true difference between school types is at least x?

0.1.3 What the p-value doesn’t tell you

effect direction
effect size
any predictions
what to do

0.1.4 Elements of quantitative statistical inference

What is a statistic? A statistic is a single number that summarizes (or represents) a data set. For example, the mean is a statistic and summarizes the overall magnitude of a variable. By saying: “the average Dutchmen is 1.86m tall”, the whole sample is summarized into one number. Statistics can express other properties of data:

the standard deviation summarizes the dispersion of a variable
the regression line gives the linear association between two metric variables in units
the difference in means summarizes the association of one metric outcome and a factor variable (school type)

In our studies we can never be certain about a statistic, be it through sampling bias or measurement errors. An inferential statistic encapsulates the statistic with a statement of uncertainty. The best known statement of uncertainty in classic statistics is the confidence interval.

An inferential statistic is a summary score on data, that comes with a statement on certainty

0.1.5 Do you know what the 90% confidence interval is?

$[0.1, 0.4]$

The probability that the true mean is greater than 0 is at least 95 %
The probability that the true mean equals 0 is smaller than 5 %
The “null hypothesis” that the true mean equals 0 is likely to be incorrect
There is a 95 % probability that the true mean lies between 0.1 and 0.4
We can be 95 % confident that the true mean lies between 0.1 and 0.4
If we were to repeat the experiment over and over, then 95 % of the time the true mean falls between 0.1 and 0.4

0.1.6 What the CI really is

“If we were to repeat the experiment over and over, then 95% of the time the confidence intervals contain the true mean.”

Got it?
Is that what you can work with?
How often are you going to repeat, actually?

0.1.7 Quantitative Bayesian inference

choose a statistic, e.g. difference in means.
consider magnitude: How large must the effect be to be relevant?
consider certainty: How certain do you want to be? What are the risks when being wrong?
estimate a model.
make a statament on magnitude.
make a statement on certainty.

0.1.8 Is school type B better than A?

difference in means between school types
needs to be at least $+5$ to count as relevant
chance that difference is $<2$ must not exceed $2.5%$

We run a simple group mean regression model:

M_1 <- stan_glm(score ~ Type, 
                data = D_1)

and get these coefficients:

T_1 <- fixef(M_1)
T_1

fixed effects
fixef	center	lower	upper
Intercept	73.32	70.778	75.82
TypeB	2.58	-0.757	5.79

Well, how to read this table? What are the fixed effects? First of all, don’t contemplate long about the term fixed. It means nothing and is just a label. It may help to imagine it being called “fried effect” or “filos effect”.

We start by looking at the column center, which reflects the overall magnitude of coefficients. The intercept is the average score in type A schools. The TypeB effect is the difference towards group A. Add it to the intercept and you get the average of type B schools.

Per coefficient, we get the center estimate, which is a best guess. It marks the most likely region for the true effect to be. We do our magnitude statement:

Overall, the difference is in the region of $2.58$

Lower and upper mark the region where the true effect is with a certainty of 95%. In Bayesian statistics this is called a credibility interval. Ironically (and fortunately), it is precisely what many people falsely assume the frequentist confidence interval is. It can always be understood as a bet: if someone puts EUR 5 that the true value is outside (smaller or larger) that range, you hold EUR 95 against. We make our certainty statement:

With 95% certainty, the true value is in the interval $ [-0.76, 5.79]_{CI95}$

Keep in mind, that the 95% is a conventional number, but not magic one (dark magic, maybe). It fully depends on what is at stake or what attitude one takes towards the risk of being wrong. When overhauling the education system of a nation, you probably want to be highly certain that what you are doing has the desired effect. In contrast, a young entrepreneur is probably much less risk averse. With Bayesian statistics, you can shape credibility limits to your needs.

0.1.9 Meet the posterior

Where do these intervals come from? In classic statistics some arcane asymptotic rules (i.e. higher math) are used to compute them, and you get only these two points, lower and upper.

In Bayesian statistics, the posterior distribution holds complete gradual information on certainty.

With the posterior distribution, we can now test, what the probability is for the difference being at least 2:

This probability is exactly the area of the distribution inside the green rectangle. By simple counting on the posterior, we precisely get:

With 64.375% certainty, type B schools have at least an advantage of 2 scores.

That is not enough certainty to take action.

0.1.10 Doing a linear regression

Now we tackle the question of economic status and student performance using a linear regression model

M_2 <- stan_glm(score ~ income, 
                data = D_1)

and get these coefficients:

T_2 <- fixef(M_2)
T_2

fixed effects
fixef	center	lower	upper
Intercept	74.70	73.2	76.21
income	2.97	1.5	4.49

0.1.11 Extended model

The previous model made a statement on the average student. Do we care about them? No. We care about a fair society and that means to break with heritage and socialization. Routinely, educational researchers ask for household income, which we use as an approximation for social layer.

We explore graphically: in both education types, students of higher income get the better scores. However, this is much less for B. The crossover point is slightly above average income. This is why the simple group comparison showed no advantage for B. Below the crossover, a student has lesser a chance with A.

We run a model that contains both predictors, income and type of education. The term income:type is an interaction effect, we’ll see in a minute, what this means.

M_3 <- 
  stan_glm(score ~ income + Type + income:Type,
           data = D_1)

We extract the effects:

T_3 <- fixef(M_3)
T_3

fixed effects
fixef	center	lower	upper
Intercept	73.44	70.943	75.787
income	5.04	2.723	7.300
TypeB	2.37	-0.768	5.542
income:TypeB	-3.68	-6.639	-0.468

This needs explanation; we start by just looking at magnitude (center):

Intercept is the average score in schools A, with average income
income is the effect of income in schools A, for every unit of the income scale, the score raises that much. This is called the slope effect
TypeB is the effect of being in schools B, with average income

income:Type is an interaction effect. Look at the graph once again. The slope of the two groups differ. B has the lower slope. This difference is precisely the interaction effect. To get the absolute slope in group B, add the interaction effect to the main effect.

0.1.12 How to report?

A statistic is a single number that somehow summerizes a bunch of numbers. The average is a statistic to represent overall magnitude.

An inferential statistic is a single number that summarizes a bunch of numbers and come with an uncertainty statement. We have seen credibility intervals to reflect uncertainty.

It is assumed that in applied research the magnitude, as well as the uncertainty matter. (Experimental social science is utterly obsessed about certainty, they are the real non-quants).

And this is how to do it:

start with the intercept, explain what it represents
make a statement on most likely magnitude (the center)
make a statement on uncertainty (lower - upper)
move on to the next effect and repeat

And here it goes:

The average score of average income students on schools with A is $73.44$. This estimate is very precise, we can be 95% sure that the true value is within the interval, $73.44 [70.94, 75.79]_{CI95}$.

The income effect in group A is remarkable: per unit on the income scale, the score increases by 5 points $5.04 [2.72, 7.3]_{CI95}$. The effect is almost certainly positive, but with considerable uncertainty about its real magnitude.

The education type effect reflects average-income students’ advantage of being in B. The effect is small in comparison to the income effect and there is considerable uncertainty in both direction: neither can we safely deny that the effect is relevant, nor can we assume it to be strong.

The interaction effect indicates that the income effect in B is weaker, compared to A. It seems to be reduced by more than a half, which is really good news. However, we must be careful with conclusions, as the uncertainty is immense. There is quite a chance that the effect is much lower than that.

In other words: you have to hit the field, again, and gather more data. Uncertainty means lack of information, and gathering more data adds information.

And that is another thing you can do with Bayesian statistics, but not with p-values: if you are too uncertain, just gather more data, until you are satisfied (or running out of resources).

Yes, you are getting me right: in the frequentist framework of null-hypothesis testing, this is not allowed! Just right: no “adaptive testing”! You may fully replicate your study, but you may not staple the two data sets. That’s a pretty stupid thing, isn’t it?

0.1.13 Other things to do with Bayesian statistics

rational decision making
create more sophisticated models
do model selection
get predictions
include prior knowledge

0.1.14 Prior knowledge

Let’s gamble! (Thinking of gambling is a great metaphor for understanding Bayesian concepts).

You meet a female German student at a party. What is she studying?

An unknown person enters the room. What is her IQ? The closest guess wins.

You try to replicate the Stroop effect and fail? Are you gonna write a paper: “The Stroop effect debunked”?

0.1.15 Prior knowledge in school research

A student does a short version of an IQ test and gets a score of 165. How certain are you that this real?

A student does a short version of an IQ test and gets a score of 105. How certain are you that this real?

Exercise: what effects in school research are solidly known?

Using prior knowledge to come to conclusions is natural. With Bayesian statistics, you can incorporate prior knowledge in your models.

0.1.16 Exercise

You are manager of the follow-up study to confirm B’s reduction of inequality. You collect another few hundred observations using the same protocol. The data set D_2 contains the new data stacked on the original observations. Analyze it in much the same way.

Obs	Stud	income	School	Type	score
4	4	-1.140	9	A	62.5
308	308	0.408	6	B	81.2
78	78	0.089	5	B	78.2
51	51	-1.399	4	A	71.0
311	311	0.702	9	A	63.2
320	320	-0.105	5	B	81.7