As human beings we make our decisions by believes. For example, we trust a person or a company more, when we can look back at a series of successful transactions. And we have remarkable skills to recall what has just happened, but also what happened yesterday or years ago. By integrating over all the evidence, we form a view of the world. Usually, when evidence is strong and abundant, this view becomes more stable. We believe stronger and feel less in doubt. Nobody would deny that humans learn by such experience, nevertheless humans sometimes are terrible decision makers, for a variety of psychological reasons, to name just three:
Bayesian statistics kicks in on the very intuitive idea that belief is formed by a series of encounters, but formalizes these concepts: belief (or certainty or credibility or credence) is measured on a probability scale (0 = impossible, 1 = certain) and the encounters are called data. When new data arrives, a transition occurs from what you new before, prior belief to what you know after seeing the data, posterior belief.
The wheather forecast announced that tomorrow morning there is a 60% chance of rain. After getting up you glance at the sky and see heavy clouds. What is chance of rain now? You may not be able to quantify it exactly, but you will be more certain that it is going to rain (\(p\gt 60%\)), as there is another piece of evidence.
Bayesian statistics can be reduced to three elements:
Bayesian statistics formalizes the transition from prior belief to posterior belief in a remarkably simple formula:
\[\text{posterior}\ \propto \text{prior}\times\text{likelihood}\]
In very plain words this is:
what you believe now is a combination of what you knew before and what you have just seen in the data.
The data is usually the present observation or study. But, that does not exclude that prior knowledge grounds on data, too. In experimental Psychology researchers entertain themselves repetitions of the very same experimental paradigm, with slight variations maybe. For example, in the famous Stroop effect, participants have to name the ink color of a word. When the word is itself a color word and refers to a different color, response times typically increase. This effect has been replicated in many dozens of published studies and, probably, thousands of student experiments. Cumulative evidence is so strong that, would repeat the experiment another time and find the reverse effect, no one would seriously take this as a debunk. This extreme example illustrates another principle that follows from Bayes rule:
Today’s posterior is tomorrow’s prior.
There is no principled difference between prior and posterior. They are just levels of belief (credences) at different points in time. As much as today will become on of your yesterdays, will another round of Stroop add to the accumulated evidence. Prior knowledge can be firm, like with the Stroop task or faint, like almost all the rest in social sciences. The same holds for the likelihood in the present data. Large sample sizes and powerful experimental designs have more weight.
Prior belief and likelihood can be congruent. That is what classic statisticians bluntly call the confirmation of one’s hypothesis. As in the Stroop example, they can also contradict each other. In such cases one often gets less certain about the situation. In the initial rain prediction scenario, a sunny sky would contradict yesterday’s forecast and one is left with less certainty about how the wheather will turn out. Still, neither prior nor likelihood will ever get thrown out of the equation completely. It is a matter of strength of belief versus strength of evidence.
Whatever happens in the individual case, it is generally accepted that scientific progress is incremental over large periods of time. Under this perspective the idea of updating one’s belief is innocent, almost trivial. It is common sense, that if you are too uncertain about a situation you better gather more information. Once you have reached a satisfactory level of certainty, you proceed to act (or publish). If you have enjoyed a classic education in statistics, it should embarass you that incremental collection of evidence is impossible when doing classic, i.e. frequentist, statistics. Neither is there a way to express one’s previous belief when doing a t-test, nor can you just continue your data collection until satisfactory certainty is reached. In the fictional example of the Stroop task, the classic data analysis pretends as if no one has ever done such an experiment before. At the same time, it is strictly forbidden to invite further participants to the lab, when the test results point into the right direction, but evidence is still to weak. If you planned the study with, say, \(N = 20\), this is what you have to do, no less no more. If you reach your goal with less participants, you must continue testing. If you are unsatisfied with the level of certainty (e.g., \(p = .52\)), the only permissable action is dump your data and start over from zero. There are many other ways that frequentist statistics is flawed, including some deeply philosophical ones. For a common person the denial of incremental progress is deeply counter-intuitive. for a common researcher it is just a millstone around the neck.
In the following the essential ideas and procedures in Bayesian analysis will be introduced. For an illustration, these concepts will be framed by a virtual case study, which comprises a full Bayesian workflow, with one of the most simple models possible.
It is a scientific principle that every event to happen has its causes (from the same universe). The better these causes are understood, the better will be any prediction of what is going to happen the next moment, given that one knows the laws of physics. Laplace demon is a classic experiment of thought on the issue: the demon is said to have perfect knowledge of laws of physics and about the universe’s current state. Within naturalistic thinking, the demon should be able to perfectly predict what is to happen next. Of course, such an entity could never exist, because it were actually a computer that matches the universe in size. In addition, there are limits to how precisely we can measure the current state, although physicist and engineers have pushed this very far.
When Violet did her experiment to prove the superiority of design B, the only two things she knew about the state of affairs was that the participant sitting in front of her is member of a very loose group of people called the “typical user” and the design her or she was exposed to. That is painstakingly little to pin down the neural state of affairs. Her lack of knowledge is profound but still not a problem as the research question was gross, too, not asking for more than the difference in average duration. Instead, imagine Violet and a collegue had invented a silly game where they both guess the time-on-task of individual participants. Who comes closest wins. As both players are clever people, they do not just randomly announce numbers, but let themselves guide by data of previous sessions. A very simple but reasonable approach would be to always guess the average ToT in all previous sessions. As gross as this is, it qualifies as a model, more precisely the grand mean model [LM]. The model explains all observations by the population mean.
Of course, Violet would never expect her grand mean model to precisely predict the outcome of a session. Still, imagine a device that has perfect knowledge of the car rental website, the complete current neural state of a the participant and the physical environment both are in. The device would also have a complete and valid psychological theory. With this device, Jane could always make a perfect prediction of the outcome. Unfortunately, real design researchers are far from Laplace demoism. Routinely borrowing instruments from social sciences, precision of measurement is humble and the understanding of neural processes during web navigation is highly incomplete. Participants vary in many complex ways in their neural state and a myriad of small unrelated forces (SMURF) can push or hinder the user towards completion.
Laplace demon has perfect knowledge of all SMURF trajectories and therefore can produce a perfect prediction. Violet is completely ignorant of any SMURFs and her predictions will be off many times. A common way to conceive this situation is that observed values \(y_i\) are composed of the expected value under the model \(\mu_i\) and a random part, \(\epsilon_i\)
\[y_i = \mu_i + \epsilon_i\]
Generally, statistical models consist of these two parts: the likelihood to describe the association between predictors and expected values and the random part, which describes the overall influence of the unexplained SMURFs.
The likelihood function states the dependency of outcome on the predictor variables. The dependency can be a complex mathematical function of multiple predictors, or as simple as the population average. A common likelihood function is the linear function. For example, in their guessing game, Violet could try to improve her population model, by also taking age of participants into account. Older people tend to be slower. Violet creates a plot from past records. The ellipsoid form of the point cloud indicates that ToT is somehow depending on age. Violet draws a straight line with an upward slope to approximate the relationship. It seems that 30 year old persons have an average ToT of around 90 seconds, which increases to around 120 seconds for 50 year olds. Arithmetically, this is an increase of around 1.5 seconds per year of age.
Violet can use this information to improve her gambling. Instead of stoically calling the population mean, she uses a linear function as predictor: $90 + ( - 30) 1.5 $. In Bayesian statistics, this is called a likelihood function and the general form for a single linear likelihood function is:
\[\mu_i = \beta_0 + \beta_1x_{1i}\\\]
Likelihood functions connect the expected value \(\mu\) with observed variables \(x_{i1}, x_{i2}, ..., x_{ik}\), and (to be estimated) parameters, e.g. \(\beta_0, \beta_1\). The likelihood function is often called the deterministic part of a model, because its prediction strictly depends on the observed values and the predictors, but nothing else. For example, two persons of age 30 will always be predicted to use up 90 seconds. Apparently, this is not the case for real data.
The linear model is very common in statistical modelling, but likelihoods can basically take all mathematical forms. For example:
In the vast majority of cases, the likelihood function is the interesting part of the model, where researchers transform their theoretical considerations or practical questions into a mathematical form. The parameters of the likelihood function are being estimated and answer the urging questions, such as:
A subtle, but noteworthy feature of likelihood functions is that \(\mu_i\) and \(x_i\) have indicators \(i\). Potentially, every observation \(i\) has their own realization of predictors and gets a unique expected value, whereas the parameters \(\beta_0, \beta_1\) asf. are single values that apply for all observations at once. In fact, we can conceive statistical models as operating on multiple levels, where there is always the two: the observation level and the population level. When introducing multi-level models, we will see how this principle extends to more than these two levels. Another related idea is that parameters summarizes patterns found in data. Any summary implies repetition and that is what the likelihood expresses: the pattern that repeats across observations and is therefore predictable.
The random part of a statistical model is what changes between observation and is not predictable. When using the grand mean model, the only information we are using is that the person is from the target population. Everything else is left to the unobserved SMURFs and that goes into the random part of the model. Fortunately, SMURFs don’t work completely arbitrary. Frequently, recognizable patterns of ramdomness emerge. These patterns can be formulated mathematically as probability and density distributions. A probability distribution is typically characterized as a probability mass function that assigns probabilities to outcomes, such as:
Probability distributions are mathematical functions that assign probabilities to the outcome of a measured variable \(y\). Consider a participant who is asked to complete three tasks of constant difficulty, such that there is a chance of \(30%\) for each one to be solved. The outcome variable of interest is the number of correct completions (0, 1, 2 or 3). Under idealized consitions, the following random distribution gives the probability of every possible outcome.
Further, we observe that the most probable outcome is exactly one correct task, which occurs with a probability of \(P(y = 1) = 0.441\). At the same time, there is ample possibility for all failures, \(P(y = 0) = 0.343\). We may also look at combined events, say the probability for less than two correct. That is precisely the sum \(P(y \leq 1) = P(y = 0) + P(y = 1) = 0.784\).
We can bundle basic events by adding up the probabilities. An extreme case of that is the universal event that includes all possible outcomes. You can say with absolute certainty that the outcome is, indeed, between zero and three and certainty means the probability is 1, or: \(P(0 \leq y \leq 3) = 1\). As a matter of fact, all probability (and density) distributions fulfill that property.
More precisely, the area under the PMF must be exactly one and that brings us directly to a second form of characterizing the random distribution: the cumulative distribution distribution (CDF) renders the probability for the outcome to be smaller or equal to \(y\). In the case of discrete outcomes, this is just stacking (or summing) over all outcomes, just as we did for \(P(y\leq1)\) above. The CDF of the three-tasks example is shown in the graph below. We recognize the left starting point, which is exactly \(P(y = 0)\) and observe large jump to \(P(y \leq 1)\). Finally, at \(y \leq 3\) the function reaches the upper limit of 1, which is full certainty.
In the three-tasks example, reading and recombining probabilities is like counting blocks and stacking them upon each other. This is how most children learn basic arithmetics. when the outcome measure is continuous, rather than discrete, some high school math is required. The most common continuous measure is probably durations. As we will see, durations take quite tricky random patterns, so for the sake of simplicity, consider the distribution of intelligence quotients (IQ). Strictly spoken, the IQ is not continuous, as one usually only measures and reports whole number scores. Still, for instructional purposes, assume that the IQ is given in arbitrary precision, be it \(114.9\), \(100.0001\) or \(\pi * 20\).
We observe that the most likely IQ is 100 and that almost nobody reaches scores higher than 150 or lower than 50. But, how likely is it to have an IQ of exactly 100? Less than you might think! With continuous measures, we can no longer think in blocks that have a certain area. In fact, the probability of having an IQ of exactly \(100.00...0\) is exactly zero. The block of IQ = 100 is infinitely narrow and therefore has an area of zero. Generally, with continuous outcome variables, We can no longer read probabilities directly. Therefore, probability mass distributions don’t apply, but the association between outcome and probability is given by what is called probability density functions. What PDFs share withg PMFs is that the area under the curve is always exactly one. They differ in that PDFs return a density for every possible outcome, which by itself is not as useful as probability, but can be converted into probabilities.
Practically, nobody is really interested in infinite precision. When asking “what is the probability of IQ = 100?”, the answer is “zero”, but what was really meant was: “what is the probability of an IQ in a close interval around 100?”. Once we speak of intervals, we clearly have areas larger than zero. The graph below shows the area in the range of 85 to 115.
But, how large is this area exactly? As the distribution is curved, we can no longer simply count virtual blocks. Recall that the CDF gives the probability mass, i.e. the area under the curve, for outcomes up to a chosen point. Continuous distributions have CDFs, too, and the the graph below shows the CDF for the IQs. We observe how the curve starts to rise from zero at around 50, has its steepest point at 100, just to slow down and run against 1.
Take a look at the following graph. It shows the two areas \(IQ \leq 85\) and \(IQ \leq 115\). The magic patch in the center is just the desired interval.
And here the CDF comes into the play. To any point of the PDF, the CDF yields the area up to this point and we can compute the area of the interval by simple subtraction:
\[ P(IQ \leq 115) = 0.159 P(IQ \leq 85) = 0.841 P(85 \leq IQ \leq 115) = P(IQ \leq 115) - P(IQ \leq 85) = 0.683 \]
Probability and density distributions usually are expressed as mathematical functions. For example, the function for the case of task completion is the binomial distribution, which gives the probability for \(y\) successes in \(k\) trials at a base probability of \(p\):
\[ Pr(y|p,k) = {k \choose y}p^y(1-p)^{k-y} \]
In most cases where the binomial distribution applies, base probability \(p\) is the parameter of interest, whereas the number of of trials is known beforehand and therefore does not require estimation. For that reason, the binomial distribution is commonly taken as a one-parameter distribution. When discussing the binomial in more detail, we will learn that \(p\) determines the location of the distribution, as well as how widely it is dispersed (preview Figure XY). The distribution that approximated the IQs is called the Gaussian distribution (or Normal). The Gaussian distribution function takes two parameters, \(mu\) determines the location of the distribution, say average IQ being 98, 100 or 102 and \(sigma\) which gives the dispersion, independently (preview Figure XY).
Location and dispersion are two immediate properties of plotted distributions that have intuitive interpretations. Location of a distribution usually reflects where the most typical values come to lie (100 in the IQ example). When an experimenter asks for the difference of two designs in ToT, this is purely about location. Dispersion can either represent uncertainty or variation in a population, depending on the research design and statistical model. The most common interpretation is uncertainty. The basic problem with dispersion is that spreading out a distribution influences how typical the most typical values are. The fake IQ data basically is a perfect Gaussian distribution with a mean of 100 and a standard deviation of 15. The density of this distribution at an IQ of 100 is NaN. If IQs had a standard deviation of 30, the density at 100 would fall to NaN. If you were in game to guess an unfamiliar persons IQ, in both cases 100 would be the best guess, but you had a considerable higher chance of being right, when dispersion is low.
The perspective of uncertainty routinely occurs in the experimental comparison of conditions, e.g. design A compared to design B. What causes experimenters worry is when the residual distributions in their models is widely spread. Roughly speaking, residuals are the variation that is not predicted by the model. The source of this variation is unknown and usually called measurement error. It resides in the realm of the SMURFs. With stronger measurement error dispersion, the two estimated locations get less certainty assigned, which blurs the difference between the two.
The second perspective on dispersion is that it indicates variation by a known source. Frequently, this source is differences between persons. The IQ is an extreme example of this, as these tests are purposefully designed to have the desired distribution. In chapter [LMM] we will encounter several sources of variation, but I am really concerned about human variation, mostly. Commonly, experimental researchers are obsessed by differences in location, which to my mind confuses “the most typical” with “in general”. Only when variation by participants is low, this gradually becomes the same. We will re-encounter this idea when turning to multi-level models.
Most distributions routinely used in statistics have one or two parameters. Generally, if there is one parameter this determines both, location and dispersion, whereas two-parameter distributions can vary location and dispersion independently, to some extent. The Gaussian distribution is a special case as \(\mu\) purely does location, whereas \(\sigma\) is just dispersion. With common two-parametric distributions, both parameters influence location and dispersion in more or less twisted ways. For example, mean and variance of a two-parametric binomial distributions both depend on chance of success \(p\) and number of trials \(k\), as \(\textrm{M} = pk\) and \(\textrm{Var} = kp(1-p)\).
In this book I advocate the thoughtful choice of distributions rather than doing batteries of goodness-of-fit to confirm that one of them, the Gaussian, is an adequate approximation. It usual is trivial to determine whether a measure is discrete (like everything that is counted) or (quasi)continuous and that is the most salient feature of distributions. A second, nearly as obvious, feature of any measure is its range. Practically all physical measures, such as duration, size or temperature have natural lower bounds, which typically results in scales of measurement which are non-negative. Counts have a lower boundary, too (zero), but there can be a known upper bound, such as the number of trials. Statistical distributions can be classified the same way: having no bounds (Gaussian, t), one bound (usually the lower, Poisson, exponential) or two bounds (binomial, beta).
Many dozens of PMFs and PDFs are known in statistical science and are candidates to choose from. First orientation grounds on superficial characteristics of measures, such as discrete/continuous or range, but that is sometimes not sufficient. For example, the pattern of randomness in three-tasks falls into a binomial distribution only, when all trials have the same chance of success. If the tasks are very similar in content and structure, learning is likely to happen and the chance of success differs between trials. Using the binomial distribution when chances are not constant leads to severely mistaken statistical models.
For most distributions, strict mathematical definitions exist for under which circumstances randomness takes this particular pattern. Frequently, there is one or more natural phenomena that accurately fall into this pattern, such as the number of radioactive isotope cores decaying in a certain interval (Poisson distributed) or … . This is particularly the case for the canonical four random distributions that follow. Why are these canonical? The pragmatic answer is: they cover the basic types of measures: chance of success in a number of trials (binomial), counting (Poisson) and continuous measures (exponential, Gaussian).
However,
Imagine, standing slightly elevated in front of a moving crowd and choose one pedestrian to follow with your eyes. The task is easier when the target person is tall, moves straight on while the rest of the crowd is moving in random curly patterns. This method also works to identify the waiter in a crowded bar.
The theorists answer is twofold: first, they are exponential …. second, they are lowest entropy.
Random distributions:
A very basic performance variable in design research is task success. Think of devices in high risk sitations such as medical infusion pumps in surgery. These devices are remarkably simple, giving a medication at a certain rate into the bloodstream for a given time. Yet, they are operated by humans under high pressure and must therefore be extremely error proof in handling. Imagine, the European government would set up a law that manufacturers of medical infusion pump must prove a 90% error-free operation in routine tasks. A possible validation study could be as follows: a sample of \(N = 30\) experienced nurses are invited to a testing lab and asked to complete ten standard tasks with the device. The number of error-free task completions per nurse is the recorded performance variable tpo validate the 90% claim. Under somewhat idealized conditions, namely that all nurses have the same proficiency with the device and all tasks have the success chance of 90%, the outcome follows a Binomial distribution and the results could look like the following:
Speaking about the Binomial distribution in terms of successes in a number of attempts is common. As a matter of fact, any binary classification of outcomes is amenable for Binomial modelling, like on/off, red/blue, … . Imagine, Jane’s big boss needs a catchy phrase for an investor meeting. Together they decide that the return rate of customers could be a good measure, translating into a statement such as eighty percent of customers come back. To prove (or disprove) the claim, Jane uses the customer data base and divides all individuals into two groups: those who have precisely one record and those who returned (no matter how many times). This process results in a distribution, that has two possible outcomes: : \(0\) for one-timers and \(1\) for returners. This is in fact, a special case of the Binomial distribution with \(k = 1\) attempts. Examples are given in the first row of the figure.
A Binomial distributions has two parameters: \(p\) is the chance of success and \(k\) is the number of attempts. \(p\) is a probability and therefore can take values in the range from zero to one. With larger \(p\) the distribution moves to the right. The mean of Binomial distributions is the probability scaled by number of attempts, \(M = kp\). Logically, there cannot be more successes then \(k\), but with larger \(k\) the distribution gets wider. The variance is the odds scaled by number of attempts, \(\textrm{Var} = kp(1-p)\). As mean and variance depend on the exact same parameters, they cannot be set independently. In fact, the relation \(Var = M(1-p)\) is parabolic, so that variance is largest at \(p = .5\), but decreases towards both boundaries. A Binomial distribution with, say \(k=10\) and \(p = .4\) always has mean \(4\) and variance \(2.4\). This means, in turn, that an outcome with a mean of \(4\) and a variance of \(3\) is not Binomially distributed. This occurs frequently, when the success rate is not identical across trials. A common solution is to use hierarchical distributions, where the parameter \(p\) itself is distributed, rather than fixed. A common distribution for \(p\) is the beta distribution and the logitnormal distribution is an alternative.
The Binomial distribution has two boundaries, zero below and number of attempts \(k\) above. While a lower boundary of zero is often natural, one cannot always speak of a number of attempts. For example, the number of times a customer returns to the car rental website does not yield a natural interpretation of number of attempts. Rather, one could imagine the situation as that any moment is an opportunity to hire a car. At the same time, every single moment has a very, very small chance that a car is hired, indeed. Under these conditions, an infinite (or painstakingly large) number of opportunities and a very low rate, the random pattern is neatly summarized by Poisson distributions.
Some counting processes have no natural upper limit like the number of trials in a test. In design research, a number of measures are such unbounds counts:
These measures can often be modelled as Poisson distributed. A useful way to think of unbound counts, is that they can happen at every moment, but with a very small chance. Think of a longer interaction sequence of a user with a system, where errors are recorded. It can be conceived as an almost infinite number of opportunities to err, with a very small chance of something to happen. The Poisson distribution is a so called limiting case of the binomial distributions, with infinite \(k\) and infinitely small \(p\). Of course, such a situation is completely ideal. Yet, Poisson distributions fit such situations well enough.
Poisson distributions possess only one parameter \(\lambda\) (lambda), that is strictly positive and determines mean and variance of the distribution alike: \(\lambda = M = \textrm{Var}\). As a matter of fact, there cannot be massively dispersed distributions close to zero, nor narrow ones in the far. Owing to the lower boundary, Poisson distributions are asymmetric, with the left tail always being steeper. Higher \(\lambda\)s push the distribution away from the boundary and the skew diminishes. It is commonly practiced to approximate counts in the high numbers by normal distributions.
The linkage between mean and variance is very strict. Only a certain amount of randomness can be contained. If there is more randomness, and that is almost certainly so, Poisson distributions are not appropriate. One speaks of overdispersion in such a case.
Consider a very simple video game, subway smurfer, where the player jumps and runs a little blue avatar on the roof of a train and catches items passing by. Many items have been placed into the game, but catching a single one is very difficult. The developers are aware that a too low success rate would demotivate players as much as when the gane is made to easy. In this experiment, only one player is recorded, and in wonderful ways this player never suffers from fatigue, nor does he get better with training. The player plays a 100 times and records the catches after every run. In this idealized situation, the distribution of catches would, indeed, follow a Poisson distribution, as in the figure below.
Consider a variation of the experiment with 100 players doing one game and less restrictive rules. Players come differently equipped to perform visual search tasks and coordinate actions at high speeds. They are tested at different times of the day and by chance feel a bit groggy or energized. The chance of catching varies between players, which violates the assumption that was borrowed from the Binomial, a constant chance \(p\). The extra variation is seen in the wider of the two distributions.
Poisson distributions’ lower boundary can cause trouble: the measure at hand is truly required to include the lower bound. A person can perform a sequence with no errors, catch zero items or have no friends on facebook. But, you cannot complete an interaction sequence in zero steps or have a conversation with less than two statements. Fortunately, once a count measure has a lower boundary right of zero, the offset is often available, such as the minimum necessary steps to complete a task. In such a case, the number of errornous steps can be derived and used as a measure, instead:
\[ \textrm{#errors} = \textrm{#steps} - \textrm{#neccessary steps} \]
Another lower bound problem arises, when there are hurdles. In traffic research, the frequency of use public transport certainly is an interesting variable. A straight-forward assessment would be to ask bus passengers “How many times have you taken the bus the last five days?”. This clearly is a count measure, but it cannot be zero, because the person is sitting in the bus right now. This could be solved by a more inclusive form of inquiry, such as approaching random households. But, the problem is deeper: actually, the whole population is of two classes, those who use public transport and those who don’t.
Exponential distributions apply for measures of duration. Exponential distributions have the same generating process as Poisson distributions, except, that the duration for an event to happen is the variable of interest, rather than events in a given time. The same idealized conditions of a completely unaffected subway smurfer player and constant catchability of items, the duration between any two catches is exponentially distributed. In more general, the chance for an event to happen is the same at any moment, completely independent of how long one has been waiting. For this property, the exponential distribution is called memoryless.
Durations are common measures in design research, most importantly, time-on-task and reaction time. Unfortunately, the exponential distribution is a poor approximation of the random pattern found in duration measures. That is for two reasons: first, the exponential distribution shares with Poisson, that
The best known distributions are normal distributions or Gaussian distributions. These distributions arise mathematically under the assumption of a myriad of small unrelated forces (SMURF) pushing performance (or any other outcome) up or down. As SMURFs work in all directions independently, their effects often average out and the majority of observations stays clumped together in the center, more or less.
Normal distributions have two parameters: \(\mu\) marks the center and mean of the distribution. The linear models introduced later are aiming at predicting \(\mu\). The second parameter $ represents the dispersion of the random pattern. When randomness is pronounced, the center of the distribution gets less mass assigned, as the tails get wider.
Different to Poisson and Binomial distributions, mean and variance of the distribution can be set independently and overdispersion is never an issue.
Normal distributions have the compelling interpretation of summarizing the effect of SMURFs. They serve to capture randomness in a broad class of regression models and other statistical approaches. The problem with normal distributions is that they only capture the pattern of randomness under two assumption. The first assumption is that the outcome is continuous. While that holds for duration as a measure of performance, it would not hold for counting the errors a user makes. The second assumption is that the SMURFs are truly additive, like the forces add up, when two pool balls collide. This appears subtle at first, but it has the far reaching consequence that the outcome variable must have an infinite range in both directions, which is impossible.
The normal distribution is called “normal”, because people normally use it. Of course not. It gots its name for a deeper reason, commonly known (and held with awe) as the central limit theorem. Basically, this theorem proves what we have passingly observed at binomial and Poisson distributions: the more they move to the right, the more symmetric they get. The central limit theorem proves that, in the long run, a wide range of distributions are indistinguishable from the normal distribution. In practice, infinity is relative. In some cases, it is reasonable to trade in some fidelity for convenience and good approximations make effective statisticians. As a general rule, the normal distribution approximates other distributions well, when the majority of measures stay far from the natural boundaries. That is the case in experiments with very many attempts and moderate chances(e.g. signal detection experiments), when counts are in the high numbers (number of clicks in a complex task) or with long durations. However, these rules are no guarantee and careful model criticism is essential.
Measurement is prime and specialized (non-central-limit) distributions remain the first recommendation for capturing measurement errors. The true salvation of normal distributions is their application in multi-level models. While last century statistics was reigned by questions of location, and variance considered nuisance, new statistics care for variation. Most notably, amount of variation in a population is added as a central idea in multi-level modelling, which is commonly referred to as random effects. These models can become highly complex and convenience is needed more than ever. Normal distributions tie things together in multi-level models, as they keep location and dispersion apart, tidy.
The dilemma is then solved with the introduction of generalized linear models, which is a framework for using linear models with appropriate error distributions. Fortunately, MLM and GLM work seemlessly together. With MLM we can conveniently build graceful likelihood models, using normal distributions for populations. The GLM part is a thin layer to get the measurement scale right and choose the right error distribution, just like a looking glass.
The five canonical random distributions match the basic type of measurements, they make strict assumptions on the data generating process A majority of data does not meet these conditions. A routine problem is that binomial and Poisson assume that the chance for an event to occur is strictly constant. Take the number of successful tasks in . A Poisson distribution with \(\lambda = M = \textrm{Var}\) emerges only, if all users in the study have the same chance of errors. Could that be the case?
In GMLM is can happen that a normal distribution sits underneath a Poisson distribution. That is not a plugin distribution, strictly, but a related concept.
Frequentist statistics fails at the basic fact that by far most research is incremental and has ever been. Bayesian statistics embraces the idea of gradually increase in certainty. Why has it not been adopted earlier? The reason is it was unfeasible. The innocent multiplication of prior and likelihood is a complex integral, which in most cases has no analytic solution. If you have enjoyed a classic statistics education, you may remember that the computation of sum of squares (explained and residual) can be done by paper and pencil in reasonable time. And that is precisely how statistical computations has been performed before the advent of electronic computing machinery. In the frequentist statistical framework (some call it a zoo), ingenious mathematicians have developed procedures that were rather easy to compute. That made statistical data analysis possible in those times. It came at costs, though:
Expensive computation is in the past. Modern computers can simulate realistic worlds in real time and the complex integrals in Bayesian statistics they solve hands down. When analytical solutions do not exist, the integrals can still be solved using numerical procedures. Numerical procedures have been used in frequentist statistics, too, for example the iterative least squares algorithm applies for Generalized Linear Models, or the Newton-Rapson optimizer can be used to find the maximum likelihood estimate. However, these procedures are too limited as they fail for highly multidimensional problems as they are common in Linear Mixed-Effects Models. Moreover, they do not allow to approximate integrals.
Most Bayesian estimation engines these days ground on a numerical procedure called Markov-Chain Monte-Carlo sampling. The method differs from the earlier mentioned in that it basically is a random number generator. The basic MCMC algorithm is so simple, it can be explained on half a page and implemented with 25 lines of code. Despite its simplicity, the MCMC algorithm is applicable to practically all statistical problems one can imagine. Being so simple and generic at the same time must come at some costs. The downside of MCMC sampling still is computing time. Models with little data and few variables, like the rainfall case above, are estimated within a few minutes. Linear-mixed effects models, which we will encounter later in this book, can take hours and large psychometric models (which are beyond the scope), can take up to a few days.
Still, the MCMC algorithm not only delivers some accurate point estimates, it produces the full posterior distribution. This lets us characterize a parameters magnitude and degree of (un)certainty. Let’s run an analysis on the 20 rainfall observations to see how this happens.
M_1 <-
Rain %>%
stan_glm(rain ~ cloudy -1,
family = binomial,
data = .)
What the estimation does, is to calculate the posterior distribution from the observations. The posterior distribution contains the probability (more precisely: the density) for all possible values of the parameter in question. The following density plot represents our belief about the parameter \(P(rain|cloudy)\) after we have observed twenty days:
From the posterior distribution, we can deduct all kinds of summary statistics, such as:
We can also make non-standard evaluations on the posterior distribution, for example: How certain is it that \(P(rain|cloudy) < 0.7\)? We’ll demonstrate the use of this in the next section.
Coming back to MCMC: how is this distribution actually produced. In plain words, MCMC makes a random walk through parameter space. Regions where the true value is more likely to be are just visited more often. The posterior distribution plots above are actually just frequency plots.
[ILLUSTRATE RANDOM WALK]
Reconsider Jane and Andrew. When asked about whether users can, on average, complete a transaction within 99, they looked at some time-on-task measures. Recall how the research question was defined (based on some fictional laws): the slogan “rent a car in 99 seconds” is only legit, when the average time of all users is 99s or below. Taking the average of a set of measures is everyday knowledge. Often, what is intuitive to people is what they think about the least and don’t see the alternatives. So, just to be sure that we are on the same page, the statement is not about:
All the above, including the mean, are numbers that summarize the observations. A single number that summarizes a set of observations is called a statistic, formally it is just
\(s = f_s(D)\)
with \(s\) being a statistic on data \(D\), defined by the function \(f_s\). Characterizing a data set by appropriate statistics is called descriptive statistics.
Most commonly used statistics fall in three classes: cardinality, central tendency (or location) of data, or data dispersion. Cardinality statistics give the number of units of a class in a data set. The most common cardinalities are the number of observed performance measures \(N_{Obs}\), the number of individual participants \(N_{Part}\) in the study.
Reporting the calculated mean of time-on-task in two designs is descriptive. It summarizes the data set in a reasonable way. It is informative in so far as one learns which design to prefer, but coarsely. Consider a case where two designs were just slight variations, come at precisely the same costs, and the researcher has collected performance measures on five subjects per group, without the option of inviting more participants.
Reconsider Jane and Andrew. What did they know about the current state of affairs when running a particular session. Using some time-on-task measures they disproved the claim “rent a car in 99 seconds”. Recall how precisely the question was phrased: on average, users had to be able to complete the transaction in 99 seconds. The statistic of interest is the mean. This was debunked with almost no effort, by calculating:
\[\hat M_{ToT} = 1\over n * \sum{ToT} = 105.975\]
So, if everybody would just do descriptive statistics, we were done, here: Identify the quantitative summary that suits your situation and compute it. However, the field of statistics strongest contribution is that the level of uncertainty can also be quantified. This uncertainty arises from incomplete knowledge: only a small fraction of potential users have been tested, whereas the claim is about the whole population. Notice that I have not written the sample mean as \(M\), but put a “hat” on it. That is to denote that the sample mean is an estimate for the population mean. The term estimate usually denotes that it is reasonable (intuitive, as well strictly mathematical) to assume that the statistic obtained from the sample is useful for making claims about the whole population. At the same time, every sample of users carries only partial information on the whole population, such that the estimate is imperfect.
A consequence of the imperfect is that when another sample is drawn, one usually does not obtain the precise same estimate. The fluctuation of samples from an unknown population is what classic frequentist statistics draws upon. In the case of Jane and Andrew, a frequentist statistician would ask the question:
How certain can you be that in the population of users \(M_{ToT] <= 99\)?
For frequentist thinkers the idea of the sample is central and the mathematical underpinning rests on an experiment of thought: how would all other possible samples look like? Here, following the Bayesian approach, and the fluctuation in sample statistics we consider a consequence of uncertainty. All of them carrying incomplete information, and inferential Bayesian statistics centers around full quantification of uncertainty. As we have seen, uncertainty about a future ivent to occur, is crucial for decision making on rational grounds.
The p-value […] Imagine giant replication study that showed that 35% of published psychology studies are not replicable. In fact, such a study exists and it has far-reaching consequences. […] [REF] shows that rejection bias is one reason: studies that have not reached the magic .05 level of significance, have a lower chance of publication. [REF] sees as a reason that author’s implicitly trick the system by the garden of forking paths strategy. The research project collects an array of variables, followed by a fishing expediture. Theory is conventiently considered last, but written up in a pseudo-a prior way.