These listed threats can overlap. So, it is common to point out that a single issue in an experiment violates multiple efforts of maintaining validity.

Internal Validity is….

“… addressed by experimental arrangements that help rule out or make implausible factors that could explain the effects we wish to attribute to the intervention” (Kazdin, 1992)

Kazdin is saying that we want to render alternative explanations as worthless. We want to set up our experiment that effectively neutralizes their impact so that the only one remaining would be the logical explanation of cause.

Internal validity (IV) is the priority, both for conclusions and also logically coming before external validity. One should first have a factual finding before worrying about whether it generalizes (Kazdin 1992).

Threats to Internal validity

History

  • any unplanned event that happens in or out of the experiment during the course. Fired from work, a student dies on UW campus in the middle of a year long experiment, which impacts other students who may be future participants, and so now are different than students taking experiment before the death/tragedy.

Maturation, Age, sophistication

  • changes within the subject. It’s similar to history effects but contained within the person. Also includes things like wisdom, boredom, fatigue.

Testing (aka testing effects, practice effects)

  • Taking a pretest may impact post test. Big issue in IQ testing. The issue is basically: if one is exposed to stimuli, does that exposure have lasting impacts?

Instrumentation Change in measurement instruments over time.

  • Consider 30 years ago how we measured “gender”. It was considered binary and confused with ‘Sex’. Now we measure both (or at least we should). But it’s harder, then, to compare trends today from what took place 30 years ago. Not that we shouldn’t be curious about the ways we can experience gender but that it’s changed and so may be subject to this type of measurement threat.

Regression to the mean

  • this statistical property is when extreme events (which are rare) are then usually followed by events that are closer to the average.

  • A classic case of statistical regression exists in athletics (though I think often described in fighter pilot training schools). An athlete would perform really poorly, and this would represent the extreme event. The coach would yell at the athlete and berate them, thinking that this will motivate them to play better. The next game, the player plays much better. Why?

    • Statistically we expect that after a rare event, performance will track more closely to the average. So, the athlete plays ‘better’ just means they play closer to their average performance. No yelling, no hugging, no other coaching factor made a difference. Player just had a random-bad-day.

    • But the coach sees the improvement and (mis-) attributes the improved performance to their motivational style (yelling) and then continues to yell at players having a bad games.

  • As an example, let’s say you wanted to provide some kind of treatment, or intervention to help reduce stress. And in this research you gave a survey measuring stress, you may be interested in the students who are the most stressed. If you included them in your study, provide some treatment, and then re-measure their stress levels a week later, regression to the mean would predict that students would revert back to their baseline level of stress regardless of receiving the treatment (this is why we need control groups).

Selection Biases

  • Systematic differences between groups based on selection or assignment.
  • Conclusions about an independent variable are only clear when the group’s under study have been shown to be basically the same. Random assignment helps ensure this, but if there are systemic differences between the groups before the intervention, then your experiment will be unclear.

Attrition

  • Participants dropping out for various reasons, but if those reasons are systematic, then that will impact your groups. For instance, in college, we have a lot of people having to drop from school. Do you think any differences in ethnicity could be used to predict that, and is ethnicity a proxy for something else (think: yes)? What could be happening?

Combination of selection and other threats.

  • If not using random assignment, you may get selection biases and related other biases like those above, like a history threat.

Diffusion or limitation of treatment.

  • An intervention given to one group is accidentally provided. For example, a therapist using CBT also throws in some psychoanalytically inspired ideas on attachment and defense mechanisms.

Special treatment or reactions of controls.

  • “No treatment” or “control” group gets special treatment, such as more money, more monitoring of well being, or special privileges. As in, “I know it’s so hard to wait for your turn to get the therapy, here let me buy you a sandwich.”

External Validity

It’s all about generalizability. Does the results from your study map out, extend past my sample? Can I make statements about larger populations?

Threats

Sample characteristics

  • You can’t generalize to groups that don’t have the same traits, such as age, gender, ethnicity, education, etc. This is what most introduction to research students see as major threats. Sometimes they are, sometimes not. For instance, if doing biological research, a case can be made that since human beings are share about 99.9% similarity (differences are rare and minor (e.g.,, melanin in skin)), when doing biological research you might not need to have diverse samples of humans. On the other hand, there may very well be important differences that can be flagged, or tracked by proxy of ethnicity. For instance, African Americans and heart disease is a known problem but for several reasons: perhaps economic forces leads to African Americans having a poorer diet, but there is also evidence that they have a different gene that makes salt consumption problematic. East Asians suffer from alcohol intolerance at a much higher rate than world population due to a lack of an enzyme called dehydrogenase.

    • the point I’m making is that having a diverse sample makes external generalizations easier, but sometimes those diverse samples are not necessary.

Stimulus characteristics

  • Things like settings. A particular therapist out of several has uncommon talents, or other features of the experiment which may affect outcomes

Contextual characteristics.

  • Conditions in which the experiment is embedded.

    1. Reactivity of experimental arrangement –the effect of the fact that a participant knows she is being studied.

    2. Multiple treatment interference – if subjects are exposed to more than one treatment, multiple treatment interference refers to drawing conclusions about a given treatment in the context of other treatments.

    3. Novelty effects – the newness of some intervention is the source of change, not the intervention itself. Brake lights used to be only near the trunk. In the late 70’s some evidence showed that putting a brake light near the back window reduced rear-end collisions. Studies supported this. But now that nearly all cars have brake lights higher up near the back window, their effectiveness has waned.

Assessment characteristics

  1. Reactivity of assessment –extent to which people show change because they know they’ve been assessed

  2. Test sensitization –does pre-test change person amenability to treatment?

  3. Time measurement and treatment (treatment) effects – the timing of measurement may reveal the effectiveness of an intervention. For example, let’s say you perform an experiment where one group is treated a particular way, and then immediately you measure their score on some survey. The immediacy of that measurement may not allow for generalizability because had you waited an hour, for example, the treatment effect goes away.

Construct Validity

Construct validity addresses the presumed cause for the explanation of the causal relation. Is the explanation or interpretation of the investigator plausible? Is the reason for the relation between the intervention and behavior change due to the construct (explanation, interpretation) given by the investigator? Answers to these questions focus specifically on construct validity (Kazdin, 1992).

Threats

Attention and contact

  • Imagine a perfect study where all internal threats have been handled well. And let’s say the study is testing the effectiveness of a medicine against a control group given a placebo. Let’s further say that the drug absolutely has an effect. But in the administration of the treatment, the doctor spends time and attention discussing the effect of the drug and how the medical effects impact critical biological processes.

    • In this example both the drug and the doctor’s attention and contact with the participant will have an impact on the behavior that’s being measured. So there’s actually two constructs at play and if you don’t handle them they amalgamate together such that the effects of the actual medicine may be overestimated.

Single operations and narrow stimulus sampling

  • These threats to construct validity often have in common the notion of “interaction” . Imagine a therapist giving two different treatments to two different groups. And let’s say one treatment (treatment A) is shown to be superior than the other. The conclusion would be that the treatment A, and not the therapist, was the main factor. This appears logical because you could say that both groups were treated by the same therapist, and any therapist effects would canceled out.

    • The problem is that there may be an interaction between therapist and treatment. As an example, maybe the therapist feels more comfortable giving treatment A, whereas the therapist may feel reluctant to give treatment B. As a consequence, this interaction between therapist and treatment may be the reason that the group receiving treatment A improved. This is a muddled, confounded, construct.

Experimenter Expectancy

  • these show up because experimenters are human and can be biased against certain interventions, against certain people, or in favor of certain treatments, and people. As a consequence of these unintentional biases we as experimenters can impact our participants in very subtle ways. Often it is this set of threats that double-blind-experiments are used where the experimenter, who may have their unknown biases, are blind to which subjects receive which treatments, and the second blind is that the participants themselves are blind to which group they’ve been assigned to.

Cues of the experimental situation aka demand characteristics

  • Demand characteristics are interesting and are often useful fodder for experiments themselves. For instance, are there any forms of communication that could be picked up by participants prior to the study that would bias their performance in that study? This could be as subtle as instructions given on a survey, smiles from research assistance, rumors from other participants who completed the study. The bottom line is that demand characteristics represent a venue to influence participants in ways one might think are irrelevant.

Statistical Conclusion Validity

Using statistical tests are necessary because they help researchers determine whether differences between groups are real and whether those differences were due to chance. Consider a simple example.

Essentially, if you flip a coin 10 times and find that heads came up 7 times. You may wonder whether or not the coin is Fair. It turns out that statistics can tell us the likelihood of tossing 10 coins and having 7 or more heads happening: about 17% of the time. (see the P-value below).

binom.test(n=10,x=7,alternative = 'greater')
## 
##  Exact binomial test
## 
## data:  7 and 10
## number of successes = 7, number of trials = 10, p-value = 0.1719
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.3933758 1.0000000
## sample estimates:
## probability of success 
##                    0.7

What this means regarding my silly coin flip example is that an event that seems relatively rare, like 7 out of 10 coin tosses coming up heads, actually shows up not 0 but also not 50% of the time.

When you design an experiment, ultimately you’re going to have some statistical summary of that effort. And it might be a simple measurement of did it work: Yes or no?

If you treat the count of yes’s and no’s as heads or tails, you might find that statistics tells you whether or not the count of yes’s versus the number of no’s would be expected by chance.

So statistical conclusion validity speaks to the way that you use quantitative evaluation and whether not you use them appropriately in your final analysis

Threats

Low statistical power

  • Power is the ability of a test to detect a difference. More specifically, power is the probability of rejecting the null hypothesis when the null hypothesis is false. The following threats to statistical validity below contribute to weak power. But this particular section it’s best to discuss what do we mean by power in relation to sample size or to effect size.

  • At an intuitive level, effect size is an effort to quantify the size of some factor’s influence on a dependent variable relative to other factors. My favorite example in class is to talk about the effects that gravity and friction have on a falling object. If the dependent variable is the speed it takes for an object to fall a 10 feet, and you were to compare objects such as a balled-up, crumpled piece of paper versus a hammer, I’m sure you would not be surprised that the hammer hits the ground a bit faster than the paper. So the question of effect size is “what are all of the factors of influence?” as well as “which factors are more important?”

    • In this case, the most important one factor that has the largest effect size, is the physical concept of gravity. Without gravity, nothing’s falling. A smaller effect that is absolutely real and true, but smaller, is air friction–that the atmosphere, the air in our classroom, will slow down the falling crumpled piece of paper–a bit. In other words, friction is not nothing, but compared to gravity it’s small.

  • If you have large effects then you don’t need a whole lot of the Statistical power. Imagine you were in a dark room looking for a tiny small diamond earring. Statistical power would be the brightness of your flashlight. Because the diamond is small, you may need a relatively powerful flashlight to detect it. On the other hand if you were looking for a refrigerator, you might just need the glow of your smartphone.

  • In other words, small effect sizes require massive amounts of statistical power to detect differences and the reason is that small effect sizes can be masked by statistical noise. To use our diamond example. You may be looking for the diamond, but some glitter from your makeup, or your kids art project, will also reflect a little bit of light and so now you’ve got a lot of background noise that makes it hard to detect the diamond.

  • If you’re looking for a huge factor with a huge effect size like a refrigerator in a dark room, the room could be filled with washing machines and dishwashers and diamonds and glitter but the refrigerator is so huge you’re going to have an easy time finding it.

  • The most well-known way to increase statistical power is to increase sample size. Having said that, some researchers (e.g., McElreath, 2020) lament that given the ease at which we are able to collect data now may discourage researchers from actually thinking through and being clear about their scientific proposals. Being clear about your proposal, constructs, DAG’s, may actually obviate the need to increase sample sizes.

  • The other way to change power is to adjust the alpha level of your test. This is a bit of a controversial thing to try. The standard Alpha level in psychology is 05. What this means is that you have set up your experiment such that you’re willing to be wrong 5% of the time. If you’re willing to be wrong 30% of the time, you would be effectively changing your Alpha from .05 to .30, and in doing so you would be more likely to detect differences, however those differences may be more likely to be due to chance/noise.

    • Using the flashlight example and diamonds: Let’s say a very strong flashlight is being used. This light will help you distinguish diamonds from glitter 95/100 (5% alpha). That means that out of 100 sparkles, you think you found a diamond, but 5 of those diamonds are actually some other sparkly thing.
    • Using a cheaper flashlight (changing alpha from .05 to .30), you will still see sparkles (yay! signficance!) but now for 100 sparkles only 70 will be diamonds, 30 will be something other than a diamond.

Variability in Procedures

  • If your efforts to control internal validity are weak, or that your efforts to have clean procedures are weak, then that will introduce noise into your study making it hard to statistically differentiate real effects from that noise (diamonds vs sparkles). So, the solution is to make sure that your procedures are clean, and do not contribute to noise (don’t spill glitter). It would not be wrong to think about this as a ratio. With noise in the denominator, and the main effects in the numerator: \(\frac{\text{Effect}}{\color{magenta}{\text{Noise}}}\)

The more noise you have in the denominator, the overall ratio reduces and approaches 0. The fraction’s denominator gets bigger and bigger which makes the fraction smaller and smaller (the arrows mean “approach”): \[\frac{\text{Effect}}{\color{magenta}{\text{Noise}\rightarrow\infty}}\color{red}{\rightarrow0\text{ boo!!}}\]

So, if you can reduce the amount of noise in your study, that would mean the main effects will be easier to see: \[\frac{\text{Effect}}{\color{blue}{\text{Noise}\rightarrow0}}\color{teal}{\rightarrow\infty\text{ yesss}}\]

(Of course you’ll never have a perfect study that has zero noise, and zero variability so your denominator will never approach zero in the slightest).

Subject Heterogeneity

  • Subject heterogeneity is also related to power, and effect sizes. Similar to reducing noise and procedures, making participants more similar on various traits will reduce variability, and noise, and thus make it easier for the statistical test to detect a difference, if one is present. Of course, this will reduce generalizability and that can also be a concern.

  • Although another technique would be to use a within subjects design, because, as you should know, that should reduce the influences of these differences.

Unreliability of measures

  • Again this is about contributing to noise in that denominator mentioned above. That if your measures are unreliable then the scores that they produce will be partially a result of random noise. And that then means you’ll have a harder time to detecting true differences.

Multiple comparison error rates

  • This particular problem has to do with what’s more recently referred to as “p hacking”. In this case P stands for probability, specifically the P-value. When you look at a data set there are possibly infinite ways to organize the data. Yes…that’s right. Big ol’ \(\infty\).
    • How? Well, you could combine variables, like ethnicity + income.
    • You could make arbitrary groupings of subjects:
      • Everyone below $10k per year versus everyone above. Or everyone below $10.5k. Or $10.55k..and so on. Do you now see how you can make an infinite number of comparisons? If you make these arbitrary combinations eventually you will discover a significant effect.
    • This is a common way for unethical researchers to get published. Their original hypothesis didn’t work. But since they have the data, they start making new/different groupings of variables and participant sin their data, and voila, Significance!
  • If you’re doing several comparisons in your data, then sometimes you must account for the fact that you’re increasing the opportunity of finding a random, chance event. In stats, you can learn about this by reading about ‘family-wise error rates.’

References

Kazdin, A. E. (1992). Research Design in Clinical Psychology, 2nd edition. Allyn and Bacon.

McElreath, Richard (2020). Statistical Rethinking. A Bayesian Course with Examples in R and STAN. CRC Press, ISBN: 0429642318.