These listed threats can overlap. So, it is common to point out that a single issue in an experiment violates multiple efforts of maintaining validity.
“… addressed by experimental arrangements that help rule out or make implausible factors that could explain the effects we wish to attribute to the intervention” (Kazdin, 1992)
Kazdin is saying that we want to render alternative explanations as worthless. We want to set up our experiment that effectively neutralizes their impact so that the only one remaining would be the logical explanation of cause.
Internal validity (IV) is the priority, both for conclusions and also logically coming before external validity. One should first have a factual finding before worrying about whether it generalizes (Kazdin 1992).
this statistical property is when extreme events (which are rare) are then usually followed by events that are closer to the average.
A classic case of statistical regression exists in athletics (though I think often described in fighter pilot training schools). An athlete would perform really poorly, and this would represent the extreme event. The coach would yell at the athlete and berate them, thinking that this will motivate them to play better. The next game, the player plays much better. Why?
Statistically we expect that after a rare event, performance will track more closely to the average. So, the athlete plays ‘better’ just means they play closer to their average performance. No yelling, no hugging, no other coaching factor made a difference. Player just had a random-bad-day.
But the coach sees the improvement and (mis-) attributes the improved performance to their motivational style (yelling) and then continues to yell at players having a bad games.
As an example, let’s say you wanted to provide some kind of treatment, or intervention to help reduce stress. And in this research you gave a survey measuring stress, you may be interested in the students who are the most stressed. If you included them in your study, provide some treatment, and then re-measure their stress levels a week later, regression to the mean would predict that students would revert back to their baseline level of stress regardless of receiving the treatment (this is why we need control groups).
It’s all about generalizability. Does the results from your study map out, extend past my sample? Can I make statements about larger populations?
You can’t generalize to groups that don’t have the same traits, such as age, gender, ethnicity, education, etc. This is what most introduction to research students see as major threats. Sometimes they are, sometimes not. For instance, if doing biological research, a case can be made that since human beings are share about 99.9% similarity (differences are rare and minor (e.g.,, melanin in skin)), when doing biological research you might not need to have diverse samples of humans. On the other hand, there may very well be important differences that can be flagged, or tracked by proxy of ethnicity. For instance, African Americans and heart disease is a known problem but for several reasons: perhaps economic forces leads to African Americans having a poorer diet, but there is also evidence that they have a different gene that makes salt consumption problematic. East Asians suffer from alcohol intolerance at a much higher rate than world population due to a lack of an enzyme called dehydrogenase.
Conditions in which the experiment is embedded.
Reactivity of experimental arrangement –the effect of the fact that a participant knows she is being studied.
Multiple treatment interference – if subjects are exposed to more than one treatment, multiple treatment interference refers to drawing conclusions about a given treatment in the context of other treatments.
Novelty effects – the newness of some intervention is the source of change, not the intervention itself. Brake lights used to be only near the trunk. In the late 70’s some evidence showed that putting a brake light near the back window reduced rear-end collisions. Studies supported this. But now that nearly all cars have brake lights higher up near the back window, their effectiveness has waned.
Reactivity of assessment –extent to which people show change because they know they’ve been assessed
Test sensitization –does pre-test change person amenability to treatment?
Time measurement and treatment (treatment) effects – the timing of measurement may reveal the effectiveness of an intervention. For example, let’s say you perform an experiment where one group is treated a particular way, and then immediately you measure their score on some survey. The immediacy of that measurement may not allow for generalizability because had you waited an hour, for example, the treatment effect goes away.
Construct validity addresses the presumed cause for the explanation of the causal relation. Is the explanation or interpretation of the investigator plausible? Is the reason for the relation between the intervention and behavior change due to the construct (explanation, interpretation) given by the investigator? Answers to these questions focus specifically on construct validity (Kazdin, 1992).
Imagine a perfect study where all internal threats have been handled well. And let’s say the study is testing the effectiveness of a medicine against a control group given a placebo. Let’s further say that the drug absolutely has an effect. But in the administration of the treatment, the doctor spends time and attention discussing the effect of the drug and how the medical effects impact critical biological processes.
These threats to construct validity often have in common the notion of “interaction” . Imagine a therapist giving two different treatments to two different groups. And let’s say one treatment (treatment A) is shown to be superior than the other. The conclusion would be that the treatment A, and not the therapist, was the main factor. This appears logical because you could say that both groups were treated by the same therapist, and any therapist effects would canceled out.
Using statistical tests are necessary because they help researchers determine whether differences between groups are real and whether those differences were due to chance. Consider a simple example.
Essentially, if you flip a coin 10 times and find that heads came up 7 times. You may wonder whether or not the coin is Fair. It turns out that statistics can tell us the likelihood of tossing 10 coins and having 7 or more heads happening: about 17% of the time. (see the P-value below).
binom.test(n=10,x=7,alternative = 'greater')
##
## Exact binomial test
##
## data: 7 and 10
## number of successes = 7, number of trials = 10, p-value = 0.1719
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.3933758 1.0000000
## sample estimates:
## probability of success
## 0.7
What this means regarding my silly coin flip example is that an event that seems relatively rare, like 7 out of 10 coin tosses coming up heads, actually shows up not 0 but also not 50% of the time.
When you design an experiment, ultimately you’re going to have some statistical summary of that effort. And it might be a simple measurement of did it work: Yes or no?
If you treat the count of yes’s and no’s as heads or tails, you might find that statistics tells you whether or not the count of yes’s versus the number of no’s would be expected by chance.
So statistical conclusion validity speaks to the way that you use quantitative evaluation and whether not you use them appropriately in your final analysis
Power is the ability of a test to detect a difference. More specifically, power is the probability of rejecting the null hypothesis when the null hypothesis is false. The following threats to statistical validity below contribute to weak power. But this particular section it’s best to discuss what do we mean by power in relation to sample size or to effect size.
At an intuitive level, effect size is an effort to quantify the size of some factor’s influence on a dependent variable relative to other factors. My favorite example in class is to talk about the effects that gravity and friction have on a falling object. If the dependent variable is the speed it takes for an object to fall a 10 feet, and you were to compare objects such as a balled-up, crumpled piece of paper versus a hammer, I’m sure you would not be surprised that the hammer hits the ground a bit faster than the paper. So the question of effect size is “what are all of the factors of influence?” as well as “which factors are more important?”
If you have large effects then you don’t need a whole lot of the Statistical power. Imagine you were in a dark room looking for a tiny small diamond earring. Statistical power would be the brightness of your flashlight. Because the diamond is small, you may need a relatively powerful flashlight to detect it. On the other hand if you were looking for a refrigerator, you might just need the glow of your smartphone.
In other words, small effect sizes require massive amounts of statistical power to detect differences and the reason is that small effect sizes can be masked by statistical noise. To use our diamond example. You may be looking for the diamond, but some glitter from your makeup, or your kids art project, will also reflect a little bit of light and so now you’ve got a lot of background noise that makes it hard to detect the diamond.
If you’re looking for a huge factor with a huge effect size like a refrigerator in a dark room, the room could be filled with washing machines and dishwashers and diamonds and glitter but the refrigerator is so huge you’re going to have an easy time finding it.
The most well-known way to increase statistical power is to increase sample size. Having said that, some researchers (e.g., McElreath, 2020) lament that given the ease at which we are able to collect data now may discourage researchers from actually thinking through and being clear about their scientific proposals. Being clear about your proposal, constructs, DAG’s, may actually obviate the need to increase sample sizes.
The other way to change power is to adjust the alpha level of your test. This is a bit of a controversial thing to try. The standard Alpha level in psychology is 05. What this means is that you have set up your experiment such that you’re willing to be wrong 5% of the time. If you’re willing to be wrong 30% of the time, you would be effectively changing your Alpha from .05 to .30, and in doing so you would be more likely to detect differences, however those differences may be more likely to be due to chance/noise.
The more noise you have in the denominator, the overall ratio reduces and approaches 0. The fraction’s denominator gets bigger and bigger which makes the fraction smaller and smaller (the arrows mean “approach”): \[\frac{\text{Effect}}{\color{magenta}{\text{Noise}\rightarrow\infty}}\color{red}{\rightarrow0\text{ boo!!}}\]
So, if you can reduce the amount of noise in your study, that would mean the main effects will be easier to see: \[\frac{\text{Effect}}{\color{blue}{\text{Noise}\rightarrow0}}\color{teal}{\rightarrow\infty\text{ yesss}}\]
(Of course you’ll never have a perfect study that has zero noise, and zero variability so your denominator will never approach zero in the slightest).
Subject heterogeneity is also related to power, and effect sizes. Similar to reducing noise and procedures, making participants more similar on various traits will reduce variability, and noise, and thus make it easier for the statistical test to detect a difference, if one is present. Of course, this will reduce generalizability and that can also be a concern.
Although another technique would be to use a within subjects design, because, as you should know, that should reduce the influences of these differences.
Kazdin, A. E. (1992). Research Design in Clinical Psychology, 2nd edition. Allyn and Bacon.
McElreath, Richard (2020). Statistical Rethinking. A Bayesian Course with Examples in R and STAN. CRC Press, ISBN: 0429642318.