Hypothesis Testing

2024-11-18

What Are Hypothesis Tests?

Hypothesis tests are a means to test whether a claim regarding a value of interest is valid to a certain degree of statistical certainty.

A good analogy for hypothesis tests is determination of guilt/innocence in the US court system: there is a default assumption (innocence) that must be overcome via presentation of overwhelming evidence (of guilt) beyond a threshold of “reasonable doubt”. Hypothesis tests are structured in a similar fashion.

Hypo Test Basics

Like the court analogy, hypothesis tests start with two complimentary statements or conditions:

Default/Null hypothesis ($H_{0}$): analogous to “innocence”, it’s the condition that holds when there is insufficient evidence to suggest otherwise
Alternative hypothesis ($H_{1}$): analogous to “guilt”, it’s the condition that the presentation of evidence is meant to “prove”

Each hypothesis test uses a test level, $\alpha$ (0 < $\alpha$ < 1), to define the standard of “reasonable doubt” beyond which the statistical evidence presented is considered sufficient to reject the default condition and “prove” the alternative.

NOTE: Hypothesis tests DO hold up in court, but should NOT be considered as proving “absolute” truths, since by their very nature they include the possibility of error

α and Error Types

Since we cannot prove that our conclusions are 100% true using a hypothesis test, we are looking to minimize errors when drawing conclusions with them.

Types of error:

$H_{0}$ concluded, $H_{0}$ is “truth”: Not an error!
$H_{0}$ concluded, $H_{1}$ is “truth”: Using our analogy, this is the case of being declared innocent when actually guilty in court, or a “false negative” (also called Type II error)
$H_{1}$ concluded, $H_{0}$ is “truth”: Again with our analogy, this is the case of being declared guilty when actually innocent, or a “false positive” (also called Type I error)
$H_{1}$ concluded, $H_{1}$ is “truth”: Not an error!

As in criminal justice, we are looking to avoid “declaring guilt when innocent”, and so we set our $\alpha$ level to represent our tolerance for making those errors (Type I)

Setting $\alpha$

The value of $\alpha$ should always be set ahead of a test, and preferably even before gathering the data for the test, if possible. The magnitude depends greatly on the subject of the test and the “user” of the test’s tolerance for risk.

The default value for $\alpha$ = 0.05 (5% risk of Type I error)

Looser or tighter tolerances may be necessary for different circumstances, for instance, when determining whether a drug with potentially dangerous side effects shows better outcomes vs. a certain disease than a placebo (or a currently accepted treatment) $\alpha$ = 0.01 or even $\alpha$ = 0.001 may be appropriate!

Types of Tests

There are multiple hypothesis test types, each of which serve a different purpose:

Tests for the mean - One can test a sample against a known historical population or target mean (1-sample test of the mean), or against the mean of another sample (2-sample test)
Tests of proportions - One can test a sample against a known value (1-sample test), or against the proportional outcome of another sample (2-sample test). (Note: these tests typically require sample populations of 30 or greater)
Tests of variance - As above, one can test a sample’s variance against a target or known variance, or test against the variance of another sample

In addition, each of the tests above must be further characterized as 1-sided or 2-sided, with 2-sided tests simply determining if the values tested are statistically “different”, while 1-sided tests determine if one value is greater than or less than the other to the desired minimum level of statistical significance.

We will concern ourselves the rest of the way only with 1-sample tests of the mean.

One Sample Test of the Mean

Use the one-sample test option to test against an historical process mean or a target condition.

It’s important when setting up your hypotheses that, like in court, it’s invalid logic to attempt to “prove a negative.”

Choose one-sided or two-sided
- 1-sided: when attempting to prove the sample has a higher or lower mean relative to the standard
- 2-sided: when it’s sufficient to prove the sample mean is simply “different” (falls to either side of the standard)
Determine the planning level for $\alpha$ (typically $\alpha$ = 0.05)
Collect/identify data

One Sample Test, Continued

Set up hypotheses and test conditions (assumes $n\ge30$):

	1-sided (Lower Tail)	1-sided (Upper Tail)	2-sided
$H_{0}$:	$\mu$ $\geq$ $\mu_{0}$	$\mu$ $\leq$ $\mu_{0}$	$\mu$ $=$ $\mu_{0}$
$H_{1}$:	$\mu$ $<$ $\mu_{0}$	$\mu$ $>$ $\mu_{0}$	$\mu$ $\neq$ $\mu_{0}$
$\text{Test Statistic}$:	$Z_{0} = \frac{\mu_{0}-\overline{y}}{\frac{s}{\sqrt{n}}}$	$Z_{0} = \frac{\overline{y} - \mu_{0}}{\frac{s}{\sqrt{n}}}$	$Z_{0} = \Big{\|}\frac{\overline{y} - \mu_{0}}{\frac{s}{\sqrt{n}}}\Big{\|}$
$\text{Comparison}$:	$P(Z_{0}) > (1-\alpha)$	$P(Z_{0})>(1-\alpha)$	$P(Z_{0}) > (1-\frac{\alpha}{2})$

$\mu$ is the “true population mean”, $\mu_{0}$ is the target mean, $Z_{0}$ is the location input into the cumulative distribution function (P) of the standard normal distribution, $\overline{y}$ is the sample mean, s is the sample standard deviation, n is the sample size, and $\alpha$ is the significance level for the test.

One Sample Test Conclusions

For a 1-sided test, if the result of the comparison is FALSE, we can only state that we “fail to reject the null hypothesis”, which essentially means that there is insufficient evidence of a difference. Similarly, if $P(Z_{0})<\alpha$, then we “reject the null hypothesis with (1- $\alpha$ )*100 percent confidence in $H_{1}$”. So, if $\alpha$ is 0.05, we would have 95% confidence in $H_{1}$, which is the minimum confidence the result would provide

For a 2-sided test, the statements are the same, just note the differences in the test statistic and the comparison value.

In R, it’s possible to determine the cumulative probability for the standard normal distribution at $Z_{0}$ using the base R pnorm function and perform the comparison, for example if $Z_{0}$ is 2.5 and $\alpha$ = 0.05 (1-sided test):

alpha <- 0.05
zee0 <- 2.5
pnorm(zee0) > (1-alpha)

## [1] TRUE

Since the result of the comparison is TRUE, we would reject the null hypothesis with 95% confidence!

Visual of 1-Sided, 1-Sample Test

When we perform a 1-sided (lower tail), 1-sample test, we are comparing the cumulative probability (area under the probability distribution function) to the left of $Z_{0}$ with the probability of a Type 1 error, which in our case is $\alpha$. For the example in the previous slide, the area represented by pnorm(zee0) is shown visually in green below:

Visual of 2-Sided, 1-Sample Test

For the 2-sided test option, we are only concerned if the sample mean is “different” than the standard, hence the use of absolute value when computing the test statistic. This also means that the acceptable risk $\alpha$ is divided across the lower and upper portions of the distribution. Since the risk is divided, we then only compare the test statistic to half of $\alpha$ and the absolute value coerces us to visualize this on the upper tail of the distribution, as seen below:

Practical Example: 90’s Action!

You and a buddy (both in your mid-40s) are sitting at the bar during a data science conference, and the conversation turns to movies. Your friend has been consuming some adult beverages, and with the pontificating voice of a man deeply in his cups he states, “I’ll tell you man, action movies in the 90s were just DIFFERENT, you know? They were just so much better then than at any other point in history! I’ll bet you could just watch some random ones and see the difference!!” He then randomly high-fives the woman next to him, who smirks, looks at you, and rolls her eyes (behind your buddy’s back).

You’ve been pacing yourself, so you know he’s probably full of hot air, but you have your laptop and therefore access to the Movie Database and a means to test his “hypothesis”.

90’s Action, Continued!

The first thing you want to do is look at action data ratings across the years to let you know if he’s even in the ballpark…

90’s Action, Continued…

Based on the histogram, you’re pretty sure he’s full of it, but you figure you’ll indulge him, since he was at least sober enough to claim you would know just from a random sample. So, you set things up to try it…

$H_{0}$: $\mu_{90s}\le\mu_{all}$

$H_{1}$: $\mu_{90s}>\mu_{all}$

TS: $Z_{0} = \frac{\overline{y} - \mu_{all}}{\frac{s}{\sqrt{n}}}$

Since you’re feeling good, you decide to only test his hypothesis to 90% certainty, so $\alpha$ = 0.1

You’ve already subset the database by genre for just such an “emergency”, so it’s easy to determine the all-action ratings average: mean(action$rating) = 5.2920222

So you pick thirty random 90s movies, since we have to be “responsible” about sample size if not adult beverage consumption, and get to work.

90’s Action, the Conclusion!

First 5 90s Action Films Sampled
title	year	rating
Tai ji zhang san feng	1993	7.1
I Love Trouble	1994	5.1
Last Action Hero	1993	5.4
Johnny Mnemonic	1995	4.7
Muro-ami	1999	5.5

TS = -0.4131594

Comparison: Is 0.3397449 > 0.9? –> FALSE

Unfortunately, your buddy’s statement doesn’t overcome the evidence against, and you cannot reject the null hypothesis… the 90’s didn’t have demonstrably better action movies on average.

	1-sided (Lower Tail)	1-sided (Upper Tail)	2-sided
\(H_{0}\):	\(\mu\) \(\geq\) \(\mu_{0}\)	\(\mu\) \(\leq\) \(\mu_{0}\)	\(\mu\) \(=\) \(\mu_{0}\)
\(H_{1}\):	\(\mu\) \(<\) \(\mu_{0}\)	\(\mu\) \(>\) \(\mu_{0}\)	\(\mu\) \(\neq\) \(\mu_{0}\)
\(\text{Test Statistic}\):	\(Z_{0} = \frac{\mu_{0}-\overline{y}}{\frac{s}{\sqrt{n}}}\)	\(Z_{0} = \frac{\overline{y} - \mu_{0}}{\frac{s}{\sqrt{n}}}\)	\(Z_{0} = \Big{\|}\frac{\overline{y} - \mu_{0}}{\frac{s}{\sqrt{n}}}\Big{\|}\)
\(\text{Comparison}\):	\(P(Z_{0}) > (1-\alpha)\)	\(P(Z_{0})>(1-\alpha)\)	\(P(Z_{0}) > (1-\frac{\alpha}{2})\)