Introduction to Hypothesis Testing

2026-02-07

Introduction to hypothesis testing

the framework of the all the important decision making Yuefan Ma 2/7/2026

what is hypothesis Testing

Hypothesis testing is a standard statistical process that allows to: - to test claims about the population parameter - make data-driven decisions - Quantify uncertainty in conclusions.

the core ideas is that we compare the observed data with what we would expected if a starting hypothesis are true.

The Two Hypothesis

In every hypothesis test, we have two different hypothesis:

Null hypothesis

The default assumption or guess
represent no effect or no differences

Alternate hypothesis

what we want to prove or investigate
represent an effect or difference exists

Left-tailed interpret as colored orange dot in the graph** tests if the parameter is less than null value. the equation for this is mu < mu0. an example is “does a drug reduce the blood pressure” Right-tailed interpret as blue dot in graph tests if the parameter is greater than the null value. the equation for this is mu > mu0. an example for it is “does training increase strength” Two-tailed interprets as the black dot tests if the parameter is different than the null value. the equation for this is mu not equal mu0. and example for this is “Is there any difference in the treatment”.

choosing the Correct Type: 1. based on the research question, which direction do you expect to go. 2. check the previous study to see which direction it goes. 3. you must be conservative with the type so use the two-tailed test if you are uncertain. 4. you must choose your type before collecting data.

Statistical Implication: - one-tailed test: more powerful for detecting effects in specified direction. - Two-tailed test: Less powerful but more conservative. - Critical values differ between the two test: One-tailed test uses z = 1.645 as alpha = 0.05 and two-tailed test uses z = 1.96

The mathematical Formulation

For a population mean mu

\[ \begin{aligned} h_0&: \mu = /mu_0\\ h_a&: \mu \neq \ quad \ text{(two-tailed test)} \end{aligned}\]

or for directional test

\[ H_a: \mu > \mu_0 \quad \text{or} \quad H_a: \mu < \mu_0 \]

Test Statistic

z-test (when sigma is known) \[ Z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} \sim N(0,1) \]

t-test (when sigma is unknown) \[ Z = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}\sim t_{n-1} \]

Visualizing Rejection Region

## Ignoring unknown labels:
## • subtitles : "Significance level a = 0.05 (two-tailed test)"

Figure 1 For a = 0.05, we reject h0 if |Z| > 1.96

what this plot shows us is the decision rules in hypothesis testing. the pink lines are the normal distribution curve which represent the sampling distribution under the null hypothesis. it also shows where the test statistic would fall if h0 were true. the blue lines are the critical values which are located at z = +- 1.96 for alpha 0.05. it is the thresholds for the statistical significance. calculated by qnorm(1 - alpha/2). the maroon area are the rejection regions which are the extreme tails of the distribution (\|z\|> 1.96). if the observed test statistic falls that area, then we say that we reject the null hypothesis. the green area are the non-rejection region where it is the value between -1.96 < z < 1.96 if the observed statistic falls in the region then we can say we fail to reject the null hypothesis.
## Understanding the p-value the p-value is the probability of obtaining test results at as the extreme as the observed results, assuming the null hypothesis is true.

what this plot shows us is the decision rules in hypothesis testing. the pink lines are the normal distribution curve which represent the sampling distribution under the null hypothesis. it also shows where the test statistic would fall if h0 were true. the blue lines are the critical values which are located at z = +- 1.96 for alpha 0.05. it is the thresholds for the statistical significance. calculated by qnorm(1 - alpha/2). the maroon area are the rejection regions which are the extreme tails of the distribution (|z|> 1.96). if the observed test statistic falls that area, then we say that we reject the null hypothesis. the green area are the non-rejection region where it is the value between -1.96 < z < 1.96 if the observed statistic falls in the region then we can say we fail to reject the null hypothesis.

## Understanding the p-value the p-value is the probability of obtaining test results at as the extreme as the observed results, assuming the null hypothesis is true.

Mathematical Expression

for a two-tailed test: \[ p\text{-value} = 2 times P(Z \geqz_{\text{obs}}|H_0) \]

For a one-tailed test:

\[ p\text{-value} = P(Z \geqz_{\text{obs}}|H_0) \quad \text{(right-tailed)} \] — Decision Rule

\[ \text{If } p \leq \alpha: \quad \text{Reject } H_0\]

\[ \text{If } p > \alpha: \quad \text{Fail to Reject } H_0\ \] —

Visualizing the p-value

## Ignoring unknown labels:
## • subtitles : "the violet area is the represenation of the p-value of the
##   two-tailed test"

Figure 2: The p-value is the combined area in both tails beyond the observed statistic

the purple line are observed statistic which the z is at 2.4. the p-value area is the dark blue area beyond the |z| area which calculated by this equation: p_val <- 2*(pnorm(abs(z_obs)))

the interpretation is the small p-value means that it has strong evidence as it has 0.0164. which if we do a comparison with the alpha which is 0.05 then that would mean that we reject null hypothesis. meaning that if null hypothesis is true then we see the results this extreme 1.64 percent of the time by random chance.

you will have decide if to say if you reject the null hypothesis or fail to reject the null hypothesis by checking if the p value is greater or lesser than the alpha which is usually 0.05.

Statisical Power Analysis(3D)

Figure 3 Statistical power increase with larger effect sizes and larger sample sizes

this 3d surface shows the x-axis shows the number of observation for this sample size. Y-axis which shows the magnitude of the true effects for the effect size. the z-axis which is in power used to measure the probability of measure the effects if it exists.

the key pattern to notice to that power will increase with the larger sample size as well as with the larger size. this shown as moving right on the x-axis or moving up on the y-axis. small studies with small effects have low power which is shown to be red region. larger studies with large effect have high power which is shown to be the yellow region.

the practical implication for this 3d surface is that study design which need sufficent sample size to detect the expected effects. Resources planning as larger sample will cost a lot of money but will provide more reliable results. from interpreting from this 3d surface that it low power studies may fail to detect real effects.

Standard Guideline

adequate power >= 0.8 effect conventions -Small: d= 0.2 -Medium: d = 0.5 -Large d= 0.8

Power Calculation Formula power depends on: 1.Effect size (d) 2.Sample Size (n) 3.Significance level (alpha) 4. Test types (one-tailed types vs two-tailed types)

Complete R Example: One-Sample t-Test

# step 1 is generate sample data
set.seed(123) # to make the same result again
test_scores <- rnorm(n =50, mean = 76, sd = 13)
# step 2 is to make state hypothesis
# state the hypothesis
# h0: mu = 75 new method is same as old method
# h1: mu > 75 new method is better than old methods
# step 3 is to perform the t-test
result <- t.test(test_scores, mu = 75, alternative = "greater")
# step 4 is to display the summary statistic
cat("Sample mean:", round(mean(test_scores),2),"\n")
cat("t-statistic:", round(result$statistic,3),"\n")
cat("p-value:", round(result$p.value,4),"\n")
# This last part is to decide if the hypothesis is rejected or failed to rejected
if(result$p.value < 0.05){
  cat("reject h0: evidence of improvement\n")
} else{
  cat("Fail to reject h0: no evidence\n")
}

Running the Example Code

set.seed(123)
test_scores <- rnorm(n =50, mean = 76, sd = 13)
result <- t.test(test_scores, mu = 75, alternative = "greater")
cat("ACTUAL OUTPUT:\n")
cat("==============\n\n")
cat("t =", round(result$statistic,3),"\n")
cat("df =",result$parameter,"\n")
cat("p-value =", round(result$p.value,4),"\n\n")
if(result$p.value < 0.05){
  cat("DECISION: Reject H0 (p < 0.05) \n")
  cat("CONCULSION: Evidence supports the improvement\n")
} else{
  cat("DECISION:Fail to reject H0\n")
  cat("CONCULSION: No significant evidence\n")
}

Key Takeaways

What We’ve Learned: 1.Framework: Hypothesis testing provides a structured approach to statistic inference 2.Components: Clear formation of H0 and H1, test statistics, p-value 3.Decision: making based on evidence from the statistic and pre-defined alpha level. 4.Visualization: Graphical representations helps with the understanding. 5.Power analysis: Very important for study design and interpretations. — common pitfall to avoid: -Misinterpreting p-values : p<0.05 not equal to effect is real. -Confusing statistical significance: statistical does not equal to practical significances -Ignoring assumptions: test validity depends on assumptions -Multiple Testing: conducting many tests increase type 1 error — Best Practices 1.Pre-specify: check the hypotheses and alpha level before data collection 2.check assumption: make sure your assumption is correct before conducting test 3.report effect sizes: write down the effect size and confidence intervals 4.Consider power:check if any outside variable will affect the set when designing studies 5.Interpret results: check at result in contexts of the experiment —