MATH 343: APPLIED STATISTICS NOTES
1 1. Introduction to hypothesis Testing
Imagine a doctor testing a new drug. The existing drug (Drug A) has a 60% success rate. The new drug (Drug B) is more expensive to produce.
The crucial question: Is Drug B significantly better than Drug A, or is its higher observed success rate in a small trial just due to random chance?
Hypothesis testing is the formal, statistical framework we use to answer these kinds of questions. It allows us to make data-driven inferences about a population based on sample data, while quantifying the uncertainty of those inferences.
2 2. Core Components
2.1 A. Null and Alternative Hypotheses
Every hypothesis test sets up two competing claims.
- Null Hypothesis (H₀): Represents the status quo or no effect.
Examples:- “The new drug is no better than the old one.”
- “The mean height of women is 65 inches.”
- “The coin is fair.”
- “The new drug is no better than the old one.”
- Alternative Hypothesis (H₁ or Hₐ): Represents the effect we want to detect.
Examples:- “The new drug is better than the old one.”
- “The mean height of women is not 65 inches.”
- “The coin is biased.”
- “The new drug is better than the old one.”
Formulating Hypotheses:
Two-tailed test:
H₀: μ = k
H₁: μ ≠ kOne-tailed test:
H₀: μ ≤ k H₁: μ > kor
H₀: μ ≥ k H₁: μ < k
The choice between one-tailed and two-tailed must be made before looking at the data.
2.2 B. Type I and Type II Errors & Power
Because we use samples, we can never be 100% certain.
Decision | H₀ True | H₀ False |
---|---|---|
Reject H₀ | Type I Error (α) | Correct Decision (Power = 1 - β) |
Fail to Reject H₀ | Correct (1 - α) | Type II Error (β) |
Type I Error (α): Rejecting a true null hypothesis.
Consequence: Concluding an effect exists when it doesn’t. (e.g., Convicting an innocent person, adopting a new drug that is no better).
Significance Level (α): The pre-chosen probability of making a Type I error. Common choices are 0.05 (5%), 0.01 (1%), and 0.10 (10%). This is our threshold for “unlikely.”
Type II Error (β): Failing to reject a false null hypothesis.
Consequence: Concluding no effect exists when it actually does. (e.g., Letting a guilty person go free, sticking with an old drug when a new one is better).
It depends on the sample size, the true effect size, and the chosen α.
Power (1 - β): Probability of correctly rejecting a false null. Researchers typically aim for 80% power.
Ways to increase power:
- Increase sample size (n).
- Increase effect size.
- Increase α (but this raises Type I error risk).
NB: There is a trade-off between Type I and Type II errors. Decreasing α makes it harder to reject H₀, which inadvertently increases β (decreases power), unless compensated for by a larger sample size.
2.3 C. The p-value: Making the Decision
The p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis (H₀) is true.
How to interpret it: A small p-value (typically ≤ α) means that the observed data would be very unlikely if the null hypothesis were true. This provides evidence against H₀.
The Decision Rule:
If p-value ≤ α, we reject the null hypothesis (H₀). The result is “statistically significant.”
If p-value > α, we fail to reject the null hypothesis (H₀). We don’t have enough evidence to support H₁.
Crucial: A p-value > α does not prove H₀ is true. It only means the evidence wasn’t strong enough to reject it. Also, “statistically significant” does not necessarily mean “practically important.”
3 3. One-Sample Tests
3.1 A. One-Sample z-test
When to use: To test a hypothesis about a population mean (μ) when: - The population standard deviation (σ) is known.
- The sample size is large (n ≥ 30, thanks to the Central Limit Theorem), or the population is normally distributed.
Test statistic:
\[ z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} \]
where:
x̄ = sample mean
μ₀ = hypothesized population mean under H₀
σ = known population standard deviation
n = sample size
3.2 B. One-Sample t-test
When to use it: To test a hypothesis about a population mean (μ) when:
- The population standard deviation (σ) is unknown (which is almost always the case in real life).
- We use the sample standard deviation (s) as an estimate.
- The sample size is small (n < 30) and the population is approximately normal.
Test statistic:
\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]
where s is the sample standard deviation.
with Degrees of Freedom (df): df = n - 1 The t-distribution is slightly wider and more variable than the z-distribution, accounting for the extra uncertainty from estimating σ with s. As df increases, the t-distribution approaches the z-distribution.
4 4. Solved Examples
4.1 Example 1: One-Sample z-test (Manual Calculation)
Problem: A company claims its energy bars have an average of 20 grams of protein. You know the population standard deviation is 1.5 grams. You take a sample of 35 bars and find a sample mean of 20.6 grams. At α = 0.05, is there evidence to suggest the mean protein is different from 20 grams?
Solution:
1. Hypotheses:
H₀: μ = 20 (The mean is 20g)
H₁: μ ≠ 20 (The mean is not 20g) -> Two-tailed test
2. Significance Level: α = 0.05
3. Test Statistic:
- x̄ = 20.6, μ₀ = 20, σ = 1.5, n = 35
\[ z = \frac{20.6 - 20}{1.5 / \sqrt{35}} = \frac{0.6}{1.5 / 5.916} = \frac{0.6}{0.2535} \approx 2.37 \]
4. Find p-value:
For a two-tailed test, p-value = 2 * P(Z > |2.37|).
From Z-table, P(Z > 2.37) ≈ 0.0089.
p-value = 2 * 0.0089 = 0.0178.
5. Decision:
- p-value (0.0178) < α (0.05). Therefore, we reject H₀.
6. Conclusion:
At the 5% significance level, there is sufficient evidence to conclude that the true mean protein content of the energy bars is different from 20 grams.
mu0 <- 20
sigma <- 1.5
n <- 35
xbar <- 20.6
z <- (xbar - mu0) / (sigma / sqrt(n))
p_value <- 2 * (1 - pnorm(abs(z)))
list(z_statistic = z, p_value = p_value)
## $z_statistic
## [1] 2.366432
##
## $p_value
## [1] 0.01796048
Interpretation: Reject H₀ if p-value < 0.05.
4.2 Example 2: One-Sample t-test (Practical with R & Python)
Problem: A car manufacturer claims a new model gets at least 40 MPG. A consumer agency tests a random sample of 12 cars, with the following results:
39.2, 40.5, 38.7, 41.0, 39.8, 40.9, 38.5, 39.9, 40.2, 39.3, 41.1, 38.6
Test the manufacturer’s claim at α = 0.05. Assume MPG is approximately normally distributed.
Solution (Manual Steps First):
1. Hypotheses:
H₀: μ ≥ 40 (Manufacturer's claim is true)
H₁: μ < 40 (The mean MPG is less than 40) -> One-tailed (left-tailed) test
2. Significance Level: α = 0.05
3. Calculate Sample Statistics:
Calculate the mean (x̄) and standard deviation (s) of the sample data
\[ \bar{x} = \frac{\text{sum of all values}}{12} \approx 39.783 \]
s ≈ 0.905 (calculated using the sample standard deviation formula)
5. Test Statistic:
\[ t = \frac{\bar{x} - \mu_{0}}{s / \sqrt{n}} = \frac{39.783 - 40}{0.905 / \sqrt{12}} = \frac{-0.217}{0.905 / 3.464} = \frac{-0.217}{0.261} \approx -0.831 \]
df = n - 1 = 11
Find p-value (using t-table for df=11):
This is a left-tailed test. We need P(T < -0.831).
From a t-table, for df=11, the value 0.831 falls between 0.697 and 1.363. The corresponding one-tailed probabilities are 0.25 and 0.10.
We can estimate p-value > 0.10. (Software will give a more precise value).
6. Decision (Manual):
Our estimated p-value (> 0.10) is greater than α (0.05). Therefore, we fail to reject H₀.
Now, let’s solve it precisely with code:
In R:
# Sample data
mpg_data <- c(39.2, 40.5, 38.7, 41.0, 39.8, 40.9, 38.5, 39.9, 40.2, 39.3, 41.1, 38.6)
# Perform one-sample t-test (alternative="less" for H1: mu < 40)
test_result <- t.test(mpg_data, mu = 40, alternative = "less")
# Print the results
print(test_result)
##
## One Sample t-test
##
## data: mpg_data
## t = -0.69814, df = 11, p-value = 0.2498
## alternative hypothesis: true mean is less than 40
## 95 percent confidence interval:
## -Inf 40.30138
## sample estimates:
## mean of x
## 39.80833
Conclusion from R: The p-value is 0.2116. Since 0.2116 > 0.05, we fail to reject H₀. There is not enough evidence to reject the manufacturer’s claim that the mean MPG is at least 40.
In Python:
#import numpy as np
#from scipy import stats
# Sample data
#mpg_data = np.array([39.2, 40.5, 38.7, 41.0, 39.8, 40.9, 38.5, 39.9, 40.2, 39.3, 41.1, 38.6])
# Perform one-sample t-test
# 'alternative="less"' sets H1: mu < 40
#t_stat, p_value = stats.ttest_1samp(mpg_data, popmean=40, alternative='less')
# Print the results
#print(f"t-statistic: {t_stat:.4f}")
#print(f"p-value: {p_value:.4f}")
# For a one-tailed test, the p-value from `ttest_1samp` with 'alternative' is already correct.
# If using an older version without 'alternative', p_value / 2 for one-tailed.
Conclusion is the same as in R.
5 5. Exercises & Assignments
5.0.1 Exercise 1: Conceptual
Define in your own words: Null Hypothesis, p-value, Power.
Describe the consequences of Type I and Type II errors in the context of a clinical trial for a new cancer drug (H₀: The new drug is no better than the standard).
If you set α = 0.01 instead of α = 0.05, what happens to the power of a test, assuming all else remains equal? Why?
5.0.2 Exercise 2: Hypothesis Formulation
Formulate the appropriate H₀ and H₁ for each scenario:
A historian believes the average age of soldiers in a civil war was less than 25.
A quality control manager needs to ensure that bottles are filled to 500 ml.
A researcher is testing if a new teaching method leads to a different exam score than the old method.
5.0.3 Assignment 1: z-test
A national survey found that the average American adult works 43.7 hours per week. The population standard deviation is assumed to be 4.6 hours. You survey 50 adults in your state and find they work an average of 45.1 hours per week. At the α = 0.01 level, is there significant evidence to conclude that workers in your state work more than the national average?
(Solve this manually, showing all steps, and then verify your answer using R or Python code).
5.0.4 Assignment 2: t-test
The recommended daily calcium intake for adults is 1000 mg. A nutritionist believes the intake for women in their 50s is too low. She collects data from a random sample of 15 women:
980, 1005, 1010, 942, 865, 1200, 1105, 978, 1020, 999, 870, 1050, 1055, 955, 907
Test the nutritionist’s belief at the α = 0.05 level. Assume the population is approximately normal. (Solve this using R or Python. Include your code and output in your answer).
6 6. Solutions
6.0.1 Exercise 1 Solutions
Null hypothesis = default assumption.
p-value = probability of extreme data given H₀.
Power = correctly rejecting false H₀.
Type I Error: Concluding drug works when it does not.
Type II Error: Missing a true effect.
Lower α decreases power.
6.0.2 Exercise 2 Solutions
- H₀: μ ≥ 25, H₁: μ < 25
- H₀: μ = 500, H₁: μ ≠ 500
- H₀: μ_new = μ_old, H₁: μ_new ≠ μ_old
6.0.3 Assignment 1 Solution (z-test)
Manual:
H₀: μ ≤ 43.7, H₁: μ > 43.7 (Right-tailed)
α = 0.01
\[ z = \frac{45.1 - 43.7}{4.6 / \sqrt{50}} = \frac{1.4}{4.6 / 7.071} = \frac{1.4}{0.6506} \approx 2.15 \]
p-value = P(Z > 2.15) ≈ 0.0158 (from Z-table)
Decision: 0.0158 > 0.01 → p-value > α → Fail to Reject H₀.
Conclusion:
Insufficient evidence to claim workers in the state work more than the national average
mu0 <- 43.7
sigma <- 4.6
n <- 50
xbar <- 45.1
z <- (xbar - mu0) / (sigma / sqrt(n))
p_value <- 1 - pnorm(z)
list(z_statistic = z, p_value = p_value)
## $z_statistic
## [1] 2.152064
##
## $p_value
## [1] 0.01569615
Interpretation: Fail to reject H₀ (p = 0.016 > 0.01). No evidence workers work more.
6.0.4 Assignment 2 Solution (t-test)
calcium_data <- c(980, 1005, 1010, 942, 865, 1200, 1105, 978, 1020, 999,
870, 1050, 1055, 955, 907)
t.test(calcium_data, mu = 1000, alternative = "less")
##
## One Sample t-test
##
## data: calcium_data
## t = -0.17434, df = 14, p-value = 0.432
## alternative hypothesis: true mean is less than 1000
## 95 percent confidence interval:
## -Inf 1035.804
## sample estimates:
## mean of x
## 996.0667
Conclusion:
The p-value (0.2405) is greater than α (0.05). We fail to reject H₀. There is not sufficient evidence at the 0.05 level to support the claim that calcium intake for women in their 50s is below 1000 mg. (Note: The sample mean is 987, which is below 1000, but the test tells us this difference is not statistically significant—it could easily be due to random sampling variation).