March 26, 2024

Context: Normal Distribution

A Normal Distribution entails:

  • the mean, median, and mode are (approximately) equal
  • the variance of the data is centered about the mean
  • occurrences that are near the mean are more frequent than occurrences farther away from the mean

We will be generating random data for this demonstration

data <- rnorm(1000, mean = 50, sd = 10)

This data is:

  • 1000 real numbers
  • normally distributed around the mean of 50
  • with a variance of 100 (standard deviation of 10)

Context: Point Estimation & Sampling Distributions

  • A random sample is derived from taking a random subset of a population.
  • A statistic is a function on the observations of a random sample, such as the mean \(\overline{X}\), or variance \(S^2\)
  • A point estimate is a single instance of a statistic for a specific random sample, such as mean \(\overline{x}\), or variance \(s^2\)
  • A point estimate serves as a reasonable estimation of a population parameter when the true value is unknown
  • If you take many random samples, each the same size, each point estimate for the statistic of interest is a random variable, and the point estimates follow a Sampling Distribution
  • The sample distribution plots the frequency of the occurrence of getting a particular point estimation for a random sample

Context: Central Limit Theorem

The Central Limit Theorem states that:

  • For a population with an unknown mean \(\mu\), taking random samples of the same size n, where n is sufficiently large:
  • the sampling distribution of the sample mean (as our statistic of interest) will be approximately normally distributed
  • and the mean of the sampling distribution will approximately equal the mean of the population:
  • \[\mu_{\overline{x}} = \frac{\mu + ... + \mu}{n} = \mu\]
  • with variance:
  • \[\sigma^2_{\hat{X}} = \frac{\sigma^2 + ... + \sigma^2}{n^2} = \frac{\sigma^2}{n}\]

Assuming we don’t know the population mean is 50, if we

  • take a random sample of 200 observations of our data
  • repeat this process 100 times

we can derive a sampling distribution of the mean of the random samples

  • This process yeilded a sampling distribution of the mean with 49.81, which is very close to the actual mean of 50
sample_avgs <- replicate(100, mean(sample(data, 200), replace = TRUE))
avg_sample_avg <- mean(sample_avgs)

Hypothesis Testing: Null and Alternative Hypotheses

  • A statistical hypothesis is a statement about a parameter of a population
  • The null hypothesis, \(H_0\), is a statement that is assumed to be true about a particular population parameter, such as the value of the mean
  • The alternative hypothesis, \(H_1\), is a statement about the population parameter that contradicts the null hypothesis, such as stating:
  • \(\mu < \mu_0\) - lower one-sided \(H_1\), \(\mu\) is less than what \(H_0\) states
  • \(\mu > \mu_0\) - upper one-sided \(H_1\), \(\mu\) is greater than what \(H_0\) states
  • \(\mu \!= \mu_0\) - Two-sided \(H_1\), \(\mu\) is greater OR less than what \(H_0\) states

Hypothesis Testing: The General Procedure

  • 1: Identify the population parameter of interest
  • 2: State the null hypothesis \(H_0\)
  • 3: State the alternative hypothesis we will test against the null hypothesis, \(H_1\)
  • 4: Define the test statistic of the point estimate of the parameter of interest.
  • The test statistic is the z-score that the sample population parameter would have within the sampling distribution if the population parameter was the true mean
  • 5: Determine what rejection criteria will be used to reject or fail to reject \(H_0\)
  • A common approach is to specify the significance level, \(\alpha\), where if the test statistic falls outside of the bounds set by \(z_{\alpha}\), then we reject \(H_0\)
  • 6: Take the random sample and calculate the sample test statistic
  • 7: Decide whether to reject \(H_0\) or fail to reject \(H_0\) based on the rejection criteria (Note: we never “accept” the null hypothesis!)

Hypothesis Testing on the Mean of a Normal Distribution, Variance Known

  • For our randomly generated population data, lets assume we didn’t know the true mean was 50. And, lets say that its been claimed that the population mean is 49, something we want to perform a hypothesis test on.
  • 1: Parameter of Interest: \(\mu\)
  • 2: \(H_0: \mu_0 = 49\)
  • 3: \(H_0: \mu > 49\), an upper one-sided hypothesis test
  • 4: Test statistic: \(Z_0 = \frac{\overline{X}-\mu_0}{\frac{\sigma}{\sqrt{n}}}\)
  • 5: Let us select \(\alpha\) = 0.05, where \(z_{\alpha}\) = 1.645. If our test statistic is greater than 1.645, we reject \(H_0\)
  • 6: Let’s take our random sample, n = 300, and determine the test statistic:
samp_mean <- mean(sample(data, 300))
print((samp_mean - 49)/(10/sqrt(300)))
[1] 2.448078
  • With a test statistic \(z_0 = 2.45\), we reject the null hypothesis based on our rejection criteria, as \(z_{alpha} = 1.645 < z_0 = 2.45\)

Visualization

  • We make a plot showing the distribution of sample means if 49 was the true mean, and we show the line that our rejection criteria, \(z_{alpha} = 1.645\) lies.
  • Then, we show where our test statistic lies, which is within the rejection region (greater than \(z_{alpha}\))
xval = seq(from = (49 - 4 * (10/sqrt(300))), to = (49 + 4 * (10/sqrt(300))),
    length.out = 1000)
yval = dnorm(xval, mean = 49, sd = (10/sqrt(300)))

critical_value = (1.645) * (10/sqrt(300)) + 49

plot <- plot_ly() %>%
    add_lines(x = xval, y = yval, name = "Distribution if mean is 49") %>%
    add_trace(x = critical_value, y = c(0, max(yval)), mode = "lines",
        name = "z_alpha") %>%
    add_trace(x = (samp_mean), y = c(0, max(yval)), mode = "lines",
        name = "test statistic")

plot

Type I and II Errors

  • There is a probability we could come to the wrong conclusion from the particular random sample we took, and this probability is determined by our rejection criteria
  • Type I Error: We reject the null hypothesis when it is actually true
  • This occurs if our test statistic falls within the rejection region
  • The value for \(\alpha\) we chose can be interpreted as the probability of having a type I error (for \(\alpha = 0.05\), probability is 5%)
  • example: if the true mean was 49, and we happened to get the test statistic we did, we would have falsely rejected \(H_0\)
  • Type II Error: We fail to reject the null hypothesis when it is actually false. We would have to know the “true” mean to find it!
  • This occurs if our test statistic falls within the “acceptance” region of \(H_0\) (technically, the confidence interval, set by \(\alpha\)
  • The probability of having a type II error is denoted as \(\beta\), and for an upper one-sided hypothesis test on the mean with variance known, it is calculated as:

\[ \Phi(z_{\alpha} - \frac{(\mu_{true}-\mu_{0})\sqrt{n}}{\sigma})\] > * Because we know the true mean is 50, we can calculate beta:

z_beta = (1.645 - ((50 - 49) * (sqrt(300)/10)))
beta = pnorm(z_beta, mean = 0, sd = 1, lower.tail = TRUE)
print(beta)
[1] 0.4653156
  • We have a 46% probability of committing a type II error!

Final Thoughts

  • Key takeaway: Hypothesis testing only allows us to reject a claim about a population parameter, not discover the true value of the population parameter.
  • Caveat: But, if our claim about the population parameter is correct, or very close to correct, then it would be unlikely to yield a test statistic that rejects this claim, and if we did, we would be committing a Type I error.
  • Key takeaway 2: Increasing sample size decreases the probability of making Type I and Type II errors