2024-03-21

Introduction to Hypothesis Testing

  • Hypothesis testing is a statistical method used to make inferences about population parameters based on sample data.
  • It involves formulating a null hypothesis and an alternative hypothesis, collecting data, and using statistical tests to determine whether there is enough evidence to reject the null hypothesis.
  • This presentation will cover the basics of hypothesis testing, including the hypothesis formulation, test statistics, p-values, and interpretation of results.

Hypothesis Formulation

  • Null Hypothesis ( \(H_0\) ):
    Represents the status quo or no effect.
  • Alternative Hypothesis (\(H_1\) or \(H_a\)):
    Represents the research hypothesis or the effect we are testing for.

Examples:
- \(H_0\): The mean temperature is \(75^\circ\)F.
- \(H_1\): The mean temperature is not equal to \(75^\circ\)F.

Test Statistics

  • Test statistics are used to quantify the evidence against the null hypothesis.
  • Common test statistics include z-score, t-statistic, F-statistic, etc.
  • The choice of test statistic depends on the hypothesis being tested and the type of data.

Example: - For testing the population mean with known standard deviation, the z-statistic is often used.

Formula for the z-Statistic

The z-statistic is a measure used in hypothesis testing to determine how many standard deviations a data point is from the mean of a normal distribution. It is calculated using the formula: \[ z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \]

  • \(\bar{X}\): Sample mean
  • \(\mu\): Population mean
  • \(\sigma\): Population standard deviation
  • \(n\): Sample size

P-Value

  • The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the observed value under the null hypothesis.
  • A small p-value indicates strong evidence against the null hypothesis, leading to its rejection.
  • The significance level (usually denoted by \(\alpha\)) is the threshold used to determine statistical significance.

\[ \text{p-value} = P(\text{Observing } T \text{ as extreme as } t \text{ under } H_0) \]

Example Dataset: Exam Scores

Consider a dataset of exam scores:

# Generate example data
set.seed(123)
exam_scores <- rnorm(100, mean = 75, sd = 10)

# View the first few scores
head(exam_scores)
## [1] 69.39524 72.69823 90.58708 75.70508 76.29288 92.15065
tail(exam_scores)
## [1] 88.60652 68.99740 96.87333 90.32611 72.64300 64.73579

Histogram of Exam Scores

Code:

# Histogram of exam scores
histo <- ggplot() +
  geom_histogram(aes(x = exam_scores), bins = 15, fill = "skyblue", 
                 color = "black") +
  labs(x = "Exam Scores", y = "Frequency") +
  theme_minimal()

Histogram:

Box Plot of Exam Scores

# Box plot of exam scores
box <- ggplot() +
  geom_boxplot(aes(y = exam_scores), fill = "salmon") +
  labs(x = "", y = "Exam Scores") +
  theme_minimal()

Box Plot:

3D Scatterplot

Code:

# Generate example data for 3D scatterplot
set.seed(789)
x <- rnorm(100)
y <- rnorm(100)
z <- 2*x + 3*y + rnorm(100)

# 3D Scatterplot
library(plotly)
scatter <- plot_ly(x = x, y = y, z = z, type = "scatter3d", 
                   mode = "markers", marker = list(size = 5)) %>%
  layout(scene = list(xaxis = list(title = "X"),
                      yaxis = list(title = "Y"),
                      zaxis = list(title = "Z")),
         title = "3D Scatterplot of X, Y, and Z")

Scatterplot:

Hypothesis Testing Example

## 
##  One Sample t-test
## 
## data:  exam_scores
## t = 0.99041, df = 99, p-value = 0.3244
## alternative hypothesis: true mean is not equal to 75
## 95 percent confidence interval:
##  74.09283 77.71528
## sample estimates:
## mean of x 
##  75.90406

Based on the hypothesis test conducted, there is not enough evidence to reject the null hypothesis. This suggests that the mean exam score is not significantly different from 75. The 95% confidence interval is [74.09283, 77.71528], indicating a likely mean score of 75.90406.

Example of Calculating the z-Statistic

\[ z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \]

Suppose we have a sample with the following characteristics: - Sample mean (\(\bar{X}\)): 78 - Population mean (\(\mu\)): 75 - Population standard deviation (\(\sigma\)): 10 - Sample size (\(n\)): 100

Substituting these values into the formula:

\[ z = \frac{78 - 75}{\frac{10}{\sqrt{100}}} = \frac{3}{1} = 3 \]

  • In this example, the z-statistic is 3, indicating that the sample mean is 3 standard deviations away from the population mean.

Conclusion

  • Hypothesis testing is a powerful tool for making inferences about population parameters based on sample data.
  • It involves formulating null and alternative hypotheses, choosing a test statistic, calculating the p-value, and interpreting the results.