2024-10-31

Hypothesis Testing

A hypothesis test is a method for statistical inference, where the calculation of a test statistic is used to determine whether the data supports a particular hypothesis.

Hypothesis Testing is important in data science because it allows us to make better decisions and avoid Type I and II errors.

There are several common types of hypothesis testing including: T-Tests, Z-Tests, Chi-Squared Tests which utilize the corresponding t, z, and chi statistics.

One-Sample T-Test

One type of Hypothesis Testing, the one-sample t-test (also known as the Student’s t-test) was published by English statistician William Sealy Gosset in 1908.

The test was initially made to provide a solution for a problem in industrial quality control and is used for continuous data that is a random sample from a normal population.

There are 5 main steps in one-sample t-testing:
1. Formulate and Define Hypotheses
2. Choose a Significance Level
3. Calculate Necessary Statistics
4. Calculate the P-Value
5. Come to a Conclusion

Example: National Health and Nutrition Examination Survey Dataset

To demonstrate a one-sample t-test, we will use data from the National Health and Nutrition Examination Survey (NHANES). This survey was created to assess the health and nutrition of adults and children in the United States.

We will utilize the data collected on Total Cholesterol levels to determine if the mean is within the healthy range of 200 mg/dL.

First, we will transform the data from mmol/L to mg/dL and take an initial look at the data.

Scatter Plot of NHANES Cholesterol Data

The scatter plot shows each example from the dataset as a point. Most of the points appear to fall around the 100-300 range.

Histogram of Cholesterol Data from the NHANES Dataset

From the histogram, we can see that the data is normally distributed.

1. Defining Hypotheses

We will test whether the sample mean is greater than 200 mg/dL. \[ Null \ Hypothesis \ (H_0): \mu=200 \\ Alternative \ Hypothesis \ (H_1): \mu>200 \] The null hypothesis is that the sample mean is not greater than 200 mg/dL. The alternative hypothesis is that the sample mean is greater than 200 mg/dL.

Next we will choose a significance level.

2. Choosing a Significance Level

Because lower a cholesterol level is an important indication of overall health, we will use a significance value of 0.01, meaning that there is a 1% chance of incorrectly rejecting the null hypothesis (Type I error).

\[ Significance \ level \ \alpha \ = 0.01 \]

Now we proceed to equations for necessary statistics.

3. t-Statistic

The t-statistic is given by \[ t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \] Where \[ \bar{x} = \frac{\sum_{i=1}^n{x_i}}{n} \\ s = \sqrt{\frac{\sum_{i=1}^n{(x_{i}-\bar{x})^2}}{n-1}} \]

4. Calculate the P-Value

We can calculate the p-value in R by using t.test(), storing the results and accessing the $p.value column.

result <- t.test(df$TotChol, mu=200)
pvalue <- result$p.value
print(pvalue)
## [1] 1.038955e-133

The p-value is 1.038995e-133.

5. Come to a Conclusion

With our p-value, 1.038955e-133, we can use our significance level to come to a conclusion about the sample mean.
Because \[1.038955x10^{-133}<0.01\] We reject the null hypothesis\[ H_0: \mu=200 \].
Thus, we cannot conclude that the sample mean is greater than 200 mg/dL.

Sample Mean vs Hypothesized Mean

Lastly, we can take a look at our Hypothesized Mean compared to our Sample Mean. Here we can clearly see that we cannot conclude that the sample mean is greater than our Hypothesized Mean.

References