Hypothesis Testing in R: A How to Guide

2024-10-21

Steps for Hypothesis Testing

Define the Hypothesis:
- Null Hypothesis \(H_0\): This is the base assumption you are testing about your dataset. Usually this is where where your hypothesis is set too.
- Alternative Hypothesis. This is the statement you accept if the null hypothesis is actually accepted. Make this the value that you want to find
Choosing the test:
- You must choose a hypothesis based off of the data, (e.g: t-test, chi-square, etc)
Set the Significance Level (Your \(\alpha\) value)
- This is the probability of rejecting the null hypothesis when it’s true (common choices are 0.05 & 0.01)
Calculate the Test Statistic:
- This involves computing a value (z, t, chi-square, etc) that helps determine how far the observed data deviates from the null hypothesis.
Reject or do not reject the Null Hypothesis:
- Compare the test statistic to a critical value, or use a p-value to determine whether to reject to reject \(H_0\)

Example 1: One-Sample t-Test for Mean

A one-sample t-test is used to test if the mean of sample differs from a known or hypothesized population mean.

Example: I have a vector of heights in centimeters. I want to see whether the average height of vector is 165 cm.

heights <- c(162, 164, 167, 170, 160, 165, 163, 168, 172, 159, 190, 154, 204)

Set the hypothesis tests to be: - The mean height is 165 cm, this will be our \(H_0\) - The mean height is not 165 cm, this will be our \(H_a\) - Our alpha will be 0.05, or within a confidence range of 95% - We will reject the null hypothesis if the p value is greater than 0.05

Perform the t-test in R:

h_test_result <- t.test(heights, mu = 165)
print(h_test_result$p.value)

## [1] 0.3003206

As we can we see our p-value is 0.3003206, which is greater than the 0.05 that we set for our \(a\), therefore me must reject the null hypothesis Which means, that the mean of the heights is not 165 cm.

Visual Representation of Hypothesis Testing

Hypothesis testing on a mean usually follows a standard normal curve.

In a hypothesis test that is about equality, the \(a\) acts a “green” zone around or normal curve. Any p-value that falls in this green zone will make is so that we cannot reject the null hypothesis

Any p-value that falls on the outside range, will be rejected. This allows us to make definitive claims about our hypotheses.

Equations for Hypothesis Testing (T-distrbution)

These are the equations that you can use to do mean Hypothesis testing manually

Test statistic: \(Z_0 = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}\)

The test statistic is \(Z_0\)
The mean of your sample is: \(\bar{x}\)
The hypothesis (ur guess for the mean): \(\mu_0\)
The standard deviation of your sample: \(s\)
The number in your sample set: \(n\)

The p-value is a very powerful tool in order for you to reject your null hypothesis, much like the previous example

P-value for two sided: (\(\mu \neq \mu_0\)): \(2(1 - \phi(|x|))\)
P-value for one sided: (\(\mu > \mu_0\)): \(1 - \phi(Z_0)\)
P-value for on sided: (\(\mu < \mu_0\)): \(\phi(Z_0)\)

\(\phi\) is a the normal CDF of the given test statistic. However you must compute \(Z_a\) if you wish to reject based off of the \(Z_0\) test statistic

Hypothesis Tests on Standard Deviation

Now lets look at doing Hypothesis Testing on the Standard Deviation, it is a lot like mean but the equations are a little bit different.

We wil use the same dataset from before, as in a vector of heights:

heights <- c(162, 164, 167, 170, 160, 165, 163, 168, 172, 159, 190, 154, 204)

For this example, lets test whether the standard deviation is equal to 25.3 and a confidence interval \(a\) of 0.05

Unfortunately, R does not have a built-in function to compute the test statistic for Chi-Squared (Standard Deviation Test Statistic), so it will have to be done manually. Thankfully, R does have a function to compute the p-value, in which we will use to compute and reject/not-reject our null hypothesis Again, we will reject if the calculated p-value is greater than the \(a\) of 0.05

We calculated the p-value to be. Since, 0.9999033 is clearly greater than 0.05, that means we reject the null hypothesis. Meaning that we know for sure that the standard deviation is not 25.3

Hypothesis Tests of Standard Deviation Code

hypothesis <- 25.3
n <- length(heights)
s2 <- var(heights)  

# Calculate Chi-squared test statistic
chi_squared <- (n - 1) * (s2) / (hypothesis^2)

# Degrees of freedom
df <- n - 1

# Calculate p-value for the Chi-squared statistic (two-tailed)
p_value <- 1 - pchisq(chi_squared, df)

print(p_value)

## [1] 0.991365

Graphical Representation of Standard Deviation Testing (For Equality)

The p-value acts very similarly to the p-value “green” zone in mean testing. The only difference is that standard deviation testing uses the Chi-Squared Distribution

Graphical Representation of Standard Deviation Testing (For Greater Than)

Graphical Representation of Standard Deviation Testing (For Less Than)

Equations for Standard Deviation Testsing (Chi-Squared Distribution)

Much like the Mean hypothesis testing manually using equations

Test statistic: \(\chi_0^2 = (n - 1)s^2 / \sigma_0^2\)

The test statistic is \(\chi_0\)
The standard deviation of your sample is: \(s^2\)
The hypothesis (ur guess for the standard deviation): \(\sigma_0\)
The number in your sample set: \(n\)

The p-value is a very powerful tool in order for you to reject your null hypothesis, much like with mean testing

P-value for two sided: (\(\mu \neq \mu_0\)): \(2(1 - \chi_0^2(|x|))\)
P-value for one sided: (\(\mu > \mu_0\)): \(1 - \chi_0^2Z_0)\)
P-value for on sided: (\(\mu < \mu_0\)): \(\chi_0^2(Z_0)\)