2025-03-15

What is Hypothesis Testing?

  • Goal: Determine the strength of your claim (hypothesis) against a different claim

  • Hypothesis: A claim/statement made about a property of a population

  • Null Hypothesis (\(H_0\)): A statement that the value of the population parameter is equal to some claimed value

  • Alternate Hypothesis (\(H_A\)): A statement that differs from \(H_0\)

Test Statistic

  • Test statistic: How we measure the probability of the null hypothesis being not true
  • Calculated from the sample data to measure how far the data diverges from \(H_0\)
  • Extreme values indicate our data is far from what is expected from \(H_0\)

Test Statistic

How to Calculate

Population Parameter: \(\mu\)

Sample Statistic: \(\bar{x}\)

Standard Error: \(\sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}\)

Test Statistic: \(z = \frac{\bar{x} - \mu}{\sigma_\bar{x}}\)

Significance Level & P-Value

  • Significance Level \(\propto\): Probability value used as cutoff for determining significant evidence against \(H_0\)
  • P-Value: Probability that you are likely to find a set of observations if \(H_0\) were true
\(\propto = 1 - CL\) (By default \(\propto\) = 0.05)

Reject \(H_0\) when \(p-value \lt \propto\)

Fail to reject \(H_0\) when \(p-value \gt \propto\)

Example 1

Reject the Null Hypothesis

You want to research if there is a significant correlation between fertility and education in Swiss provinces around the year 1888. You predict that if a province has 20% or more draftees with an education beyond primary school, then we would expect the fertility rate to be less than average. Use a confidence level of 95% to test your hypothesis.

Example 1 (Code)

library(dplyr)
# I am using "swiss", which is a data set within R

# Here, I am creating a function that contains the z-test equation
z_test <- function(xbar) (xbar-pop_parameter)/(sd)

# Next, I need to find all parameters needed to perform the z-test
pop_parameter = mean(swiss$Fertility)
# pop_parameter = 70.14255
sample_stat <- swiss %>% filter(Education >= 20)
xbar = mean(sample_stat$Fertility)
# xbar = 49.48333
sd = sd(sample_stat$Fertility)
# sd = 10.59876
z <- z_test(xbar)
# z = -1.94921

#Lastly, use the z-score to determine the p-value
p_value = pnorm(-abs(z))
# p-value = 0.02564
# Significance level = 0.05

0.02564 < 0.05 The p-value is less than the significance level, so we reject \(H_0\). A significant correlation may exist between education status that is equal to or higher than 25% and fertility in 1888 Swiss provinces.

Example 1 (Plot)

library(ggplot2)

ggplot(swiss, aes(x=swiss$Education, y=swiss$Fertility)) + 
  geom_point(color = "hotpink") + 
  geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 3), color = "turquoise") + 
  labs(title = "Education vs. Fertility in 1888 Swiss Provinces", 
       x = "% of Draftees w/ Education Beyond Primary School", 
       y = "Fertility")

Example 2

Fail to Reject the Null Hypothesis

This time, you want to research if there is a significant correlation between fertility and Catholic status in Swiss provinces around the year 1888. You predict that if a province has an above average proportion of Catholic draftees, then we expect to see an above average fertility rate. Use a confidence level of 95% to test your hypothesis.

Example 2 (Code)

# I need to find all parameters needed to perform the z-test.
pop_parameter = mean(swiss$Fertility)
# pop_parameter = 70.14255
mean_catholic = mean(swiss$Catholic)
# mean_catholic = 41.14383
sample_stat <- swiss %>% filter(Catholic > mean_catholic)
xbar = mean(sample_stat$Fertility)
# xbar = 74.27895
sd = sd(sample_stat$Fertility)
# sd = 16.75621
z <- z_test(xbar)
# z = 0.24686

#Lastly, use the z-score to determine the p-value
p_value = pnorm(-abs(z))
# p-value = 0.40251
# Significance level = 0.05

0.40251 > 0.05 The p-value is greater than the significance level of 0.05, so we fail to reject \(H_0\). There is no significant correlation between Catholic status and fertility in 1888 Swiss provinces. Although, the plot implies that more research with alternate statistical tests may need to be conducted to determine other potential relationships.

Example 2 (Plot)

ggplot(swiss, aes(x=swiss$Catholic, y=swiss$Fertility)) + 
  geom_point(color = "orange") + 
  geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 3), 
              color = "turquoise") + 
  labs(title = "Education vs. Catholic Status in 1888 Swiss Provinces", 
       x = "% of Draftees that are Catholic", 
       y = "Fertility")

Plotly example using Swiss data

library(plotly)
swiss_v2 <- data.frame(Province = rownames(swiss), swiss)
swissv2 <- as.factor
fig1 <- plot_ly(swiss_v2, x = ~Infant.Mortality, y = ~Fertility, 
                type = 'scatter', mode = 'markers', 
                size = ~Education, color = ~Province, 
                colors = 'Paired',
                sizes = c(10,50),
                marker = list(opacity=0.5, sizemode = 'diameter'),
                hoverinfo = 'text',
                text = ~paste('Province:', Province, '<br>Education:', Education))
              
fig1 <- fig1 %>% layout(title = "Infant Mortality vs. Fertility vs. Education per Province",
                        xaxis = list(showgrid = FALSE),
                        yaxis = list(showgrid = FALSE),
                        showlegend = FALSE)

Plotly example using Swiss data

fig1

This slideshow was created for the purpose of a homework assignment submission. Thank you for your time!