1. Chi-Square Goodness-of-Fit Test

Theory

The goodness-of-fit test evaluates whether an observed distribution differs from a hypothesized theoretical distribution.

Hypothesis

\(H_0\): The observed frequencies match the expected frequencies.
\(H_A\): The observed frequencies differ significantly from the expected frequencies.

Test Statistic

\[ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \] where: - \(O_i\) = observed frequency for category \(i\), - \(E_i\) = expected frequency under \(H_0\), - \(k\) = number of categories.

Under \(H_0\), \(\chi^2\) follows a chi-square distribution with \(k - 1\) degrees of freedom.

Example: Simulated Data

Let’s test whether people’s drink preferences follow a given probability distribution.

# Load required packages
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.1

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)

## Warning: package 'knitr' was built under R version 4.4.1

# Simulated observed counts
observed_counts <- c(50, 30, 20)  # Coffee, Tea, Juice
expected_proportions <- c(0.5, 0.3, 0.2)  # Expected proportions
total_count <- sum(observed_counts)

# Compute expected frequencies
expected_counts <- total_count * expected_proportions

# Perform Chi-square Goodness-of-Fit Test
chisq_test <- chisq.test(observed_counts, p = expected_proportions)

# Create a table
df_goodness <- data.frame(
  Category = c("Coffee", "Tea", "Juice"),
  Observed = observed_counts,
  Expected = expected_counts
)

# Display the table
kable(df_goodness, caption = "Observed vs Expected Frequencies")

Observed vs Expected Frequencies
Category	Observed	Expected
Coffee	50	50
Tea	30	30
Juice	20	20

# Print test result
chisq_test

## 
##  Chi-squared test for given probabilities
## 
## data:  observed_counts
## X-squared = 0, df = 2, p-value = 1

Visualizing Observed vs Expected Frequencies

ggplot(df_goodness, aes(x = Category, y = Observed, fill = Category)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  geom_point(aes(y = Expected), size = 4, color = "red") +
  labs(title = "Observed vs Expected Frequencies", y = "Count") +
  theme_minimal()

Interpretation If p-value < 0.05, reject 𝐻 0

→ observed distribution differs from expected. If p-value > 0.05, fail to reject 𝐻 0

→ observed distribution matches expected.

#### Chi-Square Test for Independence

This test evaluates whether two categorical variables are associated.

\(H_0\): The two categorical variables are independent.

\(H_A\): There is an association between the variables.

For a contingency table:

\(E_{ij} = \frac{(\text{row total}) \times (\text{column total})}{\text{grand total}}\)

\(\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

Example: Hair and Eye Color (Real Data)

Let’s analyze whether hair color and eye color are independent.

# Load dataset
data(HairEyeColor)

# Convert to a 2D contingency table (sum over Gender)
hair_eye_data <- margin.table(HairEyeColor, c(1, 2))

# Perform Chi-Square Test
chisq_indep_test <- chisq.test(hair_eye_data)

# Display contingency table
kable(hair_eye_data, caption = "Hair Color vs Eye Color Contingency Table")

Hair Color vs Eye Color Contingency Table
	Brown	Blue	Hazel	Green
Black	68	20	15	5
Brown	119	84	54	29
Red	26	17	14	14
Blond	7	94	10	16

# Print test results
chisq_indep_test

## 
##  Pearson's Chi-squared test
## 
## data:  hair_eye_data
## X-squared = 138.29, df = 9, p-value < 2.2e-16

require(vcd)

## Loading required package: vcd

## Warning: package 'vcd' was built under R version 4.4.2

## Loading required package: grid

mosaic(hair_eye_data, shade = TRUE, legend = TRUE, main = "Hair vs Eye Color Association")

📊 Dataset

Product Type	Region A	Region B
Type 1	40	60
Type 2	30	50
Type 3	50	40

📝 Task for Students

Create the contingency table in R.
Perform a chi-square test for independence.
Interpret the result.

# Create the contingency table
product_sales <- matrix(c(40, 30, 50, 60, 50, 40), nrow = 3, byrow = TRUE,
                       dimnames = list(ProductType = c("Type 1", "Type 2", "Type 3"),
                                       Region = c("Region A", "Region B")))

product_sales

##            Region
## ProductType Region A Region B
##      Type 1       40       30
##      Type 2       50       60
##      Type 3       50       40

help Code

# Create contingency table
product_data <- matrix(c(40, 60, 30, 50, 50, 40), nrow = 3, byrow = TRUE)
colnames(product_data) <- c("Region A", "Region B")
rownames(product_data) <- c("Type 1", "Type 2", "Type 3")

# Perform Chi-square test
product_chisq_test <- chisq.test(product_data)

# Display table
kable(product_data, caption = "Product Sales Contingency Table")

Product Sales Contingency Table
	Region A	Region B
Type 1	40	60
Type 2	30	50
Type 3	50	40

# Print test results
product_chisq_test

## 
##  Pearson's Chi-squared test
## 
## data:  product_data
## X-squared = 6.8625, df = 2, p-value = 0.03235

Non Parametric Hypothesis Test (Chi-Square)

Dr. Debashis Chatterjee

2025-02-19

Introduction