The Chi-Square Test is a nonparametric statistical method used to analyze categorical data. It is used when we have frequency counts and want to test whether:
Unlike parametric tests, the chi-square test does not assume normality and is useful for count-based data.
The goodness-of-fit test evaluates whether an observed distribution differs from a hypothesized theoretical distribution.
\[ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \] where: - \(O_i\) = observed frequency for category \(i\), - \(E_i\) = expected frequency under \(H_0\), - \(k\) = number of categories.
Under \(H_0\), \(\chi^2\) follows a chi-square distribution with \(k - 1\) degrees of freedom.
Let’s test whether people’s drink preferences follow a given probability distribution.
# Load required packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.1
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.1
# Simulated observed counts
observed_counts <- c(50, 30, 20) # Coffee, Tea, Juice
expected_proportions <- c(0.5, 0.3, 0.2) # Expected proportions
total_count <- sum(observed_counts)
# Compute expected frequencies
expected_counts <- total_count * expected_proportions
# Perform Chi-square Goodness-of-Fit Test
chisq_test <- chisq.test(observed_counts, p = expected_proportions)
# Create a table
df_goodness <- data.frame(
Category = c("Coffee", "Tea", "Juice"),
Observed = observed_counts,
Expected = expected_counts
)
# Display the table
kable(df_goodness, caption = "Observed vs Expected Frequencies")
Category | Observed | Expected |
---|---|---|
Coffee | 50 | 50 |
Tea | 30 | 30 |
Juice | 20 | 20 |
# Print test result
chisq_test
##
## Chi-squared test for given probabilities
##
## data: observed_counts
## X-squared = 0, df = 2, p-value = 1
Visualizing Observed vs Expected Frequencies
ggplot(df_goodness, aes(x = Category, y = Observed, fill = Category)) +
geom_bar(stat = "identity", alpha = 0.8) +
geom_point(aes(y = Expected), size = 4, color = "red") +
labs(title = "Observed vs Expected Frequencies", y = "Count") +
theme_minimal()
Interpretation If p-value < 0.05, reject 𝐻 0
→ observed distribution differs from expected. If p-value > 0.05, fail to reject 𝐻 0
→ observed distribution matches expected.
#### Chi-Square Test for Independence
This test evaluates whether two categorical variables are associated.
\(H_0\): The two categorical variables are independent.
\(H_A\): There is an association between the variables.
For a contingency table:
\(E_{ij} = \frac{(\text{row total}) \times (\text{column total})}{\text{grand total}}\)
\(\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)
Let’s analyze whether hair color and eye color are independent.
# Load dataset
data(HairEyeColor)
# Convert to a 2D contingency table (sum over Gender)
hair_eye_data <- margin.table(HairEyeColor, c(1, 2))
# Perform Chi-Square Test
chisq_indep_test <- chisq.test(hair_eye_data)
# Display contingency table
kable(hair_eye_data, caption = "Hair Color vs Eye Color Contingency Table")
Brown | Blue | Hazel | Green | |
---|---|---|---|---|
Black | 68 | 20 | 15 | 5 |
Brown | 119 | 84 | 54 | 29 |
Red | 26 | 17 | 14 | 14 |
Blond | 7 | 94 | 10 | 16 |
# Print test results
chisq_indep_test
##
## Pearson's Chi-squared test
##
## data: hair_eye_data
## X-squared = 138.29, df = 9, p-value < 2.2e-16
require(vcd)
## Loading required package: vcd
## Warning: package 'vcd' was built under R version 4.4.2
## Loading required package: grid
mosaic(hair_eye_data, shade = TRUE, legend = TRUE, main = "Hair vs Eye Color Association")
Product Type | Region A | Region B |
---|---|---|
Type 1 | 40 | 60 |
Type 2 | 30 | 50 |
Type 3 | 50 | 40 |
# Create the contingency table
product_sales <- matrix(c(40, 30, 50, 60, 50, 40), nrow = 3, byrow = TRUE,
dimnames = list(ProductType = c("Type 1", "Type 2", "Type 3"),
Region = c("Region A", "Region B")))
product_sales
## Region
## ProductType Region A Region B
## Type 1 40 30
## Type 2 50 60
## Type 3 50 40
help Code
# Create contingency table
product_data <- matrix(c(40, 60, 30, 50, 50, 40), nrow = 3, byrow = TRUE)
colnames(product_data) <- c("Region A", "Region B")
rownames(product_data) <- c("Type 1", "Type 2", "Type 3")
# Perform Chi-square test
product_chisq_test <- chisq.test(product_data)
# Display table
kable(product_data, caption = "Product Sales Contingency Table")
Region A | Region B | |
---|---|---|
Type 1 | 40 | 60 |
Type 2 | 30 | 50 |
Type 3 | 50 | 40 |
# Print test results
product_chisq_test
##
## Pearson's Chi-squared test
##
## data: product_data
## X-squared = 6.8625, df = 2, p-value = 0.03235