size
Species Small Medium Large
setosa 47 3 0
versicolor 11 36 3
virginica 1 32 17
3. Chi-Squared Goodness of Fit Test
This test evaluates one variable. We want to know if the proportions of “Small,” “Medium,” and “Large” flowers are significantly different from each other.
Hypotheses
: The proportions of Small, Medium, and Large flowers are equal ( each).
: The proportions are unequal.
Code
# Visualizing the distributionflower %>%ggplot(aes(x = size)) +geom_bar(fill ="#a139ca", alpha =0.7) +labs(title ="Goodness of Fit: Observed Frequencies",subtitle ="Proportion of flowers by size",x ="Size Category",y ="Frequency") +theme_test(base_size =15)
Code
# Performing the testgoodness_fit <- flower %>%select(size) %>%table() %>%chisq.test()goodness_fit
Chi-squared test for given probabilities
data: .
X-squared = 28.44, df = 2, p-value = 6.673e-07
Interpretation: The p-value is approximately . Since , we reject the null hypothesis. The proportions of flower sizes in this dataset are statistically unequal.
4. Chi-Squared Test of Independence
This test evaluates the inter-relationship between two variables: Species and Size.
Hypotheses
: Species and Size are independent (no relationship).
: Species and Size are dependent (there is an association).
Code
# Visualizing the relationshipflower %>%ggplot(aes(size, fill = Species)) +geom_bar(alpha =0.7) +labs(title ="Test of Independence: Species vs. Size",subtitle ="Stacked bar chart showing categorical overlap",x ="Size Category",y ="Count") +theme_test(base_size =15) +scale_fill_manual(values =c("setosa"="#d48ce1","versicolor"="#a139ca","virginica"="#6d2683"))
Performing the Test
Code
# Run the independence testindependence_test <-table(flower) %>%chisq.test()independence_test
Interpretation: The p-value is . Since , we reject the null hypothesis. This suggests that Species and Size are highly dependent—knowing the species helps predict the likely size of the flower.
5. Checking Assumptions (Expected Values)
A Chi-squared test may be unreliable if more than 20% of the expected frequencies are less than 5. In such cases, Fisher’s Exact Test is preferred.
size
Species Small Medium Large
setosa 19.66667 23.66667 6.666667
versicolor 19.66667 23.66667 6.666667
virginica 19.66667 23.66667 6.666667
Systematic Checklist (Cheat Sheet)
Discretizing Data:cut(variable, breaks = n, labels = c(...))
Contingency Table:table(var1, var2)
Statistical Test:chisq.test(table_object)
Checking Validity:test_object$expected
Alternative for Small Samples:fisher.test(table_object)
Summary: You have successfully mastered both types of Chi-squared tests! You can now determine if a category distribution is balanced and if two categorical variables influence one another.