Loading Libraries
library(readxl)
library(ggplot2)
library(rcompanion)
RQ: Do students prefer tea, coffee, soda, and water equally?
Import dataset
DatasetA2 <- read_excel("DatasetA2.xlsx")
Revewing the data and dataset structure
head(DatasetA2)
## # A tibble: 6 × 2
## StudentID FavoriteDrink
## <dbl> <chr>
## 1 1 Soda
## 2 2 Soda
## 3 3 Soda
## 4 4 Coffee
## 5 5 Soda
## 6 6 Coffee
str(DatasetA2)
## tibble [100 × 2] (S3: tbl_df/tbl/data.frame)
## $ StudentID : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
## $ FavoriteDrink: chr [1:100] "Soda" "Soda" "Soda" "Coffee" ...
The dataset contains 100 observations and 2 variables:
StudentID: Numerical identifier for each student
FavoriteDrink: Categorical variable with four options (Coffee, Soda,
Tea, Water)
Creating a frequency table
beverage_table <- table(DatasetA2$FavoriteDrink)
print("Frequency Table for Beverage Preferences : ")
## [1] "Frequency Table for Beverage Preferences : "
print(beverage_table)
##
## Coffee Soda Tea Water
## 26 29 28 17
The frequency table shows the count of students who prefer each
beverage:
Coffee: 26 students, Soda: 29 students, Tea: 28 students, Water: 17
students
Total: 100 students
Calculate percentages for better understanding
beverage_percentages <- prop.table(beverage_table) * 100
print("Percentage Distribution:")
## [1] "Percentage Distribution:"
print(round(beverage_percentages, 1))
##
## Coffee Soda Tea Water
## 26 29 28 17
Percentage breakdown: This shows Soda is the most preferred (29%)
and Water is the least preferred (17%)
Creating Bar Graph
ggplot(DatasetA2, aes(x = FavoriteDrink, fill = FavoriteDrink)) +
geom_bar() +
labs(
x = "Beverage Type",
y = "Number of Students",
title = "Distribution of Beverage Preferences Among Students"
) +
theme_minimal() +
theme(
text = element_text(size = 14),
axis.title = element_text(size = 14),
axis.text = element_text(size = 14),
plot.title = element_text(size = 14, face = "bold"),
legend.position = "none"
) +
geom_text(stat = 'count', aes(label = after_stat(count)))

Conduct Chi-Square Goodness of Fit Test
# Observed frequencies from our data
observed <- as.vector(beverage_table)
# Expected proportions (equal preference = 25% each)
expected_proportions <- c(0.25, 0.25, 0.25, 0.25)
# Run the chi-square test
chi_result_a2 <- chisq.test(x = observed, p = expected_proportions)
print("Chi-Square Test Results:")
## [1] "Chi-Square Test Results:"
print(chi_result_a2)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 3.6, df = 3, p-value = 0.308
X-squared = 3.6
df = 3
p-value = 0.308
Statistical Significance: p > .05 → The result is NOT
statistically significant
This means we fail to reject the null hypothesis
Calculate expected counts for reference
total_n <- sum(observed)
expected_counts <- total_n * expected_proportions
print("Expected counts (if preferences were equal):")
## [1] "Expected counts (if preferences were equal):"
print(expected_counts)
## [1] 25 25 25 25
If preferences were perfectly equal, we would expect 25 students for
each beverage
Expected counts: Coffee=25, Soda=25, Tea=25, Water=25
Our observed counts differ slightly from these expected values
Create comparison table
comparison <- data.frame(
Beverage = names(beverage_table),
Observed = observed,
Expected = expected_counts,
Difference = observed - expected_counts
)
print("Observed vs Expected Comparison:")
## [1] "Observed vs Expected Comparison:"
print(comparison)
## Beverage Observed Expected Difference
## 1 Coffee 26 25 1
## 2 Soda 29 25 4
## 3 Tea 28 25 3
## 4 Water 17 25 -8
Water shows the largest deviation from what was expected
Since p > 0.05, we do NOT calculate effect size
Effect size only calculated when p < 0.05
With p = 0.308, the differences between observed and expected
frequencies are not statistically significant
Final interpretation and report
print("FINDINGS FOR SCENARIO A2: A chi-square goodness-of-fit test was conducted to determine whether students preferred tea, coffee, soda, and water equally.")
## [1] "FINDINGS FOR SCENARIO A2: A chi-square goodness-of-fit test was conducted to determine whether students preferred tea, coffee, soda, and water equally."
if(chi_result_a2$p.value < 0.05) {
cat("The results indicated that the observed frequencies were significantly different from the expected frequencies")
} else {
cat("The results indicated that the observed frequencies were NOT significantly different from the expected frequencies")
}
## The results indicated that the observed frequencies were NOT significantly different from the expected frequencies
cat(", χ²(", chi_result_a2$parameter, ") = ", round(chi_result_a2$statistic, 2),
", p = ", round(chi_result_a2$p.value, 3), ". ", sep="")
## , χ²(3) = 3.6, p = 0.308.
if(chi_result_a2$p.value < 0.05) {
cat("This suggests that students do not prefer all beverages equally.")
} else {
cat("This suggests that students prefer all beverages equally, with no single drink being favored over others.")
}
## This suggests that students prefer all beverages equally, with no single drink being favored over others.
cat("\n\nBased on the observed frequencies: Coffee (", observed[1], "), Soda (", observed[2],
"), Tea (", observed[3], "), and Water (", observed[4], "), the slight variations from the expected 25 per beverage are likely due to random chance rather than true preference differences.")
##
##
## Based on the observed frequencies: Coffee ( 26 ), Soda ( 29 ), Tea ( 28 ), and Water ( 17 ), the slight variations from the expected 25 per beverage are likely due to random chance rather than true preference differences.
Null Hypothesis: There is no difference in observed vs expected
frequencies
Alternative Hypothesis: There is a difference in observed vs
expected frequencies
Test Used: Chi-Square Goodness of Fit
Results: χ²(3) = 3.6, p = .308
Decision: Fail to reject the null hypothesis
Conclusion: Students prefer all beverages equally. The slight
variations in preferences (Soda slightly higher, Water slightly lower)
are not statistically significant and could be due to random sampling
variation.