Loading Libraries

library(readxl)
library(ggplot2)
library(rcompanion)

RQ: Do students prefer tea, coffee, soda, and water equally?

Import dataset

DatasetA2 <- read_excel("DatasetA2.xlsx")

Revewing the data and dataset structure

head(DatasetA2)
## # A tibble: 6 × 2
##   StudentID FavoriteDrink
##       <dbl> <chr>        
## 1         1 Soda         
## 2         2 Soda         
## 3         3 Soda         
## 4         4 Coffee       
## 5         5 Soda         
## 6         6 Coffee
str(DatasetA2)
## tibble [100 × 2] (S3: tbl_df/tbl/data.frame)
##  $ StudentID    : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
##  $ FavoriteDrink: chr [1:100] "Soda" "Soda" "Soda" "Coffee" ...

The dataset contains 100 observations and 2 variables:

StudentID: Numerical identifier for each student

FavoriteDrink: Categorical variable with four options (Coffee, Soda, Tea, Water)

Creating a frequency table

beverage_table <- table(DatasetA2$FavoriteDrink)
print("Frequency Table for Beverage Preferences : ")
## [1] "Frequency Table for Beverage Preferences : "
print(beverage_table)
## 
## Coffee   Soda    Tea  Water 
##     26     29     28     17

The frequency table shows the count of students who prefer each beverage:

Coffee: 26 students, Soda: 29 students, Tea: 28 students, Water: 17 students

Total: 100 students

Calculate percentages for better understanding

beverage_percentages <- prop.table(beverage_table) * 100
print("Percentage Distribution:")
## [1] "Percentage Distribution:"
print(round(beverage_percentages, 1))
## 
## Coffee   Soda    Tea  Water 
##     26     29     28     17

Percentage breakdown: This shows Soda is the most preferred (29%) and Water is the least preferred (17%)

Creating Bar Graph

ggplot(DatasetA2, aes(x = FavoriteDrink, fill = FavoriteDrink)) +
  geom_bar() +
  labs(
    x = "Beverage Type",
    y = "Number of Students",
    title = "Distribution of Beverage Preferences Among Students"
  ) +
  theme_minimal() +
  theme(
    text = element_text(size = 14),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 14),
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "none"
  ) +
  geom_text(stat = 'count', aes(label = after_stat(count)))

Conduct Chi-Square Goodness of Fit Test

# Observed frequencies from our data
observed <- as.vector(beverage_table)

# Expected proportions (equal preference = 25% each)
expected_proportions <- c(0.25, 0.25, 0.25, 0.25)

# Run the chi-square test
chi_result_a2 <- chisq.test(x = observed, p = expected_proportions)
print("Chi-Square Test Results:")
## [1] "Chi-Square Test Results:"
print(chi_result_a2)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 3.6, df = 3, p-value = 0.308

X-squared = 3.6

df = 3

p-value = 0.308

Statistical Significance: p > .05 → The result is NOT statistically significant

This means we fail to reject the null hypothesis

Calculate expected counts for reference

total_n <- sum(observed)
expected_counts <- total_n * expected_proportions
print("Expected counts (if preferences were equal):")
## [1] "Expected counts (if preferences were equal):"
print(expected_counts)
## [1] 25 25 25 25

If preferences were perfectly equal, we would expect 25 students for each beverage

Expected counts: Coffee=25, Soda=25, Tea=25, Water=25

Our observed counts differ slightly from these expected values

Create comparison table

comparison <- data.frame(
  Beverage = names(beverage_table),
  Observed = observed,
  Expected = expected_counts,
  Difference = observed - expected_counts
)
print("Observed vs Expected Comparison:")
## [1] "Observed vs Expected Comparison:"
print(comparison)
##   Beverage Observed Expected Difference
## 1   Coffee       26       25          1
## 2     Soda       29       25          4
## 3      Tea       28       25          3
## 4    Water       17       25         -8
Water shows the largest deviation from what was expected
Since p > 0.05, we do NOT calculate effect size
Effect size only calculated when p < 0.05
With p = 0.308, the differences between observed and expected frequencies are not statistically significant

Final interpretation and report

print("FINDINGS FOR SCENARIO A2: A chi-square goodness-of-fit test was conducted to determine whether students preferred tea, coffee, soda, and water equally.")
## [1] "FINDINGS FOR SCENARIO A2: A chi-square goodness-of-fit test was conducted to determine whether students preferred tea, coffee, soda, and water equally."
if(chi_result_a2$p.value < 0.05) {
  cat("The results indicated that the observed frequencies were significantly different from the expected frequencies")
} else {
  cat("The results indicated that the observed frequencies were NOT significantly different from the expected frequencies")
}
## The results indicated that the observed frequencies were NOT significantly different from the expected frequencies
cat(", χ²(", chi_result_a2$parameter, ") = ", round(chi_result_a2$statistic, 2), 
    ", p = ", round(chi_result_a2$p.value, 3), ". ", sep="")
## , χ²(3) = 3.6, p = 0.308.
if(chi_result_a2$p.value < 0.05) {
  cat("This suggests that students do not prefer all beverages equally.")
} else {
  cat("This suggests that students prefer all beverages equally, with no single drink being favored over others.")
}
## This suggests that students prefer all beverages equally, with no single drink being favored over others.
cat("\n\nBased on the observed frequencies: Coffee (", observed[1], "), Soda (", observed[2], 
    "), Tea (", observed[3], "), and Water (", observed[4], "), the slight variations from the expected 25 per beverage are likely due to random chance rather than true preference differences.")
## 
## 
## Based on the observed frequencies: Coffee ( 26 ), Soda ( 29 ), Tea ( 28 ), and Water ( 17 ), the slight variations from the expected 25 per beverage are likely due to random chance rather than true preference differences.

Null Hypothesis: There is no difference in observed vs expected frequencies

Alternative Hypothesis: There is a difference in observed vs expected frequencies

Test Used: Chi-Square Goodness of Fit

Results: χ²(3) = 3.6, p = .308

Decision: Fail to reject the null hypothesis

Conclusion: Students prefer all beverages equally. The slight variations in preferences (Soda slightly higher, Water slightly lower) are not statistically significant and could be due to random sampling variation.