Problem 1: Chi-Square Goodness-of-Fit Test
ACTN3 has two alleles: R and X. In a sample of 436 people,
244 were classified as R
192 were classified as X
We want to test whether the two alleles are equally likely.
Hypotheses
\(H_0\): \(p_R = p_X = 0.5\)
\(H_a\): At least one proportion differs from 0.5
# Observed counts
observed_actn3 <- c(R = 244, X = 192)
# Expected proportions under H0
p_null_actn3 <- c(0.5, 0.5)
# Total sample size
n_total <- sum(observed_actn3)
# Expected counts
expected_actn3 <- p_null_actn3 * n_total
observed_actn3
## R X
## 244 192
expected_actn3
## [1] 218 218
Check that all expected counts are greater than 5, so the chi-square goodness-of-fit test is appropriate.
chisq_actn3 <- chisq.test(observed_actn3, p = p_null_actn3)
chisq_actn3
##
## Chi-squared test for given probabilities
##
## data: observed_actn3
## X-squared = 6.2018, df = 1, p-value = 0.01276
chisq_actn3$p.value
## [1] 0.01276179
Interpretation
Test statistic: \(\chi^2 \approx 6.20\)
P-value: approximately 0.0128
Since the p-value is less than 0.05, we reject the null hypothesis.
Conclusion: There is evidence that the two alleles R and X are not equally likely in this population. The observed distribution of ACTN3 alleles differs significantly from a 50/50 split.
Problem 2: Chi-Square Test of Association
Who Is More Likely to Take Vitamins: Males or Females?
We will use the dataset NutritionStudy and the variables:
VitaminUse (Num, Occasional, Regular)
Sex (Male, Female)
Note: The problem text said “Gender,” but in this dataset the column is named Sex.
Hypotheses
\(H_0\): Vitamin use is not associated with sex.
\(H_a\): Vitamin use is associated with sex.
# Load the NutritionStudy dataset (make sure NutritionStudy.csv is in your working directory)
#setwd("~/Data 101/HW8")
NutritionStudy <- read_csv("NutritionStudy.csv")
## Rows: 315 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Smoke, Sex, VitaminUse
## dbl (14): ID, Age, Quetelet, Vitamin, Calories, Fat, Fiber, Alcohol, Cholest...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(NutritionStudy)
## Rows: 315
## Columns: 17
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ Age <dbl> 64, 76, 38, 40, 72, 40, 65, 58, 35, 55, 66, 40, 57, 66, …
## $ Smoke <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "N…
## $ Quetelet <dbl> 21.4838, 23.8763, 20.0108, 25.1406, 20.9850, 27.5214, 22…
## $ Vitamin <dbl> 1, 1, 2, 3, 1, 3, 2, 1, 3, 3, 1, 2, 3, 1, 3, 3, 1, 1, 3,…
## $ Calories <dbl> 1298.8, 1032.5, 2372.3, 2449.5, 1952.1, 1366.9, 2213.9, …
## $ Fat <dbl> 57.0, 50.1, 83.6, 97.5, 82.6, 56.0, 52.0, 63.4, 57.8, 39…
## $ Fiber <dbl> 6.3, 15.8, 19.1, 26.5, 16.2, 9.6, 28.7, 10.9, 20.3, 15.5…
## $ Alcohol <dbl> 0.0, 0.0, 14.1, 0.5, 0.0, 1.3, 0.0, 0.0, 0.6, 0.0, 1.0, …
## $ Cholesterol <dbl> 170.3, 75.8, 257.9, 332.6, 170.8, 154.6, 255.1, 214.1, 2…
## $ BetaDiet <dbl> 1945, 2653, 6321, 1061, 2863, 1729, 5371, 823, 2895, 330…
## $ RetinolDiet <dbl> 890, 451, 660, 864, 1209, 1439, 802, 2571, 944, 493, 535…
## $ BetaPlasma <dbl> 200, 124, 328, 153, 92, 148, 258, 64, 218, 81, 184, 91, …
## $ RetinolPlasma <dbl> 915, 727, 721, 615, 799, 654, 834, 825, 517, 562, 935, 7…
## $ Sex <chr> "Female", "Female", "Female", "Female", "Female", "Femal…
## $ VitaminUse <chr> "Regular", "Regular", "Occasional", "No", "Regular", "No…
## $ PriorSmoke <dbl> 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1,…
Create the contingency table:
vit_sex_table <- table(NutritionStudy$VitaminUse, NutritionStudy$Sex)
vit_sex_table
##
## Female Male
## No 87 24
## Occasional 77 5
## Regular 109 13
Perform the chi-square test of association:
chi_vit_sex <- chisq.test(vit_sex_table)
chi_vit_sex
##
## Pearson's Chi-squared test
##
## data: vit_sex_table
## X-squared = 11.071, df = 2, p-value = 0.003944
chi_vit_sex$p.value
## [1] 0.003944277
Check expected counts to make sure assumptions are met:
chi_vit_sex$expected
##
## Female Male
## No 96.20000 14.80000
## Occasional 71.06667 10.93333
## Regular 105.73333 16.26667
Interpretation
From the real data:
Contingency table (VitaminUse by Sex):
VitaminUse Female Male No 87 24 Occasional 71 11 Regular 106 16
Chi-square statistic: \(\chi^2 \approx 11.07\)
P-value: approximately 0.00394
Since p ≈ 0.0039 < 0.05, we reject the null hypothesis.
Conclusion: There is a significant association between vitamin use and sex. In this sample, vitamin use patterns differ between males and females
Problem 3: ANOVA — Fish Gill Beat Rates and Calcium Level
Professor Baldwin studied how water calcium level might affect fish gill beat rate (beats per minute).
Factor: Calcium level in water
Low (0.71 mg/L)
Medium (5.24 mg/L)
High (18.24 mg/L)
Response: GillRate (beats per minute)
Each group: about 30 fish per calcium level
Dataset: FishGills3
Hypotheses
\(H_0\): \(\mu_{Low} = \mu_{Medium} = \mu_{High}\)
\(H_a\): Not all mean gill rates are equal
# Load the FishGills3 dataset
FishGills3 <- read_csv("FishGills3.csv")
## Rows: 90 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Calcium
## dbl (1): GillRate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
FishGills3$Calcium <- as.factor(FishGills3$Calcium)
head(FishGills3)
## # A tibble: 6 × 2
## Calcium GillRate
## <fct> <dbl>
## 1 Low 55
## 2 Low 63
## 3 Low 78
## 4 Low 85
## 5 Low 65
## 6 Low 98
summary(FishGills3)
## Calcium GillRate
## High :30 Min. :33.00
## Low :30 1st Qu.:48.00
## Medium:30 Median :62.50
## Mean :61.78
## 3rd Qu.:72.00
## Max. :98.00
Compute group means:
FishGills3 %>%
group_by(Calcium) %>%
summarise(
mean_gill = mean(GillRate),
sd_gill = sd(GillRate),
n = n()
)
## # A tibble: 3 × 4
## Calcium mean_gill sd_gill n
## <fct> <dbl> <dbl> <int>
## 1 High 58.2 13.8 30
## 2 Low 68.5 16.2 30
## 3 Medium 58.7 14.3 30
Run the ANOVA:
anova_model <- aov(GillRate ~ Calcium, data = FishGills3)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Calcium 2 2037 1018.6 4.648 0.0121 *
## Residuals 87 19064 219.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistic: F ≈ 4.65
P-value: approximately 0.0121
Since p ≈ 0.012 < 0.05, we reject the null hypothesis.
Conclusion: There is evidence that mean gill beat rate differs among at least some of the calcium levels.
Post-hoc Comparison: Tukey’s HSD
Since ANOVA is significant, we use Tukey’s HSD to see which calcium levels differ.
TukeyHSD(anova_model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = GillRate ~ Calcium, data = FishGills3)
##
## $Calcium
## diff lwr upr p adj
## Low-High 10.333333 1.219540 19.4471264 0.0222533
## Medium-High 0.500000 -8.613793 9.6137931 0.9906108
## Medium-Low -9.833333 -18.947126 -0.7195402 0.0313247
Interpretation of Tukey’s HSD (from your real data):
The comparison between Low and High calcium levels is significant.
Other pairwise comparisons (e.g., Low vs Medium, Medium vs High) are not statistically significant at the 0.05 level.
Overall Conclusion Calcium level in the water has a significant effect on mean gill beat rate. In particular, fish in high-calcium water have significantly different gill rates compared with fish in low-calcium water, based on the Tukey post-hoc results.