Three problems — one chi-square goodness-of-fit, one chi-square test of independence, and one one-way ANOVA. Each one follows the same pattern: state hypotheses, run the test, interpret the result.
ACTN3 is a gene that encodes alpha-actinin-3, a protein in fast-twitch muscle fibers. The gene has two main alleles: R (functional) and X (non-functional). The R allele is linked to better performance in strength/speed/power sports; the X allele is associated with endurance.
A study of 436 people classified 244 as R and 192 as X. Does this provide evidence that the two options are NOT equally likely?
State your hypotheses:
p₁ = proportion of people who have the R allele p₂ = proportion of people who have the X allele - H₀: p₁ = p₂ = 0.5 - H₁: at least one proportion is not 0.5
# We believe that no subject could have been classified as both. We also believe that the observations came from random sampling. Therefore we believe the observations in the data are independent.
# We are using counts and not percentages
# checking expected counts >= 5 condition:
# np = n(1-p) = 436(0.5) = 218. Both are greater than or equal to 5.
# using R as a check:
observed <- c(244, 192)
observed_test <- chisq.test(observed) # put results in observed_test
observed_test$expected # pulling expected values out of observed_test
## [1] 218 218
# returns 218 and 218
# Q1. Run a chi-square goodness-of-fit test.
# c(R allele, X allele) a vector
# (Hint: observed <- c(244, 192); chisq.test(observed))
observed <- c(244, 192)
chisq.test(observed)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 6.2018, df = 1, p-value = 0.01276
# Q2. What is the p-value? At α = 0.05, do you reject H₀?
# The p-value is 0.01276. Since the p-value is less than α = 0.05,
# we reject the null hypothesis that the proportion of people with the R allele is equal to the proportion of people with the X allele.
Q3. Write your conclusion in plain English:
The p-value is 0.01276. Since the p-value is less than α = 0.05, we reject the null hypothesis that the proportion of people with the R allele is equal to the proportion of people with the X allele. Assuming the null hypothesis is true ( that the R allele and X allele are equally likely), the probability of observing a split at least as extreme as 244 R alleles and 192 X alleles, purely due to sampling, is only about 1.3%. Because the probability is so small, we reject the null hypothesis in favor of the alternative that the alleles do not occur with the same proportion. Since we had more R’s than X’s, it seems that the R allele would appear more often than the X allele. So yes, this study provides evidence that both alleles are not equally likely.
The NutritionStudy.csv dataset contains data on vitamin
use (VitaminUse) and gender (Sex) for many
participants. Is there a significant association between these two
variables?
Download NutritionStudy.csv from the Datasets folder on
Blackboard.
nutrition <- read.csv("NutritionStudy.csv")
glimpse(nutrition)
## Rows: 315
## Columns: 17
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ Age <int> 64, 76, 38, 40, 72, 40, 65, 58, 35, 55, 66, 40, 57, 66, …
## $ Smoke <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "N…
## $ Quetelet <dbl> 21.4838, 23.8763, 20.0108, 25.1406, 20.9850, 27.5214, 22…
## $ Vitamin <int> 1, 1, 2, 3, 1, 3, 2, 1, 3, 3, 1, 2, 3, 1, 3, 3, 1, 1, 3,…
## $ Calories <dbl> 1298.8, 1032.5, 2372.3, 2449.5, 1952.1, 1366.9, 2213.9, …
## $ Fat <dbl> 57.0, 50.1, 83.6, 97.5, 82.6, 56.0, 52.0, 63.4, 57.8, 39…
## $ Fiber <dbl> 6.3, 15.8, 19.1, 26.5, 16.2, 9.6, 28.7, 10.9, 20.3, 15.5…
## $ Alcohol <dbl> 0.0, 0.0, 14.1, 0.5, 0.0, 1.3, 0.0, 0.0, 0.6, 0.0, 1.0, …
## $ Cholesterol <dbl> 170.3, 75.8, 257.9, 332.6, 170.8, 154.6, 255.1, 214.1, 2…
## $ BetaDiet <int> 1945, 2653, 6321, 1061, 2863, 1729, 5371, 823, 2895, 330…
## $ RetinolDiet <int> 890, 451, 660, 864, 1209, 1439, 802, 2571, 944, 493, 535…
## $ BetaPlasma <int> 200, 124, 328, 153, 92, 148, 258, 64, 218, 81, 184, 91, …
## $ RetinolPlasma <int> 915, 727, 721, 615, 799, 654, 834, 825, 517, 562, 935, 7…
## $ Sex <chr> "Female", "Female", "Female", "Female", "Female", "Femal…
## $ VitaminUse <chr> "Regular", "Regular", "Occasional", "No", "Regular", "No…
## $ PriorSmoke <int> 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1,…
State your hypotheses:
# Q4. Build a contingency table of VitaminUse and Sex using table().
two_way_vitamin_table <- table(nutrition$VitaminUse , nutrition$Sex)
two_way_vitamin_table
##
## Female Male
## No 87 24
## Occasional 77 5
## Regular 109 13
# Q5. Run a chi-square test of independence on that table.
# We assume that no subject would select multiple vitamin use descriptors. It appears that the study has excluded those whose gender does not fit in the exclusive male / female categories.
#We also believe that the observations came from random sampling. Therefore we believe the observations in the data are independent.
# We are using counts and not percentages
# All expected cell count should be >= 5
chisq.test(two_way_vitamin_table)$expected
##
## Female Male
## No 96.20000 14.80000
## Occasional 71.06667 10.93333
## Regular 105.73333 16.26667
all(chisq.test(two_way_vitamin_table)$expected >= 5)
## [1] TRUE
# By table of expected values and return of TRUE on all cells check, the expected cell count condition is satisfied
chisq.test(two_way_vitamin_table)
##
## Pearson's Chi-squared test
##
## data: two_way_vitamin_table
## X-squared = 11.071, df = 2, p-value = 0.003944
# Q6. What is the p-value? Do you reject H₀ at α = 0.05?
# The p-value is 0.003944. Since the p-value is less than α = 0.05,
# we reject the null hypothesis that vitamin use and sex are independent.
Q7. Write your conclusion in plain English:
The p-value is 0.003944, which is much less than α = 0.05, so we reject the null hypothesis. Assuming VitaminUse and Sex were actually independent, the probability of seeing an association this strong between the two variables, purely due to random sampling, would be only about 0.39%. Because that’s so unlikely, we conclude that there is a significant association between vitamin use and sex — vitamin use patterns differ between males and females in this sample. Looking at the data, women appear more likely to use vitamins “Occasionally” or “Regularly” relative to men, while a notably larger share of men fall into the “No” vitamin use category than would be expected if the two variables were unrelated.
Researchers wanted to know how water chemistry affects fish ventilation. Fish were randomly assigned to one of three tanks with different calcium levels:
The team counted gill rates (beats per minute) for 30 fish in each
tank. The data is in FishGills3.csv.
Download FishGills3.csv from the Datasets folder on
Blackboard.
fish <- read.csv("FishGills3.csv")
glimpse(fish)
## Rows: 90
## Columns: 2
## $ Calcium <chr> "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low", "Low"…
## $ GillRate <int> 55, 63, 78, 85, 65, 98, 68, 84, 44, 87, 48, 86, 93, 64, 83, 7…
# conditions check for ANOVA
# Independence - the fish were randomly assigned to the tanks
# Check whether groups have reasonable / similar sample size and variance
fish |>
group_by(Calcium) |>
summarize ( n = n(), sd = sd(GillRate))
## # A tibble: 3 × 3
## Calcium n sd
## <chr> <int> <dbl>
## 1 High 30 13.8
## 2 Low 30 16.2
## 3 Medium 30 14.3
# Based upon the tibble, each calcium level (group) has the same number of observations at n = 30. The standard deviations (13.78, 16.23, 14.28) of the groups are close enough to satisfy the similar / roughly equal variance condition.
State your hypotheses:
# Q8. Run a one-way ANOVA testing GillRate by Calcium.
# (Hint: aov(GillRate ~ Calcium, data = fish))
fish_anova <- aov(GillRate ~ Calcium, data = fish)
fish_anova
## Call:
## aov(formula = GillRate ~ Calcium, data = fish)
##
## Terms:
## Calcium Residuals
## Sum of Squares 2037.222 19064.333
## Deg. of Freedom 2 87
##
## Residual standard error: 14.80305
## Estimated effects may be unbalanced
# Q9. Use summary() on the result. What is the F statistic and p-value?
summary(fish_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Calcium 2 2037 1018.6 4.648 0.0121 *
## Residuals 87 19064 219.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Given the summary(fish_anova) results:
# The F statistic value is 4.648.
# The p-value is 0.0121
# Q10. At α = 0.05, do you reject H₀?
# Since p-value = 0.0121 < 0.05, we reject the null hypothesis in favor of the alternative hypothesis.
Q11. Write your conclusion in plain English:
The F-statistic is 4.648 with a p-value of 0.0121.
Since 0.0121 is less than α = 0.05, we reject the null hypothesis that mean gill rate is the same across all three calcium levels. Assuming H₀ were true (equal means across all calcium levels), the probability of seeing group differences this large purely by chance would be only about 1.2%. Because this is unlikely, we conclude that calcium level in the water does have a significant effect on fish gill ventilation rate — at least one calcium level produces a different average gill rate than the others.
The F statitics of 4.648 gives us the ratio of (variability between groups / variability within groups). So in this case, our variability between groups is about 4.6 times greater than the variability that we might get within the group. This ratio greater than 1 suggests that the differences between calcium groups is not likely to be due to random chance alone. This matches with our decision to reject the null hypothesis.
Calcium levels do seem to have an effect on gill rate.