Homework 8

In your markdown answer the following problems. Include the following:

Your hypotheses
p-value
conclusion

Problem 1

ACTN3 is a gene that encodes alpha-actinin-3, a protein in fast-twitch muscle fibers, important for activities like sprinting and weightlifting. The gene has two main alleles: R (functional) and X (non-functional). The R allele is linked to better performance in strength, speed, and power sports, while the X allele is associated with endurance due to a greater reliance on slow-twitch fibers. However, athletic performance is influenced by various factors, including training, environment, and other genes, making the ACTN3 genotype just one contributing factor. A study examines the ACTN3 genetic alleles R and X, also associated with fast-twitch muscles. Of the 436 people in this sample, 244 were classified as R, and 192 were classified as X. Does the sample provide evidence that the two options are not equally likely? Conduct the test using a chi-square goodness-of-fit test.

Answer

This is a goodness-of-fit test, with one variable (presence of a specific allele of ACTN3 related to athletic performance) and two categories (presence of R allele and presence of X allele). Of 436 cases, 244 possess R and 192 possess X alleles. We want to test whether the observed difference is statistically significant.

Hypotheses

The null hypothesis is that both proportions are the same, while the alternative hypothesis is that the proportions are statistically different.

\(H_0\): \(p_1\) = \(p_2\) = 1/2

\(H_a\): \(p_1\) \(\neq\) \(p_2\) \(\neq\) 1/2

# Observed counts
observed <- c(244, 192)

# Null values
null_proportions <- c(1/2, 1/2)

Check the expected values to make sure that we can preform the Chi-square test

# Expected values
expected_values <- null_proportions*sum(observed) 
expected_values

## [1] 218 218

All are greater than 5 and we can perform the chi-square test.

# Perform chi-squared goodness-of-fit test
chisq.test(observed)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 6.2018, df = 1, p-value = 0.01276

Based on the p-value of 0.01276 (less than .05), we reject the hypothesis that the outcomes are equally likely and conclude that the probabilities of getting R and X allenes are statistically different with 95% confidence.

Note the null was not included in the chisq.test() because the probabilities are assumed to be the same.

Problem 1 Conclusion

p-value = 0.01276. It is close to zero thus there is significant evidence that the probabilities of obtaining R or X allenes are different.

Problem 2

Who Is More Likely to Take Vitamins: Males or Females? The dataset NutritionStudy contains, among other things, information about vitamin use and the gender of the participants. Is there a significant association between these two variables? Use the variables VitaminUse and Gender to conduct a chi-square analysis and give the results. (Test for Association)

Answer

This is a test of association between gender and likelihood of taking vitamins, using data from the NutritionStudy dataset.

Loading the dataset, assumed to be located in the same directory as the rmd code. Read the dataset and summary its variables.

setwd(getwd())       # set the wd to the location of the rmd code

df<- read_csv("NutritionStudy.csv")

summary(df)

##        ID             Age           Smoke              Quetelet    
##  Min.   :  1.0   Min.   :19.00   Length:315         Min.   :16.33  
##  1st Qu.: 79.5   1st Qu.:39.00   Class :character   1st Qu.:21.80  
##  Median :158.0   Median :48.00   Mode  :character   Median :24.74  
##  Mean   :158.0   Mean   :50.15                      Mean   :26.16  
##  3rd Qu.:236.5   3rd Qu.:62.50                      3rd Qu.:28.85  
##  Max.   :315.0   Max.   :83.00                      Max.   :50.40  
##     Vitamin         Calories           Fat             Fiber      
##  Min.   :1.000   Min.   : 445.2   Min.   : 14.40   Min.   : 3.10  
##  1st Qu.:1.000   1st Qu.:1338.0   1st Qu.: 53.95   1st Qu.: 9.15  
##  Median :2.000   Median :1666.8   Median : 72.90   Median :12.10  
##  Mean   :1.965   Mean   :1796.7   Mean   : 77.03   Mean   :12.79  
##  3rd Qu.:3.000   3rd Qu.:2100.4   3rd Qu.: 95.25   3rd Qu.:15.60  
##  Max.   :3.000   Max.   :6662.2   Max.   :235.90   Max.   :36.80  
##     Alcohol         Cholesterol       BetaDiet     RetinolDiet    
##  Min.   :  0.000   Min.   : 37.7   Min.   : 214   Min.   :  30.0  
##  1st Qu.:  0.000   1st Qu.:155.0   1st Qu.:1116   1st Qu.: 480.0  
##  Median :  0.300   Median :206.3   Median :1802   Median : 707.0  
##  Mean   :  3.279   Mean   :242.5   Mean   :2186   Mean   : 832.7  
##  3rd Qu.:  3.200   3rd Qu.:308.9   3rd Qu.:2836   3rd Qu.:1037.0  
##  Max.   :203.000   Max.   :900.7   Max.   :9642   Max.   :6901.0  
##    BetaPlasma     RetinolPlasma        Sex             VitaminUse       
##  Min.   :   0.0   Min.   : 179.0   Length:315         Length:315        
##  1st Qu.:  90.0   1st Qu.: 466.0   Class :character   Class :character  
##  Median : 140.0   Median : 566.0   Mode  :character   Mode  :character  
##  Mean   : 189.9   Mean   : 602.8                                        
##  3rd Qu.: 230.0   3rd Qu.: 716.0                                        
##  Max.   :1415.0   Max.   :1727.0                                        
##    PriorSmoke   
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.638  
##  3rd Qu.:2.000  
##  Max.   :3.000

# create table with the specific variables Sex and VitaminUse and show it

observed_dataset<- table(df$Sex, df$VitaminUse)
observed_dataset

##         
##           No Occasional Regular
##   Female  87         77     109
##   Male    24          5      13

Hypotheses

The null hypothesis is that VitaminUse is not associated with Sex, while the alternative hypothesis is that both variables are associated.

\(H_0\): VitaminUse is not associated with Sex.

\(H_a\): VitaminUse is associated with Sex.

# Perform the Chi square test:

chisq.test(observed_dataset)

## 
##  Pearson's Chi-squared test
## 
## data:  observed_dataset
## X-squared = 11.071, df = 2, p-value = 0.003944

Based on the p-value smaller than 0.05, therefore, we reject the hypothesis that VitaminUse is not associated with Sex and conclude that the association of VitaminUse with Sex is statistically significant with 95% confidence.

Problem 2 Conclusion

p-value = 0.003944. It is close to zero thus there is statistical significance of the association between VitaminUse and Sex.

Problem 3

Most fish use gills for respiration in water, and researchers can observe how fast a fish’s gill cover beats to study ventilation, much like we might observe a person’s breathing rate. Professor Brad Baldwin is interested in how water chemistry might affect gill beat rates. In one experiment, he randomly assigned fish to tanks with different calcium levels. One tank was low in calcium (0.71 mg/L), the second tank had a medium amount (5.24 mg/L), and the third tank had water with a high calcium level (18.24 mg/L). His research team counted gill rates (beats per minute) for samples of 30 fish in each tank. The results are stored in FishGills3. Perform ANOVA test to see if the mean gill rate differs depending on the calcium level of the water.

Answer

We will do an Anova test to examine whether fish breathing rate (mean gill rates) depend on water calcium level using 3 populations of fish in 3 water tanks containing different levels of calcium.

Loading the FishGills3 dataset, assumed to be located in the same directory as the rmd code. Read the dataset and summary its variables.

setwd(getwd())       # set the wd to the location of the rmd code

df<- read_csv("FishGills3.csv")

## Rows: 90 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Calcium
## dbl (1): GillRate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Summary the basic stats of df, show the top 6 rows, and calculate the mean gill rates grouped by calcium level

summary(df)

##    Calcium             GillRate    
##  Length:90          Min.   :33.00  
##  Class :character   1st Qu.:48.00  
##  Mode  :character   Median :62.50  
##                     Mean   :61.78  
##                     3rd Qu.:72.00  
##                     Max.   :98.00

df |> group_by (Calcium) |> summarize (mean_gillrate = mean(GillRate))

## # A tibble: 3 × 2
##   Calcium mean_gillrate
##   <chr>           <dbl>
## 1 High             58.2
## 2 Low              68.5
## 3 Medium           58.7

One can observe that in general the breathing rate is inversely proportional to calcium level, with the largest difference appearing between low to medium calcium level.

Hypotheses

The null hypothesis is that the mean gill rates for the 3 calcium levels is the same. The alternative hypothesis that at least one mean gill rate is different from the others.

\(H_0\): \(\mu_L\) = \(\mu_M\) = \(\mu_H\)

\(H_a\): not all \(\mu_i\) are equal

# Perform ANOVA

anova_result <- aov(GillRate ~ Calcium, data = df)

anova_result

## Call:
##    aov(formula = GillRate ~ Calcium, data = df)
## 
## Terms:
##                   Calcium Residuals
## Sum of Squares   2037.222 19064.333
## Deg. of Freedom         2        87
## 
## Residual standard error: 14.80305
## Estimated effects may be unbalanced

#Do the summary() to get degrees of freedom, F value, and p-value 

summary(anova_result)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Calcium      2   2037  1018.6   4.648 0.0121 *
## Residuals   87  19064   219.1                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is very small (0.0121): indicating strong evidence against the null hypothesis. The Anova tests suggests that there are significant differences in gill rates depending on the calcium level.

Following the class activity, do the Tukey’s Honestly Significant Difference (HSD) test on the ANOVA model.

# Tukey's Honestly Significant Difference (HSD) test on the ANOVA model
library(tidyverse)  #ensure tidyverse is loaded

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = GillRate ~ Calcium, data = df)
## 
## $Calcium
##                  diff        lwr        upr     p adj
## Low-High    10.333333   1.219540 19.4471264 0.0222533
## Medium-High  0.500000  -8.613793  9.6137931 0.9906108
## Medium-Low  -9.833333 -18.947126 -0.7195402 0.0313247

The HSF test shows that the largest difference between GillRate means is produced by the change of calcium level from Low to High, yielding the highest statistical significance (p= 0.022). The difference between GillRate means produced by the change of calcium level from Low to Medium is almost as significant (p= 0.031). The difference between GillRate means produced by the change of calcium level from Medium to High is not statistically significant (p= 0.991).

Problem 3 Conclusion

p-value = 0.0121. The fish breathing rate (mean gill rate) significantly differs depending on the calcium level in water. That dependence is inversely proportional. I checked that out and it makes sense. High calcium levels make gill membranes “harder”, impermeable to vital ions (such as Na+ and K+). Low calcium levels make membranes permeable and induce faster loss of vital ions. To restore ionic balance, fish need more energy, hence pump more water (breathe harder) to get more oxygen. The difference in mean_gillrate shows that such effect is most pronounced and statistically significant when calcium level increases sevenfold (from 0.71 to 5 mg/L); and is statistically insignificant when increasing further from 5 to 18 mg/L, probably because membranes are already saturated with calcium by the time water reaches 3 mg/L.

Published

Published at https://rpubs.com/rmiranda/1361375

Homework 8

Raul Miranda

2025-10-28