Homework 8

Problem 1

# Observed counts
observed <- c(244, 192)

# Null values
theoritical_prop <- rep(1/2, 2)

Hypothesis

\(H_0\):\(p_1\) = \(p_2\) = 1/2
\(H_a\): at least on \(p_i\) \(\neq\) 1/2

# Expected values
expected_values <- theoritical_prop*sum(observed) 
expected_values

## [1] 218 218

Both values are > 5.

# Perform chi-squared goodness-of-fit test
chisq.test(observed)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 6.2018, df = 1, p-value = 0.01276

Therefore, based on the p-value obtained, we reject the idea that the R and X alleles are equally likely and that there are different probabilities of occurrence.

Problem 2

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

df <- read_csv("NutritionStudy.csv")

## Rows: 315 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Smoke, Sex, VitaminUse
## dbl (14): ID, Age, Quetelet, Vitamin, Calories, Fat, Fiber, Alcohol, Cholest...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(df)

##        ID             Age           Smoke              Quetelet    
##  Min.   :  1.0   Min.   :19.00   Length:315         Min.   :16.33  
##  1st Qu.: 79.5   1st Qu.:39.00   Class :character   1st Qu.:21.80  
##  Median :158.0   Median :48.00   Mode  :character   Median :24.74  
##  Mean   :158.0   Mean   :50.15                      Mean   :26.16  
##  3rd Qu.:236.5   3rd Qu.:62.50                      3rd Qu.:28.85  
##  Max.   :315.0   Max.   :83.00                      Max.   :50.40  
##     Vitamin         Calories           Fat             Fiber      
##  Min.   :1.000   Min.   : 445.2   Min.   : 14.40   Min.   : 3.10  
##  1st Qu.:1.000   1st Qu.:1338.0   1st Qu.: 53.95   1st Qu.: 9.15  
##  Median :2.000   Median :1666.8   Median : 72.90   Median :12.10  
##  Mean   :1.965   Mean   :1796.7   Mean   : 77.03   Mean   :12.79  
##  3rd Qu.:3.000   3rd Qu.:2100.4   3rd Qu.: 95.25   3rd Qu.:15.60  
##  Max.   :3.000   Max.   :6662.2   Max.   :235.90   Max.   :36.80  
##     Alcohol         Cholesterol       BetaDiet     RetinolDiet    
##  Min.   :  0.000   Min.   : 37.7   Min.   : 214   Min.   :  30.0  
##  1st Qu.:  0.000   1st Qu.:155.0   1st Qu.:1116   1st Qu.: 480.0  
##  Median :  0.300   Median :206.3   Median :1802   Median : 707.0  
##  Mean   :  3.279   Mean   :242.5   Mean   :2186   Mean   : 832.7  
##  3rd Qu.:  3.200   3rd Qu.:308.9   3rd Qu.:2836   3rd Qu.:1037.0  
##  Max.   :203.000   Max.   :900.7   Max.   :9642   Max.   :6901.0  
##    BetaPlasma     RetinolPlasma        Sex             VitaminUse       
##  Min.   :   0.0   Min.   : 179.0   Length:315         Length:315        
##  1st Qu.:  90.0   1st Qu.: 466.0   Class :character   Class :character  
##  Median : 140.0   Median : 566.0   Mode  :character   Mode  :character  
##  Mean   : 189.9   Mean   : 602.8                                        
##  3rd Qu.: 230.0   3rd Qu.: 716.0                                        
##  Max.   :1415.0   Max.   :1727.0                                        
##    PriorSmoke   
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.638  
##  3rd Qu.:2.000  
##  Max.   :3.000

observed_dataset<- table(df$VitaminUse, df$Sex)
observed_dataset

##             
##              Female Male
##   No             87   24
##   Occasional     77    5
##   Regular       109   13

Hypothesis

\(H_0\) : Vitamin use is not associated with gender
\(H_a\) : Vitamin use is associated with gender

chisq.test(observed_dataset)

## 
##  Pearson's Chi-squared test
## 
## data:  observed_dataset
## X-squared = 11.071, df = 2, p-value = 0.003944

With a p-value of 0.0039, which is less than the typical significance level of 0.05, there is sufficient evidence to reject the null hypothesis.

Therefore, we conclude that there is a significant association between vitamin use and gender.

Problem 3

df2 <- read_csv("FishGills3.csv")

## Rows: 90 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Calcium
## dbl (1): GillRate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(df2)

##    Calcium             GillRate    
##  Length:90          Min.   :33.00  
##  Class :character   1st Qu.:48.00  
##  Mode  :character   Median :62.50  
##                     Mean   :61.78  
##                     3rd Qu.:72.00  
##                     Max.   :98.00

Hypothesis

\(H_0\): \(\mu_L\) = \(\mu_M\) = \(\mu_H\)
\(H_a\): not all \(\mu_i\) are equal

anova_result <- aov(GillRate ~ Calcium, data = df2)

anova_result

## Call:
##    aov(formula = GillRate ~ Calcium, data = df2)
## 
## Terms:
##                   Calcium Residuals
## Sum of Squares   2037.222 19064.333
## Deg. of Freedom         2        87
## 
## Residual standard error: 14.80305
## Estimated effects may be unbalanced

summary(anova_result)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Calcium      2   2037  1018.6   4.648 0.0121 *
## Residuals   87  19064   219.1                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a p-value of 0.0121, which is less than the typical significance level of 0.05, there is sufficient evidence to reject the null hypothesis. This suggests that the mean gill rate differs depending on the calcium level of the water.

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = GillRate ~ Calcium, data = df2)
## 
## $Calcium
##                  diff        lwr        upr     p adj
## Low-High    10.333333   1.219540 19.4471264 0.0222533
## Medium-High  0.500000  -8.613793  9.6137931 0.9906108
## Medium-Low  -9.833333 -18.947126 -0.7195402 0.0313247

The most significant difference is between Low and High calcium levels.

Homework 8

Jake G

2025-10-27

Problem 1

Problem 2

Problem 3