# Observed counts
observed <- c(244, 192)
# Null values
theoritical_prop <- rep(1/2, 2)
Hypothesis
\(H_0\):\(p_1\) = \(p_2\) = 1/2
\(H_a\): at least on \(p_i\) \(\neq\) 1/2
# Expected values
expected_values <- theoritical_prop*sum(observed)
expected_values
## [1] 218 218
Both values are > 5.
# Perform chi-squared goodness-of-fit test
chisq.test(observed)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 6.2018, df = 1, p-value = 0.01276
Therefore, based on the p-value obtained, we reject the idea that the
R and X alleles are equally likely and that there are different
probabilities of occurrence.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("NutritionStudy.csv")
## Rows: 315 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Smoke, Sex, VitaminUse
## dbl (14): ID, Age, Quetelet, Vitamin, Calories, Fat, Fiber, Alcohol, Cholest...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(df)
## ID Age Smoke Quetelet
## Min. : 1.0 Min. :19.00 Length:315 Min. :16.33
## 1st Qu.: 79.5 1st Qu.:39.00 Class :character 1st Qu.:21.80
## Median :158.0 Median :48.00 Mode :character Median :24.74
## Mean :158.0 Mean :50.15 Mean :26.16
## 3rd Qu.:236.5 3rd Qu.:62.50 3rd Qu.:28.85
## Max. :315.0 Max. :83.00 Max. :50.40
## Vitamin Calories Fat Fiber
## Min. :1.000 Min. : 445.2 Min. : 14.40 Min. : 3.10
## 1st Qu.:1.000 1st Qu.:1338.0 1st Qu.: 53.95 1st Qu.: 9.15
## Median :2.000 Median :1666.8 Median : 72.90 Median :12.10
## Mean :1.965 Mean :1796.7 Mean : 77.03 Mean :12.79
## 3rd Qu.:3.000 3rd Qu.:2100.4 3rd Qu.: 95.25 3rd Qu.:15.60
## Max. :3.000 Max. :6662.2 Max. :235.90 Max. :36.80
## Alcohol Cholesterol BetaDiet RetinolDiet
## Min. : 0.000 Min. : 37.7 Min. : 214 Min. : 30.0
## 1st Qu.: 0.000 1st Qu.:155.0 1st Qu.:1116 1st Qu.: 480.0
## Median : 0.300 Median :206.3 Median :1802 Median : 707.0
## Mean : 3.279 Mean :242.5 Mean :2186 Mean : 832.7
## 3rd Qu.: 3.200 3rd Qu.:308.9 3rd Qu.:2836 3rd Qu.:1037.0
## Max. :203.000 Max. :900.7 Max. :9642 Max. :6901.0
## BetaPlasma RetinolPlasma Sex VitaminUse
## Min. : 0.0 Min. : 179.0 Length:315 Length:315
## 1st Qu.: 90.0 1st Qu.: 466.0 Class :character Class :character
## Median : 140.0 Median : 566.0 Mode :character Mode :character
## Mean : 189.9 Mean : 602.8
## 3rd Qu.: 230.0 3rd Qu.: 716.0
## Max. :1415.0 Max. :1727.0
## PriorSmoke
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.638
## 3rd Qu.:2.000
## Max. :3.000
observed_dataset<- table(df$VitaminUse, df$Sex)
observed_dataset
##
## Female Male
## No 87 24
## Occasional 77 5
## Regular 109 13
Hypothesis
\(H_0\) : Vitamin use is not
associated with gender
\(H_a\) :
Vitamin use is associated with gender
chisq.test(observed_dataset)
##
## Pearson's Chi-squared test
##
## data: observed_dataset
## X-squared = 11.071, df = 2, p-value = 0.003944
With a p-value of 0.0039, which is less than the typical significance level of 0.05, there is sufficient evidence to reject the null hypothesis.
Therefore, we conclude that there is a significant association
between vitamin use and gender.
df2 <- read_csv("FishGills3.csv")
## Rows: 90 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Calcium
## dbl (1): GillRate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(df2)
## Calcium GillRate
## Length:90 Min. :33.00
## Class :character 1st Qu.:48.00
## Mode :character Median :62.50
## Mean :61.78
## 3rd Qu.:72.00
## Max. :98.00
Hypothesis
\(H_0\): \(\mu_L\) = \(\mu_M\) = \(\mu_H\)
\(H_a\): not all \(\mu_i\) are equal
anova_result <- aov(GillRate ~ Calcium, data = df2)
anova_result
## Call:
## aov(formula = GillRate ~ Calcium, data = df2)
##
## Terms:
## Calcium Residuals
## Sum of Squares 2037.222 19064.333
## Deg. of Freedom 2 87
##
## Residual standard error: 14.80305
## Estimated effects may be unbalanced
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Calcium 2 2037 1018.6 4.648 0.0121 *
## Residuals 87 19064 219.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With a p-value of 0.0121, which is less than the typical significance level of 0.05, there is sufficient evidence to reject the null hypothesis. This suggests that the mean gill rate differs depending on the calcium level of the water.
TukeyHSD(anova_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = GillRate ~ Calcium, data = df2)
##
## $Calcium
## diff lwr upr p adj
## Low-High 10.333333 1.219540 19.4471264 0.0222533
## Medium-High 0.500000 -8.613793 9.6137931 0.9906108
## Medium-Low -9.833333 -18.947126 -0.7195402 0.0313247
The most significant difference is between Low and High calcium levels.