Instructions

Task 0

Create an R code chunk below to load the tidyverse, ggplot2, and the NHANES package. Further, load the NHANES dataset.

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.5     ✔ tibble    3.3.1
## ✔ purrr     1.2.2     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(NHANES)
data("NHANES")

Homework question 1

According to a 2021 data brief published by the National Center for Health Statistics (NCHS) in the United States, males have a higher prevalence of diabetes than women across age groups. Conduct a hypothesis test at 10% level of significance to assess this claim using the NHANES dataset.

Task 1

Write down the null and alternative hypotheses. Describe any notation used. H_0: p_m = p_w H_A: p_m > p_w

p_m is proportion of men with diabetes p_w is proportion of women with diabetes

Task 2

Create an R code chuck below and write a command to generate a suitable graphical summary capturing the two variables described above.

tab <- table(NHANES$Gender, NHANES$Diabetes)
prop.table(tab, margin= NULL)
##         
##                  No        Yes
##   female 0.46581457 0.03621424
##   male   0.45709069 0.04088050
barplot (
  table (NHANES$Gender, NHANES$Diabetes),
  beside = TRUE, 
  legend = TRUE, 
  ylab= "Count", 
  main = "Diabetes by sex"
)

Task 3

Based on the plot above, would you say that there is a relationship between the two variables? Based on the plot, there appears to be a relationshup between sex and diabetes. There aren’t many that have diabetes, but the ones that do, there seems to be more men that have it then women.

Task 4

Create a code chunk and conduct the hypothesis test.

prop.test(x = table(NHANES$Gender, NHANES$Diabetes), alternative = "greater")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  table(NHANES$Gender, NHANES$Diabetes)
## X-squared = 3.2964, df = 1, p-value = 0.03472
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.0009167922 1.0000000000
## sample estimates:
##    prop 1    prop 2 
## 0.9278642 0.9179059

Task 5

Identify the p-value of this test and interpret the p-value. P-value is .03472. Assuming the null is true, there is about a 3.47% chance of observing a difference in proportions at least as extreme as the one seen in the sample.

Task 6

Make a decision for this test and write the conclusion in the context of the problem. We reject the null because we have enough statistical evidence to conclude that the proportion of individuals without diabetes differs by gender.

Task 7

Describe the type of error we could be committing and what that could translate to in a clinical setting. Could be making a type 1 error this could lead to unnecessary screening or interventions, misallocation of healthcare resources, or overemphasis on gender-based risk.

Task 8

What is the most obvious limitation of the analysis we have conducted as it may or may not support the NCHS claim we set out to investigate? There are many confounding variables like BMI, age, race, socioeconomic status, diet and physical activity.

Homework question 2

Is there a relationship between someone’s sexual orientation (SexOrientation) and their mental health assessed in terms of self-reported number of days where the participant felt down, depressed, or hopeless (Depressed)?

Task 1

Which type of test should you conduct to investigate this relationship? one-way ANOVA

Task 2

Write down the null and alternative hypotheses. H_0: m_1=m_2=m_3=… H_A:at least one mean differs

Task 3

df <- NHANES %>%
  filter(!is.na(SexOrientation),
         !is.na(Depressed)) %>%
  mutate(
    Depressed_num = as.numeric(Depressed)   # convert to numeric
  )
summary(df$Depressed_num)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.285   1.000   3.000
anova_model <- aov(Depressed_num ~ SexOrientation, data = df)
summary(anova_model)
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## SexOrientation    2     25  12.476   38.24 <2e-16 ***
## Residuals      4833   1577   0.326                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Depressed_num ~ SexOrientation, data = df)
## 
## $SexOrientation
##                               diff        lwr         upr     p adj
## Heterosexual-Bisexual   -0.4019753 -0.5262990 -0.27765157 0.0000000
## Homosexual-Bisexual     -0.1193277 -0.3095019  0.07084639 0.3049939
## Homosexual-Heterosexual  0.2826476  0.1360730  0.42922209 0.0000188

Task 4

Explain the calculation for the degrees of freedom reported in the test output. Does that match with what you would expect based on the data dictionary descriptions?

The between-groups degrees of freedom reflect how many group means are being compared, while within-group degrees of freedom represent the remaining variablity among individuals within groups. Yes, this does match my expectations with the one-way ANOVA.

Task 5

Create a suitable numerical summary to investigate the issue observed in Task 4. Does this now make sense? Explain A suitable numerical summary is to compute the mean, standard deviation, and sample size of the number of depressed days for each sexual orientation group. After examining the numerical summary bit becomes clear that while the group means differ somewhat, there is substantial variability within each sexual orientation group.

Task 6

Based on the p-value above, can we claim a relationship between the two variables? Yes, there is statistically significant evidence of a relationship between sexual orientation and the mean number of days participants reported feeling depressed. This suggests that at least one sexual orientation group differs from the others in terms of average depressed days.

Task 7

Check the conditions for conducting this test. The observations are independent, as NHANES data are collected from different individuals, sexual orientation is a categorical variable with more than two variables, the number of days feeling depressed is a quantitative variable. The number of groups is sufficiently large enough for the Central Limit Theorem. The variability across groups appears reasonably similar.

Homework question 3

On an average, how different is the reported BMI for participants who do any moderate or vigorous-intensity sports, fitness or recreational activities when compared to those who do not?

Task 1

Write down the parameter we are setting out to investigate - in notation as well as words. m_1= mean BMI for participants who do engage in moderate or vigorous physical activity m_2= mean BMI for participants who do not engage in moderate or vigorous physical activity.

m_1-m_2

Task 2

Create an R code chuck below and write a command to generate a suitable graphical summary capturing the two variables described above.

boxplot(BMI~PhysActive,
        data=NHANES,
        xlab="Moderate or Vigorous Physical Activity",
        ylab= "Body Mass Index (BMI)", 
        main= "BMI by Physical activity Status")

Task 3

Based on the plot above, would you say that there is a relationship between the two variables? Based on the plot there appears to be a relationship between physical activity participation and BMI.

Task 4

Create a code chunk and calculate a 99% confidence interval for the parameter.

t.test(BMI~PhysActive,
       data=NHANES,
       conf.level=.99)
## 
##  Welch Two Sample t-test
## 
## data:  BMI by PhysActive
## t = 15.427, df = 7121.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 99 percent confidence interval:
##  1.942116 2.720909
## sample estimates:
##  mean in group No mean in group Yes 
##          29.43069          27.09918

Task 5

Based on the output in Task 4, what is the ordering of groups used by R? Does this match with your order in Task 1? R orders them alphabetically so R reports it as m_no-m_yes, which does not match the ordering used in task 1 which was m_yes-m_no. The difference in ordering explains why the sign may appear reversed from original parameter definition.

Task 6

Interpret the confidence interval reported by R using the ordering in the output. We are 99% confident that the true difference in mean BMI between individuals who do not participate in moderate or vigorous activity and those who do lies within the reported level.

Task 7

Based on the interval alone, what do you expect is the decision of a hypothesis test conducted at 1% level of significance investigating difference between the two groups? Explain your reasoning. We expect to reject the null hypothesis because the interval does not contain 0.

Task 8

Based on the results above, can we claim that doing moderate or vigorous-intensity sports, fitness or recreational activities causes one’s BMI to be lower? No we cannot claim causation, the direction of the relationship is unclear, we can say there is an association.

Homework question 4

Does Amazon sell lego sets for a higher price on average than the lego website? Let us revisit the lego_sample dataset from the openintro package to investigate.

Task 1

Write code to load the required dataset and create a differences column based on the two price columns. Be mindful of the order

library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data("lego_sample")
lego_sample <- lego_sample |>
  mutate(price_diff = amazon_price - price)

Task 2

Create a numerical summary to investigate the claim descriptively a.k.a. without conducting any inference.

lego_sample |>
  mutate(
    price_diff = amazon_price - price 
  ) |>
  summarise(
    mean_diff = mean(price_diff, na.rm = TRUE),
    median_diff = median(price_diff, na.rm = TRUE),
    sd_diff = sd(price_diff, na.rm = TRUE),
    n = n()
  )

Task 3

Write R code to conduct a hypothesis test.

lego_sample |>
  mutate(price_diff = amazon_price - price) |>
  t.test(
    formula = price_diff ~ 1,
    alternative = "greater",
    data = _
  )
## 
##  One Sample t-test
## 
## data:  price_diff
## t = 3.318, df = 74, p-value = 0.0007041
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
##  3.613865      Inf
## sample estimates:
## mean of x 
##  7.257067

Task 4

Based on the p-value, can we claim that legos are more expensive on Amazon than on the company website at 5% leve of significance? We reject the null because there is stastically significant evidence to conclude that legos are more expensive on Amazon then on the company website.

External resources used

If you used any external resources to write editable code or debug code that won’t work, please list them here. Please avoid saying things such as Googled it!, though that’s where you might begin. Please provide specific reference such as StackExchange, R-bloggers, notes from class such and such etc.