## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(broom)
bmd <- readr::read_csv("~/Downloads/bmd.csv", show_col_types = FALSE)
bmd_new <- bmd %>%
mutate(
sex = factor(RIAGENDR, levels = c(1,2), labels = c("Male", "Female")),
ethnicity = factor (RIDRETH1, levels = c(1,2,3,4,5), labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
smokernew = factor (smoker, levels = c(1,2,3), labels = c("Current","Past", "Never"))
)
# Display sample
# Display sample size
data.frame(
Metric = "Sample Size",
Value = paste(nrow(bmd_new), "bmd new labels")
) %>%
kable()| Metric | Value |
|---|---|
| Sample Size | 2898 bmd new labels |
#Removing missing BMD and ethnicity
bmd_new2 <- bmd_new %>%
filter(!is.na(DXXOFBMD) & !is.na(ethnicity))
# Display sample
# Display sample size
data.frame(
Metric = "Sample Size",
Value = paste(nrow(bmd_new2), "final bmd new labels")
) %>%
kable()| Metric | Value |
|---|---|
| Sample Size | 2286 final bmd new labels |
Interpretation: The initial sample size for the given data set is 2898 for BMD. After I removed 612 observations,the final sample is 2286 BMD.
##Part 1: One-Way NOV Research Question: Do mean and BMD values differ across ethnicity group? The BMD Values are significantly different, as well as the mean values. But the mean values are very close in number to each other.
#Step1: State Hypotheses
##Null Hypothesis
#(H₀):** u_Mexican_America = u_Other_Hispanic = u_Non-Hispanic_White = u_Non-Hispanic_Black = u_Other
#Plain Language: All five BMD population means are equal
##Alternative Hypothesis
#(H₁):** u_ethnicity≠u_ethnicity
#Plain Language:At least one population mean differs from the others.
##Significance level:** α = 0.05
#Step 2: Exploratory Analysis
# Calculate summary statistics by BMD category
summary_stats <- bmd_new2 %>%
group_by(ethnicity) %>%
summarise(
n = n(),
Mean = mean(DXXOFBMD),
SD = sd(DXXOFBMD),
Median = median(DXXOFBMD),
Min = min(DXXOFBMD),
Max = max(DXXOFBMD)
)
summary_stats %>%
kable(digits = 2,
caption = "Descriptive Statistics: BMD by ethnicity Category")| ethnicity | n | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| Mexican American | 255 | 0.95 | 0.15 | 0.96 | 0.44 | 1.46 |
| Other Hispanic | 244 | 0.95 | 0.16 | 0.94 | 0.55 | 1.39 |
| Non-Hispanic White | 846 | 0.89 | 0.16 | 0.88 | 0.37 | 1.46 |
| Non-Hispanic Black | 532 | 0.97 | 0.16 | 0.98 | 0.44 | 1.55 |
| Other | 409 | 0.90 | 0.15 | 0.89 | 0.54 | 1.34 |
# Create boxplots with individual points
ggplot(bmd_new2,
aes(x = ethnicity, y = DXXOFBMD, fill = ethnicity)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "BMD Across Ethnicities",
subtitle = "BMD Data",
x = "Ethnicity",
y = "BMD Means",
fill = "BMD Category"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")#Interpretation:What patterns do you observe in BMD across ethnicity groups?
#I have noticed that the median in each ethnicity group are very similar.
#The only group that has a slightly different box plot is "Non-Hispanic Black."
#The rest of the groups are similar with upper and lower quartile ranges.
#Step3: ANOVA F-test
# Fit the one-way ANOVA model
anova_model <- aov(DXXOFBMD ~ ethnicity, data = bmd_new2)
# Display the ANOVA table
summary(anova_model)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#State decision and interpret: Since the p-value= <2e-16, we reject the null hypothesis. Therefore, it is statistically significant between the mean BMD across at least one ethnicity groups.
#For ethnicity, the degrees of freedom = 4, the F-test = 30.18.
#Step 4: Assumption Checks
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(anova_model)par(mfrow = c(1, 1))
#Step 4 B - Equal Variances
# Levene's test for homogeneity of variance
levene_test <- leveneTest(DXXOFBMD ~ ethnicity, data = bmd_new2)
print(levene_test)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
#Interpretation: The p-value = 0.1788. Since the p-value is greater than 0.05, we accept equal variances.
#The residuals for the Q-Q plot are normally distributed since the data points roughly follow the linear line.
#Since it is normally distributed and have equal variances, the assumptions have not been violated.
#Step 5: Post-hoc Comparisons
# Conduct Tukey HSD test
tukey_results <- TukeyHSD(anova_model)
print(tukey_results)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = bmd_new2)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
| Comparison | Mean Diff | 95% CI | p-value | Significant? |
|---|
diff lwr upr p adj
|Other Hispanic-Mexican American | -0.004441273| -0.042644130 | 0.03376158 |0.9978049| |Non-Hispanic White-Mexican American | -0.062377249 |-0.092852618 |-0.03190188 |0.0000003| |Non-Hispanic Black-Mexican American | 0.022391619 |-0.010100052 |0.05488329 |0.3276613| |Other-Mexican American | -0.053319344 |-0.087357253 |-0.01928144 |0.0001919| |Non-Hispanic White-Other Hispanic | -0.057935976 |-0.088934694 |-0.02693726 |0.0000036| |Non-Hispanic Black-Other Hispanic | 0.026832892 |-0.006150151 |0.05981593 |0.1722466| |Other-Other Hispanic | -0.048878071 |-0.083385341 |-0.01437080 |0.0010720| |Non-Hispanic Black-Non-Hispanic White | 0.084768868 |0.061164402 |0.10837333 |0.0000000| |Other-Non-Hispanic White |0.009057905 |-0.016633367 |0.03474918 |0.8719109| |Other-Non-Hispanic Black |-0.075710963 |-0.103764519 |-0.04765741 |0.0000000|
#Interpret: Based on the ethnicity groups above, the p-values for “Non-Hispanic White-Mexican American”, “Other-Mexican American”,“Non-Hispanic White Other Hispanic”, “Other-other Hispanic” “Non-Hispanic Black-Non-Hispanic White” and “Other-Non-Hispanic Black” are less than 0.05. This shows we reject the null hypothesis and the groups are statistically significant.
#Step 6: Effect Size
# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")## Eta-squared (η²): 0.0503
## Percentage of variance explained: 5.03 %
Interpretation: η² = 0.0503 Percentage of variance explained: 5.03% Effect size classification (small/medium/large): small effect size What does this mean practically? The effect is real but modest in magnitude. Ethnicity represents 5.03% variance in BMD.
###Part 2: Correlation Analysis
#Step 1: Create Total Intake Variables
bmd_new3 <- bmd_new2 %>%
mutate(
calcium_total = coalesce(calcium,0) + coalesce(DSQTCALC,0),
vitd_total = coalesce(vitd,0) + coalesce(DSQTVD,0)
)
#Step 2: Assess and Apply Transformations
#Create Histograms for continuous variables (BMI, calcium, vitamin D, MET)
# Normal data
hist(bmd_new3$BMXBMI, main = "Histogram: BMI",
xlab = "Value", col = "lightblue")# Normal data
hist(bmd_new3$calcium_total, main = "Histogram: Calcium Total",
xlab = "Value", col = "lightblue")qqnorm(bmd_new3$calcium_total, main = "Q-Q Plot: Calcium Total")
qqline(bmd_new3$calcium_total, col = "red", lwd = 2)# Normal data
hist(bmd_new3$vitd_total, main = "Histogram: Vitamin D Total",
xlab = "Value", col = "lightblue")qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot: Vitamin D Total")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)#Log transformations
bmd_new3$log_BMXBMI <- log(bmd_new3$BMXBMI)
bmd_new3$log_calcium_total <- log(bmd_new3$calcium_total+1)
#SQRT transformations
bmd_new3$sqrt_totmet <- sqrt(bmd_new3$totmet)
bmd_new3$sqrt_vitd_total <- sqrt(bmd_new3$vitd_total)
#Re-check normality
qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot:Log BMI")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot: Log Calcium Total")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot: sqrt MET")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot: sqrt Vitamin D Total")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)
Documentation: I did apply transformations to the variables since we
used log and SQRT for the four variables: BMI, calcium, vitamin D and
MET. By applying transformations, I wanted to see if the data was
normally distributed and not right skewed.
#Step 3:Three Correlation Tables:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}
#Table A: Predictors vs. Outcome (BMD)
#Examine correlations between potential predictors and BMD:
#Age vs. BMD
#BMI vs. BMD
#Total calcium vs. BMD
#Total vitamin D vs. BMD
#Use transformed versions where appropriate.**
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()
# Example structure:
tableA <- bind_rows(
corr_pair(bmd_new3, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd_new3, "log_BMXBMI", "DXXOFBMD"),
corr_pair(bmd_new3, "log_calcium_total","DXXOFBMD"),
corr_pair(bmd_new3, "sqrt_vitd_total", "DXXOFBMD")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableA %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | 2286 | -0.232 | 0.0000 |
| log_BMXBMI | DXXOFBMD | 2275 | 0.425 | 0.0000 |
| log_calcium_total | DXXOFBMD | 2286 | 0.030 | 0.1450 |
| sqrt_vitd_total | DXXOFBMD | 2286 | -0.051 | 0.0151 |
Interpret the results: Which variables show statistically significant correlations with BMD? What is the direction and strength of these relationships?] The variables “RIDAGEYR”, “log_calcium_total” and “sqrt_vitd_total” p-values are less than 0.05. Therefore, the three variables are statistically significant and we reject the null hypothesis. RIDEAGEYR and BMD, vitamin d and BMD are both inverse associations.
#Table B: Predictors vs. Exposure (Physical Activity)
#Examine correlations between potential confounders and physical activity:
#Age vs. MET
#BMI vs. MET
#Total calcium vs. MET
#Total vitamin D vs. MET
tableB <- bind_rows(
corr_pair(bmd_new3, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd_new3, "log_BMXBMI", "sqrt_totmet"),
corr_pair(bmd_new3, "log_calcium_total","sqrt_totmet"),
corr_pair(bmd_new3, "sqrt_vitd_total", "sqrt_totmet")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableB %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | sqrt_totmet | 1588 | -0.105 | 2.67e-05 |
| log_BMXBMI | sqrt_totmet | 1582 | 0.009 | 7.32e-01 |
| log_calcium_total | sqrt_totmet | 1588 | 0.072 | 4.36e-03 |
| sqrt_vitd_total | sqrt_totmet | 1588 | -0.002 | 9.34e-01 |
Interpret the results: The variables that are associated with physical activity levels?
All of the variables are associated with physical activity level since each p-value is less than 0.05.
#Table C: Predictors vs. Eachother
#Examine correlations among all predictor variables (all pairwise combinations):
#- Age, BMI, Total calcium, Total vitamin D
preds <- c("RIDAGEYR", "BMXBMI", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd_new3, .x[1], .x[2])) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableC %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | BMXBMI | 2275 | -0.098 | 2.70e-06 |
| RIDAGEYR | log_calcium_total | 2286 | -0.024 | 2.58e-01 |
| RIDAGEYR | sqrt_vitd_total | 2286 | 0.146 | 0.00e+00 |
| BMXBMI | log_calcium_total | 2275 | 0.032 | 1.23e-01 |
| BMXBMI | sqrt_vitd_total | 2275 | 0.027 | 2.01e-01 |
| log_calcium_total | sqrt_vitd_total | 2286 | 0.224 | 0.00e+00 |
Interpret the results:
[YOUR INTERPRETATION: Are there strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses?] Yes, there are strong correlations between Age and vitamin D, calcium and vitamin D. Since both r values are higher, this might indicate multicollinearity concerns in future regression analyses.
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd_new3, aes(x = BMXBMI, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "BMXBMI",
y = "BMD",
title = "BMD vs BMI (transformed)"
)# Example: BMD vs Physical Activity
ggplot(bmd_new3, aes(x = sqrt_totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "sqrt",
y = "BMD",
title = "BMD vs MET (transformed)"
)Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
B. Assumptions and Limitations
C. R Programming Challenge
YOUR REFLECTION HERE (200–400 words) I would use ANOVA when I am using one categorical predictor. For correlation, I would use this method if I have two variables that are continuous. For example, in epidemiological research, if I were trying to research the prevalence of cancer, I would use ANOVA when studying the stages of cancer, but I would use correlation to study the age of the participants who had the disease. The key difference is that ANOVA tells us how to take multiple treatments, exposure levels, or populations in public health, but it does not interfere with the chance of false discoveries. Correlation tells us the true strength and direction of a linear relationship between two continuous variables in a health topic. For ANOVA, how do we compare vaccination rates across geographic areas for COVID-19? For correlation, how can we understand age-related changes in diabetes to help improve preventative programs for at-risk populations? The assumptions that were most challenging to meet in this assignment were linearity and normality. The BMI and calcium were skewed and needed transformations. Violations of assumptions can affect my conclusions because if there is a small, unbalanced sample and severe assumption violations, then it can affect the accuracy of the data, which might lead to false or unreliable outcomes. The limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health are that, since we are measuring the relationship between both variables at the same time, we can not determine if there is causality. This can also include certain biases, since people may not give accurate or honest information when giving nutritional/activity data and bone health. I think the most difficult part of the R coding for this project is making sure I converted the ethnicity, sex, and smoker status correctly. I was confused about how to start the code and whether I was taking the information out of the code book in the assignment, or if I was making up my own labels. This was before the template was created from the professor, and I was writing my own code. I had spent hours trying to figure out if the variables were being taken from the code book, and I then realized I should ask for assistance from the TA. I find it extremely important to ask for help when it comes to R coding because it can be complex and hard to understand. The R skill that I strengthened through completing Assignment 1 is recognizing which variable might need to be changed and using the right variable from the code book. I think it’s important to identify little details in R because if not, it can lead to issues with running the code. I also improved at identifying if there are missing commas. This can help mediate further issues when trying to download the code. — # Submission checklist
Before submitting, verify: