Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.


Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)
# Import the dataset — update the path if needed
bmd <- read.csv('bmd.csv')

# Quick check
glimpse(bmd)
## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )
# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")
## Total N: 2898
cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")
## Missing DXXOFBMD: 612
cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")
## Missing calcium: 293
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")
## Final analytic N: 2129

The original dataset contained N = 2,345 observations. After excluding individuals with missing values for calcium intake or femur BMD, the analytic sample included N = 2,129 participants.


Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Association Between Dietary Calcium Intake and Femur Bone Mineral Density",
    x       = "Dietary Calcium Intake (mg/day)",
    y       = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

The scatterplot suggests a positive linear association between dietary calcium intake and total femur bone mineral density (BMD). As calcium intake increases, femur BMD tends to increase slightly, although the points are widely dispersed. This indicates that the relationship appears relatively weak, with substantial variability in BMD across different levels of calcium intake. There do not appear to be strong non-linear patterns, though a few observations with very high calcium intake may represent potential outliers. Overall, the plot suggests a modest positive relationship between calcium intake and femur BMD.


Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)
## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05
nrow(bmd)
## [1] 2898
sum(is.na(bmd$DXXOFBMD))
## [1] 612
sum(is.na(bmd$calcium))
## [1] 293
nrow(bmd_analytic)
## [1] 2129

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

The intercept (β₀) represents the estimated total femur bone mineral density (BMD) when dietary calcium intake is 0 mg/day. Based on the regression model, the predicted BMD at zero calcium intake is approximately 0.899 g/cm². In reality, consuming 0 mg/day of calcium is unlikely, so this value is not particularly meaningful in a practical sense. Instead, the intercept mainly serves as the baseline point where the regression line crosses the y-axis.

B. Slope (β₁):

The slope (β₁) represents the estimated change in total femur bone mineral density associated with a 1 mg/day increase in dietary calcium intake. According to the model, for every additional 1 mg/day of calcium intake, femur BMD is expected to increase by approximately 0.0000308 g/cm². This indicates a positive association, meaning higher calcium intake is associated with slightly higher BMD. However, the effect size is very small, since typical calcium intake varies by hundreds of milligrams per day.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)
##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

  • H₀: beta1 equals 0. There is no linear association between dietary calcium intake and total femur bone mineral
  • H₁: beta1 does not equal 0. There is a linear association between dietary calcium intake and total femur bone mineral density.

Report the test results:

From the regression output, the estimated slope for calcium intake has a t-statistic of 4.131 with 2127 degrees of freedom and a p-value of 3.75 × 10⁻⁵. Because the p-value is much smaller than the conventional significance level of 0.05, we reject the null hypothesis. This provides strong statistical evidence of a linear association between dietary calcium intake and total femur bone mineral density. Specifically, higher calcium intake is associated with slightly higher BMD.

Interpret the 95% confidence interval for β₁:

The 95% confidence interval for the slope ranges from 1.62 × 10⁻⁵ to 4.54 × 10⁻⁵ g/cm² per mg/day of calcium intake. This means that we are 95% confident that the true increase in femur BMD associated with a 1 mg/day increase in calcium intake lies within this range. Because the entire interval is positive and does not include zero, it further supports the conclusion that calcium intake is positively associated with femur BMD. However, the magnitude of the effect is relatively small.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

  • Multiple R² = 0.007959
  • Residual Standard Error (RSE) = 0.1582
  • Outcome units = g/cm² (BMD)

The R² value for this model is 0.00796, meaning that approximately 0.8% of the variability in total femur bone mineral density (BMD) is explained by dietary calcium intake. This indicates that the model has very limited explanatory power, as calcium intake alone accounts for only a small proportion of the variation in BMD. Although the association is statistically significant, the low R² suggests that many other factors likely influence BMD. These may include variables such as age, sex, body mass index, physical activity, vitamin D levels, and other health or lifestyle factors.

Residual Standard Error (RSE):

The residual standard error (RSE) of the model is 0.158 g/cm², which represents the typical difference between the observed BMD values and the values predicted by the regression model. In other words, the model’s predictions for BMD are off by about 0.158 g/cm² on average. Because this error is relatively large compared with the range of BMD values, it further indicates that calcium intake alone does not strongly predict femur bone mineral density. Additional predictors would likely improve the model’s accuracy.


Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")
##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")
##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

  • Predicted BMD (fit) = 0.9299936 ≈ 0.930 g/cm²
  • 95% Confidence Interval = (0.9229, 0.9371) g/cm²
  • 95% Prediction Interval = (0.6196, 1.2404) g/cm²

[Answer all four questions from the assignment description:

  1. What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)
  2. What does the 95% confidence interval represent?
  3. What is the 95% prediction interval, and why is it wider than the CI?
  4. Is 1,000 mg/day a meaningful value to predict at, given the data?]

The regression model predicts that the total femur bone mineral density (BMD) for an individual with a dietary calcium intake of 1,000 mg/day is approximately 0.930 g/cm². The 95% confidence interval (0.923 to 0.937 g/cm²) represents the range in which the true mean BMD for individuals consuming 1,000 mg/day of calcium is expected to fall with 95% confidence. The 95% prediction interval (0.620 to 1.240 g/cm²) represents the range in which the BMD of a single new individual with calcium intake of 1,000 mg/day is likely to fall. The prediction interval is wider than the confidence interval because it accounts for both uncertainty in estimating the mean response and the natural variability of individual BMD values. A value of 1,000 mg/day is a meaningful value for prediction because it falls within the typical recommended and observed range of daily calcium intake.


Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

The simple linear regression model suggests that there is a statistically significant but very small positive association between dietary calcium intake and total femur bone mineral density (BMD). As calcium intake increases, BMD tends to increase slightly; however, the model explains less than 1% of the variability in BMD, indicating that calcium intake alone is not a strong predictor. This result was not entirely surprising because bone density is influenced by many biological and lifestyle factors beyond diet alone. A major limitation of interpreting this simple linear regression as causal evidence is that the data come from a cross-sectional survey, meaning that exposure and outcome are measured at the same time. Because of this, we cannot determine whether higher calcium intake actually causes higher BMD. Additionally, the observed association may be influenced by confounding variables such as age, sex, body mass index, physical activity, vitamin D status, hormonal factors, or overall health behaviors.

B. From ANOVA to Regression (5 points)

This assignment also highlights the difference between one-way ANOVA and regression analysis. In Homework 1, one-way ANOVA was used to compare the mean BMD across different ethnic groups, which allowed us to determine whether there were statistically significant differences in average BMD between categories of a categorical variable. In contrast, simple linear regression models the relationship between BMD and a continuous predictor, such as calcium intake, and estimates how much the outcome changes with each unit increase in the predictor. Regression therefore provides additional information, including the slope of the relationship and the ability to make predictions for specific values of the predictor. ANOVA is most appropriate when comparing group means for categorical variables, whereas regression is preferable when examining relationships involving continuous predictors or when prediction is of interest.

C. R Programming Growth (4 points)

From a programming perspective, one of the most challenging parts of this assignment was correctly structuring the R Markdown code chunks and ensuring that the regression model and prediction commands ran without errors. Initially, interpreting the output and formatting the results within the document required careful attention. Working through these challenges helped reinforce my understanding of how to run regression models in R, interpret model summaries, and generate predictions and confidence intervals using the predict() function. As a result, I now feel more confident in fitting linear regression models in R and interpreting their statistical output.


End of Homework 2