EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv("C:/Users/abbym/OneDrive/Desktop/STATS553/Assignments/Homework #2/bmd(in).csv")

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "The Association between Daily Calcium Intake and Bone Mineral Density",
    x       = "Calcium (mg/day)",
    y       = "Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?] The points on the scatterplot do not show a strong linear trend. The fitted line given is in the positive direction. The relationship appears weak. There are a a few outliers when it comes to calcium intake, as well as bone mineral density. —

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?] The intercept represents what bone mineral density would be if calcium intake was 0 mg/day. The intercept of 8.992e-01 g/cm² is not very meaningful, because it would be hard to not have any calcium intake during a day, especially during a long period of time.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?] For every 1 unit increase in calcium, bone mineral density increases by 3.079e-05 g/cm². The direction is positive, and even though it is an increase by a very small amount, the effect is fairly large given typical calcium intake ranges and the ranges of BMD measures.
### Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: There is no linear relationship between bone mineral density and calcium intake. β1=0
H₁: There is a linear relationship between bone mineral density and calcium intake. β1≠0

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?] The t-statistic is 4.131, the degrees of freedom are 2127, and the p-value is 3.75e-05. You reject the null hypothesis. This means that there is a linear relationship between bone mineral density and calcium intake. Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?] The range of values possible for the true slope are 1.617334e-05 to 4.540649e-05. We are 95% confident that the true linear relationship between calcium intake and BMD is within this range of slope. ### Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?] 0.796% of the variance in BMD is explained by dietary calcium intake. This means that the model doesn’t fit the data very well, and that other predictors are important for determining what can predict the bone mineral density of a person. Residual Standard Error (RSE):

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.] The RSE is 0.1582. This means that the model is off, on average, by 0.1582 g/cm² when predicting bone mineral density. This also tells us that compared to the prediction value of calcium intake on BMD, the possibility and magnitude of the error is high. —

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) The predicted BMD at calcium = 1,000 mg/day is [0.9229112 g/cm², 0.937076 g/cm²].
What does the 95% confidence interval represent? The 95% confidence interval represents an estimation of the mean bone mineral density of the population of people who have a calcium intake of 1,000 mg/day.
What is the 95% prediction interval, and why is it wider than the CI? The 95% prediction interval is [0.6195964 g/cm², 1.240391 g/cm²], which is the prediction of the bone mineral density of a single individual who consumes 1,000 mg of calcium per day. The prediction interval is wider than the confidence interval, because it is much harder to predict the bone mineral density of a single person than a mean of an entire population, so the interval has to be wider to be 95% confident that the true value for an individual person is included in the range.
Is 1,000 mg/day a meaningful value to predict at, given the data?] Yes, 1,000 mg/day is a meaningful value to predict at. There are many data points surrounding the 1,000 mg/day area on the scatterplot comparing calcium intake and BMD. —

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

The regression models tells us that daily calcium intake may affect the bone mineral density of a person, but adding more variables may help to explain the association more, and make the model stronger. The results were somewhat surprising, as I expected calcium intake to have more of an association with bone mineral density. The key limitations of interpreting SLR from a cross-sectional survey as causal evidence is that you are unable to establish temporality, meaning you are unsure if the calcium intake or bone mineral density occurred first, or which was measured first. There also may be independence issues, as with a cross-sectional study like BRFSS, there may be geographic or household clusters included. Confounders such as age, gender, physical activity, BMI, and more might explain the observed association. ANOVA answered whether BMD differed amongst different groups of people, whereas SLR measures the linear relationship between calcium intake and BMD. Regression helps to give a quantifiable number regarding the strength of the relationship between calcium intake and BMD, whereas ANOVA tells us whether BMD varied by different groups, which groups it varied by, and how much it varied by. You would prefer ANOVA when you are trying to decide if categories of certain things, like age group, gender, race/ethnicity, and more, differ when it comes to a continuous outcome. Simple linear regression is good at comparing two continuous variables and quantifying the relationship between the two variables. Since all of coding was completed for us, the most challenging part of the assignment was making sure I was interpreting the correct values and interpreting them correctly. I have had troubles in the past with R coding for assignments where certain values or variables had to be changed in order to receive the correct outcome. Throughout these assignments, and this assignment, I have become more confident in proofreading the code in order to obtain the results and outcomes I am looking for. I make sure to read the code carefully and determine what needs to be changed to fit my data and what I am looking for and what can stay the same. —

End of Homework 2