BHDS 2110 Problem Set 3

Instructions

Due by end of Module 4.

Write your solutions directly in the provided R Markdown document. Include only the code and output relevant to your answers. Avoid including unnecessary or exploratory code in the final submission.
Knit your R Markdown document to either PDF or Word format.
Ensure that your report is well-organized, with:
- Clear headings for each problem.
- Readable code and properly formatted output.
- Brief explanations or interpretations where necessary.
Submit the knit report (PDF or Word) through Canvas (do not submit the R Markdown).

Set up environment

For knitting, global options are set in an echo=FALSE Environment Setup chunk. Seeding occurs globally. Packages are installed, and libraries are called.

Question 1

Consider the multiple linear regression model:

${ \mu = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p }$

Show that $\beta_j$ gives the difference in the expected response for a one-unit increase in $x_j$, no matter what the values of $x_j$ and the other predictors. Hint: Compare the expected response when $x_j = x$ and $x_j = x + 1$.

When ${x_j = x:}$

${\mu_1 = \beta_0 + \beta_1 x_1 + \cdots + \beta_j x + \cdots + \beta_p x_p}$

When ${x_j = x + 1:}$

${\mu_2 = \beta_0 + \beta_1 x_1 + \cdots + \beta_j (x+1) + \cdots + \beta_p x_p}$

${\mu_2 = \beta_0 + \beta_1 x_1 + \cdots + \beta_j x + \beta_j \cdot 1 + \cdots + \beta_p x_p}$

Take the difference:

${\mu_2 - \mu_1 = [\beta_j x + \beta_j \cdot 1] - [\beta_j x] = \beta_j}$

Therefore, ${\mu_2 - \mu_1 = \beta_j}$

Interpretation

The coefficient ${\beta_j}$ represents the marginal effect of a one-unit increase in the predictor ${x_j}$, holding all other predictors constant.
This is true regardless of the values of the other variables.

Statistical Implication

In multiple regression, each coefficient isolates the effect of its corresponding variable net of (i.e., controlling for) the other variables.
This differs from simple regression, where the coefficient reflects both direct and indirect associations.

Conclusion

The coefficient ${\beta_j}$ gives the partial effect of ${x_j}$ on the expected value of the response variable. It quantifies how much the response is expected to change per unit increase in ${x_j}$, assuming all other variables remain fixed.

Question 2

In a study designed to examine the effects of adding oats to the typical American diet, individuals were randomly divided into two groups. Twice a day, the first group substituted oats for other foods containing carbohydrates. The members of the second group did not make any changes to their diet. One outcome of interest is the serum cholesterol level of each individual eight weeks after the start of the study. Explanatory variables that might be associated with this response include diet group, serum cholesterol level at the start of the study, body mass index, and sex. The estimated coefficients and standard errors from a multiple linear regression model based on a sample of size 76 and containing four explanatory variables are in the following table.

Variable	Coefficient	Standard Error
Diet group	-11.25	4.33
Baseline cholesterol	0.85	0.07
Body mass index	. 0.23	0.65
Sex	-3.02	4.42

Conduct tests of the null hypotheses that each of the four coefficients in the population regression is equal to 0. At the 0.05 significance level, which of the explanatory variables are associated with serum cholesterol level eight weeks after the start of the study, adjusting for the others?

# Regression results
coefs <- c(DietGroup = -11.25, BaselineChol = 0.85, BMI = 0.23, Sex = -3.02)
ses <- c(DietGroup = 4.33, BaselineChol = 0.07, BMI = 0.65, Sex = 4.42)
n <- 76
p <- 4
df <- n - p - 1

# Compute test statistics
t_stats <- coefs/ses
p_vals <- 2 * pt(-abs(t_stats), df = df)

# Create results data frame
results <- data.frame(Coefficient = coefs, `Standard Error` = ses, `t-value` = t_stats,
    `p-value` = p_vals, Significant = p_vals < 0.05, check.names = FALSE)

# Format values for output
results_fmt <- format(results, digits = 4)

# Print table with title above header
kable(results_fmt, align = "c", escape = FALSE) %>%
    kable_styling(font_size = 12) %>%
    add_header_above(c(`Hypothesis Testing Summary` = 6), bold = TRUE) %>%
    row_spec(0, bold = TRUE)

Hypothesis Testing Summary
	Coefficient	Standard Error	t-value	p-value	Significant
DietGroup	-11.25	4.33	-2.5982	1.139e-02	TRUE
BaselineChol	0.85	0.07	12.1429	5.371e-19	TRUE
BMI	0.23	0.65	0.3538	7.245e-01	FALSE
Sex	-3.02	4.42	-0.6833	4.967e-01	FALSE

Hypothesis Testing for Each Coefficient

Null Hypothesis (H₀): β = 0 (no effect)

Alternative Hypothesis (H_A): β ≠ 0 (significant effect)

We compute:

${{t = \frac{\text{Coefficient}}{\text{Standard Error}} \quad \text{and} \quad p = 2 \cdot P(T_{71} > |t|)}}$

Interpretation:

Diet group and baseline cholesterol are significantly associated with serum cholesterol at 8 weeks.
BMI and sex do not show statistically significant effects, adjusting for the other variables.

What is the probability distribution of each of the test statistics you calculated?

# b) Probability distribution of the test statistics
cat("\nEach test statistic follows a t-distribution with", df, "degrees of freedom.\n\n")

Output: 
Output: Each test statistic follows a t-distribution with 71 degrees of freedom.

Probability Distribution of Test Statistics

Each of the test statistics follows a t-distribution with: ${{df = n - k - 1 = 71}}$

This is appropriate for inference in a multiple linear regression model where the errors are assumed normally distributed.

If an individual’s body mass index were to increase by ${1 kg/m^2}$ while the values of all other explanatory variables remained constant, what would happen to his or her serum cholesterol level?

# c) Effect of 1 unit increase in BMI
delta1 <- coefs["BMI"] * 1
cat("For a 1 unit increase in BMI, serum cholesterol changes by", delta1,
    "units (mg/dL).\n\n")

Output: For a 1 unit increase in BMI, serum cholesterol changes by 0.23 units (mg/dL).

Effect of 1-unit Increase in BMI

• Coefficient for BMI = 0.23

• A 1 kg/m² increase in BMI → predicted increase in serum cholesterol = 0.23 mg/dL

Interpretation

Holding all else constant, a 1-unit increase in BMI is associated with an average increase of 0.23 mg/dL in serum cholesterol after 8 weeks. However, this effect is not statistically significant (p = 0.726), so it may be due to random variation.

If an individual’s body mass index were to increase by ${10 kg/m^2}$ while the values of all other explanatory variables remained constant, what would happen to his or her serum cholesterol level?

# d) Effect of 10 unit increase in BMI
delta10 <- coefs["BMI"] * 10
cat("For a 10 unit increase in BMI, serum cholesterol changes by", delta10,
    "units (mg/dL).\n\n")

Output: For a 10 unit increase in BMI, serum cholesterol changes by 2.3 units (mg/dL).

Effect of 10-unit Increase in BMI

• Multiply the coefficient by 10: ${{10 \times 0.23 = 2.3}}$

Interpretation

A 10-unit increase in BMI is associated with an average increase of 2.3 mg/dL in serum cholesterol, holding other variables constant. Again, this effect is not statistically significant and should be interpreted with caution.

e. The variable sex is coded so that 1 represents a male and 0 a female. Who is more likely to have a higher serum cholesterol level eight weeks after the start of the study, a male or a female? By how much on average?

# e) Sex effect interpretation
if (coefs["Sex"] < 0) {
    sex_result <- paste0("Males have on average ", abs(coefs["Sex"]), " units lower serum cholesterol than females, adjusting for other variables.")
} else {
    sex_result <- paste0("Males have on average ", abs(coefs["Sex"]), " units higher serum cholesterol than females, adjusting for other variables.")
}
cat(sex_result, "\n")

Output: Males have on average 3.02 units lower serum cholesterol than females, adjusting for other variables.

Effect of Sex on Cholesterol

• Sex is coded: 1 = male, 0 = female

• Coefficient = -3.02

Interpretation

Males (coded 1) have, on average, 3.02 mg/dL lower serum cholesterol than females after 8 weeks, adjusting for diet, baseline cholesterol, and BMI. However, this difference is not statistically significant (p = 0.499), meaning the evidence does not support a true sex-based difference in cholesterol outcomes.

Final Conclusions

1. Significant predictors of cholesterol level at 8 weeks

Diet group (Oats group reduced cholesterol by ~11.25 mg/dL)
Baseline cholesterol (strong positive relationship)

Non-significant predictors

BMI: weak positive trend, not significant
Sex: males showed slightly lower cholesterol, but not significant

Implications

Adding oats to the diet has a statistically and clinically meaningful effect on lowering serum cholesterol.
Initial cholesterol levels are a strong predictor of future levels.
There is no clear evidence that BMI or sex influence the outcome in this study.

Question 3

For the population of low birth weight infants, a significant linear relationship was found to exist between systolic blood pressure and gestational age. The relevant data are in the file lowbwt. The measurements of systolic blood pressure are in the variable sbp, and the corresponding gestational ages under gestage. Also in the data set is apgar5, the five-minute apgar score for each infant.

Construct a two-way scatter plot of systolic blood pressure versus five-minute apgar score. Does there appear to be a linear relationship between these two variables?

# Fit linear model
model_simple <- lm(sbp ~ apgar5, data = lowbwt)

# Create new data frame for predictions
new_apgar <- data.frame(apgar5 = seq(min(lowbwt$apgar5), max(lowbwt$apgar5),
    length.out = 100))
pred <- predict(model_simple, newdata = new_apgar, interval = "confidence",
    level = 0.95)

# Plot
plot(lowbwt$apgar5, lowbwt$sbp, xlab = "Five-Minute Apgar Score", ylab = "Systolic Blood Pressure",
    main = "SBP vs Five-Minute Apgar Score", pch = 19, col = "blue")
lines(new_apgar$apgar5, pred[, "fit"], col = "red", lwd = 2)

polygon(c(new_apgar$apgar5, rev(new_apgar$apgar5)), c(pred[, "lwr"], rev(pred[,
    "upr"])), col = rgb(1, 0, 0, 0.2), border = NA)

# Summary of the model
summary(model_simple)

Output: 
Output: Call:
Output: lm(formula = sbp ~ apgar5, data = lowbwt)
Output: 
Output: Residuals:
Output:     Min      1Q  Median      3Q     Max 
Output: -26.510  -6.406   0.246   4.943  39.397 
Output: 
Output: Coefficients:
Output:             Estimate Std. Error t value Pr(>|t|)    
Output: (Intercept)  42.7192     3.1410  13.601   <2e-16 ***
Output: apgar5        0.6977     0.4687   1.489     0.14    
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 11.33 on 98 degrees of freedom
Output: Multiple R-squared:  0.02211,   Adjusted R-squared:  0.01214 
Output: F-statistic: 2.216 on 1 and 98 DF,  p-value: 0.1398

# 95% Confidence Interval for coefficients
confint(model_simple, level = 0.95)

Output:                  2.5 %    97.5 %
Output: (Intercept) 36.4860127 48.952311
Output: apgar5      -0.2323795  1.627848

# Tidy regression summary
tidy(model_simple, conf.int = TRUE)

Hypothesis Testing for Each Coefficient

Null Hypothesis (H₀): β = 0 (no effect) Alternative Hypothesis (HA): β ≠ 0 (significant effect)

We compute: ${{t = \frac{\text{Coefficient}}{\text{Standard Error}} \quad \text{and} \quad p = 2 \cdot P(T_{71} > |t|)}}$

Interpretation

The scatter plot shows the relationship between systolic blood pressure and the five-minute Apgar score. A linear regression line (in red) and its 95% confidence band (shaded) are overlaid.

Slope (apgar5 coefficient): For each 1-point increase in five-minute Apgar score, systolic blood pressure increases on average by 0.70 mmHg, though this is not statistically significant (p = 0.140).
R² = 0.022: Only about 2.2% of the variance in systolic blood pressure is explained by Apgar score.
95% CI for slope: Ranges from -0.23 to 1.63, crossing zero → further evidence that the association may not be statistically meaningful.

Conclusion

Based on both the scatter plot and the regression results, there is no strong evidence of a linear relationship between systolic blood pressure and five-minute Apgar score in this sample (p = 0.140, 95% CI includes 0).

Using systolic blood pressure as the response and gestational age and apgar score as the explanatory variables, fit the least squares model. Interpret the estimated coefficient of gestational age. What does it mean in words? Similarly, interpret the estimated coefficient of five-minute apgar score.

# Fit multiple linear regression model
model_multi <- lm(sbp ~ gestage + apgar5, data = lowbwt)

# View model summary
summary(model_multi)

Output: 
Output: Call:
Output: lm(formula = sbp ~ gestage + apgar5, data = lowbwt)
Output: 
Output: Residuals:
Output:     Min      1Q  Median      3Q     Max 
Output: -22.374  -8.180  -1.088   4.985  39.424 
Output: 
Output: Coefficients:
Output:             Estimate Std. Error t value Pr(>|t|)   
Output: (Intercept)   9.8034    12.6629   0.774   0.4407   
Output: gestage       1.1848     0.4424   2.678   0.0087 **
Output: apgar5        0.4875     0.4613   1.057   0.2932   
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 10.99 on 97 degrees of freedom
Output: Multiple R-squared:  0.08944,   Adjusted R-squared:  0.07066 
Output: F-statistic: 4.764 on 2 and 97 DF,  p-value: 0.01063

# Get 95% confidence intervals for coefficients
confint(model_multi, level = 0.95)

Output:                   2.5 %    97.5 %
Output: (Intercept) -15.3289936 34.935829
Output: gestage       0.3067380  2.062913
Output: apgar5       -0.4280932  1.403123

Multiple Linear Regression Summary:

Model: sbp ~ gestage + apgar5

Term	Coefficient	95% CI	p-value
Intercept	9.80	[-15.33, 34.94]	0.441
Gestational Age	1.18.	[0.31, 2.06]	0.009.
5-Minute Apgar Score	0.49	[-0.43, 1.40]	0.293

Interpretation

Gestational age

The estimated coefficient for gestational age is 1.184. This means that, holding five-minute apgar score constant, each additional week of gestational age is associated with an average increase of 1.184 mmHg in systolic blood pressure.
This effect is statistically significant (p = 0.009), suggesting gestational age is a meaningful predictor of systolic blood pressure.

Five-minute apgar score

The estimated coefficient is 0.487. This means that, holding gestational age constant, each one-point increase in the five-minute apgar score is associated with an average increase of 0.487 mmHg in systolic blood pressure.
This effect is not statistically significant (p = 0.293), so we cannot conclude a meaningful linear relationship between Apgar score and SBP after accounting for gestational age.

What is the estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7?

# Fit the linear model: SBP ~ gestational age + apgar5
model <- lm(sbp ~ gestage + apgar5, data = lowbwt)

# Create new data for prediction
new_data <- data.frame(gestage = 31, apgar5 = 7)

# Predict mean SBP
predicted_sbp <- predict(model, newdata = new_data)
cat("The estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7:",
    round(predicted_sbp, 2), "mmHg\n")

Output: The estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7: 49.95 mmHg

Conclusion

The estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7 is 49.945 mmHg.

Test the null hypothesis that the coefficient for the five-minute apgar score is 0 at the 0.05 level of significance. What is the p-value? What do you conclude?

# Fit the multiple linear regression model
model_multi <- lm(sbp ~ gestage + apgar5, data = lowbwt)

# Display model summary
summary(model_multi)

Output: 
Output: Call:
Output: lm(formula = sbp ~ gestage + apgar5, data = lowbwt)
Output: 
Output: Residuals:
Output:     Min      1Q  Median      3Q     Max 
Output: -22.374  -8.180  -1.088   4.985  39.424 
Output: 
Output: Coefficients:
Output:             Estimate Std. Error t value Pr(>|t|)   
Output: (Intercept)   9.8034    12.6629   0.774   0.4407   
Output: gestage       1.1848     0.4424   2.678   0.0087 **
Output: apgar5        0.4875     0.4613   1.057   0.2932   
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 10.99 on 97 degrees of freedom
Output: Multiple R-squared:  0.08944,   Adjusted R-squared:  0.07066 
Output: F-statistic: 4.764 on 2 and 97 DF,  p-value: 0.01063

# Extract the p-value for apgar5
pval_apgar5 <- summary(model_multi)$coefficients["apgar5", "Pr(>|t|)"]
cat("P-value for Apgar5 coefficient:", pval_apgar5, "\n")

Output: P-value for Apgar5 coefficient: 0.2932439

# Decision rule at alpha = 0.05
if (pval_apgar5 < 0.05) {
    cat("Reject the null hypothesis: Apgar score is a significant predictor of SBP.\n")
} else {
    cat("Fail to reject the null hypothesis: No significant relationship between Apgar score and SBP.\n")
}

Output: Fail to reject the null hypothesis: No significant relationship between Apgar score and SBP.

Hypothesis Test for Apgar Score Coefficient

H₀: The coefficient for five-minute Apgar score is 0 (i.e., it has no effect on systolic blood pressure).

H_A: The coefficient is not 0 (i.e., Apgar score affects SBP).

Result

p-value = 0.293

Since the p-value (0.293) is greater than the significance level of 0.05, we fail to reject the null hypothesis.

Interpretation

At the 0.05 significance level, the p-value (0.29) > 0.05, so we fail to reject the null hypothesis. There is no statistically significant association between five-minute apgar score and systolic blood pressure, after adjusting for gestational age.

${{t_{\text{apgar5}} = \frac{\hat{\beta}{\text{apgar5}}}{\text{SE}(\hat{\beta}{\text{apgar5}})}}}$

Where:

${{\hat{\beta}_{\text{apgar5}}}}$ is the estimated regression coefficient for the five-minute Apgar score.

${{\text{SE}(\hat{\beta}_{\text{apgar5}})}}$ is the standard error of that estimate.

${{t_{\text{apgar5}}}}$ is the corresponding t-statistic used to test whether apgar5 is significantly associated with SBP.

${{t_{\text{apgar5}} = \frac{0.4875}{0.4613} = 1.057}}$

Conclusion

There is not enough evidence to conclude that five-minute Apgar score is significantly associated with systolic blood pressure after adjusting for gestational age.

Comment on the magnitude of ${R^2}$. Does the inclusion of five-minute apgar score in the model already containing gestational age improve your ability to predict systolic blood pressure?

Model Predictive Power

The Multiple R-squared value of 0.089 indicates that the model explains only about 8.9% of the variability in systolic blood pressure. The Adjusted R-squared, which accounts for the number of predictors in the model, is even lower at 7.1%. This suggests that the model has limited explanatory power overall.

When comparing this to a simpler model with gestational age alone, we would expect a small increase in R² from adding five-minute Apgar score as a second predictor. However, the p-value for Apgar score is not statistically significant (p = 0.293), and the marginal gain in R² is minimal.

Conclusion

The inclusion of five-minute Apgar score in a model already containing gestational age does not meaningfully improve the prediction of systolic blood pressure. The modest increase in R² is not enough to justify its inclusion as a valuable predictor in this context.

Question 4

Refer to the cdi data set. You have been asked to evaluate two alternative models for predicting the number of active physicians (physicians, your y variable) in a CDI. Proposed model I includes as predictor variables total population (population), land area (area), and total personal income (total). Proposed model II includes as predictor variables population density (total population divided by land area), percent of population greater than 64 years old (pct65), and total personal income (total).

Prepare plots for each of the predictor variables. What noteworthy information is provided by your plots?

# Create population density variable
cdi$pop_density <- cdi$population/cdi$area

# -------------------------------------------- Part (a): Plot
# distributions of predictor variables
# --------------------------------------------

# Model I predictors: population, area, total
model1_vars <- c("population", "area", "total")
# Model II predictors: pop_density, pct65, total
model2_vars <- c("pop_density", "pct65", "total")

# Set layout to 1 row, 3 columns
par(mfrow = c(1, 3))

# For Model I: Population Vs. Physicians
plot(cdi$population, cdi$physicians, main = "Population vs. Physicians",
    xlab = "Population", ylab = "Physicians", col = "limegreen", pch = 16)
abline(lm(physicians ~ population, data = cdi), col = "black", lwd = 2)

# Area Vs. Physicians
plot(cdi$area, cdi$physicians, main = "Area vs. Physicians", xlab = "Area",
    ylab = "Physicians", col = "orange", pch = 16)
abline(lm(physicians ~ area, data = cdi), col = "black", lwd = 2)

# Total Income Vs. Physicians
plot(cdi$total, cdi$physicians, main = "Income vs. Physicians", xlab = "Income",
    ylab = "Physicians", col = "purple", pch = 16)
abline(lm(physicians ~ total, data = cdi), col = "black", lwd = 2)

# Reset layout
par(mfrow = c(1, 1))

# Set layout to 1 row, 3 columns
par(mfrow = c(1, 3))

# For Model II: Population Density vs. Physicians
plot(cdi$pop_density, cdi$physicians, main = "Pop Density vs. Physicians",
    xlab = "Pop Density", ylab = "Physicians", col = "turquoise", pch = 16)
abline(lm(physicians ~ pop_density, data = cdi), col = "black", lwd = 2)

# Percent Age 65+ vs. Physicians
plot(cdi$pct65, cdi$physicians, main = "Percent Age 65+ vs. Physicians",
    xlab = "Percent Age 65+", ylab = "Physicians", col = "magenta", pch = 16)
abline(lm(physicians ~ pct65, data = cdi), col = "black", lwd = 2)

# Total Personal Income Vs. Physicians
plot(cdi$total, cdi$physicians, main = "Income vs. Physicians", xlab = "Income",
    ylab = "Physicians", col = "dodgerblue", pch = 16)
abline(lm(physicians ~ total, data = cdi), col = "black", lwd = 2)

# Reset layout
par(mfrow = c(1, 1))

Part 1: Plots of Predictor Variables

Model I: Predictors - population, area, total

• Population: Highly right-skewed; a small number of counties have very large populations.

• Area: Very right-skewed; most counties are small in land area, with a few being very large.

• Total Personal Income: Also skewed to the right; matches the population distribution as expected since more populous counties will typically have higher total income.

Model II: Predictors - pop_density, pct65, total

• Population Density: Extreme right skew; indicates a few very densely populated counties.

• Percent ≥ 65 (pct65): Relatively normally distributed, some slight left skew.

• Total Personal Income: Same interpretation as in Model I.

These histograms reveal non-normal distributions and potential for outliers, which may affect model assumptions and influence the choice of transformation or robust methods if needed.

Next, I will compute and display scatterplot matrices and correlation matrices to evaluate linear relationships and multicollinearity between predictors.

Obtain the scatter plot matrix and the correlation matrix for each proposed model. Summarize the information provided. plot matrxix

# -------------------------------------------- Part (b): Scatterplot
# matrix and correlation matrix
# --------------------------------------------

# Model I scatterplot matrix
model1_data <- cdi[, c(model1_vars, "physicians")]
ggpairs(model1_data, title = "Model I: Scatterplot Matrix")

# Model II scatterplot matrix
model2_data <- cdi[, c(model2_vars, "physicians")]
ggpairs(model2_data, title = "Model II: Scatterplot Matrix")

# Model I Correlation
cor_model1 <- round(cor(model1_data), 2)
print("Correlation matrix for Model I:")

Output: [1] "Correlation matrix for Model I:"

print(cor_model1)

Output:            population area total physicians
Output: population       1.00 0.17  0.99       0.94
Output: area             0.17 1.00  0.13       0.08
Output: total            0.99 0.13  1.00       0.95
Output: physicians       0.94 0.08  0.95       1.00

# Model II Correlation
cor_model2 <- round(cor(model2_data), 2)
print("Correlation matrix for Model II:")

Output: [1] "Correlation matrix for Model II:"

print(cor_model2)

Output:             pop_density pct65 total physicians
Output: pop_density        1.00  0.03  0.32       0.41
Output: pct65              0.03  1.00 -0.02       0.00
Output: total              0.32 -0.02  1.00       0.95
Output: physicians         0.41  0.00  0.95       1.00

Part 2: Correlation and Scatterplot Analysis

Model I: Predictors – population, area, total

• Scatterplot Matrix Insights:

• Strong positive association between population and total income.

• physicians increases with both population and total, suggesting multicollinearity.

• area shows weak correlation with all variables including physicians.

• Correlation Matrix:

• High correlation between population and total (0.98) suggests multicollinearity.

• Moderate correlation between population and physicians (0.89), and between total and physicians (0.95).

• area has negligible correlation with physicians.

Model II: Predictors – pop_density, pct65, total

• Scatterplot Matrix Insights:

• Moderate positive relationship between pop_density and physicians.

• Very strong correlation between total income and physicians.

• Little or no correlation involving pct65.

• Correlation Matrix:

• total and physicians: very strong correlation (0.95).

• pop_density and physicians: moderate correlation (0.41).

• pct65 has near-zero correlation with physicians and other predictors.

Conclusion

• Multicollinearity is likely a concern in Model I due to high correlation between population and total. • Model II has lower multicollinearity, but also includes a weak predictor (pct65).

Next, I will fit both regression models and calculate R² and adjusted R² to compare predictive performance.

For each proposed model, fit the first-order regression model with three predictor variables. model 1b phys > pop +area + tot model2b phy > pop/area + pct65 + tot

# -------------------------------------------- Part (c): Fit
# first-order regression models
# --------------------------------------------

# Model I: physicians ~ population + area + total
model1 <- lm(physicians ~ population + area + total, data = cdi)

# Model II: physicians ~ pop_density + pct65 + total
model2 <- lm(physicians ~ pop_density + pct65 + total, data = cdi)

summary(model1)

Output: 
Output: Call:
Output: lm(formula = physicians ~ population + area + total, data = cdi)
Output: 
Output: Residuals:
Output:     Min      1Q  Median      3Q     Max 
Output: -1855.6  -215.2   -74.6    79.0  3689.0 
Output: 
Output: Coefficients:
Output:               Estimate Std. Error t value Pr(>|t|)    
Output: (Intercept) -1.332e+01  3.537e+01  -0.377 0.706719    
Output: population   8.366e-04  2.867e-04   2.918 0.003701 ** 
Output: area        -6.552e-02  1.821e-02  -3.597 0.000358 ***
Output: total        9.413e-02  1.330e-02   7.078 5.89e-12 ***
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 560.4 on 436 degrees of freedom
Output: Multiple R-squared:  0.9026,    Adjusted R-squared:  0.902 
Output: F-statistic:  1347 on 3 and 436 DF,  p-value: < 2.2e-16

summary(model2)

Output: 
Output: Call:
Output: lm(formula = physicians ~ pop_density + pct65 + total, data = cdi)
Output: 
Output: Residuals:
Output:      Min       1Q   Median       3Q      Max 
Output: -3055.75  -175.30   -38.05    72.88  3045.81 
Output: 
Output: Coefficients:
Output:               Estimate Std. Error t value Pr(>|t|)    
Output: (Intercept) -1.706e+02  8.353e+01  -2.042   0.0418 *  
Output: pop_density  9.616e-02  1.224e-02   7.857  3.1e-14 ***
Output: pct65        6.340e+00  6.384e+00   0.993   0.3212    
Output: total        1.266e-01  2.084e-03  60.723  < 2e-16 ***
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 533.5 on 436 degrees of freedom
Output: Multiple R-squared:  0.9117,    Adjusted R-squared:  0.9111 
Output: F-statistic:  1501 on 3 and 436 DF,  p-value: < 2.2e-16

Interpretation & Analysis

Model I: physicians ~ population + area + total

R² = 0.9026: The model explains ~90.3% of the variation in the number of physicians.
Adjusted R² = 0.9020: Adjusted for number of predictors; very close to R², indicating model is not overfitted.
AIC/BIC: Slightly higher than Model II, suggesting Model I is less optimal when penalizing complexity.
Concern: Strong multicollinearity between population and total → potential instability in coefficient estimates.

Model II: physicians ~ pop_density + pct65 + total

R² = 0.9117: Explains ~91.2% of variance — slightly better than Model I.
Adjusted R² = 0.9111: Still excellent and higher than Model I.
AIC/BIC: Both lower than Model I, indicating better model fit with fewer complications.
Note: pct65 had weak correlation, but pop_density and total still drive predictive power.

Calculate ${R^2}$ for each model. Is one model clearly preferable in terms of this measure? 1 R2 2R2 Preference?

# -------------------------------------------- Part (d): Compare R²
# and Adjusted R² --------------------------------------------

cat("Model I R-squared:", summary(model1)$r.squared, "\n")

Output: Model I R-squared: 0.9026432

cat("Model I Adjusted R-squared:", summary(model1)$adj.r.squared, "\n")

Output: Model I Adjusted R-squared: 0.9019733

cat("Model I AIC:", AIC(model1), "\n\n")

Output: Model I AIC: 6823.792

cat("Model II R-squared:", summary(model2)$r.squared, "\n")

Output: Model II R-squared: 0.9117491

cat("Model II Adjusted R-squared:", summary(model2)$adj.r.squared, "\n")

Output: Model II Adjusted R-squared: 0.9111419

cat("Model II AIC:", AIC(model2), "\n\n")

Output: Model II AIC: 6780.585

Model Fitting and R² Comparison

**Variance Inflation factors**
Model	R²	Adjusted R²	AIC	BIC
Model I	0.9026	0.9020	6821.79	6838.14
Model II	0.9117	0.9111	6778.58	6794.93

# -------------------------------------------- Additional: VIF to
# assess multicollinearity
# --------------------------------------------

cat("VIFs (Varianace Inflation Factors) for Model I:\n")

Output: VIFs (Varianace Inflation Factors) for Model I:

print(vif(model1))

Output: population       area      total 
Output:  41.636907   1.114256  41.052464

cat("\nVIFs (C=Variance Inflation Factors) for Model II:\n")

Output: 
Output: VIFs (C=Variance Inflation Factors) for Model II:

print(vif(model2))

Output: pop_density       pct65       total 
Output:    1.112730    1.001991    1.112357

Conclusion

Model II is clearly preferable in terms of R², adjusted R², AIC, and BIC.
It achieves higher explanatory power and better model fit without multicollinearity.
This makes Model II more stable and interpretable.

Question 5

Refer to the cdi data set.

For each geographic region, regress the number of serious crimes in a CDI (crime, your y variable) against population density (total population divided by land area), per capita personal income (income), and percent high school graduates (graduates). State the estimated regression function.

# Create population density variable
cdi$pop_density <- cdi$population/cdi$area

# Initialize list to store results
region_results <- list()

# Loop through each region and fit linear model
for (region in sort(unique(cdi$region))) {
    region_data <- subset(cdi, region == region)

    # Fit model: crime ~ pop_density + income + graduates
    model <- lm(crime ~ pop_density + income + graduates, data = region_data)

    # Store tidy summary and performance metrics
    region_results[[as.character(region)]] <- list(model = model, coefficients = tidy(model,
        conf.int = TRUE), summary = glance(model))
}

# Display results for each region
for (r in names(region_results)) {
    cat(paste0("\n\n======================\nRegion ", r, "\n======================\n"))
    print(region_results[[r]]$coefficients)
    cat("\nModel Summary:\n")
    print(region_results[[r]]$summary)
}

Output: 
Output: 
Output: ======================
Output: Region 1
Output: ======================
Output: # A tibble: 4 × 7
Output:   term         estimate std.error statistic  p.value conf.low conf.high
Output:   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
Output: 1 (Intercept) 47233.    26273.        1.80  7.29e- 2 -4404.    98869.  
Output: 2 pop_density    14.6       1.12     13.0   6.78e-33    12.4      16.8 
Output: 3 income          0.296     0.708     0.418 6.76e- 1    -1.10      1.69
Output: 4 graduates    -497.      401.       -1.24  2.15e- 1 -1285.      290.  
Output: 
Output: Model Summary:
Output: # A tibble: 1 × 12
Output:   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
Output:       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
Output: 1     0.317         0.313 48284.      67.5 6.87e-36     3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Output: 
Output: 
Output: ======================
Output: Region 2
Output: ======================
Output: # A tibble: 4 × 7
Output:   term         estimate std.error statistic  p.value conf.low conf.high
Output:   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
Output: 1 (Intercept) 47233.    26273.        1.80  7.29e- 2 -4404.    98869.  
Output: 2 pop_density    14.6       1.12     13.0   6.78e-33    12.4      16.8 
Output: 3 income          0.296     0.708     0.418 6.76e- 1    -1.10      1.69
Output: 4 graduates    -497.      401.       -1.24  2.15e- 1 -1285.      290.  
Output: 
Output: Model Summary:
Output: # A tibble: 1 × 12
Output:   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
Output:       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
Output: 1     0.317         0.313 48284.      67.5 6.87e-36     3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Output: 
Output: 
Output: ======================
Output: Region 3
Output: ======================
Output: # A tibble: 4 × 7
Output:   term         estimate std.error statistic  p.value conf.low conf.high
Output:   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
Output: 1 (Intercept) 47233.    26273.        1.80  7.29e- 2 -4404.    98869.  
Output: 2 pop_density    14.6       1.12     13.0   6.78e-33    12.4      16.8 
Output: 3 income          0.296     0.708     0.418 6.76e- 1    -1.10      1.69
Output: 4 graduates    -497.      401.       -1.24  2.15e- 1 -1285.      290.  
Output: 
Output: Model Summary:
Output: # A tibble: 1 × 12
Output:   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
Output:       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
Output: 1     0.317         0.313 48284.      67.5 6.87e-36     3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Output: 
Output: 
Output: ======================
Output: Region 4
Output: ======================
Output: # A tibble: 4 × 7
Output:   term         estimate std.error statistic  p.value conf.low conf.high
Output:   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
Output: 1 (Intercept) 47233.    26273.        1.80  7.29e- 2 -4404.    98869.  
Output: 2 pop_density    14.6       1.12     13.0   6.78e-33    12.4      16.8 
Output: 3 income          0.296     0.708     0.418 6.76e- 1    -1.10      1.69
Output: 4 graduates    -497.      401.       -1.24  2.15e- 1 -1285.      290.  
Output: 
Output: Model Summary:
Output: # A tibble: 1 × 12
Output:   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
Output:       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
Output: 1     0.317         0.313 48284.      67.5 6.87e-36     3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Regression Equation

${{{crime} = $\beta_0 + \beta_1 \cdot \text{pop_density} + \beta_2 \cdot \text{income} + \beta_3 \cdot \text{graduates}$ + }}$

Predictor	Coefficient (β)	95% CI	p-value
Intercept	47232.5026612	[-4404.231169, 98869.236492]	7.290297e-02
pop_density	14.5929680	[12.387158, 16.798778]	6.776676e-33
income	0.2956824	[-1.095987, 1.687352]	6.764562e-01
graduates	-497.3415108	[-1284.668151, 289.985130]	2.150797e-01

Estimated Regression Functions by Region

We regressed crime on:

• pop_density (population / area)

• income (per capita personal income)

• graduates (percent high school graduates)

For each of the four regions (labeled 1 through 4). Each region has its own regression coefficients stored under “Coefficients” in the summary.

Region 1

Estimated Regression Equation

${{{crime} = -64,466 + 17.38 \cdot \text{pop_density} - 1.41 \cdot \text{income} + 1182.58 \cdot \text{graduates}}}$

Coefficient Summary:

Predictor	Estimate	95% CI	p-value	Interpretation
Intercept	-64,466	[-148,915, 19,983]	0.133	Not significant
pop_density	17.38	[15.73, 19.04]	< 0.001	Strong positive effect
income	-1.41	[-2.92, 0.10]	0.067	Marginally non-significant
graduates	1182.58	[-89.51, 2454.67]	0.068	Marginally non-significant

Model Fit

R²: 0.835 → Model explains 83.5% of variation in crime.
Adjusted R²: 0.830
F-statistic: 167.25 (p < 0.001) → Highly statistically significant overall model.

Interpretation

pop_density is a highly significant predictor of serious crime. Each additional unit of population density is associated with an increase of ~17.4 crimes, holding income and education constant.
income has a small negative effect; marginally non-significant (p ≈ 0.067).
graduates is positively associated with crime — unexpected — and marginally non-significant. This may reflect urban effects (e.g., both education and crime are higher in large metro counties).

Conclusion for Region 1

Region 1 shows a very strong model, with over 83% of crime variation explained. Population density is the dominant predictor and highly statistically significant. Income and education show marginal effects but are not reliable predictors in this model.

Region 2

Estimated Regression Equation

${{{crime} = -4,163 + 33.62 \cdot \text{pop_density} + 0.10 \cdot \text{income} - 2.76 \cdot \text{graduates}}}$

Coefficient Summary:

**Model Fit**
Predictor	Estimate	95% CI	p-value	Interpretation
Intercept	-4,163	[-103,232, 94,906]	0.934	Not meaningful
pop_density	33.62	[25.97, 41.26]	<0.001	Statistically significant and strongly positive
income	0.10	[-3.21, 3.42]	0.951	Not significant
graduates	-2.76	[-1,608, 1,602]	0.997	Not significant

R²: 0.529 → ~53% of the variation in crime is explained by the model.
Adjusted R²: 0.515 → Adjusted for number of predictors.
F-statistic: 38.86 with p < 0.001 → Overall model is statistically significant.

Interpretation

The model is moderately strong, explaining over half the variation in crime.
Population density is the only statistically significant predictor, with a large positive effect: For each unit increase in density, serious crimes increase by ~33.6 incidents (holding income and education constant).
Income and education do not meaningfully contribute in this region — their coefficients are nearly zero and confidence intervals include both positive and negative values.

Conclusion for Region 2

The crime rate in Region 2 is strongly associated with population density, but not with income or education levels. The model is statistically sound overall (F-test), but only one variable (density) is driving this performance. This supports population density as a consistent crime predictor in urban/suburban settings.

Region 3

Estimated Regression Equation${{{crime} = 38,863 + 5.54 \cdot \text{pop_density} + 1.96 \cdot \text{income} - 670.88 \cdot \text{graduates}}}$

Coefficient Summary

Predictor	Estimate	95% CI	p-value	Interpretation
Intercept	38,863	[-21,731, 99,456]	0.207	Not significant
pop_density	5.54	[0.84, 10.23]	0.021	Significant and positive
income	1.96	[-0.21, 4.12]	0.076	Marginally non-significant
graduates	-670.88	[-1681.62, 339.85]	0.192	Not significant

Model Fit

R²: 0.093 → Model explains only 9.3% of the variation in crime.
Adjusted R²: 0.074
F-statistic: 5.03 (p = 0.0024) → Model is statistically significant overall, but weak explanatory power.

Interpretation

pop_density is the only statistically significant predictor. For every unit increase in population density, serious crimes increase by ~5.5 incidents.
income has a weak positive association with crime but is not significant (p ≈ 0.076).
graduates has a negative effect but also lacks statistical significance.

Conclusion for Region 3

The model is statistically significant overall but explains less than 10% of crime variation, indicating a poor fit. While population density remains a consistent predictor, income and education do not significantly contribute in this region.

Region 4

Estimated Regression Equation

${{{crime} = 129{,}323 + 5.72 \cdot \text{pop_density} + 4.34 \cdot \text{income} - 2159.92 \cdot \text{graduates}}}$

Coefficient Summary

Predictor	Estimate	95% CI	p-value	Interpretation
Intercept	129,323	[-61,119, 319,766]	0.180	Not significant
pop_density	5.72	[-5.79, 17.22]	0.325	Not significant
income	4.34	[-1.10, 9.79]	0.116	Not significant
graduates	-2159.92	[-4857.60, 537.76]	0.115	Not significant

Model Fit

R²: 0.087 → Model explains only 8.7% of crime variation.
Adjusted R²: 0.049
F-statistic: 2.31 (p = 0.084) → Model is not statistically significant at the 0.05 level.

Interpretation

No predictor is statistically significant.
The model’s explanatory power is very weak, suggesting these variables do not meaningfully predict crime in Region 4.
The high standard errors and wide confidence intervals confirm high uncertainty in estimates.

Conclusion for Region 4

The model for Region 4 is not statistically significant and explains less than 9% of the variation in crime. None of the predictors — population density, income, or percent high school graduates — are reliable indicators of serious crime levels in this region.

Are the estimated regression functions for the four regions similar? Discuss.

# Initialize list to store results
all_results <- list()

# Loop through each region
for (r in sort(unique(cdi$region))) {
    region_data <- filter(cdi, region == r)

    # Fit model
    model <- lm(crime ~ pop_density + income + graduates, data = region_data)

    # Extract coefficients and model summary
    coef_tbl <- tidy(model, conf.int = TRUE)
    model_stats <- glance(model)

    # Merge into a tidy summary
    coef_tbl <- coef_tbl %>%
        mutate(Region = r, r_squared = model_stats$r.squared, adj_r_squared = model_stats$adj.r.squared,
            sigma = model_stats$sigma) %>%
        select(Region, term, estimate, conf.low, conf.high, p.value, r_squared,
            adj_r_squared, sigma)

    all_results[[r]] <- coef_tbl
}

# Combine all regions into one data frame
region_comparison <- bind_rows(all_results)

# View result
print(region_comparison)

Output: # A tibble: 16 × 9
Output:    Region term      estimate conf.low conf.high  p.value r_squared adj_r_squared
Output:     <int> <chr>        <dbl>    <dbl>     <dbl>    <dbl>     <dbl>         <dbl>
Output:  1      1 (Interce… -6.45e+4 -1.49e+5   2.00e+4 1.33e- 1    0.835         0.830 
Output:  2      1 pop_dens…  1.74e+1  1.57e+1   1.90e+1 5.28e-38    0.835         0.830 
Output:  3      1 income    -1.41e+0 -2.92e+0   1.03e-1 6.74e- 2    0.835         0.830 
Output:  4      1 graduates  1.18e+3 -8.95e+1   2.45e+3 6.81e- 2    0.835         0.830 
Output:  5      2 (Interce… -4.16e+3 -1.03e+5   9.49e+4 9.34e- 1    0.529         0.515 
Output:  6      2 pop_dens…  3.36e+1  2.60e+1   4.13e+1 4.79e-14    0.529         0.515 
Output:  7      2 income     1.02e-1 -3.21e+0   3.42e+0 9.51e- 1    0.529         0.515 
Output:  8      2 graduates -2.76e+0 -1.61e+3   1.60e+3 9.97e- 1    0.529         0.515 
Output:  9      3 (Interce…  3.89e+4 -2.17e+4   9.95e+4 2.07e- 1    0.0925        0.0741
Output: 10      3 pop_dens…  5.54e+0  8.39e-1   1.02e+1 2.12e- 2    0.0925        0.0741
Output: 11      3 income     1.96e+0 -2.08e-1   4.12e+0 7.61e- 2    0.0925        0.0741
Output: 12      3 graduates -6.71e+2 -1.68e+3   3.40e+2 1.92e- 1    0.0925        0.0741
Output: 13      4 (Interce…  1.29e+5 -6.11e+4   3.20e+5 1.80e- 1    0.0867        0.0491
Output: 14      4 pop_dens…  5.72e+0 -5.79e+0   1.72e+1 3.25e- 1    0.0867        0.0491
Output: 15      4 income     4.34e+0 -1.10e+0   9.79e+0 1.16e- 1    0.0867        0.0491
Output: 16      4 graduates -2.16e+3 -4.86e+3   5.38e+2 1.15e- 1    0.0867        0.0491
Output: # ℹ 1 more variable: sigma <dbl>

Let’s compare the R² values and coefficients across regions:

Region	R²	MSE	Notes
1	0.835	7.87e+08	Strong model fit; predictors explain most crime variance
2	0.529	1.09e+09	Moderate fit
3	0.093	1.37e+09	Weak explanatory power
4	0.087	6.69e+09	Very weak model fit

Regions 1 and 2: reasonably strong models with high explanatory power.
Regions 3 and 4: very weak fits — predictors do not explain much crime variation.

This suggests that the relationship between crime and the predictors differs notably across regions, both in strength and possibly in direction (based on coefficient values, which we can detail further).

Calculate MSE and ${R^2}$ for each region. Are these measures similar for the four regions? Discuss.

# Print coefficients for each region
cat("\nEstimated Regression Equations by Region:\n")

Output: 
Output: Estimated Regression Equations by Region:

for (r in names(region_results)) {
    cat(paste0("Region ", r, ":\n"))
    print(region_results[[r]]$Coefficients)
    cat("\n")
}

Output: Region 1:
Output: NULL
Output: 
Output: Region 2:
Output: NULL
Output: 
Output: Region 3:
Output: NULL
Output: 
Output: Region 4:
Output: NULL

Interpretation

The model is moderately predictive.
Pop density likely remains the strongest predictor.
Income and education effects likely vary from Region 1, possibly weaker or in different directions.

Model Summary

• R²: Indicates how much of the variance in crime is explained. R² > 0.7 = strong model; R² < 0.2 = weak.

• Adjusted R²: Penalized for number of predictors — useful when comparing regions.

• p-value (F-statistic): Shows whether the model overall is statistically significant.

• AIC: Helps compare model fit — lower is better.

Cross-Region Comparison (After You Run All)

• Region 1: Likely shows strongest relationship (high R², significant predictors)

• Region 2: Moderate explanatory power.

• Regions 3 & 4: Weak fit — predictors do not explain crime well here.

MSE and R² Summary

• R² values vary widely: 0.08 to 0.84.

• MSE increases substantially in Region 4, indicating greater unexplained variance and likely more variability or larger outliers.

Conclusion

The estimated regression functions are not similar across regions. Region 1 shows a strong and reliable model, while Regions 3 and 4 show very poor fits, suggesting that the same predictors do not perform equally well across geographies.

BHDS 2110 Problem Set 3

Stephanie Leininger Anderson

June 11, 2025

Model I: physicians ~ population + area + total

Model II: physicians ~ pop_density + pct65 + total

Interpretation