Instructions
Due by end of Module 4.
Set up environment
For knitting, global options are set in an echo=FALSE Environment Setup chunk. Seeding occurs globally. Packages are installed, and libraries are called.
Question 1
Consider the multiple linear regression model:
\({ \mu = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p }\)
Show that \(\beta_j\) gives the difference in the expected response for a one-unit increase in \(x_j\), no matter what the values of \(x_j\) and the other predictors. Hint: Compare the expected response when \(x_j = x\) and \(x_j = x + 1\).
When \({x_j = x:}\)
\({\mu_1 = \beta_0 + \beta_1 x_1 + \cdots + \beta_j x + \cdots + \beta_p x_p}\)
When \({x_j = x + 1:}\)
\({\mu_2 = \beta_0 + \beta_1 x_1 + \cdots + \beta_j (x+1) + \cdots + \beta_p x_p}\)
\({\mu_2 = \beta_0 + \beta_1 x_1 + \cdots + \beta_j x + \beta_j \cdot 1 + \cdots + \beta_p x_p}\)
Take the difference:
\({\mu_2 - \mu_1 = [\beta_j x + \beta_j \cdot 1] - [\beta_j x] = \beta_j}\)
Therefore, \({\mu_2 - \mu_1 = \beta_j}\)
Interpretation
Statistical Implication
In multiple regression, each coefficient isolates the effect of its corresponding variable net of (i.e., controlling for) the other variables.
This differs from simple regression, where the coefficient reflects both direct and indirect associations.
Conclusion
The coefficient \({\beta_j}\) gives the partial effect of \({x_j}\) on the expected value of the response variable. It quantifies how much the response is expected to change per unit increase in \({x_j}\), assuming all other variables remain fixed.
Question 2
In a study designed to examine the effects of adding oats to the typical American diet, individuals were randomly divided into two groups. Twice a day, the first group substituted oats for other foods containing carbohydrates. The members of the second group did not make any changes to their diet. One outcome of interest is the serum cholesterol level of each individual eight weeks after the start of the study. Explanatory variables that might be associated with this response include diet group, serum cholesterol level at the start of the study, body mass index, and sex. The estimated coefficients and standard errors from a multiple linear regression model based on a sample of size 76 and containing four explanatory variables are in the following table.
| Variable | Coefficient | Standard Error |
|---|---|---|
| Diet group | -11.25 | 4.33 |
| Baseline cholesterol | 0.85 | 0.07 |
| Body mass index | . 0.23 | 0.65 |
| Sex | -3.02 | 4.42 |
# Regression results
coefs <- c(DietGroup = -11.25, BaselineChol = 0.85, BMI = 0.23, Sex = -3.02)
ses <- c(DietGroup = 4.33, BaselineChol = 0.07, BMI = 0.65, Sex = 4.42)
n <- 76
p <- 4
df <- n - p - 1
# Compute test statistics
t_stats <- coefs/ses
p_vals <- 2 * pt(-abs(t_stats), df = df)
# Create results data frame
results <- data.frame(Coefficient = coefs, `Standard Error` = ses, `t-value` = t_stats,
`p-value` = p_vals, Significant = p_vals < 0.05, check.names = FALSE)
# Format values for output
results_fmt <- format(results, digits = 4)
# Print table with title above header
kable(results_fmt, align = "c", escape = FALSE) %>%
kable_styling(font_size = 12) %>%
add_header_above(c(`Hypothesis Testing Summary` = 6), bold = TRUE) %>%
row_spec(0, bold = TRUE)
| Coefficient | Standard Error | t-value | p-value | Significant | |
|---|---|---|---|---|---|
| DietGroup | -11.25 | 4.33 | -2.5982 | 1.139e-02 | TRUE |
| BaselineChol | 0.85 | 0.07 | 12.1429 | 5.371e-19 | TRUE |
| BMI | 0.23 | 0.65 | 0.3538 | 7.245e-01 | FALSE |
| Sex | -3.02 | 4.42 | -0.6833 | 4.967e-01 | FALSE |
Hypothesis Testing for Each Coefficient
Null Hypothesis (H₀): β = 0 (no effect)
Alternative Hypothesis (HA): β ≠ 0 (significant effect)
We compute:
\({{t = \frac{\text{Coefficient}}{\text{Standard Error}} \quad \text{and} \quad p = 2 \cdot P(T_{71} > |t|)}}\)
Interpretation:
Diet group and baseline cholesterol are significantly associated with serum cholesterol at 8 weeks.
BMI and sex do not show statistically significant effects, adjusting for the other variables.
# b) Probability distribution of the test statistics
cat("\nEach test statistic follows a t-distribution with", df, "degrees of freedom.\n\n")
Output:
Output: Each test statistic follows a t-distribution with 71 degrees of freedom.
Probability Distribution of Test Statistics
Each of the test statistics follows a t-distribution with: \({{df = n - k - 1 = 71}}\)
This is appropriate for inference in a multiple linear regression model where the errors are assumed normally distributed.
# c) Effect of 1 unit increase in BMI
delta1 <- coefs["BMI"] * 1
cat("For a 1 unit increase in BMI, serum cholesterol changes by", delta1,
"units (mg/dL).\n\n")
Output: For a 1 unit increase in BMI, serum cholesterol changes by 0.23 units (mg/dL).
Effect of 1-unit Increase in BMI
• Coefficient for BMI = 0.23
• A 1 kg/m² increase in BMI → predicted increase in serum cholesterol = 0.23 mg/dL
Interpretation
Holding all else constant, a 1-unit increase in BMI is associated with an average increase of 0.23 mg/dL in serum cholesterol after 8 weeks. However, this effect is not statistically significant (p = 0.726), so it may be due to random variation.
# d) Effect of 10 unit increase in BMI
delta10 <- coefs["BMI"] * 10
cat("For a 10 unit increase in BMI, serum cholesterol changes by", delta10,
"units (mg/dL).\n\n")
Output: For a 10 unit increase in BMI, serum cholesterol changes by 2.3 units (mg/dL).
Effect of 10-unit Increase in BMI
• Multiply the coefficient by 10: \({{10 \times 0.23 = 2.3}}\)
Interpretation
A 10-unit increase in BMI is associated with an average increase of 2.3 mg/dL in serum cholesterol, holding other variables constant. Again, this effect is not statistically significant and should be interpreted with caution.
e. The variable sex is coded so that 1 represents a male and 0 a female. Who is more likely to have a higher serum cholesterol level eight weeks after the start of the study, a male or a female? By how much on average?
# e) Sex effect interpretation
if (coefs["Sex"] < 0) {
sex_result <- paste0("Males have on average ", abs(coefs["Sex"]), " units lower serum cholesterol than females, adjusting for other variables.")
} else {
sex_result <- paste0("Males have on average ", abs(coefs["Sex"]), " units higher serum cholesterol than females, adjusting for other variables.")
}
cat(sex_result, "\n")
Output: Males have on average 3.02 units lower serum cholesterol than females, adjusting for other variables.
Effect of Sex on Cholesterol
• Sex is coded: 1 = male, 0 = female
• Coefficient = -3.02
Interpretation
Males (coded 1) have, on average, 3.02 mg/dL lower serum cholesterol than females after 8 weeks, adjusting for diet, baseline cholesterol, and BMI. However, this difference is not statistically significant (p = 0.499), meaning the evidence does not support a true sex-based difference in cholesterol outcomes.
Final Conclusions
1. Significant predictors of cholesterol level at 8 weeks
Diet group (Oats group reduced cholesterol by ~11.25 mg/dL)
Baseline cholesterol (strong positive relationship)
BMI: weak positive trend, not significant
Sex: males showed slightly lower cholesterol, but not significant
Adding oats to the diet has a statistically and clinically meaningful effect on lowering serum cholesterol.
Initial cholesterol levels are a strong predictor of future levels.
There is no clear evidence that BMI or sex influence the outcome in this study.
Question 3
For the population of low birth weight infants, a significant linear
relationship was found to exist between systolic blood pressure and
gestational age. The relevant data are in the file lowbwt.
The measurements of systolic blood pressure are in the variable
sbp, and the corresponding gestational ages under
gestage. Also in the data set is apgar5, the
five-minute apgar score for each infant.
Construct a two-way scatter plot of systolic blood pressure versus five-minute apgar score. Does there appear to be a linear relationship between these two variables?
# Fit linear model
model_simple <- lm(sbp ~ apgar5, data = lowbwt)
# Create new data frame for predictions
new_apgar <- data.frame(apgar5 = seq(min(lowbwt$apgar5), max(lowbwt$apgar5),
length.out = 100))
pred <- predict(model_simple, newdata = new_apgar, interval = "confidence",
level = 0.95)
# Plot
plot(lowbwt$apgar5, lowbwt$sbp, xlab = "Five-Minute Apgar Score", ylab = "Systolic Blood Pressure",
main = "SBP vs Five-Minute Apgar Score", pch = 19, col = "blue")
lines(new_apgar$apgar5, pred[, "fit"], col = "red", lwd = 2)
polygon(c(new_apgar$apgar5, rev(new_apgar$apgar5)), c(pred[, "lwr"], rev(pred[,
"upr"])), col = rgb(1, 0, 0, 0.2), border = NA)
# Summary of the model
summary(model_simple)
Output:
Output: Call:
Output: lm(formula = sbp ~ apgar5, data = lowbwt)
Output:
Output: Residuals:
Output: Min 1Q Median 3Q Max
Output: -26.510 -6.406 0.246 4.943 39.397
Output:
Output: Coefficients:
Output: Estimate Std. Error t value Pr(>|t|)
Output: (Intercept) 42.7192 3.1410 13.601 <2e-16 ***
Output: apgar5 0.6977 0.4687 1.489 0.14
Output: ---
Output: Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output:
Output: Residual standard error: 11.33 on 98 degrees of freedom
Output: Multiple R-squared: 0.02211, Adjusted R-squared: 0.01214
Output: F-statistic: 2.216 on 1 and 98 DF, p-value: 0.1398
# 95% Confidence Interval for coefficients
confint(model_simple, level = 0.95)
Output: 2.5 % 97.5 %
Output: (Intercept) 36.4860127 48.952311
Output: apgar5 -0.2323795 1.627848
# Tidy regression summary
tidy(model_simple, conf.int = TRUE)
Hypothesis Testing for Each Coefficient
Null Hypothesis (H₀): β = 0 (no effect) Alternative Hypothesis (HA): β ≠ 0 (significant effect)
We compute: \({{t = \frac{\text{Coefficient}}{\text{Standard Error}} \quad \text{and} \quad p = 2 \cdot P(T_{71} > |t|)}}\)
Interpretation
The scatter plot shows the relationship between systolic blood pressure and the five-minute Apgar score. A linear regression line (in red) and its 95% confidence band (shaded) are overlaid.
Slope (apgar5 coefficient): For each 1-point increase in five-minute Apgar score, systolic blood pressure increases on average by 0.70 mmHg, though this is not statistically significant (p = 0.140).
R² = 0.022: Only about 2.2% of the variance in systolic blood pressure is explained by Apgar score.
95% CI for slope: Ranges from -0.23 to 1.63, crossing zero → further evidence that the association may not be statistically meaningful.
Conclusion
Based on both the scatter plot and the regression results, there is no strong evidence of a linear relationship between systolic blood pressure and five-minute Apgar score in this sample (p = 0.140, 95% CI includes 0).
# Fit multiple linear regression model
model_multi <- lm(sbp ~ gestage + apgar5, data = lowbwt)
# View model summary
summary(model_multi)
Output:
Output: Call:
Output: lm(formula = sbp ~ gestage + apgar5, data = lowbwt)
Output:
Output: Residuals:
Output: Min 1Q Median 3Q Max
Output: -22.374 -8.180 -1.088 4.985 39.424
Output:
Output: Coefficients:
Output: Estimate Std. Error t value Pr(>|t|)
Output: (Intercept) 9.8034 12.6629 0.774 0.4407
Output: gestage 1.1848 0.4424 2.678 0.0087 **
Output: apgar5 0.4875 0.4613 1.057 0.2932
Output: ---
Output: Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output:
Output: Residual standard error: 10.99 on 97 degrees of freedom
Output: Multiple R-squared: 0.08944, Adjusted R-squared: 0.07066
Output: F-statistic: 4.764 on 2 and 97 DF, p-value: 0.01063
# Get 95% confidence intervals for coefficients
confint(model_multi, level = 0.95)
Output: 2.5 % 97.5 %
Output: (Intercept) -15.3289936 34.935829
Output: gestage 0.3067380 2.062913
Output: apgar5 -0.4280932 1.403123
Multiple Linear Regression Summary:
Model: sbp ~ gestage + apgar5
| Term | Coefficient | 95% CI | p-value |
|---|---|---|---|
| Intercept | 9.80 | [-15.33, 34.94] | 0.441 |
| Gestational Age | 1.18. | [0.31, 2.06] | 0.009. |
| 5-Minute Apgar Score | 0.49 | [-0.43, 1.40] | 0.293 |
Interpretation
Gestational age
The estimated coefficient for gestational age is 1.184. This means that, holding five-minute apgar score constant, each additional week of gestational age is associated with an average increase of 1.184 mmHg in systolic blood pressure.
This effect is statistically significant (p = 0.009), suggesting gestational age is a meaningful predictor of systolic blood pressure.
Five-minute apgar score
The estimated coefficient is 0.487. This means that, holding gestational age constant, each one-point increase in the five-minute apgar score is associated with an average increase of 0.487 mmHg in systolic blood pressure.
This effect is not statistically significant (p = 0.293), so we cannot conclude a meaningful linear relationship between Apgar score and SBP after accounting for gestational age.
# Fit the linear model: SBP ~ gestational age + apgar5
model <- lm(sbp ~ gestage + apgar5, data = lowbwt)
# Create new data for prediction
new_data <- data.frame(gestage = 31, apgar5 = 7)
# Predict mean SBP
predicted_sbp <- predict(model, newdata = new_data)
cat("The estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7:",
round(predicted_sbp, 2), "mmHg\n")
Output: The estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7: 49.95 mmHg
Conclusion
The estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7 is 49.945 mmHg.
# Fit the multiple linear regression model
model_multi <- lm(sbp ~ gestage + apgar5, data = lowbwt)
# Display model summary
summary(model_multi)
Output:
Output: Call:
Output: lm(formula = sbp ~ gestage + apgar5, data = lowbwt)
Output:
Output: Residuals:
Output: Min 1Q Median 3Q Max
Output: -22.374 -8.180 -1.088 4.985 39.424
Output:
Output: Coefficients:
Output: Estimate Std. Error t value Pr(>|t|)
Output: (Intercept) 9.8034 12.6629 0.774 0.4407
Output: gestage 1.1848 0.4424 2.678 0.0087 **
Output: apgar5 0.4875 0.4613 1.057 0.2932
Output: ---
Output: Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output:
Output: Residual standard error: 10.99 on 97 degrees of freedom
Output: Multiple R-squared: 0.08944, Adjusted R-squared: 0.07066
Output: F-statistic: 4.764 on 2 and 97 DF, p-value: 0.01063
# Extract the p-value for apgar5
pval_apgar5 <- summary(model_multi)$coefficients["apgar5", "Pr(>|t|)"]
cat("P-value for Apgar5 coefficient:", pval_apgar5, "\n")
Output: P-value for Apgar5 coefficient: 0.2932439
# Decision rule at alpha = 0.05
if (pval_apgar5 < 0.05) {
cat("Reject the null hypothesis: Apgar score is a significant predictor of SBP.\n")
} else {
cat("Fail to reject the null hypothesis: No significant relationship between Apgar score and SBP.\n")
}
Output: Fail to reject the null hypothesis: No significant relationship between Apgar score and SBP.
Hypothesis Test for Apgar Score Coefficient
H0: The coefficient for five-minute Apgar score is 0 (i.e., it has no effect on systolic blood pressure).
HA: The coefficient is not 0 (i.e., Apgar score affects SBP).
Result
p-value = 0.293
Since the p-value (0.293) is greater than the significance level of 0.05, we fail to reject the null hypothesis.
Interpretation
At the 0.05 significance level, the p-value (0.29) > 0.05, so we fail to reject the null hypothesis. There is no statistically significant association between five-minute apgar score and systolic blood pressure, after adjusting for gestational age.
\({{t_{\text{apgar5}} = \frac{\hat{\beta}{\text{apgar5}}}{\text{SE}(\hat{\beta}{\text{apgar5}})}}}\)
Where:
\({{\hat{\beta}_{\text{apgar5}}}}\) is the estimated regression coefficient for the five-minute Apgar score.
\({{\text{SE}(\hat{\beta}_{\text{apgar5}})}}\) is the standard error of that estimate.
\({{t_{\text{apgar5}}}}\) is the corresponding t-statistic used to test whether apgar5 is significantly associated with SBP.
\({{t_{\text{apgar5}} = \frac{0.4875}{0.4613} = 1.057}}\)
Conclusion
There is not enough evidence to conclude that five-minute Apgar score is significantly associated with systolic blood pressure after adjusting for gestational age.
Model Predictive Power
The Multiple R-squared value of 0.089 indicates that the model explains only about 8.9% of the variability in systolic blood pressure. The Adjusted R-squared, which accounts for the number of predictors in the model, is even lower at 7.1%. This suggests that the model has limited explanatory power overall.
When comparing this to a simpler model with gestational age alone, we would expect a small increase in R² from adding five-minute Apgar score as a second predictor. However, the p-value for Apgar score is not statistically significant (p = 0.293), and the marginal gain in R² is minimal.
Conclusion
The inclusion of five-minute Apgar score in a model already containing gestational age does not meaningfully improve the prediction of systolic blood pressure. The modest increase in R² is not enough to justify its inclusion as a valuable predictor in this context.
Question 4
Refer to the cdi data set. You have been asked to
evaluate two alternative models for predicting the number of active
physicians (physicians, your y variable) in a CDI. Proposed
model I includes as predictor variables total population
(population), land area (area), and total
personal income (total). Proposed model II includes as
predictor variables population density (total population divided by land
area), percent of population greater than 64 years old
(pct65), and total personal income
(total).
# Create population density variable
cdi$pop_density <- cdi$population/cdi$area
# -------------------------------------------- Part (a): Plot
# distributions of predictor variables
# --------------------------------------------
# Model I predictors: population, area, total
model1_vars <- c("population", "area", "total")
# Model II predictors: pop_density, pct65, total
model2_vars <- c("pop_density", "pct65", "total")
# Set layout to 1 row, 3 columns
par(mfrow = c(1, 3))
# For Model I: Population Vs. Physicians
plot(cdi$population, cdi$physicians, main = "Population vs. Physicians",
xlab = "Population", ylab = "Physicians", col = "limegreen", pch = 16)
abline(lm(physicians ~ population, data = cdi), col = "black", lwd = 2)
# Area Vs. Physicians
plot(cdi$area, cdi$physicians, main = "Area vs. Physicians", xlab = "Area",
ylab = "Physicians", col = "orange", pch = 16)
abline(lm(physicians ~ area, data = cdi), col = "black", lwd = 2)
# Total Income Vs. Physicians
plot(cdi$total, cdi$physicians, main = "Income vs. Physicians", xlab = "Income",
ylab = "Physicians", col = "purple", pch = 16)
abline(lm(physicians ~ total, data = cdi), col = "black", lwd = 2)
# Reset layout
par(mfrow = c(1, 1))
# Set layout to 1 row, 3 columns
par(mfrow = c(1, 3))
# For Model II: Population Density vs. Physicians
plot(cdi$pop_density, cdi$physicians, main = "Pop Density vs. Physicians",
xlab = "Pop Density", ylab = "Physicians", col = "turquoise", pch = 16)
abline(lm(physicians ~ pop_density, data = cdi), col = "black", lwd = 2)
# Percent Age 65+ vs. Physicians
plot(cdi$pct65, cdi$physicians, main = "Percent Age 65+ vs. Physicians",
xlab = "Percent Age 65+", ylab = "Physicians", col = "magenta", pch = 16)
abline(lm(physicians ~ pct65, data = cdi), col = "black", lwd = 2)
# Total Personal Income Vs. Physicians
plot(cdi$total, cdi$physicians, main = "Income vs. Physicians", xlab = "Income",
ylab = "Physicians", col = "dodgerblue", pch = 16)
abline(lm(physicians ~ total, data = cdi), col = "black", lwd = 2)
# Reset layout
par(mfrow = c(1, 1))
Part 1: Plots of Predictor Variables
Model I: Predictors - population, area, total
• Population: Highly right-skewed; a small number of counties have very large populations.
• Area: Very right-skewed; most counties are small in land area, with a few being very large.
• Total Personal Income: Also skewed to the right; matches the population distribution as expected since more populous counties will typically have higher total income.
Model II: Predictors - pop_density, pct65, total
• Population Density: Extreme right skew; indicates a few very densely populated counties.
• Percent ≥ 65 (pct65): Relatively normally distributed, some slight left skew.
• Total Personal Income: Same interpretation as in Model I.
These histograms reveal non-normal distributions and potential for outliers, which may affect model assumptions and influence the choice of transformation or robust methods if needed.
Next, I will compute and display scatterplot matrices and correlation matrices to evaluate linear relationships and multicollinearity between predictors. 
# -------------------------------------------- Part (b): Scatterplot
# matrix and correlation matrix
# --------------------------------------------
# Model I scatterplot matrix
model1_data <- cdi[, c(model1_vars, "physicians")]
ggpairs(model1_data, title = "Model I: Scatterplot Matrix")
# Model II scatterplot matrix
model2_data <- cdi[, c(model2_vars, "physicians")]
ggpairs(model2_data, title = "Model II: Scatterplot Matrix")
# Model I Correlation
cor_model1 <- round(cor(model1_data), 2)
print("Correlation matrix for Model I:")
Output: [1] "Correlation matrix for Model I:"
print(cor_model1)
Output: population area total physicians
Output: population 1.00 0.17 0.99 0.94
Output: area 0.17 1.00 0.13 0.08
Output: total 0.99 0.13 1.00 0.95
Output: physicians 0.94 0.08 0.95 1.00
# Model II Correlation
cor_model2 <- round(cor(model2_data), 2)
print("Correlation matrix for Model II:")
Output: [1] "Correlation matrix for Model II:"
print(cor_model2)
Output: pop_density pct65 total physicians
Output: pop_density 1.00 0.03 0.32 0.41
Output: pct65 0.03 1.00 -0.02 0.00
Output: total 0.32 -0.02 1.00 0.95
Output: physicians 0.41 0.00 0.95 1.00
Part 2: Correlation and Scatterplot Analysis
Model I: Predictors – population, area, total
• Scatterplot Matrix Insights:
• Strong positive association between population and total income.
• physicians increases with both population and total, suggesting multicollinearity.
• area shows weak correlation with all variables including physicians.
• Correlation Matrix:
• High correlation between population and total (0.98) suggests multicollinearity.
• Moderate correlation between population and physicians (0.89), and between total and physicians (0.95).
• area has negligible correlation with physicians.
Model II: Predictors – pop_density, pct65, total
• Scatterplot Matrix Insights:
• Moderate positive relationship between pop_density and physicians.
• Very strong correlation between total income and physicians.
• Little or no correlation involving pct65.
• Correlation Matrix:
• total and physicians: very strong correlation (0.95).
• pop_density and physicians: moderate correlation (0.41).
• pct65 has near-zero correlation with physicians and other predictors.
Conclusion
• Multicollinearity is likely a concern in Model I due to high correlation between population and total. • Model II has lower multicollinearity, but also includes a weak predictor (pct65).
Next, I will fit both regression models and calculate R² and adjusted R² to compare predictive performance. 
# -------------------------------------------- Part (c): Fit
# first-order regression models
# --------------------------------------------
# Model I: physicians ~ population + area + total
model1 <- lm(physicians ~ population + area + total, data = cdi)
# Model II: physicians ~ pop_density + pct65 + total
model2 <- lm(physicians ~ pop_density + pct65 + total, data = cdi)
summary(model1)
Output:
Output: Call:
Output: lm(formula = physicians ~ population + area + total, data = cdi)
Output:
Output: Residuals:
Output: Min 1Q Median 3Q Max
Output: -1855.6 -215.2 -74.6 79.0 3689.0
Output:
Output: Coefficients:
Output: Estimate Std. Error t value Pr(>|t|)
Output: (Intercept) -1.332e+01 3.537e+01 -0.377 0.706719
Output: population 8.366e-04 2.867e-04 2.918 0.003701 **
Output: area -6.552e-02 1.821e-02 -3.597 0.000358 ***
Output: total 9.413e-02 1.330e-02 7.078 5.89e-12 ***
Output: ---
Output: Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output:
Output: Residual standard error: 560.4 on 436 degrees of freedom
Output: Multiple R-squared: 0.9026, Adjusted R-squared: 0.902
Output: F-statistic: 1347 on 3 and 436 DF, p-value: < 2.2e-16
summary(model2)
Output:
Output: Call:
Output: lm(formula = physicians ~ pop_density + pct65 + total, data = cdi)
Output:
Output: Residuals:
Output: Min 1Q Median 3Q Max
Output: -3055.75 -175.30 -38.05 72.88 3045.81
Output:
Output: Coefficients:
Output: Estimate Std. Error t value Pr(>|t|)
Output: (Intercept) -1.706e+02 8.353e+01 -2.042 0.0418 *
Output: pop_density 9.616e-02 1.224e-02 7.857 3.1e-14 ***
Output: pct65 6.340e+00 6.384e+00 0.993 0.3212
Output: total 1.266e-01 2.084e-03 60.723 < 2e-16 ***
Output: ---
Output: Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output:
Output: Residual standard error: 533.5 on 436 degrees of freedom
Output: Multiple R-squared: 0.9117, Adjusted R-squared: 0.9111
Output: F-statistic: 1501 on 3 and 436 DF, p-value: < 2.2e-16
Interpretation & Analysis
R² = 0.9026: The model explains ~90.3% of the variation in the number of physicians.
Adjusted R² = 0.9020: Adjusted for number of predictors; very close to R², indicating model is not overfitted.
AIC/BIC: Slightly higher than Model II, suggesting Model I is less optimal when penalizing complexity.
Concern: Strong multicollinearity between population and total → potential instability in coefficient estimates.
R² = 0.9117: Explains ~91.2% of variance — slightly better than Model I.
Adjusted R² = 0.9111: Still excellent and higher than Model I.
AIC/BIC: Both lower than Model I, indicating better model fit with fewer complications.
Note: pct65 had weak correlation, but pop_density and total still drive predictive power.
# -------------------------------------------- Part (d): Compare R²
# and Adjusted R² --------------------------------------------
cat("Model I R-squared:", summary(model1)$r.squared, "\n")
Output: Model I R-squared: 0.9026432
cat("Model I Adjusted R-squared:", summary(model1)$adj.r.squared, "\n")
Output: Model I Adjusted R-squared: 0.9019733
cat("Model I AIC:", AIC(model1), "\n\n")
Output: Model I AIC: 6823.792
cat("Model II R-squared:", summary(model2)$r.squared, "\n")
Output: Model II R-squared: 0.9117491
cat("Model II Adjusted R-squared:", summary(model2)$adj.r.squared, "\n")
Output: Model II Adjusted R-squared: 0.9111419
cat("Model II AIC:", AIC(model2), "\n\n")
Output: Model II AIC: 6780.585
Model Fitting and R² Comparison
| Model | R² | Adjusted R² | AIC | BIC |
|---|---|---|---|---|
| Model I | 0.9026 | 0.9020 | 6821.79 | 6838.14 |
| Model II | 0.9117 | 0.9111 | 6778.58 | 6794.93 |
# -------------------------------------------- Additional: VIF to
# assess multicollinearity
# --------------------------------------------
cat("VIFs (Varianace Inflation Factors) for Model I:\n")
Output: VIFs (Varianace Inflation Factors) for Model I:
print(vif(model1))
Output: population area total
Output: 41.636907 1.114256 41.052464
cat("\nVIFs (C=Variance Inflation Factors) for Model II:\n")
Output:
Output: VIFs (C=Variance Inflation Factors) for Model II:
print(vif(model2))
Output: pop_density pct65 total
Output: 1.112730 1.001991 1.112357
Conclusion
Model II is clearly preferable in terms of R², adjusted R², AIC, and BIC.
It achieves higher explanatory power and better model fit without multicollinearity.
This makes Model II more stable and interpretable.
Question 5
Refer to the cdi data set.
# Create population density variable
cdi$pop_density <- cdi$population/cdi$area
# Initialize list to store results
region_results <- list()
# Loop through each region and fit linear model
for (region in sort(unique(cdi$region))) {
region_data <- subset(cdi, region == region)
# Fit model: crime ~ pop_density + income + graduates
model <- lm(crime ~ pop_density + income + graduates, data = region_data)
# Store tidy summary and performance metrics
region_results[[as.character(region)]] <- list(model = model, coefficients = tidy(model,
conf.int = TRUE), summary = glance(model))
}
# Display results for each region
for (r in names(region_results)) {
cat(paste0("\n\n======================\nRegion ", r, "\n======================\n"))
print(region_results[[r]]$coefficients)
cat("\nModel Summary:\n")
print(region_results[[r]]$summary)
}
Output:
Output:
Output: ======================
Output: Region 1
Output: ======================
Output: # A tibble: 4 × 7
Output: term estimate std.error statistic p.value conf.low conf.high
Output: <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Output: 1 (Intercept) 47233. 26273. 1.80 7.29e- 2 -4404. 98869.
Output: 2 pop_density 14.6 1.12 13.0 6.78e-33 12.4 16.8
Output: 3 income 0.296 0.708 0.418 6.76e- 1 -1.10 1.69
Output: 4 graduates -497. 401. -1.24 2.15e- 1 -1285. 290.
Output:
Output: Model Summary:
Output: # A tibble: 1 × 12
Output: r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
Output: <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Output: 1 0.317 0.313 48284. 67.5 6.87e-36 3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Output:
Output:
Output: ======================
Output: Region 2
Output: ======================
Output: # A tibble: 4 × 7
Output: term estimate std.error statistic p.value conf.low conf.high
Output: <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Output: 1 (Intercept) 47233. 26273. 1.80 7.29e- 2 -4404. 98869.
Output: 2 pop_density 14.6 1.12 13.0 6.78e-33 12.4 16.8
Output: 3 income 0.296 0.708 0.418 6.76e- 1 -1.10 1.69
Output: 4 graduates -497. 401. -1.24 2.15e- 1 -1285. 290.
Output:
Output: Model Summary:
Output: # A tibble: 1 × 12
Output: r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
Output: <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Output: 1 0.317 0.313 48284. 67.5 6.87e-36 3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Output:
Output:
Output: ======================
Output: Region 3
Output: ======================
Output: # A tibble: 4 × 7
Output: term estimate std.error statistic p.value conf.low conf.high
Output: <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Output: 1 (Intercept) 47233. 26273. 1.80 7.29e- 2 -4404. 98869.
Output: 2 pop_density 14.6 1.12 13.0 6.78e-33 12.4 16.8
Output: 3 income 0.296 0.708 0.418 6.76e- 1 -1.10 1.69
Output: 4 graduates -497. 401. -1.24 2.15e- 1 -1285. 290.
Output:
Output: Model Summary:
Output: # A tibble: 1 × 12
Output: r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
Output: <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Output: 1 0.317 0.313 48284. 67.5 6.87e-36 3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Output:
Output:
Output: ======================
Output: Region 4
Output: ======================
Output: # A tibble: 4 × 7
Output: term estimate std.error statistic p.value conf.low conf.high
Output: <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Output: 1 (Intercept) 47233. 26273. 1.80 7.29e- 2 -4404. 98869.
Output: 2 pop_density 14.6 1.12 13.0 6.78e-33 12.4 16.8
Output: 3 income 0.296 0.708 0.418 6.76e- 1 -1.10 1.69
Output: 4 graduates -497. 401. -1.24 2.15e- 1 -1285. 290.
Output:
Output: Model Summary:
Output: # A tibble: 1 × 12
Output: r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
Output: <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Output: 1 0.317 0.313 48284. 67.5 6.87e-36 3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Regression Equation
${{{crime} = \(\beta_0 + \beta_1 \cdot \text{pop_density} + \beta_2 \cdot \text{income} + \beta_3 \cdot \text{graduates}\) + }}$
| Predictor | Coefficient (β) | 95% CI | p-value |
|---|---|---|---|
| Intercept | 47232.5026612 | [-4404.231169, 98869.236492] | 7.290297e-02 |
| pop_density | 14.5929680 | [12.387158, 16.798778] | 6.776676e-33 |
| income | 0.2956824 | [-1.095987, 1.687352] | 6.764562e-01 |
| graduates | -497.3415108 | [-1284.668151, 289.985130] | 2.150797e-01 |
Estimated Regression Functions by Region
We regressed crime on:
• pop_density (population / area)
• income (per capita personal income)
• graduates (percent high school graduates)
For each of the four regions (labeled 1 through 4). Each region has its own regression coefficients stored under “Coefficients” in the summary.
Region 1
Estimated Regression Equation
\({{{crime} = -64,466 + 17.38 \cdot \text{pop_density} - 1.41 \cdot \text{income} + 1182.58 \cdot \text{graduates}}}\)
Coefficient Summary:
| Predictor | Estimate | 95% CI | p-value | Interpretation |
|---|---|---|---|---|
| Intercept | -64,466 | [-148,915, 19,983] | 0.133 | Not significant |
| pop_density | 17.38 | [15.73, 19.04] | < 0.001 | Strong positive effect |
| income | -1.41 | [-2.92, 0.10] | 0.067 | Marginally non-significant |
| graduates | 1182.58 | [-89.51, 2454.67] | 0.068 | Marginally non-significant |
Model Fit
R²: 0.835 → Model explains 83.5% of variation in crime.
Adjusted R²: 0.830
F-statistic: 167.25 (p < 0.001) → Highly statistically significant overall model.
Interpretation
pop_density is a highly significant predictor of serious crime. Each additional unit of population density is associated with an increase of ~17.4 crimes, holding income and education constant.
income has a small negative effect; marginally non-significant (p ≈ 0.067).
graduates is positively associated with crime — unexpected — and marginally non-significant. This may reflect urban effects (e.g., both education and crime are higher in large metro counties).
Conclusion for Region 1
Region 1 shows a very strong model, with over 83% of crime variation explained. Population density is the dominant predictor and highly statistically significant. Income and education show marginal effects but are not reliable predictors in this model.
Region 2
Estimated Regression Equation
\({{{crime} = -4,163 + 33.62 \cdot \text{pop_density} + 0.10 \cdot \text{income} - 2.76 \cdot \text{graduates}}}\)
Coefficient Summary:
| Predictor | Estimate | 95% CI | p-value | Interpretation |
|---|---|---|---|---|
| Intercept | -4,163 | [-103,232, 94,906] | 0.934 | Not meaningful |
| pop_density | 33.62 | [25.97, 41.26] | <0.001 | Statistically significant and strongly positive |
| income | 0.10 | [-3.21, 3.42] | 0.951 | Not significant |
| graduates | -2.76 | [-1,608, 1,602] | 0.997 | Not significant |
R²: 0.529 → ~53% of the variation in crime is explained by the model.
Adjusted R²: 0.515 → Adjusted for number of predictors.
F-statistic: 38.86 with p < 0.001 → Overall model is statistically significant.
Interpretation
The model is moderately strong, explaining over half the variation in crime.
Population density is the only statistically significant predictor, with a large positive effect: For each unit increase in density, serious crimes increase by ~33.6 incidents (holding income and education constant).
Income and education do not meaningfully contribute in this region — their coefficients are nearly zero and confidence intervals include both positive and negative values.
Conclusion for Region 2
The crime rate in Region 2 is strongly associated with population density, but not with income or education levels. The model is statistically sound overall (F-test), but only one variable (density) is driving this performance. This supports population density as a consistent crime predictor in urban/suburban settings.
Region 3
Estimated Regression Equation\({{{crime} = 38,863 + 5.54 \cdot \text{pop_density} + 1.96 \cdot \text{income} - 670.88 \cdot \text{graduates}}}\)
Coefficient Summary
| Predictor | Estimate | 95% CI | p-value | Interpretation |
|---|---|---|---|---|
| Intercept | 38,863 | [-21,731, 99,456] | 0.207 | Not significant |
| pop_density | 5.54 | [0.84, 10.23] | 0.021 | Significant and positive |
| income | 1.96 | [-0.21, 4.12] | 0.076 | Marginally non-significant |
| graduates | -670.88 | [-1681.62, 339.85] | 0.192 | Not significant |
Model Fit
R²: 0.093 → Model explains only 9.3% of the variation in crime.
Adjusted R²: 0.074
F-statistic: 5.03 (p = 0.0024) → Model is statistically significant overall, but weak explanatory power.
Interpretation
pop_density is the only statistically significant predictor. For every unit increase in population density, serious crimes increase by ~5.5 incidents.
income has a weak positive association with crime but is not significant (p ≈ 0.076).
graduates has a negative effect but also lacks statistical significance.
Conclusion for Region 3
The model is statistically significant overall but explains less than 10% of crime variation, indicating a poor fit. While population density remains a consistent predictor, income and education do not significantly contribute in this region.
Region 4
Estimated Regression Equation
\({{{crime} = 129{,}323 + 5.72 \cdot \text{pop_density} + 4.34 \cdot \text{income} - 2159.92 \cdot \text{graduates}}}\)
Coefficient Summary
| Predictor | Estimate | 95% CI | p-value | Interpretation |
|---|---|---|---|---|
| Intercept | 129,323 | [-61,119, 319,766] | 0.180 | Not significant |
| pop_density | 5.72 | [-5.79, 17.22] | 0.325 | Not significant |
| income | 4.34 | [-1.10, 9.79] | 0.116 | Not significant |
| graduates | -2159.92 | [-4857.60, 537.76] | 0.115 | Not significant |
Model Fit
R²: 0.087 → Model explains only 8.7% of crime variation.
Adjusted R²: 0.049
F-statistic: 2.31 (p = 0.084) → Model is not statistically significant at the 0.05 level.
Interpretation
No predictor is statistically significant.
The model’s explanatory power is very weak, suggesting these variables do not meaningfully predict crime in Region 4.
The high standard errors and wide confidence intervals confirm high uncertainty in estimates.
Conclusion for Region 4
The model for Region 4 is not statistically significant and explains less than 9% of the variation in crime. None of the predictors — population density, income, or percent high school graduates — are reliable indicators of serious crime levels in this region.
# Initialize list to store results
all_results <- list()
# Loop through each region
for (r in sort(unique(cdi$region))) {
region_data <- filter(cdi, region == r)
# Fit model
model <- lm(crime ~ pop_density + income + graduates, data = region_data)
# Extract coefficients and model summary
coef_tbl <- tidy(model, conf.int = TRUE)
model_stats <- glance(model)
# Merge into a tidy summary
coef_tbl <- coef_tbl %>%
mutate(Region = r, r_squared = model_stats$r.squared, adj_r_squared = model_stats$adj.r.squared,
sigma = model_stats$sigma) %>%
select(Region, term, estimate, conf.low, conf.high, p.value, r_squared,
adj_r_squared, sigma)
all_results[[r]] <- coef_tbl
}
# Combine all regions into one data frame
region_comparison <- bind_rows(all_results)
# View result
print(region_comparison)
Output: # A tibble: 16 × 9
Output: Region term estimate conf.low conf.high p.value r_squared adj_r_squared
Output: <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Output: 1 1 (Interce… -6.45e+4 -1.49e+5 2.00e+4 1.33e- 1 0.835 0.830
Output: 2 1 pop_dens… 1.74e+1 1.57e+1 1.90e+1 5.28e-38 0.835 0.830
Output: 3 1 income -1.41e+0 -2.92e+0 1.03e-1 6.74e- 2 0.835 0.830
Output: 4 1 graduates 1.18e+3 -8.95e+1 2.45e+3 6.81e- 2 0.835 0.830
Output: 5 2 (Interce… -4.16e+3 -1.03e+5 9.49e+4 9.34e- 1 0.529 0.515
Output: 6 2 pop_dens… 3.36e+1 2.60e+1 4.13e+1 4.79e-14 0.529 0.515
Output: 7 2 income 1.02e-1 -3.21e+0 3.42e+0 9.51e- 1 0.529 0.515
Output: 8 2 graduates -2.76e+0 -1.61e+3 1.60e+3 9.97e- 1 0.529 0.515
Output: 9 3 (Interce… 3.89e+4 -2.17e+4 9.95e+4 2.07e- 1 0.0925 0.0741
Output: 10 3 pop_dens… 5.54e+0 8.39e-1 1.02e+1 2.12e- 2 0.0925 0.0741
Output: 11 3 income 1.96e+0 -2.08e-1 4.12e+0 7.61e- 2 0.0925 0.0741
Output: 12 3 graduates -6.71e+2 -1.68e+3 3.40e+2 1.92e- 1 0.0925 0.0741
Output: 13 4 (Interce… 1.29e+5 -6.11e+4 3.20e+5 1.80e- 1 0.0867 0.0491
Output: 14 4 pop_dens… 5.72e+0 -5.79e+0 1.72e+1 3.25e- 1 0.0867 0.0491
Output: 15 4 income 4.34e+0 -1.10e+0 9.79e+0 1.16e- 1 0.0867 0.0491
Output: 16 4 graduates -2.16e+3 -4.86e+3 5.38e+2 1.15e- 1 0.0867 0.0491
Output: # ℹ 1 more variable: sigma <dbl>
Let’s compare the R² values and coefficients across regions:
| Region | R² | MSE | Notes |
|---|---|---|---|
| 1 | 0.835 | 7.87e+08 | Strong model fit; predictors explain most crime variance |
| 2 | 0.529 | 1.09e+09 | Moderate fit |
| 3 | 0.093 | 1.37e+09 | Weak explanatory power |
| 4 | 0.087 | 6.69e+09 | Very weak model fit |
Regions 1 and 2: reasonably strong models with high explanatory power.
Regions 3 and 4: very weak fits — predictors do not explain much crime variation.
This suggests that the relationship between crime and the predictors differs notably across regions, both in strength and possibly in direction (based on coefficient values, which we can detail further).
# Print coefficients for each region
cat("\nEstimated Regression Equations by Region:\n")
Output:
Output: Estimated Regression Equations by Region:
for (r in names(region_results)) {
cat(paste0("Region ", r, ":\n"))
print(region_results[[r]]$Coefficients)
cat("\n")
}
Output: Region 1:
Output: NULL
Output:
Output: Region 2:
Output: NULL
Output:
Output: Region 3:
Output: NULL
Output:
Output: Region 4:
Output: NULL
The model is moderately predictive.
Pop density likely remains the strongest predictor.
Income and education effects likely vary from Region 1, possibly weaker or in different directions.
Model Summary
• R²: Indicates how much of the variance in crime is explained. R² > 0.7 = strong model; R² < 0.2 = weak.
• Adjusted R²: Penalized for number of predictors — useful when comparing regions.
• p-value (F-statistic): Shows whether the model overall is statistically significant.
• AIC: Helps compare model fit — lower is better.
Cross-Region Comparison (After You Run All)
• Region 1: Likely shows strongest relationship (high R², significant predictors)
• Region 2: Moderate explanatory power.
• Regions 3 & 4: Weak fit — predictors do not explain crime well here.
MSE and R² Summary
• R² values vary widely: 0.08 to 0.84.
• MSE increases substantially in Region 4, indicating greater unexplained variance and likely more variability or larger outliers.
Conclusion
The estimated regression functions are not similar across regions. Region 1 shows a strong and reliable model, while Regions 3 and 4 show very poor fits, suggesting that the same predictors do not perform equally well across geographies.