Instructions

Due by end of Module 4.

Set up environment

For knitting, global options are set in an echo=FALSE Environment Setup chunk. Seeding occurs globally. Packages are installed, and libraries are called.

Question 1

Consider the multiple linear regression model:

\({ \mu = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p }\)

Show that \(\beta_j\) gives the difference in the expected response for a one-unit increase in \(x_j\), no matter what the values of \(x_j\) and the other predictors. Hint: Compare the expected response when \(x_j = x\) and \(x_j = x + 1\).

When \({x_j = x:}\)

\({\mu_1 = \beta_0 + \beta_1 x_1 + \cdots + \beta_j x + \cdots + \beta_p x_p}\)

When \({x_j = x + 1:}\)

\({\mu_2 = \beta_0 + \beta_1 x_1 + \cdots + \beta_j (x+1) + \cdots + \beta_p x_p}\)

\({\mu_2 = \beta_0 + \beta_1 x_1 + \cdots + \beta_j x + \beta_j \cdot 1 + \cdots + \beta_p x_p}\)

Take the difference:

\({\mu_2 - \mu_1 = [\beta_j x + \beta_j \cdot 1] - [\beta_j x] = \beta_j}\)

Therefore, \({\mu_2 - \mu_1 = \beta_j}\)

Interpretation

Statistical Implication

Conclusion

The coefficient \({\beta_j}\) gives the partial effect of \({x_j}\) on the expected value of the response variable. It quantifies how much the response is expected to change per unit increase in \({x_j}\), assuming all other variables remain fixed.

Question 2

In a study designed to examine the effects of adding oats to the typical American diet, individuals were randomly divided into two groups. Twice a day, the first group substituted oats for other foods containing carbohydrates. The members of the second group did not make any changes to their diet. One outcome of interest is the serum cholesterol level of each individual eight weeks after the start of the study. Explanatory variables that might be associated with this response include diet group, serum cholesterol level at the start of the study, body mass index, and sex. The estimated coefficients and standard errors from a multiple linear regression model based on a sample of size 76 and containing four explanatory variables are in the following table.

Variable Coefficient Standard Error
Diet group -11.25 4.33
Baseline cholesterol 0.85 0.07
Body mass index . 0.23 0.65
Sex -3.02 4.42
  1. Conduct tests of the null hypotheses that each of the four coefficients in the population regression is equal to 0. At the 0.05 significance level, which of the explanatory variables are associated with serum cholesterol level eight weeks after the start of the study, adjusting for the others?
# Regression results
coefs <- c(DietGroup = -11.25, BaselineChol = 0.85, BMI = 0.23, Sex = -3.02)
ses <- c(DietGroup = 4.33, BaselineChol = 0.07, BMI = 0.65, Sex = 4.42)
n <- 76
p <- 4
df <- n - p - 1

# Compute test statistics
t_stats <- coefs/ses
p_vals <- 2 * pt(-abs(t_stats), df = df)

# Create results data frame
results <- data.frame(Coefficient = coefs, `Standard Error` = ses, `t-value` = t_stats,
    `p-value` = p_vals, Significant = p_vals < 0.05, check.names = FALSE)

# Format values for output
results_fmt <- format(results, digits = 4)

# Print table with title above header
kable(results_fmt, align = "c", escape = FALSE) %>%
    kable_styling(font_size = 12) %>%
    add_header_above(c(`Hypothesis Testing Summary` = 6), bold = TRUE) %>%
    row_spec(0, bold = TRUE)
Hypothesis Testing Summary
Coefficient Standard Error t-value p-value Significant
DietGroup -11.25 4.33 -2.5982 1.139e-02 TRUE
BaselineChol 0.85 0.07 12.1429 5.371e-19 TRUE
BMI 0.23 0.65 0.3538 7.245e-01 FALSE
Sex -3.02 4.42 -0.6833 4.967e-01 FALSE

Hypothesis Testing for Each Coefficient

Null Hypothesis (H₀): β = 0 (no effect)

Alternative Hypothesis (HA): β ≠ 0 (significant effect)

We compute:

\({{t = \frac{\text{Coefficient}}{\text{Standard Error}} \quad \text{and} \quad p = 2 \cdot P(T_{71} > |t|)}}\)

Interpretation:

  1. What is the probability distribution of each of the test statistics you calculated?
# b) Probability distribution of the test statistics
cat("\nEach test statistic follows a t-distribution with", df, "degrees of freedom.\n\n")
Output: 
Output: Each test statistic follows a t-distribution with 71 degrees of freedom.

Probability Distribution of Test Statistics

Each of the test statistics follows a t-distribution with: \({{df = n - k - 1 = 71}}\)

This is appropriate for inference in a multiple linear regression model where the errors are assumed normally distributed.

  1. If an individual’s body mass index were to increase by \({1 kg/m^2}\) while the values of all other explanatory variables remained constant, what would happen to his or her serum cholesterol level?
# c) Effect of 1 unit increase in BMI
delta1 <- coefs["BMI"] * 1
cat("For a 1 unit increase in BMI, serum cholesterol changes by", delta1,
    "units (mg/dL).\n\n")
Output: For a 1 unit increase in BMI, serum cholesterol changes by 0.23 units (mg/dL).

Effect of 1-unit Increase in BMI

• Coefficient for BMI = 0.23

• A 1 kg/m² increase in BMI → predicted increase in serum cholesterol = 0.23 mg/dL

Interpretation

Holding all else constant, a 1-unit increase in BMI is associated with an average increase of 0.23 mg/dL in serum cholesterol after 8 weeks. However, this effect is not statistically significant (p = 0.726), so it may be due to random variation.

  1. If an individual’s body mass index were to increase by \({10 kg/m^2}\) while the values of all other explanatory variables remained constant, what would happen to his or her serum cholesterol level?
# d) Effect of 10 unit increase in BMI
delta10 <- coefs["BMI"] * 10
cat("For a 10 unit increase in BMI, serum cholesterol changes by", delta10,
    "units (mg/dL).\n\n")
Output: For a 10 unit increase in BMI, serum cholesterol changes by 2.3 units (mg/dL).

Effect of 10-unit Increase in BMI

• Multiply the coefficient by 10: \({{10 \times 0.23 = 2.3}}\)

Interpretation

A 10-unit increase in BMI is associated with an average increase of 2.3 mg/dL in serum cholesterol, holding other variables constant. Again, this effect is not statistically significant and should be interpreted with caution.

e. The variable sex is coded so that 1 represents a male and 0 a female. Who is more likely to have a higher serum cholesterol level eight weeks after the start of the study, a male or a female? By how much on average?

# e) Sex effect interpretation
if (coefs["Sex"] < 0) {
    sex_result <- paste0("Males have on average ", abs(coefs["Sex"]), " units lower serum cholesterol than females, adjusting for other variables.")
} else {
    sex_result <- paste0("Males have on average ", abs(coefs["Sex"]), " units higher serum cholesterol than females, adjusting for other variables.")
}
cat(sex_result, "\n")
Output: Males have on average 3.02 units lower serum cholesterol than females, adjusting for other variables.

Effect of Sex on Cholesterol

• Sex is coded: 1 = male, 0 = female

• Coefficient = -3.02

Interpretation

Males (coded 1) have, on average, 3.02 mg/dL lower serum cholesterol than females after 8 weeks, adjusting for diet, baseline cholesterol, and BMI. However, this difference is not statistically significant (p = 0.499), meaning the evidence does not support a true sex-based difference in cholesterol outcomes.

Final Conclusions

1. Significant predictors of cholesterol level at 8 weeks

  1. Non-significant predictors
  1. Implications

Question 3

For the population of low birth weight infants, a significant linear relationship was found to exist between systolic blood pressure and gestational age. The relevant data are in the file lowbwt. The measurements of systolic blood pressure are in the variable sbp, and the corresponding gestational ages under gestage. Also in the data set is apgar5, the five-minute apgar score for each infant.

Construct a two-way scatter plot of systolic blood pressure versus five-minute apgar score. Does there appear to be a linear relationship between these two variables?

# Fit linear model
model_simple <- lm(sbp ~ apgar5, data = lowbwt)

# Create new data frame for predictions
new_apgar <- data.frame(apgar5 = seq(min(lowbwt$apgar5), max(lowbwt$apgar5),
    length.out = 100))
pred <- predict(model_simple, newdata = new_apgar, interval = "confidence",
    level = 0.95)

# Plot
plot(lowbwt$apgar5, lowbwt$sbp, xlab = "Five-Minute Apgar Score", ylab = "Systolic Blood Pressure",
    main = "SBP vs Five-Minute Apgar Score", pch = 19, col = "blue")
lines(new_apgar$apgar5, pred[, "fit"], col = "red", lwd = 2)

polygon(c(new_apgar$apgar5, rev(new_apgar$apgar5)), c(pred[, "lwr"], rev(pred[,
    "upr"])), col = rgb(1, 0, 0, 0.2), border = NA)

# Summary of the model
summary(model_simple)
Output: 
Output: Call:
Output: lm(formula = sbp ~ apgar5, data = lowbwt)
Output: 
Output: Residuals:
Output:     Min      1Q  Median      3Q     Max 
Output: -26.510  -6.406   0.246   4.943  39.397 
Output: 
Output: Coefficients:
Output:             Estimate Std. Error t value Pr(>|t|)    
Output: (Intercept)  42.7192     3.1410  13.601   <2e-16 ***
Output: apgar5        0.6977     0.4687   1.489     0.14    
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 11.33 on 98 degrees of freedom
Output: Multiple R-squared:  0.02211,   Adjusted R-squared:  0.01214 
Output: F-statistic: 2.216 on 1 and 98 DF,  p-value: 0.1398
# 95% Confidence Interval for coefficients
confint(model_simple, level = 0.95)
Output:                  2.5 %    97.5 %
Output: (Intercept) 36.4860127 48.952311
Output: apgar5      -0.2323795  1.627848
# Tidy regression summary
tidy(model_simple, conf.int = TRUE)

Hypothesis Testing for Each Coefficient

Null Hypothesis (H₀): β = 0 (no effect) Alternative Hypothesis (HA): β ≠ 0 (significant effect)

We compute: \({{t = \frac{\text{Coefficient}}{\text{Standard Error}} \quad \text{and} \quad p = 2 \cdot P(T_{71} > |t|)}}\)

Interpretation

The scatter plot shows the relationship between systolic blood pressure and the five-minute Apgar score. A linear regression line (in red) and its 95% confidence band (shaded) are overlaid.

Conclusion

Based on both the scatter plot and the regression results, there is no strong evidence of a linear relationship between systolic blood pressure and five-minute Apgar score in this sample (p = 0.140, 95% CI includes 0).

  1. Using systolic blood pressure as the response and gestational age and apgar score as the explanatory variables, fit the least squares model. Interpret the estimated coefficient of gestational age. What does it mean in words? Similarly, interpret the estimated coefficient of five-minute apgar score.
# Fit multiple linear regression model
model_multi <- lm(sbp ~ gestage + apgar5, data = lowbwt)

# View model summary
summary(model_multi)
Output: 
Output: Call:
Output: lm(formula = sbp ~ gestage + apgar5, data = lowbwt)
Output: 
Output: Residuals:
Output:     Min      1Q  Median      3Q     Max 
Output: -22.374  -8.180  -1.088   4.985  39.424 
Output: 
Output: Coefficients:
Output:             Estimate Std. Error t value Pr(>|t|)   
Output: (Intercept)   9.8034    12.6629   0.774   0.4407   
Output: gestage       1.1848     0.4424   2.678   0.0087 **
Output: apgar5        0.4875     0.4613   1.057   0.2932   
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 10.99 on 97 degrees of freedom
Output: Multiple R-squared:  0.08944,   Adjusted R-squared:  0.07066 
Output: F-statistic: 4.764 on 2 and 97 DF,  p-value: 0.01063
# Get 95% confidence intervals for coefficients
confint(model_multi, level = 0.95)
Output:                   2.5 %    97.5 %
Output: (Intercept) -15.3289936 34.935829
Output: gestage       0.3067380  2.062913
Output: apgar5       -0.4280932  1.403123

Multiple Linear Regression Summary:

Model: sbp ~ gestage + apgar5

Term Coefficient 95% CI p-value
Intercept 9.80 [-15.33, 34.94] 0.441
Gestational Age 1.18. [0.31, 2.06] 0.009.
5-Minute Apgar Score 0.49 [-0.43, 1.40] 0.293

Interpretation

Gestational age

Five-minute apgar score

  1. What is the estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7?
# Fit the linear model: SBP ~ gestational age + apgar5
model <- lm(sbp ~ gestage + apgar5, data = lowbwt)

# Create new data for prediction
new_data <- data.frame(gestage = 31, apgar5 = 7)

# Predict mean SBP
predicted_sbp <- predict(model, newdata = new_data)
cat("The estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7:",
    round(predicted_sbp, 2), "mmHg\n")
Output: The estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7: 49.95 mmHg

Conclusion

The estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7 is 49.945 mmHg.

  1. Test the null hypothesis that the coefficient for the five-minute apgar score is 0 at the 0.05 level of significance. What is the p-value? What do you conclude?
# Fit the multiple linear regression model
model_multi <- lm(sbp ~ gestage + apgar5, data = lowbwt)

# Display model summary
summary(model_multi)
Output: 
Output: Call:
Output: lm(formula = sbp ~ gestage + apgar5, data = lowbwt)
Output: 
Output: Residuals:
Output:     Min      1Q  Median      3Q     Max 
Output: -22.374  -8.180  -1.088   4.985  39.424 
Output: 
Output: Coefficients:
Output:             Estimate Std. Error t value Pr(>|t|)   
Output: (Intercept)   9.8034    12.6629   0.774   0.4407   
Output: gestage       1.1848     0.4424   2.678   0.0087 **
Output: apgar5        0.4875     0.4613   1.057   0.2932   
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 10.99 on 97 degrees of freedom
Output: Multiple R-squared:  0.08944,   Adjusted R-squared:  0.07066 
Output: F-statistic: 4.764 on 2 and 97 DF,  p-value: 0.01063
# Extract the p-value for apgar5
pval_apgar5 <- summary(model_multi)$coefficients["apgar5", "Pr(>|t|)"]
cat("P-value for Apgar5 coefficient:", pval_apgar5, "\n")
Output: P-value for Apgar5 coefficient: 0.2932439
# Decision rule at alpha = 0.05
if (pval_apgar5 < 0.05) {
    cat("Reject the null hypothesis: Apgar score is a significant predictor of SBP.\n")
} else {
    cat("Fail to reject the null hypothesis: No significant relationship between Apgar score and SBP.\n")
}
Output: Fail to reject the null hypothesis: No significant relationship between Apgar score and SBP.

Hypothesis Test for Apgar Score Coefficient

H0: The coefficient for five-minute Apgar score is 0 (i.e., it has no effect on systolic blood pressure).

HA: The coefficient is not 0 (i.e., Apgar score affects SBP).

Result

p-value = 0.293

Since the p-value (0.293) is greater than the significance level of 0.05, we fail to reject the null hypothesis.

Interpretation

At the 0.05 significance level, the p-value (0.29) > 0.05, so we fail to reject the null hypothesis. There is no statistically significant association between five-minute apgar score and systolic blood pressure, after adjusting for gestational age.

\({{t_{\text{apgar5}} = \frac{\hat{\beta}{\text{apgar5}}}{\text{SE}(\hat{\beta}{\text{apgar5}})}}}\)

Where:

\({{\hat{\beta}_{\text{apgar5}}}}\) is the estimated regression coefficient for the five-minute Apgar score.

\({{\text{SE}(\hat{\beta}_{\text{apgar5}})}}\) is the standard error of that estimate.

\({{t_{\text{apgar5}}}}\) is the corresponding t-statistic used to test whether apgar5 is significantly associated with SBP.

\({{t_{\text{apgar5}} = \frac{0.4875}{0.4613} = 1.057}}\)

Conclusion

There is not enough evidence to conclude that five-minute Apgar score is significantly associated with systolic blood pressure after adjusting for gestational age.

  1. Comment on the magnitude of \({R^2}\). Does the inclusion of five-minute apgar score in the model already containing gestational age improve your ability to predict systolic blood pressure?

Model Predictive Power

The Multiple R-squared value of 0.089 indicates that the model explains only about 8.9% of the variability in systolic blood pressure. The Adjusted R-squared, which accounts for the number of predictors in the model, is even lower at 7.1%. This suggests that the model has limited explanatory power overall.

When comparing this to a simpler model with gestational age alone, we would expect a small increase in R² from adding five-minute Apgar score as a second predictor. However, the p-value for Apgar score is not statistically significant (p = 0.293), and the marginal gain in R² is minimal.

Conclusion

The inclusion of five-minute Apgar score in a model already containing gestational age does not meaningfully improve the prediction of systolic blood pressure. The modest increase in R² is not enough to justify its inclusion as a valuable predictor in this context.

Question 4

Refer to the cdi data set. You have been asked to evaluate two alternative models for predicting the number of active physicians (physicians, your y variable) in a CDI. Proposed model I includes as predictor variables total population (population), land area (area), and total personal income (total). Proposed model II includes as predictor variables population density (total population divided by land area), percent of population greater than 64 years old (pct65), and total personal income (total).

  1. Prepare plots for each of the predictor variables. What noteworthy information is provided by your plots?
# Create population density variable
cdi$pop_density <- cdi$population/cdi$area

# -------------------------------------------- Part (a): Plot
# distributions of predictor variables
# --------------------------------------------

# Model I predictors: population, area, total
model1_vars <- c("population", "area", "total")
# Model II predictors: pop_density, pct65, total
model2_vars <- c("pop_density", "pct65", "total")

# Set layout to 1 row, 3 columns
par(mfrow = c(1, 3))

# For Model I: Population Vs. Physicians
plot(cdi$population, cdi$physicians, main = "Population vs. Physicians",
    xlab = "Population", ylab = "Physicians", col = "limegreen", pch = 16)
abline(lm(physicians ~ population, data = cdi), col = "black", lwd = 2)

# Area Vs. Physicians
plot(cdi$area, cdi$physicians, main = "Area vs. Physicians", xlab = "Area",
    ylab = "Physicians", col = "orange", pch = 16)
abline(lm(physicians ~ area, data = cdi), col = "black", lwd = 2)

# Total Income Vs. Physicians
plot(cdi$total, cdi$physicians, main = "Income vs. Physicians", xlab = "Income",
    ylab = "Physicians", col = "purple", pch = 16)
abline(lm(physicians ~ total, data = cdi), col = "black", lwd = 2)

# Reset layout
par(mfrow = c(1, 1))

# Set layout to 1 row, 3 columns
par(mfrow = c(1, 3))

# For Model II: Population Density vs. Physicians
plot(cdi$pop_density, cdi$physicians, main = "Pop Density vs. Physicians",
    xlab = "Pop Density", ylab = "Physicians", col = "turquoise", pch = 16)
abline(lm(physicians ~ pop_density, data = cdi), col = "black", lwd = 2)

# Percent Age 65+ vs. Physicians
plot(cdi$pct65, cdi$physicians, main = "Percent Age 65+ vs. Physicians",
    xlab = "Percent Age 65+", ylab = "Physicians", col = "magenta", pch = 16)
abline(lm(physicians ~ pct65, data = cdi), col = "black", lwd = 2)

# Total Personal Income Vs. Physicians
plot(cdi$total, cdi$physicians, main = "Income vs. Physicians", xlab = "Income",
    ylab = "Physicians", col = "dodgerblue", pch = 16)
abline(lm(physicians ~ total, data = cdi), col = "black", lwd = 2)

# Reset layout
par(mfrow = c(1, 1))

Part 1: Plots of Predictor Variables

Model I: Predictors - population, area, total

• Population: Highly right-skewed; a small number of counties have very large populations.

• Area: Very right-skewed; most counties are small in land area, with a few being very large.

• Total Personal Income: Also skewed to the right; matches the population distribution as expected since more populous counties will typically have higher total income.

Model II: Predictors - pop_density, pct65, total

• Population Density: Extreme right skew; indicates a few very densely populated counties.

• Percent ≥ 65 (pct65): Relatively normally distributed, some slight left skew.

• Total Personal Income: Same interpretation as in Model I.

These histograms reveal non-normal distributions and potential for outliers, which may affect model assumptions and influence the choice of transformation or robust methods if needed.

Next, I will compute and display scatterplot matrices and correlation matrices to evaluate linear relationships and multicollinearity between predictors. 

  1. Obtain the scatter plot matrix and the correlation matrix for each proposed model. Summarize the information provided. plot matrxix
# -------------------------------------------- Part (b): Scatterplot
# matrix and correlation matrix
# --------------------------------------------

# Model I scatterplot matrix
model1_data <- cdi[, c(model1_vars, "physicians")]
ggpairs(model1_data, title = "Model I: Scatterplot Matrix")

# Model II scatterplot matrix
model2_data <- cdi[, c(model2_vars, "physicians")]
ggpairs(model2_data, title = "Model II: Scatterplot Matrix")

# Model I Correlation
cor_model1 <- round(cor(model1_data), 2)
print("Correlation matrix for Model I:")
Output: [1] "Correlation matrix for Model I:"
print(cor_model1)
Output:            population area total physicians
Output: population       1.00 0.17  0.99       0.94
Output: area             0.17 1.00  0.13       0.08
Output: total            0.99 0.13  1.00       0.95
Output: physicians       0.94 0.08  0.95       1.00
# Model II Correlation
cor_model2 <- round(cor(model2_data), 2)
print("Correlation matrix for Model II:")
Output: [1] "Correlation matrix for Model II:"
print(cor_model2)
Output:             pop_density pct65 total physicians
Output: pop_density        1.00  0.03  0.32       0.41
Output: pct65              0.03  1.00 -0.02       0.00
Output: total              0.32 -0.02  1.00       0.95
Output: physicians         0.41  0.00  0.95       1.00

Part 2: Correlation and Scatterplot Analysis

Model I: Predictors – population, area, total

• Scatterplot Matrix Insights:

• Strong positive association between population and total income.

• physicians increases with both population and total, suggesting multicollinearity.

• area shows weak correlation with all variables including physicians.

• Correlation Matrix:

• High correlation between population and total (0.98) suggests multicollinearity.

• Moderate correlation between population and physicians (0.89), and between total and physicians (0.95).

• area has negligible correlation with physicians.

Model II: Predictors – pop_density, pct65, total

• Scatterplot Matrix Insights:

• Moderate positive relationship between pop_density and physicians.

• Very strong correlation between total income and physicians.

• Little or no correlation involving pct65.

• Correlation Matrix:

• total and physicians: very strong correlation (0.95).

• pop_density and physicians: moderate correlation (0.41).

• pct65 has near-zero correlation with physicians and other predictors.

Conclusion

• Multicollinearity is likely a concern in Model I due to high correlation between population and total. • Model II has lower multicollinearity, but also includes a weak predictor (pct65).

Next, I will fit both regression models and calculate R² and adjusted R² to compare predictive performance. 

  1. For each proposed model, fit the first-order regression model with three predictor variables. model 1b phys > pop +area + tot model2b phy > pop/area + pct65 + tot
# -------------------------------------------- Part (c): Fit
# first-order regression models
# --------------------------------------------

# Model I: physicians ~ population + area + total
model1 <- lm(physicians ~ population + area + total, data = cdi)

# Model II: physicians ~ pop_density + pct65 + total
model2 <- lm(physicians ~ pop_density + pct65 + total, data = cdi)

summary(model1)
Output: 
Output: Call:
Output: lm(formula = physicians ~ population + area + total, data = cdi)
Output: 
Output: Residuals:
Output:     Min      1Q  Median      3Q     Max 
Output: -1855.6  -215.2   -74.6    79.0  3689.0 
Output: 
Output: Coefficients:
Output:               Estimate Std. Error t value Pr(>|t|)    
Output: (Intercept) -1.332e+01  3.537e+01  -0.377 0.706719    
Output: population   8.366e-04  2.867e-04   2.918 0.003701 ** 
Output: area        -6.552e-02  1.821e-02  -3.597 0.000358 ***
Output: total        9.413e-02  1.330e-02   7.078 5.89e-12 ***
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 560.4 on 436 degrees of freedom
Output: Multiple R-squared:  0.9026,    Adjusted R-squared:  0.902 
Output: F-statistic:  1347 on 3 and 436 DF,  p-value: < 2.2e-16
summary(model2)
Output: 
Output: Call:
Output: lm(formula = physicians ~ pop_density + pct65 + total, data = cdi)
Output: 
Output: Residuals:
Output:      Min       1Q   Median       3Q      Max 
Output: -3055.75  -175.30   -38.05    72.88  3045.81 
Output: 
Output: Coefficients:
Output:               Estimate Std. Error t value Pr(>|t|)    
Output: (Intercept) -1.706e+02  8.353e+01  -2.042   0.0418 *  
Output: pop_density  9.616e-02  1.224e-02   7.857  3.1e-14 ***
Output: pct65        6.340e+00  6.384e+00   0.993   0.3212    
Output: total        1.266e-01  2.084e-03  60.723  < 2e-16 ***
Output: ---
Output: Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output: 
Output: Residual standard error: 533.5 on 436 degrees of freedom
Output: Multiple R-squared:  0.9117,    Adjusted R-squared:  0.9111 
Output: F-statistic:  1501 on 3 and 436 DF,  p-value: < 2.2e-16

Interpretation & Analysis

Model I: physicians ~ population + area + total

Model II: physicians ~ pop_density + pct65 + total

  1. Calculate \({R^2}\) for each model. Is one model clearly preferable in terms of this measure? 1 R2 2R2 Preference?
# -------------------------------------------- Part (d): Compare R²
# and Adjusted R² --------------------------------------------

cat("Model I R-squared:", summary(model1)$r.squared, "\n")
Output: Model I R-squared: 0.9026432
cat("Model I Adjusted R-squared:", summary(model1)$adj.r.squared, "\n")
Output: Model I Adjusted R-squared: 0.9019733
cat("Model I AIC:", AIC(model1), "\n\n")
Output: Model I AIC: 6823.792
cat("Model II R-squared:", summary(model2)$r.squared, "\n")
Output: Model II R-squared: 0.9117491
cat("Model II Adjusted R-squared:", summary(model2)$adj.r.squared, "\n")
Output: Model II Adjusted R-squared: 0.9111419
cat("Model II AIC:", AIC(model2), "\n\n")
Output: Model II AIC: 6780.585

Model Fitting and R² Comparison

Variance Inflation factors
Model Adjusted R² AIC BIC
Model I 0.9026 0.9020 6821.79 6838.14
Model II 0.9117 0.9111 6778.58 6794.93
# -------------------------------------------- Additional: VIF to
# assess multicollinearity
# --------------------------------------------

cat("VIFs (Varianace Inflation Factors) for Model I:\n")
Output: VIFs (Varianace Inflation Factors) for Model I:
print(vif(model1))
Output: population       area      total 
Output:  41.636907   1.114256  41.052464
cat("\nVIFs (C=Variance Inflation Factors) for Model II:\n")
Output: 
Output: VIFs (C=Variance Inflation Factors) for Model II:
print(vif(model2))
Output: pop_density       pct65       total 
Output:    1.112730    1.001991    1.112357

Conclusion

Question 5

Refer to the cdi data set.

  1. For each geographic region, regress the number of serious crimes in a CDI (crime, your y variable) against population density (total population divided by land area), per capita personal income (income), and percent high school graduates (graduates). State the estimated regression function.
# Create population density variable
cdi$pop_density <- cdi$population/cdi$area

# Initialize list to store results
region_results <- list()

# Loop through each region and fit linear model
for (region in sort(unique(cdi$region))) {
    region_data <- subset(cdi, region == region)

    # Fit model: crime ~ pop_density + income + graduates
    model <- lm(crime ~ pop_density + income + graduates, data = region_data)

    # Store tidy summary and performance metrics
    region_results[[as.character(region)]] <- list(model = model, coefficients = tidy(model,
        conf.int = TRUE), summary = glance(model))
}

# Display results for each region
for (r in names(region_results)) {
    cat(paste0("\n\n======================\nRegion ", r, "\n======================\n"))
    print(region_results[[r]]$coefficients)
    cat("\nModel Summary:\n")
    print(region_results[[r]]$summary)
}
Output: 
Output: 
Output: ======================
Output: Region 1
Output: ======================
Output: # A tibble: 4 × 7
Output:   term         estimate std.error statistic  p.value conf.low conf.high
Output:   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
Output: 1 (Intercept) 47233.    26273.        1.80  7.29e- 2 -4404.    98869.  
Output: 2 pop_density    14.6       1.12     13.0   6.78e-33    12.4      16.8 
Output: 3 income          0.296     0.708     0.418 6.76e- 1    -1.10      1.69
Output: 4 graduates    -497.      401.       -1.24  2.15e- 1 -1285.      290.  
Output: 
Output: Model Summary:
Output: # A tibble: 1 × 12
Output:   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
Output:       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
Output: 1     0.317         0.313 48284.      67.5 6.87e-36     3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Output: 
Output: 
Output: ======================
Output: Region 2
Output: ======================
Output: # A tibble: 4 × 7
Output:   term         estimate std.error statistic  p.value conf.low conf.high
Output:   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
Output: 1 (Intercept) 47233.    26273.        1.80  7.29e- 2 -4404.    98869.  
Output: 2 pop_density    14.6       1.12     13.0   6.78e-33    12.4      16.8 
Output: 3 income          0.296     0.708     0.418 6.76e- 1    -1.10      1.69
Output: 4 graduates    -497.      401.       -1.24  2.15e- 1 -1285.      290.  
Output: 
Output: Model Summary:
Output: # A tibble: 1 × 12
Output:   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
Output:       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
Output: 1     0.317         0.313 48284.      67.5 6.87e-36     3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Output: 
Output: 
Output: ======================
Output: Region 3
Output: ======================
Output: # A tibble: 4 × 7
Output:   term         estimate std.error statistic  p.value conf.low conf.high
Output:   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
Output: 1 (Intercept) 47233.    26273.        1.80  7.29e- 2 -4404.    98869.  
Output: 2 pop_density    14.6       1.12     13.0   6.78e-33    12.4      16.8 
Output: 3 income          0.296     0.708     0.418 6.76e- 1    -1.10      1.69
Output: 4 graduates    -497.      401.       -1.24  2.15e- 1 -1285.      290.  
Output: 
Output: Model Summary:
Output: # A tibble: 1 × 12
Output:   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
Output:       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
Output: 1     0.317         0.313 48284.      67.5 6.87e-36     3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Output: 
Output: 
Output: ======================
Output: Region 4
Output: ======================
Output: # A tibble: 4 × 7
Output:   term         estimate std.error statistic  p.value conf.low conf.high
Output:   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
Output: 1 (Intercept) 47233.    26273.        1.80  7.29e- 2 -4404.    98869.  
Output: 2 pop_density    14.6       1.12     13.0   6.78e-33    12.4      16.8 
Output: 3 income          0.296     0.708     0.418 6.76e- 1    -1.10      1.69
Output: 4 graduates    -497.      401.       -1.24  2.15e- 1 -1285.      290.  
Output: 
Output: Model Summary:
Output: # A tibble: 1 × 12
Output:   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
Output:       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
Output: 1     0.317         0.313 48284.      67.5 6.87e-36     3 -5368. 10745. 10766.
Output: # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Regression Equation

${{{crime} = \(\beta_0 + \beta_1 \cdot \text{pop_density} + \beta_2 \cdot \text{income} + \beta_3 \cdot \text{graduates}\) + }}$

Predictor Coefficient (β) 95% CI p-value
Intercept 47232.5026612 [-4404.231169, 98869.236492] 7.290297e-02
pop_density 14.5929680 [12.387158, 16.798778] 6.776676e-33
income 0.2956824 [-1.095987, 1.687352] 6.764562e-01
graduates -497.3415108 [-1284.668151, 289.985130] 2.150797e-01

Estimated Regression Functions by Region

We regressed crime on:

• pop_density (population / area)

• income (per capita personal income)

• graduates (percent high school graduates)

For each of the four regions (labeled 1 through 4). Each region has its own regression coefficients stored under “Coefficients” in the summary.

Region 1

Estimated Regression Equation

\({{{crime} = -64,466 + 17.38 \cdot \text{pop_density} - 1.41 \cdot \text{income} + 1182.58 \cdot \text{graduates}}}\)

Coefficient Summary:

Predictor Estimate 95% CI p-value Interpretation
Intercept -64,466 [-148,915, 19,983] 0.133 Not significant
pop_density 17.38 [15.73, 19.04] < 0.001 Strong positive effect
income -1.41 [-2.92, 0.10] 0.067 Marginally non-significant
graduates 1182.58 [-89.51, 2454.67] 0.068 Marginally non-significant

Model Fit

Interpretation

Conclusion for Region 1

Region 1 shows a very strong model, with over 83% of crime variation explained. Population density is the dominant predictor and highly statistically significant. Income and education show marginal effects but are not reliable predictors in this model.

Region 2

Estimated Regression Equation

\({{{crime} = -4,163 + 33.62 \cdot \text{pop_density} + 0.10 \cdot \text{income} - 2.76 \cdot \text{graduates}}}\)

Coefficient Summary:

Model Fit
Predictor Estimate 95% CI p-value Interpretation
Intercept -4,163 [-103,232, 94,906] 0.934 Not meaningful
pop_density 33.62 [25.97, 41.26] <0.001 Statistically significant and strongly positive
income 0.10 [-3.21, 3.42] 0.951 Not significant
graduates -2.76 [-1,608, 1,602] 0.997 Not significant

Interpretation

Conclusion for Region 2

The crime rate in Region 2 is strongly associated with population density, but not with income or education levels. The model is statistically sound overall (F-test), but only one variable (density) is driving this performance. This supports population density as a consistent crime predictor in urban/suburban settings.

Region 3

Estimated Regression Equation\({{{crime} = 38,863 + 5.54 \cdot \text{pop_density} + 1.96 \cdot \text{income} - 670.88 \cdot \text{graduates}}}\)

Coefficient Summary

Predictor Estimate 95% CI p-value Interpretation
Intercept 38,863 [-21,731, 99,456] 0.207 Not significant
pop_density 5.54 [0.84, 10.23] 0.021 Significant and positive
income 1.96 [-0.21, 4.12] 0.076 Marginally non-significant
graduates -670.88 [-1681.62, 339.85] 0.192 Not significant

Model Fit

Interpretation

Conclusion for Region 3

The model is statistically significant overall but explains less than 10% of crime variation, indicating a poor fit. While population density remains a consistent predictor, income and education do not significantly contribute in this region.

Region 4

Estimated Regression Equation

\({{{crime} = 129{,}323 + 5.72 \cdot \text{pop_density} + 4.34 \cdot \text{income} - 2159.92 \cdot \text{graduates}}}\)

Coefficient Summary

Predictor Estimate 95% CI p-value Interpretation
Intercept 129,323 [-61,119, 319,766] 0.180 Not significant
pop_density 5.72 [-5.79, 17.22] 0.325 Not significant
income 4.34 [-1.10, 9.79] 0.116 Not significant
graduates -2159.92 [-4857.60, 537.76] 0.115 Not significant

Model Fit

Interpretation

Conclusion for Region 4

The model for Region 4 is not statistically significant and explains less than 9% of the variation in crime. None of the predictors — population density, income, or percent high school graduates — are reliable indicators of serious crime levels in this region.

  1. Are the estimated regression functions for the four regions similar? Discuss.
# Initialize list to store results
all_results <- list()

# Loop through each region
for (r in sort(unique(cdi$region))) {
    region_data <- filter(cdi, region == r)

    # Fit model
    model <- lm(crime ~ pop_density + income + graduates, data = region_data)

    # Extract coefficients and model summary
    coef_tbl <- tidy(model, conf.int = TRUE)
    model_stats <- glance(model)

    # Merge into a tidy summary
    coef_tbl <- coef_tbl %>%
        mutate(Region = r, r_squared = model_stats$r.squared, adj_r_squared = model_stats$adj.r.squared,
            sigma = model_stats$sigma) %>%
        select(Region, term, estimate, conf.low, conf.high, p.value, r_squared,
            adj_r_squared, sigma)

    all_results[[r]] <- coef_tbl
}

# Combine all regions into one data frame
region_comparison <- bind_rows(all_results)

# View result
print(region_comparison)
Output: # A tibble: 16 × 9
Output:    Region term      estimate conf.low conf.high  p.value r_squared adj_r_squared
Output:     <int> <chr>        <dbl>    <dbl>     <dbl>    <dbl>     <dbl>         <dbl>
Output:  1      1 (Interce… -6.45e+4 -1.49e+5   2.00e+4 1.33e- 1    0.835         0.830 
Output:  2      1 pop_dens…  1.74e+1  1.57e+1   1.90e+1 5.28e-38    0.835         0.830 
Output:  3      1 income    -1.41e+0 -2.92e+0   1.03e-1 6.74e- 2    0.835         0.830 
Output:  4      1 graduates  1.18e+3 -8.95e+1   2.45e+3 6.81e- 2    0.835         0.830 
Output:  5      2 (Interce… -4.16e+3 -1.03e+5   9.49e+4 9.34e- 1    0.529         0.515 
Output:  6      2 pop_dens…  3.36e+1  2.60e+1   4.13e+1 4.79e-14    0.529         0.515 
Output:  7      2 income     1.02e-1 -3.21e+0   3.42e+0 9.51e- 1    0.529         0.515 
Output:  8      2 graduates -2.76e+0 -1.61e+3   1.60e+3 9.97e- 1    0.529         0.515 
Output:  9      3 (Interce…  3.89e+4 -2.17e+4   9.95e+4 2.07e- 1    0.0925        0.0741
Output: 10      3 pop_dens…  5.54e+0  8.39e-1   1.02e+1 2.12e- 2    0.0925        0.0741
Output: 11      3 income     1.96e+0 -2.08e-1   4.12e+0 7.61e- 2    0.0925        0.0741
Output: 12      3 graduates -6.71e+2 -1.68e+3   3.40e+2 1.92e- 1    0.0925        0.0741
Output: 13      4 (Interce…  1.29e+5 -6.11e+4   3.20e+5 1.80e- 1    0.0867        0.0491
Output: 14      4 pop_dens…  5.72e+0 -5.79e+0   1.72e+1 3.25e- 1    0.0867        0.0491
Output: 15      4 income     4.34e+0 -1.10e+0   9.79e+0 1.16e- 1    0.0867        0.0491
Output: 16      4 graduates -2.16e+3 -4.86e+3   5.38e+2 1.15e- 1    0.0867        0.0491
Output: # ℹ 1 more variable: sigma <dbl>

Let’s compare the R² values and coefficients across regions:

Region MSE Notes
1 0.835 7.87e+08 Strong model fit; predictors explain most crime variance
2 0.529 1.09e+09 Moderate fit
3 0.093 1.37e+09 Weak explanatory power
4 0.087 6.69e+09 Very weak model fit

This suggests that the relationship between crime and the predictors differs notably across regions, both in strength and possibly in direction (based on coefficient values, which we can detail further).

  1. Calculate MSE and \({R^2}\) for each region. Are these measures similar for the four regions? Discuss.
# Print coefficients for each region
cat("\nEstimated Regression Equations by Region:\n")
Output: 
Output: Estimated Regression Equations by Region:
for (r in names(region_results)) {
    cat(paste0("Region ", r, ":\n"))
    print(region_results[[r]]$Coefficients)
    cat("\n")
}
Output: Region 1:
Output: NULL
Output: 
Output: Region 2:
Output: NULL
Output: 
Output: Region 3:
Output: NULL
Output: 
Output: Region 4:
Output: NULL

Interpretation

Model Summary

• R²: Indicates how much of the variance in crime is explained. R² > 0.7 = strong model; R² < 0.2 = weak.

• Adjusted R²: Penalized for number of predictors — useful when comparing regions.

• p-value (F-statistic): Shows whether the model overall is statistically significant.

• AIC: Helps compare model fit — lower is better.

Cross-Region Comparison (After You Run All)

• Region 1: Likely shows strongest relationship (high R², significant predictors)

• Region 2: Moderate explanatory power.

• Regions 3 & 4: Weak fit — predictors do not explain crime well here.

MSE and R² Summary

• R² values vary widely: 0.08 to 0.84.

• MSE increases substantially in Region 4, indicating greater unexplained variance and likely more variability or larger outliers.

Conclusion

The estimated regression functions are not similar across regions. Region 1 shows a strong and reliable model, while Regions 3 and 4 show very poor fits, suggesting that the same predictors do not perform equally well across geographies.