This is the last homework. Part 1 uses linear regression on country-level data. Part 2 uses logistic regression on a medical dataset.
Download AllCountries.csv from the Datasets folder on
Blackboard. The dataset has 217 countries with variables including GDP,
LifeExpectancy, Health, Internet, CO2, Energy, Electricity, and
more.
countries <- read.csv("AllCountries.csv")
head(countries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
Fit a simple linear regression model predicting
LifeExpectancy from GDP.
model1 <- lm(LifeExpectancy ~ GDP, data = countries)
summary(model1)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Report the intercept and slope. What does the slope mean in plain English (e.g., “for every X increase in GDP, life expectancy increases by Y”)?
The intercept is about 68.4 years and the slope on GDP is about 0.000248. The slope means that for every $1 increase in GDP per capita, predicted life expectancy increases by roughly 0.000248 years — i.e., for every additional $1,000 in GDP per capita, life expectancy rises by about 0.25 years.
What does the R² value tell you about how well GDP explains life expectancy?
The R² is about 0.43, meaning GDP alone explains roughly 43% of the variation in life expectancy across countries. That is a moderate relationship — GDP matters, but more than half of the variation is driven by other factors.
Fit a multiple regression predicting LifeExpectancy from
GDP, Health, and Internet.
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
summary(model2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpret the coefficient on Health (controlling
for GDP and Internet).
Holding GDP and Internet constant, the Health
coefficient (about 0.25) means that each one-unit increase in health
spending is associated with roughly a 0.25-year increase in predicted
life expectancy, and this effect is statistically significant (p ≈
0.0003).
How does the adjusted R² compare to the simple model in Q1? What does that suggest about adding predictors?
The adjusted R² rises substantially (to about 0.72, vs ~0.43 in Q1).
Adding Health and Internet explains much more
of the variation in life expectancy, so these predictors meaningfully
improve the model — Internet access in particular is a strong predictor
(GDP becomes non-significant once they are included).
For the simple model in Q1 (LifeExpectancy ~ GDP):
For homoscedasticity, I would look at a residuals-vs-fitted plot: ideally the residuals are randomly scattered around 0 with roughly constant spread across all fitted values (no funnel/fan shape). For normality of residuals, I would look at a Q-Q plot: ideally the points fall close to the straight diagonal line.
par(mfrow = c(1, 2))
# Residuals vs Fitted (homoscedasticity)
plot(fitted(model1), resid(model1),
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")
# Q-Q plot (normality)
qqnorm(resid(model1)); qqline(resid(model1), col = "red")
Your reflection:
The residuals-vs-fitted plot shows a clear funnel pattern (more spread at lower fitted values), which suggests heteroscedasticity — the constant-variance assumption is violated. The Q-Q plot bends away from the line in the tails (especially the lower tail), indicating the residuals are left-skewed rather than perfectly normal. So the simple linear model does not fully satisfy the regression assumptions.
For the multiple regression in Q2, calculate the RMSE (root mean squared error).
rmse <- sqrt(mean(residuals(model2)^2))
rmse
## [1] 4.056417
What does the RMSE represent in the context of predicting life expectancy? How would large residuals for certain countries affect your confidence in the model?
The RMSE (about 4.1 years) is the typical size of the model’s prediction error — on average, the model’s life-expectancy predictions are off by roughly 4.1 years. Large residuals for certain countries mean the model predicts those countries poorly (they are outliers/unusual cases), which lowers confidence that the model generalizes well and signals that important factors for those countries are missing.
Suppose you fit a regression predicting CO2 using both
Energy and Electricity. These two predictors
are highly correlated.
Explain in 2-3 sentences how this multicollinearity could affect (a) the interpretation of the coefficients and (b) the reliability of the model.
Energy and Electricity move
together, the model cannot separate their individual effects, so the
coefficients become unstable and hard to interpret — they can even flip
sign or appear non-significant even though both relate to CO2. (b)
Multicollinearity inflates the standard errors of the coefficients,
making the estimates imprecise and sensitive to small changes in the
data; the overall predictions may still be fine, but the individual
coefficient estimates are unreliable.This part uses the Pima Indians Diabetes dataset (768 patients, binary outcome: 0 = no diabetes, 1 = diabetes).
Don’t change this chunk — it loads and cleans the data:
url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data <- read.csv(url, header = FALSE)
colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
"Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")
data$Outcome <- as.factor(data$Outcome)
# Replace impossible 0 values with NA
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA
colSums(is.na(data))
## Pregnancies Glucose BloodPressure
## 0 5 35
## SkinThickness Insulin BMI
## 0 0 11
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
Fit a logistic regression predicting Outcome from
Glucose, BMI, and Age.
model_log <- glm(Outcome ~ Glucose + BMI + Age, data = data, family = "binomial")
summary(model_log)
##
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.032377 0.711037 -12.703 < 2e-16 ***
## Glucose 0.035548 0.003481 10.212 < 2e-16 ***
## BMI 0.089753 0.014377 6.243 4.3e-10 ***
## Age 0.028699 0.007809 3.675 0.000238 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 974.75 on 751 degrees of freedom
## Residual deviance: 724.96 on 748 degrees of freedom
## (16 observations deleted due to missingness)
## AIC: 732.96
##
## Number of Fisher Scoring iterations: 4
Get the summary of the model. For each predictor, does an increase RAISE or LOWER the odds of diabetes? Which predictors are significant (p < 0.05)?
All three coefficients are positive, so an increase in
Glucose, BMI, or Age each RAISES
the odds of diabetes. All three are statistically significant (p <
0.05) — Glucose and BMI very strongly (p <
0.001), and Age as well (p ≈ 0.0002).
Use threshold 0.5 to convert predicted probabilities into 0/1 predictions, then build a confusion matrix.
data$pred_prob <- predict(model_log, data, type = "response")
data$pred_class <- ifelse(data$pred_prob > 0.5, 1, 0)
cm <- table(Actual = data$Outcome, Predicted = data$pred_class)
cm
## Predicted
## Actual 0 1
## 0 429 59
## 1 114 150
Report the confusion matrix counts: TP, TN, FP, FN.
Treating Outcome = 1 (diabetes) as positive: TN = 429, FP = 59, FN = 114, TP = 150. True negatives are correctly predicted non-diabetics, true positives correctly predicted diabetics, false positives are healthy patients flagged as diabetic, and false negatives are diabetics the model missed.
From your confusion matrix, calculate accuracy, precision, and recall.
TN <- cm[1, 1]; FP <- cm[1, 2]
FN <- cm[2, 1]; TP <- cm[2, 2]
accuracy <- (TP + TN) / sum(cm)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
round(c(accuracy = accuracy, precision = precision, recall = recall), 3)
## accuracy precision recall
## 0.770 0.718 0.568
Report all three values. In a medical screening context, which is more important — precision or recall? Why?
Accuracy ≈ 0.77, precision ≈ 0.72, recall ≈ 0.57. In medical screening, recall is more important: it measures how many actual diabetics the model catches. A false negative (missing a real case) is far more dangerous than a false positive (a healthy person flagged for follow-up testing), so we want to minimize missed cases even at the cost of some extra false alarms.
Plot the ROC curve and compute the AUC.
# install.packages("pROC") if needed
library(pROC)
roc_obj <- roc(data$Outcome, data$pred_prob)
plot(roc_obj, main = "ROC Curve", col = "blue")
auc(roc_obj)
## Area under the curve: 0.828
Report the AUC. Is your model closer to random guessing (AUC = 0.5) or perfect (AUC = 1)? Describe its overall performance in one sentence.
The AUC is about 0.83, which is much closer to perfect (1) than to random guessing (0.5). This means the model has good discriminating ability — it correctly ranks a randomly chosen diabetic patient as higher-risk than a randomly chosen non-diabetic about 83% of the time.