This is the last homework. Part 1 uses linear regression on country-level data. Part 2 uses logistic regression on a medical dataset.
Download AllCountries.csv from the Datasets folder on
Blackboard. The dataset has 217 countries with variables including GDP,
LifeExpectancy, Health, Internet, CO2, Energy, Electricity, and
more.
countries <- read.csv("AllCountries.csv")
head(countries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
Fit a simple linear regression model predicting
LifeExpectancy from GDP.
# Your code:
model1 <- lm(LifeExpectancy ~ GDP, data = countries)
summary(model1)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Report the intercept and slope. What does the slope mean in plain English (e.g., “for every X increase in GDP, life expectancy increases by Y”)?
Intercept = 68.42 Slope = .0002476
Life expectancy is predicted to increase by .0002476 years for every one unit increase in GDP.
What does the R² value tell you about how well GDP explains life expectancy?
Fit a multiple regression predicting LifeExpectancy from
GDP, Health, and Internet.
# Your code:
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
summary(model2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpret the coefficient on Health (controlling
for GDP and Internet).
Health coefficient = .2479 Given that the levels of GDP and internet remain constant, the coefficient of Health equals .2479. This implies that for a one unit increment in health spending, there is a 1/4-year increment in life expectancy.
How does the adjusted R² compare to the simple model in Q1? What does that suggest about adding predictors?
Simple model R² = .4272 Adjusted R² = .7164 The value of R² went from .4272 in the simple model to .7164 in the adjusted model. This indicates that the inclusion of both health and internet has significantly enhanced the explanatory capability of the model.
For the simple model in Q1 (LifeExpectancy ~ GDP):
In order to check homoscedasticity, we must look at whether residuals are spread evenly across values. In order to check normality of residuals we must check to see whether the residuals follow a straight line in the Q-Q plot.
# Your code:
par(mfrow = c(1,2))
plot(model1$fitted.values, model1$residuals,
main = "Residual Plot",
xlab = "Fitted Plot",
ylab = "Residuals")
qqnorm(model2$residuals)
qqline(model2$residuals, col = "red")
Your reflection:
The residual plot should show no clear pattern for the model to be appropriate. The Q-Q plot should roughly follow the line, which indicates normality. —
For the multiple regression in Q2, calculate the RMSE (root mean squared error).
# Hint: sqrt(mean(residuals(model)^2))
sqrt(mean(model2$residuals^2))
## [1] 4.056417
What does the RMSE represent in the context of predicting life expectancy? How would large residuals for certain countries affect your confidence in the model?
Suppose you fit a regression predicting CO2 using both
Energy and Electricity. These two predictors
are highly correlated.
Explain in 2-3 sentences how this multicollinearity could affect (a) the interpretation of the coefficients and (b) the reliability of the model.
This part uses the Pima Indians Diabetes dataset (768 patients, binary outcome: 0 = no diabetes, 1 = diabetes).
Don’t change this chunk — it loads and cleans the data:
url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data <- read.csv(url, header = FALSE)
colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
"Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")
data$Outcome <- as.factor(data$Outcome)
# Replace impossible 0 values with NA
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA
colSums(is.na(data))
## Pregnancies Glucose BloodPressure
## 0 5 35
## SkinThickness Insulin BMI
## 0 0 11
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
Fit a logistic regression predicting Outcome from
Glucose, BMI, and Age.
# Hint: glm(Outcome ~ Glucose + BMI + Age, data = data, family = "binomial")
model <- glm(Outcome ~ Glucose + BMI + Age, data = data, family = "binomial")
summary(model)
##
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.032377 0.711037 -12.703 < 2e-16 ***
## Glucose 0.035548 0.003481 10.212 < 2e-16 ***
## BMI 0.089753 0.014377 6.243 4.3e-10 ***
## Age 0.028699 0.007809 3.675 0.000238 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 974.75 on 751 degrees of freedom
## Residual deviance: 724.96 on 748 degrees of freedom
## (16 observations deleted due to missingness)
## AIC: 732.96
##
## Number of Fisher Scoring iterations: 4
Get the summary of the model. For each predictor, does an increase RAISE or LOWER the odds of diabetes? Which predictors are significant (p < 0.05)? An increase of Glucose (p < .05) strongly increase the likelihood of diabetes. A higher BMI (p < .05) also will likely result in a higher risk of diabetes. Older age (p < .05) is linked to higher odds of diabetes. As shown above, all three predictor’s p values are less than .05, indicating that they contribute to the predicting diabetes model. —
Use threshold 0.5 to convert predicted probabilities into 0/1 predictions, then build a confusion matrix.
# Hint:
# data$pred_prob <- predict(model, data, type = "response")
# data$pred_class <- ifelse(data$pred_prob > 0.5, 1, 0)
# table(Actual = data$Outcome, Predicted = data$pred_class)
clean_data <- na.omit(data)
model <- glm(Outcome ~ Glucose + BMI + Age,
data = clean_data,
family = "binomial")
clean_data$pred_prob <- predict(model, type = "response")
clean_data$pred_class <- ifelse(clean_data$pred_prob > .5, 1, 0)
cm <- table(Actual = clean_data$Outcome,
Predicted = clean_data$pred_class)
cm
## Predicted
## Actual 0 1
## 0 419 56
## 1 108 141
Report the confusion matrix counts: TP, TN, FP, FN. TP = 141 TN = 419 FP = 56 FN = 108 —
From your confusion matrix, calculate accuracy, precision, and recall.
# Your code:
TP <- 141
TN <- 419
FP <- 56
FN <- 108
accuracy <- (TP + TN) / (TP + TN + FP + FN)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
accuracy
## [1] 0.7734807
precision
## [1] 0.715736
recall
## [1] 0.5662651
Report all three values. In a medical screening context, which is more important — precision or recall? Why? Accuracy of the model is 0.773. This means that the model is accurate 77% of the time. Precision of the model is 0.716, which implies that when it predicts diabetes, then 72% of the time it is accurate. Recall of the model is 0.566, which implies that it is able to identify about 57% of people having diabetes. Recall becomes more significant in medical screening as it becomes preferable to detect as many sick patients as possible, even at the cost of wrongly identifying healthy individuals.
Plot the ROC curve and compute the AUC.
# install.packages("pROC") if needed
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Your code:
Report the AUC. Is your model closer to random guessing (AUC = 0.5) or perfect (AUC = 1)? Describe its overall performance in one sentence.