This is the last homework. Part 1 uses linear regression on country-level data. Part 2 uses logistic regression on a medical dataset.


Part 1 — Linear Regression: AllCountries dataset

Download AllCountries.csv from the Datasets folder on Blackboard. The dataset has 217 countries with variables including GDP, LifeExpectancy, Health, Internet, CO2, Energy, Electricity, and more.

countries <- read.csv("AllCountries.csv")
head(countries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1

Q1 — Simple Linear Regression

Fit a simple linear regression model predicting LifeExpectancy from GDP.

# Your code:

model1 <- lm(LifeExpectancy ~ GDP, data = countries)


summary(model1)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Report the intercept and slope. What does the slope mean in plain English (e.g., “for every X increase in GDP, life expectancy increases by Y”)? Intercept: 68.42 Slope: 0.0002476

The slope means that for every 1 unit increase in GDP, the predicted life expectancy increases by 0.0002476 years on average. Because GDP is measured in dollars a more useful interpretation is that for every $1,000 increase in GDP per person, the predicted life expectancy increases by about 0.248 yearson average

What does the R² value tell you about how well GDP explains life expectancy? The R² value is 0.4304, which means that approximately 43.0% of the variation in life expectancy across countries is explained by GDP. This tells us that GDP is a moderately strong predictor of life expectancy, but other factors also influence life expectancy —

Q2 — Multiple Linear Regression

Fit a multiple regression predicting LifeExpectancy from GDP, Health, and Internet.

# Your code:
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet,
             data = countries)

summary(model2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpret the coefficient on Health (controlling for GDP and Internet). The coefficient for Health is 0.2479, which means that by keeping GDP and Internet use constant, a one unit increase in the Health variable is associated with an increase of 0.248 years in predicted life expectancy

How does the adjusted R² compare to the simple model in Q1? What does that suggest about adding predictors?

The adjusted R² for the multiple regression model is 0.7164 compared with the original 0.4272 for the simple regression model so this means that adding Health and Internet greatly improved the model’s ability to explain differences in life expectancy across countries

Q3 — Checking Assumptions

For the simple model in Q1 (LifeExpectancy ~ GDP):

  1. Briefly describe what you would CHECK to evaluate homoscedasticity and normality of residuals. What would an ideal outcome look like?

To evaluate homoscedasticity, I would examine the residual plot to see whether the residuals are randomly scattered around the horizontal line at zero with no clear pattern or funnel shape. Ideally, the points should have roughly equal spread across all fitted values.

To evaluate normality of residuals, I would examine the Q-Q plot. Ideally, the points should lie close to the reference line, indicating that the residuals are approximately normally distributed.

  1. Then code your check (residual plot + Q-Q plot of residuals) and reflect on what you see.
# Your code:

plot(model1$fitted.values,
     residuals(model1),
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red")

qqnorm(residuals(model1))
qqline(residuals(model1), col = "red")

Your reflection:

The residual plot isn’t completely random. There seems to be a pattern where the residuals get smaller as the fitted values increase which suggests the model may not meet the homoscedasticity assumption perfectly.

The Q-Q plot follows the reference line pretty well in the middle but the points start to drift away from the line at both ends and this suggests the residuals are roughly normal overall but there are some changes in the tails

Q4 — Diagnosing Fit (RMSE)

For the multiple regression in Q2, calculate the RMSE (root mean squared error).

# Hint: sqrt(mean(residuals(model)^2))

rmse <- sqrt(mean(residuals(model2)^2))
rmse
## [1] 4.056417

What does the RMSE represent in the context of predicting life expectancy? How would large residuals for certain countries affect your confidence in the model?

The RMSE of 4.06 means that the model’s predictions are off by about 4 years of life expectancy on average. A lower RMSE indicates that the model makes more accurate predictions.

Large residuals mean the model is making bigger prediction errors for certain countries. If many countries have large residuals, I would have less confidence in the model because it suggests it does not predict life expectancy accurately for those countries

Q5 — Multicollinearity (no code)

Suppose you fit a regression predicting CO2 using both Energy and Electricity. These two predictors are highly correlated.

Explain in 2-3 sentences how this multicollinearity could affect (a) the interpretation of the coefficients and (b) the reliability of the model.

Part 2 — Logistic Regression: Pima Indians Diabetes

This part uses the Pima Indians Diabetes dataset (768 patients, binary outcome: 0 = no diabetes, 1 = diabetes).

Don’t change this chunk — it loads and cleans the data:

url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data <- read.csv(url, header = FALSE)
colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
                    "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")
data$Outcome <- as.factor(data$Outcome)

# Replace impossible 0 values with NA
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA

colSums(is.na(data))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Q6 — Fit the Logistic Model

Fit a logistic regression predicting Outcome from Glucose, BMI, and Age.

# Hint: glm(Outcome ~ Glucose + BMI + Age, data = data, family = "binomial")

log_model <- glm(Outcome ~ Glucose + BMI + Age, data = data,family = "binomial"
)

summary(log_model)
## 
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial", 
##     data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.032377   0.711037 -12.703  < 2e-16 ***
## Glucose      0.035548   0.003481  10.212  < 2e-16 ***
## BMI          0.089753   0.014377   6.243  4.3e-10 ***
## Age          0.028699   0.007809   3.675 0.000238 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.75  on 751  degrees of freedom
## Residual deviance: 724.96  on 748  degrees of freedom
##   (16 observations deleted due to missingness)
## AIC: 732.96
## 
## Number of Fisher Scoring iterations: 4

Get the summary of the model. For each predictor, does an increase RAISE or LOWER the odds of diabetes? Which predictors are significant (p < 0.05)? All three predictors have positive coefficients, so increasing Glucose, BMI, or Age is associated with higher odds of having diabetes.

All three predictors are statistically significant because their p-values are less than 0.05:

Q7 — Confusion Matrix

Use threshold 0.5 to convert predicted probabilities into 0/1 predictions, then build a confusion matrix.

# Hint:
# data$pred_prob  <- predict(model, data, type = "response")
# data$pred_class <- ifelse(data$pred_prob > 0.5, 1, 0)
# table(Actual = data$Outcome, Predicted = data$pred_class)

data$pred_prob <- predict(log_model, data, type = "response")

data$pred_class <- ifelse(data$pred_prob > 0.5, 1, 0)

table(Actual = data$Outcome,Predicted = data$pred_class)
##       Predicted
## Actual   0   1
##      0 429  59
##      1 114 150

Report the confusion matrix counts: TP, TN, FP, FN. TP = 150 TN = 429 FP = 59 FN = 114 —

Q8 — Accuracy, Precision, Recall

From your confusion matrix, calculate accuracy, precision, and recall.

# Your code:
cm <- table(Actual = data$Outcome,Predicted = data$pred_class)

TN <- cm[1,1]
FP <- cm[1,2]
FN <- cm[2,1]
TP <- cm[2,2]

accuracy <- (TP + TN) / (TP + TN + FP + FN)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)

accuracy
## [1] 0.7699468
precision
## [1] 0.7177033
recall
## [1] 0.5681818

Report all three values. In a medical screening context, which is more important — precision or recall? Why?

The model has an accuracy of 77.0%, a precision of 71.8%, and a recall of 56.8%

I think Recall is usually more important than precision because it is better to identify as many people with diabetes as possible. Missing someone who actually has diabetes (a false negative) could delay treatment and prevent them from getting the care they need

Q9 — ROC and AUC

Plot the ROC curve and compute the AUC.

# install.packages("pROC") if needed
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Your code:
roc_curve <- roc(data$Outcome, data$pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)

auc(roc_curve)
## Area under the curve: 0.828

Report the AUC. Is your model closer to random guessing (AUC = 0.5) or perfect (AUC = 1)? Describe its overall performance in one sentence.

The AUC of 0.828 is much closer to 1 than 0.5 which means the model performs well at distinguishing between patients with and without diabetes.

Overall the model has good predictive performance and does a good job classifying diabetes outcomes