Homework 6 — Linear and Logistic Regression

This is the last homework. Part 1 uses linear regression on country-level data. Part 2 uses logistic regression on a medical dataset.

Part 1 — Linear Regression: AllCountries dataset

Download AllCountries.csv from the Datasets folder on Blackboard. The dataset has 217 countries with variables including GDP, LifeExpectancy, Health, Internet, CO2, Energy, Electricity, and more.

countries <- read_csv("AllCountries.csv")

## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(countries)

## # A tibble: 6 × 26
##   Country Code  LandArea Population Density   GDP Rural   CO2 PumpPrice Military
##   <chr>   <chr>    <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 Afghan… AFG     653.       37.2      56.9   521  74.5  0.29      0.7      3.72
## 2 Albania ALB      27.4       2.87    105.   5254  39.7  1.98      1.36     4.08
## 3 Algeria DZA    2382.       42.2      17.7  4279  27.4  3.74      0.28    13.8 
## 4 Americ… ASM       0.2       0.055   277.     NA  12.8 NA        NA       NA   
## 5 Andorra AND       0.47      0.077   164.  42030  11.9  5.83     NA       NA   
## 6 Angola  AGO    1247.       30.8      24.7  3432  34.5  1.29      0.97     9.4 
## # ℹ 16 more variables: Health <dbl>, ArmedForces <dbl>, Internet <dbl>,
## #   Cell <dbl>, HIV <dbl>, Hunger <dbl>, Diabetes <dbl>, BirthRate <dbl>,
## #   DeathRate <dbl>, ElderlyPop <dbl>, LifeExpectancy <dbl>, FemaleLabor <dbl>,
## #   Unemployment <dbl>, Energy <dbl>, Electricity <dbl>, Developed <dbl>

Q1 — Simple Linear Regression

Fit a simple linear regression model predicting LifeExpectancy from GDP.

# Your code:

model <- lm(countries$LifeExpectancy~ countries$GDP, data = countries)
summary(model)

## 
## Call:
## lm(formula = countries$LifeExpectancy ~ countries$GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.842e+01  5.415e-01  126.36   <2e-16 ***
## countries$GDP 2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Report the intercept and slope. What does the slope mean in plain English (e.g., “for every X increase in GDP, life expectancy increases by Y”)?

For every $1 increase in GDP per capita, life expectancy is predicted to increase by 0.000248 years on average.

What does the R² value tell you about how well GDP explains life expectancy?

GDP alone explains about 43% of the variation in life expectancy across countries. This is a moderately strong relationship, but a lot of variation (57%) is still left unexplained by GDP.

Q2 — Multiple Linear Regression

Fit a multiple regression predicting LifeExpectancy from GDP, Health, and Internet.

# Your code:
model1 <- lm(countries$LifeExpectancy~ countries$GDP + countries$Health + countries$Internet, data = countries)
summary(model1)

## 
## Call:
## lm(formula = countries$LifeExpectancy ~ countries$GDP + countries$Health + 
##     countries$Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.908e+01  8.149e-01  72.499  < 2e-16 ***
## countries$GDP      2.367e-05  2.287e-05   1.035 0.302025    
## countries$Health   2.479e-01  6.619e-02   3.745 0.000247 ***
## countries$Internet 1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpret the coefficient on Health (controlling for GDP and Internet). Holding GDP and Internet usage constant, for every 1 percentage point increase in health expenditure (as % of GDP), life expectancy is predicted to increase by 0.248 years on average.

How does the adjusted R² compare to the simple model in Q1? What does that suggest about adding predictors?

Adding Health and Internet substantially improved the model. The three predictors together explain about 71.6% of the variation in life expectancy

Q3 — Checking Assumptions

For the simple model in Q1 (LifeExpectancy ~ GDP):

Briefly describe what you would CHECK to evaluate homoscedasticity and normality of residuals. What would an ideal outcome look like?
Then code your check (residual plot + Q-Q plot of residuals) and reflect on what you see.

# Your code:

plot(model, which = 1)

plot(model, which = 2)

Your reflection:

Homoscedasticity: Look for random scatter (no funnel shape) in Residuals vs Fitted. Normality: Points should roughly follow the diagonal line in Q-Q plot. however there is some deviation in the tail.

Q4 — Diagnosing Fit (RMSE)

For the multiple regression in Q2, calculate the RMSE (root mean squared error).

# Hint: sqrt(mean(residuals(model)^2))

sqrt(mean(residuals(model)^2))

## [1] 5.868172

What does the RMSE represent in the context of predicting life expectancy? How would large residuals for certain countries affect your confidence in the model?

On average, the model’s predictions of life expectancy are off by about 5.87 years

Large residuals for specific countries mean the model is making much larger prediction errors for those nations than for others. This would reduce our confidence in the model. The model is less reliable when making predictions for outlier countries.

Q5 — Multicollinearity (no code)

Suppose you fit a regression predicting CO2 using both Energy and Electricity. These two predictors are highly correlated.

Explain in 2-3 sentences how this multicollinearity could affect (a) the interpretation of the coefficients and (b) the reliability of the model.

High correlation between Energy and Electricity makes it hard to separate their individual effects. Coefficients become unstable, large standard errors, insignificant p-values even if jointly important.

b) Interpretation of “effect of Energy holding Electricity constant” becomes unreliable.

Part 2 — Logistic Regression: Pima Indians Diabetes

This part uses the Pima Indians Diabetes dataset (768 patients, binary outcome: 0 = no diabetes, 1 = diabetes).

Don’t change this chunk — it loads and cleans the data:

url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data <- read.csv(url, header = FALSE)
colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
                    "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")
data$Outcome <- as.factor(data$Outcome)

# Replace impossible 0 values with NA
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA

colSums(is.na(data))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Q6 — Fit the Logistic Model

Fit a logistic regression predicting Outcome from Glucose, BMI, and Age.

# Hint: glm(Outcome ~ Glucose + BMI + Age, data = data, family = "binomial")

model_log <- glm(Outcome ~ Glucose + BMI + Age, data = data, family = "binomial")

summary(model_log)

## 
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial", 
##     data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.032377   0.711037 -12.703  < 2e-16 ***
## Glucose      0.035548   0.003481  10.212  < 2e-16 ***
## BMI          0.089753   0.014377   6.243  4.3e-10 ***
## Age          0.028699   0.007809   3.675 0.000238 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.75  on 751  degrees of freedom
## Residual deviance: 724.96  on 748  degrees of freedom
##   (16 observations deleted due to missingness)
## AIC: 732.96
## 
## Number of Fisher Scoring iterations: 4

Get the summary of the model. For each predictor, does an increase RAISE or LOWER the odds of diabetes? Which predictors are significant (p < 0.05)?

Glucose: Raises the odds of diabetes (positive coefficient). BMI: Raises the odds of diabetes (positive coefficient). Age: Raises the odds of diabetes (positive coefficient).

Which predictors are significant (p < 0.05)?

Glucose: Significant (p < 0.0000000000000002) BMI: Significant (p = 0.00000000043) Age: Significant (p = 0.000238)

All of them are statistically significant

Q7 — Confusion Matrix

Use threshold 0.5 to convert predicted probabilities into 0/1 predictions, then build a confusion matrix.

# Hint:
 data$pred_prob  <- predict(model_log, data, type = "response")
 data$pred_class <- ifelse(data$pred_prob > 0.5, 1, 0)
 table(Actual = data$Outcome, Predicted = data$pred_class)

##       Predicted
## Actual   0   1
##      0 429  59
##      1 114 150

Report the confusion matrix counts: TP, TN, FP, FN.

(TN) = 429 (FP) = 59 (FN) = 114 (TP) = 150

Q8 — Accuracy, Precision, Recall

From your confusion matrix, calculate accuracy, precision, and recall.

# Your code:

accuracy <- (429 + 150) / (429 + 59 + 114 + 150)
precision <- 150 / (150 + 59)
recall <- 150 / (150 + 114)

Report all three values. In a medical screening context, which is more important — precision or recall? Why? Accuracy = (77.0%) Precision = (71.8%) Recall = (56.8%)

Recall is more important in this context. Missing someone who actually has diabetes (False Negative) is more dangerous than incorrectly flagging someone who does not have it (False Positive). A high recall ensures that most people with diabetes are correctly identified and can receive treatment

Q9 — ROC and AUC

Plot the ROC curve and compute the AUC.

# install.packages("pROC") if needed
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Your code:
roc_obj <- roc(data$Outcome, data$pred_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_obj)

auc(roc_obj)

## Area under the curve: 0.828

Report the AUC. Is your model closer to random guessing (AUC = 0.5) or perfect (AUC = 1)? Describe its overall performance in one sentence. The model performs quite well overall. An AUC of 0.828 is much better than random guessing (AUC = 0.5). In fact, it is closer to a perfect model. the model can reasonably distinguish between people who have diabetes and those who do not.