This is the last homework. Part 1 uses linear regression on country-level data. Part 2 uses logistic regression on a medical dataset.
Download AllCountries.csv from the Datasets folder on
Blackboard. The dataset has 217 countries with variables including GDP,
LifeExpectancy, Health, Internet, CO2, Energy, Electricity, and
more.
countries <- read.csv("AllCountries.csv")
head(countries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
Fit a simple linear regression model predicting
LifeExpectancy from GDP.
# Your code:
# Using the lm (y~x, data=) function and then the summary function.
my_model <- lm(LifeExpectancy ~ GDP, data = countries)
summary(my_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Report the intercept and slope. What does the slope mean in plain English (e.g., “for every X increase in GDP, life expectancy increases by Y”)? The intercept is 68.42 years. The slope is 2.476e-04 years per unit of GDP. For every 1 unit increase in GDP (I am unsure what the specific unit of GDP is in this dataset), life expectancy increases by 0.00025 years.
What does the R² value tell you about how well GDP explains life expectancy? The R2 value says 42.72% of variation in life expectancy is predicted by GDP. —
Fit a multiple regression predicting LifeExpectancy from
GDP, Health, and Internet.
# Your code:
# Using the lm function again and putting multiple variables in the x position, separating them with +.
my_multi_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
summary(my_multi_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpret the coefficient on Health (controlling
for GDP and Internet). For every 1 unit increase in health
(unsure of units), life expectancy increases by 0.2479 years,
controlling for GDP and internet.
How does the adjusted R² compare to the simple model in Q1? What does that suggest about adding predictors? The adjusted R2 is significantly higher in this model than the simple model (0.7164 compared to 0.4272). Adding predictors explains a greater amount of the variation in life expectancy. —
For the simple model in Q1 (LifeExpectancy ~ GDP):
To check homoscedasticity, I would use a residual plot (plot residuals on y and fitted values on x). Ideally, the points should be randomly scattered around the horizontal line with no funnel pattern.
To check normality of residuals, I would use a Q-Q plot. Ideally, the points should fall very close to the line on the plot.
# Your code:
# Using the plot() function and specifying 1 and 2. 1 gives me the residual plot and 2 gives me the Q-Q plot.
plot(my_model, which=1)
plot(my_model, which=2)
Your reflection: The Q-Q plot looks close to ideal, indicating the residuals follow a normal distribution.
The residual plot is concerning.
This question doesn’t ask about linearity, but the residuals don’t follow a straight line (the line they follow is curved), indicating the assumption of linearity is violated. In terms of homoelasticity, the spread of residuals is wider at lower fitted values and narrower at higher fitted values. The spread is not even. It is a funnel shape, which is the exact opposite of ideal, so homoelasticity is violated. 2/4 assumptions (linearity and equal variance) are violated with this data, indicating that a linear model is not a good fit for this data as-is. The data could potentially be transformed to make a linear model appropriate. —
For the multiple regression in Q2, calculate the RMSE (root mean squared error).
# Hint: sqrt(mean(residuals(model)^2))
# Using formula in hint and inputting the name of my model.
sqrt(mean(residuals(my_multi_model)^2))
## [1] 4.056417
What does the RMSE represent in the context of predicting
life expectancy? How would large residuals for certain
countries affect your confidence in the model? On average, life
expectancy prediction by the model is off by 4.056417 years. Large
residuals significantly decreases confidence in the model because we
square the residuals when calculating SSresiduals, so larger misses
dramatically increase the sum of residuals.
—
Suppose you fit a regression predicting CO2 using both
Energy and Electricity. These two predictors
are highly correlated.
Explain in 2-3 sentences how this multicollinearity could affect (a) the interpretation of the coefficients and (b) the reliability of the model. If two predictor variables are highly correlated, the assumption of no multicollineraity is violated, and a linear model is not a good fit for the data as is. Multicollinearity makes coefficients bounce around because using variables so similar to one another makes it hard for the model to credit the correct variable for a change in y. Multicollineraity also affects reliability because p value can become very large. —
This part uses the Pima Indians Diabetes dataset (768 patients, binary outcome: 0 = no diabetes, 1 = diabetes).
Don’t change this chunk — it loads and cleans the data:
url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data <- read.csv(url, header = FALSE)
colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
"Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")
data$Outcome <- as.factor(data$Outcome)
# Replace impossible 0 values with NA
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA
colSums(is.na(data))
## Pregnancies Glucose BloodPressure
## 0 5 35
## SkinThickness Insulin BMI
## 0 0 11
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
Fit a logistic regression predicting Outcome from
Glucose, BMI, and Age.
# Hint: glm(Outcome ~ Glucose + BMI + Age, data = data, family = "binomial")
# Using the hint and storing my model so I can use the summary function.
model <- glm(Outcome ~ Glucose + BMI + Age, data = data, family = "binomial")
summary(model)
##
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.032377 0.711037 -12.703 < 2e-16 ***
## Glucose 0.035548 0.003481 10.212 < 2e-16 ***
## BMI 0.089753 0.014377 6.243 4.3e-10 ***
## Age 0.028699 0.007809 3.675 0.000238 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 974.75 on 751 degrees of freedom
## Residual deviance: 724.96 on 748 degrees of freedom
## (16 observations deleted due to missingness)
## AIC: 732.96
##
## Number of Fisher Scoring iterations: 4
Get the summary of the model. For each predictor, does an increase RAISE or LOWER the odds of diabetes? Which predictors are significant (p < 0.05)? For each of the predictors (glucose, BMI, and age), an increase in the predictor raises the odds of diabetes. All predictors are significant. —
Use threshold 0.5 to convert predicted probabilities into 0/1 predictions, then build a confusion matrix.
# Hint:
# data$pred_prob <- predict(model, data, type = "response")
# data$pred_class <- ifelse(data$pred_prob > 0.5, 1, 0)
# table(Actual = data$Outcome, Predicted = data$pred_class)
data$pred_prob <- predict(model, data, type = "response")
data$pred_class <- ifelse(data$pred_prob > 0.5, 1, 0)
table(Actual = data$Outcome, Predicted = data$pred_class)
## Predicted
## Actual 0 1
## 0 429 59
## 1 114 150
Report the confusion matrix counts: TP, TN, FP, FN. TP: 150 TN: 429 FP: 59 FN: 114 —
From your confusion matrix, calculate accuracy, precision, and
recall. Accuracy = (TP + TN) / total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
# Your code:
# Using the formulas I added above and the numbers from the confusion matrix I just reported.
(150+429)/sum(150,429,59,114)
## [1] 0.7699468
150/(150+59)
## [1] 0.7177033
150/(150+114)
## [1] 0.5681818
Report all three values. In a medical screening context, which is more important — precision or recall? Why? Accuracy = 0.7699468 Precision = 0.7177033 Recall = 0.5681818 Recall matters more. Precision does not take into account false negatives. A model can give many false negatives and still have a high precision. Recall does consider false negatives, and false negatives can be very dangerous for a patient. —
Plot the ROC curve and compute the AUC.
# install.packages("pROC") if needed
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Your code:
# Using the roc function and identifying the "outcome" and "pred_prob" columns. Then using the auc function on the roc.
my_roc <- roc(data$Outcome, data$pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(my_roc, main = "ROC Curve — Diabetes Model", col = "blue")
auc(my_roc)
## Area under the curve: 0.828
Report the AUC. Is your model closer to random guessing (AUC = 0.5) or perfect (AUC = 1)? Describe its overall performance in one sentence. The AUC is 0.828. This is closer to 1 than it is to 0.5, so my model is not random guessing. Overall, my model is performing well at predicting true positives while minimizing false positives.