This is the last homework. Part 1 uses linear regression on country-level data. Part 2 uses logistic regression on a medical dataset.


Part 1 — Linear Regression: AllCountries dataset

Download AllCountries.csv from the Datasets folder on Blackboard. The dataset has 217 countries with variables including GDP, LifeExpectancy, Health, Internet, CO2, Energy, Electricity, and more.

countries <- read.csv("AllCountries.csv")
head(countries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1

Q1 — Simple Linear Regression

Fit a simple linear regression model predicting LifeExpectancy from GDP.

# Your code:
Linreg <- lm(LifeExpectancy ~ GDP, data = countries)
  summary(Linreg)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Report the intercept and slope. What does the slope mean in plain English (e.g., “for every X increase in GDP, life expectancy increases by Y”)?

#The slope means that for every one-unit increase in GDP, life expectancy is expected to change by the slopes average value.

What does the R² value tell you about how well GDP explains life expectancy?

#The R^2 value represents the degree of variation in life expectancy that can be attribuited to GDP. Higher R^2 translates to GDP having a larger toll in differences in life expectancy.

Q2 — Multiple Linear Regression

Fit a multiple regression predicting LifeExpectancy from GDP, Health, and Internet.

# Your code:
Linreg2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
  summary(Linreg2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpret the coefficient on Health (controlling for GDP and Internet).

#The Health coefficient represents how much life expectancy is expected to change for every one-unit increase in Health but maintaining GDP& Internet the same.

How does the adjusted R² compare to the simple model in Q1? What does that suggest about adding predictors?

#The adjusted R^2 model compres to the simple model as it accounts for more nuance within the facets of the research topic explored. It suggests the importance that accounting for all attributing factors to best explain the differences in life expectancy


Q3 — Checking Assumptions

For the simple model in Q1 (LifeExpectancy ~ GDP):

  1. Briefly describe what you would CHECK to evaluate homoscedasticity and normality of residuals. What would an ideal outcome look like? #In regards to homoscedasticity I checked the residual plot to examine whether the points are randomly scattered or have a distinct pattern. In terms of normality I checked the QQ plot to see if the points follow the straight line. Only if the two can be determined then one can determine whether the assumptions are reasonably met.

  2. Then code your check (residual plot + Q-Q plot of residuals) and reflect on what you see.

# Your code:

plot(Linreg, which = 1)

plot(Linreg, which = 2)

Your reflection: #The residual plot is not scattered as the points follow a curved pattern instead of being evenly scattered around zero. This signifies the relationship between GDP and life expectancy cannot not be seen as perfectly linear. The QQ plot mostly follows the line in the middle, but the points start to move away from the line at both ends. In analyzin the plots I make assertion that the assumptions are only partially met, making the model a moderate but not strong line of reasoning. —

Q4 — Diagnosing Fit (RMSE)

For the multiple regression in Q2, calculate the RMSE (root mean squared error).

# Hint: sqrt(mean(residuals(model)^2))
sqrt(mean(residuals(Linreg2)^2))
## [1] 4.056417

What does the RMSE represent in the context of predicting life expectancy? How would large residuals for certain countries affect your confidence in the model?

RMSE represents how off ones predictions are from the actual average life expectancy values. Larger residuals correlates to the model making bigger prediction errors, so one begins to have less confidence in its predictions.


Q5 — Multicollinearity (no code)

Suppose you fit a regression predicting CO2 using both Energy and Electricity. These two predictors are highly correlated.

Explain in 2-3 sentences how this multicollinearity could affect (a) the interpretation of the coefficients and (b) the reliability of the model. #If Energy and Electricity are highly correlated, it becomes harder to tell which variable is actually affecting CO2 because they contain similar information. This can make the coefficients less reliable and cause the model estimates to change more than expected.

#It is hard to distinguish between which variables are most affecting CO2 when predictaors such as energy and electricity are highly correlated. Leading to coefficients to be less reliable as models fluctuation of change is expected to be larger.

Part 2 — Logistic Regression: Pima Indians Diabetes

This part uses the Pima Indians Diabetes dataset (768 patients, binary outcome: 0 = no diabetes, 1 = diabetes).

Don’t change this chunk — it loads and cleans the data:

url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data <- read.csv(url, header = FALSE)
colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
                    "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")
data$Outcome <- as.factor(data$Outcome)

# Replace impossible 0 values with NA
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA

colSums(is.na(data))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Q6 — Fit the Logistic Model

Fit a logistic regression predicting Outcome from Glucose, BMI, and Age.

# Hint: glm(Outcome ~ Glucose + BMI + Age, data = data, family = "binomial")
Linreg3 <- glm(Outcome ~ Glucose + BMI + Age,
              data = data,
              family = "binomial")

summary(Linreg3)
## 
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial", 
##     data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.032377   0.711037 -12.703  < 2e-16 ***
## Glucose      0.035548   0.003481  10.212  < 2e-16 ***
## BMI          0.089753   0.014377   6.243  4.3e-10 ***
## Age          0.028699   0.007809   3.675 0.000238 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.75  on 751  degrees of freedom
## Residual deviance: 724.96  on 748  degrees of freedom
##   (16 observations deleted due to missingness)
## AIC: 732.96
## 
## Number of Fisher Scoring iterations: 4

Get the summary of the model. For each predictor, does an increase RAISE or LOWER the odds of diabetes? Which predictors are significant (p < 0.05)?

#Glucose increase raises the odds of diabetes, BMI increase raises the odds of diabetes, and increase in Age.

#The statistically significant predictors are all three again, Glucose, BMI, and Age as they are all less than p < 0.05.


Q7 — Confusion Matrix

Use threshold 0.5 to convert predicted probabilities into 0/1 predictions, then build a confusion matrix.

# Hint:
# data$pred_prob  <- predict(model, data, type = "response")
# data$pred_class <- ifelse(data$pred_prob > 0.5, 1, 0)
# table(Actual = data$Outcome, Predicted = data$pred_class)

data$pred_prob <- predict (Linreg3, data,
    type = "response")

data$pred_class <- ifelse(data$pred_prob > 0.5, 1, 0)
  table(Actual = data$Outcome,
      Predicted = data$pred_class)
##       Predicted
## Actual   0   1
##      0 429  59
##      1 114 150

Report the confusion matrix counts: TP, TN, FP, FN.


Q8 — Accuracy, Precision, Recall

From your confusion matrix, calculate accuracy, precision, and recall.

# Your code:

cm <- table(Actual = data$Outcome, Predicted = data$pred_class)
    TP <- cm[2,2]
  TN <- cm[1,1]
  FP <- cm[1,2]
 FN <- cm[2,1]

accuracy <- (TP + TN) / sum(cm)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)

accuracy
## [1] 0.7699468
  precision
## [1] 0.7177033
    recall
## [1] 0.5681818

Report all three values. In a medical screening context, which is more important — precision or recall? Why? #[1] 0.7699468 #[2] 0.7177033 #[3] 0.5681818

#In a medical screening test, precision is more important than recall as making any error in collection of biometric data can jeapordize the integreity of the data used in conducting fair statisical analysis.


Q9 — ROC and AUC

Plot the ROC curve and compute the AUC.

# install.packages("pROC") if needed

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Your code:
roc_curve <- roc(data$Outcome, data$pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
  plot(roc_curve) 

  auc(roc_curve)
## Area under the curve: 0.828

Report the AUC. Is your model closer to random guessing (AUC = 0.5) or perfect (AUC = 1)? Describe its overall performance in one sentence.

#The models AUC consists of how well the model separates people with and without diabetes. A value closer to 1 indicates the models perform notably well, while a value closer to 0.5 means it is as significant as a blind guess.