DATA 101 HW 9

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Task 1

all_countries <- read_csv("C:/Users/desir_7411ic3/Downloads/AllCountries.csv")

Task 2

slm2 <- lm(LifeExpectancy ~ GDP, data=all_countries)

summary(slm2)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = all_countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The R-squared value of 0.4304 indicates that this linear regression model can explain about 43% of the data, which practically means that a country’s GDP may be a significant factor in life expectancy but not a great predictor. The intercept value of 68.432 means that according to this model, a country with absolutely no domestic output (GDP) will have an average life expectancy of 68.432 years old. The slope coefficient value (0.0002476) means that for each additional dollar in the GDP, average life expectancy will go up by 0.0002476 years, or 2.2 hours.

Task 3

mlr3 <- lm(LifeExpectancy ~ GDP + Health + Internet, all_countries)

summary(mlr3)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Controlling for GDP and Internet, the Health value is 0.2479. This means that each additional percentage point of total government expenditures on healthcare adds 0.2479 years to average life expectancy, or just under three months. The adjusted R-squared value is 0.7164, meaning that the model explains 71.64% of the LifeExpectancy values in the dataset. This is significantly better than 42.72% from the simple model, indicating that the combination of GDP, health expenditures, and Internet access form a decent predictor for average life expectancy.

Task 4

Homoscedasticity can be checked by creating a Residuals vs Fitted plot, then checking for large differences in variance. Ideally, all residuals would have equal absolute values. If residuals were all over the place, then the model would likely be discounted. If the residuals trended away from the model in some localized area, then this suggests that there is a subset of the independent variable with different behavior than the rest of the dataset.

Normality is checked with with a Q-Q plot. Ideally, all points will lie exactly on the diagonal. Small departures from the normal line don’t really matter at this sample size because of CLT, but the same departures at a sample size \(n<30\) would probably invalidate any conclusions drawn from the model.

par(mfrow=c(2,2)); plot(slm2); par(mfrow=c(1,1))

The Residuals vs Fitted model shows much larger prediction errors for low-GDP countries than for high-GDP countries, which is probably because poorer countries tend to have compounding issues that dramatically lower life expectancy (e.g. war, malnutrition), which is a strong violation of homoscedasticity.

The Q-Q plot shows OK normality, with moderate departures on the tail ends. The curve on the left tail indicates that some poor countries have much lower life expectancy than predicted, while a few departures on the right tail show that there are a few high-GDP countries with lower life expectancy than expected.

Task 5

resid5 <- resid(mlr3)

rmse5 <- sqrt(mean(resid5^2))
rmse5

## [1] 4.056417

The multiple linear regression model predicting life expectancy based on GDP, HHS spending, and Internet access is off by ~4 years on average. Countries with very low life expectancies might be throwing off this model, so I would be interested in recalculating the RMSE without the outliers in this dataset. The very low residuals found for the majority of the observations in this dataset would indicate a low RMSE, so I think that the RMSE does not faithfully represent the accuracy of this model for most countries of the world.

Task 6

Multicollinearity is much more dangerous than I originally thought. If two variables like Energy and Electricity are highly correlated but treated as fully independent, than the idea of holding one variable constant to find a regression coefficient for the other is ridiculous, and will render that value useless. Plus, the model can’t differentiate the causes for certain behavior, so even if the model performs well overall, the p-values won’t be trustworthy and the standard errors will be massive.