1. Import Data

data <- read_csv("AllCountries.csv")

## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Simple Linear Regression

model1 <- lm(LifeExpectancy ~ GDP, data = data)
summary(model1)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Intercept: This is the predicted life expectancy when GDP is $0. In this context, it represents the baseline life expectancy (68.4 years)

Slope: For every $1000 increase in GDP per capita, life expectancy increases by this amount (0.2476 years). The positive value shows that countries with higher GDP tend to have higher life expectancies.

R² Value: This tells us what percentage of the differences in life expectancy can be explained by GDP. The rest of the variation comes from other factors like healthcare, education, and living conditions. In this case, our model explains roughly 42.7% of the resultant values.

3. Multiple Linear Regression

model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = data)
summary(model2)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Health Coefficient: This shows how much life expectancy changes when government healthcare spending increases by 1%, while keeping GDP and Internet the same. It tells us the effect of healthcare spending independent of wealth. (0.25 years per 1%)

Adjusted R² Comparison:

data.frame(
  Model = c("GDP only", "GDP + Health + Internet"),
  R2 = c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared)
)

##                     Model        R2
## 1                GDP only 0.4271795
## 2 GDP + Health + Internet 0.7163973

The adjusted R² is higher for the multiple regression model, meaning adding Health and Internet explains more of the variation in life expectancy than GDP alone.

4. Checking Assumptions

How to Check

Homoscedasticity (Equal Variance): Make a plot of residuals vs fitted values. We want to see random scatter with similar spread throughout - no funnel or cone shapes. If the spread changes, our predictions may be less reliable.

Normality: Use a Q-Q plot. We want points to follow the diagonal line. This checks if our errors are normally distributed.

Code and Tests

par(mfrow = c(2, 2))

plot(fitted(model1), residuals(model1), main = "Residuals vs Fitted",
     xlab = "Fitted Values", ylab = "Residuals", pch = 16, col = rgb(0,0,1,0.5))
abline(h = 0, col = "red", lwd = 2, lty = 2)

qqnorm(residuals(model1), pch = 16, col = rgb(0,0,1,0.5))
qqline(residuals(model1), col = "red", lwd = 2)

plot(fitted(model1), sqrt(abs(scale(residuals(model1)))), main = "Scale-Location",
     xlab = "Fitted Values", ylab = "Standardized Residuals", pch = 16, col = rgb(0,0,1,0.5))

hist(residuals(model1), breaks = 30, col = "lightblue", main = "Histogram of Residuals")

par(mfrow = c(1, 1))

Residuals are slgihtly skewed, which is non-ideal, but the Q-Q plot looks reasonable (though there is a slight bow).

5. RMSE and Residuals

rmse <- sqrt(mean(residuals(model2)^2))
rmse

## [1] 4.056417

RMSE: This is the average prediction error in years. In out words, our model is, on average, around 4 years off.

Impact on Confidence: Countries with large positive residuals (actual > predicted) might have strong healthcare systems or healthy cultural practices that our model doesn’t capture. Countries with large negative residuals (actual < predicted) might have wars, diseases, or poor infrastructure. These outliers show that our model is missing some important factors, so we should look into other variables like education or pollution.

6. Multicollinearity

Scenario: If we try to predict CO2 emissions using both Energy and Electricity, these variables are likely very correlated (countries that use more energy also use more electricity).

Effects:

Hard to Interpret: We can’t tell which variable is really affecting CO2 because they move together
Unreliable Results: The coefficients become unstable and standard errors get larger, making relationships harder to detect

Solutions: Fix it by removing one variable, combining them, or using different statistical methods.

HW9

Rebecca Murphy