HW 9

Loading libraries

library(tidyverse)

Loading data set

countries <- read_csv("AllCountries.csv")

Hnadling NAs

colSums(is.na(countries))

##        Country           Code       LandArea     Population        Density 
##              0              0              8              1              8 
##            GDP          Rural            CO2      PumpPrice       Military 
##             30              3             13             50             67 
##         Health    ArmedForces       Internet           Cell            HIV 
##             29             49             13             15             81 
##         Hunger       Diabetes      BirthRate      DeathRate     ElderlyPop 
##             52             10             15             15             24 
## LifeExpectancy    FemaleLabor   Unemployment         Energy    Electricity 
##             18             30             30             82             76 
##      Developed 
##             75

Simple Linear Regression

simple_model <- lm(LifeExpectancy ~ GDP, data = countries)

summary(simple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation

The intercept is 68.42 and that represents the predicted life expectancy (in years) for country with a GDP of 0.

The coefficient for GDP is 0.0002476 which means for every 1 unit increase in GDP (per capita,in US dollars) life expectancy increases by about 0.0002476 years.

P- value is extremely small for both of them that means relationship between GDP and life expectancy is statistically significant.

R² is 0.4304 tells us that GDP explains about 43% of the variation in life expectancy across countries.

Multiple Linear Regression

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)

summary(multiple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation

The intercept (around 59.08) represents the predicted life expectancy when GDP, Health, and Internet access are all zero.

Coefficients: GDP: The coefficient for GDP (around 0.0000237) suggests that for each 1 unit increase in GDP, life expectancy increases by 0.0000237 years. The p-value (0.302025) shows GDP is not statistically significant in this model after adding Health and Internet.

Health: The health coefficient (about 0.2479) indicates that for every 1% increase in health , life expectancy increases by about 0.2479 years. The p-value (0.000247) shows statistically significant.

Internet: The internet coefficient (about 0.1903) means that for every 1% increase in internet access among the population, life expectancy increases by about 0.1903 years. The p-value is extremely small showing that statistically significant.

R² (around 0.7164) means that about 71.6% of the variation in life expectancy across countries explained by GDP,Health, and Internet together.

R²increased from about 0.43 to 0.72, which shows that the additional predictors (health and internet) greatly improve the model. This suggests that life expectancy is influenced by more that juts GDP.

Checking Assumptions

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

Homoscedasticity

To check homoscedasticity, we look at the Residuals vs Fitted plot from the diagnostic panel. This lets us see whether the residuals stay evenly spread out across all fitted values.

Ideal outcome: The points should be scattered randomly with no clear shape or pattern, and the vertical spread of residuals should be roughly the same across all fitted values. This suggests the model has constant variance and predicts consistently across different GDP values.

What a violation looks like: A curved pattern, a “fan shape,” or shrinking/widening of the spread. This means the variance is not constant. If this happens, the standard errors and p-values might not be trustworthy, and the model might perform better for some countries (like low-GDP ones) than others.

Reflection (based on my model): There is some mild violation here. The residuals show a slight curve and the spread gets a little wider for higher fitted values. This means the model predicts low-GDP countries more consistently than high-GDP countries. Not severe, but something to keep in mind.

Normality of Residuals

To check normality, we use the Normal Q–Q plot, which compares the residuals to a theoretical normal distribution.

Ideal outcome: Points should fall mostly along the diagonal line, with only small deviations at the ends. This indicates the residuals are normally distributed, which supports reliable hypothesis tests.

What a violation looks like: Strong curvature (S-shape), heavy tails, or points far away from the line. This can mean the p-values and confidence intervals may be less accurate.

Reflection (based on my model): There is some deviation from normality. The middle section follows the line fairly well, but the tails pull away from it, especially on the low end. This means a few countries have unusually large positive or negative residuals. Not terrible, but it suggests the model is less reliable for extreme GDP values.

Diognosing model fit

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

## [1] 4.056417

The RMSE for the multiple regression model is about 4.06, which means the model’s predictions are, on average, about 4 years different from the actual life expectancy. This is not perfect, but it’s reasonable since life expectancy depends on many factors.

Large residuals for certain countries can lower confidence in the model’s predictions. If a country keeps showing big errors, it might mean the model is missing important variables or that the country is an outlier with unusual conditions.

For further investigation, I would check if adding variables like sanitation, vaccination access, or poverty rate improves the model, since these can strongly affect life expectancy and might explain the countries with large errors.

Question 6

If Energy and Electricity are highly correlated, the model can’t really tell which one is actually affecting CO₂ emissions, because they move together. This makes their coefficients unstable and sometimes even misleading. One variable might look insignificant or even have the “wrong” sign, not because it doesn’t matter, but because the model is confused by the overlap. The overall predictions might still be okay, but I wouldn’t fully trust the individual coefficients when multicollinearity is this strong.