AllCountries <- read.csv("AllCountries.csv")
# simple linear regression: LifeExpectancy ~ GDP
simple_model <- lm(LifeExpectancy ~ GDP, data = AllCountries)
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
The intercept (~68.42) represents the predicted life expectancy for a country with a GDP per capita of $0. This is not not practically meaningful, but mathematically it’s the y-intercept.
The coeficient for GDP (0.0002476) means that for every $1 increase in GDP per capita, life expectancy increases by about 0.0002476 years.
Both p-values(2e-16) are extremely small indicating that both the intercept and GDP coefficient are statistically significant predictors of life expectancy.
R square means GDP alone explains about 43% of the variation in life expectancy across countries, which makes it an important factor but more than half of the variation is due to other influences.
# multiple linear regression: LifeExpenctancy ~ GDP + Health + Internet
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
The intercept (around 59.08) represents the predicted life expectancy when GDP, Health, and Internet are all 0. This is not practically meaningful, but mathematically it is simply the y-intercept of the regression plane.
The coefficient for Health (around 0.248) means that for every 1-unit increase in Health spending, life expectancy increases by about 0.25 years, holding GDP and Internet constant.
Looking at the p-values:
Health and Internet have p-values < 0.001, indicating they are statistically significant predictors of life expectancy.GDP has a p-value around 0.30, meaning it is not statistically significant once Health and Internet are included.
Homoscedasticity can be checked by exAmining residuals vs fitted plot. The ideal outcome is that residuals should appear random around hte horizontal line at 0 and no pattern. A violation would look like a funnel shape or curved pattern. Overall, the assumption of constant variance was reasonably met, meaning the model is fairly reliable.
Normality can be checked by using a Q-Q plot to assess wheather residuals follow a normal distribution. An ideal outcome would be that the points on the plot fall close to the diagonal line and a violation would look like strong curvature or heavy tails(points far from the line). The points mostly followed the diagonal line on the plot with some outliers suggesting that the inference is reliable.
plot(AllCountries$GDP, AllCountries$LifeExpectancy,
xlab = "GDP per Capita ($US)",
ylab = "Life Expectancy (Years)",
main = "Life Expectancy vs GDP")
abline(simple_model, col = 1, lwd = 2)
### 5
residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
RMSE ~ 4 meaning prediction misses by ~ 4 LifeExpenctancy on average. On average, the model’s predicted life expectancy values differ from the actual values by about 4 years. Large residuals can affect confidence in the model by indicating nonlinearity or missing predictors. We can investigate the nonlinear relationships or outliers
When predictors are highly correlated the model has difficulty separating their individual effects because both try to explain a similar type of variation in CO2. This makes the the model less certain about the coeficients thus the model becomes unreliable to explain which predictor has a stronger effect on CO2.