library(readr)
setwd("~/Desktop/datasets")
AllCountries <- read_csv("AllCountries.csv")Hw 9
1
# Fit simple linear regression: life expectancy ~ GDP
simple_model <- lm(LifeExpectancy ~ GDP, data = AllCountries)
# View the model summary
summary(simple_model)
Call:
lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
Residuals:
Min 1Q Median 3Q Max
-16.352 -3.882 1.550 4.458 9.330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.901 on 177 degrees of freedom
(38 observations deleted due to missingness)
Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
The intercept (around 68.4) represents the predicted life expectancy when GDP is 0. (y-intercept)
The coefficient for GDP (around 0.000247) means for every $1 increase in GDP, life expectancy increases by about 0.000247 years
Both pvalues are < 0.05, indicating statistical significance
R² (around 0.427) explains about 42.7% of the variance in life expectancy from GDP alone
2
# Fit multiple linear regression: life expectancy ~ GDP + Health + Internet
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
# View the model summary
summary(multiple_model)
Call:
lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
Residuals:
Min 1Q Median 3Q Max
-14.5662 -1.8227 0.4108 2.5422 9.4161
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
GDP 2.367e-05 2.287e-05 1.035 0.302025
Health 2.479e-01 6.619e-02 3.745 0.000247 ***
Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.104 on 169 degrees of freedom
(44 observations deleted due to missingness)
Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Coefficients (slope):
GDP: Positive (around 0.0000237), longer life expectancy for countries with greater GDP. Health: Positive (around .248), better health increases life expectancy Internet: Positive (around 0.190), better internet increases life expectancy.
P-values: Health and Internet are significant (<0.05).
Adjusted R²: around 0.716. This means about 71.6% of MPG variance is explained by this model. This is better than the R² from the previous model (42.7%). This suggests that Health and Internet are helpful in predicting life expectancy.
3
Homoscedasticity: after generating core diagnostics, the Residuals vs Fitted plot should have a single, flat red line and a random scatter of points (random variance). The Scale-Location should have a flat, horizontal line and even spread of points.
Normality: (Q-Q Residuals) points should be clustered around the dashed line. No unusual patterns/outliers
par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))Homoscedasticity: Residuals vs Fitted shows a curved relationship, not suggesting homoscedasticity (non-linear relationship between life expectancy and GDP). The Scale-Location is strongly skewed right (signs of heteroscedasticity)
Normality: Somewhat normal, however the points trail away from the dotted line towards the top of the graph. This indicates that there are “extreme” countries in the dataset that act as outliers.
4
Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?
# Calculate residuals
residuals_multiple <- resid(multiple_model)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple[1] 4.056417
RMSE = 4.056 years in life expectancy, meaning predictions miss by ~4 years in life expectancy on average. Larger residuals (more “extreme” countries in the dataset) would make the model less reliable (lower the confidence). To make the model more secure, I would want to investigate other factors that could predict life expectancy such as median household income (MHI) or death rates.
5
The model becomes less reliable because the multicollinearity makes it harder to isolate the influence of either variable. Since energy and electricity are closely related, small changes or outliers will cause coefficients to swing.