# 1. Load the data
df <- read.csv("AllCountries.csv")
# 2. Fit Linear Regression
model1 <- lm(LifeExpectancy ~ GDP, data = df)
summary(model1)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
# Report Coefficients
coefficients(model1)
## (Intercept) GDP
## 6.842208e+01 2.476441e-04
Intercept (\(\beta_0 = 68.42\)): The predicted average life expectancy is approximately 68.42 years for a country with a GDP of $0.
Slope (\(\beta_1 = 0.0002476\)): For every $1 increase in GDP per capita, life expectancy is predicted to increase by approximately 0.0002476 years.
\(R^2\) Value (\(0.4304\)): This value tells you that GDP explains 43.04% of the variation in life expectancy across the countries in the dataset.
# 3. Fit Multiple Linear Regression
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = df)
summary(model2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Coefficient for Health (\(0.2479\)): Every 1% increase in government expenditures on healthcare is associated with a predicted increase of 0.2479 years in life expectancy.
Adjusted \(R^2\) Comparison (\(0.7164\) vs. \(0.4272\)): The adjusted \(R^2\) increased significantly from 0.4272 in the simple model to 0.7164 in the multiple regression model. This suggests that adding Health and Internet as predictors substantially improves the model’s ability to explain variation in life expectancy.
# 4. Checking Assumptions
par(mfrow=c(1,2))
plot(model1, which = 1) # Homoscedasticity check
plot(model1, which = 2) # Normality check
Homoscedasticity: An ideal outcome in a “Residuals vs. Fitted” plot would be a random scatter of points with no clear pattern. A violation (like a funnel shape) would indicate that the model’s reliability varies at different levels of GDP.
Normality: An ideal outcome in a “Normal Q-Q” plot is for points to fall along a straight diagonal line. Deviations suggest the residuals are not normally distributed, which can affect the validity of your p-values.
# 5. Calculate RMSE
rmse_val <- sqrt(mean(residuals(model2)^2))
print(rmse_val)
## [1] 4.056417
RMSE (\(4.0564\)): This represents the typical deviation between the observed life expectancy and the values predicted by the model. On average, the model’s predictions are off by about 4.06 years.
Large Residuals: Countries with unusually high or low life expectancy would lower confidence in the model’s generalizability. One might investigate if regional factors or specific healthcare policies in those countries act as outliers.
# 6 Multicollinearity Scenario
This multicollinearity makes it difficult to isolate the individual effect of Energy versus Electricity on CO2 emissions. It can lead to unstable regression coefficients, where a small change in the data could significantly flip the results.