knitr::opts_chunk$set(echo = TRUE)
# Load the dataset
# This chunk will now show up in your report
AllCountries <- read.csv("AllCountries.csv")
head(AllCountries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
model1 <- lm(LifeExpectancy ~ GDP, data = AllCountries)
summary(model1)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
summary(model2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
par(mfrow = c(1, 2))
plot(model1, which = 1)
plot(model1, which = 2)
rmse_val <- sqrt(mean(resid(model2)^2))
print(paste("RMSE:", rmse_val))
## [1] "RMSE: 4.05641669659996"
The simple linear regression model analyzes the relationship between a country’s wealth (GDP) and the average life expectancy of its citizens. The intercept represents the theoretical life expectancy in a country with zero GDP, while the slope (GDP coefficient) indicates the estimated increase in life expectancy for every additional dollar of GDP per capita. The R-squared value is particularly important here as it quantifies exactly how much of the variation in life expectancy across different countries can be explained solely by their economic output.
By adding health expenditures and internet access to the model, we can observe how these factors influence life expectancy while controlling for the effect of GDP. The healthcare coefficient shows the specific impact of government health spending on longevity, assuming wealth and technology levels remain constant. When comparing this model to the first one, an increase in the Adjusted R-squared value would suggest that including healthcare and internet access provides a more comprehensive and accurate explanation for the differences in life expectancy worldwide.
To ensure the reliability of our simple regression model, we must verify the assumptions of homoscedasticity and normality. In the Residuals vs. Fitted plot, we look for a random distribution of points; a visible funnel shape would indicate heteroscedasticity, meaning the model’s accuracy varies at different wealth levels. Simultaneously, the Normal Q-Q plot allows us to check if the residuals follow a normal distribution; if the data points deviate significantly from the diagonal line, it suggests that the model’s statistical significance tests and confidence intervals may be unreliable.
The Root Mean Square Error (RMSE) serves as a diagnostic tool to measure the average magnitude of the model’s prediction errors. In the context of this study, the RMSE tells us, on average, how many years our predicted life expectancy deviates from the actual observed data. If we find unusually large residuals for certain countries, it suggests that there are unique regional factors, such as local social policies or environmental conditions, that the current model does not account for, which might warrant further specific investigation.
In a scenario where we analyze CO2 emissions using both energy and electricity consumption as predictors, we would likely encounter the issue of multicollinearity. Because electricity is a form of energy, these two variables are highly correlated and move in tandem, making it mathematically difficult for the regression model to isolate the individual impact of each one. This overlap leads to unstable coefficients and high standard errors, ultimately making it hard to determine which factor is the true driver of emissions and reducing the overall reliability of the model.