knitr::opts_chunk$set(echo = TRUE)
# Load the dataset
# This chunk will now show up in your report
AllCountries <- read.csv("AllCountries.csv")
head(AllCountries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1
model1 <- lm(LifeExpectancy ~ GDP, data = AllCountries)
summary(model1)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
summary(model2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16
par(mfrow = c(1, 2))
plot(model1, which = 1) 
plot(model1, which = 2) 

rmse_val <- sqrt(mean(resid(model2)^2))
print(paste("RMSE:", rmse_val))
## [1] "RMSE: 4.05641669659996"
  1. The simple linear regression model analyzes the relationship between a country’s wealth (GDP) and the average life expectancy of its citizens. The intercept represents the theoretical life expectancy in a country with zero GDP, while the slope (GDP coefficient) indicates the estimated increase in life expectancy for every additional dollar of GDP per capita. The R-squared value is particularly important here as it quantifies exactly how much of the variation in life expectancy across different countries can be explained solely by their economic output.

  2. By adding health expenditures and internet access to the model, we can observe how these factors influence life expectancy while controlling for the effect of GDP. The healthcare coefficient shows the specific impact of government health spending on longevity, assuming wealth and technology levels remain constant. When comparing this model to the first one, an increase in the Adjusted R-squared value would suggest that including healthcare and internet access provides a more comprehensive and accurate explanation for the differences in life expectancy worldwide.

  3. To ensure the reliability of our simple regression model, we must verify the assumptions of homoscedasticity and normality. In the Residuals vs. Fitted plot, we look for a random distribution of points; a visible funnel shape would indicate heteroscedasticity, meaning the model’s accuracy varies at different wealth levels. Simultaneously, the Normal Q-Q plot allows us to check if the residuals follow a normal distribution; if the data points deviate significantly from the diagonal line, it suggests that the model’s statistical significance tests and confidence intervals may be unreliable.

  4. The Root Mean Square Error (RMSE) serves as a diagnostic tool to measure the average magnitude of the model’s prediction errors. In the context of this study, the RMSE tells us, on average, how many years our predicted life expectancy deviates from the actual observed data. If we find unusually large residuals for certain countries, it suggests that there are unique regional factors, such as local social policies or environmental conditions, that the current model does not account for, which might warrant further specific investigation.

In a scenario where we analyze CO2 emissions using both energy and electricity consumption as predictors, we would likely encounter the issue of multicollinearity. Because electricity is a form of energy, these two variables are highly correlated and move in tandem, making it mathematically difficult for the regression model to isolate the individual impact of each one. This overlap leads to unstable coefficients and high standard errors, ultimately making it hard to determine which factor is the true driver of emissions and reducing the overall reliability of the model.

  1. Multicollinearity essentially confuses the model because the predictors provide redundant information. When variables like energy and electricity consumption are used together, the model cannot accurately assign credit for the resulting CO2 emissions to just one of them. This doesn’t necessarily mean the overall model is wrong, but it makes the specific individual results (the coefficients) for those two variables very difficult to trust or explain clearly in a report.