# read the data
dat <- read.csv("AllCountries.csv")
# View the first rows
head(dat)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
# simple linear regression model
lrModel1 <- lm(LifeExpectancy ~ GDP, data = dat)
# summary of model
summary(lrModel1)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Interpretation:
The intercept of 68.42 represents the estimated life expectancy in a country where GDP is $0.
The slope of 0.0002476 indicates that for each $1 increase in GDP per capita, life expectancy increases by approximately 0.00025 years.
The R-squared value of 0.4304 means that GDP explains about 43% of the variation in life expectancy across countries. This shows somewhat of a relationship — GDP is an important factor, but other variables may also be just as influential (or more).
# multiple linear regression model
lrModel2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = dat)
# summary of model
summary(lrModel2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpretation:
The intercept of 59.08 represents the predicted life expectancy for a country with GDP, healthcare spending, and internet usage all equal to zero.
The coefficient for Health is 0.02479, which means that for every 1% increase in the share of government spending on healthcare, life expectancy increases by about 0.025 years. This variable is statistically significant.
The Internet variable has a coefficient of 0.1903, meaning that a 1% increase in internet access is associated with an increase of about 0.19 years in life expectancy.
GDP is no longer statistically significant in this model (p = 0.302). This could suggest that adding Health and Internet makes GDP less meaningful.
The Adjusted R-squared value is 0.7164, which means that about 72% of the variation in life expectancy across countries is explained by this model.
# diagnostic plots for assumptions
par(mfrow=c(2,2))
plot(lrModel1)
par(mfrow=c(1,1))
Interpretation:
Overall: Assumptions are mostly met, with weak breaks of linearity, normality, and constant variance.
# residuals vs. order
plot(resid(lrModel1), type = "b",
main = "Residuals vs Observation Order",
ylab = "Residuals")
abline(h = 0, lty = 2)
Interpretation:
Conclusion: The independence assumption seems reasonable in this model.
# calculate the residuals for multiple model
residuals_lrModel2 <- resid(lrModel2)
# compute the RMSE
rmse_lrModel2 <- sqrt(mean(residuals_lrModel2^2))
rmse_lrModel2
## [1] 4.056417
Interpretation:
The RMSE (Root Mean Squared Error) for the multiple regression model is ~4.06. This means that on average, the model’s predicted life expectancy values deviate from the actual values by about 4 years.
This gives an estimate of the model’s prediction error. Seeing very large residuals for some countries (especially those with outliers) could lower our confidence in the model. Those points may be important to look at in order to understand if there are unique factors that the model does not show.
If there is a high correlation between energy and electricity, multicollinearity would be present in the regression model. This means the two predictors have overlapping information, which makes it difficult to isolate the variables and see how much of a unique effect each variable has on CO2 emissions. Multicollinearity can cause unstable coefficient estimates, as small changes in data can lead to large shifts in the slope values. It can also inflate standard errors, which can lead to falsely-insignificant p-values. Finally, it would make in generally more difficult to interpret whether energy or electricity contains the real effect on emissions. To improve this model, we could drop one of the correlated predictors, or combine them into one variable.