Question 1
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(tibble)
AllCountries <- read.csv("AllCountries.csv")
head(AllCountries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
tail(AllCountries)
## Country Code LandArea Population Density GDP Rural CO2
## 212 Vietnam VNM 310.07 95.540 308.1 2564 64.1 1.82
## 213 Virgin Islands (U.S.) VIR 0.35 0.107 305.6 NA 4.3 NA
## 214 West Bank and Gaza PSE 6.02 4.569 759.0 3199 23.8 NA
## 215 Yemen, Rep. YEM 527.97 28.499 54.0 944 63.4 0.88
## 216 Zambia ZMB 743.39 17.352 23.3 1540 56.5 0.29
## 217 Zimbabwe ZWE 386.85 14.439 37.3 2147 67.8 0.88
## PumpPrice Military Health ArmedForces Internet Cell HIV Hunger Diabetes
## 212 0.80 8.10 8.95 522 49.6 125.6 0.3 10.8 6.0
## 213 NA NA NA NA 64.4 NA NA NA 12.3
## 214 1.54 NA NA NA 65.2 81.2 NA NA 10.6
## 215 0.92 NA 0.00 40 26.7 54.4 NA 34.4 5.4
## 216 1.40 5.66 7.13 16 27.9 78.6 11.5 44.5 3.9
## 217 1.34 5.61 14.51 51 27.1 85.3 13.3 46.6 1.8
## BirthRate DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment
## 212 16.5 5.8 7.4 76.5 79.1 1.9
## 213 12.8 7.8 19.1 79.4 64.6 8.4
## 214 31.4 3.5 3.1 73.6 20.3 30.2
## 215 31.0 6.4 2.9 65.2 6.2 12.9
## 216 37.8 7.6 2.5 62.3 71.7 7.2
## 217 32.3 7.9 2.8 61.7 79.5 4.9
## Energy Electricity Developed
## 212 NA 1424 1
## 213 NA NA NA
## 214 NA NA NA
## 215 NA 220 1
## 216 NA 717 1
## 217 NA 609 1
str(AllCountries)
## 'data.frame': 217 obs. of 26 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : int 521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
## $ Rural : num 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num 3.72 4.08 13.81 NA NA ...
## $ Health : num 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : int 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num 67.4 123.7 111 NA 104.4 ...
## $ HIV : num NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : int NA 808 1328 NA NA 545 NA 2030 1016 NA ...
## $ Electricity : int NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
## $ Developed : int NA 1 1 NA NA 1 NA 2 1 NA ...
colSums(is.na(AllCountries))
## Country Code LandArea Population Density
## 0 0 8 1 8
## GDP Rural CO2 PumpPrice Military
## 30 3 13 50 67
## Health ArmedForces Internet Cell HIV
## 29 49 13 15 81
## Hunger Diabetes BirthRate DeathRate ElderlyPop
## 52 10 15 15 24
## LifeExpectancy FemaleLabor Unemployment Energy Electricity
## 18 30 30 82 76
## Developed
## 75
Question 2
simple_regression <- lm(LifeExpectancy ~ GDP, data = AllCountries )
summary(simple_regression)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
The intercept (68.42) means that for a country with a GDP of $0, the life expectancy is 68.42 years as predicted by the model.
GDP: The GDP has a slope of (0.0002476) which means that for every $1 increase in GDP per capita, the life expectancy is expected to increase by about 0.0002476. The p-value being (p <2e-16) means that GDP is highly significant in determining the life expectancy of a country.
The adjusted R-squared value of 0.4272 means that there is approximately 42.72% variation in life expectancy which can be explained by GDP per capita across all countries.
Question 3
multiple_regression <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
summary(multiple_regression)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
When life expectancy is 0, the coefficient is approximately 59.08.
GDP: (Negative) around 0.000024 meaning that GDP decreases for a higher life expectancy Health: (Negative) around 0.025 meaning that health decreases with a higher life expectancy Internet: (Negative) around 0.019 meaning that internet decreases with a higher life expectancy
Question 4
par(mfrow=c(2,2)); plot(simple_regression); par(mfrow=c(1,1))
Homoscedasticity: (Residual vs Fitted) The shape of the line is not straight and the residuals are not randomly scattered around zero showing heteroscedasticity
(Scale- Location) The line is also not a horizontal line and the number of residuals is changing across the fitted values.
Normality: (Q-Q Plot) The residuals slightly deviate towards the right upper tail from the reference line showing that there is a slight right skew in the distribution.
Question 5
residuals_for_multiple_regression <- resid(multiple_regression)
rmse_multiple_regression <- sqrt(mean(residuals_for_multiple_regression^2))
rmse_multiple_regression
## [1] 4.056417
RMSE(Root Mean Squared Error) is used to measure the average prediction error in Life Expectancy units in years. The model is therefore showing us that it has predicted that life expectancy differs from the actual life expectancy by 4.056417.
If there are large residuals, based on the model, it would reduce the confidence in countries with unusually high or low life expectancy because either due to non-linearity or possible outliers.
I would investigate the outliers using the residuals vs leverage plot to find out if they are outliers that would influence the model.
Question 6
If I used a multiple regression model to predict CO2 emissions using Energy and Electricity and I notice that they are highly correlated, it would make it hard to know which predictor is the most important in predicting CO2 emissions making the coefficients unstable and the p-values may be unreliable because we will not be able to tell apart which predictor is of real significance.