Question 1

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(tibble)


AllCountries <- read.csv("AllCountries.csv")

head(AllCountries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1
tail(AllCountries)
##                   Country Code LandArea Population Density  GDP Rural  CO2
## 212               Vietnam  VNM   310.07     95.540   308.1 2564  64.1 1.82
## 213 Virgin Islands (U.S.)  VIR     0.35      0.107   305.6   NA   4.3   NA
## 214    West Bank and Gaza  PSE     6.02      4.569   759.0 3199  23.8   NA
## 215           Yemen, Rep.  YEM   527.97     28.499    54.0  944  63.4 0.88
## 216                Zambia  ZMB   743.39     17.352    23.3 1540  56.5 0.29
## 217              Zimbabwe  ZWE   386.85     14.439    37.3 2147  67.8 0.88
##     PumpPrice Military Health ArmedForces Internet  Cell  HIV Hunger Diabetes
## 212      0.80     8.10   8.95         522     49.6 125.6  0.3   10.8      6.0
## 213        NA       NA     NA          NA     64.4    NA   NA     NA     12.3
## 214      1.54       NA     NA          NA     65.2  81.2   NA     NA     10.6
## 215      0.92       NA   0.00          40     26.7  54.4   NA   34.4      5.4
## 216      1.40     5.66   7.13          16     27.9  78.6 11.5   44.5      3.9
## 217      1.34     5.61  14.51          51     27.1  85.3 13.3   46.6      1.8
##     BirthRate DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment
## 212      16.5       5.8        7.4           76.5        79.1          1.9
## 213      12.8       7.8       19.1           79.4        64.6          8.4
## 214      31.4       3.5        3.1           73.6        20.3         30.2
## 215      31.0       6.4        2.9           65.2         6.2         12.9
## 216      37.8       7.6        2.5           62.3        71.7          7.2
## 217      32.3       7.9        2.8           61.7        79.5          4.9
##     Energy Electricity Developed
## 212     NA        1424         1
## 213     NA          NA        NA
## 214     NA          NA        NA
## 215     NA         220         1
## 216     NA         717         1
## 217     NA         609         1
str(AllCountries)
## 'data.frame':    217 obs. of  26 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num  652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num  37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num  56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : int  521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
##  $ Rural         : num  74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num  0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num  0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num  3.72 4.08 13.81 NA NA ...
##  $ Health        : num  2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : int  323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num  11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num  67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num  NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num  30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num  9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num  32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num  6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num  2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num  64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num  50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num  1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : int  NA 808 1328 NA NA 545 NA 2030 1016 NA ...
##  $ Electricity   : int  NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
##  $ Developed     : int  NA 1 1 NA NA 1 NA 2 1 NA ...
colSums(is.na(AllCountries))
##        Country           Code       LandArea     Population        Density 
##              0              0              8              1              8 
##            GDP          Rural            CO2      PumpPrice       Military 
##             30              3             13             50             67 
##         Health    ArmedForces       Internet           Cell            HIV 
##             29             49             13             15             81 
##         Hunger       Diabetes      BirthRate      DeathRate     ElderlyPop 
##             52             10             15             15             24 
## LifeExpectancy    FemaleLabor   Unemployment         Energy    Electricity 
##             18             30             30             82             76 
##      Developed 
##             75

Question 2

simple_regression <- lm(LifeExpectancy ~ GDP, data = AllCountries )

summary(simple_regression)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The intercept (68.42) means that for a country with a GDP of $0, the life expectancy is 68.42 years as predicted by the model.

GDP: The GDP has a slope of (0.0002476) which means that for every $1 increase in GDP per capita, the life expectancy is expected to increase by about 0.0002476. The p-value being (p <2e-16) means that GDP is highly significant in determining the life expectancy of a country.

The adjusted R-squared value of 0.4272 means that there is approximately 42.72% variation in life expectancy which can be explained by GDP per capita across all countries.

Question 3

multiple_regression <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)

summary(multiple_regression)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

When life expectancy is 0, the coefficient is approximately 59.08.

GDP: (Negative) around 0.000024 meaning that GDP decreases for a higher life expectancy Health: (Negative) around 0.025 meaning that health decreases with a higher life expectancy Internet: (Negative) around 0.019 meaning that internet decreases with a higher life expectancy

Question 4

par(mfrow=c(2,2)); plot(simple_regression); par(mfrow=c(1,1))

Homoscedasticity: (Residual vs Fitted) The shape of the line is not straight and the residuals are not randomly scattered around zero showing heteroscedasticity

(Scale- Location) The line is also not a horizontal line and the number of residuals is changing across the fitted values.

Normality: (Q-Q Plot) The residuals slightly deviate towards the right upper tail from the reference line showing that there is a slight right skew in the distribution.

Question 5

residuals_for_multiple_regression <- resid(multiple_regression)

rmse_multiple_regression <- sqrt(mean(residuals_for_multiple_regression^2))

rmse_multiple_regression
## [1] 4.056417

RMSE(Root Mean Squared Error) is used to measure the average prediction error in Life Expectancy units in years. The model is therefore showing us that it has predicted that life expectancy differs from the actual life expectancy by 4.056417.

If there are large residuals, based on the model, it would reduce the confidence in countries with unusually high or low life expectancy because either due to non-linearity or possible outliers.

I would investigate the outliers using the residuals vs leverage plot to find out if they are outliers that would influence the model.

Question 6

If I used a multiple regression model to predict CO2 emissions using Energy and Electricity and I notice that they are highly correlated, it would make it hard to know which predictor is the most important in predicting CO2 emissions making the coefficients unstable and the p-values may be unreliable because we will not be able to tell apart which predictor is of real significance.