setwd("C:/Users/COCO3/Downloads")
AllCountries <- read.csv("AllCountries.csv")
colSums(is.na(AllCountries)) 
##        Country           Code       LandArea     Population        Density 
##              0              0              8              1              8 
##            GDP          Rural            CO2      PumpPrice       Military 
##             30              3             13             50             67 
##         Health    ArmedForces       Internet           Cell            HIV 
##             29             49             13             15             81 
##         Hunger       Diabetes      BirthRate      DeathRate     ElderlyPop 
##             52             10             15             15             24 
## LifeExpectancy    FemaleLabor   Unemployment         Energy    Electricity 
##             18             30             30             82             76 
##      Developed 
##             75
#Simple Linear Regression

simple_model <- lm(LifeExpectancy ~ GDP,data = AllCountries)

summary (simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16
# Multiple Linear Regression
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)

summary (multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16
#Checking Assumptions
par (mfrow = c(2,2))
plot (simple_model)

par (mfrow=c(1,1))

###Comparing to outcome Homoscedascity Analysis: The assumption is pretty accurate, only there are a small amount of differences. The variance is consistent, but the outliers could be a concern. Although, this isn’t a giant- big- deal. Pretty reliable.

Normality Analysis: There is some violation here. The residuals are pretty normal in the middle, but show some heavier tails on the left side. This could mean there are countries with ridicoulsly low life expectancies based on their GDP. Less reliable than the Homoscedascity, there are more concerns to watch out for.

#Diagnosing Model Fit

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
cor (AllCountries [, c("Energy", "Electricity")], use = "complete.obs")
##                Energy Electricity
## Energy      1.0000000   0.7970054
## Electricity 0.7970054   1.0000000

-This multicollinearity might affect the interpretation of the regression coefficient and the reliability of the model. This is because of the unreliable coefficients; basically, the model isn’t able to tell whether Energy or Electricity is doing the driving when it comes to emissions from CO2. It then messes up the effects of each factor, individually. So esentially, since Energy and Electricity are too close in relation, multiple regression is not the best regression model to tell us the impacts on CO2 emissions, individually.