setwd("C:/Users/COCO3/Downloads")
AllCountries <- read.csv("AllCountries.csv")
colSums(is.na(AllCountries))
## Country Code LandArea Population Density
## 0 0 8 1 8
## GDP Rural CO2 PumpPrice Military
## 30 3 13 50 67
## Health ArmedForces Internet Cell HIV
## 29 49 13 15 81
## Hunger Diabetes BirthRate DeathRate ElderlyPop
## 52 10 15 15 24
## LifeExpectancy FemaleLabor Unemployment Energy Electricity
## 18 30 30 82 76
## Developed
## 75
#Simple Linear Regression
simple_model <- lm(LifeExpectancy ~ GDP,data = AllCountries)
summary (simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
The intercept is 68.42 and the slope is 0.0002476
68.42 stands for the years, so when GDP/capita is at $0, life expectancy is assumed to be 68.42 years.
0.0002476 stands for the increase in year when there is a $1 increase in GDP per capita. So that’s like saying every dollar increased, increases life expectancy by 0.25 years (Probably around 3 months)
0.4304 is the R², and it stands for the amount of differences in life expectancy between countries, based on GDP. 0.4304 would be equal to about 43%
# Multiple Linear Regression
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
summary (multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Intercept: 59.08 years
GDP: 0.00002367
Health: 0.2479 years
Internet: 0.1903 years
Adjust R²= 71.64
Health represents 0.2479 years, which means life expectancy increases by 0.25 years for every 1 percent increase in government healthcare, while keeping GDP and internet access constant.
From the simple model, it’s R² was 42.72%, and from this model, R² was 71.64%. This means there was an improvement in percentange; about 29%, actually. It is a very significant improvement. This means that even though being rich matters a lot, government healthcare and technology are important factors on why some countries have higher life expectancy than others.
#Checking Assumptions
par (mfrow = c(2,2))
plot (simple_model)
par (mfrow=c(1,1))
How to check assumptions for Homoscedasticity: You would compare all of the residuals to the Residuals vs. Fitted plot. This way you can check the consistency between all of them
An ideal outcome for this is that the points should be scattered, and the spread of the residuals should be pretty much equal throughout all the fitted values.
A violation would look like this: There would be more curved patterns, and that would mean there is non-linear connections, and also there would be multiple/inconsistent changes with the spread of the residuals. It would mean the p-value/hypotheses was also not trustwrothy, plus there might have been standard error and confidence interval may have gotten messed up.
How to check assumptions for Normality of Residuals: You would check the Normal Q-Q Residual plor to see if residuals follow a consistent and normal distrubtion
Ideal outcome: Not many outliers, no S curves, most points should be close to the ine
Violation looks like: S shape curve, skewed plot, lots of deviations. This would mean the p-values could have been inaccurate, and the predications intervals could’ve been biased.
###Comparing to outcome Homoscedascity Analysis: The assumption is pretty accurate, only there are a small amount of differences. The variance is consistent, but the outliers could be a concern. Although, this isn’t a giant- big- deal. Pretty reliable.
Normality Analysis: There is some violation here. The residuals are pretty normal in the middle, but show some heavier tails on the left side. This could mean there are countries with ridicoulsly low life expectancies based on their GDP. Less reliable than the Homoscedascity, there are more concerns to watch out for.
#Diagnosing Model Fit
residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
In the context, the RMSE represents how different the predictions are from the actual life expectancy.
Large residuals for certain countries would affect the confidence in the models predictions by affecting reliability. For large residuals, it can be assumed that their measurements have some error, and important factors might not be considered. For example, if there are small countries that consistently have large residuals, the model might not work for that type/size of country, and so there will be lots of inaccuracy.
For investigating further, I would say that including health care quality, lifestyle factors, and how corrupt or just the government is.
cor (AllCountries [, c("Energy", "Electricity")], use = "complete.obs")
## Energy Electricity
## Energy 1.0000000 0.7970054
## Electricity 0.7970054 1.0000000
-This multicollinearity might affect the interpretation of the regression coefficient and the reliability of the model. This is because of the unreliable coefficients; basically, the model isn’t able to tell whether Energy or Electricity is doing the driving when it comes to emissions from CO2. It then messes up the effects of each factor, individually. So esentially, since Energy and Electricity are too close in relation, multiple regression is not the best regression model to tell us the impacts on CO2 emissions, individually.