library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/Thu Nguyen/Downloads/HW9Dataset")
library(dplyr)
library(lubridate)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(forcats)
library(ggplot2)
countries <- read.csv("AllCountries.csv")
simple_linear_model <- lm(LifeExpectancy ~ GDP, data = countries)

summary(simple_linear_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

this mean that GDP is really impacted a country and its life expectancy. R^2 come out to be 0.4304 mean that there is 43% variation in life expectancy impacted by GDP.

multiple_linear_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)

summary(multiple_linear_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

R^2 saying 0.7213 mean that there is 72% variation in life expectancy impacted by health and internet between every country.

plot(countries$LifeExpectancy, countries$GDP,
     xlab="Life Expectancy", ylab="GDP", main="Life Expectancy vs GDP")
abline(simple_linear_model, col=1, lwd=2)

There is an overall positive linear trend: as GDP increases, life expectancy generally rises as well. Although this trend may be harder to see because GDP values grow exponentially, most data points between a life expectancy of roughly 55 to 75 fall close to the regression line, supporting the assumption of linearity. However, beyond a life expectancy of 75, the points begin to scatter more widely above the line, suggesting that certain countries experience extremely high GDP levels that do not increase life expectancy at the same linear rate.

plot(resid(simple_linear_model), type="b", main="Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)

The independence assumption may be compromised in the simple linear model, as the residuals appear tightly clustered in some regions while widely dispersed in others.

par(mfrow=c(2,2)); plot(simple_linear_model); par(mfrow=c(1,1))

Residuals vs Fitted : The points cluster heavily around the 70–75 range and curve downward, showing uneven spread around 0. This suggests homoscedasticity is not fully met and patterns of heteroscedasticity may be present.

Q–Q Plot : The right tail rises above the line, indicating slight non-normality, though the violation is not severe.

Scale–Location: Residual spread decreases as fitted values increase, further suggesting heteroscedasticity.

Residuals vs Leverage: Points cluster on the left with no extreme leverage values, indicating no influential outliers.

residuals_multiple <- resid(multiple_linear_model)


rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

Reflection: Large residuals for some countries—especially those with lower GDP levels—suggest that the model’s predictions may be less accurate for these cases. To improve the analysis, additional variables that influence life expectancy could be incorporated into a multiple linear regression model, while factors with minimal impact, such as GDP alone, could be adjusted or removed.

multiple_linear_model2 <- lm(CO2 ~ Energy + Electricity, data = countries)

summary(multiple_linear_model2)
## 
## Call:
## lm(formula = CO2 ~ Energy + Electricity, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7559  -1.1406  -0.2020   0.7143   7.3751 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.998e-01  2.655e-01   3.012  0.00311 ** 
## Energy       3.122e-03  1.066e-04  29.290  < 2e-16 ***
## Electricity -7.044e-04  5.526e-05 -12.747  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.331 on 131 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.899,  Adjusted R-squared:  0.8974 
## F-statistic: 582.8 on 2 and 131 DF,  p-value: < 2.2e-16
cor(countries[, c("CO2", "Energy", "Electricity")], use = "complete.obs")
##                   CO2    Energy Electricity
## CO2         1.0000000 0.8795736   0.4871233
## Energy      0.8795736 1.0000000   0.7969352
## Electricity 0.4871233 0.7969352   1.0000000

The correlation between CO₂ and Energy (0.879 ≈ 0.88) indicates strong multicollinearity, while the CO₂ and Electricity correlation (0.487 ≈ 0.49) suggests a moderate level. This degree of multicollinearity can weaken the reliability of the model by inflating confidence intervals and potentially biasing coefficient estimates, ultimately reducing the model’s overall stability.