VLe HW9 Data 101

1. Uploading CSV file

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(lubridate)
library(zoo)

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(forcats)
library(ggplot2)

countries <- read.csv("AllCountries.csv")

2. Simple Linear Regression Model

simple_linear_model <- lm(LifeExpectancy ~ GDP, data = countries)

summary(simple_linear_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation: Due to both GDP and intercept possessing three stars, this indicates that GDP is significant and heavily impact a country and its life expectancy. The R^2 being 0.4304 indicates that there is 43% variation in life expectancy impacted by GDP.

3. Multiple Linear Regression Model

multiple_linear_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)

summary(multiple_linear_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation: Due to Health and Internet having three stars compared to GDP with no stars, this indicates that Health and Internet are factors that are more significant and heavily contribute to life expectancy compared to GDP. The R^2 being 0.7213 indicates that there is 72% variation in life expectancy impacted by health and internet between every country.

4. Checking Assumptions (For Linear Model)

plot(countries$LifeExpectancy, countries$GDP,
     xlab="Life Expectancy", ylab="GDP", main="Life Expectancy vs GDP")
abline(simple_linear_model, col=1, lwd=2)

Interpretation: There is a positive linear trend, as GDP increases, so does the life expectancy of each individual (May not be seen easily due to the value of GDP exponentially increasing to an extremely large number). The values of each dot are placed close to the regression line between below 55 to ~75, therefore linearity is reasonable. However, after 75, the dots are scattered high above the regression line, indicating an exponential amount of GDP by a specific country and its life expectancy within its population.

plot(resid(simple_linear_model), type="b", main="Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)

Ideal outcome: No bias in data that can skew the data’s results, residuals are independent, suggests a correlation between GDP and life expectancy in a country.

Violations: Independence may be violated in the simple linear model, residual values become condensed in one area and spread out in other areas.

par(mfrow=c(2,2)); plot(simple_linear_model); par(mfrow=c(1,1))

Residuals vs Fitted (Homoscedasticity): Condensed cloud of values within the <70 - 75 range with a heavy curve trajecting a downward and higher values. The residuals are not evenly spread out around the y axis of 0, indicating that homoscedasticity isn’t exactly met. There may be patterns that invalidate the linear model, suggesting heteroscedasticity.

Q-Q Residulas (Normality): The tail deviates going high to the right, which indicates residuals aren’t exactly normal, but not heavily violated.

Scale-Location: Spread decreases with fitted values (mostly condensed on the left side), indicating heteroscedasticity.

Residuals vs Leverage: Due to all values are condensed to the left, indicating no problems in high-leveraging points, outliers aren’t influential.

5. Diagnosing Model Fit

# For multiple model

# Calculate residuals
residuals_multiple <- resid(multiple_linear_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

## [1] 4.056417

Multiple RMSE = 4.0: Represents the average deviation of multiple factors, GDP, Health, and Internet, and the life expectancy.

Reflection: Large residuals for certain countries such as countries with lower GDP’s may indicate less accurate predictions. A way to investigate further is by adding more factors in the multiple linear regression model that may have an impact on a country’s life expectancy and remove factors that do not have a major impact such as GDP.

6. Hypothetical Example

multiple_linear_model2 <- lm(CO2 ~ Energy + Electricity, data = countries)

summary(multiple_linear_model2)

## 
## Call:
## lm(formula = CO2 ~ Energy + Electricity, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7559  -1.1406  -0.2020   0.7143   7.3751 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.998e-01  2.655e-01   3.012  0.00311 ** 
## Energy       3.122e-03  1.066e-04  29.290  < 2e-16 ***
## Electricity -7.044e-04  5.526e-05 -12.747  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.331 on 131 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.899,  Adjusted R-squared:  0.8974 
## F-statistic: 582.8 on 2 and 131 DF,  p-value: < 2.2e-16

Checking multicollinearity

cor(countries[, c("CO2", "Energy", "Electricity")], use = "complete.obs")

##                   CO2    Energy Electricity
## CO2         1.0000000 0.8795736   0.4871233
## Energy      0.8795736 1.0000000   0.7969352
## Electricity 0.4871233 0.7969352   1.0000000

Interpretation: CO2 - Energy: 0.879 ~ 0.88 - Indicates strong correlation of multicollinearity CO2 - Electricity: 0.487 ~ 0.49 - Indicates moderate correlation of multicollinearity The multicollinearity might affect the reliability of the model by potentially making wider confidence intervals, and possibly biased results, making the model less reliable.