AllCountries <- read.csv("AllCountries.csv")
# simple linear regression: LifeExpectancy ~ GDP
simple_model <- lm(LifeExpectancy ~ GDP, data = AllCountries)

summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation

# multiple linear regression: LifeExpenctancy ~ GDP + Health + Internet
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)

summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation

Health and Internet have p-values < 0.001, indicating they are statistically significant predictors of life expectancy.GDP has a p-value around 0.30, meaning it is not statistically significant once Health and Internet are included.

4

  • Homoscedasticity can be checked by exAmining residuals vs fitted plot. The ideal outcome is that residuals should appear random around hte horizontal line at 0 and no pattern. A violation would look like a funnel shape or curved pattern. Overall, the assumption of constant variance was reasonably met, meaning the model is fairly reliable.

  • Normality can be checked by using a Q-Q plot to assess wheather residuals follow a normal distribution. An ideal outcome would be that the points on the plot fall close to the diagonal line and a violation would look like strong curvature or heavy tails(points far from the line). The points mostly followed the diagonal line on the plot with some outliers suggesting that the inference is reliable.

plot(AllCountries$GDP, AllCountries$LifeExpectancy,
xlab = "GDP per Capita ($US)",
ylab = "Life Expectancy (Years)",
main = "Life Expectancy vs GDP")

abline(simple_model, col = 1, lwd = 2)

### 5

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

Interpretation

RMSE ~ 4 meaning prediction misses by ~ 4 LifeExpenctancy on average. On average, the model’s predicted life expectancy values differ from the actual values by about 4 years. Large residuals can affect confidence in the model by indicating nonlinearity or missing predictors. We can investigate the nonlinear relationships or outliers

6

When predictors are highly correlated the model has difficulty separating their individual effects because both try to explain a similar type of variation in CO2. This makes the the model less certain about the coeficients thus the model becomes unreliable to explain which predictor has a stronger effect on CO2.