Load libraries and dataset

library(tidyverse)
countries <- read_csv("AllCountries.csv")

Simple Linear Regression

Predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US)

# Fit simple linear regression model
simple_model <- lm(LifeExpectancy ~ GDP, data = countries)

# Interpret the model
summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation

Intercept: ~68.4, representing predicted life expectancy in years when the GDP is $0.

Slope: ~0.00025, meaning for every $1 increase in GDP per capita, life expectancy is predicted to increase by around 0.00025 years.

R²: ~.0.43, meaning around 43% of the variance in life expectancy is explained by GDP alone. Quite poor.


Multiple Linear Regression

Predict LifeExpectancy using GDP, Health, and Internet as predictors.

# Fit the multiple linear regression model
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)

# Interpret the model 
summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation

Health coefficient: ~0.248, meaning that while controlling for GDP and Internet, as health expenditures increase, life expectancy is also predicted to increase.

Adjusted R²: ~0.716, meaning around 71.6% of variance in life expectancy is explained by this model. This is higher than the simple model’s 43%, showing that the additional predictors explained more factors in life expectancy for a more accurate prediction.


Checking Assumptions

From the simple linear regression model in Q1, Describe how you would check the assumptions of homoscedasticity and normality of residuals

I will use a residuals vs fitted and scale-location plot to check the assumption of homoscedasticity, and a Q-Q plot to check the assumption of normality.

For homoscedasticity, an ideal residuals vs fitted plot will show a consistent spread of points across the line with an even scatter around zero, and an ideal scale-location plot will have a roughly horizontal red line with an even spread of points. Violations may indicate that the model’s predictions are unreliable due to unequal variance in prediction errors.

For normality, an ideal Q-Q plot, residuals will follow the straight diagonal line. Violations may indicate that the model’s predictions are unreliable because the p-values and confidence intervals may be invalid and thus falsely significant.

Check Assumptions

plot(simple_model)

Reflection

The residuals vs fitted plot shows some uneven spread suggesting unequal variance and some heteroscedasticity. The residuals are fairly evenly scattered around zero, indicating homoscedasticity.

The scale-location plot shows a spread that increases towards the middle of the fitted values (ages), indicating some heteroscedasticity.

Overall the plots’ uneven spread patterns do not match the ideal outcomes for homoscedasticity, but they are not severe enough to invalidate the model.

The Q-Q residuals plot deviates at the left and right tails which does not match the ideal outcomes for normality, but the deviation is not severe enough to invalidate the model.


Diagnosing Model Fit

From the multiple regression model from Q2, calculate the RMSE and explain what it represents in the context of predicting life expectancy

# Calculate residuals
residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

RMSE: 4.06, meaning predictions miss by ~4.06 years on average.

How would large residuals for certain countries affect you confidence in the model’s predictions, and what might you investigate further?

Large residuals for certain countries would decrease my confidence in the model’s predictions as the RMSE may skewed due to these outlier life expectancies. I would look into the countries with unusually high or low life expectancies to see why they are unusual, such as if infant death is high or there are deadly conflicts in the country.


Hypothetical Example

Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

This multicollinearity may decrease the reliability of the model by inflating the standard error of the regression coefficients for energy and electricity, making it harder to tell which predictor variables and their p-values in the model are significant.