library(tidyverse)
countries <- read_csv("AllCountries.csv")
Predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US)
# Fit simple linear regression model
simple_model <- lm(LifeExpectancy ~ GDP, data = countries)
# Interpret the model
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Interpretation
Intercept: ~68.4, representing predicted life
expectancy in years when the GDP is $0.
Slope: ~0.00025, meaning for every $1 increase in
GDP per capita, life expectancy is predicted to increase by around
0.00025 years.
R²: ~.0.43, meaning around 43% of the variance in life expectancy is explained by GDP alone. Quite poor.
Predict LifeExpectancy using GDP, Health, and Internet as predictors.
# Fit the multiple linear regression model
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
# Interpret the model
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpretation
Health coefficient: ~0.248, meaning that while
controlling for GDP and Internet, as health expenditures increase, life
expectancy is also predicted to increase.
Adjusted R²: ~0.716, meaning around 71.6% of variance in life expectancy is explained by this model. This is higher than the simple model’s 43%, showing that the additional predictors explained more factors in life expectancy for a more accurate prediction.
From the simple linear regression model in Q1, Describe how you would check the assumptions of homoscedasticity and normality of residuals
I will use a residuals vs fitted and scale-location plot to check the
assumption of homoscedasticity, and a Q-Q plot to check the assumption
of normality.
For homoscedasticity, an ideal residuals vs fitted plot will show a
consistent spread of points across the line with an even scatter around
zero, and an ideal scale-location plot will have a roughly horizontal
red line with an even spread of points. Violations may indicate that the
model’s predictions are unreliable due to unequal variance in prediction
errors.
For normality, an ideal Q-Q plot, residuals will follow the straight
diagonal line. Violations may indicate that the model’s predictions are
unreliable because the p-values and confidence intervals may be invalid
and thus falsely significant.
Check Assumptions
plot(simple_model)
Reflection
The residuals vs fitted plot shows some uneven spread suggesting
unequal variance and some heteroscedasticity. The residuals are fairly
evenly scattered around zero, indicating homoscedasticity.
The scale-location plot shows a spread that increases towards the
middle of the fitted values (ages), indicating some heteroscedasticity.
Overall the plots’ uneven spread patterns do not
match the ideal outcomes for homoscedasticity, but they are not severe
enough to invalidate the model.
The Q-Q residuals plot deviates at the left and right tails which does not match the ideal outcomes for normality, but the deviation is not severe enough to invalidate the model.
From the multiple regression model from Q2, calculate the RMSE and explain what it represents in the context of predicting life expectancy
# Calculate residuals
residuals_multiple <- resid(multiple_model)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
RMSE: 4.06, meaning predictions miss by ~4.06 years
on average.
How would large residuals for certain countries affect you
confidence in the model’s predictions, and what might you investigate
further?
Large residuals for certain countries would decrease my confidence in the model’s predictions as the RMSE may skewed due to these outlier life expectancies. I would look into the countries with unusually high or low life expectancies to see why they are unusual, such as if infant death is high or there are deadly conflicts in the country.
Explain how this multicollinearity might affect the
interpretation of the regression coefficients and the reliability of the
model.
This multicollinearity may decrease the reliability of the model by inflating the standard error of the regression coefficients for energy and electricity, making it harder to tell which predictor variables and their p-values in the model are significant.