library(dplyr)
library(tidyverse)
setwd("C:/Users/Joanne G/OneDrive/Data101(Fall 2025)/Datasets")
all_Countries_df <- read.csv("AllCountries.csv")
simple_reg_model <- lm(LifeExpectancy ~ GDP, data = all_Countries_df)
summary(simple_reg_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = all_Countries_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Interpretation: The intercept of 68.42 means that if a country had a GDP per of $0, its predicted life expectancy would be about 68.4 years, while the slope of 0.0002476 indicates that each additional $ 1 in GDP per capita is associated with a 0.00025 year increase in life expectancy (about 2.5 years for every $10,000 increase). The R² value of 0.43 shows that GDP explains 43% of the variation in life expectancy across countries, indicating a moderately strong relationship where GDP is important but not the only influencing factor.
multiple_reg_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = all_Countries_df)
summary(multiple_reg_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = all_Countries_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpretation: Compared to the simple regression model from Question 1, where the R² was 0.43, the adjusted R² in the new model increases dramatically to 0.72, indicating that adding Health and Internet access greatly improves the model’s ability to explain variation in life expectancy. This substantial increase suggests that these additional predictors contribute meaningful explanatory power beyond GDP alone, strengthening the model and showing that life expectancy is influenced by multiple social and economic factors, not just economic output.
par(mfrow=c(2,2)); plot(simple_reg_model); par(mfrow=c(1,1))
Interpretation: For linearity, an ideal residuals vs. fitted plot would show residuals scattered randomly around zero with no clear pattern, while a curve or systematic trend indicates non-linearity, suggesting the model may not fully capture the relationship between GDP and life expectancy. For homoscedasticity, an ideal scale–location plot shows residuals spread evenly across fitted values, whereas a funnel or fan shape indicates heteroscedasticity, meaning predictions may be more reliable for some countries than others. For normality of residuals, an ideal Q-Q plot shows points lying closely along the 45-degree line, while deviations at the tails suggest slight non-normality, which could affect confidence intervals and hypothesis tests.
#Simple regression Calculation (Question 1)
residuals_multiple <- resid(simple_reg_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 5.868172
#Multi-regression Calculation (Question 2)
residuals_multiple <- resid(multiple_reg_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
Interpretation: Large residuals for certain countries, especially those with unusually high or low life expectancy—would reduce confidence in the model’s predictions for those cases, as they indicate the model is not capturing all the factors influencing life expectancy. While the multiple regression model has an RMSE of about 4.06, showing reasonably accurate predictions on average, these outliers suggest that additional important predictors may be missing. To improve the model, it would be important to investigate the countries with the largest residuals and consider including other relevant variables, such as healthcare quality, education, inequality, or environmental factors, to enhance predictive accuracy.