Upload the data set as a csv

library(dplyr)
library(tidyverse)

setwd("C:/Users/Joanne G/OneDrive/Data101(Fall 2025)/Datasets")
all_Countries_df <- read.csv("AllCountries.csv")

1- Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capital in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

simple_reg_model <- lm(LifeExpectancy ~ GDP, data = all_Countries_df)

summary(simple_reg_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = all_Countries_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation: The intercept of 68.42 means that if a country had a GDP per of $0, its predicted life expectancy would be about 68.4 years, while the slope of 0.0002476 indicates that each additional $ 1 in GDP per capita is associated with a 0.00025 year increase in life expectancy (about 2.5 years for every $10,000 increase). The R² value of 0.43 shows that GDP explains 43% of the variation in life expectancy across countries, indicating a moderately strong relationship where GDP is important but not the only influencing factor.

2- Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

multiple_reg_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = all_Countries_df)

summary(multiple_reg_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = all_Countries_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation: Compared to the simple regression model from Question 1, where the R² was 0.43, the adjusted R² in the new model increases dramatically to 0.72, indicating that adding Health and Internet access greatly improves the model’s ability to explain variation in life expectancy. This substantial increase suggests that these additional predictors contribute meaningful explanatory power beyond GDP alone, strengthening the model and showing that life expectancy is influenced by multiple social and economic factors, not just economic output.

3- Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

par(mfrow=c(2,2)); plot(simple_reg_model); par(mfrow=c(1,1))

Interpretation: For linearity, an ideal residuals vs. fitted plot would show residuals scattered randomly around zero with no clear pattern, while a curve or systematic trend indicates non-linearity, suggesting the model may not fully capture the relationship between GDP and life expectancy. For homoscedasticity, an ideal scale–location plot shows residuals spread evenly across fitted values, whereas a funnel or fan shape indicates heteroscedasticity, meaning predictions may be more reliable for some countries than others. For normality of residuals, an ideal Q-Q plot shows points lying closely along the 45-degree line, while deviations at the tails suggest slight non-normality, which could affect confidence intervals and hypothesis tests.

4- Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

#Simple regression Calculation (Question 1)
residuals_multiple <- resid(simple_reg_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 5.868172
#Multi-regression Calculation (Question 2)
residuals_multiple <- resid(multiple_reg_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

Interpretation: Large residuals for certain countries, especially those with unusually high or low life expectancy—would reduce confidence in the model’s predictions for those cases, as they indicate the model is not capturing all the factors influencing life expectancy. While the multiple regression model has an RMSE of about 4.06, showing reasonably accurate predictions on average, these outliers suggest that additional important predictors may be missing. To improve the model, it would be important to investigate the countries with the largest residuals and consider including other relevant variables, such as healthcare quality, education, inequality, or environmental factors, to enhance predictive accuracy.

5- Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

Interpretation: Hypothetically, if we fit a multiple linear regression model to predict CO2 emissions using Energy and Electricity as predictors and notice that these two variables are highly correlated, this multicollinearity would make it difficult to separate their individual contributions to CO2 emissions. As a result, the regression coefficients for Energy and Electricity would likely be unstable and unreliable, meaning they could change dramatically with small changes in the data and would be hard to interpret. While the overall predictive accuracy of the model might remain acceptable, the high correlation prevents us from confidently determining the importance of each variable, limiting the model’s usefulness for understanding the specific effects of Energy and Electricity on CO2 emissions.