Step 1:

countries <- read.csv('AllCountries.csv')

Step 2:

Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?
simple_model <- lm(LifeExpectancy ~ GDP, data = countries)
simple_model
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Coefficients:
## (Intercept)          GDP  
##   6.842e+01    2.476e-04
summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The intercept is 68.42, which represents the predicted life expectancy is 68.42 years old when GDP is zero.

The slope coefficient for GDP is 0.0002476, which means that for every one unit increase in GDP, life expectancy increases by 0.0002476 years.

R² is 0.4304, indicating that 43% of the variance in life expectancy is explained by GDP.

Step 3:

Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
multiple_model
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Coefficients:
## (Intercept)          GDP       Health     Internet  
##   5.908e+01    2.367e-05    2.479e-01    1.903e-01
summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

The coefficient for Health is 0.248, which means that for every 1% increase in government expenditures on healthcare, the average Life Expectancy increases by about 0.25 years, while keeping GDP and Internet constant.

The adjusted R² for the multiple regression model is 0.71 which much higher compared to the simple regression model which was 0.43, suggesting that these additional predictors provide more insight and explain life expectancy better.

Step 4:

Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

We check homoscedasticity using the Residuals vs. Fitted plot. Our ideal outcome is that the residuals are randomly scattered around zero with no clear pattern, indicating constant variance of the errors. A violation may suggest that the errors are inconsistent, which makes the model less reliable.

We check the normality of residuals using the Q-Q residuals plot. The points should be as close to the line as possible. A violation would indicate that the errors are not normally distributed, which would reduce the accuracy of the predictions.

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

From the plots, the residuals don’t look random; there’s a small curve and an uneven spread, indicating that the errors aren’t constant. The Q-Q plot also bends at the ends, showing that the residuals are not perfectly normal. Overall, the model is okay but not perfect.

Step 5:

Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect yourconfidence in the model’s predictions, and what might you investigate further?
residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

The RMSE is 4.06, which indicates that the predictions were off by an average of 4.06 years compared to the actual life expectancy values.

If some countries have very large residuals, it lowers our confidence in the accuracy of life expectancy predictions for those countries.

One thing I might investigate further is adding more predictors to see if we can improve the accuracy of our predictions as much as possible.

Step 6:

Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

If we notice that Energy and Electricity are highly correlated, it means these variables are dependent on each other. This affects our results because we wouldn’t know which variable is responsible for the change in CO emissions. It also impacts the interpretation of the regression coefficients, since the coefficients for Energy and Electricity can become unstable with even a small change in the data could cause the results to change dramatically. This lowers the reliability of the model, because while the overall predictions might still be accurate, we would not be sure which variable contributes more to the outcome.