title: “HW 9” author: “Hadiyah Sumter” date: “2025-11-29” output: html_document —

setwd("~/Desktop/DATA-101")
countries <- read.csv("AllCountries.csv")

##2 Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

simple_model <- lm(LifeExpectancy ~ GDP, data = countries)
simple_model
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Coefficients:
## (Intercept)          GDP  
##   6.842e+01    2.476e-04
summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

##3 Multiple Linear Regression (Fitting and Interpretation)

Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
multiple_model
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Coefficients:
## (Intercept)          GDP       Health     Internet  
##   5.908e+01    2.367e-05    2.479e-01    1.903e-01
summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

4 Checking Assumptions (Homoscedasticity and Normality)

For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

To assess homoscedasticity, I examine the Residuals vs. Fitted plot generated by plot(simple_model). An ideal outcome would show residuals randomly scattered around zero with no clear pattern, indicating equal variance across all fitted values.A violation such as a funnel shape or curved pattern would suggest heteroscedasticity, meaning the model’s prediction errors are inconsistent and could reduce the reliability of standard errors and confidence intervals.

To assess normality of residuals, I review the Normal Q–Q Plot from the same diagnostic output. Ideally, the residuals should fall closely along the diagonal line, which indicates that they follow a roughly normal distribution. Major deviations or curvature in the Q–Q plot would indicate non-normality, which can affect the accuracy of hypothesis tests and prediction intervals.

par(mfrow = c(2,2))
plot(simple_model)

par(mfrow = c(1,1))

Using the plot(simple_model) diagnostics, the Residuals vs Fitted plot showed a fairly random scatter with no strong patterns, suggesting that the homoscedasticity assumption was met. The Q–Q plot showed that most points followed the diagonal line closely with only minor deviations at the tails, indicating approximate normality of residuals. Overall, the diagnostic plots were close to the ideal outcomes, so the model appears reasonably reliable for predicting life expectancy based on GDP.

5 Diagnosing Model Fit (RMSE and Residuals)

For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

The RMSE shows how far, on average, the model’s predicted life expectancy values are from the actual values. Because it is measured in years, a lower RMSE means more accurate predictions. If certain countries have large residuals meaning the model predicts their life expectancy poorly it lowers confidence in the model. These large errors might indicate outliers, missing predictors, or unique country conditions that the model does not capture and would need further investigation.

6 Hypothetical Example (Multicollinearity in Multiple Regression)

Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

If Energy and Electricity are highly correlated, the model has multicollinearity. This means the two predictors contain very similar information, making it difficult for the regression to separate their individual effects on CO₂ emissions. As a result, the coefficients may become unstable, change signs unexpectedly, or have large standard errors. While the model may still predict CO₂ emissions accurately overall, the reliability of the individual coefficients is reduced, and it becomes harder to determine which predictor truly explains the variation in emissions.