HW 9

Author

Sajutee Mukrabine

1- Upload the data set as a csv

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/sajut/OneDrive/Desktop/DATA_101")
AllCountries <- read_csv("AllCountries.csv")
Rows: 217 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country, Code
dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2- Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

simple_model <- lm(LifeExpectancy ~ GDP, data = AllCountries)

summary(simple_model)

Call:
lm(formula = LifeExpectancy ~ GDP, data = AllCountries)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.352  -3.882   1.550   4.458   9.330 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.901 on 177 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.4304,    Adjusted R-squared:  0.4272 
F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Intercept is 68.42 which means A country with a GDP of $0 is predicted to have a life expectancy of about 68.42 years. Slope is 0.0002476 means for every $1 increase in GDP per person, life expectancy increases by about 0.0002476 years. R-squared 0.4304 this means that GDP explains about 43% of the differences in life expectancy across countries. This means GDP is a helpful predictor, but more than half of the variation is due to other factors.

3- Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)

summary(multiple_model)

Call:
lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.5662  -1.8227   0.4108   2.5422   9.4161 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
GDP         2.367e-05  2.287e-05   1.035 0.302025    
Health      2.479e-01  6.619e-02   3.745 0.000247 ***
Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.104 on 169 degrees of freedom
  (44 observations deleted due to missingness)
Multiple R-squared:  0.7213,    Adjusted R-squared:  0.7164 
F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

The coefficient of Health is 0.25, which is indicating that with a 1 percent increase in government health spending, life expectancy increases approximately 0.25 years after adjusting for GDP and Internet access. The adjusted R² of the multiple regression is greater than that of the simple model, this implying the inclusion of Health and Internet improves the model and also helps explain more of the variation in life expectancy across countries.

4- Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

To check homoscedasticity, we look at the Residuals vs Fitted plot. Ideally, the points should be spread out evenly with no pattern. If we see a funnel shape or curve, it means the model’s errors are uneven, which can make predictions less reliable.

To check normality, we look at the Q–Q plot. Ideally, the points should follow the straight line. If they bend away from the line, the residuals are not normal, which can affect the accuracy of the model.

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

We observed that, based on the diagnostic plots, the residuals appeared to be broadly distributed, indicating that the assumption of homoscedasticity was reasonably satisfied. As evidenced by the Q–Q plot, most points followed the line, so the residuals were close to normal. Overall the results were fairly close to the ideal patterns, indicating that the simple regression model is reasonably reliable for predicting life expectancy from GDP.

5- Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
[1] 4.056417

The RMSE of 4.06 means that the model’s forecasts for life expectancy are usually off by around 4 years. If some countries have very large differences between predicted and actual life expectancy, it makes the model less reliable for them. These nations might be special in that certain circumstances (poor health care, wars, or very good health systems) make them so. These variables might well need looking at or more predictors added to make the model better.

6- Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

If Energy and Electricity are highly correlated, it is hard to know how much each one affects CO2 emissions on its own. The coefficients may be unstable, and the p-values may not show significance even if the variables matter. The model can still predict CO2 fairly well, but interpreting each coefficient is unreliable. One solution is to eliminate one of the correlated variables or merge them into one.