HW_9

1. Import the dataset

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

countries <- read_csv("C:/DATA101/AllCountries.csv")

## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Simple Linear Regression (Fitting and Interpretation):

Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

simple_model <- lm(LifeExpectancy ~ GDP, data = countries)
simple_model

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Coefficients:
## (Intercept)          GDP  
##   6.842e+01    2.476e-04

summary(simple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Intercept: The intercept is roughly 68.42 which is the predicted LifeExpectancy when the GPD is 0. Slope Coefficients: The coefficient is roughly 0.002476 which means for every 1 increade in the GPD, the life expectancy increases by 0.002476. R²: The R² values is 0.4272 which means roughly 42.72% of the variation in Life Expectancy can be attributed to GPD.

3. Multiple Linear Regression (Fitting and Interpretation):

Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)

summary(multiple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Health Coefficient: The Health Coefficient is roughly 0.2479 which is positive meaning as the Health increases by 1, the Life Expectancy increase by 0.2479. Adjusted R²: The adjusted R² is now 0.7164 which is much greater than the original adjusted R² which means by including the Internet and Health, all three combined can explain 71.64% of the variance of the Life Expectancy.

4. Checking Assumptions (Homoscedasticity and Normality):

For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

My answer: I would check the assumptions of homoscedasticity and normality of residuals by plotting the information. I would look for fanning or any sharp changes. Ideally there would be little curvatures and most of the statistics would be scattered near the zero line. If the shape turns out to be a funnel shape where there are changes as the values increase or decrease, then that would be a violation which would mean my statistical conclusions would not be as reliable.

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

The outcome is an ideal outcome with a mostly flat cloud. Most of the residuals are scattered near the zero line, although there are some that veer away from the zero line.

5. Diagnosing Model Fit (RMSE and Residuals):

For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy.

How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

# Calculate residuals
residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

## [1] 4.056417

Multiple model, RMSE = 4.06 Life Expectancy meaning predictions miss by ~4.1 Life Expectancy on average. With large residuals, my data would be skewed meaning the confidence of the model’s predictions would drop. I would investigate which countries are skewing my data and remove them for a more accurate prediction.

6. Hypothetical Example (Multicollinearity in Multiple Regression):

Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

The model would no longer be as reliable as the multicollinearity would confuse the model, inflate errors and weaken predictions. In relation to the regression coefficients, the multicollinearity would reduce the precision of the estimated changes to the CO2 emission based on the Energy and Electricity. Overall, there would be more errors and instability.