library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/njnav/OneDrive/Data 101/Week 9/Import")
allcountries <- read_csv("AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?
simple_model <- lm(LifeExpectancy ~ GDP, data = allcountries)
simple_model
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = allcountries)
##
## Coefficients:
## (Intercept) GDP
## 6.842e+01 2.476e-04
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = allcountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
The intercept = 68.42 and represents the predicted life expectancy when GDP is 0. The coefficient for life expectancy = 0.0002476 which means for every 1 increase in GDP, life expectancy increases by about 0.0002476 years. R² = 0.4304 which explains that about 43% of the variance in life expectancy is from GDP alone.
Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = allcountries)
multiple_model
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = allcountries)
##
## Coefficients:
## (Intercept) GDP Health Internet
## 5.908e+01 2.367e-05 2.479e-01 1.903e-01
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = allcountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
The coefficient for Health = 0.2479 and is positive which means for every 1 percent of government expenditures directed towards healthcare, life expectancy increases by about 0.2479 years while controlling for GDP and Internet. The adjusted adjusted R² = 0.7164 which is bigger than the previous multiple R² from question 1 which means that the additional predictors explain more variance about life expectancy at about 72%.
Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.
We check for homoscedasticity by observing the residuals vs fitted
plot to look for a constant variance of residuals around zero. We want
the spread of the residuals to be roughly the same across all predictors
with no extreme differences. If there is an unequal variance then it
means the homoscedasticity assumption isn’t met and makes significance
tests less trustworthy.
Then we check the normality by observing the Q-Q plot and we expect the
residuals to be normally distributed staying close to the line. If it is
not normally distributed then there may be many large over estimates or
under estimates making the predictions inaccurate.
par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))
Residuals vs Fitted: The line doesn’t follow around zero with there being a strong variance near the beginning with a negative trend at the end. The residuals are not evenly spread around zero meaning that homoscedasticity isn’t met.
Scale-Location: There is a big spread at the beginning showing that there is heteroscedasticity.
Q-Q Residuals: The residuals deviate from the line near the beginning and the end of the plot showing that it is not exactly normally distributed
Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?
residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
The RMSE = 4.056417 which means that predictions from our model misses by about 4 years of life expectancy. Countries with large residuals could make it so the predictions from our model become less accurate making it harder to confidently predict the life expectancy. To have the model make more accurate predictions on life expectancy we could try removing and adding different predictors to see if there is any big change.
Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.
If energy and electricity are highly correlated then that means there is strong multicollinearity. This can cause the coefficients to be inflated and make the p-value become unreliable even if the model itself fits well. It would probably be best to remove one of the predictors as keeping both could confuse the model and make it difficult to tell which of the predictors is influencing the predictions more.