library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
All_Countries <- read.csv("C:/Users/nika/Downloads/R/csv/AllCountries.csv")

Assignment 1

  1. Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?
simple_model <- lm(LifeExpectancy ~ GDP, data = All_Countries)
summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = All_Countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16
coefficients(simple_model)
##  (Intercept)          GDP 
## 6.842208e+01 2.476441e-04
plot(All_Countries$GDP, All_Countries$LifeExpectancy,
     xlab = "GDP", 
     ylab = "Life Expectancy", 
     main = "Life Expectancy vs GDP",
     pch = 16)
abline(simple_model, col = "red", lwd = 2)

The intercept is 6.842e+01 which represents the predicted average life expectancy when GDP is \(0. This shows the baseline life expectancy for countries with zero economic output. The slope is 2.476e-04, meaning that for every 1\) GDP increase, the life expectancy increases by 2.476e-04. The R squared = 0.4304 means that about 43.04 percent of the variation in life expectancy across countries can be explained by the GDP. Because the R squared value is less than 0.5, we see that the GDp is somewhat substantial on the live expectancy variation.

Assignment 2

  1. Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?
All_Countries <- All_Countries |>
  filter(!is.na(LifeExpectancy) & !is.na(GDP) & !is.na(Health) & !is.na(Internet))

multiple_model <- lm(LifeExpectancy ~ GDP +Health + Internet, data = All_Countries)
summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = All_Countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16
coefficients(multiple_model)
##  (Intercept)          GDP       Health     Internet 
## 5.908027e+01 2.367169e-05 2.478764e-01 1.903116e-01
plot(All_Countries$LifeExpectancy, fitted(multiple_model),
     xlab = "Observed Life Expectancy",
     ylab = "Predicted Life Expectancy",
     main = "Multiple Regression: Observed vs Predicted",
     pch = 16)
abline(0, 1, col = "red", lwd = 2)

The coefficient for health is 2.479e-01 indicating that for every 1 percentage increase in health expenditure, the life expectancy increases by 2.479e-01 while holding GDP and internet access constant. The adjusted R squared is 0.7164 compared to the simple linear regression where the adjusted R squared is 0.4272. This is a significant improvement in the adjusted R squared, demonstrating that health expenditures and internet access are important predictors of life expectancy beyond GDP alone.

#Assignment 3 3. Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

To check for simple linear regression, we need to check homoscedasticity. This means that we have to look at the spread of residuals (which should be roughly constant) with no clear pattern, and randomly scattered around the horizontal line at zero. If homoscedasticity is violated, the models standard error and p-values are unreliable and thus the tests and confidence intervals are invalid. Then to check the normality of residuals, we have to make sure that the points fall approximately along the diagonal reference line with no severe departures. If this requirement is violated, the prediction may be inaccurate and the hypothesis regression is not valid.

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

The residuals vs fitted plots show that the points are randomly scattered with no clear pattern suggesting that the linearity and homoscedasticity assumptions are met. The Normal QQ plot shows residuals closely following the diagonal line indicating the normality assumption is satisfied. The scale-location plot shows relatively constant variance across fitted values confirming homoscedasticity. Finally, the residuals vs leverage plot doesn’t reveal any highly influential outliers.

Assignment 4

  1. Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
crPlots(multiple_model)

plot(resid(multiple_model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

residuals_simple <- resid(simple_model)
rmse_simple <- sqrt(mean(residuals_simple^2))
rmse_simple
## [1] 5.868172
residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

The RMSE for the multiple regression model is 4.056 years, meaning that on average, predictions deviate from actual values by approximately 4 years.

Large residuals for certain countries indicate systematic over-prediction or under-prediction. This reduces confidence in the model because: 1. Model Incompleteness: Important factors affecting those specific countries are missing. For example, countries with unusually HIGH life expectancy despite low GDP/Health/Internet may have excellent public health systems or healthy cultural practices. Countries with unusually LOW life expectancy may face armed conflicts or disease epidemics not captured by our predictors. 2. Limited Generalizability: If the model fails for post-conflict nations or small island states, we cannot confidently use it to predict similar countries’ outcomes. 3. Systematic Bias: Large residuals may reveal blind spots—perhaps the model consistently under-predicts for certain regions, indicating missing regional factors.

For countries with large residuals, I would investigate: - Armed conflicts and political instability (explains low life expectancy despite resources) - Disease prevalence like HIV/AIDS or malaria (major health burdens not in our predictors) - Healthcare system quality and efficiency (spending doesn’t equal effectiveness) - Environmental factors like water quality and sanitation - Income inequality and education levels (GDP doesn’t capture distribution) - Data quality issues or measurement errors

Assignment 5

  1. Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.
cor(All_Countries[, c("Energy", "CO2", "Electricity")], use = "complete.obs")
##                Energy       CO2 Electricity
## Energy      1.0000000 0.8689607   0.8289231
## CO2         0.8689607 1.0000000   0.5216719
## Electricity 0.8289231 0.5216719   1.0000000

Multicollinearity between Energy and Electricity can lead to unstable coefficients, inflated standard errors, and unreliable individual coefficients.

Energy-Electricity = 0.87, indicating strong multicollinearity between the two predictors. Energy-CO2 = 0.83, indicating a highly relevant predictor. Electricity-CO2 = 0.52: Electricity is moderately correlated with CO2, also indicating it is a useful predictor.

Implications: The coefficient standard errors for Energy and Electricity may be inflated due to their high correlation (0.83). Individual p-values for these predictors might appear non-significant even though the overall model fits well and both variables are truly related to CO2. The coefficients will be unstable—small changes in data could dramatically change whether Energy or Electricity appears more important.

Next step would be drop one predictor and keep either Energy or Electricity (not both).