##Uplaod Dataset:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Downloads")
AllCountries <- read_csv("AllCountries (1).csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
##                                               
## 1 function (x, df1, df2, ncp, log = FALSE)    
## 2 {                                           
## 3     if (missing(ncp))                       
## 4         .Call(C_df, x, df1, df2, log)       
## 5     else .Call(C_dnf, x, df1, df2, ncp, log)
## 6 }

2. Simple Linear Regression (Fitting and Interpretation):

Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

model1 <- lm(LifeExpectancy ~ GDP, data = AllCountries)
summary(model1)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpertation: - The intercept (about 68.42) represents the predicted life expectancy when GDP is 0. This is not practically meaningful, but mathematically it’s the y-intercept. - The coefficient for GDP (about 0.0002476) means that for every 1-dollar increase in GDP per capita, life expectancy increases by about 0.0002476 years. So for every $1,000 increase, life expectancy increases by about 0.2476 years. - Look at the p-values: Both the intercept and GDP have p-values < 0.05, indicating statistical significance. - R² (about 0.43) explains about 43% of the variance in LifeExpectancy from GDP alone — decent, but there is still room for improvement since other factors also influence life expectancy

3- Multiple Linear Regression (Fitting and Interpretation)

Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
summary(model2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16
  • The Health coefficient (0.2479) means that for every 1% increase in government health spending, life expectancy increases by about 0.2479 years, while holding GDP and Internet constant.
  • Health is significant (p = 0.000247), meaning it meaningfully predicts life expectancy.
  • GDP is not significant (p ≈ 0.30) when Health and Internet are included.
  • Internet (0.1903) is significant (p < 2e-16), so higher internet access is linked to higher life expectancy. -R² = 0.716, which is much higher than the simple model’s 0.427. This means adding Health and Internet improves the model and explains more variation in life expectancy.

5- Diagnosing Model Fit (RMSE and Residuals)

For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)

residuals_model2 <- resid(model2)

rmse_model2 <- sqrt(mean(residuals_model2^2))
rmse_model2
## [1] 4.056417

Interpretation: - Multiple model, RMSE = 4.06 years, meaning predictions miss by about 4.06 years on average. - Large residuals for certain countries would mean the model predicted their life expectancy poorly. These could be countries with unusually high or low life expectancy, or countries influenced by factors not included in the model. - Overall, the multiple model performs reasonably well, but big residuals would make me look closer at outliers or missing predictors to understand why the model struggled for those cases.

6- Hypothetical Example (Multicollinearity in Multiple Regression)

Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

model3 <- lm(CO2 ~ Energy + Electricity, data = AllCountries)
summary(model3)
## 
## Call:
## lm(formula = CO2 ~ Energy + Electricity, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7559  -1.1406  -0.2020   0.7143   7.3751 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.998e-01  2.655e-01   3.012  0.00311 ** 
## Energy       3.122e-03  1.066e-04  29.290  < 2e-16 ***
## Electricity -7.044e-04  5.526e-05 -12.747  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.331 on 131 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.899,  Adjusted R-squared:  0.8974 
## F-statistic: 582.8 on 2 and 131 DF,  p-value: < 2.2e-16
cor(AllCountries[, c("CO2", "Energy", "Electricity")], use = "complete.obs")
##                   CO2    Energy Electricity
## CO2         1.0000000 0.8795736   0.4871233
## Energy      0.8795736 1.0000000   0.7969352
## Electricity 0.4871233 0.7969352   1.0000000

Interpretation Very high correlations among predictors: Energy–Electricity = 0.80, which is strong multicollinearity.

Energy is also strongly related to CO₂ (0.88), while Electricity is moderately related to CO₂ (0.49).

Implications: The coefficients for Energy and Electricity may be unstable or harder to interpret, and their individual p-values may not fully reflect their true importance, even though the model as a whole fits well (R² ≈ 0.899).

Next step would be to keep the stronger predictor (Energy) and consider dropping Electricity.