1. Upload the data set as a csv

all_countries <- read.csv("AllCountries.csv")

head(all_countries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1
dim(all_countries)
## [1] 217  26
colSums(is.na(all_countries))
##        Country           Code       LandArea     Population        Density 
##              0              0              8              1              8 
##            GDP          Rural            CO2      PumpPrice       Military 
##             30              3             13             50             67 
##         Health    ArmedForces       Internet           Cell            HIV 
##             29             49             13             15             81 
##         Hunger       Diabetes      BirthRate      DeathRate     ElderlyPop 
##             52             10             15             15             24 
## LifeExpectancy    FemaleLabor   Unemployment         Energy    Electricity 
##             18             30             30             82             76 
##      Developed 
##             75

2. Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

#summary(all_countries)

simple_model <- lm(LifeExpectancy ~ GDP,data = all_countries)
summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = all_countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The Interpretation
- The intercept is about 6.842e+01 which represents the predicted life expectancy when GPD is 0. This is not practically meaningful, but mathematically its the y-intercept.
- The coefficient is about 2.476e-4 which means for every 1 point increase in GDP, life expectancy increases by 2.476e-4 years.
- Both p-values (2e-16) are < 0.05 which indicates statistical significance.
- The adjusted r-squared is around 0.4272 explains about 42.72% of the variance in life expectancy from GDP alone. Decent but room for improvement.

3. Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = all_countries)

summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

The Interpretations
- Holding GDP and Internet constant, the Health coefficient is about 2.479e-1 which means for every 1 point increase in health, health expectancy increases by 2.479e-1 years while holding GDP and internet constant.
- The adjusted r-squared is 0.7164 which compared to the linear regressions resulted adjusted r-squared of 0.4272 which suggests a big improvement to the adjusted r-squared with the additional predictors

4. Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome

# Visual Linearity Check 
plot(all_countries$LifeExpectancy, all_countries$GDP,
     xlab="Life Expectancy", ylab="GDP", main="Life Expectancy vs GDP")
abline(simple_model, col=1, lwd=2)

# Core Diagnostics - 
par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))


Answer:
- For the simple linear regression model from Question 1 (Life Expectancy ~ GDP), to check assumptions of homescedasticity is to use Residuals vs Fitted Plot and for normality is to use a Q-Q Plot.
- For the Residuals vs Fitted Plot, the ideal outcome are residuals scattered randomly around 0 and spread of points are consistent across the line. A violation would indicate
- For the Q-Q plot, the ideal outcome are residuals should follow the straight diagonal line. A violation would indicate
- Result of the Residuals vs Fitted Plot is not a flat cloud, the residuals are not scattered randomly around 0, and the spread of points are not consistent across the line. Therefore it does not match the ideal outcome.
- The result of the Q-Q plot is the tails are not quite on the line but do not deviate too far from the line. So normality is checked. Therefore it does match the ideal outcome.

5. Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

# calculate residuals
residuals_multiple <- resid(multiple_model)

# calculate RMSE
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417


- The RMSE for the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet) is 3.411562 which means that the prediction misses by ~3.41 years on average.
- Large residuals for certain countries affect the confidence in the model’s prediction because the data is skewed because of them and also because there is a possiblity and there is missing data present that is not accounted for. This would suggest to look further into the data set’s variables and data for each column.

6. Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.s


- This multicollinearity might affect the interpretation of the regression coefficients and reliability of the model because it makes it hard to tell which predictor really matters and the coefficients can become unstable and p-values unreliable.