Question 1

Step 1: Load and Prepare the Data

The necessary libraries are loaded and dataset imported R-studio and prepared to use.

Question 2

We fit the linear regrwession midel to predict Life expectancy based on GDP

# Fit the linear regression model
model <- lm(LifeExpectancy ~ GDP, data = All)
# View summary model
summary(model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = All)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation:

Question 3

We will conduct a multiple regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors.

# Fit the Multiple regression model
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = All)
#View summary model
summary(model2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = All)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation: - The intercept for Health is 0.2479 suggesting that it has a positive impact on life expactancy while conttrolling for GDP and Internet. The p-value is 0.000247, indicating that after controlling for GDP and internet access, health spending has a significant effect on life expectancy. A 1 % increase of government expenditures on healthcare will increase life expacatcy by 0.2479 years. - The adjusted R² for has increased to 71.64 % as compared to Question 1 (Simple linear regression). The multiple regression explains 71.64% more variation in life expectancy compared to the simple linear model. - The additional predictors provided more insights on how health investment and technological infrastructure contribute to enhnace global health outcomes and boost life expectancy.

Question 4

Homoscedasticity Assumption The assumption of Homoscedasticity tests whether residuals are constant across all levels of independent variables.

Ideal Outcome: - Residuals are equally spread around zero across all predicted values - When the residual vs fitted plot show no clear pattern - Constant “band” or “cloud” of points

Violations: - Standard errors may be incorrect -A funnel shape indicates heteroscedasticity - Coefficient estimates remain unbiased but inefficient

# Scale-Location plot
plot(model, which = 3) 

plot(model, which = 1) # Residuals vs Fitted

-Reflection: Residuals are equally spread around zero across all predicted values. It matched the ideal outcome.

Normality Assumption The assumption of normality testst if the residuals are drawn from a normal distribution.

plot(model, which = 2) # Q-Q plot

Ideal outcome: Points fall approximately on the straight line in the Q-Q plot. It shows a normally distributed residuals.

Violation: Strong curvature or S-shape. The shape indicate non-normality that may affect hypothesis testing and confidence intervals.

# Q-Q plot
plot(model, which = 2) 

-Reflection: Points fall approximately on the straight line in the Q-Q plot, This guggsted that the residuals are normally distributed, Hence, it matched the ideal outcome.

Question 5

The RSME average magnitude of prediction errors.

## Calculate RMSE for the multiple regression model
pred <- predict(model2)
resid <- residuals(model2)
RMSE <- sqrt(mean(resid^2))
RMSE
## [1] 4.056417

Explanation -The RSME is 4.056417 representing the prediction error in life expectancy (in years). It inducates that the predictions are 4.056417 percentages about the average. The value is high, suggesting that the model is not performing well and requires improvement to enhnace predictive accuracy. - When some countries have large residuals, it reduces confidence of the predictive model.This means that the model systematically fails to capture important factors influencing life expectancy in these cases.It makes the model to be misspecified or missing key variables. -Further, i will investigate if these large residuals cluster by region, income level, or specific country characteristics that are not captured by GDP, health expenditures, and internet access.

Question 6

Impact of Multicollinearity on Interpretation -First, Multicollinearity makes coefficients unstable and unreliable. The values of energy (β₁) and electricity (β₂) will be very sensitive to slight changes on data. Hence, a small change or data by adding or removing a country will significantly change the results. -Secondly, it inflates standard errors making the predictors appear statistically insignificant even if they are important. -Lastly, it makes it difficult to interpret the results

Effect on Model Reliability -First, multicollinearity makes coefficients for each predictor smaller since the standard error is inflated. Hence, leading to insignificant p-values although they are significant. This affects the overall model significance and fit. -Secondly, it makes it hard to establish the predictor that has a significant change in CO₂ emissions. Hence, it impacts on the reliability of the model.