Question 1

Upload the data set as a csv

setwd("C:/Users/Mulut/Desktop/Classes/Data101/HW9")

AllCountries <- read.csv("AllCountries.csv")

head(AllCountries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1

Question 2

Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

simple_model <- lm(LifeExpectancy ~ GDP, data = AllCountries)
simple_model
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Coefficients:
## (Intercept)          GDP  
##   6.842e+01    2.476e-04
summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

results

Based on the simple linear regression results, the intercept is 68.42, meaning that a country with a GDP per capita of 0 is predicted to have an average life expectancy of about 68.42 years, which serves as a mathematical baseline for the model. The slope for GDP is 0.0002476, indicating that for every 1 increase in GDP per capita, life expectancy increases by 0.0002476 years. This positive slope shows that higher-income countries tend to have longer life expectancies. The R squered value of 0.4304 means that about 43% of the variation in life expectancy across countries is explained by GDP alone, suggesting that GDP is an important predictor but that more than half of the variation is due to other factors not included in the simple model.

Question 3

Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, 
                     data = AllCountries)

multiple_model
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Coefficients:
## (Intercept)          GDP       Health     Internet  
##   5.908e+01    2.367e-05    2.479e-01    1.903e-01
summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Results

The coefficient for Health is 0.2479, which means that for each 1 point increase in the share of government spending directed toward healthcare, a country’s life expectancy is predicted to increase by about 0.25 years, or roughly three months, holding GDP and Internet access constant. This shows that healthcare spending has an independent and meaningful positive effect on life expectancy beyond a country’s economic wealth or level of technological access. Comparing model fit, the multiple regression model has an adjusted R squared of 0.7164, which is substantially higher than the 0.4272 from the simple regression in Question 1. This large increase indicates that adding Health and Internet greatly improves the model’s ability to explain variation in life expectancy across countries. In other words, while GDP alone explains about 43% of the differences in life expectancy, including healthcare spending and internet access raises that explanatory power to over 71%, suggesting that these additional predictors capture important social and developmental factors that significantly strengthen the model.

Question 4

Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

### results

To check the assumptions of the simple linear regression model LifeExpectancy ~ GDP, I examined both homoscedasticity and normality of the residuals using diagnostic plots. Homoscedasticity was evaluated with the Residuals vs Fitted and Scale-Location plots, where the ideal outcome would be a random, horizontal scatter of points with no clear pattern and a roughly constant spread across all fitted values. However, the Residuals vs Fitted plot shows a noticeable curved pattern, suggesting heteroscedasticity the variance of the residuals is not constant, especially at higher fitted life expectancy values. This violation indicates that the model may not predict life expectancy equally well across all GDP levels and that the accuracy of standard errors may be compromised. Normality was checked using the Q-Q plot, where an ideal plot would show points falling closely along the 45 degree reference line. Instead, the Q-Q plot reveals systematic deviations, especially in the tails, indicating that the residuals are not perfectly normally distributed. Such a violation suggests that hypothesis tests and confidence intervals may be less reliable.

Question 5

Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

results

The RMSE for the multiple regression model is 4.06, which means that, on average, the model’s predictions of life expectancy are off by about 4 years. This indicates a reasonably good fit for country level data, but not perfect accuracy. If certain countries show much larger residuals than this especially those with unusually high or low life expectancy, it would reduce confidence in the model’s predictions for those cases and suggest that important factors affecting those countries are not included in the model.

Question 6

Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

Answer

In this hypothetical model predicting CO2 emissions using Energy and Electricity, the fact that these two predictors are highly correlated means the regression suffers from multicollinearity, which makes it difficult for the model to determine the unique contribution of each variable. When predictors move together, the model cannot reliably separate their individual effects, causing the estimated coefficients to become unstable and overly sensitive to small changes in the data, Multicollinearity also inflates the standard errors of the coefficients, reducing our confidence in the estimates and making it harder to interpret which predictor actually influences CO2 emissions. Therefore, multicollinearity does not necessarily harm prediction accuracy, but it severely harms interpretability and the stability of the model’s conclusions.