setwd("~/Downloads/Data 101 Course materials/Data Sets")

countries <- read.csv("AllCountries.csv")
head(countries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1
colSums(is.na(countries))
##        Country           Code       LandArea     Population        Density 
##              0              0              8              1              8 
##            GDP          Rural            CO2      PumpPrice       Military 
##             30              3             13             50             67 
##         Health    ArmedForces       Internet           Cell            HIV 
##             29             49             13             15             81 
##         Hunger       Diabetes      BirthRate      DeathRate     ElderlyPop 
##             52             10             15             15             24 
## LifeExpectancy    FemaleLabor   Unemployment         Energy    Electricity 
##             18             30             30             82             76 
##      Developed 
##             75

Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US).

simple_model <- lm(LifeExpectancy ~ GDP, data = countries)
simple_model
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Coefficients:
## (Intercept)          GDP  
##   6.842e+01    2.476e-04
summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R2 value tell you about how well GDP explains variation in life expectancy across countries?

The intercept (the value of y when x is 0.) is 68.42. The life expectancy when GDP is 0 is 68.42 years.

For every added dollar of GDP added the life expectancy increases by 0.0002476.

The Adjusted R squared value tells the variation of life expectancy is explained about 43% of the time using only GDP.

Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
multiple_model
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Coefficients:
## (Intercept)          GDP       Health     Internet  
##   5.908e+01    2.367e-05    2.479e-01    1.903e-01
summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R2 compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

The coefficient for Health is 0.247, which means when the percentage of government expenditures on healthcare increases by. 1 the life expectancy increases by 0.247 years.

The Adjusted R-squared increased from almost 43% to almost 72%, which is a significant increase. This indicates that the variation of life expectancy is explained almost 30% of the time with the additional predictors.

Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

An ideal outcome would look like the dots clumping together all around the line. A violation with large residuals would indicate that the model is not as reliable for predicting the life expectancy accurately and the model would need to be adjusted.

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

Residuals vs Fitted: Mostly following the red line with a high concentration of dots around 70 with fewer dots mostly following the red line as the values get higher. Thus, linearity looks pretty good. The residuals follow the red line generally which indicates the homoscedasticity assumption is generally met. There are slight variation, suggesting mild heteroscedasticity, but not severe enough to invalidate the model.

Scale–Location: Cloud like spread with more variation as the values increase. There is some heteroscedasticity (variance grows for high life expectancy predictions).

Q–Q plot: Tails deviate, the left tail and right tail are both a bit low. This means that residuals not are perfectly normal, but it generally does follow the line. So normality is checked.

Residuals vs Leverage: The dashed line is far from the red line in places especially towards the higher values.

plot(countries$GDP, countries$LifeExpectancy,
     xlab="GDP", ylab="Life Expectancy", main="Life Expectancy vs. GDP")
abline(simple_model, col=1, lwd=2)

While there is a somewhat positive linear trend that life Expectancy increases as GDP increase, there are a lot of values with a far distance from the regression line. The model should be adjusted to improve the accuracy of its predictions.

Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

# Calculate residuals
residuals_simple <- resid(simple_model)

# Calculate RMSE for simple model
rmse_simple <- sqrt(mean(residuals_simple^2))
rmse_simple
## [1] 5.868172
# For multiple model

# Calculate residuals
residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

Simple model, RMSE = 5.868 years, predictions miss by ~5.9 years on average.

Multiple model, RMSE = 4.056 years, predictions miss by ~4.1 years on average. The error drops by 1.8 years.

Overall, the multiple model improves the accuracy. It includes additional relevant factors (Government spending on Health and Internet usage) that impact life expectancy and it both raises Adjusted R-squares and reduces RMSE.

Large residuals for certain countries would lower my confidence in the model’s predictions and to address it I would change the model by adding additional factors with low p values (***) and rechecking it.

Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

This multicollinearity would make it difficult to tell what the impact of Energy and Electricity would be individually. This would make it hard to see if one matters more than the other. It has the potential to make the coefficients unstable and p-values unreliable, whci hwould make the model unreliable as well.