Homework 9

Question 1

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(car)

## Warning: package 'car' was built under R version 4.5.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

df <- read.csv("AllCountries.csv")

Question 2

simple_model <- lm(LifeExpectancy ~ GDP, data = df)

summary(simple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The intercept is 68.42, meaning that the model predicts a life expectancy of about 68.4 years when GDP is zero. The slope is 2.476e-04, indicating that for every 1 dollar increase in GDP per capita, life expectancy increases by about 0.00025 years. With this, a $10,000 increase in GDP predicts roughly a 2.5-year increase in life expectancy. The R² of 0.43 shows that GDP explains about 43% of the variation in life expectancy across countries, meaning GDP is an important factor influencing life expectancy, but not the only factor.

Question 3

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = df)

summary(multiple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Health independently contributes to higher life expectancy. A 1% increase in the health variable corresponds to roughly 0.25 years in life expectancy. With the new predictors added in (Health & Internet), we see an adjusted R^2 of 0.7164, which is a roughly 30% increase from before. This means that the health and internet predictors are significant and should be included in the model.

Question 4

To check our assumptions of homoscedasticity and normality for the simple model, we need the residuals vs fitted plot, the Q-Q residuals plot, and the scale-location plot. When checking for homoscedasticity, we need to look at the fitted vs residuals plot and the scale-location plot. Ideally, the residuals vs fitted plot will show the residuals randomly scattered around 0 without a pattern. For the scale-location plot, we would ideally want to see the red line being flat and the spread of points consistent throughout the graph. To check for normality we need to look at the Q-Q residuals plot. Ideally we would want to see the points follow the line closely.

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

When checking for homoscedasticity, we look at the residuals vs fitted plot and the scale-location part. The residuals vs fitted plot shows a curve shape and the scale-location plot shows an upward trend in the higher values. Both of these plots indicate heteroscedasticity. This suggests that the life expectancy and GDP relationship is not perfectly linear. To check the normality of residuals, we look at the Q-Q plot. We see that the points deviate from the line at the extreme values in the low and top end of the line, indicating slightly non-normal residuals.

Question 5

residuals_simple <- resid(simple_model)

rmse_simple <- sqrt(mean(residuals_simple^2))
rmse_simple

## [1] 5.868172

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

## [1] 4.056417

We see an RMSE of ~4.06 for the multiple model, indicating that our predictions for life expectancy will be off by an average of 4.06 years. This is an improvement over the simple model, which has an RMSE of ~5.87. However, large residuals from countries with an unusually high or low life expectancy would suggest the model is missing other key predictors. Further investigation should be focused on outlier countries that disproportionately affect the model and other potentially important predictors not included in this model.

Question 6

Because energy and electricity are highly correlated with each other, the model would struggle to separate their contributions to CO2 emissions. Because of this, the coefficients for energy and electricity would likely be unreliable and unstable, making them impossible to interpret. While this wouldn’t harm the overall prediction accuracy of the model, it wouldn’t allow the researcher to trust the individual coefficients and therefore make it impossible to determine how important each variable is.