Homework 9

Problem 1

Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

lin_model <- lm(LifeExpectancy ~ GDP, data=all_countries)

summary(lin_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = all_countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The coefficients tell us that there is a positive relationship between GDP and LifeExpectancy. The R$^2$ value suggests that 42.72% of the variance in LifeExpectancy can be attributed to or explained by GDP.

Problem 2

Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access)as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R²compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

multi_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data=all_countries)

summary(multi_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

The coefficients for each are positive meaning there is correlation, although the p value for GDP is 0.3 meaning it is not significant. Adjusted R$^2$ from this model increased by almost 30% from 0.4272 to 0.7164, meaning 71.64% of the variation in LifeExpectancy can be explained by GDP, Health, and Internet. This means the additional predictors added better explain the LifeExpectancy value and likely, Health and Internet are datapoints which can explain someone’s LifeExpectancy.

Problem 3

For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

For the simple linear regression model in Problem 1, I would check the assumptions by plotting core statistics and looking for no fanning of the dataset. For all assumptions it would be ideal if they hug a single line with no curve or very little curve. A violation of that ideal would insinuate that the dataset is not as accurate and models fitted to it are not as reliable.

par(mfrow=c(2,2)); plot(lin_model); par(mfrow=c(1,1))

Ultimately, the outcome is fairly ideal, although the fitted and levarage plots start off very steep and change direction at a point.

Problem 4

For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

crPlots(multi_model)

multiple_rmse <- sqrt(mean(resid(multi_model)^2))
multiple_rmse

## [1] 4.056417

The returned RMSE was 4.0, meaning predictions from the multiple regression model miss by 4 years on average. Large residuals for countries with high or low life expectancy would skew the data very much, meaning the model would be even worse at predicting.

Problem 5

Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

Having multicollinearity would effect the regression coefficients because the two variables’ relation would almost melt the coefficients together. The dataset would have very similar regression coefficients for the two predictors which means the data altogether is murky and unreliable.