Homework 9 - Data 101

Loading the Dataset

library(tidyverse)
setwd("C:/Users/sarah/Downloads")

countries <- read_csv("AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(countries)
## # A tibble: 6 × 26
##   Country Code  LandArea Population Density   GDP Rural   CO2 PumpPrice Military
##   <chr>   <chr>    <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 Afghan… AFG     653.       37.2      56.9   521  74.5  0.29      0.7      3.72
## 2 Albania ALB      27.4       2.87    105.   5254  39.7  1.98      1.36     4.08
## 3 Algeria DZA    2382.       42.2      17.7  4279  27.4  3.74      0.28    13.8 
## 4 Americ… ASM       0.2       0.055   277.     NA  12.8 NA        NA       NA   
## 5 Andorra AND       0.47      0.077   164.  42030  11.9  5.83     NA       NA   
## 6 Angola  AGO    1247.       30.8      24.7  3432  34.5  1.29      0.97     9.4 
## # ℹ 16 more variables: Health <dbl>, ArmedForces <dbl>, Internet <dbl>,
## #   Cell <dbl>, HIV <dbl>, Hunger <dbl>, Diabetes <dbl>, BirthRate <dbl>,
## #   DeathRate <dbl>, ElderlyPop <dbl>, LifeExpectancy <dbl>, FemaleLabor <dbl>,
## #   Unemployment <dbl>, Energy <dbl>, Electricity <dbl>, Developed <dbl>

Fit a Simple Linear Regression Model

simple_fit <- lm(LifeExpectancy ~ GDP, data = countries) 
summary(simple_fit)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Intercept: 68.42. This means that when the GDP of a country is 0, the predicted life expectancy is 68.42.

Slope: 0.0002476. This means that life expectancy increases by 0.0002476 years for every one unit increase 2in GDP.

Adjusted R-squared: 0.4272. This means 42.72% of the variation in life expectancy of a country is explained by GDP.

Equation: 0.0002476(GDP) + 68.42

Fit a Multiple Linear Regression Model

multiple_fit <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
summary(multiple_fit)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Coefficient of Health: 0.2479. Holding other variables constant, this slope coefficient means that for every percent increase in Health, life expectancy increases by 0.2479.

Adjusted R-squared: 0.7164. This means that 71.64% of the variation in life expectancy is explained by health, internet, and GDP. The adjusted r-squared of this multiple linear model is better than the linear model, as the difference in their percentages is 0.2892. This suggests that adding the additional predictors of health and internet helped to predict life expectancy better.

Checking Assumptions

  1. Linearity - There needs to be a linear relationship between life expectancy and GDP, not a quadratic or exponential one. A violation of this assumption may render the model useless.

  2. Independence - Observations must be independent. If residuals go up and down randomly around zero without patterns, independence is valid. A violation of this may cause error in correctly predicting life expectancy.

  3. Homoscedasticity (Equal Variance) - Spread of residuals (errors) should be roughly the same across all values of GDP. A violation of this may make the model unreliable.

  4. Normality of Residuals - the errors(residuals) should follow a normal distribution. A violation of this may also make the model unreliable.

5. No Multicollinearity - This is also a general basic assumption for a multiple linear model, however, we are only taking a look at a simple linear model. So I will not include this in the code.

Code

options(scipen = 999)
plot(countries$GDP, countries$LifeExpectancy,
     xlab="GDP", ylab="LifeExpectancy", main="LifeExpectancy vs GDP")
abline(simple_fit, col=1, lwd=2)

There is hardly any linear trend in the plot above. It is more like a logarithmic trend, where life expectancy sharply increases as GDP increases, but then it starts to stay around 75-85, on the y-axis, as GDP increases further. We violate linearity for our simple model.

2) Independence

plot(resid(simple_fit), type="b", main="Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)

Residuals vs Order: Residuals seem to be bouncing around y-axis without a downward or upward trend. However, residuals go down deeply into the negatives more than they go up into positive numbers.

Implication: Independence isn’t likely violated in the simple model (LifeExpectancy ~ GDP).

3) Core Diagnostics (linearity, homoscedasticity, normality, and influence)

par(mfrow=c(2,2)); plot(simple_fit); par(mfrow=c(1,1))

Residuals vs. Fitted: There is a clear pattern in this diagnostic plot. There is a lot of spread vertically between residuals in the lower fitted values, but this does not hold true as fitted values increase. Residuals start to go down as fitted values increase. This means that homoscedasticity is violated because the residuals are not evenly scattered around zero. It also shows that linearity is violated and suggests unequal variance.

Scale–Location: Like the Residuals vs. Fitted, residuals are clustered towards lower fitted values and points are spread out vertically, but crowded horizontally towards the start. This means for low GDP countries, there is high variance between residuals for life expectancy. As fitted values get higher, the variance between residuals get slightly smaller, but then get higher towards the end, as suggested by the red line. All of this means that there is heteroscedasticity because the variance of residuals is not constant.

Q–Q plot: Residuals mainly follow the diagonal line. However, tails diverge, especially the right tail. This means that residuals are not perfectly normal, but not so deviant from a normal distribution. So normality is likely met.

Residuals vs Leverage: Residuals diverge from the expected path as leverage increases.

Diagnose Model Fit for Multiple Linear Regression

# Calculate residuals
residuals_m <- resid(multiple_fit)

# Calculate RMSE for multiple model
rmse_m <- sqrt(mean(residuals_m^2))
rmse_m
## [1] 4.056417

RMSE is the root mean squared eror. The RMSE is 4.056 and this is the average prediciton error in the life expectancy. This number quantifies how close predictions are, on average, to the actual values of life expectancy in the dataset.

Large residuals for certain countries affect my confidence in the model’s predictions negatively as the RMSE would be larger. Countries with unusually high or low life expectancy can cause the model’s predictions to skew from the actual predictions. Like the mean, if there are a lot of outliers, then the mean won’t be representative of the population. Similarily, if there are outliers in the residuals for certain countries, it can damage the model’s prediction accuracy. I might investigate further by checking for outliers using a residuals vs. fitted plot. If residuals don’t follow a pattern and everything is scattered randomly, then we’re sure that there are no outliers.

Hypothetical Example

If the Energy and Electricity variables are highly correlated, then your interpretation of regression coefficients may be wrong. When interpreting regression coefficients, you hold all other variables constant. However, if two variables are highly correlated to each other, their regression coefficients would depend on each other. This makes their interpretation incorrect; you truly aren’t holding all other variables constant.

This also reduces the reliability of the model; if your regression coefficients depend on each other, then it would be hard to tell which variables are actually significant at prediction. You don’t know which variable is best at predicting CO2 emissions if they are both highly correlated. If you don’t know which variable is best at predicting CO2 emissions, then your model is practically ineffective at predicting the variable. All of this reduces the reliability of the multiple linear model.