setwd("~/Downloads/Data 101 Course materials/Data Sets")
countries <- read.csv("AllCountries.csv")
head(countries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
colSums(is.na(countries))
## Country Code LandArea Population Density
## 0 0 8 1 8
## GDP Rural CO2 PumpPrice Military
## 30 3 13 50 67
## Health ArmedForces Internet Cell HIV
## 29 49 13 15 81
## Hunger Diabetes BirthRate DeathRate ElderlyPop
## 52 10 15 15 24
## LifeExpectancy FemaleLabor Unemployment Energy Electricity
## 18 30 30 82 76
## Developed
## 75
Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US).
simple_model <- lm(LifeExpectancy ~ GDP, data = countries)
simple_model
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Coefficients:
## (Intercept) GDP
## 6.842e+01 2.476e-04
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R2 value tell you about how well GDP explains variation in life expectancy across countries?
The intercept (the value of y when x is 0.) is 68.42. The life expectancy when GDP is 0 is 68.42 years.
For every added dollar of GDP added the life expectancy increases by 0.0002476.
The Adjusted R squared value tells the variation of life expectancy is explained about 43% of the time using only GDP.
Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
multiple_model
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
##
## Coefficients:
## (Intercept) GDP Health Internet
## 5.908e+01 2.367e-05 2.479e-01 1.903e-01
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R2 compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?
The coefficient for Health is 0.247, which means when the percentage of government expenditures on healthcare increases by. 1 the life expectancy increases by 0.247 years.
The Adjusted R-squared increased from almost 43% to almost 72%, which is a significant increase. This indicates that the variation of life expectancy is explained almost 30% of the time with the additional predictors.
Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.
An ideal outcome would look like the dots clumping together all around the line. A violation with large residuals would indicate that the model is not as reliable for predicting the life expectancy accurately and the model would need to be adjusted.
par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))
Residuals vs Fitted: Mostly following the red line with
a high concentration of dots around 70 with fewer dots mostly following
the red line as the values get higher. Thus, linearity looks pretty
good. The residuals follow the red line generally which indicates the
homoscedasticity assumption is generally met. There are slight
variation, suggesting mild heteroscedasticity, but not severe enough to
invalidate the model.
Scale–Location: Cloud like spread with more variation as the values increase. There is some heteroscedasticity (variance grows for high life expectancy predictions).
Q–Q plot: Tails deviate, the left tail and right tail are both a bit low. This means that residuals not are perfectly normal, but it generally does follow the line. So normality is checked.
Residuals vs Leverage: The dashed line is far from the red line in places especially towards the higher values.
plot(countries$GDP, countries$LifeExpectancy,
xlab="GDP", ylab="Life Expectancy", main="Life Expectancy vs. GDP")
abline(simple_model, col=1, lwd=2)
While there is a somewhat positive linear trend that life Expectancy increases as GDP increase, there are a lot of values with a far distance from the regression line. The model should be adjusted to improve the accuracy of its predictions.
Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?
# Calculate residuals
residuals_simple <- resid(simple_model)
# Calculate RMSE for simple model
rmse_simple <- sqrt(mean(residuals_simple^2))
rmse_simple
## [1] 5.868172
# For multiple model
# Calculate residuals
residuals_multiple <- resid(multiple_model)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
Simple model, RMSE = 5.868 years, predictions miss by ~5.9 years on average.
Multiple model, RMSE = 4.056 years, predictions miss by ~4.1 years on average. The error drops by 1.8 years.
Overall, the multiple model improves the accuracy. It includes additional relevant factors (Government spending on Health and Internet usage) that impact life expectancy and it both raises Adjusted R-squares and reduces RMSE.
Large residuals for certain countries would lower my confidence in the model’s predictions and to address it I would change the model by adding additional factors with low p values (***) and rechecking it.
Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.
This multicollinearity would make it difficult to tell what the impact of Energy and Electricity would be individually. This would make it hard to see if one matters more than the other. It has the potential to make the coefficients unstable and p-values unreliable, whci hwould make the model unreliable as well.