countries <- read.csv("AllCountries.csv")
Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?
simple_model <- lm(LifeExpectancy ~ GDP, data = countries)
simple_model
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Coefficients:
## (Intercept) GDP
## 6.842e+01 2.476e-04
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
The intercept is (68.42), meaning that if a country had a GDP of $0 per individual life expectancy would be about 68.42 Years. For the slope, it would be (0.0002476), indicating that as GDP increases, so will the life expectancy by 0.0002476 Years. Lastly, the R² is (0.4304), which tells us that 43% of the differences in terms of life expectancy is explained by the GDP alone.
Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)
multiple_model
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
##
## Coefficients:
## (Intercept) GDP Health Internet
## 5.908e+01 2.367e-05 2.479e-01 1.903e-01
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
The coefficient for Health is 0.2479 which indicates that for every 1% increase in a country’s government spending for Healthcare, life expectancy will increase by around 0.25 years, even while controlling for Internet access and GDP. Additionally, for the Adjusted R² for the multiple regression model will be 0.7164, showing a big increase from the previous simple regression model which was (0.4304) which tells us that adding Health and Internet makes the model better at explaining the difference for life expecancy across the countries.
Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.
+ We check the homoscedasticity by looking at the Residuals vs Fitted Plots. The ideal scenario is that the points are randomly scattered around zero with no patterns, meaning the model’s errors are consistent. If there’s such pattern such as funnel or curve, it will indicate the errors change across the GDP values, which will make the model less reliable.
+ We’d also check the normality of residuals by utilizing the Q-Q plot, the points should follow the straight line closely, which will indicate the residuals are roughly normal. If the points are bending away from the line a lot, that will indicate that the residuals are not normally distributed, and may affect how accurate predictions and statistical tests are.
par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))
The Residuals vs Fitted plot indicates a clear pattern instead of a random scatter, which may suggest that the errors do not have constant variances. Also, the Q-Q plot shows the points curving away from the line, which means that the residuals are not perfectly normal. These issues also mean that the simple regression model might have limitations, but it can still provide a general understanding towards GDP relating to life expectancy.
Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?
residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
The RMSE is about 4.06, which means that the model’s predictions were a bit off by 4 years on average compared to the actual life expectancy values. If a country has a large residual, it would mean that the model predicted its life expectancy inefficiently. To improve the model, we could look into adding more important predictors that may affect life expectancy, such as Education, Sanitation, and Overall Healthcare Quality. These additional variables may help reduce the errors, making the predictors much more accurate.
Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.
+If Energy & Electricity are highly correlated, it will indicate that they move together. This will make it hard for the model to tell which one is affecting for CO2 Emissions. Because of this, the coefficients for these predictors can change a lot just by small changes within the dataset. Even if the model still predicts CO2 fairly, there’s uncertainties. Meaning that we wouldn’t be able to cleary tell which variable may contribute more to the outcome.