Upload the data set as a csv
all_countries <- read.csv("AllCountries.csv")
Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?
model <- lm(LifeExpectancy ~ GDP, data = all_countries)
summary(model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = all_countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
This suggests the predicted average life expectancy when the GDP is at $0 even though GDP of $0 is unrealistic but it is still our starting point.
This slope represents the change in life expectancy for every $1 increase in the GDP. It is a small number of increase since GDP is usually dealt with at larger amounts than just a dollar.
This value represents that about 43.04% of the variation in life expectancy across different countries can be explained by the differences in their GDP. The other remaining 56.96% is due to not counter other factors from the dataset.
Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?
model_multi <- lm(LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
summary(model_multi)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
This value suggests that While keeping GDP and Internet access constant, the model predicts that for every 1% increase in government health expenditure, life expectancy increases by approximately 0.2479 years or about 3 months.
This value is far more than the value from the simple regression model from Question 1 which is from 43.04% to 71.64%. It suggests that adding Health and Internet as predictors dramatically improves our ability to explain the variation in life expectancy. Instead of only explaining 43% of the differences with just wealth (GDP), we are now explaining nearly 72% of the variation by accounting for healthcare priorities and technological infrastructure.
Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.
par(mfrow = c(1, 2))
plot(model, which = 1)
plot(model, which = 2)
How to check: Create a Residuals vs. Fitted Values plot.
Violation: If the spread of residuals increases or decreases as predicted values change (often looking like a fan or funnel), the assumption is violated. This means the model’s accuracy is inconsistent meaning it might be good at predicting life expectancy for poor countries but very unreliable for wealthy ones.
Result: The plot shows a funnel shape or a slight curve skewed right. In this dataset, there is often more variation in life expectancy among lower-GDP countries than higher-GDP countries. This violation means our standard errors might be biased.
How to check: Create a Normal Q-Q (Quantile-Quantile) Plot.
Violation: If the points curve away from the line at the tails, the residuals are not normally distributed. This suggests that the model may be systematically over or under-predicting in certain cases, which can make your hypothesis tests (like p-values) less accurate.
Result: The plot shows points pulling away from the line at the bottom left and top right. This indicates that the residuals are “left-skewed,” meaning the model significantly over-predicts life expectancy for a few specific countries (outliers).
The model does not perfectly match the ideal outcome. GDP alone is considered a noisy predictor, and the relationship might be non-linear, which is why the assumptions are struggling.
Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?
multi_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
residuals <- residuals(multi_model)
rmse_multi <- sqrt(mean(residuals^2))
rmse_multi
## [1] 4.056417
The RMSE of 4.056417 represents the typical distance between the actual life expectancy of a country and the life expectancy predicted by our model. Meaning, on average, our model’s predictions are off by about 4 years. Given that life expectancy generally ranges from about 50 to 85 years in this dataset, an error of 4 years is relatively small, but still significant when comparing the health outcomes of different nations.
The large residuals like 4.06 years reduces the reliability since if certain countries have residuals much larger than the RMSE like 10 or 15 years, it indicates that the model is failing to capture unique circumstances in those regions. Also, in both simple and multi linear regression examples above, the largest residuals are often negative, meaning the model over-predicts how long people live there based on their wealth, health, and tech access.
Maybe getting more categorical variables like whether the country is developed or developing would help. It allows more information on how that country will progress moving foward as well. Also, a way to add more variables like whether the country had an event that affected their country such as conflicts (internal or external) and epidemics which GDP alone will not be able to capture except a drastic change during those time depending on how much it affected and how they responded to that event.
Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.
cor(all_countries[, c("Energy", "Electricity")], use = "complete.obs")
## Energy Electricity
## Energy 1.0000000 0.7970054
## Electricity 0.7970054 1.0000000
Both Energy and Electricity should logically have a positive relationship with CO2 emissions. Because of the 0.797 correlation, the model may assign a positive coefficient to one like Energy and a negative coefficient to the other like Electricity just to fit the mathematical line. You cannot reliably say that increasing Electricity reduces CO2 based on this model. The negative sign is likely a mathematical result of multicollinearity, not a real-world trend.
The model becomes very sensitive to small changes. If you added or removed just five countries from AllCountries.csv, the coefficients for Energy and Electricity might change drastically. Thus, it has a reduced reliability due to its extreme sensitivity. More detailed data is needed.