setwd("~/Documents/Data 101")
all_countries <- read.csv("AllCountries.csv")
simple_model <- lm(LifeExpectancy ~ GDP, data = all_countries)
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = all_countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Our intercept for this linear model is 6.842e+01, or 68.42. In the context of the problem, our intercept of 68.42 means that when the GDP per capita is $0, the average life expectancy is 68.42 years. Our slope for this linear model is 2.476e-4, or 0.0002476. In the context of the problem, our slope of 0.0002476 means that as the GDP per capita increases by 1 , the life expectancy increases by 0.0002476. Our adjusted R-squared for this model is 0.4272. In the context of the problem, our adjusted R-squared of 0.4272 means that 42.72% of the variance of Life expectancy can be explained by GDP alone. This means that the other 57.28% is explained by other factors.
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
The coeffecient for health is 2.479e-01, or 0.2479. In the context of our problem, our coefficient of 0.2479 for health means that for every 1% increase in government spending on healthcare, life expectancy increases by about 0.248 years. In terms of life expencancy while controlling GDP and Internet, the Health coefficient keeps GDP and Internet constant. Our adjusted R-squared from the first model was 0.4272, and our adjusted R-squared from the multiple model was 0.7164. This shows a 29% increase in the explanation of the variance of life expectancy based on the new varialbes we added, Health and Internet, coming up from 42.72% all the way to 71.64%.
I would check the homosecedasticity and normality of residuals by creating the residual plot. Looking at our homoscedasticity, Ideally we want the residuals in the plot to be spread out evenly among our graph and scattered around 0, making a funnel-like shape. What we don’t want is a cone shape on either side, which means that our model has unequal varaince. For the normality, we are going to look at the Q-Q plot. We want our Q-Q plot to have a linear relationship with the line. So, we want our residuals to be along a 45 degree line. What we dont want is our residuals to be warping around the line, not continously staying with it. This could mean that the normality is skewed.
par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))
For our homoscedasticity, we see that our residuals vs. fitted plot is curved, which suggests that the relationship between Life Expectancy and GDP may not be linear and that there is a clear pattern with all of the values being clustered around 70 and decreasing as the values increase. This means that our residual vs. fitted plot does not match the ideal outcome, and there is significant heteroscedasticity.
For our normality, we see that in our Q-Q plot our residuals stay with the 45 degree angle line, but there is some variance at the tails. This means that there could be a slight skew at either of the tails, because the residuals do not completley line up with our line. I do believe that this matches the ideal outcome pretty well, because it shows that our residuals are following a normal distribution.
residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
Our RMSE is 4.05, which means that the model’s prediction of life expectancy are off by 4.05 years on average. Large residuals mean the model makes big errors for some countries, reducing confidence in its predictions. I would look variables like education, inequality, healthcare quality, or possible outliers and data issues in those countries to further interpret our RMSE.
When Energy and Electricity are highly correlated, multicollinearity makes it difficult to separate their individual effects on CO₂ emissions. The regression coefficients may become unstable and have large standard errors, meaning small changes in the data could lead to big changes in the estimates. As a result, the coefficients may not be statistically significant even if the variables are important, reducing confidence in interpreting their individual impact, even if the overall model still predicts well.