Countries <- read.csv("Allcountries.csv")
LE_GDP_SimpleModel <- lm(LifeExpectancy ~ GDP, data = Countries)
summary(LE_GDP_SimpleModel)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = Countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Intercept: 6.842e+01. Slope coefficient: 2.476e-04. R squared: 0.43.
When GDP is zero life expectancy is 6.842e+01. Slope coefficient is 2.476e-04 which means that for every dollar increase in GDP life expectancy goes up by 0.0002476 years. The R squared value is 0.43 which means that GDP accounts for about 43 percent of the variation in life expectancy.
LE2_Model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = Countries)
summary(LE2_Model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = Countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Coefficient (Health): 2.479e-01. Adjusted R squared: 0.7164.
The coefficient for health is 2.479e-01 which means that is where the plotting starts on the y axis and that is how much life expectancy increases for every unit increase in health. The adjusted R squared in this model is 0.7164 which means that the three variables put together account for about 72 percent of the variability in life expectancy. The R squared differs from question one in that it is a bigger number and has more explanatory power for variation in life expectancy than the model in question 1. This means that the additional predictors have more to do with explaining the variation in life expectancy.
Normality is mostly full filled by the residuals, but it includes a slight rightward skew towards the end of the data. The ideal outcomes would be for the residuals and to match the trajectory of the dotted lines in their respective graphs which would mean that the models are predicting accurately.
par(mfrow=c(2,2)); plot(LE_GDP_SimpleModel); par(mfrow=c(1,1))
These do not match the previously mentioned ideal outcomes.
RMSE
LE2_residuals <- resid(LE2_Model)
LE2_rmse <- sqrt(mean(LE2_residuals^2))
LE2_rmse
## [1] 4.056417
What does this mean? what do large residuals for certain countries mean? what would have to be investigated further?
This result means that the predictions for life expectancy have an error on average of 4.05 years. If certain countries have larger errors it means that there is some other factor which this model is not taking into account which is affecting life expectancy in those countries.
Hypothetical Q: Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.
This would make the model unreliable because it would not be able to tell which variable is having the effect on the predictions that the model is producing. This would also result in standard errors being inflated.