For Task 1, I used a simple linear regression model to predict LifeExpectancy from GDP “AllCountries” dataset. The intercept was 68.42, meaning at 0 GDP expected life would be 68.42. The slope for GDP was .0002476. This means there is a 1$ increase in GDP, life expectancy increased by .0002476 years. The R-squared value was .4304, meaning that about 43.04% of the variation in life expectancy across other nations is explained by the GDP in this model.
data <- read.csv("C:/Users/camer/OneDrive/Desktop/Data101/AllCountries.csv")
model1 <- lm(LifeExpectancy ~ GDP, data = data)
summary(model1)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
For task 2, I used a multiple linear regression model, again predicting Life Expectancy using GDP, Health, and internet this time. The coefficient for health was .2479, meaning one health unit increase was the equivalent of .248 years, with no change in GDP or Internet. Basically, nations with greater investment or culture in healthcare tend to have a higher life expectancy. The R-Squared was .7164, which compared to the .4272 in the previous question means there is much more room to explain any extreme variations in the Life Expectancy across countries. All in all, the increase in R-squared just means that the model explains more variation because it includes other relevant factors.
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = data)
summary(model2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
For Task 3, I had to start with checking the homoscedasticity. I used a residuals vs fitted plot, with the residuals scattered around 0, and equal spread. To check normality I used a histogram of residuals. In the ideal case the histogram would have looked bell shaped. Skews or uneven plots would indicate the residuals not being normally distributed.
Based on my results the homoscedasticity does not look perfect, the residuals do not seem to be random, and the histogram isn’t completely evenly bell shaped. The histogram is reasonably symmetric, whereas the residuals are for too uniform of a pattern to be perfect. The homoscedasticity seems to be in much worse shape compared to the normality.
plot(model1$fitted.values, model1$residuals,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0)
hist(model1$residuals,
main = "Histogram of Residuals",
xlab = "Residuals")
##Q4
I calculated the RSME for the multiple regression model used previously. The RSME is the root mean squared error, which is used as a model error prediction. Since the response variable for the model is life expectancy, and the RSME was 4.056417 years is the error bound. Basically, the model’s predictions typically are only 4.06 years off. Large residuals for certain countries would lower my confidence in the model’s predictions because a large residual indicates the life expectancy’s accuracy being far off of the true value.I would investigate how to fix the problem.
res <- residuals(model2)
rmse <- sqrt(mean(res^2))
rmse
## [1] 4.056417
##Q5
Lastly, the energy and electricity are both very correlated, which means that multicollinearity is likely. Since both variable would likely give similar values to the model, the regression would likely have difficulty separating their individual affects on the CO2 emissions.