Problem 1

countries <- read.csv("AllCountries.csv")

Problem 2

linear <- lm(LifeExpectancy ~ GDP, data = countries)

summary(linear)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation

The intercept is 68.42, which means that based on the model, if the GDP was 0 dollars, the life expectancy would be 68.42 years. While the slope is 0.0002476, which means that for every increase of 1 dollar in the GDP, the life expectancy increases by 0.0002476. The R^2 value which 0.43, tells us that there is a moderately strong relationship between the two factors of GDP and Life Expectancy. This means that 43% of the differences in the life expectancy is related to the GDP of the country.

Problem 3

multiple <- lm(LifeExpectancy ~ GDP + Health + Internet, data = countries)

# View the model summary
summary(multiple)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation

The Health coefficient is 0.2479, which means that for every 1 percent that the government’s expenditures that is directed towards healthcare, the life expectancy is increased by 0.2479 years, which is almost 3 months, holding GDP and Internet access constant. Compared to the previous model, the R^2 increased from 0.43 to 0.72, meaning that adding the factors Health and Internet greatly improves the accuracy of the model rather than just having solely relying on the GDP to capture the life expectancy of the countries.

Problem 4

par(mfrow=c(2,2)); plot(linear); par(mfrow=c(1,1))

Interpretation

So as you can see above, I plotted the values to see the Homoscedasticity and Normality of the model. From the Residuals vs Fitted plot shows a non-constant spread, meaning that the residuals are more spread out for countries with low life expectancy. Though its not severe, it just means that my model is able to predict countries with high life expectancy better than countries with low ones. For the normality, there are mild deviations in the tail, meaning that the residuals, though not perfect, are normal.

Probem 5

residuals_multiple <- resid(multiple)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

Interpretation

With a rmse value of a 4.06, this essentially means that on average, the model’s prediction for life expectancy is off by about 4 years. This means that overall the model does a decent job in predicting the life expectancy but its not perfect and still has room for error. In cases where there are large residuals for certain countries, it would lessen my confidence to the model’s prediction and might cause me to consider other variables that I could include in the model in hopes for a much accurate prediction and results.

Problem 6

cor(countries[, c("CO2", "Energy", "Electricity")], use = "complete.obs")
##                   CO2    Energy Electricity
## CO2         1.0000000 0.8795736   0.4871233
## Energy      0.8795736 1.0000000   0.7969352
## Electricity 0.4871233 0.7969352   1.0000000

Interpretation

From the results we got, we can see that there is a high correlation between Energy and Electricity, which indicates multicollinearity in the model. Because of this, it will be difficult to accurately determine each variable’s effect on CO2 emissions, making the model not as reliable.