DATA 101 Homework 9

Robert Miller

Reading the data:

# read the data

dat <- read.csv("AllCountries.csv")

# View the first rows 

head(dat)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1

Simple Linear Regression: LifeExpectancy ~ GDP

# simple linear regression model

lrModel1 <- lm(LifeExpectancy ~ GDP, data = dat)

# summary of model 

summary(lrModel1)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation:

The intercept of 68.42 represents the estimated life expectancy in a country where GDP is $0.

The slope of 0.0002476 indicates that for each $1 increase in GDP per capita, life expectancy increases by approximately 0.00025 years.

The R-squared value of 0.4304 means that GDP explains about 43% of the variation in life expectancy across countries. This shows somewhat of a relationship — GDP is an important factor, but other variables may also be just as influential (or more).

Multiple Linear Regression: LifeExpectancy ~ GDP + Health + Internet

# multiple linear regression model

lrModel2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = dat)

# summary of model 

summary(lrModel2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation:

The intercept of 59.08 represents the predicted life expectancy for a country with GDP, healthcare spending, and internet usage all equal to zero.

The coefficient for Health is 0.02479, which means that for every 1% increase in the share of government spending on healthcare, life expectancy increases by about 0.025 years. This variable is statistically significant.

The Internet variable has a coefficient of 0.1903, meaning that a 1% increase in internet access is associated with an increase of about 0.19 years in life expectancy.

GDP is no longer statistically significant in this model (p = 0.302). This could suggest that adding Health and Internet makes GDP less meaningful.

The Adjusted R-squared value is 0.7164, which means that about 72% of the variation in life expectancy across countries is explained by this model.

Check Assumptions of Linear Regression:

# diagnostic plots for assumptions

par(mfrow=c(2,2))
plot(lrModel1)

par(mfrow=c(1,1))

Interpretation:

  • Residuals vs Fitted: Some curve and spreading pattern appear- may suggest some non-linearity and heteroscedasticity (variance not constant).
  • Q-Q Plot: The tails deviate from the line, which means the residuals are not perfectly normal but acceptable.
  • Scale-Location: Slight upward trend. Shows residual spread increases with fitted values- mild heteroscedasticity.
  • Residuals vs Leverage: Some points have higher leverage, but no strong outliers are shown

Overall: Assumptions are mostly met, with weak breaks of linearity, normality, and constant variance.

# residuals vs. order

plot(resid(lrModel1), type = "b",
main = "Residuals vs Observation Order",
ylab = "Residuals")
abline(h = 0, lty = 2)

Interpretation:

  • The residuals bounce around zero with no clear upward or downward trend, which suggests the errors are not related to the order of the data.

Conclusion: The independence assumption seems reasonable in this model.

Diagnosing Model Fit (RMSE and Residuals)

# calculate the residuals for multiple model

residuals_lrModel2 <- resid(lrModel2)

# compute the RMSE

rmse_lrModel2 <- sqrt(mean(residuals_lrModel2^2))
rmse_lrModel2
## [1] 4.056417

Interpretation:

The RMSE (Root Mean Squared Error) for the multiple regression model is ~4.06. This means that on average, the model’s predicted life expectancy values deviate from the actual values by about 4 years.

This gives an estimate of the model’s prediction error. Seeing very large residuals for some countries (especially those with outliers) could lower our confidence in the model. Those points may be important to look at in order to understand if there are unique factors that the model does not show.

Multicollinearity in Multiple Regression

If there is a high correlation between energy and electricity, multicollinearity would be present in the regression model. This means the two predictors have overlapping information, which makes it difficult to isolate the variables and see how much of a unique effect each variable has on CO2 emissions. Multicollinearity can cause unstable coefficient estimates, as small changes in data can lead to large shifts in the slope values. It can also inflate standard errors, which can lead to falsely-insignificant p-values. Finally, it would make in generally more difficult to interpret whether energy or electricity contains the real effect on emissions. To improve this model, we could drop one of the correlated predictors, or combine them into one variable.