2. Load the Data into R. How many observations and how many variables do you have?

Loading the data into R shows there are 247 observations of 121 variables.

3.Some of the observations are country level observations, and others are region level observations. Move all of the region level observations into a separate data frame; that is, remove all observations associated with the following country codes: CEB CSS EAP EAS ECA ECS EMU EUU FCS HIC HPC LAC LCN LDC LIC LMC LMY MEA MIC MNA NOC OEC OED OSS PSS SSA SSF SST UMC and WLD.

Removing these variables changes the data set to 217 observations of 121 variables.

4. Which variable corresponds to life expectancy at birth?

SP.DYN.LE00.IN is the variable that corresponds to life expectancy at birth.

5. Plot life expectancy at birth versus Proportion of seats held by women in national parliaments.

6. Plot life expectancy at birth versus Population growth.

7. From your two plots above, does life expectancy seem more correlated with population growth or proportion of seats held by women?

Life expectancy at birth vs population growth seems more highly correlated than proportion of seats held by women.

8. Create a linear model of life expectancy as described by population growth

a. Examine the residual plots.

b. Remove any outlier(s).

As shown in the residual plots above there is one extreme outlier (173) that must be removed. This outlier causes the Residuals vs Fitted graph to be far from the flat line expected and even lies far beyond Cook’s distance in the Residuals vs Leverage plot.

c. Create a linear model of life expectancy as described by population growth for your data set with outliers removed. Call this model mod1.

d. Examine the residual plots, and explain any issues that you see with the residuals.

These residual plots are much better with outlier 173 removed, however some the residuals vs fitted values are still not spread uniformly across the y=0 line, so there are possibly outliers or other variables not taken into account. Looking at the Q-Q plot these variables could be the 107, 115 or 183 points, as they are the furthest from the given slope, but they are not as far departed as outlier 173 from before. The Scale-Location plot is probably the best of the four as the center line is not perfectly flat but better than the other graphs. In the plot of the outliers and leverage we can see that again the 107 point is an outlier, but it still is within Cook’s distance.

9. Find the 20 variables in your data set that have the fewest missing values.

The 20 variables with the fewest missing values are AG.SRF.TOTL.K2, SP.POP.GROW, AG.LND.TOTL.K2, EN.POP.DNST, EG.ELC.ACCS.RU.ZS, EG.ELC.ACCS.ZS, EN.BIR.THRD.NO, EN.FSH.THRD.NO, EN.HPT.THRD.NO, EN.MAM.THRD.NO, SP.RUR.TOTL.ZS, IT.NET.SECR, IT.MLT.MAIN, AG.LND.EL5M.ZS, AG.LND.FRST.ZS, SP.DYN.LE00.IN, NY.GDP.PCAP.CD, NY.GDP.PETR.RT.ZS, SH.MED.PHYS.ZS and SE.SEC.ENRL.FE.ZS.

10. Create a linear model that describes life expectancy as a linear function of the 20 variables with the fewest missing values.

11. Reduce the number of variables in your model, and create a new model called mod2.

The variables SP.DYN.LE00.IN and SP.POP.GROW were removed when creating mod2. SP.DYN.LE00.IN was removed because it is the variable that is being described, and SP.POP.GROW was removed because it is already in the linear model of mod1.

12. Compare mod1 with mod2 by:

a. Compare the residuals in the two models.

## 
## Call:
## lm(formula = SP.DYN.LE00.IN ~ SP.POP.GROW, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1766  -3.9456  -0.4208   4.3489  17.9961 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.5698     0.7277  106.60   <2e-16 ***
## SP.POP.GROW  -4.8615     0.4309  -11.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.606 on 207 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.3808, Adjusted R-squared:  0.3778 
## F-statistic: 127.3 on 1 and 207 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = SP.DYN.LE00.IN ~ AG.SRF.TOTL.K2 + AG.LND.TOTL.K2 + 
##     EN.POP.DNST + EG.ELC.ACCS.RU.ZS + EG.ELC.ACCS.ZS + EN.BIR.THRD.NO + 
##     EN.FSH.THRD.NO + EN.HPT.THRD.NO + EN.MAM.THRD.NO + SP.RUR.TOTL.ZS + 
##     IT.NET.SECR + IT.MLT.MAIN + AG.LND.EL5M.ZS + AG.LND.FRST.ZS + 
##     NY.GDP.PCAP.CD + NY.GDP.PETR.RT.ZS + SH.MED.PHYS.ZS + SE.SEC.ENRL.FE.ZS, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0266  -1.8857   0.4022   2.4374   9.9660 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.705e+01  4.038e+00  11.651  < 2e-16 ***
## AG.SRF.TOTL.K2    -2.958e-06  2.993e-06  -0.988  0.32440    
## AG.LND.TOTL.K2     2.911e-06  3.116e-06   0.934  0.35139    
## EN.POP.DNST        4.791e-04  5.139e-04   0.932  0.35243    
## EG.ELC.ACCS.RU.ZS  9.743e-02  3.175e-02   3.069  0.00249 ** 
## EG.ELC.ACCS.ZS     3.685e-02  3.769e-02   0.978  0.32948    
## EN.BIR.THRD.NO    -8.675e-03  1.435e-02  -0.605  0.54616    
## EN.FSH.THRD.NO     2.356e-03  5.979e-03   0.394  0.69397    
## EN.HPT.THRD.NO     5.954e-04  1.569e-03   0.380  0.70477    
## EN.MAM.THRD.NO     3.198e-03  1.617e-02   0.198  0.84345    
## SP.RUR.TOTL.ZS    -3.797e-02  1.871e-02  -2.029  0.04399 *  
## IT.NET.SECR        4.630e-06  7.830e-06   0.591  0.55505    
## IT.MLT.MAIN        1.233e-08  1.566e-08   0.788  0.43204    
## AG.LND.EL5M.ZS     1.259e-02  1.868e-02   0.674  0.50135    
## AG.LND.FRST.ZS     2.451e-02  1.300e-02   1.886  0.06100 .  
## NY.GDP.PCAP.CD     8.922e-05  2.095e-05   4.258 3.36e-05 ***
## NY.GDP.PETR.RT.ZS -9.481e-02  2.850e-02  -3.327  0.00107 ** 
## SH.MED.PHYS.ZS     5.281e-01  3.232e-01   1.634  0.10412    
## SE.SEC.ENRL.FE.ZS  2.740e-01  8.260e-02   3.317  0.00111 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.935 on 174 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.8022, Adjusted R-squared:  0.7817 
## F-statistic:  39.2 on 18 and 174 DF,  p-value: < 2.2e-16

Comparing the residuals shown above we see that the median is very close to 0 in both cases, and both 1Q and 3Q values are close to the absolute values of each other. The min and max of both cases are further apart however, with both residual mins being higher than their respective maxes.

b. Use ANOVA to determine whether the extra variables are significant.

## Analysis of Variance Table
## 
## Response: SP.DYN.LE00.IN
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## SP.POP.GROW   1 5555.8  5555.8  127.31 < 2.2e-16 ***
## Residuals   207 9033.4    43.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: SP.DYN.LE00.IN
##                    Df Sum Sq Mean Sq  F value    Pr(>F)    
## AG.SRF.TOTL.K2      1    2.9     2.9   0.1863  0.666511    
## AG.LND.TOTL.K2      1  105.7   105.7   6.8240  0.009780 ** 
## EN.POP.DNST         1  446.6   446.6  28.8365 2.494e-07 ***
## EG.ELC.ACCS.RU.ZS   1 8693.8  8693.8 561.3990 < 2.2e-16 ***
## EG.ELC.ACCS.ZS      1   59.8    59.8   3.8585  0.051088 .  
## EN.BIR.THRD.NO      1    0.4     0.4   0.0246  0.875559    
## EN.FSH.THRD.NO      1    2.1     2.1   0.1342  0.714550    
## EN.HPT.THRD.NO      1    7.3     7.3   0.4745  0.491829    
## EN.MAM.THRD.NO      1   40.3    40.3   2.6028  0.108489    
## SP.RUR.TOTL.ZS      1  445.1   445.1  28.7423 2.600e-07 ***
## IT.NET.SECR         1   67.1    67.1   4.3331  0.038842 *  
## IT.MLT.MAIN         1   10.1    10.1   0.6507  0.420980    
## AG.LND.EL5M.ZS      1   59.9    59.9   3.8661  0.050861 .  
## AG.LND.FRST.ZS      1   88.2    88.2   5.6930  0.018107 *  
## NY.GDP.PCAP.CD      1  413.7   413.7  26.7120 6.428e-07 ***
## NY.GDP.PETR.RT.ZS   1  282.1   282.1  18.2185 3.230e-05 ***
## SH.MED.PHYS.ZS      1   31.5    31.5   2.0317  0.155844    
## SE.SEC.ENRL.FE.ZS   1  170.4   170.4  11.0040  0.001107 ** 
## Residuals         174 2694.5    15.5                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At an alpha = 0.05 level, most of the extra variables are insignifcant. Notable exceptions are AG.LND.TOTL.K2, EN.POP.DNST, EG.ELC.ACCS.RU.ZS, SP.RUR.TOTL.ZS, IT.NET.SECR, AG.LND.FRST.ZS, NY.GDP.PCAP.CD, NY.GDP.PETR.RT.ZS and SE.SEC.ENRL.FE.ZS. Two other variables that are worth noting are EG.ELC.ACCS.ZS and AG.LND.EL5M.ZS, which are insignificant at alpha = 0.05, but just barely as their respective P values are 0.051088 and 0.050861.

c. Using other appropriate methods of comparison.

(BONUS) This topic will get you up to 20 points added to your project grade. Very little partial credit will be given; each part must be essentially correct before credit will be given for that part. You do not need to do all the parts.

(a) (4 points) Plot the countries of the world, shaded by life expectancy.

(2 poins) Create a linear model of life expectancy as a function of health expenditures per capita. Consider whether either or both variables need to be transformed! Call this mod3.

Neither of the variables need to be transformed in their current state.

(c) (4 points) Compute the residuals. Which country has the most positive residual? Least positive? Interpret the residuals.

## 
## Call:
## lm(formula = SP.DYN.LE00.IN ~ SH.XPD.PCAP, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.627  -4.503   1.591   5.254  10.570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.789e+01  5.805e-01 116.945   <2e-16 ***
## SH.XPD.PCAP 2.647e-03  2.791e-04   9.483   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.853 on 186 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.3259, Adjusted R-squared:  0.3223 
## F-statistic: 89.92 on 1 and 186 DF,  p-value: < 2.2e-16
## [1] 10.5701
## [1] -19.62739

The countries with the most positive and least positive residuals are with 109 with 10.5701 and 183 with -19.62739.

(d) (10 points) Plot the countries of the world, shaded by the residuals of mod3. Comment on any patterns.

Appendix - Code for answers

suppressMessages(library(ggplot2)) suppressMessages(library(dplyr))

data <- read.csv(“http://stat.slu.edu/~speegle/student_data/student_data_long25.csv”, as.is = TRUE)

data <- data[-c(35, 47, 59, 60, 61, 62, 65, 70, 71, 91, 94, 119, 125, 126, 127, 130, 131, 143, 146, 151, 168, 171, 172, 174, 186, 202, 204, 205, 230, 240),]

plot1 <- ggplot(data, aes(x = SP.DYN.LE00.IN, y = SG.GEN.PARL.ZS)) + geom_point() + ggtitle(“Life Expectancy at Birth vs Proportion of Seats held by Women in National Parliaments”) plot1

plot2 <- ggplot(data, aes(x = SP.DYN.LE00.IN, y = SP.POP.GROW)) + geom_point() + ggtitle(“Life Expectancy at Birth vs Population Growth”) plot2

lm1 <- lm(SP.DYN.LE00.IN~SP.POP.GROW, data = data)

plot(lm1)

data <- filter(data, SP.POP.GROW != 8.08855915954617)

mod1 <- lm(SP.DYN.LE00.IN~SP.POP.GROW, data = data)

plot(mod1)

lm2 <- lm(SP.DYN.LE00.IN~AG.SRF.TOTL.K2 + SP.POP.GROW + AG.LND.TOTL.K2 + EN.POP.DNST + EG.ELC.ACCS.RU.ZS + EG.ELC.ACCS.ZS + EN.BIR.THRD.NO + EN.FSH.THRD.NO + EN.HPT.THRD.NO + EN.MAM.THRD.NO + SP.RUR.TOTL.ZS + IT.NET.SECR + IT.MLT.MAIN + AG.LND.EL5M.ZS + AG.LND.FRST.ZS + SP.DYN.LE00.IN + NY.GDP.PCAP.CD + NY.GDP.PETR.RT.ZS + SH.MED.PHYS.ZS + SE.SEC.ENRL.FE.ZS, data = data)

mod2 <- lm(SP.DYN.LE00.IN~AG.SRF.TOTL.K2 + AG.LND.TOTL.K2 + EN.POP.DNST + EG.ELC.ACCS.RU.ZS + EG.ELC.ACCS.ZS + EN.BIR.THRD.NO + EN.FSH.THRD.NO + EN.HPT.THRD.NO + EN.MAM.THRD.NO + SP.RUR.TOTL.ZS + IT.NET.SECR + IT.MLT.MAIN + AG.LND.EL5M.ZS + AG.LND.FRST.ZS + NY.GDP.PCAP.CD + NY.GDP.PETR.RT.ZS + SH.MED.PHYS.ZS + SE.SEC.ENRL.FE.ZS, data = data)

anova(mod1) anova(mod2)

library(maps) map(“world”, fill = TRUE, col = data$SP.DYN.LE00.IN, plot = TRUE)

mod3 <- lm(SP.DYN.LE00.IN~SH.XPD.PCAP, data = data)

summary(mod3) mod3_residuals <- mod3$residuals max(mod3_residuals) min(mod3_residuals)