Loading the data into R shows there are 247 observations of 121 variables.
Removing these variables changes the data set to 217 observations of 121 variables.
SP.DYN.LE00.IN is the variable that corresponds to life expectancy at birth.
As shown in the residual plots above there is one extreme outlier (173) that must be removed. This outlier causes the Residuals vs Fitted graph to be far from the flat line expected and even lies far beyond Cook’s distance in the Residuals vs Leverage plot.
These residual plots are much better with outlier 173 removed, however some the residuals vs fitted values are still not spread uniformly across the y=0 line, so there are possibly outliers or other variables not taken into account. Looking at the Q-Q plot these variables could be the 107, 115 or 183 points, as they are the furthest from the given slope, but they are not as far departed as outlier 173 from before. The Scale-Location plot is probably the best of the four as the center line is not perfectly flat but better than the other graphs. In the plot of the outliers and leverage we can see that again the 107 point is an outlier, but it still is within Cook’s distance.
The 20 variables with the fewest missing values are AG.SRF.TOTL.K2, SP.POP.GROW, AG.LND.TOTL.K2, EN.POP.DNST, EG.ELC.ACCS.RU.ZS, EG.ELC.ACCS.ZS, EN.BIR.THRD.NO, EN.FSH.THRD.NO, EN.HPT.THRD.NO, EN.MAM.THRD.NO, SP.RUR.TOTL.ZS, IT.NET.SECR, IT.MLT.MAIN, AG.LND.EL5M.ZS, AG.LND.FRST.ZS, SP.DYN.LE00.IN, NY.GDP.PCAP.CD, NY.GDP.PETR.RT.ZS, SH.MED.PHYS.ZS and SE.SEC.ENRL.FE.ZS.
The variables SP.DYN.LE00.IN and SP.POP.GROW were removed when creating mod2. SP.DYN.LE00.IN was removed because it is the variable that is being described, and SP.POP.GROW was removed because it is already in the linear model of mod1.
##
## Call:
## lm(formula = SP.DYN.LE00.IN ~ SP.POP.GROW, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.1766 -3.9456 -0.4208 4.3489 17.9961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.5698 0.7277 106.60 <2e-16 ***
## SP.POP.GROW -4.8615 0.4309 -11.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.606 on 207 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.3808, Adjusted R-squared: 0.3778
## F-statistic: 127.3 on 1 and 207 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = SP.DYN.LE00.IN ~ AG.SRF.TOTL.K2 + AG.LND.TOTL.K2 +
## EN.POP.DNST + EG.ELC.ACCS.RU.ZS + EG.ELC.ACCS.ZS + EN.BIR.THRD.NO +
## EN.FSH.THRD.NO + EN.HPT.THRD.NO + EN.MAM.THRD.NO + SP.RUR.TOTL.ZS +
## IT.NET.SECR + IT.MLT.MAIN + AG.LND.EL5M.ZS + AG.LND.FRST.ZS +
## NY.GDP.PCAP.CD + NY.GDP.PETR.RT.ZS + SH.MED.PHYS.ZS + SE.SEC.ENRL.FE.ZS,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.0266 -1.8857 0.4022 2.4374 9.9660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.705e+01 4.038e+00 11.651 < 2e-16 ***
## AG.SRF.TOTL.K2 -2.958e-06 2.993e-06 -0.988 0.32440
## AG.LND.TOTL.K2 2.911e-06 3.116e-06 0.934 0.35139
## EN.POP.DNST 4.791e-04 5.139e-04 0.932 0.35243
## EG.ELC.ACCS.RU.ZS 9.743e-02 3.175e-02 3.069 0.00249 **
## EG.ELC.ACCS.ZS 3.685e-02 3.769e-02 0.978 0.32948
## EN.BIR.THRD.NO -8.675e-03 1.435e-02 -0.605 0.54616
## EN.FSH.THRD.NO 2.356e-03 5.979e-03 0.394 0.69397
## EN.HPT.THRD.NO 5.954e-04 1.569e-03 0.380 0.70477
## EN.MAM.THRD.NO 3.198e-03 1.617e-02 0.198 0.84345
## SP.RUR.TOTL.ZS -3.797e-02 1.871e-02 -2.029 0.04399 *
## IT.NET.SECR 4.630e-06 7.830e-06 0.591 0.55505
## IT.MLT.MAIN 1.233e-08 1.566e-08 0.788 0.43204
## AG.LND.EL5M.ZS 1.259e-02 1.868e-02 0.674 0.50135
## AG.LND.FRST.ZS 2.451e-02 1.300e-02 1.886 0.06100 .
## NY.GDP.PCAP.CD 8.922e-05 2.095e-05 4.258 3.36e-05 ***
## NY.GDP.PETR.RT.ZS -9.481e-02 2.850e-02 -3.327 0.00107 **
## SH.MED.PHYS.ZS 5.281e-01 3.232e-01 1.634 0.10412
## SE.SEC.ENRL.FE.ZS 2.740e-01 8.260e-02 3.317 0.00111 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.935 on 174 degrees of freedom
## (23 observations deleted due to missingness)
## Multiple R-squared: 0.8022, Adjusted R-squared: 0.7817
## F-statistic: 39.2 on 18 and 174 DF, p-value: < 2.2e-16
Comparing the residuals shown above we see that the median is very close to 0 in both cases, and both 1Q and 3Q values are close to the absolute values of each other. The min and max of both cases are further apart however, with both residual mins being higher than their respective maxes.
## Analysis of Variance Table
##
## Response: SP.DYN.LE00.IN
## Df Sum Sq Mean Sq F value Pr(>F)
## SP.POP.GROW 1 5555.8 5555.8 127.31 < 2.2e-16 ***
## Residuals 207 9033.4 43.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: SP.DYN.LE00.IN
## Df Sum Sq Mean Sq F value Pr(>F)
## AG.SRF.TOTL.K2 1 2.9 2.9 0.1863 0.666511
## AG.LND.TOTL.K2 1 105.7 105.7 6.8240 0.009780 **
## EN.POP.DNST 1 446.6 446.6 28.8365 2.494e-07 ***
## EG.ELC.ACCS.RU.ZS 1 8693.8 8693.8 561.3990 < 2.2e-16 ***
## EG.ELC.ACCS.ZS 1 59.8 59.8 3.8585 0.051088 .
## EN.BIR.THRD.NO 1 0.4 0.4 0.0246 0.875559
## EN.FSH.THRD.NO 1 2.1 2.1 0.1342 0.714550
## EN.HPT.THRD.NO 1 7.3 7.3 0.4745 0.491829
## EN.MAM.THRD.NO 1 40.3 40.3 2.6028 0.108489
## SP.RUR.TOTL.ZS 1 445.1 445.1 28.7423 2.600e-07 ***
## IT.NET.SECR 1 67.1 67.1 4.3331 0.038842 *
## IT.MLT.MAIN 1 10.1 10.1 0.6507 0.420980
## AG.LND.EL5M.ZS 1 59.9 59.9 3.8661 0.050861 .
## AG.LND.FRST.ZS 1 88.2 88.2 5.6930 0.018107 *
## NY.GDP.PCAP.CD 1 413.7 413.7 26.7120 6.428e-07 ***
## NY.GDP.PETR.RT.ZS 1 282.1 282.1 18.2185 3.230e-05 ***
## SH.MED.PHYS.ZS 1 31.5 31.5 2.0317 0.155844
## SE.SEC.ENRL.FE.ZS 1 170.4 170.4 11.0040 0.001107 **
## Residuals 174 2694.5 15.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At an alpha = 0.05 level, most of the extra variables are insignifcant. Notable exceptions are AG.LND.TOTL.K2, EN.POP.DNST, EG.ELC.ACCS.RU.ZS, SP.RUR.TOTL.ZS, IT.NET.SECR, AG.LND.FRST.ZS, NY.GDP.PCAP.CD, NY.GDP.PETR.RT.ZS and SE.SEC.ENRL.FE.ZS. Two other variables that are worth noting are EG.ELC.ACCS.ZS and AG.LND.EL5M.ZS, which are insignificant at alpha = 0.05, but just barely as their respective P values are 0.051088 and 0.050861.
Neither of the variables need to be transformed in their current state.
##
## Call:
## lm(formula = SP.DYN.LE00.IN ~ SH.XPD.PCAP, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.627 -4.503 1.591 5.254 10.570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.789e+01 5.805e-01 116.945 <2e-16 ***
## SH.XPD.PCAP 2.647e-03 2.791e-04 9.483 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.853 on 186 degrees of freedom
## (28 observations deleted due to missingness)
## Multiple R-squared: 0.3259, Adjusted R-squared: 0.3223
## F-statistic: 89.92 on 1 and 186 DF, p-value: < 2.2e-16
## [1] 10.5701
## [1] -19.62739
The countries with the most positive and least positive residuals are with 109 with 10.5701 and 183 with -19.62739.
suppressMessages(library(ggplot2)) suppressMessages(library(dplyr))
data <- read.csv(“http://stat.slu.edu/~speegle/student_data/student_data_long25.csv”, as.is = TRUE)
data <- data[-c(35, 47, 59, 60, 61, 62, 65, 70, 71, 91, 94, 119, 125, 126, 127, 130, 131, 143, 146, 151, 168, 171, 172, 174, 186, 202, 204, 205, 230, 240),]
plot1 <- ggplot(data, aes(x = SP.DYN.LE00.IN, y = SG.GEN.PARL.ZS)) + geom_point() + ggtitle(“Life Expectancy at Birth vs Proportion of Seats held by Women in National Parliaments”) plot1
plot2 <- ggplot(data, aes(x = SP.DYN.LE00.IN, y = SP.POP.GROW)) + geom_point() + ggtitle(“Life Expectancy at Birth vs Population Growth”) plot2
lm1 <- lm(SP.DYN.LE00.IN~SP.POP.GROW, data = data)
plot(lm1)
data <- filter(data, SP.POP.GROW != 8.08855915954617)
mod1 <- lm(SP.DYN.LE00.IN~SP.POP.GROW, data = data)
plot(mod1)
lm2 <- lm(SP.DYN.LE00.IN~AG.SRF.TOTL.K2 + SP.POP.GROW + AG.LND.TOTL.K2 + EN.POP.DNST + EG.ELC.ACCS.RU.ZS + EG.ELC.ACCS.ZS + EN.BIR.THRD.NO + EN.FSH.THRD.NO + EN.HPT.THRD.NO + EN.MAM.THRD.NO + SP.RUR.TOTL.ZS + IT.NET.SECR + IT.MLT.MAIN + AG.LND.EL5M.ZS + AG.LND.FRST.ZS + SP.DYN.LE00.IN + NY.GDP.PCAP.CD + NY.GDP.PETR.RT.ZS + SH.MED.PHYS.ZS + SE.SEC.ENRL.FE.ZS, data = data)
mod2 <- lm(SP.DYN.LE00.IN~AG.SRF.TOTL.K2 + AG.LND.TOTL.K2 + EN.POP.DNST + EG.ELC.ACCS.RU.ZS + EG.ELC.ACCS.ZS + EN.BIR.THRD.NO + EN.FSH.THRD.NO + EN.HPT.THRD.NO + EN.MAM.THRD.NO + SP.RUR.TOTL.ZS + IT.NET.SECR + IT.MLT.MAIN + AG.LND.EL5M.ZS + AG.LND.FRST.ZS + NY.GDP.PCAP.CD + NY.GDP.PETR.RT.ZS + SH.MED.PHYS.ZS + SE.SEC.ENRL.FE.ZS, data = data)
anova(mod1) anova(mod2)
library(maps) map(“world”, fill = TRUE, col = data$SP.DYN.LE00.IN, plot = TRUE)
mod3 <- lm(SP.DYN.LE00.IN~SH.XPD.PCAP, data = data)
summary(mod3) mod3_residuals <- mod3$residuals max(mod3_residuals) min(mod3_residuals)