StatsProject.2

Introduction

My interest with this exploration is to conduct a statistical inquiry into the realm of the Human Sciences that focuses on development economics. Henceforth, in order to achieve an exploration of practical significance I will consider the framework of development established by the UN sustainable development goals. Nonetheless, the point of this exploration is to strengthen my grasp on statistical considerations/intuition with an exploration into real world data. Therefore, in my inquiry I will consider a fixed practical framework of gauging development to fit different linear models of increasing complexity. All in all, in so doing, I expect to fully represent the statistical and practical requirements of data exploration by independently exploring different outcomes and predictors.

Considering sustainable development goals 4, 7 & 8, I focus on data about world education, energy production and consumption, as well as econometric indicators.

Outcome Variables:

Primary dependent variable: Access to electricity (% of population)

Predictor Variables:

Primary independent variable: Government expenditure on education, total (% of GDP)

Secondary independent variable: Foreign direct investment, net inflows (BoP, current US$

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.3.0
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(wbstats)

Raw Data Selection

worldBankVars <- c("EG.ELC.ACCS.ZS","SE.XPD.TOTL.GD.ZS","BX.KLT.DINV.CD.WD","EG.IMP.CONS.ZS")

varsNames <- c("access_elect","education","foreignDirectInvestment", "energy_imports")

worldBankData <- wb(country = "all", indicator = worldBankVars, startdate = 2000, enddate = 2016, return_wide = TRUE)

Processed Data

new_data_format <- worldBankData %>%
  rename_at(vars(worldBankVars), ~ varsNames) %>% # rename
  drop_na() %>%                                 # remove missing
  group_by(country) %>%
  arrange(date) %>%
  top_n(1) %>%
  ungroup()

## Selecting by education

nrow(new_data_format)

## [1] 173

Model 1

model1 <- lm(access_elect ~ education, new_data_format)

summary(model1)

## 
## Call:
## lm(formula = access_elect ~ education, data = new_data_format)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.249  -4.109  11.638  16.232  22.956 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   69.409      6.440  10.777   <2e-16 ***
## education      2.656      1.205   2.205   0.0288 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.25 on 171 degrees of freedom
## Multiple R-squared:  0.02765,    Adjusted R-squared:  0.02197 
## F-statistic: 4.863 on 1 and 171 DF,  p-value: 0.02877

confint(model1)

##                 2.5 %    97.5 %
## (Intercept) 56.696181 82.121159
## education    0.278628  5.034033

plot(model1)

Model 1: Primary outcome vs. primary predictor (random variable x1)

The R summary function displays significant statistical information about the first model.

The model can be summarized in the y = mx + b form: y = 2.656x + 69.409

B represents the B0 intercept of the linear relationship. As seen in the R console, this predictor shows great statistical significance with a p < 2E-16 .

M, the slope of the linear relationship, represents the marginal change in the y value with a unit change in x. As seen in the R console, this predictor shows statistical significance with a p < 0.028: attributing more than the standard 95% significance against the null.

Ultimately, model1 displays an existing positive correlation between expenditure in education and percent of the population with access to energy while having predictors of statistical significance. Furthermore, model1’s q-q plot displayed in the R console above reveals more standard error towards lower quartiles of data. Upon practical consideration of the trend, one comes to the the understanding that countries with lower GDP and thusly proportionally lower expenditure in education have exponentially lower access to electricity. All in all, model1 introduces the general relatioship in inquiry: empirical wealth trends.

Model 2

model2 <- lm(access_elect ~ foreignDirectInvestment, new_data_format)

summary(model2)

## 
## Call:
## lm(formula = access_elect ~ foreignDirectInvestment, data = new_data_format)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.025  -7.533  12.935  17.848  18.343 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8.195e+01  2.029e+00  40.384   <2e-16 ***
## foreignDirectInvestment 1.223e-11  7.484e-12   1.634    0.104    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.41 on 171 degrees of freedom
## Multiple R-squared:  0.01537,    Adjusted R-squared:  0.00961 
## F-statistic: 2.669 on 1 and 171 DF,  p-value: 0.1042

confint(model2)

##                                 2.5 %       97.5 %
## (Intercept)              7.794347e+01 8.595474e+01
## foreignDirectInvestment -2.546370e-12 2.699835e-11

plot(model2)

Model 2: Primary outcome (Yi) vs. secondary predictor (random variable x2)

The R summary function displays significant statistical information about the first model.

The model can be summarized in the y = m2x2 + b form: y = 1.223E-11x2 + 81.95

B represents the B0 intercept of the linear relationship. As seen in the R console, this predictor shows great statistical significance with a p < 2E-16 (the same significance as model 1 but different higher value.

M, the slope of the linear relationship, represents the marginal change in the y value with a unit change in x. As seen in the R console, this predictor shows less statistical significance with a p < 0.104: attributing less than standard 95% significance against the null.

Ultimately, the model displays an existing positive correlation between foreign direct investment and percent of the population with access to energy while having predictors of moderate statistical significance. Furthermore, model2’s diagnostics displayed in the R console above reveal similar errors to model 1. Consequently, there seems to be collinearity between the secondary and primary predictors (this seems to be due to the standard empirical wealth association in world economic data).

Model 3

model3 <- lm(access_elect ~ education + foreignDirectInvestment, new_data_format)

summary(model3)

## 
## Call:
## lm(formula = access_elect ~ education + foreignDirectInvestment, 
##     data = new_data_format)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.298  -5.307  11.748  16.693  24.102 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             6.806e+01  6.452e+00  10.549   <2e-16 ***
## education               2.714e+00  1.198e+00   2.265   0.0248 *  
## foreignDirectInvestment 1.269e-11  7.398e-12   1.716   0.0880 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.1 on 170 degrees of freedom
## Multiple R-squared:  0.04421,    Adjusted R-squared:  0.03296 
## F-statistic: 3.931 on 2 and 170 DF,  p-value: 0.02142

confint(model3)

##                                 2.5 %       97.5 %
## (Intercept)              5.532396e+01 8.079633e+01
## education                3.484825e-01 5.079125e+00
## foreignDirectInvestment -1.909138e-12 2.729781e-11

plot(model3)

Model 3: Primary outcome vs. primary predictor + secondary predictor

The R summary function displays significant statistical information about the first model.

The model can be summarized in the y = m1x1 + m2x2 + b form: y = 2.714x1 + 1.269E-11x2 + 68.06

B represents the B0 intercept of the linear relationship. As seen in the R console, this predictor shows statistical significance with a p < 2E-16, 1 df. Notably, the standard error of this predictor is slightly larger than in model 1.

M1, the slope of the linear relationship, represents the marginal change in the y value with a unit change in x1. As seen in the R console, this predictor shows statistical significance with a p < 0.0248: attributing more than the standard 95% significance against the null. Model 3––a more complex model––attributes more weight to the marginal association between the primary outcome and the primary predictor.

M2, the slope of the linear relationship, represents the marginal change in the y value with a unit change in x2. As seen in the R console, this predictor shows slight statistical significance with a p < 0.0880: attributing slightly less than the standard 95% significance against the null T-distribution. Model3––a more complex model––seemingly attributes more weight to the marginal association between the primary outcome and m2 as compared to model2.

Altogether, the model displays an existing positive correlation among the predictors and the outcome (at the same time, all three predictors have p < alpha: significance alpha = 0.1) Furthermore, Model 3’s diagnostics display similar error to models’ 1 and 2, which enforces the practical association discussed in model2’s exploration section.

Ultimately, model3 seems statistically significant and a good fit. Firstly, its R-squared: 0.04421 is the highest out of all previously fitted models, implying it models the outcome with higher linear association. Furthermore, model3’ displays a F-statistic: 3.931 and p-value: 0.02142––a marginal increase in significance from model1’s F-distribution p: 0.0287.

Altogether, the addition of the secondary predictor to the model slightly increased the standard error of the estimated, statistically significant B0 (i.e the size of the confidence intervals for B0) while keeping its estimate and p value consistent to model 1. Therefore, while the increase in the R-squared might represent bias due to collinearity between predictors, the added complexity seems to increase the standard error of all predictors––as seen in the console––which implies the confidence intervals are wider and therefore better account for the variance bias trade-off. Furthermore, practically considered this and other statistical variances among the previous three models reflects a general practical understanding of development.

All in all, the third model concludes that education is more significant and influential to economic development. That is to say, when development is viewed as how much of a population has access to energy and estimated by the sum of independent measurements of external and internal investment into the national economy, internal investment weights more and seems more significant.

Anova

anova(model1,model2,model3)

## Analysis of Variance Table
## 
## Model 1: access_elect ~ education
## Model 2: access_elect ~ foreignDirectInvestment
## Model 3: access_elect ~ education + foreignDirectInvestment
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    171 108995                              
## 2    171 110372  0   -1377.1                 
## 3    170 107139  1    3232.8 5.1295 0.02478 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To properly understand the increasing complexity of the models, I conduct an Anova test. As seen in the console, model3 has the lowest RSS: 107139 with p: 0.02478, which suggests that model3 is the “best-fit” model and is statistically significant.