library(tidyr)
library(dplyr)
library(ggplot2)
library(DT)

Part 1 - Introduction

For nations around the globe, education spending represents an important and often large segment of domestic spending. Globally, nations spend differing amounts on education with similarly differing results in the impact said spending makes in terms of primary completion rate. In this project we will examine if there is any correlation between government spending on education and primary completion rate in Low Income Countries (LIC), Middle Income Countries (MIC), and High Income Countries (HIC) between the years of 2000-2014.

Part 2 - Data

Below we will load the data has been collected by the World Bank. The dataset is from the World Bank World Development Indicators open data library and is updated yearly. We will be using the following data:

After loading, we will subset the data to isolate Low, Middle, and High Income country groups.

comp <- read.csv("/Users/jp/Dropbox/Data Science MS/Courses/Data 606 - Statistics and Probability for Data Analytics/Project/data/API_SE.PRM.CMPT.ZS_DS2_en_csv_v2_10516401.csv",
                 header=TRUE, na.strings="",skip=4)


spend <- read.csv("/Users/jp/Dropbox/Data Science MS/Courses/Data 606 - Statistics and Probability for Data Analytics/Project/data/API_SE.XPD.TOTL.GB.ZS_DS2_en_csv_v2_10515724.csv",
               header=TRUE, na.strings="",skip=3)

spend2 <- spend %>% 
    gather(key="Year", value="Spend",5:62) %>% 
    subset(!is.na(Spend), select=-c(X2018,X,Indicator.Name)) %>% 
    group_by(Country.Name) %>% 
    arrange(Year, .by_group=TRUE) %>% 
    mutate(spend_diff= Spend-lag(Spend, default=first(Spend)))
spend2$Year <- as.integer(substr(spend2$Year, 2, 5))

comp2 <- comp %>% 
    gather(key="Year", value="Completion",5:62) %>% 
    subset(!is.na(Completion), select=-c(X2018,X,Indicator.Name)) %>% 
    group_by(Country.Name) %>% 
    arrange(Year, .by_group=TRUE) %>% 
    mutate(comp_diff= Completion-lag(Completion, default=first(Completion)))
comp2$Year <- as.integer(substr(comp2$Year, 2, 5))

df <- merge(spend2, comp2, by=c("Country.Name","Year")) %>% 
    select(c(Country.Name,Country.Code.x,Year,Spend,Completion,spend_diff,comp_diff))
colnames(df) <- c("Country.Name","Country.Code","Year","Spend","Completion","spend_diff","comp_diff")

#remove regions and designated groups by world bank, leaving only individual countries
# remove <- c("ARB","CEB","CSS","EAP","EAR","EAS","ECA","ECS","EUU","FCS","FSM","HIC",
#             "HPC","IBD","IBT","IDA","IDB","IDX","INX","LAC","LCN","LDC","LIC","LMC",
#             "LMY","LTE","MEA","MIC","MNA","NAC","OED","OSS","PRE","PSS","PST","SAS",
#             "SSA","SSF","SST","TEA","TEC","TLA","TMN","TSA","TSS","WLD","ZAF")

df$category <- ifelse(df$spend_diff<0,"Decrease", 
                      ifelse(df$spend_diff>0,"Increase",NA))

#df <- df[!(df$Country.Code %in% remove),c(8,1,2,3,4,5,6,7)] %>% subset(!is.na(category))
#df <- df[(df$Country.Code %in% remove),c(8,1,2,3,4,5,6,7)] %>% subset(!is.na(category))

income_df <- subset(df, Country.Code=="LIC" | Country.Code=="HIC" | Country.Code=="MIC" )
income_df <- income_df[income_df$Year>=2000,] 
income_df <- income_df[income_df$Year<=2014,]

datatable(income_df)

Cases

Each case represents the education expenditure and corresponding primary completion rate (as percentage of GDP and relevant age group, respectively) per group per year.

Part 3 - Exploratory data analysis

Below we will graph spending and completion rates for each of the groups.

With Regards to primary completion rate, we make the following observations from the graphs above:

Regarding Education Spending, we make the following observations:

Part 4 - Inference

Below we will attempt to model primary completion rate and education spending to test whether education spending is a good predictor of primary completion rate.

## 
## Call:
## lm(formula = Completion ~ Spend, data = low_inc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.714  -6.850   1.960   5.059  10.591 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -10.237     53.032  -0.193    0.852
## Spend          4.383      3.251   1.348    0.215
## 
## Residual standard error: 8.014 on 8 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.08323 
## F-statistic: 1.817 on 1 and 8 DF,  p-value: 0.2146
## [1] 0.4302219

## 
## Call:
## lm(formula = Completion ~ Spend, data = mid_inc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.540 -2.660  1.147  2.733  4.015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   65.224     29.863   2.184   0.0515 .
## Spend          1.653      1.983   0.834   0.4223  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.599 on 11 degrees of freedom
## Multiple R-squared:  0.05941,    Adjusted R-squared:  -0.0261 
## F-statistic: 0.6948 on 1 and 11 DF,  p-value: 0.4223

## [1] 0.2437402

## 
## Call:
## lm(formula = Completion ~ Spend, data = hi_inc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2189 -0.1712  0.1716  0.3961  0.9019 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 108.0663     8.1678   13.23 6.42e-09 ***
## Spend        -0.8545     0.6425   -1.33    0.206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6589 on 13 degrees of freedom
## Multiple R-squared:  0.1198, Adjusted R-squared:  0.05205 
## F-statistic: 1.769 on 1 and 13 DF,  p-value: 0.2064

## [1] -0.3460598

Part 5 - Conclusion

As shown by the models above we conclude the following:

References

https://data.worldbank.org/indicator/se.xpd.totl.gb.zs https://data.worldbank.org/indicator/SE.PRM.CMPT.ZS