DATA 606 Data Project Proposal

Research question

Does increased government spending on education increase primary school completion rate?

Cases

Each case represents the education expenditure and corresponding primary completion rate (as percentage of GDP and relevant age group, respectively) per country per year.

The cases are from the World Bank World Development Indicators.

The indicators that will be used are:

SE.XPD.TOTL.GB.ZS - Government expenditure on education, total (% of government expenditure)
SE.PRM.CMPT.ZS - Primary completion rate, total (% of relevant age group)

Data collection

This data has been collected by the World Bank. The dataset is from the World Bank’s open data library and is updated yearly. The last observed values for each dataset is for year 2017.

Type of study

This study is an observational study.

Data Source

The data comes from the World Bank data catalog:
https://datacatalog.worldbank.org/dataset/world-development-indicators

Dependent Variable

The dependent variable is the primary completion rate change for a given country.

Independent Variable

The independent variables are spending category (YoY increase or decrease), and education expenditure change.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

As there are disparate numbers of datapoints for each variable, we will seek to analyze those cases where we have information for both variables over the same time span.

Below are some descriptive information:

library(tidyr)
library(dplyr)
comp <- read.csv("/Users/jp/Dropbox/Data Science MS/Courses/Data 606 - Statistics and Probability for Data Analytics/Project/data/API_SE.PRM.CMPT.ZS_DS2_en_csv_v2_10516401.csv",
                 header=TRUE, na.strings="",skip=4)


spend <- read.csv("/Users/jp/Dropbox/Data Science MS/Courses/Data 606 - Statistics and Probability for Data Analytics/Project/data/API_SE.XPD.TOTL.GB.ZS_DS2_en_csv_v2_10515724.csv",
               header=TRUE, na.strings="",skip=3)

spend2 <- spend %>% 
    gather(key="Year", value="Spend",5:62) %>% 
    subset(!is.na(Spend), select=-c(X2018,X,Indicator.Name)) %>% 
    group_by(Country.Name) %>% 
    arrange(Year, .by_group=TRUE) %>% 
    mutate(spend_diff= Spend-lag(Spend, default=first(Spend)))
spend2$Year <- as.integer(substr(spend2$Year, 2, 5))

comp2 <- comp %>% 
    gather(key="Year", value="Completion",5:62) %>% 
    subset(!is.na(Completion), select=-c(X2018,X,Indicator.Name)) %>% 
    group_by(Country.Name) %>% 
    arrange(Year, .by_group=TRUE) %>% 
    mutate(comp_diff= Completion-lag(Completion, default=first(Completion)))
comp2$Year <- as.integer(substr(comp2$Year, 2, 5))

df <- merge(spend2, comp2, by=c("Country.Name","Year")) %>% 
    select(c(Country.Name,Country.Code.x,Year,spend_diff,comp_diff))
colnames(df) <- c("Country.Name","Country.Code","Year","spend_diff","comp_diff")

#remove regions and designated groups by world bank, leaving only individual countries
remove <- c("ARB","CEB","CSS","EAP","EAR","EAS","ECA","ECS","EUU","FCS","FSM","HIC",
            "HPC","IBD","IBT","IDA","IDB","IDX","INX","LAC","LCN","LDC","LIC","LMC",
            "LMY","LTE","MEA","MIC","MNA","NAC","OED","OSS","PRE","PSS","PST","SAS",
            "SSA","SSF","SST","TEA","TEC","TLA","TMN","TSA","TSS","WLD","ZAF")

df$category <- ifelse(df$spend_diff<0,"Decrease", 
                      ifelse(df$spend_diff>0,"Increase",NA))

df <- df[!(df$Country.Code %in% remove),c(6,1,2,3,4,5)] %>% subset(!is.na(category))
head(df,10)

##    category Country.Name Country.Code Year spend_diff comp_diff
## 1  Increase      Albania          ALB 2000    0.45457   5.04882
## 2  Increase      Albania          ALB 2001    0.44490  -0.19255
## 3  Increase      Albania          ALB 2003    0.68668  -1.97823
## 4  Increase      Albania          ALB 2004    0.12956  -1.03155
## 5  Increase      Albania          ALB 2007    0.24424   2.75822
## 6  Increase      Albania          ALB 2013    0.94672  -7.93729
## 7  Decrease      Albania          ALB 2015   -0.80621  -0.53393
## 8  Increase      Albania          ALB 2016    2.27926   1.13591
## 9  Decrease      Albania          ALB 2017   -6.07333   1.12819
## 11 Increase       Angola          AGO 2010    1.22998   0.96724

df %>% group_by(category) %>% summarize(Cases=n())

## # A tibble: 2 x 2
##   category Cases
##   <chr>    <int>
## 1 Decrease   791
## 2 Increase   851

barplot(table(df$category), main="Education Spending Observations", sub="Number of observations per group")

library(ggplot2)
qqplot(as.factor(df$category), df$spend_diff, geom="boxplot", ylim=c(-5,5), main="Spending Change")

qqplot(as.factor(df$category), df$comp_diff, geom="boxplot", ylim=c(-10,10), main="Primary Completion Change")