Does increased government spending on education increase primary school completion rate?
Each case represents the education expenditure and corresponding primary completion rate (as percentage of GDP and relevant age group, respectively) per country per year.
The cases are from the World Bank World Development Indicators.
The indicators that will be used are:
SE.XPD.TOTL.GB.ZS - Government expenditure on education, total (% of government expenditure)
SE.PRM.CMPT.ZS - Primary completion rate, total (% of relevant age group)
This data has been collected by the World Bank. The dataset is from the World Bank’s open data library and is updated yearly. The last observed values for each dataset is for year 2017.
This study is an observational study.
The data comes from the World Bank data catalog:
https://datacatalog.worldbank.org/dataset/world-development-indicators
The dependent variable is the primary completion rate change for a given country.
The independent variables are spending category (YoY increase or decrease), and education expenditure change.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
As there are disparate numbers of datapoints for each variable, we will seek to analyze those cases where we have information for both variables over the same time span.
Below are some descriptive information:
library(tidyr)
library(dplyr)
comp <- read.csv("/Users/jp/Dropbox/Data Science MS/Courses/Data 606 - Statistics and Probability for Data Analytics/Project/data/API_SE.PRM.CMPT.ZS_DS2_en_csv_v2_10516401.csv",
header=TRUE, na.strings="",skip=4)
spend <- read.csv("/Users/jp/Dropbox/Data Science MS/Courses/Data 606 - Statistics and Probability for Data Analytics/Project/data/API_SE.XPD.TOTL.GB.ZS_DS2_en_csv_v2_10515724.csv",
header=TRUE, na.strings="",skip=3)
spend2 <- spend %>%
gather(key="Year", value="Spend",5:62) %>%
subset(!is.na(Spend), select=-c(X2018,X,Indicator.Name)) %>%
group_by(Country.Name) %>%
arrange(Year, .by_group=TRUE) %>%
mutate(spend_diff= Spend-lag(Spend, default=first(Spend)))
spend2$Year <- as.integer(substr(spend2$Year, 2, 5))
comp2 <- comp %>%
gather(key="Year", value="Completion",5:62) %>%
subset(!is.na(Completion), select=-c(X2018,X,Indicator.Name)) %>%
group_by(Country.Name) %>%
arrange(Year, .by_group=TRUE) %>%
mutate(comp_diff= Completion-lag(Completion, default=first(Completion)))
comp2$Year <- as.integer(substr(comp2$Year, 2, 5))
df <- merge(spend2, comp2, by=c("Country.Name","Year")) %>%
select(c(Country.Name,Country.Code.x,Year,spend_diff,comp_diff))
colnames(df) <- c("Country.Name","Country.Code","Year","spend_diff","comp_diff")
#remove regions and designated groups by world bank, leaving only individual countries
remove <- c("ARB","CEB","CSS","EAP","EAR","EAS","ECA","ECS","EUU","FCS","FSM","HIC",
"HPC","IBD","IBT","IDA","IDB","IDX","INX","LAC","LCN","LDC","LIC","LMC",
"LMY","LTE","MEA","MIC","MNA","NAC","OED","OSS","PRE","PSS","PST","SAS",
"SSA","SSF","SST","TEA","TEC","TLA","TMN","TSA","TSS","WLD","ZAF")
df$category <- ifelse(df$spend_diff<0,"Decrease",
ifelse(df$spend_diff>0,"Increase",NA))
df <- df[!(df$Country.Code %in% remove),c(6,1,2,3,4,5)] %>% subset(!is.na(category))
head(df,10)
## category Country.Name Country.Code Year spend_diff comp_diff
## 1 Increase Albania ALB 2000 0.45457 5.04882
## 2 Increase Albania ALB 2001 0.44490 -0.19255
## 3 Increase Albania ALB 2003 0.68668 -1.97823
## 4 Increase Albania ALB 2004 0.12956 -1.03155
## 5 Increase Albania ALB 2007 0.24424 2.75822
## 6 Increase Albania ALB 2013 0.94672 -7.93729
## 7 Decrease Albania ALB 2015 -0.80621 -0.53393
## 8 Increase Albania ALB 2016 2.27926 1.13591
## 9 Decrease Albania ALB 2017 -6.07333 1.12819
## 11 Increase Angola AGO 2010 1.22998 0.96724
df %>% group_by(category) %>% summarize(Cases=n())
## # A tibble: 2 x 2
## category Cases
## <chr> <int>
## 1 Decrease 791
## 2 Increase 851
barplot(table(df$category), main="Education Spending Observations", sub="Number of observations per group")
library(ggplot2)
qqplot(as.factor(df$category), df$spend_diff, geom="boxplot", ylim=c(-5,5), main="Spending Change")
qqplot(as.factor(df$category), df$comp_diff, geom="boxplot", ylim=c(-10,10), main="Primary Completion Change")