For nations around the globe, education spending represents an important and often large segment of domestic spending. Globally, nations spend differing amounts on education with similarly differing results in the impact said spending makes in terms of primary completion rate. In this project we will examine if there is any correlation between government spending on education and primary completion rate in Low Income Countries (LIC), Middle Income Countries (MIC), and High Income Countries (HIC) between the years of 2000-2014.
Below we will load the data has been collected by the World Bank. The dataset is from the World Bank World Development Indicators open data library and is updated yearly. We will be using the following data:
SE.XPD.TOTL.GB.ZS - Government expenditure on education, total (% of government expenditure)
SE.PRM.CMPT.ZS - Primary completion rate, total (% of relevant age group)
After loading, we will subset the data to isolate Low, Middle, and High Income country groups.
comp <- read.csv("/Users/jp/Dropbox/Data Science MS/Courses/Data 606 - Statistics and Probability for Data Analytics/Project/data/API_SE.PRM.CMPT.ZS_DS2_en_csv_v2_10516401.csv",
header=TRUE, na.strings="",skip=4)
spend <- read.csv("/Users/jp/Dropbox/Data Science MS/Courses/Data 606 - Statistics and Probability for Data Analytics/Project/data/API_SE.XPD.TOTL.GB.ZS_DS2_en_csv_v2_10515724.csv",
header=TRUE, na.strings="",skip=3)
spend2 <- spend %>%
gather(key="Year", value="Spend",5:62) %>%
subset(!is.na(Spend), select=-c(X2018,X,Indicator.Name)) %>%
group_by(Country.Name) %>%
arrange(Year, .by_group=TRUE) %>%
mutate(spend_diff= Spend-lag(Spend, default=first(Spend)))
spend2$Year <- as.integer(substr(spend2$Year, 2, 5))
comp2 <- comp %>%
gather(key="Year", value="Completion",5:62) %>%
subset(!is.na(Completion), select=-c(X2018,X,Indicator.Name)) %>%
group_by(Country.Name) %>%
arrange(Year, .by_group=TRUE) %>%
mutate(comp_diff= Completion-lag(Completion, default=first(Completion)))
comp2$Year <- as.integer(substr(comp2$Year, 2, 5))
df <- merge(spend2, comp2, by=c("Country.Name","Year")) %>%
select(c(Country.Name,Country.Code.x,Year,Spend,Completion,spend_diff,comp_diff))
colnames(df) <- c("Country.Name","Country.Code","Year","Spend","Completion","spend_diff","comp_diff")
#remove regions and designated groups by world bank, leaving only individual countries
# remove <- c("ARB","CEB","CSS","EAP","EAR","EAS","ECA","ECS","EUU","FCS","FSM","HIC",
# "HPC","IBD","IBT","IDA","IDB","IDX","INX","LAC","LCN","LDC","LIC","LMC",
# "LMY","LTE","MEA","MIC","MNA","NAC","OED","OSS","PRE","PSS","PST","SAS",
# "SSA","SSF","SST","TEA","TEC","TLA","TMN","TSA","TSS","WLD","ZAF")
df$category <- ifelse(df$spend_diff<0,"Decrease",
ifelse(df$spend_diff>0,"Increase",NA))
#df <- df[!(df$Country.Code %in% remove),c(8,1,2,3,4,5,6,7)] %>% subset(!is.na(category))
#df <- df[(df$Country.Code %in% remove),c(8,1,2,3,4,5,6,7)] %>% subset(!is.na(category))
income_df <- subset(df, Country.Code=="LIC" | Country.Code=="HIC" | Country.Code=="MIC" )
income_df <- income_df[income_df$Year>=2000,]
income_df <- income_df[income_df$Year<=2014,]
datatable(income_df)
Each case represents the education expenditure and corresponding primary completion rate (as percentage of GDP and relevant age group, respectively) per group per year.
Below we will graph spending and completion rates for each of the groups.
ggplot(income_df, aes(x=Year,y=Completion, colour=Country.Code) ) + geom_line() +
ggtitle("Completion Rate") +
ylab("Primary Completion (% of relevant age group)")
ggplot(income_df, aes(x=Year,y=Spend, colour=Country.Code)) + geom_line() +
ggtitle("Education Spending") +
ylab("Education Spending (% of GDP")
With Regards to primary completion rate, we make the following observations from the graphs above:
Regarding Education Spending, we make the following observations:
Below we will attempt to model primary completion rate and education spending to test whether education spending is a good predictor of primary completion rate.
low_inc <- subset(income_df,Country.Code=="LIC")
ml <- lm(Completion ~ Spend, low_inc)
plot_model <- function(model,df){
plot(df$Completion ~ df$Spend, main=paste(deparse(substitute(df))))
abline(model)
}
plot_model(ml,low_inc)
plot(abs(ml$residuals) ~ ml$fitted.values,
xlab = "Fitted Values", ylab = "Absolute Value of Residuals")
##
## Call:
## lm(formula = Completion ~ Spend, data = low_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.714 -6.850 1.960 5.059 10.591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.237 53.032 -0.193 0.852
## Spend 4.383 3.251 1.348 0.215
##
## Residual standard error: 8.014 on 8 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.08323
## F-statistic: 1.817 on 1 and 8 DF, p-value: 0.2146
## [1] 0.4302219
mid_inc <- subset(income_df,Country.Code=="MIC")
mm <- lm(Completion ~ Spend, mid_inc)
plot_model(mm,mid_inc)
##
## Call:
## lm(formula = Completion ~ Spend, data = mid_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.540 -2.660 1.147 2.733 4.015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.224 29.863 2.184 0.0515 .
## Spend 1.653 1.983 0.834 0.4223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.599 on 11 degrees of freedom
## Multiple R-squared: 0.05941, Adjusted R-squared: -0.0261
## F-statistic: 0.6948 on 1 and 11 DF, p-value: 0.4223
plot(abs(mm$residuals) ~ mm$fitted.values,
xlab = "Fitted Values", ylab = "Absolute Value of Residuals")
## [1] 0.2437402
hi_inc <- subset(income_df,Country.Code=="HIC")
mh <- lm(Completion ~ Spend, hi_inc)
plot_model(mh,hi_inc)
##
## Call:
## lm(formula = Completion ~ Spend, data = hi_inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2189 -0.1712 0.1716 0.3961 0.9019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108.0663 8.1678 13.23 6.42e-09 ***
## Spend -0.8545 0.6425 -1.33 0.206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6589 on 13 degrees of freedom
## Multiple R-squared: 0.1198, Adjusted R-squared: 0.05205
## F-statistic: 1.769 on 1 and 13 DF, p-value: 0.2064
plot(abs(mh$residuals) ~ mh$fitted.values,
xlab = "Fitted Values", ylab = "Absolute Value of Residuals")
## [1] -0.3460598
As shown by the models above we conclude the following:
While there is some correlation between education spending and primary completion rate, the data does not show that spending is a good predictor of primary completion rate.
Linear regression is not suitable for describing the relationship between Spend and Completion Rate. This is evidenced by, inconsistency in residuals, non-normal residuals, and very low Adjusted R squared measures for each model.
The observed p-values for each category is in each case, large (>.2). This indicates that the observed values for the dependant variable are likely due to chance and as a result, Spend is not an adequate explanatory varible for Primary Completion Rate.