This project will investigate data from the World Bank’s World Development indicators to see if increased government expenditure on education improve youth literacy.
This is an observational study that investigates the relationship between the change in government expenditure in education and its corresponding change in youth literacy within a one year time period. Each case is a country with data for these two indicators for a specified one year period: (1) government expenditure on education and (2) youth literacy.
Hypothesis testing is done to determine if there is a significant difference in the change in youth literacy between the group with increased expenditure on education and the group with decreased expenditure on education.
The data used in this study is the World Bank’s collection of development indicators, which “presents the most current and accurate global development data available”. This data covers the years 1960 through 2017.
This study focuses on two indicators below.
A complete case is a country with data for both selected indicator codes SE.XPD.TOTL.GB.ZS
and SE.ADT.1524.LT.ZS
within a one year time period.
After investigating the data for complete cases for one time period, not enough complete cases were generated. For example, for the period 2010-2011, this only resulted in 17 complete cases. In order to increase the number of cases for the study, several time periods were selected.
The first task is to generate enough number of cases to assign in one of two groups: the group with increased spending in education and the group with decreased spending in education.
library(dplyr)
library(tidyr)
library(knitr)
library(ggplot2)
library(stringr)
library(DT)
library(psych)
This indicator code is for government expenditure on education, which is measured as percent of government expenditure.
The code below selects observations so that this indicator is not NA
for the one year time period. One year time periods from 1996 to 2016 are selected.
get_completeCases <- function(start, end, indicator_code){
complete_cases <-
data %>% dplyr::filter(Indicator.Code == indicator_code &
is.na(data[,which(colnames(data)==start)])==FALSE &
is.na(data[,which(colnames(data)==end)])==FALSE) %>% dplyr::select("Country.Name", "Country.Code", start, end)
return(complete_cases)
}
#education expenditure
indicator_code <- 'SE.XPD.TOTL.GB.ZS'
educ_expend_p1 <- get_completeCases("X2015", "X2016", indicator_code)
educ_expend_p2 <- get_completeCases("X2014", "X2015", indicator_code)
educ_expend_p3 <- get_completeCases("X2013", "X2014", indicator_code)
educ_expend_p4 <- get_completeCases("X2012", "X2013", indicator_code)
educ_expend_p5 <- get_completeCases("X2011", "X2012", indicator_code)
educ_expend_p6 <- get_completeCases("X2010", "X2011", indicator_code)
educ_expend_p7 <- get_completeCases("X2009", "X2010", indicator_code)
educ_expend_p8 <- get_completeCases("X2008", "X2009", indicator_code)
educ_expend_p9 <- get_completeCases("X2007", "X2008", indicator_code)
educ_expend_p10 <- get_completeCases("X2006", "X2007", indicator_code)
educ_expend_p11 <- get_completeCases("X2005", "X2006", indicator_code)
educ_expend_p12 <- get_completeCases("X2004", "X2005", indicator_code)
educ_expend_p13 <- get_completeCases("X2003", "X2004", indicator_code)
educ_expend_p14 <- get_completeCases("X2002", "X2003", indicator_code)
educ_expend_p15 <- get_completeCases("X2001", "X2002", indicator_code)
educ_expend_p16 <- get_completeCases("X2000", "X2001", indicator_code)
educ_expend_p17 <- get_completeCases("X1999", "X2000", indicator_code)
educ_expend_p18 <- get_completeCases("X1998", "X1999", indicator_code)
educ_expend_p19 <- get_completeCases("X1997", "X1998", indicator_code)
educ_expend_p20 <- get_completeCases("X1996", "X1997", indicator_code)
This indicator code is for youth literacy rate, which is measured as percent of people ages 15 - 24.
The code below selects observations so that this indicator is not NA
for the one year time period. One year time periods from 1996 to 2016 are selected.
indicator_code <- 'SE.ADT.1524.LT.ZS'
literacy_p1 <- get_completeCases("X2015", "X2016", indicator_code)
literacy_p2 <- get_completeCases("X2014", "X2015", indicator_code)
literacy_p3 <- get_completeCases("X2013", "X2014", indicator_code)
literacy_p4 <- get_completeCases("X2012", "X2013", indicator_code)
literacy_p5 <- get_completeCases("X2011", "X2012", indicator_code)
literacy_p6 <- get_completeCases("X2010", "X2011", indicator_code)
literacy_p7 <- get_completeCases("X2009", "X2010", indicator_code)
literacy_p8 <- get_completeCases("X2008", "X2009", indicator_code)
literacy_p9 <- get_completeCases("X2007", "X2008", indicator_code)
literacy_p10 <- get_completeCases("X2006", "X2007", indicator_code)
literacy_p11 <- get_completeCases("X2005", "X2006", indicator_code)
literacy_p12 <- get_completeCases("X2004", "X2005", indicator_code)
literacy_p13 <- get_completeCases("X2003", "X2004", indicator_code)
literacy_p14 <- get_completeCases("X2002", "X2003", indicator_code)
literacy_p15 <- get_completeCases("X2001", "X2002", indicator_code)
literacy_p16 <- get_completeCases("X2000", "X2001", indicator_code)
literacy_p17 <- get_completeCases("X1999", "X2000", indicator_code)
literacy_p18 <- get_completeCases("X1998", "X1999", indicator_code)
literacy_p19 <- get_completeCases("X1997", "X1998", indicator_code)
literacy_p20 <- get_completeCases("X1996", "X1997", indicator_code)
Each expenditure data within a specific one year time period is stored in its own data frame. A column called Period
is going to be added, which will store easy to understand description of the different time periods.
The year columns are going to be renamed as Expend.Begin
and Expend.End
, which makes it easier to reference the years as start and end points of the time period.
#rename years and add period tracking --> education spending complete cases
educ_expend_p1$Period <- rep("2015 - 2016", times=nrow(educ_expend_p1))
educ_expend_p2$Period <- rep("2014 - 2015", times=nrow(educ_expend_p2))
educ_expend_p3$Period <- rep("2013 - 2014", times=nrow(educ_expend_p3))
educ_expend_p4$Period <- rep("2012 - 2013", times=nrow(educ_expend_p4))
educ_expend_p5$Period <- rep("2011 - 2012", times=nrow(educ_expend_p5))
educ_expend_p6$Period <- rep("2010 - 2011", times=nrow(educ_expend_p6))
educ_expend_p7$Period <- rep("2009 - 2010", times=nrow(educ_expend_p7))
educ_expend_p8$Period <- rep("2008 - 2009", times=nrow(educ_expend_p8))
educ_expend_p9$Period <- rep("2007 - 2008", times=nrow(educ_expend_p9))
educ_expend_p10$Period <- rep("2006 - 2007", times=nrow(educ_expend_p10))
educ_expend_p11$Period <- rep("2005 - 2006", times=nrow(educ_expend_p11))
educ_expend_p12$Period <- rep("P2004 - 2005", times=nrow(educ_expend_p12))
educ_expend_p13$Period <- rep("2003 - 2004", times=nrow(educ_expend_p13))
educ_expend_p14$Period <- rep("2002 - 2003", times=nrow(educ_expend_p14))
educ_expend_p15$Period <- rep("2001 - 2002", times=nrow(educ_expend_p15))
educ_expend_p16$Period <- rep("2000 - 2001", times=nrow(educ_expend_p16))
educ_expend_p17$Period <- rep("1999 - 2000", times=nrow(educ_expend_p17))
educ_expend_p18$Period <- rep("1998 - 1999", times=nrow(educ_expend_p18))
educ_expend_p19$Period <- rep("1997 - 1998", times=nrow(educ_expend_p19))
educ_expend_p20$Period <- rep("1996 - 1997", times=nrow(educ_expend_p20))
rename_colnames <- function(dataFrame){
colnames(dataFrame)[3] <- "Expend.Begin"
colnames(dataFrame)[4] <- "Expend.End"
return(dataFrame)
}
educ_expend_p1 <- rename_colnames(educ_expend_p1)
educ_expend_p2 <- rename_colnames(educ_expend_p2)
educ_expend_p3 <- rename_colnames(educ_expend_p3)
educ_expend_p4 <- rename_colnames(educ_expend_p4)
educ_expend_p5 <- rename_colnames(educ_expend_p5)
educ_expend_p6 <- rename_colnames(educ_expend_p6)
educ_expend_p7 <- rename_colnames(educ_expend_p7)
educ_expend_p8 <- rename_colnames(educ_expend_p8)
educ_expend_p9 <- rename_colnames(educ_expend_p9)
educ_expend_p10 <- rename_colnames(educ_expend_p10)
educ_expend_p11 <- rename_colnames(educ_expend_p11)
educ_expend_p12 <- rename_colnames(educ_expend_p12)
educ_expend_p13 <- rename_colnames(educ_expend_p13)
educ_expend_p14 <- rename_colnames(educ_expend_p14)
educ_expend_p15 <- rename_colnames(educ_expend_p15)
educ_expend_p16 <- rename_colnames(educ_expend_p16)
educ_expend_p17 <- rename_colnames(educ_expend_p17)
educ_expend_p18 <- rename_colnames(educ_expend_p18)
educ_expend_p19 <- rename_colnames(educ_expend_p19)
educ_expend_p20 <- rename_colnames(educ_expend_p20)
Each literacy data within a specific one year time period is stored in its own data frame. A column called Period
is going to be added, which will store easy to understand description of the different time periods.
The year columns are going to be renamed as Literacy.Begin
and Literacy.End
, which makes it easier to reference the years as start and end points of the time period.
#rename years and add period tracking --> literacy complete cases
literacy_p1$Period <- rep("2015 - 2016", times=nrow(literacy_p1))
literacy_p2$Period <- rep("2014 - 2015", times=nrow(literacy_p2))
literacy_p3$Period <- rep("2013 - 2014", times=nrow(literacy_p3))
literacy_p4$Period <- rep("2012 - 2013", times=nrow(literacy_p4))
literacy_p5$Period <- rep("2011 - 2012", times=nrow(literacy_p5))
literacy_p6$Period <- rep("2010 - 2011", times=nrow(literacy_p6))
literacy_p7$Period <- rep("2009 - 2010", times=nrow(literacy_p7))
literacy_p8$Period <- rep("2008 - 2009", times=nrow(literacy_p8))
literacy_p9$Period <- rep("2007 - 2008", times=nrow(literacy_p9))
literacy_p10$Period <- rep("2006 - 2007", times=nrow(literacy_p10))
literacy_p11$Period <- rep("2005 - 2006", times=nrow(literacy_p11))
literacy_p12$Period <- rep("P2004 - 2005", times=nrow(literacy_p12))
literacy_p13$Period <- rep("2003 - 2004", times=nrow(literacy_p13))
literacy_p14$Period <- rep("2002 - 2003", times=nrow(literacy_p14))
literacy_p15$Period <- rep("2001 - 2002", times=nrow(literacy_p15))
literacy_p16$Period <- rep("2000 - 2001", times=nrow(literacy_p16))
literacy_p17$Period <- rep("1999 - 2000", times=nrow(literacy_p17))
literacy_p18$Period <- rep("1998 - 1999", times=nrow(literacy_p18))
literacy_p19$Period <- rep("1997 - 1998", times=nrow(literacy_p19))
literacy_p20$Period <- rep("1996 - 1997", times=nrow(literacy_p20))
rename_colnames <- function(dataFrame){
colnames(dataFrame)[3] <- "Literacy.Begin"
colnames(dataFrame)[4] <- "Literacy.End"
return(dataFrame)
}
literacy_p1 <- rename_colnames(literacy_p1)
literacy_p2 <- rename_colnames(literacy_p2)
literacy_p3 <- rename_colnames(literacy_p3)
literacy_p4 <- rename_colnames(literacy_p4)
literacy_p5 <- rename_colnames(literacy_p5)
literacy_p6 <- rename_colnames(literacy_p6)
literacy_p7 <- rename_colnames(literacy_p7)
literacy_p8 <- rename_colnames(literacy_p8)
literacy_p9 <- rename_colnames(literacy_p9)
literacy_p10 <- rename_colnames(literacy_p10)
literacy_p11 <- rename_colnames(literacy_p11)
literacy_p12 <- rename_colnames(literacy_p12)
literacy_p13 <- rename_colnames(literacy_p13)
literacy_p14 <- rename_colnames(literacy_p14)
literacy_p15 <- rename_colnames(literacy_p15)
literacy_p16 <- rename_colnames(literacy_p16)
literacy_p17 <- rename_colnames(literacy_p17)
literacy_p18 <- rename_colnames(literacy_p18)
literacy_p19 <- rename_colnames(literacy_p19)
literacy_p20 <- rename_colnames(literacy_p20)
The code below joins the expenditure and literacy data to build a single data frame for each time period with all the columns combined into a single data frame.
#join spending and literacy data: only keep observations with match on both tables.
educ_expend_literacy_p1 <- dplyr::inner_join(educ_expend_p1, literacy_p1, by="Country.Code")
educ_expend_literacy_p2 <- dplyr::inner_join(educ_expend_p2, literacy_p2, by="Country.Code")
educ_expend_literacy_p3 <- dplyr::inner_join(educ_expend_p3, literacy_p3, by="Country.Code")
educ_expend_literacy_p4 <- dplyr::inner_join(educ_expend_p4, literacy_p4, by="Country.Code")
educ_expend_literacy_p5 <- dplyr::inner_join(educ_expend_p5, literacy_p5, by="Country.Code")
educ_expend_literacy_p6 <- dplyr::inner_join(educ_expend_p6, literacy_p6, by="Country.Code")
educ_expend_literacy_p7 <- dplyr::inner_join(educ_expend_p7, literacy_p7, by="Country.Code")
educ_expend_literacy_p8 <- dplyr::inner_join(educ_expend_p8, literacy_p8, by="Country.Code")
educ_expend_literacy_p9 <- dplyr::inner_join(educ_expend_p9, literacy_p9, by="Country.Code")
educ_expend_literacy_p10 <- dplyr::inner_join(educ_expend_p10, literacy_p10, by="Country.Code")
educ_expend_literacy_p11 <- dplyr::inner_join(educ_expend_p11, literacy_p11, by="Country.Code")
educ_expend_literacy_p12 <- dplyr::inner_join(educ_expend_p12, literacy_p12, by="Country.Code")
educ_expend_literacy_p13 <- dplyr::inner_join(educ_expend_p13, literacy_p13, by="Country.Code")
educ_expend_literacy_p14 <- dplyr::inner_join(educ_expend_p14, literacy_p14, by="Country.Code")
educ_expend_literacy_p15 <- dplyr::inner_join(educ_expend_p15, literacy_p15, by="Country.Code")
educ_expend_literacy_p16 <- dplyr::inner_join(educ_expend_p16, literacy_p16, by="Country.Code")
educ_expend_literacy_p17 <- dplyr::inner_join(educ_expend_p17, literacy_p17, by="Country.Code")
educ_expend_literacy_p18 <- dplyr::inner_join(educ_expend_p18, literacy_p18, by="Country.Code")
educ_expend_literacy_p19 <- dplyr::inner_join(educ_expend_p19, literacy_p19, by="Country.Code")
educ_expend_literacy_p20 <- dplyr::inner_join(educ_expend_p20, literacy_p20, by="Country.Code")
The code below creates a single data frame that stores all the complete cases.
#combine rorws of tables
educ_expend_literacy <-
dplyr::bind_rows(educ_expend_literacy_p1,educ_expend_literacy_p2, educ_expend_literacy_p3,
educ_expend_literacy_p4, educ_expend_literacy_p5, educ_expend_literacy_p6,
educ_expend_literacy_p7, educ_expend_literacy_p8, educ_expend_literacy_p9,
educ_expend_literacy_p10, educ_expend_literacy_p11, educ_expend_literacy_p12,
educ_expend_literacy_p13, educ_expend_literacy_p14, educ_expend_literacy_p15,
educ_expend_literacy_p16, educ_expend_literacy_p17, educ_expend_literacy_p18,
educ_expend_literacy_p19, educ_expend_literacy_p20)
nrow(educ_expend_literacy)
## [1] 144
The code below adds two calculated columns: Change.Educ_Spending
and Change.Literacy
.
#Calculatred column: Change.Educ_Spending
educ_expend_literacy$Change.Educ_Expend <-
educ_expend_literacy$Expend.End - educ_expend_literacy$Expend.Begin
#Calculated column: Change.Literacy
educ_expend_literacy$Change.Literacy <-
educ_expend_literacy$Literacy.End - educ_expend_literacy$Literacy.Begin
The code below adds a new column Group
.
Observations with change in educational expenditure that is less than zero are assigned the group label Less-Expenditure
.
Observations with change in educational expenditure that is greater than zero are assigned the group label More-Expenditure
.
#Group cases in positive or negative expenditure
educ_expend_literacy$Group[educ_expend_literacy$Change.Educ_Expend < 0] <- "Less-Expenditure"
educ_expend_literacy$Group[educ_expend_literacy$Change.Educ_Expend > 0] <- "More-Expenditure"
#Drop group that is the same
#test <- educ_expend_literacy[-which(educ_expend_literacy$Group=="Same-Expenditure"),]
Each case is assigned into one of two groups: Less-Expenditure
or More-Expenditure
.
datatable(educ_expend_literacy %>% dplyr::select(Group, Country.Name.x, Country.Code, Period.x,
Expend.Begin, Expend.End, Change.Educ_Expend,
Literacy.Begin, Literacy.End, Change.Literacy) %>%
dplyr::arrange(Group))
## [1] 75
## [1] 69
Below is the summary statistics of change in literacy rate for the group with increased spending in education.
#describe(group_more_expend$Change.Educ_Expend)
describe(group_more_expend$Change.Literacy)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 75 0.27 1.06 0.12 0.24 0.22 -3.37 4.08 7.45 0.01 4.19
## se
## X1 0.12
Below is the summary statistics of change in literacy rate for the group with decreased spending in education.
#describe(group_less_expend$Change.Educ_Expend)
describe(group_less_expend$Change.Literacy)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 69 0.28 1.04 0.13 0.16 0.42 -1.8 6.53 8.33 3.27 17.18
## se
## X1 0.12
The plot below shows the relationship between change in government spending in education and change in literacy rate.
The plot below shows a side by side box plot of the change in literacy rate for the two groups of increased and decreased spending in education. As you can see, the group with increased spending shows a slightly increased literacy rate compared to the group with decreased spending.
Below is the distribution of the change in literacy rate. The distribution roughly resembles a normal distribution with some skew to the right.
This inference is going to investigate the difference in the average change in youth literacy between two groups. The first group includes cases when there is an increase in government expenditure in education. The second group includes cases when there is a decrease in government expenditure in education.
For each of the two group, the mean of the change in literacy is going to be calculated. To determine if there is significant difference between these two test, a hypothesis test is going to be done. The t-distribution is going to be used to perform the hypothesis test.
This inference assumes that the change in literacy within a one year period is independent from other observations with different time periods and/or different countries or regions. The cases are built by selecting one year time periods from from 1996 to 2016. The sample size for each of the groups is less than 10% of the population.
The sample size for each group is greater than 30.
The distribution of the change in literacy roughly resembles a normal distribution (see histogram above), which suggests that if the sample size were increased, the distribution would be approximately normal.
Null hypothesis: There is no significant difference in the average change in literacy between the group with increased change in government expenditure in education and the group with decreased change in government expenditure in education over a one year period.
Alternative hypothesis: There is a significant difference in the average change in literacy between the group with increased change in government expenditure in education and the group with decreased change in government expenditure in education over a one year period.
The code below performs the calculations for the hypothesis test using the t distribution.
group1 <- group_more_expend
group2 <- group_less_expend
group1_size <- nrow(group1)
group2_size <- nrow(group2)
group1_mean <- mean(group1$Change.Literacy)
group2_mean <- mean(group2$Change.Literacy)
group1_sd <- sd(group1$Change.Literacy)
group2_sd <- sd(group2$Change.Literacy)
diff_mean <- group1_mean - group2_mean
SE <- sqrt(group1_sd^2/group1_size + group2_sd^2/group2_size)
critical_val95 <- abs(qt(.05/2, min(group1_size, group2_size)-1)) #2-tail
t_score <- (diff_mean - 0)/SE
p_value <- (1-pt(abs(t_score), df=min(group1_size,group2_size)-1))*2
CI95_lower <- diff_mean - SE * critical_val95
CI95_upper <- diff_mean + SE * critical_val95
Below is a table of the calculations.
G1-size | G2-size | G1-mean | G2-mean | Diff-mean | G1-sd | G2-sd | SE | t-score | P-value | Critical_val95 | CI95 |
---|---|---|---|---|---|---|---|---|---|---|---|
75 | 69 | 0.2652 | 0.2754 | -0.0101 | 1.0618 | 1.038 | 0.1751 | -0.058 | 0.9539 | 1.9955 | -0.3595 to 0.3392 |
As you can see above, the p-value is 0.95, which is a strong evidence to fail to reject the null hypothesis.
Based on the result of the hypothesis test (p-value of .9539), there is no significant difference in the average change in literacy between groups with increased expenditure and decreased expenditure in education over a one year period.
The 95% confidence interval of the difference in the average change in literacy between these two groups is (-0.3595, 0.3392). This confidence interval includes zero.
The average change in literacy for the group with increased expenditure is actually slightly less (at 0.2652) than the group with decreased expenditure (at .2754). I did not expect to see this. I expected to see the group with increased expenditure in education to have a higher mean. This study only looked at one possible explanatory variable and a very limited set of data. Future work should investigate a larger data set, consider more factors or design a better observational study.
Diez, D.M., Barr, C.D., & Cetinkaya-Rundel, M. (2015). OpenIntro Statistics. Retrieved from https://www.openintro.org/stat/textbook.php
World Development Indicators: https://datacatalog.worldbank.org/dataset/world-development-indicators