Data 606 Data Project

Part 1 - Introduction

This project will investigate data from the World Bank’s World Development indicators to see if increased government expenditure on education improve youth literacy.

This is an observational study that investigates the relationship between the change in government expenditure in education and its corresponding change in youth literacy within a one year time period. Each case is a country with data for these two indicators for a specified one year period: (1) government expenditure on education and (2) youth literacy.

Hypothesis testing is done to determine if there is a significant difference in the change in youth literacy between the group with increased expenditure on education and the group with decreased expenditure on education.

Part 2 - Data

The data used in this study is the World Bank’s collection of development indicators, which “presents the most current and accurate global development data available”. This data covers the years 1960 through 2017.

Source: https://datacatalog.worldbank.org/dataset/world-development-indicators
Release Date: June 11, 2010
Lasted Updated: March 1, 2018

This study focuses on two indicators below.

SE.XPD.TOTL.GB.ZS - Government expenditure on education, total (% of government expenditure)
SE.ADT.1524.LT.ZS - Literacy rate, youth total (% of people ages 15-24)

Part 3 - Exploratory data analysis

About complete cases

A complete case is a country with data for both selected indicator codes SE.XPD.TOTL.GB.ZS and SE.ADT.1524.LT.ZS within a one year time period.

After investigating the data for complete cases for one time period, not enough complete cases were generated. For example, for the period 2010-2011, this only resulted in 17 complete cases. In order to increase the number of cases for the study, several time periods were selected.

The first task is to generate enough number of cases to assign in one of two groups: the group with increased spending in education and the group with decreased spending in education.

Load libraries

library(dplyr)
library(tidyr)
library(knitr)
library(ggplot2)
library(stringr)
library(DT)
library(psych)

Load data

Generate complete cases for indicator code ‘SE.XPD.TOTL.GB.ZS’

This indicator code is for government expenditure on education, which is measured as percent of government expenditure.

The code below selects observations so that this indicator is not NA for the one year time period. One year time periods from 1996 to 2016 are selected.

Function to retrieve complete cases

get_completeCases <- function(start, end, indicator_code){
  complete_cases <- 
  data %>% dplyr::filter(Indicator.Code == indicator_code &
                      is.na(data[,which(colnames(data)==start)])==FALSE & 
                        is.na(data[,which(colnames(data)==end)])==FALSE) %>%                                         dplyr::select("Country.Name", "Country.Code", start, end) 
  return(complete_cases)
}

#education expenditure
indicator_code <- 'SE.XPD.TOTL.GB.ZS'
educ_expend_p1 <- get_completeCases("X2015", "X2016", indicator_code)
educ_expend_p2 <- get_completeCases("X2014", "X2015", indicator_code)
educ_expend_p3 <- get_completeCases("X2013", "X2014", indicator_code)
educ_expend_p4 <- get_completeCases("X2012", "X2013", indicator_code)
educ_expend_p5 <- get_completeCases("X2011", "X2012", indicator_code)
educ_expend_p6 <- get_completeCases("X2010", "X2011", indicator_code)
educ_expend_p7 <- get_completeCases("X2009", "X2010", indicator_code)
educ_expend_p8 <- get_completeCases("X2008", "X2009", indicator_code)
educ_expend_p9 <- get_completeCases("X2007", "X2008", indicator_code)
educ_expend_p10 <- get_completeCases("X2006", "X2007", indicator_code)
educ_expend_p11 <- get_completeCases("X2005", "X2006", indicator_code)
educ_expend_p12 <- get_completeCases("X2004", "X2005", indicator_code)
educ_expend_p13 <- get_completeCases("X2003", "X2004", indicator_code)
educ_expend_p14 <- get_completeCases("X2002", "X2003", indicator_code)
educ_expend_p15 <- get_completeCases("X2001", "X2002", indicator_code)
educ_expend_p16 <- get_completeCases("X2000", "X2001", indicator_code)
educ_expend_p17 <- get_completeCases("X1999", "X2000", indicator_code)
educ_expend_p18 <- get_completeCases("X1998", "X1999", indicator_code)
educ_expend_p19 <- get_completeCases("X1997", "X1998", indicator_code)
educ_expend_p20 <- get_completeCases("X1996", "X1997", indicator_code)

Generate complete cases for indicator code ‘SE.ADT.1524.LT.ZS’

This indicator code is for youth literacy rate, which is measured as percent of people ages 15 - 24.

The code below selects observations so that this indicator is not NA for the one year time period. One year time periods from 1996 to 2016 are selected.

indicator_code <- 'SE.ADT.1524.LT.ZS'
literacy_p1 <- get_completeCases("X2015", "X2016", indicator_code)
literacy_p2 <- get_completeCases("X2014", "X2015", indicator_code)
literacy_p3 <- get_completeCases("X2013", "X2014", indicator_code)
literacy_p4 <- get_completeCases("X2012", "X2013", indicator_code)
literacy_p5 <- get_completeCases("X2011", "X2012", indicator_code)
literacy_p6 <- get_completeCases("X2010", "X2011", indicator_code)
literacy_p7 <- get_completeCases("X2009", "X2010", indicator_code)
literacy_p8 <- get_completeCases("X2008", "X2009", indicator_code)
literacy_p9 <- get_completeCases("X2007", "X2008", indicator_code)
literacy_p10 <- get_completeCases("X2006", "X2007", indicator_code)
literacy_p11 <- get_completeCases("X2005", "X2006", indicator_code)
literacy_p12 <- get_completeCases("X2004", "X2005", indicator_code)
literacy_p13 <- get_completeCases("X2003", "X2004", indicator_code)
literacy_p14 <- get_completeCases("X2002", "X2003", indicator_code)
literacy_p15 <- get_completeCases("X2001", "X2002", indicator_code)
literacy_p16 <- get_completeCases("X2000", "X2001", indicator_code)
literacy_p17 <- get_completeCases("X1999", "X2000", indicator_code)
literacy_p18 <- get_completeCases("X1998", "X1999", indicator_code)
literacy_p19 <- get_completeCases("X1997", "X1998", indicator_code)
literacy_p20 <- get_completeCases("X1996", "X1997", indicator_code)

Format expenditure data

Each expenditure data within a specific one year time period is stored in its own data frame. A column called Period is going to be added, which will store easy to understand description of the different time periods.

The year columns are going to be renamed as Expend.Begin and Expend.End, which makes it easier to reference the years as start and end points of the time period.

#rename years and add period tracking --> education spending complete cases
educ_expend_p1$Period <- rep("2015 - 2016", times=nrow(educ_expend_p1))
educ_expend_p2$Period <- rep("2014 - 2015", times=nrow(educ_expend_p2))
educ_expend_p3$Period <- rep("2013 - 2014", times=nrow(educ_expend_p3))              
educ_expend_p4$Period <- rep("2012 - 2013", times=nrow(educ_expend_p4))
educ_expend_p5$Period <- rep("2011 - 2012", times=nrow(educ_expend_p5))             
educ_expend_p6$Period <- rep("2010 - 2011", times=nrow(educ_expend_p6))  
educ_expend_p7$Period <- rep("2009 - 2010", times=nrow(educ_expend_p7))  
educ_expend_p8$Period <- rep("2008 - 2009", times=nrow(educ_expend_p8))
educ_expend_p9$Period <- rep("2007 - 2008", times=nrow(educ_expend_p9)) 
educ_expend_p10$Period <- rep("2006 - 2007", times=nrow(educ_expend_p10)) 
educ_expend_p11$Period <- rep("2005 - 2006", times=nrow(educ_expend_p11)) 
educ_expend_p12$Period <- rep("P2004 - 2005", times=nrow(educ_expend_p12)) 
educ_expend_p13$Period <- rep("2003 - 2004", times=nrow(educ_expend_p13)) 
educ_expend_p14$Period <- rep("2002 - 2003", times=nrow(educ_expend_p14)) 
educ_expend_p15$Period <- rep("2001 - 2002", times=nrow(educ_expend_p15)) 
educ_expend_p16$Period <- rep("2000 - 2001", times=nrow(educ_expend_p16)) 
educ_expend_p17$Period <- rep("1999 - 2000", times=nrow(educ_expend_p17)) 
educ_expend_p18$Period <- rep("1998 - 1999", times=nrow(educ_expend_p18)) 
educ_expend_p19$Period <- rep("1997 - 1998", times=nrow(educ_expend_p19)) 
educ_expend_p20$Period <- rep("1996 - 1997", times=nrow(educ_expend_p20)) 

rename_colnames <- function(dataFrame){
colnames(dataFrame)[3] <- "Expend.Begin"
colnames(dataFrame)[4] <- "Expend.End"
return(dataFrame)
}
educ_expend_p1 <- rename_colnames(educ_expend_p1)
educ_expend_p2 <- rename_colnames(educ_expend_p2)
educ_expend_p3 <- rename_colnames(educ_expend_p3)
educ_expend_p4 <- rename_colnames(educ_expend_p4)
educ_expend_p5 <- rename_colnames(educ_expend_p5)
educ_expend_p6 <- rename_colnames(educ_expend_p6)
educ_expend_p7 <- rename_colnames(educ_expend_p7)
educ_expend_p8 <- rename_colnames(educ_expend_p8)
educ_expend_p9 <- rename_colnames(educ_expend_p9)
educ_expend_p10 <- rename_colnames(educ_expend_p10)
educ_expend_p11 <- rename_colnames(educ_expend_p11)
educ_expend_p12 <- rename_colnames(educ_expend_p12)
educ_expend_p13 <- rename_colnames(educ_expend_p13)
educ_expend_p14 <- rename_colnames(educ_expend_p14)
educ_expend_p15 <- rename_colnames(educ_expend_p15)
educ_expend_p16 <- rename_colnames(educ_expend_p16)
educ_expend_p17 <- rename_colnames(educ_expend_p17)
educ_expend_p18 <- rename_colnames(educ_expend_p18)
educ_expend_p19 <- rename_colnames(educ_expend_p19)
educ_expend_p20 <- rename_colnames(educ_expend_p20)

Format literacy data

Each literacy data within a specific one year time period is stored in its own data frame. A column called Period is going to be added, which will store easy to understand description of the different time periods.

The year columns are going to be renamed as Literacy.Begin and Literacy.End, which makes it easier to reference the years as start and end points of the time period.

#rename years and add period tracking --> literacy complete cases
literacy_p1$Period <- rep("2015 - 2016", times=nrow(literacy_p1))
literacy_p2$Period <- rep("2014 - 2015", times=nrow(literacy_p2))
literacy_p3$Period <- rep("2013 - 2014", times=nrow(literacy_p3))              
literacy_p4$Period <- rep("2012 - 2013", times=nrow(literacy_p4))
literacy_p5$Period <- rep("2011 - 2012", times=nrow(literacy_p5))             
literacy_p6$Period <- rep("2010 - 2011", times=nrow(literacy_p6))  
literacy_p7$Period <- rep("2009 - 2010", times=nrow(literacy_p7))  
literacy_p8$Period <- rep("2008 - 2009", times=nrow(literacy_p8))
literacy_p9$Period <- rep("2007 - 2008", times=nrow(literacy_p9)) 
literacy_p10$Period <- rep("2006 - 2007", times=nrow(literacy_p10)) 
literacy_p11$Period <- rep("2005 - 2006", times=nrow(literacy_p11)) 
literacy_p12$Period <- rep("P2004 - 2005", times=nrow(literacy_p12)) 
literacy_p13$Period <- rep("2003 - 2004", times=nrow(literacy_p13)) 
literacy_p14$Period <- rep("2002 - 2003", times=nrow(literacy_p14)) 
literacy_p15$Period <- rep("2001 - 2002", times=nrow(literacy_p15)) 
literacy_p16$Period <- rep("2000 - 2001", times=nrow(literacy_p16)) 
literacy_p17$Period <- rep("1999 - 2000", times=nrow(literacy_p17)) 
literacy_p18$Period <- rep("1998 - 1999", times=nrow(literacy_p18)) 
literacy_p19$Period <- rep("1997 - 1998", times=nrow(literacy_p19)) 
literacy_p20$Period <- rep("1996 - 1997", times=nrow(literacy_p20)) 

rename_colnames <- function(dataFrame){
colnames(dataFrame)[3] <- "Literacy.Begin"
colnames(dataFrame)[4] <- "Literacy.End"
return(dataFrame)
}
literacy_p1 <- rename_colnames(literacy_p1)
literacy_p2 <- rename_colnames(literacy_p2)
literacy_p3 <- rename_colnames(literacy_p3)
literacy_p4 <- rename_colnames(literacy_p4)
literacy_p5 <- rename_colnames(literacy_p5)
literacy_p6 <- rename_colnames(literacy_p6)
literacy_p7 <- rename_colnames(literacy_p7)
literacy_p8 <- rename_colnames(literacy_p8)
literacy_p9 <- rename_colnames(literacy_p9)
literacy_p10 <- rename_colnames(literacy_p10)
literacy_p11 <- rename_colnames(literacy_p11)
literacy_p12 <- rename_colnames(literacy_p12)
literacy_p13 <- rename_colnames(literacy_p13)
literacy_p14 <- rename_colnames(literacy_p14)
literacy_p15 <- rename_colnames(literacy_p15)
literacy_p16 <- rename_colnames(literacy_p16)
literacy_p17 <- rename_colnames(literacy_p17)
literacy_p18 <- rename_colnames(literacy_p18)
literacy_p19 <- rename_colnames(literacy_p19)
literacy_p20 <- rename_colnames(literacy_p20)

Join expenditure and literacy data

The code below joins the expenditure and literacy data to build a single data frame for each time period with all the columns combined into a single data frame.

#join spending and literacy data: only keep observations with match on both tables. 
educ_expend_literacy_p1 <- dplyr::inner_join(educ_expend_p1, literacy_p1, by="Country.Code") 
educ_expend_literacy_p2 <- dplyr::inner_join(educ_expend_p2, literacy_p2, by="Country.Code") 
educ_expend_literacy_p3 <- dplyr::inner_join(educ_expend_p3, literacy_p3, by="Country.Code") 
educ_expend_literacy_p4 <- dplyr::inner_join(educ_expend_p4, literacy_p4, by="Country.Code")
educ_expend_literacy_p5 <- dplyr::inner_join(educ_expend_p5, literacy_p5, by="Country.Code")
educ_expend_literacy_p6 <- dplyr::inner_join(educ_expend_p6, literacy_p6, by="Country.Code")
educ_expend_literacy_p7 <- dplyr::inner_join(educ_expend_p7, literacy_p7, by="Country.Code")
educ_expend_literacy_p8 <- dplyr::inner_join(educ_expend_p8, literacy_p8, by="Country.Code")
educ_expend_literacy_p9 <- dplyr::inner_join(educ_expend_p9, literacy_p9, by="Country.Code")
educ_expend_literacy_p10 <- dplyr::inner_join(educ_expend_p10, literacy_p10, by="Country.Code")
educ_expend_literacy_p11 <- dplyr::inner_join(educ_expend_p11, literacy_p11, by="Country.Code")
educ_expend_literacy_p12 <- dplyr::inner_join(educ_expend_p12, literacy_p12, by="Country.Code")
educ_expend_literacy_p13 <- dplyr::inner_join(educ_expend_p13, literacy_p13, by="Country.Code")
educ_expend_literacy_p14 <- dplyr::inner_join(educ_expend_p14, literacy_p14, by="Country.Code")
educ_expend_literacy_p15 <- dplyr::inner_join(educ_expend_p15, literacy_p15, by="Country.Code")
educ_expend_literacy_p16 <- dplyr::inner_join(educ_expend_p16, literacy_p16, by="Country.Code")
educ_expend_literacy_p17 <- dplyr::inner_join(educ_expend_p17, literacy_p17, by="Country.Code")
educ_expend_literacy_p18 <- dplyr::inner_join(educ_expend_p18, literacy_p18, by="Country.Code")
educ_expend_literacy_p19 <- dplyr::inner_join(educ_expend_p19, literacy_p19, by="Country.Code")
educ_expend_literacy_p20 <- dplyr::inner_join(educ_expend_p20, literacy_p20, by="Country.Code")

Build a single data frame for all the different time periods

The code below creates a single data frame that stores all the complete cases.

#combine rorws of tables
educ_expend_literacy <- 
dplyr::bind_rows(educ_expend_literacy_p1,educ_expend_literacy_p2, educ_expend_literacy_p3, 
                 educ_expend_literacy_p4, educ_expend_literacy_p5, educ_expend_literacy_p6, 
                 educ_expend_literacy_p7, educ_expend_literacy_p8, educ_expend_literacy_p9,
                 educ_expend_literacy_p10, educ_expend_literacy_p11, educ_expend_literacy_p12,
                 educ_expend_literacy_p13, educ_expend_literacy_p14, educ_expend_literacy_p15,
                 educ_expend_literacy_p16, educ_expend_literacy_p17, educ_expend_literacy_p18,
                 educ_expend_literacy_p19, educ_expend_literacy_p20)
nrow(educ_expend_literacy)

## [1] 144

Add calculated columns

The code below adds two calculated columns: Change.Educ_Spending and Change.Literacy.

#Calculatred column: Change.Educ_Spending
educ_expend_literacy$Change.Educ_Expend <-  
  educ_expend_literacy$Expend.End - educ_expend_literacy$Expend.Begin
#Calculated column: Change.Literacy
educ_expend_literacy$Change.Literacy <-  
  educ_expend_literacy$Literacy.End - educ_expend_literacy$Literacy.Begin

Create the two groups

The code below adds a new column Group.

Observations with change in educational expenditure that is less than zero are assigned the group label Less-Expenditure.

Observations with change in educational expenditure that is greater than zero are assigned the group label More-Expenditure.

#Group cases in positive or negative expenditure
educ_expend_literacy$Group[educ_expend_literacy$Change.Educ_Expend < 0] <- "Less-Expenditure"
educ_expend_literacy$Group[educ_expend_literacy$Change.Educ_Expend > 0] <- "More-Expenditure"
#Drop group that is the same
#test <- educ_expend_literacy[-which(educ_expend_literacy$Group=="Same-Expenditure"),]

Preview of cases

Each case is assigned into one of two groups: Less-Expenditure or More-Expenditure.

datatable(educ_expend_literacy %>% dplyr::select(Group, Country.Name.x, Country.Code, Period.x,
              Expend.Begin, Expend.End, Change.Educ_Expend, 
              Literacy.Begin, Literacy.End, Change.Literacy) %>%
              dplyr::arrange(Group))

Summary statistics

Size of group with increased government expenditure on education: 75 cases
Size of group with decreased government expenditure on education: 69 cases

## [1] 75

## [1] 69

Group with increased expenditure on education:

Below is the summary statistics of change in literacy rate for the group with increased spending in education.

#describe(group_more_expend$Change.Educ_Expend)
describe(group_more_expend$Change.Literacy)

##    vars  n mean   sd median trimmed  mad   min  max range skew kurtosis
## X1    1 75 0.27 1.06   0.12    0.24 0.22 -3.37 4.08  7.45 0.01     4.19
##      se
## X1 0.12

Group with decreased expenditure on education:

Below is the summary statistics of change in literacy rate for the group with decreased spending in education.

#describe(group_less_expend$Change.Educ_Expend)
describe(group_less_expend$Change.Literacy)

##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 69 0.28 1.04   0.13    0.16 0.42 -1.8 6.53  8.33 3.27    17.18
##      se
## X1 0.12

Scatter plot

The plot below shows the relationship between change in government spending in education and change in literacy rate.

Box plot

The plot below shows a side by side box plot of the change in literacy rate for the two groups of increased and decreased spending in education. As you can see, the group with increased spending shows a slightly increased literacy rate compared to the group with decreased spending.

Histogram of change in literacy

Below is the distribution of the change in literacy rate. The distribution roughly resembles a normal distribution with some skew to the right.

Part 4 - Inference

This inference is going to investigate the difference in the average change in youth literacy between two groups. The first group includes cases when there is an increase in government expenditure in education. The second group includes cases when there is a decrease in government expenditure in education.

For each of the two group, the mean of the change in literacy is going to be calculated. To determine if there is significant difference between these two test, a hypothesis test is going to be done. The t-distribution is going to be used to perform the hypothesis test.

This inference assumes that the change in literacy within a one year period is independent from other observations with different time periods and/or different countries or regions. The cases are built by selecting one year time periods from from 1996 to 2016. The sample size for each of the groups is less than 10% of the population.

The sample size for each group is greater than 30.

The distribution of the change in literacy roughly resembles a normal distribution (see histogram above), which suggests that if the sample size were increased, the distribution would be approximately normal.

Hypothesis test

Null hypothesis: There is no significant difference in the average change in literacy between the group with increased change in government expenditure in education and the group with decreased change in government expenditure in education over a one year period.

Alternative hypothesis: There is a significant difference in the average change in literacy between the group with increased change in government expenditure in education and the group with decreased change in government expenditure in education over a one year period.

The code below performs the calculations for the hypothesis test using the t distribution.

group1 <- group_more_expend
group2 <- group_less_expend
group1_size <- nrow(group1)
group2_size <- nrow(group2)
group1_mean <- mean(group1$Change.Literacy) 
group2_mean <- mean(group2$Change.Literacy)
group1_sd <- sd(group1$Change.Literacy)
group2_sd <- sd(group2$Change.Literacy)
diff_mean <- group1_mean - group2_mean
SE <- sqrt(group1_sd^2/group1_size + group2_sd^2/group2_size)
critical_val95 <- abs(qt(.05/2, min(group1_size, group2_size)-1)) #2-tail
t_score <- (diff_mean - 0)/SE
p_value <- (1-pt(abs(t_score), df=min(group1_size,group2_size)-1))*2
CI95_lower <- diff_mean - SE * critical_val95
CI95_upper <- diff_mean + SE * critical_val95

Below is a table of the calculations.

G1-size	G2-size	G1-mean	G2-mean	Diff-mean	G1-sd	G2-sd	SE	t-score	P-value	Critical_val95	CI95
75	69	0.2652	0.2754	-0.0101	1.0618	1.038	0.1751	-0.058	0.9539	1.9955	-0.3595 to 0.3392

As you can see above, the p-value is 0.95, which is a strong evidence to fail to reject the null hypothesis.

Part 5 - Conclusion

Based on the result of the hypothesis test (p-value of .9539), there is no significant difference in the average change in literacy between groups with increased expenditure and decreased expenditure in education over a one year period.

The 95% confidence interval of the difference in the average change in literacy between these two groups is (-0.3595, 0.3392). This confidence interval includes zero.

The average change in literacy for the group with increased expenditure is actually slightly less (at 0.2652) than the group with decreased expenditure (at .2754). I did not expect to see this. I expected to see the group with increased expenditure in education to have a higher mean. This study only looked at one possible explanatory variable and a very limited set of data. Future work should investigate a larger data set, consider more factors or design a better observational study.

References

Diez, D.M., Barr, C.D., & Cetinkaya-Rundel, M. (2015). OpenIntro Statistics. Retrieved from https://www.openintro.org/stat/textbook.php

World Development Indicators: https://datacatalog.worldbank.org/dataset/world-development-indicators