Introduction

The education sector is one of the most important areas of development in any country. The educational system is the foundation on which a country’s human resource is moulded so they can make a contribution to the country’s development. Year after year governments touts on the increased amount of money allocated and spent on their educational system. The government of Guyana also touts on increased spending in the education sector, but still the majority of Guyana’s population remains uneducated. I am curious to discover whether the amount spent on education influences a country’s literacy rate, pupil to teacher ratio and the number of students enroled in primary, secondary and tertiary schools.

Data Acquisition

Functions

World Bank API

The World Bank database was my primary source where I retrieved my data. So I found it fitting to write a function to retrieve, clean and transform the data from the World Bank website via its API.

fetchIndicatorValues <- function(indicator)
{
  url.base <- sprintf("http://api.worldbank.org/countries/all/indicators/%s?format=xml&per_page=10000",indicator)
  doc.xml <- xmlParse(url.base)
  
  # get root element of the data set
  doc.xml.root <- xmlRoot(doc.xml)
  
  ## getting number of pages from the xml root attribute
  pages <- as.integer(xmlAttrs(doc.xml.root)[["pages"]])
  
  # creating an empty dataframe to store the query results
  data.set <- data.frame(indicator = character(0),country = character(0),value = double(0),decimal=integer(0),stringsAsFactors = FALSE)
  
  # iterating over the pages and storing the query restults into the data.set dataframe
  for(i in 1:pages)
  {
    url.page <- sprintf("%s&page=%s",url.base,i) 
    doc.xml <- xmlParse(url.page)
    doc.xml.root <- xmlRoot(doc.xml)
    doc.xml.elements <-xmlChildren(doc.xml.root)
    doc.df <- xmlToDataFrame(doc.xml.elements,stringsAsFactors = FALSE)
    data.set <- rbind(data.set,doc.df)
  }
  
  # Transforming the value column from chracter to double class type
  data.set$value <- as.double(data.set$value)
  # Transforming the date column from chracter to integer class type
  data.set$date <- as.integer(data.set$date)
  #removing unwanted columns
  data.set$indicator <- data.set$decimal <- NULL
  # removing records with NA values
  data.set <- na.omit(data.set)
  # Removing continent and special country grouping
  data.set <- subset(data.set, !(country %in% group.items$country))
  # removing rownames so a recount of the row numbers can be triggered
  row.names(data.set) <- NULL
  # Replacing ' from country names to an empty space 
  data.set$country <- str_replace_all(data.set$country,"'"," ")
  return(data.set)
}

Calculating Mean Values

The datasets contain values for each country for several years. For this project and the analysis that follows later down in the pipeline. I needed to calculate the mean value and expenditure over that time frame. The function below does exactly that.

calculateMeanVals <- function(countryName,tableName, col1,col2,col3 = NA)
{
 if(is.na(col3))
    {
     col3 <- ""
    }
    else
    {
     col3 <- sprintf(",%s",col3)
    }
  
  # Select the records for a country and calculates the min/max date and average value and expenditure for that period
  
  df <- sqldf(sprintf("Select  country,min(date) as minDate,max(date) as maxDate ,AVG(%s)as %s,AVG(%s)as %s 
                      %s
                      FROM 
                      '%s' where country = '%s'
                      GROUP BY country %s"
                      ,col1, 
                      sprintf("avg_%s",col1) ,
                      col2,
                      sprintf("avg_%s",col2),
                      col3,
                      tableName,
                      countryName,
                      col3))
  
  return(df)
}

List of grouped regions

The datasets contained values for grouped regions e.g. Caribbean small states, Euro area, etc. I created a text file that contains the grouped regions that will be used to filter out those values from the datasets.

# list of regions that are grouped in the dataset
group.items <- read.table("https://raw.githubusercontent.com/jasonjgy2000/IS607/master/Projects/607%20Proj/Data/expList.txt",stringsAsFactors = FALSE,header = FALSE,sep = "\t",col.names = c("country"))

Government expenditure on education, total (% of government expenditure)

Retrieving the general government expenditure on education (current, capital, and transfers) is expressed as a percentage of total general government expenditure on all sectors and storing it into a data frame.

gov.exp.df <- fetchIndicatorValues("SE.XPD.TOTL.GB.ZS")
head(gov.exp.df)

##       country date    value
## 1 Afghanistan 2014 18.06834
## 2 Afghanistan 2013 18.46120
## 3 Afghanistan 2012 12.82317
## 4 Afghanistan 2011 19.10585
## 5 Afghanistan 2010 22.24014
## 6     Albania 2007 11.17719

Literacy rate, adult total (% of people ages 15 and above)

Retrieving and cleaning the adults (15+) literacy rate percentage dataset. It contains the total percentage of the population age 15 and above who can, with understanding, read and write a short, simple statement on their everyday life.

literacy.rate.df <- fetchIndicatorValues("SE.ADT.LITR.ZS")
head(literacy.rate.df)

##       country date    value
## 1 Afghanistan 2011 31.74112
## 2 Afghanistan 1979 18.15768
## 3     Albania 2012 97.24696
## 4     Albania 2011 96.84530
## 5     Albania 2008 95.93864
## 6     Albania 2001 98.71297

literacy.rate.expenditure <-dplyr::inner_join(gov.exp.df,literacy.rate.df,by=c("country","date"))

colnames(literacy.rate.expenditure) <- c("country","date","expenditure","literacyrate")


countries <- literacy.rate.expenditure[!duplicated(literacy.rate.expenditure[,"country"]),1]

  for(i in 1: length(countries))
  {
    if(i == 1)
    {
      literacy.rate.expenditure.avg <-  calculateMeanVals(countries[i],"literacy.rate.expenditure","expenditure","literacyrate")
    }
    else
    {
literacy.rate.expenditure.avg <- rbind(literacy.rate.expenditure.avg,calculateMeanVals(countries[i],"literacy.rate.expenditure","expenditure","literacyrate"))  
    }
    
  }

## Loading required package: tcltk

head(literacy.rate.expenditure.avg)

##       country minDate maxDate avg_expenditure avg_literacyrate
## 1 Afghanistan    2011    2011        19.10585         31.74112
## 2     Albania    2001    2001        10.58193         98.71297
## 3   Argentina    2001    2001        16.35761         97.19331
## 4     Armenia    2011    2011        12.59334         99.74442
## 5       Aruba    2000    2010        21.99895         97.05695
## 6  Azerbaijan    2007    2011         8.86377         99.72788

#summary(literacy.rate.expenditure.avg)

Pupil to teacher ratio in primary Schools

Below I am parsing the number of pupils enrolled in primary school divided by the number of primary school teachers. This dataset was parsed from a CSV file, cleaned and transformed into a data frame.

# load Csv
pupil.teacher.df <- read.csv(file="https://raw.githubusercontent.com/jasonjgy2000/IS607/master/Projects/607%20Proj/Data/se.prm.enrl.tc.zs_Indicator_en_csv_v2.csv",skip = 4,header = TRUE, stringsAsFactors = FALSE,check.names = FALSE)

# removing columns not needed 
pupil.teacher.df$`Country Code` <- pupil.teacher.df$`Indicator Name` <- pupil.teacher.df$`Indicator Code` <-  NULL

# renaming first column from  Country name to country
colnames(pupil.teacher.df)[1] <- "country"

#removing extra column with NA's
pupil.teacher.df[,58] <- NULL

dplyr::tbl_df(pupil.teacher.df)

## Source: local data frame [248 x 57]
## 
##                 country  1960  1961  1962  1963  1964  1965  1966  1967
##                   (chr) (lgl) (lgl) (lgl) (lgl) (lgl) (lgl) (lgl) (lgl)
## 1                 Aruba    NA    NA    NA    NA    NA    NA    NA    NA
## 2               Andorra    NA    NA    NA    NA    NA    NA    NA    NA
## 3           Afghanistan    NA    NA    NA    NA    NA    NA    NA    NA
## 4                Angola    NA    NA    NA    NA    NA    NA    NA    NA
## 5               Albania    NA    NA    NA    NA    NA    NA    NA    NA
## 6            Arab World    NA    NA    NA    NA    NA    NA    NA    NA
## 7  United Arab Emirates    NA    NA    NA    NA    NA    NA    NA    NA
## 8             Argentina    NA    NA    NA    NA    NA    NA    NA    NA
## 9               Armenia    NA    NA    NA    NA    NA    NA    NA    NA
## 10       American Samoa    NA    NA    NA    NA    NA    NA    NA    NA
## ..                  ...   ...   ...   ...   ...   ...   ...   ...   ...
## Variables not shown: 1968 (lgl), 1969 (lgl), 1970 (dbl), 1971 (dbl), 1972
##   (dbl), 1973 (dbl), 1974 (dbl), 1975 (dbl), 1976 (dbl), 1977 (dbl), 1978
##   (dbl), 1979 (dbl), 1980 (dbl), 1981 (dbl), 1982 (dbl), 1983 (dbl), 1984
##   (dbl), 1985 (dbl), 1986 (dbl), 1987 (dbl), 1988 (dbl), 1989 (dbl), 1990
##   (dbl), 1991 (dbl), 1992 (dbl), 1993 (dbl), 1994 (dbl), 1995 (dbl), 1996
##   (dbl), 1997 (dbl), 1998 (dbl), 1999 (dbl), 2000 (dbl), 2001 (dbl), 2002
##   (dbl), 2003 (dbl), 2004 (dbl), 2005 (dbl), 2006 (dbl), 2007 (dbl), 2008
##   (dbl), 2009 (dbl), 2010 (dbl), 2011 (dbl), 2012 (dbl), 2013 (dbl), 2014
##   (dbl), 2015 (lgl)

# wide to long transformation
pupil.teacher.df <- pupil.teacher.df %>%  gather("date","value",2:57)

# change Year column from factor to integer class type
pupil.teacher.df$date <- as.integer(as.character(pupil.teacher.df$date))

# removing records with NA values
pupil.teacher.df <- na.omit(pupil.teacher.df)

# Removing continent and special country grouping
pupil.teacher.df <- subset(pupil.teacher.df, !(country %in% group.items$country))

# removing rownames so a recount of the row numbers can be triggered
row.names(pupil.teacher.df) <- NULL

# Replacing ' from country names to an empty space 
pupil.teacher.df$country <- str_replace_all(pupil.teacher.df$country,"'"," ")

# joining the pupil teacher ratio data frame to the expenditure data frame by country and date
pupil.techer.expenditure <-dplyr::inner_join(gov.exp.df,pupil.teacher.df,by=c("country","date"))

# assigning new column names to the pupil teacher ratio expenditure data frame
colnames(pupil.techer.expenditure) <- c("country","date","expenditure","pupilTeacher")

# selecting distinct country names from the pupil.teacher.expenditure data frame
countries <- pupil.techer.expenditure[!duplicated(pupil.techer.expenditure[,"country"]),1]

# creating the average pupil teacher expenditure data frame.

  for(i in 1: length(countries))
  {
    if(i == 1)
    {
      pupil.techer.expenditure.avg <-  calculateMeanVals(countries[i],"pupil.techer.expenditure","expenditure","pupilTeacher",NA)
    }
    else
    {
pupil.techer.expenditure.avg <- rbind(pupil.techer.expenditure.avg,calculateMeanVals(countries[i],"pupil.techer.expenditure","expenditure","pupilTeacher",NA))  
    }
  }  


head(pupil.techer.expenditure.avg)

##               country minDate maxDate avg_expenditure avg_pupilTeacher
## 1         Afghanistan    2010    2012        18.05639         44.20456
## 2             Albania    1997    2003        10.61149         22.14333
## 3             Algeria    2008    2008        11.42941         23.23052
## 4              Angola    2010    2010         8.68477         45.58651
## 5 Antigua and Barbuda    2009    2009         6.91662         16.22446
## 6           Argentina    1998    2008        16.03012         17.65021

Student enrollment primary, secondary and tertiary

Retrieving and cleaning the datasets that contain the gross enrolment ratio for primary, secondary and tertiary schools.

enrollment.primary <- fetchIndicatorValues("SE.PRM.ENRR")
head(enrollment.primary)

##       country date     value
## 1 Afghanistan 2013 105.85978
## 2 Afghanistan 2012 103.94475
## 3 Afghanistan 2011  97.40102
## 4 Afghanistan 2010  99.39882
## 5 Afghanistan 2009  95.72796
## 6 Afghanistan 2008  99.46714

enrollment.secondary <- fetchIndicatorValues("SE.SEC.ENRR")
head(enrollment.secondary)

##       country date    value
## 1 Afghanistan 2013 54.31084
## 2 Afghanistan 2012 54.03824
## 3 Afghanistan 2011 51.77018
## 4 Afghanistan 2010 50.18150
## 5 Afghanistan 2009 44.23704
## 6 Afghanistan 2008 38.37213

enrollment.tertiary <- fetchIndicatorValues("SE.TER.ENRR")
head(enrollment.tertiary)

##       country date   value
## 1 Afghanistan 2011 3.74394
## 2 Afghanistan 2009 3.91140
## 3 Afghanistan 2004 1.29312
## 4 Afghanistan 2003 1.27942
## 5 Afghanistan 1990 2.29743
## 6 Afghanistan 1987 1.75979

enrollment.primary$type <- "primary"
enrollment.secondary$type <- "secondary"
enrollment.tertiary$type <- "tertiary"

# combining the primary, secondary and tertiary data frames into one data frame

enrollment.all <- rbind(enrollment.primary,enrollment.secondary,enrollment.tertiary)

# joining the enrollment data frame to the expenditure data frame by country and date
enrollment.expenditure <-dplyr::inner_join(gov.exp.df,enrollment.all,by=c("country","date"))

colnames(enrollment.expenditure) <- c("country","date","expenditure","enrollment","type")

# selecting distinct country names from the enrollment.all data frame
countries <- enrollment.all[!duplicated(enrollment.all[,"country"]),1]

# creating the average entrollment expenditure data frame. 
  for(i in 1: length(countries))
  {
    if(i == 1)
    {
      enrollment.expenditure.avg <-  calculateMeanVals(countries[i],"enrollment.expenditure","expenditure","enrollment","type")
    }
    else
    {
enrollment.expenditure.avg <- rbind(enrollment.expenditure.avg,calculateMeanVals(countries[i],"enrollment.expenditure","expenditure","enrollment","type"))  
    }
    
  } 

head(enrollment.expenditure.avg)

##       country minDate maxDate avg_expenditure avg_enrollment      type
## 1 Afghanistan    2010    2013        18.15759      101.65109   primary
## 2 Afghanistan    2010    2013        18.15759       52.57519 secondary
## 3 Afghanistan    2011    2011        19.10585        3.74394  tertiary
## 4     Albania    1997    2003        10.23420      101.32601   primary
## 5     Albania    1997    2007        10.52824       72.42682 secondary
## 6     Albania    1997    2007        10.52824       18.16968  tertiary

Visualization

Government expenditure on education, total (% of government expenditure)

ggplot(gov.exp.df, aes(x = value)) + geom_histogram() +ggtitle("Government Expenditure")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Literacy rate, adult total (% of people ages 15 and above)

plot(literacy.rate.expenditure.avg$avg_literacyrate~literacy.rate.expenditure.avg$avg_expenditure, xlab = "Expenditure",ylab = "Literacy Rate",main ="Government Expenditure vs Literacy Rate")

Pupil to teacher ratio in primary Schools

plot(pupil.techer.expenditure.avg$avg_pupilTeacher~pupil.techer.expenditure.avg$avg_expenditure, xlab = "Expenditure",ylab = "Pupil Teacher Ratio",main ="Government Expenditure vs Pupil Teacher Ratio")

Student enrollment primary, secondary and tertiary

plot(enrollment.expenditure.avg$avg_enrollment~enrollment.expenditure.avg$avg_expenditure, xlab = "Expenditure",ylab = "Student Enrollment",main ="Government Expenditure vs Student Enrollment")

Analysis

Literacy rate, adult total (% of people ages 15 and above)

literacy.rate.cor <- cor.test(literacy.rate.expenditure.avg$avg_expenditure,literacy.rate.expenditure.avg$avg_literacyrate)

literacy.rate.cor$estimate

##        cor 
## -0.1934485

Pupil to teacher ratio in primary Schools

pupil.teacher.cor <-  cor.test(pupil.techer.expenditure.avg$avg_expenditure,pupil.techer.expenditure.avg$avg_pupilTeacher)
pupil.teacher.cor$estimate

##       cor 
## 0.1625357

Student enrollment primary, secondary and tertiary

student.enrollment.cor <-  cor.test(enrollment.expenditure.avg$avg_expenditure,enrollment.expenditure.avg$avg_enrollment)
student.enrollment.cor$estimate

##        cor 
## -0.1080285

df <- data.frame(item = character(0), corvalue = double(0), stringsAsFactors = FALSE)
df <- rbind(df,data.frame(item = "Literacy Rate",value = literacy.rate.cor$estimate , stringsAsFactors = FALSE))
df <- rbind(df,data.frame(item = "Pupil To Teacher Ratio",value = pupil.teacher.cor$estimate , stringsAsFactors = FALSE))
df <- rbind(df,data.frame(item = "Student Enrollment",value = student.enrollment.cor$estimate , stringsAsFactors = FALSE))
row.names(df) <- NULL
ggplot(df, aes(x = item, y=value)) + geom_histogram(stat = "identity") + ggtitle("Correlation")

Conclusion

There is an extremely weak relationship between government expenditure into the educational system and their country’s literacy rate, pupil to teacher ratio and the number of students enrolled in primary, secondary and tertiary schools.

IS607: Data Acquisition and Management Final Project

Jason Joseph

December 1, 2015

Introduction

Data Acquisition

Functions

World Bank API

Calculating Mean Values

List of grouped regions

Government expenditure on education, total (% of government expenditure)

Literacy rate, adult total (% of people ages 15 and above)

Pupil to teacher ratio in primary Schools

Student enrollment primary, secondary and tertiary

Visualization

Government expenditure on education, total (% of government expenditure)

Literacy rate, adult total (% of people ages 15 and above)

Pupil to teacher ratio in primary Schools

Student enrollment primary, secondary and tertiary

Analysis

Literacy rate, adult total (% of people ages 15 and above)

Pupil to teacher ratio in primary Schools

Student enrollment primary, secondary and tertiary

Conclusion