The education sector is one of the most important areas of development in any country. The educational system is the foundation on which a country’s human resource is moulded so they can make a contribution to the country’s development. Year after year governments touts on the increased amount of money allocated and spent on their educational system. The government of Guyana also touts on increased spending in the education sector, but still the majority of Guyana’s population remains uneducated. I am curious to discover whether the amount spent on education influences a country’s literacy rate, pupil to teacher ratio and the number of students enroled in primary, secondary and tertiary schools.
The World Bank database was my primary source where I retrieved my data. So I found it fitting to write a function to retrieve, clean and transform the data from the World Bank website via its API.
fetchIndicatorValues <- function(indicator)
{
url.base <- sprintf("http://api.worldbank.org/countries/all/indicators/%s?format=xml&per_page=10000",indicator)
doc.xml <- xmlParse(url.base)
# get root element of the data set
doc.xml.root <- xmlRoot(doc.xml)
## getting number of pages from the xml root attribute
pages <- as.integer(xmlAttrs(doc.xml.root)[["pages"]])
# creating an empty dataframe to store the query results
data.set <- data.frame(indicator = character(0),country = character(0),value = double(0),decimal=integer(0),stringsAsFactors = FALSE)
# iterating over the pages and storing the query restults into the data.set dataframe
for(i in 1:pages)
{
url.page <- sprintf("%s&page=%s",url.base,i)
doc.xml <- xmlParse(url.page)
doc.xml.root <- xmlRoot(doc.xml)
doc.xml.elements <-xmlChildren(doc.xml.root)
doc.df <- xmlToDataFrame(doc.xml.elements,stringsAsFactors = FALSE)
data.set <- rbind(data.set,doc.df)
}
# Transforming the value column from chracter to double class type
data.set$value <- as.double(data.set$value)
# Transforming the date column from chracter to integer class type
data.set$date <- as.integer(data.set$date)
#removing unwanted columns
data.set$indicator <- data.set$decimal <- NULL
# removing records with NA values
data.set <- na.omit(data.set)
# Removing continent and special country grouping
data.set <- subset(data.set, !(country %in% group.items$country))
# removing rownames so a recount of the row numbers can be triggered
row.names(data.set) <- NULL
# Replacing ' from country names to an empty space
data.set$country <- str_replace_all(data.set$country,"'"," ")
return(data.set)
}
The datasets contain values for each country for several years. For this project and the analysis that follows later down in the pipeline. I needed to calculate the mean value and expenditure over that time frame. The function below does exactly that.
calculateMeanVals <- function(countryName,tableName, col1,col2,col3 = NA)
{
if(is.na(col3))
{
col3 <- ""
}
else
{
col3 <- sprintf(",%s",col3)
}
# Select the records for a country and calculates the min/max date and average value and expenditure for that period
df <- sqldf(sprintf("Select country,min(date) as minDate,max(date) as maxDate ,AVG(%s)as %s,AVG(%s)as %s
%s
FROM
'%s' where country = '%s'
GROUP BY country %s"
,col1,
sprintf("avg_%s",col1) ,
col2,
sprintf("avg_%s",col2),
col3,
tableName,
countryName,
col3))
return(df)
}
The datasets contained values for grouped regions e.g. Caribbean small states, Euro area, etc. I created a text file that contains the grouped regions that will be used to filter out those values from the datasets.
# list of regions that are grouped in the dataset
group.items <- read.table("https://raw.githubusercontent.com/jasonjgy2000/IS607/master/Projects/607%20Proj/Data/expList.txt",stringsAsFactors = FALSE,header = FALSE,sep = "\t",col.names = c("country"))
Retrieving the general government expenditure on education (current, capital, and transfers) is expressed as a percentage of total general government expenditure on all sectors and storing it into a data frame.
gov.exp.df <- fetchIndicatorValues("SE.XPD.TOTL.GB.ZS")
head(gov.exp.df)
## country date value
## 1 Afghanistan 2014 18.06834
## 2 Afghanistan 2013 18.46120
## 3 Afghanistan 2012 12.82317
## 4 Afghanistan 2011 19.10585
## 5 Afghanistan 2010 22.24014
## 6 Albania 2007 11.17719
Retrieving and cleaning the adults (15+) literacy rate percentage dataset. It contains the total percentage of the population age 15 and above who can, with understanding, read and write a short, simple statement on their everyday life.
literacy.rate.df <- fetchIndicatorValues("SE.ADT.LITR.ZS")
head(literacy.rate.df)
## country date value
## 1 Afghanistan 2011 31.74112
## 2 Afghanistan 1979 18.15768
## 3 Albania 2012 97.24696
## 4 Albania 2011 96.84530
## 5 Albania 2008 95.93864
## 6 Albania 2001 98.71297
literacy.rate.expenditure <-dplyr::inner_join(gov.exp.df,literacy.rate.df,by=c("country","date"))
colnames(literacy.rate.expenditure) <- c("country","date","expenditure","literacyrate")
countries <- literacy.rate.expenditure[!duplicated(literacy.rate.expenditure[,"country"]),1]
for(i in 1: length(countries))
{
if(i == 1)
{
literacy.rate.expenditure.avg <- calculateMeanVals(countries[i],"literacy.rate.expenditure","expenditure","literacyrate")
}
else
{
literacy.rate.expenditure.avg <- rbind(literacy.rate.expenditure.avg,calculateMeanVals(countries[i],"literacy.rate.expenditure","expenditure","literacyrate"))
}
}
## Loading required package: tcltk
head(literacy.rate.expenditure.avg)
## country minDate maxDate avg_expenditure avg_literacyrate
## 1 Afghanistan 2011 2011 19.10585 31.74112
## 2 Albania 2001 2001 10.58193 98.71297
## 3 Argentina 2001 2001 16.35761 97.19331
## 4 Armenia 2011 2011 12.59334 99.74442
## 5 Aruba 2000 2010 21.99895 97.05695
## 6 Azerbaijan 2007 2011 8.86377 99.72788
#summary(literacy.rate.expenditure.avg)
Below I am parsing the number of pupils enrolled in primary school divided by the number of primary school teachers. This dataset was parsed from a CSV file, cleaned and transformed into a data frame.
# load Csv
pupil.teacher.df <- read.csv(file="https://raw.githubusercontent.com/jasonjgy2000/IS607/master/Projects/607%20Proj/Data/se.prm.enrl.tc.zs_Indicator_en_csv_v2.csv",skip = 4,header = TRUE, stringsAsFactors = FALSE,check.names = FALSE)
# removing columns not needed
pupil.teacher.df$`Country Code` <- pupil.teacher.df$`Indicator Name` <- pupil.teacher.df$`Indicator Code` <- NULL
# renaming first column from Country name to country
colnames(pupil.teacher.df)[1] <- "country"
#removing extra column with NA's
pupil.teacher.df[,58] <- NULL
dplyr::tbl_df(pupil.teacher.df)
## Source: local data frame [248 x 57]
##
## country 1960 1961 1962 1963 1964 1965 1966 1967
## (chr) (lgl) (lgl) (lgl) (lgl) (lgl) (lgl) (lgl) (lgl)
## 1 Aruba NA NA NA NA NA NA NA NA
## 2 Andorra NA NA NA NA NA NA NA NA
## 3 Afghanistan NA NA NA NA NA NA NA NA
## 4 Angola NA NA NA NA NA NA NA NA
## 5 Albania NA NA NA NA NA NA NA NA
## 6 Arab World NA NA NA NA NA NA NA NA
## 7 United Arab Emirates NA NA NA NA NA NA NA NA
## 8 Argentina NA NA NA NA NA NA NA NA
## 9 Armenia NA NA NA NA NA NA NA NA
## 10 American Samoa NA NA NA NA NA NA NA NA
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: 1968 (lgl), 1969 (lgl), 1970 (dbl), 1971 (dbl), 1972
## (dbl), 1973 (dbl), 1974 (dbl), 1975 (dbl), 1976 (dbl), 1977 (dbl), 1978
## (dbl), 1979 (dbl), 1980 (dbl), 1981 (dbl), 1982 (dbl), 1983 (dbl), 1984
## (dbl), 1985 (dbl), 1986 (dbl), 1987 (dbl), 1988 (dbl), 1989 (dbl), 1990
## (dbl), 1991 (dbl), 1992 (dbl), 1993 (dbl), 1994 (dbl), 1995 (dbl), 1996
## (dbl), 1997 (dbl), 1998 (dbl), 1999 (dbl), 2000 (dbl), 2001 (dbl), 2002
## (dbl), 2003 (dbl), 2004 (dbl), 2005 (dbl), 2006 (dbl), 2007 (dbl), 2008
## (dbl), 2009 (dbl), 2010 (dbl), 2011 (dbl), 2012 (dbl), 2013 (dbl), 2014
## (dbl), 2015 (lgl)
# wide to long transformation
pupil.teacher.df <- pupil.teacher.df %>% gather("date","value",2:57)
# change Year column from factor to integer class type
pupil.teacher.df$date <- as.integer(as.character(pupil.teacher.df$date))
# removing records with NA values
pupil.teacher.df <- na.omit(pupil.teacher.df)
# Removing continent and special country grouping
pupil.teacher.df <- subset(pupil.teacher.df, !(country %in% group.items$country))
# removing rownames so a recount of the row numbers can be triggered
row.names(pupil.teacher.df) <- NULL
# Replacing ' from country names to an empty space
pupil.teacher.df$country <- str_replace_all(pupil.teacher.df$country,"'"," ")
# joining the pupil teacher ratio data frame to the expenditure data frame by country and date
pupil.techer.expenditure <-dplyr::inner_join(gov.exp.df,pupil.teacher.df,by=c("country","date"))
# assigning new column names to the pupil teacher ratio expenditure data frame
colnames(pupil.techer.expenditure) <- c("country","date","expenditure","pupilTeacher")
# selecting distinct country names from the pupil.teacher.expenditure data frame
countries <- pupil.techer.expenditure[!duplicated(pupil.techer.expenditure[,"country"]),1]
# creating the average pupil teacher expenditure data frame.
for(i in 1: length(countries))
{
if(i == 1)
{
pupil.techer.expenditure.avg <- calculateMeanVals(countries[i],"pupil.techer.expenditure","expenditure","pupilTeacher",NA)
}
else
{
pupil.techer.expenditure.avg <- rbind(pupil.techer.expenditure.avg,calculateMeanVals(countries[i],"pupil.techer.expenditure","expenditure","pupilTeacher",NA))
}
}
head(pupil.techer.expenditure.avg)
## country minDate maxDate avg_expenditure avg_pupilTeacher
## 1 Afghanistan 2010 2012 18.05639 44.20456
## 2 Albania 1997 2003 10.61149 22.14333
## 3 Algeria 2008 2008 11.42941 23.23052
## 4 Angola 2010 2010 8.68477 45.58651
## 5 Antigua and Barbuda 2009 2009 6.91662 16.22446
## 6 Argentina 1998 2008 16.03012 17.65021
Retrieving and cleaning the datasets that contain the gross enrolment ratio for primary, secondary and tertiary schools.
enrollment.primary <- fetchIndicatorValues("SE.PRM.ENRR")
head(enrollment.primary)
## country date value
## 1 Afghanistan 2013 105.85978
## 2 Afghanistan 2012 103.94475
## 3 Afghanistan 2011 97.40102
## 4 Afghanistan 2010 99.39882
## 5 Afghanistan 2009 95.72796
## 6 Afghanistan 2008 99.46714
enrollment.secondary <- fetchIndicatorValues("SE.SEC.ENRR")
head(enrollment.secondary)
## country date value
## 1 Afghanistan 2013 54.31084
## 2 Afghanistan 2012 54.03824
## 3 Afghanistan 2011 51.77018
## 4 Afghanistan 2010 50.18150
## 5 Afghanistan 2009 44.23704
## 6 Afghanistan 2008 38.37213
enrollment.tertiary <- fetchIndicatorValues("SE.TER.ENRR")
head(enrollment.tertiary)
## country date value
## 1 Afghanistan 2011 3.74394
## 2 Afghanistan 2009 3.91140
## 3 Afghanistan 2004 1.29312
## 4 Afghanistan 2003 1.27942
## 5 Afghanistan 1990 2.29743
## 6 Afghanistan 1987 1.75979
enrollment.primary$type <- "primary"
enrollment.secondary$type <- "secondary"
enrollment.tertiary$type <- "tertiary"
# combining the primary, secondary and tertiary data frames into one data frame
enrollment.all <- rbind(enrollment.primary,enrollment.secondary,enrollment.tertiary)
# joining the enrollment data frame to the expenditure data frame by country and date
enrollment.expenditure <-dplyr::inner_join(gov.exp.df,enrollment.all,by=c("country","date"))
colnames(enrollment.expenditure) <- c("country","date","expenditure","enrollment","type")
# selecting distinct country names from the enrollment.all data frame
countries <- enrollment.all[!duplicated(enrollment.all[,"country"]),1]
# creating the average entrollment expenditure data frame.
for(i in 1: length(countries))
{
if(i == 1)
{
enrollment.expenditure.avg <- calculateMeanVals(countries[i],"enrollment.expenditure","expenditure","enrollment","type")
}
else
{
enrollment.expenditure.avg <- rbind(enrollment.expenditure.avg,calculateMeanVals(countries[i],"enrollment.expenditure","expenditure","enrollment","type"))
}
}
head(enrollment.expenditure.avg)
## country minDate maxDate avg_expenditure avg_enrollment type
## 1 Afghanistan 2010 2013 18.15759 101.65109 primary
## 2 Afghanistan 2010 2013 18.15759 52.57519 secondary
## 3 Afghanistan 2011 2011 19.10585 3.74394 tertiary
## 4 Albania 1997 2003 10.23420 101.32601 primary
## 5 Albania 1997 2007 10.52824 72.42682 secondary
## 6 Albania 1997 2007 10.52824 18.16968 tertiary
ggplot(gov.exp.df, aes(x = value)) + geom_histogram() +ggtitle("Government Expenditure")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
plot(literacy.rate.expenditure.avg$avg_literacyrate~literacy.rate.expenditure.avg$avg_expenditure, xlab = "Expenditure",ylab = "Literacy Rate",main ="Government Expenditure vs Literacy Rate")
plot(pupil.techer.expenditure.avg$avg_pupilTeacher~pupil.techer.expenditure.avg$avg_expenditure, xlab = "Expenditure",ylab = "Pupil Teacher Ratio",main ="Government Expenditure vs Pupil Teacher Ratio")
plot(enrollment.expenditure.avg$avg_enrollment~enrollment.expenditure.avg$avg_expenditure, xlab = "Expenditure",ylab = "Student Enrollment",main ="Government Expenditure vs Student Enrollment")
literacy.rate.cor <- cor.test(literacy.rate.expenditure.avg$avg_expenditure,literacy.rate.expenditure.avg$avg_literacyrate)
literacy.rate.cor$estimate
## cor
## -0.1934485
pupil.teacher.cor <- cor.test(pupil.techer.expenditure.avg$avg_expenditure,pupil.techer.expenditure.avg$avg_pupilTeacher)
pupil.teacher.cor$estimate
## cor
## 0.1625357
student.enrollment.cor <- cor.test(enrollment.expenditure.avg$avg_expenditure,enrollment.expenditure.avg$avg_enrollment)
student.enrollment.cor$estimate
## cor
## -0.1080285
df <- data.frame(item = character(0), corvalue = double(0), stringsAsFactors = FALSE)
df <- rbind(df,data.frame(item = "Literacy Rate",value = literacy.rate.cor$estimate , stringsAsFactors = FALSE))
df <- rbind(df,data.frame(item = "Pupil To Teacher Ratio",value = pupil.teacher.cor$estimate , stringsAsFactors = FALSE))
df <- rbind(df,data.frame(item = "Student Enrollment",value = student.enrollment.cor$estimate , stringsAsFactors = FALSE))
row.names(df) <- NULL
ggplot(df, aes(x = item, y=value)) + geom_histogram(stat = "identity") + ggtitle("Correlation")
There is an extremely weak relationship between government expenditure into the educational system and their country’s literacy rate, pupil to teacher ratio and the number of students enrolled in primary, secondary and tertiary schools.