In this analysis I will utilise the OECD dataset on Goverment Expenditure on Education for the years 2010-2015. I will join this dataset with the World Bank Cuntry classification and apply suitable Data Preprocessing techniques to reveal patterns in spending in the combined data.
The World Bank data is processed first. This dataset classifies 211 countries by region and income group. It is originally in excel format with multiple sheets and headers. After loading using the readxl library, I filter and subset this dataset and use the mutate function to create factors of “Income Group” and “Region”. A scan and treatment for erroneous data and missing values is perfromed and the results provided.
Next, the OECD dataset containing Education spending data for Years 2010-2015 is processed. This dataset is not in tidy format as variables are stored as observations. The dataset is loaded, columns re-named, to align with the world bank data, and the dplyr spread function is used to create new variables for the year 2015. A scan and treatment for erroneous data and missing values is performed and the results provided. Additionally, I utilise the rowMeans function and the mutate function to impute the average spend value for that country where a spend value is missing for a particular country-year.
Finally, both datasets are joined. A scan for missing values and outliers is made. Lastly the distribution of produce distributions for the indicators used and transform one of the indicator variables to produce a more normal distribution.
https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups, download date “25/10/2019”, license: https://datacatalog.worldbank.org/public-licenses#cc-other
Source: Education at a glance: Educational finance indicators, https://data.oecd.org/eduresource/public-spending-on-education.htm, download date “25/10/2019”
# clean the environment
rm(list=ls())
library(readr)
library(dplyr)
library(tidyr)
library(Hmisc)
library(ggplot2)
library(knitr)
library(readxl)
library(outliers)
country_worldbank <- read_excel("data/country_class_worldbank.xls", sheet = "List of economies", skip = 3)
-
/
New names:
* x -> x..1
* x -> x..2
cat(" Dimensions: " , dim(country_worldbank))
Dimensions: 275 9
head(country_worldbank)
Relevant Columns in world bankdataset: Economy, Code, Region, Income group
# read in OECD eduaction expenditure data
education <- read_csv("data/OECD_ED_Spend_1.csv")
Parsed with column specification:
cols(
LOCATION = [31mcol_character()[39m,
INDICATOR = [31mcol_character()[39m,
SUBJECT = [31mcol_character()[39m,
MEASURE = [31mcol_character()[39m,
FREQUENCY = [31mcol_character()[39m,
TIME = [32mcol_double()[39m,
Value = [32mcol_double()[39m,
`Flag Codes` = [31mcol_character()[39m
)
cat(" Dimensions: " , dim(education))
Dimensions: 5401 8
head(education)
Relevant Columns in OECD Education dataset: LOCATION, INDICATOR, TIME, Value
Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.
Verify Data types - all character columns Columns of interst: “Economy” rename -> “CountryName” “Code” rename -> “CountryCode” “Income group” reaname -> “IncomeGroup” to remove space. Alter to be a factor - ordered. “Region” column to remain unchanged. Alter to be a factor - ordered. All other columns to be removed,
cat("############## country_worldbank ##################\n")
############## country_worldbank ##################
str(country_worldbank)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 275 obs. of 9 variables:
$ x..1 : chr "x" "1" "2" "3" ...
$ x..2 : chr "x" NA NA NA ...
$ Economy : chr "x" "Afghanistan" "Albania" "Algeria" ...
$ Code : chr "x" "AFG" "ALB" "DZA" ...
$ X : chr "x" NA NA NA ...
$ Region : chr "x" "South Asia" "Europe & Central Asia" "Middle East & North Africa" ...
$ Income group : chr "x" "Low income" "Upper middle income" "Upper middle income" ...
$ Lending category: chr "x" "IDA" "IBRD" "IBRD" ...
$ Other : chr "x" "HIPC" NA NA ...
describe(country_worldbank)
country_worldbank
9 Variables 275 Observations
----------------------------------------------------------------------------------------------------------------------------------
x..1
n missing distinct
269 6 223
lowest : 1 10 100 101 102
highest: Geographic classifications in this table cover all income levels. IDA countries are those that lack the financial ability to borrow from IBRD. IDA credits are deeply concessional—interest-free loans and grants for programs aimed at boosting economic growth and improving living conditions. IBRD loans are noncessional. Blend countries are eligible for IDA credits because of their low per capita incomes but are also eligible for IBRD because they are financially creditworthy. Note: The term country, used interchangeably with economy, does not imply political independence but refers to any territory for which authorities report separate social or economic statistics. Income classifications set on 1 July 2019 remain in effect until 1 July 2020. Argentina, which was temporarily unclassified in July 2016 pending release of revised national accounts statistics, was classified as upper middle income for FY17 as of 29 September 2016 based on alternative conversion factors. Also effective 29 September 2016, Syrian Arab Republic is reclassified from IBRD lending category to IDA-only. On 29 March 2017, new country codes were introduced to align World Bank 3-letter codes with ISO 3-letter codes: Andorra (AND), Dem. Rep. Congo (COD), Isle of Man (IMN), Kosovo (XKX), Romania (ROU), Timor-Leste (TLS), and West Bank and Gaza (PSE). This table classifies all World Bank member countries (189), and all other economies with populations of more than 30,000. For operational and analytical purposes, economies are divided among income groups according to 2018 gross national income (GNI) per capita, calculated using the World Bank Atlas method. The groups are: low income, $1,025 or less; lower middle income, $1,026 - 3,995; upper middle income, $3,996 - 12,375; and high income, $12,375 or more. The effective operational cutoff for IDA eligibility is $1,175 or less. x
----------------------------------------------------------------------------------------------------------------------------------
x..2
n missing distinct value
1 274 1 x
Value x
Frequency 1
Proportion 1
----------------------------------------------------------------------------------------------------------------------------------
Economy
n missing distinct
265 10 265
lowest : Afghanistan Albania Algeria American Samoa Andorra
highest: World x Yemen, Rep. Zambia Zimbabwe
----------------------------------------------------------------------------------------------------------------------------------
Code
n missing distinct
265 10 265
lowest : ABW AFG AGO ALB AND, highest: XKX YEM ZAF ZMB ZWE
----------------------------------------------------------------------------------------------------------------------------------
X
n missing distinct value
1 274 1 x
Value x
Frequency 1
Proportion 1
----------------------------------------------------------------------------------------------------------------------------------
Region
n missing distinct
219 56 8
Value East Asia & Pacific Europe & Central Asia Latin America & Caribbean Middle East & North Africa
Frequency 38 58 42 21
Proportion 0.174 0.265 0.192 0.096
Value North America South Asia Sub-Saharan Africa x
Frequency 3 8 48 1
Proportion 0.014 0.037 0.219 0.005
----------------------------------------------------------------------------------------------------------------------------------
Income group
n missing distinct
219 56 5
Value High income Low income Lower middle income Upper middle income x
Frequency 80 31 47 60 1
Proportion 0.365 0.142 0.215 0.274 0.005
----------------------------------------------------------------------------------------------------------------------------------
Lending category
n missing distinct
219 56 5
Value .. Blend IBRD IDA x
Frequency 74 17 68 59 1
Proportion 0.338 0.078 0.311 0.269 0.005
----------------------------------------------------------------------------------------------------------------------------------
Other
n missing distinct
59 216 3
Value EMU HIPC x
Frequency 19 39 1
Proportion 0.322 0.661 0.017
----------------------------------------------------------------------------------------------------------------------------------
# select, rename and filter.
country_worldbank <- country_worldbank %>%
select("Economy", "Code", "Region", "Income group") %>%
rename("CountryName" = "Economy", "CountryCode" = "Code","IncomeGroup" = "Income group") %>%
filter(!is.na(Region) & !is.na(`IncomeGroup`) & nchar(CountryCode) ==3 )
# Set region and Income Group as factors. Order by Region.
country_worldbank <- country_worldbank %>%
mutate(Region = factor(as.character(Region),
labels = c("E.Asia", "EUR.Cen.Asia", "Lat.Amer.Carr", "M.East.N.Africa",
"N.Amer", "Sth.Asia", "Sub.S.Africa" ),
levels = c("East Asia & Pacific", "Europe & Central Asia",
"Latin America & Caribbean","Middle East & North Africa",
"North America", "South Asia", "Sub-Saharan Africa")),
IncomeGroup = factor(IncomeGroup,
labels = c("High","Upper.Mid", "Low.Mid", "Low"),
levels = c("High income","Upper middle income", "Lower middle income", "Low income")))
# dispaly and verify factors
cat("Regions:")
Regions:
levels(country_worldbank$Region)
[1] "E.Asia" "EUR.Cen.Asia" "Lat.Amer.Carr" "M.East.N.Africa" "N.Amer" "Sth.Asia" "Sub.S.Africa"
cat("Income Groups: ")
Income Groups:
levels(country_worldbank$IncomeGroup)
[1] "High" "Upper.Mid" "Low.Mid" "Low"
head(country_worldbank)
cat(" Dimensions: " , dim(country_worldbank))
Dimensions: 218 4
worldbank dataset now has four columns and 218 observations Two factors have been created.
Verify Data types - all character columns Columns of interst: “LOCATON” rename -> “CountryCode” “MEASURE” filter to only include -> PC_GDP “SUBJECT” rename -> IndicatorCode. filter to only include ->“PRY_NTRY”, “TRY”" “TIME” column to remain unchanged for now.
cat("\n\n################# oecd education #####################\n")
################# oecd education #####################
str(education)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 5401 obs. of 8 variables:
$ LOCATION : chr "AUT" "AUT" "AUT" "AUT" ...
$ INDICATOR : chr "EDUEXP" "EDUEXP" "EDUEXP" "EDUEXP" ...
$ SUBJECT : chr "TRY" "TRY" "TRY" "TRY" ...
$ MEASURE : chr "USD_STUDENT" "USD_STUDENT" "USD_STUDENT" "USD_STUDENT" ...
$ FREQUENCY : chr "A" "A" "A" "A" ...
$ TIME : num 1995 2000 2005 2008 2009 ...
$ Value : num NA NA NA NA NA ...
$ Flag Codes: chr "M" "M" "M" "M" ...
- attr(*, "spec")=
.. cols(
.. LOCATION = [31mcol_character()[39m,
.. INDICATOR = [31mcol_character()[39m,
.. SUBJECT = [31mcol_character()[39m,
.. MEASURE = [31mcol_character()[39m,
.. FREQUENCY = [31mcol_character()[39m,
.. TIME = [32mcol_double()[39m,
.. Value = [32mcol_double()[39m,
.. `Flag Codes` = [31mcol_character()[39m
.. )
head(education)
describe(education)
education
8 Variables 5401 Observations
----------------------------------------------------------------------------------------------------------------------------------
LOCATION
n missing distinct
5401 0 46
lowest : ARG AUS AUT BEL BRA, highest: SVN SWE TUR USA ZAF
----------------------------------------------------------------------------------------------------------------------------------
INDICATOR
n missing distinct value
5401 0 1 EDUEXP
Value EDUEXP
Frequency 5401
Proportion 1
----------------------------------------------------------------------------------------------------------------------------------
SUBJECT
n missing distinct
5401 0 5
Value EARLYCHILDEDU PRY PRY_NTRY SRY TRY
Frequency 1077 1068 1100 1052 1104
Proportion 0.199 0.198 0.204 0.195 0.204
----------------------------------------------------------------------------------------------------------------------------------
MEASURE
n missing distinct
5401 0 2
Value PC_GDP USD_STUDENT
Frequency 2715 2686
Proportion 0.503 0.497
----------------------------------------------------------------------------------------------------------------------------------
FREQUENCY
n missing distinct value
5401 0 1 A
Value A
Frequency 5401
Proportion 1
----------------------------------------------------------------------------------------------------------------------------------
TIME
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
5401 0 12 0.993 2009 6.421 1995 2000 2008 2011 2014 2015 2016
Value 1995 2000 2005 2008 2009 2010 2011 2012 2013 2014 2015 2016
Frequency 443 444 452 452 450 452 450 444 450 452 452 460
Proportion 0.082 0.082 0.084 0.084 0.083 0.084 0.083 0.082 0.083 0.084 0.084 0.085
----------------------------------------------------------------------------------------------------------------------------------
Value
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
2487 2914 2251 1 4247 5653 6.763e-01 9.816e-01 1.692e+00 4.200e+00 8.131e+03 1.163e+04 1.538e+04
lowest : 0.000 0.072 0.073 0.074 0.086, highest: 27578.869 29328.235 30003.244 45800.558 48906.889
----------------------------------------------------------------------------------------------------------------------------------
Flag Codes
n missing distinct value
2914 2487 1 M
Value M
Frequency 2914
Proportion 1
----------------------------------------------------------------------------------------------------------------------------------
# rename character columns and subset charcater columns plus numeric values for years 2010-2015
keep_cols_education = c("CountryName", "CountryCode", "IndicatorCode")
years <- paste(2010:2015)
keep_cols_education = c(keep_cols_education, years)
# rename coumns and filte to include only the indicator codes for
# "primary to post secondary non-tertiary" and "Tertiary".
education <- education %>%
rename(Measure = MEASURE, IndicatorCode = SUBJECT, CountryCode = LOCATION, Time = TIME) %>%
select(CountryCode, Measure, IndicatorCode, Time, Value) %>%
filter(Measure == "PC_GDP" & IndicatorCode %in% c("PRY_NTRY", "TRY") & Time %in% years)
head(education)
cat(" Dimensions: " , dim(education))
Dimensions: 552 5
education datset now has five columns and 552 observations
Check if data is tidy: Exah variable in its own column. Ecah observation in its own row. Check observation for Australia and USA
country_worldbank %>% filter(CountryCode %in% c("AUS", "USA") )
Data is in tidy format. One observation, no variables in columns
education %>% filter(CountryCode %in% c("AUS", "USA") & Time > 2013)
Data is NOT in tidy format. Time column needs to be spread out. Indicator codes to be summed inot one value fr each country/year
# Checking the number of NA values for each variable.
colSums(is.na(education))
CountryCode Measure IndicatorCode Time Value
0 0 0 0 112
# we see 112 missing values that can be dropped
education <- education[complete.cases(education), ]
# sum the percentage spent on education for each country, year
# count the number of records the sum to filter when
education <- education %>% group_by(CountryCode, Measure, Time) %>%
summarise(code_count = n(), ED_PCG = sum(Value)) %>%
filter(code_count == 2) %>%
select(-code_count )
# now spread the data to
education <- education %>%
spread(Time, value =ED_PCG, drop = TRUE, fill = NA)
head(education)
Now data is in tidy format.
In OECD Education dataset.. Create / mutate the average of all Education expenditure values for row. Thsn aerage is used to impute missing values for years with no value for pecentage spent on education.
# Checking the number of NA values for each variable.
colSums(is.na(education))
CountryCode Measure 2010 2011 2012 2013 2014 2015
0 0 12 11 4 1 3 4
# imputation of means where NAs exist for value in years
education$col_avg = rowMeans(education[, years], na.rm = TRUE)
education <- education %>% mutate(`2010` = coalesce(`2010`,col_avg ),
`2011` = coalesce(`2011`,col_avg ),
`2012` = coalesce(`2012`,col_avg ),
`2013` = coalesce(`2013`,col_avg ),
`2014` = coalesce(`2014`,col_avg ),
`2015` = coalesce(`2015`,col_avg ))
# we still ahve some NAs that can be dropped later.
colSums(is.na(education))
CountryCode Measure 2010 2011 2012 2013 2014 2015 col_avg
0 0 0 0 0 0 0 0 0
data <- left_join(country_worldbank, education, by = c("CountryCode"), suffix = c("", ".EDUCATION") )
head(data)
Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.
# verify dimensions and NAs
dim(data)
[1] 218 12
colSums(is.na(data))
CountryName CountryCode Region IncomeGroup Measure 2010 2011 2012 2013 2014
0 0 0 0 176 176 176 176 176 176
2015 col_avg
176 176
# key columns are "CountryName" and "CountryCode".
# verify NAs do not exist in these columns
colSums(is.na(data[,0:4]))
CountryName CountryCode Region IncomeGroup
0 0 0 0
# SCAN for empty values in year columns
data %>% filter(is.na(`2010`) & is.na(`2011`)& is.na(`2012`)& is.na(`2013`)& is.na(`2014`) & is.na(`2015`) )
#drop these out of the datset
data <- data %>% filter(!is.na(`2010`) | !is.na(`2011`) | !is.na(`2012`) | !is.na(`2013`) | !is.na(`2014`) | !is.na(`2015`) )
colSums(is.na(data))
CountryName CountryCode Region IncomeGroup Measure 2010 2011 2012 2013 2014
0 0 0 0 0 0 0 0 0 0
2015 col_avg
0 0
# plot a
boxplot(data$`2010`)
boxplot(data$`2011`)
boxplot(data$`2012`)
boxplot(data$`2013`)
boxplot(data$`2014`)
boxplot(data$`2015`)
y = data$`2011`
# his of 2011 data
hist(data$`2011`)
# transform with log and plot hist.
hist(log(data$`2011`))